*sphinx.addnodesdocument)}( rawsourcechildren]( translations LanguagesNode)}(hhh](h pending_xref)}(hhh]docutils.nodesTextChinese (Simplified)}parenthsba attributes}(ids]classes]names]dupnames]backrefs] refdomainstdreftypedoc reftarget:/translations/zh_CN/filesystems/xfs/xfs-online-fsck-designmodnameN classnameN refexplicitutagnamehhh ubh)}(hhh]hChinese (Traditional)}hh2sbah}(h]h ]h"]h$]h&] refdomainh)reftypeh+ reftarget:/translations/zh_TW/filesystems/xfs/xfs-online-fsck-designmodnameN classnameN refexplicituh1hhh ubh)}(hhh]hItalian}hhFsbah}(h]h ]h"]h$]h&] refdomainh)reftypeh+ reftarget:/translations/it_IT/filesystems/xfs/xfs-online-fsck-designmodnameN classnameN refexplicituh1hhh ubh)}(hhh]hJapanese}hhZsbah}(h]h ]h"]h$]h&] refdomainh)reftypeh+ reftarget:/translations/ja_JP/filesystems/xfs/xfs-online-fsck-designmodnameN classnameN refexplicituh1hhh ubh)}(hhh]hKorean}hhnsbah}(h]h ]h"]h$]h&] refdomainh)reftypeh+ reftarget:/translations/ko_KR/filesystems/xfs/xfs-online-fsck-designmodnameN classnameN refexplicituh1hhh ubh)}(hhh]hSpanish}hhsbah}(h]h ]h"]h$]h&] refdomainh)reftypeh+ reftarget:/translations/sp_SP/filesystems/xfs/xfs-online-fsck-designmodnameN classnameN refexplicituh1hhh ubeh}(h]h ]h"]h$]h&]current_languageEnglishuh1h hh _documenthsourceNlineNubhcomment)}(h SPDX-License-Identifier: GPL-2.0h]h SPDX-License-Identifier: GPL-2.0}hhsbah}(h]h ]h"]h$]h&] xml:spacepreserveuh1hhhhhhT/var/lib/git/docbuild/linux/Documentation/filesystems/xfs/xfs-online-fsck-design.rsthKubhtarget)}(h.. _xfs_online_fsck_design:h]h}(h]xfs-online-fsck-designah ]h"]xfs_online_fsck_designah$]h&]uh1hhKhhhhhhubh)}(hX5Mapping of heading styles within this document: Heading 1 uses "====" above and below Heading 2 uses "====" Heading 3 uses "----" Heading 4 uses "````" Heading 5 uses "^^^^" Heading 6 uses "~~~~" Heading 7 uses "...." Sections are manually numbered because apparently that's what everyone does in the kernel.h]hX5Mapping of heading styles within this document: Heading 1 uses "====" above and below Heading 2 uses "====" Heading 3 uses "----" Heading 4 uses "````" Heading 5 uses "^^^^" Heading 6 uses "~~~~" Heading 7 uses "...." Sections are manually numbered because apparently that's what everyone does in the kernel.}hhsbah}(h]h ]h"]h$]h&]hhuh1hhhhhhhhKubhsection)}(hhh](htitle)}(hXFS Online Fsck Designh]hXFS Online Fsck Design}(hhhhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhhhhhKubh paragraph)}(h|This document captures the design of the online filesystem check feature for XFS. The purpose of this document is threefold:h]h|This document captures the design of the online filesystem check feature for XFS. The purpose of this document is threefold:}(hhhhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhKhhhhubh bullet_list)}(hhh](h list_item)}(hTo help kernel distributors understand exactly what the XFS online fsck feature is, and issues about which they should be aware. h]h)}(hTo help kernel distributors understand exactly what the XFS online fsck feature is, and issues about which they should be aware.h]hTo help kernel distributors understand exactly what the XFS online fsck feature is, and issues about which they should be aware.}(hjhhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhKhhubah}(h]h ]h"]h$]h&]uh1hhhhhhhhNubh)}(hTo help people reading the code to familiarize themselves with the relevant concepts and design points before they start digging into the code. h]h)}(hTo help people reading the code to familiarize themselves with the relevant concepts and design points before they start digging into the code.h]hTo help people reading the code to familiarize themselves with the relevant concepts and design points before they start digging into the code.}(hjhhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhKhjubah}(h]h ]h"]h$]h&]uh1hhhhhhhhNubh)}(hlTo help developers maintaining the system by capturing the reasons supporting higher level decision making. h]h)}(hkTo help developers maintaining the system by capturing the reasons supporting higher level decision making.h]hkTo help developers maintaining the system by capturing the reasons supporting higher level decision making.}(hj0hhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhKhj,ubah}(h]h ]h"]h$]h&]uh1hhhhhhhhNubeh}(h]h ]h"]h$]h&]bullet-uh1hhhhKhhhhubh)}(htAs the online fsck code is merged, the links in this document to topic branches will be replaced with links to code.h]htAs the online fsck code is merged, the links in this document to topic branches will be replaced with links to code.}(hjLhhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhK"hhhhubh)}(hoThis document is licensed under the terms of the GNU Public License, v2. The primary author is Darrick J. Wong.h]hoThis document is licensed under the terms of the GNU Public License, v2. The primary author is Darrick J. Wong.}(hjZhhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhK%hhhhubh)}(hXYThis design document is split into seven parts. Part 1 defines what fsck tools are and the motivations for writing a new one. Parts 2 and 3 present a high level overview of how online fsck process works and how it is tested to ensure correct functionality. Part 4 discusses the user interface and the intended usage modes of the new program. Parts 5 and 6 show off the high level components and how they fit together, and then present case studies of how each repair function actually works. Part 7 sums up what has been discussed so far and speculates about what else might be built atop online fsck.h]hXYThis design document is split into seven parts. Part 1 defines what fsck tools are and the motivations for writing a new one. Parts 2 and 3 present a high level overview of how online fsck process works and how it is tested to ensure correct functionality. Part 4 discusses the user interface and the intended usage modes of the new program. Parts 5 and 6 show off the high level components and how they fit together, and then present case studies of how each repair function actually works. Part 7 sums up what has been discussed so far and speculates about what else might be built atop online fsck.}(hjhhhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhK(hhhhubhtopic)}(hTable of Contents h](h)}(hTable of Contentsh]hTable of Contents}(hj|hhhNhNubah}(h]h ]h"]h$]h&]uh1hhjxhhhK4ubh)}(hhh](h)}(hhh](h)}(hhh]h reference)}(hhh]h1. What is a Filesystem Check?}(hjhhhNhNubah}(h]id13ah ]h"]h$]h&]refidwhat-is-a-filesystem-checkuh1jhjubah}(h]h ]h"]h$]h&]uh1hhjubh)}(hhh](h)}(hhh]h)}(hhh]j)}(hhh]hTLDR; Show Me the Code!}(hjhhhNhNubah}(h]id14ah ]h"]h$]h&]refidtldr-show-me-the-codeuh1jhjubah}(h]h ]h"]h$]h&]uh1hhjubah}(h]h ]h"]h$]h&]uh1hhjubh)}(hhh]h)}(hhh]j)}(hhh]hExisting Tools}(hjhhhNhNubah}(h]id15ah ]h"]h$]h&]refidexisting-toolsuh1jhjubah}(h]h ]h"]h$]h&]uh1hhjubah}(h]h ]h"]h$]h&]uh1hhjubh)}(hhh]h)}(hhh]j)}(hhh]hProblem Statement}(hjhhhNhNubah}(h]id16ah ]h"]h$]h&]refidproblem-statementuh1jhjubah}(h]h ]h"]h$]h&]uh1hhjubah}(h]h ]h"]h$]h&]uh1hhjubeh}(h]h ]h"]h$]h&]uh1hhjubeh}(h]h ]h"]h$]h&]uh1hhjubh)}(hhh](h)}(hhh]j)}(hhh]h2. Theory of Operation}(hj&hhhNhNubah}(h]id17ah ]h"]h$]h&]refidtheory-of-operationuh1jhj#ubah}(h]h ]h"]h$]h&]uh1hhj ubh)}(hhh](h)}(hhh]h)}(hhh]j)}(hhh]hScope}(hjEhhhNhNubah}(h]id18ah ]h"]h$]h&]refidscopeuh1jhjBubah}(h]h ]h"]h$]h&]uh1hhj?ubah}(h]h ]h"]h$]h&]uh1hhj<ubh)}(hhh]h)}(hhh]j)}(hhh]hPhases of Work}(hjghhhNhNubah}(h]id19ah ]h"]h$]h&]refidphases-of-workuh1jhjdubah}(h]h ]h"]h$]h&]uh1hhjaubah}(h]h ]h"]h$]h&]uh1hhj<ubh)}(hhh]h)}(hhh]j)}(hhh]hSteps for Each Scrub Item}(hjhhhNhNubah}(h]id20ah ]h"]h$]h&]refidsteps-for-each-scrub-itemuh1jhjubah}(h]h ]h"]h$]h&]uh1hhjubah}(h]h ]h"]h$]h&]uh1hhj<ubh)}(hhh](h)}(hhh]j)}(hhh]hClassification of Metadata}(hjhhhNhNubah}(h]id21ah ]h"]h$]h&]refidclassification-of-metadatauh1jhjubah}(h]h ]h"]h$]h&]uh1hhjubh)}(hhh](h)}(hhh]h)}(hhh]j)}(hhh]hPrimary Metadata}(hjhhhNhNubah}(h]id22ah ]h"]h$]h&]refidprimary-metadatauh1jhjubah}(h]h ]h"]h$]h&]uh1hhjubah}(h]h ]h"]h$]h&]uh1hhjubh)}(hhh]h)}(hhh]j)}(hhh]hSecondary Metadata}(hjhhhNhNubah}(h]id23ah ]h"]h$]h&]refidsecondary-metadatauh1jhjubah}(h]h ]h"]h$]h&]uh1hhjubah}(h]h ]h"]h$]h&]uh1hhjubh)}(hhh]h)}(hhh]j)}(hhh]hSummary Information}(hjhhhNhNubah}(h]id24ah ]h"]h$]h&]refidsummary-informationuh1jhj ubah}(h]h ]h"]h$]h&]uh1hhjubah}(h]h ]h"]h$]h&]uh1hhjubeh}(h]h ]h"]h$]h&]uh1hhjubeh}(h]h ]h"]h$]h&]uh1hhj<ubh)}(hhh]h)}(hhh]j)}(hhh]hRisk Management}(hj<hhhNhNubah}(h]id25ah ]h"]h$]h&]refidrisk-managementuh1jhj9ubah}(h]h ]h"]h$]h&]uh1hhj6ubah}(h]h ]h"]h$]h&]uh1hhj<ubeh}(h]h ]h"]h$]h&]uh1hhj ubeh}(h]h ]h"]h$]h&]uh1hhjubh)}(hhh](h)}(hhh]j)}(hhh]h3. Testing Plan}(hjjhhhNhNubah}(h]id26ah ]h"]h$]h&]refid testing-planuh1jhjgubah}(h]h ]h"]h$]h&]uh1hhjdubh)}(hhh](h)}(hhh]h)}(hhh]j)}(hhh]hIntegrated Testing with fstests}(hjhhhNhNubah}(h]id27ah ]h"]h$]h&]refidintegrated-testing-with-fstestsuh1jhjubah}(h]h ]h"]h$]h&]uh1hhjubah}(h]h ]h"]h$]h&]uh1hhjubh)}(hhh]h)}(hhh]j)}(hhh]h'General Fuzz Testing of Metadata Blocks}(hjhhhNhNubah}(h]id28ah ]h"]h$]h&]refid'general-fuzz-testing-of-metadata-blocksuh1jhjubah}(h]h ]h"]h$]h&]uh1hhjubah}(h]h ]h"]h$]h&]uh1hhjubh)}(hhh]h)}(hhh]j)}(hhh]h)Targeted Fuzz Testing of Metadata Records}(hjhhhNhNubah}(h]id29ah ]h"]h$]h&]refid)targeted-fuzz-testing-of-metadata-recordsuh1jhjubah}(h]h ]h"]h$]h&]uh1hhjubah}(h]h ]h"]h$]h&]uh1hhjubh)}(hhh]h)}(hhh]j)}(hhh]hStress Testing}(hjhhhNhNubah}(h]id30ah ]h"]h$]h&]refidstress-testinguh1jhjubah}(h]h ]h"]h$]h&]uh1hhjubah}(h]h ]h"]h$]h&]uh1hhjubeh}(h]h ]h"]h$]h&]uh1hhjdubeh}(h]h ]h"]h$]h&]uh1hhjubh)}(hhh](h)}(hhh]j)}(hhh]h4. User Interface}(hjhhhNhNubah}(h]id31ah ]h"]h$]h&]refiduser-interfaceuh1jhjubah}(h]h ]h"]h$]h&]uh1hhjubh)}(hhh](h)}(hhh]h)}(hhh]j)}(hhh]hChecking on Demand}(hj<hhhNhNubah}(h]id32ah ]h"]h$]h&]refidchecking-on-demanduh1jhj9ubah}(h]h ]h"]h$]h&]uh1hhj6ubah}(h]h ]h"]h$]h&]uh1hhj3ubh)}(hhh]h)}(hhh]j)}(hhh]hBackground Service}(hj^hhhNhNubah}(h]id33ah ]h"]h$]h&]refidbackground-serviceuh1jhj[ubah}(h]h ]h"]h$]h&]uh1hhjXubah}(h]h ]h"]h$]h&]uh1hhj3ubh)}(hhh]h)}(hhh]j)}(hhh]hHealth Reporting}(hjhhhNhNubah}(h]id34ah ]h"]h$]h&]refidhealth-reportinguh1jhj}ubah}(h]h ]h"]h$]h&]uh1hhjzubah}(h]h ]h"]h$]h&]uh1hhj3ubeh}(h]h ]h"]h$]h&]uh1hhjubeh}(h]h ]h"]h$]h&]uh1hhjubh)}(hhh](h)}(hhh]j)}(hhh]h(5. Kernel Algorithms and Data Structures}(hjhhhNhNubah}(h]id35ah ]h"]h$]h&]refid%kernel-algorithms-and-data-structuresuh1jhjubah}(h]h ]h"]h$]h&]uh1hhjubh)}(hhh](h)}(hhh]h)}(hhh]j)}(hhh]hSelf Describing Metadata}(hjhhhNhNubah}(h]id36ah ]h"]h$]h&]refidself-describing-metadatauh1jhjubah}(h]h ]h"]h$]h&]uh1hhjubah}(h]h ]h"]h$]h&]uh1hhjubh)}(hhh]h)}(hhh]j)}(hhh]hReverse Mapping}(hjhhhNhNubah}(h]id37ah ]h"]h$]h&]refidreverse-mappinguh1jhjubah}(h]h ]h"]h$]h&]uh1hhjubah}(h]h ]h"]h$]h&]uh1hhjubh)}(hhh](h)}(hhh]j)}(hhh]hChecking and Cross-Referencing}(hjhhhNhNubah}(h]id38ah ]h"]h$]h&]refidchecking-and-cross-referencinguh1jhjubah}(h]h ]h"]h$]h&]uh1hhj ubh)}(hhh](h)}(hhh]h)}(hhh]j)}(hhh]hMetadata Buffer Verification}(hj0hhhNhNubah}(h]id39ah ]h"]h$]h&]refidmetadata-buffer-verificationuh1jhj-ubah}(h]h ]h"]h$]h&]uh1hhj*ubah}(h]h ]h"]h$]h&]uh1hhj'ubh)}(hhh]h)}(hhh]j)}(hhh]hInternal Consistency Checks}(hjRhhhNhNubah}(h]id40ah ]h"]h$]h&]refidinternal-consistency-checksuh1jhjOubah}(h]h ]h"]h$]h&]uh1hhjLubah}(h]h ]h"]h$]h&]uh1hhj'ubh)}(hhh]h)}(hhh]j)}(hhh]h4Validation of Userspace-Controlled Record Attributes}(hjthhhNhNubah}(h]id41ah ]h"]h$]h&]refid4validation-of-userspace-controlled-record-attributesuh1jhjqubah}(h]h ]h"]h$]h&]uh1hhjnubah}(h]h ]h"]h$]h&]uh1hhj'ubh)}(hhh]h)}(hhh]j)}(hhh]h Cross-Referencing Space Metadata}(hjhhhNhNubah}(h]id42ah ]h"]h$]h&]refid cross-referencing-space-metadatauh1jhjubah}(h]h ]h"]h$]h&]uh1hhjubah}(h]h ]h"]h$]h&]uh1hhj'ubh)}(hhh]h)}(hhh]j)}(hhh]hChecking Extended Attributes}(hjhhhNhNubah}(h]id43ah ]h"]h$]h&]refidchecking-extended-attributesuh1jhjubah}(h]h ]h"]h$]h&]uh1hhjubah}(h]h ]h"]h$]h&]uh1hhj'ubh)}(hhh](h)}(hhh]j)}(hhh]h*Checking and Cross-Referencing Directories}(hjhhhNhNubah}(h]id44ah ]h"]h$]h&]refid*checking-and-cross-referencing-directoriesuh1jhjubah}(h]h ]h"]h$]h&]uh1hhjubh)}(hhh]h)}(hhh]h)}(hhh]j)}(hhh]h#Checking Directory/Attribute Btrees}(hjhhhNhNubah}(h]id45ah ]h"]h$]h&]refid#checking-directory-attribute-btreesuh1jhjubah}(h]h ]h"]h$]h&]uh1hhjubah}(h]h ]h"]h$]h&]uh1hhjubah}(h]h ]h"]h$]h&]uh1hhjubeh}(h]h ]h"]h$]h&]uh1hhj'ubh)}(hhh]h)}(hhh]j)}(hhh]h"Cross-Referencing Summary Counters}(hj'hhhNhNubah}(h]id46ah ]h"]h$]h&]refid"cross-referencing-summary-countersuh1jhj$ubah}(h]h ]h"]h$]h&]uh1hhj!ubah}(h]h ]h"]h$]h&]uh1hhj'ubh)}(hhh]h)}(hhh]j)}(hhh]hPost-Repair Reverification}(hjIhhhNhNubah}(h]id47ah ]h"]h$]h&]refidpost-repair-reverificationuh1jhjFubah}(h]h ]h"]h$]h&]uh1hhjCubah}(h]h ]h"]h$]h&]uh1hhj'ubeh}(h]h ]h"]h$]h&]uh1hhj ubeh}(h]h ]h"]h$]h&]uh1hhjubh)}(hhh](h)}(hhh]j)}(hhh]h$Eventual Consistency vs. Online Fsck}(hjwhhhNhNubah}(h]id48ah ]h"]h$]h&]refid#eventual-consistency-vs-online-fsckuh1jhjtubah}(h]h ]h"]h$]h&]uh1hhjqubh)}(hhh](h)}(hhh]h)}(hhh]j)}(hhh]hDiscovery of the Problem}(hjhhhNhNubah}(h]id49ah ]h"]h$]h&]refiddiscovery-of-the-problemuh1jhjubah}(h]h ]h"]h$]h&]uh1hhjubah}(h]h ]h"]h$]h&]uh1hhjubh)}(hhh]h)}(hhh]j)}(hhh]h Intent Drains}(hjhhhNhNubah}(h]id50ah ]h"]h$]h&]refid intent-drainsuh1jhjubah}(h]h ]h"]h$]h&]uh1hhjubah}(h]h ]h"]h$]h&]uh1hhjubh)}(hhh]h)}(hhh]j)}(hhh]h%Static Keys (aka Jump Label Patching)}(hjhhhNhNubah}(h]id51ah ]h"]h$]h&]refid#static-keys-aka-jump-label-patchinguh1jhjubah}(h]h ]h"]h$]h&]uh1hhjubah}(h]h ]h"]h$]h&]uh1hhjubeh}(h]h ]h"]h$]h&]uh1hhjqubeh}(h]h ]h"]h$]h&]uh1hhjubh)}(hhh](h)}(hhh]j)}(hhh]hPageable Kernel Memory}(hjhhhNhNubah}(h]id52ah ]h"]h$]h&]refidpageable-kernel-memoryuh1jhjubah}(h]h ]h"]h$]h&]uh1hhjubh)}(hhh](h)}(hhh]h)}(hhh]j)}(hhh]hxfile Access Models}(hj'hhhNhNubah}(h]id53ah ]h"]h$]h&]refidxfile-access-modelsuh1jhj$ubah}(h]h ]h"]h$]h&]uh1hhj!ubah}(h]h ]h"]h$]h&]uh1hhjubh)}(hhh]h)}(hhh]j)}(hhh]hxfile Access Coordination}(hjIhhhNhNubah}(h]id54ah ]h"]h$]h&]refidxfile-access-coordinationuh1jhjFubah}(h]h ]h"]h$]h&]uh1hhjCubah}(h]h ]h"]h$]h&]uh1hhjubh)}(hhh](h)}(hhh]j)}(hhh]hArrays of Fixed-Sized Records}(hjkhhhNhNubah}(h]id55ah ]h"]h$]h&]refidarrays-of-fixed-sized-recordsuh1jhjhubah}(h]h ]h"]h$]h&]uh1hhjeubh)}(hhh](h)}(hhh]h)}(hhh]j)}(hhh]hArray Access Patterns}(hjhhhNhNubah}(h]id56ah ]h"]h$]h&]refidarray-access-patternsuh1jhjubah}(h]h ]h"]h$]h&]uh1hhjubah}(h]h ]h"]h$]h&]uh1hhjubh)}(hhh]h)}(hhh]j)}(hhh]hIterating Array Elements}(hjhhhNhNubah}(h]id57ah ]h"]h$]h&]refiditerating-array-elementsuh1jhjubah}(h]h ]h"]h$]h&]uh1hhjubah}(h]h ]h"]h$]h&]uh1hhjubh)}(hhh](h)}(hhh]j)}(hhh]hSorting Array Elements}(hjhhhNhNubah}(h]id58ah ]h"]h$]h&]refidsorting-array-elementsuh1jhjubah}(h]h ]h"]h$]h&]uh1hhjubh)}(hhh]h)}(hhh]h)}(hhh]j)}(hhh]hCase Study: Sorting xfarrays}(hjhhhNhNubah}(h]id59ah ]h"]h$]h&]refidcase-study-sorting-xfarraysuh1jhjubah}(h]h ]h"]h$]h&]uh1hhjubah}(h]h ]h"]h$]h&]uh1hhjubah}(h]h ]h"]h$]h&]uh1hhjubeh}(h]h ]h"]h$]h&]uh1hhjubeh}(h]h ]h"]h$]h&]uh1hhjeubeh}(h]h ]h"]h$]h&]uh1hhjubh)}(hhh]h)}(hhh]j)}(hhh]h Blob Storage}(hj'hhhNhNubah}(h]id60ah ]h"]h$]h&]refid blob-storageuh1jhj$ubah}(h]h ]h"]h$]h&]uh1hhj!ubah}(h]h ]h"]h$]h&]uh1hhjubh)}(hhh](h)}(hhh]j)}(hhh]hIn-Memory B+Trees}(hjIhhhNhNubah}(h]id61ah ]h"]h$]h&]refidin-memory-b-treesuh1jhjFubah}(h]h ]h"]h$]h&]uh1hhjCubh)}(hhh](h)}(hhh]h)}(hhh]j)}(hhh]h%Using xfiles as a Buffer Cache Target}(hjhhhhNhNubah}(h]id62ah ]h"]h$]h&]refid%using-xfiles-as-a-buffer-cache-targetuh1jhjeubah}(h]h ]h"]h$]h&]uh1hhjbubah}(h]h ]h"]h$]h&]uh1hhj_ubh)}(hhh]h)}(hhh]j)}(hhh]h Space Management with an xfbtree}(hjhhhNhNubah}(h]id63ah ]h"]h$]h&]refid space-management-with-an-xfbtreeuh1jhjubah}(h]h ]h"]h$]h&]uh1hhjubah}(h]h ]h"]h$]h&]uh1hhj_ubh)}(hhh]h)}(hhh]j)}(hhh]hPopulating an xfbtree}(hjhhhNhNubah}(h]id64ah ]h"]h$]h&]refidpopulating-an-xfbtreeuh1jhjubah}(h]h ]h"]h$]h&]uh1hhjubah}(h]h ]h"]h$]h&]uh1hhj_ubh)}(hhh]h)}(hhh]j)}(hhh]h!Committing Logged xfbtree Buffers}(hjhhhNhNubah}(h]id65ah ]h"]h$]h&]refid!committing-logged-xfbtree-buffersuh1jhjubah}(h]h ]h"]h$]h&]uh1hhjubah}(h]h ]h"]h$]h&]uh1hhj_ubeh}(h]h ]h"]h$]h&]uh1hhjCubeh}(h]h ]h"]h$]h&]uh1hhjubeh}(h]h ]h"]h$]h&]uh1hhjubeh}(h]h ]h"]h$]h&]uh1hhjubh)}(hhh](h)}(hhh]j)}(hhh]hBulk Loading of Ondisk B+Trees}(hj hhhNhNubah}(h]id66ah ]h"]h$]h&]refidbulk-loading-of-ondisk-b-treesuh1jhj ubah}(h]h ]h"]h$]h&]uh1hhj ubh)}(hhh](h)}(hhh]h)}(hhh]j)}(hhh]hGeometry Computation}(hj' hhhNhNubah}(h]id67ah ]h"]h$]h&]refidgeometry-computationuh1jhj$ ubah}(h]h ]h"]h$]h&]uh1hhj! ubah}(h]h ]h"]h$]h&]uh1hhj ubh)}(hhh]h)}(hhh]j)}(hhh]hReserving New B+Tree Blocks}(hjI hhhNhNubah}(h]id68ah ]h"]h$]h&]refidreserving-new-b-tree-blocksuh1jhjF ubah}(h]h ]h"]h$]h&]uh1hhjC ubah}(h]h ]h"]h$]h&]uh1hhj ubh)}(hhh](h)}(hhh]j)}(hhh]hWriting the New Tree}(hjk hhhNhNubah}(h]id69ah ]h"]h$]h&]refidwriting-the-new-treeuh1jhjh ubah}(h]h ]h"]h$]h&]uh1hhje ubh)}(hhh](h)}(hhh]h)}(hhh]j)}(hhh]h&Case Study: Rebuilding the Inode Index}(hj hhhNhNubah}(h]id70ah ]h"]h$]h&]refid%case-study-rebuilding-the-inode-indexuh1jhj ubah}(h]h ]h"]h$]h&]uh1hhj ubah}(h]h ]h"]h$]h&]uh1hhj ubh)}(hhh]h)}(hhh]j)}(hhh]h1Case Study: Rebuilding the Space Reference Counts}(hj hhhNhNubah}(h]id71ah ]h"]h$]h&]refid0case-study-rebuilding-the-space-reference-countsuh1jhj ubah}(h]h ]h"]h$]h&]uh1hhj ubah}(h]h ]h"]h$]h&]uh1hhj ubh)}(hhh]h)}(hhh]j)}(hhh]h0Case Study: Rebuilding File Fork Mapping Indices}(hj hhhNhNubah}(h]id72ah ]h"]h$]h&]refid/case-study-rebuilding-file-fork-mapping-indicesuh1jhj ubah}(h]h ]h"]h$]h&]uh1hhj ubah}(h]h ]h"]h$]h&]uh1hhj ubeh}(h]h ]h"]h$]h&]uh1hhje ubeh}(h]h ]h"]h$]h&]uh1hhj ubeh}(h]h ]h"]h$]h&]uh1hhj ubeh}(h]h ]h"]h$]h&]uh1hhjubh)}(hhh](h)}(hhh]j)}(hhh]hReaping Old Metadata Blocks}(hj hhhNhNubah}(h]id73ah ]h"]h$]h&]refidreaping-old-metadata-blocksuh1jhj ubah}(h]h ]h"]h$]h&]uh1hhj ubh)}(hhh](h)}(hhh]h)}(hhh]j)}(hhh]h0Case Study: Reaping After a Regular Btree Repair}(hj' hhhNhNubah}(h]id74ah ]h"]h$]h&]refid/case-study-reaping-after-a-regular-btree-repairuh1jhj$ ubah}(h]h ]h"]h$]h&]uh1hhj! ubah}(h]h ]h"]h$]h&]uh1hhj ubh)}(hhh]h)}(hhh]j)}(hhh]h-Case Study: Rebuilding the Free Space Indices}(hjI hhhNhNubah}(h]id75ah ]h"]h$]h&]refid,case-study-rebuilding-the-free-space-indicesuh1jhjF ubah}(h]h ]h"]h$]h&]uh1hhjC ubah}(h]h ]h"]h$]h&]uh1hhj ubh)}(hhh]h)}(hhh]j)}(hhh]h:Case Study: Reaping After Repairing Reverse Mapping Btrees}(hjk hhhNhNubah}(h]id76ah ]h"]h$]h&]refid9case-study-reaping-after-repairing-reverse-mapping-btreesuh1jhjh ubah}(h]h ]h"]h$]h&]uh1hhje ubah}(h]h ]h"]h$]h&]uh1hhj ubh)}(hhh]h)}(hhh]j)}(hhh]hCase Study: Rebuilding the AGFL}(hj hhhNhNubah}(h]id77ah ]h"]h$]h&]refidcase-study-rebuilding-the-agfluh1jhj ubah}(h]h ]h"]h$]h&]uh1hhj ubah}(h]h ]h"]h$]h&]uh1hhj ubeh}(h]h ]h"]h$]h&]uh1hhj ubeh}(h]h ]h"]h$]h&]uh1hhjubh)}(hhh]h)}(hhh]j)}(hhh]hInode Record Repairs}(hj hhhNhNubah}(h]id78ah ]h"]h$]h&]refidinode-record-repairsuh1jhj ubah}(h]h ]h"]h$]h&]uh1hhj ubah}(h]h ]h"]h$]h&]uh1hhjubh)}(hhh]h)}(hhh]j)}(hhh]hQuota Record Repairs}(hj hhhNhNubah}(h]id79ah ]h"]h$]h&]refidquota-record-repairsuh1jhj ubah}(h]h ]h"]h$]h&]uh1hhj ubah}(h]h ]h"]h$]h&]uh1hhjubh)}(hhh]h)}(hhh]j)}(hhh]h Freezing to Fix Summary Counters}(hj hhhNhNubah}(h]id80ah ]h"]h$]h&]refid freezing-to-fix-summary-countersuh1jhj ubah}(h]h ]h"]h$]h&]uh1hhj ubah}(h]h ]h"]h$]h&]uh1hhjubh)}(hhh](h)}(hhh]j)}(hhh]hFull Filesystem Scans}(hj! hhhNhNubah}(h]id81ah ]h"]h$]h&]refidfull-filesystem-scansuh1jhj ubah}(h]h ]h"]h$]h&]uh1hhj ubh)}(hhh](h)}(hhh]h)}(hhh]j)}(hhh]hCoordinated Inode Scans}(hj@ hhhNhNubah}(h]id82ah ]h"]h$]h&]refidcoordinated-inode-scansuh1jhj= ubah}(h]h ]h"]h$]h&]uh1hhj: ubah}(h]h ]h"]h$]h&]uh1hhj7 ubh)}(hhh](h)}(hhh]j)}(hhh]hInode Management}(hjb hhhNhNubah}(h]id83ah ]h"]h$]h&]refidinode-managementuh1jhj_ ubah}(h]h ]h"]h$]h&]uh1hhj\ ubh)}(hhh](h)}(hhh]h)}(hhh]j)}(hhh]higet and irele During a Scrub}(hj hhhNhNubah}(h]id84ah ]h"]h$]h&]refidiget-and-irele-during-a-scrubuh1jhj~ ubah}(h]h ]h"]h$]h&]uh1hhj{ ubah}(h]h ]h"]h$]h&]uh1hhjx ubh)}(hhh]h)}(hhh]j)}(hhh]hLocking Inodes}(hj hhhNhNubah}(h]id85ah ]h"]h$]h&]refidlocking-inodesuh1jhj ubah}(h]h ]h"]h$]h&]uh1hhj ubah}(h]h ]h"]h$]h&]uh1hhjx ubh)}(hhh]h)}(hhh]j)}(hhh]h&Case Study: Finding a Directory Parent}(hj hhhNhNubah}(h]id86ah ]h"]h$]h&]refid%case-study-finding-a-directory-parentuh1jhj ubah}(h]h ]h"]h$]h&]uh1hhj ubah}(h]h ]h"]h$]h&]uh1hhjx ubeh}(h]h ]h"]h$]h&]uh1hhj\ ubeh}(h]h ]h"]h$]h&]uh1hhj7 ubh)}(hhh]h)}(hhh]j)}(hhh]hFilesystem Hooks}(hj hhhNhNubah}(h]id87ah ]h"]h$]h&]refidfilesystem-hooksuh1jhj ubah}(h]h ]h"]h$]h&]uh1hhj ubah}(h]h ]h"]h$]h&]uh1hhj7 ubh)}(hhh](h)}(hhh]j)}(hhh]hLive Updates During a Scan}(hj hhhNhNubah}(h]id88ah ]h"]h$]h&]refidlive-updates-during-a-scanuh1jhj ubah}(h]h ]h"]h$]h&]uh1hhj ubh)}(hhh](h)}(hhh]h)}(hhh]j)}(hhh]h"Case Study: Quota Counter Checking}(hj4 hhhNhNubah}(h]id89ah ]h"]h$]h&]refid!case-study-quota-counter-checkinguh1jhj1 ubah}(h]h ]h"]h$]h&]uh1hhj. ubah}(h]h ]h"]h$]h&]uh1hhj+ ubh)}(hhh]h)}(hhh]j)}(hhh]h$Case Study: File Link Count Checking}(hjV hhhNhNubah}(h]id90ah ]h"]h$]h&]refid#case-study-file-link-count-checkinguh1jhjS ubah}(h]h ]h"]h$]h&]uh1hhjP ubah}(h]h ]h"]h$]h&]uh1hhj+ ubh)}(hhh]h)}(hhh]j)}(hhh]h.Case Study: Rebuilding Reverse Mapping Records}(hjx hhhNhNubah}(h]id91ah ]h"]h$]h&]refid-case-study-rebuilding-reverse-mapping-recordsuh1jhju ubah}(h]h ]h"]h$]h&]uh1hhjr ubah}(h]h ]h"]h$]h&]uh1hhj+ ubeh}(h]h ]h"]h$]h&]uh1hhj ubeh}(h]h ]h"]h$]h&]uh1hhj7 ubeh}(h]h ]h"]h$]h&]uh1hhj ubeh}(h]h ]h"]h$]h&]uh1hhjubh)}(hhh](h)}(hhh]j)}(hhh]h,Staging Repairs with Temporary Files on Disk}(hj hhhNhNubah}(h]id92ah ]h"]h$]h&]refid,staging-repairs-with-temporary-files-on-diskuh1jhj ubah}(h]h ]h"]h$]h&]uh1hhj ubh)}(hhh]h)}(hhh]h)}(hhh]j)}(hhh]hUsing a Temporary File}(hj hhhNhNubah}(h]id93ah ]h"]h$]h&]refidusing-a-temporary-fileuh1jhj ubah}(h]h ]h"]h$]h&]uh1hhj ubah}(h]h ]h"]h$]h&]uh1hhj ubah}(h]h ]h"]h$]h&]uh1hhj ubeh}(h]h ]h"]h$]h&]uh1hhjubh)}(hhh](h)}(hhh]j)}(hhh]hLogged File Content Exchanges}(hj hhhNhNubah}(h]id94ah ]h"]h$]h&]refidlogged-file-content-exchangesuh1jhj ubah}(h]h ]h"]h$]h&]uh1hhj ubh)}(hhh](h)}(hhh]h)}(hhh]j)}(hhh]h+Mechanics of a Logged File Content Exchange}(hj hhhNhNubah}(h]id95ah ]h"]h$]h&]refid+mechanics-of-a-logged-file-content-exchangeuh1jhj ubah}(h]h ]h"]h$]h&]uh1hhj ubah}(h]h ]h"]h$]h&]uh1hhj ubh)}(hhh]h)}(hhh]j)}(hhh]h&Preparation for File Content Exchanges}(hj@ hhhNhNubah}(h]id96ah ]h"]h$]h&]refid&preparation-for-file-content-exchangesuh1jhj= ubah}(h]h ]h"]h$]h&]uh1hhj: ubah}(h]h ]h"]h$]h&]uh1hhj ubh)}(hhh]h)}(hhh]j)}(hhh]h6Special Features for Exchanging Metadata File Contents}(hjb hhhNhNubah}(h]id97ah ]h"]h$]h&]refid6special-features-for-exchanging-metadata-file-contentsuh1jhj_ ubah}(h]h ]h"]h$]h&]uh1hhj\ ubah}(h]h ]h"]h$]h&]uh1hhj ubh)}(hhh](h)}(hhh]j)}(hhh]h"Exchanging Temporary File Contents}(hj hhhNhNubah}(h]id98ah ]h"]h$]h&]refid"exchanging-temporary-file-contentsuh1jhj ubah}(h]h ]h"]h$]h&]uh1hhj~ ubh)}(hhh](h)}(hhh]h)}(hhh]j)}(hhh]h/Case Study: Repairing the Realtime Summary File}(hj hhhNhNubah}(h]id99ah ]h"]h$]h&]refid.case-study-repairing-the-realtime-summary-fileuh1jhj ubah}(h]h ]h"]h$]h&]uh1hhj ubah}(h]h ]h"]h$]h&]uh1hhj ubh)}(hhh]h)}(hhh]j)}(hhh]h)Case Study: Salvaging Extended Attributes}(hj hhhNhNubah}(h]id100ah ]h"]h$]h&]refid(case-study-salvaging-extended-attributesuh1jhj ubah}(h]h ]h"]h$]h&]uh1hhj ubah}(h]h ]h"]h$]h&]uh1hhj ubeh}(h]h ]h"]h$]h&]uh1hhj~ ubeh}(h]h ]h"]h$]h&]uh1hhj ubeh}(h]h ]h"]h$]h&]uh1hhj ubeh}(h]h ]h"]h$]h&]uh1hhjubh)}(hhh](h)}(hhh]j)}(hhh]hFixing Directories}(hj hhhNhNubah}(h]id101ah ]h"]h$]h&]refidfixing-directoriesuh1jhj ubah}(h]h ]h"]h$]h&]uh1hhj ubh)}(hhh](h)}(hhh]h)}(hhh]j)}(hhh]h!Case Study: Salvaging Directories}(hjhhhNhNubah}(h]id102ah ]h"]h$]h&]refid case-study-salvaging-directoriesuh1jhjubah}(h]h ]h"]h$]h&]uh1hhjubah}(h]h ]h"]h$]h&]uh1hhjubh)}(hhh](h)}(hhh]j)}(hhh]hParent Pointers}(hj@hhhNhNubah}(h]id103ah ]h"]h$]h&]refidparent-pointersuh1jhj=ubah}(h]h ]h"]h$]h&]uh1hhj:ubh)}(hhh](h)}(hhh]h)}(hhh]j)}(hhh]h6Case Study: Repairing Directories with Parent Pointers}(hj_hhhNhNubah}(h]id104ah ]h"]h$]h&]refid5case-study-repairing-directories-with-parent-pointersuh1jhj\ubah}(h]h ]h"]h$]h&]uh1hhjYubah}(h]h ]h"]h$]h&]uh1hhjVubh)}(hhh]h)}(hhh]j)}(hhh]h%Case Study: Repairing Parent Pointers}(hjhhhNhNubah}(h]id105ah ]h"]h$]h&]refid$case-study-repairing-parent-pointersuh1jhj~ubah}(h]h ]h"]h$]h&]uh1hhj{ubah}(h]h ]h"]h$]h&]uh1hhjVubh)}(hhh]h)}(hhh]j)}(hhh]h/Digression: Offline Checking of Parent Pointers}(hjhhhNhNubah}(h]id106ah ]h"]h$]h&]refid.digression-offline-checking-of-parent-pointersuh1jhjubah}(h]h ]h"]h$]h&]uh1hhjubah}(h]h ]h"]h$]h&]uh1hhjVubh)}(hhh]h)}(hhh]j)}(hhh]h$Case Study: Directory Tree Structure}(hjhhhNhNubah}(h]id107ah ]h"]h$]h&]refid#case-study-directory-tree-structureuh1jhjubah}(h]h ]h"]h$]h&]uh1hhjubah}(h]h ]h"]h$]h&]uh1hhjVubeh}(h]h ]h"]h$]h&]uh1hhj:ubeh}(h]h ]h"]h$]h&]uh1hhjubeh}(h]h ]h"]h$]h&]uh1hhj ubeh}(h]h ]h"]h$]h&]uh1hhjubh)}(hhh]h)}(hhh]j)}(hhh]h The Orphanage}(hjhhhNhNubah}(h]id108ah ]h"]h$]h&]refid the-orphanageuh1jhjubah}(h]h ]h"]h$]h&]uh1hhjubah}(h]h ]h"]h$]h&]uh1hhjubeh}(h]h ]h"]h$]h&]uh1hhjubeh}(h]h ]h"]h$]h&]uh1hhjubh)}(hhh](h)}(hhh]j)}(hhh]h+6. Userspace Algorithms and Data Structures}(hj-hhhNhNubah}(h]id109ah ]h"]h$]h&]refid(userspace-algorithms-and-data-structuresuh1jhj*ubah}(h]h ]h"]h$]h&]uh1hhj'ubh)}(hhh](h)}(hhh]h)}(hhh]j)}(hhh]hChecking Metadata}(hjLhhhNhNubah}(h]id110ah ]h"]h$]h&]refidchecking-metadatauh1jhjIubah}(h]h ]h"]h$]h&]uh1hhjFubah}(h]h ]h"]h$]h&]uh1hhjCubh)}(hhh]h)}(hhh]j)}(hhh]hParallel Inode Scans}(hjnhhhNhNubah}(h]id111ah ]h"]h$]h&]refidparallel-inode-scansuh1jhjkubah}(h]h ]h"]h$]h&]uh1hhjhubah}(h]h ]h"]h$]h&]uh1hhjCubh)}(hhh]h)}(hhh]j)}(hhh]hScheduling Repairs}(hjhhhNhNubah}(h]id112ah ]h"]h$]h&]refidscheduling-repairsuh1jhjubah}(h]h ]h"]h$]h&]uh1hhjubah}(h]h ]h"]h$]h&]uh1hhjCubh)}(hhh]h)}(hhh]j)}(hhh]h/Checking Names for Confusable Unicode Sequences}(hjhhhNhNubah}(h]id113ah ]h"]h$]h&]refid/checking-names-for-confusable-unicode-sequencesuh1jhjubah}(h]h ]h"]h$]h&]uh1hhjubah}(h]h ]h"]h$]h&]uh1hhjCubh)}(hhh]h)}(hhh]j)}(hhh]h'Media Verification of File Data Extents}(hjhhhNhNubah}(h]id114ah ]h"]h$]h&]refid'media-verification-of-file-data-extentsuh1jhjubah}(h]h ]h"]h$]h&]uh1hhjubah}(h]h ]h"]h$]h&]uh1hhjCubeh}(h]h ]h"]h$]h&]uh1hhj'ubeh}(h]h ]h"]h$]h&]uh1hhjubh)}(hhh](h)}(hhh]j)}(hhh]h7. Conclusion and Future Work}(hjhhhNhNubah}(h]id115ah ]h"]h$]h&]refidconclusion-and-future-workuh1jhjubah}(h]h ]h"]h$]h&]uh1hhjubh)}(hhh](h)}(hhh](h)}(hhh]j)}(hhh]hXFS_IOC_EXCHANGE_RANGE}(hj!hhhNhNubah}(h]id116ah ]h"]h$]h&]refidxfs-ioc-exchange-rangeuh1jhjubah}(h]h ]h"]h$]h&]uh1hhjubh)}(hhh]h)}(hhh]h)}(hhh]j)}(hhh]h.File Content Exchanges with Regular User Files}(hj@hhhNhNubah}(h]id117ah ]h"]h$]h&]refid.file-content-exchanges-with-regular-user-filesuh1jhj=ubah}(h]h ]h"]h$]h&]uh1hhj:ubah}(h]h ]h"]h$]h&]uh1hhj7ubah}(h]h ]h"]h$]h&]uh1hhjubeh}(h]h ]h"]h$]h&]uh1hhjubh)}(hhh]h)}(hhh]j)}(hhh]hVectorized Scrub}(hjnhhhNhNubah}(h]id118ah ]h"]h$]h&]refidvectorized-scrubuh1jhjkubah}(h]h ]h"]h$]h&]uh1hhjhubah}(h]h ]h"]h$]h&]uh1hhjubh)}(hhh]h)}(hhh]j)}(hhh]h$Quality of Service Targets for Scrub}(hjhhhNhNubah}(h]id119ah ]h"]h$]h&]refid$quality-of-service-targets-for-scrubuh1jhjubah}(h]h ]h"]h$]h&]uh1hhjubah}(h]h ]h"]h$]h&]uh1hhjubh)}(hhh]h)}(hhh]j)}(hhh]hDefragmenting Free Space}(hjhhhNhNubah}(h]id120ah ]h"]h$]h&]refiddefragmenting-free-spaceuh1jhjubah}(h]h ]h"]h$]h&]uh1hhjubah}(h]h ]h"]h$]h&]uh1hhjubh)}(hhh]h)}(hhh]j)}(hhh]hShrinking Filesystems}(hjhhhNhNubah}(h]id121ah ]h"]h$]h&]refidshrinking-filesystemsuh1jhjubah}(h]h ]h"]h$]h&]uh1hhjubah}(h]h ]h"]h$]h&]uh1hhjubeh}(h]h ]h"]h$]h&]uh1hhjubeh}(h]h ]h"]h$]h&]uh1hhjubeh}(h]h ]h"]h$]h&]uh1hhjxhhhNhNubeh}(h]table-of-contentsah ](contentslocaleh"]table of contentsah$]h&]uh1jvhhhK4hhhhubh)}(hhh](h)}(h1. What is a Filesystem Check?h]h1. What is a Filesystem Check?}(hjhhhNhNubah}(h]h ]h"]h$]h&]refidjuh1hhj hhhhhK7ubh)}(h1A Unix filesystem has four main responsibilities:h]h1A Unix filesystem has four main responsibilities:}(hjhhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhK9hj hhubh)}(hhh](h)}(h~Provide a hierarchy of names through which application programs can associate arbitrary blobs of data for any length of time, h]h)}(h}Provide a hierarchy of names through which application programs can associate arbitrary blobs of data for any length of time,h]h}Provide a hierarchy of names through which application programs can associate arbitrary blobs of data for any length of time,}(hj3hhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhK;hj/ubah}(h]h ]h"]h$]h&]uh1hhj,hhhhhNubh)}(h:Virtualize physical storage media across those names, and h]h)}(h9Virtualize physical storage media across those names, andh]h9Virtualize physical storage media across those names, and}(hjKhhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhK>hjGubah}(h]h ]h"]h$]h&]uh1hhj,hhhhhNubh)}(h+Retrieve the named data blobs at any time. h]h)}(h*Retrieve the named data blobs at any time.h]h*Retrieve the named data blobs at any time.}(hjchhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhK@hj_ubah}(h]h ]h"]h$]h&]uh1hhj,hhhhhNubh)}(hExamine resource usage. h]h)}(hExamine resource usage.h]hExamine resource usage.}(hj{hhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhKBhjwubah}(h]h ]h"]h$]h&]uh1hhj,hhhhhNubeh}(h]h ]h"]h$]h&]jJjKuh1hhhhK;hj hhubh)}(hXMetadata directly supporting these functions (e.g. files, directories, space mappings) are sometimes called primary metadata. Secondary metadata (e.g. reverse mapping and directory parent pointers) support operations internal to the filesystem, such as internal consistency checking and reorganization. Summary metadata, as the name implies, condense information contained in primary metadata for performance reasons.h]hXMetadata directly supporting these functions (e.g. files, directories, space mappings) are sometimes called primary metadata. Secondary metadata (e.g. reverse mapping and directory parent pointers) support operations internal to the filesystem, such as internal consistency checking and reorganization. Summary metadata, as the name implies, condense information contained in primary metadata for performance reasons.}(hjhhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhKDhj hhubh)}(hX?The filesystem check (fsck) tool examines all the metadata in a filesystem to look for errors. In addition to looking for obvious metadata corruptions, fsck also cross-references different types of metadata records with each other to look for inconsistencies. People do not like losing data, so most fsck tools also contains some ability to correct any problems found. As a word of caution -- the primary goal of most Linux fsck tools is to restore the filesystem metadata to a consistent state, not to maximize the data recovered. That precedent will not be challenged here.h]hX?The filesystem check (fsck) tool examines all the metadata in a filesystem to look for errors. In addition to looking for obvious metadata corruptions, fsck also cross-references different types of metadata records with each other to look for inconsistencies. People do not like losing data, so most fsck tools also contains some ability to correct any problems found. As a word of caution -- the primary goal of most Linux fsck tools is to restore the filesystem metadata to a consistent state, not to maximize the data recovered. That precedent will not be challenged here.}(hjhhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhKLhj hhubh)}(hXFilesystems of the 20th century generally lacked any redundancy in the ondisk format, which means that fsck can only respond to errors by erasing files until errors are no longer detected. More recent filesystem designs contain enough redundancy in their metadata that it is now possible to regenerate data structures when non-catastrophic errors occur; this capability aids both strategies.h]hXFilesystems of the 20th century generally lacked any redundancy in the ondisk format, which means that fsck can only respond to errors by erasing files until errors are no longer detected. More recent filesystem designs contain enough redundancy in their metadata that it is now possible to regenerate data structures when non-catastrophic errors occur; this capability aids both strategies.}(hjhhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhKXhj hhubhtable)}(hhh]htgroup)}(hhh](hcolspec)}(hhh]h}(h]h ]h"]h$]h&]colwidthKJuh1jhjubhtbody)}(hhh](hrow)}(hhh]hentry)}(hhh]h)}(h **Note**:h](hstrong)}(h**Note**h]hNote}(hjhhhNhNubah}(h]h ]h"]h$]h&]uh1jhjubh:}(hjhhhNhNubeh}(h]h ]h"]h$]h&]uh1hhhhK`hjubah}(h]h ]h"]h$]h&]uh1jhjubah}(h]h ]h"]h$]h&]uh1jhjubj)}(hhh]j)}(hhh]h)}(hXSystem administrators avoid data loss by increasing the number of separate storage systems through the creation of backups; and they avoid downtime by increasing the redundancy of each storage system through the creation of RAID arrays. fsck tools address only the first problem.h]hXSystem administrators avoid data loss by increasing the number of separate storage systems through the creation of backups; and they avoid downtime by increasing the redundancy of each storage system through the creation of RAID arrays. fsck tools address only the first problem.}(hjhhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhKbhjubah}(h]h ]h"]h$]h&]uh1jhjubah}(h]h ]h"]h$]h&]uh1jhjubeh}(h]h ]h"]h$]h&]uh1jhjubeh}(h]h ]h"]h$]h&]colsKuh1jhjubah}(h]h ]h"]h$]h&]uh1jhj hhhhhNubh)}(hhh](h)}(hTLDR; Show Me the Code!h]hTLDR; Show Me the Code!}(hjDhhhNhNubah}(h]h ]h"]h$]h&]jjuh1hhjAhhhhhKjubh)}(hX#Code is posted to the kernel.org git trees as follows: `kernel changes `_, `userspace changes `_, and `QA test changes `_. Each kernel patchset adding an online repair function will use the same branch name across the kernel, xfsprogs, and fstests git repos.h](h7Code is posted to the kernel.org git trees as follows: }(hjRhhhNhNubj)}(hn`kernel changes `_h]hkernel changes}(hjZhhhNhNubah}(h]h ]h"]h$]h&]namekernel changesrefuriZhttps://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-symlinkuh1jhjRubh)}(h] h]h}(h]kernel-changesah ]h"]kernel changesah$]h&]refurijkuh1h referencedKhjRubh, }(hjRhhhNhNubj)}(h~`userspace changes `_h]huserspace changes}(hj~hhhNhNubah}(h]h ]h"]h$]h&]nameuserspace changesjjghttps://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=scrub-media-scan-serviceuh1jhjRubh)}(hj h]h}(h]userspace-changesah ]h"]userspace changesah$]h&]refurijuh1hjyKhjRubh, and }(hjRhhhNhNubj)}(ho`QA test changes `_h]hQA test changes}(hjhhhNhNubah}(h]h ]h"]h$]h&]nameQA test changesjjZhttps://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfstests-dev.git/log/?h=repair-dirsuh1jhjRubh)}(h] h]h}(h]qa-test-changesah ]h"]qa test changesah$]h&]refurijuh1hjyKhjRubh. Each kernel patchset adding an online repair function will use the same branch name across the kernel, xfsprogs, and fstests git repos.}(hjRhhhNhNubeh}(h]h ]h"]h$]h&]uh1hhhhKlhjAhhubeh}(h]jah ]h"]tldr; show me the code!ah$]h&]uh1hhj hhhhhKjubh)}(hhh](h)}(hExisting Toolsh]hExisting Tools}(hjhhhNhNubah}(h]h ]h"]h$]h&]jjuh1hhjhhhhhKtubh)}(hThe online fsck tool described here will be the third tool in the history of XFS (on Linux) to check and repair filesystems. Two programs precede it:h]hThe online fsck tool described here will be the third tool in the history of XFS (on Linux) to check and repair filesystems. Two programs precede it:}(hjhhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhKvhjhhubh)}(hXThe first program, ``xfs_check``, was created as part of the XFS debugger (``xfs_db``) and can only be used with unmounted filesystems. It walks all metadata in the filesystem looking for inconsistencies in the metadata, though it lacks any ability to repair what it finds. Due to its high memory requirements and inability to repair things, this program is now deprecated and will not be discussed further.h](hThe first program, }(hjhhhNhNubhliteral)}(h ``xfs_check``h]h xfs_check}(hjhhhNhNubah}(h]h ]h"]h$]h&]uh1jhjubh+, was created as part of the XFS debugger (}(hjhhhNhNubj)}(h ``xfs_db``h]hxfs_db}(hj hhhNhNubah}(h]h ]h"]h$]h&]uh1jhjubhXB) and can only be used with unmounted filesystems. It walks all metadata in the filesystem looking for inconsistencies in the metadata, though it lacks any ability to repair what it finds. Due to its high memory requirements and inability to repair things, this program is now deprecated and will not be discussed further.}(hjhhhNhNubeh}(h]h ]h"]h$]h&]uh1hhhhKzhjhhubh)}(hXgThe second program, ``xfs_repair``, was created to be faster and more robust than the first program. Like its predecessor, it can only be used with unmounted filesystems. It uses extent-based in-memory data structures to reduce memory consumption, and tries to schedule readahead IO appropriately to reduce I/O waiting time while it scans the metadata of the entire filesystem. The most important feature of this tool is its ability to respond to inconsistencies in file metadata and directory tree by erasing things as needed to eliminate problems. Space usage metadata are rebuilt from the observed file metadata.h](hThe second program, }(hj"hhhNhNubj)}(h``xfs_repair``h]h xfs_repair}(hj*hhhNhNubah}(h]h ]h"]h$]h&]uh1jhj"ubhXE, was created to be faster and more robust than the first program. Like its predecessor, it can only be used with unmounted filesystems. It uses extent-based in-memory data structures to reduce memory consumption, and tries to schedule readahead IO appropriately to reduce I/O waiting time while it scans the metadata of the entire filesystem. The most important feature of this tool is its ability to respond to inconsistencies in file metadata and directory tree by erasing things as needed to eliminate problems. Space usage metadata are rebuilt from the observed file metadata.}(hj"hhhNhNubeh}(h]h ]h"]h$]h&]uh1hhhhKhjhhubeh}(h]jah ]h"]existing toolsah$]h&]uh1hhj hhhhhKtubh)}(hhh](h)}(hProblem Statementh]hProblem Statement}(hjLhhhNhNubah}(h]h ]h"]h$]h&]jjuh1hhjIhhhhhKubh)}(h6The current XFS tools leave several problems unsolved:h]h6The current XFS tools leave several problems unsolved:}(hjZhhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhKhjIhhubhenumerated_list)}(hhh](h)}(h**User programs** suddenly **lose access** to the filesystem when unexpected shutdowns occur as a result of silent corruptions in the metadata. These occur **unpredictably** and often without warning. h]h)}(h**User programs** suddenly **lose access** to the filesystem when unexpected shutdowns occur as a result of silent corruptions in the metadata. These occur **unpredictably** and often without warning.h](j)}(h**User programs**h]h User programs}(hjuhhhNhNubah}(h]h ]h"]h$]h&]uh1jhjqubh suddenly }(hjqhhhNhNubj)}(h**lose access**h]h lose access}(hjhhhNhNubah}(h]h ]h"]h$]h&]uh1jhjqubhr to the filesystem when unexpected shutdowns occur as a result of silent corruptions in the metadata. These occur }(hjqhhhNhNubj)}(h**unpredictably**h]h unpredictably}(hjhhhNhNubah}(h]h ]h"]h$]h&]uh1jhjqubh and often without warning.}(hjqhhhNhNubeh}(h]h ]h"]h$]h&]uh1hhhhKhjmubah}(h]h ]h"]h$]h&]uh1hhjjhhhhhNubh)}(hu**Users** experience a **total loss of service** during the recovery period after an **unexpected shutdown** occurs. h]h)}(ht**Users** experience a **total loss of service** during the recovery period after an **unexpected shutdown** occurs.h](j)}(h **Users**h]hUsers}(hjhhhNhNubah}(h]h ]h"]h$]h&]uh1jhjubh experience a }(hjhhhNhNubj)}(h**total loss of service**h]htotal loss of service}(hjhhhNhNubah}(h]h ]h"]h$]h&]uh1jhjubh% during the recovery period after an }(hjhhhNhNubj)}(h**unexpected shutdown**h]hunexpected shutdown}(hjhhhNhNubah}(h]h ]h"]h$]h&]uh1jhjubh occurs.}(hjhhhNhNubeh}(h]h ]h"]h$]h&]uh1hhhhKhjubah}(h]h ]h"]h$]h&]uh1hhjjhhhhhNubh)}(hz**Users** experience a **total loss of service** if the filesystem is taken offline to **look for problems** proactively. h]h)}(hy**Users** experience a **total loss of service** if the filesystem is taken offline to **look for problems** proactively.h](j)}(h **Users**h]hUsers}(hj hhhNhNubah}(h]h ]h"]h$]h&]uh1jhjubh experience a }(hjhhhNhNubj)}(h**total loss of service**h]htotal loss of service}(hjhhhNhNubah}(h]h ]h"]h$]h&]uh1jhjubh' if the filesystem is taken offline to }(hjhhhNhNubj)}(h**look for problems**h]hlook for problems}(hj-hhhNhNubah}(h]h ]h"]h$]h&]uh1jhjubh proactively.}(hjhhhNhNubeh}(h]h ]h"]h$]h&]uh1hhhhKhjubah}(h]h ]h"]h$]h&]uh1hhjjhhhhhNubh)}(h**Data owners** cannot **check the integrity** of their stored data without reading all of it. This may expose them to substantial billing costs when a linear media scan performed by the storage system administrator might suffice. h]h)}(h**Data owners** cannot **check the integrity** of their stored data without reading all of it. This may expose them to substantial billing costs when a linear media scan performed by the storage system administrator might suffice.h](j)}(h**Data owners**h]h Data owners}(hjShhhNhNubah}(h]h ]h"]h$]h&]uh1jhjOubh cannot }(hjOhhhNhNubj)}(h**check the integrity**h]hcheck the integrity}(hjehhhNhNubah}(h]h ]h"]h$]h&]uh1jhjOubh of their stored data without reading all of it. This may expose them to substantial billing costs when a linear media scan performed by the storage system administrator might suffice.}(hjOhhhNhNubeh}(h]h ]h"]h$]h&]uh1hhhhKhjKubah}(h]h ]h"]h$]h&]uh1hhjjhhhhhNubh)}(h**System administrators** cannot **schedule** a maintenance window to deal with corruptions if they **lack the means** to assess filesystem health while the filesystem is online. h]h)}(h**System administrators** cannot **schedule** a maintenance window to deal with corruptions if they **lack the means** to assess filesystem health while the filesystem is online.h](j)}(h**System administrators**h]hSystem administrators}(hjhhhNhNubah}(h]h ]h"]h$]h&]uh1jhjubh cannot }(hjhhhNhNubj)}(h **schedule**h]hschedule}(hjhhhNhNubah}(h]h ]h"]h$]h&]uh1jhjubh7 a maintenance window to deal with corruptions if they }(hjhhhNhNubj)}(h**lack the means**h]hlack the means}(hjhhhNhNubah}(h]h ]h"]h$]h&]uh1jhjubh< to assess filesystem health while the filesystem is online.}(hjhhhNhNubeh}(h]h ]h"]h$]h&]uh1hhhhKhjubah}(h]h ]h"]h$]h&]uh1hhjjhhhhhNubh)}(h**Fleet monitoring tools** cannot **automate periodic checks** of filesystem health when doing so requires **manual intervention** and downtime. h]h)}(h**Fleet monitoring tools** cannot **automate periodic checks** of filesystem health when doing so requires **manual intervention** and downtime.h](j)}(h**Fleet monitoring tools**h]hFleet monitoring tools}(hjhhhNhNubah}(h]h ]h"]h$]h&]uh1jhjubh cannot }(hjhhhNhNubj)}(h**automate periodic checks**h]hautomate periodic checks}(hjhhhNhNubah}(h]h ]h"]h$]h&]uh1jhjubh- of filesystem health when doing so requires }(hjhhhNhNubj)}(h**manual intervention**h]hmanual intervention}(hjhhhNhNubah}(h]h ]h"]h$]h&]uh1jhjubh and downtime.}(hjhhhNhNubeh}(h]h ]h"]h$]h&]uh1hhhhKhjubah}(h]h ]h"]h$]h&]uh1hhjjhhhhhNubh)}(h**Users** can be tricked into **doing things they do not desire** when malicious actors **exploit quirks of Unicode** to place misleading names in directories. h]h)}(h**Users** can be tricked into **doing things they do not desire** when malicious actors **exploit quirks of Unicode** to place misleading names in directories.h](j)}(h **Users**h]hUsers}(hjhhhNhNubah}(h]h ]h"]h$]h&]uh1jhjubh can be tricked into }(hjhhhNhNubj)}(h#**doing things they do not desire**h]hdoing things they do not desire}(hj1hhhNhNubah}(h]h ]h"]h$]h&]uh1jhjubh when malicious actors }(hjhhhNhNubj)}(h**exploit quirks of Unicode**h]hexploit quirks of Unicode}(hjChhhNhNubah}(h]h ]h"]h$]h&]uh1jhjubh* to place misleading names in directories.}(hjhhhNhNubeh}(h]h ]h"]h$]h&]uh1hhhhKhjubah}(h]h ]h"]h$]h&]uh1hhjjhhhhhNubeh}(h]h ]h"]h$]h&]enumtypearabicprefixhsuffix.uh1jhhjIhhhhhKubh)}(hGiven this definition of the problems to be solved and the actors who would benefit, the proposed solution is a third fsck tool that acts on a running filesystem.h]hGiven this definition of the problems to be solved and the actors who would benefit, the proposed solution is a third fsck tool that acts on a running filesystem.}(hjlhhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhKhjIhhubh)}(hXThis new third program has three components: an in-kernel facility to check metadata, an in-kernel facility to repair metadata, and a userspace driver program to drive fsck activity on a live filesystem. ``xfs_scrub`` is the name of the driver program. The rest of this document presents the goals and use cases of the new fsck tool, describes its major design points in connection to those goals, and discusses the similarities and differences with existing tools.h](hThis new third program has three components: an in-kernel facility to check metadata, an in-kernel facility to repair metadata, and a userspace driver program to drive fsck activity on a live filesystem. }(hjzhhhNhNubj)}(h ``xfs_scrub``h]h xfs_scrub}(hjhhhNhNubah}(h]h ]h"]h$]h&]uh1jhjzubh is the name of the driver program. The rest of this document presents the goals and use cases of the new fsck tool, describes its major design points in connection to those goals, and discusses the similarities and differences with existing tools.}(hjzhhhNhNubeh}(h]h ]h"]h$]h&]uh1hhhhKhjIhhubj)}(hhh]j)}(hhh](j)}(hhh]h}(h]h ]h"]h$]h&]colwidthKJuh1jhjubj)}(hhh](j)}(hhh]j)}(hhh]h)}(h **Note**:h](j)}(h**Note**h]hNote}(hjhhhNhNubah}(h]h ]h"]h$]h&]uh1jhjubh:}(hjhhhNhNubeh}(h]h ]h"]h$]h&]uh1hhhhKhjubah}(h]h ]h"]h$]h&]uh1jhjubah}(h]h ]h"]h$]h&]uh1jhjubj)}(hhh]j)}(hhh]h)}(hXtThroughout this document, the existing offline fsck tool can also be referred to by its current name "``xfs_repair``". The userspace driver program for the new online fsck tool can be referred to as "``xfs_scrub``". The kernel portion of online fsck that validates metadata is called "online scrub", and portion of the kernel that fixes metadata is called "online repair".h](hhThroughout this document, the existing offline fsck tool can also be referred to by its current name “}(hjhhhNhNubj)}(h``xfs_repair``h]h xfs_repair}(hjhhhNhNubah}(h]h ]h"]h$]h&]uh1jhjubhX”. The userspace driver program for the new online fsck tool can be referred to as “}(hjhhhNhNubj)}(h ``xfs_scrub``h]h xfs_scrub}(hjhhhNhNubah}(h]h ]h"]h$]h&]uh1jhjubh”. The kernel portion of online fsck that validates metadata is called “online scrub”, and portion of the kernel that fixes metadata is called “online repair”.}(hjhhhNhNubeh}(h]h ]h"]h$]h&]uh1hhhhKhjubah}(h]h ]h"]h$]h&]uh1jhjubah}(h]h ]h"]h$]h&]uh1jhjubeh}(h]h ]h"]h$]h&]uh1jhjubeh}(h]h ]h"]h$]h&]colsKuh1jhjubah}(h]h ]h"]h$]h&]uh1jhjIhhhhhNubh)}(hXThe naming hierarchy is broken up into objects known as directories and files and the physical space is split into pieces known as allocation groups. Sharding enables better performance on highly parallel systems and helps to contain the damage when corruptions occur. The division of the filesystem into principal objects (allocation groups and inodes) means that there are ample opportunities to perform targeted checks and repairs on a subset of the filesystem.h]hXThe naming hierarchy is broken up into objects known as directories and files and the physical space is split into pieces known as allocation groups. Sharding enables better performance on highly parallel systems and helps to contain the damage when corruptions occur. The division of the filesystem into principal objects (allocation groups and inodes) means that there are ample opportunities to perform targeted checks and repairs on a subset of the filesystem.}(hj2hhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhKhjIhhubh)}(hWhile this is going on, other parts continue processing IO requests. Even if a piece of filesystem metadata can only be regenerated by scanning the entire system, the scan can still be done in the background while other file operations continue.h]hWhile this is going on, other parts continue processing IO requests. Even if a piece of filesystem metadata can only be regenerated by scanning the entire system, the scan can still be done in the background while other file operations continue.}(hj@hhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhKhjIhhubh)}(hX(In summary, online fsck takes advantage of resource sharding and redundant metadata to enable targeted checking and repair operations while the system is running. This capability will be coupled to automatic system management so that autonomous self-healing of XFS maximizes service availability.h]hX(In summary, online fsck takes advantage of resource sharding and redundant metadata to enable targeted checking and repair operations while the system is running. This capability will be coupled to automatic system management so that autonomous self-healing of XFS maximizes service availability.}(hjNhhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhKhjIhhubeh}(h]jah ]h"]problem statementah$]h&]uh1hhj hhhhhKubeh}(h]jah ]h"]1. what is a filesystem check?ah$]h&]uh1hhhhhhhhK7ubh)}(hhh](h)}(h2. Theory of Operationh]h2. Theory of Operation}(hjmhhhNhNubah}(h]h ]h"]h$]h&]jj/uh1hhjjhhhhhKubh)}(hXBecause it is necessary for online fsck to lock and scan live metadata objects, online fsck consists of three separate code components. The first is the userspace driver program ``xfs_scrub``, which is responsible for identifying individual metadata items, scheduling work items for them, reacting to the outcomes appropriately, and reporting results to the system administrator. The second and third are in the kernel, which implements functions to check and repair each type of online fsck work item.h](hBecause it is necessary for online fsck to lock and scan live metadata objects, online fsck consists of three separate code components. The first is the userspace driver program }(hj{hhhNhNubj)}(h ``xfs_scrub``h]h xfs_scrub}(hjhhhNhNubah}(h]h ]h"]h$]h&]uh1jhj{ubhX7, which is responsible for identifying individual metadata items, scheduling work items for them, reacting to the outcomes appropriately, and reporting results to the system administrator. The second and third are in the kernel, which implements functions to check and repair each type of online fsck work item.}(hj{hhhNhNubeh}(h]h ]h"]h$]h&]uh1hhhhKhjjhhubj)}(hhh]j)}(hhh](j)}(hhh]h}(h]h ]h"]h$]h&]colwidthKBuh1jhjubj)}(hhh](j)}(hhh]j)}(hhh]h)}(h **Note**:h](j)}(h**Note**h]hNote}(hjhhhNhNubah}(h]h ]h"]h$]h&]uh1jhjubh:}(hjhhhNhNubeh}(h]h ]h"]h$]h&]uh1hhhhKhjubah}(h]h ]h"]h$]h&]uh1jhjubah}(h]h ]h"]h$]h&]uh1jhjubj)}(hhh]j)}(hhh]h)}(hWFor brevity, this document shortens the phrase "online fsck work item" to "scrub item".h]h_For brevity, this document shortens the phrase “online fsck work item” to “scrub item”.}(hjhhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhKhjubah}(h]h ]h"]h$]h&]uh1jhjubah}(h]h ]h"]h$]h&]uh1jhjubeh}(h]h ]h"]h$]h&]uh1jhjubeh}(h]h ]h"]h$]h&]colsKuh1jhjubah}(h]h ]h"]h$]h&]uh1jhjjhhhhhNubh)}(hScrub item types are delineated in a manner consistent with the Unix design philosophy, which is to say that each item should handle one aspect of a metadata structure, and handle it well.h]hScrub item types are delineated in a manner consistent with the Unix design philosophy, which is to say that each item should handle one aspect of a metadata structure, and handle it well.}(hjhhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhKhjjhhubh)}(hhh](h)}(hScopeh]hScope}(hj hhhNhNubah}(h]h ]h"]h$]h&]jjNuh1hhjhhhhhKubh)}(hXIn principle, online fsck should be able to check and to repair everything that the offline fsck program can handle. However, online fsck cannot be running 100% of the time, which means that latent errors may creep in after a scrub completes. If these errors cause the next mount to fail, offline fsck is the only solution. This limitation means that maintenance of the offline fsck tool will continue. A second limitation of online fsck is that it must follow the same resource sharing and lock acquisition rules as the regular filesystem. This means that scrub cannot take *any* shortcuts to save time, because doing so could lead to concurrency problems. In other words, online fsck is not a complete replacement for offline fsck, and a complete run of online fsck may take longer than online fsck. However, both of these limitations are acceptable tradeoffs to satisfy the different motivations of online fsck, which are to **minimize system downtime** and to **increase predictability of operation**.h](hX?In principle, online fsck should be able to check and to repair everything that the offline fsck program can handle. However, online fsck cannot be running 100% of the time, which means that latent errors may creep in after a scrub completes. If these errors cause the next mount to fail, offline fsck is the only solution. This limitation means that maintenance of the offline fsck tool will continue. A second limitation of online fsck is that it must follow the same resource sharing and lock acquisition rules as the regular filesystem. This means that scrub cannot take }(hj.hhhNhNubhemphasis)}(h*any*h]hany}(hj8hhhNhNubah}(h]h ]h"]h$]h&]uh1j6hj.ubhX\ shortcuts to save time, because doing so could lead to concurrency problems. In other words, online fsck is not a complete replacement for offline fsck, and a complete run of online fsck may take longer than online fsck. However, both of these limitations are acceptable tradeoffs to satisfy the different motivations of online fsck, which are to }(hj.hhhNhNubj)}(h**minimize system downtime**h]hminimize system downtime}(hjJhhhNhNubah}(h]h ]h"]h$]h&]uh1jhj.ubh and to }(hj.hhhNhNubj)}(h(**increase predictability of operation**h]h$increase predictability of operation}(hj\hhhNhNubah}(h]h ]h"]h$]h&]uh1jhj.ubh.}(hj.hhhNhNubeh}(h]h ]h"]h$]h&]uh1hhhhKhjhhubh)}(h.. _scrubphases:h]h}(h]h ]h"]h$]h&]j scrubphasesuh1hhMhjhhhhubeh}(h]jTah ]h"]scopeah$]h&]uh1hhjjhhhhhKubh)}(hhh](h)}(hPhases of Workh]hPhases of Work}(hjhhhNhNubah}(h]h ]h"]h$]h&]jjpuh1hhjhhhhhMubh)}(hXThe userspace driver program ``xfs_scrub`` splits the work of checking and repairing an entire filesystem into seven phases. Each phase concentrates on checking specific types of scrub items and depends on the success of all previous phases. The seven phases are as follows:h](hThe userspace driver program }(hjhhhNhNubj)}(h ``xfs_scrub``h]h xfs_scrub}(hjhhhNhNubah}(h]h ]h"]h$]h&]uh1jhjubh splits the work of checking and repairing an entire filesystem into seven phases. Each phase concentrates on checking specific types of scrub items and depends on the success of all previous phases. The seven phases are as follows:}(hjhhhNhNubeh}(h]h ]h"]h$]h&]uh1hhhhMhjhhubji)}(hhh](h)}(hCollect geometry information about the mounted filesystem and computer, discover the online fsck capabilities of the kernel, and open the underlying storage devices. h]h)}(hCollect geometry information about the mounted filesystem and computer, discover the online fsck capabilities of the kernel, and open the underlying storage devices.h]hCollect geometry information about the mounted filesystem and computer, discover the online fsck capabilities of the kernel, and open the underlying storage devices.}(hjhhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhM hjubah}(h]h ]h"]h$]h&]uh1hhjhhhhhNubh)}(hX#Check allocation group metadata, all realtime volume metadata, and all quota files. Each metadata structure is scheduled as a separate scrub item. If corruption is found in the inode header or inode btree and ``xfs_scrub`` is permitted to perform repairs, then those scrub items are repaired to prepare for phase 3. Repairs are implemented by using the information in the scrub item to resubmit the kernel scrub call with the repair flag enabled; this is discussed in the next section. Optimizations and all other repairs are deferred to phase 4. h]h)}(hX"Check allocation group metadata, all realtime volume metadata, and all quota files. Each metadata structure is scheduled as a separate scrub item. If corruption is found in the inode header or inode btree and ``xfs_scrub`` is permitted to perform repairs, then those scrub items are repaired to prepare for phase 3. Repairs are implemented by using the information in the scrub item to resubmit the kernel scrub call with the repair flag enabled; this is discussed in the next section. Optimizations and all other repairs are deferred to phase 4.h](hCheck allocation group metadata, all realtime volume metadata, and all quota files. Each metadata structure is scheduled as a separate scrub item. If corruption is found in the inode header or inode btree and }(hjhhhNhNubj)}(h ``xfs_scrub``h]h xfs_scrub}(hjhhhNhNubah}(h]h ]h"]h$]h&]uh1jhjubhXD is permitted to perform repairs, then those scrub items are repaired to prepare for phase 3. Repairs are implemented by using the information in the scrub item to resubmit the kernel scrub call with the repair flag enabled; this is discussed in the next section. Optimizations and all other repairs are deferred to phase 4.}(hjhhhNhNubeh}(h]h ]h"]h$]h&]uh1hhhhMhjubah}(h]h ]h"]h$]h&]uh1hhjhhhhhNubh)}(hXyCheck all metadata of every file in the filesystem. Each metadata structure is also scheduled as a separate scrub item. If repairs are needed and ``xfs_scrub`` is permitted to perform repairs, and there were no problems detected during phase 2, then those scrub items are repaired immediately. Optimizations, deferred repairs, and unsuccessful repairs are deferred to phase 4. h]h)}(hXxCheck all metadata of every file in the filesystem. Each metadata structure is also scheduled as a separate scrub item. If repairs are needed and ``xfs_scrub`` is permitted to perform repairs, and there were no problems detected during phase 2, then those scrub items are repaired immediately. Optimizations, deferred repairs, and unsuccessful repairs are deferred to phase 4.h](hCheck all metadata of every file in the filesystem. Each metadata structure is also scheduled as a separate scrub item. If repairs are needed and }(hjhhhNhNubj)}(h ``xfs_scrub``h]h xfs_scrub}(hjhhhNhNubah}(h]h ]h"]h$]h&]uh1jhjubh is permitted to perform repairs, and there were no problems detected during phase 2, then those scrub items are repaired immediately. Optimizations, deferred repairs, and unsuccessful repairs are deferred to phase 4.}(hjhhhNhNubeh}(h]h ]h"]h$]h&]uh1hhhhMhjubah}(h]h ]h"]h$]h&]uh1hhjhhhhhNubh)}(hX All remaining repairs and scheduled optimizations are performed during this phase, if the caller permits them. Before starting repairs, the summary counters are checked and any necessary repairs are performed so that subsequent repairs will not fail the resource reservation step due to wildly incorrect summary counters. Unsuccessful repairs are requeued as long as forward progress on repairs is made somewhere in the filesystem. Free space in the filesystem is trimmed at the end of phase 4 if the filesystem is clean. h]h)}(hX All remaining repairs and scheduled optimizations are performed during this phase, if the caller permits them. Before starting repairs, the summary counters are checked and any necessary repairs are performed so that subsequent repairs will not fail the resource reservation step due to wildly incorrect summary counters. Unsuccessful repairs are requeued as long as forward progress on repairs is made somewhere in the filesystem. Free space in the filesystem is trimmed at the end of phase 4 if the filesystem is clean.h]hX All remaining repairs and scheduled optimizations are performed during this phase, if the caller permits them. Before starting repairs, the summary counters are checked and any necessary repairs are performed so that subsequent repairs will not fail the resource reservation step due to wildly incorrect summary counters. Unsuccessful repairs are requeued as long as forward progress on repairs is made somewhere in the filesystem. Free space in the filesystem is trimmed at the end of phase 4 if the filesystem is clean.}(hj*hhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhM#hj&ubah}(h]h ]h"]h$]h&]uh1hhjhhhhhNubh)}(hXcBy the start of this phase, all primary and secondary filesystem metadata must be correct. Summary counters such as the free space counts and quota resource counts are checked and corrected. Directory entry names and extended attribute names are checked for suspicious entries such as control characters or confusing Unicode sequences appearing in names. h]h)}(hXbBy the start of this phase, all primary and secondary filesystem metadata must be correct. Summary counters such as the free space counts and quota resource counts are checked and corrected. Directory entry names and extended attribute names are checked for suspicious entries such as control characters or confusing Unicode sequences appearing in names.h]hXbBy the start of this phase, all primary and secondary filesystem metadata must be correct. Summary counters such as the free space counts and quota resource counts are checked and corrected. Directory entry names and extended attribute names are checked for suspicious entries such as control characters or confusing Unicode sequences appearing in names.}(hjBhhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhM-hj>ubah}(h]h ]h"]h$]h&]uh1hhjhhhhhNubh)}(hXCIf the caller asks for a media scan, read all allocated and written data file extents in the filesystem. The ability to use hardware-assisted data file integrity checking is new to online fsck; neither of the previous tools have this capability. If media errors occur, they will be mapped to the owning files and reported. h]h)}(hXBIf the caller asks for a media scan, read all allocated and written data file extents in the filesystem. The ability to use hardware-assisted data file integrity checking is new to online fsck; neither of the previous tools have this capability. If media errors occur, they will be mapped to the owning files and reported.h]hXBIf the caller asks for a media scan, read all allocated and written data file extents in the filesystem. The ability to use hardware-assisted data file integrity checking is new to online fsck; neither of the previous tools have this capability. If media errors occur, they will be mapped to the owning files and reported.}(hjZhhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhM5hjVubah}(h]h ]h"]h$]h&]uh1hhjhhhhhNubh)}(heRe-check the summary counters and presents the caller with a summary of space usage and file counts. h]h)}(hdRe-check the summary counters and presents the caller with a summary of space usage and file counts.h]hdRe-check the summary counters and presents the caller with a summary of space usage and file counts.}(hjrhhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhM;hjnubah}(h]h ]h"]h$]h&]uh1hhjhhhhhNubeh}(h]h ]h"]h$]h&]jgjhjihjjjkuh1jhhjhhhhhM ubh)}(haThis allocation of responsibilities will be :ref:`revisited ` later in this document.h](h,This allocation of responsibilities will be }(hjhhhNhNubh)}(h:ref:`revisited `h]hinline)}(hjh]h revisited}(hjhhhNhNubah}(h]h ](xrefstdstd-refeh"]h$]h&]uh1jhjubah}(h]h ]h"]h$]h&]refdoc&filesystems/xfs/xfs-online-fsck-design refdomainjreftyperef refexplicitrefwarn reftarget scrubcheckuh1hhhhM>hjubh later in this document.}(hjhhhNhNubeh}(h]h ]h"]h$]h&]uh1hhhhM>hjhhubeh}(h](jvj~eh ]h"](phases of work scrubphaseseh$]h&]uh1hhjjhhhhhMexpect_referenced_by_name}jjtsexpect_referenced_by_id}j~jtsubh)}(hhh](h)}(hSteps for Each Scrub Itemh]hSteps for Each Scrub Item}(hjhhhNhNubah}(h]h ]h"]h$]h&]jjuh1hhjhhhhhMBubh)}(hThe kernel scrub code uses a three-step strategy for checking and repairing the one aspect of a metadata object represented by a scrub item:h]hThe kernel scrub code uses a three-step strategy for checking and repairing the one aspect of a metadata object represented by a scrub item:}(hjhhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhMDhjhhubji)}(hhh](h)}(hXThe scrub item of interest is checked for corruptions; opportunities for optimization; and for values that are directly controlled by the system administrator but look suspicious. If the item is not corrupt or does not need optimization, resource are released and the positive scan results are returned to userspace. If the item is corrupt or could be optimized but the caller does not permit this, resources are released and the negative scan results are returned to userspace. Otherwise, the kernel moves on to the second step. h]h)}(hXThe scrub item of interest is checked for corruptions; opportunities for optimization; and for values that are directly controlled by the system administrator but look suspicious. If the item is not corrupt or does not need optimization, resource are released and the positive scan results are returned to userspace. If the item is corrupt or could be optimized but the caller does not permit this, resources are released and the negative scan results are returned to userspace. Otherwise, the kernel moves on to the second step.h]hXThe scrub item of interest is checked for corruptions; opportunities for optimization; and for values that are directly controlled by the system administrator but look suspicious. If the item is not corrupt or does not need optimization, resource are released and the positive scan results are returned to userspace. If the item is corrupt or could be optimized but the caller does not permit this, resources are released and the negative scan results are returned to userspace. Otherwise, the kernel moves on to the second step.}(hjhhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhMGhjubah}(h]h ]h"]h$]h&]uh1hhjhhhhhNubh)}(hXAThe repair function is called to rebuild the data structure. Repair functions generally choose rebuild a structure from other metadata rather than try to salvage the existing structure. If the repair fails, the scan results from the first step are returned to userspace. Otherwise, the kernel moves on to the third step. h]h)}(hX@The repair function is called to rebuild the data structure. Repair functions generally choose rebuild a structure from other metadata rather than try to salvage the existing structure. If the repair fails, the scan results from the first step are returned to userspace. Otherwise, the kernel moves on to the third step.h]hX@The repair function is called to rebuild the data structure. Repair functions generally choose rebuild a structure from other metadata rather than try to salvage the existing structure. If the repair fails, the scan results from the first step are returned to userspace. Otherwise, the kernel moves on to the third step.}(hj hhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhMQhj ubah}(h]h ]h"]h$]h&]uh1hhjhhhhhNubh)}(hIn the third step, the kernel runs the same checks over the new metadata item to assess the efficacy of the repairs. The results of the reassessment are returned to userspace. h]h)}(hIn the third step, the kernel runs the same checks over the new metadata item to assess the efficacy of the repairs. The results of the reassessment are returned to userspace.h]hIn the third step, the kernel runs the same checks over the new metadata item to assess the efficacy of the repairs. The results of the reassessment are returned to userspace.}(hj%hhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhMXhj!ubah}(h]h ]h"]h$]h&]uh1hhjhhhhhNubeh}(h]h ]h"]h$]h&]jgjhjihjjjkuh1jhhjhhhhhMGubeh}(h]jah ]h"]steps for each scrub itemah$]h&]uh1hhjjhhhhhMBubh)}(hhh](h)}(hClassification of Metadatah]hClassification of Metadata}(hjIhhhNhNubah}(h]h ]h"]h$]h&]jjuh1hhjFhhhhhM]ubh)}(h^Each type of metadata object (and therefore each type of scrub item) is classified as follows:h]h^Each type of metadata object (and therefore each type of scrub item) is classified as follows:}(hjWhhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhM_hjFhhubh)}(hhh](h)}(hPrimary Metadatah]hPrimary Metadata}(hjhhhhNhNubah}(h]h ]h"]h$]h&]jjuh1hhjehhhhhMcubh)}(hMetadata structures in this category should be most familiar to filesystem users either because they are directly created by the user or they index objects created by the user Most filesystem objects fall into this class:h]hMetadata structures in this category should be most familiar to filesystem users either because they are directly created by the user or they index objects created by the user Most filesystem objects fall into this class:}(hjvhhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhMehjehhubh)}(hhh](h)}(h+Free space and reference count information h]h)}(h*Free space and reference count informationh]h*Free space and reference count information}(hjhhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhMjhjubah}(h]h ]h"]h$]h&]uh1hhjhhhhhNubh)}(hInode records and indexes h]h)}(hInode records and indexesh]hInode records and indexes}(hjhhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhMlhjubah}(h]h ]h"]h$]h&]uh1hhjhhhhhNubh)}(h*Storage mapping information for file data h]h)}(h)Storage mapping information for file datah]h)Storage mapping information for file data}(hjhhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhMnhjubah}(h]h ]h"]h$]h&]uh1hhjhhhhhNubh)}(h Directories h]h)}(h Directoriesh]h Directories}(hjhhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhMphjubah}(h]h ]h"]h$]h&]uh1hhjhhhhhNubh)}(hExtended attributes h]h)}(hExtended attributesh]hExtended attributes}(hjhhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhMrhjubah}(h]h ]h"]h$]h&]uh1hhjhhhhhNubh)}(hSymbolic links h]h)}(hSymbolic linksh]hSymbolic links}(hjhhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhMthjubah}(h]h ]h"]h$]h&]uh1hhjhhhhhNubh)}(h Quota limits h]h)}(h Quota limitsh]h Quota limits}(hjhhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhMvhjubah}(h]h ]h"]h$]h&]uh1hhjhhhhhNubeh}(h]h ]h"]h$]h&]jJjKuh1hhhhMjhjehhubh)}(h\Scrub obeys the same rules as regular filesystem accesses for resource and lock acquisition.h]h\Scrub obeys the same rules as regular filesystem accesses for resource and lock acquisition.}(hj5hhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhMxhjehhubh)}(hXAPrimary metadata objects are the simplest for scrub to process. The principal filesystem object (either an allocation group or an inode) that owns the item being scrubbed is locked to guard against concurrent updates. The check function examines every record associated with the type for obvious errors and cross-references healthy records against other metadata to look for inconsistencies. Repairs for this class of scrub item are simple, since the repair function starts by holding all the resources acquired in the previous step. The repair function scans available metadata as needed to record all the observations needed to complete the structure. Next, it stages the observations in a new ondisk structure and commits it atomically to complete the repair. Finally, the storage from the old data structure are carefully reaped.h]hXAPrimary metadata objects are the simplest for scrub to process. The principal filesystem object (either an allocation group or an inode) that owns the item being scrubbed is locked to guard against concurrent updates. The check function examines every record associated with the type for obvious errors and cross-references healthy records against other metadata to look for inconsistencies. Repairs for this class of scrub item are simple, since the repair function starts by holding all the resources acquired in the previous step. The repair function scans available metadata as needed to record all the observations needed to complete the structure. Next, it stages the observations in a new ondisk structure and commits it atomically to complete the repair. Finally, the storage from the old data structure are carefully reaped.}(hjChhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhM{hjehhubh)}(hX Because ``xfs_scrub`` locks a primary object for the duration of the repair, this is effectively an offline repair operation performed on a subset of the filesystem. This minimizes the complexity of the repair code because it is not necessary to handle concurrent updates from other threads, nor is it necessary to access any other part of the filesystem. As a result, indexed structures can be rebuilt very quickly, and programs trying to access the damaged structure will be blocked until repairs complete. The only infrastructure needed by the repair code are the staging area for observations and a means to write new structures to disk. Despite these limitations, the advantage that online repair holds is clear: targeted work on individual shards of the filesystem avoids total loss of service.h](hBecause }(hjQhhhNhNubj)}(h ``xfs_scrub``h]h xfs_scrub}(hjYhhhNhNubah}(h]h ]h"]h$]h&]uh1jhjQubhX  locks a primary object for the duration of the repair, this is effectively an offline repair operation performed on a subset of the filesystem. This minimizes the complexity of the repair code because it is not necessary to handle concurrent updates from other threads, nor is it necessary to access any other part of the filesystem. As a result, indexed structures can be rebuilt very quickly, and programs trying to access the damaged structure will be blocked until repairs complete. The only infrastructure needed by the repair code are the staging area for observations and a means to write new structures to disk. Despite these limitations, the advantage that online repair holds is clear: targeted work on individual shards of the filesystem avoids total loss of service.}(hjQhhhNhNubeh}(h]h ]h"]h$]h&]uh1hhhhMhjehhubh)}(hXThis mechanism is described in section 2.1 ("Off-Line Algorithm") of V. Srinivasan and M. J. Carey, `"Performance of On-Line Index Construction Algorithms" `_, *Extending Database Technology*, pp. 293-309, 1992.h](hhThis mechanism is described in section 2.1 (“Off-Line Algorithm”) of V. Srinivasan and M. J. Carey, }(hjqhhhNhNubj)}(h~`"Performance of On-Line Index Construction Algorithms" `_h]h:“Performance of On-Line Index Construction Algorithms”}(hjyhhhNhNubah}(h]h ]h"]h$]h&]name6"Performance of On-Line Index Construction Algorithms"jjBhttps://minds.wisconsin.edu/bitstream/handle/1793/59524/TR1047.pdfuh1jhjqubh)}(hE h]h}(h]4performance-of-on-line-index-construction-algorithmsah ]h"]6"performance of on-line index construction algorithms"ah$]h&]refurijuh1hjyKhjqubh, }(hjqhhhNhNubj7)}(h*Extending Database Technology*h]hExtending Database Technology}(hjhhhNhNubah}(h]h ]h"]h$]h&]uh1j6hjqubh, pp. 293-309, 1992.}(hjqhhhNhNubeh}(h]h ]h"]h$]h&]uh1hhhhMhjehhubh)}(hXMost primary metadata repair functions stage their intermediate results in an in-memory array prior to formatting the new ondisk structure, which is very similar to the list-based algorithm discussed in section 2.3 ("List-Based Algorithms") of Srinivasan. However, any data structure builder that maintains a resource lock for the duration of the repair is *always* an offline algorithm.h](hXiMost primary metadata repair functions stage their intermediate results in an in-memory array prior to formatting the new ondisk structure, which is very similar to the list-based algorithm discussed in section 2.3 (“List-Based Algorithms”) of Srinivasan. However, any data structure builder that maintains a resource lock for the duration of the repair is }(hjhhhNhNubj7)}(h*always*h]halways}(hjhhhNhNubah}(h]h ]h"]h$]h&]uh1j6hjubh an offline algorithm.}(hjhhhNhNubeh}(h]h ]h"]h$]h&]uh1hhhhMhjehhubh)}(h.. _secondary_metadata:h]h}(h]h ]h"]h$]h&]jjuh1hhMhjehhhhubeh}(h]jah ]h"]primary metadataah$]h&]uh1hhjFhhhhhMcubh)}(hhh](h)}(hSecondary Metadatah]hSecondary Metadata}(hjhhhNhNubah}(h]h ]h"]h$]h&]jjuh1hhjhhhhhMubh)}(hMetadata structures in this category reflect records found in primary metadata, but are only needed for online fsck or for reorganization of the filesystem.h]hMetadata structures in this category reflect records found in primary metadata, but are only needed for online fsck or for reorganization of the filesystem.}(hjhhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhMhjhhubh)}(hSecondary metadata include:h]hSecondary metadata include:}(hjhhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhMhjhhubh)}(hhh](h)}(hReverse mapping information h]h)}(hReverse mapping informationh]hReverse mapping information}(hjhhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhMhjubah}(h]h ]h"]h$]h&]uh1hhjhhhhhNubh)}(hDirectory parent pointers h]h)}(hDirectory parent pointersh]hDirectory parent pointers}(hj0hhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhMhj,ubah}(h]h ]h"]h$]h&]uh1hhjhhhhhNubeh}(h]h ]h"]h$]h&]jJjKuh1hhhhMhjhhubh)}(hX&This class of metadata is difficult for scrub to process because scrub attaches to the secondary object but needs to check primary metadata, which runs counter to the usual order of resource acquisition. Frequently, this means that full filesystems scans are necessary to rebuild the metadata. Check functions can be limited in scope to reduce runtime. Repairs, however, require a full scan of primary metadata, which can take a long time to complete. Under these conditions, ``xfs_scrub`` cannot lock resources for the entire duration of the repair.h](hXThis class of metadata is difficult for scrub to process because scrub attaches to the secondary object but needs to check primary metadata, which runs counter to the usual order of resource acquisition. Frequently, this means that full filesystems scans are necessary to rebuild the metadata. Check functions can be limited in scope to reduce runtime. Repairs, however, require a full scan of primary metadata, which can take a long time to complete. Under these conditions, }(hjJhhhNhNubj)}(h ``xfs_scrub``h]h xfs_scrub}(hjRhhhNhNubah}(h]h ]h"]h$]h&]uh1jhjJubh= cannot lock resources for the entire duration of the repair.}(hjJhhhNhNubeh}(h]h ]h"]h$]h&]uh1hhhhMhjhhubh)}(hXInstead, repair functions set up an in-memory staging structure to store observations. Depending on the requirements of the specific repair function, the staging index will either have the same format as the ondisk structure or a design specific to that repair function. The next step is to release all locks and start the filesystem scan. When the repair scanner needs to record an observation, the staging data are locked long enough to apply the update. While the filesystem scan is in progress, the repair function hooks the filesystem so that it can apply pending filesystem updates to the staging information. Once the scan is done, the owning object is re-locked, the live data is used to write a new ondisk structure, and the repairs are committed atomically. The hooks are disabled and the staging staging area is freed. Finally, the storage from the old data structure are carefully reaped.h]hXInstead, repair functions set up an in-memory staging structure to store observations. Depending on the requirements of the specific repair function, the staging index will either have the same format as the ondisk structure or a design specific to that repair function. The next step is to release all locks and start the filesystem scan. When the repair scanner needs to record an observation, the staging data are locked long enough to apply the update. While the filesystem scan is in progress, the repair function hooks the filesystem so that it can apply pending filesystem updates to the staging information. Once the scan is done, the owning object is re-locked, the live data is used to write a new ondisk structure, and the repairs are committed atomically. The hooks are disabled and the staging staging area is freed. Finally, the storage from the old data structure are carefully reaped.}(hjjhhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhMhjhhubh)}(hXIntroducing concurrency helps online repair avoid various locking problems, but comes at a high cost to code complexity. Live filesystem code has to be hooked so that the repair function can observe updates in progress. The staging area has to become a fully functional parallel structure so that updates can be merged from the hooks. Finally, the hook, the filesystem scan, and the inode locking model must be sufficiently well integrated that a hook event can decide if a given update should be applied to the staging structure.h]hXIntroducing concurrency helps online repair avoid various locking problems, but comes at a high cost to code complexity. Live filesystem code has to be hooked so that the repair function can observe updates in progress. The staging area has to become a fully functional parallel structure so that updates can be merged from the hooks. Finally, the hook, the filesystem scan, and the inode locking model must be sufficiently well integrated that a hook event can decide if a given update should be applied to the staging structure.}(hjxhhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhMhjhhubh)}(hX@In theory, the scrub implementation could apply these same techniques for primary metadata, but doing so would make it massively more complex and less performant. Programs attempting to access the damaged structures are not blocked from operation, which may cause application failure or an unplanned filesystem shutdown.h]hX@In theory, the scrub implementation could apply these same techniques for primary metadata, but doing so would make it massively more complex and less performant. Programs attempting to access the damaged structures are not blocked from operation, which may cause application failure or an unplanned filesystem shutdown.}(hjhhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhMhjhhubh)}(hX_Inspiration for the secondary metadata repair strategy was drawn from section 2.4 of Srinivasan above, and sections 2 ("NSF: Inded Build Without Side-File") and 3.1.1 ("Duplicate Key Insert Problem") in C. Mohan, `"Algorithms for Creating Indexes for Very Large Tables Without Quiescing Updates" `_, 1992.h](hInspiration for the secondary metadata repair strategy was drawn from section 2.4 of Srinivasan above, and sections 2 (“NSF: Inded Build Without Side-File”) and 3.1.1 (“Duplicate Key Insert Problem”) in C. Mohan, }(hjhhhNhNubj)}(h`"Algorithms for Creating Indexes for Very Large Tables Without Quiescing Updates" `_h]hU“Algorithms for Creating Indexes for Very Large Tables Without Quiescing Updates”}(hjhhhNhNubah}(h]h ]h"]h$]h&]nameQ"Algorithms for Creating Indexes for Very Large Tables Without Quiescing Updates"jj,https://dl.acm.org/doi/10.1145/130283.130337uh1jhjubh)}(h/ h]h}(h]Oalgorithms-for-creating-indexes-for-very-large-tables-without-quiescing-updatesah ]h"]Q"algorithms for creating indexes for very large tables without quiescing updates"ah$]h&]refurijuh1hjyKhjubh, 1992.}(hjhhhNhNubeh}(h]h ]h"]h$]h&]uh1hhhhMhjhhubh)}(hXlThe sidecar index mentioned above bears some resemblance to the side file method mentioned in Srinivasan and Mohan. Their method consists of an index builder that extracts relevant record data to build the new structure as quickly as possible; and an auxiliary structure that captures all updates that would be committed to the index by other threads were the new index already online. After the index building scan finishes, the updates recorded in the side file are applied to the new index. To avoid conflicts between the index builder and other writer threads, the builder maintains a publicly visible cursor that tracks the progress of the scan through the record space. To avoid duplication of work between the side file and the index builder, side file updates are elided when the record ID for the update is greater than the cursor position within the record ID space.h]hXlThe sidecar index mentioned above bears some resemblance to the side file method mentioned in Srinivasan and Mohan. Their method consists of an index builder that extracts relevant record data to build the new structure as quickly as possible; and an auxiliary structure that captures all updates that would be committed to the index by other threads were the new index already online. After the index building scan finishes, the updates recorded in the side file are applied to the new index. To avoid conflicts between the index builder and other writer threads, the builder maintains a publicly visible cursor that tracks the progress of the scan through the record space. To avoid duplication of work between the side file and the index builder, side file updates are elided when the record ID for the update is greater than the cursor position within the record ID space.}(hjhhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhMhjhhubh)}(hX[To minimize changes to the rest of the codebase, XFS online repair keeps the replacement index hidden until it's completely ready to go. In other words, there is no attempt to expose the keyspace of the new index while repair is running. The complexity of such an approach would be very high and perhaps more appropriate to building *new* indices.h](hXOTo minimize changes to the rest of the codebase, XFS online repair keeps the replacement index hidden until it’s completely ready to go. In other words, there is no attempt to expose the keyspace of the new index while repair is running. The complexity of such an approach would be very high and perhaps more appropriate to building }(hjhhhNhNubj7)}(h*new*h]hnew}(hjhhhNhNubah}(h]h ]h"]h$]h&]uh1j6hjubh indices.}(hjhhhNhNubeh}(h]h ]h"]h$]h&]uh1hhhhMhjhhubh)}(h**Future Work Question**: Can the full scan and live update code used to facilitate a repair also be used to implement a comprehensive check?h](j)}(h**Future Work Question**h]hFuture Work Question}(hjhhhNhNubah}(h]h ]h"]h$]h&]uh1jhjubhu: Can the full scan and live update code used to facilitate a repair also be used to implement a comprehensive check?}(hjhhhNhNubeh}(h]h ]h"]h$]h&]uh1hhhhMhjhhubh)}(hX*Answer*: In theory, yes. Check would be much stronger if each scrub function employed these live scans to build a shadow copy of the metadata and then compared the shadow records to the ondisk records. However, doing that is a fair amount more work than what the checking functions do now. The live scans and hooks were developed much later. That in turn increases the runtime of those scrub functions.h](j7)}(h*Answer*h]hAnswer}(hjhhhNhNubah}(h]h ]h"]h$]h&]uh1j6hjubhX: In theory, yes. Check would be much stronger if each scrub function employed these live scans to build a shadow copy of the metadata and then compared the shadow records to the ondisk records. However, doing that is a fair amount more work than what the checking functions do now. The live scans and hooks were developed much later. That in turn increases the runtime of those scrub functions.}(hjhhhNhNubeh}(h]h ]h"]h$]h&]uh1hhhhMhjhhubeh}(h](jid2eh ]h"](secondary metadatasecondary_metadataeh$]h&]uh1hhjFhhhhhMj}j0jsj}jjsubh)}(hhh](h)}(hSummary Informationh]hSummary Information}(hj8hhhNhNubah}(h]h ]h"]h$]h&]jjuh1hhj5hhhhhMubh)}(hMetadata structures in this last category summarize the contents of primary metadata records. These are often used to speed up resource usage queries, and are many times smaller than the primary metadata which they represent.h]hMetadata structures in this last category summarize the contents of primary metadata records. These are often used to speed up resource usage queries, and are many times smaller than the primary metadata which they represent.}(hjFhhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhMhj5hhubh)}(h(Examples of summary information include:h]h(Examples of summary information include:}(hjThhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhM hj5hhubh)}(hhh](h)}(h(Summary counts of free space and inodes h]h)}(h'Summary counts of free space and inodesh]h'Summary counts of free space and inodes}(hjihhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhMhjeubah}(h]h ]h"]h$]h&]uh1hhjbhhhhhNubh)}(h"File link counts from directories h]h)}(h!File link counts from directoriesh]h!File link counts from directories}(hjhhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhMhj}ubah}(h]h ]h"]h$]h&]uh1hhjbhhhhhNubh)}(hQuota resource usage counts h]h)}(hQuota resource usage countsh]hQuota resource usage counts}(hjhhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhMhjubah}(h]h ]h"]h$]h&]uh1hhjbhhhhhNubeh}(h]h ]h"]h$]h&]jJjKuh1hhhhMhj5hhubh)}(hCheck and repair require full filesystem scans, but resource and lock acquisition follow the same paths as regular filesystem accesses.h]hCheck and repair require full filesystem scans, but resource and lock acquisition follow the same paths as regular filesystem accesses.}(hjhhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhMhj5hhubh)}(hXThe superblock summary counters have special requirements due to the underlying implementation of the incore counters, and will be treated separately. Check and repair of the other types of summary counters (quota resource counts and file link counts) employ the same filesystem scanning and hooking techniques as outlined above, but because the underlying data are sets of integer counters, the staging data need not be a fully functional mirror of the ondisk structure.h]hXThe superblock summary counters have special requirements due to the underlying implementation of the incore counters, and will be treated separately. Check and repair of the other types of summary counters (quota resource counts and file link counts) employ the same filesystem scanning and hooking techniques as outlined above, but because the underlying data are sets of integer counters, the staging data need not be a fully functional mirror of the ondisk structure.}(hjhhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhMhj5hhubh)}(hXFInspiration for quota and file link count repair strategies were drawn from sections 2.12 ("Online Index Operations") through 2.14 ("Incremental View Maintenance") of G. Graefe, `"Concurrent Queries and Updates in Summary Views and Their Indexes" `_, 2011.h](hInspiration for quota and file link count repair strategies were drawn from sections 2.12 (“Online Index Operations”) through 2.14 (“Incremental View Maintenance”) of G. Graefe, }(hjhhhNhNubj)}(h`"Concurrent Queries and Updates in Summary Views and Their Indexes" `_h]hG“Concurrent Queries and Updates in Summary Views and Their Indexes”}(hjhhhNhNubah}(h]h ]h"]h$]h&]nameC"Concurrent Queries and Updates in Summary Views and Their Indexes"jjChttp://www.odbms.org/wp-content/uploads/2014/06/Increment-locks.pdfuh1jhjubh)}(hF h]h}(h]Aconcurrent-queries-and-updates-in-summary-views-and-their-indexesah ]h"]C"concurrent queries and updates in summary views and their indexes"ah$]h&]refurijuh1hjyKhjubh, 2011.}(hjhhhNhNubeh}(h]h ]h"]h$]h&]uh1hhhhMhj5hhubh)}(hXSince quotas are non-negative integer counts of resource usage, online quotacheck can use the incremental view deltas described in section 2.14 to track pending changes to the block and inode usage counts in each transaction, and commit those changes to a dquot side file when the transaction commits. Delta tracking is necessary for dquots because the index builder scans inodes, whereas the data structure being rebuilt is an index of dquots. Link count checking combines the view deltas and commit step into one because it sets attributes of the objects being scanned instead of writing them to a separate data structure. Each online fsck function will be discussed as case studies later in this document.h]hXSince quotas are non-negative integer counts of resource usage, online quotacheck can use the incremental view deltas described in section 2.14 to track pending changes to the block and inode usage counts in each transaction, and commit those changes to a dquot side file when the transaction commits. Delta tracking is necessary for dquots because the index builder scans inodes, whereas the data structure being rebuilt is an index of dquots. Link count checking combines the view deltas and commit step into one because it sets attributes of the objects being scanned instead of writing them to a separate data structure. Each online fsck function will be discussed as case studies later in this document.}(hjhhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhM%hj5hhubeh}(h]jah ]h"]summary informationah$]h&]uh1hhjFhhhhhMubeh}(h]jah ]h"]classification of metadataah$]h&]uh1hhjjhhhhhM]ubh)}(hhh](h)}(hRisk Managementh]hRisk Management}(hjhhhNhNubah}(h]h ]h"]h$]h&]jjEuh1hhjhhhhhM2ubh)}(hDuring the development of online fsck, several risk factors were identified that may make the feature unsuitable for certain distributors and users. Steps can be taken to mitigate or eliminate those risks, though at a cost to functionality.h]hDuring the development of online fsck, several risk factors were identified that may make the feature unsuitable for certain distributors and users. Steps can be taken to mitigate or eliminate those risks, though at a cost to functionality.}(hj,hhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhM4hjhhubh)}(hhh](h)}(hX**Decreased performance**: Adding metadata indices to the filesystem increases the time cost of persisting changes to disk, and the reverse space mapping and directory parent pointers are no exception. System administrators who require the maximum performance can disable the reverse mapping features at format time, though this choice dramatically reduces the ability of online fsck to find inconsistencies and repair them. h]h)}(hX**Decreased performance**: Adding metadata indices to the filesystem increases the time cost of persisting changes to disk, and the reverse space mapping and directory parent pointers are no exception. System administrators who require the maximum performance can disable the reverse mapping features at format time, though this choice dramatically reduces the ability of online fsck to find inconsistencies and repair them.h](j)}(h**Decreased performance**h]hDecreased performance}(hjEhhhNhNubah}(h]h ]h"]h$]h&]uh1jhjAubhX: Adding metadata indices to the filesystem increases the time cost of persisting changes to disk, and the reverse space mapping and directory parent pointers are no exception. System administrators who require the maximum performance can disable the reverse mapping features at format time, though this choice dramatically reduces the ability of online fsck to find inconsistencies and repair them.}(hjAhhhNhNubeh}(h]h ]h"]h$]h&]uh1hhhhM9hj=ubah}(h]h ]h"]h$]h&]uh1hhj:hhhhhNubh)}(hX**Incorrect repairs**: As with all software, there might be defects in the software that result in incorrect repairs being written to the filesystem. Systematic fuzz testing (detailed in the next section) is employed by the authors to find bugs early, but it might not catch everything. The kernel build system provides Kconfig options (``CONFIG_XFS_ONLINE_SCRUB`` and ``CONFIG_XFS_ONLINE_REPAIR``) to enable distributors to choose not to accept this risk. The xfsprogs build system has a configure option (``--enable-scrub=no``) that disables building of the ``xfs_scrub`` binary, though this is not a risk mitigation if the kernel functionality remains enabled. h]h)}(hX**Incorrect repairs**: As with all software, there might be defects in the software that result in incorrect repairs being written to the filesystem. Systematic fuzz testing (detailed in the next section) is employed by the authors to find bugs early, but it might not catch everything. The kernel build system provides Kconfig options (``CONFIG_XFS_ONLINE_SCRUB`` and ``CONFIG_XFS_ONLINE_REPAIR``) to enable distributors to choose not to accept this risk. The xfsprogs build system has a configure option (``--enable-scrub=no``) that disables building of the ``xfs_scrub`` binary, though this is not a risk mitigation if the kernel functionality remains enabled.h](j)}(h**Incorrect repairs**h]hIncorrect repairs}(hjkhhhNhNubah}(h]h ]h"]h$]h&]uh1jhjgubhX<: As with all software, there might be defects in the software that result in incorrect repairs being written to the filesystem. Systematic fuzz testing (detailed in the next section) is employed by the authors to find bugs early, but it might not catch everything. The kernel build system provides Kconfig options (}(hjghhhNhNubj)}(h``CONFIG_XFS_ONLINE_SCRUB``h]hCONFIG_XFS_ONLINE_SCRUB}(hj}hhhNhNubah}(h]h ]h"]h$]h&]uh1jhjgubh and }(hjghhhNhNubj)}(h``CONFIG_XFS_ONLINE_REPAIR``h]hCONFIG_XFS_ONLINE_REPAIR}(hjhhhNhNubah}(h]h ]h"]h$]h&]uh1jhjgubhn) to enable distributors to choose not to accept this risk. The xfsprogs build system has a configure option (}(hjghhhNhNubj)}(h``--enable-scrub=no``h]h--enable-scrub=no}(hjhhhNhNubah}(h]h ]h"]h$]h&]uh1jhjgubh ) that disables building of the }(hjghhhNhNubj)}(h ``xfs_scrub``h]h xfs_scrub}(hjhhhNhNubah}(h]h ]h"]h$]h&]uh1jhjgubhZ binary, though this is not a risk mitigation if the kernel functionality remains enabled.}(hjghhhNhNubeh}(h]h ]h"]h$]h&]uh1hhhhM@hjcubah}(h]h ]h"]h$]h&]uh1hhj:hhhhhNubh)}(hX**Inability to repair**: Sometimes, a filesystem is too badly damaged to be repairable. If the keyspaces of several metadata indices overlap in some manner but a coherent narrative cannot be formed from records collected, then the repair fails. To reduce the chance that a repair will fail with a dirty transaction and render the filesystem unusable, the online repair functions have been designed to stage and validate all new records before committing the new structure. h]h)}(hX**Inability to repair**: Sometimes, a filesystem is too badly damaged to be repairable. If the keyspaces of several metadata indices overlap in some manner but a coherent narrative cannot be formed from records collected, then the repair fails. To reduce the chance that a repair will fail with a dirty transaction and render the filesystem unusable, the online repair functions have been designed to stage and validate all new records before committing the new structure.h](j)}(h**Inability to repair**h]hInability to repair}(hjhhhNhNubah}(h]h ]h"]h$]h&]uh1jhjubhX: Sometimes, a filesystem is too badly damaged to be repairable. If the keyspaces of several metadata indices overlap in some manner but a coherent narrative cannot be formed from records collected, then the repair fails. To reduce the chance that a repair will fail with a dirty transaction and render the filesystem unusable, the online repair functions have been designed to stage and validate all new records before committing the new structure.}(hjhhhNhNubeh}(h]h ]h"]h$]h&]uh1hhhhMKhjubah}(h]h ]h"]h$]h&]uh1hhj:hhhhhNubh)}(hXJ**Misbehavior**: Online fsck requires many privileges -- raw IO to block devices, opening files by handle, ignoring Unix discretionary access control, and the ability to perform administrative changes. Running this automatically in the background scares people, so the systemd background service is configured to run with only the privileges required. Obviously, this cannot address certain problems like the kernel crashing or deadlocking, but it should be sufficient to prevent the scrub process from escaping and reconfiguring the system. The cron job does not have this protection. h]h)}(hXI**Misbehavior**: Online fsck requires many privileges -- raw IO to block devices, opening files by handle, ignoring Unix discretionary access control, and the ability to perform administrative changes. Running this automatically in the background scares people, so the systemd background service is configured to run with only the privileges required. Obviously, this cannot address certain problems like the kernel crashing or deadlocking, but it should be sufficient to prevent the scrub process from escaping and reconfiguring the system. The cron job does not have this protection.h](j)}(h**Misbehavior**h]h Misbehavior}(hjhhhNhNubah}(h]h ]h"]h$]h&]uh1jhjubhX:: Online fsck requires many privileges -- raw IO to block devices, opening files by handle, ignoring Unix discretionary access control, and the ability to perform administrative changes. Running this automatically in the background scares people, so the systemd background service is configured to run with only the privileges required. Obviously, this cannot address certain problems like the kernel crashing or deadlocking, but it should be sufficient to prevent the scrub process from escaping and reconfiguring the system. The cron job does not have this protection.}(hjhhhNhNubeh}(h]h ]h"]h$]h&]uh1hhhhMUhjubah}(h]h ]h"]h$]h&]uh1hhj:hhhhhNubh)}(hX**Fuzz Kiddiez**: There are many people now who seem to think that running automated fuzz testing of ondisk artifacts to find mischievous behavior and spraying exploit code onto the public mailing list for instant zero-day disclosure is somehow of some social benefit. In the view of this author, the benefit is realized only when the fuzz operators help to **fix** the flaws, but this opinion apparently is not widely shared among security "researchers". The XFS maintainers' continuing ability to manage these events presents an ongoing risk to the stability of the development process. Automated testing should front-load some of the risk while the feature is considered EXPERIMENTAL. h]h)}(hX**Fuzz Kiddiez**: There are many people now who seem to think that running automated fuzz testing of ondisk artifacts to find mischievous behavior and spraying exploit code onto the public mailing list for instant zero-day disclosure is somehow of some social benefit. In the view of this author, the benefit is realized only when the fuzz operators help to **fix** the flaws, but this opinion apparently is not widely shared among security "researchers". The XFS maintainers' continuing ability to manage these events presents an ongoing risk to the stability of the development process. Automated testing should front-load some of the risk while the feature is considered EXPERIMENTAL.h](j)}(h**Fuzz Kiddiez**h]h Fuzz Kiddiez}(hj%hhhNhNubah}(h]h ]h"]h$]h&]uh1jhj!ubhXV: There are many people now who seem to think that running automated fuzz testing of ondisk artifacts to find mischievous behavior and spraying exploit code onto the public mailing list for instant zero-day disclosure is somehow of some social benefit. In the view of this author, the benefit is realized only when the fuzz operators help to }(hj!hhhNhNubj)}(h**fix**h]hfix}(hj7hhhNhNubah}(h]h ]h"]h$]h&]uh1jhj!ubhXH the flaws, but this opinion apparently is not widely shared among security “researchers”. The XFS maintainers’ continuing ability to manage these events presents an ongoing risk to the stability of the development process. Automated testing should front-load some of the risk while the feature is considered EXPERIMENTAL.}(hj!hhhNhNubeh}(h]h ]h"]h$]h&]uh1hhhhM_hjubah}(h]h ]h"]h$]h&]uh1hhj:hhhhhNubeh}(h]h ]h"]h$]h&]jJjKuh1hhhhM9hjhhubh)}(hMany of these risks are inherent to software programming. Despite this, it is hoped that this new functionality will prove useful in reducing unexpected downtime.h]hMany of these risks are inherent to software programming. Despite this, it is hoped that this new functionality will prove useful in reducing unexpected downtime.}(hj[hhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhMkhjhhubeh}(h]jKah ]h"]risk managementah$]h&]uh1hhjjhhhhhM2ubeh}(h]j5ah ]h"]2. theory of operationah$]h&]uh1hhhhhhhhKubh)}(hhh](h)}(h3. Testing Planh]h3. Testing Plan}(hjzhhhNhNubah}(h]h ]h"]h$]h&]jjsuh1hhjwhhhhhMpubh)}(h3As stated before, fsck tools have three main goals:h]h3As stated before, fsck tools have three main goals:}(hjhhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhMrhjwhhubji)}(hhh](h)}(h(Detect inconsistencies in the metadata; h]h)}(h'Detect inconsistencies in the metadata;h]h'Detect inconsistencies in the metadata;}(hjhhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhMthjubah}(h]h ]h"]h$]h&]uh1hhjhhhhhNubh)}(h%Eliminate those inconsistencies; and h]h)}(h$Eliminate those inconsistencies; andh]h$Eliminate those inconsistencies; and}(hjhhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhMvhjubah}(h]h ]h"]h$]h&]uh1hhjhhhhhNubh)}(hMinimize further loss of data. h]h)}(hMinimize further loss of data.h]hMinimize further loss of data.}(hjhhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhMxhjubah}(h]h ]h"]h$]h&]uh1hhjhhhhhNubeh}(h]h ]h"]h$]h&]jgjhjihjjjkuh1jhhjwhhhhhMtubh)}(hXaDemonstrations of correct operation are necessary to build users' confidence that the software behaves within expectations. Unfortunately, it was not really feasible to perform regular exhaustive testing of every aspect of a fsck tool until the introduction of low-cost virtual machines with high-IOPS storage. With ample hardware availability in mind, the testing strategy for the online fsck project involves differential analysis against the existing fsck tools and systematic testing of every attribute of every type of metadata object. Testing can be split into four major categories, as discussed below.h]hXcDemonstrations of correct operation are necessary to build users’ confidence that the software behaves within expectations. Unfortunately, it was not really feasible to perform regular exhaustive testing of every aspect of a fsck tool until the introduction of low-cost virtual machines with high-IOPS storage. With ample hardware availability in mind, the testing strategy for the online fsck project involves differential analysis against the existing fsck tools and systematic testing of every attribute of every type of metadata object. Testing can be split into four major categories, as discussed below.}(hjhhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhMzhjwhhubh)}(hhh](h)}(hIntegrated Testing with fstestsh]hIntegrated Testing with fstests}(hjhhhNhNubah}(h]h ]h"]h$]h&]jjuh1hhjhhhhhMubh)}(hXThe primary goal of any free software QA effort is to make testing as inexpensive and widespread as possible to maximize the scaling advantages of community. In other words, testing should maximize the breadth of filesystem configuration scenarios and hardware setups. This improves code quality by enabling the authors of online fsck to find and fix bugs early, and helps developers of new features to find integration issues earlier in their development effort.h]hXThe primary goal of any free software QA effort is to make testing as inexpensive and widespread as possible to maximize the scaling advantages of community. In other words, testing should maximize the breadth of filesystem configuration scenarios and hardware setups. This improves code quality by enabling the authors of online fsck to find and fix bugs early, and helps developers of new features to find integration issues earlier in their development effort.}(hjhhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhMhjhhubh)}(hXThe Linux filesystem community shares a common QA testing suite, `fstests `_, for functional and regression testing. Even before development work began on online fsck, fstests (when run on XFS) would run both the ``xfs_check`` and ``xfs_repair -n`` commands on the test and scratch filesystems between each test. This provides a level of assurance that the kernel and the fsck tools stay in alignment about what constitutes consistent metadata. During development of the online checking code, fstests was modified to run ``xfs_scrub -n`` between each test to ensure that the new checking code produces the same results as the two existing fsck tools.h](hAThe Linux filesystem community shares a common QA testing suite, }(hjhhhNhNubj)}(hD`fstests `_h]hfstests}(hjhhhNhNubah}(h]h ]h"]h$]h&]namefstestsjj7https://git.kernel.org/pub/scm/fs/xfs/xfstests-dev.git/uh1jhjubh)}(h: h]h}(h]fstestsah ]h"]fstestsah$]h&]refurij,uh1hjyKhjubh, for functional and regression testing. Even before development work began on online fsck, fstests (when run on XFS) would run both the }(hjhhhNhNubj)}(h ``xfs_check``h]h xfs_check}(hj>hhhNhNubah}(h]h ]h"]h$]h&]uh1jhjubh and }(hjhhhNhNubj)}(h``xfs_repair -n``h]h xfs_repair -n}(hjPhhhNhNubah}(h]h ]h"]h$]h&]uh1jhjubhX commands on the test and scratch filesystems between each test. This provides a level of assurance that the kernel and the fsck tools stay in alignment about what constitutes consistent metadata. During development of the online checking code, fstests was modified to run }(hjhhhNhNubj)}(h``xfs_scrub -n``h]h xfs_scrub -n}(hjbhhhNhNubah}(h]h ]h"]h$]h&]uh1jhjubhq between each test to ensure that the new checking code produces the same results as the two existing fsck tools.}(hjhhhNhNubeh}(h]h ]h"]h$]h&]uh1hhhhMhjhhubh)}(hXdTo start development of online repair, fstests was modified to run ``xfs_repair`` to rebuild the filesystem's metadata indices between tests. This ensures that offline repair does not crash, leave a corrupt filesystem after it exists, or trigger complaints from the online check. This also established a baseline for what can and cannot be repaired offline. To complete the first phase of development of online repair, fstests was modified to be able to run ``xfs_scrub`` in a "force rebuild" mode. This enables a comparison of the effectiveness of online repair as compared to the existing offline repair tools.h](hCTo start development of online repair, fstests was modified to run }(hjzhhhNhNubj)}(h``xfs_repair``h]h xfs_repair}(hjhhhNhNubah}(h]h ]h"]h$]h&]uh1jhjzubhX{ to rebuild the filesystem’s metadata indices between tests. This ensures that offline repair does not crash, leave a corrupt filesystem after it exists, or trigger complaints from the online check. This also established a baseline for what can and cannot be repaired offline. To complete the first phase of development of online repair, fstests was modified to be able to run }(hjzhhhNhNubj)}(h ``xfs_scrub``h]h xfs_scrub}(hjhhhNhNubah}(h]h ]h"]h$]h&]uh1jhjzubh in a “force rebuild” mode. This enables a comparison of the effectiveness of online repair as compared to the existing offline repair tools.}(hjzhhhNhNubeh}(h]h ]h"]h$]h&]uh1hhhhMhjhhubeh}(h]jah ]h"]integrated testing with fstestsah$]h&]uh1hhjwhhhhhMubh)}(hhh](h)}(h'General Fuzz Testing of Metadata Blocksh]h'General Fuzz Testing of Metadata Blocks}(hjhhhNhNubah}(h]h ]h"]h$]h&]jjuh1hhjhhhhhMubh)}(hJXFS benefits greatly from having a very robust debugging tool, ``xfs_db``.h](h?XFS benefits greatly from having a very robust debugging tool, }(hjhhhNhNubj)}(h ``xfs_db``h]hxfs_db}(hjhhhNhNubah}(h]h ]h"]h$]h&]uh1jhjubh.}(hjhhhNhNubeh}(h]h ]h"]h$]h&]uh1hhhhMhjhhubh)}(hX#Before development of online fsck even began, a set of fstests were created to test the rather common fault that entire metadata blocks get corrupted. This required the creation of fstests library code that can create a filesystem containing every possible type of metadata object. Next, individual test cases were created to create a test filesystem, identify a single block of a specific type of metadata object, trash it with the existing ``blocktrash`` command in ``xfs_db``, and test the reaction of a particular metadata validation strategy.h](hXBefore development of online fsck even began, a set of fstests were created to test the rather common fault that entire metadata blocks get corrupted. This required the creation of fstests library code that can create a filesystem containing every possible type of metadata object. Next, individual test cases were created to create a test filesystem, identify a single block of a specific type of metadata object, trash it with the existing }(hjhhhNhNubj)}(h``blocktrash``h]h blocktrash}(hjhhhNhNubah}(h]h ]h"]h$]h&]uh1jhjubh command in }(hjhhhNhNubj)}(h ``xfs_db``h]hxfs_db}(hjhhhNhNubah}(h]h ]h"]h$]h&]uh1jhjubhE, and test the reaction of a particular metadata validation strategy.}(hjhhhNhNubeh}(h]h ]h"]h$]h&]uh1hhhhMhjhhubh)}(hXThis earlier test suite enabled XFS developers to test the ability of the in-kernel validation functions and the ability of the offline fsck tool to detect and eliminate the inconsistent metadata. This part of the test suite was extended to cover online fsck in exactly the same manner.h]hXThis earlier test suite enabled XFS developers to test the ability of the in-kernel validation functions and the ability of the offline fsck tool to detect and eliminate the inconsistent metadata. This part of the test suite was extended to cover online fsck in exactly the same manner.}(hj hhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhMhjhhubh)}(h=In other words, for a given fstests filesystem configuration:h]h=In other words, for a given fstests filesystem configuration:}(hj$ hhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhMhjhhubh)}(hhh]h)}(hX For each metadata object existing on the filesystem: * Write garbage to it * Test the reactions of: 1. The kernel verifiers to stop obviously bad metadata 2. Offline repair (``xfs_repair``) to detect and fix 3. Online repair (``xfs_scrub``) to detect and fix h](h)}(h4For each metadata object existing on the filesystem:h]h4For each metadata object existing on the filesystem:}(hj9 hhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhMhj5 ubh)}(hhh](h)}(hWrite garbage to it h]h)}(hWrite garbage to ith]hWrite garbage to it}(hjN hhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhMhjJ ubah}(h]h ]h"]h$]h&]uh1hhjG ubh)}(hTest the reactions of: 1. The kernel verifiers to stop obviously bad metadata 2. Offline repair (``xfs_repair``) to detect and fix 3. Online repair (``xfs_scrub``) to detect and fix h](h)}(hTest the reactions of:h]hTest the reactions of:}(hjf hhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhMhjb ubji)}(hhh](h)}(h3The kernel verifiers to stop obviously bad metadatah]h)}(hjy h]h3The kernel verifiers to stop obviously bad metadata}(hj{ hhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhMhjw ubah}(h]h ]h"]h$]h&]uh1hhjt ubh)}(h1Offline repair (``xfs_repair``) to detect and fixh]h)}(hj h](hOffline repair (}(hj hhhNhNubj)}(h``xfs_repair``h]h xfs_repair}(hj hhhNhNubah}(h]h ]h"]h$]h&]uh1jhj ubh) to detect and fix}(hj hhhNhNubeh}(h]h ]h"]h$]h&]uh1hhhhMhj ubah}(h]h ]h"]h$]h&]uh1hhjt ubh)}(h0Online repair (``xfs_scrub``) to detect and fix h]h)}(h/Online repair (``xfs_scrub``) to detect and fixh](hOnline repair (}(hj hhhNhNubj)}(h ``xfs_scrub``h]h xfs_scrub}(hj hhhNhNubah}(h]h ]h"]h$]h&]uh1jhj ubh) to detect and fix}(hj hhhNhNubeh}(h]h ]h"]h$]h&]uh1hhhhMhj ubah}(h]h ]h"]h$]h&]uh1hhjt ubeh}(h]h ]h"]h$]h&]jgjhjihjjjkuh1jhhjb ubeh}(h]h ]h"]h$]h&]uh1hhjG ubeh}(h]h ]h"]h$]h&]jJ*uh1hhhhMhj5 ubeh}(h]h ]h"]h$]h&]uh1hhj2 hhhNhNubah}(h]h ]h"]h$]h&]jJj uh1hhhhMhjhhubeh}(h]jah ]h"]'general fuzz testing of metadata blocksah$]h&]uh1hhjwhhhhhMubh)}(hhh](h)}(h)Targeted Fuzz Testing of Metadata Recordsh]h)Targeted Fuzz Testing of Metadata Records}(hj !hhhNhNubah}(h]h ]h"]h$]h&]jjuh1hhj!hhhhhMubh)}(hX9The testing plan for online fsck includes extending the existing fs testing infrastructure to provide a much more powerful facility: targeted fuzz testing of every metadata field of every metadata object in the filesystem. ``xfs_db`` can modify every field of every metadata structure in every block in the filesystem to simulate the effects of memory corruption and software bugs. Given that fstests already contains the ability to create a filesystem containing every metadata format known to the filesystem, ``xfs_db`` can be used to perform exhaustive fuzz testing!h](hThe testing plan for online fsck includes extending the existing fs testing infrastructure to provide a much more powerful facility: targeted fuzz testing of every metadata field of every metadata object in the filesystem. }(hj!hhhNhNubj)}(h ``xfs_db``h]hxfs_db}(hj !hhhNhNubah}(h]h ]h"]h$]h&]uh1jhj!ubhX can modify every field of every metadata structure in every block in the filesystem to simulate the effects of memory corruption and software bugs. Given that fstests already contains the ability to create a filesystem containing every metadata format known to the filesystem, }(hj!hhhNhNubj)}(h ``xfs_db``h]hxfs_db}(hj2!hhhNhNubah}(h]h ]h"]h$]h&]uh1jhj!ubh0 can be used to perform exhaustive fuzz testing!}(hj!hhhNhNubeh}(h]h ]h"]h$]h&]uh1hhhhMhj!hhubh)}(h-For a given fstests filesystem configuration:h]h-For a given fstests filesystem configuration:}(hjJ!hhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhMhj!hhubh)}(hhh]h)}(hXiFor each metadata object existing on the filesystem... * For each record inside that metadata object... * For each field inside that record... * For each conceivable type of transformation that can be applied to a bit field... 1. Clear all bits 2. Set all bits 3. Toggle the most significant bit 4. Toggle the middle bit 5. Toggle the least significant bit 6. Add a small quantity 7. Subtract a small quantity 8. Randomize the contents * ...test the reactions of: 1. The kernel verifiers to stop obviously bad metadata 2. Offline checking (``xfs_repair -n``) 3. Offline repair (``xfs_repair``) 4. Online checking (``xfs_scrub -n``) 5. Online repair (``xfs_scrub``) 6. Both repair tools (``xfs_scrub`` and then ``xfs_repair`` if online repair doesn't succeed) h](h)}(h6For each metadata object existing on the filesystem...h]h6For each metadata object existing on the filesystem...}(hj_!hhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhMhj[!ubh)}(hhh]h)}(hX For each record inside that metadata object... * For each field inside that record... * For each conceivable type of transformation that can be applied to a bit field... 1. Clear all bits 2. Set all bits 3. Toggle the most significant bit 4. Toggle the middle bit 5. Toggle the least significant bit 6. Add a small quantity 7. Subtract a small quantity 8. Randomize the contents * ...test the reactions of: 1. The kernel verifiers to stop obviously bad metadata 2. Offline checking (``xfs_repair -n``) 3. Offline repair (``xfs_repair``) 4. Online checking (``xfs_scrub -n``) 5. Online repair (``xfs_scrub``) 6. Both repair tools (``xfs_scrub`` and then ``xfs_repair`` if online repair doesn't succeed) h](h)}(h.For each record inside that metadata object...h]h.For each record inside that metadata object...}(hjt!hhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhMhjp!ubh)}(hhh]h)}(hXFor each field inside that record... * For each conceivable type of transformation that can be applied to a bit field... 1. Clear all bits 2. Set all bits 3. Toggle the most significant bit 4. Toggle the middle bit 5. Toggle the least significant bit 6. Add a small quantity 7. Subtract a small quantity 8. Randomize the contents * ...test the reactions of: 1. The kernel verifiers to stop obviously bad metadata 2. Offline checking (``xfs_repair -n``) 3. Offline repair (``xfs_repair``) 4. Online checking (``xfs_scrub -n``) 5. Online repair (``xfs_scrub``) 6. Both repair tools (``xfs_scrub`` and then ``xfs_repair`` if online repair doesn't succeed) h](h)}(h$For each field inside that record...h]h$For each field inside that record...}(hj!hhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhMhj!ubh)}(hhh]h)}(hXuFor each conceivable type of transformation that can be applied to a bit field... 1. Clear all bits 2. Set all bits 3. Toggle the most significant bit 4. Toggle the middle bit 5. Toggle the least significant bit 6. Add a small quantity 7. Subtract a small quantity 8. Randomize the contents * ...test the reactions of: 1. The kernel verifiers to stop obviously bad metadata 2. Offline checking (``xfs_repair -n``) 3. Offline repair (``xfs_repair``) 4. Online checking (``xfs_scrub -n``) 5. Online repair (``xfs_scrub``) 6. Both repair tools (``xfs_scrub`` and then ``xfs_repair`` if online repair doesn't succeed) h](h)}(hQFor each conceivable type of transformation that can be applied to a bit field...h]hQFor each conceivable type of transformation that can be applied to a bit field...}(hj!hhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhMhj!ubji)}(hhh](h)}(hClear all bitsh]h)}(hj!h]hClear all bits}(hj!hhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhMhj!ubah}(h]h ]h"]h$]h&]uh1hhj!ubh)}(h Set all bitsh]h)}(hj!h]h Set all bits}(hj!hhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhMhj!ubah}(h]h ]h"]h$]h&]uh1hhj!ubh)}(hToggle the most significant bith]h)}(hj!h]hToggle the most significant bit}(hj!hhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhMhj!ubah}(h]h ]h"]h$]h&]uh1hhj!ubh)}(hToggle the middle bith]h)}(hj!h]hToggle the middle bit}(hj!hhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhMhj!ubah}(h]h ]h"]h$]h&]uh1hhj!ubh)}(h Toggle the least significant bith]h)}(hj "h]h Toggle the least significant bit}(hj"hhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhMhj "ubah}(h]h ]h"]h$]h&]uh1hhj!ubh)}(hAdd a small quantityh]h)}(hj$"h]hAdd a small quantity}(hj&"hhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhMhj""ubah}(h]h ]h"]h$]h&]uh1hhj!ubh)}(hSubtract a small quantityh]h)}(hj;"h]hSubtract a small quantity}(hj="hhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhMhj9"ubah}(h]h ]h"]h$]h&]uh1hhj!ubh)}(hRandomize the contents h]h)}(hRandomize the contentsh]hRandomize the contents}(hjT"hhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhMhjP"ubah}(h]h ]h"]h$]h&]uh1hhj!ubeh}(h]h ]h"]h$]h&]jgjhjihjjjkuh1jhhj!ubh)}(hhh]h)}(hXB...test the reactions of: 1. The kernel verifiers to stop obviously bad metadata 2. Offline checking (``xfs_repair -n``) 3. Offline repair (``xfs_repair``) 4. Online checking (``xfs_scrub -n``) 5. Online repair (``xfs_scrub``) 6. Both repair tools (``xfs_scrub`` and then ``xfs_repair`` if online repair doesn't succeed) h](h)}(h...test the reactions of:h]h...test the reactions of:}(hju"hhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhMhjq"ubji)}(hhh](h)}(h3The kernel verifiers to stop obviously bad metadatah]h)}(hj"h]h3The kernel verifiers to stop obviously bad metadata}(hj"hhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhMhj"ubah}(h]h ]h"]h$]h&]uh1hhj"ubh)}(h$Offline checking (``xfs_repair -n``)h]h)}(hj"h](hOffline checking (}(hj"hhhNhNubj)}(h``xfs_repair -n``h]h xfs_repair -n}(hj"hhhNhNubah}(h]h ]h"]h$]h&]uh1jhj"ubh)}(hj"hhhNhNubeh}(h]h ]h"]h$]h&]uh1hhhhMhj"ubah}(h]h ]h"]h$]h&]uh1hhj"ubh)}(hOffline repair (``xfs_repair``)h]h)}(hj"h](hOffline repair (}(hj"hhhNhNubj)}(h``xfs_repair``h]h xfs_repair}(hj"hhhNhNubah}(h]h ]h"]h$]h&]uh1jhj"ubh)}(hj"hhhNhNubeh}(h]h ]h"]h$]h&]uh1hhhhMhj"ubah}(h]h ]h"]h$]h&]uh1hhj"ubh)}(h"Online checking (``xfs_scrub -n``)h]h)}(hj"h](hOnline checking (}(hj"hhhNhNubj)}(h``xfs_scrub -n``h]h xfs_scrub -n}(hj"hhhNhNubah}(h]h ]h"]h$]h&]uh1jhj"ubh)}(hj"hhhNhNubeh}(h]h ]h"]h$]h&]uh1hhhhMhj"ubah}(h]h ]h"]h$]h&]uh1hhj"ubh)}(hOnline repair (``xfs_scrub``)h]h)}(hj#h](hOnline repair (}(hj#hhhNhNubj)}(h ``xfs_scrub``h]h xfs_scrub}(hj##hhhNhNubah}(h]h ]h"]h$]h&]uh1jhj#ubh)}(hj#hhhNhNubeh}(h]h ]h"]h$]h&]uh1hhhhMhj#ubah}(h]h ]h"]h$]h&]uh1hhj"ubh)}(h[Both repair tools (``xfs_scrub`` and then ``xfs_repair`` if online repair doesn't succeed) h]h)}(hZBoth repair tools (``xfs_scrub`` and then ``xfs_repair`` if online repair doesn't succeed)h](hBoth repair tools (}(hjE#hhhNhNubj)}(h ``xfs_scrub``h]h xfs_scrub}(hjM#hhhNhNubah}(h]h ]h"]h$]h&]uh1jhjE#ubh and then }(hjE#hhhNhNubj)}(h``xfs_repair``h]h xfs_repair}(hj_#hhhNhNubah}(h]h ]h"]h$]h&]uh1jhjE#ubh$ if online repair doesn’t succeed)}(hjE#hhhNhNubeh}(h]h ]h"]h$]h&]uh1hhhhMhjA#ubah}(h]h ]h"]h$]h&]uh1hhj"ubeh}(h]h ]h"]h$]h&]jgjhjihjjjkuh1jhhjq"ubeh}(h]h ]h"]h$]h&]uh1hhjn"ubah}(h]h ]h"]h$]h&]jJj uh1hhhhMhj!ubeh}(h]h ]h"]h$]h&]uh1hhj!ubah}(h]h ]h"]h$]h&]jJj uh1hhhhMhj!ubeh}(h]h ]h"]h$]h&]uh1hhj!ubah}(h]h ]h"]h$]h&]jJj uh1hhhhMhjp!ubeh}(h]h ]h"]h$]h&]uh1hhjm!ubah}(h]h ]h"]h$]h&]jJj uh1hhhhMhj[!ubeh}(h]h ]h"]h$]h&]uh1hhjX!hhhNhNubah}(h]h ]h"]h$]h&]jJj uh1hhhhMhj!hhubh)}(h)This is quite the combinatoric explosion!h]h)This is quite the combinatoric explosion!}(hj#hhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhMhj!hhubh)}(hXFortunately, having this much test coverage makes it easy for XFS developers to check the responses of XFS' fsck tools. Since the introduction of the fuzz testing framework, these tests have been used to discover incorrect repair code and missing functionality for entire classes of metadata objects in ``xfs_repair``. The enhanced testing was used to finalize the deprecation of ``xfs_check`` by confirming that ``xfs_repair`` could detect at least as many corruptions as the older tool.h](hX1Fortunately, having this much test coverage makes it easy for XFS developers to check the responses of XFS’ fsck tools. Since the introduction of the fuzz testing framework, these tests have been used to discover incorrect repair code and missing functionality for entire classes of metadata objects in }(hj#hhhNhNubj)}(h``xfs_repair``h]h xfs_repair}(hj#hhhNhNubah}(h]h ]h"]h$]h&]uh1jhj#ubh?. The enhanced testing was used to finalize the deprecation of }(hj#hhhNhNubj)}(h ``xfs_check``h]h xfs_check}(hj#hhhNhNubah}(h]h ]h"]h$]h&]uh1jhj#ubh by confirming that }(hj#hhhNhNubj)}(h``xfs_repair``h]h xfs_repair}(hj#hhhNhNubah}(h]h ]h"]h$]h&]uh1jhj#ubh= could detect at least as many corruptions as the older tool.}(hj#hhhNhNubeh}(h]h ]h"]h$]h&]uh1hhhhMhj!hhubh)}(hThese tests have been very valuable for ``xfs_scrub`` in the same ways -- they allow the online fsck developers to compare online fsck against offline fsck, and they enable XFS developers to find deficiencies in the code base.h](h(These tests have been very valuable for }(hj$hhhNhNubj)}(h ``xfs_scrub``h]h xfs_scrub}(hj$hhhNhNubah}(h]h ]h"]h$]h&]uh1jhj$ubh in the same ways -- they allow the online fsck developers to compare online fsck against offline fsck, and they enable XFS developers to find deficiencies in the code base.}(hj$hhhNhNubeh}(h]h ]h"]h$]h&]uh1hhhhMhj!hhubh)}(hXProposed patchsets include `general fuzzer improvements `_, `fuzzing baselines `_, and `improvements in fuzz testing comprehensiveness `_.h](hProposed patchsets include }(hj1$hhhNhNubj)}(h`general fuzzer improvements `_h]hgeneral fuzzer improvements}(hj9$hhhNhNubah}(h]h ]h"]h$]h&]namegeneral fuzzer improvementsjjbhttps://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfstests-dev.git/log/?h=fuzzer-improvementsuh1jhj1$ubh)}(he h]h}(h]general-fuzzer-improvementsah ]h"]general fuzzer improvementsah$]h&]refurijI$uh1hjyKhj1$ubh, }(hj1$hhhNhNubj)}(hs`fuzzing baselines `_h]hfuzzing baselines}(hj[$hhhNhNubah}(h]h ]h"]h$]h&]namefuzzing baselinesjj\https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfstests-dev.git/log/?h=fuzz-baselineuh1jhj1$ubh)}(h_ h]h}(h]fuzzing-baselinesah ]h"]fuzzing baselinesah$]h&]refurijk$uh1hjyKhj1$ubh, and }(hj1$hhhNhNubj)}(h`improvements in fuzz testing comprehensiveness `_h]h.improvements in fuzz testing comprehensiveness}(hj}$hhhNhNubah}(h]h ]h"]h$]h&]name.improvements in fuzz testing comprehensivenessjj`https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfstests-dev.git/log/?h=more-fuzz-testinguh1jhj1$ubh)}(hc h]h}(h].improvements-in-fuzz-testing-comprehensivenessah ]h"].improvements in fuzz testing comprehensivenessah$]h&]refurij$uh1hjyKhj1$ubh.}(hj1$hhhNhNubeh}(h]h ]h"]h$]h&]uh1hhhhMhj!hhubeh}(h]jah ]h"])targeted fuzz testing of metadata recordsah$]h&]uh1hhjwhhhhhMubh)}(hhh](h)}(hStress Testingh]hStress Testing}(hj$hhhNhNubah}(h]h ]h"]h$]h&]jjuh1hhj$hhhhhMubh)}(hXA unique requirement to online fsck is the ability to operate on a filesystem concurrently with regular workloads. Although it is of course impossible to run ``xfs_scrub`` with *zero* observable impact on the running system, the online repair code should never introduce inconsistencies into the filesystem metadata, and regular workloads should never notice resource starvation. To verify that these conditions are being met, fstests has been enhanced in the following ways:h](hA unique requirement to online fsck is the ability to operate on a filesystem concurrently with regular workloads. Although it is of course impossible to run }(hj$hhhNhNubj)}(h ``xfs_scrub``h]h xfs_scrub}(hj$hhhNhNubah}(h]h ]h"]h$]h&]uh1jhj$ubh with }(hj$hhhNhNubj7)}(h*zero*h]hzero}(hj$hhhNhNubah}(h]h ]h"]h$]h&]uh1j6hj$ubhX$ observable impact on the running system, the online repair code should never introduce inconsistencies into the filesystem metadata, and regular workloads should never notice resource starvation. To verify that these conditions are being met, fstests has been enhanced in the following ways:}(hj$hhhNhNubeh}(h]h ]h"]h$]h&]uh1hhhhM hj$hhubh)}(hhh](h)}(hgFor each scrub item type, create a test to exercise checking that item type while running ``fsstress``.h]h)}(hgFor each scrub item type, create a test to exercise checking that item type while running ``fsstress``.h](hZFor each scrub item type, create a test to exercise checking that item type while running }(hj$hhhNhNubj)}(h ``fsstress``h]hfsstress}(hj$hhhNhNubah}(h]h ]h"]h$]h&]uh1jhj$ubh.}(hj$hhhNhNubeh}(h]h ]h"]h$]h&]uh1hhhhMhj$ubah}(h]h ]h"]h$]h&]uh1hhj$hhhhhNubh)}(hhFor each scrub item type, create a test to exercise repairing that item type while running ``fsstress``.h]h)}(hhFor each scrub item type, create a test to exercise repairing that item type while running ``fsstress``.h](h[For each scrub item type, create a test to exercise repairing that item type while running }(hj %hhhNhNubj)}(h ``fsstress``h]hfsstress}(hj(%hhhNhNubah}(h]h ]h"]h$]h&]uh1jhj %ubh.}(hj %hhhNhNubeh}(h]h ]h"]h$]h&]uh1hhhhMhj%ubah}(h]h ]h"]h$]h&]uh1hhj$hhhhhNubh)}(hkRace ``fsstress`` and ``xfs_scrub -n`` to ensure that checking the whole filesystem doesn't cause problems.h]h)}(hkRace ``fsstress`` and ``xfs_scrub -n`` to ensure that checking the whole filesystem doesn't cause problems.h](hRace }(hjJ%hhhNhNubj)}(h ``fsstress``h]hfsstress}(hjR%hhhNhNubah}(h]h ]h"]h$]h&]uh1jhjJ%ubh and }(hjJ%hhhNhNubj)}(h``xfs_scrub -n``h]h xfs_scrub -n}(hjd%hhhNhNubah}(h]h ]h"]h$]h&]uh1jhjJ%ubhG to ensure that checking the whole filesystem doesn’t cause problems.}(hjJ%hhhNhNubeh}(h]h ]h"]h$]h&]uh1hhhhMhjF%ubah}(h]h ]h"]h$]h&]uh1hhj$hhhhhNubh)}(hRace ``fsstress`` and ``xfs_scrub`` in force-rebuild mode to ensure that force-repairing the whole filesystem doesn't cause problems.h]h)}(hRace ``fsstress`` and ``xfs_scrub`` in force-rebuild mode to ensure that force-repairing the whole filesystem doesn't cause problems.h](hRace }(hj%hhhNhNubj)}(h ``fsstress``h]hfsstress}(hj%hhhNhNubah}(h]h ]h"]h$]h&]uh1jhj%ubh and }(hj%hhhNhNubj)}(h ``xfs_scrub``h]h xfs_scrub}(hj%hhhNhNubah}(h]h ]h"]h$]h&]uh1jhj%ubhd in force-rebuild mode to ensure that force-repairing the whole filesystem doesn’t cause problems.}(hj%hhhNhNubeh}(h]h ]h"]h$]h&]uh1hhhhMhj%ubah}(h]h ]h"]h$]h&]uh1hhj$hhhhhNubh)}(hqRace ``xfs_scrub`` in check and force-repair mode against ``fsstress`` while freezing and thawing the filesystem.h]h)}(hqRace ``xfs_scrub`` in check and force-repair mode against ``fsstress`` while freezing and thawing the filesystem.h](hRace }(hj%hhhNhNubj)}(h ``xfs_scrub``h]h xfs_scrub}(hj%hhhNhNubah}(h]h ]h"]h$]h&]uh1jhj%ubh( in check and force-repair mode against }(hj%hhhNhNubj)}(h ``fsstress``h]hfsstress}(hj%hhhNhNubah}(h]h ]h"]h$]h&]uh1jhj%ubh+ while freezing and thawing the filesystem.}(hj%hhhNhNubeh}(h]h ]h"]h$]h&]uh1hhhhMhj%ubah}(h]h ]h"]h$]h&]uh1hhj$hhhhhNubh)}(hRace ``xfs_scrub`` in check and force-repair mode against ``fsstress`` while remounting the filesystem read-only and read-write.h]h)}(hRace ``xfs_scrub`` in check and force-repair mode against ``fsstress`` while remounting the filesystem read-only and read-write.h](hRace }(hj%hhhNhNubj)}(h ``xfs_scrub``h]h xfs_scrub}(hj&hhhNhNubah}(h]h ]h"]h$]h&]uh1jhj%ubh( in check and force-repair mode against }(hj%hhhNhNubj)}(h ``fsstress``h]hfsstress}(hj&hhhNhNubah}(h]h ]h"]h$]h&]uh1jhj%ubh: while remounting the filesystem read-only and read-write.}(hj%hhhNhNubeh}(h]h ]h"]h$]h&]uh1hhhhMhj%ubah}(h]h ]h"]h$]h&]uh1hhj$hhhhhNubh)}(hHThe same, but running ``fsx`` instead of ``fsstress``. (Not done yet?) h]h)}(hGThe same, but running ``fsx`` instead of ``fsstress``. (Not done yet?)h](hThe same, but running }(hj:&hhhNhNubj)}(h``fsx``h]hfsx}(hjB&hhhNhNubah}(h]h ]h"]h$]h&]uh1jhj:&ubh instead of }(hj:&hhhNhNubj)}(h ``fsstress``h]hfsstress}(hjT&hhhNhNubah}(h]h ]h"]h$]h&]uh1jhj:&ubh. (Not done yet?)}(hj:&hhhNhNubeh}(h]h ]h"]h$]h&]uh1hhhhMhj6&ubah}(h]h ]h"]h$]h&]uh1hhj$hhhhhNubeh}(h]h ]h"]h$]h&]jJj uh1hhhhMhj$hhubh)}(hSuccess is defined by the ability to run all of these tests without observing any unexpected filesystem shutdowns due to corrupted metadata, kernel hang check warnings, or any other sort of mischief.h]hSuccess is defined by the ability to run all of these tests without observing any unexpected filesystem shutdowns due to corrupted metadata, kernel hang check warnings, or any other sort of mischief.}(hjx&hhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhM hj$hhubh)}(hXMProposed patchsets include `general stress testing `_ and the `evolution of existing per-function stress testing `_.h](hProposed patchsets include }(hj&hhhNhNubj)}(h`general stress testing `_h]hgeneral stress testing}(hj&hhhNhNubah}(h]h ]h"]h$]h&]namegeneral stress testingjjqhttps://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfstests-dev.git/log/?h=race-scrub-and-mount-state-changesuh1jhj&ubh)}(ht h]h}(h]general-stress-testingah ]h"]general stress testingah$]h&]refurij&uh1hjyKhj&ubh and the }(hj&hhhNhNubj)}(h`evolution of existing per-function stress testing `_h]h1evolution of existing per-function stress testing}(hj&hhhNhNubah}(h]h ]h"]h$]h&]name1evolution of existing per-function stress testingjjdhttps://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfstests-dev.git/log/?h=refactor-scrub-stressuh1jhj&ubh)}(hg h]h}(h]1evolution-of-existing-per-function-stress-testingah ]h"]1evolution of existing per-function stress testingah$]h&]refurij&uh1hjyKhj&ubh.}(hj&hhhNhNubeh}(h]h ]h"]h$]h&]uh1hhhhM$hj$hhubeh}(h]jah ]h"]stress testingah$]h&]uh1hhjwhhhhhMubeh}(h]jyah ]h"]3. testing planah$]h&]uh1hhhhhhhhMpubh)}(hhh](h)}(h4. User Interfaceh]h4. User Interface}(hj&hhhNhNubah}(h]h ]h"]h$]h&]jj&uh1hhj&hhhhhM*ubh)}(hXThe primary user of online fsck is the system administrator, just like offline repair. Online fsck presents two modes of operation to administrators: A foreground CLI process for online fsck on demand, and a background service that performs autonomous checking and repair.h]hXThe primary user of online fsck is the system administrator, just like offline repair. Online fsck presents two modes of operation to administrators: A foreground CLI process for online fsck on demand, and a background service that performs autonomous checking and repair.}(hj&hhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhM,hj&hhubh)}(hhh](h)}(hChecking on Demandh]hChecking on Demand}(hj'hhhNhNubah}(h]h ]h"]h$]h&]jjEuh1hhj'hhhhhM3ubh)}(hXFor administrators who want the absolute freshest information about the metadata in a filesystem, ``xfs_scrub`` can be run as a foreground process on a command line. The program checks every piece of metadata in the filesystem while the administrator waits for the results to be reported, just like the existing ``xfs_repair`` tool. Both tools share a ``-n`` option to perform a read-only scan, and a ``-v`` option to increase the verbosity of the information reported.h](hbFor administrators who want the absolute freshest information about the metadata in a filesystem, }(hj'hhhNhNubj)}(h ``xfs_scrub``h]h xfs_scrub}(hj'hhhNhNubah}(h]h ]h"]h$]h&]uh1jhj'ubh can be run as a foreground process on a command line. The program checks every piece of metadata in the filesystem while the administrator waits for the results to be reported, just like the existing }(hj'hhhNhNubj)}(h``xfs_repair``h]h xfs_repair}(hj0'hhhNhNubah}(h]h ]h"]h$]h&]uh1jhj'ubh tool. Both tools share a }(hj'hhhNhNubj)}(h``-n``h]h-n}(hjB'hhhNhNubah}(h]h ]h"]h$]h&]uh1jhj'ubh+ option to perform a read-only scan, and a }(hj'hhhNhNubj)}(h``-v``h]h-v}(hjT'hhhNhNubah}(h]h ]h"]h$]h&]uh1jhj'ubh> option to increase the verbosity of the information reported.}(hj'hhhNhNubeh}(h]h ]h"]h$]h&]uh1hhhhM5hj'hhubh)}(hX)A new feature of ``xfs_scrub`` is the ``-x`` option, which employs the error correction capabilities of the hardware to check data file contents. The media scan is not enabled by default because it may dramatically increase program runtime and consume a lot of bandwidth on older storage hardware.h](hA new feature of }(hjl'hhhNhNubj)}(h ``xfs_scrub``h]h xfs_scrub}(hjt'hhhNhNubah}(h]h ]h"]h$]h&]uh1jhjl'ubh is the }(hjl'hhhNhNubj)}(h``-x``h]h-x}(hj'hhhNhNubah}(h]h ]h"]h$]h&]uh1jhjl'ubh option, which employs the error correction capabilities of the hardware to check data file contents. The media scan is not enabled by default because it may dramatically increase program runtime and consume a lot of bandwidth on older storage hardware.}(hjl'hhhNhNubeh}(h]h ]h"]h$]h&]uh1hhhhM>hj'hhubh)}(hDThe output of a foreground invocation is captured in the system log.h]hDThe output of a foreground invocation is captured in the system log.}(hj'hhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhMChj'hhubh)}(hXThe ``xfs_scrub_all`` program walks the list of mounted filesystems and initiates ``xfs_scrub`` for each of them in parallel. It serializes scans for any filesystems that resolve to the same top level kernel block device to prevent resource overconsumption.h](hThe }(hj'hhhNhNubj)}(h``xfs_scrub_all``h]h xfs_scrub_all}(hj'hhhNhNubah}(h]h ]h"]h$]h&]uh1jhj'ubh= program walks the list of mounted filesystems and initiates }(hj'hhhNhNubj)}(h ``xfs_scrub``h]h xfs_scrub}(hj'hhhNhNubah}(h]h ]h"]h$]h&]uh1jhj'ubh for each of them in parallel. It serializes scans for any filesystems that resolve to the same top level kernel block device to prevent resource overconsumption.}(hj'hhhNhNubeh}(h]h ]h"]h$]h&]uh1hhhhMEhj'hhubeh}(h]jKah ]h"]checking on demandah$]h&]uh1hhj&hhhhhM3ubh)}(hhh](h)}(hBackground Serviceh]hBackground Service}(hj'hhhNhNubah}(h]h ]h"]h$]h&]jjguh1hhj'hhhhhMKubh)}(hXTo reduce the workload of system administrators, the ``xfs_scrub`` package provides a suite of `systemd `_ timers and services that run online fsck automatically on weekends by default. The background service configures scrub to run with as little privilege as possible, the lowest CPU and IO priority, and in a CPU-constrained single threaded mode. This can be tuned by the systemd administrator at any time to suit the latency and throughput requirements of customer workloads.h](h5To reduce the workload of system administrators, the }(hj'hhhNhNubj)}(h ``xfs_scrub``h]h xfs_scrub}(hj'hhhNhNubah}(h]h ]h"]h$]h&]uh1jhj'ubh package provides a suite of }(hj'hhhNhNubj)}(h `systemd `_h]hsystemd}(hj(hhhNhNubah}(h]h ]h"]h$]h&]namesystemdjjhttps://systemd.io/uh1jhj'ubh)}(h h]h}(h]systemdah ]h"]systemdah$]h&]refurij (uh1hjyKhj'ubhXu timers and services that run online fsck automatically on weekends by default. The background service configures scrub to run with as little privilege as possible, the lowest CPU and IO priority, and in a CPU-constrained single threaded mode. This can be tuned by the systemd administrator at any time to suit the latency and throughput requirements of customer workloads.}(hj'hhhNhNubeh}(h]h ]h"]h$]h&]uh1hhhhMMhj'hhubh)}(hXThe output of the background service is also captured in the system log. If desired, reports of failures (either due to inconsistencies or mere runtime errors) can be emailed automatically by setting the ``EMAIL_ADDR`` environment variable in the following service files:h](hThe output of the background service is also captured in the system log. If desired, reports of failures (either due to inconsistencies or mere runtime errors) can be emailed automatically by setting the }(hj8(hhhNhNubj)}(h``EMAIL_ADDR``h]h EMAIL_ADDR}(hj@(hhhNhNubah}(h]h ]h"]h$]h&]uh1jhj8(ubh5 environment variable in the following service files:}(hj8(hhhNhNubeh}(h]h ]h"]h$]h&]uh1hhhhMVhj'hhubh)}(hhh](h)}(h``xfs_scrub_fail@.service``h]h)}(hj](h]j)}(hj](h]hxfs_scrub_fail@.service}(hjb(hhhNhNubah}(h]h ]h"]h$]h&]uh1jhj_(ubah}(h]h ]h"]h$]h&]uh1hhhhM[hj[(ubah}(h]h ]h"]h$]h&]uh1hhjX(hhhhhNubh)}(h!``xfs_scrub_media_fail@.service``h]h)}(hj}(h]j)}(hj}(h]hxfs_scrub_media_fail@.service}(hj(hhhNhNubah}(h]h ]h"]h$]h&]uh1jhj(ubah}(h]h ]h"]h$]h&]uh1hhhhM\hj{(ubah}(h]h ]h"]h$]h&]uh1hhjX(hhhhhNubh)}(h``xfs_scrub_all_fail.service`` h]h)}(h``xfs_scrub_all_fail.service``h]j)}(hj(h]hxfs_scrub_all_fail.service}(hj(hhhNhNubah}(h]h ]h"]h$]h&]uh1jhj(ubah}(h]h ]h"]h$]h&]uh1hhhhM]hj(ubah}(h]h ]h"]h$]h&]uh1hhjX(hhhhhNubeh}(h]h ]h"]h$]h&]jJj uh1hhhhM[hj'hhubh)}(hThe decision to enable the background scan is left to the system administrator. This can be done by enabling either of the following services:h]hThe decision to enable the background scan is left to the system administrator. This can be done by enabling either of the following services:}(hj(hhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhM_hj'hhubh)}(hhh](h)}(h*``xfs_scrub_all.timer`` on systemd systemsh]h)}(hj(h](j)}(h``xfs_scrub_all.timer``h]hxfs_scrub_all.timer}(hj(hhhNhNubah}(h]h ]h"]h$]h&]uh1jhj(ubh on systemd systems}(hj(hhhNhNubeh}(h]h ]h"]h$]h&]uh1hhhhMbhj(ubah}(h]h ]h"]h$]h&]uh1hhj(hhhhhNubh)}(h.``xfs_scrub_all.cron`` on non-systemd systems h]h)}(h-``xfs_scrub_all.cron`` on non-systemd systemsh](j)}(h``xfs_scrub_all.cron``h]hxfs_scrub_all.cron}(hj)hhhNhNubah}(h]h ]h"]h$]h&]uh1jhj(ubh on non-systemd systems}(hj(hhhNhNubeh}(h]h ]h"]h$]h&]uh1hhhhMchj(ubah}(h]h ]h"]h$]h&]uh1hhj(hhhhhNubeh}(h]h ]h"]h$]h&]jJj uh1hhhhMbhj'hhubh)}(hXThis automatic weekly scan is configured out of the box to perform an additional media scan of all file data once per month. This is less foolproof than, say, storing file data block checksums, but much more performant if application software provides its own integrity checking, redundancy can be provided elsewhere above the filesystem, or the storage device's integrity guarantees are deemed sufficient.h]hXThis automatic weekly scan is configured out of the box to perform an additional media scan of all file data once per month. This is less foolproof than, say, storing file data block checksums, but much more performant if application software provides its own integrity checking, redundancy can be provided elsewhere above the filesystem, or the storage device’s integrity guarantees are deemed sufficient.}(hj$)hhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhMehj'hhubh)}(hX)The systemd unit file definitions have been subjected to a security audit (as of systemd 249) to ensure that the xfs_scrub processes have as little access to the rest of the system as possible. This was performed via ``systemd-analyze security``, after which privileges were restricted to the minimum required, sandboxing was set up to the maximal extent possible with sandboxing and system call filtering; and access to the filesystem tree was restricted to the minimum needed to start the program and access the filesystem being scanned. The service definition files restrict CPU usage to 80% of one CPU core, and apply as nice of a priority to IO and CPU scheduling as possible. This measure was taken to minimize delays in the rest of the filesystem. No such hardening has been performed for the cron job.h](hThe systemd unit file definitions have been subjected to a security audit (as of systemd 249) to ensure that the xfs_scrub processes have as little access to the rest of the system as possible. This was performed via }(hj2)hhhNhNubj)}(h``systemd-analyze security``h]hsystemd-analyze security}(hj:)hhhNhNubah}(h]h ]h"]h$]h&]uh1jhj2)ubhX4, after which privileges were restricted to the minimum required, sandboxing was set up to the maximal extent possible with sandboxing and system call filtering; and access to the filesystem tree was restricted to the minimum needed to start the program and access the filesystem being scanned. The service definition files restrict CPU usage to 80% of one CPU core, and apply as nice of a priority to IO and CPU scheduling as possible. This measure was taken to minimize delays in the rest of the filesystem. No such hardening has been performed for the cron job.}(hj2)hhhNhNubeh}(h]h ]h"]h$]h&]uh1hhhhMlhj'hhubh)}(hProposed patchset: `Enabling the xfs_scrub background service `_.h](hProposed patchset: }(hjR)hhhNhNubj)}(h`Enabling the xfs_scrub background service `_h]h)Enabling the xfs_scrub background service}(hjZ)hhhNhNubah}(h]h ]h"]h$]h&]name)Enabling the xfs_scrub background servicejjghttps://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=scrub-media-scan-serviceuh1jhjR)ubh)}(hj h]h}(h])enabling-the-xfs-scrub-background-serviceah ]h"])enabling the xfs_scrub background serviceah$]h&]refurijj)uh1hjyKhjR)ubh.}(hjR)hhhNhNubeh}(h]h ]h"]h$]h&]uh1hhhhMyhj'hhubeh}(h]jmah ]h"]background serviceah$]h&]uh1hhj&hhhhhMKubh)}(hhh](h)}(hHealth Reportingh]hHealth Reporting}(hj)hhhNhNubah}(h]h ]h"]h$]h&]jjuh1hhj)hhhhhM~ubh)}(hXXFS caches a summary of each filesystem's health status in memory. The information is updated whenever ``xfs_scrub`` is run, or whenever inconsistencies are detected in the filesystem metadata during regular operations. System administrators should use the ``health`` command of ``xfs_spaceman`` to download this information into a human-readable format. If problems have been observed, the administrator can schedule a reduced service window to run the online repair tool to correct the problem. Failing that, the administrator can decide to schedule a maintenance window to run the traditional offline repair tool to correct the problem.h](hiXFS caches a summary of each filesystem’s health status in memory. The information is updated whenever }(hj)hhhNhNubj)}(h ``xfs_scrub``h]h xfs_scrub}(hj)hhhNhNubah}(h]h ]h"]h$]h&]uh1jhj)ubh is run, or whenever inconsistencies are detected in the filesystem metadata during regular operations. System administrators should use the }(hj)hhhNhNubj)}(h ``health``h]hhealth}(hj)hhhNhNubah}(h]h ]h"]h$]h&]uh1jhj)ubh command of }(hj)hhhNhNubj)}(h``xfs_spaceman``h]h xfs_spaceman}(hj)hhhNhNubah}(h]h ]h"]h$]h&]uh1jhj)ubhXX to download this information into a human-readable format. If problems have been observed, the administrator can schedule a reduced service window to run the online repair tool to correct the problem. Failing that, the administrator can decide to schedule a maintenance window to run the traditional offline repair tool to correct the problem.}(hj)hhhNhNubeh}(h]h ]h"]h$]h&]uh1hhhhMhj)hhubh)}(h**Future Work Question**: Should the health reporting integrate with the new inotify fs error notification system? Would it be helpful for sysadmins to have a daemon to listen for corruption notifications and initiate a repair?h](j)}(h**Future Work Question**h]hFuture Work Question}(hj)hhhNhNubah}(h]h ]h"]h$]h&]uh1jhj)ubh: Should the health reporting integrate with the new inotify fs error notification system? Would it be helpful for sysadmins to have a daemon to listen for corruption notifications and initiate a repair?}(hj)hhhNhNubeh}(h]h ]h"]h$]h&]uh1hhhhMhj)hhubh)}(h*Answer*: These questions remain unanswered, but should be a part of the conversation with early adopters and potential downstream users of XFS.h](j7)}(h*Answer*h]hAnswer}(hj)hhhNhNubah}(h]h ]h"]h$]h&]uh1j6hj)ubh: These questions remain unanswered, but should be a part of the conversation with early adopters and potential downstream users of XFS.}(hj)hhhNhNubeh}(h]h ]h"]h$]h&]uh1hhhhMhj)hhubh)}(hXXProposed patchsets include `wiring up health reports to correction returns `_ and `preservation of sickness info during memory reclaim `_.h](hProposed patchsets include }(hj*hhhNhNubj)}(h`wiring up health reports to correction returns `_h]h.wiring up health reports to correction returns}(hj*hhhNhNubah}(h]h ]h"]h$]h&]name.wiring up health reports to correction returnsjjehttps://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=corruption-health-reportsuh1jhj*ubh)}(hh h]h}(h].wiring-up-health-reports-to-correction-returnsah ]h"].wiring up health reports to correction returnsah$]h&]refurij.*uh1hjyKhj*ubh and }(hj*hhhNhNubj)}(h`preservation of sickness info during memory reclaim `_h]h3preservation of sickness info during memory reclaim}(hj@*hhhNhNubah}(h]h ]h"]h$]h&]name3preservation of sickness info during memory reclaimjjehttps://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=indirect-health-reportinguh1jhj*ubh)}(hh h]h}(h]3preservation-of-sickness-info-during-memory-reclaimah ]h"]3preservation of sickness info during memory reclaimah$]h&]refurijP*uh1hjyKhj*ubh.}(hj*hhhNhNubeh}(h]h ]h"]h$]h&]uh1hhhhMhj)hhubeh}(h]jah ]h"]health reportingah$]h&]uh1hhj&hhhhhM~ubeh}(h]j,ah ]h"]4. user interfaceah$]h&]uh1hhhhhhhhM*ubh)}(hhh](h)}(h(5. Kernel Algorithms and Data Structuresh]h(5. Kernel Algorithms and Data Structures}(hjy*hhhNhNubah}(h]h ]h"]h$]h&]jjuh1hhjv*hhhhhMubh)}(hXgThis section discusses the key algorithms and data structures of the kernel code that provide the ability to check and repair metadata while the system is running. The first chapters in this section reveal the pieces that provide the foundation for checking metadata. The remainder of this section presents the mechanisms through which XFS regenerates itself.h]hXgThis section discusses the key algorithms and data structures of the kernel code that provide the ability to check and repair metadata while the system is running. The first chapters in this section reveal the pieces that provide the foundation for checking metadata. The remainder of this section presents the mechanisms through which XFS regenerates itself.}(hj*hhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhMhjv*hhubh)}(hhh](h)}(hSelf Describing Metadatah]hSelf Describing Metadata}(hj*hhhNhNubah}(h]h ]h"]h$]h&]jjuh1hhj*hhhhhMubh)}(hXStarting with XFS version 5 in 2012, XFS updated the format of nearly every ondisk block header to record a magic number, a checksum, a universally "unique" identifier (UUID), an owner code, the ondisk address of the block, and a log sequence number. When loading a block buffer from disk, the magic number, UUID, owner, and ondisk address confirm that the retrieved block matches the specific owner of the current filesystem, and that the information contained in the block is supposed to be found at the ondisk address. The first three components enable checking tools to disregard alleged metadata that doesn't belong to the filesystem, and the fourth component enables the filesystem to detect lost writes.h]hXStarting with XFS version 5 in 2012, XFS updated the format of nearly every ondisk block header to record a magic number, a checksum, a universally “unique” identifier (UUID), an owner code, the ondisk address of the block, and a log sequence number. When loading a block buffer from disk, the magic number, UUID, owner, and ondisk address confirm that the retrieved block matches the specific owner of the current filesystem, and that the information contained in the block is supposed to be found at the ondisk address. The first three components enable checking tools to disregard alleged metadata that doesn’t belong to the filesystem, and the fourth component enables the filesystem to detect lost writes.}(hj*hhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhMhj*hhubh)}(hX3Whenever a file system operation modifies a block, the change is submitted to the log as part of a transaction. The log then processes these transactions marking them done once they are safely persisted to storage. The logging code maintains the checksum and the log sequence number of the last transactional update. Checksums are useful for detecting torn writes and other discrepancies that can be introduced between the computer and its storage devices. Sequence number tracking enables log recovery to avoid applying out of date log updates to the filesystem.h]hX3Whenever a file system operation modifies a block, the change is submitted to the log as part of a transaction. The log then processes these transactions marking them done once they are safely persisted to storage. The logging code maintains the checksum and the log sequence number of the last transactional update. Checksums are useful for detecting torn writes and other discrepancies that can be introduced between the computer and its storage devices. Sequence number tracking enables log recovery to avoid applying out of date log updates to the filesystem.}(hj*hhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhMhj*hhubh)}(hXThese two features improve overall runtime resiliency by providing a means for the filesystem to detect obvious corruption when reading metadata blocks from disk, but these buffer verifiers cannot provide any consistency checking between metadata structures.h]hXThese two features improve overall runtime resiliency by providing a means for the filesystem to detect obvious corruption when reading metadata blocks from disk, but these buffer verifiers cannot provide any consistency checking between metadata structures.}(hj*hhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhMhj*hhubh)}(huFor more information, please see the documentation for Documentation/filesystems/xfs/xfs-self-describing-metadata.rsth]huFor more information, please see the documentation for Documentation/filesystems/xfs/xfs-self-describing-metadata.rst}(hj*hhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhMhj*hhubeh}(h]jah ]h"]self describing metadataah$]h&]uh1hhjv*hhhhhMubh)}(hhh](h)}(hReverse Mappingh]hReverse Mapping}(hj*hhhNhNubah}(h]h ]h"]h$]h&]jjuh1hhj*hhhhhMubh)}(hX#The original design of XFS (circa 1993) is an improvement upon 1980s Unix filesystem design. In those days, storage density was expensive, CPU time was scarce, and excessive seek time could kill performance. For performance reasons, filesystem authors were reluctant to add redundancy to the filesystem, even at the cost of data integrity. Filesystems designers in the early 21st century choose different strategies to increase internal redundancy -- either storing nearly identical copies of metadata, or more space-efficient encoding techniques.h]hX#The original design of XFS (circa 1993) is an improvement upon 1980s Unix filesystem design. In those days, storage density was expensive, CPU time was scarce, and excessive seek time could kill performance. For performance reasons, filesystem authors were reluctant to add redundancy to the filesystem, even at the cost of data integrity. Filesystems designers in the early 21st century choose different strategies to increase internal redundancy -- either storing nearly identical copies of metadata, or more space-efficient encoding techniques.}(hj*hhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhMhj*hhubh)}(hXFor XFS, a different redundancy strategy was chosen to modernize the design: a secondary space usage index that maps allocated disk extents back to their owners. By adding a new index, the filesystem retains most of its ability to scale well to heavily threaded workloads involving large datasets, since the primary file metadata (the directory tree, the file block map, and the allocation groups) remain unchanged. Like any system that improves redundancy, the reverse-mapping feature increases overhead costs for space mapping activities. However, it has two critical advantages: first, the reverse index is key to enabling online fsck and other requested functionality such as free space defragmentation, better media failure reporting, and filesystem shrinking. Second, the different ondisk storage format of the reverse mapping btree defeats device-level deduplication because the filesystem requires real redundancy.h]hXFor XFS, a different redundancy strategy was chosen to modernize the design: a secondary space usage index that maps allocated disk extents back to their owners. By adding a new index, the filesystem retains most of its ability to scale well to heavily threaded workloads involving large datasets, since the primary file metadata (the directory tree, the file block map, and the allocation groups) remain unchanged. Like any system that improves redundancy, the reverse-mapping feature increases overhead costs for space mapping activities. However, it has two critical advantages: first, the reverse index is key to enabling online fsck and other requested functionality such as free space defragmentation, better media failure reporting, and filesystem shrinking. Second, the different ondisk storage format of the reverse mapping btree defeats device-level deduplication because the filesystem requires real redundancy.}(hj+hhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhMhj*hhubj)}(hhh]j)}(hhh](j)}(hhh]h}(h]h ]h"]h$]h&]colwidthKJuh1jhj+ubj)}(hhh](j)}(hhh]j)}(hhh]h)}(h **Sidebar**:h](j)}(h **Sidebar**h]hSidebar}(hj/+hhhNhNubah}(h]h ]h"]h$]h&]uh1jhj++ubh:}(hj++hhhNhNubeh}(h]h ]h"]h$]h&]uh1hhhhMhj(+ubah}(h]h ]h"]h$]h&]uh1jhj%+ubah}(h]h ]h"]h$]h&]uh1jhj"+ubj)}(hhh]j)}(hhh]h)}(hXA criticism of adding the secondary index is that it does nothing to improve the robustness of user data storage itself. This is a valid point, but adding a new index for file data block checksums increases write amplification by turning data overwrites into copy-writes, which age the filesystem prematurely. In keeping with thirty years of precedent, users who want file data integrity can supply as powerful a solution as they require. As for metadata, the complexity of adding a new secondary index of space usage is much less than adding volume management and storage device mirroring to XFS itself. Perfection of RAID and volume management are best left to existing layers in the kernel.h]hXA criticism of adding the secondary index is that it does nothing to improve the robustness of user data storage itself. This is a valid point, but adding a new index for file data block checksums increases write amplification by turning data overwrites into copy-writes, which age the filesystem prematurely. In keeping with thirty years of precedent, users who want file data integrity can supply as powerful a solution as they require. As for metadata, the complexity of adding a new secondary index of space usage is much less than adding volume management and storage device mirroring to XFS itself. Perfection of RAID and volume management are best left to existing layers in the kernel.}(hjY+hhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhMhjV+ubah}(h]h ]h"]h$]h&]uh1jhjS+ubah}(h]h ]h"]h$]h&]uh1jhj"+ubeh}(h]h ]h"]h$]h&]uh1jhj+ubeh}(h]h ]h"]h$]h&]colsKuh1jhj+ubah}(h]h ]h"]h$]h&]uh1jhj*hhhhhNubh)}(hIThe information captured in a reverse space mapping record is as follows:h]hIThe information captured in a reverse space mapping record is as follows:}(hj+hhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhMhj*hhubh literal_block)}(hXGstruct xfs_rmap_irec { xfs_agblock_t rm_startblock; /* extent start block */ xfs_extlen_t rm_blockcount; /* extent length */ uint64_t rm_owner; /* extent owner */ uint64_t rm_offset; /* offset within the owner */ unsigned int rm_flags; /* state flags */ };h]hXGstruct xfs_rmap_irec { xfs_agblock_t rm_startblock; /* extent start block */ xfs_extlen_t rm_blockcount; /* extent length */ uint64_t rm_owner; /* extent owner */ uint64_t rm_offset; /* offset within the owner */ unsigned int rm_flags; /* state flags */ };}hj+sbah}(h]h ]h"]h$]h&]hhforcelanguagechighlight_args}uh1j+hhhMhj*hhubh)}(hXThe first two fields capture the location and size of the physical space, in units of filesystem blocks. The owner field tells scrub which metadata structure or file inode have been assigned this space. For space allocated to files, the offset field tells scrub where the space was mapped within the file fork. Finally, the flags field provides extra information about the space usage -- is this an attribute fork extent? A file mapping btree extent? Or an unwritten data extent?h]hXThe first two fields capture the location and size of the physical space, in units of filesystem blocks. The owner field tells scrub which metadata structure or file inode have been assigned this space. For space allocated to files, the offset field tells scrub where the space was mapped within the file fork. Finally, the flags field provides extra information about the space usage -- is this an attribute fork extent? A file mapping btree extent? Or an unwritten data extent?}(hj+hhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhMhj*hhubh)}(hXOnline filesystem checking judges the consistency of each primary metadata record by comparing its information against all other space indices. The reverse mapping index plays a key role in the consistency checking process because it contains a centralized alternate copy of all space allocation information. Program runtime and ease of resource acquisition are the only real limits to what online checking can consult. For example, a file data extent mapping can be checked against:Xh]hXOnline filesystem checking judges the consistency of each primary metadata record by comparing its information against all other space indices. The reverse mapping index plays a key role in the consistency checking process because it contains a centralized alternate copy of all space allocation information. Program runtime and ease of resource acquisition are the only real limits to what online checking can consult. For example, a file data extent mapping can be checked against:}(hj+hhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhM hj*hhubh)}(hhh](h)}(h6The absence of an entry in the free space information.h]h)}(hj+h]h6The absence of an entry in the free space information.}(hj+hhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhMhj+ubah}(h]h ]h"]h$]h&]uh1hhj+hhhhhNubh)}(h+The absence of an entry in the inode index.h]h)}(hj+h]h+The absence of an entry in the inode index.}(hj+hhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhMhj+ubah}(h]h ]h"]h$]h&]uh1hhj+hhhhhNubh)}(hgThe absence of an entry in the reference count data if the file is not marked as having shared extents.h]h)}(hgThe absence of an entry in the reference count data if the file is not marked as having shared extents.h]hgThe absence of an entry in the reference count data if the file is not marked as having shared extents.}(hj+hhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhMhj+ubah}(h]h ]h"]h$]h&]uh1hhj+hhhhhNubh)}(hCThe correspondence of an entry in the reverse mapping information. h]h)}(hBThe correspondence of an entry in the reverse mapping information.h]hBThe correspondence of an entry in the reverse mapping information.}(hj,hhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhMhj,ubah}(h]h ]h"]h$]h&]uh1hhj+hhhhhNubeh}(h]h ]h"]h$]h&]jJj uh1hhhhMhj*hhubh)}(hEThere are several observations to make about reverse mapping indices:h]hEThere are several observations to make about reverse mapping indices:}(hj,,hhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhMhj*hhubji)}(hhh](h)}(hReverse mappings can provide a positive affirmation of correctness if any of the above primary metadata are in doubt. The checking code for most primary metadata follows a path similar to the one outlined above. h]h)}(hReverse mappings can provide a positive affirmation of correctness if any of the above primary metadata are in doubt. The checking code for most primary metadata follows a path similar to the one outlined above.h]hReverse mappings can provide a positive affirmation of correctness if any of the above primary metadata are in doubt. The checking code for most primary metadata follows a path similar to the one outlined above.}(hjA,hhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhMhj=,ubah}(h]h ]h"]h$]h&]uh1hhj:,hhhhhNubh)}(hXProving the consistency of secondary metadata with the primary metadata is difficult because that requires a full scan of all primary space metadata, which is very time intensive. For example, checking a reverse mapping record for a file extent mapping btree block requires locking the file and searching the entire btree to confirm the block. Instead, scrub relies on rigorous cross-referencing during the primary space mapping structure checks. h]h)}(hXProving the consistency of secondary metadata with the primary metadata is difficult because that requires a full scan of all primary space metadata, which is very time intensive. For example, checking a reverse mapping record for a file extent mapping btree block requires locking the file and searching the entire btree to confirm the block. Instead, scrub relies on rigorous cross-referencing during the primary space mapping structure checks.h]hXProving the consistency of secondary metadata with the primary metadata is difficult because that requires a full scan of all primary space metadata, which is very time intensive. For example, checking a reverse mapping record for a file extent mapping btree block requires locking the file and searching the entire btree to confirm the block. Instead, scrub relies on rigorous cross-referencing during the primary space mapping structure checks.}(hjY,hhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhM!hjU,ubah}(h]h ]h"]h$]h&]uh1hhj:,hhhhhNubh)}(hXConsistency scans must use non-blocking lock acquisition primitives if the required locking order is not the same order used by regular filesystem operations. For example, if the filesystem normally takes a file ILOCK before taking the AGF buffer lock but scrub wants to take a file ILOCK while holding an AGF buffer lock, scrub cannot block on that second acquisition. This means that forward progress during this part of a scan of the reverse mapping data cannot be guaranteed if system load is heavy. h]h)}(hXConsistency scans must use non-blocking lock acquisition primitives if the required locking order is not the same order used by regular filesystem operations. For example, if the filesystem normally takes a file ILOCK before taking the AGF buffer lock but scrub wants to take a file ILOCK while holding an AGF buffer lock, scrub cannot block on that second acquisition. This means that forward progress during this part of a scan of the reverse mapping data cannot be guaranteed if system load is heavy.h]hXConsistency scans must use non-blocking lock acquisition primitives if the required locking order is not the same order used by regular filesystem operations. For example, if the filesystem normally takes a file ILOCK before taking the AGF buffer lock but scrub wants to take a file ILOCK while holding an AGF buffer lock, scrub cannot block on that second acquisition. This means that forward progress during this part of a scan of the reverse mapping data cannot be guaranteed if system load is heavy.}(hjq,hhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhM*hjm,ubah}(h]h ]h"]h$]h&]uh1hhj:,hhhhhNubeh}(h]h ]h"]h$]h&]jgjhjihjjjkuh1jhhj*hhhhhMubh)}(hIn summary, reverse mappings play a key role in reconstruction of primary metadata. The details of how these records are staged, written to disk, and committed into the filesystem are covered in subsequent sections.h]hIn summary, reverse mappings play a key role in reconstruction of primary metadata. The details of how these records are staged, written to disk, and committed into the filesystem are covered in subsequent sections.}(hj,hhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhM3hj*hhubeh}(h]jah ]h"]reverse mappingah$]h&]uh1hhjv*hhhhhMubh)}(hhh](h)}(hChecking and Cross-Referencingh]hChecking and Cross-Referencing}(hj,hhhNhNubah}(h]h ]h"]h$]h&]jjuh1hhj,hhhhhM9ubh)}(hXThe first step of checking a metadata structure is to examine every record contained within the structure and its relationship with the rest of the system. XFS contains multiple layers of checking to try to prevent inconsistent metadata from wreaking havoc on the system. Each of these layers contributes information that helps the kernel to make three decisions about the health of a metadata structure:h]hXThe first step of checking a metadata structure is to examine every record contained within the structure and its relationship with the rest of the system. XFS contains multiple layers of checking to try to prevent inconsistent metadata from wreaking havoc on the system. Each of these layers contributes information that helps the kernel to make three decisions about the health of a metadata structure:}(hj,hhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhM;hj,hhubh)}(hhh](h)}(hMIs a part of this structure obviously corrupt (``XFS_SCRUB_OFLAG_CORRUPT``) ?h]h)}(hj,h](h/Is a part of this structure obviously corrupt (}(hj,hhhNhNubj)}(h``XFS_SCRUB_OFLAG_CORRUPT``h]hXFS_SCRUB_OFLAG_CORRUPT}(hj,hhhNhNubah}(h]h ]h"]h$]h&]uh1jhj,ubh) ?}(hj,hhhNhNubeh}(h]h ]h"]h$]h&]uh1hhhhMChj,ubah}(h]h ]h"]h$]h&]uh1hhj,hhhhhNubh)}(h[Is this structure inconsistent with the rest of the system (``XFS_SCRUB_OFLAG_XCORRUPT``) ?h]h)}(h[Is this structure inconsistent with the rest of the system (``XFS_SCRUB_OFLAG_XCORRUPT``) ?h](h/hhhhhNubh)}(hFilesystem labelsh]h)}(hjZ/h]hFilesystem labels}(hj\/hhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhMhjX/ubah}(h]h ]h"]h$]h&]uh1hhj>/hhhhhNubh)}(hFile timestampsh]h)}(hjq/h]hFile timestamps}(hjs/hhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhMhjo/ubah}(h]h ]h"]h$]h&]uh1hhj>/hhhhhNubh)}(hFile permissionsh]h)}(hj/h]hFile permissions}(hj/hhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhMhj/ubah}(h]h ]h"]h$]h&]uh1hhj>/hhhhhNubh)}(h File sizeh]h)}(hj/h]h File size}(hj/hhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhMhj/ubah}(h]h ]h"]h$]h&]uh1hhj>/hhhhhNubh)}(h File flagsh]h)}(hj/h]h File flags}(hj/hhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhMhj/ubah}(h]h ]h"]h$]h&]uh1hhj>/hhhhhNubh)}(hRNames present in directory entries, extended attribute keys, and filesystem labelsh]h)}(hRNames present in directory entries, extended attribute keys, and filesystem labelsh]hRNames present in directory entries, extended attribute keys, and filesystem labels}(hj/hhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhMhj/ubah}(h]h ]h"]h$]h&]uh1hhj>/hhhhhNubh)}(h!Extended attribute key namespacesh]h)}(hj/h]h!Extended attribute key namespaces}(hj/hhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhMhj/ubah}(h]h ]h"]h$]h&]uh1hhj>/hhhhhNubh)}(hExtended attribute valuesh]h)}(hj/h]hExtended attribute values}(hj/hhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhMhj/ubah}(h]h ]h"]h$]h&]uh1hhj>/hhhhhNubh)}(hFile data block contentsh]h)}(hj0h]hFile data block contents}(hj0hhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhMhj0ubah}(h]h ]h"]h$]h&]uh1hhj>/hhhhhNubh)}(h Quota limitsh]h)}(hj*0h]h Quota limits}(hj,0hhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhMhj(0ubah}(h]h ]h"]h$]h&]uh1hhj>/hhhhhNubh)}(hBQuota timer expiration (if resource usage exceeds the soft limit) h]h)}(hAQuota timer expiration (if resource usage exceeds the soft limit)h]hAQuota timer expiration (if resource usage exceeds the soft limit)}(hjC0hhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhMhj?0ubah}(h]h ]h"]h$]h&]uh1hhj>/hhhhhNubeh}(h]h ]h"]h$]h&]jJjKuh1hhhhMhj/hhubeh}(h]jah ]h"]4validation of userspace-controlled record attributesah$]h&]uh1hhj,hhhhhMubh)}(hhh](h)}(h Cross-Referencing Space Metadatah]h Cross-Referencing Space Metadata}(hjg0hhhNhNubah}(h]h ]h"]h$]h&]jjuh1hhjd0hhhhhMubh)}(hXAfter internal block checks, the next higher level of checking is cross-referencing records between metadata structures. For regular runtime code, the cost of these checks is considered to be prohibitively expensive, but as scrub is dedicated to rooting out inconsistencies, it must pursue all avenues of inquiry. The exact set of cross-referencing is highly dependent on the context of the data structure being checked.h]hXAfter internal block checks, the next higher level of checking is cross-referencing records between metadata structures. For regular runtime code, the cost of these checks is considered to be prohibitively expensive, but as scrub is dedicated to rooting out inconsistencies, it must pursue all avenues of inquiry. The exact set of cross-referencing is highly dependent on the context of the data structure being checked.}(hju0hhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhMhjd0hhubh)}(hX*The XFS btree code has keyspace scanning functions that online fsck uses to cross reference one structure with another. Specifically, scrub can scan the key space of an index to determine if that keyspace is fully, sparsely, or not at all mapped to records. For the reverse mapping btree, it is possible to mask parts of the key for the purposes of performing a keyspace scan so that scrub can decide if the rmap btree contains records mapping a certain extent of physical space without the sparsenses of the rest of the rmap keyspace getting in the way.h]hX*The XFS btree code has keyspace scanning functions that online fsck uses to cross reference one structure with another. Specifically, scrub can scan the key space of an index to determine if that keyspace is fully, sparsely, or not at all mapped to records. For the reverse mapping btree, it is possible to mask parts of the key for the purposes of performing a keyspace scan so that scrub can decide if the rmap btree contains records mapping a certain extent of physical space without the sparsenses of the rest of the rmap keyspace getting in the way.}(hj0hhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhMhjd0hhubh)}(hCBtree blocks undergo the following checks before cross-referencing:h]hCBtree blocks undergo the following checks before cross-referencing:}(hj0hhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhMhjd0hhubh)}(hhh](h)}(hIDoes the type of data stored in the block match what scrub is expecting? h]h)}(hHDoes the type of data stored in the block match what scrub is expecting?h]hHDoes the type of data stored in the block match what scrub is expecting?}(hj0hhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhMhj0ubah}(h]h ]h"]h$]h&]uh1hhj0hhhhhNubh)}(hGDoes the block belong to the owning structure that asked for the read? h]h)}(hFDoes the block belong to the owning structure that asked for the read?h]hFDoes the block belong to the owning structure that asked for the read?}(hj0hhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhMhj0ubah}(h]h ]h"]h$]h&]uh1hhj0hhhhhNubh)}(h%Do the records fit within the block? h]h)}(h$Do the records fit within the block?h]h$Do the records fit within the block?}(hj0hhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhMhj0ubah}(h]h ]h"]h$]h&]uh1hhj0hhhhhNubh)}(hHAre the records contained inside the block free of obvious corruptions? h]h)}(hGAre the records contained inside the block free of obvious corruptions?h]hGAre the records contained inside the block free of obvious corruptions?}(hj0hhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhMhj0ubah}(h]h ]h"]h$]h&]uh1hhj0hhhhhNubh)}(h*Are the name hashes in the correct order? h]h)}(h)Are the name hashes in the correct order?h]h)Are the name hashes in the correct order?}(hj1hhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhMhj1ubah}(h]h ]h"]h$]h&]uh1hhj0hhhhhNubh)}(hXDo node pointers within the btree point to valid block addresses for the type of btree? h]h)}(hWDo node pointers within the btree point to valid block addresses for the type of btree?h]hWDo node pointers within the btree point to valid block addresses for the type of btree?}(hj1hhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhMhj1ubah}(h]h ]h"]h$]h&]uh1hhj0hhhhhNubh)}(h,Do child pointers point towards the leaves? h]h)}(h+Do child pointers point towards the leaves?h]h+Do child pointers point towards the leaves?}(hj61hhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhMhj21ubah}(h]h ]h"]h$]h&]uh1hhj0hhhhhNubh)}(h1Do sibling pointers point across the same level? h]h)}(h0Do sibling pointers point across the same level?h]h0Do sibling pointers point across the same level?}(hjN1hhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhMhjJ1ubah}(h]h ]h"]h$]h&]uh1hhj0hhhhhNubh)}(hbFor each node block record, does the record key accurate reflect the contents of the child block? h]h)}(haFor each node block record, does the record key accurate reflect the contents of the child block?h]haFor each node block record, does the record key accurate reflect the contents of the child block?}(hjf1hhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhMhjb1ubah}(h]h ]h"]h$]h&]uh1hhj0hhhhhNubeh}(h]h ]h"]h$]h&]jJjKuh1hhhhMhjd0hhubh)}(h9Space allocation records are cross-referenced as follows:h]h9Space allocation records are cross-referenced as follows:}(hj1hhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhMhjd0hhubji)}(hhh](h)}(hXKAny space mentioned by any metadata structure are cross-referenced as follows: - Does the reverse mapping index list only the appropriate owner as the owner of each block? - Are none of the blocks claimed as free space? - If these aren't file data blocks, are none of the blocks claimed as space shared by different owners? h](h)}(hNAny space mentioned by any metadata structure are cross-referenced as follows:h]hNAny space mentioned by any metadata structure are cross-referenced as follows:}(hj1hhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhMhj1ubh)}(hhh](h)}(h[Does the reverse mapping index list only the appropriate owner as the owner of each block? h]h)}(hZDoes the reverse mapping index list only the appropriate owner as the owner of each block?h]hZDoes the reverse mapping index list only the appropriate owner as the owner of each block?}(hj1hhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhMhj1ubah}(h]h ]h"]h$]h&]uh1hhj1ubh)}(h.Are none of the blocks claimed as free space? h]h)}(h-Are none of the blocks claimed as free space?h]h-Are none of the blocks claimed as free space?}(hj1hhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhMhj1ubah}(h]h ]h"]h$]h&]uh1hhj1ubh)}(hfIf these aren't file data blocks, are none of the blocks claimed as space shared by different owners? h]h)}(heIf these aren't file data blocks, are none of the blocks claimed as space shared by different owners?h]hgIf these aren’t file data blocks, are none of the blocks claimed as space shared by different owners?}(hj1hhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhMhj1ubah}(h]h ]h"]h$]h&]uh1hhj1ubeh}(h]h ]h"]h$]h&]jJjKuh1hhhhMhj1ubeh}(h]h ]h"]h$]h&]uh1hhj1hhhNhNubh)}(hXFBtree blocks are cross-referenced as follows: - Everything in class 1 above. - If there's a parent node block, do the keys listed for this block match the keyspace of this block? - Do the sibling pointers point to valid blocks? Of the same level? - Do the child pointers point to valid blocks? Of the next level down? h](h)}(h-Btree blocks are cross-referenced as follows:h]h-Btree blocks are cross-referenced as follows:}(hj1hhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhMhj1ubh)}(hhh](h)}(hEverything in class 1 above. h]h)}(hEverything in class 1 above.h]hEverything in class 1 above.}(hj2hhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhMhj2ubah}(h]h ]h"]h$]h&]uh1hhj 2ubh)}(hdIf there's a parent node block, do the keys listed for this block match the keyspace of this block? h]h)}(hcIf there's a parent node block, do the keys listed for this block match the keyspace of this block?h]heIf there’s a parent node block, do the keys listed for this block match the keyspace of this block?}(hj+2hhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhMhj'2ubah}(h]h ]h"]h$]h&]uh1hhj 2ubh)}(hCDo the sibling pointers point to valid blocks? Of the same level? h]h)}(hBDo the sibling pointers point to valid blocks? Of the same level?h]hBDo the sibling pointers point to valid blocks? Of the same level?}(hjC2hhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhMhj?2ubah}(h]h ]h"]h$]h&]uh1hhj 2ubh)}(hFDo the child pointers point to valid blocks? Of the next level down? h]h)}(hEDo the child pointers point to valid blocks? Of the next level down?h]hEDo the child pointers point to valid blocks? Of the next level down?}(hj[2hhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhMhjW2ubah}(h]h ]h"]h$]h&]uh1hhj 2ubeh}(h]h ]h"]h$]h&]jJjKuh1hhhhMhj1ubeh}(h]h ]h"]h$]h&]uh1hhj1hhhNhNubh)}(hXNFree space btree records are cross-referenced as follows: - Everything in class 1 and 2 above. - Does the reverse mapping index list no owners of this space? - Is this space not claimed by the inode index for inodes? - Is it not mentioned by the reference count index? - Is there a matching record in the other free space btree? h](h)}(h9Free space btree records are cross-referenced as follows:h]h9Free space btree records are cross-referenced as follows:}(hj2hhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhMhj{2ubh)}(hhh](h)}(h#Everything in class 1 and 2 above. h]h)}(h"Everything in class 1 and 2 above.h]h"Everything in class 1 and 2 above.}(hj2hhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhMhj2ubah}(h]h ]h"]h$]h&]uh1hhj2ubh)}(h=Does the reverse mapping index list no owners of this space? h]h)}(h`_, `inode btree `_, and `rmap btree `_ records; to find `mergeable records `_; and to `improve cross referencing with rmap `_ before starting a repair.h](h2Proposed patchsets are the series to find gaps in }(hj4hhhNhNubj)}(hz`refcount btree `_h]hrefcount btree}(hj4hhhNhNubah}(h]h ]h"]h$]h&]namerefcount btreejjfhttps://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=scrub-detect-refcount-gapsuh1jhj4ubh)}(hi h]h}(h]refcount-btreeah ]h"]refcount btreeah$]h&]refurij4uh1hjyKhj4ubh, }(hj4hhhNhNubj)}(ht`inode btree `_h]h inode btree}(hj4hhhNhNubah}(h]h ]h"]h$]h&]name inode btreejjchttps://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=scrub-detect-inobt-gapsuh1jhj4ubh)}(hf h]h}(h] inode-btreeah ]h"] inode btreeah$]h&]refurij4uh1hjyKhj4ubh, and }(hj4hhhNhNubj)}(ht`rmap btree `_h]h rmap btree}(hj 5hhhNhNubah}(h]h ]h"]h$]h&]name rmap btreejjdhttps://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=scrub-detect-rmapbt-gapsuh1jhj4ubh)}(hg h]h}(h] rmap-btreeah ]h"] rmap btreeah$]h&]refurij5uh1hjyKhj4ubh records; to find }(hj4hhhNhNubj)}(h`mergeable records `_h]hmergeable records}(hj,5hhhNhNubah}(h]h ]h"]h$]h&]namemergeable recordsjjjhttps://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=scrub-detect-mergeable-recordsuh1jhj4ubh)}(hm h]h}(h]mergeable-recordsah ]h"]mergeable recordsah$]h&]refurij<5uh1hjyKhj4ubh ; and to }(hj4hhhNhNubj)}(h`improve cross referencing with rmap `_h]h#improve cross referencing with rmap}(hjN5hhhNhNubah}(h]h ]h"]h$]h&]name#improve cross referencing with rmapjjjhttps://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=scrub-strengthen-rmap-checkinguh1jhj4ubh)}(hm h]h}(h]#improve-cross-referencing-with-rmapah ]h"]#improve cross referencing with rmapah$]h&]refurij^5uh1hjyKhj4ubh before starting a repair.}(hj4hhhNhNubeh}(h]h ]h"]h$]h&]uh1hhhhM!hjd0hhubeh}(h]jah ]h"] cross-referencing space metadataah$]h&]uh1hhj,hhhhhMubh)}(hhh](h)}(hChecking Extended Attributesh]hChecking Extended Attributes}(hj5hhhNhNubah}(h]h ]h"]h$]h&]jjuh1hhj}5hhhhhM1ubh)}(hXcExtended attributes implement a key-value store that enable fragments of data to be attached to any file. Both the kernel and userspace can access the keys and values, subject to namespace and privilege restrictions. Most typically these fragments are metadata about the file -- origins, security contexts, user-supplied labels, indexing information, etc.h]hXcExtended attributes implement a key-value store that enable fragments of data to be attached to any file. Both the kernel and userspace can access the keys and values, subject to namespace and privilege restrictions. Most typically these fragments are metadata about the file -- origins, security contexts, user-supplied labels, indexing information, etc.}(hj5hhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhM3hj}5hhubh)}(hXNames can be as long as 255 bytes and can exist in several different namespaces. Values can be as large as 64KB. A file's extended attributes are stored in blocks mapped by the attr fork. The mappings point to leaf blocks, remote value blocks, or dabtree blocks. Block 0 in the attribute fork is always the top of the structure, but otherwise each of the three types of blocks can be found at any offset in the attr fork. Leaf blocks contain attribute key records that point to the name and the value. Names are always stored elsewhere in the same leaf block. Values that are less than 3/4 the size of a filesystem block are also stored elsewhere in the same leaf block. Remote value blocks contain values that are too large to fit inside a leaf. If the leaf information exceeds a single filesystem block, a dabtree (also rooted at block 0) is created to map hashes of the attribute names to leaf blocks in the attr fork.h]hXNames can be as long as 255 bytes and can exist in several different namespaces. Values can be as large as 64KB. A file’s extended attributes are stored in blocks mapped by the attr fork. The mappings point to leaf blocks, remote value blocks, or dabtree blocks. Block 0 in the attribute fork is always the top of the structure, but otherwise each of the three types of blocks can be found at any offset in the attr fork. Leaf blocks contain attribute key records that point to the name and the value. Names are always stored elsewhere in the same leaf block. Values that are less than 3/4 the size of a filesystem block are also stored elsewhere in the same leaf block. Remote value blocks contain values that are too large to fit inside a leaf. If the leaf information exceeds a single filesystem block, a dabtree (also rooted at block 0) is created to map hashes of the attribute names to leaf blocks in the attr fork.}(hj5hhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhM:hj}5hhubh)}(hChecking an extended attribute structure is not so straightforward due to the lack of separation between attr blocks and index blocks. Scrub must read each block mapped by the attr fork and ignore the non-leaf blocks:h]hChecking an extended attribute structure is not so straightforward due to the lack of separation between attr blocks and index blocks. Scrub must read each block mapped by the attr fork and ignore the non-leaf blocks:}(hj5hhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhMJhj}5hhubji)}(hhh](h)}(hWalk the dabtree in the attr fork (if present) to ensure that there are no irregularities in the blocks or dabtree mappings that do not point to attr leaf blocks. h]h)}(hWalk the dabtree in the attr fork (if present) to ensure that there are no irregularities in the blocks or dabtree mappings that do not point to attr leaf blocks.h]hWalk the dabtree in the attr fork (if present) to ensure that there are no irregularities in the blocks or dabtree mappings that do not point to attr leaf blocks.}(hj5hhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhMOhj5ubah}(h]h ]h"]h$]h&]uh1hhj5hhhhhNubh)}(hX|Walk the blocks of the attr fork looking for leaf blocks. For each entry inside a leaf: a. Validate that the name does not contain invalid characters. b. Read the attr value. This performs a named lookup of the attr name to ensure the correctness of the dabtree. If the value is stored in a remote block, this also validates the integrity of the remote value block. h](h)}(hWWalk the blocks of the attr fork looking for leaf blocks. For each entry inside a leaf:h]hWWalk the blocks of the attr fork looking for leaf blocks. For each entry inside a leaf:}(hj5hhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhMShj5ubji)}(hhh](h)}(h` and :ref:`file link counts ` are discussed in more detail in later sections.h](hChecking operations involving }(hjW7hhhNhNubh)}(h:ref:`parents `h]j)}(hja7h]hparents}(hjc7hhhNhNubah}(h]h ](jstdstd-refeh"]h$]h&]uh1jhj_7ubah}(h]h ]h"]h$]h&]refdocj refdomainjm7reftyperef refexplicitrefwarnj dirparentuh1hhhhMhjW7ubh and }(hjW7hhhNhNubh)}(h :ref:`file link counts `h]j)}(hj7h]hfile link counts}(hj7hhhNhNubah}(h]h ](jstdstd-refeh"]h$]h&]uh1jhj7ubah}(h]h ]h"]h$]h&]refdocj refdomainj7reftyperef refexplicitrefwarnjnlinksuh1hhhhMhjW7ubh0 are discussed in more detail in later sections.}(hjW7hhhNhNubeh}(h]h ]h"]h$]h&]uh1hhhhMhj26hhubh)}(hhh](h)}(h#Checking Directory/Attribute Btreesh]h#Checking Directory/Attribute Btrees}(hj7hhhNhNubah}(h]h ]h"]h$]h&]jjuh1hhj7hhhhhMubh)}(hAs stated in previous sections, the directory/attribute btree (dabtree) index maps user-provided names to improve lookup times by avoiding linear scans. Internally, it maps a 32-bit hash of the name to a block offset within the appropriate file fork.h]hAs stated in previous sections, the directory/attribute btree (dabtree) index maps user-provided names to improve lookup times by avoiding linear scans. Internally, it maps a 32-bit hash of the name to a block offset within the appropriate file fork.}(hj7hhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhMhj7hhubh)}(hXThe internal structure of a dabtree closely resembles the btrees that record fixed-size metadata records -- each dabtree block contains a magic number, a checksum, sibling pointers, a UUID, a tree level, and a log sequence number. The format of leaf and node records are the same -- each entry points to the next level down in the hierarchy, with dabtree node records pointing to dabtree leaf blocks, and dabtree leaf records pointing to non-dabtree blocks elsewhere in the fork.h]hXThe internal structure of a dabtree closely resembles the btrees that record fixed-size metadata records -- each dabtree block contains a magic number, a checksum, sibling pointers, a UUID, a tree level, and a log sequence number. The format of leaf and node records are the same -- each entry points to the next level down in the hierarchy, with dabtree node records pointing to dabtree leaf blocks, and dabtree leaf records pointing to non-dabtree blocks elsewhere in the fork. }(hj7hhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhMhj7hhubh)}(h\Checking and cross-referencing the dabtree is very similar to what is done for space btrees:h]h\Checking and cross-referencing the dabtree is very similar to what is done for space btrees:}(hj7hhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhMhj7hhubh)}(hhh](h)}(hIDoes the type of data stored in the block match what scrub is expecting? h]h)}(hHDoes the type of data stored in the block match what scrub is expecting?h]hHDoes the type of data stored in the block match what scrub is expecting?}(hj7hhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhMhj7ubah}(h]h ]h"]h$]h&]uh1hhj7hhhhhNubh)}(hGDoes the block belong to the owning structure that asked for the read? h]h)}(hFDoes the block belong to the owning structure that asked for the read?h]hFDoes the block belong to the owning structure that asked for the read?}(hj8hhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhMhj8ubah}(h]h ]h"]h$]h&]uh1hhj7hhhhhNubh)}(h%Do the records fit within the block? h]h)}(h$Do the records fit within the block?h]h$Do the records fit within the block?}(hj8hhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhMhj8ubah}(h]h ]h"]h$]h&]uh1hhj7hhhhhNubh)}(hHAre the records contained inside the block free of obvious corruptions? h]h)}(hGAre the records contained inside the block free of obvious corruptions?h]hGAre the records contained inside the block free of obvious corruptions?}(hj78hhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhMhj38ubah}(h]h ]h"]h$]h&]uh1hhj7hhhhhNubh)}(h*Are the name hashes in the correct order? h]h)}(h)Are the name hashes in the correct order?h]h)Are the name hashes in the correct order?}(hjO8hhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhMhjK8ubah}(h]h ]h"]h$]h&]uh1hhj7hhhhhNubh)}(hTDo node pointers within the dabtree point to valid fork offsets for dabtree blocks? h]h)}(hSDo node pointers within the dabtree point to valid fork offsets for dabtree blocks?h]hSDo node pointers within the dabtree point to valid fork offsets for dabtree blocks?}(hjg8hhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhMhjc8ubah}(h]h ]h"]h$]h&]uh1hhj7hhhhhNubh)}(hcDo leaf pointers within the dabtree point to valid fork offsets for directory or attr leaf blocks? h]h)}(hbDo leaf pointers within the dabtree point to valid fork offsets for directory or attr leaf blocks?h]hbDo leaf pointers within the dabtree point to valid fork offsets for directory or attr leaf blocks?}(hj8hhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhMhj{8ubah}(h]h ]h"]h$]h&]uh1hhj7hhhhhNubh)}(h,Do child pointers point towards the leaves? h]h)}(h+Do child pointers point towards the leaves?h]h+Do child pointers point towards the leaves?}(hj8hhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhMhj8ubah}(h]h ]h"]h$]h&]uh1hhj7hhhhhNubh)}(h1Do sibling pointers point across the same level? h]h)}(h0Do sibling pointers point across the same level?h]h0Do sibling pointers point across the same level?}(hj8hhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhMhj8ubah}(h]h ]h"]h$]h&]uh1hhj7hhhhhNubh)}(hlFor each dabtree node record, does the record key accurate reflect the contents of the child dabtree block? h]h)}(hkFor each dabtree node record, does the record key accurate reflect the contents of the child dabtree block?h]hkFor each dabtree node record, does the record key accurate reflect the contents of the child dabtree block?}(hj8hhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhMhj8ubah}(h]h ]h"]h$]h&]uh1hhj7hhhhhNubh)}(hpFor each dabtree leaf record, does the record key accurate reflect the contents of the directory or attr block? h]h)}(hoFor each dabtree leaf record, does the record key accurate reflect the contents of the directory or attr block?h]hoFor each dabtree leaf record, does the record key accurate reflect the contents of the directory or attr block?}(hj8hhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhMhj8ubah}(h]h ]h"]h$]h&]uh1hhj7hhhhhNubeh}(h]h ]h"]h$]h&]jJjKuh1hhhhMhj7hhubeh}(h]jah ]h"]#checking directory/attribute btreesah$]h&]uh1hhj26hhhhhMubeh}(h]jah ]h"]*checking and cross-referencing directoriesah$]h&]uh1hhj,hhhhhM_ubh)}(hhh](h)}(h"Cross-Referencing Summary Countersh]h"Cross-Referencing Summary Counters}(hj 9hhhNhNubah}(h]h ]h"]h$]h&]jj0uh1hhj9hhhhhMubh)}(hqXFS maintains three classes of summary counters: available resources, quota resource usage, and file link counts.h]hqXFS maintains three classes of summary counters: available resources, quota resource usage, and file link counts.}(hj9hhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhMhj9hhubh)}(hXIn theory, the amount of available resources (data blocks, inodes, realtime extents) can be found by walking the entire filesystem. This would make for very slow reporting, so a transactional filesystem can maintain summaries of this information in the superblock. Cross-referencing these values against the filesystem metadata should be a simple matter of walking the free space and inode metadata in each AG and the realtime bitmap, but there are complications that will be discussed in :ref:`more detail ` later.h](hXIn theory, the amount of available resources (data blocks, inodes, realtime extents) can be found by walking the entire filesystem. This would make for very slow reporting, so a transactional filesystem can maintain summaries of this information in the superblock. Cross-referencing these values against the filesystem metadata should be a simple matter of walking the free space and inode metadata in each AG and the realtime bitmap, but there are complications that will be discussed in }(hj&9hhhNhNubh)}(h:ref:`more detail `h]j)}(hj09h]h more detail}(hj29hhhNhNubah}(h]h ](jstdstd-refeh"]h$]h&]uh1jhj.9ubah}(h]h ]h"]h$]h&]refdocj refdomainj<9reftyperef refexplicitrefwarnj fscountersuh1hhhhMhj&9ubh later.}(hj&9hhhNhNubeh}(h]h ]h"]h$]h&]uh1hhhhMhj9hhubh)}(h:ref:`Quota usage ` and :ref:`file link count ` checking are sufficiently complicated to warrant separate sections.h](h)}(h:ref:`Quota usage `h]j)}(hj^9h]h Quota usage}(hj`9hhhNhNubah}(h]h ](jstdstd-refeh"]h$]h&]uh1jhj\9ubah}(h]h ]h"]h$]h&]refdocj refdomainjj9reftyperef refexplicitrefwarnj quotacheckuh1hhhhMhjX9ubh and }(hjX9hhhNhNubh)}(h:ref:`file link count `h]j)}(hj9h]hfile link count}(hj9hhhNhNubah}(h]h ](jstdstd-refeh"]h$]h&]uh1jhj9ubah}(h]h ]h"]h$]h&]refdocj refdomainj9reftyperef refexplicitrefwarnjnlinksuh1hhhhMhjX9ubhD checking are sufficiently complicated to warrant separate sections.}(hjX9hhhNhNubeh}(h]h ]h"]h$]h&]uh1hhhhMhj9hhubeh}(h]j6ah ]h"]"cross-referencing summary countersah$]h&]uh1hhj,hhhhhMubh)}(hhh](h)}(hPost-Repair Reverificationh]hPost-Repair Reverification}(hj9hhhNhNubah}(h]h ]h"]h$]h&]jjRuh1hhj9hhhhhMubh)}(hXAfter performing a repair, the checking code is run a second time to validate the new structure, and the results of the health assessment are recorded internally and returned to the calling process. This step is critical for enabling system administrator to monitor the status of the filesystem and the progress of any repairs. For developers, it is a useful means to judge the efficacy of error detection and correction in the online and offline checking tools.h]hXAfter performing a repair, the checking code is run a second time to validate the new structure, and the results of the health assessment are recorded internally and returned to the calling process. This step is critical for enabling system administrator to monitor the status of the filesystem and the progress of any repairs. For developers, it is a useful means to judge the efficacy of error detection and correction in the online and offline checking tools.}(hj9hhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhMhj9hhubeh}(h]jXah ]h"]post-repair reverificationah$]h&]uh1hhj,hhhhhMubeh}(h]j ah ]h"]checking and cross-referencingah$]h&]uh1hhjv*hhhhhM9ubh)}(hhh](h)}(h$Eventual Consistency vs. Online Fsckh]h$Eventual Consistency vs. Online Fsck}(hj9hhhNhNubah}(h]h ]h"]h$]h&]jjuh1hhj9hhhhhMubh)}(hXqComplex operations can make modifications to multiple per-AG data structures with a chain of transactions. These chains, once committed to the log, are restarted during log recovery if the system crashes while processing the chain. Because the AG header buffers are unlocked between transactions within a chain, online checking must coordinate with chained operations that are in progress to avoid incorrectly detecting inconsistencies due to pending chains. Furthermore, online repair must not run when operations are pending because the metadata are temporarily inconsistent with each other, and rebuilding is not possible.h]hXqComplex operations can make modifications to multiple per-AG data structures with a chain of transactions. These chains, once committed to the log, are restarted during log recovery if the system crashes while processing the chain. Because the AG header buffers are unlocked between transactions within a chain, online checking must coordinate with chained operations that are in progress to avoid incorrectly detecting inconsistencies due to pending chains. Furthermore, online repair must not run when operations are pending because the metadata are temporarily inconsistent with each other, and rebuilding is not possible.}(hj9hhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhMhj9hhubh)}(hOnly online fsck has this requirement of total consistency of AG metadata, and should be relatively rare as compared to filesystem change operations. Online fsck coordinates with transaction chains as follows:h]hOnly online fsck has this requirement of total consistency of AG metadata, and should be relatively rare as compared to filesystem change operations. Online fsck coordinates with transaction chains as follows:}(hj9hhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhMhj9hhubh)}(hhh](h)}(hFor each AG, maintain a count of intent items targeting that AG. The count should be bumped whenever a new item is added to the chain. The count should be dropped when the filesystem has locked the AG header buffers and finished the work. h]h)}(hFor each AG, maintain a count of intent items targeting that AG. The count should be bumped whenever a new item is added to the chain. The count should be dropped when the filesystem has locked the AG header buffers and finished the work.h]hFor each AG, maintain a count of intent items targeting that AG. The count should be bumped whenever a new item is added to the chain. The count should be dropped when the filesystem has locked the AG header buffers and finished the work.}(hj:hhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhMhj:ubah}(h]h ]h"]h$]h&]uh1hhj :hhhhhNubh)}(hXWhen online fsck wants to examine an AG, it should lock the AG header buffers to quiesce all transaction chains that want to modify that AG. If the count is zero, proceed with the checking operation. If it is nonzero, cycle the buffer locks to allow the chain to make forward progress. h]h)}(hXWhen online fsck wants to examine an AG, it should lock the AG header buffers to quiesce all transaction chains that want to modify that AG. If the count is zero, proceed with the checking operation. If it is nonzero, cycle the buffer locks to allow the chain to make forward progress.h]hXWhen online fsck wants to examine an AG, it should lock the AG header buffers to quiesce all transaction chains that want to modify that AG. If the count is zero, proceed with the checking operation. If it is nonzero, cycle the buffer locks to allow the chain to make forward progress.}(hj*:hhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhMhj&:ubah}(h]h ]h"]h$]h&]uh1hhj :hhhhhNubeh}(h]h ]h"]h$]h&]jJj uh1hhhhMhj9hhubh)}(hXJThis may lead to online fsck taking a long time to complete, but regular filesystem updates take precedence over background checking activity. Details about the discovery of this situation are presented in the :ref:`next section `, and details about the solution are presented :ref:`after that`.h](hThis may lead to online fsck taking a long time to complete, but regular filesystem updates take precedence over background checking activity. Details about the discovery of this situation are presented in the }(hjD:hhhNhNubh)}(h(:ref:`next section `h]j)}(hjN:h]h next section}(hjP:hhhNhNubah}(h]h ](jstdstd-refeh"]h$]h&]uh1jhjL:ubah}(h]h ]h"]h$]h&]refdocj refdomainjZ:reftyperef refexplicitrefwarnjchain_coordinationuh1hhhhMhjD:ubh/, and details about the solution are presented }(hjD:hhhNhNubh)}(h :ref:`after that`h]j)}(hjr:h]h after that}(hjt:hhhNhNubah}(h]h ](jstdstd-refeh"]h$]h&]uh1jhjp:ubah}(h]h ]h"]h$]h&]refdocj refdomainj~:reftyperef refexplicitrefwarnj intent_drainsuh1hhhhMhjD:ubh.}(hjD:hhhNhNubeh}(h]h ]h"]h$]h&]uh1hhhhMhj9hhubh)}(h.. _chain_coordination:h]h}(h]h ]h"]h$]h&]jchain-coordinationuh1hhMhj9hhhhubh)}(hhh](h)}(hDiscovery of the Problemh]hDiscovery of the Problem}(hj:hhhNhNubah}(h]h ]h"]h$]h&]jjuh1hhj:hhhhhM ubh)}(hXMidway through the development of online scrubbing, the fsstress tests uncovered a misinteraction between online fsck and compound transaction chains created by other writer threads that resulted in false reports of metadata inconsistency. The root cause of these reports is the eventual consistency model introduced by the expansion of deferred work items and compound transaction chains when reverse mapping and reflink were introduced.h]hXMidway through the development of online scrubbing, the fsstress tests uncovered a misinteraction between online fsck and compound transaction chains created by other writer threads that resulted in false reports of metadata inconsistency. The root cause of these reports is the eventual consistency model introduced by the expansion of deferred work items and compound transaction chains when reverse mapping and reflink were introduced.}(hj:hhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhM hj:hhubh)}(hXMOriginally, transaction chains were added to XFS to avoid deadlocks when unmapping space from files. Deadlock avoidance rules require that AGs only be locked in increasing order, which makes it impossible (say) to use a single transaction to free a space extent in AG 7 and then try to free a now superfluous block mapping btree block in AG 3. To avoid these kinds of deadlocks, XFS creates Extent Freeing Intent (EFI) log items to commit to freeing some space in one transaction while deferring the actual metadata updates to a fresh transaction. The transaction sequence looks like this:h]hXMOriginally, transaction chains were added to XFS to avoid deadlocks when unmapping space from files. Deadlock avoidance rules require that AGs only be locked in increasing order, which makes it impossible (say) to use a single transaction to free a space extent in AG 7 and then try to free a now superfluous block mapping btree block in AG 3. To avoid these kinds of deadlocks, XFS creates Extent Freeing Intent (EFI) log items to commit to freeing some space in one transaction while deferring the actual metadata updates to a fresh transaction. The transaction sequence looks like this:}(hj:hhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhMhj:hhubji)}(hhh](h)}(hXThe first transaction contains a physical update to the file's block mapping structures to remove the mapping from the btree blocks. It then attaches to the in-memory transaction an action item to schedule deferred freeing of space. Concretely, each transaction maintains a list of ``struct xfs_defer_pending`` objects, each of which maintains a list of ``struct xfs_extent_free_item`` objects. Returning to the example above, the action item tracks the freeing of both the unmapped space from AG 7 and the block mapping btree (BMBT) block from AG 3. Deferred frees recorded in this manner are committed in the log by creating an EFI log item from the ``struct xfs_extent_free_item`` object and attaching the log item to the transaction. When the log is persisted to disk, the EFI item is written into the ondisk transaction record. EFIs can list up to 16 extents to free, all sorted in AG order. h]h)}(hXThe first transaction contains a physical update to the file's block mapping structures to remove the mapping from the btree blocks. It then attaches to the in-memory transaction an action item to schedule deferred freeing of space. Concretely, each transaction maintains a list of ``struct xfs_defer_pending`` objects, each of which maintains a list of ``struct xfs_extent_free_item`` objects. Returning to the example above, the action item tracks the freeing of both the unmapped space from AG 7 and the block mapping btree (BMBT) block from AG 3. Deferred frees recorded in this manner are committed in the log by creating an EFI log item from the ``struct xfs_extent_free_item`` object and attaching the log item to the transaction. When the log is persisted to disk, the EFI item is written into the ondisk transaction record. EFIs can list up to 16 extents to free, all sorted in AG order.h](hXThe first transaction contains a physical update to the file’s block mapping structures to remove the mapping from the btree blocks. It then attaches to the in-memory transaction an action item to schedule deferred freeing of space. Concretely, each transaction maintains a list of }(hj:hhhNhNubj)}(h``struct xfs_defer_pending``h]hstruct xfs_defer_pending}(hj:hhhNhNubah}(h]h ]h"]h$]h&]uh1jhj:ubh, objects, each of which maintains a list of }(hj:hhhNhNubj)}(h``struct xfs_extent_free_item``h]hstruct xfs_extent_free_item}(hj:hhhNhNubah}(h]h ]h"]h$]h&]uh1jhj:ubhX  objects. Returning to the example above, the action item tracks the freeing of both the unmapped space from AG 7 and the block mapping btree (BMBT) block from AG 3. Deferred frees recorded in this manner are committed in the log by creating an EFI log item from the }(hj:hhhNhNubj)}(h``struct xfs_extent_free_item``h]hstruct xfs_extent_free_item}(hj;hhhNhNubah}(h]h ]h"]h$]h&]uh1jhj:ubh object and attaching the log item to the transaction. When the log is persisted to disk, the EFI item is written into the ondisk transaction record. EFIs can list up to 16 extents to free, all sorted in AG order.}(hj:hhhNhNubeh}(h]h ]h"]h$]h&]uh1hhhhM hj:ubah}(h]h ]h"]h$]h&]uh1hhj:hhhhhNubh)}(hXThe second transaction contains a physical update to the free space btrees of AG 3 to release the former BMBT block and a second physical update to the free space btrees of AG 7 to release the unmapped file space. Observe that the physical updates are resequenced in the correct order when possible. Attached to the transaction is a an extent free done (EFD) log item. The EFD contains a pointer to the EFI logged in transaction #1 so that log recovery can tell if the EFI needs to be replayed. h]h)}(hXThe second transaction contains a physical update to the free space btrees of AG 3 to release the former BMBT block and a second physical update to the free space btrees of AG 7 to release the unmapped file space. Observe that the physical updates are resequenced in the correct order when possible. Attached to the transaction is a an extent free done (EFD) log item. The EFD contains a pointer to the EFI logged in transaction #1 so that log recovery can tell if the EFI needs to be replayed.h]hXThe second transaction contains a physical update to the free space btrees of AG 3 to release the former BMBT block and a second physical update to the free space btrees of AG 7 to release the unmapped file space. Observe that the physical updates are resequenced in the correct order when possible. Attached to the transaction is a an extent free done (EFD) log item. The EFD contains a pointer to the EFI logged in transaction #1 so that log recovery can tell if the EFI needs to be replayed.}(hj';hhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhM1hj#;ubah}(h]h ]h"]h$]h&]uh1hhj:hhhhhNubeh}(h]h ]h"]h$]h&]jgjhjihjjjkuh1jhhj:hhhhhM ubh)}(hXPIf the system goes down after transaction #1 is written back to the filesystem but before #2 is committed, a scan of the filesystem metadata would show inconsistent filesystem metadata because there would not appear to be any owner of the unmapped space. Happily, log recovery corrects this inconsistency for us -- when recovery finds an intent log item but does not find a corresponding intent done item, it will reconstruct the incore state of the intent item and finish it. In the example above, the log must replay both frees described in the recovered EFI to complete the recovery phase.h]hXPIf the system goes down after transaction #1 is written back to the filesystem but before #2 is committed, a scan of the filesystem metadata would show inconsistent filesystem metadata because there would not appear to be any owner of the unmapped space. Happily, log recovery corrects this inconsistency for us -- when recovery finds an intent log item but does not find a corresponding intent done item, it will reconstruct the incore state of the intent item and finish it. In the example above, the log must replay both frees described in the recovered EFI to complete the recovery phase.}(hjA;hhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhM:hj:hhubh)}(hGThere are subtleties to XFS' transaction chaining strategy to consider:h]hIThere are subtleties to XFS’ transaction chaining strategy to consider:}(hjO;hhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhMDhj:hhubh)}(hhh](h)}(hX`Log items must be added to a transaction in the correct order to prevent conflicts with principal objects that are not held by the transaction. In other words, all per-AG metadata updates for an unmapped block must be completed before the last update to free the extent, and extents should not be reallocated until that last update commits to the log. h]h)}(hX_Log items must be added to a transaction in the correct order to prevent conflicts with principal objects that are not held by the transaction. In other words, all per-AG metadata updates for an unmapped block must be completed before the last update to free the extent, and extents should not be reallocated until that last update commits to the log.h]hX_Log items must be added to a transaction in the correct order to prevent conflicts with principal objects that are not held by the transaction. In other words, all per-AG metadata updates for an unmapped block must be completed before the last update to free the extent, and extents should not be reallocated until that last update commits to the log.}(hjd;hhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhMFhj`;ubah}(h]h ]h"]h$]h&]uh1hhj];hhhhhNubh)}(hAG header buffers are released between each transaction in a chain. This means that other threads can observe an AG in an intermediate state, but as long as the first subtlety is handled, this should not affect the correctness of filesystem operations. h]h)}(hAG header buffers are released between each transaction in a chain. This means that other threads can observe an AG in an intermediate state, but as long as the first subtlety is handled, this should not affect the correctness of filesystem operations.h]hAG header buffers are released between each transaction in a chain. This means that other threads can observe an AG in an intermediate state, but as long as the first subtlety is handled, this should not affect the correctness of filesystem operations.}(hj|;hhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhMLhjx;ubah}(h]h ]h"]h$]h&]uh1hhj];hhhhhNubh)}(hUnmounting the filesystem flushes all pending work to disk, which means that offline fsck never sees the temporary inconsistencies caused by deferred work item processing. h]h)}(hUnmounting the filesystem flushes all pending work to disk, which means that offline fsck never sees the temporary inconsistencies caused by deferred work item processing.h]hUnmounting the filesystem flushes all pending work to disk, which means that offline fsck never sees the temporary inconsistencies caused by deferred work item processing.}(hj;hhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhMQhj;ubah}(h]h ]h"]h$]h&]uh1hhj];hhhhhNubeh}(h]h ]h"]h$]h&]jJj uh1hhhhMFhj:hhubh)}(hgIn this manner, XFS employs a form of eventual consistency to avoid deadlocks and increase parallelism.h]hgIn this manner, XFS employs a form of eventual consistency to avoid deadlocks and increase parallelism.}(hj;hhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhMUhj:hhubh)}(hXDuring the design phase of the reverse mapping and reflink features, it was decided that it was impractical to cram all the reverse mapping updates for a single filesystem change into a single transaction because a single file mapping operation can explode into many small updates:h]hXDuring the design phase of the reverse mapping and reflink features, it was decided that it was impractical to cram all the reverse mapping updates for a single filesystem change into a single transaction because a single file mapping operation can explode into many small updates:}(hj;hhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhMXhj:hhubh)}(hhh](h)}(hThe block mapping update itselfh]h)}(hj;h]hThe block mapping update itself}(hj;hhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhM]hj;ubah}(h]h ]h"]h$]h&]uh1hhj;hhhhhNubh)}(h5A reverse mapping update for the block mapping updateh]h)}(hj;h]h5A reverse mapping update for the block mapping update}(hj;hhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhM^hj;ubah}(h]h ]h"]h$]h&]uh1hhj;hhhhhNubh)}(hFixing the freelisth]h)}(hj;h]hFixing the freelist}(hj;hhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhM_hj;ubah}(h]h ]h"]h$]h&]uh1hhj;hhhhhNubh)}(h.A reverse mapping update for the freelist fix h]h)}(h-A reverse mapping update for the freelist fixh]h-A reverse mapping update for the freelist fix}(hj<hhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhM`hj<ubah}(h]h ]h"]h$]h&]uh1hhj;hhhhhNubh)}(h)A shape change to the block mapping btreeh]h)}(hj,<h]h)A shape change to the block mapping btree}(hj.<hhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhMbhj*<ubah}(h]h ]h"]h$]h&]uh1hhj;hhhhhNubh)}(h-A reverse mapping update for the btree updateh]h)}(hjC<h]h-A reverse mapping update for the btree update}(hjE<hhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhMchjA<ubah}(h]h ]h"]h$]h&]uh1hhj;hhhhhNubh)}(hFixing the freelist (again)h]h)}(hjZ<h]hFixing the freelist (again)}(hj\<hhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhMdhjX<ubah}(h]h ]h"]h$]h&]uh1hhj;hhhhhNubh)}(h.A reverse mapping update for the freelist fix h]h)}(h-A reverse mapping update for the freelist fixh]h-A reverse mapping update for the freelist fix}(hjs<hhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhMehjo<ubah}(h]h ]h"]h$]h&]uh1hhj;hhhhhNubh)}(h/An update to the reference counting informationh]h)}(hj<h]h/An update to the reference counting information}(hj<hhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhMghj<ubah}(h]h ]h"]h$]h&]uh1hhj;hhhhhNubh)}(h0A reverse mapping update for the refcount updateh]h)}(hj<h]h0A reverse mapping update for the refcount update}(hj<hhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhMhhj<ubah}(h]h ]h"]h$]h&]uh1hhj;hhhhhNubh)}(h"Fixing the freelist (a third time)h]h)}(hj<h]h"Fixing the freelist (a third time)}(hj<hhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhMihj<ubah}(h]h ]h"]h$]h&]uh1hhj;hhhhhNubh)}(h.A reverse mapping update for the freelist fix h]h)}(h-A reverse mapping update for the freelist fixh]h-A reverse mapping update for the freelist fix}(hj<hhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhMjhj<ubah}(h]h ]h"]h$]h&]uh1hhj;hhhhhNubh)}(hCFreeing any space that was unmapped and not owned by any other fileh]h)}(hj<h]hCFreeing any space that was unmapped and not owned by any other file}(hj<hhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhMlhj<ubah}(h]h ]h"]h$]h&]uh1hhj;hhhhhNubh)}(h#Fixing the freelist (a fourth time)h]h)}(hj<h]h#Fixing the freelist (a fourth time)}(hj<hhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhMmhj<ubah}(h]h ]h"]h$]h&]uh1hhj;hhhhhNubh)}(h.A reverse mapping update for the freelist fix h]h)}(h-A reverse mapping update for the freelist fixh]h-A reverse mapping update for the freelist fix}(hj=hhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhMnhj=ubah}(h]h ]h"]h$]h&]uh1hhj;hhhhhNubh)}(h1Freeing the space used by the block mapping btreeh]h)}(hj,=h]h1Freeing the space used by the block mapping btree}(hj.=hhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhMphj*=ubah}(h]h ]h"]h$]h&]uh1hhj;hhhhhNubh)}(h"Fixing the freelist (a fifth time)h]h)}(hjC=h]h"Fixing the freelist (a fifth time)}(hjE=hhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhMqhjA=ubah}(h]h ]h"]h$]h&]uh1hhj;hhhhhNubh)}(h.A reverse mapping update for the freelist fix h]h)}(h-A reverse mapping update for the freelist fixh]h-A reverse mapping update for the freelist fix}(hj\=hhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhMrhjX=ubah}(h]h ]h"]h$]h&]uh1hhj;hhhhhNubeh}(h]h ]h"]h$]h&]jJj uh1hhhhM]hj:hhubh)}(hX%Free list fixups are not usually needed more than once per AG per transaction chain, but it is theoretically possible if space is very tight. For copy-on-write updates this is even worse, because this must be done once to remove the space from a staging area and again to map it into the file!h]hX%Free list fixups are not usually needed more than once per AG per transaction chain, but it is theoretically possible if space is very tight. For copy-on-write updates this is even worse, because this must be done once to remove the space from a staging area and again to map it into the file!}(hjv=hhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhMthj:hhubh)}(hXTo deal with this explosion in a calm manner, XFS expands its use of deferred work items to cover most reverse mapping updates and all refcount updates. This reduces the worst case size of transaction reservations by breaking the work into a long chain of small updates, which increases the degree of eventual consistency in the system. Again, this generally isn't a problem because XFS orders its deferred work items carefully to avoid resource reuse conflicts between unsuspecting threads.h]hXTo deal with this explosion in a calm manner, XFS expands its use of deferred work items to cover most reverse mapping updates and all refcount updates. This reduces the worst case size of transaction reservations by breaking the work into a long chain of small updates, which increases the degree of eventual consistency in the system. Again, this generally isn’t a problem because XFS orders its deferred work items carefully to avoid resource reuse conflicts between unsuspecting threads.}(hj=hhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhMyhj:hhubh)}(hXAHowever, online fsck changes the rules -- remember that although physical updates to per-AG structures are coordinated by locking the buffers for AG headers, buffer locks are dropped between transactions. Once scrub acquires resources and takes locks for a data structure, it must do all the validation work without releasing the lock. If the main lock for a space btree is an AG header buffer lock, scrub may have interrupted another thread that is midway through finishing a chain. For example, if a thread performing a copy-on-write has completed a reverse mapping update but not the corresponding refcount update, the two AG btrees will appear inconsistent to scrub and an observation of corruption will be recorded. This observation will not be correct. If a repair is attempted in this state, the results will be catastrophic!h]hXAHowever, online fsck changes the rules -- remember that although physical updates to per-AG structures are coordinated by locking the buffers for AG headers, buffer locks are dropped between transactions. Once scrub acquires resources and takes locks for a data structure, it must do all the validation work without releasing the lock. If the main lock for a space btree is an AG header buffer lock, scrub may have interrupted another thread that is midway through finishing a chain. For example, if a thread performing a copy-on-write has completed a reverse mapping update but not the corresponding refcount update, the two AG btrees will appear inconsistent to scrub and an observation of corruption will be recorded. This observation will not be correct. If a repair is attempted in this state, the results will be catastrophic!}(hj=hhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhMhj:hhubh)}(h`Several other solutions to this problem were evaluated upon discovery of this flaw and rejected:h]h`Several other solutions to this problem were evaluated upon discovery of this flaw and rejected:}(hj=hhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhMhj:hhubji)}(hhh](h)}(hXAdd a higher level lock to allocation groups and require writer threads to acquire the higher level lock in AG order before making any changes. This would be very difficult to implement in practice because it is difficult to determine which locks need to be obtained, and in what order, without simulating the entire operation. Performing a dry run of a file operation to discover necessary locks would make the filesystem very slow. h]h)}(hXAdd a higher level lock to allocation groups and require writer threads to acquire the higher level lock in AG order before making any changes. This would be very difficult to implement in practice because it is difficult to determine which locks need to be obtained, and in what order, without simulating the entire operation. Performing a dry run of a file operation to discover necessary locks would make the filesystem very slow.h]hXAdd a higher level lock to allocation groups and require writer threads to acquire the higher level lock in AG order before making any changes. This would be very difficult to implement in practice because it is difficult to determine which locks need to be obtained, and in what order, without simulating the entire operation. Performing a dry run of a file operation to discover necessary locks would make the filesystem very slow.}(hj=hhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhMhj=ubah}(h]h ]h"]h$]h&]uh1hhj=hhhhhNubh)}(hXMake the deferred work coordinator code aware of consecutive intent items targeting the same AG and have it hold the AG header buffers locked across the transaction roll between updates. This would introduce a lot of complexity into the coordinator since it is only loosely coupled with the actual deferred work items. It would also fail to solve the problem because deferred work items can generate new deferred subtasks, but all subtasks must be complete before work can start on a new sibling task. h]h)}(hXMake the deferred work coordinator code aware of consecutive intent items targeting the same AG and have it hold the AG header buffers locked across the transaction roll between updates. This would introduce a lot of complexity into the coordinator since it is only loosely coupled with the actual deferred work items. It would also fail to solve the problem because deferred work items can generate new deferred subtasks, but all subtasks must be complete before work can start on a new sibling task.h]hXMake the deferred work coordinator code aware of consecutive intent items targeting the same AG and have it hold the AG header buffers locked across the transaction roll between updates. This would introduce a lot of complexity into the coordinator since it is only loosely coupled with the actual deferred work items. It would also fail to solve the problem because deferred work items can generate new deferred subtasks, but all subtasks must be complete before work can start on a new sibling task.}(hj=hhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhMhj=ubah}(h]h ]h"]h$]h&]uh1hhj=hhhhhNubh)}(hXZTeach online fsck to walk all transactions waiting for whichever lock(s) protect the data structure being scrubbed to look for pending operations. The checking and repair operations must factor these pending operations into the evaluations being performed. This solution is a nonstarter because it is *extremely* invasive to the main filesystem. h]h)}(hXYTeach online fsck to walk all transactions waiting for whichever lock(s) protect the data structure being scrubbed to look for pending operations. The checking and repair operations must factor these pending operations into the evaluations being performed. This solution is a nonstarter because it is *extremely* invasive to the main filesystem.h](hX-Teach online fsck to walk all transactions waiting for whichever lock(s) protect the data structure being scrubbed to look for pending operations. The checking and repair operations must factor these pending operations into the evaluations being performed. This solution is a nonstarter because it is }(hj=hhhNhNubj7)}(h *extremely*h]h extremely}(hj=hhhNhNubah}(h]h ]h"]h$]h&]uh1j6hj=ubh! invasive to the main filesystem.}(hj=hhhNhNubeh}(h]h ]h"]h$]h&]uh1hhhhMhj=ubah}(h]h ]h"]h$]h&]uh1hhj=hhhhhNubeh}(h]h ]h"]h$]h&]jgjhjihjjjkuh1jhhj:hhhhhMubh)}(h.. _intent_drains:h]h}(h]h ]h"]h$]h&]jjuh1hhMhj:hhhhubeh}(h](jj:eh ]h"](discovery of the problemchain_coordinationeh$]h&]uh1hhj9hhhhhM j}j >j:sj}j:j:subh)}(hhh](h)}(h Intent Drainsh]h Intent Drains}(hj(>hhhNhNubah}(h]h ]h"]h$]h&]jjuh1hhj%>hhhhhMubh)}(hXOnline fsck uses an atomic intent item counter and lock cycling to coordinate with transaction chains. There are two key properties to the drain mechanism. First, the counter is incremented when a deferred work item is *queued* to a transaction, and it is decremented after the associated intent done log item is *committed* to another transaction. The second property is that deferred work can be added to a transaction without holding an AG header lock, but per-AG work items cannot be marked done without locking that AG header buffer to log the physical updates and the intent done log item. The first property enables scrub to yield to running transaction chains, which is an explicit deprioritization of online fsck to benefit file operations. The second property of the drain is key to the correct coordination of scrub, since scrub will always be able to decide if a conflict is possible.h](hOnline fsck uses an atomic intent item counter and lock cycling to coordinate with transaction chains. There are two key properties to the drain mechanism. First, the counter is incremented when a deferred work item is }(hj6>hhhNhNubj7)}(h*queued*h]hqueued}(hj>>hhhNhNubah}(h]h ]h"]h$]h&]uh1j6hj6>ubhV to a transaction, and it is decremented after the associated intent done log item is }(hj6>hhhNhNubj7)}(h *committed*h]h committed}(hjP>hhhNhNubah}(h]h ]h"]h$]h&]uh1j6hj6>ubhX< to another transaction. The second property is that deferred work can be added to a transaction without holding an AG header lock, but per-AG work items cannot be marked done without locking that AG header buffer to log the physical updates and the intent done log item. The first property enables scrub to yield to running transaction chains, which is an explicit deprioritization of online fsck to benefit file operations. The second property of the drain is key to the correct coordination of scrub, since scrub will always be able to decide if a conflict is possible.}(hj6>hhhNhNubeh}(h]h ]h"]h$]h&]uh1hhhhMhj%>hhubh)}(h8For regular filesystem code, the drain works as follows:h]h8For regular filesystem code, the drain works as follows:}(hjh>hhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhMhj%>hhubji)}(hhh](h)}(hVCall the appropriate subsystem function to add a deferred work item to a transaction. h]h)}(hUCall the appropriate subsystem function to add a deferred work item to a transaction.h]hUCall the appropriate subsystem function to add a deferred work item to a transaction.}(hj}>hhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhMhjy>ubah}(h]h ]h"]h$]h&]uh1hhjv>hhhhhNubh)}(hEThe function calls ``xfs_defer_drain_bump`` to increase the counter. h]h)}(hDThe function calls ``xfs_defer_drain_bump`` to increase the counter.h](hThe function calls }(hj>hhhNhNubj)}(h``xfs_defer_drain_bump``h]hxfs_defer_drain_bump}(hj>hhhNhNubah}(h]h ]h"]h$]h&]uh1jhj>ubh to increase the counter.}(hj>hhhNhNubeh}(h]h ]h"]h$]h&]uh1hhhhMhj>ubah}(h]h ]h"]h$]h&]uh1hhjv>hhhhhNubh)}(hrWhen the deferred item manager wants to finish the deferred work item, it calls ``->finish_item`` to complete it. h]h)}(hqWhen the deferred item manager wants to finish the deferred work item, it calls ``->finish_item`` to complete it.h](hPWhen the deferred item manager wants to finish the deferred work item, it calls }(hj>hhhNhNubj)}(h``->finish_item``h]h ->finish_item}(hj>hhhNhNubah}(h]h ]h"]h$]h&]uh1jhj>ubh to complete it.}(hj>hhhNhNubeh}(h]h ]h"]h$]h&]uh1hhhhMhj>ubah}(h]h ]h"]h$]h&]uh1hhjv>hhhhhNubh)}(hThe ``->finish_item`` implementation logs some changes and calls ``xfs_defer_drain_drop`` to decrease the sloppy counter and wake up any threads waiting on the drain. h]h)}(hThe ``->finish_item`` implementation logs some changes and calls ``xfs_defer_drain_drop`` to decrease the sloppy counter and wake up any threads waiting on the drain.h](hThe }(hj>hhhNhNubj)}(h``->finish_item``h]h ->finish_item}(hj>hhhNhNubah}(h]h ]h"]h$]h&]uh1jhj>ubh, implementation logs some changes and calls }(hj>hhhNhNubj)}(h``xfs_defer_drain_drop``h]hxfs_defer_drain_drop}(hj?hhhNhNubah}(h]h ]h"]h$]h&]uh1jhj>ubhM to decrease the sloppy counter and wake up any threads waiting on the drain.}(hj>hhhNhNubeh}(h]h ]h"]h$]h&]uh1hhhhMhj>ubah}(h]h ]h"]h$]h&]uh1hhjv>hhhhhNubh)}(hXThe subtransaction commits, which unlocks the resource associated with the intent item. h]h)}(hWThe subtransaction commits, which unlocks the resource associated with the intent item.h]hWThe subtransaction commits, which unlocks the resource associated with the intent item.}(hj%?hhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhMhj!?ubah}(h]h ]h"]h$]h&]uh1hhjv>hhhhhNubeh}(h]h ]h"]h$]h&]jgjhjihjjjkuh1jhhj%>hhhhhMubh)}(h&For scrub, the drain works as follows:h]h&For scrub, the drain works as follows:}(hj??hhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhMhj%>hhubji)}(hhh](h)}(hLock the resource(s) associated with the metadata being scrubbed. For example, a scan of the refcount btree would lock the AGI and AGF header buffers. h]h)}(hLock the resource(s) associated with the metadata being scrubbed. For example, a scan of the refcount btree would lock the AGI and AGF header buffers.h]hLock the resource(s) associated with the metadata being scrubbed. For example, a scan of the refcount btree would lock the AGI and AGF header buffers.}(hjT?hhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhMhjP?ubah}(h]h ]h"]h$]h&]uh1hhjM?hhhhhNubh)}(hIf the counter is zero (``xfs_defer_drain_busy`` returns false), there are no chains in progress and the operation may proceed. h]h)}(hIf the counter is zero (``xfs_defer_drain_busy`` returns false), there are no chains in progress and the operation may proceed.h](hIf the counter is zero (}(hjl?hhhNhNubj)}(h``xfs_defer_drain_busy``h]hxfs_defer_drain_busy}(hjt?hhhNhNubah}(h]h ]h"]h$]h&]uh1jhjl?ubhO returns false), there are no chains in progress and the operation may proceed.}(hjl?hhhNhNubeh}(h]h ]h"]h$]h&]uh1hhhhMhjh?ubah}(h]h ]h"]h$]h&]uh1hhjM?hhhhhNubh)}(h4Otherwise, release the resources grabbed in step 1. h]h)}(h3Otherwise, release the resources grabbed in step 1.h]h3Otherwise, release the resources grabbed in step 1.}(hj?hhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhMhj?ubah}(h]h ]h"]h$]h&]uh1hhjM?hhhhhNubh)}(hWait for the intent counter to reach zero (``xfs_defer_drain_intents``), then go back to step 1 unless a signal has been caught. h]h)}(hWait for the intent counter to reach zero (``xfs_defer_drain_intents``), then go back to step 1 unless a signal has been caught.h](h+Wait for the intent counter to reach zero (}(hj?hhhNhNubj)}(h``xfs_defer_drain_intents``h]hxfs_defer_drain_intents}(hj?hhhNhNubah}(h]h ]h"]h$]h&]uh1jhj?ubh:), then go back to step 1 unless a signal has been caught.}(hj?hhhNhNubeh}(h]h ]h"]h$]h&]uh1hhhhMhj?ubah}(h]h ]h"]h$]h&]uh1hhjM?hhhhhNubeh}(h]h ]h"]h$]h&]jgjhjihjjjkuh1jhhj%>hhhhhMubh)}(hTo avoid polling in step 4, the drain provides a waitqueue for scrub threads to be woken up whenever the intent count drops to zero.h]hTo avoid polling in step 4, the drain provides a waitqueue for scrub threads to be woken up whenever the intent count drops to zero.}(hj?hhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhMhj%>hhubh)}(hThe proposed patchset is the `scrub intent drain series `_.h](hThe proposed patchset is the }(hj?hhhNhNubj)}(h~`scrub intent drain series `_h]hscrub intent drain series}(hj?hhhNhNubah}(h]h ]h"]h$]h&]namescrub intent drain seriesjj_https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=scrub-drain-intentsuh1jhj?ubh)}(hb h]h}(h]scrub-intent-drain-seriesah ]h"]scrub intent drain seriesah$]h&]refurij@uh1hjyKhj?ubh.}(hj?hhhNhNubeh}(h]h ]h"]h$]h&]uh1hhhhMhj%>hhubh)}(h.. _jump_labels:h]h}(h]h ]h"]h$]h&]j jump-labelsuh1hhMhj%>hhhhubeh}(h](jid3eh ]h"]( intent drains intent_drainseh$]h&]uh1hhj9hhhhhMj}j)@j>sj}jj>subh)}(hhh](h)}(h%Static Keys (aka Jump Label Patching)h]h%Static Keys (aka Jump Label Patching)}(hj1@hhhNhNubah}(h]h ]h"]h$]h&]jjuh1hhj.@hhhhhMubh)}(hXOnline fsck for XFS separates the regular filesystem from the checking and repair code as much as possible. However, there are a few parts of online fsck (such as the intent drains, and later, live update hooks) where it is useful for the online fsck code to know what's going on in the rest of the filesystem. Since it is not expected that online fsck will be constantly running in the background, it is very important to minimize the runtime overhead imposed by these hooks when online fsck is compiled into the kernel but not actively running on behalf of userspace. Taking locks in the hot path of a writer thread to access a data structure only to find that no further action is necessary is expensive -- on the author's computer, this have an overhead of 40-50ns per access. Fortunately, the kernel supports dynamic code patching, which enables XFS to replace a static branch to hook code with ``nop`` sleds when online fsck isn't running. This sled has an overhead of however long it takes the instruction decoder to skip past the sled, which seems to be on the order of less than 1ns and does not access memory outside of instruction fetching.h](hXOnline fsck for XFS separates the regular filesystem from the checking and repair code as much as possible. However, there are a few parts of online fsck (such as the intent drains, and later, live update hooks) where it is useful for the online fsck code to know what’s going on in the rest of the filesystem. Since it is not expected that online fsck will be constantly running in the background, it is very important to minimize the runtime overhead imposed by these hooks when online fsck is compiled into the kernel but not actively running on behalf of userspace. Taking locks in the hot path of a writer thread to access a data structure only to find that no further action is necessary is expensive -- on the author’s computer, this have an overhead of 40-50ns per access. Fortunately, the kernel supports dynamic code patching, which enables XFS to replace a static branch to hook code with }(hj?@hhhNhNubj)}(h``nop``h]hnop}(hjG@hhhNhNubah}(h]h ]h"]h$]h&]uh1jhj?@ubh sleds when online fsck isn’t running. This sled has an overhead of however long it takes the instruction decoder to skip past the sled, which seems to be on the order of less than 1ns and does not access memory outside of instruction fetching.}(hj?@hhhNhNubeh}(h]h ]h"]h$]h&]uh1hhhhMhj.@hhubh)}(hXWhen online fsck enables the static key, the sled is replaced with an unconditional branch to call the hook code. The switchover is quite expensive (~22000ns) but is paid entirely by the program that invoked online fsck, and can be amortized if multiple threads enter online fsck at the same time, or if multiple filesystems are being checked at the same time. Changing the branch direction requires taking the CPU hotplug lock, and since CPU initialization requires memory allocation, online fsck must be careful not to change a static key while holding any locks or resources that could be accessed in the memory reclaim paths. To minimize contention on the CPU hotplug lock, care should be taken not to enable or disable static keys unnecessarily.h]hXWhen online fsck enables the static key, the sled is replaced with an unconditional branch to call the hook code. The switchover is quite expensive (~22000ns) but is paid entirely by the program that invoked online fsck, and can be amortized if multiple threads enter online fsck at the same time, or if multiple filesystems are being checked at the same time. Changing the branch direction requires taking the CPU hotplug lock, and since CPU initialization requires memory allocation, online fsck must be careful not to change a static key while holding any locks or resources that could be accessed in the memory reclaim paths. To minimize contention on the CPU hotplug lock, care should be taken not to enable or disable static keys unnecessarily.}(hj_@hhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhMhj.@hhubh)}(hBecause static keys are intended to minimize hook overhead for regular filesystem operations when xfs_scrub is not running, the intended usage patterns are as follows:h]hBecause static keys are intended to minimize hook overhead for regular filesystem operations when xfs_scrub is not running, the intended usage patterns are as follows:}(hjm@hhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhMhj.@hhubh)}(hhh](h)}(hThe hooked part of XFS should declare a static-scoped static key that defaults to false. The ``DEFINE_STATIC_KEY_FALSE`` macro takes care of this. The static key itself should be declared as a ``static`` variable. h]h)}(hThe hooked part of XFS should declare a static-scoped static key that defaults to false. The ``DEFINE_STATIC_KEY_FALSE`` macro takes care of this. The static key itself should be declared as a ``static`` variable.h](h]The hooked part of XFS should declare a static-scoped static key that defaults to false. The }(hj@hhhNhNubj)}(h``DEFINE_STATIC_KEY_FALSE``h]hDEFINE_STATIC_KEY_FALSE}(hj@hhhNhNubah}(h]h ]h"]h$]h&]uh1jhj@ubhI macro takes care of this. The static key itself should be declared as a }(hj@hhhNhNubj)}(h ``static``h]hstatic}(hj@hhhNhNubah}(h]h ]h"]h$]h&]uh1jhj@ubh variable.}(hj@hhhNhNubeh}(h]h ]h"]h$]h&]uh1hhhhM hj~@ubah}(h]h ]h"]h$]h&]uh1hhj{@hhhhhNubh)}(hWhen deciding to invoke code that's only used by scrub, the regular filesystem should call the ``static_branch_unlikely`` predicate to avoid the scrub-only hook code if the static key is not enabled. h]h)}(hWhen deciding to invoke code that's only used by scrub, the regular filesystem should call the ``static_branch_unlikely`` predicate to avoid the scrub-only hook code if the static key is not enabled.h](haWhen deciding to invoke code that’s only used by scrub, the regular filesystem should call the }(hj@hhhNhNubj)}(h``static_branch_unlikely``h]hstatic_branch_unlikely}(hj@hhhNhNubah}(h]h ]h"]h$]h&]uh1jhj@ubhN predicate to avoid the scrub-only hook code if the static key is not enabled.}(hj@hhhNhNubeh}(h]h ]h"]h$]h&]uh1hhhhMhj@ubah}(h]h ]h"]h$]h&]uh1hhj{@hhhhhNubh)}(hXThe regular filesystem should export helper functions that call ``static_branch_inc`` to enable and ``static_branch_dec`` to disable the static key. Wrapper functions make it easy to compile out the relevant code if the kernel distributor turns off online fsck at build time. h]h)}(hXThe regular filesystem should export helper functions that call ``static_branch_inc`` to enable and ``static_branch_dec`` to disable the static key. Wrapper functions make it easy to compile out the relevant code if the kernel distributor turns off online fsck at build time.h](h@The regular filesystem should export helper functions that call }(hj@hhhNhNubj)}(h``static_branch_inc``h]hstatic_branch_inc}(hj@hhhNhNubah}(h]h ]h"]h$]h&]uh1jhj@ubh to enable and }(hj@hhhNhNubj)}(h``static_branch_dec``h]hstatic_branch_dec}(hjAhhhNhNubah}(h]h ]h"]h$]h&]uh1jhj@ubh to disable the static key. Wrapper functions make it easy to compile out the relevant code if the kernel distributor turns off online fsck at build time.}(hj@hhhNhNubeh}(h]h ]h"]h$]h&]uh1hhhhMhj@ubah}(h]h ]h"]h$]h&]uh1hhj{@hhhhhNubh)}(hXlScrub functions wanting to turn on scrub-only XFS functionality should call the ``xchk_fsgates_enable`` from the setup function to enable a specific hook. This must be done before obtaining any resources that are used by memory reclaim. Callers had better be sure they really need the functionality gated by the static key; the ``TRY_HARDER`` flag is useful here. h]h)}(hXkScrub functions wanting to turn on scrub-only XFS functionality should call the ``xchk_fsgates_enable`` from the setup function to enable a specific hook. This must be done before obtaining any resources that are used by memory reclaim. Callers had better be sure they really need the functionality gated by the static key; the ``TRY_HARDER`` flag is useful here.h](hPScrub functions wanting to turn on scrub-only XFS functionality should call the }(hj$AhhhNhNubj)}(h``xchk_fsgates_enable``h]hxchk_fsgates_enable}(hj,AhhhNhNubah}(h]h ]h"]h$]h&]uh1jhj$Aubh from the setup function to enable a specific hook. This must be done before obtaining any resources that are used by memory reclaim. Callers had better be sure they really need the functionality gated by the static key; the }(hj$AhhhNhNubj)}(h``TRY_HARDER``h]h TRY_HARDER}(hj>AhhhNhNubah}(h]h ]h"]h$]h&]uh1jhj$Aubh flag is useful here.}(hj$AhhhNhNubeh}(h]h ]h"]h$]h&]uh1hhhhMhj Aubah}(h]h ]h"]h$]h&]uh1hhj{@hhhhhNubeh}(h]h ]h"]h$]h&]jJjKuh1hhhhM hj.@hhubh)}(hX]Online scrub has resource acquisition helpers (e.g. ``xchk_perag_lock``) to handle locking AGI and AGF buffers for all scrubber functions. If it detects a conflict between scrub and the running transactions, it will try to wait for intents to complete. If the caller of the helper has not enabled the static key, the helper will return -EDEADLOCK, which should result in the scrub being restarted with the ``TRY_HARDER`` flag set. The scrub setup function should detect that flag, enable the static key, and try the scrub again. Scrub teardown disables all static keys obtained by ``xchk_fsgates_enable``.h](h4Online scrub has resource acquisition helpers (e.g. }(hjbAhhhNhNubj)}(h``xchk_perag_lock``h]hxchk_perag_lock}(hjjAhhhNhNubah}(h]h ]h"]h$]h&]uh1jhjbAubhXO) to handle locking AGI and AGF buffers for all scrubber functions. If it detects a conflict between scrub and the running transactions, it will try to wait for intents to complete. If the caller of the helper has not enabled the static key, the helper will return -EDEADLOCK, which should result in the scrub being restarted with the }(hjbAhhhNhNubj)}(h``TRY_HARDER``h]h TRY_HARDER}(hj|AhhhNhNubah}(h]h ]h"]h$]h&]uh1jhjbAubh flag set. The scrub setup function should detect that flag, enable the static key, and try the scrub again. Scrub teardown disables all static keys obtained by }(hjbAhhhNhNubj)}(h``xchk_fsgates_enable``h]hxchk_fsgates_enable}(hjAhhhNhNubah}(h]h ]h"]h$]h&]uh1jhjbAubh.}(hjbAhhhNhNubeh}(h]h ]h"]h$]h&]uh1hhhhM#hj.@hhubh)}(hcFor more information, please see the kernel documentation of Documentation/staging/static-keys.rst.h]hcFor more information, please see the kernel documentation of Documentation/staging/static-keys.rst.}(hjAhhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhM.hj.@hhubh)}(h .. _xfile:h]h}(h]h ]h"]h$]h&]jxfileuh1hhM1hj.@hhhhubeh}(h](jj"@eh ]h"](%static keys (aka jump label patching) jump_labelseh$]h&]uh1hhj9hhhhhMj}jAj@sj}j"@j@subeh}(h]jah ]h"]$eventual consistency vs. online fsckah$]h&]uh1hhjv*hhhhhMubh)}(hhh](h)}(hPageable Kernel Memoryh]hPageable Kernel Memory}(hjAhhhNhNubah}(h]h ]h"]h$]h&]jjuh1hhjAhhhhhM4ubh)}(hX>Some online checking functions work by scanning the filesystem to build a shadow copy of an ondisk metadata structure in memory and comparing the two copies. For online repair to rebuild a metadata structure, it must compute the record set that will be stored in the new structure before it can persist that new structure to disk. Ideally, repairs complete with a single atomic commit that introduces a new data structure. To meet these goals, the kernel needs to collect a large amount of information in a place that doesn't require the correct operation of the filesystem.h]hX@Some online checking functions work by scanning the filesystem to build a shadow copy of an ondisk metadata structure in memory and comparing the two copies. For online repair to rebuild a metadata structure, it must compute the record set that will be stored in the new structure before it can persist that new structure to disk. Ideally, repairs complete with a single atomic commit that introduces a new data structure. To meet these goals, the kernel needs to collect a large amount of information in a place that doesn’t require the correct operation of the filesystem.}(hjAhhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhM6hjAhhubh)}(h%Kernel memory isn't suitable because:h]h'Kernel memory isn’t suitable because:}(hjAhhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhMAhjAhhubh)}(hhh](h)}(hnAllocating a contiguous region of memory to create a C array is very difficult, especially on 32-bit systems. h]h)}(hmAllocating a contiguous region of memory to create a C array is very difficult, especially on 32-bit systems.h]hmAllocating a contiguous region of memory to create a C array is very difficult, especially on 32-bit systems.}(hjBhhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhMChjBubah}(h]h ]h"]h$]h&]uh1hhjAhhhhhNubh)}(hLinked lists of records introduce double pointer overhead which is very high and eliminate the possibility of indexed lookups. h]h)}(h~Linked lists of records introduce double pointer overhead which is very high and eliminate the possibility of indexed lookups.:h]h~Linked lists of records introduce double pointer overhead which is very high and eliminate the possibility of indexed lookups.}(hjBhhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhMFhjBubah}(h]h ]h"]h$]h&]uh1hhjAhhhhhNubh)}(hIKernel memory is pinned, which can drive the system into OOM conditions. h]h)}(hHKernel memory is pinned, which can drive the system into OOM conditions.h]hHKernel memory is pinned, which can drive the system into OOM conditions.}(hj4BhhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhMIhj0Bubah}(h]h ]h"]h$]h&]uh1hhjAhhhhhNubh)}(hJThe system might not have sufficient memory to stage all the information. h]h)}(hIThe system might not have sufficient memory to stage all the information.h]hIThe system might not have sufficient memory to stage all the information.}(hjLBhhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhMKhjHBubah}(h]h ]h"]h$]h&]uh1hhjAhhhhhNubeh}(h]h ]h"]h$]h&]jJj uh1hhhhMChjAhhubh)}(hXdAt any given time, online fsck does not need to keep the entire record set in memory, which means that individual records can be paged out if necessary. Continued development of online fsck demonstrated that the ability to perform indexed data storage would also be very useful. Fortunately, the Linux kernel already has a facility for byte-addressable and pageable storage: tmpfs. In-kernel graphics drivers (most notably i915) take advantage of tmpfs files to store intermediate data that doesn't need to be in memory at all times, so that usage precedent is already established. Hence, the ``xfile`` was born!h](hXSAt any given time, online fsck does not need to keep the entire record set in memory, which means that individual records can be paged out if necessary. Continued development of online fsck demonstrated that the ability to perform indexed data storage would also be very useful. Fortunately, the Linux kernel already has a facility for byte-addressable and pageable storage: tmpfs. In-kernel graphics drivers (most notably i915) take advantage of tmpfs files to store intermediate data that doesn’t need to be in memory at all times, so that usage precedent is already established. Hence, the }(hjfBhhhNhNubj)}(h ``xfile``h]hxfile}(hjnBhhhNhNubah}(h]h ]h"]h$]h&]uh1jhjfBubh was born!}(hjfBhhhNhNubeh}(h]h ]h"]h$]h&]uh1hhhhMMhjAhhubj)}(hhh]j)}(hhh](j)}(hhh]h}(h]h ]h"]h$]h&]colwidthKJuh1jhjBubj)}(hhh](j)}(hhh]j)}(hhh]h)}(h**Historical Sidebar**:h](j)}(h**Historical Sidebar**h]hHistorical Sidebar}(hjBhhhNhNubah}(h]h ]h"]h$]h&]uh1jhjBubh:}(hjBhhhNhNubeh}(h]h ]h"]h$]h&]uh1hhhhMYhjBubah}(h]h ]h"]h$]h&]uh1jhjBubah}(h]h ]h"]h$]h&]uh1jhjBubj)}(hhh]j)}(hhh](h)}(hThe first edition of online repair inserted records into a new btree as it found them, which failed because filesystem could shut down with a built data structure, which would be live after recovery finished.h]hThe first edition of online repair inserted records into a new btree as it found them, which failed because filesystem could shut down with a built data structure, which would be live after recovery finished.}(hjBhhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhM[hjBubh)}(hThe second edition solved the half-rebuilt structure problem by storing everything in memory, but frequently ran the system out of memory.h]hThe second edition solved the half-rebuilt structure problem by storing everything in memory, but frequently ran the system out of memory.}(hjBhhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhM_hjBubh)}(hyThe third edition solved the OOM problem by using linked lists, but the memory overhead of the list pointers was extreme.h]hyThe third edition solved the OOM problem by using linked lists, but the memory overhead of the list pointers was extreme.}(hjBhhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhMbhjBubeh}(h]h ]h"]h$]h&]uh1jhjBubah}(h]h ]h"]h$]h&]uh1jhjBubeh}(h]h ]h"]h$]h&]uh1jhjBubeh}(h]h ]h"]h$]h&]colsKuh1jhjBubah}(h]h ]h"]h$]h&]uh1jhjAhhhhhNubh)}(hhh](h)}(hxfile Access Modelsh]hxfile Access Models}(hjChhhNhNubah}(h]h ]h"]h$]h&]jj0uh1hhjChhhhhMgubh)}(hBA survey of the intended uses of xfiles suggested these use cases:h]hBA survey of the intended uses of xfiles suggested these use cases:}(hj'ChhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhMihjChhubji)}(hhh](h)}(hbArrays of fixed-sized records (space management btrees, directory and extended attribute entries) h]h)}(haArrays of fixed-sized records (space management btrees, directory and extended attribute entries)h]haArrays of fixed-sized records (space management btrees, directory and extended attribute entries)}(hjSparse arrays of fixed-sized records (quotas and link counts) h]h)}(h=Sparse arrays of fixed-sized records (quotas and link counts)h]h=Sparse arrays of fixed-sized records (quotas and link counts)}(hjTChhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhMnhjPCubah}(h]h ]h"]h$]h&]uh1hhj5ChhhhhNubh)}(hcLarge binary objects (BLOBs) of variable sizes (directory and extended attribute names and values) h]h)}(hbLarge binary objects (BLOBs) of variable sizes (directory and extended attribute names and values)h]hbLarge binary objects (BLOBs) of variable sizes (directory and extended attribute names and values)}(hjlChhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhMphjhCubah}(h]h ]h"]h$]h&]uh1hhj5ChhhhhNubh)}(h2Staging btrees in memory (reverse mapping btrees) h]h)}(h1Staging btrees in memory (reverse mapping btrees)h]h1Staging btrees in memory (reverse mapping btrees)}(hjChhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhMshjCubah}(h]h ]h"]h$]h&]uh1hhj5ChhhhhNubh)}(h/Arbitrary contents (realtime space management) h]h)}(h.Arbitrary contents (realtime space management)h]h.Arbitrary contents (realtime space management)}(hjChhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhMuhjCubah}(h]h ]h"]h$]h&]uh1hhj5ChhhhhNubeh}(h]h ]h"]h$]h&]jgjhjihjjjkuh1jhhjChhhhhMkubh)}(hXYTo support the first four use cases, high level data structures wrap the xfile to share functionality between online fsck functions. The rest of this section discusses the interfaces that the xfile presents to four of those five higher level data structures. The fifth use case is discussed in the :ref:`realtime summary ` case study.h](hX*To support the first four use cases, high level data structures wrap the xfile to share functionality between online fsck functions. The rest of this section discusses the interfaces that the xfile presents to four of those five higher level data structures. The fifth use case is discussed in the }(hjChhhNhNubh)}(h#:ref:`realtime summary `h]j)}(hjCh]hrealtime summary}(hjChhhNhNubah}(h]h ](jstdstd-refeh"]h$]h&]uh1jhjCubah}(h]h ]h"]h$]h&]refdocj refdomainjCreftyperef refexplicitrefwarnj rtsummaryuh1hhhhMwhjCubh case study.}(hjChhhNhNubeh}(h]h ]h"]h$]h&]uh1hhhhMwhjChhubh)}(hXXFS is very record-based, which suggests that the ability to load and store complete records is important. To support these cases, a pair of ``xfile_load`` and ``xfile_store`` functions are provided to read and persist objects into an xfile that treat any error as an out of memory error. For online repair, squashing error conditions in this manner is an acceptable behavior because the only reaction is to abort the operation back to userspace.h](hXFS is very record-based, which suggests that the ability to load and store complete records is important. To support these cases, a pair of }(hjChhhNhNubj)}(h``xfile_load``h]h xfile_load}(hjChhhNhNubah}(h]h ]h"]h$]h&]uh1jhjCubh and }(hjChhhNhNubj)}(h``xfile_store``h]h xfile_store}(hjDhhhNhNubah}(h]h ]h"]h$]h&]uh1jhjCubhX functions are provided to read and persist objects into an xfile that treat any error as an out of memory error. For online repair, squashing error conditions in this manner is an acceptable behavior because the only reaction is to abort the operation back to userspace.}(hjChhhNhNubeh}(h]h ]h"]h$]h&]uh1hhhhM~hjChhubh)}(hXHowever, no discussion of file access idioms is complete without answering the question, "But what about mmap?" It is convenient to access storage directly with pointers, just like userspace code does with regular memory. Online fsck must not drive the system into OOM conditions, which means that xfiles must be responsive to memory reclamation. tmpfs can only push a pagecache folio to the swap cache if the folio is neither pinned nor locked, which means the xfile must not pin too many folios.h]hXHowever, no discussion of file access idioms is complete without answering the question, “But what about mmap?” It is convenient to access storage directly with pointers, just like userspace code does with regular memory. Online fsck must not drive the system into OOM conditions, which means that xfiles must be responsive to memory reclamation. tmpfs can only push a pagecache folio to the swap cache if the folio is neither pinned nor locked, which means the xfile must not pin too many folios.}(hjDhhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhMhjChhubh)}(hX Short term direct access to xfile contents is done by locking the pagecache folio and mapping it into kernel address space. Object load and store uses this mechanism. Folio locks are not supposed to be held for long periods of time, so long term direct access to xfile contents is done by bumping the folio refcount, mapping it into kernel address space, and dropping the folio lock. These long term users *must* be responsive to memory reclaim by hooking into the shrinker infrastructure to know when to release folios.h](hXShort term direct access to xfile contents is done by locking the pagecache folio and mapping it into kernel address space. Object load and store uses this mechanism. Folio locks are not supposed to be held for long periods of time, so long term direct access to xfile contents is done by bumping the folio refcount, mapping it into kernel address space, and dropping the folio lock. These long term users }(hj(DhhhNhNubj7)}(h*must*h]hmust}(hj0DhhhNhNubah}(h]h ]h"]h$]h&]uh1j6hj(Dubhl be responsive to memory reclaim by hooking into the shrinker infrastructure to know when to release folios.}(hj(DhhhNhNubeh}(h]h ]h"]h$]h&]uh1hhhhMhjChhubh)}(hX'The ``xfile_get_folio`` and ``xfile_put_folio`` functions are provided to retrieve the (locked) folio that backs part of an xfile and to release it. The only code to use these folio lease functions are the xfarray :ref:`sorting` algorithms and the :ref:`in-memory btrees`.h](hThe }(hjHDhhhNhNubj)}(h``xfile_get_folio``h]hxfile_get_folio}(hjPDhhhNhNubah}(h]h ]h"]h$]h&]uh1jhjHDubh and }(hjHDhhhNhNubj)}(h``xfile_put_folio``h]hxfile_put_folio}(hjbDhhhNhNubah}(h]h ]h"]h$]h&]uh1jhjHDubh functions are provided to retrieve the (locked) folio that backs part of an xfile and to release it. The only code to use these folio lease functions are the xfarray }(hjHDhhhNhNubh)}(h:ref:`sorting`h]j)}(hjvDh]hsorting}(hjxDhhhNhNubah}(h]h ](jstdstd-refeh"]h$]h&]uh1jhjtDubah}(h]h ]h"]h$]h&]refdocj refdomainjDreftyperef refexplicitrefwarnj xfarray_sortuh1hhhhMhjHDubh algorithms and the }(hjHDhhhNhNubh)}(h :ref:`in-memory btrees`h]j)}(hjDh]hin-memory btrees}(hjDhhhNhNubah}(h]h ](jstdstd-refeh"]h$]h&]uh1jhjDubah}(h]h ]h"]h$]h&]refdocj refdomainjDreftyperef refexplicitrefwarnjxfbtreeuh1hhhhMhjHDubh.}(hjHDhhhNhNubeh}(h]h ]h"]h$]h&]uh1hhhhMhjChhubeh}(h]j6ah ]h"]xfile access modelsah$]h&]uh1hhjAhhhhhMgubh)}(hhh](h)}(hxfile Access Coordinationh]hxfile Access Coordination}(hjDhhhNhNubah}(h]h ]h"]h$]h&]jjRuh1hhjDhhhhhMubh)}(hX For security reasons, xfiles must be owned privately by the kernel. They are marked ``S_PRIVATE`` to prevent interference from the security system, must never be mapped into process file descriptor tables, and their pages must never be mapped into userspace processes.h](hTFor security reasons, xfiles must be owned privately by the kernel. They are marked }(hjDhhhNhNubj)}(h ``S_PRIVATE``h]h S_PRIVATE}(hjDhhhNhNubah}(h]h ]h"]h$]h&]uh1jhjDubh to prevent interference from the security system, must never be mapped into process file descriptor tables, and their pages must never be mapped into userspace processes.}(hjDhhhNhNubeh}(h]h ]h"]h$]h&]uh1hhhhMhjDhhubh)}(hXTo avoid locking recursion issues with the VFS, all accesses to the shmfs file are performed by manipulating the page cache directly. xfile writers call the ``->write_begin`` and ``->write_end`` functions of the xfile's address space to grab writable pages, copy the caller's buffer into the page, and release the pages. xfile readers call ``shmem_read_mapping_page_gfp`` to grab pages directly before copying the contents into the caller's buffer. In other words, xfiles ignore the VFS read and write code paths to avoid having to create a dummy ``struct kiocb`` and to avoid taking inode and freeze locks. tmpfs cannot be frozen, and xfiles must not be exposed to userspace.h](hTo avoid locking recursion issues with the VFS, all accesses to the shmfs file are performed by manipulating the page cache directly. xfile writers call the }(hjDhhhNhNubj)}(h``->write_begin``h]h ->write_begin}(hjEhhhNhNubah}(h]h ]h"]h$]h&]uh1jhjDubh and }(hjDhhhNhNubj)}(h``->write_end``h]h ->write_end}(hjEhhhNhNubah}(h]h ]h"]h$]h&]uh1jhjDubh functions of the xfile’s address space to grab writable pages, copy the caller’s buffer into the page, and release the pages. xfile readers call }(hjDhhhNhNubj)}(h``shmem_read_mapping_page_gfp``h]hshmem_read_mapping_page_gfp}(hj&EhhhNhNubah}(h]h ]h"]h$]h&]uh1jhjDubh to grab pages directly before copying the contents into the caller’s buffer. In other words, xfiles ignore the VFS read and write code paths to avoid having to create a dummy }(hjDhhhNhNubj)}(h``struct kiocb``h]h struct kiocb}(hj8EhhhNhNubah}(h]h ]h"]h$]h&]uh1jhjDubhq and to avoid taking inode and freeze locks. tmpfs cannot be frozen, and xfiles must not be exposed to userspace.}(hjDhhhNhNubeh}(h]h ]h"]h$]h&]uh1hhhhMhjDhhubh)}(hX5If an xfile is shared between threads to stage repairs, the caller must provide its own locks to coordinate access. For example, if a scrub function stores scan results in an xfile and needs other threads to provide updates to the scanned data, the scrub function must provide a lock for all threads to share.h]hX5If an xfile is shared between threads to stage repairs, the caller must provide its own locks to coordinate access. For example, if a scrub function stores scan results in an xfile and needs other threads to provide updates to the scanned data, the scrub function must provide a lock for all threads to share.}(hjPEhhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhMhjDhhubh)}(h .. _xfarray:h]h}(h]h ]h"]h$]h&]jxfarrayuh1hhMhjDhhhhubeh}(h]jXah ]h"]xfile access coordinationah$]h&]uh1hhjAhhhhhMubh)}(hhh](h)}(hArrays of Fixed-Sized Recordsh]hArrays of Fixed-Sized Records}(hjsEhhhNhNubah}(h]h ]h"]h$]h&]jjtuh1hhjpEhhhhhMubh)}(hX.In XFS, each type of indexed space metadata (free space, inodes, reference counts, file fork space, and reverse mappings) consists of a set of fixed-size records indexed with a classic B+ tree. Directories have a set of fixed-size dirent records that point to the names, and extended attributes have a set of fixed-size attribute keys that point to names and values. Quota counters and file link counters index records with numbers. During a repair, scrub needs to stage new records during the gathering step and retrieve them during the btree building step.h]hX.In XFS, each type of indexed space metadata (free space, inodes, reference counts, file fork space, and reverse mappings) consists of a set of fixed-size records indexed with a classic B+ tree. Directories have a set of fixed-size dirent records that point to the names, and extended attributes have a set of fixed-size attribute keys that point to names and values. Quota counters and file link counters index records with numbers. During a repair, scrub needs to stage new records during the gathering step and retrieve them during the btree building step.}(hjEhhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhMhjpEhhubh)}(hXAlthough this requirement can be satisfied by calling the read and write methods of the xfile directly, it is simpler for callers for there to be a higher level abstraction to take care of computing array offsets, to provide iterator functions, and to deal with sparse records and sorting. The ``xfarray`` abstraction presents a linear array for fixed-size records atop the byte-accessible xfile.h](hX&Although this requirement can be satisfied by calling the read and write methods of the xfile directly, it is simpler for callers for there to be a higher level abstraction to take care of computing array offsets, to provide iterator functions, and to deal with sparse records and sorting. The }(hjEhhhNhNubj)}(h ``xfarray``h]hxfarray}(hjEhhhNhNubah}(h]h ]h"]h$]h&]uh1jhjEubh[ abstraction presents a linear array for fixed-size records atop the byte-accessible xfile.}(hjEhhhNhNubeh}(h]h ]h"]h$]h&]uh1hhhhMhjpEhhubh)}(h.. _xfarray_access_patterns:h]h}(h]h ]h"]h$]h&]jxfarray-access-patternsuh1hhMhjpEhhhhubh)}(hhh](h)}(hArray Access Patternsh]hArray Access Patterns}(hjEhhhNhNubah}(h]h ]h"]h$]h&]jjuh1hhjEhhhhhMubh)}(hArray access patterns in online fsck tend to fall into three categories. Iteration of records is assumed to be necessary for all cases and will be covered in the next section.h]hArray access patterns in online fsck tend to fall into three categories. Iteration of records is assumed to be necessary for all cases and will be covered in the next section.}(hjEhhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhMhjEhhubh)}(hXPThe first type of caller handles records that are indexed by position. Gaps may exist between records, and a record may be updated multiple times during the collection step. In other words, these callers want a sparse linearly addressed table file. The typical use case are quota records or file link count records. Access to array elements is performed programmatically via ``xfarray_load`` and ``xfarray_store`` functions, which wrap the similarly-named xfile functions to provide loading and storing of array elements at arbitrary array indices. Gaps are defined to be null records, and null records are defined to be a sequence of all zero bytes. Null records are detected by calling ``xfarray_element_is_null``. They are created either by calling ``xfarray_unset`` to null out an existing record or by never storing anything to an array index.h](hXwThe first type of caller handles records that are indexed by position. Gaps may exist between records, and a record may be updated multiple times during the collection step. In other words, these callers want a sparse linearly addressed table file. The typical use case are quota records or file link count records. Access to array elements is performed programmatically via }(hjEhhhNhNubj)}(h``xfarray_load``h]h xfarray_load}(hjEhhhNhNubah}(h]h ]h"]h$]h&]uh1jhjEubh and }(hjEhhhNhNubj)}(h``xfarray_store``h]h xfarray_store}(hjEhhhNhNubah}(h]h ]h"]h$]h&]uh1jhjEubhX functions, which wrap the similarly-named xfile functions to provide loading and storing of array elements at arbitrary array indices. Gaps are defined to be null records, and null records are defined to be a sequence of all zero bytes. Null records are detected by calling }(hjEhhhNhNubj)}(h``xfarray_element_is_null``h]hxfarray_element_is_null}(hjFhhhNhNubah}(h]h ]h"]h$]h&]uh1jhjEubh%. They are created either by calling }(hjEhhhNhNubj)}(h``xfarray_unset``h]h xfarray_unset}(hjFhhhNhNubah}(h]h ]h"]h$]h&]uh1jhjEubhO to null out an existing record or by never storing anything to an array index.}(hjEhhhNhNubeh}(h]h ]h"]h$]h&]uh1hhhhMhjEhhubh)}(hX0The second type of caller handles records that are not indexed by position and do not require multiple updates to a record. The typical use case here is rebuilding space btrees and key/value btrees. These callers can add records to the array without caring about array indices via the ``xfarray_append`` function, which stores a record at the end of the array. For callers that require records to be presentable in a specific order (e.g. rebuilding btree data), the ``xfarray_sort`` function can arrange the sorted records; this function will be covered later.h](hXThe second type of caller handles records that are not indexed by position and do not require multiple updates to a record. The typical use case here is rebuilding space btrees and key/value btrees. These callers can add records to the array without caring about array indices via the }(hj/FhhhNhNubj)}(h``xfarray_append``h]hxfarray_append}(hj7FhhhNhNubah}(h]h ]h"]h$]h&]uh1jhj/Fubh function, which stores a record at the end of the array. For callers that require records to be presentable in a specific order (e.g. rebuilding btree data), the }(hj/FhhhNhNubj)}(h``xfarray_sort``h]h xfarray_sort}(hjIFhhhNhNubah}(h]h ]h"]h$]h&]uh1jhj/FubhN function can arrange the sorted records; this function will be covered later.}(hj/FhhhNhNubeh}(h]h ]h"]h$]h&]uh1hhhhMhjEhhubh)}(hXThe third type of caller is a bag, which is useful for counting records. The typical use case here is constructing space extent reference counts from reverse mapping information. Records can be put in the bag in any order, they can be removed from the bag at any time, and uniqueness of records is left to callers. The ``xfarray_store_anywhere`` function is used to insert a record in any null record slot in the bag; and the ``xfarray_unset`` function removes a record from the bag.h](hX?The third type of caller is a bag, which is useful for counting records. The typical use case here is constructing space extent reference counts from reverse mapping information. Records can be put in the bag in any order, they can be removed from the bag at any time, and uniqueness of records is left to callers. The }(hjaFhhhNhNubj)}(h``xfarray_store_anywhere``h]hxfarray_store_anywhere}(hjiFhhhNhNubah}(h]h ]h"]h$]h&]uh1jhjaFubhQ function is used to insert a record in any null record slot in the bag; and the }(hjaFhhhNhNubj)}(h``xfarray_unset``h]h xfarray_unset}(hj{FhhhNhNubah}(h]h ]h"]h$]h&]uh1jhjaFubh( function removes a record from the bag.}(hjaFhhhNhNubeh}(h]h ]h"]h$]h&]uh1hhhhMhjEhhubh)}(hThe proposed patchset is the `big in-memory array `_.h](hThe proposed patchset is the }(hjFhhhNhNubj)}(hn`big in-memory array `_h]hbig in-memory array}(hjFhhhNhNubah}(h]h ]h"]h$]h&]namebig in-memory arrayjjUhttps://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=big-arrayuh1jhjFubh)}(hX h]h}(h]big-in-memory-arrayah ]h"]big in-memory arrayah$]h&]refurijFuh1hjyKhjFubh.}(hjFhhhNhNubeh}(h]h ]h"]h$]h&]uh1hhhhMhjEhhubeh}(h](jjEeh ]h"](array access patternsxfarray_access_patternseh$]h&]uh1hhjpEhhhhhMj}jFjEsj}jEjEsubh)}(hhh](h)}(hIterating Array Elementsh]hIterating Array Elements}(hjFhhhNhNubah}(h]h ]h"]h$]h&]jjuh1hhjFhhhhhMubh)}(hMost users of the xfarray require the ability to iterate the records stored in the array. Callers can probe every possible array index with the following:h]hMost users of the xfarray require the ability to iterate the records stored in the array. Callers can probe every possible array index with the following:}(hjFhhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhMhjFhhubj+)}(huxfarray_idx_t i; foreach_xfarray_idx(array, i) { xfarray_load(array, i, &rec); /* do something with rec */ }h]huxfarray_idx_t i; foreach_xfarray_idx(array, i) { xfarray_load(array, i, &rec); /* do something with rec */ }}hjFsbah}(h]h ]h"]h$]h&]hhj+j+j+j+}uh1j+hhhMhjFhhubh)}(hkAll users of this idiom must be prepared to handle null records or must already know that there aren't any.h]hmAll users of this idiom must be prepared to handle null records or must already know that there aren’t any.}(hjFhhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhM hjFhhubh)}(hX^For xfarray users that want to iterate a sparse array, the ``xfarray_iter`` function ignores indices in the xfarray that have never been written to by calling ``xfile_seek_data`` (which internally uses ``SEEK_DATA``) to skip areas of the array that are not populated with memory pages. Once it finds a page, it will skip the zeroed areas of the page.h](h;For xfarray users that want to iterate a sparse array, the }(hj GhhhNhNubj)}(h``xfarray_iter``h]h xfarray_iter}(hjGhhhNhNubah}(h]h ]h"]h$]h&]uh1jhj GubhT function ignores indices in the xfarray that have never been written to by calling }(hj GhhhNhNubj)}(h``xfile_seek_data``h]hxfile_seek_data}(hj#GhhhNhNubah}(h]h ]h"]h$]h&]uh1jhj Gubh (which internally uses }(hj GhhhNhNubj)}(h ``SEEK_DATA``h]h SEEK_DATA}(hj5GhhhNhNubah}(h]h ]h"]h$]h&]uh1jhj Gubh) to skip areas of the array that are not populated with memory pages. Once it finds a page, it will skip the zeroed areas of the page.}(hj GhhhNhNubeh}(h]h ]h"]h$]h&]uh1hhhhMhjFhhubj+)}(h}xfarray_idx_t i = XFARRAY_CURSOR_INIT; while ((ret = xfarray_iter(array, &i, &rec)) == 1) { /* do something with rec */ }h]h}xfarray_idx_t i = XFARRAY_CURSOR_INIT; while ((ret = xfarray_iter(array, &i, &rec)) == 1) { /* do something with rec */ }}hjMGsbah}(h]h ]h"]h$]h&]hhj+j+j+j+}uh1j+hhhMhjFhhubh)}(h.. _xfarray_sort:h]h}(h]h ]h"]h$]h&]j xfarray-sortuh1hhMhjFhhhhubeh}(h]jah ]h"]iterating array elementsah$]h&]uh1hhjpEhhhhhMubh)}(hhh](h)}(hSorting Array Elementsh]hSorting Array Elements}(hjqGhhhNhNubah}(h]h ]h"]h$]h&]jjuh1hhjnGhhhhhMubh)}(hXDuring the fourth demonstration of online repair, a community reviewer remarked that for performance reasons, online repair ought to load batches of records into btree record blocks instead of inserting records into a new btree one at a time. The btree insertion code in XFS is responsible for maintaining correct ordering of the records, so naturally the xfarray must also support sorting the record set prior to bulk loading.h]hXDuring the fourth demonstration of online repair, a community reviewer remarked that for performance reasons, online repair ought to load batches of records into btree record blocks instead of inserting records into a new btree one at a time. The btree insertion code in XFS is responsible for maintaining correct ordering of the records, so naturally the xfarray must also support sorting the record set prior to bulk loading.}(hjGhhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhM hjnGhhubh)}(hhh](h)}(hCase Study: Sorting xfarraysh]hCase Study: Sorting xfarrays}(hjGhhhNhNubah}(h]h ]h"]h$]h&]jjuh1hhjGhhhhhM)ubh)}(hXThe sorting algorithm used in the xfarray is actually a combination of adaptive quicksort and a heapsort subalgorithm in the spirit of `Sedgewick `_ and `pdqsort `_, with customizations for the Linux kernel. To sort records in a reasonably short amount of time, ``xfarray`` takes advantage of the binary subpartitioning offered by quicksort, but it also uses heapsort to hedge against performance collapse if the chosen quicksort pivots are poor. Both algorithms are (in general) O(n * lg(n)), but there is a wide performance gulf between the two implementations.h](hThe sorting algorithm used in the xfarray is actually a combination of adaptive quicksort and a heapsort subalgorithm in the spirit of }(hjGhhhNhNubj)}(h:`Sedgewick `_h]h Sedgewick}(hjGhhhNhNubah}(h]h ]h"]h$]h&]name Sedgewickjj+https://algs4.cs.princeton.edu/23quicksort/uh1jhjGubh)}(h. h]h}(h] sedgewickah ]h"] sedgewickah$]h&]refurijGuh1hjyKhjGubh and }(hjGhhhNhNubj)}(h,`pdqsort `_h]hpdqsort}(hjGhhhNhNubah}(h]h ]h"]h$]h&]namepdqsortjjhttps://github.com/orlp/pdqsortuh1jhjGubh)}(h" h]h}(h]pdqsortah ]h"]pdqsortah$]h&]refurijGuh1hjyKhjGubhb, with customizations for the Linux kernel. To sort records in a reasonably short amount of time, }(hjGhhhNhNubj)}(h ``xfarray``h]hxfarray}(hjGhhhNhNubah}(h]h ]h"]h$]h&]uh1jhjGubhX" takes advantage of the binary subpartitioning offered by quicksort, but it also uses heapsort to hedge against performance collapse if the chosen quicksort pivots are poor. Both algorithms are (in general) O(n * lg(n)), but there is a wide performance gulf between the two implementations.}(hjGhhhNhNubeh}(h]h ]h"]h$]h&]uh1hhhhM+hjGhhubh)}(hThe Linux kernel already contains a reasonably fast implementation of heapsort. It only operates on regular C arrays, which limits the scope of its usefulness. There are two key places where the xfarray uses it:h]hThe Linux kernel already contains a reasonably fast implementation of heapsort. It only operates on regular C arrays, which limits the scope of its usefulness. There are two key places where the xfarray uses it:}(hjHhhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhM7hjGhhubh)}(hhh](h)}(h9Sorting any record subset backed by a single xfile page. h]h)}(h8Sorting any record subset backed by a single xfile page.h]h8Sorting any record subset backed by a single xfile page.}(hjHhhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhM;hjHubah}(h]h ]h"]h$]h&]uh1hhjHhhhhhNubh)}(hLoading a small number of xfarray records from potentially disparate parts of the xfarray into a memory buffer, and sorting the buffer. h]h)}(hLoading a small number of xfarray records from potentially disparate parts of the xfarray into a memory buffer, and sorting the buffer.h]hLoading a small number of xfarray records from potentially disparate parts of the xfarray into a memory buffer, and sorting the buffer.}(hj/HhhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhM=hj+Hubah}(h]h ]h"]h$]h&]uh1hhjHhhhhhNubeh}(h]h ]h"]h$]h&]jJj uh1hhhhM;hjGhhubh)}(hIn other words, ``xfarray`` uses heapsort to constrain the nested recursion of quicksort, thereby mitigating quicksort's worst runtime behavior.h](hIn other words, }(hjIHhhhNhNubj)}(h ``xfarray``h]hxfarray}(hjQHhhhNhNubah}(h]h ]h"]h$]h&]uh1jhjIHubhw uses heapsort to constrain the nested recursion of quicksort, thereby mitigating quicksort’s worst runtime behavior.}(hjIHhhhNhNubeh}(h]h ]h"]h$]h&]uh1hhhhM@hjGhhubh)}(hXChoosing a quicksort pivot is a tricky business. A good pivot splits the set to sort in half, leading to the divide and conquer behavior that is crucial to O(n * lg(n)) performance. A poor pivot barely splits the subset at all, leading to O(n\ :sup:`2`) runtime. The xfarray sort routine tries to avoid picking a bad pivot by sampling nine records into a memory buffer and using the kernel heapsort to identify the median of the nine.h](hChoosing a quicksort pivot is a tricky business. A good pivot splits the set to sort in half, leading to the divide and conquer behavior that is crucial to O(n * lg(n)) performance. A poor pivot barely splits the subset at all, leading to O(n }(hjiHhhhNhNubh superscript)}(h:sup:`2`h]h2}(hjsHhhhNhNubah}(h]h ]h"]h$]h&]uh1jqHhjiHubh) runtime. The xfarray sort routine tries to avoid picking a bad pivot by sampling nine records into a memory buffer and using the kernel heapsort to identify the median of the nine.}(hjiHhhhNhNubeh}(h]h ]h"]h$]h&]uh1hhhhMChjGhhubh)}(hXMost modern quicksort implementations employ Tukey's "ninther" to select a pivot from a classic C array. Typical ninther implementations pick three unique triads of records, sort each of the triads, and then sort the middle value of each triad to determine the ninther value. As stated previously, however, xfile accesses are not entirely cheap. It turned out to be much more performant to read the nine elements into a memory buffer, run the kernel's in-memory heapsort on the buffer, and choose the 4th element of that buffer as the pivot. Tukey's ninthers are described in J. W. Tukey, `The ninther, a technique for low-effort robust (resistant) location in large samples`, in *Contributions to Survey Sampling and Applied Statistics*, edited by H. David, (Academic Press, 1978), pp. 251–257.h](hXWMost modern quicksort implementations employ Tukey’s “ninther” to select a pivot from a classic C array. Typical ninther implementations pick three unique triads of records, sort each of the triads, and then sort the middle value of each triad to determine the ninther value. As stated previously, however, xfile accesses are not entirely cheap. It turned out to be much more performant to read the nine elements into a memory buffer, run the kernel’s in-memory heapsort on the buffer, and choose the 4th element of that buffer as the pivot. Tukey’s ninthers are described in J. W. Tukey, }(hjHhhhNhNubhtitle_reference)}(hV`The ninther, a technique for low-effort robust (resistant) location in large samples`h]hTThe ninther, a technique for low-effort robust (resistant) location in large samples}(hjHhhhNhNubah}(h]h ]h"]h$]h&]uh1jHhjHubh, in }(hjHhhhNhNubj7)}(h9*Contributions to Survey Sampling and Applied Statistics*h]h7Contributions to Survey Sampling and Applied Statistics}(hjHhhhNhNubah}(h]h ]h"]h$]h&]uh1j6hjHubh<, edited by H. David, (Academic Press, 1978), pp. 251–257.}(hjHhhhNhNubeh}(h]h ]h"]h$]h&]uh1hhhhMLhjGhhubh)}(hXThe partitioning of quicksort is fairly textbook -- rearrange the record subset around the pivot, then set up the current and next stack frames to sort with the larger and the smaller halves of the pivot, respectively. This keeps the stack space requirements to log2(record count).h]hXThe partitioning of quicksort is fairly textbook -- rearrange the record subset around the pivot, then set up the current and next stack frames to sort with the larger and the smaller halves of the pivot, respectively. This keeps the stack space requirements to log2(record count).}(hjHhhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhMZhjGhhubh)}(hXCAs a final performance optimization, the hi and lo scanning phase of quicksort keeps examined xfile pages mapped in the kernel for as long as possible to reduce map/unmap cycles. Surprisingly, this reduces overall sort runtime by nearly half again after accounting for the application of heapsort directly onto xfile pages.h]hXCAs a final performance optimization, the hi and lo scanning phase of quicksort keeps examined xfile pages mapped in the kernel for as long as possible to reduce map/unmap cycles. Surprisingly, this reduces overall sort runtime by nearly half again after accounting for the application of heapsort directly onto xfile pages.}(hjHhhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhM_hjGhhubh)}(h .. _xfblob:h]h}(h]h ]h"]h$]h&]jxfblobuh1hhMehjGhhhhubeh}(h]jah ]h"]case study: sorting xfarraysah$]h&]uh1hhjnGhhhhhM)ubeh}(h](jjfGeh ]h"](sorting array elements xfarray_sorteh$]h&]uh1hhjpEhhhhhMj}jHj\Gsj}jfGj\Gsubeh}(h](jzjhEeh ]h"](arrays of fixed-sized recordsxfarrayeh$]h&]uh1hhjAhhhhhMj}jHj^Esj}jhEj^Esubh)}(hhh](h)}(h Blob Storageh]h Blob Storage}(hjIhhhNhNubah}(h]h ]h"]h$]h&]jj0uh1hhjIhhhhhMhubh)}(hXExtended attributes and directories add an additional requirement for staging records: arbitrary byte sequences of finite length. Each directory entry record needs to store entry name, and each extended attribute needs to store both the attribute name and value. The names, keys, and values can consume a large amount of memory, so the ``xfblob`` abstraction was created to simplify management of these blobs atop an xfile.h](hXPExtended attributes and directories add an additional requirement for staging records: arbitrary byte sequences of finite length. Each directory entry record needs to store entry name, and each extended attribute needs to store both the attribute name and value. The names, keys, and values can consume a large amount of memory, so the }(hjIhhhNhNubj)}(h ``xfblob``h]hxfblob}(hjIhhhNhNubah}(h]h ]h"]h$]h&]uh1jhjIubhM abstraction was created to simplify management of these blobs atop an xfile.}(hjIhhhNhNubeh}(h]h ]h"]h$]h&]uh1hhhhMjhjIhhubh)}(hXBlob arrays provide ``xfblob_load`` and ``xfblob_store`` functions to retrieve and persist objects. The store function returns a magic cookie for every object that it persists. Later, callers provide this cookie to the ``xblob_load`` to recall the object. The ``xfblob_free`` function frees a specific blob, and the ``xfblob_truncate`` function frees them all because compaction is not needed.h](hBlob arrays provide }(hj2IhhhNhNubj)}(h``xfblob_load``h]h xfblob_load}(hj:IhhhNhNubah}(h]h ]h"]h$]h&]uh1jhj2Iubh and }(hj2IhhhNhNubj)}(h``xfblob_store``h]h xfblob_store}(hjLIhhhNhNubah}(h]h ]h"]h$]h&]uh1jhj2Iubh functions to retrieve and persist objects. The store function returns a magic cookie for every object that it persists. Later, callers provide this cookie to the }(hj2IhhhNhNubj)}(h``xblob_load``h]h xblob_load}(hj^IhhhNhNubah}(h]h ]h"]h$]h&]uh1jhj2Iubh to recall the object. The }(hj2IhhhNhNubj)}(h``xfblob_free``h]h xfblob_free}(hjpIhhhNhNubah}(h]h ]h"]h$]h&]uh1jhj2Iubh) function frees a specific blob, and the }(hj2IhhhNhNubj)}(h``xfblob_truncate``h]hxfblob_truncate}(hjIhhhNhNubah}(h]h ]h"]h$]h&]uh1jhj2Iubh: function frees them all because compaction is not needed.}(hj2IhhhNhNubeh}(h]h ]h"]h$]h&]uh1hhhhMrhjIhhubh)}(hXTThe details of repairing directories and extended attributes will be discussed in a subsequent section about atomic file content exchanges. However, it should be noted that these repair functions only use blob storage to cache a small number of entries before adding them to a temporary ondisk file, which is why compaction is not required.h]hXTThe details of repairing directories and extended attributes will be discussed in a subsequent section about atomic file content exchanges. However, it should be noted that these repair functions only use blob storage to cache a small number of entries before adding them to a temporary ondisk file, which is why compaction is not required.}(hjIhhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhMyhjIhhubh)}(hThe proposed patchset is at the start of the `extended attribute repair `_ series.h](h-The proposed patchset is at the start of the }(hjIhhhNhNubj)}(hx`extended attribute repair `_h]hextended attribute repair}(hjIhhhNhNubah}(h]h ]h"]h$]h&]nameextended attribute repairjjYhttps://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-xattrsuh1jhjIubh)}(h\ h]h}(h]extended-attribute-repairah ]h"]extended attribute repairah$]h&]refurijIuh1hjyKhjIubh series.}(hjIhhhNhNubeh}(h]h ]h"]h$]h&]uh1hhhhMhjIhhubh)}(h .. _xfbtree:h]h}(h]h ]h"]h$]h&]jxfbtreeuh1hhMhjIhhhhubeh}(h](j6jHeh ]h"]( blob storagexfblobeh$]h&]uh1hhjAhhhhhMhj}jIjHsj}jHjHsubh)}(hhh](h)}(hIn-Memory B+Treesh]hIn-Memory B+Trees}(hjIhhhNhNubah}(h]h ]h"]h$]h&]jjRuh1hhjIhhhhhMubh)}(hXThe chapter about :ref:`secondary metadata` mentioned that checking and repairing of secondary metadata commonly requires coordination between a live metadata scan of the filesystem and writer threads that are updating that metadata. Keeping the scan data up to date requires requires the ability to propagate metadata updates from the filesystem into the data being collected by the scan. This *can* be done by appending concurrent updates into a separate log file and applying them before writing the new metadata to disk, but this leads to unbounded memory consumption if the rest of the system is very busy. Another option is to skip the side-log and commit live updates from the filesystem directly into the scan data, which trades more overhead for a lower maximum memory requirement. In both cases, the data structure holding the scan results must support indexed access to perform well.h](hThe chapter about }(hjIhhhNhNubh)}(h-:ref:`secondary metadata`h]j)}(hjJh]hsecondary metadata}(hj JhhhNhNubah}(h]h ](jstdstd-refeh"]h$]h&]uh1jhjJubah}(h]h ]h"]h$]h&]refdocj refdomainjJreftyperef refexplicitrefwarnjsecondary_metadatauh1hhhhMhjIubhX` mentioned that checking and repairing of secondary metadata commonly requires coordination between a live metadata scan of the filesystem and writer threads that are updating that metadata. Keeping the scan data up to date requires requires the ability to propagate metadata updates from the filesystem into the data being collected by the scan. This }(hjIhhhNhNubj7)}(h*can*h]hcan}(hj*JhhhNhNubah}(h]h ]h"]h$]h&]uh1j6hjIubhX be done by appending concurrent updates into a separate log file and applying them before writing the new metadata to disk, but this leads to unbounded memory consumption if the rest of the system is very busy. Another option is to skip the side-log and commit live updates from the filesystem directly into the scan data, which trades more overhead for a lower maximum memory requirement. In both cases, the data structure holding the scan results must support indexed access to perform well.}(hjIhhhNhNubeh}(h]h ]h"]h$]h&]uh1hhhhMhjIhhubh)}(hXGiven that indexed lookups of scan data is required for both strategies, online fsck employs the second strategy of committing live updates directly into scan data. Because xfarrays are not indexed and do not enforce record ordering, they are not suitable for this task. Conveniently, however, XFS has a library to create and maintain ordered reverse mapping records: the existing rmap btree code! If only there was a means to create one in memory.h]hXGiven that indexed lookups of scan data is required for both strategies, online fsck employs the second strategy of committing live updates directly into scan data. Because xfarrays are not indexed and do not enforce record ordering, they are not suitable for this task. Conveniently, however, XFS has a library to create and maintain ordered reverse mapping records: the existing rmap btree code! If only there was a means to create one in memory.}(hjBJhhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhMhjIhhubh)}(hXRecall that the :ref:`xfile ` abstraction represents memory pages as a regular file, which means that the kernel can create byte or block addressable virtual address spaces at will. The XFS buffer cache specializes in abstracting IO to block-oriented address spaces, which means that adaptation of the buffer cache to interface with xfiles enables reuse of the entire btree library. Btrees built atop an xfile are collectively known as ``xfbtrees``. The next few sections describe how they actually work.h](hRecall that the }(hjPJhhhNhNubh)}(h:ref:`xfile `h]j)}(hjZJh]hxfile}(hj\JhhhNhNubah}(h]h ](jstdstd-refeh"]h$]h&]uh1jhjXJubah}(h]h ]h"]h$]h&]refdocj refdomainjfJreftyperef refexplicitrefwarnjxfileuh1hhhhMhjPJubhX abstraction represents memory pages as a regular file, which means that the kernel can create byte or block addressable virtual address spaces at will. The XFS buffer cache specializes in abstracting IO to block-oriented address spaces, which means that adaptation of the buffer cache to interface with xfiles enables reuse of the entire btree library. Btrees built atop an xfile are collectively known as }(hjPJhhhNhNubj)}(h ``xfbtrees``h]hxfbtrees}(hj|JhhhNhNubah}(h]h ]h"]h$]h&]uh1jhjPJubh8. The next few sections describe how they actually work.}(hjPJhhhNhNubeh}(h]h ]h"]h$]h&]uh1hhhhMhjIhhubh)}(hThe proposed patchset is the `in-memory btree `_ series.h](hThe proposed patchset is the }(hjJhhhNhNubj)}(hq`in-memory btree `_h]hin-memory btree}(hjJhhhNhNubah}(h]h ]h"]h$]h&]namein-memory btreejj\https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=in-memory-btreesuh1jhjJubh)}(h_ h]h}(h]in-memory-btreeah ]h"]in-memory btreeah$]h&]refurijJuh1hjyKhjJubh series.}(hjJhhhNhNubeh}(h]h ]h"]h$]h&]uh1hhhhMhjIhhubh)}(hhh](h)}(h%Using xfiles as a Buffer Cache Targeth]h%Using xfiles as a Buffer Cache Target}(hjJhhhNhNubah}(h]h ]h"]h$]h&]jjquh1hhjJhhhhhMubh)}(hXTwo modifications are necessary to support xfiles as a buffer cache target. The first is to make it possible for the ``struct xfs_buftarg`` structure to host the ``struct xfs_buf`` rhashtable, because normally those are held by a per-AG structure. The second change is to modify the buffer ``ioapply`` function to "read" cached pages from the xfile and "write" cached pages back to the xfile. Multiple access to individual buffers is controlled by the ``xfs_buf`` lock, since the xfile does not provide any locking on its own. With this adaptation in place, users of the xfile-backed buffer cache use exactly the same APIs as users of the disk-backed buffer cache. The separation between xfile and buffer cache implies higher memory usage since they do not share pages, but this property could some day enable transactional updates to an in-memory btree. Today, however, it simply eliminates the need for new code.h](huTwo modifications are necessary to support xfiles as a buffer cache target. The first is to make it possible for the }(hjJhhhNhNubj)}(h``struct xfs_buftarg``h]hstruct xfs_buftarg}(hjJhhhNhNubah}(h]h ]h"]h$]h&]uh1jhjJubh structure to host the }(hjJhhhNhNubj)}(h``struct xfs_buf``h]hstruct xfs_buf}(hjJhhhNhNubah}(h]h ]h"]h$]h&]uh1jhjJubhn rhashtable, because normally those are held by a per-AG structure. The second change is to modify the buffer }(hjJhhhNhNubj)}(h ``ioapply``h]hioapply}(hjKhhhNhNubah}(h]h ]h"]h$]h&]uh1jhjJubh function to “read” cached pages from the xfile and “write” cached pages back to the xfile. Multiple access to individual buffers is controlled by the }(hjJhhhNhNubj)}(h ``xfs_buf``h]hxfs_buf}(hjKhhhNhNubah}(h]h ]h"]h$]h&]uh1jhjJubhX lock, since the xfile does not provide any locking on its own. With this adaptation in place, users of the xfile-backed buffer cache use exactly the same APIs as users of the disk-backed buffer cache. The separation between xfile and buffer cache implies higher memory usage since they do not share pages, but this property could some day enable transactional updates to an in-memory btree. Today, however, it simply eliminates the need for new code.}(hjJhhhNhNubeh}(h]h ]h"]h$]h&]uh1hhhhMhjJhhubeh}(h]jwah ]h"]%using xfiles as a buffer cache targetah$]h&]uh1hhjIhhhhhMubh)}(hhh](h)}(h Space Management with an xfbtreeh]h Space Management with an xfbtree}(hj5KhhhNhNubah}(h]h ]h"]h$]h&]jjuh1hhj2KhhhhhMubh)}(hXiSpace management for an xfile is very simple -- each btree block is one memory page in size. These blocks use the same header format as an on-disk btree, but the in-memory block verifiers ignore the checksums, assuming that xfile memory is no more corruption-prone than regular DRAM. Reusing existing code here is more important than absolute memory efficiency.h]hXiSpace management for an xfile is very simple -- each btree block is one memory page in size. These blocks use the same header format as an on-disk btree, but the in-memory block verifiers ignore the checksums, assuming that xfile memory is no more corruption-prone than regular DRAM. Reusing existing code here is more important than absolute memory efficiency.}(hjCKhhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhMhj2Khhubh)}(hThe very first block of an xfile backing an xfbtree contains a header block. The header describes the owner, height, and the block number of the root xfbtree block.h]hThe very first block of an xfile backing an xfbtree contains a header block. The header describes the owner, height, and the block number of the root xfbtree block.}(hjQKhhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhMhj2Khhubh)}(hXtTo allocate a btree block, use ``xfile_seek_data`` to find a gap in the file. If there are no gaps, create one by extending the length of the xfile. Preallocate space for the block with ``xfile_prealloc``, and hand back the location. To free an xfbtree block, use ``xfile_discard`` (which internally uses ``FALLOC_FL_PUNCH_HOLE``) to remove the memory page from the xfile.h](hTo allocate a btree block, use }(hj_KhhhNhNubj)}(h``xfile_seek_data``h]hxfile_seek_data}(hjgKhhhNhNubah}(h]h ]h"]h$]h&]uh1jhj_Kubh to find a gap in the file. If there are no gaps, create one by extending the length of the xfile. Preallocate space for the block with }(hj_KhhhNhNubj)}(h``xfile_prealloc``h]hxfile_prealloc}(hjyKhhhNhNubah}(h]h ]h"]h$]h&]uh1jhj_Kubh<, and hand back the location. To free an xfbtree block, use }(hj_KhhhNhNubj)}(h``xfile_discard``h]h xfile_discard}(hjKhhhNhNubah}(h]h ]h"]h$]h&]uh1jhj_Kubh (which internally uses }(hj_KhhhNhNubj)}(h``FALLOC_FL_PUNCH_HOLE``h]hFALLOC_FL_PUNCH_HOLE}(hjKhhhNhNubah}(h]h ]h"]h$]h&]uh1jhj_Kubh+) to remove the memory page from the xfile.}(hj_KhhhNhNubeh}(h]h ]h"]h$]h&]uh1hhhhMhj2Khhubeh}(h]jah ]h"] space management with an xfbtreeah$]h&]uh1hhjIhhhhhMubh)}(hhh](h)}(hPopulating an xfbtreeh]hPopulating an xfbtree}(hjKhhhNhNubah}(h]h ]h"]h$]h&]jjuh1hhjKhhhhhMubh)}(hRAn online fsck function that wants to create an xfbtree should proceed as follows:h]hRAn online fsck function that wants to create an xfbtree should proceed as follows:}(hjKhhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhMhjKhhubji)}(hhh](h)}(h*Call ``xfile_create`` to create an xfile. h]h)}(h)Call ``xfile_create`` to create an xfile.h](hCall }(hjKhhhNhNubj)}(h``xfile_create``h]h xfile_create}(hjKhhhNhNubah}(h]h ]h"]h$]h&]uh1jhjKubh to create an xfile.}(hjKhhhNhNubeh}(h]h ]h"]h$]h&]uh1hhhhMhjKubah}(h]h ]h"]h$]h&]uh1hhjKhhhhhNubh)}(hcCall ``xfs_alloc_memory_buftarg`` to create a buffer cache target structure pointing to the xfile. h]h)}(hbCall ``xfs_alloc_memory_buftarg`` to create a buffer cache target structure pointing to the xfile.h](hCall }(hj LhhhNhNubj)}(h``xfs_alloc_memory_buftarg``h]hxfs_alloc_memory_buftarg}(hjLhhhNhNubah}(h]h ]h"]h$]h&]uh1jhj LubhA to create a buffer cache target structure pointing to the xfile.}(hj LhhhNhNubeh}(h]h ]h"]h$]h&]uh1hhhhMhjLubah}(h]h ]h"]h$]h&]uh1hhjKhhhhhNubh)}(hXPass the buffer cache target, buffer ops, and other information to ``xfbtree_init`` to initialize the passed in ``struct xfbtree`` and write an initial root block to the xfile. Each btree type should define a wrapper that passes necessary arguments to the creation function. For example, rmap btrees define ``xfs_rmapbt_mem_create`` to take care of all the necessary details for callers. h]h)}(hXPass the buffer cache target, buffer ops, and other information to ``xfbtree_init`` to initialize the passed in ``struct xfbtree`` and write an initial root block to the xfile. Each btree type should define a wrapper that passes necessary arguments to the creation function. For example, rmap btrees define ``xfs_rmapbt_mem_create`` to take care of all the necessary details for callers.h](hCPass the buffer cache target, buffer ops, and other information to }(hj6LhhhNhNubj)}(h``xfbtree_init``h]h xfbtree_init}(hj>LhhhNhNubah}(h]h ]h"]h$]h&]uh1jhj6Lubh to initialize the passed in }(hj6LhhhNhNubj)}(h``struct xfbtree``h]hstruct xfbtree}(hjPLhhhNhNubah}(h]h ]h"]h$]h&]uh1jhj6Lubh and write an initial root block to the xfile. Each btree type should define a wrapper that passes necessary arguments to the creation function. For example, rmap btrees define }(hj6LhhhNhNubj)}(h``xfs_rmapbt_mem_create``h]hxfs_rmapbt_mem_create}(hjbLhhhNhNubah}(h]h ]h"]h$]h&]uh1jhj6Lubh7 to take care of all the necessary details for callers.}(hj6LhhhNhNubeh}(h]h ]h"]h$]h&]uh1hhhhMhj2Lubah}(h]h ]h"]h$]h&]uh1hhjKhhhhhNubh)}(hPass the xfbtree object to the btree cursor creation function for the btree type. Following the example above, ``xfs_rmapbt_mem_cursor`` takes care of this for callers. h]h)}(hPass the xfbtree object to the btree cursor creation function for the btree type. Following the example above, ``xfs_rmapbt_mem_cursor`` takes care of this for callers.h](hoPass the xfbtree object to the btree cursor creation function for the btree type. Following the example above, }(hjLhhhNhNubj)}(h``xfs_rmapbt_mem_cursor``h]hxfs_rmapbt_mem_cursor}(hjLhhhNhNubah}(h]h ]h"]h$]h&]uh1jhjLubh takes care of this for callers.}(hjLhhhNhNubeh}(h]h ]h"]h$]h&]uh1hhhhMhjLubah}(h]h ]h"]h$]h&]uh1hhjKhhhhhNubh)}(hXnPass the btree cursor to the regular btree functions to make queries against and to update the in-memory btree. For example, a btree cursor for an rmap xfbtree can be passed to the ``xfs_rmap_*`` functions just like any other btree cursor. See the :ref:`next section` for information on dealing with xfbtree updates that are logged to a transaction. h]h)}(hXmPass the btree cursor to the regular btree functions to make queries against and to update the in-memory btree. For example, a btree cursor for an rmap xfbtree can be passed to the ``xfs_rmap_*`` functions just like any other btree cursor. See the :ref:`next section` for information on dealing with xfbtree updates that are logged to a transaction.h](hPass the btree cursor to the regular btree functions to make queries against and to update the in-memory btree. For example, a btree cursor for an rmap xfbtree can be passed to the }(hjLhhhNhNubj)}(h``xfs_rmap_*``h]h xfs_rmap_*}(hjLhhhNhNubah}(h]h ]h"]h$]h&]uh1jhjLubh5 functions just like any other btree cursor. See the }(hjLhhhNhNubh)}(h#:ref:`next section`h]j)}(hjLh]h next section}(hjLhhhNhNubah}(h]h ](jstdstd-refeh"]h$]h&]uh1jhjLubah}(h]h ]h"]h$]h&]refdocj refdomainjLreftyperef refexplicitrefwarnjxfbtree_commituh1hhhhMhjLubhR for information on dealing with xfbtree updates that are logged to a transaction.}(hjLhhhNhNubeh}(h]h ]h"]h$]h&]uh1hhhhMhjLubah}(h]h ]h"]h$]h&]uh1hhjKhhhhhNubh)}(hWhen finished, delete the btree cursor, destroy the xfbtree object, free the buffer target, and the destroy the xfile to release all resources. h]h)}(hWhen finished, delete the btree cursor, destroy the xfbtree object, free the buffer target, and the destroy the xfile to release all resources.h]hWhen finished, delete the btree cursor, destroy the xfbtree object, free the buffer target, and the destroy the xfile to release all resources.}(hjLhhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhMhjLubah}(h]h ]h"]h$]h&]uh1hhjKhhhhhNubeh}(h]h ]h"]h$]h&]jgjhjihjjjkuh1jhhjKhhhhhMubh)}(h.. _xfbtree_commit:h]h}(h]h ]h"]h$]h&]jxfbtree-commituh1hhMhjKhhhhubeh}(h]jah ]h"]populating an xfbtreeah$]h&]uh1hhjIhhhhhMubh)}(hhh](h)}(h!Committing Logged xfbtree Buffersh]h!Committing Logged xfbtree Buffers}(hj+MhhhNhNubah}(h]h ]h"]h$]h&]jjuh1hhj(MhhhhhMubh)}(hXAlthough it is a clever hack to reuse the rmap btree code to handle the staging structure, the ephemeral nature of the in-memory btree block storage presents some challenges of its own. The XFS transaction manager must not commit buffer log items for buffers backed by an xfile because the log format does not understand updates for devices other than the data device. An ephemeral xfbtree probably will not exist by the time the AIL checkpoints log transactions back into the filesystem, and certainly won't exist during log recovery. For these reasons, any code updating an xfbtree in transaction context must remove the buffer log items from the transaction and write the updates into the backing xfile before committing or cancelling the transaction.h]hXAlthough it is a clever hack to reuse the rmap btree code to handle the staging structure, the ephemeral nature of the in-memory btree block storage presents some challenges of its own. The XFS transaction manager must not commit buffer log items for buffers backed by an xfile because the log format does not understand updates for devices other than the data device. An ephemeral xfbtree probably will not exist by the time the AIL checkpoints log transactions back into the filesystem, and certainly won’t exist during log recovery. For these reasons, any code updating an xfbtree in transaction context must remove the buffer log items from the transaction and write the updates into the backing xfile before committing or cancelling the transaction.}(hj9MhhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhMhj(Mhhubh)}(hlThe ``xfbtree_trans_commit`` and ``xfbtree_trans_cancel`` functions implement this functionality as follows:h](hThe }(hjGMhhhNhNubj)}(h``xfbtree_trans_commit``h]hxfbtree_trans_commit}(hjOMhhhNhNubah}(h]h ]h"]h$]h&]uh1jhjGMubh and }(hjGMhhhNhNubj)}(h``xfbtree_trans_cancel``h]hxfbtree_trans_cancel}(hjaMhhhNhNubah}(h]h ]h"]h$]h&]uh1jhjGMubh3 functions implement this functionality as follows:}(hjGMhhhNhNubeh}(h]h ]h"]h$]h&]uh1hhhhM hj(Mhhubji)}(hhh](h)}(h:Find each buffer log item whose buffer targets the xfile. h]h)}(h9Find each buffer log item whose buffer targets the xfile.h]h9Find each buffer log item whose buffer targets the xfile.}(hjMhhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhM hj|Mubah}(h]h ]h"]h$]h&]uh1hhjyMhhhhhNubh)}(h1Record the dirty/ordered status of the log item. h]h)}(h0Record the dirty/ordered status of the log item.h]h0Record the dirty/ordered status of the log item.}(hjMhhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhM hjMubah}(h]h ]h"]h$]h&]uh1hhjyMhhhhhNubh)}(h%Detach the log item from the buffer. h]h)}(h$Detach the log item from the buffer.h]h$Detach the log item from the buffer.}(hjMhhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhM hjMubah}(h]h ]h"]h$]h&]uh1hhjyMhhhhhNubh)}(h+Queue the buffer to a special delwri list. h]h)}(h*Queue the buffer to a special delwri list.h]h*Queue the buffer to a special delwri list.}(hjMhhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhM hjMubah}(h]h ]h"]h$]h&]uh1hhjyMhhhhhNubh)}(hiClear the transaction dirty flag if the only dirty log items were the ones that were detached in step 3. h]h)}(hhClear the transaction dirty flag if the only dirty log items were the ones that were detached in step 3.h]hhClear the transaction dirty flag if the only dirty log items were the ones that were detached in step 3.}(hjMhhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhM hjMubah}(h]h ]h"]h$]h&]uh1hhjyMhhhhhNubh)}(h_Submit the delwri list to commit the changes to the xfile, if the updates are being committed. h]h)}(h^Submit the delwri list to commit the changes to the xfile, if the updates are being committed.h]h^Submit the delwri list to commit the changes to the xfile, if the updates are being committed.}(hjMhhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhM hjMubah}(h]h ]h"]h$]h&]uh1hhjyMhhhhhNubeh}(h]h ]h"]h$]h&]jgjhjihjjjkuh1jhhj(MhhhhhM ubh)}(hwAfter removing xfile logged buffers from the transaction in this manner, the transaction can be committed or cancelled.h]hwAfter removing xfile logged buffers from the transaction in this manner, the transaction can be committed or cancelled.}(hjNhhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhM hj(Mhhubeh}(h](jj Meh ]h"](!committing logged xfbtree buffersxfbtree_commiteh$]h&]uh1hhjIhhhhhMj}j%NjMsj}j MjMsubeh}(h](jXjIeh ]h"](in-memory b+treesxfbtreeeh$]h&]uh1hhjAhhhhhMj}j/NjIsj}jIjIsubeh}(h](jjAeh ]h"](pageable kernel memoryxfileeh$]h&]uh1hhjv*hhhhhM4j}j9NjAsj}jAjAsubh)}(hhh](h)}(hBulk Loading of Ondisk B+Treesh]hBulk Loading of Ondisk B+Trees}(hjANhhhNhNubah}(h]h ]h"]h$]h&]jj uh1hhj>NhhhhhM ubh)}(hXAs mentioned previously, early iterations of online repair built new btree structures by creating a new btree and adding observations individually. Loading a btree one record at a time had a slight advantage of not requiring the incore records to be sorted prior to commit, but was very slow and leaked blocks if the system went down during a repair. Loading records one at a time also meant that repair could not control the loading factor of the blocks in the new btree.h]hXAs mentioned previously, early iterations of online repair built new btree structures by creating a new btree and adding observations individually. Loading a btree one record at a time had a slight advantage of not requiring the incore records to be sorted prior to commit, but was very slow and leaked blocks if the system went down during a repair. Loading records one at a time also meant that repair could not control the loading factor of the blocks in the new btree.}(hjONhhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhM hj>Nhhubh)}(hX"Fortunately, the venerable ``xfs_repair`` tool had a more efficient means for rebuilding a btree index from a collection of records -- bulk btree loading. This was implemented rather inefficiently code-wise, since ``xfs_repair`` had separate copy-pasted implementations for each btree type.h](hFortunately, the venerable }(hj]NhhhNhNubj)}(h``xfs_repair``h]h xfs_repair}(hjeNhhhNhNubah}(h]h ]h"]h$]h&]uh1jhj]Nubh tool had a more efficient means for rebuilding a btree index from a collection of records -- bulk btree loading. This was implemented rather inefficiently code-wise, since }(hj]NhhhNhNubj)}(h``xfs_repair``h]h xfs_repair}(hjwNhhhNhNubah}(h]h ]h"]h$]h&]uh1jhj]Nubh> had separate copy-pasted implementations for each btree type.}(hj]NhhhNhNubeh}(h]h ]h"]h$]h&]uh1hhhhM( hj>Nhhubh)}(hTo prepare for online fsck, each of the four bulk loaders were studied, notes were taken, and the four were refactored into a single generic btree bulk loading mechanism. Those notes in turn have been refreshed and are presented below.h]hTo prepare for online fsck, each of the four bulk loaders were studied, notes were taken, and the four were refactored into a single generic btree bulk loading mechanism. Those notes in turn have been refreshed and are presented below.}(hjNhhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhM- hj>Nhhubh)}(hhh](h)}(hGeometry Computationh]hGeometry Computation}(hjNhhhNhNubah}(h]h ]h"]h$]h&]jj0 uh1hhjNhhhhhM3 ubh)}(hXRThe zeroth step of bulk loading is to assemble the entire record set that will be stored in the new btree, and sort the records. Next, call ``xfs_btree_bload_compute_geometry`` to compute the shape of the btree from the record set, the type of btree, and any load factor preferences. This information is required for resource reservation.h](hThe zeroth step of bulk loading is to assemble the entire record set that will be stored in the new btree, and sort the records. Next, call }(hjNhhhNhNubj)}(h$``xfs_btree_bload_compute_geometry``h]h xfs_btree_bload_compute_geometry}(hjNhhhNhNubah}(h]h ]h"]h$]h&]uh1jhjNubh to compute the shape of the btree from the record set, the type of btree, and any load factor preferences. This information is required for resource reservation.}(hjNhhhNhNubeh}(h]h ]h"]h$]h&]uh1hhhhM5 hjNhhubh)}(hFirst, the geometry computation computes the minimum and maximum records that will fit in a leaf block from the size of a btree block and the size of the block header. Roughly speaking, the maximum number of records is::h]hFirst, the geometry computation computes the minimum and maximum records that will fit in a leaf block from the size of a btree block and the size of the block header. Roughly speaking, the maximum number of records is:}(hjNhhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhM; hjNhhubj+)}(h2maxrecs = (block_size - header_size) / record_sizeh]h2maxrecs = (block_size - header_size) / record_size}hjNsbah}(h]h ]h"]h$]h&]hhuh1j+hhhM@ hjNhhubh)}(hThe XFS design specifies that btree blocks should be merged when possible, which means the minimum number of records is half of maxrecs::h]hThe XFS design specifies that btree blocks should be merged when possible, which means the minimum number of records is half of maxrecs:}(hjNhhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhMB hjNhhubj+)}(hminrecs = maxrecs / 2h]hminrecs = maxrecs / 2}hjNsbah}(h]h ]h"]h$]h&]hhuh1j+hhhME hjNhhubh)}(hX The next variable to determine is the desired loading factor. This must be at least minrecs and no more than maxrecs. Choosing minrecs is undesirable because it wastes half the block. Choosing maxrecs is also undesirable because adding a single record to each newly rebuilt leaf block will cause a tree split, which causes a noticeable drop in performance immediately afterwards. The default loading factor was chosen to be 75% of maxrecs, which provides a reasonably compact structure without any immediate split penalties::h]hX The next variable to determine is the desired loading factor. This must be at least minrecs and no more than maxrecs. Choosing minrecs is undesirable because it wastes half the block. Choosing maxrecs is also undesirable because adding a single record to each newly rebuilt leaf block will cause a tree split, which causes a noticeable drop in performance immediately afterwards. The default loading factor was chosen to be 75% of maxrecs, which provides a reasonably compact structure without any immediate split penalties:}(hjOhhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhMG hjNhhubj+)}(h-default_load_factor = (maxrecs + minrecs) / 2h]h-default_load_factor = (maxrecs + minrecs) / 2}hjOsbah}(h]h ]h"]h$]h&]hhuh1j+hhhMP hjNhhubh)}(hcIf space is tight, the loading factor will be set to maxrecs to try to avoid running out of space::h]hbIf space is tight, the loading factor will be set to maxrecs to try to avoid running out of space:}(hj"OhhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhMR hjNhhubj+)}(h?leaf_load_factor = enough space ? default_load_factor : maxrecsh]h?leaf_load_factor = enough space ? default_load_factor : maxrecs}hj0Osbah}(h]h ]h"]h$]h&]hhuh1j+hhhMU hjNhhubh)}(hwLoad factor is computed for btree node blocks using the combined size of the btree key and pointer as the record size::h]hvLoad factor is computed for btree node blocks using the combined size of the btree key and pointer as the record size:}(hj>OhhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhMW hjNhhubj+)}(hmaxrecs = (block_size - header_size) / (key_size + ptr_size) minrecs = maxrecs / 2 node_load_factor = enough space ? default_load_factor : maxrecsh]hmaxrecs = (block_size - header_size) / (key_size + ptr_size) minrecs = maxrecs / 2 node_load_factor = enough space ? default_load_factor : maxrecs}hjLOsbah}(h]h ]h"]h$]h&]hhuh1j+hhhMZ hjNhhubh)}(haOnce that's done, the number of leaf blocks required to store the record set can be computed as::h]hbOnce that’s done, the number of leaf blocks required to store the record set can be computed as:}(hjZOhhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhM^ hjNhhubj+)}(h3leaf_blocks = ceil(record_count / leaf_load_factor)h]h3leaf_blocks = ceil(record_count / leaf_load_factor)}hjhOsbah}(h]h ]h"]h$]h&]hhuh1j+hhhMa hjNhhubh)}(h]The number of node blocks needed to point to the next level down in the tree is computed as::h]h\The number of node blocks needed to point to the next level down in the tree is computed as:}(hjvOhhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhMc hjNhhubj+)}(hin_blocks = (n == 0 ? leaf_blocks : node_blocks[n]) node_blocks[n + 1] = ceil(n_blocks / node_load_factor)h]hin_blocks = (n == 0 ? leaf_blocks : node_blocks[n]) node_blocks[n + 1] = ceil(n_blocks / node_load_factor)}hjOsbah}(h]h ]h"]h$]h&]hhuh1j+hhhMf hjNhhubh)}(hThe entire computation is performed recursively until the current level only needs one block. The resulting geometry is as follows:h]hThe entire computation is performed recursively until the current level only needs one block. The resulting geometry is as follows:}(hjOhhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhMi hjNhhubh)}(hhh](h)}(hFor AG-rooted btrees, this level is the root level, so the height of the new tree is ``level + 1`` and the space needed is the summation of the number of blocks on each level. h]h)}(hFor AG-rooted btrees, this level is the root level, so the height of the new tree is ``level + 1`` and the space needed is the summation of the number of blocks on each level.h](hUFor AG-rooted btrees, this level is the root level, so the height of the new tree is }(hjOhhhNhNubj)}(h ``level + 1``h]h level + 1}(hjOhhhNhNubah}(h]h ]h"]h$]h&]uh1jhjOubhM and the space needed is the summation of the number of blocks on each level.}(hjOhhhNhNubeh}(h]h ]h"]h$]h&]uh1hhhhMm hjOubah}(h]h ]h"]h$]h&]uh1hhjOhhhhhNubh)}(hFor inode-rooted btrees where the records in the top level do not fit in the inode fork area, the height is ``level + 2``, the space needed is the summation of the number of blocks on each level, and the inode fork points to the root block. h]h)}(hFor inode-rooted btrees where the records in the top level do not fit in the inode fork area, the height is ``level + 2``, the space needed is the summation of the number of blocks on each level, and the inode fork points to the root block.h](hlFor inode-rooted btrees where the records in the top level do not fit in the inode fork area, the height is }(hjOhhhNhNubj)}(h ``level + 2``h]h level + 2}(hjOhhhNhNubah}(h]h ]h"]h$]h&]uh1jhjOubhw, the space needed is the summation of the number of blocks on each level, and the inode fork points to the root block.}(hjOhhhNhNubeh}(h]h ]h"]h$]h&]uh1hhhhMq hjOubah}(h]h ]h"]h$]h&]uh1hhjOhhhhhNubh)}(hXFor inode-rooted btrees where the records in the top level can be stored in the inode fork area, then the root block can be stored in the inode, the height is ``level + 1``, and the space needed is one less than the summation of the number of blocks on each level. This only becomes relevant when non-bmap btrees gain the ability to root in an inode, which is a future patchset and only included here for completeness. h]h)}(hXFor inode-rooted btrees where the records in the top level can be stored in the inode fork area, then the root block can be stored in the inode, the height is ``level + 1``, and the space needed is one less than the summation of the number of blocks on each level. This only becomes relevant when non-bmap btrees gain the ability to root in an inode, which is a future patchset and only included here for completeness.h](hFor inode-rooted btrees where the records in the top level can be stored in the inode fork area, then the root block can be stored in the inode, the height is }(hjOhhhNhNubj)}(h ``level + 1``h]h level + 1}(hjPhhhNhNubah}(h]h ]h"]h$]h&]uh1jhjOubh, and the space needed is one less than the summation of the number of blocks on each level. This only becomes relevant when non-bmap btrees gain the ability to root in an inode, which is a future patchset and only included here for completeness.}(hjOhhhNhNubeh}(h]h ]h"]h$]h&]uh1hhhhMv hjOubah}(h]h ]h"]h$]h&]uh1hhjOhhhhhNubeh}(h]h ]h"]h$]h&]jJjKuh1hhhhMm hjNhhubh)}(h .. _newbt:h]h}(h]h ]h"]h$]h&]jnewbtuh1hhM} hjNhhhhubeh}(h]j6 ah ]h"]geometry computationah$]h&]uh1hhj>NhhhhhM3 ubh)}(hhh](h)}(hReserving New B+Tree Blocksh]hReserving New B+Tree Blocks}(hj` for more details.h](hqEFIs have a role to play during the commit and reaping phases; please see the next section and the section about }(hjPhhhNhNubh)}(h:ref:`reaping`h]j)}(hjPh]hreaping}(hjPhhhNhNubah}(h]h ](jstdstd-refeh"]h$]h&]uh1jhjPubah}(h]h ]h"]h$]h&]refdocj refdomainjPreftyperef refexplicitrefwarnjreapinguh1hhhhM hjPubh for more details.}(hjPhhhNhNubeh}(h]h ]h"]h$]h&]uh1hhhhM hj9Phhubh)}(hX)Proposed patchsets are the `bitmap rework `_ and the `preparation for bulk loading btrees `_.h](hProposed patchsets are the }(hjPhhhNhNubj)}(hs`bitmap rework `_h]h bitmap rework}(hjPhhhNhNubah}(h]h ]h"]h$]h&]name bitmap reworkjj`https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-bitmap-reworkuh1jhjPubh)}(hc h]h}(h] bitmap-reworkah ]h"] bitmap reworkah$]h&]refurijPuh1hjyKhjPubh and the }(hjPhhhNhNubj)}(h`preparation for bulk loading btrees `_h]h#preparation for bulk loading btrees}(hjPhhhNhNubah}(h]h ]h"]h$]h&]name#preparation for bulk loading btreesjjhhttps://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-prep-for-bulk-loadinguh1jhjPubh)}(hk h]h}(h]#preparation-for-bulk-loading-btreesah ]h"]#preparation for bulk loading btreesah$]h&]refurijPuh1hjyKhjPubh.}(hjPhhhNhNubeh}(h]h ]h"]h$]h&]uh1hhhhM hj9Phhubeh}(h](jX j1Peh ]h"](reserving new b+tree blocksnewbteh$]h&]uh1hhj>NhhhhhM j}jQj'Psj}j1Pj'Psubh)}(hhh](h)}(hWriting the New Treeh]hWriting the New Tree}(hjQhhhNhNubah}(h]h ]h"]h$]h&]jjt uh1hhjQhhhhhM ubh)}(hThis part is pretty simple -- the btree builder (``xfs_btree_bulkload``) claims a block from the reserved list, writes the new btree block header, fills the rest of the block with records, and adds the new leaf block to a list of written blocks::h](h1This part is pretty simple -- the btree builder (}(hj%QhhhNhNubj)}(h``xfs_btree_bulkload``h]hxfs_btree_bulkload}(hj-QhhhNhNubah}(h]h ]h"]h$]h&]uh1jhj%Qubh) claims a block from the reserved list, writes the new btree block header, fills the rest of the block with records, and adds the new leaf block to a list of written blocks:}(hj%QhhhNhNubeh}(h]h ]h"]h$]h&]uh1hhhhM hjQhhubj+)}(h;┌────┐ │leaf│ │RRR │ └────┘h]h;┌────┐ │leaf│ │RRR │ └────┘}hjEQsbah}(h]h ]h"]h$]h&]hhuh1j+hhhM hjQhhubh)}(hGSibling pointers are set every time a new block is added to the level::h]hFSibling pointers are set every time a new block is added to the level:}(hjSQhhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhM hjQhhubj+)}(h┌────┐ ┌────┐ ┌────┐ ┌────┐ │leaf│→│leaf│→│leaf│→│leaf│ │RRR │←│RRR │←│RRR │←│RRR │ └────┘ └────┘ └────┘ └────┘h]h┌────┐ ┌────┐ ┌────┐ ┌────┐ │leaf│→│leaf│→│leaf│→│leaf│ │RRR │←│RRR │←│RRR │←│RRR │ └────┘ └────┘ └────┘ └────┘}hjaQsbah}(h]h ]h"]h$]h&]hhuh1j+hhhM hjQhhubh)}(hWhen it finishes writing the record leaf blocks, it moves on to the node blocks To fill a node block, it walks each block in the next level down in the tree to compute the relevant keys and write them into the parent node::h]hWhen it finishes writing the record leaf blocks, it moves on to the node blocks To fill a node block, it walks each block in the next level down in the tree to compute the relevant keys and write them into the parent node:}(hjoQhhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhM hjQhhubj+)}(hX ┌────┐ ┌────┐ │node│──────→│node│ │PP │←──────│PP │ └────┘ └────┘ ↙ ↘ ↙ ↘ ┌────┐ ┌────┐ ┌────┐ ┌────┐ │leaf│→│leaf│→│leaf│→│leaf│ │RRR │←│RRR │←│RRR │←│RRR │ └────┘ └────┘ └────┘ └────┘h]hX ┌────┐ ┌────┐ │node│──────→│node│ │PP │←──────│PP │ └────┘ └────┘ ↙ ↘ ↙ ↘ ┌────┐ ┌────┐ ┌────┐ ┌────┐ │leaf│→│leaf│→│leaf│→│leaf│ │RRR │←│RRR │←│RRR │←│RRR │ └────┘ └────┘ └────┘ └────┘}hj}Qsbah}(h]h ]h"]h$]h&]hhuh1j+hhhM hjQhhubh)}(hFWhen it reaches the root level, it is ready to commit the new btree!::h]hEWhen it reaches the root level, it is ready to commit the new btree!:}(hjQhhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhM hjQhhubj+)}(hXs ┌─────────┐ │ root │ │ PP │ └─────────┘ ↙ ↘ ┌────┐ ┌────┐ │node│──────→│node│ │PP │←──────│PP │ └────┘ └────┘ ↙ ↘ ↙ ↘ ┌────┐ ┌────┐ ┌────┐ ┌────┐ │leaf│→│leaf│→│leaf│→│leaf│ │RRR │←│RRR │←│RRR │←│RRR │ └────┘ └────┘ └────┘ └────┘h]hXs ┌─────────┐ │ root │ │ PP │ └─────────┘ ↙ ↘ ┌────┐ ┌────┐ │node│──────→│node│ │PP │←──────│PP │ └────┘ └────┘ ↙ ↘ ↙ ↘ ┌────┐ ┌────┐ ┌────┐ ┌────┐ │leaf│→│leaf│→│leaf│→│leaf│ │RRR │←│RRR │←│RRR │←│RRR │ └────┘ └────┘ └────┘ └────┘}hjQsbah}(h]h ]h"]h$]h&]hhuh1j+hhhM hjQhhubh)}(hXThe first step to commit the new btree is to persist the btree blocks to disk synchronously. This is a little complicated because a new btree block could have been freed in the recent past, so the builder must use ``xfs_buf_delwri_queue_here`` to remove the (stale) buffer from the AIL list before it can write the new blocks to disk. Blocks are queued for IO using a delwri list and written in one large batch with ``xfs_buf_delwri_submit``.h](hThe first step to commit the new btree is to persist the btree blocks to disk synchronously. This is a little complicated because a new btree block could have been freed in the recent past, so the builder must use }(hjQhhhNhNubj)}(h``xfs_buf_delwri_queue_here``h]hxfs_buf_delwri_queue_here}(hjQhhhNhNubah}(h]h ]h"]h$]h&]uh1jhjQubh to remove the (stale) buffer from the AIL list before it can write the new blocks to disk. Blocks are queued for IO using a delwri list and written in one large batch with }(hjQhhhNhNubj)}(h``xfs_buf_delwri_submit``h]hxfs_buf_delwri_submit}(hjQhhhNhNubah}(h]h ]h"]h$]h&]uh1jhjQubh.}(hjQhhhNhNubeh}(h]h ]h"]h$]h&]uh1hhhhM hjQhhubh)}(hX/Once the new blocks have been persisted to disk, control returns to the individual repair function that called the bulk loader. The repair function must log the location of the new root in a transaction, clean up the space reservations that were made for the new btree, and reap the old metadata blocks:h]hX/Once the new blocks have been persisted to disk, control returns to the individual repair function that called the bulk loader. The repair function must log the location of the new root in a transaction, clean up the space reservations that were made for the new btree, and reap the old metadata blocks:}(hjQhhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhM hjQhhubji)}(hhh](h)}(h+Commit the location of the new btree root. h]h)}(h*Commit the location of the new btree root.h]h*Commit the location of the new btree root.}(hjQhhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhM hjQubah}(h]h ]h"]h$]h&]uh1hhjQhhhhhNubh)}(hXFor each incore reservation: a. Log Extent Freeing Done (EFD) items for all the space that was consumed by the btree builder. The new EFDs must point to the EFIs attached to the reservation to prevent log recovery from freeing the new blocks. b. For unclaimed portions of incore reservations, create a regular deferred extent free work item to be free the unused space later in the transaction chain. c. The EFDs and EFIs logged in steps 2a and 2b must not overrun the reservation of the committing transaction. If the btree loading code suspects this might be about to happen, it must call ``xrep_defer_finish`` to clear out the deferred work and obtain a fresh transaction. h](h)}(hFor each incore reservation:h]hFor each incore reservation:}(hjRhhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhM hjRubji)}(hhh](h)}(hLog Extent Freeing Done (EFD) items for all the space that was consumed by the btree builder. The new EFDs must point to the EFIs attached to the reservation to prevent log recovery from freeing the new blocks. h]h)}(hLog Extent Freeing Done (EFD) items for all the space that was consumed by the btree builder. The new EFDs must point to the EFIs attached to the reservation to prevent log recovery from freeing the new blocks.h]hLog Extent Freeing Done (EFD) items for all the space that was consumed by the btree builder. The new EFDs must point to the EFIs attached to the reservation to prevent log recovery from freeing the new blocks.}(hjRhhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhM hjRubah}(h]h ]h"]h$]h&]uh1hhjRubh)}(hFor unclaimed portions of incore reservations, create a regular deferred extent free work item to be free the unused space later in the transaction chain. h]h)}(hFor unclaimed portions of incore reservations, create a regular deferred extent free work item to be free the unused space later in the transaction chain.h]hFor unclaimed portions of incore reservations, create a regular deferred extent free work item to be free the unused space later in the transaction chain.}(hj3RhhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhM hj/Rubah}(h]h ]h"]h$]h&]uh1hhjRubh)}(hXThe EFDs and EFIs logged in steps 2a and 2b must not overrun the reservation of the committing transaction. If the btree loading code suspects this might be about to happen, it must call ``xrep_defer_finish`` to clear out the deferred work and obtain a fresh transaction. h]h)}(hXThe EFDs and EFIs logged in steps 2a and 2b must not overrun the reservation of the committing transaction. If the btree loading code suspects this might be about to happen, it must call ``xrep_defer_finish`` to clear out the deferred work and obtain a fresh transaction.h](hThe EFDs and EFIs logged in steps 2a and 2b must not overrun the reservation of the committing transaction. If the btree loading code suspects this might be about to happen, it must call }(hjKRhhhNhNubj)}(h``xrep_defer_finish``h]hxrep_defer_finish}(hjSRhhhNhNubah}(h]h ]h"]h$]h&]uh1jhjKRubh? to clear out the deferred work and obtain a fresh transaction.}(hjKRhhhNhNubeh}(h]h ]h"]h$]h&]uh1hhhhM hjGRubah}(h]h ]h"]h$]h&]uh1hhjRubeh}(h]h ]h"]h$]h&]jgj6jihjjjkuh1jhhjRubeh}(h]h ]h"]h$]h&]uh1hhjQhhhNhNubh)}(haClear out the deferred work a second time to finish the commit and clean the repair transaction. h]h)}(h`Clear out the deferred work a second time to finish the commit and clean the repair transaction.h]h`Clear out the deferred work a second time to finish the commit and clean the repair transaction.}(hjRhhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhM hj}Rubah}(h]h ]h"]h$]h&]uh1hhjQhhhhhNubeh}(h]h ]h"]h$]h&]jgjhjihjjjkuh1jhhjQhhhhhM ubh)}(hXThe transaction rolling in steps 2c and 3 represent a weakness in the repair algorithm, because a log flush and a crash before the end of the reap step can result in space leaking. Online repair functions minimize the chances of this occurring by using very large transactions, which each can accommodate many thousands of block freeing instructions. Repair moves on to reaping the old blocks, which will be presented in a subsequent :ref:`section` after a few case studies of bulk loading.h](hXThe transaction rolling in steps 2c and 3 represent a weakness in the repair algorithm, because a log flush and a crash before the end of the reap step can result in space leaking. Online repair functions minimize the chances of this occurring by using very large transactions, which each can accommodate many thousands of block freeing instructions. Repair moves on to reaping the old blocks, which will be presented in a subsequent }(hjRhhhNhNubh)}(h:ref:`section`h]j)}(hjRh]hsection}(hjRhhhNhNubah}(h]h ](jstdstd-refeh"]h$]h&]uh1jhjRubah}(h]h ]h"]h$]h&]refdocj refdomainjRreftyperef refexplicitrefwarnjreapinguh1hhhhM hjRubh* after a few case studies of bulk loading.}(hjRhhhNhNubeh}(h]h ]h"]h$]h&]uh1hhhhM hjQhhubh)}(hhh](h)}(h&Case Study: Rebuilding the Inode Indexh]h&Case Study: Rebuilding the Inode Index}(hjRhhhNhNubah}(h]h ]h"]h$]h&]jj uh1hhjRhhhhhM ubh)}(h;The high level process to rebuild the inode index btree is:h]h;The high level process to rebuild the inode index btree is:}(hjRhhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhM hjRhhubji)}(hhh](h)}(hWalk the reverse mapping records to generate ``struct xfs_inobt_rec`` records from the inode chunk information and a bitmap of the old inode btree blocks. h]h)}(hWalk the reverse mapping records to generate ``struct xfs_inobt_rec`` records from the inode chunk information and a bitmap of the old inode btree blocks.h](h-Walk the reverse mapping records to generate }(hjRhhhNhNubj)}(h``struct xfs_inobt_rec``h]hstruct xfs_inobt_rec}(hjRhhhNhNubah}(h]h ]h"]h$]h&]uh1jhjRubhU records from the inode chunk information and a bitmap of the old inode btree blocks.}(hjRhhhNhNubeh}(h]h ]h"]h$]h&]uh1hhhhM hjRubah}(h]h ]h"]h$]h&]uh1hhjRhhhhhNubh)}(h1Append the records to an xfarray in inode order. h]h)}(h0Append the records to an xfarray in inode order.h]h0Append the records to an xfarray in inode order.}(hjShhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhM hjSubah}(h]h ]h"]h$]h&]uh1hhjRhhhhhNubh)}(hUse the ``xfs_btree_bload_compute_geometry`` function to compute the number of blocks needed for the inode btree. If the free space inode btree is enabled, call it again to estimate the geometry of the finobt. h]h)}(hUse the ``xfs_btree_bload_compute_geometry`` function to compute the number of blocks needed for the inode btree. If the free space inode btree is enabled, call it again to estimate the geometry of the finobt.h](hUse the }(hj5ShhhNhNubj)}(h$``xfs_btree_bload_compute_geometry``h]h xfs_btree_bload_compute_geometry}(hj=ShhhNhNubah}(h]h ]h"]h$]h&]uh1jhj5Subh function to compute the number of blocks needed for the inode btree. If the free space inode btree is enabled, call it again to estimate the geometry of the finobt.}(hj5ShhhNhNubeh}(h]h ]h"]h$]h&]uh1hhhhM hj1Subah}(h]h ]h"]h$]h&]uh1hhjRhhhhhNubh)}(h=Allocate the number of blocks computed in the previous step. h]h)}(hCommit the location of the new btree root block(s) to the AGI.h]h>Commit the location of the new btree root block(s) to the AGI.}(hjShhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhM hjSubah}(h]h ]h"]h$]h&]uh1hhjRhhhhhNubh)}(h>Reap the old btree blocks using the bitmap created in step 1. h]h)}(h=Reap the old btree blocks using the bitmap created in step 1.h]h=Reap the old btree blocks using the bitmap created in step 1.}(hjShhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhM hjSubah}(h]h ]h"]h$]h&]uh1hhjRhhhhhNubeh}(h]h ]h"]h$]h&]jgjhjihjjjkuh1jhhjRhhhhhM ubh)}(hDetails are as follows.h]hDetails are as follows.}(hjShhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhM! hjRhhubh)}(hX7The inode btree maps inumbers to the ondisk location of the associated inode records, which means that the inode btrees can be rebuilt from the reverse mapping information. Reverse mapping records with an owner of ``XFS_RMAP_OWN_INOBT`` marks the location of the old inode btree blocks. Each reverse mapping record with an owner of ``XFS_RMAP_OWN_INODES`` marks the location of at least one inode cluster buffer. A cluster is the smallest number of ondisk inodes that can be allocated or freed in a single transaction; it is never smaller than 1 fs block or 4 inodes.h](hThe inode btree maps inumbers to the ondisk location of the associated inode records, which means that the inode btrees can be rebuilt from the reverse mapping information. Reverse mapping records with an owner of }(hjShhhNhNubj)}(h``XFS_RMAP_OWN_INOBT``h]hXFS_RMAP_OWN_INOBT}(hjShhhNhNubah}(h]h ]h"]h$]h&]uh1jhjSubh` marks the location of the old inode btree blocks. Each reverse mapping record with an owner of }(hjShhhNhNubj)}(h``XFS_RMAP_OWN_INODES``h]hXFS_RMAP_OWN_INODES}(hjShhhNhNubah}(h]h ]h"]h$]h&]uh1jhjSubh marks the location of at least one inode cluster buffer. A cluster is the smallest number of ondisk inodes that can be allocated or freed in a single transaction; it is never smaller than 1 fs block or 4 inodes.}(hjShhhNhNubeh}(h]h ]h"]h$]h&]uh1hhhhM# hjRhhubh)}(hXFor the space represented by each inode cluster, ensure that there are no records in the free space btrees nor any records in the reference count btree. If there are, the space metadata inconsistencies are reason enough to abort the operation. Otherwise, read each cluster buffer to check that its contents appear to be ondisk inodes and to decide if the file is allocated (``xfs_dinode.i_mode != 0``) or free (``xfs_dinode.i_mode == 0``). Accumulate the results of successive inode cluster buffer reads until there is enough information to fill a single inode chunk record, which is 64 consecutive numbers in the inumber keyspace. If the chunk is sparse, the chunk record may include holes.h](hXvFor the space represented by each inode cluster, ensure that there are no records in the free space btrees nor any records in the reference count btree. If there are, the space metadata inconsistencies are reason enough to abort the operation. Otherwise, read each cluster buffer to check that its contents appear to be ondisk inodes and to decide if the file is allocated (}(hjThhhNhNubj)}(h``xfs_dinode.i_mode != 0``h]hxfs_dinode.i_mode != 0}(hjThhhNhNubah}(h]h ]h"]h$]h&]uh1jhjTubh ) or free (}(hjThhhNhNubj)}(h``xfs_dinode.i_mode == 0``h]hxfs_dinode.i_mode == 0}(hj-ThhhNhNubah}(h]h ]h"]h$]h&]uh1jhjTubh). Accumulate the results of successive inode cluster buffer reads until there is enough information to fill a single inode chunk record, which is 64 consecutive numbers in the inumber keyspace. If the chunk is sparse, the chunk record may include holes.}(hjThhhNhNubeh}(h]h ]h"]h$]h&]uh1hhhhM- hjRhhubh)}(hX3Once the repair function accumulates one chunk's worth of data, it calls ``xfarray_append`` to add the inode btree record to the xfarray. This xfarray is walked twice during the btree creation step -- once to populate the inode btree with all inode chunk records, and a second time to populate the free inode btree with records for chunks that have free non-sparse inodes. The number of records for the inode btree is the number of xfarray records, but the record count for the free inode btree has to be computed as inode chunk records are stored in the xfarray.h](hKOnce the repair function accumulates one chunk’s worth of data, it calls }(hjEThhhNhNubj)}(h``xfarray_append``h]hxfarray_append}(hjMThhhNhNubah}(h]h ]h"]h$]h&]uh1jhjETubhX to add the inode btree record to the xfarray. This xfarray is walked twice during the btree creation step -- once to populate the inode btree with all inode chunk records, and a second time to populate the free inode btree with records for chunks that have free non-sparse inodes. The number of records for the inode btree is the number of xfarray records, but the record count for the free inode btree has to be computed as inode chunk records are stored in the xfarray.}(hjEThhhNhNubeh}(h]h ]h"]h$]h&]uh1hhhhM9 hjRhhubh)}(hThe proposed patchset is the `AG btree repair `_ series.h](hThe proposed patchset is the }(hjeThhhNhNubj)}(hq`AG btree repair `_h]hAG btree repair}(hjmThhhNhNubah}(h]h ]h"]h$]h&]nameAG btree repairjj\https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-ag-btreesuh1jhjeTubh)}(h_ h]h}(h]ag-btree-repairah ]h"]ag btree repairah$]h&]refurij}Tuh1hjyKhjeTubh series.}(hjeThhhNhNubeh}(h]h ]h"]h$]h&]uh1hhhhMB hjRhhubeh}(h]j ah ]h"]&case study: rebuilding the inode indexah$]h&]uh1hhjQhhhhhM ubh)}(hhh](h)}(h1Case Study: Rebuilding the Space Reference Countsh]h1Case Study: Rebuilding the Space Reference Counts}(hjThhhNhNubah}(h]h ]h"]h$]h&]jj uh1hhjThhhhhMH ubh)}(hXReverse mapping records are used to rebuild the reference count information. Reference counts are required for correct operation of copy on write for shared file data. Imagine the reverse mapping entries as rectangles representing extents of physical blocks, and that the rectangles can be laid down to allow them to overlap each other. From the diagram below, it is apparent that a reference count record must start or end wherever the height of the stack changes. In other words, the record emission stimulus is level-triggered::h]hXReverse mapping records are used to rebuild the reference count information. Reference counts are required for correct operation of copy on write for shared file data. Imagine the reverse mapping entries as rectangles representing extents of physical blocks, and that the rectangles can be laid down to allow them to overlap each other. From the diagram below, it is apparent that a reference count record must start or end wherever the height of the stack changes. In other words, the record emission stimulus is level-triggered:}(hjThhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhMJ hjThhubj+)}(hX █ ███ ██ █████ ████ ███ ██████ ██ ████ ███████████ ████ █████████ ████████████████████████████████ ███████████ ^ ^ ^^ ^^ ^ ^^ ^^^ ^^^^ ^ ^^ ^ ^ ^ 2 1 23 21 3 43 234 2123 1 01 2 3 0h]hX █ ███ ██ █████ ████ ███ ██████ ██ ████ ███████████ ████ █████████ ████████████████████████████████ ███████████ ^ ^ ^^ ^^ ^ ^^ ^^^ ^^^^ ^ ^^ ^ ^ ^ 2 1 23 21 3 43 234 2123 1 01 2 3 0}hjTsbah}(h]h ]h"]h$]h&]hhuh1j+hhhMT hjThhubh)}(hXPThe ondisk reference count btree does not store the refcount == 0 cases because the free space btree already records which blocks are free. Extents being used to stage copy-on-write operations should be the only records with refcount == 1. Single-owner file blocks aren't recorded in either the free space or the reference count btrees.h]hXRThe ondisk reference count btree does not store the refcount == 0 cases because the free space btree already records which blocks are free. Extents being used to stage copy-on-write operations should be the only records with refcount == 1. Single-owner file blocks aren’t recorded in either the free space or the reference count btrees.}(hjThhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhM[ hjThhubh)}(h?The high level process to rebuild the reference count btree is:h]h?The high level process to rebuild the reference count btree is:}(hjThhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhMb hjThhubji)}(hhh](h)}(hXWalk the reverse mapping records to generate ``struct xfs_refcount_irec`` records for any space having more than one reverse mapping and add them to the xfarray. Any records owned by ``XFS_RMAP_OWN_COW`` are also added to the xfarray because these are extents allocated to stage a copy on write operation and are tracked in the refcount btree. Use any records owned by ``XFS_RMAP_OWN_REFC`` to create a bitmap of old refcount btree blocks. h](h)}(hXWWalk the reverse mapping records to generate ``struct xfs_refcount_irec`` records for any space having more than one reverse mapping and add them to the xfarray. Any records owned by ``XFS_RMAP_OWN_COW`` are also added to the xfarray because these are extents allocated to stage a copy on write operation and are tracked in the refcount btree.h](h-Walk the reverse mapping records to generate }(hjThhhNhNubj)}(h``struct xfs_refcount_irec``h]hstruct xfs_refcount_irec}(hjThhhNhNubah}(h]h ]h"]h$]h&]uh1jhjTubhn records for any space having more than one reverse mapping and add them to the xfarray. Any records owned by }(hjThhhNhNubj)}(h``XFS_RMAP_OWN_COW``h]hXFS_RMAP_OWN_COW}(hjUhhhNhNubah}(h]h ]h"]h$]h&]uh1jhjTubh are also added to the xfarray because these are extents allocated to stage a copy on write operation and are tracked in the refcount btree.}(hjThhhNhNubeh}(h]h ]h"]h$]h&]uh1hhhhMd hjTubh)}(h_Use any records owned by ``XFS_RMAP_OWN_REFC`` to create a bitmap of old refcount btree blocks.h](hUse any records owned by }(hjUhhhNhNubj)}(h``XFS_RMAP_OWN_REFC``h]hXFS_RMAP_OWN_REFC}(hj&UhhhNhNubah}(h]h ]h"]h$]h&]uh1jhjUubh1 to create a bitmap of old refcount btree blocks.}(hjUhhhNhNubeh}(h]h ]h"]h$]h&]uh1hhhhMk hjTubeh}(h]h ]h"]h$]h&]uh1hhjThhhhhNubh)}(hSort the records in physical extent order, putting the CoW staging extents at the end of the xfarray. This matches the sorting order of records in the refcount btree. h]h)}(hSort the records in physical extent order, putting the CoW staging extents at the end of the xfarray. This matches the sorting order of records in the refcount btree.h]hSort the records in physical extent order, putting the CoW staging extents at the end of the xfarray. This matches the sorting order of records in the refcount btree.}(hjHUhhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhMn hjDUubah}(h]h ]h"]h$]h&]uh1hhjThhhhhNubh)}(hoUse the ``xfs_btree_bload_compute_geometry`` function to compute the number of blocks needed for the new tree. h]h)}(hnUse the ``xfs_btree_bload_compute_geometry`` function to compute the number of blocks needed for the new tree.h](hUse the }(hj`UhhhNhNubj)}(h$``xfs_btree_bload_compute_geometry``h]h xfs_btree_bload_compute_geometry}(hjhUhhhNhNubah}(h]h ]h"]h$]h&]uh1jhj`UubhB function to compute the number of blocks needed for the new tree.}(hj`UhhhNhNubeh}(h]h ]h"]h$]h&]uh1hhhhMr hj\Uubah}(h]h ]h"]h$]h&]uh1hhjThhhhhNubh)}(h=Allocate the number of blocks computed in the previous step. h]h)}(hReap the old btree blocks using the bitmap created in step 1. h]h)}(h=Reap the old btree blocks using the bitmap created in step 1.h]h=Reap the old btree blocks using the bitmap created in step 1.}(hjUhhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhM| hjUubah}(h]h ]h"]h$]h&]uh1hhjThhhhhNubeh}(h]h ]h"]h$]h&]jgjhjihjjjkuh1jhhjThhhhhMd ubh)}(hDetails are as follows; the same algorithm is used by ``xfs_repair`` to generate refcount information from reverse mapping records.h](h6Details are as follows; the same algorithm is used by }(hjUhhhNhNubj)}(h``xfs_repair``h]h xfs_repair}(hjVhhhNhNubah}(h]h ]h"]h$]h&]uh1jhjUubh? to generate refcount information from reverse mapping records.}(hjUhhhNhNubeh}(h]h ]h"]h$]h&]uh1hhhhM~ hjThhubh)}(hhh]h)}(hX\Until the reverse mapping btree runs out of records: - Retrieve the next record from the btree and put it in a bag. - Collect all records with the same starting block from the btree and put them in the bag. - While the bag isn't empty: - Among the mappings in the bag, compute the lowest block number where the reference count changes. This position will be either the starting block number of the next unprocessed reverse mapping or the next block after the shortest mapping in the bag. - Remove all mappings from the bag that end at this position. - Collect all reverse mappings that start at this position from the btree and put them in the bag. - If the size of the bag changed and is greater than one, create a new refcount record associating the block number range that we just walked to the size of the bag. h](h)}(h4Until the reverse mapping btree runs out of records:h]h4Until the reverse mapping btree runs out of records:}(hj%VhhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhM hj!Vubh)}(hhh](h)}(h=Retrieve the next record from the btree and put it in a bag. h]h)}(h` section. Reverse mappings are added to the bag using ``xfarray_store_anywhere`` and removed via ``xfarray_unset``. Bag members are examined through ``xfarray_iter`` loops.h](hLThe bag-like structure in this case is a type 2 xfarray as discussed in the }(hjVhhhNhNubh)}(h7:ref:`xfarray access patterns`h]j)}(hjWh]hxfarray access patterns}(hjWhhhNhNubah}(h]h ](jstdstd-refeh"]h$]h&]uh1jhjWubah}(h]h ]h"]h$]h&]refdocj refdomainjWreftyperef refexplicitrefwarnjxfarray_access_patternsuh1hhhhM hjVubh6 section. Reverse mappings are added to the bag using }(hjVhhhNhNubj)}(h``xfarray_store_anywhere``h]hxfarray_store_anywhere}(hj%WhhhNhNubah}(h]h ]h"]h$]h&]uh1jhjVubh and removed via }(hjVhhhNhNubj)}(h``xfarray_unset``h]h xfarray_unset}(hj7WhhhNhNubah}(h]h ]h"]h$]h&]uh1jhjVubh#. Bag members are examined through }(hjVhhhNhNubj)}(h``xfarray_iter``h]h xfarray_iter}(hjIWhhhNhNubah}(h]h ]h"]h$]h&]uh1jhjVubh loops.}(hjVhhhNhNubeh}(h]h ]h"]h$]h&]uh1hhhhM hjThhubh)}(hThe proposed patchset is the `AG btree repair `_ series.h](hThe proposed patchset is the }(hjaWhhhNhNubj)}(hq`AG btree repair `_h]hAG btree repair}(hjiWhhhNhNubah}(h]h ]h"]h$]h&]nameAG btree repairjj\https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-ag-btreesuh1jhjaWubh)}(h_ h]h}(h]id4ah ]h"]h$]ag btree repairah&]refurijyWuh1hjyKhjaWubh series.}(hjaWhhhNhNubeh}(h]h ]h"]h$]h&]uh1hhhhM hjThhubeh}(h]j ah ]h"]1case study: rebuilding the space reference countsah$]h&]uh1hhjQhhhhhMH ubh)}(hhh](h)}(h0Case Study: Rebuilding File Fork Mapping Indicesh]h0Case Study: Rebuilding File Fork Mapping Indices}(hjWhhhNhNubah}(h]h ]h"]h$]h&]jj uh1hhjWhhhhhM ubh)}(hDThe high level process to rebuild a data/attr fork mapping btree is:h]hDThe high level process to rebuild a data/attr fork mapping btree is:}(hjWhhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhM hjWhhubji)}(hhh](h)}(hWalk the reverse mapping records to generate ``struct xfs_bmbt_rec`` records from the reverse mapping records for that inode and fork. Append these records to an xfarray. Compute the bitmap of the old bmap btree blocks from the ``BMBT_BLOCK`` records. h]h)}(hWalk the reverse mapping records to generate ``struct xfs_bmbt_rec`` records from the reverse mapping records for that inode and fork. Append these records to an xfarray. Compute the bitmap of the old bmap btree blocks from the ``BMBT_BLOCK`` records.h](h-Walk the reverse mapping records to generate }(hjWhhhNhNubj)}(h``struct xfs_bmbt_rec``h]hstruct xfs_bmbt_rec}(hjWhhhNhNubah}(h]h ]h"]h$]h&]uh1jhjWubh records from the reverse mapping records for that inode and fork. Append these records to an xfarray. Compute the bitmap of the old bmap btree blocks from the }(hjWhhhNhNubj)}(h``BMBT_BLOCK``h]h BMBT_BLOCK}(hjWhhhNhNubah}(h]h ]h"]h$]h&]uh1jhjWubh records.}(hjWhhhNhNubeh}(h]h ]h"]h$]h&]uh1hhhhM hjWubah}(h]h ]h"]h$]h&]uh1hhjWhhhhhNubh)}(hoUse the ``xfs_btree_bload_compute_geometry`` function to compute the number of blocks needed for the new tree. h]h)}(hnUse the ``xfs_btree_bload_compute_geometry`` function to compute the number of blocks needed for the new tree.h](hUse the }(hjWhhhNhNubj)}(h$``xfs_btree_bload_compute_geometry``h]h xfs_btree_bload_compute_geometry}(hjXhhhNhNubah}(h]h ]h"]h$]h&]uh1jhjWubhB function to compute the number of blocks needed for the new tree.}(hjWhhhNhNubeh}(h]h ]h"]h$]h&]uh1hhhhM hjWubah}(h]h ]h"]h$]h&]uh1hhjWhhhhhNubh)}(h'Sort the records in file offset order. h]h)}(h&Sort the records in file offset order.h]h&Sort the records in file offset order.}(hj$XhhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhM hj Xubah}(h]h ]h"]h$]h&]uh1hhjWhhhhhNubh)}(hIf the extent records would fit in the inode fork immediate area, commit the records to that immediate area and skip to step 8. h]h)}(hIf the extent records would fit in the inode fork immediate area, commit the records to that immediate area and skip to step 8.h]hIf the extent records would fit in the inode fork immediate area, commit the records to that immediate area and skip to step 8.}(hjReap the old btree blocks using the bitmap created in step 1. h]h)}(h=Reap the old btree blocks using the bitmap created in step 1.h]h=Reap the old btree blocks using the bitmap created in step 1.}(hjXhhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhM hjXubah}(h]h ]h"]h$]h&]uh1hhjWhhhhhNubeh}(h]h ]h"]h$]h&]jgjhjihjjjkuh1jhhjWhhhhhM ubh)}(hXThere are some complications here: First, it's possible to move the fork offset to adjust the sizes of the immediate areas if the data and attr forks are not both in BMBT format. Second, if there are sufficiently few fork mappings, it may be possible to use EXTENTS format instead of BMBT, which may require a conversion. Third, the incore extent map must be reloaded carefully to avoid disturbing any delayed allocation extents.h]hXThere are some complications here: First, it’s possible to move the fork offset to adjust the sizes of the immediate areas if the data and attr forks are not both in BMBT format. Second, if there are sufficiently few fork mappings, it may be possible to use EXTENTS format instead of BMBT, which may require a conversion. Third, the incore extent map must be reloaded carefully to avoid disturbing any delayed allocation extents.}(hjXhhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhM hjWhhubh)}(hThe proposed patchset is the `file mapping repair `_ series.h](hThe proposed patchset is the }(hjXhhhNhNubj)}(hy`file mapping repair `_h]hfile mapping repair}(hjXhhhNhNubah}(h]h ]h"]h$]h&]namefile mapping repairjj`https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-file-mappingsuh1jhjXubh)}(hc h]h}(h]file-mapping-repairah ]h"]file mapping repairah$]h&]refurijXuh1hjyKhjXubh series.}(hjXhhhNhNubeh}(h]h ]h"]h$]h&]uh1hhhhM hjWhhubh)}(h .. _reaping:h]h}(h]h ]h"]h$]h&]jreapinguh1hhM hjWhhhhubeh}(h]j ah ]h"]0case study: rebuilding file fork mapping indicesah$]h&]uh1hhjQhhhhhM ubeh}(h]jz ah ]h"]writing the new treeah$]h&]uh1hhj>NhhhhhM ubeh}(h]j ah ]h"]bulk loading of ondisk b+treesah$]h&]uh1hhjv*hhhhhM ubh)}(hhh](h)}(hReaping Old Metadata Blocksh]hReaping Old Metadata Blocks}(hj)YhhhNhNubah}(h]h ]h"]h$]h&]jj uh1hhj&YhhhhhM ubh)}(hXWhenever online fsck builds a new data structure to replace one that is suspect, there is a question of how to find and dispose of the blocks that belonged to the old structure. The laziest method of course is not to deal with them at all, but this slowly leads to service degradations as space leaks out of the filesystem. Hopefully, someone will schedule a rebuild of the free space information to plug all those leaks. Offline repair rebuilds all space metadata after recording the usage of the files and directories that it decides not to clear, hence it can build new structures in the discovered free space and avoid the question of reaping.h]hXWhenever online fsck builds a new data structure to replace one that is suspect, there is a question of how to find and dispose of the blocks that belonged to the old structure. The laziest method of course is not to deal with them at all, but this slowly leads to service degradations as space leaks out of the filesystem. Hopefully, someone will schedule a rebuild of the free space information to plug all those leaks. Offline repair rebuilds all space metadata after recording the usage of the files and directories that it decides not to clear, hence it can build new structures in the discovered free space and avoid the question of reaping.}(hj7YhhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhM hj&Yhhubh)}(hXAs part of a repair, online fsck relies heavily on the reverse mapping records to find space that is owned by the corresponding rmap owner yet truly free. Cross referencing rmap records with other rmap records is necessary because there may be other data structures that also think they own some of those blocks (e.g. crosslinked trees). Permitting the block allocator to hand them out again will not push the system towards consistency.h]hXAs part of a repair, online fsck relies heavily on the reverse mapping records to find space that is owned by the corresponding rmap owner yet truly free. Cross referencing rmap records with other rmap records is necessary because there may be other data structures that also think they own some of those blocks (e.g. crosslinked trees). Permitting the block allocator to hand them out again will not push the system towards consistency.}(hjEYhhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhM hj&Yhhubh)}(h_For space metadata, the process of finding extents to dispose of generally follows this format:h]h_For space metadata, the process of finding extents to dispose of generally follows this format:}(hjSYhhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhM hj&Yhhubji)}(hhh](h)}(hCreate a bitmap of space used by data structures that must be preserved. The space reservations used to create the new metadata can be used here if the same rmap owner code is used to denote all of the objects being rebuilt. h]h)}(hCreate a bitmap of space used by data structures that must be preserved. The space reservations used to create the new metadata can be used here if the same rmap owner code is used to denote all of the objects being rebuilt.h]hCreate a bitmap of space used by data structures that must be preserved. The space reservations used to create the new metadata can be used here if the same rmap owner code is used to denote all of the objects being rebuilt.}(hjhYhhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhM hjdYubah}(h]h ]h"]h$]h&]uh1hhjaYhhhhhNubh)}(hSurvey the reverse mapping data to create a bitmap of space owned by the same ``XFS_RMAP_OWN_*`` number for the metadata that is being preserved. h]h)}(hSurvey the reverse mapping data to create a bitmap of space owned by the same ``XFS_RMAP_OWN_*`` number for the metadata that is being preserved.h](hNSurvey the reverse mapping data to create a bitmap of space owned by the same }(hjYhhhNhNubj)}(h``XFS_RMAP_OWN_*``h]hXFS_RMAP_OWN_*}(hjYhhhNhNubah}(h]h ]h"]h$]h&]uh1jhjYubh1 number for the metadata that is being preserved.}(hjYhhhNhNubeh}(h]h ]h"]h$]h&]uh1hhhhM hj|Yubah}(h]h ]h"]h$]h&]uh1hhjaYhhhhhNubh)}(hUse the bitmap disunion operator to subtract (1) from (2). The remaining set bits represent candidate extents that could be freed. The process moves on to step 4 below. h]h)}(hUse the bitmap disunion operator to subtract (1) from (2). The remaining set bits represent candidate extents that could be freed. The process moves on to step 4 below.h]hUse the bitmap disunion operator to subtract (1) from (2). The remaining set bits represent candidate extents that could be freed. The process moves on to step 4 below.}(hjYhhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhM hjYubah}(h]h ]h"]h$]h&]uh1hhjaYhhhhhNubeh}(h]h ]h"]h$]h&]jgjhjihjjjkuh1jhhj&YhhhhhM ubh)}(hXDRepairs for file-based metadata such as extended attributes, directories, symbolic links, quota files and realtime bitmaps are performed by building a new structure attached to a temporary file and exchanging all mappings in the file forks. Afterward, the mappings in the old file fork are the candidate blocks for disposal.h]hXDRepairs for file-based metadata such as extended attributes, directories, symbolic links, quota files and realtime bitmaps are performed by building a new structure attached to a temporary file and exchanging all mappings in the file forks. Afterward, the mappings in the old file fork are the candidate blocks for disposal.}(hjYhhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhM hj&Yhhubh)}(h7The process for disposing of old extents is as follows:h]h7The process for disposing of old extents is as follows:}(hjYhhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhM hj&Yhhubji)}(hhh](h)}(hXBFor each candidate extent, count the number of reverse mapping records for the first block in that extent that do not have the same rmap owner for the data structure being repaired. - If zero, the block has a single owner and can be freed. - If not, the block is part of a crosslinked structure and must not be freed. h](h)}(hFor each candidate extent, count the number of reverse mapping records for the first block in that extent that do not have the same rmap owner for the data structure being repaired.h]hFor each candidate extent, count the number of reverse mapping records for the first block in that extent that do not have the same rmap owner for the data structure being repaired.}(hjYhhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhM hjYubh)}(hhh](h)}(h8If zero, the block has a single owner and can be freed. h]h)}(h7If zero, the block has a single owner and can be freed.h]h7If zero, the block has a single owner and can be freed.}(hjYhhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhM hjYubah}(h]h ]h"]h$]h&]uh1hhjYubh)}(hLIf not, the block is part of a crosslinked structure and must not be freed. h]h)}(hKIf not, the block is part of a crosslinked structure and must not be freed.h]hKIf not, the block is part of a crosslinked structure and must not be freed.}(hjZhhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhM hjZubah}(h]h ]h"]h$]h&]uh1hhjYubeh}(h]h ]h"]h$]h&]jJjKuh1hhhhM hjYubeh}(h]h ]h"]h$]h&]uh1hhjYhhhNhNubh)}(hStarting with the next block in the extent, figure out how many more blocks have the same zero/nonzero other owner status as that first block. h]h)}(hStarting with the next block in the extent, figure out how many more blocks have the same zero/nonzero other owner status as that first block.h]hStarting with the next block in the extent, figure out how many more blocks have the same zero/nonzero other owner status as that first block.}(hj8ZhhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhM hj4Zubah}(h]h ]h"]h$]h&]uh1hhjYhhhhhNubh)}(hIf the region is crosslinked, delete the reverse mapping entry for the structure being repaired and move on to the next region. h]h)}(hIf the region is crosslinked, delete the reverse mapping entry for the structure being repaired and move on to the next region.h]hIf the region is crosslinked, delete the reverse mapping entry for the structure being repaired and move on to the next region.}(hjPZhhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhM hjLZubah}(h]h ]h"]h$]h&]uh1hhjYhhhhhNubh)}(htIf the region is to be freed, mark any corresponding buffers in the buffer cache as stale to prevent log writeback. h]h)}(hsIf the region is to be freed, mark any corresponding buffers in the buffer cache as stale to prevent log writeback.h]hsIf the region is to be freed, mark any corresponding buffers in the buffer cache as stale to prevent log writeback.}(hjhZhhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhM hjdZubah}(h]h ]h"]h$]h&]uh1hhjYhhhhhNubh)}(hFree the region and move on. h]h)}(hFree the region and move on.h]hFree the region and move on.}(hjZhhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhM hj|Zubah}(h]h ]h"]h$]h&]uh1hhjYhhhhhNubeh}(h]h ]h"]h$]h&]jgjhjihjjjkstartKuh1jhhj&YhhhhhM ubh)}(hHowever, there is one complication to this procedure. Transactions are of finite size, so the reaping process must be careful to roll the transactions to avoid overruns. Overruns come from two sources:h]hHowever, there is one complication to this procedure. Transactions are of finite size, so the reaping process must be careful to roll the transactions to avoid overruns. Overruns come from two sources:}(hjZhhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhM hj&Yhhubji)}(hhh](h)}(h:EFIs logged on behalf of space that is no longer occupied h]h)}(h9EFIs logged on behalf of space that is no longer occupiedh]h9EFIs logged on behalf of space that is no longer occupied}(hjZhhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhM hjZubah}(h]h ]h"]h$]h&]uh1hhjZhhhhhNubh)}(h#Log items for buffer invalidations h]h)}(h"Log items for buffer invalidationsh]h"Log items for buffer invalidations}(hjZhhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhM hjZubah}(h]h ]h"]h$]h&]uh1hhjZhhhhhNubeh}(h]h ]h"]h$]h&]jgj6jihjjjkuh1jhhj&YhhhhhM ubh)}(hThis is also a window in which a crash during the reaping process can leak blocks. As stated earlier, online repair functions use very large transactions to minimize the chances of this occurring.h]hThis is also a window in which a crash during the reaping process can leak blocks. As stated earlier, online repair functions use very large transactions to minimize the chances of this occurring.}(hjZhhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhM hj&Yhhubh)}(hThe proposed patchset is the `preparation for bulk loading btrees `_ series.h](hThe proposed patchset is the }(hjZhhhNhNubj)}(h`preparation for bulk loading btrees `_h]h#preparation for bulk loading btrees}(hjZhhhNhNubah}(h]h ]h"]h$]h&]name#preparation for bulk loading btreesjjhhttps://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-prep-for-bulk-loadinguh1jhjZubh)}(hk h]h}(h]id5ah ]h"]h$]#preparation for bulk loading btreesah&]refurij[uh1hjyKhjZubh series.}(hjZhhhNhNubeh}(h]h ]h"]h$]h&]uh1hhhhM hj&Yhhubh)}(hhh](h)}(h0Case Study: Reaping After a Regular Btree Repairh]h0Case Study: Reaping After a Regular Btree Repair}(hj#[hhhNhNubah}(h]h ]h"]h$]h&]jj0 uh1hhj [hhhhhM$ ubh)}(hX:Old reference count and inode btrees are the easiest to reap because they have rmap records with special owner codes: ``XFS_RMAP_OWN_REFC`` for the refcount btree, and ``XFS_RMAP_OWN_INOBT`` for the inode and free inode btrees. Creating a list of extents to reap the old btree blocks is quite simple, conceptually:h](hvOld reference count and inode btrees are the easiest to reap because they have rmap records with special owner codes: }(hj1[hhhNhNubj)}(h``XFS_RMAP_OWN_REFC``h]hXFS_RMAP_OWN_REFC}(hj9[hhhNhNubah}(h]h ]h"]h$]h&]uh1jhj1[ubh for the refcount btree, and }(hj1[hhhNhNubj)}(h``XFS_RMAP_OWN_INOBT``h]hXFS_RMAP_OWN_INOBT}(hjK[hhhNhNubah}(h]h ]h"]h$]h&]uh1jhj1[ubh| for the inode and free inode btrees. Creating a list of extents to reap the old btree blocks is quite simple, conceptually:}(hj1[hhhNhNubeh}(h]h ]h"]h$]h&]uh1hhhhM& hj [hhubji)}(hhh](h)}(hJLock the relevant AGI/AGF header buffers to prevent allocation and frees. h]h)}(hILock the relevant AGI/AGF header buffers to prevent allocation and frees.h]hILock the relevant AGI/AGF header buffers to prevent allocation and frees.}(hjj[hhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhM, hjf[ubah}(h]h ]h"]h$]h&]uh1hhjc[hhhhhNubh)}(hFor each reverse mapping record with an rmap owner corresponding to the metadata structure being rebuilt, set the corresponding range in a bitmap. h]h)}(hFor each reverse mapping record with an rmap owner corresponding to the metadata structure being rebuilt, set the corresponding range in a bitmap.h]hFor each reverse mapping record with an rmap owner corresponding to the metadata structure being rebuilt, set the corresponding range in a bitmap.}(hj[hhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhM. hj~[ubah}(h]h ]h"]h$]h&]uh1hhjc[hhhhhNubh)}(h~Walk the current data structures that have the same rmap owner. For each block visited, clear that range in the above bitmap. h]h)}(h}Walk the current data structures that have the same rmap owner. For each block visited, clear that range in the above bitmap.h]h}Walk the current data structures that have the same rmap owner. For each block visited, clear that range in the above bitmap.}(hj[hhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhM1 hj[ubah}(h]h ]h"]h$]h&]uh1hhjc[hhhhhNubh)}(hEach set bit in the bitmap represents a block that could be a block from the old data structures and hence is a candidate for reaping. In other words, ``(rmap_records_owned_by & ~blocks_reachable_by_walk)`` are the blocks that might be freeable. h]h)}(hEach set bit in the bitmap represents a block that could be a block from the old data structures and hence is a candidate for reaping. In other words, ``(rmap_records_owned_by & ~blocks_reachable_by_walk)`` are the blocks that might be freeable.h](hEach set bit in the bitmap represents a block that could be a block from the old data structures and hence is a candidate for reaping. In other words, }(hj[hhhNhNubj)}(h7``(rmap_records_owned_by & ~blocks_reachable_by_walk)``h]h3(rmap_records_owned_by & ~blocks_reachable_by_walk)}(hj[hhhNhNubah}(h]h ]h"]h$]h&]uh1jhj[ubh' are the blocks that might be freeable.}(hj[hhhNhNubeh}(h]h ]h"]h$]h&]uh1hhhhM4 hj[ubah}(h]h ]h"]h$]h&]uh1hhjc[hhhhhNubeh}(h]h ]h"]h$]h&]jgjhjihjjjkuh1jhhj [hhhhhM, ubh)}(hIf it is possible to maintain the AGF lock throughout the repair (which is the common case), then step 2 can be performed at the same time as the reverse mapping record walk that creates the records for the new btree.h]hIf it is possible to maintain the AGF lock throughout the repair (which is the common case), then step 2 can be performed at the same time as the reverse mapping record walk that creates the records for the new btree.}(hj[hhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhM9 hj [hhubeh}(h]j6 ah ]h"]0case study: reaping after a regular btree repairah$]h&]uh1hhj&YhhhhhM$ ubh)}(hhh](h)}(h-Case Study: Rebuilding the Free Space Indicesh]h-Case Study: Rebuilding the Free Space Indices}(hj[hhhNhNubah}(h]h ]h"]h$]h&]jjR uh1hhj[hhhhhM> ubh)}(hCommit the locations of the new btree root blocks to the AGF. h]h)}(h=Commit the locations of the new btree root blocks to the AGF.h]h=Commit the locations of the new btree root blocks to the AGF.}(hj\hhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhMQ hj\ubah}(h]h ]h"]h$]h&]uh1hhj\hhhhhNubh)}(hReap the old btree blocks by looking for space that is not recorded by the reverse mapping btree, the new free space btrees, or the AGFL. h]h)}(hReap the old btree blocks by looking for space that is not recorded by the reverse mapping btree, the new free space btrees, or the AGFL.h]hReap the old btree blocks by looking for space that is not recorded by the reverse mapping btree, the new free space btrees, or the AGFL.}(hj\hhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhMS hj\ubah}(h]h ]h"]h$]h&]uh1hhj\hhhhhNubeh}(h]h ]h"]h$]h&]jgjhjihjjjkuh1jhhj[hhhhhMB ubh)}(hXRepairing the free space btrees has three key complications over a regular btree repair:h]hXRepairing the free space btrees has three key complications over a regular btree repair:}(hj\hhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhMV hj[hhubh)}(hFirst, free space is not explicitly tracked in the reverse mapping records. Hence, the new free space records must be inferred from gaps in the physical space component of the keyspace of the reverse mapping btree.h]hFirst, free space is not explicitly tracked in the reverse mapping records. Hence, the new free space records must be inferred from gaps in the physical space component of the keyspace of the reverse mapping btree.}(hj]hhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhMY hj[hhubh)}(hXxSecond, free space repairs cannot use the common btree reservation code because new blocks are reserved out of the free space btrees. This is impossible when repairing the free space btrees themselves. However, repair holds the AGF buffer lock for the duration of the free space index reconstruction, so it can use the collected free space information to supply the blocks for the new free space btrees. It is not necessary to back each reserved extent with an EFI because the new free space btrees are constructed in what the ondisk filesystem thinks is unowned space. However, if reserving blocks for the new btrees from the collected free space information changes the number of free space records, repair must re-estimate the new free space btree geometry with the new record count until the reservation is sufficient. As part of committing the new btrees, repair must ensure that reverse mappings are created for the reserved blocks and that unused reserved blocks are inserted into the free space btrees. Deferrred rmap and freeing operations are used to ensure that this transition is atomic, similar to the other btree repair functions.h]hXxSecond, free space repairs cannot use the common btree reservation code because new blocks are reserved out of the free space btrees. This is impossible when repairing the free space btrees themselves. However, repair holds the AGF buffer lock for the duration of the free space index reconstruction, so it can use the collected free space information to supply the blocks for the new free space btrees. It is not necessary to back each reserved extent with an EFI because the new free space btrees are constructed in what the ondisk filesystem thinks is unowned space. However, if reserving blocks for the new btrees from the collected free space information changes the number of free space records, repair must re-estimate the new free space btree geometry with the new record count until the reservation is sufficient. As part of committing the new btrees, repair must ensure that reverse mappings are created for the reserved blocks and that unused reserved blocks are inserted into the free space btrees. Deferrred rmap and freeing operations are used to ensure that this transition is atomic, similar to the other btree repair functions.}(hj]hhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhM] hj[hhubh)}(hXThird, finding the blocks to reap after the repair is not overly straightforward. Blocks for the free space btrees and the reverse mapping btrees are supplied by the AGFL. Blocks put onto the AGFL have reverse mapping records with the owner ``XFS_RMAP_OWN_AG``. This ownership is retained when blocks move from the AGFL into the free space btrees or the reverse mapping btrees. When repair walks reverse mapping records to synthesize free space records, it creates a bitmap (``ag_owner_bitmap``) of all the space claimed by ``XFS_RMAP_OWN_AG`` records. The repair context maintains a second bitmap corresponding to the rmap btree blocks and the AGFL blocks (``rmap_agfl_bitmap``). When the walk is complete, the bitmap disunion operation ``(ag_owner_bitmap & ~rmap_agfl_bitmap)`` computes the extents that are used by the old free space btrees. These blocks can then be reaped using the methods outlined above.h](hThird, finding the blocks to reap after the repair is not overly straightforward. Blocks for the free space btrees and the reverse mapping btrees are supplied by the AGFL. Blocks put onto the AGFL have reverse mapping records with the owner }(hj#]hhhNhNubj)}(h``XFS_RMAP_OWN_AG``h]hXFS_RMAP_OWN_AG}(hj+]hhhNhNubah}(h]h ]h"]h$]h&]uh1jhj#]ubh. This ownership is retained when blocks move from the AGFL into the free space btrees or the reverse mapping btrees. When repair walks reverse mapping records to synthesize free space records, it creates a bitmap (}(hj#]hhhNhNubj)}(h``ag_owner_bitmap``h]hag_owner_bitmap}(hj=]hhhNhNubah}(h]h ]h"]h$]h&]uh1jhj#]ubh) of all the space claimed by }(hj#]hhhNhNubj)}(h``XFS_RMAP_OWN_AG``h]hXFS_RMAP_OWN_AG}(hjO]hhhNhNubah}(h]h ]h"]h$]h&]uh1jhj#]ubhs records. The repair context maintains a second bitmap corresponding to the rmap btree blocks and the AGFL blocks (}(hj#]hhhNhNubj)}(h``rmap_agfl_bitmap``h]hrmap_agfl_bitmap}(hja]hhhNhNubah}(h]h ]h"]h$]h&]uh1jhj#]ubh<). When the walk is complete, the bitmap disunion operation }(hj#]hhhNhNubj)}(h)``(ag_owner_bitmap & ~rmap_agfl_bitmap)``h]h%(ag_owner_bitmap & ~rmap_agfl_bitmap)}(hjs]hhhNhNubah}(h]h ]h"]h$]h&]uh1jhj#]ubh computes the extents that are used by the old free space btrees. These blocks can then be reaped using the methods outlined above.}(hj#]hhhNhNubeh}(h]h ]h"]h$]h&]uh1hhhhMp hj[hhubh)}(hThe proposed patchset is the `AG btree repair `_ series.h](hThe proposed patchset is the }(hj]hhhNhNubj)}(hq`AG btree repair `_h]hAG btree repair}(hj]hhhNhNubah}(h]h ]h"]h$]h&]nameAG btree repairjj\https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-ag-btreesuh1jhj]ubh)}(h_ h]h}(h]id6ah ]h"]h$]ag btree repairah&]refurij]uh1hjyKhj]ubh series.}(hj]hhhNhNubeh}(h]h ]h"]h$]h&]uh1hhhhM hj[hhubh)}(h.. _rmap_reap:h]h}(h]h ]h"]h$]h&]j rmap-reapuh1hhM hj[hhhhubeh}(h]jX ah ]h"]-case study: rebuilding the free space indicesah$]h&]uh1hhj&YhhhhhM> ubh)}(hhh](h)}(h:Case Study: Reaping After Repairing Reverse Mapping Btreesh]h:Case Study: Reaping After Repairing Reverse Mapping Btrees}(hj]hhhNhNubah}(h]h ]h"]h$]h&]jjt uh1hhj]hhhhhM ubh)}(hXOld reverse mapping btrees are less difficult to reap after a repair. As mentioned in the previous section, blocks on the AGFL, the two free space btree blocks, and the reverse mapping btree blocks all have reverse mapping records with ``XFS_RMAP_OWN_AG`` as the owner. The full process of gathering reverse mapping records and building a new btree are described in the case study of :ref:`live rebuilds of rmap data `, but a crucial point from that discussion is that the new rmap btree will not contain any records for the old rmap btree, nor will the old btree blocks be tracked in the free space btrees. The list of candidate reaping blocks is computed by setting the bits corresponding to the gaps in the new rmap btree records, and then clearing the bits corresponding to extents in the free space btrees and the current AGFL blocks. The result ``(new_rmapbt_gaps & ~(agfl | bnobt_records))`` are reaped using the methods outlined above.h](hOld reverse mapping btrees are less difficult to reap after a repair. As mentioned in the previous section, blocks on the AGFL, the two free space btree blocks, and the reverse mapping btree blocks all have reverse mapping records with }(hj]hhhNhNubj)}(h``XFS_RMAP_OWN_AG``h]hXFS_RMAP_OWN_AG}(hj]hhhNhNubah}(h]h ]h"]h$]h&]uh1jhj]ubh as the owner. The full process of gathering reverse mapping records and building a new btree are described in the case study of }(hj]hhhNhNubh)}(h/:ref:`live rebuilds of rmap data `h]j)}(hj]h]hlive rebuilds of rmap data}(hj]hhhNhNubah}(h]h ](jstdstd-refeh"]h$]h&]uh1jhj]ubah}(h]h ]h"]h$]h&]refdocj refdomainj^reftyperef refexplicitrefwarnj rmap_repairuh1hhhhM hj]ubhX, but a crucial point from that discussion is that the new rmap btree will not contain any records for the old rmap btree, nor will the old btree blocks be tracked in the free space btrees. The list of candidate reaping blocks is computed by setting the bits corresponding to the gaps in the new rmap btree records, and then clearing the bits corresponding to extents in the free space btrees and the current AGFL blocks. The result }(hj]hhhNhNubj)}(h/``(new_rmapbt_gaps & ~(agfl | bnobt_records))``h]h+(new_rmapbt_gaps & ~(agfl | bnobt_records))}(hj^hhhNhNubah}(h]h ]h"]h$]h&]uh1jhj]ubh- are reaped using the methods outlined above.}(hj]hhhNhNubeh}(h]h ]h"]h$]h&]uh1hhhhM hj]hhubh)}(hyThe rest of the process of rebuildng the reverse mapping btree is discussed in a separate :ref:`case study`.h](hZThe rest of the process of rebuildng the reverse mapping btree is discussed in a separate }(hj4^hhhNhNubh)}(h:ref:`case study`h]j)}(hj>^h]h case study}(hj@^hhhNhNubah}(h]h ](jstdstd-refeh"]h$]h&]uh1jhj<^ubah}(h]h ]h"]h$]h&]refdocj refdomainjJ^reftyperef refexplicitrefwarnj rmap_repairuh1hhhhM hj4^ubh.}(hj4^hhhNhNubeh}(h]h ]h"]h$]h&]uh1hhhhM hj]hhubh)}(hThe proposed patchset is the `AG btree repair `_ series.h](hThe proposed patchset is the }(hjf^hhhNhNubj)}(hq`AG btree repair `_h]hAG btree repair}(hjn^hhhNhNubah}(h]h ]h"]h$]h&]nameAG btree repairjj\https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-ag-btreesuh1jhjf^ubh)}(h_ h]h}(h]id7ah ]h"]h$]ag btree repairah&]refurij~^uh1hjyKhjf^ubh series.}(hjf^hhhNhNubeh}(h]h ]h"]h$]h&]uh1hhhhM hj]hhubeh}(h](jz j]eh ]h"](:case study: reaping after repairing reverse mapping btrees rmap_reapeh$]h&]uh1hhj&YhhhhhM j}j^j]sj}j]j]subh)}(hhh](h)}(hCase Study: Rebuilding the AGFLh]hCase Study: Rebuilding the AGFL}(hj^hhhNhNubah}(h]h ]h"]h$]h&]jj uh1hhj^hhhhhM ubh)}(hCThe allocation group free block list (AGFL) is repaired as follows:h]hCThe allocation group free block list (AGFL) is repaired as follows:}(hj^hhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhM hj^hhubji)}(hhh](h)}(hhCreate a bitmap for all the space that the reverse mapping data claims is owned by ``XFS_RMAP_OWN_AG``. h]h)}(hgCreate a bitmap for all the space that the reverse mapping data claims is owned by ``XFS_RMAP_OWN_AG``.h](hSCreate a bitmap for all the space that the reverse mapping data claims is owned by }(hj^hhhNhNubj)}(h``XFS_RMAP_OWN_AG``h]hXFS_RMAP_OWN_AG}(hj^hhhNhNubah}(h]h ]h"]h$]h&]uh1jhj^ubh.}(hj^hhhNhNubeh}(h]h ]h"]h$]h&]uh1hhhhM hj^ubah}(h]h ]h"]h$]h&]uh1hhj^hhhhhNubh)}(hISubtract the space used by the two free space btrees and the rmap btree. h]h)}(hHSubtract the space used by the two free space btrees and the rmap btree.h]hHSubtract the space used by the two free space btrees and the rmap btree.}(hj^hhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhM hj^ubah}(h]h ]h"]h$]h&]uh1hhj^hhhhhNubh)}(hSubtract any space that the reverse mapping data claims is owned by any other owner, to avoid re-adding crosslinked blocks to the AGFL. h]h)}(hSubtract any space that the reverse mapping data claims is owned by any other owner, to avoid re-adding crosslinked blocks to the AGFL.h]hSubtract any space that the reverse mapping data claims is owned by any other owner, to avoid re-adding crosslinked blocks to the AGFL.}(hj_hhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhM hj_ubah}(h]h ]h"]h$]h&]uh1hhj^hhhhhNubh)}(h1Once the AGFL is full, reap any blocks leftover. h]h)}(h0Once the AGFL is full, reap any blocks leftover.h]h0Once the AGFL is full, reap any blocks leftover.}(hj _hhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhM hj_ubah}(h]h ]h"]h$]h&]uh1hhj^hhhhhNubh)}(hAThe next operation to fix the freelist will right-size the list. h]h)}(h@The next operation to fix the freelist will right-size the list.h]h@The next operation to fix the freelist will right-size the list.}(hj8_hhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhM hj4_ubah}(h]h ]h"]h$]h&]uh1hhj^hhhhhNubeh}(h]h ]h"]h$]h&]jgjhjihjjjkuh1jhhj^hhhhhM ubh)}(hSee `fs/xfs/scrub/agheader_repair.c `_ for more details.h](hSee }(hjR_hhhNhNubj)}(h`fs/xfs/scrub/agheader_repair.c `_h]hfs/xfs/scrub/agheader_repair.c}(hjZ_hhhNhNubah}(h]h ]h"]h$]h&]namefs/xfs/scrub/agheader_repair.cjjfhttps://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/fs/xfs/scrub/agheader_repair.cuh1jhjR_ubh)}(hi h]h}(h]fs-xfs-scrub-agheader-repair-cah ]h"]fs/xfs/scrub/agheader_repair.cah$]h&]refurijj_uh1hjyKhjR_ubh for more details.}(hjR_hhhNhNubeh}(h]h ]h"]h$]h&]uh1hhhhM hj^hhubeh}(h]j ah ]h"]case study: rebuilding the agflah$]h&]uh1hhj&YhhhhhM ubeh}(h](j jYeh ]h"](reaping old metadata blocksreapingeh$]h&]uh1hhjv*hhhhhM j}j_jYsj}jYjYsubh)}(hhh](h)}(hInode Record Repairsh]hInode Record Repairs}(hj_hhhNhNubah}(h]h ]h"]h$]h&]jj uh1hhj_hhhhhM ubh)}(hXmInode records must be handled carefully, because they have both ondisk records ("dinodes") and an in-memory ("cached") representation. There is a very high potential for cache coherency issues if online fsck is not careful to access the ondisk metadata *only* when the ondisk metadata is so badly damaged that the filesystem cannot load the in-memory representation. When online fsck wants to open a damaged file for scrubbing, it must use specialized resource acquisition functions that return either the in-memory representation *or* a lock on whichever object is necessary to prevent any update to the ondisk location.h](hXInode records must be handled carefully, because they have both ondisk records (“dinodes”) and an in-memory (“cached”) representation. There is a very high potential for cache coherency issues if online fsck is not careful to access the ondisk metadata }(hj_hhhNhNubj7)}(h*only*h]honly}(hj_hhhNhNubah}(h]h ]h"]h$]h&]uh1j6hj_ubhX when the ondisk metadata is so badly damaged that the filesystem cannot load the in-memory representation. When online fsck wants to open a damaged file for scrubbing, it must use specialized resource acquisition functions that return either the in-memory representation }(hj_hhhNhNubj7)}(h*or*h]hor}(hj_hhhNhNubah}(h]h ]h"]h$]h&]uh1j6hj_ubhV a lock on whichever object is necessary to prevent any update to the ondisk location.}(hj_hhhNhNubeh}(h]h ]h"]h$]h&]uh1hhhhM hj_hhubh)}(hX4The only repairs that should be made to the ondisk inode buffers are whatever is necessary to get the in-core structure loaded. This means fixing whatever is caught by the inode cluster buffer and inode fork verifiers, and retrying the ``iget`` operation. If the second ``iget`` fails, the repair has failed.h](hThe only repairs that should be made to the ondisk inode buffers are whatever is necessary to get the in-core structure loaded. This means fixing whatever is caught by the inode cluster buffer and inode fork verifiers, and retrying the }(hj_hhhNhNubj)}(h``iget``h]higet}(hj_hhhNhNubah}(h]h ]h"]h$]h&]uh1jhj_ubh operation. If the second }(hj_hhhNhNubj)}(h``iget``h]higet}(hj_hhhNhNubah}(h]h ]h"]h$]h&]uh1jhj_ubh fails, the repair has failed.}(hj_hhhNhNubeh}(h]h ]h"]h$]h&]uh1hhhhM hj_hhubh)}(hXOnce the in-memory representation is loaded, repair can lock the inode and can subject it to comprehensive checks, repairs, and optimizations. Most inode attributes are easy to check and constrain, or are user-controlled arbitrary bit patterns; these are both easy to fix. Dealing with the data and attr fork extent counts and the file block counts is more complicated, because computing the correct value requires traversing the forks, or if that fails, leaving the fields invalid and waiting for the fork fsck functions to run.h]hXOnce the in-memory representation is loaded, repair can lock the inode and can subject it to comprehensive checks, repairs, and optimizations. Most inode attributes are easy to check and constrain, or are user-controlled arbitrary bit patterns; these are both easy to fix. Dealing with the data and attr fork extent counts and the file block counts is more complicated, because computing the correct value requires traversing the forks, or if that fails, leaving the fields invalid and waiting for the fork fsck functions to run.}(hj`hhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhM hj_hhubh)}(hThe proposed patchset is the `inode `_ repair series.h](hThe proposed patchset is the }(hj`hhhNhNubj)}(hd`inode `_h]hinode}(hj`hhhNhNubah}(h]h ]h"]h$]h&]nameinodejjYhttps://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-inodesuh1jhj`ubh)}(h\ h]h}(h]inodeah ]h"]inodeah$]h&]refurij.`uh1hjyKhj`ubh repair series.}(hj`hhhNhNubeh}(h]h ]h"]h$]h&]uh1hhhhM hj_hhubeh}(h]j ah ]h"]inode record repairsah$]h&]uh1hhjv*hhhhhM ubh)}(hhh](h)}(hQuota Record Repairsh]hQuota Record Repairs}(hjP`hhhNhNubah}(h]h ]h"]h$]h&]jj uh1hhjM`hhhhhM ubh)}(hSimilar to inodes, quota records ("dquots") also have both ondisk records and an in-memory representation, and hence are subject to the same cache coherency issues. Somewhat confusingly, both are known as dquots in the XFS codebase.h]hSimilar to inodes, quota records (“dquots”) also have both ondisk records and an in-memory representation, and hence are subject to the same cache coherency issues. Somewhat confusingly, both are known as dquots in the XFS codebase.}(hj^`hhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhM hjM`hhubh)}(hXThe only repairs that should be made to the ondisk quota record buffers are whatever is necessary to get the in-core structure loaded. Once the in-memory representation is loaded, the only attributes needing checking are obviously bad limits and timer values.h]hXThe only repairs that should be made to the ondisk quota record buffers are whatever is necessary to get the in-core structure loaded. Once the in-memory representation is loaded, the only attributes needing checking are obviously bad limits and timer values.}(hjl`hhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhM hjM`hhubh)}(h~Quota usage counters are checked, repaired, and discussed separately in the section about :ref:`live quotacheck `.h](hZQuota usage counters are checked, repaired, and discussed separately in the section about }(hjz`hhhNhNubh)}(h#:ref:`live quotacheck `h]j)}(hj`h]hlive quotacheck}(hj`hhhNhNubah}(h]h ](jstdstd-refeh"]h$]h&]uh1jhj`ubah}(h]h ]h"]h$]h&]refdocj refdomainj`reftyperef refexplicitrefwarnj quotacheckuh1hhhhM hjz`ubh.}(hjz`hhhNhNubeh}(h]h ]h"]h$]h&]uh1hhhhM hjM`hhubh)}(hThe proposed patchset is the `quota `_ repair series.h](hThe proposed patchset is the }(hj`hhhNhNubj)}(hc`quota `_h]hquota}(hj`hhhNhNubah}(h]h ]h"]h$]h&]namequotajjXhttps://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-quotauh1jhj`ubh)}(h[ h]h}(h]quotaah ]h"]quotaah$]h&]refurij`uh1hjyKhj`ubh repair series.}(hj`hhhNhNubeh}(h]h ]h"]h$]h&]uh1hhhhM hjM`hhubh)}(h.. _fscounters:h]h}(h]h ]h"]h$]h&]j fscountersuh1hhM hjM`hhhhubeh}(h]j ah ]h"]quota record repairsah$]h&]uh1hhjv*hhhhhM ubh)}(hhh](h)}(h Freezing to Fix Summary Countersh]h Freezing to Fix Summary Counters}(hj`hhhNhNubah}(h]h ]h"]h$]h&]jj uh1hhj`hhhhhM ubh)}(hX Filesystem summary counters track availability of filesystem resources such as free blocks, free inodes, and allocated inodes. This information could be compiled by walking the free space and inode indexes, but this is a slow process, so XFS maintains a copy in the ondisk superblock that should reflect the ondisk metadata, at least when the filesystem has been unmounted cleanly. For performance reasons, XFS also maintains incore copies of those counters, which are key to enabling resource reservations for active transactions. Writer threads reserve the worst-case quantities of resources from the incore counter and give back whatever they don't use at commit time. It is therefore only necessary to serialize on the superblock when the superblock is being committed to disk.h]hXFilesystem summary counters track availability of filesystem resources such as free blocks, free inodes, and allocated inodes. This information could be compiled by walking the free space and inode indexes, but this is a slow process, so XFS maintains a copy in the ondisk superblock that should reflect the ondisk metadata, at least when the filesystem has been unmounted cleanly. For performance reasons, XFS also maintains incore copies of those counters, which are key to enabling resource reservations for active transactions. Writer threads reserve the worst-case quantities of resources from the incore counter and give back whatever they don’t use at commit time. It is therefore only necessary to serialize on the superblock when the superblock is being committed to disk.}(hj`hhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhM hj`hhubh)}(hXThe lazy superblock counter feature introduced in XFS v5 took this even further by training log recovery to recompute the summary counters from the AG headers, which eliminated the need for most transactions even to touch the superblock. The only time XFS commits the summary counters is at filesystem unmount. To reduce contention even further, the incore counter is implemented as a percpu counter, which means that each CPU is allocated a batch of blocks from a global incore counter and can satisfy small allocations from the local batch.h]hXThe lazy superblock counter feature introduced in XFS v5 took this even further by training log recovery to recompute the summary counters from the AG headers, which eliminated the need for most transactions even to touch the superblock. The only time XFS commits the summary counters is at filesystem unmount. To reduce contention even further, the incore counter is implemented as a percpu counter, which means that each CPU is allocated a batch of blocks from a global incore counter and can satisfy small allocations from the local batch.}(hj ahhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhM hj`hhubh)}(hXThe high-performance nature of the summary counters makes it difficult for online fsck to check them, since there is no way to quiesce a percpu counter while the system is running. Although online fsck can read the filesystem metadata to compute the correct values of the summary counters, there's no way to hold the value of a percpu counter stable, so it's quite possible that the counter will be out of date by the time the walk is complete. Earlier versions of online scrub would return to userspace with an incomplete scan flag, but this is not a satisfying outcome for a system administrator. For repairs, the in-memory counters must be stabilized while walking the filesystem metadata to get an accurate reading and install it in the percpu counter.h]hXThe high-performance nature of the summary counters makes it difficult for online fsck to check them, since there is no way to quiesce a percpu counter while the system is running. Although online fsck can read the filesystem metadata to compute the correct values of the summary counters, there’s no way to hold the value of a percpu counter stable, so it’s quite possible that the counter will be out of date by the time the walk is complete. Earlier versions of online scrub would return to userspace with an incomplete scan flag, but this is not a satisfying outcome for a system administrator. For repairs, the in-memory counters must be stabilized while walking the filesystem metadata to get an accurate reading and install it in the percpu counter.}(hjahhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhM hj`hhubh)}(hXTo satisfy this requirement, online fsck must prevent other programs in the system from initiating new writes to the filesystem, it must disable background garbage collection threads, and it must wait for existing writer programs to exit the kernel. Once that has been established, scrub can walk the AG free space indexes, the inode btrees, and the realtime bitmap to compute the correct value of all four summary counters. This is very similar to a filesystem freeze, though not all of the pieces are necessary:h]hXTo satisfy this requirement, online fsck must prevent other programs in the system from initiating new writes to the filesystem, it must disable background garbage collection threads, and it must wait for existing writer programs to exit the kernel. Once that has been established, scrub can walk the AG free space indexes, the inode btrees, and the realtime bitmap to compute the correct value of all four summary counters. This is very similar to a filesystem freeze, though not all of the pieces are necessary:}(hj)ahhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhM hj`hhubh)}(hhh](h)}(hThe final freeze state is set one higher than ``SB_FREEZE_COMPLETE`` to prevent other threads from thawing the filesystem, or other scrub threads from initiating another fscounters freeze. h]h)}(hThe final freeze state is set one higher than ``SB_FREEZE_COMPLETE`` to prevent other threads from thawing the filesystem, or other scrub threads from initiating another fscounters freeze.h](h.The final freeze state is set one higher than }(hj>ahhhNhNubj)}(h``SB_FREEZE_COMPLETE``h]hSB_FREEZE_COMPLETE}(hjFahhhNhNubah}(h]h ]h"]h$]h&]uh1jhj>aubhx to prevent other threads from thawing the filesystem, or other scrub threads from initiating another fscounters freeze.}(hj>ahhhNhNubeh}(h]h ]h"]h$]h&]uh1hhhhM hj:aubah}(h]h ]h"]h$]h&]uh1hhj7ahhhhhNubh)}(hIt does not quiesce the log. h]h)}(hIt does not quiesce the log.h]hIt does not quiesce the log.}(hjhahhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhM" hjdaubah}(h]h ]h"]h$]h&]uh1hhj7ahhhhhNubeh}(h]h ]h"]h$]h&]jJjKuh1hhhhM hj`hhubh)}(hWith this code in place, it is now possible to pause the filesystem for just long enough to check and correct the summary counters.h]hWith this code in place, it is now possible to pause the filesystem for just long enough to check and correct the summary counters.}(hjahhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhM$ hj`hhubj)}(hhh]j)}(hhh](j)}(hhh]h}(h]h ]h"]h$]h&]colwidthKJuh1jhjaubj)}(hhh](j)}(hhh]j)}(hhh]h)}(h**Historical Sidebar**:h](j)}(h**Historical Sidebar**h]hHistorical Sidebar}(hjahhhNhNubah}(h]h ]h"]h$]h&]uh1jhjaubh:}(hjahhhNhNubeh}(h]h ]h"]h$]h&]uh1hhhhM( hjaubah}(h]h ]h"]h$]h&]uh1jhjaubah}(h]h ]h"]h$]h&]uh1jhjaubj)}(hhh]j)}(hhh](h)}(hX The initial implementation used the actual VFS filesystem freeze mechanism to quiesce filesystem activity. With the filesystem frozen, it is possible to resolve the counter values with exact precision, but there are many problems with calling the VFS methods directly:h]hX The initial implementation used the actual VFS filesystem freeze mechanism to quiesce filesystem activity. With the filesystem frozen, it is possible to resolve the counter values with exact precision, but there are many problems with calling the VFS methods directly:}(hjahhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhM* hjaubh)}(hhh](h)}(h~Other programs can unfreeze the filesystem without our knowledge. This leads to incorrect scan results and incorrect repairs. h]h)}(h}Other programs can unfreeze the filesystem without our knowledge. This leads to incorrect scan results and incorrect repairs.h]h}Other programs can unfreeze the filesystem without our knowledge. This leads to incorrect scan results and incorrect repairs.}(hjahhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhM0 hjaubah}(h]h ]h"]h$]h&]uh1hhjaubh)}(hXlAdding an extra lock to prevent others from thawing the filesystem required the addition of a ``->freeze_super`` function to wrap ``freeze_fs()``. This in turn caused other subtle problems because it turns out that the VFS ``freeze_super`` and ``thaw_super`` functions can drop the last reference to the VFS superblock, and any subsequent access becomes a UAF bug! This can happen if the filesystem is unmounted while the underlying block device has frozen the filesystem. This problem could be solved by grabbing extra references to the superblock, but it felt suboptimal given the other inadequacies of this approach. h]h)}(hXkAdding an extra lock to prevent others from thawing the filesystem required the addition of a ``->freeze_super`` function to wrap ``freeze_fs()``. This in turn caused other subtle problems because it turns out that the VFS ``freeze_super`` and ``thaw_super`` functions can drop the last reference to the VFS superblock, and any subsequent access becomes a UAF bug! This can happen if the filesystem is unmounted while the underlying block device has frozen the filesystem. This problem could be solved by grabbing extra references to the superblock, but it felt suboptimal given the other inadequacies of this approach.h](h^Adding an extra lock to prevent others from thawing the filesystem required the addition of a }(hjbhhhNhNubj)}(h``->freeze_super``h]h->freeze_super}(hj bhhhNhNubah}(h]h ]h"]h$]h&]uh1jhjbubh function to wrap }(hjbhhhNhNubj)}(h``freeze_fs()``h]h freeze_fs()}(hjbhhhNhNubah}(h]h ]h"]h$]h&]uh1jhjbubhN. This in turn caused other subtle problems because it turns out that the VFS }(hjbhhhNhNubj)}(h``freeze_super``h]h freeze_super}(hj0bhhhNhNubah}(h]h ]h"]h$]h&]uh1jhjbubh and }(hjbhhhNhNubj)}(h``thaw_super``h]h thaw_super}(hjBbhhhNhNubah}(h]h ]h"]h$]h&]uh1jhjbubhXi functions can drop the last reference to the VFS superblock, and any subsequent access becomes a UAF bug! This can happen if the filesystem is unmounted while the underlying block device has frozen the filesystem. This problem could be solved by grabbing extra references to the superblock, but it felt suboptimal given the other inadequacies of this approach.}(hjbhhhNhNubeh}(h]h ]h"]h$]h&]uh1hhhhM3 hjbubah}(h]h ]h"]h$]h&]uh1hhjaubh)}(hThe log need not be quiesced to check the summary counters, but a VFS freeze initiates one anyway. This adds unnecessary runtime to live fscounter fsck operations. h]h)}(hThe log need not be quiesced to check the summary counters, but a VFS freeze initiates one anyway. This adds unnecessary runtime to live fscounter fsck operations.h]hThe log need not be quiesced to check the summary counters, but a VFS freeze initiates one anyway. This adds unnecessary runtime to live fscounter fsck operations.}(hjdbhhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhM@ hj`bubah}(h]h ]h"]h$]h&]uh1hhjaubh)}(hpQuiescing the log means that XFS flushes the (possibly incorrect) counters to disk as part of cleaning the log. h]h)}(hoQuiescing the log means that XFS flushes the (possibly incorrect) counters to disk as part of cleaning the log.h]hoQuiescing the log means that XFS flushes the (possibly incorrect) counters to disk as part of cleaning the log.}(hj|bhhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhMD hjxbubah}(h]h ]h"]h$]h&]uh1hhjaubh)}(hA bug in the VFS meant that freeze could complete even when sync_filesystem fails to flush the filesystem and returns an error. This bug was fixed in Linux 5.17.h]h)}(hA bug in the VFS meant that freeze could complete even when sync_filesystem fails to flush the filesystem and returns an error. This bug was fixed in Linux 5.17.h]hA bug in the VFS meant that freeze could complete even when sync_filesystem fails to flush the filesystem and returns an error. This bug was fixed in Linux 5.17.}(hjbhhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhMG hjbubah}(h]h ]h"]h$]h&]uh1hhjaubeh}(h]h ]h"]h$]h&]jJjKuh1hhhhM0 hjaubeh}(h]h ]h"]h$]h&]uh1jhjaubah}(h]h ]h"]h$]h&]uh1jhjaubeh}(h]h ]h"]h$]h&]uh1jhjaubeh}(h]h ]h"]h$]h&]colsKuh1jhjaubah}(h]h ]h"]h$]h&]uh1jhj`hhhNhNubh)}(hThe proposed patchset is the `summary counter cleanup `_ series.h](hThe proposed patchset is the }(hjbhhhNhNubj)}(hz`summary counter cleanup `_h]hsummary counter cleanup}(hjbhhhNhNubah}(h]h ]h"]h$]h&]namesummary counter cleanupjj]https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-fscountersuh1jhjbubh)}(h` h]h}(h]summary-counter-cleanupah ]h"]summary counter cleanupah$]h&]refurijbuh1hjyKhjbubh series.}(hjbhhhNhNubeh}(h]h ]h"]h$]h&]uh1hhhhML hj`hhubeh}(h](j j`eh ]h"]( freezing to fix summary counters fscounterseh$]h&]uh1hhjv*hhhhhM j}jcj`sj}j`j`subh)}(hhh](h)}(hFull Filesystem Scansh]hFull Filesystem Scans}(hj chhhNhNubah}(h]h ]h"]h$]h&]jj* uh1hhjchhhhhMR ubh)}(hXCertain types of metadata can only be checked by walking every file in the entire filesystem to record observations and comparing the observations against what's recorded on disk. Like every other type of online repair, repairs are made by writing those observations to disk in a replacement structure and committing it atomically. However, it is not practical to shut down the entire filesystem to examine hundreds of billions of files because the downtime would be excessive. Therefore, online fsck must build the infrastructure to manage a live scan of all the files in the filesystem. There are two questions that need to be solved to perform a live walk:h]hXCertain types of metadata can only be checked by walking every file in the entire filesystem to record observations and comparing the observations against what’s recorded on disk. Like every other type of online repair, repairs are made by writing those observations to disk in a replacement structure and committing it atomically. However, it is not practical to shut down the entire filesystem to examine hundreds of billions of files because the downtime would be excessive. Therefore, online fsck must build the infrastructure to manage a live scan of all the files in the filesystem. There are two questions that need to be solved to perform a live walk:}(hjchhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhMT hjchhubh)}(2hhh](h)}(h`_ in *Lions' Commentary on UNIX, 6th Edition*, (Dept. of Computer Science, the University of New South Wales, November 1977), pp. 18-2; and later by D. Ritchie and K. Thompson, `"Implementation of the File System" `_, from *The UNIX Time-Sharing System*, (The Bell System Technical Journal, July 1978), pp. 1913-4.h](h_In the original Unix filesystems of the 1970s, each directory entry contained an index number (}(hj{chhhNhNubj7)}(h *inumber*h]hinumber}(hjchhhNhNubah}(h]h ]h"]h$]h&]uh1j6hj{cubh3) which was used as an index into on ondisk array (}(hj{chhhNhNubj7)}(h*itable*h]hitable}(hjchhhNhNubah}(h]h ]h"]h$]h&]uh1j6hj{cubh) of fixed-size records (}(hj{chhhNhNubj7)}(h*inodes*h]hinodes}(hjchhhNhNubah}(h]h ]h"]h$]h&]uh1j6hj{cubhe) describing a file’s attributes and its data block mapping. This system is described by J. Lions, }(hj{chhhNhNubj)}(hB`"inode (5659)" `_h]h“inode (5659)”}(hjchhhNhNubah}(h]h ]h"]h$]h&]name"inode (5659)"jj.http://www.lemis.com/grog/Documentation/Lions/uh1jhj{cubh)}(h1 h]h}(h] inode-5659ah ]h"]"inode (5659)"ah$]h&]refurijcuh1hjyKhj{cubh in }(hj{chhhNhNubj7)}(h(*Lions' Commentary on UNIX, 6th Edition*h]h(Lions’ Commentary on UNIX, 6th Edition}(hjchhhNhNubah}(h]h ]h"]h$]h&]uh1j6hj{cubh, (Dept. of Computer Science, the University of New South Wales, November 1977), pp. 18-2; and later by D. Ritchie and K. Thompson, }(hj{chhhNhNubj)}(hc`"Implementation of the File System" `_h]h'“Implementation of the File System”}(hjchhhNhNubah}(h]h ]h"]h$]h&]name#"Implementation of the File System"jj:https://archive.org/details/bstj57-6-1905/page/n8/mode/1upuh1jhj{cubh)}(h= h]h}(h]!implementation-of-the-file-systemah ]h"]#"implementation of the file system"ah$]h&]refurijcuh1hjyKhj{cubh, from }(hj{chhhNhNubj7)}(h*The UNIX Time-Sharing System*h]hThe UNIX Time-Sharing System}(hjdhhhNhNubah}(h]h ]h"]h$]h&]uh1j6hj{cubh=, (The Bell System Technical Journal, July 1978), pp. 1913-4.}(hj{chhhNhNubeh}(h]h ]h"]h$]h&]uh1hhhhMi hjjchhubh)}(hXXFS retains most of this design, except now inumbers are search keys over all the space in the data section filesystem. They form a continuous keyspace that can be expressed as a 64-bit integer, though the inodes themselves are sparsely distributed within the keyspace. Scans proceed in a linear fashion across the inumber keyspace, starting from ``0x0`` and ending at ``0xFFFFFFFFFFFFFFFF``. Naturally, a scan through a keyspace requires a scan cursor object to track the scan progress. Because this keyspace is sparse, this cursor contains two parts. The first part of this scan cursor object tracks the inode that will be examined next; call this the examination cursor. Somewhat less obviously, the scan cursor object must also track which parts of the keyspace have already been visited, which is critical for deciding if a concurrent filesystem update needs to be incorporated into the scan data. Call this the visited inode cursor.h](hX[XFS retains most of this design, except now inumbers are search keys over all the space in the data section filesystem. They form a continuous keyspace that can be expressed as a 64-bit integer, though the inodes themselves are sparsely distributed within the keyspace. Scans proceed in a linear fashion across the inumber keyspace, starting from }(hj'dhhhNhNubj)}(h``0x0``h]h0x0}(hj/dhhhNhNubah}(h]h ]h"]h$]h&]uh1jhj'dubh and ending at }(hj'dhhhNhNubj)}(h``0xFFFFFFFFFFFFFFFF``h]h0xFFFFFFFFFFFFFFFF}(hjAdhhhNhNubah}(h]h ]h"]h$]h&]uh1jhj'dubhX#. Naturally, a scan through a keyspace requires a scan cursor object to track the scan progress. Because this keyspace is sparse, this cursor contains two parts. The first part of this scan cursor object tracks the inode that will be examined next; call this the examination cursor. Somewhat less obviously, the scan cursor object must also track which parts of the keyspace have already been visited, which is critical for deciding if a concurrent filesystem update needs to be incorporated into the scan data. Call this the visited inode cursor.}(hj'dhhhNhNubeh}(h]h ]h"]h$]h&]uh1hhhhMv hjjchhubh)}(hVAdvancing the scan cursor is a multi-step process encapsulated in ``xchk_iscan_iter``:h](hBAdvancing the scan cursor is a multi-step process encapsulated in }(hjYdhhhNhNubj)}(h``xchk_iscan_iter``h]hxchk_iscan_iter}(hjadhhhNhNubah}(h]h ]h"]h$]h&]uh1jhjYdubh:}(hjYdhhhNhNubeh}(h]h ]h"]h$]h&]uh1hhhhM hjjchhubji)}(hhh](h)}(hLock the AGI buffer of the AG containing the inode pointed to by the visited inode cursor. This guarantee that inodes in this AG cannot be allocated or freed while advancing the cursor. h]h)}(hLock the AGI buffer of the AG containing the inode pointed to by the visited inode cursor. This guarantee that inodes in this AG cannot be allocated or freed while advancing the cursor.h]hLock the AGI buffer of the AG containing the inode pointed to by the visited inode cursor. This guarantee that inodes in this AG cannot be allocated or freed while advancing the cursor.}(hjdhhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhM hj|dubah}(h]h ]h"]h$]h&]uh1hhjydhhhhhNubh)}(hUse the per-AG inode btree to look up the next inumber after the one that was just visited, since it may not be keyspace adjacent. h]h)}(hUse the per-AG inode btree to look up the next inumber after the one that was just visited, since it may not be keyspace adjacent.h]hUse the per-AG inode btree to look up the next inumber after the one that was just visited, since it may not be keyspace adjacent.}(hjdhhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhM hjdubah}(h]h ]h"]h$]h&]uh1hhjydhhhhhNubh)}(hXIf there are no more inodes left in this AG: a. Move the examination cursor to the point of the inumber keyspace that corresponds to the start of the next AG. b. Adjust the visited inode cursor to indicate that it has "visited" the last possible inode in the current AG's inode keyspace. XFS inumbers are segmented, so the cursor needs to be marked as having visited the entire keyspace up to just before the start of the next AG's inode keyspace. c. Unlock the AGI and return to step 1 if there are unexamined AGs in the filesystem. d. If there are no more AGs to examine, set both cursors to the end of the inumber keyspace. The scan is now complete. h](h)}(h,If there are no more inodes left in this AG:h]h,If there are no more inodes left in this AG:}(hjdhhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhM hjdubji)}(hhh](h)}(hoMove the examination cursor to the point of the inumber keyspace that corresponds to the start of the next AG. h]h)}(hnMove the examination cursor to the point of the inumber keyspace that corresponds to the start of the next AG.h]hnMove the examination cursor to the point of the inumber keyspace that corresponds to the start of the next AG.}(hjdhhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhM hjdubah}(h]h ]h"]h$]h&]uh1hhjdubh)}(hXAdjust the visited inode cursor to indicate that it has "visited" the last possible inode in the current AG's inode keyspace. XFS inumbers are segmented, so the cursor needs to be marked as having visited the entire keyspace up to just before the start of the next AG's inode keyspace. h]h)}(hXAdjust the visited inode cursor to indicate that it has "visited" the last possible inode in the current AG's inode keyspace. XFS inumbers are segmented, so the cursor needs to be marked as having visited the entire keyspace up to just before the start of the next AG's inode keyspace.h]hX%Adjust the visited inode cursor to indicate that it has “visited” the last possible inode in the current AG’s inode keyspace. XFS inumbers are segmented, so the cursor needs to be marked as having visited the entire keyspace up to just before the start of the next AG’s inode keyspace.}(hjdhhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhM hjdubah}(h]h ]h"]h$]h&]uh1hhjdubh)}(hSUnlock the AGI and return to step 1 if there are unexamined AGs in the filesystem. h]h)}(hRUnlock the AGI and return to step 1 if there are unexamined AGs in the filesystem.h]hRUnlock the AGI and return to step 1 if there are unexamined AGs in the filesystem.}(hjdhhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhM hjdubah}(h]h ]h"]h$]h&]uh1hhjdubh)}(htIf there are no more AGs to examine, set both cursors to the end of the inumber keyspace. The scan is now complete. h]h)}(hsIf there are no more AGs to examine, set both cursors to the end of the inumber keyspace. The scan is now complete.h]hsIf there are no more AGs to examine, set both cursors to the end of the inumber keyspace. The scan is now complete.}(hj ehhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhM hj eubah}(h]h ]h"]h$]h&]uh1hhjdubeh}(h]h ]h"]h$]h&]jgj6jihjjjkuh1jhhjdubeh}(h]h ]h"]h$]h&]uh1hhjydhhhNhNubh)}(hXOtherwise, there is at least one more inode to scan in this AG: a. Move the examination cursor ahead to the next inode marked as allocated by the inode btree. b. Adjust the visited inode cursor to point to the inode just prior to where the examination cursor is now. Because the scanner holds the AGI buffer lock, no inodes could have been created in the part of the inode keyspace that the visited inode cursor just advanced. h](h)}(h?Otherwise, there is at least one more inode to scan in this AG:h]h?Otherwise, there is at least one more inode to scan in this AG:}(hj1ehhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhM hj-eubji)}(hhh](h)}(h\Move the examination cursor ahead to the next inode marked as allocated by the inode btree. h]h)}(h[Move the examination cursor ahead to the next inode marked as allocated by the inode btree.h]h[Move the examination cursor ahead to the next inode marked as allocated by the inode btree.}(hjFehhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhM hjBeubah}(h]h ]h"]h$]h&]uh1hhj?eubh)}(hX Adjust the visited inode cursor to point to the inode just prior to where the examination cursor is now. Because the scanner holds the AGI buffer lock, no inodes could have been created in the part of the inode keyspace that the visited inode cursor just advanced. h]h)}(hXAdjust the visited inode cursor to point to the inode just prior to where the examination cursor is now. Because the scanner holds the AGI buffer lock, no inodes could have been created in the part of the inode keyspace that the visited inode cursor just advanced.h]hXAdjust the visited inode cursor to point to the inode just prior to where the examination cursor is now. Because the scanner holds the AGI buffer lock, no inodes could have been created in the part of the inode keyspace that the visited inode cursor just advanced.}(hj^ehhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhM hjZeubah}(h]h ]h"]h$]h&]uh1hhj?eubeh}(h]h ]h"]h$]h&]jgj6jihjjjkuh1jhhj-eubeh}(h]h ]h"]h$]h&]uh1hhjydhhhNhNubh)}(hX[Get the incore inode for the inumber of the examination cursor. By maintaining the AGI buffer lock until this point, the scanner knows that it was safe to advance the examination cursor across the entire keyspace, and that it has stabilized this next inode so that it cannot disappear from the filesystem until the scan releases the incore inode. h]h)}(hXZGet the incore inode for the inumber of the examination cursor. By maintaining the AGI buffer lock until this point, the scanner knows that it was safe to advance the examination cursor across the entire keyspace, and that it has stabilized this next inode so that it cannot disappear from the filesystem until the scan releases the incore inode.h]hXZGet the incore inode for the inumber of the examination cursor. By maintaining the AGI buffer lock until this point, the scanner knows that it was safe to advance the examination cursor across the entire keyspace, and that it has stabilized this next inode so that it cannot disappear from the filesystem until the scan releases the incore inode.}(hjehhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhM hj~eubah}(h]h ]h"]h$]h&]uh1hhjydhhhhhNubh)}(h=Drop the AGI lock and return the incore inode to the caller. h]h)}(h`_ series. The first user of the new functionality is the `online quotacheck `_ series.h](hThe proposed patches are the }(hjfhhhNhNubj)}(hj`inode scanner `_h]h inode scanner}(hjfhhhNhNubah}(h]h ]h"]h$]h&]name inode scannerjjWhttps://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=scrub-iscanuh1jhjfubh)}(hZ h]h}(h] inode-scannerah ]h"] inode scannerah$]h&]refurijfuh1hjyKhjfubh8 series. The first user of the new functionality is the }(hjfhhhNhNubj)}(ht`online quotacheck `_h]honline quotacheck}(hjghhhNhNubah}(h]h ]h"]h$]h&]nameonline quotacheckjj]https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-quotacheckuh1jhjfubh)}(h` h]h}(h]online-quotacheckah ]h"]online quotacheckah$]h&]refurijguh1hjyKhjfubh series.}(hjfhhhNhNubeh}(h]h ]h"]h$]h&]uh1hhhhM hjjchhubeh}(h](jO jiceh ]h"](coordinated inode scansiscaneh$]h&]uh1hhjchhhhhMg j}j2gj_csj}jicj_csubh)}(hhh](h)}(hInode Managementh]hInode Management}(hj:ghhhNhNubah}(h]h ]h"]h$]h&]jjk uh1hhj7ghhhhhM ubh)}(hXIn regular filesystem code, references to allocated XFS incore inodes are always obtained (``xfs_iget``) outside of transaction context because the creation of the incore context for an existing file does not require metadata updates. However, it is important to note that references to incore inodes obtained as part of file creation must be performed in transaction context because the filesystem must ensure the atomicity of the ondisk inode btree index updates and the initialization of the actual ondisk inode.h](h[In regular filesystem code, references to allocated XFS incore inodes are always obtained (}(hjHghhhNhNubj)}(h ``xfs_iget``h]hxfs_iget}(hjPghhhNhNubah}(h]h ]h"]h$]h&]uh1jhjHgubhX) outside of transaction context because the creation of the incore context for an existing file does not require metadata updates. However, it is important to note that references to incore inodes obtained as part of file creation must be performed in transaction context because the filesystem must ensure the atomicity of the ondisk inode btree index updates and the initialization of the actual ondisk inode.}(hjHghhhNhNubeh}(h]h ]h"]h$]h&]uh1hhhhM hj7ghhubh)}(hReferences to incore inodes are always released (``xfs_irele``) outside of transaction context because there are a handful of activities that might require ondisk updates:h](h1References to incore inodes are always released (}(hjhghhhNhNubj)}(h ``xfs_irele``h]h xfs_irele}(hjpghhhNhNubah}(h]h ]h"]h$]h&]uh1jhjhgubhm) outside of transaction context because there are a handful of activities that might require ondisk updates:}(hjhghhhNhNubeh}(h]h ]h"]h$]h&]uh1hhhhM hj7ghhubh)}(hhh](h)}(hSThe VFS may decide to kick off writeback as part of a ``DONTCACHE`` inode release. h]h)}(hRThe VFS may decide to kick off writeback as part of a ``DONTCACHE`` inode release.h](h6The VFS may decide to kick off writeback as part of a }(hjghhhNhNubj)}(h ``DONTCACHE``h]h DONTCACHE}(hjghhhNhNubah}(h]h ]h"]h$]h&]uh1jhjgubh inode release.}(hjghhhNhNubeh}(h]h ]h"]h$]h&]uh1hhhhM hjgubah}(h]h ]h"]h$]h&]uh1hhjghhhhhNubh)}(h2Speculative preallocations need to be unreserved. h]h)}(h1Speculative preallocations need to be unreserved.h]h1Speculative preallocations need to be unreserved.}(hjghhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhM hjgubah}(h]h ]h"]h$]h&]uh1hhjghhhhhNubh)}(hAn unlinked file may have lost its last reference, in which case the entire file must be inactivated, which involves releasing all of its resources in the ondisk metadata and freeing the inode. h]h)}(hAn unlinked file may have lost its last reference, in which case the entire file must be inactivated, which involves releasing all of its resources in the ondisk metadata and freeing the inode.h]hAn unlinked file may have lost its last reference, in which case the entire file must be inactivated, which involves releasing all of its resources in the ondisk metadata and freeing the inode.}(hjghhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhM hjgubah}(h]h ]h"]h$]h&]uh1hhjghhhhhNubeh}(h]h ]h"]h$]h&]jJjKuh1hhhhM hj7ghhubh)}(hXThese activities are collectively called inode inactivation. Inactivation has two parts -- the VFS part, which initiates writeback on all dirty file pages, and the XFS part, which cleans up XFS-specific information and frees the inode if it was unlinked. If the inode is unlinked (or unconnected after a file handle operation), the kernel drops the inode into the inactivation machinery immediately.h]hXThese activities are collectively called inode inactivation. Inactivation has two parts -- the VFS part, which initiates writeback on all dirty file pages, and the XFS part, which cleans up XFS-specific information and frees the inode if it was unlinked. If the inode is unlinked (or unconnected after a file handle operation), the kernel drops the inode into the inactivation machinery immediately.}(hjghhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhM hj7ghhubh)}(hbDuring normal operation, resource acquisition for an update follows this order to avoid deadlocks:h]hbDuring normal operation, resource acquisition for an update follows this order to avoid deadlocks:}(hjghhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhM hj7ghhubji)}(hhh](h)}(hInode reference (``iget``). h]h)}(hInode reference (``iget``).h](hInode reference (}(hjhhhhNhNubj)}(h``iget``h]higet}(hjhhhhNhNubah}(h]h ]h"]h$]h&]uh1jhjhubh).}(hjhhhhNhNubeh}(h]h ]h"]h$]h&]uh1hhhhM hj hubah}(h]h ]h"]h$]h&]uh1hhjhhhhhhNubh)}(hFFilesystem freeze protection, if repairing (``mnt_want_write_file``). h]h)}(hEFilesystem freeze protection, if repairing (``mnt_want_write_file``).h](h,Filesystem freeze protection, if repairing (}(hj8hhhhNhNubj)}(h``mnt_want_write_file``h]hmnt_want_write_file}(hj@hhhhNhNubah}(h]h ]h"]h$]h&]uh1jhj8hubh).}(hj8hhhhNhNubeh}(h]h ]h"]h$]h&]uh1hhhhM hj4hubah}(h]h ]h"]h$]h&]uh1hhjhhhhhhNubh)}(h`_ and `dir iget usage `_.h](h"Proposed patchsets include fixing }(hjQjhhhNhNubj)}(hr`scrub iget usage `_h]hscrub iget usage}(hjYjhhhNhNubah}(h]h ]h"]h$]h&]namescrub iget usagejj\https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=scrub-iget-fixesuh1jhjQjubh)}(h_ h]h}(h]scrub-iget-usageah ]h"]scrub iget usageah$]h&]refurijijuh1hjyKhjQjubh and }(hjQjhhhNhNubj)}(ht`dir iget usage `_h]hdir iget usage}(hj{jhhhNhNubah}(h]h ]h"]h$]h&]namedir iget usagejj`https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=scrub-dir-iget-fixesuh1jhjQjubh)}(hc h]h}(h]dir-iget-usageah ]h"]dir iget usageah$]h&]refurijjuh1hjyKhjQjubh.}(hjQjhhhNhNubeh}(h]h ]h"]h$]h&]uh1hhhhM8 hjihhubh)}(h .. _ilocking:h]h}(h]h ]h"]h$]h&]jilockinguh1hhM> hjihhhhubeh}(h]j ah ]h"]iget and irele during a scrubah$]h&]uh1hhj7ghhhhhM ubh)}(hhh](h)}(hLocking Inodesh]hLocking Inodes}(hjjhhhNhNubah}(h]h ]h"]h$]h&]jj uh1hhjjhhhhhMA ubh)}(hXIn regular filesystem code, the VFS and XFS will acquire multiple IOLOCK locks in a well-known order: parent → child when updating the directory tree, and in numerical order of the addresses of their ``struct inode`` object otherwise. For regular files, the MMAPLOCK can be acquired after the IOLOCK to stop page faults. If two MMAPLOCKs must be acquired, they are acquired in numerical order of the addresses of their ``struct address_space`` objects. Due to the structure of existing filesystem code, IOLOCKs and MMAPLOCKs must be acquired before transactions are allocated. If two ILOCKs must be acquired, they are acquired in inumber order.h](hIn regular filesystem code, the VFS and XFS will acquire multiple IOLOCK locks in a well-known order: parent → child when updating the directory tree, and in numerical order of the addresses of their }(hjjhhhNhNubj)}(h``struct inode``h]h struct inode}(hjjhhhNhNubah}(h]h ]h"]h$]h&]uh1jhjjubh object otherwise. For regular files, the MMAPLOCK can be acquired after the IOLOCK to stop page faults. If two MMAPLOCKs must be acquired, they are acquired in numerical order of the addresses of their }(hjjhhhNhNubj)}(h``struct address_space``h]hstruct address_space}(hjjhhhNhNubah}(h]h ]h"]h$]h&]uh1jhjjubh objects. Due to the structure of existing filesystem code, IOLOCKs and MMAPLOCKs must be acquired before transactions are allocated. If two ILOCKs must be acquired, they are acquired in inumber order.}(hjjhhhNhNubeh}(h]h ]h"]h$]h&]uh1hhhhMC hjjhhubh)}(hXInode lock acquisition must be done carefully during a coordinated inode scan. Online fsck cannot abide these conventions, because for a directory tree scanner, the scrub process holds the IOLOCK of the file being scanned and it needs to take the IOLOCK of the file at the other end of the directory link. If the directory tree is corrupt because it contains a cycle, ``xfs_scrub`` cannot use the regular inode locking functions and avoid becoming trapped in an ABBA deadlock.h](hXpInode lock acquisition must be done carefully during a coordinated inode scan. Online fsck cannot abide these conventions, because for a directory tree scanner, the scrub process holds the IOLOCK of the file being scanned and it needs to take the IOLOCK of the file at the other end of the directory link. If the directory tree is corrupt because it contains a cycle, }(hjjhhhNhNubj)}(h ``xfs_scrub``h]h xfs_scrub}(hjkhhhNhNubah}(h]h ]h"]h$]h&]uh1jhjjubh_ cannot use the regular inode locking functions and avoid becoming trapped in an ABBA deadlock.}(hjjhhhNhNubeh}(h]h ]h"]h$]h&]uh1hhhhMN hjjhhubh)}(hX[Solving both of these problems is straightforward -- any time online fsck needs to take a second lock of the same class, it uses trylock to avoid an ABBA deadlock. If the trylock fails, scrub drops all inode locks and use trylock loops to (re)acquire all necessary resources. Trylock loops enable scrub to check for pending fatal signals, which is how scrub avoids deadlocking the filesystem or becoming an unresponsive process. However, trylock loops means that online fsck must be prepared to measure the resource being scrubbed before and after the lock cycle to detect changes and react accordingly.h]hX[Solving both of these problems is straightforward -- any time online fsck needs to take a second lock of the same class, it uses trylock to avoid an ABBA deadlock. If the trylock fails, scrub drops all inode locks and use trylock loops to (re)acquire all necessary resources. Trylock loops enable scrub to check for pending fatal signals, which is how scrub avoids deadlocking the filesystem or becoming an unresponsive process. However, trylock loops means that online fsck must be prepared to measure the resource being scrubbed before and after the lock cycle to detect changes and react accordingly.}(hjkhhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhMV hjjhhubh)}(h.. _dirparent:h]h}(h]h ]h"]h$]h&]j dirparentuh1hhMa hjjhhhhubeh}(h](j jjeh ]h"](locking inodesilockingeh$]h&]uh1hhj7ghhhhhMA j}j6kjjsj}jjjjsubh)}(hhh](h)}(h&Case Study: Finding a Directory Parenth]h&Case Study: Finding a Directory Parent}(hj>khhhNhNubah}(h]h ]h"]h$]h&]jj uh1hhj;khhhhhMd ubh)}(hXConsider the directory parent pointer repair code as an example. Online fsck must verify that the dotdot dirent of a directory points up to a parent directory, and that the parent directory contains exactly one dirent pointing down to the child directory. Fully validating this relationship (and repairing it if possible) requires a walk of every directory on the filesystem while holding the child locked, and while updates to the directory tree are being made. The coordinated inode scan provides a way to walk the filesystem without the possibility of missing an inode. The child directory is kept locked to prevent updates to the dotdot dirent, but if the scanner fails to lock a parent, it can drop and relock both the child and the prospective parent. If the dotdot entry changes while the directory is unlocked, then a move or rename operation must have changed the child's parentage, and the scan can exit early.h]hXConsider the directory parent pointer repair code as an example. Online fsck must verify that the dotdot dirent of a directory points up to a parent directory, and that the parent directory contains exactly one dirent pointing down to the child directory. Fully validating this relationship (and repairing it if possible) requires a walk of every directory on the filesystem while holding the child locked, and while updates to the directory tree are being made. The coordinated inode scan provides a way to walk the filesystem without the possibility of missing an inode. The child directory is kept locked to prevent updates to the dotdot dirent, but if the scanner fails to lock a parent, it can drop and relock both the child and the prospective parent. If the dotdot entry changes while the directory is unlocked, then a move or rename operation must have changed the child’s parentage, and the scan can exit early.}(hjLkhhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhMf hj;khhubh)}(hThe proposed patchset is the `directory repair `_ series.h](hThe proposed patchset is the }(hjZkhhhNhNubj)}(hm`directory repair `_h]hdirectory repair}(hjbkhhhNhNubah}(h]h ]h"]h$]h&]namedirectory repairjjWhttps://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-dirsuh1jhjZkubh)}(hZ h]h}(h]directory-repairah ]h"]directory repairah$]h&]refurijrkuh1hjyKhjZkubh series.}(hjZkhhhNhNubeh}(h]h ]h"]h$]h&]uh1hhhhMv hj;khhubh)}(h .. _fshooks:h]h}(h]h ]h"]h$]h&]jfshooksuh1hhM{ hj;khhhhubeh}(h](j j0keh ]h"](&case study: finding a directory parent dirparenteh$]h&]uh1hhj7ghhhhhMd j}jkj&ksj}j0kj&ksubeh}(h]jq ah ]h"]inode managementah$]h&]uh1hhjchhhhhM ubh)}(hhh](h)}(hFilesystem Hooksh]hFilesystem Hooks}(hjkhhhNhNubah}(h]h ]h"]h$]h&]jj uh1hhjkhhhhhM~ ubh)}(hXThe second piece of support that online fsck functions need during a full filesystem scan is the ability to stay informed about updates being made by other threads in the filesystem, since comparisons against the past are useless in a dynamic environment. Two pieces of Linux kernel infrastructure enable online fsck to monitor regular filesystem operations: filesystem hooks and :ref:`static keys`.h](hX|The second piece of support that online fsck functions need during a full filesystem scan is the ability to stay informed about updates being made by other threads in the filesystem, since comparisons against the past are useless in a dynamic environment. Two pieces of Linux kernel infrastructure enable online fsck to monitor regular filesystem operations: filesystem hooks and }(hjkhhhNhNubh)}(h:ref:`static keys`h]j)}(hjkh]h static keys}(hjkhhhNhNubah}(h]h ](jstdstd-refeh"]h$]h&]uh1jhjkubah}(h]h ]h"]h$]h&]refdocj refdomainjkreftyperef refexplicitrefwarnj jump_labelsuh1hhhhM hjkubh.}(hjkhhhNhNubeh}(h]h ]h"]h$]h&]uh1hhhhM hjkhhubh)}(hXLFilesystem hooks convey information about an ongoing filesystem operation to a downstream consumer. In this case, the downstream consumer is always an online fsck function. Because multiple fsck functions can run in parallel, online fsck uses the Linux notifier call chain facility to dispatch updates to any number of interested fsck processes. Call chains are a dynamic list, which means that they can be configured at run time. Because these hooks are private to the XFS module, the information passed along contains exactly what the checking function needs to update its observations.h]hXLFilesystem hooks convey information about an ongoing filesystem operation to a downstream consumer. In this case, the downstream consumer is always an online fsck function. Because multiple fsck functions can run in parallel, online fsck uses the Linux notifier call chain facility to dispatch updates to any number of interested fsck processes. Call chains are a dynamic list, which means that they can be configured at run time. Because these hooks are private to the XFS module, the information passed along contains exactly what the checking function needs to update its observations.}(hjkhhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhM hjkhhubh)}(hXyThe current implementation of XFS hooks uses SRCU notifier chains to reduce the impact to highly threaded workloads. Regular blocking notifier chains use a rwsem and seem to have a much lower overhead for single-threaded applications. However, it may turn out that the combination of blocking chains and static keys are a more performant combination; more study is needed here.h]hXyThe current implementation of XFS hooks uses SRCU notifier chains to reduce the impact to highly threaded workloads. Regular blocking notifier chains use a rwsem and seem to have a much lower overhead for single-threaded applications. However, it may turn out that the combination of blocking chains and static keys are a more performant combination; more study is needed here.}(hjkhhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhM hjkhhubh)}(hMThe following pieces are necessary to hook a certain point in the filesystem:h]hMThe following pieces are necessary to hook a certain point in the filesystem:}(hjlhhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhM hjkhhubh)}(hhh](h)}(htA ``struct xfs_hooks`` object must be embedded in a convenient place such as a well-known incore filesystem object. h]h)}(hsA ``struct xfs_hooks`` object must be embedded in a convenient place such as a well-known incore filesystem object.h](hA }(hjlhhhNhNubj)}(h``struct xfs_hooks``h]hstruct xfs_hooks}(hj"lhhhNhNubah}(h]h ]h"]h$]h&]uh1jhjlubh] object must be embedded in a convenient place such as a well-known incore filesystem object.}(hjlhhhNhNubeh}(h]h ]h"]h$]h&]uh1hhhhM hjlubah}(h]h ]h"]h$]h&]uh1hhjlhhhhhNubh)}(h_Each hook must define an action code and a structure containing more context about the action. h]h)}(h^Each hook must define an action code and a structure containing more context about the action.h]h^Each hook must define an action code and a structure containing more context about the action.}(hjDlhhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhM hj@lubah}(h]h ]h"]h$]h&]uh1hhjlhhhhhNubh)}(hHook providers should provide appropriate wrapper functions and structs around the ``xfs_hooks`` and ``xfs_hook`` objects to take advantage of type checking to ensure correct usage. h]h)}(hHook providers should provide appropriate wrapper functions and structs around the ``xfs_hooks`` and ``xfs_hook`` objects to take advantage of type checking to ensure correct usage.h](hSHook providers should provide appropriate wrapper functions and structs around the }(hj\lhhhNhNubj)}(h ``xfs_hooks``h]h xfs_hooks}(hjdlhhhNhNubah}(h]h ]h"]h$]h&]uh1jhj\lubh and }(hj\lhhhNhNubj)}(h ``xfs_hook``h]hxfs_hook}(hjvlhhhNhNubah}(h]h ]h"]h$]h&]uh1jhj\lubhD objects to take advantage of type checking to ensure correct usage.}(hj\lhhhNhNubeh}(h]h ]h"]h$]h&]uh1hhhhM hjXlubah}(h]h ]h"]h$]h&]uh1hhjlhhhhhNubh)}(hXA callsite in the regular filesystem code must be chosen to call ``xfs_hooks_call`` with the action code and data structure. This place should be adjacent to (and not earlier than) the place where the filesystem update is committed to the transaction. In general, when the filesystem calls a hook chain, it should be able to handle sleeping and should not be vulnerable to memory reclaim or locking recursion. However, the exact requirements are very dependent on the context of the hook caller and the callee. h]h)}(hXA callsite in the regular filesystem code must be chosen to call ``xfs_hooks_call`` with the action code and data structure. This place should be adjacent to (and not earlier than) the place where the filesystem update is committed to the transaction. In general, when the filesystem calls a hook chain, it should be able to handle sleeping and should not be vulnerable to memory reclaim or locking recursion. However, the exact requirements are very dependent on the context of the hook caller and the callee.h](hAA callsite in the regular filesystem code must be chosen to call }(hjlhhhNhNubj)}(h``xfs_hooks_call``h]hxfs_hooks_call}(hjlhhhNhNubah}(h]h ]h"]h$]h&]uh1jhjlubhX with the action code and data structure. This place should be adjacent to (and not earlier than) the place where the filesystem update is committed to the transaction. In general, when the filesystem calls a hook chain, it should be able to handle sleeping and should not be vulnerable to memory reclaim or locking recursion. However, the exact requirements are very dependent on the context of the hook caller and the callee.}(hjlhhhNhNubeh}(h]h ]h"]h$]h&]uh1hhhhM hjlubah}(h]h ]h"]h$]h&]uh1hhjlhhhhhNubh)}(hXThe online fsck function should define a structure to hold scan data, a lock to coordinate access to the scan data, and a ``struct xfs_hook`` object. The scanner function and the regular filesystem code must acquire resources in the same order; see the next section for details. h]h)}(hXThe online fsck function should define a structure to hold scan data, a lock to coordinate access to the scan data, and a ``struct xfs_hook`` object. The scanner function and the regular filesystem code must acquire resources in the same order; see the next section for details.h](hzThe online fsck function should define a structure to hold scan data, a lock to coordinate access to the scan data, and a }(hjlhhhNhNubj)}(h``struct xfs_hook``h]hstruct xfs_hook}(hjlhhhNhNubah}(h]h ]h"]h$]h&]uh1jhjlubh object. The scanner function and the regular filesystem code must acquire resources in the same order; see the next section for details.}(hjlhhhNhNubeh}(h]h ]h"]h$]h&]uh1hhhhM hjlubah}(h]h ]h"]h$]h&]uh1hhjlhhhhhNubh)}(hThe online fsck code must contain a C function to catch the hook action code and data structure. If the object being updated has already been visited by the scan, then the hook information must be applied to the scan data. h]h)}(hThe online fsck code must contain a C function to catch the hook action code and data structure. If the object being updated has already been visited by the scan, then the hook information must be applied to the scan data.h]hThe online fsck code must contain a C function to catch the hook action code and data structure. If the object being updated has already been visited by the scan, then the hook information must be applied to the scan data.}(hjlhhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhM hjlubah}(h]h ]h"]h$]h&]uh1hhjlhhhhhNubh)}(hPrior to unlocking inodes to start the scan, online fsck must call ``xfs_hooks_setup`` to initialize the ``struct xfs_hook``, and ``xfs_hooks_add`` to enable the hook. h]h)}(hPrior to unlocking inodes to start the scan, online fsck must call ``xfs_hooks_setup`` to initialize the ``struct xfs_hook``, and ``xfs_hooks_add`` to enable the hook.h](hCPrior to unlocking inodes to start the scan, online fsck must call }(hjmhhhNhNubj)}(h``xfs_hooks_setup``h]hxfs_hooks_setup}(hj mhhhNhNubah}(h]h ]h"]h$]h&]uh1jhjmubh to initialize the }(hjmhhhNhNubj)}(h``struct xfs_hook``h]hstruct xfs_hook}(hjmhhhNhNubah}(h]h ]h"]h$]h&]uh1jhjmubh, and }(hjmhhhNhNubj)}(h``xfs_hooks_add``h]h xfs_hooks_add}(hj0mhhhNhNubah}(h]h ]h"]h$]h&]uh1jhjmubh to enable the hook.}(hjmhhhNhNubeh}(h]h ]h"]h$]h&]uh1hhhhM hjmubah}(h]h ]h"]h$]h&]uh1hhjlhhhhhNubh)}(hWOnline fsck must call ``xfs_hooks_del`` to disable the hook once the scan is complete. h]h)}(hVOnline fsck must call ``xfs_hooks_del`` to disable the hook once the scan is complete.h](hOnline fsck must call }(hjRmhhhNhNubj)}(h``xfs_hooks_del``h]h xfs_hooks_del}(hjZmhhhNhNubah}(h]h ]h"]h$]h&]uh1jhjRmubh/ to disable the hook once the scan is complete.}(hjRmhhhNhNubeh}(h]h ]h"]h$]h&]uh1hhhhM hjNmubah}(h]h ]h"]h$]h&]uh1hhjlhhhhhNubeh}(h]h ]h"]h$]h&]jJjKuh1hhhhM hjkhhubh)}(hThe number of hooks should be kept to a minimum to reduce complexity. Static keys are used to reduce the overhead of filesystem hooks to nearly zero when online fsck is not running.h]hThe number of hooks should be kept to a minimum to reduce complexity. Static keys are used to reduce the overhead of filesystem hooks to nearly zero when online fsck is not running.}(hj~mhhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhM hjkhhubh)}(h.. _liveupdate:h]h}(h]h ]h"]h$]h&]j liveupdateuh1hhM hjkhhhhubeh}(h](j jkeh ]h"](filesystem hooksfshookseh$]h&]uh1hhjchhhhhM~ j}jmjksj}jkjksubh)}(hhh](h)}(hLive Updates During a Scanh]hLive Updates During a Scan}(hjmhhhNhNubah}(h]h ]h"]h$]h&]jj uh1hhjmhhhhhM ubh)}(hoThe code paths of the online fsck scanning code and the :ref:`hooked` filesystem code look like this::h](h8The code paths of the online fsck scanning code and the }(hjmhhhNhNubh)}(h:ref:`hooked`h]j)}(hjmh]hhooked}(hjmhhhNhNubah}(h]h ](jstdstd-refeh"]h$]h&]uh1jhjmubah}(h]h ]h"]h$]h&]refdocj refdomainjmreftyperef refexplicitrefwarnjfshooksuh1hhhhM hjmubh filesystem code look like this:}(hjmhhhNhNubeh}(h]h ]h"]h$]h&]uh1hhhhM hjmhhubj+)}(hX_other program ↓ inode lock ←────────────────────┐ ↓ │ AG header lock │ ↓ │ filesystem function │ ↓ │ notifier call chain │ same ↓ ├─── inode scrub hook function │ lock ↓ │ scan data mutex ←──┐ same │ ↓ ├─── scan │ update scan data │ lock │ ↑ │ │ scan data mutex ←──┘ │ ↑ │ inode lock ←────────────────────┘ ↑ scrub function ↑ inode scanner ↑ xfs_scrubh]hX_other program ↓ inode lock ←────────────────────┐ ↓ │ AG header lock │ ↓ │ filesystem function │ ↓ │ notifier call chain │ same ↓ ├─── inode scrub hook function │ lock ↓ │ scan data mutex ←──┐ same │ ↓ ├─── scan │ update scan data │ lock │ ↑ │ │ scan data mutex ←──┘ │ ↑ │ inode lock ←────────────────────┘ ↑ scrub function ↑ inode scanner ↑ xfs_scrub}hjmsbah}(h]h ]h"]h$]h&]hhuh1j+hhhM hjmhhubh)}(hThese rules must be followed to ensure correct interactions between the checking code and the code making an update to the filesystem:h]hThese rules must be followed to ensure correct interactions between the checking code and the code making an update to the filesystem:}(hjmhhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhM hjmhhubh)}(hhh](h)}(hPrior to invoking the notifier call chain, the filesystem function being hooked must acquire the same lock that the scrub scanning function acquires to scan the inode. h]h)}(hPrior to invoking the notifier call chain, the filesystem function being hooked must acquire the same lock that the scrub scanning function acquires to scan the inode.h]hPrior to invoking the notifier call chain, the filesystem function being hooked must acquire the same lock that the scrub scanning function acquires to scan the inode.}(hjnhhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhM hjnubah}(h]h ]h"]h$]h&]uh1hhjnhhhhhNubh)}(hThe scanning function and the scrub hook function must coordinate access to the scan data by acquiring a lock on the scan data. h]h)}(hThe scanning function and the scrub hook function must coordinate access to the scan data by acquiring a lock on the scan data.h]hThe scanning function and the scrub hook function must coordinate access to the scan data by acquiring a lock on the scan data.}(hjnhhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhM hjnubah}(h]h ]h"]h$]h&]uh1hhjnhhhhhNubh)}(hScrub hook function must not add the live update information to the scan observations unless the inode being updated has already been scanned. The scan coordinator has a helper predicate (``xchk_iscan_want_live_update``) for this. h]h)}(hScrub hook function must not add the live update information to the scan observations unless the inode being updated has already been scanned. The scan coordinator has a helper predicate (``xchk_iscan_want_live_update``) for this.h](hScrub hook function must not add the live update information to the scan observations unless the inode being updated has already been scanned. The scan coordinator has a helper predicate (}(hj7nhhhNhNubj)}(h``xchk_iscan_want_live_update``h]hxchk_iscan_want_live_update}(hj?nhhhNhNubah}(h]h ]h"]h$]h&]uh1jhj7nubh ) for this.}(hj7nhhhNhNubeh}(h]h ]h"]h$]h&]uh1hhhhM hj3nubah}(h]h ]h"]h$]h&]uh1hhjnhhhhhNubh)}(hScrub hook functions must not change the caller's state, including the transaction that it is running. They must not acquire any resources that might conflict with the filesystem function being hooked. h]h)}(hScrub hook functions must not change the caller's state, including the transaction that it is running. They must not acquire any resources that might conflict with the filesystem function being hooked.h]hScrub hook functions must not change the caller’s state, including the transaction that it is running. They must not acquire any resources that might conflict with the filesystem function being hooked.}(hjanhhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhM hj]nubah}(h]h ]h"]h$]h&]uh1hhjnhhhhhNubh)}(hNThe hook function can abort the inode scan to avoid breaking the other rules. h]h)}(hMThe hook function can abort the inode scan to avoid breaking the other rules.h]hMThe hook function can abort the inode scan to avoid breaking the other rules.}(hjynhhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhM hjunubah}(h]h ]h"]h$]h&]uh1hhjnhhhhhNubeh}(h]h ]h"]h$]h&]jJjKuh1hhhhM hjmhhubh)}(h&The inode scan APIs are pretty simple:h]h&The inode scan APIs are pretty simple:}(hjnhhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhM hjmhhubh)}(hhh](h)}(h#``xchk_iscan_start`` starts a scan h]h)}(h"``xchk_iscan_start`` starts a scanh](j)}(h``xchk_iscan_start``h]hxchk_iscan_start}(hjnhhhNhNubah}(h]h ]h"]h$]h&]uh1jhjnubh starts a scan}(hjnhhhNhNubeh}(h]h ]h"]h$]h&]uh1hhhhM hjnubah}(h]h ]h"]h$]h&]uh1hhjnhhhhhNubh)}(hu``xchk_iscan_iter`` grabs a reference to the next inode in the scan or returns zero if there is nothing left to scan Nh]h)}(ht``xchk_iscan_iter`` grabs a reference to the next inode in the scan or returns zero if there is nothing left to scanh](j)}(h``xchk_iscan_iter``h]hxchk_iscan_iter}(hjnhhhNhNubah}(h]h ]h"]h$]h&]uh1jhjnubha grabs a reference to the next inode in the scan or returns zero if there is nothing left to scan}(hjnhhhNhNubeh}(h]h ]h"]h$]h&]uh1hhhhMhjnubah}(h]h ]h"]h$]h&]uh1hhjnhhhhhNubh)}(h``xchk_iscan_want_live_update`` to decide if an inode has already been visited in the scan. This is critical for hook functions to decide if they need to update the in-memory scan information. h]h)}(h``xchk_iscan_want_live_update`` to decide if an inode has already been visited in the scan. This is critical for hook functions to decide if they need to update the in-memory scan information.h](j)}(h``xchk_iscan_want_live_update``h]hxchk_iscan_want_live_update}(hjnhhhNhNubah}(h]h ]h"]h$]h&]uh1jhjnubh to decide if an inode has already been visited in the scan. This is critical for hook functions to decide if they need to update the in-memory scan information.}(hjnhhhNhNubeh}(h]h ]h"]h$]h&]uh1hhhhMhjnubah}(h]h ]h"]h$]h&]uh1hhjnhhhhhNubh)}(hP``xchk_iscan_mark_visited`` to mark an inode as having been visited in the scan h]h)}(hO``xchk_iscan_mark_visited`` to mark an inode as having been visited in the scanh](j)}(h``xchk_iscan_mark_visited``h]hxchk_iscan_mark_visited}(hjohhhNhNubah}(h]h ]h"]h$]h&]uh1jhjoubh4 to mark an inode as having been visited in the scan}(hjohhhNhNubeh}(h]h ]h"]h$]h&]uh1hhhhMhjoubah}(h]h ]h"]h$]h&]uh1hhjnhhhhhNubh)}(h+``xchk_iscan_teardown`` to finish the scan h]h)}(h*``xchk_iscan_teardown`` to finish the scanh](j)}(h``xchk_iscan_teardown``h]hxchk_iscan_teardown}(hjDohhhNhNubah}(h]h ]h"]h$]h&]uh1jhj@oubh to finish the scan}(hj@ohhhNhNubeh}(h]h ]h"]h$]h&]uh1hhhhM hj`_ series.h](h)This functionality is also a part of the }(hjhohhhNhNubj)}(hj`inode scanner `_h]h inode scanner}(hjpohhhNhNubah}(h]h ]h"]h$]h&]name inode scannerjjWhttps://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=scrub-iscanuh1jhjhoubh)}(hZ h]h}(h]id8ah ]h"]h$] inode scannerah&]refurijouh1hjyKhjhoubh series.}(hjhohhhNhNubeh}(h]h ]h"]h$]h&]uh1hhhhM hjmhhubh)}(h.. _quotacheck:h]h}(h]h ]h"]h$]h&]j quotacheckuh1hhMhjmhhhhubh)}(hhh](h)}(h"Case Study: Quota Counter Checkingh]h"Case Study: Quota Counter Checking}(hjohhhNhNubah}(h]h ]h"]h$]h&]jj= uh1hhjohhhhhMubh)}(hIt is useful to compare the mount time quotacheck code to the online repair quotacheck code. Mount time quotacheck does not have to contend with concurrent operations, so it does the following:h]hIt is useful to compare the mount time quotacheck code to the online repair quotacheck code. Mount time quotacheck does not have to contend with concurrent operations, so it does the following:}(hjohhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhMhjohhubji)}(hhh](h)}(hMake sure the ondisk dquots are in good enough shape that all the incore dquots will actually load, and zero the resource usage counters in the ondisk buffer. h]h)}(hMake sure the ondisk dquots are in good enough shape that all the incore dquots will actually load, and zero the resource usage counters in the ondisk buffer.h]hMake sure the ondisk dquots are in good enough shape that all the incore dquots will actually load, and zero the resource usage counters in the ondisk buffer.}(hjohhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhMhjoubah}(h]h ]h"]h$]h&]uh1hhjohhhhhNubh)}(hXWalk every inode in the filesystem. Add each file's resource usage to the incore dquot. h]h)}(hWWalk every inode in the filesystem. Add each file's resource usage to the incore dquot.h]hYWalk every inode in the filesystem. Add each file’s resource usage to the incore dquot.}(hjohhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhM hjoubah}(h]h ]h"]h$]h&]uh1hhjohhhhhNubh)}(hWalk each incore dquot. If the incore dquot is not being flushed, add the ondisk buffer backing the incore dquot to a delayed write (delwri) list. h]h)}(hWalk each incore dquot. If the incore dquot is not being flushed, add the ondisk buffer backing the incore dquot to a delayed write (delwri) list.h]hWalk each incore dquot. If the incore dquot is not being flushed, add the ondisk buffer backing the incore dquot to a delayed write (delwri) list.}(hjohhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhM#hjoubah}(h]h ]h"]h$]h&]uh1hhjohhhhhNubh)}(hWrite the buffer list to disk. h]h)}(hWrite the buffer list to disk.h]hWrite the buffer list to disk.}(hjphhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhM'hj pubah}(h]h ]h"]h$]h&]uh1hhjohhhhhNubeh}(h]h ]h"]h$]h&]jgjhjihjjjkuh1jhhjohhhhhMubh)}(hXLike most online fsck functions, online quotacheck can't write to regular filesystem objects until the newly collected metadata reflect all filesystem state. Therefore, online quotacheck records file resource usage to a shadow dquot index implemented with a sparse ``xfarray``, and only writes to the real dquots once the scan is complete. Handling transactional updates is tricky because quota resource usage updates are handled in phases to minimize contention on dquots:h](hX Like most online fsck functions, online quotacheck can’t write to regular filesystem objects until the newly collected metadata reflect all filesystem state. Therefore, online quotacheck records file resource usage to a shadow dquot index implemented with a sparse }(hj+phhhNhNubj)}(h ``xfarray``h]hxfarray}(hj3phhhNhNubah}(h]h ]h"]h$]h&]uh1jhj+pubh, and only writes to the real dquots once the scan is complete. Handling transactional updates is tricky because quota resource usage updates are handled in phases to minimize contention on dquots:}(hj+phhhNhNubeh}(h]h ]h"]h$]h&]uh1hhhhM)hjohhubji)}(hhh](h)}(hChanges in actual quota usage are tracked in the transaction. h]h)}(h=Changes in actual quota usage are tracked in the transaction.h]h=Changes in actual quota usage are tracked in the transaction.}(hjphhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhM=hjpubah}(h]h ]h"]h$]h&]uh1hhjKphhhhhNubh)}(hAt transaction commit time, each dquot is examined again: a. The dquot is locked again. b. Quota usage changes are logged and unused reservation is given back to the dquot. c. The dquot is unlocked. h](h)}(h9At transaction commit time, each dquot is examined again:h]h9At transaction commit time, each dquot is examined again:}(hjphhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhM?hjpubji)}(hhh](h)}(hThe dquot is locked again. h]h)}(hThe dquot is locked again.h]hThe dquot is locked again.}(hjqhhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhMAhjpubah}(h]h ]h"]h$]h&]uh1hhjpubh)}(hRQuota usage changes are logged and unused reservation is given back to the dquot. h]h)}(hQQuota usage changes are logged and unused reservation is given back to the dquot.h]hQQuota usage changes are logged and unused reservation is given back to the dquot.}(hjqhhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhMChjqubah}(h]h ]h"]h$]h&]uh1hhjpubh)}(hThe dquot is unlocked. h]h)}(hThe dquot is unlocked.h]hThe dquot is unlocked.}(hj0qhhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhMFhj,qubah}(h]h ]h"]h$]h&]uh1hhjpubeh}(h]h ]h"]h$]h&]jgj6jihjjjkuh1jhhjpubeh}(h]h ]h"]h$]h&]uh1hhjKphhhNhNubeh}(h]h ]h"]h$]h&]jgjhjihjjjkuh1jhhjohhhhhM2ubh)}(hXFor online quotacheck, hooks are placed in steps 2 and 4. The step 2 hook creates a shadow version of the transaction dquot context (``dqtrx``) that operates in a similar manner to the regular code. The step 4 hook commits the shadow ``dqtrx`` changes to the shadow dquots. Notice that both hooks are called with the inode locked, which is how the live update coordinates with the inode scanner.h](hFor online quotacheck, hooks are placed in steps 2 and 4. The step 2 hook creates a shadow version of the transaction dquot context (}(hjVqhhhNhNubj)}(h ``dqtrx``h]hdqtrx}(hj^qhhhNhNubah}(h]h ]h"]h$]h&]uh1jhjVqubh\) that operates in a similar manner to the regular code. The step 4 hook commits the shadow }(hjVqhhhNhNubj)}(h ``dqtrx``h]hdqtrx}(hjpqhhhNhNubah}(h]h ]h"]h$]h&]uh1jhjVqubh changes to the shadow dquots. Notice that both hooks are called with the inode locked, which is how the live update coordinates with the inode scanner.}(hjVqhhhNhNubeh}(h]h ]h"]h$]h&]uh1hhhhMHhjohhubh)}(h$The quotacheck scan looks like this:h]h$The quotacheck scan looks like this:}(hjqhhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhMOhjohhubji)}(hhh](h)}(h!Set up a coordinated inode scan. h]h)}(h Set up a coordinated inode scan.h]h Set up a coordinated inode scan.}(hjqhhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhMQhjqubah}(h]h ]h"]h$]h&]uh1hhjqhhhhhNubh)}(hX2For each inode returned by the inode scan iterator: a. Grab and lock the inode. b. Determine that inode's resource usage (data blocks, inode counts, realtime blocks) and add that to the shadow dquots for the user, group, and project ids associated with the inode. c. Unlock and release the inode. h](h)}(h3For each inode returned by the inode scan iterator:h]h3For each inode returned by the inode scan iterator:}(hjqhhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhMShjqubji)}(hhh](h)}(hGrab and lock the inode. h]h)}(hGrab and lock the inode.h]hGrab and lock the inode.}(hjqhhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhMUhjqubah}(h]h ]h"]h$]h&]uh1hhjqubh)}(hDetermine that inode's resource usage (data blocks, inode counts, realtime blocks) and add that to the shadow dquots for the user, group, and project ids associated with the inode. h]h)}(hDetermine that inode's resource usage (data blocks, inode counts, realtime blocks) and add that to the shadow dquots for the user, group, and project ids associated with the inode.h]hDetermine that inode’s resource usage (data blocks, inode counts, realtime blocks) and add that to the shadow dquots for the user, group, and project ids associated with the inode.}(hjqhhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhMWhjqubah}(h]h ]h"]h$]h&]uh1hhjqubh)}(hUnlock and release the inode. h]h)}(hUnlock and release the inode.h]hUnlock and release the inode.}(hjqhhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhM[hjqubah}(h]h ]h"]h$]h&]uh1hhjqubeh}(h]h ]h"]h$]h&]jgj6jihjjjkuh1jhhjqubeh}(h]h ]h"]h$]h&]uh1hhjqhhhNhNubh)}(hFor each dquot in the system: a. Grab and lock the dquot. b. Check the dquot against the shadow dquots created by the scan and updated by the live hooks. h](h)}(hFor each dquot in the system:h]hFor each dquot in the system:}(hjrhhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhM]hjrubji)}(hhh](h)}(hGrab and lock the dquot. h]h)}(hGrab and lock the dquot.h]hGrab and lock the dquot.}(hj3rhhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhM_hj/rubah}(h]h ]h"]h$]h&]uh1hhj,rubh)}(h]Check the dquot against the shadow dquots created by the scan and updated by the live hooks. h]h)}(h\Check the dquot against the shadow dquots created by the scan and updated by the live hooks.h]h\Check the dquot against the shadow dquots created by the scan and updated by the live hooks.}(hjKrhhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhMahjGrubah}(h]h ]h"]h$]h&]uh1hhj,rubeh}(h]h ]h"]h$]h&]jgj6jihjjjkuh1jhhjrubeh}(h]h ]h"]h$]h&]uh1hhjqhhhNhNubeh}(h]h ]h"]h$]h&]jgjhjihjjjkuh1jhhjohhhhhMQubh)}(hLive updates are key to being able to walk every quota record without needing to hold any locks for a long duration. If repairs are desired, the real and shadow dquots are locked and their resource counts are set to the values in the shadow dquot.h]hLive updates are key to being able to walk every quota record without needing to hold any locks for a long duration. If repairs are desired, the real and shadow dquots are locked and their resource counts are set to the values in the shadow dquot.}(hjqrhhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhMdhjohhubh)}(hThe proposed patchset is the `online quotacheck `_ series.h](hThe proposed patchset is the }(hjrhhhNhNubj)}(ht`online quotacheck `_h]honline quotacheck}(hjrhhhNhNubah}(h]h ]h"]h$]h&]nameonline quotacheckjj]https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-quotacheckuh1jhjrubh)}(h` h]h}(h]id9ah ]h"]h$]online quotacheckah&]refurijruh1hjyKhjrubh series.}(hjrhhhNhNubeh}(h]h ]h"]h$]h&]uh1hhhhMihjohhubh)}(h .. _nlinks:h]h}(h]h ]h"]h$]h&]jnlinksuh1hhMnhjohhhhubeh}(h](jC joeh ]h"]("case study: quota counter checking quotacheckeh$]h&]uh1hhjmhhhhhMj}jrjosj}jojosubh)}(hhh](h)}(h$Case Study: File Link Count Checkingh]h$Case Study: File Link Count Checking}(hjrhhhNhNubah}(h]h ]h"]h$]h&]jj_ uh1hhjrhhhhhMqubh)}(hX:File link count checking also uses live update hooks. The coordinated inode scanner is used to visit all directories on the filesystem, and per-file link count records are stored in a sparse ``xfarray`` indexed by inumber. During the scanning phase, each entry in a directory generates observation data as follows:h](hFile link count checking also uses live update hooks. The coordinated inode scanner is used to visit all directories on the filesystem, and per-file link count records are stored in a sparse }(hjrhhhNhNubj)}(h ``xfarray``h]hxfarray}(hjrhhhNhNubah}(h]h ]h"]h$]h&]uh1jhjrubhp indexed by inumber. During the scanning phase, each entry in a directory generates observation data as follows:}(hjrhhhNhNubeh}(h]h ]h"]h$]h&]uh1hhhhMshjrhhubji)}(hhh](h)}(hIf the entry is a dotdot (``'..'``) entry of the root directory, the directory's parent link count is bumped because the root directory's dotdot entry is self referential. h]h)}(hIf the entry is a dotdot (``'..'``) entry of the root directory, the directory's parent link count is bumped because the root directory's dotdot entry is self referential.h](hIf the entry is a dotdot (}(hjrhhhNhNubj)}(h``'..'``h]h'..'}(hjshhhNhNubah}(h]h ]h"]h$]h&]uh1jhjrubh) entry of the root directory, the directory’s parent link count is bumped because the root directory’s dotdot entry is self referential.}(hjrhhhNhNubeh}(h]h ]h"]h$]h&]uh1hhhhMzhjrubah}(h]h ]h"]h$]h&]uh1hhjrhhhhhNubh)}(hXIf the entry is a dotdot entry of a subdirectory, the parent's backref count is bumped. h]h)}(hWIf the entry is a dotdot entry of a subdirectory, the parent's backref count is bumped.h]hYIf the entry is a dotdot entry of a subdirectory, the parent’s backref count is bumped.}(hj&shhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhM~hj"subah}(h]h ]h"]h$]h&]uh1hhjrhhhhhNubh)}(h\If the entry is neither a dot nor a dotdot entry, the target file's parent count is bumped. h]h)}(h[If the entry is neither a dot nor a dotdot entry, the target file's parent count is bumped.h]h]If the entry is neither a dot nor a dotdot entry, the target file’s parent count is bumped.}(hj>shhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhMhj:subah}(h]h ]h"]h$]h&]uh1hhjrhhhhhNubh)}(hJIf the target is a subdirectory, the parent's child link count is bumped. h]h)}(hIIf the target is a subdirectory, the parent's child link count is bumped.h]hKIf the target is a subdirectory, the parent’s child link count is bumped.}(hjVshhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhMhjRsubah}(h]h ]h"]h$]h&]uh1hhjrhhhhhNubeh}(h]h ]h"]h$]h&]jgjhjihjjjkuh1jhhjrhhhhhMzubh)}(hXA crucial point to understand about how the link count inode scanner interacts with the live update hooks is that the scan cursor tracks which *parent* directories have been scanned. In other words, the live updates ignore any update about ``A → B`` when A has not been scanned, even if B has been scanned. Furthermore, a subdirectory A with a dotdot entry pointing back to B is accounted as a backref counter in the shadow data for A, since child dotdot entries affect the parent's link count. Live update hooks are carefully placed in all parts of the filesystem that create, change, or remove directory entries, since those operations involve bumplink and droplink.h](hA crucial point to understand about how the link count inode scanner interacts with the live update hooks is that the scan cursor tracks which }(hjpshhhNhNubj7)}(h*parent*h]hparent}(hjxshhhNhNubah}(h]h ]h"]h$]h&]uh1j6hjpsubhY directories have been scanned. In other words, the live updates ignore any update about }(hjpshhhNhNubj)}(h ``A → B``h]hA → B}(hjshhhNhNubah}(h]h ]h"]h$]h&]uh1jhjpsubhX when A has not been scanned, even if B has been scanned. Furthermore, a subdirectory A with a dotdot entry pointing back to B is accounted as a backref counter in the shadow data for A, since child dotdot entries affect the parent’s link count. Live update hooks are carefully placed in all parts of the filesystem that create, change, or remove directory entries, since those operations involve bumplink and droplink.}(hjpshhhNhNubeh}(h]h ]h"]h$]h&]uh1hhhhMhjrhhubh)}(hX9For any file, the correct link count is the number of parents plus the number of child subdirectories. Non-directories never have children of any kind. The backref information is used to detect inconsistencies in the number of links pointing to child subdirectories and the number of dotdot entries pointing back.h]hX9For any file, the correct link count is the number of parents plus the number of child subdirectories. Non-directories never have children of any kind. The backref information is used to detect inconsistencies in the number of links pointing to child subdirectories and the number of dotdot entries pointing back.}(hjshhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhMhjrhhubh)}(hXAfter the scan completes, the link count of each file can be checked by locking both the inode and the shadow data, and comparing the link counts. A second coordinated inode scan cursor is used for comparisons. Live updates are key to being able to walk every inode without needing to hold any locks between inodes. If repairs are desired, the inode's link count is set to the value in the shadow information. If no parents are found, the file must be :ref:`reparented ` to the orphanage to prevent the file from being lost forever.h](hXAfter the scan completes, the link count of each file can be checked by locking both the inode and the shadow data, and comparing the link counts. A second coordinated inode scan cursor is used for comparisons. Live updates are key to being able to walk every inode without needing to hold any locks between inodes. If repairs are desired, the inode’s link count is set to the value in the shadow information. If no parents are found, the file must be }(hjshhhNhNubh)}(h:ref:`reparented `h]j)}(hjsh]h reparented}(hjshhhNhNubah}(h]h ](jstdstd-refeh"]h$]h&]uh1jhjsubah}(h]h ]h"]h$]h&]refdocj refdomainjsreftyperef refexplicitrefwarnj orphanageuh1hhhhMhjsubh> to the orphanage to prevent the file from being lost forever.}(hjshhhNhNubeh}(h]h ]h"]h$]h&]uh1hhhhMhjrhhubh)}(hThe proposed patchset is the `file link count repair `_ series.h](hThe proposed patchset is the }(hjshhhNhNubj)}(ht`file link count repair `_h]hfile link count repair}(hjshhhNhNubah}(h]h ]h"]h$]h&]namefile link count repairjjXhttps://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=scrub-nlinksuh1jhjsubh)}(h[ h]h}(h]file-link-count-repairah ]h"]file link count repairah$]h&]refurijsuh1hjyKhjsubh series.}(hjshhhNhNubeh}(h]h ]h"]h$]h&]uh1hhhhMhjrhhubh)}(h.. _rmap_repair:h]h}(h]h ]h"]h$]h&]j rmap-repairuh1hhMhjrhhhhubeh}(h](je jreh ]h"]($case study: file link count checkingnlinkseh$]h&]uh1hhjmhhhhhMqj}j"tjrsj}jrjrsubh)}(hhh](h)}(h.Case Study: Rebuilding Reverse Mapping Recordsh]h.Case Study: Rebuilding Reverse Mapping Records}(hj*thhhNhNubah}(h]h ]h"]h$]h&]jj uh1hhj'thhhhhMubh)}(hXMost repair functions follow the same pattern: lock filesystem resources, walk the surviving ondisk metadata looking for replacement metadata records, and use an :ref:`in-memory array ` to store the gathered observations. The primary advantage of this approach is the simplicity and modularity of the repair code -- code and data are entirely contained within the scrub module, do not require hooks in the main filesystem, and are usually the most efficient in memory use. A secondary advantage of this repair approach is atomicity -- once the kernel decides a structure is corrupt, no other threads can access the metadata until the kernel finishes repairing and revalidating the metadata.h](hMost repair functions follow the same pattern: lock filesystem resources, walk the surviving ondisk metadata looking for replacement metadata records, and use an }(hj8thhhNhNubh)}(h :ref:`in-memory array `h]j)}(hjBth]hin-memory array}(hjDthhhNhNubah}(h]h ](jstdstd-refeh"]h$]h&]uh1jhj@tubah}(h]h ]h"]h$]h&]refdocj refdomainjNtreftyperef refexplicitrefwarnjxfarrayuh1hhhhMhj8tubhX to store the gathered observations. The primary advantage of this approach is the simplicity and modularity of the repair code -- code and data are entirely contained within the scrub module, do not require hooks in the main filesystem, and are usually the most efficient in memory use. A secondary advantage of this repair approach is atomicity -- once the kernel decides a structure is corrupt, no other threads can access the metadata until the kernel finishes repairing and revalidating the metadata.}(hj8thhhNhNubeh}(h]h ]h"]h$]h&]uh1hhhhMhj'thhubh)}(hX{For repairs going on within a shard of the filesystem, these advantages outweigh the delays inherent in locking the shard while repairing parts of the shard. Unfortunately, repairs to the reverse mapping btree cannot use the "standard" btree repair strategy because it must scan every space mapping of every fork of every file in the filesystem, and the filesystem cannot stop. Therefore, rmap repair foregoes atomicity between scrub and repair. It combines a :ref:`coordinated inode scanner `, :ref:`live update hooks `, and an :ref:`in-memory rmap btree ` to complete the scan for reverse mapping records.h](hXFor repairs going on within a shard of the filesystem, these advantages outweigh the delays inherent in locking the shard while repairing parts of the shard. Unfortunately, repairs to the reverse mapping btree cannot use the “standard” btree repair strategy because it must scan every space mapping of every fork of every file in the filesystem, and the filesystem cannot stop. Therefore, rmap repair foregoes atomicity between scrub and repair. It combines a }(hjjthhhNhNubh)}(h(:ref:`coordinated inode scanner `h]j)}(hjtth]hcoordinated inode scanner}(hjvthhhNhNubah}(h]h ](jstdstd-refeh"]h$]h&]uh1jhjrtubah}(h]h ]h"]h$]h&]refdocj refdomainjtreftyperef refexplicitrefwarnjiscanuh1hhhhMhjjtubh, }(hjjthhhNhNubh)}(h%:ref:`live update hooks `h]j)}(hjth]hlive update hooks}(hjthhhNhNubah}(h]h ](jstdstd-refeh"]h$]h&]uh1jhjtubah}(h]h ]h"]h$]h&]refdocj refdomainjtreftyperef refexplicitrefwarnj liveupdateuh1hhhhMhjjtubh , and an }(hjjthhhNhNubh)}(h%:ref:`in-memory rmap btree `h]j)}(hjth]hin-memory rmap btree}(hjthhhNhNubah}(h]h ](jstdstd-refeh"]h$]h&]uh1jhjtubah}(h]h ]h"]h$]h&]refdocj refdomainjtreftyperef refexplicitrefwarnjxfbtreeuh1hhhhMhjjtubh2 to complete the scan for reverse mapping records.}(hjjthhhNhNubeh}(h]h ]h"]h$]h&]uh1hhhhMhj'thhubji)}(hhh](h)}(h)Set up an xfbtree to stage rmap records. h]h)}(h(Set up an xfbtree to stage rmap records.h]h(Set up an xfbtree to stage rmap records.}(hjthhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhMhjtubah}(h]h ]h"]h$]h&]uh1hhjthhhhhNubh)}(hWhile holding the locks on the AGI and AGF buffers acquired during the scrub, generate reverse mappings for all AG metadata: inodes, btrees, CoW staging extents, and the internal log. h]h)}(hWhile holding the locks on the AGI and AGF buffers acquired during the scrub, generate reverse mappings for all AG metadata: inodes, btrees, CoW staging extents, and the internal log.h]hWhile holding the locks on the AGI and AGF buffers acquired during the scrub, generate reverse mappings for all AG metadata: inodes, btrees, CoW staging extents, and the internal log.}(hjuhhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhMhjtubah}(h]h ]h"]h$]h&]uh1hhjthhhhhNubh)}(hSet up an inode scanner. h]h)}(hSet up an inode scanner.h]hSet up an inode scanner.}(hjuhhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhMhjuubah}(h]h ]h"]h$]h&]uh1hhjthhhhhNubh)}(hHook into rmap updates for the AG being repaired so that the live scan data can receive updates to the rmap btree from the rest of the filesystem during the file scan. h]h)}(hHook into rmap updates for the AG being repaired so that the live scan data can receive updates to the rmap btree from the rest of the filesystem during the file scan.h]hHook into rmap updates for the AG being repaired so that the live scan data can receive updates to the rmap btree from the rest of the filesystem during the file scan.}(hj3uhhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhMhj/uubah}(h]h ]h"]h$]h&]uh1hhjthhhhhNubh)}(hXXFor each space mapping found in either fork of each file scanned, decide if the mapping matches the AG of interest. If so: a. Create a btree cursor for the in-memory btree. b. Use the rmap code to add the record to the in-memory btree. c. Use the :ref:`special commit function ` to write the xfbtree changes to the xfile. h](h)}(hzFor each space mapping found in either fork of each file scanned, decide if the mapping matches the AG of interest. If so:h]hzFor each space mapping found in either fork of each file scanned, decide if the mapping matches the AG of interest. If so:}(hjKuhhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhMhjGuubji)}(hhh](h)}(h/Create a btree cursor for the in-memory btree. h]h)}(h.Create a btree cursor for the in-memory btree.h]h.Create a btree cursor for the in-memory btree.}(hj`uhhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhMhj\uubah}(h]h ]h"]h$]h&]uh1hhjYuubh)}(h` to write the xfbtree changes to the xfile. h]h)}(hbUse the :ref:`special commit function ` to write the xfbtree changes to the xfile.h](hUse the }(hjuhhhNhNubh)}(h/:ref:`special commit function `h]j)}(hjuh]hspecial commit function}(hjuhhhNhNubah}(h]h ](jstdstd-refeh"]h$]h&]uh1jhjuubah}(h]h ]h"]h$]h&]refdocj refdomainjureftyperef refexplicitrefwarnjxfbtree_commituh1hhhhMhjuubh+ to write the xfbtree changes to the xfile.}(hjuhhhNhNubeh}(h]h ]h"]h$]h&]uh1hhhhMhjuubah}(h]h ]h"]h$]h&]uh1hhjYuubeh}(h]h ]h"]h$]h&]jgj6jihjjjkuh1jhhjGuubeh}(h]h ]h"]h$]h&]uh1hhjthhhNhNubh)}(hXFor each live update received via the hook, decide if the owner has already been scanned. If so, apply the live update into the scan data: a. Create a btree cursor for the in-memory btree. b. Replay the operation into the in-memory btree. c. Use the :ref:`special commit function ` to write the xfbtree changes to the xfile. This is performed with an empty transaction to avoid changing the caller's state. h](h)}(hFor each live update received via the hook, decide if the owner has already been scanned. If so, apply the live update into the scan data:h]hFor each live update received via the hook, decide if the owner has already been scanned. If so, apply the live update into the scan data:}(hjuhhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhMhjuubji)}(hhh](h)}(h/Create a btree cursor for the in-memory btree. h]h)}(h.Create a btree cursor for the in-memory btree.h]h.Create a btree cursor for the in-memory btree.}(hjuhhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhMhjuubah}(h]h ]h"]h$]h&]uh1hhjuubh)}(h/Replay the operation into the in-memory btree. h]h)}(h.Replay the operation into the in-memory btree.h]h.Replay the operation into the in-memory btree.}(hjvhhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhMhjvubah}(h]h ]h"]h$]h&]uh1hhjuubh)}(hUse the :ref:`special commit function ` to write the xfbtree changes to the xfile. This is performed with an empty transaction to avoid changing the caller's state. h]h)}(hUse the :ref:`special commit function ` to write the xfbtree changes to the xfile. This is performed with an empty transaction to avoid changing the caller's state.h](hUse the }(hjvhhhNhNubh)}(h/:ref:`special commit function `h]j)}(hj'vh]hspecial commit function}(hj)vhhhNhNubah}(h]h ](jstdstd-refeh"]h$]h&]uh1jhj%vubah}(h]h ]h"]h$]h&]refdocj refdomainj3vreftyperef refexplicitrefwarnjxfbtree_commituh1hhhhMhjvubh to write the xfbtree changes to the xfile. This is performed with an empty transaction to avoid changing the caller’s state.}(hjvhhhNhNubeh}(h]h ]h"]h$]h&]uh1hhhhMhjvubah}(h]h ]h"]h$]h&]uh1hhjuubeh}(h]h ]h"]h$]h&]jgj6jihjjjkuh1jhhjuubeh}(h]h ]h"]h$]h&]uh1hhjthhhNhNubh)}(h\When the inode scan finishes, create a new scrub transaction and relock the two AG headers. h]h)}(h[When the inode scan finishes, create a new scrub transaction and relock the two AG headers.h]h[When the inode scan finishes, create a new scrub transaction and relock the two AG headers.}(hjevhhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhMhjavubah}(h]h ]h"]h$]h&]uh1hhjthhhhhNubh)}(hCompute the new btree geometry using the number of rmap records in the shadow btree, like all other btree rebuilding functions. h]h)}(hCompute the new btree geometry using the number of rmap records in the shadow btree, like all other btree rebuilding functions.h]hCompute the new btree geometry using the number of rmap records in the shadow btree, like all other btree rebuilding functions.}(hj}vhhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhMhjyvubah}(h]h ]h"]h$]h&]uh1hhjthhhhhNubh)}(h=Allocate the number of blocks computed in the previous step. h]h)}(h`. h]h)}(h{Reap the old rmap btree blocks as discussed in the case study about how to :ref:`reap after rmap btree repair `.h](hKReap the old rmap btree blocks as discussed in the case study about how to }(hjvhhhNhNubh)}(h/:ref:`reap after rmap btree repair `h]j)}(hjvh]hreap after rmap btree repair}(hjvhhhNhNubah}(h]h ](jstdstd-refeh"]h$]h&]uh1jhjvubah}(h]h ]h"]h$]h&]refdocj refdomainjvreftyperef refexplicitrefwarnj rmap_reapuh1hhhhMhjvubh.}(hjvhhhNhNubeh}(h]h ]h"]h$]h&]uh1hhhhMhjvubah}(h]h ]h"]h$]h&]uh1hhjthhhhhNubh)}(h)Free the xfbtree now that it not needed. h]h)}(h(Free the xfbtree now that it not needed.h]h(Free the xfbtree now that it not needed.}(hjwhhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhMhjvubah}(h]h ]h"]h$]h&]uh1hhjthhhhhNubeh}(h]h ]h"]h$]h&]jgjhjihjjjkuh1jhhj'thhhhhMubh)}(hThe proposed patchset is the `rmap repair `_ series.h](hThe proposed patchset is the }(hjwhhhNhNubj)}(hn`rmap repair `_h]h rmap repair}(hj#whhhNhNubah}(h]h ]h"]h$]h&]name rmap repairjj]https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-rmap-btreeuh1jhjwubh)}(h` h]h}(h]id10ah ]h"] rmap repairah$]h&]refurij3wuh1hjyKhjwubh series.}(hjwhhhNhNubeh}(h]h ]h"]h$]h&]uh1hhhhMhj'thhubeh}(h](j jteh ]h"](.case study: rebuilding reverse mapping records rmap_repaireh$]h&]uh1hhjmhhhhhMj}jPwjtsj}jtjtsubeh}(h](j$ jmeh ]h"](live updates during a scan liveupdateeh$]h&]uh1hhjchhhhhM j}jZwjmsj}jmjmsubeh}(h]j0 ah ]h"]full filesystem scansah$]h&]uh1hhjv*hhhhhMR ubh)}(hhh](h)}(h,Staging Repairs with Temporary Files on Diskh]h,Staging Repairs with Temporary Files on Disk}(hjiwhhhNhNubah}(h]h ]h"]h$]h&]jj uh1hhjfwhhhhhMubh)}(hXXFS stores a substantial amount of metadata in file forks: directories, extended attributes, symbolic link targets, free space bitmaps and summary information for the realtime volume, and quota records. File forks map 64-bit logical file fork space extents to physical storage space extents, similar to how a memory management unit maps 64-bit virtual addresses to physical memory addresses. Therefore, file-based tree structures (such as directories and extended attributes) use blocks mapped in the file fork offset address space that point to other blocks mapped within that same address space, and file-based linear structures (such as bitmaps and quota records) compute array element offsets in the file fork offset address space.h]hXXFS stores a substantial amount of metadata in file forks: directories, extended attributes, symbolic link targets, free space bitmaps and summary information for the realtime volume, and quota records. File forks map 64-bit logical file fork space extents to physical storage space extents, similar to how a memory management unit maps 64-bit virtual addresses to physical memory addresses. Therefore, file-based tree structures (such as directories and extended attributes) use blocks mapped in the file fork offset address space that point to other blocks mapped within that same address space, and file-based linear structures (such as bitmaps and quota records) compute array element offsets in the file fork offset address space.}(hjwwhhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhMhjfwhhubh)}(hXJBecause file forks can consume as much space as the entire filesystem, repairs cannot be staged in memory, even when a paging scheme is available. Therefore, online repair of file-based metadata createas a temporary file in the XFS filesystem, writes a new structure at the correct offsets into the temporary file, and atomically exchanges all file fork mappings (and hence the fork contents) to commit the repair. Once the repair is complete, the old fork can be reaped as necessary; if the system goes down during the reap, the iunlink code will delete the blocks during log recovery.h]hXJBecause file forks can consume as much space as the entire filesystem, repairs cannot be staged in memory, even when a paging scheme is available. Therefore, online repair of file-based metadata createas a temporary file in the XFS filesystem, writes a new structure at the correct offsets into the temporary file, and atomically exchanges all file fork mappings (and hence the fork contents) to commit the repair. Once the repair is complete, the old fork can be reaped as necessary; if the system goes down during the reap, the iunlink code will delete the blocks during log recovery.}(hjwhhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhM hjfwhhubh)}(h**Note**: All space usage and inode indices in the filesystem *must* be consistent to use a temporary file safely! This dependency is the reason why online repair can only use pageable kernel memory to stage ondisk space usage information.h](j)}(h**Note**h]hNote}(hjwhhhNhNubah}(h]h ]h"]h$]h&]uh1jhjwubh6: All space usage and inode indices in the filesystem }(hjwhhhNhNubj7)}(h*must*h]hmust}(hjwhhhNhNubah}(h]h ]h"]h$]h&]uh1j6hjwubh be consistent to use a temporary file safely! This dependency is the reason why online repair can only use pageable kernel memory to stage ondisk space usage information.}(hjwhhhNhNubeh}(h]h ]h"]h$]h&]uh1hhhhMhjfwhhubh)}(hX)Exchanging metadata file mappings with a temporary file requires the owner field of the block headers to match the file being repaired and not the temporary file. The directory, extended attribute, and symbolic link functions were all modified to allow callers to specify owner numbers explicitly.h]hX)Exchanging metadata file mappings with a temporary file requires the owner field of the block headers to match the file being repaired and not the temporary file. The directory, extended attribute, and symbolic link functions were all modified to allow callers to specify owner numbers explicitly.}(hjwhhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhMhjfwhhubh)}(hThere is a downside to the reaping process -- if the system crashes during the reap phase and the fork extents are crosslinked, the iunlink processing will fail because freeing space will find the extra reverse mappings and abort.h]hThere is a downside to the reaping process -- if the system crashes during the reap phase and the fork extents are crosslinked, the iunlink processing will fail because freeing space will find the extra reverse mappings and abort.}(hjwhhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhM hjfwhhubh)}(hXTemporary files created for repair are similar to ``O_TMPFILE`` files created by userspace. They are not linked into a directory and the entire file will be reaped when the last reference to the file is lost. The key differences are that these files must have no access permission outside the kernel at all, they must be specially marked to prevent them from being opened by handle, and they must never be linked into the directory tree.h](h2Temporary files created for repair are similar to }(hjwhhhNhNubj)}(h ``O_TMPFILE``h]h O_TMPFILE}(hjwhhhNhNubah}(h]h ]h"]h$]h&]uh1jhjwubhXv files created by userspace. They are not linked into a directory and the entire file will be reaped when the last reference to the file is lost. The key differences are that these files must have no access permission outside the kernel at all, they must be specially marked to prevent them from being opened by handle, and they must never be linked into the directory tree.}(hjwhhhNhNubeh}(h]h ]h"]h$]h&]uh1hhhhM$hjfwhhubj)}(hhh]j)}(hhh](j)}(hhh]h}(h]h ]h"]h$]h&]colwidthKJuh1jhjxubj)}(hhh](j)}(hhh]j)}(hhh]h)}(h**Historical Sidebar**:h](j)}(h**Historical Sidebar**h]hHistorical Sidebar}(hjxhhhNhNubah}(h]h ]h"]h$]h&]uh1jhjxubh:}(hjxhhhNhNubeh}(h]h ]h"]h$]h&]uh1hhhhM-hjxubah}(h]h ]h"]h$]h&]uh1jhjxubah}(h]h ]h"]h$]h&]uh1jhj xubj)}(hhh]j)}(hhh](h)}(hXLIn the initial iteration of file metadata repair, the damaged metadata blocks would be scanned for salvageable data; the extents in the file fork would be reaped; and then a new structure would be built in its place. This strategy did not survive the introduction of the atomic repair requirement expressed earlier in this document.h]hXLIn the initial iteration of file metadata repair, the damaged metadata blocks would be scanned for salvageable data; the extents in the file fork would be reaped; and then a new structure would be built in its place. This strategy did not survive the introduction of the atomic repair requirement expressed earlier in this document.}(hjDxhhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhM/hjAxubh)}(hThe second iteration explored building a second structure at a high offset in the fork from the salvage data, reaping the old extents, and using a ``COLLAPSE_RANGE`` operation to slide the new extents into place.h](hThe second iteration explored building a second structure at a high offset in the fork from the salvage data, reaping the old extents, and using a }(hjRxhhhNhNubj)}(h``COLLAPSE_RANGE``h]hCOLLAPSE_RANGE}(hjZxhhhNhNubah}(h]h ]h"]h$]h&]uh1jhjRxubh/ operation to slide the new extents into place.}(hjRxhhhNhNubeh}(h]h ]h"]h$]h&]uh1hhhhM6hjAxubh)}(hThis had many drawbacks:h]hThis had many drawbacks:}(hjrxhhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhM;hjAxubh)}(hhh](h)}(hArray structures are linearly addressed, and the regular filesystem codebase does not have the concept of a linear offset that could be applied to the record offset computation to build an alternate copy. h]h)}(hArray structures are linearly addressed, and the regular filesystem codebase does not have the concept of a linear offset that could be applied to the record offset computation to build an alternate copy.h]hArray structures are linearly addressed, and the regular filesystem codebase does not have the concept of a linear offset that could be applied to the record offset computation to build an alternate copy.}(hjxhhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhM=hjxubah}(h]h ]h"]h$]h&]uh1hhjxubh)}(hRExtended attributes are allowed to use the entire attr fork offset address space. h]h)}(hQExtended attributes are allowed to use the entire attr fork offset address space.h]hQExtended attributes are allowed to use the entire attr fork offset address space.}(hjxhhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhMAhjxubah}(h]h ]h"]h$]h&]uh1hhjxubh)}(hX5Even if repair could build an alternate copy of a data structure in a different part of the fork address space, the atomic repair commit requirement means that online repair would have to be able to perform a log assisted ``COLLAPSE_RANGE`` operation to ensure that the old structure was completely replaced. h]h)}(hX4Even if repair could build an alternate copy of a data structure in a different part of the fork address space, the atomic repair commit requirement means that online repair would have to be able to perform a log assisted ``COLLAPSE_RANGE`` operation to ensure that the old structure was completely replaced.h](hEven if repair could build an alternate copy of a data structure in a different part of the fork address space, the atomic repair commit requirement means that online repair would have to be able to perform a log assisted }(hjxhhhNhNubj)}(h``COLLAPSE_RANGE``h]hCOLLAPSE_RANGE}(hjxhhhNhNubah}(h]h ]h"]h$]h&]uh1jhjxubhD operation to ensure that the old structure was completely replaced.}(hjxhhhNhNubeh}(h]h ]h"]h$]h&]uh1hhhhMDhjxubah}(h]h ]h"]h$]h&]uh1hhjxubh)}(hA crash after construction of the secondary tree but before the range collapse would leave unreachable blocks in the file fork. This would likely confuse things further. h]h)}(hA crash after construction of the secondary tree but before the range collapse would leave unreachable blocks in the file fork. This would likely confuse things further.h]hA crash after construction of the secondary tree but before the range collapse would leave unreachable blocks in the file fork. This would likely confuse things further.}(hjxhhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhMJhjxubah}(h]h ]h"]h$]h&]uh1hhjxubh)}(hReaping blocks after a repair is not a simple operation, and initiating a reap operation from a restarted range collapse operation during log recovery is daunting. h]h)}(hReaping blocks after a repair is not a simple operation, and initiating a reap operation from a restarted range collapse operation during log recovery is daunting.h]hReaping blocks after a repair is not a simple operation, and initiating a reap operation from a restarted range collapse operation during log recovery is daunting.}(hjxhhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhMNhjxubah}(h]h ]h"]h$]h&]uh1hhjxubh)}(hX$Directory entry blocks and quota records record the file fork offset in the header area of each block. An atomic range collapse operation would have to rewrite this part of each block header. Rewriting a single field in block headers is not a huge problem, but it's something to be aware of. h]h)}(hX#Directory entry blocks and quota records record the file fork offset in the header area of each block. An atomic range collapse operation would have to rewrite this part of each block header. Rewriting a single field in block headers is not a huge problem, but it's something to be aware of.h]hX%Directory entry blocks and quota records record the file fork offset in the header area of each block. An atomic range collapse operation would have to rewrite this part of each block header. Rewriting a single field in block headers is not a huge problem, but it’s something to be aware of.}(hjyhhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhMRhj yubah}(h]h ]h"]h$]h&]uh1hhjxubh)}(hX}Each block in a directory or extended attributes btree index contains sibling and child block pointers. Were the atomic commit to use a range collapse operation, each block would have to be rewritten very carefully to preserve the graph structure. Doing this as part of a range collapse means rewriting a large number of blocks repeatedly, which is not conducive to quick repairs. h]h)}(hX|Each block in a directory or extended attributes btree index contains sibling and child block pointers. Were the atomic commit to use a range collapse operation, each block would have to be rewritten very carefully to preserve the graph structure. Doing this as part of a range collapse means rewriting a large number of blocks repeatedly, which is not conducive to quick repairs.h]hX|Each block in a directory or extended attributes btree index contains sibling and child block pointers. Were the atomic commit to use a range collapse operation, each block would have to be rewritten very carefully to preserve the graph structure. Doing this as part of a range collapse means rewriting a large number of blocks repeatedly, which is not conducive to quick repairs.}(hj)yhhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhMYhj%yubah}(h]h ]h"]h$]h&]uh1hhjxubeh}(h]h ]h"]h$]h&]jJjKuh1hhhhM=hjAxubh)}(h8This lead to the introduction of temporary file staging.h]h8This lead to the introduction of temporary file staging.}(hjCyhhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhMahjAxubeh}(h]h ]h"]h$]h&]uh1jhj>xubah}(h]h ]h"]h$]h&]uh1jhj xubeh}(h]h ]h"]h$]h&]uh1jhjxubeh}(h]h ]h"]h$]h&]colsKuh1jhjwubah}(h]h ]h"]h$]h&]uh1jhjfwhhhhhNubh)}(hhh](h)}(hUsing a Temporary Fileh]hUsing a Temporary File}(hjsyhhhNhNubah}(h]h ]h"]h$]h&]jj uh1hhjpyhhhhhMeubh)}(hX>Online repair code should use the ``xrep_tempfile_create`` function to create a temporary file inside the filesystem. This allocates an inode, marks the in-core inode private, and attaches it to the scrub context. These files are hidden from userspace, may not be added to the directory tree, and must be kept private.h](h"Online repair code should use the }(hjyhhhNhNubj)}(h``xrep_tempfile_create``h]hxrep_tempfile_create}(hjyhhhNhNubah}(h]h ]h"]h$]h&]uh1jhjyubhX function to create a temporary file inside the filesystem. This allocates an inode, marks the in-core inode private, and attaches it to the scrub context. These files are hidden from userspace, may not be added to the directory tree, and must be kept private.}(hjyhhhNhNubeh}(h]h ]h"]h$]h&]uh1hhhhMghjpyhhubh)}(hXTemporary files only use two inode locks: the IOLOCK and the ILOCK. The MMAPLOCK is not needed here, because there must not be page faults from userspace for data fork blocks. The usage patterns of these two locks are the same as for any other XFS file -- access to file data are controlled via the IOLOCK, and access to file metadata are controlled via the ILOCK. Locking helpers are provided so that the temporary file and its lock state can be cleaned up by the scrub context. To comply with the nested locking strategy laid out in the :ref:`inode locking` section, it is recommended that scrub functions use the xrep_tempfile_ilock*_nowait lock helpers.h](hXTemporary files only use two inode locks: the IOLOCK and the ILOCK. The MMAPLOCK is not needed here, because there must not be page faults from userspace for data fork blocks. The usage patterns of these two locks are the same as for any other XFS file -- access to file data are controlled via the IOLOCK, and access to file metadata are controlled via the ILOCK. Locking helpers are provided so that the temporary file and its lock state can be cleaned up by the scrub context. To comply with the nested locking strategy laid out in the }(hjyhhhNhNubh)}(h:ref:`inode locking`h]j)}(hjyh]h inode locking}(hjyhhhNhNubah}(h]h ](jstdstd-refeh"]h$]h&]uh1jhjyubah}(h]h ]h"]h$]h&]refdocj refdomainjyreftyperef refexplicitrefwarnjilockinguh1hhhhMnhjyubhb section, it is recommended that scrub functions use the xrep_tempfile_ilock*_nowait lock helpers.}(hjyhhhNhNubeh}(h]h ]h"]h$]h&]uh1hhhhMnhjpyhhubh)}(h5Data can be written to a temporary file by two means:h]h5Data can be written to a temporary file by two means:}(hjyhhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhMzhjpyhhubji)}(hhh](h)}(hd``xrep_tempfile_copyin`` can be used to set the contents of a regular temporary file from an xfile. h]h)}(hc``xrep_tempfile_copyin`` can be used to set the contents of a regular temporary file from an xfile.h](j)}(h``xrep_tempfile_copyin``h]hxrep_tempfile_copyin}(hjyhhhNhNubah}(h]h ]h"]h$]h&]uh1jhjyubhK can be used to set the contents of a regular temporary file from an xfile.}(hjyhhhNhNubeh}(h]h ]h"]h$]h&]uh1hhhhM|hjyubah}(h]h ]h"]h$]h&]uh1hhjyhhhhhNubh)}(hsThe regular directory, symbolic link, and extended attribute functions can be used to write to the temporary file. h]h)}(hrThe regular directory, symbolic link, and extended attribute functions can be used to write to the temporary file.h]hrThe regular directory, symbolic link, and extended attribute functions can be used to write to the temporary file.}(hjzhhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhMhj zubah}(h]h ]h"]h$]h&]uh1hhjyhhhhhNubeh}(h]h ]h"]h$]h&]jgjhjihjjjkuh1jhhjpyhhhhhM|ubh)}(hOnce a good copy of a data file has been constructed in a temporary file, it must be conveyed to the file being repaired, which is the topic of the next section.h]hOnce a good copy of a data file has been constructed in a temporary file, it must be conveyed to the file being repaired, which is the topic of the next section.}(hj(zhhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhMhjpyhhubh)}(hThe proposed patches are in the `repair temporary files `_ series.h](h The proposed patches are in the }(hj6zhhhNhNubj)}(hx`repair temporary files `_h]hrepair temporary files}(hj>zhhhNhNubah}(h]h ]h"]h$]h&]namerepair temporary filesjj\https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-tempfilesuh1jhj6zubh)}(h_ h]h}(h]repair-temporary-filesah ]h"]repair temporary filesah$]h&]refurijNzuh1hjyKhj6zubh series.}(hj6zhhhNhNubeh}(h]h ]h"]h$]h&]uh1hhhhMhjpyhhubeh}(h]j ah ]h"]using a temporary fileah$]h&]uh1hhjfwhhhhhMeubeh}(h]j ah ]h"],staging repairs with temporary files on diskah$]h&]uh1hhjv*hhhhhMubh)}(hhh](h)}(hLogged File Content Exchangesh]hLogged File Content Exchanges}(hjwzhhhNhNubah}(h]h ]h"]h$]h&]jj uh1hhjtzhhhhhMubh)}(hXOnce repair builds a temporary file with a new data structure written into it, it must commit the new changes into the existing file. It is not possible to swap the inumbers of two files, so instead the new metadata must replace the old. This suggests the need for the ability to swap extents, but the existing extent swapping code used by the file defragmenting tool ``xfs_fsr`` is not sufficient for online repair because:h](hXpOnce repair builds a temporary file with a new data structure written into it, it must commit the new changes into the existing file. It is not possible to swap the inumbers of two files, so instead the new metadata must replace the old. This suggests the need for the ability to swap extents, but the existing extent swapping code used by the file defragmenting tool }(hjzhhhNhNubj)}(h ``xfs_fsr``h]hxfs_fsr}(hjzhhhNhNubah}(h]h ]h"]h$]h&]uh1jhjzubh- is not sufficient for online repair because:}(hjzhhhNhNubeh}(h]h ]h"]h$]h&]uh1hhhhMhjtzhhubji)}(hhh](h)}(hWhen the reverse-mapping btree is enabled, the swap code must keep the reverse mapping information up to date with every exchange of mappings. Therefore, it can only exchange one mapping per transaction, and each transaction is independent. h]h)}(hWhen the reverse-mapping btree is enabled, the swap code must keep the reverse mapping information up to date with every exchange of mappings. Therefore, it can only exchange one mapping per transaction, and each transaction is independent.h]hWhen the reverse-mapping btree is enabled, the swap code must keep the reverse mapping information up to date with every exchange of mappings. Therefore, it can only exchange one mapping per transaction, and each transaction is independent.}(hjzhhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhMhjzubah}(h]h ]h"]h$]h&]uh1hhjzhhhhhNubh)}(hReverse-mapping is critical for the operation of online fsck, so the old defragmentation code (which swapped entire extent forks in a single operation) is not useful here. h]h)}(hReverse-mapping is critical for the operation of online fsck, so the old defragmentation code (which swapped entire extent forks in a single operation) is not useful here.h]hReverse-mapping is critical for the operation of online fsck, so the old defragmentation code (which swapped entire extent forks in a single operation) is not useful here.}(hjzhhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhMhjzubah}(h]h ]h"]h$]h&]uh1hhjzhhhhhNubh)}(hDefragmentation is assumed to occur between two files with identical contents. For this use case, an incomplete exchange will not result in a user-visible change in file contents, even if the operation is interrupted. h]h)}(hDefragmentation is assumed to occur between two files with identical contents. For this use case, an incomplete exchange will not result in a user-visible change in file contents, even if the operation is interrupted.h]hDefragmentation is assumed to occur between two files with identical contents. For this use case, an incomplete exchange will not result in a user-visible change in file contents, even if the operation is interrupted.}(hjzhhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhMhjzubah}(h]h ]h"]h$]h&]uh1hhjzhhhhhNubh)}(hOnline repair needs to swap the contents of two files that are by definition *not* identical. For directory and xattr repairs, the user-visible contents might be the same, but the contents of individual blocks may be very different. Nh]h)}(hOnline repair needs to swap the contents of two files that are by definition *not* identical. For directory and xattr repairs, the user-visible contents might be the same, but the contents of individual blocks may be very different.h](hMOnline repair needs to swap the contents of two files that are by definition }(hjzhhhNhNubj7)}(h*not*h]hnot}(hjzhhhNhNubah}(h]h ]h"]h$]h&]uh1j6hjzubh identical. For directory and xattr repairs, the user-visible contents might be the same, but the contents of individual blocks may be very different.}(hjzhhhNhNubeh}(h]h ]h"]h$]h&]uh1hhhhMhjzubah}(h]h ]h"]h$]h&]uh1hhjzhhhhhNubh)}(h|Old blocks in the file may be cross-linked with another structure and must not reappear if the system goes down mid-repair. h]h)}(h{Old blocks in the file may be cross-linked with another structure and must not reappear if the system goes down mid-repair.h]h{Old blocks in the file may be cross-linked with another structure and must not reappear if the system goes down mid-repair.}(hj{hhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhMhj{ubah}(h]h ]h"]h$]h&]uh1hhjzhhhhhNubeh}(h]h ]h"]h$]h&]jgj6jihjjjkuh1jhhjtzhhhhhMubh)}(hXQThese problems are overcome by creating a new deferred operation and a new type of log intent item to track the progress of an operation to exchange two file ranges. The new exchange operation type chains together the same transactions used by the reverse-mapping extent swap code, but records intermedia progress in the log so that operations can be restarted after a crash. This new functionality is called the file contents exchange (xfs_exchrange) code. The underlying implementation exchanges file fork mappings (xfs_exchmaps). The new log item records the progress of the exchange to ensure that once an exchange begins, it will always run to completion, even there are interruptions. The new ``XFS_SB_FEAT_INCOMPAT_EXCHRANGE`` incompatible feature flag in the superblock protects these new log item records from being replayed on old kernels.h](hXThese problems are overcome by creating a new deferred operation and a new type of log intent item to track the progress of an operation to exchange two file ranges. The new exchange operation type chains together the same transactions used by the reverse-mapping extent swap code, but records intermedia progress in the log so that operations can be restarted after a crash. This new functionality is called the file contents exchange (xfs_exchrange) code. The underlying implementation exchanges file fork mappings (xfs_exchmaps). The new log item records the progress of the exchange to ensure that once an exchange begins, it will always run to completion, even there are interruptions. The new }(hj8{hhhNhNubj)}(h"``XFS_SB_FEAT_INCOMPAT_EXCHRANGE``h]hXFS_SB_FEAT_INCOMPAT_EXCHRANGE}(hj@{hhhNhNubah}(h]h ]h"]h$]h&]uh1jhj8{ubht incompatible feature flag in the superblock protects these new log item records from being replayed on old kernels.}(hj8{hhhNhNubeh}(h]h ]h"]h$]h&]uh1hhhhMhjtzhhubh)}(hThe proposed patchset is the `file contents exchange `_ series.h](hThe proposed patchset is the }(hjX{hhhNhNubj)}(h{`file contents exchange `_h]hfile contents exchange}(hj`{hhhNhNubah}(h]h ]h"]h$]h&]namefile contents exchangejj_https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=atomic-file-updatesuh1jhjX{ubh)}(hb h]h}(h]file-contents-exchangeah ]h"]file contents exchangeah$]h&]refurijp{uh1hjyKhjX{ubh series.}(hjX{hhhNhNubeh}(h]h ]h"]h$]h&]uh1hhhhMhjtzhhubj)}(hhh]j)}(hhh](j)}(hhh]h}(h]h ]h"]h$]h&]colwidthKJuh1jhj{ubj)}(hhh](j)}(hhh]j)}(hhh]h)}(h1**Sidebar: Using Log-Incompatible Feature Flags**h]j)}(hj{h]h-Sidebar: Using Log-Incompatible Feature Flags}(hj{hhhNhNubah}(h]h ]h"]h$]h&]uh1jhj{ubah}(h]h ]h"]h$]h&]uh1hhhhMhj{ubah}(h]h ]h"]h$]h&]uh1jhj{ubah}(h]h ]h"]h$]h&]uh1jhj{ubj)}(hhh]j)}(hhh](h)}(hX'Starting with XFS v5, the superblock contains a ``sb_features_log_incompat`` field to indicate that the log contains records that might not readable by all kernels that could mount this filesystem. In short, log incompat features protect the log contents against kernels that will not understand the contents. Unlike the other superblock feature bits, log incompat bits are ephemeral because an empty (clean) log does not need protection. The log cleans itself after its contents have been committed into the filesystem, either as part of an unmount or because the system is otherwise idle. Because upper level code can be working on a transaction at the same time that the log cleans itself, it is necessary for upper level code to communicate to the log when it is going to use a log incompatible feature.h](h0Starting with XFS v5, the superblock contains a }(hj{hhhNhNubj)}(h``sb_features_log_incompat``h]hsb_features_log_incompat}(hj{hhhNhNubah}(h]h ]h"]h$]h&]uh1jhj{ubhX field to indicate that the log contains records that might not readable by all kernels that could mount this filesystem. In short, log incompat features protect the log contents against kernels that will not understand the contents. Unlike the other superblock feature bits, log incompat bits are ephemeral because an empty (clean) log does not need protection. The log cleans itself after its contents have been committed into the filesystem, either as part of an unmount or because the system is otherwise idle. Because upper level code can be working on a transaction at the same time that the log cleans itself, it is necessary for upper level code to communicate to the log when it is going to use a log incompatible feature.}(hj{hhhNhNubeh}(h]h ]h"]h$]h&]uh1hhhhMhj{ubh)}(hXThe log coordinates access to incompatible features through the use of one ``struct rw_semaphore`` for each feature. The log cleaning code tries to take this rwsem in exclusive mode to clear the bit; if the lock attempt fails, the feature bit remains set. The code supporting a log incompat feature should create wrapper functions to obtain the log feature and call ``xfs_add_incompat_log_feature`` to set the feature bits in the primary superblock. The superblock update is performed transactionally, so the wrapper to obtain log assistance must be called just prior to the creation of the transaction that uses the functionality. For a file operation, this step must happen after taking the IOLOCK and the MMAPLOCK, but before allocating the transaction. When the transaction is complete, the ``xlog_drop_incompat_feat`` function is called to release the feature. The feature bit will not be cleared from the superblock until the log becomes clean.h](hKThe log coordinates access to incompatible features through the use of one }(hj{hhhNhNubj)}(h``struct rw_semaphore``h]hstruct rw_semaphore}(hj{hhhNhNubah}(h]h ]h"]h$]h&]uh1jhj{ubhX  for each feature. The log cleaning code tries to take this rwsem in exclusive mode to clear the bit; if the lock attempt fails, the feature bit remains set. The code supporting a log incompat feature should create wrapper functions to obtain the log feature and call }(hj{hhhNhNubj)}(h ``xfs_add_incompat_log_feature``h]hxfs_add_incompat_log_feature}(hj|hhhNhNubah}(h]h ]h"]h$]h&]uh1jhj{ubhX to set the feature bits in the primary superblock. The superblock update is performed transactionally, so the wrapper to obtain log assistance must be called just prior to the creation of the transaction that uses the functionality. For a file operation, this step must happen after taking the IOLOCK and the MMAPLOCK, but before allocating the transaction. When the transaction is complete, the }(hj{hhhNhNubj)}(h``xlog_drop_incompat_feat``h]hxlog_drop_incompat_feat}(hj|hhhNhNubah}(h]h ]h"]h$]h&]uh1jhj{ubh function is called to release the feature. The feature bit will not be cleared from the superblock until the log becomes clean.}(hj{hhhNhNubeh}(h]h ]h"]h$]h&]uh1hhhhMhj{ubh)}(hLog-assisted extended attribute updates and file content exchanges bothe use log incompat features and provide convenience wrappers around the functionality.h]hLog-assisted extended attribute updates and file content exchanges bothe use log incompat features and provide convenience wrappers around the functionality.}(hj.|hhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhMhj{ubeh}(h]h ]h"]h$]h&]uh1jhj{ubah}(h]h ]h"]h$]h&]uh1jhj{ubeh}(h]h ]h"]h$]h&]uh1jhj{ubeh}(h]h ]h"]h$]h&]colsKuh1jhj{ubah}(h]h ]h"]h$]h&]uh1jhjtzhhhhhNubh)}(hhh](h)}(h+Mechanics of a Logged File Content Exchangeh]h+Mechanics of a Logged File Content Exchange}(hj^|hhhNhNubah}(h]h ]h"]h$]h&]jj' uh1hhj[|hhhhhMubh)}(hXExchanging contents between file forks is a complex task. The goal is to exchange all file fork mappings between two file fork offset ranges. There are likely to be many extent mappings in each fork, and the edges of the mappings aren't necessarily aligned. Furthermore, there may be other updates that need to happen after the exchange, such as exchanging file sizes, inode flags, or conversion of fork data to local format. This is roughly the format of the new deferred exchange-mapping work item:h]hXExchanging contents between file forks is a complex task. The goal is to exchange all file fork mappings between two file fork offset ranges. There are likely to be many extent mappings in each fork, and the edges of the mappings aren’t necessarily aligned. Furthermore, there may be other updates that need to happen after the exchange, such as exchanging file sizes, inode flags, or conversion of fork data to local format. This is roughly the format of the new deferred exchange-mapping work item:}(hjl|hhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhMhj[|hhubj+)}(hXstruct xfs_exchmaps_intent { /* Inodes participating in the operation. */ struct xfs_inode *xmi_ip1; struct xfs_inode *xmi_ip2; /* File offset range information. */ xfs_fileoff_t xmi_startoff1; xfs_fileoff_t xmi_startoff2; xfs_filblks_t xmi_blockcount; /* Set these file sizes after the operation, unless negative. */ xfs_fsize_t xmi_isize1; xfs_fsize_t xmi_isize2; /* XFS_EXCHMAPS_* log operation flags */ uint64_t xmi_flags; };h]hXstruct xfs_exchmaps_intent { /* Inodes participating in the operation. */ struct xfs_inode *xmi_ip1; struct xfs_inode *xmi_ip2; /* File offset range information. */ xfs_fileoff_t xmi_startoff1; xfs_fileoff_t xmi_startoff2; xfs_filblks_t xmi_blockcount; /* Set these file sizes after the operation, unless negative. */ xfs_fsize_t xmi_isize1; xfs_fsize_t xmi_isize2; /* XFS_EXCHMAPS_* log operation flags */ uint64_t xmi_flags; };}hjz|sbah}(h]h ]h"]h$]h&]hhj+j+j+j+}uh1j+hhhMhj[|hhubh)}(hXThe new log intent item contains enough information to track two logical fork offset ranges: ``(inode1, startoff1, blockcount)`` and ``(inode2, startoff2, blockcount)``. Each step of an exchange operation exchanges the largest file range mapping possible from one file to the other. After each step in the exchange operation, the two startoff fields are incremented and the blockcount field is decremented to reflect the progress made. The flags field captures behavioral parameters such as exchanging attr fork mappings instead of the data fork and other work to be done after the exchange. The two isize fields are used to exchange the file sizes at the end of the operation if the file data fork is the target of the operation.h](h]The new log intent item contains enough information to track two logical fork offset ranges: }(hj|hhhNhNubj)}(h#``(inode1, startoff1, blockcount)``h]h(inode1, startoff1, blockcount)}(hj|hhhNhNubah}(h]h ]h"]h$]h&]uh1jhj|ubh and }(hj|hhhNhNubj)}(h#``(inode2, startoff2, blockcount)``h]h(inode2, startoff2, blockcount)}(hj|hhhNhNubah}(h]h ]h"]h$]h&]uh1jhj|ubhX2. Each step of an exchange operation exchanges the largest file range mapping possible from one file to the other. After each step in the exchange operation, the two startoff fields are incremented and the blockcount field is decremented to reflect the progress made. The flags field captures behavioral parameters such as exchanging attr fork mappings instead of the data fork and other work to be done after the exchange. The two isize fields are used to exchange the file sizes at the end of the operation if the file data fork is the target of the operation.}(hj|hhhNhNubeh}(h]h ]h"]h$]h&]uh1hhhhM hj[|hhubh)}(hIWhen the exchange is initiated, the sequence of operations is as follows:h]hIWhen the exchange is initiated, the sequence of operations is as follows:}(hj|hhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhMhj[|hhubji)}(hhh](h)}(hCreate a deferred work item for the file mapping exchange. At the start, it should contain the entirety of the file block ranges to be exchanged. h]h)}(hCreate a deferred work item for the file mapping exchange. At the start, it should contain the entirety of the file block ranges to be exchanged.h]hCreate a deferred work item for the file mapping exchange. At the start, it should contain the entirety of the file block ranges to be exchanged.}(hj|hhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhMhj|ubah}(h]h ]h"]h$]h&]uh1hhj|hhhhhNubh)}(hCall ``xfs_defer_finish`` to process the exchange. This is encapsulated in ``xrep_tempexch_contents`` for scrub operations. This will log an extent swap intent item to the transaction for the deferred mapping exchange work item. h]h)}(hCall ``xfs_defer_finish`` to process the exchange. This is encapsulated in ``xrep_tempexch_contents`` for scrub operations. This will log an extent swap intent item to the transaction for the deferred mapping exchange work item.h](hCall }(hj|hhhNhNubj)}(h``xfs_defer_finish``h]hxfs_defer_finish}(hj|hhhNhNubah}(h]h ]h"]h$]h&]uh1jhj|ubh2 to process the exchange. This is encapsulated in }(hj|hhhNhNubj)}(h``xrep_tempexch_contents``h]hxrep_tempexch_contents}(hj}hhhNhNubah}(h]h ]h"]h$]h&]uh1jhj|ubh for scrub operations. This will log an extent swap intent item to the transaction for the deferred mapping exchange work item.}(hj|hhhNhNubeh}(h]h ]h"]h$]h&]uh1hhhhMhj|ubah}(h]h ]h"]h$]h&]uh1hhj|hhhhhNubh)}(hXUntil ``xmi_blockcount`` of the deferred mapping exchange work item is zero, a. Read the block maps of both file ranges starting at ``xmi_startoff1`` and ``xmi_startoff2``, respectively, and compute the longest extent that can be exchanged in a single step. This is the minimum of the two ``br_blockcount`` s in the mappings. Keep advancing through the file forks until at least one of the mappings contains written blocks. Mutual holes, unwritten extents, and extent mappings to the same physical space are not exchanged. For the next few steps, this document will refer to the mapping that came from file 1 as "map1", and the mapping that came from file 2 as "map2". b. Create a deferred block mapping update to unmap map1 from file 1. c. Create a deferred block mapping update to unmap map2 from file 2. d. Create a deferred block mapping update to map map1 into file 2. e. Create a deferred block mapping update to map map2 into file 1. f. Log the block, quota, and extent count updates for both files. g. Extend the ondisk size of either file if necessary. h. Log a mapping exchange done log item for th mapping exchange intent log item that was read at the start of step 3. i. Compute the amount of file range that has just been covered. This quantity is ``(map1.br_startoff + map1.br_blockcount - xmi_startoff1)``, because step 3a could have skipped holes. j. Increase the starting offsets of ``xmi_startoff1`` and ``xmi_startoff2`` by the number of blocks computed in the previous step, and decrease ``xmi_blockcount`` by the same quantity. This advances the cursor. k. Log a new mapping exchange intent log item reflecting the advanced state of the work item. l. Return the proper error code (EAGAIN) to the deferred operation manager to inform it that there is more work to be done. The operation manager completes the deferred work in steps 3b-3e before moving back to the start of step 3. h](h)}(hLUntil ``xmi_blockcount`` of the deferred mapping exchange work item is zero,h](hUntil }(hj$}hhhNhNubj)}(h``xmi_blockcount``h]hxmi_blockcount}(hj,}hhhNhNubah}(h]h ]h"]h$]h&]uh1jhj$}ubh4 of the deferred mapping exchange work item is zero,}(hj$}hhhNhNubeh}(h]h ]h"]h$]h&]uh1hhhhM$hj }ubji)}(hhh](h)}(hXNRead the block maps of both file ranges starting at ``xmi_startoff1`` and ``xmi_startoff2``, respectively, and compute the longest extent that can be exchanged in a single step. This is the minimum of the two ``br_blockcount`` s in the mappings. Keep advancing through the file forks until at least one of the mappings contains written blocks. Mutual holes, unwritten extents, and extent mappings to the same physical space are not exchanged. For the next few steps, this document will refer to the mapping that came from file 1 as "map1", and the mapping that came from file 2 as "map2". h](h)}(hXRead the block maps of both file ranges starting at ``xmi_startoff1`` and ``xmi_startoff2``, respectively, and compute the longest extent that can be exchanged in a single step. This is the minimum of the two ``br_blockcount`` s in the mappings. Keep advancing through the file forks until at least one of the mappings contains written blocks. Mutual holes, unwritten extents, and extent mappings to the same physical space are not exchanged.h](h4Read the block maps of both file ranges starting at }(hjK}hhhNhNubj)}(h``xmi_startoff1``h]h xmi_startoff1}(hjS}hhhNhNubah}(h]h ]h"]h$]h&]uh1jhjK}ubh and }(hjK}hhhNhNubj)}(h``xmi_startoff2``h]h xmi_startoff2}(hje}hhhNhNubah}(h]h ]h"]h$]h&]uh1jhjK}ubhv, respectively, and compute the longest extent that can be exchanged in a single step. This is the minimum of the two }(hjK}hhhNhNubj)}(h``br_blockcount``h]h br_blockcount}(hjw}hhhNhNubah}(h]h ]h"]h$]h&]uh1jhjK}ubh s in the mappings. Keep advancing through the file forks until at least one of the mappings contains written blocks. Mutual holes, unwritten extents, and extent mappings to the same physical space are not exchanged.}(hjK}hhhNhNubeh}(h]h ]h"]h$]h&]uh1hhhhM&hjG}ubh)}(hFor the next few steps, this document will refer to the mapping that came from file 1 as "map1", and the mapping that came from file 2 as "map2".h]hFor the next few steps, this document will refer to the mapping that came from file 1 as “map1”, and the mapping that came from file 2 as “map2”.}(hj}hhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhM/hjG}ubeh}(h]h ]h"]h$]h&]uh1hhjD}ubh)}(hBCreate a deferred block mapping update to unmap map1 from file 1. h]h)}(hACreate a deferred block mapping update to unmap map1 from file 1.h]hACreate a deferred block mapping update to unmap map1 from file 1.}(hj}hhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhM2hj}ubah}(h]h ]h"]h$]h&]uh1hhjD}ubh)}(hBCreate a deferred block mapping update to unmap map2 from file 2. h]h)}(hACreate a deferred block mapping update to unmap map2 from file 2.h]hACreate a deferred block mapping update to unmap map2 from file 2.}(hj}hhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhM4hj}ubah}(h]h ]h"]h$]h&]uh1hhjD}ubh)}(h@Create a deferred block mapping update to map map1 into file 2. h]h)}(h?Create a deferred block mapping update to map map1 into file 2.h]h?Create a deferred block mapping update to map map1 into file 2.}(hj}hhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhM6hj}ubah}(h]h ]h"]h$]h&]uh1hhjD}ubh)}(h@Create a deferred block mapping update to map map2 into file 1. h]h)}(h?Create a deferred block mapping update to map map2 into file 1.h]h?Create a deferred block mapping update to map map2 into file 1.}(hj}hhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhM8hj}ubah}(h]h ]h"]h$]h&]uh1hhjD}ubh)}(h?Log the block, quota, and extent count updates for both files. h]h)}(h>Log the block, quota, and extent count updates for both files.h]h>Log the block, quota, and extent count updates for both files.}(hj~hhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhM:hj~ubah}(h]h ]h"]h$]h&]uh1hhjD}ubh)}(h4Extend the ondisk size of either file if necessary. h]h)}(h3Extend the ondisk size of either file if necessary.h]h3Extend the ondisk size of either file if necessary.}(hj~hhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhM<hj~ubah}(h]h ]h"]h$]h&]uh1hhjD}ubh)}(hsLog a mapping exchange done log item for th mapping exchange intent log item that was read at the start of step 3. h]h)}(hrLog a mapping exchange done log item for th mapping exchange intent log item that was read at the start of step 3.h]hrLog a mapping exchange done log item for th mapping exchange intent log item that was read at the start of step 3.}(hj7~hhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhM>hj3~ubah}(h]h ]h"]h$]h&]uh1hhjD}ubh)}(hCompute the amount of file range that has just been covered. This quantity is ``(map1.br_startoff + map1.br_blockcount - xmi_startoff1)``, because step 3a could have skipped holes. h]h)}(hCompute the amount of file range that has just been covered. This quantity is ``(map1.br_startoff + map1.br_blockcount - xmi_startoff1)``, because step 3a could have skipped holes.h](hNCompute the amount of file range that has just been covered. This quantity is }(hjO~hhhNhNubj)}(h;``(map1.br_startoff + map1.br_blockcount - xmi_startoff1)``h]h7(map1.br_startoff + map1.br_blockcount - xmi_startoff1)}(hjW~hhhNhNubah}(h]h ]h"]h$]h&]uh1jhjO~ubh+, because step 3a could have skipped holes.}(hjO~hhhNhNubeh}(h]h ]h"]h$]h&]uh1hhhhMAhjK~ubah}(h]h ]h"]h$]h&]uh1hhjD}ubh)}(hIncrease the starting offsets of ``xmi_startoff1`` and ``xmi_startoff2`` by the number of blocks computed in the previous step, and decrease ``xmi_blockcount`` by the same quantity. This advances the cursor. h]h)}(hIncrease the starting offsets of ``xmi_startoff1`` and ``xmi_startoff2`` by the number of blocks computed in the previous step, and decrease ``xmi_blockcount`` by the same quantity. This advances the cursor.h](h!Increase the starting offsets of }(hjy~hhhNhNubj)}(h``xmi_startoff1``h]h xmi_startoff1}(hj~hhhNhNubah}(h]h ]h"]h$]h&]uh1jhjy~ubh and }(hjy~hhhNhNubj)}(h``xmi_startoff2``h]h xmi_startoff2}(hj~hhhNhNubah}(h]h ]h"]h$]h&]uh1jhjy~ubhE by the number of blocks computed in the previous step, and decrease }(hjy~hhhNhNubj)}(h``xmi_blockcount``h]hxmi_blockcount}(hj~hhhNhNubah}(h]h ]h"]h$]h&]uh1jhjy~ubh0 by the same quantity. This advances the cursor.}(hjy~hhhNhNubeh}(h]h ]h"]h$]h&]uh1hhhhMEhju~ubah}(h]h ]h"]h$]h&]uh1hhjD}ubh)}(h[Log a new mapping exchange intent log item reflecting the advanced state of the work item. h]h)}(hZLog a new mapping exchange intent log item reflecting the advanced state of the work item.h]hZLog a new mapping exchange intent log item reflecting the advanced state of the work item.}(hj~hhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhMJhj~ubah}(h]h ]h"]h$]h&]uh1hhjD}ubh)}(hReturn the proper error code (EAGAIN) to the deferred operation manager to inform it that there is more work to be done. The operation manager completes the deferred work in steps 3b-3e before moving back to the start of step 3. h]h)}(hReturn the proper error code (EAGAIN) to the deferred operation manager to inform it that there is more work to be done. The operation manager completes the deferred work in steps 3b-3e before moving back to the start of step 3.h]hReturn the proper error code (EAGAIN) to the deferred operation manager to inform it that there is more work to be done. The operation manager completes the deferred work in steps 3b-3e before moving back to the start of step 3.}(hj~hhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhMMhj~ubah}(h]h ]h"]h$]h&]uh1hhjD}ubeh}(h]h ]h"]h$]h&]jgj6jihjjjkuh1jhhj }ubeh}(h]h ]h"]h$]h&]uh1hhj|hhhNhNubh)}(h[Perform any post-processing. This will be discussed in more detail in subsequent sections. h]h)}(hZPerform any post-processing. This will be discussed in more detail in subsequent sections.h]hZPerform any post-processing. This will be discussed in more detail in subsequent sections.}(hjhhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhMRhj~ubah}(h]h ]h"]h$]h&]uh1hhj|hhhhhNubeh}(h]h ]h"]h$]h&]jgjhjihjjjkuh1jhhj[|hhhhhMubh)}(hXHIf the filesystem goes down in the middle of an operation, log recovery will find the most recent unfinished maping exchange log intent item and restart from there. This is how atomic file mapping exchanges guarantees that an outside observer will either see the old broken structure or the new one, and never a mismash of both.h]hXHIf the filesystem goes down in the middle of an operation, log recovery will find the most recent unfinished maping exchange log intent item and restart from there. This is how atomic file mapping exchanges guarantees that an outside observer will either see the old broken structure or the new one, and never a mismash of both.}(hjhhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhMUhj[|hhubeh}(h]j- ah ]h"]+mechanics of a logged file content exchangeah$]h&]uh1hhjtzhhhhhMubh)}(hhh](h)}(h&Preparation for File Content Exchangesh]h&Preparation for File Content Exchanges}(hj5hhhNhNubah}(h]h ]h"]h$]h&]jjI uh1hhj2hhhhhM]ubh)}(hX\There are a few things that need to be taken care of before initiating an atomic file mapping exchange operation. First, regular files require the page cache to be flushed to disk before the operation begins, and directio writes to be quiesced. Like any filesystem operation, file mapping exchanges must determine the maximum amount of disk space and quota that can be consumed on behalf of both files in the operation, and reserve that quantity of resources to avoid an unrecoverable out of space failure once it starts dirtying metadata. The preparation step scans the ranges of both files to estimate:h]hX\There are a few things that need to be taken care of before initiating an atomic file mapping exchange operation. First, regular files require the page cache to be flushed to disk before the operation begins, and directio writes to be quiesced. Like any filesystem operation, file mapping exchanges must determine the maximum amount of disk space and quota that can be consumed on behalf of both files in the operation, and reserve that quantity of resources to avoid an unrecoverable out of space failure once it starts dirtying metadata. The preparation step scans the ranges of both files to estimate:}(hjChhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhM_hj2hhubh)}(hhh](h)}(hNData device blocks needed to handle the repeated updates to the fork mappings.h]h)}(hNData device blocks needed to handle the repeated updates to the fork mappings.h]hNData device blocks needed to handle the repeated updates to the fork mappings.}(hjXhhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhMihjTubah}(h]h ]h"]h$]h&]uh1hhjQhhhhhNubh)}(h8Change in data and realtime block counts for both files.h]h)}(hjnh]h8Change in data and realtime block counts for both files.}(hjphhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhMkhjlubah}(h]h ]h"]h$]h&]uh1hhjQhhhhhNubh)}(h`Increase in quota usage for both files, if the two files do not share the same set of quota ids.h]h)}(h`Increase in quota usage for both files, if the two files do not share the same set of quota ids.h]h`Increase in quota usage for both files, if the two files do not share the same set of quota ids.}(hjhhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhMlhjubah}(h]h ]h"]h$]h&]uh1hhjQhhhhhNubh)}(h>The number of extent mappings that will be added to each file.h]h)}(hjh]h>The number of extent mappings that will be added to each file.}(hjhhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhMnhjubah}(h]h ]h"]h$]h&]uh1hhjQhhhhhNubh)}(hWhether or not there are partially written realtime extents. User programs must never be able to access a realtime file extent that maps to different extents on the realtime volume, which could happen if the operation fails to run to completion. h]h)}(hWhether or not there are partially written realtime extents. User programs must never be able to access a realtime file extent that maps to different extents on the realtime volume, which could happen if the operation fails to run to completion.h]hWhether or not there are partially written realtime extents. User programs must never be able to access a realtime file extent that maps to different extents on the realtime volume, which could happen if the operation fails to run to completion.}(hjhhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhMohjubah}(h]h ]h"]h$]h&]uh1hhjQhhhhhNubeh}(h]h ]h"]h$]h&]jJjKuh1hhhhMihj2hhubh)}(hXThe need for precise estimation increases the run time of the exchange operation, but it is very important to maintain correct accounting. The filesystem must not run completely out of free space, nor can the mapping exchange ever add more extent mappings to a fork than it can support. Regular users are required to abide the quota limits, though metadata repairs may exceed quota to resolve inconsistent metadata elsewhere.h]hXThe need for precise estimation increases the run time of the exchange operation, but it is very important to maintain correct accounting. The filesystem must not run completely out of free space, nor can the mapping exchange ever add more extent mappings to a fork than it can support. Regular users are required to abide the quota limits, though metadata repairs may exceed quota to resolve inconsistent metadata elsewhere.}(hjhhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhMthj2hhubeh}(h]jO ah ]h"]&preparation for file content exchangesah$]h&]uh1hhjtzhhhhhM]ubh)}(hhh](h)}(h6Special Features for Exchanging Metadata File Contentsh]h6Special Features for Exchanging Metadata File Contents}(hjhhhNhNubah}(h]h ]h"]h$]h&]jjk uh1hhjhhhhhM|ubh)}(hExtended attributes, symbolic links, and directories can set the fork format to "local" and treat the fork as a literal area for data storage. Metadata repairs must take extra steps to support these cases:h]hExtended attributes, symbolic links, and directories can set the fork format to “local” and treat the fork as a literal area for data storage. Metadata repairs must take extra steps to support these cases:}(hjhhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhM~hjhhubh)}(hhh](h)}(hXIf both forks are in local format and the fork areas are large enough, the exchange is performed by copying the incore fork contents, logging both forks, and committing. The atomic file mapping exchange mechanism is not necessary, since this can be done with a single transaction. h]h)}(hXIf both forks are in local format and the fork areas are large enough, the exchange is performed by copying the incore fork contents, logging both forks, and committing. The atomic file mapping exchange mechanism is not necessary, since this can be done with a single transaction.h]hXIf both forks are in local format and the fork areas are large enough, the exchange is performed by copying the incore fork contents, logging both forks, and committing. The atomic file mapping exchange mechanism is not necessary, since this can be done with a single transaction.}(hj hhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhMhjubah}(h]h ]h"]h$]h&]uh1hhjhhhhhNubh)}(hQIf both forks map blocks, then the regular atomic file mapping exchange is used. h]h)}(hPIf both forks map blocks, then the regular atomic file mapping exchange is used.h]hPIf both forks map blocks, then the regular atomic file mapping exchange is used.}(hj#hhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhMhjubah}(h]h ]h"]h$]h&]uh1hhjhhhhhNubh)}(hX=Otherwise, only one fork is in local format. The contents of the local format fork are converted to a block to perform the exchange. The conversion to block format must be done in the same transaction that logs the initial mapping exchange intent log item. The regular atomic mapping exchange is used to exchange the metadata file mappings. Special flags are set on the exchange operation so that the transaction can be rolled one more time to convert the second file's fork back to local format so that the second file will be ready to go as soon as the ILOCK is dropped. h]h)}(hX<Otherwise, only one fork is in local format. The contents of the local format fork are converted to a block to perform the exchange. The conversion to block format must be done in the same transaction that logs the initial mapping exchange intent log item. The regular atomic mapping exchange is used to exchange the metadata file mappings. Special flags are set on the exchange operation so that the transaction can be rolled one more time to convert the second file's fork back to local format so that the second file will be ready to go as soon as the ILOCK is dropped.h]hX>Otherwise, only one fork is in local format. The contents of the local format fork are converted to a block to perform the exchange. The conversion to block format must be done in the same transaction that logs the initial mapping exchange intent log item. The regular atomic mapping exchange is used to exchange the metadata file mappings. Special flags are set on the exchange operation so that the transaction can be rolled one more time to convert the second file’s fork back to local format so that the second file will be ready to go as soon as the ILOCK is dropped.}(hj;hhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhMhj7ubah}(h]h ]h"]h$]h&]uh1hhjhhhhhNubeh}(h]h ]h"]h$]h&]jJjKuh1hhhhMhjhhubh)}(hXExtended attributes and directories stamp the owning inode into every block, but the buffer verifiers do not actually check the inode number! Although there is no verification, it is still important to maintain referential integrity, so prior to performing the mapping exchange, online repair builds every block in the new data structure with the owner field of the file being repaired.h]hXExtended attributes and directories stamp the owning inode into every block, but the buffer verifiers do not actually check the inode number! Although there is no verification, it is still important to maintain referential integrity, so prior to performing the mapping exchange, online repair builds every block in the new data structure with the owner field of the file being repaired.}(hjUhhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhMhjhhubh)}(hXAfter a successful exchange operation, the repair operation must reap the old fork blocks by processing each fork mapping through the standard :ref:`file extent reaping ` mechanism that is done post-repair. If the filesystem should go down during the reap part of the repair, the iunlink processing at the end of recovery will free both the temporary file and whatever blocks were not reaped. However, this iunlink processing omits the cross-link detection of online repair, and is not completely foolproof.h](hAfter a successful exchange operation, the repair operation must reap the old fork blocks by processing each fork mapping through the standard }(hjchhhNhNubh)}(h$:ref:`file extent reaping `h]j)}(hjmh]hfile extent reaping}(hjohhhNhNubah}(h]h ](jstdstd-refeh"]h$]h&]uh1jhjkubah}(h]h ]h"]h$]h&]refdocj refdomainjyreftyperef refexplicitrefwarnjreapinguh1hhhhMhjcubhXQ mechanism that is done post-repair. If the filesystem should go down during the reap part of the repair, the iunlink processing at the end of recovery will free both the temporary file and whatever blocks were not reaped. However, this iunlink processing omits the cross-link detection of online repair, and is not completely foolproof.}(hjchhhNhNubeh}(h]h ]h"]h$]h&]uh1hhhhMhjhhubeh}(h]jq ah ]h"]6special features for exchanging metadata file contentsah$]h&]uh1hhjtzhhhhhM|ubh)}(hhh](h)}(h"Exchanging Temporary File Contentsh]h"Exchanging Temporary File Contents}(hjhhhNhNubah}(h]h ]h"]h$]h&]jj uh1hhjhhhhhMubh)}(h=To repair a metadata file, online repair proceeds as follows:h]h=To repair a metadata file, online repair proceeds as follows:}(hjhhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhMhjhhubji)}(hhh](h)}(h Create a temporary repair file. h]h)}(hCreate a temporary repair file.h]hCreate a temporary repair file.}(hj€hhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhMhjubah}(h]h ]h"]h$]h&]uh1hhjhhhhhNubh)}(hUse the staging data to write out new contents into the temporary repair file. The same fork must be written to as is being repaired. h]h)}(hUse the staging data to write out new contents into the temporary repair file. The same fork must be written to as is being repaired.h]hUse the staging data to write out new contents into the temporary repair file. The same fork must be written to as is being repaired.}(hjڀhhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhMhjրubah}(h]h ]h"]h$]h&]uh1hhjhhhhhNubh)}(hCommit the scrub transaction, since the exchange resource estimation step must be completed before transaction reservations are made. h]h)}(hCommit the scrub transaction, since the exchange resource estimation step must be completed before transaction reservations are made.h]hCommit the scrub transaction, since the exchange resource estimation step must be completed before transaction reservations are made.}(hjhhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhMhjubah}(h]h ]h"]h$]h&]uh1hhjhhhhhNubh)}(hCall ``xrep_tempexch_trans_alloc`` to allocate a new scrub transaction with the appropriate resource reservations, locks, and fill out a ``struct xfs_exchmaps_req`` with the details of the exchange operation. h]h)}(hCall ``xrep_tempexch_trans_alloc`` to allocate a new scrub transaction with the appropriate resource reservations, locks, and fill out a ``struct xfs_exchmaps_req`` with the details of the exchange operation.h](hCall }(hj hhhNhNubj)}(h``xrep_tempexch_trans_alloc``h]hxrep_tempexch_trans_alloc}(hjhhhNhNubah}(h]h ]h"]h$]h&]uh1jhj ubhg to allocate a new scrub transaction with the appropriate resource reservations, locks, and fill out a }(hj hhhNhNubj)}(h``struct xfs_exchmaps_req``h]hstruct xfs_exchmaps_req}(hj$hhhNhNubah}(h]h ]h"]h$]h&]uh1jhj ubh, with the details of the exchange operation.}(hj hhhNhNubeh}(h]h ]h"]h$]h&]uh1hhhhMhjubah}(h]h ]h"]h$]h&]uh1hhjhhhhhNubh)}(h:Call ``xrep_tempexch_contents`` to exchange the contents. h]h)}(h9Call ``xrep_tempexch_contents`` to exchange the contents.h](hCall }(hjFhhhNhNubj)}(h``xrep_tempexch_contents``h]hxrep_tempexch_contents}(hjNhhhNhNubah}(h]h ]h"]h$]h&]uh1jhjFubh to exchange the contents.}(hjFhhhNhNubeh}(h]h ]h"]h$]h&]uh1hhhhMhjBubah}(h]h ]h"]h$]h&]uh1hhjhhhhhNubh)}(h/Commit the transaction to complete the repair. h]h)}(h.Commit the transaction to complete the repair.h]h.Commit the transaction to complete the repair.}(hjphhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhMhjlubah}(h]h ]h"]h$]h&]uh1hhjhhhhhNubeh}(h]h ]h"]h$]h&]jgjhjihjjjkuh1jhhjhhhhhMubh)}(h.. _rtsummary:h]h}(h]h ]h"]h$]h&]j rtsummaryuh1hhMhjhhhhubh)}(hhh](h)}(h/Case Study: Repairing the Realtime Summary Fileh]h/Case Study: Repairing the Realtime Summary File}(hjhhhNhNubah}(h]h ]h"]h$]h&]jj uh1hhjhhhhhMubh)}(hX@In the "realtime" section of an XFS filesystem, free space is tracked via a bitmap, similar to Unix FFS. Each bit in the bitmap represents one realtime extent, which is a multiple of the filesystem block size between 4KiB and 1GiB in size. The realtime summary file indexes the number of free extents of a given size to the offset of the block within the realtime free space bitmap where those free extents begin. In other words, the summary file helps the allocator find free extents by length, similar to what the free space by count (cntbt) btree does for the data section.h]hXDIn the “realtime” section of an XFS filesystem, free space is tracked via a bitmap, similar to Unix FFS. Each bit in the bitmap represents one realtime extent, which is a multiple of the filesystem block size between 4KiB and 1GiB in size. The realtime summary file indexes the number of free extents of a given size to the offset of the block within the realtime free space bitmap where those free extents begin. In other words, the summary file helps the allocator find free extents by length, similar to what the free space by count (cntbt) btree does for the data section.}(hjhhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhMhjhhubh)}(hXVThe summary file itself is a flat file (with no block headers or checksums!) partitioned into ``log2(total rt extents)`` sections containing enough 32-bit counters to match the number of blocks in the rt bitmap. Each counter records the number of free extents that start in that bitmap block and can satisfy a power-of-two allocation request.h](h^The summary file itself is a flat file (with no block headers or checksums!) partitioned into }(hjhhhNhNubj)}(h``log2(total rt extents)``h]hlog2(total rt extents)}(hjhhhNhNubah}(h]h ]h"]h$]h&]uh1jhjubh sections containing enough 32-bit counters to match the number of blocks in the rt bitmap. Each counter records the number of free extents that start in that bitmap block and can satisfy a power-of-two allocation request.}(hjhhhNhNubeh}(h]h ]h"]h$]h&]uh1hhhhMhjhhubh)}(h-To check the summary file against the bitmap:h]h-To check the summary file against the bitmap:}(hjԁhhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhMhjhhubji)}(hhh](h)}(h>Take the ILOCK of both the realtime bitmap and summary files. h]h)}(h=Take the ILOCK of both the realtime bitmap and summary files.h]h=Take the ILOCK of both the realtime bitmap and summary files.}(hjhhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhMhjubah}(h]h ]h"]h$]h&]uh1hhjhhhhhNubh)}(hFor each free space extent recorded in the bitmap: a. Compute the position in the summary file that contains a counter that represents this free extent. b. Read the counter from the xfile. c. Increment it, and write it back to the xfile. h](h)}(h2For each free space extent recorded in the bitmap:h]h2For each free space extent recorded in the bitmap:}(hjhhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhMhjubji)}(hhh](h)}(hcCompute the position in the summary file that contains a counter that represents this free extent. h]h)}(hbCompute the position in the summary file that contains a counter that represents this free extent.h]hbCompute the position in the summary file that contains a counter that represents this free extent.}(hjhhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhMhjubah}(h]h ]h"]h$]h&]uh1hhjubh)}(h!Read the counter from the xfile. h]h)}(h Read the counter from the xfile.h]h Read the counter from the xfile.}(hj.hhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhMhj*ubah}(h]h ]h"]h$]h&]uh1hhjubh)}(h.Increment it, and write it back to the xfile. h]h)}(h-Increment it, and write it back to the xfile.h]h-Increment it, and write it back to the xfile.}(hjFhhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhMhjBubah}(h]h ]h"]h$]h&]uh1hhjubeh}(h]h ]h"]h$]h&]jgj6jihjjjkuh1jhhjubeh}(h]h ]h"]h$]h&]uh1hhjhhhNhNubh)}(h;Compare the contents of the xfile against the ondisk file. h]h)}(h:Compare the contents of the xfile against the ondisk file.h]h:Compare the contents of the xfile against the ondisk file.}(hjjhhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhMhjfubah}(h]h ]h"]h$]h&]uh1hhjhhhhhNubeh}(h]h ]h"]h$]h&]jgjhjihjjjkuh1jhhjhhhhhMubh)}(hTo repair the summary file, write the xfile contents into the temporary file and use atomic mapping exchange to commit the new contents. The temporary file is then reaped.h]hTo repair the summary file, write the xfile contents into the temporary file and use atomic mapping exchange to commit the new contents. The temporary file is then reaped.}(hjhhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhMhjhhubh)}(hThe proposed patchset is the `realtime summary repair `_ series.h](hThe proposed patchset is the }(hjhhhNhNubj)}(hy`realtime summary repair `_h]hrealtime summary repair}(hjhhhNhNubah}(h]h ]h"]h$]h&]namerealtime summary repairjj\https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-rtsummaryuh1jhjubh)}(h_ h]h}(h]realtime-summary-repairah ]h"]realtime summary repairah$]h&]refurijuh1hjyKhjubh series.}(hjhhhNhNubeh}(h]h ]h"]h$]h&]uh1hhhhMhjhhubeh}(h](j jeh ]h"](/case study: repairing the realtime summary file rtsummaryeh$]h&]uh1hhjhhhhhMj}jǂjsj}jjsubh)}(hhh](h)}(h)Case Study: Salvaging Extended Attributesh]h)Case Study: Salvaging Extended Attributes}(hjςhhhNhNubah}(h]h ]h"]h$]h&]jj uh1hhĵhhhhhMubh)}(hXIn XFS, extended attributes are implemented as a namespaced name-value store. Values are limited in size to 64KiB, but there is no limit in the number of names. The attribute fork is unpartitioned, which means that the root of the attribute structure is always in logical block zero, but attribute leaf blocks, dabtree index blocks, and remote value blocks are intermixed. Attribute leaf blocks contain variable-sized records that associate user-provided names with the user-provided values. Values larger than a block are allocated separate extents and written there. If the leaf information expands beyond a single block, a directory/attribute btree (``dabtree``) is created to map hashes of attribute names to entries for fast lookup.h](hXIn XFS, extended attributes are implemented as a namespaced name-value store. Values are limited in size to 64KiB, but there is no limit in the number of names. The attribute fork is unpartitioned, which means that the root of the attribute structure is always in logical block zero, but attribute leaf blocks, dabtree index blocks, and remote value blocks are intermixed. Attribute leaf blocks contain variable-sized records that associate user-provided names with the user-provided values. Values larger than a block are allocated separate extents and written there. If the leaf information expands beyond a single block, a directory/attribute btree (}(hj݂hhhNhNubj)}(h ``dabtree``h]hdabtree}(hjhhhNhNubah}(h]h ]h"]h$]h&]uh1jhj݂ubhI) is created to map hashes of attribute names to entries for fast lookup.}(hj݂hhhNhNubeh}(h]h ]h"]h$]h&]uh1hhhhMhĵhhubh)}(h1Salvaging extended attributes is done as follows:h]h1Salvaging extended attributes is done as follows:}(hjhhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhMhĵhhubji)}(hhh](h)}(hXqWalk the attr fork mappings of the file being repaired to find the attribute leaf blocks. When one is found, a. Walk the attr leaf block to find candidate keys. When one is found, 1. Check the name for problems, and ignore the name if there are. 2. Retrieve the value. If that succeeds, add the name and value to the staging xfarray and xfblob. h](h)}(hlWalk the attr fork mappings of the file being repaired to find the attribute leaf blocks. When one is found,h]hlWalk the attr fork mappings of the file being repaired to find the attribute leaf blocks. When one is found,}(hjhhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhMhjubji)}(hhh]h)}(hWalk the attr leaf block to find candidate keys. When one is found, 1. Check the name for problems, and ignore the name if there are. 2. Retrieve the value. If that succeeds, add the name and value to the staging xfarray and xfblob. h](h)}(hCWalk the attr leaf block to find candidate keys. When one is found,h]hCWalk the attr leaf block to find candidate keys. When one is found,}(hj'hhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhMhj#ubji)}(hhh](h)}(h?Check the name for problems, and ignore the name if there are. h]h)}(h>Check the name for problems, and ignore the name if there are.h]h>Check the name for problems, and ignore the name if there are.}(hj<hhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhMhj8ubah}(h]h ]h"]h$]h&]uh1hhj5ubh)}(h`Retrieve the value. If that succeeds, add the name and value to the staging xfarray and xfblob. h]h)}(h_Retrieve the value. If that succeeds, add the name and value to the staging xfarray and xfblob.h]h_Retrieve the value. If that succeeds, add the name and value to the staging xfarray and xfblob.}(hjThhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhMhjPubah}(h]h ]h"]h$]h&]uh1hhj5ubeh}(h]h ]h"]h$]h&]jgjhjihjjjkuh1jhhj#ubeh}(h]h ]h"]h$]h&]uh1hhj ubah}(h]h ]h"]h$]h&]jgj6jihjjjkuh1jhhjubeh}(h]h ]h"]h$]h&]uh1hhj hhhNhNubh)}(hIf the memory usage of the xfarray and xfblob exceed a certain amount of memory or there are no more attr fork blocks to examine, unlock the file and add the staged extended attributes to the temporary file. h]h)}(hIf the memory usage of the xfarray and xfblob exceed a certain amount of memory or there are no more attr fork blocks to examine, unlock the file and add the staged extended attributes to the temporary file.h]hIf the memory usage of the xfarray and xfblob exceed a certain amount of memory or there are no more attr fork blocks to examine, unlock the file and add the staged extended attributes to the temporary file.}(hjhhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhM hjubah}(h]h ]h"]h$]h&]uh1hhj hhhhhNubh)}(hUse atomic file mapping exchange to exchange the new and old extended attribute structures. The old attribute blocks are now attached to the temporary file. h]h)}(hUse atomic file mapping exchange to exchange the new and old extended attribute structures. The old attribute blocks are now attached to the temporary file.h]hUse atomic file mapping exchange to exchange the new and old extended attribute structures. The old attribute blocks are now attached to the temporary file.}(hjhhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhMhjubah}(h]h ]h"]h$]h&]uh1hhj hhhhhNubh)}(hReap the temporary file. h]h)}(hReap the temporary file.h]hReap the temporary file.}(hjhhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhMhjubah}(h]h ]h"]h$]h&]uh1hhj hhhhhNubeh}(h]h ]h"]h$]h&]jgjhjihjjjkuh1jhhĵhhhhhMubh)}(hThe proposed patchset is the `extended attribute repair `_ series.h](hThe proposed patchset is the }(hj΃hhhNhNubj)}(hx`extended attribute repair `_h]hextended attribute repair}(hjփhhhNhNubah}(h]h ]h"]h$]h&]nameextended attribute repairjjYhttps://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-xattrsuh1jhj΃ubh)}(h\ h]h}(h]id11ah ]h"]h$]extended attribute repairah&]refurijuh1hjyKhj΃ubh series.}(hj΃hhhNhNubeh}(h]h ]h"]h$]h&]uh1hhhhMhĵhhubeh}(h]j ah ]h"])case study: salvaging extended attributesah$]h&]uh1hhjhhhhhMubeh}(h]j ah ]h"]"exchanging temporary file contentsah$]h&]uh1hhjtzhhhhhMubeh}(h]j ah ]h"]logged file content exchangesah$]h&]uh1hhjv*hhhhhMubh)}(hhh](h)}(hFixing Directoriesh]hFixing Directories}(hjhhhNhNubah}(h]h ]h"]h$]h&]jjuh1hhjhhhhhMubh)}(hXFixing directories is difficult with currently available filesystem features, since directory entries are not redundant. The offline repair tool scans all inodes to find files with nonzero link count, and then it scans all directories to establish parentage of those linked files. Damaged files and directories are zapped, and files with no parent are moved to the ``/lost+found`` directory. It does not try to salvage anything.h](hXmFixing directories is difficult with currently available filesystem features, since directory entries are not redundant. The offline repair tool scans all inodes to find files with nonzero link count, and then it scans all directories to establish parentage of those linked files. Damaged files and directories are zapped, and files with no parent are moved to the }(hj$hhhNhNubj)}(h``/lost+found``h]h /lost+found}(hj,hhhNhNubah}(h]h ]h"]h$]h&]uh1jhj$ubh0 directory. It does not try to salvage anything.}(hj$hhhNhNubeh}(h]h ]h"]h$]h&]uh1hhhhMhjhhubh)}(hXThe best that online repair can do at this time is to read directory data blocks and salvage any dirents that look plausible, correct link counts, and move orphans back into the directory tree. The salvage process is discussed in the case study at the end of this section. The :ref:`file link count fsck ` code takes care of fixing link counts and moving orphans to the ``/lost+found`` directory.h](hXThe best that online repair can do at this time is to read directory data blocks and salvage any dirents that look plausible, correct link counts, and move orphans back into the directory tree. The salvage process is discussed in the case study at the end of this section. The }(hjDhhhNhNubh)}(h$:ref:`file link count fsck `h]j)}(hjNh]hfile link count fsck}(hjPhhhNhNubah}(h]h ](jstdstd-refeh"]h$]h&]uh1jhjLubah}(h]h ]h"]h$]h&]refdocj refdomainjZreftyperef refexplicitrefwarnjnlinksuh1hhhhM$hjDubhA code takes care of fixing link counts and moving orphans to the }(hjDhhhNhNubj)}(h``/lost+found``h]h /lost+found}(hjphhhNhNubah}(h]h ]h"]h$]h&]uh1jhjDubh directory.}(hjDhhhNhNubeh}(h]h ]h"]h$]h&]uh1hhhhM$hjhhubh)}(hhh](h)}(h!Case Study: Salvaging Directoriesh]h!Case Study: Salvaging Directories}(hjhhhNhNubah}(h]h ]h"]h$]h&]jj'uh1hhjhhhhhM,ubh)}(hpUnlike extended attributes, directory blocks are all the same size, so salvaging directories is straightforward:h]hpUnlike extended attributes, directory blocks are all the same size, so salvaging directories is straightforward:}(hjhhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhM.hjhhubji)}(hhh](h)}(hFind the parent of the directory. If the dotdot entry is not unreadable, try to confirm that the alleged parent has a child entry pointing back to the directory being repaired. Otherwise, walk the filesystem to find it. h]h)}(hFind the parent of the directory. If the dotdot entry is not unreadable, try to confirm that the alleged parent has a child entry pointing back to the directory being repaired. Otherwise, walk the filesystem to find it.h]hFind the parent of the directory. If the dotdot entry is not unreadable, try to confirm that the alleged parent has a child entry pointing back to the directory being repaired. Otherwise, walk the filesystem to find it.}(hjhhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhM1hjubah}(h]h ]h"]h$]h&]uh1hhjhhhhhNubh)}(hXWalk the first partition of data fork of the directory to find the directory entry data blocks. When one is found, a. Walk the directory data block to find candidate entries. When an entry is found: i. Check the name for problems, and ignore the name if there are. ii. Retrieve the inumber and grab the inode. If that succeeds, add the name, inode number, and file type to the staging xfarray and xblob. h](h)}(hrWalk the first partition of data fork of the directory to find the directory entry data blocks. When one is found,h]hrWalk the first partition of data fork of the directory to find the directory entry data blocks. When one is found,}(hjƄhhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhM6hj„ubji)}(hhh]h)}(hX(Walk the directory data block to find candidate entries. When an entry is found: i. Check the name for problems, and ignore the name if there are. ii. Retrieve the inumber and grab the inode. If that succeeds, add the name, inode number, and file type to the staging xfarray and xblob. h](h)}(hPWalk the directory data block to find candidate entries. When an entry is found:h]hPWalk the directory data block to find candidate entries. When an entry is found:}(hjۄhhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhM:hjׄubji)}(hhh](h)}(h?Check the name for problems, and ignore the name if there are. h]h)}(h>Check the name for problems, and ignore the name if there are.h]h>Check the name for problems, and ignore the name if there are.}(hjhhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhM=hjubah}(h]h ]h"]h$]h&]uh1hhjubh)}(hRetrieve the inumber and grab the inode. If that succeeds, add the name, inode number, and file type to the staging xfarray and xblob. h]h)}(hRetrieve the inumber and grab the inode. If that succeeds, add the name, inode number, and file type to the staging xfarray and xblob.h]hRetrieve the inumber and grab the inode. If that succeeds, add the name, inode number, and file type to the staging xfarray and xblob.}(hjhhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhM?hjubah}(h]h ]h"]h$]h&]uh1hhjubeh}(h]h ]h"]h$]h&]jg lowerromanjihjjjkuh1jhhjׄubeh}(h]h ]h"]h$]h&]uh1hhjԄubah}(h]h ]h"]h$]h&]jgj6jihjjjkuh1jhhj„ubeh}(h]h ]h"]h$]h&]uh1hhjhhhNhNubh)}(hIf the memory usage of the xfarray and xfblob exceed a certain amount of memory or there are no more directory data blocks to examine, unlock the directory and add the staged dirents into the temporary directory. Truncate the staging files. h]h)}(hIf the memory usage of the xfarray and xfblob exceed a certain amount of memory or there are no more directory data blocks to examine, unlock the directory and add the staged dirents into the temporary directory. Truncate the staging files.h]hIf the memory usage of the xfarray and xfblob exceed a certain amount of memory or there are no more directory data blocks to examine, unlock the directory and add the staged dirents into the temporary directory. Truncate the staging files.}(hj9hhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhMChj5ubah}(h]h ]h"]h$]h&]uh1hhjhhhhhNubh)}(hUse atomic file mapping exchange to exchange the new and old directory structures. The old directory blocks are now attached to the temporary file. h]h)}(hUse atomic file mapping exchange to exchange the new and old directory structures. The old directory blocks are now attached to the temporary file.h]hUse atomic file mapping exchange to exchange the new and old directory structures. The old directory blocks are now attached to the temporary file.}(hjQhhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhMHhjMubah}(h]h ]h"]h$]h&]uh1hhjhhhhhNubh)}(hReap the temporary file. h]h)}(hReap the temporary file.h]hReap the temporary file.}(hjihhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhMLhjeubah}(h]h ]h"]h$]h&]uh1hhjhhhhhNubeh}(h]h ]h"]h$]h&]jgjhjihjjjkuh1jhhjhhhhhM1ubh)}(h`**Future Work Question**: Should repair revalidate the dentry cache when rebuilding a directory?h](j)}(h**Future Work Question**h]hFuture Work Question}(hjhhhNhNubah}(h]h ]h"]h$]h&]uh1jhjubhH: Should repair revalidate the dentry cache when rebuilding a directory?}(hjhhhNhNubeh}(h]h ]h"]h$]h&]uh1hhhhMNhjhhubh)}(h*Answer*: Yes, it should.h](j7)}(h*Answer*h]hAnswer}(hjhhhNhNubah}(h]h ]h"]h$]h&]uh1j6hjubh: Yes, it should.}(hjhhhNhNubeh}(h]h ]h"]h$]h&]uh1hhhhMQhjhhubh)}(huIn theory it is necessary to scan all dentry cache entries for a directory to ensure that one of the following apply:h]huIn theory it is necessary to scan all dentry cache entries for a directory to ensure that one of the following apply:}(hjhhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhMShjhhubji)}(hhh](h)}(hBThe cached dentry reflects an ondisk dirent in the new directory. h]h)}(hAThe cached dentry reflects an ondisk dirent in the new directory.h]hAThe cached dentry reflects an ondisk dirent in the new directory.}(hjЅhhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhMVhj̅ubah}(h]h ]h"]h$]h&]uh1hhjɅhhhhhNubh)}(hThe cached dentry no longer has a corresponding ondisk dirent in the new directory and the dentry can be purged from the cache. h]h)}(hThe cached dentry no longer has a corresponding ondisk dirent in the new directory and the dentry can be purged from the cache.h]hThe cached dentry no longer has a corresponding ondisk dirent in the new directory and the dentry can be purged from the cache.}(hjhhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhMXhjubah}(h]h ]h"]h$]h&]uh1hhjɅhhhhhNubh)}(hlThe cached dentry no longer has an ondisk dirent but the dentry cannot be purged. This is the problem case. h]h)}(hkThe cached dentry no longer has an ondisk dirent but the dentry cannot be purged. This is the problem case.h]hkThe cached dentry no longer has an ondisk dirent but the dentry cannot be purged. This is the problem case.}(hjhhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhM[hjubah}(h]h ]h"]h$]h&]uh1hhjɅhhhhhNubeh}(h]h ]h"]h$]h&]jgjhjihjjjkuh1jhhjhhhhhMVubh)}(hUnfortunately, the current dentry cache design doesn't provide a means to walk every child dentry of a specific directory, which makes this a hard problem. There is no known solution.h]hUnfortunately, the current dentry cache design doesn’t provide a means to walk every child dentry of a specific directory, which makes this a hard problem. There is no known solution.}(hjhhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhM_hjhhubh)}(hThe proposed patchset is the `directory repair `_ series.h](hThe proposed patchset is the }(hj(hhhNhNubj)}(hm`directory repair `_h]hdirectory repair}(hj0hhhNhNubah}(h]h ]h"]h$]h&]namedirectory repairjjWhttps://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-dirsuh1jhj(ubh)}(hZ h]h}(h]id12ah ]h"]h$]directory repairah&]refurij@uh1hjyKhj(ubh series.}(hj(hhhNhNubeh}(h]h ]h"]h$]h&]uh1hhhhMchjhhubeh}(h]j-ah ]h"]!case study: salvaging directoriesah$]h&]uh1hhjhhhhhM,ubh)}(hhh](h)}(hParent Pointersh]hParent Pointers}(hjbhhhNhNubah}(h]h ]h"]h$]h&]jjIuh1hhj_hhhhhMiubh)}(hXA parent pointer is a piece of file metadata that enables a user to locate the file's parent directory without having to traverse the directory tree from the root. Without them, reconstruction of directory trees is hindered in much the same way that the historic lack of reverse space mapping information once hindered reconstruction of filesystem space metadata. The parent pointer feature, however, makes total directory reconstruction possible.h]hXA parent pointer is a piece of file metadata that enables a user to locate the file’s parent directory without having to traverse the directory tree from the root. Without them, reconstruction of directory trees is hindered in much the same way that the historic lack of reverse space mapping information once hindered reconstruction of filesystem space metadata. The parent pointer feature, however, makes total directory reconstruction possible.}(hjphhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhMkhj_hhubh)}(hXXFS parent pointers contain the information needed to identify the corresponding directory entry in the parent directory. In other words, child files use extended attributes to store pointers to parents in the form ``(dirent_name) → (parent_inum, parent_gen)``. The directory checking process can be strengthened to ensure that the target of each dirent also contains a parent pointer pointing back to the dirent. Likewise, each parent pointer can be checked by ensuring that the target of each parent pointer is a directory and that it contains a dirent matching the parent pointer. Both online and offline repair can use this strategy.h](hXFS parent pointers contain the information needed to identify the corresponding directory entry in the parent directory. In other words, child files use extended attributes to store pointers to parents in the form }(hj~hhhNhNubj)}(h/``(dirent_name) → (parent_inum, parent_gen)``h]h+(dirent_name) → (parent_inum, parent_gen)}(hjhhhNhNubah}(h]h ]h"]h$]h&]uh1jhj~ubhXy. The directory checking process can be strengthened to ensure that the target of each dirent also contains a parent pointer pointing back to the dirent. Likewise, each parent pointer can be checked by ensuring that the target of each parent pointer is a directory and that it contains a dirent matching the parent pointer. Both online and offline repair can use this strategy.}(hj~hhhNhNubeh}(h]h ]h"]h$]h&]uh1hhhhMthj_hhubj)}(hhh]j)}(hhh](j)}(hhh]h}(h]h ]h"]h$]h&]colwidthKJuh1jhjubj)}(hhh](j)}(hhh]j)}(hhh]h)}(h**Historical Sidebar**:h](j)}(h**Historical Sidebar**h]hHistorical Sidebar}(hjhhhNhNubah}(h]h ]h"]h$]h&]uh1jhjubh:}(hjhhhNhNubeh}(h]h ]h"]h$]h&]uh1hhhhMhjubah}(h]h ]h"]h$]h&]uh1jhjubah}(h]h ]h"]h$]h&]uh1jhjubj)}(hhh]j)}(hhh](h)}(hX^Directory parent pointers were first proposed as an XFS feature more than a decade ago by SGI. Each link from a parent directory to a child file is mirrored with an extended attribute in the child that could be used to identify the parent directory. Unfortunately, this early implementation had major shortcomings and was never merged into Linux XFS:h]hX^Directory parent pointers were first proposed as an XFS feature more than a decade ago by SGI. Each link from a parent directory to a child file is mirrored with an extended attribute in the child that could be used to identify the parent directory. Unfortunately, this early implementation had major shortcomings and was never merged into Linux XFS:}(hjhhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhMhjubji)}(hhh](h)}(hXThe XFS codebase of the late 2000s did not have the infrastructure to enforce strong referential integrity in the directory tree. It did not guarantee that a change in a forward link would always be followed up with the corresponding change to the reverse links. h]h)}(hXThe XFS codebase of the late 2000s did not have the infrastructure to enforce strong referential integrity in the directory tree. It did not guarantee that a change in a forward link would always be followed up with the corresponding change to the reverse links.h]hXThe XFS codebase of the late 2000s did not have the infrastructure to enforce strong referential integrity in the directory tree. It did not guarantee that a change in a forward link would always be followed up with the corresponding change to the reverse links.}(hjhhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhMhjubah}(h]h ]h"]h$]h&]uh1hhjubh)}(hReferential integrity was not integrated into offline repair. Checking and repairs were performed on mounted filesystems without taking any kernel or inode locks to coordinate access. It is not clear how this actually worked properly. h]h)}(hReferential integrity was not integrated into offline repair. Checking and repairs were performed on mounted filesystems without taking any kernel or inode locks to coordinate access. It is not clear how this actually worked properly.h]hReferential integrity was not integrated into offline repair. Checking and repairs were performed on mounted filesystems without taking any kernel or inode locks to coordinate access. It is not clear how this actually worked properly.}(hjhhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhMhjubah}(h]h ]h"]h$]h&]uh1hhjubh)}(hThe extended attribute did not record the name of the directory entry in the parent, so the SGI parent pointer implementation cannot be used to reconnect the directory tree. h]h)}(hThe extended attribute did not record the name of the directory entry in the parent, so the SGI parent pointer implementation cannot be used to reconnect the directory tree.h]hThe extended attribute did not record the name of the directory entry in the parent, so the SGI parent pointer implementation cannot be used to reconnect the directory tree.}(hj*hhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhMhj&ubah}(h]h ]h"]h$]h&]uh1hhjubh)}(hExtended attribute forks only support 65,536 extents, which means that parent pointer attribute creation is likely to fail at some point before the maximum file link count is achieved. h]h)}(hExtended attribute forks only support 65,536 extents, which means that parent pointer attribute creation is likely to fail at some point before the maximum file link count is achieved.h]hExtended attribute forks only support 65,536 extents, which means that parent pointer attribute creation is likely to fail at some point before the maximum file link count is achieved.}(hjBhhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhMhj>ubah}(h]h ]h"]h$]h&]uh1hhjubeh}(h]h ]h"]h$]h&]jgjhjihjjjkuh1jhhjubh)}(hXThe original parent pointer design was too unstable for something like a file system repair to depend on. Allison Henderson, Chandan Babu, and Catherine Hoang are working on a second implementation that solves all shortcomings of the first. During 2022, Allison introduced log intent items to track physical manipulations of the extended attribute structures. This solves the referential integrity problem by making it possible to commit a dirent update and a parent pointer update in the same transaction. Chandan increased the maximum extent counts of both data and attribute forks, thereby ensuring that the extended attribute structure can grow to handle the maximum hardlink count of any file.h]hXThe original parent pointer design was too unstable for something like a file system repair to depend on. Allison Henderson, Chandan Babu, and Catherine Hoang are working on a second implementation that solves all shortcomings of the first. During 2022, Allison introduced log intent items to track physical manipulations of the extended attribute structures. This solves the referential integrity problem by making it possible to commit a dirent update and a parent pointer update in the same transaction. Chandan increased the maximum extent counts of both data and attribute forks, thereby ensuring that the extended attribute structure can grow to handle the maximum hardlink count of any file.}(hj\hhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhMhjubh)}(hXGFor this second effort, the ondisk parent pointer format as originally proposed was ``(parent_inum, parent_gen, dirent_pos) → (dirent_name)``. The format was changed during development to eliminate the requirement of repair tools needing to ensure that the ``dirent_pos`` field always matched when reconstructing a directory.h](hTFor this second effort, the ondisk parent pointer format as originally proposed was }(hjjhhhNhNubj)}(h;``(parent_inum, parent_gen, dirent_pos) → (dirent_name)``h]h7(parent_inum, parent_gen, dirent_pos) → (dirent_name)}(hjrhhhNhNubah}(h]h ]h"]h$]h&]uh1jhjjubht. The format was changed during development to eliminate the requirement of repair tools needing to ensure that the }(hjjhhhNhNubj)}(h``dirent_pos``h]h dirent_pos}(hjhhhNhNubah}(h]h ]h"]h$]h&]uh1jhjjubh6 field always matched when reconstructing a directory.}(hjjhhhNhNubeh}(h]h ]h"]h$]h&]uh1hhhhMhjubh)}(h8There were a few other ways to have solved that problem:h]h8There were a few other ways to have solved that problem:}(hjhhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhMhjubji)}(hhh](h)}(hThe field could be designated advisory, since the other three values are sufficient to find the entry in the parent. However, this makes indexed key lookup impossible while repairs are ongoing. h]h)}(hThe field could be designated advisory, since the other three values are sufficient to find the entry in the parent. However, this makes indexed key lookup impossible while repairs are ongoing.h]hThe field could be designated advisory, since the other three values are sufficient to find the entry in the parent. However, this makes indexed key lookup impossible while repairs are ongoing.}(hjhhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhMhjubah}(h]h ]h"]h$]h&]uh1hhjubh)}(hXWe could allow creating directory entries at specified offsets, which solves the referential integrity problem but runs the risk that dirent creation will fail due to conflicts with the free space in the directory. These conflicts could be resolved by appending the directory entry and amending the xattr code to support updating an xattr key and reindexing the dabtree, though this would have to be performed with the parent directory still locked. h](h)}(hWe could allow creating directory entries at specified offsets, which solves the referential integrity problem but runs the risk that dirent creation will fail due to conflicts with the free space in the directory.h]hWe could allow creating directory entries at specified offsets, which solves the referential integrity problem but runs the risk that dirent creation will fail due to conflicts with the free space in the directory.}(hjɇhhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhMhjŇubh)}(hThese conflicts could be resolved by appending the directory entry and amending the xattr code to support updating an xattr key and reindexing the dabtree, though this would have to be performed with the parent directory still locked.h]hThese conflicts could be resolved by appending the directory entry and amending the xattr code to support updating an xattr key and reindexing the dabtree, though this would have to be performed with the parent directory still locked.}(hjׇhhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhMhjŇubeh}(h]h ]h"]h$]h&]uh1hhjubh)}(hUSame as above, but remove the old parent pointer entry and add a new one atomically. h]h)}(hTSame as above, but remove the old parent pointer entry and add a new one atomically.h]hTSame as above, but remove the old parent pointer entry and add a new one atomically.}(hjhhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhMhjubah}(h]h ]h"]h$]h&]uh1hhjubh)}(hX(Change the ondisk xattr format to ``(parent_inum, name) → (parent_gen)``, which would provide the attr name uniqueness that we require, without forcing repair code to update the dirent position. Unfortunately, this requires changes to the xattr code to support attr names as long as 263 bytes. h]h)}(hX'Change the ondisk xattr format to ``(parent_inum, name) → (parent_gen)``, which would provide the attr name uniqueness that we require, without forcing repair code to update the dirent position. Unfortunately, this requires changes to the xattr code to support attr names as long as 263 bytes.h](h"Change the ondisk xattr format to }(hjhhhNhNubj)}(h(``(parent_inum, name) → (parent_gen)``h]h$(parent_inum, name) → (parent_gen)}(hjhhhNhNubah}(h]h ]h"]h$]h&]uh1jhjubh, which would provide the attr name uniqueness that we require, without forcing repair code to update the dirent position. Unfortunately, this requires changes to the xattr code to support attr names as long as 263 bytes.}(hjhhhNhNubeh}(h]h ]h"]h$]h&]uh1hhhhMhjubah}(h]h ]h"]h$]h&]uh1hhjubh)}(hXChange the ondisk xattr format to ``(parent_inum, hash(name)) → (name, parent_gen)``. If the hash is sufficiently resistant to collisions (e.g. sha256) then this should provide the attr name uniqueness that we require. Names shorter than 247 bytes could be stored directly. h]h)}(hXChange the ondisk xattr format to ``(parent_inum, hash(name)) → (name, parent_gen)``. If the hash is sufficiently resistant to collisions (e.g. sha256) then this should provide the attr name uniqueness that we require. Names shorter than 247 bytes could be stored directly.h](h"Change the ondisk xattr format to }(hj1hhhNhNubj)}(h4``(parent_inum, hash(name)) → (name, parent_gen)``h]h0(parent_inum, hash(name)) → (name, parent_gen)}(hj9hhhNhNubah}(h]h ]h"]h$]h&]uh1jhj1ubh. If the hash is sufficiently resistant to collisions (e.g. sha256) then this should provide the attr name uniqueness that we require. Names shorter than 247 bytes could be stored directly.}(hj1hhhNhNubeh}(h]h ]h"]h$]h&]uh1hhhhMhj-ubah}(h]h ]h"]h$]h&]uh1hhjubh)}(hXChange the ondisk xattr format to ``(dirent_name) → (parent_ino, parent_gen)``. This format doesn't require any of the complicated nested name hashing of the previous suggestions. However, it was discovered that multiple hardlinks to the same inode with the same filename caused performance problems with hashed xattr lookups, so the parent inumber is now xor'd into the hash index. h]h)}(hXChange the ondisk xattr format to ``(dirent_name) → (parent_ino, parent_gen)``. This format doesn't require any of the complicated nested name hashing of the previous suggestions. However, it was discovered that multiple hardlinks to the same inode with the same filename caused performance problems with hashed xattr lookups, so the parent inumber is now xor'd into the hash index.h](h"Change the ondisk xattr format to }(hj[hhhNhNubj)}(h.``(dirent_name) → (parent_ino, parent_gen)``h]h*(dirent_name) → (parent_ino, parent_gen)}(hjchhhNhNubah}(h]h ]h"]h$]h&]uh1jhj[ubhX6. This format doesn’t require any of the complicated nested name hashing of the previous suggestions. However, it was discovered that multiple hardlinks to the same inode with the same filename caused performance problems with hashed xattr lookups, so the parent inumber is now xor’d into the hash index.}(hj[hhhNhNubeh}(h]h ]h"]h$]h&]uh1hhhhMhjWubah}(h]h ]h"]h$]h&]uh1hhjubeh}(h]h ]h"]h$]h&]jgjhjihjjjkuh1jhhjubh)}(hIn the end, it was decided that solution #6 was the most compact and the most performant. A new hash function was designed for parent pointers.h]hIn the end, it was decided that solution #6 was the most compact and the most performant. A new hash function was designed for parent pointers.}(hjhhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhMhjubeh}(h]h ]h"]h$]h&]uh1jhj߆ubah}(h]h ]h"]h$]h&]uh1jhjubeh}(h]h ]h"]h$]h&]uh1jhjubeh}(h]h ]h"]h$]h&]colsKuh1jhjubah}(h]h ]h"]h$]h&]uh1jhj_hhhhhNubh)}(hhh](h)}(h6Case Study: Repairing Directories with Parent Pointersh]h6Case Study: Repairing Directories with Parent Pointers}(hjhhhNhNubah}(h]h ]h"]h$]h&]jjhuh1hhjhhhhhMubh)}(hDirectory rebuilding uses a :ref:`coordinated inode scan ` and a :ref:`directory entry live update hook ` as follows:h](hDirectory rebuilding uses a }(hjňhhhNhNubh)}(h%:ref:`coordinated inode scan `h]j)}(hjψh]hcoordinated inode scan}(hjшhhhNhNubah}(h]h ](jstdstd-refeh"]h$]h&]uh1jhj͈ubah}(h]h ]h"]h$]h&]refdocj refdomainjۈreftyperef refexplicitrefwarnjiscanuh1hhhhMhjňubh and a }(hjňhhhNhNubh)}(h4:ref:`directory entry live update hook `h]j)}(hjh]h directory entry live update hook}(hjhhhNhNubah}(h]h ](jstdstd-refeh"]h$]h&]uh1jhjubah}(h]h ]h"]h$]h&]refdocj refdomainjreftyperef refexplicitrefwarnj liveupdateuh1hhhhMhjňubh as follows:}(hjňhhhNhNubeh}(h]h ]h"]h$]h&]uh1hhhhMhjhhubji)}(hhh](h)}(hSet up a temporary directory for generating the new directory structure, an xfblob for storing entry names, and an xfarray for stashing the fixed size fields involved in a directory update: ``(child inumber, add vs. remove, name cookie, ftype)``. h]h)}(hSet up a temporary directory for generating the new directory structure, an xfblob for storing entry names, and an xfarray for stashing the fixed size fields involved in a directory update: ``(child inumber, add vs. remove, name cookie, ftype)``.h](hSet up a temporary directory for generating the new directory structure, an xfblob for storing entry names, and an xfarray for stashing the fixed size fields involved in a directory update: }(hj"hhhNhNubj)}(h7``(child inumber, add vs. remove, name cookie, ftype)``h]h3(child inumber, add vs. remove, name cookie, ftype)}(hj*hhhNhNubah}(h]h ]h"]h$]h&]uh1jhj"ubh.}(hj"hhhNhNubeh}(h]h ]h"]h$]h&]uh1hhhhMhjubah}(h]h ]h"]h$]h&]uh1hhjhhhhhNubh)}(hkSet up an inode scanner and hook into the directory entry code to receive updates on directory operations. h]h)}(hjSet up an inode scanner and hook into the directory entry code to receive updates on directory operations.h]hjSet up an inode scanner and hook into the directory entry code to receive updates on directory operations.}(hjLhhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhMhjHubah}(h]h ]h"]h$]h&]uh1hhjhhhhhNubh)}(hXFor each parent pointer found in each file scanned, decide if the parent pointer references the directory of interest. If so: a. Stash the parent pointer name and an addname entry for this dirent in the xfblob and xfarray, respectively. b. When finished scanning that file or the kernel memory consumption exceeds a threshold, flush the stashed updates to the temporary directory. h](h)}(h}For each parent pointer found in each file scanned, decide if the parent pointer references the directory of interest. If so:h]h}For each parent pointer found in each file scanned, decide if the parent pointer references the directory of interest. If so:}(hjdhhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhMhj`ubji)}(hhh](h)}(hlStash the parent pointer name and an addname entry for this dirent in the xfblob and xfarray, respectively. h]h)}(hkStash the parent pointer name and an addname entry for this dirent in the xfblob and xfarray, respectively.h]hkStash the parent pointer name and an addname entry for this dirent in the xfblob and xfarray, respectively.}(hjyhhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhMhjuubah}(h]h ]h"]h$]h&]uh1hhjrubh)}(hWhen finished scanning that file or the kernel memory consumption exceeds a threshold, flush the stashed updates to the temporary directory. h]h)}(hWhen finished scanning that file or the kernel memory consumption exceeds a threshold, flush the stashed updates to the temporary directory.h]hWhen finished scanning that file or the kernel memory consumption exceeds a threshold, flush the stashed updates to the temporary directory.}(hjhhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhMhjubah}(h]h ]h"]h$]h&]uh1hhjrubeh}(h]h ]h"]h$]h&]jgj6jihjjjkuh1jhhj`ubeh}(h]h ]h"]h$]h&]uh1hhjhhhNhNubh)}(hXFor each live directory update received via the hook, decide if the child has already been scanned. If so: a. Stash the parent pointer name an addname or removename entry for this dirent update in the xfblob and xfarray for later. We cannot write directly to the temporary directory because hook functions are not allowed to modify filesystem metadata. Instead, we stash updates in the xfarray and rely on the scanner thread to apply the stashed updates to the temporary directory. h](h)}(hjFor each live directory update received via the hook, decide if the child has already been scanned. If so:h]hjFor each live directory update received via the hook, decide if the child has already been scanned. If so:}(hjhhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhMhjubji)}(hhh]h)}(hXtStash the parent pointer name an addname or removename entry for this dirent update in the xfblob and xfarray for later. We cannot write directly to the temporary directory because hook functions are not allowed to modify filesystem metadata. Instead, we stash updates in the xfarray and rely on the scanner thread to apply the stashed updates to the temporary directory. h]h)}(hXsStash the parent pointer name an addname or removename entry for this dirent update in the xfblob and xfarray for later. We cannot write directly to the temporary directory because hook functions are not allowed to modify filesystem metadata. Instead, we stash updates in the xfarray and rely on the scanner thread to apply the stashed updates to the temporary directory.h]hXsStash the parent pointer name an addname or removename entry for this dirent update in the xfblob and xfarray for later. We cannot write directly to the temporary directory because hook functions are not allowed to modify filesystem metadata. Instead, we stash updates in the xfarray and rely on the scanner thread to apply the stashed updates to the temporary directory.}(hjʉhhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhMhjƉubah}(h]h ]h"]h$]h&]uh1hhjÉubah}(h]h ]h"]h$]h&]jgj6jihjjjkuh1jhhjubeh}(h]h ]h"]h$]h&]uh1hhjhhhNhNubh)}(hFWhen the scan is complete, replay any stashed entries in the xfarray. h]h)}(hEWhen the scan is complete, replay any stashed entries in the xfarray.h]hEWhen the scan is complete, replay any stashed entries in the xfarray.}(hjhhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhMhjubah}(h]h ]h"]h$]h&]uh1hhjhhhhhNubh)}(hWhen the scan is complete, atomically exchange the contents of the temporary directory and the directory being repaired. The temporary directory now contains the damaged directory structure. h]h)}(hWhen the scan is complete, atomically exchange the contents of the temporary directory and the directory being repaired. The temporary directory now contains the damaged directory structure.h]hWhen the scan is complete, atomically exchange the contents of the temporary directory and the directory being repaired. The temporary directory now contains the damaged directory structure.}(hjhhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhMhjubah}(h]h ]h"]h$]h&]uh1hhjhhhhhNubh)}(hReap the temporary directory. h]h)}(hReap the temporary directory.h]hReap the temporary directory.}(hjhhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhMhjubah}(h]h ]h"]h$]h&]uh1hhjhhhhhNubeh}(h]h ]h"]h$]h&]jgjhjihjjjkuh1jhhjhhhhhMubh)}(hThe proposed patchset is the `parent pointers directory repair `_ series.h](hThe proposed patchset is the }(hj8hhhNhNubj)}(h|`parent pointers directory repair `_h]h parent pointers directory repair}(hj@hhhNhNubah}(h]h ]h"]h$]h&]name parent pointers directory repairjjVhttps://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=pptrs-fsckuh1jhj8ubh)}(hY h]h}(h] parent-pointers-directory-repairah ]h"] parent pointers directory repairah$]h&]refurijPuh1hjyKhj8ubh series.}(hj8hhhNhNubeh}(h]h ]h"]h$]h&]uh1hhhhMhjhhubeh}(h]jnah ]h"]6case study: repairing directories with parent pointersah$]h&]uh1hhj_hhhhhMubh)}(hhh](h)}(h%Case Study: Repairing Parent Pointersh]h%Case Study: Repairing Parent Pointers}(hjrhhhNhNubah}(h]h ]h"]h$]h&]jjuh1hhjohhhhhM ubh)}(hiOnline reconstruction of a file's parent pointer information works similarly to directory reconstruction:h]hkOnline reconstruction of a file’s parent pointer information works similarly to directory reconstruction:}(hjhhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhMhjohhubji)}(hhh](h)}(hXSet up a temporary file for generating a new extended attribute structure, an xfblob for storing parent pointer names, and an xfarray for stashing the fixed size fields involved in a parent pointer update: ``(parent inumber, parent generation, add vs. remove, name cookie)``. h]h)}(hXSet up a temporary file for generating a new extended attribute structure, an xfblob for storing parent pointer names, and an xfarray for stashing the fixed size fields involved in a parent pointer update: ``(parent inumber, parent generation, add vs. remove, name cookie)``.h](hSet up a temporary file for generating a new extended attribute structure, an xfblob for storing parent pointer names, and an xfarray for stashing the fixed size fields involved in a parent pointer update: }(hjhhhNhNubj)}(hD``(parent inumber, parent generation, add vs. remove, name cookie)``h]h@(parent inumber, parent generation, add vs. remove, name cookie)}(hjhhhNhNubah}(h]h ]h"]h$]h&]uh1jhjubh.}(hjhhhNhNubeh}(h]h ]h"]h$]h&]uh1hhhhMhjubah}(h]h ]h"]h$]h&]uh1hhjhhhhhNubh)}(hkSet up an inode scanner and hook into the directory entry code to receive updates on directory operations. h]h)}(hjSet up an inode scanner and hook into the directory entry code to receive updates on directory operations.h]hjSet up an inode scanner and hook into the directory entry code to receive updates on directory operations.}(hjhhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhMhjubah}(h]h ]h"]h$]h&]uh1hhjhhhhhNubh)}(hX}For each directory entry found in each directory scanned, decide if the dirent references the file of interest. If so: a. Stash the dirent name and an addpptr entry for this parent pointer in the xfblob and xfarray, respectively. b. When finished scanning the directory or the kernel memory consumption exceeds a threshold, flush the stashed updates to the temporary file. h](h)}(hvFor each directory entry found in each directory scanned, decide if the dirent references the file of interest. If so:h]hvFor each directory entry found in each directory scanned, decide if the dirent references the file of interest. If so:}(hj׊hhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhMhjӊubji)}(hhh](h)}(hlStash the dirent name and an addpptr entry for this parent pointer in the xfblob and xfarray, respectively. h]h)}(hkStash the dirent name and an addpptr entry for this parent pointer in the xfblob and xfarray, respectively.h]hkStash the dirent name and an addpptr entry for this parent pointer in the xfblob and xfarray, respectively.}(hjhhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhMhjubah}(h]h ]h"]h$]h&]uh1hhjubh)}(hWhen finished scanning the directory or the kernel memory consumption exceeds a threshold, flush the stashed updates to the temporary file. h]h)}(hWhen finished scanning the directory or the kernel memory consumption exceeds a threshold, flush the stashed updates to the temporary file.h]hWhen finished scanning the directory or the kernel memory consumption exceeds a threshold, flush the stashed updates to the temporary file.}(hjhhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhM!hjubah}(h]h ]h"]h$]h&]uh1hhjubeh}(h]h ]h"]h$]h&]jgj6jihjjjkuh1jhhjӊubeh}(h]h ]h"]h$]h&]uh1hhjhhhNhNubh)}(hXFor each live directory update received via the hook, decide if the parent has already been scanned. If so: a. Stash the dirent name and an addpptr or removepptr entry for this dirent update in the xfblob and xfarray for later. We cannot write parent pointers directly to the temporary file because hook functions are not allowed to modify filesystem metadata. Instead, we stash updates in the xfarray and rely on the scanner thread to apply the stashed parent pointer updates to the temporary file. h](h)}(hkFor each live directory update received via the hook, decide if the parent has already been scanned. If so:h]hkFor each live directory update received via the hook, decide if the parent has already been scanned. If so:}(hj(hhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhM$hj$ubji)}(hhh]h)}(hXStash the dirent name and an addpptr or removepptr entry for this dirent update in the xfblob and xfarray for later. We cannot write parent pointers directly to the temporary file because hook functions are not allowed to modify filesystem metadata. Instead, we stash updates in the xfarray and rely on the scanner thread to apply the stashed parent pointer updates to the temporary file. h]h)}(hXStash the dirent name and an addpptr or removepptr entry for this dirent update in the xfblob and xfarray for later. We cannot write parent pointers directly to the temporary file because hook functions are not allowed to modify filesystem metadata. Instead, we stash updates in the xfarray and rely on the scanner thread to apply the stashed parent pointer updates to the temporary file.h]hXStash the dirent name and an addpptr or removepptr entry for this dirent update in the xfblob and xfarray for later. We cannot write parent pointers directly to the temporary file because hook functions are not allowed to modify filesystem metadata. Instead, we stash updates in the xfarray and rely on the scanner thread to apply the stashed parent pointer updates to the temporary file.}(hj=hhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhM(hj9ubah}(h]h ]h"]h$]h&]uh1hhj6ubah}(h]h ]h"]h$]h&]jgj6jihjjjkuh1jhhj$ubeh}(h]h ]h"]h$]h&]uh1hhjhhhNhNubh)}(hFWhen the scan is complete, replay any stashed entries in the xfarray. h]h)}(hEWhen the scan is complete, replay any stashed entries in the xfarray.h]hEWhen the scan is complete, replay any stashed entries in the xfarray.}(hjahhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhM/hj]ubah}(h]h ]h"]h$]h&]uh1hhjhhhhhNubh)}(hGCopy all non-parent pointer extended attributes to the temporary file. h]h)}(hFCopy all non-parent pointer extended attributes to the temporary file.h]hFCopy all non-parent pointer extended attributes to the temporary file.}(hjyhhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhM1hjuubah}(h]h ]h"]h$]h&]uh1hhjhhhhhNubh)}(hWhen the scan is complete, atomically exchange the mappings of the attribute forks of the temporary file and the file being repaired. The temporary file now contains the damaged extended attribute structure. h]h)}(hWhen the scan is complete, atomically exchange the mappings of the attribute forks of the temporary file and the file being repaired. The temporary file now contains the damaged extended attribute structure.h]hWhen the scan is complete, atomically exchange the mappings of the attribute forks of the temporary file and the file being repaired. The temporary file now contains the damaged extended attribute structure.}(hjhhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhM3hjubah}(h]h ]h"]h$]h&]uh1hhjhhhhhNubh)}(hReap the temporary file. h]h)}(hReap the temporary file.h]hReap the temporary file.}(hjhhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhM7hjubah}(h]h ]h"]h$]h&]uh1hhjhhhhhNubeh}(h]h ]h"]h$]h&]jgjhjihjjjkuh1jhhjohhhhhMubh)}(hThe proposed patchset is the `parent pointers repair `_ series.h](hThe proposed patchset is the }(hjËhhhNhNubj)}(hr`parent pointers repair `_h]hparent pointers repair}(hjˋhhhNhNubah}(h]h ]h"]h$]h&]nameparent pointers repairjjVhttps://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=pptrs-fsckuh1jhjËubh)}(hY h]h}(h]parent-pointers-repairah ]h"]parent pointers repairah$]h&]refurijۋuh1hjyKhjËubh series.}(hjËhhhNhNubeh}(h]h ]h"]h$]h&]uh1hhhhM9hjohhubeh}(h]jah ]h"]%case study: repairing parent pointersah$]h&]uh1hhj_hhhhhM ubh)}(hhh](h)}(h/Digression: Offline Checking of Parent Pointersh]h/Digression: Offline Checking of Parent Pointers}(hjhhhNhNubah}(h]h ]h"]h$]h&]jjuh1hhjhhhhhM?ubh)}(hExamining parent pointers in offline repair works differently because corrupt files are erased long before directory tree connectivity checks are performed. Parent pointer checks are therefore a second pass to be added to the existing connectivity checks:h]hExamining parent pointers in offline repair works differently because corrupt files are erased long before directory tree connectivity checks are performed. Parent pointer checks are therefore a second pass to be added to the existing connectivity checks:}(hj hhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhMAhjhhubji)}(hhh](h)}(hAfter the set of surviving files has been established (phase 6), walk the surviving directories of each AG in the filesystem. This is already performed as part of the connectivity checks. h]h)}(hAfter the set of surviving files has been established (phase 6), walk the surviving directories of each AG in the filesystem. This is already performed as part of the connectivity checks.h]hAfter the set of surviving files has been established (phase 6), walk the surviving directories of each AG in the filesystem. This is already performed as part of the connectivity checks.}(hj hhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhMFhjubah}(h]h ]h"]h$]h&]uh1hhjhhhhhNubh)}(hXFor each directory entry found, a. If the name has already been stored in the xfblob, then use that cookie and skip the next step. b. Otherwise, record the name in an xfblob, and remember the xfblob cookie. Unique mappings are critical for 1. Deduplicating names to reduce memory usage, and 2. Creating a stable sort key for the parent pointer indexes so that the parent pointer validation described below will work. c. Store ``(child_ag_inum, parent_inum, parent_gen, name_hash, name_len, name_cookie)`` tuples in a per-AG in-memory slab. The ``name_hash`` referenced in this section is the regular directory entry name hash, not the specialized one used for parent pointer xattrs. h](h)}(hFor each directory entry found,h]hFor each directory entry found,}(hj8hhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhMJhj4ubji)}(hhh](h)}(h`If the name has already been stored in the xfblob, then use that cookie and skip the next step. h]h)}(h_If the name has already been stored in the xfblob, then use that cookie and skip the next step.h]h_If the name has already been stored in the xfblob, then use that cookie and skip the next step.}(hjMhhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhMLhjIubah}(h]h ]h"]h$]h&]uh1hhjFubh)}(hX Otherwise, record the name in an xfblob, and remember the xfblob cookie. Unique mappings are critical for 1. Deduplicating names to reduce memory usage, and 2. Creating a stable sort key for the parent pointer indexes so that the parent pointer validation described below will work. h](h)}(hiOtherwise, record the name in an xfblob, and remember the xfblob cookie. Unique mappings are critical forh]hiOtherwise, record the name in an xfblob, and remember the xfblob cookie. Unique mappings are critical for}(hjehhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhMOhjaubji)}(hhh](h)}(h0Deduplicating names to reduce memory usage, and h]h)}(h/Deduplicating names to reduce memory usage, andh]h/Deduplicating names to reduce memory usage, and}(hjzhhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhMRhjvubah}(h]h ]h"]h$]h&]uh1hhjsubh)}(h{Creating a stable sort key for the parent pointer indexes so that the parent pointer validation described below will work. h]h)}(hzCreating a stable sort key for the parent pointer indexes so that the parent pointer validation described below will work.h]hzCreating a stable sort key for the parent pointer indexes so that the parent pointer validation described below will work.}(hjhhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhMThjubah}(h]h ]h"]h$]h&]uh1hhjsubeh}(h]h ]h"]h$]h&]jgjhjihjjjkuh1jhhjaubeh}(h]h ]h"]h$]h&]uh1hhjFubh)}(hXStore ``(child_ag_inum, parent_inum, parent_gen, name_hash, name_len, name_cookie)`` tuples in a per-AG in-memory slab. The ``name_hash`` referenced in this section is the regular directory entry name hash, not the specialized one used for parent pointer xattrs. h]h)}(hXStore ``(child_ag_inum, parent_inum, parent_gen, name_hash, name_len, name_cookie)`` tuples in a per-AG in-memory slab. The ``name_hash`` referenced in this section is the regular directory entry name hash, not the specialized one used for parent pointer xattrs.h](hStore }(hjhhhNhNubj)}(hN``(child_ag_inum, parent_inum, parent_gen, name_hash, name_len, name_cookie)``h]hJ(child_ag_inum, parent_inum, parent_gen, name_hash, name_len, name_cookie)}(hjhhhNhNubah}(h]h ]h"]h$]h&]uh1jhjubh) tuples in a per-AG in-memory slab. The }(hjhhhNhNubj)}(h ``name_hash``h]h name_hash}(hjЌhhhNhNubah}(h]h ]h"]h$]h&]uh1jhjubh} referenced in this section is the regular directory entry name hash, not the specialized one used for parent pointer xattrs.}(hjhhhNhNubeh}(h]h ]h"]h$]h&]uh1hhhhMWhjubah}(h]h ]h"]h$]h&]uh1hhjFubeh}(h]h ]h"]h$]h&]jgj6jihjjjkuh1jhhj4ubeh}(h]h ]h"]h$]h&]uh1hhjhhhNhNubh)}(hXKFor each AG in the filesystem, a. Sort the per-AG tuple set in order of ``child_ag_inum``, ``parent_inum``, ``name_hash``, and ``name_cookie``. Having a single ``name_cookie`` for each ``name`` is critical for handling the uncommon case of a directory containing multiple hardlinks to the same file where all the names hash to the same value. b. For each inode in the AG, 1. Scan the inode for parent pointers. For each parent pointer found, a. Validate the ondisk parent pointer. If validation fails, move on to the next parent pointer in the file. b. If the name has already been stored in the xfblob, then use that cookie and skip the next step. c. Record the name in a per-file xfblob, and remember the xfblob cookie. d. Store ``(parent_inum, parent_gen, name_hash, name_len, name_cookie)`` tuples in a per-file slab. 2. Sort the per-file tuples in order of ``parent_inum``, ``name_hash``, and ``name_cookie``. 3. Position one slab cursor at the start of the inode's records in the per-AG tuple slab. This should be trivial since the per-AG tuples are in child inumber order. 4. Position a second slab cursor at the start of the per-file tuple slab. 5. Iterate the two cursors in lockstep, comparing the ``parent_ino``, ``name_hash``, and ``name_cookie`` fields of the records under each cursor: a. If the per-AG cursor is at a lower point in the keyspace than the per-file cursor, then the per-AG cursor points to a missing parent pointer. Add the parent pointer to the inode and advance the per-AG cursor. b. If the per-file cursor is at a lower point in the keyspace than the per-AG cursor, then the per-file cursor points to a dangling parent pointer. Remove the parent pointer from the inode and advance the per-file cursor. c. Otherwise, both cursors point at the same parent pointer. Update the parent_gen component if necessary. Advance both cursors. h](h)}(hFor each AG in the filesystem,h]hFor each AG in the filesystem,}(hjhhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhM\hjubji)}(hhh](h)}(hX5Sort the per-AG tuple set in order of ``child_ag_inum``, ``parent_inum``, ``name_hash``, and ``name_cookie``. Having a single ``name_cookie`` for each ``name`` is critical for handling the uncommon case of a directory containing multiple hardlinks to the same file where all the names hash to the same value. h]h)}(hX4Sort the per-AG tuple set in order of ``child_ag_inum``, ``parent_inum``, ``name_hash``, and ``name_cookie``. Having a single ``name_cookie`` for each ``name`` is critical for handling the uncommon case of a directory containing multiple hardlinks to the same file where all the names hash to the same value.h](h&Sort the per-AG tuple set in order of }(hjhhhNhNubj)}(h``child_ag_inum``h]h child_ag_inum}(hjhhhNhNubah}(h]h ]h"]h$]h&]uh1jhjubh, }(hjhhhNhNubj)}(h``parent_inum``h]h parent_inum}(hj-hhhNhNubah}(h]h ]h"]h$]h&]uh1jhjubh, }(hjhhhNhNubj)}(h ``name_hash``h]h name_hash}(hj?hhhNhNubah}(h]h ]h"]h$]h&]uh1jhjubh, and }(hjhhhNhNubj)}(h``name_cookie``h]h name_cookie}(hjQhhhNhNubah}(h]h ]h"]h$]h&]uh1jhjubh. Having a single }(hjhhhNhNubj)}(h``name_cookie``h]h name_cookie}(hjchhhNhNubah}(h]h ]h"]h$]h&]uh1jhjubh for each }(hjhhhNhNubj)}(h``name``h]hname}(hjuhhhNhNubah}(h]h ]h"]h$]h&]uh1jhjubh is critical for handling the uncommon case of a directory containing multiple hardlinks to the same file where all the names hash to the same value.}(hjhhhNhNubeh}(h]h ]h"]h$]h&]uh1hhhhM^hjubah}(h]h ]h"]h$]h&]uh1hhj ubh)}(hX}For each inode in the AG, 1. Scan the inode for parent pointers. For each parent pointer found, a. Validate the ondisk parent pointer. If validation fails, move on to the next parent pointer in the file. b. If the name has already been stored in the xfblob, then use that cookie and skip the next step. c. Record the name in a per-file xfblob, and remember the xfblob cookie. d. Store ``(parent_inum, parent_gen, name_hash, name_len, name_cookie)`` tuples in a per-file slab. 2. Sort the per-file tuples in order of ``parent_inum``, ``name_hash``, and ``name_cookie``. 3. Position one slab cursor at the start of the inode's records in the per-AG tuple slab. This should be trivial since the per-AG tuples are in child inumber order. 4. Position a second slab cursor at the start of the per-file tuple slab. 5. Iterate the two cursors in lockstep, comparing the ``parent_ino``, ``name_hash``, and ``name_cookie`` fields of the records under each cursor: a. If the per-AG cursor is at a lower point in the keyspace than the per-file cursor, then the per-AG cursor points to a missing parent pointer. Add the parent pointer to the inode and advance the per-AG cursor. b. If the per-file cursor is at a lower point in the keyspace than the per-AG cursor, then the per-file cursor points to a dangling parent pointer. Remove the parent pointer from the inode and advance the per-file cursor. c. Otherwise, both cursors point at the same parent pointer. Update the parent_gen component if necessary. Advance both cursors. h](h)}(hFor each inode in the AG,h]hFor each inode in the AG,}(hjhhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhMdhjubji)}(hhh](h)}(hXScan the inode for parent pointers. For each parent pointer found, a. Validate the ondisk parent pointer. If validation fails, move on to the next parent pointer in the file. b. If the name has already been stored in the xfblob, then use that cookie and skip the next step. c. Record the name in a per-file xfblob, and remember the xfblob cookie. d. Store ``(parent_inum, parent_gen, name_hash, name_len, name_cookie)`` tuples in a per-file slab. h](h)}(hBScan the inode for parent pointers. For each parent pointer found,h]hBScan the inode for parent pointers. For each parent pointer found,}(hjhhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhMfhjubji)}(hhh](h)}(hiValidate the ondisk parent pointer. If validation fails, move on to the next parent pointer in the file. h]h)}(hhValidate the ondisk parent pointer. If validation fails, move on to the next parent pointer in the file.h]hhValidate the ondisk parent pointer. If validation fails, move on to the next parent pointer in the file.}(hjhhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhMihjubah}(h]h ]h"]h$]h&]uh1hhjubh)}(h`If the name has already been stored in the xfblob, then use that cookie and skip the next step. h]h)}(h_If the name has already been stored in the xfblob, then use that cookie and skip the next step.h]h_If the name has already been stored in the xfblob, then use that cookie and skip the next step.}(hjٍhhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhMmhjՍubah}(h]h ]h"]h$]h&]uh1hhjubh)}(hFRecord the name in a per-file xfblob, and remember the xfblob cookie. h]h)}(hERecord the name in a per-file xfblob, and remember the xfblob cookie.h]hERecord the name in a per-file xfblob, and remember the xfblob cookie.}(hjhhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhMphjubah}(h]h ]h"]h$]h&]uh1hhjubh)}(haStore ``(parent_inum, parent_gen, name_hash, name_len, name_cookie)`` tuples in a per-file slab. h]h)}(h`Store ``(parent_inum, parent_gen, name_hash, name_len, name_cookie)`` tuples in a per-file slab.h](hStore }(hj hhhNhNubj)}(h?``(parent_inum, parent_gen, name_hash, name_len, name_cookie)``h]h;(parent_inum, parent_gen, name_hash, name_len, name_cookie)}(hjhhhNhNubah}(h]h ]h"]h$]h&]uh1jhj ubh tuples in a per-file slab.}(hj hhhNhNubeh}(h]h ]h"]h$]h&]uh1hhhhMshjubah}(h]h ]h"]h$]h&]uh1hhjubeh}(h]h ]h"]h$]h&]jgj6jihjjjkuh1jhhjubeh}(h]h ]h"]h$]h&]uh1hhjubh)}(hZSort the per-file tuples in order of ``parent_inum``, ``name_hash``, and ``name_cookie``. h]h)}(hYSort the per-file tuples in order of ``parent_inum``, ``name_hash``, and ``name_cookie``.h](h%Sort the per-file tuples in order of }(hj?hhhNhNubj)}(h``parent_inum``h]h parent_inum}(hjGhhhNhNubah}(h]h ]h"]h$]h&]uh1jhj?ubh, }(hj?hhhNhNubj)}(h ``name_hash``h]h name_hash}(hjYhhhNhNubah}(h]h ]h"]h$]h&]uh1jhj?ubh, and }(hj?hhhNhNubj)}(h``name_cookie``h]h name_cookie}(hjkhhhNhNubah}(h]h ]h"]h$]h&]uh1jhj?ubh.}(hj?hhhNhNubeh}(h]h ]h"]h$]h&]uh1hhhhMvhj;ubah}(h]h ]h"]h$]h&]uh1hhjubh)}(hPosition one slab cursor at the start of the inode's records in the per-AG tuple slab. This should be trivial since the per-AG tuples are in child inumber order. h]h)}(hPosition one slab cursor at the start of the inode's records in the per-AG tuple slab. This should be trivial since the per-AG tuples are in child inumber order.h]hPosition one slab cursor at the start of the inode’s records in the per-AG tuple slab. This should be trivial since the per-AG tuples are in child inumber order.}(hjhhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhMyhjubah}(h]h ]h"]h$]h&]uh1hhjubh)}(hGPosition a second slab cursor at the start of the per-file tuple slab. h]h)}(hFPosition a second slab cursor at the start of the per-file tuple slab.h]hFPosition a second slab cursor at the start of the per-file tuple slab.}(hjhhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhM~hjubah}(h]h ]h"]h$]h&]uh1hhjubh)}(hXIterate the two cursors in lockstep, comparing the ``parent_ino``, ``name_hash``, and ``name_cookie`` fields of the records under each cursor: a. If the per-AG cursor is at a lower point in the keyspace than the per-file cursor, then the per-AG cursor points to a missing parent pointer. Add the parent pointer to the inode and advance the per-AG cursor. b. If the per-file cursor is at a lower point in the keyspace than the per-AG cursor, then the per-file cursor points to a dangling parent pointer. Remove the parent pointer from the inode and advance the per-file cursor. c. Otherwise, both cursors point at the same parent pointer. Update the parent_gen component if necessary. Advance both cursors. h](h)}(hIterate the two cursors in lockstep, comparing the ``parent_ino``, ``name_hash``, and ``name_cookie`` fields of the records under each cursor:h](h3Iterate the two cursors in lockstep, comparing the }(hjhhhNhNubj)}(h``parent_ino``h]h parent_ino}(hjŎhhhNhNubah}(h]h ]h"]h$]h&]uh1jhjubh, }(hjhhhNhNubj)}(h ``name_hash``h]h name_hash}(hj׎hhhNhNubah}(h]h ]h"]h$]h&]uh1jhjubh, and }(hjhhhNhNubj)}(h``name_cookie``h]h name_cookie}(hjhhhNhNubah}(h]h ]h"]h$]h&]uh1jhjubh) fields of the records under each cursor:}(hjhhhNhNubeh}(h]h ]h"]h$]h&]uh1hhhhMhjubji)}(hhh](h)}(hIf the per-AG cursor is at a lower point in the keyspace than the per-file cursor, then the per-AG cursor points to a missing parent pointer. Add the parent pointer to the inode and advance the per-AG cursor. h]h)}(hIf the per-AG cursor is at a lower point in the keyspace than the per-file cursor, then the per-AG cursor points to a missing parent pointer. Add the parent pointer to the inode and advance the per-AG cursor.h]hIf the per-AG cursor is at a lower point in the keyspace than the per-file cursor, then the per-AG cursor points to a missing parent pointer. Add the parent pointer to the inode and advance the per-AG cursor.}(hjhhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhMhjubah}(h]h ]h"]h$]h&]uh1hhjubh)}(hIf the per-file cursor is at a lower point in the keyspace than the per-AG cursor, then the per-file cursor points to a dangling parent pointer. Remove the parent pointer from the inode and advance the per-file cursor. h]h)}(hIf the per-file cursor is at a lower point in the keyspace than the per-AG cursor, then the per-file cursor points to a dangling parent pointer. Remove the parent pointer from the inode and advance the per-file cursor.h]hIf the per-file cursor is at a lower point in the keyspace than the per-AG cursor, then the per-file cursor points to a dangling parent pointer. Remove the parent pointer from the inode and advance the per-file cursor.}(hj hhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhMhjubah}(h]h ]h"]h$]h&]uh1hhjubh)}(h~Otherwise, both cursors point at the same parent pointer. Update the parent_gen component if necessary. Advance both cursors. h]h)}(h}Otherwise, both cursors point at the same parent pointer. Update the parent_gen component if necessary. Advance both cursors.h]h}Otherwise, both cursors point at the same parent pointer. Update the parent_gen component if necessary. Advance both cursors.}(hj8hhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhMhj4ubah}(h]h ]h"]h$]h&]uh1hhjubeh}(h]h ]h"]h$]h&]jgj6jihjjjkuh1jhhjubeh}(h]h ]h"]h$]h&]uh1hhjubeh}(h]h ]h"]h$]h&]jgjhjihjjjkuh1jhhjubeh}(h]h ]h"]h$]h&]uh1hhj ubeh}(h]h ]h"]h$]h&]jgj6jihjjjkuh1jhhjubeh}(h]h ]h"]h$]h&]uh1hhjhhhNhNubh)}(h2Move on to examining link counts, as we do today. h]h)}(h1Move on to examining link counts, as we do today.h]h1Move on to examining link counts, as we do today.}(hjthhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhMhjpubah}(h]h ]h"]h$]h&]uh1hhjhhhhhNubeh}(h]h ]h"]h$]h&]jgjhjihjjjkuh1jhhjhhhhhMFubh)}(hThe proposed patchset is the `offline parent pointers repair `_ series.h](hThe proposed patchset is the }(hjhhhNhNubj)}(h}`offline parent pointers repair `_h]hoffline parent pointers repair}(hjhhhNhNubah}(h]h ]h"]h$]h&]nameoffline parent pointers repairjjYhttps://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=pptrs-fsckuh1jhjubh)}(h\ h]h}(h]offline-parent-pointers-repairah ]h"]offline parent pointers repairah$]h&]refurijuh1hjyKhjubh series.}(hjhhhNhNubeh}(h]h ]h"]h$]h&]uh1hhhhMhjhhubh)}(hX+Rebuilding directories from parent pointers in offline repair would be very challenging because xfs_repair currently uses two single-pass scans of the filesystem during phases 3 and 4 to decide which files are corrupt enough to be zapped. This scan would have to be converted into a multi-pass scan:h]hX+Rebuilding directories from parent pointers in offline repair would be very challenging because xfs_repair currently uses two single-pass scans of the filesystem during phases 3 and 4 to decide which files are corrupt enough to be zapped. This scan would have to be converted into a multi-pass scan:}(hjhhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhMhjhhubji)}(hhh](h)}(hThe first pass of the scan zaps corrupt inodes, forks, and attributes much as it does now. Corrupt directories are noted but not zapped. h]h)}(hThe first pass of the scan zaps corrupt inodes, forks, and attributes much as it does now. Corrupt directories are noted but not zapped.h]hThe first pass of the scan zaps corrupt inodes, forks, and attributes much as it does now. Corrupt directories are noted but not zapped.}(hjӏhhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhMhjϏubah}(h]h ]h"]h$]h&]uh1hhj̏hhhhhNubh)}(hThe next pass records parent pointers pointing to the directories noted as being corrupt in the first pass. This second pass may have to happen after the phase 4 scan for duplicate blocks, if phase 4 is also capable of zapping directories. h]h)}(hThe next pass records parent pointers pointing to the directories noted as being corrupt in the first pass. This second pass may have to happen after the phase 4 scan for duplicate blocks, if phase 4 is also capable of zapping directories.h]hThe next pass records parent pointers pointing to the directories noted as being corrupt in the first pass. This second pass may have to happen after the phase 4 scan for duplicate blocks, if phase 4 is also capable of zapping directories.}(hjhhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhMhjubah}(h]h ]h"]h$]h&]uh1hhj̏hhhhhNubh)}(hThe third pass resets corrupt directories to an empty shortform directory. Free space metadata has not been ensured yet, so repair cannot yet use the directory building code in libxfs. h]h)}(hThe third pass resets corrupt directories to an empty shortform directory. Free space metadata has not been ensured yet, so repair cannot yet use the directory building code in libxfs.h]hThe third pass resets corrupt directories to an empty shortform directory. Free space metadata has not been ensured yet, so repair cannot yet use the directory building code in libxfs.}(hjhhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhMhjubah}(h]h ]h"]h$]h&]uh1hhj̏hhhhhNubh)}(hAt the start of phase 6, space metadata have been rebuilt. Use the parent pointer information recorded during step 2 to reconstruct the dirents and add them to the now-empty directories. h]h)}(hAt the start of phase 6, space metadata have been rebuilt. Use the parent pointer information recorded during step 2 to reconstruct the dirents and add them to the now-empty directories.h]hAt the start of phase 6, space metadata have been rebuilt. Use the parent pointer information recorded during step 2 to reconstruct the dirents and add them to the now-empty directories.}(hjhhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhMhjubah}(h]h ]h"]h$]h&]uh1hhj̏hhhhhNubeh}(h]h ]h"]h$]h&]jgjhjihjjjkuh1jhhjhhhhhMubh)}(h'This code has not yet been constructed.h]h'This code has not yet been constructed.}(hj5hhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhMhjhhubh)}(h .. _dirtree:h]h}(h]h ]h"]h$]h&]jdirtreeuh1hhMhjhhhhubeh}(h]jah ]h"]/digression: offline checking of parent pointersah$]h&]uh1hhj_hhhhhM?ubh)}(hhh](h)}(h$Case Study: Directory Tree Structureh]h$Case Study: Directory Tree Structure}(hjXhhhNhNubah}(h]h ]h"]h$]h&]jjuh1hhjUhhhhhMubh)}(hXAs mentioned earlier, the filesystem directory tree is supposed to be a directed acylic graph structure. However, each node in this graph is a separate ``xfs_inode`` object with its own locks, which makes validating the tree qualities difficult. Fortunately, non-directories are allowed to have multiple parents and cannot have children, so only directories need to be scanned. Directories typically constitute 5-10% of the files in a filesystem, which reduces the amount of work dramatically.h](hAs mentioned earlier, the filesystem directory tree is supposed to be a directed acylic graph structure. However, each node in this graph is a separate }(hjfhhhNhNubj)}(h ``xfs_inode``h]h xfs_inode}(hjnhhhNhNubah}(h]h ]h"]h$]h&]uh1jhjfubhXH object with its own locks, which makes validating the tree qualities difficult. Fortunately, non-directories are allowed to have multiple parents and cannot have children, so only directories need to be scanned. Directories typically constitute 5-10% of the files in a filesystem, which reduces the amount of work dramatically.}(hjfhhhNhNubeh}(h]h ]h"]h$]h&]uh1hhhhMhjUhhubh)}(hXIf the directory tree could be frozen, it would be easy to discover cycles and disconnected regions by running a depth (or breadth) first search downwards from the root directory and marking a bitmap for each directory found. At any point in the walk, trying to set an already set bit means there is a cycle. After the scan completes, XORing the marked inode bitmap with the inode allocation bitmap reveals disconnected inodes. However, one of online repair's design goals is to avoid locking the entire filesystem unless it's absolutely necessary. Directory tree updates can move subtrees across the scanner wavefront on a live filesystem, so the bitmap algorithm cannot be applied.h]hXIf the directory tree could be frozen, it would be easy to discover cycles and disconnected regions by running a depth (or breadth) first search downwards from the root directory and marking a bitmap for each directory found. At any point in the walk, trying to set an already set bit means there is a cycle. After the scan completes, XORing the marked inode bitmap with the inode allocation bitmap reveals disconnected inodes. However, one of online repair’s design goals is to avoid locking the entire filesystem unless it’s absolutely necessary. Directory tree updates can move subtrees across the scanner wavefront on a live filesystem, so the bitmap algorithm cannot be applied.}(hjhhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhMhjUhhubh)}(hXrDirectory parent pointers enable an incremental approach to validation of the tree structure. Instead of using one thread to scan the entire filesystem, multiple threads can walk from individual subdirectories upwards towards the root. For this to work, all directory entries and parent pointers must be internally consistent, each directory entry must have a parent pointer, and the link counts of all directories must be correct. Each scanner thread must be able to take the IOLOCK of an alleged parent directory while holding the IOLOCK of the child directory to prevent either directory from being moved within the tree. This is not possible since the VFS does not take the IOLOCK of a child subdirectory when moving that subdirectory, so instead the scanner stabilizes the parent -> child relationship by taking the ILOCKs and installing a dirent update hook to detect changes.h]hXrDirectory parent pointers enable an incremental approach to validation of the tree structure. Instead of using one thread to scan the entire filesystem, multiple threads can walk from individual subdirectories upwards towards the root. For this to work, all directory entries and parent pointers must be internally consistent, each directory entry must have a parent pointer, and the link counts of all directories must be correct. Each scanner thread must be able to take the IOLOCK of an alleged parent directory while holding the IOLOCK of the child directory to prevent either directory from being moved within the tree. This is not possible since the VFS does not take the IOLOCK of a child subdirectory when moving that subdirectory, so instead the scanner stabilizes the parent -> child relationship by taking the ILOCKs and installing a dirent update hook to detect changes.}(hjhhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhMhjUhhubh)}(hThe scanning process uses a dirent hook to detect changes to the directories mentioned in the scan data. The scan works as follows:h]hThe scanning process uses a dirent hook to detect changes to the directories mentioned in the scan data. The scan works as follows:}(hjhhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhMhjUhhubji)}(hhh](h)}(hX@For each subdirectory in the filesystem, a. For each parent pointer of that subdirectory, 1. Create a path object for that parent pointer, and mark the subdirectory inode number in the path object's bitmap. 2. Record the parent pointer name and inode number in a path structure. 3. If the alleged parent is the subdirectory being scrubbed, the path is a cycle. Mark the path for deletion and repeat step 1a with the next subdirectory parent pointer. 4. Try to mark the alleged parent inode number in a bitmap in the path object. If the bit is already set, then there is a cycle in the directory tree. Mark the path as a cycle and repeat step 1a with the next subdirectory parent pointer. 5. Load the alleged parent. If the alleged parent is not a linked directory, abort the scan because the parent pointer information is inconsistent. 6. For each parent pointer of this alleged ancestor directory, a. Record the parent pointer name and inode number in the path object if no parent has been set for that level. b. If an ancestor has more than one parent, mark the path as corrupt. Repeat step 1a with the next subdirectory parent pointer. c. Repeat steps 1a3-1a6 for the ancestor identified in step 1a6a. This repeats until the directory tree root is reached or no parents are found. 7. If the walk terminates at the root directory, mark the path as ok. 8. If the walk terminates without reaching the root, mark the path as disconnected. h](h)}(h(For each subdirectory in the filesystem,h]h(For each subdirectory in the filesystem,}(hjhhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhMhjubji)}(hhh]h)}(hXFor each parent pointer of that subdirectory, 1. Create a path object for that parent pointer, and mark the subdirectory inode number in the path object's bitmap. 2. Record the parent pointer name and inode number in a path structure. 3. If the alleged parent is the subdirectory being scrubbed, the path is a cycle. Mark the path for deletion and repeat step 1a with the next subdirectory parent pointer. 4. Try to mark the alleged parent inode number in a bitmap in the path object. If the bit is already set, then there is a cycle in the directory tree. Mark the path as a cycle and repeat step 1a with the next subdirectory parent pointer. 5. Load the alleged parent. If the alleged parent is not a linked directory, abort the scan because the parent pointer information is inconsistent. 6. For each parent pointer of this alleged ancestor directory, a. Record the parent pointer name and inode number in the path object if no parent has been set for that level. b. If an ancestor has more than one parent, mark the path as corrupt. Repeat step 1a with the next subdirectory parent pointer. c. Repeat steps 1a3-1a6 for the ancestor identified in step 1a6a. This repeats until the directory tree root is reached or no parents are found. 7. If the walk terminates at the root directory, mark the path as ok. 8. If the walk terminates without reaching the root, mark the path as disconnected. h](h)}(h-For each parent pointer of that subdirectory,h]h-For each parent pointer of that subdirectory,}(hj̐hhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhMhjȐubji)}(hhh](h)}(hrCreate a path object for that parent pointer, and mark the subdirectory inode number in the path object's bitmap. h]h)}(hqCreate a path object for that parent pointer, and mark the subdirectory inode number in the path object's bitmap.h]hsCreate a path object for that parent pointer, and mark the subdirectory inode number in the path object’s bitmap.}(hjhhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhMhjݐubah}(h]h ]h"]h$]h&]uh1hhjڐubh)}(hERecord the parent pointer name and inode number in a path structure. h]h)}(hDRecord the parent pointer name and inode number in a path structure.h]hDRecord the parent pointer name and inode number in a path structure.}(hjhhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhMhjubah}(h]h ]h"]h$]h&]uh1hhjڐubh)}(hIf the alleged parent is the subdirectory being scrubbed, the path is a cycle. Mark the path for deletion and repeat step 1a with the next subdirectory parent pointer. h]h)}(hIf the alleged parent is the subdirectory being scrubbed, the path is a cycle. Mark the path for deletion and repeat step 1a with the next subdirectory parent pointer.h]hIf the alleged parent is the subdirectory being scrubbed, the path is a cycle. Mark the path for deletion and repeat step 1a with the next subdirectory parent pointer.}(hjhhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhMhj ubah}(h]h ]h"]h$]h&]uh1hhjڐubh)}(hTry to mark the alleged parent inode number in a bitmap in the path object. If the bit is already set, then there is a cycle in the directory tree. Mark the path as a cycle and repeat step 1a with the next subdirectory parent pointer. h]h)}(hTry to mark the alleged parent inode number in a bitmap in the path object. If the bit is already set, then there is a cycle in the directory tree. Mark the path as a cycle and repeat step 1a with the next subdirectory parent pointer.h]hTry to mark the alleged parent inode number in a bitmap in the path object. If the bit is already set, then there is a cycle in the directory tree. Mark the path as a cycle and repeat step 1a with the next subdirectory parent pointer.}(hj)hhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhMhj%ubah}(h]h ]h"]h$]h&]uh1hhjڐubh)}(hLoad the alleged parent. If the alleged parent is not a linked directory, abort the scan because the parent pointer information is inconsistent. h]h)}(hLoad the alleged parent. If the alleged parent is not a linked directory, abort the scan because the parent pointer information is inconsistent.h]hLoad the alleged parent. If the alleged parent is not a linked directory, abort the scan because the parent pointer information is inconsistent.}(hjAhhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhMhj=ubah}(h]h ]h"]h$]h&]uh1hhjڐubh)}(hXFor each parent pointer of this alleged ancestor directory, a. Record the parent pointer name and inode number in the path object if no parent has been set for that level. b. If an ancestor has more than one parent, mark the path as corrupt. Repeat step 1a with the next subdirectory parent pointer. c. Repeat steps 1a3-1a6 for the ancestor identified in step 1a6a. This repeats until the directory tree root is reached or no parents are found. h](h)}(h;For each parent pointer of this alleged ancestor directory,h]h;For each parent pointer of this alleged ancestor directory,}(hjYhhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhMhjUubji)}(hhh](h)}(hmRecord the parent pointer name and inode number in the path object if no parent has been set for that level. h]h)}(hlRecord the parent pointer name and inode number in the path object if no parent has been set for that level.h]hlRecord the parent pointer name and inode number in the path object if no parent has been set for that level.}(hjnhhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhMhjjubah}(h]h ]h"]h$]h&]uh1hhjgubh)}(h}If an ancestor has more than one parent, mark the path as corrupt. Repeat step 1a with the next subdirectory parent pointer. h]h)}(h|If an ancestor has more than one parent, mark the path as corrupt. Repeat step 1a with the next subdirectory parent pointer.h]h|If an ancestor has more than one parent, mark the path as corrupt. Repeat step 1a with the next subdirectory parent pointer.}(hjhhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhMhjubah}(h]h ]h"]h$]h&]uh1hhjgubh)}(hRepeat steps 1a3-1a6 for the ancestor identified in step 1a6a. This repeats until the directory tree root is reached or no parents are found. h]h)}(hRepeat steps 1a3-1a6 for the ancestor identified in step 1a6a. This repeats until the directory tree root is reached or no parents are found.h]hRepeat steps 1a3-1a6 for the ancestor identified in step 1a6a. This repeats until the directory tree root is reached or no parents are found.}(hjhhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhMhjubah}(h]h ]h"]h$]h&]uh1hhjgubeh}(h]h ]h"]h$]h&]jgj6jihjjjkuh1jhhjUubeh}(h]h ]h"]h$]h&]uh1hhjڐubh)}(hCIf the walk terminates at the root directory, mark the path as ok. h]h)}(hBIf the walk terminates at the root directory, mark the path as ok.h]hBIf the walk terminates at the root directory, mark the path as ok.}(hj‘hhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhMhjubah}(h]h ]h"]h$]h&]uh1hhjڐubh)}(hQIf the walk terminates without reaching the root, mark the path as disconnected. h]h)}(hPIf the walk terminates without reaching the root, mark the path as disconnected.h]hPIf the walk terminates without reaching the root, mark the path as disconnected.}(hjڑhhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhMhj֑ubah}(h]h ]h"]h$]h&]uh1hhjڐubeh}(h]h ]h"]h$]h&]jgjhjihjjjkuh1jhhjȐubeh}(h]h ]h"]h$]h&]uh1hhjŐubah}(h]h ]h"]h$]h&]jgj6jihjjjkuh1jhhjubeh}(h]h ]h"]h$]h&]uh1hhjhhhNhNubh)}(hX If the directory entry update hook triggers, check all paths already found by the scan. If the entry matches part of a path, mark that path and the scan stale. When the scanner thread sees that the scan has been marked stale, it deletes all scan data and starts over. h]h)}(hX If the directory entry update hook triggers, check all paths already found by the scan. If the entry matches part of a path, mark that path and the scan stale. When the scanner thread sees that the scan has been marked stale, it deletes all scan data and starts over.h]hX If the directory entry update hook triggers, check all paths already found by the scan. If the entry matches part of a path, mark that path and the scan stale. When the scanner thread sees that the scan has been marked stale, it deletes all scan data and starts over.}(hj hhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhM hjubah}(h]h ]h"]h$]h&]uh1hhjhhhhhNubeh}(h]h ]h"]h$]h&]jgjhjihjjjkuh1jhhjUhhhhhMubh)}(h.Repairing the directory tree works as follows:h]h.Repairing the directory tree works as follows:}(hj$hhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhMhjUhhubji)}(hhh](h)}(hWalk each path of the target subdirectory. a. Corrupt paths and cycle paths are counted as suspect. b. Paths already marked for deletion are counted as bad. c. Paths that reached the root are counted as good. h](h)}(h*Walk each path of the target subdirectory.h]h*Walk each path of the target subdirectory.}(hj9hhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhMhj5ubji)}(hhh](h)}(h6Corrupt paths and cycle paths are counted as suspect. h]h)}(h5Corrupt paths and cycle paths are counted as suspect.h]h5Corrupt paths and cycle paths are counted as suspect.}(hjNhhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhMhjJubah}(h]h ]h"]h$]h&]uh1hhjGubh)}(h6Paths already marked for deletion are counted as bad. h]h)}(h5Paths already marked for deletion are counted as bad.h]h5Paths already marked for deletion are counted as bad.}(hjfhhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhMhjbubah}(h]h ]h"]h$]h&]uh1hhjGubh)}(h1Paths that reached the root are counted as good. h]h)}(h0Paths that reached the root are counted as good.h]h0Paths that reached the root are counted as good.}(hj~hhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhMhjzubah}(h]h ]h"]h$]h&]uh1hhjGubeh}(h]h ]h"]h$]h&]jgj6jihjjjkuh1jhhj5ubeh}(h]h ]h"]h$]h&]uh1hhj2hhhNhNubh)}(hIf the subdirectory is either the root directory or has zero link count, delete all incoming directory entries in the immediate parents. Repairs are complete. h]h)}(hIf the subdirectory is either the root directory or has zero link count, delete all incoming directory entries in the immediate parents. Repairs are complete.h]hIf the subdirectory is either the root directory or has zero link count, delete all incoming directory entries in the immediate parents. Repairs are complete.}(hjhhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhMhjubah}(h]h ]h"]h$]h&]uh1hhj2hhhhhNubh)}(hWIf the subdirectory has exactly one path, set the dotdot entry to the parent and exit. h]h)}(hVIf the subdirectory has exactly one path, set the dotdot entry to the parent and exit.h]hVIf the subdirectory has exactly one path, set the dotdot entry to the parent and exit.}(hjhhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhMhjubah}(h]h ]h"]h$]h&]uh1hhj2hhhhhNubh)}(hzIf the subdirectory has at least one good path, delete all the other incoming directory entries in the immediate parents. h]h)}(hyIf the subdirectory has at least one good path, delete all the other incoming directory entries in the immediate parents.h]hyIf the subdirectory has at least one good path, delete all the other incoming directory entries in the immediate parents.}(hjҒhhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhM"hjΒubah}(h]h ]h"]h$]h&]uh1hhj2hhhhhNubh)}(hIf the subdirectory has no good paths and more than one suspect path, delete all the other incoming directory entries in the immediate parents. h]h)}(hIf the subdirectory has no good paths and more than one suspect path, delete all the other incoming directory entries in the immediate parents.h]hIf the subdirectory has no good paths and more than one suspect path, delete all the other incoming directory entries in the immediate parents.}(hjhhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhM%hjubah}(h]h ]h"]h$]h&]uh1hhj2hhhhhNubh)}(hEIf the subdirectory has zero paths, attach it to the lost and found. h]h)}(hDIf the subdirectory has zero paths, attach it to the lost and found.h]hDIf the subdirectory has zero paths, attach it to the lost and found.}(hjhhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhM(hjubah}(h]h ]h"]h$]h&]uh1hhj2hhhhhNubeh}(h]h ]h"]h$]h&]jgjhjihjjjkuh1jhhjUhhhhhMubh)}(hThe proposed patches are in the `directory tree repair `_ series.h](h The proposed patches are in the }(hjhhhNhNubj)}(h{`directory tree repair `_h]hdirectory tree repair}(hj$hhhNhNubah}(h]h ]h"]h$]h&]namedirectory tree repairjj`https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=scrub-directory-treeuh1jhjubh)}(hc h]h}(h]directory-tree-repairah ]h"]directory tree repairah$]h&]refurij4uh1hjyKhjubh series.}(hjhhhNhNubeh}(h]h ]h"]h$]h&]uh1hhhhM*hjUhhubh)}(h.. _orphanage:h]h}(h]h ]h"]h$]h&]j orphanageuh1hhM0hjUhhhhubeh}(h](jjMeh ]h"]($case study: directory tree structuredirtreeeh$]h&]uh1hhj_hhhhhMj}j\jCsj}jMjCsubeh}(h]jOah ]h"]parent pointersah$]h&]uh1hhjhhhhhMiubeh}(h]jah ]h"]fixing directoriesah$]h&]uh1hhjv*hhhhhMubh)}(hhh](h)}(h The Orphanageh]h The Orphanage}(hjrhhhNhNubah}(h]h ]h"]h$]h&]jjuh1hhjohhhhhM3ubh)}(hXFilesystems present files as a directed, and hopefully acyclic, graph. In other words, a tree. The root of the filesystem is a directory, and each entry in a directory points downwards either to more subdirectories or to non-directory files. Unfortunately, a disruption in the directory graph pointers result in a disconnected graph, which makes files impossible to access via regular path resolution.h]hXFilesystems present files as a directed, and hopefully acyclic, graph. In other words, a tree. The root of the filesystem is a directory, and each entry in a directory points downwards either to more subdirectories or to non-directory files. Unfortunately, a disruption in the directory graph pointers result in a disconnected graph, which makes files impossible to access via regular path resolution.}(hjhhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhM5hjohhubh)}(hXbWithout parent pointers, the directory parent pointer online scrub code can detect a dotdot entry pointing to a parent directory that doesn't have a link back to the child directory and the file link count checker can detect a file that isn't pointed to by any directory in the filesystem. If such a file has a positive link count, the file is an orphan.h]hXfWithout parent pointers, the directory parent pointer online scrub code can detect a dotdot entry pointing to a parent directory that doesn’t have a link back to the child directory and the file link count checker can detect a file that isn’t pointed to by any directory in the filesystem. If such a file has a positive link count, the file is an orphan.}(hjhhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhM=hjohhubh)}(hWith parent pointers, directories can be rebuilt by scanning parent pointers and parent pointers can be rebuilt by scanning directories. This should reduce the incidence of files ending up in ``/lost+found``.h](hWith parent pointers, directories can be rebuilt by scanning parent pointers and parent pointers can be rebuilt by scanning directories. This should reduce the incidence of files ending up in }(hjhhhNhNubj)}(h``/lost+found``h]h /lost+found}(hjhhhNhNubah}(h]h ]h"]h$]h&]uh1jhjubh.}(hjhhhNhNubeh}(h]h ]h"]h$]h&]uh1hhhhMChjohhubh)}(hXLWhen orphans are found, they should be reconnected to the directory tree. Offline fsck solves the problem by creating a directory ``/lost+found`` to serve as an orphanage, and linking orphan files into the orphanage by using the inumber as the name. Reparenting a file to the orphanage does not reset any of its permissions or ACLs.h](hWhen orphans are found, they should be reconnected to the directory tree. Offline fsck solves the problem by creating a directory }(hjhhhNhNubj)}(h``/lost+found``h]h /lost+found}(hjēhhhNhNubah}(h]h ]h"]h$]h&]uh1jhjubh to serve as an orphanage, and linking orphan files into the orphanage by using the inumber as the name. Reparenting a file to the orphanage does not reset any of its permissions or ACLs.}(hjhhhNhNubeh}(h]h ]h"]h$]h&]uh1hhhhMGhjohhubh)}(hX9This process is more involved in the kernel than it is in userspace. The directory and file link count repair setup functions must use the regular VFS mechanisms to create the orphanage directory with all the necessary security attributes and dentry cache entries, just like a regular directory tree modification.h]hX9This process is more involved in the kernel than it is in userspace. The directory and file link count repair setup functions must use the regular VFS mechanisms to create the orphanage directory with all the necessary security attributes and dentry cache entries, just like a regular directory tree modification.}(hjܓhhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhMNhjohhubh)}(h7Orphaned files are adopted by the orphanage as follows:h]h7Orphaned files are adopted by the orphanage as follows:}(hjhhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhMThjohhubji)}(hhh](h)}(hCall ``xrep_orphanage_try_create`` at the start of the scrub setup function to try to ensure that the lost and found directory actually exists. This also attaches the orphanage directory to the scrub context. h]h)}(hCall ``xrep_orphanage_try_create`` at the start of the scrub setup function to try to ensure that the lost and found directory actually exists. This also attaches the orphanage directory to the scrub context.h](hCall }(hjhhhNhNubj)}(h``xrep_orphanage_try_create``h]hxrep_orphanage_try_create}(hjhhhNhNubah}(h]h ]h"]h$]h&]uh1jhjubh at the start of the scrub setup function to try to ensure that the lost and found directory actually exists. This also attaches the orphanage directory to the scrub context.}(hjhhhNhNubeh}(h]h ]h"]h$]h&]uh1hhhhMVhjubah}(h]h ]h"]h$]h&]uh1hhjhhhhhNubh)}(hIf the decision is made to reconnect a file, take the IOLOCK of both the orphanage and the file being reattached. The ``xrep_orphanage_iolock_two`` function follows the inode locking strategy discussed earlier. h]h)}(hIf the decision is made to reconnect a file, take the IOLOCK of both the orphanage and the file being reattached. The ``xrep_orphanage_iolock_two`` function follows the inode locking strategy discussed earlier.h](hvIf the decision is made to reconnect a file, take the IOLOCK of both the orphanage and the file being reattached. The }(hj)hhhNhNubj)}(h``xrep_orphanage_iolock_two``h]hxrep_orphanage_iolock_two}(hj1hhhNhNubah}(h]h ]h"]h$]h&]uh1jhj)ubh? function follows the inode locking strategy discussed earlier.}(hj)hhhNhNubeh}(h]h ]h"]h$]h&]uh1hhhhMZhj%ubah}(h]h ]h"]h$]h&]uh1hhjhhhhhNubh)}(hRUse ``xrep_adoption_trans_alloc`` to reserve resources to the repair transaction. h]h)}(hQUse ``xrep_adoption_trans_alloc`` to reserve resources to the repair transaction.h](hUse }(hjShhhNhNubj)}(h``xrep_adoption_trans_alloc``h]hxrep_adoption_trans_alloc}(hj[hhhNhNubah}(h]h ]h"]h$]h&]uh1jhjSubh0 to reserve resources to the repair transaction.}(hjShhhNhNubeh}(h]h ]h"]h$]h&]uh1hhhhM_hjOubah}(h]h ]h"]h$]h&]uh1hhjhhhhhNubh)}(hOCall ``xrep_orphanage_compute_name`` to compute the new name in the orphanage. h]h)}(hNCall ``xrep_orphanage_compute_name`` to compute the new name in the orphanage.h](hCall }(hj}hhhNhNubj)}(h``xrep_orphanage_compute_name``h]hxrep_orphanage_compute_name}(hjhhhNhNubah}(h]h ]h"]h$]h&]uh1jhj}ubh* to compute the new name in the orphanage.}(hj}hhhNhNubeh}(h]h ]h"]h$]h&]uh1hhhhMbhjyubah}(h]h ]h"]h$]h&]uh1hhjhhhhhNubh)}(hIf the adoption is going to happen, call ``xrep_adoption_reparent`` to reparent the orphaned file into the lost and found and invalidate the dentry cache. h]h)}(hIf the adoption is going to happen, call ``xrep_adoption_reparent`` to reparent the orphaned file into the lost and found and invalidate the dentry cache.h](h)If the adoption is going to happen, call }(hjhhhNhNubj)}(h``xrep_adoption_reparent``h]hxrep_adoption_reparent}(hjhhhNhNubah}(h]h ]h"]h$]h&]uh1jhjubhW to reparent the orphaned file into the lost and found and invalidate the dentry cache.}(hjhhhNhNubeh}(h]h ]h"]h$]h&]uh1hhhhMehjubah}(h]h ]h"]h$]h&]uh1hhjhhhhhNubh)}(hCall ``xrep_adoption_finish`` to commit any filesystem updates, release the orphanage ILOCK, and clean the scrub transaction. Call ``xrep_adoption_commit`` to commit the updates and the scrub transaction. h]h)}(hCall ``xrep_adoption_finish`` to commit any filesystem updates, release the orphanage ILOCK, and clean the scrub transaction. Call ``xrep_adoption_commit`` to commit the updates and the scrub transaction.h](hCall }(hjєhhhNhNubj)}(h``xrep_adoption_finish``h]hxrep_adoption_finish}(hjٔhhhNhNubah}(h]h ]h"]h$]h&]uh1jhjєubhg to commit any filesystem updates, release the orphanage ILOCK, and clean the scrub transaction. Call }(hjєhhhNhNubj)}(h``xrep_adoption_commit``h]hxrep_adoption_commit}(hjhhhNhNubah}(h]h ]h"]h$]h&]uh1jhjєubh1 to commit the updates and the scrub transaction.}(hjєhhhNhNubeh}(h]h ]h"]h$]h&]uh1hhhhMihj͔ubah}(h]h ]h"]h$]h&]uh1hhjhhhhhNubh)}(hTIf a runtime error happens, call ``xrep_adoption_cancel`` to release all resources. h]h)}(hSIf a runtime error happens, call ``xrep_adoption_cancel`` to release all resources.h](h!If a runtime error happens, call }(hj hhhNhNubj)}(h``xrep_adoption_cancel``h]hxrep_adoption_cancel}(hjhhhNhNubah}(h]h ]h"]h$]h&]uh1jhj ubh to release all resources.}(hj hhhNhNubeh}(h]h ]h"]h$]h&]uh1hhhhMmhj ubah}(h]h ]h"]h$]h&]uh1hhjhhhhhNubeh}(h]h ]h"]h$]h&]jgjhjihjjjkuh1jhhjohhhhhMVubh)}(hThe proposed patches are in the `orphanage adoption `_ series.h](h The proposed patches are in the }(hj9hhhNhNubj)}(ht`orphanage adoption `_h]horphanage adoption}(hjAhhhNhNubah}(h]h ]h"]h$]h&]nameorphanage adoptionjj\https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-orphanageuh1jhj9ubh)}(h_ h]h}(h]orphanage-adoptionah ]h"]orphanage adoptionah$]h&]refurijQuh1hjyKhj9ubh series.}(hj9hhhNhNubeh}(h]h ]h"]h$]h&]uh1hhhhMphjohhubeh}(h](jjVeh ]h"]( the orphanage orphanageeh$]h&]uh1hhjv*hhhhhM3j}jnjLsj}jVjLsubeh}(h]jah ]h"](5. kernel algorithms and data structuresah$]h&]uh1hhhhhhhhMubh)}(hhh](h)}(h+6. Userspace Algorithms and Data Structuresh]h+6. Userspace Algorithms and Data Structures}(hj}hhhNhNubah}(h]h ]h"]h$]h&]jj6uh1hhjzhhhhhMvubh)}(hThis section discusses the key algorithms and data structures of the userspace program, ``xfs_scrub``, that provide the ability to drive metadata checks and repairs in the kernel, verify file data, and look for other potential problems.h](hXThis section discusses the key algorithms and data structures of the userspace program, }(hjhhhNhNubj)}(h ``xfs_scrub``h]h xfs_scrub}(hjhhhNhNubah}(h]h ]h"]h$]h&]uh1jhjubh, that provide the ability to drive metadata checks and repairs in the kernel, verify file data, and look for other potential problems.}(hjhhhNhNubeh}(h]h ]h"]h$]h&]uh1hhhhMxhjzhhubh)}(h.. _scrubcheck:h]h}(h]h ]h"]h$]h&]j scrubcheckuh1hhM|hjzhhhhubh)}(hhh](h)}(hChecking Metadatah]hChecking Metadata}(hjhhhNhNubah}(h]h ]h"]h$]h&]jjUuh1hhjhhhhhMubh)}(hRecall the :ref:`phases of fsck work` outlined earlier. That structure follows naturally from the data dependencies designed into the filesystem from its beginnings in 1993. In XFS, there are several groups of metadata dependencies:h](h Recall the }(hjǕhhhNhNubh)}(h':ref:`phases of fsck work`h]j)}(hjѕh]hphases of fsck work}(hjӕhhhNhNubah}(h]h ](jstdstd-refeh"]h$]h&]uh1jhjϕubah}(h]h ]h"]h$]h&]refdocj refdomainjݕreftyperef refexplicitrefwarnj scrubphasesuh1hhhhMhjǕubh outlined earlier. That structure follows naturally from the data dependencies designed into the filesystem from its beginnings in 1993. In XFS, there are several groups of metadata dependencies:}(hjǕhhhNhNubeh}(h]h ]h"]h$]h&]uh1hhhhMhjhhubji)}(hhh](h)}(hFilesystem summary counts depend on consistency within the inode indices, the allocation group space btrees, and the realtime volume space information. h]h)}(hFilesystem summary counts depend on consistency within the inode indices, the allocation group space btrees, and the realtime volume space information.h]hFilesystem summary counts depend on consistency within the inode indices, the allocation group space btrees, and the realtime volume space information.}(hjhhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhMhjubah}(h]h ]h"]h$]h&]uh1hhjhhhhhNubh)}(hQuota resource counts depend on consistency within the quota file data forks, inode indices, inode records, and the forks of every file on the system. h]h)}(hQuota resource counts depend on consistency within the quota file data forks, inode indices, inode records, and the forks of every file on the system.h]hQuota resource counts depend on consistency within the quota file data forks, inode indices, inode records, and the forks of every file on the system.}(hjhhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhMhjubah}(h]h ]h"]h$]h&]uh1hhjhhhhhNubh)}(hThe naming hierarchy depends on consistency within the directory and extended attribute structures. This includes file link counts. h]h)}(hThe naming hierarchy depends on consistency within the directory and extended attribute structures. This includes file link counts.h]hThe naming hierarchy depends on consistency within the directory and extended attribute structures. This includes file link counts.}(hj0hhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhMhj,ubah}(h]h ]h"]h$]h&]uh1hhjhhhhhNubh)}(hDirectories, extended attributes, and file data depend on consistency within the file forks that map directory and extended attribute data to physical storage media. h]h)}(hDirectories, extended attributes, and file data depend on consistency within the file forks that map directory and extended attribute data to physical storage media.h]hDirectories, extended attributes, and file data depend on consistency within the file forks that map directory and extended attribute data to physical storage media.}(hjHhhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhMhjDubah}(h]h ]h"]h$]h&]uh1hhjhhhhhNubh)}(hThe file forks depends on consistency within inode records and the space metadata indices of the allocation groups and the realtime volume. This includes quota and realtime metadata files. h]h)}(hThe file forks depends on consistency within inode records and the space metadata indices of the allocation groups and the realtime volume. This includes quota and realtime metadata files.h]hThe file forks depends on consistency within inode records and the space metadata indices of the allocation groups and the realtime volume. This includes quota and realtime metadata files.}(hj`hhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhMhj\ubah}(h]h ]h"]h$]h&]uh1hhjhhhhhNubh)}(hHInode records depends on consistency within the inode metadata indices. h]h)}(hGInode records depends on consistency within the inode metadata indices.h]hGInode records depends on consistency within the inode metadata indices.}(hjxhhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhMhjtubah}(h]h ]h"]h$]h&]uh1hhjhhhhhNubh)}(hdRealtime space metadata depend on the inode records and data forks of the realtime metadata inodes. h]h)}(hcRealtime space metadata depend on the inode records and data forks of the realtime metadata inodes.h]hcRealtime space metadata depend on the inode records and data forks of the realtime metadata inodes.}(hjhhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhMhjubah}(h]h ]h"]h$]h&]uh1hhjhhhhhNubh)}(hThe allocation group metadata indices (free space, inodes, reference count, and reverse mapping btrees) depend on consistency within the AG headers and between all the AG metadata btrees. h]h)}(hThe allocation group metadata indices (free space, inodes, reference count, and reverse mapping btrees) depend on consistency within the AG headers and between all the AG metadata btrees.h]hThe allocation group metadata indices (free space, inodes, reference count, and reverse mapping btrees) depend on consistency within the AG headers and between all the AG metadata btrees.}(hjhhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhMhjubah}(h]h ]h"]h$]h&]uh1hhjhhhhhNubh)}(hh``xfs_scrub`` depends on the filesystem being mounted and kernel support for online fsck functionality. h]h)}(hg``xfs_scrub`` depends on the filesystem being mounted and kernel support for online fsck functionality.h](j)}(h ``xfs_scrub``h]h xfs_scrub}(hjĖhhhNhNubah}(h]h ]h"]h$]h&]uh1jhjubhZ depends on the filesystem being mounted and kernel support for online fsck functionality.}(hjhhhNhNubeh}(h]h ]h"]h$]h&]uh1hhhhMhjubah}(h]h ]h"]h$]h&]uh1hhjhhhhhNubeh}(h]h ]h"]h$]h&]jgj6jihjjjkuh1jhhjhhhhhMubh)}(hxTherefore, a metadata dependency graph is a convenient way to schedule checking operations in the ``xfs_scrub`` program:h](hbTherefore, a metadata dependency graph is a convenient way to schedule checking operations in the }(hjhhhNhNubj)}(h ``xfs_scrub``h]h xfs_scrub}(hjhhhNhNubah}(h]h ]h"]h$]h&]uh1jhjubh program:}(hjhhhNhNubeh}(h]h ]h"]h$]h&]uh1hhhhMhjhhubh)}(hhh](h)}(hPhase 1 checks that the provided path maps to an XFS filesystem and detect the kernel's scrubbing abilities, which validates group (i). h]h)}(hPhase 1 checks that the provided path maps to an XFS filesystem and detect the kernel's scrubbing abilities, which validates group (i).h]hPhase 1 checks that the provided path maps to an XFS filesystem and detect the kernel’s scrubbing abilities, which validates group (i).}(hjhhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhMhj ubah}(h]h ]h"]h$]h&]uh1hhjhhhhhNubh)}(hJPhase 2 scrubs groups (g) and (h) in parallel using a threaded workqueue. h]h)}(hIPhase 2 scrubs groups (g) and (h) in parallel using a threaded workqueue.h]hIPhase 2 scrubs groups (g) and (h) in parallel using a threaded workqueue.}(hj'hhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhMhj#ubah}(h]h ]h"]h$]h&]uh1hhjhhhhhNubh)}(hgPhase 3 scans inodes in parallel. For each inode, groups (f), (e), and (d) are checked, in that order. h]h)}(hfPhase 3 scans inodes in parallel. For each inode, groups (f), (e), and (d) are checked, in that order.h]hfPhase 3 scans inodes in parallel. For each inode, groups (f), (e), and (d) are checked, in that order.}(hj?hhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhMhj;ubah}(h]h ]h"]h$]h&]uh1hhjhhhhhNubh)}(h^Phase 4 repairs everything in groups (i) through (d) so that phases 5 and 6 may run reliably. h]h)}(h]Phase 4 repairs everything in groups (i) through (d) so that phases 5 and 6 may run reliably.h]h]Phase 4 repairs everything in groups (i) through (d) so that phases 5 and 6 may run reliably.}(hjWhhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhMhjSubah}(h]h ]h"]h$]h&]uh1hhjhhhhhNubh)}(h^Phase 5 starts by checking groups (b) and (c) in parallel before moving on to checking names. h]h)}(h]Phase 5 starts by checking groups (b) and (c) in parallel before moving on to checking names.h]h]Phase 5 starts by checking groups (b) and (c) in parallel before moving on to checking names.}(hjohhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhMhjkubah}(h]h ]h"]h$]h&]uh1hhjhhhhhNubh)}(hPhase 6 depends on groups (i) through (b) to find file data blocks to verify, to read them, and to report which blocks of which files are affected. h]h)}(hPhase 6 depends on groups (i) through (b) to find file data blocks to verify, to read them, and to report which blocks of which files are affected.h]hPhase 6 depends on groups (i) through (b) to find file data blocks to verify, to read them, and to report which blocks of which files are affected.}(hjhhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhMhjubah}(h]h ]h"]h$]h&]uh1hhjhhhhhNubh)}(h`_ and the `inode scan rebalance `_ series.h](h%The proposed patchsets are the scrub }(hjThhhNhNubj)}(h`performance tweaks `_h]hperformance tweaks}(hj\hhhNhNubah}(h]h ]h"]h$]h&]nameperformance tweaksjjghttps://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=scrub-performance-tweaksuh1jhjTubh)}(hj h]h}(h]performance-tweaksah ]h"]performance tweaksah$]h&]refurijluh1hjyKhjTubh and the }(hjThhhNhNubj)}(h~`inode scan rebalance `_h]hinode scan rebalance}(hj~hhhNhNubah}(h]h ]h"]h$]h&]nameinode scan rebalancejjdhttps://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=scrub-iscan-rebalanceuh1jhjTubh)}(hg h]h}(h]inode-scan-rebalanceah ]h"]inode scan rebalanceah$]h&]refurijuh1hjyKhjTubh series.}(hjThhhNhNubeh}(h]h ]h"]h$]h&]uh1hhhhMhjїhhubh)}(h.. _scrubrepair:h]h}(h]h ]h"]h$]h&]j scrubrepairuh1hhMhjїhhhhubeh}(h]j}ah ]h"]parallel inode scansah$]h&]uh1hhjzhhhhhMubh)}(hhh](h)}(hScheduling Repairsh]hScheduling Repairs}(hjhhhNhNubah}(h]h ]h"]h$]h&]jjuh1hhjhhhhhMubh)}(hXDuring phase 2, corruptions and inconsistencies reported in any AGI header or inode btree are repaired immediately, because phase 3 relies on proper functioning of the inode indices to find inodes to scan. Failed repairs are rescheduled to phase 4. Problems reported in any other space metadata are deferred to phase 4. Optimization opportunities are always deferred to phase 4, no matter their origin.h]hXDuring phase 2, corruptions and inconsistencies reported in any AGI header or inode btree are repaired immediately, because phase 3 relies on proper functioning of the inode indices to find inodes to scan. Failed repairs are rescheduled to phase 4. Problems reported in any other space metadata are deferred to phase 4. Optimization opportunities are always deferred to phase 4, no matter their origin.}(hjɘhhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhMhjhhubh)}(hDuring phase 3, corruptions and inconsistencies reported in any part of a file's metadata are repaired immediately if all space metadata were validated during phase 2. Repairs that fail or cannot be repaired immediately are scheduled for phase 4.h]hDuring phase 3, corruptions and inconsistencies reported in any part of a file’s metadata are repaired immediately if all space metadata were validated during phase 2. Repairs that fail or cannot be repaired immediately are scheduled for phase 4.}(hjטhhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhMhjhhubh)}(hXHIn the original design of ``xfs_scrub``, it was thought that repairs would be so infrequent that the ``struct xfs_scrub_metadata`` objects used to communicate with the kernel could also be used as the primary object to schedule repairs. With recent increases in the number of optimizations possible for a given filesystem object, it became much more memory-efficient to track all eligible repairs for a given filesystem object with a single repair item. Each repair item represents a single lockable object -- AGs, metadata files, individual inodes, or a class of summary information.h](hIn the original design of }(hjhhhNhNubj)}(h ``xfs_scrub``h]h xfs_scrub}(hjhhhNhNubah}(h]h ]h"]h$]h&]uh1jhjubh>, it was thought that repairs would be so infrequent that the }(hjhhhNhNubj)}(h``struct xfs_scrub_metadata``h]hstruct xfs_scrub_metadata}(hjhhhNhNubah}(h]h ]h"]h$]h&]uh1jhjubhX objects used to communicate with the kernel could also be used as the primary object to schedule repairs. With recent increases in the number of optimizations possible for a given filesystem object, it became much more memory-efficient to track all eligible repairs for a given filesystem object with a single repair item. Each repair item represents a single lockable object -- AGs, metadata files, individual inodes, or a class of summary information.}(hjhhhNhNubeh}(h]h ]h"]h$]h&]uh1hhhhMhjhhubh)}(hXSPhase 4 is responsible for scheduling a lot of repair work in as quick a manner as is practical. The :ref:`data dependencies ` outlined earlier still apply, which means that ``xfs_scrub`` must try to complete the repair work scheduled by phase 2 before trying repair work scheduled by phase 3. The repair process is as follows:h](hePhase 4 is responsible for scheduling a lot of repair work in as quick a manner as is practical. The }(hjhhhNhNubh)}(h%:ref:`data dependencies `h]j)}(hj!h]hdata dependencies}(hj#hhhNhNubah}(h]h ](jstdstd-refeh"]h$]h&]uh1jhjubah}(h]h ]h"]h$]h&]refdocj refdomainj-reftyperef refexplicitrefwarnj scrubcheckuh1hhhhM hjubh0 outlined earlier still apply, which means that }(hjhhhNhNubj)}(h ``xfs_scrub``h]h xfs_scrub}(hjChhhNhNubah}(h]h ]h"]h$]h&]uh1jhjubh must try to complete the repair work scheduled by phase 2 before trying repair work scheduled by phase 3. The repair process is as follows:}(hjhhhNhNubeh}(h]h ]h"]h$]h&]uh1hhhhM hjhhubji)}(hhh](h)}(hXStart a round of repair with a workqueue and enough workers to keep the CPUs as busy as the user desires. a. For each repair item queued by phase 2, i. Ask the kernel to repair everything listed in the repair item for a given filesystem object. ii. Make a note if the kernel made any progress in reducing the number of repairs needed for this object. iii. If the object no longer requires repairs, revalidate all metadata associated with this object. If the revalidation succeeds, drop the repair item. If not, requeue the item for more repairs. b. If any repairs were made, jump back to 1a to retry all the phase 2 items. c. For each repair item queued by phase 3, i. Ask the kernel to repair everything listed in the repair item for a given filesystem object. ii. Make a note if the kernel made any progress in reducing the number of repairs needed for this object. iii. If the object no longer requires repairs, revalidate all metadata associated with this object. If the revalidation succeeds, drop the repair item. If not, requeue the item for more repairs. d. If any repairs were made, jump back to 1c to retry all the phase 3 items. h](h)}(hiStart a round of repair with a workqueue and enough workers to keep the CPUs as busy as the user desires.h]hiStart a round of repair with a workqueue and enough workers to keep the CPUs as busy as the user desires.}(hjbhhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhMhj^ubji)}(hhh](h)}(hXFor each repair item queued by phase 2, i. Ask the kernel to repair everything listed in the repair item for a given filesystem object. ii. Make a note if the kernel made any progress in reducing the number of repairs needed for this object. iii. If the object no longer requires repairs, revalidate all metadata associated with this object. If the revalidation succeeds, drop the repair item. If not, requeue the item for more repairs. h](h)}(h'For each repair item queued by phase 2,h]h'For each repair item queued by phase 2,}(hjwhhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhMhjsubji)}(hhh](h)}(h]Ask the kernel to repair everything listed in the repair item for a given filesystem object. h]h)}(h\Ask the kernel to repair everything listed in the repair item for a given filesystem object.h]h\Ask the kernel to repair everything listed in the repair item for a given filesystem object.}(hjhhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhMhjubah}(h]h ]h"]h$]h&]uh1hhjubh)}(hfMake a note if the kernel made any progress in reducing the number of repairs needed for this object. h]h)}(heMake a note if the kernel made any progress in reducing the number of repairs needed for this object.h]heMake a note if the kernel made any progress in reducing the number of repairs needed for this object.}(hjhhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhMhjubah}(h]h ]h"]h$]h&]uh1hhjubh)}(hIf the object no longer requires repairs, revalidate all metadata associated with this object. If the revalidation succeeds, drop the repair item. If not, requeue the item for more repairs. h]h)}(hIf the object no longer requires repairs, revalidate all metadata associated with this object. If the revalidation succeeds, drop the repair item. If not, requeue the item for more repairs.h]hIf the object no longer requires repairs, revalidate all metadata associated with this object. If the revalidation succeeds, drop the repair item. If not, requeue the item for more repairs.}(hjhhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhMhjubah}(h]h ]h"]h$]h&]uh1hhjubeh}(h]h ]h"]h$]h&]jgj"jihjjjkuh1jhhjsubeh}(h]h ]h"]h$]h&]uh1hhjpubh)}(hJIf any repairs were made, jump back to 1a to retry all the phase 2 items. h]h)}(hIIf any repairs were made, jump back to 1a to retry all the phase 2 items.h]hIIf any repairs were made, jump back to 1a to retry all the phase 2 items.}(hjhhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhM!hjܙubah}(h]h ]h"]h$]h&]uh1hhjpubh)}(hXFor each repair item queued by phase 3, i. Ask the kernel to repair everything listed in the repair item for a given filesystem object. ii. Make a note if the kernel made any progress in reducing the number of repairs needed for this object. iii. If the object no longer requires repairs, revalidate all metadata associated with this object. If the revalidation succeeds, drop the repair item. If not, requeue the item for more repairs. h](h)}(h'For each repair item queued by phase 3,h]h'For each repair item queued by phase 3,}(hjhhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhM#hjubji)}(hhh](h)}(h]Ask the kernel to repair everything listed in the repair item for a given filesystem object. h]h)}(h\Ask the kernel to repair everything listed in the repair item for a given filesystem object.h]h\Ask the kernel to repair everything listed in the repair item for a given filesystem object.}(hj hhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhM%hj ubah}(h]h ]h"]h$]h&]uh1hhjubh)}(hfMake a note if the kernel made any progress in reducing the number of repairs needed for this object. h]h)}(heMake a note if the kernel made any progress in reducing the number of repairs needed for this object.h]heMake a note if the kernel made any progress in reducing the number of repairs needed for this object.}(hj%hhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhM(hj!ubah}(h]h ]h"]h$]h&]uh1hhjubh)}(hIf the object no longer requires repairs, revalidate all metadata associated with this object. If the revalidation succeeds, drop the repair item. If not, requeue the item for more repairs. h]h)}(hIf the object no longer requires repairs, revalidate all metadata associated with this object. If the revalidation succeeds, drop the repair item. If not, requeue the item for more repairs.h]hIf the object no longer requires repairs, revalidate all metadata associated with this object. If the revalidation succeeds, drop the repair item. If not, requeue the item for more repairs.}(hj=hhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhM+hj9ubah}(h]h ]h"]h$]h&]uh1hhjubeh}(h]h ]h"]h$]h&]jgj"jihjjjkuh1jhhjubeh}(h]h ]h"]h$]h&]uh1hhjpubh)}(hJIf any repairs were made, jump back to 1c to retry all the phase 3 items. h]h)}(hIIf any repairs were made, jump back to 1c to retry all the phase 3 items.h]hIIf any repairs were made, jump back to 1c to retry all the phase 3 items.}(hjahhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhM0hj]ubah}(h]h ]h"]h$]h&]uh1hhjpubeh}(h]h ]h"]h$]h&]jgj6jihjjjkuh1jhhj^ubeh}(h]h ]h"]h$]h&]uh1hhj[hhhNhNubh)}(hfIf step 1 made any repair progress of any kind, jump back to step 1 to start another round of repair. h]h)}(heIf step 1 made any repair progress of any kind, jump back to step 1 to start another round of repair.h]heIf step 1 made any repair progress of any kind, jump back to step 1 to start another round of repair.}(hjhhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhM2hjubah}(h]h ]h"]h$]h&]uh1hhj[hhhhhNubh)}(hIf there are items left to repair, run them all serially one more time. Complain if the repairs were not successful, since this is the last chance to repair anything. h]h)}(hIf there are items left to repair, run them all serially one more time. Complain if the repairs were not successful, since this is the last chance to repair anything.h]hIf there are items left to repair, run them all serially one more time. Complain if the repairs were not successful, since this is the last chance to repair anything.}(hjhhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhM5hjubah}(h]h ]h"]h$]h&]uh1hhj[hhhhhNubeh}(h]h ]h"]h$]h&]jgjhjihjjjkuh1jhhjhhhhhMubh)}(hCorruptions and inconsistencies encountered during phases 5 and 7 are repaired immediately. Corrupt file data blocks reported by phase 6 cannot be recovered by the filesystem.h]hCorruptions and inconsistencies encountered during phases 5 and 7 are repaired immediately. Corrupt file data blocks reported by phase 6 cannot be recovered by the filesystem.}(hjhhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhM9hjhhubh)}(hXZThe proposed patchsets are the `repair warning improvements `_, refactoring of the `repair data dependency `_ and `object tracking `_, and the `repair scheduling `_ improvement series.h](hThe proposed patchsets are the }(hjŚhhhNhNubj)}(h`repair warning improvements `_h]hrepair warning improvements}(hj͚hhhNhNubah}(h]h ]h"]h$]h&]namerepair warning improvementsjjkhttps://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=scrub-better-repair-warningsuh1jhjŚubh)}(hn h]h}(h]repair-warning-improvementsah ]h"]repair warning improvementsah$]h&]refurijݚuh1hjyKhjŚubh, refactoring of the }(hjŚhhhNhNubj)}(h`repair data dependency `_h]hrepair data dependency}(hjhhhNhNubah}(h]h ]h"]h$]h&]namerepair data dependencyjjehttps://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=scrub-repair-data-depsuh1jhjŚubh)}(hh h]h}(h]repair-data-dependencyah ]h"]repair data dependencyah$]h&]refurijuh1hjyKhjŚubh and }(hjŚhhhNhNubj)}(hy`object tracking `_h]hobject tracking}(hjhhhNhNubah}(h]h ]h"]h$]h&]nameobject trackingjjdhttps://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=scrub-object-trackinguh1jhjŚubh)}(hg h]h}(h]object-trackingah ]h"]object trackingah$]h&]refurij!uh1hjyKhjŚubh , and the }(hjŚhhhNhNubj)}(h}`repair scheduling `_h]hrepair scheduling}(hj3hhhNhNubah}(h]h ]h"]h$]h&]namerepair schedulingjjfhttps://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=scrub-repair-schedulinguh1jhjŚubh)}(hi h]h}(h]repair-schedulingah ]h"]repair schedulingah$]h&]refurijCuh1hjyKhjŚubh improvement series.}(hjŚhhhNhNubeh}(h]h ]h"]h$]h&]uh1hhhhM>hjhhubeh}(h](jjeh ]h"](scheduling repairs scrubrepaireh$]h&]uh1hhjzhhhhhMj}j`jsj}jjsubh)}(hhh](h)}(h/Checking Names for Confusable Unicode Sequencesh]h/Checking Names for Confusable Unicode Sequences}(hjhhhhNhNubah}(h]h ]h"]h$]h&]jjuh1hhjehhhhhMMubh)}(hXxIf ``xfs_scrub`` succeeds in validating the filesystem metadata by the end of phase 4, it moves on to phase 5, which checks for suspicious looking names in the filesystem. These names consist of the filesystem label, names in directory entries, and the names of extended attributes. Like most Unix filesystems, XFS imposes the sparest of constraints on the contents of a name:h](hIf }(hjvhhhNhNubj)}(h ``xfs_scrub``h]h xfs_scrub}(hj~hhhNhNubah}(h]h ]h"]h$]h&]uh1jhjvubhXh succeeds in validating the filesystem metadata by the end of phase 4, it moves on to phase 5, which checks for suspicious looking names in the filesystem. These names consist of the filesystem label, names in directory entries, and the names of extended attributes. Like most Unix filesystems, XFS imposes the sparest of constraints on the contents of a name:}(hjvhhhNhNubeh}(h]h ]h"]h$]h&]uh1hhhhMOhjehhubh)}(hhh](h)}(h=Slashes and null bytes are not allowed in directory entries. h]h)}(h`_ document. When ``xfs_scrub`` detects UTF-8 encoding in use on a system, it uses the Unicode normalization form NFD in conjunction with the confusable name detection component of `libicu `_ to identify names with a directory or within a file's extended attributes that could be confused for each other. Names are also checked for control characters, non-rendering characters, and mixing of bidirectional characters. All of these potential issues are reported to the system administrator during phase 5.h](hcTechniques for detecting confusable names are explained in great detail in sections 4 and 5 of the }(hj;hhhNhNubj)}(hB`Unicode Security Mechanisms `_h]hUnicode Security Mechanisms}(hjChhhNhNubah}(h]h ]h"]h$]h&]nameUnicode Security Mechanismsjj!https://unicode.org/reports/tr39/uh1jhj;ubh)}(h$ h]h}(h]unicode-security-mechanismsah ]h"]unicode security mechanismsah$]h&]refurijSuh1hjyKhj;ubh document. When }(hj;hhhNhNubj)}(h ``xfs_scrub``h]h xfs_scrub}(hjehhhNhNubah}(h]h ]h"]h$]h&]uh1jhj;ubh detects UTF-8 encoding in use on a system, it uses the Unicode normalization form NFD in conjunction with the confusable name detection component of }(hj;hhhNhNubj)}(h.`libicu `_h]hlibicu}(hjwhhhNhNubah}(h]h ]h"]h$]h&]namelibicujj"https://github.com/unicode-org/icuuh1jhj;ubh)}(h% h]h}(h]libicuah ]h"]libicuah$]h&]refurijuh1hjyKhj;ubhX; to identify names with a directory or within a file’s extended attributes that could be confused for each other. Names are also checked for control characters, non-rendering characters, and mixing of bidirectional characters. All of these potential issues are reported to the system administrator during phase 5.}(hj;hhhNhNubeh}(h]h ]h"]h$]h&]uh1hhhhMhjehhubeh}(h]jah ]h"]/checking names for confusable unicode sequencesah$]h&]uh1hhjzhhhhhMMubh)}(hhh](h)}(h'Media Verification of File Data Extentsh]h'Media Verification of File Data Extents}(hjhhhNhNubah}(h]h ]h"]h$]h&]jjuh1hhjhhhhhMubh)}(hXmThe system administrator can elect to initiate a media scan of all file data blocks. This scan after validation of all filesystem metadata (except for the summary counters) as phase 6. The scan starts by calling ``FS_IOC_GETFSMAP`` to scan the filesystem space map to find areas that are allocated to file data fork extents. Gaps between data fork extents that are smaller than 64k are treated as if they were data fork extents to reduce the command setup overhead. When the space map scan accumulates a region larger than 32MB, a media verification request is sent to the disk as a directio read of the raw block device.h](hThe system administrator can elect to initiate a media scan of all file data blocks. This scan after validation of all filesystem metadata (except for the summary counters) as phase 6. The scan starts by calling }(hjhhhNhNubj)}(h``FS_IOC_GETFSMAP``h]hFS_IOC_GETFSMAP}(hjhhhNhNubah}(h]h ]h"]h$]h&]uh1jhjubhX to scan the filesystem space map to find areas that are allocated to file data fork extents. Gaps between data fork extents that are smaller than 64k are treated as if they were data fork extents to reduce the command setup overhead. When the space map scan accumulates a region larger than 32MB, a media verification request is sent to the disk as a directio read of the raw block device.}(hjhhhNhNubeh}(h]h ]h"]h$]h&]uh1hhhhMhjhhubh)}(hXIf the verification read fails, ``xfs_scrub`` retries with single-block reads to narrow down the failure to the specific region of the media and recorded. When it has finished issuing verification requests, it again uses the space mapping ioctl to map the recorded media errors back to metadata structures and report what has been lost. For media errors in blocks owned by files, parent pointers can be used to construct file paths from inode numbers for user-friendly reporting.h](h If the verification read fails, }(hjלhhhNhNubj)}(h ``xfs_scrub``h]h xfs_scrub}(hjߜhhhNhNubah}(h]h ]h"]h$]h&]uh1jhjלubhX retries with single-block reads to narrow down the failure to the specific region of the media and recorded. When it has finished issuing verification requests, it again uses the space mapping ioctl to map the recorded media errors back to metadata structures and report what has been lost. For media errors in blocks owned by files, parent pointers can be used to construct file paths from inode numbers for user-friendly reporting.}(hjלhhhNhNubeh}(h]h ]h"]h$]h&]uh1hhhhMhjhhubeh}(h]jah ]h"]'media verification of file data extentsah$]h&]uh1hhjzhhhhhMubeh}(h]j<ah ]h"]+6. userspace algorithms and data structuresah$]h&]uh1hhhhhhhhMvubh)}(hhh](h)}(h7. Conclusion and Future Workh]h7. Conclusion and Future Work}(hjhhhNhNubah}(h]h ]h"]h$]h&]jj uh1hhjhhhhhMubh)}(hXIt is hoped that the reader of this document has followed the designs laid out in this document and now has some familiarity with how XFS performs online rebuilding of its metadata indices, and how filesystem users can interact with that functionality. Although the scope of this work is daunting, it is hoped that this guide will make it easier for code readers to understand what has been built, for whom it has been built, and why. Please feel free to contact the XFS mailing list with questions.h]hXIt is hoped that the reader of this document has followed the designs laid out in this document and now has some familiarity with how XFS performs online rebuilding of its metadata indices, and how filesystem users can interact with that functionality. Although the scope of this work is daunting, it is hoped that this guide will make it easier for code readers to understand what has been built, for whom it has been built, and why. Please feel free to contact the XFS mailing list with questions.}(hjhhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhMhjhhubh)}(hhh](h)}(hXFS_IOC_EXCHANGE_RANGEh]hXFS_IOC_EXCHANGE_RANGE}(hj'hhhNhNubah}(h]h ]h"]h$]h&]jj*uh1hhj$hhhhhMubh)}(hXqAs discussed earlier, a second frontend to the atomic file mapping exchange mechanism is a new ioctl call that userspace programs can use to commit updates to files atomically. This frontend has been out for review for several years now, though the necessary refinements to online repair and lack of customer demand mean that the proposal has not been pushed very hard.h]hXqAs discussed earlier, a second frontend to the atomic file mapping exchange mechanism is a new ioctl call that userspace programs can use to commit updates to files atomically. This frontend has been out for review for several years now, though the necessary refinements to online repair and lack of customer demand mean that the proposal has not been pushed very hard.}(hj5hhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhMhj$hhubh)}(hhh](h)}(h.File Content Exchanges with Regular User Filesh]h.File Content Exchanges with Regular User Files}(hjFhhhNhNubah}(h]h ]h"]h$]h&]jjIuh1hhjChhhhhMubh)}(hXrAs mentioned earlier, XFS has long had the ability to swap extents between files, which is used almost exclusively by ``xfs_fsr`` to defragment files. The earliest form of this was the fork swap mechanism, where the entire contents of data forks could be exchanged between two files by exchanging the raw bytes in each inode fork's immediate area. When XFS v5 came along with self-describing metadata, this old mechanism grew some log support to continue rewriting the owner fields of BMBT blocks during log recovery. When the reverse mapping btree was later added to XFS, the only way to maintain the consistency of the fork mappings with the reverse mapping index was to develop an iterative mechanism that used deferred bmap and rmap operations to swap mappings one at a time. This mechanism is identical to steps 2-3 from the procedure above except for the new tracking items, because the atomic file mapping exchange mechanism is an iteration of an existing mechanism and not something totally novel. For the narrow case of file defragmentation, the file contents must be identical, so the recovery guarantees are not much of a gain.h](hvAs mentioned earlier, XFS has long had the ability to swap extents between files, which is used almost exclusively by }(hjThhhNhNubj)}(h ``xfs_fsr``h]hxfs_fsr}(hj\hhhNhNubah}(h]h ]h"]h$]h&]uh1jhjTubhX to defragment files. The earliest form of this was the fork swap mechanism, where the entire contents of data forks could be exchanged between two files by exchanging the raw bytes in each inode fork’s immediate area. When XFS v5 came along with self-describing metadata, this old mechanism grew some log support to continue rewriting the owner fields of BMBT blocks during log recovery. When the reverse mapping btree was later added to XFS, the only way to maintain the consistency of the fork mappings with the reverse mapping index was to develop an iterative mechanism that used deferred bmap and rmap operations to swap mappings one at a time. This mechanism is identical to steps 2-3 from the procedure above except for the new tracking items, because the atomic file mapping exchange mechanism is an iteration of an existing mechanism and not something totally novel. For the narrow case of file defragmentation, the file contents must be identical, so the recovery guarantees are not much of a gain.}(hjThhhNhNubeh}(h]h ]h"]h$]h&]uh1hhhhMhjChhubh)}(hX/Atomic file content exchanges are much more flexible than the existing swapext implementations because it can guarantee that the caller never sees a mix of old and new contents even after a crash, and it can operate on two arbitrary file fork ranges. The extra flexibility enables several new use cases:h]hX/Atomic file content exchanges are much more flexible than the existing swapext implementations because it can guarantee that the caller never sees a mix of old and new contents even after a crash, and it can operate on two arbitrary file fork ranges. The extra flexibility enables several new use cases:}(hjthhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhMhjChhubh)}(hhh]h)}(hX**Atomic commit of file writes**: A userspace process opens a file that it wants to update. Next, it opens a temporary file and calls the file clone operation to reflink the first file's contents into the temporary file. Writes to the original file should instead be written to the temporary file. Finally, the process calls the atomic file mapping exchange system call (``XFS_IOC_EXCHANGE_RANGE``) to exchange the file contents, thereby committing all of the updates to the original file, or none of them. h]h)}(hX**Atomic commit of file writes**: A userspace process opens a file that it wants to update. Next, it opens a temporary file and calls the file clone operation to reflink the first file's contents into the temporary file. Writes to the original file should instead be written to the temporary file. Finally, the process calls the atomic file mapping exchange system call (``XFS_IOC_EXCHANGE_RANGE``) to exchange the file contents, thereby committing all of the updates to the original file, or none of them.h](j)}(h **Atomic commit of file writes**h]hAtomic commit of file writes}(hjhhhNhNubah}(h]h ]h"]h$]h&]uh1jhjubhXU: A userspace process opens a file that it wants to update. Next, it opens a temporary file and calls the file clone operation to reflink the first file’s contents into the temporary file. Writes to the original file should instead be written to the temporary file. Finally, the process calls the atomic file mapping exchange system call (}(hjhhhNhNubj)}(h``XFS_IOC_EXCHANGE_RANGE``h]hXFS_IOC_EXCHANGE_RANGE}(hjhhhNhNubah}(h]h ]h"]h$]h&]uh1jhjubhm) to exchange the file contents, thereby committing all of the updates to the original file, or none of them.}(hjhhhNhNubeh}(h]h ]h"]h$]h&]uh1hhhhMhjubah}(h]h ]h"]h$]h&]uh1hhjhhhhhNubah}(h]h ]h"]h$]h&]jJjKuh1hhhhMhjChhubh)}(h.. _exchrange_if_unchanged:h]h}(h]h ]h"]h$]h&]jexchrange-if-unchangeduh1hhMhjChhhhubh)}(hhh](h)}(hXw**Transactional file updates**: The same mechanism as above, but the caller only wants the commit to occur if the original file's contents have not changed. To make this happen, the calling process snapshots the file modification and change timestamps of the original file before reflinking its data to the temporary file. When the program is ready to commit the changes, it passes the timestamps into the kernel as arguments to the atomic file mapping exchange system call. The kernel only commits the changes if the provided timestamps match the original file. A new ioctl (``XFS_IOC_COMMIT_RANGE``) is provided to perform this. h]h)}(hXv**Transactional file updates**: The same mechanism as above, but the caller only wants the commit to occur if the original file's contents have not changed. To make this happen, the calling process snapshots the file modification and change timestamps of the original file before reflinking its data to the temporary file. When the program is ready to commit the changes, it passes the timestamps into the kernel as arguments to the atomic file mapping exchange system call. The kernel only commits the changes if the provided timestamps match the original file. A new ioctl (``XFS_IOC_COMMIT_RANGE``) is provided to perform this.h](j)}(h**Transactional file updates**h]hTransactional file updates}(hjٝhhhNhNubah}(h]h ]h"]h$]h&]uh1jhj՝ubhX$: The same mechanism as above, but the caller only wants the commit to occur if the original file’s contents have not changed. To make this happen, the calling process snapshots the file modification and change timestamps of the original file before reflinking its data to the temporary file. When the program is ready to commit the changes, it passes the timestamps into the kernel as arguments to the atomic file mapping exchange system call. The kernel only commits the changes if the provided timestamps match the original file. A new ioctl (}(hj՝hhhNhNubj)}(h``XFS_IOC_COMMIT_RANGE``h]hXFS_IOC_COMMIT_RANGE}(hjhhhNhNubah}(h]h ]h"]h$]h&]uh1jhj՝ubh) is provided to perform this.}(hj՝hhhNhNubeh}(h]h ]h"]h$]h&]uh1hhhhMhjѝubah}(h]h ]h"]h$]h&]uh1hhjΝhhhhhNubh)}(hX**Emulation of atomic block device writes**: Export a block device with a logical sector size matching the filesystem block size to force all writes to be aligned to the filesystem block size. Stage all writes to a temporary file, and when that is complete, call the atomic file mapping exchange system call with a flag to indicate that holes in the temporary file should be ignored. This emulates an atomic device write in software, and can support arbitrary scattered writes. h]h)}(hX**Emulation of atomic block device writes**: Export a block device with a logical sector size matching the filesystem block size to force all writes to be aligned to the filesystem block size. Stage all writes to a temporary file, and when that is complete, call the atomic file mapping exchange system call with a flag to indicate that holes in the temporary file should be ignored. This emulates an atomic device write in software, and can support arbitrary scattered writes.h](j)}(h+**Emulation of atomic block device writes**h]h'Emulation of atomic block device writes}(hjhhhNhNubah}(h]h ]h"]h$]h&]uh1jhj ubhX: Export a block device with a logical sector size matching the filesystem block size to force all writes to be aligned to the filesystem block size. Stage all writes to a temporary file, and when that is complete, call the atomic file mapping exchange system call with a flag to indicate that holes in the temporary file should be ignored. This emulates an atomic device write in software, and can support arbitrary scattered writes.}(hj hhhNhNubeh}(h]h ]h"]h$]h&]uh1hhhhMhj ubah}(h]h ]h"]h$]h&]uh1hhjΝhhhhhNubeh}(h]j͝ah ]h"]exchrange_if_unchangedah$]h&]jJjKuh1hhhhMhjChhj}j3jÝsj}j͝jÝsubeh}(h]jOah ]h"].file content exchanges with regular user filesah$]h&]uh1hhj$hhhhhMubeh}(h]j0ah ]h"]xfs_ioc_exchange_rangeah$]h&]uh1hhjhhhhhMubh)}(hhh](h)}(hVectorized Scrubh]hVectorized Scrub}(hjIhhhNhNubah}(h]h ]h"]h$]h&]jjwuh1hhjFhhhhhMubh)}(hXAs it turns out, the :ref:`refactoring ` of repair items mentioned earlier was a catalyst for enabling a vectorized scrub system call. Since 2018, the cost of making a kernel call has increased considerably on some systems to mitigate the effects of speculative execution attacks. This incentivizes program authors to make as few system calls as possible to reduce the number of times an execution path crosses a security boundary.h](hAs it turns out, the }(hjWhhhNhNubh)}(h :ref:`refactoring `h]j)}(hjah]h refactoring}(hjchhhNhNubah}(h]h ](jstdstd-refeh"]h$]h&]uh1jhj_ubah}(h]h ]h"]h$]h&]refdocj refdomainjmreftyperef refexplicitrefwarnj scrubrepairuh1hhhhMhjWubhX of repair items mentioned earlier was a catalyst for enabling a vectorized scrub system call. Since 2018, the cost of making a kernel call has increased considerably on some systems to mitigate the effects of speculative execution attacks. This incentivizes program authors to make as few system calls as possible to reduce the number of times an execution path crosses a security boundary.}(hjWhhhNhNubeh}(h]h ]h"]h$]h&]uh1hhhhMhjFhhubh)}(hX<With vectorized scrub, userspace pushes to the kernel the identity of a filesystem object, a list of scrub types to run against that object, and a simple representation of the data dependencies between the selected scrub types. The kernel executes as much of the caller's plan as it can until it hits a dependency that cannot be satisfied due to a corruption, and tells userspace how much was accomplished. It is hoped that ``io_uring`` will pick up enough of this functionality that online fsck can use that instead of adding a separate vectored scrub system call to XFS.h](hXWith vectorized scrub, userspace pushes to the kernel the identity of a filesystem object, a list of scrub types to run against that object, and a simple representation of the data dependencies between the selected scrub types. The kernel executes as much of the caller’s plan as it can until it hits a dependency that cannot be satisfied due to a corruption, and tells userspace how much was accomplished. It is hoped that }(hjhhhNhNubj)}(h ``io_uring``h]hio_uring}(hjhhhNhNubah}(h]h ]h"]h$]h&]uh1jhjubh will pick up enough of this functionality that online fsck can use that instead of adding a separate vectored scrub system call to XFS.}(hjhhhNhNubeh}(h]h ]h"]h$]h&]uh1hhhhM hjFhhubh)}(hX$The relevant patchsets are the `kernel vectorized scrub `_ and `userspace vectorized scrub `_ series.h](hThe relevant patchsets are the }(hjhhhNhNubj)}(hy`kernel vectorized scrub `_h]hkernel vectorized scrub}(hjhhhNhNubah}(h]h ]h"]h$]h&]namekernel vectorized scrubjj\https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=vectorized-scrubuh1jhjubh)}(h_ h]h}(h]kernel-vectorized-scrubah ]h"]kernel vectorized scrubah$]h&]refurijuh1hjyKhjubh and }(hjhhhNhNubj)}(h`userspace vectorized scrub `_h]huserspace vectorized scrub}(hjӞhhhNhNubah}(h]h ]h"]h$]h&]nameuserspace vectorized scrubjj_https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=vectorized-scrubuh1jhjubh)}(hb h]h}(h]userspace-vectorized-scrubah ]h"]userspace vectorized scrubah$]h&]refurijuh1hjyKhjubh series.}(hjhhhNhNubeh}(h]h ]h"]h$]h&]uh1hhhhMhjFhhubeh}(h]j}ah ]h"]vectorized scrubah$]h&]uh1hhjhhhhhMubh)}(hhh](h)}(h$Quality of Service Targets for Scrubh]h$Quality of Service Targets for Scrub}(hjhhhNhNubah}(h]h ]h"]h$]h&]jjuh1hhjhhhhhMubh)}(hXOne serious shortcoming of the online fsck code is that the amount of time that it can spend in the kernel holding resource locks is basically unbounded. Userspace is allowed to send a fatal signal to the process which will cause ``xfs_scrub`` to exit when it reaches a good stopping point, but there's no way for userspace to provide a time budget to the kernel. Given that the scrub codebase has helpers to detect fatal signals, it shouldn't be too much work to allow userspace to specify a timeout for a scrub/repair operation and abort the operation if it exceeds budget. However, most repair functions have the property that once they begin to touch ondisk metadata, the operation cannot be cancelled cleanly, after which a QoS timeout is no longer useful.h](hOne serious shortcoming of the online fsck code is that the amount of time that it can spend in the kernel holding resource locks is basically unbounded. Userspace is allowed to send a fatal signal to the process which will cause }(hjhhhNhNubj)}(h ``xfs_scrub``h]h xfs_scrub}(hjhhhNhNubah}(h]h ]h"]h$]h&]uh1jhjubhX  to exit when it reaches a good stopping point, but there’s no way for userspace to provide a time budget to the kernel. Given that the scrub codebase has helpers to detect fatal signals, it shouldn’t be too much work to allow userspace to specify a timeout for a scrub/repair operation and abort the operation if it exceeds budget. However, most repair functions have the property that once they begin to touch ondisk metadata, the operation cannot be cancelled cleanly, after which a QoS timeout is no longer useful.}(hjhhhNhNubeh}(h]h ]h"]h$]h&]uh1hhhhMhjhhubeh}(h]jah ]h"]$quality of service targets for scrubah$]h&]uh1hhjhhhhhMubh)}(hhh](h)}(hDefragmenting Free Spaceh]hDefragmenting Free Space}(hj=hhhNhNubah}(h]h ]h"]h$]h&]jjuh1hhj:hhhhhM,ubh)}(hOver the years, many XFS users have requested the creation of a program to clear a portion of the physical storage underlying a filesystem so that it becomes a contiguous chunk of free space. Call this free space defragmenter ``clearspace`` for short.h](hOver the years, many XFS users have requested the creation of a program to clear a portion of the physical storage underlying a filesystem so that it becomes a contiguous chunk of free space. Call this free space defragmenter }(hjKhhhNhNubj)}(h``clearspace``h]h clearspace}(hjShhhNhNubah}(h]h ]h"]h$]h&]uh1jhjKubh for short.}(hjKhhhNhNubeh}(h]h ]h"]h$]h&]uh1hhhhM.hj:hhubh)}(hXThe first piece the ``clearspace`` program needs is the ability to read the reverse mapping index from userspace. This already exists in the form of the ``FS_IOC_GETFSMAP`` ioctl. The second piece it needs is a new fallocate mode (``FALLOC_FL_MAP_FREE_SPACE``) that allocates the free space in a region and maps it to a file. Call this file the "space collector" file. The third piece is the ability to force an online repair.h](hThe first piece the }(hjkhhhNhNubj)}(h``clearspace``h]h clearspace}(hjshhhNhNubah}(h]h ]h"]h$]h&]uh1jhjkubhw program needs is the ability to read the reverse mapping index from userspace. This already exists in the form of the }(hjkhhhNhNubj)}(h``FS_IOC_GETFSMAP``h]hFS_IOC_GETFSMAP}(hjhhhNhNubah}(h]h ]h"]h$]h&]uh1jhjkubh; ioctl. The second piece it needs is a new fallocate mode (}(hjkhhhNhNubj)}(h``FALLOC_FL_MAP_FREE_SPACE``h]hFALLOC_FL_MAP_FREE_SPACE}(hjhhhNhNubah}(h]h ]h"]h$]h&]uh1jhjkubh) that allocates the free space in a region and maps it to a file. Call this file the “space collector” file. The third piece is the ability to force an online repair.}(hjkhhhNhNubeh}(h]h ]h"]h$]h&]uh1hhhhM3hj:hhubh)}(hX To clear all the metadata out of a portion of physical storage, clearspace uses the new fallocate map-freespace call to map any free space in that region to the space collector file. Next, clearspace finds all metadata blocks in that region by way of ``GETFSMAP`` and issues forced repair requests on the data structure. This often results in the metadata being rebuilt somewhere that is not being cleared. After each relocation, clearspace calls the "map free space" function again to collect any newly freed space in the region being cleared.h](hTo clear all the metadata out of a portion of physical storage, clearspace uses the new fallocate map-freespace call to map any free space in that region to the space collector file. Next, clearspace finds all metadata blocks in that region by way of }(hjhhhNhNubj)}(h ``GETFSMAP``h]hGETFSMAP}(hjhhhNhNubah}(h]h ]h"]h$]h&]uh1jhjubhX and issues forced repair requests on the data structure. This often results in the metadata being rebuilt somewhere that is not being cleared. After each relocation, clearspace calls the “map free space” function again to collect any newly freed space in the region being cleared.}(hjhhhNhNubeh}(h]h ]h"]h$]h&]uh1hhhhM<hj:hhubh)}(hXGTo clear all the file data out of a portion of the physical storage, clearspace uses the FSMAP information to find relevant file data blocks. Having identified a good target, it uses the ``FICLONERANGE`` call on that part of the file to try to share the physical space with a dummy file. Cloning the extent means that the original owners cannot overwrite the contents; any changes will be written somewhere else via copy-on-write. Clearspace makes its own copy of the frozen extent in an area that is not being cleared, and uses ``FIEDEUPRANGE`` (or the :ref:`atomic file content exchanges ` feature) to change the target file's data extent mapping away from the area being cleared. When all other mappings have been moved, clearspace reflinks the space into the space collector file so that it becomes unavailable.h](hTo clear all the file data out of a portion of the physical storage, clearspace uses the FSMAP information to find relevant file data blocks. Having identified a good target, it uses the }(hjϟhhhNhNubj)}(h``FICLONERANGE``h]h FICLONERANGE}(hjןhhhNhNubah}(h]h ]h"]h$]h&]uh1jhjϟubhXF call on that part of the file to try to share the physical space with a dummy file. Cloning the extent means that the original owners cannot overwrite the contents; any changes will be written somewhere else via copy-on-write. Clearspace makes its own copy of the frozen extent in an area that is not being cleared, and uses }(hjϟhhhNhNubj)}(h``FIEDEUPRANGE``h]h FIEDEUPRANGE}(hjhhhNhNubah}(h]h ]h"]h$]h&]uh1jhjϟubh (or the }(hjϟhhhNhNubh)}(h=:ref:`atomic file content exchanges `h]j)}(hjh]hatomic file content exchanges}(hjhhhNhNubah}(h]h ](jstdstd-refeh"]h$]h&]uh1jhjubah}(h]h ]h"]h$]h&]refdocj refdomainj reftyperef refexplicitrefwarnjexchrange_if_unchangeduh1hhhhMFhjϟubh feature) to change the target file’s data extent mapping away from the area being cleared. When all other mappings have been moved, clearspace reflinks the space into the space collector file so that it becomes unavailable.}(hjϟhhhNhNubeh}(h]h ]h"]h$]h&]uh1hhhhMFhj:hhubh)}(hXFThere are further optimizations that could apply to the above algorithm. To clear a piece of physical storage that has a high sharing factor, it is strongly desirable to retain this sharing factor. In fact, these extents should be moved first to maximize sharing factor after the operation completes. To make this work smoothly, clearspace needs a new ioctl (``FS_IOC_GETREFCOUNTS``) to report reference count information to userspace. With the refcount information exposed, clearspace can quickly find the longest, most shared data extents in the filesystem, and target them first.h](hXgThere are further optimizations that could apply to the above algorithm. To clear a piece of physical storage that has a high sharing factor, it is strongly desirable to retain this sharing factor. In fact, these extents should be moved first to maximize sharing factor after the operation completes. To make this work smoothly, clearspace needs a new ioctl (}(hj%hhhNhNubj)}(h``FS_IOC_GETREFCOUNTS``h]hFS_IOC_GETREFCOUNTS}(hj-hhhNhNubah}(h]h ]h"]h$]h&]uh1jhj%ubh) to report reference count information to userspace. With the refcount information exposed, clearspace can quickly find the longest, most shared data extents in the filesystem, and target them first.}(hj%hhhNhNubeh}(h]h ]h"]h$]h&]uh1hhhhMShj:hhubh)}(hE**Future Work Question**: How might the filesystem move inode chunks?h](j)}(h**Future Work Question**h]hFuture Work Question}(hjIhhhNhNubah}(h]h ]h"]h$]h&]uh1jhjEubh-: How might the filesystem move inode chunks?}(hjEhhhNhNubeh}(h]h ]h"]h$]h&]uh1hhhhM]hj:hhubh)}(hX*Answer*: To move inode chunks, Dave Chinner constructed a prototype program that creates a new file with the old contents and then locklessly runs around the filesystem updating directory entries. The operation cannot complete if the filesystem goes down. That problem isn't totally insurmountable: create an inode remapping table hidden behind a jump label, and a log item that tracks the kernel walking the filesystem to update directory entries. The trouble is, the kernel can't do anything about open files, since it cannot revoke them.h](j7)}(h*Answer*h]hAnswer}(hjehhhNhNubah}(h]h ]h"]h$]h&]uh1j6hjaubhX: To move inode chunks, Dave Chinner constructed a prototype program that creates a new file with the old contents and then locklessly runs around the filesystem updating directory entries. The operation cannot complete if the filesystem goes down. That problem isn’t totally insurmountable: create an inode remapping table hidden behind a jump label, and a log item that tracks the kernel walking the filesystem to update directory entries. The trouble is, the kernel can’t do anything about open files, since it cannot revoke them.}(hjahhhNhNubeh}(h]h ]h"]h$]h&]uh1hhhhM_hj:hhubh)}(ho**Future Work Question**: Can static keys be used to minimize the cost of supporting ``revoke()`` on XFS files?h](j)}(h**Future Work Question**h]hFuture Work Question}(hjhhhNhNubah}(h]h ]h"]h$]h&]uh1jhj}ubh=: Can static keys be used to minimize the cost of supporting }(hj}hhhNhNubj)}(h ``revoke()``h]hrevoke()}(hjhhhNhNubah}(h]h ]h"]h$]h&]uh1jhj}ubh on XFS files?}(hj}hhhNhNubeh}(h]h ]h"]h$]h&]uh1hhhhMihj:hhubh)}(h`*Answer*: Yes. Until the first revocation, the bailout code need not be in the call path at all.h](j7)}(h*Answer*h]hAnswer}(hjhhhNhNubah}(h]h ]h"]h$]h&]uh1j6hjubhX: Yes. Until the first revocation, the bailout code need not be in the call path at all.}(hjhhhNhNubeh}(h]h ]h"]h$]h&]uh1hhhhMlhj:hhubh)}(hX$The relevant patchsets are the `kernel freespace defrag `_ and `userspace freespace defrag `_ series.h](hThe relevant patchsets are the }(hjǠhhhNhNubj)}(hy`kernel freespace defrag `_h]hkernel freespace defrag}(hjϠhhhNhNubah}(h]h ]h"]h$]h&]namekernel freespace defragjj\https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=defrag-freespaceuh1jhjǠubh)}(h_ h]h}(h]kernel-freespace-defragah ]h"]kernel freespace defragah$]h&]refurijߠuh1hjyKhjǠubh and }(hjǠhhhNhNubj)}(h`userspace freespace defrag `_h]huserspace freespace defrag}(hjhhhNhNubah}(h]h ]h"]h$]h&]nameuserspace freespace defragjj_https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=defrag-freespaceuh1jhjǠubh)}(hb h]h}(h]userspace-freespace-defragah ]h"]userspace freespace defragah$]h&]refurijuh1hjyKhjǠubh series.}(hjǠhhhNhNubeh}(h]h ]h"]h$]h&]uh1hhhhMphj:hhubeh}(h]jah ]h"]defragmenting free spaceah$]h&]uh1hhjhhhhhM,ubh)}(hhh](h)}(hShrinking Filesystemsh]hShrinking Filesystems}(hj#hhhNhNubah}(h]h ]h"]h$]h&]jjuh1hhj hhhhhMyubh)}(hX!Removing the end of the filesystem ought to be a simple matter of evacuating the data and metadata at the end of the filesystem, and handing the freed space to the shrink code. That requires an evacuation of the space at end of the filesystem, which is a use of free space defragmentation!h]hX!Removing the end of the filesystem ought to be a simple matter of evacuating the data and metadata at the end of the filesystem, and handing the freed space to the shrink code. That requires an evacuation of the space at end of the filesystem, which is a use of free space defragmentation!}(hj1hhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhM{hj hhubeh}(h]jah ]h"]shrinking filesystemsah$]h&]uh1hhjhhhhhMyubeh}(h]jah ]h"]7. conclusion and future workah$]h&]uh1hhhhhhhhMubeh}(h]id1ah ]h"]xfs online fsck designah$]h&]uh1hhhhhhhhKubeh}(h]h ]h"]h$]h&]sourcehuh1hcurrent_sourceN current_lineNsettingsdocutils.frontendValues)}(hN generatorN datestampN source_linkN source_urlN toc_backlinksjfootnote_backlinksK sectnum_xformKstrip_commentsNstrip_elements_with_classesN strip_classesN report_levelK halt_levelKexit_status_levelKdebugNwarning_streamN tracebackinput_encoding utf-8-siginput_encoding_error_handlerstrictoutput_encodingutf-8output_encoding_error_handlerjwerror_encodingutf-8error_encoding_error_handlerbackslashreplace language_codeenrecord_dependenciesNconfigN id_prefixhauto_id_prefixid dump_settingsNdump_internalsNdump_transformsNdump_pseudo_xmlNexpose_internalsNstrict_visitorN_disable_configN_sourceh _destinationN _config_files]7/var/lib/git/docbuild/linux/Documentation/docutils.confafile_insertion_enabled raw_enabledKline_length_limitM'pep_referencesN pep_base_urlhttps://peps.python.org/pep_file_url_templatepep-%04drfc_referencesN rfc_base_url&https://datatracker.ietf.org/doc/html/ tab_widthKtrim_footnote_reference_spacesyntax_highlightlong smart_quotessmartquotes_locales]character_level_inline_markupdoctitle_xform docinfo_xformKsectsubtitle_xform image_loadinglinkembed_stylesheetcloak_email_addressessection_self_linkenvNubreporterNindirect_targets]substitution_defs}substitution_names}refnames}refids}(j~]jtaj]jaj:]j:aj]j>aj"@]j@ajA]jAajhE]j^EajE]jEajfG]j\GajH]jHajI]jIaj M]jMaj1P]j'PajY]jYaj]]j]aj`]j`ajic]j_cajj]jjaj0k]j&kajk]jkajm]jmajo]joajr]jrajt]jtaj]jajM]jCajV]jLaj]jaj]jaj͝]jÝaunameids}(hhjRjOj jjgjjjjujrjjjjjFjj`jjtj5jjTjj~jjvjCjjjjjjjj0jj/j,jjjjjjjmjKj&jyjjj6j3j!jj$jjS$jP$ju$jr$j$j$j&jj&j&j&j&js*j,j'jKj)jmj*(j'(jt)jq)jl*jj8*j5*jZ*jW*jwjj*jj,jj9j jg.j?j/jaja0jjz5jj4j4j5j4j$5j!5jF5jC5jh5je5j/6jj9jj8jj9j6j9jXjAjj >j:j>jj)@jj(@j%@j @j@jAj"@jAjj9NjAj8NjjDj6jmEjXjHjhEjHjzjFjEjFjjFjFjkGjjHjfGjHjjHjjGjGjGjGjIjHjIj6jIjIj/NjIj.NjXjJjJj/KjwjKjj%Mjj%Nj Mj$Njj#Yj j6Pj6 jQj1PjQjX jPjPjPjPjYjz jTj jTjTjWj jYj jXjXj_jYj_j j[j6 j]jX j^j]j^jz j_j jt_jq_jJ`j j8`j5`j`j j`j`jcj`jcj jbjbjcwj0 j2gjicj1gjO jcjcjdjdjfjfjgjgjkjq jjj jsjjpjjjjjj6kjjj5kj jkj0kjkj j|kjykjmjkjmj jZwjmjYwj$ jrjojrjC j"tjrj!tje jtjtjPwjtjOwj j=wj:wjqzj jjzj jXzjUzjj jz{jw{j/j- jjO jjq j j jǂjjƂj jjjj jljj\j-jejOjljnjZjWjjjjjRjjjj\jMj[jj>j;jnjVjmjj[jXjj<j̗jj˗j[jj}jvjsjjj`jj_jjjj jj+j(jMjJjjj]jZjjjjjJjjCj0j<jOj3j͝jj}j˞jȞjjj7jjjjjj jjCju nametypes}(hjRj jgjjujjjFj`jtjjjjCjjjj0j/jjjjmj&jj6j!j$jS$ju$j$j&j&j&js*j'j)j*(jt)jl*j8*jZ*jwj*j,j9jg.j/ja0jz5j4j5j$5jF5jh5j/6j9j8j9j9jAj >j>j)@j(@j @jAjAj9Nj8NjDjmEjHjHjFjFjFjkGjHjHjHjGjGjIjIjIj/Nj.NjJj/KjKj%Mj%Nj$Nj#Yj6PjQjQjPjPjYjTjTjWjYjXj_j_j[j]j^j^j_jt_jJ`j8`j`j`jcjcjbjcwj2gj1gjcjdjfjgjkjjjsjjjj6kj5kjkjkj|kjmjmjZwjYwjrjrj"tj!tjtjPwjOwj=wjqzjjzjXzjjz{j/jjj jǂjƂjjjlj\jejljZjjjRjj\j[j>jnjmj[jj̗j˗jjvjj`j_jj j+jMjj]jjjJjCj<j3jj˞jj7jjj jCuh}(hhjOhjjxjj jjAjrjljjjjjjjjIj5jjjTjj~jjvjjjjjFjjejjjjj,jjjjj5jjjKjjyjwjjj3j-jjjj!jP$jJ$jr$jl$j$j$jj$j&j&j&j&j,j&jKj'jmj'j'(j!(jq)jk)jj)j5*j/*jW*jQ*jjv*jj*jj*j j,j?j-jajj.jj/jjd0j4j4j4j4j!5j5jC5j=5je5j_5jj}5jj26jj7j6j9jXj9jj9j:j:jj:jj%>j%@j%>j@j@j"@j.@jj.@jAjAjjAj6jCjXjDjhEjpEjzjpEjEjEjjEjFjFjjFjfGjnGjjnGjjGjGjGjGjGjHjIj6jIjIjIjIjIjXjIjJjJjwjJjj2KjjKj Mj(Mjj(Mj j>Nj6 jNj1Pj9PjX j9PjPjPjPjPjz jQj jRjTj~Tj jTjWjzWj jWjXjXjYj&Yj j&Yj[j [j6 j [jX j[j]j]j]j]jz j]j^j^j j^jq_jk_j j_j5`j/`j jM`j`j`j`j`j j`jbjbj0 jcjicjjcjO jjcjcjcjdjcjfjfjgjgjq j7gj jijpjjjjjjjjjjjjj jjj0kj;kj j;kjykjskjkjkj jkjmjmj$ jmjojojojojC jojrjrjrjrje jrjtjsjtj'tj j'tj:wj4wj jfwj jpyjUzjOzj jtzjw{jq{j- j[|jO j2jq jj jjjj jjjj ĵjjjjj-jjGjAjOj_jnjjWjQjjojj܋jjjjjMjUjjUj;j5jVjojjojXjRj<jzjjj[jj}jїjsjmjjjjjjjjޚjjj(j"jJjDjjejZjTjjjjjjj0j$jOjCj͝jΝj}jFjȞjžjjjjjj:jjjjjj jjjjjjjjj/j&jNjEjpjgjjjjjjjjjjjEj<jsjjjjjjjjjjj&jjEj<jgj^jjjjjjjjjjj9j0j[jRj}jtjjjjjjjjj0j'jRjIjjwjjjjjjjjj0j'jRjIjtjkjjjjjjjjj0j'jRjIjqjhjjjjjjj j j0 j' jR jI jt jk j j j j j j j j j0 j' jR jI jt jk j j j j j j j j j* j! jI j@ jk jb j j j j j j j j j j j= j4 j_ jV j jx j j j j j j j' j jI j@ jk jb j j j j j j jj j'jjIj@jhj_jjjjjjjjj6j-jUjLjwjnjjjjjjj jj*j!jIj@jwjnjjjjjju footnote_refs} citation_refs} autofootnotes]autofootnote_refs]symbol_footnotes]symbol_footnote_refs] footnotes] citations]autofootnote_startKsymbol_footnote_startK id_counter collectionsCounter}jKysRparse_messages](hsystem_message)}(hhh]h)}(h2Duplicate explicit target name: "ag btree repair".h]h6Duplicate explicit target name: “ag btree repair”.}(hjhhhNhNubah}(h]h ]h"]h$]h&]uh1hhjubah}(h]h ]h"]h$]h&]jWalevelKtypeINFOsourcehlineKuh1jhjThhhhhM ubj)}(hhh]h)}(h:Enumerated list start value not ordinal-1: "4" (ordinal 4)h]h>Enumerated list start value not ordinal-1: “4” (ordinal 4)}(hjhhhNhNubah}(h]h ]h"]h$]h&]uh1hhjubah}(h]h ]h"]h$]h&]levelKtypejsourcehlineKuh1jhj&YhhhhhM ubj)}(hhh]h)}(hFDuplicate explicit target name: "preparation for bulk loading btrees".h]hJDuplicate explicit target name: “preparation for bulk loading btrees”.}(hj3hhhNhNubah}(h]h ]h"]h$]h&]uh1hhj0ubah}(h]h ]h"]h$]h&]j[alevelKtypejsourcehlineKuh1jhj&YhhhhhM! ubj)}(hhh]h)}(h2Duplicate explicit target name: "ag btree repair".h]h6Duplicate explicit target name: “ag btree repair”.}(hjNhhhNhNubah}(h]h ]h"]h$]h&]uh1hhjKubah}(h]h ]h"]h$]h&]j]alevelKtypejsourcehlineKuh1jhj[hhhhhM ubj)}(hhh]h)}(h2Duplicate explicit target name: "ag btree repair".h]h6Duplicate explicit target name: “ag btree repair”.}(hjihhhNhNubah}(h]h ]h"]h$]h&]uh1hhjfubah}(h]h ]h"]h$]h&]j^alevelKtypejsourcehlineKuh1jhj]hhhhhM ubj)}(hhh]h)}(h:Enumerated list start value not ordinal-1: "8" (ordinal 8)h]h>Enumerated list start value not ordinal-1: “8” (ordinal 8)}(hjhhhNhNubah}(h]h ]h"]h$]h&]uh1hhjubah}(h]h ]h"]h$]h&]levelKtypejsourcehlineKuh1jhjjchhhhhM ubj)}(hhh]h)}(h0Duplicate explicit target name: "inode scanner".h]h4Duplicate explicit target name: “inode scanner”.}(hjhhhNhNubah}(h]h ]h"]h$]h&]uh1hhjubah}(h]h ]h"]h$]h&]joalevelKtypejsourcehlineKuh1jhjmhhhhhMubj)}(hhh]h)}(h4Duplicate explicit target name: "online quotacheck".h]h8Duplicate explicit target name: “online quotacheck”.}(hjhhhNhNubah}(h]h ]h"]h$]h&]uh1hhjubah}(h]h ]h"]h$]h&]jralevelKtypejsourcehlineKuh1jhjohhhhhMlubj)}(hhh]h)}(hubah}(h]h ]h"]h$]h&]levelKtypejsourcehlineMuh1jubj)}(hhh]h)}(hhh]h8Hyperlink target "chain-coordination" is not referenced.}hj[sbah}(h]h ]h"]h$]h&]uh1hhjXubah}(h]h ]h"]h$]h&]levelKtypejsourcehlineMuh1jubj)}(hhh]h)}(hhh]h3Hyperlink target "intent-drains" is not referenced.}hjusbah}(h]h ]h"]h$]h&]uh1hhjrubah}(h]h ]h"]h$]h&]levelKtypejsourcehlineMuh1jubj)}(hhh]h)}(hhh]h1Hyperlink target "jump-labels" is not referenced.}hjsbah}(h]h ]h"]h$]h&]uh1hhjubah}(h]h ]h"]h$]h&]levelKtypejsourcehlineMuh1jubj)}(hhh]h)}(hhh]h+Hyperlink target "xfile" is not referenced.}hjsbah}(h]h ]h"]h$]h&]uh1hhjubah}(h]h ]h"]h$]h&]levelKtypejsourcehlineM1uh1jubj)}(hhh]h)}(hhh]h-Hyperlink target "xfarray" is not referenced.}hjãsbah}(h]h ]h"]h$]h&]uh1hhjubah}(h]h ]h"]h$]h&]levelKtypejsourcehlineMuh1jubj)}(hhh]h)}(hhh]h=Hyperlink target "xfarray-access-patterns" is not referenced.}hjݣsbah}(h]h ]h"]h$]h&]uh1hhjڣubah}(h]h ]h"]h$]h&]levelKtypejsourcehlineMuh1jubj)}(hhh]h)}(hhh]h2Hyperlink target "xfarray-sort" is not referenced.}hjsbah}(h]h ]h"]h$]h&]uh1hhjubah}(h]h ]h"]h$]h&]levelKtypejsourcehlineMuh1jubj)}(hhh]h)}(hhh]h,Hyperlink target "xfblob" is not referenced.}hjsbah}(h]h ]h"]h$]h&]uh1hhjubah}(h]h ]h"]h$]h&]levelKtypejsourcehlineMeuh1jubj)}(hhh]h)}(hhh]h-Hyperlink target "xfbtree" is not referenced.}hj+sbah}(h]h ]h"]h$]h&]uh1hhj(ubah}(h]h ]h"]h$]h&]levelKtypejsourcehlineMuh1jubj)}(hhh]h)}(hhh]h4Hyperlink target "xfbtree-commit" is not referenced.}hjEsbah}(h]h ]h"]h$]h&]uh1hhjBubah}(h]h ]h"]h$]h&]levelKtypejsourcehlineMuh1jubj)}(hhh]h)}(hhh]h+Hyperlink target "newbt" is not referenced.}hj_sbah}(h]h ]h"]h$]h&]uh1hhj\ubah}(h]h ]h"]h$]h&]levelKtypejsourcehlineM} uh1jubj)}(hhh]h)}(hhh]h-Hyperlink target "reaping" is not referenced.}hjysbah}(h]h ]h"]h$]h&]uh1hhjvubah}(h]h ]h"]h$]h&]levelKtypejsourcehlineM uh1jubj)}(hhh]h)}(hhh]h/Hyperlink target "rmap-reap" is not referenced.}hjsbah}(h]h ]h"]h$]h&]uh1hhjubah}(h]h ]h"]h$]h&]levelKtypejsourcehlineM uh1jubj)}(hhh]h)}(hhh]h0Hyperlink target "fscounters" is not referenced.}hjsbah}(h]h ]h"]h$]h&]uh1hhjubah}(h]h ]h"]h$]h&]levelKtypejsourcehlineM uh1jubj)}(hhh]h)}(hhh]h+Hyperlink target "iscan" is not referenced.}hjǤsbah}(h]h ]h"]h$]h&]uh1hhjĤubah}(h]h ]h"]h$]h&]levelKtypejsourcehlineMd uh1jubj)}(hhh]h)}(hhh]h.Hyperlink target "ilocking" is not referenced.}hjsbah}(h]h ]h"]h$]h&]uh1hhjޤubah}(h]h ]h"]h$]h&]levelKtypejsourcehlineM> uh1jubj)}(hhh]h)}(hhh]{ h/Hyperlink target "dirparent" is not referenced.}hjsbah}(h]h ]h"]h$]h&]uh1hhjubah}(h]h ]h"]h$]h&]levelKtypejsourcehlineMa uh1jubj)}(hhh]h)}(hhh]h-Hyperlink target "fshooks" is not referenced.}hjsbah}(h]h ]h"]h$]h&]uh1hhjubah}(h]h ]h"]h$]h&]levelKtypejsourcehlineM{ uh1jubj)}(hhh]h)}(hhh]h0Hyperlink target "liveupdate" is not referenced.}hj/sbah}(h]h ]h"]h$]h&]uh1hhj,ubah}(h]h ]h"]h$]h&]levelKtypejsourcehlineM uh1jubj)}(hhh]h)}(hhh]h0Hyperlink target "quotacheck" is not referenced.}hjIsbah}(h]h ]h"]h$]h&]uh1hhjFubah}(h]h ]h"]h$]h&]levelKtypejsourcehlineMuh1jubj)}(hhh]h)}(hhh]h,Hyperlink target "nlinks" is not referenced.}hjcsbah}(h]h ]h"]h$]h&]uh1hhj`ubah}(h]h ]h"]h$]h&]levelKtypejsourcehlineMnuh1jubj)}(hhh]h)}(hhh]h1Hyperlink target "rmap-repair" is not referenced.}hj}sbah}(h]h ]h"]h$]h&]uh1hhjzubah}(h]h ]h"]h$]h&]levelKtypejsourcehlineMuh1jubj)}(hhh]h)}(hhh]h/Hyperlink target "rtsummary" is not referenced.}hjsbah}(h]h ]h"]h$]h&]uh1hhjubah}(h]h ]h"]h$]h&]levelKtypejsourcehlineMuh1jubj)}(hhh]h)}(hhh]h-Hyperlink target "dirtree" is not referenced.}hjsbah}(h]h ]h"]h$]h&]uh1hhjubah}(h]h ]h"]h$]h&]levelKtypejsourcehlineMuh1jubj)}(hhh]h)}(hhh]h/Hyperlink target "orphanage" is not referenced.}hj˥sbah}(h]h ]h"]h$]h&]uh1hhjȥubah}(h]h ]h"]h$]h&]levelKtypejsourcehlineM0uh1jubj)}(hhh]h)}(hhh]h0Hyperlink target "scrubcheck" is not referenced.}hjsbah}(h]h ]h"]h$]h&]uh1hhjubah}(h]h ]h"]h$]h&]levelKtypejsourcehlineM|uh1jubj)}(hhh]h)}(hhh]h1Hyperlink target "scrubrepair" is not referenced.}hjsbah}(h]h ]h"]h$]h&]uh1hhjubah}(h]h ]h"]h$]h&]levelKtypejsourcehlineMuh1jubj)}(hhh]h)}(hhh]h