€•*      Œsphinx.addnodes”Œdocument”“”)”}”(Œ	rawsource”Œ ”Œchildren”]”(Œtranslations”ŒLanguagesNode”“”)”}”(hhh]”(h Œpending_xref”“”)”}”(hhh]”Œdocutils.nodes”ŒText”“”ŒChinese (Simplified)”…””}”Œparent”hsbaŒ
attributes”}”(Œids”]”Œclasses”]”Œnames”]”Œdupnames”]”Œbackrefs”]”Œ	refdomain”Œstd”Œreftype”Œdoc”Œ	reftarget”Œ:/translations/zh_CN/filesystems/xfs/xfs-online-fsck-design”Œmodname”NŒ	classname”NŒrefexplicit”ˆuŒtagname”hhhubh)”}”(hhh]”hŒChinese (Traditional)”…””}”hh2sbah}”(h]”h ]”h"]”h$]”h&]”Œ	refdomain”h)Œreftype”h+Œ	reftarget”Œ:/translations/zh_TW/filesystems/xfs/xfs-online-fsck-design”Œmodname”NŒ	classname”NŒrefexplicit”ˆuh1hhhubh)”}”(hhh]”hŒItalian”…””}”hhFsbah}”(h]”h ]”h"]”h$]”h&]”Œ	refdomain”h)Œreftype”h+Œ	reftarget”Œ:/translations/it_IT/filesystems/xfs/xfs-online-fsck-design”Œmodname”NŒ	classname”NŒrefexplicit”ˆuh1hhhubh)”}”(hhh]”hŒJapanese”…””}”hhZsbah}”(h]”h ]”h"]”h$]”h&]”Œ	refdomain”h)Œreftype”h+Œ	reftarget”Œ:/translations/ja_JP/filesystems/xfs/xfs-online-fsck-design”Œmodname”NŒ	classname”NŒrefexplicit”ˆuh1hhhubh)”}”(hhh]”hŒKorean”…””}”hhnsbah}”(h]”h ]”h"]”h$]”h&]”Œ	refdomain”h)Œreftype”h+Œ	reftarget”Œ:/translations/ko_KR/filesystems/xfs/xfs-online-fsck-design”Œmodname”NŒ	classname”NŒrefexplicit”ˆuh1hhhubh)”}”(hhh]”hŒSpanish”…””}”hh‚sbah}”(h]”h ]”h"]”h$]”h&]”Œ	refdomain”h)Œreftype”h+Œ	reftarget”Œ:/translations/sp_SP/filesystems/xfs/xfs-online-fsck-design”Œmodname”NŒ	classname”NŒrefexplicit”ˆuh1hhhubeh}”(h]”h ]”h"]”h$]”h&]”Œcurrent_language”ŒEnglish”uh1h
hhŒ	_document”hŒsource”NŒline”NubhŒcomment”“”)”}”(hŒ SPDX-License-Identifier: GPL-2.0”h]”hŒ SPDX-License-Identifier: GPL-2.0”…””}”hh£sbah}”(h]”h ]”h"]”h$]”h&]”Œ	xml:space”Œpreserve”uh1h¡hhhžhhŸŒT/var/lib/git/docbuild/linux/Documentation/filesystems/xfs/xfs-online-fsck-design.rst”h KubhŒtarget”“”)”}”(hŒ.. _xfs_online_fsck_design:”h]”h}”(h]”Œxfs-online-fsck-design”ah ]”h"]”Œxfs_online_fsck_design”ah$]”h&]”uh1h´h KhhhžhhŸh³ubh¢)”}”(hX5  Mapping of heading styles within this document:
Heading 1 uses "====" above and below
Heading 2 uses "===="
Heading 3 uses "----"
Heading 4 uses "````"
Heading 5 uses "^^^^"
Heading 6 uses "~~~~"
Heading 7 uses "...."

Sections are manually numbered because apparently that's what everyone
does in the kernel.”h]”hX5  Mapping of heading styles within this document:
Heading 1 uses "====" above and below
Heading 2 uses "===="
Heading 3 uses "----"
Heading 4 uses "````"
Heading 5 uses "^^^^"
Heading 6 uses "~~~~"
Heading 7 uses "...."

Sections are manually numbered because apparently that's what everyone
does in the kernel.”…””}”hhÂsbah}”(h]”h ]”h"]”h$]”h&]”h±h²uh1h¡hhhžhhŸh³h KubhŒsection”“”)”}”(hhh]”(hŒtitle”“”)”}”(hŒXFS Online Fsck Design”h]”hŒXFS Online Fsck Design”…””}”(hh×hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1hÕhhÒhžhhŸh³h KubhŒ	paragraph”“”)”}”(hŒ|This document captures the design of the online filesystem check feature for
XFS.
The purpose of this document is threefold:”h]”hŒ|This document captures the design of the online filesystem check feature for
XFS.
The purpose of this document is threefold:”…””}”(hhçhžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h KhhÒhžhubhŒbullet_list”“”)”}”(hhh]”(hŒ	list_item”“”)”}”(hŒTo help kernel distributors understand exactly what the XFS online fsck
feature is, and issues about which they should be aware.
”h]”hæ)”}”(hŒ€To help kernel distributors understand exactly what the XFS online fsck
feature is, and issues about which they should be aware.”h]”hŒ€To help kernel distributors understand exactly what the XFS online fsck
feature is, and issues about which they should be aware.”…””}”(hj   hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h Khhüubah}”(h]”h ]”h"]”h$]”h&]”uh1húhh÷hžhhŸh³h Nubhû)”}”(hŒTo help people reading the code to familiarize themselves with the relevant
concepts and design points before they start digging into the code.
”h]”hæ)”}”(hŒTo help people reading the code to familiarize themselves with the relevant
concepts and design points before they start digging into the code.”h]”hŒTo help people reading the code to familiarize themselves with the relevant
concepts and design points before they start digging into the code.”…””}”(hj  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h Khj  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhh÷hžhhŸh³h Nubhû)”}”(hŒlTo help developers maintaining the system by capturing the reasons
supporting higher level decision making.
”h]”hæ)”}”(hŒkTo help developers maintaining the system by capturing the reasons
supporting higher level decision making.”h]”hŒkTo help developers maintaining the system by capturing the reasons
supporting higher level decision making.”…””}”(hj0  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h Khj,  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhh÷hžhhŸh³h Nubeh}”(h]”h ]”h"]”h$]”h&]”Œbullet”Œ-”uh1hõhŸh³h KhhÒhžhubhæ)”}”(hŒtAs the online fsck code is merged, the links in this document to topic branches
will be replaced with links to code.”h]”hŒtAs the online fsck code is merged, the links in this document to topic branches
will be replaced with links to code.”…””}”(hjL  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h K"hhÒhžhubhæ)”}”(hŒoThis document is licensed under the terms of the GNU Public License, v2.
The primary author is Darrick J. Wong.”h]”hŒoThis document is licensed under the terms of the GNU Public License, v2.
The primary author is Darrick J. Wong.”…””}”(hjZ  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h K%hhÒhžhubhæ)”}”(hXY  This design document is split into seven parts.
Part 1 defines what fsck tools are and the motivations for writing a new one.
Parts 2 and 3 present a high level overview of how online fsck process works
and how it is tested to ensure correct functionality.
Part 4 discusses the user interface and the intended usage modes of the new
program.
Parts 5 and 6 show off the high level components and how they fit together, and
then present case studies of how each repair function actually works.
Part 7 sums up what has been discussed so far and speculates about what else
might be built atop online fsck.”h]”hXY  This design document is split into seven parts.
Part 1 defines what fsck tools are and the motivations for writing a new one.
Parts 2 and 3 present a high level overview of how online fsck process works
and how it is tested to ensure correct functionality.
Part 4 discusses the user interface and the intended usage modes of the new
program.
Parts 5 and 6 show off the high level components and how they fit together, and
then present case studies of how each repair function actually works.
Part 7 sums up what has been discussed so far and speculates about what else
might be built atop online fsck.”…””}”(hjh  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h K(hhÒhžhubhŒtopic”“”)”}”(hŒTable of Contents

”h]”(hÖ)”}”(hŒTable of Contents”h]”hŒTable of Contents”…””}”(hj|  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1hÕhjx  hŸh³h K4ubhö)”}”(hhh]”(hû)”}”(hhh]”(hæ)”}”(hhh]”hŒ	reference”“”)”}”(hhh]”hŒ1. What is a Filesystem Check?”…””}”(hj•  hžhhŸNh Nubah}”(h]”Œid13”ah ]”h"]”h$]”h&]”Œrefid”Œwhat-is-a-filesystem-check”uh1j“  hj  ubah}”(h]”h ]”h"]”h$]”h&]”uh1håhj  ubhö)”}”(hhh]”(hû)”}”(hhh]”hæ)”}”(hhh]”j”  )”}”(hhh]”hŒTLDR; Show Me the Code!”…””}”(hj´  hžhhŸNh Nubah}”(h]”Œid14”ah ]”h"]”h$]”h&]”Œrefid”Œtldr-show-me-the-code”uh1j“  hj±  ubah}”(h]”h ]”h"]”h$]”h&]”uh1håhj®  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhj«  ubhû)”}”(hhh]”hæ)”}”(hhh]”j”  )”}”(hhh]”hŒExisting Tools”…””}”(hjÖ  hžhhŸNh Nubah}”(h]”Œid15”ah ]”h"]”h$]”h&]”Œrefid”Œexisting-tools”uh1j“  hjÓ  ubah}”(h]”h ]”h"]”h$]”h&]”uh1håhjÐ  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhj«  ubhû)”}”(hhh]”hæ)”}”(hhh]”j”  )”}”(hhh]”hŒProblem Statement”…””}”(hjø  hžhhŸNh Nubah}”(h]”Œid16”ah ]”h"]”h$]”h&]”Œrefid”Œproblem-statement”uh1j“  hjõ  ubah}”(h]”h ]”h"]”h$]”h&]”uh1håhjò  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhj«  ubeh}”(h]”h ]”h"]”h$]”h&]”uh1hõhj  ubeh}”(h]”h ]”h"]”h$]”h&]”uh1húhjŠ  ubhû)”}”(hhh]”(hæ)”}”(hhh]”j”  )”}”(hhh]”hŒ2. Theory of Operation”…””}”(hj&  hžhhŸNh Nubah}”(h]”Œid17”ah ]”h"]”h$]”h&]”Œrefid”Œtheory-of-operation”uh1j“  hj#  ubah}”(h]”h ]”h"]”h$]”h&]”uh1håhj   ubhö)”}”(hhh]”(hû)”}”(hhh]”hæ)”}”(hhh]”j”  )”}”(hhh]”hŒScope”…””}”(hjE  hžhhŸNh Nubah}”(h]”Œid18”ah ]”h"]”h$]”h&]”Œrefid”Œscope”uh1j“  hjB  ubah}”(h]”h ]”h"]”h$]”h&]”uh1håhj?  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhj<  ubhû)”}”(hhh]”hæ)”}”(hhh]”j”  )”}”(hhh]”hŒPhases of Work”…””}”(hjg  hžhhŸNh Nubah}”(h]”Œid19”ah ]”h"]”h$]”h&]”Œrefid”Œphases-of-work”uh1j“  hjd  ubah}”(h]”h ]”h"]”h$]”h&]”uh1håhja  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhj<  ubhû)”}”(hhh]”hæ)”}”(hhh]”j”  )”}”(hhh]”hŒSteps for Each Scrub Item”…””}”(hj‰  hžhhŸNh Nubah}”(h]”Œid20”ah ]”h"]”h$]”h&]”Œrefid”Œsteps-for-each-scrub-item”uh1j“  hj†  ubah}”(h]”h ]”h"]”h$]”h&]”uh1håhjƒ  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhj<  ubhû)”}”(hhh]”(hæ)”}”(hhh]”j”  )”}”(hhh]”hŒClassification of Metadata”…””}”(hj«  hžhhŸNh Nubah}”(h]”Œid21”ah ]”h"]”h$]”h&]”Œrefid”Œclassification-of-metadata”uh1j“  hj¨  ubah}”(h]”h ]”h"]”h$]”h&]”uh1håhj¥  ubhö)”}”(hhh]”(hû)”}”(hhh]”hæ)”}”(hhh]”j”  )”}”(hhh]”hŒPrimary Metadata”…””}”(hjÊ  hžhhŸNh Nubah}”(h]”Œid22”ah ]”h"]”h$]”h&]”Œrefid”Œprimary-metadata”uh1j“  hjÇ  ubah}”(h]”h ]”h"]”h$]”h&]”uh1håhjÄ  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhjÁ  ubhû)”}”(hhh]”hæ)”}”(hhh]”j”  )”}”(hhh]”hŒSecondary Metadata”…””}”(hjì  hžhhŸNh Nubah}”(h]”Œid23”ah ]”h"]”h$]”h&]”Œrefid”Œsecondary-metadata”uh1j“  hjé  ubah}”(h]”h ]”h"]”h$]”h&]”uh1håhjæ  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhjÁ  ubhû)”}”(hhh]”hæ)”}”(hhh]”j”  )”}”(hhh]”hŒSummary Information”…””}”(hj  hžhhŸNh Nubah}”(h]”Œid24”ah ]”h"]”h$]”h&]”Œrefid”Œsummary-information”uh1j“  hj  ubah}”(h]”h ]”h"]”h$]”h&]”uh1håhj  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhjÁ  ubeh}”(h]”h ]”h"]”h$]”h&]”uh1hõhj¥  ubeh}”(h]”h ]”h"]”h$]”h&]”uh1húhj<  ubhû)”}”(hhh]”hæ)”}”(hhh]”j”  )”}”(hhh]”hŒRisk Management”…””}”(hj<  hžhhŸNh Nubah}”(h]”Œid25”ah ]”h"]”h$]”h&]”Œrefid”Œrisk-management”uh1j“  hj9  ubah}”(h]”h ]”h"]”h$]”h&]”uh1håhj6  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhj<  ubeh}”(h]”h ]”h"]”h$]”h&]”uh1hõhj   ubeh}”(h]”h ]”h"]”h$]”h&]”uh1húhjŠ  ubhû)”}”(hhh]”(hæ)”}”(hhh]”j”  )”}”(hhh]”hŒ3. Testing Plan”…””}”(hjj  hžhhŸNh Nubah}”(h]”Œid26”ah ]”h"]”h$]”h&]”Œrefid”Œtesting-plan”uh1j“  hjg  ubah}”(h]”h ]”h"]”h$]”h&]”uh1håhjd  ubhö)”}”(hhh]”(hû)”}”(hhh]”hæ)”}”(hhh]”j”  )”}”(hhh]”hŒIntegrated Testing with fstests”…””}”(hj‰  hžhhŸNh Nubah}”(h]”Œid27”ah ]”h"]”h$]”h&]”Œrefid”Œintegrated-testing-with-fstests”uh1j“  hj†  ubah}”(h]”h ]”h"]”h$]”h&]”uh1håhjƒ  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhj€  ubhû)”}”(hhh]”hæ)”}”(hhh]”j”  )”}”(hhh]”hŒ'General Fuzz Testing of Metadata Blocks”…””}”(hj«  hžhhŸNh Nubah}”(h]”Œid28”ah ]”h"]”h$]”h&]”Œrefid”Œ'general-fuzz-testing-of-metadata-blocks”uh1j“  hj¨  ubah}”(h]”h ]”h"]”h$]”h&]”uh1håhj¥  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhj€  ubhû)”}”(hhh]”hæ)”}”(hhh]”j”  )”}”(hhh]”hŒ)Targeted Fuzz Testing of Metadata Records”…””}”(hjÍ  hžhhŸNh Nubah}”(h]”Œid29”ah ]”h"]”h$]”h&]”Œrefid”Œ)targeted-fuzz-testing-of-metadata-records”uh1j“  hjÊ  ubah}”(h]”h ]”h"]”h$]”h&]”uh1håhjÇ  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhj€  ubhû)”}”(hhh]”hæ)”}”(hhh]”j”  )”}”(hhh]”hŒStress Testing”…””}”(hjï  hžhhŸNh Nubah}”(h]”Œid30”ah ]”h"]”h$]”h&]”Œrefid”Œstress-testing”uh1j“  hjì  ubah}”(h]”h ]”h"]”h$]”h&]”uh1håhjé  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhj€  ubeh}”(h]”h ]”h"]”h$]”h&]”uh1hõhjd  ubeh}”(h]”h ]”h"]”h$]”h&]”uh1húhjŠ  ubhû)”}”(hhh]”(hæ)”}”(hhh]”j”  )”}”(hhh]”hŒ4. User Interface”…””}”(hj  hžhhŸNh Nubah}”(h]”Œid31”ah ]”h"]”h$]”h&]”Œrefid”Œuser-interface”uh1j“  hj  ubah}”(h]”h ]”h"]”h$]”h&]”uh1håhj  ubhö)”}”(hhh]”(hû)”}”(hhh]”hæ)”}”(hhh]”j”  )”}”(hhh]”hŒChecking on Demand”…””}”(hj<  hžhhŸNh Nubah}”(h]”Œid32”ah ]”h"]”h$]”h&]”Œrefid”Œchecking-on-demand”uh1j“  hj9  ubah}”(h]”h ]”h"]”h$]”h&]”uh1håhj6  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhj3  ubhû)”}”(hhh]”hæ)”}”(hhh]”j”  )”}”(hhh]”hŒBackground Service”…””}”(hj^  hžhhŸNh Nubah}”(h]”Œid33”ah ]”h"]”h$]”h&]”Œrefid”Œbackground-service”uh1j“  hj[  ubah}”(h]”h ]”h"]”h$]”h&]”uh1håhjX  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhj3  ubhû)”}”(hhh]”hæ)”}”(hhh]”j”  )”}”(hhh]”hŒHealth Reporting”…””}”(hj€  hžhhŸNh Nubah}”(h]”Œid34”ah ]”h"]”h$]”h&]”Œrefid”Œhealth-reporting”uh1j“  hj}  ubah}”(h]”h ]”h"]”h$]”h&]”uh1håhjz  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhj3  ubeh}”(h]”h ]”h"]”h$]”h&]”uh1hõhj  ubeh}”(h]”h ]”h"]”h$]”h&]”uh1húhjŠ  ubhû)”}”(hhh]”(hæ)”}”(hhh]”j”  )”}”(hhh]”hŒ(5. Kernel Algorithms and Data Structures”…””}”(hj®  hžhhŸNh Nubah}”(h]”Œid35”ah ]”h"]”h$]”h&]”Œrefid”Œ%kernel-algorithms-and-data-structures”uh1j“  hj«  ubah}”(h]”h ]”h"]”h$]”h&]”uh1håhj¨  ubhö)”}”(hhh]”(hû)”}”(hhh]”hæ)”}”(hhh]”j”  )”}”(hhh]”hŒSelf Describing Metadata”…””}”(hjÍ  hžhhŸNh Nubah}”(h]”Œid36”ah ]”h"]”h$]”h&]”Œrefid”Œself-describing-metadata”uh1j“  hjÊ  ubah}”(h]”h ]”h"]”h$]”h&]”uh1håhjÇ  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhjÄ  ubhû)”}”(hhh]”hæ)”}”(hhh]”j”  )”}”(hhh]”hŒReverse Mapping”…””}”(hjï  hžhhŸNh Nubah}”(h]”Œid37”ah ]”h"]”h$]”h&]”Œrefid”Œreverse-mapping”uh1j“  hjì  ubah}”(h]”h ]”h"]”h$]”h&]”uh1håhjé  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhjÄ  ubhû)”}”(hhh]”(hæ)”}”(hhh]”j”  )”}”(hhh]”hŒChecking and Cross-Referencing”…””}”(hj  hžhhŸNh Nubah}”(h]”Œid38”ah ]”h"]”h$]”h&]”Œrefid”Œchecking-and-cross-referencing”uh1j“  hj  ubah}”(h]”h ]”h"]”h$]”h&]”uh1håhj  ubhö)”}”(hhh]”(hû)”}”(hhh]”hæ)”}”(hhh]”j”  )”}”(hhh]”hŒMetadata Buffer Verification”…””}”(hj0  hžhhŸNh Nubah}”(h]”Œid39”ah ]”h"]”h$]”h&]”Œrefid”Œmetadata-buffer-verification”uh1j“  hj-  ubah}”(h]”h ]”h"]”h$]”h&]”uh1håhj*  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhj'  ubhû)”}”(hhh]”hæ)”}”(hhh]”j”  )”}”(hhh]”hŒInternal Consistency Checks”…””}”(hjR  hžhhŸNh Nubah}”(h]”Œid40”ah ]”h"]”h$]”h&]”Œrefid”Œinternal-consistency-checks”uh1j“  hjO  ubah}”(h]”h ]”h"]”h$]”h&]”uh1håhjL  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhj'  ubhû)”}”(hhh]”hæ)”}”(hhh]”j”  )”}”(hhh]”hŒ4Validation of Userspace-Controlled Record Attributes”…””}”(hjt  hžhhŸNh Nubah}”(h]”Œid41”ah ]”h"]”h$]”h&]”Œrefid”Œ4validation-of-userspace-controlled-record-attributes”uh1j“  hjq  ubah}”(h]”h ]”h"]”h$]”h&]”uh1håhjn  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhj'  ubhû)”}”(hhh]”hæ)”}”(hhh]”j”  )”}”(hhh]”hŒ Cross-Referencing Space Metadata”…””}”(hj–  hžhhŸNh Nubah}”(h]”Œid42”ah ]”h"]”h$]”h&]”Œrefid”Œ cross-referencing-space-metadata”uh1j“  hj“  ubah}”(h]”h ]”h"]”h$]”h&]”uh1håhj  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhj'  ubhû)”}”(hhh]”hæ)”}”(hhh]”j”  )”}”(hhh]”hŒChecking Extended Attributes”…””}”(hj¸  hžhhŸNh Nubah}”(h]”Œid43”ah ]”h"]”h$]”h&]”Œrefid”Œchecking-extended-attributes”uh1j“  hjµ  ubah}”(h]”h ]”h"]”h$]”h&]”uh1håhj²  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhj'  ubhû)”}”(hhh]”(hæ)”}”(hhh]”j”  )”}”(hhh]”hŒ*Checking and Cross-Referencing Directories”…””}”(hjÚ  hžhhŸNh Nubah}”(h]”Œid44”ah ]”h"]”h$]”h&]”Œrefid”Œ*checking-and-cross-referencing-directories”uh1j“  hj×  ubah}”(h]”h ]”h"]”h$]”h&]”uh1håhjÔ  ubhö)”}”(hhh]”hû)”}”(hhh]”hæ)”}”(hhh]”j”  )”}”(hhh]”hŒ#Checking Directory/Attribute Btrees”…””}”(hjù  hžhhŸNh Nubah}”(h]”Œid45”ah ]”h"]”h$]”h&]”Œrefid”Œ#checking-directory-attribute-btrees”uh1j“  hjö  ubah}”(h]”h ]”h"]”h$]”h&]”uh1håhjó  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhjð  ubah}”(h]”h ]”h"]”h$]”h&]”uh1hõhjÔ  ubeh}”(h]”h ]”h"]”h$]”h&]”uh1húhj'  ubhû)”}”(hhh]”hæ)”}”(hhh]”j”  )”}”(hhh]”hŒ"Cross-Referencing Summary Counters”…””}”(hj'  hžhhŸNh Nubah}”(h]”Œid46”ah ]”h"]”h$]”h&]”Œrefid”Œ"cross-referencing-summary-counters”uh1j“  hj$  ubah}”(h]”h ]”h"]”h$]”h&]”uh1håhj!  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhj'  ubhû)”}”(hhh]”hæ)”}”(hhh]”j”  )”}”(hhh]”hŒPost-Repair Reverification”…””}”(hjI  hžhhŸNh Nubah}”(h]”Œid47”ah ]”h"]”h$]”h&]”Œrefid”Œpost-repair-reverification”uh1j“  hjF  ubah}”(h]”h ]”h"]”h$]”h&]”uh1håhjC  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhj'  ubeh}”(h]”h ]”h"]”h$]”h&]”uh1hõhj  ubeh}”(h]”h ]”h"]”h$]”h&]”uh1húhjÄ  ubhû)”}”(hhh]”(hæ)”}”(hhh]”j”  )”}”(hhh]”hŒ$Eventual Consistency vs. Online Fsck”…””}”(hjw  hžhhŸNh Nubah}”(h]”Œid48”ah ]”h"]”h$]”h&]”Œrefid”Œ#eventual-consistency-vs-online-fsck”uh1j“  hjt  ubah}”(h]”h ]”h"]”h$]”h&]”uh1håhjq  ubhö)”}”(hhh]”(hû)”}”(hhh]”hæ)”}”(hhh]”j”  )”}”(hhh]”hŒDiscovery of the Problem”…””}”(hj–  hžhhŸNh Nubah}”(h]”Œid49”ah ]”h"]”h$]”h&]”Œrefid”Œdiscovery-of-the-problem”uh1j“  hj“  ubah}”(h]”h ]”h"]”h$]”h&]”uh1håhj  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhj  ubhû)”}”(hhh]”hæ)”}”(hhh]”j”  )”}”(hhh]”hŒIntent Drains”…””}”(hj¸  hžhhŸNh Nubah}”(h]”Œid50”ah ]”h"]”h$]”h&]”Œrefid”Œintent-drains”uh1j“  hjµ  ubah}”(h]”h ]”h"]”h$]”h&]”uh1håhj²  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhj  ubhû)”}”(hhh]”hæ)”}”(hhh]”j”  )”}”(hhh]”hŒ%Static Keys (aka Jump Label Patching)”…””}”(hjÚ  hžhhŸNh Nubah}”(h]”Œid51”ah ]”h"]”h$]”h&]”Œrefid”Œ#static-keys-aka-jump-label-patching”uh1j“  hj×  ubah}”(h]”h ]”h"]”h$]”h&]”uh1håhjÔ  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhj  ubeh}”(h]”h ]”h"]”h$]”h&]”uh1hõhjq  ubeh}”(h]”h ]”h"]”h$]”h&]”uh1húhjÄ  ubhû)”}”(hhh]”(hæ)”}”(hhh]”j”  )”}”(hhh]”hŒPageable Kernel Memory”…””}”(hj  hžhhŸNh Nubah}”(h]”Œid52”ah ]”h"]”h$]”h&]”Œrefid”Œpageable-kernel-memory”uh1j“  hj  ubah}”(h]”h ]”h"]”h$]”h&]”uh1håhj  ubhö)”}”(hhh]”(hû)”}”(hhh]”hæ)”}”(hhh]”j”  )”}”(hhh]”hŒxfile Access Models”…””}”(hj'  hžhhŸNh Nubah}”(h]”Œid53”ah ]”h"]”h$]”h&]”Œrefid”Œxfile-access-models”uh1j“  hj$  ubah}”(h]”h ]”h"]”h$]”h&]”uh1håhj!  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhj  ubhû)”}”(hhh]”hæ)”}”(hhh]”j”  )”}”(hhh]”hŒxfile Access Coordination”…””}”(hjI  hžhhŸNh Nubah}”(h]”Œid54”ah ]”h"]”h$]”h&]”Œrefid”Œxfile-access-coordination”uh1j“  hjF  ubah}”(h]”h ]”h"]”h$]”h&]”uh1håhjC  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhj  ubhû)”}”(hhh]”(hæ)”}”(hhh]”j”  )”}”(hhh]”hŒArrays of Fixed-Sized Records”…””}”(hjk  hžhhŸNh Nubah}”(h]”Œid55”ah ]”h"]”h$]”h&]”Œrefid”Œarrays-of-fixed-sized-records”uh1j“  hjh  ubah}”(h]”h ]”h"]”h$]”h&]”uh1håhje  ubhö)”}”(hhh]”(hû)”}”(hhh]”hæ)”}”(hhh]”j”  )”}”(hhh]”hŒArray Access Patterns”…””}”(hjŠ  hžhhŸNh Nubah}”(h]”Œid56”ah ]”h"]”h$]”h&]”Œrefid”Œarray-access-patterns”uh1j“  hj‡  ubah}”(h]”h ]”h"]”h$]”h&]”uh1håhj„  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhj  ubhû)”}”(hhh]”hæ)”}”(hhh]”j”  )”}”(hhh]”hŒIterating Array Elements”…””}”(hj¬  hžhhŸNh Nubah}”(h]”Œid57”ah ]”h"]”h$]”h&]”Œrefid”Œiterating-array-elements”uh1j“  hj©  ubah}”(h]”h ]”h"]”h$]”h&]”uh1håhj¦  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhj  ubhû)”}”(hhh]”(hæ)”}”(hhh]”j”  )”}”(hhh]”hŒSorting Array Elements”…””}”(hjÎ  hžhhŸNh Nubah}”(h]”Œid58”ah ]”h"]”h$]”h&]”Œrefid”Œsorting-array-elements”uh1j“  hjË  ubah}”(h]”h ]”h"]”h$]”h&]”uh1håhjÈ  ubhö)”}”(hhh]”hû)”}”(hhh]”hæ)”}”(hhh]”j”  )”}”(hhh]”hŒCase Study: Sorting xfarrays”…””}”(hjí  hžhhŸNh Nubah}”(h]”Œid59”ah ]”h"]”h$]”h&]”Œrefid”Œcase-study-sorting-xfarrays”uh1j“  hjê  ubah}”(h]”h ]”h"]”h$]”h&]”uh1håhjç  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhjä  ubah}”(h]”h ]”h"]”h$]”h&]”uh1hõhjÈ  ubeh}”(h]”h ]”h"]”h$]”h&]”uh1húhj  ubeh}”(h]”h ]”h"]”h$]”h&]”uh1hõhje  ubeh}”(h]”h ]”h"]”h$]”h&]”uh1húhj  ubhû)”}”(hhh]”hæ)”}”(hhh]”j”  )”}”(hhh]”hŒBlob Storage”…””}”(hj'  hžhhŸNh Nubah}”(h]”Œid60”ah ]”h"]”h$]”h&]”Œrefid”Œblob-storage”uh1j“  hj$  ubah}”(h]”h ]”h"]”h$]”h&]”uh1håhj!  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhj  ubhû)”}”(hhh]”(hæ)”}”(hhh]”j”  )”}”(hhh]”hŒIn-Memory B+Trees”…””}”(hjI  hžhhŸNh Nubah}”(h]”Œid61”ah ]”h"]”h$]”h&]”Œrefid”Œin-memory-b-trees”uh1j“  hjF  ubah}”(h]”h ]”h"]”h$]”h&]”uh1håhjC  ubhö)”}”(hhh]”(hû)”}”(hhh]”hæ)”}”(hhh]”j”  )”}”(hhh]”hŒ%Using xfiles as a Buffer Cache Target”…””}”(hjh  hžhhŸNh Nubah}”(h]”Œid62”ah ]”h"]”h$]”h&]”Œrefid”Œ%using-xfiles-as-a-buffer-cache-target”uh1j“  hje  ubah}”(h]”h ]”h"]”h$]”h&]”uh1håhjb  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhj_  ubhû)”}”(hhh]”hæ)”}”(hhh]”j”  )”}”(hhh]”hŒ Space Management with an xfbtree”…””}”(hjŠ  hžhhŸNh Nubah}”(h]”Œid63”ah ]”h"]”h$]”h&]”Œrefid”Œ space-management-with-an-xfbtree”uh1j“  hj‡  ubah}”(h]”h ]”h"]”h$]”h&]”uh1håhj„  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhj_  ubhû)”}”(hhh]”hæ)”}”(hhh]”j”  )”}”(hhh]”hŒPopulating an xfbtree”…””}”(hj¬  hžhhŸNh Nubah}”(h]”Œid64”ah ]”h"]”h$]”h&]”Œrefid”Œpopulating-an-xfbtree”uh1j“  hj©  ubah}”(h]”h ]”h"]”h$]”h&]”uh1håhj¦  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhj_  ubhû)”}”(hhh]”hæ)”}”(hhh]”j”  )”}”(hhh]”hŒ!Committing Logged xfbtree Buffers”…””}”(hjÎ  hžhhŸNh Nubah}”(h]”Œid65”ah ]”h"]”h$]”h&]”Œrefid”Œ!committing-logged-xfbtree-buffers”uh1j“  hjË  ubah}”(h]”h ]”h"]”h$]”h&]”uh1håhjÈ  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhj_  ubeh}”(h]”h ]”h"]”h$]”h&]”uh1hõhjC  ubeh}”(h]”h ]”h"]”h$]”h&]”uh1húhj  ubeh}”(h]”h ]”h"]”h$]”h&]”uh1hõhj  ubeh}”(h]”h ]”h"]”h$]”h&]”uh1húhjÄ  ubhû)”}”(hhh]”(hæ)”}”(hhh]”j”  )”}”(hhh]”hŒBulk Loading of Ondisk B+Trees”…””}”(hj	  hžhhŸNh Nubah}”(h]”Œid66”ah ]”h"]”h$]”h&]”Œrefid”Œbulk-loading-of-ondisk-b-trees”uh1j“  hj	  ubah}”(h]”h ]”h"]”h$]”h&]”uh1håhj	  ubhö)”}”(hhh]”(hû)”}”(hhh]”hæ)”}”(hhh]”j”  )”}”(hhh]”hŒGeometry Computation”…””}”(hj'	  hžhhŸNh Nubah}”(h]”Œid67”ah ]”h"]”h$]”h&]”Œrefid”Œgeometry-computation”uh1j“  hj$	  ubah}”(h]”h ]”h"]”h$]”h&]”uh1håhj!	  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhj	  ubhû)”}”(hhh]”hæ)”}”(hhh]”j”  )”}”(hhh]”hŒReserving New B+Tree Blocks”…””}”(hjI	  hžhhŸNh Nubah}”(h]”Œid68”ah ]”h"]”h$]”h&]”Œrefid”Œreserving-new-b-tree-blocks”uh1j“  hjF	  ubah}”(h]”h ]”h"]”h$]”h&]”uh1håhjC	  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhj	  ubhû)”}”(hhh]”(hæ)”}”(hhh]”j”  )”}”(hhh]”hŒWriting the New Tree”…””}”(hjk	  hžhhŸNh Nubah}”(h]”Œid69”ah ]”h"]”h$]”h&]”Œrefid”Œwriting-the-new-tree”uh1j“  hjh	  ubah}”(h]”h ]”h"]”h$]”h&]”uh1håhje	  ubhö)”}”(hhh]”(hû)”}”(hhh]”hæ)”}”(hhh]”j”  )”}”(hhh]”hŒ&Case Study: Rebuilding the Inode Index”…””}”(hjŠ	  hžhhŸNh Nubah}”(h]”Œid70”ah ]”h"]”h$]”h&]”Œrefid”Œ%case-study-rebuilding-the-inode-index”uh1j“  hj‡	  ubah}”(h]”h ]”h"]”h$]”h&]”uh1håhj„	  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhj	  ubhû)”}”(hhh]”hæ)”}”(hhh]”j”  )”}”(hhh]”hŒ1Case Study: Rebuilding the Space Reference Counts”…””}”(hj¬	  hžhhŸNh Nubah}”(h]”Œid71”ah ]”h"]”h$]”h&]”Œrefid”Œ0case-study-rebuilding-the-space-reference-counts”uh1j“  hj©	  ubah}”(h]”h ]”h"]”h$]”h&]”uh1håhj¦	  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhj	  ubhû)”}”(hhh]”hæ)”}”(hhh]”j”  )”}”(hhh]”hŒ0Case Study: Rebuilding File Fork Mapping Indices”…””}”(hjÎ	  hžhhŸNh Nubah}”(h]”Œid72”ah ]”h"]”h$]”h&]”Œrefid”Œ/case-study-rebuilding-file-fork-mapping-indices”uh1j“  hjË	  ubah}”(h]”h ]”h"]”h$]”h&]”uh1håhjÈ	  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhj	  ubeh}”(h]”h ]”h"]”h$]”h&]”uh1hõhje	  ubeh}”(h]”h ]”h"]”h$]”h&]”uh1húhj	  ubeh}”(h]”h ]”h"]”h$]”h&]”uh1hõhj	  ubeh}”(h]”h ]”h"]”h$]”h&]”uh1húhjÄ  ubhû)”}”(hhh]”(hæ)”}”(hhh]”j”  )”}”(hhh]”hŒReaping Old Metadata Blocks”…””}”(hj
  hžhhŸNh Nubah}”(h]”Œid73”ah ]”h"]”h$]”h&]”Œrefid”Œreaping-old-metadata-blocks”uh1j“  hj
  ubah}”(h]”h ]”h"]”h$]”h&]”uh1håhj
  ubhö)”}”(hhh]”(hû)”}”(hhh]”hæ)”}”(hhh]”j”  )”}”(hhh]”hŒ0Case Study: Reaping After a Regular Btree Repair”…””}”(hj'
  hžhhŸNh Nubah}”(h]”Œid74”ah ]”h"]”h$]”h&]”Œrefid”Œ/case-study-reaping-after-a-regular-btree-repair”uh1j“  hj$
  ubah}”(h]”h ]”h"]”h$]”h&]”uh1håhj!
  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhj
  ubhû)”}”(hhh]”hæ)”}”(hhh]”j”  )”}”(hhh]”hŒ-Case Study: Rebuilding the Free Space Indices”…””}”(hjI
  hžhhŸNh Nubah}”(h]”Œid75”ah ]”h"]”h$]”h&]”Œrefid”Œ,case-study-rebuilding-the-free-space-indices”uh1j“  hjF
  ubah}”(h]”h ]”h"]”h$]”h&]”uh1håhjC
  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhj
  ubhû)”}”(hhh]”hæ)”}”(hhh]”j”  )”}”(hhh]”hŒ:Case Study: Reaping After Repairing Reverse Mapping Btrees”…””}”(hjk
  hžhhŸNh Nubah}”(h]”Œid76”ah ]”h"]”h$]”h&]”Œrefid”Œ9case-study-reaping-after-repairing-reverse-mapping-btrees”uh1j“  hjh
  ubah}”(h]”h ]”h"]”h$]”h&]”uh1håhje
  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhj
  ubhû)”}”(hhh]”hæ)”}”(hhh]”j”  )”}”(hhh]”hŒCase Study: Rebuilding the AGFL”…””}”(hj
  hžhhŸNh Nubah}”(h]”Œid77”ah ]”h"]”h$]”h&]”Œrefid”Œcase-study-rebuilding-the-agfl”uh1j“  hjŠ
  ubah}”(h]”h ]”h"]”h$]”h&]”uh1håhj‡
  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhj
  ubeh}”(h]”h ]”h"]”h$]”h&]”uh1hõhj
  ubeh}”(h]”h ]”h"]”h$]”h&]”uh1húhjÄ  ubhû)”}”(hhh]”hæ)”}”(hhh]”j”  )”}”(hhh]”hŒInode Record Repairs”…””}”(hj»
  hžhhŸNh Nubah}”(h]”Œid78”ah ]”h"]”h$]”h&]”Œrefid”Œinode-record-repairs”uh1j“  hj¸
  ubah}”(h]”h ]”h"]”h$]”h&]”uh1håhjµ
  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhjÄ  ubhû)”}”(hhh]”hæ)”}”(hhh]”j”  )”}”(hhh]”hŒQuota Record Repairs”…””}”(hjÝ
  hžhhŸNh Nubah}”(h]”Œid79”ah ]”h"]”h$]”h&]”Œrefid”Œquota-record-repairs”uh1j“  hjÚ
  ubah}”(h]”h ]”h"]”h$]”h&]”uh1håhj×
  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhjÄ  ubhû)”}”(hhh]”hæ)”}”(hhh]”j”  )”}”(hhh]”hŒ Freezing to Fix Summary Counters”…””}”(hjÿ
  hžhhŸNh Nubah}”(h]”Œid80”ah ]”h"]”h$]”h&]”Œrefid”Œ freezing-to-fix-summary-counters”uh1j“  hjü
  ubah}”(h]”h ]”h"]”h$]”h&]”uh1håhjù
  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhjÄ  ubhû)”}”(hhh]”(hæ)”}”(hhh]”j”  )”}”(hhh]”hŒFull Filesystem Scans”…””}”(hj!  hžhhŸNh Nubah}”(h]”Œid81”ah ]”h"]”h$]”h&]”Œrefid”Œfull-filesystem-scans”uh1j“  hj  ubah}”(h]”h ]”h"]”h$]”h&]”uh1håhj  ubhö)”}”(hhh]”(hû)”}”(hhh]”hæ)”}”(hhh]”j”  )”}”(hhh]”hŒCoordinated Inode Scans”…””}”(hj@  hžhhŸNh Nubah}”(h]”Œid82”ah ]”h"]”h$]”h&]”Œrefid”Œcoordinated-inode-scans”uh1j“  hj=  ubah}”(h]”h ]”h"]”h$]”h&]”uh1håhj:  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhj7  ubhû)”}”(hhh]”(hæ)”}”(hhh]”j”  )”}”(hhh]”hŒInode Management”…””}”(hjb  hžhhŸNh Nubah}”(h]”Œid83”ah ]”h"]”h$]”h&]”Œrefid”Œinode-management”uh1j“  hj_  ubah}”(h]”h ]”h"]”h$]”h&]”uh1håhj\  ubhö)”}”(hhh]”(hû)”}”(hhh]”hæ)”}”(hhh]”j”  )”}”(hhh]”hŒiget and irele During a Scrub”…””}”(hj  hžhhŸNh Nubah}”(h]”Œid84”ah ]”h"]”h$]”h&]”Œrefid”Œiget-and-irele-during-a-scrub”uh1j“  hj~  ubah}”(h]”h ]”h"]”h$]”h&]”uh1håhj{  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhjx  ubhû)”}”(hhh]”hæ)”}”(hhh]”j”  )”}”(hhh]”hŒLocking Inodes”…””}”(hj£  hžhhŸNh Nubah}”(h]”Œid85”ah ]”h"]”h$]”h&]”Œrefid”Œlocking-inodes”uh1j“  hj   ubah}”(h]”h ]”h"]”h$]”h&]”uh1håhj  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhjx  ubhû)”}”(hhh]”hæ)”}”(hhh]”j”  )”}”(hhh]”hŒ&Case Study: Finding a Directory Parent”…””}”(hjÅ  hžhhŸNh Nubah}”(h]”Œid86”ah ]”h"]”h$]”h&]”Œrefid”Œ%case-study-finding-a-directory-parent”uh1j“  hjÂ  ubah}”(h]”h ]”h"]”h$]”h&]”uh1håhj¿  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhjx  ubeh}”(h]”h ]”h"]”h$]”h&]”uh1hõhj\  ubeh}”(h]”h ]”h"]”h$]”h&]”uh1húhj7  ubhû)”}”(hhh]”hæ)”}”(hhh]”j”  )”}”(hhh]”hŒFilesystem Hooks”…””}”(hjó  hžhhŸNh Nubah}”(h]”Œid87”ah ]”h"]”h$]”h&]”Œrefid”Œfilesystem-hooks”uh1j“  hjð  ubah}”(h]”h ]”h"]”h$]”h&]”uh1håhjí  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhj7  ubhû)”}”(hhh]”(hæ)”}”(hhh]”j”  )”}”(hhh]”hŒLive Updates During a Scan”…””}”(hj  hžhhŸNh Nubah}”(h]”Œid88”ah ]”h"]”h$]”h&]”Œrefid”Œlive-updates-during-a-scan”uh1j“  hj  ubah}”(h]”h ]”h"]”h$]”h&]”uh1håhj  ubhö)”}”(hhh]”(hû)”}”(hhh]”hæ)”}”(hhh]”j”  )”}”(hhh]”hŒ"Case Study: Quota Counter Checking”…””}”(hj4  hžhhŸNh Nubah}”(h]”Œid89”ah ]”h"]”h$]”h&]”Œrefid”Œ!case-study-quota-counter-checking”uh1j“  hj1  ubah}”(h]”h ]”h"]”h$]”h&]”uh1håhj.  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhj+  ubhû)”}”(hhh]”hæ)”}”(hhh]”j”  )”}”(hhh]”hŒ$Case Study: File Link Count Checking”…””}”(hjV  hžhhŸNh Nubah}”(h]”Œid90”ah ]”h"]”h$]”h&]”Œrefid”Œ#case-study-file-link-count-checking”uh1j“  hjS  ubah}”(h]”h ]”h"]”h$]”h&]”uh1håhjP  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhj+  ubhû)”}”(hhh]”hæ)”}”(hhh]”j”  )”}”(hhh]”hŒ.Case Study: Rebuilding Reverse Mapping Records”…””}”(hjx  hžhhŸNh Nubah}”(h]”Œid91”ah ]”h"]”h$]”h&]”Œrefid”Œ-case-study-rebuilding-reverse-mapping-records”uh1j“  hju  ubah}”(h]”h ]”h"]”h$]”h&]”uh1håhjr  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhj+  ubeh}”(h]”h ]”h"]”h$]”h&]”uh1hõhj  ubeh}”(h]”h ]”h"]”h$]”h&]”uh1húhj7  ubeh}”(h]”h ]”h"]”h$]”h&]”uh1hõhj  ubeh}”(h]”h ]”h"]”h$]”h&]”uh1húhjÄ  ubhû)”}”(hhh]”(hæ)”}”(hhh]”j”  )”}”(hhh]”hŒ,Staging Repairs with Temporary Files on Disk”…””}”(hj²  hžhhŸNh Nubah}”(h]”Œid92”ah ]”h"]”h$]”h&]”Œrefid”Œ,staging-repairs-with-temporary-files-on-disk”uh1j“  hj¯  ubah}”(h]”h ]”h"]”h$]”h&]”uh1håhj¬  ubhö)”}”(hhh]”hû)”}”(hhh]”hæ)”}”(hhh]”j”  )”}”(hhh]”hŒUsing a Temporary File”…””}”(hjÑ  hžhhŸNh Nubah}”(h]”Œid93”ah ]”h"]”h$]”h&]”Œrefid”Œusing-a-temporary-file”uh1j“  hjÎ  ubah}”(h]”h ]”h"]”h$]”h&]”uh1håhjË  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhjÈ  ubah}”(h]”h ]”h"]”h$]”h&]”uh1hõhj¬  ubeh}”(h]”h ]”h"]”h$]”h&]”uh1húhjÄ  ubhû)”}”(hhh]”(hæ)”}”(hhh]”j”  )”}”(hhh]”hŒLogged File Content Exchanges”…””}”(hjÿ  hžhhŸNh Nubah}”(h]”Œid94”ah ]”h"]”h$]”h&]”Œrefid”Œlogged-file-content-exchanges”uh1j“  hjü  ubah}”(h]”h ]”h"]”h$]”h&]”uh1håhjù  ubhö)”}”(hhh]”(hû)”}”(hhh]”hæ)”}”(hhh]”j”  )”}”(hhh]”hŒ+Mechanics of a Logged File Content Exchange”…””}”(hj  hžhhŸNh Nubah}”(h]”Œid95”ah ]”h"]”h$]”h&]”Œrefid”Œ+mechanics-of-a-logged-file-content-exchange”uh1j“  hj  ubah}”(h]”h ]”h"]”h$]”h&]”uh1håhj  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhj  ubhû)”}”(hhh]”hæ)”}”(hhh]”j”  )”}”(hhh]”hŒ&Preparation for File Content Exchanges”…””}”(hj@  hžhhŸNh Nubah}”(h]”Œid96”ah ]”h"]”h$]”h&]”Œrefid”Œ&preparation-for-file-content-exchanges”uh1j“  hj=  ubah}”(h]”h ]”h"]”h$]”h&]”uh1håhj:  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhj  ubhû)”}”(hhh]”hæ)”}”(hhh]”j”  )”}”(hhh]”hŒ6Special Features for Exchanging Metadata File Contents”…””}”(hjb  hžhhŸNh Nubah}”(h]”Œid97”ah ]”h"]”h$]”h&]”Œrefid”Œ6special-features-for-exchanging-metadata-file-contents”uh1j“  hj_  ubah}”(h]”h ]”h"]”h$]”h&]”uh1håhj\  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhj  ubhû)”}”(hhh]”(hæ)”}”(hhh]”j”  )”}”(hhh]”hŒ"Exchanging Temporary File Contents”…””}”(hj„  hžhhŸNh Nubah}”(h]”Œid98”ah ]”h"]”h$]”h&]”Œrefid”Œ"exchanging-temporary-file-contents”uh1j“  hj  ubah}”(h]”h ]”h"]”h$]”h&]”uh1håhj~  ubhö)”}”(hhh]”(hû)”}”(hhh]”hæ)”}”(hhh]”j”  )”}”(hhh]”hŒ/Case Study: Repairing the Realtime Summary File”…””}”(hj£  hžhhŸNh Nubah}”(h]”Œid99”ah ]”h"]”h$]”h&]”Œrefid”Œ.case-study-repairing-the-realtime-summary-file”uh1j“  hj   ubah}”(h]”h ]”h"]”h$]”h&]”uh1håhj  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhjš  ubhû)”}”(hhh]”hæ)”}”(hhh]”j”  )”}”(hhh]”hŒ)Case Study: Salvaging Extended Attributes”…””}”(hjÅ  hžhhŸNh Nubah}”(h]”Œid100”ah ]”h"]”h$]”h&]”Œrefid”Œ(case-study-salvaging-extended-attributes”uh1j“  hjÂ  ubah}”(h]”h ]”h"]”h$]”h&]”uh1håhj¿  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhjš  ubeh}”(h]”h ]”h"]”h$]”h&]”uh1hõhj~  ubeh}”(h]”h ]”h"]”h$]”h&]”uh1húhj  ubeh}”(h]”h ]”h"]”h$]”h&]”uh1hõhjù  ubeh}”(h]”h ]”h"]”h$]”h&]”uh1húhjÄ  ubhû)”}”(hhh]”(hæ)”}”(hhh]”j”  )”}”(hhh]”hŒFixing Directories”…””}”(hjÿ  hžhhŸNh Nubah}”(h]”Œid101”ah ]”h"]”h$]”h&]”Œrefid”Œfixing-directories”uh1j“  hjü  ubah}”(h]”h ]”h"]”h$]”h&]”uh1håhjù  ubhö)”}”(hhh]”(hû)”}”(hhh]”hæ)”}”(hhh]”j”  )”}”(hhh]”hŒ!Case Study: Salvaging Directories”…””}”(hj  hžhhŸNh Nubah}”(h]”Œid102”ah ]”h"]”h$]”h&]”Œrefid”Œ case-study-salvaging-directories”uh1j“  hj  ubah}”(h]”h ]”h"]”h$]”h&]”uh1håhj  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhj  ubhû)”}”(hhh]”(hæ)”}”(hhh]”j”  )”}”(hhh]”hŒParent Pointers”…””}”(hj@  hžhhŸNh Nubah}”(h]”Œid103”ah ]”h"]”h$]”h&]”Œrefid”Œparent-pointers”uh1j“  hj=  ubah}”(h]”h ]”h"]”h$]”h&]”uh1håhj:  ubhö)”}”(hhh]”(hû)”}”(hhh]”hæ)”}”(hhh]”j”  )”}”(hhh]”hŒ6Case Study: Repairing Directories with Parent Pointers”…””}”(hj_  hžhhŸNh Nubah}”(h]”Œid104”ah ]”h"]”h$]”h&]”Œrefid”Œ5case-study-repairing-directories-with-parent-pointers”uh1j“  hj\  ubah}”(h]”h ]”h"]”h$]”h&]”uh1håhjY  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhjV  ubhû)”}”(hhh]”hæ)”}”(hhh]”j”  )”}”(hhh]”hŒ%Case Study: Repairing Parent Pointers”…””}”(hj  hžhhŸNh Nubah}”(h]”Œid105”ah ]”h"]”h$]”h&]”Œrefid”Œ$case-study-repairing-parent-pointers”uh1j“  hj~  ubah}”(h]”h ]”h"]”h$]”h&]”uh1håhj{  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhjV  ubhû)”}”(hhh]”hæ)”}”(hhh]”j”  )”}”(hhh]”hŒ/Digression: Offline Checking of Parent Pointers”…””}”(hj£  hžhhŸNh Nubah}”(h]”Œid106”ah ]”h"]”h$]”h&]”Œrefid”Œ.digression-offline-checking-of-parent-pointers”uh1j“  hj   ubah}”(h]”h ]”h"]”h$]”h&]”uh1håhj  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhjV  ubhû)”}”(hhh]”hæ)”}”(hhh]”j”  )”}”(hhh]”hŒ$Case Study: Directory Tree Structure”…””}”(hjÅ  hžhhŸNh Nubah}”(h]”Œid107”ah ]”h"]”h$]”h&]”Œrefid”Œ#case-study-directory-tree-structure”uh1j“  hjÂ  ubah}”(h]”h ]”h"]”h$]”h&]”uh1håhj¿  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhjV  ubeh}”(h]”h ]”h"]”h$]”h&]”uh1hõhj:  ubeh}”(h]”h ]”h"]”h$]”h&]”uh1húhj  ubeh}”(h]”h ]”h"]”h$]”h&]”uh1hõhjù  ubeh}”(h]”h ]”h"]”h$]”h&]”uh1húhjÄ  ubhû)”}”(hhh]”hæ)”}”(hhh]”j”  )”}”(hhh]”hŒThe Orphanage”…””}”(hjÿ  hžhhŸNh Nubah}”(h]”Œid108”ah ]”h"]”h$]”h&]”Œrefid”Œthe-orphanage”uh1j“  hjü  ubah}”(h]”h ]”h"]”h$]”h&]”uh1håhjù  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhjÄ  ubeh}”(h]”h ]”h"]”h$]”h&]”uh1hõhj¨  ubeh}”(h]”h ]”h"]”h$]”h&]”uh1húhjŠ  ubhû)”}”(hhh]”(hæ)”}”(hhh]”j”  )”}”(hhh]”hŒ+6. Userspace Algorithms and Data Structures”…””}”(hj-  hžhhŸNh Nubah}”(h]”Œid109”ah ]”h"]”h$]”h&]”Œrefid”Œ(userspace-algorithms-and-data-structures”uh1j“  hj*  ubah}”(h]”h ]”h"]”h$]”h&]”uh1håhj'  ubhö)”}”(hhh]”(hû)”}”(hhh]”hæ)”}”(hhh]”j”  )”}”(hhh]”hŒChecking Metadata”…””}”(hjL  hžhhŸNh Nubah}”(h]”Œid110”ah ]”h"]”h$]”h&]”Œrefid”Œchecking-metadata”uh1j“  hjI  ubah}”(h]”h ]”h"]”h$]”h&]”uh1håhjF  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhjC  ubhû)”}”(hhh]”hæ)”}”(hhh]”j”  )”}”(hhh]”hŒParallel Inode Scans”…””}”(hjn  hžhhŸNh Nubah}”(h]”Œid111”ah ]”h"]”h$]”h&]”Œrefid”Œparallel-inode-scans”uh1j“  hjk  ubah}”(h]”h ]”h"]”h$]”h&]”uh1håhjh  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhjC  ubhû)”}”(hhh]”hæ)”}”(hhh]”j”  )”}”(hhh]”hŒScheduling Repairs”…””}”(hj  hžhhŸNh Nubah}”(h]”Œid112”ah ]”h"]”h$]”h&]”Œrefid”Œscheduling-repairs”uh1j“  hj  ubah}”(h]”h ]”h"]”h$]”h&]”uh1håhjŠ  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhjC  ubhû)”}”(hhh]”hæ)”}”(hhh]”j”  )”}”(hhh]”hŒ/Checking Names for Confusable Unicode Sequences”…””}”(hj²  hžhhŸNh Nubah}”(h]”Œid113”ah ]”h"]”h$]”h&]”Œrefid”Œ/checking-names-for-confusable-unicode-sequences”uh1j“  hj¯  ubah}”(h]”h ]”h"]”h$]”h&]”uh1håhj¬  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhjC  ubhû)”}”(hhh]”hæ)”}”(hhh]”j”  )”}”(hhh]”hŒ'Media Verification of File Data Extents”…””}”(hjÔ  hžhhŸNh Nubah}”(h]”Œid114”ah ]”h"]”h$]”h&]”Œrefid”Œ'media-verification-of-file-data-extents”uh1j“  hjÑ  ubah}”(h]”h ]”h"]”h$]”h&]”uh1håhjÎ  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhjC  ubeh}”(h]”h ]”h"]”h$]”h&]”uh1hõhj'  ubeh}”(h]”h ]”h"]”h$]”h&]”uh1húhjŠ  ubhû)”}”(hhh]”(hæ)”}”(hhh]”j”  )”}”(hhh]”hŒ7. Conclusion and Future Work”…””}”(hj  hžhhŸNh Nubah}”(h]”Œid115”ah ]”h"]”h$]”h&]”Œrefid”Œconclusion-and-future-work”uh1j“  hjÿ  ubah}”(h]”h ]”h"]”h$]”h&]”uh1håhjü  ubhö)”}”(hhh]”(hû)”}”(hhh]”(hæ)”}”(hhh]”j”  )”}”(hhh]”hŒXFS_IOC_EXCHANGE_RANGE”…””}”(hj!  hžhhŸNh Nubah}”(h]”Œid116”ah ]”h"]”h$]”h&]”Œrefid”Œxfs-ioc-exchange-range”uh1j“  hj  ubah}”(h]”h ]”h"]”h$]”h&]”uh1håhj  ubhö)”}”(hhh]”hû)”}”(hhh]”hæ)”}”(hhh]”j”  )”}”(hhh]”hŒ.File Content Exchanges with Regular User Files”…””}”(hj@  hžhhŸNh Nubah}”(h]”Œid117”ah ]”h"]”h$]”h&]”Œrefid”Œ.file-content-exchanges-with-regular-user-files”uh1j“  hj=  ubah}”(h]”h ]”h"]”h$]”h&]”uh1håhj:  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhj7  ubah}”(h]”h ]”h"]”h$]”h&]”uh1hõhj  ubeh}”(h]”h ]”h"]”h$]”h&]”uh1húhj  ubhû)”}”(hhh]”hæ)”}”(hhh]”j”  )”}”(hhh]”hŒVectorized Scrub”…””}”(hjn  hžhhŸNh Nubah}”(h]”Œid118”ah ]”h"]”h$]”h&]”Œrefid”Œvectorized-scrub”uh1j“  hjk  ubah}”(h]”h ]”h"]”h$]”h&]”uh1håhjh  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhj  ubhû)”}”(hhh]”hæ)”}”(hhh]”j”  )”}”(hhh]”hŒ$Quality of Service Targets for Scrub”…””}”(hj  hžhhŸNh Nubah}”(h]”Œid119”ah ]”h"]”h$]”h&]”Œrefid”Œ$quality-of-service-targets-for-scrub”uh1j“  hj  ubah}”(h]”h ]”h"]”h$]”h&]”uh1håhjŠ  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhj  ubhû)”}”(hhh]”hæ)”}”(hhh]”j”  )”}”(hhh]”hŒDefragmenting Free Space”…””}”(hj²  hžhhŸNh Nubah}”(h]”Œid120”ah ]”h"]”h$]”h&]”Œrefid”Œdefragmenting-free-space”uh1j“  hj¯  ubah}”(h]”h ]”h"]”h$]”h&]”uh1håhj¬  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhj  ubhû)”}”(hhh]”hæ)”}”(hhh]”j”  )”}”(hhh]”hŒShrinking Filesystems”…””}”(hjÔ  hžhhŸNh Nubah}”(h]”Œid121”ah ]”h"]”h$]”h&]”Œrefid”Œshrinking-filesystems”uh1j“  hjÑ  ubah}”(h]”h ]”h"]”h$]”h&]”uh1håhjÎ  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhj  ubeh}”(h]”h ]”h"]”h$]”h&]”uh1hõhjü  ubeh}”(h]”h ]”h"]”h$]”h&]”uh1húhjŠ  ubeh}”(h]”h ]”h"]”h$]”h&]”uh1hõhjx  hžhhŸNh Nubeh}”(h]”Œtable-of-contents”ah ]”(Œcontents”Œlocal”eh"]”Œtable of contents”ah$]”h&]”uh1jv  hŸh³h K4hhÒhžhubhÑ)”}”(hhh]”(hÖ)”}”(hŒ1. What is a Filesystem Check?”h]”hŒ1. What is a Filesystem Check?”…””}”(hj  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”Œrefid”jž  uh1hÕhj  hžhhŸh³h K7ubhæ)”}”(hŒ1A Unix filesystem has four main responsibilities:”h]”hŒ1A Unix filesystem has four main responsibilities:”…””}”(hj  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h K9hj  hžhubhö)”}”(hhh]”(hû)”}”(hŒ~Provide a hierarchy of names through which application programs can associate
arbitrary blobs of data for any length of time,
”h]”hæ)”}”(hŒ}Provide a hierarchy of names through which application programs can associate
arbitrary blobs of data for any length of time,”h]”hŒ}Provide a hierarchy of names through which application programs can associate
arbitrary blobs of data for any length of time,”…””}”(hj3  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h K;hj/  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhj,  hžhhŸh³h Nubhû)”}”(hŒ:Virtualize physical storage media across those names, and
”h]”hæ)”}”(hŒ9Virtualize physical storage media across those names, and”h]”hŒ9Virtualize physical storage media across those names, and”…””}”(hjK  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h K>hjG  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhj,  hžhhŸh³h Nubhû)”}”(hŒ+Retrieve the named data blobs at any time.
”h]”hæ)”}”(hŒ*Retrieve the named data blobs at any time.”h]”hŒ*Retrieve the named data blobs at any time.”…””}”(hjc  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h K@hj_  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhj,  hžhhŸh³h Nubhû)”}”(hŒExamine resource usage.
”h]”hæ)”}”(hŒExamine resource usage.”h]”hŒExamine resource usage.”…””}”(hj{  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h KBhjw  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhj,  hžhhŸh³h Nubeh}”(h]”h ]”h"]”h$]”h&]”jJ  jK  uh1hõhŸh³h K;hj  hžhubhæ)”}”(hX¡  Metadata directly supporting these functions (e.g. files, directories, space
mappings) are sometimes called primary metadata.
Secondary metadata (e.g. reverse mapping and directory parent pointers) support
operations internal to the filesystem, such as internal consistency checking
and reorganization.
Summary metadata, as the name implies, condense information contained in
primary metadata for performance reasons.”h]”hX¡  Metadata directly supporting these functions (e.g. files, directories, space
mappings) are sometimes called primary metadata.
Secondary metadata (e.g. reverse mapping and directory parent pointers) support
operations internal to the filesystem, such as internal consistency checking
and reorganization.
Summary metadata, as the name implies, condense information contained in
primary metadata for performance reasons.”…””}”(hj•  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h KDhj  hžhubhæ)”}”(hX?  The filesystem check (fsck) tool examines all the metadata in a filesystem
to look for errors.
In addition to looking for obvious metadata corruptions, fsck also
cross-references different types of metadata records with each other to look
for inconsistencies.
People do not like losing data, so most fsck tools also contains some ability
to correct any problems found.
As a word of caution -- the primary goal of most Linux fsck tools is to restore
the filesystem metadata to a consistent state, not to maximize the data
recovered.
That precedent will not be challenged here.”h]”hX?  The filesystem check (fsck) tool examines all the metadata in a filesystem
to look for errors.
In addition to looking for obvious metadata corruptions, fsck also
cross-references different types of metadata records with each other to look
for inconsistencies.
People do not like losing data, so most fsck tools also contains some ability
to correct any problems found.
As a word of caution -- the primary goal of most Linux fsck tools is to restore
the filesystem metadata to a consistent state, not to maximize the data
recovered.
That precedent will not be challenged here.”…””}”(hj£  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h KLhj  hžhubhæ)”}”(hX‡  Filesystems of the 20th century generally lacked any redundancy in the ondisk
format, which means that fsck can only respond to errors by erasing files until
errors are no longer detected.
More recent filesystem designs contain enough redundancy in their metadata that
it is now possible to regenerate data structures when non-catastrophic errors
occur; this capability aids both strategies.”h]”hX‡  Filesystems of the 20th century generally lacked any redundancy in the ondisk
format, which means that fsck can only respond to errors by erasing files until
errors are no longer detected.
More recent filesystem designs contain enough redundancy in their metadata that
it is now possible to regenerate data structures when non-catastrophic errors
occur; this capability aids both strategies.”…””}”(hj±  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h KXhj  hžhubhŒtable”“”)”}”(hhh]”hŒtgroup”“”)”}”(hhh]”(hŒcolspec”“”)”}”(hhh]”h}”(h]”h ]”h"]”h$]”h&]”Œcolwidth”KJuh1jÉ  hjÆ  ubhŒtbody”“”)”}”(hhh]”(hŒrow”“”)”}”(hhh]”hŒentry”“”)”}”(hhh]”hæ)”}”(hŒ	**Note**:”h]”(hŒstrong”“”)”}”(hŒ**Note**”h]”hŒNote”…””}”(hjê  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1jè  hjä  ubhŒ:”…””}”(hjä  hžhhŸNh Nubeh}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h K`hjá  ubah}”(h]”h ]”h"]”h$]”h&]”uh1jß  hjÜ  ubah}”(h]”h ]”h"]”h$]”h&]”uh1jÚ  hj×  ubjÛ  )”}”(hhh]”jà  )”}”(hhh]”hæ)”}”(hX  System administrators avoid data loss by increasing the number of
separate storage systems through the creation of backups; and they avoid
downtime by increasing the redundancy of each storage system through the
creation of RAID arrays.
fsck tools address only the first problem.”h]”hX  System administrators avoid data loss by increasing the number of
separate storage systems through the creation of backups; and they avoid
downtime by increasing the redundancy of each storage system through the
creation of RAID arrays.
fsck tools address only the first problem.”…””}”(hj  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h Kbhj  ubah}”(h]”h ]”h"]”h$]”h&]”uh1jß  hj  ubah}”(h]”h ]”h"]”h$]”h&]”uh1jÚ  hj×  ubeh}”(h]”h ]”h"]”h$]”h&]”uh1jÕ  hjÆ  ubeh}”(h]”h ]”h"]”h$]”h&]”Œcols”Kuh1jÄ  hjÁ  ubah}”(h]”h ]”h"]”h$]”h&]”uh1j¿  hj  hžhhŸh³h NubhÑ)”}”(hhh]”(hÖ)”}”(hŒTLDR; Show Me the Code!”h]”hŒTLDR; Show Me the Code!”…””}”(hjD  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”j  j½  uh1hÕhjA  hžhhŸh³h Kjubhæ)”}”(hX#  Code is posted to the kernel.org git trees as follows:
`kernel changes <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-symlink>`_,
`userspace changes <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=scrub-media-scan-service>`_, and
`QA test changes <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfstests-dev.git/log/?h=repair-dirs>`_.
Each kernel patchset adding an online repair function will use the same branch
name across the kernel, xfsprogs, and fstests git repos.”h]”(hŒ7Code is posted to the kernel.org git trees as follows:
”…””}”(hjR  hžhhŸNh Nubj”  )”}”(hŒn`kernel changes <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-symlink>`_”h]”hŒkernel changes”…””}”(hjZ  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”Œname”Œkernel changes”Œrefuri”ŒZhttps://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-symlink”uh1j“  hjR  ubhµ)”}”(hŒ] <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-symlink>”h]”h}”(h]”Œkernel-changes”ah ]”h"]”Œkernel changes”ah$]”h&]”Œrefuri”jk  uh1h´Œ
referenced”KhjR  ubhŒ,
”…””}”(hjR  hžhhŸNh Nubj”  )”}”(hŒ~`userspace changes <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=scrub-media-scan-service>`_”h]”hŒuserspace changes”…””}”(hj~  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”Œname”Œuserspace changes”jj  Œghttps://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=scrub-media-scan-service”uh1j“  hjR  ubhµ)”}”(hŒj <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=scrub-media-scan-service>”h]”h}”(h]”Œuserspace-changes”ah ]”h"]”Œuserspace changes”ah$]”h&]”Œrefuri”jŽ  uh1h´jy  KhjR  ubhŒ, and
”…””}”(hjR  hžhhŸNh Nubj”  )”}”(hŒo`QA test changes <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfstests-dev.git/log/?h=repair-dirs>`_”h]”hŒQA test changes”…””}”(hj   hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”Œname”ŒQA test changes”jj  ŒZhttps://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfstests-dev.git/log/?h=repair-dirs”uh1j“  hjR  ubhµ)”}”(hŒ] <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfstests-dev.git/log/?h=repair-dirs>”h]”h}”(h]”Œqa-test-changes”ah ]”h"]”Œqa test changes”ah$]”h&]”Œrefuri”j°  uh1h´jy  KhjR  ubhŒ‰.
Each kernel patchset adding an online repair function will use the same branch
name across the kernel, xfsprogs, and fstests git repos.”…””}”(hjR  hžhhŸNh Nubeh}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h KlhjA  hžhubeh}”(h]”jÃ  ah ]”h"]”Œtldr; show me the code!”ah$]”h&]”uh1hÐhj  hžhhŸh³h KjubhÑ)”}”(hhh]”(hÖ)”}”(hŒExisting Tools”h]”hŒExisting Tools”…””}”(hjÒ  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”j  jß  uh1hÕhjÏ  hžhhŸh³h Ktubhæ)”}”(hŒ•The online fsck tool described here will be the third tool in the history of
XFS (on Linux) to check and repair filesystems.
Two programs precede it:”h]”hŒ•The online fsck tool described here will be the third tool in the history of
XFS (on Linux) to check and repair filesystems.
Two programs precede it:”…””}”(hjà  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h KvhjÏ  hžhubhæ)”}”(hX—  The first program, ``xfs_check``, was created as part of the XFS debugger
(``xfs_db``) and can only be used with unmounted filesystems.
It walks all metadata in the filesystem looking for inconsistencies in the
metadata, though it lacks any ability to repair what it finds.
Due to its high memory requirements and inability to repair things, this
program is now deprecated and will not be discussed further.”h]”(hŒThe first program, ”…””}”(hjî  hžhhŸNh NubhŒliteral”“”)”}”(hŒ``xfs_check``”h]”hŒ	xfs_check”…””}”(hjø  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1jö  hjî  ubhŒ+, was created as part of the XFS debugger
(”…””}”(hjî  hžhhŸNh Nubj÷  )”}”(hŒ
``xfs_db``”h]”hŒxfs_db”…””}”(hj
  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1jö  hjî  ubhXB  ) and can only be used with unmounted filesystems.
It walks all metadata in the filesystem looking for inconsistencies in the
metadata, though it lacks any ability to repair what it finds.
Due to its high memory requirements and inability to repair things, this
program is now deprecated and will not be discussed further.”…””}”(hjî  hžhhŸNh Nubeh}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h KzhjÏ  hžhubhæ)”}”(hXg  The second program, ``xfs_repair``, was created to be faster and more robust
than the first program.
Like its predecessor, it can only be used with unmounted filesystems.
It uses extent-based in-memory data structures to reduce memory consumption,
and tries to schedule readahead IO appropriately to reduce I/O waiting time
while it scans the metadata of the entire filesystem.
The most important feature of this tool is its ability to respond to
inconsistencies in file metadata and directory tree by erasing things as needed
to eliminate problems.
Space usage metadata are rebuilt from the observed file metadata.”h]”(hŒThe second program, ”…””}”(hj"  hžhhŸNh Nubj÷  )”}”(hŒ``xfs_repair``”h]”hŒ
xfs_repair”…””}”(hj*  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1jö  hj"  ubhXE  , was created to be faster and more robust
than the first program.
Like its predecessor, it can only be used with unmounted filesystems.
It uses extent-based in-memory data structures to reduce memory consumption,
and tries to schedule readahead IO appropriately to reduce I/O waiting time
while it scans the metadata of the entire filesystem.
The most important feature of this tool is its ability to respond to
inconsistencies in file metadata and directory tree by erasing things as needed
to eliminate problems.
Space usage metadata are rebuilt from the observed file metadata.”…””}”(hj"  hžhhŸNh Nubeh}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h KhjÏ  hžhubeh}”(h]”jå  ah ]”h"]”Œexisting tools”ah$]”h&]”uh1hÐhj  hžhhŸh³h KtubhÑ)”}”(hhh]”(hÖ)”}”(hŒProblem Statement”h]”hŒProblem Statement”…””}”(hjL  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”j  j  uh1hÕhjI  hžhhŸh³h Kubhæ)”}”(hŒ6The current XFS tools leave several problems unsolved:”h]”hŒ6The current XFS tools leave several problems unsolved:”…””}”(hjZ  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h KhjI  hžhubhŒenumerated_list”“”)”}”(hhh]”(hû)”}”(hŒÉ**User programs** suddenly **lose access** to the filesystem when unexpected
shutdowns occur as a result of silent corruptions in the metadata.
These occur **unpredictably** and often without warning.
”h]”hæ)”}”(hŒÈ**User programs** suddenly **lose access** to the filesystem when unexpected
shutdowns occur as a result of silent corruptions in the metadata.
These occur **unpredictably** and often without warning.”h]”(jé  )”}”(hŒ**User programs**”h]”hŒUser programs”…””}”(hju  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1jè  hjq  ubhŒ
 suddenly ”…””}”(hjq  hžhhŸNh Nubjé  )”}”(hŒ**lose access**”h]”hŒlose access”…””}”(hj‡  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1jè  hjq  ubhŒr to the filesystem when unexpected
shutdowns occur as a result of silent corruptions in the metadata.
These occur ”…””}”(hjq  hžhhŸNh Nubjé  )”}”(hŒ**unpredictably**”h]”hŒunpredictably”…””}”(hj™  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1jè  hjq  ubhŒ and often without warning.”…””}”(hjq  hžhhŸNh Nubeh}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h K‘hjm  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhjj  hžhhŸh³h Nubhû)”}”(hŒu**Users** experience a **total loss of service** during the recovery period
after an **unexpected shutdown** occurs.
”h]”hæ)”}”(hŒt**Users** experience a **total loss of service** during the recovery period
after an **unexpected shutdown** occurs.”h]”(jé  )”}”(hŒ	**Users**”h]”hŒUsers”…””}”(hj¿  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1jè  hj»  ubhŒ experience a ”…””}”(hj»  hžhhŸNh Nubjé  )”}”(hŒ**total loss of service**”h]”hŒtotal loss of service”…””}”(hjÑ  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1jè  hj»  ubhŒ% during the recovery period
after an ”…””}”(hj»  hžhhŸNh Nubjé  )”}”(hŒ**unexpected shutdown**”h]”hŒunexpected shutdown”…””}”(hjã  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1jè  hj»  ubhŒ occurs.”…””}”(hj»  hžhhŸNh Nubeh}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h K•hj·  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhjj  hžhhŸh³h Nubhû)”}”(hŒz**Users** experience a **total loss of service** if the filesystem is taken
offline to **look for problems** proactively.
”h]”hæ)”}”(hŒy**Users** experience a **total loss of service** if the filesystem is taken
offline to **look for problems** proactively.”h]”(jé  )”}”(hŒ	**Users**”h]”hŒUsers”…””}”(hj	  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1jè  hj  ubhŒ experience a ”…””}”(hj  hžhhŸNh Nubjé  )”}”(hŒ**total loss of service**”h]”hŒtotal loss of service”…””}”(hj  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1jè  hj  ubhŒ' if the filesystem is taken
offline to ”…””}”(hj  hžhhŸNh Nubjé  )”}”(hŒ**look for problems**”h]”hŒlook for problems”…””}”(hj-  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1jè  hj  ubhŒ proactively.”…””}”(hj  hžhhŸNh Nubeh}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h K˜hj  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhjj  hžhhŸh³h Nubhû)”}”(hŒç**Data owners** cannot **check the integrity** of their stored data without
reading all of it.
This may expose them to substantial billing costs when a linear media scan
performed by the storage system administrator might suffice.
”h]”hæ)”}”(hŒæ**Data owners** cannot **check the integrity** of their stored data without
reading all of it.
This may expose them to substantial billing costs when a linear media scan
performed by the storage system administrator might suffice.”h]”(jé  )”}”(hŒ**Data owners**”h]”hŒData owners”…””}”(hjS  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1jè  hjO  ubhŒ cannot ”…””}”(hjO  hžhhŸNh Nubjé  )”}”(hŒ**check the integrity**”h]”hŒcheck the integrity”…””}”(hje  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1jè  hjO  ubhŒ¸ of their stored data without
reading all of it.
This may expose them to substantial billing costs when a linear media scan
performed by the storage system administrator might suffice.”…””}”(hjO  hžhhŸNh Nubeh}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h K›hjK  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhjj  hžhhŸh³h Nubhû)”}”(hŒ³**System administrators** cannot **schedule** a maintenance window to deal
with corruptions if they **lack the means** to assess filesystem health
while the filesystem is online.
”h]”hæ)”}”(hŒ²**System administrators** cannot **schedule** a maintenance window to deal
with corruptions if they **lack the means** to assess filesystem health
while the filesystem is online.”h]”(jé  )”}”(hŒ**System administrators**”h]”hŒSystem administrators”…””}”(hj‹  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1jè  hj‡  ubhŒ cannot ”…””}”(hj‡  hžhhŸNh Nubjé  )”}”(hŒ**schedule**”h]”hŒschedule”…””}”(hj  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1jè  hj‡  ubhŒ7 a maintenance window to deal
with corruptions if they ”…””}”(hj‡  hžhhŸNh Nubjé  )”}”(hŒ**lack the means**”h]”hŒlack the means”…””}”(hj¯  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1jè  hj‡  ubhŒ< to assess filesystem health
while the filesystem is online.”…””}”(hj‡  hžhhŸNh Nubeh}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h K hjƒ  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhjj  hžhhŸh³h Nubhû)”}”(hŒ‘**Fleet monitoring tools** cannot **automate periodic checks** of filesystem
health when doing so requires **manual intervention** and downtime.
”h]”hæ)”}”(hŒ**Fleet monitoring tools** cannot **automate periodic checks** of filesystem
health when doing so requires **manual intervention** and downtime.”h]”(jé  )”}”(hŒ**Fleet monitoring tools**”h]”hŒFleet monitoring tools”…””}”(hjÕ  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1jè  hjÑ  ubhŒ cannot ”…””}”(hjÑ  hžhhŸNh Nubjé  )”}”(hŒ**automate periodic checks**”h]”hŒautomate periodic checks”…””}”(hjç  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1jè  hjÑ  ubhŒ- of filesystem
health when doing so requires ”…””}”(hjÑ  hžhhŸNh Nubjé  )”}”(hŒ**manual intervention**”h]”hŒmanual intervention”…””}”(hjù  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1jè  hjÑ  ubhŒ and downtime.”…””}”(hjÑ  hžhhŸNh Nubeh}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h K¤hjÍ  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhjj  hžhhŸh³h Nubhû)”}”(hŒ **Users** can be tricked into **doing things they do not desire** when
malicious actors **exploit quirks of Unicode** to place misleading names
in directories.
”h]”hæ)”}”(hŒŸ**Users** can be tricked into **doing things they do not desire** when
malicious actors **exploit quirks of Unicode** to place misleading names
in directories.”h]”(jé  )”}”(hŒ	**Users**”h]”hŒUsers”…””}”(hj  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1jè  hj  ubhŒ can be tricked into ”…””}”(hj  hžhhŸNh Nubjé  )”}”(hŒ#**doing things they do not desire**”h]”hŒdoing things they do not desire”…””}”(hj1  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1jè  hj  ubhŒ when
malicious actors ”…””}”(hj  hžhhŸNh Nubjé  )”}”(hŒ**exploit quirks of Unicode**”h]”hŒexploit quirks of Unicode”…””}”(hjC  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1jè  hj  ubhŒ* to place misleading names
in directories.”…””}”(hj  hžhhŸNh Nubeh}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h K§hj  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhjj  hžhhŸh³h Nubeh}”(h]”h ]”h"]”h$]”h&]”Œenumtype”Œarabic”Œprefix”hŒsuffix”Œ.”uh1jh  hjI  hžhhŸh³h K‘ubhæ)”}”(hŒ¢Given this definition of the problems to be solved and the actors who would
benefit, the proposed solution is a third fsck tool that acts on a running
filesystem.”h]”hŒ¢Given this definition of the problems to be solved and the actors who would
benefit, the proposed solution is a third fsck tool that acts on a running
filesystem.”…””}”(hjl  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h K«hjI  hžhubhæ)”}”(hXÑ  This new third program has three components: an in-kernel facility to check
metadata, an in-kernel facility to repair metadata, and a userspace driver
program to drive fsck activity on a live filesystem.
``xfs_scrub`` is the name of the driver program.
The rest of this document presents the goals and use cases of the new fsck
tool, describes its major design points in connection to those goals, and
discusses the similarities and differences with existing tools.”h]”(hŒÌThis new third program has three components: an in-kernel facility to check
metadata, an in-kernel facility to repair metadata, and a userspace driver
program to drive fsck activity on a live filesystem.
”…””}”(hjz  hžhhŸNh Nubj÷  )”}”(hŒ``xfs_scrub``”h]”hŒ	xfs_scrub”…””}”(hj‚  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1jö  hjz  ubhŒø is the name of the driver program.
The rest of this document presents the goals and use cases of the new fsck
tool, describes its major design points in connection to those goals, and
discusses the similarities and differences with existing tools.”…””}”(hjz  hžhhŸNh Nubeh}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h K¯hjI  hžhubjÀ  )”}”(hhh]”jÅ  )”}”(hhh]”(jÊ  )”}”(hhh]”h}”(h]”h ]”h"]”h$]”h&]”Œcolwidth”KJuh1jÉ  hj  ubjÖ  )”}”(hhh]”(jÛ  )”}”(hhh]”jà  )”}”(hhh]”hæ)”}”(hŒ	**Note**:”h]”(jé  )”}”(hŒ**Note**”h]”hŒNote”…””}”(hj·  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1jè  hj³  ubhŒ:”…””}”(hj³  hžhhŸNh Nubeh}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h K¸hj°  ubah}”(h]”h ]”h"]”h$]”h&]”uh1jß  hj­  ubah}”(h]”h ]”h"]”h$]”h&]”uh1jÚ  hjª  ubjÛ  )”}”(hhh]”jà  )”}”(hhh]”hæ)”}”(hXt  Throughout this document, the existing offline fsck tool can also be
referred to by its current name "``xfs_repair``".
The userspace driver program for the new online fsck tool can be
referred to as "``xfs_scrub``".
The kernel portion of online fsck that validates metadata is called
"online scrub", and portion of the kernel that fixes metadata is called
"online repair".”h]”(hŒhThroughout this document, the existing offline fsck tool can also be
referred to by its current name â€œ”…””}”(hjá  hžhhŸNh Nubj÷  )”}”(hŒ``xfs_repair``”h]”hŒ
xfs_repair”…””}”(hjé  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1jö  hjá  ubhŒXâ€.
The userspace driver program for the new online fsck tool can be
referred to as â€œ”…””}”(hjá  hžhhŸNh Nubj÷  )”}”(hŒ``xfs_scrub``”h]”hŒ	xfs_scrub”…””}”(hjû  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1jö  hjá  ubhŒ©â€.
The kernel portion of online fsck that validates metadata is called
â€œonline scrubâ€, and portion of the kernel that fixes metadata is called
â€œonline repairâ€.”…””}”(hjá  hžhhŸNh Nubeh}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h KºhjÞ  ubah}”(h]”h ]”h"]”h$]”h&]”uh1jß  hjÛ  ubah}”(h]”h ]”h"]”h$]”h&]”uh1jÚ  hjª  ubeh}”(h]”h ]”h"]”h$]”h&]”uh1jÕ  hj  ubeh}”(h]”h ]”h"]”h$]”h&]”Œcols”Kuh1jÄ  hjš  ubah}”(h]”h ]”h"]”h$]”h&]”uh1j¿  hjI  hžhhŸh³h Nubhæ)”}”(hXÐ  The naming hierarchy is broken up into objects known as directories and files
and the physical space is split into pieces known as allocation groups.
Sharding enables better performance on highly parallel systems and helps to
contain the damage when corruptions occur.
The division of the filesystem into principal objects (allocation groups and
inodes) means that there are ample opportunities to perform targeted checks and
repairs on a subset of the filesystem.”h]”hXÐ  The naming hierarchy is broken up into objects known as directories and files
and the physical space is split into pieces known as allocation groups.
Sharding enables better performance on highly parallel systems and helps to
contain the damage when corruptions occur.
The division of the filesystem into principal objects (allocation groups and
inodes) means that there are ample opportunities to perform targeted checks and
repairs on a subset of the filesystem.”…””}”(hj2  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h KÃhjI  hžhubhæ)”}”(hŒõWhile this is going on, other parts continue processing IO requests.
Even if a piece of filesystem metadata can only be regenerated by scanning the
entire system, the scan can still be done in the background while other file
operations continue.”h]”hŒõWhile this is going on, other parts continue processing IO requests.
Even if a piece of filesystem metadata can only be regenerated by scanning the
entire system, the scan can still be done in the background while other file
operations continue.”…””}”(hj@  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h KËhjI  hžhubhæ)”}”(hX(  In summary, online fsck takes advantage of resource sharding and redundant
metadata to enable targeted checking and repair operations while the system
is running.
This capability will be coupled to automatic system management so that
autonomous self-healing of XFS maximizes service availability.”h]”hX(  In summary, online fsck takes advantage of resource sharding and redundant
metadata to enable targeted checking and repair operations while the system
is running.
This capability will be coupled to automatic system management so that
autonomous self-healing of XFS maximizes service availability.”…””•      }”(hjN  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h KÐhjI  hžhubeh}”(h]”j  ah ]”h"]”Œproblem statement”ah$]”h&]”uh1hÐhj  hžhhŸh³h Kubeh}”(h]”j¤  ah ]”h"]”Œ1. what is a filesystem check?”ah$]”h&]”uh1hÐhhÒhžhhŸh³h K7ubhÑ)”}”(hhh]”(hÖ)”}”(hŒ2. Theory of Operation”h]”hŒ2. Theory of Operation”…””}”(hjm  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”j  j/  uh1hÕhjj  hžhhŸh³h K×ubhæ)”}”(hXö  Because it is necessary for online fsck to lock and scan live metadata objects,
online fsck consists of three separate code components.
The first is the userspace driver program ``xfs_scrub``, which is responsible
for identifying individual metadata items, scheduling work items for them,
reacting to the outcomes appropriately, and reporting results to the system
administrator.
The second and third are in the kernel, which implements functions to check
and repair each type of online fsck work item.”h]”(hŒ²Because it is necessary for online fsck to lock and scan live metadata objects,
online fsck consists of three separate code components.
The first is the userspace driver program ”…””}”(hj{  hžhhŸNh Nubj÷  )”}”(hŒ``xfs_scrub``”h]”hŒ	xfs_scrub”…””}”(hjƒ  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1jö  hj{  ubhX7  , which is responsible
for identifying individual metadata items, scheduling work items for them,
reacting to the outcomes appropriately, and reporting results to the system
administrator.
The second and third are in the kernel, which implements functions to check
and repair each type of online fsck work item.”…””}”(hj{  hžhhŸNh Nubeh}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h KÙhjj  hžhubjÀ  )”}”(hhh]”jÅ  )”}”(hhh]”(jÊ  )”}”(hhh]”h}”(h]”h ]”h"]”h$]”h&]”Œcolwidth”KBuh1jÉ  hjž  ubjÖ  )”}”(hhh]”(jÛ  )”}”(hhh]”jà  )”}”(hhh]”hæ)”}”(hŒ	**Note**:”h]”(jé  )”}”(hŒ**Note**”h]”hŒNote”…””}”(hj¸  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1jè  hj´  ubhŒ:”…””}”(hj´  hžhhŸNh Nubeh}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h Kãhj±  ubah}”(h]”h ]”h"]”h$]”h&]”uh1jß  hj®  ubah}”(h]”h ]”h"]”h$]”h&]”uh1jÚ  hj«  ubjÛ  )”}”(hhh]”jà  )”}”(hhh]”hæ)”}”(hŒWFor brevity, this document shortens the phrase "online fsck work
item" to "scrub item".”h]”hŒ_For brevity, this document shortens the phrase â€œonline fsck work
itemâ€ to â€œscrub itemâ€.”…””}”(hjâ  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h Kåhjß  ubah}”(h]”h ]”h"]”h$]”h&]”uh1jß  hjÜ  ubah}”(h]”h ]”h"]”h$]”h&]”uh1jÚ  hj«  ubeh}”(h]”h ]”h"]”h$]”h&]”uh1jÕ  hjž  ubeh}”(h]”h ]”h"]”h$]”h&]”Œcols”Kuh1jÄ  hj›  ubah}”(h]”h ]”h"]”h$]”h&]”uh1j¿  hjj  hžhhŸh³h Nubhæ)”}”(hŒ¼Scrub item types are delineated in a manner consistent with the Unix design
philosophy, which is to say that each item should handle one aspect of a
metadata structure, and handle it well.”h]”hŒ¼Scrub item types are delineated in a manner consistent with the Unix design
philosophy, which is to say that each item should handle one aspect of a
metadata structure, and handle it well.”…””}”(hj  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h Kéhjj  hžhubhÑ)”}”(hhh]”(hÖ)”}”(hŒScope”h]”hŒScope”…””}”(hj   hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”j  jN  uh1hÕhj  hžhhŸh³h Kîubhæ)”}”(hXí  In principle, online fsck should be able to check and to repair everything that
the offline fsck program can handle.
However, online fsck cannot be running 100% of the time, which means that
latent errors may creep in after a scrub completes.
If these errors cause the next mount to fail, offline fsck is the only
solution.
This limitation means that maintenance of the offline fsck tool will continue.
A second limitation of online fsck is that it must follow the same resource
sharing and lock acquisition rules as the regular filesystem.
This means that scrub cannot take *any* shortcuts to save time, because doing
so could lead to concurrency problems.
In other words, online fsck is not a complete replacement for offline fsck, and
a complete run of online fsck may take longer than online fsck.
However, both of these limitations are acceptable tradeoffs to satisfy the
different motivations of online fsck, which are to **minimize system downtime**
and to **increase predictability of operation**.”h]”(hX?  In principle, online fsck should be able to check and to repair everything that
the offline fsck program can handle.
However, online fsck cannot be running 100% of the time, which means that
latent errors may creep in after a scrub completes.
If these errors cause the next mount to fail, offline fsck is the only
solution.
This limitation means that maintenance of the offline fsck tool will continue.
A second limitation of online fsck is that it must follow the same resource
sharing and lock acquisition rules as the regular filesystem.
This means that scrub cannot take ”…””}”(hj.  hžhhŸNh NubhŒemphasis”“”)”}”(hŒ*any*”h]”hŒany”…””}”(hj8  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1j6  hj.  ubhX\   shortcuts to save time, because doing
so could lead to concurrency problems.
In other words, online fsck is not a complete replacement for offline fsck, and
a complete run of online fsck may take longer than online fsck.
However, both of these limitations are acceptable tradeoffs to satisfy the
different motivations of online fsck, which are to ”…””}”(hj.  hžhhŸNh Nubjé  )”}”(hŒ**minimize system downtime**”h]”hŒminimize system downtime”…””}”(hjJ  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1jè  hj.  ubhŒ
and to ”…””}”(hj.  hžhhŸNh Nubjé  )”}”(hŒ(**increase predictability of operation**”h]”hŒ$increase predictability of operation”…””}”(hj\  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1jè  hj.  ubhŒ.”…””}”(hj.  hžhhŸNh Nubeh}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h Kðhj  hžhubhµ)”}”(hŒ.. _scrubphases:”h]”h}”(h]”h ]”h"]”h$]”h&]”j  Œscrubphases”uh1h´h Mhj  hžhhŸh³ubeh}”(h]”jT  ah ]”h"]”Œscope”ah$]”h&]”uh1hÐhjj  hžhhŸh³h KîubhÑ)”}”(hhh]”(hÖ)”}”(hŒPhases of Work”h]”hŒPhases of Work”…””}”(hj‰  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”j  jp  uh1hÕhj†  hžhhŸh³h Mubhæ)”}”(hX  The userspace driver program ``xfs_scrub`` splits the work of checking and
repairing an entire filesystem into seven phases.
Each phase concentrates on checking specific types of scrub items and depends
on the success of all previous phases.
The seven phases are as follows:”h]”(hŒThe userspace driver program ”…””}”(hj—  hžhhŸNh Nubj÷  )”}”(hŒ``xfs_scrub``”h]”hŒ	xfs_scrub”…””}”(hjŸ  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1jö  hj—  ubhŒè splits the work of checking and
repairing an entire filesystem into seven phases.
Each phase concentrates on checking specific types of scrub items and depends
on the success of all previous phases.
The seven phases are as follows:”…””}”(hj—  hžhhŸNh Nubeh}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h Mhj†  hžhubji  )”}”(hhh]”(hû)”}”(hŒ¦Collect geometry information about the mounted filesystem and computer,
discover the online fsck capabilities of the kernel, and open the
underlying storage devices.
”h]”hæ)”}”(hŒ¥Collect geometry information about the mounted filesystem and computer,
discover the online fsck capabilities of the kernel, and open the
underlying storage devices.”h]”hŒ¥Collect geometry information about the mounted filesystem and computer,
discover the online fsck capabilities of the kernel, and open the
underlying storage devices.”…””}”(hj¾  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h Mhjº  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhj·  hžhhŸh³h Nubhû)”}”(hX#  Check allocation group metadata, all realtime volume metadata, and all quota
files.
Each metadata structure is scheduled as a separate scrub item.
If corruption is found in the inode header or inode btree and ``xfs_scrub``
is permitted to perform repairs, then those scrub items are repaired to
prepare for phase 3.
Repairs are implemented by using the information in the scrub item to
resubmit the kernel scrub call with the repair flag enabled; this is
discussed in the next section.
Optimizations and all other repairs are deferred to phase 4.
”h]”hæ)”}”(hX"  Check allocation group metadata, all realtime volume metadata, and all quota
files.
Each metadata structure is scheduled as a separate scrub item.
If corruption is found in the inode header or inode btree and ``xfs_scrub``
is permitted to perform repairs, then those scrub items are repaired to
prepare for phase 3.
Repairs are implemented by using the information in the scrub item to
resubmit the kernel scrub call with the repair flag enabled; this is
discussed in the next section.
Optimizations and all other repairs are deferred to phase 4.”h]”(hŒÑCheck allocation group metadata, all realtime volume metadata, and all quota
files.
Each metadata structure is scheduled as a separate scrub item.
If corruption is found in the inode header or inode btree and ”…””}”(hjÖ  hžhhŸNh Nubj÷  )”}”(hŒ``xfs_scrub``”h]”hŒ	xfs_scrub”…””}”(hjÞ  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1jö  hjÖ  ubhXD  
is permitted to perform repairs, then those scrub items are repaired to
prepare for phase 3.
Repairs are implemented by using the information in the scrub item to
resubmit the kernel scrub call with the repair flag enabled; this is
discussed in the next section.
Optimizations and all other repairs are deferred to phase 4.”…””}”(hjÖ  hžhhŸNh Nubeh}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h MhjÒ  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhj·  hžhhŸh³h Nubhû)”}”(hXy  Check all metadata of every file in the filesystem.
Each metadata structure is also scheduled as a separate scrub item.
If repairs are needed and ``xfs_scrub`` is permitted to perform repairs,
and there were no problems detected during phase 2, then those scrub items
are repaired immediately.
Optimizations, deferred repairs, and unsuccessful repairs are deferred to
phase 4.
”h]”hæ)”}”(hXx  Check all metadata of every file in the filesystem.
Each metadata structure is also scheduled as a separate scrub item.
If repairs are needed and ``xfs_scrub`` is permitted to perform repairs,
and there were no problems detected during phase 2, then those scrub items
are repaired immediately.
Optimizations, deferred repairs, and unsuccessful repairs are deferred to
phase 4.”h]”(hŒ’Check all metadata of every file in the filesystem.
Each metadata structure is also scheduled as a separate scrub item.
If repairs are needed and ”…””}”(hj   hžhhŸNh Nubj÷  )”}”(hŒ``xfs_scrub``”h]”hŒ	xfs_scrub”…””}”(hj  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1jö  hj   ubhŒÙ is permitted to perform repairs,
and there were no problems detected during phase 2, then those scrub items
are repaired immediately.
Optimizations, deferred repairs, and unsuccessful repairs are deferred to
phase 4.”…””}”(hj   hžhhŸNh Nubeh}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h Mhjü  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhj·  hžhhŸh³h Nubhû)”}”(hX
  All remaining repairs and scheduled optimizations are performed during this
phase, if the caller permits them.
Before starting repairs, the summary counters are checked and any necessary
repairs are performed so that subsequent repairs will not fail the resource
reservation step due to wildly incorrect summary counters.
Unsuccessful repairs are requeued as long as forward progress on repairs is
made somewhere in the filesystem.
Free space in the filesystem is trimmed at the end of phase 4 if the
filesystem is clean.
”h]”hæ)”}”(hX	  All remaining repairs and scheduled optimizations are performed during this
phase, if the caller permits them.
Before starting repairs, the summary counters are checked and any necessary
repairs are performed so that subsequent repairs will not fail the resource
reservation step due to wildly incorrect summary counters.
Unsuccessful repairs are requeued as long as forward progress on repairs is
made somewhere in the filesystem.
Free space in the filesystem is trimmed at the end of phase 4 if the
filesystem is clean.”h]”hX	  All remaining repairs and scheduled optimizations are performed during this
phase, if the caller permits them.
Before starting repairs, the summary counters are checked and any necessary
repairs are performed so that subsequent repairs will not fail the resource
reservation step due to wildly incorrect summary counters.
Unsuccessful repairs are requeued as long as forward progress on repairs is
made somewhere in the filesystem.
Free space in the filesystem is trimmed at the end of phase 4 if the
filesystem is clean.”…””}”(hj*  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h M#hj&  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhj·  hžhhŸh³h Nubhû)”}”(hXc  By the start of this phase, all primary and secondary filesystem metadata
must be correct.
Summary counters such as the free space counts and quota resource counts
are checked and corrected.
Directory entry names and extended attribute names are checked for
suspicious entries such as control characters or confusing Unicode sequences
appearing in names.
”h]”hæ)”}”(hXb  By the start of this phase, all primary and secondary filesystem metadata
must be correct.
Summary counters such as the free space counts and quota resource counts
are checked and corrected.
Directory entry names and extended attribute names are checked for
suspicious entries such as control characters or confusing Unicode sequences
appearing in names.”h]”hXb  By the start of this phase, all primary and secondary filesystem metadata
must be correct.
Summary counters such as the free space counts and quota resource counts
are checked and corrected.
Directory entry names and extended attribute names are checked for
suspicious entries such as control characters or confusing Unicode sequences
appearing in names.”…””}”(hjB  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h M-hj>  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhj·  hžhhŸh³h Nubhû)”}”(hXC  If the caller asks for a media scan, read all allocated and written data
file extents in the filesystem.
The ability to use hardware-assisted data file integrity checking is new
to online fsck; neither of the previous tools have this capability.
If media errors occur, they will be mapped to the owning files and reported.
”h]”hæ)”}”(hXB  If the caller asks for a media scan, read all allocated and written data
file extents in the filesystem.
The ability to use hardware-assisted data file integrity checking is new
to online fsck; neither of the previous tools have this capability.
If media errors occur, they will be mapped to the owning files and reported.”h]”hXB  If the caller asks for a media scan, read all allocated and written data
file extents in the filesystem.
The ability to use hardware-assisted data file integrity checking is new
to online fsck; neither of the previous tools have this capability.
If media errors occur, they will be mapped to the owning files and reported.”…””}”(hjZ  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h M5hjV  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhj·  hžhhŸh³h Nubhû)”}”(hŒeRe-check the summary counters and presents the caller with a summary of
space usage and file counts.
”h]”hæ)”}”(hŒdRe-check the summary counters and presents the caller with a summary of
space usage and file counts.”h]”hŒdRe-check the summary counters and presents the caller with a summary of
space usage and file counts.”…””}”(hjr  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h M;hjn  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhj·  hžhhŸh³h Nubeh}”(h]”h ]”h"]”h$]”h&]”jg  jh  ji  hjj  jk  uh1jh  hj†  hžhhŸh³h Mubhæ)”}”(hŒaThis allocation of responsibilities will be :ref:`revisited <scrubcheck>`
later in this document.”h]”(hŒ,This allocation of responsibilities will be ”…””}”(hjŒ  hžhhŸNh Nubh)”}”(hŒ:ref:`revisited <scrubcheck>`”h]”hŒinline”“”)”}”(hj–  h]”hŒ	revisited”…””}”(hjš  hžhhŸNh Nubah}”(h]”h ]”(Œxref”Œstd”Œstd-ref”eh"]”h$]”h&]”uh1j˜  hj”  ubah}”(h]”h ]”h"]”h$]”h&]”Œrefdoc”Œ&filesystems/xfs/xfs-online-fsck-design”Œ	refdomain”j¥  Œreftype”Œref”Œrefexplicit”ˆŒrefwarn”ˆŒ	reftarget”Œ
scrubcheck”uh1hhŸh³h M>hjŒ  ubhŒ
later in this document.”…””}”(hjŒ  hžhhŸNh Nubeh}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h M>hj†  hžhubeh}”(h]”(jv  j~  eh ]”h"]”(Œphases of work”Œscrubphases”eh$]”h&]”uh1hÐhjj  hžhhŸh³h MŒexpect_referenced_by_name”}”jÈ  jt  sŒexpect_referenced_by_id”}”j~  jt  subhÑ)”}”(hhh]”(hÖ)”}”(hŒSteps for Each Scrub Item”h]”hŒSteps for Each Scrub Item”…””}”(hjÒ  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”j  j’  uh1hÕhjÏ  hžhhŸh³h MBubhæ)”}”(hŒŒThe kernel scrub code uses a three-step strategy for checking and repairing
the one aspect of a metadata object represented by a scrub item:”h]”hŒŒThe kernel scrub code uses a three-step strategy for checking and repairing
the one aspect of a metadata object represented by a scrub item:”…””}”(hjà  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h MDhjÏ  hžhubji  )”}”(hhh]”(hû)”}”(hX  The scrub item of interest is checked for corruptions; opportunities for
optimization; and for values that are directly controlled by the system
administrator but look suspicious.
If the item is not corrupt or does not need optimization, resource are
released and the positive scan results are returned to userspace.
If the item is corrupt or could be optimized but the caller does not permit
this, resources are released and the negative scan results are returned to
userspace.
Otherwise, the kernel moves on to the second step.
”h]”hæ)”}”(hX  The scrub item of interest is checked for corruptions; opportunities for
optimization; and for values that are directly controlled by the system
administrator but look suspicious.
If the item is not corrupt or does not need optimization, resource are
released and the positive scan results are returned to userspace.
If the item is corrupt or could be optimized but the caller does not permit
this, resources are released and the negative scan results are returned to
userspace.
Otherwise, the kernel moves on to the second step.”h]”hX  The scrub item of interest is checked for corruptions; opportunities for
optimization; and for values that are directly controlled by the system
administrator but look suspicious.
If the item is not corrupt or does not need optimization, resource are
released and the positive scan results are returned to userspace.
If the item is corrupt or could be optimized but the caller does not permit
this, resources are released and the negative scan results are returned to
userspace.
Otherwise, the kernel moves on to the second step.”…””}”(hjõ  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h MGhjñ  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhjî  hžhhŸh³h Nubhû)”}”(hXA  The repair function is called to rebuild the data structure.
Repair functions generally choose rebuild a structure from other metadata
rather than try to salvage the existing structure.
If the repair fails, the scan results from the first step are returned to
userspace.
Otherwise, the kernel moves on to the third step.
”h]”hæ)”}”(hX@  The repair function is called to rebuild the data structure.
Repair functions generally choose rebuild a structure from other metadata
rather than try to salvage the existing structure.
If the repair fails, the scan results from the first step are returned to
userspace.
Otherwise, the kernel moves on to the third step.”h]”hX@  The repair function is called to rebuild the data structure.
Repair functions generally choose rebuild a structure from other metadata
rather than try to salvage the existing structure.
If the repair fails, the scan results from the first step are returned to
userspace.
Otherwise, the kernel moves on to the third step.”…””}”(hj  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h MQhj	  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhjî  hžhhŸh³h Nubhû)”}”(hŒ°In the third step, the kernel runs the same checks over the new metadata
item to assess the efficacy of the repairs.
The results of the reassessment are returned to userspace.
”h]”hæ)”}”(hŒ¯In the third step, the kernel runs the same checks over the new metadata
item to assess the efficacy of the repairs.
The results of the reassessment are returned to userspace.”h]”hŒ¯In the third step, the kernel runs the same checks over the new metadata
item to assess the efficacy of the repairs.
The results of the reassessment are returned to userspace.”…””}”(hj%  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h MXhj!  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhjî  hžhhŸh³h Nubeh}”(h]”h ]”h"]”h$]”h&]”jg  jh  ji  hjj  jk  uh1jh  hjÏ  hžhhŸh³h MGubeh}”(h]”j˜  ah ]”h"]”Œsteps for each scrub item”ah$]”h&]”uh1hÐhjj  hžhhŸh³h MBubhÑ)”}”(hhh]”(hÖ)”}”(hŒClassification of Metadata”h]”hŒClassification of Metadata”…””}”(hjI  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”j  j´  uh1hÕhjF  hžhhŸh³h M]ubhæ)”}”(hŒ^Each type of metadata object (and therefore each type of scrub item) is
classified as follows:”h]”hŒ^Each type of metadata object (and therefore each type of scrub item) is
classified as follows:”…””}”(hjW  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h M_hjF  hžhubhÑ)”}”(hhh]”(hÖ)”}”(hŒPrimary Metadata”h]”hŒPrimary Metadata”…””}”(hjh  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”j  jÓ  uh1hÕhje  hžhhŸh³h Mcubhæ)”}”(hŒÝMetadata structures in this category should be most familiar to filesystem
users either because they are directly created by the user or they index
objects created by the user
Most filesystem objects fall into this class:”h]”hŒÝMetadata structures in this category should be most familiar to filesystem
users either because they are directly created by the user or they index
objects created by the user
Most filesystem objects fall into this class:”…””}”(hjv  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h Mehje  hžhubhö)”}”(hhh]”(hû)”}”(hŒ+Free space and reference count information
”h]”hæ)”}”(hŒ*Free space and reference count information”h]”hŒ*Free space and reference count information”…””}”(hj‹  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h Mjhj‡  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhj„  hžhhŸh³h Nubhû)”}”(hŒInode records and indexes
”h]”hæ)”}”(hŒInode records and indexes”h]”hŒInode records and indexes”…””}”(hj£  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h MlhjŸ  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhj„  hžhhŸh³h Nubhû)”}”(hŒ*Storage mapping information for file data
”h]”hæ)”}”(hŒ)Storage mapping information for file data”h]”hŒ)Storage mapping information for file data”…””}”(hj»  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h Mnhj·  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhj„  hžhhŸh³h Nubhû)”}”(hŒDirectories
”h]”hæ)”}”(hŒDirectories”h]”hŒDirectories”…””}”(hjÓ  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h MphjÏ  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhj„  hžhhŸh³h Nubhû)”}”(hŒExtended attributes
”h]”hæ)”}”(hŒExtended attributes”h]”hŒExtended attributes”…””}”(hjë  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h Mrhjç  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhj„  hžhhŸh³h Nubhû)”}”(hŒSymbolic links
”h]”hæ)”}”(hŒSymbolic links”h]”hŒSymbolic links”…””}”(hj  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h Mthjÿ  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhj„  hžhhŸh³h Nubhû)”}”(hŒQuota limits
”h]”hæ)”}”(hŒQuota limits”h]”hŒQuota limits”…””}”(hj  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h Mvhj  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhj„  hžhhŸh³h Nubeh}”(h]”h ]”h"]”h$]”h&]”jJ  jK  uh1hõhŸh³h Mjhje  hžhubhæ)”}”(hŒ\Scrub obeys the same rules as regular filesystem accesses for resource and lock
acquisition.”h]”hŒ\Scrub obeys the same rules as regular filesystem accesses for resource and lock
acquisition.”…””}”(hj5  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h Mxhje  hžhubhæ)”}”(hXA  Primary metadata objects are the simplest for scrub to process.
The principal filesystem object (either an allocation group or an inode) that
owns the item being scrubbed is locked to guard against concurrent updates.
The check function examines every record associated with the type for obvious
errors and cross-references healthy records against other metadata to look for
inconsistencies.
Repairs for this class of scrub item are simple, since the repair function
starts by holding all the resources acquired in the previous step.
The repair function scans available metadata as needed to record all the
observations needed to complete the structure.
Next, it stages the observations in a new ondisk structure and commits it
atomically to complete the repair.
Finally, the storage from the old data structure are carefully reaped.”h]”hXA  Primary metadata objects are the simplest for scrub to process.
The principal filesystem object (either an allocation group or an inode) that
owns the item being scrubbed is locked to guard against concurrent updates.
The check function examines every record associated with the type for obvious
errors and cross-references healthy records against other metadata to look for
inconsistencies.
Repairs for this class of scrub item are simple, since the repair function
starts by holding all the resources acquired in the previous step.
The repair function scans available metadata as needed to record all the
observations needed to complete the structure.
Next, it stages the observations in a new ondisk structure and commits it
atomically to complete the repair.
Finally, the storage from the old data structure are carefully reaped.”…””}”(hjC  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h M{hje  hžhubhæ)”}”(hX   Because ``xfs_scrub`` locks a primary object for the duration of the repair,
this is effectively an offline repair operation performed on a subset of the
filesystem.
This minimizes the complexity of the repair code because it is not necessary to
handle concurrent updates from other threads, nor is it necessary to access
any other part of the filesystem.
As a result, indexed structures can be rebuilt very quickly, and programs
trying to access the damaged structure will be blocked until repairs complete.
The only infrastructure needed by the repair code are the staging area for
observations and a means to write new structures to disk.
Despite these limitations, the advantage that online repair holds is clear:
targeted work on individual shards of the filesystem avoids total loss of
service.”h]”(hŒBecause ”…””}”(hjQ  hžhhŸNh Nubj÷  )”}”(hŒ``xfs_scrub``”h]”hŒ	xfs_scrub”…””}”(hjY  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1jö  hjQ  ubhX   locks a primary object for the duration of the repair,
this is effectively an offline repair operation performed on a subset of the
filesystem.
This minimizes the complexity of the repair code because it is not necessary to
handle concurrent updates from other threads, nor is it necessary to access
any other part of the filesystem.
As a result, indexed structures can be rebuilt very quickly, and programs
trying to access the damaged structure will be blocked until repairs complete.
The only infrastructure needed by the repair code are the staging area for
observations and a means to write new structures to disk.
Despite these limitations, the advantage that online repair holds is clear:
targeted work on individual shards of the filesystem avoids total loss of
service.”…””}”(hjQ  hžhhŸNh Nubeh}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h M‰hje  hžhubhæ)”}”(hX  This mechanism is described in section 2.1 ("Off-Line Algorithm") of
V. Srinivasan and M. J. Carey, `"Performance of On-Line Index Construction
Algorithms" <https://minds.wisconsin.edu/bitstream/handle/1793/59524/TR1047.pdf>`_,
*Extending Database Technology*, pp. 293-309, 1992.”h]”(hŒhThis mechanism is described in section 2.1 (â€œOff-Line Algorithmâ€) of
V. Srinivasan and M. J. Carey, ”…””}”(hjq  hžhhŸNh Nubj”  )”}”(hŒ~`"Performance of On-Line Index Construction
Algorithms" <https://minds.wisconsin.edu/bitstream/handle/1793/59524/TR1047.pdf>`_”h]”hŒ:â€œPerformance of On-Line Index Construction
Algorithmsâ€”…””}”(hjy  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”Œname”Œ6"Performance of On-Line Index Construction Algorithms"”jj  ŒBhttps://minds.wisconsin.edu/bitstream/handle/1793/59524/TR1047.pdf”uh1j“  hjq  ubhµ)”}”(hŒE <https://minds.wisconsin.edu/bitstream/handle/1793/59524/TR1047.pdf>”h]”h}”(h]”Œ4performance-of-on-line-index-construction-algorithms”ah ]”h"]”Œ6"performance of on-line index construction algorithms"”ah$]”h&]”Œrefuri”j‰  uh1h´jy  Khjq  ubhŒ,
”…””}”(hjq  hžhhŸNh Nubj7  )”}”(hŒ*Extending Database Technology*”h]”hŒExtending Database Technology”…””}”(hj›  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1j6  hjq  ubhŒ, pp. 293-309, 1992.”…””}”(hjq  hžhhŸNh Nubeh}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h M—hje  hžhubhæ)”}”(hXƒ  Most primary metadata repair functions stage their intermediate results in an
in-memory array prior to formatting the new ondisk structure, which is very
similar to the list-based algorithm discussed in section 2.3 ("List-Based
Algorithms") of Srinivasan.
However, any data structure builder that maintains a resource lock for the
duration of the repair is *always* an offline algorithm.”h]”(hXi  Most primary metadata repair functions stage their intermediate results in an
in-memory array prior to formatting the new ondisk structure, which is very
similar to the list-based algorithm discussed in section 2.3 (â€œList-Based
Algorithmsâ€) of Srinivasan.
However, any data structure builder that maintains a resource lock for the
duration of the repair is ”…””}”(hj³  hžhhŸNh Nubj7  )”}”(hŒ*always*”h]”hŒalways”…””}”(hj»  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1j6  hj³  ubhŒ an offline algorithm.”…””}”(hj³  hžhhŸNh Nubeh}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h Mœhje  hžhubhµ)”}”(hŒ.. _secondary_metadata:”h]”h}”(h]”h ]”h"]”h$]”h&]”j  jû  uh1h´h M£hje  hžhhŸh³ubeh}”(h]”jÙ  ah ]”h"]”Œprimary metadata”ah$]”h&]”uh1hÐhjF  hžhhŸh³h McubhÑ)”}”(hhh]”(hÖ)”}”(hŒSecondary Metadata”h]”hŒSecondary Metadata”…””}”(hjç  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”j  jõ  uh1hÕhjä  hžhhŸh³h M¦ubhæ)”}”(hŒœMetadata structures in this category reflect records found in primary metadata,
but are only needed for online fsck or for reorganization of the filesystem.”h]”hŒœMetadata structures in this category reflect records found in primary metadata,
but are only needed for online fsck or for reorganization of the filesystem.”…””}”(hjõ  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h M¨hjä  hžhubhæ)”}”(hŒSecondary metadata include:”h]”hŒSecondary metadata include:”…””}”(hj  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h M«hjä  hžhubhö)”}”(hhh]”(hû)”}”(hŒReverse mapping information
”h]”hæ)”}”(hŒReverse mapping information”h]”hŒReverse mapping information”…””}”(hj  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h M­hj  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhj  hžhhŸh³h Nubhû)”}”(hŒDirectory parent pointers
”h]”hæ)”}”(hŒDirectory parent pointers”h]”hŒDirectory parent pointers”…””}”(hj0  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h M¯hj,  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhj  hžhhŸh³h Nubeh}”(h]”h ]”h"]”h$]”h&]”jJ  jK  uh1hõhŸh³h M­hjä  hžhubhæ)”}”(hX&  This class of metadata is difficult for scrub to process because scrub attaches
to the secondary object but needs to check primary metadata, which runs counter
to the usual order of resource acquisition.
Frequently, this means that full filesystems scans are necessary to rebuild the
metadata.
Check functions can be limited in scope to reduce runtime.
Repairs, however, require a full scan of primary metadata, which can take a
long time to complete.
Under these conditions, ``xfs_scrub`` cannot lock resources for the entire
duration of the repair.”h]”(hXÜ  This class of metadata is difficult for scrub to process because scrub attaches
to the secondary object but needs to check primary metadata, which runs counter
to the usual order of resource acquisition.
Frequently, this means that full filesystems scans are necessary to rebuild the
metadata.
Check functions can be limited in scope to reduce runtime.
Repairs, however, require a full scan of primary metadata, which can take a
long time to complete.
Under these conditions, ”…””}”(hjJ  hžhhŸNh Nubj÷  )”}”(hŒ``xfs_scrub``”h]”hŒ	xfs_scrub”…””}”(hjR  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1jö  hjJ  ubhŒ= cannot lock resources for the entire
duration of the repair.”…””}”(hjJ  hžhhŸNh Nubeh}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h M±hjä  hžhubhæ)”}”(hX„  Instead, repair functions set up an in-memory staging structure to store
observations.
Depending on the requirements of the specific repair function, the staging
index will either have the same format as the ondisk structure or a design
specific to that repair function.
The next step is to release all locks and start the filesystem scan.
When the repair scanner needs to record an observation, the staging data are
locked long enough to apply the update.
While the filesystem scan is in progress, the repair function hooks the
filesystem so that it can apply pending filesystem updates to the staging
information.
Once the scan is done, the owning object is re-locked, the live data is used to
write a new ondisk structure, and the repairs are committed atomically.
The hooks are disabled and the staging staging area is freed.
Finally, the storage from the old data structure are carefully reaped.”h]”hX„  Instead, repair functions set up an in-memory staging structure to store
observations.
Depending on the requirements of the specific repair function, the staging
index will either have the same format as the ondisk structure or a design
specific to that repair function.
The next step is to release all locks and start the filesystem scan.
When the repair scanner needs to record an observation, the staging data are
locked long enough to apply the update.
While the filesystem scan is in progress, the repair function hooks the
filesystem so that it can apply pending filesystem updates to the staging
information.
Once the scan is done, the owning object is re-locked, the live data is used to
write a new ondisk structure, and the repairs are committed atomically.
The hooks are disabled and the staging staging area is freed.
Finally, the storage from the old data structure are carefully reaped.”…””}”(hjj  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h M¼hjä  hžhubhæ)”}”(hX  Introducing concurrency helps online repair avoid various locking problems, but
comes at a high cost to code complexity.
Live filesystem code has to be hooked so that the repair function can observe
updates in progress.
The staging area has to become a fully functional parallel structure so that
updates can be merged from the hooks.
Finally, the hook, the filesystem scan, and the inode locking model must be
sufficiently well integrated that a hook event can decide if a given update
should be applied to the staging structure.”h]”hX  Introducing concurrency helps online repair avoid various locking problems, but
comes at a high cost to code complexity.
Live filesystem code has to be hooked so that the repair function can observe
updates in progress.
The staging area has to become a fully functional parallel structure so that
updates can be merged from the hooks.
Finally, the hook, the filesystem scan, and the inode locking model must be
sufficiently well integrated that a hook event can decide if a given update
should be applied to the staging structure.”…””}”(hjx  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h MÌhjä  hžhubhæ)”}”(hX@  In theory, the scrub implementation could apply these same techniques for
primary metadata, but doing so would make it massively more complex and less
performant.
Programs attempting to access the damaged structures are not blocked from
operation, which may cause application failure or an unplanned filesystem
shutdown.”h]”hX@  In theory, the scrub implementation could apply these same techniques for
primary metadata, but doing so would make it massively more complex and less
performant.
Programs attempting to access the damaged structures are not blocked from
operation, which may cause application failure or an unplanned filesystem
shutdown.”…””}”(hj†  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h MÖhjä  hžhubhæ)”}”(hX_  Inspiration for the secondary metadata repair strategy was drawn from section
2.4 of Srinivasan above, and sections 2 ("NSF: Inded Build Without Side-File")
and 3.1.1 ("Duplicate Key Insert Problem") in C. Mohan, `"Algorithms for
Creating Indexes for Very Large Tables Without Quiescing Updates"
<https://dl.acm.org/doi/10.1145/130283.130337>`_, 1992.”h]”(hŒÝInspiration for the secondary metadata repair strategy was drawn from section
2.4 of Srinivasan above, and sections 2 (â€œNSF: Inded Build Without Side-Fileâ€)
and 3.1.1 (â€œDuplicate Key Insert Problemâ€) in C. Mohan, ”…””}”(hj”  hžhhŸNh Nubj”  )”}”(hŒƒ`"Algorithms for
Creating Indexes for Very Large Tables Without Quiescing Updates"
<https://dl.acm.org/doi/10.1145/130283.130337>`_”h]”hŒUâ€œAlgorithms for
Creating Indexes for Very Large Tables Without Quiescing Updatesâ€”…””}”(hjœ  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”Œname”ŒQ"Algorithms for Creating Indexes for Very Large Tables Without Quiescing Updates"”jj  Œ,https://dl.acm.org/doi/10.1145/130283.130337”uh1j“  hj”  ubhµ)”}”(hŒ/
<https://dl.acm.org/doi/10.1145/130283.130337>”h]”h}”(h]”ŒOalgorithms-for-creating-indexes-for-very-large-tables-without-quiescing-updates”ah ]”h"]”ŒQ"algorithms for creating indexes for very large tables without quiescing updates"”ah$]”h&]”Œrefuri”j¬  uh1h´jy  Khj”  ubhŒ, 1992.”…””}”(hj”  hžhhŸNh Nubeh}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h MÝhjä  hžhubhæ)”}”(hXl  The sidecar index mentioned above bears some resemblance to the side file
method mentioned in Srinivasan and Mohan.
Their method consists of an index builder that extracts relevant record data to
build the new structure as quickly as possible; and an auxiliary structure that
captures all updates that would be committed to the index by other threads were
the new index already online.
After the index building scan finishes, the updates recorded in the side file
are applied to the new index.
To avoid conflicts between the index builder and other writer threads, the
builder maintains a publicly visible cursor that tracks the progress of the
scan through the record space.
To avoid duplication of work between the side file and the index builder, side
file updates are elided when the record ID for the update is greater than the
cursor position within the record ID space.”h]”hXl  The sidecar index mentioned above bears some resemblance to the side file
method mentioned in Srinivasan and Mohan.
Their method consists of an index builder that extracts relevant record data to
build the new structure as quickly as possible; and an auxiliary structure that
captures all updates that would be committed to the index by other threads were
the new index already online.
After the index building scan finishes, the updates recorded in the side file
are applied to the new index.
To avoid conflicts between the index builder and other writer threads, the
builder maintains a publicly visible cursor that tracks the progress of the
scan through the record space.
To avoid duplication of work between the side file and the index builder, side
file updates are elided when the record ID for the update is greater than the
cursor position within the record ID space.”…””}”(hjÄ  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h Mãhjä  hžhubhæ)”}”(hX[  To minimize changes to the rest of the codebase, XFS online repair keeps the
replacement index hidden until it's completely ready to go.
In other words, there is no attempt to expose the keyspace of the new index
while repair is running.
The complexity of such an approach would be very high and perhaps more
appropriate to building *new* indices.”h]”(hXO  To minimize changes to the rest of the codebase, XFS online repair keeps the
replacement index hidden until itâ€™s completely ready to go.
In other words, there is no attempt to expose the keyspace of the new index
while repair is running.
The complexity of such an approach would be very high and perhaps more
appropriate to building ”…””}”(hjÒ  hžhhŸNh Nubj7  )”}”(hŒ*new*”h]”hŒnew”…””}”(hjÚ  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1j6  hjÒ  ubhŒ	 indices.”…””}”(hjÒ  hžhhŸNh Nubeh}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h Mòhjä  hžhubhæ)”}”(hŒ**Future Work Question**: Can the full scan and live update code used to
facilitate a repair also be used to implement a comprehensive check?”h]”(jé  )”}”(hŒ**Future Work Question**”h]”hŒFuture Work Question”…””}”(hjö  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1jè  hjò  ubhŒu: Can the full scan and live update code used to
facilitate a repair also be used to implement a comprehensive check?”…””}”(hjò  hžhhŸNh Nubeh}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h Mùhjä  hžhubhæ)”}”(hX”  *Answer*: In theory, yes.  Check would be much stronger if each scrub function
employed these live scans to build a shadow copy of the metadata and then
compared the shadow records to the ondisk records.
However, doing that is a fair amount more work than what the checking functions
do now.
The live scans and hooks were developed much later.
That in turn increases the runtime of those scrub functions.”h]”(j7  )”}”(hŒ*Answer*”h]”hŒAnswer”…””}”(hj  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1j6  hj  ubhXŒ  : In theory, yes.  Check would be much stronger if each scrub function
employed these live scans to build a shadow copy of the metadata and then
compared the shadow records to the ondisk records.
However, doing that is a fair amount more work than what the checking functions
do now.
The live scans and hooks were developed much later.
That in turn increases the runtime of those scrub functions.”…””}”(hj  hžhhŸNh Nubeh}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h Mühjä  hžhubeh}”(h]”(jû  Œid2”eh ]”h"]”(Œsecondary metadata”Œsecondary_metadata”eh$]”h&]”uh1hÐhjF  hžhhŸh³h M¦jË  }”j0  jÓ  sjÍ  }”jû  jÓ  subhÑ)”}”(hhh]”(hÖ)”}”(hŒSummary Information”h]”hŒSummary Information”…””}”(hj8  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”j  j  uh1hÕhj5  hžhhŸh³h Mubhæ)”}”(hŒáMetadata structures in this last category summarize the contents of primary
metadata records.
These are often used to speed up resource usage queries, and are many times
smaller than the primary metadata which they represent.”h]”hŒáMetadata structures in this last category summarize the contents of primary
metadata records.
These are often used to speed up resource usage queries, and are many times
smaller than the primary metadata which they represent.”…””}”(hjF  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h Mhj5  hžhubhæ)”}”(hŒ(Examples of summary information include:”h]”hŒ(Examples of summary information include:”…””}”(hjT  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h Mhj5  hžhubhö)”}”(hhh]”(hû)”}”(hŒ(Summary counts of free space and inodes
”h]”hæ)”}”(hŒ'Summary counts of free space and inodes”h]”hŒ'Summary counts of free space and inodes”…””}”(hji  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h Mhje  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhjb  hžhhŸh³h Nubhû)”}”(hŒ"File link counts from directories
”h]”hæ)”}”(hŒ!File link counts from directories”h]”hŒ!File link counts from directories”…””}”(hj  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h Mhj}  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhjb  hžhhŸh³h Nubhû)”}”(hŒQuota resource usage counts
”h]”hæ)”}”(hŒQuota resource usage counts”h]”hŒQuota resource usage counts”…””}”(hj™  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h Mhj•  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhjb  hžhhŸh³h Nubeh}”(h]”h ]”h"]”h$]”h&]”jJ  jK  uh1hõhŸh³h Mhj5  hžhubhæ)”}”(hŒ‡Check and repair require full filesystem scans, but resource and lock
acquisition follow the same paths as regular filesystem accesses.”h]”hŒ‡Check and repair require full filesystem scans, but resource and lock
acquisition follow the same paths as regular filesystem accesses.”…””}”(hj³  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h Mhj5  hžhubhæ)”}”(hX×  The superblock summary counters have special requirements due to the underlying
implementation of the incore counters, and will be treated separately.
Check and repair of the other types of summary counters (quota resource counts
and file link counts) employ the same filesystem scanning and hooking
techniques as outlined above, but because the underlying data are sets of
integer counters, the staging data need not be a fully functional mirror of the
ondisk structure.”h]”hX×  The superblock summary counters have special requirements due to the underlying
implementation of the incore counters, and will be treated separately.
Check and repair of the other types of summary counters (quota resource counts
and file link counts) employ the same filesystem scanning and hooking
techniques as outlined above, but because the underlying data are sets of
integer counters, the staging data need not be a fully functional mirror of the
ondisk structure.”…””}”(hjÁ  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h Mhj5  hžhubhæ)”}”(hXF  Inspiration for quota and file link count repair strategies were drawn from
sections 2.12 ("Online Index Operations") through 2.14 ("Incremental View
Maintenance") of G.  Graefe, `"Concurrent Queries and Updates in Summary Views
and Their Indexes"
<http://www.odbms.org/wp-content/uploads/2014/06/Increment-locks.pdf>`_, 2011.”h]”(hŒ»Inspiration for quota and file link count repair strategies were drawn from
sections 2.12 (â€œOnline Index Operationsâ€) through 2.14 (â€œIncremental View
Maintenanceâ€) of G.  Graefe, ”…””}”(hjÏ  hžhhŸNh Nubj”  )”}”(hŒŒ`"Concurrent Queries and Updates in Summary Views
and Their Indexes"
<http://www.odbms.org/wp-content/uploads/2014/06/Increment-locks.pdf>`_”h]”hŒGâ€œConcurrent Queries and Updates in Summary Views
and Their Indexesâ€”…””}”(hj×  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”Œname”ŒC"Concurrent Queries and Updates in Summary Views and Their Indexes"”jj  ŒChttp://www.odbms.org/wp-content/uploads/2014/06/Increment-locks.pdf”uh1j“  hjÏ  ubhµ)”}”(hŒF
<http://www.odbms.org/wp-content/uploads/2014/06/Increment-locks.pdf>”h]”h}”(h]”ŒAconcurrent-queries-and-updates-in-summary-views-and-their-indexes”ah ]”h"]”ŒC"concurrent queries and updates in summary views and their indexes"”ah$]”h&]”Œrefuri”jç  uh1h´jy  KhjÏ  ubhŒ, 2011.”…””}”(hjÏ  hžhhŸNh Nubeh}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h Mhj5  hžhubhæ)”}”(hXÄ  Since quotas are non-negative integer counts of resource usage, online
quotacheck can use the incremental view deltas described in section 2.14 to
track pending changes to the block and inode usage counts in each transaction,
and commit those changes to a dquot side file when the transaction commits.
Delta tracking is necessary for dquots because the index builder scans inodes,
whereas the data structure being rebuilt is an index of dquots.
Link count checking combines the view deltas and commit step into one because
it sets attributes of the objects being scanned instead of writing them to a
separate data structure.
Each online fsck function will be discussed as case studies later in this
document.”h]”hXÄ  Since quotas are non-negative integer counts of resource usage, online
quotacheck can use the incremental view deltas described in section 2.14 to
track pending changes to the block and inode usage counts in each transaction,
and commit those changes to a dquot side file when the transaction commits.
Delta tracking is necessary for dquots because the index builder scans inodes,
whereas the data structure being rebuilt is an index of dquots.
Link count checking combines the view deltas and commit step into one because
it sets attributes of the objects being scanned instead of writing them to a
separate data structure.
Each online fsck function will be discussed as case studies later in this
document.”…””}”(hjÿ  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h M%hj5  hžhubeh}”(h]”j  ah ]”h"]”Œsummary information”ah$]”h&]”uh1hÐhjF  hžhhŸh³h Mubeh}”(h]”jº  ah ]”h"]”Œclassification of metadata”ah$]”h&]”uh1hÐhjj  hžhhŸh³h M]ubhÑ)”}”(hhh]”(hÖ)”}”(hŒRisk Management”h]”hŒRisk Management”…””}”(hj  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”j  jE  uh1hÕhj  hžhhŸh³h M2ubhæ)”}”(hŒðDuring the development of online fsck, several risk factors were identified
that may make the feature unsuitable for certain distributors and users.
Steps can be taken to mitigate or eliminate those risks, though at a cost to
functionality.”h]”hŒðDuring the development of online fsck, several risk factors were identified
that may make the feature unsuitable for certain distributors and users.
Steps can be taken to mitigate or eliminate those risks, though at a cost to
functionality.”…””}”(hj,  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h M4hj  hžhubhö)”}”(hhh]”(hû)”}”(hX©  **Decreased performance**: Adding metadata indices to the filesystem
increases the time cost of persisting changes to disk, and the reverse space
mapping and directory parent pointers are no exception.
System administrators who require the maximum performance can disable the
reverse mapping features at format time, though this choice dramatically
reduces the ability of online fsck to find inconsistencies and repair them.
”h]”hæ)”}”(hX¨  **Decreased performance**: Adding metadata indices to the filesystem
increases the time cost of persisting changes to disk, and the reverse space
mapping and directory parent pointers are no exception.
System administrators who require the maximum performance can disable the
reverse mapping features at format time, though this choice dramatically
reduces the ability of online fsck to find inconsistencies and repair them.”h]”(jé  )”}”(hŒ**Decreased performance**”h]”hŒDecreased performance”…””}”(hjE  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1jè  hjA  ubhX  : Adding metadata indices to the filesystem
increases the time cost of persisting changes to disk, and the reverse space
mapping and directory parent pointers are no exception.
System administrators who require the maximum performance can disable the
reverse mapping features at format time, though this choice dramatically
reduces the ability of online fsck to find inconsistencies and repair them.”…””}”(hjA  hžhhŸNh Nubeh}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h M9hj=  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhj:  hžhhŸh³h Nubhû)”}”(hX˜  **Incorrect repairs**: As with all software, there might be defects in the
software that result in incorrect repairs being written to the filesystem.
Systematic fuzz testing (detailed in the next section) is employed by the
authors to find bugs early, but it might not catch everything.
The kernel build system provides Kconfig options (``CONFIG_XFS_ONLINE_SCRUB``
and ``CONFIG_XFS_ONLINE_REPAIR``) to enable distributors to choose not to
accept this risk.
The xfsprogs build system has a configure option (``--enable-scrub=no``) that
disables building of the ``xfs_scrub`` binary, though this is not a risk
mitigation if the kernel functionality remains enabled.
”h]”hæ)”}”(hX—  **Incorrect repairs**: As with all software, there might be defects in the
software that result in incorrect repairs being written to the filesystem.
Systematic fuzz testing (detailed in the next section) is employed by the
authors to find bugs early, but it might not catch everything.
The kernel build system provides Kconfig options (``CONFIG_XFS_ONLINE_SCRUB``
and ``CONFIG_XFS_ONLINE_REPAIR``) to enable distributors to choose not to
accept this risk.
The xfsprogs build system has a configure option (``--enable-scrub=no``) that
disables building of the ``xfs_scrub`` binary, though this is not a risk
mitigation if the kernel functionality remains enabled.”h]”(jé  )”}”(hŒ**Incorrect repairs**”h]”hŒIncorrect repairs”…””}”(hjk  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1jè  hjg  ubhX<  : As with all software, there might be defects in the
software that result in incorrect repairs being written to the filesystem.
Systematic fuzz testing (detailed in the next section) is employed by the
authors to find bugs early, but it might not catch everything.
The kernel build system provides Kconfig options (”…””}”(hjg  hžhhŸNh Nubj÷  )”}”(hŒ``CONFIG_XFS_ONLINE_SCRUB``”h]”hŒCONFIG_XFS_ONLINE_SCRUB”…””}”(hj}  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1jö  hjg  ubhŒ
and ”…””}”(hjg  hžhhŸNh Nubj÷  )”}”(hŒ``CONFIG_XFS_ONLINE_REPAIR``”h]”hŒCONFIG_XFS_ONLINE_REPAIR”…””}”(hj  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1jö  hjg  ubhŒn) to enable distributors to choose not to
accept this risk.
The xfsprogs build system has a configure option (”…””}”(hjg  hžhhŸNh Nubj÷  )”}”(hŒ``--enable-scrub=no``”h]”hŒ--enable-scrub=no”…””}”(hj¡  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1jö  hjg  ubhŒ ) that
disables building of the ”…””}”(hjg  hžhhŸNh Nubj÷  )”}”(hŒ``xfs_scrub``”h]”hŒ	xfs_scrub”…””}”(hj³  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1jö  hjg  ubhŒZ binary, though this is not a risk
mitigation if the kernel functionality remains enabled.”…””}”(hjg  hžhhŸNh Nubeh}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h M@hjc  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhj:  hžhhŸh³h Nubhû)”}”(hXÙ  **Inability to repair**: Sometimes, a filesystem is too badly damaged to be
repairable.
If the keyspaces of several metadata indices overlap in some manner but a
coherent narrative cannot be formed from records collected, then the repair
fails.
To reduce the chance that a repair will fail with a dirty transaction and
render the filesystem unusable, the online repair functions have been
designed to stage and validate all new records before committing the new
structure.
”h]”hæ)”}”(hXØ  **Inability to repair**: Sometimes, a filesystem is too badly damaged to be
repairable.
If the keyspaces of several metadata indices overlap in some manner but a
coherent narrative cannot be formed from records collected, then the repair
fails.
To reduce the chance that a repair will fail with a dirty transaction and
render the filesystem unusable, the online repair functions have been
designed to stage and validate all new records before committing the new
structure.”h]”(jé  )”}”(hŒ**Inability to repair**”h]”hŒInability to repair”…””}”(hjÙ  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1jè  hjÕ  ubhXÁ  : Sometimes, a filesystem is too badly damaged to be
repairable.
If the keyspaces of several metadata indices overlap in some manner but a
coherent narrative cannot be formed from records collected, then the repair
fails.
To reduce the chance that a repair will fail with a dirty transaction and
render the filesystem unusable, the online repair functions have been
designed to stage and validate all new records before committing the new
structure.”…””}”(hjÕ  hžhhŸNh Nubeh}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h MKhjÑ  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhj:  hžhhŸh³h Nubhû)”}”(hXJ  **Misbehavior**: Online fsck requires many privileges -- raw IO to block
devices, opening files by handle, ignoring Unix discretionary access control,
and the ability to perform administrative changes.
Running this automatically in the background scares people, so the systemd
background service is configured to run with only the privileges required.
Obviously, this cannot address certain problems like the kernel crashing or
deadlocking, but it should be sufficient to prevent the scrub process from
escaping and reconfiguring the system.
The cron job does not have this protection.
”h]”hæ)”}”(hXI  **Misbehavior**: Online fsck requires many privileges -- raw IO to block
devices, opening files by handle, ignoring Unix discretionary access control,
and the ability to perform administrative changes.
Running this automatically in the background scares people, so the systemd
background service is configured to run with only the privileges required.
Obviously, this cannot address certain problems like the kernel crashing or
deadlocking, but it should be sufficient to prevent the scrub process from
escaping and reconfiguring the system.
The cron job does not have this protection.”h]”(jé  )”}”(hŒ**Misbehavior**”h]”hŒMisbehavior”…””}”(hjÿ  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1jè  hjû  ubhX:  : Online fsck requires many privileges -- raw IO to block
devices, opening files by handle, ignoring Unix discretionary access control,
and the ability to perform administrative changes.
Running this automatically in the background scares people, so the systemd
background service is configured to run with only the privileges required.
Obviously, this cannot address certain problems like the kernel crashing or
deadlocking, but it should be sufficient to prevent the scrub process from
escaping and reconfiguring the system.
The cron job does not have this protection.”…””}”(hjû  hžhhŸNh Nubeh}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h MUhj÷  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhj:  hžhhŸh³h Nubhû)”}”(hX°  **Fuzz Kiddiez**: There are many people now who seem to think that running
automated fuzz testing of ondisk artifacts to find mischievous behavior and
spraying exploit code onto the public mailing list for instant zero-day
disclosure is somehow of some social benefit.
In the view of this author, the benefit is realized only when the fuzz
operators help to **fix** the flaws, but this opinion apparently is not
widely shared among security "researchers".
The XFS maintainers' continuing ability to manage these events presents an
ongoing risk to the stability of the development process.
Automated testing should front-load some of the risk while the feature is
considered EXPERIMENTAL.
”h]”hæ)”}”(hX¯  **Fuzz Kiddiez**: There are many people now who seem to think that running
automated fuzz testing of ondisk artifacts to find mischievous behavior and
spraying exploit code onto the public mailing list for instant zero-day
disclosure is somehow of some social benefit.
In the view of this author, the benefit is realized only when the fuzz
operators help to **fix** the flaws, but this opinion apparently is not
widely shared among security "researchers".
The XFS maintainers' continuing ability to manage these events presents an
ongoing risk to the stability of the development process.
Automated testing should front-load some of the risk while the feature is
considered EXPERIMENTAL.”h]”(jé  )”}”(hŒ**Fuzz Kiddiez**”h]”hŒFuzz Kiddiez”…””}”(hj%  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1jè  hj!  ubhXV  : There are many people now who seem to think that running
automated fuzz testing of ondisk artifacts to find mischievous behavior and
spraying exploit code onto the public mailing list for instant zero-day
disclosure is somehow of some social benefit.
In the view of this author, the benefit is realized only when the fuzz
operators help to ”…””}”(hj!  hžhhŸNh Nubjé  )”}”(hŒ**fix**”h]”hŒfix”…””}”(hj7  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1jè  hj!  ubhXH   the flaws, but this opinion apparently is not
widely shared among security â€œresearchersâ€.
The XFS maintainersâ€™ continuing ability to manage these events presents an
ongoing risk to the stability of the development process.
Automated testing should front-load some of the risk while the feature is
considered EXPERIMENTAL.”…””}”(hj!  hžhhŸNh Nubeh}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h M_hj  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhj:  hžhhŸh³h Nubeh}”(h]”h ]”h"]”h$]”h&]”jJ  jK  uh1hõhŸh³h M9hj  hžhubhæ)”}”(hŒ¢Many of these risks are inherent to software programming.
Despite this, it is hoped that this new functionality will prove useful in
reducing unexpected downtime.”h]”hŒ¢Many of these risks are inherent to software programming.
Despite this, it is hoped that this new functionality will prove useful in
reducing unexpected downtime.”…””}”(hj[  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h Mkhj  hžhubeh}”(h]”jK  ah ]”h"]”Œrisk management”ah$]”h&]”uh1hÐhjj  hžhhŸh³h M2ubeh}”(h]”j5  ah ]”h"]”Œ2. theory of operation”ah$]”h&]”uh1hÐhhÒhžhhŸh³h K×ubhÑ)”}”(hhh]”(hÖ)”}”(hŒ3. Testing Plan”h]”hŒ3. Testing Plan”…””}”(hjz  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”j  js  uh1hÕhjw  hžhhŸh³h Mpubhæ)”}”(hŒ3As stated before, fsck tools have three main goals:”h]”hŒ3As stated before, fsck tools have three main goals:”…””}”(hjˆ  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h Mrhjw  hžhubji  )”}”(hhh]”(hû)”}”(hŒ(Detect inconsistencies in the metadata;
”h]”hæ)”}”(hŒ'Detect inconsistencies in the metadata;”h]”hŒ'Detect inconsistencies in the metadata;”…””}”(hj  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h Mthj™  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhj–  hžhhŸh³h Nubhû)”}”(hŒ%Eliminate those inconsistencies; and
”h]”hæ)”}”(hŒ$Eliminate those inconsistencies; and”h]”hŒ$Eliminate those inconsistencies; and”…””}”(hjµ  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h Mvhj±  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhj–  hžhhŸh³h Nubhû)”}”(hŒMinimize further loss of data.
”h]”hæ)”}”(hŒMinimize further loss of data.”h]”hŒMinimize further loss of data.”…””}”(hjÍ  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h MxhjÉ  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhj–  hžhhŸh³h Nubeh}”(h]”h ]”h"]”h$]”h&]”jg  jh  ji  hjj  jk  uh1jh  hjw  hžhhŸh³h Mtubhæ)”}”(hXa  Demonstrations of correct operation are necessary to build users' confidence
that the software behaves within expectations.
Unfortunately, it was not really feasible to perform regular exhaustive testing
of every aspect of a fsck tool until the introduction of low-cost virtual
machines with high-IOPS storage.
With ample hardware availability in mind, the testing strategy for the online
fsck project involves differential analysis against the existing fsck tools and
systematic testing of every attribute of every type of metadata object.
Testing can be split into four major categories, as discussed below.”h]”hXc  Demonstrations of correct operation are necessary to build usersâ€™ confidence
that the software behaves within expectations.
Unfortunately, it was not really feasible to perform regular exhaustive testing
of every aspect of a fsck tool until the introduction of low-cost virtual
machines with high-IOPS storage.
With ample hardware availability in mind, the testing strategy for the online
fsck project involves differential analysis against the existing fsck tools and
systematic testing of every attribute of every type of metadata object.
Testing can be split into four major categories, as discussed below.”…””}”(hjç  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h Mzhjw  hžhubhÑ)”}”(hhh]”(hÖ)”}”(hŒIntegrated Testing with fstests”h]”hŒIntegrated Testing with fstests”…””}”(hjø  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”j  j’  uh1hÕhjõ  hžhhŸh³h M…ubhæ)”}”(hXÏ  The primary goal of any free software QA effort is to make testing as
inexpensive and widespread as possible to maximize the scaling advantages of
community.
In other words, testing should maximize the breadth of filesystem configuration
scenarios and hardware setups.
This improves code quality by enabling the authors of online fsck to find and
fix bugs early, and helps developers of new features to find integration
issues earlier in their development effort.”h]”hXÏ  The primary goal of any free software QA effort is to make testing as
inexpensive and widespread as possible to maximize the scaling advantages of
community.
In other words, testing should maximize the breadth of filesystem configuration
scenarios and hardware setups.
This improves code quality by enabling the authors of online fsck to find and
fix bugs early, and helps developers of new features to find integration
issues earlier in their development effort.”…””}”(hj  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h M‡hjõ  hžhubhæ)”}”(hXÃ  The Linux filesystem community shares a common QA testing suite,
`fstests <https://git.kernel.org/pub/scm/fs/xfs/xfstests-dev.git/>`_, for
functional and regression testing.
Even before development work began on online fsck, fstests (when run on XFS)
would run both the ``xfs_check`` and ``xfs_repair -n`` commands on the test and
scratch filesystems between each test.
This provides a level of assurance that the kernel and the fsck tools stay in
alignment about what constitutes consistent metadata.
During development of the online checking code, fstests was modified to run
``xfs_scrub -n`` between each test to ensure that the new checking code
produces the same results as the two existing fsck tools.”•¨     h]”(hŒAThe Linux filesystem community shares a common QA testing suite,
”…””}”(hj  hžhhŸNh Nubj”  )”}”(hŒD`fstests <https://git.kernel.org/pub/scm/fs/xfs/xfstests-dev.git/>`_”h]”hŒfstests”…””}”(hj  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”Œname”Œfstests”jj  Œ7https://git.kernel.org/pub/scm/fs/xfs/xfstests-dev.git/”uh1j“  hj  ubhµ)”}”(hŒ: <https://git.kernel.org/pub/scm/fs/xfs/xfstests-dev.git/>”h]”h}”(h]”Œfstests”ah ]”h"]”Œfstests”ah$]”h&]”Œrefuri”j,  uh1h´jy  Khj  ubhŒ‰, for
functional and regression testing.
Even before development work began on online fsck, fstests (when run on XFS)
would run both the ”…””}”(hj  hžhhŸNh Nubj÷  )”}”(hŒ``xfs_check``”h]”hŒ	xfs_check”…””}”(hj>  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1jö  hj  ubhŒ and ”…””}”(hj  hžhhŸNh Nubj÷  )”}”(hŒ``xfs_repair -n``”h]”hŒxfs_repair -n”…””}”(hjP  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1jö  hj  ubhX   commands on the test and
scratch filesystems between each test.
This provides a level of assurance that the kernel and the fsck tools stay in
alignment about what constitutes consistent metadata.
During development of the online checking code, fstests was modified to run
”…””}”(hj  hžhhŸNh Nubj÷  )”}”(hŒ``xfs_scrub -n``”h]”hŒxfs_scrub -n”…””}”(hjb  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1jö  hj  ubhŒq between each test to ensure that the new checking code
produces the same results as the two existing fsck tools.”…””}”(hj  hžhhŸNh Nubeh}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h Mhjõ  hžhubhæ)”}”(hXd  To start development of online repair, fstests was modified to run
``xfs_repair`` to rebuild the filesystem's metadata indices between tests.
This ensures that offline repair does not crash, leave a corrupt filesystem
after it exists, or trigger complaints from the online check.
This also established a baseline for what can and cannot be repaired offline.
To complete the first phase of development of online repair, fstests was
modified to be able to run ``xfs_scrub`` in a "force rebuild" mode.
This enables a comparison of the effectiveness of online repair as compared to
the existing offline repair tools.”h]”(hŒCTo start development of online repair, fstests was modified to run
”…””}”(hjz  hžhhŸNh Nubj÷  )”}”(hŒ``xfs_repair``”h]”hŒ
xfs_repair”…””}”(hj‚  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1jö  hjz  ubhX{   to rebuild the filesystemâ€™s metadata indices between tests.
This ensures that offline repair does not crash, leave a corrupt filesystem
after it exists, or trigger complaints from the online check.
This also established a baseline for what can and cannot be repaired offline.
To complete the first phase of development of online repair, fstests was
modified to be able to run ”…””}”(hjz  hžhhŸNh Nubj÷  )”}”(hŒ``xfs_scrub``”h]”hŒ	xfs_scrub”…””}”(hj”  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1jö  hjz  ubhŒ‘ in a â€œforce rebuildâ€ mode.
This enables a comparison of the effectiveness of online repair as compared to
the existing offline repair tools.”…””}”(hjz  hžhhŸNh Nubeh}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h Mœhjõ  hžhubeh}”(h]”j˜  ah ]”h"]”Œintegrated testing with fstests”ah$]”h&]”uh1hÐhjw  hžhhŸh³h M…ubhÑ)”}”(hhh]”(hÖ)”}”(hŒ'General Fuzz Testing of Metadata Blocks”h]”hŒ'General Fuzz Testing of Metadata Blocks”…””}”(hj¶  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”j  j´  uh1hÕhj³  hžhhŸh³h M§ubhæ)”}”(hŒJXFS benefits greatly from having a very robust debugging tool, ``xfs_db``.”h]”(hŒ?XFS benefits greatly from having a very robust debugging tool, ”…””}”(hjÄ  hžhhŸNh Nubj÷  )”}”(hŒ
``xfs_db``”h]”hŒxfs_db”…””}”(hjÌ  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1jö  hjÄ  ubhŒ.”…””}”(hjÄ  hžhhŸNh Nubeh}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h M©hj³  hžhubhæ)”}”(hX#  Before development of online fsck even began, a set of fstests were created
to test the rather common fault that entire metadata blocks get corrupted.
This required the creation of fstests library code that can create a filesystem
containing every possible type of metadata object.
Next, individual test cases were created to create a test filesystem, identify
a single block of a specific type of metadata object, trash it with the
existing ``blocktrash`` command in ``xfs_db``, and test the reaction of a
particular metadata validation strategy.”h]”(hXº  Before development of online fsck even began, a set of fstests were created
to test the rather common fault that entire metadata blocks get corrupted.
This required the creation of fstests library code that can create a filesystem
containing every possible type of metadata object.
Next, individual test cases were created to create a test filesystem, identify
a single block of a specific type of metadata object, trash it with the
existing ”…””}”(hjä  hžhhŸNh Nubj÷  )”}”(hŒ``blocktrash``”h]”hŒ
blocktrash”…””}”(hjì  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1jö  hjä  ubhŒ command in ”…””}”(hjä  hžhhŸNh Nubj÷  )”}”(hŒ
``xfs_db``”h]”hŒxfs_db”…””}”(hjþ  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1jö  hjä  ubhŒE, and test the reaction of a
particular metadata validation strategy.”…””}”(hjä  hžhhŸNh Nubeh}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h M«hj³  hžhubhæ)”}”(hX  This earlier test suite enabled XFS developers to test the ability of the
in-kernel validation functions and the ability of the offline fsck tool to
detect and eliminate the inconsistent metadata.
This part of the test suite was extended to cover online fsck in exactly the
same manner.”h]”hX  This earlier test suite enabled XFS developers to test the ability of the
in-kernel validation functions and the ability of the offline fsck tool to
detect and eliminate the inconsistent metadata.
This part of the test suite was extended to cover online fsck in exactly the
same manner.”…””}”(hj   hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h M´hj³  hžhubhæ)”}”(hŒ=In other words, for a given fstests filesystem configuration:”h]”hŒ=In other words, for a given fstests filesystem configuration:”…””}”(hj$   hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h Mºhj³  hžhubhö)”}”(hhh]”hû)”}”(hX  For each metadata object existing on the filesystem:

* Write garbage to it

* Test the reactions of:

  1. The kernel verifiers to stop obviously bad metadata
  2. Offline repair (``xfs_repair``) to detect and fix
  3. Online repair (``xfs_scrub``) to detect and fix
”h]”(hæ)”}”(hŒ4For each metadata object existing on the filesystem:”h]”hŒ4For each metadata object existing on the filesystem:”…””}”(hj9   hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h M¼hj5   ubhö)”}”(hhh]”(hû)”}”(hŒWrite garbage to it
”h]”hæ)”}”(hŒWrite garbage to it”h]”hŒWrite garbage to it”…””}”(hjN   hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h M¾hjJ   ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhjG   ubhû)”}”(hŒ·Test the reactions of:

1. The kernel verifiers to stop obviously bad metadata
2. Offline repair (``xfs_repair``) to detect and fix
3. Online repair (``xfs_scrub``) to detect and fix
”h]”(hæ)”}”(hŒTest the reactions of:”h]”hŒTest the reactions of:”…””}”(hjf   hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h MÀhjb   ubji  )”}”(hhh]”(hû)”}”(hŒ3The kernel verifiers to stop obviously bad metadata”h]”hæ)”}”(hjy   h]”hŒ3The kernel verifiers to stop obviously bad metadata”…””}”(hj{   hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h MÂhjw   ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhjt   ubhû)”}”(hŒ1Offline repair (``xfs_repair``) to detect and fix”h]”hæ)”}”(hj   h]”(hŒOffline repair (”…””}”(hj’   hžhhŸNh Nubj÷  )”}”(hŒ``xfs_repair``”h]”hŒ
xfs_repair”…””}”(hj™   hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1jö  hj’   ubhŒ) to detect and fix”…””}”(hj’   hžhhŸNh Nubeh}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h MÃhjŽ   ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhjt   ubhû)”}”(hŒ0Online repair (``xfs_scrub``) to detect and fix
”h]”hæ)”}”(hŒ/Online repair (``xfs_scrub``) to detect and fix”h]”(hŒOnline repair (”…””}”(hj»   hžhhŸNh Nubj÷  )”}”(hŒ``xfs_scrub``”h]”hŒ	xfs_scrub”…””}”(hjÃ   hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1jö  hj»   ubhŒ) to detect and fix”…””}”(hj»   hžhhŸNh Nubeh}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h MÄhj·   ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhjt   ubeh}”(h]”h ]”h"]”h$]”h&]”jg  jh  ji  hjj  jk  uh1jh  hjb   ubeh}”(h]”h ]”h"]”h$]”h&]”uh1húhjG   ubeh}”(h]”h ]”h"]”h$]”h&]”jJ  Œ*”uh1hõhŸh³h M¾hj5   ubeh}”(h]”h ]”h"]”h$]”h&]”uh1húhj2   hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”jJ  jó   uh1hõhŸh³h M¼hj³  hžhubeh}”(h]”jº  ah ]”h"]”Œ'general fuzz testing of metadata blocks”ah$]”h&]”uh1hÐhjw  hžhhŸh³h M§ubhÑ)”}”(hhh]”(hÖ)”}”(hŒ)Targeted Fuzz Testing of Metadata Records”h]”hŒ)Targeted Fuzz Testing of Metadata Records”…””}”(hj
!  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”j  jÖ  uh1hÕhj!  hžhhŸh³h MÇubhæ)”}”(hX9  The testing plan for online fsck includes extending the existing fs testing
infrastructure to provide a much more powerful facility: targeted fuzz testing
of every metadata field of every metadata object in the filesystem.
``xfs_db`` can modify every field of every metadata structure in every
block in the filesystem to simulate the effects of memory corruption and
software bugs.
Given that fstests already contains the ability to create a filesystem
containing every metadata format known to the filesystem, ``xfs_db`` can be
used to perform exhaustive fuzz testing!”h]”(hŒßThe testing plan for online fsck includes extending the existing fs testing
infrastructure to provide a much more powerful facility: targeted fuzz testing
of every metadata field of every metadata object in the filesystem.
”…””}”(hj!  hžhhŸNh Nubj÷  )”}”(hŒ
``xfs_db``”h]”hŒxfs_db”…””}”(hj !  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1jö  hj!  ubhX   can modify every field of every metadata structure in every
block in the filesystem to simulate the effects of memory corruption and
software bugs.
Given that fstests already contains the ability to create a filesystem
containing every metadata format known to the filesystem, ”…””}”(hj!  hžhhŸNh Nubj÷  )”}”(hŒ
``xfs_db``”h]”hŒxfs_db”…””}”(hj2!  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1jö  hj!  ubhŒ0 can be
used to perform exhaustive fuzz testing!”…””}”(hj!  hžhhŸNh Nubeh}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h MÉhj!  hžhubhæ)”}”(hŒ-For a given fstests filesystem configuration:”h]”hŒ-For a given fstests filesystem configuration:”…””}”(hjJ!  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h MÓhj!  hžhubhö)”}”(hhh]”hû)”}”(hXi  For each metadata object existing on the filesystem...

* For each record inside that metadata object...

  * For each field inside that record...

    * For each conceivable type of transformation that can be applied to a bit field...

      1. Clear all bits
      2. Set all bits
      3. Toggle the most significant bit
      4. Toggle the middle bit
      5. Toggle the least significant bit
      6. Add a small quantity
      7. Subtract a small quantity
      8. Randomize the contents

      * ...test the reactions of:

        1. The kernel verifiers to stop obviously bad metadata
        2. Offline checking (``xfs_repair -n``)
        3. Offline repair (``xfs_repair``)
        4. Online checking (``xfs_scrub -n``)
        5. Online repair (``xfs_scrub``)
        6. Both repair tools (``xfs_scrub`` and then ``xfs_repair`` if online repair doesn't succeed)
”h]”(hæ)”}”(hŒ6For each metadata object existing on the filesystem...”h]”hŒ6For each metadata object existing on the filesystem...”…””}”(hj_!  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h MÕhj[!  ubhö)”}”(hhh]”hû)”}”(hX  For each record inside that metadata object...

* For each field inside that record...

  * For each conceivable type of transformation that can be applied to a bit field...

    1. Clear all bits
    2. Set all bits
    3. Toggle the most significant bit
    4. Toggle the middle bit
    5. Toggle the least significant bit
    6. Add a small quantity
    7. Subtract a small quantity
    8. Randomize the contents

    * ...test the reactions of:

      1. The kernel verifiers to stop obviously bad metadata
      2. Offline checking (``xfs_repair -n``)
      3. Offline repair (``xfs_repair``)
      4. Online checking (``xfs_scrub -n``)
      5. Online repair (``xfs_scrub``)
      6. Both repair tools (``xfs_scrub`` and then ``xfs_repair`` if online repair doesn't succeed)
”h]”(hæ)”}”(hŒ.For each record inside that metadata object...”h]”hŒ.For each record inside that metadata object...”…””}”(hjt!  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h M×hjp!  ubhö)”}”(hhh]”hû)”}”(hX»  For each field inside that record...

* For each conceivable type of transformation that can be applied to a bit field...

  1. Clear all bits
  2. Set all bits
  3. Toggle the most significant bit
  4. Toggle the middle bit
  5. Toggle the least significant bit
  6. Add a small quantity
  7. Subtract a small quantity
  8. Randomize the contents

  * ...test the reactions of:

    1. The kernel verifiers to stop obviously bad metadata
    2. Offline checking (``xfs_repair -n``)
    3. Offline repair (``xfs_repair``)
    4. Online checking (``xfs_scrub -n``)
    5. Online repair (``xfs_scrub``)
    6. Both repair tools (``xfs_scrub`` and then ``xfs_repair`` if online repair doesn't succeed)
”h]”(hæ)”}”(hŒ$For each field inside that record...”h]”hŒ$For each field inside that record...”…””}”(hj‰!  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h MÙhj…!  ubhö)”}”(hhh]”hû)”}”(hXu  For each conceivable type of transformation that can be applied to a bit field...

1. Clear all bits
2. Set all bits
3. Toggle the most significant bit
4. Toggle the middle bit
5. Toggle the least significant bit
6. Add a small quantity
7. Subtract a small quantity
8. Randomize the contents

* ...test the reactions of:

  1. The kernel verifiers to stop obviously bad metadata
  2. Offline checking (``xfs_repair -n``)
  3. Offline repair (``xfs_repair``)
  4. Online checking (``xfs_scrub -n``)
  5. Online repair (``xfs_scrub``)
  6. Both repair tools (``xfs_scrub`` and then ``xfs_repair`` if online repair doesn't succeed)
”h]”(hæ)”}”(hŒQFor each conceivable type of transformation that can be applied to a bit field...”h]”hŒQFor each conceivable type of transformation that can be applied to a bit field...”…””}”(hjž!  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h MÛhjš!  ubji  )”}”(hhh]”(hû)”}”(hŒClear all bits”h]”hæ)”}”(hj±!  h]”hŒClear all bits”…””}”(hj³!  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h MÝhj¯!  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhj¬!  ubhû)”}”(hŒSet all bits”h]”hæ)”}”(hjÈ!  h]”hŒSet all bits”…””}”(hjÊ!  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h MÞhjÆ!  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhj¬!  ubhû)”}”(hŒToggle the most significant bit”h]”hæ)”}”(hjß!  h]”hŒToggle the most significant bit”…””}”(hjá!  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h MßhjÝ!  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhj¬!  ubhû)”}”(hŒToggle the middle bit”h]”hæ)”}”(hjö!  h]”hŒToggle the middle bit”…””}”(hjø!  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h Màhjô!  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhj¬!  ubhû)”}”(hŒ Toggle the least significant bit”h]”hæ)”}”(hj"  h]”hŒ Toggle the least significant bit”…””}”(hj"  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h Máhj"  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhj¬!  ubhû)”}”(hŒAdd a small quantity”h]”hæ)”}”(hj$"  h]”hŒAdd a small quantity”…””}”(hj&"  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h Mâhj""  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhj¬!  ubhû)”}”(hŒSubtract a small quantity”h]”hæ)”}”(hj;"  h]”hŒSubtract a small quantity”…””}”(hj="  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h Mãhj9"  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhj¬!  ubhû)”}”(hŒRandomize the contents
”h]”hæ)”}”(hŒRandomize the contents”h]”hŒRandomize the contents”…””}”(hjT"  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h MähjP"  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhj¬!  ubeh}”(h]”h ]”h"]”h$]”h&]”jg  jh  ji  hjj  jk  uh1jh  hjš!  ubhö)”}”(hhh]”hû)”}”(hXB  ...test the reactions of:

1. The kernel verifiers to stop obviously bad metadata
2. Offline checking (``xfs_repair -n``)
3. Offline repair (``xfs_repair``)
4. Online checking (``xfs_scrub -n``)
5. Online repair (``xfs_scrub``)
6. Both repair tools (``xfs_scrub`` and then ``xfs_repair`` if online repair doesn't succeed)
”h]”(hæ)”}”(hŒ...test the reactions of:”h]”hŒ...test the reactions of:”…””}”(hju"  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h Mæhjq"  ubji  )”}”(hhh]”(hû)”}”(hŒ3The kernel verifiers to stop obviously bad metadata”h]”hæ)”}”(hjˆ"  h]”hŒ3The kernel verifiers to stop obviously bad metadata”…””}”(hjŠ"  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h Mèhj†"  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhjƒ"  ubhû)”}”(hŒ$Offline checking (``xfs_repair -n``)”h]”hæ)”}”(hjŸ"  h]”(hŒOffline checking (”…””}”(hj¡"  hžhhŸNh Nubj÷  )”}”(hŒ``xfs_repair -n``”h]”hŒxfs_repair -n”…””}”(hj¨"  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1jö  hj¡"  ubhŒ)”…””}”(hj¡"  hžhhŸNh Nubeh}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h Méhj"  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhjƒ"  ubhû)”}”(hŒOffline repair (``xfs_repair``)”h]”hæ)”}”(hjÈ"  h]”(hŒOffline repair (”…””}”(hjÊ"  hžhhŸNh Nubj÷  )”}”(hŒ``xfs_repair``”h]”hŒ
xfs_repair”…””}”(hjÑ"  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1jö  hjÊ"  ubhŒ)”…””}”(hjÊ"  hžhhŸNh Nubeh}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h MêhjÆ"  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhjƒ"  ubhû)”}”(hŒ"Online checking (``xfs_scrub -n``)”h]”hæ)”}”(hjñ"  h]”(hŒOnline checking (”…””}”(hjó"  hžhhŸNh Nubj÷  )”}”(hŒ``xfs_scrub -n``”h]”hŒxfs_scrub -n”…””}”(hjú"  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1jö  hjó"  ubhŒ)”…””}”(hjó"  hžhhŸNh Nubeh}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h Mëhjï"  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhjƒ"  ubhû)”}”(hŒOnline repair (``xfs_scrub``)”h]”hæ)”}”(hj#  h]”(hŒOnline repair (”…””}”(hj#  hžhhŸNh Nubj÷  )”}”(hŒ``xfs_scrub``”h]”hŒ	xfs_scrub”…””}”(hj##  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1jö  hj#  ubhŒ)”…””}”(hj#  hžhhŸNh Nubeh}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h Mìhj#  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhjƒ"  ubhû)”}”(hŒ[Both repair tools (``xfs_scrub`` and then ``xfs_repair`` if online repair doesn't succeed)
”h]”hæ)”}”(hŒZBoth repair tools (``xfs_scrub`` and then ``xfs_repair`` if online repair doesn't succeed)”h]”(hŒBoth repair tools (”…””}”(hjE#  hžhhŸNh Nubj÷  )”}”(hŒ``xfs_scrub``”h]”hŒ	xfs_scrub”…””}”(hjM#  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1jö  hjE#  ubhŒ
 and then ”…””}”(hjE#  hžhhŸNh Nubj÷  )”}”(hŒ``xfs_repair``”h]”hŒ
xfs_repair”…””}”(hj_#  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1jö  hjE#  ubhŒ$ if online repair doesnâ€™t succeed)”…””}”(hjE#  hžhhŸNh Nubeh}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h MíhjA#  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhjƒ"  ubeh}”(h]”h ]”h"]”h$]”h&]”jg  jh  ji  hjj  jk  uh1jh  hjq"  ubeh}”(h]”h ]”h"]”h$]”h&]”uh1húhjn"  ubah}”(h]”h ]”h"]”h$]”h&]”jJ  jó   uh1hõhŸh³h Mæhjš!  ubeh}”(h]”h ]”h"]”h$]”h&]”uh1húhj—!  ubah}”(h]”h ]”h"]”h$]”h&]”jJ  jó   uh1hõhŸh³h MÛhj…!  ubeh}”(h]”h ]”h"]”h$]”h&]”uh1húhj‚!  ubah}”(h]”h ]”h"]”h$]”h&]”jJ  jó   uh1hõhŸh³h MÙhjp!  ubeh}”(h]”h ]”h"]”h$]”h&]”uh1húhjm!  ubah}”(h]”h ]”h"]”h$]”h&]”jJ  jó   uh1hõhŸh³h M×hj[!  ubeh}”(h]”h ]”h"]”h$]”h&]”uh1húhjX!  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”jJ  jó   uh1hõhŸh³h MÕhj!  hžhubhæ)”}”(hŒ)This is quite the combinatoric explosion!”h]”hŒ)This is quite the combinatoric explosion!”…””}”(hj¿#  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h Mïhj!  hžhubhæ)”}”(hXè  Fortunately, having this much test coverage makes it easy for XFS developers to
check the responses of XFS' fsck tools.
Since the introduction of the fuzz testing framework, these tests have been
used to discover incorrect repair code and missing functionality for entire
classes of metadata objects in ``xfs_repair``.
The enhanced testing was used to finalize the deprecation of ``xfs_check`` by
confirming that ``xfs_repair`` could detect at least as many corruptions as
the older tool.”h]”(hX1  Fortunately, having this much test coverage makes it easy for XFS developers to
check the responses of XFSâ€™ fsck tools.
Since the introduction of the fuzz testing framework, these tests have been
used to discover incorrect repair code and missing functionality for entire
classes of metadata objects in ”…””}”(hjÍ#  hžhhŸNh Nubj÷  )”}”(hŒ``xfs_repair``”h]”hŒ
xfs_repair”…””}”(hjÕ#  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1jö  hjÍ#  ubhŒ?.
The enhanced testing was used to finalize the deprecation of ”…””}”(hjÍ#  hžhhŸNh Nubj÷  )”}”(hŒ``xfs_check``”h]”hŒ	xfs_check”…””}”(hjç#  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1jö  hjÍ#  ubhŒ by
confirming that ”…””}”(hjÍ#  hžhhŸNh Nubj÷  )”}”(hŒ``xfs_repair``”h]”hŒ
xfs_repair”…””}”(hjù#  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1jö  hjÍ#  ubhŒ= could detect at least as many corruptions as
the older tool.”…””}”(hjÍ#  hžhhŸNh Nubeh}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h Mñhj!  hžhubhæ)”}”(hŒâThese tests have been very valuable for ``xfs_scrub`` in the same ways -- they
allow the online fsck developers to compare online fsck against offline fsck,
and they enable XFS developers to find deficiencies in the code base.”h]”(hŒ(These tests have been very valuable for ”…””}”(hj$  hžhhŸNh Nubj÷  )”}”(hŒ``xfs_scrub``”h]”hŒ	xfs_scrub”…””}”(hj$  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1jö  hj$  ubhŒ­ in the same ways -- they
allow the online fsck developers to compare online fsck against offline fsck,
and they enable XFS developers to find deficiencies in the code base.”…””}”(hj$  hžhhŸNh Nubeh}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h Múhj!  hžhubhæ)”}”(hX®  Proposed patchsets include
`general fuzzer improvements
<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfstests-dev.git/log/?h=fuzzer-improvements>`_,
`fuzzing baselines
<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfstests-dev.git/log/?h=fuzz-baseline>`_,
and `improvements in fuzz testing comprehensiveness
<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfstests-dev.git/log/?h=more-fuzz-testing>`_.”h]”(hŒProposed patchsets include
”…””}”(hj1$  hžhhŸNh Nubj”  )”}”(hŒƒ`general fuzzer improvements
<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfstests-dev.git/log/?h=fuzzer-improvements>`_”h]”hŒgeneral fuzzer improvements”…””}”(hj9$  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”Œname”Œgeneral fuzzer improvements”jj  Œbhttps://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfstests-dev.git/log/?h=fuzzer-improvements”uh1j“  hj1$  ubhµ)”}”(hŒe
<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfstests-dev.git/log/?h=fuzzer-improvements>”h]”h}”(h]”Œgeneral-fuzzer-improvements”ah ]”h"]”Œgeneral fuzzer improvements”ah$]”h&]”Œrefuri”jI$  uh1h´jy  Khj1$  ubhŒ,
”…””}”(hj1$  hžhhŸNh Nubj”  )”}”(hŒs`fuzzing baselines
<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfstests-dev.git/log/?h=fuzz-baseline>`_”h]”hŒfuzzing baselines”…””}”(hj[$  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”Œname”Œfuzzing baselines”jj  Œ\https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfstests-dev.git/log/?h=fuzz-baseline”uh1j“  hj1$  ubhµ)”}”(hŒ_
<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfstests-dev.git/log/?h=fuzz-baseline>”h]”h}”(h]”Œfuzzing-baselines”ah ]”h"]”Œfuzzing baselines”ah$]”h&]”Œrefuri”jk$  uh1h´jy  Khj1$  ubhŒ,
and ”…””}”(hj1$  hžhhŸNh Nubj”  )”}”(hŒ”`improvements in fuzz testing comprehensiveness
<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfstests-dev.git/log/?h=more-fuzz-testing>`_”h]”hŒ.improvements in fuzz testing comprehensiveness”…””}”(hj}$  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”Œname”Œ.improvements in fuzz testing comprehensiveness”jj  Œ`https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfstests-dev.git/log/?h=more-fuzz-testing”uh1j“  hj1$  ubhµ)”}”(hŒc
<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfstests-dev.git/log/?h=more-fuzz-testing>”h]”h}”(h]”Œ.improvements-in-fuzz-testing-comprehensiveness”ah ]”h"]”Œ.improvements in fuzz testing comprehensiveness”ah$]”h&]”Œrefuri”j$  uh1h´jy  Khj1$  ubhŒ.”…””}”(hj1$  hžhhŸNh Nubeh}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h Mþhj!  hžhubeh}”(h]”jÜ  ah ]”h"]”Œ)targeted fuzz testing of metadata records”ah$]”h&]”uh1hÐhjw  hžhhŸh³h MÇubhÑ)”}”(hhh]”(hÖ)”}”(hŒStress Testing”h]”hŒStress Testing”…””}”(hj¯$  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”j  jø  uh1hÕhj¬$  hžhhŸh³h Mubhæ)”}”(hXÛ  A unique requirement to online fsck is the ability to operate on a filesystem
concurrently with regular workloads.
Although it is of course impossible to run ``xfs_scrub`` with *zero* observable
impact on the running system, the online repair code should never introduce
inconsistencies into the filesystem metadata, and regular workloads should
never notice resource starvation.
To verify that these conditions are being met, fstests has been enhanced in
the following ways:”h]”(hŒžA unique requirement to online fsck is the ability to operate on a filesystem
concurrently with regular workloads.
Although it is of course impossible to run ”…””}”(hj½$  hžhhŸNh Nubj÷  )”}”(hŒ``xfs_scrub``”h]”hŒ	xfs_scrub”…””}”(hjÅ$  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1jö  hj½$  ubhŒ with ”…””}”(hj½$  hžhhŸNh Nubj7  )”}”(hŒ*zero*”h]”hŒzero”…””}”(hj×$  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1j6  hj½$  ubhX$   observable
impact on the running system, the online repair code should never introduce
inconsistencies into the filesystem metadata, and regular workloads should
never notice resource starvation.
To verify that these conditions are being met, fstests has been enhanced in
the following ways:”…””}”(hj½$  hžhhŸNh Nubeh}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h M	hj¬$  hžhubhö)”}”(hhh]”(hû)”}”(hŒgFor each scrub item type, create a test to exercise checking that item type
while running ``fsstress``.”h]”hæ)”}”(hŒgFor each scrub item type, create a test to exercise checking that item type
while running ``fsstress``.”h]”(hŒZFor each scrub item type, create a test to exercise checking that item type
while running ”…””}”(hjö$  hžhhŸNh Nubj÷  )”}”(hŒ``fsstress``”h]”hŒfsstress”…””}”(hjþ$  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1jö  hjö$  ubhŒ.”…””}”(hjö$  hžhhŸNh Nubeh}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h Mhjò$  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhjï$  hžhhŸh³h Nubhû)”}”(hŒhFor each scrub item type, create a test to exercise repairing that item type
while running ``fsstress``.”h]”hæ)”}”(hŒhFor each scrub item type, create a test to exercise repairing that item type
while running ``fsstress``.”h]”(hŒ[For each scrub item type, create a test to exercise repairing that item type
while running ”…””}”(hj %  hžhhŸNh Nubj÷  )”}”(hŒ``fsstress``”h]”hŒfsstress”…””}”(hj(%  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1jö  hj %  ubhŒ.”…””}”(hj %  hžhhŸNh Nubeh}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h Mhj%  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhjï$  hžhhŸh³h Nubhû)”}”(hŒkRace ``fsstress`` and ``xfs_scrub -n`` to ensure that checking the whole
filesystem doesn't cause problems.”h]”hæ)”}”(hŒkRace ``fsstress`` and ``xfs_scrub -n`` to ensure that checking the whole
filesystem doesn't cause problems.”h]”(hŒRace ”…””}”(hjJ%  hžhhŸNh Nubj÷  )”}”(hŒ``fsstress``”h]”hŒfsstress”…””}”(hjR%  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1jö  hjJ%  ubhŒ and ”…””}”(hjJ%  hžhhŸNh Nubj÷  )”}”(hŒ``xfs_scrub -n``”h]”hŒxfs_scrub -n”…””}”(hjd%  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1jö  hjJ%  ubhŒG to ensure that checking the whole
filesystem doesnâ€™t cause problems.”…””}”(hjJ%  hžhhŸNh Nubeh}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h MhjF%  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhjï$  hžhhŸh³h Nubhû)”}”(hŒ…Race ``fsstress`` and ``xfs_scrub`` in force-rebuild mode to ensure that
force-repairing the whole filesystem doesn't cause problems.”h]”hæ)”}”(hŒ…Race ``fsstress`` and ``xfs_scrub`` in force-rebuild mode to ensure that
force-repairing the whole filesystem doesn't cause problems.”h]”(hŒRace ”…””}”(hj†%  hžhhŸNh Nubj÷  )”}”(hŒ``fsstress``”h]”hŒfsstress”…””}”(hjŽ%  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1jö  hj†%  ubhŒ and ”…””}”(hj†%  hžhhŸNh Nubj÷  )”}”(hŒ``xfs_scrub``”h]”hŒ	xfs_scrub”…””}”(hj %  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1jö  hj†%  ubhŒd in force-rebuild mode to ensure that
force-repairing the whole filesystem doesnâ€™t cause problems.”…””}”(hj†%  hžhhŸNh Nubeh}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h Mhj‚%  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhjï$  hžhhŸh³h Nubhû)”}”(hŒqRace ``xfs_scrub`` in check and force-repair mode against ``fsstress`` while
freezing and thawing the filesystem.”h]”hæ)”}”(hŒqRace ``xfs_scrub`` in check and force-repair mode against ``fsstress`` while
freezing and thawing the filesystem.”h]”(hŒRace ”…””}”(hjÂ%  hžhhŸNh Nubj÷  )”}”(hŒ``xfs_scrub``”h]”hŒ	xfs_scrub”…””}”(hjÊ%  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1jö  hjÂ%  ubhŒ( in check and force-repair mode against ”…””}”(hjÂ%  hžhhŸNh Nubj÷  )”}”(hŒ``fsstress``”h]”hŒfsstress”…””}”(hjÜ%  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1jö  hjÂ%  ubhŒ+ while
freezing and thawing the filesystem.”…””}”(hjÂ%  hžhhŸNh Nubeh}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h Mhj¾%  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhjï$  hžhhŸh³h Nubhû)”}”(hŒ€Race ``xfs_scrub`` in check and force-repair mode against ``fsstress`` while
remounting the filesystem read-only and read-write.”h]”hæ)”}”(hŒ€Race ``xfs_scrub`` in check and force-repair mode against ``fsstress`` while
remounting the filesystem read-only and read-write.”h]”(hŒRace ”…””}”(hjþ%  hžhhŸNh Nubj÷  )”}”(hŒ``xfs_scrub``”h]”hŒ	xfs_scrub”…””}”(hj&  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1jö  hjþ%  ubhŒ( in check and force-repair mode against ”…””}”(hjþ%  hžhhŸNh Nubj÷  )”}”(hŒ``fsstress``”h]”hŒfsstress”…””}”(hj&  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1jö  hjþ%  ubhŒ: while
remounting the filesystem read-only and read-write.”…””}”(hjþ%  hžhhŸNh Nubeh}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h Mhjú%  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhjï$  hžhhŸh³h Nubhû)”}”(hŒHThe same, but running ``fsx`` instead of ``fsstress``.  (Not done yet?)
”h]”hæ)”}”(hŒGThe same, but running ``fsx`` instead of ``fsstress``.  (Not done yet?)”h]”(hŒThe same, but running ”…””}”(hj:&  hžhhŸNh Nubj÷  )”}”(hŒ``fsx``”h]”hŒfsx”…””}”(hjB&  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1jö  hj:&  ubhŒ instead of ”…””}”(hj:&  hžhhŸNh Nubj÷  )”}”(hŒ``fsstress``”h]”hŒfsstress”…””}”(hjT&  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1jö  hj:&  ubhŒ.  (Not done yet?)”…””}”(hj:&  hžhhŸNh Nubeh}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h Mhj6&  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhjï$  hžhhŸh³h Nubeh}”(h]”h ]”h"]”h$]”h&]”jJ  jó   uh1hõhŸh³h Mhj¬$  hžhubhæ)”}”(hŒÇSuccess is defined by the ability to run all of these tests without observing
any unexpected filesystem shutdowns due to corrupted metadata, kernel hang
check warnings, or any other sort of mischief.”h]”hŒÇSuccess is defined by the ability to run all of these tests without observing
any unexpected filesystem shutdowns due to corrupted metadata, kernel hang
check warnings, or any other sort of mischief.”…””}”(hjx&  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h M hj¬$  hžhubhæ)”}”(hXM  Proposed patchsets include `general stress testing
<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfstests-dev.git/log/?h=race-scrub-and-mount-state-changes>`_
and the `evolution of existing per-function stress testing
<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfstests-dev.git/log/?h=refactor-scrub-stress>`_.”h]”(hŒProposed patchsets include ”…””}”(hj†&  hžhhŸNh Nubj”  )”}”(hŒ`general stress testing
<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfstests-dev.git/log/?h=race-scrub-and-mount-state-changes>`_”h]”hŒgeneral stress testing”…””}”(hjŽ&  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”Œname”Œgeneral stress testing”jj  Œqhttps://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfstests-dev.git/log/?h=race-scrub-and-mount-state-changes”uh1j“  hj†&  ubhµ)”}”(hŒt
<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfstests-dev.git/log/?h=race-scrub-and-mount-state-changes>”h]”h}”(h]”Œgeneral-stress-testing”ah ]”h"]”Œgeneral stress testing”ah$]”h&]”Œrefuri”jž&  uh1h´jy  Khj†&  ubhŒ	
and the ”…””}”(hj†&  hžhhŸNh Nubj”  )”}”(hŒ›`evolution of existing per-function stress testing
<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfstests-dev.git/log/?h=refactor-scrub-stress>`_”h]”hŒ1evolution of existing per-function stress testing”…””}”(hj°&  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”Œname”Œ1evolution of existing per-function stress testing”jj  Œdhttps://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfstests-dev.git/log/?h=refactor-scrub-stress”uh1j“  hj†&  ubhµ)”}”(hŒg
<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfstests-dev.git/log/?h=refactor-scrub-stress>”h]”h}”(h]”Œ1evolution-of-existing-per-function-stress-testing”ah ]”h"]”Œ1evolution of existing per-function stress testing”ah$]”h&]”Œrefuri”jÀ&  uh1h´jy  Khj†&  ubhŒ.”…””}”(hj†&  hžhhŸNh Nubeh}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h M$hj¬$  hžhubeh}”(h]”jþ  ah ]”h"]”Œstress testing”ah$]”h&]”uh1hÐhjw  hžhhŸh³h Mubeh}”(h]”jy  ah ]”h"]”Œ3. testing plan”ah$]”h&]”uh1hÐhhÒhžhhŸh³h MpubhÑ)”}”(hhh]”(hÖ)”}”(hŒ4. User Interface”h]”hŒ4. User Interface”…””}”(hjé&  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”j  j&  uh1hÕhjæ&  hžhhŸh³h M*ubhæ)”}”(hX  The primary user of online fsck is the system administrator, just like offline
repair.
Online fsck presents two modes of operation to administrators:
A foreground CLI process for online fsck on demand, and a background service
that performs autonomous checking and repair.”h]”hX  The primary user of online fsck is the system administrator, just like offline
repair.
Online fsck presents two modes of operation to administrators:
A foreground CLI process for online fsck on demand, and a background service
that performs autonomous checking and repair.”…””}”(hj÷&  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h M,hjæ&  hžhubhÑ)”}”(hhh]”(hÖ)”}”(hŒChecking on Demand”h]”hŒChecking on Demand”…””}”(hj'  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”j  jE  uh1hÕhj'  hžhhŸh³h M3ubhæ)”}”(hXÕ  For administrators who want the absolute freshest information about the
metadata in a filesystem, ``xfs_scrub`` can be run as a foreground process on
a command line.
The program checks every piece of metadata in the filesystem while the
administrator waits for the results to be reported, just like the existing
``xfs_repair`` tool.
Both tools share a ``-n`` option to perform a read-only scan, and a ``-v``
option to increase the verbosity of the information reported.”h]”(hŒbFor administrators who want the absolute freshest information about the
metadata in a filesystem, ”…””}”(hj'  hžhhŸNh Nubj÷  )”}”(hŒ``xfs_scrub``”h]”hŒ	xfs_scrub”…””}”(hj'  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1jö  hj'  ubhŒÉ can be run as a foreground process on
a command line.
The program checks every piece of metadata in the filesystem while the
administrator waits for the results to be reported, just like the existing
”…””}”(hj'  hžhhŸNh Nubj÷  )”}”(hŒ``xfs_repair``”h]”hŒ
xfs_repair”…””}”(hj0'  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1jö  hj'  ubhŒ tool.
Both tools share a ”…””}”(hj'  hžhhŸNh Nubj÷  )”}”(hŒ``-n``”h]”hŒ-n”…””}”(hjB'  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1jö  hj'  ubhŒ+ option to perform a read-only scan, and a ”…””}”(hj'  hžhhŸNh Nubj÷  )”}”(hŒ``-v``”h]”hŒ-v”…””}”(hjT'  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1jö  hj'  ubhŒ>
option to increase the verbosity of the information reported.”…””}”(hj'  hžhhŸNh Nubeh}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h M5hj'  hžhubhæ)”}”(hX)  A new feature of ``xfs_scrub`` is the ``-x`` option, which employs the error
correction capabilities of the hardware to check data file contents.
The media scan is not enabled by default because it may dramatically increase
program runtime and consume a lot of bandwidth on older storage hardware.”h]”(hŒA new feature of ”…””}”(hjl'  hžhhŸNh Nubj÷  )”}”(hŒ``xfs_scrub``”h]”hŒ	xfs_scrub”…””}”(hjt'  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1jö  hjl'  ubhŒ is the ”…””}”(hjl'  hžhhŸNh Nubj÷  )”}”(hŒ``-x``”h]”hŒ-x”…””}”(hj†'  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1jö  hjl'  ubhŒý option, which employs the error
correction capabilities of the hardware to check data file contents.
The media scan is not enabled by default because it may dramatically increase
program runtime and consume a lot of bandwidth on older storage hardware.”…””}”(hjl'  hžhhŸNh Nubeh}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h M>hj'  hžhubhæ)”}”(hŒDThe output of a foreground invocation is captured in the system log.”h]”hŒDThe output of a foreground invocation is captured in the system log.”…””}”(hjž'  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h MChj'  hžhubhæ)”}”(hX  The ``xfs_scrub_all`` program walks the list of mounted filesystems and
initiates ``xfs_scrub`` for each of them in parallel.
It serializes scans for any filesystems that resolve to the same top level
kernel block device to prevent resource overconsumption.”h]”(hŒThe ”…””}”(hj¬'  hžhhŸNh Nubj÷  )”}”(hŒ``xfs_scrub_all``”h]”hŒxfs_scrub_all”…””}”(hj´'  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1jö  hj¬'  ubhŒ= program walks the list of mounted filesystems and
initiates ”…””}”(hj¬'  hžhhŸNh Nubj÷  )”}”(hŒ``xfs_scrub``”h]”hŒ	xfs_scrub”…””}”(hjÆ'  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1jö  hj¬'  ubhŒ¢ for each of them in parallel.
It serializes scans for any filesystems that resolve to the same top level
kernel block device to prevent resource overconsumption.”…””}”(hj¬'  hžhhŸNh Nubeh}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h MEhj'  hžhubeh}”(h]”jK  ah ]”h"]”Œchecking on demand”ah$]”h&]”uh1hÐhjæ&  hžhhŸh³h M3ubhÑ)”}”(hhh]”(hÖ)”}”(hŒBackground Service”h]”hŒBackground Service”…””}”(hjè'  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”j  jg  uh1hÕhjå'  hžhhŸh³h MKubhæ)”}”(hXô  To reduce the workload of system administrators, the ``xfs_scrub`` package
provides a suite of `systemd <https://systemd.io/>`_ timers and services that
run online fsck automatically on weekends by default.
The background service configures scrub to run with as little privilege as
possible, the lowest CPU and IO priority, and in a CPU-constrained single
threaded mode.
This can be tuned by the systemd administrator at any time to suit the latency
and throughput requirements of customer workloads.”h]”(hŒ5To reduce the workload of system administrators, the ”…””}”(hjö'  hžhhŸNh Nubj÷  )”}”(hŒ``xfs_scrub``”h]”hŒ	xfs_scrub”…””}”(hjþ'  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1jö  hjö'  ubhŒ package
provides a suite of ”…””}”(hjö'  hžhhŸNh Nubj”  )”}”(hŒ `systemd <https://systemd.io/>`_”h]”hŒsystemd”…””}”(hj(  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”Œname”Œsystemd”jj  Œhttps://systemd.io/”uh1j“  hjö'  ubhµ)”}”(hŒ <https://systemd.io/>”h]”h}”(h]”Œsystemd”ah ]”h"]”Œsystemd”ah$]”h&]”Œrefuri”j (  uh1h´jy  Khjö'  ubhXu   timers and services that
run online fsck automatically on weekends by default.
The background service configures scrub to run with as little privilege as
possible, the lowest CPU and IO priority, and in a CPU-constrained single
threaded mode.
This can be tuned by the systemd administrator at any time to suit the latency
and throughput requirements of customer workloads.”…””}”(hjö'  hžhhŸNh Nubeh}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h MMhjå'  hžhubhæ)”}”(hX  The output of the background service is also captured in the system log.
If desired, reports of failures (either due to inconsistencies or mere runtime
errors) can be emailed automatically by setting the ``EMAIL_ADDR`` environment
variable in the following service files:”h]”(hŒÌThe output of the background service is also captured in the system log.
If desired, reports of failures (either due to inconsistencies or mere runtime
errors) can be emailed automatically by setting the ”…””}”(hj8(  hžhhŸNh Nubj÷  )”}”(hŒ``EMAIL_ADDR``”h]”hŒ
EMAIL_ADDR”…””}”(hj@(  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1jö  hj8(  ubhŒ5 environment
variable in the following service files:”…””}”(hj8(  hžhhŸNh Nubeh}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h MVhjå'  hžhubhö)”}”(hhh]”(hû)”}”(hŒ``xfs_scrub_fail@.service``”h]”hæ)”}”(hj](  h]”j÷  )”}”(hj](  h]”hŒxfs_scrub_fail@.service”…””}”(hjb(  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1jö  hj_(  ubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h M[hj[(  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhjX(  hžhhŸh³h Nubhû)”}”(hŒ!``xfs_scrub_media_fail@.service``”h]”hæ)”}”(hj}(  h]”j÷  )”}”(hj}(  h]”hŒxfs_scrub_media_fail@.service”…””}”(hj‚(  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1jö  hj(  ubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h M\hj{(  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhjX(  hžhhŸh³h Nubhû)”}”(hŒ``xfs_scrub_all_fail.service``
”h]”hæ)”}”(hŒ``xfs_scrub_all_fail.service``”h]”j÷  )”}”(hj¡(  h]”hŒxfs_scrub_all_fail.service”…””}”(hj£(  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1jö  hjŸ(  ubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h M]hj›(  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhjX(  hžhhŸh³h Nubeh}”(h]”h ]”h"]”h$]”h&]”jJ  jó   uh1hõhŸh³h M[hjå'  hžhubhæ)”}”(hŒŽThe decision to enable the background scan is left to the system administrator.
This can be done by enabling either of the following services:”h]”hŒŽThe decision to enable the background scan is left to the system administrator.
This can be done by enabling either of the following services:”…””}”(hjÂ(  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h M_hjå'  hžhubhö)”}”(hhh]”(hû)”}”(hŒ*``xfs_scrub_all.timer`` on systemd systems”h]”hæ)”}”(hjÕ(  h]”(j÷  )”}”(hŒ``xfs_scrub_all.timer``”h]”hŒxfs_scrub_all.timer”…””}”(hjÚ(  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1jö  hj×(  ubhŒ on systemd systems”…””}”(hj×(  hžhhŸNh Nubeh}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h MbhjÓ(  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhjÐ(  hžhhŸh³h Nubhû)”}”(hŒ.``xfs_scrub_all.cron`` on non-systemd systems
”h]”hæ)”}”(hŒ-``xfs_scrub_all.cron`` on non-systemd systems”h]”(j÷  )”}”(hŒ``xfs_scrub_all.cron``”h]”hŒxfs_scrub_all.cron”…””}”(hj )  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1jö  hjü(  ubhŒ on non-systemd systems”…””}”(hjü(  hžhhŸNh Nubeh}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h Mchjø(  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhjÐ(  hžhhŸh³h Nubeh}”(h]”h ]”h"]”h$]”h&]”jJ  jó   uh1hõhŸh³h Mbhjå'  hžhubhæ)”}”(hX–  This automatic weekly scan is configured out of the box to perform an
additional media scan of all file data once per month.
This is less foolproof than, say, storing file data block checksums, but much
more performant if application software provides its own integrity checking,
redundancy can be provided elsewhere above the filesystem, or the storage
device's integrity guarantees are deemed sufficient.”h]”hX˜  This automatic weekly scan is configured out of the box to perform an
additional media scan of all file data once per month.
This is less foolproof than, say, storing file data block checksums, but much
more performant if application software provides its own integrity checking,
redundancy can be provided elsewhere above the filesystem, or the storage
deviceâ€™s integrity guarantees are deemed sufficient.”…””}”(hj$)  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h Mehjå'  hžhubhæ)”}”(hX)  The systemd unit file definitions have been subjected to a security audit
(as of systemd 249) to ensure that the xfs_scrub processes have as little
access to the rest of the system as possible.
This was performed via ``systemd-analyze security``, after which privileges
were restricted to the minimum required, sandboxing was set up to the maximal
extent possible with sandboxing and system call filtering; and access to the
filesystem tree was restricted to the minimum needed to start the program and
access the filesystem being scanned.
The service definition files restrict CPU usage to 80% of one CPU core, and
apply as nice of a priority to IO and CPU scheduling as possible.
This measure was taken to minimize delays in the rest of the filesystem.
No such hardening has been performed for the cron job.”h]”(hŒÙThe systemd unit file definitions have been subjected to a security audit
(as of systemd 249) to ensure that the xfs_scrub processes have as little
access to the rest of the system as possible.
This was performed via ”…””}”(hj2)  hžhhŸNh Nubj÷  )”}”(hŒ``systemd-analyze security``”h]”hŒsystemd-analyze security”…””}”(hj:)  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1jö  hj2)  ubhX4  , after which privileges
were restricted to the minimum required, sandboxing was set up to the maximal
extent possible with sandboxing and system call filtering; and access to the
filesystem tree was restricted to the minimum needed to start the program and
access the filesystem being scanned.
The service definition files restrict CPU usage to 80% of one CPU core, and
apply as nice of a priority to IO and CPU scheduling as possible.
This measure was taken to minimize delays in the rest of the filesystem.
No such hardening has been performed for the cron job.”…””}”(hj2)  hžhhŸNh Nubeh}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h Mlhjå'  hžhubhæ)”}”(hŒªProposed patchset:
`Enabling the xfs_scrub background service
<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=scrub-media-scan-service>`_.”h]”(hŒProposed patchset:
”…””}”(hjR)  hžhhŸNh Nubj”  )”}”(hŒ–`Enabling the xfs_scrub background service
<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=scrub-media-scan-service>`_”h]”hŒ)Enabling the xfs_scrub background service”…””}”(hjZ)  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”Œname”Œ)Enabling the xfs_scrub background service”jj  Œghttps://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=scrub-media-scan-service”uh1j“  hjR)  ubhµ)”}”(hŒj
<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=scrub-media-scan-service>”h]”h}”(h]”Œ)enabling-the-xfs-scrub-background-service”ah ]”h"]”Œ)enabling the xfs_scrub background service”ah$]”h&]”Œrefuri”jj)  uh1h´jy  KhjR)  ubhŒ.”…””}”(hjR)  hžhhŸNh Nubeh}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h Myhjå'  hžhubeh}”(h]”jm  ah ]”h"]”Œbackground service”ah$]”h&]”uh1hÐhjæ&  hžhhŸh³h MKubhÑ)”}”(hhh]”(hÖ)”}”(hŒHealth Reporting”h]”hŒHealth Reporting”…””}”(hjŒ)  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”j  j‰  uh1hÕhj‰)  hžhhŸh³h M~ubhæ)”}”(hX  XFS caches a summary of each filesystem's health status in memory.
The information is updated whenever ``xfs_scrub`` is run, or whenever
inconsistencies are detected in the filesystem metadata during regular
operations.
System administrators should use the ``health`` command of ``xfs_spaceman`` to
download this information into a human-readable format.
If problems have been observed, the administrator can schedule a reduced
service window to run the online repair tool to correct the problem.
Failing that, the administrator can decide to schedule a maintenance window to
run the traditional offline repair tool to correct the problem.”h]”(hŒiXFS caches a summary of each filesystemâ€™s health status in memory.
The information is updated whenever ”…””}”(hjš)  hžhhŸNh Nubj÷  )”}”(hŒ``xfs_scrub``”h]”hŒ	xfs_scrub”…””}”(hj¢)  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1jö  hjš)  ubhŒ is run, or whenever
inconsistencies are detected in the filesystem metadata during regular
operations.
System administrators should use the ”…””}”(hjš)  hžhhŸNh Nubj÷  )”}”(hŒ
``health``”h]”hŒhealth”…””}”(hj´)  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1jö  hjš)  ubhŒ command of ”…””}”(hjš)  hžhhŸNh Nubj÷  )”}”(hŒ``xfs_spaceman``”h]”hŒxfs_spaceman”…””}”(hjÆ)  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1jö  hjš)  ubhXX   to
download this information into a human-readable format.
If problems have been observed, the administrator can schedule a reduced
service window to run the online repair tool to correct the problem.
Failing that, the administrator can decide to schedule a maintenance window to
run the traditional offline repair tool to correct the problem.”…””}”(hjš)  hžhhŸNh Nubeh}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h M€hj‰)  hžhubhæ)”}”(hŒã**Future Work Question**: Should the health reporting integrate with the new
inotify fs error notification system?
Would it be helpful for sysadmins to have a daemon to listen for corruption
notifications and initiate a repair?”h]”(jé  )”}”(hŒ**Future Work Question**”h]”hŒFuture Work Question”…””}”(hjâ)  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1jè  hjÞ)  ubhŒË: Should the health reporting integrate with the new
inotify fs error notification system?
Would it be helpful for sysadmins to have a daemon to listen for corruption
notifications and initiate a repair?”…””}”(hjÞ)  hžhhŸNh Nubeh}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h M‹hj‰)  hžhubhæ)”}”(hŒ*Answer*: These questions remain unanswered, but should be a part of the
conversation with early adopters and potential downstream users of XFS.”h]”(j7  )”}”(hŒ*Answer*”h]”hŒAnswer”…””}”(hjþ)  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1j6  hjú)  ubhŒˆ: These questions remain unanswered, but should be a part of the
conversation with early adopters and potential downstream users of XFS.”…””}”(hjú)  hžhhŸNh Nubeh}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h Mhj‰)  hžhubhæ)”}”(hXX  Proposed patchsets include
`wiring up health reports to correction returns
<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=corruption-health-reports>`_
and
`preservation of sickness info during memory reclaim
<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=indirect-health-reporting>`_.”h]”(hŒProposed patchsets include
”…””}”(hj*  hžhhŸNh Nubj”  )”}”(hŒ™`wiring up health reports to correction returns
<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=corruption-health-reports>`_”h]”hŒ.wiring up health reports to correction returns”…””}”(hj*  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”Œname”Œ.wiring up health reports to correction returns”jj  Œehttps://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=corruption-health-reports”uh1j“  hj*  ubhµ)”}”(hŒh
<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=corruption-health-reports>”h]”h}”(h]”Œ.wiring-up-health-reports-to-correction-returns”ah ]”h"]”Œ.wiring up health reports to correction returns”ah$]”h&]”Œrefuri”j.*  uh1h´jy  Khj*  ubhŒ
and
”…””}”(hj*  hžhhŸNh Nubj”  )”}”(hŒž`preservation of sickness info during memory reclaim
<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=indirect-health-reporting>`_”h]”hŒ3preservation of sickness info during memory reclaim”…””}”(hj@*  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”Œname”Œ3preservation of sickness info during memory reclaim”jj  Œehttps://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=indirect-health-reporting”uh1j“  hj*  ubhµ)”}”(hŒh
<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=indirect-health-reporting>”h]”h}”(h]”Œ3preservation-of-sickness-info-during-memory-reclaim”ah ]”h"]”Œ3preservation of sickness info during memory reclaim”ah$]”h&]”Œrefuri”jP*  uh1h´jy  Khj*  ubhŒ.”…””}”(hj*  hžhhŸNh Nubeh}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h M“hj‰)  hžhubeh}”(h]”j  ah ]”h"]”Œhealth reporting”ah$]”h&]”uh1hÐhjæ&  hžhhŸh³h M~ubeh}”(h]”j,  ah ]”h"]”Œ4. user interface”ah$]”h&]”uh1hÐhhÒhžhhŸh³h M*ubhÑ)”}”(hhh]”(hÖ)”}”(hŒ(5. Kernel Algorithms and Data Structures”h]”hŒ(5. Kernel Algorithms and Data Structures”…””}”(hjy*  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”j  j·  uh1hÕhjv*  hžhhŸh³h M›ubhæ)”}”(hXg  This section discusses the key algorithms and data structures of the kernel
code that provide the ability to check and repair metadata while the system
is running.
The first chapters in this section reveal the pieces that provide the
foundation for checking metadata.
The remainder of this section presents the mechanisms through which XFS
regenerates itself.”h]”hXg  This section discusses the key algorithms and data structures of the kernel
code that provide the ability to check and repair metadata while the system
is running.
The first chapters in this section reveal the pieces that provide the
foundation for checking metadata.
The remainder of this section presents the mechanisms through which XFS
regenerates itself.”…””}”(hj‡*  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h Mhjv*  hžhubhÑ)”}”(hhh]”(hÖ)”}”(hŒSelf Describing Metadata”h]”hŒSelf Describing Metadata”…””}”(hj˜*  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”j  jÖ  uh1hÕhj•*  hžhhŸh³h M¦ubhæ)”}”(hXÆ  Starting with XFS version 5 in 2012, XFS updated the format of nearly every
ondisk block header to record a magic number, a checksum, a universally
"unique" identifier (UUID), an owner code, the ondisk address of the block,
and a log sequence number.
When loading a block buffer from disk, the magic number, UUID, owner, and
ondisk address confirm that the retrieved block matches the specific owner of
the current filesystem, and that the information contained in the block is
supposed to be found at the ondisk address.
The first three components enable checking tools to disregard alleged metadata
that doesn't belong to the filesystem, and the fourth component enables the
filesystem to detect lost writes.”h]”hXÌ  Starting with XFS version 5 in 2012, XFS updated the format of nearly every
ondisk block header to record a magic number, a checksum, a universally
â€œuniqueâ€ identifier (UUID), an owner code, the ondisk address of the block,
and a log sequence number.
When loading a block buffer from disk, the magic number, UUID, owner, and
ondisk address confirm that the retrieved block matches the specific owner of
the current filesystem, and that the information contained in the block is
supposed to be found at the ondisk address.
The first three components enable checking tools to disregard alleged metadata
that doesnâ€™t belong to the filesystem, and the fourth component enables the
filesystem to detect lost writes.”…””}”(hj¦*  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h M¨hj•*  hžhubhæ)”}”(hX3  Whenever a file system operation modifies a block, the change is submitted
to the log as part of a transaction.
The log then processes these transactions marking them done once they are
safely persisted to storage.
The logging code maintains the checksum and the log sequence number of the last
transactional update.
Checksums are useful for detecting torn writes and other discrepancies that can
be introduced between the computer and its storage devices.
Sequence number tracking enables log recovery to avoid applying out of date
log updates to the filesystem.”h]”hX3  Whenever a file system operation modifies a block, the change is submitted
to the log as part of a transaction.
The log then processes these transactions marking them done once they are
safely persisted to storage.
The logging code maintains the checksum and the log sequence number of the last
transactional update.
Checksums are useful for detecting torn writes and other discrepancies that can
be introduced between the computer and its storage devices.
Sequence number tracking enables log recovery to avoid applying out of date
log updates to the filesystem.”…””}”(hj´*  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h M´hj•*  hžhubhæ)”}”(hX  These two features improve overall runtime resiliency by providing a means for
the filesystem to detect obvious corruption when reading metadata blocks from
disk, but these buffer verifiers cannot provide any consistency checking
between metadata structures.”h]”hX  These two features improve overall runtime resiliency by providing a means for
the filesystem to detect obvious corruption when reading metadata blocks from
disk, but these buffer verifiers cannot provide any consistency checking
between metadata structures.”…””}”(hjÂ*  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h M¿hj•*  hžhubhæ)”}”(hŒuFor more information, please see the documentation for
Documentation/filesystems/xfs/xfs-self-describing-metadata.rst”h]”hŒuFor more information, please see the documentation for
Documentation/filesystems/xfs/xfs-self-describing-metadata.rst”…””}”(hjÐ*  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h MÄhj•*  hžhubeh}”(h]”jÜ  ah ]”h"]”Œself describing metadata”ah$]”h&]”uh1hÐhjv*  hžhhŸh³h M¦ubhÑ)”}”(hhh]”(hÖ)”}”(hŒReverse Mapping”h]”hŒReverse Mapping”…””}”(hjè*  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”j  jø  uh1hÕhjå*  hžhhŸh³h MÈubhæ)”}”(hX#  The original design of XFS (circa 1993) is an improvement upon 1980s Unix
filesystem design.
In those days, storage density was expensive, CPU time was scarce, and
excessive seek time could kill performance.
For performance reasons, filesystem authors were reluctant to add redundancy to
the filesystem, even at the cost of data integrity.
Filesystems designers in the early 21st century choose different strategies to
increase internal redundancy -- either storing nearly identical copies of
metadata, or more space-efficient encoding techniques.”h]”hX#  The original design of XFS (circa 1993) is an improvement upon 1980s Unix
filesystem design.
In those days, storage density was expensive, CPU time was scarce, and
excessive seek time could kill performance.
For performance reasons, filesystem authors were reluctant to add redundancy to
the filesystem, even at the cost of data integrity.
Filesystems designers in the early 21st century choose different strategies to
increase internal redundancy -- either storing nearly identical copies of
metadata, or more space-efficient encoding techniques.”…””}”(hjö*  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h MÊhjå*  hžhubhæ)”}”(hXš  For XFS, a different redundancy strategy was chosen to modernize the design:
a secondary space usage index that maps allocated disk extents back to their
owners.
By adding a new index, the filesystem retains most of its ability to scale
well to heavily threaded workloads involving large datasets, since the primary
file metadata (the directory tree, the file block map, and the allocation
groups) remain unchanged.
Like any system that improves redundancy, the reverse-mapping feature increases
overhead costs for space mapping activities.
However, it has two critical advantages: first, the reverse index is key to
enabling online fsck and other requested functionality such as free space
defragmentation, better media failure reporting, and filesystem shrinking.
Second, the different ondisk storage format of the reverse mapping btree
defeats device-level deduplication because the filesystem requires real
redundancy.”h]”hXš  For XFS, a different redundancy strategy was chosen to modernize the design:
a secondary space usage index that maps allocated disk extents back to their
owners.
By adding a new index, the filesystem retains most of its ability to scale
well to heavily threaded workloads involving large datasets, since the primary
file metadata (the directory tree, the file block map, and the allocation
groups) remain unchanged.
Like any system that improves redundancy, the reverse-mapping feature increases
overhead costs for space mapping activities.
However, it has two critical advantages: first, the reverse index is key to
enabling online fsck and other requested functionality such as free space
defragmentation, better media failure reporting, and filesystem shrinking.
Second, the different ondisk storage format of the reverse mapping btree
defeats device-level deduplication because the filesystem requires real
redundancy.”…””}”(hj+  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h MÔhjå*  hžhubjÀ  )”}”(hhh]”jÅ  )”}”(hhh]”(jÊ  )”}”(hhh]”h}”(h]”h ]”h"]”h$]”h&]”Œcolwidth”KJuh1jÉ  hj+  ubjÖ  )”}”(hhh]”(jÛ  )”}”(hhh]”jà  )”}”(hhh]”hæ)”}”(hŒ**Sidebar**:”h]”(jé  )”}”(hŒ**Sidebar**”h]”hŒSidebar”…””}”(hj/+  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1jè  hj++  ubhŒ:”…””}”(hj++  hžhhŸNh Nubeh}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h Måhj(+  ubah}”(h]”h ]”h"]”h$]”h&]”uh1jß  hj%+  ubah}”(h]”h ]”h"]”h$]”h&]”uh1jÚ  hj"+  ubjÛ  )”}”(hhh]”jà  )”}”(hhh]”hæ)”}”(hXµ  A criticism of adding the secondary index is that it does nothing to
improve the robustness of user data storage itself.
This is a valid point, but adding a new index for file data block
checksums increases write amplification by turning data overwrites into
copy-writes, which age the filesystem prematurely.
In keeping with thirty years of precedent, users who want file data
integrity can supply as powerful a solution as they require.
As for metadata, the complexity of adding a new secondary index of space
usage is much less than adding volume management and storage device
mirroring to XFS itself.
Perfection of RAID and volume management are best left to existing
layers in the kernel.”h]”hXµ  A criticism of adding the secondary index is that it does nothing to
improve the robustness of user data storage itself.
This is a valid point, but adding a new index for file data block
checksums increases write amplification by turning data overwrites into
copy-writes, which age the filesystem prematurely.
In keeping with thirty years of precedent, users who want file data
integrity can supply as powerful a solution as they require.
As for metadata, the complexity of adding a new secondary index of space
usage is much less than adding volume management and storage device
mirroring to XFS itself.
Perfection of RAID and volume management are best left to existing
layers in the kernel.”…””}”(hjY+  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h MçhjV+  ubah}”(h]”h ]”h"]”h$]”h&]”uh1jß  hjS+  ubah}”(h]”h ]”h"]”h$]”h&]”uh1jÚ  hj"+  ubeh}”(h]”h ]”h"]”h$]”h&]”uh1jÕ  hj+  ubeh}”(h]”h ]”h"]”h$]”h&]”Œcols”Kuh1jÄ  hj+  ubah}”(h]”h ]”h"]”h$]”h&]”uh1j¿  hjå*  hžhhŸh³h Nubhæ)”}”(hŒIThe information captured in a reverse space mapping record is as follows:”h]”hŒIThe information captured in a reverse space mapping record is as follows:”…””}”(hj†+  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h Mõhjå*  hžhubhŒliteral_block”“”)”}”(hXG  struct xfs_rmap_irec {
    xfs_agblock_t    rm_startblock;   /* extent start block */
    xfs_extlen_t     rm_blockcount;   /* extent length */
    uint64_t         rm_owner;        /* extent owner */
    uint64_t         rm_offset;       /* offset within the owner */
    unsigned int     rm_flags;        /* state flags */
};”h]”hXG  struct xfs_rmap_irec {
    xfs_agblock_t    rm_startblock;   /* extent start block */
    xfs_extlen_t     rm_blockcount;   /* extent length */
    uint64_t         rm_owner;        /* extent owner */
    uint64_t         rm_offset;       /* offset within the owner */
    unsigned int     rm_flags;        /* state flags */
};”…””}”hj–+  sbah}”(h]”h ]”h"]”h$]”h&]”h±h²Œforce”‰Œlanguage”Œc”Œhighlight_args”}”uh1j”+  hŸh³h M÷hjå*  hžhubhæ)”}”(hXá  The first two fields capture the location and size of the physical space,
in units of filesystem blocks.
The owner field tells scrub which metadata structure or file inode have been
assigned this space.
For space allocated to files, the offset field tells scrub where the space was
mapped within the file fork.
Finally, the flags field provides extra information about the space usage --
is this an attribute fork extent?  A file mapping btree extent?  Or an
unwritten data extent?”h]”hXá  The first two fields capture the location and size of the physical space,
in units of filesystem blocks.
The owner field tells scrub which metadata structure or file inode have been
assigned this space.
For space allocated to files, the offset field tells scrub where the space was
mapped within the file fork.
Finally, the flags field provides extra information about the space usage --
is this an attribute fork extent?  A file mapping btree extent?  Or an
unwritten data extent?”…””}”(hj©+  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h Mhjå*  hžhubhæ)”}”(hXã  Online filesystem checking judges the consistency of each primary metadata
record by comparing its information against all other space indices.
The reverse mapping index plays a key role in the consistency checking process
because it contains a centralized alternate copy of all space allocation
information.
Program runtime and ease of resource acquisition are the only real limits to
what online checking can consult.
For example, a file data extent mapping can be checked against:”•X     h]”hXã  Online filesystem checking judges the consistency of each primary metadata
record by comparing its information against all other space indices.
The reverse mapping index plays a key role in the consistency checking process
because it contains a centralized alternate copy of all space allocation
information.
Program runtime and ease of resource acquisition are the only real limits to
what online checking can consult.
For example, a file data extent mapping can be checked against:”…””}”(hj·+  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h Mhjå*  hžhubhö)”}”(hhh]”(hû)”}”(hŒ6The absence of an entry in the free space information.”h]”hæ)”}”(hjÊ+  h]”hŒ6The absence of an entry in the free space information.”…””}”(hjÌ+  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h MhjÈ+  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhjÅ+  hžhhŸh³h Nubhû)”}”(hŒ+The absence of an entry in the inode index.”h]”hæ)”}”(hjá+  h]”hŒ+The absence of an entry in the inode index.”…””}”(hjã+  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h Mhjß+  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhjÅ+  hžhhŸh³h Nubhû)”}”(hŒgThe absence of an entry in the reference count data if the file is not
marked as having shared extents.”h]”hæ)”}”(hŒgThe absence of an entry in the reference count data if the file is not
marked as having shared extents.”h]”hŒgThe absence of an entry in the reference count data if the file is not
marked as having shared extents.”…””}”(hjú+  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h Mhjö+  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhjÅ+  hžhhŸh³h Nubhû)”}”(hŒCThe correspondence of an entry in the reverse mapping information.
”h]”hæ)”}”(hŒBThe correspondence of an entry in the reverse mapping information.”h]”hŒBThe correspondence of an entry in the reverse mapping information.”…””}”(hj,  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h Mhj,  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhjÅ+  hžhhŸh³h Nubeh}”(h]”h ]”h"]”h$]”h&]”jJ  jó   uh1hõhŸh³h Mhjå*  hžhubhæ)”}”(hŒEThere are several observations to make about reverse mapping indices:”h]”hŒEThere are several observations to make about reverse mapping indices:”…””}”(hj,,  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h Mhjå*  hžhubji  )”}”(hhh]”(hû)”}”(hŒÔReverse mappings can provide a positive affirmation of correctness if any of
the above primary metadata are in doubt.
The checking code for most primary metadata follows a path similar to the
one outlined above.
”h]”hæ)”}”(hŒÓReverse mappings can provide a positive affirmation of correctness if any of
the above primary metadata are in doubt.
The checking code for most primary metadata follows a path similar to the
one outlined above.”h]”hŒÓReverse mappings can provide a positive affirmation of correctness if any of
the above primary metadata are in doubt.
The checking code for most primary metadata follows a path similar to the
one outlined above.”…””}”(hjA,  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h Mhj=,  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhj:,  hžhhŸh³h Nubhû)”}”(hX¿  Proving the consistency of secondary metadata with the primary metadata is
difficult because that requires a full scan of all primary space metadata,
which is very time intensive.
For example, checking a reverse mapping record for a file extent mapping
btree block requires locking the file and searching the entire btree to
confirm the block.
Instead, scrub relies on rigorous cross-referencing during the primary space
mapping structure checks.
”h]”hæ)”}”(hX¾  Proving the consistency of secondary metadata with the primary metadata is
difficult because that requires a full scan of all primary space metadata,
which is very time intensive.
For example, checking a reverse mapping record for a file extent mapping
btree block requires locking the file and searching the entire btree to
confirm the block.
Instead, scrub relies on rigorous cross-referencing during the primary space
mapping structure checks.”h]”hX¾  Proving the consistency of secondary metadata with the primary metadata is
difficult because that requires a full scan of all primary space metadata,
which is very time intensive.
For example, checking a reverse mapping record for a file extent mapping
btree block requires locking the file and searching the entire btree to
confirm the block.
Instead, scrub relies on rigorous cross-referencing during the primary space
mapping structure checks.”…””}”(hjY,  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h M!hjU,  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhj:,  hžhhŸh³h Nubhû)”}”(hXø  Consistency scans must use non-blocking lock acquisition primitives if the
required locking order is not the same order used by regular filesystem
operations.
For example, if the filesystem normally takes a file ILOCK before taking
the AGF buffer lock but scrub wants to take a file ILOCK while holding
an AGF buffer lock, scrub cannot block on that second acquisition.
This means that forward progress during this part of a scan of the reverse
mapping data cannot be guaranteed if system load is heavy.
”h]”hæ)”}”(hX÷  Consistency scans must use non-blocking lock acquisition primitives if the
required locking order is not the same order used by regular filesystem
operations.
For example, if the filesystem normally takes a file ILOCK before taking
the AGF buffer lock but scrub wants to take a file ILOCK while holding
an AGF buffer lock, scrub cannot block on that second acquisition.
This means that forward progress during this part of a scan of the reverse
mapping data cannot be guaranteed if system load is heavy.”h]”hX÷  Consistency scans must use non-blocking lock acquisition primitives if the
required locking order is not the same order used by regular filesystem
operations.
For example, if the filesystem normally takes a file ILOCK before taking
the AGF buffer lock but scrub wants to take a file ILOCK while holding
an AGF buffer lock, scrub cannot block on that second acquisition.
This means that forward progress during this part of a scan of the reverse
mapping data cannot be guaranteed if system load is heavy.”…””}”(hjq,  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h M*hjm,  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhj:,  hžhhŸh³h Nubeh}”(h]”h ]”h"]”h$]”h&]”jg  jh  ji  hjj  jk  uh1jh  hjå*  hžhhŸh³h Mubhæ)”}”(hŒ×In summary, reverse mappings play a key role in reconstruction of primary
metadata.
The details of how these records are staged, written to disk, and committed
into the filesystem are covered in subsequent sections.”h]”hŒ×In summary, reverse mappings play a key role in reconstruction of primary
metadata.
The details of how these records are staged, written to disk, and committed
into the filesystem are covered in subsequent sections.”…””}”(hj‹,  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h M3hjå*  hžhubeh}”(h]”jþ  ah ]”h"]”Œreverse mapping”ah$]”h&]”uh1hÐhjv*  hžhhŸh³h MÈubhÑ)”}”(hhh]”(hÖ)”}”(hŒChecking and Cross-Referencing”h]”hŒChecking and Cross-Referencing”…””}”(hj£,  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”j  j  uh1hÕhj ,  hžhhŸh³h M9ubhæ)”}”(hX”  The first step of checking a metadata structure is to examine every record
contained within the structure and its relationship with the rest of the
system.
XFS contains multiple layers of checking to try to prevent inconsistent
metadata from wreaking havoc on the system.
Each of these layers contributes information that helps the kernel to make
three decisions about the health of a metadata structure:”h]”hX”  The first step of checking a metadata structure is to examine every record
contained within the structure and its relationship with the rest of the
system.
XFS contains multiple layers of checking to try to prevent inconsistent
metadata from wreaking havoc on the system.
Each of these layers contributes information that helps the kernel to make
three decisions about the health of a metadata structure:”…””}”(hj±,  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h M;hj ,  hžhubhö)”}”(hhh]”(hû)”}”(hŒMIs a part of this structure obviously corrupt (``XFS_SCRUB_OFLAG_CORRUPT``) ?”h]”hæ)”}”(hjÄ,  h]”(hŒ/Is a part of this structure obviously corrupt (”…””}”(hjÆ,  hžhhŸNh Nubj÷  )”}”(hŒ``XFS_SCRUB_OFLAG_CORRUPT``”h]”hŒXFS_SCRUB_OFLAG_CORRUPT”…””}”(hjÍ,  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1jö  hjÆ,  ubhŒ) ?”…””}”(hjÆ,  hžhhŸNh Nubeh}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h MChjÂ,  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhj¿,  hžhhŸh³h Nubhû)”}”(hŒ[Is this structure inconsistent with the rest of the system
(``XFS_SCRUB_OFLAG_XCORRUPT``) ?”h]”hæ)”}”(hŒ[Is this structure inconsistent with the rest of the system
(``XFS_SCRUB_OFLAG_XCORRUPT``) ?”h]”(hŒ<Is this structure inconsistent with the rest of the system
(”…””}”(hjï,  hžhhŸNh Nubj÷  )”}”(hŒ``XFS_SCRUB_OFLAG_XCORRUPT``”h]”hŒXFS_SCRUB_OFLAG_XCORRUPT”…””}”(hj÷,  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1jö  hjï,  ubhŒ) ?”…””}”(hjï,  hžhhŸNh Nubeh}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h MDhjë,  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhj¿,  hžhhŸh³h Nubhû)”}”(hŒrIs there so much damage around the filesystem that cross-referencing is not
possible (``XFS_SCRUB_OFLAG_XFAIL``) ?”h]”hæ)”}”(hŒrIs there so much damage around the filesystem that cross-referencing is not
possible (``XFS_SCRUB_OFLAG_XFAIL``) ?”h]”(hŒVIs there so much damage around the filesystem that cross-referencing is not
possible (”…””}”(hj-  hžhhŸNh Nubj÷  )”}”(hŒ``XFS_SCRUB_OFLAG_XFAIL``”h]”hŒXFS_SCRUB_OFLAG_XFAIL”…””}”(hj!-  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1jö  hj-  ubhŒ) ?”…””}”(hj-  hžhhŸNh Nubeh}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h MFhj-  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhj¿,  hžhhŸh³h Nubhû)”}”(hŒrCan the structure be optimized to improve performance or reduce the size of
metadata (``XFS_SCRUB_OFLAG_PREEN``) ?”h]”hæ)”}”(hŒrCan the structure be optimized to improve performance or reduce the size of
metadata (``XFS_SCRUB_OFLAG_PREEN``) ?”h]”(hŒVCan the structure be optimized to improve performance or reduce the size of
metadata (”…””}”(hjC-  hžhhŸNh Nubj÷  )”}”(hŒ``XFS_SCRUB_OFLAG_PREEN``”h]”hŒXFS_SCRUB_OFLAG_PREEN”…””}”(hjK-  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1jö  hjC-  ubhŒ) ?”…””}”(hjC-  hžhhŸNh Nubeh}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h MHhj?-  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhj¿,  hžhhŸh³h Nubhû)”}”(hŒ‰Does the structure contain data that is not inconsistent but deserves review
by the system administrator (``XFS_SCRUB_OFLAG_WARNING``) ?
”h]”hæ)”}”(hŒˆDoes the structure contain data that is not inconsistent but deserves review
by the system administrator (``XFS_SCRUB_OFLAG_WARNING``) ?”h]”(hŒjDoes the structure contain data that is not inconsistent but deserves review
by the system administrator (”…””}”(hjm-  hžhhŸNh Nubj÷  )”}”(hŒ``XFS_SCRUB_OFLAG_WARNING``”h]”hŒXFS_SCRUB_OFLAG_WARNING”…””}”(hju-  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1jö  hjm-  ubhŒ) ?”…””}”(hjm-  hžhhŸNh Nubeh}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h MJhji-  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhj¿,  hžhhŸh³h Nubeh}”(h]”h ]”h"]”h$]”h&]”jJ  jK  uh1hõhŸh³h MChj ,  hžhubhæ)”}”(hŒIThe following sections describe how the metadata scrubbing process works.”h]”hŒIThe following sections describe how the metadata scrubbing process works.”…””}”(hj™-  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h MMhj ,  hžhubhÑ)”}”(hhh]”(hÖ)”}”(hŒMetadata Buffer Verification”h]”hŒMetadata Buffer Verification”…””}”(hjª-  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”j  j9  uh1hÕhj§-  hžhhŸh³h MPubhæ)”}”(hŒÙThe lowest layer of metadata protection in XFS are the metadata verifiers built
into the buffer cache.
These functions perform inexpensive internal consistency checking of the block
itself, and answer these questions:”h]”hŒÙThe lowest layer of metadata protection in XFS are the metadata verifiers built
into the buffer cache.
These functions perform inexpensive internal consistency checking of the block
itself, and answer these questions:”…””}”(hj¸-  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h MRhj§-  hžhubhö)”}”(hhh]”(hû)”}”(hŒ*Does the block belong to this filesystem?
”h]”hæ)”}”(hŒ)Does the block belong to this filesystem?”h]”hŒ)Does the block belong to this filesystem?”…””}”(hjÍ-  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h MWhjÉ-  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhjÆ-  hžhhŸh³h Nubhû)”}”(hŒ”Does the block belong to the structure that asked for the read?
This assumes that metadata blocks only have one owner, which is always true
in XFS.
”h]”hæ)”}”(hŒ“Does the block belong to the structure that asked for the read?
This assumes that metadata blocks only have one owner, which is always true
in XFS.”h]”hŒ“Does the block belong to the structure that asked for the read?
This assumes that metadata blocks only have one owner, which is always true
in XFS.”…””}”(hjå-  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h MYhjá-  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhjÆ-  hžhhŸh³h Nubhû)”}”(hŒ^Is the type of data stored in the block within a reasonable range of what
scrub is expecting?
”h]”hæ)”}”(hŒ]Is the type of data stored in the block within a reasonable range of what
scrub is expecting?”h]”hŒ]Is the type of data stored in the block within a reasonable range of what
scrub is expecting?”…””}”(hjý-  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h M]hjù-  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhjÆ-  hžhhŸh³h Nubhû)”}”(hŒMDoes the physical location of the block match the location it was read from?
”h]”hæ)”}”(hŒLDoes the physical location of the block match the location it was read from?”h]”hŒLDoes the physical location of the block match the location it was read from?”…””}”(hj.  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h M`hj.  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhjÆ-  hžhhŸh³h Nubhû)”}”(hŒ(Does the block checksum match the data?
”h]”hæ)”}”(hŒ'Does the block checksum match the data?”h]”hŒ'Does the block checksum match the data?”…””}”(hj-.  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h Mbhj).  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhjÆ-  hžhhŸh³h Nubeh}”(h]”h ]”h"]”h$]”h&]”jJ  jK  uh1hõhŸh³h MWhj§-  hžhubhæ)”}”(hX³  The scope of the protections here are very limited -- verifiers can only
establish that the filesystem code is reasonably free of gross corruption bugs
and that the storage system is reasonably competent at retrieval.
Corruption problems observed at runtime cause the generation of health reports,
failed system calls, and in the extreme case, filesystem shutdowns if the
corrupt metadata force the cancellation of a dirty transaction.”h]”hX³  The scope of the protections here are very limited -- verifiers can only
establish that the filesystem code is reasonably free of gross corruption bugs
and that the storage system is reasonably competent at retrieval.
Corruption problems observed at runtime cause the generation of health reports,
failed system calls, and in the extreme case, filesystem shutdowns if the
corrupt metadata force the cancellation of a dirty transaction.”…””}”(hjG.  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h Mdhj§-  hžhubhæ)”}”(hX¿  Every online fsck scrubbing function is expected to read every ondisk metadata
block of a structure in the course of checking the structure.
Corruption problems observed during a check are immediately reported to
userspace as corruption; during a cross-reference, they are reported as a
failure to cross-reference once the full examination is complete.
Reads satisfied by a buffer already in cache (and hence already verified)
bypass these checks.”h]”hX¿  Every online fsck scrubbing function is expected to read every ondisk metadata
block of a structure in the course of checking the structure.
Corruption problems observed during a check are immediately reported to
userspace as corruption; during a cross-reference, they are reported as a
failure to cross-reference once the full examination is complete.
Reads satisfied by a buffer already in cache (and hence already verified)
bypass these checks.”…””}”(hjU.  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h Mkhj§-  hžhubeh}”(h]”j?  ah ]”h"]”Œmetadata buffer verification”ah$]”h&]”uh1hÐhj ,  hžhhŸh³h MPubhÑ)”}”(hhh]”(hÖ)”}”(hŒInternal Consistency Checks”h]”hŒInternal Consistency Checks”…””}”(hjm.  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”j  j[  uh1hÕhjj.  hžhhŸh³h Mtubhæ)”}”(hX¬  After the buffer cache, the next level of metadata protection is the internal
record verification code built into the filesystem.
These checks are split between the buffer verifiers, the in-filesystem users of
the buffer cache, and the scrub code itself, depending on the amount of higher
level context required.
The scope of checking is still internal to the block.
These higher level checking functions answer these questions:”h]”hX¬  After the buffer cache, the next level of metadata protection is the internal
record verification code built into the filesystem.
These checks are split between the buffer verifiers, the in-filesystem users of
the buffer cache, and the scrub code itself, depending on the amount of higher
level context required.
The scope of checking is still internal to the block.
These higher level checking functions answer these questions:”…””}”(hj{.  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h Mvhjj.  hžhubhö)”}”(hhh]”(hû)”}”(hŒIDoes the type of data stored in the block match what scrub is expecting?
”h]”hæ)”}”(hŒHDoes the type of data stored in the block match what scrub is expecting?”h]”hŒHDoes the type of data stored in the block match what scrub is expecting?”…””}”(hj.  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h M~hjŒ.  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhj‰.  hžhhŸh³h Nubhû)”}”(hŒGDoes the block belong to the owning structure that asked for the read?
”h]”hæ)”}”(hŒFDoes the block belong to the owning structure that asked for the read?”h]”hŒFDoes the block belong to the owning structure that asked for the read?”…””}”(hj¨.  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h M€hj¤.  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhj‰.  hžhhŸh³h Nubhû)”}”(hŒDIf the block contains records, do the records fit within the block?
”h]”hæ)”}”(hŒCIf the block contains records, do the records fit within the block?”h]”hŒCIf the block contains records, do the records fit within the block?”…””}”(hjÀ.  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h M‚hj¼.  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhj‰.  hžhhŸh³h Nubhû)”}”(hŒ]If the block tracks internal free space information, is it consistent with
the record areas?
”h]”hæ)”}”(hŒ\If the block tracks internal free space information, is it consistent with
the record areas?”h]”hŒ\If the block tracks internal free space information, is it consistent with
the record areas?”…””}”(hjØ.  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h M„hjÔ.  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhj‰.  hžhhŸh³h Nubhû)”}”(hŒHAre the records contained inside the block free of obvious corruptions?
”h]”hæ)”}”(hŒGAre the records contained inside the block free of obvious corruptions?”h]”hŒGAre the records contained inside the block free of obvious corruptions?”…””}”(hjð.  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h M‡hjì.  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhj‰.  hžhhŸh³h Nubeh}”(h]”h ]”h"]”h$]”h&]”jJ  jK  uh1hõhŸh³h M~hjj.  hžhubhæ)”}”(hXÑ  Record checks in this category are more rigorous and more time-intensive.
For example, block pointers and inumbers are checked to ensure that they point
within the dynamically allocated parts of an allocation group and within
the filesystem.
Names are checked for invalid characters, and flags are checked for invalid
combinations.
Other record attributes are checked for sensible values.
Btree records spanning an interval of the btree keyspace are checked for
correct order and lack of mergeability (except for file fork mappings).
For performance reasons, regular code may skip some of these checks unless
debugging is enabled or a write is about to occur.
Scrub functions, of course, must check all possible problems.”h]”hXÑ  Record checks in this category are more rigorous and more time-intensive.
For example, block pointers and inumbers are checked to ensure that they point
within the dynamically allocated parts of an allocation group and within
the filesystem.
Names are checked for invalid characters, and flags are checked for invalid
combinations.
Other record attributes are checked for sensible values.
Btree records spanning an interval of the btree keyspace are checked for
correct order and lack of mergeability (except for file fork mappings).
For performance reasons, regular code may skip some of these checks unless
debugging is enabled or a write is about to occur.
Scrub functions, of course, must check all possible problems.”…””}”(hj
/  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h M‰hjj.  hžhubeh}”(h]”ja  ah ]”h"]”Œinternal consistency checks”ah$]”h&]”uh1hÐhj ,  hžhhŸh³h MtubhÑ)”}”(hhh]”(hÖ)”}”(hŒ4Validation of Userspace-Controlled Record Attributes”h]”hŒ4Validation of Userspace-Controlled Record Attributes”…””}”(hj"/  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”j  j}  uh1hÕhj/  hžhhŸh³h M—ubhæ)”}”(hŒÙVarious pieces of filesystem metadata are directly controlled by userspace.
Because of this nature, validation work cannot be more precise than checking
that a value is within the possible range.
These fields include:”h]”hŒÙVarious pieces of filesystem metadata are directly controlled by userspace.
Because of this nature, validation work cannot be more precise than checking
that a value is within the possible range.
These fields include:”…””}”(hj0/  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h M™hj/  hžhubhö)”}”(hhh]”(hû)”}”(hŒ-Superblock fields controlled by mount options”h]”hæ)”}”(hjC/  h]”hŒ-Superblock fields controlled by mount options”…””}”(hjE/  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h MžhjA/  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhj>/  hžhhŸh³h Nubhû)”}”(hŒFilesystem labels”h]”hæ)”}”(hjZ/  h]”hŒFilesystem labels”…””}”(hj\/  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h MŸhjX/  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhj>/  hžhhŸh³h Nubhû)”}”(hŒFile timestamps”h]”hæ)”}”(hjq/  h]”hŒFile timestamps”…””}”(hjs/  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h M hjo/  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhj>/  hžhhŸh³h Nubhû)”}”(hŒFile permissions”h]”hæ)”}”(hjˆ/  h]”hŒFile permissions”…””}”(hjŠ/  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h M¡hj†/  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhj>/  hžhhŸh³h Nubhû)”}”(hŒ	File size”h]”hæ)”}”(hjŸ/  h]”hŒ	File size”…””}”(hj¡/  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h M¢hj/  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhj>/  hžhhŸh³h Nubhû)”}”(hŒ
File flags”h]”hæ)”}”(hj¶/  h]”hŒ
File flags”…””}”(hj¸/  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h M£hj´/  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhj>/  hžhhŸh³h Nubhû)”}”(hŒRNames present in directory entries, extended attribute keys, and filesystem
labels”h]”hæ)”}”(hŒRNames present in directory entries, extended attribute keys, and filesystem
labels”h]”hŒRNames present in directory entries, extended attribute keys, and filesystem
labels”…””}”(hjÏ/  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h M¤hjË/  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhj>/  hžhhŸh³h Nubhû)”}”(hŒ!Extended attribute key namespaces”h]”hæ)”}”(hjå/  h]”hŒ!Extended attribute key namespaces”…””}”(hjç/  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h M¦hjã/  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhj>/  hžhhŸh³h Nubhû)”}”(hŒExtended attribute values”h]”hæ)”}”(hjü/  h]”hŒExtended attribute values”…””}”(hjþ/  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h M§hjú/  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhj>/  hžhhŸh³h Nubhû)”}”(hŒFile data block contents”h]”hæ)”}”(hj0  h]”hŒFile data block contents”…””}”(hj0  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h M¨hj0  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhj>/  hžhhŸh³h Nubhû)”}”(hŒQuota limits”h]”hæ)”}”(hj*0  h]”hŒQuota limits”…””}”(hj,0  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h M©hj(0  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhj>/  hžhhŸh³h Nubhû)”}”(hŒBQuota timer expiration (if resource usage exceeds the soft limit)
”h]”hæ)”}”(hŒAQuota timer expiration (if resource usage exceeds the soft limit)”h]”hŒAQuota timer expiration (if resource usage exceeds the soft limit)”…””}”(hjC0  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h Mªhj?0  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhj>/  hžhhŸh³h Nubeh}”(h]”h ]”h"]”h$]”h&]”jJ  jK  uh1hõhŸh³h Mžhj/  hžhubeh}”(h]”jƒ  ah ]”h"]”Œ4validation of userspace-controlled record attributes”ah$]”h&]”uh1hÐhj ,  hžhhŸh³h M—ubhÑ)”}”(hhh]”(hÖ)”}”(hŒ Cross-Referencing Space Metadata”h]”hŒ Cross-Referencing Space Metadata”…””}”(hjg0  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”j  jŸ  uh1hÕhjd0  hžhhŸh³h M­ubhæ)”}”(hX¤  After internal block checks, the next higher level of checking is
cross-referencing records between metadata structures.
For regular runtime code, the cost of these checks is considered to be
prohibitively expensive, but as scrub is dedicated to rooting out
inconsistencies, it must pursue all avenues of inquiry.
The exact set of cross-referencing is highly dependent on the context of the
data structure being checked.”h]”hX¤  After internal block checks, the next higher level of checking is
cross-referencing records between metadata structures.
For regular runtime code, the cost of these checks is considered to be
prohibitively expensive, but as scrub is dedicated to rooting out
inconsistencies, it must pursue all avenues of inquiry.
The exact set of cross-referencing is highly dependent on the context of the
data structure being checked.”…””}”(hju0  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h M¯hjd0  hžhubhæ)”}”(hX*  The XFS btree code has keyspace scanning functions that online fsck uses to
cross reference one structure with another.
Specifically, scrub can scan the key space of an index to determine if that
keyspace is fully, sparsely, or not at all mapped to records.
For the reverse mapping btree, it is possible to mask parts of the key for the
purposes of performing a keyspace scan so that scrub can decide if the rmap
btree contains records mapping a certain extent of physical space without the
sparsenses of the rest of the rmap keyspace getting in the way.”h]”hX*  The XFS btree code has keyspace scanning functions that online fsck uses to
cross reference one structure with another.
Specifically, scrub can scan the key space of an index to determine if that
keyspace is fully, sparsely, or not at all mapped to records.
For the reverse mapping btree, it is possible to mask parts of the key for the
purposes of performing a keyspace scan so that scrub can decide if the rmap
btree contains records mapping a certain extent of physical space without the
sparsenses of the rest of the rmap keyspace getting in the way.”…””}”(hjƒ0  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h M·hjd0  hžhubhæ)”}”(hŒCBtree blocks undergo the following checks before cross-referencing:”h]”hŒCBtree blocks undergo the following checks before cross-referencing:”…””}”(hj‘0  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h MÀhjd0  hžhubhö)”}”(hhh]”(hû)”}”(hŒIDoes the type of data stored in the block match what scrub is expecting?
”h]”hæ)”}”(hŒHDoes the type of data stored in the block match what scrub is expecting?”h]”hŒHDoes the type of data stored in the block match what scrub is expecting?”…””}”(hj¦0  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h MÂhj¢0  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhjŸ0  hžhhŸh³h Nubhû)”}”(hŒGDoes the block belong to the owning structure that asked for the read?
”h]”hæ)”}”(hŒFDoes the block belong to the owning structure that asked for the read?”h]”hŒFDoes the block belong to the owning structure that asked for the read?”…””}”(hj¾0  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h MÄhjº0  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhjŸ0  hžhhŸh³h Nubhû)”}”(hŒ%Do the records fit within the block?
”h]”hæ)”}”(hŒ$Do the records fit within the block?”h]”hŒ$Do the records fit within the block?”…””}”(hjÖ0  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h MÆhjÒ0  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhjŸ0  hžhhŸh³h Nubhû)”}”(hŒHAre the records contained inside the block free of obvious corruptions?
”h]”hæ)”}”(hŒGAre the records contained inside the block free of obvious corruptions?”h]”hŒGAre the records contained inside the block free of obvious corruptions?”…””}”(hjî0  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h MÈhjê0  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhjŸ0  hžhhŸh³h Nubhû)”}”(hŒ*Are the name hashes in the correct order?
”h]”hæ)”}”(hŒ)Are the name hashes in the correct order?”h]”hŒ)Are the name hashes in the correct order?”…””}”(hj1  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h MÊhj1  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhjŸ0  hžhhŸh³h Nubhû)”}”(hŒXDo node pointers within the btree point to valid block addresses for the type
of btree?
”h]”hæ)”}”(hŒWDo node pointers within the btree point to valid block addresses for the type
of btree?”h]”hŒWDo node pointers within the btree point to valid block addresses for the type
of btree?”…””}”(hj1  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h MÌhj1  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhjŸ0  hžhhŸh³h Nubhû)”}”(hŒ,Do child pointers point towards the leaves?
”h]”hæ)”}”(hŒ+Do child pointers point towards the leaves?”h]”hŒ+Do child pointers point towards the leaves?”…””}”(hj61  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h MÏhj21  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhjŸ0  hžhhŸh³h Nubhû)”}”(hŒ1Do sibling pointers point across the same level?
”h]”hæ)”}”(hŒ0Do sibling pointers point across the same level?”h]”hŒ0Do sibling pointers point across the same level?”…””}”(hjN1  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h MÑhjJ1  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhjŸ0  hžhhŸh³h Nubhû)”}”(hŒbFor each node block record, does the record key accurate reflect the contents
of the child block?
”h]”hæ)”}”(hŒaFor each node block record, does the record key accurate reflect the contents
of the child block?”h]”hŒaFor each node block record, does the record key accurate reflect the contents
of the child block?”…””}”(hjf1  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h MÓhjb1  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhjŸ0  hžhhŸh³h Nubeh}”(h]”h ]”h"]”h$]”h&]”jJ  jK  uh1hõhŸh³h MÂhjd0  hžhubhæ)”}”(hŒ9Space allocation records are cross-referenced as follows:”h]”hŒ9Space allocation records are cross-referenced as follows:”…””}”(hj€1  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h MÖhjd0  hžhubji  )”}”(hhh]”(hû)”}”(hXK  Any space mentioned by any metadata structure are cross-referenced as
follows:

- Does the reverse mapping index list only the appropriate owner as the
  owner of each block?

- Are none of the blocks claimed as free space?

- If these aren't file data blocks, are none of the blocks claimed as space
  shared by different owners?
”h]”(hæ)”}”(hŒNAny space mentioned by any metadata structure are cross-referenced as
follows:”h]”hŒNAny space mentioned by any metadata structure are cross-referenced as
follows:”…””}”(hj•1  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h MØhj‘1  ubhö)”}”(hhh]”(hû)”}”(hŒ[Does the reverse mapping index list only the appropriate owner as the
owner of each block?
”h]”hæ)”}”(hŒZDoes the reverse mapping index list only the appropriate owner as the
owner of each block?”h]”hŒZDoes the reverse mapping index list only the appropriate owner as the
owner of each block?”…””}”(hjª1  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h MÛhj¦1  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhj£1  ubhû)”}”(hŒ.Are none of the blocks claimed as free space?
”h]”hæ)”}”(hŒ-Are none of the blocks claimed as free space?”h]”hŒ-Are none of the blocks claimed as free space?”…””}”(hjÂ1  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h MÞhj¾1  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhj£1  ubhû)”}”(hŒfIf these aren't file data blocks, are none of the blocks claimed as space
shared by different owners?
”h]”hæ)”}”(hŒeIf these aren't file data blocks, are none of the blocks claimed as space
shared by different owners?”h]”hŒgIf these arenâ€™t file data blocks, are none of the blocks claimed as space
shared by different owners?”…””}”(hjÚ1  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h MàhjÖ1  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhj£1  ubeh}”(h]”h ]”h"]”h$]”h&]”jJ  jK  uh1hõhŸh³h MÛhj‘1  ubeh}”(h]”h ]”h"]”h$]”h&]”uh1húhjŽ1  hžhhŸNh Nubhû)”}”(hXF  Btree blocks are cross-referenced as follows:

- Everything in class 1 above.

- If there's a parent node block, do the keys listed for this block match the
  keyspace of this block?

- Do the sibling pointers point to valid blocks?  Of the same level?

- Do the child pointers point to valid blocks?  Of the next level down?
”h]”(hæ)”}”(hŒ-Btree blocks are cross-referenced as follows:”h]”hŒ-Btree blocks are cross-referenced as follows:”…””}”(hjþ1  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h Mãhjú1  ubhö)”}”(hhh]”(hû)”}”(hŒEverything in class 1 above.
”h]”hæ)”}”(hŒEverything in class 1 above.”h]”hŒEverything in class 1 above.”…””}”(hj2  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h Måhj2  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhj2  ubhû)”}”(hŒdIf there's a parent node block, do the keys listed for this block match the
keyspace of this block?
”h]”hæ)”}”(hŒcIf there's a parent node block, do the keys listed for this block match the
keyspace of this block?”h]”hŒeIf thereâ€™s a parent node block, do the keys listed for this block match the
keyspace of this block?”…””}”(hj+2  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h Mçhj'2  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhj2  ubhû)”}”(hŒCDo the sibling pointers point to valid blocks?  Of the same level?
”h]”hæ)”}”(hŒBDo the sibling pointers point to valid blocks?  Of the same level?”h]”hŒBDo the sibling pointers point to valid blocks?  Of the same level?”…””}”(hjC2  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h Mêhj?2  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhj2  ubhû)”}”(hŒFDo the child pointers point to valid blocks?  Of the next level down?
”h]”hæ)”}”(hŒEDo the child pointers point to valid blocks?  Of the next level down?”h]”hŒEDo the child pointers point to valid blocks?  Of the next level down?”…””}”(hj[2  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h MìhjW2  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhj2  ubeh}”(h]”h ]”h"]”h$]”h&]”jJ  jK  uh1hõhŸh³h Måhjú1  ubeh}”(h]”h ]”h"]”h$]”h&]”uh1húhjŽ1  hžhhŸNh Nubhû)”}”(hXN  Free space btree records are cross-referenced as follows:

- Everything in class 1 and 2 above.

- Does the reverse mapping index list no owners of this space?

- Is this space not claimed by the inode index for inodes?

- Is it not mentioned by the reference count index?

- Is there a matching record in the other free space btree?
”h]”(hæ)”}”(hŒ9Free space btree records are cross-referenced as follows:”h]”hŒ9Free space btree records are cross-referenced as follows:”…””}”(hj2  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h Mîhj{2  ubhö)”}”(hhh]”(hû)”}”(hŒ#Everything in class 1 and 2 above.
”h]”hæ)”}”(hŒ"Everything in class 1 and 2 above.”h]”hŒ"Everything in class 1 and 2 above.”…””}”(hj”2  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h Mðhj2  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhj2  ubhû)”}”(hŒ=Does the reverse mapping index list no owners of this space?
”h]”hæ)”}”(hŒ<Does the reverse mapping index list no owners of this space?”h]”hŒ<Does the reverse mapping index list no owners of this space?”…””}”(hj¬2  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h Mòhj¨2  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhj2  ubhû)”}”(hŒ9Is this space not claimed by the inode index for inodes?
”h]”hæ)”}”(hŒ8Is this space not claimed by the inode index for inodes?”h]”hŒ8Is this space not claimed by the inode index for inodes?”…””}”(hjÄ2  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h MôhjÀ2  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhj2  ubhû)”}”(hŒ2Is it not mentioned by the reference count index?
”h]”hæ)”}”(hŒ1Is it not mentioned by the reference count index?”h]”hŒ1Is it not mentioned by the reference count index?”…””}”(hjÜ2  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h MöhjØ2  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhj2  ubhû)”}”(hŒ:Is there a matching record in the other free space btree?
”h]”hæ)”}”(hŒ9Is there a matching record in the other free space btree?”h]”hŒ9Is there a matching record in the other free space btree?”…””}”(hjô2  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h Møhjð2  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhj2  ubeh}”(h]”h ]”h"]”h$]”h&]”jJ  jK  uh1hõhŸh³h Mðhj{2  ubeh}”(h]”h ]”h"]”h$]”h&]”uh1húhjŽ1  hžhhŸNh Nubhû)”}”(hX&  Inode btree records are cross-referenced as follows:

- Everything in class 1 and 2 above.

- Is there a matching record in free inode btree?

- Do cleared bits in the holemask correspond with inode clusters?

- Do set bits in the freemask correspond with inode records with zero link
  count?
”h]”(hæ)”}”(hŒ4Inode btree records are cross-referenced as follows:”h]”hŒ4Inode btree records are cross-referenced as follows:”…””}”(hj3  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h Múhj3  ubhö)”}”(hhh]”(hû)”}”(hŒ#Everything in class 1 and 2 above.
”h]”hæ)”}”(hŒ"Everything in class 1 and 2 above.”h]”hŒ"Everything in class 1 and 2 above.”…””}”(hj-3  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h Mühj)3  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhj&3  ubhû)”}”(hŒ0Is there a matching record in free inode btree?
”h]”hæ)”}”(hŒ/Is there a matching record in free inode btree?”h]”hŒ/Is there a matching record in free inode btree?”…””}”(hjE3  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h MþhjA3  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhj&3  ubhû)”}”(hŒ@Do cleared bits in the holemask correspond with inode clusters?
”h]”hæ)”}”(hŒ?Do cleared bits in the holemask correspond with inode clusters?”h]”hŒ?Do cleared bits in the holemask correspond with inode clusters?”…””}”(hj]3  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h M hjY3  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhj&3  ubhû)”}”(hŒPDo set bits in the freemask correspond with inode records with zero link
count?
”h]”hæ)”}”(hŒODo set bits in the freemask correspond with inode records with zero link
count?”h]”hŒODo set bits in the freemask correspond with inode records with zero link
count?”…””}”(hju3  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h Mhjq3  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhj&3  ubeh}”(h]”h ]”h"]”h$]”h&]”jJ  jK  uh1hõhŸh³h Mühj3  ubeh}”(h]”h ]”h"]”h$]”h&]”uh1húhjŽ1  hžhhŸNh Nubhû)”}”(hX  Inode records are cross-referenced as follows:

- Everything in class 1.

- Do all the fields that summarize information about the file forks actually
  match those forks?

- Does each inode with zero link count correspond to a record in the free
  inode btree?
”h]”(hæ)”}”(hŒ.Inode records are cross-referenced as follows:”h]”hŒ.Inode records are cross-referenced as follows:”…””}”(hj™3  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h Mhj•3  ubhö)”}”(hhh]”(hû)”}”(hŒEverything in class 1.
”h]”hæ)”}”(hŒEverything in class 1.”h]”hŒEverything in class 1.”…””}”(hj®3  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h Mhjª3  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhj§3  ubhû)”}”(hŒ^Do all the fields that summarize information about the file forks actually
match those forks?
”h]”hæ)”}”(hŒ]Do all the fields that summarize information about the file forks actually
match those forks?”h]”hŒ]Do all the fields that summarize information about the file forks actually
match those forks?”…””}”(hjÆ3  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h M	hjÂ3  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhj§3  ubhû)”}”(hŒUDoes each inode with zero link count correspond to a record in the free
inode btree?
”h]”hæ)”}”(hŒTDoes each inode with zero link count correspond to a record in the free
inode btree?”h]”hŒTDoes each inode with zero link count correspond to a record in the free
inode btree?”…””}”(hjÞ3  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h MhjÚ3  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhj§3  ubeh}”(h]”h ]”h"]”h$]”h&]”jJ  jK  uh1hõhŸh³h Mhj•3  ubeh}”(h]”h ]”h"]”h$]”h&]”uh1húhjŽ1  hžhhŸNh Nubhû)”}”(hŒÿFile fork space mapping records are cross-referenced as follows:

- Everything in class 1 and 2 above.

- Is this space not mentioned by the inode btrees?

- If this is a CoW fork mapping, does it correspond to a CoW entry in the
  reference count btree?
”h]”(hæ)”}”(hŒ@File fork space mapping records are cross-referenced as follows:”h]”hŒ@File fork space mapping records are cross-referenced as follows:”…””}”(hj4  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h Mhjþ3  ubhö)”}”(hhh]”(hû)”}”(hŒ#Everything in class 1 and 2 above.
”h]”hæ)”}”(hŒ"Everything in class 1 and 2 above.”h]”hŒ"Everything in class 1 and 2 above.”…””}”(hj4  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h Mhj4  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhj4  ubhû)”}”(hŒ1Is this space not mentioned by the inode btrees?
”h]”hæ)”}”(hŒ0Is this space not mentioned by the inode btrees?”h]”hŒ0Is this space not mentioned by the inode btrees?”…””}”(hj/4  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h Mhj+4  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhj4  ubhû)”}”(hŒ_If this is a CoW fork mapping, does it correspond to a CoW entry in the
reference count btree?
”h]”hæ)”}”(hŒ^If this is a CoW fork mapping, does it correspond to a CoW entry in the
reference count btree?”h]”hŒ^If this is a CoW fork mapping, does it correspond to a CoW entry in the
reference count btree?”…””}”(hjG4  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h MhjC4  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhj4  ubeh}”(h]”h ]”h"]”h$]”h&]”jJ  jK  uh1hõhŸh³h Mhjþ3  ubeh}”(h]”h ]”h"]”h$]”h&]”uh1húhjŽ1  hžhhŸNh Nubhû)”}”(hX`  Reference count records are cross-referenced as follows:

- Everything in class 1 and 2 above.

- Within the space subkeyspace of the rmap btree (that is to say, all
  records mapped to a particular space extent and ignoring the owner info),
  are there the same number of reverse mapping records for each block as the
  reference count record claims?
”h]”(hæ)”}”(hŒ8Reference count records are cross-referenced as follows:”h]”hŒ8Reference count records are cross-referenced as follows:”…””}”(hjk4  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h Mhjg4  ubhö)”}”(hhh]”(hû)”}”(hŒ#Everything in class 1 and 2 above.
”h]”hæ)”}”(hŒ"Everything in class 1 and 2 above.”h]”hŒ"Everything in class 1 and 2 above.”…””}”(hj€4  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h Mhj|4  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhjy4  ubhû)”}”(hŒøWithin the space subkeyspace of the rmap btree (that is to say, all
records mapped to a particular space extent and ignoring the owner info),
are there the same number of reverse mapping records for each block as the
reference count record claims?
”h]”hæ)”}”(hŒ÷Within the space subkeyspace of the rmap btree (that is to say, all
records mapped to a particular space extent and ignoring the owner info),
are there the same number of reverse mapping records for each block as the
reference count record claims?”h]”hŒ÷Within the space subkeyspace of the rmap btree (that is to say, all
records mapped to a particular space extent and ignoring the owner info),
are there the same number of reverse mapping records for each block as the
reference count record claims?”…””}”(hj˜4  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h Mhj”4  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhjy4  ubeh}”(h]”h ]”h"]”h$]”h&]”jJ  jK  uh1hõhŸh³h Mhjg4  ubeh}”(h]”h ]”h"]”h$]”h&]”uh1húhjŽ1  hžhhŸNh Nubeh}”(h]”h ]”h"]”h$]”h&]”jg  jh  ji  hjj  jk  uh1jh  hjd0  hžhhŸh³h MØubhæ)”}”(hXå  Proposed patchsets are the series to find gaps in
`refcount btree
<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=scrub-detect-refcount-gaps>`_,
`inode btree
<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=scrub-detect-inobt-gaps>`_, and
`rmap btree
<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=scrub-detect-rmapbt-gaps>`_ records;
to find
`mergeable records
<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=scrub-detect-mergeable-records>`_;
and to
`improve cross referencing with rmap
<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=scrub-strengthen-rmap-checking>`_
before starting a repair.”h]”(hŒ2Proposed patchsets are the series to find gaps in
”…””}”(hj¾4  hžhhŸNh Nubj”  )”}”(hŒz`refcount btree
<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=scrub-detect-refcount-gaps>`_”h]”hŒrefcount btree”…””}”(hjÆ4  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”Œname”Œrefcount btree”jj  Œfhttps://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=scrub-detect-refcount-gaps”uh1j“  hj¾4  ubhµ)”}”(hŒi
<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=scrub-detect-refcount-gaps>”h]”h}”(h]”Œrefcount-btree”ah ]”h"]”Œrefcount btree”ah$]”h&]”Œrefuri”jÖ4  uh1h´jy  Khj¾4  ubhŒ,
”…””}”(hj¾4  hžhhŸNh Nubj”  )”}”(hŒt`inode btree
<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=scrub-detect-inobt-gaps>`_”h]”hŒinode btree”…””}”(hjè4  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”Œname”Œinode btree”jj  Œchttps://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=scrub-detect-inobt-gaps”uh1j“  hj¾4  ubhµ)”}”(hŒf
<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=scrub-detect-inobt-gaps>”h]”h}”(h]”Œinode-btree”ah ]”h"]”Œinode btree”ah$]”h&]”Œrefuri”jø4  uh1h´jy  Khj¾4  ubhŒ, and
”…””}”(hj¾4  hžhhŸNh Nubj”  )”}”(hŒt`rmap btree
<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=scrub-detect-rmapbt-gaps>`_”h]”hŒ
rmap btree”…””}”(hj
5  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”Œname”Œ
rmap btree”jj  Œdhttps://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=scrub-detect-rmapbt-gaps”uh1j“  hj¾4  ubhµ)”}”(hŒg
<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=scrub-detect-rmapbt-gaps>”h]”h}”(h]”Œ
rmap-btree”ah ]”h"]”Œ
rmap btree”ah$]”h&]”Œrefuri”j5  uh1h´jy  Khj¾4  ubhŒ records;
to find
”…””}”(hj¾4  hžhhŸNh Nubj”  )”}”(hŒ`mergeable records
<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=scrub-detect-mergeable-records>`_”h]”hŒmergeable records”…””}”(hj,5  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”Œname”Œmergeable records”jj  Œjhttps://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=scrub-detect-mergeable-records”uh1j“  hj¾4  ubhµ)”}”(hŒm
<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=scrub-detect-mergeable-records>”h]”h}”(h]”Œmergeable-records”ah ]”h"]”Œmergeable records”ah$]”h&]”Œrefuri”j<5  uh1h´jy  Khj¾4  ubhŒ	;
and to
”…””}”(hj¾4  hžhhŸNh Nubj”  )”}”(hŒ“`improve cross referencing with rmap
<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=scrub-strengthen-rmap-checking>`_”h]”hŒ#improve cross referencing with rmap”…””}”(hjN5  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”Œname”Œ#improve cross referencing with rmap”jj  Œjhttps://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=scrub-strengthen-rmap-checking”uh1j“  hj¾4  ubhµ)”}”(hŒm
<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=scrub-strengthen-rmap-checking>”h]”h}”(h]”Œ#improve-cross-referencing-with-rmap”ah ]”h"]”Œ#improve cross referencing with rmap”ah$]”h&]”Œrefuri”j^5  uh1h´jy  Khj¾4  ubhŒ
before starting a repair.”…””}”(hj¾4  hžhhŸNh Nubeh}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h M!hjd0  hžhubeh}”(h]”j¥  ah ]”h"]”Œ cross-referencing space metadata”ah$]”h&]”uh1hÐhj ,  hžhhŸh³h M­ubhÑ)”}”(hhh]”(hÖ)”}”(hŒChecking Extended Attributes”h]”hŒChecking Extended Attributes”…””}”(hj€5  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”j  jÁ  uh1hÕhj}5  hžhhŸh³h M1ubhæ)”}”(hXc  Extended attributes implement a key-value store that enable fragments of data
to be attached to any file.
Both the kernel and userspace can access the keys and values, subject to
namespace and privilege restrictions.
Most typically these fragments are metadata about the file -- origins, security
contexts, user-supplied labels, indexing information, etc.”h]”hXc  Extended attributes implement a key-value store that enable fragments of data
to be attached to any file.
Both the kernel and userspace can access the keys and values, subject to
namespace and privilege restrictions.
Most typically these fragments are metadata about the file -- origins, security
contexts, user-supplied labels, indexing information, etc.”…””}”(hjŽ5  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h M3hj}5  hžhubhæ)”}”(hX™  Names can be as long as 255 bytes and can exist in several different
namespaces.
Values can be as large as 64KB.
A file's extended attributes are stored in blocks mapped by the attr fork.
The mappings point to leaf blocks, remote value blocks, or dabtree blocks.
Block 0 in the attribute fork is always the top of the structure, but otherwise
each of the three types of blocks can be found at any offset in the attr fork.
Leaf blocks contain attribute key records that point to the name and the value.
Names are always stored elsewhere in the same leaf block.
Values that are less than 3/4 the size of a filesystem block are also stored
elsewhere in the same leaf block.
Remote value blocks contain values that are too large to fit inside a leaf.
If the leaf information exceeds a single filesystem block, a dabtree (also
rooted at block 0) is created to map hashes of the attribute names to leaf
blocks in the attr fork.”h]”hX›  Names can be as long as 255 bytes and can exist in several different
namespaces.
Values can be as large as 64KB.
A fileâ€™s extended attributes are stored in blocks mapped by the attr fork.
The mappings point to leaf blocks, remote value blocks, or dabtree blocks.
Block 0 in the attribute fork is always the top of the structure, but otherwise
each of the three types of blocks can be found at any offset in the attr fork.
Leaf blocks contain attribute key records that point to the name and the value.
Names are always stored elsewhere in the same leaf block.
Values that are less than 3/4 the size of a filesystem block are also stored
elsewhere in the same leaf block.
Remote value blocks contain values that are too large to fit inside a leaf.
If the leaf information exceeds a single filesystem block, a dabtree (also
rooted at block 0) is created to map hashes of the attribute names to leaf
blocks in the attr fork.”…””}”(hjœ5  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h M:hj}5  hžhubhæ)”}”(hŒÙChecking an extended attribute structure is not so straightforward due to the
lack of separation between attr blocks and index blocks.
Scrub must read each block mapped by the attr fork and ignore the non-leaf
blocks:”h]”hŒÙChecking an extended attribute structure is not so straightforward due to the
lack of separation between attr blocks and index blocks.
Scrub must read each block mapped by the attr fork and ignore the non-leaf
blocks:”…””}”(hjª5  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h MJhj}5  hžhubji  )”}”(hhh]”(hû)”}”(hŒ£Walk the dabtree in the attr fork (if present) to ensure that there are no
irregularities in the blocks or dabtree mappings that do not point to
attr leaf blocks.
”h]”hæ)”}”(hŒ¢Walk the dabtree in the attr fork (if present) to ensure that there are no
irregularities in the blocks or dabtree mappings that do not point to
attr leaf blocks.”h]”hŒ¢Walk the dabtree in the attr fork (if present) to ensure that there are no
irregularities in the blocks or dabtree mappings that do not point to
attr leaf blocks.”…””}”(hj¿5  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h MOhj»5  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhj¸5  hžhhŸh³h Nubhû)”}”(hX|  Walk the blocks of the attr fork looking for leaf blocks.
For each entry inside a leaf:

a. Validate that the name does not contain invalid characters.

b. Read the attr value.
   This performs a named lookup of the attr name to ensure the correctness
   of the dabtree.
   If the value is stored in a remote block, this also validates the
   integrity of the remote value block.
”h]”(hæ)”}”(hŒWWalk the blocks of the attr fork looking for leaf blocks.
For each entry inside a leaf:”h]”hŒWWalk the blocks of the attr fork looking for leaf blocks.
For each entry inside a leaf:”…””}”(hj×5  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h MShjÓ5  ubji  )”}”(hhh]”(hû)”}”(hŒ<Validate that the name does not contain invalid characters.
”h]”hæ)”}”(hŒ;Validate that the name does not contain invalid characters.”h]”hŒ;Validate that the name does not contain invalid characters.”…””}”(hjì5  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h MVhjè5  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhjå5  ubhû)”}”(hŒÔRead the attr value.
This performs a named lookup of the attr name to ensure the correctness
of the dabtree.
If the value is stored in a remote block, this also validates the
integrity of the remote value block.
”h]”hæ)”}”(hŒÓRead the attr value.
This performs a named lookup of the attr name to ensure the correctness
of the dabtree.
If the value is stored in a remote block, this also validates the
integrity of the remote value block.”h]”hŒÓRead the attr value.
This performs a named lookup of the attr name to ensure the correctness
of the dabtree.
If the value is stored in a remote block, this also validates the
integrity of the remote value block.”…””}”(hj6  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h MXhj 6  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhjå5  ubeh}”(h]”h ]”h"]”h$]”h&]”jg  Œ
loweralpha”ji  hjj  jk  uh1jh  hjÓ5  ubeh}”(h]”h ]”h"]”h$]”h&]”uh1húhj¸5  hžhhŸNh Nubeh}”(h]”h ]”h"]”h$]”h&]”jg  jh  ji  hjj  jk  uh1jh  hj}5  hžhhŸh³h MOubeh}”(h]”jÇ  ah ]”h"]”Œchecking extended attributes”ah$]”h&]”uh1hÐhj ,  hžhhŸh³h M1ubhÑ)”}”(hhh]”(hÖ)”}”(hŒ*Checking and Cross-Referencing Directories”h]”hŒ*Checking and Cross-Referencing Directories”…””}”(hj56  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”j  jã  uh1hÕhj26  hžhhŸh³h M_ubhæ)”}”(hX)  The filesystem directory tree is a directed acylic graph structure, with files
constituting the nodes, and directory entries (dirents) constituting the edges.
Directories are a special type of file containing a set of mappings from a
255-byte sequence (name) to an inumber.
These are called directory entries, or dirents for short.
Each directory file must have exactly one directory pointing to the file.
A root directory points to itself.
Directory entries point to files of any type.
Each non-directory file may have multiple directories point to it.”h]”hX)  The filesystem directory tree is a directed acylic graph structure, with files
constituting the nodes, and directory entries (dirents) constituting the edges.
Directories are a special type of file containing a set of mappings from a
255-byte sequence (name) to an inumber.
These are called directory entries, or dirents for short.
Each directory file must have exactly one directory pointing to the file.
A root directory points to itself.
Directory entries point to files of any type.
Each non-directory file may have multiple directories point to it.”…””}”(hjC6  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h Mahj26  hžhubhæ)”}”(hXˆ  In XFS, directories are implemented as a file containing up to three 32GB
partitions.
The first partition contains directory entry data blocks.
Each data block contains variable-sized records associating a user-provided
name with an inumber and, optionally, a file type.
If the directory entry data grows beyond one block, the second partition (which
exists as post-EOF extents) is populated with a block containing free space
information and an index that maps hashes of the dirent names to directory data
blocks in the first partition.
This makes directory name lookups very fast.
If this second partition grows beyond one block, the third partition is
populated with a linear array of free space information for faster
expansions.
If the free space has been separated and the second partition grows again
beyond one block, then a dabtree is used to map hashes of dirent names to
directory data blocks.”h]”hXˆ  In XFS, directories are implemented as a file containing up to three 32GB
partitions.
The first partition contains directory entry data blocks.
Each data block contains variable-sized records associating a user-provided
name with an inumber and, optionally, a file type.
If the directory entry data grows beyond one block, the second partition (which
exists as post-EOF extents) is populated with a block containing free space
information and an index that maps hashes of the dirent names to directory data
blocks in the first partition.
This makes directory name lookups very fast.
If this second partition grows beyond one block, the third partition is
populated with a linear array of free space information for faster
expansions.
If the free space has been separated and the second partition grows again
beyond one block, then a dabtree is used to map hashes of dirent names to
directory data blocks.”…””}”(hjQ6  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h Mkhj26  hžhubhæ)”}”(hŒ/Checking a directory is pretty straightforward:”h]”hŒ/Checking a directory is pretty straightforward:”…””}”(hj_6  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h M|hj26  hžhubji  )”}”(hhh]”(hû)”}”(hŒ§Walk the dabtree in the second partition (if present) to ensure that there
are no irregularities in the blocks or dabtree mappings that do not point to
dirent blocks.
”h]”hæ)”}”(hŒ¦Walk the dabtree in the second partition (if present) to ensure that there
are no irregularities in the blocks or dabtree mappings that do not point to
dirent blocks.”h]”hŒ¦Walk the dabtree in the second partition (if present) to ensure that there
are no irregularities in the blocks or dabtree mappings that do not point to
dirent blocks.”…””}”(hjt6  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h M~hjp6  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhjm6  hžhhŸh³h Nubhû)”}”(hXH  Walk the blocks of the first partition looking for directory entries.
Each dirent is checked as follows:

a. Does the name contain no invalid characters?

b. Does the inumber correspond to an actual, allocated inode?

c. Does the child inode have a nonzero link count?

d. If a file type is included in the dirent, does it match the type of the
   inode?

e. If the child is a subdirectory, does the child's dotdot pointer point
   back to the parent?

f. If the directory has a second partition, perform a named lookup of the
   dirent name to ensure the correctness of the dabtree.
”h]”(hæ)”}”(hŒhWalk the blocks of the first partition looking for directory entries.
Each dirent is checked as follows:”h]”hŒhWalk the blocks of the first partition looking for directory entries.
Each dirent is checked as follows:”…””}”(hjŒ6  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h M‚hjˆ6  ubji  )”}”(hhh]”(hû)”}”(hŒ-Does the name contain no invalid characters?
”h]”hæ)”}”(hŒ,Does the name contain no invalid characters?”h]”hŒ,Does the name contain no invalid characters?”…””}”(hj¡6  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h M…hj6  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhjš6  ubhû)”}”(hŒ;Does the inumber correspond to an actual, allocated inode?
”h]”hæ)”}”(hŒ:Does the inumber correspond to an actual, allocated inode?”h]”hŒ:Does the inumber correspond to an actual, allocated inode?”…””}”(hj¹6  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h M‡hjµ6  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhjš6  ubhû)”}”(hŒ0Does the child inode have a nonzero link count?
”h]”hæ)”}”(hŒ/Does the child inode have a nonzero link count?”h]”hŒ/Does the child inode have a nonzero link count?”…””}”(hjÑ6  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h M‰hjÍ6  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhjš6  ubhû)”}”(hŒOIf a file type is included in the dirent, does it match the type of the
inode?
”h]”hæ)”}”(hŒNIf a file type is included in the dirent, does it match the type of the
inode?”h]”hŒNIf a file type is included in the dirent, does it match the type of the
inode?”…””}”(hjé6  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h M‹hjå6  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhjš6  ubhû)”}”(hŒZIf the child is a subdirectory, does the child's dotdot pointer point
back to the parent?
”h]”hæ)”}”(hŒYIf the child is a subdirectory, does the child's dotdot pointer point
back to the parent?”h]”hŒ[If the child is a subdirectory, does the childâ€™s dotdot pointer point
back to the parent?”…””}”(hj7  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h MŽhjý6  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhjš6  ubhû)”}”(hŒ}If the directory has a second partition, perform a named lookup of the
dirent name to ensure the correctness of the dabtree.
”h]”hæ)”}”(hŒ|If the directory has a second partition, perform a named lookup of the
dirent name to ensure the correctness of the dabtree.”h]”hŒ|If the directory has a second partition, perform a named lookup of the
dirent name to ensure the correctness of the dabtree.”…””}”(hj7  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h M‘hj7  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhjš6  ubeh}”(h]”h ]”h"]”h$]”h&]”jg  j6  ji  hjj  jk  uh1jh  hjˆ6  ubeh}”(h]”h ]”h"]”h$]”h&]”uh1húhjm6  hžhhŸNh Nubhû)”}”(hŒ|Walk the free space list in the third partition (if present) to ensure that
the free spaces it describes are really unused.
”h]”hæ)”}”(hŒ{Walk the free space list in the third partition (if present) to ensure that
the free spaces it describes are really unused.”h]”hŒ{Walk the free space list in the third partition (if present) to ensure that
the free spaces it describes are really unused.”…””}”(hj=7  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h M”hj97  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhjm6  hžhhŸh³h Nubeh}”(h]”h ]”h"]”h$]”h&]”jg  jh  ji  hjj  jk  uh1jh  hj26  hžhhŸh³h M~ubhæ)”}”(hŒChecking operations involving :ref:`parents <dirparent>` and
:ref:`file link counts <nlinks>` are discussed in more detail in later
sections.”h]”(hŒChecking operations involving ”…””}”(hjW7  hžhhŸNh Nubh)”}”(hŒ:ref:`parents <dirparent>`”h]”j™  )”}”(hja7  h]”hŒparents”…””}”(hjc7  hžhhŸNh Nubah}”(h]”h ]”(j¤  Œstd”Œstd-ref”eh"]”h$]”h&]”uh1j˜  hj_7  ubah}”(h]”h ]”h"]”h$]”h&]”Œrefdoc”j±  Œ	refdomain”jm7  Œreftype”Œref”Œrefexplicit”ˆŒrefwarn”ˆj·  Œ	dirparent”uh1hhŸh³h M—hjW7  ubhŒ and
”…””}”(hjW7  hžhhŸNh Nubh)”}”(hŒ :ref:`file link counts <nlinks>`”h]”j™  )”}”(hj…7  h]”hŒfile link counts”…””}”(hj‡7  hžhhŸNh Nubah}”(h]”h ]”(j¤  Œstd”Œstd-ref”eh"]”h$]”h&]”uh1j˜  hjƒ7  ubah}”(h]”h ]”h"]”h$]”h&]”Œrefdoc”j±  Œ	refdomain”j‘7  Œreftype”Œref”Œrefexplicit”ˆŒrefwarn”ˆj·  Œnlinks”uh1hhŸh³h M—hjW7  ubhŒ0 are discussed in more detail in later
sections.”…””}”(hjW7  hžhhŸNh Nubeh}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h M—hj26  hžhubhÑ)”}”(hhh]”(hÖ)”}”(hŒ#Checking Directory/Attribute Btrees”h]”hŒ#Checking Directory/Attribute Btrees”…””}”(hj°7  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”j  j  uh1hÕhj­7  hžhhŸh³h Mœubhæ)”}”(hŒúAs stated in previous sections, the directory/attribute btree (dabtree) index
maps user-provided names to improve lookup times by avoiding linear scans.
Internally, it maps a 32-bit hash of the name to a block offset within the
appropriate file fork.”h]”hŒúAs stated in previous sections, the directory/attribute btree (dabtree) index
maps user-provided names to improve lookup times by avoiding linear scans.
Internally, it maps a 32-bit hash of the name to a block offset within the
appropriate file fork.”…””}”(hj¾7  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h Mžhj­7  hžhubhæ)”}”(hXß  The internal structure of a dabtree closely resembles the btrees that record
fixed-size metadata records -- each dabtree block contains a magic number, a
checksum, sibling pointers, a UUID, a tree level, and a log sequence number.
The format of leaf and node records are the same -- each entry points to the
next level down in the hierarchy, with dabtree node records pointing to dabtree
leaf blocks, and dabtree leaf records pointing to non-dabtree blocks elsewhere
in the fork.”h]”hXß  The internal structure of a dabtree closely resembles the btrees that record
fixed-size metadata records -- each dabtree block contains a magic number, a
checksum, sibling pointers, a UUID, a tree level, and a log sequence number.
The format of leaf and node records are the same -- each entry points to the
next level down in the hierarchy, with dabtree node records pointing to dabtree
leaf blocks, and dabtree leaf records pointing to non-dabtree blocks elsewhere
in the fork.”…””•
      }”(hjÌ7  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h M£hj­7  hžhubhæ)”}”(hŒ\Checking and cross-referencing the dabtree is very similar to what is done for
space btrees:”h]”hŒ\Checking and cross-referencing the dabtree is very similar to what is done for
space btrees:”…””}”(hjÚ7  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h M«hj­7  hžhubhö)”}”(hhh]”(hû)”}”(hŒIDoes the type of data stored in the block match what scrub is expecting?
”h]”hæ)”}”(hŒHDoes the type of data stored in the block match what scrub is expecting?”h]”hŒHDoes the type of data stored in the block match what scrub is expecting?”…””}”(hjï7  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h M®hjë7  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhjè7  hžhhŸh³h Nubhû)”}”(hŒGDoes the block belong to the owning structure that asked for the read?
”h]”hæ)”}”(hŒFDoes the block belong to the owning structure that asked for the read?”h]”hŒFDoes the block belong to the owning structure that asked for the read?”…””}”(hj8  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h M°hj8  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhjè7  hžhhŸh³h Nubhû)”}”(hŒ%Do the records fit within the block?
”h]”hæ)”}”(hŒ$Do the records fit within the block?”h]”hŒ$Do the records fit within the block?”…””}”(hj8  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h M²hj8  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhjè7  hžhhŸh³h Nubhû)”}”(hŒHAre the records contained inside the block free of obvious corruptions?
”h]”hæ)”}”(hŒGAre the records contained inside the block free of obvious corruptions?”h]”hŒGAre the records contained inside the block free of obvious corruptions?”…””}”(hj78  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h M´hj38  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhjè7  hžhhŸh³h Nubhû)”}”(hŒ*Are the name hashes in the correct order?
”h]”hæ)”}”(hŒ)Are the name hashes in the correct order?”h]”hŒ)Are the name hashes in the correct order?”…””}”(hjO8  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h M¶hjK8  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhjè7  hžhhŸh³h Nubhû)”}”(hŒTDo node pointers within the dabtree point to valid fork offsets for dabtree
blocks?
”h]”hæ)”}”(hŒSDo node pointers within the dabtree point to valid fork offsets for dabtree
blocks?”h]”hŒSDo node pointers within the dabtree point to valid fork offsets for dabtree
blocks?”…””}”(hjg8  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h M¸hjc8  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhjè7  hžhhŸh³h Nubhû)”}”(hŒcDo leaf pointers within the dabtree point to valid fork offsets for directory
or attr leaf blocks?
”h]”hæ)”}”(hŒbDo leaf pointers within the dabtree point to valid fork offsets for directory
or attr leaf blocks?”h]”hŒbDo leaf pointers within the dabtree point to valid fork offsets for directory
or attr leaf blocks?”…””}”(hj8  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h M»hj{8  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhjè7  hžhhŸh³h Nubhû)”}”(hŒ,Do child pointers point towards the leaves?
”h]”hæ)”}”(hŒ+Do child pointers point towards the leaves?”h]”hŒ+Do child pointers point towards the leaves?”…””}”(hj—8  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h M¾hj“8  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhjè7  hžhhŸh³h Nubhû)”}”(hŒ1Do sibling pointers point across the same level?
”h]”hæ)”}”(hŒ0Do sibling pointers point across the same level?”h]”hŒ0Do sibling pointers point across the same level?”…””}”(hj¯8  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h MÀhj«8  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhjè7  hžhhŸh³h Nubhû)”}”(hŒlFor each dabtree node record, does the record key accurate reflect the
contents of the child dabtree block?
”h]”hæ)”}”(hŒkFor each dabtree node record, does the record key accurate reflect the
contents of the child dabtree block?”h]”hŒkFor each dabtree node record, does the record key accurate reflect the
contents of the child dabtree block?”…””}”(hjÇ8  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h MÂhjÃ8  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhjè7  hžhhŸh³h Nubhû)”}”(hŒpFor each dabtree leaf record, does the record key accurate reflect the
contents of the directory or attr block?
”h]”hæ)”}”(hŒoFor each dabtree leaf record, does the record key accurate reflect the
contents of the directory or attr block?”h]”hŒoFor each dabtree leaf record, does the record key accurate reflect the
contents of the directory or attr block?”…””}”(hjß8  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h MÅhjÛ8  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhjè7  hžhhŸh³h Nubeh}”(h]”h ]”h"]”h$]”h&]”jJ  jK  uh1hõhŸh³h M®hj­7  hžhubeh}”(h]”j  ah ]”h"]”Œ#checking directory/attribute btrees”ah$]”h&]”uh1hÐhj26  hžhhŸh³h Mœubeh}”(h]”jé  ah ]”h"]”Œ*checking and cross-referencing directories”ah$]”h&]”uh1hÐhj ,  hžhhŸh³h M_ubhÑ)”}”(hhh]”(hÖ)”}”(hŒ"Cross-Referencing Summary Counters”h]”hŒ"Cross-Referencing Summary Counters”…””}”(hj
9  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”j  j0  uh1hÕhj9  hžhhŸh³h MÉubhæ)”}”(hŒqXFS maintains three classes of summary counters: available resources, quota
resource usage, and file link counts.”h]”hŒqXFS maintains three classes of summary counters: available resources, quota
resource usage, and file link counts.”…””}”(hj9  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h MËhj9  hžhubhæ)”}”(hX  In theory, the amount of available resources (data blocks, inodes, realtime
extents) can be found by walking the entire filesystem.
This would make for very slow reporting, so a transactional filesystem can
maintain summaries of this information in the superblock.
Cross-referencing these values against the filesystem metadata should be a
simple matter of walking the free space and inode metadata in each AG and the
realtime bitmap, but there are complications that will be discussed in
:ref:`more detail <fscounters>` later.”h]”(hXé  In theory, the amount of available resources (data blocks, inodes, realtime
extents) can be found by walking the entire filesystem.
This would make for very slow reporting, so a transactional filesystem can
maintain summaries of this information in the superblock.
Cross-referencing these values against the filesystem metadata should be a
simple matter of walking the free space and inode metadata in each AG and the
realtime bitmap, but there are complications that will be discussed in
”…””}”(hj&9  hžhhŸNh Nubh)”}”(hŒ:ref:`more detail <fscounters>`”h]”j™  )”}”(hj09  h]”hŒmore detail”…””}”(hj29  hžhhŸNh Nubah}”(h]”h ]”(j¤  Œstd”Œstd-ref”eh"]”h$]”h&]”uh1j˜  hj.9  ubah}”(h]”h ]”h"]”h$]”h&]”Œrefdoc”j±  Œ	refdomain”j<9  Œreftype”Œref”Œrefexplicit”ˆŒrefwarn”ˆj·  Œ
fscounters”uh1hhŸh³h MÎhj&9  ubhŒ later.”…””}”(hj&9  hžhhŸNh Nubeh}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h MÎhj9  hžhubhæ)”}”(hŒ‡:ref:`Quota usage <quotacheck>` and :ref:`file link count <nlinks>`
checking are sufficiently complicated to warrant separate sections.”h]”(h)”}”(hŒ:ref:`Quota usage <quotacheck>`”h]”j™  )”}”(hj^9  h]”hŒQuota usage”…””}”(hj`9  hžhhŸNh Nubah}”(h]”h ]”(j¤  Œstd”Œstd-ref”eh"]”h$]”h&]”uh1j˜  hj\9  ubah}”(h]”h ]”h"]”h$]”h&]”Œrefdoc”j±  Œ	refdomain”jj9  Œreftype”Œref”Œrefexplicit”ˆŒrefwarn”ˆj·  Œ
quotacheck”uh1hhŸh³h M×hjX9  ubhŒ and ”…””}”(hjX9  hžhhŸNh Nubh)”}”(hŒ:ref:`file link count <nlinks>`”h]”j™  )”}”(hj‚9  h]”hŒfile link count”…””}”(hj„9  hžhhŸNh Nubah}”(h]”h ]”(j¤  Œstd”Œstd-ref”eh"]”h$]”h&]”uh1j˜  hj€9  ubah}”(h]”h ]”h"]”h$]”h&]”Œrefdoc”j±  Œ	refdomain”jŽ9  Œreftype”Œref”Œrefexplicit”ˆŒrefwarn”ˆj·  Œnlinks”uh1hhŸh³h M×hjX9  ubhŒD
checking are sufficiently complicated to warrant separate sections.”…””}”(hjX9  hžhhŸNh Nubeh}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h M×hj9  hžhubeh}”(h]”j6  ah ]”h"]”Œ"cross-referencing summary counters”ah$]”h&]”uh1hÐhj ,  hžhhŸh³h MÉubhÑ)”}”(hhh]”(hÖ)”}”(hŒPost-Repair Reverification”h]”hŒPost-Repair Reverification”…””}”(hj´9  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”j  jR  uh1hÕhj±9  hžhhŸh³h MÛubhæ)”}”(hXÎ  After performing a repair, the checking code is run a second time to validate
the new structure, and the results of the health assessment are recorded
internally and returned to the calling process.
This step is critical for enabling system administrator to monitor the status
of the filesystem and the progress of any repairs.
For developers, it is a useful means to judge the efficacy of error detection
and correction in the online and offline checking tools.”h]”hXÎ  After performing a repair, the checking code is run a second time to validate
the new structure, and the results of the health assessment are recorded
internally and returned to the calling process.
This step is critical for enabling system administrator to monitor the status
of the filesystem and the progress of any repairs.
For developers, it is a useful means to judge the efficacy of error detection
and correction in the online and offline checking tools.”…””}”(hjÂ9  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h MÝhj±9  hžhubeh}”(h]”jX  ah ]”h"]”Œpost-repair reverification”ah$]”h&]”uh1hÐhj ,  hžhhŸh³h MÛubeh}”(h]”j   ah ]”h"]”Œchecking and cross-referencing”ah$]”h&]”uh1hÐhjv*  hžhhŸh³h M9ubhÑ)”}”(hhh]”(hÖ)”}”(hŒ$Eventual Consistency vs. Online Fsck”h]”hŒ$Eventual Consistency vs. Online Fsck”…””}”(hjá9  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”j  j€  uh1hÕhjÞ9  hžhhŸh³h Mæubhæ)”}”(hXq  Complex operations can make modifications to multiple per-AG data structures
with a chain of transactions.
These chains, once committed to the log, are restarted during log recovery if
the system crashes while processing the chain.
Because the AG header buffers are unlocked between transactions within a chain,
online checking must coordinate with chained operations that are in progress to
avoid incorrectly detecting inconsistencies due to pending chains.
Furthermore, online repair must not run when operations are pending because
the metadata are temporarily inconsistent with each other, and rebuilding is
not possible.”h]”hXq  Complex operations can make modifications to multiple per-AG data structures
with a chain of transactions.
These chains, once committed to the log, are restarted during log recovery if
the system crashes while processing the chain.
Because the AG header buffers are unlocked between transactions within a chain,
online checking must coordinate with chained operations that are in progress to
avoid incorrectly detecting inconsistencies due to pending chains.
Furthermore, online repair must not run when operations are pending because
the metadata are temporarily inconsistent with each other, and rebuilding is
not possible.”…””}”(hjï9  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h MèhjÞ9  hžhubhæ)”}”(hŒÑOnly online fsck has this requirement of total consistency of AG metadata, and
should be relatively rare as compared to filesystem change operations.
Online fsck coordinates with transaction chains as follows:”h]”hŒÑOnly online fsck has this requirement of total consistency of AG metadata, and
should be relatively rare as compared to filesystem change operations.
Online fsck coordinates with transaction chains as follows:”…””}”(hjý9  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h MóhjÞ9  hžhubhö)”}”(hhh]”(hû)”}”(hŒïFor each AG, maintain a count of intent items targeting that AG.
The count should be bumped whenever a new item is added to the chain.
The count should be dropped when the filesystem has locked the AG header
buffers and finished the work.
”h]”hæ)”}”(hŒîFor each AG, maintain a count of intent items targeting that AG.
The count should be bumped whenever a new item is added to the chain.
The count should be dropped when the filesystem has locked the AG header
buffers and finished the work.”h]”hŒîFor each AG, maintain a count of intent items targeting that AG.
The count should be bumped whenever a new item is added to the chain.
The count should be dropped when the filesystem has locked the AG header
buffers and finished the work.”…””}”(hj:  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h M÷hj:  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhj:  hžhhŸh³h Nubhû)”}”(hX  When online fsck wants to examine an AG, it should lock the AG header
buffers to quiesce all transaction chains that want to modify that AG.
If the count is zero, proceed with the checking operation.
If it is nonzero, cycle the buffer locks to allow the chain to make forward
progress.
”h]”hæ)”}”(hX  When online fsck wants to examine an AG, it should lock the AG header
buffers to quiesce all transaction chains that want to modify that AG.
If the count is zero, proceed with the checking operation.
If it is nonzero, cycle the buffer locks to allow the chain to make forward
progress.”h]”hX  When online fsck wants to examine an AG, it should lock the AG header
buffers to quiesce all transaction chains that want to modify that AG.
If the count is zero, proceed with the checking operation.
If it is nonzero, cycle the buffer locks to allow the chain to make forward
progress.”…””}”(hj*:  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h Mühj&:  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhj:  hžhhŸh³h Nubeh}”(h]”h ]”h"]”h$]”h&]”jJ  jó   uh1hõhŸh³h M÷hjÞ9  hžhubhæ)”}”(hXJ  This may lead to online fsck taking a long time to complete, but regular
filesystem updates take precedence over background checking activity.
Details about the discovery of this situation are presented in the
:ref:`next section <chain_coordination>`, and details about the solution
are presented :ref:`after that<intent_drains>`.”h]”(hŒÒThis may lead to online fsck taking a long time to complete, but regular
filesystem updates take precedence over background checking activity.
Details about the discovery of this situation are presented in the
”…””}”(hjD:  hžhhŸNh Nubh)”}”(hŒ(:ref:`next section <chain_coordination>`”h]”j™  )”}”(hjN:  h]”hŒnext section”…””}”(hjP:  hžhhŸNh Nubah}”(h]”h ]”(j¤  Œstd”Œstd-ref”eh"]”h$]”h&]”uh1j˜  hjL:  ubah}”(h]”h ]”h"]”h$]”h&]”Œrefdoc”j±  Œ	refdomain”jZ:  Œreftype”Œref”Œrefexplicit”ˆŒrefwarn”ˆj·  Œchain_coordination”uh1hhŸh³h MhjD:  ubhŒ/, and details about the solution
are presented ”…””}”(hjD:  hžhhŸNh Nubh)”}”(hŒ :ref:`after that<intent_drains>`”h]”j™  )”}”(hjr:  h]”hŒ
after that”…””}”(hjt:  hžhhŸNh Nubah}”(h]”h ]”(j¤  Œstd”Œstd-ref”eh"]”h$]”h&]”uh1j˜  hjp:  ubah}”(h]”h ]”h"]”h$]”h&]”Œrefdoc”j±  Œ	refdomain”j~:  Œreftype”Œref”Œrefexplicit”ˆŒrefwarn”ˆj·  Œintent_drains”uh1hhŸh³h MhjD:  ubhŒ.”…””}”(hjD:  hžhhŸNh Nubeh}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h MhjÞ9  hžhubhµ)”}”(hŒ.. _chain_coordination:”h]”h}”(h]”h ]”h"]”h$]”h&]”j  Œchain-coordination”uh1h´h MhjÞ9  hžhhŸh³ubhÑ)”}”(hhh]”(hÖ)”}”(hŒDiscovery of the Problem”h]”hŒDiscovery of the Problem”…””}”(hj¨:  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”j  jŸ  uh1hÕhj¥:  hžhhŸh³h Mubhæ)”}”(hX¶  Midway through the development of online scrubbing, the fsstress tests
uncovered a misinteraction between online fsck and compound transaction chains
created by other writer threads that resulted in false reports of metadata
inconsistency.
The root cause of these reports is the eventual consistency model introduced by
the expansion of deferred work items and compound transaction chains when
reverse mapping and reflink were introduced.”h]”hX¶  Midway through the development of online scrubbing, the fsstress tests
uncovered a misinteraction between online fsck and compound transaction chains
created by other writer threads that resulted in false reports of metadata
inconsistency.
The root cause of these reports is the eventual consistency model introduced by
the expansion of deferred work items and compound transaction chains when
reverse mapping and reflink were introduced.”…””}”(hj¶:  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h Mhj¥:  hžhubhæ)”}”(hXM  Originally, transaction chains were added to XFS to avoid deadlocks when
unmapping space from files.
Deadlock avoidance rules require that AGs only be locked in increasing order,
which makes it impossible (say) to use a single transaction to free a space
extent in AG 7 and then try to free a now superfluous block mapping btree block
in AG 3.
To avoid these kinds of deadlocks, XFS creates Extent Freeing Intent (EFI) log
items to commit to freeing some space in one transaction while deferring the
actual metadata updates to a fresh transaction.
The transaction sequence looks like this:”h]”hXM  Originally, transaction chains were added to XFS to avoid deadlocks when
unmapping space from files.
Deadlock avoidance rules require that AGs only be locked in increasing order,
which makes it impossible (say) to use a single transaction to free a space
extent in AG 7 and then try to free a now superfluous block mapping btree block
in AG 3.
To avoid these kinds of deadlocks, XFS creates Extent Freeing Intent (EFI) log
items to commit to freeing some space in one transaction while deferring the
actual metadata updates to a fresh transaction.
The transaction sequence looks like this:”…””}”(hjÄ:  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h Mhj¥:  hžhubji  )”}”(hhh]”(hû)”}”(hX  The first transaction contains a physical update to the file's block mapping
structures to remove the mapping from the btree blocks.
It then attaches to the in-memory transaction an action item to schedule
deferred freeing of space.
Concretely, each transaction maintains a list of ``struct
xfs_defer_pending`` objects, each of which maintains a list of ``struct
xfs_extent_free_item`` objects.
Returning to the example above, the action item tracks the freeing of both
the unmapped space from AG 7 and the block mapping btree (BMBT) block from
AG 3.
Deferred frees recorded in this manner are committed in the log by creating
an EFI log item from the ``struct xfs_extent_free_item`` object and
attaching the log item to the transaction.
When the log is persisted to disk, the EFI item is written into the ondisk
transaction record.
EFIs can list up to 16 extents to free, all sorted in AG order.
”h]”hæ)”}”(hX€  The first transaction contains a physical update to the file's block mapping
structures to remove the mapping from the btree blocks.
It then attaches to the in-memory transaction an action item to schedule
deferred freeing of space.
Concretely, each transaction maintains a list of ``struct
xfs_defer_pending`` objects, each of which maintains a list of ``struct
xfs_extent_free_item`` objects.
Returning to the example above, the action item tracks the freeing of both
the unmapped space from AG 7 and the block mapping btree (BMBT) block from
AG 3.
Deferred frees recorded in this manner are committed in the log by creating
an EFI log item from the ``struct xfs_extent_free_item`` object and
attaching the log item to the transaction.
When the log is persisted to disk, the EFI item is written into the ondisk
transaction record.
EFIs can list up to 16 extents to free, all sorted in AG order.”h]”(hX  The first transaction contains a physical update to the fileâ€™s block mapping
structures to remove the mapping from the btree blocks.
It then attaches to the in-memory transaction an action item to schedule
deferred freeing of space.
Concretely, each transaction maintains a list of ”…””}”(hjÙ:  hžhhŸNh Nubj÷  )”}”(hŒ``struct
xfs_defer_pending``”h]”hŒstruct
xfs_defer_pending”…””}”(hjá:  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1jö  hjÙ:  ubhŒ, objects, each of which maintains a list of ”…””}”(hjÙ:  hžhhŸNh Nubj÷  )”}”(hŒ``struct
xfs_extent_free_item``”h]”hŒstruct
xfs_extent_free_item”…””}”(hjó:  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1jö  hjÙ:  ubhX   objects.
Returning to the example above, the action item tracks the freeing of both
the unmapped space from AG 7 and the block mapping btree (BMBT) block from
AG 3.
Deferred frees recorded in this manner are committed in the log by creating
an EFI log item from the ”…””}”(hjÙ:  hžhhŸNh Nubj÷  )”}”(hŒ``struct xfs_extent_free_item``”h]”hŒstruct xfs_extent_free_item”…””}”(hj;  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1jö  hjÙ:  ubhŒÕ object and
attaching the log item to the transaction.
When the log is persisted to disk, the EFI item is written into the ondisk
transaction record.
EFIs can list up to 16 extents to free, all sorted in AG order.”…””}”(hjÙ:  hžhhŸNh Nubeh}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h M hjÕ:  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhjÒ:  hžhhŸh³h Nubhû)”}”(hXï  The second transaction contains a physical update to the free space btrees
of AG 3 to release the former BMBT block and a second physical update to the
free space btrees of AG 7 to release the unmapped file space.
Observe that the physical updates are resequenced in the correct order
when possible.
Attached to the transaction is a an extent free done (EFD) log item.
The EFD contains a pointer to the EFI logged in transaction #1 so that log
recovery can tell if the EFI needs to be replayed.
”h]”hæ)”}”(hXî  The second transaction contains a physical update to the free space btrees
of AG 3 to release the former BMBT block and a second physical update to the
free space btrees of AG 7 to release the unmapped file space.
Observe that the physical updates are resequenced in the correct order
when possible.
Attached to the transaction is a an extent free done (EFD) log item.
The EFD contains a pointer to the EFI logged in transaction #1 so that log
recovery can tell if the EFI needs to be replayed.”h]”hXî  The second transaction contains a physical update to the free space btrees
of AG 3 to release the former BMBT block and a second physical update to the
free space btrees of AG 7 to release the unmapped file space.
Observe that the physical updates are resequenced in the correct order
when possible.
Attached to the transaction is a an extent free done (EFD) log item.
The EFD contains a pointer to the EFI logged in transaction #1 so that log
recovery can tell if the EFI needs to be replayed.”…””}”(hj';  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h M1hj#;  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhjÒ:  hžhhŸh³h Nubeh}”(h]”h ]”h"]”h$]”h&]”jg  jh  ji  hjj  jk  uh1jh  hj¥:  hžhhŸh³h M ubhæ)”}”(hXP  If the system goes down after transaction #1 is written back to the filesystem
but before #2 is committed, a scan of the filesystem metadata would show
inconsistent filesystem metadata because there would not appear to be any owner
of the unmapped space.
Happily, log recovery corrects this inconsistency for us -- when recovery finds
an intent log item but does not find a corresponding intent done item, it will
reconstruct the incore state of the intent item and finish it.
In the example above, the log must replay both frees described in the recovered
EFI to complete the recovery phase.”h]”hXP  If the system goes down after transaction #1 is written back to the filesystem
but before #2 is committed, a scan of the filesystem metadata would show
inconsistent filesystem metadata because there would not appear to be any owner
of the unmapped space.
Happily, log recovery corrects this inconsistency for us -- when recovery finds
an intent log item but does not find a corresponding intent done item, it will
reconstruct the incore state of the intent item and finish it.
In the example above, the log must replay both frees described in the recovered
EFI to complete the recovery phase.”…””}”(hjA;  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h M:hj¥:  hžhubhæ)”}”(hŒGThere are subtleties to XFS' transaction chaining strategy to consider:”h]”hŒIThere are subtleties to XFSâ€™ transaction chaining strategy to consider:”…””}”(hjO;  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h MDhj¥:  hžhubhö)”}”(hhh]”(hû)”}”(hX`  Log items must be added to a transaction in the correct order to prevent
conflicts with principal objects that are not held by the transaction.
In other words, all per-AG metadata updates for an unmapped block must be
completed before the last update to free the extent, and extents should not
be reallocated until that last update commits to the log.
”h]”hæ)”}”(hX_  Log items must be added to a transaction in the correct order to prevent
conflicts with principal objects that are not held by the transaction.
In other words, all per-AG metadata updates for an unmapped block must be
completed before the last update to free the extent, and extents should not
be reallocated until that last update commits to the log.”h]”hX_  Log items must be added to a transaction in the correct order to prevent
conflicts with principal objects that are not held by the transaction.
In other words, all per-AG metadata updates for an unmapped block must be
completed before the last update to free the extent, and extents should not
be reallocated until that last update commits to the log.”…””}”(hjd;  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h MFhj`;  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhj];  hžhhŸh³h Nubhû)”}”(hŒýAG header buffers are released between each transaction in a chain.
This means that other threads can observe an AG in an intermediate state,
but as long as the first subtlety is handled, this should not affect the
correctness of filesystem operations.
”h]”hæ)”}”(hŒüAG header buffers are released between each transaction in a chain.
This means that other threads can observe an AG in an intermediate state,
but as long as the first subtlety is handled, this should not affect the
correctness of filesystem operations.”h]”hŒüAG header buffers are released between each transaction in a chain.
This means that other threads can observe an AG in an intermediate state,
but as long as the first subtlety is handled, this should not affect the
correctness of filesystem operations.”…””}”(hj|;  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h MLhjx;  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhj];  hžhhŸh³h Nubhû)”}”(hŒ¬Unmounting the filesystem flushes all pending work to disk, which means that
offline fsck never sees the temporary inconsistencies caused by deferred
work item processing.
”h]”hæ)”}”(hŒ«Unmounting the filesystem flushes all pending work to disk, which means that
offline fsck never sees the temporary inconsistencies caused by deferred
work item processing.”h]”hŒ«Unmounting the filesystem flushes all pending work to disk, which means that
offline fsck never sees the temporary inconsistencies caused by deferred
work item processing.”…””}”(hj”;  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h MQhj;  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhj];  hžhhŸh³h Nubeh}”(h]”h ]”h"]”h$]”h&]”jJ  jó   uh1hõhŸh³h MFhj¥:  hžhubhæ)”}”(hŒgIn this manner, XFS employs a form of eventual consistency to avoid deadlocks
and increase parallelism.”h]”hŒgIn this manner, XFS employs a form of eventual consistency to avoid deadlocks
and increase parallelism.”…””}”(hj®;  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h MUhj¥:  hžhubhæ)”}”(hX  During the design phase of the reverse mapping and reflink features, it was
decided that it was impractical to cram all the reverse mapping updates for a
single filesystem change into a single transaction because a single file
mapping operation can explode into many small updates:”h]”hX  During the design phase of the reverse mapping and reflink features, it was
decided that it was impractical to cram all the reverse mapping updates for a
single filesystem change into a single transaction because a single file
mapping operation can explode into many small updates:”…””}”(hj¼;  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h MXhj¥:  hžhubhö)”}”(hhh]”(hû)”}”(hŒThe block mapping update itself”h]”hæ)”}”(hjÏ;  h]”hŒThe block mapping update itself”…””}”(hjÑ;  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h M]hjÍ;  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhjÊ;  hžhhŸh³h Nubhû)”}”(hŒ5A reverse mapping update for the block mapping update”h]”hæ)”}”(hjæ;  h]”hŒ5A reverse mapping update for the block mapping update”…””}”(hjè;  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h M^hjä;  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhjÊ;  hžhhŸh³h Nubhû)”}”(hŒFixing the freelist”h]”hæ)”}”(hjý;  h]”hŒFixing the freelist”…””}”(hjÿ;  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h M_hjû;  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhjÊ;  hžhhŸh³h Nubhû)”}”(hŒ.A reverse mapping update for the freelist fix
”h]”hæ)”}”(hŒ-A reverse mapping update for the freelist fix”h]”hŒ-A reverse mapping update for the freelist fix”…””}”(hj<  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h M`hj<  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhjÊ;  hžhhŸh³h Nubhû)”}”(hŒ)A shape change to the block mapping btree”h]”hæ)”}”(hj,<  h]”hŒ)A shape change to the block mapping btree”…””}”(hj.<  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h Mbhj*<  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhjÊ;  hžhhŸh³h Nubhû)”}”(hŒ-A reverse mapping update for the btree update”h]”hæ)”}”(hjC<  h]”hŒ-A reverse mapping update for the btree update”…””}”(hjE<  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h MchjA<  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhjÊ;  hžhhŸh³h Nubhû)”}”(hŒFixing the freelist (again)”h]”hæ)”}”(hjZ<  h]”hŒFixing the freelist (again)”…””}”(hj\<  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h MdhjX<  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhjÊ;  hžhhŸh³h Nubhû)”}”(hŒ.A reverse mapping update for the freelist fix
”h]”hæ)”}”(hŒ-A reverse mapping update for the freelist fix”h]”hŒ-A reverse mapping update for the freelist fix”…””}”(hjs<  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h Mehjo<  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhjÊ;  hžhhŸh³h Nubhû)”}”(hŒ/An update to the reference counting information”h]”hæ)”}”(hj‰<  h]”hŒ/An update to the reference counting information”…””}”(hj‹<  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h Mghj‡<  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhjÊ;  hžhhŸh³h Nubhû)”}”(hŒ0A reverse mapping update for the refcount update”h]”hæ)”}”(hj <  h]”hŒ0A reverse mapping update for the refcount update”…””}”(hj¢<  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h Mhhjž<  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhjÊ;  hžhhŸh³h Nubhû)”}”(hŒ"Fixing the freelist (a third time)”h]”hæ)”}”(hj·<  h]”hŒ"Fixing the freelist (a third time)”…””}”(hj¹<  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h Mihjµ<  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhjÊ;  hžhhŸh³h Nubhû)”}”(hŒ.A reverse mapping update for the freelist fix
”h]”hæ)”}”(hŒ-A reverse mapping update for the freelist fix”h]”hŒ-A reverse mapping update for the freelist fix”…””}”(hjÐ<  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h MjhjÌ<  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhjÊ;  hžhhŸh³h Nubhû)”}”(hŒCFreeing any space that was unmapped and not owned by any other file”h]”hæ)”}”(hjæ<  h]”hŒCFreeing any space that was unmapped and not owned by any other file”…””}”(hjè<  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h Mlhjä<  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhjÊ;  hžhhŸh³h Nubhû)”}”(hŒ#Fixing the freelist (a fourth time)”h]”hæ)”}”(hjý<  h]”hŒ#Fixing the freelist (a fourth time)”…””}”(hjÿ<  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h Mmhjû<  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhjÊ;  hžhhŸh³h Nubhû)”}”(hŒ.A reverse mapping update for the freelist fix
”h]”hæ)”}”(hŒ-A reverse mapping update for the freelist fix”h]”hŒ-A reverse mapping update for the freelist fix”…””}”(hj=  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h Mnhj=  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhjÊ;  hžhhŸh³h Nubhû)”}”(hŒ1Freeing the space used by the block mapping btree”h]”hæ)”}”(hj,=  h]”hŒ1Freeing the space used by the block mapping btree”…””}”(hj.=  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h Mphj*=  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhjÊ;  hžhhŸh³h Nubhû)”}”(hŒ"Fixing the freelist (a fifth time)”h]”hæ)”}”(hjC=  h]”hŒ"Fixing the freelist (a fifth time)”…””}”(hjE=  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h MqhjA=  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhjÊ;  hžhhŸh³h Nubhû)”}”(hŒ.A reverse mapping update for the freelist fix
”h]”hæ)”}”(hŒ-A reverse mapping update for the freelist fix”h]”hŒ-A reverse mapping update for the freelist fix”…””}”(hj\=  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h MrhjX=  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhjÊ;  hžhhŸh³h Nubeh}”(h]”h ]”h"]”h$]”h&]”jJ  jó   uh1hõhŸh³h M]hj¥:  hžhubhæ)”}”(hX%  Free list fixups are not usually needed more than once per AG per transaction
chain, but it is theoretically possible if space is very tight.
For copy-on-write updates this is even worse, because this must be done once to
remove the space from a staging area and again to map it into the file!”h]”hX%  Free list fixups are not usually needed more than once per AG per transaction
chain, but it is theoretically possible if space is very tight.
For copy-on-write updates this is even worse, because this must be done once to
remove the space from a staging area and again to map it into the file!”…””}”(hjv=  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h Mthj¥:  hžhubhæ)”}”(hXë  To deal with this explosion in a calm manner, XFS expands its use of deferred
work items to cover most reverse mapping updates and all refcount updates.
This reduces the worst case size of transaction reservations by breaking the
work into a long chain of small updates, which increases the degree of eventual
consistency in the system.
Again, this generally isn't a problem because XFS orders its deferred work
items carefully to avoid resource reuse conflicts between unsuspecting threads.”h]”hXí  To deal with this explosion in a calm manner, XFS expands its use of deferred
work items to cover most reverse mapping updates and all refcount updates.
This reduces the worst case size of transaction reservations by breaking the
work into a long chain of small updates, which increases the degree of eventual
consistency in the system.
Again, this generally isnâ€™t a problem because XFS orders its deferred work
items carefully to avoid resource reuse conflicts between unsuspecting threads.”…””}”(hj„=  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h Myhj¥:  hžhubhæ)”}”(hXA  However, online fsck changes the rules -- remember that although physical
updates to per-AG structures are coordinated by locking the buffers for AG
headers, buffer locks are dropped between transactions.
Once scrub acquires resources and takes locks for a data structure, it must do
all the validation work without releasing the lock.
If the main lock for a space btree is an AG header buffer lock, scrub may have
interrupted another thread that is midway through finishing a chain.
For example, if a thread performing a copy-on-write has completed a reverse
mapping update but not the corresponding refcount update, the two AG btrees
will appear inconsistent to scrub and an observation of corruption will be
recorded.  This observation will not be correct.
If a repair is attempted in this state, the results will be catastrophic!”h]”hXA  However, online fsck changes the rules -- remember that although physical
updates to per-AG structures are coordinated by locking the buffers for AG
headers, buffer locks are dropped between transactions.
Once scrub acquires resources and takes locks for a data structure, it must do
all the validation work without releasing the lock.
If the main lock for a space btree is an AG header buffer lock, scrub may have
interrupted another thread that is midway through finishing a chain.
For example, if a thread performing a copy-on-write has completed a reverse
mapping update but not the corresponding refcount update, the two AG btrees
will appear inconsistent to scrub and an observation of corruption will be
recorded.  This observation will not be correct.
If a repair is attempted in this state, the results will be catastrophic!”…””}”(hj’=  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h Mhj¥:  hžhubhæ)”}”(hŒ`Several other solutions to this problem were evaluated upon discovery of this
flaw and rejected:”h]”hŒ`Several other solutions to this problem were evaluated upon discovery of this
flaw and rejected:”…””}”(hj =  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h MŽhj¥:  hžhubji  )”}”(hhh]”(hû)”}”(hX²  Add a higher level lock to allocation groups and require writer threads to
acquire the higher level lock in AG order before making any changes.
This would be very difficult to implement in practice because it is
difficult to determine which locks need to be obtained, and in what order,
without simulating the entire operation.
Performing a dry run of a file operation to discover necessary locks would
make the filesystem very slow.
”h]”hæ)”}”(hX±  Add a higher level lock to allocation groups and require writer threads to
acquire the higher level lock in AG order before making any changes.
This would be very difficult to implement in practice because it is
difficult to determine which locks need to be obtained, and in what order,
without simulating the entire operation.
Performing a dry run of a file operation to discover necessary locks would
make the filesystem very slow.”h]”hX±  Add a higher level lock to allocation groups and require writer threads to
acquire the higher level lock in AG order before making any changes.
This would be very difficult to implement in practice because it is
difficult to determine which locks need to be obtained, and in what order,
without simulating the entire operation.
Performing a dry run of a file operation to discover necessary locks would
make the filesystem very slow.”…””}”(hjµ=  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h M‘hj±=  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhj®=  hžhhŸh³h Nubhû)”}”(hXö  Make the deferred work coordinator code aware of consecutive intent items
targeting the same AG and have it hold the AG header buffers locked across
the transaction roll between updates.
This would introduce a lot of complexity into the coordinator since it is
only loosely coupled with the actual deferred work items.
It would also fail to solve the problem because deferred work items can
generate new deferred subtasks, but all subtasks must be complete before
work can start on a new sibling task.
”h]”hæ)”}”(hXõ  Make the deferred work coordinator code aware of consecutive intent items
targeting the same AG and have it hold the AG header buffers locked across
the transaction roll between updates.
This would introduce a lot of complexity into the coordinator since it is
only loosely coupled with the actual deferred work items.
It would also fail to solve the problem because deferred work items can
generate new deferred subtasks, but all subtasks must be complete before
work can start on a new sibling task.”h]”hXõ  Make the deferred work coordinator code aware of consecutive intent items
targeting the same AG and have it hold the AG header buffers locked across
the transaction roll between updates.
This would introduce a lot of complexity into the coordinator since it is
only loosely coupled with the actual deferred work items.
It would also fail to solve the problem because deferred work items can
generate new deferred subtasks, but all subtasks must be complete before
work can start on a new sibling task.”…””}”(hjÍ=  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h M™hjÉ=  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhj®=  hžhhŸh³h Nubhû)”}”(hXZ  Teach online fsck to walk all transactions waiting for whichever lock(s)
protect the data structure being scrubbed to look for pending operations.
The checking and repair operations must factor these pending operations into
the evaluations being performed.
This solution is a nonstarter because it is *extremely* invasive to the main
filesystem.
”h]”hæ)”}”(hXY  Teach online fsck to walk all transactions waiting for whichever lock(s)
protect the data structure being scrubbed to look for pending operations.
The checking and repair operations must factor these pending operations into
the evaluations being performed.
This solution is a nonstarter because it is *extremely* invasive to the main
filesystem.”h]”(hX-  Teach online fsck to walk all transactions waiting for whichever lock(s)
protect the data structure being scrubbed to look for pending operations.
The checking and repair operations must factor these pending operations into
the evaluations being performed.
This solution is a nonstarter because it is ”…””}”(hjå=  hžhhŸNh Nubj7  )”}”(hŒ*extremely*”h]”hŒ	extremely”…””}”(hjí=  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1j6  hjå=  ubhŒ! invasive to the main
filesystem.”…””}”(hjå=  hžhhŸNh Nubeh}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h M¢hjá=  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhj®=  hžhhŸh³h Nubeh}”(h]”h ]”h"]”h$]”h&]”jg  jh  ji  hjj  jk  uh1jh  hj¥:  hžhhŸh³h M‘ubhµ)”}”(hŒ.. _intent_drains:”h]”h}”(h]”h ]”h"]”h$]”h&]”j  jÇ  uh1h´h M©hj¥:  hžhhŸh³ubeh}”(h]”(j¥  j¤:  eh ]”h"]”(Œdiscovery of the problem”Œchain_coordination”eh$]”h&]”uh1hÐhjÞ9  hžhhŸh³h MjË  }”j >  jš:  sjÍ  }”j¤:  jš:  subhÑ)”}”(hhh]”(hÖ)”}”(hŒIntent Drains”h]”hŒIntent Drains”…””}”(hj(>  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”j  jÁ  uh1hÕhj%>  hžhhŸh³h M¬ubhæ)”}”(hX€  Online fsck uses an atomic intent item counter and lock cycling to coordinate
with transaction chains.
There are two key properties to the drain mechanism.
First, the counter is incremented when a deferred work item is *queued* to a
transaction, and it is decremented after the associated intent done log item is
*committed* to another transaction.
The second property is that deferred work can be added to a transaction without
holding an AG header lock, but per-AG work items cannot be marked done without
locking that AG header buffer to log the physical updates and the intent done
log item.
The first property enables scrub to yield to running transaction chains, which
is an explicit deprioritization of online fsck to benefit file operations.
The second property of the drain is key to the correct coordination of scrub,
since scrub will always be able to decide if a conflict is possible.”h]”(hŒÛOnline fsck uses an atomic intent item counter and lock cycling to coordinate
with transaction chains.
There are two key properties to the drain mechanism.
First, the counter is incremented when a deferred work item is ”…””}”(hj6>  hžhhŸNh Nubj7  )”}”(hŒ*queued*”h]”hŒqueued”…””}”(hj>>  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1j6  hj6>  ubhŒV to a
transaction, and it is decremented after the associated intent done log item is
”…””}”(hj6>  hžhhŸNh Nubj7  )”}”(hŒ*committed*”h]”hŒ	committed”…””}”(hjP>  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1j6  hj6>  ubhX<   to another transaction.
The second property is that deferred work can be added to a transaction without
holding an AG header lock, but per-AG work items cannot be marked done without
locking that AG header buffer to log the physical updates and the intent done
log item.
The first property enables scrub to yield to running transaction chains, which
is an explicit deprioritization of online fsck to benefit file operations.
The second property of the drain is key to the correct coordination of scrub,
since scrub will always be able to decide if a conflict is possible.”…””}”(hj6>  hžhhŸNh Nubeh}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h M®hj%>  hžhubhæ)”}”(hŒ8For regular filesystem code, the drain works as follows:”h]”hŒ8For regular filesystem code, the drain works as follows:”…””}”(hjh>  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h M½hj%>  hžhubji  )”}”(hhh]”(hû)”}”(hŒVCall the appropriate subsystem function to add a deferred work item to a
transaction.
”h]”hæ)”}”(hŒUCall the appropriate subsystem function to add a deferred work item to a
transaction.”h]”hŒUCall the appropriate subsystem function to add a deferred work item to a
transaction.”…””}”(hj}>  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h M¿hjy>  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhjv>  hžhhŸh³h Nubhû)”}”(hŒEThe function calls ``xfs_defer_drain_bump`` to increase the counter.
”h]”hæ)”}”(hŒDThe function calls ``xfs_defer_drain_bump`` to increase the counter.”h]”(hŒThe function calls ”…””}”(hj•>  hžhhŸNh Nubj÷  )”}”(hŒ``xfs_defer_drain_bump``”h]”hŒxfs_defer_drain_bump”…””}”(hj>  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1jö  hj•>  ubhŒ to increase the counter.”…””}”(hj•>  hžhhŸNh Nubeh}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h MÂhj‘>  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhjv>  hžhhŸh³h Nubhû)”}”(hŒrWhen the deferred item manager wants to finish the deferred work item, it
calls ``->finish_item`` to complete it.
”h]”hæ)”}”(hŒqWhen the deferred item manager wants to finish the deferred work item, it
calls ``->finish_item`` to complete it.”h]”(hŒPWhen the deferred item manager wants to finish the deferred work item, it
calls ”…””}”(hj¿>  hžhhŸNh Nubj÷  )”}”(hŒ``->finish_item``”h]”hŒ->finish_item”…””}”(hjÇ>  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1jö  hj¿>  ubhŒ to complete it.”…””}”(hj¿>  hžhhŸNh Nubeh}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h MÄhj»>  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhjv>  hžhhŸh³h Nubhû)”}”(hŒ§The ``->finish_item`` implementation logs some changes and calls
``xfs_defer_drain_drop`` to decrease the sloppy counter and wake up any threads
waiting on the drain.
”h]”hæ)”}”(hŒ¦The ``->finish_item`` implementation logs some changes and calls
``xfs_defer_drain_drop`` to decrease the sloppy counter and wake up any threads
waiting on the drain.”h]”(hŒThe ”…””}”(hjé>  hžhhŸNh Nubj÷  )”}”(hŒ``->finish_item``”h]”hŒ->finish_item”…””}”(hjñ>  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1jö  hjé>  ubhŒ, implementation logs some changes and calls
”…””}”(hjé>  hžhhŸNh Nubj÷  )”}”(hŒ``xfs_defer_drain_drop``”h]”hŒxfs_defer_drain_drop”…””}”(hj?  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1jö  hjé>  ubhŒM to decrease the sloppy counter and wake up any threads
waiting on the drain.”…””}”(hjé>  hžhhŸNh Nubeh}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h MÇhjå>  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhjv>  hžhhŸh³h Nubhû)”}”(hŒXThe subtransaction commits, which unlocks the resource associated with the
intent item.
”h]”hæ)”}”(hŒWThe subtransaction commits, which unlocks the resource associated with the
intent item.”h]”hŒWThe subtransaction commits, which unlocks the resource associated with the
intent item.”…””}”(hj%?  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h MËhj!?  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhjv>  hžhhŸh³h Nubeh}”(h]”h ]”h"]”h$]”h&]”jg  jh  ji  hjj  jk  uh1jh  hj%>  hžhhŸh³h M¿ubhæ)”}”(hŒ&For scrub, the drain works as follows:”h]”hŒ&For scrub, the drain works as follows:”…””}”(hj??  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h MÎhj%>  hžhubji  )”}”(hhh]”(hû)”}”(hŒ—Lock the resource(s) associated with the metadata being scrubbed.
For example, a scan of the refcount btree would lock the AGI and AGF header
buffers.
”h]”hæ)”}”(hŒ–Lock the resource(s) associated with the metadata being scrubbed.
For example, a scan of the refcount btree would lock the AGI and AGF header
buffers.”h]”hŒ–Lock the resource(s) associated with the metadata being scrubbed.
For example, a scan of the refcount btree would lock the AGI and AGF header
buffers.”…””}”(hjT?  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h MÐhjP?  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhjM?  hžhhŸh³h Nubhû)”}”(hŒ€If the counter is zero (``xfs_defer_drain_busy`` returns false), there are no
chains in progress and the operation may proceed.
”h]”hæ)”}”(hŒIf the counter is zero (``xfs_defer_drain_busy`` returns false), there are no
chains in progress and the operation may proceed.”h]”(hŒIf the counter is zero (”…””}”(hjl?  hžhhŸNh Nubj÷  )”}”(hŒ``xfs_defer_drain_busy``”h]”hŒxfs_defer_drain_busy”…””}”(hjt?  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1jö  hjl?  ubhŒO returns false), there are no
chains in progress and the operation may proceed.”…””}”(hjl?  hžhhŸNh Nubeh}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h MÔhjh?  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhjM?  hžhhŸh³h Nubhû)”}”(hŒ4Otherwise, release the resources grabbed in step 1.
”h]”hæ)”}”(hŒ3Otherwise, release the resources grabbed in step 1.”h]”hŒ3Otherwise, release the resources grabbed in step 1.”…””}”(hj–?  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h M×hj’?  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhjM?  hžhhŸh³h Nubhû)”}”(hŒWait for the intent counter to reach zero (``xfs_defer_drain_intents``), then go
back to step 1 unless a signal has been caught.
”h]”hæ)”}”(hŒ€Wait for the intent counter to reach zero (``xfs_defer_drain_intents``), then go
back to step 1 unless a signal has been caught.”h]”(hŒ+Wait for the intent counter to reach zero (”…””}”(hj®?  hžhhŸNh Nubj÷  )”}”(hŒ``xfs_defer_drain_intents``”h]”hŒxfs_defer_drain_intents”…””}”(hj¶?  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1jö  hj®?  ubhŒ:), then go
back to step 1 unless a signal has been caught.”…””}”(hj®?  hžhhŸNh Nubeh}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h MÙhjª?  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhjM?  hžhhŸh³h Nubeh}”(h]”h ]”h"]”h$]”h&]”jg  jh  ji  hjj  jk  uh1jh  hj%>  hžhhŸh³h MÐubhæ)”}”(hŒ„To avoid polling in step 4, the drain provides a waitqueue for scrub threads to
be woken up whenever the intent count drops to zero.”h]”hŒ„To avoid polling in step 4, the drain provides a waitqueue for scrub threads to
be woken up whenever the intent count drops to zero.”…””}”(hjÚ?  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h MÜhj%>  hžhubhæ)”}”(hŒœThe proposed patchset is the
`scrub intent drain series
<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=scrub-drain-intents>`_.”h]”(hŒThe proposed patchset is the
”…””}”(hjè?  hžhhŸNh Nubj”  )”}”(hŒ~`scrub intent drain series
<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=scrub-drain-intents>`_”h]”hŒscrub intent drain series”…””}”(hjð?  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”Œname”Œscrub intent drain series”jj  Œ_https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=scrub-drain-intents”uh1j“  hjè?  ubhµ)”}”(hŒb
<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=scrub-drain-intents>”h]”h}”(h]”Œscrub-intent-drain-series”ah ]”h"]”Œscrub intent drain series”ah$]”h&]”Œrefuri”j @  uh1h´jy  Khjè?  ubhŒ.”…””}”(hjè?  hžhhŸNh Nubeh}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h Mßhj%>  hžhubhµ)”}”(hŒ.. _jump_labels:”h]”h}”(h]”h ]”h"]”h$]”h&]”j  Œjump-labels”uh1h´h Mãhj%>  hžhhŸh³ubeh}”(h]”(jÇ  Œid3”eh ]”h"]”(Œintent drains”Œintent_drains”eh$]”h&]”uh1hÐhjÞ9  hžhhŸh³h M¬jË  }”j)@  j>  sjÍ  }”jÇ  j>  subhÑ)”}”(hhh]”(hÖ)”}”(hŒ%Static Keys (aka Jump Label Patching)”h]”hŒ%Static Keys (aka Jump Label Patching)”…””}”(hj1@  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”j  jã  uh1hÕhj.@  hžhhŸh³h Mæubhæ)”}”(hX  Online fsck for XFS separates the regular filesystem from the checking and
repair code as much as possible.
However, there are a few parts of online fsck (such as the intent drains, and
later, live update hooks) where it is useful for the online fsck code to know
what's going on in the rest of the filesystem.
Since it is not expected that online fsck will be constantly running in the
background, it is very important to minimize the runtime overhead imposed by
these hooks when online fsck is compiled into the kernel but not actively
running on behalf of userspace.
Taking locks in the hot path of a writer thread to access a data structure only
to find that no further action is necessary is expensive -- on the author's
computer, this have an overhead of 40-50ns per access.
Fortunately, the kernel supports dynamic code patching, which enables XFS to
replace a static branch to hook code with ``nop`` sleds when online fsck isn't
running.
This sled has an overhead of however long it takes the instruction decoder to
skip past the sled, which seems to be on the order of less than 1ns and
does not access memory outside of instruction fetching.”h]”(hXˆ  Online fsck for XFS separates the regular filesystem from the checking and
repair code as much as possible.
However, there are a few parts of online fsck (such as the intent drains, and
later, live update hooks) where it is useful for the online fsck code to know
whatâ€™s going on in the rest of the filesystem.
Since it is not expected that online fsck will be constantly running in the
background, it is very important to minimize the runtime overhead imposed by
these hooks when online fsck is compiled into the kernel but not actively
running on behalf of userspace.
Taking locks in the hot path of a writer thread to access a data structure only
to find that no further action is necessary is expensive -- on the authorâ€™s
computer, this have an overhead of 40-50ns per access.
Fortunately, the kernel supports dynamic code patching, which enables XFS to
replace a static branch to hook code with ”…””}”(hj?@  hžhhŸNh Nubj÷  )”}”(hŒ``nop``”h]”hŒnop”…””}”(hjG@  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1jö  hj?@  ubhŒö sleds when online fsck isnâ€™t
running.
This sled has an overhead of however long it takes the instruction decoder to
skip past the sled, which seems to be on the order of less than 1ns and
does not access memory outside of instruction fetching.”…””}”(hj?@  hžhhŸNh Nubeh}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h Mèhj.@  hžhubhæ)”}”(hXî  When online fsck enables the static key, the sled is replaced with an
unconditional branch to call the hook code.
The switchover is quite expensive (~22000ns) but is paid entirely by the
program that invoked online fsck, and can be amortized if multiple threads
enter online fsck at the same time, or if multiple filesystems are being
checked at the same time.
Changing the branch direction requires taking the CPU hotplug lock, and since
CPU initialization requires memory allocation, online fsck must be careful not
to change a static key while holding any locks or resources that could be
accessed in the memory reclaim paths.
To minimize contention on the CPU hotplug lock, care should be taken not to
enable or disable static keys unnecessarily.”h]”hXî  When online fsck enables the static key, the sled is replaced with an
unconditional branch to call the hook code.
The switchover is quite expensive (~22000ns) but is paid entirely by the
program that invoked online fsck, and can be amortized if multiple threads
enter online fsck at the same time, or if multiple filesystems are being
checked at the same time.
Changing the branch direction requires taking the CPU hotplug lock, and since
CPU initialization requires memory allocation, online fsck must be careful not
to change a static key while holding any locks or resources that could be
accessed in the memory reclaim paths.
To minimize contention on the CPU hotplug lock, care should be taken not to
enable or disable static keys unnecessarily.”…””}”(hj_@  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h Mûhj.@  hžhubhæ)”}”(hŒ§Because static keys are intended to minimize hook overhead for regular
filesystem operations when xfs_scrub is not running, the intended usage
patterns are as follows:”h]”hŒ§Because static keys are intended to minimize hook overhead for regular
filesystem operations when xfs_scrub is not running, the intended usage
patterns are as follows:”…””}”(hjm@  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h Mhj.@  hžhubhö)”}”(hhh]”(hû)”}”(hŒÖThe hooked part of XFS should declare a static-scoped static key that
defaults to false.
The ``DEFINE_STATIC_KEY_FALSE`` macro takes care of this.
The static key itself should be declared as a ``static`` variable.
”h]”hæ)”}”(hŒÕThe hooked part of XFS should declare a static-scoped static key that
defaults to false.
The ``DEFINE_STATIC_KEY_FALSE`` macro takes care of this.
The static key itself should be declared as a ``static`` variable.”h]”(hŒ]The hooked part of XFS should declare a static-scoped static key that
defaults to false.
The ”…””}”(hj‚@  hžhhŸNh Nubj÷  )”}”(hŒ``DEFINE_STATIC_KEY_FALSE``”h]”hŒDEFINE_STATIC_KEY_FALSE”…””}”(hjŠ@  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1jö  hj‚@  ubhŒI macro takes care of this.
The static key itself should be declared as a ”…””}”(hj‚@  hžhhŸNh Nubj÷  )”}”(hŒ
``static``”h]”hŒstatic”…””}”(hjœ@  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1jö  hj‚@  ubhŒ
 variable.”…””}”(hj‚@  hžhhŸNh Nubeh}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h Mhj~@  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhj{@  hžhhŸh³h Nubhû)”}”(hŒÈWhen deciding to invoke code that's only used by scrub, the regular
filesystem should call the ``static_branch_unlikely`` predicate to avoid the
scrub-only hook code if the static key is not enabled.
”h]”hæ)”}”(hŒÇWhen deciding to invoke code that's only used by scrub, the regular
filesystem should call the ``static_branch_unlikely`` predicate to avoid the
scrub-only hook code if the static key is not enabled.”h]”(hŒaWhen deciding to invoke code thatâ€™s only used by scrub, the regular
filesystem should call the ”…””}”(hj¾@  hžhhŸNh Nubj÷  )”}”(hŒ``static_branch_unlikely``”h]”hŒstatic_branch_unlikely”…””}”(hjÆ@  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1jö  hj¾@  ubhŒN predicate to avoid the
scrub-only hook code if the static key is not enabled.”…””}”(hj¾@  hžhhŸNh Nubeh}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h Mhjº@  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhj{@  hžhhŸh³h Nubhû)”}”(hX  The regular filesystem should export helper functions that call
``static_branch_inc`` to enable and ``static_branch_dec`` to disable the
static key.
Wrapper functions make it easy to compile out the relevant code if the kernel
distributor turns off online fsck at build time.
”h]”hæ)”}”(hX  The regular filesystem should export helper functions that call
``static_branch_inc`` to enable and ``static_branch_dec`` to disable the
static key.
Wrapper functions make it easy to compile out the relevant code if the kernel
distributor turns off online fsck at build time.”h]”(hŒ@The regular filesystem should export helper functions that call
”…””}”(hjè@  hžhhŸNh Nubj÷  )”}”(hŒ``static_branch_inc``”h]”hŒstatic_branch_inc”…””}”(hjð@  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1jö  hjè@  ubhŒ to enable and ”…””}”(hjè@  hžhhŸNh Nubj÷  )”}”(hŒ``static_branch_dec``”h]”hŒstatic_branch_dec”…””}”(hjA  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1jö  hjè@  ubhŒš to disable the
static key.
Wrapper functions make it easy to compile out the relevant code if the kernel
distributor turns off online fsck at build time.”…””}”(hjè@  hžhhŸNh Nubeh}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h Mhjä@  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhj{@  hžhhŸh³h Nubhû)”}”(hXl  Scrub functions wanting to turn on scrub-only XFS functionality should call
the ``xchk_fsgates_enable`` from the setup function to enable a specific
hook.
This must be done before obtaining any resources that are used by memory
reclaim.
Callers had better be sure they really need the functionality gated by the
static key; the ``TRY_HARDER`` flag is useful here.
”h]”hæ)”}”(hXk  Scrub functions wanting to turn on scrub-only XFS functionality should call
the ``xchk_fsgates_enable`` from the setup function to enable a specific
hook.
This must be done before obtaining any resources that are used by memory
reclaim.
Callers had better be sure they really need the functionality gated by the
static key; the ``TRY_HARDER`` flag is useful here.”h]”(hŒPScrub functions wanting to turn on scrub-only XFS functionality should call
the ”…””}”(hj$A  hžhhŸNh Nubj÷  )”}”(hŒ``xchk_fsgates_enable``”h]”hŒxchk_fsgates_enable”…””}”(hj,A  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1jö  hj$A  ubhŒá from the setup function to enable a specific
hook.
This must be done before obtaining any resources that are used by memory
reclaim.
Callers had better be sure they really need the functionality gated by the
static key; the ”…””}”(hj$A  hžhhŸNh Nubj÷  )”}”(hŒ``TRY_HARDER``”h]”hŒ
TRY_HARDER”…””}”(hj>A  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1jö  hj$A  ubhŒ flag is useful here.”…””}”(hj$A  hžhhŸNh Nubeh}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h Mhj A  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhj{@  hžhhŸh³h Nubeh}”(h]”h ]”h"]”h$]”h&]”jJ  jK  uh1hõhŸh³h Mhj.@  hžhubhæ)”}”(hX]  Online scrub has resource acquisition helpers (e.g. ``xchk_perag_lock``) to
handle locking AGI and AGF buffers for all scrubber functions.
If it detects a conflict between scrub and the running transactions, it will
try to wait for intents to complete.
If the caller of the helper has not enabled the static key, the helper will
return -EDEADLOCK, which should result in the scrub being restarted with the
``TRY_HARDER`` flag set.
The scrub setup function should detect that flag, enable the static key, and
try the scrub again.
Scrub teardown disables all static keys obtained by ``xchk_fsgates_enable``.”h]”(hŒ4Online scrub has resource acquisition helpers (e.g. ”…””}”(hjbA  hžhhŸNh Nubj÷  )”}”(hŒ``xchk_perag_lock``”h]”hŒxchk_perag_lock”…””}”(hjjA  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1jö  hjbA  ubhXO  ) to
handle locking AGI and AGF buffers for all scrubber functions.
If it detects a conflict between scrub and the running transactions, it will
try to wait for intents to complete.
If the caller of the helper has not enabled the static key, the helper will
return -EDEADLOCK, which should result in the scrub being restarted with the
”…””}”(hjbA  hžhhŸNh Nubj÷  )”}”(hŒ``TRY_HARDER``”h]”hŒ
TRY_HARDER”…””}”(hj|A  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1jö  hjbA  ubhŒ¡ flag set.
The scrub setup function should detect that flag, enable the static key, and
try the scrub again.
Scrub teardown disables all static keys obtained by ”…””}”(hjbA  hžhhŸNh Nubj÷  )”}”(hŒ``xchk_fsgates_enable``”h]”hŒxchk_fsgates_enable”…””}”(hjŽA  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1jö  hjbA  ubhŒ.”…””}”(hjbA  hžhhŸNh Nubeh}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h M#hj.@  hžhubhæ)”}”(hŒcFor more information, please see the kernel documentation of
Documentation/staging/static-keys.rst.”h]”hŒcFor more information, please see the kernel documentation of
Documentation/staging/static-keys.rst.”…””}”(hj¦A  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h M.hj.@  hžhubhµ)”}”(hŒ
.. _xfile:”h]”h}”(h]”h ]”h"]”h$]”h&]”j  Œxfile”uh1h´h M1hj.@  hžhhŸh³ubeh}”(h]”(jé  j"@  eh ]”h"]”(Œ%static keys (aka jump label patching)”Œjump_labels”eh$]”h&]”uh1hÐhjÞ9  hžhhŸh³h MæjË  }”jÄA  j@  sjÍ  }”j"@  j@  subeh}”(h]”j†  ah ]”h"]”Œ$eventual consistency vs. online fsck”ah$]”h&]”uh1hÐhjv*  hžhhŸh³h MæubhÑ)”}”(hhh]”(hÖ)”}”(hŒPageable Kernel Memory”h]”hŒPageable Kernel Memory”…””}”(hjÓA  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”j  j  uh1hÕhjÐA  hžhhŸh³h M4ubhæ)”}”(hX>  Some online checking functions work by scanning the filesystem to build a
shadow copy of an ondisk metadata structure in memory and comparing the two
copies.
For online repair to rebuild a metadata structure, it must compute the record
set that will be stored in the new structure before it can persist that new
structure to disk.
Ideally, repairs complete with a single atomic commit that introduces
a new data structure.
To meet these goals, the kernel needs to collect a large amount of information
in a place that doesn't require the correct operation of the filesystem.”h]”hX@  Some online checking functions work by scanning the filesystem to build a
shadow copy of an ondisk metadata structure in memory and comparing the two
copies.
For online repair to rebuild a metadata structure, it must compute the record
set that will be stored in the new structure before it can persist that new
structure to disk.
Ideally, repairs complete with a single atomic commit that introduces
a new data structure.
To meet these goals, the kernel needs to collect a large amount of information
in a place that doesnâ€™t require the correct operation of the filesystem.”…””}”(hjáA  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h M6hjÐA  hžhubhæ)”}”(hŒ%Kernel memory isn't suitable because:”h]”hŒ'Kernel memory isnâ€™t suitable because:”…””}”(hjïA  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h MAhjÐA  hžhubhö)”}”(hhh]”(hû)”}”(hŒnAllocating a contiguous region of memory to create a C array is very
difficult, especially on 32-bit systems.
”h]”hæ)”}”(hŒmAllocating a contiguous region of memory to create a C array is very
difficult, especially on 32-bit systems.”h]”hŒmAllocating a contiguous region of memory to create a C array is very
difficult, especially on 32-bit systems.”…””}”(hjB  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h MChj B  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhjýA  hžhhŸh³h Nubhû)”}”(hŒLinked lists of records introduce double pointer overhead which is very high
and eliminate the possibility of indexed lookups.
”h]”hæ)”}”(hŒ~Linked lists of records introduce double pointer overhead which is very high
and eliminate the possibility of indexed lookups.”•:      h]”hŒ~Linked lists of records introduce double pointer overhead which is very high
and eliminate the possibility of indexed lookups.”…””}”(hjB  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h MFhjB  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhjýA  hžhhŸh³h Nubhû)”}”(hŒIKernel memory is pinned, which can drive the system into OOM conditions.
”h]”hæ)”}”(hŒHKernel memory is pinned, which can drive the system into OOM conditions.”h]”hŒHKernel memory is pinned, which can drive the system into OOM conditions.”…””}”(hj4B  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h MIhj0B  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhjýA  hžhhŸh³h Nubhû)”}”(hŒJThe system might not have sufficient memory to stage all the information.
”h]”hæ)”}”(hŒIThe system might not have sufficient memory to stage all the information.”h]”hŒIThe system might not have sufficient memory to stage all the information.”…””}”(hjLB  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h MKhjHB  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhjýA  hžhhŸh³h Nubeh}”(h]”h ]”h"]”h$]”h&]”jJ  jó   uh1hõhŸh³h MChjÐA  hžhubhæ)”}”(hXd  At any given time, online fsck does not need to keep the entire record set in
memory, which means that individual records can be paged out if necessary.
Continued development of online fsck demonstrated that the ability to perform
indexed data storage would also be very useful.
Fortunately, the Linux kernel already has a facility for byte-addressable and
pageable storage: tmpfs.
In-kernel graphics drivers (most notably i915) take advantage of tmpfs files
to store intermediate data that doesn't need to be in memory at all times, so
that usage precedent is already established.
Hence, the ``xfile`` was born!”h]”(hXS  At any given time, online fsck does not need to keep the entire record set in
memory, which means that individual records can be paged out if necessary.
Continued development of online fsck demonstrated that the ability to perform
indexed data storage would also be very useful.
Fortunately, the Linux kernel already has a facility for byte-addressable and
pageable storage: tmpfs.
In-kernel graphics drivers (most notably i915) take advantage of tmpfs files
to store intermediate data that doesnâ€™t need to be in memory at all times, so
that usage precedent is already established.
Hence, the ”…””}”(hjfB  hžhhŸNh Nubj÷  )”}”(hŒ	``xfile``”h]”hŒxfile”…””}”(hjnB  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1jö  hjfB  ubhŒ
 was born!”…””}”(hjfB  hžhhŸNh Nubeh}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h MMhjÐA  hžhubjÀ  )”}”(hhh]”jÅ  )”}”(hhh]”(jÊ  )”}”(hhh]”h}”(h]”h ]”h"]”h$]”h&]”Œcolwidth”KJuh1jÉ  hj‰B  ubjÖ  )”}”(hhh]”(jÛ  )”}”(hhh]”jà  )”}”(hhh]”hæ)”}”(hŒ**Historical Sidebar**:”h]”(jé  )”}”(hŒ**Historical Sidebar**”h]”hŒHistorical Sidebar”…””}”(hj£B  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1jè  hjŸB  ubhŒ:”…””}”(hjŸB  hžhhŸNh Nubeh}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h MYhjœB  ubah}”(h]”h ]”h"]”h$]”h&]”uh1jß  hj™B  ubah}”(h]”h ]”h"]”h$]”h&]”uh1jÚ  hj–B  ubjÛ  )”}”(hhh]”jà  )”}”(hhh]”(hæ)”}”(hŒÐThe first edition of online repair inserted records into a new btree as
it found them, which failed because filesystem could shut down with a
built data structure, which would be live after recovery finished.”h]”hŒÐThe first edition of online repair inserted records into a new btree as
it found them, which failed because filesystem could shut down with a
built data structure, which would be live after recovery finished.”…””}”(hjÍB  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h M[hjÊB  ubhæ)”}”(hŒŠThe second edition solved the half-rebuilt structure problem by storing
everything in memory, but frequently ran the system out of memory.”h]”hŒŠThe second edition solved the half-rebuilt structure problem by storing
everything in memory, but frequently ran the system out of memory.”…””}”(hjÛB  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h M_hjÊB  ubhæ)”}”(hŒyThe third edition solved the OOM problem by using linked lists, but the
memory overhead of the list pointers was extreme.”h]”hŒyThe third edition solved the OOM problem by using linked lists, but the
memory overhead of the list pointers was extreme.”…””}”(hjéB  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h MbhjÊB  ubeh}”(h]”h ]”h"]”h$]”h&]”uh1jß  hjÇB  ubah}”(h]”h ]”h"]”h$]”h&]”uh1jÚ  hj–B  ubeh}”(h]”h ]”h"]”h$]”h&]”uh1jÕ  hj‰B  ubeh}”(h]”h ]”h"]”h$]”h&]”Œcols”Kuh1jÄ  hj†B  ubah}”(h]”h ]”h"]”h$]”h&]”uh1j¿  hjÐA  hžhhŸh³h NubhÑ)”}”(hhh]”(hÖ)”}”(hŒxfile Access Models”h]”hŒxfile Access Models”…””}”(hjC  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”j  j0  uh1hÕhjC  hžhhŸh³h Mgubhæ)”}”(hŒBA survey of the intended uses of xfiles suggested these use cases:”h]”hŒBA survey of the intended uses of xfiles suggested these use cases:”…””}”(hj'C  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h MihjC  hžhubji  )”}”(hhh]”(hû)”}”(hŒbArrays of fixed-sized records (space management btrees, directory and
extended attribute entries)
”h]”hæ)”}”(hŒaArrays of fixed-sized records (space management btrees, directory and
extended attribute entries)”h]”hŒaArrays of fixed-sized records (space management btrees, directory and
extended attribute entries)”…””}”(hj<C  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h Mkhj8C  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhj5C  hžhhŸh³h Nubhû)”}”(hŒ>Sparse arrays of fixed-sized records (quotas and link counts)
”h]”hæ)”}”(hŒ=Sparse arrays of fixed-sized records (quotas and link counts)”h]”hŒ=Sparse arrays of fixed-sized records (quotas and link counts)”…””}”(hjTC  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h MnhjPC  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhj5C  hžhhŸh³h Nubhû)”}”(hŒcLarge binary objects (BLOBs) of variable sizes (directory and extended
attribute names and values)
”h]”hæ)”}”(hŒbLarge binary objects (BLOBs) of variable sizes (directory and extended
attribute names and values)”h]”hŒbLarge binary objects (BLOBs) of variable sizes (directory and extended
attribute names and values)”…””}”(hjlC  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h MphjhC  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhj5C  hžhhŸh³h Nubhû)”}”(hŒ2Staging btrees in memory (reverse mapping btrees)
”h]”hæ)”}”(hŒ1Staging btrees in memory (reverse mapping btrees)”h]”hŒ1Staging btrees in memory (reverse mapping btrees)”…””}”(hj„C  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h Mshj€C  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhj5C  hžhhŸh³h Nubhû)”}”(hŒ/Arbitrary contents (realtime space management)
”h]”hæ)”}”(hŒ.Arbitrary contents (realtime space management)”h]”hŒ.Arbitrary contents (realtime space management)”…””}”(hjœC  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h Muhj˜C  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhj5C  hžhhŸh³h Nubeh}”(h]”h ]”h"]”h$]”h&]”jg  jh  ji  hjj  jk  uh1jh  hjC  hžhhŸh³h Mkubhæ)”}”(hXY  To support the first four use cases, high level data structures wrap the xfile
to share functionality between online fsck functions.
The rest of this section discusses the interfaces that the xfile presents to
four of those five higher level data structures.
The fifth use case is discussed in the :ref:`realtime summary <rtsummary>` case
study.”h]”(hX*  To support the first four use cases, high level data structures wrap the xfile
to share functionality between online fsck functions.
The rest of this section discusses the interfaces that the xfile presents to
four of those five higher level data structures.
The fifth use case is discussed in the ”…””}”(hj¶C  hžhhŸNh Nubh)”}”(hŒ#:ref:`realtime summary <rtsummary>`”h]”j™  )”}”(hjÀC  h]”hŒrealtime summary”…””}”(hjÂC  hžhhŸNh Nubah}”(h]”h ]”(j¤  Œstd”Œstd-ref”eh"]”h$]”h&]”uh1j˜  hj¾C  ubah}”(h]”h ]”h"]”h$]”h&]”Œrefdoc”j±  Œ	refdomain”jÌC  Œreftype”Œref”Œrefexplicit”ˆŒrefwarn”ˆj·  Œ	rtsummary”uh1hhŸh³h Mwhj¶C  ubhŒ case
study.”…””}”(hj¶C  hžhhŸNh Nubeh}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h MwhjC  hžhubhæ)”}”(hX¿  XFS is very record-based, which suggests that the ability to load and store
complete records is important.
To support these cases, a pair of ``xfile_load`` and ``xfile_store``
functions are provided to read and persist objects into an xfile that treat any
error as an out of memory error.  For online repair, squashing error conditions
in this manner is an acceptable behavior because the only reaction is to abort
the operation back to userspace.”h]”(hŒXFS is very record-based, which suggests that the ability to load and store
complete records is important.
To support these cases, a pair of ”…””}”(hjèC  hžhhŸNh Nubj÷  )”}”(hŒ``xfile_load``”h]”hŒ
xfile_load”…””}”(hjðC  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1jö  hjèC  ubhŒ and ”…””}”(hjèC  hžhhŸNh Nubj÷  )”}”(hŒ``xfile_store``”h]”hŒxfile_store”…””}”(hjD  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1jö  hjèC  ubhX  
functions are provided to read and persist objects into an xfile that treat any
error as an out of memory error.  For online repair, squashing error conditions
in this manner is an acceptable behavior because the only reaction is to abort
the operation back to userspace.”…””}”(hjèC  hžhhŸNh Nubeh}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h M~hjC  hžhubhæ)”}”(hXñ  However, no discussion of file access idioms is complete without answering the
question, "But what about mmap?"
It is convenient to access storage directly with pointers, just like userspace
code does with regular memory.
Online fsck must not drive the system into OOM conditions, which means that
xfiles must be responsive to memory reclamation.
tmpfs can only push a pagecache folio to the swap cache if the folio is neither
pinned nor locked, which means the xfile must not pin too many folios.”h]”hXõ  However, no discussion of file access idioms is complete without answering the
question, â€œBut what about mmap?â€
It is convenient to access storage directly with pointers, just like userspace
code does with regular memory.
Online fsck must not drive the system into OOM conditions, which means that
xfiles must be responsive to memory reclamation.
tmpfs can only push a pagecache folio to the swap cache if the folio is neither
pinned nor locked, which means the xfile must not pin too many folios.”…””}”(hjD  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h M†hjC  hžhubhæ)”}”(hX
  Short term direct access to xfile contents is done by locking the pagecache
folio and mapping it into kernel address space.  Object load and store uses this
mechanism.  Folio locks are not supposed to be held for long periods of time, so
long term direct access to xfile contents is done by bumping the folio refcount,
mapping it into kernel address space, and dropping the folio lock.
These long term users *must* be responsive to memory reclaim by hooking into
the shrinker infrastructure to know when to release folios.”h]”(hX˜  Short term direct access to xfile contents is done by locking the pagecache
folio and mapping it into kernel address space.  Object load and store uses this
mechanism.  Folio locks are not supposed to be held for long periods of time, so
long term direct access to xfile contents is done by bumping the folio refcount,
mapping it into kernel address space, and dropping the folio lock.
These long term users ”…””}”(hj(D  hžhhŸNh Nubj7  )”}”(hŒ*must*”h]”hŒmust”…””}”(hj0D  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1j6  hj(D  ubhŒl be responsive to memory reclaim by hooking into
the shrinker infrastructure to know when to release folios.”…””}”(hj(D  hžhhŸNh Nubeh}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h MhjC  hžhubhæ)”}”(hX'  The ``xfile_get_folio`` and ``xfile_put_folio`` functions are provided to
retrieve the (locked) folio that backs part of an xfile and to release it.
The only code to use these folio lease functions are the xfarray
:ref:`sorting<xfarray_sort>` algorithms and the :ref:`in-memory
btrees<xfbtree>`.”h]”(hŒThe ”…””}”(hjHD  hžhhŸNh Nubj÷  )”}”(hŒ``xfile_get_folio``”h]”hŒxfile_get_folio”…””}”(hjPD  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1jö  hjHD  ubhŒ and ”…””}”(hjHD  hžhhŸNh Nubj÷  )”}”(hŒ``xfile_put_folio``”h]”hŒxfile_put_folio”…””}”(hjbD  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1jö  hjHD  ubhŒ§ functions are provided to
retrieve the (locked) folio that backs part of an xfile and to release it.
The only code to use these folio lease functions are the xfarray
”…””}”(hjHD  hžhhŸNh Nubh)”}”(hŒ:ref:`sorting<xfarray_sort>`”h]”j™  )”}”(hjvD  h]”hŒsorting”…””}”(hjxD  hžhhŸNh Nubah}”(h]”h ]”(j¤  Œstd”Œstd-ref”eh"]”h$]”h&]”uh1j˜  hjtD  ubah}”(h]”h ]”h"]”h$]”h&]”Œrefdoc”j±  Œ	refdomain”j‚D  Œreftype”Œref”Œrefexplicit”ˆŒrefwarn”ˆj·  Œxfarray_sort”uh1hhŸh³h M—hjHD  ubhŒ algorithms and the ”…””}”(hjHD  hžhhŸNh Nubh)”}”(hŒ :ref:`in-memory
btrees<xfbtree>`”h]”j™  )”}”(hjšD  h]”hŒin-memory
btrees”…””}”(hjœD  hžhhŸNh Nubah}”(h]”h ]”(j¤  Œstd”Œstd-ref”eh"]”h$]”h&]”uh1j˜  hj˜D  ubah}”(h]”h ]”h"]”h$]”h&]”Œrefdoc”j±  Œ	refdomain”j¦D  Œreftype”Œref”Œrefexplicit”ˆŒrefwarn”ˆj·  Œxfbtree”uh1hhŸh³h M—hjHD  ubhŒ.”…””}”(hjHD  hžhhŸNh Nubeh}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h M—hjC  hžhubeh}”(h]”j6  ah ]”h"]”Œxfile access models”ah$]”h&]”uh1hÐhjÐA  hžhhŸh³h MgubhÑ)”}”(hhh]”(hÖ)”}”(hŒxfile Access Coordination”h]”hŒxfile Access Coordination”…””}”(hjÌD  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”j  jR  uh1hÕhjÉD  hžhhŸh³h Mžubhæ)”}”(hX  For security reasons, xfiles must be owned privately by the kernel.
They are marked ``S_PRIVATE`` to prevent interference from the security system,
must never be mapped into process file descriptor tables, and their pages must
never be mapped into userspace processes.”h]”(hŒTFor security reasons, xfiles must be owned privately by the kernel.
They are marked ”…””}”(hjÚD  hžhhŸNh Nubj÷  )”}”(hŒ``S_PRIVATE``”h]”hŒ	S_PRIVATE”…””}”(hjâD  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1jö  hjÚD  ubhŒ« to prevent interference from the security system,
must never be mapped into process file descriptor tables, and their pages must
never be mapped into userspace processes.”…””}”(hjÚD  hžhhŸNh Nubeh}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h M hjÉD  hžhubhæ)”}”(hX¤  To avoid locking recursion issues with the VFS, all accesses to the shmfs file
are performed by manipulating the page cache directly.
xfile writers call the ``->write_begin`` and ``->write_end`` functions of the
xfile's address space to grab writable pages, copy the caller's buffer into the
page, and release the pages.
xfile readers call ``shmem_read_mapping_page_gfp`` to grab pages directly
before copying the contents into the caller's buffer.
In other words, xfiles ignore the VFS read and write code paths to avoid
having to create a dummy ``struct kiocb`` and to avoid taking inode and
freeze locks.
tmpfs cannot be frozen, and xfiles must not be exposed to userspace.”h]”(hŒTo avoid locking recursion issues with the VFS, all accesses to the shmfs file
are performed by manipulating the page cache directly.
xfile writers call the ”…””}”(hjúD  hžhhŸNh Nubj÷  )”}”(hŒ``->write_begin``”h]”hŒ->write_begin”…””}”(hjE  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1jö  hjúD  ubhŒ and ”…””}”(hjúD  hžhhŸNh Nubj÷  )”}”(hŒ``->write_end``”h]”hŒ->write_end”…””}”(hjE  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1jö  hjúD  ubhŒ– functions of the
xfileâ€™s address space to grab writable pages, copy the callerâ€™s buffer into the
page, and release the pages.
xfile readers call ”…””}”(hjúD  hžhhŸNh Nubj÷  )”}”(hŒ``shmem_read_mapping_page_gfp``”h]”hŒshmem_read_mapping_page_gfp”…””}”(hj&E  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1jö  hjúD  ubhŒ² to grab pages directly
before copying the contents into the callerâ€™s buffer.
In other words, xfiles ignore the VFS read and write code paths to avoid
having to create a dummy ”…””}”(hjúD  hžhhŸNh Nubj÷  )”}”(hŒ``struct kiocb``”h]”hŒstruct kiocb”…””}”(hj8E  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1jö  hjúD  ubhŒq and to avoid taking inode and
freeze locks.
tmpfs cannot be frozen, and xfiles must not be exposed to userspace.”…””}”(hjúD  hžhhŸNh Nubeh}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h M¥hjÉD  hžhubhæ)”}”(hX5  If an xfile is shared between threads to stage repairs, the caller must provide
its own locks to coordinate access.
For example, if a scrub function stores scan results in an xfile and needs
other threads to provide updates to the scanned data, the scrub function must
provide a lock for all threads to share.”h]”hX5  If an xfile is shared between threads to stage repairs, the caller must provide
its own locks to coordinate access.
For example, if a scrub function stores scan results in an xfile and needs
other threads to provide updates to the scanned data, the scrub function must
provide a lock for all threads to share.”…””}”(hjPE  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h M±hjÉD  hžhubhµ)”}”(hŒ.. _xfarray:”h]”h}”(h]”h ]”h"]”h$]”h&]”j  Œxfarray”uh1h´h M·hjÉD  hžhhŸh³ubeh}”(h]”jX  ah ]”h"]”Œxfile access coordination”ah$]”h&]”uh1hÐhjÐA  hžhhŸh³h MžubhÑ)”}”(hhh]”(hÖ)”}”(hŒArrays of Fixed-Sized Records”h]”hŒArrays of Fixed-Sized Records”…””}”(hjsE  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”j  jt  uh1hÕhjpE  hžhhŸh³h Mºubhæ)”}”(hX.  In XFS, each type of indexed space metadata (free space, inodes, reference
counts, file fork space, and reverse mappings) consists of a set of fixed-size
records indexed with a classic B+ tree.
Directories have a set of fixed-size dirent records that point to the names,
and extended attributes have a set of fixed-size attribute keys that point to
names and values.
Quota counters and file link counters index records with numbers.
During a repair, scrub needs to stage new records during the gathering step and
retrieve them during the btree building step.”h]”hX.  In XFS, each type of indexed space metadata (free space, inodes, reference
counts, file fork space, and reverse mappings) consists of a set of fixed-size
records indexed with a classic B+ tree.
Directories have a set of fixed-size dirent records that point to the names,
and extended attributes have a set of fixed-size attribute keys that point to
names and values.
Quota counters and file link counters index records with numbers.
During a repair, scrub needs to stage new records during the gathering step and
retrieve them during the btree building step.”…””}”(hjE  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h M¼hjpE  hžhubhæ)”}”(hXŒ  Although this requirement can be satisfied by calling the read and write
methods of the xfile directly, it is simpler for callers for there to be a
higher level abstraction to take care of computing array offsets, to provide
iterator functions, and to deal with sparse records and sorting.
The ``xfarray`` abstraction presents a linear array for fixed-size records atop
the byte-accessible xfile.”h]”(hX&  Although this requirement can be satisfied by calling the read and write
methods of the xfile directly, it is simpler for callers for there to be a
higher level abstraction to take care of computing array offsets, to provide
iterator functions, and to deal with sparse records and sorting.
The ”…””}”(hjE  hžhhŸNh Nubj÷  )”}”(hŒ``xfarray``”h]”hŒxfarray”…””}”(hj—E  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1jö  hjE  ubhŒ[ abstraction presents a linear array for fixed-size records atop
the byte-accessible xfile.”…””}”(hjE  hžhhŸNh Nubeh}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h MÆhjpE  hžhubhµ)”}”(hŒ.. _xfarray_access_patterns:”h]”h}”(h]”h ]”h"]”h$]”h&]”j  Œxfarray-access-patterns”uh1h´h MÍhjpE  hžhhŸh³ubhÑ)”}”(hhh]”(hÖ)”}”(hŒArray Access Patterns”h]”hŒArray Access Patterns”…””}”(hj½E  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”j  j“  uh1hÕhjºE  hžhhŸh³h MÐubhæ)”}”(hŒ¯Array access patterns in online fsck tend to fall into three categories.
Iteration of records is assumed to be necessary for all cases and will be
covered in the next section.”h]”hŒ¯Array access patterns in online fsck tend to fall into three categories.
Iteration of records is assumed to be necessary for all cases and will be
covered in the next section.”…””}”(hjËE  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h MÒhjºE  hžhubhæ)”}”(hXP  The first type of caller handles records that are indexed by position.
Gaps may exist between records, and a record may be updated multiple times
during the collection step.
In other words, these callers want a sparse linearly addressed table file.
The typical use case are quota records or file link count records.
Access to array elements is performed programmatically via ``xfarray_load`` and
``xfarray_store`` functions, which wrap the similarly-named xfile functions to
provide loading and storing of array elements at arbitrary array indices.
Gaps are defined to be null records, and null records are defined to be a
sequence of all zero bytes.
Null records are detected by calling ``xfarray_element_is_null``.
They are created either by calling ``xfarray_unset`` to null out an existing
record or by never storing anything to an array index.”h]”(hXw  The first type of caller handles records that are indexed by position.
Gaps may exist between records, and a record may be updated multiple times
during the collection step.
In other words, these callers want a sparse linearly addressed table file.
The typical use case are quota records or file link count records.
Access to array elements is performed programmatically via ”…””}”(hjÙE  hžhhŸNh Nubj÷  )”}”(hŒ``xfarray_load``”h]”hŒxfarray_load”…””}”(hjáE  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1jö  hjÙE  ubhŒ and
”…””}”(hjÙE  hžhhŸNh Nubj÷  )”}”(hŒ``xfarray_store``”h]”hŒxfarray_store”…””}”(hjóE  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1jö  hjÙE  ubhX   functions, which wrap the similarly-named xfile functions to
provide loading and storing of array elements at arbitrary array indices.
Gaps are defined to be null records, and null records are defined to be a
sequence of all zero bytes.
Null records are detected by calling ”…””}”(hjÙE  hžhhŸNh Nubj÷  )”}”(hŒ``xfarray_element_is_null``”h]”hŒxfarray_element_is_null”…””}”(hjF  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1jö  hjÙE  ubhŒ%.
They are created either by calling ”…””}”(hjÙE  hžhhŸNh Nubj÷  )”}”(hŒ``xfarray_unset``”h]”hŒxfarray_unset”…””}”(hjF  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1jö  hjÙE  ubhŒO to null out an existing
record or by never storing anything to an array index.”…””}”(hjÙE  hžhhŸNh Nubeh}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h MÖhjºE  hžhubhæ)”}”(hX0  The second type of caller handles records that are not indexed by position
and do not require multiple updates to a record.
The typical use case here is rebuilding space btrees and key/value btrees.
These callers can add records to the array without caring about array indices
via the ``xfarray_append`` function, which stores a record at the end of the
array.
For callers that require records to be presentable in a specific order (e.g.
rebuilding btree data), the ``xfarray_sort`` function can arrange the sorted
records; this function will be covered later.”h]”(hX  The second type of caller handles records that are not indexed by position
and do not require multiple updates to a record.
The typical use case here is rebuilding space btrees and key/value btrees.
These callers can add records to the array without caring about array indices
via the ”…””}”(hj/F  hžhhŸNh Nubj÷  )”}”(hŒ``xfarray_append``”h]”hŒxfarray_append”…””}”(hj7F  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1jö  hj/F  ubhŒ£ function, which stores a record at the end of the
array.
For callers that require records to be presentable in a specific order (e.g.
rebuilding btree data), the ”…””}”(hj/F  hžhhŸNh Nubj÷  )”}”(hŒ``xfarray_sort``”h]”hŒxfarray_sort”…””}”(hjIF  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1jö  hj/F  ubhŒN function can arrange the sorted
records; this function will be covered later.”…””}”(hj/F  hžhhŸNh Nubeh}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h MähjºE  hžhubhæ)”}”(hXã  The third type of caller is a bag, which is useful for counting records.
The typical use case here is constructing space extent reference counts from
reverse mapping information.
Records can be put in the bag in any order, they can be removed from the bag
at any time, and uniqueness of records is left to callers.
The ``xfarray_store_anywhere`` function is used to insert a record in any
null record slot in the bag; and the ``xfarray_unset`` function removes a
record from the bag.”h]”(hX?  The third type of caller is a bag, which is useful for counting records.
The typical use case here is constructing space extent reference counts from
reverse mapping information.
Records can be put in the bag in any order, they can be removed from the bag
at any time, and uniqueness of records is left to callers.
The ”…””}”(hjaF  hžhhŸNh Nubj÷  )”}”(hŒ``xfarray_store_anywhere``”h]”hŒxfarray_store_anywhere”…””}”(hjiF  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1jö  hjaF  ubhŒQ function is used to insert a record in any
null record slot in the bag; and the ”…””}”(hjaF  hžhhŸNh Nubj÷  )”}”(hŒ``xfarray_unset``”h]”hŒxfarray_unset”…””}”(hj{F  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1jö  hjaF  ubhŒ( function removes a
record from the bag.”…””}”(hjaF  hžhhŸNh Nubeh}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h MîhjºE  hžhubhæ)”}”(hŒŒThe proposed patchset is the
`big in-memory array
<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=big-array>`_.”h]”(hŒThe proposed patchset is the
”…””}”(hj“F  hžhhŸNh Nubj”  )”}”(hŒn`big in-memory array
<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=big-array>`_”h]”hŒbig in-memory array”…””}”(hj›F  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”Œname”Œbig in-memory array”jj  ŒUhttps://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=big-array”uh1j“  hj“F  ubhµ)”}”(hŒX
<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=big-array>”h]”h}”(h]”Œbig-in-memory-array”ah ]”h"]”Œbig in-memory array”ah$]”h&]”Œrefuri”j«F  uh1h´jy  Khj“F  ubhŒ.”…””}”(hj“F  hžhhŸNh Nubeh}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h M÷hjºE  hžhubeh}”(h]”(j™  j¹E  eh ]”h"]”(Œarray access patterns”Œxfarray_access_patterns”eh$]”h&]”uh1hÐhjpE  hžhhŸh³h MÐjË  }”jÈF  j¯E  sjÍ  }”j¹E  j¯E  subhÑ)”}”(hhh]”(hÖ)”}”(hŒIterating Array Elements”h]”hŒIterating Array Elements”…””}”(hjÐF  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”j  jµ  uh1hÕhjÍF  hžhhŸh³h Müubhæ)”}”(hŒšMost users of the xfarray require the ability to iterate the records stored in
the array.
Callers can probe every possible array index with the following:”h]”hŒšMost users of the xfarray require the ability to iterate the records stored in
the array.
Callers can probe every possible array index with the following:”…””}”(hjÞF  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h MþhjÍF  hžhubj•+  )”}”(hŒuxfarray_idx_t i;
foreach_xfarray_idx(array, i) {
    xfarray_load(array, i, &rec);

    /* do something with rec */
}”h]”hŒuxfarray_idx_t i;
foreach_xfarray_idx(array, i) {
    xfarray_load(array, i, &rec);

    /* do something with rec */
}”…””}”hjìF  sbah}”(h]”h ]”h"]”h$]”h&]”h±h²j¤+  ‰j¥+  j¦+  j§+  }”uh1j”+  hŸh³h MhjÍF  hžhubhæ)”}”(hŒkAll users of this idiom must be prepared to handle null records or must already
know that there aren't any.”h]”hŒmAll users of this idiom must be prepared to handle null records or must already
know that there arenâ€™t any.”…””}”(hjûF  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h MhjÍF  hžhubhæ)”}”(hX^  For xfarray users that want to iterate a sparse array, the ``xfarray_iter``
function ignores indices in the xfarray that have never been written to by
calling ``xfile_seek_data`` (which internally uses ``SEEK_DATA``) to skip areas
of the array that are not populated with memory pages.
Once it finds a page, it will skip the zeroed areas of the page.”h]”(hŒ;For xfarray users that want to iterate a sparse array, the ”…””}”(hj	G  hžhhŸNh Nubj÷  )”}”(hŒ``xfarray_iter``”h]”hŒxfarray_iter”…””}”(hjG  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1jö  hj	G  ubhŒT
function ignores indices in the xfarray that have never been written to by
calling ”…””}”(hj	G  hžhhŸNh Nubj÷  )”}”(hŒ``xfile_seek_data``”h]”hŒxfile_seek_data”…””}”(hj#G  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1jö  hj	G  ubhŒ (which internally uses ”…””}”(hj	G  hžhhŸNh Nubj÷  )”}”(hŒ``SEEK_DATA``”h]”hŒ	SEEK_DATA”…””}”(hj5G  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1jö  hj	G  ubhŒ‡) to skip areas
of the array that are not populated with memory pages.
Once it finds a page, it will skip the zeroed areas of the page.”…””}”(hj	G  hžhhŸNh Nubeh}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h MhjÍF  hžhubj•+  )”}”(hŒ}xfarray_idx_t i = XFARRAY_CURSOR_INIT;
while ((ret = xfarray_iter(array, &i, &rec)) == 1) {
    /* do something with rec */
}”h]”hŒ}xfarray_idx_t i = XFARRAY_CURSOR_INIT;
while ((ret = xfarray_iter(array, &i, &rec)) == 1) {
    /* do something with rec */
}”…””}”hjMG  sbah}”(h]”h ]”h"]”h$]”h&]”h±h²j¤+  ‰j¥+  j¦+  j§+  }”uh1j”+  hŸh³h MhjÍF  hžhubhµ)”}”(hŒ.. _xfarray_sort:”h]”h}”(h]”h ]”h"]”h$]”h&]”j  Œxfarray-sort”uh1h´h MhjÍF  hžhhŸh³ubeh}”(h]”j»  ah ]”h"]”Œiterating array elements”ah$]”h&]”uh1hÐhjpE  hžhhŸh³h MüubhÑ)”}”(hhh]”(hÖ)”}”(hŒSorting Array Elements”h]”hŒSorting Array Elements”…””}”(hjqG  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”j  j×  uh1hÕhjnG  hžhhŸh³h Mubhæ)”}”(hX«  During the fourth demonstration of online repair, a community reviewer remarked
that for performance reasons, online repair ought to load batches of records
into btree record blocks instead of inserting records into a new btree one at a
time.
The btree insertion code in XFS is responsible for maintaining correct ordering
of the records, so naturally the xfarray must also support sorting the record
set prior to bulk loading.”h]”hX«  During the fourth demonstration of online repair, a community reviewer remarked
that for performance reasons, online repair ought to load batches of records
into btree record blocks instead of inserting records into a new btree one at a
time.
The btree insertion code in XFS is responsible for maintaining correct ordering
of the records, so naturally the xfarray must also support sorting the record
set prior to bulk loading.”…””}”(hjG  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h M hjnG  hžhubhÑ)”}”(hhh]”(hÖ)”}”(hŒCase Study: Sorting xfarrays”h]”hŒCase Study: Sorting xfarrays”…””}”(hjG  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”j  jö  uh1hÕhjG  hžhhŸh³h M)ubhæ)”}”(hX  The sorting algorithm used in the xfarray is actually a combination of adaptive
quicksort and a heapsort subalgorithm in the spirit of
`Sedgewick <https://algs4.cs.princeton.edu/23quicksort/>`_ and
`pdqsort <https://github.com/orlp/pdqsort>`_, with customizations for the Linux
kernel.
To sort records in a reasonably short amount of time, ``xfarray`` takes
advantage of the binary subpartitioning offered by quicksort, but it also uses
heapsort to hedge against performance collapse if the chosen quicksort pivots
are poor.
Both algorithms are (in general) O(n * lg(n)), but there is a wide performance
gulf between the two implementations.”h]”(hŒ‡The sorting algorithm used in the xfarray is actually a combination of adaptive
quicksort and a heapsort subalgorithm in the spirit of
”…””}”(hjžG  hžhhŸNh Nubj”  )”}”(hŒ:`Sedgewick <https://algs4.cs.princeton.edu/23quicksort/>`_”h]”hŒ	Sedgewick”…””}”(hj¦G  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”Œname”Œ	Sedgewick”jj  Œ+https://algs4.cs.princeton.edu/23quicksort/”uh1j“  hjžG  ubhµ)”}”(hŒ. <https://algs4.cs.princeton.edu/23quicksort/>”h]”h}”(h]”Œ	sedgewick”ah ]”h"]”Œ	sedgewick”ah$]”h&]”Œrefuri”j¶G  uh1h´jy  KhjžG  ubhŒ and
”…””}”(hjžG  hžhhŸNh Nubj”  )”}”(hŒ,`pdqsort <https://github.com/orlp/pdqsort>`_”h]”hŒpdqsort”…””}”(hjÈG  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”Œname”Œpdqsort”jj  Œhttps://github.com/orlp/pdqsort”uh1j“  hjžG  ubhµ)”}”(hŒ" <https://github.com/orlp/pdqsort>”h]”h}”(h]”Œpdqsort”ah ]”h"]”Œpdqsort”ah$]”h&]”Œrefuri”jØG  uh1h´jy  KhjžG  ubhŒb, with customizations for the Linux
kernel.
To sort records in a reasonably short amount of time, ”…””}”(hjžG  hžhhŸNh Nubj÷  )”}”(hŒ``xfarray``”h]”hŒxfarray”…””}”(hjêG  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1jö  hjžG  ubhX"   takes
advantage of the binary subpartitioning offered by quicksort, but it also uses
heapsort to hedge against performance collapse if the chosen quicksort pivots
are poor.
Both algorithms are (in general) O(n * lg(n)), but there is a wide performance
gulf between the two implementations.”…””}”(hjžG  hžhhŸNh Nubeh}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h M+hjG  hžhubhæ)”}”(hŒÓThe Linux kernel already contains a reasonably fast implementation of heapsort.
It only operates on regular C arrays, which limits the scope of its usefulness.
There are two key places where the xfarray uses it:”h]”hŒÓThe Linux kernel already contains a reasonably fast implementation of heapsort.
It only operates on regular C arrays, which limits the scope of its usefulness.
There are two key places where the xfarray uses it:”…””}”(hjH  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h M7hjG  hžhubhö)”}”(hhh]”(hû)”}”(hŒ9Sorting any record subset backed by a single xfile page.
”h]”hæ)”}”(hŒ8Sorting any record subset backed by a single xfile page.”h]”hŒ8Sorting any record subset backed by a single xfile page.”…””}”(hjH  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h M;hjH  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhjH  hžhhŸh³h Nubhû)”}”(hŒˆLoading a small number of xfarray records from potentially disparate parts
of the xfarray into a memory buffer, and sorting the buffer.
”h]”hæ)”}”(hŒ‡Loading a small number of xfarray records from potentially disparate parts
of the xfarray into a memory buffer, and sorting the buffer.”h]”hŒ‡Loading a small number of xfarray records from potentially disparate parts
of the xfarray into a memory buffer, and sorting the buffer.”…””}”(hj/H  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h M=hj+H  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhjH  hžhhŸh³h Nubeh}”(h]”h ]”h"]”h$]”h&]”jJ  jó   uh1hõhŸh³h M;hjG  hžhubhæ)”}”(hŒIn other words, ``xfarray`` uses heapsort to constrain the nested recursion of
quicksort, thereby mitigating quicksort's worst runtime behavior.”h]”(hŒIn other words, ”…””}”(hjIH  hžhhŸNh Nubj÷  )”}”(hŒ``xfarray``”h]”hŒxfarray”…””}”(hjQH  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1jö  hjIH  ubhŒw uses heapsort to constrain the nested recursion of
quicksort, thereby mitigating quicksortâ€™s worst runtime behavior.”…””}”(hjIH  hžhhŸNh Nubeh}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h M@hjG  hžhubhæ)”}”(hX³  Choosing a quicksort pivot is a tricky business.
A good pivot splits the set to sort in half, leading to the divide and conquer
behavior that is crucial to  O(n * lg(n)) performance.
A poor pivot barely splits the subset at all, leading to O(n\ :sup:`2`)
runtime.
The xfarray sort routine tries to avoid picking a bad pivot by sampling nine
records into a memory buffer and using the kernel heapsort to identify the
median of the nine.”h]”(hŒõChoosing a quicksort pivot is a tricky business.
A good pivot splits the set to sort in half, leading to the divide and conquer
behavior that is crucial to  O(n * lg(n)) performance.
A poor pivot barely splits the subset at all, leading to O(n  ”…””}”(hjiH  hžhhŸNh NubhŒsuperscript”“”)”}”(hŒ:sup:`2`”h]”hŒ2”…””}”(hjsH  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1jqH  hjiH  ubhŒ¶)
runtime.
The xfarray sort routine tries to avoid picking a bad pivot by sampling nine
records into a memory buffer and using the kernel heapsort to identify the
median of the nine.”…””}”(hjiH  hžhhŸNh Nubeh}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h MChjG  hžhubhæ)”}”(hX  Most modern quicksort implementations employ Tukey's "ninther" to select a
pivot from a classic C array.
Typical ninther implementations pick three unique triads of records, sort each
of the triads, and then sort the middle value of each triad to determine the
ninther value.
As stated previously, however, xfile accesses are not entirely cheap.
It turned out to be much more performant to read the nine elements into a
memory buffer, run the kernel's in-memory heapsort on the buffer, and choose
the 4th element of that buffer as the pivot.
Tukey's ninthers are described in J. W. Tukey, `The ninther, a technique for
low-effort robust (resistant) location in large samples`, in *Contributions to
Survey Sampling and Applied Statistics*, edited by H. David, (Academic Press,
1978), pp. 251â€“257.”h]”(hXW  Most modern quicksort implementations employ Tukeyâ€™s â€œnintherâ€ to select a
pivot from a classic C array.
Typical ninther implementations pick three unique triads of records, sort each
of the triads, and then sort the middle value of each triad to determine the
ninther value.
As stated previously, however, xfile accesses are not entirely cheap.
It turned out to be much more performant to read the nine elements into a
memory buffer, run the kernelâ€™s in-memory heapsort on the buffer, and choose
the 4th element of that buffer as the pivot.
Tukeyâ€™s ninthers are described in J. W. Tukey, ”…””}”(hj‹H  hžhhŸNh NubhŒtitle_reference”“”)”}”(hŒV`The ninther, a technique for
low-effort robust (resistant) location in large samples`”h]”hŒTThe ninther, a technique for
low-effort robust (resistant) location in large samples”…””}”(hj•H  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1j“H  hj‹H  ubhŒ, in ”…””}”(hj‹H  hžhhŸNh Nubj7  )”}”(hŒ9*Contributions to
Survey Sampling and Applied Statistics*”h]”hŒ7Contributions to
Survey Sampling and Applied Statistics”…””}”(hj§H  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1j6  hj‹H  ubhŒ<, edited by H. David, (Academic Press,
1978), pp. 251â€“257.”…””}”(hj‹H  hžhhŸNh Nubeh}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h MLhjG  hžhubhæ)”}”(hX  The partitioning of quicksort is fairly textbook -- rearrange the record
subset around the pivot, then set up the current and next stack frames to
sort with the larger and the smaller halves of the pivot, respectively.
This keeps the stack space requirements to log2(record count).”h]”hX  The partitioning of quicksort is fairly textbook -- rearrange the record
subset around the pivot, then set up the current and next stack frames to
sort with the larger and the smaller halves of the pivot, respectively.
This keeps the stack space requirements to log2(record count).”…””}”(hj¿H  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h MZhjG  hžhubhæ)”}”(hXC  As a final performance optimization, the hi and lo scanning phase of quicksort
keeps examined xfile pages mapped in the kernel for as long as possible to
reduce map/unmap cycles.
Surprisingly, this reduces overall sort runtime by nearly half again after
accounting for the application of heapsort directly onto xfile pages.”h]”hXC  As a final performance optimization, the hi and lo scanning phase of quicksort
keeps examined xfile pages mapped in the kernel for as long as possible to
reduce map/unmap cycles.
Surprisingly, this reduces overall sort runtime by nearly half again after
accounting for the application of heapsort directly onto xfile pages.”…””}”(hjÍH  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h M_hjG  hžhubhµ)”}”(hŒ.. _xfblob:”h]”h}”(h]”h ]”h"]”h$]”h&]”j  Œxfblob”uh1h´h MehjG  hžhhŸh³ubeh}”(h]”jü  ah ]”h"]”Œcase study: sorting xfarrays”ah$]”h&]”uh1hÐhjnG  hžhhŸh³h M)ubeh}”(h]”(jÝ  jfG  eh ]”h"]”(Œsorting array elements”Œxfarray_sort”eh$]”h&]”uh1hÐhjpE  hžhhŸh³h MjË  }”jòH  j\G  sjÍ  }”jfG  j\G  subeh}”(h]”(jz  jhE  eh ]”h"]”(Œarrays of fixed-sized records”Œxfarray”eh$]”h&]”uh1hÐhjÐA  hžhhŸh³h MºjË  }”jüH  j^E  sjÍ  }”jhE  j^E  subhÑ)”}”(hhh]”(hÖ)”}”(hŒBlob Storage”h]”hŒBlob Storage”…””}”(hjI  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”j  j0  uh1hÕhjI  hžhhŸh³h Mhubhæ)”}”(hX§  Extended attributes and directories add an additional requirement for staging
records: arbitrary byte sequences of finite length.
Each directory entry record needs to store entry name,
and each extended attribute needs to store both the attribute name and value.
The names, keys, and values can consume a large amount of memory, so the
``xfblob`` abstraction was created to simplify management of these blobs
atop an xfile.”h]”(hXP  Extended attributes and directories add an additional requirement for staging
records: arbitrary byte sequences of finite length.
Each directory entry record needs to store entry name,
and each extended attribute needs to store both the attribute name and value.
The names, keys, and values can consume a large amount of memory, so the
”…””}”(hjI  hžhhŸNh Nubj÷  )”}”(hŒ
``xfblob``”h]”hŒxfblob”…””}”(hjI  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1jö  hjI  ubhŒM abstraction was created to simplify management of these blobs
atop an xfile.”…””}”(hjI  hžhhŸNh Nubeh}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h MjhjI  hžhubhæ)”}”(hX‰  Blob arrays provide ``xfblob_load`` and ``xfblob_store`` functions to retrieve
and persist objects.
The store function returns a magic cookie for every object that it persists.
Later, callers provide this cookie to the ``xblob_load`` to recall the object.
The ``xfblob_free`` function frees a specific blob, and the ``xfblob_truncate``
function frees them all because compaction is not needed.”h]”(hŒBlob arrays provide ”…””}”(hj2I  hžhhŸNh Nubj÷  )”}”(hŒ``xfblob_load``”h]”hŒxfblob_load”…””}”(hj:I  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1jö  hj2I  ubhŒ and ”…””}”(hj2I  hžhhŸNh Nubj÷  )”}”(hŒ``xfblob_store``”h]”hŒxfblob_store”…””}”(hjLI  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1jö  hj2I  ubhŒ£ functions to retrieve
and persist objects.
The store function returns a magic cookie for every object that it persists.
Later, callers provide this cookie to the ”…””}”(hj2I  hžhhŸNh Nubj÷  )”}”(hŒ``xblob_load``”h]”hŒ
xblob_load”…””}”(hj^I  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1jö  hj2I  ubhŒ to recall the object.
The ”…””}”(hj2I  hžhhŸNh Nubj÷  )”}”(hŒ``xfblob_free``”h]”hŒxfblob_free”…””}”(hjpI  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1jö  hj2I  ubhŒ) function frees a specific blob, and the ”…””}”(hj2I  hžhhŸNh Nubj÷  )”}”(hŒ``xfblob_truncate``”h]”hŒxfblob_truncate”…””}”(hj‚I  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1jö  hj2I  ubhŒ:
function frees them all because compaction is not needed.”…””}”(hj2I  hžhhŸNh Nubeh}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h MrhjI  hžhubhæ)”}”(hXT  The details of repairing directories and extended attributes will be discussed
in a subsequent section about atomic file content exchanges.
However, it should be noted that these repair functions only use blob storage
to cache a small number of entries before adding them to a temporary ondisk
file, which is why compaction is not required.”h]”hXT  The details of repairing directories and extended attributes will be discussed
in a subsequent section about atomic file content exchanges.
However, it should be noted that these repair functions only use blob storage
to cache a small number of entries before adding them to a temporary ondisk
file, which is why compaction is not required.”…””}”(hjšI  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h MyhjI  hžhubhæ)”}”(hŒ­The proposed patchset is at the start of the
`extended attribute repair
<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-xattrs>`_ series.”h]”(hŒ-The proposed patchset is at the start of the
”…””}”(hj¨I  hžhhŸNh Nubj”  )”}”(hŒx`extended attribute repair
<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-xattrs>`_”h]”hŒextended attribute repair”…””}”(hj°I  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”Œname”Œextended attribute repair”jj  ŒYhttps://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-xattrs”uh1j“  hj¨I  ubhµ)”}”(hŒ\
<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-xattrs>”h]”h}”(h]”Œextended-attribute-repair”ah ]”h"]”Œextended attribute repair”ah$]”h&]”Œrefuri”jÀI  uh1h´jy  Khj¨I  ubhŒ series.”…””}”(hj¨I  hžhhŸNh Nubeh}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h MhjI  hžhubhµ)”}”(hŒ.. _xfbtree:”h]”h}”(h]”h ]”h"]”h$]”h&]”j  Œxfbtree”uh1h´h MƒhjI  hžhhŸh³ubeh}”(h]”(j6  jåH  eh ]”h"]”(Œblob storage”Œxfblob”eh$]”h&]”uh1hÐhjÐA  hžhhŸh³h MhjË  }”jèI  jÛH  sjÍ  }”jåH  jÛH  subhÑ)”}”(hhh]”(hÖ)”}”(hŒIn-Memory B+Trees”h]”hŒIn-Memory B+Trees”…””}”(hjðI  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”j  jR  uh1hÕhjíI  hžhhŸh³h M†ubhæ)”}”(hX’  The chapter about :ref:`secondary metadata<secondary_metadata>` mentioned that
checking and repairing of secondary metadata commonly requires coordination
between a live metadata scan of the filesystem and writer threads that are
updating that metadata.
Keeping the scan data up to date requires requires the ability to propagate
metadata updates from the filesystem into the data being collected by the scan.
This *can* be done by appending concurrent updates into a separate log file and
applying them before writing the new metadata to disk, but this leads to
unbounded memory consumption if the rest of the system is very busy.
Another option is to skip the side-log and commit live updates from the
filesystem directly into the scan data, which trades more overhead for a lower
maximum memory requirement.
In both cases, the data structure holding the scan results must support indexed
access to perform well.”h]”(hŒThe chapter about ”…””}”(hjþI  hžhhŸNh Nubh)”}”(hŒ-:ref:`secondary metadata<secondary_metadata>`”h]”j™  )”}”(hjJ  h]”hŒsecondary metadata”…””}”(hj
J  hžhhŸNh Nubah}”(h]”h ]”(j¤  Œstd”Œstd-ref”eh"]”h$]”h&]”uh1j˜  hjJ  ubah}”(h]”h ]”h"]”h$]”h&]”Œrefdoc”j±  Œ	refdomain”jJ  Œreftype”Œref”Œrefexplicit”ˆŒrefwarn”ˆj·  Œsecondary_metadata”uh1hhŸh³h MˆhjþI  ubhX`   mentioned that
checking and repairing of secondary metadata commonly requires coordination
between a live metadata scan of the filesystem and writer threads that are
updating that metadata.
Keeping the scan data up to date requires requires the ability to propagate
metadata updates from the filesystem into the data being collected by the scan.
This ”…””}”(hjþI  hžhhŸNh Nubj7  )”}”(hŒ*can*”h]”hŒcan”…””}”(hj*J  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1j6  hjþI  ubhXî   be done by appending concurrent updates into a separate log file and
applying them before writing the new metadata to disk, but this leads to
unbounded memory consumption if the rest of the system is very busy.
Another option is to skip the side-log and commit live updates from the
filesystem directly into the scan data, which trades more overhead for a lower
maximum memory requirement.
In both cases, the data structure holding the scan results must support indexed
access to perform well.”…””}”(hjþI  hžhhŸNh Nubeh}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h MˆhjíI  hžhubhæ)”}”(hXÀ  Given that indexed lookups of scan data is required for both strategies, online
fsck employs the second strategy of committing live updates directly into
scan data.
Because xfarrays are not indexed and do not enforce record ordering, they
are not suitable for this task.
Conveniently, however, XFS has a library to create and maintain ordered reverse
mapping records: the existing rmap btree code!
If only there was a means to create one in memory.”h]”hXÀ  Given that indexed lookups of scan data is required for both strategies, online
fsck employs the second strategy of committing live updates directly into
scan data.
Because xfarrays are not indexed and do not enforce record ordering, they
are not suitable for this task.
Conveniently, however, XFS has a library to create and maintain ordered reverse
mapping records: the existing rmap btree code!
If only there was a means to create one in memory.”…””}”(hjBJ  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h M—hjíI  hžhubhæ)”}”(hX   Recall that the :ref:`xfile <xfile>` abstraction represents memory pages as a
regular file, which means that the kernel can create byte or block addressable
virtual address spaces at will.
The XFS buffer cache specializes in abstracting IO to block-oriented  address
spaces, which means that adaptation of the buffer cache to interface with
xfiles enables reuse of the entire btree library.
Btrees built atop an xfile are collectively known as ``xfbtrees``.
The next few sections describe how they actually work.”h]”(hŒRecall that the ”…””}”(hjPJ  hžhhŸNh Nubh)”}”(hŒ:ref:`xfile <xfile>`”h]”j™  )”}”(hjZJ  h]”hŒxfile”…””}”(hj\J  hžhhŸNh Nubah}”(h]”h ]”(j¤  Œstd”Œstd-ref”eh"]”h$]”h&]”uh1j˜  hjXJ  ubah}”(h]”h ]”h"]”h$]”h&]”Œrefdoc”j±  Œ	refdomain”jfJ  Œreftype”Œref”Œrefexplicit”ˆŒrefwarn”ˆj·  Œxfile”uh1hhŸh³h M hjPJ  ubhX˜   abstraction represents memory pages as a
regular file, which means that the kernel can create byte or block addressable
virtual address spaces at will.
The XFS buffer cache specializes in abstracting IO to block-oriented  address
spaces, which means that adaptation of the buffer cache to interface with
xfiles enables reuse of the entire btree library.
Btrees built atop an xfile are collectively known as ”…””}”(hjPJ  hžhhŸNh Nubj÷  )”}”(hŒ``xfbtrees``”h]”hŒxfbtrees”…””}”(hj|J  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1jö  hjPJ  ubhŒ8.
The next few sections describe how they actually work.”…””}”(hjPJ  hžhhŸNh Nubeh}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h M hjíI  hžhubhæ)”}”(hŒ–The proposed patchset is the
`in-memory btree
<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=in-memory-btrees>`_
series.”h]”(hŒThe proposed patchset is the
”…””}”(hj”J  hžhhŸNh Nubj”  )”}”(hŒq`in-memory btree
<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=in-memory-btrees>`_”h]”hŒin-memory btree”…””}”(hjœJ  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”Œname”Œin-memory btree”jj  Œ\https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=in-memory-btrees”uh1j“  hj”J  ubhµ)”}”(hŒ_
<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=in-memory-btrees>”h]”h}”(h]”Œin-memory-btree”ah ]”h"]”Œin-memory btree”ah$]”h&]”Œrefuri”j¬J  uh1h´jy  Khj”J  ubhŒ
series.”…””}”(hj”J  hžhhŸNh Nubeh}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h M©hjíI  hžhubhÑ)”}”(hhh]”(hÖ)”}”(hŒ%Using xfiles as a Buffer Cache Target”h]”hŒ%Using xfiles as a Buffer Cache Target”…””}”(hjÇJ  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”j  jq  uh1hÕhjÄJ  hžhhŸh³h M¯ubhæ)”}”(hX’  Two modifications are necessary to support xfiles as a buffer cache target.
The first is to make it possible for the ``struct xfs_buftarg`` structure to
host the ``struct xfs_buf`` rhashtable, because normally those are held by a
per-AG structure.
The second change is to modify the buffer ``ioapply`` function to "read" cached
pages from the xfile and "write" cached pages back to the xfile.
Multiple access to individual buffers is controlled by the ``xfs_buf`` lock,
since the xfile does not provide any locking on its own.
With this adaptation in place, users of the xfile-backed buffer cache use
exactly the same APIs as users of the disk-backed buffer cache.
The separation between xfile and buffer cache implies higher memory usage since
they do not share pages, but this property could some day enable transactional
updates to an in-memory btree.
Today, however, it simply eliminates the need for new code.”h]”(hŒuTwo modifications are necessary to support xfiles as a buffer cache target.
The first is to make it possible for the ”…””}”(hjÕJ  hžhhŸNh Nubj÷  )”}”(hŒ``struct xfs_buftarg``”h]”hŒstruct xfs_buftarg”…””}”(hjÝJ  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1jö  hjÕJ  ubhŒ structure to
host the ”…””}”(hjÕJ  hžhhŸNh Nubj÷  )”}”(hŒ``struct xfs_buf``”h]”hŒstruct xfs_buf”…””}”(hjïJ  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1jö  hjÕJ  ubhŒn rhashtable, because normally those are held by a
per-AG structure.
The second change is to modify the buffer ”…””}”(hjÕJ  hžhhŸNh Nubj÷  )”}”(hŒ``ioapply``”h]”hŒioapply”…””}”(hjK  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1jö  hjÕJ  ubhŒŸ function to â€œreadâ€ cached
pages from the xfile and â€œwriteâ€ cached pages back to the xfile.
Multiple access to individual buffers is controlled by the ”…””}”(hjÕJ  hžhhŸNh Nubj÷  )”}”(hŒ``xfs_buf``”h]”hŒxfs_buf”…””}”(hjK  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1jö  hjÕJ  ubhXÃ   lock,
since the xfile does not provide any locking on its own.
With this adaptation in place, users of the xfile-backed buffer cache use
exactly the same APIs as users of the disk-backed buffer cache.
The separation between xfile and buffer cache implies higher memory usage since
they do not share pages, but this property could some day enable transactional
updates to an in-memory btree.
Today, however, it simply eliminates the need for new code.”…””}”(hjÕJ  hžhhŸNh Nubeh}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h M±hjÄJ  hžhubeh}”(h]”jw  ah ]”h"]”Œ%using xfiles as a buffer cache target”ah$]”h&]”uh1hÐhjíI  hžhhŸh³h M¯ubhÑ)”}”(hhh]”(hÖ)”}”(hŒ Space Management with an xfbtree”h]”hŒ Space Management with an xfbtree”…””}”(hj5K  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”j  j“  uh1hÕhj2K  hžhhŸh³h MÁubhæ)”}”(hXi  Space management for an xfile is very simple -- each btree block is one memory
page in size.
These blocks use the same header format as an on-disk btree, but the in-memory
block verifiers ignore the checksums, assuming that xfile memory is no more
corruption-prone than regular DRAM.
Reusing existing code here is more important than absolute memory efficiency.”h]”hXi  Space management for an xfile is very simple -- each btree block is one memory
page in size.
These blocks use the same header format as an on-disk btree, but the in-memory
block verifiers ignore the checksums, assuming that xfile memory is no more
corruption-prone than regular DRAM.
Reusing existing code here is more important than absolute memory efficiency.”…””}”(hjCK  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h MÃhj2K  hžhubhæ)”}”(hŒ¤The very first block of an xfile backing an xfbtree contains a header block.
The header describes the owner, height, and the block number of the root
xfbtree block.”h]”hŒ¤The very first block of an xfile backing an xfbtree contains a header block.
The header describes the owner, height, and the block number of the root
xfbtree block.”…””}”(hjQK  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h MÊhj2K  hžhubhæ)”}”(hXt  To allocate a btree block, use ``xfile_seek_data`` to find a gap in the file.
If there are no gaps, create one by extending the length of the xfile.
Preallocate space for the block with ``xfile_prealloc``, and hand back the
location.
To free an xfbtree block, use ``xfile_discard`` (which internally uses
``FALLOC_FL_PUNCH_HOLE``) to remove the memory page from the xfile.”h]”(hŒTo allocate a btree block, use ”…””}”(hj_K  hžhhŸNh Nubj÷  )”}”(hŒ``xfile_seek_data``”h]”hŒxfile_seek_data”…””}”(hjgK  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1jö  hj_K  ubhŒˆ to find a gap in the file.
If there are no gaps, create one by extending the length of the xfile.
Preallocate space for the block with ”…””}”(hj_K  hžhhŸNh Nubj÷  )”}”(hŒ``xfile_prealloc``”h]”hŒxfile_prealloc”…””}”(hjyK  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1jö  hj_K  ubhŒ<, and hand back the
location.
To free an xfbtree block, use ”…””}”(hj_K  hžhhŸNh Nubj÷  )”}”(hŒ``xfile_discard``”h]”hŒxfile_discard”…””}”(hj‹K  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1jö  hj_K  ubhŒ (which internally uses
”…””}”(hj_K  hžhhŸNh Nubj÷  )”}”(hŒ``FALLOC_FL_PUNCH_HOLE``”h]”hŒFALLOC_FL_PUNCH_HOLE”…””}”(hjK  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1jö  hj_K  ubhŒ+) to remove the memory page from the xfile.”…””}”(hj_K  hžhhŸNh Nubeh}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h MÎhj2K  hžhubeh}”(h]”j™  ah ]”h"]”Œ space management with an xfbtree”ah$]”h&]”uh1hÐhjíI  hžhhŸh³h MÁubhÑ)”}”(hhh]”(hÖ)”}”(hŒPopulating an xfbtree”h]”hŒPopulating an xfbtree”…””}”(hj¿K  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”j  jµ  uh1hÕhj¼K  hžhhŸh³h MÖubhæ)”}”(hŒRAn online fsck function that wants to create an xfbtree should proceed as
follows:”h]”hŒRAn online fsck function that wants to create an xfbtree should proceed as
follows:”…””}”(hjÍK  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h MØhj¼K  hžhubji  )”}”(hhh]”(hû)”}”(hŒ*Call ``xfile_create`` to create an xfile.
”h]”hæ)”}”(hŒ)Call ``xfile_create`` to create an xfile.”h]”(hŒCall ”…””}”(hjâK  hžhhŸNh Nubj÷  )”}”(hŒ``xfile_create``”h]”hŒxfile_create”…””}”(hjêK  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1jö  hjâK  ubhŒ to create an xfile.”…””}”(hjâK  hžhhŸNh Nubeh}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h MÛhjÞK  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhjÛK  hžhhŸh³h Nubhû)”}”(hŒcCall ``xfs_alloc_memory_buftarg`` to create a buffer cache target structure
pointing to the xfile.
”h]”hæ)”}”(hŒbCall ``xfs_alloc_memory_buftarg`` to create a buffer cache target structure
pointing to the xfile.”h]”(hŒCall ”…””}”(hjL  hžhhŸNh Nubj÷  )”}”(hŒ``xfs_alloc_memory_buftarg``”h]”hŒxfs_alloc_memory_buftarg”…””}”(hjL  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1jö  hjL  ubhŒA to create a buffer cache target structure
pointing to the xfile.”…””}”(hjL  hžhhŸNh Nubeh}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h MÝhjL  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhjÛK  hžhhŸh³h Nubhû)”}”(hX„  Pass the buffer cache target, buffer ops, and other information to
``xfbtree_init`` to initialize the passed in ``struct xfbtree`` and write an
initial root block to the xfile.
Each btree type should define a wrapper that passes necessary arguments to
the creation function.
For example, rmap btrees define ``xfs_rmapbt_mem_create`` to take care of
all the necessary details for callers.
”h]”hæ)”}”(hXƒ  Pass the buffer cache target, buffer ops, and other information to
``xfbtree_init`` to initialize the passed in ``struct xfbtree`` and write an
initial root block to the xfile.
Each btree type should define a wrapper that passes necessary arguments to
the creation function.
For example, rmap btrees define ``xfs_rmapbt_mem_create`` to take care of
all the necessary details for callers.”h]”(hŒCPass the buffer cache target, buffer ops, and other information to
”…””}”(hj6L  hžhhŸNh Nubj÷  )”}”(hŒ``xfbtree_init``”h]”hŒxfbtree_init”…””}”(hj>L  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1jö  hj6L  ubhŒ to initialize the passed in ”…””}”(hj6L  hžhhŸNh Nubj÷  )”}”(hŒ``struct xfbtree``”h]”hŒstruct xfbtree”…””}”(hjPL  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1jö  hj6L  ubhŒ± and write an
initial root block to the xfile.
Each btree type should define a wrapper that passes necessary arguments to
the creation function.
For example, rmap btrees define ”…””}”(hj6L  hžhhŸNh Nubj÷  )”}”(hŒ``xfs_rmapbt_mem_create``”h]”hŒxfs_rmapbt_mem_create”…””}”(hjbL  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1jö  hj6L  ubhŒ7 to take care of
all the necessary details for callers.”…””}”(hj6L  hžhhŸNh Nubeh}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h Màhj2L  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhjÛK  hžhhŸh³h Nubhû)”}”(hŒ©Pass the xfbtree object to the btree cursor creation function for the
btree type.
Following the example above, ``xfs_rmapbt_mem_cursor`` takes care of this
for callers.
”h]”hæ)”}”(hŒ¨Pass the xfbtree object to the btree cursor creation function for the
btree type.
Following the example above, ``xfs_rmapbt_mem_cursor`` takes care of this
for callers.”h]”(hŒoPass the xfbtree object to the btree cursor creation function for the
btree type.
Following the example above, ”…””}”(hj„L  hžhhŸNh Nubj÷  )”}”(hŒ``xfs_rmapbt_mem_cursor``”h]”hŒxfs_rmapbt_mem_cursor”…””}”(hjŒL  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1jö  hj„L  ubhŒ  takes care of this
for callers.”…””}”(hj„L  hžhhŸNh Nubeh}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h Mèhj€L  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhjÛK  hžhhŸh³h Nubhû)”}”(hXn  Pass the btree cursor to the regular btree functions to make queries against
and to update the in-memory btree.
For example, a btree cursor for an rmap xfbtree can be passed to the
``xfs_rmap_*`` functions just like any other btree cursor.
See the :ref:`next section<xfbtree_commit>` for information on dealing with
xfbtree updates that are logged to a transaction.
”h]”hæ)”}”(hXm  Pass the btree cursor to the regular btree functions to make queries against
and to update the in-memory btree.
For example, a btree cursor for an rmap xfbtree can be passed to the
``xfs_rmap_*`` functions just like any other btree cursor.
See the :ref:`next section<xfbtree_commit>` for information on dealing with
xfbtree updates that are logged to a transaction.”h]”(hŒµPass the btree cursor to the regular btree functions to make queries against
and to update the in-memory btree.
For example, a btree cursor for an rmap xfbtree can be passed to the
”…””}”(hj®L  hžhhŸNh Nubj÷  )”}”(hŒ``xfs_rmap_*``”h]”hŒ
xfs_rmap_*”…””}”(hj¶L  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1jö  hj®L  ubhŒ5 functions just like any other btree cursor.
See the ”…””}”(hj®L  hžhhŸNh Nubh)”}”(hŒ#:ref:`next section<xfbtree_commit>`”h]”j™  )”}”(hjÊL  h]”hŒnext section”…””}”(hjÌL  hžhhŸNh Nubah}”(h]”h ]”(j¤  Œstd”Œstd-ref”eh"]”h$]”h&]”uh1j˜  hjÈL  ubah}”(h]”h ]”h"]”h$]”h&]”Œrefdoc”j±  Œ	refdomain”jÖL  Œreftype”Œref”Œrefexplicit”ˆŒrefwarn”ˆj·  Œxfbtree_commit”uh1hhŸh³h Míhj®L  ubhŒR for information on dealing with
xfbtree updates that are logged to a transaction.”…””}”(hj®L  hžhhŸNh Nubeh}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h MíhjªL  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhjÛK  hžhhŸh³h Nubhû)”}”(hŒWhen finished, delete the btree cursor, destroy the xfbtree object, free the
buffer target, and the destroy the xfile to release all resources.
”h]”hæ)”}”(hŒWhen finished, delete the btree cursor, destroy the xfbtree object, free the
buffer target, and the destroy the xfile to release all resources.”h]”hŒWhen finished, delete the btree cursor, destroy the xfbtree object, free the
buffer target, and the destroy the xfile to release all resources.”…””}”(hjüL  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h MôhjøL  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhjÛK  hžhhŸh³h Nubeh}”(h]”h ]”h"]”h$]”h&]”jg  jh  ji  hjj  jk  uh1jh  hj¼K  hžhhŸh³h MÛubhµ)”}”(hŒ.. _xfbtree_commit:”h]”h}”(h]”h ]”h"]”h$]”h&]”j  Œxfbtree-commit”uh1h´h M÷hj¼K  hžhhŸh³ubeh}”(h]”j»  ah ]”h"]”Œpopulating an xfbtree”ah$]”h&]”uh1hÐhjíI  hžhhŸh³h MÖubhÑ)”}”(hhh]”(hÖ)”}”(hŒ!Committing Logged xfbtree Buffers”h]”hŒ!Committing Logged xfbtree Buffers”…””}”(hj+M  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”j  j×  uh1hÕhj(M  hžhhŸh³h Múubhæ)”}”(hXò  Although it is a clever hack to reuse the rmap btree code to handle the staging
structure, the ephemeral nature of the in-memory btree block storage presents
some challenges of its own.
The XFS transaction manager must not commit buffer log items for buffers backed
by an xfile because the log format does not understand updates for devices
other than the data device.
An ephemeral xfbtree probably will not exist by the time the AIL checkpoints
log transactions back into the filesystem, and certainly won't exist during
log recovery.
For these reasons, any code updating an xfbtree in transaction context must
remove the buffer log items from the transaction and write the updates into the
backing xfile before committing or cancelling the transaction.”h]”hXô  Although it is a clever hack to reuse the rmap btree code to handle the staging
structure, the ephemeral nature of the in-memory btree block storage presents
some challenges of its own.
The XFS transaction manager must not commit buffer log items for buffers backed
by an xfile because the log format does not understand updates for devices
other than the data device.
An ephemeral xfbtree probably will not exist by the time the AIL checkpoints
log transactions back into the filesystem, and certainly wonâ€™t exist during
log recovery.
For these reasons, any code updating an xfbtree in transaction context must
remove the buffer log items from the transaction and write the updates into the
backing xfile before committing or cancelling the transaction.”…””}”(hj9M  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h Mühj(M  hžhubhæ)”}”(hŒlThe ``xfbtree_trans_commit`` and ``xfbtree_trans_cancel`` functions implement
this functionality as follows:”•       h]”(hŒThe ”…””}”(hjGM  hžhhŸNh Nubj÷  )”}”(hŒ``xfbtree_trans_commit``”h]”hŒxfbtree_trans_commit”…””}”(hjOM  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1jö  hjGM  ubhŒ and ”…””}”(hjGM  hžhhŸNh Nubj÷  )”}”(hŒ``xfbtree_trans_cancel``”h]”hŒxfbtree_trans_cancel”…””}”(hjaM  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1jö  hjGM  ubhŒ3 functions implement
this functionality as follows:”…””}”(hjGM  hžhhŸNh Nubeh}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h M		hj(M  hžhubji  )”}”(hhh]”(hû)”}”(hŒ:Find each buffer log item whose buffer targets the xfile.
”h]”hæ)”}”(hŒ9Find each buffer log item whose buffer targets the xfile.”h]”hŒ9Find each buffer log item whose buffer targets the xfile.”…””}”(hj€M  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h M	hj|M  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhjyM  hžhhŸh³h Nubhû)”}”(hŒ1Record the dirty/ordered status of the log item.
”h]”hæ)”}”(hŒ0Record the dirty/ordered status of the log item.”h]”hŒ0Record the dirty/ordered status of the log item.”…””}”(hj˜M  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h M	hj”M  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhjyM  hžhhŸh³h Nubhû)”}”(hŒ%Detach the log item from the buffer.
”h]”hæ)”}”(hŒ$Detach the log item from the buffer.”h]”hŒ$Detach the log item from the buffer.”…””}”(hj°M  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h M	hj¬M  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhjyM  hžhhŸh³h Nubhû)”}”(hŒ+Queue the buffer to a special delwri list.
”h]”hæ)”}”(hŒ*Queue the buffer to a special delwri list.”h]”hŒ*Queue the buffer to a special delwri list.”…””}”(hjÈM  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h M	hjÄM  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhjyM  hžhhŸh³h Nubhû)”}”(hŒiClear the transaction dirty flag if the only dirty log items were the ones
that were detached in step 3.
”h]”hæ)”}”(hŒhClear the transaction dirty flag if the only dirty log items were the ones
that were detached in step 3.”h]”hŒhClear the transaction dirty flag if the only dirty log items were the ones
that were detached in step 3.”…””}”(hjàM  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h M	hjÜM  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhjyM  hžhhŸh³h Nubhû)”}”(hŒ_Submit the delwri list to commit the changes to the xfile, if the updates
are being committed.
”h]”hæ)”}”(hŒ^Submit the delwri list to commit the changes to the xfile, if the updates
are being committed.”h]”hŒ^Submit the delwri list to commit the changes to the xfile, if the updates
are being committed.”…””}”(hjøM  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h M	hjôM  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhjyM  hžhhŸh³h Nubeh}”(h]”h ]”h"]”h$]”h&]”jg  jh  ji  hjj  jk  uh1jh  hj(M  hžhhŸh³h M	ubhæ)”}”(hŒwAfter removing xfile logged buffers from the transaction in this manner, the
transaction can be committed or cancelled.”h]”hŒwAfter removing xfile logged buffers from the transaction in this manner, the
transaction can be committed or cancelled.”…””}”(hjN  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h M	hj(M  hžhubeh}”(h]”(jÝ  j M  eh ]”h"]”(Œ!committing logged xfbtree buffers”Œxfbtree_commit”eh$]”h&]”uh1hÐhjíI  hžhhŸh³h MújË  }”j%N  jM  sjÍ  }”j M  jM  subeh}”(h]”(jX  jâI  eh ]”h"]”(Œin-memory b+trees”Œxfbtree”eh$]”h&]”uh1hÐhjÐA  hžhhŸh³h M†jË  }”j/N  jØI  sjÍ  }”jâI  jØI  subeh}”(h]”(j  j¾A  eh ]”h"]”(Œpageable kernel memory”Œxfile”eh$]”h&]”uh1hÐhjv*  hžhhŸh³h M4jË  }”j9N  j´A  sjÍ  }”j¾A  j´A  subhÑ)”}”(hhh]”(hÖ)”}”(hŒBulk Loading of Ondisk B+Trees”h]”hŒBulk Loading of Ondisk B+Trees”…””}”(hjAN  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”j  j	  uh1hÕhj>N  hžhhŸh³h M	ubhæ)”}”(hXØ  As mentioned previously, early iterations of online repair built new btree
structures by creating a new btree and adding observations individually.
Loading a btree one record at a time had a slight advantage of not requiring
the incore records to be sorted prior to commit, but was very slow and leaked
blocks if the system went down during a repair.
Loading records one at a time also meant that repair could not control the
loading factor of the blocks in the new btree.”h]”hXØ  As mentioned previously, early iterations of online repair built new btree
structures by creating a new btree and adding observations individually.
Loading a btree one record at a time had a slight advantage of not requiring
the incore records to be sorted prior to commit, but was very slow and leaked
blocks if the system went down during a repair.
Loading records one at a time also meant that repair could not control the
loading factor of the blocks in the new btree.”…””}”(hjON  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h M 	hj>N  hžhubhæ)”}”(hX"  Fortunately, the venerable ``xfs_repair`` tool had a more efficient means for
rebuilding a btree index from a collection of records -- bulk btree loading.
This was implemented rather inefficiently code-wise, since ``xfs_repair``
had separate copy-pasted implementations for each btree type.”h]”(hŒFortunately, the venerable ”…””}”(hj]N  hžhhŸNh Nubj÷  )”}”(hŒ``xfs_repair``”h]”hŒ
xfs_repair”…””}”(hjeN  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1jö  hj]N  ubhŒ­ tool had a more efficient means for
rebuilding a btree index from a collection of records -- bulk btree loading.
This was implemented rather inefficiently code-wise, since ”…””}”(hj]N  hžhhŸNh Nubj÷  )”}”(hŒ``xfs_repair``”h]”hŒ
xfs_repair”…””}”(hjwN  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1jö  hj]N  ubhŒ>
had separate copy-pasted implementations for each btree type.”…””}”(hj]N  hžhhŸNh Nubeh}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h M(	hj>N  hžhubhæ)”}”(hŒëTo prepare for online fsck, each of the four bulk loaders were studied, notes
were taken, and the four were refactored into a single generic btree bulk
loading mechanism.
Those notes in turn have been refreshed and are presented below.”h]”hŒëTo prepare for online fsck, each of the four bulk loaders were studied, notes
were taken, and the four were refactored into a single generic btree bulk
loading mechanism.
Those notes in turn have been refreshed and are presented below.”…””}”(hjN  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h M-	hj>N  hžhubhÑ)”}”(hhh]”(hÖ)”}”(hŒGeometry Computation”h]”hŒGeometry Computation”…””}”(hj N  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”j  j0	  uh1hÕhjN  hžhhŸh³h M3	ubhæ)”}”(hXR  The zeroth step of bulk loading is to assemble the entire record set that will
be stored in the new btree, and sort the records.
Next, call ``xfs_btree_bload_compute_geometry`` to compute the shape of the
btree from the record set, the type of btree, and any load factor preferences.
This information is required for resource reservation.”h]”(hŒŒThe zeroth step of bulk loading is to assemble the entire record set that will
be stored in the new btree, and sort the records.
Next, call ”…””}”(hj®N  hžhhŸNh Nubj÷  )”}”(hŒ$``xfs_btree_bload_compute_geometry``”h]”hŒ xfs_btree_bload_compute_geometry”…””}”(hj¶N  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1jö  hj®N  ubhŒ¢ to compute the shape of the
btree from the record set, the type of btree, and any load factor preferences.
This information is required for resource reservation.”…””}”(hj®N  hžhhŸNh Nubeh}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h M5	hjN  hžhubhæ)”}”(hŒÜFirst, the geometry computation computes the minimum and maximum records that
will fit in a leaf block from the size of a btree block and the size of the
block header.
Roughly speaking, the maximum number of records is::”h]”hŒÛFirst, the geometry computation computes the minimum and maximum records that
will fit in a leaf block from the size of a btree block and the size of the
block header.
Roughly speaking, the maximum number of records is:”…””}”(hjÎN  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h M;	hjN  hžhubj•+  )”}”(hŒ2maxrecs = (block_size - header_size) / record_size”h]”hŒ2maxrecs = (block_size - header_size) / record_size”…””}”hjÜN  sbah}”(h]”h ]”h"]”h$]”h&]”h±h²uh1j”+  hŸh³h M@	hjN  hžhubhæ)”}”(hŒ‰The XFS design specifies that btree blocks should be merged when possible,
which means the minimum number of records is half of maxrecs::”h]”hŒˆThe XFS design specifies that btree blocks should be merged when possible,
which means the minimum number of records is half of maxrecs:”…””}”(hjêN  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h MB	hjN  hžhubj•+  )”}”(hŒminrecs = maxrecs / 2”h]”hŒminrecs = maxrecs / 2”…””}”hjøN  sbah}”(h]”h ]”h"]”h$]”h&]”h±h²uh1j”+  hŸh³h ME	hjN  hžhubhæ)”}”(hX  The next variable to determine is the desired loading factor.
This must be at least minrecs and no more than maxrecs.
Choosing minrecs is undesirable because it wastes half the block.
Choosing maxrecs is also undesirable because adding a single record to each
newly rebuilt leaf block will cause a tree split, which causes a noticeable
drop in performance immediately afterwards.
The default loading factor was chosen to be 75% of maxrecs, which provides a
reasonably compact structure without any immediate split penalties::”h]”hX  The next variable to determine is the desired loading factor.
This must be at least minrecs and no more than maxrecs.
Choosing minrecs is undesirable because it wastes half the block.
Choosing maxrecs is also undesirable because adding a single record to each
newly rebuilt leaf block will cause a tree split, which causes a noticeable
drop in performance immediately afterwards.
The default loading factor was chosen to be 75% of maxrecs, which provides a
reasonably compact structure without any immediate split penalties:”…””}”(hjO  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h MG	hjN  hžhubj•+  )”}”(hŒ-default_load_factor = (maxrecs + minrecs) / 2”h]”hŒ-default_load_factor = (maxrecs + minrecs) / 2”…””}”hjO  sbah}”(h]”h ]”h"]”h$]”h&]”h±h²uh1j”+  hŸh³h MP	hjN  hžhubhæ)”}”(hŒcIf space is tight, the loading factor will be set to maxrecs to try to avoid
running out of space::”h]”hŒbIf space is tight, the loading factor will be set to maxrecs to try to avoid
running out of space:”…””}”(hj"O  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h MR	hjN  hžhubj•+  )”}”(hŒ?leaf_load_factor = enough space ? default_load_factor : maxrecs”h]”hŒ?leaf_load_factor = enough space ? default_load_factor : maxrecs”…””}”hj0O  sbah}”(h]”h ]”h"]”h$]”h&]”h±h²uh1j”+  hŸh³h MU	hjN  hžhubhæ)”}”(hŒwLoad factor is computed for btree node blocks using the combined size of the
btree key and pointer as the record size::”h]”hŒvLoad factor is computed for btree node blocks using the combined size of the
btree key and pointer as the record size:”…””}”(hj>O  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h MW	hjN  hžhubj•+  )”}”(hŒ’maxrecs = (block_size - header_size) / (key_size + ptr_size)
minrecs = maxrecs / 2
node_load_factor = enough space ? default_load_factor : maxrecs”h]”hŒ’maxrecs = (block_size - header_size) / (key_size + ptr_size)
minrecs = maxrecs / 2
node_load_factor = enough space ? default_load_factor : maxrecs”…””}”hjLO  sbah}”(h]”h ]”h"]”h$]”h&]”h±h²uh1j”+  hŸh³h MZ	hjN  hžhubhæ)”}”(hŒaOnce that's done, the number of leaf blocks required to store the record set
can be computed as::”h]”hŒbOnce thatâ€™s done, the number of leaf blocks required to store the record set
can be computed as:”…””}”(hjZO  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h M^	hjN  hžhubj•+  )”}”(hŒ3leaf_blocks = ceil(record_count / leaf_load_factor)”h]”hŒ3leaf_blocks = ceil(record_count / leaf_load_factor)”…””}”hjhO  sbah}”(h]”h ]”h"]”h$]”h&]”h±h²uh1j”+  hŸh³h Ma	hjN  hžhubhæ)”}”(hŒ]The number of node blocks needed to point to the next level down in the tree
is computed as::”h]”hŒ\The number of node blocks needed to point to the next level down in the tree
is computed as:”…””}”(hjvO  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h Mc	hjN  hžhubj•+  )”}”(hŒin_blocks = (n == 0 ? leaf_blocks : node_blocks[n])
node_blocks[n + 1] = ceil(n_blocks / node_load_factor)”h]”hŒin_blocks = (n == 0 ? leaf_blocks : node_blocks[n])
node_blocks[n + 1] = ceil(n_blocks / node_load_factor)”…””}”hj„O  sbah}”(h]”h ]”h"]”h$]”h&]”h±h²uh1j”+  hŸh³h Mf	hjN  hžhubhæ)”}”(hŒƒThe entire computation is performed recursively until the current level only
needs one block.
The resulting geometry is as follows:”h]”hŒƒThe entire computation is performed recursively until the current level only
needs one block.
The resulting geometry is as follows:”…””}”(hj’O  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h Mi	hjN  hžhubhö)”}”(hhh]”(hû)”}”(hŒ°For AG-rooted btrees, this level is the root level, so the height of the new
tree is ``level + 1`` and the space needed is the summation of the number of
blocks on each level.
”h]”hæ)”}”(hŒ¯For AG-rooted btrees, this level is the root level, so the height of the new
tree is ``level + 1`` and the space needed is the summation of the number of
blocks on each level.”h]”(hŒUFor AG-rooted btrees, this level is the root level, so the height of the new
tree is ”…””}”(hj§O  hžhhŸNh Nubj÷  )”}”(hŒ``level + 1``”h]”hŒ	level + 1”…””}”(hj¯O  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1jö  hj§O  ubhŒM and the space needed is the summation of the number of
blocks on each level.”…””}”(hj§O  hžhhŸNh Nubeh}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h Mm	hj£O  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhj O  hžhhŸh³h Nubhû)”}”(hŒñFor inode-rooted btrees where the records in the top level do not fit in the
inode fork area, the height is ``level + 2``, the space needed is the
summation of the number of blocks on each level, and the inode fork points to
the root block.
”h]”hæ)”}”(hŒðFor inode-rooted btrees where the records in the top level do not fit in the
inode fork area, the height is ``level + 2``, the space needed is the
summation of the number of blocks on each level, and the inode fork points to
the root block.”h]”(hŒlFor inode-rooted btrees where the records in the top level do not fit in the
inode fork area, the height is ”…””}”(hjÑO  hžhhŸNh Nubj÷  )”}”(hŒ``level + 2``”h]”hŒ	level + 2”…””}”(hjÙO  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1jö  hjÑO  ubhŒw, the space needed is the
summation of the number of blocks on each level, and the inode fork points to
the root block.”…””}”(hjÑO  hžhhŸNh Nubeh}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h Mq	hjÍO  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhj O  hžhhŸh³h Nubhû)”}”(hX£  For inode-rooted btrees where the records in the top level can be stored in
the inode fork area, then the root block can be stored in the inode, the
height is ``level + 1``, and the space needed is one less than the summation
of the number of blocks on each level.
This only becomes relevant when non-bmap btrees gain the ability to root in
an inode, which is a future patchset and only included here for completeness.
”h]”hæ)”}”(hX¢  For inode-rooted btrees where the records in the top level can be stored in
the inode fork area, then the root block can be stored in the inode, the
height is ``level + 1``, and the space needed is one less than the summation
of the number of blocks on each level.
This only becomes relevant when non-bmap btrees gain the ability to root in
an inode, which is a future patchset and only included here for completeness.”h]”(hŒŸFor inode-rooted btrees where the records in the top level can be stored in
the inode fork area, then the root block can be stored in the inode, the
height is ”…””}”(hjûO  hžhhŸNh Nubj÷  )”}”(hŒ``level + 1``”h]”hŒ	level + 1”…””}”(hjP  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1jö  hjûO  ubhŒö, and the space needed is one less than the summation
of the number of blocks on each level.
This only becomes relevant when non-bmap btrees gain the ability to root in
an inode, which is a future patchset and only included here for completeness.”…””}”(hjûO  hžhhŸNh Nubeh}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h Mv	hj÷O  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhj O  hžhhŸh³h Nubeh}”(h]”h ]”h"]”h$]”h&]”jJ  jK  uh1hõhŸh³h Mm	hjN  hžhubhµ)”}”(hŒ
.. _newbt:”h]”h}”(h]”h ]”h"]”h$]”h&]”j  Œnewbt”uh1h´h M}	hjN  hžhhŸh³ubeh}”(h]”j6	  ah ]”h"]”Œgeometry computation”ah$]”h&]”uh1hÐhj>N  hžhhŸh³h M3	ubhÑ)”}”(hhh]”(hÖ)”}”(hŒReserving New B+Tree Blocks”h]”hŒReserving New B+Tree Blocks”…””}”(hj<P  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”j  jR	  uh1hÕhj9P  hžhhŸh³h M€	ubhæ)”}”(hXD  Once repair knows the number of blocks needed for the new btree, it allocates
those blocks using the free space information.
Each reserved extent is tracked separately by the btree builder state data.
To improve crash resilience, the reservation code also logs an Extent Freeing
Intent (EFI) item in the same transaction as each space allocation and attaches
its in-memory ``struct xfs_extent_free_item`` object to the space reservation.
If the system goes down, log recovery will use the unfinished EFIs to free the
unused space, the free space, leaving the filesystem unchanged.”h]”(hXu  Once repair knows the number of blocks needed for the new btree, it allocates
those blocks using the free space information.
Each reserved extent is tracked separately by the btree builder state data.
To improve crash resilience, the reservation code also logs an Extent Freeing
Intent (EFI) item in the same transaction as each space allocation and attaches
its in-memory ”…””}”(hjJP  hžhhŸNh Nubj÷  )”}”(hŒ``struct xfs_extent_free_item``”h]”hŒstruct xfs_extent_free_item”…””}”(hjRP  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1jö  hjJP  ubhŒ° object to the space reservation.
If the system goes down, log recovery will use the unfinished EFIs to free the
unused space, the free space, leaving the filesystem unchanged.”…””}”(hjJP  hžhhŸNh Nubeh}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h M‚	hj9P  hžhubhæ)”}”(hX  Each time the btree builder claims a block for the btree from a reserved
extent, it updates the in-memory reservation to reflect the claimed space.
Block reservation tries to allocate as much contiguous space as possible to
reduce the number of EFIs in play.”h]”hX  Each time the btree builder claims a block for the btree from a reserved
extent, it updates the in-memory reservation to reflect the claimed space.
Block reservation tries to allocate as much contiguous space as possible to
reduce the number of EFIs in play.”…””}”(hjjP  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h M‹	hj9P  hžhubhæ)”}”(hXe  While repair is writing these new btree blocks, the EFIs created for the space
reservations pin the tail of the ondisk log.
It's possible that other parts of the system will remain busy and push the head
of the log towards the pinned tail.
To avoid livelocking the filesystem, the EFIs must not pin the tail of the log
for too long.
To alleviate this problem, the dynamic relogging capability of the deferred ops
mechanism is reused here to commit a transaction at the log head containing an
EFD for the old EFI and new EFI at the head.
This enables the log to release the old EFI to keep the log moving forwards.”h]”hXg  While repair is writing these new btree blocks, the EFIs created for the space
reservations pin the tail of the ondisk log.
Itâ€™s possible that other parts of the system will remain busy and push the head
of the log towards the pinned tail.
To avoid livelocking the filesystem, the EFIs must not pin the tail of the log
for too long.
To alleviate this problem, the dynamic relogging capability of the deferred ops
mechanism is reused here to commit a transaction at the log head containing an
EFD for the old EFI and new EFI at the head.
This enables the log to release the old EFI to keep the log moving forwards.”…””}”(hjxP  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h M	hj9P  hžhubhæ)”}”(hŒšEFIs have a role to play during the commit and reaping phases; please see the
next section and the section about :ref:`reaping<reaping>` for more details.”h]”(hŒqEFIs have a role to play during the commit and reaping phases; please see the
next section and the section about ”…””}”(hj†P  hžhhŸNh Nubh)”}”(hŒ:ref:`reaping<reaping>`”h]”j™  )”}”(hjP  h]”hŒreaping”…””}”(hj’P  hžhhŸNh Nubah}”(h]”h ]”(j¤  Œstd”Œstd-ref”eh"]”h$]”h&]”uh1j˜  hjŽP  ubah}”(h]”h ]”h"]”h$]”h&]”Œrefdoc”j±  Œ	refdomain”jœP  Œreftype”Œref”Œrefexplicit”ˆŒrefwarn”ˆj·  Œreaping”uh1hhŸh³h M›	hj†P  ubhŒ for more details.”…””}”(hj†P  hžhhŸNh Nubeh}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h M›	hj9P  hžhubhæ)”}”(hX)  Proposed patchsets are the
`bitmap rework
<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-bitmap-rework>`_
and the
`preparation for bulk loading btrees
<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-prep-for-bulk-loading>`_.”h]”(hŒProposed patchsets are the
”…””}”(hj¸P  hžhhŸNh Nubj”  )”}”(hŒs`bitmap rework
<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-bitmap-rework>`_”h]”hŒbitmap rework”…””}”(hjÀP  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”Œname”Œbitmap rework”jj  Œ`https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-bitmap-rework”uh1j“  hj¸P  ubhµ)”}”(hŒc
<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-bitmap-rework>”h]”h}”(h]”Œbitmap-rework”ah ]”h"]”Œbitmap rework”ah$]”h&]”Œrefuri”jÐP  uh1h´jy  Khj¸P  ubhŒ	
and the
”…””}”(hj¸P  hžhhŸNh Nubj”  )”}”(hŒ‘`preparation for bulk loading btrees
<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-prep-for-bulk-loading>`_”h]”hŒ#preparation for bulk loading btrees”…””}”(hjâP  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”Œname”Œ#preparation for bulk loading btrees”jj  Œhhttps://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-prep-for-bulk-loading”uh1j“  hj¸P  ubhµ)”}”(hŒk
<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-prep-for-bulk-loading>”h]”h}”(h]”Œ#preparation-for-bulk-loading-btrees”ah ]”h"]”Œ#preparation for bulk loading btrees”ah$]”h&]”Œrefuri”jòP  uh1h´jy  Khj¸P  ubhŒ.”…””}”(hj¸P  hžhhŸNh Nubeh}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h Mž	hj9P  hžhubeh}”(h]”(jX	  j1P  eh ]”h"]”(Œreserving new b+tree blocks”Œnewbt”eh$]”h&]”uh1hÐhj>N  hžhhŸh³h M€	jË  }”jQ  j'P  sjÍ  }”j1P  j'P  subhÑ)”}”(hhh]”(hÖ)”}”(hŒWriting the New Tree”h]”hŒWriting the New Tree”…””}”(hjQ  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”j  jt	  uh1hÕhjQ  hžhhŸh³h M§	ubhæ)”}”(hŒöThis part is pretty simple -- the btree builder (``xfs_btree_bulkload``) claims
a block from the reserved list, writes the new btree block header, fills the
rest of the block with records, and adds the new leaf block to a list of
written blocks::”h]”(hŒ1This part is pretty simple -- the btree builder (”…””}”(hj%Q  hžhhŸNh Nubj÷  )”}”(hŒ``xfs_btree_bulkload``”h]”hŒxfs_btree_bulkload”…””}”(hj-Q  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1jö  hj%Q  ubhŒ®) claims
a block from the reserved list, writes the new btree block header, fills the
rest of the block with records, and adds the new leaf block to a list of
written blocks:”…””}”(hj%Q  hžhhŸNh Nubeh}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h M©	hjQ  hžhubj•+  )”}”(hŒ;â”Œâ”€â”€â”€â”€â”
â”‚leafâ”‚
â”‚RRR â”‚
â””â”€â”€â”€â”€â”˜”h]”hŒ;â”Œâ”€â”€â”€â”€â”
â”‚leafâ”‚
â”‚RRR â”‚
â””â”€â”€â”€â”€â”˜”…””}”hjEQ  sbah}”(h]”h ]”h"]”h$]”h&]”h±h²uh1j”+  hŸh³h M®	hjQ  hžhubhæ)”}”(hŒGSibling pointers are set every time a new block is added to the level::”h]”hŒFSibling pointers are set every time a new block is added to the level:”…””}”(hjSQ  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h M³	hjQ  hžhubj•+  )”}”(hŒûâ”Œâ”€â”€â”€â”€â” â”Œâ”€â”€â”€â”€â” â”Œâ”€â”€â”€â”€â” â”Œâ”€â”€â”€â”€â”
â”‚leafâ”‚â†’â”‚leafâ”‚â†’â”‚leafâ”‚â†’â”‚leafâ”‚
â”‚RRR â”‚â†â”‚RRR â”‚â†â”‚RRR â”‚â†â”‚RRR â”‚
â””â”€â”€â”€â”€â”˜ â””â”€â”€â”€â”€â”˜ â””â”€â”€â”€â”€â”˜ â””â”€â”€â”€â”€â”˜”h]”hŒûâ”Œâ”€â”€â”€â”€â” â”Œâ”€â”€â”€â”€â” â”Œâ”€â”€â”€â”€â” â”Œâ”€â”€â”€â”€â”
â”‚leafâ”‚â†’â”‚leafâ”‚â†’â”‚leafâ”‚â†’â”‚leafâ”‚
â”‚RRR â”‚â†â”‚RRR â”‚â†â”‚RRR â”‚â†â”‚RRR â”‚
â””â”€â”€â”€â”€â”˜ â””â”€â”€â”€â”€â”˜ â””â”€â”€â”€â”€â”˜ â””â”€â”€â”€â”€â”˜”…””}”hjaQ  sbah}”(h]”h ]”h"]”h$]”h&]”h±h²uh1j”+  hŸh³h Mµ	hjQ  hžhubhæ)”}”(hŒßWhen it finishes writing the record leaf blocks, it moves on to the node
blocks
To fill a node block, it walks each block in the next level down in the tree
to compute the relevant keys and write them into the parent node::”h]”hŒÞWhen it finishes writing the record leaf blocks, it moves on to the node
blocks
To fill a node block, it walks each block in the next level down in the tree
to compute the relevant keys and write them into the parent node:”…””}”(hjoQ  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h Mº	hjQ  hžhubj•+  )”}”(hX×      â”Œâ”€â”€â”€â”€â”       â”Œâ”€â”€â”€â”€â”
    â”‚nodeâ”‚â”€â”€â”€â”€â”€â”€â†’â”‚nodeâ”‚
    â”‚PP  â”‚â†â”€â”€â”€â”€â”€â”€â”‚PP  â”‚
    â””â”€â”€â”€â”€â”˜       â””â”€â”€â”€â”€â”˜
    â†™   â†˜         â†™   â†˜
â”Œâ”€â”€â”€â”€â” â”Œâ”€â”€â”€â”€â” â”Œâ”€â”€â”€â”€â” â”Œâ”€â”€â”€â”€â”
â”‚leafâ”‚â†’â”‚leafâ”‚â†’â”‚leafâ”‚â†’â”‚leafâ”‚
â”‚RRR â”‚â†â”‚RRR â”‚â†â”‚RRR â”‚â†â”‚RRR â”‚
â””â”€â”€â”€â”€â”˜ â””â”€â”€â”€â”€â”˜ â””â”€â”€â”€â”€â”˜ â””â”€â”€â”€â”€â”˜”h]”hX×      â”Œâ”€â”€â”€â”€â”       â”Œâ”€â”€â”€â”€â”
    â”‚nodeâ”‚â”€â”€â”€â”€â”€â”€â†’â”‚nodeâ”‚
    â”‚PP  â”‚â†â”€â”€â”€â”€â”€â”€â”‚PP  â”‚
    â””â”€â”€â”€â”€â”˜       â””â”€â”€â”€â”€â”˜
    â†™   â†˜         â†™   â†˜
â”Œâ”€â”€â”€â”€â” â”Œâ”€â”€â”€â”€â” â”Œâ”€â”€â”€â”€â” â”Œâ”€â”€â”€â”€â”
â”‚leafâ”‚â†’â”‚leafâ”‚â†’â”‚leafâ”‚â†’â”‚leafâ”‚
â”‚RRR â”‚â†â”‚RRR â”‚â†â”‚RRR â”‚â†â”‚RRR â”‚
â””â”€â”€â”€â”€â”˜ â””â”€â”€â”€â”€â”˜ â””â”€â”€â”€â”€â”˜ â””â”€â”€â”€â”€â”˜”…””}”hj}Q  sbah}”(h]”h ]”h"]”h$]”h&]”h±h²uh1j”+  hŸh³h M¿	hjQ  hžhubhæ)”}”(hŒFWhen it reaches the root level, it is ready to commit the new btree!::”h]”hŒEWhen it reaches the root level, it is ready to commit the new btree!:”…””}”(hj‹Q  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h MÉ	hjQ  hžhubj•+  )”}”(hXs          â”Œâ”€â”€â”€â”€â”€â”€â”€â”€â”€â”
        â”‚  root   â”‚
        â”‚   PP    â”‚
        â””â”€â”€â”€â”€â”€â”€â”€â”€â”€â”˜
        â†™         â†˜
    â”Œâ”€â”€â”€â”€â”       â”Œâ”€â”€â”€â”€â”
    â”‚nodeâ”‚â”€â”€â”€â”€â”€â”€â†’â”‚nodeâ”‚
    â”‚PP  â”‚â†â”€â”€â”€â”€â”€â”€â”‚PP  â”‚
    â””â”€â”€â”€â”€â”˜       â””â”€â”€â”€â”€â”˜
    â†™   â†˜         â†™   â†˜
â”Œâ”€â”€â”€â”€â” â”Œâ”€â”€â”€â”€â” â”Œâ”€â”€â”€â”€â” â”Œâ”€â”€â”€â”€â”
â”‚leafâ”‚â†’â”‚leafâ”‚â†’â”‚leafâ”‚â†’â”‚leafâ”‚
â”‚RRR â”‚â†â”‚RRR â”‚â†â”‚RRR â”‚â†â”‚RRR â”‚
â””â”€â”€â”€â”€â”˜ â””â”€â”€â”€â”€â”˜ â””â”€â”€â”€â”€â”˜ â””â”€â”€â”€â”€â”˜”h]”hXs          â”Œâ”€â”€â”€â”€â”€â”€â”€â”€â”€â”
        â”‚  root   â”‚
        â”‚   PP    â”‚
        â””â”€â”€â”€â”€â”€â”€â”€â”€â”€â”˜
        â†™         â†˜
    â”Œâ”€â”€â”€â”€â”       â”Œâ”€â”€â”€â”€â”
    â”‚nodeâ”‚â”€â”€â”€â”€â”€â”€â†’â”‚nodeâ”‚
    â”‚PP  â”‚â†â”€â”€â”€â”€â”€â”€â”‚PP  â”‚
    â””â”€â”€â”€â”€â”˜       â””â”€â”€â”€â”€â”˜
    â†™   â†˜         â†™   â†˜
â”Œâ”€â”€â”€â”€â” â”Œâ”€â”€â”€â”€â” â”Œâ”€â”€â”€â”€â” â”Œâ”€â”€â”€â”€â”
â”‚leafâ”‚â†’â”‚leafâ”‚â†’â”‚leafâ”‚â†’â”‚leafâ”‚
â”‚RRR â”‚â†â”‚RRR â”‚â†â”‚RRR â”‚â†â”‚RRR â”‚
â””â”€â”€â”€â”€â”˜ â””â”€â”€â”€â”€â”˜ â””â”€â”€â”€â”€â”˜ â””â”€â”€â”€â”€â”˜”…””}”hj™Q  sbah}”(h]”h ]”h"]”h$]”h&]”h±h²uh1j”+  hŸh³h MË	hjQ  hžhubhæ)”}”(hXº  The first step to commit the new btree is to persist the btree blocks to disk
synchronously.
This is a little complicated because a new btree block could have been freed
in the recent past, so the builder must use ``xfs_buf_delwri_queue_here`` to
remove the (stale) buffer from the AIL list before it can write the new blocks
to disk.
Blocks are queued for IO using a delwri list and written in one large batch
with ``xfs_buf_delwri_submit``.”h]”(hŒÖThe first step to commit the new btree is to persist the btree blocks to disk
synchronously.
This is a little complicated because a new btree block could have been freed
in the recent past, so the builder must use ”…””}”(hj§Q  hžhhŸNh Nubj÷  )”}”(hŒ``xfs_buf_delwri_queue_here``”h]”hŒxfs_buf_delwri_queue_here”…””}”(hj¯Q  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1jö  hj§Q  ubhŒ­ to
remove the (stale) buffer from the AIL list before it can write the new blocks
to disk.
Blocks are queued for IO using a delwri list and written in one large batch
with ”…””}”(hj§Q  hžhhŸNh Nubj÷  )”}”(hŒ``xfs_buf_delwri_submit``”h]”hŒxfs_buf_delwri_submit”…””}”(hjÁQ  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1jö  hj§Q  ubhŒ.”…””}”(hj§Q  hžhhŸNh Nubeh}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h MÚ	hjQ  hžhubhæ)”}”(hX/  Once the new blocks have been persisted to disk, control returns to the
individual repair function that called the bulk loader.
The repair function must log the location of the new root in a transaction,
clean up the space reservations that were made for the new btree, and reap the
old metadata blocks:”h]”hX/  Once the new blocks have been persisted to disk, control returns to the
individual repair function that called the bulk loader.
The repair function must log the location of the new root in a transaction,
clean up the space reservations that were made for the new btree, and reap the
old metadata blocks:”…””}”(hjÙQ  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h Mã	hjQ  hžhubji  )”}”(hhh]”(hû)”}”(hŒ+Commit the location of the new btree root.
”h]”hæ)”}”(hŒ*Commit the location of the new btree root.”h]”hŒ*Commit the location of the new btree root.”…””}”(hjîQ  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h Mé	hjêQ  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhjçQ  hžhhŸh³h Nubhû)”}”(hXÀ  For each incore reservation:

a. Log Extent Freeing Done (EFD) items for all the space that was consumed
   by the btree builder.  The new EFDs must point to the EFIs attached to
   the reservation to prevent log recovery from freeing the new blocks.

b. For unclaimed portions of incore reservations, create a regular deferred
   extent free work item to be free the unused space later in the
   transaction chain.

c. The EFDs and EFIs logged in steps 2a and 2b must not overrun the
   reservation of the committing transaction.
   If the btree loading code suspects this might be about to happen, it must
   call ``xrep_defer_finish`` to clear out the deferred work and obtain a
   fresh transaction.
”h]”(hæ)”}”(hŒFor each incore reservation:”h]”hŒFor each incore reservation:”…””}”(hjR  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h Më	hjR  ubji  )”}”(hhh]”(hû)”}”(hŒÔLog Extent Freeing Done (EFD) items for all the space that was consumed
by the btree builder.  The new EFDs must point to the EFIs attached to
the reservation to prevent log recovery from freeing the new blocks.
”h]”hæ)”}”(hŒÓLog Extent Freeing Done (EFD) items for all the space that was consumed
by the btree builder.  The new EFDs must point to the EFIs attached to
the reservation to prevent log recovery from freeing the new blocks.”h]”hŒÓLog Extent Freeing Done (EFD) items for all the space that was consumed
by the btree builder.  The new EFDs must point to the EFIs attached to
the reservation to prevent log recovery from freeing the new blocks.”…””}”(hjR  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h Mí	hjR  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhjR  ubhû)”}”(hŒ›For unclaimed portions of incore reservations, create a regular deferred
extent free work item to be free the unused space later in the
transaction chain.
”h]”hæ)”}”(hŒšFor unclaimed portions of incore reservations, create a regular deferred
extent free work item to be free the unused space later in the
transaction chain.”h]”hŒšFor unclaimed portions of incore reservations, create a regular deferred
extent free work item to be free the unused space later in the
transaction chain.”…””}”(hj3R  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h Mñ	hj/R  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhjR  ubhû)”}”(hX  The EFDs and EFIs logged in steps 2a and 2b must not overrun the
reservation of the committing transaction.
If the btree loading code suspects this might be about to happen, it must
call ``xrep_defer_finish`` to clear out the deferred work and obtain a
fresh transaction.
”h]”hæ)”}”(hX  The EFDs and EFIs logged in steps 2a and 2b must not overrun the
reservation of the committing transaction.
If the btree loading code suspects this might be about to happen, it must
call ``xrep_defer_finish`` to clear out the deferred work and obtain a
fresh transaction.”h]”(hŒ»The EFDs and EFIs logged in steps 2a and 2b must not overrun the
reservation of the committing transaction.
If the btree loading code suspects this might be about to happen, it must
call ”…””}”(hjKR  hžhhŸNh Nubj÷  )”}”(hŒ``xrep_defer_finish``”h]”hŒxrep_defer_finish”…””}”(hjSR  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1jö  hjKR  ubhŒ? to clear out the deferred work and obtain a
fresh transaction.”…””}”(hjKR  hžhhŸNh Nubeh}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h Mõ	hjGR  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhjR  ubeh}”(h]”h ]”h"]”h$]”h&]”jg  j6  ji  hjj  jk  uh1jh  hjR  ubeh}”(h]”h ]”h"]”h$]”h&]”uh1húhjçQ  hžhhŸNh Nubhû)”}”(hŒaClear out the deferred work a second time to finish the commit and clean
the repair transaction.
”h]”hæ)”}”(hŒ`Clear out the deferred work a second time to finish the commit and clean
the repair transaction.”h]”hŒ`Clear out the deferred work a second time to finish the commit and clean
the repair transaction.”…””}”(hjR  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h Mû	hj}R  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhjçQ  hžhhŸh³h Nubeh}”(h]”h ]”h"]”h$]”h&]”jg  jh  ji  hjj  jk  uh1jh  hjQ  hžhhŸh³h Mé	ubhæ)”}”(hXó  The transaction rolling in steps 2c and 3 represent a weakness in the repair
algorithm, because a log flush and a crash before the end of the reap step can
result in space leaking.
Online repair functions minimize the chances of this occurring by using very
large transactions, which each can accommodate many thousands of block freeing
instructions.
Repair moves on to reaping the old blocks, which will be presented in a
subsequent :ref:`section<reaping>` after a few case studies of bulk loading.”h]”(hX²  The transaction rolling in steps 2c and 3 represent a weakness in the repair
algorithm, because a log flush and a crash before the end of the reap step can
result in space leaking.
Online repair functions minimize the chances of this occurring by using very
large transactions, which each can accommodate many thousands of block freeing
instructions.
Repair moves on to reaping the old blocks, which will be presented in a
subsequent ”…””}”(hj›R  hžhhŸNh Nubh)”}”(hŒ:ref:`section<reaping>`”h]”j™  )”}”(hj¥R  h]”hŒsection”…””}”(hj§R  hžhhŸNh Nubah}”(h]”h ]”(j¤  Œstd”Œstd-ref”eh"]”h$]”h&]”uh1j˜  hj£R  ubah}”(h]”h ]”h"]”h$]”h&]”Œrefdoc”j±  Œ	refdomain”j±R  Œreftype”Œref”Œrefexplicit”ˆŒrefwarn”ˆj·  Œreaping”uh1hhŸh³h Mþ	hj›R  ubhŒ* after a few case studies of bulk loading.”…””}”(hj›R  hžhhŸNh Nubeh}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h Mþ	hjQ  hžhubhÑ)”}”(hhh]”(hÖ)”}”(hŒ&Case Study: Rebuilding the Inode Index”h]”hŒ&Case Study: Rebuilding the Inode Index”…””}”(hjÐR  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”j  j“	  uh1hÕhjÍR  hžhhŸh³h M
ubhæ)”}”(hŒ;The high level process to rebuild the inode index btree is:”h]”hŒ;The high level process to rebuild the inode index btree is:”…””}”(hjÞR  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h M

hjÍR  hžhubji  )”}”(hhh]”(hû)”}”(hŒ›Walk the reverse mapping records to generate ``struct xfs_inobt_rec``
records from the inode chunk information and a bitmap of the old inode btree
blocks.
”h]”hæ)”}”(hŒšWalk the reverse mapping records to generate ``struct xfs_inobt_rec``
records from the inode chunk information and a bitmap of the old inode btree
blocks.”h]”(hŒ-Walk the reverse mapping records to generate ”…””}”(hjóR  hžhhŸNh Nubj÷  )”}”(hŒ``struct xfs_inobt_rec``”h]”hŒstruct xfs_inobt_rec”…””}”(hjûR  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1jö  hjóR  ubhŒU
records from the inode chunk information and a bitmap of the old inode btree
blocks.”…””}”(hjóR  hžhhŸNh Nubeh}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h M
hjïR  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhjìR  hžhhŸh³h Nubhû)”}”(hŒ1Append the records to an xfarray in inode order.
”h]”hæ)”}”(hŒ0Append the records to an xfarray in inode order.”h]”hŒ0Append the records to an xfarray in inode order.”…””}”(hjS  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h M
hjS  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhjìR  hžhhŸh³h Nubhû)”}”(hŒÒUse the ``xfs_btree_bload_compute_geometry`` function to compute the number
of blocks needed for the inode btree.
If the free space inode btree is enabled, call it again to estimate the
geometry of the finobt.
”h]”hæ)”}”(hŒÑUse the ``xfs_btree_bload_compute_geometry`` function to compute the number
of blocks needed for the inode btree.
If the free space inode btree is enabled, call it again to estimate the
geometry of the finobt.”h]”(hŒUse the ”…””}”(hj5S  hžhhŸNh Nubj÷  )”}”(hŒ$``xfs_btree_bload_compute_geometry``”h]”hŒ xfs_btree_bload_compute_geometry”…””}”(hj=S  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1jö  hj5S  ubhŒ¥ function to compute the number
of blocks needed for the inode btree.
If the free space inode btree is enabled, call it again to estimate the
geometry of the finobt.”…””}”(hj5S  hžhhŸNh Nubeh}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h M
hj1S  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhjìR  hžhhŸh³h Nubhû)”}”(hŒ=Allocate the number of blocks computed in the previous step.
”h]”hæ)”}”(hŒ<Allocate the number of blocks computed in the previous step.”h]”hŒ<Allocate the number of blocks computed in the previous step.”…””}”(hj_S  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h M
hj[S  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhjìR  hžhhŸh³h Nubhû)”}”(hŒ¸Use ``xfs_btree_bload`` to write the xfarray records to btree blocks and
generate the internal node blocks.
If the free space inode btree is enabled, call it again to load the finobt.
”h]”hæ)”}”(hŒ·Use ``xfs_btree_bload`` to write the xfarray records to btree blocks and
generate the internal node blocks.
If the free space inode btree is enabled, call it again to load the finobt.”h]”(hŒUse ”…””}”(hjwS  hžhhŸNh Nubj÷  )”}”(hŒ``xfs_btree_bload``”h]”hŒxfs_btree_bload”…””}”(hjS  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1jö  hjwS  ubhŒ  to write the xfarray records to btree blocks and
generate the internal node blocks.
If the free space inode btree is enabled, call it again to load the finobt.”…””}”(hjwS  hžhhŸNh Nubeh}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h M
hjsS  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhjìR  hžhhŸh³h Nubhû)”}”(hŒ?Commit the location of the new btree root block(s) to the AGI.
”h]”hæ)”}”(hŒ>Commit the location of the new btree root block(s) to the AGI.”h]”hŒ>Commit the location of the new btree root block(s) to the AGI.”…””}”(hj¡S  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h M
hjS  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhjìR  hžhhŸh³h Nubhû)”}”(hŒ>Reap the old btree blocks using the bitmap created in step 1.
”h]”hæ)”}”(hŒ=Reap the old btree blocks using the bitmap created in step 1.”h]”hŒ=Reap the old btree blocks using the bitmap created in step 1.”…””}”(hj¹S  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h M
hjµS  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhjìR  hžhhŸh³h Nubeh}”(h]”h ]”h"]”h$]”h&]”jg  jh  ji  hjj  jk  uh1jh  hjÍR  hžhhŸh³h M
ubhæ)”}”(hŒDetails are as follows.”h]”hŒDetails are as follows.”…””}”(hjÓS  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h M!
hjÍR  hžhubhæ)”}”(hX7  The inode btree maps inumbers to the ondisk location of the associated
inode records, which means that the inode btrees can be rebuilt from the
reverse mapping information.
Reverse mapping records with an owner of ``XFS_RMAP_OWN_INOBT`` marks the
location of the old inode btree blocks.
Each reverse mapping record with an owner of ``XFS_RMAP_OWN_INODES`` marks the
location of at least one inode cluster buffer.
A cluster is the smallest number of ondisk inodes that can be allocated or
freed in a single transaction; it is never smaller than 1 fs block or 4 inodes.”h]”(hŒÖThe inode btree maps inumbers to the ondisk location of the associated
inode records, which means that the inode btrees can be rebuilt from the
reverse mapping information.
Reverse mapping records with an owner of ”…””}”(hjáS  hžhhŸNh Nubj÷  )”}”(hŒ``XFS_RMAP_OWN_INOBT``”h]”hŒXFS_RMAP_OWN_INOBT”…””}”(hjéS  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1jö  hjáS  ubhŒ` marks the
location of the old inode btree blocks.
Each reverse mapping record with an owner of ”…””}”(hjáS  hžhhŸNh Nubj÷  )”}”(hŒ``XFS_RMAP_OWN_INODES``”h]”hŒXFS_RMAP_OWN_INODES”…””}”(hjûS  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1jö  hjáS  ubhŒÔ marks the
location of at least one inode cluster buffer.
A cluster is the smallest number of ondisk inodes that can be allocated or
freed in a single transaction; it is never smaller than 1 fs block or 4 inodes.”…””}”(hjáS  hžhhŸNh Nubeh}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h M#
hjÍR  hžhubhæ)”}”(hX³  For the space represented by each inode cluster, ensure that there are no
records in the free space btrees nor any records in the reference count btree.
If there are, the space metadata inconsistencies are reason enough to abort the
operation.
Otherwise, read each cluster buffer to check that its contents appear to be
ondisk inodes and to decide if the file is allocated
(``xfs_dinode.i_mode != 0``) or free (``xfs_dinode.i_mode == 0``).
Accumulate the results of successive inode cluster buffer reads until there is
enough information to fill a single inode chunk record, which is 64 consecutive
numbers in the inumber keyspace.
If the chunk is sparse, the chunk record may include holes.”h]”(hXv  For the space represented by each inode cluster, ensure that there are no
records in the free space btrees nor any records in the reference count btree.
If there are, the space metadata inconsistencies are reason enough to abort the
operation.
Otherwise, read each cluster buffer to check that its contents appear to be
ondisk inodes and to decide if the file is allocated
(”…””}”(hjT  hžhhŸNh Nubj÷  )”}”(hŒ``xfs_dinode.i_mode != 0``”h]”hŒxfs_dinode.i_mode != 0”…””}”(hjT  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1jö  hjT  ubhŒ) or free (”…””}”(hjT  hžhhŸNh Nubj÷  )”}”(hŒ``xfs_dinode.i_mode == 0``”h]”hŒxfs_dinode.i_mode == 0”…””}”(hj-T  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1jö  hjT  ubhŒþ).
Accumulate the results of successive inode cluster buffer reads until there is
enough information to fill a single inode chunk record, which is 64 consecutive
numbers in the inumber keyspace.
If the chunk is sparse, the chunk record may include holes.”…””}”(hjT  hžhhŸNh Nubeh}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h M-
hjÍR  hžhubhæ)”}”(hX3  Once the repair function accumulates one chunk's worth of data, it calls
``xfarray_append`` to add the inode btree record to the xfarray.
This xfarray is walked twice during the btree creation step -- once to populate
the inode btree with all inode chunk records, and a second time to populate the
free inode btree with records for chunks that have free non-sparse inodes.
The number of records for the inode btree is the number of xfarray records,
but the record count for the free inode btree has to be computed as inode chunk
records are stored in the xfarray.”h]”(hŒKOnce the repair function accumulates one chunkâ€™s worth of data, it calls
”…””}”(hjET  hžhhŸNh Nubj÷  )”}”(hŒ``xfarray_append``”h]”hŒxfarray_append”…””}”(hjMT  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1jö  hjET  ubhXØ   to add the inode btree record to the xfarray.
This xfarray is walked twice during the btree creation step -- once to populate
the inode btree with all inode chunk records, and a second time to populate the
free inode btree with records for chunks that have free non-sparse inodes.
The number of records for the inode btree is the number of xfarray records,
but the record count for the free inode btree has to be computed as inode chunk
records are stored in the xfarray.”…””}”(hjET  hžhhŸNh Nubeh}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h M9
hjÍR  hžhubhæ)”}”(hŒ–The proposed patchset is the
`AG btree repair
<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-ag-btrees>`_
series.”h]”(hŒThe proposed patchset is the
”…””}”(hjeT  hžhhŸNh Nubj”  )”}”(hŒq`AG btree repair
<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-ag-btrees>`_”h]”hŒAG btree repair”…””}”(hjmT  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”Œname”ŒAG btree repair”jj  Œ\https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-ag-btrees”uh1j“  hjeT  ubhµ)”}”(hŒ_
<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-ag-btrees>”h]”h}”(h]”Œag-btree-repair”ah ]”h"]”Œag btree repair”ah$]”h&]”Œrefuri”j}T  uh1h´jy  KhjeT  ubhŒ
series.”…””}”(hjeT  hžhhŸNh Nubeh}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h MB
hjÍR  hžhubeh}”(h]”j™	  ah ]”h"]”Œ&case study: rebuilding the inode index”ah$]”h&]”uh1hÐhjQ  hžhhŸh³h M
ubhÑ)”}”(hhh]”(hÖ)”}”(hŒ1Case Study: Rebuilding the Space Reference Counts”h]”hŒ1Case Study: Rebuilding the Space Reference Counts”…””}”(hjŸT  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”j  jµ	  uh1hÕhjœT  hžhhŸh³h MH
ubhæ)”}”(hX  Reverse mapping records are used to rebuild the reference count information.
Reference counts are required for correct operation of copy on write for shared
file data.
Imagine the reverse mapping entries as rectangles representing extents of
physical blocks, and that the rectangles can be laid down to allow them to
overlap each other.
From the diagram below, it is apparent that a reference count record must start
or end wherever the height of the stack changes.
In other words, the record emission stimulus is level-triggered::”h]”hX  Reverse mapping records are used to rebuild the reference count information.
Reference counts are required for correct operation of copy on write for shared
file data.
Imagine the reverse mapping entries as rectangles representing extents of
physical blocks, and that the rectangles can be laid down to allow them to
overlap each other.
From the diagram below, it is apparent that a reference count record must start
or end wherever the height of the stack changes.
In other words, the record emission stimulus is level-triggered:”…””}”(hj­T  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h MJ
hjœT  hžhubj•+  )”}”(hX½                  â–ˆ    â–ˆâ–ˆâ–ˆ
      â–ˆâ–ˆ      â–ˆâ–ˆâ–ˆâ–ˆâ–ˆ â–ˆâ–ˆâ–ˆâ–ˆ   â–ˆâ–ˆâ–ˆ        â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ
â–ˆâ–ˆ   â–ˆâ–ˆâ–ˆâ–ˆ     â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ â–ˆâ–ˆâ–ˆâ–ˆ     â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ
â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ
^ ^  ^^ ^^    ^ ^^ ^^^  ^^^^  ^ ^^ ^  ^     ^
2 1  23 21    3 43 234  2123  1 01 2  3     0”h]”hX½                  â–ˆ    â–ˆâ–ˆâ–ˆ
      â–ˆâ–ˆ      â–ˆâ–ˆâ–ˆâ–ˆâ–ˆ â–ˆâ–ˆâ–ˆâ–ˆ   â–ˆâ–ˆâ–ˆ        â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ
â–ˆâ–ˆ   â–ˆâ–ˆâ–ˆâ–ˆ     â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ â–ˆâ–ˆâ–ˆâ–ˆ     â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ
â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ
^ ^  ^^ ^^    ^ ^^ ^^^  ^^^^  ^ ^^ ^  ^     ^
2 1  23 21    3 43 234  2123  1 01 2  3     0”…””}”hj»T  sbah}”(h]”h ]”h"]”h$]”h&]”h±h²uh1j”+  hŸh³h MT
hjœT  hžhubhæ)”}”(hXP  The ondisk reference count btree does not store the refcount == 0 cases because
the free space btree already records which blocks are free.
Extents being used to stage copy-on-write operations should be the only records
with refcount == 1.
Single-owner file blocks aren't recorded in either the free space or the
reference count btrees.”h]”hXR  The ondisk reference count btree does not store the refcount == 0 cases because
the free space btree already records which blocks are free.
Extents being used to stage copy-on-write operations should be the only records
with refcount == 1.
Single-owner file blocks arenâ€™t recorded in either the free space or the
reference count btrees.”…””}”(hjÉT  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h M[
hjœT  hžhubhæ)”}”(hŒ?The high level process to rebuild the reference count btree is:”h]”hŒ?The high level process to rebuild the reference count btree is:”…””}”(hj×T  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h Mb
hjœT  hžhubji  )”}”(hhh]”(hû)”}”(hX¹  Walk the reverse mapping records to generate ``struct xfs_refcount_irec``
records for any space having more than one reverse mapping and add them to
the xfarray.
Any records owned by ``XFS_RMAP_OWN_COW`` are also added to the xfarray
because these are extents allocated to stage a copy on write operation and
are tracked in the refcount btree.

Use any records owned by ``XFS_RMAP_OWN_REFC`` to create a bitmap of old
refcount btree blocks.
”h]”(hæ)”}”(hXW  Walk the reverse mapping records to generate ``struct xfs_refcount_irec``
records for any space having more than one reverse mapping and add them to
the xfarray.
Any records owned by ``XFS_RMAP_OWN_COW`` are also added to the xfarray
because these are extents allocated to stage a copy on write operation and
are tracked in the refcount btree.”h]”(hŒ-Walk the reverse mapping records to generate ”…””}”(hjìT  hžhhŸNh Nubj÷  )”}”(hŒ``struct xfs_refcount_irec``”h]”hŒstruct xfs_refcount_irec”…””}”(hjôT  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1jö  hjìT  ubhŒn
records for any space having more than one reverse mapping and add them to
the xfarray.
Any records owned by ”…””}”(hjìT  hžhhŸNh Nubj÷  )”}”(hŒ``XFS_RMAP_OWN_COW``”h]”hŒXFS_RMAP_OWN_COW”…””}”(hjU  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1jö  hjìT  ubhŒŒ are also added to the xfarray
because these are extents allocated to stage a copy on write operation and
are tracked in the refcount btree.”…””}”(hjìT  hžhhŸNh Nubeh}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h Md
hjèT  ubhæ)”}”(hŒ_Use any records owned by ``XFS_RMAP_OWN_REFC`` to create a bitmap of old
refcount btree blocks.”h]”(hŒUse any records owned by ”…””}”(hjU  hžhhŸNh Nubj÷  )”}”(hŒ``XFS_RMAP_OWN_REFC``”h]”hŒXFS_RMAP_OWN_REFC”…””}”(hj&U  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1jö  hjU  ubhŒ1 to create a bitmap of old
refcount btree blocks.”…””}”(hjU  hžhhŸNh Nubeh}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h Mk
hjèT  ubeh}”(h]”h ]”h"]”h$]”h&]”uh1húhjåT  hžhhŸh³h Nubhû)”}”(hŒ§Sort the records in physical extent order, putting the CoW staging extents
at the end of the xfarray.
This matches the sorting order of records in the refcount btree.
”h]”hæ)”}”(hŒ¦Sort the records in physical extent order, putting the CoW staging extents
at the end of the xfarray.
This matches the sorting order of records in the refcount btree.”h]”hŒ¦Sort the records in physical extent order, putting the CoW staging extents
at the end of the xfarray.
This matches the sorting order of records in the refcount btree.”…””}”(hjHU  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h Mn
hjDU  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhjåT  hžhhŸh³h Nubhû)”}”(hŒoUse the ``xfs_btree_bload_compute_geometry`` function to compute the number
of blocks needed for the new tree.
”h]”hæ)”}”(hŒnUse the ``xfs_btree_bload_compute_geometry`` function to compute the number
of blocks needed for the new tree.”h]”(hŒUse the ”…””}”(hj`U  hžhhŸNh Nubj÷  )”}”(hŒ$``xfs_btree_bload_compute_geometry``”h]”hŒ xfs_btree_bload_compute_geometry”…””}”(hjhU  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1jö  hj`U  ubhŒB function to compute the number
of blocks needed for the new tree.”…””}”(hj`U  hžhhŸNh Nubeh}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h Mr
hj\U  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhjåT  hžhhŸh³h Nubhû)”}”(hŒ=Allocate the number of blocks computed in the previous step.
”h]”hæ)”}”(hŒ<Allocate the number of blocks computed in the previous step.”h]”hŒ<Allocate the number of blocks computed in the previous step.”…””}”(hjŠU  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h Mu
hj†U  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhjåT  hžhhŸh³h Nubhû)”}”(hŒlUse ``xfs_btree_bload`` to write the xfarray records to btree blocks and
generate the internal node blocks.
”h]”hæ)”}”(hŒkUse ``xfs_btree_bload`` to write the xfarray records to btree blocks and
generate the internal node blocks.”h]”(hŒUse ”…””}”(hj¢U  hžhhŸNh Nubj÷  )”}”(hŒ``xfs_btree_bload``”h]”hŒxfs_btree_bload”…””}”(hjªU  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1jö  hj¢U  ubhŒT to write the xfarray records to btree blocks and
generate the internal node blocks.”…””}”(hj¢U  hžhhŸNh Nubeh}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h Mw
hjžU  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhjåT  hžhhŸh³h Nubhû)”}”(hŒ8Commit the location of new btree root block to the AGF.
”h]”hæ)”}”(hŒ7Commit the location of new btree root block to the AGF.”h]”hŒ7Commit the location of new btree root block to the AGF.”…””}”(hjÌU  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h Mz
hjÈU  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhjåT  hžhhŸh³h Nubhû)”}”(hŒ>Reap the old btree blocks using the bitmap created in step 1.
”h]”hæ)”}”(hŒ=Reap the old btree blocks using the bitmap created in step 1.”h]”hŒ=Reap the old btree blocks using the bitmap created in step 1.”…””}”(hjäU  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h M|
hjàU  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhjåT  hžhhŸh³h Nubeh}”(h]”h ]”h"]”h$]”h&]”jg  jh  ji  hjj  jk  uh1jh  hjœT  hžhhŸh³h Md
ubhæ)”}”(hŒƒDetails are as follows; the same algorithm is used by ``xfs_repair`` to
generate refcount information from reverse mapping records.”h]”(hŒ6Details are as follows; the same algorithm is used by ”…””}”(hjþU  hžhhŸNh Nubj÷  )”}”(hŒ``xfs_repair``”h]”hŒ
xfs_repair”…””}”(hjV  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1jö  hjþU  ubhŒ? to
generate refcount information from reverse mapping records.”…””}”(hjþU  hžhhŸNh Nubeh}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h M~
hjœT  hžhubhö)”}”(hhh]”hû)”}”(hX\  Until the reverse mapping btree runs out of records:

- Retrieve the next record from the btree and put it in a bag.

- Collect all records with the same starting block from the btree and put
  them in the bag.

- While the bag isn't empty:

  - Among the mappings in the bag, compute the lowest block number where the
    reference count changes.
    This position will be either the starting block number of the next
    unprocessed reverse mapping or the next block after the shortest mapping
    in the bag.

  - Remove all mappings from the bag that end at this position.

  - Collect all reverse mappings that start at this position from the btree
    and put them in the bag.

  - If the size of the bag changed and is greater than one, create a new
    refcount record associating the block number range that we just walked to
    the size of the bag.
”h]”(hæ)”}”(hŒ4Until the reverse mapping btree runs out of records:”h]”hŒ4Until the reverse mapping btree runs out of records:”…””}”(hj%V  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h M
hj!V  ubhö)”}”(hhh]”(hû)”}”(hŒ=Retrieve the next record from the btree and put it in a bag.
”h]”hæ)”}”(hŒ<Retrieve the next record from the btree and put it in a bag.”h]”hŒ<Retrieve the next record from the btree and put it in a bag.”…””}”(hj:V  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h Mƒ
hj6V  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhj3V  ubhû)”}”(hŒYCollect all records with the same starting block from the btree and put
them in the bag.
”h]”hæ)”}”(hŒXCollect all records with the same starting block from the btree and put
them in the bag.”h]”hŒXCollect all records with the same starting block from the btree and put
them in the bag.”…””}”(hjRV  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h M…
hjNV  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhj3V  ubhû)”}”(hXp  While the bag isn't empty:

- Among the mappings in the bag, compute the lowest block number where the
  reference count changes.
  This position will be either the starting block number of the next
  unprocessed reverse mapping or the next block after the shortest mapping
  in the bag.

- Remove all mappings from the bag that end at this position.

- Collect all reverse mappings that start at this position from the btree
  and put them in the bag.

- If the size of the bag changed and is greater than one, create a new
  refcount record associating the block number range that we just walked to
  the size of the bag.
”h]”(hæ)”}”(hŒWhile the bag isn't empty:”h]”hŒWhile the bag isnâ€™t empty:”…””}”(hjjV  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h Mˆ
hjfV  ubhö)”}”(hhh]”(hû)”}”(hŒúAmong the mappings in the bag, compute the lowest block number where the
reference count changes.
This position will be either the starting block number of the next
unprocessed reverse mapping or the next block after the shortest mapping
in the bag.
”h]”hæ)”}”(hŒùAmong the mappings in the bag, compute the lowest block number where the
reference count changes.
This position will be either the starting block number of the next
unprocessed reverse mapping or the next block after the shortest mapping
in the bag.”h]”hŒùAmong the mappings in the bag, compute the lowest block number where the
reference count changes.
This position will be either the starting block number of the next
unprocessed reverse mapping or the next block after the shortest mapping
in the bag.”…””}”(hjV  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h MŠ
hj{V  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhjxV  ubhû)”}”(hŒ<Remove all mappings from the bag that end at this position.
”h]”hæ)”}”(hŒ;Remove all mappings from the bag that end at this position.”h]”hŒ;Remove all mappings from the bag that end at this position.”…””}”(hj—V  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h M
hj“V  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhjxV  ubhû)”}”(hŒaCollect all reverse mappings that start at this position from the btree
and put them in the bag.
”h]”hæ)”}”(hŒ`Collect all reverse mappings that start at this position from the btree
and put them in the bag.”h]”hŒ`Collect all reverse mappings that start at this position from the btree
and put them in the bag.”…””}”(hj¯V  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h M’
hj«V  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhjxV  ubhû)”}”(hŒ¤If the size of the bag changed and is greater than one, create a new
refcount record associating the block number range that we just walked to
the size of the bag.
”h]”hæ)”}”(hŒ£If the size of the bag changed and is greater than one, create a new
refcount record associating the block number range that we just walked to
the size of the bag.”h]”hŒ£If the size of the bag changed and is greater than one, create a new
refcount record associating the block number range that we just walked to
the size of the bag.”…””}”(hjÇV  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h M•
hjÃV  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhjxV  ubeh}”(h]”h ]”h"]”h$]”h&]”jJ  jK  uh1hõhŸh³h MŠ
hjfV  ubeh}”(h]”h ]”h"]”h$]”h&]”uh1húhj3V  ubeh}”(h]”h ]”h"]”h$]”h&]”jJ  jK  uh1hõhŸh³h Mƒ
hj!V  ubeh}”(h]”h ]”h"]”h$]”h&]”uh1húhjV  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”jJ  jK  uh1hõhŸh³h M
hjœT  hžhubhæ)”}”(hX/  The bag-like structure in this case is a type 2 xfarray as discussed in the
:ref:`xfarray access patterns<xfarray_access_patterns>` section.
Reverse mappings are added to the bag using ``xfarray_store_anywhere`` and
removed via ``xfarray_unset``.
Bag members are examined through ``xfarray_iter`` loops.”h]”(hŒLThe bag-like structure in this case is a type 2 xfarray as discussed in the
”…””}”(hjùV  hžhhŸNh Nubh)”}”(hŒ7:ref:`xfarray access patterns<xfarray_access_patterns>`”h]”j™  )”}”(hjW  h]”hŒxfarray access patterns”…””}”(hjW  hžhhŸNh Nubah}”(h]”h ]”(j¤  Œstd”Œstd-ref”eh"]”h$]”h&]”uh1j˜  hjW  ubah}”(h]”h ]”h"]”h$]”h&]”Œrefdoc”j±  Œ	refdomain”jW  Œreftype”Œref”Œrefexplicit”ˆŒrefwarn”ˆj·  Œxfarray_access_patterns”uh1hhŸh³h M™
hjùV  ubhŒ6 section.
Reverse mappings are added to the bag using ”…””}”(hjùV  hžhhŸNh Nubj÷  )”}”(hŒ``xfarray_store_anywhere``”h]”hŒxfarray_store_anywhere”…””}”(hj%W  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1jö  hjùV  ubhŒ and
removed via ”…””}”(hjùV  hžhhŸNh Nubj÷  )”}”(hŒ``xfarray_unset``”h]”hŒxfarray_unset”…””}”(hj7W  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1jö  hjùV  ubhŒ#.
Bag members are examined through ”…””}”(hjùV  hžhhŸNh Nubj÷  )”}”(hŒ``xfarray_iter``”h]”hŒxfarray_iter”…””}”(hjIW  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1jö  hjùV  ubhŒ loops.”…””}”(hjùV  hžhhŸNh Nubeh}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h M™
hjœT  hžhubhæ)”}”(hŒ–The proposed patchset is the
`AG btree repair
<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-ag-btrees>`_
series.”h]”(hŒThe proposed patchset is the
”…””}”(hjaW  hžhhŸNh Nubj”  )”}”(hŒq`AG btree repair
<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-ag-btrees>`_”h]”hŒAG btree repair”…””}”(hjiW  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”Œname”ŒAG btree repair”jj  Œ\https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-ag-btrees”uh1j“  hjaW  ubhµ)”}”(hŒ_
<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-ag-btrees>”h]”h}”(h]”Œid4”ah ]”h"]”h$]”Œag btree repair”ah&]”Œrefuri”jyW  uh1h´jy  KhjaW  ubhŒ
series.”…””}”(hjaW  hžhhŸNh Nubeh}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h MŸ
hjœT  hžhubeh}”(h]”j»	  ah ]”h"]”Œ1case study: rebuilding the space reference counts”ah$]”h&]”uh1hÐhjQ  hžhhŸh³h MH
ubhÑ)”}”(hhh]”(hÖ)”}”(hŒ0Case Study: Rebuilding File Fork Mapping Indices”h]”hŒ0Case Study: Rebuilding File Fork Mapping Indices”…””}”(hj›W  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”j  j×	  uh1hÕhj˜W  hžhhŸh³h M¥
ubhæ)”}”(hŒDThe high level process to rebuild a data/attr fork mapping btree is:”h]”hŒDThe high level process to rebuild a data/attr fork mapping btree is:”…””}”(hj©W  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h M§
hj˜W  hžhubji  )”}”(hhh]”(hû)”}”(hŒüWalk the reverse mapping records to generate ``struct xfs_bmbt_rec``
records from the reverse mapping records for that inode and fork.
Append these records to an xfarray.
Compute the bitmap of the old bmap btree blocks from the ``BMBT_BLOCK``
records.
”h]”hæ)”}”(hŒûWalk the reverse mapping records to generate ``struct xfs_bmbt_rec``
records from the reverse mapping records for that inode and fork.
Append these records to an xfarray.
Compute the bitmap of the old bmap btree blocks from the ``BMBT_BLOCK``
records.”h]”(hŒ-Walk the reverse mapping records to generate ”…””}”(hj¾W  hžhhŸNh Nubj÷  )”}”(hŒ``struct xfs_bmbt_rec``”h]”hŒstruct xfs_bmbt_rec”…””}”(hjÆW  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1jö  hj¾W  ubhŒ 
records from the reverse mapping records for that inode and fork.
Append these records to an xfarray.
Compute the bitmap of the old bmap btree blocks from the ”…””}”(hj¾W  hžhhŸNh Nubj÷  )”}”(hŒ``BMBT_BLOCK``”h]”hŒ
BMBT_BLOCK”…””}”(hjØW  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1jö  hj¾W  ubhŒ	
records.”…””}”(hj¾W  hžhhŸNh Nubeh}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h M©
hjºW  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhj·W  hžhhŸh³h Nubhû)”}”(hŒoUse the ``xfs_btree_bload_compute_geometry`` function to compute the number
of blocks needed for the new tree.
”h]”hæ)”}”(hŒnUse the ``xfs_btree_bload_compute_geometry`` function to compute the number
of blocks needed for the new tree.”h]”(hŒUse the ”…””}”(hjúW  hžhhŸNh Nubj÷  )”}”(hŒ$``xfs_btree_bload_compute_geometry``”h]”hŒ xfs_btree_bload_compute_geometry”…””}”(hjX  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1jö  hjúW  ubhŒB function to compute the number
of blocks needed for the new tree.”…””}”(hjúW  hžhhŸNh Nubeh}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h M¯
hjöW  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhj·W  hžhhŸh³h Nubhû)”}”(hŒ'Sort the records in file offset order.
”h]”hæ)”}”(hŒ&Sort the records in file offset order.”h]”hŒ&Sort the records in file offset order.”…””}”(hj$X  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h M²
hj X  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhj·W  hžhhŸh³h Nubhû)”}”(hŒ€If the extent records would fit in the inode fork immediate area, commit the
records to that immediate area and skip to step 8.
”h]”hæ)”}”(hŒIf the extent records would fit in the inode fork immediate area, commit the
records to that immediate area and skip to step 8.”h]”hŒIf the extent records would fit in the inode fork immediate area, commit the
records to that immediate area and skip to step 8.”…””}”(hj<X  hžhhŸNh Nubah•      }”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h M´
hj8X  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhj·W  hžhhŸh³h Nubhû)”}”(hŒ=Allocate the number of blocks computed in the previous step.
”h]”hæ)”}”(hŒ<Allocate the number of blocks computed in the previous step.”h]”hŒ<Allocate the number of blocks computed in the previous step.”…””}”(hjTX  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h M·
hjPX  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhj·W  hžhhŸh³h Nubhû)”}”(hŒlUse ``xfs_btree_bload`` to write the xfarray records to btree blocks and
generate the internal node blocks.
”h]”hæ)”}”(hŒkUse ``xfs_btree_bload`` to write the xfarray records to btree blocks and
generate the internal node blocks.”h]”(hŒUse ”…””}”(hjlX  hžhhŸNh Nubj÷  )”}”(hŒ``xfs_btree_bload``”h]”hŒxfs_btree_bload”…””}”(hjtX  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1jö  hjlX  ubhŒT to write the xfarray records to btree blocks and
generate the internal node blocks.”…””}”(hjlX  hžhhŸNh Nubeh}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h M¹
hjhX  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhj·W  hžhhŸh³h Nubhû)”}”(hŒBCommit the new btree root block to the inode fork immediate area.
”h]”hæ)”}”(hŒACommit the new btree root block to the inode fork immediate area.”h]”hŒACommit the new btree root block to the inode fork immediate area.”…””}”(hj–X  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h M¼
hj’X  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhj·W  hžhhŸh³h Nubhû)”}”(hŒ>Reap the old btree blocks using the bitmap created in step 1.
”h]”hæ)”}”(hŒ=Reap the old btree blocks using the bitmap created in step 1.”h]”hŒ=Reap the old btree blocks using the bitmap created in step 1.”…””}”(hj®X  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h M¾
hjªX  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhj·W  hžhhŸh³h Nubeh}”(h]”h ]”h"]”h$]”h&]”jg  jh  ji  hjj  jk  uh1jh  hj˜W  hžhhŸh³h M©
ubhæ)”}”(hX­  There are some complications here:
First, it's possible to move the fork offset to adjust the sizes of the
immediate areas if the data and attr forks are not both in BMBT format.
Second, if there are sufficiently few fork mappings, it may be possible to use
EXTENTS format instead of BMBT, which may require a conversion.
Third, the incore extent map must be reloaded carefully to avoid disturbing
any delayed allocation extents.”h]”hX¯  There are some complications here:
First, itâ€™s possible to move the fork offset to adjust the sizes of the
immediate areas if the data and attr forks are not both in BMBT format.
Second, if there are sufficiently few fork mappings, it may be possible to use
EXTENTS format instead of BMBT, which may require a conversion.
Third, the incore extent map must be reloaded carefully to avoid disturbing
any delayed allocation extents.”…””}”(hjÈX  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h MÀ
hj˜W  hžhubhæ)”}”(hŒžThe proposed patchset is the
`file mapping repair
<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-file-mappings>`_
series.”h]”(hŒThe proposed patchset is the
”…””}”(hjÖX  hžhhŸNh Nubj”  )”}”(hŒy`file mapping repair
<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-file-mappings>`_”h]”hŒfile mapping repair”…””}”(hjÞX  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”Œname”Œfile mapping repair”jj  Œ`https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-file-mappings”uh1j“  hjÖX  ubhµ)”}”(hŒc
<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-file-mappings>”h]”h}”(h]”Œfile-mapping-repair”ah ]”h"]”Œfile mapping repair”ah$]”h&]”Œrefuri”jîX  uh1h´jy  KhjÖX  ubhŒ
series.”…””}”(hjÖX  hžhhŸNh Nubeh}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h MÈ
hj˜W  hžhubhµ)”}”(hŒ.. _reaping:”h]”h}”(h]”h ]”h"]”h$]”h&]”j  Œreaping”uh1h´h MÍ
hj˜W  hžhhŸh³ubeh}”(h]”jÝ	  ah ]”h"]”Œ0case study: rebuilding file fork mapping indices”ah$]”h&]”uh1hÐhjQ  hžhhŸh³h M¥
ubeh}”(h]”jz	  ah ]”h"]”Œwriting the new tree”ah$]”h&]”uh1hÐhj>N  hžhhŸh³h M§	ubeh}”(h]”j	  ah ]”h"]”Œbulk loading of ondisk b+trees”ah$]”h&]”uh1hÐhjv*  hžhhŸh³h M	ubhÑ)”}”(hhh]”(hÖ)”}”(hŒReaping Old Metadata Blocks”h]”hŒReaping Old Metadata Blocks”…””}”(hj)Y  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”j  j
  uh1hÕhj&Y  hžhhŸh³h MÐ
ubhæ)”}”(hX‡  Whenever online fsck builds a new data structure to replace one that is
suspect, there is a question of how to find and dispose of the blocks that
belonged to the old structure.
The laziest method of course is not to deal with them at all, but this slowly
leads to service degradations as space leaks out of the filesystem.
Hopefully, someone will schedule a rebuild of the free space information to
plug all those leaks.
Offline repair rebuilds all space metadata after recording the usage of
the files and directories that it decides not to clear, hence it can build new
structures in the discovered free space and avoid the question of reaping.”h]”hX‡  Whenever online fsck builds a new data structure to replace one that is
suspect, there is a question of how to find and dispose of the blocks that
belonged to the old structure.
The laziest method of course is not to deal with them at all, but this slowly
leads to service degradations as space leaks out of the filesystem.
Hopefully, someone will schedule a rebuild of the free space information to
plug all those leaks.
Offline repair rebuilds all space metadata after recording the usage of
the files and directories that it decides not to clear, hence it can build new
structures in the discovered free space and avoid the question of reaping.”…””}”(hj7Y  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h MÒ
hj&Y  hžhubhæ)”}”(hXµ  As part of a repair, online fsck relies heavily on the reverse mapping records
to find space that is owned by the corresponding rmap owner yet truly free.
Cross referencing rmap records with other rmap records is necessary because
there may be other data structures that also think they own some of those
blocks (e.g. crosslinked trees).
Permitting the block allocator to hand them out again will not push the system
towards consistency.”h]”hXµ  As part of a repair, online fsck relies heavily on the reverse mapping records
to find space that is owned by the corresponding rmap owner yet truly free.
Cross referencing rmap records with other rmap records is necessary because
there may be other data structures that also think they own some of those
blocks (e.g. crosslinked trees).
Permitting the block allocator to hand them out again will not push the system
towards consistency.”…””}”(hjEY  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h MÝ
hj&Y  hžhubhæ)”}”(hŒ_For space metadata, the process of finding extents to dispose of generally
follows this format:”h]”hŒ_For space metadata, the process of finding extents to dispose of generally
follows this format:”…””}”(hjSY  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h Må
hj&Y  hžhubji  )”}”(hhh]”(hû)”}”(hŒáCreate a bitmap of space used by data structures that must be preserved.
The space reservations used to create the new metadata can be used here if
the same rmap owner code is used to denote all of the objects being rebuilt.
”h]”hæ)”}”(hŒàCreate a bitmap of space used by data structures that must be preserved.
The space reservations used to create the new metadata can be used here if
the same rmap owner code is used to denote all of the objects being rebuilt.”h]”hŒàCreate a bitmap of space used by data structures that must be preserved.
The space reservations used to create the new metadata can be used here if
the same rmap owner code is used to denote all of the objects being rebuilt.”…””}”(hjhY  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h Mè
hjdY  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhjaY  hžhhŸh³h Nubhû)”}”(hŒ’Survey the reverse mapping data to create a bitmap of space owned by the
same ``XFS_RMAP_OWN_*`` number for the metadata that is being preserved.
”h]”hæ)”}”(hŒ‘Survey the reverse mapping data to create a bitmap of space owned by the
same ``XFS_RMAP_OWN_*`` number for the metadata that is being preserved.”h]”(hŒNSurvey the reverse mapping data to create a bitmap of space owned by the
same ”…””}”(hj€Y  hžhhŸNh Nubj÷  )”}”(hŒ``XFS_RMAP_OWN_*``”h]”hŒXFS_RMAP_OWN_*”…””}”(hjˆY  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1jö  hj€Y  ubhŒ1 number for the metadata that is being preserved.”…””}”(hj€Y  hžhhŸNh Nubeh}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h Mì
hj|Y  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhjaY  hžhhŸh³h Nubhû)”}”(hŒ©Use the bitmap disunion operator to subtract (1) from (2).
The remaining set bits represent candidate extents that could be freed.
The process moves on to step 4 below.
”h]”hæ)”}”(hŒ¨Use the bitmap disunion operator to subtract (1) from (2).
The remaining set bits represent candidate extents that could be freed.
The process moves on to step 4 below.”h]”hŒ¨Use the bitmap disunion operator to subtract (1) from (2).
The remaining set bits represent candidate extents that could be freed.
The process moves on to step 4 below.”…””}”(hjªY  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h Mï
hj¦Y  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhjaY  hžhhŸh³h Nubeh}”(h]”h ]”h"]”h$]”h&]”jg  jh  ji  hjj  jk  uh1jh  hj&Y  hžhhŸh³h Mè
ubhæ)”}”(hXD  Repairs for file-based metadata such as extended attributes, directories,
symbolic links, quota files and realtime bitmaps are performed by building a
new structure attached to a temporary file and exchanging all mappings in the
file forks.
Afterward, the mappings in the old file fork are the candidate blocks for
disposal.”h]”hXD  Repairs for file-based metadata such as extended attributes, directories,
symbolic links, quota files and realtime bitmaps are performed by building a
new structure attached to a temporary file and exchanging all mappings in the
file forks.
Afterward, the mappings in the old file fork are the candidate blocks for
disposal.”…””}”(hjÄY  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h Mó
hj&Y  hžhubhæ)”}”(hŒ7The process for disposing of old extents is as follows:”h]”hŒ7The process for disposing of old extents is as follows:”…””}”(hjÒY  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h Mú
hj&Y  hžhubji  )”}”(hhh]”(hû)”}”(hXB  For each candidate extent, count the number of reverse mapping records for
the first block in that extent that do not have the same rmap owner for the
data structure being repaired.

- If zero, the block has a single owner and can be freed.

- If not, the block is part of a crosslinked structure and must not be
  freed.
”h]”(hæ)”}”(hŒµFor each candidate extent, count the number of reverse mapping records for
the first block in that extent that do not have the same rmap owner for the
data structure being repaired.”h]”hŒµFor each candidate extent, count the number of reverse mapping records for
the first block in that extent that do not have the same rmap owner for the
data structure being repaired.”…””}”(hjçY  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h Mü
hjãY  ubhö)”}”(hhh]”(hû)”}”(hŒ8If zero, the block has a single owner and can be freed.
”h]”hæ)”}”(hŒ7If zero, the block has a single owner and can be freed.”h]”hŒ7If zero, the block has a single owner and can be freed.”…””}”(hjüY  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h M hjøY  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhjõY  ubhû)”}”(hŒLIf not, the block is part of a crosslinked structure and must not be
freed.
”h]”hæ)”}”(hŒKIf not, the block is part of a crosslinked structure and must not be
freed.”h]”hŒKIf not, the block is part of a crosslinked structure and must not be
freed.”…””}”(hjZ  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h MhjZ  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhjõY  ubeh}”(h]”h ]”h"]”h$]”h&]”jJ  jK  uh1hõhŸh³h M hjãY  ubeh}”(h]”h ]”h"]”h$]”h&]”uh1húhjàY  hžhhŸNh Nubhû)”}”(hŒStarting with the next block in the extent, figure out how many more blocks
have the same zero/nonzero other owner status as that first block.
”h]”hæ)”}”(hŒŽStarting with the next block in the extent, figure out how many more blocks
have the same zero/nonzero other owner status as that first block.”h]”hŒŽStarting with the next block in the extent, figure out how many more blocks
have the same zero/nonzero other owner status as that first block.”…””}”(hj8Z  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h Mhj4Z  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhjàY  hžhhŸh³h Nubhû)”}”(hŒ€If the region is crosslinked, delete the reverse mapping entry for the
structure being repaired and move on to the next region.
”h]”hæ)”}”(hŒIf the region is crosslinked, delete the reverse mapping entry for the
structure being repaired and move on to the next region.”h]”hŒIf the region is crosslinked, delete the reverse mapping entry for the
structure being repaired and move on to the next region.”…””}”(hjPZ  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h MhjLZ  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhjàY  hžhhŸh³h Nubhû)”}”(hŒtIf the region is to be freed, mark any corresponding buffers in the buffer
cache as stale to prevent log writeback.
”h]”hæ)”}”(hŒsIf the region is to be freed, mark any corresponding buffers in the buffer
cache as stale to prevent log writeback.”h]”hŒsIf the region is to be freed, mark any corresponding buffers in the buffer
cache as stale to prevent log writeback.”…””}”(hjhZ  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h MhjdZ  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhjàY  hžhhŸh³h Nubhû)”}”(hŒFree the region and move on.
”h]”hæ)”}”(hŒFree the region and move on.”h]”hŒFree the region and move on.”…””}”(hj€Z  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h Mhj|Z  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhjàY  hžhhŸh³h Nubeh}”(h]”h ]”h"]”h$]”h&]”jg  jh  ji  hjj  jk  Œstart”Kuh1jh  hj&Y  hžhhŸh³h Mü
ubhæ)”}”(hŒÉHowever, there is one complication to this procedure.
Transactions are of finite size, so the reaping process must be careful to roll
the transactions to avoid overruns.
Overruns come from two sources:”h]”hŒÉHowever, there is one complication to this procedure.
Transactions are of finite size, so the reaping process must be careful to roll
the transactions to avoid overruns.
Overruns come from two sources:”…””}”(hj›Z  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h Mhj&Y  hžhubji  )”}”(hhh]”(hû)”}”(hŒ:EFIs logged on behalf of space that is no longer occupied
”h]”hæ)”}”(hŒ9EFIs logged on behalf of space that is no longer occupied”h]”hŒ9EFIs logged on behalf of space that is no longer occupied”…””}”(hj°Z  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h Mhj¬Z  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhj©Z  hžhhŸh³h Nubhû)”}”(hŒ#Log items for buffer invalidations
”h]”hæ)”}”(hŒ"Log items for buffer invalidations”h]”hŒ"Log items for buffer invalidations”…””}”(hjÈZ  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h MhjÄZ  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhj©Z  hžhhŸh³h Nubeh}”(h]”h ]”h"]”h$]”h&]”jg  j6  ji  hjj  jk  uh1jh  hj&Y  hžhhŸh³h Mubhæ)”}”(hŒÄThis is also a window in which a crash during the reaping process can leak
blocks.
As stated earlier, online repair functions use very large transactions to
minimize the chances of this occurring.”h]”hŒÄThis is also a window in which a crash during the reaping process can leak
blocks.
As stated earlier, online repair functions use very large transactions to
minimize the chances of this occurring.”…””}”(hjâZ  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h Mhj&Y  hžhubhæ)”}”(hŒ¶The proposed patchset is the
`preparation for bulk loading btrees
<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-prep-for-bulk-loading>`_
series.”h]”(hŒThe proposed patchset is the
”…””}”(hjðZ  hžhhŸNh Nubj”  )”}”(hŒ‘`preparation for bulk loading btrees
<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-prep-for-bulk-loading>`_”h]”hŒ#preparation for bulk loading btrees”…””}”(hjøZ  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”Œname”Œ#preparation for bulk loading btrees”jj  Œhhttps://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-prep-for-bulk-loading”uh1j“  hjðZ  ubhµ)”}”(hŒk
<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-prep-for-bulk-loading>”h]”h}”(h]”Œid5”ah ]”h"]”h$]”Œ#preparation for bulk loading btrees”ah&]”Œrefuri”j[  uh1h´jy  KhjðZ  ubhŒ
series.”…””}”(hjðZ  hžhhŸNh Nubeh}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h Mhj&Y  hžhubhÑ)”}”(hhh]”(hÖ)”}”(hŒ0Case Study: Reaping After a Regular Btree Repair”h]”hŒ0Case Study: Reaping After a Regular Btree Repair”…””}”(hj#[  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”j  j0
  uh1hÕhj [  hžhhŸh³h M$ubhæ)”}”(hX:  Old reference count and inode btrees are the easiest to reap because they have
rmap records with special owner codes: ``XFS_RMAP_OWN_REFC`` for the refcount
btree, and ``XFS_RMAP_OWN_INOBT`` for the inode and free inode btrees.
Creating a list of extents to reap the old btree blocks is quite simple,
conceptually:”h]”(hŒvOld reference count and inode btrees are the easiest to reap because they have
rmap records with special owner codes: ”…””}”(hj1[  hžhhŸNh Nubj÷  )”}”(hŒ``XFS_RMAP_OWN_REFC``”h]”hŒXFS_RMAP_OWN_REFC”…””}”(hj9[  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1jö  hj1[  ubhŒ for the refcount
btree, and ”…””}”(hj1[  hžhhŸNh Nubj÷  )”}”(hŒ``XFS_RMAP_OWN_INOBT``”h]”hŒXFS_RMAP_OWN_INOBT”…””}”(hjK[  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1jö  hj1[  ubhŒ| for the inode and free inode btrees.
Creating a list of extents to reap the old btree blocks is quite simple,
conceptually:”…””}”(hj1[  hžhhŸNh Nubeh}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h M&hj [  hžhubji  )”}”(hhh]”(hû)”}”(hŒJLock the relevant AGI/AGF header buffers to prevent allocation and frees.
”h]”hæ)”}”(hŒILock the relevant AGI/AGF header buffers to prevent allocation and frees.”h]”hŒILock the relevant AGI/AGF header buffers to prevent allocation and frees.”…””}”(hjj[  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h M,hjf[  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhjc[  hžhhŸh³h Nubhû)”}”(hŒ“For each reverse mapping record with an rmap owner corresponding to the
metadata structure being rebuilt, set the corresponding range in a bitmap.
”h]”hæ)”}”(hŒ’For each reverse mapping record with an rmap owner corresponding to the
metadata structure being rebuilt, set the corresponding range in a bitmap.”h]”hŒ’For each reverse mapping record with an rmap owner corresponding to the
metadata structure being rebuilt, set the corresponding range in a bitmap.”…””}”(hj‚[  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h M.hj~[  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhjc[  hžhhŸh³h Nubhû)”}”(hŒ~Walk the current data structures that have the same rmap owner.
For each block visited, clear that range in the above bitmap.
”h]”hæ)”}”(hŒ}Walk the current data structures that have the same rmap owner.
For each block visited, clear that range in the above bitmap.”h]”hŒ}Walk the current data structures that have the same rmap owner.
For each block visited, clear that range in the above bitmap.”…””}”(hjš[  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h M1hj–[  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhjc[  hžhhŸh³h Nubhû)”}”(hŒöEach set bit in the bitmap represents a block that could be a block from the
old data structures and hence is a candidate for reaping.
In other words, ``(rmap_records_owned_by & ~blocks_reachable_by_walk)``
are the blocks that might be freeable.
”h]”hæ)”}”(hŒõEach set bit in the bitmap represents a block that could be a block from the
old data structures and hence is a candidate for reaping.
In other words, ``(rmap_records_owned_by & ~blocks_reachable_by_walk)``
are the blocks that might be freeable.”h]”(hŒ—Each set bit in the bitmap represents a block that could be a block from the
old data structures and hence is a candidate for reaping.
In other words, ”…””}”(hj²[  hžhhŸNh Nubj÷  )”}”(hŒ7``(rmap_records_owned_by & ~blocks_reachable_by_walk)``”h]”hŒ3(rmap_records_owned_by & ~blocks_reachable_by_walk)”…””}”(hjº[  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1jö  hj²[  ubhŒ'
are the blocks that might be freeable.”…””}”(hj²[  hžhhŸNh Nubeh}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h M4hj®[  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhjc[  hžhhŸh³h Nubeh}”(h]”h ]”h"]”h$]”h&]”jg  jh  ji  hjj  jk  uh1jh  hj [  hžhhŸh³h M,ubhæ)”}”(hŒÙIf it is possible to maintain the AGF lock throughout the repair (which is the
common case), then step 2 can be performed at the same time as the reverse
mapping record walk that creates the records for the new btree.”h]”hŒÙIf it is possible to maintain the AGF lock throughout the repair (which is the
common case), then step 2 can be performed at the same time as the reverse
mapping record walk that creates the records for the new btree.”…””}”(hjÞ[  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h M9hj [  hžhubeh}”(h]”j6
  ah ]”h"]”Œ0case study: reaping after a regular btree repair”ah$]”h&]”uh1hÐhj&Y  hžhhŸh³h M$ubhÑ)”}”(hhh]”(hÖ)”}”(hŒ-Case Study: Rebuilding the Free Space Indices”h]”hŒ-Case Study: Rebuilding the Free Space Indices”…””}”(hjö[  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”j  jR
  uh1hÕhjó[  hžhhŸh³h M>ubhæ)”}”(hŒ<The high level process to rebuild the free space indices is:”h]”hŒ<The high level process to rebuild the free space indices is:”…””}”(hj\  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h M@hjó[  hžhubji  )”}”(hhh]”(hû)”}”(hŒWalk the reverse mapping records to generate ``struct xfs_alloc_rec_incore``
records from the gaps in the reverse mapping btree.
”h]”hæ)”}”(hŒ€Walk the reverse mapping records to generate ``struct xfs_alloc_rec_incore``
records from the gaps in the reverse mapping btree.”h]”(hŒ-Walk the reverse mapping records to generate ”…””}”(hj\  hžhhŸNh Nubj÷  )”}”(hŒ``struct xfs_alloc_rec_incore``”h]”hŒstruct xfs_alloc_rec_incore”…””}”(hj!\  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1jö  hj\  ubhŒ4
records from the gaps in the reverse mapping btree.”…””}”(hj\  hžhhŸNh Nubeh}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h MBhj\  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhj\  hžhhŸh³h Nubhû)”}”(hŒ"Append the records to an xfarray.
”h]”hæ)”}”(hŒ!Append the records to an xfarray.”h]”hŒ!Append the records to an xfarray.”…””}”(hjC\  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h MEhj?\  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhj\  hžhhŸh³h Nubhû)”}”(hŒpUse the ``xfs_btree_bload_compute_geometry`` function to compute the number
of blocks needed for each new tree.
”h]”hæ)”}”(hŒoUse the ``xfs_btree_bload_compute_geometry`` function to compute the number
of blocks needed for each new tree.”h]”(hŒUse the ”…””}”(hj[\  hžhhŸNh Nubj÷  )”}”(hŒ$``xfs_btree_bload_compute_geometry``”h]”hŒ xfs_btree_bload_compute_geometry”…””}”(hjc\  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1jö  hj[\  ubhŒC function to compute the number
of blocks needed for each new tree.”…””}”(hj[\  hžhhŸNh Nubeh}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h MGhjW\  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhj\  hžhhŸh³h Nubhû)”}”(hŒgAllocate the number of blocks computed in the previous step from the free
space information collected.
”h]”hæ)”}”(hŒfAllocate the number of blocks computed in the previous step from the free
space information collected.”h]”hŒfAllocate the number of blocks computed in the previous step from the free
space information collected.”…””}”(hj…\  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h MJhj\  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhj\  hžhhŸh³h Nubhû)”}”(hŒÇUse ``xfs_btree_bload`` to write the xfarray records to btree blocks and
generate the internal node blocks for the free space by length index.
Call it again for the free space by block number index.
”h]”hæ)”}”(hŒÆUse ``xfs_btree_bload`` to write the xfarray records to btree blocks and
generate the internal node blocks for the free space by length index.
Call it again for the free space by block number index.”h]”(hŒUse ”…””}”(hj\  hžhhŸNh Nubj÷  )”}”(hŒ``xfs_btree_bload``”h]”hŒxfs_btree_bload”…””}”(hj¥\  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1jö  hj\  ubhŒ¯ to write the xfarray records to btree blocks and
generate the internal node blocks for the free space by length index.
Call it again for the free space by block number index.”…””}”(hj\  hžhhŸNh Nubeh}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h MMhj™\  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhj\  hžhhŸh³h Nubhû)”}”(hŒ>Commit the locations of the new btree root blocks to the AGF.
”h]”hæ)”}”(hŒ=Commit the locations of the new btree root blocks to the AGF.”h]”hŒ=Commit the locations of the new btree root blocks to the AGF.”…””}”(hjÇ\  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h MQhjÃ\  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhj\  hžhhŸh³h Nubhû)”}”(hŒŠReap the old btree blocks by looking for space that is not recorded by the
reverse mapping btree, the new free space btrees, or the AGFL.
”h]”hæ)”}”(hŒ‰Reap the old btree blocks by looking for space that is not recorded by the
reverse mapping btree, the new free space btrees, or the AGFL.”h]”hŒ‰Reap the old btree blocks by looking for space that is not recorded by the
reverse mapping btree, the new free space btrees, or the AGFL.”…””}”(hjß\  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h MShjÛ\  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhj\  hžhhŸh³h Nubeh}”(h]”h ]”h"]”h$]”h&]”jg  jh  ji  hjj  jk  uh1jh  hjó[  hžhhŸh³h MBubhæ)”}”(hŒXRepairing the free space btrees has three key complications over a regular
btree repair:”h]”hŒXRepairing the free space btrees has three key complications over a regular
btree repair:”…””}”(hjù\  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h MVhjó[  hžhubhæ)”}”(hŒÖFirst, free space is not explicitly tracked in the reverse mapping records.
Hence, the new free space records must be inferred from gaps in the physical
space component of the keyspace of the reverse mapping btree.”h]”hŒÖFirst, free space is not explicitly tracked in the reverse mapping records.
Hence, the new free space records must be inferred from gaps in the physical
space component of the keyspace of the reverse mapping btree.”…””}”(hj]  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h MYhjó[  hžhubhæ)”}”(hXx  Second, free space repairs cannot use the common btree reservation code because
new blocks are reserved out of the free space btrees.
This is impossible when repairing the free space btrees themselves.
However, repair holds the AGF buffer lock for the duration of the free space
index reconstruction, so it can use the collected free space information to
supply the blocks for the new free space btrees.
It is not necessary to back each reserved extent with an EFI because the new
free space btrees are constructed in what the ondisk filesystem thinks is
unowned space.
However, if reserving blocks for the new btrees from the collected free space
information changes the number of free space records, repair must re-estimate
the new free space btree geometry with the new record count until the
reservation is sufficient.
As part of committing the new btrees, repair must ensure that reverse mappings
are created for the reserved blocks and that unused reserved blocks are
inserted into the free space btrees.
Deferrred rmap and freeing operations are used to ensure that this transition
is atomic, similar to the other btree repair functions.”h]”hXx  Second, free space repairs cannot use the common btree reservation code because
new blocks are reserved out of the free space btrees.
This is impossible when repairing the free space btrees themselves.
However, repair holds the AGF buffer lock for the duration of the free space
index reconstruction, so it can use the collected free space information to
supply the blocks for the new free space btrees.
It is not necessary to back each reserved extent with an EFI because the new
free space btrees are constructed in what the ondisk filesystem thinks is
unowned space.
However, if reserving blocks for the new btrees from the collected free space
information changes the number of free space records, repair must re-estimate
the new free space btree geometry with the new record count until the
reservation is sufficient.
As part of committing the new btrees, repair must ensure that reverse mappings
are created for the reserved blocks and that unused reserved blocks are
inserted into the free space btrees.
Deferrred rmap and freeing operations are used to ensure that this transition
is atomic, similar to the other btree repair functions.”…””}”(hj]  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h M]hjó[  hžhubhæ)”}”(hXŽ  Third, finding the blocks to reap after the repair is not overly
straightforward.
Blocks for the free space btrees and the reverse mapping btrees are supplied by
the AGFL.
Blocks put onto the AGFL have reverse mapping records with the owner
``XFS_RMAP_OWN_AG``.
This ownership is retained when blocks move from the AGFL into the free space
btrees or the reverse mapping btrees.
When repair walks reverse mapping records to synthesize free space records, it
creates a bitmap (``ag_owner_bitmap``) of all the space claimed by
``XFS_RMAP_OWN_AG`` records.
The repair context maintains a second bitmap corresponding to the rmap btree
blocks and the AGFL blocks (``rmap_agfl_bitmap``).
When the walk is complete, the bitmap disunion operation ``(ag_owner_bitmap &
~rmap_agfl_bitmap)`` computes the extents that are used by the old free space
btrees.
These blocks can then be reaped using the methods outlined above.”h]”(hŒñThird, finding the blocks to reap after the repair is not overly
straightforward.
Blocks for the free space btrees and the reverse mapping btrees are supplied by
the AGFL.
Blocks put onto the AGFL have reverse mapping records with the owner
”…””}”(hj#]  hžhhŸNh Nubj÷  )”}”(hŒ``XFS_RMAP_OWN_AG``”h]”hŒXFS_RMAP_OWN_AG”…””}”(hj+]  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1jö  hj#]  ubhŒ×.
This ownership is retained when blocks move from the AGFL into the free space
btrees or the reverse mapping btrees.
When repair walks reverse mapping records to synthesize free space records, it
creates a bitmap (”…””}”(hj#]  hžhhŸNh Nubj÷  )”}”(hŒ``ag_owner_bitmap``”h]”hŒag_owner_bitmap”…””}”(hj=]  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1jö  hj#]  ubhŒ) of all the space claimed by
”…””}”(hj#]  hžhhŸNh Nubj÷  )”}”(hŒ``XFS_RMAP_OWN_AG``”h]”hŒXFS_RMAP_OWN_AG”…””}”(hjO]  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1jö  hj#]  ubhŒs records.
The repair context maintains a second bitmap corresponding to the rmap btree
blocks and the AGFL blocks (”…””}”(hj#]  hžhhŸNh Nubj÷  )”}”(hŒ``rmap_agfl_bitmap``”h]”hŒrmap_agfl_bitmap”…””}”(hja]  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1jö  hj#]  ubhŒ<).
When the walk is complete, the bitmap disunion operation ”…””}”(hj#]  hžhhŸNh Nubj÷  )”}”(hŒ)``(ag_owner_bitmap &
~rmap_agfl_bitmap)``”h]”hŒ%(ag_owner_bitmap &
~rmap_agfl_bitmap)”…””}”(hjs]  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1jö  hj#]  ubhŒƒ computes the extents that are used by the old free space
btrees.
These blocks can then be reaped using the methods outlined above.”…””}”(hj#]  hžhhŸNh Nubeh}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h Mphjó[  hžhubhæ)”}”(hŒ–The proposed patchset is the
`AG btree repair
<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-ag-btrees>`_
series.”h]”(hŒThe proposed patchset is the
”…””}”(hj‹]  hžhhŸNh Nubj”  )”}”(hŒq`AG btree repair
<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-ag-btrees>`_”h]”hŒAG btree repair”…””}”(hj“]  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”Œname”ŒAG btree repair”jj  Œ\https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-ag-btrees”uh1j“  hj‹]  ubhµ)”}”(hŒ_
<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-ag-btrees>”h]”h}”(h]”Œid6”ah ]”h"]”h$]”Œag btree repair”ah&]”Œrefuri”j£]  uh1h´jy  Khj‹]  ubhŒ
series.”…””}”(hj‹]  hžhhŸNh Nubeh}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h M‚hjó[  hžhubhµ)”}”(hŒ.. _rmap_reap:”h]”h}”(h]”h ]”h"]”h$]”h&]”j  Œ	rmap-reap”uh1h´h M‡hjó[  hžhhŸh³ubeh}”(h]”jX
  ah ]”h"]”Œ-case study: rebuilding the free space indices”ah$]”h&]”uh1hÐhj&Y  hžhhŸh³h M>ubhÑ)”}”(hhh]”(hÖ)”}”(hŒ:Case Study: Reaping After Repairing Reverse Mapping Btrees”h]”hŒ:Case Study: Reaping After Repairing Reverse Mapping Btrees”…””}”(hjÐ]  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”j  jt
  uh1hÕhjÍ]  hžhhŸh³h MŠubhæ)”}”(hX¼  Old reverse mapping btrees are less difficult to reap after a repair.
As mentioned in the previous section, blocks on the AGFL, the two free space
btree blocks, and the reverse mapping btree blocks all have reverse mapping
records with ``XFS_RMAP_OWN_AG`` as the owner.
The full process of gathering reverse mapping records and building a new btree
are described in the case study of
:ref:`live rebuilds of rmap data <rmap_repair>`, but a crucial point from that
discussion is that the new rmap btree will not contain any records for the old
rmap btree, nor will the old btree blocks be tracked in the free space btrees.
The list of candidate reaping blocks is computed by setting the bits
corresponding to the gaps in the new rmap btree records, and then clearing the
bits corresponding to extents in the free space btrees and the current AGFL
blocks.
The result ``(new_rmapbt_gaps & ~(agfl | bnobt_records))`` are reaped using the
methods outlined above.”h]”(hŒìOld reverse mapping btrees are less difficult to reap after a repair.
As mentioned in the previous section, blocks on the AGFL, the two free space
btree blocks, and the reverse mapping btree blocks all have reverse mapping
records with ”…””}”(hjÞ]  hžhhŸNh Nubj÷  )”}”(hŒ``XFS_RMAP_OWN_AG``”h]”hŒXFS_RMAP_OWN_AG”…””}”(hjæ]  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1jö  hjÞ]  ubhŒ as the owner.
The full process of gathering reverse mapping records and building a new btree
are described in the case study of
”…””}”(hjÞ]  hžhhŸNh Nubh)”}”(hŒ/:ref:`live rebuilds of rmap data <rmap_repair>`”h]”j™  )”}”(hjú]  h]”hŒlive rebuilds of rmap data”…””}”(hjü]  hžhhŸNh Nubah}”(h]”h ]”(j¤  Œstd”Œstd-ref”eh"]”h$]”h&]”uh1j˜  hjø]  ubah}”(h]”h ]”h"]”h$]”h&]”Œrefdoc”j±  Œ	refdomain”j^  Œreftype”Œref”Œrefexplicit”ˆŒrefwarn”ˆj·  Œrmap_repair”uh1hhŸh³h MŒhjÞ]  ubhX±  , but a crucial point from that
discussion is that the new rmap btree will not contain any records for the old
rmap btree, nor will the old btree blocks be tracked in the free space btrees.
The list of candidate reaping blocks is computed by setting the bits
corresponding to the gaps in the new rmap btree records, and then clearing the
bits corresponding to extents in the free space btrees and the current AGFL
blocks.
The result ”…””}”(hjÞ]  hžhhŸNh Nubj÷  )”}”(hŒ/``(new_rmapbt_gaps & ~(agfl | bnobt_records))``”h]”hŒ+(new_rmapbt_gaps & ~(agfl | bnobt_records))”…””}”(hj^  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1jö  hjÞ]  ubhŒ- are reaped using the
methods outlined above.”…””}”(hjÞ]  hžhhŸNh Nubeh}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h MŒhjÍ]  hžhubhæ)”}”(hŒyThe rest of the process of rebuildng the reverse mapping btree is discussed
in a separate :ref:`case study<rmap_repair>`.”h]”(hŒZThe rest of the process of rebuildng the reverse mapping btree is discussed
in a separate ”…””}”(hj4^  hžhhŸNh Nubh)”}”(hŒ:ref:`case study<rmap_repair>`”h]”j™  )”}”(hj>^  h]”hŒ
case study”…””}”(hj@^  hžhhŸNh Nubah}”(h]”h ]”(j¤  Œstd”Œstd-ref”eh"]”h$]”h&]”uh1j˜  hj<^  ubah}”(h]”h ]”h"]”h$]”h&]”Œrefdoc”j±  Œ	refdomain”jJ^  Œreftype”Œref”Œrefexplicit”ˆŒrefwarn”ˆj·  Œrmap_repair”uh1hhŸh³h Mœhj4^  ubhŒ.”…””}”(hj4^  hžhhŸNh Nubeh}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h MœhjÍ]  hžhubhæ)”}”(hŒ–The proposed patchset is the
`AG btree repair
<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-ag-btrees>`_
series.”h]”(hŒThe proposed patchset is the
”…””}”(hjf^  hžhhŸNh Nubj”  )”}”(hŒq`AG btree repair
<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-ag-btrees>`_”h]”hŒAG btree repair”…””}”(hjn^  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”Œname”ŒAG btree repair”jj  Œ\https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-ag-btrees”uh1j“  hjf^  ubhµ)”}”(hŒ_
<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-ag-btrees>”h]”h}”(h]”Œid7”ah ]”h"]”h$]”Œag btree repair”ah&]”Œrefuri”j~^  uh1h´jy  Khjf^  ubhŒ
series.”…””}”(hjf^  hžhhŸNh Nubeh}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h MŸhjÍ]  hžhubeh}”(h]”(jz
  jÅ]  eh ]”h"]”(Œ:case study: reaping after repairing reverse mapping btrees”Œ	rmap_reap”eh$]”h&]”uh1hÐhj&Y  hžhhŸh³h MŠjË  }”j›^  j»]  sjÍ  }”jÅ]  j»]  subhÑ)”}”(hhh]”(hÖ)”}”(hŒCase Study: Rebuilding the AGFL”h]”hŒCase Study: Rebuilding the AGFL”…””}”(hj£^  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”j  j–
  uh1hÕhj ^  hžhhŸh³h M¥ubhæ)”}”(hŒCThe allocation group free block list (AGFL) is repaired as follows:”h]”hŒCThe allocation group free block list (AGFL) is repaired as follows:”…””}”(hj±^  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h M§hj ^  hžhubji  )”}”(hhh]”(hû)”}”(hŒhCreate a bitmap for all the space that the reverse mapping data claims is
owned by ``XFS_RMAP_OWN_AG``.
”h]”hæ)”}”(hŒgCreate a bitmap for all the space that the reverse mapping data claims is
owned by ``XFS_RMAP_OWN_AG``.”h]”(hŒSCreate a bitmap for all the space that the reverse mapping data claims is
owned by ”…””}”(hjÆ^  hžhhŸNh Nubj÷  )”}”(hŒ``XFS_RMAP_OWN_AG``”h]”hŒXFS_RMAP_OWN_AG”…””}”(hjÎ^  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1jö  hjÆ^  ubhŒ.”…””}”(hjÆ^  hžhhŸNh Nubeh}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h M©hjÂ^  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhj¿^  hžhhŸh³h Nubhû)”}”(hŒISubtract the space used by the two free space btrees and the rmap btree.
”h]”hæ)”}”(hŒHSubtract the space used by the two free space btrees and the rmap btree.”h]”hŒHSubtract the space used by the two free space btrees and the rmap btree.”…””}”(hjð^  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h M¬hjì^  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhj¿^  hžhhŸh³h Nubhû)”}”(hŒˆSubtract any space that the reverse mapping data claims is owned by any
other owner, to avoid re-adding crosslinked blocks to the AGFL.
”h]”hæ)”}”(hŒ‡Subtract any space that the reverse mapping data claims is owned by any
other owner, to avoid re-adding crosslinked blocks to the AGFL.”h]”hŒ‡Subtract any space that the reverse mapping data claims is owned by any
other owner, to avoid re-adding crosslinked blocks to the AGFL.”…””}”(hj_  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h M®hj_  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhj¿^  hžhhŸh³h Nubhû)”}”(hŒ1Once the AGFL is full, reap any blocks leftover.
”h]”hæ)”}”(hŒ0Once the AGFL is full, reap any blocks leftover.”h]”hŒ0Once the AGFL is full, reap any blocks leftover.”…””}”(hj _  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h M±hj_  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhj¿^  hžhhŸh³h Nubhû)”}”(hŒAThe next operation to fix the freelist will right-size the list.
”h]”hæ)”}”(hŒ@The next operation to fix the freelist will right-size the list.”h]”hŒ@The next operation to fix the freelist will right-size the list.”…””}”(hj8_  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h M³hj4_  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhj¿^  hžhhŸh³h Nubeh}”(h]”h ]”h"]”h$]”h&]”jg  jh  ji  hjj  jk  uh1jh  hj ^  hžhhŸh³h M©ubhæ)”}”(hŒ See `fs/xfs/scrub/agheader_repair.c <https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/fs/xfs/scrub/agheader_repair.c>`_ for more details.”h]”(hŒSee ”…””}”(hjR_  hžhhŸNh Nubj”  )”}”(hŒŠ`fs/xfs/scrub/agheader_repair.c <https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/fs/xfs/scrub/agheader_repair.c>`_”h]”hŒfs/xfs/scrub/agheader_repair.c”…””}”(hjZ_  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”Œname”Œfs/xfs/scrub/agheader_repair.c”jj  Œfhttps://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/fs/xfs/scrub/agheader_repair.c”uh1j“  hjR_  ubhµ)”}”(hŒi <https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/fs/xfs/scrub/agheader_repair.c>”h]”h}”(h]”Œfs-xfs-scrub-agheader-repair-c”ah ]”h"]”Œfs/xfs/scrub/agheader_repair.c”ah$]”h&]”Œrefuri”jj_  uh1h´jy  KhjR_  ubhŒ for more details.”…””}”(hjR_  hžhhŸNh Nubeh}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h Mµhj ^  hžhubeh}”(h]”jœ
  ah ]”h"]”Œcase study: rebuilding the agfl”ah$]”h&]”uh1hÐhj&Y  hžhhŸh³h M¥ubeh}”(h]”(j
  jY  eh ]”h"]”(Œreaping old metadata blocks”Œreaping”eh$]”h&]”uh1hÐhjv*  hžhhŸh³h MÐ
jË  }”jŽ_  jY  sjÍ  }”jY  jY  subhÑ)”}”(hhh]”(hÖ)”}”(hŒInode Record Repairs”h]”hŒInode Record Repairs”…””}”(hj–_  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”j  jÄ
  uh1hÕhj“_  hžhhŸh³h M¸ubhæ)”}”(hXm  Inode records must be handled carefully, because they have both ondisk records
("dinodes") and an in-memory ("cached") representation.
There is a very high potential for cache coherency issues if online fsck is not
careful to access the ondisk metadata *only* when the ondisk metadata is so
badly damaged that the filesystem cannot load the in-memory representation.
When online fsck wants to open a damaged file for scrubbing, it must use
specialized resource acquisition functions that return either the in-memory
representation *or* a lock on whichever object is necessary to prevent any
update to the ondisk location.”h]”(hX  Inode records must be handled carefully, because they have both ondisk records
(â€œdinodesâ€) and an in-memory (â€œcachedâ€) representation.
There is a very high potential for cache coherency issues if online fsck is not
careful to access the ondisk metadata ”…””}”(hj¤_  hžhhŸNh Nubj7  )”}”(hŒ*only*”h]”hŒonly”…””}”(hj¬_  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1j6  hj¤_  ubhX   when the ondisk metadata is so
badly damaged that the filesystem cannot load the in-memory representation.
When online fsck wants to open a damaged file for scrubbing, it must use
specialized resource acquisition functions that return either the in-memory
representation ”…””}”(hj¤_  hžhhŸNh Nubj7  )”}”(hŒ*or*”h]”hŒor”…””}”(hj¾_  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1j6  hj¤_  ubhŒV a lock on whichever object is necessary to prevent any
update to the ondisk location.”…””}”(hj¤_  hžhhŸNh Nubeh}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h Mºhj“_  hžhubhæ)”}”(hX4  The only repairs that should be made to the ondisk inode buffers are whatever
is necessary to get the in-core structure loaded.
This means fixing whatever is caught by the inode cluster buffer and inode fork
verifiers, and retrying the ``iget`` operation.
If the second ``iget`` fails, the repair has failed.”h]”(hŒìThe only repairs that should be made to the ondisk inode buffers are whatever
is necessary to get the in-core structure loaded.
This means fixing whatever is caught by the inode cluster buffer and inode fork
verifiers, and retrying the ”…””}”(hjÖ_  hžhhŸNh Nubj÷  )”}”(hŒ``iget``”h]”hŒiget”…””}”(hjÞ_  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1jö  hjÖ_  ubhŒ operation.
If the second ”…””}”(hjÖ_  hžhhŸNh Nubj÷  )”}”(hŒ``iget``”h]”hŒiget”…””}”(hjð_  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1jö  hjÖ_  ubhŒ fails, the repair has failed.”…””}”(hjÖ_  hžhhŸNh Nubeh}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h MÄhj“_  hžhubhæ)”}”(hX  Once the in-memory representation is loaded, repair can lock the inode and can
subject it to comprehensive checks, repairs, and optimizations.
Most inode attributes are easy to check and constrain, or are user-controlled
arbitrary bit patterns; these are both easy to fix.
Dealing with the data and attr fork extent counts and the file block counts is
more complicated, because computing the correct value requires traversing the
forks, or if that fails, leaving the fields invalid and waiting for the fork
fsck functions to run.”h]”hX  Once the in-memory representation is loaded, repair can lock the inode and can
subject it to comprehensive checks, repairs, and optimizations.
Most inode attributes are easy to check and constrain, or are user-controlled
arbitrary bit patterns; these are both easy to fix.
Dealing with the data and attr fork extent counts and the file block counts is
more complicated, because computing the correct value requires traversing the
forks, or if that fails, leaving the fields invalid and waiting for the fork
fsck functions to run.”…””}”(hj`  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h MÊhj“_  hžhubhæ)”}”(hŒThe proposed patchset is the
`inode
<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-inodes>`_
repair series.”h]”(hŒThe proposed patchset is the
”…””}”(hj`  hžhhŸNh Nubj”  )”}”(hŒd`inode
<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-inodes>`_”h]”hŒinode”…””}”(hj`  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”Œname”Œinode”jj  ŒYhttps://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-inodes”uh1j“  hj`  ubhµ)”}”(hŒ\
<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-inodes>”h]”h}”(h]”Œinode”ah ]”h"]”Œinode”ah$]”h&]”Œrefuri”j.`  uh1h´jy  Khj`  ubhŒ
repair series.”…””}”(hj`  hžhhŸNh Nubeh}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h MÓhj“_  hžhubeh}”(h]”jÊ
  ah ]”h"]”Œinode record repairs”ah$]”h&]”uh1hÐhjv*  hžhhŸh³h M¸ubhÑ)”}”(hhh]”(hÖ)”}”(hŒQuota Record Repairs”h]”hŒQuota Record Repairs”…””}”(hjP`  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”j  jæ
  uh1hÕhjM`  hžhhŸh³h MÙubhæ)”}”(hŒèSimilar to inodes, quota records ("dquots") also have both ondisk records and
an in-memory representation, and hence are subject to the same cache coherency
issues.
Somewhat confusingly, both are known as dquots in the XFS codebase.”h]”hŒìSimilar to inodes, quota records (â€œdquotsâ€) also have both ondisk records and
an in-memory representation, and hence are subject to the same cache coherency
issues.
Somewhat confusingly, both are known as dquots in the XFS codebase.”…””}”(hj^`  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h MÛhjM`  hžhubhæ)”}”(hX  The only repairs that should be made to the ondisk quota record buffers are
whatever is necessary to get the in-core structure loaded.
Once the in-memory representation is loaded, the only attributes needing
checking are obviously bad limits and timer values.”h]”hX  The only repairs that should be made to the ondisk quota record buffers are
whatever is necessary to get the in-core structure loaded.
Once the in-memory representation is loaded, the only attributes needing
checking are obviously bad limits and timer values.”…””}”(hjl`  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h MàhjM`  hžhubhæ)”}”(hŒ~Quota usage counters are checked, repaired, and discussed separately in the
section about :ref:`live quotacheck <quotacheck>`.”h]”(hŒZQuota usage counters are checked, repaired, and discussed separately in the
section about ”…””}”(hjz`  hžhhŸNh Nubh)”}”(hŒ#:ref:`live quotacheck <quotacheck>`”h]”j™  )”}”(hj„`  h]”hŒlive quotacheck”…””}”(hj†`  hžhhŸNh Nubah}”(h]”h ]”(j¤  Œstd”Œstd-ref”eh"]”h$]”h&]”uh1j˜  hj‚`  ubah}”(h]”h ]”h"]”h$]”h&]”Œrefdoc”j±  Œ	refdomain”j`  Œreftype”Œref”Œrefexplicit”ˆŒrefwarn”ˆj·  Œ
quotacheck”uh1hhŸh³h Måhjz`  ubhŒ.”…””}”(hjz`  hžhhŸNh Nubeh}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h MåhjM`  hžhubhæ)”}”(hŒThe proposed patchset is the
`quota
<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-quota>`_
repair series.”h]”(hŒThe proposed patchset is the
”…””}”(hj¬`  hžhhŸNh Nubj”  )”}”(hŒc`quota
<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-quota>`_”h]”hŒquota”…””}”(hj´`  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”Œname”Œquota”jj  ŒXhttps://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-quota”uh1j“  hj¬`  ubhµ)”}”(hŒ[
<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-quota>”h]”h}”(h]”Œquota”ah ]”h"]”Œquota”ah$]”h&]”Œrefuri”jÄ`  uh1h´jy  Khj¬`  ubhŒ
repair series.”…””}”(hj¬`  hžhhŸNh Nubeh}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h MèhjM`  hžhubhµ)”}”(hŒ.. _fscounters:”h]”h}”(h]”h ]”h"]”h$]”h&]”j  Œ
fscounters”uh1h´h MíhjM`  hžhhŸh³ubeh}”(h]”jì
  ah ]”h"]”Œquota record repairs”ah$]”h&]”uh1hÐhjv*  hžhhŸh³h MÙubhÑ)”}”(hhh]”(hÖ)”}”(hŒ Freezing to Fix Summary Counters”h]”hŒ Freezing to Fix Summary Counters”…””}”(hjñ`  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”j  j  uh1hÕhjî`  hžhhŸh³h Mðubhæ)”}”(hX  Filesystem summary counters track availability of filesystem resources such
as free blocks, free inodes, and allocated inodes.
This information could be compiled by walking the free space and inode indexes,
but this is a slow process, so XFS maintains a copy in the ondisk superblock
that should reflect the ondisk metadata, at least when the filesystem has been
unmounted cleanly.
For performance reasons, XFS also maintains incore copies of those counters,
which are key to enabling resource reservations for active transactions.
Writer threads reserve the worst-case quantities of resources from the
incore counter and give back whatever they don't use at commit time.
It is therefore only necessary to serialize on the superblock when the
superblock is being committed to disk.”h]”hX  Filesystem summary counters track availability of filesystem resources such
as free blocks, free inodes, and allocated inodes.
This information could be compiled by walking the free space and inode indexes,
but this is a slow process, so XFS maintains a copy in the ondisk superblock
that should reflect the ondisk metadata, at least when the filesystem has been
unmounted cleanly.
For performance reasons, XFS also maintains incore copies of those counters,
which are key to enabling resource reservations for active transactions.
Writer threads reserve the worst-case quantities of resources from the
incore counter and give back whatever they donâ€™t use at commit time.
It is therefore only necessary to serialize on the superblock when the
superblock is being committed to disk.”…””}”(hjÿ`  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h Mòhjî`  hžhubhæ)”}”(hX  The lazy superblock counter feature introduced in XFS v5 took this even further
by training log recovery to recompute the summary counters from the AG headers,
which eliminated the need for most transactions even to touch the superblock.
The only time XFS commits the summary counters is at filesystem unmount.
To reduce contention even further, the incore counter is implemented as a
percpu counter, which means that each CPU is allocated a batch of blocks from a
global incore counter and can satisfy small allocations from the local batch.”h]”hX  The lazy superblock counter feature introduced in XFS v5 took this even further
by training log recovery to recompute the summary counters from the AG headers,
which eliminated the need for most transactions even to touch the superblock.
The only time XFS commits the summary counters is at filesystem unmount.
To reduce contention even further, the incore counter is implemented as a
percpu counter, which means that each CPU is allocated a batch of blocks from a
global incore counter and can satisfy small allocations from the local batch.”…””}”(hja  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h Mÿhjî`  hžhubhæ)”}”(hXô  The high-performance nature of the summary counters makes it difficult for
online fsck to check them, since there is no way to quiesce a percpu counter
while the system is running.
Although online fsck can read the filesystem metadata to compute the correct
values of the summary counters, there's no way to hold the value of a percpu
counter stable, so it's quite possible that the counter will be out of date by
the time the walk is complete.
Earlier versions of online scrub would return to userspace with an incomplete
scan flag, but this is not a satisfying outcome for a system administrator.
For repairs, the in-memory counters must be stabilized while walking the
filesystem metadata to get an accurate reading and install it in the percpu
counter.”h]”hXø  The high-performance nature of the summary counters makes it difficult for
online fsck to check them, since there is no way to quiesce a percpu counter
while the system is running.
Although online fsck can read the filesystem metadata to compute the correct
values of the summary counters, thereâ€™s no way to hold the value of a percpu
counter stable, so itâ€™s quite possible that the counter will be out of date by
the time the walk is complete.
Earlier versions of online scrub would return to userspace with an incomplete
scan flag, but this is not a satisfying outcome for a system administrator.
For repairs, the in-memory counters must be stabilized while walking the
filesystem metadata to get an accurate reading and install it in the percpu
counter.”…””}”(hja  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h Mhjî`  hžhubhæ)”}”(hX  To satisfy this requirement, online fsck must prevent other programs in the
system from initiating new writes to the filesystem, it must disable background
garbage collection threads, and it must wait for existing writer programs to
exit the kernel.
Once that has been established, scrub can walk the AG free space indexes, the
inode btrees, and the realtime bitmap to compute the correct value of all
four summary counters.
This is very similar to a filesystem freeze, though not all of the pieces are
necessary:”h]”hX  To satisfy this requirement, online fsck must prevent other programs in the
system from initiating new writes to the filesystem, it must disable background
garbage collection threads, and it must wait for existing writer programs to
exit the kernel.
Once that has been established, scrub can walk the AG free space indexes, the
inode btrees, and the realtime bitmap to compute the correct value of all
four summary counters.
This is very similar to a filesystem freeze, though not all of the pieces are
necessary:”…””}”(hj)a  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h Mhjî`  hžhubhö)”}”(hhh]”(hû)”}”(hŒ½The final freeze state is set one higher than ``SB_FREEZE_COMPLETE`` to
prevent other threads from thawing the filesystem, or other scrub threads
from initiating another fscounters freeze.
”h]”hæ)”}”(hŒ¼The final freeze state is set one higher than ``SB_FREEZE_COMPLETE`` to
prevent other threads from thawing the filesystem, or other scrub threads
from initiating another fscounters freeze.”h]”(hŒ.The final freeze state is set one higher than ”…””}”(hj>a  hžhhŸNh Nubj÷  )”}”(hŒ``SB_FREEZE_COMPLETE``”h]”hŒSB_FREEZE_COMPLETE”…””}”(hjFa  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1jö  hj>a  ubhŒx to
prevent other threads from thawing the filesystem, or other scrub threads
from initiating another fscounters freeze.”…””}”(hj>a  hžhhŸNh Nubeh}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h Mhj:a  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhj7a  hžhhŸh³h Nubhû)”}”(hŒIt does not quiesce the log.
”h]”hæ)”}”(hŒIt does not quiesce the log.”h]”hŒIt does not quiesce the log.”…””}”(hjha  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h M"hjda  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhj7a  hžhhŸh³h Nubeh}”(h]”h ]”h"]”h$]”h&]”jJ  jK  uh1hõhŸh³h Mhjî`  hžhubhæ)”}”(hŒƒWith this code in place, it is now possible to pause the filesystem for just
long enough to check and correct the summary counters.”h]”hŒƒWith this code in place, it is now possible to pause the filesystem for just
long enough to check and correct the summary counters.”…””}”(hj‚a  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h M$hjî`  hžhubjÀ  )”}”(hhh]”jÅ  )”}”(hhh]”(jÊ  )”}”(hhh]”h}”(h]”h ]”h"]”h$]”h&]”Œcolwidth”KJuh1jÉ  hj“a  ubjÖ  )”}”(hhh]”(jÛ  )”}”(hhh]”jà  )”}”(hhh]”hæ)”}”(hŒ**Historical Sidebar**:”h]”(jé  )”}”(hŒ**Historical Sidebar**”h]”hŒHistorical Sidebar”…””}”(hj­a  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1jè  hj©a  ubhŒ:”…””}”(hj©a  hžhhŸNh Nubeh}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h M(hj¦a  ubah}”(h]”h ]”h"]”h$]”h&]”uh1jß  hj£a  ubah}”(h]”h ]”h"]”h$]”h&]”uh1jÚ  hj a  ubjÛ  )”}”(hhh]”jà  )”}”(hhh]”(hæ)”}”(hX  The initial implementation used the actual VFS filesystem freeze
mechanism to quiesce filesystem activity.
With the filesystem frozen, it is possible to resolve the counter values
with exact precision, but there are many problems with calling the VFS
methods directly:”h]”hX  The initial implementation used the actual VFS filesystem freeze
mechanism to quiesce filesystem activity.
With the filesystem frozen, it is possible to resolve the counter values
with exact precision, but there are many problems with calling the VFS
methods directly:”…””}”(hj×a  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h M*hjÔa  ubhö)”}”(hhh]”(hû)”}”(hŒ~Other programs can unfreeze the filesystem without our knowledge.
This leads to incorrect scan results and incorrect repairs.
”h]”hæ)”}”(hŒ}Other programs can unfreeze the filesystem without our knowledge.
This leads to incorrect scan results and incorrect repairs.”h]”hŒ}Other programs can unfreeze the filesystem without our knowledge.
This leads to incorrect scan results and incorrect repairs.”…””}”(hjìa  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h M0hjèa  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhjåa  ubhû)”}”(hXl  Adding an extra lock to prevent others from thawing the filesystem
required the addition of a ``->freeze_super`` function to wrap
``freeze_fs()``.
This in turn caused other subtle problems because it turns out that
the VFS ``freeze_super`` and ``thaw_super`` functions can drop the
last reference to the VFS superblock, and any subsequent access
becomes a UAF bug!
This can happen if the filesystem is unmounted while the underlying
block device has frozen the filesystem.
This problem could be solved by grabbing extra references to the
superblock, but it felt suboptimal given the other inadequacies of
this approach.
”h]”hæ)”}”(hXk  Adding an extra lock to prevent others from thawing the filesystem
required the addition of a ``->freeze_super`` function to wrap
``freeze_fs()``.
This in turn caused other subtle problems because it turns out that
the VFS ``freeze_super`` and ``thaw_super`` functions can drop the
last reference to the VFS superblock, and any subsequent access
becomes a UAF bug!
This can happen if the filesystem is unmounted while the underlying
block device has frozen the filesystem.
This problem could be solved by grabbing extra references to the
superblock, but it felt suboptimal given the other inadequacies of
this approach.”h]”(hŒ^Adding an extra lock to prevent others from thawing the filesystem
required the addition of a ”…””}”(hjb  hžhhŸNh Nubj÷  )”}”(hŒ``->freeze_super``”h]”hŒ->freeze_super”…””}”(hjb  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1jö  hjb  ubhŒ function to wrap
”…””}”(hjb  hžhhŸNh Nubj÷  )”}”(hŒ``freeze_fs()``”h]”hŒfreeze_fs()”…””}”(hjb  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1jö  hjb  ubhŒN.
This in turn caused other subtle problems because it turns out that
the VFS ”…””}”(hjb  hžhhŸNh Nubj÷  )”}”(hŒ``freeze_super``”h]”hŒfreeze_super”…””}”(hj0b  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1jö  hjb  ubhŒ and ”…””}”(hjb  hžhhŸNh Nubj÷  )”}”(hŒ``thaw_super``”h]”hŒ
thaw_super”…””}”(hjBb  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1jö  hjb  ubhXi   functions can drop the
last reference to the VFS superblock, and any subsequent access
becomes a UAF bug!
This can happen if the filesystem is unmounted while the underlying
block device has frozen the filesystem.
This problem could be solved by grabbing extra references to the
superblock, but it felt suboptimal given the other inadequacies of
this approach.”…””}”(hjb  hžhhŸNh Nubeh}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h M3hj b  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhjåa  ubhû)”}”(hŒ¤The log need not be quiesced to check the summary counters, but a VFS
freeze initiates one anyway.
This adds unnecessary runtime to live fscounter fsck operations.
”h]”hæ)”}”(hŒ£The log need not be quiesced to check the summary counters, but a VFS
freeze initiates one anyway.
This adds unnecessary runtime to live fscounter fsck operations.”h]”hŒ£The log need not be quiesced to check the summary counters, but a VFS
freeze initiates one anyway.
This adds unnecessary runtime to live fscounter fsck operations.”…””}”(hjdb  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h M@hj`b  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhjåa  ubhû)”}”(hŒpQuiescing the log means that XFS flushes the (possibly incorrect)
counters to disk as part of cleaning the log.
”h]”hæ)”}”(hŒoQuiescing the log means that XFS flushes the (possibly incorrect)
counters to disk as part of cleaning the log.”h]”hŒoQuiescing the log means that XFS flushes the (possibly incorrect)
counters to disk as part of cleaning the log.”…””}”(hj|b  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h MDhjxb  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhjåa  ubhû)”}”(hŒ¡A bug in the VFS meant that freeze could complete even when
sync_filesystem fails to flush the filesystem and returns an error.
This bug was fixed in Linux 5.17.”h]”hæ)”}”(hŒ¡A bug in the VFS meant that freeze could complete even when
sync_filesystem fails to flush the filesystem and returns an error.
This bug was fixed in Linux 5.17.”h]”hŒ¡A bug in the VFS meant that freeze could complete even when
sync_filesystem fails to flush the filesystem and returns an error.
This bug was fixed in Linux 5.17.”…””}”(hj”b  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h MGhjb  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhjåa  ubeh}”(h]”h ]”h"]”h$]”h&]”jJ  jK  uh1hõhŸh³h M0hjÔa  ubeh}”(h]”h ]”h"]”h$]”h&]”uh1jß  hjÑa  ubah}”(h]”h ]”h"]”h$]”h&]”uh1jÚ  hj a  ubeh}”(h]”h ]”h"]”h$]”h&]”uh1jÕ  hj“a  ubeh}”(h]”h ]”h"]”h$]”h&]”Œcols”Kuh1jÄ  hja  ubah}”(h]”h ]”h"]”h$]”h&]”uh1j¿  hjî`  hžhhŸNh Nubhæ)”}”(hŒŸThe proposed patchset is the
`summary counter cleanup
<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-fscounters>`_
series.”h]”(hŒThe proposed patchset is the
”…””}”(hjÍb  hžhhŸNh Nubj”  )”}”(hŒz`summary counter cleanup
<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-fscounters>`_”h]”hŒsummary counter cleanup”…””}”(hjÕb  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”Œname”Œsummary counter cleanup”jj  Œ]https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-fscounters”uh1j“  hjÍb  ubhµ)”}”(hŒ`
<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-fscounters>”h]”h}”(h]”Œsummary-counter-cleanup”ah ]”h"]”Œsummary counter cleanup”ah$]”h&]”Œrefuri”jåb  uh1h´jy  KhjÍb  ubhŒ
series.”…””}”(hjÍb  hžhhŸNh Nubeh}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h MLhjî`  hžhubeh}”(h]”(j  jæ`  eh ]”h"]”(Œ freezing to fix summary counters”Œ
fscounters”eh$]”h&]”uh1hÐhjv*  hžhhŸh³h MðjË  }”jc  jÜ`  sjÍ  }”jæ`  jÜ`  subhÑ)”}”(hhh]”(hÖ)”}”(hŒFull Filesystem Scans”h]”hŒFull Filesystem Scans”…””}”(hj
c  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”j  j*  uh1hÕhjc  hžhhŸh³h MRubhæ)”}”(hX“  Certain types of metadata can only be checked by walking every file in the
entire filesystem to record observations and comparing the observations against
what's recorded on disk.
Like every other type of online repair, repairs are made by writing those
observations to disk in a replacement structure and committing it atomically.
However, it is not practical to shut down the entire filesystem to examine
hundreds of billions of files because the downtime would be excessive.
Therefore, online fsck must build the infrastructure to manage a live scan of
all the files in the filesystem.
There are two questions that need to be solved to perform a live walk:”h]”hX•  Certain types of metadata can only be checked by walking every file in the
entire filesystem to record observations and comparing the observations against
whatâ€™s recorded on disk.
Like every other type of online repair, repairs are made by writing those
observations to disk in a replacement structure and committing it atomically.
However, it is not practical to shut down the entire filesystem to examine
hundreds of billions of files because the downtime would be excessive.
Therefore, online fsck must build the infrastructure to manage a live scan of
all the files in the filesystem.
There are two questions that need to be solved to perform a live walk:”…””}”(hjc  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h MThjc  hžhubhö)”}”(•2      hhh]”(hû)”}”(hŒ<How does scrub manage the scan while it is collecting data?
”h]”hæ)”}”(hŒ;How does scrub manage the scan while it is collecting data?”h]”hŒ;How does scrub manage the scan while it is collecting data?”…””}”(hj-c  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h M_hj)c  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhj&c  hžhhŸh³h Nubhû)”}”(hŒUHow does the scan keep abreast of changes being made to the system by other
threads?
”h]”hæ)”}”(hŒTHow does the scan keep abreast of changes being made to the system by other
threads?”h]”hŒTHow does the scan keep abreast of changes being made to the system by other
threads?”…””}”(hjEc  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h MahjAc  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhj&c  hžhhŸh³h Nubeh}”(h]”h ]”h"]”h$]”h&]”jJ  jK  uh1hõhŸh³h M_hjc  hžhubhµ)”}”(hŒ
.. _iscan:”h]”h}”(h]”h ]”h"]”h$]”h&]”j  Œiscan”uh1h´h Mdhjc  hžhhŸh³ubhÑ)”}”(hhh]”(hÖ)”}”(hŒCoordinated Inode Scans”h]”hŒCoordinated Inode Scans”…””}”(hjmc  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”j  jI  uh1hÕhjjc  hžhhŸh³h Mgubhæ)”}”(hXÞ  In the original Unix filesystems of the 1970s, each directory entry contained
an index number (*inumber*) which was used as an index into on ondisk array
(*itable*) of fixed-size records (*inodes*) describing a file's attributes and
its data block mapping.
This system is described by J. Lions, `"inode (5659)"
<http://www.lemis.com/grog/Documentation/Lions/>`_ in *Lions' Commentary on
UNIX, 6th Edition*, (Dept. of Computer Science, the University of New South
Wales, November 1977), pp. 18-2; and later by D. Ritchie and K. Thompson,
`"Implementation of the File System"
<https://archive.org/details/bstj57-6-1905/page/n8/mode/1up>`_, from *The UNIX
Time-Sharing System*, (The Bell System Technical Journal, July 1978), pp.
1913-4.”h]”(hŒ_In the original Unix filesystems of the 1970s, each directory entry contained
an index number (”…””}”(hj{c  hžhhŸNh Nubj7  )”}”(hŒ	*inumber*”h]”hŒinumber”…””}”(hjƒc  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1j6  hj{c  ubhŒ3) which was used as an index into on ondisk array
(”…””}”(hj{c  hžhhŸNh Nubj7  )”}”(hŒ*itable*”h]”hŒitable”…””}”(hj•c  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1j6  hj{c  ubhŒ) of fixed-size records (”…””}”(hj{c  hžhhŸNh Nubj7  )”}”(hŒ*inodes*”h]”hŒinodes”…””}”(hj§c  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1j6  hj{c  ubhŒe) describing a fileâ€™s attributes and
its data block mapping.
This system is described by J. Lions, ”…””}”(hj{c  hžhhŸNh Nubj”  )”}”(hŒB`"inode (5659)"
<http://www.lemis.com/grog/Documentation/Lions/>`_”h]”hŒâ€œinode (5659)â€”…””}”(hj¹c  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”Œname”Œ"inode (5659)"”jj  Œ.http://www.lemis.com/grog/Documentation/Lions/”uh1j“  hj{c  ubhµ)”}”(hŒ1
<http://www.lemis.com/grog/Documentation/Lions/>”h]”h}”(h]”Œ
inode-5659”ah ]”h"]”Œ"inode (5659)"”ah$]”h&]”Œrefuri”jÉc  uh1h´jy  Khj{c  ubhŒ in ”…””}”(hj{c  hžhhŸNh Nubj7  )”}”(hŒ(*Lions' Commentary on
UNIX, 6th Edition*”h]”hŒ(Lionsâ€™ Commentary on
UNIX, 6th Edition”…””}”(hjÛc  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1j6  hj{c  ubhŒ„, (Dept. of Computer Science, the University of New South
Wales, November 1977), pp. 18-2; and later by D. Ritchie and K. Thompson,
”…””}”(hj{c  hžhhŸNh Nubj”  )”}”(hŒc`"Implementation of the File System"
<https://archive.org/details/bstj57-6-1905/page/n8/mode/1up>`_”h]”hŒ'â€œImplementation of the File Systemâ€”…””}”(hjíc  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”Œname”Œ#"Implementation of the File System"”jj  Œ:https://archive.org/details/bstj57-6-1905/page/n8/mode/1up”uh1j“  hj{c  ubhµ)”}”(hŒ=
<https://archive.org/details/bstj57-6-1905/page/n8/mode/1up>”h]”h}”(h]”Œ!implementation-of-the-file-system”ah ]”h"]”Œ#"implementation of the file system"”ah$]”h&]”Œrefuri”jýc  uh1h´jy  Khj{c  ubhŒ, from ”…””}”(hj{c  hžhhŸNh Nubj7  )”}”(hŒ*The UNIX
Time-Sharing System*”h]”hŒThe UNIX
Time-Sharing System”…””}”(hjd  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1j6  hj{c  ubhŒ=, (The Bell System Technical Journal, July 1978), pp.
1913-4.”…””}”(hj{c  hžhhŸNh Nubeh}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h Mihjjc  hžhubhæ)”}”(hXª  XFS retains most of this design, except now inumbers are search keys over all
the space in the data section filesystem.
They form a continuous keyspace that can be expressed as a 64-bit integer,
though the inodes themselves are sparsely distributed within the keyspace.
Scans proceed in a linear fashion across the inumber keyspace, starting from
``0x0`` and ending at ``0xFFFFFFFFFFFFFFFF``.
Naturally, a scan through a keyspace requires a scan cursor object to track the
scan progress.
Because this keyspace is sparse, this cursor contains two parts.
The first part of this scan cursor object tracks the inode that will be
examined next; call this the examination cursor.
Somewhat less obviously, the scan cursor object must also track which parts of
the keyspace have already been visited, which is critical for deciding if a
concurrent filesystem update needs to be incorporated into the scan data.
Call this the visited inode cursor.”h]”(hX[  XFS retains most of this design, except now inumbers are search keys over all
the space in the data section filesystem.
They form a continuous keyspace that can be expressed as a 64-bit integer,
though the inodes themselves are sparsely distributed within the keyspace.
Scans proceed in a linear fashion across the inumber keyspace, starting from
”…””}”(hj'd  hžhhŸNh Nubj÷  )”}”(hŒ``0x0``”h]”hŒ0x0”…””}”(hj/d  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1jö  hj'd  ubhŒ and ending at ”…””}”(hj'd  hžhhŸNh Nubj÷  )”}”(hŒ``0xFFFFFFFFFFFFFFFF``”h]”hŒ0xFFFFFFFFFFFFFFFF”…””}”(hjAd  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1jö  hj'd  ubhX#  .
Naturally, a scan through a keyspace requires a scan cursor object to track the
scan progress.
Because this keyspace is sparse, this cursor contains two parts.
The first part of this scan cursor object tracks the inode that will be
examined next; call this the examination cursor.
Somewhat less obviously, the scan cursor object must also track which parts of
the keyspace have already been visited, which is critical for deciding if a
concurrent filesystem update needs to be incorporated into the scan data.
Call this the visited inode cursor.”…””}”(hj'd  hžhhŸNh Nubeh}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h Mvhjjc  hžhubhæ)”}”(hŒVAdvancing the scan cursor is a multi-step process encapsulated in
``xchk_iscan_iter``:”h]”(hŒBAdvancing the scan cursor is a multi-step process encapsulated in
”…””}”(hjYd  hžhhŸNh Nubj÷  )”}”(hŒ``xchk_iscan_iter``”h]”hŒxchk_iscan_iter”…””}”(hjad  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1jö  hjYd  ubhŒ:”…””}”(hjYd  hžhhŸNh Nubeh}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h M†hjjc  hžhubji  )”}”(hhh]”(hû)”}”(hŒºLock the AGI buffer of the AG containing the inode pointed to by the visited
inode cursor.
This guarantee that inodes in this AG cannot be allocated or freed while
advancing the cursor.
”h]”hæ)”}”(hŒ¹Lock the AGI buffer of the AG containing the inode pointed to by the visited
inode cursor.
This guarantee that inodes in this AG cannot be allocated or freed while
advancing the cursor.”h]”hŒ¹Lock the AGI buffer of the AG containing the inode pointed to by the visited
inode cursor.
This guarantee that inodes in this AG cannot be allocated or freed while
advancing the cursor.”…””}”(hj€d  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h M‰hj|d  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhjyd  hžhhŸh³h Nubhû)”}”(hŒƒUse the per-AG inode btree to look up the next inumber after the one that
was just visited, since it may not be keyspace adjacent.
”h]”hæ)”}”(hŒ‚Use the per-AG inode btree to look up the next inumber after the one that
was just visited, since it may not be keyspace adjacent.”h]”hŒ‚Use the per-AG inode btree to look up the next inumber after the one that
was just visited, since it may not be keyspace adjacent.”…””}”(hj˜d  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h MŽhj”d  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhjyd  hžhhŸh³h Nubhû)”}”(hX©  If there are no more inodes left in this AG:

a. Move the examination cursor to the point of the inumber keyspace that
   corresponds to the start of the next AG.

b. Adjust the visited inode cursor to indicate that it has "visited" the
   last possible inode in the current AG's inode keyspace.
   XFS inumbers are segmented, so the cursor needs to be marked as having
   visited the entire keyspace up to just before the start of the next AG's
   inode keyspace.

c. Unlock the AGI and return to step 1 if there are unexamined AGs in the
   filesystem.

d. If there are no more AGs to examine, set both cursors to the end of the
   inumber keyspace.
   The scan is now complete.
”h]”(hæ)”}”(hŒ,If there are no more inodes left in this AG:”h]”hŒ,If there are no more inodes left in this AG:”…””}”(hj°d  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h M‘hj¬d  ubji  )”}”(hhh]”(hû)”}”(hŒoMove the examination cursor to the point of the inumber keyspace that
corresponds to the start of the next AG.
”h]”hæ)”}”(hŒnMove the examination cursor to the point of the inumber keyspace that
corresponds to the start of the next AG.”h]”hŒnMove the examination cursor to the point of the inumber keyspace that
corresponds to the start of the next AG.”…””}”(hjÅd  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h M“hjÁd  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhj¾d  ubhû)”}”(hX  Adjust the visited inode cursor to indicate that it has "visited" the
last possible inode in the current AG's inode keyspace.
XFS inumbers are segmented, so the cursor needs to be marked as having
visited the entire keyspace up to just before the start of the next AG's
inode keyspace.
”h]”hæ)”}”(hX  Adjust the visited inode cursor to indicate that it has "visited" the
last possible inode in the current AG's inode keyspace.
XFS inumbers are segmented, so the cursor needs to be marked as having
visited the entire keyspace up to just before the start of the next AG's
inode keyspace.”h]”hX%  Adjust the visited inode cursor to indicate that it has â€œvisitedâ€ the
last possible inode in the current AGâ€™s inode keyspace.
XFS inumbers are segmented, so the cursor needs to be marked as having
visited the entire keyspace up to just before the start of the next AGâ€™s
inode keyspace.”…””}”(hjÝd  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h M–hjÙd  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhj¾d  ubhû)”}”(hŒSUnlock the AGI and return to step 1 if there are unexamined AGs in the
filesystem.
”h]”hæ)”}”(hŒRUnlock the AGI and return to step 1 if there are unexamined AGs in the
filesystem.”h]”hŒRUnlock the AGI and return to step 1 if there are unexamined AGs in the
filesystem.”…””}”(hjõd  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h Mœhjñd  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhj¾d  ubhû)”}”(hŒtIf there are no more AGs to examine, set both cursors to the end of the
inumber keyspace.
The scan is now complete.
”h]”hæ)”}”(hŒsIf there are no more AGs to examine, set both cursors to the end of the
inumber keyspace.
The scan is now complete.”h]”hŒsIf there are no more AGs to examine, set both cursors to the end of the
inumber keyspace.
The scan is now complete.”…””}”(hje  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h MŸhj	e  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhj¾d  ubeh}”(h]”h ]”h"]”h$]”h&]”jg  j6  ji  hjj  jk  uh1jh  hj¬d  ubeh}”(h]”h ]”h"]”h$]”h&]”uh1húhjyd  hžhhŸNh Nubhû)”}”(hX¼  Otherwise, there is at least one more inode to scan in this AG:

a. Move the examination cursor ahead to the next inode marked as allocated
   by the inode btree.

b. Adjust the visited inode cursor to point to the inode just prior to where
   the examination cursor is now.
   Because the scanner holds the AGI buffer lock, no inodes could have been
   created in the part of the inode keyspace that the visited inode cursor
   just advanced.
”h]”(hæ)”}”(hŒ?Otherwise, there is at least one more inode to scan in this AG:”h]”hŒ?Otherwise, there is at least one more inode to scan in this AG:”…””}”(hj1e  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h M£hj-e  ubji  )”}”(hhh]”(hû)”}”(hŒ\Move the examination cursor ahead to the next inode marked as allocated
by the inode btree.
”h]”hæ)”}”(hŒ[Move the examination cursor ahead to the next inode marked as allocated
by the inode btree.”h]”hŒ[Move the examination cursor ahead to the next inode marked as allocated
by the inode btree.”…””}”(hjFe  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h M¥hjBe  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhj?e  ubhû)”}”(hX	  Adjust the visited inode cursor to point to the inode just prior to where
the examination cursor is now.
Because the scanner holds the AGI buffer lock, no inodes could have been
created in the part of the inode keyspace that the visited inode cursor
just advanced.
”h]”hæ)”}”(hX  Adjust the visited inode cursor to point to the inode just prior to where
the examination cursor is now.
Because the scanner holds the AGI buffer lock, no inodes could have been
created in the part of the inode keyspace that the visited inode cursor
just advanced.”h]”hX  Adjust the visited inode cursor to point to the inode just prior to where
the examination cursor is now.
Because the scanner holds the AGI buffer lock, no inodes could have been
created in the part of the inode keyspace that the visited inode cursor
just advanced.”…””}”(hj^e  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h M¨hjZe  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhj?e  ubeh}”(h]”h ]”h"]”h$]”h&]”jg  j6  ji  hjj  jk  uh1jh  hj-e  ubeh}”(h]”h ]”h"]”h$]”h&]”uh1húhjyd  hžhhŸNh Nubhû)”}”(hX[  Get the incore inode for the inumber of the examination cursor.
By maintaining the AGI buffer lock until this point, the scanner knows that
it was safe to advance the examination cursor across the entire keyspace,
and that it has stabilized this next inode so that it cannot disappear from
the filesystem until the scan releases the incore inode.
”h]”hæ)”}”(hXZ  Get the incore inode for the inumber of the examination cursor.
By maintaining the AGI buffer lock until this point, the scanner knows that
it was safe to advance the examination cursor across the entire keyspace,
and that it has stabilized this next inode so that it cannot disappear from
the filesystem until the scan releases the incore inode.”h]”hXZ  Get the incore inode for the inumber of the examination cursor.
By maintaining the AGI buffer lock until this point, the scanner knows that
it was safe to advance the examination cursor across the entire keyspace,
and that it has stabilized this next inode so that it cannot disappear from
the filesystem until the scan releases the incore inode.”…””}”(hj‚e  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h M®hj~e  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhjyd  hžhhŸh³h Nubhû)”}”(hŒ=Drop the AGI lock and return the incore inode to the caller.
”h]”hæ)”}”(hŒ<Drop the AGI lock and return the incore inode to the caller.”h]”hŒ<Drop the AGI lock and return the incore inode to the caller.”…””}”(hjše  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h M´hj–e  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhjyd  hžhhŸh³h Nubeh}”(h]”h ]”h"]”h$]”h&]”jg  jh  ji  hjj  jk  uh1jh  hjjc  hžhhŸh³h M‰ubhæ)”}”(hŒBOnline fsck functions scan all files in the filesystem as follows:”h]”hŒBOnline fsck functions scan all files in the filesystem as follows:”…””}”(hj´e  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h M¶hjjc  hžhubji  )”}”(hhh]”(hû)”}”(hŒ.Start a scan by calling ``xchk_iscan_start``.
”h]”hæ)”}”(hŒ-Start a scan by calling ``xchk_iscan_start``.”h]”(hŒStart a scan by calling ”…””}”(hjÉe  hžhhŸNh Nubj÷  )”}”(hŒ``xchk_iscan_start``”h]”hŒxchk_iscan_start”…””}”(hjÑe  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1jö  hjÉe  ubhŒ.”…””}”(hjÉe  hžhhŸNh Nubeh}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h M¸hjÅe  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhjÂe  hžhhŸh³h Nubhû)”}”(hXG  Advance the scan cursor (``xchk_iscan_iter``) to get the next inode.
If one is provided:

a. Lock the inode to prevent updates during the scan.

b. Scan the inode.

c. While still holding the inode lock, adjust the visited inode cursor
   (``xchk_iscan_mark_visited``) to point to this inode.

d. Unlock and release the inode.
”h]”(hæ)”}”(hŒXAdvance the scan cursor (``xchk_iscan_iter``) to get the next inode.
If one is provided:”h]”(hŒAdvance the scan cursor (”…””}”(hjóe  hžhhŸNh Nubj÷  )”}”(hŒ``xchk_iscan_iter``”h]”hŒxchk_iscan_iter”…””}”(hjûe  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1jö  hjóe  ubhŒ,) to get the next inode.
If one is provided:”…””}”(hjóe  hžhhŸNh Nubeh}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h Mºhjïe  ubji  )”}”(hhh]”(hû)”}”(hŒ3Lock the inode to prevent updates during the scan.
”h]”hæ)”}”(hŒ2Lock the inode to prevent updates during the scan.”h]”hŒ2Lock the inode to prevent updates during the scan.”…””}”(hjf  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h M½hjf  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhjf  ubhû)”}”(hŒScan the inode.
”h]”hæ)”}”(hŒScan the inode.”h]”hŒScan the inode.”…””}”(hj2f  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h M¿hj.f  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhjf  ubhû)”}”(hŒzWhile still holding the inode lock, adjust the visited inode cursor
(``xchk_iscan_mark_visited``) to point to this inode.
”h]”hæ)”}”(hŒyWhile still holding the inode lock, adjust the visited inode cursor
(``xchk_iscan_mark_visited``) to point to this inode.”h]”(hŒEWhile still holding the inode lock, adjust the visited inode cursor
(”…””}”(hjJf  hžhhŸNh Nubj÷  )”}”(hŒ``xchk_iscan_mark_visited``”h]”hŒxchk_iscan_mark_visited”…””}”(hjRf  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1jö  hjJf  ubhŒ) to point to this inode.”…””}”(hjJf  hžhhŸNh Nubeh}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h MÁhjFf  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhjf  ubhû)”}”(hŒUnlock and release the inode.
”h]”hæ)”}”(hŒUnlock and release the inode.”h]”hŒUnlock and release the inode.”…””}”(hjtf  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h MÄhjpf  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhjf  ubeh}”(h]”h ]”h"]”h$]”h&]”jg  j6  ji  hjj  jk  uh1jh  hjïe  ubeh}”(h]”h ]”h"]”h$]”h&]”uh1húhjÂe  hžhhŸNh Nubeh}”(h]”h ]”h"]”h$]”h&]”jg  jh  ji  hjj  jk  uh1jh  hjjc  hžhhŸh³h M¸ubji  )”}”(hhh]”hû)”}”(hŒ3Call ``xchk_iscan_teardown`` to complete the scan.
”h]”hæ)”}”(hŒ2Call ``xchk_iscan_teardown`` to complete the scan.”h]”(hŒCall ”…””}”(hj¡f  hžhhŸNh Nubj÷  )”}”(hŒ``xchk_iscan_teardown``”h]”hŒxchk_iscan_teardown”…””}”(hj©f  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1jö  hj¡f  ubhŒ to complete the scan.”…””}”(hj¡f  hžhhŸNh Nubeh}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h MÆhjf  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhjšf  hžhhŸh³h Nubah}”(h]”h ]”h"]”h$]”h&]”jg  jh  ji  hjj  jk  jšZ  Kuh1jh  hjjc  hžhhŸh³h MÆubhæ)”}”(hX  There are subtleties with the inode cache that complicate grabbing the incore
inode for the caller.
Obviously, it is an absolute requirement that the inode metadata be consistent
enough to load it into the inode cache.
Second, if the incore inode is stuck in some intermediate state, the scan
coordinator must release the AGI and push the main filesystem to get the inode
back into a loadable state.”h]”hX  There are subtleties with the inode cache that complicate grabbing the incore
inode for the caller.
Obviously, it is an absolute requirement that the inode metadata be consistent
enough to load it into the inode cache.
Second, if the incore inode is stuck in some intermediate state, the scan
coordinator must release the AGI and push the main filesystem to get the inode
back into a loadable state.”…””}”(hjÍf  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h MÈhjjc  hžhubhæ)”}”(hX;  The proposed patches are the
`inode scanner
<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=scrub-iscan>`_
series.
The first user of the new functionality is the
`online quotacheck
<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-quotacheck>`_
series.”h]”(hŒThe proposed patches are the
”…””}”(hjÛf  hžhhŸNh Nubj”  )”}”(hŒj`inode scanner
<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=scrub-iscan>`_”h]”hŒinode scanner”…””}”(hjãf  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”Œname”Œinode scanner”jj  ŒWhttps://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=scrub-iscan”uh1j“  hjÛf  ubhµ)”}”(hŒZ
<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=scrub-iscan>”h]”h}”(h]”Œinode-scanner”ah ]”h"]”Œinode scanner”ah$]”h&]”Œrefuri”jóf  uh1h´jy  KhjÛf  ubhŒ8
series.
The first user of the new functionality is the
”…””}”(hjÛf  hžhhŸNh Nubj”  )”}”(hŒt`online quotacheck
<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-quotacheck>`_”h]”hŒonline quotacheck”…””}”(hjg  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”Œname”Œonline quotacheck”jj  Œ]https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-quotacheck”uh1j“  hjÛf  ubhµ)”}”(hŒ`
<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-quotacheck>”h]”h}”(h]”Œonline-quotacheck”ah ]”h"]”Œonline quotacheck”ah$]”h&]”Œrefuri”jg  uh1h´jy  KhjÛf  ubhŒ
series.”…””}”(hjÛf  hžhhŸNh Nubeh}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h MÐhjjc  hžhubeh}”(h]”(jO  jic  eh ]”h"]”(Œcoordinated inode scans”Œiscan”eh$]”h&]”uh1hÐhjc  hžhhŸh³h MgjË  }”j2g  j_c  sjÍ  }”jic  j_c  subhÑ)”}”(hhh]”(hÖ)”}”(hŒInode Management”h]”hŒInode Management”…””}”(hj:g  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”j  jk  uh1hÕhj7g  hžhhŸh³h MÚubhæ)”}”(hX  In regular filesystem code, references to allocated XFS incore inodes are
always obtained (``xfs_iget``) outside of transaction context because the
creation of the incore context for an existing file does not require metadata
updates.
However, it is important to note that references to incore inodes obtained as
part of file creation must be performed in transaction context because the
filesystem must ensure the atomicity of the ondisk inode btree index updates
and the initialization of the actual ondisk inode.”h]”(hŒ[In regular filesystem code, references to allocated XFS incore inodes are
always obtained (”…””}”(hjHg  hžhhŸNh Nubj÷  )”}”(hŒ``xfs_iget``”h]”hŒxfs_iget”…””}”(hjPg  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1jö  hjHg  ubhXœ  ) outside of transaction context because the
creation of the incore context for an existing file does not require metadata
updates.
However, it is important to note that references to incore inodes obtained as
part of file creation must be performed in transaction context because the
filesystem must ensure the atomicity of the ondisk inode btree index updates
and the initialization of the actual ondisk inode.”…””}”(hjHg  hžhhŸNh Nubeh}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h MÜhj7g  hžhubhæ)”}”(hŒ«References to incore inodes are always released (``xfs_irele``) outside of
transaction context because there are a handful of activities that might
require ondisk updates:”h]”(hŒ1References to incore inodes are always released (”…””}”(hjhg  hžhhŸNh Nubj÷  )”}”(hŒ``xfs_irele``”h]”hŒ	xfs_irele”…””}”(hjpg  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1jö  hjhg  ubhŒm) outside of
transaction context because there are a handful of activities that might
require ondisk updates:”…””}”(hjhg  hžhhŸNh Nubeh}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h Måhj7g  hžhubhö)”}”(hhh]”(hû)”}”(hŒSThe VFS may decide to kick off writeback as part of a ``DONTCACHE`` inode
release.
”h]”hæ)”}”(hŒRThe VFS may decide to kick off writeback as part of a ``DONTCACHE`` inode
release.”h]”(hŒ6The VFS may decide to kick off writeback as part of a ”…””}”(hjg  hžhhŸNh Nubj÷  )”}”(hŒ``DONTCACHE``”h]”hŒ	DONTCACHE”…””}”(hj—g  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1jö  hjg  ubhŒ inode
release.”…””}”(hjg  hžhhŸNh Nubeh}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h Méhj‹g  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhjˆg  hžhhŸh³h Nubhû)”}”(hŒ2Speculative preallocations need to be unreserved.
”h]”hæ)”}”(hŒ1Speculative preallocations need to be unreserved.”h]”hŒ1Speculative preallocations need to be unreserved.”…””}”(hj¹g  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h Mìhjµg  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhjˆg  hžhhŸh³h Nubhû)”}”(hŒÂAn unlinked file may have lost its last reference, in which case the entire
file must be inactivated, which involves releasing all of its resources in
the ondisk metadata and freeing the inode.
”h]”hæ)”}”(hŒÁAn unlinked file may have lost its last reference, in which case the entire
file must be inactivated, which involves releasing all of its resources in
the ondisk metadata and freeing the inode.”h]”hŒÁAn unlinked file may have lost its last reference, in which case the entire
file must be inactivated, which involves releasing all of its resources in
the ondisk metadata and freeing the inode.”…””}”(hjÑg  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h MîhjÍg  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhjˆg  hžhhŸh³h Nubeh}”(h]”h ]”h"]”h$]”h&]”jJ  jK  uh1hõhŸh³h Méhj7g  hžhubhæ)”}”(hX  These activities are collectively called inode inactivation.
Inactivation has two parts -- the VFS part, which initiates writeback on all
dirty file pages, and the XFS part, which cleans up XFS-specific information
and frees the inode if it was unlinked.
If the inode is unlinked (or unconnected after a file handle operation), the
kernel drops the inode into the inactivation machinery immediately.”h]”hX  These activities are collectively called inode inactivation.
Inactivation has two parts -- the VFS part, which initiates writeback on all
dirty file pages, and the XFS part, which cleans up XFS-specific information
and frees the inode if it was unlinked.
If the inode is unlinked (or unconnected after a file handle operation), the
kernel drops the inode into the inactivation machinery immediately.”…””}”(hjëg  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h Mòhj7g  hžhubhæ)”}”(hŒbDuring normal operation, resource acquisition for an update follows this order
to avoid deadlocks:”h]”hŒbDuring normal operation, resource acquisition for an update follows this order
to avoid deadlocks:”…””}”(hjùg  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h Mùhj7g  hžhubji  )”}”(hhh]”(hû)”}”(hŒInode reference (``iget``).
”h]”hæ)”}”(hŒInode reference (``iget``).”h]”(hŒInode reference (”…””}”(hjh  hžhhŸNh Nubj÷  )”}”(hŒ``iget``”h]”hŒiget”…””}”(hjh  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1jö  hjh  ubhŒ).”…””}”(hjh  hžhhŸNh Nubeh}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h Mühj
h  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhjh  hžhhŸh³h Nubhû)”}”(hŒFFilesystem freeze protection, if repairing (``mnt_want_write_file``).
”h]”hæ)”}”(hŒEFilesystem freeze protection, if repairing (``mnt_want_write_file``).”h]”(hŒ,Filesystem freeze protection, if repairing (”…””}”(hj8h  hžhhŸNh Nubj÷  )”}”(hŒ``mnt_want_write_file``”h]”hŒmnt_want_write_file”…””}”(hj@h  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1jö  hj8h  ubhŒ).”…””}”(hj8h  hžhhŸNh Nubeh}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h Mþhj4h  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhjh  hžhhŸh³h Nubhû)”}”(hŒ<Inode ``IOLOCK`` (VFS ``i_rwsem``) lock to control file IO.
”h]”hæ)”}”(hŒ;Inode ``IOLOCK`` (VFS ``i_rwsem``) lock to control file IO.”h]”(hŒInode ”…””}”(hjbh  hžhhŸNh Nubj÷  )”}”(hŒ
``IOLOCK``”h]”hŒIOLOCK”…””}”(hjjh  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1jö  hjbh  ubhŒ (VFS ”…””}”(hjbh  hžhhŸNh Nubj÷  )”}”(hŒ``i_rwsem``”h]”hŒi_rwsem”…””}”(hj|h  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1jö  hjbh  ubhŒ) lock to control file IO.”…””}”(hjbh  hžhhŸNh Nubeh}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h M hj^h  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhjh  hžhhŸh³h Nubhû)”}”(hŒmInode ``MMAPLOCK`` (page cache ``invalidate_lock``) lock for operations that
can update page cache mappings.
”h]”hæ)”}”(hŒlInode ``MMAPLOCK`` (page cache ``invalidate_lock``) lock for operations that
can update page cache mappings.”h]”(hŒInode ”…””}”(hjžh  hžhhŸNh Nubj÷  )”}”(hŒ``MMAPLOCK``”h]”hŒMMAPLOCK”…””}”(hj¦h  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1jö  hjžh  ubhŒ (page cache ”…””}”(hjžh  hžhhŸNh Nubj÷  )”}”(hŒ``invalidate_lock``”h]”hŒinvalidate_lock”…””}”(hj¸h  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1jö  hjžh  ubhŒ:) lock for operations that
can update page cache mappings.”…””}”(hjžh  hžhhŸNh Nubeh}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h Mhjšh  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhjh  hžhhŸh³h Nubhû)”}”(hŒLog feature enablement.
”h]”hæ)”}”(hŒLog feature enablement.”h]”hŒLog feature enablement.”…””}”(hjÚh  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h MhjÖh  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhjh  hžhhŸh³h Nubhû)”}”(hŒTransaction log space grant.
”h]”hæ)”}”(hŒTransaction log space grant.”h]”hŒTransaction log space grant.”…””}”(hjòh  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h Mhjîh  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhjh  hžhhŸh³h Nubhû)”}”(hŒ<Space on the data and realtime devices for the transaction.
”h]”hæ)”}”(hŒ;Space on the data and realtime devices for the transaction.”h]”hŒ;Space on the data and realtime devices for the transaction.”…””}”(hj
i  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h M	hji  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhjh  hžhhŸh³h Nubhû)”}”(hŒfIncore dquot references, if a file is being repaired.
Note that they are not locked, merely acquired.
”h]”hæ)”}”(hŒeIncore dquot references, if a file is being repaired.
Note that they are not locked, merely acquired.”h]”hŒeIncore dquot references, if a file is being repaired.
Note that they are not locked, merely acquired.”…””}”(hj"i  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h Mhji  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhjh  hžhhŸh³h Nubhû)”}”(hŒ+Inode ``ILOCK`` for file metadata updates.
”h]”hæ)”}”(hŒ*Inode ``ILOCK`` for file metadata updates.”h]”(hŒInode ”…””}”(hj:i  hžhhŸNh Nubj÷  )”}”(hŒ	``ILOCK``”h]”hŒILOCK”…””}”(hjBi  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1jö  hj:i  ubhŒ for file metadata updates.”…””}”(hj:i  hžhhŸNh Nubeh}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h Mhj6i  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhjh  hžhhŸh³h Nubhû)”}”(hŒ8AG header buffer locks / Realtime metadata inode ILOCK.
”h]”hæ)”}”(hŒ7AG header buffer locks / Realtime metadata inode ILOCK.”h]”hŒ7AG header buffer locks / Realtime metadata inode ILOCK.”…””}”(hjdi  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h Mhj`i  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhjh  hžhhŸh³h Nubhû)”}”(hŒ/Realtime metadata buffer locks, if applicable.
”h]”hæ)”}”(hŒ.Realtime metadata buffer locks, if applicable.”h]”hŒ.Realtime metadata buffer locks, if applicable.”…””}”(hj|i  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h Mhjxi  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhjh  hžhhŸh³h Nubhû)”}”(hŒ,Extent mapping btree blocks, if applicable.
”h]”hæ)”}”(hŒ+Extent mapping btree blocks, if applicable.”h]”hŒ+Extent mapping btree blocks, if applicable.”…””}”(hj”i  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h Mhji  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhjh  hžhhŸh³h Nubeh}”(h]”h ]”h"]”h$]”h&]”jg  jh  ji  hjj  jk  uh1jh  hj7g  hžhhŸh³h Müubhæ)”}”(hX±  Resources are often released in the reverse order, though this is not required.
However, online fsck differs from regular XFS operations because it may examine
an object that normally is acquired in a later stage of the locking order, and
then decide to cross-reference the object with an object that is acquired
earlier in the order.
The next few sections detail the specific ways in which online fsck takes care
to avoid deadlocks.”h]”hX±  Resources are often released in the reverse order, though this is not required.
However, online fsck differs from regular XFS operations because it may examine
an object that normally is acquired in a later stage of the locking order, and
then decide to cross-reference the object with an object that is acquired
earlier in the order.
The next few sections detail the specific ways in which online fsck takes care
to avoid deadlocks.”…””}”(hj®i  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h Mhj7g  hžhubhÑ)”}”(hhh]”(hÖ)”}”(hŒiget and irele During a Scrub”h]”hŒiget and irele During a Scrub”…””}”(hj¿i  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”j  jŠ  uh1hÕhj¼i  hžhhŸh³h Mubhæ)”}”(hXb  An inode scan performed on behalf of a scrub operation runs in transaction
context, and possibly with resources already locked and bound to it.
This isn't much of a problem for ``iget`` since it can operate in the context
of an existing transaction, as long as all of the bound resources are acquired
before the inode reference in the regular filesystem.”h]”(hŒ³An inode scan performed on behalf of a scrub operation runs in transaction
context, and possibly with resources already locked and bound to it.
This isnâ€™t much of a problem for ”…””}”(hjÍi  hžhhŸNh Nubj÷  )”}”(hŒ``iget``”h]”hŒiget”…””}”(hjÕi  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1jö  hjÍi  ubhŒ© since it can operate in the context
of an existing transaction, as long as all of the bound resources are acquired
before the inode reference in the regular filesystem.”…””}”(hjÍi  hžhhŸNh Nubeh}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h M!hj¼i  hžhubhæ)”}”(hX°  When the VFS ``iput`` function is given a linked inode with no other
references, it normally puts the inode on an LRU list in the hope that it can
save time if another process re-opens the file before the system runs out
of memory and frees it.
Filesystem callers can short-circuit the LRU process by setting a ``DONTCACHE``
flag on the inode to cause the kernel to try to drop the inode into the
inactivation machinery immediately.”h]”(hŒWhen the VFS ”…””}”(hjíi  hžhhŸNh Nubj÷  )”}”(hŒ``iput``”h]”hŒiput”…””}”(hjõi  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1jö  hjíi  ubhX"   function is given a linked inode with no other
references, it normally puts the inode on an LRU list in the hope that it can
save time if another process re-opens the file before the system runs out
of memory and frees it.
Filesystem callers can short-circuit the LRU process by setting a ”…””}”(hjíi  hžhhŸNh Nubj÷  )”}”(hŒ``DONTCACHE``”h]”hŒ	DONTCACHE”…””}”(hjj  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1jö  hjíi  ubhŒl
flag on the inode to cause the kernel to try to drop the inode into the
inactivation machinery immediately.”…””}”(hjíi  hžhhŸNh Nubeh}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h M'hj¼i  hžhubhæ)”}”(hXü  In the past, inactivation was always done from the process that dropped the
inode, which was a problem for scrub because scrub may already hold a
transaction, and XFS does not support nesting transactions.
On the other hand, if there is no scrub transaction, it is desirable to drop
otherwise unused inodes immediately to avoid polluting caches.
To capture these nuances, the online fsck code has a separate ``xchk_irele``
function to set or clear the ``DONTCACHE`` flag to get the required release
behavior.”h]”(hX˜  In the past, inactivation was always done from the process that dropped the
inode, which was a problem for scrub because scrub may already hold a
transaction, and XFS does not support nesting transactions.
On the other hand, if there is no scrub transaction, it is desirable to drop
otherwise unused inodes immediately to avoid polluting caches.
To capture these nuances, the online fsck code has a separate ”…””}”(hjj  hžhhŸNh Nubj÷  )”}”(hŒ``xchk_irele``”h]”hŒ
xchk_irele”…””}”(hj'j  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1jö  hjj  ubhŒ
function to set or clear the ”…””}”(hjj  hžhhŸNh Nubj÷  )”}”(hŒ``DONTCACHE``”h]”hŒ	DONTCACHE”…””}”(hj9j  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1jö  hjj  ubhŒ+ flag to get the required release
behavior.”…””}”(hjj  hžhhŸNh Nubeh}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h M/hj¼i  hžhubhæ)”}”(hX  Proposed patchsets include fixing
`scrub iget usage
<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=scrub-iget-fixes>`_ and
`dir iget usage
<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=scrub-dir-iget-fixes>`_.”h]”(hŒ"Proposed patchsets include fixing
”…””}”(hjQj  hžhhŸNh Nubj”  )”}”(hŒr`scrub iget usage
<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=scrub-iget-fixes>`_”h]”hŒscrub iget usage”…””}”(hjYj  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”Œname”Œscrub iget usage”jj  Œ\https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=scrub-iget-fixes”uh1j“  hjQj  ubhµ)”}”(hŒ_
<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=scrub-iget-fixes>”h]”h}”(h]”Œscrub-iget-usage”ah ]”h"]”Œscrub iget usage”ah$]”h&]”Œrefuri”jij  uh1h´jy  KhjQj  ubhŒ and
”…””}”(hjQj  hžhhŸNh Nubj”  )”}”(hŒt`dir iget usage
<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=scrub-dir-iget-fixes>`_”h]”hŒdir iget usage”…””}”(hj{j  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”Œname”Œdir iget usage”jj  Œ`https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=scrub-dir-iget-fixes”uh1j“  hjQj  ubhµ)”}”(hŒc
<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=scrub-dir-iget-fixes>”h]”h}”(h]”Œdir-iget-usage”ah ]”h"]”Œdir iget usage”ah$]”h&]”Œrefuri”j‹j  uh1h´jy  KhjQj  ubhŒ.”…””}”(hjQj  hžhhŸNh Nubeh}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h M8hj¼i  hžhubhµ)”}”(hŒ.. _ilocking:”h]”h}”(h]”h ]”h"]”h$]”h&]”j  Œilocking”uh1h´h M>hj¼i  hžhhŸh³ubeh}”(h]”j  ah ]”h"]”Œiget and irele during a scrub”ah$]”h&]”uh1hÐhj7g  hžhhŸh³h MubhÑ)”}”(hhh]”(hÖ)”}”(hŒLocking Inodes”h]”hŒLocking Inodes”…””}”(hj¸j  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”j  j¬  uh1hÕhjµj  hžhhŸh³h MAubhæ)”}”(hX†  In regular filesystem code, the VFS and XFS will acquire multiple IOLOCK locks
in a well-known order: parent â†’ child when updating the directory tree, and
in numerical order of the addresses of their ``struct inode`` object otherwise.
For regular files, the MMAPLOCK can be acquired after the IOLOCK to stop page
faults.
If two MMAPLOCKs must be acquired, they are acquired in numerical order of
the addresses of their ``struct address_space`` objects.
Due to the structure of existing filesystem code, IOLOCKs and MMAPLOCKs must be
acquired before transactions are allocated.
If two ILOCKs must be acquired, they are acquired in inumber order.”h]”(hŒÊIn regular filesystem code, the VFS and XFS will acquire multiple IOLOCK locks
in a well-known order: parent â†’ child when updating the directory tree, and
in numerical order of the addresses of their ”…””}”(hjÆj  hžhhŸNh Nubj÷  )”}”(hŒ``struct inode``”h]”hŒstruct inode”…””}”(hjÎj  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1jö  hjÆj  ubhŒË object otherwise.
For regular files, the MMAPLOCK can be acquired after the IOLOCK to stop page
faults.
If two MMAPLOCKs must be acquired, they are acquired in numerical order of
the addresses of their ”…””}”(hjÆj  hžhhŸNh Nubj÷  )”}”(hŒ``struct address_space``”h]”hŒstruct address_space”…””}”(hjàj  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1jö  hjÆj  ubhŒÉ objects.
Due to the structure of existing filesystem code, IOLOCKs and MMAPLOCKs must be
acquired before transactions are allocated.
If two ILOCKs must be acquired, they are acquired in inumber order.”…””}”(hjÆj  hžhhŸNh Nubeh}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h MChjµj  hžhubhæ)”}”(hXÜ  Inode lock acquisition must be done carefully during a coordinated inode scan.
Online fsck cannot abide these conventions, because for a directory tree
scanner, the scrub process holds the IOLOCK of the file being scanned and it
needs to take the IOLOCK of the file at the other end of the directory link.
If the directory tree is corrupt because it contains a cycle, ``xfs_scrub``
cannot use the regular inode locking functions and avoid becoming trapped in an
ABBA deadlock.”h]”(hXp  Inode lock acquisition must be done carefully during a coordinated inode scan.
Online fsck cannot abide these conventions, because for a directory tree
scanner, the scrub process holds the IOLOCK of the file being scanned and it
needs to take the IOLOCK of the file at the other end of the directory link.
If the directory tree is corrupt because it contains a cycle, ”…””}”(hjøj  hžhhŸNh Nubj÷  )”}”(hŒ``xfs_scrub``”h]”hŒ	xfs_scrub”…””}”(hj k  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1jö  hjøj  ubhŒ_
cannot use the regular inode locking functions and avoid becoming trapped in an
ABBA deadlock.”…””}”(hjøj  hžhhŸNh Nubeh}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h MNhjµj  hžhubhæ)”}”(hX[  Solving both of these problems is straightforward -- any time online fsck
needs to take a second lock of the same class, it uses trylock to avoid an ABBA
deadlock.
If the trylock fails, scrub drops all inode locks and use trylock loops to
(re)acquire all necessary resources.
Trylock loops enable scrub to check for pending fatal signals, which is how
scrub avoids deadlocking the filesystem or becoming an unresponsive process.
However, trylock loops means that online fsck must be prepared to measure the
resource being scrubbed before and after the lock cycle to detect changes and
react accordingly.”h]”hX[  Solving both of these problems is straightforward -- any time online fsck
needs to take a second lock of the same class, it uses trylock to avoid an ABBA
deadlock.
If the trylock fails, scrub drops all inode locks and use trylock loops to
(re)acquire all necessary resources.
Trylock loops enable scrub to check for pending fatal signals, which is how
scrub avoids deadlocking the filesystem or becoming an unresponsive process.
However, trylock loops means that online fsck must be prepared to measure the
resource being scrubbed before and after the lock cycle to detect changes and
react accordingly.”…””}”(hjk  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h MVhjµj  hžhubhµ)”}”(hŒ.. _dirparent:”h]”h}”(h]”h ]”h"]”h$]”h&]”j  Œ	dirparent”uh1h´h Mahjµj  hžhhŸh³ubeh}”(h]”(j²  j­j  eh ]”h"]”(Œlocking inodes”Œilocking”eh$]”h&]”uh1hÐhj7g  hžhhŸh³h MAjË  }”j6k  j£j  sjÍ  }”j­j  j£j  subhÑ)”}”(hhh]”(hÖ)”}”(hŒ&Case Study: Finding a Directory Parent”h]”hŒ&Case Study: Finding a Directory Parent”…””}”(hj>k  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”j  jÎ  uh1hÕhj;k  hžhhŸh³h Mdubhæ)”}”(hX˜  Consider the directory parent pointer repair code as an example.
Online fsck must verify that the dotdot dirent of a directory points up to a
parent directory, and that the parent directory contains exactly one dirent
pointing down to the child directory.
Fully validating this relationship (and repairing it if possible) requires a
walk of every directory on the filesystem while holding the child locked, and
while updates to the directory tree are being made.
The coordinated inode scan provides a way to walk the filesystem without the
possibility of missing an inode.
The child directory is kept locked to prevent updates to the dotdot dirent, but
if the scanner fails to lock a parent, it can drop and relock both the child
and the prospective parent.
If the dotdot entry changes while the directory is unlocked, then a move or
rename operation must have changed the child's parentage, and the scan can
exit early.”h]”hXš  Consider the directory parent pointer repair code as an example.
Online fsck must verify that the dotdot dirent of a directory points up to a
parent directory, and that the parent directory contains exactly one dirent
pointing down to the child directory.
Fully validating this relationship (and repairing it if possible) requires a
walk of every directory on the filesystem while holding the child locked, and
while updates to the directory tree are being made.
The coordinated inode scan provides a way to walk the filesystem without the
possibility of missing an inode.
The child directory is kept locked to prevent updates to the dotdot dirent, but
if the scanner fails to lock a parent, it can drop and relock both the child
and the prospective parent.
If the dotdot entry changes while the directory is unlocked, then a move or
rename operation must have changed the childâ€™s parentage, and the scan can
exit early.”…””}”(hjLk  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h Mfhj;k  hžhubhæ)”}”(hŒ’The proposed patchset is the
`directory repair
<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-dirs>`_
series.”h]”(hŒThe proposed patchset is the
”…””}”(hjZk  hžhhŸNh Nubj”  )”}”(hŒm`directory repair
<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-dirs>`_”h]”hŒdirectory repair”…””}”(hjbk  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”Œname”Œdirectory repair”jj  ŒWhttps://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-dirs”uh1j“  hjZk  ubhµ)”}”(hŒZ
<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-dirs>”h]”h}”(h]”Œdirectory-repair”ah ]”h"]”Œdirectory repair”ah$]”h&]”Œrefuri”jrk  uh1h´jy  KhjZk  ubhŒ
series.”…””}”(hjZk  hžhhŸNh Nubeh}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h Mvhj;k  hžhubhµ)”}”(hŒ.. _fshooks:”h]”h}”(h]”h ]”h"]”h$]”h&]”j  Œfshooks”uh1h´h M{hj;k  hžhhŸh³ubeh}”(h]”(jÔ  j0k  eh ]”h"]”(Œ&case study: finding a directory parent”Œ	dirparent”eh$]”h&]”uh1hÐhj7g  hžhhŸh³h MdjË  }”jšk  j&k  sjÍ  }”j0k  j&k  subeh}”(h]”jq  ah ]”h"]”Œinode management”ah$]”h&]”uh1hÐhjc  hžhhŸh³h MÚubhÑ)”}”(hhh]”(hÖ)”}”(hŒFilesystem Hooks”h]”hŒFilesystem Hooks”…””}”(hj©k  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”j  jü  uh1hÕhj¦k  hžhhŸh³h M~ubhæ)”}”(hXœ  The second piece of support that online fsck functions need during a full
filesystem scan is the ability to stay informed about updates being made by
other threads in the filesystem, since comparisons against the past are useless
in a dynamic environment.
Two pieces of Linux kernel infrastructure enable online fsck to monitor regular
filesystem operations: filesystem hooks and :ref:`static keys<jump_labels>`.”h]”(hX|  The second piece of support that online fsck functions need during a full
filesystem scan is the ability to stay informed about updates being made by
other threads in the filesystem, since comparisons against the past are useless
in a dynamic environment.
Two pieces of Linux kernel infrastructure enable online fsck to monitor regular
filesystem operations: filesystem hooks and ”…””}”(hj·k  hžhhŸNh Nubh)”}”(hŒ:ref:`static keys<jump_labels>`”h]”j™  )”}”(hjÁk  h]”hŒstatic keys”…””}”(hjÃk  hžhhŸNh Nubah}”(h]”h ]”(j¤  Œstd”Œstd-ref”eh"]”h$]”h&]”uh1j˜  hj¿k  ubah}”(h]”h ]”h"]”h$]”h&]”Œrefdoc”j±  Œ	refdomain”jÍk  Œreftype”Œref”Œrefexplicit”ˆŒrefwarn”ˆj·  Œjump_labels”uh1hhŸh³h M€hj·k  ubhŒ.”…””}”(hj·k  hžhhŸNh Nubeh}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h M€hj¦k  hžhubhæ)”}”(hXL  Filesystem hooks convey information about an ongoing filesystem operation to
a downstream consumer.
In this case, the downstream consumer is always an online fsck function.
Because multiple fsck functions can run in parallel, online fsck uses the Linux
notifier call chain facility to dispatch updates to any number of interested
fsck processes.
Call chains are a dynamic list, which means that they can be configured at
run time.
Because these hooks are private to the XFS module, the information passed along
contains exactly what the checking function needs to update its observations.”h]”hXL  Filesystem hooks convey information about an ongoing filesystem operation to
a downstream consumer.
In this case, the downstream consumer is always an online fsck function.
Because multiple fsck functions can run in parallel, online fsck uses the Linux
notifier call chain facility to dispatch updates to any number of interested
fsck processes.
Call chains are a dynamic list, which means that they can be configured at
run time.
Because these hooks are private to the XFS module, the information passed along
contains exactly what the checking function needs to update its observations.”…””}”(hjék  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h M‡hj¦k  hžhubhæ)”}”(hXy  The current implementation of XFS hooks uses SRCU notifier chains to reduce the
impact to highly threaded workloads.
Regular blocking notifier chains use a rwsem and seem to have a much lower
overhead for single-threaded applications.
However, it may turn out that the combination of blocking chains and static
keys are a more performant combination; more study is needed here.”h]”hXy  The current implementation of XFS hooks uses SRCU notifier chains to reduce the
impact to highly threaded workloads.
Regular blocking notifier chains use a rwsem and seem to have a much lower
overhead for single-threaded applications.
However, it may turn out that the combination of blocking chains and static
keys are a more performant combination; more study is needed here.”…””}”(hj÷k  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h M’hj¦k  hžhubhæ)”}”(hŒMThe following pieces are necessary to hook a certain point in the filesystem:”h]”hŒMThe following pieces are necessary to hook a certain point in the filesystem:”…””}”(hjl  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h M™hj¦k  hžhubhö)”}”(hhh]”(hû)”}”(hŒtA ``struct xfs_hooks`` object must be embedded in a convenient place such as
a well-known incore filesystem object.
”h]”hæ)”}”(hŒsA ``struct xfs_hooks`` object must be embedded in a convenient place such as
a well-known incore filesystem object.”h]”(hŒA ”…””}”(hjl  hžhhŸNh Nubj÷  )”}”(hŒ``struct xfs_hooks``”h]”hŒstruct xfs_hooks”…””}”(hj"l  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1jö  hjl  ubhŒ] object must be embedded in a convenient place such as
a well-known incore filesystem object.”…””}”(hjl  hžhhŸNh Nubeh}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h M›hjl  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhjl  hžhhŸh³h Nubhû)”}”(hŒ_Each hook must define an action code and a structure containing more context
about the action.
”h]”hæ)”}”(hŒ^Each hook must define an action code and a structure containing more context
about the action.”h]”hŒ^Each hook must define an action code and a structure containing more context
about the action.”…””}”(hjDl  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h Mžhj@l  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhjl  hžhhŸh³h Nubhû)”}”(hŒ¶Hook providers should provide appropriate wrapper functions and structs
around the ``xfs_hooks`` and ``xfs_hook`` objects to take advantage of type
checking to ensure correct usage.
”h]”hæ)”}”(hŒµHook providers should provide appropriate wrapper functions and structs
around the ``xfs_hooks`` and ``xfs_hook`` objects to take advantage of type
checking to ensure correct usage.”h]”(hŒSHook providers should provide appropriate wrapper functions and structs
around the ”…””}”(hj\l  hžhhŸNh Nubj÷  )”}”(hŒ``xfs_hooks``”h]”hŒ	xfs_hooks”…””}”(hjdl  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1jö  hj\l  ubhŒ and ”…””}”(hj\l  hžhhŸNh Nubj÷  )”}”(hŒ``xfs_hook``”h]”hŒxfs_hook”…””}”(hjvl  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1jö  hj\l  ubhŒD objects to take advantage of type
checking to ensure correct usage.”…””}”(hj\l  hžhhŸNh Nubeh}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h M¡hjXl  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhjl  hžhhŸh³h Nubhû)”}”(hXÿ  A callsite in the regular filesystem code must be chosen to call
``xfs_hooks_call`` with the action code and data structure.
This place should be adjacent to (and not earlier than) the place where
the filesystem update is committed to the transaction.
In general, when the filesystem calls a hook chain, it should be able to
handle sleeping and should not be vulnerable to memory reclaim or locking
recursion.
However, the exact requirements are very dependent on the context of the hook
caller and the callee.
”h]”hæ)”}”(hXþ  A callsite in the regular filesystem code must be chosen to call
``xfs_hooks_call`` with the action code and data structure.
This place should be adjacent to (and not earlier than) the place where
the filesystem update is committed to the transaction.
In general, when the filesystem calls a hook chain, it should be able to
handle sleeping and should not be vulnerable to memory reclaim or locking
recursion.
However, the exact requirements are very dependent on the context of the hook
caller and the callee.”h]”(hŒAA callsite in the regular filesystem code must be chosen to call
”…””}”(hj˜l  hžhhŸNh Nubj÷  )”}”(hŒ``xfs_hooks_call``”h]”hŒxfs_hooks_call”…””}”(hj l  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1jö  hj˜l  ubhX«   with the action code and data structure.
This place should be adjacent to (and not earlier than) the place where
the filesystem update is committed to the transaction.
In general, when the filesystem calls a hook chain, it should be able to
handle sleeping and should not be vulnerable to memory reclaim or locking
recursion.
However, the exact requirements are very dependent on the context of the hook
caller and the callee.”…””}”(hj˜l  hžhhŸNh Nubeh}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h M¥hj”l  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhjl  hžhhŸh³h Nubhû)”}”(hX  The online fsck function should define a structure to hold scan data, a lock
to coordinate access to the scan data, and a ``struct xfs_hook`` object.
The scanner function and the regular filesystem code must acquire resources
in the same order; see the next section for details.
”h]”hæ)”}”(hX  The online fsck function should define a structure to hold scan data, a lock
to coordinate access to the scan data, and a ``struct xfs_hook`` object.
The scanner function and the regular filesystem code must acquire resources
in the same order; see the next section for details.”h]”(hŒzThe online fsck function should define a structure to hold scan data, a lock
to coordinate access to the scan data, and a ”…””}”(hjÂl  hžhhŸNh Nubj÷  )”}”(hŒ``struct xfs_hook``”h]”hŒstruct xfs_hook”…””}”(hjÊl  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1jö  hjÂl  ubhŒ‰ object.
The scanner function and the regular filesystem code must acquire resources
in the same order; see the next section for details.”…””}”(hjÂl  hžhhŸNh Nubeh}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h M¯hj¾l  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhjl  hžhhŸh³h Nubhû)”}”(hŒßThe online fsck code must contain a C function to catch the hook action code
and data structure.
If the object being updated has already been visited by the scan, then the
hook information must be applied to the scan data.
”h]”hæ)”}”(hŒÞThe online fsck code must contain a C function to catch the hook action code
and data structure.
If the object being updated has already been visited by the scan, then the
hook information must be applied to the scan data.”h]”hŒÞThe online fsck code must contain a C function to catch the hook action code
and data structure.
If the object being updated has already been visited by the scan, then the
hook information must be applied to the scan data.”…””}”(hjìl  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h M´hjèl  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhjl  hžhhŸh³h Nubhû)”}”(hŒ¨Prior to unlocking inodes to start the scan, online fsck must call
``xfs_hooks_setup`` to initialize the ``struct xfs_hook``, and
``xfs_hooks_add`` to enable the hook.
”h]”hæ)”}”(hŒ§Prior to unlocking inodes to start the scan, online fsck must call
``xfs_hooks_setup`` to initialize the ``struct xfs_hook``, and
``xfs_hooks_add`` to enable the hook.”h]”(hŒCPrior to unlocking inodes to start the scan, online fsck must call
”…””}”(hjm  hžhhŸNh Nubj÷  )”}”(hŒ``xfs_hooks_setup``”h]”hŒxfs_hooks_setup”…””}”(hjm  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1jö  hjm  ubhŒ to initialize the ”…””}”(hjm  hžhhŸNh Nubj÷  )”}”(hŒ``struct xfs_hook``”h]”hŒstruct xfs_hook”…””}”(hjm  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1jö  hjm  ubhŒ, and
”…””}”(hjm  hžhhŸNh Nubj÷  )”}”(hŒ``xfs_hooks_add``”h]”hŒxfs_hooks_add”…””}”(hj0m  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1jö  hjm  ubhŒ to enable the hook.”…””}”(hjm  hžhhŸNh Nubeh}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h M¹hj m  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhjl  hžhhŸh³h Nubhû)”}”(hŒWOnline fsck must call ``xfs_hooks_del`` to disable the hook once the scan is
complete.
”h]”hæ)”}”(hŒVOnline fsck must call ``xfs_hooks_del`` to disable the hook once the scan is
complete.”h]”(hŒOnline fsck must call ”…””}”(hjRm  hžhhŸNh Nubj÷  )”}”(hŒ``xfs_hooks_del``”h]”hŒxfs_hooks_del”…””}”(hjZm  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1jö  hjRm  ubhŒ/ to disable the hook once the scan is
complete.”…””}”(hjRm  hžhhŸNh Nubeh}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h M½hjNm  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhjl  hžhhŸh³h Nubeh}”(h]”h ]”h"]”h$]”h&]”jJ  jK  uh1hõhŸh³h M›hj¦k  hžhubhæ)”}”(hŒµThe number of hooks should be kept to a minimum to reduce complexity.
Static keys are used to reduce the overhead of filesystem hooks to nearly
zero when online fsck is not running.”h]”hŒµThe number of hooks should be kept to a minimum to reduce complexity.
Static keys are used to reduce the overhead of filesystem hooks to nearly
zero when online fsck is not running.”…””}”(hj~m  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h MÀhj¦k  hžhubhµ)”}”(hŒ.. _liveupdate:”h]”h}”(h]”h ]”h"]”h$]”h&]”j  Œ
liveupdate”uh1h´h MÄhj¦k  hžhhŸh³ubeh}”(h]”(j  j”k  eh ]”h"]”(Œfilesystem hooks”Œfshooks”eh$]”h&]”uh1hÐhjc  hžhhŸh³h M~jË  }”jœm  jŠk  sjÍ  }”j”k  jŠk  subhÑ)”}”(hhh]”(hÖ)”}”(hŒLive Updates During a Scan”h]”hŒLive Updates During a Scan”…””}”(hj¤m  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”j  j  uh1hÕhj¡m  hžhhŸh³h MÇubhæ)”}”(hŒoThe code paths of the online fsck scanning code and the :ref:`hooked<fshooks>`
filesystem code look like this::”h]”(hŒ8The code paths of the online fsck scanning code and the ”…””}”(hj²m  hžhhŸNh Nubh)”}”(hŒ:ref:`hooked<fshooks>`”h]”j™  )”}”(hj¼m  h]”hŒhooked”…””}”(hj¾m  hžhhŸNh Nubah}”(h]”h ]”(j¤  Œstd”Œstd-ref”eh"]”h$]”h&]”uh1j˜  hjºm  ubah}”(h]”h ]”h"]”h$]”h&]”Œrefdoc”j±  Œ	refdomain”jÈm  Œreftype”Œref”Œrefexplicit”ˆŒrefwarn”ˆj·  Œfshooks”uh1hhŸh³h MÉhj²m  ubhŒ 
filesystem code look like this:”…””}”(hj²m  hžhhŸNh Nubeh}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h MÉhj¡m  hžhubj•+  )”}”(hX_  other program
      â†“
inode lock â†â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”
      â†“                         â”‚
AG header lock                  â”‚
      â†“                         â”‚
filesystem function             â”‚
      â†“                         â”‚
notifier call chain             â”‚    same
      â†“                         â”œâ”€â”€â”€ inode
scrub hook function             â”‚    lock
      â†“                         â”‚
scan data mutex â†â”€â”€â”    same    â”‚
      â†“            â”œâ”€â”€â”€ scan    â”‚
update scan data   â”‚    lock    â”‚
      â†‘            â”‚            â”‚
scan data mutex â†â”€â”€â”˜            â”‚
      â†‘                         â”‚
inode lock â†â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”˜
      â†‘
scrub function
      â†‘
inode scanner
      â†‘
xfs_scrub”h]”hX_  other program
      â†“
inode lock â†â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”
      â†“                         â”‚
AG header lock                  â”‚
      â†“                         â”‚
filesystem function             â”‚
      â†“                         â”‚
notifier call chain             â”‚    same
      â†“                         â”œâ”€â”€â”€ inode
scrub hook function             â”‚    lock
      â†“                         â”‚
scan data mutex â†â”€â”€â”    same    â”‚
      â†“            â”œâ”€â”€â”€ scan    â”‚
update scan data   â”‚    lock    â”‚
      â†‘            â”‚            â”‚
scan data mutex â†â”€â”€â”˜            â”‚
      â†‘                         â”‚
inode lock â†â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”˜
      â†‘
scrub function
      â†‘
inode scanner
      â†‘
xfs_scrub”…””}”hjäm  sbah}”(h]”h ]”h"]”h$]”h&]”h±h²uh1j”+  hŸh³h MÌhj¡m  hžhubhæ)”}”(hŒ†These rules must be followed to ensure correct interactions between the
checking code and the code making an update to the filesystem:”h]”hŒ†These rules must be followed to ensure correct interactions between the
checking code and the code making an update to the filesystem:”…””}”(hjòm  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h Mæhj¡m  hžhubhö)”}”(hhh]”(hû)”}”(hŒ¨Prior to invoking the notifier call chain, the filesystem function being
hooked must acquire the same lock that the scrub scanning function acquires
to scan the inode.
”h]”hæ)”}”(hŒ§Prior to invoking the notifier call chain, the filesystem function being
hooked must acquire the same lock that the scrub scanning function acquires
to scan the inode.”h]”hŒ§Prior to invoking the notifier call chain, the filesystem function being
hooked must acquire the same lock that the scrub scanning function acquires
to scan the inode.”…””}”(hjn  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h Méhjn  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhj n  hžhhŸh³h Nubhû)”}”(hŒ€The scanning function and the scrub hook function must coordinate access to
the scan data by acquiring a lock on the scan data.
”h]”hæ)”}”(hŒThe scanning function and the scrub hook function must coordinate access to
the scan data by acquiring a lock on the scan data.”h]”hŒThe scanning function and the scrub hook function must coordinate access to
the scan data by acquiring a lock on the scan data.”…””}”(hjn  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h Míhjn  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhj n  hžhhŸh³h Nubhû)”}”(hŒçScrub hook function must not add the live update information to the scan
observations unless the inode being updated has already been scanned.
The scan coordinator has a helper predicate (``xchk_iscan_want_live_update``)
for this.
”h]”hæ)”}”(hŒæScrub hook function must not add the live update information to the scan
observations unless the inode being updated has already been scanned.
The scan coordinator has a helper predicate (``xchk_iscan_want_live_update``)
for this.”h]”(hŒ¼Scrub hook function must not add the live update information to the scan
observations unless the inode being updated has already been scanned.
The scan coordinator has a helper predicate (”…””}”(hj7n  hžhhŸNh Nubj÷  )”}”(hŒ``xchk_iscan_want_live_update``”h]”hŒxchk_iscan_want_live_update”…””}”(hj?n  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1jö  hj7n  ubhŒ)
for this.”…””}”(hj7n  hžhhŸNh Nubeh}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h Mðhj3n  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhj n  hžhhŸh³h Nubhû)”}”(hŒÊScrub hook functions must not change the caller's state, including the
transaction that it is running.
They must not acquire any resources that might conflict with the filesystem
function being hooked.
”h]”hæ)”}”(hŒÉScrub hook functions must not change the caller's state, including the
transaction that it is running.
They must not acquire any resources that might conflict with the filesystem
function being hooked.”h]”hŒËScrub hook functions must not change the callerâ€™s state, including the
transaction that it is running.
They must not acquire any resources that might conflict with the filesystem
function being hooked.”…””}”(hjan  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h Mõhj]n  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhj n  hžhhŸh³h Nubhû)”}”(hŒNThe hook function can abort the inode scan to avoid breaking the other rules.
”h]”hæ)”}”(hŒMThe hook function can abort the inode scan to avoid breaking the other rules.”h]”hŒMThe hook function can abort the inode scan to avoid breaking the other rules.”…””}”(hjyn  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h Múhjun  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhj n  hžhhŸh³h Nubeh}”(h]”h ]”h"]”h$]”h&]”jJ  jK  uh1hõhŸh³h Méhj¡m  hžhubhæ)”}”(hŒ&The inode scan APIs are pretty simple:”h]”hŒ&The inode scan APIs are pretty simple:”…””}”(hj“n  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h Mühj¡m  hžhubhö)”}”(hhh]”(hû)”}”(hŒ#``xchk_iscan_start`` starts a scan
”h]”hæ)”}”(hŒ"``xchk_iscan_start`` starts a scan”h]”(j÷  )”}”(hŒ``xchk_iscan_start``”h]”hŒxchk_iscan_start”…””}”(hj¬n  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1jö  hj¨n  ubhŒ starts a scan”…””}”(hj¨n  hžhhŸNh Nubeh}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h Mþhj¤n  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhj¡n  hžhhŸh³h Nubhû)”}”(hŒu``xchk_iscan_iter`` grabs a reference to the next inode in the scan or
returns zero if there is nothing left to scan
”•N      h]”hæ)”}”(hŒt``xchk_iscan_iter`` grabs a reference to the next inode in the scan or
returns zero if there is nothing left to scan”h]”(j÷  )”}”(hŒ``xchk_iscan_iter``”h]”hŒxchk_iscan_iter”…””}”(hjÒn  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1jö  hjÎn  ubhŒa grabs a reference to the next inode in the scan or
returns zero if there is nothing left to scan”…””}”(hjÎn  hžhhŸNh Nubeh}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h M hjÊn  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhj¡n  hžhhŸh³h Nubhû)”}”(hŒÁ``xchk_iscan_want_live_update`` to decide if an inode has already been
visited in the scan.
This is critical for hook functions to decide if they need to update the
in-memory scan information.
”h]”hæ)”}”(hŒÀ``xchk_iscan_want_live_update`` to decide if an inode has already been
visited in the scan.
This is critical for hook functions to decide if they need to update the
in-memory scan information.”h]”(j÷  )”}”(hŒ``xchk_iscan_want_live_update``”h]”hŒxchk_iscan_want_live_update”…””}”(hjøn  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1jö  hjôn  ubhŒ¡ to decide if an inode has already been
visited in the scan.
This is critical for hook functions to decide if they need to update the
in-memory scan information.”…””}”(hjôn  hžhhŸNh Nubeh}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h Mhjðn  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhj¡n  hžhhŸh³h Nubhû)”}”(hŒP``xchk_iscan_mark_visited`` to mark an inode as having been visited in the
scan
”h]”hæ)”}”(hŒO``xchk_iscan_mark_visited`` to mark an inode as having been visited in the
scan”h]”(j÷  )”}”(hŒ``xchk_iscan_mark_visited``”h]”hŒxchk_iscan_mark_visited”…””}”(hjo  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1jö  hjo  ubhŒ4 to mark an inode as having been visited in the
scan”…””}”(hjo  hžhhŸNh Nubeh}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h Mhjo  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhj¡n  hžhhŸh³h Nubhû)”}”(hŒ+``xchk_iscan_teardown`` to finish the scan
”h]”hæ)”}”(hŒ*``xchk_iscan_teardown`` to finish the scan”h]”(j÷  )”}”(hŒ``xchk_iscan_teardown``”h]”hŒxchk_iscan_teardown”…””}”(hjDo  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1jö  hj@o  ubhŒ to finish the scan”…””}”(hj@o  hžhhŸNh Nubeh}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h Mhj<o  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhj¡n  hžhhŸh³h Nubeh}”(h]”h ]”h"]”h$]”h&]”jJ  jK  uh1hõhŸh³h Mþhj¡m  hžhubhæ)”}”(hŒ›This functionality is also a part of the
`inode scanner
<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=scrub-iscan>`_
series.”h]”(hŒ)This functionality is also a part of the
”…””}”(hjho  hžhhŸNh Nubj”  )”}”(hŒj`inode scanner
<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=scrub-iscan>`_”h]”hŒinode scanner”…””}”(hjpo  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”Œname”Œinode scanner”jj  ŒWhttps://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=scrub-iscan”uh1j“  hjho  ubhµ)”}”(hŒZ
<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=scrub-iscan>”h]”h}”(h]”Œid8”ah ]”h"]”h$]”Œinode scanner”ah&]”Œrefuri”j€o  uh1h´jy  Khjho  ubhŒ
series.”…””}”(hjho  hžhhŸNh Nubeh}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h Mhj¡m  hžhubhµ)”}”(hŒ.. _quotacheck:”h]”h}”(h]”h ]”h"]”h$]”h&]”j  Œ
quotacheck”uh1h´h Mhj¡m  hžhhŸh³ubhÑ)”}”(hhh]”(hÖ)”}”(hŒ"Case Study: Quota Counter Checking”h]”hŒ"Case Study: Quota Counter Checking”…””}”(hj¦o  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”j  j=  uh1hÕhj£o  hžhhŸh³h Mubhæ)”}”(hŒÁIt is useful to compare the mount time quotacheck code to the online repair
quotacheck code.
Mount time quotacheck does not have to contend with concurrent operations, so
it does the following:”h]”hŒÁIt is useful to compare the mount time quotacheck code to the online repair
quotacheck code.
Mount time quotacheck does not have to contend with concurrent operations, so
it does the following:”…””}”(hj´o  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h Mhj£o  hžhubji  )”}”(hhh]”(hû)”}”(hŒŸMake sure the ondisk dquots are in good enough shape that all the incore
dquots will actually load, and zero the resource usage counters in the
ondisk buffer.
”h]”hæ)”}”(hŒžMake sure the ondisk dquots are in good enough shape that all the incore
dquots will actually load, and zero the resource usage counters in the
ondisk buffer.”h]”hŒžMake sure the ondisk dquots are in good enough shape that all the incore
dquots will actually load, and zero the resource usage counters in the
ondisk buffer.”…””}”(hjÉo  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h MhjÅo  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhjÂo  hžhhŸh³h Nubhû)”}”(hŒXWalk every inode in the filesystem.
Add each file's resource usage to the incore dquot.
”h]”hæ)”}”(hŒWWalk every inode in the filesystem.
Add each file's resource usage to the incore dquot.”h]”hŒYWalk every inode in the filesystem.
Add each fileâ€™s resource usage to the incore dquot.”…””}”(hjáo  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h M hjÝo  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhjÂo  hžhhŸh³h Nubhû)”}”(hŒ“Walk each incore dquot.
If the incore dquot is not being flushed, add the ondisk buffer backing the
incore dquot to a delayed write (delwri) list.
”h]”hæ)”}”(hŒ’Walk each incore dquot.
If the incore dquot is not being flushed, add the ondisk buffer backing the
incore dquot to a delayed write (delwri) list.”h]”hŒ’Walk each incore dquot.
If the incore dquot is not being flushed, add the ondisk buffer backing the
incore dquot to a delayed write (delwri) list.”…””}”(hjùo  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h M#hjõo  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhjÂo  hžhhŸh³h Nubhû)”}”(hŒWrite the buffer list to disk.
”h]”hæ)”}”(hŒWrite the buffer list to disk.”h]”hŒWrite the buffer list to disk.”…””}”(hjp  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h M'hjp  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhjÂo  hžhhŸh³h Nubeh}”(h]”h ]”h"]”h$]”h&]”jg  jh  ji  hjj  jk  uh1jh  hj£o  hžhhŸh³h Mubhæ)”}”(hXÙ  Like most online fsck functions, online quotacheck can't write to regular
filesystem objects until the newly collected metadata reflect all filesystem
state.
Therefore, online quotacheck records file resource usage to a shadow dquot
index implemented with a sparse ``xfarray``, and only writes to the real dquots
once the scan is complete.
Handling transactional updates is tricky because quota resource usage updates
are handled in phases to minimize contention on dquots:”h]”(hX  Like most online fsck functions, online quotacheck canâ€™t write to regular
filesystem objects until the newly collected metadata reflect all filesystem
state.
Therefore, online quotacheck records file resource usage to a shadow dquot
index implemented with a sparse ”…””}”(hj+p  hžhhŸNh Nubj÷  )”}”(hŒ``xfarray``”h]”hŒxfarray”…””}”(hj3p  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1jö  hj+p  ubhŒÅ, and only writes to the real dquots
once the scan is complete.
Handling transactional updates is tricky because quota resource usage updates
are handled in phases to minimize contention on dquots:”…””}”(hj+p  hžhhŸNh Nubeh}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h M)hj£o  hžhubji  )”}”(hhh]”(hû)”}”(hŒ<The inodes involved are joined and locked to a transaction.
”h]”hæ)”}”(hŒ;The inodes involved are joined and locked to a transaction.”h]”hŒ;The inodes involved are joined and locked to a transaction.”…””}”(hjRp  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h M2hjNp  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhjKp  hžhhŸh³h Nubhû)”}”(hŒÌFor each dquot attached to the file:

a. The dquot is locked.

b. A quota reservation is added to the dquot's resource usage.
   The reservation is recorded in the transaction.

c. The dquot is unlocked.
”h]”(hæ)”}”(hŒ$For each dquot attached to the file:”h]”hŒ$For each dquot attached to the file:”…””}”(hjjp  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h M4hjfp  ubji  )”}”(hhh]”(hû)”}”(hŒThe dquot is locked.
”h]”hæ)”}”(hŒThe dquot is locked.”h]”hŒThe dquot is locked.”…””}”(hjp  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h M6hj{p  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhjxp  ubhû)”}”(hŒlA quota reservation is added to the dquot's resource usage.
The reservation is recorded in the transaction.
”h]”hæ)”}”(hŒkA quota reservation is added to the dquot's resource usage.
The reservation is recorded in the transaction.”h]”hŒmA quota reservation is added to the dquotâ€™s resource usage.
The reservation is recorded in the transaction.”…””}”(hj—p  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h M8hj“p  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhjxp  ubhû)”}”(hŒThe dquot is unlocked.
”h]”hæ)”}”(hŒThe dquot is unlocked.”h]”hŒThe dquot is unlocked.”…””}”(hj¯p  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h M;hj«p  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhjxp  ubeh}”(h]”h ]”h"]”h$]”h&]”jg  j6  ji  hjj  jk  uh1jh  hjfp  ubeh}”(h]”h ]”h"]”h$]”h&]”uh1húhjKp  hžhhŸNh Nubhû)”}”(hŒ>Changes in actual quota usage are tracked in the transaction.
”h]”hæ)”}”(hŒ=Changes in actual quota usage are tracked in the transaction.”h]”hŒ=Changes in actual quota usage are tracked in the transaction.”…””}”(hjÓp  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h M=hjÏp  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhjKp  hžhhŸh³h Nubhû)”}”(hŒÍAt transaction commit time, each dquot is examined again:

a. The dquot is locked again.

b. Quota usage changes are logged and unused reservation is given back to
   the dquot.

c. The dquot is unlocked.
”h]”(hæ)”}”(hŒ9At transaction commit time, each dquot is examined again:”h]”hŒ9At transaction commit time, each dquot is examined again:”…””}”(hjëp  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h M?hjçp  ubji  )”}”(hhh]”(hû)”}”(hŒThe dquot is locked again.
”h]”hæ)”}”(hŒThe dquot is locked again.”h]”hŒThe dquot is locked again.”…””}”(hj q  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h MAhjüp  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhjùp  ubhû)”}”(hŒRQuota usage changes are logged and unused reservation is given back to
the dquot.
”h]”hæ)”}”(hŒQQuota usage changes are logged and unused reservation is given back to
the dquot.”h]”hŒQQuota usage changes are logged and unused reservation is given back to
the dquot.”…””}”(hjq  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h MChjq  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhjùp  ubhû)”}”(hŒThe dquot is unlocked.
”h]”hæ)”}”(hŒThe dquot is unlocked.”h]”hŒThe dquot is unlocked.”…””}”(hj0q  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h MFhj,q  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhjùp  ubeh}”(h]”h ]”h"]”h$]”h&]”jg  j6  ji  hjj  jk  uh1jh  hjçp  ubeh}”(h]”h ]”h"]”h$]”h&]”uh1húhjKp  hžhhŸNh Nubeh}”(h]”h ]”h"]”h$]”h&]”jg  jh  ji  hjj  jk  uh1jh  hj£o  hžhhŸh³h M2ubhæ)”}”(hX‹  For online quotacheck, hooks are placed in steps 2 and 4.
The step 2 hook creates a shadow version of the transaction dquot context
(``dqtrx``) that operates in a similar manner to the regular code.
The step 4 hook commits the shadow ``dqtrx`` changes to the shadow dquots.
Notice that both hooks are called with the inode locked, which is how the
live update coordinates with the inode scanner.”h]”(hŒ…For online quotacheck, hooks are placed in steps 2 and 4.
The step 2 hook creates a shadow version of the transaction dquot context
(”…””}”(hjVq  hžhhŸNh Nubj÷  )”}”(hŒ	``dqtrx``”h]”hŒdqtrx”…””}”(hj^q  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1jö  hjVq  ubhŒ\) that operates in a similar manner to the regular code.
The step 4 hook commits the shadow ”…””}”(hjVq  hžhhŸNh Nubj÷  )”}”(hŒ	``dqtrx``”h]”hŒdqtrx”…””}”(hjpq  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1jö  hjVq  ubhŒ˜ changes to the shadow dquots.
Notice that both hooks are called with the inode locked, which is how the
live update coordinates with the inode scanner.”…””}”(hjVq  hžhhŸNh Nubeh}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h MHhj£o  hžhubhæ)”}”(hŒ$The quotacheck scan looks like this:”h]”hŒ$The quotacheck scan looks like this:”…””}”(hjˆq  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h MOhj£o  hžhubji  )”}”(hhh]”(hû)”}”(hŒ!Set up a coordinated inode scan.
”h]”hæ)”}”(hŒ Set up a coordinated inode scan.”h]”hŒ Set up a coordinated inode scan.”…””}”(hjq  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h MQhj™q  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhj–q  hžhhŸh³h Nubhû)”}”(hX2  For each inode returned by the inode scan iterator:

a. Grab and lock the inode.

b. Determine that inode's resource usage (data blocks, inode counts,
   realtime blocks) and add that to the shadow dquots for the user, group,
   and project ids associated with the inode.

c. Unlock and release the inode.
”h]”(hæ)”}”(hŒ3For each inode returned by the inode scan iterator:”h]”hŒ3For each inode returned by the inode scan iterator:”…””}”(hjµq  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h MShj±q  ubji  )”}”(hhh]”(hû)”}”(hŒGrab and lock the inode.
”h]”hæ)”}”(hŒGrab and lock the inode.”h]”hŒGrab and lock the inode.”…””}”(hjÊq  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h MUhjÆq  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhjÃq  ubhû)”}”(hŒµDetermine that inode's resource usage (data blocks, inode counts,
realtime blocks) and add that to the shadow dquots for the user, group,
and project ids associated with the inode.
”h]”hæ)”}”(hŒ´Determine that inode's resource usage (data blocks, inode counts,
realtime blocks) and add that to the shadow dquots for the user, group,
and project ids associated with the inode.”h]”hŒ¶Determine that inodeâ€™s resource usage (data blocks, inode counts,
realtime blocks) and add that to the shadow dquots for the user, group,
and project ids associated with the inode.”…””}”(hjâq  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h MWhjÞq  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhjÃq  ubhû)”}”(hŒUnlock and release the inode.
”h]”hæ)”}”(hŒUnlock and release the inode.”h]”hŒUnlock and release the inode.”…””}”(hjúq  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h M[hjöq  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhjÃq  ubeh}”(h]”h ]”h"]”h$]”h&]”jg  j6  ji  hjj  jk  uh1jh  hj±q  ubeh}”(h]”h ]”h"]”h$]”h&]”uh1húhj–q  hžhhŸNh Nubhû)”}”(hŒŸFor each dquot in the system:

a. Grab and lock the dquot.

b. Check the dquot against the shadow dquots created by the scan and updated
   by the live hooks.
”h]”(hæ)”}”(hŒFor each dquot in the system:”h]”hŒFor each dquot in the system:”…””}”(hjr  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h M]hjr  ubji  )”}”(hhh]”(hû)”}”(hŒGrab and lock the dquot.
”h]”hæ)”}”(hŒGrab and lock the dquot.”h]”hŒGrab and lock the dquot.”…””}”(hj3r  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h M_hj/r  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhj,r  ubhû)”}”(hŒ]Check the dquot against the shadow dquots created by the scan and updated
by the live hooks.
”h]”hæ)”}”(hŒ\Check the dquot against the shadow dquots created by the scan and updated
by the live hooks.”h]”hŒ\Check the dquot against the shadow dquots created by the scan and updated
by the live hooks.”…””}”(hjKr  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h MahjGr  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhj,r  ubeh}”(h]”h ]”h"]”h$]”h&]”jg  j6  ji  hjj  jk  uh1jh  hjr  ubeh}”(h]”h ]”h"]”h$]”h&]”uh1húhj–q  hžhhŸNh Nubeh}”(h]”h ]”h"]”h$]”h&]”jg  jh  ji  hjj  jk  uh1jh  hj£o  hžhhŸh³h MQubhæ)”}”(hŒ÷Live updates are key to being able to walk every quota record without
needing to hold any locks for a long duration.
If repairs are desired, the real and shadow dquots are locked and their
resource counts are set to the values in the shadow dquot.”h]”hŒ÷Live updates are key to being able to walk every quota record without
needing to hold any locks for a long duration.
If repairs are desired, the real and shadow dquots are locked and their
resource counts are set to the values in the shadow dquot.”…””}”(hjqr  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h Mdhj£o  hžhubhæ)”}”(hŒ™The proposed patchset is the
`online quotacheck
<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-quotacheck>`_
series.”h]”(hŒThe proposed patchset is the
”…””}”(hjr  hžhhŸNh Nubj”  )”}”(hŒt`online quotacheck
<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-quotacheck>`_”h]”hŒonline quotacheck”…””}”(hj‡r  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”Œname”Œonline quotacheck”jj  Œ]https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-quotacheck”uh1j“  hjr  ubhµ)”}”(hŒ`
<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-quotacheck>”h]”h}”(h]”Œid9”ah ]”h"]”h$]”Œonline quotacheck”ah&]”Œrefuri”j—r  uh1h´jy  Khjr  ubhŒ
series.”…””}”(hjr  hžhhŸNh Nubeh}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h Mihj£o  hžhubhµ)”}”(hŒ.. _nlinks:”h]”h}”(h]”h ]”h"]”h$]”h&]”j  Œnlinks”uh1h´h Mnhj£o  hžhhŸh³ubeh}”(h]”(jC  j¢o  eh ]”h"]”(Œ"case study: quota counter checking”Œ
quotacheck”eh$]”h&]”uh1hÐhj¡m  hžhhŸh³h MjË  }”j¿r  j˜o  sjÍ  }”j¢o  j˜o  subhÑ)”}”(hhh]”(hÖ)”}”(hŒ$Case Study: File Link Count Checking”h]”hŒ$Case Study: File Link Count Checking”…””}”(hjÇr  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”j  j_  uh1hÕhjÄr  hžhhŸh³h Mqubhæ)”}”(hX:  File link count checking also uses live update hooks.
The coordinated inode scanner is used to visit all directories on the
filesystem, and per-file link count records are stored in a sparse ``xfarray``
indexed by inumber.
During the scanning phase, each entry in a directory generates observation
data as follows:”h]”(hŒ¿File link count checking also uses live update hooks.
The coordinated inode scanner is used to visit all directories on the
filesystem, and per-file link count records are stored in a sparse ”…””}”(hjÕr  hžhhŸNh Nubj÷  )”}”(hŒ``xfarray``”h]”hŒxfarray”…””}”(hjÝr  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1jö  hjÕr  ubhŒp
indexed by inumber.
During the scanning phase, each entry in a directory generates observation
data as follows:”…””}”(hjÕr  hžhhŸNh Nubeh}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h MshjÄr  hžhubji  )”}”(hhh]”(hû)”}”(hŒ¬If the entry is a dotdot (``'..'``) entry of the root directory, the
directory's parent link count is bumped because the root directory's dotdot
entry is self referential.
”h]”hæ)”}”(hŒ«If the entry is a dotdot (``'..'``) entry of the root directory, the
directory's parent link count is bumped because the root directory's dotdot
entry is self referential.”h]”(hŒIf the entry is a dotdot (”…””}”(hjür  hžhhŸNh Nubj÷  )”}”(hŒ``'..'``”h]”hŒ'..'”…””}”(hjs  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1jö  hjür  ubhŒ) entry of the root directory, the
directoryâ€™s parent link count is bumped because the root directoryâ€™s dotdot
entry is self referential.”…””}”(hjür  hžhhŸNh Nubeh}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h Mzhjør  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhjõr  hžhhŸh³h Nubhû)”}”(hŒXIf the entry is a dotdot entry of a subdirectory, the parent's backref
count is bumped.
”h]”hæ)”}”(hŒWIf the entry is a dotdot entry of a subdirectory, the parent's backref
count is bumped.”h]”hŒYIf the entry is a dotdot entry of a subdirectory, the parentâ€™s backref
count is bumped.”…””}”(hj&s  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h M~hj"s  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhjõr  hžhhŸh³h Nubhû)”}”(hŒ\If the entry is neither a dot nor a dotdot entry, the target file's parent
count is bumped.
”h]”hæ)”}”(hŒ[If the entry is neither a dot nor a dotdot entry, the target file's parent
count is bumped.”h]”hŒ]If the entry is neither a dot nor a dotdot entry, the target fileâ€™s parent
count is bumped.”…””}”(hj>s  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h Mhj:s  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhjõr  hžhhŸh³h Nubhû)”}”(hŒJIf the target is a subdirectory, the parent's child link count is bumped.
”h]”hæ)”}”(hŒIIf the target is a subdirectory, the parent's child link count is bumped.”h]”hŒKIf the target is a subdirectory, the parentâ€™s child link count is bumped.”…””}”(hjVs  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h M„hjRs  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhjõr  hžhhŸh³h Nubeh}”(h]”h ]”h"]”h$]”h&]”jg  jh  ji  hjj  jk  uh1jh  hjÄr  hžhhŸh³h Mzubhæ)”}”(hXž  A crucial point to understand about how the link count inode scanner interacts
with the live update hooks is that the scan cursor tracks which *parent*
directories have been scanned.
In other words, the live updates ignore any update about ``A â†’ B`` when A has
not been scanned, even if B has been scanned.
Furthermore, a subdirectory A with a dotdot entry pointing back to B is
accounted as a backref counter in the shadow data for A, since child dotdot
entries affect the parent's link count.
Live update hooks are carefully placed in all parts of the filesystem that
create, change, or remove directory entries, since those operations involve
bumplink and droplink.”h]”(hŒA crucial point to understand about how the link count inode scanner interacts
with the live update hooks is that the scan cursor tracks which ”…””}”(hjps  hžhhŸNh Nubj7  )”}”(hŒ*parent*”h]”hŒparent”…””}”(hjxs  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1j6  hjps  ubhŒY
directories have been scanned.
In other words, the live updates ignore any update about ”…””}”(hjps  hžhhŸNh Nubj÷  )”}”(hŒ``A â†’ B``”h]”hŒA â†’ B”…””}”(hjŠs  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1jö  hjps  ubhX¥   when A has
not been scanned, even if B has been scanned.
Furthermore, a subdirectory A with a dotdot entry pointing back to B is
accounted as a backref counter in the shadow data for A, since child dotdot
entries affect the parentâ€™s link count.
Live update hooks are carefully placed in all parts of the filesystem that
create, change, or remove directory entries, since those operations involve
bumplink and droplink.”…””}”(hjps  hžhhŸNh Nubeh}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h M†hjÄr  hžhubhæ)”}”(hX9  For any file, the correct link count is the number of parents plus the number
of child subdirectories.
Non-directories never have children of any kind.
The backref information is used to detect inconsistencies in the number of
links pointing to child subdirectories and the number of dotdot entries
pointing back.”h]”hX9  For any file, the correct link count is the number of parents plus the number
of child subdirectories.
Non-directories never have children of any kind.
The backref information is used to detect inconsistencies in the number of
links pointing to child subdirectories and the number of dotdot entries
pointing back.”…””}”(hj¢s  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h M’hjÄr  hžhubhæ)”}”(hX  After the scan completes, the link count of each file can be checked by locking
both the inode and the shadow data, and comparing the link counts.
A second coordinated inode scan cursor is used for comparisons.
Live updates are key to being able to walk every inode without needing to hold
any locks between inodes.
If repairs are desired, the inode's link count is set to the value in the
shadow information.
If no parents are found, the file must be :ref:`reparented <orphanage>` to the
orphanage to prevent the file from being lost forever.”h]”(hXÆ  After the scan completes, the link count of each file can be checked by locking
both the inode and the shadow data, and comparing the link counts.
A second coordinated inode scan cursor is used for comparisons.
Live updates are key to being able to walk every inode without needing to hold
any locks between inodes.
If repairs are desired, the inodeâ€™s link count is set to the value in the
shadow information.
If no parents are found, the file must be ”…””}”(hj°s  hžhhŸNh Nubh)”}”(hŒ:ref:`reparented <orphanage>`”h]”j™  )”}”(hjºs  h]”hŒ
reparented”…””}”(hj¼s  hžhhŸNh Nubah}”(h]”h ]”(j¤  Œstd”Œstd-ref”eh"]”h$]”h&]”uh1j˜  hj¸s  ubah}”(h]”h ]”h"]”h$]”h&]”Œrefdoc”j±  Œ	refdomain”jÆs  Œreftype”Œref”Œrefexplicit”ˆŒrefwarn”ˆj·  Œ	orphanage”uh1hhŸh³h M™hj°s  ubhŒ> to the
orphanage to prevent the file from being lost forever.”…””}”(hj°s  hžhhŸNh Nubeh}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h M™hjÄr  hžhubhæ)”}”(hŒ™The proposed patchset is the
`file link count repair
<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=scrub-nlinks>`_
series.”h]”(hŒThe proposed patchset is the
”…””}”(hjâs  hžhhŸNh Nubj”  )”}”(hŒt`file link count repair
<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=scrub-nlinks>`_”h]”hŒfile link count repair”…””}”(hjês  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”Œname”Œfile link count repair”jj  ŒXhttps://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=scrub-nlinks”uh1j“  hjâs  ubhµ)”}”(hŒ[
<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=scrub-nlinks>”h]”h}”(h]”Œfile-link-count-repair”ah ]”h"]”Œfile link count repair”ah$]”h&]”Œrefuri”jús  uh1h´jy  Khjâs  ubhŒ
series.”…””}”(hjâs  hžhhŸNh Nubeh}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h M£hjÄr  hžhubhµ)”}”(hŒ.. _rmap_repair:”h]”h}”(h]”h ]”h"]”h$]”h&]”j  Œrmap-repair”uh1h´h M¨hjÄr  hžhhŸh³ubeh}”(h]”(je  j¹r  eh ]”h"]”(Œ$case study: file link count checking”Œnlinks”eh$]”h&]”uh1hÐhj¡m  hžhhŸh³h MqjË  }”j"t  j¯r  sjÍ  }”j¹r  j¯r  subhÑ)”}”(hhh]”(hÖ)”}”(hŒ.Case Study: Rebuilding Reverse Mapping Records”h]”hŒ.Case Study: Rebuilding Reverse Mapping Records”…””}”(hj*t  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”j  j  uh1hÕhj't  hžhhŸh³h M«ubhæ)”}”(hX»  Most repair functions follow the same pattern: lock filesystem resources,
walk the surviving ondisk metadata looking for replacement metadata records,
and use an :ref:`in-memory array <xfarray>` to store the gathered observations.
The primary advantage of this approach is the simplicity and modularity of the
repair code -- code and data are entirely contained within the scrub module,
do not require hooks in the main filesystem, and are usually the most efficient
in memory use.
A secondary advantage of this repair approach is atomicity -- once the kernel
decides a structure is corrupt, no other threads can access the metadata until
the kernel finishes repairing and revalidating the metadata.”h]”(hŒ¢Most repair functions follow the same pattern: lock filesystem resources,
walk the surviving ondisk metadata looking for replacement metadata records,
and use an ”…””}”(hj8t  hžhhŸNh Nubh)”}”(hŒ :ref:`in-memory array <xfarray>`”h]”j™  )”}”(hjBt  h]”hŒin-memory array”…””}”(hjDt  hžhhŸNh Nubah}”(h]”h ]”(j¤  Œstd”Œstd-ref”eh"]”h$]”h&]”uh1j˜  hj@t  ubah}”(h]”h ]”h"]”h$]”h&]”Œrefdoc”j±  Œ	refdomain”jNt  Œreftype”Œref”Œrefexplicit”ˆŒrefwarn”ˆj·  Œxfarray”uh1hhŸh³h M­hj8t  ubhXù   to store the gathered observations.
The primary advantage of this approach is the simplicity and modularity of the
repair code -- code and data are entirely contained within the scrub module,
do not require hooks in the main filesystem, and are usually the most efficient
in memory use.
A secondary advantage of this repair approach is atomicity -- once the kernel
decides a structure is corrupt, no other threads can access the metadata until
the kernel finishes repairing and revalidating the metadata.”…””}”(hj8t  hžhhŸNh Nubeh}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h M­hj't  hžhubhæ)”}”(hX{  For repairs going on within a shard of the filesystem, these advantages
outweigh the delays inherent in locking the shard while repairing parts of the
shard.
Unfortunately, repairs to the reverse mapping btree cannot use the "standard"
btree repair strategy because it must scan every space mapping of every fork of
every file in the filesystem, and the filesystem cannot stop.
Therefore, rmap repair foregoes atomicity between scrub and repair.
It combines a :ref:`coordinated inode scanner <iscan>`, :ref:`live update hooks
<liveupdate>`, and an :ref:`in-memory rmap btree <xfbtree>` to complete the
scan for reverse mapping records.”h]”(hXÐ  For repairs going on within a shard of the filesystem, these advantages
outweigh the delays inherent in locking the shard while repairing parts of the
shard.
Unfortunately, repairs to the reverse mapping btree cannot use the â€œstandardâ€
btree repair strategy because it must scan every space mapping of every fork of
every file in the filesystem, and the filesystem cannot stop.
Therefore, rmap repair foregoes atomicity between scrub and repair.
It combines a ”…””}”(hjjt  hžhhŸNh Nubh)”}”(hŒ(:ref:`coordinated inode scanner <iscan>`”h]”j™  )”}”(hjtt  h]”hŒcoordinated inode scanner”…””}”(hjvt  hžhhŸNh Nubah}”(h]”h ]”(j¤  Œstd”Œstd-ref”eh"]”h$]”h&]”uh1j˜  hjrt  ubah}”(h]”h ]”h"]”h$]”h&]”Œrefdoc”j±  Œ	refdomain”j€t  Œreftype”Œref”Œrefexplicit”ˆŒrefwarn”ˆj·  Œiscan”uh1hhŸh³h M¸hjjt  ubhŒ, ”…””}”(hjjt  hžhhŸNh Nubh)”}”(hŒ%:ref:`live update hooks
<liveupdate>`”h]”j™  )”}”(hj˜t  h]”hŒlive update hooks”…””}”(hjšt  hžhhŸNh Nubah}”(h]”h ]”(j¤  Œstd”Œstd-ref”eh"]”h$]”h&]”uh1j˜  hj–t  ubah}”(h]”h ]”h"]”h$]”h&]”Œrefdoc”j±  Œ	refdomain”j¤t  Œreftype”Œref”Œrefexplicit”ˆŒrefwarn”ˆj·  Œ
liveupdate”uh1hhŸh³h M¸hjjt  ubhŒ	, and an ”…””}”(hjjt  hžhhŸNh Nubh)”}”(hŒ%:ref:`in-memory rmap btree <xfbtree>`”h]”j™  )”}”(hj¼t  h]”hŒin-memory rmap btree”…””}”(hj¾t  hžhhŸNh Nubah}”(h]”h ]”(j¤  Œstd”Œstd-ref”eh"]”h$]”h&]”uh1j˜  hjºt  ubah}”(h]”h ]”h"]”h$]”h&]”Œrefdoc”j±  Œ	refdomain”jÈt  Œreftype”Œref”Œrefexplicit”ˆŒrefwarn”ˆj·  Œxfbtree”uh1hhŸh³h M¸hjjt  ubhŒ2 to complete the
scan for reverse mapping records.”…””}”(hjjt  hžhhŸNh Nubeh}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h M¸hj't  hžhubji  )”}”(hhh]”(hû)”}”(hŒ)Set up an xfbtree to stage rmap records.
”h]”hæ)”}”(hŒ(Set up an xfbtree to stage rmap records.”h]”hŒ(Set up an xfbtree to stage rmap records.”…””}”(hjët  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h MÃhjçt  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhjät  hžhhŸh³h Nubhû)”}”(hŒ¸While holding the locks on the AGI and AGF buffers acquired during the
scrub, generate reverse mappings for all AG metadata: inodes, btrees, CoW
staging extents, and the internal log.
”h]”hæ)”}”(hŒ·While holding the locks on the AGI and AGF buffers acquired during the
scrub, generate reverse mappings for all AG metadata: inodes, btrees, CoW
staging extents, and the internal log.”h]”hŒ·While holding the locks on the AGI and AGF buffers acquired during the
scrub, generate reverse mappings for all AG metadata: inodes, btrees, CoW
staging extents, and the internal log.”…””}”(hju  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h MÅhjÿt  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhjät  hžhhŸh³h Nubhû)”}”(hŒSet up an inode scanner.
”h]”hæ)”}”(hŒSet up an inode scanner.”h]”hŒSet up an inode scanner.”…””}”(hju  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h MÉhju  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhjät  hžhhŸh³h Nubhû)”}”(hŒ¨Hook into rmap updates for the AG being repaired so that the live scan data
can receive updates to the rmap btree from the rest of the filesystem during
the file scan.
”h]”hæ)”}”(hŒ§Hook into rmap updates for the AG being repaired so that the live scan data
can receive updates to the rmap btree from the rest of the filesystem during
the file scan.”h]”hŒ§Hook into rmap updates for the AG being repaired so that the live scan data
can receive updates to the rmap btree from the rest of the filesystem during
the file scan.”…””}”(hj3u  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h MËhj/u  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhjät  hžhhŸh³h Nubhû)”}”(hXX  For each space mapping found in either fork of each file scanned,
decide if the mapping matches the AG of interest.
If so:

a. Create a btree cursor for the in-memory btree.

b. Use the rmap code to add the record to the in-memory btree.

c. Use the :ref:`special commit function <xfbtree_commit>` to write the
   xfbtree changes to the xfile.
”h]”(hæ)”}”(hŒzFor each space mapping found in either fork of each file scanned,
decide if the mapping matches the AG of interest.
If so:”h]”hŒzFor each space mapping found in either fork of each file scanned,
decide if the mapping matches the AG of interest.
If so:”…””}”(hjKu  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h MÏhjGu  ubji  )”}”(hhh]”(hû)”}”(hŒ/Create a btree cursor for the in-memory btree.
”h]”hæ)”}”(hŒ.Create a btree cursor for the in-memory btree.”h]”hŒ.Create a btree cursor for the in-memory btree.”…””}”(hj`u  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h MÓhj\u  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhjYu  ubhû)”}”(hŒ<Use the rmap code to add the record to the in-memory btree.
”h]”hæ)”}”(hŒ;Use the rmap code to add the record to the in-memory btree.”h]”hŒ;Use the rmap code to add the record to the in-memory btree.”…””}”(hjxu  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h MÕhjtu  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhjYu  ubhû)”}”(hŒcUse the :ref:`special commit function <xfbtree_commit>` to write the
xfbtree changes to the xfile.
”h]”hæ)”}”(hŒbUse the :ref:`special commit function <xfbtree_commit>` to write the
xfbtree changes to the xfile.”h]”(hŒUse the ”…””}”(hju  hžhhŸNh Nubh)”}”(hŒ/:ref:`special commit function <xfbtree_commit>`”h]”j™  )”}”(hjšu  h]”hŒspecial commit function”…””}”(hjœu  hžhhŸNh Nubah}”(h]”h ]”(j¤  Œstd”Œstd-ref”eh"]”h$]”h&]”uh1j˜  hj˜u  ubah}”(h]”h ]”h"]”h$]”h&]”Œrefdoc”j±  Œ	refdomain”j¦u  Œreftype”Œref”Œrefexplicit”ˆŒrefwarn”ˆj·  Œxfbtree_commit”uh1hhŸh³h M×hju  ubhŒ+ to write the
xfbtree changes to the xfile.”…””}”(hju  hžhhŸNh Nubeh}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h M×hjŒu  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhjYu  ubeh}”(h]”h ]”h"]”h$]”h&]”jg  j6  ji  hjj  jk  uh1jh  hjGu  ubeh}”(h]”h ]”h"]”h$]”h&]”uh1húhjät  hžhhŸNh Nubhû)”}”(hX³  For each live update received via the hook, decide if the owner has already
been scanned.
If so, apply the live update into the scan data:

a. Create a btree cursor for the in-memory btree.

b. Replay the operation into the in-memory btree.

c. Use the :ref:`special commit function <xfbtree_commit>` to write the
   xfbtree changes to the xfile.
   This is performed with an empty transaction to avoid changing the
   caller's state.
”h]”(hæ)”}”(hŒŠFor each live update received via the hook, decide if the owner has already
been scanned.
If so, apply the live update into the scan data:”h]”hŒŠFor each live update received via the hook, decide if the owner has already
been scanned.
If so, apply the live update into the scan data:”…””}”(hjØu  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h MÚhjÔu  ubji  )”}”(hhh]”(hû)”}”(hŒ/Create a btree cursor for the in-memory btree.
”h]”hæ)”}”(hŒ.Create a btree cursor for the in-memory btree.”h]”hŒ.Create a btree cursor for the in-memory btree.”…””}”(hjíu  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h MÞhjéu  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhjæu  ubhû)”}”(hŒ/Replay the operation into the in-memory btree.
”h]”hæ)”}”(hŒ.Replay the operation into the in-memory btree.”h]”hŒ.Replay the operation into the in-memory btree.”…””}”(hjv  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h Màhjv  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhjæu  ubhû)”}”(hŒµUse the :ref:`special commit function <xfbtree_commit>` to write the
xfbtree changes to the xfile.
This is performed with an empty transaction to avoid changing the
caller's state.
”h]”hæ)”}”(hŒ´Use the :ref:`special commit function <xfbtree_commit>` to write the
xfbtree changes to the xfile.
This is performed with an empty transaction to avoid changing the
caller's state.”h]”(hŒUse the ”…””}”(hjv  hžhhŸNh Nubh)”}”(hŒ/:ref:`special commit function <xfbtree_commit>`”h]”j™  )”}”(hj'v  h]”hŒspecial commit function”…””}”(hj)v  hžhhŸNh Nubah}”(h]”h ]”(j¤  Œstd”Œstd-ref”eh"]”h$]”h&]”uh1j˜  hj%v  ubah}”(h]”h ]”h"]”h$]”h&]”Œrefdoc”j±  Œ	refdomain”j3v  Œreftype”Œref”Œrefexplicit”ˆŒrefwarn”ˆj·  Œxfbtree_commit”uh1hhŸh³h Mâhjv  ubhŒ to write the
xfbtree changes to the xfile.
This is performed with an empty transaction to avoid changing the
callerâ€™s state.”…””}”(hjv  hžhhŸNh Nubeh}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h Mâhjv  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhjæu  ubeh}”(h]”h ]”h"]”h$]”h&]”jg  j6  ji  hjj  jk  uh1jh  hjÔu  ubeh}”(h]”h ]”h"]”h$]”h&]”uh1húhjät  hžhhŸNh Nubhû)”}”(hŒ\When the inode scan finishes, create a new scrub transaction and relock the
two AG headers.
”h]”hæ)”}”(hŒ[When the inode scan finishes, create a new scrub transaction and relock the
two AG headers.”h]”hŒ[When the inode scan finishes, create a new scrub transaction and relock the
two AG headers.”…””}”(hjev  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h Mçhjav  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhjät  hžhhŸh³h Nubhû)”}”(hŒ€Compute the new btree geometry using the number of rmap records in the
shadow btree, like all other btree rebuilding functions.
”h]”hæ)”}”(hŒCompute the new btree geometry using the number of rmap records in the
shadow btree, like all other btree rebuilding functions.”h]”hŒCompute the new btree geometry using the number of rmap records in the
shadow btree, like all other btree rebuilding functions.”…””}”(hj}v  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h Mêhjyv  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhjät  hžhhŸh³h Nubhû)”}”(hŒ=Allocate the number of blocks computed in the previous step.
”h]”hæ)”}”(hŒ<Allocate the number of blocks computed in the previous step.”h]”hŒ<Allocate the number of blocks computed in the previous step.”…””}”(hj•v  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h Míhj‘v  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhjät  hžhhŸh³h Nubhû)”}”(hŒOPerform the usual btree bulk loading and commit to install the new rmap
btree.
”h]”hæ)”}”(hŒNPerform the usual btree bulk loading and commit to install the new rmap
btree.”h]”hŒNPerform the usual btree bulk loading and commit to install the new rmap
btree.”…””}”(hj­v  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h Mïhj©v  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhjät  hžhhŸh³h Nubhû)”}”(hŒ|Reap the old rmap btree blocks as discussed in the case study about how
to :ref:`reap after rmap btree repair <rmap_reap>`.
”h]”hæ)”}”(hŒ{Reap the old rmap btree blocks as discussed in the case study about how
to :ref:`reap after rmap btree repair <rmap_reap>`.”h]”(hŒKReap the old rmap btree blocks as discussed in the case study about how
to ”…””}”(hjÅv  hžhhŸNh Nubh)”}”(hŒ/:ref:`reap after rmap btree repair <rmap_reap>`”h]”j™  )”}”(hjÏv  h]”hŒreap after rmap btree repair”…””}”(hjÑv  hžhhŸNh Nubah}”(h]”h ]”(j¤  Œstd”Œstd-ref”eh"]”h$]”h&]”uh1j˜  hjÍv  ubah}”(h]”h ]”h"]”h$]”h&]”Œrefdoc”j±  Œ	refdomain”jÛv  Œreftype”Œref”Œrefexplicit”ˆŒrefwarn”ˆj·  Œ	rmap_reap”uh1hhŸh³h MòhjÅv  ubhŒ.”…””}”(hjÅv  hžhhŸNh Nubeh}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h MòhjÁv  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhjät  hžhhŸh³h Nubhû)”}”(hŒ)Free the xfbtree now that it not needed.
”h]”hæ)”}”(hŒ(Free the xfbtree now that it not needed.”h]”hŒ(Free the xfbtree now that it not needed.”…””}”(hjw  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h Mõhjýv  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhjät  hžhhŸh³h Nubeh}”(h]”h ]”h"]”h$]”h&]”jg  jh  ji  hjj  jk  uh1jh  hj't  hžhhŸh³h MÃubhæ)”}”(hŒ“The proposed patchset is the
`rmap repair
<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-rmap-btree>`_
series.”h]”(hŒThe proposed patchset is the
”…””}”(hjw  hžhhŸNh Nubj”  )”}”(hŒn`rmap repair
<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-rmap-btree>`_”h]”hŒrmap repair”…””}”(hj#w  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”Œname”Œrmap repair”jj  Œ]https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-rmap-btree”uh1j“  hjw  ubhµ)”}”(hŒ`
<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-rmap-btree>”h]”h}”(h]”Œid10”ah ]”h"]”Œrmap repair”ah$]”h&]”Œrefuri”j3w  uh1h´jy  Khjw  ubhŒ
series.”…””}”(hjw  hžhhŸNh Nubeh}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h M÷hj't  hžhubeh}”(h]”(j‡  jt  eh ]”h"]”(Œ.case study: rebuilding reverse mapping records”Œrmap_repair”eh$]”h&]”uh1hÐhj¡m  hžhhŸh³h M«jË  }”jPw  jt  sjÍ  }”jt  jt  subeh}”(h]”(j$  j–m  eh ]”h"]”(Œlive updates during a scan”Œ
liveupdate”eh$]”h&]”uh1hÐhjc  hžhhŸh³h MÇjË  }”jZw  jŒm  sjÍ  }”j–m  jŒm  subeh}”(h]”j0  ah ]”h"]”Œfull filesystem scans”ah$]”h&]”uh1hÐhjv*  hžhhŸh³h MRubhÑ)”}”(hhh]”(hÖ)”}”(hŒ,Staging Repairs with Temporary Files on Disk”h]”hŒ,Staging Repairs with Temporary Files on Disk”…””}”(hjiw  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”j  j»  uh1hÕhjfw  hžhhŸh³h Mýubhæ)”}”(hXß  XFS stores a substantial amount of metadata in file forks: directories,
extended attributes, symbolic link targets, free space bitmaps and summary
information for the realtime volume, and quota records.
File forks map 64-bit logical file fork space extents to physical storage space
extents, similar to how a memory management unit maps 64-bit virtual addresses
to physical memory addresses.
Therefore, file-based tree structures (such as directories and extended
attributes) use blocks mapped in the file fork offset address space that point
to other blocks mapped within that same address space, and file-based linear
structures (such as bitmaps and quota records) compute array element offsets in
the file fork offset address space.”h]”hXß  XFS stores a substantial amount of metadata in file forks: directories,
extended attributes, symbolic link targets, free space bitmaps and summary
information for the realtime volume, and quota records.
File forks map 64-bit logical file fork space extents to physical storage space
extents, similar to how a memory management unit maps 64-bit virtual addresses
to physical memory addresses.
Therefore, file-based tree structures (such as directories and extended
attributes) use blocks mapped in the file fork offset address space that point
to other blocks mapped within that same address space, and file-based linear
structures (such as bitmaps and quota records) compute array element offsets in
the file fork offset address space.”…””}”(hjww  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h Mÿhjfw  hžhubhæ)”}”(hXJ  Because file forks can consume as much space as the entire filesystem, repairs
cannot be staged in memory, even when a paging scheme is available.
Therefore, online repair of file-based metadata createas a temporary file in
the XFS filesystem, writes a new structure at the correct offsets into the
temporary file, and atomically exchanges all file fork mappings (and hence the
fork contents) to commit the repair.
Once the repair is complete, the old fork can be reaped as necessary; if the
system goes down during the reap, the iunlink code will delete the blocks
during log recovery.”h]”hXJ  Because file forks can consume as much space as the entire filesystem, repairs
cannot be staged in memory, even when a paging scheme is available.
Therefore, online repair of file-based metadata createas a temporary file in
the XFS filesystem, writes a new structure at the correct offsets into the
temporary file, and atomically exchanges all file fork mappings (and hence the
fork contents) to commit the repair.
Once the repair is complete, the old fork can be reaped as necessary; if the
system goes down during the reap, the iunlink code will delete the blocks
during log recovery.”…””}”(hj…w  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h Mhjfw  hžhubhæ)”}”(hŒï**Note**: All space usage and inode indices in the filesystem *must* be
consistent to use a temporary file safely!
This dependency is the reason why online repair can only use pageable kernel
memory to stage ondisk space usage information.”h]”(jé  )”}”(hŒ**Note**”h]”hŒNote”…””}”(hj—w  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1jè  hj“w  ubhŒ6: All space usage and inode indices in the filesystem ”…””}”(hj“w  hžhhŸNh Nubj7  )”}”(hŒ*must*”h]”hŒmust”…””}”(hj©w  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1j6  hj“w  ubhŒ« be
consistent to use a temporary file safely!
This dependency is the reason why online repair can only use pageable kernel
memory to stage ondisk space usage information.”…””}”(hj“w  hžhhŸNh Nubeh}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h Mhjfw  hžhubhæ)”}”(hX)  Exchanging metadata file mappings with a temporary file requires the owner
field of the block headers to match the file being repaired and not the
temporary file.
The directory, extended attribute, and symbolic link functions were all
modified to allow callers to specify owner numbers explicitly.”h]”hX)  Exchanging metadata file mappings with a temporary file requires the owner
field of the block headers to match the file being repaired and not the
temporary file.
The directory, extended attribute, and symbolic link functions were all
modified to allow callers to specify owner numbers explicitly.”…””}”(hjÁw  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h Mhjfw  hžhubhæ)”}”(hŒæThere is a downside to the reaping process -- if the system crashes during the
reap phase and the fork extents are crosslinked, the iunlink processing will
fail because freeing space will find the extra reverse mappings and abort.”h]”hŒæThere is a downside to the reaping process -- if the system crashes during the
reap phase and the fork extents are crosslinked, the iunlink processing will
fail because freeing space will find the extra reverse mappings and abort.”…””}”(hjÏw  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h M hjfw  hžhubhæ)”}”(hXµ  Temporary files created for repair are similar to ``O_TMPFILE`` files created
by userspace.
They are not linked into a directory and the entire file will be reaped when
the last reference to the file is lost.
The key differences are that these files must have no access permission outside
the kernel at all, they must be specially marked to prevent them from being
opened by handle, and they must never be linked into the directory tree.”h]”(hŒ2Temporary files created for repair are similar to ”…””}”(hjÝw  hžhhŸNh Nubj÷  )”}”(hŒ``O_TMPFILE``”h]”hŒ	O_TMPFILE”…””}”(hjåw  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1jö  hjÝw  ubhXv   files created
by userspace.
They are not linked into a directory and the entire file will be reaped when
the last reference to the file is lost.
The key differences are that these files must have no access permission outside
the kernel at all, they must be specially marked to prevent them from being
opened by handle, and they must never be linked into the directory tree.”…””}”(hjÝw  hžhhŸNh Nubeh}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h M$hjfw  hžhubjÀ  )”}”(hhh]”jÅ  )”}”(hhh]”(jÊ  )”}”(hhh]”h}”(h]”h ]”h"]”h$]”h&]”Œcolwidth”KJuh1jÉ  hj x  ubjÖ  )”}”(hhh]”(jÛ  )”}”(hhh]”jà  )”}”(hhh]”hæ)”}”(hŒ**Historical Sidebar**:”h]”(jé  )”}”(hŒ**Historical Sidebar**”h]”hŒHistorical Sidebar”…””}”(hjx  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1jè  hjx  ubhŒ:”…””}”(hjx  hžhhŸNh Nubeh}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h M-hjx  ubah}”(h]”h ]”h"]”h$]”h&]”uh1jß  hjx  ubah}”(h]”h ]”h"]”h$]”h&]”uh1jÚ  hjx  ubjÛ  )”}”(hhh]”jà  )”}”(hhh]”(hæ)”}”(hXL  In the initial iteration of file metadata repair, the damaged metadata
blocks would be scanned for salvageable data; the extents in the file
fork would be reaped; and then a new structure would be built in its
place.
This strategy did not survive the introduction of the atomic repair
requirement expressed earlier in this document.”h]”hXL  In the initial iteration of file metadata repair, the damaged metadata
blocks would be scanned for salvageable data; the extents in the file
fork would be reaped; and then a new structure would be built in its
place.
This strategy did not survive the introduction of the atomic repair
requirement expressed earlier in this document.”…””}”(hjDx  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h M/hjAx  ubhæ)”}”(hŒÔThe second iteration explored building a second structure at a high
offset in the fork from the salvage data, reaping the old extents, and
using a ``COLLAPSE_RANGE`` operation to slide the new extents into
place.”h]”(hŒ“The second iteration explored building a second structure at a high
offset in the fork from the salvage data, reaping the old extents, and
using a ”…””}”(hjRx  hžhhŸNh Nubj÷  )”}”(hŒ``COLLAPSE_RANGE``”h]”hŒCOLLAPSE_RANGE”…””}”(hjZx  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1jö  hjRx  ubhŒ/ operation to slide the new extents into
place.”…””}”(hjRx  hžhhŸNh Nubeh}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h M6hjAx  ubhæ)”}”(hŒThis had many drawbacks:”h]”hŒThis had many drawbacks:”…””}”(hjrx  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h M;hjAx  ubhö)”}”(hhh]”(hû)”}”(hŒÍArray structures are linearly addressed, and the regular filesystem
codebase does not have the concept of a linear offset that could be
applied to the record offset computation to build an alternate copy.
”h]”hæ)”}”(hŒÌArray structures are linearly addressed, and the regular filesystem
codebase does not have the concept of a linear offset that could be
applied to the record offset computation to build an alternate copy.”h]”hŒÌArray structures are linearly addressed, and the regular filesystem
codebase does not have the concept of a linear offset that could be
applied to the record offset computation to build an alternate copy.”…””}”(hj‡x  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h M=hjƒx  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhj€x  ubhû)”}”(hŒRExtended attributes are allowed to use the entire attr fork offset
address space.
”h]”hæ)”}”(hŒQExtended attributes are allowed to use the entire attr fork offset
address space.”h]”hŒQExtended attributes are allowed to use the entire attr fork offset
address space.”…””}”(hjŸx  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h MAhj›x  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhj€x  ubhû)”}”(hX5  Even if repair could build an alternate copy of a data structure in a
different part of the fork address space, the atomic repair commit
requirement means that online repair would have to be able to perform
a log assisted ``COLLAPSE_RANGE`` operation to ensure that the old
structure was completely replaced.
”h]”hæ)”}”(hX4  Even if repair could build an alternate copy of a data structure in a
different part of the fork address space, the atomic repair commit
requirement means that online repair would have to be able to perform
a log assisted ``COLLAPSE_RANGE`` operation to ensure that the old
structure was completely replaced.”h]”(hŒÞEven if repair could build an alternate copy of a data structure in a
different part of the fork address space, the atomic repair commit
requirement means that online repair would have to be able to perform
a log assisted ”…””}”(hj·x  hžhhŸNh Nubj÷  )”}”(hŒ``COLLAPSE_RANGE``”h]”hŒCOLLAPSE_RANGE”…””}”(hj¿x  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1jö  hj·x  ubhŒD operation to ensure that the old
structure was completely replaced.”…””}”(hj·x  hžhhŸNh Nubeh}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h MDhj³x  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhj€x  ubhû)”}”(hŒªA crash after construction of the secondary tree but before the range
collapse would leave unreachable blocks in the file fork.
This would likely confuse things further.
”h]”hæ)”}”(hŒ©A crash after construction of the secondary tree but before the range
collapse would leave unreachable blocks in the file fork.
This would likely confuse things further.”h]”hŒ©A crash after construction of the secondary tree but before the range
collapse would leave unreachable blocks in the file fork.
This would likely confuse things further.”…””}”(hjáx  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h MJhjÝx  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhj€x  ubhû)”}”(hŒ¤Reaping blocks after a repair is not a simple operation, and
initiating a reap operation from a restarted range collapse operation
during log recovery is daunting.
”h]”hæ)”}”(hŒ£Reaping blocks after a repair is not a simple operation, and
initiating a reap operation from a restarted range collapse operation
during log recovery is daunting.”h]”hŒ£Reaping blocks after a repair is not a simple operation, and
initiating a reap operation from a restarted range collapse operation
during log recovery is daunting.”…””}”(hjùx  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h MNhjõx  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhj€x  ubhû)”}”(hX$  Directory entry blocks and quota records record the file fork offset
in the header area of each block.
An atomic range collapse operation would have to rewrite this part of
each block header.
Rewriting a single field in block headers is not a huge problem, but
it's something to be aware of.
”h]”hæ)”}”(hX#  Directory entry blocks and quota records record the file fork offset
in the header area of each block.
An atomic range collapse operation would have to rewrite this part of
each block header.
Rewriting a single field in block headers is not a huge problem, but
it's something to be aware of.”h]”hX%  Directory entry blocks and quota records record the file fork offset
in the header area of each block.
An atomic range collapse operation would have to rewrite this part of
each block header.
Rewriting a single field in block headers is not a huge problem, but
itâ€™s something to be aware of.”…””}”(hjy  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h MRhjy  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhj€x  ubhû)”}”(hX}  Each block in a directory or extended attributes btree index contains
sibling and child block pointers.
Were the atomic commit to use a range collapse operation, each block
would have to be rewritten very carefully to preserve the graph
structure.
Doing this as part of a range collapse means rewriting a large number
of blocks repeatedly, which is not conducive to quick repairs.
”h]”hæ)”}”(hX|  Each block in a directory or extended attributes btree index contains
sibling and child block pointers.
Were the atomic commit to use a range collapse operation, each block
would have to be rewritten very carefully to preserve the graph
structure.
Doing this as part of a range collapse means rewriting a large number
of blocks repeatedly, which is not conducive to quick repairs.”h]”hX|  Each block in a directory or extended attributes btree index contains
sibling and child block pointers.
Were the atomic commit to use a range collapse operation, each block
would have to be rewritten very carefully to preserve the graph
structure.
Doing this as part of a range collapse means rewriting a large number
of blocks repeatedly, which is not conducive to quick repairs.”…””}”(hj)y  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h MYhj%y  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhj€x  ubeh}”(h]”h ]”h"]”h$]”h&]”jJ  jK  uh1hõhŸh³h M=hjAx  ubhæ)”}”(hŒ8This lead to the introduction of temporary file staging.”h]”hŒ8This lead to the introduction of temporary file staging.”…””}”(hjCy  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h MahjAx  ubeh}”(h]”h ]”h"]”h$]”h&]”uh1jß  hj>x  ubah}”(h]”h ]”h"]”h$]”h&]”uh1jÚ  hjx  ubeh}”(h]”h ]”h"]”h$]”h&]”uh1jÕ  hj x  ubeh}”(h]”h ]”h"]”h$]”h&]”Œcols”Kuh1jÄ  hjýw  ubah}”(h]”h ]”h"]”h$]”h&]”uh1j¿  hjfw  hžhhŸh³h NubhÑ)”}”(hhh]”(hÖ)”}”(hŒUsing a Temporary File”h]”hŒUsing a Temporary File”…””}”(hjsy  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”j  jÚ  uh1hÕhjpy  hžhhŸh³h Meubhæ)”}”(hX>  Online repair code should use the ``xrep_tempfile_create`` function to create a
temporary file inside the filesystem.
This allocates an inode, marks the in-core inode private, and attaches it to
the scrub context.
These files are hidden from userspace, may not be added to the directory tree,
and must be kept private.”h]”(hŒ"Online repair code should use the ”…””}”(hjy  hžhhŸNh Nubj÷  )”}”(hŒ``xrep_tempfile_create``”h]”hŒxrep_tempfile_create”…””}”(hj‰y  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1jö  hjy  ubhX   function to create a
temporary file inside the filesystem.
This allocates an inode, marks the in-core inode private, and attaches it to
the scrub context.
These files are hidden from userspace, may not be added to the directory tree,
and must be kept private.”…””}”(hjy  hžhhŸNh Nubeh}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h Mghjpy  hžhubhæ)”}”(hX›  Temporary files only use two inode locks: the IOLOCK and the ILOCK.
The MMAPLOCK is not needed here, because there must not be page faults from
userspace for data fork blocks.
The usage patterns of these two locks are the same as for any other XFS file --
access to file data are controlled via the IOLOCK, and access to file metadata
are controlled via the ILOCK.
Locking helpers are provided so that the temporary file and its lock state can
be cleaned up by the scrub context.
To comply with the nested locking strategy laid out in the :ref:`inode
locking<ilocking>` section, it is recommended that scrub functions use the
xrep_tempfile_ilock*_nowait lock helpers.”h]”(hX  Temporary files only use two inode locks: the IOLOCK and the ILOCK.
The MMAPLOCK is not needed here, because there must not be page faults from
userspace for data fork blocks.
The usage patterns of these two locks are the same as for any other XFS file --
access to file data are controlled via the IOLOCK, and access to file metadata
are controlled via the ILOCK.
Locking helpers are provided so that the temporary file and its lock state can
be cleaned up by the scrub context.
To comply with the nested locking strategy laid out in the ”…””}”(hj¡y  hžhhŸNh Nubh)”}”(hŒ:ref:`inode
locking<ilocking>`”h]”j™  )”}”(hj«y  h]”hŒinode
locking”…””}”(hj­y  hžhhŸNh Nubah}”(h]”h ]”(j¤  Œstd”Œstd-ref”eh"]”h$]”h&]”uh1j˜  hj©y  ubah}”(h]”h ]”h"]”h$]”h&]”Œrefdoc”j±  Œ	refdomain”j·y  Œreftype”Œref”Œrefexplicit”ˆŒrefwarn”ˆj·  Œilocking”uh1hhŸh³h Mnhj¡y  ubhŒb section, it is recommended that scrub functions use the
xrep_tempfile_ilock*_nowait lock helpers.”…””}”(hj¡y  hžhhŸNh Nubeh}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h Mnhjpy  hžhubhæ)”}”(hŒ5Data can be written to a temporary file by two means:”h]”hŒ5Data can be written to a temporary file by two means:”…””}”(hjÓy  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h Mzhjpy  hžhubji  )”}”(hhh]”(hû)”}”(hŒd``xrep_tempfile_copyin`` can be used to set the contents of a regular
temporary file from an xfile.
”h]”hæ)”}”(hŒc``xrep_tempfile_copyin`` can be used to set the contents of a regular
temporary file from an xfile.”h]”(j÷  )”}”(hŒ``xrep_tempfile_copyin``”h]”hŒxrep_tempfile_copyin”…””}”(hjìy  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1jö  hjèy  ubhŒK can be used to set the contents of a regular
temporary file from an xfile.”…””}”(hjèy  hžhhŸNh Nubeh}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h M|hjäy  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhjáy  hžhhŸh³h Nubhû)”}”(hŒsThe regular directory, symbolic link, and extended attribute functions can
be used to write to the temporary file.
”h]”hæ)”}”(hŒrThe regular directory, symbolic link, and extended attribute functions can
be used to write to the temporary file.”h]”hŒrThe regular directory, symbolic link, and extended attribute functions can
be used to write to the temporary file.”…””}”(hjz  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h Mhj
z  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhjáy  hžhhŸh³h Nubeh}”(h]”h ]”h"]”h$]”h&]”jg  jh  ji  hjj  jk  uh1jh  hjpy  hžhhŸh³h M|ubhæ)”}”(hŒ¡Once a good copy of a data file has been constructed in a temporary file, it
must be conveyed to the file being repaired, which is the topic of the next
section.”h]”hŒ¡Once a good copy of a data file has been constructed in a temporary file, it
must be conveyed to the file being repaired, which is the topic of the next
section.”…””}”(hj(z  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h M‚hjpy  hžhubhæ)”}”(hŒ The proposed patches are in the
`repair temporary files
<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-tempfiles>`_
series.”h]”(hŒ The proposed patches are in the
”…””}”(hj6z  hžhhŸNh Nubj”  )”}”(hŒx`repair temporary files
<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-tempfiles>`_”h]”hŒrepair temporary files”…””}”(hj>z  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”Œname”Œrepair temporary files”jj  Œ\https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-tempfiles”uh1j“  hj6z  ubhµ)”}”(hŒ_
<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-tempfiles>”h]”h}”(h]”Œrepair-temporary-files”ah ]”h"]”Œrepair temporary files”ah$]”h&]”Œrefuri”jNz  uh1h´jy  Khj6z  ubhŒ
series.”…””}”(hj6z  hžhhŸNh Nubeh}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h M†hjpy  hžhubeh}”(h]”jà  ah ]”h"]”Œusing a temporary file”ah$]”h&]”uh1hÐhjfw  hžhhŸh³h Meubeh}”(h]”jÁ  ah ]”h"]”Œ,staging repairs with temporary files on disk”ah$]”h&]”uh1hÐhjv*  hžhhŸh³h MýubhÑ)”}”(hhh]”(hÖ)”}”(hŒLogged File Content Exchanges”h]”hŒLogged File Content Exchanges”…””}”(hjwz  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”j  j  uh1hÕhjtz  hžhhŸh³h MŒubhæ)”}”(hX¨  Once repair builds a temporary file with a new data structure written into
it, it must commit the new changes into the existing file.
It is not possible to swap the inumbers of two files, so instead the new
metadata must replace the old.
This suggests the need for the ability to swap extents, but the existing extent
swapping code used by the file defragmenting tool ``xfs_fsr`` is not sufficient
for online repair because:”h]”(hXp  Once repair builds a temporary file with a new data structure written into
it, it must commit the new changes into the existing file.
It is not possible to swap the inumbers of two files, so instead the new
metadata must replace the old.
This suggests the need for the ability to swap extents, but the existing extent
swapping code used by the file defragmenting tool ”…””}”(hj…z  hžhhŸNh Nubj÷  )”}”(hŒ``xfs_fsr``”h]”hŒxfs_fsr”…””}”(hjz  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1jö  hj…z  ubhŒ- is not sufficient
for online repair because:”…””}”(hj…z  hžhhŸNh Nubeh}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h MŽhjtz  hžhubji  )”}”(hhh]”(hû)”}”(hŒñWhen the reverse-mapping btree is enabled, the swap code must keep the
reverse mapping information up to date with every exchange of mappings.
Therefore, it can only exchange one mapping per transaction, and each
transaction is independent.
”h]”hæ)”}”(hŒðWhen the reverse-mapping btree is enabled, the swap code must keep the
reverse mapping information up to date with every exchange of mappings.
Therefore, it can only exchange one mapping per transaction, and each
transaction is independent.”h]”hŒðWhen the reverse-mapping btree is enabled, the swap code must keep the
reverse mapping information up to date with every exchange of mappings.
Therefore, it can only exchange one mapping per transaction, and each
transaction is independent.”…””}”(hj¬z  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h M–hj¨z  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhj¥z  hžhhŸh³h Nubhû)”}”(hŒ¬Reverse-mapping is critical for the operation of online fsck, so the old
defragmentation code (which swapped entire extent forks in a single
operation) is not useful here.
”h]”hæ)”}”(hŒ«Reverse-mapping is critical for the operation of online fsck, so the old
defragmentation code (which swapped entire extent forks in a single
operation) is not useful here.”h]”hŒ«Reverse-mapping is critical for the operation of online fsck, so the old
defragmentation code (which swapped entire extent forks in a single
operation) is not useful here.”…””}”(hjÄz  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h M›hjÀz  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhj¥z  hžhhŸh³h Nubhû)”}”(hŒÚDefragmentation is assumed to occur between two files with identical
contents.
For this use case, an incomplete exchange will not result in a user-visible
change in file contents, even if the operation is interrupted.
”h]”hæ)”}”(hŒÙDefragmentation is assumed to occur between two files with identical
contents.
For this use case, an incomplete exchange will not result in a user-visible
change in file contents, even if the operation is interrupted.”h]”hŒÙDefragmentation is assumed to occur between two files with identical
contents.
For this use case, an incomplete exchange will not result in a user-visible
change in file contents, even if the operation is interrupted.”…””}”(hjÜz  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h MŸhjØz  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhj¥z  hžhhŸh³h Nubhû)”}”(hŒéOnline repair needs to swap the contents of two files that are by definition
*not* identical.
For directory and xattr repairs, the user-visible contents might be the
same, but the contents of individual blocks may be very different.
”•N      h]”hæ)”}”(hŒèOnline repair needs to swap the contents of two files that are by definition
*not* identical.
For directory and xattr repairs, the user-visible contents might be the
same, but the contents of individual blocks may be very different.”h]”(hŒMOnline repair needs to swap the contents of two files that are by definition
”…””}”(hjôz  hžhhŸNh Nubj7  )”}”(hŒ*not*”h]”hŒnot”…””}”(hjüz  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1j6  hjôz  ubhŒ– identical.
For directory and xattr repairs, the user-visible contents might be the
same, but the contents of individual blocks may be very different.”…””}”(hjôz  hžhhŸNh Nubeh}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h M¤hjðz  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhj¥z  hžhhŸh³h Nubhû)”}”(hŒ|Old blocks in the file may be cross-linked with another structure and must
not reappear if the system goes down mid-repair.
”h]”hæ)”}”(hŒ{Old blocks in the file may be cross-linked with another structure and must
not reappear if the system goes down mid-repair.”h]”hŒ{Old blocks in the file may be cross-linked with another structure and must
not reappear if the system goes down mid-repair.”…””}”(hj{  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h M©hj{  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhj¥z  hžhhŸh³h Nubeh}”(h]”h ]”h"]”h$]”h&]”jg  j6  ji  hjj  jk  uh1jh  hjtz  hžhhŸh³h M–ubhæ)”}”(hXQ  These problems are overcome by creating a new deferred operation and a new type
of log intent item to track the progress of an operation to exchange two file
ranges.
The new exchange operation type chains together the same transactions used by
the reverse-mapping extent swap code, but records intermedia progress in the
log so that operations can be restarted after a crash.
This new functionality is called the file contents exchange (xfs_exchrange)
code.
The underlying implementation exchanges file fork mappings (xfs_exchmaps).
The new log item records the progress of the exchange to ensure that once an
exchange begins, it will always run to completion, even there are
interruptions.
The new ``XFS_SB_FEAT_INCOMPAT_EXCHRANGE`` incompatible feature flag
in the superblock protects these new log item records from being replayed on
old kernels.”h]”(hX»  These problems are overcome by creating a new deferred operation and a new type
of log intent item to track the progress of an operation to exchange two file
ranges.
The new exchange operation type chains together the same transactions used by
the reverse-mapping extent swap code, but records intermedia progress in the
log so that operations can be restarted after a crash.
This new functionality is called the file contents exchange (xfs_exchrange)
code.
The underlying implementation exchanges file fork mappings (xfs_exchmaps).
The new log item records the progress of the exchange to ensure that once an
exchange begins, it will always run to completion, even there are
interruptions.
The new ”…””}”(hj8{  hžhhŸNh Nubj÷  )”}”(hŒ"``XFS_SB_FEAT_INCOMPAT_EXCHRANGE``”h]”hŒXFS_SB_FEAT_INCOMPAT_EXCHRANGE”…””}”(hj@{  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1jö  hj8{  ubhŒt incompatible feature flag
in the superblock protects these new log item records from being replayed on
old kernels.”…””}”(hj8{  hžhhŸNh Nubeh}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h M¬hjtz  hžhubhæ)”}”(hŒ The proposed patchset is the
`file contents exchange
<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=atomic-file-updates>`_
series.”h]”(hŒThe proposed patchset is the
”…””}”(hjX{  hžhhŸNh Nubj”  )”}”(hŒ{`file contents exchange
<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=atomic-file-updates>`_”h]”hŒfile contents exchange”…””}”(hj`{  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”Œname”Œfile contents exchange”jj  Œ_https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=atomic-file-updates”uh1j“  hjX{  ubhµ)”}”(hŒb
<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=atomic-file-updates>”h]”h}”(h]”Œfile-contents-exchange”ah ]”h"]”Œfile contents exchange”ah$]”h&]”Œrefuri”jp{  uh1h´jy  KhjX{  ubhŒ
series.”…””}”(hjX{  hžhhŸNh Nubeh}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h M¼hjtz  hžhubjÀ  )”}”(hhh]”jÅ  )”}”(hhh]”(jÊ  )”}”(hhh]”h}”(h]”h ]”h"]”h$]”h&]”Œcolwidth”KJuh1jÉ  hj‹{  ubjÖ  )”}”(hhh]”(jÛ  )”}”(hhh]”jà  )”}”(hhh]”hæ)”}”(hŒ1**Sidebar: Using Log-Incompatible Feature Flags**”h]”jé  )”}”(hj£{  h]”hŒ-Sidebar: Using Log-Incompatible Feature Flags”…””}”(hj¥{  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1jè  hj¡{  ubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h MÂhjž{  ubah}”(h]”h ]”h"]”h$]”h&]”uh1jß  hj›{  ubah}”(h]”h ]”h"]”h$]”h&]”uh1jÚ  hj˜{  ubjÛ  )”}”(hhh]”jà  )”}”(hhh]”(hæ)”}”(hX'  Starting with XFS v5, the superblock contains a
``sb_features_log_incompat`` field to indicate that the log contains
records that might not readable by all kernels that could mount this
filesystem.
In short, log incompat features protect the log contents against kernels
that will not understand the contents.
Unlike the other superblock feature bits, log incompat bits are
ephemeral because an empty (clean) log does not need protection.
The log cleans itself after its contents have been committed into the
filesystem, either as part of an unmount or because the system is
otherwise idle.
Because upper level code can be working on a transaction at the same
time that the log cleans itself, it is necessary for upper level code to
communicate to the log when it is going to use a log incompatible
feature.”h]”(hŒ0Starting with XFS v5, the superblock contains a
”…””}”(hjÊ{  hžhhŸNh Nubj÷  )”}”(hŒ``sb_features_log_incompat``”h]”hŒsb_features_log_incompat”…””}”(hjÒ{  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1jö  hjÊ{  ubhXÛ   field to indicate that the log contains
records that might not readable by all kernels that could mount this
filesystem.
In short, log incompat features protect the log contents against kernels
that will not understand the contents.
Unlike the other superblock feature bits, log incompat bits are
ephemeral because an empty (clean) log does not need protection.
The log cleans itself after its contents have been committed into the
filesystem, either as part of an unmount or because the system is
otherwise idle.
Because upper level code can be working on a transaction at the same
time that the log cleans itself, it is necessary for upper level code to
communicate to the log when it is going to use a log incompatible
feature.”…””}”(hjÊ{  hžhhŸNh Nubeh}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h MÄhjÇ{  ubhæ)”}”(hX¶  The log coordinates access to incompatible features through the use of
one ``struct rw_semaphore`` for each feature.
The log cleaning code tries to take this rwsem in exclusive mode to
clear the bit; if the lock attempt fails, the feature bit remains set.
The code supporting a log incompat feature should create wrapper
functions to obtain the log feature and call
``xfs_add_incompat_log_feature`` to set the feature bits in the primary
superblock.
The superblock update is performed transactionally, so the wrapper to
obtain log assistance must be called just prior to the creation of the
transaction that uses the functionality.
For a file operation, this step must happen after taking the IOLOCK
and the MMAPLOCK, but before allocating the transaction.
When the transaction is complete, the ``xlog_drop_incompat_feat``
function is called to release the feature.
The feature bit will not be cleared from the superblock until the log
becomes clean.”h]”(hŒKThe log coordinates access to incompatible features through the use of
one ”…””}”(hjê{  hžhhŸNh Nubj÷  )”}”(hŒ``struct rw_semaphore``”h]”hŒstruct rw_semaphore”…””}”(hjò{  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1jö  hjê{  ubhX   for each feature.
The log cleaning code tries to take this rwsem in exclusive mode to
clear the bit; if the lock attempt fails, the feature bit remains set.
The code supporting a log incompat feature should create wrapper
functions to obtain the log feature and call
”…””}”(hjê{  hžhhŸNh Nubj÷  )”}”(hŒ ``xfs_add_incompat_log_feature``”h]”hŒxfs_add_incompat_log_feature”…””}”(hj|  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1jö  hjê{  ubhX   to set the feature bits in the primary
superblock.
The superblock update is performed transactionally, so the wrapper to
obtain log assistance must be called just prior to the creation of the
transaction that uses the functionality.
For a file operation, this step must happen after taking the IOLOCK
and the MMAPLOCK, but before allocating the transaction.
When the transaction is complete, the ”…””}”(hjê{  hžhhŸNh Nubj÷  )”}”(hŒ``xlog_drop_incompat_feat``”h]”hŒxlog_drop_incompat_feat”…””}”(hj|  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1jö  hjê{  ubhŒ€
function is called to release the feature.
The feature bit will not be cleared from the superblock until the log
becomes clean.”…””}”(hjê{  hžhhŸNh Nubeh}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h MÔhjÇ{  ubhæ)”}”(hŒLog-assisted extended attribute updates and file content exchanges bothe
use log incompat features and provide convenience wrappers around the
functionality.”h]”hŒLog-assisted extended attribute updates and file content exchanges bothe
use log incompat features and provide convenience wrappers around the
functionality.”…””}”(hj.|  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h MæhjÇ{  ubeh}”(h]”h ]”h"]”h$]”h&]”uh1jß  hjÄ{  ubah}”(h]”h ]”h"]”h$]”h&]”uh1jÚ  hj˜{  ubeh}”(h]”h ]”h"]”h$]”h&]”uh1jÕ  hj‹{  ubeh}”(h]”h ]”h"]”h$]”h&]”Œcols”Kuh1jÄ  hjˆ{  ubah}”(h]”h ]”h"]”h$]”h&]”uh1j¿  hjtz  hžhhŸh³h NubhÑ)”}”(hhh]”(hÖ)”}”(hŒ+Mechanics of a Logged File Content Exchange”h]”hŒ+Mechanics of a Logged File Content Exchange”…””}”(hj^|  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”j  j'  uh1hÕhj[|  hžhhŸh³h Mìubhæ)”}”(hXô  Exchanging contents between file forks is a complex task.
The goal is to exchange all file fork mappings between two file fork offset
ranges.
There are likely to be many extent mappings in each fork, and the edges of
the mappings aren't necessarily aligned.
Furthermore, there may be other updates that need to happen after the exchange,
such as exchanging file sizes, inode flags, or conversion of fork data to local
format.
This is roughly the format of the new deferred exchange-mapping work item:”h]”hXö  Exchanging contents between file forks is a complex task.
The goal is to exchange all file fork mappings between two file fork offset
ranges.
There are likely to be many extent mappings in each fork, and the edges of
the mappings arenâ€™t necessarily aligned.
Furthermore, there may be other updates that need to happen after the exchange,
such as exchanging file sizes, inode flags, or conversion of fork data to local
format.
This is roughly the format of the new deferred exchange-mapping work item:”…””}”(hjl|  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h Mîhj[|  hžhubj•+  )”}”(hX  struct xfs_exchmaps_intent {
    /* Inodes participating in the operation. */
    struct xfs_inode    *xmi_ip1;
    struct xfs_inode    *xmi_ip2;

    /* File offset range information. */
    xfs_fileoff_t       xmi_startoff1;
    xfs_fileoff_t       xmi_startoff2;
    xfs_filblks_t       xmi_blockcount;

    /* Set these file sizes after the operation, unless negative. */
    xfs_fsize_t         xmi_isize1;
    xfs_fsize_t         xmi_isize2;

    /* XFS_EXCHMAPS_* log operation flags */
    uint64_t            xmi_flags;
};”h]”hX  struct xfs_exchmaps_intent {
    /* Inodes participating in the operation. */
    struct xfs_inode    *xmi_ip1;
    struct xfs_inode    *xmi_ip2;

    /* File offset range information. */
    xfs_fileoff_t       xmi_startoff1;
    xfs_fileoff_t       xmi_startoff2;
    xfs_filblks_t       xmi_blockcount;

    /* Set these file sizes after the operation, unless negative. */
    xfs_fsize_t         xmi_isize1;
    xfs_fsize_t         xmi_isize2;

    /* XFS_EXCHMAPS_* log operation flags */
    uint64_t            xmi_flags;
};”…””}”hjz|  sbah}”(h]”h ]”h"]”h$]”h&]”h±h²j¤+  ‰j¥+  j¦+  j§+  }”uh1j”+  hŸh³h Møhj[|  hžhubhæ)”}”(hXÚ  The new log intent item contains enough information to track two logical fork
offset ranges: ``(inode1, startoff1, blockcount)`` and ``(inode2, startoff2,
blockcount)``.
Each step of an exchange operation exchanges the largest file range mapping
possible from one file to the other.
After each step in the exchange operation, the two startoff fields are
incremented and the blockcount field is decremented to reflect the progress
made.
The flags field captures behavioral parameters such as exchanging attr fork
mappings instead of the data fork and other work to be done after the exchange.
The two isize fields are used to exchange the file sizes at the end of the
operation if the file data fork is the target of the operation.”h]”(hŒ]The new log intent item contains enough information to track two logical fork
offset ranges: ”…””}”(hj‰|  hžhhŸNh Nubj÷  )”}”(hŒ#``(inode1, startoff1, blockcount)``”h]”hŒ(inode1, startoff1, blockcount)”…””}”(hj‘|  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1jö  hj‰|  ubhŒ and ”…””}”(hj‰|  hžhhŸNh Nubj÷  )”}”(hŒ#``(inode2, startoff2,
blockcount)``”h]”hŒ(inode2, startoff2,
blockcount)”…””}”(hj£|  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1jö  hj‰|  ubhX2  .
Each step of an exchange operation exchanges the largest file range mapping
possible from one file to the other.
After each step in the exchange operation, the two startoff fields are
incremented and the blockcount field is decremented to reflect the progress
made.
The flags field captures behavioral parameters such as exchanging attr fork
mappings instead of the data fork and other work to be done after the exchange.
The two isize fields are used to exchange the file sizes at the end of the
operation if the file data fork is the target of the operation.”…””}”(hj‰|  hžhhŸNh Nubeh}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h Mhj[|  hžhubhæ)”}”(hŒIWhen the exchange is initiated, the sequence of operations is as follows:”h]”hŒIWhen the exchange is initiated, the sequence of operations is as follows:”…””}”(hj»|  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h Mhj[|  hžhubji  )”}”(hhh]”(hû)”}”(hŒ’Create a deferred work item for the file mapping exchange.
At the start, it should contain the entirety of the file block ranges to be
exchanged.
”h]”hæ)”}”(hŒ‘Create a deferred work item for the file mapping exchange.
At the start, it should contain the entirety of the file block ranges to be
exchanged.”h]”hŒ‘Create a deferred work item for the file mapping exchange.
At the start, it should contain the entirety of the file block ranges to be
exchanged.”…””}”(hjÐ|  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h MhjÌ|  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhjÉ|  hžhhŸh³h Nubhû)”}”(hŒåCall ``xfs_defer_finish`` to process the exchange.
This is encapsulated in ``xrep_tempexch_contents`` for scrub operations.
This will log an extent swap intent item to the transaction for the deferred
mapping exchange work item.
”h]”hæ)”}”(hŒäCall ``xfs_defer_finish`` to process the exchange.
This is encapsulated in ``xrep_tempexch_contents`` for scrub operations.
This will log an extent swap intent item to the transaction for the deferred
mapping exchange work item.”h]”(hŒCall ”…””}”(hjè|  hžhhŸNh Nubj÷  )”}”(hŒ``xfs_defer_finish``”h]”hŒxfs_defer_finish”…””}”(hjð|  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1jö  hjè|  ubhŒ2 to process the exchange.
This is encapsulated in ”…””}”(hjè|  hžhhŸNh Nubj÷  )”}”(hŒ``xrep_tempexch_contents``”h]”hŒxrep_tempexch_contents”…””}”(hj}  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1jö  hjè|  ubhŒ for scrub operations.
This will log an extent swap intent item to the transaction for the deferred
mapping exchange work item.”…””}”(hjè|  hžhhŸNh Nubeh}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h Mhjä|  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhjÉ|  hžhhŸh³h Nubhû)”}”(hX³  Until ``xmi_blockcount`` of the deferred mapping exchange work item is zero,

a. Read the block maps of both file ranges starting at ``xmi_startoff1`` and
   ``xmi_startoff2``, respectively, and compute the longest extent that can
   be exchanged in a single step.
   This is the minimum of the two ``br_blockcount`` s in the mappings.
   Keep advancing through the file forks until at least one of the mappings
   contains written blocks.
   Mutual holes, unwritten extents, and extent mappings to the same physical
   space are not exchanged.

   For the next few steps, this document will refer to the mapping that came
   from file 1 as "map1", and the mapping that came from file 2 as "map2".

b. Create a deferred block mapping update to unmap map1 from file 1.

c. Create a deferred block mapping update to unmap map2 from file 2.

d. Create a deferred block mapping update to map map1 into file 2.

e. Create a deferred block mapping update to map map2 into file 1.

f. Log the block, quota, and extent count updates for both files.

g. Extend the ondisk size of either file if necessary.

h. Log a mapping exchange done log item for th mapping exchange intent log
   item that was read at the start of step 3.

i. Compute the amount of file range that has just been covered.
   This quantity is ``(map1.br_startoff + map1.br_blockcount -
   xmi_startoff1)``, because step 3a could have skipped holes.

j. Increase the starting offsets of ``xmi_startoff1`` and ``xmi_startoff2``
   by the number of blocks computed in the previous step, and decrease
   ``xmi_blockcount`` by the same quantity.
   This advances the cursor.

k. Log a new mapping exchange intent log item reflecting the advanced state
   of the work item.

l. Return the proper error code (EAGAIN) to the deferred operation manager
   to inform it that there is more work to be done.
   The operation manager completes the deferred work in steps 3b-3e before
   moving back to the start of step 3.
”h]”(hæ)”}”(hŒLUntil ``xmi_blockcount`` of the deferred mapping exchange work item is zero,”h]”(hŒUntil ”…””}”(hj$}  hžhhŸNh Nubj÷  )”}”(hŒ``xmi_blockcount``”h]”hŒxmi_blockcount”…””}”(hj,}  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1jö  hj$}  ubhŒ4 of the deferred mapping exchange work item is zero,”…””}”(hj$}  hžhhŸNh Nubeh}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h M$hj }  ubji  )”}”(hhh]”(hû)”}”(hXN  Read the block maps of both file ranges starting at ``xmi_startoff1`` and
``xmi_startoff2``, respectively, and compute the longest extent that can
be exchanged in a single step.
This is the minimum of the two ``br_blockcount`` s in the mappings.
Keep advancing through the file forks until at least one of the mappings
contains written blocks.
Mutual holes, unwritten extents, and extent mappings to the same physical
space are not exchanged.

For the next few steps, this document will refer to the mapping that came
from file 1 as "map1", and the mapping that came from file 2 as "map2".
”h]”(hæ)”}”(hXº  Read the block maps of both file ranges starting at ``xmi_startoff1`` and
``xmi_startoff2``, respectively, and compute the longest extent that can
be exchanged in a single step.
This is the minimum of the two ``br_blockcount`` s in the mappings.
Keep advancing through the file forks until at least one of the mappings
contains written blocks.
Mutual holes, unwritten extents, and extent mappings to the same physical
space are not exchanged.”h]”(hŒ4Read the block maps of both file ranges starting at ”…””}”(hjK}  hžhhŸNh Nubj÷  )”}”(hŒ``xmi_startoff1``”h]”hŒxmi_startoff1”…””}”(hjS}  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1jö  hjK}  ubhŒ and
”…””}”(hjK}  hžhhŸNh Nubj÷  )”}”(hŒ``xmi_startoff2``”h]”hŒxmi_startoff2”…””}”(hje}  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1jö  hjK}  ubhŒv, respectively, and compute the longest extent that can
be exchanged in a single step.
This is the minimum of the two ”…””}”(hjK}  hžhhŸNh Nubj÷  )”}”(hŒ``br_blockcount``”h]”hŒbr_blockcount”…””}”(hjw}  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1jö  hjK}  ubhŒØ s in the mappings.
Keep advancing through the file forks until at least one of the mappings
contains written blocks.
Mutual holes, unwritten extents, and extent mappings to the same physical
space are not exchanged.”…””}”(hjK}  hžhhŸNh Nubeh}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h M&hjG}  ubhæ)”}”(hŒ‘For the next few steps, this document will refer to the mapping that came
from file 1 as "map1", and the mapping that came from file 2 as "map2".”h]”hŒ™For the next few steps, this document will refer to the mapping that came
from file 1 as â€œmap1â€, and the mapping that came from file 2 as â€œmap2â€.”…””}”(hj}  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h M/hjG}  ubeh}”(h]”h ]”h"]”h$]”h&]”uh1húhjD}  ubhû)”}”(hŒBCreate a deferred block mapping update to unmap map1 from file 1.
”h]”hæ)”}”(hŒACreate a deferred block mapping update to unmap map1 from file 1.”h]”hŒACreate a deferred block mapping update to unmap map1 from file 1.”…””}”(hj§}  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h M2hj£}  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhjD}  ubhû)”}”(hŒBCreate a deferred block mapping update to unmap map2 from file 2.
”h]”hæ)”}”(hŒACreate a deferred block mapping update to unmap map2 from file 2.”h]”hŒACreate a deferred block mapping update to unmap map2 from file 2.”…””}”(hj¿}  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h M4hj»}  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhjD}  ubhû)”}”(hŒ@Create a deferred block mapping update to map map1 into file 2.
”h]”hæ)”}”(hŒ?Create a deferred block mapping update to map map1 into file 2.”h]”hŒ?Create a deferred block mapping update to map map1 into file 2.”…””}”(hj×}  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h M6hjÓ}  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhjD}  ubhû)”}”(hŒ@Create a deferred block mapping update to map map2 into file 1.
”h]”hæ)”}”(hŒ?Create a deferred block mapping update to map map2 into file 1.”h]”hŒ?Create a deferred block mapping update to map map2 into file 1.”…””}”(hjï}  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h M8hjë}  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhjD}  ubhû)”}”(hŒ?Log the block, quota, and extent count updates for both files.
”h]”hæ)”}”(hŒ>Log the block, quota, and extent count updates for both files.”h]”hŒ>Log the block, quota, and extent count updates for both files.”…””}”(hj~  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h M:hj~  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhjD}  ubhû)”}”(hŒ4Extend the ondisk size of either file if necessary.
”h]”hæ)”}”(hŒ3Extend the ondisk size of either file if necessary.”h]”hŒ3Extend the ondisk size of either file if necessary.”…””}”(hj~  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h M<hj~  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhjD}  ubhû)”}”(hŒsLog a mapping exchange done log item for th mapping exchange intent log
item that was read at the start of step 3.
”h]”hæ)”}”(hŒrLog a mapping exchange done log item for th mapping exchange intent log
item that was read at the start of step 3.”h]”hŒrLog a mapping exchange done log item for th mapping exchange intent log
item that was read at the start of step 3.”…””}”(hj7~  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h M>hj3~  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhjD}  ubhû)”}”(hŒµCompute the amount of file range that has just been covered.
This quantity is ``(map1.br_startoff + map1.br_blockcount -
xmi_startoff1)``, because step 3a could have skipped holes.
”h]”hæ)”}”(hŒ´Compute the amount of file range that has just been covered.
This quantity is ``(map1.br_startoff + map1.br_blockcount -
xmi_startoff1)``, because step 3a could have skipped holes.”h]”(hŒNCompute the amount of file range that has just been covered.
This quantity is ”…””}”(hjO~  hžhhŸNh Nubj÷  )”}”(hŒ;``(map1.br_startoff + map1.br_blockcount -
xmi_startoff1)``”h]”hŒ7(map1.br_startoff + map1.br_blockcount -
xmi_startoff1)”…””}”(hjW~  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1jö  hjO~  ubhŒ+, because step 3a could have skipped holes.”…””}”(hjO~  hžhhŸNh Nubeh}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h MAhjK~  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhjD}  ubhû)”}”(hŒÐIncrease the starting offsets of ``xmi_startoff1`` and ``xmi_startoff2``
by the number of blocks computed in the previous step, and decrease
``xmi_blockcount`` by the same quantity.
This advances the cursor.
”h]”hæ)”}”(hŒÏIncrease the starting offsets of ``xmi_startoff1`` and ``xmi_startoff2``
by the number of blocks computed in the previous step, and decrease
``xmi_blockcount`` by the same quantity.
This advances the cursor.”h]”(hŒ!Increase the starting offsets of ”…””}”(hjy~  hžhhŸNh Nubj÷  )”}”(hŒ``xmi_startoff1``”h]”hŒxmi_startoff1”…””}”(hj~  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1jö  hjy~  ubhŒ and ”…””}”(hjy~  hžhhŸNh Nubj÷  )”}”(hŒ``xmi_startoff2``”h]”hŒxmi_startoff2”…””}”(hj“~  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1jö  hjy~  ubhŒE
by the number of blocks computed in the previous step, and decrease
”…””}”(hjy~  hžhhŸNh Nubj÷  )”}”(hŒ``xmi_blockcount``”h]”hŒxmi_blockcount”…””}”(hj¥~  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1jö  hjy~  ubhŒ0 by the same quantity.
This advances the cursor.”…””}”(hjy~  hžhhŸNh Nubeh}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h MEhju~  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhjD}  ubhû)”}”(hŒ[Log a new mapping exchange intent log item reflecting the advanced state
of the work item.
”h]”hæ)”}”(hŒZLog a new mapping exchange intent log item reflecting the advanced state
of the work item.”h]”hŒZLog a new mapping exchange intent log item reflecting the advanced state
of the work item.”…””}”(hjÇ~  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h MJhjÃ~  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhjD}  ubhû)”}”(hŒåReturn the proper error code (EAGAIN) to the deferred operation manager
to inform it that there is more work to be done.
The operation manager completes the deferred work in steps 3b-3e before
moving back to the start of step 3.
”h]”hæ)”}”(hŒäReturn the proper error code (EAGAIN) to the deferred operation manager
to inform it that there is more work to be done.
The operation manager completes the deferred work in steps 3b-3e before
moving back to the start of step 3.”h]”hŒäReturn the proper error code (EAGAIN) to the deferred operation manager
to inform it that there is more work to be done.
The operation manager completes the deferred work in steps 3b-3e before
moving back to the start of step 3.”…””}”(hjß~  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h MMhjÛ~  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhjD}  ubeh}”(h]”h ]”h"]”h$]”h&]”jg  j6  ji  hjj  jk  uh1jh  hj }  ubeh}”(h]”h ]”h"]”h$]”h&]”uh1húhjÉ|  hžhhŸNh Nubhû)”}”(hŒ[Perform any post-processing.
This will be discussed in more detail in subsequent sections.
”h]”hæ)”}”(hŒZPerform any post-processing.
This will be discussed in more detail in subsequent sections.”h]”hŒZPerform any post-processing.
This will be discussed in more detail in subsequent sections.”…””}”(hj  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h MRhjÿ~  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhjÉ|  hžhhŸh³h Nubeh}”(h]”h ]”h"]”h$]”h&]”jg  jh  ji  hjj  jk  uh1jh  hj[|  hžhhŸh³h Mubhæ)”}”(hXH  If the filesystem goes down in the middle of an operation, log recovery will
find the most recent unfinished maping exchange log intent item and restart
from there.
This is how atomic file mapping exchanges guarantees that an outside observer
will either see the old broken structure or the new one, and never a mismash of
both.”h]”hXH  If the filesystem goes down in the middle of an operation, log recovery will
find the most recent unfinished maping exchange log intent item and restart
from there.
This is how atomic file mapping exchanges guarantees that an outside observer
will either see the old broken structure or the new one, and never a mismash of
both.”…””}”(hj  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h MUhj[|  hžhubeh}”(h]”j-  ah ]”h"]”Œ+mechanics of a logged file content exchange”ah$]”h&]”uh1hÐhjtz  hžhhŸh³h MìubhÑ)”}”(hhh]”(hÖ)”}”(hŒ&Preparation for File Content Exchanges”h]”hŒ&Preparation for File Content Exchanges”…””}”(hj5  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”j  jI  uh1hÕhj2  hžhhŸh³h M]ubhæ)”}”(hX\  There are a few things that need to be taken care of before initiating an
atomic file mapping exchange operation.
First, regular files require the page cache to be flushed to disk before the
operation begins, and directio writes to be quiesced.
Like any filesystem operation, file mapping exchanges must determine the
maximum amount of disk space and quota that can be consumed on behalf of both
files in the operation, and reserve that quantity of resources to avoid an
unrecoverable out of space failure once it starts dirtying metadata.
The preparation step scans the ranges of both files to estimate:”h]”hX\  There are a few things that need to be taken care of before initiating an
atomic file mapping exchange operation.
First, regular files require the page cache to be flushed to disk before the
operation begins, and directio writes to be quiesced.
Like any filesystem operation, file mapping exchanges must determine the
maximum amount of disk space and quota that can be consumed on behalf of both
files in the operation, and reserve that quantity of resources to avoid an
unrecoverable out of space failure once it starts dirtying metadata.
The preparation step scans the ranges of both files to estimate:”…””}”(hjC  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h M_hj2  hžhubhö)”}”(hhh]”(hû)”}”(hŒNData device blocks needed to handle the repeated updates to the fork
mappings.”h]”hæ)”}”(hŒNData device blocks needed to handle the repeated updates to the fork
mappings.”h]”hŒNData device blocks needed to handle the repeated updates to the fork
mappings.”…””}”(hjX  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h MihjT  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhjQ  hžhhŸh³h Nubhû)”}”(hŒ8Change in data and realtime block counts for both files.”h]”hæ)”}”(hjn  h]”hŒ8Change in data and realtime block counts for both files.”…””}”(hjp  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h Mkhjl  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhjQ  hžhhŸh³h Nubhû)”}”(hŒ`Increase in quota usage for both files, if the two files do not share the
same set of quota ids.”h]”hæ)”}”(hŒ`Increase in quota usage for both files, if the two files do not share the
same set of quota ids.”h]”hŒ`Increase in quota usage for both files, if the two files do not share the
same set of quota ids.”…””}”(hj‡  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h Mlhjƒ  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhjQ  hžhhŸh³h Nubhû)”}”(hŒ>The number of extent mappings that will be added to each file.”h]”hæ)”}”(hj  h]”hŒ>The number of extent mappings that will be added to each file.”…””}”(hjŸ  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h Mnhj›  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhjQ  hžhhŸh³h Nubhû)”}”(hŒöWhether or not there are partially written realtime extents.
User programs must never be able to access a realtime file extent that maps
to different extents on the realtime volume, which could happen if the
operation fails to run to completion.
”h]”hæ)”}”(hŒõWhether or not there are partially written realtime extents.
User programs must never be able to access a realtime file extent that maps
to different extents on the realtime volume, which could happen if the
operation fails to run to completion.”h]”hŒõWhether or not there are partially written realtime extents.
User programs must never be able to access a realtime file extent that maps
to different extents on the realtime volume, which could happen if the
operation fails to run to completion.”…””}”(hj¶  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h Mohj²  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhjQ  hžhhŸh³h Nubeh}”(h]”h ]”h"]”h$]”h&]”jJ  jK  uh1hõhŸh³h Mihj2  hžhubhæ)”}”(hX©  The need for precise estimation increases the run time of the exchange
operation, but it is very important to maintain correct accounting.
The filesystem must not run completely out of free space, nor can the mapping
exchange ever add more extent mappings to a fork than it can support.
Regular users are required to abide the quota limits, though metadata repairs
may exceed quota to resolve inconsistent metadata elsewhere.”h]”hX©  The need for precise estimation increases the run time of the exchange
operation, but it is very important to maintain correct accounting.
The filesystem must not run completely out of free space, nor can the mapping
exchange ever add more extent mappings to a fork than it can support.
Regular users are required to abide the quota limits, though metadata repairs
may exceed quota to resolve inconsistent metadata elsewhere.”…””}”(hjÐ  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h Mthj2  hžhubeh}”(h]”jO  ah ]”h"]”Œ&preparation for file content exchanges”ah$]”h&]”uh1hÐhjtz  hžhhŸh³h M]ubhÑ)”}”(hhh]”(hÖ)”}”(hŒ6Special Features for Exchanging Metadata File Contents”h]”hŒ6Special Features for Exchanging Metadata File Contents”…””}”(hjè  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”j  jk  uh1hÕhjå  hžhhŸh³h M|ubhæ)”}”(hŒÍExtended attributes, symbolic links, and directories can set the fork format to
"local" and treat the fork as a literal area for data storage.
Metadata repairs must take extra steps to support these cases:”h]”hŒÑExtended attributes, symbolic links, and directories can set the fork format to
â€œlocalâ€ and treat the fork as a literal area for data storage.
Metadata repairs must take extra steps to support these cases:”…””}”(hjö  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h M~hjå  hžhubhö)”}”(hhh]”(hû)”}”(hX  If both forks are in local format and the fork areas are large enough, the
exchange is performed by copying the incore fork contents, logging both
forks, and committing.
The atomic file mapping exchange mechanism is not necessary, since this can
be done with a single transaction.
”h]”hæ)”}”(hX  If both forks are in local format and the fork areas are large enough, the
exchange is performed by copying the incore fork contents, logging both
forks, and committing.
The atomic file mapping exchange mechanism is not necessary, since this can
be done with a single transaction.”h]”hX  If both forks are in local format and the fork areas are large enough, the
exchange is performed by copying the incore fork contents, logging both
forks, and committing.
The atomic file mapping exchange mechanism is not necessary, since this can
be done with a single transaction.”…””}”(hj€  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h M‚hj€  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhj€  hžhhŸh³h Nubhû)”}”(hŒQIf both forks map blocks, then the regular atomic file mapping exchange is
used.
”h]”hæ)”}”(hŒPIf both forks map blocks, then the regular atomic file mapping exchange is
used.”h]”hŒPIf both forks map blocks, then the regular atomic file mapping exchange is
used.”…””}”(hj#€  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h Mˆhj€  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhj€  hžhhŸh³h Nubhû)”}”(hX=  Otherwise, only one fork is in local format.
The contents of the local format fork are converted to a block to perform the
exchange.
The conversion to block format must be done in the same transaction that
logs the initial mapping exchange intent log item.
The regular atomic mapping exchange is used to exchange the metadata file
mappings.
Special flags are set on the exchange operation so that the transaction can
be rolled one more time to convert the second file's fork back to local
format so that the second file will be ready to go as soon as the ILOCK is
dropped.
”h]”hæ)”}”(hX<  Otherwise, only one fork is in local format.
The contents of the local format fork are converted to a block to perform the
exchange.
The conversion to block format must be done in the same transaction that
logs the initial mapping exchange intent log item.
The regular atomic mapping exchange is used to exchange the metadata file
mappings.
Special flags are set on the exchange operation so that the transaction can
be rolled one more time to convert the second file's fork back to local
format so that the second file will be ready to go as soon as the ILOCK is
dropped.”h]”hX>  Otherwise, only one fork is in local format.
The contents of the local format fork are converted to a block to perform the
exchange.
The conversion to block format must be done in the same transaction that
logs the initial mapping exchange intent log item.
The regular atomic mapping exchange is used to exchange the metadata file
mappings.
Special flags are set on the exchange operation so that the transaction can
be rolled one more time to convert the second fileâ€™s fork back to local
format so that the second file will be ready to go as soon as the ILOCK is
dropped.”…””}”(hj;€  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h M‹hj7€  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhj€  hžhhŸh³h Nubeh}”(h]”h ]”h"]”h$]”h&]”jJ  jK  uh1hõhŸh³h M‚hjå  hžhubhæ)”}”(hX‚  Extended attributes and directories stamp the owning inode into every block,
but the buffer verifiers do not actually check the inode number!
Although there is no verification, it is still important to maintain
referential integrity, so prior to performing the mapping exchange, online
repair builds every block in the new data structure with the owner field of the
file being repaired.”h]”hX‚  Extended attributes and directories stamp the owning inode into every block,
but the buffer verifiers do not actually check the inode number!
Although there is no verification, it is still important to maintain
referential integrity, so prior to performing the mapping exchange, online
repair builds every block in the new data structure with the owner field of the
file being repaired.”…””}”(hjU€  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h M—hjå  hžhubhæ)”}”(hX  After a successful exchange operation, the repair operation must reap the old
fork blocks by processing each fork mapping through the standard :ref:`file
extent reaping <reaping>` mechanism that is done post-repair.
If the filesystem should go down during the reap part of the repair, the
iunlink processing at the end of recovery will free both the temporary file and
whatever blocks were not reaped.
However, this iunlink processing omits the cross-link detection of online
repair, and is not completely foolproof.”h]”(hŒAfter a successful exchange operation, the repair operation must reap the old
fork blocks by processing each fork mapping through the standard ”…””}”(hjc€  hžhhŸNh Nubh)”}”(hŒ$:ref:`file
extent reaping <reaping>`”h]”j™  )”}”(hjm€  h]”hŒfile
extent reaping”…””}”(hjo€  hžhhŸNh Nubah}”(h]”h ]”(j¤  Œstd”Œstd-ref”eh"]”h$]”h&]”uh1j˜  hjk€  ubah}”(h]”h ]”h"]”h$]”h&]”Œrefdoc”j±  Œ	refdomain”jy€  Œreftype”Œref”Œrefexplicit”ˆŒrefwarn”ˆj·  Œreaping”uh1hhŸh³h Mžhjc€  ubhXQ   mechanism that is done post-repair.
If the filesystem should go down during the reap part of the repair, the
iunlink processing at the end of recovery will free both the temporary file and
whatever blocks were not reaped.
However, this iunlink processing omits the cross-link detection of online
repair, and is not completely foolproof.”…””}”(hjc€  hžhhŸNh Nubeh}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h Mžhjå  hžhubeh}”(h]”jq  ah ]”h"]”Œ6special features for exchanging metadata file contents”ah$]”h&]”uh1hÐhjtz  hžhhŸh³h M|ubhÑ)”}”(hhh]”(hÖ)”}”(hŒ"Exchanging Temporary File Contents”h]”hŒ"Exchanging Temporary File Contents”…””}”(hjŸ€  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”j  j  uh1hÕhjœ€  hžhhŸh³h M¨ubhæ)”}”(hŒ=To repair a metadata file, online repair proceeds as follows:”h]”hŒ=To repair a metadata file, online repair proceeds as follows:”…””}”(hj­€  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h Mªhjœ€  hžhubji  )”}”(hhh]”(hû)”}”(hŒ Create a temporary repair file.
”h]”hæ)”}”(hŒCreate a temporary repair file.”h]”hŒCreate a temporary repair file.”…””}”(hjÂ€  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h M¬hj¾€  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhj»€  hžhhŸh³h Nubhû)”}”(hŒ†Use the staging data to write out new contents into the temporary repair
file.
The same fork must be written to as is being repaired.
”h]”hæ)”}”(hŒ…Use the staging data to write out new contents into the temporary repair
file.
The same fork must be written to as is being repaired.”h]”hŒ…Use the staging data to write out new contents into the temporary repair
file.
The same fork must be written to as is being repaired.”…””}”(hjÚ€  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h M®hjÖ€  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhj»€  hžhhŸh³h Nubhû)”}”(hŒ†Commit the scrub transaction, since the exchange resource estimation step
must be completed before transaction reservations are made.
”h]”hæ)”}”(hŒ…Commit the scrub transaction, since the exchange resource estimation step
must be completed before transaction reservations are made.”h]”hŒ…Commit the scrub transaction, since the exchange resource estimation step
must be completed before transaction reservations are made.”…””}”(hjò€  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h M²hjî€  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhj»€  hžhhŸh³h Nubhû)”}”(hŒÑCall ``xrep_tempexch_trans_alloc`` to allocate a new scrub transaction with
the appropriate resource reservations, locks, and fill out a ``struct
xfs_exchmaps_req`` with the details of the exchange operation.
”h]”hæ)”}”(hŒÐCall ``xrep_tempexch_trans_alloc`` to allocate a new scrub transaction with
the appropriate resource reservations, locks, and fill out a ``struct
xfs_exchmaps_req`` with the details of the exchange operation.”h]”(hŒCall ”…””}”(hj
  hžhhŸNh Nubj÷  )”}”(hŒ``xrep_tempexch_trans_alloc``”h]”hŒxrep_tempexch_trans_alloc”…””}”(hj  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1jö  hj
  ubhŒg to allocate a new scrub transaction with
the appropriate resource reservations, locks, and fill out a ”…””}”(hj
  hžhhŸNh Nubj÷  )”}”(hŒ``struct
xfs_exchmaps_req``”h]”hŒstruct
xfs_exchmaps_req”…””}”(hj$  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1jö  hj
  ubhŒ, with the details of the exchange operation.”…””}”(hj
  hžhhŸNh Nubeh}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h Mµhj  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhj»€  hžhhŸh³h Nubhû)”}”(hŒ:Call ``xrep_tempexch_contents`` to exchange the contents.
”h]”hæ)”}”(hŒ9Call ``xrep_tempexch_contents`` to exchange the contents.”h]”(hŒCall ”…””}”(hjF  hžhhŸNh Nubj÷  )”}”(hŒ``xrep_tempexch_contents``”h]”hŒxrep_tempexch_contents”…””}”(hjN  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1jö  hjF  ubhŒ to exchange the contents.”…””}”(hjF  hžhhŸNh Nubeh}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h M¹hjB  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhj»€  hžhhŸh³h Nubhû)”}”(hŒ/Commit the transaction to complete the repair.
”h]”hæ)”}”(hŒ.Commit the transaction to complete the repair.”h]”hŒ.Commit the transaction to complete the repair.”…””}”(hjp  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h M»hjl  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhj»€  hžhhŸh³h Nubeh}”(h]”h ]”h"]”h$]”h&]”jg  jh  ji  hjj  jk  uh1jh  hjœ€  hžhhŸh³h M¬ubhµ)”}”(hŒ.. _rtsummary:”h]”h}”(h]”h ]”h"]”h$]”h&]”j  Œ	rtsummary”uh1h´h M½hjœ€  hžhhŸh³ubhÑ)”}”(hhh]”(hÖ)”}”(hŒ/Case Study: Repairing the Realtime Summary File”h]”hŒ/Case Study: Repairing the Realtime Summary File”…””}”(hj˜  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”j  j¬  uh1hÕhj•  hžhhŸh³h MÀubhæ)”}”(hX@  In the "realtime" section of an XFS filesystem, free space is tracked via a
bitmap, similar to Unix FFS.
Each bit in the bitmap represents one realtime extent, which is a multiple of
the filesystem block size between 4KiB and 1GiB in size.
The realtime summary file indexes the number of free extents of a given size to
the offset of the block within the realtime free space bitmap where those free
extents begin.
In other words, the summary file helps the allocator find free extents by
length, similar to what the free space by count (cntbt) btree does for the data
section.”h]”hXD  In the â€œrealtimeâ€ section of an XFS filesystem, free space is tracked via a
bitmap, similar to Unix FFS.
Each bit in the bitmap represents one realtime extent, which is a multiple of
the filesystem block size between 4KiB and 1GiB in size.
The realtime summary file indexes the number of free extents of a given size to
the offset of the block within the realtime free space bitmap where those free
extents begin.
In other words, the summary file helps the allocator find free extents by
length, similar to what the free space by count (cntbt) btree does for the data
section.”…””}”(hj¦  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h MÂhj•  hžhubhæ)”}”(hXV  The summary file itself is a flat file (with no block headers or checksums!)
partitioned into ``log2(total rt extents)`` sections containing enough 32-bit
counters to match the number of blocks in the rt bitmap.
Each counter records the number of free extents that start in that bitmap block
and can satisfy a power-of-two allocation request.”h]”(hŒ^The summary file itself is a flat file (with no block headers or checksums!)
partitioned into ”…””}”(hj´  hžhhŸNh Nubj÷  )”}”(hŒ``log2(total rt extents)``”h]”hŒlog2(total rt extents)”…””}”(hj¼  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1jö  hj´  ubhŒÞ sections containing enough 32-bit
counters to match the number of blocks in the rt bitmap.
Each counter records the number of free extents that start in that bitmap block
and can satisfy a power-of-two allocation request.”…””}”(hj´  hžhhŸNh Nubeh}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h MÍhj•  hžhubhæ)”}”(hŒ-To check the summary file against the bitmap:”h]”hŒ-To check the summary file against the bitmap:”…””}”(hjÔ  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h MÓhj•  hžhubji  )”}”(hhh]”(hû)”}”(hŒ>Take the ILOCK of both the realtime bitmap and summary files.
”h]”hæ)”}”(hŒ=Take the ILOCK of both the realtime bitmap and summary files.”h]”hŒ=Take the ILOCK of both the realtime bitmap and summary files.”…””}”(hjé  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h MÕhjå  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhjâ  hžhhŸh³h Nubhû)”}”(hŒôFor each free space extent recorded in the bitmap:

a. Compute the position in the summary file that contains a counter that
   represents this free extent.

b. Read the counter from the xfile.

c. Increment it, and write it back to the xfile.
”h]”(hæ)”}”(hŒ2For each free space extent recorded in the bitmap:”h]”hŒ2For each free space extent recorded in the bitmap:”…””}”(hj‚  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h M×hjý  ubji  )”}”(hhh]”(hû)”}”(hŒcCompute the position in the summary file that contains a counter that
represents this free extent.
”h]”hæ)”}”(hŒbCompute the position in the summary file that contains a counter that
represents this free extent.”h]”hŒbCompute the position in the summary file that contains a counter that
represents this free extent.”…””}”(hj‚  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h MÙhj‚  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhj‚  ubhû)”}”(hŒ!Read the counter from the xfile.
”h]”hæ)”}”(hŒ Read the counter from the xfile.”h]”hŒ Read the counter from the xfile.”…””}”(hj.‚  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h MÜhj*‚  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhj‚  ubhû)”}”(hŒ.Increment it, and write it back to the xfile.
”h]”hæ)”}”(hŒ-Increment it, and write it back to the xfile.”h]”hŒ-Increment it, and write it back to the xfile.”…””}”(hjF‚  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h MÞhjB‚  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhj‚  ubeh}”(h]”h ]”h"]”h$]”h&]”jg  j6  ji  hjj  jk  uh1jh  hjý  ubeh}”(h]”h ]”h"]”h$]”h&]”uh1húhjâ  hžhhŸNh Nubhû)”}”(hŒ;Compare the contents of the xfile against the ondisk file.
”h]”hæ)”}”(hŒ:Compare the contents of the xfile against the ondisk file.”h]”hŒ:Compare the contents of the xfile against the ondisk file.”…””}”(hjj‚  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h Màhjf‚  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhjâ  hžhhŸh³h Nubeh}”(h]”h ]”h"]”h$]”h&]”jg  jh  ji  hjj  jk  uh1jh  hj•  hžhhŸh³h MÕubhæ)”}”(hŒ«To repair the summary file, write the xfile contents into the temporary file
and use atomic mapping exchange to commit the new contents.
The temporary file is then reaped.”h]”hŒ«To repair the summary file, write the xfile contents into the temporary file
and use atomic mapping exchange to commit the new contents.
The temporary file is then reaped.”…””}”(hj„‚  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h Mâhj•  hžhubhæ)”}”(hŒžThe proposed patchset is the
`realtime summary repair
<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-rtsummary>`_
series.”h]”(hŒThe proposed patchset is the
”…””}”(hj’‚  hžhhŸNh Nubj”  )”}”(hŒy`realtime summary repair
<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-rtsummary>`_”h]”hŒrealtime summary repair”…””}”(hjš‚  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”Œname”Œrealtime summary repair”jj  Œ\https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-rtsummary”uh1j“  hj’‚  ubhµ)”}”(hŒ_
<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-rtsummary>”h]”h}”(h]”Œrealtime-summary-repair”ah ]”h"]”Œrealtime summary repair”ah$]”h&]”Œrefuri”jª‚  uh1h´jy  Khj’‚  ubhŒ
series.”…””}”(hj’‚  hžhhŸNh Nubeh}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h Mæhj•  hžhubeh}”(h]”(j²  j”  eh ]”h"]”(Œ/case study: repairing the realtime summary file”Œ	rtsummary”eh$]”h&]”uh1hÐhjœ€  hžhhŸh³h MÀjË  }”jÇ‚  jŠ  sjÍ  }”j”  jŠ  subhÑ)”}”(hhh]”(hÖ)”}”(hŒ)Case Study: Salvaging Extended Attributes”h]”hŒ)Case Study: Salvaging Extended Attributes”…””}”(hjÏ‚  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”j  jÎ  uh1hÕhjÌ‚  hžhhŸh³h Mìubhæ)”}”(hXá  In XFS, extended attributes are implemented as a namespaced name-value store.
Values are limited in size to 64KiB, but there is no limit in the number of
names.
The attribute fork is unpartitioned, which means that the root of the attribute
structure is always in logical block zero, but attribute leaf blocks, dabtree
index blocks, and remote value blocks are intermixed.
Attribute leaf blocks contain variable-sized records that associate
user-provided names with the user-provided values.
Values larger than a block are allocated separate extents and written there.
If the leaf information expands beyond a single block, a directory/attribute
btree (``dabtree``) is created to map hashes of attribute names to entries
for fast lookup.”h]”(hX  In XFS, extended attributes are implemented as a namespaced name-value store.
Values are limited in size to 64KiB, but there is no limit in the number of
names.
The attribute fork is unpartitioned, which means that the root of the attribute
structure is always in logical block zero, but attribute leaf blocks, dabtree
index blocks, and remote value blocks are intermixed.
Attribute leaf blocks contain variable-sized records that associate
user-provided names with the user-provided values.
Values larger than a block are allocated separate extents and written there.
If the leaf information expands beyond a single block, a directory/attribute
btree (”…””}”(hjÝ‚  hžhhŸNh Nubj÷  )”}”(hŒ``dabtree``”h]”hŒdabtree”…””}”(hjå‚  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1jö  hjÝ‚  ubhŒI) is created to map hashes of attribute names to entries
for fast lookup.”…””}”(hjÝ‚  hžhhŸNh Nubeh}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h MîhjÌ‚  hžhubhæ)”}”(hŒ1Salvaging extended attributes is done as follows:”h]”hŒ1Salvaging extended attributes is done as follows:”…””}”(hjý‚  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h MûhjÌ‚  hžhubji  )”}”(hhh]”(hû)”}”(hXq  Walk the attr fork mappings of the file being repaired to find the attribute
leaf blocks.
When one is found,

a. Walk the attr leaf block to find candidate keys.
   When one is found,

   1. Check the name for problems, and ignore the name if there are.

   2. Retrieve the value.
      If that succeeds, add the name and value to the staging xfarray and
      xfblob.
”h]”(hæ)”}”(hŒlWalk the attr fork mappings of the file being repaired to find the attribute
leaf blocks.
When one is found,”h]”hŒlWalk the attr fork mappings of the file being repaired to find the attribute
leaf blocks.
When one is found,”…””}”(hjƒ  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h Mýhjƒ  ubji  )”}”(hhh]”hû)”}”(hŒñWalk the attr leaf block to find candidate keys.
When one is found,

1. Check the name for problems, and ignore the name if there are.

2. Retrieve the value.
   If that succeeds, add the name and value to the staging xfarray and
   xfblob.
”h]”(hæ)”}”(hŒCWalk the attr leaf block to find candidate keys.
When one is found,”h]”hŒCWalk the attr leaf block to find candidate keys.
When one is found,”…””}”(hj'ƒ  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h Mhj#ƒ  ubji  )”}”(hhh]”(hû)”}”(hŒ?Check the name for problems, and ignore the name if there are.
”h]”hæ)”}”(hŒ>Check the name for problems, and ignore the name if there are.”h]”hŒ>Check the name for problems, and ignore the name if there are.”…””}”(hj<ƒ  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h Mhj8ƒ  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhj5ƒ  ubhû)”}”(hŒ`Retrieve the value.
If that succeeds, add the name and value to the staging xfarray and
xfblob.
”h]”hæ)”}”(hŒ_Retrieve the value.
If that succeeds, add the name and value to the staging xfarray and
xfblob.”h]”hŒ_Retrieve the value.
If that succeeds, add the name and value to the staging xfarray and
xfblob.”…””}”(hjTƒ  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h MhjPƒ  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhj5ƒ  ubeh}”(h]”h ]”h"]”h$]”h&]”jg  jh  ji  hjj  jk  uh1jh  hj#ƒ  ubeh}”(h]”h ]”h"]”h$]”h&]”uh1húhj ƒ  ubah}”(h]”h ]”h"]”h$]”h&]”jg  j6  ji  hjj  jk  uh1jh  hjƒ  ubeh}”(h]”h ]”h"]”h$]”h&]”uh1húhjƒ  hžhhŸNh Nubhû)”}”(hŒÐIf the memory usage of the xfarray and xfblob exceed a certain amount of
memory or there are no more attr fork blocks to examine, unlock the file and
add the staged extended attributes to the temporary file.
”h]”hæ)”}”(hŒÏIf the memory usage of the xfarray and xfblob exceed a certain amount of
memory or there are no more attr fork blocks to examine, unlock the file and
add the staged extended attributes to the temporary file.”h]”hŒÏIf the memory usage of the xfarray and xfblob exceed a certain amount of
memory or there are no more attr fork blocks to examine, unlock the file and
add the staged extended attributes to the temporary file.”…””}”(hj„ƒ  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h M
hj€ƒ  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhjƒ  hžhhŸh³h Nubhû)”}”(hŒUse atomic file mapping exchange to exchange the new and old extended
attribute structures.
The old attribute blocks are now attached to the temporary file.
”h]”hæ)”}”(hŒœUse atomic file mapping exchange to exchange the new and old extended
attribute structures.
The old attribute blocks are now attached to the temporary file.”h]”hŒœUse atomic file mapping exchange to exchange the new and old extended
attribute structures.
The old attribute blocks are now attached to the temporary file.”…””}”(hjœƒ  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h Mhj˜ƒ  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhjƒ  hžhhŸh³h Nubhû)”}”(hŒReap the temporary file.
”h]”hæ)”}”(hŒReap the temporary file.”h]”hŒReap the temporary file.”…””}”(hj´ƒ  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h Mhj°ƒ  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhjƒ  hžhhŸh³h Nubeh}”(h]”h ]”h"]”h$]”h&]”jg  jh  ji  hjj  jk  uh1jh  hjÌ‚  hžhhŸh³h Mýubhæ)”}”(hŒThe proposed patchset is the
`extended attribute repair
<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-xattrs>`_
series.”h]”(hŒThe proposed patchset is the
”…””}”(hjÎƒ  hžhhŸNh Nubj”  )”}”(hŒx`extended attribute repair
<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-xattrs>`_”h]”hŒextended attribute repair”…””}”(hjÖƒ  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”Œname”Œextended attribute repair”jj  ŒYhttps://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-xattrs”uh1j“  hjÎƒ  ubhµ)”}”(hŒ\
<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-xattrs>”h]”h}”(h]”Œid11”ah ]”h"]”h$]”Œextended attribute repair”ah&]”Œrefuri”jæƒ  uh1h´jy  KhjÎƒ  ubhŒ
series.”…””}”(hjÎƒ  hžhhŸNh Nubeh}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h MhjÌ‚  hžhubeh}”(h]”jÔ  ah ]”h"]”Œ)case study: salvaging extended attributes”ah$]”h&]”uh1hÐhjœ€  hžhhŸh³h Mìubeh}”(h]”j“  ah ]”h"]”Œ"exchanging temporary file contents”ah$]”h&]”uh1hÐhjtz  hžhhŸh³h M¨ubeh}”(h]”j  ah ]”h"]”Œlogged file content exchanges”ah$]”h&]”uh1hÐhjv*  hžhhŸh³h MŒubhÑ)”}”(hhh]”(hÖ)”}”(hŒFixing Directories”h]”hŒFixing Directories”…””}”(hj„  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”j  j  uh1hÕhj„  hžhhŸh³h Mubhæ)”}”(hX¬  Fixing directories is difficult with currently available filesystem features,
since directory entries are not redundant.
The offline repair tool scans all inodes to find files with nonzero link count,
and then it scans all directories to establish parentage of those linked files.
Damaged files and directories are zapped, and files with no parent are
moved to the ``/lost+found`` directory.
It does not try to salvage anything.”h]”(hXm  Fixing directories is difficult with currently available filesystem features,
since directory entries are not redundant.
The offline repair tool scans all inodes to find files with nonzero link count,
and then it scans all directories to establish parentage of those linked files.
Damaged files and directories are zapped, and files with no parent are
moved to the ”…””}”(hj$„  hžhhŸNh Nubj÷  )”}”(hŒ``/lost+found``”h]”hŒ/lost+found”…””}”(hj,„  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1jö  hj$„  ubhŒ0 directory.
It does not try to salvage anything.”…””}”(hj$„  hžhhŸNh Nubeh}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h Mhj„  hžhubhæ)”}”(hX”  The best that online repair can do at this time is to read directory data
blocks and salvage any dirents that look plausible, correct link counts, and
move orphans back into the directory tree.
The salvage process is discussed in the case study at the end of this section.
The :ref:`file link count fsck <nlinks>` code takes care of fixing link counts
and moving orphans to the ``/lost+found`` directory.”h]”(hX  The best that online repair can do at this time is to read directory data
blocks and salvage any dirents that look plausible, correct link counts, and
move orphans back into the directory tree.
The salvage process is discussed in the case study at the end of this section.
The ”…””}”(hjD„  hžhhŸNh Nubh)”}”(hŒ$:ref:`file link count fsck <nlinks>`”h]”j™  )”}”(hjN„  h]”hŒfile link count fsck”…””}”(hjP„  hžhhŸNh Nubah}”(h]”h ]”(j¤  Œstd”Œstd-ref”eh"]”h$]”h&]”uh1j˜  hjL„  ubah}”(h]”h ]”h"]”h$]”h&]”Œrefdoc”j±  Œ	refdomain”jZ„  Œreftype”Œref”Œrefexplicit”ˆŒrefwarn”ˆj·  Œnlinks”uh1hhŸh³h M$hjD„  ubhŒA code takes care of fixing link counts
and moving orphans to the ”…””}”(hjD„  hžhhŸNh Nubj÷  )”}”(hŒ``/lost+found``”h]”hŒ/lost+found”…””}”(hjp„  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1jö  hjD„  ubhŒ directory.”…””}”(hjD„  hžhhŸNh Nubeh}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h M$hj„  hžhubhÑ)”}”(hhh]”(hÖ)”}”(hŒ!Case Study: Salvaging Directories”h]”hŒ!Case Study: Salvaging Directories”…””}”(hj‹„  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”j  j'  uh1hÕhjˆ„  hžhhŸh³h M,ubhæ)”}”(hŒpUnlike extended attributes, directory blocks are all the same size, so
salvaging directories is straightforward:”h]”hŒpUnlike extended attributes, directory blocks are all the same size, so
salvaging directories is straightforward:”…””}”(hj™„  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h M.hjˆ„  hžhubji  )”}”(hhh]”(hû)”}”(hŒÜFind the parent of the directory.
If the dotdot entry is not unreadable, try to confirm that the alleged
parent has a child entry pointing back to the directory being repaired.
Otherwise, walk the filesystem to find it.
”h]”hæ)”}”(hŒÛFind the parent of the directory.
If the dotdot entry is not unreadable, try to confirm that the alleged
parent has a child entry pointing back to the directory being repaired.
Otherwise, walk the filesystem to find it.”h]”hŒÛFind the parent of the directory.
If the dotdot entry is not unreadable, try to confirm that the alleged
parent has a child entry pointing back to the directory being repaired.
Otherwise, walk the filesystem to find it.”…””}”(hj®„  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h M1hjª„  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhj§„  hžhhŸh³h Nubhû)”}”(hX®  Walk the first partition of data fork of the directory to find the directory
entry data blocks.
When one is found,

a. Walk the directory data block to find candidate entries.
   When an entry is found:

   i. Check the name for problems, and ignore the name if there are.

   ii. Retrieve the inumber and grab the inode.
       If that succeeds, add the name, inode number, and file type to the
       staging xfarray and xblob.
”h]”(hæ)”}”(hŒrWalk the first partition of data fork of the directory to find the directory
entry data blocks.
When one is found,”h]”hŒrWalk the first partition of data fork of the directory to find the directory
entry data blocks.
When one is found,”…””}”(hjÆ„  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h M6hjÂ„  ubji  )”}”(hhh]”hû)”}”(hX(  Walk the directory data block to find candidate entries.
When an entry is found:

i. Check the name for problems, and ignore the name if there are.

ii. Retrieve the inumber and grab the inode.
    If that succeeds, add the name, inode number, and file type to the
    staging xfarray and xblob.
”h]”(hæ)”}”(hŒPWalk the directory data block to find candidate entries.
When an entry is found:”h]”hŒPWalk the directory data block to find candidate entries.
When an entry is found:”…””}”(hjÛ„  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h M:hj×„  ubji  )”}”(hhh]”(hû)”}”(hŒ?Check the name for problems, and ignore the name if there are.
”h]”hæ)”}”(hŒ>Check the name for problems, and ignore the name if there are.”h]”hŒ>Check the name for problems, and ignore the name if there are.”…””}”(hjð„  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h M=hjì„  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhjé„  ubhû)”}”(hŒ‡Retrieve the inumber and grab the inode.
If that succeeds, add the name, inode number, and file type to the
staging xfarray and xblob.
”h]”hæ)”}”(hŒ†Retrieve the inumber and grab the inode.
If that succeeds, add the name, inode number, and file type to the
staging xfarray and xblob.”h]”hŒ†Retrieve the inumber and grab the inode.
If that succeeds, add the name, inode number, and file type to the
staging xfarray and xblob.”…””}”(hj…  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h M?hj…  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhjé„  ubeh}”(h]”h ]”h"]”h$]”h&]”jg  Œ
lowerroman”ji  hjj  jk  uh1jh  hj×„  ubeh}”(h]”h ]”h"]”h$]”h&]”uh1húhjÔ„  ubah}”(h]”h ]”h"]”h$]”h&]”jg  j6  ji  hjj  jk  uh1jh  hjÂ„  ubeh}”(h]”h ]”h"]”h$]”h&]”uh1húhj§„  hžhhŸNh Nubhû)”}”(hŒñIf the memory usage of the xfarray and xfblob exceed a certain amount of
memory or there are no more directory data blocks to examine, unlock the
directory and add the staged dirents into the temporary directory.
Truncate the staging files.
”h]”hæ)”}”(hŒðIf the memory usage of the xfarray and xfblob exceed a certain amount of
memory or there are no more directory data blocks to examine, unlock the
directory and add the staged dirents into the temporary directory.
Truncate the staging files.”h]”hŒðIf the memory usage of the xfarray and xfblob exceed a certain amount of
memory or there are no more directory data blocks to examine, unlock the
directory and add the staged dirents into the temporary directory.
Truncate the staging files.”…””}”(hj9…  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h MChj5…  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhj§„  hžhhŸh³h Nubhû)”}”(hŒ”Use atomic file mapping exchange to exchange the new and old directory
structures.
The old directory blocks are now attached to the temporary file.
”h]”hæ)”}”(hŒ“Use atomic file mapping exchange to exchange the new and old directory
structures.
The old directory blocks are now attached to the temporary file.”h]”hŒ“Use atomic file mapping exchange to exchange the new and old directory
structures.
The old directory blocks are now attached to the temporary file.”…””•      }”(hjQ…  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h MHhjM…  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhj§„  hžhhŸh³h Nubhû)”}”(hŒReap the temporary file.
”h]”hæ)”}”(hŒReap the temporary file.”h]”hŒReap the temporary file.”…””}”(hji…  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h MLhje…  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhj§„  hžhhŸh³h Nubeh}”(h]”h ]”h"]”h$]”h&]”jg  jh  ji  hjj  jk  uh1jh  hjˆ„  hžhhŸh³h M1ubhæ)”}”(hŒ`**Future Work Question**: Should repair revalidate the dentry cache when
rebuilding a directory?”h]”(jé  )”}”(hŒ**Future Work Question**”h]”hŒFuture Work Question”…””}”(hj‡…  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1jè  hjƒ…  ubhŒH: Should repair revalidate the dentry cache when
rebuilding a directory?”…””}”(hjƒ…  hžhhŸNh Nubeh}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h MNhjˆ„  hžhubhæ)”}”(hŒ*Answer*: Yes, it should.”h]”(j7  )”}”(hŒ*Answer*”h]”hŒAnswer”…””}”(hj£…  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1j6  hjŸ…  ubhŒ: Yes, it should.”…””}”(hjŸ…  hžhhŸNh Nubeh}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h MQhjˆ„  hžhubhæ)”}”(hŒuIn theory it is necessary to scan all dentry cache entries for a directory to
ensure that one of the following apply:”h]”hŒuIn theory it is necessary to scan all dentry cache entries for a directory to
ensure that one of the following apply:”…””}”(hj»…  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h MShjˆ„  hžhubji  )”}”(hhh]”(hû)”}”(hŒBThe cached dentry reflects an ondisk dirent in the new directory.
”h]”hæ)”}”(hŒAThe cached dentry reflects an ondisk dirent in the new directory.”h]”hŒAThe cached dentry reflects an ondisk dirent in the new directory.”…””}”(hjÐ…  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h MVhjÌ…  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhjÉ…  hžhhŸh³h Nubhû)”}”(hŒ€The cached dentry no longer has a corresponding ondisk dirent in the new
directory and the dentry can be purged from the cache.
”h]”hæ)”}”(hŒThe cached dentry no longer has a corresponding ondisk dirent in the new
directory and the dentry can be purged from the cache.”h]”hŒThe cached dentry no longer has a corresponding ondisk dirent in the new
directory and the dentry can be purged from the cache.”…””}”(hjè…  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h MXhjä…  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhjÉ…  hžhhŸh³h Nubhû)”}”(hŒlThe cached dentry no longer has an ondisk dirent but the dentry cannot be
purged.
This is the problem case.
”h]”hæ)”}”(hŒkThe cached dentry no longer has an ondisk dirent but the dentry cannot be
purged.
This is the problem case.”h]”hŒkThe cached dentry no longer has an ondisk dirent but the dentry cannot be
purged.
This is the problem case.”…””}”(hj †  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h M[hjü…  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhjÉ…  hžhhŸh³h Nubeh}”(h]”h ]”h"]”h$]”h&]”jg  jh  ji  hjj  jk  uh1jh  hjˆ„  hžhhŸh³h MVubhæ)”}”(hŒ·Unfortunately, the current dentry cache design doesn't provide a means to walk
every child dentry of a specific directory, which makes this a hard problem.
There is no known solution.”h]”hŒ¹Unfortunately, the current dentry cache design doesnâ€™t provide a means to walk
every child dentry of a specific directory, which makes this a hard problem.
There is no known solution.”…””}”(hj†  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h M_hjˆ„  hžhubhæ)”}”(hŒ’The proposed patchset is the
`directory repair
<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-dirs>`_
series.”h]”(hŒThe proposed patchset is the
”…””}”(hj(†  hžhhŸNh Nubj”  )”}”(hŒm`directory repair
<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-dirs>`_”h]”hŒdirectory repair”…””}”(hj0†  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”Œname”Œdirectory repair”jj  ŒWhttps://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-dirs”uh1j“  hj(†  ubhµ)”}”(hŒZ
<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-dirs>”h]”h}”(h]”Œid12”ah ]”h"]”h$]”Œdirectory repair”ah&]”Œrefuri”j@†  uh1h´jy  Khj(†  ubhŒ
series.”…””}”(hj(†  hžhhŸNh Nubeh}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h Mchjˆ„  hžhubeh}”(h]”j-  ah ]”h"]”Œ!case study: salvaging directories”ah$]”h&]”uh1hÐhj„  hžhhŸh³h M,ubhÑ)”}”(hhh]”(hÖ)”}”(hŒParent Pointers”h]”hŒParent Pointers”…””}”(hjb†  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”j  jI  uh1hÕhj_†  hžhhŸh³h Miubhæ)”}”(hX¿  A parent pointer is a piece of file metadata that enables a user to locate the
file's parent directory without having to traverse the directory tree from the
root.
Without them, reconstruction of directory trees is hindered in much the same
way that the historic lack of reverse space mapping information once hindered
reconstruction of filesystem space metadata.
The parent pointer feature, however, makes total directory reconstruction
possible.”h]”hXÁ  A parent pointer is a piece of file metadata that enables a user to locate the
fileâ€™s parent directory without having to traverse the directory tree from the
root.
Without them, reconstruction of directory trees is hindered in much the same
way that the historic lack of reverse space mapping information once hindered
reconstruction of filesystem space metadata.
The parent pointer feature, however, makes total directory reconstruction
possible.”…””}”(hjp†  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h Mkhj_†  hžhubhæ)”}”(hX  XFS parent pointers contain the information needed to identify the
corresponding directory entry in the parent directory.
In other words, child files use extended attributes to store pointers to
parents in the form ``(dirent_name) â†’ (parent_inum, parent_gen)``.
The directory checking process can be strengthened to ensure that the target of
each dirent also contains a parent pointer pointing back to the dirent.
Likewise, each parent pointer can be checked by ensuring that the target of
each parent pointer is a directory and that it contains a dirent matching
the parent pointer.
Both online and offline repair can use this strategy.”h]”(hŒ×XFS parent pointers contain the information needed to identify the
corresponding directory entry in the parent directory.
In other words, child files use extended attributes to store pointers to
parents in the form ”…””}”(hj~†  hžhhŸNh Nubj÷  )”}”(hŒ/``(dirent_name) â†’ (parent_inum, parent_gen)``”h]”hŒ+(dirent_name) â†’ (parent_inum, parent_gen)”…””}”(hj††  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1jö  hj~†  ubhXy  .
The directory checking process can be strengthened to ensure that the target of
each dirent also contains a parent pointer pointing back to the dirent.
Likewise, each parent pointer can be checked by ensuring that the target of
each parent pointer is a directory and that it contains a dirent matching
the parent pointer.
Both online and offline repair can use this strategy.”…””}”(hj~†  hžhhŸNh Nubeh}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h Mthj_†  hžhubjÀ  )”}”(hhh]”jÅ  )”}”(hhh]”(jÊ  )”}”(hhh]”h}”(h]”h ]”h"]”h$]”h&]”Œcolwidth”KJuh1jÉ  hj¡†  ubjÖ  )”}”(hhh]”(jÛ  )”}”(hhh]”jà  )”}”(hhh]”hæ)”}”(hŒ**Historical Sidebar**:”h]”(jé  )”}”(hŒ**Historical Sidebar**”h]”hŒHistorical Sidebar”…””}”(hj»†  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1jè  hj·†  ubhŒ:”…””}”(hj·†  hžhhŸNh Nubeh}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h M€hj´†  ubah}”(h]”h ]”h"]”h$]”h&]”uh1jß  hj±†  ubah}”(h]”h ]”h"]”h$]”h&]”uh1jÚ  hj®†  ubjÛ  )”}”(hhh]”jà  )”}”(hhh]”(hæ)”}”(hX^  Directory parent pointers were first proposed as an XFS feature more
than a decade ago by SGI.
Each link from a parent directory to a child file is mirrored with an
extended attribute in the child that could be used to identify the
parent directory.
Unfortunately, this early implementation had major shortcomings and was
never merged into Linux XFS:”h]”hX^  Directory parent pointers were first proposed as an XFS feature more
than a decade ago by SGI.
Each link from a parent directory to a child file is mirrored with an
extended attribute in the child that could be used to identify the
parent directory.
Unfortunately, this early implementation had major shortcomings and was
never merged into Linux XFS:”…””}”(hjå†  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h M‚hjâ†  ubji  )”}”(hhh]”(hû)”}”(hX  The XFS codebase of the late 2000s did not have the infrastructure to
enforce strong referential integrity in the directory tree.
It did not guarantee that a change in a forward link would always be
followed up with the corresponding change to the reverse links.
”h]”hæ)”}”(hX  The XFS codebase of the late 2000s did not have the infrastructure to
enforce strong referential integrity in the directory tree.
It did not guarantee that a change in a forward link would always be
followed up with the corresponding change to the reverse links.”h]”hX  The XFS codebase of the late 2000s did not have the infrastructure to
enforce strong referential integrity in the directory tree.
It did not guarantee that a change in a forward link would always be
followed up with the corresponding change to the reverse links.”…””}”(hjú†  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h MŠhjö†  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhjó†  ubhû)”}”(hŒëReferential integrity was not integrated into offline repair.
Checking and repairs were performed on mounted filesystems without
taking any kernel or inode locks to coordinate access.
It is not clear how this actually worked properly.
”h]”hæ)”}”(hŒêReferential integrity was not integrated into offline repair.
Checking and repairs were performed on mounted filesystems without
taking any kernel or inode locks to coordinate access.
It is not clear how this actually worked properly.”h]”hŒêReferential integrity was not integrated into offline repair.
Checking and repairs were performed on mounted filesystems without
taking any kernel or inode locks to coordinate access.
It is not clear how this actually worked properly.”…””}”(hj‡  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h Mhj‡  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhjó†  ubhû)”}”(hŒ®The extended attribute did not record the name of the directory entry
in the parent, so the SGI parent pointer implementation cannot be
used to reconnect the directory tree.
”h]”hæ)”}”(hŒ­The extended attribute did not record the name of the directory entry
in the parent, so the SGI parent pointer implementation cannot be
used to reconnect the directory tree.”h]”hŒ­The extended attribute did not record the name of the directory entry
in the parent, so the SGI parent pointer implementation cannot be
used to reconnect the directory tree.”…””}”(hj*‡  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h M”hj&‡  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhjó†  ubhû)”}”(hŒ¹Extended attribute forks only support 65,536 extents, which means
that parent pointer attribute creation is likely to fail at some
point before the maximum file link count is achieved.
”h]”hæ)”}”(hŒ¸Extended attribute forks only support 65,536 extents, which means
that parent pointer attribute creation is likely to fail at some
point before the maximum file link count is achieved.”h]”hŒ¸Extended attribute forks only support 65,536 extents, which means
that parent pointer attribute creation is likely to fail at some
point before the maximum file link count is achieved.”…””}”(hjB‡  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h M˜hj>‡  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhjó†  ubeh}”(h]”h ]”h"]”h$]”h&]”jg  jh  ji  hjj  jk  uh1jh  hjâ†  ubhæ)”}”(hXº  The original parent pointer design was too unstable for something like
a file system repair to depend on.
Allison Henderson, Chandan Babu, and Catherine Hoang are working on a
second implementation that solves all shortcomings of the first.
During 2022, Allison introduced log intent items to track physical
manipulations of the extended attribute structures.
This solves the referential integrity problem by making it possible to
commit a dirent update and a parent pointer update in the same
transaction.
Chandan increased the maximum extent counts of both data and attribute
forks, thereby ensuring that the extended attribute structure can grow
to handle the maximum hardlink count of any file.”h]”hXº  The original parent pointer design was too unstable for something like
a file system repair to depend on.
Allison Henderson, Chandan Babu, and Catherine Hoang are working on a
second implementation that solves all shortcomings of the first.
During 2022, Allison introduced log intent items to track physical
manipulations of the extended attribute structures.
This solves the referential integrity problem by making it possible to
commit a dirent update and a parent pointer update in the same
transaction.
Chandan increased the maximum extent counts of both data and attribute
forks, thereby ensuring that the extended attribute structure can grow
to handle the maximum hardlink count of any file.”…””}”(hj\‡  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h Mœhjâ†  ubhæ)”}”(hXG  For this second effort, the ondisk parent pointer format as originally
proposed was ``(parent_inum, parent_gen, dirent_pos) â†’ (dirent_name)``.
The format was changed during development to eliminate the requirement
of repair tools needing to ensure that the ``dirent_pos`` field always
matched when reconstructing a directory.”h]”(hŒTFor this second effort, the ondisk parent pointer format as originally
proposed was ”…””}”(hjj‡  hžhhŸNh Nubj÷  )”}”(hŒ;``(parent_inum, parent_gen, dirent_pos) â†’ (dirent_name)``”h]”hŒ7(parent_inum, parent_gen, dirent_pos) â†’ (dirent_name)”…””}”(hjr‡  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1jö  hjj‡  ubhŒt.
The format was changed during development to eliminate the requirement
of repair tools needing to ensure that the ”…””}”(hjj‡  hžhhŸNh Nubj÷  )”}”(hŒ``dirent_pos``”h]”hŒ
dirent_pos”…””}”(hj„‡  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1jö  hjj‡  ubhŒ6 field always
matched when reconstructing a directory.”…””}”(hjj‡  hžhhŸNh Nubeh}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h M©hjâ†  ubhæ)”}”(hŒ8There were a few other ways to have solved that problem:”h]”hŒ8There were a few other ways to have solved that problem:”…””}”(hjœ‡  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h M¯hjâ†  ubji  )”}”(hhh]”(hû)”}”(hŒÂThe field could be designated advisory, since the other three values
are sufficient to find the entry in the parent.
However, this makes indexed key lookup impossible while repairs are
ongoing.
”h]”hæ)”}”(hŒÁThe field could be designated advisory, since the other three values
are sufficient to find the entry in the parent.
However, this makes indexed key lookup impossible while repairs are
ongoing.”h]”hŒÁThe field could be designated advisory, since the other three values
are sufficient to find the entry in the parent.
However, this makes indexed key lookup impossible while repairs are
ongoing.”…””}”(hj±‡  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h M±hj­‡  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhjª‡  ubhû)”}”(hXÃ  We could allow creating directory entries at specified offsets, which
solves the referential integrity problem but runs the risk that
dirent creation will fail due to conflicts with the free space in the
directory.

These conflicts could be resolved by appending the directory entry
and amending the xattr code to support updating an xattr key and
reindexing the dabtree, though this would have to be performed with
the parent directory still locked.
”h]”(hæ)”}”(hŒÖWe could allow creating directory entries at specified offsets, which
solves the referential integrity problem but runs the risk that
dirent creation will fail due to conflicts with the free space in the
directory.”h]”hŒÖWe could allow creating directory entries at specified offsets, which
solves the referential integrity problem but runs the risk that
dirent creation will fail due to conflicts with the free space in the
directory.”…””}”(hjÉ‡  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h M¶hjÅ‡  ubhæ)”}”(hŒêThese conflicts could be resolved by appending the directory entry
and amending the xattr code to support updating an xattr key and
reindexing the dabtree, though this would have to be performed with
the parent directory still locked.”h]”hŒêThese conflicts could be resolved by appending the directory entry
and amending the xattr code to support updating an xattr key and
reindexing the dabtree, though this would have to be performed with
the parent directory still locked.”…””}”(hj×‡  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h M»hjÅ‡  ubeh}”(h]”h ]”h"]”h$]”h&]”uh1húhjª‡  ubhû)”}”(hŒUSame as above, but remove the old parent pointer entry and add a new
one atomically.
”h]”hæ)”}”(hŒTSame as above, but remove the old parent pointer entry and add a new
one atomically.”h]”hŒTSame as above, but remove the old parent pointer entry and add a new
one atomically.”…””}”(hjï‡  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h MÀhjë‡  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhjª‡  ubhû)”}”(hX(  Change the ondisk xattr format to
``(parent_inum, name) â†’ (parent_gen)``, which would provide the attr
name uniqueness that we require, without forcing repair code to
update the dirent position.
Unfortunately, this requires changes to the xattr code to support
attr names as long as 263 bytes.
”h]”hæ)”}”(hX'  Change the ondisk xattr format to
``(parent_inum, name) â†’ (parent_gen)``, which would provide the attr
name uniqueness that we require, without forcing repair code to
update the dirent position.
Unfortunately, this requires changes to the xattr code to support
attr names as long as 263 bytes.”h]”(hŒ"Change the ondisk xattr format to
”…””}”(hjˆ  hžhhŸNh Nubj÷  )”}”(hŒ(``(parent_inum, name) â†’ (parent_gen)``”h]”hŒ$(parent_inum, name) â†’ (parent_gen)”…””}”(hjˆ  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1jö  hjˆ  ubhŒÝ, which would provide the attr
name uniqueness that we require, without forcing repair code to
update the dirent position.
Unfortunately, this requires changes to the xattr code to support
attr names as long as 263 bytes.”…””}”(hjˆ  hžhhŸNh Nubeh}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h MÃhjˆ  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhjª‡  ubhû)”}”(hX  Change the ondisk xattr format to ``(parent_inum, hash(name)) â†’
(name, parent_gen)``.
If the hash is sufficiently resistant to collisions (e.g. sha256)
then this should provide the attr name uniqueness that we require.
Names shorter than 247 bytes could be stored directly.
”h]”hæ)”}”(hX  Change the ondisk xattr format to ``(parent_inum, hash(name)) â†’
(name, parent_gen)``.
If the hash is sufficiently resistant to collisions (e.g. sha256)
then this should provide the attr name uniqueness that we require.
Names shorter than 247 bytes could be stored directly.”h]”(hŒ"Change the ondisk xattr format to ”…””}”(hj1ˆ  hžhhŸNh Nubj÷  )”}”(hŒ4``(parent_inum, hash(name)) â†’
(name, parent_gen)``”h]”hŒ0(parent_inum, hash(name)) â†’
(name, parent_gen)”…””}”(hj9ˆ  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1jö  hj1ˆ  ubhŒ½.
If the hash is sufficiently resistant to collisions (e.g. sha256)
then this should provide the attr name uniqueness that we require.
Names shorter than 247 bytes could be stored directly.”…””}”(hj1ˆ  hžhhŸNh Nubeh}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h MÊhj-ˆ  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhjª‡  ubhû)”}”(hXƒ  Change the ondisk xattr format to ``(dirent_name) â†’ (parent_ino,
parent_gen)``.  This format doesn't require any of the complicated
nested name hashing of the previous suggestions.  However, it was
discovered that multiple hardlinks to the same inode with the same
filename caused performance problems with hashed xattr lookups, so
the parent inumber is now xor'd into the hash index.
”h]”hæ)”}”(hX‚  Change the ondisk xattr format to ``(dirent_name) â†’ (parent_ino,
parent_gen)``.  This format doesn't require any of the complicated
nested name hashing of the previous suggestions.  However, it was
discovered that multiple hardlinks to the same inode with the same
filename caused performance problems with hashed xattr lookups, so
the parent inumber is now xor'd into the hash index.”h]”(hŒ"Change the ondisk xattr format to ”…””}”(hj[ˆ  hžhhŸNh Nubj÷  )”}”(hŒ.``(dirent_name) â†’ (parent_ino,
parent_gen)``”h]”hŒ*(dirent_name) â†’ (parent_ino,
parent_gen)”…””}”(hjcˆ  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1jö  hj[ˆ  ubhX6  .  This format doesnâ€™t require any of the complicated
nested name hashing of the previous suggestions.  However, it was
discovered that multiple hardlinks to the same inode with the same
filename caused performance problems with hashed xattr lookups, so
the parent inumber is now xorâ€™d into the hash index.”…””}”(hj[ˆ  hžhhŸNh Nubeh}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h MÐhjWˆ  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhjª‡  ubeh}”(h]”h ]”h"]”h$]”h&]”jg  jh  ji  hjj  jk  uh1jh  hjâ†  ubhæ)”}”(hŒIn the end, it was decided that solution #6 was the most compact and the
most performant.  A new hash function was designed for parent pointers.”h]”hŒIn the end, it was decided that solution #6 was the most compact and the
most performant.  A new hash function was designed for parent pointers.”…””}”(hj‡ˆ  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h M×hjâ†  ubeh}”(h]”h ]”h"]”h$]”h&]”uh1jß  hjß†  ubah}”(h]”h ]”h"]”h$]”h&]”uh1jÚ  hj®†  ubeh}”(h]”h ]”h"]”h$]”h&]”uh1jÕ  hj¡†  ubeh}”(h]”h ]”h"]”h$]”h&]”Œcols”Kuh1jÄ  hjž†  ubah}”(h]”h ]”h"]”h$]”h&]”uh1j¿  hj_†  hžhhŸh³h NubhÑ)”}”(hhh]”(hÖ)”}”(hŒ6Case Study: Repairing Directories with Parent Pointers”h]”hŒ6Case Study: Repairing Directories with Parent Pointers”…””}”(hj·ˆ  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”j  jh  uh1hÕhj´ˆ  hžhhŸh³h MÝubhæ)”}”(hŒˆDirectory rebuilding uses a :ref:`coordinated inode scan <iscan>` and
a :ref:`directory entry live update hook <liveupdate>` as follows:”h]”(hŒDirectory rebuilding uses a ”…””}”(hjÅˆ  hžhhŸNh Nubh)”}”(hŒ%:ref:`coordinated inode scan <iscan>`”h]”j™  )”}”(hjÏˆ  h]”hŒcoordinated inode scan”…””}”(hjÑˆ  hžhhŸNh Nubah}”(h]”h ]”(j¤  Œstd”Œstd-ref”eh"]”h$]”h&]”uh1j˜  hjÍˆ  ubah}”(h]”h ]”h"]”h$]”h&]”Œrefdoc”j±  Œ	refdomain”jÛˆ  Œreftype”Œref”Œrefexplicit”ˆŒrefwarn”ˆj·  Œiscan”uh1hhŸh³h MßhjÅˆ  ubhŒ and
a ”…””}”(hjÅˆ  hžhhŸNh Nubh)”}”(hŒ4:ref:`directory entry live update hook <liveupdate>`”h]”j™  )”}”(hjóˆ  h]”hŒ directory entry live update hook”…””}”(hjõˆ  hžhhŸNh Nubah}”(h]”h ]”(j¤  Œstd”Œstd-ref”eh"]”h$]”h&]”uh1j˜  hjñˆ  ubah}”(h]”h ]”h"]”h$]”h&]”Œrefdoc”j±  Œ	refdomain”jÿˆ  Œreftype”Œref”Œrefexplicit”ˆŒrefwarn”ˆj·  Œ
liveupdate”uh1hhŸh³h MßhjÅˆ  ubhŒ as follows:”…””}”(hjÅˆ  hžhhŸNh Nubeh}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h Mßhj´ˆ  hžhubji  )”}”(hhh]”(hû)”}”(hŒ÷Set up a temporary directory for generating the new directory structure,
an xfblob for storing entry names, and an xfarray for stashing the fixed
size fields involved in a directory update: ``(child inumber, add vs.
remove, name cookie, ftype)``.
”h]”hæ)”}”(hŒöSet up a temporary directory for generating the new directory structure,
an xfblob for storing entry names, and an xfarray for stashing the fixed
size fields involved in a directory update: ``(child inumber, add vs.
remove, name cookie, ftype)``.”h]”(hŒ¾Set up a temporary directory for generating the new directory structure,
an xfblob for storing entry names, and an xfarray for stashing the fixed
size fields involved in a directory update: ”…””}”(hj"‰  hžhhŸNh Nubj÷  )”}”(hŒ7``(child inumber, add vs.
remove, name cookie, ftype)``”h]”hŒ3(child inumber, add vs.
remove, name cookie, ftype)”…””}”(hj*‰  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1jö  hj"‰  ubhŒ.”…””}”(hj"‰  hžhhŸNh Nubeh}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h Mâhj‰  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhj‰  hžhhŸh³h Nubhû)”}”(hŒkSet up an inode scanner and hook into the directory entry code to receive
updates on directory operations.
”h]”hæ)”}”(hŒjSet up an inode scanner and hook into the directory entry code to receive
updates on directory operations.”h]”hŒjSet up an inode scanner and hook into the directory entry code to receive
updates on directory operations.”…””}”(hjL‰  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h MçhjH‰  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhj‰  hžhhŸh³h Nubhû)”}”(hX…  For each parent pointer found in each file scanned, decide if the parent
pointer references the directory of interest.
If so:

a. Stash the parent pointer name and an addname entry for this dirent in the
   xfblob and xfarray, respectively.

b. When finished scanning that file or the kernel memory consumption exceeds
   a threshold, flush the stashed updates to the temporary directory.
”h]”(hæ)”}”(hŒ}For each parent pointer found in each file scanned, decide if the parent
pointer references the directory of interest.
If so:”h]”hŒ}For each parent pointer found in each file scanned, decide if the parent
pointer references the directory of interest.
If so:”…””}”(hjd‰  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h Mêhj`‰  ubji  )”}”(hhh]”(hû)”}”(hŒlStash the parent pointer name and an addname entry for this dirent in the
xfblob and xfarray, respectively.
”h]”hæ)”}”(hŒkStash the parent pointer name and an addname entry for this dirent in the
xfblob and xfarray, respectively.”h]”hŒkStash the parent pointer name and an addname entry for this dirent in the
xfblob and xfarray, respectively.”…””}”(hjy‰  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h Mîhju‰  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhjr‰  ubhû)”}”(hŒWhen finished scanning that file or the kernel memory consumption exceeds
a threshold, flush the stashed updates to the temporary directory.
”h]”hæ)”}”(hŒŒWhen finished scanning that file or the kernel memory consumption exceeds
a threshold, flush the stashed updates to the temporary directory.”h]”hŒŒWhen finished scanning that file or the kernel memory consumption exceeds
a threshold, flush the stashed updates to the temporary directory.”…””}”(hj‘‰  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h Mñhj‰  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhjr‰  ubeh}”(h]”h ]”h"]”h$]”h&]”jg  j6  ji  hjj  jk  uh1jh  hj`‰  ubeh}”(h]”h ]”h"]”h$]”h&]”uh1húhj‰  hžhhŸNh Nubhû)”}”(hXò  For each live directory update received via the hook, decide if the child
has already been scanned.
If so:

a. Stash the parent pointer name an addname or removename entry for this
   dirent update in the xfblob and xfarray for later.
   We cannot write directly to the temporary directory because hook
   functions are not allowed to modify filesystem metadata.
   Instead, we stash updates in the xfarray and rely on the scanner thread
   to apply the stashed updates to the temporary directory.
”h]”(hæ)”}”(hŒjFor each live directory update received via the hook, decide if the child
has already been scanned.
If so:”h]”hŒjFor each live directory update received via the hook, decide if the child
has already been scanned.
If so:”…””}”(hjµ‰  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h Môhj±‰  ubji  )”}”(hhh]”hû)”}”(hXt  Stash the parent pointer name an addname or removename entry for this
dirent update in the xfblob and xfarray for later.
We cannot write directly to the temporary directory because hook
functions are not allowed to modify filesystem metadata.
Instead, we stash updates in the xfarray and rely on the scanner thread
to apply the stashed updates to the temporary directory.
”h]”hæ)”}”(hXs  Stash the parent pointer name an addname or removename entry for this
dirent update in the xfblob and xfarray for later.
We cannot write directly to the temporary directory because hook
functions are not allowed to modify filesystem metadata.
Instead, we stash updates in the xfarray and rely on the scanner thread
to apply the stashed updates to the temporary directory.”h]”hXs  Stash the parent pointer name an addname or removename entry for this
dirent update in the xfblob and xfarray for later.
We cannot write directly to the temporary directory because hook
functions are not allowed to modify filesystem metadata.
Instead, we stash updates in the xfarray and rely on the scanner thread
to apply the stashed updates to the temporary directory.”…””}”(hjÊ‰  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h MøhjÆ‰  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhjÃ‰  ubah}”(h]”h ]”h"]”h$]”h&]”jg  j6  ji  hjj  jk  uh1jh  hj±‰  ubeh}”(h]”h ]”h"]”h$]”h&]”uh1húhj‰  hžhhŸNh Nubhû)”}”(hŒFWhen the scan is complete, replay any stashed entries in the xfarray.
”h]”hæ)”}”(hŒEWhen the scan is complete, replay any stashed entries in the xfarray.”h]”hŒEWhen the scan is complete, replay any stashed entries in the xfarray.”…””}”(hjî‰  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h Mÿhjê‰  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhj‰  hžhhŸh³h Nubhû)”}”(hŒ¿When the scan is complete, atomically exchange the contents of the temporary
directory and the directory being repaired.
The temporary directory now contains the damaged directory structure.
”h]”hæ)”}”(hŒ¾When the scan is complete, atomically exchange the contents of the temporary
directory and the directory being repaired.
The temporary directory now contains the damaged directory structure.”h]”hŒ¾When the scan is complete, atomically exchange the contents of the temporary
directory and the directory being repaired.
The temporary directory now contains the damaged directory structure.”…””}”(hjŠ  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h MhjŠ  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhj‰  hžhhŸh³h Nubhû)”}”(hŒReap the temporary directory.
”h]”hæ)”}”(hŒReap the temporary directory.”h]”hŒReap the temporary directory.”…””}”(hjŠ  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h MhjŠ  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhj‰  hžhhŸh³h Nubeh}”(h]”h ]”h"]”h$]”h&]”jg  jh  ji  hjj  jk  uh1jh  hj´ˆ  hžhhŸh³h Mâubhæ)”}”(hŒ¡The proposed patchset is the
`parent pointers directory repair
<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=pptrs-fsck>`_
series.”h]”(hŒThe proposed patchset is the
”…””}”(hj8Š  hžhhŸNh Nubj”  )”}”(hŒ|`parent pointers directory repair
<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=pptrs-fsck>`_”h]”hŒ parent pointers directory repair”…””}”(hj@Š  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”Œname”Œ parent pointers directory repair”jj  ŒVhttps://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=pptrs-fsck”uh1j“  hj8Š  ubhµ)”}”(hŒY
<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=pptrs-fsck>”h]”h}”(h]”Œ parent-pointers-directory-repair”ah ]”h"]”Œ parent pointers directory repair”ah$]”h&]”Œrefuri”jPŠ  uh1h´jy  Khj8Š  ubhŒ
series.”…””}”(hj8Š  hžhhŸNh Nubeh}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h Mhj´ˆ  hžhubeh}”(h]”jn  ah ]”h"]”Œ6case study: repairing directories with parent pointers”ah$]”h&]”uh1hÐhj_†  hžhhŸh³h MÝubhÑ)”}”(hhh]”(hÖ)”}”(hŒ%Case Study: Repairing Parent Pointers”h]”hŒ%Case Study: Repairing Parent Pointers”…””}”(hjrŠ  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”j  jŠ  uh1hÕhjoŠ  hžhhŸh³h Mubhæ)”}”(hŒiOnline reconstruction of a file's parent pointer information works similarly to
directory reconstruction:”h]”hŒkOnline reconstruction of a fileâ€™s parent pointer information works similarly to
directory reconstruction:”…””}”(hj€Š  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h MhjoŠ  hžhubji  )”}”(hhh]”(hû)”}”(hX  Set up a temporary file for generating a new extended attribute structure,
an xfblob for storing parent pointer names, and an xfarray for stashing the
fixed size fields involved in a parent pointer update: ``(parent inumber,
parent generation, add vs. remove, name cookie)``.
”h]”hæ)”}”(hX  Set up a temporary file for generating a new extended attribute structure,
an xfblob for storing parent pointer names, and an xfarray for stashing the
fixed size fields involved in a parent pointer update: ``(parent inumber,
parent generation, add vs. remove, name cookie)``.”h]”(hŒÎSet up a temporary file for generating a new extended attribute structure,
an xfblob for storing parent pointer names, and an xfarray for stashing the
fixed size fields involved in a parent pointer update: ”…””}”(hj•Š  hžhhŸNh Nubj÷  )”}”(hŒD``(parent inumber,
parent generation, add vs. remove, name cookie)``”h]”hŒ@(parent inumber,
parent generation, add vs. remove, name cookie)”…””}”(hjŠ  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1jö  hj•Š  ubhŒ.”…””}”(hj•Š  hžhhŸNh Nubeh}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h Mhj‘Š  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhjŽŠ  hžhhŸh³h Nubhû)”}”(hŒkSet up an inode scanner and hook into the directory entry code to receive
updates on directory operations.
”h]”hæ)”}”(hŒjSet up an inode scanner and hook into the directory entry code to receive
updates on directory operations.”h]”hŒjSet up an inode scanner and hook into the directory entry code to receive
updates on directory operations.”…””}”(hj¿Š  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h Mhj»Š  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhjŽŠ  hžhhŸh³h Nubhû)”}”(hX}  For each directory entry found in each directory scanned, decide if the
dirent references the file of interest.
If so:

a. Stash the dirent name and an addpptr entry for this parent pointer in the
   xfblob and xfarray, respectively.

b. When finished scanning the directory or the kernel memory consumption
   exceeds a threshold, flush the stashed updates to the temporary file.
”h]”(hæ)”}”(hŒvFor each directory entry found in each directory scanned, decide if the
dirent references the file of interest.
If so:”h]”hŒvFor each directory entry found in each directory scanned, decide if the
dirent references the file of interest.
If so:”…””}”(hj×Š  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h MhjÓŠ  ubji  )”}”(hhh]”(hû)”}”(hŒlStash the dirent name and an addpptr entry for this parent pointer in the
xfblob and xfarray, respectively.
”h]”hæ)”}”(hŒkStash the dirent name and an addpptr entry for this parent pointer in the
xfblob and xfarray, respectively.”h]”hŒkStash the dirent name and an addpptr entry for this parent pointer in the
xfblob and xfarray, respectively.”…””}”(hjìŠ  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h MhjèŠ  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhjåŠ  ubhû)”}”(hŒŒWhen finished scanning the directory or the kernel memory consumption
exceeds a threshold, flush the stashed updates to the temporary file.
”h]”hæ)”}”(hŒ‹When finished scanning the directory or the kernel memory consumption
exceeds a threshold, flush the stashed updates to the temporary file.”h]”hŒ‹When finished scanning the directory or the kernel memory consumption
exceeds a threshold, flush the stashed updates to the temporary file.”…””}”(hj‹  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h M!hj ‹  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhjåŠ  ubeh}”(h]”h ]”h"]”h$]”h&]”jg  j6  ji  hjj  jk  uh1jh  hjÓŠ  ubeh}”(h]”h ]”h"]”h$]”h&]”uh1húhjŽŠ  hžhhŸNh Nubhû)”}”(hX  For each live directory update received via the hook, decide if the parent
has already been scanned.
If so:

a. Stash the dirent name and an addpptr or removepptr entry for this dirent
   update in the xfblob and xfarray for later.
   We cannot write parent pointers directly to the temporary file because
   hook functions are not allowed to modify filesystem metadata.
   Instead, we stash updates in the xfarray and rely on the scanner thread
   to apply the stashed parent pointer updates to the temporary file.
”h]”(hæ)”}”(hŒkFor each live directory update received via the hook, decide if the parent
has already been scanned.
If so:”h]”hŒkFor each live directory update received via the hook, decide if the parent
has already been scanned.
If so:”…””}”(hj(‹  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h M$hj$‹  ubji  )”}”(hhh]”hû)”}”(hX…  Stash the dirent name and an addpptr or removepptr entry for this dirent
update in the xfblob and xfarray for later.
We cannot write parent pointers directly to the temporary file because
hook functions are not allowed to modify filesystem metadata.
Instead, we stash updates in the xfarray and rely on the scanner thread
to apply the stashed parent pointer updates to the temporary file.
”h]”hæ)”}”(hX„  Stash the dirent name and an addpptr or removepptr entry for this dirent
update in the xfblob and xfarray for later.
We cannot write parent pointers directly to the temporary file because
hook functions are not allowed to modify filesystem metadata.
Instead, we stash updates in the xfarray and rely on the scanner thread
to apply the stashed parent pointer updates to the temporary file.”h]”hX„  Stash the dirent name and an addpptr or removepptr entry for this dirent
update in the xfblob and xfarray for later.
We cannot write parent pointers directly to the temporary file because
hook functions are not allowed to modify filesystem metadata.
Instead, we stash updates in the xfarray and rely on the scanner thread
to apply the stashed parent pointer updates to the temporary file.”…””}”(hj=‹  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h M(hj9‹  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhj6‹  ubah}”(h]”h ]”h"]”h$]”h&]”jg  j6  ji  hjj  jk  uh1jh  hj$‹  ubeh}”(h]”h ]”h"]”h$]”h&]”uh1húhjŽŠ  hžhhŸNh Nubhû)”}”(hŒFWhen the scan is complete, replay any stashed entries in the xfarray.
”h]”hæ)”}”(hŒEWhen the scan is complete, replay any stashed entries in the xfarray.”h]”hŒEWhen the scan is complete, replay any stashed entries in the xfarray.”…””}”(hja‹  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h M/hj]‹  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhjŽŠ  hžhhŸh³h Nubhû)”}”(hŒGCopy all non-parent pointer extended attributes to the temporary file.
”h]”hæ)”}”(hŒFCopy all non-parent pointer extended attributes to the temporary file.”h]”hŒFCopy all non-parent pointer extended attributes to the temporary file.”…””}”(hjy‹  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h M1hju‹  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhjŽŠ  hžhhŸh³h Nubhû)”}”(hŒÐWhen the scan is complete, atomically exchange the mappings of the attribute
forks of the temporary file and the file being repaired.
The temporary file now contains the damaged extended attribute structure.
”h]”hæ)”}”(hŒÏWhen the scan is complete, atomically exchange the mappings of the attribute
forks of the temporary file and the file being repaired.
The temporary file now contains the damaged extended attribute structure.”h]”hŒÏWhen the scan is complete, atomically exchange the mappings of the attribute
forks of the temporary file and the file being repaired.
The temporary file now contains the damaged extended attribute structure.”…””}”(hj‘‹  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h M3hj‹  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhjŽŠ  hžhhŸh³h Nubhû)”}”(hŒReap the temporary file.
”h]”hæ)”}”(hŒReap the temporary file.”h]”hŒReap the temporary file.”…””}”(hj©‹  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h M7hj¥‹  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhjŽŠ  hžhhŸh³h Nubeh}”(h]”h ]”h"]”h$]”h&]”jg  jh  ji  hjj  jk  uh1jh  hjoŠ  hžhhŸh³h Mubhæ)”}”(hŒ—The proposed patchset is the
`parent pointers repair
<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=pptrs-fsck>`_
series.”h]”(hŒThe proposed patchset is the
”…””}”(hjÃ‹  hžhhŸNh Nubj”  )”}”(hŒr`parent pointers repair
<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=pptrs-fsck>`_”h]”hŒparent pointers repair”…””}”(hjË‹  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”Œname”Œparent pointers repair”jj  ŒVhttps://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=pptrs-fsck”uh1j“  hjÃ‹  ubhµ)”}”(hŒY
<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=pptrs-fsck>”h]”h}”(h]”Œparent-pointers-repair”ah ]”h"]”Œparent pointers repair”ah$]”h&]”Œrefuri”jÛ‹  uh1h´jy  KhjÃ‹  ubhŒ
series.”…””}”(hjÃ‹  hžhhŸNh Nubeh}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h M9hjoŠ  hžhubeh}”(h]”j  ah ]”h"]”Œ%case study: repairing parent pointers”ah$]”h&]”uh1hÐhj_†  hžhhŸh³h MubhÑ)”}”(hhh]”(hÖ)”}”(hŒ/Digression: Offline Checking of Parent Pointers”h]”hŒ/Digression: Offline Checking of Parent Pointers”…””}”(hjý‹  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”j  j¬  uh1hÕhjú‹  hžhhŸh³h M?ubhæ)”}”(hŒÿExamining parent pointers in offline repair works differently because corrupt
files are erased long before directory tree connectivity checks are performed.
Parent pointer checks are therefore a second pass to be added to the existing
connectivity checks:”h]”hŒÿExamining parent pointers in offline repair works differently because corrupt
files are erased long before directory tree connectivity checks are performed.
Parent pointer checks are therefore a second pass to be added to the existing
connectivity checks:”…””}”(hjŒ  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h MAhjú‹  hžhubji  )”}”(hhh]”(hû)”}”(hŒ¼After the set of surviving files has been established (phase 6),
walk the surviving directories of each AG in the filesystem.
This is already performed as part of the connectivity checks.
”h]”hæ)”}”(hŒ»After the set of surviving files has been established (phase 6),
walk the surviving directories of each AG in the filesystem.
This is already performed as part of the connectivity checks.”h]”hŒ»After the set of surviving files has been established (phase 6),
walk the surviving directories of each AG in the filesystem.
This is already performed as part of the connectivity checks.”…””}”(hj Œ  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h MFhjŒ  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhjŒ  hžhhŸh³h Nubhû)”}”(hXÌ  For each directory entry found,

a. If the name has already been stored in the xfblob, then use that cookie
   and skip the next step.

b. Otherwise, record the name in an xfblob, and remember the xfblob cookie.
   Unique mappings are critical for

   1. Deduplicating names to reduce memory usage, and

   2. Creating a stable sort key for the parent pointer indexes so that the
      parent pointer validation described below will work.

c. Store ``(child_ag_inum, parent_inum, parent_gen, name_hash, name_len,
   name_cookie)`` tuples in a per-AG in-memory slab.  The ``name_hash``
   referenced in this section is the regular directory entry name hash, not
   the specialized one used for parent pointer xattrs.
”h]”(hæ)”}”(hŒFor each directory entry found,”h]”hŒFor each directory entry found,”…””}”(hj8Œ  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h MJhj4Œ  ubji  )”}”(hhh]”(hû)”}”(hŒ`If the name has already been stored in the xfblob, then use that cookie
and skip the next step.
”h]”hæ)”}”(hŒ_If the name has already been stored in the xfblob, then use that cookie
and skip the next step.”h]”hŒ_If the name has already been stored in the xfblob, then use that cookie
and skip the next step.”…””}”(hjMŒ  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h MLhjIŒ  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhjFŒ  ubhû)”}”(hX   Otherwise, record the name in an xfblob, and remember the xfblob cookie.
Unique mappings are critical for

1. Deduplicating names to reduce memory usage, and

2. Creating a stable sort key for the parent pointer indexes so that the
   parent pointer validation described below will work.
”h]”(hæ)”}”(hŒiOtherwise, record the name in an xfblob, and remember the xfblob cookie.
Unique mappings are critical for”h]”hŒiOtherwise, record the name in an xfblob, and remember the xfblob cookie.
Unique mappings are critical for”…””}”(hjeŒ  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h MOhjaŒ  ubji  )”}”(hhh]”(hû)”}”(hŒ0Deduplicating names to reduce memory usage, and
”h]”hæ)”}”(hŒ/Deduplicating names to reduce memory usage, and”h]”hŒ/Deduplicating names to reduce memory usage, and”…””}”(hjzŒ  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h MRhjvŒ  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhjsŒ  ubhû)”}”(hŒ{Creating a stable sort key for the parent pointer indexes so that the
parent pointer validation described below will work.
”h]”hæ)”}”(hŒzCreating a stable sort key for the parent pointer indexes so that the
parent pointer validation described below will work.”h]”hŒzCreating a stable sort key for the parent pointer indexes so that the
parent pointer validation described below will work.”…””}”(hj’Œ  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h MThjŽŒ  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhjsŒ  ubeh}”(h]”h ]”h"]”h$]”h&]”jg  jh  ji  hjj  jk  uh1jh  hjaŒ  ubeh}”(h]”h ]”h"]”h$]”h&]”uh1húhjFŒ  ubhû)”}”(hX  Store ``(child_ag_inum, parent_inum, parent_gen, name_hash, name_len,
name_cookie)`` tuples in a per-AG in-memory slab.  The ``name_hash``
referenced in this section is the regular directory entry name hash, not
the specialized one used for parent pointer xattrs.
”h]”hæ)”}”(hX  Store ``(child_ag_inum, parent_inum, parent_gen, name_hash, name_len,
name_cookie)`` tuples in a per-AG in-memory slab.  The ``name_hash``
referenced in this section is the regular directory entry name hash, not
the specialized one used for parent pointer xattrs.”h]”(hŒStore ”…””}”(hj¶Œ  hžhhŸNh Nubj÷  )”}”(hŒN``(child_ag_inum, parent_inum, parent_gen, name_hash, name_len,
name_cookie)``”h]”hŒJ(child_ag_inum, parent_inum, parent_gen, name_hash, name_len,
name_cookie)”…””}”(hj¾Œ  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1jö  hj¶Œ  ubhŒ) tuples in a per-AG in-memory slab.  The ”…””}”(hj¶Œ  hžhhŸNh Nubj÷  )”}”(hŒ``name_hash``”h]”hŒ	name_hash”…””}”(hjÐŒ  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1jö  hj¶Œ  ubhŒ}
referenced in this section is the regular directory entry name hash, not
the specialized one used for parent pointer xattrs.”…””}”(hj¶Œ  hžhhŸNh Nubeh}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h MWhj²Œ  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhjFŒ  ubeh}”(h]”h ]”h"]”h$]”h&]”jg  j6  ji  hjj  jk  uh1jh  hj4Œ  ubeh}”(h]”h ]”h"]”h$]”h&]”uh1húhjŒ  hžhhŸNh Nubhû)”}”(hXK  For each AG in the filesystem,

a. Sort the per-AG tuple set in order of ``child_ag_inum``, ``parent_inum``,
   ``name_hash``, and ``name_cookie``.
   Having a single ``name_cookie`` for each ``name`` is critical for
   handling the uncommon case of a directory containing multiple hardlinks
   to the same file where all the names hash to the same value.

b. For each inode in the AG,

   1. Scan the inode for parent pointers.
      For each parent pointer found,

      a. Validate the ondisk parent pointer.
         If validation fails, move on to the next parent pointer in the
         file.

      b. If the name has already been stored in the xfblob, then use that
         cookie and skip the next step.

      c. Record the name in a per-file xfblob, and remember the xfblob
         cookie.

      d. Store ``(parent_inum, parent_gen, name_hash, name_len,
         name_cookie)`` tuples in a per-file slab.

   2. Sort the per-file tuples in order of ``parent_inum``, ``name_hash``,
      and ``name_cookie``.

   3. Position one slab cursor at the start of the inode's records in the
      per-AG tuple slab.
      This should be trivial since the per-AG tuples are in child inumber
      order.

   4. Position a second slab cursor at the start of the per-file tuple slab.

   5. Iterate the two cursors in lockstep, comparing the ``parent_ino``,
      ``name_hash``, and ``name_cookie`` fields of the records under each
      cursor:

      a. If the per-AG cursor is at a lower point in the keyspace than the
         per-file cursor, then the per-AG cursor points to a missing parent
         pointer.
         Add the parent pointer to the inode and advance the per-AG
         cursor.

      b. If the per-file cursor is at a lower point in the keyspace than
         the per-AG cursor, then the per-file cursor points to a dangling
         parent pointer.
         Remove the parent pointer from the inode and advance the per-file
         cursor.

      c. Otherwise, both cursors point at the same parent pointer.
         Update the parent_gen component if necessary.
         Advance both cursors.
”h]”(hæ)”}”(hŒFor each AG in the filesystem,”h]”hŒFor each AG in the filesystem,”…””}”(hjþŒ  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h M\hjúŒ  ubji  )”}”(hhh]”(hû)”}”(hX5  Sort the per-AG tuple set in order of ``child_ag_inum``, ``parent_inum``,
``name_hash``, and ``name_cookie``.
Having a single ``name_cookie`` for each ``name`` is critical for
handling the uncommon case of a directory containing multiple hardlinks
to the same file where all the names hash to the same value.
”h]”hæ)”}”(hX4  Sort the per-AG tuple set in order of ``child_ag_inum``, ``parent_inum``,
``name_hash``, and ``name_cookie``.
Having a single ``name_cookie`` for each ``name`` is critical for
handling the uncommon case of a directory containing multiple hardlinks
to the same file where all the names hash to the same value.”h]”(hŒ&Sort the per-AG tuple set in order of ”…””}”(hj  hžhhŸNh Nubj÷  )”}”(hŒ``child_ag_inum``”h]”hŒchild_ag_inum”…””}”(hj  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1jö  hj  ubhŒ, ”…””}”(hj  hžhhŸNh Nubj÷  )”}”(hŒ``parent_inum``”h]”hŒparent_inum”…””}”(hj-  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1jö  hj  ubhŒ,
”…””}”(hj  hžhhŸNh Nubj÷  )”}”(hŒ``name_hash``”h]”hŒ	name_hash”…””}”(hj?  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1jö  hj  ubhŒ, and ”…””}”(hj  hžhhŸNh Nubj÷  )”}”(hŒ``name_cookie``”h]”hŒname_cookie”…””}”(hjQ  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1jö  hj  ubhŒ.
Having a single ”…””}”(hj  hžhhŸNh Nubj÷  )”}”(hŒ``name_cookie``”h]”hŒname_cookie”…””}”(hjc  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1jö  hj  ubhŒ
 for each ”…””}”(hj  hžhhŸNh Nubj÷  )”}”(hŒ``name``”h]”hŒname”…””}”(hju  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1jö  hj  ubhŒ• is critical for
handling the uncommon case of a directory containing multiple hardlinks
to the same file where all the names hash to the same value.”…””}”(hj  hžhhŸNh Nubeh}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h M^hj  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhj  ubhû)”}”(hX}  For each inode in the AG,

1. Scan the inode for parent pointers.
   For each parent pointer found,

   a. Validate the ondisk parent pointer.
      If validation fails, move on to the next parent pointer in the
      file.

   b. If the name has already been stored in the xfblob, then use that
      cookie and skip the next step.

   c. Record the name in a per-file xfblob, and remember the xfblob
      cookie.

   d. Store ``(parent_inum, parent_gen, name_hash, name_len,
      name_cookie)`` tuples in a per-file slab.

2. Sort the per-file tuples in order of ``parent_inum``, ``name_hash``,
   and ``name_cookie``.

3. Position one slab cursor at the start of the inode's records in the
   per-AG tuple slab.
   This should be trivial since the per-AG tuples are in child inumber
   order.

4. Position a second slab cursor at the start of the per-file tuple slab.

5. Iterate the two cursors in lockstep, comparing the ``parent_ino``,
   ``name_hash``, and ``name_cookie`` fields of the records under each
   cursor:

   a. If the per-AG cursor is at a lower point in the keyspace than the
      per-file cursor, then the per-AG cursor points to a missing parent
      pointer.
      Add the parent pointer to the inode and advance the per-AG
      cursor.

   b. If the per-file cursor is at a lower point in the keyspace than
      the per-AG cursor, then the per-file cursor points to a dangling
      parent pointer.
      Remove the parent pointer from the inode and advance the per-file
      cursor.

   c. Otherwise, both cursors point at the same parent pointer.
      Update the parent_gen component if necessary.
      Advance both cursors.
”h]”(hæ)”}”(hŒFor each inode in the AG,”h]”hŒFor each inode in the AG,”…””}”(hj—  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h Mdhj“  ubji  )”}”(hhh]”(hû)”}”(hXÒ  Scan the inode for parent pointers.
For each parent pointer found,

a. Validate the ondisk parent pointer.
   If validation fails, move on to the next parent pointer in the
   file.

b. If the name has already been stored in the xfblob, then use that
   cookie and skip the next step.

c. Record the name in a per-file xfblob, and remember the xfblob
   cookie.

d. Store ``(parent_inum, parent_gen, name_hash, name_len,
   name_cookie)`` tuples in a per-file slab.
”h]”(hæ)”}”(hŒBScan the inode for parent pointers.
For each parent pointer found,”h]”hŒBScan the inode for parent pointers.
For each parent pointer found,”…””}”(hj¬  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h Mfhj¨  ubji  )”}”(hhh]”(hû)”}”(hŒiValidate the ondisk parent pointer.
If validation fails, move on to the next parent pointer in the
file.
”h]”hæ)”}”(hŒhValidate the ondisk parent pointer.
If validation fails, move on to the next parent pointer in the
file.”h]”hŒhValidate the ondisk parent pointer.
If validation fails, move on to the next parent pointer in the
file.”…””}”(hjÁ  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h Mihj½  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhjº  ubhû)”}”(hŒ`If the name has already been stored in the xfblob, then use that
cookie and skip the next step.
”h]”hæ)”}”(hŒ_If the name has already been stored in the xfblob, then use that
cookie and skip the next step.”h]”hŒ_If the name has already been stored in the xfblob, then use that
cookie and skip the next step.”…””}”(hjÙ  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h MmhjÕ  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhjº  ubhû)”}”(hŒFRecord the name in a per-file xfblob, and remember the xfblob
cookie.
”h]”hæ)”}”(hŒERecord the name in a per-file xfblob, and remember the xfblob
cookie.”h]”hŒERecord the name in a per-file xfblob, and remember the xfblob
cookie.”…””}”(hjñ  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h Mphjí  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhjº  ubhû)”}”(hŒaStore ``(parent_inum, parent_gen, name_hash, name_len,
name_cookie)`` tuples in a per-file slab.
”h]”hæ)”}”(hŒ`Store ``(parent_inum, parent_gen, name_hash, name_len,
name_cookie)`` tuples in a per-file slab.”h]”(hŒStore ”…””}”(hj	Ž  hžhhŸNh Nubj÷  )”}”(hŒ?``(parent_inum, parent_gen, name_hash, name_len,
name_cookie)``”h]”hŒ;(parent_inum, parent_gen, name_hash, name_len,
name_cookie)”…””}”(hjŽ  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1jö  hj	Ž  ubhŒ tuples in a per-file slab.”…””}”(hj	Ž  hžhhŸNh Nubeh}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h MshjŽ  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhjº  ubeh}”(h]”h ]”h"]”h$]”h&]”jg  j6  ji  hjj  jk  uh1jh  hj¨  ubeh}”(h]”h ]”h"]”h$]”h&]”uh1húhj¥  ubhû)”}”(hŒZSort the per-file tuples in order of ``parent_inum``, ``name_hash``,
and ``name_cookie``.
”h]”hæ)”}”(hŒYSort the per-file tuples in order of ``parent_inum``, ``name_hash``,
and ``name_cookie``.”h]”(hŒ%Sort the per-file tuples in order of ”…””}”(hj?Ž  hžhhŸNh Nubj÷  )”}”(hŒ``parent_inum``”h]”hŒparent_inum”…””}”(hjGŽ  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1jö  hj?Ž  ubhŒ, ”…””}”(hj?Ž  hžhhŸNh Nubj÷  )”}”(hŒ``name_hash``”h]”hŒ	name_hash”…””}”(hjYŽ  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1jö  hj?Ž  ubhŒ,
and ”…””}”(hj?Ž  hžhhŸNh Nubj÷  )”}”(hŒ``name_cookie``”h]”hŒname_cookie”…””}”(hjkŽ  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1jö  hj?Ž  ubhŒ.”…””}”(hj?Ž  hžhhŸNh Nubeh}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h Mvhj;Ž  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhj¥  ubhû)”}”(hŒ¢Position one slab cursor at the start of the inode's records in the
per-AG tuple slab.
This should be trivial since the per-AG tuples are in child inumber
order.
”h]”hæ)”}”(hŒ¡Position one slab cursor at the start of the inode's records in the
per-AG tuple slab.
This should be trivial since the per-AG tuples are in child inumber
order.”h]”hŒ£Position one slab cursor at the start of the inodeâ€™s records in the
per-AG tuple slab.
This should be trivial since the per-AG tuples are in child inumber
order.”…””}”(hjŽ  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h Myhj‰Ž  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhj¥  ubhû)”}”(hŒGPosition a second slab cursor at the start of the per-file tuple slab.
”h]”hæ)”}”(hŒFPosition a second slab cursor at the start of the per-file tuple slab.”h]”hŒFPosition a second slab cursor at the start of the per-file tuple slab.”…””}”(hj¥Ž  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h M~hj¡Ž  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhj¥  ubhû)”}”(hXã  Iterate the two cursors in lockstep, comparing the ``parent_ino``,
``name_hash``, and ``name_cookie`` fields of the records under each
cursor:

a. If the per-AG cursor is at a lower point in the keyspace than the
   per-file cursor, then the per-AG cursor points to a missing parent
   pointer.
   Add the parent pointer to the inode and advance the per-AG
   cursor.

b. If the per-file cursor is at a lower point in the keyspace than
   the per-AG cursor, then the per-file cursor points to a dangling
   parent pointer.
   Remove the parent pointer from the inode and advance the per-file
   cursor.

c. Otherwise, both cursors point at the same parent pointer.
   Update the parent_gen component if necessary.
   Advance both cursors.
”h]”(hæ)”}”(hŒŽIterate the two cursors in lockstep, comparing the ``parent_ino``,
``name_hash``, and ``name_cookie`` fields of the records under each
cursor:”h]”(hŒ3Iterate the two cursors in lockstep, comparing the ”…””}”(hj½Ž  hžhhŸNh Nubj÷  )”}”(hŒ``parent_ino``”h]”hŒ
parent_ino”…””}”(hjÅŽ  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1jö  hj½Ž  ubhŒ,
”…””}”(hj½Ž  hžhhŸNh Nubj÷  )”}”(hŒ``name_hash``”h]”hŒ	name_hash”…””}”(hj×Ž  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1jö  hj½Ž  ubhŒ, and ”…””}”(hj½Ž  hžhhŸNh Nubj÷  )”}”(hŒ``name_cookie``”h]”hŒname_cookie”…””}”(hjéŽ  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1jö  hj½Ž  ubhŒ) fields of the records under each
cursor:”…””}”(hj½Ž  hžhhŸNh Nubeh}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h M€hj¹Ž  ubji  )”}”(hhh]”(hû)”}”(hŒÑIf the per-AG cursor is at a lower point in the keyspace than the
per-file cursor, then the per-AG cursor points to a missing parent
pointer.
Add the parent pointer to the inode and advance the per-AG
cursor.
”h]”hæ)”}”(hŒÐIf the per-AG cursor is at a lower point in the keyspace than the
per-file cursor, then the per-AG cursor points to a missing parent
pointer.
Add the parent pointer to the inode and advance the per-AG
cursor.”h]”hŒÐIf the per-AG cursor is at a lower point in the keyspace than the
per-file cursor, then the per-AG cursor points to a missing parent
pointer.
Add the parent pointer to the inode and advance the per-AG
cursor.”…””}”(hj  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h M„hj  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhj  ubhû)”}”(hŒÛIf the per-file cursor is at a lower point in the keyspace than
the per-AG cursor, then the per-file cursor points to a dangling
parent pointer.
Remove the parent pointer from the inode and advance the per-file
cursor.
”h]”hæ)”}”(hŒÚIf the per-file cursor is at a lower point in the keyspace than
the per-AG cursor, then the per-file cursor points to a dangling
parent pointer.
Remove the parent pointer from the inode and advance the per-file
cursor.”h]”hŒÚIf the per-file cursor is at a lower point in the keyspace than
the per-AG cursor, then the per-file cursor points to a dangling
parent pointer.
Remove the parent pointer from the inode and advance the per-file
cursor.”…””}”(hj   hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h MŠhj  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhj  ubhû)”}”(hŒ~Otherwise, both cursors point at the same parent pointer.
Update the parent_gen component if necessary.
Advance both cursors.
”h]”hæ)”}”(hŒ}Otherwise, both cursors point at the same parent pointer.
Update the parent_gen component if necessary.
Advance both cursors.”h]”hŒ}Otherwise, both cursors point at the same parent pointer.
Update the parent_gen component if necessary.
Advance both cursors.”…””}”(hj8  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h Mhj4  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhj  ubeh}”(h]”h ]”h"]”h$]”h&]”jg  j6  ji  hjj  jk  uh1jh  hj¹Ž  ubeh}”(h]”h ]”h"]”h$]”h&]”uh1húhj¥  ubeh}”(h]”h ]”h"]”h$]”h&]”jg  jh  ji  hjj  jk  uh1jh  hj“  ubeh}”(h]”h ]”h"]”h$]”h&]”uh1húhj  ubeh}”(h]”h ]”h"]”h$]”h&]”jg  j6  ji  hjj  jk  uh1jh  hjúŒ  ubeh}”(h]”h ]”h"]”h$]”h&]”uh1húhjŒ  hžhhŸNh Nubhû)”}”(hŒ2Move on to examining link counts, as we do today.
”h]”hæ)”}”(hŒ1Move on to examining link counts, as we do today.”h]”hŒ1Move on to examining link counts, as we do today.”…””}”(hjt  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h M”hjp  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhjŒ  hžhhŸh³h Nubeh}”(h]”h ]”h"]”h$]”h&]”jg  jh  ji  hjj  jk  uh1jh  hjú‹  hžhhŸh³h MFubhæ)”}”(hŒ¢The proposed patchset is the
`offline parent pointers repair
<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=pptrs-fsck>`_
series.”h]”(hŒThe proposed patchset is the
”…””}”(hjŽ  hžhhŸNh Nubj”  )”}”(hŒ}`offline parent pointers repair
<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=pptrs-fsck>`_”h]”hŒoffline parent pointers repair”…””}”(hj–  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”Œname”Œoffline parent pointers repair”jj  ŒYhttps://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=pptrs-fsck”uh1j“  hjŽ  ubhµ)”}”(hŒ\
<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=pptrs-fsck>”h]”h}”(h]”Œoffline-parent-pointers-repair”ah ]”h"]”Œoffline parent pointers repair”ah$]”h&]”Œrefuri”j¦  uh1h´jy  KhjŽ  ubhŒ
series.”…””}”(hjŽ  hžhhŸNh Nubeh}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h M–hjú‹  hžhubhæ)”}”(hX+  Rebuilding directories from parent pointers in offline repair would be very
challenging because xfs_repair currently uses two single-pass scans of the
filesystem during phases 3 and 4 to decide which files are corrupt enough to be
zapped.
This scan would have to be converted into a multi-pass scan:”h]”hX+  Rebuilding directories from parent pointers in offline repair would be very
challenging because xfs_repair currently uses two single-pass scans of the
filesystem during phases 3 and 4 to decide which files are corrupt enough to be
zapped.
This scan would have to be converted into a multi-pass scan:”…””}”(hj¾  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h M›hjú‹  hžhubji  )”}”(hhh]”(hû)”}”(hŒ‰The first pass of the scan zaps corrupt inodes, forks, and attributes
much as it does now.
Corrupt directories are noted but not zapped.
”h]”hæ)”}”(hŒˆThe first pass of the scan zaps corrupt inodes, forks, and attributes
much as it does now.
Corrupt directories are noted but not zapped.”h]”hŒˆThe first pass of the scan zaps corrupt inodes, forks, and attributes
much as it does now.
Corrupt directories are noted but not zapped.”…””}”(hjÓ  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h M¡hjÏ  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhjÌ  hžhhŸh³h Nubhû)”}”(hŒðThe next pass records parent pointers pointing to the directories noted
as being corrupt in the first pass.
This second pass may have to happen after the phase 4 scan for duplicate
blocks, if phase 4 is also capable of zapping directories.
”h]”hæ)”}”(hŒïThe next pass records parent pointers pointing to the directories noted
as being corrupt in the first pass.
This second pass may have to happen after the phase 4 scan for duplicate
blocks, if phase 4 is also capable of zapping directories.”h]”hŒïThe next pass records parent pointers pointing to the directories noted
as being corrupt in the first pass.
This second pass may have to happen after the phase 4 scan for duplicate
blocks, if phase 4 is also capable of zapping directories.”…””}”(hjë  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h M¥hjç  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhjÌ  hžhhŸh³h Nubhû)”}”(hŒ¹The third pass resets corrupt directories to an empty shortform directory.
Free space metadata has not been ensured yet, so repair cannot yet use the
directory building code in libxfs.
”h]”hæ)”}”(hŒ¸The third pass resets corrupt directories to an empty shortform directory.
Free space metadata has not been ensured yet, so repair cannot yet use the
directory building code in libxfs.”h]”hŒ¸The third pass resets corrupt directories to an empty shortform directory.
Free space metadata has not been ensured yet, so repair cannot yet use the
directory building code in libxfs.”…””}”(hj  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h Mªhjÿ  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhjÌ  hžhhŸh³h Nubhû)”}”(hŒ»At the start of phase 6, space metadata have been rebuilt.
Use the parent pointer information recorded during step 2 to reconstruct
the dirents and add them to the now-empty directories.
”•      h]”hæ)”}”(hŒºAt the start of phase 6, space metadata have been rebuilt.
Use the parent pointer information recorded during step 2 to reconstruct
the dirents and add them to the now-empty directories.”h]”hŒºAt the start of phase 6, space metadata have been rebuilt.
Use the parent pointer information recorded during step 2 to reconstruct
the dirents and add them to the now-empty directories.”…””}”(hj  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h M®hj  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhjÌ  hžhhŸh³h Nubeh}”(h]”h ]”h"]”h$]”h&]”jg  jh  ji  hjj  jk  uh1jh  hjú‹  hžhhŸh³h M¡ubhæ)”}”(hŒ'This code has not yet been constructed.”h]”hŒ'This code has not yet been constructed.”…””}”(hj5  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h M²hjú‹  hžhubhµ)”}”(hŒ.. _dirtree:”h]”h}”(h]”h ]”h"]”h$]”h&]”j  Œdirtree”uh1h´h M´hjú‹  hžhhŸh³ubeh}”(h]”j²  ah ]”h"]”Œ/digression: offline checking of parent pointers”ah$]”h&]”uh1hÐhj_†  hžhhŸh³h M?ubhÑ)”}”(hhh]”(hÖ)”}”(hŒ$Case Study: Directory Tree Structure”h]”hŒ$Case Study: Directory Tree Structure”…””}”(hjX  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”j  jÎ  uh1hÕhjU  hžhhŸh³h M·ubhæ)”}”(hXí  As mentioned earlier, the filesystem directory tree is supposed to be a
directed acylic graph structure.
However, each node in this graph is a separate ``xfs_inode`` object with its
own locks, which makes validating the tree qualities difficult.
Fortunately, non-directories are allowed to have multiple parents and cannot
have children, so only directories need to be scanned.
Directories typically constitute 5-10% of the files in a filesystem, which
reduces the amount of work dramatically.”h]”(hŒ˜As mentioned earlier, the filesystem directory tree is supposed to be a
directed acylic graph structure.
However, each node in this graph is a separate ”…””}”(hjf  hžhhŸNh Nubj÷  )”}”(hŒ``xfs_inode``”h]”hŒ	xfs_inode”…””}”(hjn  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1jö  hjf  ubhXH   object with its
own locks, which makes validating the tree qualities difficult.
Fortunately, non-directories are allowed to have multiple parents and cannot
have children, so only directories need to be scanned.
Directories typically constitute 5-10% of the files in a filesystem, which
reduces the amount of work dramatically.”…””}”(hjf  hžhhŸNh Nubeh}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h M¹hjU  hžhubhæ)”}”(hX«  If the directory tree could be frozen, it would be easy to discover cycles and
disconnected regions by running a depth (or breadth) first search downwards
from the root directory and marking a bitmap for each directory found.
At any point in the walk, trying to set an already set bit means there is a
cycle.
After the scan completes, XORing the marked inode bitmap with the inode
allocation bitmap reveals disconnected inodes.
However, one of online repair's design goals is to avoid locking the entire
filesystem unless it's absolutely necessary.
Directory tree updates can move subtrees across the scanner wavefront on a live
filesystem, so the bitmap algorithm cannot be applied.”h]”hX¯  If the directory tree could be frozen, it would be easy to discover cycles and
disconnected regions by running a depth (or breadth) first search downwards
from the root directory and marking a bitmap for each directory found.
At any point in the walk, trying to set an already set bit means there is a
cycle.
After the scan completes, XORing the marked inode bitmap with the inode
allocation bitmap reveals disconnected inodes.
However, one of online repairâ€™s design goals is to avoid locking the entire
filesystem unless itâ€™s absolutely necessary.
Directory tree updates can move subtrees across the scanner wavefront on a live
filesystem, so the bitmap algorithm cannot be applied.”…””}”(hj†  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h MÂhjU  hžhubhæ)”}”(hXr  Directory parent pointers enable an incremental approach to validation of the
tree structure.
Instead of using one thread to scan the entire filesystem, multiple threads can
walk from individual subdirectories upwards towards the root.
For this to work, all directory entries and parent pointers must be internally
consistent, each directory entry must have a parent pointer, and the link
counts of all directories must be correct.
Each scanner thread must be able to take the IOLOCK of an alleged parent
directory while holding the IOLOCK of the child directory to prevent either
directory from being moved within the tree.
This is not possible since the VFS does not take the IOLOCK of a child
subdirectory when moving that subdirectory, so instead the scanner stabilizes
the parent -> child relationship by taking the ILOCKs and installing a dirent
update hook to detect changes.”h]”hXr  Directory parent pointers enable an incremental approach to validation of the
tree structure.
Instead of using one thread to scan the entire filesystem, multiple threads can
walk from individual subdirectories upwards towards the root.
For this to work, all directory entries and parent pointers must be internally
consistent, each directory entry must have a parent pointer, and the link
counts of all directories must be correct.
Each scanner thread must be able to take the IOLOCK of an alleged parent
directory while holding the IOLOCK of the child directory to prevent either
directory from being moved within the tree.
This is not possible since the VFS does not take the IOLOCK of a child
subdirectory when moving that subdirectory, so instead the scanner stabilizes
the parent -> child relationship by taking the ILOCKs and installing a dirent
update hook to detect changes.”…””}”(hj”  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h MÎhjU  hžhubhæ)”}”(hŒƒThe scanning process uses a dirent hook to detect changes to the directories
mentioned in the scan data.
The scan works as follows:”h]”hŒƒThe scanning process uses a dirent hook to detect changes to the directories
mentioned in the scan data.
The scan works as follows:”…””}”(hj¢  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h MÝhjU  hžhubji  )”}”(hhh]”(hû)”}”(hX@  For each subdirectory in the filesystem,

a. For each parent pointer of that subdirectory,

   1. Create a path object for that parent pointer, and mark the
      subdirectory inode number in the path object's bitmap.

   2. Record the parent pointer name and inode number in a path structure.

   3. If the alleged parent is the subdirectory being scrubbed, the path is
      a cycle.
      Mark the path for deletion and repeat step 1a with the next
      subdirectory parent pointer.

   4. Try to mark the alleged parent inode number in a bitmap in the path
      object.
      If the bit is already set, then there is a cycle in the directory
      tree.
      Mark the path as a cycle and repeat step 1a with the next subdirectory
      parent pointer.

   5. Load the alleged parent.
      If the alleged parent is not a linked directory, abort the scan
      because the parent pointer information is inconsistent.

   6. For each parent pointer of this alleged ancestor directory,

      a. Record the parent pointer name and inode number in the path object
         if no parent has been set for that level.

      b. If an ancestor has more than one parent, mark the path as corrupt.
         Repeat step 1a with the next subdirectory parent pointer.

      c. Repeat steps 1a3-1a6 for the ancestor identified in step 1a6a.
         This repeats until the directory tree root is reached or no parents
         are found.

   7. If the walk terminates at the root directory, mark the path as ok.

   8. If the walk terminates without reaching the root, mark the path as
      disconnected.
”h]”(hæ)”}”(hŒ(For each subdirectory in the filesystem,”h]”hŒ(For each subdirectory in the filesystem,”…””}”(hj·  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h Máhj³  ubji  )”}”(hhh]”hû)”}”(hXÂ  For each parent pointer of that subdirectory,

1. Create a path object for that parent pointer, and mark the
   subdirectory inode number in the path object's bitmap.

2. Record the parent pointer name and inode number in a path structure.

3. If the alleged parent is the subdirectory being scrubbed, the path is
   a cycle.
   Mark the path for deletion and repeat step 1a with the next
   subdirectory parent pointer.

4. Try to mark the alleged parent inode number in a bitmap in the path
   object.
   If the bit is already set, then there is a cycle in the directory
   tree.
   Mark the path as a cycle and repeat step 1a with the next subdirectory
   parent pointer.

5. Load the alleged parent.
   If the alleged parent is not a linked directory, abort the scan
   because the parent pointer information is inconsistent.

6. For each parent pointer of this alleged ancestor directory,

   a. Record the parent pointer name and inode number in the path object
      if no parent has been set for that level.

   b. If an ancestor has more than one parent, mark the path as corrupt.
      Repeat step 1a with the next subdirectory parent pointer.

   c. Repeat steps 1a3-1a6 for the ancestor identified in step 1a6a.
      This repeats until the directory tree root is reached or no parents
      are found.

7. If the walk terminates at the root directory, mark the path as ok.

8. If the walk terminates without reaching the root, mark the path as
   disconnected.
”h]”(hæ)”}”(hŒ-For each parent pointer of that subdirectory,”h]”hŒ-For each parent pointer of that subdirectory,”…””}”(hjÌ  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h MãhjÈ  ubji  )”}”(hhh]”(hû)”}”(hŒrCreate a path object for that parent pointer, and mark the
subdirectory inode number in the path object's bitmap.
”h]”hæ)”}”(hŒqCreate a path object for that parent pointer, and mark the
subdirectory inode number in the path object's bitmap.”h]”hŒsCreate a path object for that parent pointer, and mark the
subdirectory inode number in the path objectâ€™s bitmap.”…””}”(hjá  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h MåhjÝ  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhjÚ  ubhû)”}”(hŒERecord the parent pointer name and inode number in a path structure.
”h]”hæ)”}”(hŒDRecord the parent pointer name and inode number in a path structure.”h]”hŒDRecord the parent pointer name and inode number in a path structure.”…””}”(hjù  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h Mèhjõ  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhjÚ  ubhû)”}”(hŒ¨If the alleged parent is the subdirectory being scrubbed, the path is
a cycle.
Mark the path for deletion and repeat step 1a with the next
subdirectory parent pointer.
”h]”hæ)”}”(hŒ§If the alleged parent is the subdirectory being scrubbed, the path is
a cycle.
Mark the path for deletion and repeat step 1a with the next
subdirectory parent pointer.”h]”hŒ§If the alleged parent is the subdirectory being scrubbed, the path is
a cycle.
Mark the path for deletion and repeat step 1a with the next
subdirectory parent pointer.”…””}”(hj‘  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h Mêhj‘  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhjÚ  ubhû)”}”(hŒëTry to mark the alleged parent inode number in a bitmap in the path
object.
If the bit is already set, then there is a cycle in the directory
tree.
Mark the path as a cycle and repeat step 1a with the next subdirectory
parent pointer.
”h]”hæ)”}”(hŒêTry to mark the alleged parent inode number in a bitmap in the path
object.
If the bit is already set, then there is a cycle in the directory
tree.
Mark the path as a cycle and repeat step 1a with the next subdirectory
parent pointer.”h]”hŒêTry to mark the alleged parent inode number in a bitmap in the path
object.
If the bit is already set, then there is a cycle in the directory
tree.
Mark the path as a cycle and repeat step 1a with the next subdirectory
parent pointer.”…””}”(hj)‘  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h Mïhj%‘  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhjÚ  ubhû)”}”(hŒ‘Load the alleged parent.
If the alleged parent is not a linked directory, abort the scan
because the parent pointer information is inconsistent.
”h]”hæ)”}”(hŒLoad the alleged parent.
If the alleged parent is not a linked directory, abort the scan
because the parent pointer information is inconsistent.”h]”hŒLoad the alleged parent.
If the alleged parent is not a linked directory, abort the scan
because the parent pointer information is inconsistent.”…””}”(hjA‘  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h Möhj=‘  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhjÚ  ubhû)”}”(hXÌ  For each parent pointer of this alleged ancestor directory,

a. Record the parent pointer name and inode number in the path object
   if no parent has been set for that level.

b. If an ancestor has more than one parent, mark the path as corrupt.
   Repeat step 1a with the next subdirectory parent pointer.

c. Repeat steps 1a3-1a6 for the ancestor identified in step 1a6a.
   This repeats until the directory tree root is reached or no parents
   are found.
”h]”(hæ)”}”(hŒ;For each parent pointer of this alleged ancestor directory,”h]”hŒ;For each parent pointer of this alleged ancestor directory,”…””}”(hjY‘  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h MúhjU‘  ubji  )”}”(hhh]”(hû)”}”(hŒmRecord the parent pointer name and inode number in the path object
if no parent has been set for that level.
”h]”hæ)”}”(hŒlRecord the parent pointer name and inode number in the path object
if no parent has been set for that level.”h]”hŒlRecord the parent pointer name and inode number in the path object
if no parent has been set for that level.”…””}”(hjn‘  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h Mühjj‘  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhjg‘  ubhû)”}”(hŒ}If an ancestor has more than one parent, mark the path as corrupt.
Repeat step 1a with the next subdirectory parent pointer.
”h]”hæ)”}”(hŒ|If an ancestor has more than one parent, mark the path as corrupt.
Repeat step 1a with the next subdirectory parent pointer.”h]”hŒ|If an ancestor has more than one parent, mark the path as corrupt.
Repeat step 1a with the next subdirectory parent pointer.”…””}”(hj†‘  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h Mÿhj‚‘  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhjg‘  ubhû)”}”(hŒŽRepeat steps 1a3-1a6 for the ancestor identified in step 1a6a.
This repeats until the directory tree root is reached or no parents
are found.
”h]”hæ)”}”(hŒRepeat steps 1a3-1a6 for the ancestor identified in step 1a6a.
This repeats until the directory tree root is reached or no parents
are found.”h]”hŒRepeat steps 1a3-1a6 for the ancestor identified in step 1a6a.
This repeats until the directory tree root is reached or no parents
are found.”…””}”(hjž‘  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h Mhjš‘  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhjg‘  ubeh}”(h]”h ]”h"]”h$]”h&]”jg  j6  ji  hjj  jk  uh1jh  hjU‘  ubeh}”(h]”h ]”h"]”h$]”h&]”uh1húhjÚ  ubhû)”}”(hŒCIf the walk terminates at the root directory, mark the path as ok.
”h]”hæ)”}”(hŒBIf the walk terminates at the root directory, mark the path as ok.”h]”hŒBIf the walk terminates at the root directory, mark the path as ok.”…””}”(hjÂ‘  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h Mhj¾‘  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhjÚ  ubhû)”}”(hŒQIf the walk terminates without reaching the root, mark the path as
disconnected.
”h]”hæ)”}”(hŒPIf the walk terminates without reaching the root, mark the path as
disconnected.”h]”hŒPIf the walk terminates without reaching the root, mark the path as
disconnected.”…””}”(hjÚ‘  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h MhjÖ‘  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhjÚ  ubeh}”(h]”h ]”h"]”h$]”h&]”jg  jh  ji  hjj  jk  uh1jh  hjÈ  ubeh}”(h]”h ]”h"]”h$]”h&]”uh1húhjÅ  ubah}”(h]”h ]”h"]”h$]”h&]”jg  j6  ji  hjj  jk  uh1jh  hj³  ubeh}”(h]”h ]”h"]”h$]”h&]”uh1húhj°  hžhhŸNh Nubhû)”}”(hX  If the directory entry update hook triggers, check all paths already found
by the scan.
If the entry matches part of a path, mark that path and the scan stale.
When the scanner thread sees that the scan has been marked stale, it deletes
all scan data and starts over.
”h]”hæ)”}”(hX  If the directory entry update hook triggers, check all paths already found
by the scan.
If the entry matches part of a path, mark that path and the scan stale.
When the scanner thread sees that the scan has been marked stale, it deletes
all scan data and starts over.”h]”hX  If the directory entry update hook triggers, check all paths already found
by the scan.
If the entry matches part of a path, mark that path and the scan stale.
When the scanner thread sees that the scan has been marked stale, it deletes
all scan data and starts over.”…””}”(hj
’  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h Mhj’  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhj°  hžhhŸh³h Nubeh}”(h]”h ]”h"]”h$]”h&]”jg  jh  ji  hjj  jk  uh1jh  hjU  hžhhŸh³h Máubhæ)”}”(hŒ.Repairing the directory tree works as follows:”h]”hŒ.Repairing the directory tree works as follows:”…””}”(hj$’  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h MhjU  hžhubji  )”}”(hhh]”(hû)”}”(hŒÔWalk each path of the target subdirectory.

a. Corrupt paths and cycle paths are counted as suspect.

b. Paths already marked for deletion are counted as bad.

c. Paths that reached the root are counted as good.
”h]”(hæ)”}”(hŒ*Walk each path of the target subdirectory.”h]”hŒ*Walk each path of the target subdirectory.”…””}”(hj9’  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h Mhj5’  ubji  )”}”(hhh]”(hû)”}”(hŒ6Corrupt paths and cycle paths are counted as suspect.
”h]”hæ)”}”(hŒ5Corrupt paths and cycle paths are counted as suspect.”h]”hŒ5Corrupt paths and cycle paths are counted as suspect.”…””}”(hjN’  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h MhjJ’  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhjG’  ubhû)”}”(hŒ6Paths already marked for deletion are counted as bad.
”h]”hæ)”}”(hŒ5Paths already marked for deletion are counted as bad.”h]”hŒ5Paths already marked for deletion are counted as bad.”…””}”(hjf’  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h Mhjb’  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhjG’  ubhû)”}”(hŒ1Paths that reached the root are counted as good.
”h]”hæ)”}”(hŒ0Paths that reached the root are counted as good.”h]”hŒ0Paths that reached the root are counted as good.”…””}”(hj~’  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h Mhjz’  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhjG’  ubeh}”(h]”h ]”h"]”h$]”h&]”jg  j6  ji  hjj  jk  uh1jh  hj5’  ubeh}”(h]”h ]”h"]”h$]”h&]”uh1húhj2’  hžhhŸNh Nubhû)”}”(hŒŸIf the subdirectory is either the root directory or has zero link count,
delete all incoming directory entries in the immediate parents.
Repairs are complete.
”h]”hæ)”}”(hŒžIf the subdirectory is either the root directory or has zero link count,
delete all incoming directory entries in the immediate parents.
Repairs are complete.”h]”hŒžIf the subdirectory is either the root directory or has zero link count,
delete all incoming directory entries in the immediate parents.
Repairs are complete.”…””}”(hj¢’  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h Mhjž’  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhj2’  hžhhŸh³h Nubhû)”}”(hŒWIf the subdirectory has exactly one path, set the dotdot entry to the
parent and exit.
”h]”hæ)”}”(hŒVIf the subdirectory has exactly one path, set the dotdot entry to the
parent and exit.”h]”hŒVIf the subdirectory has exactly one path, set the dotdot entry to the
parent and exit.”…””}”(hjº’  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h Mhj¶’  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhj2’  hžhhŸh³h Nubhû)”}”(hŒzIf the subdirectory has at least one good path, delete all the other
incoming directory entries in the immediate parents.
”h]”hæ)”}”(hŒyIf the subdirectory has at least one good path, delete all the other
incoming directory entries in the immediate parents.”h]”hŒyIf the subdirectory has at least one good path, delete all the other
incoming directory entries in the immediate parents.”…””}”(hjÒ’  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h M"hjÎ’  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhj2’  hžhhŸh³h Nubhû)”}”(hŒIf the subdirectory has no good paths and more than one suspect path, delete
all the other incoming directory entries in the immediate parents.
”h]”hæ)”}”(hŒIf the subdirectory has no good paths and more than one suspect path, delete
all the other incoming directory entries in the immediate parents.”h]”hŒIf the subdirectory has no good paths and more than one suspect path, delete
all the other incoming directory entries in the immediate parents.”…””}”(hjê’  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h M%hjæ’  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhj2’  hžhhŸh³h Nubhû)”}”(hŒEIf the subdirectory has zero paths, attach it to the lost and found.
”h]”hæ)”}”(hŒDIf the subdirectory has zero paths, attach it to the lost and found.”h]”hŒDIf the subdirectory has zero paths, attach it to the lost and found.”…””}”(hj“  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h M(hjþ’  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhj2’  hžhhŸh³h Nubeh}”(h]”h ]”h"]”h$]”h&]”jg  jh  ji  hjj  jk  uh1jh  hjU  hžhhŸh³h Mubhæ)”}”(hŒ£The proposed patches are in the
`directory tree repair
<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=scrub-directory-tree>`_
series.”h]”(hŒ The proposed patches are in the
”…””}”(hj“  hžhhŸNh Nubj”  )”}”(hŒ{`directory tree repair
<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=scrub-directory-tree>`_”h]”hŒdirectory tree repair”…””}”(hj$“  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”Œname”Œdirectory tree repair”jj  Œ`https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=scrub-directory-tree”uh1j“  hj“  ubhµ)”}”(hŒc
<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=scrub-directory-tree>”h]”h}”(h]”Œdirectory-tree-repair”ah ]”h"]”Œdirectory tree repair”ah$]”h&]”Œrefuri”j4“  uh1h´jy  Khj“  ubhŒ
series.”…””}”(hj“  hžhhŸNh Nubeh}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h M*hjU  hžhubhµ)”}”(hŒ.. _orphanage:”h]”h}”(h]”h ]”h"]”h$]”h&]”j  Œ	orphanage”uh1h´h M0hjU  hžhhŸh³ubeh}”(h]”(jÔ  jM  eh ]”h"]”(Œ$case study: directory tree structure”Œdirtree”eh$]”h&]”uh1hÐhj_†  hžhhŸh³h M·jË  }”j\“  jC  sjÍ  }”jM  jC  subeh}”(h]”jO  ah ]”h"]”Œparent pointers”ah$]”h&]”uh1hÐhj„  hžhhŸh³h Miubeh}”(h]”j  ah ]”h"]”Œfixing directories”ah$]”h&]”uh1hÐhjv*  hžhhŸh³h MubhÑ)”}”(hhh]”(hÖ)”}”(hŒThe Orphanage”h]”hŒThe Orphanage”…””}”(hjr“  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”j  j  uh1hÕhjo“  hžhhŸh³h M3ubhæ)”}”(hX‘  Filesystems present files as a directed, and hopefully acyclic, graph.
In other words, a tree.
The root of the filesystem is a directory, and each entry in a directory points
downwards either to more subdirectories or to non-directory files.
Unfortunately, a disruption in the directory graph pointers result in a
disconnected graph, which makes files impossible to access via regular path
resolution.”h]”hX‘  Filesystems present files as a directed, and hopefully acyclic, graph.
In other words, a tree.
The root of the filesystem is a directory, and each entry in a directory points
downwards either to more subdirectories or to non-directory files.
Unfortunately, a disruption in the directory graph pointers result in a
disconnected graph, which makes files impossible to access via regular path
resolution.”…””}”(hj€“  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h M5hjo“  hžhubhæ)”}”(hXb  Without parent pointers, the directory parent pointer online scrub code can
detect a dotdot entry pointing to a parent directory that doesn't have a link
back to the child directory and the file link count checker can detect a file
that isn't pointed to by any directory in the filesystem.
If such a file has a positive link count, the file is an orphan.”h]”hXf  Without parent pointers, the directory parent pointer online scrub code can
detect a dotdot entry pointing to a parent directory that doesnâ€™t have a link
back to the child directory and the file link count checker can detect a file
that isnâ€™t pointed to by any directory in the filesystem.
If such a file has a positive link count, the file is an orphan.”…””}”(hjŽ“  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h M=hjo“  hžhubhæ)”}”(hŒÐWith parent pointers, directories can be rebuilt by scanning parent pointers
and parent pointers can be rebuilt by scanning directories.
This should reduce the incidence of files ending up in ``/lost+found``.”h]”(hŒÀWith parent pointers, directories can be rebuilt by scanning parent pointers
and parent pointers can be rebuilt by scanning directories.
This should reduce the incidence of files ending up in ”…””}”(hjœ“  hžhhŸNh Nubj÷  )”}”(hŒ``/lost+found``”h]”hŒ/lost+found”…””}”(hj¤“  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1jö  hjœ“  ubhŒ.”…””}”(hjœ“  hžhhŸNh Nubeh}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h MChjo“  hžhubhæ)”}”(hXL  When orphans are found, they should be reconnected to the directory tree.
Offline fsck solves the problem by creating a directory ``/lost+found`` to
serve as an orphanage, and linking orphan files into the orphanage by using the
inumber as the name.
Reparenting a file to the orphanage does not reset any of its permissions or
ACLs.”h]”(hŒ‚When orphans are found, they should be reconnected to the directory tree.
Offline fsck solves the problem by creating a directory ”…””}”(hj¼“  hžhhŸNh Nubj÷  )”}”(hŒ``/lost+found``”h]”hŒ/lost+found”…””}”(hjÄ“  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1jö  hj¼“  ubhŒ» to
serve as an orphanage, and linking orphan files into the orphanage by using the
inumber as the name.
Reparenting a file to the orphanage does not reset any of its permissions or
ACLs.”…””}”(hj¼“  hžhhŸNh Nubeh}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h MGhjo“  hžhubhæ)”}”(hX9  This process is more involved in the kernel than it is in userspace.
The directory and file link count repair setup functions must use the regular
VFS mechanisms to create the orphanage directory with all the necessary
security attributes and dentry cache entries, just like a regular directory
tree modification.”h]”hX9  This process is more involved in the kernel than it is in userspace.
The directory and file link count repair setup functions must use the regular
VFS mechanisms to create the orphanage directory with all the necessary
security attributes and dentry cache entries, just like a regular directory
tree modification.”…””}”(hjÜ“  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h MNhjo“  hžhubhæ)”}”(hŒ7Orphaned files are adopted by the orphanage as follows:”h]”hŒ7Orphaned files are adopted by the orphanage as follows:”…””}”(hjê“  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h MThjo“  hžhubji  )”}”(hhh]”(hû)”}”(hŒÑCall ``xrep_orphanage_try_create`` at the start of the scrub setup function
to try to ensure that the lost and found directory actually exists.
This also attaches the orphanage directory to the scrub context.
”h]”hæ)”}”(hŒÐCall ``xrep_orphanage_try_create`` at the start of the scrub setup function
to try to ensure that the lost and found directory actually exists.
This also attaches the orphanage directory to the scrub context.”h]”(hŒCall ”…””}”(hjÿ“  hžhhŸNh Nubj÷  )”}”(hŒ``xrep_orphanage_try_create``”h]”hŒxrep_orphanage_try_create”…””}”(hj”  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1jö  hjÿ“  ubhŒ® at the start of the scrub setup function
to try to ensure that the lost and found directory actually exists.
This also attaches the orphanage directory to the scrub context.”…””}”(hjÿ“  hžhhŸNh Nubeh}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h MVhjû“  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhjø“  hžhhŸh³h Nubhû)”}”(hŒÓIf the decision is made to reconnect a file, take the IOLOCK of both the
orphanage and the file being reattached.
The ``xrep_orphanage_iolock_two`` function follows the inode locking
strategy discussed earlier.
”h]”hæ)”}”(hŒÒIf the decision is made to reconnect a file, take the IOLOCK of both the
orphanage and the file being reattached.
The ``xrep_orphanage_iolock_two`` function follows the inode locking
strategy discussed earlier.”h]”(hŒvIf the decision is made to reconnect a file, take the IOLOCK of both the
orphanage and the file being reattached.
The ”…””}”(hj)”  hžhhŸNh Nubj÷  )”}”(hŒ``xrep_orphanage_iolock_two``”h]”hŒxrep_orphanage_iolock_two”…””}”(hj1”  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1jö  hj)”  ubhŒ? function follows the inode locking
strategy discussed earlier.”…””}”(hj)”  hžhhŸNh Nubeh}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h MZhj%”  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhjø“  hžhhŸh³h Nubhû)”}”(hŒRUse ``xrep_adoption_trans_alloc`` to reserve resources to the repair
transaction.
”h]”hæ)”}”(hŒQUse ``xrep_adoption_trans_alloc`` to reserve resources to the repair
transaction.”h]”(hŒUse ”…””}”(hjS”  hžhhŸNh Nubj÷  )”}”(hŒ``xrep_adoption_trans_alloc``”h]”hŒxrep_adoption_trans_alloc”…””}”(hj[”  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1jö  hjS”  ubhŒ0 to reserve resources to the repair
transaction.”…””}”(hjS”  hžhhŸNh Nubeh}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h M_hjO”  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhjø“  hžhhŸh³h Nubhû)”}”(hŒOCall ``xrep_orphanage_compute_name`` to compute the new name in the
orphanage.
”h]”hæ)”}”(hŒNCall ``xrep_orphanage_compute_name`` to compute the new name in the
orphanage.”h]”(hŒCall ”…””}”(hj}”  hžhhŸNh Nubj÷  )”}”(hŒ``xrep_orphanage_compute_name``”h]”hŒxrep_orphanage_compute_name”…””}”(hj…”  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1jö  hj}”  ubhŒ* to compute the new name in the
orphanage.”…””}”(hj}”  hžhhŸNh Nubeh}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h Mbhjy”  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhjø“  hžhhŸh³h Nubhû)”}”(hŒ›If the adoption is going to happen, call ``xrep_adoption_reparent`` to
reparent the orphaned file into the lost and found and invalidate the dentry
cache.
”h]”hæ)”}”(hŒšIf the adoption is going to happen, call ``xrep_adoption_reparent`` to
reparent the orphaned file into the lost and found and invalidate the dentry
cache.”h]”(hŒ)If the adoption is going to happen, call ”…””}”(hj§”  hžhhŸNh Nubj÷  )”}”(hŒ``xrep_adoption_reparent``”h]”hŒxrep_adoption_reparent”…””}”(hj¯”  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1jö  hj§”  ubhŒW to
reparent the orphaned file into the lost and found and invalidate the dentry
cache.”…””}”(hj§”  hžhhŸNh Nubeh}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h Mehj£”  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhjø“  hžhhŸh³h Nubhû)”}”(hŒÎCall ``xrep_adoption_finish`` to commit any filesystem updates, release the
orphanage ILOCK, and clean the scrub transaction.  Call
``xrep_adoption_commit`` to commit the updates and the scrub transaction.
”h]”hæ)”}”(hŒÍCall ``xrep_adoption_finish`` to commit any filesystem updates, release the
orphanage ILOCK, and clean the scrub transaction.  Call
``xrep_adoption_commit`` to commit the updates and the scrub transaction.”h]”(hŒCall ”…””}”(hjÑ”  hžhhŸNh Nubj÷  )”}”(hŒ``xrep_adoption_finish``”h]”hŒxrep_adoption_finish”…””}”(hjÙ”  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1jö  hjÑ”  ubhŒg to commit any filesystem updates, release the
orphanage ILOCK, and clean the scrub transaction.  Call
”…””}”(hjÑ”  hžhhŸNh Nubj÷  )”}”(hŒ``xrep_adoption_commit``”h]”hŒxrep_adoption_commit”…””}”(hjë”  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1jö  hjÑ”  ubhŒ1 to commit the updates and the scrub transaction.”…””}”(hjÑ”  hžhhŸNh Nubeh}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h MihjÍ”  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhjø“  hžhhŸh³h Nubhû)”}”(hŒTIf a runtime error happens, call ``xrep_adoption_cancel`` to release all
resources.
”h]”hæ)”}”(hŒSIf a runtime error happens, call ``xrep_adoption_cancel`` to release all
resources.”h]”(hŒ!If a runtime error happens, call ”…””}”(hj•  hžhhŸNh Nubj÷  )”}”(hŒ``xrep_adoption_cancel``”h]”hŒxrep_adoption_cancel”…””}”(hj•  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1jö  hj•  ubhŒ to release all
resources.”…””}”(hj•  hžhhŸNh Nubeh}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h Mmhj	•  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhjø“  hžhhŸh³h Nubeh}”(h]”h ]”h"]”h$]”h&]”jg  jh  ji  hjj  jk  uh1jh  hjo“  hžhhŸh³h MVubhæ)”}”(hŒœThe proposed patches are in the
`orphanage adoption
<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-orphanage>`_
series.”h]”(hŒ The proposed patches are in the
”…””}”(hj9•  hžhhŸNh Nubj”  )”}”(hŒt`orphanage adoption
<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-orphanage>`_”h]”hŒorphanage adoption”…””}”(hjA•  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”Œname”Œorphanage adoption”jj  Œ\https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-orphanage”uh1j“  hj9•  ubhµ)”}”(hŒ_
<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-orphanage>”h]”h}”(h]”Œorphanage-adoption”ah ]”h"]”Œorphanage adoption”ah$]”h&]”Œrefuri”jQ•  uh1h´jy  Khj9•  ubhŒ
series.”…””}”(hj9•  hžhhŸNh Nubeh}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h Mphjo“  hžhubeh}”(h]”(j  jV“  eh ]”h"]”(Œthe orphanage”Œ	orphanage”eh$]”h&]”uh1hÐhjv*  hžhhŸh³h M3jË  }”jn•  jL“  sjÍ  }”jV“  jL“  subeh}”(h]”j½  ah ]”h"]”Œ(5. kernel algorithms and data structures”ah$]”h&]”uh1hÐhhÒhžhhŸh³h M›ubhÑ)”}”(hhh]”(hÖ)”}”(hŒ+6. Userspace Algorithms and Data Structures”h]”hŒ+6. Userspace Algorithms and Data Structures”…””}”(hj}•  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”j  j6  uh1hÕhjz•  hžhhŸh³h Mvubhæ)”}”(hŒìThis section discusses the key algorithms and data structures of the userspace
program, ``xfs_scrub``, that provide the ability to drive metadata checks and
repairs in the kernel, verify file data, and look for other potential problems.”h]”(hŒXThis section discusses the key algorithms and data structures of the userspace
program, ”…””}”(hj‹•  hžhhŸNh Nubj÷  )”}”(hŒ``xfs_scrub``”h]”hŒ	xfs_scrub”…””}”(hj“•  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1jö  hj‹•  ubhŒ‡, that provide the ability to drive metadata checks and
repairs in the kernel, verify file data, and look for other potential problems.”…””}”(hj‹•  hžhhŸNh Nubeh}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h Mxhjz•  hžhubhµ)”}”(hŒ.. _scrubcheck:”h]”h}”(h]”h ]”h"]”h$]”h&]”j  Œ
scrubcheck”uh1h´h M|hjz•  hžhhŸh³ubhÑ)”}”(hhh]”(hÖ)”}”(hŒChecking Metadata”h]”hŒChecking Metadata”…””}”(hj¹•  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”j  jU  uh1hÕhj¶•  hžhhŸh³h Mubhæ)”}”(hŒõRecall the :ref:`phases of fsck work<scrubphases>` outlined earlier.
That structure follows naturally from the data dependencies designed into the
filesystem from its beginnings in 1993.
In XFS, there are several groups of metadata dependencies:”h]”(hŒRecall the ”…””}”(hjÇ•  hžhhŸNh Nubh)”}”(hŒ':ref:`phases of fsck work<scrubphases>`”h]”j™  )”}”(hjÑ•  h]”hŒphases of fsck work”…””}”(hjÓ•  hžhhŸNh Nubah}”(h]”h ]”(j¤  Œstd”Œstd-ref”eh"]”h$]”h&]”uh1j˜  hjÏ•  ubah}”(h]”h ]”h"]”h$]”h&]”Œrefdoc”j±  Œ	refdomain”jÝ•  Œreftype”Œref”Œrefexplicit”ˆŒrefwarn”ˆj·  Œscrubphases”uh1hhŸh³h MhjÇ•  ubhŒÃ outlined earlier.
That structure follows naturally from the data dependencies designed into the
filesystem from its beginnings in 1993.
In XFS, there are several groups of metadata dependencies:”…””}”(hjÇ•  hžhhŸNh Nubeh}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h Mhj¶•  hžhubji  )”}”(hhh]”(hû)”}”(hŒ˜Filesystem summary counts depend on consistency within the inode indices,
the allocation group space btrees, and the realtime volume space
information.
”h]”hæ)”}”(hŒ—Filesystem summary counts depend on consistency within the inode indices,
the allocation group space btrees, and the realtime volume space
information.”h]”hŒ—Filesystem summary counts depend on consistency within the inode indices,
the allocation group space btrees, and the realtime volume space
information.”…””}”(hj –  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h M†hjü•  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhjù•  hžhhŸh³h Nubhû)”}”(hŒ—Quota resource counts depend on consistency within the quota file data
forks, inode indices, inode records, and the forks of every file on the
system.
”h]”hæ)”}”(hŒ–Quota resource counts depend on consistency within the quota file data
forks, inode indices, inode records, and the forks of every file on the
system.”h]”hŒ–Quota resource counts depend on consistency within the quota file data
forks, inode indices, inode records, and the forks of every file on the
system.”…””}”(hj–  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h MŠhj–  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhjù•  hžhhŸh³h Nubhû)”}”(hŒ„The naming hierarchy depends on consistency within the directory and
extended attribute structures.
This includes file link counts.
”h]”hæ)”}”(hŒƒThe naming hierarchy depends on consistency within the directory and
extended attribute structures.
This includes file link counts.”h]”hŒƒThe naming hierarchy depends on consistency within the directory and
extended attribute structures.
This includes file link counts.”…””}”(hj0–  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h MŽhj,–  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhjù•  hžhhŸh³h Nubhû)”}”(hŒ¦Directories, extended attributes, and file data depend on consistency within
the file forks that map directory and extended attribute data to physical
storage media.
”h]”hæ)”}”(hŒ¥Directories, extended attributes, and file data depend on consistency within
the file forks that map directory and extended attribute data to physical
storage media.”h]”hŒ¥Directories, extended attributes, and file data depend on consistency within
the file forks that map directory and extended attribute data to physical
storage media.”…””}”(hjH–  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h M’hjD–  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhjù•  hžhhŸh³h Nubhû)”}”(hŒ½The file forks depends on consistency within inode records and the space
metadata indices of the allocation groups and the realtime volume.
This includes quota and realtime metadata files.
”h]”hæ)”}”(hŒ¼The file forks depends on consistency within inode records and the space
metadata indices of the allocation groups and the realtime volume.
This includes quota and realtime metadata files.”h]”hŒ¼The file forks depends on consistency within inode records and the space
metadata indices of the allocation groups and the realtime volume.
This includes quota and realtime metadata files.”…””}”(hj`–  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h M–hj\–  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhjù•  hžhhŸh³h Nubhû)”}”(hŒHInode records depends on consistency within the inode metadata indices.
”h]”hæ)”}”(hŒGInode records depends on consistency within the inode metadata indices.”h]”hŒGInode records depends on consistency within the inode metadata indices.”…””}”(hjx–  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h Mšhjt–  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhjù•  hžhhŸh³h Nubhû)”}”(hŒdRealtime space metadata depend on the inode records and data forks of the
realtime metadata inodes.
”h]”hæ)”}”(hŒcRealtime space metadata depend on the inode records and data forks of the
realtime metadata inodes.”h]”hŒcRealtime space metadata depend on the inode records and data forks of the
realtime metadata inodes.”…””}”(hj–  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h MœhjŒ–  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhjù•  hžhhŸh³h Nubhû)”}”(hŒ¼The allocation group metadata indices (free space, inodes, reference count,
and reverse mapping btrees) depend on consistency within the AG headers and
between all the AG metadata btrees.
”h]”hæ)”}”(hŒ»The allocation group metadata indices (free space, inodes, reference count,
and reverse mapping btrees) depend on consistency within the AG headers and
between all the AG metadata btrees.”h]”hŒ»The allocation group metadata indices (free space, inodes, reference count,
and reverse mapping btrees) depend on consistency within the AG headers and
between all the AG metadata btrees.”…””}”(hj¨–  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h MŸhj¤–  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhjù•  hžhhŸh³h Nubhû)”}”(hŒh``xfs_scrub`` depends on the filesystem being mounted and kernel support
for online fsck functionality.
”h]”hæ)”}”(hŒg``xfs_scrub`` depends on the filesystem being mounted and kernel support
for online fsck functionality.”h]”(j÷  )”}”(hŒ``xfs_scrub``”h]”hŒ	xfs_scrub”…””}”(hjÄ–  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1jö  hjÀ–  ubhŒZ depends on the filesystem being mounted and kernel support
for online fsck functionality.”…””}”(hjÀ–  hžhhŸNh Nubeh}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h M£hj¼–  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhjù•  hžhhŸh³h Nubeh}”(h]”h ]”h"]”h$]”h&]”jg  j6  ji  hjj  jk  uh1jh  hj¶•  hžhhŸh³h M†ubhæ)”}”(hŒxTherefore, a metadata dependency graph is a convenient way to schedule checking
operations in the ``xfs_scrub`` program:”h]”(hŒbTherefore, a metadata dependency graph is a convenient way to schedule checking
operations in the ”…””}”(hjè–  hžhhŸNh Nubj÷  )”}”(hŒ``xfs_scrub``”h]”hŒ	xfs_scrub”…””}”(hjð–  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1jö  hjè–  ubhŒ	 program:”…””}”(hjè–  hžhhŸNh Nubeh}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h M¦hj¶•  hžhubhö)”}”(hhh]”(hû)”}”(hŒˆPhase 1 checks that the provided path maps to an XFS filesystem and detect
the kernel's scrubbing abilities, which validates group (i).
”h]”hæ)”}”(hŒ‡Phase 1 checks that the provided path maps to an XFS filesystem and detect
the kernel's scrubbing abilities, which validates group (i).”h]”hŒ‰Phase 1 checks that the provided path maps to an XFS filesystem and detect
the kernelâ€™s scrubbing abilities, which validates group (i).”…””}”(hj—  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h M©hj—  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhj—  hžhhŸh³h Nubhû)”}”(hŒJPhase 2 scrubs groups (g) and (h) in parallel using a threaded workqueue.
”h]”hæ)”}”(hŒIPhase 2 scrubs groups (g) and (h) in parallel using a threaded workqueue.”h]”hŒIPhase 2 scrubs groups (g) and (h) in parallel using a threaded workqueue.”…””}”(hj'—  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h M¬hj#—  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhj—  hžhhŸh³h Nubhû)”}”(hŒgPhase 3 scans inodes in parallel.
For each inode, groups (f), (e), and (d) are checked, in that order.
”h]”hæ)”}”(hŒfPhase 3 scans inodes in parallel.
For each inode, groups (f), (e), and (d) are checked, in that order.”h]”hŒfPhase 3 scans inodes in parallel.
For each inode, groups (f), (e), and (d) are checked, in that order.”…””}”(hj?—  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h M®hj;—  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhj—  hžhhŸh³h Nubhû)”}”(hŒ^Phase 4 repairs everything in groups (i) through (d) so that phases 5 and 6
may run reliably.
”h]”hæ)”}”(hŒ]Phase 4 repairs everything in groups (i) through (d) so that phases 5 and 6
may run reliably.”h]”hŒ]Phase 4 repairs everything in groups (i) through (d) so that phases 5 and 6
may run reliably.”…””}”(hjW—  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h M±hjS—  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhj—  hžhhŸh³h Nubhû)”}”(hŒ^Phase 5 starts by checking groups (b) and (c) in parallel before moving on
to checking names.
”h]”hæ)”}”(hŒ]Phase 5 starts by checking groups (b) and (c) in parallel before moving on
to checking names.”h]”hŒ]Phase 5 starts by checking groups (b) and (c) in parallel before moving on
to checking names.”…””}”(hjo—  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h M´hjk—  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhj—  hžhhŸh³h Nubhû)”}”(hŒ”Phase 6 depends on groups (i) through (b) to find file data blocks to verify,
to read them, and to report which blocks of which files are affected.
”h]”hæ)”}”(hŒ“Phase 6 depends on groups (i) through (b) to find file data blocks to verify,
to read them, and to report which blocks of which files are affected.”h]”hŒ“Phase 6 depends on groups (i) through (b) to find file data blocks to verify,
to read them, and to report which blocks of which files are affected.”…””}”(hj‡—  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h M·hjƒ—  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhj—  hžhhŸh³h Nubhû)”}”(hŒ<Phase 7 checks group (a), having validated everything else.
”h]”hæ)”}”(hŒ;Phase 7 checks group (a), having validated everything else.”h]”hŒ;Phase 7 checks group (a), having validated everything else.”…””}”(hjŸ—  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h Mºhj›—  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhj—  hžhhŸh³h Nubeh}”(h]”h ]”h"]”h$]”h&]”jJ  jK  uh1hõhŸh³h M©hj¶•  hžhubhæ)”}”(hŒcNotice that the data dependencies between groups are enforced by the structure
of the program flow.”h]”hŒcNotice that the data dependencies between groups are enforced by the structure
of the program flow.”…””}”(hj¹—  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h M¼hj¶•  hžhubeh}”(h]”(j[  jµ•  eh ]”h"]”(Œchecking metadata”Œ
scrubcheck”eh$]”h&]”uh1hÐhjz•  hžhhŸh³h MjË  }”jÌ—  j«•  sjÍ  }”jµ•  j«•  subhÑ)”}”(hhh]”(hÖ)”}”(hŒParallel Inode Scans”h]”hŒParallel Inode Scans”…””}”(hjÔ—  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”j  jw  uh1hÕhjÑ—  hžhhŸh³h MÀubhæ)”}”(hXm  An XFS filesystem can easily contain hundreds of millions of inodes.
Given that XFS targets installations with large high-performance storage,
it is desirable to scrub inodes in parallel to minimize runtime, particularly
if the program has been invoked manually from a command line.
This requires careful scheduling to keep the threads as evenly loaded as
possible.”h]”hXm  An XFS filesystem can easily contain hundreds of millions of inodes.
Given that XFS targets installations with large high-performance storage,
it is desirable to scrub inodes in parallel to minimize runtime, particularly
if the program has been invoked manually from a command line.
This requires careful scheduling to keep the threads as evenly loaded as
possible.”…””}”(hjâ—  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h MÂhjÑ—  hžhubhæ)”}”(hX,  Early iterations of the ``xfs_scrub`` inode scanner naÃ¯vely created a single
workqueue and scheduled a single workqueue item per AG.
Each workqueue item walked the inode btree (with ``XFS_IOC_INUMBERS``) to find
inode chunks and then called bulkstat (``XFS_IOC_BULKSTAT``) to gather enough
information to construct file handles.
The file handle was then passed to a function to generate scrub items for each
metadata object of each inode.
This simple algorithm leads to thread balancing problems in phase 3 if the
filesystem contains one AG with a few large sparse files and the rest of the
AGs contain many smaller files.
The inode scan dispatch function was not sufficiently granular; it should have
been dispatching at the level of individual inodes, or, to constrain memory
consumption, inode btree records.”h]”(hŒEarly iterations of the ”…””}”(hjð—  hžhhŸNh Nubj÷  )”}”(hŒ``xfs_scrub``”h]”hŒ	xfs_scrub”…””}”(hjø—  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1jö  hjð—  ubhŒ’ inode scanner naÃ¯vely created a single
workqueue and scheduled a single workqueue item per AG.
Each workqueue item walked the inode btree (with ”…””}”(hjð—  hžhhŸNh Nubj÷  )”}”(hŒ``XFS_IOC_INUMBERS``”h]”hŒXFS_IOC_INUMBERS”…””}”(hj
˜  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1jö  hjð—  ubhŒ1) to find
inode chunks and then called bulkstat (”…””}”(hjð—  hžhhŸNh Nubj÷  )”}”(hŒ``XFS_IOC_BULKSTAT``”h]”hŒXFS_IOC_BULKSTAT”…””}”(hj˜  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1jö  hjð—  ubhX  ) to gather enough
information to construct file handles.
The file handle was then passed to a function to generate scrub items for each
metadata object of each inode.
This simple algorithm leads to thread balancing problems in phase 3 if the
filesystem contains one AG with a few large sparse files and the rest of the
AGs contain many smaller files.
The inode scan dispatch function was not sufficiently granular; it should have
been dispatching at the level of individual inodes, or, to constrain memory
consumption, inode btree records.”…””}”(hjð—  hžhhŸNh Nubeh}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h MÉhjÑ—  hžhubhæ)”}”(hX{  Thanks to Dave Chinner, bounded workqueues in userspace enable ``xfs_scrub`` to
avoid this problem with ease by adding a second workqueue.
Just like before, the first workqueue is seeded with one workqueue item per AG,
and it uses INUMBERS to find inode btree chunks.
The second workqueue, however, is configured with an upper bound on the number
of items that can be waiting to be run.
Each inode btree chunk found by the first workqueue's workers are queued to the
second workqueue, and it is this second workqueue that queries BULKSTAT,
creates a file handle, and passes it to a function to generate scrub items for
each metadata object of each inode.
If the second workqueue is too full, the workqueue add function blocks the
first workqueue's workers until the backlog eases.
This doesn't completely solve the balancing problem, but reduces it enough to
move on to more pressing issues.”h]”(hŒ?Thanks to Dave Chinner, bounded workqueues in userspace enable ”…””}”(hj4˜  hžhhŸNh Nubj÷  )”}”(hŒ``xfs_scrub``”h]”hŒ	xfs_scrub”…””}”(hj<˜  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1jö  hj4˜  ubhX5   to
avoid this problem with ease by adding a second workqueue.
Just like before, the first workqueue is seeded with one workqueue item per AG,
and it uses INUMBERS to find inode btree chunks.
The second workqueue, however, is configured with an upper bound on the number
of items that can be waiting to be run.
Each inode btree chunk found by the first workqueueâ€™s workers are queued to the
second workqueue, and it is this second workqueue that queries BULKSTAT,
creates a file handle, and passes it to a function to generate scrub items for
each metadata object of each inode.
If the second workqueue is too full, the workqueue add function blocks the
first workqueueâ€™s workers until the backlog eases.
This doesnâ€™t completely solve the balancing problem, but reduces it enough to
move on to more pressing issues.”…””}”(hj4˜  hžhhŸNh Nubeh}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h M×hjÑ—  hžhubhæ)”}”(hX3  The proposed patchsets are the scrub
`performance tweaks
<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=scrub-performance-tweaks>`_
and the
`inode scan rebalance
<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=scrub-iscan-rebalance>`_
series.”h]”(hŒ%The proposed patchsets are the scrub
”…””}”(hjT˜  hžhhŸNh Nubj”  )”}”(hŒ`performance tweaks
<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=scrub-performance-tweaks>`_”h]”hŒperformance tweaks”…””}”(hj\˜  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”Œname”Œperformance tweaks”jj  Œghttps://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=scrub-performance-tweaks”uh1j“  hjT˜  ubhµ)”}”(hŒj
<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=scrub-performance-tweaks>”h]”h}”(h]”Œperformance-tweaks”ah ]”h"]”Œperformance tweaks”ah$]”h&]”Œrefuri”jl˜  uh1h´jy  KhjT˜  ubhŒ	
and the
”…””}”(hjT˜  hžhhŸNh Nubj”  )”}”(hŒ~`inode scan rebalance
<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=scrub-iscan-rebalance>`_”h]”hŒinode scan rebalance”…””}”(hj~˜  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”Œname”Œinode scan rebalance”jj  Œdhttps://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=scrub-iscan-rebalance”uh1j“  hjT˜  ubhµ)”}”(hŒg
<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=scrub-iscan-rebalance>”h]”h}”(h]”Œinode-scan-rebalance”ah ]”h"]”Œinode scan rebalance”ah$]”h&]”Œrefuri”jŽ˜  uh1h´jy  KhjT˜  ubhŒ
series.”…””}”(hjT˜  hžhhŸNh Nubeh}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h MæhjÑ—  hžhubhµ)”}”(hŒ.. _scrubrepair:”h]”h}”(h]”h ]”h"]”h$]”h&]”j  Œscrubrepair”uh1h´h MîhjÑ—  hžhhŸh³ubeh}”(h]”j}  ah ]”h"]”Œparallel inode scans”ah$]”h&]”uh1hÐhjz•  hžhhŸh³h MÀubhÑ)”}”(hhh]”(hÖ)”}”(hŒScheduling Repairs”h]”hŒScheduling Repairs”…””}”(hj»˜  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”j  j™  uh1hÕhj¸˜  hžhhŸh³h Mñubhæ)”}”(hX’  During phase 2, corruptions and inconsistencies reported in any AGI header or
inode btree are repaired immediately, because phase 3 relies on proper
functioning of the inode indices to find inodes to scan.
Failed repairs are rescheduled to phase 4.
Problems reported in any other space metadata are deferred to phase 4.
Optimization opportunities are always deferred to phase 4, no matter their
origin.”h]”hX’  During phase 2, corruptions and inconsistencies reported in any AGI header or
inode btree are repaired immediately, because phase 3 relies on proper
functioning of the inode indices to find inodes to scan.
Failed repairs are rescheduled to phase 4.
Problems reported in any other space metadata are deferred to phase 4.
Optimization opportunities are always deferred to phase 4, no matter their
origin.”…””}”(hjÉ˜  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h Móhj¸˜  hžhubhæ)”}”(hŒöDuring phase 3, corruptions and inconsistencies reported in any part of a
file's metadata are repaired immediately if all space metadata were validated
during phase 2.
Repairs that fail or cannot be repaired immediately are scheduled for phase 4.”h]”hŒøDuring phase 3, corruptions and inconsistencies reported in any part of a
fileâ€™s metadata are repaired immediately if all space metadata were validated
during phase 2.
Repairs that fail or cannot be repaired immediately are scheduled for phase 4.”…””}”(hj×˜  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h Mûhj¸˜  hžhubhæ)”}”(hXH  In the original design of ``xfs_scrub``, it was thought that repairs would be
so infrequent that the ``struct xfs_scrub_metadata`` objects used to
communicate with the kernel could also be used as the primary object to
schedule repairs.
With recent increases in the number of optimizations possible for a given
filesystem object, it became much more memory-efficient to track all eligible
repairs for a given filesystem object with a single repair item.
Each repair item represents a single lockable object -- AGs, metadata files,
individual inodes, or a class of summary information.”h]”(hŒIn the original design of ”…””}”(hjå˜  hžhhŸNh Nubj÷  )”}”(hŒ``xfs_scrub``”h]”hŒ	xfs_scrub”…””}”(hjí˜  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1jö  hjå˜  ubhŒ>, it was thought that repairs would be
so infrequent that the ”…””}”(hjå˜  hžhhŸNh Nubj÷  )”}”(hŒ``struct xfs_scrub_metadata``”h]”hŒstruct xfs_scrub_metadata”…””}”(hjÿ˜  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1jö  hjå˜  ubhXÆ   objects used to
communicate with the kernel could also be used as the primary object to
schedule repairs.
With recent increases in the number of optimizations possible for a given
filesystem object, it became much more memory-efficient to track all eligible
repairs for a given filesystem object with a single repair item.
Each repair item represents a single lockable object -- AGs, metadata files,
individual inodes, or a class of summary information.”…””}”(hjå˜  hžhhŸNh Nubeh}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h M hj¸˜  hžhubhæ)”}”(hXS  Phase 4 is responsible for scheduling a lot of repair work in as quick a
manner as is practical.
The :ref:`data dependencies <scrubcheck>` outlined earlier still apply, which
means that ``xfs_scrub`` must try to complete the repair work scheduled by
phase 2 before trying repair work scheduled by phase 3.
The repair process is as follows:”h]”(hŒePhase 4 is responsible for scheduling a lot of repair work in as quick a
manner as is practical.
The ”…””}”(hj™  hžhhŸNh Nubh)”}”(hŒ%:ref:`data dependencies <scrubcheck>`”h]”j™  )”}”(hj!™  h]”hŒdata dependencies”…””}”(hj#™  hžhhŸNh Nubah}”(h]”h ]”(j¤  Œstd”Œstd-ref”eh"]”h$]”h&]”uh1j˜  hj™  ubah}”(h]”h ]”h"]”h$]”h&]”Œrefdoc”j±  Œ	refdomain”j-™  Œreftype”Œref”Œrefexplicit”ˆŒrefwarn”ˆj·  Œ
scrubcheck”uh1hhŸh³h M
hj™  ubhŒ0 outlined earlier still apply, which
means that ”…””}”(hj™  hžhhŸNh Nubj÷  )”}”(hŒ``xfs_scrub``”h]”hŒ	xfs_scrub”…””}”(hjC™  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1jö  hj™  ubhŒŒ must try to complete the repair work scheduled by
phase 2 before trying repair work scheduled by phase 3.
The repair process is as follows:”…””}”(hj™  hžhhŸNh Nubeh}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h M
hj¸˜  hžhubji  )”}”(hhh]”(hû)”}”(hXæ  Start a round of repair with a workqueue and enough workers to keep the CPUs
as busy as the user desires.

a. For each repair item queued by phase 2,

   i.   Ask the kernel to repair everything listed in the repair item for a
        given filesystem object.

   ii.  Make a note if the kernel made any progress in reducing the number
        of repairs needed for this object.

   iii. If the object no longer requires repairs, revalidate all metadata
        associated with this object.
        If the revalidation succeeds, drop the repair item.
        If not, requeue the item for more repairs.

b. If any repairs were made, jump back to 1a to retry all the phase 2 items.

c. For each repair item queued by phase 3,

   i.   Ask the kernel to repair everything listed in the repair item for a
        given filesystem object.

   ii.  Make a note if the kernel made any progress in reducing the number
        of repairs needed for this object.

   iii. If the object no longer requires repairs, revalidate all metadata
        associated with this object.
        If the revalidation succeeds, drop the repair item.
        If not, requeue the item for more repairs.

d. If any repairs were made, jump back to 1c to retry all the phase 3 items.
”h]”(hæ)”}”(hŒiStart a round of repair with a workqueue and enough workers to keep the CPUs
as busy as the user desires.”h]”hŒiStart a round of repair with a workqueue and enough workers to keep the CPUs
as busy as the user desires.”…””}”(hjb™  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h Mhj^™  ubji  )”}”(hhh]”(hû)”}”(hXÔ  For each repair item queued by phase 2,

i.   Ask the kernel to repair everything listed in the repair item for a
     given filesystem object.

ii.  Make a note if the kernel made any progress in reducing the number
     of repairs needed for this object.

iii. If the object no longer requires repairs, revalidate all metadata
     associated with this object.
     If the revalidation succeeds, drop the repair item.
     If not, requeue the item for more repairs.
”h]”(hæ)”}”(hŒ'For each repair item queued by phase 2,”h]”hŒ'For each repair item queued by phase 2,”…””}”(hjw™  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h Mhjs™  ubji  )”}”(hhh]”(hû)”}”(hŒ]Ask the kernel to repair everything listed in the repair item for a
given filesystem object.
”h]”hæ)”}”(hŒ\Ask the kernel to repair everything listed in the repair item for a
given filesystem object.”h]”hŒ\Ask the kernel to repair everything listed in the repair item for a
given filesystem object.”…””}”(hjŒ™  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h Mhjˆ™  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhj…™  ubhû)”}”(hŒfMake a note if the kernel made any progress in reducing the number
of repairs needed for this object.
”h]”hæ)”}”(hŒeMake a note if the kernel made any progress in reducing the number
of repairs needed for this object.”h]”hŒeMake a note if the kernel made any progress in reducing the number
of repairs needed for this object.”…””}”(hj¤™  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h Mhj ™  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhj…™  ubhû)”}”(hŒ¾If the object no longer requires repairs, revalidate all metadata
associated with this object.
If the revalidation succeeds, drop the repair item.
If not, requeue the item for more repairs.
”h]”hæ)”}”(hŒ½If the object no longer requires repairs, revalidate all metadata
associated with this object.
If the revalidation succeeds, drop the repair item.
If not, requeue the item for more repairs.”h]”hŒ½If the object no longer requires repairs, revalidate all metadata
associated with this object.
If the revalidation succeeds, drop the repair item.
If not, requeue the item for more repairs.”…””}”(hj¼™  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h Mhj¸™  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhj…™  ubeh}”(h]”h ]”h"]”h$]”h&]”jg  j"…  ji  hjj  jk  uh1jh  hjs™  ubeh}”(h]”h ]”h"]”h$]”h&]”uh1húhjp™  ubhû)”}”(hŒJIf any repairs were made, jump back to 1a to retry all the phase 2 items.
”h]”hæ)”}”(hŒIIf any repairs were made, jump back to 1a to retry all the phase 2 items.”h]”hŒIIf any repairs were made, jump back to 1a to retry all the phase 2 items.”…””}”(hjà™  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h M!hjÜ™  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhjp™  ubhû)”}”(hXÔ  For each repair item queued by phase 3,

i.   Ask the kernel to repair everything listed in the repair item for a
     given filesystem object.

ii.  Make a note if the kernel made any progress in reducing the number
     of repairs needed for this object.

iii. If the object no longer requires repairs, revalidate all metadata
     associated with this object.
     If the revalidation succeeds, drop the repair item.
     If not, requeue the item for more repairs.
”h]”(hæ)”}”(hŒ'For each repair item queued by phase 3,”h]”hŒ'For each repair item queued by phase 3,”…””}”(hjø™  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h M#hjô™  ubji  )”}”(hhh]”(hû)”}”(hŒ]Ask the kernel to repair everything listed in the repair item for a
given filesystem object.
”h]”hæ)”}”(hŒ\Ask the kernel to repair everything listed in the repair item for a
given filesystem object.”h]”hŒ\Ask the kernel to repair everything listed in the repair item for a
given filesystem object.”…””}”(hjš  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h M%hj	š  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhjš  ubhû)”}”(hŒfMake a note if the kernel made any progress in reducing the number
of repairs needed for this object.
”h]”hæ)”}”(hŒeMake a note if the kernel made any progress in reducing the number
of repairs needed for this object.”h]”hŒeMake a note if the kernel made any progress in reducing the number
of repairs needed for this object.”…””}”(hj%š  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h M(hj!š  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhjš  ubhû)”}”(hŒ¾If the object no longer requires repairs, revalidate all metadata
associated with this object.
If the revalidation succeeds, drop the repair item.
If not, requeue the item for more repairs.
”h]”hæ)”}”(hŒ½If the object no longer requires repairs, revalidate all metadata
associated with this object.
If the revalidation succeeds, drop the repair item.
If not, requeue the item for more repairs.”h]”hŒ½If the object no longer requires repairs, revalidate all metadata
associated with this object.
If the revalidation succeeds, drop the repair item.
If not, requeue the item for more repairs.”…””}”(hj=š  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h M+hj9š  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhjš  ubeh}”(h]”h ]”h"]”h$]”h&]”jg  j"…  ji  hjj  jk  uh1jh  hjô™  ubeh}”(h]”h ]”h"]”h$]”h&]”uh1húhjp™  ubhû)”}”(hŒJIf any repairs were made, jump back to 1c to retry all the phase 3 items.
”h]”hæ)”}”(hŒIIf any repairs were made, jump back to 1c to retry all the phase 3 items.”h]”hŒIIf any repairs were made, jump back to 1c to retry all the phase 3 items.”…””}”(hjaš  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h M0hj]š  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhjp™  ubeh}”(h]”h ]”h"]”h$]”h&]”jg  j6  ji  hjj  jk  uh1jh  hj^™  ubeh}”(h]”h ]”h"]”h$]”h&]”uh1húhj[™  hžhhŸNh Nubhû)”}”(hŒfIf step 1 made any repair progress of any kind, jump back to step 1 to start
another round of repair.
”h]”hæ)”}”(hŒeIf step 1 made any repair progress of any kind, jump back to step 1 to start
another round of repair.”h]”hŒeIf step 1 made any repair progress of any kind, jump back to step 1 to start
another round of repair.”…””}”(hj…š  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h M2hjš  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhj[™  hžhhŸh³h Nubhû)”}”(hŒ§If there are items left to repair, run them all serially one more time.
Complain if the repairs were not successful, since this is the last chance
to repair anything.
”h]”hæ)”}”(hŒ¦If there are items left to repair, run them all serially one more time.
Complain if the repairs were not successful, since this is the last chance
to repair anything.”h]”hŒ¦If there are items left to repair, run them all serially one more time.
Complain if the repairs were not successful, since this is the last chance
to repair anything.”…””}”(hjš  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h M5h•       j™š  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhj[™  hžhhŸh³h Nubeh}”(h]”h ]”h"]”h$]”h&]”jg  jh  ji  hjj  jk  uh1jh  hj¸˜  hžhhŸh³h Mubhæ)”}”(hŒ¯Corruptions and inconsistencies encountered during phases 5 and 7 are repaired
immediately.
Corrupt file data blocks reported by phase 6 cannot be recovered by the
filesystem.”h]”hŒ¯Corruptions and inconsistencies encountered during phases 5 and 7 are repaired
immediately.
Corrupt file data blocks reported by phase 6 cannot be recovered by the
filesystem.”…””}”(hj·š  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h M9hj¸˜  hžhubhæ)”}”(hXZ  The proposed patchsets are the
`repair warning improvements
<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=scrub-better-repair-warnings>`_,
refactoring of the
`repair data dependency
<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=scrub-repair-data-deps>`_
and
`object tracking
<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=scrub-object-tracking>`_,
and the
`repair scheduling
<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=scrub-repair-scheduling>`_
improvement series.”h]”(hŒThe proposed patchsets are the
”…””}”(hjÅš  hžhhŸNh Nubj”  )”}”(hŒŒ`repair warning improvements
<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=scrub-better-repair-warnings>`_”h]”hŒrepair warning improvements”…””}”(hjÍš  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”Œname”Œrepair warning improvements”jj  Œkhttps://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=scrub-better-repair-warnings”uh1j“  hjÅš  ubhµ)”}”(hŒn
<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=scrub-better-repair-warnings>”h]”h}”(h]”Œrepair-warning-improvements”ah ]”h"]”Œrepair warning improvements”ah$]”h&]”Œrefuri”jÝš  uh1h´jy  KhjÅš  ubhŒ,
refactoring of the
”…””}”(hjÅš  hžhhŸNh Nubj”  )”}”(hŒ`repair data dependency
<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=scrub-repair-data-deps>`_”h]”hŒrepair data dependency”…””}”(hjïš  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”Œname”Œrepair data dependency”jj  Œehttps://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=scrub-repair-data-deps”uh1j“  hjÅš  ubhµ)”}”(hŒh
<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=scrub-repair-data-deps>”h]”h}”(h]”Œrepair-data-dependency”ah ]”h"]”Œrepair data dependency”ah$]”h&]”Œrefuri”jÿš  uh1h´jy  KhjÅš  ubhŒ
and
”…””}”(hjÅš  hžhhŸNh Nubj”  )”}”(hŒy`object tracking
<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=scrub-object-tracking>`_”h]”hŒobject tracking”…””}”(hj›  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”Œname”Œobject tracking”jj  Œdhttps://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=scrub-object-tracking”uh1j“  hjÅš  ubhµ)”}”(hŒg
<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=scrub-object-tracking>”h]”h}”(h]”Œobject-tracking”ah ]”h"]”Œobject tracking”ah$]”h&]”Œrefuri”j!›  uh1h´jy  KhjÅš  ubhŒ
,
and the
”…””}”(hjÅš  hžhhŸNh Nubj”  )”}”(hŒ}`repair scheduling
<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=scrub-repair-scheduling>`_”h]”hŒrepair scheduling”…””}”(hj3›  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”Œname”Œrepair scheduling”jj  Œfhttps://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=scrub-repair-scheduling”uh1j“  hjÅš  ubhµ)”}”(hŒi
<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=scrub-repair-scheduling>”h]”h}”(h]”Œrepair-scheduling”ah ]”h"]”Œrepair scheduling”ah$]”h&]”Œrefuri”jC›  uh1h´jy  KhjÅš  ubhŒ
improvement series.”…””}”(hjÅš  hžhhŸNh Nubeh}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h M>hj¸˜  hžhubeh}”(h]”(jŸ  j°˜  eh ]”h"]”(Œscheduling repairs”Œscrubrepair”eh$]”h&]”uh1hÐhjz•  hžhhŸh³h MñjË  }”j`›  j¦˜  sjÍ  }”j°˜  j¦˜  subhÑ)”}”(hhh]”(hÖ)”}”(hŒ/Checking Names for Confusable Unicode Sequences”h]”hŒ/Checking Names for Confusable Unicode Sequences”…””}”(hjh›  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”j  j»  uh1hÕhje›  hžhhŸh³h MMubhæ)”}”(hXx  If ``xfs_scrub`` succeeds in validating the filesystem metadata by the end of
phase 4, it moves on to phase 5, which checks for suspicious looking names in
the filesystem.
These names consist of the filesystem label, names in directory entries, and
the names of extended attributes.
Like most Unix filesystems, XFS imposes the sparest of constraints on the
contents of a name:”h]”(hŒIf ”…””}”(hjv›  hžhhŸNh Nubj÷  )”}”(hŒ``xfs_scrub``”h]”hŒ	xfs_scrub”…””}”(hj~›  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1jö  hjv›  ubhXh   succeeds in validating the filesystem metadata by the end of
phase 4, it moves on to phase 5, which checks for suspicious looking names in
the filesystem.
These names consist of the filesystem label, names in directory entries, and
the names of extended attributes.
Like most Unix filesystems, XFS imposes the sparest of constraints on the
contents of a name:”…””}”(hjv›  hžhhŸNh Nubeh}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h MOhje›  hžhubhö)”}”(hhh]”(hû)”}”(hŒ=Slashes and null bytes are not allowed in directory entries.
”h]”hæ)”}”(hŒ<Slashes and null bytes are not allowed in directory entries.”h]”hŒ<Slashes and null bytes are not allowed in directory entries.”…””}”(hj›  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h MWhj™›  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhj–›  hžhhŸh³h Nubhû)”}”(hŒENull bytes are not allowed in userspace-visible extended attributes.
”h]”hæ)”}”(hŒDNull bytes are not allowed in userspace-visible extended attributes.”h]”hŒDNull bytes are not allowed in userspace-visible extended attributes.”…””}”(hjµ›  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h MYhj±›  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhj–›  hžhhŸh³h Nubhû)”}”(hŒ4Null bytes are not allowed in the filesystem label.
”h]”hæ)”}”(hŒ3Null bytes are not allowed in the filesystem label.”h]”hŒ3Null bytes are not allowed in the filesystem label.”…””}”(hjÍ›  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h M[hjÉ›  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhj–›  hžhhŸh³h Nubeh}”(h]”h ]”h"]”h$]”h&]”jJ  jK  uh1hõhŸh³h MWhje›  hžhubhæ)”}”(hX)  Directory entries and attribute keys store the length of the name explicitly
ondisk, which means that nulls are not name terminators.
For this section, the term "naming domain" refers to any place where names are
presented together -- all the names in a directory, or all the attributes of a
file.”h]”hX-  Directory entries and attribute keys store the length of the name explicitly
ondisk, which means that nulls are not name terminators.
For this section, the term â€œnaming domainâ€ refers to any place where names are
presented together -- all the names in a directory, or all the attributes of a
file.”…””}”(hjç›  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h M]hje›  hžhubhæ)”}”(hX½  Although the Unix naming constraints are very permissive, the reality of most
modern-day Linux systems is that programs work with Unicode character code
points to support international languages.
These programs typically encode those code points in UTF-8 when interfacing
with the C library because the kernel expects null-terminated names.
In the common case, therefore, names found in an XFS filesystem are actually
UTF-8 encoded Unicode data.”h]”hX½  Although the Unix naming constraints are very permissive, the reality of most
modern-day Linux systems is that programs work with Unicode character code
points to support international languages.
These programs typically encode those code points in UTF-8 when interfacing
with the C library because the kernel expects null-terminated names.
In the common case, therefore, names found in an XFS filesystem are actually
UTF-8 encoded Unicode data.”…””}”(hjõ›  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h Mchje›  hžhubhæ)”}”(hX6  To maximize its expressiveness, the Unicode standard defines separate control
points for various characters that render similarly or identically in writing
systems around the world.
For example, the character "Cyrillic Small Letter A" U+0430 "Ð°" often renders
identically to "Latin Small Letter A" U+0061 "a".”h]”hXF  To maximize its expressiveness, the Unicode standard defines separate control
points for various characters that render similarly or identically in writing
systems around the world.
For example, the character â€œCyrillic Small Letter Aâ€ U+0430 â€œÐ°â€ often renders
identically to â€œLatin Small Letter Aâ€ U+0061 â€œaâ€.”…””}”(hjœ  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h Mkhje›  hžhubhæ)”}”(hXw  The standard also permits characters to be constructed in multiple ways --
either by using a defined code point, or by combining one code point with
various combining marks.
For example, the character "Angstrom Sign U+212B "â„«" can also be expressed
as "Latin Capital Letter A" U+0041 "A" followed by "Combining Ring Above"
U+030A "â—ŒÌŠ".
Both sequences render identically.”h]”hX  The standard also permits characters to be constructed in multiple ways --
either by using a defined code point, or by combining one code point with
various combining marks.
For example, the character â€œAngstrom Sign U+212B â€œâ„«â€ can also be expressed
as â€œLatin Capital Letter Aâ€ U+0041 â€œAâ€ followed by â€œCombining Ring Aboveâ€
U+030A â€œâ—ŒÌŠâ€.
Both sequences render identically.”…””}”(hjœ  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h Mqhje›  hžhubhæ)”}”(hXì  Like the standards that preceded it, Unicode also defines various control
characters to alter the presentation of text.
For example, the character "Right-to-Left Override" U+202E can trick some
programs into rendering "moo\\xe2\\x80\\xaegnp.txt" as "mootxt.png".
A second category of rendering problems involves whitespace characters.
If the character "Zero Width Space" U+200B is encountered in a file name, the
name will render identically to a name that does not have the zero width
space.”h]”hXü  Like the standards that preceded it, Unicode also defines various control
characters to alter the presentation of text.
For example, the character â€œRight-to-Left Overrideâ€ U+202E can trick some
programs into rendering â€œmoo \xe2 \x80 \xaegnp.txtâ€ as â€œmootxt.pngâ€.
A second category of rendering problems involves whitespace characters.
If the character â€œZero Width Spaceâ€ U+200B is encountered in a file name, the
name will render identically to a name that does not have the zero width
space.”…””}”(hjœ  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h Myhje›  hžhubhæ)”}”(hX!  If two names within a naming domain have different byte sequences but render
identically, a user may be confused by it.
The kernel, in its indifference to upper level encoding schemes, permits this.
Most filesystem drivers persist the byte sequence names that are given to them
by the VFS.”h]”hX!  If two names within a naming domain have different byte sequences but render
identically, a user may be confused by it.
The kernel, in its indifference to upper level encoding schemes, permits this.
Most filesystem drivers persist the byte sequence names that are given to them
by the VFS.”…””}”(hj-œ  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h M‚hje›  hžhubhæ)”}”(hX¿  Techniques for detecting confusable names are explained in great detail in
sections 4 and 5 of the
`Unicode Security Mechanisms <https://unicode.org/reports/tr39/>`_
document.
When ``xfs_scrub`` detects UTF-8 encoding in use on a system, it uses the
Unicode normalization form NFD in conjunction with the confusable name
detection component of
`libicu <https://github.com/unicode-org/icu>`_
to identify names with a directory or within a file's extended attributes that
could be confused for each other.
Names are also checked for control characters, non-rendering characters, and
mixing of bidirectional characters.
All of these potential issues are reported to the system administrator during
phase 5.”h]”(hŒcTechniques for detecting confusable names are explained in great detail in
sections 4 and 5 of the
”…””}”(hj;œ  hžhhŸNh Nubj”  )”}”(hŒB`Unicode Security Mechanisms <https://unicode.org/reports/tr39/>`_”h]”hŒUnicode Security Mechanisms”…””}”(hjCœ  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”Œname”ŒUnicode Security Mechanisms”jj  Œ!https://unicode.org/reports/tr39/”uh1j“  hj;œ  ubhµ)”}”(hŒ$ <https://unicode.org/reports/tr39/>”h]”h}”(h]”Œunicode-security-mechanisms”ah ]”h"]”Œunicode security mechanisms”ah$]”h&]”Œrefuri”jSœ  uh1h´jy  Khj;œ  ubhŒ
document.
When ”…””}”(hj;œ  hžhhŸNh Nubj÷  )”}”(hŒ``xfs_scrub``”h]”hŒ	xfs_scrub”…””}”(hjeœ  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1jö  hj;œ  ubhŒ– detects UTF-8 encoding in use on a system, it uses the
Unicode normalization form NFD in conjunction with the confusable name
detection component of
”…””}”(hj;œ  hžhhŸNh Nubj”  )”}”(hŒ.`libicu <https://github.com/unicode-org/icu>`_”h]”hŒlibicu”…””}”(hjwœ  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”Œname”Œlibicu”jj  Œ"https://github.com/unicode-org/icu”uh1j“  hj;œ  ubhµ)”}”(hŒ% <https://github.com/unicode-org/icu>”h]”h}”(h]”Œlibicu”ah ]”h"]”Œlibicu”ah$]”h&]”Œrefuri”j‡œ  uh1h´jy  Khj;œ  ubhX;  
to identify names with a directory or within a fileâ€™s extended attributes that
could be confused for each other.
Names are also checked for control characters, non-rendering characters, and
mixing of bidirectional characters.
All of these potential issues are reported to the system administrator during
phase 5.”…””}”(hj;œ  hžhhŸNh Nubeh}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h Mˆhje›  hžhubeh}”(h]”jÁ  ah ]”h"]”Œ/checking names for confusable unicode sequences”ah$]”h&]”uh1hÐhjz•  hžhhŸh³h MMubhÑ)”}”(hhh]”(hÖ)”}”(hŒ'Media Verification of File Data Extents”h]”hŒ'Media Verification of File Data Extents”…””}”(hj©œ  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”j  jÝ  uh1hÕhj¦œ  hžhhŸh³h M˜ubhæ)”}”(hXm  The system administrator can elect to initiate a media scan of all file data
blocks.
This scan after validation of all filesystem metadata (except for the summary
counters) as phase 6.
The scan starts by calling ``FS_IOC_GETFSMAP`` to scan the filesystem space map
to find areas that are allocated to file data fork extents.
Gaps between data fork extents that are smaller than 64k are treated as if
they were data fork extents to reduce the command setup overhead.
When the space map scan accumulates a region larger than 32MB, a media
verification request is sent to the disk as a directio read of the raw block
device.”h]”(hŒÔThe system administrator can elect to initiate a media scan of all file data
blocks.
This scan after validation of all filesystem metadata (except for the summary
counters) as phase 6.
The scan starts by calling ”…””}”(hj·œ  hžhhŸNh Nubj÷  )”}”(hŒ``FS_IOC_GETFSMAP``”h]”hŒFS_IOC_GETFSMAP”…””}”(hj¿œ  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1jö  hj·œ  ubhX†   to scan the filesystem space map
to find areas that are allocated to file data fork extents.
Gaps between data fork extents that are smaller than 64k are treated as if
they were data fork extents to reduce the command setup overhead.
When the space map scan accumulates a region larger than 32MB, a media
verification request is sent to the disk as a directio read of the raw block
device.”…””}”(hj·œ  hžhhŸNh Nubeh}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h Mšhj¦œ  hžhubhæ)”}”(hXß  If the verification read fails, ``xfs_scrub`` retries with single-block reads
to narrow down the failure to the specific region of the media and recorded.
When it has finished issuing verification requests, it again uses the space
mapping ioctl to map the recorded media errors back to metadata structures
and report what has been lost.
For media errors in blocks owned by files, parent pointers can be used to
construct file paths from inode numbers for user-friendly reporting.”h]”(hŒ If the verification read fails, ”…””}”(hj×œ  hžhhŸNh Nubj÷  )”}”(hŒ``xfs_scrub``”h]”hŒ	xfs_scrub”…””}”(hjßœ  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1jö  hj×œ  ubhX²   retries with single-block reads
to narrow down the failure to the specific region of the media and recorded.
When it has finished issuing verification requests, it again uses the space
mapping ioctl to map the recorded media errors back to metadata structures
and report what has been lost.
For media errors in blocks owned by files, parent pointers can be used to
construct file paths from inode numbers for user-friendly reporting.”…””}”(hj×œ  hžhhŸNh Nubeh}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h M¦hj¦œ  hžhubeh}”(h]”jã  ah ]”h"]”Œ'media verification of file data extents”ah$]”h&]”uh1hÐhjz•  hžhhŸh³h M˜ubeh}”(h]”j<  ah ]”h"]”Œ+6. userspace algorithms and data structures”ah$]”h&]”uh1hÐhhÒhžhhŸh³h MvubhÑ)”}”(hhh]”(hÖ)”}”(hŒ7. Conclusion and Future Work”h]”hŒ7. Conclusion and Future Work”…””}”(hj  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”j  j  uh1hÕhj  hžhhŸh³h M¯ubhæ)”}”(hXó  It is hoped that the reader of this document has followed the designs laid out
in this document and now has some familiarity with how XFS performs online
rebuilding of its metadata indices, and how filesystem users can interact with
that functionality.
Although the scope of this work is daunting, it is hoped that this guide will
make it easier for code readers to understand what has been built, for whom it
has been built, and why.
Please feel free to contact the XFS mailing list with questions.”h]”hXó  It is hoped that the reader of this document has followed the designs laid out
in this document and now has some familiarity with how XFS performs online
rebuilding of its metadata indices, and how filesystem users can interact with
that functionality.
Although the scope of this work is daunting, it is hoped that this guide will
make it easier for code readers to understand what has been built, for whom it
has been built, and why.
Please feel free to contact the XFS mailing list with questions.”…””}”(hj  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h M±hj  hžhubhÑ)”}”(hhh]”(hÖ)”}”(hŒXFS_IOC_EXCHANGE_RANGE”h]”hŒXFS_IOC_EXCHANGE_RANGE”…””}”(hj'  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”j  j*  uh1hÕhj$  hžhhŸh³h M»ubhæ)”}”(hXq  As discussed earlier, a second frontend to the atomic file mapping exchange
mechanism is a new ioctl call that userspace programs can use to commit updates
to files atomically.
This frontend has been out for review for several years now, though the
necessary refinements to online repair and lack of customer demand mean that
the proposal has not been pushed very hard.”h]”hXq  As discussed earlier, a second frontend to the atomic file mapping exchange
mechanism is a new ioctl call that userspace programs can use to commit updates
to files atomically.
This frontend has been out for review for several years now, though the
necessary refinements to online repair and lack of customer demand mean that
the proposal has not been pushed very hard.”…””}”(hj5  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h M½hj$  hžhubhÑ)”}”(hhh]”(hÖ)”}”(hŒ.File Content Exchanges with Regular User Files”h]”hŒ.File Content Exchanges with Regular User Files”…””}”(hjF  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”j  jI  uh1hÕhjC  hžhhŸh³h MÅubhæ)”}”(hXr  As mentioned earlier, XFS has long had the ability to swap extents between
files, which is used almost exclusively by ``xfs_fsr`` to defragment files.
The earliest form of this was the fork swap mechanism, where the entire
contents of data forks could be exchanged between two files by exchanging the
raw bytes in each inode fork's immediate area.
When XFS v5 came along with self-describing metadata, this old mechanism grew
some log support to continue rewriting the owner fields of BMBT blocks during
log recovery.
When the reverse mapping btree was later added to XFS, the only way to maintain
the consistency of the fork mappings with the reverse mapping index was to
develop an iterative mechanism that used deferred bmap and rmap operations to
swap mappings one at a time.
This mechanism is identical to steps 2-3 from the procedure above except for
the new tracking items, because the atomic file mapping exchange mechanism is
an iteration of an existing mechanism and not something totally novel.
For the narrow case of file defragmentation, the file contents must be
identical, so the recovery guarantees are not much of a gain.”h]”(hŒvAs mentioned earlier, XFS has long had the ability to swap extents between
files, which is used almost exclusively by ”…””}”(hjT  hžhhŸNh Nubj÷  )”}”(hŒ``xfs_fsr``”h]”hŒxfs_fsr”…””}”(hj\  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1jö  hjT  ubhXó   to defragment files.
The earliest form of this was the fork swap mechanism, where the entire
contents of data forks could be exchanged between two files by exchanging the
raw bytes in each inode forkâ€™s immediate area.
When XFS v5 came along with self-describing metadata, this old mechanism grew
some log support to continue rewriting the owner fields of BMBT blocks during
log recovery.
When the reverse mapping btree was later added to XFS, the only way to maintain
the consistency of the fork mappings with the reverse mapping index was to
develop an iterative mechanism that used deferred bmap and rmap operations to
swap mappings one at a time.
This mechanism is identical to steps 2-3 from the procedure above except for
the new tracking items, because the atomic file mapping exchange mechanism is
an iteration of an existing mechanism and not something totally novel.
For the narrow case of file defragmentation, the file contents must be
identical, so the recovery guarantees are not much of a gain.”…””}”(hjT  hžhhŸNh Nubeh}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h MÇhjC  hžhubhæ)”}”(hX/  Atomic file content exchanges are much more flexible than the existing swapext
implementations because it can guarantee that the caller never sees a mix of
old and new contents even after a crash, and it can operate on two arbitrary
file fork ranges.
The extra flexibility enables several new use cases:”h]”hX/  Atomic file content exchanges are much more flexible than the existing swapext
implementations because it can guarantee that the caller never sees a mix of
old and new contents even after a crash, and it can operate on two arbitrary
file fork ranges.
The extra flexibility enables several new use cases:”…””}”(hjt  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h MÙhjC  hžhubhö)”}”(hhh]”hû)”}”(hXû  **Atomic commit of file writes**: A userspace process opens a file that it
wants to update.
Next, it opens a temporary file and calls the file clone operation to reflink
the first file's contents into the temporary file.
Writes to the original file should instead be written to the temporary file.
Finally, the process calls the atomic file mapping exchange system call
(``XFS_IOC_EXCHANGE_RANGE``) to exchange the file contents, thereby
committing all of the updates to the original file, or none of them.
”h]”hæ)”}”(hXú  **Atomic commit of file writes**: A userspace process opens a file that it
wants to update.
Next, it opens a temporary file and calls the file clone operation to reflink
the first file's contents into the temporary file.
Writes to the original file should instead be written to the temporary file.
Finally, the process calls the atomic file mapping exchange system call
(``XFS_IOC_EXCHANGE_RANGE``) to exchange the file contents, thereby
committing all of the updates to the original file, or none of them.”h]”(jé  )”}”(hŒ **Atomic commit of file writes**”h]”hŒAtomic commit of file writes”…””}”(hj  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1jè  hj‰  ubhXU  : A userspace process opens a file that it
wants to update.
Next, it opens a temporary file and calls the file clone operation to reflink
the first fileâ€™s contents into the temporary file.
Writes to the original file should instead be written to the temporary file.
Finally, the process calls the atomic file mapping exchange system call
(”…””}”(hj‰  hžhhŸNh Nubj÷  )”}”(hŒ``XFS_IOC_EXCHANGE_RANGE``”h]”hŒXFS_IOC_EXCHANGE_RANGE”…””}”(hjŸ  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1jö  hj‰  ubhŒm) to exchange the file contents, thereby
committing all of the updates to the original file, or none of them.”…””}”(hj‰  hžhhŸNh Nubeh}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h Mßhj…  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhj‚  hžhhŸh³h Nubah}”(h]”h ]”h"]”h$]”h&]”jJ  jK  uh1hõhŸh³h MßhjC  hžhubhµ)”}”(hŒ.. _exchrange_if_unchanged:”h]”h}”(h]”h ]”h"]”h$]”h&]”j  Œexchrange-if-unchanged”uh1h´h MèhjC  hžhhŸh³ubhö)”}”(hhh]”(hû)”}”(hXw  **Transactional file updates**: The same mechanism as above, but the caller
only wants the commit to occur if the original file's contents have not
changed.
To make this happen, the calling process snapshots the file modification and
change timestamps of the original file before reflinking its data to the
temporary file.
When the program is ready to commit the changes, it passes the timestamps
into the kernel as arguments to the atomic file mapping exchange system call.
The kernel only commits the changes if the provided timestamps match the
original file.
A new ioctl (``XFS_IOC_COMMIT_RANGE``) is provided to perform this.
”h]”hæ)”}”(hXv  **Transactional file updates**: The same mechanism as above, but the caller
only wants the commit to occur if the original file's contents have not
changed.
To make this happen, the calling process snapshots the file modification and
change timestamps of the original file before reflinking its data to the
temporary file.
When the program is ready to commit the changes, it passes the timestamps
into the kernel as arguments to the atomic file mapping exchange system call.
The kernel only commits the changes if the provided timestamps match the
original file.
A new ioctl (``XFS_IOC_COMMIT_RANGE``) is provided to perform this.”h]”(jé  )”}”(hŒ**Transactional file updates**”h]”hŒTransactional file updates”…””}”(hjÙ  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1jè  hjÕ  ubhX$  : The same mechanism as above, but the caller
only wants the commit to occur if the original fileâ€™s contents have not
changed.
To make this happen, the calling process snapshots the file modification and
change timestamps of the original file before reflinking its data to the
temporary file.
When the program is ready to commit the changes, it passes the timestamps
into the kernel as arguments to the atomic file mapping exchange system call.
The kernel only commits the changes if the provided timestamps match the
original file.
A new ioctl (”…””}”(hjÕ  hžhhŸNh Nubj÷  )”}”(hŒ``XFS_IOC_COMMIT_RANGE``”h]”hŒXFS_IOC_COMMIT_RANGE”…””}”(hjë  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1jö  hjÕ  ubhŒ) is provided to perform this.”…””}”(hjÕ  hžhhŸNh Nubeh}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h MêhjÑ  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhjÎ  hžhhŸh³h Nubhû)”}”(hXÞ  **Emulation of atomic block device writes**: Export a block device with a
logical sector size matching the filesystem block size to force all writes
to be aligned to the filesystem block size.
Stage all writes to a temporary file, and when that is complete, call the
atomic file mapping exchange system call with a flag to indicate that holes
in the temporary file should be ignored.
This emulates an atomic device write in software, and can support arbitrary
scattered writes.
”h]”hæ)”}”(hXÝ  **Emulation of atomic block device writes**: Export a block device with a
logical sector size matching the filesystem block size to force all writes
to be aligned to the filesystem block size.
Stage all writes to a temporary file, and when that is complete, call the
atomic file mapping exchange system call with a flag to indicate that holes
in the temporary file should be ignored.
This emulates an atomic device write in software, and can support arbitrary
scattered writes.”h]”(jé  )”}”(hŒ+**Emulation of atomic block device writes**”h]”hŒ'Emulation of atomic block device writes”…””}”(hjž  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1jè  hjž  ubhX²  : Export a block device with a
logical sector size matching the filesystem block size to force all writes
to be aligned to the filesystem block size.
Stage all writes to a temporary file, and when that is complete, call the
atomic file mapping exchange system call with a flag to indicate that holes
in the temporary file should be ignored.
This emulates an atomic device write in software, and can support arbitrary
scattered writes.”…””}”(hjž  hžhhŸNh Nubeh}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h Möhj	ž  ubah}”(h]”h ]”h"]”h$]”h&]”uh1húhjÎ  hžhhŸh³h Nubeh}”(h]”jÍ  ah ]”h"]”Œexchrange_if_unchanged”ah$]”h&]”jJ  jK  uh1hõhŸh³h MêhjC  hžhjË  }”j3ž  jÃ  sjÍ  }”jÍ  jÃ  subeh}”(h]”jO  ah ]”h"]”Œ.file content exchanges with regular user files”ah$]”h&]”uh1hÐhj$  hžhhŸh³h MÅubeh}”(h]”j0  ah ]”h"]”Œxfs_ioc_exchange_range”ah$]”h&]”uh1hÐhj  hžhhŸh³h M»ubhÑ)”}”(hhh]”(hÖ)”}”(hŒVectorized Scrub”h]”hŒVectorized Scrub”…””}”(hjIž  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”j  jw  uh1hÕhjFž  hžhhŸh³h M ubhæ)”}”(hX¼  As it turns out, the :ref:`refactoring <scrubrepair>` of repair items mentioned
earlier was a catalyst for enabling a vectorized scrub system call.
Since 2018, the cost of making a kernel call has increased considerably on some
systems to mitigate the effects of speculative execution attacks.
This incentivizes program authors to make as few system calls as possible to
reduce the number of times an execution path crosses a security boundary.”h]”(hŒAs it turns out, the ”…””}”(hjWž  hžhhŸNh Nubh)”}”(hŒ :ref:`refactoring <scrubrepair>`”h]”j™  )”}”(hjaž  h]”hŒrefactoring”…””}”(hjcž  hžhhŸNh Nubah}”(h]”h ]”(j¤  Œstd”Œstd-ref”eh"]”h$]”h&]”uh1j˜  hj_ž  ubah}”(h]”h ]”h"]”h$]”h&]”Œrefdoc”j±  Œ	refdomain”jmž  Œreftype”Œref”Œrefexplicit”ˆŒrefwarn”ˆj·  Œscrubrepair”uh1hhŸh³h MhjWž  ubhX‡   of repair items mentioned
earlier was a catalyst for enabling a vectorized scrub system call.
Since 2018, the cost of making a kernel call has increased considerably on some
systems to mitigate the effects of speculative execution attacks.
This incentivizes program authors to make as few system calls as possible to
reduce the number of times an execution path crosses a security boundary.”…””}”(hjWž  hžhhŸNh Nubeh}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h MhjFž  hžhubhæ)”}”(hX<  With vectorized scrub, userspace pushes to the kernel the identity of a
filesystem object, a list of scrub types to run against that object, and a
simple representation of the data dependencies between the selected scrub
types.
The kernel executes as much of the caller's plan as it can until it hits a
dependency that cannot be satisfied due to a corruption, and tells userspace
how much was accomplished.
It is hoped that ``io_uring`` will pick up enough of this functionality that
online fsck can use that instead of adding a separate vectored scrub system
call to XFS.”h]”(hXª  With vectorized scrub, userspace pushes to the kernel the identity of a
filesystem object, a list of scrub types to run against that object, and a
simple representation of the data dependencies between the selected scrub
types.
The kernel executes as much of the callerâ€™s plan as it can until it hits a
dependency that cannot be satisfied due to a corruption, and tells userspace
how much was accomplished.
It is hoped that ”…””}”(hj‰ž  hžhhŸNh Nubj÷  )”}”(hŒ``io_uring``”h]”hŒio_uring”…””}”(hj‘ž  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1jö  hj‰ž  ubhŒˆ will pick up enough of this functionality that
online fsck can use that instead of adding a separate vectored scrub system
call to XFS.”…””}”(hj‰ž  hžhhŸNh Nubeh}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h M	hjFž  hžhubhæ)”}”(hX$  The relevant patchsets are the
`kernel vectorized scrub
<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=vectorized-scrub>`_
and
`userspace vectorized scrub
<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=vectorized-scrub>`_
series.”h]”(hŒThe relevant patchsets are the
”…””}”(hj©ž  hžhhŸNh Nubj”  )”}”(hŒy`kernel vectorized scrub
<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=vectorized-scrub>`_”h]”hŒkernel vectorized scrub”…””}”(hj±ž  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”Œname”Œkernel vectorized scrub”jj  Œ\https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=vectorized-scrub”uh1j“  hj©ž  ubhµ)”}”(hŒ_
<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=vectorized-scrub>”h]”h}”(h]”Œkernel-vectorized-scrub”ah ]”h"]”Œkernel vectorized scrub”ah$]”h&]”Œrefuri”jÁž  uh1h´jy  Khj©ž  ubhŒ
and
”…””}”(hj©ž  hžhhŸNh Nubj”  )”}”(hŒ`userspace vectorized scrub
<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=vectorized-scrub>`_”h]”hŒuserspace vectorized scrub”…””}”(hjÓž  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”Œname”Œuserspace vectorized scrub”jj  Œ_https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=vectorized-scrub”uh1j“  hj©ž  ubhµ)”}”(hŒb
<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=vectorized-scrub>”h]”h}”(h]”Œuserspace-vectorized-scrub”ah ]”h"]”Œuserspace vectorized scrub”ah$]”h&]”Œrefuri”jãž  uh1h´jy  Khj©ž  ubhŒ
series.”…””}”(hj©ž  hžhhŸNh Nubeh}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h MhjFž  hžhubeh}”(h]”j}  ah ]”h"]”Œvectorized scrub”ah$]”h&]”uh1hÐhj  hžhhŸh³h M ubhÑ)”}”(hhh]”(hÖ)”}”(hŒ$Quality of Service Targets for Scrub”h]”hŒ$Quality of Service Targets for Scrub”…””}”(hjŸ  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”j  j™  uh1hÕhjŸ  hžhhŸh³h Mubhæ)”}”(hXù  One serious shortcoming of the online fsck code is that the amount of time that
it can spend in the kernel holding resource locks is basically unbounded.
Userspace is allowed to send a fatal signal to the process which will cause
``xfs_scrub`` to exit when it reaches a good stopping point, but there's no way
for userspace to provide a time budget to the kernel.
Given that the scrub codebase has helpers to detect fatal signals, it shouldn't
be too much work to allow userspace to specify a timeout for a scrub/repair
operation and abort the operation if it exceeds budget.
However, most repair functions have the property that once they begin to touch
ondisk metadata, the operation cannot be cancelled cleanly, after which a QoS
timeout is no longer useful.”h]”(hŒæOne serious shortcoming of the online fsck code is that the amount of time that
it can spend in the kernel holding resource locks is basically unbounded.
Userspace is allowed to send a fatal signal to the process which will cause
”…””}”(hjŸ  hžhhŸNh Nubj÷  )”}”(hŒ``xfs_scrub``”h]”hŒ	xfs_scrub”…””}”(hjŸ  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1jö  hjŸ  ubhX
   to exit when it reaches a good stopping point, but thereâ€™s no way
for userspace to provide a time budget to the kernel.
Given that the scrub codebase has helpers to detect fatal signals, it shouldnâ€™t
be too much work to allow userspace to specify a timeout for a scrub/repair
operation and abort the operation if it exceeds budget.
However, most repair functions have the property that once they begin to touch
ondisk metadata, the operation cannot be cancelled cleanly, after which a QoS
timeout is no longer useful.”…””}”(hjŸ  hžhhŸNh Nubeh}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h MhjŸ  hžhubeh}”(h]”jŸ  ah ]”h"]”Œ$quality of service targets for scrub”ah$]”h&]”uh1hÐhj  hžhhŸh³h MubhÑ)”}”(hhh]”(hÖ)”}”(hŒDefragmenting Free Space”h]”hŒDefragmenting Free Space”…””}”(hj=Ÿ  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”j  j»  uh1hÕhj:Ÿ  hžhhŸh³h M,ubhæ)”}”(hŒûOver the years, many XFS users have requested the creation of a program to
clear a portion of the physical storage underlying a filesystem so that it
becomes a contiguous chunk of free space.
Call this free space defragmenter ``clearspace`` for short.”h]”(hŒâOver the years, many XFS users have requested the creation of a program to
clear a portion of the physical storage underlying a filesystem so that it
becomes a contiguous chunk of free space.
Call this free space defragmenter ”…””}”(hjKŸ  hžhhŸNh Nubj÷  )”}”(hŒ``clearspace``”h]”hŒ
clearspace”…””}”(hjSŸ  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1jö  hjKŸ  ubhŒ for short.”…””}”(hjKŸ  hžhhŸNh Nubeh}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h M.hj:Ÿ  hžhubhæ)”}”(hXª  The first piece the ``clearspace`` program needs is the ability to read the
reverse mapping index from userspace.
This already exists in the form of the ``FS_IOC_GETFSMAP`` ioctl.
The second piece it needs is a new fallocate mode
(``FALLOC_FL_MAP_FREE_SPACE``) that allocates the free space in a region and
maps it to a file.
Call this file the "space collector" file.
The third piece is the ability to force an online repair.”h]”(hŒThe first piece the ”…””}”(hjkŸ  hžhhŸNh Nubj÷  )”}”(hŒ``clearspace``”h]”hŒ
clearspace”…””}”(hjsŸ  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1jö  hjkŸ  ubhŒw program needs is the ability to read the
reverse mapping index from userspace.
This already exists in the form of the ”…””}”(hjkŸ  hžhhŸNh Nubj÷  )”}”(hŒ``FS_IOC_GETFSMAP``”h]”hŒFS_IOC_GETFSMAP”…””}”(hj…Ÿ  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1jö  hjkŸ  ubhŒ; ioctl.
The second piece it needs is a new fallocate mode
(”…””}”(hjkŸ  hžhhŸNh Nubj÷  )”}”(hŒ``FALLOC_FL_MAP_FREE_SPACE``”h]”hŒFALLOC_FL_MAP_FREE_SPACE”…””}”(hj—Ÿ  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1jö  hjkŸ  ubhŒ«) that allocates the free space in a region and
maps it to a file.
Call this file the â€œspace collectorâ€ file.
The third piece is the ability to force an online repair.”…””}”(hjkŸ  hžhhŸNh Nubeh}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h M3hj:Ÿ  hžhubhæ)”}”(hX   To clear all the metadata out of a portion of physical storage, clearspace
uses the new fallocate map-freespace call to map any free space in that region
to the space collector file.
Next, clearspace finds all metadata blocks in that region by way of
``GETFSMAP`` and issues forced repair requests on the data structure.
This often results in the metadata being rebuilt somewhere that is not being
cleared.
After each relocation, clearspace calls the "map free space" function again to
collect any newly freed space in the region being cleared.”h]”(hŒûTo clear all the metadata out of a portion of physical storage, clearspace
uses the new fallocate map-freespace call to map any free space in that region
to the space collector file.
Next, clearspace finds all metadata blocks in that region by way of
”…””}”(hj¯Ÿ  hžhhŸNh Nubj÷  )”}”(hŒ``GETFSMAP``”h]”hŒGETFSMAP”…””}”(hj·Ÿ  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1jö  hj¯Ÿ  ubhX   and issues forced repair requests on the data structure.
This often results in the metadata being rebuilt somewhere that is not being
cleared.
After each relocation, clearspace calls the â€œmap free spaceâ€ function again to
collect any newly freed space in the region being cleared.”…””}”(hj¯Ÿ  hžhhŸNh Nubeh}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h M<hj:Ÿ  hžhubhæ)”}”(hXG  To clear all the file data out of a portion of the physical storage, clearspace
uses the FSMAP information to find relevant file data blocks.
Having identified a good target, it uses the ``FICLONERANGE`` call on that part
of the file to try to share the physical space with a dummy file.
Cloning the extent means that the original owners cannot overwrite the
contents; any changes will be written somewhere else via copy-on-write.
Clearspace makes its own copy of the frozen extent in an area that is not being
cleared, and uses ``FIEDEUPRANGE`` (or the :ref:`atomic file content exchanges
<exchrange_if_unchanged>` feature) to change the target file's data extent
mapping away from the area being cleared.
When all other mappings have been moved, clearspace reflinks the space into the
space collector file so that it becomes unavailable.”h]”(hŒ»To clear all the file data out of a portion of the physical storage, clearspace
uses the FSMAP information to find relevant file data blocks.
Having identified a good target, it uses the ”…””}”(hjÏŸ  hžhhŸNh Nubj÷  )”}”(hŒ``FICLONERANGE``”h]”hŒFICLONERANGE”…””}”(hj×Ÿ  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1jö  hjÏŸ  ubhXF   call on that part
of the file to try to share the physical space with a dummy file.
Cloning the extent means that the original owners cannot overwrite the
contents; any changes will be written somewhere else via copy-on-write.
Clearspace makes its own copy of the frozen extent in an area that is not being
cleared, and uses ”…””}”(hjÏŸ  hžhhŸNh Nubj÷  )”}”(hŒ``FIEDEUPRANGE``”h]”hŒFIEDEUPRANGE”…””}”(hjéŸ  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1jö  hjÏŸ  ubhŒ	 (or the ”…””}”(hjÏŸ  hžhhŸNh Nubh)”}”(hŒ=:ref:`atomic file content exchanges
<exchrange_if_unchanged>`”h]”j™  )”}”(hjýŸ  h]”hŒatomic file content exchanges”…””}”(hjÿŸ  hžhhŸNh Nubah}”(h]”h ]”(j¤  Œstd”Œstd-ref”eh"]”h$]”h&]”uh1j˜  hjûŸ  ubah}”(h]”h ]”h"]”h$]”h&]”Œrefdoc”j±  Œ	refdomain”j	   Œreftype”Œref”Œrefexplicit”ˆŒrefwarn”ˆj·  Œexchrange_if_unchanged”uh1hhŸh³h MFhjÏŸ  ubhŒâ feature) to change the target fileâ€™s data extent
mapping away from the area being cleared.
When all other mappings have been moved, clearspace reflinks the space into the
space collector file so that it becomes unavailable.”…””}”(hjÏŸ  hžhhŸNh Nubeh}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h MFhj:Ÿ  hžhubhæ)”}”(hXF  There are further optimizations that could apply to the above algorithm.
To clear a piece of physical storage that has a high sharing factor, it is
strongly desirable to retain this sharing factor.
In fact, these extents should be moved first to maximize sharing factor after
the operation completes.
To make this work smoothly, clearspace needs a new ioctl
(``FS_IOC_GETREFCOUNTS``) to report reference count information to userspace.
With the refcount information exposed, clearspace can quickly find the longest,
most shared data extents in the filesystem, and target them first.”h]”(hXg  There are further optimizations that could apply to the above algorithm.
To clear a piece of physical storage that has a high sharing factor, it is
strongly desirable to retain this sharing factor.
In fact, these extents should be moved first to maximize sharing factor after
the operation completes.
To make this work smoothly, clearspace needs a new ioctl
(”…””}”(hj%   hžhhŸNh Nubj÷  )”}”(hŒ``FS_IOC_GETREFCOUNTS``”h]”hŒFS_IOC_GETREFCOUNTS”…””}”(hj-   hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1jö  hj%   ubhŒÈ) to report reference count information to userspace.
With the refcount information exposed, clearspace can quickly find the longest,
most shared data extents in the filesystem, and target them first.”…””}”(hj%   hžhhŸNh Nubeh}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h MShj:Ÿ  hžhubhæ)”}”(hŒE**Future Work Question**: How might the filesystem move inode chunks?”h]”(jé  )”}”(hŒ**Future Work Question**”h]”hŒFuture Work Question”…””}”(hjI   hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1jè  hjE   ubhŒ-: How might the filesystem move inode chunks?”…””}”(hjE   hžhhŸNh Nubeh}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h M]hj:Ÿ  hžhubhæ)”}”(hX  *Answer*: To move inode chunks, Dave Chinner constructed a prototype program
that creates a new file with the old contents and then locklessly runs around
the filesystem updating directory entries.
The operation cannot complete if the filesystem goes down.
That problem isn't totally insurmountable: create an inode remapping table
hidden behind a jump label, and a log item that tracks the kernel walking the
filesystem to update directory entries.
The trouble is, the kernel can't do anything about open files, since it cannot
revoke them.”h]”(j7  )”}”(hŒ*Answer*”h]”hŒAnswer”…””}”(hje   hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1j6  hja   ubhX  : To move inode chunks, Dave Chinner constructed a prototype program
that creates a new file with the old contents and then locklessly runs around
the filesystem updating directory entries.
The operation cannot complete if the filesystem goes down.
That problem isnâ€™t totally insurmountable: create an inode remapping table
hidden behind a jump label, and a log item that tracks the kernel walking the
filesystem to update directory entries.
The trouble is, the kernel canâ€™t do anything about open files, since it cannot
revoke them.”…””}”(hja   hžhhŸNh Nubeh}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h M_hj:Ÿ  hžhubhæ)”}”(hŒo**Future Work Question**: Can static keys be used to minimize the cost of
supporting ``revoke()`` on XFS files?”h]”(jé  )”}”(hŒ**Future Work Question**”h]”hŒFuture Work Question”…””}”(hj   hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1jè  hj}   ubhŒ=: Can static keys be used to minimize the cost of
supporting ”…””}”(hj}   hžhhŸNh Nubj÷  )”}”(hŒ``revoke()``”h]”hŒrevoke()”…””}”(hj“   hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1jö  hj}   ubhŒ on XFS files?”…””}”(hj}   hžhhŸNh Nubeh}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h Mihj:Ÿ  hžhubhæ)”}”(hŒ`*Answer*: Yes.
Until the first revocation, the bailout code need not be in the call path at
all.”h]”(j7  )”}”(hŒ*Answer*”h]”hŒAnswer”…””}”(hj¯   hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1j6  hj«   ubhŒX: Yes.
Until the first revocation, the bailout code need not be in the call path at
all.”…””}”(hj«   hžhhŸNh Nubeh}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h Mlhj:Ÿ  hžhubhæ)”}”(hX$  The relevant patchsets are the
`kernel freespace defrag
<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=defrag-freespace>`_
and
`userspace freespace defrag
<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=defrag-freespace>`_
series.”h]”(hŒThe relevant patchsets are the
”…””}”(hjÇ   hžhhŸNh Nubj”  )”}”(hŒy`kernel freespace defrag
<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=defrag-freespace>`_”h]”hŒkernel freespace defrag”…””}”(hjÏ   hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”Œname”Œkernel freespace defrag”jj  Œ\https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=defrag-freespace”uh1j“  hjÇ   ubhµ)”}”(hŒ_
<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=defrag-freespace>”h]”h}”(h]”Œkernel-freespace-defrag”ah ]”h"]”Œkernel freespace defrag”ah$]”h&]”Œrefuri”jß   uh1h´jy  KhjÇ   ubhŒ
and
”…””}”(hjÇ   hžhhŸNh Nubj”  )”}”(hŒ`userspace freespace defrag
<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=defrag-freespace>`_”h]”hŒuserspace freespace defrag”…””}”(hjñ   hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”Œname”Œuserspace freespace defrag”jj  Œ_https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=defrag-freespace”uh1j“  hjÇ   ubhµ)”}”(hŒb
<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=defrag-freespace>”h]”h}”(h]”Œuserspace-freespace-defrag”ah ]”h"]”Œuserspace freespace defrag”ah$]”h&]”Œrefuri”j¡  uh1h´jy  KhjÇ   ubhŒ
series.”…””}”(hjÇ   hžhhŸNh Nubeh}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h Mphj:Ÿ  hžhubeh}”(h]”jÁ  ah ]”h"]”Œdefragmenting free space”ah$]”h&]”uh1hÐhj  hžhhŸh³h M,ubhÑ)”}”(hhh]”(hÖ)”}”(hŒShrinking Filesystems”h]”hŒShrinking Filesystems”…””}”(hj#¡  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”j  jÝ  uh1hÕhj ¡  hžhhŸh³h Myubhæ)”}”(hX!  Removing the end of the filesystem ought to be a simple matter of evacuating
the data and metadata at the end of the filesystem, and handing the freed space
to the shrink code.
That requires an evacuation of the space at end of the filesystem, which is a
use of free space defragmentation!”h]”hX!  Removing the end of the filesystem ought to be a simple matter of evacuating
the data and metadata at the end of the filesystem, and handing the freed space
to the shrink code.
That requires an evacuation of the space at end of the filesystem, which is a
use of free space defragmentation!”…””}”(hj1¡  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhŸh³h M{hj ¡  hžhubeh}”(h]”jã  ah ]”h"]”Œshrinking filesystems”ah$]”h&]”uh1hÐhj  hžhhŸh³h Myubeh}”(h]”j  ah ]”h"]”Œ7. conclusion and future work”ah$]”h&]”uh1hÐhhÒhžhhŸh³h M¯ubeh}”(h]”Œid1”ah ]”h"]”Œxfs online fsck design”ah$]”h&]”uh1hÐhhhžhhŸh³h Kubeh}”(h]”h ]”h"]”h$]”h&]”Œsource”h³uh1hŒcurrent_source”NŒcurrent_line”NŒsettings”Œdocutils.frontend”ŒValues”“”)”}”(hÕNŒ	generator”NŒ	datestamp”NŒsource_link”NŒ
source_url”NŒtoc_backlinks”jß  Œfootnote_backlinks”KŒsectnum_xform”KŒstrip_comments”NŒstrip_elements_with_classes”NŒstrip_classes”NŒreport_level”KŒ
halt_level”KŒexit_status_level”KŒdebug”NŒwarning_stream”NŒ	traceback”ˆŒinput_encoding”Œ	utf-8-sig”Œinput_encoding_error_handler”Œstrict”Œoutput_encoding”Œutf-8”Œoutput_encoding_error_handler”jw¡  Œerror_encoding”Œutf-8”Œerror_encoding_error_handler”Œbackslashreplace”Œlanguage_code”Œen”Œrecord_dependencies”NŒconfig”NŒ	id_prefix”hŒauto_id_prefix”Œid”Œdump_settings”NŒdump_internals”NŒdump_transforms”NŒdump_pseudo_xml”NŒexpose_internals”NŒstrict_visitor”NŒ_disable_config”NŒ_source”h³Œ_destination”NŒ_config_files”]”Œ7/var/lib/git/docbuild/linux/Documentation/docutils.conf”aŒfile_insertion_enabled”ˆŒraw_enabled”KŒline_length_limit”M'Œpep_references”NŒpep_base_url”Œhttps://peps.python.org/”Œpep_file_url_template”Œpep-%04d”Œrfc_references”NŒrfc_base_url”Œ&https://datatracker.ietf.org/doc/html/”Œ	tab_width”KŒtrim_footnote_reference_space”‰Œsyntax_highlight”Œlong”Œsmart_quotes”ˆŒsmartquotes_locales”]”Œcharacter_level_inline_markup”‰Œdoctitle_xform”‰Œdocinfo_xform”KŒsectsubtitle_xform”‰Œimage_loading”Œlink”Œembed_stylesheet”‰Œcloak_email_addresses”ˆŒsection_self_link”‰Œenv”NubŒreporter”NŒindirect_targets”]”Œsubstitution_defs”}”Œsubstitution_names”}”Œrefnames”}”Œrefids”}”(j~  ]”jt  ajû  ]”jÓ  aj¤:  ]”jš:  ajÇ  ]”j>  aj"@  ]”j@  aj¾A  ]”j´A  ajhE  ]”j^E  aj¹E  ]”j¯E  ajfG  ]”j\G  ajåH  ]”jÛH  ajâI  ]”jØI  aj M  ]”jM  aj1P  ]”j'P  ajY  ]”jY  ajÅ]  ]”j»]  ajæ`  ]”jÜ`  ajic  ]”j_c  aj­j  ]”j£j  aj0k  ]”j&k  aj”k  ]”jŠk  aj–m  ]”jŒm  aj¢o  ]”j˜o  aj¹r  ]”j¯r  ajt  ]”jt  aj”  ]”jŠ  ajM  ]”jC  ajV“  ]”jL“  ajµ•  ]”j«•  aj°˜  ]”j¦˜  ajÍ  ]”jÃ  auŒnameids”}”(h¿h¼jR¡  jO¡  j	  j  jg  j¤  jÌ  jÃ  ju  jr  j˜  j•  jº  j·  jF  jå  j`  j  jt  j5  jƒ  jT  jÈ  j~  jÇ  jv  jC  j˜  j  jº  já  jÙ  j“  j  j0  jû  j/  j,  j¶  j³  j  j  jñ  jî  jm  jK  jã&  jy  j°  j˜  j6  j3  j!  jº  j©$  jÜ  jS$  jP$  ju$  jr$  j—$  j”$  jÜ&  jþ  j¨&  j¥&  jÊ&  jÇ&  js*  j,  jâ'  jK  j†)  jm  j*(  j'(  jt)  jq)  jl*  j  j8*  j5*  jZ*  jW*  jw•  j½  jâ*  jÜ  j,  jþ  jÛ9  j   jg.  j?  j/  ja  ja0  jƒ  jz5  j¥  jà4  jÝ4  j5  jÿ4  j$5  j!5  jF5  jC5  jh5  je5  j/6  jÇ  j9  jé  jý8  j  j®9  j6  jÔ9  jX  jÍA  j†  j >  j¤:  j>  j¥  j)@  jÇ  j(@  j%@  j
@  j@  jÄA  j"@  jÃA  jé  j9N  j¾A  j8N  j  jÆD  j6  jmE  jX  jüH  jhE  jûH  jz  jÈF  j¹E  jÇF  j™  jµF  j²F  jkG  j»  jòH  jfG  jñH  jÝ  jêH  jü  jÀG  j½G  jâG  jßG  jèI  jåH  jçI  j6  jÊI  jÇI  j/N  jâI  j.N  jX  j¶J  j³J  j/K  jw  j¹K  j™  j%M  j»  j%N  j M  j$N  jÝ  j#Y  j	  j6P  j6	  jQ  j1P  jQ  jX	  jÚP  j×P  jüP  jùP  jY  jz	  j™T  j™	  j‡T  j„T  j•W  j»	  jY  jÝ	  jøX  jõX  jŽ_  jY  j_  j
  jð[  j6
  jÊ]  jX
  j›^  jÅ]  jš^  jz
  j†_  jœ
  jt_  jq_  jJ`  jÊ
  j8`  j5`  jë`  jì
  jÎ`  jË`  jc  jæ`  jc  j  jïb  jìb  jcw  j0  j2g  jic  j1g  jO  jÓc  jÐc  jd  jd  jýf  júf  jg  jg  j£k  jq  j²j  j  jsj  jpj  j•j  j’j  j6k  j­j  j5k  j²  jšk  j0k  j™k  jÔ  j|k  jyk  jœm  j”k  j›m  j  jZw  j–m  jYw  j$  j¿r  j¢o  j¾r  jC  j"t  j¹r  j!t  je  jt  jt  jPw  jt  jOw  j‡  j=w  j:w  jqz  jÁ  jjz  jà  jXz  jUz  j„  j  jz{  jw{  j/  j-  jâ  jO  j™€  jq  j	„  j“  jÇ‚  j”  jÆ‚  j²  j´‚  j±‚  j„  jÔ  jl“  j  j\†  j-  je“  jO  jlŠ  jn  jZŠ  jWŠ  j÷‹  j  jå‹  jâ‹  jR  j²  j°  j­  j\“  jM  j[“  jÔ  j>“  j;“  jn•  jV“  jm•  j  j[•  jX•  j  j<  jÌ—  jµ•  jË—  j[  jµ˜  j}  jv˜  js˜  j˜˜  j•˜  j`›  j°˜  j_›  jŸ  jçš  jäš  j	›  j›  j+›  j(›  jM›  jJ›  j£œ  jÁ  j]œ  jZœ  j‘œ  jŽœ  jûœ  jã  jJ¡  j  jCž  j0  j<ž  jO  j3ž  jÍ  jÿž  j}  jËž  jÈž  jíž  jêž  j7Ÿ  jŸ  j¡  jÁ  jé   jæ   j¡  j¡  jC¡  jã  uŒ	nametypes”}”(h¿ˆjR¡  ‰j	  ‰jg  ‰jÌ  ‰ju  ˆj˜  ˆjº  ˆjF  ‰j`  ‰jt  ‰jƒ  ‰jÈ  ˆjÇ  ‰jC  ‰j  ‰já  ‰j“  ˆj0  ˆj/  ‰j¶  ˆj  ‰jñ  ˆjm  ‰jã&  ‰j°  ‰j6  ˆj!  ‰j©$  ‰jS$  ˆju$  ˆj—$  ˆjÜ&  ‰j¨&  ˆjÊ&  ˆjs*  ‰jâ'  ‰j†)  ‰j*(  ˆjt)  ˆjl*  ‰j8*  ˆjZ*  ˆjw•  ‰jâ*  ‰j,  ‰jÛ9  ‰jg.  ‰j/  ‰ja0  ‰jz5  ‰jà4  ˆj5  ˆj$5  ˆjF5  ˆjh5  ˆj/6  ‰j9  ‰jý8  ‰j®9  ‰jÔ9  ‰jÍA  ‰j >  ˆj>  ‰j)@  ˆj(@  ‰j
@  ˆjÄA  ˆjÃA  ‰j9N  ˆj8N  ‰jÆD  ‰jmE  ‰jüH  ˆjûH  ‰jÈF  ˆjÇF  ‰jµF  ˆjkG  ‰jòH  ˆjñH  ‰jêH  ‰jÀG  ˆjâG  ˆjèI  ˆjçI  ‰jÊI  ˆj/N  ˆj.N  ‰j¶J  ˆj/K  ‰j¹K  ‰j%M  ‰j%N  ˆj$N  ‰j#Y  ‰j6P  ‰jQ  ˆjQ  ‰jÚP  ˆjüP  ˆjY  ‰j™T  ‰j‡T  ˆj•W  ‰jY  ‰jøX  ˆjŽ_  ˆj_  ‰jð[  ‰jÊ]  ‰j›^  ˆjš^  ‰j†_  ‰jt_  ˆjJ`  ‰j8`  ˆjë`  ‰jÎ`  ˆjc  ˆjc  ‰jïb  ˆjcw  ‰j2g  ˆj1g  ‰jÓc  ˆjd  ˆjýf  ˆjg  ˆj£k  ‰j²j  ‰jsj  ˆj•j  ˆj6k  ˆj5k  ‰jšk  ˆj™k  ‰j|k  ˆjœm  ˆj›m  ‰jZw  ˆjYw  ‰j¿r  ˆj¾r  ‰j"t  ˆj!t  ‰jt  ˆjPw  ˆjOw  ‰j=w  ˆjqz  ‰jjz  ‰jXz  ˆj„  ‰jz{  ˆj/  ‰jâ  ‰j™€  ‰j	„  ‰jÇ‚  ˆjÆ‚  ‰j´‚  ˆj„  ‰jl“  ‰j\†  ‰je“  ‰jlŠ  ‰jZŠ  ˆj÷‹  ‰jå‹  ˆjR  ‰j°  ˆj\“  ˆj[“  ‰j>“  ˆjn•  ˆjm•  ‰j[•  ˆj  ‰jÌ—  ˆjË—  ‰jµ˜  ‰jv˜  ˆj˜˜  ˆj`›  ˆj_›  ‰jçš  ˆj	›  ˆj+›  ˆjM›  ˆj£œ  ‰j]œ  ˆj‘œ  ˆjûœ  ‰jJ¡  ‰jCž  ‰j<ž  ‰j3ž  ˆjÿž  ‰jËž  ˆjíž  ˆj7Ÿ  ‰j¡  ‰jé   ˆj¡  ˆjC¡  ‰uh}”(h¼h¶jO¡  hÒj  jx  j¤  j  jÃ  jA  jr  jl  j•  j  j·  j±  jå  jÏ  j  jI  j5  jj  jT  j  j~  j†  jv  j†  j˜  jÏ  jº  jF  jÙ  je  j  jŠ  jû  jä  j,  jä  j³  j­  j  j5  jî  jè  jK  j  jy  jw  j˜  jõ  j3  j-  jº  j³  jÜ  j!  jP$  jJ$  jr$  jl$  j”$  jŽ$  jþ  j¬$  j¥&  jŸ&  jÇ&  jÁ&  j,  jæ&  jK  j'  jm  jå'  j'(  j!(  jq)  jk)  j  j‰)  j5*  j/*  jW*  jQ*  j½  jv*  jÜ  j•*  jþ  jå*  j   j ,  j?  j§-  ja  jj.  jƒ  j/  j¥  jd0  jÝ4  j×4  jÿ4  jù4  j!5  j5  jC5  j=5  je5  j_5  jÇ  j}5  jé  j26  j  j­7  j6  j9  jX  j±9  j†  jÞ9  j¤:  j¥:  j¥  j¥:  jÇ  j%>  j%@  j%>  j@  j@  j"@  j.@  jé  j.@  j¾A  jÐA  j  jÐA  j6  jC  jX  jÉD  jhE  jpE  jz  jpE  j¹E  jºE  j™  jºE  j²F  j¬F  j»  jÍF  jfG  jnG  jÝ  jnG  jü  jG  j½G  j·G  jßG  jÙG  jåH  jI  j6  jI  jÇI  jÁI  jâI  jíI  jX  jíI  j³J  j­J  jw  jÄJ  j™  j2K  j»  j¼K  j M  j(M  jÝ  j(M  j	  j>N  j6	  jN  j1P  j9P  jX	  j9P  j×P  jÑP  jùP  jóP  jz	  jQ  j™	  jÍR  j„T  j~T  j»	  jœT  j€W  jzW  jÝ	  j˜W  jõX  jïX  jY  j&Y  j
  j&Y  j[  j	[  j6
  j [  jX
  jó[  jª]  j¤]  jÅ]  jÍ]  jz
  jÍ]  j…^  j^  jœ
  j ^  jq_  jk_  jÊ
  j“_  j5`  j/`  jì
  jM`  jË`  jÅ`  jæ`  jî`  j  jî`  jìb  jæb  j0  jc  jic  jjc  jO  jjc  jÐc  jÊc  jd  jþc  júf  jôf  jg  jg  jq  j7g  j  j¼i  jpj  jjj  j’j  jŒj  j­j  jµj  j²  jµj  j0k  j;k  jÔ  j;k  jyk  jsk  j”k  j¦k  j  j¦k  j–m  j¡m  j$  j¡m  j‡o  jo  j¢o  j£o  jC  j£o  jžr  j˜r  j¹r  jÄr  je  jÄr  jt  jûs  jt  j't  j‡  j't  j:w  j4w  jÁ  jfw  jà  jpy  jUz  jOz  j  jtz  jw{  jq{  j-  j[|  jO  j2  jq  jå  j“  jœ€  j”  j•  j²  j•  j±‚  j«‚  jÔ  jÌ‚  jíƒ  jçƒ  j  j„  j-  jˆ„  jG†  jA†  jO  j_†  jn  j´ˆ  jWŠ  jQŠ  j  joŠ  jâ‹  jÜ‹  j²  jú‹  j­  j§  jM  jU  jÔ  jU  j;“  j5“  jV“  jo“  j  jo“  jX•  jR•  j<  jz•  jµ•  j¶•  j[  j¶•  j}  jÑ—  js˜  jm˜  j•˜  j˜  j°˜  j¸˜  jŸ  j¸˜  jäš  jÞš  j›  j ›  j(›  j"›  jJ›  jD›  jÁ  je›  jZœ  jTœ  jŽœ  jˆœ  jã  j¦œ  j  j  j0  j$  jO  jC  jÍ  jÎ  j}  jFž  jÈž  jÂž  jêž  jäž  jŸ  jŸ  jÁ  j:Ÿ  jæ   jà   j¡  j¡  jã  j ¡  jž  j•  j½  j´  jß  jÖ  j  jø  j/  j&  jN  jE  jp  jg  j’  j‰  j´  j«  jÓ  jÊ  jõ  jì  j  j  jE  j<  js  jj  j’  j‰  j´  j«  jÖ  jÍ  jø  jï  j&  j  jE  j<  jg  j^  j‰  j€  j·  j®  jÖ  jÍ  jø  jï  j  j  j9  j0  j[  jR  j}  jt  jŸ  j–  jÁ  j¸  jã  jÚ  j  jù  j0  j'  jR  jI  j€  jw  jŸ  j–  jÁ  j¸  jã  jÚ  j  j  j0  j'  jR  jI  jt  jk  j“  jŠ  jµ  j¬  j×  jÎ  jö  jí  j0  j'  jR  jI  jq  jh  j“  jŠ  jµ  j¬  j×  jÎ  j	  j	  j0	  j'	  jR	  jI	  jt	  jk	  j“	  jŠ	  jµ	  j¬	  j×	  jÎ	  j
  j
  j0
  j'
  jR
  jI
  jt
  jk
  j–
  j
  jÄ
  j»
  jæ
  jÝ
  j  jÿ
  j*  j!  jI  j@  jk  jb  jŠ  j  j¬  j£  jÎ  jÅ  jü  jó  j  j  j=  j4  j_  jV  j  jx  j»  j²  jÚ  jÑ  j  jÿ  j'  j  jI  j@  jk  jb  j  j„  j¬  j£  jÎ  jÅ  j  jÿ  j'  j  jI  j@  jh  j_  jŠ  j  j¬  j£  jÎ  jÅ  j  jÿ  j6  j-  jU  jL  jw  jn  j™  j  j»  j²  jÝ  jÔ  j  j  j*  j!  jI  j@  jw  jn  j™  j  j»  j²  jÝ  jÔ  uŒfootnote_refs”}”Œcitation_refs”}”Œautofootnotes”]”Œautofootnote_refs”]”Œsymbol_footnotes”]”Œsymbol_footnote_refs”]”Œ	footnotes”]”Œ	citations”]”Œautofootnote_start”KŒsymbol_footnote_start”K Œ
id_counter”Œcollections”ŒCounter”“”}”j…¡  Kys…”R”Œparse_messages”]”(hŒsystem_message”“”)”}”(hhh]”hæ)”}”(hŒ2Duplicate explicit target name: "ag btree repair".”h]”hŒ6Duplicate explicit target name: â€œag btree repairâ€.”…””}”(hjü¡  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhjù¡  ubah}”(h]”h ]”h"]”h$]”h&]”j€W  aŒlevel”KŒtype”ŒINFO”Œsource”h³Œline”Kuh1j÷¡  hjœT  hžhhŸh³h M¢
ubjø¡  )”}”(hhh]”hæ)”}”(hŒ:Enumerated list start value not ordinal-1: "4" (ordinal 4)”h]”hŒ>Enumerated list start value not ordinal-1: â€œ4â€ (ordinal 4)”…””}”(hj¢  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhj¢  ubah}”(h]”h ]”h"]”h$]”h&]”Œlevel”KŒtype”j¢  Œsource”h³Œline”Kuh1j÷¡  hj&Y  hžhhŸh³h Mü
ubjø¡  )”}”(hhh]”hæ)”}”(hŒFDuplicate explicit target name: "preparation for bulk loading btrees".”h]”hŒJDuplicate explicit target name: â€œpreparation for bulk loading btreesâ€.”…””}”(hj3¢  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhj0¢  ubah}”(h]”h ]”h"]”h$]”h&]”j[  aŒlevel”KŒtype”j¢  Œsource”h³Œline”Kuh1j÷¡  hj&Y  hžhhŸh³h M!ubjø¡  )”}”(hhh]”hæ)”}”(hŒ2Duplicate explicit target name: "ag btree repair".”h]”hŒ6Duplicate explicit target name: â€œag btree repairâ€.”…””}”(hjN¢  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhjK¢  ubah}”(h]”h ]”h"]”h$]”h&]”jª]  aŒlevel”KŒtype”j¢  Œsource”h³Œline”Kuh1j÷¡  hjó[  hžhhŸh³h M…ubjø¡  )”}”(hhh]”hæ)”}”(hŒ2Duplicate explicit target name: "ag btree repair".”h]”hŒ6Duplicate explicit target name: â€œag btree repairâ€.”…””}”(hji¢  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhjf¢  ubah}”(h]”h ]”h"]”h$]”h&]”j…^  aŒlevel”KŒtype”j¢  Œsource”h³Œline”Kuh1j÷¡  hjÍ]  hžhhŸh³h M¢ubjø¡  )”}”(hhh]”hæ)”}”(hŒ:Enumerated list start value not ordinal-1: "8" (ordinal 8)”h]”hŒ>Enumerated list start value not ordinal-1: â€œ8â€ (ordinal 8)”…””}”(hj„¢  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhj¢  ubah}”(h]”h ]”h"]”h$]”h&]”Œlevel”KŒtype”j¢  Œsource”h³Œline”Kuh1j÷¡  hjjc  hžhhŸh³h MÆubjø¡  )”}”(hhh]”hæ)”}”(hŒ0Duplicate explicit target name: "inode scanner".”h]”hŒ4Duplicate explicit target name: â€œinode scannerâ€.”…””}”(hjŸ¢  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhjœ¢  ubah}”(h]”h ]”h"]”h$]”h&]”j‡o  aŒlevel”KŒtype”j¢  Œsource”h³Œline”Kuh1j÷¡  hj¡m  hžhhŸh³h Mubjø¡  )”}”(hhh]”hæ)”}”(hŒ4Duplicate explicit target name: "online quotacheck".”h]”hŒ8Duplicate explicit target name: â€œonline quotacheckâ€.”…””}”(hjº¢  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhj·¢  ubah}”(h]”h ]”h"]”h$]”h&]”jžr  aŒlevel”KŒtype”j¢  Œsource”h³Œline”Kuh1j÷¡  hj£o  hžhhŸh³h Mlubjø¡  )”}”(hhh]”hæ)”}”(hŒ<Duplicate explicit target name: "extended attribute repair".”h]”hŒ@Duplicate explicit target name: â€œextended attribute repairâ€.”…””}”(hjÕ¢  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhjÒ¢  ubah}”(h]”h ]”h"]”h$]”h&]”jíƒ  aŒlevel”KŒtype”j¢  Œsource”h³Œline”Kuh1j÷¡  hjÌ‚  hžhhŸh³h Mubjø¡  )”}”(hhh]”hæ)”}”(hŒ3Duplicate explicit target name: "directory repair".”h]”hŒ7Duplicate explicit target name: â€œdirectory repairâ€.”…””}”(hjð¢  hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1håhjí¢  ubah}”(h]”h ]”h"]”h$]”h&]”jG†  aŒlevel”KŒtype”j¢  Œsource”h³Œline”Kuh1j÷¡  hjˆ„  hžhhŸh³h MfubeŒtransform_messages”]”(jø¡  )”}”(hhh]”hæ)”}”(hhh]”hŒ<Hyperlink target "xfs_online_fsck_design" is not referenced.”…””}”hj£  sbah}”(h]”h ]”h"]”h$]”h&]”uh1håhj
£  ubah}”(h]”h ]”h"]”h$]”h&]”Œlevel”KŒtype”j¢  Œsource”h³Œline”Kuh1j÷¡  ubjø¡  )”}”(hhh]”hæ)”}”(hhh]”hŒ1Hyperlink target "scrubphases" is not referenced.”…””}”hj'£  sbah}”(h]”h ]”h"]”h$]”h&]”uh1håhj$£  ubah}”(h]”h ]”h"]”h$]”h&]”Œlevel”KŒtype”j¢  Œsource”h³Œline”Muh1j÷¡  ubjø¡  )”}”(hhh]”hæ)”}”(hhh]”hŒ8Hyperlink target "secondary-metadata" is not referenced.”…””}”hjA£  sbah}”(h]”h ]”h"]”h$]”h&]”uh1håhj>£  ubah}”(h]”h ]”h"]”h$]”h&]”Œlevel”KŒtype”j¢  Œsource”h³Œline”M£uh1j÷¡  ubjø¡  )”}”(hhh]”hæ)”}”(hhh]”hŒ8Hyperlink target "chain-coordination" is not referenced.”…””}”hj[£  sbah}”(h]”h ]”h"]”h$]”h&]”uh1håhjX£  ubah}”(h]”h ]”h"]”h$]”h&]”Œlevel”KŒtype”j¢  Œsource”h³Œline”Muh1j÷¡  ubjø¡  )”}”(hhh]”hæ)”}”(hhh]”hŒ3Hyperlink target "intent-drains" is not referenced.”…””}”hju£  sbah}”(h]”h ]”h"]”h$]”h&]”uh1håhjr£  ubah}”(h]”h ]”h"]”h$]”h&]”Œlevel”KŒtype”j¢  Œsource”h³Œline”M©uh1j÷¡  ubjø¡  )”}”(hhh]”hæ)”}”(hhh]”hŒ1Hyperlink target "jump-labels" is not referenced.”…””}”hj£  sbah}”(h]”h ]”h"]”h$]”h&]”uh1håhjŒ£  ubah}”(h]”h ]”h"]”h$]”h&]”Œlevel”KŒtype”j¢  Œsource”h³Œline”Mãuh1j÷¡  ubjø¡  )”}”(hhh]”hæ)”}”(hhh]”hŒ+Hyperlink target "xfile" is not referenced.”…””}”hj©£  sbah}”(h]”h ]”h"]”h$]”h&]”uh1håhj¦£  ubah}”(h]”h ]”h"]”h$]”h&]”Œlevel”KŒtype”j¢  Œsource”h³Œline”M1uh1j÷¡  ubjø¡  )”}”(hhh]”hæ)”}”(hhh]”hŒ-Hyperlink target "xfarray" is not referenced.”…””}”hjÃ£  sbah}”(h]”h ]”h"]”h$]”h&]”uh1håhjÀ£  ubah}”(h]”h ]”h"]”h$]”h&]”Œlevel”KŒtype”j¢  Œsource”h³Œline”M·uh1j÷¡  ubjø¡  )”}”(hhh]”hæ)”}”(hhh]”hŒ=Hyperlink target "xfarray-access-patterns" is not referenced.”…””}”hjÝ£  sbah}”(h]”h ]”h"]”h$]”h&]”uh1håhjÚ£  ubah}”(h]”h ]”h"]”h$]”h&]”Œlevel”KŒtype”j¢  Œsource”h³Œline”MÍuh1j÷¡  ubjø¡  )”}”(hhh]”hæ)”}”(hhh]”hŒ2Hyperlink target "xfarray-sort" is not referenced.”…””}”hj÷£  sbah}”(h]”h ]”h"]”h$]”h&]”uh1håhjô£  ubah}”(h]”h ]”h"]”h$]”h&]”Œlevel”KŒtype”j¢  Œsource”h³Œline”Muh1j÷¡  ubjø¡  )”}”(hhh]”hæ)”}”(hhh]”hŒ,Hyperlink target "xfblob" is not referenced.”…””}”hj¤  sbah}”(h]”h ]”h"]”h$]”h&]”uh1håhj¤  ubah}”(h]”h ]”h"]”h$]”h&]”Œlevel”KŒtype”j¢  Œsource”h³Œline”Meuh1j÷¡  ubjø¡  )”}”(hhh]”hæ)”}”(hhh]”hŒ-Hyperlink target "xfbtree" is not referenced.”…””}”hj+¤  sbah}”(h]”h ]”h"]”h$]”h&]”uh1håhj(¤  ubah}”(h]”h ]”h"]”h$]”h&]”Œlevel”KŒtype”j¢  Œsource”h³Œline”Mƒuh1j÷¡  ubjø¡  )”}”(hhh]”hæ)”}”(hhh]”hŒ4Hyperlink target "xfbtree-commit" is not referenced.”…””}”hjE¤  sbah}”(h]”h ]”h"]”h$]”h&]”uh1håhjB¤  ubah}”(h]”h ]”h"]”h$]”h&]”Œlevel”KŒtype”j¢  Œsource”h³Œline”M÷uh1j÷¡  ubjø¡  )”}”(hhh]”hæ)”}”(hhh]”hŒ+Hyperlink target "newbt" is not referenced.”…””}”hj_¤  sbah}”(h]”h ]”h"]”h$]”h&]”uh1håhj\¤  ubah}”(h]”h ]”h"]”h$]”h&]”Œlevel”KŒtype”j¢  Œsource”h³Œline”M}	uh1j÷¡  ubjø¡  )”}”(hhh]”hæ)”}”(hhh]”hŒ-Hyperlink target "reaping" is not referenced.”…””}”hjy¤  sbah}”(h]”h ]”h"]”h$]”h&]”uh1håhjv¤  ubah}”(h]”h ]”h"]”h$]”h&]”Œlevel”KŒtype”j¢  Œsource”h³Œline”MÍ
uh1j÷¡  ubjø¡  )”}”(hhh]”hæ)”}”(hhh]”hŒ/Hyperlink target "rmap-reap" is not referenced.”…””}”hj“¤  sbah}”(h]”h ]”h"]”h$]”h&]”uh1håhj¤  ubah}”(h]”h ]”h"]”h$]”h&]”Œlevel”KŒtype”j¢  Œsource”h³Œline”M‡uh1j÷¡  ubjø¡  )”}”(hhh]”hæ)”}”(hhh]”hŒ0Hyperlink target "fscounters" is not referenced.”…””}”hj­¤  sbah}”(h]”h ]”h"]”h$]”h&]”uh1håhjª¤  ubah}”(h]”h ]”h"]”h$]”h&]”Œlevel”KŒtype”j¢  Œsource”h³Œline”Míuh1j÷¡  ubjø¡  )”}”(hhh]”hæ)”}”(hhh]”hŒ+Hyperlink target "iscan" is not referenced.”…””}”hjÇ¤  sbah}”(h]”h ]”h"]”h$]”h&]”uh1håhjÄ¤  ubah}”(h]”h ]”h"]”h$]”h&]”Œlevel”KŒtype”j¢  Œsource”h³Œline”Mduh1j÷¡  ubjø¡  )”}”(hhh]”hæ)”}”(hhh]”hŒ.Hyperlink target "ilocking" is not referenced.”…””}”hjá¤  sbah}”(h]”h ]”h"]”h$]”h&]”uh1håhjÞ¤  ubah}”(h]”h ]”h"]”h$]”h&]”Œlevel”KŒtype”j¢  Œsource”h³Œline”M>uh1j÷¡  ubjø¡  )”}”(hhh]”hæ)”}”(hhh]”•{
      hŒ/Hyperlink target "dirparent" is not referenced.”…””}”hjû¤  sbah}”(h]”h ]”h"]”h$]”h&]”uh1håhjø¤  ubah}”(h]”h ]”h"]”h$]”h&]”Œlevel”KŒtype”j¢  Œsource”h³Œline”Mauh1j÷¡  ubjø¡  )”}”(hhh]”hæ)”}”(hhh]”hŒ-Hyperlink target "fshooks" is not referenced.”…””}”hj¥  sbah}”(h]”h ]”h"]”h$]”h&]”uh1håhj¥  ubah}”(h]”h ]”h"]”h$]”h&]”Œlevel”KŒtype”j¢  Œsource”h³Œline”M{uh1j÷¡  ubjø¡  )”}”(hhh]”hæ)”}”(hhh]”hŒ0Hyperlink target "liveupdate" is not referenced.”…””}”hj/¥  sbah}”(h]”h ]”h"]”h$]”h&]”uh1håhj,¥  ubah}”(h]”h ]”h"]”h$]”h&]”Œlevel”KŒtype”j¢  Œsource”h³Œline”MÄuh1j÷¡  ubjø¡  )”}”(hhh]”hæ)”}”(hhh]”hŒ0Hyperlink target "quotacheck" is not referenced.”…””}”hjI¥  sbah}”(h]”h ]”h"]”h$]”h&]”uh1håhjF¥  ubah}”(h]”h ]”h"]”h$]”h&]”Œlevel”KŒtype”j¢  Œsource”h³Œline”Muh1j÷¡  ubjø¡  )”}”(hhh]”hæ)”}”(hhh]”hŒ,Hyperlink target "nlinks" is not referenced.”…””}”hjc¥  sbah}”(h]”h ]”h"]”h$]”h&]”uh1håhj`¥  ubah}”(h]”h ]”h"]”h$]”h&]”Œlevel”KŒtype”j¢  Œsource”h³Œline”Mnuh1j÷¡  ubjø¡  )”}”(hhh]”hæ)”}”(hhh]”hŒ1Hyperlink target "rmap-repair" is not referenced.”…””}”hj}¥  sbah}”(h]”h ]”h"]”h$]”h&]”uh1håhjz¥  ubah}”(h]”h ]”h"]”h$]”h&]”Œlevel”KŒtype”j¢  Œsource”h³Œline”M¨uh1j÷¡  ubjø¡  )”}”(hhh]”hæ)”}”(hhh]”hŒ/Hyperlink target "rtsummary" is not referenced.”…””}”hj—¥  sbah}”(h]”h ]”h"]”h$]”h&]”uh1håhj”¥  ubah}”(h]”h ]”h"]”h$]”h&]”Œlevel”KŒtype”j¢  Œsource”h³Œline”M½uh1j÷¡  ubjø¡  )”}”(hhh]”hæ)”}”(hhh]”hŒ-Hyperlink target "dirtree" is not referenced.”…””}”hj±¥  sbah}”(h]”h ]”h"]”h$]”h&]”uh1håhj®¥  ubah}”(h]”h ]”h"]”h$]”h&]”Œlevel”KŒtype”j¢  Œsource”h³Œline”M´uh1j÷¡  ubjø¡  )”}”(hhh]”hæ)”}”(hhh]”hŒ/Hyperlink target "orphanage" is not referenced.”…””}”hjË¥  sbah}”(h]”h ]”h"]”h$]”h&]”uh1håhjÈ¥  ubah}”(h]”h ]”h"]”h$]”h&]”Œlevel”KŒtype”j¢  Œsource”h³Œline”M0uh1j÷¡  ubjø¡  )”}”(hhh]”hæ)”}”(hhh]”hŒ0Hyperlink target "scrubcheck" is not referenced.”…””}”hjå¥  sbah}”(h]”h ]”h"]”h$]”h&]”uh1håhjâ¥  ubah}”(h]”h ]”h"]”h$]”h&]”Œlevel”KŒtype”j¢  Œsource”h³Œline”M|uh1j÷¡  ubjø¡  )”}”(hhh]”hæ)”}”(hhh]”hŒ1Hyperlink target "scrubrepair" is not referenced.”…””}”hjÿ¥  sbah}”(h]”h ]”h"]”h$]”h&]”uh1håhjü¥  ubah}”(h]”h ]”h"]”h$]”h&]”Œlevel”KŒtype”j¢  Œsource”h³Œline”Mîuh1j÷¡  ubjø¡  )”}”(hhh]”hæ)”}”(hhh]”hŒ<Hyperlink target "exchrange-if-unchanged" is not referenced.”…””}”hj¦  sbah}”(h]”h ]”h"]”h$]”h&]”uh1håhj¦  ubah}”(h]”h ]”h"]”h$]”h&]”Œlevel”KŒtype”j¢  Œsource”h³Œline”Mèuh1j÷¡  ubeŒtransformer”NŒinclude_log”]”Œ
decoration”Nhžhub.