€•_µŒsphinx.addnodes”Œdocument”“”)”}”(Œ rawsource”Œ”Œchildren”]”(Œ translations”Œ LanguagesNode”“”)”}”(hhh]”(hŒ pending_xref”“”)”}”(hhh]”Œdocutils.nodes”ŒText”“”ŒChinese (Simplified)”…””}”Œparent”hsbaŒ attributes”}”(Œids”]”Œclasses”]”Œnames”]”Œdupnames”]”Œbackrefs”]”Œ refdomain”Œstd”Œreftype”Œdoc”Œ reftarget”Œ@/translations/zh_CN/filesystems/xfs/xfs-self-describing-metadata”Œmodname”NŒ classname”NŒ refexplicit”ˆuŒtagname”hhh ubh)”}”(hhh]”hŒChinese (Traditional)”…””}”hh2sbah}”(h]”h ]”h"]”h$]”h&]”Œ refdomain”h)Œreftype”h+Œ reftarget”Œ@/translations/zh_TW/filesystems/xfs/xfs-self-describing-metadata”Œmodname”NŒ classname”NŒ refexplicit”ˆuh1hhh ubh)”}”(hhh]”hŒItalian”…””}”hhFsbah}”(h]”h ]”h"]”h$]”h&]”Œ refdomain”h)Œreftype”h+Œ reftarget”Œ@/translations/it_IT/filesystems/xfs/xfs-self-describing-metadata”Œmodname”NŒ classname”NŒ refexplicit”ˆuh1hhh ubh)”}”(hhh]”hŒJapanese”…””}”hhZsbah}”(h]”h ]”h"]”h$]”h&]”Œ refdomain”h)Œreftype”h+Œ reftarget”Œ@/translations/ja_JP/filesystems/xfs/xfs-self-describing-metadata”Œmodname”NŒ classname”NŒ refexplicit”ˆuh1hhh ubh)”}”(hhh]”hŒKorean”…””}”hhnsbah}”(h]”h ]”h"]”h$]”h&]”Œ refdomain”h)Œreftype”h+Œ reftarget”Œ@/translations/ko_KR/filesystems/xfs/xfs-self-describing-metadata”Œmodname”NŒ classname”NŒ refexplicit”ˆuh1hhh ubh)”}”(hhh]”hŒSpanish”…””}”hh‚sbah}”(h]”h ]”h"]”h$]”h&]”Œ refdomain”h)Œreftype”h+Œ reftarget”Œ@/translations/sp_SP/filesystems/xfs/xfs-self-describing-metadata”Œmodname”NŒ classname”NŒ refexplicit”ˆuh1hhh ubeh}”(h]”h ]”h"]”h$]”h&]”Œcurrent_language”ŒEnglish”uh1h hhŒ _document”hŒsource”NŒline”NubhŒcomment”“”)”}”(hŒ SPDX-License-Identifier: GPL-2.0”h]”hŒ SPDX-License-Identifier: GPL-2.0”…””}”hh£sbah}”(h]”h ]”h"]”h$]”h&]”Œ xml:space”Œpreserve”uh1h¡hhhžhhŸŒZ/var/lib/git/docbuild/linux/Documentation/filesystems/xfs/xfs-self-describing-metadata.rst”h KubhŒtarget”“”)”}”(hŒ!.. _xfs_self_describing_metadata:”h]”h}”(h]”h ]”h"]”h$]”h&]”Œrefid”Œxfs-self-describing-metadata”uh1h´h KhhhžhhŸh³ubhŒsection”“”)”}”(hhh]”(hŒtitle”“”)”}”(hŒXFS Self Describing Metadata”h]”hŒXFS Self Describing Metadata”…””}”(hhÉhžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1hÇhhÄhžhhŸh³h KubhÃ)”}”(hhh]”(hÈ)”}”(hŒ Introduction”h]”hŒ Introduction”…””}”(hhÚhžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1hÇhh×hžhhŸh³h K ubhŒ paragraph”“”)”}”(hXmThe largest scalability problem facing XFS is not one of algorithmic scalability, but of verification of the filesystem structure. Scalabilty of the structures and indexes on disk and the algorithms for iterating them are adequate for supporting PB scale filesystems with billions of inodes, however it is this very scalability that causes the verification problem.”h]”hXmThe largest scalability problem facing XFS is not one of algorithmic scalability, but of verification of the filesystem structure. Scalabilty of the structures and indexes on disk and the algorithms for iterating them are adequate for supporting PB scale filesystems with billions of inodes, however it is this very scalability that causes the verification problem.”…””}”(hhêhžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1hèhŸh³h K hh×hžhubhé)”}”(hXÆAlmost all metadata on XFS is dynamically allocated. The only fixed location metadata is the allocation group headers (SB, AGF, AGFL and AGI), while all other metadata structures need to be discovered by walking the filesystem structure in different ways. While this is already done by userspace tools for validating and repairing the structure, there are limits to what they can verify, and this in turn limits the supportable size of an XFS filesystem.”h]”hXÆAlmost all metadata on XFS is dynamically allocated. The only fixed location metadata is the allocation group headers (SB, AGF, AGFL and AGI), while all other metadata structures need to be discovered by walking the filesystem structure in different ways. While this is already done by userspace tools for validating and repairing the structure, there are limits to what they can verify, and this in turn limits the supportable size of an XFS filesystem.”…””}”(hhøhžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1hèhŸh³h Khh×hžhubhé)”}”(hXåFor example, it is entirely possible to manually use xfs_db and a bit of scripting to analyse the structure of a 100TB filesystem when trying to determine the root cause of a corruption problem, but it is still mainly a manual task of verifying that things like single bit errors or misplaced writes weren't the ultimate cause of a corruption event. It may take a few hours to a few days to perform such forensic analysis, so for at this scale root cause analysis is entirely possible.”h]”hXçFor example, it is entirely possible to manually use xfs_db and a bit of scripting to analyse the structure of a 100TB filesystem when trying to determine the root cause of a corruption problem, but it is still mainly a manual task of verifying that things like single bit errors or misplaced writes weren’t the ultimate cause of a corruption event. It may take a few hours to a few days to perform such forensic analysis, so for at this scale root cause analysis is entirely possible.”…””}”(hjhžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1hèhŸh³h Khh×hžhubhé)”}”(hXÒHowever, if we scale the filesystem up to 1PB, we now have 10x as much metadata to analyse and so that analysis blows out towards weeks/months of forensic work. Most of the analysis work is slow and tedious, so as the amount of analysis goes up, the more likely that the cause will be lost in the noise. Hence the primary concern for supporting PB scale filesystems is minimising the time and effort required for basic forensic analysis of the filesystem structure.”h]”hXÒHowever, if we scale the filesystem up to 1PB, we now have 10x as much metadata to analyse and so that analysis blows out towards weeks/months of forensic work. Most of the analysis work is slow and tedious, so as the amount of analysis goes up, the more likely that the cause will be lost in the noise. Hence the primary concern for supporting PB scale filesystems is minimising the time and effort required for basic forensic analysis of the filesystem structure.”…””}”(hjhžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1hèhŸh³h K hh×hžhubeh}”(h]”Œ introduction”ah ]”h"]”Œ introduction”ah$]”h&]”uh1hÂhhÄhžhhŸh³h K ubhÃ)”}”(hhh]”(hÈ)”}”(hŒSelf Describing Metadata”h]”hŒSelf Describing Metadata”…””}”(hj-hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1hÇhj*hžhhŸh³h K)ubhé)”}”(hXhOne of the problems with the current metadata format is that apart from the magic number in the metadata block, we have no other way of identifying what it is supposed to be. We can't even identify if it is the right place. Put simply, you can't look at a single metadata block in isolation and say "yes, it is supposed to be there and the contents are valid".”h]”hXpOne of the problems with the current metadata format is that apart from the magic number in the metadata block, we have no other way of identifying what it is supposed to be. We can’t even identify if it is the right place. Put simply, you can’t look at a single metadata block in isolation and say “yes, it is supposed to be there and the contents are validâ€.”…””}”(hj;hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1hèhŸh³h K+hj*hžhubhé)”}”(hXûHence most of the time spent on forensic analysis is spent doing basic verification of metadata values, looking for values that are in range (and hence not detected by automated verification checks) but are not correct. Finding and understanding how things like cross linked block lists (e.g. sibling pointers in a btree end up with loops in them) are the key to understanding what went wrong, but it is impossible to tell what order the blocks were linked into each other or written to disk after the fact.”h]”hXûHence most of the time spent on forensic analysis is spent doing basic verification of metadata values, looking for values that are in range (and hence not detected by automated verification checks) but are not correct. Finding and understanding how things like cross linked block lists (e.g. sibling pointers in a btree end up with loops in them) are the key to understanding what went wrong, but it is impossible to tell what order the blocks were linked into each other or written to disk after the fact.”…””}”(hjIhžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1hèhŸh³h K1hj*hžhubhé)”}”(hXQHence we need to record more information into the metadata to allow us to quickly determine if the metadata is intact and can be ignored for the purpose of analysis. We can't protect against every possible type of error, but we can ensure that common types of errors are easily detectable. Hence the concept of self describing metadata.”h]”hXSHence we need to record more information into the metadata to allow us to quickly determine if the metadata is intact and can be ignored for the purpose of analysis. We can’t protect against every possible type of error, but we can ensure that common types of errors are easily detectable. Hence the concept of self describing metadata.”…””}”(hjWhžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1hèhŸh³h K9hj*hžhubhé)”}”(hXThe first, fundamental requirement of self describing metadata is that the metadata object contains some form of unique identifier in a well known location. This allows us to identify the expected contents of the block and hence parse and verify the metadata object. IF we can't independently identify the type of metadata in the object, then the metadata doesn't describe itself very well at all!”h]”hX‘The first, fundamental requirement of self describing metadata is that the metadata object contains some form of unique identifier in a well known location. This allows us to identify the expected contents of the block and hence parse and verify the metadata object. IF we can’t independently identify the type of metadata in the object, then the metadata doesn’t describe itself very well at all!”…””}”(hjehžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1hèhŸh³h K?hj*hžhubhé)”}”(hXdLuckily, almost all XFS metadata has magic numbers embedded already - only the AGFL, remote symlinks and remote attribute blocks do not contain identifying magic numbers. Hence we can change the on-disk format of all these objects to add more identifying information and detect this simply by changing the magic numbers in the metadata objects. That is, if it has the current magic number, the metadata isn't self identifying. If it contains a new magic number, it is self identifying and we can do much more expansive automated verification of the metadata object at runtime, during forensic analysis or repair.”h]”hXfLuckily, almost all XFS metadata has magic numbers embedded already - only the AGFL, remote symlinks and remote attribute blocks do not contain identifying magic numbers. Hence we can change the on-disk format of all these objects to add more identifying information and detect this simply by changing the magic numbers in the metadata objects. That is, if it has the current magic number, the metadata isn’t self identifying. If it contains a new magic number, it is self identifying and we can do much more expansive automated verification of the metadata object at runtime, during forensic analysis or repair.”…””}”(hjshžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1hèhŸh³h KFhj*hžhubhé)”}”(hXËAs a primary concern, self describing metadata needs some form of overall integrity checking. We cannot trust the metadata if we cannot verify that it has not been changed as a result of external influences. Hence we need some form of integrity check, and this is done by adding CRC32c validation to the metadata block. If we can verify the block contains the metadata it was intended to contain, a large amount of the manual verification work can be skipped.”h]”hXËAs a primary concern, self describing metadata needs some form of overall integrity checking. We cannot trust the metadata if we cannot verify that it has not been changed as a result of external influences. Hence we need some form of integrity check, and this is done by adding CRC32c validation to the metadata block. If we can verify the block contains the metadata it was intended to contain, a large amount of the manual verification work can be skipped.”…””}”(hjhžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1hèhŸh³h KOhj*hžhubhé)”}”(hXmCRC32c was selected as metadata cannot be more than 64k in length in XFS and hence a 32 bit CRC is more than sufficient to detect multi-bit errors in metadata blocks. CRC32c is also now hardware accelerated on common CPUs so it is fast. So while CRC32c is not the strongest of possible integrity checks that could be used, it is more than sufficient for our needs and has relatively little overhead. Adding support for larger integrity fields and/or algorithms does really provide any extra value over CRC32c, but it does add a lot of complexity and so there is no provision for changing the integrity checking mechanism.”h]”hXmCRC32c was selected as metadata cannot be more than 64k in length in XFS and hence a 32 bit CRC is more than sufficient to detect multi-bit errors in metadata blocks. CRC32c is also now hardware accelerated on common CPUs so it is fast. So while CRC32c is not the strongest of possible integrity checks that could be used, it is more than sufficient for our needs and has relatively little overhead. Adding support for larger integrity fields and/or algorithms does really provide any extra value over CRC32c, but it does add a lot of complexity and so there is no provision for changing the integrity checking mechanism.”…””}”(hjhžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1hèhŸh³h KVhj*hžhubhé)”}”(hXSelf describing metadata needs to contain enough information so that the metadata block can be verified as being in the correct place without needing to look at any other metadata. This means it needs to contain location information. Just adding a block number to the metadata is not sufficient to protect against mis-directed writes - a write might be misdirected to the wrong LUN and so be written to the "correct block" of the wrong filesystem. Hence location information must contain a filesystem identifier as well as a block number.”h]”hXSelf describing metadata needs to contain enough information so that the metadata block can be verified as being in the correct place without needing to look at any other metadata. This means it needs to contain location information. Just adding a block number to the metadata is not sufficient to protect against mis-directed writes - a write might be misdirected to the wrong LUN and so be written to the “correct block†of the wrong filesystem. Hence location information must contain a filesystem identifier as well as a block number.”…””}”(hjhžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1hèhŸh³h K`hj*hžhubhé)”}”(hX9Another key information point in forensic analysis is knowing who the metadata block belongs to. We already know the type, the location, that it is valid and/or corrupted, and how long ago that it was last modified. Knowing the owner of the block is important as it allows us to find other related metadata to determine the scope of the corruption. For example, if we have a extent btree object, we don't know what inode it belongs to and hence have to walk the entire filesystem to find the owner of the block. Worse, the corruption could mean that no owner can be found (i.e. it's an orphan block), and so without an owner field in the metadata we have no idea of the scope of the corruption. If we have an owner field in the metadata object, we can immediately do top down validation to determine the scope of the problem.”h]”hX=Another key information point in forensic analysis is knowing who the metadata block belongs to. We already know the type, the location, that it is valid and/or corrupted, and how long ago that it was last modified. Knowing the owner of the block is important as it allows us to find other related metadata to determine the scope of the corruption. For example, if we have a extent btree object, we don’t know what inode it belongs to and hence have to walk the entire filesystem to find the owner of the block. Worse, the corruption could mean that no owner can be found (i.e. it’s an orphan block), and so without an owner field in the metadata we have no idea of the scope of the corruption. If we have an owner field in the metadata object, we can immediately do top down validation to determine the scope of the problem.”…””}”(hj«hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1hèhŸh³h Khhj*hžhubhé)”}”(hX°Different types of metadata have different owner identifiers. For example, directory, attribute and extent tree blocks are all owned by an inode, while freespace btree blocks are owned by an allocation group. Hence the size and contents of the owner field are determined by the type of metadata object we are looking at. The owner information can also identify misplaced writes (e.g. freespace btree block written to the wrong AG).”h]”hX°Different types of metadata have different owner identifiers. For example, directory, attribute and extent tree blocks are all owned by an inode, while freespace btree blocks are owned by an allocation group. Hence the size and contents of the owner field are determined by the type of metadata object we are looking at. The owner information can also identify misplaced writes (e.g. freespace btree block written to the wrong AG).”…””}”(hj¹hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1hèhŸh³h Kthj*hžhubhé)”}”(hXSelf describing metadata also needs to contain some indication of when it was written to the filesystem. One of the key information points when doing forensic analysis is how recently the block was modified. Correlation of set of corrupted metadata blocks based on modification times is important as it can indicate whether the corruptions are related, whether there's been multiple corruption events that lead to the eventual failure, and even whether there are corruptions present that the run-time verification is not detecting.”h]”hXSelf describing metadata also needs to contain some indication of when it was written to the filesystem. One of the key information points when doing forensic analysis is how recently the block was modified. Correlation of set of corrupted metadata blocks based on modification times is important as it can indicate whether the corruptions are related, whether there’s been multiple corruption events that lead to the eventual failure, and even whether there are corruptions present that the run-time verification is not detecting.”…””}”(hjÇhžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1hèhŸh³h K{hj*hžhubhé)”}”(hXÌFor example, we can determine whether a metadata object is supposed to be free space or still allocated if it is still referenced by its owner by looking at when the free space btree block that contains the block was last written compared to when the metadata object itself was last written. If the free space block is more recent than the object and the object's owner, then there is a very good chance that the block should have been removed from the owner.”h]”hXÎFor example, we can determine whether a metadata object is supposed to be free space or still allocated if it is still referenced by its owner by looking at when the free space btree block that contains the block was last written compared to when the metadata object itself was last written. If the free space block is more recent than the object and the object’s owner, then there is a very good chance that the block should have been removed from the owner.”…””}”(hjÕhžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1hèhŸh³h Kƒhj*hžhubhé)”}”(hX{To provide this "written timestamp", each metadata block gets the Log Sequence Number (LSN) of the most recent transaction it was modified on written into it. This number will always increase over the life of the filesystem, and the only thing that resets it is running xfs_repair on the filesystem. Further, by use of the LSN we can tell if the corrupted metadata all belonged to the same log checkpoint and hence have some idea of how much modification occurred between the first and last instance of corrupt metadata on disk and, further, how much modification occurred between the corruption being written and when it was detected.”h]”hXTo provide this “written timestampâ€, each metadata block gets the Log Sequence Number (LSN) of the most recent transaction it was modified on written into it. This number will always increase over the life of the filesystem, and the only thing that resets it is running xfs_repair on the filesystem. Further, by use of the LSN we can tell if the corrupted metadata all belonged to the same log checkpoint and hence have some idea of how much modification occurred between the first and last instance of corrupt metadata on disk and, further, how much modification occurred between the corruption being written and when it was detected.”…””}”(hjãhžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1hèhŸh³h KŠhj*hžhubeh}”(h]”Œself-describing-metadata”ah ]”h"]”Œself describing metadata”ah$]”h&]”uh1hÂhhÄhžhhŸh³h K)ubhÃ)”}”(hhh]”(hÈ)”}”(hŒRuntime Validation”h]”hŒRuntime Validation”…””}”(hjühžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1hÇhjùhžhhŸh³h K•ubhé)”}”(hŒLValidation of self-describing metadata takes place at runtime in two places:”h]”hŒLValidation of self-describing metadata takes place at runtime in two places:”…””}”(hj hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1hèhŸh³h K—hjùhžhubhŒ block_quote”“”)”}”(hŒ[- immediately after a successful read from disk - immediately prior to write IO submission ”h]”hŒ bullet_list”“”)”}”(hhh]”(hŒ list_item”“”)”}”(hŒ-immediately after a successful read from disk”h]”hé)”}”(hj'h]”hŒ-immediately after a successful read from disk”…””}”(hj)hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1hèhŸh³h K™hj%ubah}”(h]”h ]”h"]”h$]”h&]”uh1j#hj ubj$)”}”(hŒ)immediately prior to write IO submission ”h]”hé)”}”(hŒ(immediately prior to write IO submission”h]”hŒ(immediately prior to write IO submission”…””}”(hj@hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1hèhŸh³h Kšhj<ubah}”(h]”h ]”h"]”h$]”h&]”uh1j#hj ubeh}”(h]”h ]”h"]”h$]”h&]”Œbullet”Œ-”uh1jhŸh³h K™hjubah}”(h]”h ]”h"]”h$]”h&]”uh1jhŸh³h K™hjùhžhubhé)”}”(hXwThe verification is completely stateless - it is done independently of the modification process, and seeks only to check that the metadata is what it says it is and that the metadata fields are within bounds and internally consistent. As such, we cannot catch all types of corruption that can occur within a block as there may be certain limitations that operational state enforces of the metadata, or there may be corruption of interblock relationships (e.g. corrupted sibling pointer lists). Hence we still need stateful checking in the main code body, but in general most of the per-field validation is handled by the verifiers.”h]”hXwThe verification is completely stateless - it is done independently of the modification process, and seeks only to check that the metadata is what it says it is and that the metadata fields are within bounds and internally consistent. As such, we cannot catch all types of corruption that can occur within a block as there may be certain limitations that operational state enforces of the metadata, or there may be corruption of interblock relationships (e.g. corrupted sibling pointer lists). Hence we still need stateful checking in the main code body, but in general most of the per-field validation is handled by the verifiers.”…””}”(hjbhžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1hèhŸh³h Kœhjùhžhubhé)”}”(hXQFor read verification, the caller needs to specify the expected type of metadata that it should see, and the IO completion process verifies that the metadata object matches what was expected. If the verification process fails, then it marks the object being read as EFSCORRUPTED. The caller needs to catch this error (same as for IO errors), and if it needs to take special action due to a verification error it can do so by catching the EFSCORRUPTED error value. If we need more discrimination of error type at higher levels, we can define new error numbers for different errors as necessary.”h]”hXQFor read verification, the caller needs to specify the expected type of metadata that it should see, and the IO completion process verifies that the metadata object matches what was expected. If the verification process fails, then it marks the object being read as EFSCORRUPTED. The caller needs to catch this error (same as for IO errors), and if it needs to take special action due to a verification error it can do so by catching the EFSCORRUPTED error value. If we need more discrimination of error type at higher levels, we can define new error numbers for different errors as necessary.”…””}”(hjphžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1hèhŸh³h K¦hjùhžhubhé)”}”(hXÕThe first step in read verification is checking the magic number and determining whether CRC validating is necessary. If it is, the CRC32c is calculated and compared against the value stored in the object itself. Once this is validated, further checks are made against the location information, followed by extensive object specific metadata validation. If any of these checks fail, then the buffer is considered corrupt and the EFSCORRUPTED error is set appropriately.”h]”hXÕThe first step in read verification is checking the magic number and determining whether CRC validating is necessary. If it is, the CRC32c is calculated and compared against the value stored in the object itself. Once this is validated, further checks are made against the location information, followed by extensive object specific metadata validation. If any of these checks fail, then the buffer is considered corrupt and the EFSCORRUPTED error is set appropriately.”…””}”(hj~hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1hèhŸh³h K¯hjùhžhubhé)”}”(hX±Write verification is the opposite of the read verification - first the object is extensively verified and if it is OK we then update the LSN from the last modification made to the object, After this, we calculate the CRC and insert it into the object. Once this is done the write IO is allowed to continue. If any error occurs during this process, the buffer is again marked with a EFSCORRUPTED error for the higher layers to catch.”h]”hX±Write verification is the opposite of the read verification - first the object is extensively verified and if it is OK we then update the LSN from the last modification made to the object, After this, we calculate the CRC and insert it into the object. Once this is done the write IO is allowed to continue. If any error occurs during this process, the buffer is again marked with a EFSCORRUPTED error for the higher layers to catch.”…””}”(hjŒhžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1hèhŸh³h K¶hjùhžhubeh}”(h]”Œruntime-validation”ah ]”h"]”Œruntime validation”ah$]”h&]”uh1hÂhhÄhžhhŸh³h K•ubhÃ)”}”(hhh]”(hÈ)”}”(hŒ Structures”h]”hŒ Structures”…””}”(hj¥hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1hÇhj¢hžhhŸh³h K¾ubhé)”}”(hŒHA typical on-disk structure needs to contain the following information::”h]”hŒGA typical on-disk structure needs to contain the following information:”…””}”(hj³hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1hèhŸh³h KÀhj¢hžhubhŒ literal_block”“”)”}”(hXstruct xfs_ondisk_hdr { __be32 magic; /* magic number */ __be32 crc; /* CRC, not logged */ uuid_t uuid; /* filesystem identifier */ __be64 owner; /* parent object */ __be64 blkno; /* location on disk */ __be64 lsn; /* last modification in log, not logged */ };”h]”hXstruct xfs_ondisk_hdr { __be32 magic; /* magic number */ __be32 crc; /* CRC, not logged */ uuid_t uuid; /* filesystem identifier */ __be64 owner; /* parent object */ __be64 blkno; /* location on disk */ __be64 lsn; /* last modification in log, not logged */ };”…””}”hjÃsbah}”(h]”h ]”h"]”h$]”h&]”h±h²uh1jÁhŸh³h KÂhj¢hžhubhé)”}”(hXDepending on the metadata, this information may be part of a header structure separate to the metadata contents, or may be distributed through an existing structure. The latter occurs with metadata that already contains some of this information, such as the superblock and AG headers.”h]”hXDepending on the metadata, this information may be part of a header structure separate to the metadata contents, or may be distributed through an existing structure. The latter occurs with metadata that already contains some of this information, such as the superblock and AG headers.”…””}”(hjÑhžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1hèhŸh³h KËhj¢hžhubhé)”}”(hŒ„Other metadata may have different formats for the information, but the same level of information is generally provided. For example:”h]”hŒ„Other metadata may have different formats for the information, but the same level of information is generally provided. For example:”…””}”(hjßhžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1hèhŸh³h KÐhj¢hžhubj)”}”(hXÕ- short btree blocks have a 32 bit owner (ag number) and a 32 bit block number for location. The two of these combined provide the same information as @owner and @blkno in eh above structure, but using 8 bytes less space on disk. - directory/attribute node blocks have a 16 bit magic number, and the header that contains the magic number has other information in it as well. hence the additional metadata headers change the overall format of the metadata. ”h]”j)”}”(hhh]”(j$)”}”(hŒäshort btree blocks have a 32 bit owner (ag number) and a 32 bit block number for location. The two of these combined provide the same information as @owner and @blkno in eh above structure, but using 8 bytes less space on disk. ”h]”hé)”}”(hŒãshort btree blocks have a 32 bit owner (ag number) and a 32 bit block number for location. The two of these combined provide the same information as @owner and @blkno in eh above structure, but using 8 bytes less space on disk.”h]”hŒãshort btree blocks have a 32 bit owner (ag number) and a 32 bit block number for location. The two of these combined provide the same information as @owner and @blkno in eh above structure, but using 8 bytes less space on disk.”…””}”(hjøhžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1hèhŸh³h KÓhjôubah}”(h]”h ]”h"]”h$]”h&]”uh1j#hjñubj$)”}”(hŒàdirectory/attribute node blocks have a 16 bit magic number, and the header that contains the magic number has other information in it as well. hence the additional metadata headers change the overall format of the metadata. ”h]”hé)”}”(hŒßdirectory/attribute node blocks have a 16 bit magic number, and the header that contains the magic number has other information in it as well. hence the additional metadata headers change the overall format of the metadata.”h]”hŒßdirectory/attribute node blocks have a 16 bit magic number, and the header that contains the magic number has other information in it as well. hence the additional metadata headers change the overall format of the metadata.”…””}”(hjhžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1hèhŸh³h KØhj ubah}”(h]”h ]”h"]”h$]”h&]”uh1j#hjñubeh}”(h]”h ]”h"]”h$]”h&]”jZj[uh1jhŸh³h KÓhjíubah}”(h]”h ]”h"]”h$]”h&]”uh1jhŸh³h KÓhj¢hžhubhé)”}”(hŒ9A typical buffer read verifier is structured as follows::”h]”hŒ8A typical buffer read verifier is structured as follows:”…””}”(hj0hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1hèhŸh³h KÝhj¢hžhubjÂ)”}”(hX#define XFS_FOO_CRC_OFF offsetof(struct xfs_ondisk_hdr, crc) static void xfs_foo_read_verify( struct xfs_buf *bp) { struct xfs_mount *mp = bp->b_mount; if ((xfs_sb_version_hascrc(&mp->m_sb) && !xfs_verify_cksum(bp->b_addr, BBTOB(bp->b_length), XFS_FOO_CRC_OFF)) || !xfs_foo_verify(bp)) { XFS_CORRUPTION_ERROR(__func__, XFS_ERRLEVEL_LOW, mp, bp->b_addr); xfs_buf_ioerror(bp, EFSCORRUPTED); } }”h]”hX#define XFS_FOO_CRC_OFF offsetof(struct xfs_ondisk_hdr, crc) static void xfs_foo_read_verify( struct xfs_buf *bp) { struct xfs_mount *mp = bp->b_mount; if ((xfs_sb_version_hascrc(&mp->m_sb) && !xfs_verify_cksum(bp->b_addr, BBTOB(bp->b_length), XFS_FOO_CRC_OFF)) || !xfs_foo_verify(bp)) { XFS_CORRUPTION_ERROR(__func__, XFS_ERRLEVEL_LOW, mp, bp->b_addr); xfs_buf_ioerror(bp, EFSCORRUPTED); } }”…””}”hj>sbah}”(h]”h ]”h"]”h$]”h&]”h±h²uh1jÁhŸh³h Kßhj¢hžhubhé)”}”(hŒàThe code ensures that the CRC is only checked if the filesystem has CRCs enabled by checking the superblock of the feature bit, and then if the CRC verifies OK (or is not needed) it verifies the actual contents of the block.”h]”hŒàThe code ensures that the CRC is only checked if the filesystem has CRCs enabled by checking the superblock of the feature bit, and then if the CRC verifies OK (or is not needed) it verifies the actual contents of the block.”…””}”(hjLhžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1hèhŸh³h Kðhj¢hžhubhé)”}”(hŒÎThe verifier function will take a couple of different forms, depending on whether the magic number can be used to determine the format of the block. In the case it can't, the code is structured as follows::”h]”hŒÏThe verifier function will take a couple of different forms, depending on whether the magic number can be used to determine the format of the block. In the case it can’t, the code is structured as follows:”…””}”(hjZhžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1hèhŸh³h Kôhj¢hžhubjÂ)”}”(hXstatic bool xfs_foo_verify( struct xfs_buf *bp) { struct xfs_mount *mp = bp->b_mount; struct xfs_ondisk_hdr *hdr = bp->b_addr; if (hdr->magic != cpu_to_be32(XFS_FOO_MAGIC)) return false; if (!xfs_sb_version_hascrc(&mp->m_sb)) { if (!uuid_equal(&hdr->uuid, &mp->m_sb.sb_uuid)) return false; if (bp->b_bn != be64_to_cpu(hdr->blkno)) return false; if (hdr->owner == 0) return false; } /* object specific verification checks here */ return true; }”h]”hXstatic bool xfs_foo_verify( struct xfs_buf *bp) { struct xfs_mount *mp = bp->b_mount; struct xfs_ondisk_hdr *hdr = bp->b_addr; if (hdr->magic != cpu_to_be32(XFS_FOO_MAGIC)) return false; if (!xfs_sb_version_hascrc(&mp->m_sb)) { if (!uuid_equal(&hdr->uuid, &mp->m_sb.sb_uuid)) return false; if (bp->b_bn != be64_to_cpu(hdr->blkno)) return false; if (hdr->owner == 0) return false; } /* object specific verification checks here */ return true; }”…””}”hjhsbah}”(h]”h ]”h"]”h$]”h&]”h±h²uh1jÁhŸh³h Køhj¢hžhubhé)”}”(hŒ]If there are different magic numbers for the different formats, the verifier will look like::”h]”hŒ\If there are different magic numbers for the different formats, the verifier will look like:”…””}”(hjvhžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1hèhŸh³h Mhj¢hžhubjÂ)”}”(hX¤static bool xfs_foo_verify( struct xfs_buf *bp) { struct xfs_mount *mp = bp->b_mount; struct xfs_ondisk_hdr *hdr = bp->b_addr; if (hdr->magic == cpu_to_be32(XFS_FOO_CRC_MAGIC)) { if (!uuid_equal(&hdr->uuid, &mp->m_sb.sb_uuid)) return false; if (bp->b_bn != be64_to_cpu(hdr->blkno)) return false; if (hdr->owner == 0) return false; } else if (hdr->magic != cpu_to_be32(XFS_FOO_MAGIC)) return false; /* object specific verification checks here */ return true; }”h]”hX¤static bool xfs_foo_verify( struct xfs_buf *bp) { struct xfs_mount *mp = bp->b_mount; struct xfs_ondisk_hdr *hdr = bp->b_addr; if (hdr->magic == cpu_to_be32(XFS_FOO_CRC_MAGIC)) { if (!uuid_equal(&hdr->uuid, &mp->m_sb.sb_uuid)) return false; if (bp->b_bn != be64_to_cpu(hdr->blkno)) return false; if (hdr->owner == 0) return false; } else if (hdr->magic != cpu_to_be32(XFS_FOO_MAGIC)) return false; /* object specific verification checks here */ return true; }”…””}”hj„sbah}”(h]”h ]”h"]”h$]”h&]”h±h²uh1jÁhŸh³h Mhj¢hžhubhé)”}”(hŒ“Write verifiers are very similar to the read verifiers, they just do things in the opposite order to the read verifiers. A typical write verifier::”h]”hŒ’Write verifiers are very similar to the read verifiers, they just do things in the opposite order to the read verifiers. A typical write verifier:”…””}”(hj’hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1hèhŸh³h M)hj¢hžhubjÂ)”}”(hX©static void xfs_foo_write_verify( struct xfs_buf *bp) { struct xfs_mount *mp = bp->b_mount; struct xfs_buf_log_item *bip = bp->b_fspriv; if (!xfs_foo_verify(bp)) { XFS_CORRUPTION_ERROR(__func__, XFS_ERRLEVEL_LOW, mp, bp->b_addr); xfs_buf_ioerror(bp, EFSCORRUPTED); return; } if (!xfs_sb_version_hascrc(&mp->m_sb)) return; if (bip) { struct xfs_ondisk_hdr *hdr = bp->b_addr; hdr->lsn = cpu_to_be64(bip->bli_item.li_lsn); } xfs_update_cksum(bp->b_addr, BBTOB(bp->b_length), XFS_FOO_CRC_OFF); }”h]”hX©static void xfs_foo_write_verify( struct xfs_buf *bp) { struct xfs_mount *mp = bp->b_mount; struct xfs_buf_log_item *bip = bp->b_fspriv; if (!xfs_foo_verify(bp)) { XFS_CORRUPTION_ERROR(__func__, XFS_ERRLEVEL_LOW, mp, bp->b_addr); xfs_buf_ioerror(bp, EFSCORRUPTED); return; } if (!xfs_sb_version_hascrc(&mp->m_sb)) return; if (bip) { struct xfs_ondisk_hdr *hdr = bp->b_addr; hdr->lsn = cpu_to_be64(bip->bli_item.li_lsn); } xfs_update_cksum(bp->b_addr, BBTOB(bp->b_length), XFS_FOO_CRC_OFF); }”…””}”hj sbah}”(h]”h ]”h"]”h$]”h&]”h±h²uh1jÁhŸh³h M,hj¢hžhubhé)”}”(hXbThis will verify the internal structure of the metadata before we go any further, detecting corruptions that have occurred as the metadata has been modified in memory. If the metadata verifies OK, and CRCs are enabled, we then update the LSN field (when it was last modified) and calculate the CRC on the metadata. Once this is done, we can issue the IO.”h]”hXbThis will verify the internal structure of the metadata before we go any further, detecting corruptions that have occurred as the metadata has been modified in memory. If the metadata verifies OK, and CRCs are enabled, we then update the LSN field (when it was last modified) and calculate the CRC on the metadata. Once this is done, we can issue the IO.”…””}”(hj®hžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1hèhŸh³h MDhj¢hžhubeh}”(h]”Œ structures”ah ]”h"]”Œ structures”ah$]”h&]”uh1hÂhhÄhžhhŸh³h K¾ubhÃ)”}”(hhh]”(hÈ)”}”(hŒInodes and Dquots”h]”hŒInodes and Dquots”…””}”(hjÇhžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1hÇhjÄhžhhŸh³h MKubhé)”}”(hXInodes and dquots are special snowflakes. They have per-object CRC and self-identifiers, but they are packed so that there are multiple objects per buffer. Hence we do not use per-buffer verifiers to do the work of per-object verification and CRC calculations. The per-buffer verifiers simply perform basic identification of the buffer - that they contain inodes or dquots, and that there are magic numbers in all the expected spots. All further CRC and verification checks are done when each inode is read from or written back to the buffer.”h]”hXInodes and dquots are special snowflakes. They have per-object CRC and self-identifiers, but they are packed so that there are multiple objects per buffer. Hence we do not use per-buffer verifiers to do the work of per-object verification and CRC calculations. The per-buffer verifiers simply perform basic identification of the buffer - that they contain inodes or dquots, and that there are magic numbers in all the expected spots. All further CRC and verification checks are done when each inode is read from or written back to the buffer.”…””}”(hjÕhžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1hèhŸh³h MMhjÄhžhubhé)”}”(hXïThe structure of the verifiers and the identifiers checks is very similar to the buffer code described above. The only difference is where they are called. For example, inode read verification is done in xfs_inode_from_disk() when the inode is first read out of the buffer and the struct xfs_inode is instantiated. The inode is already extensively verified during writeback in xfs_iflush_int, so the only addition here is to add the LSN and CRC to the inode as it is copied back into the buffer.”h]”hXïThe structure of the verifiers and the identifiers checks is very similar to the buffer code described above. The only difference is where they are called. For example, inode read verification is done in xfs_inode_from_disk() when the inode is first read out of the buffer and the struct xfs_inode is instantiated. The inode is already extensively verified during writeback in xfs_iflush_int, so the only addition here is to add the LSN and CRC to the inode as it is copied back into the buffer.”…””}”(hjãhžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1hèhŸh³h MVhjÄhžhubhé)”}”(hX4XXX: inode unlinked list modification doesn't recalculate the inode CRC! None of the unlinked list modifications check or update CRCs, neither during unlink nor log recovery. So, it's gone unnoticed until now. This won't matter immediately - repair will probably complain about it - but it needs to be fixed.”h]”hX:XXX: inode unlinked list modification doesn’t recalculate the inode CRC! None of the unlinked list modifications check or update CRCs, neither during unlink nor log recovery. So, it’s gone unnoticed until now. This won’t matter immediately - repair will probably complain about it - but it needs to be fixed.”…””}”(hjñhžhhŸNh Nubah}”(h]”h ]”h"]”h$]”h&]”uh1hèhŸh³h M^hjÄhžhubeh}”(h]”Œinodes-and-dquots”ah ]”h"]”Œinodes and dquots”ah$]”h&]”uh1hÂhhÄhžhhŸh³h MKubeh}”(h]”(hÁŒid1”eh ]”h"]”(Œxfs self describing metadata”Œxfs_self_describing_metadata”eh$]”h&]”uh1hÂhhhžhhŸh³h KŒexpect_referenced_by_name”}”j h¶sŒexpect_referenced_by_id”}”hÁh¶subeh}”(h]”h ]”h"]”h$]”h&]”Œsource”h³uh1hŒcurrent_source”NŒ current_line”NŒsettings”Œdocutils.frontend”ŒValues”“”)”}”(hÇNŒ generator”NŒ datestamp”NŒ source_link”NŒ source_url”NŒ toc_backlinks”Œentry”Œfootnote_backlinks”KŒ sectnum_xform”KŒstrip_comments”NŒstrip_elements_with_classes”NŒ strip_classes”NŒ report_level”KŒ halt_level”KŒexit_status_level”KŒdebug”NŒwarning_stream”NŒ traceback”ˆŒinput_encoding”Œ utf-8-sig”Œinput_encoding_error_handler”Œstrict”Œoutput_encoding”Œutf-8”Œoutput_encoding_error_handler”j7Œerror_encoding”Œutf-8”Œerror_encoding_error_handler”Œbackslashreplace”Œ language_code”Œen”Œrecord_dependencies”NŒconfig”NŒ id_prefix”hŒauto_id_prefix”Œid”Œ dump_settings”NŒdump_internals”NŒdump_transforms”NŒdump_pseudo_xml”NŒexpose_internals”NŒstrict_visitor”NŒ_disable_config”NŒ_source”h³Œ _destination”NŒ _config_files”]”Œ7/var/lib/git/docbuild/linux/Documentation/docutils.conf”aŒfile_insertion_enabled”ˆŒ raw_enabled”KŒline_length_limit”M'Œpep_references”NŒ pep_base_url”Œhttps://peps.python.org/”Œpep_file_url_template”Œpep-%04d”Œrfc_references”NŒ rfc_base_url”Œ&https://datatracker.ietf.org/doc/html/”Œ tab_width”KŒtrim_footnote_reference_space”‰Œsyntax_highlight”Œlong”Œ smart_quotes”ˆŒsmartquotes_locales”]”Œcharacter_level_inline_markup”‰Œdoctitle_xform”‰Œ docinfo_xform”KŒsectsubtitle_xform”‰Œ image_loading”Œlink”Œembed_stylesheet”‰Œcloak_email_addresses”ˆŒsection_self_link”‰Œenv”NubŒreporter”NŒindirect_targets”]”Œsubstitution_defs”}”Œsubstitution_names”}”Œrefnames”}”Œrefids”}”hÁ]”h¶asŒnameids”}”(j hÁj j j'j$jöjójŸjœjÁj¾jjuŒ nametypes”}”(j ˆj ‰j'‰jö‰jŸ‰jÁ‰j‰uh}”(hÁhÄj hÄj$h×jój*jœjùj¾j¢jjÄuŒ footnote_refs”}”Œ citation_refs”}”Œ autofootnotes”]”Œautofootnote_refs”]”Œsymbol_footnotes”]”Œsymbol_footnote_refs”]”Œ footnotes”]”Œ citations”]”Œautofootnote_start”KŒsymbol_footnote_start”KŒ id_counter”Œ collections”ŒCounter”“”}”jEKs…”R”Œparse_messages”]”Œtransform_messages”]”hŒsystem_message”“”)”}”(hhh]”hé)”}”(hhh]”hŒBHyperlink target "xfs-self-describing-metadata" is not referenced.”…””}”hj¡sbah}”(h]”h ]”h"]”h$]”h&]”uh1hèhjžubah}”(h]”h ]”h"]”h$]”h&]”Œlevel”KŒtype”ŒINFO”Œsource”h³Œline”Kuh1jœubaŒ transformer”NŒ include_log”]”Œ decoration”Nhžhub.