aboutsummaryrefslogtreecommitdiffstats
diff options
context:
space:
mode:
authorDarrick J. Wong <djwong@kernel.org>2021-03-25 19:47:00 -0700
committerDarrick J. Wong <djwong@kernel.org>2023-12-31 09:33:41 -0800
commit49fce9723e080094789564be00371270a0914974 (patch)
tree1d079728364724fd80e96fc2a602d44e2a710319
parent50bb034a87cee6efeb05256642a959f943c0a5f2 (diff)
downloadxfs-documentation-realtime-reflink.tar.gz
design: document changes for the realtime refcount btreerealtime-reflink_2023-12-31realtime-reflink
Update the ondisk format documentation to reflect the realtime refcount btree. Signed-off-by: Darrick J. Wong <djwong@kernel.org>
-rw-r--r--design/XFS_Filesystem_Structure/internal_inodes.asciidoc5
-rw-r--r--design/XFS_Filesystem_Structure/journaling_log.asciidoc9
-rw-r--r--design/XFS_Filesystem_Structure/magic.asciidoc1
-rw-r--r--design/XFS_Filesystem_Structure/realtime.asciidoc5
-rw-r--r--design/XFS_Filesystem_Structure/rtrefcountbt.asciidoc173
5 files changed, 190 insertions, 3 deletions
diff --git a/design/XFS_Filesystem_Structure/internal_inodes.asciidoc b/design/XFS_Filesystem_Structure/internal_inodes.asciidoc
index 7da0cdf..0fc758c 100644
--- a/design/XFS_Filesystem_Structure/internal_inodes.asciidoc
+++ b/design/XFS_Filesystem_Structure/internal_inodes.asciidoc
@@ -31,6 +31,7 @@ of those inodes have been deallocated and may be reused by future features.
| xref:Real-Time_Bitmap_Inode[Realtime Bitmap] | /realtime/bitmap
| xref:Real-Time_Summary_Inode[Realtime Summary] | /realtime/summary
| xref:Real_time_Reverse_Mapping_Btree[Realtime Reverse Mapping B+tree] | /realtime/*.rmap
+| xref:Real_time_Refcount_Btree[Realtime Reference Count+tree] | /realtime/*.refcount
|=====
Metadata files are flagged by the +XFS_DIFLAG2_METADATA+ flag in the
@@ -325,4 +326,6 @@ xref:Real-Time_Bitmap_Inode[Bitmap Inode] and the
xref:Real-Time_Summary_Inode[Summary Inode].
Each realtime group can allocate one inode to managing a
-xref:Real_time_Reverse_Mapping_Btree[reverse-index of space] usage.
+xref:Real_time_Reverse_Mapping_Btree[reverse-index of space] usage, and
+a second one to manage xref:Real_time_Refcount_Btree[reference counts] of space
+usage.
diff --git a/design/XFS_Filesystem_Structure/journaling_log.asciidoc b/design/XFS_Filesystem_Structure/journaling_log.asciidoc
index 52513b1..2c7b738 100644
--- a/design/XFS_Filesystem_Structure/journaling_log.asciidoc
+++ b/design/XFS_Filesystem_Structure/journaling_log.asciidoc
@@ -569,6 +569,15 @@ reverse mapping operation we want. The upper three bytes are flag bits.
| +XFS_REFCOUNT_EXTENT_FREE_COW+ | Unreserve an extent for staging copy on write.
|=====
+.Reference count update log intent flags
+[options="header"]
+|=====
+| Value | Description
+| +XFS_REFCOUNT_EXTENT_REALTIME+ | Extent describes a range of blocks on the
+realtime volume. The range must be aligned to the realtime extent size,
+because extents cannot be partially shared.
+|=====
+
The ``reference count update intent'' operation comes first; it tells the log
that XFS wants to update some reference counts. This record is crucial for
correct log recovery because it enables us to spread a complex metadata update
diff --git a/design/XFS_Filesystem_Structure/magic.asciidoc b/design/XFS_Filesystem_Structure/magic.asciidoc
index c83f59a..19ee52b 100644
--- a/design/XFS_Filesystem_Structure/magic.asciidoc
+++ b/design/XFS_Filesystem_Structure/magic.asciidoc
@@ -49,6 +49,7 @@ relevant chapters. Magic numbers tend to have consistent locations:
| +XFS_REFC_CRC_MAGIC+ | 0x52334643 | R3FC | xref:Reference_Count_Btree[Reference Count B+tree], v5 only
| +XFS_MD_MAGIC+ | 0x5846534d | XFSM | xref:Metadata_Dumps[Metadata Dumps]
| +XFS_RTSB_MAGIC+ | 0x58524750 | XRGP | xref:Realtime_Groups[Realtime Groups]
+| +XFS_RTREFC_CRC_MAGIC+ | 0x52434e54 | RCNT | xref:Real_time_Refcount_Btree[Real-Time Reference Count B+tree], v5 only
|=====
The magic numbers for log items are at offset zero in each log item, but items
diff --git a/design/XFS_Filesystem_Structure/realtime.asciidoc b/design/XFS_Filesystem_Structure/realtime.asciidoc
index 77d947b..5d0b47a 100644
--- a/design/XFS_Filesystem_Structure/realtime.asciidoc
+++ b/design/XFS_Filesystem_Structure/realtime.asciidoc
@@ -14,8 +14,7 @@ By placing the real time device (and the journal) on separate high-performance
storage devices, it is possible to reduce most of the unpredictability in I/O
response times that come from metadata operations.
-None of the XFS per-AG B+trees are involved with real time files. It is not
-possible for real time files to share data blocks.
+None of the XFS per-AG B+trees are involved with real time files.
[[Real-Time_Bitmap_Inode]]
== Free Space Bitmap Inode
@@ -233,3 +232,5 @@ meta_uuid = c52adb8a-48a6-4325-b251-d4dcb30889ea
----
include::rtrmapbt.asciidoc[]
+
+include::rtrefcountbt.asciidoc[]
diff --git a/design/XFS_Filesystem_Structure/rtrefcountbt.asciidoc b/design/XFS_Filesystem_Structure/rtrefcountbt.asciidoc
new file mode 100644
index 0000000..617badb
--- /dev/null
+++ b/design/XFS_Filesystem_Structure/rtrefcountbt.asciidoc
@@ -0,0 +1,173 @@
+[[Real_time_Refcount_Btree]]
+=== Reference Count B+tree
+
+If the reflink and real-time storage device features are enabled, each
+real-time group has its own reference count B+tree.
+
+As mentioned in the chapter about xref:Reflink_Deduplication[sharing data
+blocks], this data structure is necessary to track how many times each extent
+in the realtime volume has been mapped. This is how the copy-on-write code
+determines what to do when a realtime file is written.
+
+This B+tree is only present if the +XFS_SB_FEAT_RO_COMPAT_REFLINK+ feature is
+enabled and a real time device is present. The feature requires a version 5
+filesystem.
+
+The rtgroup reference count B+tree is rooted in an inode's data fork; the inode
+number can be found by resolving the path +/realtime/$rgno.refcount+ in the
+metadata directory tree. superblock. The B+tree blocks themselves are stored
+in the regular filesystem. The structures used for an inode's B+tree root are:
+
+[source, c]
+----
+struct xfs_rtrefcount_root {
+ __be16 bb_level;
+ __be16 bb_numrecs;
+};
+----
+
+* If the B+tree contains only a single level, the ondisk data fork area begins
+with a +xfs_rtrefcount_root+ header followed by an array of +xfs_refcount_rec+
+leaf records.
+
+* Otherwise, the ondisk data fork area begins with the +xfs_rtrefcount_root+
+header and is followed first by an array of +xfs_refcount_key+ values and then
+an array of +xfs_rtrefcount_ptr_t+ values. The size of both arrays is
+specified by the header's +bb_numrecs+ value.
+
+* The root node in the inode can only contain up to 28 leaf records or
+key/pointer pairs for a standard 512 byte inode before a new level of nodes is
+added between the root and the leaves. +di_forkoff+ should always be zero,
+because there are no extended attributes.
+
+Each record in an rtgroup reference count B+tree has the same structure as an
+AG reference count btree:
+
+[source, c]
+----
+struct xfs_refcount_rec {
+ __be32 rc_startblock;
+ __be32 rc_blockcount;
+ __be32 rc_refcount;
+};
+----
+
+*rc_startblock*::
+rtgroup block number of this record. Note that reference count records are
+tracked in units of realtime blocks, not realtime extents.
+However, records must be aligned to the realtime extent size in accordance with
+the existing realtime extent handling strategy. The high bit
+(+XFS_REFC_COW_FLAG+) is set for all records referring to an extent that is
+being used to stage a copy on write operation. This reduces recovery time
+during mount operations. The reference count of these staging events must only
+be 1.
+
+*rc_blockcount*::
+The length of this extent, in filesystem blocks.
+
+*rc_refcount*::
+Number of times this extent has been shared.
+
+The key has the following structure:
+
+[source, c]
+----
+struct xfs_refcount_key {
+ __be32 rc_startblock;
+};
+----
+
+* All block numbers are 32-bit rtgroup device block numbers, though the
+key should be aligned to the realtime extent size.
+
+* The +bb_magic+ value is ``RCNT'' (0x52434354).
+
+* The +struct xfs_btree_lblock+ header is used for intermediate B+tree node as
+well as the leaves.
+
+==== xfs_db rtrefcountbt Example
+
+This example shows a real-time reference count B+tree from a freshly
+populated filesystem. One directory tree has been reflinked:
+
+----
+xfs_db> path -m /realtime/0.refcount
+xfs_db> p
+core.magic = 0x494e
+core.mode = 0100000
+core.version = 3
+core.format = 6 (refcount)
+...
+v3.inumber = 134
+v3.uuid = 23d157a4-8ca7-4fca-8782-637dc6746105
+v3.reflink = 0
+v3.cowextsz = 0
+v3.dax = 0
+v3.bigtime = 1
+v3.nrext64 = 1
+v3.metadata = 1
+u3.rtrefcbt.level = 1
+u3.rtrefcbt.numrecs = 2
+u3.rtrefcbt.keys[1-2] = [startblock,cowflag]
+1:[4,0]
+2:[344,0]
+u3.rtrefcbt.ptrs[1-2] = 1:8 2:9
+----
+
+Notice that this is a two-level refcount btree; we must continue towards the
+leaf level.
+
+----
+xfs_db> addr u3.rtrefcbt.ptrs[2]
+xfs_db> p
+magic = 0x52434e54
+level = 0
+numrecs = 170
+leftsib = 8
+rightsib = null
+bno = 72
+lsn = 0
+uuid = 23d157a4-8ca7-4fca-8782-637dc6746105
+owner = 134
+crc = 0x21e04c3 (correct)
+recs[1-170] = [startblock,blockcount,refcount,cowflag]
+1:[344,1,2,0]
+2:[346,1,2,0]
+3:[348,1,2,0]
+4:[350,1,2,0]
+5:[352,1,2,0]
+6:[354,1,2,0]
+...
+----
+
+This indicates that realtime block 354 is shared. Let's use the realtime
+reverse mapping information to find which files are sharing these blocks:
+
+----
+xfs_db> fsmap -r 354 354
+0: 0/1 len 682 owner 10015 offset 0 bmbt 0 attrfork 0 extflag 0
+1: 0/354 len 1 owner 10014 offset 353 bmbt 0 attrfork 0 extflag 0
+----
+
+It looks as though inodes 10,014 and 10,015 share this block. Let us confirm
+this by navigating to those inodes and dumping the data fork mappings:
+
+----
+xfs_db> inode 10015
+xfs_db> p core.realtime
+core.realtime = 1
+xfs_db> bmap
+data offset 0 startblock 1 (0/1) count 682 flag 0
+xfs_db> inode 10014
+xfs_db> p core.realtime
+core.realtime = 1
+xfs_db> bmap 350 10
+data offset 351 startblock 352 (0/352) count 1 flag 0
+data offset 353 startblock 354 (0/354) count 1 flag 0
+data offset 355 startblock 356 (0/356) count 1 flag 0
+data offset 357 startblock 358 (0/358) count 1 flag 0
+data offset 359 startblock 360 (0/360) count 1 flag 0
+----
+
+Notice that both inodes have their realtime flags set, and both of them map
+a data fork extent to the same realtime block 354.