kernel/git/djwong/xfs-linux.git - Darrick J. Wong : XFS Kernel Dev.

Age	Commit message (Collapse)	Author	Files	Lines
2024-05-03	xfs: simplify iext overflow checking and upgradeHEAD korg/for-next_2024-05-21 master	Christoph Hellwig	10	-87/+41
	Currently the calls to xfs_iext_count_may_overflow and xfs_iext_count_upgrade are always paired. Merge them into a single function to simplify the callers and the actual check and upgrade logic itself. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
2024-05-03	xfs: remove a racy if_bytes check in xfs_reflink_end_cow_extent	Christoph Hellwig	1	-6/+0
	Accessing if_bytes without the ilock is racy. Remove the initial if_bytes == 0 check in xfs_reflink_end_cow_extent and let ext_iext_lookup_extent fail for this case after we've taken the ilock. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org> Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
2024-05-03	xfs: upgrade the extent counters in xfs_reflink_end_cow_extent later	Christoph Hellwig	1	-8/+8
	Defer the extent counter size upgrade until we know we're going to modify the extent mapping. This also defers dirtying the transaction and will allow us safely back out later in the function in later changes. Fixes: 4f86bb4b66c9 ("xfs: Conditionally upgrade existing inodes to use large extent counters") Reviewed-by: "Darrick J. Wong" <djwong@kernel.org> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
2024-05-03	xfs: xfs_quota_unreserve_blkres can't fail	Christoph Hellwig	8	-37/+20
	Unreserving quotas can't fail due to quota limits, and we'll notice a shut down file system a bit later in all the callers anyway. Return void and remove the error checking and propagation in the callers. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org> Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
2024-05-03	xfs: consolidate the xfs_quota_reserve_blkres definitions	Christoph Hellwig	1	-12/+6
	xfs_trans_reserve_quota_nblks is already stubbed out if quota support is disabled, no need for an extra xfs_quota_reserve_blkres stub. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org> Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
2024-05-03	xfs: clean up buffer allocation in xlog_do_recovery_pass	Christoph Hellwig	1	-7/+6
	Merge the initial xlog_alloc_buffer calls, and pass the variable designating the length that is initialized to 1 above instead of passing the open coded 1 directly. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org> Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
2024-05-03	xfs: fix log recovery buffer allocation for the legacy h_size fixup	Christoph Hellwig	1	-6/+14
	Commit a70f9fe52daa ("xfs: detect and handle invalid iclog size set by mkfs") added a fixup for incorrect h_size values used for the initial umount record in old xfsprogs versions. Later commit 0c771b99d6c9 ("xfs: clean up calculation of LR header blocks") cleaned up the log reover buffer calculation, but stoped using the fixed up h_size value to size the log recovery buffer, which can lead to an out of bounds access when the incorrect h_size does not come from the old mkfs tool, but a fuzzer. Fix this by open coding xlog_logrec_hblks and taking the fixed h_size into account for this calculation. Fixes: 0c771b99d6c9 ("xfs: clean up calculation of LR header blocks") Reported-by: Sam Sun <samsun1006219@gmail.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org> Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
2024-05-03	Merge tag 'xfs-cleanups-6.10_2024-05-02' of ↵	Chandan Babu R	7	-55/+69
	https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux into xfs-6.10-mergeF xfs: last round of cleanups for 6.10 Here are the reviewed cleanups at the head of the fsverity series. Apparently there's other work that could use some of these things, so let's try to get it in for 6.10. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Chandan Babu R <chandanbabu@kernel.org> * tag 'xfs-cleanups-6.10_2024-05-02' of https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux: xfs: widen flags argument to the xfs_iflags_* helpers xfs: minor cleanups of xfs_attr3_rmt_blocks xfs: create a helper to compute the blockcount of a max sized remote value xfs: turn XFS_ATTR3_RMT_BUF_SPACE into a function xfs: use unsigned ints for non-negative quantities in xfs_attr_remote.c
2024-05-02	xfs: widen flags argument to the xfs_iflags_* helpersxfs-cleanups-6.10_2024-05-02 xfs-cleanups-6.10	Darrick J. Wong	2	-10/+8
	xfs_inode.i_flags is an unsigned long, so make these helpers take that as the flags argument instead of unsigned short. This is needed for the next patch. While we're at it, remove the iflags variable from xfs_iget_cache_miss because we no longer need it. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Andrey Albershteyn <aalbersh@redhat.com>
2024-05-02	xfs: minor cleanups of xfs_attr3_rmt_blocks	Darrick J. Wong	1	-8/+8
	Clean up the type signature of this function since we don't have negative attr lengths or block counts. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Andrey Albershteyn <aalbersh@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-05-02	xfs: create a helper to compute the blockcount of a max sized remote value	Darrick J. Wong	3	-3/+9
	Create a helper function to compute the number of fsblocks needed to store a maximally-sized extended attribute value. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Andrey Albershteyn <aalbersh@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-05-02	xfs: turn XFS_ATTR3_RMT_BUF_SPACE into a function	Darrick J. Wong	2	-6/+17
	Turn this into a properly typechecked function, and actually use the correct blocksize for extended attributes. The function cannot be static inline because xfsprogs userspace uses it. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Andrey Albershteyn <aalbersh@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-05-02	xfs: use unsigned ints for non-negative quantities in xfs_attr_remote.c	Darrick J. Wong	2	-32/+31
	In the next few patches we're going to refactor the attr remote code so that we can support headerless remote xattr values for storing merkle tree blocks. For now, let's change the code to use unsigned int to describe quantities of bytes and blocks that cannot be negative. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Andrey Albershteyn <aalbersh@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-04-30	xfs: do not allocate the entire delalloc extent in xfs_bmapi_writekorg/for-next_2024-05-02	Christoph Hellwig	1	-2/+3
	While trying to convert the entire delalloc extent is a good decision for regular writeback as it leads to larger contigous on-disk extents, but for other callers of xfs_bmapi_write is is rather questionable as it forced them to loop creating new transactions just in case there is no large enough contiguous extent to cover the whole delalloc reservation. Change xfs_bmapi_write to only allocate the passed in range instead, whіle the writeback path through xfs_bmapi_convert_delalloc and xfs_bmapi_allocate still always converts the full extents. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org> Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
2024-04-30	xfs: fix xfs_bmap_add_extent_delay_real for partial conversions	Christoph Hellwig	1	-5/+10
	xfs_bmap_add_extent_delay_real takes parts or all of a delalloc extent and converts them to a real extent. It is written to deal with any potential overlap of the to be converted range with the delalloc extent, but it turns out that currently only converting the entire extents, or a part starting at the beginning is actually exercised, as the only caller always tries to convert the entire delalloc extent, and either succeeds or at least progresses partially from the start. If it only converts a tiny part of a delalloc extent, the indirect block calculation for the new delalloc extent (da_new) might be equivalent to that of the existing delalloc extent (da_old). If this extent conversion now requires allocating an indirect block that gets accounted into da_new, leading to the assert that da_new must be smaller or equal to da_new unless we split the extent to trigger. Except for the assert that case is actually handled by just trying to allocate more space, as that already handled for the split case (which currently can't be reached at all), so just reusing it should be fine. Except that without dipping into the reserved block pool that would make it a bit too easy to trigger a fs shutdown due to ENOSPC. So in addition to adjusting the assert, also dip into the reserved block pool. Note that I could only reproduce the assert with a change to only convert the actually asked range instead of the full delalloc extent from xfs_bmapi_write. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org> Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
2024-04-30	xfs: remove the xfs_iext_peek_prev_extent call in xfs_bmapi_allocate	Christoph Hellwig	1	-5/+0
	Both callers of xfs_bmapi_allocate already initialize bma->prev, don't redo that in xfs_bmapi_allocate. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org> Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
2024-04-30	xfs: pass the actual offset and len to allocate to xfs_bmapi_allocate	Christoph Hellwig	1	-14/+18
	xfs_bmapi_allocate currently overwrites offset and len when converting delayed allocations, and duplicates the length cap done for non-delalloc allocations. Move all that logic into the callers to avoid duplication and to make the calling conventions more obvious. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org> Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
2024-04-30	xfs: don't open code XFS_FILBLKS_MIN in xfs_bmapi_write	Christoph Hellwig	1	-6/+3
	XFS_FILBLKS_MIN uses min_t and thus does the comparison using the correct xfs_filblks_t type. Use it in xfs_bmapi_write and slightly adjust the comment document th potential pitfall to take account of this Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org> Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
2024-04-30	xfs: lift a xfs_valid_startblock into xfs_bmapi_allocate	Christoph Hellwig	1	-6/+5
	xfs_bmapi_convert_delalloc has a xfs_valid_startblock check on the block allocated by xfs_bmapi_allocate. Lift it into xfs_bmapi_allocate as we should assert the same for xfs_bmapi_write. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org> Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
2024-04-30	xfs: remove the unusued tmp_logflags variable in xfs_bmapi_allocate	Christoph Hellwig	1	-3/+0
	tmp_logflags is initialized to 0 and then ORed into bma->logflags, which isn't actually doing anything. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org> Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
2024-04-30	xfs: fix error returns from xfs_bmapi_write	Christoph Hellwig	10	-74/+57
	xfs_bmapi_write can return 0 without actually returning a mapping in mval in two different cases: 1) when there is absolutely no space available to do an allocation 2) when converting delalloc space, and the allocation is so small that it only covers parts of the delalloc extent before the range requested by the caller Callers at best can handle one of these cases, but in many cases can't cope with either one. Switch xfs_bmapi_write to always return a mapping or return an error code instead. For case 1) above ENOSPC is the obvious choice which is very much what the callers expect anyway. For case 2) there is no really good error code, so pick a funky one from the SysV streams portfolio. This fixes the reproducer here: https://lore.kernel.org/linux-xfs/CAEJPjCvT3Uag-pMTYuigEjWZHn1sGMZ0GCjVVCv29tNHK76Cgg@mail.gmail.com0/ which uses reserved blocks to create file systems that are gravely out of space and thus cause at least xfs_file_alloc_space to hang and trigger the lack of ENOSPC handling in xfs_dquot_disk_alloc. Note that this patch does not actually make any caller but xfs_alloc_file_space deal intelligently with case 2) above. Signed-off-by: Christoph Hellwig <hch@lst.de> Reported-by: 刘通 <lyutoon@gmail.com> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org> Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
2024-04-29	xfs: convert delayed extents to unwritten when zeroing post eof blocks	Zhang Yi	1	-0/+29
	Current clone operation could be non-atomic if the destination of a file is beyond EOF, user could get a file with corrupted (zeroed) data on crash. The problem is about preallocations. If you write some data into a file: [A...B) and XFS decides to preallocate some post-eof blocks, then it can create a delayed allocation reservation: [A.........D) The writeback path tries to convert delayed extents to real ones by allocating blocks. If there aren't enough contiguous free space, we can end up with two extents, the first real and the second still delalloc: [A....C)[C.D) After that, both the in-memory and the on-disk file sizes are still B. If we clone into the range [E...F) from another file: [A....C)[C.D) [E...F) then xfs_reflink_zero_posteof() calls iomap_zero_range() to zero out the range [B, E) beyond EOF and flush it. Since [C, D) is still a delalloc extent, its pagecache will be zeroed and both the in-memory and on-disk size will be updated to D after flushing but before cloning. This is wrong, because the user can see the size change and read the zeroes while the clone operation is ongoing. We need to keep the in-memory and on-disk size before the clone operation starts, so instead of writing zeroes through the page cache for delayed ranges beyond EOF, we convert these ranges to unwritten and invalidate any cached data over that range beyond EOF. Suggested-by: Dave Chinner <david@fromorbit.com> Signed-off-by: Zhang Yi <yi.zhang@huawei.com> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
2024-04-29	xfs: make xfs_bmapi_convert_delalloc() to allocate the target offset	Zhang Yi	2	-42/+46
	Since xfs_bmapi_convert_delalloc() only attempts to allocate the entire delalloc extent and require multiple invocations to allocate the target offset. So xfs_convert_blocks() add a loop to do this job and we call it in the write back path, but xfs_convert_blocks() isn't a common helper. Let's do it in xfs_bmapi_convert_delalloc() and drop xfs_convert_blocks(), preparing for the post EOF delalloc blocks converting in the buffered write begin path. Signed-off-by: Zhang Yi <yi.zhang@huawei.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org> Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
2024-04-29	xfs: make the seq argument to xfs_bmapi_convert_delalloc() optional	Zhang Yi	1	-2/+4
	Allow callers to pass a NULLL seq argument if they don't care about the fork sequence number. Signed-off-by: Zhang Yi <yi.zhang@huawei.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org> Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
2024-04-29	xfs: match lock mode in xfs_buffered_write_iomap_begin()	Zhang Yi	1	-5/+5
	Commit 1aa91d9c9933 ("xfs: Add async buffered write support") replace xfs_ilock(XFS_ILOCK_EXCL) with xfs_ilock_for_iomap() when locking the writing inode, and a new variable lockmode is used to indicate the lock mode. Although the lockmode should always be XFS_ILOCK_EXCL, it's still better to use this variable instead of useing XFS_ILOCK_EXCL directly when unlocking the inode. Fixes: 1aa91d9c9933 ("xfs: Add async buffered write support") Signed-off-by: Zhang Yi <yi.zhang@huawei.com> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
2024-04-26	xfs: refactor dir format helpers	Christoph Hellwig	6	-150/+105
	Add a new enum and a xfs_dir2_format helper that returns it to allow the code to switch on the format of a directory in a single operation and switch all helpers of xfs_dir2_isblock and xfs_dir2_isleaf to it. This also removes the explicit xfs_iread_extents call in a few of the call sites given that xfs_bmap_last_offset already takes care of it underneath. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org> Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
2024-04-26	xfs: factor out a xfs_dir_replace_args helper	Christoph Hellwig	3	-41/+28
	Add a helper to switch between the different directory formats for removing a directory entry. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org> Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
2024-04-26	xfs: factor out a xfs_dir_removename_args helper	Christoph Hellwig	3	-42/+27
	Add a helper to switch between the different directory formats for removing a directory entry. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org> Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
2024-04-26	xfs: factor out a xfs_dir_createname_args helper	Christoph Hellwig	3	-43/+30
	Add a helper to switch between the different directory formats for creating a directory entry and to handle the XFS_DA_OP_JUSTCHECK flag based on the passed in ino number field. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org> Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
2024-04-26	xfs: factor out a xfs_dir_lookup_args helper	Christoph Hellwig	3	-60/+43
	Add a helper to switch between the different directory formats for lookup and to handle the -EEXIST return for a successful lookup. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org> Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
2024-04-24	xfs: Remove unused function xrep_dir_self_parent	Jiapeng Chong	1	-21/+0
	The function are defined in the dir_repair.c file, but not called elsewhere, so delete the unused function. fs/xfs/scrub/dir_repair.c:186:1: warning: unused function 'xrep_dir_self_parent'. Reported-by: Abaci Robot <abaci@linux.alibaba.com> Closes: https://bugzilla.openanolis.cn/show_bug.cgi?id=8867 Signed-off-by: Jiapeng Chong <jiapeng.chong@linux.alibaba.com> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org> Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
2024-04-24	Merge tag 'repair-fixes-6.10_2024-04-23' of ↵	Chandan Babu R	12	-76/+49
	https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux into xfs-6.10-mergeC xfs: minor fixes to online repair Here are some miscellaneous bug fixes for the online repair code. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Chandan Babu R <chandanbabu@kernel.org> * tag 'repair-fixes-6.10_2024-04-23' of https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux: xfs: invalidate dentries for a file before moving it to the orphanage xfs: exchange-range for repairs is no longer dynamic xfs: fix iunlock calls in xrep_adoption_trans_alloc xfs: drop the scrub file's iolock when transaction allocation fails
2024-04-24	Merge tag 'reduce-scrub-iget-overhead-6.10_2024-04-23' of ↵	Chandan Babu R	4	-11/+66
	https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux into xfs-6.10-mergeC xfs: reduce iget overhead in scrub This patchset looks to reduce iget overhead in two ways: First, a previous patch conditionally set DONTCACHE on inodes during xchk_irele on the grounds that we knew better at irele time if an inode should be dropped. Unfortunately, over time that patch morphed into a call to d_mark_dontcache, which resulted in inodes being dropped even if they were referenced by the dcache. This actually caused more recycle overhead than if we'd simply called xfs_iget to set DONTCACHE only on misses. The second patch reduces the cost of untrusted iget for a vectored scrub call by having the scrubv code maintain a separate refcount to the inode so that the cache will always hit. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Chandan Babu R <chandanbabu@kernel.org> * tag 'reduce-scrub-iget-overhead-6.10_2024-04-23' of https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux: xfs: only iget the file once when doing vectored scrub-by-handle xfs: use dontcache for grabbing inodes during scrub
2024-04-24	Merge tag 'vectorized-scrub-6.10_2024-04-23' of ↵	Chandan Babu R	10	-59/+366
	https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux into xfs-6.10-mergeC xfs: vectorize scrub kernel calls Create a vectorized version of the metadata scrub and repair ioctl, and adapt xfs_scrub to use that. This mitigates the impact of system call overhead on xfs_scrub runtime. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Chandan Babu R <chandanbabu@kernel.org> * tag 'vectorized-scrub-6.10_2024-04-23' of https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux: xfs: introduce vectored scrub mode xfs: move xfs_ioc_scrub_metadata to scrub.c xfs: reduce the rate of cond_resched calls inside scrub
2024-04-24	Merge tag 'scrub-directory-tree-6.10_2024-04-23' of ↵	Chandan Babu R	21	-4/+2337
	https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux into xfs-6.10-mergeC xfs: detect and correct directory tree problems Historically, checking the tree-ness of the directory tree structure has not been complete. Cycles of subdirectories break the tree properties, as do subdirectories with multiple parents. It's easy enough for DFS to detect problems as long as one of the participants is reachable from the root, but this technique cannot find unconnected cycles. Directory parent pointers change that, because we can discover all of these problems from a simple walk from a subdirectory towards the root. For each child we start with, if the walk terminates without reaching the root, we know the path is disconnected and ought to be attached to the lost and found. If we find ourselves, we know this is a cycle and can delete an incoming edge. If we find multiple paths to the root, we know to delete an incoming edge. Even better, once we've finished walking paths, we've identified the good ones and know which other path(s) to remove. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Chandan Babu R <chandanbabu@kernel.org> * tag 'scrub-directory-tree-6.10_2024-04-23' of https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux: xfs: fix corruptions in the directory tree xfs: report directory tree corruption in the health information xfs: invalidate dirloop scrub path data when concurrent updates happen xfs: teach online scrub to find directory tree structure problems
2024-04-24	Merge tag 'repair-pptrs-6.10_2024-04-23' of ↵	Chandan Babu R	25	-123/+2742
	https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux into xfs-6.10-mergeC xfs: online repair for parent pointers This series implements online repair for directory parent pointer metadata. The checking half is fairly straightforward -- for each outgoing directory link (forward or backwards), grab the inode at the other end, and confirm that there's a corresponding link. If we can't grab an inode or lock it, we'll save that link for a slower loop that cycles all the locks, confirms the continued existence of the link, and rechecks the link if it's actually still there. Repairs are a bit more involved -- for directories, we walk the entire filesystem to rebuild the dirents from parent pointer information. Parent pointer repairs do the same walk but rebuild the pptrs from the dirent information, but with the added twist that it duplicates all the xattrs so that it can use the atomic extent swapping code to commit the repairs atomically. This introduces an added twist to the xattr repair code -- we use dirent hooks to detect a colliding update to the pptr data while we're not holding the ILOCKs; if one is detected, we restart the xattr salvaging process but this time hold all the ILOCKs until the end of the scan. For offline repair, the phase6 directory connectivity scan generates an index of all the expected parent pointers in the filesystem. Then it walks each file and compares the parent pointers attached to that file against the index generated, and resyncs the results as necessary. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Chandan Babu R <chandanbabu@kernel.org> * tag 'repair-pptrs-6.10_2024-04-23' of https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux: xfs: inode repair should ensure there's an attr fork to store parent pointers xfs: repair link count of nondirectories after rebuilding parent pointers xfs: adapt the orphanage code to handle parent pointers xfs: actually rebuild the parent pointer xattrs xfs: add a per-leaf block callback to xchk_xattr_walk xfs: split xfs_bmap_add_attrfork into two pieces xfs: remove pointless unlocked assertion xfs: implement live updates for parent pointer repairs xfs: repair directory parent pointers by scanning for dirents xfs: replay unlocked parent pointer updates that accrue during xattr repair xfs: implement live updates for directory repairs xfs: repair directories by scanning directory parent pointers xfs: add raw parent pointer apis to support repair xfs: salvage parent pointers when rebuilding xattr structures xfs: make the reserved block permission flag explicit in xfs_attr_set xfs: remove some boilerplate from xfs_attr_set
2024-04-24	Merge tag 'scrub-pptrs-6.10_2024-04-23' of ↵	Chandan Babu R	13	-11/+1307
	https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux into xfs-6.10-mergeC xfs: scrubbing for parent pointers Teach online fsck to use parent pointers to assist in checking directories, parent pointers, extended attributes, and link counts. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Chandan Babu R <chandanbabu@kernel.org> * tag 'scrub-pptrs-6.10_2024-04-23' of https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux: xfs: check parent pointer xattrs when scrubbing xfs: walk directory parent pointers to determine backref count xfs: deferred scrub of parent pointers xfs: scrub parent pointers xfs: deferred scrub of dirents xfs: check dirents have parent pointers xfs: revert commit 44af6c7e59b12
2024-04-24	Merge tag 'pptrs-6.10_2024-04-23' of ↵	Chandan Babu R	42	-832/+2807
	https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux into xfs-6.10-mergeC xfs: Parent Pointers This is the latest parent pointer attributes for xfs. The goal of this patch set is to add a parent pointer attribute to each inode. The attribute name containing the parent inode, generation, and directory offset, while the attribute value contains the file name. This feature will enable future optimizations for online scrub, shrink, nfs handles, verity, or any other feature that could make use of quickly deriving an inodes path from the mount point. Directory parent pointers are stored as namespaced extended attributes of a file. Because parent pointers are an indivisible tuple of (dirent_name, parent_ino, parent_gen) we cannot use the usual attr name lookup functions to find a parent pointer. This is solvable by introducing a new lookup mode that checks both the name and the value of the xattr. Therefore, introduce this new name-value lookup mode that's gated on the XFS_ATTR_PARENT namespace. This requires the introduction of new opcodes for the extended attribute update log intent items, which actually means that parent pointers (itself an INCOMPAT feature) does not depend on the LOGGED_XATTRS log incompat feature bit. To reduce collisions on the dirent names of parent pointers, introduce a new attr hash mode that is the dir2 namehash of the dirent name xor'd with the parent inode number. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Chandan Babu R <chandanbabu@kernel.org> * tag 'pptrs-6.10_2024-04-23' of https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux: xfs: enable parent pointers xfs: drop compatibility minimum log size computations for reflink xfs: fix unit conversion error in xfs_log_calc_max_attrsetm_res xfs: add a incompat feature bit for parent pointers xfs: don't remove the attr fork when parent pointers are enabled xfs: add parent pointer ioctls xfs: split out handle management helpers a bit xfs: move handle ioctl code to xfs_handle.c xfs: pass the attr value to put_listent when possible xfs: don't return XFS_ATTR_PARENT attributes via listxattr xfs: Add parent pointers to xfs_cross_rename xfs: Add parent pointers to rename xfs: remove parent pointers in unlink xfs: add parent attributes to symlink xfs: add parent attributes to link xfs: parent pointer attribute creation xfs: create a hashname function for parent pointers xfs: extend transaction reservations for parent attributes xfs: add parent pointer validator functions xfs: Expose init_xattrs in xfs_create_tmpfile xfs: record inode generation in xattr update log intent items xfs: create attr log item opcodes and formats for parent pointers xfs: refactor xfs_is_using_logged_xattrs checks in attr item recovery xfs: allow xattr matching on name and value for parent pointers xfs: define parent pointer ondisk extended attribute format xfs: add parent pointer support to attribute code xfs: create a separate hashname function for extended attributes xfs: move xfs_attr_defer_add to xfs_attr_item.c xfs: check the flags earlier in xfs_attr_match xfs: rearrange xfs_attr_match parameters
2024-04-24	Merge tag 'improve-attr-validation-6.10_2024-04-23' of ↵	Chandan Babu R	11	-72/+299
	https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux into xfs-6.10-mergeC xfs: improve extended attribute validation Prior to introducing parent pointer extended attributes, let's spend some time cleaning up the attr code and strengthening the validation that it performs on attrs coming in from the disk. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Chandan Babu R <chandanbabu@kernel.org> * tag 'improve-attr-validation-6.10_2024-04-23' of https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux: xfs: enforce one namespace per attribute xfs: refactor name/value iovec validation in xlog_recover_attri_commit_pass2 xfs: refactor name/length checks in xfs_attri_validate xfs: use local variables for name and value length in _attri_commit_pass2 xfs: always set args->value in xfs_attri_item_recover xfs: validate recovered name buffers when recovering xattr items xfs: use helpers to extract xattr op from opflags xfs: restructure xfs_attr_complete_op a bit xfs: check shortform attr entry flags specifically xfs: fix missing check for invalid attr flags xfs: check opcode and iovec count match in xlog_recover_attri_commit_pass2 xfs: use an XFS_OPSTATE_ flag for detecting if logged xattrs are available xfs: require XFS_SB_FEAT_INCOMPAT_LOG_XATTRS for attr log intent item recovery xfs: attr fork iext must be loaded before calling xfs_attr_is_leaf
2024-04-24	Merge tag 'shrink-dirattr-args-6.10_2024-04-23' of ↵	Chandan Babu R	11	-65/+80
	https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux into xfs-6.10-mergeC xfs: shrink struct xfs_da_args Let's clean out some unused flags and fields from struct xfs_da_args. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Chandan Babu R <chandanbabu@kernel.org> * tag 'shrink-dirattr-args-6.10_2024-04-23' of https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux: xfs: rearrange xfs_da_args a bit to use less space xfs: make attr removal an explicit operation xfs: remove xfs_da_args.attr_flags xfs: remove XFS_DA_OP_NOTIME xfs: remove XFS_DA_OP_REMOVE
2024-04-23	xfs: invalidate dentries for a file before moving it to the orphanagerepair-fixes-6.10_2024-04-23 repair-fixes-6.10	Darrick J. Wong	2	-29/+20
	Invalidate the cached dentries that point to the file that we're moving to lost+found before we actually move it. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-04-23	xfs: exchange-range for repairs is no longer dynamic	Darrick J. Wong	10	-45/+25
	The atomic file exchange-range functionality is now a permanent filesystem feature instead of a dynamic log-incompat feature. It cannot be turned on at runtime, so we no longer need the XCHK_FSGATES flags and whatnot that supported it. Remove the flag and the enable function, and move the xfs_has_exchange_range checks to the start of the repair functions. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-04-23	xfs: fix iunlock calls in xrep_adoption_trans_alloc	Darrick J. Wong	1	-1/+1
	If the transaction allocation in xrep_adoption_trans_alloc fails, we should drop only the locks that we took. In this case this is ILOCK_EXCL of both the orphanage and the file being repaired. Dropping any IOLOCK here is incorrect. Found by fuzzing u3.sfdir3.list[1].name = zeroes in xfs/1546. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-04-23	xfs: only iget the file once when doing vectored scrub-by-handlereduce-scrub-iget-overhead-6.10_2024-04-23 reduce-scrub-iget-overhead-6.10	Darrick J. Wong	1	-0/+45
	If a program wants us to perform a scrub on a file handle and the fd passed to ioctl() is not the file referenced in the handle, iget the file once and pass it into the scrub code. This amortizes the untrusted iget lookup over /all/ the scrubbers mentioned in the scrubv call. When running fstests in "rebuild all metadata after each test" mode, I observed a 10% reduction in runtime on account of avoiding repeated inobt lookups. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-04-23	xfs: introduce vectored scrub modevectorized-scrub-6.10_2024-04-23 vectorized-scrub-6.10	Darrick J. Wong	5	-1/+264
	Introduce a variant on XFS_SCRUB_METADATA that allows for a vectored mode. The caller specifies the principal metadata object that they want to scrub (allocation group, inode, etc.) once, followed by an array of scrub types they want called on that object. The kernel runs the scrub operations and writes the output flags and errno code to the corresponding array element. A new pseudo scrub type BARRIER is introduced to force the kernel to return to userspace if any corruptions have been found when scrubbing the previous scrub types in the array. This enables userspace to schedule, for example, the sequence: 1. data fork 2. barrier 3. directory If the data fork scrub is clean, then the kernel will perform the directory scrub. If not, the barrier in 2 will exit back to userspace. The alternative would have been an interface where userspace passes a pointer to an empty buffer, and the kernel formats that with xfs_scrub_vecs that tell userspace what it scrubbed and what the outcome was. With that the kernel would have to communicate that the buffer needed to have been at least X size, even though for our cases XFS_SCRUB_TYPE_NR + 2 would always be enough. Compared to that, this design keeps all the dependency policy and ordering logic in userspace where it already resides instead of duplicating it in the kernel. The downside of that is that it needs the barrier logic. When running fstests in "rebuild all metadata after each test" mode, I observed a 10% reduction in runtime due to fewer transitions across the system call boundary. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-04-23	xfs: drop the scrub file's iolock when transaction allocation fails	Darrick J. Wong	1	-1/+3
	If the transaction allocation in the !orphanage_available case of xrep_nlinks_repair_inode fails, we need to drop the IOLOCK of the file being scrubbed before exiting. Found by fuzzing u3.sfdir3.list[1].name = zeroes in xfs/1546. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-04-23	xfs: use dontcache for grabbing inodes during scrub	Darrick J. Wong	3	-11/+21
	Back when I wrote commit a03297a0ca9f2, I had thought that we'd be doing users a favor by only marking inodes dontcache at the end of a scrub operation, and only if there's only one reference to that inode. This was more or less true back when I_DONTCACHE was an XFS iflag and the only thing it did was change the outcome of xfs_fs_drop_inode to 1. Note: If there are dentries pointing to the inode when scrub finishes, the inode will have positive i_count and stay around in cache until dentry reclaim. But now we have d_mark_dontcache, which cause the inode and the dentries attached to it all to be marked I_DONTCACHE, which means that we drop the dentries ASAP, which drops the inode ASAP. This is bad if scrub found problems with the inode, because now they can be scheduled for inactivation, which can cause inodegc to trip on it and shut down the filesystem. Even if the inode isn't bad, this is still suboptimal because phases 3-7 each initiate inode scans. Dropping the inode immediately during phase 3 is silly because phase 5 will reload it and drop it immediately, etc. It's fine to mark the inodes dontcache, but if there have been accesses to the file that set up dentries, we should keep them. I validated this by setting up ftrace to capture xfs_iget_recycle* tracepoints and ran xfs/285 for 30 seconds. With current djwong-wtf I saw ~30,000 recycle events. I then dropped the d_mark_dontcache calls and set XFS_IGET_DONTCACHE, and the recycle events dropped to ~5,000 per 30 seconds. Therefore, grab the inode with XFS_IGET_DONTCACHE, which only has the effect of setting I_DONTCACHE for cache misses. Remove the d_mark_dontcache call that can happen in xchk_irele. Fixes: a03297a0ca9f2 ("xfs: manage inode DONTCACHE status at irele time") Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-04-23	xfs: fix corruptions in the directory treescrub-directory-tree-6.10_2024-04-23 scrub-directory-tree-6.10	Darrick J. Wong	11	-8/+927
	Repair corruptions in the directory tree itself. Cycles are broken by removing an incoming parent->child link. Multiply-owned directories are fixed by pruning the extra parent -> child links Disconnected subtrees are reconnected to the lost and found. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-04-23	xfs: move xfs_ioc_scrub_metadata to scrub.c	Darrick J. Wong	3	-27/+28
	Move the scrub ioctl handler to scrub.c to keep the code together and to reduce unnecessary code when CONFIG_XFS_ONLINE_SCRUB=n. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-04-23	xfs: report directory tree corruption in the health information	Darrick J. Wong	4	-1/+6
	Report directories that are the source of corruption in the directory tree. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-04-23	xfs: reduce the rate of cond_resched calls inside scrub	Darrick J. Wong	6	-31/+74
	We really don't want to call cond_resched every single time we go through a loop in scrub -- there may be billions of records, and probing into the scheduler itself has overhead. Reduce this overhead by only calling cond_resched 10x per second; and add a counter so that we only check jiffies once every 1000 records or so. Surprisingly, this reduces scrub-only fstests runtime by about 2%. I used the bmapinflate xfs_db command to produce a billion-extent file and this stupid gadget reduced the scrub runtime by about 4%. From a stupid microbenchmark of calling these things 1 billion times, I estimate that cond_resched costs about 5.5ns per call; jiffes costs about 0.3ns per read; and fatal_signal_pending costs about 0.4ns per call. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-04-23	xfs: inode repair should ensure there's an attr fork to store parent pointersrepair-pptrs-6.10_2024-04-23 repair-pptrs-6.10	Darrick J. Wong	1	-0/+41
	The runtime parent pointer update code expects that any file being moved around the directory tree already has an attr fork. However, if we had to rebuild an inode core record, there's a chance that we zeroed forkoff as part of the inode to pass the iget verifiers. Therefore, if we performed any repairs on an inode core, ensure that the inode has a nonzero forkoff before unlocking the inode. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-04-23	xfs: invalidate dirloop scrub path data when concurrent updates happen	Darrick J. Wong	3	-1/+244
	Add a dirent update hook so that we can detect directory tree updates that affect any of the paths found by this scrubber and force it to rescan. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-04-23	xfs: repair link count of nondirectories after rebuilding parent pointers	Darrick J. Wong	1	-0/+107
	Since the parent pointer scrubber does not exhaustively search the filesystem for missing parent pointers, it doesn't have a good way to determine that there are pointers missing from an otherwise uncorrupt xattr structure. Instead, for nondirectories it employs a heuristic of comparing the file link count to the number of parent pointers found. However, we don't want this heuristic flagging a false corruption after a repair has actually scanned the entire filesystem to rebuild the parent pointers. Therefore, reset the file link count in this one case because we actually know the correct link count. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-04-23	xfs: teach online scrub to find directory tree structure problems	Darrick J. Wong	12	-2/+1168
	Create a new scrubber that detects corruptions within the directory tree structure itself. It can detect directories with multiple parents; loops within the directory tree; and directory loops not accessible from the root. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-04-23	xfs: adapt the orphanage code to handle parent pointers	Darrick J. Wong	3	-0/+43
	Adapt the orphanage's adoption code to update the child file's parent pointers as part of the reparenting process. Also ensure that the child has an attr fork to receive the parent pointer update, since the runtime code assumes one exists. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-04-23	xfs: actually rebuild the parent pointer xattrs	Darrick J. Wong	7	-23/+701
	Once we've assembled all the parent pointers for a file, we need to commit the new dataset atomically to that file. Parent pointer records are embedded in the xattr structure, which means that we must write a new extended attribute structure, again, atomically. Therefore, we must copy the non-parent-pointer attributes from the file being repaired into the temporary file's extended attributes and then call the atomic extent swap mechanism to exchange the blocks. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-04-23	xfs: add a per-leaf block callback to xchk_xattr_walk	Darrick J. Wong	6	-8/+20
	Add a second callback function to xchk_xattr_walk so that we can do something in between attr leaf blocks. This will be used by the next patch to see if we should flush cached parent pointer updates to constrain memory usage. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-04-23	xfs: split xfs_bmap_add_attrfork into two pieces	Darrick J. Wong	3	-28/+50
	Split this function into two pieces -- one to make the actual changes to the inode core to add the attr fork, and another one to deal with getting the transaction and locking the inodes. The next couple of patches will need this to be split into two. One patch implements committing new parent pointer recordsets to damaged files. If one file has an attr fork and the other does not, we have to create the missing attr fork before the atomic swap transaction, and can use the behavior encoded in the current xfs_bmap_add_attrfork. The second patch adapts /lost+found adoptions to handle parent pointers correctly. The adoption process will add a parent pointer to a child that is being moved to /lost+found, but this requires that the attr fork already exists. We don't know if we're actually going to commit the adoption until we've already reserved a transaction and taken the ILOCKs, which means that we must have a way to bypass the start of the current xfs_bmap_add_attrfork. Therefore, create xfs_attr_add_fork as the helper that creates a transaction and takes locks; and make xfs_bmap_add_attrfork the function that updates the inode core and allocates the incore attr fork. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-04-23	xfs: remove pointless unlocked assertion	Darrick J. Wong	1	-2/+0
	Remove this assertion about the inode not having an attr fork from xfs_bmap_add_attrfork because the function handles that case just fine. Weirder still, the function actually /requires/ the caller not to hold the ILOCK, which means that its accesses are not stabilized. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-04-23	xfs: implement live updates for parent pointer repairs	Darrick J. Wong	2	-5/+100
	While we're scanning the filesystem for dirents that we can turn into parent pointers, we cannot hold the IOLOCK or ILOCK of the file being repaired. Therefore, we need to set up a dirent hook so that we can keep the temporary file's parent pionters up to date with the rest of the filesystem. Hence we add the ability to remove pptrs from the temporary file. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-04-23	xfs: repair directory parent pointers by scanning for dirents	Darrick J. Wong	2	-3/+447
	If parent pointers are enabled on the filesystem, we can repair the entire dataset by walking the directories of the filesystem looking for dirents that we can turn into parent pointers. Once we have a full incore dataset, we'll figure out what to do with it, but that's for a subsequent patch. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-04-23	xfs: replay unlocked parent pointer updates that accrue during xattr repair	Darrick J. Wong	2	-2/+509
	There are a few places where the extended attribute repair code drops the ILOCK to apply stashed xattrs to the temporary file. Although setxattr and removexattr are still locked out because we retain our hold on the IOLOCK, this doesn't prevent renames from updating parent pointers, because the VFS doesn't take i_rwsem on children that are being moved. Therefore, set up a dirent hook to capture parent pointer updates for this file, and replay(?) the updates. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-04-23	xfs: implement live updates for directory repairs	Darrick J. Wong	4	-22/+218
	While we're scanning the filesystem for parent pointers that we can turn into dirents, we cannot hold the IOLOCK or ILOCK of the directory being repaired. Therefore, we need to set up a dirent hook so that we can keep the temporary directory up to date with the rest of the filesystem. Hence we add the ability to remove entries from the temporary dir. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-04-23	xfs: repair directories by scanning directory parent pointers	Darrick J. Wong	1	-6/+341
	For filesystems with parent pointers, scan the entire filesystem looking for parent pointers that target the directory we're rebuilding instead of trying to salvage whatever we can from the directory data blocks. This will be more robust than salvaging, but there's more code to come. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-04-23	xfs: add raw parent pointer apis to support repair	Darrick J. Wong	4	-2/+72
	Add a couple of utility functions to set or remove parent pointers from a file. These functions will be used by repair code, hence they skip the xattr logging that regular parent pointer updates use. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-04-23	xfs: check parent pointer xattrs when scrubbingscrub-pptrs-6.10_2024-04-23 scrub-pptrs-6.10	Darrick J. Wong	1	-0/+16
	Check parent pointer xattrs as part of scrubbing xattrs. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-04-23	xfs: salvage parent pointers when rebuilding xattr structures	Darrick J. Wong	2	-9/+65
	When we're salvaging extended attributes, make sure we validate the ones that claim to be parent pointers before adding them to the salvage pile. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-04-23	xfs: walk directory parent pointers to determine backref count	Darrick J. Wong	6	-1/+177
	If the filesystem has parent pointers enabled, walk the parent pointers of subdirectories to determine the true backref count. In theory each subdir should have a single parent reachable via dotdot, but in the case of (corrupt) subdirs with multiple parents, we need to keep the link counts high enough that the directory loop detector will be able to correct the multiple parents problems. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-04-23	xfs: make the reserved block permission flag explicit in xfs_attr_set	Darrick J. Wong	4	-6/+6
	Make the use of reserved blocks an explicit parameter to xfs_attr_set. Userspace setting XFS_ATTR_ROOT attrs should continue to be able to use it, but for online repairs we can back out and therefore do not care. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-04-23	xfs: remove some boilerplate from xfs_attr_set	Darrick J. Wong	3	-23/+38
	In preparation for online/offline repair wanting to use xfs_attr_set, move some of the boilerplate out of this function into the callers. Repair can initialize the da_args completely, and the userspace flag handling/twisting goes away once we move it to xfs_attr_change. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-04-23	xfs: deferred scrub of parent pointers	Darrick J. Wong	3	-8/+264
	If the trylock-based dirent check fails, retain those parent pointers and check them at the end. This may involve dropping the locks on the file being scanned, so yay. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-04-23	xfs: scrub parent pointers	Darrick J. Wong	1	-0/+371
	Actually check parent pointers now. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-04-23	xfs: deferred scrub of dirents	Darrick J. Wong	4	-3/+346
	If the trylock-based parent pointer check fails, retain those dirents and check them at the end. This may involve dropping the locks on the file being scanned, so yay. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-04-23	xfs: check dirents have parent pointers	Darrick J. Wong	3	-1/+138
	If the fs has parent pointers, we need to check that each child dirent points to a file that has a parent pointer pointing back at us. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-04-23	xfs: enable parent pointerspptrs-6.10_2024-04-23 pptrs-6.10	Darrick J. Wong	1	-1/+2
	Add parent pointers to the list of supported features. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-04-23	xfs: revert commit 44af6c7e59b12	Darrick J. Wong	1	-8/+5
	In my haste to fix what I thought was a performance problem in the attr scrub code, I neglected to notice that the xfs_attr_get_ilocked also had the effect of checking that attributes can actually be looked up through the attr dabtree. Fix this. Fixes: 44af6c7e59b12 ("xfs: don't load local xattr values during scrub") Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-04-23	xfs: drop compatibility minimum log size computations for reflink	Darrick J. Wong	1	-0/+14
	Let's also drop the oversized minimum log computations for reflink and rmap that were the result of bugs introduced many years ago. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Allison Henderson <allison.henderson@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-04-23	xfs: fix unit conversion error in xfs_log_calc_max_attrsetm_res	Darrick J. Wong	1	-0/+32
	Dave and I were discussing some recent test regressions as a result of me turning on nrext64=1 on realtime filesystems, when we noticed that the minimum log size of a 32M filesystem jumped from 954 blocks to 4287 blocks. Digging through xfs_log_calc_max_attrsetm_res, Dave noticed that @size contains the maximum estimated amount of space needed for a local format xattr, in bytes, but we feed this quantity to XFS_NEXTENTADD_SPACE_RES, which requires units of blocks. This has resulted in an overestimation of the minimum log size over the years. We should nominally correct this, but there's a backwards compatibility problem -- if we enable it now, the minimum log size will decrease. If a corrected mkfs formats a filesystem with this new smaller log size, a user will encounter mount failures on an uncorrected kernel due to the larger minimum log size computations there. Therefore, turn this on for parent pointers because it wasn't merged at all upstream when this issue was discovered. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Allison Henderson <allison.henderson@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-04-23	xfs: add a incompat feature bit for parent pointers	Allison Henderson	4	-0/+10
	Create an incompat feature bit and a fs geometry flag so that we can enable the feature in the ondisk superblock and advertise its existence to userspace. Signed-off-by: Mark Tinguely <mark.tinguely@oracle.com> Signed-off-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Allison Henderson <allison.henderson@oracle.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2024-04-23	xfs: don't remove the attr fork when parent pointers are enabled	Allison Henderson	1	-2/+4
	When an inode is removed, it may also cause the attribute fork to be removed if it is the last attribute. This transaction gets flushed to the log, but if the system goes down before we could inactivate the symlink, the log recovery tries to inactivate this inode (since it is on the unlinked list) but the verifier trips over the remote value and leaks it. Hence we ended up with a file in this odd state on a "clean" mount. The "obvious" fix is to prohibit erasure of the attr fork to avoid tripping over the verifiers when pptrs are enabled. Signed-off-by: Allison Henderson <allison.henderson@oracle.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-04-23	xfs: add parent pointer ioctls	Darrick J. Wong	11	-2/+522
	This patch adds a pair of new file ioctls to retrieve the parent pointer of a given inode. They both return the same results, but one operates on the file descriptor passed to ioctl() whereas the other allows the caller to specify a file handle for which the caller wants results. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-04-23	xfs: split out handle management helpers a bit	Darrick J. Wong	2	-32/+70
	Split out the functions that generate file/fs handles and map them back into dentries in preparation for the GETPARENTS ioctl next. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-04-23	xfs: move handle ioctl code to xfs_handle.c	Darrick J. Wong	6	-619/+649
	Move the handle managemnet code (and the attrmulti code that uses it) to xfs_handle.c. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-04-23	xfs: pass the attr value to put_listent when possible	Allison Henderson	5	-3/+13
	Pass the attr value to put_listent when we have local xattrs or shortform xattrs. This will enable the GETPARENTS ioctl to use xfs_attr_list as its backend. Signed-off-by: Allison Henderson <allison.henderson@oracle.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-04-23	xfs: don't return XFS_ATTR_PARENT attributes via listxattr	Allison Henderson	2	-0/+7
	Parent pointers are internal filesystem metadata. They're not intended to be directly visible to userspace, so filter them out of xfs_xattr_put_listent so that they don't appear in listxattr. Signed-off-by: Allison Henderson <allison.henderson@oracle.com> Inspired-by: Andrey Albershteyn <aalbersh@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> [djwong: change this to XFS_ATTR_PRIVATE_NSP_MASK per fsverity patchset] Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-04-23	xfs: Add parent pointers to xfs_cross_rename	Allison Henderson	1	-8/+25
	Cross renames are handled separately from standard renames, and need different handling to update the parent attributes correctly. Signed-off-by: Allison Henderson <allison.henderson@oracle.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-04-23	xfs: Add parent pointers to rename	Allison Henderson	7	-11/+142
	This patch removes the old parent pointer attribute during the rename operation, and re-adds the updated parent pointer. Signed-off-by: Allison Henderson <allison.henderson@oracle.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> [djwong: adjust to new ondisk format] Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-04-23	xfs: remove parent pointers in unlink	Allison Henderson	5	-8/+60
	This patch removes the parent pointer attribute during unlink Signed-off-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Allison Henderson <allison.henderson@oracle.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> [djwong: adjust to new ondisk format, minor rebase fixes] Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-04-23	xfs: add parent attributes to symlink	Allison Henderson	4	-8/+45
	This patch modifies xfs_symlink to add a parent pointer to the inode. Signed-off-by: Allison Henderson <allison.henderson@oracle.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> [djwong: minor rebase fixups] Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-04-23	xfs: add parent attributes to link	Allison Henderson	5	-10/+54
	This patch modifies xfs_link to add a parent pointer to the inode. Signed-off-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Allison Henderson <allison.henderson@oracle.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> [djwong: minor rebase fixes] Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-04-23	xfs: parent pointer attribute creation	Allison Henderson	9	-12/+242
	Add parent pointer attribute during xfs_create, and subroutines to initialize attributes. Note that the xfs_attr_intent object contains a pointer to the caller's xfs_da_args object, so the latter must persist until transaction commit. Signed-off-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Allison Henderson <allison.henderson@oracle.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> [djwong: shorten names, adjust to new format, set init_xattrs for parent pointers] Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-04-23	xfs: create a hashname function for parent pointers	Darrick J. Wong	4	-0/+59
	Although directory entry and parent pointer recordsets look very similar (name -> ino), there's one major difference between them: a file can be hardlinked from multiple parent directories with the same filename. This is common in shared container environments where a base directory tree might be hardlink-copied multiple times. IOWs the same 'ls' program might be hardlinked to multiple /srv/*/bin/ls paths. We don't want parent pointer operations to bog down on hash collisions between the same dirent name, so create a special hash function that mixes in the parent directory inode number. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-04-23	xfs: extend transaction reservations for parent attributes	Allison Henderson	1	-52/+274
	We need to add, remove or modify parent pointer attributes during create/link/unlink/rename operations atomically with the dirents in the parent directories being modified. This means they need to be modified in the same transaction as the parent directories, and so we need to add the required space for the attribute modifications to the transaction reservations. Signed-off-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Allison Henderson <allison.henderson@oracle.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> [djwong: fix indenting errors, adjust for new log format] Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-04-23	xfs: add parent pointer validator functions	Allison Henderson	5	-0/+123
	The attr name of a parent pointer is a string, and the attr value of a parent pointer is (more or less) a file handle. So we need to modify attr_namecheck to verify the parent pointer name, and add a xfs_parent_valuecheck function to sanitize the handle. At the same time, we need to validate attr values during log recovery if the xattr is really a parent pointer. Signed-off-by: Allison Henderson <allison.henderson@oracle.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> [djwong: move functions to xfs_parent.c, adjust for new disk format] Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-04-23	xfs: Expose init_xattrs in xfs_create_tmpfile	Allison Henderson	3	-4/+5
	Tmp files are used as part of rename operations and will need attr forks initialized for parent pointers. Expose the init_xattrs parameter to the calling function to initialize the fork. Signed-off-by: Allison Henderson <allison.henderson@oracle.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-04-23	xfs: record inode generation in xattr update log intent items	Darrick J. Wong	2	-7/+28
	For parent pointer updates, record the i_generation of the file that is being updated so that we don't accidentally jump generations. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-04-23	xfs: create attr log item opcodes and formats for parent pointers	Darrick J. Wong	6	-26/+284
	Make the necessary alterations to the extended attribute log intent item ondisk format so that we can log parent pointer operations. This requires the creation of new opcodes specific to parent pointers, and a new four-argument replace operation to handle renames. At this point this part of the patchset has changed so much from what Allison original wrote that I no longer think her SoB applies. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-04-23	xfs: refactor xfs_is_using_logged_xattrs checks in attr item recovery	Darrick J. Wong	1	-3/+4
	Move this feature check down to the per-op checks so that we can ensure that we never see parent pointer attr items on non-pptr filesystems, and that logged xattrs are turned on for non-pptr attr items. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-04-23	xfs: allow xattr matching on name and value for parent pointers	Darrick J. Wong	1	-6/+46
	If a file is hardlinked with the same name but from multiple parents, the parent pointers will all have the same dirent name (== attr name) but with different parent_ino/parent_gen values. To disambiguate, we need to be able to match on both the attr name and the attr value. This is in contrast to regular xattrs, which are matchtg edit d only on name. Therefore, plumb in the ability to match shortform and local attrs on name and value in the XFS_ATTR_PARENT namespace. Parent pointer attr values are never large enough to be stored in a remote attr, so we need can reject these cases as corruption. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-04-23	xfs: define parent pointer ondisk extended attribute format	Allison Henderson	2	-0/+14
	We need to define the parent pointer attribute format before we start adding support for it into all the code that needs to use it. The EA format we will use encodes the following information: name={dirent name} value={parent inumber, parent inode generation} hash=xfs_dir2_hashname(dirent name) ^ (parent_inumber) The inode/gen gives all the information we need to reliably identify the parent without requiring child->parent lock ordering, and allows userspace to do pathname component level reconstruction without the kernel ever needing to verify the parent itself as part of ioctl calls. By using the name-value lookup mode in the extended attribute code to match parent pointers using both the xattr name and value, we can identify the exact parent pointer EA we need to modify/remove in rename/unlink operations without searching the entire EA space. By storing the dirent name, we have enough information to be able to validate and reconstruct damaged directory trees. Earlier iterations of this patchset encoded the directory offset in the parent pointer key, but this format required repair to keep that in sync across directory rebuilds, which is unnecessary complexity. Signed-off-by: Allison Henderson <allison.henderson@oracle.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-04-23	xfs: add parent pointer support to attribute code	Allison Henderson	3	-3/+10
	Add the new parent attribute type. XFS_ATTR_PARENT is used only for parent pointer entries; it uses reserved blocks like XFS_ATTR_ROOT. Signed-off-by: Mark Tinguely <mark.tinguely@oracle.com> Signed-off-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Allison Henderson <allison.henderson@oracle.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-04-23	xfs: create a separate hashname function for extended attributes	Darrick J. Wong	6	-9/+54
	Create a separate function to compute name hashvalues for extended attributes. When we get to parent pointers we'll be altering the rules so that metadump obfuscation doesn't turn heinous. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-04-23	xfs: move xfs_attr_defer_add to xfs_attr_item.c	Darrick J. Wong	3	-34/+41
	Move the code that adds the incore xfs_attr_item deferred work data to a transaction live with the ATTRI log item code. This means that the upper level extended attribute code no longer has to know about the inner workings of the ATTRI log items. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-04-23	xfs: check the flags earlier in xfs_attr_match	Christoph Hellwig	1	-9/+10
	Checking the flags match is much cheaper than a memcmp, so do it early on in xfs_attr_match, and also add a little helper to calculate the match mask right under the comment explaining the logic for it. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2024-04-23	xfs: rearrange xfs_attr_match parameters	Darrick J. Wong	1	-11/+12
	Rearrange the parameters to this function so that they match the order of attr listent: attr_flags -> name -> namelen -> value -> valuelen. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-04-23	xfs: enforce one namespace per attributeimprove-attr-validation-6.10_2024-04-23 improve-attr-validation-6.10	Darrick J. Wong	7	-18/+41
	Create a standardized helper function to enforce one namespace bit per extended attribute, and refactor all the open-coded hweight logic. This function is not a static inline to avoid porting hassles in userspace. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-04-23	xfs: refactor name/value iovec validation in xlog_recover_attri_commit_pass2	Darrick J. Wong	1	-18/+46
	Hoist the code that checks the attr name and value iovecs into separate helpers so that we can add more callsites for the new parent pointer attr intent items. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-04-23	xfs: refactor name/length checks in xfs_attri_validate	Darrick J. Wong	1	-8/+15
	Move the name and length checks into the attr op switch statement so that we can perform more specific checks of the value length. Over the next few patches we're going to add new attr op flags with different validation requirements. While we're at it, remove the incorrect comment. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-04-23	xfs: use local variables for name and value length in _attri_commit_pass2	Darrick J. Wong	1	-11/+14
	We're about to start using tagged unions in the xattr log format, so create a bunch of local variables in the recovery function so we only have to decode the log item fields once. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-04-23	xfs: always set args->value in xfs_attri_item_recover	Darrick J. Wong	1	-2/+2
	Always set args->value to the recovered value buffer. This reduces the amount of code in the switch statement, and hence the amount of thinking that I have to do. We validated the recovered buffers, supposedly. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-04-23	xfs: validate recovered name buffers when recovering xattr items	Darrick J. Wong	1	-11/+47
	Strengthen the xattri log item recovery code by checking that we actually have the required name and newname buffers for whatever operation we're replaying. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-04-23	xfs: use helpers to extract xattr op from opflags	Darrick J. Wong	2	-6/+15
	Create helper functions to extract the xattr op from the ondisk xattri log item and the incore attr intent item. These will get more use in the patches that follow. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-04-23	xfs: restructure xfs_attr_complete_op a bit	Darrick J. Wong	1	-5/+4
	Eliminate the local variable from this function so that we can streamline things a bit later when we add the PPTR_REPLACE op code. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-04-23	xfs: check shortform attr entry flags specifically	Darrick J. Wong	1	-0/+9
	While reviewing flag checking in the attr scrub functions, we noticed that the shortform attr scanner didn't catch entries that have the LOCAL or INCOMPLETE bits set. Neither of these flags can ever be set on a shortform attr, so we need to check this narrower set of valid flags. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-04-23	xfs: fix missing check for invalid attr flags	Darrick J. Wong	2	-4/+14
	The xattr scrubber doesn't check for undefined flags in shortform attr entries. Therefore, define a mask XFS_ATTR_ONDISK_MASK that has all possible XFS_ATTR_* flags in it, and use that to check for unknown bits in xchk_xattr_actor. Refactor the check in the dabtree scanner function to use the new mask as well. The redundant checks need to be in place because the dabtree check examines the hash mappings and therefore needs to decode the attr leaf entries to compute the namehash. This happens before the walk of the xattr entries themselves. Fixes: ae0506eba78fd ("xfs: check used space of shortform xattr structures") Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-04-23	xfs: check opcode and iovec count match in xlog_recover_attri_commit_pass2	Darrick J. Wong	1	-0/+27
	Check that the number of recovered log iovecs is what is expected for the xattri opcode is expecting. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-04-23	xfs: rearrange xfs_da_args a bit to use less spaceshrink-dirattr-args-6.10_2024-04-23 shrink-dirattr-args-6.10	Darrick J. Wong	1	-9/+11
	A few notes about struct xfs_da_args: The XFS_ATTR_* flags only go up as far as XFS_ATTR_INCOMPLETE, which means that attr_filter could be a u8 field. I've reduced the number of XFS_DA_OP_* flags down to the point where op_flags would also fit into a u8. filetype has 7 bytes of slack after it, which is wasteful. namelen will never be greater than MAXNAMELEN, which is 256. This field could be reduced to a short. Rearrange the fields in xfs_da_args to waste less space. This reduces the structure size from 136 bytes to 128. Later when we add extra fields to support parent pointer replacement, this will only bloat the structure to 144 bytes, instead of 168. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-04-23	xfs: use an XFS_OPSTATE_ flag for detecting if logged xattrs are available	Darrick J. Wong	4	-3/+24
	Per reviewer request, use an OPSTATE flag (+ helpers) to decide if logged xattrs are enabled, instead of querying the xfs_sb. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-04-23	xfs: make attr removal an explicit operation	Darrick J. Wong	6	-25/+34
	Parent pointers match attrs on name+value, unlike everything else which matches on only the name. Therefore, we cannot keep using the heuristic that !value means remove. Make this an explicit operation code. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-04-23	xfs: require XFS_SB_FEAT_INCOMPAT_LOG_XATTRS for attr log intent item recovery	Darrick J. Wong	1	-2/+3
	The XFS_SB_FEAT_INCOMPAT_LOG_XATTRS feature bit protects a filesystem from old kernels that do not know how to recover extended attribute log intent items. Make this check mandatory instead of a debugging assert. Fixes: fd920008784ea ("xfs: Set up infrastructure for log attribute replay") Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-04-23	xfs: attr fork iext must be loaded before calling xfs_attr_is_leaf	Darrick J. Wong	3	-6/+60
	Christoph noticed that the xfs_attr_is_leaf in xfs_attr_get_ilocked can access the incore extent tree of the attr fork, but nothing in the xfs_attr_get path guarantees that the incore tree is actually loaded. Most of the time it is, but seeing as xfs_attr_is_leaf ignores the return value of xfs_iext_get_extent I guess we've been making choices based on random stack contents and nobody's complained? Reported-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-04-23	xfs: remove xfs_da_args.attr_flags	Darrick J. Wong	10	-28/+39
	This field only ever contains XATTR_{CREATE,REPLACE}, and it only goes as deep as xfs_attr_set. Remove the field from the structure and replace it with an enum specifying exactly what kind of change we want to make to the xattr structure. Upsert is the name that we'll give to the flags==0 operation, because we're either updating an existing value or inserting it, and the caller doesn't care. Note: The "UPSERTR" name created here is to make userspace porting easier. It will be removed in the next patch. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-04-23	xfs: remove XFS_DA_OP_NOTIME	Darrick J. Wong	3	-8/+4
	The only user of this flag sets it prior to an xfs_attr_get_ilocked call, which doesn't update anything. Get rid of the flag. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-04-23	xfs: remove XFS_DA_OP_REMOVE	Darrick J. Wong	2	-5/+2
	Nobody checks this flag, so get rid of it. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-04-22	xfs: reinstate delalloc for RT inodes (if sb_rextsize == 1)korg/for-next_2024-04-23	Christoph Hellwig	4	-4/+5
	Commit aff3a9edb708 ("xfs: Use preallocation for inodes with extsz hints") disabled delayed allocation for all inodes with extent size hints due a data exposure problem. It turns out we fixed this data exposure problem since by always creating unwritten extents for delalloc conversions due to more data exposure problems, but the writeback path doesn't actually support extent size hints when converting delalloc these days, which probably isn't a problem given that people using the hints know what they get. However due to the way how xfs_get_extsz_hint is implemented, it always claims an extent size hint for RT inodes even if the RT extent size is a single FSB. Due to that the above commit effectively disabled delalloc support for RT inodes. Switch xfs_get_extsz_hint to return 0 for this case and work around that in a few places to reinstate delalloc support for RT inodes on file systems with an sb_rextsize of 1. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org> Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
2024-04-22	xfs: stop the steal (of data blocks for RT indirect blocks)	Christoph Hellwig	1	-1/+6
	When xfs_bmap_del_extent_delay has to split an indirect block it tries to steal blocks from the the part that gets unmapped to increase the indirect block reservation that now needs to cover for two extents instead of one. This works perfectly fine on the data device, where the data and indirect blocks come from the same pool. It has no chance of working when the inode sits on the RT device. To support re-enabling delalloc for inodes on the RT device, make this behavior conditional on not being for rt extents. Note that split of delalloc extents should only happen on writeback failure, as for other kinds of hole punching we first write back all data and thus convert the delalloc reservations covering the hole to a real allocation. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org> Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
2024-04-22	xfs: rework splitting of indirect block reservations	Christoph Hellwig	1	-22/+16
	Move the check if we have enough indirect blocks and the stealing of the deleted extent blocks out of xfs_bmap_split_indlen and into the caller to prepare for handling delayed allocation of RT extents that can't easily be stolen. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org> Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
2024-04-22	xfs: look at m_frextents in xfs_iomap_prealloc_size for RT allocations	Christoph Hellwig	1	-12/+31
	Add a check for files on the RT subvolume and use m_frextents instead of m_fdblocks to adjust the preallocation size. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org> Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
2024-04-22	xfs: support RT inodes in xfs_mod_delalloc	Christoph Hellwig	7	-13/+56
	To prepare for re-enabling delalloc on RT devices, track the data blocks (which use the RT device when the inode sits on it) and the indirect blocks (which don't) separately to xfs_mod_delalloc, and add a new percpu counter to also track the RT delalloc blocks. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org> Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
2024-04-22	xfs: cleanup fdblock/frextent accounting in xfs_bmap_del_extent_delay	Christoph Hellwig	1	-10/+10
	The code to account fdblocks and frextents in xfs_bmap_del_extent_delay is a bit weird in that it accounts frextents before the iext tree manipulations and fdblocks after it. Given that the iext tree manipulations cannot fail currently that's not really a problem, but still odd. Move the frextent manipulation to the end, and use a fdblocks variable to account of the unconditional indirect blocks and the data blocks only freed for !RT. This prepares for following updates in the area and already makes the code more readable. Also remove the !isrt assert given that this code clearly handles rt extents correctly, and we'll soon reinstate delalloc support for RT inodes. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org> Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
2024-04-22	xfs: reinstate RT support in xfs_bmapi_reserve_delalloc	Christoph Hellwig	1	-8/+14
	Allocate data blocks for RT inodes using xfs_dec_frextents. While at it optimize the data device case by doing only a single xfs_dec_fdblocks call for the extent itself and the indirect blocks. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org> Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
2024-04-22	xfs: split xfs_mod_freecounter	Christoph Hellwig	14	-124/+97
	xfs_mod_freecounter has two entirely separate code paths for adding or subtracting from the free counters. Only the subtract case looks at the rsvd flag and can return an error. Split xfs_mod_freecounter into separate helpers for subtracting or adding the freecounter, and remove all the impossible to reach error handling for the addition case. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org> Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
2024-04-22	xfs: block deltas in xfs_trans_unreserve_and_mod_sb must be positive	Christoph Hellwig	1	-14/+24
	And to make that more clear, rearrange the code a bit and add asserts and a comment. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org> Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
2024-04-22	xfs: move RT inode locking out of __xfs_bunmapi	Christoph Hellwig	2	-7/+11
	__xfs_bunmapi is a bit of an odd place to lock the rtbitmap and rtsummary inodes given that it is very high level code. While this only looks ugly right now, it will become a problem when supporting delayed allocations for RT inodes as __xfs_bunmapi might end up deleting only delalloc extents and thus never unlock the rt inodes. Move the locking into xfs_bmap_del_extent_real just before the call to xfs_rtfree_blocks instead and use a new flag in the transaction to ensure that the locking happens only once. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org> Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
2024-04-22	xfs: free RT extents after updating the bmap btree	Christoph Hellwig	1	-17/+9
	Currently xfs_bmap_del_extent_real frees RT extents before updating the bmap btree, while it frees regular blocks after performing the bmap btree update for convoluted historic reasons. Switch to free the RT blocks in the same place as the regular data blocks instead to simply the code and fix a very theoretical bug. A short history of this code researched by Dave Chiner below: The truncate for data device extents was originally a two-phase operation. First it removed the bmapbt record, but because this can free BMBT extents, it can use up all the free space tree reservation space. So the transaction gets rolled to commit the BMBT change and the xfs_bmap_finish() call that frees the data extent runs with a new transaction reservation that allows different free space btrees to be logged without overrun. However, on crash, this could lose the free space because there was nothing to tell recovery about the extents removed from the BMBT, hence EFIs were introduced. They tie the extent free operation to the bmapbt record removal commit for recovery of the second phase of the extent removal process. Then RT extents came along. RT extent freeing does not require a free space btree reservation because the free space metadata is static and transaction size is bound. Hence we don't need to care if the BMBT record removal modifies the per-ag free space trees and we don't need a two-phase extent remove transaction. The only thing we have to care about is not losing space on crash. Hence instead of recording the extent for freeing in the bmap list for xfs_bmap_finish() to process in a new transaction, it simply freed the rtextent directly. So the original code (from 1994) simply replaced the "free AG extent later" queueing with a direct free. This code was originally at the start of xfs_dmap_del_extent(), but the xfs_bmap_add_free() got moved to the end of the function via the "do_fx" flag (the current code logic) in 1997 (commit c4fac74eaa58 in the historic xfs-import tree) because there was a shutdown occurring because of a case where splitting the extent record failed because the BMBT split and the filesystem didn't have enough space for the split to be done. (FWIW, I'm not sure this can happen anymore.) The commit backed out the BMBT change on ENOSPC error, and in doing so I think this actually breaks RT free space tracking. However, it then returns an ENOSPC error, and we have a dirty transaction in the RT case so this will shut down the filesysetm when the transaction is cancelled. Hence the corrupted "bmbt now points at freed rt dev space" condition never make it to disk, but it's still the wrong way to handle the issue. IOWs, this proposed change fixes that "shutdown at ENOSPC on rt devices" situation that was introduced by the above commit back in 1997. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org> Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
2024-04-22	xfs: refactor realtime inode locking	Christoph Hellwig	7	-23/+87
	Create helper functions to deal with locking realtime metadata inodes. This enables us to maintain correct locking order once we start adding the realtime rmap and refcount btree inodes. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
2024-04-22	xfs: make XFS_TRANS_LOWMODE match the other XFS_TRANS_ definitions	Christoph Hellwig	1	-2/+1
	Commit bb7b1c9c5dd3 ("xfs: tag transactions that contain intent done items") switched the XFS_TRANS_ definitions to be bit based, and using comments above the definitions. As XFS_TRANS_LOWMODE was last and has a big fat comment it was missed. Switch it to the same style. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org> Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
2024-04-22	xfs: compile out v4 support if disabled	Christoph Hellwig	2	-18/+44
	Add a few strategic IS_ENABLED statements to let the compiler eliminate unused code when CONFIG_XFS_SUPPORT_V4 is disabled. This saves multiple kilobytes of .text in my .config: $ size xfs.o.* text data bss dec hex filename 1363633 294836 592 1659061 1950b5 xfs.o.new 1371453 294868 592 1666913 196f61 xfs.o.old Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org> Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
2024-04-22	xfs: remove the unused xfs_extent_busy_enomem trace event	Christoph Hellwig	1	-1/+0
	Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org> Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
2024-04-22	xfs: unwind xfs_extent_busy_clear	Christoph Hellwig	1	-34/+25
	The current structure of xfs_extent_busy_clear that locks the first busy extent in each AG and unlocks when switching to a new AG makes sparse unhappy as the lock critical section tracking can't cope with taking the lock conditionally and inside a loop. Rewrite xfs_extent_busy_clear so that it has an outer loop only advancing when moving to a new AG, and an inner loop that consumes busy extents for the given AG to make life easier for sparse and to also make this logic more obvious to humans. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org> Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
2024-04-22	xfs: move more logic into xfs_extent_busy_clear_one	Christoph Hellwig	1	-11/+12
	Move the handling of discarded entries into xfs_extent_busy_clear_one to reuse the length check and tidy up the logic in the caller. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org> Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
2024-04-22	xfs: Remove unused function is_rt_data_fork	Jiapeng Chong	1	-8/+0
	The function are defined in the rmap_repair.c file, but not called elsewhere, so delete the unused function. fs/xfs/scrub/rmap_repair.c:436:1: warning: unused function 'is_rt_data_fork'. Reported-by: Abaci Robot <abaci@linux.alibaba.com> Closes: https://bugzilla.openanolis.cn/show_bug.cgi?id=8425 Signed-off-by: Jiapeng Chong <jiapeng.chong@linux.alibaba.com> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org> Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
2024-04-22	xfs: small cleanup in xrep_update_qflags()	Dan Carpenter	1	-1/+1
	The "mp" pointer is the same as "sc->mp" so this change doesn't affect runtime at all. However, it's nicer to use same name for both the lock and the unlock. Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
2024-04-22	xfs: Fix typo in comment	Thorsten Blum	1	-1/+1
	s/somethign/something/ Signed-off-by: Thorsten Blum <thorsten.blum@toblux.com> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
2024-04-22	xfs: fix sparse warnings about unused interval tree functions	Dave Chinner	1	-10/+12
	Sparse throws warnings about the interval tree functions that are defined and then not used in the scrub bitmap code: fs/xfs/scrub/bitmap.c:57:1: warning: unused function 'xbitmap64_tree_iter_next' [-Wunused-function] INTERVAL_TREE_DEFINE(struct xbitmap64_node, bn_rbnode, uint64_t, ^ ./include/linux/interval_tree_generic.h:151:33: note: expanded from macro 'INTERVAL_TREE_DEFINE' ITSTATIC ITSTRUCT * \ ^ <scratch space>:3:1: note: expanded from here xbitmap64_tree_iter_next ^ fs/xfs/scrub/bitmap.c:331:1: warning: unused function 'xbitmap32_tree_iter_next' [-Wunused-function] INTERVAL_TREE_DEFINE(struct xbitmap32_node, bn_rbnode, uint32_t, ^ ./include/linux/interval_tree_generic.h:151:33: note: expanded from macro 'INTERVAL_TREE_DEFINE' ITSTATIC ITSTRUCT * \ ^ <scratch space>:59:1: note: expanded from here xbitmap32_tree_iter_next Fix these by marking the functions created by the interval tree creation macro as __maybe_unused to suppress this warning. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org> Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
2024-04-22	xfs: silence sparse warning when checking version number	Dave Chinner	1	-2/+1
	Scrub checks the superblock version number against the known good feature bits that can be set in the version mask. It calculates the version mask to compare like so: vernum_mask = cpu_to_be16(~XFS_SB_VERSION_OKBITS \| XFS_SB_VERSION_NUMBITS \| XFS_SB_VERSION_ALIGNBIT \| XFS_SB_VERSION_DALIGNBIT \| XFS_SB_VERSION_SHAREDBIT \| XFS_SB_VERSION_LOGV2BIT \| XFS_SB_VERSION_SECTORBIT \| XFS_SB_VERSION_EXTFLGBIT \| XFS_SB_VERSION_DIRV2BIT); This generates a sparse warning: fs/xfs/scrub/agheader.c:168:23: warning: cast truncates bits from constant value (ffff3f8f becomes 3f8f) This is because '~XFS_SB_VERSION_OKBITS' is considered a 32 bit constant, even though it's value is always under 16 bits. This is a kinda silly thing to do, because: /* * Supported feature bit list is just all bits in the versionnum field because * we've used them all up and understand them all. Except, of course, for the * shared superblock bit, which nobody knows what it does and so is unsupported. */ #define XFS_SB_VERSION_OKBITS \ ((XFS_SB_VERSION_NUMBITS \| XFS_SB_VERSION_ALLFBITS) & \ ~XFS_SB_VERSION_SHAREDBIT) #define XFS_SB_VERSION_NUMBITS 0x000f #define XFS_SB_VERSION_ALLFBITS 0xfff0 #define XFS_SB_VERSION_SHAREDBIT 0x0200 XFS_SB_VERSION_OKBITS has a value of 0xfdff, and so ~XFS_SB_VERSION_OKBITS == XFS_SB_VERSION_SHAREDBIT. The calculated mask already sets XFS_SB_VERSION_SHAREDBIT, so starting with ~XFS_SB_VERSION_OKBITS is completely redundant.... Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org> Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
2024-04-20	xfs: fix CIL sparse lock context warnings	Dave Chinner	2	-2/+3
	Sparse reports: fs/xfs/xfs_log_cil.c:1127:1: warning: context imbalance in 'xlog_cil_push_work' - different lock contexts for basic block fs/xfs/xfs_log_cil.c:1380:1: warning: context imbalance in 'xlog_cil_push_background' - wrong count at exit fs/xfs/xfs_log_cil.c:1623:9: warning: context imbalance in 'xlog_cil_commit' - unexpected unlock xlog_cil_push_background() has a locking annotations for an rw_sem. Sparse does not track lock contexts for rw_sems, so the annotation generates false warnings. Remove the annotation. xlog_wait_on_iclog() drops the log->l_ic_loglock. The function has a sparse annotation, but the prototype in xfs_log_priv.h does not. Hence the warning from xlog_cil_push_work() which calls xlog_wait_on_iclog(). Add the missing annotation. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
2024-04-16	Merge tag 'retain-ilock-during-dir-ops-6.10_2024-04-15' of ↵	Chandan Babu R	13	-54/+156
	https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux into xfs-6.10-mergeA xfs: retain ILOCK during directory updates This series changes the directory update code to retain the ILOCK on all files involved in a rename until the end of the operation. The upcoming parent pointers patchset applies parent pointers in a separate chained update from the actual directory update, which is why it is now necessary to keep the ILOCK instead of dropping it after the first transaction in the chain. As a side effect, we no longer need to hold the IOLOCK during an rmapbt scan of inodes to serialize the scan with ongoing directory updates. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Chandan Babu R <chandanbabu@kernel.org> * tag 'retain-ilock-during-dir-ops-6.10_2024-04-15' of https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux: xfs: unlock new repair tempfiles after creation xfs: don't pick up IOLOCK during rmapbt repair scan xfs: Hold inode locks in xfs_rename xfs: Hold inode locks in xfs_trans_alloc_dir xfs: Hold inode locks in xfs_ialloc xfs: Increase XFS_QM_TRANS_MAXDQS to 5 xfs: Increase XFS_DEFER_OPS_NR_INODES to 5
2024-04-16	Merge tag 'online-fsck-design-6.10_2024-04-15' of ↵	Chandan Babu R	1	-88/+266
	https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux into xfs-6.10-mergeA xfs: design documentation for online fsck, part 2 This series updates the design documentation for online fsck to reflect the final design of the parent pointers feature as well as the implementation of online fsck for the new metadata. This has been running on the djcloud for months with no problems. Enjoy! Signed-off-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Chandan Babu R <chandanbabu@kernel.org> * tag 'online-fsck-design-6.10_2024-04-15' of https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux: docs: describe xfs directory tree online fsck docs: update offline parent pointer repair strategy docs: update online directory and parent pointer repair sections docs: update the parent pointers documentation to the final version
2024-04-16	Merge tag 'discard-relax-locks-6.10_2024-04-15' of ↵	Chandan Babu R	1	-60/+93
	https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux into xfs-6.10-mergeA xfs: less heavy locks during fstrim Congratulations! You have made it to the final patchset of the main online fsck feature! This patchset fixes some stalling behavior that I observed when running FITRIM against large flash-based filesystems with very heavily fragmented free space data. In summary -- the current fstrim implementation optimizes for trimming the largest free extents first, and holds the AGF lock for the duration of the operation. This is great if fstrim is being run as a foreground process by a sysadmin. For xfs_scrub, however, this isn't so good -- we don't really want to block on one huge kernel call while reporting no progress information. We don't want to hold the AGF so long that background processes stall. These problems are easily fixable by issuing smaller FITRIM calls, but there's still the problem of walking the entire cntbt. To solve that second problem, we introduce a new sub-AG FITRIM implementation. To solve the first problem, make it relax the AGF periodically. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Chandan Babu R <chandanbabu@kernel.org> * tag 'discard-relax-locks-6.10_2024-04-15' of https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux: xfs: fix performance problems when fstrimming a subset of a fragmented AG
2024-04-16	Merge tag 'inode-repair-improvements-6.10_2024-04-15' of ↵	Chandan Babu R	12	-99/+187
	https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux into xfs-6.10-mergeA xfs: inode-related repair fixes While doing QA of the online fsck code, I made a few observations: First, nobody was checking that the di_onlink field is actually zero; Second, that allocating a temporary file for repairs can fail (and thus bring down the entire fs) if the inode cluster is corrupt; and Third, that file link counts do not pin at ~0U to prevent integer overflows. Fourth, the x{chk,rep}_metadata_inode_fork functions should be subclassing the main scrub context, not modifying the parent's setup willy-nilly. This scattered patchset fixes those three problems. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Chandan Babu R <chandanbabu@kernel.org> * tag 'inode-repair-improvements-6.10_2024-04-15' of https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux: xfs: create subordinate scrub contexts for xchk_metadata_inode_subtype xfs: pin inodes that would otherwise overflow link count xfs: try to avoid allocating from sick inode clusters xfs: check unused nlink fields in the ondisk inode
2024-04-16	Merge tag 'repair-iunlink-6.10_2024-04-15' of ↵	Chandan Babu R	6	-47/+1179
	https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux into xfs-6.10-mergeA xfs: online fsck of iunlink buckets This series enhances the AGI scrub code to check the unlinked inode bucket lists for errors, and fixes them if necessary. Now that iunlink pointer updates are virtual log items, we can batch updates pretty efficiently in the logging code. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Chandan Babu R <chandanbabu@kernel.org> * tag 'repair-iunlink-6.10_2024-04-15' of https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux: xfs: repair AGI unlinked inode bucket lists xfs: hoist AGI repair context to a heap object xfs: check AGI unlinked inode buckets
2024-04-16	Merge tag 'repair-symlink-6.10_2024-04-15' of ↵	Chandan Babu R	12	-15/+609
	https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux into xfs-6.10-mergeA xfs: online repair of symbolic links The patches in this set adds the ability to repair the target buffer of a symbolic link, using the same salvage, rebuild, and swap strategy used everywhere else. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Chandan Babu R <chandanbabu@kernel.org> * tag 'repair-symlink-6.10_2024-04-15' of https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux: xfs: online repair of symbolic links xfs: pass the owner to xfs_symlink_write_target xfs: expose xfs_bmap_local_to_extents for online repair
2024-04-16	Merge tag 'repair-orphanage-6.10_2024-04-15' of ↵	Chandan Babu R	16	-38/+1139
	https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux into xfs-6.10-mergeA xfs: move orphan files to lost and found Orphaned files are defined to be files with nonzero ondisk link count but no observable parent directory. This series enables online repair to reparent orphaned files into the filesystem directory tree, and wires up this reparenting ability into the directory, file link count, and parent pointer repair functions. This is how we fix files with positive link count that are not reachable through the directory tree. This patch will also create the orphanage directory (lost+found) if it is not present. In contrast to xfs_repair, we follow e2fsck in creating the lost+found without group or other-owner access to avoid accidental disclosure of files that were previously hidden by an 0700 directory. That's silly security, but people have been known to do it. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Chandan Babu R <chandanbabu@kernel.org> * tag 'repair-orphanage-6.10_2024-04-15' of https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux: xfs: ensure dentry consistency when the orphanage adopts a file xfs: move files to orphanage instead of letting nlinks drop to zero xfs: move orphan files to the orphanage
2024-04-16	Merge tag 'repair-dirs-6.10_2024-04-15' of ↵	Chandan Babu R	21	-4/+2437
	https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux into xfs-6.10-mergeA xfs: online repair of directories This series employs atomic extent swapping to enable safe reconstruction of directory data. For now, XFS does not support reverse directory links (aka parent pointers), so we can only salvage the dirents of a directory and construct a new structure. Directory repair therefore consists of five main parts: First, we walk the existing directory to salvage as many entries as we can, by adding them as new directory entries to the repair temp dir. Second, we validate the parent pointer found in the directory. If one was not found, we scan the entire filesystem looking for a potential parent. Third, we use atomic extent swaps to exchange the entire data fork between the two directories. Fourth, we reap the old directory blocks as carefully as we can. To wrap up the directory repair code, we need to add to the regular filesystem the ability to free all the data fork blocks in a directory. This does not change anything with normal directories, since they must still unlink and shrink one entry at a time. However, this will facilitate freeing of partially-inactivated temporary directories during log recovery. The second half of this patchset implements repairs for the dotdot entries of directories. For now there is only rudimentary support for this, because there are no directory parent pointers, so the best we can do is scanning the filesystem and the VFS dcache for answers. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Chandan Babu R <chandanbabu@kernel.org> * tag 'repair-dirs-6.10_2024-04-15' of https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux: xfs: ask the dentry cache if it knows the parent of a directory xfs: online repair of parent pointers xfs: scan the filesystem to repair a directory dotdot entry xfs: online repair of directories xfs: inactivate directory data blocks
2024-04-16	Merge tag 'repair-unlinked-inode-state-6.10_2024-04-15' of ↵	Chandan Babu R	5	-13/+100
	https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux into xfs-6.10-mergeA xfs: online repair of inode unlinked state This series adds some logic to the inode scrubbers so that they can detect and deal with consistency errors between the link count and the per-inode unlinked list state. The helpers needed to do this are presented here because they are a prequisite for rebuildng directories, since we need to get a rebuilt non-empty directory off the unlinked list. Note that this patchset does not provide comprehensive reconstruction of the AGI unlinked list; that is coming in a subsequent patchset. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Chandan Babu R <chandanbabu@kernel.org> * tag 'repair-unlinked-inode-state-6.10_2024-04-15' of https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux: xfs: update the unlinked list when repairing link counts xfs: ensure unlinked list state is consistent with nlink during scrub
2024-04-16	Merge tag 'repair-xattrs-6.10_2024-04-15' of ↵	Chandan Babu R	30	-83/+2284
	https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux into xfs-6.10-mergeA xfs: online repair of extended attributes This series employs atomic extent swapping to enable safe reconstruction of extended attribute data attached to a file. Because xattrs do not have any redundant information to draw off of, we can at best salvage as much data as we can and build a new structure. Rebuilding an extended attribute structure consists of these three steps: First, we walk the existing attributes to salvage as many of them as we can, by adding them as new attributes attached to the repair tempfile. We need to add a new xfile-based data structure to hold blobs of arbitrary length to stage the xattr names and values. Second, we write the salvaged attributes to a temporary file, and use atomic extent swaps to exchange the entire attribute fork between the two files. Finally, we reap the old xattr blocks (which are now in the temporary file) as carefully as we can. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Chandan Babu R <chandanbabu@kernel.org> * tag 'repair-xattrs-6.10_2024-04-15' of https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux: xfs: create an xattr iteration function for scrub xfs: flag empty xattr leaf blocks for optimization xfs: scrub should set preen if attr leaf has holes xfs: repair extended attributes xfs: use atomic extent swapping to fix user file fork data xfs: create a blob array data structure xfs: enable discarding of folios backing an xfile
2024-04-16	Merge tag 'dirattr-validate-owners-6.10_2024-04-15' of ↵	Chandan Babu R	23	-148/+492
	https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux into xfs-6.10-mergeA xfs: set and validate dir/attr block owners There are a couple of significant changes that need to be made to the directory and xattr code before we can support online repairs of those data structures. The first change is because online repair is designed to use libxfs to create a replacement dir/xattr structure in a temporary file, and use atomic extent swapping to commit the corrected structure. To avoid the performance hit of walking every block of the new structure to rewrite the owner number before the swap, we instead change libxfs to allow callers of the dir and xattr code the ability to set an explicit owner number to be written into the header fields of any new blocks that are created. For regular operation this will be the directory inode number. The second change is to update the dir/xattr code to actually check the owner number in each block that is read off the disk, since we don't currently do that. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Chandan Babu R <chandanbabu@kernel.org> * tag 'dirattr-validate-owners-6.10_2024-04-15' of https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux: xfs: validate explicit directory free block owners xfs: validate explicit directory block buffer owners xfs: validate explicit directory data buffer owners xfs: validate directory leaf buffer owners xfs: validate dabtree node buffer owners xfs: validate attr remote value buffer owners xfs: validate attr leaf buffer owners xfs: reduce indenting in xfs_attr_node_list xfs: use the xfs_da_args owner field to set new dir/attr block owner xfs: add an explicit owner field to xfs_da_args
2024-04-16	Merge tag 'repair-rtsummary-6.10_2024-04-15' of ↵	Chandan Babu R	12	-19/+715
	https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux into xfs-6.10-mergeA xfs: online repair of realtime summaries We now have all the infrastructure we need to repair file metadata. We'll begin with the realtime summary file, because it is the least complex data structure. To support this we need to add three more pieces to the temporary file code from the previous patchset -- preallocating space in the temp file, formatting metadata into that space and writing the blocks to disk, and swapping the fork mappings atomically. After that, the actual reconstruction of the realtime summary information is pretty simple, since we can simply write the incore copy computed by the rtsummary scrubber to the temporary file, swap the contents, and reap the old blocks. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Chandan Babu R <chandanbabu@kernel.org> * tag 'repair-rtsummary-6.10_2024-04-15' of https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux: xfs: online repair of realtime summaries xfs: teach the tempfile to set up atomic file content exchanges xfs: support preallocating and copying content into temporary files
2024-04-16	Merge tag 'repair-tempfiles-6.10_2024-04-15' of ↵	Chandan Babu R	14	-26/+843
	https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux into xfs-6.10-mergeA xfs: create temporary files for online repair As mentioned earlier, the repair strategy for file-based metadata is to build a new copy in a temporary file and swap the file fork mappings with the metadata inode. We've built the atomic extent swap facility, so now we need to build a facility for handling private temporary files. The first step is to teach the filesystem to ignore the temporary files. We'll mark them as PRIVATE in the VFS so that the kernel security modules will leave it alone. The second step is to add the online repair code the ability to create a temporary file and reap extents from the temporary file after the extent swap. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Chandan Babu R <chandanbabu@kernel.org> * tag 'repair-tempfiles-6.10_2024-04-15' of https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux: xfs: add the ability to reap entire inode forks xfs: refactor live buffer invalidation for repairs xfs: create temporary files and directories for online repair xfs: hide private inodes from bulkstat and handle functions
2024-04-16	Merge tag 'atomic-file-updates-6.10_2024-04-15' of ↵	Chandan Babu R	30	-188/+3616
	https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux into xfs-6.10-mergeA xfs: atomic file content exchanges This series creates a new XFS_IOC_EXCHANGE_RANGE ioctl to exchange ranges of bytes between two files atomically. This new functionality enables data storage programs to stage and commit file updates such that reader programs will see either the old contents or the new contents in their entirety, with no chance of torn writes. A successful call completion guarantees that the new contents will be seen even if the system fails. The ability to exchange file fork mappings between files in this manner is critical to supporting online filesystem repair, which is built upon the strategy of constructing a clean copy of a damaged structure and committing the new structure into the metadata file atomically. The ioctls exist to facilitate testing of the new functionality and to enable future application program designs. User programs will be able to update files atomically by opening an O_TMPFILE, reflinking the source file to it, making whatever updates they want to make, and exchange the relevant ranges of the temp file with the original file. If the updates are aligned with the file block size, a new (since v2) flag provides for exchanging only the written areas. Note that application software must quiesce writes to the file while it stages an atomic update. This will be addressed by a subsequent series. This mechanism solves the clunkiness of two existing atomic file update mechanisms: for O_TRUNC + rewrite, this eliminates the brief period where other programs can see an empty file. For create tempfile + rename, the need to copy file attributes and extended attributes for each file update is eliminated. However, this method introduces its own awkwardness -- any program initiating an exchange now needs to have a way to signal to other programs that the file contents have changed. For file access mediated via read and write, fanotify or inotify are probably sufficient. For mmaped files, that may not be fast enough. Here is the proposed manual page: IOCTL-XFS-EXCHANGE-RANGE(2System Calls ManuIOCTL-XFS-EXCHANGE-RANGE(2) NAME ioctl_xfs_exchange_range - exchange the contents of parts of two files SYNOPSIS #include <sys/ioctl.h> #include <xfs/xfs_fs.h> int ioctl(int file2_fd, XFS_IOC_EXCHANGE_RANGE, struct xfs_ex‐ change_range arg); DESCRIPTION Given a range of bytes in a first file file1_fd and a second range of bytes in a second file file2_fd, this ioctl(2) ex‐ changes the contents of the two ranges. Exchanges are atomic with regards to concurrent file opera‐ tions. Implementations must guarantee that readers see either the old contents or the new contents in their entirety, even if the system fails. The system call parameters are conveyed in structures of the following form: struct xfs_exchange_range { __s32 file1_fd; __u32 pad; __u64 file1_offset; __u64 file2_offset; __u64 length; __u64 flags; }; The field pad must be zero. The fields file1_fd, file1_offset, and length define the first range of bytes to be exchanged. The fields file2_fd, file2_offset, and length define the second range of bytes to be exchanged. Both files must be from the same filesystem mount. If the two file descriptors represent the same file, the byte ranges must not overlap. Most disk-based filesystems require that the starts of both ranges must be aligned to the file block size. If this is the case, the ends of the ranges must also be so aligned unless the XFS_EXCHANGE_RANGE_TO_EOF flag is set. The field flags control the behavior of the exchange operation. XFS_EXCHANGE_RANGE_TO_EOF Ignore the length parameter. All bytes in file1_fd from file1_offset to EOF are moved to file2_fd, and file2's size is set to (file2_offset+(file1_length- file1_offset)). Meanwhile, all bytes in file2 from file2_offset to EOF are moved to file1 and file1's size is set to (file1_offset+(file2_length- file2_offset)). XFS_EXCHANGE_RANGE_DSYNC Ensure that all modified in-core data in both file ranges and all metadata updates pertaining to the exchange operation are flushed to persistent storage before the call returns. Opening either file de‐ scriptor with O_SYNC or O_DSYNC will have the same effect. XFS_EXCHANGE_RANGE_FILE1_WRITTEN Only exchange sub-ranges of file1_fd that are known to contain data written by application software. Each sub-range may be expanded (both upwards and downwards) to align with the file allocation unit. For files on the data device, this is one filesystem block. For files on the realtime device, this is the realtime extent size. This facility can be used to implement fast atomic scatter-gather writes of any complexity for software-defined storage targets if all writes are aligned to the file allocation unit. XFS_EXCHANGE_RANGE_DRY_RUN Check the parameters and the feasibility of the op‐ eration, but do not change anything. RETURN VALUE On error, -1 is returned, and errno is set to indicate the er‐ ror. ERRORS Error codes can be one of, but are not limited to, the follow‐ ing: EBADF file1_fd is not open for reading and writing or is open for append-only writes; or file2_fd is not open for reading and writing or is open for append-only writes. EINVAL The parameters are not correct for these files. This error can also appear if either file descriptor repre‐ sents a device, FIFO, or socket. Disk filesystems gen‐ erally require the offset and length arguments to be aligned to the fundamental block sizes of both files. EIO An I/O error occurred. EISDIR One of the files is a directory. ENOMEM The kernel was unable to allocate sufficient memory to perform the operation. ENOSPC There is not enough free space in the filesystem ex‐ change the contents safely. EOPNOTSUPP The filesystem does not support exchanging bytes between the two files. EPERM file1_fd or file2_fd are immutable. ETXTBSY One of the files is a swap file. EUCLEAN The filesystem is corrupt. EXDEV file1_fd and file2_fd are not on the same mounted filesystem. CONFORMING TO This API is XFS-specific. USE CASES Several use cases are imagined for this system call. In all cases, application software must coordinate updates to the file because the exchange is performed unconditionally. The first is a data storage program that wants to commit non- contiguous updates to a file atomically and coordinates write access to that file. This can be done by creating a temporary file, calling FICLONE(2) to share the contents, and staging the updates into the temporary file. The FULL_FILES flag is recom‐ mended for this purpose. The temporary file can be deleted or punched out afterwards. An example program might look like this: int fd = open("/some/file", O_RDWR); int temp_fd = open("/some", O_TMPFILE \| O_RDWR); ioctl(temp_fd, FICLONE, fd); / append 1MB of records / lseek(temp_fd, 0, SEEK_END); write(temp_fd, data1, 1000000); / update record index / pwrite(temp_fd, data1, 600, 98765); pwrite(temp_fd, data2, 320, 54321); pwrite(temp_fd, data2, 15, 0); / commit the entire update / struct xfs_exchange_range args = { .file1_fd = temp_fd, .flags = XFS_EXCHANGE_RANGE_TO_EOF, }; ioctl(fd, XFS_IOC_EXCHANGE_RANGE, &args); The second is a software-defined storage host (e.g. a disk jukebox) which implements an atomic scatter-gather write com‐ mand. Provided the exported disk's logical block size matches the file's allocation unit size, this can be done by creating a temporary file and writing the data at the appropriate offsets. It is recommended that the temporary file be truncated to the size of the regular file before any writes are staged to the temporary file to avoid issues with zeroing during EOF exten‐ sion. Use this call with the FILE1_WRITTEN flag to exchange only the file allocation units involved in the emulated de‐ vice's write command. The temporary file should be truncated or punched out completely before being reused to stage another write. An example program might look like this: int fd = open("/some/file", O_RDWR); int temp_fd = open("/some", O_TMPFILE \| O_RDWR); struct stat sb; int blksz; fstat(fd, &sb); blksz = sb.st_blksize; / land scatter gather writes between 100fsb and 500fsb / pwrite(temp_fd, data1, blksz 2, blksz * 100); pwrite(temp_fd, data2, blksz * 20, blksz * 480); pwrite(temp_fd, data3, blksz * 7, blksz * 257); /* commit the entire update / struct xfs_exchange_range args = { .file1_fd = temp_fd, .file1_offset = blksz 100, .file2_offset = blksz * 100, .length = blksz * 400, .flags = XFS_EXCHANGE_RANGE_FILE1_WRITTEN \| XFS_EXCHANGE_RANGE_FILE1_DSYNC, }; ioctl(fd, XFS_IOC_EXCHANGE_RANGE, &args); NOTES Some filesystems may limit the amount of data or the number of extents that can be exchanged in a single call. SEE ALSO ioctl(2) XFS 2024-02-10 IOCTL-XFS-EXCHANGE-RANGE(2) The reference implementation in XFS creates a new log incompat feature and log intent items to track high level progress of swapping ranges of two files and finish interrupted work if the system goes down. Sample code can be found in the corresponding changes to xfs_io to exercise the use case mentioned above. Note that this function is /not/ the O_DIRECT atomic untorn file writes concept that has also been floating around for years. It is also not the RWF_ATOMIC patchset that has been shared. This RFC is constructed entirely in software, which means that there are no limitations other than the general filesystem limits. As a side note, the original motivation behind the kernel functionality is online repair of file-based metadata. The atomic file content exchange is implemented as an atomic exchange of file fork mappings, which means that we can implement online reconstruction of extended attributes and directories by building a new one in another inode and exchanging the contents. Subsequent patchsets adapt the online filesystem repair code to use atomic file exchanges. This enables repair functions to construct a clean copy of a directory, xattr information, symbolic links, realtime bitmaps, and realtime summary information in a temporary inode. If this completes successfully, the new contents can be committed atomically into the inode being repaired. This is essential to avoid making corruption problems worse if the system goes down in the middle of running repair. For userspace, this series also includes the userspace pieces needed to test the new functionality, and a sample implementation of atomic file updates. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Chandan Babu R <chandanbabu@kernel.org> * tag 'atomic-file-updates-6.10_2024-04-15' of https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux: xfs: enable logged file mapping exchange feature docs: update swapext -> exchmaps language xfs: capture inode generation numbers in the ondisk exchmaps log item xfs: support non-power-of-two rtextsize with exchange-range xfs: make file range exchange support realtime files xfs: condense symbolic links after a mapping exchange operation xfs: condense directories after a mapping exchange operation xfs: condense extended attributes after a mapping exchange operation xfs: add error injection to test file mapping exchange recovery xfs: bind together the front and back ends of the file range exchange code xfs: create deferred log items for file mapping exchanges xfs: introduce a file mapping exchange log intent item xfs: create a incompat flag for atomic file mapping exchanges xfs: introduce new file range exchange ioctl vfs: export remap and write check helpers
2024-04-16	Merge tag 'file-exchange-refactorings-6.10_2024-04-15' of ↵	Chandan Babu R	10	-94/+122
	https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux into xfs-6.10-mergeA xfs: refactorings for atomic file content exchanges This series applies various cleanups and refactorings to file IO handling code ahead of the main series to implement atomic file content exchanges. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Chandan Babu R <chandanbabu@kernel.org> * tag 'file-exchange-refactorings-6.10_2024-04-15' of https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux: xfs: constify xfs_bmap_is_written_extent xfs: refactor non-power-of-two alignment checks xfs: hoist multi-fsb allocation unit detection to a helper xfs: create a new helper to return a file's allocation unit xfs: declare xfs_file.c symbols in xfs_file.h xfs: move xfs_iops.c declarations out of xfs_inode.h xfs: move inode lease breaking functions to xfs_inode.c
2024-04-16	Merge tag 'log-incompat-permissions-6.10_2024-04-15' of ↵	Chandan Babu R	22	-121/+160
	https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux into xfs-6.10-mergeA xfs: improve log incompat feature handling This patchset improves the performance of log incompat feature bit handling by making a few changes to how the filesystem handles them. First, we now only clear the bits during a clean unmount to reduce calls to the (expensive) upgrade function to once per bit per mount. Second, we now only allow incompat feature upgrades for sysadmins or if the sysadmin explicitly allows it via mount option. Currently the only log incompat user is logged xattrs, which requires CONFIG_XFS_DEBUG=y, so there should be no user visible impact to this change. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Chandan Babu R <chandanbabu@kernel.org> * tag 'log-incompat-permissions-6.10_2024-04-15' of https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux: xfs: only clear log incompat flags at clean unmount xfs: fix error bailout in xrep_abt_build_new_trees xfs: fix potential AGI <-> ILOCK ABBA deadlock in xrep_dinode_findmode_walk_directory xfs: fix an AGI lock acquisition ordering problem in xrep_dinode_findmode xfs: pass xfs_buf lookup flags to xfs_*read_agi
2024-04-15	xfs: unlock new repair tempfiles after creationretain-ilock-during-dir-ops-6.10_2024-04-15 retain-ilock-during-dir-ops-6.10	Darrick J. Wong	1	-0/+2
	After creation, drop the ILOCK on temporary files that have been created to stage a repair. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-04-15	xfs: don't pick up IOLOCK during rmapbt repair scan	Darrick J. Wong	1	-15/+1
	Now that we've fixed the directory operations to hold the ILOCK until they're finished with rmapbt updates for directory shape changes, we no longer need to take this lock when scanning directories for rmapbt records. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-04-15	xfs: Hold inode locks in xfs_rename	Allison Henderson	1	-12/+33
	Modify xfs_rename to hold all inode locks across a rename operation We will need this later when we add parent pointers Signed-off-by: Allison Henderson <allison.henderson@oracle.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Catherine Hoang <catherine.hoang@oracle.com> Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-04-15	xfs: Hold inode locks in xfs_trans_alloc_dir	Allison Henderson	2	-4/+19
	Modify xfs_trans_alloc_dir to hold locks after return. Caller will be responsible for manual unlock. We will need this later to hold locks across parent pointer operations Signed-off-by: Allison Henderson <allison.henderson@oracle.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Catherine Hoang <catherine.hoang@oracle.com> Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-04-15	xfs: Hold inode locks in xfs_ialloc	Allison Henderson	3	-6/+16
	Modify xfs_ialloc to hold locks after return. Caller will be responsible for manual unlock. We will need this later to hold locks across parent pointer operations Signed-off-by: Allison Henderson <allison.henderson@oracle.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Catherine Hoang <catherine.hoang@oracle.com> [djwong: hold the parent ilocked across transaction rolls too] Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-04-15	docs: describe xfs directory tree online fsckonline-fsck-design-6.10_2024-04-15 online-fsck-design-6.10	Darrick J. Wong	1	-0/+124
	I've added a scrubber that checks the directory tree structure and fixes them; describe this in the design documentation. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-04-15	xfs: Increase XFS_QM_TRANS_MAXDQS to 5	Allison Henderson	4	-6/+53
	With parent pointers enabled, a rename operation can update up to 5 inodes: src_dp, target_dp, src_ip, target_ip and wip. This causes their dquots to a be attached to the transaction chain, so we need to increase XFS_QM_TRANS_MAXDQS. This patch also add a helper function xfs_dqlockn to lock an arbitrary number of dquots. Signed-off-by: Allison Henderson <allison.henderson@oracle.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-04-15	docs: update offline parent pointer repair strategy	Darrick J. Wong	1	-21/+60
	Now update how xfs_repair checks and repairs parent pointer info. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-04-15	xfs: Increase XFS_DEFER_OPS_NR_INODES to 5	Allison Henderson	4	-11/+32
	Renames that generate parent pointer updates can join up to 5 inodes locked in sorted order. So we need to increase the number of defer ops inodes and relock them in the same way. Signed-off-by: Allison Henderson <allison.henderson@oracle.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Catherine Hoang <catherine.hoang@oracle.com> [djwong: have one sorting function] Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-04-15	xfs: fix performance problems when fstrimming a subset of a fragmented AGdiscard-relax-locks-6.10_2024-04-15 discard-relax-locks-6.10	Darrick J. Wong	1	-60/+93
	On a 10TB filesystem where the free space in each AG is heavily fragmented, I noticed some very high runtimes on a FITRIM call for the entire filesystem. xfs_scrub likes to report progress information on each phase of the scrub, which means that a strace for the entire filesystem: ioctl(3, FITRIM, {start=0x0, len=10995116277760, minlen=0}) = 0 <686.209839> shows that scrub is uncommunicative for the entire duration. Reducing the size of the FITRIM requests to a single AG at a time produces lower times for each individual call, but even this isn't quite acceptable, because the time between progress reports are still very high: Strace for the first 4x 1TB AGs looks like (2): ioctl(3, FITRIM, {start=0x0, len=1099511627776, minlen=0}) = 0 <68.352033> ioctl(3, FITRIM, {start=0x10000000000, len=1099511627776, minlen=0}) = 0 <68.760323> ioctl(3, FITRIM, {start=0x20000000000, len=1099511627776, minlen=0}) = 0 <67.235226> ioctl(3, FITRIM, {start=0x30000000000, len=1099511627776, minlen=0}) = 0 <69.465744> I then had the idea to limit the length parameter of each call to a smallish amount (~11GB) so that we could report progress relatively quickly, but much to my surprise, each FITRIM call still took ~68 seconds! Unfortunately, the by-length fstrim implementation handles this poorly because it walks the entire free space by length index (cntbt), which is a very inefficient way to walk a subset of the blocks of an AG. Therefore, create a second implementation that will walk the bnobt and perform the trims in block number order. This implementation avoids the worst problems of the original code, though it lacks the desirable attribute of freeing the biggest chunks first. On the other hand, this second implementation will be much easier to constrain the system call latency, and makes it much easier to report fstrim progress to anyone who's running xfs_scrub. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com
2024-04-15	xfs: create subordinate scrub contexts for xchk_metadata_inode_subtypeinode-repair-improvements-6.10_2024-04-15 inode-repair-improvements-6.10	Darrick J. Wong	4	-73/+91
	When a file-based metadata structure is being scrubbed in xchk_metadata_inode_subtype, we should create an entirely new scrub context so that each scrubber doesn't trip over another's buffers. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-04-15	docs: update online directory and parent pointer repair sections	Darrick J. Wong	1	-26/+29
	Update the case studies of online directory and parent pointer reconstruction to reflect what they actually do in the final version. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-04-15	docs: update the parent pointers documentation to the final version	Darrick J. Wong	1	-41/+53
	Now that we've decided on the ondisk format of parent pointers, update the documentation to reflect that. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-04-15	xfs: pin inodes that would otherwise overflow link count	Darrick J. Wong	5	-26/+36
	The VFS inc_nlink function does not explicitly check for integer overflows in the i_nlink field. Instead, it checks the link count against s_max_links in the vfs_{link,create,rename} functions. XFS sets the maximum link count to 2.1 billion, so integer overflows should not be a problem. However. It's possible that online repair could find that a file has more than four billion links, particularly if the link count got corrupted while creating hardlinks to the file. The di_nlinkv2 field is not large enough to store a value larger than 2^32, so we ought to define a magic pin value of ~0U which means that the inode never gets deleted. This will prevent a UAF error if the repair finds this situation and users begin deleting links to the file. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-04-15	xfs: try to avoid allocating from sick inode clusters	Darrick J. Wong	1	-0/+40
	I noticed that xfs/413 and xfs/375 occasionally failed while fuzzing core.mode of an inode. The root cause of these problems is that the field we fuzzed (core.mode or core.magic, typically) causes the entire inode cluster buffer verification to fail, which affects several inodes at once. The repair process tries to create either a /lost+found or a temporary repair file, but regrettably it picks the same inode cluster that we just corrupted, with the result that repair triggers the demise of the filesystem. Try avoid this by making the inode allocation path detect when the perag health status indicates that someone has found bad inode cluster buffers, and try to read the inode cluster buffer. If the cluster buffer fails the verifiers, try another AG. This isn't foolproof and can result in premature ENOSPC, but that might be better than shutting down. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-04-15	xfs: check unused nlink fields in the ondisk inode	Darrick J. Wong	2	-0/+20
	v2/v3 inodes use di_nlink and not di_onlink; and v1 inodes use di_onlink and not di_nlink. Whichever field is not in use, make sure its contents are zero, and teach xfs_scrub to fix that if it is. This clears a bunch of missing scrub failure errors in xfs/385 for core.onlink. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-04-15	xfs: repair AGI unlinked inode bucket listsrepair-iunlink-6.10_2024-04-15 repair-iunlink-6.10	Darrick J. Wong	3	-4/+1074
	Teach the AGI repair code to rebuild the unlinked buckets and lists. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-04-15	xfs: online repair of symbolic linksrepair-symlink-6.10_2024-04-15 repair-symlink-6.10	Darrick J. Wong	7	-2/+587
	If a symbolic link target looks bad, try to sift through the rubble to find as much of the target buffer that we can, and stage a new target (short or remote format as needed) in a temporary file and use the atomic extent swapping mechanism to commit the results. In the worst case, we replace the target with an overly long filename that cannot possibly resolve. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-04-15	xfs: hoist AGI repair context to a heap object	Darrick J. Wong	1	-42/+63
	Save ~460 bytes of stack space by moving all the repair context to a heap object. We're going to add even more context data in the next patch, which is why we really need to do this now. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-04-15	xfs: check AGI unlinked inode buckets	Darrick J. Wong	3	-1/+42
	Look for corruptions in the AGI unlinked bucket chains. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-04-15	xfs: ensure dentry consistency when the orphanage adopts a filerepair-orphanage-6.10_2024-04-15 repair-orphanage-6.10	Darrick J. Wong	2	-0/+133
	When the orphanage adopts a file, that file becomes a child of the orphanage. The dentry cache may have entries for the orphanage directory and the name we've chosen, so (1) make sure we abort if the dcache has a positive entry because something's not right; and (2) invalidate and purge negative dentries if the adoption goes through. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-04-15	xfs: pass the owner to xfs_symlink_write_target	Darrick J. Wong	3	-6/+6
	Require callers of xfs_symlink_write_target to pass the owner number explicitly. This sets us up for online repair to be able to write a remote symlink target to sc->tempip with sc->ip's inumber in the block heaader. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-04-15	xfs: move files to orphanage instead of letting nlinks drop to zero	Darrick J. Wong	7	-19/+163
	If we encounter an inode with a nonzero link count but zero observed links, move it to the orphanage. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-04-15	xfs: expose xfs_bmap_local_to_extents for online repair	Darrick J. Wong	4	-7/+16
	Allow online repair to call xfs_bmap_local_to_extents and add a void * argument at the end so that online repair can pass its own context. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-04-15	xfs: ask the dentry cache if it knows the parent of a directoryrepair-dirs-6.10_2024-04-15 repair-dirs-6.10	Darrick J. Wong	5	-1/+81
	It's possible that the dentry cache can tell us the parent of a directory. Therefore, when repairing directory dot dot entries, query the dcache as a last resort before scanning the entire filesystem. A reviewer asks: "How high is the chance that we actually have a valid dcache entry for a file in a corrupted directory?" There's a decent chance of this actually working. Say you have a 1000-block directory foo, and block 980 gets corrupted. Let's further suppose that block 0 has a correct entry for ".." and "bar". If someone accesses /mnt/foo/bar, that will cause the dcache to create a dentry from /mnt to /mnt/foo whose d_parent points back to /mnt. If you then want to rebuild the directory, XFS can obtain the parent from the dcache without needing to wander into parent pointers or scan the filesystem to find /mnt's connection to foo. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-04-15	xfs: move orphan files to the orphanage	Darrick J. Wong	11	-20/+844
	When we're repairing a directory structure or fixing the dotdot entry of a subdirectory, it's possible that we won't ever find a parent for the subdirectory. When this is the case, move it to the orphanage, aka /lost+found. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-04-15	xfs: online repair of parent pointers	Darrick J. Wong	6	-1/+238
	Teach the online repair code to fix parent pointers for directories. For now, this means correcting the dotdot entry of an existing directory that is otherwise consistent. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-04-15	xfs: scan the filesystem to repair a directory dotdot entry	Darrick J. Wong	7	-24/+528
	Teach the online directory repair code to scan the filesystem so that we can set the dotdot entry when we're rebuilding a directory. This involves dropping ILOCK on the directory that we're repairing, which means that the VFS can sneak in and tell us to update dotdot at any time. Deal with these races by using a dirent hook to absorb dotdot updates, and be careful not to check the scan results until after we've retaken the ILOCK. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-04-15	xfs: update the unlinked list when repairing link countsrepair-unlinked-inode-state-6.10_2024-04-15 repair-unlinked-inode-state-6.10	Darrick J. Wong	1	-9/+33
	When we're repairing the link counts of a file, we must ensure either that the file has zero link count and is on the unlinked list; or that it has nonzero link count and is not on the unlinked list. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-04-15	xfs: online repair of directories	Darrick J. Wong	15	-2/+1563
	If a directory looks like it's in bad shape, try to sift through the rubble to find whatever directory entries we can, scan the directory tree for the parent (if needed), stage the new directory contents in a temporary file and use the atomic extent swapping mechanism to commit the results in bulk. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-04-15	xfs: inactivate directory data blocks	Darrick J. Wong	1	-0/+51
	Teach inode inactivation to delete all the incore buffers backing a directory. In normal runtime this should never happen because the VFS forbids rmdir on a non-empty directory. In the next patch, online directory repair stands up a new directory, exchanges it with the broken directory, and then drops the private temporary directory. If we cancel the repair just prior to exchanging the directory contents, the new directory will need to be torn down. Note: If we commit the repair, reaping will take care of all the ondisk space allocations and incore buffers for the old corrupt directory. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-04-15	xfs: create an xattr iteration function for scrubrepair-xattrs-6.10_2024-04-15 repair-xattrs-6.10	Darrick J. Wong	5	-78/+414
	Create a streamlined function to walk a file's xattrs, without all the cursor management stuff in the regular listxattr. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-04-15	xfs: ensure unlinked list state is consistent with nlink during scrub	Darrick J. Wong	4	-4/+67
	Now that we have the means to tell if an inode is on an unlinked inode list or not, we can check that an inode with zero link count is on the unlinked list; and an inode that has nonzero link count is not on that list. Make repair clean things up too. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-04-15	xfs: flag empty xattr leaf blocks for optimization	Darrick J. Wong	2	-0/+13
	Empty xattr leaf blocks at offset zero are a waste of space but otherwise harmless. If we encounter one, flag it as an opportunity for optimization. If we encounter empty attr leaf blocks anywhere else in the attr fork, that's corruption. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-04-15	xfs: scrub should set preen if attr leaf has holes	Darrick J. Wong	4	-0/+20
	If an attr block indicates that it could use compaction, set the preen flag to have the attr fork rebuilt, since the attr fork rebuilder can take care of that for us. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-04-15	xfs: repair extended attributes	Darrick J. Wong	19	-4/+1436
	If the extended attributes look bad, try to sift through the rubble to find whatever keys/values we can, stage a new attribute structure in a temporary file and use the atomic extent swapping mechanism to commit the results in bulk. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>