aboutsummaryrefslogtreecommitdiffstats
path: root/fs
AgeCommit message (Collapse)AuthorFilesLines
2021-11-03Merge branch 'per_signal_struct_coredumps-for-v5.16' of ↵Linus Torvalds5-91/+23
git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace Pull per signal_struct coredumps from Eric Biederman: "Current coredumps are mixed up with the exit code, the signal handling code, and the ptrace code making coredumps much more complicated than necessary and difficult to follow. This series of changes starts with ptrace_stop and cleans it up, making it easier to follow what is happening in ptrace_stop. Then cleans up the exec interactions with coredumps. Then cleans up the coredump interactions with exit. Finally the coredump interactions with the signal handling code is cleaned up. The first and last changes are bug fixes for minor bugs. I believe the fact that vfork followed by execve can kill the process the called vfork if exec fails is sufficient justification to change the userspace visible behavior. In previous discussions some of these changes were organized differently and individually appeared to make the code base worse. As currently written I believe they all stand on their own as cleanups and bug fixes. Which means that even if the worst should happen and the last change needs to be reverted for some unimaginable reason, the code base will still be improved. If the worst does not happen there are a more cleanups that can be made. Signals that generate coredumps can easily become eligible for short circuit delivery in complete_signal. The entire rendezvous for generating a coredump can move into get_signal. The function force_sig_info_to_task be written in a way that does not modify the signal handling state of the target task (because coredumps are eligible for short circuit delivery). Many of these future cleanups can be done another way but nothing so cleanly as if coredumps become per signal_struct" * 'per_signal_struct_coredumps-for-v5.16' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: coredump: Limit coredumps to a single thread group coredump: Don't perform any cleanups before dumping core exit: Factor coredump_exit_mm out of exit_mm exec: Check for a pending fatal signal instead of core_state ptrace: Remove the unnecessary arguments from arch_ptrace_stop signal: Remove the bogus sigkill_pending in ptrace_stop
2021-11-03Merge tag 'jfs-5.16' of git://github.com/kleikamp/linux-shaggyLinus Torvalds1-29/+22
Pull jfs fix from David Kleikamp: "Just one JFS patch" * tag 'jfs-5.16' of git://github.com/kleikamp/linux-shaggy: JFS: fix memleak in jfs_mount
2021-11-02Merge tag 'xfs-5.16-merge-4' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linuxLinus Torvalds88-900/+1649
Pull xfs updates from Darrick Wong: "This cycle we've worked on fixing bugs and improving XFS' memory footprint. The most notable fixes include: fixing a corruption warning (and free space accounting skew) if copy on write fails; fixing slab cache misuse if SLOB is enabled, which apparently was broken for years without anybody noticing; and fixing a potential race with online shrinkfs. Otherwise, the bulk of the changes here involve setting up separate slab caches for frequently used items such as btree cursors and log intent items, and compacting the structures to reduce memory usage of those items substantially. This also sets us up to support larger btrees in future kernels. We also switch parts of online fsck to allocate scrub context information from the heap instead of using stack space. Summary: - Bug fixes and cleanups for kernel memory allocation usage, this time without touching the mm code. - Refactor the log recovery mechanism that preserves held resources across a transaction roll so that it uses the exact same mechanism that we use for that during regular runtime. - Fix bugs and tighten checking around btree heights. - Remove more old typedefs. - Fix perag reference leaks when racing with growfs. - Remove unused fields from xfs_btree_cur. - Allocate various scrub structures on the heap to reduce stack usage. - Pack xfs_btree_cur fields and rearrange to support arbitrary heights. - Compute maximum possible heights for each btree height, and use that to set up slab caches for each btree type. - Finally remove kmem_zone_t, since these have always been struct kmem_cache on Linux. - Compact the structures used to coordinate work intent items. - Set up slab caches for each work intent item type. - Rename the "bmap_add_free" function to "free_extent_later", which more accurately describes what it does. - Fix corruption warning on unmount when a CoW preallocation covers a data fork delalloc reservation but then the CoW fails. - Add some more minor code improvements" * tag 'xfs-5.16-merge-4' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: (45 commits) xfs: use swap() to make code cleaner xfs: Remove duplicated include in xfs_super xfs: punch out data fork delalloc blocks on COW writeback failure xfs: remove unused parameter from refcount code xfs: reduce the size of struct xfs_extent_free_item xfs: rename xfs_bmap_add_free to xfs_free_extent_later xfs: create slab caches for frequently-used deferred items xfs: compact deferred intent item structures xfs: rename _zone variables to _cache xfs: remove kmem_zone typedef xfs: use separate btree cursor cache for each btree type xfs: compute absolute maximum nlevels for each btree type xfs: kill XFS_BTREE_MAXLEVELS xfs: compute the maximum height of the rmap btree when reflink enabled xfs: clean up xfs_btree_{calc_size,compute_maxlevels} xfs: compute maximum AG btree height for critical reservation calculation xfs: rename m_ag_maxlevels to m_allocbt_maxlevels xfs: dynamically allocate cursors based on maxlevels xfs: encode the max btree height in the cursor xfs: refactor btree cursor allocation function ...
2021-11-02Merge tag 'afs-next-20211102' of ↵Linus Torvalds4-28/+27
git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs Pull AFS updates from David Howells: - Split the readpage handler for symlinks from the one for files. The symlink readpage isn't given a file pointer, so the handling has to be special-cased. This has been posted as part of a patchset to foliate netfs, afs, etc.[1] but I've moved it to this one as it's not actually doing foliation but is more of a pre-cleanup. - Fix file creation to set the mtime from the client's clock to keep make happy if the server's clock isn't quite in sync.[2] Link: https://lore.kernel.org/r/163005742570.2472992.7800423440314043178.stgit@warthog.procyon.org.uk/ [1] Link: http://lists.infradead.org/pipermail/linux-afs/2021-October/004395.html [2] * tag 'afs-next-20211102' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs: afs: Set mtime from the client for yfs create operations afs: Sort out symlink reading
2021-11-02Merge tag 'gfs2-v5.15-rc5-fixes' of ↵Linus Torvalds11-136/+186
git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2 Pull gfs2 updates from Andreas Gruenbacher: - Fix a locking order inversion between the inode and iopen glocks in gfs2_inode_lookup. - Implement proper queuing of glock holders for glocks that require instantiation (like reading an inode or bitmap blocks from disk). Before, multiple glock holders could race with each other and half-initialized objects could be exposed; the GL_SKIP flag further exacerbated this problem. - Fix a rare deadlock between inode lookup / creation and remote delete work. - Fix a rare scheduling-while-atomic bug in dlm during glock hash table walks. - Various other minor fixes and cleanups. * tag 'gfs2-v5.15-rc5-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2: (21 commits) gfs2: Fix unused value warning in do_gfs2_set_flags() gfs2: check context in gfs2_glock_put gfs2: Fix glock_hash_walk bugs gfs2: Cancel remote delete work asynchronously gfs2: set glock object after nq gfs2: remove RDF_UPTODATE flag gfs2: Eliminate GIF_INVALID flag gfs2: fix GL_SKIP node_scope problems gfs2: split glock instantiation off from do_promote gfs2: further simplify do_promote gfs2: re-factor function do_promote gfs2: Remove 'first' trace_gfs2_promote argument gfs2: change go_lock to go_instantiate gfs2: dump glocks from gfs2_consist_OBJ_i gfs2: dequeue iopen holder in gfs2_inode_lookup error gfs2: Save ip from gfs2_glock_nq_init gfs2: Allow append and immutable bits to coexist gfs2: Switch some BUG_ON to GLOCK_BUG_ON for debug gfs2: move GL_SKIP check from glops to do_promote gfs2: Add GL_SKIP holder flag to dump_holder ...
2021-11-02Merge tag 'gfs2-v5.15-rc5-mmap-fault' of ↵Linus Torvalds17-190/+544
git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2 Pull gfs2 mmap + page fault deadlocks fixes from Andreas Gruenbacher: "Functions gfs2_file_read_iter and gfs2_file_write_iter are both accessing the user buffer to write to or read from while holding the inode glock. In the most basic deadlock scenario, that buffer will not be resident and it will be mapped to the same file. Accessing the buffer will trigger a page fault, and gfs2 will deadlock trying to take the same inode glock again while trying to handle that fault. Fix that and similar, more complex scenarios by disabling page faults while accessing user buffers. To make this work, introduce a small amount of new infrastructure and fix some bugs that didn't trigger so far, with page faults enabled" * tag 'gfs2-v5.15-rc5-mmap-fault' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2: gfs2: Fix mmap + page fault deadlocks for direct I/O iov_iter: Introduce nofault flag to disable page faults gup: Introduce FOLL_NOFAULT flag to disable page faults iomap: Add done_before argument to iomap_dio_rw iomap: Support partial direct I/O on user copy failures iomap: Fix iomap_dio_rw return value for user copies gfs2: Fix mmap + page fault deadlocks for buffered I/O gfs2: Eliminate ip->i_gh gfs2: Move the inode glock locking to gfs2_file_buffered_write gfs2: Introduce flag for glock holder auto-demotion gfs2: Clean up function may_grant gfs2: Add wrapper for iomap_file_buffered_write iov_iter: Introduce fault_in_iov_iter_writeable iov_iter: Turn iov_iter_fault_in_readable into fault_in_iov_iter_readable gup: Turn fault_in_pages_{readable,writeable} into fault_in_{readable,writeable} powerpc/kvm: Fix kvm_use_magic_page iov_iter: Fix iov_iter_get_pages{,_alloc} page fault return value
2021-11-02afs: Set mtime from the client for yfs create operationsMarc Dionne1-19/+13
For operations that create vnodes on the server such as CreateFile, MakeDir or Symlink, the server will store its own current time as the mtime if the client doesn't pass in a time in the accompanying StoreStatus structure. If the server and client clocks are not well synchronized, the client may see timestamps in the future or inconsistent dependency checks with "make" for files that are not modified after creation: make[2]: Warning: File 'arch/x86/kernel/apic/modules.order' has modification time 0.14 s in the future make[2]: warning: Clock skew detected. Your build may be incomplete. This is already handled correctly for non yfs operations; also set the mtime for the corresponding yfs operations. Changes: v3: Replace S_IRWXUGO with 0777, per checkpatch v2: [dhowells] Merge the two xdr_encode_YFSStoreStatus*() functions together Signed-off-by: Marc Dionne <marc.dionne@auristor.com> Signed-off-by: David Howells <dhowells@redhat.com> Link: http://lists.infradead.org/pipermail/linux-afs/2021-October/004395.html
2021-11-02afs: Sort out symlink readingDavid Howells3-9/+14
afs_readpage() doesn't get a file pointer when called for a symlink, so separate it from regular file pointer handling. Signed-off-by: David Howells <dhowells@redhat.com> Reviewed-by: Jeff Layton <jlayton@redhat.com> Link: https://lore.kernel.org/r/162687508008.276387.6418924257569297305.stgit@warthog.procyon.org.uk/ # rfc Link: https://lore.kernel.org/r/162981152280.1901565.2264055504466731917.stgit@warthog.procyon.org.uk/ Link: https://lore.kernel.org/r/163005742570.2472992.7800423440314043178.stgit@warthog.procyon.org.uk/ # v2
2021-11-01Merge tag 'audit-pr-20211101' of ↵Linus Torvalds1-0/+2
git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/audit Pull audit updates from Paul Moore: "Add some additional audit logging to capture the openat2() syscall open_how struct info. Previous variations of the open()/openat() syscalls allowed audit admins to inspect the syscall args to get the information contained in the new open_how struct used in openat2()" * tag 'audit-pr-20211101' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/audit: audit: return early if the filter rule has a lower priority audit: add OPENAT2 record to list "how" info audit: add support for the openat2 syscall audit: replace magic audit syscall class numbers with macros lsm_audit: avoid overloading the "key" audit field audit: Convert to SPDX identifier audit: rename struct node to struct audit_node to prevent future name collisions
2021-11-01Merge tag 'selinux-pr-20211101' of ↵Linus Torvalds5-11/+99
git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/selinux Pull selinux updates from Paul Moore: - Add LSM/SELinux/Smack controls and auditing for io-uring. As usual, the individual commit descriptions have more detail, but we were basically missing two things which we're adding here: + establishment of a proper audit context so that auditing of io-uring ops works similarly to how it does for syscalls (with some io-uring additions because io-uring ops are *not* syscalls) + additional LSM hooks to enable access control points for some of the more unusual io-uring features, e.g. credential overrides. The additional audit callouts and LSM hooks were done in conjunction with the io-uring folks, based on conversations and RFC patches earlier in the year. - Fixup the binder credential handling so that the proper credentials are used in the LSM hooks; the commit description and the code comment which is removed in these patches are helpful to understand the background and why this is the proper fix. - Enable SELinux genfscon policy support for securityfs, allowing improved SELinux filesystem labeling for other subsystems which make use of securityfs, e.g. IMA. * tag 'selinux-pr-20211101' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/selinux: security: Return xattr name from security_dentry_init_security() selinux: fix a sock regression in selinux_ip_postroute_compat() binder: use cred instead of task for getsecid binder: use cred instead of task for selinux checks binder: use euid from cred instead of using task LSM: Avoid warnings about potentially unused hook variables selinux: fix all of the W=1 build warnings selinux: make better use of the nf_hook_state passed to the NF hooks selinux: fix race condition when computing ocontext SIDs selinux: remove unneeded ipv6 hook wrappers selinux: remove the SELinux lockdown implementation selinux: enable genfscon labeling for securityfs Smack: Brutalist io_uring support selinux: add support for the io_uring access controls lsm,io_uring: add LSM hooks to io_uring io_uring: convert io_uring to the secure anon inode interface fs: add anon_inode_getfile_secure() similar to anon_inode_getfd_secure() audit: add filtering for io_uring records audit,io_uring,io-wq: add some basic audit support to io_uring audit: prepare audit_context for use in calling contexts beyond syscalls
2021-11-01Merge tag 'trace-v5.16' of ↵Linus Torvalds1-1/+2
git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace Pull tracing updates from Steven Rostedt: - kprobes: Restructured stack unwinder to show properly on x86 when a stack dump happens from a kretprobe callback. - Fix to bootconfig parsing - Have tracefs allow owner and group permissions by default (only denying others). There's been pressure to allow non root to tracefs in a controlled fashion, and using groups is probably the safest. - Bootconfig memory managament updates. - Bootconfig clean up to have the tools directory be less dependent on changes in the kernel tree. - Allow perf to be traced by function tracer. - Rewrite of function graph tracer to be a callback from the function tracer instead of having its own trampoline (this change will happen on an arch by arch basis, and currently only x86_64 implements it). - Allow multiple direct trampolines (bpf hooks to functions) be batched together in one synchronization. - Allow histogram triggers to add variables that can perform calculations against the event's fields. - Use the linker to determine architecture callbacks from the ftrace trampoline to allow for proper parameter prototypes and prevent warnings from the compiler. - Extend histogram triggers to key off of variables. - Have trace recursion use bit magic to determine preempt context over if branches. - Have trace recursion disable preemption as all use cases do anyway. - Added testing for verification of tracing utilities. - Various small clean ups and fixes. * tag 'trace-v5.16' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace: (101 commits) tracing/histogram: Fix semicolon.cocci warnings tracing/histogram: Fix documentation inline emphasis warning tracing: Increase PERF_MAX_TRACE_SIZE to handle Sentinel1 and docker together tracing: Show size of requested perf buffer bootconfig: Initialize ret in xbc_parse_tree() ftrace: do CPU checking after preemption disabled ftrace: disable preemption when recursion locked tracing/histogram: Document expression arithmetic and constants tracing/histogram: Optimize division by a power of 2 tracing/histogram: Covert expr to const if both operands are constants tracing/histogram: Simplify handling of .sym-offset in expressions tracing: Fix operator precedence for hist triggers expression tracing: Add division and multiplication support for hist triggers tracing: Add support for creating hist trigger variables from literal selftests/ftrace: Stop tracing while reading the trace file by default MAINTAINERS: Update KPROBES and TRACING entries test_kprobes: Move it from kernel/ to lib/ docs, kprobes: Remove invalid URL and add new reference samples/kretprobes: Fix return value if register_kretprobe() failed lib/bootconfig: Fix the xbc_get_info kerneldoc ...
2021-11-01Merge tag 'kspp-misc-fixes-5.16-rc1' of ↵Linus Torvalds3-8/+7
git://git.kernel.org/pub/scm/linux/kernel/git/gustavoars/linux Pull hardening fixes and cleanups from Gustavo A. R. Silva: "Various hardening fixes and cleanups that I've been collecting during the last development cycle: Fix -Wcast-function-type error: - firewire: Remove function callback casts (Oscar Carter) Fix application of sizeof operator: - firmware/psci: fix application of sizeof to pointer (jing yangyang) Replace open coded instances with size_t saturating arithmetic helpers: - assoc_array: Avoid open coded arithmetic in allocator arguments (Len Baker) - writeback: prefer struct_size over open coded arithmetic (Len Baker) - aio: Prefer struct_size over open coded arithmetic (Len Baker) - dmaengine: pxa_dma: Prefer struct_size over open coded arithmetic (Len Baker) Flexible array transformation: - KVM: PPC: Replace zero-length array with flexible array member (Len Baker) Use 2-factor argument multiplication form: - nouveau/svm: Use kvcalloc() instead of kvzalloc() (Gustavo A. R. Silva) - xfs: Use kvcalloc() instead of kvzalloc() (Gustavo A. R. Silva)" * tag 'kspp-misc-fixes-5.16-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gustavoars/linux: firewire: Remove function callback casts nouveau/svm: Use kvcalloc() instead of kvzalloc() firmware/psci: fix application of sizeof to pointer dmaengine: pxa_dma: Prefer struct_size over open coded arithmetic KVM: PPC: Replace zero-length array with flexible array member aio: Prefer struct_size over open coded arithmetic writeback: prefer struct_size over open coded arithmetic xfs: Use kvcalloc() instead of kvzalloc() assoc_array: Avoid open coded arithmetic in allocator arguments
2021-11-01Merge tag 'overflow-v5.16-rc1' of ↵Linus Torvalds2-8/+6
git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux Pull overflow updates from Kees Cook: "The end goal of the current buffer overflow detection work[0] is to gain full compile-time and run-time coverage of all detectable buffer overflows seen via array indexing or memcpy(), memmove(), and memset(). The str*() family of functions already have full coverage. While much of the work for these changes have been on-going for many releases (i.e. 0-element and 1-element array replacements, as well as avoiding false positives and fixing discovered overflows[1]), this series contains the foundational elements of several related buffer overflow detection improvements by providing new common helpers and FORTIFY_SOURCE changes needed to gain the introspection required for compiler visibility into array sizes. Also included are a handful of already Acked instances using the helpers (or related clean-ups), with many more waiting at the ready to be taken via subsystem-specific trees[2]. The new helpers are: - struct_group() for gaining struct member range introspection - memset_after() and memset_startat() for clearing to the end of structures - DECLARE_FLEX_ARRAY() for using flex arrays in unions or alone in structs Also included is the beginning of the refactoring of FORTIFY_SOURCE to support memcpy() introspection, fix missing and regressed coverage under GCC, and to prepare to fix the currently broken Clang support. Finishing this work is part of the larger series[0], but depends on all the false positives and buffer overflow bug fixes to have landed already and those that depend on this series to land. As part of the FORTIFY_SOURCE refactoring, a set of both a compile-time and run-time tests are added for FORTIFY_SOURCE and the mem*()-family functions respectively. The compile time tests have found a legitimate (though corner-case) bug[6] already. Please note that the appearance of "panic" and "BUG" in the FORTIFY_SOURCE refactoring are the result of relocating existing code, and no new use of those code-paths are expected nor desired. Finally, there are two tree-wide conversions for 0-element arrays and flexible array unions to gain sane compiler introspection coverage that result in no known object code differences. After this series (and the changes that have now landed via netdev and usb), we are very close to finally being able to build with -Warray-bounds and -Wzero-length-bounds. However, due corner cases in GCC[3] and Clang[4], I have not included the last two patches that turn on these options, as I don't want to introduce any known warnings to the build. Hopefully these can be solved soon" Link: https://lore.kernel.org/lkml/20210818060533.3569517-1-keescook@chromium.org/ [0] Link: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/log/?qt=grep&q=FORTIFY_SOURCE [1] Link: https://lore.kernel.org/lkml/202108220107.3E26FE6C9C@keescook/ [2] Link: https://lore.kernel.org/lkml/3ab153ec-2798-da4c-f7b1-81b0ac8b0c5b@roeck-us.net/ [3] Link: https://bugs.llvm.org/show_bug.cgi?id=51682 [4] Link: https://lore.kernel.org/lkml/202109051257.29B29745C0@keescook/ [5] Link: https://lore.kernel.org/lkml/20211020200039.170424-1-keescook@chromium.org/ [6] * tag 'overflow-v5.16-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux: (30 commits) fortify: strlen: Avoid shadowing previous locals compiler-gcc.h: Define __SANITIZE_ADDRESS__ under hwaddress sanitizer treewide: Replace 0-element memcpy() destinations with flexible arrays treewide: Replace open-coded flex arrays in unions stddef: Introduce DECLARE_FLEX_ARRAY() helper btrfs: Use memset_startat() to clear end of struct string.h: Introduce memset_startat() for wiping trailing members and padding xfrm: Use memset_after() to clear padding string.h: Introduce memset_after() for wiping trailing members/padding lib: Introduce CONFIG_MEMCPY_KUNIT_TEST fortify: Add compile-time FORTIFY_SOURCE tests fortify: Allow strlen() and strnlen() to pass compile-time known lengths fortify: Prepare to improve strnlen() and strlen() warnings fortify: Fix dropped strcpy() compile-time write overflow check fortify: Explicitly disable Clang support fortify: Move remaining fortify helpers into fortify-string.h lib/string: Move helper functions out of string.c compiler_types.h: Remove __compiletime_object_size() cm4000_cs: Use struct_group() to zero struct cm4000_dev region can: flexcan: Use struct_group() to zero struct flexcan_regs regions ...
2021-11-01Merge tag 'x86_cc_for_v5.16_rc1' of ↵Linus Torvalds1-3/+3
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull generic confidential computing updates from Borislav Petkov: "Add an interface called cc_platform_has() which is supposed to be used by confidential computing solutions to query different aspects of the system. The intent behind it is to unify testing of such aspects instead of having each confidential computing solution add its own set of tests to code paths in the kernel, leading to an unwieldy mess" * tag 'x86_cc_for_v5.16_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: treewide: Replace the use of mem_encrypt_active() with cc_platform_has() x86/sev: Replace occurrences of sev_es_active() with cc_platform_has() x86/sev: Replace occurrences of sev_active() with cc_platform_has() x86/sme: Replace occurrences of sme_active() with cc_platform_has() powerpc/pseries/svm: Add a powerpc version of cc_platform_has() x86/sev: Add an x86 version of cc_platform_has() arch/cc: Introduce a function to check for confidential computing features x86/ioremap: Selectively build arch override encryption functions
2021-11-01Merge tag 'sched-core-2021-11-01' of ↵Linus Torvalds4-20/+24
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler updates from Thomas Gleixner: - Revert the printk format based wchan() symbol resolution as it can leak the raw value in case that the symbol is not resolvable. - Make wchan() more robust and work with all kind of unwinders by enforcing that the task stays blocked while unwinding is in progress. - Prevent sched_fork() from accessing an invalid sched_task_group - Improve asymmetric packing logic - Extend scheduler statistics to RT and DL scheduling classes and add statistics for bandwith burst to the SCHED_FAIR class. - Properly account SCHED_IDLE entities - Prevent a potential deadlock when initial priority is assigned to a newly created kthread. A recent change to plug a race between cpuset and __sched_setscheduler() introduced a new lock dependency which is now triggered. Break the lock dependency chain by moving the priority assignment to the thread function. - Fix the idle time reporting in /proc/uptime for NOHZ enabled systems. - Improve idle balancing in general and especially for NOHZ enabled systems. - Provide proper interfaces for live patching so it does not have to fiddle with scheduler internals. - Add cluster aware scheduling support. - A small set of tweaks for RT (irqwork, wait_task_inactive(), various scheduler options and delaying mmdrop) - The usual small tweaks and improvements all over the place * tag 'sched-core-2021-11-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (69 commits) sched/fair: Cleanup newidle_balance sched/fair: Remove sysctl_sched_migration_cost condition sched/fair: Wait before decaying max_newidle_lb_cost sched/fair: Skip update_blocked_averages if we are defering load balance sched/fair: Account update_blocked_averages in newidle_balance cost x86: Fix __get_wchan() for !STACKTRACE sched,x86: Fix L2 cache mask sched/core: Remove rq_relock() sched: Improve wake_up_all_idle_cpus() take #2 irq_work: Also rcuwait for !IRQ_WORK_HARD_IRQ on PREEMPT_RT irq_work: Handle some irq_work in a per-CPU thread on PREEMPT_RT irq_work: Allow irq_work_sync() to sleep if irq_work() no IRQ support. sched/rt: Annotate the RT balancing logic irqwork as IRQ_WORK_HARD_IRQ sched: Add cluster scheduler level for x86 sched: Add cluster scheduler level in core and related Kconfig for ARM64 topology: Represent clusters of CPUs within a die sched: Disable -Wunused-but-set-variable sched: Add wrapper for get_wchan() to keep task blocked x86: Fix get_wchan() to support the ORC unwinder proc: Use task_is_running() for wchan in /proc/$pid/stat ...
2021-11-01Merge tag 'for-5.16-tag' of ↵Linus Torvalds51-2903/+4439
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs updates from David Sterba: "The updates this time are more under the hood and enhancing existing features (subpage with compression and zoned namespaces). Performance related: - misc small inode logging improvements (+3% throughput, -11% latency on sample dbench workload) - more efficient directory logging: bulk item insertion, less tree searches and locking - speed up bulk insertion of items into a b-tree, which is used when logging directories, when running delayed items for directories (fsync and transaction commits) and when running the slow path (full sync) of an fsync (bulk creation run time -4%, deletion -12%) Core: - continued subpage support - make defragmentation work - make compression write work - zoned mode - support ZNS (zoned namespaces), zone capacity is number of usable blocks in each zone - add dedicated block group (zoned) for relocation, to prevent out of order writes in some cases - greedy block group reclaim, pick the ones with least usable space first - preparatory work for send protocol updates - error handling improvements - cleanups and refactoring Fixes: - lockdep warnings - in show_devname callback, on seeding device - device delete on loop device due to conversions to workqueues - fix deadlock between chunk allocation and chunk btree modifications - fix tracking of missing device count and status" * tag 'for-5.16-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: (140 commits) btrfs: remove root argument from check_item_in_log() btrfs: remove root argument from add_link() btrfs: remove root argument from btrfs_unlink_inode() btrfs: remove root argument from drop_one_dir_item() btrfs: clear MISSING device status bit in btrfs_close_one_device btrfs: call btrfs_check_rw_degradable only if there is a missing device btrfs: send: prepare for v2 protocol btrfs: fix comment about sector sizes supported in 64K systems btrfs: update device path inode time instead of bd_inode fs: export an inode_update_time helper btrfs: fix deadlock when defragging transparent huge pages btrfs: sysfs: convert scnprintf and snprintf to sysfs_emit btrfs: make btrfs_super_block size match BTRFS_SUPER_INFO_SIZE btrfs: update comments for chunk allocation -ENOSPC cases btrfs: fix deadlock between chunk allocation and chunk btree modifications btrfs: zoned: use greedy gc for auto reclaim btrfs: check-integrity: stop storing the block device name in btrfsic_dev_state btrfs: use btrfs_get_dev_args_from_path in dev removal ioctls btrfs: add a btrfs_get_dev_args_from_path helper btrfs: handle device lookup with btrfs_dev_lookup_args ...
2021-11-01btrfs: fix lzo_decompress_bio() kmap leakageLinus Torvalds1-1/+2
Commit ccaa66c8dd27 reinstated the kmap/kunmap that had been dropped in commit 8c945d32e604 ("btrfs: compression: drop kmap/kunmap from lzo"). However, it seems to have done so incorrectly due to the change not reverting cleanly, and lzo_decompress_bio() ended up not having a matching "kunmap()" to the "kmap()" that was put back. Also, any assert that the page pointer is not NULL should be before the kmap() of said pointer, since otherwise you'd just oops in the kmap() before the assert would even trigger. I noticed this when trying to verify my btrfs merge, and things not adding up. I'm doing this fixup before re-doing my merge, because this commit needs to also be backported to 5.15 (after verification from the btrfs people). Fixes: ccaa66c8dd27 ("Revert 'btrfs: compression: drop kmap/kunmap from lzo'") Cc: David Sterba <dsterba@suse.com> Cc: stable@kernel.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-11-01Merge tag 'exfat-for-5.16-rc1' of ↵Linus Torvalds1-1/+1
git://git.kernel.org/pub/scm/linux/kernel/git/linkinjeon/exfat Pull exfat fix from Namjae Jeon: "Fix ->i_blocks truncation issue caused by wrong 32bit mask" * tag 'exfat-for-5.16-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/linkinjeon/exfat: exfat: fix incorrect loading of i_blocks for large files
2021-11-01Merge tag 'erofs-for-5.16-rc1' of ↵Linus Torvalds16-299/+959
git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs Pull erofs updates from Gao Xiang: "There are some new features available for this cycle. Firstly, EROFS LZMA algorithm support, specifically called MicroLZMA, is available as an option for embedded devices, LiveCDs and/or as the secondary auxiliary compression algorithm besides the primary algorithm in one file. In order to better support the LZMA fixed-sized output compression, especially for 4KiB pcluster size (which has lowest memory pressure thus useful for memory-sensitive scenarios), Lasse introduced a new LZMA header/container format called MicroLZMA to minimize the original LZMA1 header (for example, we don't need to waste 4-byte dictionary size and another 8-byte uncompressed size, which can be calculated by fs directly, for each pcluster) and enable EROFS fixed-sized output compression. Note that MicroLZMA can also be later used by other things in addition to EROFS too where wasting minimal amount of space for headers is important and it can be only compiled by enabling XZ_DEC_MICROLZMA. MicroLZMA has been supported by the latest upstream XZ embedded [1] & XZ utils [2], apply the latest related XZ embedded upstream patches by the XZ author Lasse here. Secondly, multiple device is also supported in this cycle, which is designed for multi-layer container images. By working together with inter-layer data deduplication and compression, we can achieve the next high-performance container image solution. Our team will announce the new Nydus container image service [3] implementation with new RAFS v6 (EROFS-compatible) format in Open Source Summit 2021 China [4] soon. Besides, the secondary compression head support and readmore decompression strategy are also included in this cycle. There are also some minor bugfixes and cleanups, as always. Summary: - support multiple devices for multi-layer container images; - support the secondary compression head; - support readmore decompression strategy; - support new LZMA algorithm (specifically called MicroLZMA); - some bugfixes & cleanups" * tag 'erofs-for-5.16-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs: erofs: don't trigger WARN() when decompression fails erofs: get rid of ->lru usage erofs: lzma compression support erofs: rename some generic methods in decompressor lib/xz, lib/decompress_unxz.c: Fix spelling in comments lib/xz: Add MicroLZMA decoder lib/xz: Move s->lzma.len = 0 initialization to lzma_reset() lib/xz: Validate the value before assigning it to an enum variable lib/xz: Avoid overlapping memcpy() with invalid input with in-place decompression erofs: introduce readmore decompression strategy erofs: introduce the secondary compression head erofs: get compression algorithms directly on mapping erofs: add multiple device support erofs: decouple basic mount options from fs_context erofs: remove the fast path of per-CPU buffer decompression
2021-11-01Merge tag 'fscrypt-for-linus' of git://git.kernel.org/pub/scm/fs/fscrypt/fscryptLinus Torvalds8-40/+87
Pull fscrypt updates from Eric Biggers: "Some cleanups for fs/crypto/: - Allow 256-bit master keys with AES-256-XTS - Improve documentation and comments - Remove unneeded field fscrypt_operations::max_namelen" * tag 'fscrypt-for-linus' of git://git.kernel.org/pub/scm/fs/fscrypt/fscrypt: fscrypt: improve a few comments fscrypt: allow 256-bit master keys with AES-256-XTS fscrypt: improve documentation for inline encryption fscrypt: clean up comments in bio.c fscrypt: remove fscrypt_operations::max_namelen
2021-11-01Merge tag 'for-5.16/inode-sync-2021-10-29' of git://git.kernel.dk/linux-blockLinus Torvalds5-57/+26
Pull block inode sync updates from Jens Axboe: "This contains improvements to how bdev inode syncing is handled, unifying the API" * tag 'for-5.16/inode-sync-2021-10-29' of git://git.kernel.dk/linux-block: block: simplify the block device syncing code ntfs3: use sync_blockdev_nowait fat: use sync_blockdev_nowait btrfs: use sync_blockdev xen-blkback: use sync_blockdev block: remove __sync_blockdev fs: remove __sync_filesystem
2021-11-01Merge tag 'for-5.16/ki_complete-2021-10-29' of git://git.kernel.dk/linux-blockLinus Torvalds10-21/+21
Pull kiocb->ki_complete() cleanup from Jens Axboe: "This removes the res2 argument from kiocb->ki_complete(). Only the USB gadget code used it, everybody else passes 0. The USB guys checked the user gadget code they could find, and everybody just uses res as expected for the async interface" * tag 'for-5.16/ki_complete-2021-10-29' of git://git.kernel.dk/linux-block: fs: get rid of the res2 iocb->ki_complete argument usb: remove res2 argument from gadget code completions
2021-11-01Merge tag 'for-5.16/passthrough-flag-2021-10-29' of ↵Linus Torvalds3-120/+44
git://git.kernel.dk/linux-block Pull QUEUE_FLAG_SCSI_PASSTHROUGH removal from Jens Axboe: "This contains a series leading to the removal of the QUEUE_FLAG_SCSI_PASSTHROUGH queue flag" * tag 'for-5.16/passthrough-flag-2021-10-29' of git://git.kernel.dk/linux-block: block: remove blk_{get,put}_request block: remove QUEUE_FLAG_SCSI_PASSTHROUGH block: remove the initialize_rq_fn blk_mq_ops method scsi: add a scsi_alloc_request helper bsg-lib: initialize the bsg_job in bsg_transport_sg_io_fn nfsd/blocklayout: use ->get_unique_id instead of sending SCSI commands sd: implement ->get_unique_id block: add a ->get_unique_id method
2021-11-01Merge tag 'for-5.16/bdev-size-2021-10-29' of git://git.kernel.dk/linux-blockLinus Torvalds24-59/+42
Pull bdev size cleanups from Jens Axboe: "Clean up the bdev size handling with new bdev_nr_bytes() helper" * tag 'for-5.16/bdev-size-2021-10-29' of git://git.kernel.dk/linux-block: (34 commits) partitions/ibm: use bdev_nr_sectors instead of open coding it partitions/efi: use bdev_nr_bytes instead of open coding it block/ioctl: use bdev_nr_sectors and bdev_nr_bytes block: cache inode size in bdev udf: use sb_bdev_nr_blocks reiserfs: use sb_bdev_nr_blocks ntfs: use sb_bdev_nr_blocks jfs: use sb_bdev_nr_blocks ext4: use sb_bdev_nr_blocks block: add a sb_bdev_nr_blocks helper block: use bdev_nr_bytes instead of open coding it in blkdev_fallocate squashfs: use bdev_nr_bytes instead of open coding it reiserfs: use bdev_nr_bytes instead of open coding it pstore/blk: use bdev_nr_bytes instead of open coding it ntfs3: use bdev_nr_bytes instead of open coding it nilfs2: use bdev_nr_bytes instead of open coding it nfs/blocklayout: use bdev_nr_bytes instead of open coding it jfs: use bdev_nr_bytes instead of open coding it hfsplus: use bdev_nr_sectors instead of open coding it hfs: use bdev_nr_sectors instead of open coding it ...
2021-11-01Merge tag 'for-5.16/io_uring-2021-10-29' of git://git.kernel.dk/linux-blockLinus Torvalds3-848/+983
Pull io_uring updates from Jens Axboe: "Light on new features - basically just the hybrid mode support. Outside of that it's just fixes, cleanups, and performance improvements. In detail: - Add ring related information to the fdinfo output (Hao) - Hybrid async mode (Hao) - Support for batched issue on block (me) - sqe error trace improvement (me) - IOPOLL efficiency improvements (Pavel) - submit state cleanups and improvements (Pavel) - Completion side improvements (Pavel) - Drain improvements (Pavel) - Buffer selection cleanups (Pavel) - Fixed file node improvements (Pavel) - io-wq setup cancelation fix (Pavel) - Various other performance improvements and cleanups (Pavel) - Misc fixes (Arnd, Bixuan, Changcheng, Hao, me, Noah)" * tag 'for-5.16/io_uring-2021-10-29' of git://git.kernel.dk/linux-block: (97 commits) io-wq: remove worker to owner tw dependency io_uring: harder fdinfo sq/cq ring iterating io_uring: don't assign write hint in the read path io_uring: clusterise ki_flags access in rw_prep io_uring: kill unused param from io_file_supports_nowait io_uring: clean up timeout async_data allocation io_uring: don't try io-wq polling if not supported io_uring: check if opcode needs poll first on arming io_uring: clean iowq submit work cancellation io_uring: clean io_wq_submit_work()'s main loop io-wq: use helper for worker refcounting io_uring: implement async hybrid mode for pollable requests io_uring: Use ERR_CAST() instead of ERR_PTR(PTR_ERR()) io_uring: split logic of force_nonblock io_uring: warning about unused-but-set parameter io_uring: inform block layer of how many requests we are submitting io_uring: simplify io_file_supports_nowait() io_uring: combine REQ_F_NOWAIT_{READ,WRITE} flags io_uring: arm poll for non-nowait files fs/io_uring: Prioritise checking faster conditions first in io_write ...
2021-11-01Merge tag 'for-5.16/block-2021-10-29' of git://git.kernel.dk/linux-blockLinus Torvalds18-61/+68
Pull block updates from Jens Axboe: - mq-deadline accounting improvements (Bart) - blk-wbt timer fix (Andrea) - Untangle the block layer includes (Christoph) - Rework the poll support to be bio based, which will enable adding support for polling for bio based drivers (Christoph) - Block layer core support for multi-actuator drives (Damien) - blk-crypto improvements (Eric) - Batched tag allocation support (me) - Request completion batching support (me) - Plugging improvements (me) - Shared tag set improvements (John) - Concurrent queue quiesce support (Ming) - Cache bdev in ->private_data for block devices (Pavel) - bdev dio improvements (Pavel) - Block device invalidation and block size improvements (Xie) - Various cleanups, fixes, and improvements (Christoph, Jackie, Masahira, Tejun, Yu, Pavel, Zheng, me) * tag 'for-5.16/block-2021-10-29' of git://git.kernel.dk/linux-block: (174 commits) blk-mq-debugfs: Show active requests per queue for shared tags block: improve readability of blk_mq_end_request_batch() virtio-blk: Use blk_validate_block_size() to validate block size loop: Use blk_validate_block_size() to validate block size nbd: Use blk_validate_block_size() to validate block size block: Add a helper to validate the block size block: re-flow blk_mq_rq_ctx_init() block: prefetch request to be initialized block: pass in blk_mq_tags to blk_mq_rq_ctx_init() block: add rq_flags to struct blk_mq_alloc_data block: add async version of bio_set_polled block: kill DIO_MULTI_BIO block: kill unused polling bits in __blkdev_direct_IO() block: avoid extra iter advance with async iocb block: Add independent access ranges support blk-mq: don't issue request directly in case that current is to be blocked sbitmap: silence data race warning blk-cgroup: synchronize blkg creation against policy deactivation block: refactor bio_iov_bvec_set() block: add single bio async direct IO helper ...
2021-11-01Merge tag 'locks-v5.16' of ↵Linus Torvalds6-156/+27
git://git.kernel.org/pub/scm/linux/kernel/git/jlayton/linux Pull file locking updates from Jeff Layton: "Most of this is just follow-on cleanup work of documentation and comments from the mandatory locking removal in v5.15. The only real functional change is that LOCK_MAND flock() support is also being removed, as it has basically been non-functional since the v2.5 days" * tag 'locks-v5.16' of git://git.kernel.org/pub/scm/linux/kernel/git/jlayton/linux: fs: remove leftover comments from mandatory locking removal locks: remove changelog comments docs: fs: locks.rst: update comment about mandatory file locking Documentation: remove reference to now removed mandatory-locking doc locks: remove LOCK_MAND flock lock support
2021-11-01Merge tag 'folio-5.16' of git://git.infradead.org/users/willy/pagecacheLinus Torvalds4-13/+15
Pull memory folios from Matthew Wilcox: "Add memory folios, a new type to represent either order-0 pages or the head page of a compound page. This should be enough infrastructure to support filesystems converting from pages to folios. The point of all this churn is to allow filesystems and the page cache to manage memory in larger chunks than PAGE_SIZE. The original plan was to use compound pages like THP does, but I ran into problems with some functions expecting only a head page while others expect the precise page containing a particular byte. The folio type allows a function to declare that it's expecting only a head page. Almost incidentally, this allows us to remove various calls to VM_BUG_ON(PageTail(page)) and compound_head(). This converts just parts of the core MM and the page cache. For 5.17, we intend to convert various filesystems (XFS and AFS are ready; other filesystems may make it) and also convert more of the MM and page cache to folios. For 5.18, multi-page folios should be ready. The multi-page folios offer some improvement to some workloads. The 80% win is real, but appears to be an artificial benchmark (postgres startup, which isn't a serious workload). Real workloads (eg building the kernel, running postgres in a steady state, etc) seem to benefit between 0-10%. I haven't heard of any performance losses as a result of this series. Nobody has done any serious performance tuning; I imagine that tweaking the readahead algorithm could provide some more interesting wins. There are also other places where we could choose to create large folios and currently do not, such as writes that are larger than PAGE_SIZE. I'd like to thank all my reviewers who've offered review/ack tags: Christoph Hellwig, David Howells, Jan Kara, Jeff Layton, Johannes Weiner, Kirill A. Shutemov, Michal Hocko, Mike Rapoport, Vlastimil Babka, William Kucharski, Yu Zhao and Zi Yan. I'd also like to thank those who gave feedback I incorporated but haven't offered up review tags for this part of the series: Nick Piggin, Mel Gorman, Ming Lei, Darrick Wong, Ted Ts'o, John Hubbard, Hugh Dickins, and probably a few others who I forget" * tag 'folio-5.16' of git://git.infradead.org/users/willy/pagecache: (90 commits) mm/writeback: Add folio_write_one mm/filemap: Add FGP_STABLE mm/filemap: Add filemap_get_folio mm/filemap: Convert mapping_get_entry to return a folio mm/filemap: Add filemap_add_folio() mm/filemap: Add filemap_alloc_folio mm/page_alloc: Add folio allocation functions mm/lru: Add folio_add_lru() mm/lru: Convert __pagevec_lru_add_fn to take a folio mm: Add folio_evictable() mm/workingset: Convert workingset_refault() to take a folio mm/filemap: Add readahead_folio() mm/filemap: Add folio_mkwrite_check_truncate() mm/filemap: Add i_blocks_per_folio() mm/writeback: Add folio_redirty_for_writepage() mm/writeback: Add folio_account_redirty() mm/writeback: Add folio_clear_dirty_for_io() mm/writeback: Add folio_cancel_dirty() mm/writeback: Add folio_account_cleaned() mm/writeback: Add filemap_dirty_folio() ...
2021-11-01exfat: fix incorrect loading of i_blocks for large filesSungjong Seo1-1/+1
When calculating i_blocks, there was a mistake that was masked with a 32-bit variable. So i_blocks for files larger than 4 GiB had incorrect values. Mask with a 64-bit variable instead of 32-bit one. Fixes: 5f2aa075070c ("exfat: add inode operations") Cc: stable@vger.kernel.org # v5.7+ Reported-by: Ganapathi Kamath <hgkamath@hotmail.com> Signed-off-by: Sungjong Seo <sj1557.seo@samsung.com> Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
2021-10-31erofs: don't trigger WARN() when decompression failsGao Xiang1-1/+0
syzbot reported a WARNING [1] due to corrupted compressed data. As Dmitry said, "If this is not a kernel bug, then the code should not use WARN. WARN if for kernel bugs and is recognized as such by all testing systems and humans." [1] https://lore.kernel.org/r/000000000000b3586105cf0ff45e@google.com Link: https://lore.kernel.org/r/20211025074311.130395-1-hsiangkao@linux.alibaba.com Cc: Dmitry Vyukov <dvyukov@google.com> Reviewed-by: Chao Yu <chao@kernel.org> Reported-by: syzbot+d8aaffc3719597e8cfb4@syzkaller.appspotmail.com Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
2021-10-30xfs: use swap() to make code cleanerChangcheng Deng1-8/+2
Use swap() in order to make code cleaner. Issue found by coccinelle. Reported-by: Zeal Robot <zealci@zte.com.cn> Signed-off-by: Changcheng Deng <deng.changcheng@zte.com.cn> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2021-10-30xfs: Remove duplicated include in xfs_superWan Jiabing1-1/+0
Fix following checkincludes.pl warning: ./fs/xfs/xfs_super.c: xfs_btree.h is included more than once. The include is in line 15. Remove the duplicated here. Signed-off-by: Wan Jiabing <wanjiabing@vivo.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2021-10-29Merge tag 'for-5.15-rc7-tag' of ↵Linus Torvalds5-33/+72
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs fixes from David Sterba: "Last minute fixes for crash on 32bit architectures when compression is in use. It's a regression introduced in 5.15-rc and I'd really like not let this into the final release, fixes via stable trees would add unnecessary delay. The problem is on 32bit architectures with highmem enabled, the pages for compression may need to be kmapped, while the patches removed that as we don't use GFP_HIGHMEM allocations anymore. The pages that don't come from local allocation still may be from highmem. Despite being on 32bit there's enough such ARM machines in use so it's not a marginal issue. I did full reverts of the patches one by one instead of a huge one. There's one exception for the "lzo" revert as there was an intermediate patch touching the same code to make it compatible with subpage. I can't revert that one too, so the revert in lzo.c is manual. Qu Wenruo has worked on that with me and verified the changes" * tag 'for-5.15-rc7-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: Revert "btrfs: compression: drop kmap/kunmap from lzo" Revert "btrfs: compression: drop kmap/kunmap from zlib" Revert "btrfs: compression: drop kmap/kunmap from zstd" Revert "btrfs: compression: drop kmap/kunmap from generic helpers"
2021-10-29io-wq: remove worker to owner tw dependencyPavel Begunkov1-9/+37
INFO: task iou-wrk-6609:6612 blocked for more than 143 seconds. Not tainted 5.15.0-rc5-syzkaller #0 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. task:iou-wrk-6609 state:D stack:27944 pid: 6612 ppid: 6526 flags:0x00004006 Call Trace: context_switch kernel/sched/core.c:4940 [inline] __schedule+0xb44/0x5960 kernel/sched/core.c:6287 schedule+0xd3/0x270 kernel/sched/core.c:6366 schedule_timeout+0x1db/0x2a0 kernel/time/timer.c:1857 do_wait_for_common kernel/sched/completion.c:85 [inline] __wait_for_common kernel/sched/completion.c:106 [inline] wait_for_common kernel/sched/completion.c:117 [inline] wait_for_completion+0x176/0x280 kernel/sched/completion.c:138 io_worker_exit fs/io-wq.c:183 [inline] io_wqe_worker+0x66d/0xc40 fs/io-wq.c:597 ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:295 io-wq worker may submit a task_work to the master task and upon io_worker_exit() wait for the tw to get executed. The problem appears when the master task is waiting in coredump.c: 468 freezer_do_not_count(); 469 wait_for_completion(&core_state->startup); 470 freezer_count(); Apparently having some dependency on children threads getting everything stuck. Workaround it by cancelling the taks_work callback that causes it before going into io_worker_exit() waiting. p.s. probably a better option is to not submit tw elevating the refcount in the first place, but let's leave this excercise for the future. Cc: stable@vger.kernel.org Reported-and-tested-by: syzbot+27d62ee6f256b186883e@syzkaller.appspotmail.com Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/142a716f4ed936feae868959059154362bfa8c19.1635509451.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-29io_uring: harder fdinfo sq/cq ring iteratingJens Axboe1-22/+29
The ring iteration is racy, which isn't necessarily a problem except it can cause us to iterate the whole thing. That isn't desired or ideal, and it can lead to excessive runtimes of reading fdinfo. Cap the iteration at tail - head OR the ring size. While in there, clean up the ring masking and just dump the raw values along with the masks. That provides more useful debug info. Fixes: 83f84356bc8f ("io_uring: add more uring info to fdinfo for debug") Reported-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-29Revert "btrfs: compression: drop kmap/kunmap from lzo"David Sterba1-11/+25
This reverts commit 8c945d32e60427cbc0859cf7045bbe6196bb03d8. The kmaps in compression code are still needed and cause crashes on 32bit machines (ARM, x86). Reproducible eg. by running fstest btrfs/004 with enabled LZO or ZSTD compression. The revert does not apply cleanly due to changes in a6e66e6f8c1b ("btrfs: rework lzo_decompress_bio() to make it subpage compatible") that reworked the page iteration so the revert is done to be equivalent to the original code. Link: https://lore.kernel.org/all/CAJCQCtT+OuemovPO7GZk8Y8=qtOObr0XTDp8jh4OHD6y84AFxw@mail.gmail.com/ Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=214839 Tested-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-29Revert "btrfs: compression: drop kmap/kunmap from zlib"David Sterba1-11/+25
This reverts commit 696ab562e6df9fbafd6052d8ce4aafcb2ed16069. The kmaps in compression code are still needed and cause crashes on 32bit machines (ARM, x86). Reproducible eg. by running fstest btrfs/004 with enabled LZO or ZSTD compression. Link: https://lore.kernel.org/all/CAJCQCtT+OuemovPO7GZk8Y8=qtOObr0XTDp8jh4OHD6y84AFxw@mail.gmail.com/ Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=214839 Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-29Revert "btrfs: compression: drop kmap/kunmap from zstd"David Sterba1-9/+18
This reverts commit bbaf9715f3f5b5ff0de71da91fcc34ee9c198ed8. The kmaps in compression code are still needed and cause crashes on 32bit machines (ARM, x86). Reproducible eg. by running fstest btrfs/004 with enabled LZO or ZSTD compression. Example stacktrace with ZSTD on a 32bit ARM machine: Unable to handle kernel NULL pointer dereference at virtual address 00000000 pgd = c4159ed3 [00000000] *pgd=00000000 Internal error: Oops: 5 [#1] PREEMPT SMP ARM Modules linked in: CPU: 0 PID: 210 Comm: kworker/u2:3 Not tainted 5.14.0-rc79+ #12 Hardware name: Allwinner sun4i/sun5i Families Workqueue: btrfs-delalloc btrfs_work_helper PC is at mmiocpy+0x48/0x330 LR is at ZSTD_compressStream_generic+0x15c/0x28c (mmiocpy) from [<c0629648>] (ZSTD_compressStream_generic+0x15c/0x28c) (ZSTD_compressStream_generic) from [<c06297dc>] (ZSTD_compressStream+0x64/0xa0) (ZSTD_compressStream) from [<c049444c>] (zstd_compress_pages+0x170/0x488) (zstd_compress_pages) from [<c0496798>] (btrfs_compress_pages+0x124/0x12c) (btrfs_compress_pages) from [<c043c068>] (compress_file_range+0x3c0/0x834) (compress_file_range) from [<c043c4ec>] (async_cow_start+0x10/0x28) (async_cow_start) from [<c0475c3c>] (btrfs_work_helper+0x100/0x230) (btrfs_work_helper) from [<c014ef68>] (process_one_work+0x1b4/0x418) (process_one_work) from [<c014f210>] (worker_thread+0x44/0x524) (worker_thread) from [<c0156aa4>] (kthread+0x180/0x1b0) (kthread) from [<c0100150>] Link: https://lore.kernel.org/all/CAJCQCtT+OuemovPO7GZk8Y8=qtOObr0XTDp8jh4OHD6y84AFxw@mail.gmail.com/ Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=214839 Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-29btrfs: remove root argument from check_item_in_log()Filipe Manana1-2/+2
The root argument passed to check_item_in_log() always matches the root of the given directory, so it can be eliminated. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-29btrfs: remove root argument from add_link()Filipe Manana1-2/+3
The root argument for tree-log.c:add_link() always matches the root of the given directory and the given inode, so it can eliminated. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-29btrfs: remove root argument from btrfs_unlink_inode()Filipe Manana3-22/+18
The root argument passed to btrfs_unlink_inode() and its callee, __btrfs_unlink_inode(), always matches the root of the given directory and the given inode. So remove the argument and make __btrfs_unlink_inode() use the root of the directory. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-29btrfs: remove root argument from drop_one_dir_item()Filipe Manana1-4/+4
The root argument for drop_one_dir_item() always matches the root of the given directory inode, since each log tree is associated to one and only one subvolume/root, so remove the argument. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-29btrfs: clear MISSING device status bit in btrfs_close_one_deviceLi Zhang1-1/+3
Reported bug: https://github.com/kdave/btrfs-progs/issues/389 There's a problem with scrub reporting aborted status but returning error code 0, on a filesystem with missing and readded device. Roughly these steps: - mkfs -d raid1 dev1 dev2 - fill with data - unmount - make dev1 disappear - mount -o degraded - copy more data - make dev1 appear again Running scrub afterwards reports that the command was aborted, but the system log message says the exit code was 0. It seems that the cause of the error is decrementing fs_devices->missing_devices but not clearing device->dev_state. Every time we umount filesystem, it would call close_ctree, And it would eventually involve btrfs_close_one_device to close the device, but it only decrements fs_devices->missing_devices but does not clear the device BTRFS_DEV_STATE_MISSING bit. Worse, this bug will cause Integer Overflow, because every time umount, fs_devices->missing_devices will decrease. If fs_devices->missing_devices value hit 0, it would overflow. With added debugging: loop1: detected capacity change from 0 to 20971520 BTRFS: device fsid 56ad51f1-5523-463b-8547-c19486c51ebb devid 1 transid 21 /dev/loop1 scanned by systemd-udevd (2311) loop2: detected capacity change from 0 to 20971520 BTRFS: device fsid 56ad51f1-5523-463b-8547-c19486c51ebb devid 2 transid 17 /dev/loop2 scanned by systemd-udevd (2313) BTRFS info (device loop1): flagging fs with big metadata feature BTRFS info (device loop1): allowing degraded mounts BTRFS info (device loop1): using free space tree BTRFS info (device loop1): has skinny extents BTRFS info (device loop1): before clear_missing.00000000f706684d /dev/loop1 0 BTRFS warning (device loop1): devid 2 uuid 6635ac31-56dd-4852-873b-c60f5e2d53d2 is missing BTRFS info (device loop1): before clear_missing.0000000000000000 /dev/loop2 1 BTRFS info (device loop1): flagging fs with big metadata feature BTRFS info (device loop1): allowing degraded mounts BTRFS info (device loop1): using free space tree BTRFS info (device loop1): has skinny extents BTRFS info (device loop1): before clear_missing.00000000f706684d /dev/loop1 0 BTRFS warning (device loop1): devid 2 uuid 6635ac31-56dd-4852-873b-c60f5e2d53d2 is missing BTRFS info (device loop1): before clear_missing.0000000000000000 /dev/loop2 0 BTRFS info (device loop1): flagging fs with big metadata feature BTRFS info (device loop1): allowing degraded mounts BTRFS info (device loop1): using free space tree BTRFS info (device loop1): has skinny extents BTRFS info (device loop1): before clear_missing.00000000f706684d /dev/loop1 18446744073709551615 BTRFS warning (device loop1): devid 2 uuid 6635ac31-56dd-4852-873b-c60f5e2d53d2 is missing BTRFS info (device loop1): before clear_missing.0000000000000000 /dev/loop2 18446744073709551615 If fs_devices->missing_devices is 0, next time it would be 18446744073709551615 After apply this patch, the fs_devices->missing_devices seems to be right: $ truncate -s 10g test1 $ truncate -s 10g test2 $ losetup /dev/loop1 test1 $ losetup /dev/loop2 test2 $ mkfs.btrfs -draid1 -mraid1 /dev/loop1 /dev/loop2 -f $ losetup -d /dev/loop2 $ mount -o degraded /dev/loop1 /mnt/1 $ umount /mnt/1 $ mount -o degraded /dev/loop1 /mnt/1 $ umount /mnt/1 $ mount -o degraded /dev/loop1 /mnt/1 $ umount /mnt/1 $ dmesg loop1: detected capacity change from 0 to 20971520 loop2: detected capacity change from 0 to 20971520 BTRFS: device fsid 15aa1203-98d3-4a66-bcae-ca82f629c2cd devid 1 transid 5 /dev/loop1 scanned by mkfs.btrfs (1863) BTRFS: device fsid 15aa1203-98d3-4a66-bcae-ca82f629c2cd devid 2 transid 5 /dev/loop2 scanned by mkfs.btrfs (1863) BTRFS info (device loop1): flagging fs with big metadata feature BTRFS info (device loop1): allowing degraded mounts BTRFS info (device loop1): disk space caching is enabled BTRFS info (device loop1): has skinny extents BTRFS info (device loop1): before clear_missing.00000000975bd577 /dev/loop1 0 BTRFS warning (device loop1): devid 2 uuid 8b333791-0b3f-4f57-b449-1c1ab6b51f38 is missing BTRFS info (device loop1): before clear_missing.0000000000000000 /dev/loop2 1 BTRFS info (device loop1): checking UUID tree BTRFS info (device loop1): flagging fs with big metadata feature BTRFS info (device loop1): allowing degraded mounts BTRFS info (device loop1): disk space caching is enabled BTRFS info (device loop1): has skinny extents BTRFS info (device loop1): before clear_missing.00000000975bd577 /dev/loop1 0 BTRFS warning (device loop1): devid 2 uuid 8b333791-0b3f-4f57-b449-1c1ab6b51f38 is missing BTRFS info (device loop1): before clear_missing.0000000000000000 /dev/loop2 1 BTRFS info (device loop1): flagging fs with big metadata feature BTRFS info (device loop1): allowing degraded mounts BTRFS info (device loop1): disk space caching is enabled BTRFS info (device loop1): has skinny extents BTRFS info (device loop1): before clear_missing.00000000975bd577 /dev/loop1 0 BTRFS warning (device loop1): devid 2 uuid 8b333791-0b3f-4f57-b449-1c1ab6b51f38 is missing BTRFS info (device loop1): before clear_missing.0000000000000000 /dev/loop2 1 CC: stable@vger.kernel.org # 4.19+ Signed-off-by: Li Zhang <zhanglikernel@gmail.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-29btrfs: call btrfs_check_rw_degradable only if there is a missing deviceAnand Jain1-1/+2
In open_ctree() in btrfs_check_rw_degradable() [1], we check each block group individually if at least the minimum number of devices is available for that profile. If all the devices are available, then we don't have to check degradable. [1] open_ctree() :: 3559 if (!sb_rdonly(sb) && !btrfs_check_rw_degradable(fs_info, NULL)) { Also before calling btrfs_check_rw_degradable() in open_ctee() at the line number shown below [2] we call btrfs_read_chunk_tree() and down to add_missing_dev() to record number of missing devices. [2] open_ctree() :: 3454 ret = btrfs_read_chunk_tree(fs_info); btrfs_read_chunk_tree() read_one_chunk() / read_one_dev() add_missing_dev() So, check if there is any missing device before btrfs_check_rw_degradable() in open_ctree(). Also, with this the mount command could save ~16ms.[3] in the most common case, that is no device is missing. [3] 1) * 16934.96 us | btrfs_check_rw_degradable [btrfs](); CC: stable@vger.kernel.org # 4.19+ Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-29btrfs: send: prepare for v2 protocolDavid Sterba3-1/+32
This is preparatory work for send protocol update to version 2 and higher. We have many pending protocol update requests but still don't have the basic protocol rev in place, the first thing that must happen is to do the actual versioning support. The protocol version is u32 and is a new member in the send ioctl struct. Validity of the version field is backed by a new flag bit. Old kernels would fail when a higher version is requested. Version protocol 0 will pick the highest supported version, BTRFS_SEND_STREAM_VERSION, that's also exported in sysfs. The version is still unchanged and will be increased once we have new incompatible commands or stream updates. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-28ocfs2: fix race between searching chunks and release journal_head from ↵Gautham Ananthakrishna1-9/+13
buffer_head Encountered a race between ocfs2_test_bg_bit_allocatable() and jbd2_journal_put_journal_head() resulting in the below vmcore. PID: 106879 TASK: ffff880244ba9c00 CPU: 2 COMMAND: "loop3" Call trace: panic oops_end no_context __bad_area_nosemaphore bad_area_nosemaphore __do_page_fault do_page_fault page_fault [exception RIP: ocfs2_block_group_find_clear_bits+316] ocfs2_block_group_find_clear_bits [ocfs2] ocfs2_cluster_group_search [ocfs2] ocfs2_search_chain [ocfs2] ocfs2_claim_suballoc_bits [ocfs2] __ocfs2_claim_clusters [ocfs2] ocfs2_claim_clusters [ocfs2] ocfs2_local_alloc_slide_window [ocfs2] ocfs2_reserve_local_alloc_bits [ocfs2] ocfs2_reserve_clusters_with_limit [ocfs2] ocfs2_reserve_clusters [ocfs2] ocfs2_lock_refcount_allocators [ocfs2] ocfs2_make_clusters_writable [ocfs2] ocfs2_replace_cow [ocfs2] ocfs2_refcount_cow [ocfs2] ocfs2_file_write_iter [ocfs2] lo_rw_aio loop_queue_work kthread_worker_fn kthread ret_from_fork When ocfs2_test_bg_bit_allocatable() called bh2jh(bg_bh), the bg_bh->b_private NULL as jbd2_journal_put_journal_head() raced and released the jounal head from the buffer head. Needed to take bit lock for the bit 'BH_JournalHead' to fix this race. Link: https://lkml.kernel.org/r/1634820718-6043-1-git-send-email-gautham.ananthakrishna@oracle.com Signed-off-by: Gautham Ananthakrishna <gautham.ananthakrishna@oracle.com> Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com> Cc: <rajesh.sivaramasubramaniom@oracle.com> Cc: Mark Fasheh <mark@fasheh.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: Changwei Ge <gechangwei@live.cn> Cc: Gang He <ghe@suse.com> Cc: Jun Piao <piaojun@huawei.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-10-27Revert "btrfs: compression: drop kmap/kunmap from generic helpers"David Sterba2-2/+4
This reverts commit 4c2bf276b56d8d27ddbafcdf056ef3fc60ae50b0. The kmaps in compression code are still needed and cause crashes on 32bit machines (ARM, x86). Reproducible eg. by running fstest btrfs/004 with enabled LZO or ZSTD compression. Link: https://lore.kernel.org/all/CAJCQCtT+OuemovPO7GZk8Y8=qtOObr0XTDp8jh4OHD6y84AFxw@mail.gmail.com/ Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=214839 Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-26io_uring: don't assign write hint in the read pathJens Axboe1-1/+1
Move this out of the generic read/write prep path, and place it in the write specific kiocb setup instead. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-26btrfs: fix comment about sector sizes supported in 64K systemsAnand Jain1-2/+1
Commit 95ea0486b20e ("btrfs: allow read-write for 4K sectorsize on 64K page size systems") added write support for 4K sectorsize on a 64K systems. Fix the now stale comments. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-26btrfs: update device path inode time instead of bd_inodeJosef Bacik1-8/+13
Christoph pointed out that I'm updating bdev->bd_inode for the device time when we remove block devices from a btrfs file system, however this isn't actually exposed to anything. The inode we want to update is the one that's associated with the path to the device, usually on devtmpfs, so that blkid notices the difference. We still don't want to do the blkdev_open, so use kern_path() to get the path to the given device and do the update time on that inode. Fixes: 8f96a5bfa150 ("btrfs: update the bdev time directly when closing") Reported-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-26fs: export an inode_update_time helperJosef Bacik1-3/+4
If you already have an inode and need to update the time on the inode there is no way to do this properly. Export this helper to allow file systems to update time on the inode so the appropriate handler is called, either ->update_time or generic_update_time. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-26btrfs: fix deadlock when defragging transparent huge pagesOmar Sandoval1-0/+14
Attempting to defragment a Btrfs file containing a transparent huge page immediately deadlocks with the following stack trace: #0 context_switch (kernel/sched/core.c:4940:2) #1 __schedule (kernel/sched/core.c:6287:8) #2 schedule (kernel/sched/core.c:6366:3) #3 io_schedule (kernel/sched/core.c:8389:2) #4 wait_on_page_bit_common (mm/filemap.c:1356:4) #5 __lock_page (mm/filemap.c:1648:2) #6 lock_page (./include/linux/pagemap.h:625:3) #7 pagecache_get_page (mm/filemap.c:1910:4) #8 find_or_create_page (./include/linux/pagemap.h:420:9) #9 defrag_prepare_one_page (fs/btrfs/ioctl.c:1068:9) #10 defrag_one_range (fs/btrfs/ioctl.c:1326:14) #11 defrag_one_cluster (fs/btrfs/ioctl.c:1421:9) #12 btrfs_defrag_file (fs/btrfs/ioctl.c:1523:9) #13 btrfs_ioctl_defrag (fs/btrfs/ioctl.c:3117:9) #14 btrfs_ioctl (fs/btrfs/ioctl.c:4872:10) #15 vfs_ioctl (fs/ioctl.c:51:10) #16 __do_sys_ioctl (fs/ioctl.c:874:11) #17 __se_sys_ioctl (fs/ioctl.c:860:1) #18 __x64_sys_ioctl (fs/ioctl.c:860:1) #19 do_syscall_x64 (arch/x86/entry/common.c:50:14) #20 do_syscall_64 (arch/x86/entry/common.c:80:7) #21 entry_SYSCALL_64+0x7c/0x15b (arch/x86/entry/entry_64.S:113) A huge page is represented by a compound page, which consists of a struct page for each PAGE_SIZE page within the huge page. The first struct page is the "head page", and the remaining are "tail pages". Defragmentation attempts to lock each page in the range. However, lock_page() on a tail page actually locks the corresponding head page. So, if defragmentation tries to lock more than one struct page in a compound page, it tries to lock the same head page twice and deadlocks with itself. Ideally, we should be able to defragment transparent huge pages. However, THP for filesystems is currently read-only, so a lot of code is not ready to use huge pages for I/O. For now, let's just return ETXTBUSY. This can be reproduced with the following on a kernel with CONFIG_READ_ONLY_THP_FOR_FS=y: $ cat create_thp_file.c #include <fcntl.h> #include <stdbool.h> #include <stdio.h> #include <stdint.h> #include <stdlib.h> #include <unistd.h> #include <sys/mman.h> static const char zeroes[1024 * 1024]; static const size_t FILE_SIZE = 2 * 1024 * 1024; int main(int argc, char **argv) { if (argc != 2) { fprintf(stderr, "usage: %s PATH\n", argv[0]); return EXIT_FAILURE; } int fd = creat(argv[1], 0777); if (fd == -1) { perror("creat"); return EXIT_FAILURE; } size_t written = 0; while (written < FILE_SIZE) { ssize_t ret = write(fd, zeroes, sizeof(zeroes) < FILE_SIZE - written ? sizeof(zeroes) : FILE_SIZE - written); if (ret < 0) { perror("write"); return EXIT_FAILURE; } written += ret; } close(fd); fd = open(argv[1], O_RDONLY); if (fd == -1) { perror("open"); return EXIT_FAILURE; } /* * Reserve some address space so that we can align the file mapping to * the huge page size. */ void *placeholder_map = mmap(NULL, FILE_SIZE * 2, PROT_NONE, MAP_PRIVATE | MAP_ANONYMOUS, -1, 0); if (placeholder_map == MAP_FAILED) { perror("mmap (placeholder)"); return EXIT_FAILURE; } void *aligned_address = (void *)(((uintptr_t)placeholder_map + FILE_SIZE - 1) & ~(FILE_SIZE - 1)); void *map = mmap(aligned_address, FILE_SIZE, PROT_READ | PROT_EXEC, MAP_SHARED | MAP_FIXED, fd, 0); if (map == MAP_FAILED) { perror("mmap"); return EXIT_FAILURE; } if (madvise(map, FILE_SIZE, MADV_HUGEPAGE) < 0) { perror("madvise"); return EXIT_FAILURE; } char *line = NULL; size_t line_capacity = 0; FILE *smaps_file = fopen("/proc/self/smaps", "r"); if (!smaps_file) { perror("fopen"); return EXIT_FAILURE; } for (;;) { for (size_t off = 0; off < FILE_SIZE; off += 4096) ((volatile char *)map)[off]; ssize_t ret; bool this_mapping = false; while ((ret = getline(&line, &line_capacity, smaps_file)) > 0) { unsigned long start, end, huge; if (sscanf(line, "%lx-%lx", &start, &end) == 2) { this_mapping = (start <= (uintptr_t)map && (uintptr_t)map < end); } else if (this_mapping && sscanf(line, "FilePmdMapped: %ld", &huge) == 1 && huge > 0) { return EXIT_SUCCESS; } } sleep(6); rewind(smaps_file); fflush(smaps_file); } } $ ./create_thp_file huge $ btrfs fi defrag -czstd ./huge Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Omar Sandoval <osandov@fb.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-26btrfs: sysfs: convert scnprintf and snprintf to sysfs_emitAnand Jain1-49/+44
Commit 2efc459d06f1 ("sysfs: Add sysfs_emit and sysfs_emit_at to format sysfs out") merged in 5.10 introduced two new functions sysfs_emit() and sysfs_emit_at() which are aware of the PAGE_SIZE limit of the output buffer. Use the above two new functions instead of scnprintf() and snprintf() in various sysfs show(). Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-26btrfs: make btrfs_super_block size match BTRFS_SUPER_INFO_SIZEQu Wenruo2-3/+7
It's a common practice to avoid use sizeof(struct btrfs_super_block) (3531), but to use BTRFS_SUPER_INFO_SIZE (4096). The problem is that, sizeof(struct btrfs_super_block) doesn't match BTRFS_SUPER_INFO_SIZE from the very beginning. Furthermore, for all call sites except selftests, we always allocate BTRFS_SUPER_INFO_SIZE space for super block, there isn't any real reason to use the smaller value, and it doesn't really save any space. So let's get rid of such confusing behavior, and unify those two values. This modification also adds a new static_assert() to verify the size, and moves the BTRFS_SUPER_INFO_* macros to the definition of btrfs_super_block for the static_assert(). Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-26btrfs: update comments for chunk allocation -ENOSPC casesFilipe Manana1-3/+18
Update the comments at btrfs_chunk_alloc() and do_chunk_alloc() that describe which cases can lead to a failure to allocate metadata and system space despite having previously reserved space. This adds one more reason that I previously forgot to mention. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-26btrfs: fix deadlock between chunk allocation and chunk btree modificationsFilipe Manana4-56/+111
When a task is doing some modification to the chunk btree and it is not in the context of a chunk allocation or a chunk removal, it can deadlock with another task that is currently allocating a new data or metadata chunk. These contexts are the following: * When relocating a system chunk, when we need to COW the extent buffers that belong to the chunk btree; * When adding a new device (ioctl), where we need to add a new device item to the chunk btree; * When removing a device (ioctl), where we need to remove a device item from the chunk btree; * When resizing a device (ioctl), where we need to update a device item in the chunk btree and may need to relocate a system chunk that lies beyond the new device size when shrinking a device. The problem happens due to a sequence of steps like the following: 1) Task A starts a data or metadata chunk allocation and it locks the chunk mutex; 2) Task B is relocating a system chunk, and when it needs to COW an extent buffer of the chunk btree, it has locked both that extent buffer as well as its parent extent buffer; 3) Since there is not enough available system space, either because none of the existing system block groups have enough free space or because the only one with enough free space is in RO mode due to the relocation, task B triggers a new system chunk allocation. It blocks when trying to acquire the chunk mutex, currently held by task A; 4) Task A enters btrfs_chunk_alloc_add_chunk_item(), in order to insert the new chunk item into the chunk btree and update the existing device items there. But in order to do that, it has to lock the extent buffer that task B locked at step 2, or its parent extent buffer, but task B is waiting on the chunk mutex, which is currently locked by task A, therefore resulting in a deadlock. One example report when the deadlock happens with system chunk relocation: INFO: task kworker/u9:5:546 blocked for more than 143 seconds. Not tainted 5.15.0-rc3+ #1 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. task:kworker/u9:5 state:D stack:25936 pid: 546 ppid: 2 flags:0x00004000 Workqueue: events_unbound btrfs_async_reclaim_metadata_space Call Trace: context_switch kernel/sched/core.c:4940 [inline] __schedule+0xcd9/0x2530 kernel/sched/core.c:6287 schedule+0xd3/0x270 kernel/sched/core.c:6366 rwsem_down_read_slowpath+0x4ee/0x9d0 kernel/locking/rwsem.c:993 __down_read_common kernel/locking/rwsem.c:1214 [inline] __down_read kernel/locking/rwsem.c:1223 [inline] down_read_nested+0xe6/0x440 kernel/locking/rwsem.c:1590 __btrfs_tree_read_lock+0x31/0x350 fs/btrfs/locking.c:47 btrfs_tree_read_lock fs/btrfs/locking.c:54 [inline] btrfs_read_lock_root_node+0x8a/0x320 fs/btrfs/locking.c:191 btrfs_search_slot_get_root fs/btrfs/ctree.c:1623 [inline] btrfs_search_slot+0x13b4/0x2140 fs/btrfs/ctree.c:1728 btrfs_update_device+0x11f/0x500 fs/btrfs/volumes.c:2794 btrfs_chunk_alloc_add_chunk_item+0x34d/0xea0 fs/btrfs/volumes.c:5504 do_chunk_alloc fs/btrfs/block-group.c:3408 [inline] btrfs_chunk_alloc+0x84d/0xf50 fs/btrfs/block-group.c:3653 flush_space+0x54e/0xd80 fs/btrfs/space-info.c:670 btrfs_async_reclaim_metadata_space+0x396/0xa90 fs/btrfs/space-info.c:953 process_one_work+0x9df/0x16d0 kernel/workqueue.c:2297 worker_thread+0x90/0xed0 kernel/workqueue.c:2444 kthread+0x3e5/0x4d0 kernel/kthread.c:319 ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:295 INFO: task syz-executor:9107 blocked for more than 143 seconds. Not tainted 5.15.0-rc3+ #1 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. task:syz-executor state:D stack:23200 pid: 9107 ppid: 7792 flags:0x00004004 Call Trace: context_switch kernel/sched/core.c:4940 [inline] __schedule+0xcd9/0x2530 kernel/sched/core.c:6287 schedule+0xd3/0x270 kernel/sched/core.c:6366 schedule_preempt_disabled+0xf/0x20 kernel/sched/core.c:6425 __mutex_lock_common kernel/locking/mutex.c:669 [inline] __mutex_lock+0xc96/0x1680 kernel/locking/mutex.c:729 btrfs_chunk_alloc+0x31a/0xf50 fs/btrfs/block-group.c:3631 find_free_extent_update_loop fs/btrfs/extent-tree.c:3986 [inline] find_free_extent+0x25cb/0x3a30 fs/btrfs/extent-tree.c:4335 btrfs_reserve_extent+0x1f1/0x500 fs/btrfs/extent-tree.c:4415 btrfs_alloc_tree_block+0x203/0x1120 fs/btrfs/extent-tree.c:4813 __btrfs_cow_block+0x412/0x1620 fs/btrfs/ctree.c:415 btrfs_cow_block+0x2f6/0x8c0 fs/btrfs/ctree.c:570 btrfs_search_slot+0x1094/0x2140 fs/btrfs/ctree.c:1768 relocate_tree_block fs/btrfs/relocation.c:2694 [inline] relocate_tree_blocks+0xf73/0x1770 fs/btrfs/relocation.c:2757 relocate_block_group+0x47e/0xc70 fs/btrfs/relocation.c:3673 btrfs_relocate_block_group+0x48a/0xc60 fs/btrfs/relocation.c:4070 btrfs_relocate_chunk+0x96/0x280 fs/btrfs/volumes.c:3181 __btrfs_balance fs/btrfs/volumes.c:3911 [inline] btrfs_balance+0x1f03/0x3cd0 fs/btrfs/volumes.c:4301 btrfs_ioctl_balance+0x61e/0x800 fs/btrfs/ioctl.c:4137 btrfs_ioctl+0x39ea/0x7b70 fs/btrfs/ioctl.c:4949 vfs_ioctl fs/ioctl.c:51 [inline] __do_sys_ioctl fs/ioctl.c:874 [inline] __se_sys_ioctl fs/ioctl.c:860 [inline] __x64_sys_ioctl+0x193/0x200 fs/ioctl.c:860 do_syscall_x64 arch/x86/entry/common.c:50 [inline] do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80 entry_SYSCALL_64_after_hwframe+0x44/0xae So fix this by making sure that whenever we try to modify the chunk btree and we are neither in a chunk allocation context nor in a chunk remove context, we reserve system space before modifying the chunk btree. Reported-by: Hao Sun <sunhao.th@gmail.com> Link: https://lore.kernel.org/linux-btrfs/CACkBjsax51i4mu6C0C3vJqQN3NR_iVuucoeG3U1HXjrgzn5FFQ@mail.gmail.com/ Fixes: 79bd37120b1495 ("btrfs: rework chunk allocation to avoid exhaustion of the system chunk array") CC: stable@vger.kernel.org # 5.14+ Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-26btrfs: zoned: use greedy gc for auto reclaimJohannes Thumshirn1-0/+22
Currently auto reclaim of unusable zones reclaims the block-groups in the order they have been added to the reclaim list. Change this to a greedy algorithm by sorting the list so we have the block-groups with the least amount of valid bytes reclaimed first. Note: we can't splice the block groups from reclaim_bgs to let the sort happen outside of the lock. The block groups can be still in use by other parts eg. via bg_list and we must hold unused_bgs_lock while processing them. Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> [ write note and comment why we can't splice the list ] Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-26btrfs: check-integrity: stop storing the block device name in btrfsic_dev_stateChristoph Hellwig1-91/+110
Just use the %pg format specifier in all the debug printks previously using it. Note that both bdevname and the %pg specifier never print a pathname, so the kbasename call wasn't needed to start with. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Sterba <dsterba@suse.com> [ adjust messages and indentation ] Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-26btrfs: use btrfs_get_dev_args_from_path in dev removal ioctlsJosef Bacik3-36/+48
For device removal and replace we call btrfs_find_device_by_devspec, which if we give it a device path and nothing else will call btrfs_get_dev_args_from_path, which opens the block device and reads the super block and then looks up our device based on that. However at this point we're holding the sb write "lock", so reading the block device pulls in the dependency of ->open_mutex, which produces the following lockdep splat ====================================================== WARNING: possible circular locking dependency detected 5.14.0-rc2+ #405 Not tainted ------------------------------------------------------ losetup/11576 is trying to acquire lock: ffff9bbe8cded938 ((wq_completion)loop0){+.+.}-{0:0}, at: flush_workqueue+0x67/0x5e0 but task is already holding lock: ffff9bbe88e4fc68 (&lo->lo_mutex){+.+.}-{3:3}, at: __loop_clr_fd+0x41/0x660 [loop] which lock already depends on the new lock. the existing dependency chain (in reverse order) is: -> #4 (&lo->lo_mutex){+.+.}-{3:3}: __mutex_lock+0x7d/0x750 lo_open+0x28/0x60 [loop] blkdev_get_whole+0x25/0xf0 blkdev_get_by_dev.part.0+0x168/0x3c0 blkdev_open+0xd2/0xe0 do_dentry_open+0x161/0x390 path_openat+0x3cc/0xa20 do_filp_open+0x96/0x120 do_sys_openat2+0x7b/0x130 __x64_sys_openat+0x46/0x70 do_syscall_64+0x38/0x90 entry_SYSCALL_64_after_hwframe+0x44/0xae -> #3 (&disk->open_mutex){+.+.}-{3:3}: __mutex_lock+0x7d/0x750 blkdev_get_by_dev.part.0+0x56/0x3c0 blkdev_get_by_path+0x98/0xa0 btrfs_get_bdev_and_sb+0x1b/0xb0 btrfs_find_device_by_devspec+0x12b/0x1c0 btrfs_rm_device+0x127/0x610 btrfs_ioctl+0x2a31/0x2e70 __x64_sys_ioctl+0x80/0xb0 do_syscall_64+0x38/0x90 entry_SYSCALL_64_after_hwframe+0x44/0xae -> #2 (sb_writers#12){.+.+}-{0:0}: lo_write_bvec+0xc2/0x240 [loop] loop_process_work+0x238/0xd00 [loop] process_one_work+0x26b/0x560 worker_thread+0x55/0x3c0 kthread+0x140/0x160 ret_from_fork+0x1f/0x30 -> #1 ((work_completion)(&lo->rootcg_work)){+.+.}-{0:0}: process_one_work+0x245/0x560 worker_thread+0x55/0x3c0 kthread+0x140/0x160 ret_from_fork+0x1f/0x30 -> #0 ((wq_completion)loop0){+.+.}-{0:0}: __lock_acquire+0x10ea/0x1d90 lock_acquire+0xb5/0x2b0 flush_workqueue+0x91/0x5e0 drain_workqueue+0xa0/0x110 destroy_workqueue+0x36/0x250 __loop_clr_fd+0x9a/0x660 [loop] block_ioctl+0x3f/0x50 __x64_sys_ioctl+0x80/0xb0 do_syscall_64+0x38/0x90 entry_SYSCALL_64_after_hwframe+0x44/0xae other info that might help us debug this: Chain exists of: (wq_completion)loop0 --> &disk->open_mutex --> &lo->lo_mutex Possible unsafe locking scenario: CPU0 CPU1 ---- ---- lock(&lo->lo_mutex); lock(&disk->open_mutex); lock(&lo->lo_mutex); lock((wq_completion)loop0); *** DEADLOCK *** 1 lock held by losetup/11576: #0: ffff9bbe88e4fc68 (&lo->lo_mutex){+.+.}-{3:3}, at: __loop_clr_fd+0x41/0x660 [loop] stack backtrace: CPU: 0 PID: 11576 Comm: losetup Not tainted 5.14.0-rc2+ #405 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.13.0-2.fc32 04/01/2014 Call Trace: dump_stack_lvl+0x57/0x72 check_noncircular+0xcf/0xf0 ? stack_trace_save+0x3b/0x50 __lock_acquire+0x10ea/0x1d90 lock_acquire+0xb5/0x2b0 ? flush_workqueue+0x67/0x5e0 ? lockdep_init_map_type+0x47/0x220 flush_workqueue+0x91/0x5e0 ? flush_workqueue+0x67/0x5e0 ? verify_cpu+0xf0/0x100 drain_workqueue+0xa0/0x110 destroy_workqueue+0x36/0x250 __loop_clr_fd+0x9a/0x660 [loop] ? blkdev_ioctl+0x8d/0x2a0 block_ioctl+0x3f/0x50 __x64_sys_ioctl+0x80/0xb0 do_syscall_64+0x38/0x90 entry_SYSCALL_64_after_hwframe+0x44/0xae RIP: 0033:0x7f31b02404cb Instead what we want to do is populate our device lookup args before we grab any locks, and then pass these args into btrfs_rm_device(). From there we can find the device and do the appropriate removal. Suggested-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-26btrfs: add a btrfs_get_dev_args_from_path helperJosef Bacik2-32/+68
We are going to want to populate our device lookup args outside of any locks and then do the actual device lookup later, so add a helper to do this work and make btrfs_find_device_by_devspec() use this helper for now. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-26btrfs: handle device lookup with btrfs_dev_lookup_argsJosef Bacik5-65/+112
We have a lot of device lookup functions that all do something slightly different. Clean this up by adding a struct to hold the different lookup criteria, and then pass this around to btrfs_find_device() so it can do the proper matching based on the lookup criteria. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-26btrfs: do not call close_fs_devices in btrfs_rm_deviceJosef Bacik1-1/+9
There's a subtle case where if we're removing the seed device from a file system we need to free its private copy of the fs_devices. However we do not need to call close_fs_devices(), because at this point there are no devices left to close as we've closed the last one. The only thing that close_fs_devices() does is decrement ->opened, which should be 1. We want to avoid calling close_fs_devices() here because it has a lockdep_assert_held(&uuid_mutex), and we are going to stop holding the uuid_mutex in this path. So simply decrement the ->opened counter like we should, and then clean up like normal. Also add a comment explaining what we're doing here as I initially removed this code erroneously. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-26btrfs: add comments for device counts in struct btrfs_fs_devicesAnand Jain1-0/+19
A bug was was checking a wrong device count before we delete the struct btrfs_fs_devices in btrfs_rm_device(). To avoid future confusion and easy reference add a comment about the various device counts that we have in the struct btrfs_fs_devices. Signed-off-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-26btrfs: use num_device to check for the last surviving seed deviceAnand Jain1-1/+1
For both sprout and seed fsids, btrfs_fs_devices::num_devices provides device count including missing btrfs_fs_devices::open_devices provides device count excluding missing We create a dummy struct btrfs_device for the missing device, so num_devices != open_devices when there is a missing device. In btrfs_rm_devices() we wrongly check for %cur_devices->open_devices before freeing the seed fs_devices. Instead we should check for %cur_devices->num_devices. Signed-off-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-26btrfs: fix lost error handling when replaying directory deletesFilipe Manana1-1/+3
At replay_dir_deletes(), if find_dir_range() returns an error we break out of the main while loop and then assign a value of 0 (success) to the 'ret' variable, resulting in completely ignoring that an error happened. Fix that by jumping to the 'out' label when find_dir_range() returns an error (negative value). CC: stable@vger.kernel.org # 4.4+ Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-26btrfs: remove btrfs_bio::logical memberQu Wenruo3-11/+8
The member btrfs_bio::logical is only initialized by two call sites: - btrfs_repair_one_sector() No corresponding site to utilize it. - btrfs_submit_direct() The corresponding site to utilize it is btrfs_check_read_dio_bio(). However for btrfs_check_read_dio_bio(), we can grab the file_offset from btrfs_dio_private::file_offset directly. Thus it turns out we don't really need that btrfs_bio::logical member at all. For btrfs_bio, the logical bytenr can be fetched from its bio->bi_iter.bi_sector directly. So let's just remove the member to save 8 bytes for structure btrfs_bio. Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-26btrfs: rename btrfs_dio_private::logical_offset to file_offsetQu Wenruo2-7/+12
The naming of "logical_offset" can be confused with logical bytenr of the dio range. In fact it's file offset, and the naming "file_offset" is already widely used in all other sites. Just do the rename to avoid confusion. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-26btrfs: use bvec_kmap_local in btrfs_csum_one_bioChristoph Hellwig1-4/+4
Using local kmaps slightly reduces the chances to stray writes, and the bvec interface cleans up the code a little bit. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-26btrfs: reduce btrfs_update_block_group alloc argument to boolAnand Jain3-5/+5
btrfs_update_block_group() accounts for the number of bytes allocated or freed. Argument @alloc specifies whether the call is for alloc or free. Convert the argument @alloc type from int to bool. Reviewed-by: Su Yue <l@damenly.su> Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-26btrfs: make btrfs_ref::real_root optionalNikolay Borisov1-14/+9
Now that real_root is only used in ref-verify core gate it behind CONFIG_BTRFS_FS_REF_VERIFY ifdef. This shrinks the size of pending delayed refs by 8 bytes per ref, of which we can have many at any one time depending on intensity of the workload. Also change the comment about the member as it no longer deals with qgroups. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-26btrfs: pull up qgroup checks from delayed-ref core to init timeNikolay Borisov5-17/+11
Instead of checking whether qgroup processing for a dealyed ref has to happen in the core of delayed ref, simply pull the check at init time of respective delayed ref structures. This eliminates the final use of real_root in delayed-ref core paving the way to making this member optional. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-26btrfs: add additional parameters to btrfs_init_tree_ref/btrfs_init_data_refNikolay Borisov6-22/+39
In order to make 'real_root' used only in ref-verify it's required to have the necessary context to perform the same checks that this member is used for. So add 'mod_root' which will contain the root on behalf of which a delayed ref was created and a 'skip_group' parameter which will contain callsite-specific override of skip_qgroup. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-26btrfs: rely on owning_root field in btrfs_add_delayed_tree_ref to detect ↵Nikolay Borisov1-1/+1
CHUNK_ROOT The real_root field is going to be used only by ref-verify tool so limit its use outside of it. Blocks belonging to the chunk root will always have it as an owner so the check is equivalent. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-26btrfs: rename root fields in delayed refs structsNikolay Borisov4-19/+20
Both data and metadata delayed ref structures have fields named root/ref_root respectively. Those are somewhat cryptic and don't really convey the real meaning. In fact those roots are really the original owners of the respective block (i.e in case of a snapshot a data delayed ref will contain the original root that owns the given block). Rename those fields accordingly and adjust comments. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-26btrfs: do not infinite loop in data reclaim if we abortedJosef Bacik1-4/+24
Error injection stressing uncovered a busy loop in our data reclaim loop. There are two cases here, one where we loop creating block groups until space_info->full is set, or in the main loop we will skip erroring out any tickets if space_info->full == 0. Unfortunately if we aborted the transaction then we will never allocate chunks or reclaim any space and thus never get ->full, and you'll see stack traces like this: watchdog: BUG: soft lockup - CPU#0 stuck for 26s! [kworker/u4:4:139] CPU: 0 PID: 139 Comm: kworker/u4:4 Tainted: G W 5.13.0-rc1+ #328 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.13.0-2.fc32 04/01/2014 Workqueue: events_unbound btrfs_async_reclaim_data_space RIP: 0010:btrfs_join_transaction+0x12/0x20 RSP: 0018:ffffb2b780b77de0 EFLAGS: 00000246 RAX: ffffb2b781863d58 RBX: 0000000000000000 RCX: 0000000000000000 RDX: 0000000000000801 RSI: ffff987952b57400 RDI: ffff987940aa3000 RBP: ffff987954d55000 R08: 0000000000000001 R09: ffff98795539e8f0 R10: 000000000000000f R11: 000000000000000f R12: ffffffffffffffff R13: ffff987952b574c8 R14: ffff987952b57400 R15: 0000000000000008 FS: 0000000000000000(0000) GS:ffff9879bbc00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007f0703da4000 CR3: 0000000113398004 CR4: 0000000000370ef0 Call Trace: flush_space+0x4a8/0x660 btrfs_async_reclaim_data_space+0x55/0x130 process_one_work+0x1e9/0x380 worker_thread+0x53/0x3e0 ? process_one_work+0x380/0x380 kthread+0x118/0x140 ? __kthread_bind_mask+0x60/0x60 ret_from_fork+0x1f/0x30 Fix this by checking to see if we have a btrfs fs error in either of the reclaim loops, and if so fail the tickets and bail. In addition to this, fix maybe_fail_all_tickets() to not try to grant tickets if we've aborted, simply fail everything. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-26btrfs: add a BTRFS_FS_ERROR helperJosef Bacik9-19/+19
We have a few flags that are inconsistently used to describe the fs in different states of failure. As of 5963ffcaf383 ("btrfs: always abort the transaction if we abort a trans handle") we will always set BTRFS_FS_STATE_ERROR if we abort, so we don't have to check both ABORTED and ERROR to see if things have gone wrong. Add a helper to check BTRFS_FS_STATE_ERROR and then convert all checkers of FS_STATE_ERROR to use the helper. The TRANS_ABORTED bit check was added in af7227338135 ("Btrfs: clean up resources during umount after trans is aborted") but is not actually specific. Reviewed-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-26btrfs: change error handling for btrfs_delete_*_in_logJosef Bacik3-47/+25
Currently we will abort the transaction if we get a random error (like -EIO) while trying to remove the directory entries from the root log during rename. However since these are simply log tree related errors, we can mark the trans as needing a full commit. Then if the error was truly catastrophic we'll hit it during the normal commit and abort as appropriate. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-26btrfs: change handle_fs_error in recover_log_trees to abortsJosef Bacik1-10/+9
During inspection of the return path for replay I noticed that we don't actually abort the transaction if we get a failure during replay. This isn't a problem necessarily, as we properly return the error and will fail to mount. However we still leave this dangling transaction that could conceivably be committed without thinking there was an error. We were using btrfs_handle_fs_error() here, but that pre-dates the transaction abort code. Simply replace the btrfs_handle_fs_error() calls with transaction aborts, so we still know where exactly things went wrong, and add a few in some other un-handled error cases. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-26btrfs: zoned: use kmemdup() to replace kmalloc + memcpyKai Song1-3/+1
Fix memdup.cocci warning: fs/btrfs/zoned.c:1198:23-30: WARNING opportunity for kmemdup Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Kai Song <songkai01@inspur.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-26btrfs: subpage: only allow compression if the range is fully page alignedQu Wenruo1-4/+44
For compressed write, we use a mechanism called async COW, which unlike regular run_delalloc_cow() or cow_file_range() will also unlock the first page. This mechanism allows us to continue handling next ranges, without waiting for the time consuming compression. But this has a problem for subpage case, as we could have the following delalloc range for a page: 0 32K 64K | |///////| |///////| \- A \- B In the above case, if we pass both ranges to cow_file_range_async(), both range A and range B will try to unlock the full page [0, 64K). And which one finishes later than the other one will try to do other page operations like end_page_writeback() on a unlocked page, triggering VM layer BUG_ON(). To make subpage compression work at least partially, here we add another restriction for it, only allow compression if the delalloc range is fully page aligned. By that, async extent is always ensured to unlock the first page exclusively, just like it used to be for regular sectorsize. In theory, we only need to make sure the delalloc range fully covers its first page, but the tail page will be locked anyway, blocking later writeback until the compression finishes. Thus here we choose to make sure the range is fully page aligned before doing the compression. In the future, we could optimize the situation by properly increasing subpage::writers number for the locked page, but that also means we need to change how we run delalloc range of page. (Instead of running each delalloc range we hit, we need to find and lock all delalloc ranges covering the page, then run each of them). Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-26btrfs: subpage: avoid potential deadlock with compression and delallocQu Wenruo3-15/+40
[BUG] With experimental subpage compression enabled, a simple fsstress can lead to self deadlock on page 720896: mkfs.btrfs -f -s 4k $dev > /dev/null mount $dev -o compress $mnt $fsstress -p 1 -n 100 -w -d $mnt -v -s 1625511156 [CAUSE] If we have a file layout looks like below: 0 32K 64K 96K 128K |//| |///////////////| 4K Then we run delalloc range for the inode, it will: - Call find_lock_delalloc_range() with @delalloc_start = 0 Then we got a delalloc range [0, 4K). This range will be COWed. - Call find_lock_delalloc_range() again with @delalloc_start = 4K Since find_lock_delalloc_range() never cares whether the range is still inside page range [0, 64K), it will return range [64K, 128K). This range meets the condition for subpage compression, will go through async COW path. And async COW path will return @page_started. But that @page_started is now for range [64K, 128K), not for range [0, 64K). - writepage_dellloc() returned 1 for page [0, 64K) Thus page [0, 64K) will not be unlocked, nor its page dirty status will be cleared. Next time when we try to lock page [0, 64K) we will deadlock, as there is no one to release page [0, 64K). This problem will never happen for regular page size as one page only contains one sector. After the first find_lock_delalloc_range() call, the @delalloc_end will go beyond @page_end no matter if we found a delalloc range or not Thus this bug only happens for subpage, as now we need multiple runs to exhaust the delalloc range of a page. [FIX] Fix the problem by ensuring the delalloc range we ran at least started inside @locked_page. So that we will never get incorrect @page_started. And to prevent such problem from happening again: - Make find_lock_delalloc_range() return false if the found range is beyond @end value passed in. Since @end will be utilized now, add an ASSERT() to ensure we pass correct @end into find_lock_delalloc_range(). This also means, for selftests we needs to populate @end before calling find_lock_delalloc_range(). - New ASSERT() in find_lock_delalloc_range() Now we will make sure the @start/@end passed in at least covers part of the page. - New ASSERT() in run_delalloc_range() To make sure the range at least starts inside @locked page. - Use @delalloc_start as proper cursor, while @delalloc_end is always reset to @page_end. Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-26btrfs: handle page locking in btrfs_page_end_writer_lock with no writersQu Wenruo1-0/+10
There are several call sites of extent_clear_unlock_delalloc() which get @locked_page = NULL. So that extent_clear_unlock_delalloc() will try to call process_one_page() to unlock every page even the first page is not locked by btrfs_page_start_writer_lock(). This will trigger an ASSERT() in btrfs_subpage_end_and_test_writer() as previously we require every page passed to btrfs_subpage_end_and_test_writer() to be locked by btrfs_page_start_writer_lock(). But compression path doesn't go that way. Thankfully it's not hard to distinguish page locked by lock_page() and btrfs_page_start_writer_lock(). So do the check in btrfs_subpage_end_and_test_writer() so now it can handle both cases well. Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-26btrfs: rework page locking in __extent_writepage()Qu Wenruo3-1/+59
Pages passed to __extent_writepage() are always locked, but they may be locked by different functions. There are two types of locked page for __extent_writepage(): - Page locked by plain lock_page() It should not have any subpage::writers count. Can be unlocked by unlock_page(). This is the most common locked page for __extent_writepage() called inside extent_write_cache_pages() or extent_write_full_page(). Rarer cases include the @locked_page from extent_write_locked_range(). - Page locked by lock_delalloc_pages() There is only one caller, all pages except @locked_page for extent_write_locked_range(). In this case, we have to call subpage helper to handle the case. So here we introduce a helper, btrfs_page_unlock_writer(), to allow __extent_writepage() to unlock different locked pages. And since for all other callers of __extent_writepage() their pages are ensured to be locked by lock_page(), also add an extra check for epd::extent_locked to unlock such pages directly. Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-26btrfs: subpage: make lzo_compress_pages() compatibleQu Wenruo1-136/+134
There are several problems in lzo_compress_pages() preventing it from being subpage compatible: - No page offset is calculated when reading from inode pages For subpage case, we could have @start which is not aligned to PAGE_SIZE. Thus the destination where we read data from must take offset in page into consideration. - The padding for segment header is bound to PAGE_SIZE This means, for subpage case we can skip several corners where on x86 machines we need to add padding zeros. The rework will: - Update the comment to replace "page" with "sector" - Introduce a new helper, copy_compressed_data_to_page(), to do the copy So that we don't need to bother page switching for both input and output. Now in lzo_compress_pages() we only care about page switching for input, while in copy_compressed_data_to_page() we only care about the page switching for output. - Only one main cursor For lzo_compress_pages() we use @cur_in as main cursor. It will be the file offset we are currently at. All other helper variables will be only declared inside the loop. For copy_compressed_data_to_page() it's similar, we will have @cur_out at the main cursor, which records how many bytes are in the output. Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-26btrfs: factor uncompressed async extent submission code into a new helperQu Wenruo1-24/+52
Introduce a new helper, submit_uncompressed_range(), for async cow cases where we fallback to COW. There are some new updates introduced to the helper: - Proper locked_page detection It's possible that the async_extent range doesn't cover the locked page. In that case we shouldn't unlock the locked page. In the new helper, we will ensure that we only unlock the locked page when: * The locked page covers part of the async_extent range * The locked page is not unlocked by cow_file_range() nor extent_write_locked_range() This also means extra comments are added focusing on the page locking. - Add extra comment on some rare parameter used. We use @unlock_page = 0 for cow_file_range(), where only two call sites doing the same thing, including the new helper. It's definitely worth some comments. Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-26btrfs: subpage: make extent_write_locked_range() compatibleQu Wenruo1-4/+11
There are two sites are not subpage compatible yet for extent_write_locked_range(): - How @nr_pages are calculated For subpage we can have the following range with 64K page size: 0 32K 64K 96K 128K | |////|/////| | In that case, although 96K - 32K == 64K, thus it looks like one page is enough, but the range spans two pages, not one. Fix it by doing proper round_up() and round_down() to calculate @nr_pages. Also add some extra ASSERT()s to ensure the range passed in is already aligned. - How the page end is calculated Currently we just use cur + PAGE_SIZE - 1 to calculate the page end. Which can't handle the above range layout, and will trigger ASSERT() in btrfs_writepage_endio_finish_ordered(), as the range is no longer covered by the page range. Fix it by taking page end into consideration. Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-26btrfs: subpage: make end_compressed_bio_writeback() compatibleQu Wenruo1-1/+3
In end_compressed_writeback() we just clear the full page writeback. For subpage case, if there are two delalloc ranges in the same page, the 2nd range will trigger a BUG_ON() as the page writeback is already cleared by previous range. Fix it by using btrfs_page_clamp_clear_writeback() helper. Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-26btrfs: subpage: make btrfs_submit_compressed_write() compatibleQu Wenruo1-1/+2
There is a WARN_ON() checking if @start is aligned to PAGE_SIZE, not sectorsize, which will cause false alert for subpage. Fix it to check against sectorsize. Furthermore: - Use ASSERT() to do the check So that in the future we may skip the check for production build - Also check alignment for @len Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-26btrfs: subpage: make compress_file_range() compatibleQu Wenruo1-1/+1
In function compress_file_range(), when the compression is finished, the function just rounds up @total_in to PAGE_SIZE. This is fine for regular sectorsize == PAGE_SIZE case, but not for subpage. Just change the ALIGN(, PAGE_SIZE) to round_up(, sectorsize) so that both regular sectorsize and subpage sectorsize will be happy. Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-26btrfs: cleanup for extent_write_locked_range()Qu Wenruo3-21/+32
There are several cleanups for extent_write_locked_range(), most of them are pure cleanups, but with some preparation for future subpage support. - Add a proper comment for which call sites are suitable Unlike regular synchronized extent write back, if async COW or zoned COW happens, we have all pages in the range still locked. Thus for those (only) two call sites, we need this function to submit page content into bios and submit them. - Remove @mode parameter All the existing two call sites pass WB_SYNC_ALL. No need for @mode parameter. - Better error handling Currently if we hit an error during the page iteration loop, we overwrite @ret, causing only the last error can be recorded. Here we add @found_error and @first_error variable to record if we hit any error, and the first error we hit. So the first error won't get lost. - Don't reuse @start as the cursor We reuse the parameter @start as the cursor to iterate the range, not a big problem, but since we're here, introduce a proper @cur as the cursor. - Remove impossible branch Since all pages are still locked after the ordered extent is inserted, there is no way that pages can get its dirty bit cleared. Remove the branch where page is not dirty and replace it with an ASSERT(). Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-26btrfs: refactor submit_compressed_extents()Qu Wenruo1-141/+131
We have a big chunk of code inside a while() loop, with tons of strange jumps for error handling. It's definitely not to the code standard of today. Move the code into a new function, submit_one_async_extent(). Since we're here, also do the following changes: - Comment style change To follow the current scheme - Don't fallback to non-compressed write then hitting ENOSPC If we hit ENOSPC for compressed write, how could we reserve more space for non-compressed write? Thus we go error path directly. This removes the retry: label. - Add more comment for super long parameter list Explain which parameter is for, so we don't need to check the prototype. - Move the error handling to submit_one_async_extent() Thus no strange code like: out_free: ... goto again; Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-26btrfs: remove unused function btrfs_bio_fits_in_stripe()Qu Wenruo2-44/+0
As the last caller in compression.c has been removed, we don't need that function anymore. Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-26btrfs: determine stripe boundary at bio allocation time in ↵Qu Wenruo2-85/+59
btrfs_submit_compressed_write Currently btrfs_submit_compressed_write() will check btrfs_bio_fits_in_stripe() each time a new page is going to be added. Even if compressed extent is small, we don't really need to do that for every page. Align the behavior to extent_io.c, by determining the stripe boundary when allocating a bio. Unlike extent_io.c, in compressed.c we don't need to bother things like different bio flags, thus no need to re-use bio_ctrl. Here we just manually introduce new local variable, next_stripe_start, and use that value returned from alloc_compressed_bio() to calculate the stripe boundary. Then each time we add some page range into the bio, we check if we reached the boundary. And if reached, submit it. Also, since we have @cur_disk_bytenr to determine whether we're the last bio, we don't need a explicit last_bio: tag for error handling any more. And since we use @cur_disk_bytenr to wait, there is no need for pending_bios, also remove it to save some memory of compressed_bio. Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-26btrfs: determine stripe boundary at bio allocation time in ↵Qu Wenruo1-71/+97
btrfs_submit_compressed_read Currently btrfs_submit_compressed_read() will check btrfs_bio_fits_in_stripe() each time a new page is going to be added. Even if compressed extent is small, we don't really need to do that for every page. This patch will align the behavior to extent_io.c, by determining the stripe boundary when allocating a bio. Unlike extent_io.c, in compressed.c we don't need to bother things like different bio flags, thus no need to re-use bio_ctrl. Here we just manually introduce new local variable, next_stripe_start, and teach alloc_compressed_bio() to calculate the stripe boundary. Then each time we add some page range into the bio, we check if we reached the boundary. And if reached, submit it. Also, since we have @cur_disk_byte to determine whether we're the last bio, we don't need a explicit last_bio: tag for error handling any more. And we can use @cur_disk_byte to track which range has been added to bio, we can also use @cur_disk_byte to calculate the wait condition, no need for @pending_bios. Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-26btrfs: introduce alloc_compressed_bio() for compressionQu Wenruo1-32/+58
Just aggregate the bio allocation code into one helper, so that we can replace 4 call sites. There is one special note for zoned write. Currently btrfs_submit_compressed_write() will only allocate the first bio using ZONE_APPEND. If we have to submit current bio due to stripe boundary, the new bio allocated will not use ZONE_APPEND. In theory this should be a bug, but considering zoned mode currently only support SINGLE profile, which doesn't have any stripe boundary limit, it should never be a problem and we have assertions in place. This function will provide a good entrance for any work which needs to be done at bio allocation time. Like determining the stripe boundary. Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-26btrfs: introduce submit_compressed_bio() for compressionQu Wenruo1-26/+19
The new helper, submit_compressed_bio(), will aggregate the following work: - Increase compressed_bio::pending_bios - Remap the endio function - Map and submit the bio This slightly reorders calls to btrfs_csum_one_bio or btrfs_lookup_bio_sums but but none of them does anything regarding IO submission so this is effectively no change. We mainly care about order of - atomic_inc - btrfs_bio_wq_end_io - btrfs_map_bio Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-26btrfs: handle errors properly inside btrfs_submit_compressed_write()Qu Wenruo1-36/+62
Just like btrfs_submit_compressed_read(), there are quite some BUG_ON()s inside btrfs_submit_compressed_write() for the bio submission path. Fix them using the same method: - For last bio, just endio the bio As in that case, one of the endio function of all these submitted bio will be able to free the compressed_bio - For half-submitted bio, wait and finish the compressed_bio manually In this case, as long as all other bio finish, we're the only one referring the compressed bio, and can manually finish it. Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-26btrfs: handle errors properly inside btrfs_submit_compressed_read()Qu Wenruo1-50/+83
There are quite some BUG_ON()s inside btrfs_submit_compressed_read(), namely all errors inside the for() loop relies on BUG_ON() to handle -ENOMEM. Handle these errors properly by: - Wait for submitted bios to finish first Using wake_var_event() APIs to wait without introducing extra memory overhead inside compressed_bio. This allows us to wait for any submitted bio to finish, while still keeps the compressed_bio from being freed. - Introduce finish_compressed_bio_read() to finish the compressed_bio - Properly end the bio and finish compressed_bio when error happens Now in btrfs_submit_compressed_read() even when the bio submission failed, we can properly handle the error without triggering BUG_ON(). Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-26btrfs: subpage: add bitmap for PageChecked flagQu Wenruo7-27/+85
Although in btrfs we have very limited usage of PageChecked flag, it's still some page flag not yet subpage compatible. Fix it by introducing btrfs_subpage::checked_offset to do the convert. For most call sites, especially for free-space cache, COW fixup and btrfs_invalidatepage(), they all work in full page mode anyway. For other call sites, they work as subpage compatible mode. Some call sites need extra modification: - btrfs_drop_pages() Needs extra parameter to get the real range we need to clear checked flag. Also since btrfs_drop_pages() will accept pages beyond the dirtied range, update btrfs_subpage_clamp_range() to handle such case by setting @len to 0 if the page is beyond target range. - btrfs_invalidatepage() We need to call subpage helper before calling __btrfs_releasepage(), or it will trigger ASSERT() as page->private will be cleared. - btrfs_verify_data_csum() In theory we don't need the io_bio->csum check anymore, but it's won't hurt. Just change the comment. Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-26btrfs: introduce compressed_bio::pending_sectors to trace compressed bioQu Wenruo2-35/+47
For btrfs_submit_compressed_read() and btrfs_submit_compressed_write(), we have a pretty weird dance around compressed_bio::pending_bios: btrfs_submit_compressed_read/write() { cb = kmalloc() refcount_set(&cb->pending_bios, 0); bio = btrfs_alloc_bio(); /* NOTE here, we haven't yet submitted any bio */ refcount_set(&cb->pending_bios, 1); for (pg_index = 0; pg_index < cb->nr_pages; pg_index++) { if (submit) { /* Here we submit bio, but we always have one * extra pending_bios */ refcount_inc(&cb->pending_bios); ret = btrfs_map_bio(); } } /* Submit the last bio */ ret = btrfs_map_bio(); } There are two reasons why we do this: - compressed_bio::pending_bios is a refcount Thus if it's reduced to 0, it can not be increased again. - To ensure the compressed_bio is not freed by some submitted bios If the submitted bio is finished before the next bio submitted, we can free the compressed_bio completely. But the above code is sometimes confusing, and we can do it better by introducing a new member, compressed_bio::pending_sectors. Now we use compressed_bio::pending_sectors to indicate whether we have any pending sectors under IO or not yet submitted. If pending_sectors == 0, we're definitely the last bio of compressed_bio, and is OK to release the compressed bio. Now the workflow looks like this: btrfs_submit_compressed_read/write() { cb = kmalloc() atomic_set(&cb->pending_bios, 0); refcount_set(&cb->pending_sectors, compressed_len >> sectorsize_bits); bio = btrfs_alloc_bio(); for (pg_index = 0; pg_index < cb->nr_pages; pg_index++) { if (submit) { refcount_inc(&cb->pending_bios); ret = btrfs_map_bio(); } } /* Submit the last bio */ refcount_inc(&cb->pending_bios); ret = btrfs_map_bio(); } For now we still need pending_bios for later error handling, but will remove pending_bios eventually after properly handling the errors. Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-26btrfs: subpage: make add_ra_bio_pages() compatibleQu Wenruo2-32/+59
[BUG] If we remove the subpage limitation in add_ra_bio_pages(), then read a compressed extent which has part of its range in next page, like the following inode layout: 0 32K 64K 96K 128K |<--------------|-------------->| Btrfs will trigger ASSERT() in endio function: assertion failed: atomic_read(&subpage->readers) >= nbits ------------[ cut here ]------------ kernel BUG at fs/btrfs/ctree.h:3431! Internal error: Oops - BUG: 0 [#1] SMP Workqueue: btrfs-endio btrfs_work_helper [btrfs] Call trace: assertfail.constprop.0+0x28/0x2c [btrfs] btrfs_subpage_end_reader+0x148/0x14c [btrfs] end_page_read+0x8c/0x100 [btrfs] end_bio_extent_readpage+0x320/0x6b0 [btrfs] bio_endio+0x15c/0x1dc end_workqueue_fn+0x44/0x64 [btrfs] btrfs_work_helper+0x74/0x250 [btrfs] process_one_work+0x1d4/0x47c worker_thread+0x180/0x400 kthread+0x11c/0x120 ret_from_fork+0x10/0x30 ---[ end trace c8b7b552d3bb408c ]--- [CAUSE] When we read the page range [0, 64K), we find it's a compressed extent, and we will try to add extra pages in add_ra_bio_pages() to avoid reading the same compressed extent. But when we add such page into the read bio, it doesn't follow the behavior of btrfs_do_readpage() to properly set subpage::readers. This means, for page [64K, 128K), its subpage::readers is still 0. And when endio is executed on both pages, since page [64K, 128K) has 0 subpage::readers, it triggers above ASSERT() [FIX] Function add_ra_bio_pages() is far from subpage compatible, it always assume PAGE_SIZE == sectorsize, thus when it skip to next range it always just skip PAGE_SIZE. Make it subpage compatible by: - Skip to next page properly when needed If we find there is already a page cache, we need to skip to next page. For that case, we shouldn't just skip PAGE_SIZE bytes, but use @pg_index to calculate the next bytenr and continue. - Only add the page range covered by current extent map We need to calculate which range is covered by current extent map and only add that part into the read bio. - Update subpage::readers before submitting the bio - Use proper cursor other than confusing @last_offset - Calculate the missed threshold based on sector size It's no longer using missed pages, as for 64K page size, we have at most 3 pages to skip. (If aligned only 2 pages) - Add ASSERT() to make sure our bytenr is always aligned - Add comment for the function Add a special note for subpage case, as the function won't really work well for subpage cases. Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-26btrfs: don't pass compressed pages to btrfs_writepage_endio_finish_ordered()Qu Wenruo1-4/+1
Since async_extent holds the compressed page, it would trigger the new ASSERT() in btrfs_mark_ordered_io_finished() which checks that the range is inside the page. Now btrfs_writepage_endio_finish_ordered() can accept @page == NULL, just pass NULL to btrfs_writepage_endio_finish_ordered(). Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-26btrfs: use async_chunk::async_cow to replace the confusing pending pointerQu Wenruo1-9/+7
For structure async_chunk, we use a very strange member layout to grab structure async_cow who owns this async_chunk. At initialization, it goes like this: async_chunk[i].pending = &ctx->num_chunks; Then at async_cow_free() we do a super weird freeing: /* * Since the pointer to 'pending' is at the beginning of the array of * async_chunk's, freeing it ensures the whole array has been freed. */ if (atomic_dec_and_test(async_chunk->pending)) kvfree(async_chunk->pending); This is absolutely an abuse of kvfree(). Replace async_chunk::pending with async_chunk::async_cow, so that we can grab the async_cow structure directly, without this strange dancing. And with this change, there is no requirement for any specific member location. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-26btrfs: remove unnecessary parameter delalloc_start for writepage_delalloc()Qu Wenruo1-7/+7
In function __extent_writepage() we always pass page start to @delalloc_start for writepage_delalloc(). Thus we don't really need @delalloc_start parameter as we can extract it from @page. Remove @delalloc_start parameter and make __extent_writepage() to declare @page_start and @page_end as const. Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-26btrfs: remove unused parameter nr_pages in add_ra_bio_pages()Qu Wenruo1-2/+0
Variable @nr_pages only gets increased but never used. Remove it. Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-26btrfs: use single bulk copy operations when logging directoriesFilipe Manana1-10/+15
When logging a directory and inserting a batch of directory items, we are copying the data of each item from a leaf in the fs/subvolume tree to a leaf in a log tree, separately. This is not really needed, since we are copying from a contiguous memory area into another one, so we can use a single copy operation to copy all items at once. This patch is part of a small patchset that is comprised of the following patches: btrfs: loop only once over data sizes array when inserting an item batch btrfs: unexport setup_items_for_insert() btrfs: use single bulk copy operations when logging directories This is patch 3/3. The following test was used to compare performance of a branch without the patchset versus one branch that has the whole patchset applied: $ cat dir-fsync-test.sh #!/bin/bash DEV=/dev/nvme0n1 MNT=/mnt/nvme0n1 NUM_NEW_FILES=1000000 NUM_FILE_DELETES=1000 LEAF_SIZE=16K mkfs.btrfs -f -n $LEAF_SIZE $DEV mount -o ssd $DEV $MNT mkdir $MNT/testdir for ((i = 1; i <= $NUM_NEW_FILES; i++)); do echo -n > $MNT/testdir/file_$i done # Fsync the directory, this will log the new dir items and the inodes # they point to, because these are new inodes. start=$(date +%s%N) xfs_io -c "fsync" $MNT/testdir end=$(date +%s%N) dur=$(( (end - start) / 1000000 )) echo "dir fsync took $dur ms after adding $NUM_NEW_FILES files" # sync to force transaction commit and wipeout the log. sync del_inc=$(( $NUM_NEW_FILES / $NUM_FILE_DELETES )) for ((i = 1; i <= $NUM_NEW_FILES; i += $del_inc)); do rm -f $MNT/testdir/file_$i done # Fsync the directory, this will only log dir items, there are no # dentries pointing to new inodes. start=$(date +%s%N) xfs_io -c "fsync" $MNT/testdir end=$(date +%s%N) dur=$(( (end - start) / 1000000 )) echo "dir fsync took $dur ms after deleting $NUM_FILE_DELETES files" umount $MNT The tests were run on a non-debug kernel (Debian's default kernel config) and were the following: *** with a leaf size of 16K, before patchset *** dir fsync took 8482 ms after adding 1000000 files dir fsync took 166 ms after deleting 1000 files *** with a leaf size of 16K, after patchset *** dir fsync took 8196 ms after adding 1000000 files (-3.4%) dir fsync took 143 ms after deleting 1000 files (-14.9%) *** with a leaf size of 64K, before patchset *** dir fsync took 12851 ms after adding 1000000 files dir fsync took 466 ms after deleting 1000 files *** with a leaf size of 64K, after patchset *** dir fsync took 12287 ms after adding 1000000 files (-4.5%) dir fsync took 414 ms after deleting 1000 files (-11.8%) Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-26btrfs: unexport setup_items_for_insert()Filipe Manana5-59/+68
Since setup_items_for_insert() is not used anymore outside of ctree.c, make it static and remove its prototype from ctree.h. This also requires to move the definition of setup_item_for_insert() from ctree.h to ctree.c and move down btrfs_duplicate_item() so that it's defined after setup_items_for_insert(). Further, since setup_item_for_insert() is used outside ctree.c, rename it to btrfs_setup_item_for_insert(). This patch is part of a small patchset that is comprised of the following patches: btrfs: loop only once over data sizes array when inserting an item batch btrfs: unexport setup_items_for_insert() btrfs: use single bulk copy operations when logging directories This is patch 2/3 and performance results, and the specific tests, are included in the changelog of patch 3/3. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-26btrfs: loop only once over data sizes array when inserting an item batchFilipe Manana8-72/+131
When inserting a batch of items into a btree, we end up looping over the data sizes array 3 times: 1) Once in the caller of btrfs_insert_empty_items(), when it populates the array with the data sizes for each item; 2) Once at btrfs_insert_empty_items() to sum the elements of the data sizes array and compute the total data size; 3) And then once again at setup_items_for_insert(), where we do exactly the same as what we do at btrfs_insert_empty_items(), to compute the total data size. That is not bad for small arrays, but when the arrays have hundreds of elements, the time spent on looping is not negligible. For example when doing batch inserts of delayed items for dir index items or when logging a directory, it's common to have 200 to 260 dir index items in a single batch when using a leaf size of 16K and using file names between 8 and 12 characters. For a 64K leaf size, multiply that by 4. Taking into account that during directory logging or when flushing delayed dir index items we can have many of those large batches, the time spent on the looping adds up quickly. It's also more important to avoid it at setup_items_for_insert(), since we are holding a write lock on a leaf and, in some cases, on upper nodes of the btree, which causes us to block other tasks that want to access the leaf and nodes for longer than necessary. So change the code so that setup_items_for_insert() and btrfs_insert_empty_items() no longer compute the total data size, and instead rely on the caller to supply it. This makes us loop over the array only once, where we can both populate the data size array and compute the total data size, taking advantage of spatial and temporal locality. To make this more manageable, use a structure to contain all the relevant details for a batch of items (keys array, data sizes array, total data size, number of items), and use it as an argument for btrfs_insert_empty_items() and setup_items_for_insert(). This patch is part of a small patchset that is comprised of the following patches: btrfs: loop only once over data sizes array when inserting an item batch btrfs: unexport setup_items_for_insert() btrfs: use single bulk copy operations when logging directories This is patch 1/3 and performance results, and the specific tests, are included in the changelog of patch 3/3. Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-26btrfs: remove btrfs_raid_bio::fs_info memberQu Wenruo4-44/+41
We can grab fs_info reliably from btrfs_raid_bio::bioc, as the bioc is always passed into alloc_rbio(), and only get released when the raid bio is released. Remove btrfs_raid_bio::fs_info member, and cleanup all the @fs_info parameters for alloc_rbio() callers. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-26btrfs: make sure btrfs_io_context::fs_info is always initializedQu Wenruo1-4/+5
Currently btrfs_io_context::fs_info is only initialized in btrfs_map_bio, but there are call sites like btrfs_map_sblock() which calls __btrfs_map_block() directly, leaving bioc::fs_info uninitialized (NULL). Currently this is fine, but later cleanup will rely on bioc::fs_info to grab fs_info, and this can be a hidden problem for such usage. This patch will remove such hidden uninitialized member by always assigning bioc::fs_info at alloc_btrfs_io_context(). Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-26btrfs: assert that extent buffers are write locked instead of only lockedFilipe Manana5-14/+15
We currently use lockdep_assert_held() at btrfs_assert_tree_locked(), and that checks that we hold a lock either in read mode or write mode. However in all contexts we use btrfs_assert_tree_locked(), we actually want to check if we are holding a write lock on the extent buffer's rw semaphore - it would be a bug if in any of those contexts we were holding a read lock instead. So change btrfs_assert_tree_locked() to use lockdep_assert_held_write() instead and, to make it more explicit, rename btrfs_assert_tree_locked() to btrfs_assert_tree_write_locked(), so that it's clear we want to check we are holding a write lock. For now there are no contexts where we want to assert that we must have a read lock, but in case that is needed in the future, we can add a new helper function that just calls out lockdep_assert_held_read(). Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-26btrfs: do not take the uuid_mutex in btrfs_rm_deviceJosef Bacik1-5/+5
We got the following lockdep splat while running fstests (specifically btrfs/003 and btrfs/020 in a row) with the new rc. This was uncovered by 87579e9b7d8d ("loop: use worker per cgroup instead of kworker") which converted loop to using workqueues, which comes with lockdep annotations that don't exist with kworkers. The lockdep splat is as follows: WARNING: possible circular locking dependency detected 5.14.0-rc2-custom+ #34 Not tainted ------------------------------------------------------ losetup/156417 is trying to acquire lock: ffff9c7645b02d38 ((wq_completion)loop0){+.+.}-{0:0}, at: flush_workqueue+0x84/0x600 but task is already holding lock: ffff9c7647395468 (&lo->lo_mutex){+.+.}-{3:3}, at: __loop_clr_fd+0x41/0x650 [loop] which lock already depends on the new lock. the existing dependency chain (in reverse order) is: -> #5 (&lo->lo_mutex){+.+.}-{3:3}: __mutex_lock+0xba/0x7c0 lo_open+0x28/0x60 [loop] blkdev_get_whole+0x28/0xf0 blkdev_get_by_dev.part.0+0x168/0x3c0 blkdev_open+0xd2/0xe0 do_dentry_open+0x163/0x3a0 path_openat+0x74d/0xa40 do_filp_open+0x9c/0x140 do_sys_openat2+0xb1/0x170 __x64_sys_openat+0x54/0x90 do_syscall_64+0x3b/0x90 entry_SYSCALL_64_after_hwframe+0x44/0xae -> #4 (&disk->open_mutex){+.+.}-{3:3}: __mutex_lock+0xba/0x7c0 blkdev_get_by_dev.part.0+0xd1/0x3c0 blkdev_get_by_path+0xc0/0xd0 btrfs_scan_one_device+0x52/0x1f0 [btrfs] btrfs_control_ioctl+0xac/0x170 [btrfs] __x64_sys_ioctl+0x83/0xb0 do_syscall_64+0x3b/0x90 entry_SYSCALL_64_after_hwframe+0x44/0xae -> #3 (uuid_mutex){+.+.}-{3:3}: __mutex_lock+0xba/0x7c0 btrfs_rm_device+0x48/0x6a0 [btrfs] btrfs_ioctl+0x2d1c/0x3110 [btrfs] __x64_sys_ioctl+0x83/0xb0 do_syscall_64+0x3b/0x90 entry_SYSCALL_64_after_hwframe+0x44/0xae -> #2 (sb_writers#11){.+.+}-{0:0}: lo_write_bvec+0x112/0x290 [loop] loop_process_work+0x25f/0xcb0 [loop] process_one_work+0x28f/0x5d0 worker_thread+0x55/0x3c0 kthread+0x140/0x170 ret_from_fork+0x22/0x30 -> #1 ((work_completion)(&lo->rootcg_work)){+.+.}-{0:0}: process_one_work+0x266/0x5d0 worker_thread+0x55/0x3c0 kthread+0x140/0x170 ret_from_fork+0x22/0x30 -> #0 ((wq_completion)loop0){+.+.}-{0:0}: __lock_acquire+0x1130/0x1dc0 lock_acquire+0xf5/0x320 flush_workqueue+0xae/0x600 drain_workqueue+0xa0/0x110 destroy_workqueue+0x36/0x250 __loop_clr_fd+0x9a/0x650 [loop] lo_ioctl+0x29d/0x780 [loop] block_ioctl+0x3f/0x50 __x64_sys_ioctl+0x83/0xb0 do_syscall_64+0x3b/0x90 entry_SYSCALL_64_after_hwframe+0x44/0xae other info that might help us debug this: Chain exists of: (wq_completion)loop0 --> &disk->open_mutex --> &lo->lo_mutex Possible unsafe locking scenario: CPU0 CPU1 ---- ---- lock(&lo->lo_mutex); lock(&disk->open_mutex); lock(&lo->lo_mutex); lock((wq_completion)loop0); *** DEADLOCK *** 1 lock held by losetup/156417: #0: ffff9c7647395468 (&lo->lo_mutex){+.+.}-{3:3}, at: __loop_clr_fd+0x41/0x650 [loop] stack backtrace: CPU: 8 PID: 156417 Comm: losetup Not tainted 5.14.0-rc2-custom+ #34 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015 Call Trace: dump_stack_lvl+0x57/0x72 check_noncircular+0x10a/0x120 __lock_acquire+0x1130/0x1dc0 lock_acquire+0xf5/0x320 ? flush_workqueue+0x84/0x600 flush_workqueue+0xae/0x600 ? flush_workqueue+0x84/0x600 drain_workqueue+0xa0/0x110 destroy_workqueue+0x36/0x250 __loop_clr_fd+0x9a/0x650 [loop] lo_ioctl+0x29d/0x780 [loop] ? __lock_acquire+0x3a0/0x1dc0 ? update_dl_rq_load_avg+0x152/0x360 ? lock_is_held_type+0xa5/0x120 ? find_held_lock.constprop.0+0x2b/0x80 block_ioctl+0x3f/0x50 __x64_sys_ioctl+0x83/0xb0 do_syscall_64+0x3b/0x90 entry_SYSCALL_64_after_hwframe+0x44/0xae RIP: 0033:0x7f645884de6b Usually the uuid_mutex exists to protect the fs_devices that map together all of the devices that match a specific uuid. In rm_device we're messing with the uuid of a device, so it makes sense to protect that here. However in doing that it pulls in a whole host of lockdep dependencies, as we call mnt_may_write() on the sb before we grab the uuid_mutex, thus we end up with the dependency chain under the uuid_mutex being added under the normal sb write dependency chain, which causes problems with loop devices. We don't need the uuid mutex here however. If we call btrfs_scan_one_device() before we scratch the super block we will find the fs_devices and not find the device itself and return EBUSY because the fs_devices is open. If we call it after the scratch happens it will not appear to be a valid btrfs file system. We do not need to worry about other fs_devices modifying operations here because we're protected by the exclusive operations locking. So drop the uuid_mutex here in order to fix the lockdep splat. A more detailed explanation from the discussion: We are worried about rm and scan racing with each other, before this change we'll zero the device out under the UUID mutex so when scan does run it'll make sure that it can go through the whole device scan thing without rm messing with us. We aren't worried if the scratch happens first, because the result is we don't think this is a btrfs device and we bail out. The only case we are concerned with is we scratch _after_ scan is able to read the superblock and gets a seemingly valid super block, so lets consider this case. Scan will call device_list_add() with the device we're removing. We'll call find_fsid_with_metadata_uuid() and get our fs_devices for this UUID. At this point we lock the fs_devices->device_list_mutex. This is what protects us in this case, but we have two cases here. 1. We aren't to the device removal part of the RM. We found our device, and device name matches our path, we go down and we set total_devices to our super number of devices, which doesn't affect anything because we haven't done the remove yet. 2. We are past the device removal part, which is protected by the device_list_mutex. Scan doesn't find the device, it goes down and does the if (fs_devices->opened) return -EBUSY; check and we bail out. Nothing about this situation is ideal, but the lockdep splat is real, and the fix is safe, tho admittedly a bit scary looking. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> [ copy more from the discussion ] Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-26btrfs: rename struct btrfs_io_bio to btrfs_bioQu Wenruo13-106/+107
Previously we had "struct btrfs_bio", which records IO context for mirrored IO and RAID56, and "strcut btrfs_io_bio", which records extra btrfs specific info for logical bytenr bio. With "btrfs_bio" renamed to "btrfs_io_context", we are safe to rename "btrfs_io_bio" to "btrfs_bio" which is a more suitable name now. The struct btrfs_bio changes meaning by this commit. There was a suggested name like btrfs_logical_bio but it's a bit long and we'd prefer to use a shorter name. This could be a concern for backports to older kernels where the different meaning could possibly cause confusion or bugs. Comparing the new and old structures, there's no overlap among the struct members so a build would break in case of incorrect backport. We haven't had many backports to bio code anyway so this is more of a theoretical cause of bugs and a matter of precaution but we'll need to keep the semantic change in mind. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-26btrfs: remove btrfs_bio_alloc() helperQu Wenruo4-25/+19
The helper btrfs_bio_alloc() is almost the same as btrfs_io_bio_alloc(), except it's allocating using BIO_MAX_VECS as @nr_iovecs, and initializes bio->bi_iter.bi_sector. However the naming itself is not using "btrfs_io_bio" to indicate its parameter is "strcut btrfs_io_bio" and can be easily confused with "struct btrfs_bio". Considering assigned bio->bi_iter.bi_sector is such a simple work and there are already tons of call sites doing that manually, there is no need to do that in a helper. Remove btrfs_bio_alloc() helper, and enhance btrfs_io_bio_alloc() function to provide a fail-safe value for its @nr_iovecs. And then replace all btrfs_bio_alloc() callers with btrfs_io_bio_alloc(). Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-26btrfs: rename btrfs_bio to btrfs_io_contextQu Wenruo11-315/+325
The structure btrfs_bio is used by two different sites: - bio->bi_private for mirror based profiles For those profiles (SINGLE/DUP/RAID1*/RAID10), this structures records how many mirrors are still pending, and save the original endio function of the bio. - RAID56 code In that case, RAID56 only utilize the stripes info, and no long uses that to trace the pending mirrors. So btrfs_bio is not always bind to a bio, and contains more info for IO context, thus renaming it will make the naming less confusing. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-26btrfs: keep track of the last logged keys when logging a directoryFilipe Manana4-13/+75
After the first time we log a directory in the current transaction, for each directory item in a changed leaf of the subvolume tree, we have to check if we previously logged the item, in order to overwrite it in case its data changed or skip it in case its data hasn't changed. Checking if we have logged each item before not only wastes times, but it also adds lock contention on the log tree. So in order to minimize the number of times we do such checks, keep track of the offset of the last key we logged for a directory and, on the next time we log the directory, skip the checks for any new keys that have an offset greater than the offset we have previously saved. This is specially effective for index keys, because the offset for these keys comes from a monotonically increasing counter. This patch is part of a patchset comprised of the following 5 patches: btrfs: remove root argument from btrfs_log_inode() and its callees btrfs: remove redundant log root assignment from log_dir_items() btrfs: factor out the copying loop of dir items from log_dir_items() btrfs: insert items in batches when logging a directory when possible btrfs: keep track of the last logged keys when logging a directory This is patch 5/5. The following test was used on a non-debug kernel to measure the impact it has on a directory fsync: $ cat test-dir-fsync.sh #!/bin/bash DEV=/dev/nvme0n1 MNT=/mnt/nvme0n1 NUM_NEW_FILES=100000 NUM_FILE_DELETES=1000 mkfs.btrfs -f $DEV mount -o ssd $DEV $MNT mkdir $MNT/testdir for ((i = 1; i <= $NUM_NEW_FILES; i++)); do echo -n > $MNT/testdir/file_$i done # fsync the directory, this will log the new dir items and the inodes # they point to, because these are new inodes. start=$(date +%s%N) xfs_io -c "fsync" $MNT/testdir end=$(date +%s%N) dur=$(( (end - start) / 1000000 )) echo "dir fsync took $dur ms after adding $NUM_NEW_FILES files" # sync to force transaction commit and wipeout the log. sync del_inc=$(( $NUM_NEW_FILES / $NUM_FILE_DELETES )) for ((i = 1; i <= $NUM_NEW_FILES; i += $del_inc)); do rm -f $MNT/testdir/file_$i done # fsync the directory, this will only log dir items, there are no # dentries pointing to new inodes. start=$(date +%s%N) xfs_io -c "fsync" $MNT/testdir end=$(date +%s%N) dur=$(( (end - start) / 1000000 )) echo "dir fsync took $dur ms after deleting $NUM_FILE_DELETES files" umount $MNT Test results with NUM_NEW_FILES set to 100 000 and 1 000 000: **** before patchset, 100 000 files, 1000 deletes **** dir fsync took 848 ms after adding 100000 files dir fsync took 175 ms after deleting 1000 files **** after patchset, 100 000 files, 1000 deletes **** dir fsync took 758 ms after adding 100000 files (-11.2%) dir fsync took 63 ms after deleting 1000 files (-94.1%) **** before patchset, 1 000 000 files, 1000 deletes **** dir fsync took 9945 ms after adding 1000000 files dir fsync took 473 ms after deleting 1000 files **** after patchset, 1 000 000 files, 1000 deletes **** dir fsync took 8677 ms after adding 1000000 files (-13.6%) dir fsync took 146 ms after deleting 1000 files (-105.6%) Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-26btrfs: insert items in batches when logging a directory when possibleFilipe Manana1-37/+180
When logging a directory, we scan its directory items from the subvolume tree and then copy one by one into the log tree. This is not efficient since we generally are able to insert several items in a batch, using a single btree operation for adding several items at once. The reason we copy items one by one is that we must check if each item was previously logged in the current transaction, and if it was we either overwrite it or skip it in case its content did not change in the subvolume tree (this can happen only for dir item keys, but not for dir index keys), and doing such check makes it a bit cumbersome to attempt batch insertions. However the chances for doing batch insertions are very frequent and always happen when: 1) Logging the directory for the first time in the current transaction, as none of the items exist in the log tree yet; 2) Logging new dir index keys, because the offset for new dir index keys comes from a monotonically increasing counter. This means if we keep adding dentries to a directory, through creation of new files and sub-directories or by adding new links or renaming from some other directory into the one we are logging, all the new dir index keys have a new offset that is greater than the offset of any previously logged index keys, so we can insert them in batches into the log tree. For dir item keys, since their offset depends on the result of an hash function against the dentry's name, unless the directory is being logged for the first time in the current transaction, the chances being able to insert the items in the log using batches is pretty much random and not predictable, as it depends on the names of the dentries, but still happens often enough. So change directory logging to keep track of consecutive directory items that don't exist yet in the log and batch insert them. This patch is part of a patchset comprised of the following 5 patches: btrfs: remove root argument from btrfs_log_inode() and its callees btrfs: remove redundant log root assignment from log_dir_items() btrfs: factor out the copying loop of dir items from log_dir_items() btrfs: insert items in batches when logging a directory when possible btrfs: keep track of the last logged keys when logging a directory This is patch 4/5. The change log of the last patch (5/5) has performance results. Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-26btrfs: factor out the copying loop of dir items from log_dir_items()Filipe Manana1-60/+75
In preparation for the next change, move the loop that processes a leaf and copies its directory items to the log, into a separate helper function. This makes the next change simpler and it also helps making log_dir_items() a bit shorter (specially after the next change). This patch is part of a patchset comprised of the following 5 patches: btrfs: remove root argument from btrfs_log_inode() and its callees btrfs: remove redundant log root assignment from log_dir_items() btrfs: factor out the copying loop of dir items from log_dir_items() btrfs: insert items in batches when logging a directory when possible btrfs: keep track of the last logged keys when logging a directory This is patch 3/5. The change log of the last patch (5/5) has performance results. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-26btrfs: remove redundant log root assignment from log_dir_items()Filipe Manana1-2/+0
At log_dir_items() we are assigning the exact same value to the local variable 'log', once when it's declared and once again shortly after. Remove the later assignment as it's pointless. This patch is part of a patchset comprised of the following 5 patches: btrfs: remove root argument from btrfs_log_inode() and its callees btrfs: remove redundant log root assignment from log_dir_items() btrfs: factor out the copying loop of dir items from log_dir_items() btrfs: insert items in batches when logging a directory when possible btrfs: keep track of the last logged keys when logging a directory This is patch 2/5. The change log of the last patch (5/5) has performance results. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-26btrfs: remove root argument from btrfs_log_inode() and its calleesFilipe Manana1-28/+24
The root argument passed to btrfs_log_inode() is unncessary, as it is always the root of the inode we are going to log. This root also gets unnecessarily propagated to several functions called by btrfs_log_inode(), and all of them take the inode as an argument as well. So just remove the root argument from these functions and have them get the root from the inode where needed. This patch is part of a patchset comprised of the following 5 patches: btrfs: remove root argument from btrfs_log_inode() and its callees btrfs: remove redundant log root assignment from log_dir_items() btrfs: factor out the copying loop of dir items from log_dir_items() btrfs: insert items in batches when logging a directory when possible btrfs: keep track of the last logged keys when logging a directory This is patch 1/5. The change log of the last patch (5/5) has performance results. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-26btrfs: zoned: let the for_treelog test in the allocator stand outJohannes Thumshirn1-3/+4
The statement which decides if an extent allocation on a zoned device is for the dedicated tree-log block group or not and if we can use the block group we picked for this allocation is not easy to read but an important part of the allocator. Rewrite into an if condition instead of a plain boolean test to make it stand out more, like the version which tests for the dedicated data-relocation block group. Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-26btrfs: rename setup_extent_mapping in relocation codeJohannes Thumshirn1-4/+3
In btrfs code we have two functions called setup_extent_mapping, one in the extent_map code and one in the relocation code. While both are private to their respective implementation, this can still be confusing for the reader. So rename the version in relocation.c to setup_relocation_extent_mapping. No functional changes. Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-26btrfs: zoned: allow preallocation for relocation inodesJohannes Thumshirn1-33/+2
Now that we use a dedicated block group and regular writes for data relocation, we can preallocate the space needed for a relocated inode, just like we do in regular mode. Essentially this reverts commit 32430c614844 ("btrfs: zoned: enable relocation on a zoned filesystem") as it is not needed anymore. Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-26btrfs: check for relocation inodes on zoned btrfs in should_nocowJohannes Thumshirn1-1/+9
Prepare for allowing preallocation for relocation inodes. Reviewed-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-26btrfs: zoned: use regular writes for relocationJohannes Thumshirn1-0/+11
Now that we have a dedicated block group for relocation, we can use REQ_OP_WRITE instead of REQ_OP_ZONE_APPEND for writing out the data on relocation. Reviewed-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-26btrfs: zoned: only allow one process to add pages to a relocation inodeJohannes Thumshirn1-0/+11
Don't allow more than one process to add pages to a relocation inode on a zoned filesystem, otherwise we cannot guarantee the sequential write rule once we're filling preallocated extents on a zoned filesystem. Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-26btrfs: zoned: add a dedicated data relocation block groupJohannes Thumshirn6-2/+71
Relocation in a zoned filesystem can fail with a transaction abort with error -22 (EINVAL). This happens because the relocation code assumes that the extents we relocated the data to have the same size the source extents had and ensures this by preallocating the extents. But in a zoned filesystem we currently can't preallocate the extents as this would break the sequential write required rule. Therefore it can happen that the writeback process kicks in while we're still adding pages to a delalloc range and starts writing out dirty pages. This then creates destination extents that are smaller than the source extents, triggering the following safety check in get_new_location(): 1034 if (num_bytes != btrfs_file_extent_disk_num_bytes(leaf, fi)) { 1035 ret = -EINVAL; 1036 goto out; 1037 } Temporarily create a dedicated block group for the relocation process, so no non-relocation data writes can interfere with the relocation writes. This is needed that we can switch the relocation process on a zoned filesystem from the REQ_OP_ZONE_APPEND writing we use for data to a scheme like in a non-zoned filesystem using REQ_OP_WRITE and preallocation. Fixes: 32430c614844 ("btrfs: zoned: enable relocation on a zoned filesystem") Reviewed-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-26btrfs: introduce btrfs_is_data_reloc_rootJohannes Thumshirn5-15/+16
There are several places in our codebase where we check if a root is the root of the data reloc tree and subsequent patches will introduce more. Factor out the check into a small helper function instead of open coding it multiple times. Reviewed-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-26btrfs: unexport repair_io_failure()Qu Wenruo2-6/+3
Function repair_io_failure() is no longer used out of extent_io.c since commit 8b9b6f255485 ("btrfs: scrub: cleanup the remaining nodatasum fixup code"), which removes the last external caller. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-26btrfs: do not commit delayed inode when logging a file in full sync modeFilipe Manana1-16/+5
When logging a regular file in full sync mode, we are currently committing its delayed inode item. This is to ensure that we never miss copying the inode item, with its most up to date data, into the log tree. However that is not necessary since commit e4545de5b035 ("Btrfs: fix fsync data loss after append write"), because even if we don't find the leaf with the inode item when looking for leaves that changed in the current transaction, we end up logging the inode item later using the in-memory content. In case we find the leaf containing the inode item, we already end up using the in-memory inode for filling the inode item in the log tree, and not the inode item that is in the fs/subvolume tree, as it might be not up to date (copy_items() -> fill_inode_item()). So don't commit the delayed inode item, which brings a couple of benefits: 1) Avoid writing the inode item to the fs/subvolume btree, saving time and reducing lock contention on the btree; 2) In case no other item for the inode was changed, added or deleted in the same leaf where the inode item is located, we ended up copying all the items in that leaf to the log tree - it's harmless from a functional point of view, but it wastes time and log tree space. This patch is part of a patch set comprised of the following patches: btrfs: check if a log tree exists at inode_logged() btrfs: remove no longer needed checks for NULL log context btrfs: do not log new dentries when logging that a new name exists btrfs: always update the logged transaction when logging new names btrfs: avoid expensive search when dropping inode items from log btrfs: add helper to truncate inode items when logging inode btrfs: avoid expensive search when truncating inode items from the log btrfs: avoid search for logged i_size when logging inode if possible btrfs: avoid attempt to drop extents when logging inode for the first time btrfs: do not commit delayed inode when logging a file in full sync mode This is patch 10/10 and the following test results compare a branch with the whole patch set applied versus a branch without any of the patches applied. The following script was used to test dbench with 8 and 16 jobs on a machine with 12 cores, 64G of RAM, a NVME device and using a non-debug kernel config (Debian's default): $ cat test.sh #!/bin/bash if [ $# -ne 1 ]; then echo "Use $0 NUM_JOBS" exit 1 fi NUM_JOBS=$1 DEV=/dev/nvme0n1 MNT=/mnt/nvme0n1 MOUNT_OPTIONS="-o ssd" MKFS_OPTIONS="-m single -d single" echo "performance" | \ tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor mkfs.btrfs -f $MKFS_OPTIONS $DEV mount $MOUNT_OPTIONS $DEV $MNT dbench -D $MNT -t 120 $NUM_JOBS umount $MNT The results were the following: 8 jobs, before patchset: Operation Count AvgLat MaxLat ---------------------------------------- NTCreateX 4113896 0.009 238.665 Close 3021699 0.001 0.590 Rename 174215 0.082 238.733 Unlink 830977 0.049 238.642 Deltree 96 2.232 8.022 Mkdir 48 0.003 0.005 Qpathinfo 3729013 0.005 2.672 Qfileinfo 653206 0.001 0.152 Qfsinfo 683866 0.002 0.526 Sfileinfo 335055 0.004 1.571 Find 1441800 0.016 4.288 WriteX 2049644 0.010 3.982 ReadX 6449786 0.003 0.969 LockX 13400 0.002 0.043 UnlockX 13400 0.001 0.075 Flush 288349 2.521 245.516 Throughput 1075.73 MB/sec 8 clients 8 procs max_latency=245.520 ms 8 jobs, after patchset: Operation Count AvgLat MaxLat ---------------------------------------- NTCreateX 4154282 0.009 156.675 Close 3051450 0.001 0.843 Rename 175912 0.072 4.444 Unlink 839067 0.048 66.050 Deltree 96 2.131 5.979 Mkdir 48 0.002 0.004 Qpathinfo 3765575 0.005 3.079 Qfileinfo 659582 0.001 0.099 Qfsinfo 690474 0.002 0.155 Sfileinfo 338366 0.004 1.419 Find 1455816 0.016 3.423 WriteX 2069538 0.010 4.328 ReadX 6512429 0.003 0.840 LockX 13530 0.002 0.078 UnlockX 13530 0.001 0.051 Flush 291158 2.500 163.468 Throughput 1105.45 MB/sec 8 clients 8 procs max_latency=163.474 ms +2.7% throughput, -40.1% max latency 16 jobs, before patchset: Operation Count AvgLat MaxLat ---------------------------------------- NTCreateX 5457602 0.033 337.098 Close 4008979 0.002 2.018 Rename 231051 0.323 254.054 Unlink 1102209 0.202 337.243 Deltree 160 6.521 31.720 Mkdir 80 0.003 0.007 Qpathinfo 4946147 0.014 6.988 Qfileinfo 867440 0.001 1.642 Qfsinfo 907081 0.003 1.821 Sfileinfo 444433 0.005 2.053 Find 1912506 0.067 7.854 WriteX 2724852 0.018 7.428 ReadX 8553883 0.003 2.059 LockX 17770 0.003 0.350 UnlockX 17770 0.002 0.627 Flush 382533 2.810 353.691 Throughput 1413.09 MB/sec 16 clients 16 procs max_latency=353.696 ms 16 jobs, after patchset: Operation Count AvgLat MaxLat ---------------------------------------- NTCreateX 5393156 0.034 303.181 Close 3961986 0.002 1.502 Rename 228359 0.320 253.379 Unlink 1088920 0.206 303.409 Deltree 160 6.419 30.088 Mkdir 80 0.003 0.004 Qpathinfo 4887967 0.015 7.722 Qfileinfo 857408 0.001 1.651 Qfsinfo 896343 0.002 2.147 Sfileinfo 439317 0.005 4.298 Find 1890018 0.073 8.347 WriteX 2693356 0.018 6.373 ReadX 8453485 0.003 3.836 LockX 17562 0.003 0.486 UnlockX 17562 0.002 0.635 Flush 378023 2.802 315.904 Throughput 1454.46 MB/sec 16 clients 16 procs max_latency=315.910 ms +2.9% throughput, -11.3% max latency Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-26btrfs: avoid attempt to drop extents when logging inode for the first timeFilipe Manana1-8/+19
When logging an extent, in the fast fsync path, we always attempt do drop or trim any existing extents with a range that match or overlap the range of the extent we are about to log. We do that through a call to btrfs_drop_extents(). However this is not needed when we are logging the inode for the first time in the current transaction, since we have no inode items of the inode in the log tree. Calling btrfs_drop_extents() does a deletion search on the log tree, which is expensive when we have concurrent tasks accessing the log tree because a deletion search always acquires a write lock on the extent buffers at levels 2, 1 and 0, adding significant lock contention, specially taking into account the height of a log tree rarely (if ever) goes beyond 2 or 3, due to its short life. So skip the call to btrfs_drop_extents() when the inode was not previously logged in the current transaction. This patch is part of a patch set comprised of the following patches: btrfs: check if a log tree exists at inode_logged() btrfs: remove no longer needed checks for NULL log context btrfs: do not log new dentries when logging that a new name exists btrfs: always update the logged transaction when logging new names btrfs: avoid expensive search when dropping inode items from log btrfs: add helper to truncate inode items when logging inode btrfs: avoid expensive search when truncating inode items from the log btrfs: avoid search for logged i_size when logging inode if possible btrfs: avoid attempt to drop extents when logging inode for the first time btrfs: do not commit delayed inode when logging a file in full sync mode This is patch 9/10 and test results are listed in the change log of the last patch in the set. Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-26btrfs: avoid search for logged i_size when logging inode if possibleFilipe Manana1-1/+1
If we are logging that an inode exists and the inode was not logged before, we can avoid searching in the log tree for the inode item since we know it does not exists. That wastes time and adds more lock contention on the extent buffers of the log tree when there are other tasks that are logging other inodes. This patch is part of a patch set comprised of the following patches: btrfs: check if a log tree exists at inode_logged() btrfs: remove no longer needed checks for NULL log context btrfs: do not log new dentries when logging that a new name exists btrfs: always update the logged transaction when logging new names btrfs: avoid expensive search when dropping inode items from log btrfs: add helper to truncate inode items when logging inode btrfs: avoid expensive search when truncating inode items from the log btrfs: avoid search for logged i_size when logging inode if possible btrfs: avoid attempt to drop extents when logging inode for the first time btrfs: do not commit delayed inode when logging a file in full sync mode This is patch 8/10 and test results are listed in the change log of the last patch in the set. Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-26btrfs: avoid expensive search when truncating inode items from the logFilipe Manana1-1/+3
Whenever we are logging a file inode in full sync mode we call btrfs_truncate_inode_items() to delete items of the inode we may have previously logged. That results in doing a btree search for deletion, which is expensive because it always acquires write locks for extent buffers at levels 2, 1 and 0, and it balances any node that is less than half full. Acquiring the write locks can block the task if the extent buffers are already locked by another task or block other tasks attempting to lock them, which is specially bad in case of log trees since they are small due to their short life, with a root node at a level typically not greater than level 2. If we know that we are logging the inode for the first time in the current transaction, we can skip the call to btrfs_truncate_inode_items(), avoiding the deletion search. This change does that. This patch is part of a patch set comprised of the following patches: btrfs: check if a log tree exists at inode_logged() btrfs: remove no longer needed checks for NULL log context btrfs: do not log new dentries when logging that a new name exists btrfs: always update the logged transaction when logging new names btrfs: avoid expensive search when dropping inode items from log btrfs: add helper to truncate inode items when logging inode btrfs: avoid expensive search when truncating inode items from the log btrfs: avoid search for logged i_size when logging inode if possible btrfs: avoid attempt to drop extents when logging inode for the first time btrfs: do not commit delayed inode when logging a file in full sync mode This is patch 7/10 and test results are listed in the change log of the last patch in the set. Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-26btrfs: add helper to truncate inode items when logging inodeFilipe Manana1-13/+19
Move the call to btrfs_truncate_inode_items(), and the surrounding retry loop, into a local helper function. This avoids some repetition and avoids making the next change a bit awkward due to a bit of too much indentation. This patch is part of a patch set comprised of the following patches: btrfs: check if a log tree exists at inode_logged() btrfs: remove no longer needed checks for NULL log context btrfs: do not log new dentries when logging that a new name exists btrfs: always update the logged transaction when logging new names btrfs: avoid expensive search when dropping inode items from log btrfs: add helper to truncate inode items when logging inode btrfs: avoid expensive search when truncating inode items from the log btrfs: avoid search for logged i_size when logging inode if possible btrfs: avoid attempt to drop extents when logging inode for the first time btrfs: do not commit delayed inode when logging a file in full sync mode This is patch 6/10 and test results are listed in the change log of the last patch in the set. Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-26btrfs: avoid expensive search when dropping inode items from logFilipe Manana1-9/+13
Whenever we are logging a directory inode, logging that an inode exists or logging an inode that has changes in its references or xattrs, we attempt to delete items of this inode we may have previously logged (through calls to drop_objectid_items()). That attempt does a btree search for deletion, which is expensive because it always acquires write locks for extent buffers at levels 2, 1 and 0, and it balances any node that is less than half full. Acquiring the write locks can block the task if the extent buffers are already locked or block other tasks attempting to lock them, which is specially bad in case of log trees since they are small due to their short life, with a root node at a level typically not greater than level 2. If we know that we are logging the inode for the first time in the current transaction, we can skip the search. This change does that. This patch is part of a patch set comprised of the following patches: btrfs: check if a log tree exists at inode_logged() btrfs: remove no longer needed checks for NULL log context btrfs: do not log new dentries when logging that a new name exists btrfs: always update the logged transaction when logging new names btrfs: avoid expensive search when dropping inode items from log btrfs: add helper to truncate inode items when logging inode btrfs: avoid expensive search when truncating inode items from the log btrfs: avoid search for logged i_size when logging inode if possible btrfs: avoid attempt to drop extents when logging inode for the first time btrfs: do not commit delayed inode when logging a file in full sync mode This is patch 5/10 and test results are listed in the change log of the last patch in the set. Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-26btrfs: always update the logged transaction when logging new namesFilipe Manana1-39/+34
When we are logging a new name for an inode, due to a link or rename operation, if the inode has ancestor inodes that are new, created in the current transaction, we need to log that these inodes exist. To ensure that a subsequent explicit fsync on one of these ancestor inodes does sync the log, we don't set the logged_trans field of these inodes. This was done in commit 75b463d2b47aef ("btrfs: do not commit logs and transactions during link and rename operations"), to avoid syncing a log after a rename or link operation. In order to allow for future changes to do some optimizations, change this behaviour to always update the logged_trans of any logged inode and don't update the last_log_commit of the inode if we are logging that it exists. This accomplishes that same objective with simpler logic, allowing for some optimizations in the next patches. So just do that simplification. This patch is part of a patch set comprised of the following patches: btrfs: check if a log tree exists at inode_logged() btrfs: remove no longer needed checks for NULL log context btrfs: do not log new dentries when logging that a new name exists btrfs: always update the logged transaction when logging new names btrfs: avoid expensive search when dropping inode items from log btrfs: add helper to truncate inode items when logging inode btrfs: avoid expensive search when truncating inode items from the log btrfs: avoid search for logged i_size when logging inode if possible btrfs: avoid attempt to drop extents when logging inode for the first time btrfs: do not commit delayed inode when logging a file in full sync mode This is patch 4/10 and test results are listed in the change log of the last patch in the set. Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-26btrfs: do not log new dentries when logging that a new name existsFilipe Manana1-0/+8
When logging a new name for an inode, due to a link or rename operation, we don't need to log all new dentries of the parent directories and their subdirectories. We only want to log the names of the inode and that any new parent directories exist. So in this case don't trigger logging of the new dentries, that is only need when doing an explicit fsync on a directory or on a file which requires logging its parent directories. This avoids unnecessary work and reduces contention on the extent buffers of a log tree. This patch is part of a patch set comprised of the following patches: btrfs: check if a log tree exists at inode_logged() btrfs: remove no longer needed checks for NULL log context btrfs: do not log new dentries when logging that a new name exists btrfs: always update the logged transaction when logging new names btrfs: avoid expensive search when dropping inode items from log btrfs: add helper to truncate inode items when logging inode btrfs: avoid expensive search when truncating inode items from the log btrfs: avoid search for logged i_size when logging inode if possible btrfs: avoid attempt to drop extents when logging inode for the first time btrfs: do not commit delayed inode when logging a file in full sync mode This is patch 3/10 and test results are listed in the change log of the last patch in the set. Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-26btrfs: remove no longer needed checks for NULL log contextFilipe Manana1-13/+7
Since commit 75b463d2b47aef ("btrfs: do not commit logs and transactions during link and rename operations"), we always pass a non-NULL log context to btrfs_log_inode_parent() and therefore to all the functions that it calls. So remove the checks we have all over the place that test for a NULL log context, making the code shorter and easier to read, as well as reducing the size of the generated code. This patch is part of a patch set comprised of the following patches: btrfs: check if a log tree exists at inode_logged() btrfs: remove no longer needed checks for NULL log context btrfs: do not log new dentries when logging that a new name exists btrfs: always update the logged transaction when logging new names btrfs: avoid expensive search when dropping inode items from log btrfs: add helper to truncate inode items when logging inode btrfs: avoid expensive search when truncating inode items from the log btrfs: avoid search for logged i_size when logging inode if possible btrfs: avoid attempt to drop extents when logging inode for the first time btrfs: do not commit delayed inode when logging a file in full sync mode This is patch 2/10 and test results are listed in the change log of the last patch in the set. Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-26btrfs: check if a log tree exists at inode_logged()Filipe Manana1-0/+3
In case an inode was never logged since it was loaded from disk and was modified in the current transaction (its ->last_trans matches the ID of the current transaction), inode_logged() returns true even if there's no existing log tree. In this case we can simply check if a log tree exists and return false if it does not. This avoids a caller of inode_logged() doing some unnecessary, but harmless, work. For btrfs_log_new_name() it avoids it logging an inode in case it was never logged since it was loaded from disk and there is currently no log tree for the inode's root. For the remaining callers of inode_logged(), btrfs_del_dir_entries_in_log() and btrfs_del_inode_ref_in_log(), it has no effect since they already check if a log tree exists through their calls to join_running_log_trans(). So just add a check to inode_logged() to verify if a log tree exists, and return false if it does not. This patch is part of a patch set comprised of the following patches: btrfs: check if a log tree exists at inode_logged() btrfs: remove no longer needed checks for NULL log context btrfs: do not log new dentries when logging that a new name exists btrfs: always update the logged transaction when logging new names btrfs: avoid expensive search when dropping inode items from log btrfs: add helper to truncate inode items when logging inode btrfs: avoid expensive search when truncating inode items from the log btrfs: avoid search for logged i_size when logging inode if possible btrfs: avoid attempt to drop extents when logging inode for the first time btrfs: do not commit delayed inode when logging a file in full sync mode This is patch 1/10 and test results are listed in the change log of the last patch in the set. Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-26btrfs: remove stale comment about the btrfs_show_devnameAnand Jain1-7/+0
There were few lockdep warnings because btrfs_show_devname() was using device_list_mutex as recorded in the commits: 0ccd05285e7f ("btrfs: fix a possible umount deadlock") 779bf3fefa83 ("btrfs: fix lock dep warning, move scratch dev out of device_list_mutex and uuid_mutex") And finally, commit 88c14590cdd6 ("btrfs: use RCU in btrfs_show_devname for device list traversal") removed the device_list_mutex from btrfs_show_devname for performance reasons. This patch removes a stale comment about the function btrfs_show_devname and device_list_mutex. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-26btrfs: update latest_dev when we create a sprout deviceAnand Jain1-0/+2
When we add a device to the seed filesystem (sprouting) it is a new filesystem (and fsid) on the device added. Update the latest_dev so that /proc/self/mounts shows the correct device. Example: $ btrfstune -S1 /dev/vg/seed $ mount /dev/vg/seed /btrfs mount: /btrfs: WARNING: device write-protected, mounted read-only. $ cat /proc/self/mounts | grep btrfs /dev/mapper/vg-seed /btrfs btrfs ro,relatime,space_cache,subvolid=5,subvol=/ 0 0 $ btrfs dev add -f /dev/vg/new /btrfs Before: $ cat /proc/self/mounts | grep btrfs /dev/mapper/vg-seed /btrfs btrfs ro,relatime,space_cache,subvolid=5,subvol=/ 0 0 After: $ cat /proc/self/mounts | grep btrfs /dev/mapper/vg-new /btrfs btrfs ro,relatime,space_cache,subvolid=5,subvol=/ 0 0 Tested-by: Su Yue <l@damenly.su> Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-26btrfs: use latest_dev in btrfs_show_devnameAnand Jain1-19/+5
The test case btrfs/238 reports the warning below: WARNING: CPU: 3 PID: 481 at fs/btrfs/super.c:2509 btrfs_show_devname+0x104/0x1e8 [btrfs] CPU: 2 PID: 1 Comm: systemd Tainted: G W O 5.14.0-rc1-custom #72 Hardware name: QEMU QEMU Virtual Machine, BIOS 0.0.0 02/06/2015 Call trace: btrfs_show_devname+0x108/0x1b4 [btrfs] show_mountinfo+0x234/0x2c4 m_show+0x28/0x34 seq_read_iter+0x12c/0x3c4 vfs_read+0x29c/0x2c8 ksys_read+0x80/0xec __arm64_sys_read+0x28/0x34 invoke_syscall+0x50/0xf8 do_el0_svc+0x88/0x138 el0_svc+0x2c/0x8c el0t_64_sync_handler+0x84/0xe4 el0t_64_sync+0x198/0x19c Reason: While btrfs_prepare_sprout() moves the fs_devices::devices into fs_devices::seed_list, the btrfs_show_devname() searches for the devices and found none, leading to the warning as in above. Fix: latest_dev is updated according to the changes to the device list. That means we could use the latest_dev->name to show the device name in /proc/self/mounts, the pointer will be always valid as it's assigned before the device is deleted from the list in remove or replace. The RCU protection is sufficient as the device structure is freed after synchronization. Reported-by: Su Yue <l@damenly.su> Tested-by: Su Yue <l@damenly.su> Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-26btrfs: convert latest_bdev type to btrfs_device and renameAnand Jain6-12/+16
In preparation to fix a bug in btrfs_show_devname(). Convert fs_devices::latest_bdev type from struct block_device to struct btrfs_device and, rename the member to fs_devices::latest_dev. So that btrfs_show_devname() can use fs_devices::latest_dev::name. Tested-by: Su Yue <l@damenly.su> Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-26btrfs: zoned: finish relocating block groupNaohiro Aota1-0/+4
We will no longer write to a relocating block group. So, we can finish it now. Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-26btrfs: zoned: finish fully written block groupNaohiro Aota5-2/+70
If we have written to the zone capacity, the device automatically deactivates the zone. Sync up block group side (the active BG list and zone_is_active flag) with it. We need to do it both on data BGs and metadata BGs. On data side, we add a hook to btrfs_finish_ordered_io(). On metadata side, we use end_extent_buffer_writeback(). To reduce excess lookup of a block group, we mark the last extent buffer in a block group with EXTENT_BUFFER_ZONE_FINISH flag. This cannot be done for data (ordered_extent), because the address may change due to REQ_OP_ZONE_APPEND. Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-26btrfs: zoned: avoid chunk allocation if active block group has enough spaceNaohiro Aota3-7/+60
The current extent allocator tries to allocate a new block group when the existing block groups do not have enough space. On a ZNS device, a new block group means a new active zone. If the number of active zones has already reached the max_active_zones, activating a new zone needs to finish an existing zone, leading to wasting the free space there. So, instead, it should reuse the existing active block groups as much as possible when we can't activate any other zones without sacrificing an already activated block group. While at it, I converted find_free_extent_update_loop() to check the found_extent() case early and made the other conditions simpler. Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-26btrfs: move ffe_ctl one level upNaohiro Aota1-75/+87
We are passing too many variables as it is from btrfs_reserve_extent() to find_free_extent(). The next commit will add min_alloc_size to ffe_ctl, and that means another pass-through argument. Take this opportunity to move ffe_ctl one level up and drop the redundant arguments. Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-26btrfs: zoned: activate new block groupNaohiro Aota1-0/+6
Activate new block group at btrfs_make_block_group(). We do not check the return value. If failed, we can try again later at the actual extent allocation phase. Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-26btrfs: zoned: activate block group on allocationNaohiro Aota1-0/+12
Activate a block group when trying to allocate an extent from it. We check read-only case and no space left case before trying to activate a block group not to consume the number of active zones uselessly. Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-26btrfs: zoned: load active zone info for block groupNaohiro Aota1-0/+24
Load activeness of underlying zones of a block group. When underlying zones are active, we add the block group to the fs_info->zone_active_bgs list. Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-26btrfs: zoned: implement active zone trackingNaohiro Aota7-2/+226
Add zone_is_active flag to btrfs_block_group. This flag indicates the underlying zones are all active. Such zone active block groups are tracked by fs_info->active_bg_list. btrfs_dev_{set,clear}_active_zone() take responsibility for the underlying device part. They set/clear the bitmap to indicate zone activeness and count the number of zones we can activate left. btrfs_zone_{activate,finish}() take responsibility for the logical part and the list management. In addition, btrfs_zone_finish() wait for any writes on it and send REQ_OP_ZONE_FINISH to the zone. Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-26btrfs: zoned: introduce physical_map to btrfs_block_groupNaohiro Aota3-2/+16
We will use a block group's physical location to track active zones and finish fully written zones in the following commits. Since the zone activation is done in the extent allocation context which already holding the tree locks, we can't query the chunk tree for the physical locations. So, copy the location info into a block group and use it for activation. Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-26btrfs: zoned: load active zone information from devicesNaohiro Aota2-1/+60
The ZNS specification defines a limit on the number of zones that can be in the implicit open, explicit open or closed conditions. Any zone with such condition is defined as an active zone and correspond to any zone that is being written or that has been only partially written. If the maximum number of active zones is reached, we must either reset or finish some active zones before being able to chose other zones for storing data. Load queue_max_active_zones() and track the number of active zones left on the device. Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-26btrfs: zoned: finish superblock zone once no space left for new SBNaohiro Aota3-20/+44
If there is no more space left for a new superblock in a superblock zone, then it is better to ZONE_FINISH the zone and frees up the active zone count. Since btrfs_advance_sb_log() can now issue REQ_OP_ZONE_FINISH, we also need to convert it to return int for the error case. Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-26btrfs: zoned: locate superblock position using zone capacityNaohiro Aota1-2/+13
sb_write_pointer() returns the write position of next superblock. For READ, we need a previous location. When the pointer is at the head, the previous one is the last one of the other zone. Calculate the last one's position from zone capacity. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-26btrfs: zoned: consider zone as full when no more SB can be writtenNaohiro Aota1-8/+15
We cannot write beyond zone capacity. So, we should consider a zone as "full" when the write pointer goes beyond capacity - the size of super info. Also, take this opportunity to replace a subtle duplicated code with a loop and fix a typo in comment. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-26btrfs: zoned: tweak reclaim threshold for zone capacityNaohiro Aota1-2/+6
With the introduction of zone capacity, the range [capacity, length] is always zone unusable. Counting this region as a reclaim target will cause reclaiming too early. Reclaim block groups based on bytes that can be usable after resetting. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-26btrfs: zoned: calculate free space from zone capacityNaohiro Aota4-6/+16
Now that we introduced capacity in a block group, we need to calculate free space using the capacity instead of the length. Thus, bytes we account capacity - alloc_pointer as free, and account bytes [capacity, length] as zone unusable. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-26btrfs: zoned: move btrfs_free_excluded_extents out of btrfs_calc_zone_unusableNaohiro Aota2-3/+2
btrfs_free_excluded_extents() is not neccessary for btrfs_calc_zone_unusable() and it makes btrfs_calc_zone_unusable() difficult to reuse. Move it out and call btrfs_free_excluded_extents() in proper context. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-26btrfs: zoned: load zone capacity information from devicesNaohiro Aota2-1/+24
The ZNS specification introduces the concept of a Zone Capacity. A zone capacity is an additional per-zone attribute that indicates the number of usable logical blocks within each zone, starting from the first logical block of each zone. It is always smaller or equal to the zone size. With the SINGLE profile, we can set a block group's "capacity" as the same as the underlying zone's Zone Capacity. We will limit the allocation not to exceed in a following commit. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-26btrfs: defrag: enable defrag for subpage caseQu Wenruo1-6/+0
With the new infrastructure which has taken subpage into consideration, now we should be safe to allow defrag to work for subpage case. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-26btrfs: defrag: remove the old infrastructureQu Wenruo1-313/+0
Now the old infrastructure can all be removed, defrag Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-26btrfs: defrag: use defrag_one_cluster() to implement btrfs_defrag_file()Qu Wenruo1-149/+55
The function defrag_one_cluster() is able to defrag one range well enough, we only need to do preparation for it, including: - Clamp and align the defrag range - Exclude invalid cases - Proper inode locking The old infrastructures will not be removed in this patch, as it would be too noisy to review. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-26btrfs: defrag: introduce helper to defrag one clusterQu Wenruo1-0/+56
This new helper, defrag_one_cluster(), will defrag one cluster (at most 256K): - Collect all initial targets - Kick in readahead when possible - Call defrag_one_range() on each initial target With some extra range clamping. - Update @sectors_defragged parameter This involves one behavior change, the defragged sectors accounting is no longer as accurate as old behavior, as the initial targets are not consistent. We can have new holes punched inside the initial target, and we will skip such holes later. But the defragged sectors accounting doesn't need to be that accurate anyway, thus I don't want to pass those extra accounting burden into defrag_one_range(). Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-26btrfs: defrag: introduce helper to defrag a rangeQu Wenruo1-10/+93
A new helper, defrag_one_range(), is introduced to defrag one range. This function will mostly prepare the needed pages and extent status for defrag_one_locked_target(). As we can only have a consistent view of extent map with page and extent bits locked, we need to re-check the range passed in to get a real target list for defrag_one_locked_target(). Since defrag_collect_targets() will call defrag_lookup_extent() and lock extent range, we also need to teach those two functions to skip extent lock. Thus new parameter, @locked, is introduced to skip extent lock if the caller has already locked the range. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-26btrfs: defrag: introduce helper to defrag a contiguous prepared rangeQu Wenruo1-0/+55
A new helper, defrag_one_locked_target(), introduced to do the real part of defrag. The caller needs to ensure both page and extents bits are locked, and no ordered extent exists for the range, and all writeback is finished. The core defrag part is pretty straight-forward: - Reserve space - Set extent bits to defrag - Update involved pages to be dirty Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-26btrfs: defrag: introduce helper to collect target file extentsQu Wenruo1-0/+120
Introduce a helper, defrag_collect_targets(), to collect all possible targets to be defragged. This function will not consider things like max_sectors_to_defrag, thus caller should be responsible to ensure we don't exceed the limit. This function will be the first stage of later defrag rework. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-26btrfs: defrag: factor out page preparation into a helperQu Wenruo1-61/+87
In cluster_pages_for_defrag(), we have complex code block inside one for() loop. The code block is to prepare one page for defrag, this will ensure: - The page is locked and set up properly. - No ordered extent exists in the page range. - The page is uptodate. This behavior is pretty common and will be reused by later defrag rework. So factor out the code into its own helper, defrag_prepare_one_page(), for later usage, and cleanup the code by a little. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-26btrfs: defrag: replace hard coded PAGE_SIZE with sectorsizeQu Wenruo1-5/+6
When testing subpage defrag support, I always find some strange inode nbytes error, after a lot of debugging, it turns out that defrag_lookup_extent() is using PAGE_SIZE as size for lookup_extent_mapping(). Since lookup_extent_mapping() is calling __lookup_extent_mapping() with @strict == 1, this means any extent map smaller than one page will be ignored, prevent subpage defrag to grab a correct extent map. There are quite some PAGE_SIZE usage in ioctl.c, but most of them are correct usages, and can be one of the following cases: - ioctl structure size check We want ioctl structure to be contained inside one page. - real page operations The remaining cases in defrag_lookup_extent() and check_defrag_in_cache() will be addressed in this patch. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-26btrfs: defrag: also check PagePrivate for subpage cases in ↵Qu Wenruo1-2/+3
cluster_pages_for_defrag() In function cluster_pages_for_defrag() we have a window where we unlock page, either start the ordered range or read the content from disk. When we re-lock the page, we need to make sure it still has the correct page->private for subpage. Thus add the extra PagePrivate check here to handle subpage cases properly. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-26btrfs: defrag: pass file_ra_state instead of file to btrfs_defrag_file()Qu Wenruo2-11/+20
Currently btrfs_defrag_file() accepts both "struct inode" and "struct file" as parameter. We can easily grab "struct inode" from "struct file" using file_inode() helper. The reason why we need "struct file" is just to re-use its f_ra. Change this to pass "struct file_ra_state" parameter, so that it's more clear what we really want. Since we're here, also add some comments on the function btrfs_defrag_file(). Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-26btrfs: rename and switch to bool btrfs_chunk_readonlyAnand Jain3-17/+19
btrfs_chunk_readonly() checks if the given chunk is writeable. It returns 1 for readonly, and 0 for writeable. So the return argument type bool shall suffice instead of the current type int. Also, rename btrfs_chunk_readonly() to btrfs_chunk_writeable() as we check if the bg is writeable, and helps to keep the logic at the parent function simpler to understand. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-26btrfs: reflink: initialize return value to 0 in btrfs_extent_same()Sidong Yang1-1/+1
Fix a warning reported by smatch that ret could be returned without initialized. The dedupe operations are supposed to to return 0 for a 0 length range but the caller does not pass olen == 0. To keep this behaviour and also fix the warning initialize ret to 0. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Sidong Yang <realwakka@gmail.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-26btrfs: subpage: pack all subpage bitmaps into a larger bitmapQu Wenruo3-85/+121
Currently we use u16 bitmap to make 4k sectorsize work for 64K page size. But this u16 bitmap is not large enough to contain larger page size like 128K, nor is space efficient for 16K page size. To handle both cases, here we pack all subpage bitmaps into a larger bitmap, now btrfs_subpage::bitmaps[] will be the ultimate bitmap for subpage usage. Each sub-bitmap will has its start bit number recorded in btrfs_subpage_info::*_start, and its bitmap length will be recorded in btrfs_subpage_info::bitmap_nr_bits. All subpage bitmap operations will be converted from using direct u16 operations to bitmap operations, with above *_start calculated. For 64K page size with 4K sectorsize, this should not cause much difference. While for 16K page size, we will only need 1 unsigned long (u32) to store all the bitmaps, which saves quite some space. Furthermore, this allows us to support larger page size like 128K and 258K. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-26fs: remove leftover comments from mandatory locking removalJeff Layton2-7/+1
Stragglers from commit f7e33bdbd6d1 ("fs: remove mandatory file locking support"). Signed-off-by: Jeff Layton <jlayton@kernel.org>
2021-10-25fscrypt: improve a few commentsEric Biggers2-3/+13
Improve a few comments. These were extracted from the patch "fscrypt: add support for hardware-wrapped keys" (https://lore.kernel.org/r/20211021181608.54127-4-ebiggers@kernel.org). Link: https://lore.kernel.org/r/20211026021042.6581-1-ebiggers@kernel.org Signed-off-by: Eric Biggers <ebiggers@google.com>
2021-10-25btrfs: subpage: introduce btrfs_subpage_bitmap_infoQu Wenruo4-3/+72
Currently we use fixed size u16 bitmap for subpage bitmap. This is fine for 4K sectorsize with 64K page size. But for 4K sectorsize and larger page size, the bitmap is too small, while for smaller page size like 16K, u16 bitmaps waste too much space. Here we introduce a new helper structure, btrfs_subpage_bitmap_info, to record the proper bitmap size, and where each bitmap should start at. By this, we can later compact all subpage bitmaps into one u32 bitmap. This patch is the first step. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-25btrfs: subpage: make btrfs_alloc_subpage() return btrfs_subpage directlyQu Wenruo3-22/+24
The existing calling convention of btrfs_alloc_subpage() is pretty awful. Change it to a more common pattern by returning struct btrfs_subpage directly and let the caller to determine if the call succeeded. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-25btrfs: subpage: only call btrfs_alloc_subpage() when sectorsize is smaller ↵Qu Wenruo2-9/+10
than PAGE_SIZE There are two call sites of btrfs_alloc_subpage(): - btrfs_attach_subpage() We have ensured sectorsize is smaller than PAGE_SIZE - alloc_extent_buffer() We call btrfs_alloc_subpage() unconditionally. The alloc_extent_buffer() forces us to check the sectorsize size against page size inside btrfs_alloc_subpage(). Since the function name, btrfs_alloc_subpage(), already indicates it should only get called for subpage cases, do the check in alloc_extent_buffer() and add an ASSERT() in btrfs_alloc_subpage(). Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-25btrfs: update comment for fs_devices::seed_list in btrfs_rm_deviceSu Yue1-1/+1
Update it since commit 944d3f9fac61 ("btrfs: switch seed device to list api") did conversion from fs_devices::seed to fs_devices::seed_list. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Su Yue <l@damenly.su> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-25btrfs: drop unnecessary ret in ioctl_quota_rescan_statusAnand Jain1-3/+2
There is no need for the variable ret after d66105cfa873 ("btrfs: allocate btrfs_ioctl_quota_rescan_args on stack"), remove it. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-25btrfs: send: simplify send_create_inode_if_neededMarcos Paulo de Souza1-11/+4
The out label is being overused, we can simply return if the condition permits. No functional changes. Reviewed-by: Su Yue <l@damenly.su> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Marcos Paulo de Souza <mpdesouza@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-25btrfs: rename btrfs_alloc_chunk to btrfs_create_chunkNikolay Borisov4-10/+10
The user facing function used to allocate new chunks is btrfs_chunk_alloc, unfortunately there is yet another similar sounding function - btrfs_alloc_chunk. This creates confusion, especially since the latter function can be considered "private" in the sense that it implements the first stage of chunk creation and as such is called by btrfs_chunk_alloc. To avoid the awkwardness that comes with having similarly named but distinctly different in their purpose function rename btrfs_alloc_chunk to btrfs_create_chunk, given that the main purpose of this function is to orchestrate the whole process of allocating a chunk - reserving space into devices, deciding on characteristics of the stripe size and creating the in-memory structures. Reviewed-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-25fs: get rid of the res2 iocb->ki_complete argumentJens Axboe10-21/+21
The second argument was only used by the USB gadget code, yet everyone pays the overhead of passing a zero to be passed into aio, where it ends up being part of the aio res2 value. Now that everybody is passing in zero, kill off the extra argument. Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-25io_uring: clusterise ki_flags access in rw_prepPavel Begunkov1-10/+11
ioprio setup doesn't depend on other fields that are modified in io_prep_rw() and we can move it down in the function without worrying about performance. It's useful as it makes iocb->ki_flags accesses/modifications closer together, so it's more likely the compiler will cache it in a register and avoid extra reloads. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/8ee98779c06f1b59f6039b1e292db4332efd664b.1634987320.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-25io_uring: kill unused param from io_file_supports_nowaitPavel Begunkov1-4/+3
io_file_supports_nowait() doesn't use rw argument anymore, remove it. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/4bd6709fc573d70c866ea656cb7a7dbe94be8026.1634987320.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-25io_uring: clean up timeout async_data allocationPavel Begunkov1-1/+3
opcode prep functions are one of the first things that are called, we can't have ->async_data allocated at this point and it's certainly a bug. Reflect this assumption in io_timeout_prep() and add a WARN_ONCE just in case. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/75a28ca7dbcc5af8b6cd9092819e8384c24dedd4.1634987320.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-25io_uring: don't try io-wq polling if not supportedPavel Begunkov1-2/+6
If an opcode doesn't support polling, just let it be executed synchronously in iowq, otherwise it will do a nonblock attempt just to fail in io_arm_poll_handler() and return back to blocking execution. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/6401256db01b88f448f15fcd241439cb76f5b940.1634987320.git.asml.silence@gmail.com Reviewed-by: Hao Xu <haoxu@linux.alibaba.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-25io_uring: check if opcode needs poll first on armingPavel Begunkov1-4/+2
->pollout or ->pollin are set only for opcodes that need a file, so if io_arm_poll_handler() tests them first we can be sure that the request has file set and the ->file check can be removed. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/9adfe4f543d984875e516fce6da35348aab48668.1634987320.git.asml.silence@gmail.com Reviewed-by: Hao Xu <haoxu@linux.alibaba.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-25io_uring: clean iowq submit work cancellationPavel Begunkov1-30/+29
If we've got IO_WQ_WORK_CANCEL in io_wq_submit_work(), handle the error on the same lines as the check instead of having a weird code flow. The main loop doesn't change but goes one indention left. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/ff4a09cf41f7a22bbb294b6f1faea721e21fe615.1634987320.git.asml.silence@gmail.com Reviewed-by: Hao Xu <haoxu@linux.alibaba.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-25io_uring: clean io_wq_submit_work()'s main loopPavel Begunkov1-28/+12
Do a bit of cleaning for the main loop of io_wq_submit_work(). Get rid of switch, just replace it with a single if as we're retrying in both other cases. Kill issue_sqe label, Get rid of needs_poll nesting and disambiguate a bit the comment. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/ed12ce0c64e051f9a6b8a37a24f8ea554d299c29.1634987320.git.asml.silence@gmail.com Reviewed-by: Hao Xu <haoxu@linux.alibaba.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-25gfs2: Fix unused value warning in do_gfs2_set_flags()Tim Gardner1-1/+0
Coverity complains of an unused value: CID 119623 (#1 of 1): Unused value (UNUSED_VALUE) assigned_value: Assigning value -1 to error here, but that stored value is overwritten before it can be used. 237 error = -EPERM; Fix it by removing the assignment. Signed-off-by: Tim Gardner <tim.gardner@canonical.com> Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2021-10-25gfs2: check context in gfs2_glock_putAlexander Aring1-0/+3
Add a might_sleep call into gfs2_glock_put which can sleep in DLM when the last reference is released. This will show problems earlier, and not only when the last reference is put. Signed-off-by: Alexander Aring <aahringo@redhat.com> Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2021-10-25gfs2: Fix glock_hash_walk bugsAndreas Gruenbacher1-10/+12
So far, glock_hash_walk took a reference on each glock it iterated over, and it was the examiner's responsibility to drop those references. Dropping the final reference to a glock can sleep and the examiners are called in a RCU critical section with spin locks held, so examiners that didn't need the extra reference had to drop it asynchronously via gfs2_glock_queue_put or similar. This wasn't done correctly in thaw_glock which did call gfs2_glock_put, and not at all in dump_glock_func. Change glock_hash_walk to not take glock references at all. That way, the examiners that don't need them won't have to bother with slow asynchronous puts, and the examiners that do need references can take them themselves. Reported-by: Alexander Aring <aahringo@redhat.com> Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2021-10-25gfs2: Cancel remote delete work asynchronouslyAndreas Gruenbacher1-1/+1
In gfs2_inode_lookup and gfs2_create_inode, we're calling gfs2_cancel_delete_work which currently cancels any remote delete work (delete_work_func) synchronously. This means that if the work is currently running, it will wait for it to finish. We're doing this to pevent a previous instance of an inode from having any influence on the next instance. However, delete_work_func uses gfs2_inode_lookup internally, and we can end up in a deadlock when delete_work_func gets interrupted at the wrong time. For example, (1) An inode's iopen glock has delete work queued, but the inode itself has been evicted from the inode cache. (2) The delete work is preempted before reaching gfs2_inode_lookup. (3) Another process recreates the inode (gfs2_create_inode). It tries to cancel any outstanding delete work, which blocks waiting for the ongoing delete work to finish. (4) The delete work calls gfs2_inode_lookup, which blocks waiting for gfs2_create_inode to instantiate and unlock the new inode => deadlock. It turns out that when the delete work notices that its inode has been re-instantiated, it will do nothing. This means that it's safe to cancel the delete work asynchronously. This prevents the kind of deadlock described above. Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com> Signed-off-by: Bob Peterson <rpeterso@redhat.com>
2021-10-25gfs2: set glock object after nqBob Peterson1-2/+2
Before this patch, function gfs2_create_inode called glock_set_object to set the gl_object for inode and iopen glocks before the glock was locked. That's wrong because other competing processes like evict may be blocked waiting for the glock and still have gl_object set before the actual eviction can take place. This patch moves the call to glock_set_object until after the glock is acquire in function gfs2_create_inode, so it waits for possibly competing evicts to finish their processing first. Signed-off-by: Bob Peterson <rpeterso@redhat.com> Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2021-10-25gfs2: remove RDF_UPTODATE flagBob Peterson3-24/+15
The new GLF_INSTANTIATE_NEEDED flag obsoletes the old rgrp flag GFS2_RDF_UPTODATE, so this patch replaces it like we did with inodes. Signed-off-by: Bob Peterson <rpeterso@redhat.com> Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2021-10-25gfs2: Eliminate GIF_INVALID flagBob Peterson4-11/+4
With the addition of the new GLF_INSTANTIATE_NEEDED flag, the GIF_INVALID flag is now redundant. This patch removes it. Since inode_instantiate is only called when instantiation is needed, the check in inode_instantiate is removed too. Signed-off-by: Bob Peterson <rpeterso@redhat.com> Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2021-10-25gfs2: fix GL_SKIP node_scope problemsBob Peterson7-20/+61
Before this patch, when a glock was locked, the very first holder on the queue would unlock the lockref and call the go_instantiate glops function (if one existed), unless GL_SKIP was specified. When we introduced the new node-scope concept, we allowed multiple holders to lock glocks in EX mode and share the lock. But node-scope introduced a new problem: if the first holder has GL_SKIP and the next one does NOT, since it is not the first holder on the queue, the go_instantiate op was not called. Eventually the GL_SKIP holder may call the instantiate sub-function (e.g. gfs2_rgrp_bh_get) but there was still a window of time in which another non-GL_SKIP holder assumes the instantiate function had been called by the first holder. In the case of rgrp glocks, this led to a NULL pointer dereference on the buffer_heads. This patch tries to fix the problem by introducing two new glock flags: GLF_INSTANTIATE_NEEDED, which keeps track of when the instantiate function needs to be called to "fill in" or "read in" the object before it is referenced. GLF_INSTANTIATE_IN_PROG which is used to determine when a process is in the process of reading in the object. Whenever a function needs to reference the object, it checks the GLF_INSTANTIATE_NEEDED flag, and if set, it sets GLF_INSTANTIATE_IN_PROG and calls the glops "go_instantiate" function. As before, the gl_lockref spin_lock is unlocked during the IO operation, which may take a relatively long amount of time to complete. While unlocked, if another process determines go_instantiate is still needed, it sees GLF_INSTANTIATE_IN_PROG is set, and waits for the go_instantiate glop operation to be completed. Once GLF_INSTANTIATE_IN_PROG is cleared, it needs to check GLF_INSTANTIATE_NEEDED again because the other process's go_instantiate operation may not have been successful. Functions that previously called the instantiate sub-functions now call directly into gfs2_instantiate so the new bits are managed properly. Signed-off-by: Bob Peterson <rpeterso@redhat.com> Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2021-10-25gfs2: split glock instantiation off from do_promoteBob Peterson1-3/+17
Before this patch, function do_promote had a section of code that did the actual instantiation. This patch splits that off into its own function, gfs2_instantiate, which prepares us for the next patch that will use that function. Signed-off-by: Bob Peterson <rpeterso@redhat.com> Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>