aboutsummaryrefslogtreecommitdiffstats
path: root/fs
AgeCommit message (Collapse)AuthorFilesLines
2 daysMerge tag '6.10-rc-smb3-fixes' of git://git.samba.org/sfrench/cifs-2.6Linus Torvalds8-45/+63
Pull smb client updates from Steve French: - three important fixes to recent netfs conversion to fix various xfstest failures, and rmmod oops - cleanup patch to fix various GCC-14 warnings * tag '6.10-rc-smb3-fixes' of git://git.samba.org/sfrench/cifs-2.6: smb3: fix perf regression with cached writes with netfs conversion cifs: Fix locking in cifs_strict_readv() cifs: Change from mempool_destroy to mempool_exit for request pools smb: smb2pdu.h: Avoid -Wflex-array-member-not-at-end warnings
2 daysMerge tag 'integrity-v6.10' of ↵Linus Torvalds2-2/+2
ssh://ra.kernel.org/pub/scm/linux/kernel/git/zohar/linux-integrity Pull integrity updates from Mimi Zohar: "Two IMA changes, one EVM change, a use after free bug fix, and a code cleanup to address "-Wflex-array-member-not-at-end" warnings: - The existing IMA {ascii, binary}_runtime_measurements lists include a hard coded SHA1 hash. To address this limitation, define per TPM enabled hash algorithm {ascii, binary}_runtime_measurements lists - Close an IMA integrity init_module syscall measurement gap by defining a new critical-data record - Enable (partial) EVM support on stacked filesystems (overlayfs). Only EVM portable & immutable file signatures are copied up, since they do not contain filesystem specific metadata" * tag 'integrity-v6.10' of ssh://ra.kernel.org/pub/scm/linux/kernel/git/zohar/linux-integrity: ima: add crypto agility support for template-hash algorithm evm: Rename is_unsupported_fs to is_unsupported_hmac_fs fs: Rename SB_I_EVM_UNSUPPORTED to SB_I_EVM_HMAC_UNSUPPORTED evm: Enforce signatures on unsupported filesystem for EVM_INIT_X509 ima: re-evaluate file integrity on file metadata change evm: Store and detect metadata inode attributes changes ima: Move file-change detection variables into new structure evm: Use the metadata inode to calculate metadata hash evm: Implement per signature type decision in security_inode_copy_up_xattr security: allow finer granularity in permitting copy-up of security xattrs ima: Rename backing_inode to real_inode integrity: Avoid -Wflex-array-member-not-at-end warnings ima: define an init_module critical data record ima: Fix use-after-free on a dentry's dname.name
3 daysMerge tag 'net-next-6.10' of ↵Linus Torvalds6-15/+16
git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next Pull networking updates from Jakub Kicinski: "Core & protocols: - Complete rework of garbage collection of AF_UNIX sockets. AF_UNIX is prone to forming reference count cycles due to fd passing functionality. New method based on Tarjan's Strongly Connected Components algorithm should be both faster and remove a lot of workarounds we accumulated over the years. - Add TCP fraglist GRO support, allowing chaining multiple TCP packets and forwarding them together. Useful for small switches / routers which lack basic checksum offload in some scenarios (e.g. PPPoE). - Support using SMP threads for handling packet backlog i.e. packet processing from software interfaces and old drivers which don't use NAPI. This helps move the processing out of the softirq jumble. - Continue work of converting from rtnl lock to RCU protection. Don't require rtnl lock when reading: IPv6 routing FIB, IPv6 address labels, netdev threaded NAPI sysfs files, bonding driver's sysfs files, MPLS devconf, IPv4 FIB rules, netns IDs, tcp metrics, TC Qdiscs, neighbor entries, ARP entries via ioctl(SIOCGARP), a lot of the link information available via rtnetlink. - Small optimizations from Eric to UDP wake up handling, memory accounting, RPS/RFS implementation, TCP packet sizing etc. - Allow direct page recycling in the bulk API used by XDP, for +2% PPS. - Support peek with an offset on TCP sockets. - Add MPTCP APIs for querying last time packets were received/sent/acked and whether MPTCP "upgrade" succeeded on a TCP socket. - Add intra-node communication shortcut to improve SMC performance. - Add IPv6 (and IPv{4,6}-over-IPv{4,6}) support to the GTP protocol driver. - Add HSR-SAN (RedBOX) mode of operation to the HSR protocol driver. - Add reset reasons for tracing what caused a TCP reset to be sent. - Introduce direction attribute for xfrm (IPSec) states. State can be used either for input or output packet processing. Things we sprinkled into general kernel code: - Add bitmap_{read,write}(), bitmap_size(), expose BYTES_TO_BITS(). This required touch-ups and renaming of a few existing users. - Add Endian-dependent __counted_by_{le,be} annotations. - Make building selftests "quieter" by printing summaries like "CC object.o" rather than full commands with all the arguments. Netfilter: - Use GFP_KERNEL to clone elements, to deal better with OOM situations and avoid failures in the .commit step. BPF: - Add eBPF JIT for ARCv2 CPUs. - Support attaching kprobe BPF programs through kprobe_multi link in a session mode, meaning, a BPF program is attached to both function entry and return, the entry program can decide if the return program gets executed and the entry program can share u64 cookie value with return program. "Session mode" is a common use-case for tetragon and bpftrace. - Add the ability to specify and retrieve BPF cookie for raw tracepoint programs in order to ease migration from classic to raw tracepoints. - Add an internal-only BPF per-CPU instruction for resolving per-CPU memory addresses and implement support in x86, ARM64 and RISC-V JITs. This allows inlining functions which need to access per-CPU state. - Optimize x86 BPF JIT's emit_mov_imm64, and add support for various atomics in bpf_arena which can be JITed as a single x86 instruction. Support BPF arena on ARM64. - Add a new bpf_wq API for deferring events and refactor process-context bpf_timer code to keep common code where possible. - Harden the BPF verifier's and/or/xor value tracking. - Introduce crypto kfuncs to let BPF programs call kernel crypto APIs. - Support bpf_tail_call_static() helper for BPF programs with GCC 13. - Add bpf_preempt_{disable,enable}() kfuncs in order to allow a BPF program to have code sections where preemption is disabled. Driver API: - Skip software TC processing completely if all installed rules are marked as HW-only, instead of checking the HW-only flag rule by rule. - Add support for configuring PoE (Power over Ethernet), similar to the already existing support for PoDL (Power over Data Line) config. - Initial bits of a queue control API, for now allowing a single queue to be reset without disturbing packet flow to other queues. - Common (ethtool) statistics for hardware timestamping. Tests and tooling: - Remove the need to create a config file to run the net forwarding tests so that a naive "make run_tests" can exercise them. - Define a method of writing tests which require an external endpoint to communicate with (to send/receive data towards the test machine). Add a few such tests. - Create a shared code library for writing Python tests. Expose the YAML Netlink library from tools/ to the tests for easy Netlink access. - Move netfilter tests under net/, extend them, separate performance tests from correctness tests, and iron out issues found by running them "on every commit". - Refactor BPF selftests to use common network helpers. - Further work filling in YAML definitions of Netlink messages for: nftables, team driver, bonding interfaces, vlan interfaces, VF info, TC u32 mark, TC police action. - Teach Python YAML Netlink to decode attribute policies. - Extend the definition of the "indexed array" construct in the specs to cover arrays of scalars rather than just nests. - Add hyperlinks between definitions in generated Netlink docs. Drivers: - Make sure unsupported flower control flags are rejected by drivers, and make more drivers report errors directly to the application rather than dmesg (large number of driver changes from Asbjørn Sloth Tønnesen). - Ethernet high-speed NICs: - Broadcom (bnxt): - support multiple RSS contexts and steering traffic to them - support XDP metadata - make page pool allocations more NUMA aware - Intel (100G, ice, idpf): - extract datapath code common among Intel drivers into a library - use fewer resources in switchdev by sharing queues with the PF - add PFCP filter support - add Ethernet filter support - use a spinlock instead of HW lock in PTP clock ops - support 5 layer Tx scheduler topology - nVidia/Mellanox: - 800G link modes and 100G SerDes speeds - per-queue IRQ coalescing configuration - Marvell Octeon: - support offloading TC packet mark action - Ethernet NICs consumer, embedded and virtual: - stop lying about skb->truesize in USB Ethernet drivers, it messes up TCP memory calculations - Google cloud vNIC: - support changing ring size via ethtool - support ring reset using the queue control API - VirtIO net: - expose flow hash from RSS to XDP - per-queue statistics - add selftests - Synopsys (stmmac): - support controllers which require an RX clock signal from the MII bus to perform their hardware initialization - TI: - icssg_prueth: support ICSSG-based Ethernet on AM65x SR1.0 devices - icssg_prueth: add SW TX / RX Coalescing based on hrtimers - cpsw: minimal XDP support - Renesas (ravb): - support describing the MDIO bus - Realtek (r8169): - add support for RTL8168M - Microchip Sparx5: - matchall and flower actions mirred and redirect - Ethernet switches: - nVidia/Mellanox: - improve events processing performance - Marvell: - add support for MV88E6250 family internal PHYs - Microchip: - add DCB and DSCP mapping support for KSZ switches - vsc73xx: convert to PHYLINK - Realtek: - rtl8226b/rtl8221b: add C45 instances and SerDes switching - Many driver changes related to PHYLIB and PHYLINK deprecated API cleanup - Ethernet PHYs: - Add a new driver for Airoha EN8811H 2.5 Gigabit PHY. - micrel: lan8814: add support for PPS out and external timestamp trigger - WiFi: - Disable Wireless Extensions (WEXT) in all Wi-Fi 7 devices drivers. Modern devices can only be configured using nl80211. - mac80211/cfg80211 - handle color change per link for WiFi 7 Multi-Link Operation - Intel (iwlwifi): - don't support puncturing in 5 GHz - support monitor mode on passive channels - BZ-W device support - P2P with HE/EHT support - re-add support for firmware API 90 - provide channel survey information for Automatic Channel Selection - MediaTek (mt76): - mt7921 LED control - mt7925 EHT radiotap support - mt7920e PCI support - Qualcomm (ath11k): - P2P support for QCA6390, WCN6855 and QCA2066 - support hibernation - ieee80211-freq-limit Device Tree property support - Qualcomm (ath12k): - refactoring in preparation of multi-link support - suspend and hibernation support - ACPI support - debugfs support, including dfs_simulate_radar support - RealTek: - rtw88: RTL8723CS SDIO device support - rtw89: RTL8922AE Wi-Fi 7 PCI device support - rtw89: complete features of new WiFi 7 chip 8922AE including BT-coexistence and Wake-on-WLAN - rtw89: use BIOS ACPI settings to set TX power and channels - rtl8xxxu: enable Management Frame Protection (MFP) support - Bluetooth: - support for Intel BlazarI and Filmore Peak2 (BE201) - support for MediaTek MT7921S SDIO - initial support for Intel PCIe BT driver - remove HCI_AMP support" * tag 'net-next-6.10' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (1827 commits) selftests: netfilter: fix packetdrill conntrack testcase net: gro: fix napi_gro_cb zeroed alignment Bluetooth: btintel_pcie: Refactor and code cleanup Bluetooth: btintel_pcie: Fix warning reported by sparse Bluetooth: hci_core: Fix not handling hdev->le_num_of_adv_sets=1 Bluetooth: btintel: Fix compiler warning for multi_v7_defconfig config Bluetooth: btintel_pcie: Fix compiler warnings Bluetooth: btintel_pcie: Add *setup* function to download firmware Bluetooth: btintel_pcie: Add support for PCIe transport Bluetooth: btintel: Export few static functions Bluetooth: HCI: Remove HCI_AMP support Bluetooth: L2CAP: Fix div-by-zero in l2cap_le_flowctl_init() Bluetooth: qca: Fix error code in qca_read_fw_build_info() Bluetooth: hci_conn: Use __counted_by() and avoid -Wfamnae warning Bluetooth: btintel: Add support for Filmore Peak2 (BE201) Bluetooth: btintel: Add support for BlazarI LE Create Connection command timeout increased to 20 secs dt-bindings: net: bluetooth: Add MediaTek MT7921S SDIO Bluetooth Bluetooth: compute LE flow credits based on recvbuf space Bluetooth: hci_sync: Use cmd->num_cis instead of magic number ...
3 daysMerge tag 'fsverity-for-linus' of git://git.kernel.org/pub/scm/fs/fsverity/linuxLinus Torvalds1-6/+1
Pull fsverity update from Eric Biggers: "Fix a false positive kmemleak warning" * tag 'fsverity-for-linus' of git://git.kernel.org/pub/scm/fs/fsverity/linux: fsverity: use register_sysctl_init() to avoid kmemleak warning
3 daysMerge tag 'fscrypt-for-linus' of git://git.kernel.org/pub/scm/fs/fscrypt/linuxLinus Torvalds1-6/+26
Pull fscrypt update from Eric Biggers: "Improve the performance of opening unencrypted files on filesystems that support fscrypt" * tag 'fscrypt-for-linus' of git://git.kernel.org/pub/scm/fs/fscrypt/linux: fscrypt: try to avoid refing parent dentry in fscrypt_file_open
3 daysMerge tag 'for-linus-6.10-ofs1' of ↵Linus Torvalds1-1/+2
git://git.kernel.org/pub/scm/linux/kernel/git/hubcap/linux Pull orangefs update from Mike Marshall: "Fix out-of-bounds fsid access. Small fix to quiet warnings from string fortification helpers, suggested by Arnd Bergmann" * tag 'for-linus-6.10-ofs1' of git://git.kernel.org/pub/scm/linux/kernel/git/hubcap/linux: orangefs: fix out-of-bounds fsid access
3 daysMerge tag 'gfs2-for-v6.10' of ↵Linus Torvalds18-258/+320
git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2 Pull gfs2 updates from Andreas Gruenbacher: - Properly fix the glock shrinker this time: it broke in commit "gfs2: Make glock lru list scanning safer" and commit "gfs2: fix glock shrinker ref issues" wasn't actually enough to fix it - On unmount, keep glocks around long enough that no more dlm callbacks can occur on them - Some more folio conversion patches from Matthew Wilcox - Lots of other smaller fixes and cleanups * tag 'gfs2-for-v6.10' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2: (27 commits) gfs2: make timeout values more explicit gfs2: Convert gfs2_aspace_writepage() to use a folio gfs2: Add a migrate_folio operation for journalled files gfs2: Simplify gfs2_read_super gfs2: Convert gfs2_page_mkwrite() to use a folio gfs2: gfs2_freeze_unlock cleanup gfs2: Remove and replace gfs2_glock_queue_work gfs2: do_xmote fixes gfs2: finish_xmote cleanup gfs2: Unlock fewer glocks on unmount gfs2: Fix potential glock use-after-free on unmount gfs2: Remove ill-placed consistency check gfs2: Fix lru_count accounting gfs2: Fix "Make glock lru list scanning safer" Revert "gfs2: fix glock shrinker ref issues" gfs2: Fix "ignore unlock failures after withdraw" gfs2: Get rid of unnecessary test_and_set_bit gfs2: Don't set GLF_LOCK in gfs2_dispose_glock_lru gfs2: Replace gfs2_glock_queue_put with gfs2_glock_put_async gfs2: Get rid of gfs2_glock_queue_put in signal_our_withdraw ...
3 daysMerge tag 'dlm-6.10' of ↵Linus Torvalds24-1482/+1363
git://git.kernel.org/pub/scm/linux/kernel/git/teigland/linux-dlm Pull dlm updates from David Teigland: "This set includes some small fixes, and some big internal changes: - Fix a long standing race between the unlock callback for the last lkb struct, and removing the rsb that became unused after the final unlock. This could lead different nodes to inconsistent info about the rsb master node. - Remove unnecessary refcounting on callback structs, returning to the way things were done in the past. - Do message processing in softirq context. This allows dlm messages to be cleared more quickly and efficiently, reducing long lists of incomplete requests. A future change to run callbacks directly from this context will make this more effective. - The softirq message processing involved a number of patches changing mutexes to spinlocks and rwlocks, and a fair amount of code re-org in preparation. - Use an rhashtable for rsb structs, rather than our old internal hash table implementation. This also required some re-org of lists and locks preparation for the change. - Drop the dlm_scand kthread, and use timers to clear unused rsb structs. Scanning all rsb's periodically was a lot of wasted work. - Fix recent regression in logic for copying LVB data in user space lock requests" * tag 'dlm-6.10' of git://git.kernel.org/pub/scm/linux/kernel/git/teigland/linux-dlm: (34 commits) dlm: return -ENOMEM if ls_recover_buf fails dlm: fix sleep in atomic context dlm: use rwlock for lkbidr dlm: use rwlock for rsb hash table dlm: drop dlm_scand kthread and use timers dlm: do not use ref counts for rsb in the toss state dlm: switch to use rhashtable for rsbs dlm: add rsb lists for iteration dlm: merge toss and keep hash table lists into one list dlm: change to single hashtable lock dlm: increment ls_count for dlm_scand dlm: do message processing in softirq context dlm: use spin_lock_bh for message processing dlm: remove schedule in receive path dlm: convert ls_recv_active from rw_semaphore to rwlock dlm: avoid blocking receive at the end of recovery dlm: convert res_lock to spinlock dlm: convert ls_waiters_mutex to spinlock dlm: drop mutex use in waiters recovery dlm: add new struct to save position in dlm_copy_master_names ...
3 daysMerge tag 'for-6.10-tag' of ↵Linus Torvalds50-2309/+2517
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs updates from David Sterba: "This update brings a few minor performance improvements, otherwise there's a lot of refactoring, cleanups and other sort of not user visible changes. Performance improvements: - inline b-tree locking functions, improvement in metadata-heavy changes - relax locking on a range that's being reflinked, allows read operations to run in parallel - speed up NOCOW write checks (throughput +9% on a sample test) - extent locking ranges have been reduced in several places, namely around delayed ref processing Core: - more page to folio conversions: - relocation - send - compression - inline extent handling - super block write and wait - extent_map structure optimizations: - reduced structure size - code simplifications - add shrinker for allocated objects, the numbers can go high and could exhaust memory on smaller systems (reported) as they may not get an opportunity to be freed fast enough - extent locking optimizations: - reduce locking ranges where it does not seem to be necessary and are safe due to other means of synchronization - potential improvements due to lower contention, allocation/freeing and state management operations of extent state tracking structures - delayed ref cleanups and simplifications - updated trace points - improved error handling, warnings and assertions - cleanups and refactoring, unification of error handling paths" * tag 'for-6.10-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: (122 commits) btrfs: qgroup: fix initialization of auto inherit array btrfs: count super block write errors in device instead of tracking folio error state btrfs: use the folio iterator in btrfs_end_super_write() btrfs: convert super block writes to folio in write_dev_supers() btrfs: convert super block writes to folio in wait_dev_supers() bio: Export bio_add_folio_nofail to modules btrfs: remove duplicate included header from fs.h btrfs: add a cached state to extent_clear_unlock_delalloc btrfs: push extent lock down in submit_one_async_extent btrfs: push lock_extent down in cow_file_range() btrfs: move can_cow_file_range_inline() outside of the extent lock btrfs: push lock_extent into cow_file_range_inline btrfs: push extent lock into cow_file_range btrfs: push extent lock into run_delalloc_cow btrfs: remove unlock_extent from run_delalloc_compressed btrfs: push extent lock down in run_delalloc_nocow btrfs: adjust while loop condition in run_delalloc_nocow btrfs: push extent lock into run_delalloc_nocow btrfs: push the extent lock into btrfs_run_delalloc_range btrfs: lock extent when doing inline extent in compression ...
3 daysMerge tag 'erofs-for-6.10-rc1' of ↵Linus Torvalds11-212/+555
git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs Pull erofs updates from Gao Xiang: "The LZ4 global buffer count is now configurable instead of the previous per-CPU buffers, which is useful for bare metals with hundreds of CPUs. A reserved buffer pool for LZ4 decompression can also be enabled to minimize the tail allocation latencies under the low memory scenarios with heavy memory pressure. In addition, Zstandard algorithm is now supported as an alternative since it has been requested by users for a while. There are some random cleanups as usual. Summary: - Make LZ4 global buffers configurable instead of per-CPU buffers - Add a reserved buffer pool for LZ4 decompression for lower latencies - Support Zstandard compression algorithm as an alternative - Derive fsid from on-disk UUID for .statfs() if possible - Minor cleanups" * tag 'erofs-for-6.10-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs: erofs: Zstandard compression support erofs: clean up z_erofs_load_full_lcluster() erofs: derive fsid from on-disk UUID for .statfs() if possible erofs: add a reserved buffer pool for lz4 decompression erofs: do not use pagepool in z_erofs_gbuf_growsize() erofs: rename per-CPU buffers to global buffer pool and make it configurable erofs: rename utils.c to zutil.c
3 dayssmb3: fix perf regression with cached writes with netfs conversionSteve French1-3/+0
Write through mode is for cache=none, not for default (when caching is allowed if we have a lease). Some tests were running much, much more slowly as a result of disabling caching of writes by default. Reviewed-by: David Howells <dhowells@redhat.com> Reviewed-by: Enzo Matsumiya <ematsumiya@suse.de> Fixes: 3ee1a1fc3981 ("cifs: Cut over to using netfslib") Signed-off-by: Steve French <stfrench@microsoft.com>
3 daysMerge tag 'efi-next-for-v6.10' of ↵Linus Torvalds2-6/+4
git://git.kernel.org/pub/scm/linux/kernel/git/efi/efi Pull EFI updates from Ard Biesheuvel: "Only a handful of changes this cycle, consisting of cleanup work and a low-prio bugfix: - Additional cleanup by Tim for the efivarfs variable name length confusion - Avoid freeing a bogus pointer when virtual remapping is omitted in the EFI boot stub" * tag 'efi-next-for-v6.10' of git://git.kernel.org/pub/scm/linux/kernel/git/efi/efi: efi: libstub: only free priv.runtime_map when allocated efi: Clear up misconceptions about a maximum variable name size efivarfs: Remove unused internal struct members Documentation: Mark the 'efivars' sysfs interface as removed efi: pstore: Request at most 512 bytes for variable names
4 dayscifs: Fix locking in cifs_strict_readv()Steve French3-10/+28
Fix to take the i_rwsem (through the netfs locking wrappers) before taking cinode->lock_sem. Fixes: 3ee1a1fc3981 ("cifs: Cut over to using netfslib") Reported-by: Enzo Matsumiya <ematsumiya@suse.de> Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: Steve French <stfrench@microsoft.com>
4 dayscifs: Change from mempool_destroy to mempool_exit for request poolsSteve French1-2/+2
insmod followed by rmmod was oopsing with the new mempools cifs request patch Fixes: edea94a69730 ("cifs: Add mempools for cifs_io_request and cifs_io_subrequest structs") Suggested-by: David Howells <dhowells@redhat.com> Reviewed-by: Enzo Matsumiya <ematsumiya@suse.de> Signed-off-by: Steve French <stfrench@microsoft.com>
4 dayssmb: smb2pdu.h: Avoid -Wflex-array-member-not-at-end warningsGustavo A. R. Silva3-30/+33
-Wflex-array-member-not-at-end is coming in GCC-14, and we are getting ready to enable it globally. So, in order to avoid ending up with a flexible-array member in the middle of multiple other structs, we use the `__struct_group()` helper to separate the flexible array from the rest of the members in the flexible structure, and use the tagged `struct create_context_hdr` instead of `struct create_context`. So, with these changes, fix 51 of the following warnings[1]: fs/smb/client/../common/smb2pdu.h:1225:31: warning: structure containing a flexible array member is not at the end of another structure [-Wflex-array-member-not-at-end] Link: https://gist.github.com/GustavoARSilva/772526a39be3dd4db39e71497f0a9893 [1] Link: https://github.com/KSPP/linux/issues/202 Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org> Signed-off-by: Steve French <stfrench@microsoft.com>
4 daysMerge tag 'hardening-6.10-rc1' of ↵Linus Torvalds4-45/+20
git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux Pull hardening updates from Kees Cook: "The bulk of the changes here are related to refactoring and expanding the KUnit tests for string helper and fortify behavior. Some trivial strncpy replacements in fs/ were carried in my tree. Also some fixes to SCSI string handling were carried in my tree since the helper for those was introduce here. Beyond that, just little fixes all around: objtool getting confused about LKDTM+KCFI, preparing for future refactors (constification of sysctl tables, additional __counted_by annotations), a Clang UBSAN+i386 crash fix, and adding more options in the hardening.config Kconfig fragment. Summary: - selftests: Add str*cmp tests (Ivan Orlov) - __counted_by: provide UAPI for _le/_be variants (Erick Archer) - Various strncpy deprecation refactors (Justin Stitt) - stackleak: Use a copy of soon-to-be-const sysctl table (Thomas Weißschuh) - UBSAN: Work around i386 -regparm=3 bug with Clang prior to version 19 - Provide helper to deal with non-NUL-terminated string copying - SCSI: Fix older string copying bugs (with new helper) - selftests: Consolidate string helper behavioral tests - selftests: add memcpy() fortify tests - string: Add additional __realloc_size() annotations for "dup" helpers - LKDTM: Fix KCFI+rodata+objtool confusion - hardening.config: Enable KCFI" * tag 'hardening-6.10-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux: (29 commits) uapi: stddef.h: Provide UAPI macros for __counted_by_{le, be} stackleak: Use a copy of the ctl_table argument string: Add additional __realloc_size() annotations for "dup" helpers kunit/fortify: Fix replaced failure path to unbreak __alloc_size hardening: Enable KCFI and some other options lkdtm: Disable CFI checking for perms functions kunit/fortify: Add memcpy() tests kunit/fortify: Do not spam logs with fortify WARNs kunit/fortify: Rename tests to use recommended conventions init: replace deprecated strncpy with strscpy_pad kunit/fortify: Fix mismatched kvalloc()/vfree() usage scsi: qla2xxx: Avoid possible run-time warning with long model_num scsi: mpi3mr: Avoid possible run-time warning with long manufacturer strings scsi: mptfusion: Avoid possible run-time warning with long manufacturer strings fs: ecryptfs: replace deprecated strncpy with strscpy hfsplus: refactor copy_name to not use strncpy reiserfs: replace deprecated strncpy with scnprintf virt: acrn: replace deprecated strncpy with strscpy ubsan: Avoid i386 UBSAN handler crashes with Clang ubsan: Remove 1-element array usage in debug reporting ...
4 daysMerge tag 'execve-6.10-rc1' of ↵Linus Torvalds4-48/+72
git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux Pull execve updates from Kees Cook: - Provide knob to change (previously fixed) coredump NOTES size (Allen Pais) - Add sched_prepare_exec tracepoint (Marco Elver) - Make /proc/$pid/auxv work under binfmt_elf_fdpic (Max Filippov) - Convert ARCH_HAVE_EXTRA_ELF_NOTES to proper Kconfig (Vignesh Balasubramanian) - Leave a gap between .bss and brk * tag 'execve-6.10-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux: fs/coredump: Enable dynamic configuration of max file note size binfmt_elf_fdpic: fix /proc/<pid>/auxv binfmt_elf: Leave a gap between .bss and brk Replace macro "ARCH_HAVE_EXTRA_ELF_NOTES" with kconfig tracing: Add sched_prepare_exec tracepoint
4 daysMerge tag 'for-6.10/block-20240511' of git://git.kernel.dk/linuxLinus Torvalds1-2/+1
Pull block updates from Jens Axboe: - Add a partscan attribute in sysfs, fixing an issue with systemd relying on an internal interface that went away. - Attempt #2 at making long running discards interruptible. The previous attempt went into 6.9, but we ended up mostly reverting it as it had issues. - Remove old ida_simple API in bcache - Support for zoned write plugging, greatly improving the performance on zoned devices. - Remove the old throttle low interface, which has been experimental since 2017 and never made it beyond that and isn't being used. - Remove page->index debugging checks in brd, as it hasn't caught anything and prepares us for removing in struct page. - MD pull request from Song - Don't schedule block workers on isolated CPUs * tag 'for-6.10/block-20240511' of git://git.kernel.dk/linux: (84 commits) blk-throttle: delay initialization until configuration blk-throttle: remove CONFIG_BLK_DEV_THROTTLING_LOW block: fix that util can be greater than 100% block: support to account io_ticks precisely block: add plug while submitting IO bcache: fix variable length array abuse in btree_iter bcache: Remove usage of the deprecated ida_simple_xx() API md: Revert "md: Fix overflow in is_mddev_idle" blk-lib: check for kill signal in ioctl BLKDISCARD block: add a bio_await_chain helper block: add a blk_alloc_discard_bio helper block: add a bio_chain_and_submit helper block: move discard checks into the ioctl handler block: remove the discard_granularity check in __blkdev_issue_discard block/ioctl: prefer different overflow check null_blk: Fix the WARNING: modpost: missing MODULE_DESCRIPTION() block: fix and simplify blkdevparts= cmdline parsing block: refine the EOF check in blkdev_iomap_begin block: add a partscan sysfs attribute for disks block: add a disk_has_partscan helper ...
4 daysMerge tag 'vfs-6.10.rw' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfsLinus Torvalds3-42/+82
Pull vfs rw iterator updates from Christian Brauner: "The core fs signalfd, userfaultfd, and timerfd subsystems did still use f_op->read() instead of f_op->read_iter(). Convert them over since we should aim to get rid of f_op->read() at some point. Aside from that io_uring and others want to mark files as FMODE_NOWAIT so it can make use of per-IO nonblocking hints to enable more efficient IO. Converting those users to f_op->read_iter() allows them to be marked with FMODE_NOWAIT" * tag 'vfs-6.10.rw' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: signalfd: convert to ->read_iter() userfaultfd: convert to ->read_iter() timerfd: convert to ->read_iter() new helper: copy_to_iter_full()
4 daysMerge tag 'vfs-6.10.netfs' of ↵Linus Torvalds40-4467/+2835
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull netfs updates from Christian Brauner: "This reworks the netfslib writeback implementation so that pages read from the cache are written to the cache through ->writepages(), thereby allowing the fscache page flag to be retired. The reworking also: - builds on top of the new writeback_iter() infrastructure - makes it possible to use vectored write RPCs as discontiguous streams of pages can be accommodated - makes it easier to do simultaneous content crypto and stream division - provides support for retrying writes and re-dividing a stream - replaces the ->launder_folio() op, so that ->writepages() is used instead - uses mempools to allocate the netfs_io_request and netfs_io_subrequest structs to avoid allocation failure in the writeback path Some code that uses the fscache page flag is retained for compatibility purposes with nfs and ceph. The code is switched to using the synonymous private_2 label instead and marked with deprecation comments. The merge commit contains additional details on the new algorithm that I've left out of here as it would probably be excessively detailed. On top of the netfslib infrastructure this contains the work to convert cifs over to netfslib" * tag 'vfs-6.10.netfs' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (38 commits) cifs: Enable large folio support cifs: Remove some code that's no longer used, part 3 cifs: Remove some code that's no longer used, part 2 cifs: Remove some code that's no longer used, part 1 cifs: Cut over to using netfslib cifs: Implement netfslib hooks cifs: Make add_credits_and_wake_if() clear deducted credits cifs: Add mempools for cifs_io_request and cifs_io_subrequest structs cifs: Set zero_point in the copy_file_range() and remap_file_range() cifs: Move cifs_loose_read_iter() and cifs_file_write_iter() to file.c cifs: Replace the writedata replay bool with a netfs sreq flag cifs: Make wait_mtu_credits take size_t args cifs: Use more fields from netfs_io_subrequest cifs: Replace cifs_writedata with a wrapper around netfs_io_subrequest cifs: Replace cifs_readdata with a wrapper around netfs_io_subrequest cifs: Use alternative invalidation to using launder_folio netfs, afs: Use writeback retry to deal with alternate keys netfs: Miscellaneous tidy ups netfs: Remove the old writeback code netfs: Cut over to using new writeback code ...
4 daysMerge tag 'vfs-6.10.mount' of ↵Linus Torvalds6-312/+330
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull vfs mount API conversions from Christian Brauner: "This converts qnx6, minix, debugfs, tracefs, freevxfs, and openpromfs to the new mount api, further reducing the number of filesystems relying on the legacy mount api" * tag 'vfs-6.10.mount' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: minix: convert minix to use the new mount api vfs: Convert tracefs to use the new mount API vfs: Convert debugfs to use the new mount API openpromfs: finish conversion to the new mount API freevxfs: Convert freevxfs to the new mount API. qnx6: convert qnx6 to use the new mount api
4 daysMerge tag 'vfs-6.10.misc' of ↵Linus Torvalds30-174/+275
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull misc vfs updates from Christian Brauner: "This contains the usual miscellaneous features, cleanups, and fixes for vfs and individual fses. Features: - Free up FMODE_* bits. I've freed up bits 6, 7, 8, and 24. That means we now have six free FMODE_* bits in total (but bit #6 already got used for FMODE_WRITE_RESTRICTED) - Add FOP_HUGE_PAGES flag (follow-up to FMODE_* cleanup) - Add fd_raw cleanup class so we can make use of automatic cleanup provided by CLASS(fd_raw, f)(fd) for O_PATH fds as well - Optimize seq_puts() - Simplify __seq_puts() - Add new anon_inode_getfile_fmode() api to allow specifying f_mode instead of open-coding it in multiple places - Annotate struct file_handle with __counted_by() and use struct_size() - Warn in get_file() whether f_count resurrection from zero is attempted (epoll/drm discussion) - Folio-sophize aio - Export the subvolume id in statx() for both btrfs and bcachefs - Relax linkat(AT_EMPTY_PATH) requirements - Add F_DUPFD_QUERY fcntl() allowing to compare two file descriptors for dup*() equality replacing kcmp() Cleanups: - Compile out swapfile inode checks when swap isn't enabled - Use (1 << n) notation for FMODE_* bitshifts for clarity - Remove redundant variable assignment in fs/direct-io - Cleanup uses of strncpy in orangefs - Speed up and cleanup writeback - Move fsparam_string_empty() helper into header since it's currently open-coded in multiple places - Add kernel-doc comments to proc_create_net_data_write() - Don't needlessly read dentry->d_flags twice Fixes: - Fix out-of-range warning in nilfs2 - Fix ecryptfs overflow due to wrong encryption packet size calculation - Fix overly long line in xfs file_operations (follow-up to FMODE_* cleanup) - Don't raise FOP_BUFFER_{R,W}ASYNC for directories in xfs (follow-up to FMODE_* cleanup) - Don't call xfs_file_open from xfs_dir_open (follow-up to FMODE_* cleanup) - Fix stable offset api to prevent endless loops - Fix afs file server rotations - Prevent xattr node from overflowing the eraseblock in jffs2 - Move fdinfo PTRACE_MODE_READ procfs check into the .permission() operation instead of .open() operation since this caused userspace regressions" * tag 'vfs-6.10.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (39 commits) afs: Fix fileserver rotation getting stuck selftests: add F_DUPDFD_QUERY selftests fcntl: add F_DUPFD_QUERY fcntl() file: add fd_raw cleanup class fs: WARN when f_count resurrection is attempted seq_file: Simplify __seq_puts() seq_file: Optimize seq_puts() proc: Move fdinfo PTRACE_MODE_READ check into the inode .permission operation fs: Create anon_inode_getfile_fmode() xfs: don't call xfs_file_open from xfs_dir_open xfs: drop fop_flags for directories xfs: fix overly long line in the file_operations shmem: Fix shmem_rename2() libfs: Add simple_offset_rename() API libfs: Fix simple_offset_rename_exchange() jffs2: prevent xattr node from overflowing the eraseblock vfs, swap: compile out IS_SWAPFILE() on swapless configs vfs: relax linkat() AT_EMPTY_PATH - aka flink() - requirements fs/direct-io: remove redundant assignment to variable retval fs/dcache: Re-use value stored to dentry->d_flags instead of re-reading ...
4 daysMerge tag 'vfs-6.10.iomap' of ↵Linus Torvalds1-54/+65
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull vfs iomap updates from Christian Brauner: "This contains a few cleanups to the iomap code. Nothing particularly stands out" * tag 'vfs-6.10.iomap' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: iomap: do some small logical cleanup in buffered write iomap: make iomap_write_end() return a boolean iomap: use a new variable to handle the written bytes in iomap_write_iter() iomap: don't increase i_size if it's not a write operation iomap: drop the write failure handles when unsharing and zeroing iomap: convert iomap_writepages to writeack_iter
7 daysMerge tag 'mm-hotfixes-stable-2024-05-10-13-14' of ↵Linus Torvalds2-10/+18
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull MM fixes from Andrew Morton: "18 hotfixes, 7 of which are cc:stable. More fixups for this cycle's page_owner updates. And a few userfaultfd fixes. Otherwise, random singletons - see the individual changelogs for details" * tag 'mm-hotfixes-stable-2024-05-10-13-14' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: mailmap: add entry for Barry Song selftests/mm: fix powerpc ARCH check mailmap: add entry for John Garry XArray: set the marks correctly when splitting an entry selftests/vDSO: fix runtime errors on LoongArch selftests/vDSO: fix building errors on LoongArch mm,page_owner: don't remove __GFP_NOLOCKDEP in add_stack_record_to_list fs/proc/task_mmu: fix uffd-wp confusion in pagemap_scan_pmd_entry() fs/proc/task_mmu: fix loss of young/dirty bits during pagemap scan mm/vmalloc: fix return value of vb_alloc if size is 0 mm: use memalloc_nofs_save() in page_cache_ra_order() kmsan: compiler_types: declare __no_sanitize_or_inline lib/test_xarray.c: fix error assumptions on check_xa_multi_store_adv_add() tools: fix userspace compilation with new test_xarray changes MAINTAINERS: update URL's for KEYS/KEYRINGS_INTEGRITY and TPM DEVICE DRIVER mm: page_owner: fix wrong information in dump_page_owner maple_tree: fix mas_empty_area_rev() null pointer dereference mm/userfaultfd: reset ptes when close() for wr-protected ones
8 daysafs: Fix fileserver rotation getting stuckDavid Howells1-2/+6
Fix the fileserver rotation code in a couple of ways: (1) op->server_states is an array, not a pointer to a single record, so fix the places that access it to index it. (2) In the places that go through an address list to work out which one has the best priority, fix the loops to skip known failed addresses. Without this, the rotation algorithm may get stuck on addresses that are inaccessible or don't respond. This can be triggered manually by finding a server that advertises a non-routable address and giving it a higher priority, eg.: echo "add udp 192.168.0.0/16 3000" >/proc/fs/afs/addr_prefs if the server, say, includes the address 192.168.7.7 in its address list, and then attempting to access a volume on that server. Fixes: 495f2ae9e355 ("afs: Fix fileserver rotation") Signed-off-by: David Howells <dhowells@redhat.com> cc: Marc Dionne <marc.dionne@auristor.com> cc: linux-afs@lists.infradead.org Link: https://lore.kernel.org/r/4005300.1712309731@warthog.procyon.org.uk/ # v1 Link: https://lore.kernel.org/r/998836.1714746152@warthog.procyon.org.uk Signed-off-by: Christian Brauner <brauner@kernel.org>
8 daysfcntl: add F_DUPFD_QUERY fcntl()Linus Torvalds1-0/+20
Often userspace needs to know whether two file descriptors refer to the same struct file. For example, systemd uses this to filter out duplicate file descriptors in it's file descriptor store (cf. [1]) and vulkan uses it to compare dma-buf fds (cf. [2]). The only api we provided for this was kcmp() but that's not generally available or might be disallowed because it is way more powerful (allows ordering of file pointers, operates on non-current task) etc. So give userspace a simple way of comparing two file descriptors for sameness adding a new fcntl() F_DUDFD_QUERY. Link: https://github.com/systemd/systemd/blob/a4f0e0da3573a10bc5404142be8799418760b1d1/src/basic/fd-util.c#L517 [1] Link: https://gitlab.freedesktop.org/wlroots/wlroots/-/blob/master/render/vulkan/texture.c#L490 [2] Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> [brauner: commit message] Signed-off-by: Christian Brauner <brauner@kernel.org>
8 daysMerge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski40-219/+507
Cross-merge networking fixes after downstream PR. No conflicts. Adjacent changes: drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c 35d92abfbad8 ("net: hns3: fix kernel crash when devlink reload during initialization") 2a1a1a7b5fd7 ("net: hns3: add command queue trace for hns3") Signed-off-by: Jakub Kicinski <kuba@kernel.org>
9 dayserofs: Zstandard compression supportGao Xiang9-1/+333
Add Zstandard compression as the 4th supported algorithm since it becomes more popular now and some end users have asked this for quite a while [1][2]. Each EROFS physical cluster contains only one valid standard Zstandard frame as described in [3] so that decompression can be performed on a per-pcluster basis independently. Currently, it just leverages multi-call stream decompression APIs with internal sliding window buffers. One-shot or bufferless decompression could be implemented later for even better performance if needed. [1] https://github.com/erofs/erofs-utils/issues/6 [2] https://lore.kernel.org/r/Y08h+z6CZdnS1XBm@B-P7TQMD6M-0146.lan [3] https://www.rfc-editor.org/rfc/rfc8478.txt Acked-by: Chao Yu <chao@kernel.org> Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com> Link: https://lore.kernel.org/r/20240508234453.17896-1-xiang@kernel.org
9 daysMerge tag '6.9-rc7-ksmbd-fixes' of git://git.samba.org/ksmbdLinus Torvalds6-51/+60
Pull smb server fixes from Steve French: "Five ksmbd server fixes, all also for stable - Three fixes related to SMB3 leases (fixes two xfstests, and a locking issue) - Unitialized variable fix - Socket creation fix when bindv6only is set" * tag '6.9-rc7-ksmbd-fixes' of git://git.samba.org/ksmbd: ksmbd: do not grant v2 lease if parent lease key and epoch are not set ksmbd: use rwsem instead of rwlock for lease break ksmbd: avoid to send duplicate lease break notifications ksmbd: off ipv6only for both ipv4/ipv6 binding ksmbd: fix uninitialized symbol 'share' in smb2_tree_connect()
9 daysMerge tag 'fuse-fixes-6.9-final' of ↵Linus Torvalds2-2/+2
git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse Pull fuse fixes from Miklos Szeredi: "Two one-liner fixes for issues introduced in -rc1" * tag 'fuse-fixes-6.9-final' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse: virtiofs: include a newline in sysfs tag fuse: verify zero padding in fuse_backing_map
9 daysMerge tag 'exfat-for-6.9-rc8' of ↵Linus Torvalds2-4/+5
git://git.kernel.org/pub/scm/linux/kernel/git/linkinjeon/exfat Pull exfat fixes from Namjae Jeon: - Fix xfstests generic/013 test failure with dirsync mount option - Initialize the reserved fields of deleted file and stream extension dentries to zero * tag 'exfat-for-6.9-rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/linkinjeon/exfat: exfat: zero the reserved fields of file and stream extension dentries exfat: fix timing of synchronizing bitmap and inode
9 daysfscrypt: try to avoid refing parent dentry in fscrypt_file_openMateusz Guzik1-6/+26
Merely checking if the directory is encrypted happens for every open when using ext4, at the moment refing and unrefing the parent, costing 2 atomics and serializing opens of different files. The most common case of encryption not being used can be checked for with RCU instead. Sample result from open1_processes -t 20 ("Separate file open/close") from will-it-scale on Sapphire Rapids (ops/s): before: 12539898 after: 25575494 (+103%) v2: - add a comment justifying rcu usage, submitted by Eric Biggers - whack spurious IS_ENCRYPTED check from the refed case Signed-off-by: Mateusz Guzik <mjguzik@gmail.com> Link: https://lore.kernel.org/r/20240508081400.422212-1-mjguzik@gmail.com Signed-off-by: Eric Biggers <ebiggers@google.com>
9 daysMerge tag 'bcachefs-2024-05-07.2' of https://evilpiepirate.org/git/bcachefsLinus Torvalds20-71/+150
Pull bcachefs fixes from Kent Overstreet: - Various syzbot fixes; mainly small gaps in validation - Fix an integer overflow in fiemap() which was preventing filefrag from returning the full list of extents - Fix a refcounting bug on the device refcount, turned up by new assertions in the development branch - Fix a device removal/readd bug; write_super() was repeatedly dropping and retaking bch_dev->io_ref references * tag 'bcachefs-2024-05-07.2' of https://evilpiepirate.org/git/bcachefs: bcachefs: Add missing sched_annotate_sleep() in bch2_journal_flush_seq_async() bcachefs: Fix race in bch2_write_super() bcachefs: BCH_SB_LAYOUT_SIZE_BITS_MAX bcachefs: Add missing skcipher_request_set_callback() call bcachefs: Fix snapshot_t() usage in bch2_fs_quota_read_inode() bcachefs: Fix shift-by-64 in bformat_needs_redo() bcachefs: Guard against unknown k.k->type in __bkey_invalid() bcachefs: Add missing validation for superblock section clean bcachefs: Fix assert in bch2_alloc_v4_invalid() bcachefs: fix overflow in fiemap bcachefs: Add a better limit for maximum number of buckets bcachefs: Fix lifetime issue in device iterator helpers bcachefs: Fix bch2_dev_lookup() refcounting bcachefs: Initialize bch_write_op->failed in inline data path bcachefs: Fix refcount put in sb_field_resize error path bcachefs: Inodes need extra padding for varint_decode_fast() bcachefs: Fix early error path in bch2_fs_btree_key_cache_exit() bcachefs: bucket_pos_to_bp_noerror() bcachefs: don't free error pointers bcachefs: Fix a scheduler splat in __bch2_next_write_buffer_flush_journal_buf()
9 daysfs/coredump: Enable dynamic configuration of max file note sizeAllen Pais2-2/+22
Introduce the capability to dynamically configure the maximum file note size for ELF core dumps via sysctl. Why is this being done? We have observed that during a crash when there are more than 65k mmaps in memory, the existing fixed limit on the size of the ELF notes section becomes a bottleneck. The notes section quickly reaches its capacity, leading to incomplete memory segment information in the resulting coredump. This truncation compromises the utility of the coredumps, as crucial information about the memory state at the time of the crash might be omitted. This enhancement removes the previous static limit of 4MB, allowing system administrators to adjust the size based on system-specific requirements or constraints. Eg: $ sysctl -a | grep core_file_note_size_limit kernel.core_file_note_size_limit = 4194304 $ sysctl -n kernel.core_file_note_size_limit 4194304 $echo 519304 > /proc/sys/kernel/core_file_note_size_limit $sysctl -n kernel.core_file_note_size_limit 519304 Attempting to write beyond the ceiling value of 16MB $echo 17194304 > /proc/sys/kernel/core_file_note_size_limit bash: echo: write error: Invalid argument Signed-off-by: Vijay Nag <nagvijay@microsoft.com> Signed-off-by: Allen Pais <apais@linux.microsoft.com> Link: https://lore.kernel.org/r/20240506193700.7884-1-apais@linux.microsoft.com Signed-off-by: Kees Cook <keescook@chromium.org>
9 dayserofs: clean up z_erofs_load_full_lcluster()Gao Xiang2-20/+6
Only four lcluster types here, remove redundant code. No real logic changes. Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com> Link: https://lore.kernel.org/r/20240508123357.3266173-1-hsiangkao@linux.alibaba.com
10 dayserofs: derive fsid from on-disk UUID for .statfs() if possibleHongzhen Luo1-7/+5
Use the superblock's UUID to generate the fsid when it's non-null. Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com> Reviewed-by: Jingbo Xu <jefflexu@linux.alibaba.com> Signed-off-by: Hongzhen Luo <hongzhen@linux.alibaba.com> Link: https://lore.kernel.org/r/20240409113022.74720-1-hongzhen@linux.alibaba.com Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
10 dayserofs: add a reserved buffer pool for lz4 decompressionChunhai Guo3-17/+52
This adds a special global buffer pool (in the end) for reserved pages. Using a reserved pool for LZ4 decompression significantly reduces the time spent on extra temporary page allocation for the extreme cases in low memory scenarios. The table below shows the reduction in time spent on page allocation for LZ4 decompression when using a reserved pool. The results were obtained from multi-app launch benchmarks on ARM64 Android devices running the 5.15 kernel with an 8-core CPU and 8GB of memory. In the benchmark, we launched 16 frequently-used apps, and the camera app was the last one in each round. The data in the table is the average time of camera app for each round. After using the reserved pool, there was an average improvement of 150ms in the overall launch time of our camera app, which was obtained from the systrace log. +--------------+---------------+--------------+---------+ | | w/o page pool | w/ page pool | diff | +--------------+---------------+--------------+---------+ | Average (ms) | 3434 | 21 | -99.38% | +--------------+---------------+--------------+---------+ Based on the benchmark logs, 64 pages are sufficient for 95% of scenarios. This value can be adjusted with a module parameter `reserved_pages`. The default value is 0. This pool is currently only used for the LZ4 decompressor, but it can be applied to more decompressors if needed. Signed-off-by: Chunhai Guo <guochunhai@vivo.com> Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com> Link: https://lore.kernel.org/r/20240402131523.2703948-1-guochunhai@vivo.com Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
10 dayserofs: do not use pagepool in z_erofs_gbuf_growsize()Chunhai Guo1-36/+31
Let's use alloc_pages_bulk_array() for simplicity and get rid of unnecessary pagepool. Signed-off-by: Chunhai Guo <guochunhai@vivo.com> Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com> Link: https://lore.kernel.org/r/20240402092757.2635257-1-guochunhai@vivo.com Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
10 dayserofs: rename per-CPU buffers to global buffer pool and make it configurableChunhai Guo6-161/+166
It will cost more time if compressed buffers are allocated on demand for low-latency algorithms (like lz4) so EROFS uses per-CPU buffers to keep compressed data if in-place decompression is unfulfilled. While it is kind of wasteful of memory for a device with hundreds of CPUs, and only a small number of CPUs concurrently decompress most of the time. This patch renames it as 'global buffer pool' and makes it configurable. This allows two or more CPUs to share a common buffer to reduce memory occupation. Suggested-by: Gao Xiang <xiang@kernel.org> Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com> Signed-off-by: Chunhai Guo <guochunhai@vivo.com> Link: https://lore.kernel.org/r/20240402100036.2673604-1-guochunhai@vivo.com Signed-off-by: Sandeep Dhavale <dhavale@google.com> Link: https://lore.kernel.org/r/20240408215231.3376659-1-dhavale@google.com Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
10 dayserofs: rename utils.c to zutil.cChunhai Guo2-21/+13
Currently, utils.c is only useful if CONFIG_EROFS_FS_ZIP is on. So let's rename it to zutil.c as well as avoid its inclusion if CONFIG_EROFS_FS_ZIP is explicitly disabled. Signed-off-by: Chunhai Guo <guochunhai@vivo.com> Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com> Link: https://lore.kernel.org/r/20240401135550.2550043-1-guochunhai@vivo.com Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
10 daysvirtiofs: include a newline in sysfs tagBrian Foster1-1/+1
The internal tag string doesn't contain a newline. Append one when emitting the tag via sysfs. [Stefan] Orthogonal to the newline issue, sysfs_emit(buf, "%s", fs->tag) is needed to prevent format string injection. Signed-off-by: Brian Foster <bfoster@redhat.com> Fixes: a8f62f50b4e4 ("virtiofs: export filesystem tags through sysfs") Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
10 daysbtrfs: qgroup: fix initialization of auto inherit arrayDan Carpenter1-1/+1
The "i++" was accidentally left out so it just sets qgids[0] over and over. This can lead to unexpected problems, as the groups[1:] would be all 0, leading to later find_qgroup_rb() unable to find a qgroup and cause snapshot creation failure. Fixes: 5343cd9364ea ("btrfs: qgroup: simple quota auto hierarchy for nested subvolumes") CC: stable@vger.kernel.org # 6.7+ Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
10 daysbtrfs: count super block write errors in device instead of tracking folio ↵Matthew Wilcox (Oracle)3-28/+29
error state Currently the error status of super block write is tracked in page/folio status bit Error. For that we need to keep the reference for the whole duration of write and wait. Count the number of superblock writeback errors in the btrfs_device. That means we don't need the folio to stay around until it's waited for, and can avoid the extra call to folio_get/put. Also remove a mention of PageError in a comment as it's the last mention of the page Error state. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
10 daysbtrfs: use the folio iterator in btrfs_end_super_write()Matthew Wilcox (Oracle)1-13/+6
Iterate over folios instead of bvecs. Switch the order of unlock and put to be the usual order; we know this folio can't be put until it's been waited for, but that's fragile. Remove the calls to ClearPageUptodate / SetPageUptodate -- if PAGE_SIZE is larger than BTRFS_SUPER_INFO_SIZE, we'd be marking the entire folio uptodate without having actually initialised all the bytes in the page. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
10 daysbtrfs: convert super block writes to folio in write_dev_supers()Matthew Wilcox (Oracle)1-10/+13
This is a direct conversion from pages to folios, assuming single page folio. Also removes some calls to obsolete APIs and some hidden calls to compound_head(). Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
10 daysbtrfs: convert super block writes to folio in wait_dev_supers()Matthew Wilcox (Oracle)1-11/+12
This is a direct conversion from pages to folios, assuming single page folio. Also removes a few calls to compound_head() and calls to obsolete APIs. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
10 daysbtrfs: remove duplicate included header from fs.hThorsten Blum1-1/+0
Remove duplicate included header file linux/blkdev.h . Signed-off-by: Thorsten Blum <thorsten.blum@toblux.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
10 daysbtrfs: add a cached state to extent_clear_unlock_delallocJosef Bacik3-19/+28
Now that we have the lock_extent tightly coupled with extent_clear_unlock_delalloc we can add a cached state to extent_clear_unlock_delalloc and benefit from skipping the extra lookup when we're doing cow. Reviewed-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
10 daysbtrfs: push extent lock down in submit_one_async_extentJosef Bacik1-1/+2
We don't need to include the time we spend in the allocator under our extent lock protection, move it after the allocator and make sure we lock the extent in the error case to ensure we're not clearing these bits without the extent lock held. Reviewed-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
10 daysbtrfs: push lock_extent down in cow_file_range()Josef Bacik1-2/+14
Now that we've got the extent lock pushed into cow_file_range() we can push it further down into the allocation loop. This allows us to only hold the extent lock during the dropping of the extent map range and inserting the ordered extent. This makes the error case a little trickier as we'll now have to lock the range before clearing any of the other extent bits for the range, but this is the error path so is less performance critical. Reviewed-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
10 daysbtrfs: move can_cow_file_range_inline() outside of the extent lockJosef Bacik1-4/+8
These checks aren't reliant on the extent lock. Move this up into cow_file_range_inline(), and then update encoded writes to call this check before calling __cow_file_range_inline(). This will allow us to skip the extent lock if we're not able to inline the given extent. Reviewed-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
10 daysbtrfs: push lock_extent into cow_file_range_inlineJosef Bacik1-5/+8
Now that we've pushed the lock_extent() into cow_file_range() we can push the extent locking into cow_file_range_inline() and move the lock_extent in cow_file_range() to after we call cow_file_range_inline(). Reviewed-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
10 daysbtrfs: push extent lock into cow_file_rangeJosef Bacik1-8/+7
Now that cow_file_range is the only function that is called with the range locked, push this call into cow_file_range so we can further narrow the scope. Reviewed-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
10 daysbtrfs: push extent lock into run_delalloc_cowJosef Bacik1-8/+7
This is used by zoned but also as the fallback for uncompressed extents when we fail to compress the ranges. Push the extent lock into run_dealloc_cow(), and adjust the compression case to take the extent lock after calling run_delalloc_cow(). Reviewed-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
10 daysbtrfs: remove unlock_extent from run_delalloc_compressedJosef Bacik1-6/+5
Since we immediately unlock the extent range when we enter run_delalloc_compressed() simply move the lock_extent() down to cover cow_file_range() and then remove the unlock_extent() from run_delalloc_compressed. Reviewed-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
10 daysbtrfs: push extent lock down in run_delalloc_nocowJosef Bacik1-3/+18
run_delalloc_nocow is a little special because we use the file extents to see if we can nocow a range. We don't actually need the protection of the extent lock to look at the file extents at this point however. We are currently holding the page lock for this range, so we are protected from anybody who would simultaneously be modifying the file extent items for this range. * mmap() - we're holding the page lock. * buffered writes - we're holding the page lock. * direct writes - we're holding the page lock and direct IO has to flush page cache before it's able to continue. * fallocate() - all callers flush the range and wait on ordered extents while holding the inode lock and the mmap lock, so we are again saved by the page lock. We want to use the extent lock to protect 1) The mapping tree for the given range. 2) The ordered extents for the given range. 3) The io_tree for the given range. Push the extent lock down to cover these operations. In the fallback_to_cow() case we simply lock before doing anything and rely on the cow_file_range() helper to handle it's range properly. Reviewed-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
10 daysbtrfs: adjust while loop condition in run_delalloc_nocowJosef Bacik1-3/+1
We have the following pattern while (1) { if (cur_offset > end) break; } Which is just while (cur_offset <= end) { ... } so adjust the code to be more clear. Reviewed-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
10 daysbtrfs: push extent lock into run_delalloc_nocowJosef Bacik1-5/+7
run_delalloc_nocow is a bit special as it walks through the file extents for the inode and determines what it can nocow and what it can't. This is the more complicated area for extent locking, so start with this function. Reviewed-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
10 daysbtrfs: push the extent lock into btrfs_run_delalloc_rangeJosef Bacik2-3/+7
We want to limit the scope of the extent lock to be around operations that can change in flight. Currently we hold the extent lock through the entire writepage operation, which isn't really necessary. We want to protect to make sure nobody has updated DELALLOC. In find_lock_delalloc_range we must lock the range in order to validate the contents of our io_tree. However once we've done that we're safe to unlock the range and continue, as we have the page lock already held for the range. We are protected from all operations at this point. * mmap() - we're holding the page lock, thus are protected. * buffered writes - again, we're protected because we take the page lock for the first and last page in our range for buffered writes so we won't create new delalloc ranges in this area. * direct IO - we invalidate pagecache before attempting to write a new area, which requires the page lock, so again are protected once we're holding the page lock on this range. Additionally this behavior actually already exists for compressed, we unlock the range as soon as we start to process the async extents, and re-lock it during compression. So this is completely safe, and makes the locking more consistent. Make this simple by just pushing the extent lock into btrfs_run_delalloc_range. From there followup patches will push the lock further down into its users. Reviewed-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
10 daysbtrfs: lock extent when doing inline extent in compressionJosef Bacik1-10/+7
We currently don't lock the extent when we're doing a cow_file_range_inline() for a compressed extent. This isn't a problem necessarily, but it's inconsistent with the rest of our usage of cow_file_range_inline(). This also leads to some extra weird logic around whether the extent is locked or not. Fix this to lock the extent before calling cow_file_range_inline() in compression to make it consistent with the rest of the inline users. In future patches this will be pushed down into the cow_file_range_inline() helper, so we're fine with the quick and dirty locking here. This patch exists to make the behavior change obvious. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
10 daysbtrfs: move extent bit and page cleanup into cow_file_range_inlineJosef Bacik1-53/+51
We duplicate the extent cleanup for cow_file_range_inline() in the cow and compressed case. The encoded case doesn't need to do cleanup the same way, so rename cow_file_range_inline to __cow_file_range_inline and then make cow_file_range_inline handle the extent cleanup appropriately, and update the callers. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
10 daysbtrfs: unlock all the pages with successful inline extent creationJosef Bacik1-14/+1
Since 4750af3bbe5d ("btrfs: prevent extent_clear_unlock_delalloc() to unlock page not locked by __process_pages_contig()") we have been unlocking the locked page manually instead of via extent_clear_unlock_delalloc() because of subpage blocksize support. However we actually disable inline extent creation for subpage blocksize support, so this behavior isn't necessary. Remove this code and comment, if at some point the subpage blocksize code grows support for inline extents this can be re-evaluated. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
10 daysbtrfs: push all inline logic into cow_file_rangeJosef Bacik1-62/+81
Currently we have a lot of duplicated checks of if (start == 0 && fs_info->sectorsize == PAGE_SIZE) cow_file_range_inline(); Instead of duplicating this check everywhere, consolidate all of the inline extent logic into a helper which documents all of the checks and then use that helper inside of cow_file_range_inline(). With this we can clean up all of the calls to either unconditionally call cow_file_range_inline(), or at least reduce the checks we're doing before we call cow_file_range_inline(); Reviewed-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
10 daysbtrfs: handle errors in btrfs_reloc_clone_csums properlyJosef Bacik4-4/+12
In the cow path we will clone the reloc csums for relocated data extents, and if there's an error we already have an ordered extent and rely on the ordered extent finishing to clean everything up. There's a problem however, we don't mark the ordered extent with an error, we pretend like everything was just fine. If we were at the end of our range we won't actually bubble up this error anywhere, and we could end up inserting an extent that doesn't have csums where it should have them. Fix this by adding a helper to mark the ordered extent with an error, and then use this when we fail to lookup the csums in btrfs_reloc_clone_csums. Use this helper in the other place where we use the same pattern while we're here. This will prevent us from erroneously inserting the extent that doesn't have the required checksums. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
10 daysbtrfs: add extra sanity checks for create_io_em()Qu Wenruo1-1/+39
The function create_io_em() is called before we submit an IO, to update the in-memory extent map for the involved range. This patch changes the following aspects: - Does not allow BTRFS_ORDERED_NOCOW type For real NOCOW (excluding NOCOW writes into preallocated ranges) writes, we never call create_io_em(), as we does not need to update the extent map at all. So remove the sanity check allowing BTRFS_ORDERED_NOCOW type. - Add extra sanity checks * PREALLOC - @block_len == len For uncompressed writes. * REGULAR - @block_len == @orig_block_len == @ram_bytes == @len We're creating a new uncompressed extent, and referring all of it. - @orig_start == @start We haven no offset inside the extent. * COMPRESSED - valid @compress_type - @len <= @ram_bytes This is to co-operate with encoded writes, which can cause a new file extent referring only part of a uncompressed extent. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
10 daysbtrfs: simplify the inline extent map creationQu Wenruo1-10/+10
With the tree-checker ensuring all inline file extents starts at file offset 0 and has a length no larger than sectorsize, we can simplify the calculation to assigned those fixes values directly. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
10 daysbtrfs: add extra comments on extent_map membersQu Wenruo1-1/+54
The extent_map structure is very critical to btrfs, as it is involved for both read and write paths. Unfortunately the structure is not properly explained, making it pretty hard to understand nor to do further improvement. This patch adds extra comments explaining the major members based on my code reading. Hopefully we can find more members to cleanup in the future. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
10 daysbtrfs: drop unused argument of calcu_metadata_size()Naohiro Aota1-6/+5
calcu_metadata_size() has a "reserve" argument, but the only caller always set it to "1". The other usage (reserve = 0) is dropped by a commit 0647bf564f1e ("Btrfs: improve forever loop when doing balance relocation"), which is more than 10 years ago. Drop the argument and simplify the code. Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
10 daysbtrfs: simplify return variables in btrfs_drop_subtree()Anand Jain1-9/+7
There's another return variable wret that is only passed to ret on error, we can simply use ret. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
10 daysbtrfs: simplify return variables in lookup_extent_data_ref()Anand Jain1-15/+14
First, drop err instead reuse ret, choose to return the error instead of goto fail and then return the same error. Do not initialize the ret until where it has to be initialized. Slight logic change in handling the btrfs_search_slot() and btrfs_next_leaf() return value. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
10 daysbtrfs: rename return variables in btrfs_qgroup_rescan_worker()Anand Jain1-19/+19
Rename ret to ret2 compile and then err to ret. Also, new ret2 is found to be localized within the 'if (trans)' statement, so move its declaration there. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
10 daysbtrfs: drop variable err in quick_update_accounting()Anand Jain1-6/+3
In quick_update_accounting() err is used as 2nd return value, which could be achieved just with ret. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
10 daysbtrfs: reuse ret instead of err in relocate_tree_blocks()Anand Jain1-11/+8
Coding style fixes the function relocate_tree_blocks(). After the fix, ret is the return value variable. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
10 daysbtrfs: rename err and ret to ret in build_backref_tree()Anand Jain1-11/+7
Code style fix in the function build_backref_tree(). Drop the ret initialization 0, as we don't need it. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
10 daysbtrfs: rename werr and err to ret in __btrfs_wait_marked_extents()Anand Jain1-13/+8
Rename the function's local return variables err and werr to ret. Also, align the variable declarations with the other declarations in the function for better function space alignment. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
10 daysbtrfs: rename werr and err to ret in btrfs_write_marked_extents()Anand Jain1-13/+10
Rename the function's local variable werr and err to ret. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
10 daysbtrfs: report filemap_fdata<write|wait>_range() errorAnand Jain1-0/+4
In the function btrfs_write_marked_extents() and in __btrfs_wait_marked_extents() return the actual error if when filemap_fdata<write|wait>_range() fails. Suggested-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
10 daysbtrfs: use btrfs_is_testing() everywhereDavid Sterba4-9/+8
There are open coded tests of BTRFS_FS_STATE_DUMMY_FS_INFO and we have a wrapper for that that's a compile-time constant when self-tests are not built in. As this is only for development we can save some bytes and conditions on release configs by using the helper in the remaining cases. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
10 daysbtrfs: initialize delayed inodes xarray without GFP_ATOMICFilipe Manana1-2/+1
There's no need to initialize the delayed inodes xarray with a GFP_ATOMIC flag because that actually does nothing on the xarray operations. That was needed for radix trees, but for xarrays the allocation flags are passed as the last argument to xa_store() (which we are using correctly). So initialize the delayed inodes xarray with a simple xa_init(). Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
10 daysbtrfs: make try_release_extent_mapping() return a boolFilipe Manana3-13/+13
Currently try_release_extent_mapping() as an int return type, but we use it as a boolean. Its only caller, the release folio callback, also returns a boolean which corresponds to try_release_extent_mapping()'s return value. So change its return value type to bool as well as its helper try_release_extent_state(). Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
10 daysbtrfs: be better releasing extent maps at try_release_extent_mapping()Filipe Manana1-59/+61
At try_release_extent_mapping(), called during the release folio callback (btrfs_release_folio() callchain), we don't release any extent maps in the range if the GFP flags don't allow blocking. This behaviour is exaggerated because: 1) Both searching for extent maps and removing them are not blocking operations. The only thing that it is the cond_resched() call at the end of the loop that searches for and removes extent maps; 2) We currently only operate on a single page, so for the case where block size matches the page size, we can only have one extent map, and for the case where the block size is smaller than the page size, we can have at most 16 extent maps. So it's very unlikely the cond_resched() call will ever block even in the block size smaller than page size scenario. So instead of not removing any extent maps at all in case the GFP glags don't allow blocking, keep removing extent maps while we don't need to reschedule. This makes it safe for the subpage case and for a future where we can process folios with a size larger than a page. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
10 daysbtrfs: remove i_size restriction at try_release_extent_mapping()Filipe Manana1-2/+1
Currently we don't attempt to release extent maps if the inode has an i_size that is not greater than 16M. This condition was added way back in 2008 by commit 70dec8079d78 ("Btrfs: extent_io and extent_state optimizations"), without any explanation about it. A quick chat with Chris on slack revealed that the goal was probably to release the extent maps for small files only when closing the inode. This however can be harmful in case we have tons of such files being kept open for very long periods of time, since we will consume more and more pages for extent maps. So remove the condition. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
10 daysbtrfs: use btrfs_get_fs_generation() at try_release_extent_mapping()Filipe Manana1-6/+1
Nowadays we have the btrfs_get_fs_generation() to get the current generation of the filesystem, so there's no need anymore to lock the transaction spinlock to read it. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
10 daysbtrfs: rename some variables at try_release_extent_mapping()Filipe Manana1-12/+12
Rename the following variables: 1) "btrfs_inode" to "inode", because it's shorter to type and clear, and we don't have a VFS inode here as well, so there's no confusion; 2) "tree" to "io_tree", to be clear which tree we are dealing with, since we use 2 different trees in the function; 3) "map" to "extent_tree" since "map" gives the idea we are dealing with an extent map for example, but we are dealing with the inode's extent tree (the tree which stores extent maps). These also make the next patches simpler. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
10 daysbtrfs: add tracepoints for extent map shrinker eventsFilipe Manana2-1/+17
Add some tracepoints for the extent map shrinker to help debug and analyse main events. These have proved useful during development of the shrinker. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
10 daysbtrfs: update comment for btrfs_set_inode_full_sync() about lockingFilipe Manana1-3/+5
Nowadays we have a lock used to synchronize mmap writes with reflink and fsync operations (struct btrfs_inode::i_mmap_lock), so update the comment for btrfs_set_inode_full_sync() to mention that it can also be called while holding that mmap lock. Besides being a valid alternative to the inode's VFS lock, we already have the extent map shrinker using that mmap lock instead. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
10 daysbtrfs: add a shrinker for extent mapsFilipe Manana4-0/+180
Extent maps are used either to represent existing file extent items, or to represent new extents that are going to be written and the respective file extent items are created when the ordered extent completes. We currently don't have any limit for how many extent maps we can have, neither per inode nor globally. Most of the time this not too noticeable because extent maps are removed in the following situations: 1) When evicting an inode; 2) When releasing folios (pages) through the btrfs_release_folio() address space operation callback. However we won't release extent maps in the folio range if the folio is either dirty or under writeback or if the inode's i_size is less than or equals to 16M (see try_release_extent_mapping(). This 16M i_size constraint was added back in 2008 with commit 70dec8079d78 ("Btrfs: extent_io and extent_state optimizations"), but there's no explanation about why we have it or why the 16M value. This means that for buffered IO we can reach an OOM situation due to too many extent maps if either of the following happens: 1) There's a set of tasks constantly doing IO on many files with a size not larger than 16M, specially if they keep the files open for very long periods, therefore preventing inode eviction. This requires a really high number of such files, and having many non mergeable extent maps (due to random 4K writes for example) and a machine with very little memory; 2) There's a set tasks constantly doing random write IO (therefore creating many non mergeable extent maps) on files and keeping them open for long periods of time, so inode eviction doesn't happen and there's always a lot of dirty pages or pages under writeback, preventing btrfs_release_folio() from releasing the respective extent maps. This second case was actually reported in the thread pointed by the Link tag below, and it requires a very large file under heavy IO and a machine with very little amount of RAM, which is probably hard to happen in practice in a real world use case. However when using direct IO this is not so hard to happen, because the page cache is not used, and therefore btrfs_release_folio() is never called. Which means extent maps are dropped only when evicting the inode, and that means that if we have tasks that keep a file descriptor open and keep doing IO on a very large file (or files), we can exhaust memory due to an unbounded amount of extent maps. This is especially easy to happen if we have a huge file with millions of small extents and their extent maps are not mergeable (non contiguous offsets and disk locations). This was reported in that thread with the following fio test: $ cat test.sh #!/bin/bash DEV=/dev/sdj MNT=/mnt/sdj MOUNT_OPTIONS="-o ssd" MKFS_OPTIONS="" cat <<EOF > /tmp/fio-job.ini [global] name=fio-rand-write filename=$MNT/fio-rand-write rw=randwrite bs=4K direct=1 numjobs=16 fallocate=none time_based runtime=90000 [file1] size=300G ioengine=libaio iodepth=16 EOF umount $MNT &> /dev/null mkfs.btrfs -f $MKFS_OPTIONS $DEV mount $MOUNT_OPTIONS $DEV $MNT fio /tmp/fio-job.ini umount $MNT Monitoring the btrfs_extent_map slab while running the test with: $ watch -d -n 1 'cat /sys/kernel/slab/btrfs_extent_map/objects \ /sys/kernel/slab/btrfs_extent_map/total_objects' Shows the number of active and total extent maps skyrocketing to tens of millions, and on systems with a short amount of memory it's easy and quick to get into an OOM situation, as reported in that thread. So to avoid this issue add a shrinker that will remove extents maps, as long as they are not pinned, and takes proper care with any concurrent fsync to avoid missing extents (setting the full sync flag while in the middle of a fast fsync). This shrinker is triggered through the callbacks nr_cached_objects and free_cached_objects of struct super_operations. The shrinker will iterate over all roots and over all inodes of each root, and keeps track of the last scanned root and inode, so that the next time it runs, it starts from that root and from the next inode. This is similar to what xfs does for its inode reclaim (implements those callbacks, and cycles through inodes by starting from where it ended last time). Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
10 daysbtrfs: add a global per cpu counter to track number of used extent mapsFilipe Manana3-0/+28
Add a per cpu counter that tracks the total number of extent maps that are in extent trees of inodes that belong to fs trees. This is going to be used in an upcoming change that adds a shrinker for extent maps. Only extent maps for fs trees are considered, because for special trees such as the data relocation tree we don't want to evict their extent maps which are critical for the relocation to work, and since those are limited, it's not a concern to have them in memory during the relocation of a block group. Another case are extent maps for free space cache inodes, which must always remain in memory, but those are limited (there's only one per free space cache inode, which means one per block group). Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
10 daysbtrfs: pass the extent map tree's inode to try_merge_map()Filipe Manana1-7/+6
Extent maps are always associated to an inode's extent map tree, so there's no need to pass the extent map tree explicitly to try_merge_map(). In order to facilitate an upcoming change that adds a shrinker for extent maps, change try_merge_map() to receive the inode instead of its extent map tree. Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
10 daysbtrfs: pass the extent map tree's inode to setup_extent_mapping()Filipe Manana1-5/+5
Extent maps are always associated to an inode's extent map tree, so there's no need to pass the extent map tree explicitly to setup_extent_mapping(). In order to facilitate an upcoming change that adds a shrinker for extent maps, change setup_extent_mapping() to receive the inode instead of its extent map tree. Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
10 daysbtrfs: pass the extent map tree's inode to replace_extent_mapping()Filipe Manana1-5/+6
Extent maps are always associated to an inode's extent map tree, so there's no need to pass the extent map tree explicitly to replace_extent_mapping(). In order to facilitate an upcoming change that adds a shrinker for extent maps, change replace_extent_mapping() to receive the inode instead of its extent map tree. Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
10 daysbtrfs: pass the extent map tree's inode to remove_extent_mapping()Filipe Manana4-20/+25
Extent maps are always associated to an inode's extent map tree, so there's no need to pass the extent map tree explicitly to remove_extent_mapping(). In order to facilitate an upcoming change that adds a shrinker for extent maps, change remove_extent_mapping() to receive the inode instead of its extent map tree. Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
10 daysbtrfs: pass the extent map tree's inode to clear_em_logging()Filipe Manana3-4/+6
Extent maps are always associated to an inode's extent map tree, so there's no need to pass the extent map tree explicitly to clear_em_logging(). In order to facilitate an upcoming change that adds a shrinker for extent maps, change clear_em_logging() to receive the inode instead of its extent map tree. Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
10 daysbtrfs: pass the extent map tree's inode to add_extent_mapping()Filipe Manana1-18/+16
Extent maps are always added to an inode's extent map tree, so there's no need to pass the extent map tree explicitly to add_extent_mapping(). In order to facilitate an upcoming change that adds a shrinker for extent maps, change add_extent_mapping() to receive the inode instead of its extent map tree. Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
10 daysbtrfs: change root->root_key.objectid to btrfs_root_id()Josef Bacik26-241/+218
A comment from Filipe on one of my previous cleanups brought my attention to a new helper we have for getting the root id of a root, which makes it easier to read in the code. The changes where made with the following Coccinelle semantic patch: // <smpl> @@ expression E,E1; @@ ( E->root_key.objectid = E1 | - E->root_key.objectid + btrfs_root_id(E) ) // </smpl> Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> [ minor style fixups ] Signed-off-by: David Sterba <dsterba@suse.com>
10 daysbtrfs: set start on clone before calling copy_extent_buffer_fullJosef Bacik1-2/+8
Our subpage testing started hanging on generic/560 and I bisected it down to 1cab1375ba6d ("btrfs: reuse cloned extent buffer during fiemap to avoid re-allocations"). This is subtle because we use eb->start to figure out where in the folio we're copying to when we're subpage, as our ->start may refer to an area inside of the folio. For example, assume a 16K page size machine with a 4K node size, and assume that we already have a cloned extent buffer when we cloned the previous search. copy_extent_buffer_full() will do the following when copying the extent buffer path->nodes[0] (src) into cloned (dest): src->start = 8k; // this is the new leaf we're cloning cloned->start = 4k; // this is left over from the previous clone src_addr = folio_address(src->folios[0]); dest_addr = folio_address(dest->folios[0]); memcpy(dest_addr + get_eb_offset_in_folio(dst, 0), src_addr + get_eb_offset_in_folio(src, 0), src->len); Now get_eb_offset_in_folio() is where the problems occur, because for sub-pagesize blocksize we can have multiple eb's per folio, the code for this is as follows size_t get_eb_offset_in_folio(eb, offset) { return (eb->start + offset & (folio_size(eb->folio[0]) - 1)); } So in the above example we are copying into offset 4K inside the folio. However once we update cloned->start to 8K to match the src the math for get_eb_offset_in_folio() changes, and any subsequent reads (i.e. btrfs_item_key_to_cpu()) will start reading from the offset 8K instead of 4K where we copied to, giving us garbage. Fix this by setting start before we co copy_extent_buffer_full() to make sure that we're copying into the same offset inside of the folio that we will read from later. All other sites of copy_extent_buffer_full() are correct because we either set ->start beforehand or we simply don't change it in the case of the tree-log usage. With this fix we now pass generic/560 on our subpage tests. Fixes: 1cab1375ba6d ("btrfs: reuse cloned extent buffer during fiemap to avoid re-allocations") Reviewed-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
10 daysbtrfs: replace btrfs_delayed_*_ref with btrfs_*_refJosef Bacik2-38/+27
Now that these two structs are the same, move the btrfs_data_ref and btrfs_tree_ref up and use these in the btrfs_delayed_ref_node. Then remove the btrfs_delayed_*_ref structs. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
10 daysbtrfs: remove the btrfs_delayed_ref_node container helpersJosef Bacik1-27/+0
Now that we don't use these helpers anywhere, remove them. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
10 daysbtrfs: stop referencing btrfs_delayed_tree_ref directlyJosef Bacik2-15/+16
We only ever need to use this to get the level of the tree block ref, so use the btrfs_delayed_ref_owner() helper, which returns the level for the given reference. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
10 daysbtrfs: stop referencing btrfs_delayed_data_ref directlyJosef Bacik2-14/+13
Now that most of our elements are inside of btrfs_delayed_ref_node directly and we have helpers for the delayed_data_ref bits, go ahead and remove all direct usage of btrfs_delayed_data_ref and use the helpers where needed. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
10 daysbtrfs: make the insert backref helpers take a btrfs_delayed_ref_nodeJosef Bacik1-25/+21
We don't need to pass in all the elements for the backrefs as function arguments, simply pass through the btrfs_delayed_ref_node and then extract the values we need from that. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
10 daysbtrfs: drop unnecessary arguments from __btrfs_free_extentJosef Bacik1-15/+8
We have all the information we need in our btrfs_delayed_ref_node, which we already pass into __btrfs_free_extent. Drop the extra arguments and just extract the values from btrfs_delayed_ref_node. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
10 daysbtrfs: make __btrfs_inc_extent_ref take a btrfs_delayed_ref_nodeJosef Bacik2-32/+25
We're just extracting the values from btrfs_delayed_ref_node and passing them through, simply pass the btrfs_delayed_ref_node into __btrfs_inc_extent_ref and shrink the function arguments. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
10 daysbtrfs: rename btrfs_data_ref->ino to ->objectidJosef Bacik3-4/+4
This is how we refer to it in the rest of the extent reference related code, make it consistent. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
10 daysbtrfs: move ->parent and ->ref_root into btrfs_delayed_ref_nodeJosef Bacik4-77/+48
These two members are shared by both the tree refs and data refs, so move them into btrfs_delayed_ref_node proper. This allows us to greatly simplify the comparison code, as the shared refs always only sort on parent, and the non shared refs always sort first on ref_root, and then only data refs sort on their specific fields. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
10 daysbtrfs: rename ->len to ->num_bytes in btrfs_refJosef Bacik8-27/+27
We consistently use ->num_bytes everywhere through the delayed ref code, except in btrfs_ref. Rename btrfs_ref to match all the other code. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
10 daysbtrfs: unify the btrfs_add_delayed_*_ref helpers into one helperJosef Bacik1-77/+25
Now that these helpers are identical, create a helper function that handles everything properly and strip the individual helpers down to use just the common helper. This cleans up a significant amount of duplicated code. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
10 daysbtrfs: simplify delayed ref tracepointsJosef Bacik2-14/+4
Now that all of the delayed ref information is in the delayed ref node, drastically simplify the delayed ref tracepoints by simply passing in the btrfs_delayed_ref_node and populating the tracepoints with the values from the structure itself. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
10 daysbtrfs: move ref specific initialization into init_delayed_ref_commonJosef Bacik1-14/+11
Now that the btrfs_delayed_ref_node contains a union of the data and metadata specific information we can move the initialization into init_delayed_ref_common and just use the btrfs_ref to initialize the correct fields of the reference. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
10 daysbtrfs: initialize btrfs_delayed_ref_head with btrfs_refJosef Bacik1-28/+25
We are calling init_delayed_ref_head with all of the elements from btrfs_ref, clean this up to simply pass in the btrfs_ref and initialize the btrfs_delayed_ref_head with the values from the btrfs_ref directly. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
10 daysbtrfs: pass btrfs_ref to init_delayed_ref_commonJosef Bacik2-21/+27
We're extracting all of these values from the btrfs_ref we passed in already, just pass the btrfs_ref through to init_delayed_ref_common and get the values directly from the struct. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
10 daysbtrfs: move ref_root into btrfs_refJosef Bacik8-88/+73
We have this in both btrfs_tree_ref and btrfs_data_ref, which is just wasting space and making the code more complicated. Move this into btrfs_ref proper and update all the call sites to do the assignment in btrfs_ref. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
10 daysbtrfs: do not use a function to initialize btrfs_refJosef Bacik7-96/+131
btrfs_ref currently has ->owning_root, and ->ref_root is shared between the tree ref and data ref, so in order to move that into btrfs_ref proper I would need to add another root parameter to the initialization function. This function has too many arguments, and adding another root will make it easy to make mistakes about which root goes where. Drop the generic ref init function and statically initialize the btrfs_ref in every usage. This makes the code easier to read because we can see what elements we're assigning, and will make the upcoming change moving the ref_root into the btrfs_ref more clear and less error prone than adding a new element to the initialization function. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
10 daysbtrfs: embed data_ref and tree_ref in btrfs_delayed_ref_nodeJosef Bacik2-55/+40
We have been embedding btrfs_delayed_ref_node in the btrfs_delayed_data_ref and btrfs_delayed_tree_ref, and then we have two sets of cachep's and a variety of handling that is awkward because of this separation. Instead union these two members inside of btrfs_delayed_ref_node and make that the first class object. This allows us to go down to one cachep for our delayed ref nodes instead of two. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
10 daysbtrfs: add a helper to get the delayed ref node from the data/tree refJosef Bacik2-9/+31
We have several different ways we refer to references throughout the code and it's not consistent and there's a bit of duplication. In order to clean this up I want to have one structure we use to define reference information, and one structure we use for the delayed reference information. Start this process by adding a helper to get from the btrfs_delayed_data_ref/btrfs_delayed_tree_ref to the btrfs_delayed_ref_node so that it'll make moving these structures around simpler. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
10 daysbtrfs: use btrfs_find_first_inode() at btrfs_prune_dentries()Filipe Manana1-52/+14
Currently btrfs_prune_dentries() has open code to find the first inode in a root with a minimum inode number. Remove that code and make it use the helper btrfs_find_first_inode() for that task. Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
10 daysbtrfs: export find_next_inode() as btrfs_find_first_inode()Filipe Manana3-80/+85
Export the relocation private helper find_next_inode() to inode.c, as this same logic is also used at btrfs_prune_dentries() and will be used by an upcoming change that adds an extent map shrinker. The next patch will change btrfs_prune_dentries() to use this helper. Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
10 daysbtrfs: simplify add_extent_mapping() by removing pointless labelFilipe Manana1-4/+4
The add_extent_mapping() function is short and trivial, there's no need to have a label for a quick exit in case of an error, even because there's no error handling needed, we just need to return the error. So remove that label and return directly. Also while at it remove the redundant initialization of 'ret', as that may help avoid some warnings with clang tools such as the one reported/fixed by commit 966de47ff0c9 ("btrfs: remove redundant initialization of variables in log_new_ancestors"). Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
10 daysbtrfs: tests: error out on unexpected extent map reference countFilipe Manana1-8/+35
In the extent map self tests, when freeing all extent maps from a test extent map tree we are not expecting to find any extent map with a reference count different from 1 (the tree reference). If we find any, we just log a message but we don't fail the test, which makes it very easy to miss any bug/regression - no one reads the test messages unless a test fails. So change the behaviour to make a test fail if we find an extent map in the tree with a reference count different from 1. Make the failure happen only after removing all extent maps, so that we don't leak memory. Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
10 daysbtrfs: pass an inode to btrfs_add_extent_mapping()Filipe Manana4-98/+95
Instead of passing fs_info and extent map tree arguments to btrfs_add_extent_mapping(), we can pass an inode instead, as extent maps are always inserted in the extent map tree of an inode, and the fs_info can be extracted from the inode (inode->root->fs_info). The only exception is in the self tests where we allocate an extent map tree and then use it to insert/update/remove extent maps. However the tests can be changed to use a test inode and then use the inode's extent map tree. So change btrfs_add_extent_mapping() to have an inode as an argument instead of a fs_info and an extent map tree. This reduces the number of parameters and will also be needed for an upcoming change. Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
10 daysbtrfs: open code csum_exist_in_range()Filipe Manana1-12/+7
The csum_exist_in_range() function is now too trivial and is only used in one place, so open code it in its single caller. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
10 daysbtrfs: make NOCOW checks for existence of checksums in a range more efficientFilipe Manana4-27/+27
Before deciding if we can do a NOCOW write into a range, one of the things we have to do is check if there are checksum items for that range. We do that through the btrfs_lookup_csums_list() function, which searches for checksums and adds them to a list supplied by the caller. But all we need is to check if there is any checksum, we don't need to look for all of them and collect them into a list, which requires more search time in the checksums tree, allocating memory for checksums items to add to the list, copy checksums from a leaf into those list items, then free that memory, etc. This is all unnecessary overhead, wasting mostly CPU time, and perhaps some occasional IO if we need to read from disk any extent buffers. So change btrfs_lookup_csums_list() to allow to return immediately in case it finds any checksum, without the need to add it to a list and read it from a leaf. This is accomplished by allowing a NULL list parameter and making the function return 1 if it found any checksum, 0 if it didn't found any, and a negative value in case of an error. The following test with fio was used to measure performance: $ cat test.sh #!/bin/bash DEV=/dev/nullb0 MNT=/mnt/nullb0 cat <<EOF > /tmp/fio-job.ini [global] name=fio-rand-write filename=$MNT/fio-rand-write rw=randwrite bssplit=4k/20:8k/20:16k/20:32k/20:64k/20 direct=1 numjobs=16 fallocate=posix time_based runtime=300 [file1] size=8G ioengine=io_uring iodepth=16 EOF umount $MNT &> /dev/null mkfs.btrfs -f $DEV mount -o ssd $DEV $MNT fio /tmp/fio-job.ini umount $MNT The test was run on a release kernel (Debian's default kernel config). The results before this patch: WRITE: bw=139MiB/s (146MB/s), 8204KiB/s-9504KiB/s (8401kB/s-9732kB/s), io=17.0GiB (18.3GB), run=125317-125344msec The results after this patch: WRITE: bw=153MiB/s (160MB/s), 9241KiB/s-10.0MiB/s (9463kB/s-10.5MB/s), io=17.0GiB (18.3GB), run=114054-114071msec Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
10 daysbtrfs: simplify error path for btrfs_lookup_csums_list()Filipe Manana1-8/+9
In the error path we have this while loop that keeps iterating over the csums of the list and then delete them from the list and free them, testing for an error (ret < 0) and list emptyness as the conditions of the while loop. Simplify this by using list_for_each_entry_safe() so there's no need to delete elements from the list and need to test the error condition on each iteration. Also rename the 'fail' label to 'out' since the label is not exclusive to a failure path, as we also end up there when the function succeeds, and it's also a more common label name. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
10 daysbtrfs: remove use of a temporary list at btrfs_lookup_csums_list()Filipe Manana1-5/+3
There's no need to use a temporary list to add the checksums, we can just add them to input list and then on error delete and free any checksums that were added. So simplify and remove the temporary list. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
10 daysbtrfs: remove search_commit parameter from btrfs_lookup_csums_list()Filipe Manana5-16/+7
All the callers of btrfs_lookup_csums_list() pass a value of 0 as the "search_commit" parameter. So remove it and make the function behave as to always search from the regular root. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
10 daysbtrfs: add function comment to btrfs_lookup_csums_list()Filipe Manana1-0/+13
Add a function comment to btrfs_lookup_csums_list() to document it. With another upcoming change its parameter list and return value will be less obvious. So add the documentation now so that it can be updated where needed later. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
10 daysbtrfs: move btrfs_page_mkwrite() from inode.c into file.cFilipe Manana3-168/+166
btrfs_page_mkwrite() is a struct vm_operations_struct callback and we define that structure in file.c. Currently the function is in inode.c and has to be exported to be used in file.c, which makes no sense because it's not used anywhere else. So move btrfs_page_mkwrite() from inode.c and into file.c. While at it do a few minor style changes: 1) Capitalize the first word of every comment and end each sentence with punctuation; 2) Avoid splitting some statements into two lines when everything fits in 85 characters or less. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
10 daysbtrfs: remove no longer used btrfs_clone_chunk_map()Filipe Manana2-16/+0
There are no more users of btrfs_clone_chunk_map(), the last one (and only one ever) was removed in commit 1ec17ef59168 ("btrfs: zoned: fix use-after-free in do_zone_finish()"). So remove btrfs_clone_chunk_map(). Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
10 daysbtrfs: remove list_empty() check at warn_about_uncommitted_trans()Filipe Manana1-3/+0
At warn_about_uncommitted_trans(), there's no need to check if the list is empty and return, because list_for_each_entry_safe() is safe to call for an empty list, it simply does nothing. So remove the check. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
10 daysbtrfs: remove pointless return value assignment at btrfs_finish_one_ordered()Filipe Manana1-1/+0
At btrfs_finish_one_ordered() it's pointless to assign 0 to the 'ret' variable because if it has a non-zero value (error), we have already jumped to the 'out' label. So remove that redundant assignment. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
10 daysbtrfs: remove not needed mod_start and mod_len from struct extent_mapFilipe Manana4-27/+3
The mod_start and mod_len fields of struct extent_map were introduced by commit 4e2f84e63dc1 ("Btrfs: improve fsync by filtering extents that we want") in order to avoid too low performance when fsyncing a file that keeps getting extent maps merge, because it resulted in each fsync logging again csum ranges that were already merged before. We don't need this anymore as extent maps in the list of modified extents are never merged with other extent maps and once we log an extent map we remove it from the list of modified extent maps, so it's never logged twice. So remove the mod_start and mod_len fields from struct extent_map and use instead the start and len fields when logging checksums in the fast fsync path. This also makes EXTENT_FLAG_FILLING unused so remove it as well. Running the reproducer from the commit mentioned before, with a larger number of extents and against a null block device, so that IO is fast and we can better see any impact from searching checksums items and logging them, gave the following results from dd: Before this change: 409600000 bytes (410 MB, 391 MiB) copied, 22.948 s, 17.8 MB/s After this change: 409600000 bytes (410 MB, 391 MiB) copied, 22.9997 s, 17.8 MB/s So no changes in throughput. The test was done in a release kernel (non-debug, Debian's default kernel config) and its steps are the following: $ mkfs.btrfs -f /dev/nullb0 $ mount /dev/sdb /mnt $ dd if=/dev/zero of=/mnt/foobar bs=4k count=100000 oflag=sync $ umount /mnt This also reduces the size of struct extent_map from 128 bytes down to 112 bytes, so now we can have 36 extents maps per 4K page instead of 32. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
10 daysbtrfs: free PERTRANS at the end of cleanup_transaction()Boris Burkov1-4/+1
Some of the operations after the free might convert more PERTRANS metadata. Do the freeing as late as possible to eliminate a source of leaked PERTRANS metadata. This helps with the pass rate of generic/269 and generic/475. Reviewed-by: Qu Wenruo <qwu@suse.com> Signed-off-by: Boris Burkov <boris@bur.io> Signed-off-by: David Sterba <dsterba@suse.com>
10 daysbtrfs: compression: migrate compression/decompression paths to foliosQu Wenruo6-255/+251
For both compression and decompression paths, we always require a "struct page **pages" and "unsigned long nr_pages", this involves quite some part of the btrfs compression paths: - All the compression entry points - compressed_bio structure This affects both compression and decompression. - async_extent structure Unfortunately with all those involved parts, there is no good way to split the conversion into smaller patches while still passing compiling. So do this in one big conversion in one go. Please note this is direct page->folio conversion, no change on the page sized folio requirement yet. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> [ minor style fixups ] Signed-off-by: David Sterba <dsterba@suse.com>
10 daysbtrfs: introduce btrfs_alloc_folio_array()Qu Wenruo2-0/+33
The new helper will do the same thing as btrfs_alloc_page_array(), but with folios. One extra difference is, there is no extra helper for bulk allocation, thus it may not be as efficient as the page version. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
10 daysbtrfs: migrate insert_inline_extent() to folio interfacesQu Wenruo1-8/+10
Since insert_inline_extent() now only accepts a single page, it's much easier to convert it to use folio interfaces. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
10 daysbtrfs: make insert_inline_extent() accept one page directlyQu Wenruo1-22/+25
Since our inline extent cannot accept anything larger than a sector, there is really no need to pass all the compressed pages to insert_inline_extent(). And just in case, expand the ASSERT()s to make sure we only try inline with compressed size no larger than sectorsize. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
10 daysbtrfs: compression: convert page allocation to folio interfacesQu Wenruo6-24/+24
Currently we have two wrappers to allocate and free a page for compression usage: - btrfs_alloc_compr_page() - btrfs_free_compr_page() The allocator would try to grab a page from the pool, and only allocate a new page if the pool is empty. The reclaimer would check if the pool is full, and if not full it would put the page into the pool. This patch converts both helpers to use folio interfaces, and allowing further conversion of compression path to folios. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
10 daysbtrfs: compression: add error handling for missed page cacheQu Wenruo5-8/+46
For all the supported compression algorithms, the compression path would always need to grab the page cache, then do the compression. Normally we would get a page reference without any problem, since the write path should have already locked the pages in the write range. For the sake of error handling, we should handle the page cache miss case. Adds a common wrapper, btrfs_compress_find_get_page(), which calls find_get_page(), and do the error handling along with an error message. Callers inside compression path would only need to call btrfs_compress_find_get_page(), and error out if it returned any error. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
10 daysbtrfs: stop locking the source extent range during reflinkFilipe Manana2-38/+24
Nowadays before starting a reflink operation we do this: 1) Take the VFS lock of the inodes in exclusive mode (a rw semaphore); 2) Take the mmap lock of the inodes (struct btrfs_inode::i_mmap_lock); 3) Flush all delalloc in the source and target ranges; 4) Wait for all ordered extents in the source and target ranges to complete; 5) Lock the source and destination ranges in the inodes' io trees. In step 5 we lock the source range because: 1) We needed to serialize against mmap writes, but that is not needed anymore because nowadays we do that through the inode's i_mmap_lock (step 2). This happens since commit 8c99516a8cdd ("btrfs: exclude mmaps while doing remap"); 2) To serialize against a concurrent relocation and avoid generating a delayed ref for an extent that was just dropped by relocation, see commit d8b552424210 ("Btrfs: fix race between reflink/dedupe and relocation"). Locking the source range however blocks any concurrent reads for that range and makes test case generic/733 fail. So instead of locking the source range during reflinks, make relocation read lock the inode's i_mmap_lock, so that it serializes with a concurrent reflink while still able to run concurrently with mmap writes and allow concurrent reads too. Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
10 daysbtrfs: qgroup: delete unnecessary check in btrfs_qgroup_check_inherit()Dan Carpenter1-3/+0
This check "if (inherit->num_qgroups > PAGE_SIZE)" is confusing and unnecessary. The problem with the check is that static checkers flag it as a potential mixup of between units of bytes vs number of elements. Fortunately, the check can safely be deleted because the next check is correct and applies an even stricter limit: if (size != struct_size(inherit, qgroups, inherit->num_qgroups)) return -EINVAL; The "inherit" struct ends in a variable array of __u64 and "inherit->num_qgroups" is the number of elements in the array. At the start of the function we check that: if (size < sizeof(*inherit) || size > PAGE_SIZE) return -EINVAL; Thus, since we verify that the whole struct fits within one page, that means that the number of elements in the inherit->qgroups[] array must be less than PAGE_SIZE. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
10 daysbtrfs: convert put_file_data() to foliosGoldwyn Rodrigues1-22/+23
Use folio instead of page in put_file_data(). Add a warning in case higher order folio is found, this will be implemented in the future. Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
10 daysbtrfs: convert relocate_one_page() to folios and renameGoldwyn Rodrigues1-46/+47
Convert page references to folios and call the respective folio functions. Since find_or_create_page() takes a mask argument, call __filemap_get_folio() instead of filemap_grab_folio(). The patch assumes folio size is PAGE_SIZE, add a warning in case it's a higher order that will be implemented in the future. Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
10 daysbtrfs: page to folio conversion: prealloc_file_extent_cluster()Goldwyn Rodrigues1-6/+6
Convert usage of page to folio in prealloc_file_extent_cluster() Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
10 daysbtrfs: rename err to ret in btrfs_direct_write()Anand Jain1-24/+24
Unify naming of return value to the preferred way. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
10 daysbtrfs: rename err to ret in prepare_pages()Anand Jain1-12/+12
Unify naming of return value to the preferred way. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
10 daysbtrfs: rename err to ret in btrfs_dirty_pages()Anand Jain1-4/+4
Unify naming of return value to the preferred way. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
10 daysbtrfs: rename err to ret in create_reloc_inode()Anand Jain1-9/+9
Unify naming of return value to the preferred way. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
10 daysbtrfs: rename err to ret in __btrfs_end_transaction()Anand Jain1-4/+4
Unify naming of return value to the preferred way. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
10 daysbtrfs: rename err to ret in convert_extent_bit()Anand Jain1-15/+15
Unify naming of return value to the preferred way. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
10 daysbtrfs: rename err to ret in __set_extent_bit()Anand Jain1-14/+14
Unify naming of return value to the preferred way. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
10 daysbtrfs: rename err to ret in btrfs_ioctl_snap_destroy()Anand Jain1-33/+33
Unify naming of return value to the preferred way. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
10 daysbtrfs: rename err to ret in btrfs_cont_expand()Anand Jain1-13/+13
Unify naming of return value to the preferred way. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
10 daysbtrfs: rename err to ret in btrfs_rmdir()Anand Jain1-11/+11
Unify naming of return value to the preferred way. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
10 daysbtrfs: rename err to ret in btrfs_initxattrs()Anand Jain1-5/+5
Unify naming of return value to the preferred way. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
10 daysbtrfs: warn if EXTENT_BUFFER_UPTODATE is set while readingTavian Barnes1-0/+7
We recently tracked down a race condition that triggered a read for an extent buffer with EXTENT_BUFFER_UPTODATE already set. While this read was in progress, other concurrent readers would see the UPTODATE bit and return early as if the read was already complete, making accesses to the extent buffer conflict with the read operation that was overwriting it. Add a WARN_ON() to end_bbio_meta_read() for this situation to make similar races easier to spot in the future. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Tavian Barnes <tavianator@tavianator.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
10 daysbtrfs: add helper to clear EXTENT_BUFFER_READINGTavian Barnes1-6/+9
We are clearing the bit and waking up any waiters in two different places. Factor that code out into a static helper function. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Tavian Barnes <tavianator@tavianator.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
10 daysbtrfs: avoid pointless wake ups of drew lock readersFilipe Manana1-2/+6
When unlocking a write lock on a drew lock, at btrfs_drew_write_unlock(), it's pointless to wake up tasks waiting to acquire a read lock if we didn't decrement the 'writers' counter down to 0, since a read lock can only be acquired when the counter reaches a value of 0. Doing so is harmless from a functional point of view, but it's not efficient due to unnecessarily waking up tasks just for them to sleep again on the waitqueue. So change this to wake up readers only if we decremented the 'writers' counter to 0. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
10 daysbtrfs: remove pointless writepages callback wrapperFilipe Manana3-10/+2
There's no point in having a static writepages callback in inode.c that does nothing besides calling extent_writepages from extent_io.c. So just remove the callback at inode.c and rename extent_writepages() to btrfs_writepages(). Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
10 daysbtrfs: remove pointless readahead callback wrapperFilipe Manana3-7/+2
There's no point in having a static readahead callback in inode.c that does nothing besides calling extent_readahead() from extent_io.c. So just remove the callback at inode.c and rename extent_readahead() to btrfs_readahead(). Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
10 daysbtrfs: locking: rename __btrfs_tree_lock() and __btrfs_tree_read_lock()Filipe Manana4-14/+14
The __btrfs_tree_lock() and __btrfs_tree_read_lock() are using a naming with a double underscore prefix, which is specially not proper for exported functions. Remove the double underscore prefix from their name and add the "_nested" suffix. Reviewed-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
10 daysbtrfs: locking: inline btrfs_tree_lock() and btrfs_tree_read_lock()Filipe Manana2-12/+12
The functions btrfs_tree_lock() and btrfs_tree_read_lock() are very trivial so that can be made inline and avoid call overhead, as they are very often called inside critical sections (when searching a btree for example, attempting to lock a child node/leaf while holding a lock on the parent). So make them static inline, which even reduces the size of the btrfs module a little bit. Before this change: $ size fs/btrfs/btrfs.ko text data bss dec hex filename 1718786 156276 16920 1891982 1cde8e fs/btrfs/btrfs.ko After this change: $ size fs/btrfs/btrfs.ko text data bss dec hex filename 1718650 156260 16920 1891830 1cddf6 fs/btrfs/btrfs.ko Running fs_mark also showed a tiny improvement with this script: $ cat test.sh #!/bin/bash DEV=/dev/nullb0 MNT=/mnt/nullb0 FILES=100000 THREADS=$(nproc --all) echo "performance" | \ tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor umount $DEV &> /dev/null mkfs.btrfs -f $DEV mount $DEV $MNT OPTS="-S 0 -L 5 -n $FILES -s 0 -t $THREADS -k" for ((i = 1; i <= $THREADS; i++)); do OPTS="$OPTS -d $MNT/d$i" done fs_mark $OPTS umount $MNT Before this change: FSUse% Count Size Files/sec App Overhead 10 1200000 0 180894.0 10705410 16 2400000 0 228211.4 10765738 23 3600000 0 215969.6 11011072 30 4800000 0 199077.1 11145587 46 6000000 0 176624.1 11658470 After this change: FSUse% Count Size Files/sec App Overhead 10 1200000 0 185312.3 10708377 16 2400000 0 229320.4 10858013 23 3600000 0 217958.7 11006167 30 4800000 0 205122.9 11112899 46 6000000 0 178039.1 11438852 Reviewed-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
10 daysbtrfs: remove pointless BUG_ON() when creating snapshotFilipe Manana1-2/+0
When creating a snapshot we first check with btrfs_lookup_dir_item() if there is a name collision in the parent directory and then return an error if there's a collision. Then later on when trying to insert a dir item for the snapshot we BUG_ON() if the return value is -EEXIST or -EOVERFLOW: static noinline int create_pending_snapshot(...) { (...) /* check if there is a file/dir which has the same name. */ dir_item = btrfs_lookup_dir_item(...); (...) ret = btrfs_insert_dir_item(...); /* We have check then name at the beginning, so it is impossible. */ BUG_ON(ret == -EEXIST || ret == -EOVERFLOW); if (ret) { btrfs_abort_transaction(trans, ret); goto fail; } (...) } It's impossible to get the -EEXIST because we previously checked for a potential collision with btrfs_lookup_dir_item() and we know that after that no one could have added a colliding name because at this point the transaction is in its critical section, state TRANS_STATE_COMMIT_DOING, so no one can join this transaction to add a colliding name and neither can anyone start a new transaction to do that. As for the -EOVERFLOW, that can't happen as long as we have the extended references feature enabled, which is a mkfs default for many years now. In either case, the BUG_ON() is excessive as we can properly deal with any error and can abort the transaction and jump to the 'fail' label, in which case we'll also get the useful stack trace (just like a BUG_ON()) from the abort if the error is either -EEXIST or -EOVERFLOW. So remove the BUG_ON(). Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
10 daysbcachefs: Add missing sched_annotate_sleep() in bch2_journal_flush_seq_async()Kent Overstreet1-0/+6
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
10 daysbcachefs: Fix race in bch2_write_super()Kent Overstreet1-15/+32
bch2_write_super() was looping over online devices multiple times - dropping and retaking io_ref each time. This meant it could race with device removal; it could increment the sequence number on a device but fail to write it - and then if the device was re-added, it would get confused the next time around thinking a superblock write was silently dropped. Fix this by taking io_ref once, and stashing pointers to online devices in a darray. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
11 daysgfs2: make timeout values more explicitWolfram Sang1-3/+2
'timeout' is a vague name for the return value of wait_event_*_timeout because it actually returns the time left. Because the variable is never used later, just drop the return value. Since variable 'timeout' is then only used to carry a fixed timeout value, drop this in favor of a fixed function argument as in the other call to wait_event_timeout() above. Signed-off-by: Wolfram Sang <wsa+renesas@sang-engineering.com> Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
11 daysMerge tag 'for-6.9-rc7-tag' of ↵Linus Torvalds3-15/+18
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs fixes from David Sterba: "Two more fixes, both have some visible effects on user space: - add check if quotas are enabled when passing qgroup inheritance info, this affects snapper that could fail to create a snapshot - do check for leaf/node flag WRITTEN earlier so that nodes are completely validated before access, this used to be done by integrity checker but it's been removed and left an unhandled case" * tag 'for-6.9-rc7-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: btrfs: make sure that WRITTEN is set on all metadata blocks btrfs: qgroup: do not check qgroup inherit if qgroup is disabled
11 daysbcachefs: BCH_SB_LAYOUT_SIZE_BITS_MAXKent Overstreet2-1/+3
Define a constant for the max superblock size, to avoid a too-large shift. Reported-by: syzbot+a8b0fb419355c91dda7f@syzkaller.appspotmail.com Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
11 daysbcachefs: Add missing skcipher_request_set_callback() callKent Overstreet1-0/+1
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
11 daysbcachefs: Fix snapshot_t() usage in bch2_fs_quota_read_inode()Kent Overstreet1-5/+3
bch2_fs_quota_read_inode() wasn't entirely updated to the bch2_snapshot_tree() helper, which takes rcu lock. Reported-by: syzbot+a3a9a61224ed3b7f0010@syzkaller.appspotmail.com Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
11 daysbcachefs: Fix shift-by-64 in bformat_needs_redo()Kent Overstreet1-8/+14
Ancient versions of bcachefs produced packed formats that could represent keys that our in memory format cannot represent; bformat_needs_redo() has some tricky shifts to check for this sort of overflow. Reported-by: syzbot+594427aebfefeebe91c6@syzkaller.appspotmail.com Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
11 daysbcachefs: Guard against unknown k.k->type in __bkey_invalid()Kent Overstreet1-2/+2
For forwards compatibility we have to allow unknown key types, and only run the checks that make sense against them. Fix a missing guard on k.k->type being known. Reported-by: syzbot+ae4dc916da3ce51f284f@syzkaller.appspotmail.com Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
11 daysbcachefs: Add missing validation for superblock section cleanKent Overstreet1-0/+14
We were forgetting to check for jset entries that overrun the end of the section - both in validate and to_text(); to_text() needs to be safe for types that fail to validate. Reported-by: syzbot+c48865e11e7e893ec4ab@syzkaller.appspotmail.com Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
11 daysbcachefs: Fix assert in bch2_alloc_v4_invalid()Kent Overstreet2-4/+8
Reported-by: syzbot+10827fa6b176e1acf1d0@syzkaller.appspotmail.com Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
11 daysbcachefs: fix overflow in fiemapReed Riley1-1/+1
filefrag (and potentially other utilities that call fiemap) sometimes pass ULONG_MAX as the length. fiemap_prep clamps excessively large lengths - but the calculation of end can overflow if it occurs before calling fiemap_prep. When this happens, filefrag assumes it has read to the end and exits. Signed-off-by: Reed Riley <reed@riley.engineer> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
11 daysbcachefs: Add a better limit for maximum number of bucketsKent Overstreet4-3/+17
The bucket_gens array is a single array allocation (one byte per bucket), and kernel allocations are still limited to INT_MAX. Check this limit to avoid failing the bucket_gens array allocation. Reported-by: syzbot+b29f436493184ea42e2b@syzkaller.appspotmail.com Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
11 daysbcachefs: Fix lifetime issue in device iterator helpersKent Overstreet1-2/+2
bch2_get_next_dev() and bch2_get_next_online_dev() iterate over devices, dropping and taking refs as they go; we can't access the previous device (for ca->dev_idx) after we've dropped our ref to it, unless we take rcu_read_lock() first. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
11 daysbcachefs: Fix bch2_dev_lookup() refcountingKent Overstreet1-6/+2
bch2_dev_lookup() is supposed to take a ref on the device it returns, but for_each_member_device() takes refs as it iterates, for_each_member_device_rcu() does not. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
11 daysbcachefs: Initialize bch_write_op->failed in inline data pathKent Overstreet1-0/+2
Normally this is initialized in __bch2_write(), which is executed in a loop, but the inline data path skips this. Reported-by: syzbot+fd3ccb331eb21f05d13b@syzkaller.appspotmail.com Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
11 daysbcachefs: Fix refcount put in sb_field_resize error pathKent Overstreet1-1/+1
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
11 daysbcachefs: Inodes need extra padding for varint_decode_fast()Kent Overstreet1-10/+18
Reported-by: syzbot+66b9b74f6520068596a9@syzkaller.appspotmail.com Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
11 daysbcachefs: Fix early error path in bch2_fs_btree_key_cache_exit()Kent Overstreet1-7/+9
Reported-by: syzbot+a35cdb62ec34d44fb062@syzkaller.appspotmail.com Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
11 daysbcachefs: bucket_pos_to_bp_noerror()Kent Overstreet2-5/+11
We don't want the assert when we're checking if the backpointer is valid. Reported-by: syzbot+bf7215c0525098e7747a@syzkaller.appspotmail.com Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
11 daysbcachefs: don't free error pointersKent Overstreet1-1/+2
Reported-by: syzbot+3333603f569fc2ef258c@syzkaller.appspotmail.com Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
11 daysbcachefs: Fix a scheduler splat in __bch2_next_write_buffer_flush_journal_buf()Kent Overstreet1-0/+2
We're using mutex_lock() inside a wait_event() conditional - prepare_to_wait() has already flipped task state, so potentially blocking ops need annotation. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
11 daysorangefs: fix out-of-bounds fsid accessMike Marshall1-1/+2
Arnd Bergmann sent a patch to fsdevel, he says: "orangefs_statfs() copies two consecutive fields of the superblock into the statfs structure, which triggers a warning from the string fortification helpers" Jan Kara suggested an alternate way to do the patch to make it more readable. I ran both ideas through xfstests and both seem fine. This patch is based on Jan Kara's suggestion. Signed-off-by: Mike Marshall <hubcap@omnibond.com>
12 daysfs/proc/task_mmu: fix uffd-wp confusion in pagemap_scan_pmd_entry()Ryan Roberts1-9/+13
pagemap_scan_pmd_entry() checks if uffd-wp is set on each pte to avoid unnecessary if set. However it was previously checking with `pte_uffd_wp(ptep_get(pte))` without first confirming that the pte was present. It is only valid to call pte_uffd_wp() for present ptes. For swap ptes, pte_swp_uffd_wp() must be called because the uffd-wp bit may be kept in a different position, depending on the arch. This was leading to test failures in the pagemap_ioctl mm selftest, when bringing up uffd-wp support on arm64 due to incorrectly interpretting the uffd-wp status of migration entries. Let's fix this by using the correct check based on pte_present(). While we are at it, let's pass the pte to make_uffd_wp_pte() to avoid the pointless extra ptep_get() which can't be optimized out due to READ_ONCE() on many arches. Link: https://lkml.kernel.org/r/20240429114104.182890-1-ryan.roberts@arm.com Fixes: 12f6b01a0bcb ("fs/proc/task_mmu: add fast paths to get/clear PAGE_IS_WRITTEN flag") Closes: https://lore.kernel.org/linux-arm-kernel/ZiuyGXt0XWwRgFh9@x1n/ Signed-off-by: Ryan Roberts <ryan.roberts@arm.com> Acked-by: David Hildenbrand <david@redhat.com> Reviewed-by: Muhammad Usama Anjum <usama.anjum@collabora.com> Tested-by: Muhammad Usama Anjum <usama.anjum@collabora.com> Cc: Peter Xu <peterx@redhat.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
12 daysfs/proc/task_mmu: fix loss of young/dirty bits during pagemap scanRyan Roberts1-1/+1
make_uffd_wp_pte() was previously doing: pte = ptep_get(ptep); ptep_modify_prot_start(ptep); pte = pte_mkuffd_wp(pte); ptep_modify_prot_commit(ptep, pte); But if another thread accessed or dirtied the pte between the first 2 calls, this could lead to loss of that information. Since ptep_modify_prot_start() gets and clears atomically, the following is the correct pattern and prevents any possible race. Any access after the first call would see an invalid pte and cause a fault: pte = ptep_modify_prot_start(ptep); pte = pte_mkuffd_wp(pte); ptep_modify_prot_commit(ptep, pte); Link: https://lkml.kernel.org/r/20240429114017.182570-1-ryan.roberts@arm.com Fixes: 52526ca7fdb9 ("fs/proc/task_mmu: implement IOCTL to get and optionally clear info about PTEs") Signed-off-by: Ryan Roberts <ryan.roberts@arm.com> Acked-by: David Hildenbrand <david@redhat.com> Cc: Muhammad Usama Anjum <usama.anjum@collabora.com> Cc: Peter Xu <peterx@redhat.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
12 daysmm/userfaultfd: reset ptes when close() for wr-protected onesPeter Xu1-0/+4
Userfaultfd unregister includes a step to remove wr-protect bits from all the relevant pgtable entries, but that only covered an explicit UFFDIO_UNREGISTER ioctl, not a close() on the userfaultfd itself. Cover that too. This fixes a WARN trace. The only user visible side effect is the user can observe leftover wr-protect bits even if the user close()ed on an userfaultfd when releasing the last reference of it. However hopefully that should be harmless, and nothing bad should happen even if so. This change is now more important after the recent page-table-check patch we merged in mm-unstable (446dd9ad37d0 ("mm/page_table_check: support userfault wr-protect entries")), as we'll do sanity check on uffd-wp bits without vma context. So it's better if we can 100% guarantee no uffd-wp bit leftovers, to make sure each report will be valid. Link: https://lore.kernel.org/all/000000000000ca4df20616a0fe16@google.com/ Fixes: f369b07c8614 ("mm/uffd: reset write protection when unregister with wp-mode") Analyzed-by: David Hildenbrand <david@redhat.com> Link: https://lkml.kernel.org/r/20240422133311.2987675-1-peterx@redhat.com Reported-by: syzbot+d8426b591c36b21c750e@syzkaller.appspotmail.com Signed-off-by: Peter Xu <peterx@redhat.com> Reviewed-by: David Hildenbrand <david@redhat.com> Cc: Nadav Amit <nadav.amit@gmail.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
12 daysepoll: be better about file lifetimesLinus Torvalds1-1/+37
epoll can call out to vfs_poll() with a file pointer that may race with the last 'fput()'. That would make f_count go down to zero, and while the ep->mtx locking means that the resulting file pointer tear-down will be blocked until the poll returns, it means that f_count is already dead, and any use of it won't actually get a reference to the file any more: it's dead regardless. Make sure we have a valid ref on the file pointer before we call down to vfs_poll() from the epoll routines. Link: https://lore.kernel.org/lkml/0000000000002d631f0615918f1e@google.com/ Reported-by: syzbot+045b454ab35fd82a35fb@syzkaller.appspotmail.com Reviewed-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
12 daysMerge tag 'trace-v6.9-rc6-2' of ↵Linus Torvalds3-59/+195
git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace Pull tracing and tracefs fixes from Steven Rostedt: - Fix RCU callback of freeing an eventfs_inode. The freeing of the eventfs_inode from the kref going to zero freed the contents of the eventfs_inode and then used kfree_rcu() to free the inode itself. But the contents should also be protected by RCU. Switch to a call_rcu() that calls a function to free all of the eventfs_inode after the RCU synchronization. - The tracing subsystem maps its own descriptor to a file represented by eventfs. The freeing of this descriptor needs to know when the last reference of an eventfs_inode is released, but currently there is no interface for that. Add a "release" callback to the eventfs_inode entry array that allows for freeing of data that can be referenced by the eventfs_inode being opened. Then increment the ref counter for this descriptor when the eventfs_inode file is created, and decrement/free it when the last reference to the eventfs_inode is released and the file is removed. This prevents races between freeing the descriptor and the opening of the eventfs file. - Fix the permission processing of eventfs. The change to make the permissions of eventfs default to the mount point but keep track of when changes were made had a side effect that could cause security concerns. When the tracefs is remounted with a given gid or uid, all the files within it should inherit that gid or uid. But if the admin had changed the permission of some file within the tracefs file system, it would not get updated by the remount. This caused the kselftest of file permissions to fail the second time it is run. The first time, all changes would look fine, but the second time, because the changes were "saved", the remount did not reset them. Create a link list of all existing tracefs inodes, and clear the saved flags on them on a remount if the remount changes the corresponding gid or uid fields. This also simplifies the code by removing the distinction between the toplevel eventfs and an instance eventfs. They should both act the same. They were different because of a misconception due to the remount not resetting the flags. Now that remount resets all the files and directories to default to the root node if a uid/gid is specified, it makes the logic simpler to implement. * tag 'trace-v6.9-rc6-2' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace: eventfs: Have "events" directory get permissions from its parent eventfs: Do not treat events directory different than other directories eventfs: Do not differentiate the toplevel events directory tracefs: Still use mount point as default permissions for instances tracefs: Reset permissions on remount if permissions are options eventfs: Free all of the eventfs_inode after RCU eventfs/tracing: Add callback for release of an eventfs_inode
13 daysksmbd: do not grant v2 lease if parent lease key and epoch are not setNamjae Jeon1-5/+9
This patch fix xfstests generic/070 test with smb2 leases = yes. cifs.ko doesn't set parent lease key and epoch in create context v2 lease. ksmbd suppose that parent lease and epoch are vaild if data length is v2 lease context size and handle directory lease using this values. ksmbd should hanle it as v1 lease not v2 lease if parent lease key and epoch are not set in create context v2 lease. Cc: stable@vger.kernel.org Signed-off-by: Namjae Jeon <linkinjeon@kernel.org> Signed-off-by: Steve French <stfrench@microsoft.com>
13 daysksmbd: use rwsem instead of rwlock for lease breakNamjae Jeon5-38/+30
lease break wait for lease break acknowledgment. rwsem is more suitable than unlock while traversing the list for parent lease break in ->m_op_list. Cc: stable@vger.kernel.org Signed-off-by: Namjae Jeon <linkinjeon@kernel.org> Signed-off-by: Steve French <stfrench@microsoft.com>
13 daysksmbd: avoid to send duplicate lease break notificationsNamjae Jeon1-6/+15
This patch fixes generic/011 when enable smb2 leases. if ksmbd sends multiple notifications for a file, cifs increments the reference count of the file but it does not decrement the count by the failure of queue_work. So even if the file is closed, cifs does not send a SMB2_CLOSE request. Cc: stable@vger.kernel.org Signed-off-by: Namjae Jeon <linkinjeon@kernel.org> Signed-off-by: Steve French <stfrench@microsoft.com>
13 daysksmbd: off ipv6only for both ipv4/ipv6 bindingNamjae Jeon1-0/+4
ΕΛΕΝΗ reported that ksmbd binds to the IPV6 wildcard (::) by default for ipv4 and ipv6 binding. So IPV4 connections are successful only when the Linux system parameter bindv6only is set to 0 [default value]. If this parameter is set to 1, then the ipv6 wildcard only represents any IPV6 address. Samba creates different sockets for ipv4 and ipv6 by default. This patch off sk_ipv6only to support IPV4/IPV6 connections without creating two sockets. Cc: stable@vger.kernel.org Reported-by: ΕΛΕΝΗ ΤΖΑΒΕΛΛΑ <helentzavellas@yahoo.gr> Signed-off-by: Namjae Jeon <linkinjeon@kernel.org> Signed-off-by: Steve French <stfrench@microsoft.com>
14 dayseventfs: Have "events" directory get permissions from its parentSteven Rostedt (Google)1-6/+24
The events directory gets its permissions from the root inode. But this can cause an inconsistency if the instances directory changes its permissions, as the permissions of the created directories under it should inherit the permissions of the instances directory when directories under it are created. Currently the behavior is: # cd /sys/kernel/tracing # chgrp 1002 instances # mkdir instances/foo # ls -l instances/foo [..] -r--r----- 1 root lkp 0 May 1 18:55 buffer_total_size_kb -rw-r----- 1 root lkp 0 May 1 18:55 current_tracer -rw-r----- 1 root lkp 0 May 1 18:55 error_log drwxr-xr-x 1 root root 0 May 1 18:55 events --w------- 1 root lkp 0 May 1 18:55 free_buffer drwxr-x--- 2 root lkp 0 May 1 18:55 options drwxr-x--- 10 root lkp 0 May 1 18:55 per_cpu -rw-r----- 1 root lkp 0 May 1 18:55 set_event All the files and directories under "foo" has the "lkp" group except the "events" directory. That's because its getting its default value from the mount point instead of its parent. Have the "events" directory make its default value based on its parent's permissions. That now gives: # ls -l instances/foo [..] -rw-r----- 1 root lkp 0 May 1 21:16 buffer_subbuf_size_kb -r--r----- 1 root lkp 0 May 1 21:16 buffer_total_size_kb -rw-r----- 1 root lkp 0 May 1 21:16 current_tracer -rw-r----- 1 root lkp 0 May 1 21:16 error_log drwxr-xr-x 1 root lkp 0 May 1 21:16 events --w------- 1 root lkp 0 May 1 21:16 free_buffer drwxr-x--- 2 root lkp 0 May 1 21:16 options drwxr-x--- 10 root lkp 0 May 1 21:16 per_cpu -rw-r----- 1 root lkp 0 May 1 21:16 set_event Link: https://lore.kernel.org/linux-trace-kernel/20240502200906.161887248@goodmis.org Cc: stable@vger.kernel.org Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Andrew Morton <akpm@linux-foundation.org> Fixes: 8186fff7ab649 ("tracefs/eventfs: Use root and instance inodes as default ownership") Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
14 dayseventfs: Do not treat events directory different than other directoriesSteven Rostedt (Google)1-15/+1
Treat the events directory the same as other directories when it comes to permissions. The events directory was considered different because it's dentry is persistent, whereas the other directory dentries are created when accessed. But the way tracefs now does its ownership by using the root dentry's permissions as the default permissions, the events directory can get out of sync when a remount is performed setting the group and user permissions. Remove the special case for the events directory on setting the attributes. This allows the updates caused by remount to work properly as well as simplifies the code. Link: https://lore.kernel.org/linux-trace-kernel/20240502200906.002923579@goodmis.org Cc: stable@vger.kernel.org Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Andrew Morton <akpm@linux-foundation.org> Fixes: 8186fff7ab649 ("tracefs/eventfs: Use root and instance inodes as default ownership") Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
14 dayseventfs: Do not differentiate the toplevel events directorySteven Rostedt (Google)2-25/+11
The toplevel events directory is really no different than the events directory of instances. Having the two be different caused inconsistencies and made it harder to fix the permissions bugs. Make all events directories act the same. Link: https://lore.kernel.org/linux-trace-kernel/20240502200905.846448710@goodmis.org Cc: stable@vger.kernel.org Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Andrew Morton <akpm@linux-foundation.org> Fixes: 8186fff7ab649 ("tracefs/eventfs: Use root and instance inodes as default ownership") Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
14 daystracefs: Still use mount point as default permissions for instancesSteven Rostedt (Google)1-2/+25
If the instances directory's permissions were never change, then have it and its children use the mount point permissions as the default. Currently, the permissions of instance directories are determined by the instance directory's permissions itself. But if the tracefs file system is remounted and changes the permissions, the instance directory and its children should use the new permission. But because both the instance directory and its children use the instance directory's inode for permissions, it misses the update. To demonstrate this: # cd /sys/kernel/tracing/ # mkdir instances/foo # ls -ld instances/foo drwxr-x--- 5 root root 0 May 1 19:07 instances/foo # ls -ld instances drwxr-x--- 3 root root 0 May 1 18:57 instances # ls -ld current_tracer -rw-r----- 1 root root 0 May 1 18:57 current_tracer # mount -o remount,gid=1002 . # ls -ld instances drwxr-x--- 3 root root 0 May 1 18:57 instances # ls -ld instances/foo/ drwxr-x--- 5 root root 0 May 1 19:07 instances/foo/ # ls -ld current_tracer -rw-r----- 1 root lkp 0 May 1 18:57 current_tracer Notice that changing the group id to that of "lkp" did not affect the instances directory nor its children. It should have been: # ls -ld current_tracer -rw-r----- 1 root root 0 May 1 19:19 current_tracer # ls -ld instances/foo/ drwxr-x--- 5 root root 0 May 1 19:25 instances/foo/ # ls -ld instances drwxr-x--- 3 root root 0 May 1 19:19 instances # mount -o remount,gid=1002 . # ls -ld current_tracer -rw-r----- 1 root lkp 0 May 1 19:19 current_tracer # ls -ld instances drwxr-x--- 3 root lkp 0 May 1 19:19 instances # ls -ld instances/foo/ drwxr-x--- 5 root lkp 0 May 1 19:25 instances/foo/ Where all files were updated by the remount gid update. Link: https://lore.kernel.org/linux-trace-kernel/20240502200905.686838327@goodmis.org Cc: stable@vger.kernel.org Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Andrew Morton <akpm@linux-foundation.org> Fixes: 8186fff7ab649 ("tracefs/eventfs: Use root and instance inodes as default ownership") Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
14 daystracefs: Reset permissions on remount if permissions are optionsSteven Rostedt (Google)3-2/+99
There's an inconsistency with the way permissions are handled in tracefs. Because the permissions are generated when accessed, they default to the root inode's permission if they were never set by the user. If the user sets the permissions, then a flag is set and the permissions are saved via the inode (for tracefs files) or an internal attribute field (for eventfs). But if a remount happens that specify the permissions, all the files that were not changed by the user gets updated, but the ones that were are not. If the user were to remount the file system with a given permission, then all files and directories within that file system should be updated. This can cause security issues if a file's permission was updated but the admin forgot about it. They could incorrectly think that remounting with permissions set would update all files, but miss some. For example: # cd /sys/kernel/tracing # chgrp 1002 current_tracer # ls -l [..] -rw-r----- 1 root root 0 May 1 21:25 buffer_size_kb -rw-r----- 1 root root 0 May 1 21:25 buffer_subbuf_size_kb -r--r----- 1 root root 0 May 1 21:25 buffer_total_size_kb -rw-r----- 1 root lkp 0 May 1 21:25 current_tracer -rw-r----- 1 root root 0 May 1 21:25 dynamic_events -r--r----- 1 root root 0 May 1 21:25 dyn_ftrace_total_info -r--r----- 1 root root 0 May 1 21:25 enabled_functions Where current_tracer now has group "lkp". # mount -o remount,gid=1001 . # ls -l -rw-r----- 1 root tracing 0 May 1 21:25 buffer_size_kb -rw-r----- 1 root tracing 0 May 1 21:25 buffer_subbuf_size_kb -r--r----- 1 root tracing 0 May 1 21:25 buffer_total_size_kb -rw-r----- 1 root lkp 0 May 1 21:25 current_tracer -rw-r----- 1 root tracing 0 May 1 21:25 dynamic_events -r--r----- 1 root tracing 0 May 1 21:25 dyn_ftrace_total_info -r--r----- 1 root tracing 0 May 1 21:25 enabled_functions Everything changed but the "current_tracer". Add a new link list that keeps track of all the tracefs_inodes which has the permission flags that tell if the file/dir should use the root inode's permission or not. Then on remount, clear all the flags so that the default behavior of using the root inode's permission is done for all files and directories. Link: https://lore.kernel.org/linux-trace-kernel/20240502200905.529542160@goodmis.org Cc: stable@vger.kernel.org Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Andrew Morton <akpm@linux-foundation.org> Fixes: 8186fff7ab649 ("tracefs/eventfs: Use root and instance inodes as default ownership") Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
14 dayseventfs: Free all of the eventfs_inode after RCUSteven Rostedt (Google)1-9/+16
The freeing of eventfs_inode via a kfree_rcu() callback. But the content of the eventfs_inode was being freed after the last kref. This is dangerous, as changes are being made that can access the content of an eventfs_inode from an RCU loop. Instead of using kfree_rcu() use call_rcu() that calls a function to do all the freeing of the eventfs_inode after a RCU grace period has expired. Link: https://lore.kernel.org/linux-trace-kernel/20240502200905.370261163@goodmis.org Cc: stable@vger.kernel.org Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Andrew Morton <akpm@linux-foundation.org> Fixes: 43aa6f97c2d03 ("eventfs: Get rid of dentry pointers without refcounts") Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>