aboutsummaryrefslogtreecommitdiffstats
path: root/kernel/cgroup
AgeCommit message (Collapse)AuthorFilesLines
5 dayscgroup/rstat: add cgroup_rstat_cpu_lock helpers and tracepointsJesper Dangaard Brouer1-12/+58
This closely resembles helpers added for the global cgroup_rstat_lock in commit fc29e04ae1ad ("cgroup/rstat: add cgroup_rstat_lock helpers and tracepoints"). This is for the per CPU lock cgroup_rstat_cpu_lock. Based on production workloads, we observe the fast-path "update" function cgroup_rstat_updated() is invoked around 3 million times per sec, while the "flush" function cgroup_rstat_flush_locked(), walking each possible CPU, can see periodic spikes of 700 invocations/sec. For this reason, the tracepoints are split into normal and fastpath versions for this per-CPU lock. Making it feasible for production to continuously monitor the non-fastpath tracepoint to detect lock contention issues. The reason for monitoring is that lock disables IRQs which can disturb e.g. softirq processing on the local CPUs involved. When the global cgroup_rstat_lock stops disabling IRQs (e.g converted to a mutex), this per CPU lock becomes the next bottleneck that can introduce latency variations. A practical bpftrace script for monitoring contention latency: bpftrace -e ' tracepoint:cgroup:cgroup_rstat_cpu_lock_contended { @start[tid]=nsecs; @cnt[probe]=count()} tracepoint:cgroup:cgroup_rstat_cpu_locked { if (args->contended) { @wait_ns=hist(nsecs-@start[tid]); delete(@start[tid]);} @cnt[probe]=count()} interval:s:1 {time("%H:%M:%S "); print(@wait_ns); print(@cnt); clear(@cnt);}' Signed-off-by: Jesper Dangaard Brouer <hawk@kernel.org> Signed-off-by: Tejun Heo <tj@kernel.org>
2024-04-25cgroup/cpuset: Remove outdated comment in sched_partition_write()Xiu Jianfeng1-3/+0
The comment here is outdated and can cause confusion, from the code perspective, there’s also no need for new comment, so just remove it. Signed-off-by: Xiu Jianfeng <xiujianfeng@huawei.com> Acked-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2024-04-23cgroup/cpuset: Fix incorrect top_cpuset flagsWaiman Long1-2/+2
Commit 8996f93fc388 ("cgroup/cpuset: Statically initialize more members of top_cpuset") uses an incorrect "<" relational operator for the CS_SCHED_LOAD_BALANCE bit when initializing the top_cpuset. This results in load_balancing turned off by default in the top cpuset which is bad for performance. Fix this by using the BIT() helper macro to set the desired top_cpuset flags and avoid similar mistake from being made in the future. Fixes: 8996f93fc388 ("cgroup/cpuset: Statically initialize more members of top_cpuset") Signed-off-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2024-04-23cgroup/cpuset: Avoid clearing CS_SCHED_LOAD_BALANCE twiceXiu Jianfeng1-5/+0
In cpuset_css_online(), CS_SCHED_LOAD_BALANCE will be cleared twice, the former one in the is_in_v2_mode() case could be removed because is_in_v2_mode() can be true for cgroup v1 if the "cpuset_v2_mode" mount option is specified, that balance flag change isn't appropriate for this particular case. Signed-off-by: Xiu Jianfeng <xiujianfeng@huawei.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2024-04-22cgroup/cpuset: Statically initialize more members of top_cpusetXiu Jianfeng1-3/+2
Initializing top_cpuset.relax_domain_level and setting CS_SCHED_LOAD_BALANCE to top_cpuset.flags in cpuset_init() could be completed at the time of top_cpuset definition by compiler. Signed-off-by: Xiu Jianfeng <xiujianfeng@huawei.com> Reviewed-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2024-04-19cgroup: Avoid unnecessary looping in cgroup_no_v1()Xiu Jianfeng1-0/+1
No need to continue the for_each_subsys loop after the token matches the name of subsys and cgroup_no_v1_mask is set. Signed-off-by: Xiu Jianfeng <xiujianfeng@huawei.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2024-04-18cgroup, legacy_freezer: update comment for freezer_css_offline()Xiu Jianfeng1-1/+1
After commit f5d39b020809 ("freezer,sched: Rewrite core freezer logic"), system_freezing_count was replaced by freezer_active. Signed-off-by: Xiu Jianfeng <xiujianfeng@huawei.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2024-04-18cgroup: don't call cgroup1_pidlist_destroy_all() for v2Xiu Jianfeng1-1/+2
Currently cgroup1_pidlist_destroy_all() will be called when releasing cgroup even if the cgroup is on default hierarchy, however it doesn't make any sense for v2 to destroy pidlist of v1. Signed-off-by: Xiu Jianfeng <xiujianfeng@huawei.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2024-04-17cgroup_freezer: update comment for freezer_css_online()Xiu Jianfeng1-2/+1
The freezer->lock was replaced by freezer_mutex in commit e5ced8ebb10c ("cgroup_freezer: replace freezer->lock with freezer_mutex"), so the comment here is out-of-date, update it. Signed-off-by: Xiu Jianfeng <xiujianfeng@huawei.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2024-04-17cgroup/rstat: desc member cgrp in cgroup_rstat_flush_releaseJesper Dangaard Brouer1-0/+1
Recent change to cgroup_rstat_flush_release added a parameter cgrp, which is used by tracepoint to correlate with other tracepoints that also have this cgrp. The kernel test robot detected kernel doc was missing a description of this member. Reported-by: kernel test robot <lkp@intel.com> Closes: https://lore.kernel.org/oe-kbuild-all/202404170821.HwZGISTY-lkp@intel.com/ Signed-off-by: Jesper Dangaard Brouer <hawk@kernel.org> Signed-off-by: Tejun Heo <tj@kernel.org>
2024-04-16cgroup/rstat: add cgroup_rstat_lock helpers and tracepointsJesper Dangaard Brouer1-8/+39
This commit enhances the ability to troubleshoot the global cgroup_rstat_lock by introducing wrapper helper functions for the lock along with associated tracepoints. Although global, the cgroup_rstat_lock helper APIs and tracepoints take arguments such as cgroup pointer and cpu_in_loop variable. This adjustment is made because flushing occurs per cgroup despite the lock being global. Hence, when troubleshooting, it's important to identify the relevant cgroup. The cpu_in_loop variable is necessary because the global lock may be released within the main flushing loop that traverses CPUs. In the tracepoints, the cpu_in_loop value is set to -1 when acquiring the main lock; otherwise, it denotes the CPU number processed last. The new feature in this patchset is detecting when lock is contended. The tracepoints are implemented with production in mind. For minimum overhead attach to cgroup:cgroup_rstat_lock_contended, which only gets activated when trylock detects lock is contended. A quick production check for issues could be done via this perf commands: perf record -g -e cgroup:cgroup_rstat_lock_contended Next natural question would be asking how long time do lock contenders wait for obtaining the lock. This can be answered by measuring the time between cgroup:cgroup_rstat_lock_contended and cgroup:cgroup_rstat_locked when args->contended is set. Like this bpftrace script: bpftrace -e ' tracepoint:cgroup:cgroup_rstat_lock_contended {@start[tid]=nsecs} tracepoint:cgroup:cgroup_rstat_locked { if (args->contended) { @wait_ns=hist(nsecs-@start[tid]); delete(@start[tid]);}} interval:s:1 {time("%H:%M:%S "); print(@wait_ns); }' Extending with time spend holding the lock will be more expensive as this also looks at all the non-contended cases. Like this bpftrace script: bpftrace -e ' tracepoint:cgroup:cgroup_rstat_lock_contended {@start[tid]=nsecs} tracepoint:cgroup:cgroup_rstat_locked { @locked[tid]=nsecs; if (args->contended) { @wait_ns=hist(nsecs-@start[tid]); delete(@start[tid]);}} tracepoint:cgroup:cgroup_rstat_unlock { @locked_ns=hist(nsecs-@locked[tid]); delete(@locked[tid]);} interval:s:1 {time("%H:%M:%S "); print(@wait_ns);print(@locked_ns); }' Signed-off-by: Jesper Dangaard Brouer <hawk@kernel.org> Signed-off-by: Tejun Heo <tj@kernel.org>
2024-04-16cgroup/pids: Remove superfluous zeroingMichal Koutný1-2/+0
Atomic counters are in kzalloc'd struct. They are zeroed already and atomic64_t does not need special initialization (cf kernel/trace/trace_clock.c:trace_counter). Signed-off-by: Michal Koutný <mkoutny@suse.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2024-04-08cgroup/cpuset: Make cpuset hotplug processing synchronousWaiman Long1-85/+56
Since commit 3a5a6d0c2b03("cpuset: don't nest cgroup_mutex inside get_online_cpus()"), cpuset hotplug was done asynchronously via a work function. This is to avoid recursive locking of cgroup_mutex. Since then, the cgroup locking scheme has changed quite a bit. A cpuset_mutex was introduced to protect cpuset specific operations. The cpuset_mutex is then replaced by a cpuset_rwsem. With commit d74b27d63a8b ("cgroup/cpuset: Change cpuset_rwsem and hotplug lock order"), cpu_hotplug_lock is acquired before cpuset_rwsem. Later on, cpuset_rwsem is reverted back to cpuset_mutex. All these locking changes allow the hotplug code to call into cpuset core directly. The following commits were also merged due to the asynchronous nature of cpuset hotplug processing. - commit b22afcdf04c9 ("cpu/hotplug: Cure the cpusets trainwreck") - commit 50e76632339d ("sched/cpuset/pm: Fix cpuset vs. suspend-resume bugs") - commit 28b89b9e6f7b ("cpuset: handle race between CPU hotplug and cpuset_hotplug_work") Clean up all these bandages by making cpuset hotplug processing synchronous again with the exception that the call to cgroup_transfer_tasks() to transfer tasks out of an empty cgroup v1 cpuset, if necessary, will still be done via a work function due to the existing cgroup_mutex -> cpu_hotplug_lock dependency. It is possible to reverse that dependency, but that will require updating a number of different cgroup controllers. This special hotplug code path should be rarely taken anyway. As all the cpuset states will be updated by the end of the hotplug operation, we can revert most the above commits except commit 50e76632339d ("sched/cpuset/pm: Fix cpuset vs. suspend-resume bugs") which is partially reverted. Also removing some cpus_read_lock trylock attempts in the cpuset partition code as they are no longer necessary since the cpu_hotplug_lock is now held for the whole duration of the cpuset hotplug code path. Signed-off-by: Waiman Long <longman@redhat.com> Tested-by: Valentin Schneider <vschneid@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2024-03-12Merge tag 'net-next-6.9' of ↵Linus Torvalds1-2/+2
git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next Pull networking updates from Jakub Kicinski: "Core & protocols: - Large effort by Eric to lower rtnl_lock pressure and remove locks: - Make commonly used parts of rtnetlink (address, route dumps etc) lockless, protected by RCU instead of rtnl_lock. - Add a netns exit callback which already holds rtnl_lock, allowing netns exit to take rtnl_lock once in the core instead of once for each driver / callback. - Remove locks / serialization in the socket diag interface. - Remove 6 calls to synchronize_rcu() while holding rtnl_lock. - Remove the dev_base_lock, depend on RCU where necessary. - Support busy polling on a per-epoll context basis. Poll length and budget parameters can be set independently of system defaults. - Introduce struct net_hotdata, to make sure read-mostly global config variables fit in as few cache lines as possible. - Add optional per-nexthop statistics to ease monitoring / debug of ECMP imbalance problems. - Support TCP_NOTSENT_LOWAT in MPTCP. - Ensure that IPv6 temporary addresses' preferred lifetimes are long enough, compared to other configured lifetimes, and at least 2 sec. - Support forwarding of ICMP Error messages in IPSec, per RFC 4301. - Add support for the independent control state machine for bonding per IEEE 802.1AX-2008 5.4.15 in addition to the existing coupled control state machine. - Add "network ID" to MCTP socket APIs to support hosts with multiple disjoint MCTP networks. - Re-use the mono_delivery_time skbuff bit for packets which user space wants to be sent at a specified time. Maintain the timing information while traversing veth links, bridge etc. - Take advantage of MSG_SPLICE_PAGES for RxRPC DATA and ACK packets. - Simplify many places iterating over netdevs by using an xarray instead of a hash table walk (hash table remains in place, for use on fastpaths). - Speed up scanning for expired routes by keeping a dedicated list. - Speed up "generic" XDP by trying harder to avoid large allocations. - Support attaching arbitrary metadata to netconsole messages. Things we sprinkled into general kernel code: - Enforce VM_IOREMAP flag and range in ioremap_page_range and introduce VM_SPARSE kind and vm_area_[un]map_pages (used by bpf_arena). - Rework selftest harness to enable the use of the full range of ksft exit code (pass, fail, skip, xfail, xpass). Netfilter: - Allow userspace to define a table that is exclusively owned by a daemon (via netlink socket aliveness) without auto-removing this table when the userspace program exits. Such table gets marked as orphaned and a restarting management daemon can re-attach/regain ownership. - Speed up element insertions to nftables' concatenated-ranges set type. Compact a few related data structures. BPF: - Add BPF token support for delegating a subset of BPF subsystem functionality from privileged system-wide daemons such as systemd through special mount options for userns-bound BPF fs to a trusted & unprivileged application. - Introduce bpf_arena which is sparse shared memory region between BPF program and user space where structures inside the arena can have pointers to other areas of the arena, and pointers work seamlessly for both user-space programs and BPF programs. - Introduce may_goto instruction that is a contract between the verifier and the program. The verifier allows the program to loop assuming it's behaving well, but reserves the right to terminate it. - Extend the BPF verifier to enable static subprog calls in spin lock critical sections. - Support registration of struct_ops types from modules which helps projects like fuse-bpf that seeks to implement a new struct_ops type. - Add support for retrieval of cookies for perf/kprobe multi links. - Support arbitrary TCP SYN cookie generation / validation in the TC layer with BPF to allow creating SYN flood handling in BPF firewalls. - Add code generation to inline the bpf_kptr_xchg() helper which improves performance when stashing/popping the allocated BPF objects. Wireless: - Add SPP (signaling and payload protected) AMSDU support. - Support wider bandwidth OFDMA, as required for EHT operation. Driver API: - Major overhaul of the Energy Efficient Ethernet internals to support new link modes (2.5GE, 5GE), share more code between drivers (especially those using phylib), and encourage more uniform behavior. Convert and clean up drivers. - Define an API for querying per netdev queue statistics from drivers. - IPSec: account in global stats for fully offloaded sessions. - Create a concept of Ethernet PHY Packages at the Device Tree level, to allow parameterizing the existing PHY package code. - Enable Rx hashing (RSS) on GTP protocol fields. Misc: - Improvements and refactoring all over networking selftests. - Create uniform module aliases for TC classifiers, actions, and packet schedulers to simplify creating modprobe policies. - Address all missing MODULE_DESCRIPTION() warnings in networking. - Extend the Netlink descriptions in YAML to cover message encapsulation or "Netlink polymorphism", where interpretation of nested attributes depends on link type, classifier type or some other "class type". Drivers: - Ethernet high-speed NICs: - Add a new driver for Marvell's Octeon PCI Endpoint NIC VF. - Intel (100G, ice, idpf): - support E825-C devices - nVidia/Mellanox: - support devices with one port and multiple PCIe links - Broadcom (bnxt): - support n-tuple filters - support configuring the RSS key - Wangxun (ngbe/txgbe): - implement irq_domain for TXGBE's sub-interrupts - Pensando/AMD: - support XDP - optimize queue submission and wakeup handling (+17% bps) - optimize struct layout, saving 28% of memory on queues - Ethernet NICs embedded and virtual: - Google cloud vNIC: - refactor driver to perform memory allocations for new queue config before stopping and freeing the old queue memory - Synopsys (stmmac): - obey queueMaxSDU and implement counters required by 802.1Qbv - Renesas (ravb): - support packet checksum offload - suspend to RAM and runtime PM support - Ethernet switches: - nVidia/Mellanox: - support for nexthop group statistics - Microchip: - ksz8: implement PHY loopback - add support for KSZ8567, a 7-port 10/100Mbps switch - PTP: - New driver for RENESAS FemtoClock3 Wireless clock generator. - Support OCP PTP cards designed and built by Adva. - CAN: - Support recvmsg() flags for own, local and remote traffic on CAN BCM sockets. - Support for esd GmbH PCIe/402 CAN device family. - m_can: - Rx/Tx submission coalescing - wake on frame Rx - WiFi: - Intel (iwlwifi): - enable signaling and payload protected A-MSDUs - support wider-bandwidth OFDMA - support for new devices - bump FW API to 89 for AX devices; 90 for BZ/SC devices - MediaTek (mt76): - mt7915: newer ADIE version support - mt7925: radio temperature sensor support - Qualcomm (ath11k): - support 6 GHz station power modes: Low Power Indoor (LPI), Standard Power) SP and Very Low Power (VLP) - QCA6390 & WCN6855: support 2 concurrent station interfaces - QCA2066 support - Qualcomm (ath12k): - refactoring in preparation for Multi-Link Operation (MLO) support - 1024 Block Ack window size support - firmware-2.bin support - support having multiple identical PCI devices (firmware needs to have ATH12K_FW_FEATURE_MULTI_QRTR_ID) - QCN9274: support split-PHY devices - WCN7850: enable Power Save Mode in station mode - WCN7850: P2P support - RealTek: - rtw88: support for more rtw8811cu and rtw8821cu devices - rtw89: support SCAN_RANDOM_SN and SET_SCAN_DWELL - rtlwifi: speed up USB firmware initialization - rtwl8xxxu: - RTL8188F: concurrent interface support - Channel Switch Announcement (CSA) support in AP mode - Broadcom (brcmfmac): - per-vendor feature support - per-vendor SAE password setup - DMI nvram filename quirk for ACEPC W5 Pro" * tag 'net-next-6.9' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (2255 commits) nexthop: Fix splat with CONFIG_DEBUG_PREEMPT=y nexthop: Fix out-of-bounds access during attribute validation nexthop: Only parse NHA_OP_FLAGS for dump messages that require it nexthop: Only parse NHA_OP_FLAGS for get messages that require it bpf: move sleepable flag from bpf_prog_aux to bpf_prog bpf: hardcode BPF_PROG_PACK_SIZE to 2MB * num_possible_nodes() selftests/bpf: Add kprobe multi triggering benchmarks ptp: Move from simple ida to xarray vxlan: Remove generic .ndo_get_stats64 vxlan: Do not alloc tstats manually devlink: Add comments to use netlink gen tool nfp: flower: handle acti_netdevs allocation failure net/packet: Add getsockopt support for PACKET_COPY_THRESH net/netlink: Add getsockopt support for NETLINK_LISTEN_ALL_NSID selftests/bpf: Add bpf_arena_htab test. selftests/bpf: Add bpf_arena_list test. selftests/bpf: Add unit tests for bpf_arena_alloc/free_pages bpf: Add helper macro bpf_addr_space_cast() libbpf: Recognize __arena global variables. bpftool: Recognize arena map type ...
2024-03-11Merge tag 'cgroup-for-6.9' of ↵Linus Torvalds1-0/+1
git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup Pull cgroup updates from Tejun Heo: "A quiet cycle. One trivial doc update patch. Two patches to drop the now defunct memory_spread_slab feature from cgroup1 cpuset" * tag 'cgroup-for-6.9' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: cgroup/cpuset: Mark memory_spread_slab as obsolete cgroup/cpuset: Remove cpuset_do_slab_mem_spread() docs: cgroup-v1: add missing code-block tags
2024-03-07Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski1-4/+4
Cross-merge networking fixes after downstream PR. No conflicts. Adjacent changes: net/core/page_pool_user.c 0b11b1c5c320 ("netdev: let netlink core handle -EMSGSIZE errors") 429679dcf7d9 ("page_pool: fix netlink dump stop/resume") Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-02-29cgroup/cpuset: Fix retval in update_cpumask()Kamalesh Babulal1-1/+1
The update_cpumask(), checks for newly requested cpumask by calling validate_change(), which returns an error on passing an invalid set of cpu(s). Independent of the error returned, update_cpumask() always returns zero, suppressing the error and returning success to the user on writing an invalid cpu range for a cpuset. Fix it by returning retval instead, which is returned by validate_change(). Fixes: 99fe36ba6fc1 ("cgroup/cpuset: Improve temporary cpumasks handling") Signed-off-by: Kamalesh Babulal <kamalesh.babulal@oracle.com> Reviewed-by: Waiman Long <longman@redhat.com> Cc: stable@vger.kernel.org # v6.6+ Signed-off-by: Tejun Heo <tj@kernel.org>
2024-02-29cgroup/cpuset: Mark memory_spread_slab as obsoleteXiongwei Song1-0/+1
We've removed the SLAB allocator, cpuset_do_slab_mem_spread() and SLAB_MEM_SPREAD, memory_spread_slab is a no-op now. We can mark memory_spread_slab as obsolete in case someone still wants to use it after cpuset_do_slab_mem_spread() removed. For more details, please check [1]. [1] https://lore.kernel.org/lkml/32bc1403-49da-445a-8c00-9686a3b0d6a3@redhat.com/T/#m8e292e21b00f95a4bb8086371fa7387fa4ea8f60 tj: Description and cosmetic updates. Signed-off-by: Xiongwei Song <xiongwei.song@windriver.com> Acked-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2024-02-28cgroup/cpuset: Fix a memory leak in update_exclusive_cpumask()Waiman Long1-3/+3
Fix a possible memory leak in update_exclusive_cpumask() by moving the alloc_cpumasks() down after the validate_change() check which can fail and still before the temporary cpumasks are needed. Fixes: e2ffe502ba45 ("cgroup/cpuset: Add cpuset.cpus.exclusive for v2") Reported-and-tested-by: Mirsad Todorovac <mirsad.todorovac@alu.hr> Closes: https://lore.kernel.org/lkml/14915689-27a3-4cd8-80d2-9c30d0c768b6@alu.unizg.hr Signed-off-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org> Cc: stable@vger.kernel.org # v6.7+
2024-01-31bpf: treewide: Annotate BPF kfuncs in BTFDaniel Xu1-2/+2
This commit marks kfuncs as such inside the .BTF_ids section. The upshot of these annotations is that we'll be able to automatically generate kfunc prototypes for downstream users. The process is as follows: 1. In source, use BTF_KFUNCS_START/END macro pair to mark kfuncs 2. During build, pahole injects into BTF a "bpf_kfunc" BTF_DECL_TAG for each function inside BTF_KFUNCS sets 3. At runtime, vmlinux or module BTF is made available in sysfs 4. At runtime, bpftool (or similar) can look at provided BTF and generate appropriate prototypes for functions with "bpf_kfunc" tag To ensure future kfunc are similarly tagged, we now also return error inside kfunc registration for untagged kfuncs. For vmlinux kfuncs, we also WARN(), as initcall machinery does not handle errors. Signed-off-by: Daniel Xu <dxu@dxuuu.xyz> Acked-by: Benjamin Tissoires <bentiss@kernel.org> Link: https://lore.kernel.org/r/e55150ceecbf0a5d961e608941165c0bee7bc943.1706491398.git.dxu@dxuuu.xyz Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-01-18Merge tag 'driver-core-6.8-rc1' of ↵Linus Torvalds3-31/+8
git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core Pull driver core updates from Greg KH: "Here are the set of driver core and kernfs changes for 6.8-rc1. Nothing major in here this release cycle, just lots of small cleanups and some tweaks on kernfs that in the very end, got reverted and will come back in a safer way next release cycle. Included in here are: - more driver core 'const' cleanups and fixes - fw_devlink=rpm is now the default behavior - kernfs tiny changes to remove some string functions - cpu handling in the driver core is updated to work better on many systems that add topologies and cpus after booting - other minor changes and cleanups All of the cpu handling patches have been acked by the respective maintainers and are coming in here in one series. Everything has been in linux-next for a while with no reported issues" * tag 'driver-core-6.8-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: (51 commits) Revert "kernfs: convert kernfs_idr_lock to an irq safe raw spinlock" kernfs: convert kernfs_idr_lock to an irq safe raw spinlock class: fix use-after-free in class_register() PM: clk: make pm_clk_add_notifier() take a const pointer EDAC: constantify the struct bus_type usage kernfs: fix reference to renamed function driver core: device.h: fix Excess kernel-doc description warning driver core: class: fix Excess kernel-doc description warning driver core: mark remaining local bus_type variables as const driver core: container: make container_subsys const driver core: bus: constantify subsys_register() calls driver core: bus: make bus_sort_breadthfirst() take a const pointer kernfs: d_obtain_alias(NULL) will do the right thing... driver core: Better advertise dev_err_probe() kernfs: Convert kernfs_path_from_node_locked() from strlcpy() to strscpy() kernfs: Convert kernfs_name_locked() from strlcpy() to strscpy() kernfs: Convert kernfs_walk_ns() from strlcpy() to strscpy() initramfs: Expose retained initrd as sysfs file fs/kernfs/dir: obey S_ISGID kernel/cgroup: use kernfs_create_dir_ns() ...
2024-01-08Merge tag 'cgroup-for-6.8' of ↵Linus Torvalds5-144/+386
git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup Pull cgroup updates from Tejun Heo: - Yafang Shao added task_get_cgroup1() helper to enable a similar BPF helper so that BPF progs can be more useful on cgroup1 hierarchies. While cgroup1 is mostly in maintenance mode, this addition is very small while having an outsized usefulness for users who are still on cgroup1. Yafang also optimized root cgroup list access by making it RCU protected in the process. - Waiman Long optimized rstat operation leading to substantially lower and more consistent lock hold time while flushing the hierarchical statistics. As the lock can be acquired briefly in various hot paths, this reduction has cascading benefits. - Waiman also improved the quality of isolation for cpuset's isolated partitions. CPUs which are allocated to isolated partitions are now excluded from running unbound work items and cpu_is_isolated() test which is used by vmstat and memcg to reduce interference now includes cpuset isolated CPUs. While it isn't there yet, the hope is eventually reaching parity with the isolation level provided by the `isolcpus` boot param but in a dynamic manner. * tag 'cgroup-for-6.8' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: cgroup: Move rcu_head up near the top of cgroup_root cgroup/cpuset: Include isolated cpuset CPUs in cpu_is_isolated() check cgroup: Avoid false cacheline sharing of read mostly rstat_cpu cgroup/rstat: Optimize cgroup_rstat_updated_list() cgroup: Fix documentation for cpu.idle cgroup/cpuset: Expose cpuset.cpus.isolated workqueue: Move workqueue_set_unbound_cpumask() and its helpers inside CONFIG_SYSFS cgroup/rstat: Reduce cpu_lock hold time in cgroup_rstat_flush_locked() cgroup/cpuset: Take isolated CPUs out of workqueue unbound cpumask cgroup/cpuset: Keep track of CPUs in isolated partitions selftests/cgroup: Minor code cleanup and reorganization of test_cpuset_prs.sh workqueue: Add workqueue_unbound_exclude_cpumask() to exclude CPUs from wq_unbound_cpumask selftests: cgroup: Fixes a typo in a comment cgroup: Add a new helper for cgroup1 hierarchy cgroup: Add annotation for holding namespace_sem in current_cgns_cgroup_from_root() cgroup: Eliminate the need for cgroup_mutex in proc_cgroup_show() cgroup: Make operations on the cgroup root_list RCU safe cgroup: Remove unnecessary list_empty()
2023-12-15kernfs: Convert kernfs_path_from_node_locked() from strlcpy() to strscpy()Kees Cook3-4/+4
One of the last remaining users of strlcpy() in the kernel is kernfs_path_from_node_locked(), which passes back the problematic "length we _would_ have copied" return value to indicate truncation. Convert the chain of all callers to use the negative return value (some of which already doing this explicitly). All callers were already also checking for negative return values, so the risk to missed checks looks very low. In this analysis, it was found that cgroup1_release_agent() actually didn't handle the "too large" condition, so this is technically also a bug fix. :) Here's the chain of callers, and resolution identifying each one as now handling the correct return value: kernfs_path_from_node_locked() kernfs_path_from_node() pr_cont_kernfs_path() returns void kernfs_path() sysfs_warn_dup() return value ignored cgroup_path() blkg_path() bfq_bic_update_cgroup() return value ignored TRACE_IOCG_PATH() return value ignored TRACE_CGROUP_PATH() return value ignored perf_event_cgroup() return value ignored task_group_path() return value ignored damon_sysfs_memcg_path_eq() return value ignored get_mm_memcg_path() return value ignored lru_gen_seq_show() return value ignored cgroup_path_from_kernfs_id() return value ignored cgroup_show_path() already converted "too large" error to negative value cgroup_path_ns_locked() cgroup_path_ns() bpf_iter_cgroup_show_fdinfo() return value ignored cgroup1_release_agent() wasn't checking "too large" error proc_cgroup_show() already converted "too large" to negative value Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Tejun Heo <tj@kernel.org> Cc: Zefan Li <lizefan.x@bytedance.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Waiman Long <longman@redhat.com> Cc: <cgroups@vger.kernel.org> Co-developed-by: Azeem Shaikh <azeemshaikh38@gmail.com> Signed-off-by: Azeem Shaikh <azeemshaikh38@gmail.com> Link: https://lore.kernel.org/r/20231116192127.1558276-3-keescook@chromium.org Signed-off-by: Kees Cook <keescook@chromium.org> Link: https://lore.kernel.org/r/20231212211741.164376-3-keescook@chromium.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-12-15kernel/cgroup: use kernfs_create_dir_ns()Max Kellermann1-27/+4
By passing the fsugid to kernfs_create_dir_ns(), we don't need cgroup_kn_set_ugid() any longer. That function was added for exactly this purpose by commit 49957f8e2a43 ("cgroup: newly created dirs and files should be owned by the creator"). Eliminating this piece of duplicate code means we benefit from future improvements to kernfs_create_dir_ns(); for example, both are lacking S_ISGID support currently, which my next patch will add to kernfs_create_dir_ns(). It cannot (easily) be added to cgroup_kn_set_ugid() because we can't dereference struct kernfs_iattrs from there. -- v1 -> v2: 12-digit commit id Signed-off-by: Max Kellermann <max.kellermann@ionos.com> Acked-by: Tejun Heo <tj@kernel.org> Link: https://lore.kernel.org/r/20231208093310.297233-1-max.kellermann@ionos.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-12-07Merge tag 'cgroup-for-6.7-rc4-fixes' of ↵Linus Torvalds1-1/+7
git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup Pull cgroup fix from Tejun Heo: "Just one fix. Commit f5d39b020809 ("freezer,sched: Rewrite core freezer logic") changed how freezing state is recorded which made cgroup_freezing() disagree with the actual state of the task while thawing triggering a warning. Fix it by updating cgroup_freezing()" * tag 'cgroup-for-6.7-rc4-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: cgroup_freezer: cgroup_freezing: Check if not frozen
2023-12-06cgroup/cpuset: Include isolated cpuset CPUs in cpu_is_isolated() checkWaiman Long1-0/+11
Currently, the cpu_is_isolated() function checks only the statically isolated CPUs specified via the "isolcpus" and "nohz_full" kernel command line options. This function is used by vmstat and memcg to reduce interference with isolated CPUs by not doing stat flushing or scheduling works on those CPUs. Workloads running on isolated CPUs within isolated cpuset partitions should receive the same treatment to reduce unnecessary interference. This patch introduces a new cpuset_cpu_is_isolated() function to be called by cpu_is_isolated() so that the set of dynamically created cpuset isolated CPUs will be included in the check. Assuming that testing a bit in a cpumask is atomic, no synchronization primitive is currently used to synchronize access to the cpuset's isolated_cpus mask. Signed-off-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2023-12-01cgroup/rstat: Optimize cgroup_rstat_updated_list()Waiman Long1-62/+91
The current design of cgroup_rstat_cpu_pop_updated() is to traverse the updated tree in a way to pop out the leaf nodes first before their parents. This can cause traversal of multiple nodes before a leaf node can be found and popped out. IOW, a given node in the tree can be visited multiple times before the whole operation is done. So it is not very efficient and the code can be hard to read. With the introduction of cgroup_rstat_updated_list() to build a list of cgroups to be flushed first before any flushing operation is being done, we can optimize the way the updated tree nodes are being popped by pushing the parents first to the tail end of the list before their children. In this way, most updated tree nodes will be visited only once with the exception of the subtree root as we still need to go back to its parent and popped it out of its updated_children list. This also makes the code easier to read. Signed-off-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2023-11-28cgroup_freezer: cgroup_freezing: Check if not frozenTim Van Patten1-1/+7
__thaw_task() was recently updated to warn if the task being thawed was part of a freezer cgroup that is still currently freezing: void __thaw_task(struct task_struct *p) { ... if (WARN_ON_ONCE(freezing(p))) goto unlock; This has exposed a bug in cgroup1 freezing where when CGROUP_FROZEN is asserted, the CGROUP_FREEZING bits are not also cleared at the same time. Meaning, when a cgroup is marked FROZEN it continues to be marked FREEZING as well. This causes the WARNING to trigger, because cgroup_freezing() thinks the cgroup is still freezing. There are two ways to fix this: 1. Whenever FROZEN is set, clear FREEZING for the cgroup and all children cgroups. 2. Update cgroup_freezing() to also verify that FROZEN is not set. This patch implements option (2), since it's smaller and more straightforward. Signed-off-by: Tim Van Patten <timvp@google.com> Tested-by: Mark Hasemeyer <markhas@chromium.org> Fixes: f5d39b020809 ("freezer,sched: Rewrite core freezer logic") Cc: stable@vger.kernel.org # v6.1+ Signed-off-by: Tejun Heo <tj@kernel.org>
2023-11-28cgroup/cpuset: Expose cpuset.cpus.isolatedWaiman Long1-1/+1
The root-only cpuset.cpus.isolated control file shows the current set of isolated CPUs in isolated partitions. This control file is currently exposed only with the cgroup_debug boot command line option which also adds the ".__DEBUG__." prefix. This is actually a useful control file if users want to find out which CPUs are currently in an isolated state by the cpuset controller. Remove CFTYPE_DEBUG flag for this control file and make it available by default without any prefix. The test_cpuset_prs.sh test script and the cgroup-v2.rst documentation file are also updated accordingly. Minor code change is also made in test_cpuset_prs.sh to avoid false test failure when running on debug kernel. Signed-off-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2023-11-14sched: psi: fix unprivileged polling against cgroupsJohannes Weiner1-12/+0
519fabc7aaba ("psi: remove 500ms min window size limitation for triggers") breaks unprivileged psi polling on cgroups. Historically, we had a privilege check for polling in the open() of a pressure file in /proc, but were erroneously missing it for the open() of cgroup pressure files. When unprivileged polling was introduced in d82caa273565 ("sched/psi: Allow unprivileged polling of N*2s period"), it needed to filter privileges depending on the exact polling parameters, and as such moved the CAP_SYS_RESOURCE check from the proc open() callback to psi_trigger_create(). Both the proc files as well as cgroup files go through this during write(). This implicitly added the missing check for privileges required for HT polling for cgroups. When 519fabc7aaba ("psi: remove 500ms min window size limitation for triggers") followed right after to remove further restrictions on the RT polling window, it incorrectly assumed the cgroup privilege check was still missing and added it to the cgroup open(), mirroring what we used to do for proc files in the past. As a result, unprivileged poll requests that would be supported now get rejected when opening the cgroup pressure file for writing. Remove the cgroup open() check. psi_trigger_create() handles it. Fixes: 519fabc7aaba ("psi: remove 500ms min window size limitation for triggers") Reported-by: Luca Boccassi <bluca@debian.org> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Luca Boccassi <bluca@debian.org> Acked-by: Suren Baghdasaryan <surenb@google.com> Cc: stable@vger.kernel.org # 6.5+ Link: https://lore.kernel.org/r/20231026164114.2488682-1-hannes@cmpxchg.org
2023-11-12cgroup/rstat: Reduce cpu_lock hold time in cgroup_rstat_flush_locked()Waiman Long1-15/+28
When cgroup_rstat_updated() isn't being called concurrently with cgroup_rstat_flush_locked(), its run time is pretty short. When both are called concurrently, the cgroup_rstat_updated() run time can spike to a pretty high value due to high cpu_lock hold time in cgroup_rstat_flush_locked(). This can be problematic if the task calling cgroup_rstat_updated() is a realtime task running on an isolated CPU with a strict latency requirement. The cgroup_rstat_updated() call can happen when there is a page fault even though the task is running in user space most of the time. The percpu cpu_lock is used to protect the update tree - updated_next and updated_children. This protection is only needed when cgroup_rstat_cpu_pop_updated() is being called. The subsequent flushing operation which can take a much longer time does not need that protection as it is already protected by cgroup_rstat_lock. To reduce the cpu_lock hold time, we need to perform all the cgroup_rstat_cpu_pop_updated() calls up front with the lock released afterward before doing any flushing. This patch adds a new cgroup_rstat_updated_list() function to return a singly linked list of cgroups to be flushed. Some instrumentation code are added to measure the cpu_lock hold time right after lock acquisition to after releasing the lock. Parallel kernel build on a 2-socket x86-64 server is used as the benchmarking tool for measuring the lock hold time. The maximum cpu_lock hold time before and after the patch are 100us and 29us respectively. So the worst case time is reduced to about 30% of the original. However, there may be some OS or hardware noises like NMI or SMI in the test system that can worsen the worst case value. Those noises are usually tuned out in a real production environment to get a better result. OTOH, the lock hold time frequency distribution should give a better idea of the performance benefit of the patch. Below were the frequency distribution before and after the patch: Hold time Before patch After patch --------- ------------ ----------- 0-01 us 804,139 13,738,708 01-05 us 9,772,767 1,177,194 05-10 us 4,595,028 4,984 10-15 us 303,481 3,562 15-20 us 78,971 1,314 20-25 us 24,583 18 25-30 us 6,908 12 30-40 us 8,015 40-50 us 2,192 50-60 us 316 60-70 us 43 70-80 us 7 80-90 us 2 >90 us 3 Signed-off-by: Waiman Long <longman@redhat.com> Reviewed-by: Yosry Ahmed <yosryahmed@google.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2023-11-12cgroup/cpuset: Take isolated CPUs out of workqueue unbound cpumaskWaiman Long1-20/+96
To make CPUs in isolated cpuset partition closer in isolation to the boot time isolated CPUs specified in the "isolcpus" boot command line option, we need to take those CPUs out of the workqueue unbound cpumask so that work functions from the unbound workqueues won't run on those CPUs. Otherwise, they will interfere the user tasks running on those isolated CPUs. With the introduction of the workqueue_unbound_exclude_cpumask() helper function in an earlier commit, those isolated CPUs can now be taken out from the workqueue unbound cpumask. This patch also updates cgroup-v2.rst to mention that isolated CPUs will be excluded from unbound workqueue cpumask as well as updating test_cpuset_prs.sh to verify the correctness of the new *cpuset.cpus.isolated file, if available via cgroup_debug option. Signed-off-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2023-11-12cgroup/cpuset: Keep track of CPUs in isolated partitionsWaiman Long1-63/+127
Add a new internal isolated_cpus mask to keep track of the CPUs that are in isolated partitions. Expose that new cpumask as a new root-only control file ".cpuset.cpus.isolated". tj: Updated patch description to reflect dropping __DEBUG__ prefix. Signed-off-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2023-11-09Merge tag 'net-6.7-rc1' of ↵Linus Torvalds1-6/+3
git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Pull networking fixes from Jakub Kicinski: "Including fixes from netfilter and bpf. Current release - regressions: - sched: fix SKB_NOT_DROPPED_YET splat under debug config Current release - new code bugs: - tcp: - fix usec timestamps with TCP fastopen - fix possible out-of-bounds reads in tcp_hash_fail() - fix SYN option room calculation for TCP-AO - tcp_sigpool: fix some off by one bugs - bpf: fix compilation error without CGROUPS - ptp: - ptp_read() should not release queue - fix tsevqs corruption Previous releases - regressions: - llc: verify mac len before reading mac header Previous releases - always broken: - bpf: - fix check_stack_write_fixed_off() to correctly spill imm - fix precision tracking for BPF_ALU | BPF_TO_BE | BPF_END - check map->usercnt after timer->timer is assigned - dsa: lan9303: consequently nested-lock physical MDIO - dccp/tcp: call security_inet_conn_request() after setting IP addr - tg3: fix the TX ring stall due to incorrect full ring handling - phylink: initialize carrier state at creation - ice: fix direction of VF rules in switchdev mode Misc: - fill in a bunch of missing MODULE_DESCRIPTION()s, more to come" * tag 'net-6.7-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (84 commits) net: ti: icss-iep: fix setting counter value ptp: fix corrupted list in ptp_open ptp: ptp_read should not release queue net_sched: sch_fq: better validate TCA_FQ_WEIGHTS and TCA_FQ_PRIOMAP net: kcm: fill in MODULE_DESCRIPTION() net/sched: act_ct: Always fill offloading tuple iifidx netfilter: nat: fix ipv6 nat redirect with mapped and scoped addresses netfilter: xt_recent: fix (increase) ipv6 literal buffer length ipvs: add missing module descriptions netfilter: nf_tables: remove catchall element in GC sync path netfilter: add missing module descriptions drivers/net/ppp: use standard array-copy-function net: enetc: shorten enetc_setup_xdp_prog() error message to fit NETLINK_MAX_FMTMSG_LEN virtio/vsock: Fix uninit-value in virtio_transport_recv_pkt() r8169: respect userspace disabling IFF_MULTICAST selftests/bpf: get trusted cgrp from bpf_iter__cgroup directly bpf: Let verifier consider {task,cgroup} is trusted in bpf_iter_reg net: phylink: initialize carrier state at creation test/vsock: add dobule bind connect test test/vsock: refactor vsock_accept ...
2023-11-09cgroup: Add a new helper for cgroup1 hierarchyYafang Shao2-1/+34
A new helper is added for cgroup1 hierarchy: - task_get_cgroup1 Acquires the associated cgroup of a task within a specific cgroup1 hierarchy. The cgroup1 hierarchy is identified by its hierarchy ID. This helper function is added to facilitate the tracing of tasks within a particular container or cgroup dir in BPF programs. It's important to note that this helper is designed specifically for cgroup1 only. tj: Use irsqsave/restore as suggested by Hou Tao <houtao@huaweicloud.com>. Suggested-by: Tejun Heo <tj@kernel.org> Signed-off-by: Yafang Shao <laoar.shao@gmail.com> Cc: Hou Tao <houtao@huaweicloud.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2023-11-09cgroup: Add annotation for holding namespace_sem in ↵Yafang Shao1-0/+5
current_cgns_cgroup_from_root() When I initially examined the function current_cgns_cgroup_from_root(), I was perplexed by its lack of holding cgroup_mutex. However, after Michal explained the reason[0] to me, I realized that it already holds the namespace_sem. I believe this intricacy could also confuse others, so it would be advisable to include an annotation for clarification. After we replace the cgroup_mutex with RCU read lock, if current doesn't hold the namespace_sem, the root cgroup will be NULL. So let's add a WARN_ON_ONCE() for it. [0]. https://lore.kernel.org/bpf/afdnpo3jz2ic2ampud7swd6so5carkilts2mkygcaw67vbw6yh@5b5mncf7qyet Signed-off-by: Yafang Shao <laoar.shao@gmail.com> Cc: Michal Koutny <mkoutny@suse.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2023-11-09cgroup: Eliminate the need for cgroup_mutex in proc_cgroup_show()Yafang Shao1-5/+7
The cgroup root_list is already RCU-safe. Therefore, we can replace the cgroup_mutex with the RCU read lock in some particular paths. This change will be particularly beneficial for frequent operations, such as `cat /proc/self/cgroup`, in a cgroup1-based container environment. I did stress tests with this change, as outlined below (with CONFIG_PROVE_RCU_LIST enabled): - Continuously mounting and unmounting named cgroups in some tasks, for example: cgrp_name=$1 while true do mount -t cgroup -o none,name=$cgrp_name none /$cgrp_name umount /$cgrp_name done - Continuously triggering proc_cgroup_show() in some tasks concurrently, for example: while true; do cat /proc/self/cgroup > /dev/null; done They can ran successfully after implementing this change, with no RCU warnings in dmesg. Signed-off-by: Yafang Shao <laoar.shao@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2023-11-09cgroup: Make operations on the cgroup root_list RCU safeYafang Shao2-8/+18
At present, when we perform operations on the cgroup root_list, we must hold the cgroup_mutex, which is a relatively heavyweight lock. In reality, we can make operations on this list RCU-safe, eliminating the need to hold the cgroup_mutex during traversal. Modifications to the list only occur in the cgroup root setup and destroy paths, which should be infrequent in a production environment. In contrast, traversal may occur frequently. Therefore, making it RCU-safe would be beneficial. Signed-off-by: Yafang Shao <laoar.shao@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2023-11-09cgroup: Remove unnecessary list_empty()Yafang Shao1-4/+3
The root hasn't been removed from the root_list, so the list can't be NULL. However, if it had been removed, attempting to destroy it once more is not possible. Let's replace this with WARN_ON_ONCE() for clarity. Signed-off-by: Yafang Shao <laoar.shao@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2023-11-02Merge tag 'mm-stable-2023-11-01-14-33' of ↵Linus Torvalds1-1/+14
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull MM updates from Andrew Morton: "Many singleton patches against the MM code. The patch series which are included in this merge do the following: - Kemeng Shi has contributed some compation maintenance work in the series 'Fixes and cleanups to compaction' - Joel Fernandes has a patchset ('Optimize mremap during mutual alignment within PMD') which fixes an obscure issue with mremap()'s pagetable handling during a subsequent exec(), based upon an implementation which Linus suggested - More DAMON/DAMOS maintenance and feature work from SeongJae Park i the following patch series: mm/damon: misc fixups for documents, comments and its tracepoint mm/damon: add a tracepoint for damos apply target regions mm/damon: provide pseudo-moving sum based access rate mm/damon: implement DAMOS apply intervals mm/damon/core-test: Fix memory leaks in core-test mm/damon/sysfs-schemes: Do DAMOS tried regions update for only one apply interval - In the series 'Do not try to access unaccepted memory' Adrian Hunter provides some fixups for the recently-added 'unaccepted memory' feature. To increase the feature's checking coverage. 'Plug a few gaps where RAM is exposed without checking if it is unaccepted memory' - In the series 'cleanups for lockless slab shrink' Qi Zheng has done some maintenance work which is preparation for the lockless slab shrinking code - Qi Zheng has redone the earlier (and reverted) attempt to make slab shrinking lockless in the series 'use refcount+RCU method to implement lockless slab shrink' - David Hildenbrand contributes some maintenance work for the rmap code in the series 'Anon rmap cleanups' - Kefeng Wang does more folio conversions and some maintenance work in the migration code. Series 'mm: migrate: more folio conversion and unification' - Matthew Wilcox has fixed an issue in the buffer_head code which was causing long stalls under some heavy memory/IO loads. Some cleanups were added on the way. Series 'Add and use bdev_getblk()' - In the series 'Use nth_page() in place of direct struct page manipulation' Zi Yan has fixed a potential issue with the direct manipulation of hugetlb page frames - In the series 'mm: hugetlb: Skip initialization of gigantic tail struct pages if freed by HVO' has improved our handling of gigantic pages in the hugetlb vmmemmep optimizaton code. This provides significant boot time improvements when significant amounts of gigantic pages are in use - Matthew Wilcox has sent the series 'Small hugetlb cleanups' - code rationalization and folio conversions in the hugetlb code - Yin Fengwei has improved mlock()'s handling of large folios in the series 'support large folio for mlock' - In the series 'Expose swapcache stat for memcg v1' Liu Shixin has added statistics for memcg v1 users which are available (and useful) under memcg v2 - Florent Revest has enhanced the MDWE (Memory-Deny-Write-Executable) prctl so that userspace may direct the kernel to not automatically propagate the denial to child processes. The series is named 'MDWE without inheritance' - Kefeng Wang has provided the series 'mm: convert numa balancing functions to use a folio' which does what it says - In the series 'mm/ksm: add fork-exec support for prctl' Stefan Roesch makes is possible for a process to propagate KSM treatment across exec() - Huang Ying has enhanced memory tiering's calculation of memory distances. This is used to permit the dax/kmem driver to use 'high bandwidth memory' in addition to Optane Data Center Persistent Memory Modules (DCPMM). The series is named 'memory tiering: calculate abstract distance based on ACPI HMAT' - In the series 'Smart scanning mode for KSM' Stefan Roesch has optimized KSM by teaching it to retain and use some historical information from previous scans - Yosry Ahmed has fixed some inconsistencies in memcg statistics in the series 'mm: memcg: fix tracking of pending stats updates values' - In the series 'Implement IOCTL to get and optionally clear info about PTEs' Peter Xu has added an ioctl to /proc/<pid>/pagemap which permits us to atomically read-then-clear page softdirty state. This is mainly used by CRIU - Hugh Dickins contributed the series 'shmem,tmpfs: general maintenance', a bunch of relatively minor maintenance tweaks to this code - Matthew Wilcox has increased the use of the VMA lock over file-backed page faults in the series 'Handle more faults under the VMA lock'. Some rationalizations of the fault path became possible as a result - In the series 'mm/rmap: convert page_move_anon_rmap() to folio_move_anon_rmap()' David Hildenbrand has implemented some cleanups and folio conversions - In the series 'various improvements to the GUP interface' Lorenzo Stoakes has simplified and improved the GUP interface with an eye to providing groundwork for future improvements - Andrey Konovalov has sent along the series 'kasan: assorted fixes and improvements' which does those things - Some page allocator maintenance work from Kemeng Shi in the series 'Two minor cleanups to break_down_buddy_pages' - In thes series 'New selftest for mm' Breno Leitao has developed another MM self test which tickles a race we had between madvise() and page faults - In the series 'Add folio_end_read' Matthew Wilcox provides cleanups and an optimization to the core pagecache code - Nhat Pham has added memcg accounting for hugetlb memory in the series 'hugetlb memcg accounting' - Cleanups and rationalizations to the pagemap code from Lorenzo Stoakes, in the series 'Abstract vma_merge() and split_vma()' - Audra Mitchell has fixed issues in the procfs page_owner code's new timestamping feature which was causing some misbehaviours. In the series 'Fix page_owner's use of free timestamps' - Lorenzo Stoakes has fixed the handling of new mappings of sealed files in the series 'permit write-sealed memfd read-only shared mappings' - Mike Kravetz has optimized the hugetlb vmemmap optimization in the series 'Batch hugetlb vmemmap modification operations' - Some buffer_head folio conversions and cleanups from Matthew Wilcox in the series 'Finish the create_empty_buffers() transition' - As a page allocator performance optimization Huang Ying has added automatic tuning to the allocator's per-cpu-pages feature, in the series 'mm: PCP high auto-tuning' - Roman Gushchin has contributed the patchset 'mm: improve performance of accounted kernel memory allocations' which improves their performance by ~30% as measured by a micro-benchmark - folio conversions from Kefeng Wang in the series 'mm: convert page cpupid functions to folios' - Some kmemleak fixups in Liu Shixin's series 'Some bugfix about kmemleak' - Qi Zheng has improved our handling of memoryless nodes by keeping them off the allocation fallback list. This is done in the series 'handle memoryless nodes more appropriately' - khugepaged conversions from Vishal Moola in the series 'Some khugepaged folio conversions'" [ bcachefs conflicts with the dynamically allocated shrinkers have been resolved as per Stephen Rothwell in https://lore.kernel.org/all/20230913093553.4290421e@canb.auug.org.au/ with help from Qi Zheng. The clone3 test filtering conflict was half-arsed by yours truly ] * tag 'mm-stable-2023-11-01-14-33' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (406 commits) mm/damon/sysfs: update monitoring target regions for online input commit mm/damon/sysfs: remove requested targets when online-commit inputs selftests: add a sanity check for zswap Documentation: maple_tree: fix word spelling error mm/vmalloc: fix the unchecked dereference warning in vread_iter() zswap: export compression failure stats Documentation: ubsan: drop "the" from article title mempolicy: migration attempt to match interleave nodes mempolicy: mmap_lock is not needed while migrating folios mempolicy: alloc_pages_mpol() for NUMA policy without vma mm: add page_rmappable_folio() wrapper mempolicy: remove confusing MPOL_MF_LAZY dead code mempolicy: mpol_shared_policy_init() without pseudo-vma mempolicy trivia: use pgoff_t in shared mempolicy tree mempolicy trivia: slightly more consistent naming mempolicy trivia: delete those ancient pr_debug()s mempolicy: fix migrate_pages(2) syscall return nr_failed kernfs: drop shared NUMA mempolicy hooks hugetlbfs: drop shared NUMA mempolicy pretence mm/damon/sysfs-test: add a unit test for damon_sysfs_set_targets() ...
2023-11-01bpf: Add __bpf_hook_{start,end} macrosDave Marchevsky1-6/+3
Not all uses of __diag_ignore_all(...) in BPF-related code in order to suppress warnings are wrapping kfunc definitions. Some "hook point" definitions - small functions meant to be used as attach points for fentry and similar BPF progs - need to suppress -Wmissing-declarations. We could use __bpf_kfunc_{start,end}_defs added in the previous patch in such cases, but this might be confusing to someone unfamiliar with BPF internals. Instead, this patch adds __bpf_hook_{start,end} macros, currently having the same effect as __bpf_kfunc_{start,end}_defs, then uses them to suppress warnings for two hook points in the kernel itself and some bpf_testmod hook points as well. Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com> Cc: Yafang Shao <laoar.shao@gmail.com> Acked-by: Jiri Olsa <jolsa@kernel.org> Acked-by: Yafang Shao <laoar.shao@gmail.com> Link: https://lore.kernel.org/r/20231031215625.2343848-2-davemarchevsky@fb.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-10-31Merge tag 'net-next-6.7' of ↵Linus Torvalds1-6/+12
git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next Pull networking updates from Jakub Kicinski: "Core & protocols: - Support usec resolution of TCP timestamps, enabled selectively by a route attribute. - Defer regular TCP ACK while processing socket backlog, try to send a cumulative ACK at the end. Increase single TCP flow performance on a 200Gbit NIC by 20% (100Gbit -> 120Gbit). - The Fair Queuing (FQ) packet scheduler: - add built-in 3 band prio / WRR scheduling - support bypass if the qdisc is mostly idle (5% speed up for TCP RR) - improve inactive flow reporting - optimize the layout of structures for better cache locality - Support TCP Authentication Option (RFC 5925, TCP-AO), a more modern replacement for the old MD5 option. - Add more retransmission timeout (RTO) related statistics to TCP_INFO. - Support sending fragmented skbs over vsock sockets. - Make sure we send SIGPIPE for vsock sockets if socket was shutdown(). - Add sysctl for ignoring lower limit on lifetime in Router Advertisement PIO, based on an in-progress IETF draft. - Add sysctl to control activation of TCP ping-pong mode. - Add sysctl to make connection timeout in MPTCP configurable. - Support rcvlowat and notsent_lowat on MPTCP sockets, to help apps limit the number of wakeups. - Support netlink GET for MDB (multicast forwarding), allowing user space to request a single MDB entry instead of dumping the entire table. - Support selective FDB flushing in the VXLAN tunnel driver. - Allow limiting learned FDB entries in bridges, prevent OOM attacks. - Allow controlling via configfs netconsole targets which were created via the kernel cmdline at boot, rather than via configfs at runtime. - Support multiple PTP timestamp event queue readers with different filters. - MCTP over I3C. BPF: - Add new veth-like netdevice where BPF program defines the logic of the xmit routine. It can operate in L3 and L2 mode. - Support exceptions - allow asserting conditions which should never be true but are hard for the verifier to infer. With some extra flexibility around handling of the exit / failure: https://lwn.net/Articles/938435/ - Add support for local per-cpu kptr, allow allocating and storing per-cpu objects in maps. Access to those objects operates on the value for the current CPU. This allows to deprecate local one-off implementations of per-CPU storage like BPF_MAP_TYPE_PERCPU_CGROUP_STORAGE maps. - Extend cgroup BPF sockaddr hooks for UNIX sockets. The use case is for systemd to re-implement the LogNamespace feature which allows running multiple instances of systemd-journald to process the logs of different services. - Enable open-coded task_vma iteration, after maple tree conversion made it hard to directly walk VMAs in tracing programs. - Add open-coded task, css_task and css iterator support. One of the use cases is customizable OOM victim selection via BPF. - Allow source address selection with bpf_*_fib_lookup(). - Add ability to pin BPF timer to the current CPU. - Prevent creation of infinite loops by combining tail calls and fentry/fexit programs. - Add missed stats for kprobes to retrieve the number of missed kprobe executions and subsequent executions of BPF programs. - Inherit system settings for CPU security mitigations. - Add BPF v4 CPU instruction support for arm32 and s390x. Changes to common code: - overflow: add DEFINE_FLEX() for on-stack definition of structs with flexible array members. - Process doc update with more guidance for reviewers. Driver API: - Simplify locking in WiFi (cfg80211 and mac80211 layers), use wiphy mutex in most places and remove a lot of smaller locks. - Create a common DPLL configuration API. Allow configuring and querying state of PLL circuits used for clock syntonization, in network time distribution. - Unify fragmented and full page allocation APIs in page pool code. Let drivers be ignorant of PAGE_SIZE. - Rework PHY state machine to avoid races with calls to phy_stop(). - Notify DSA drivers of MAC address changes on user ports, improve correctness of offloads which depend on matching port MAC addresses. - Allow antenna control on injected WiFi frames. - Reduce the number of variants of napi_schedule(). - Simplify error handling when composing devlink health messages. Misc: - A lot of KCSAN data race "fixes", from Eric. - A lot of __counted_by() annotations, from Kees. - A lot of strncpy -> strscpy and printf format fixes. - Replace master/slave terminology with conduit/user in DSA drivers. - Handful of KUnit tests for netdev and WiFi core. Removed: - AppleTalk COPS. - AppleTalk ipddp. - TI AR7 CPMAC Ethernet driver. Drivers: - Ethernet high-speed NICs: - Intel (100G, ice, idpf): - add a driver for the Intel E2000 IPUs - make CRC/FCS stripping configurable - cross-timestamping for E823 devices - basic support for E830 devices - use aux-bus for managing client drivers - i40e: report firmware versions via devlink - nVidia/Mellanox: - support 4-port NICs - increase max number of channels to 256 - optimize / parallelize SF creation flow - Broadcom (bnxt): - enhance NIC temperature reporting - support PAM4 speeds and lane configuration - Marvell OcteonTX2: - PTP pulse-per-second output support - enable hardware timestamping for VFs - Solarflare/AMD: - conntrack NAT offload and offload for tunnels - Wangxun (ngbe/txgbe): - expose HW statistics - Pensando/AMD: - support PCI level reset - narrow down the condition under which skbs are linearized - Netronome/Corigine (nfp): - support CHACHA20-POLY1305 crypto in IPsec offload - Ethernet NICs embedded, slower, virtual: - Synopsys (stmmac): - add Loongson-1 SoC support - enable use of HW queues with no offload capabilities - enable PPS input support on all 5 channels - increase TX coalesce timer to 5ms - RealTek USB (r8152): improve efficiency of Rx by using GRO frags - xen: support SW packet timestamping - add drivers for implementations based on TI's PRUSS (AM64x EVM) - nVidia/Mellanox Ethernet datacenter switches: - avoid poor HW resource use on Spectrum-4 by better block selection for IPv6 multicast forwarding and ordering of blocks in ACL region - Ethernet embedded switches: - Microchip: - support configuring the drive strength for EMI compliance - ksz9477: partial ACL support - ksz9477: HSR offload - ksz9477: Wake on LAN - Realtek: - rtl8366rb: respect device tree config of the CPU port - Ethernet PHYs: - support Broadcom BCM5221 PHYs - TI dp83867: support hardware LED blinking - CAN: - add support for Linux-PHY based CAN transceivers - at91_can: clean up and use rx-offload helpers - WiFi: - MediaTek (mt76): - new sub-driver for mt7925 USB/PCIe devices - HW wireless <> Ethernet bridging in MT7988 chips - mt7603/mt7628 stability improvements - Qualcomm (ath12k): - WCN7850: - enable 320 MHz channels in 6 GHz band - hardware rfkill support - enable IEEE80211_HW_SINGLE_SCAN_ON_ALL_BANDS to make scan faster - read board data variant name from SMBIOS - QCN9274: mesh support - RealTek (rtw89): - TDMA-based multi-channel concurrency (MCC) - Silicon Labs (wfx): - Remain-On-Channel (ROC) support - Bluetooth: - ISO: many improvements for broadcast support - mark BCM4378/BCM4387 as BROKEN_LE_CODED - add support for QCA2066 - btmtksdio: enable Bluetooth wakeup from suspend" * tag 'net-next-6.7' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (1816 commits) net: pcs: xpcs: Add 2500BASE-X case in get state for XPCS drivers net: bpf: Use sockopt_lock_sock() in ip_sock_set_tos() net: mana: Use xdp_set_features_flag instead of direct assignment vxlan: Cleanup IFLA_VXLAN_PORT_RANGE entry in vxlan_get_size() iavf: delete the iavf client interface iavf: add a common function for undoing the interrupt scheme iavf: use unregister_netdev iavf: rely on netdev's own registered state iavf: fix the waiting time for initial reset iavf: in iavf_down, don't queue watchdog_task if comms failed iavf: simplify mutex_trylock+sleep loops iavf: fix comments about old bit locks doc/netlink: Update schema to support cmd-cnt-name and cmd-max-name tools: ynl: introduce option to process unknown attributes or types ipvlan: properly track tx_errors netdevsim: Block until all devices are released nfp: using napi_build_skb() to replace build_skb() net: dsa: microchip: ksz9477: Fix spelling mistake "Enery" -> "Energy" net: dsa: microchip: Ensure Stable PME Pin State for Wake-on-LAN net: dsa: microchip: Refactor switch shutdown routine for WoL preparation ...
2023-10-30Merge tag 'cgroup-for-6.7' of ↵Linus Torvalds2-335/+1001
git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup Pull cgroup updates from Tejun Heo: - cpuset now supports remote partitions where CPUs can be reserved for exclusive use down the tree without requiring all the intermediate nodes to be partitions. This makes it easier to use partitions without modifying existing cgroup hierarchy. - cpuset partition configuration behavior improvement - cgroup_favordynmods= boot param added to allow setting the flag on boot on cgroup1 - Misc code and doc updates * tag 'cgroup-for-6.7' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: docs/cgroup: Add the list of threaded controllers to cgroup-v2.rst cgroup: use legacy_name for cgroup v1 disable info cgroup/cpuset: Cleanup signedness issue in cpu_exclusive_check() cgroup/cpuset: Enable invalid to valid local partition transition cgroup: add cgroup_favordynmods= command-line option cgroup/cpuset: Extend test_cpuset_prs.sh to test remote partition cgroup/cpuset: Documentation update for partition cgroup/cpuset: Check partition conflict with housekeeping setup cgroup/cpuset: Introduce remote partition cgroup/cpuset: Add cpuset.cpus.exclusive for v2 cgroup/cpuset: Add cpuset.cpus.exclusive.effective for v2 cgroup/cpuset: Fix load balance state in update_partition_sd_lb() cgroup: Avoid extra dereference in css_populate_dir() cgroup: Check for ret during cgroup1_base_files cft addition
2023-10-26Merge tag 'for-netdev' of ↵Jakub Kicinski1-6/+12
ssh://gitolite.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next Daniel Borkmann says: ==================== pull-request: bpf-next 2023-10-26 We've added 51 non-merge commits during the last 10 day(s) which contain a total of 75 files changed, 5037 insertions(+), 200 deletions(-). The main changes are: 1) Add open-coded task, css_task and css iterator support. One of the use cases is customizable OOM victim selection via BPF, from Chuyi Zhou. 2) Fix BPF verifier's iterator convergence logic to use exact states comparison for convergence checks, from Eduard Zingerman, Andrii Nakryiko and Alexei Starovoitov. 3) Add BPF programmable net device where bpf_mprog defines the logic of its xmit routine. It can operate in L3 and L2 mode, from Daniel Borkmann and Nikolay Aleksandrov. 4) Batch of fixes for BPF per-CPU kptr and re-enable unit_size checking for global per-CPU allocator, from Hou Tao. 5) Fix libbpf which eagerly assumed that SHT_GNU_verdef ELF section was going to be present whenever a binary has SHT_GNU_versym section, from Andrii Nakryiko. 6) Fix BPF ringbuf correctness to fold smp_mb__before_atomic() into atomic_set_release(), from Paul E. McKenney. 7) Add a warning if NAPI callback missed xdp_do_flush() under CONFIG_DEBUG_NET which helps checking if drivers were missing the former, from Sebastian Andrzej Siewior. 8) Fix missed RCU read-lock in bpf_task_under_cgroup() which was throwing a warning under sleepable programs, from Yafang Shao. 9) Avoid unnecessary -EBUSY from htab_lock_bucket by disabling IRQ before checking map_locked, from Song Liu. 10) Make BPF CI linked_list failure test more robust, from Kumar Kartikeya Dwivedi. 11) Enable samples/bpf to be built as PIE in Fedora, from Viktor Malik. 12) Fix xsk starving when multiple xsk sockets were associated with a single xsk_buff_pool, from Albert Huang. 13) Clarify the signed modulo implementation for the BPF ISA standardization document that it uses truncated division, from Dave Thaler. 14) Improve BPF verifier's JEQ/JNE branch taken logic to also consider signed bounds knowledge, from Andrii Nakryiko. 15) Add an option to XDP selftests to use multi-buffer AF_XDP xdp_hw_metadata and mark used XDP programs as capable to use frags, from Larysa Zaremba. 16) Fix bpftool's BTF dumper wrt printing a pointer value and another one to fix struct_ops dump in an array, from Manu Bretelle. * tag 'for-netdev' of ssh://gitolite.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (51 commits) netkit: Remove explicit active/peer ptr initialization selftests/bpf: Fix selftests broken by mitigations=off samples/bpf: Allow building with custom bpftool samples/bpf: Fix passing LDFLAGS to libbpf samples/bpf: Allow building with custom CFLAGS/LDFLAGS bpf: Add more WARN_ON_ONCE checks for mismatched alloc and free selftests/bpf: Add selftests for netkit selftests/bpf: Add netlink helper library bpftool: Extend net dump with netkit progs bpftool: Implement link show support for netkit libbpf: Add link-based API for netkit tools: Sync if_link uapi header netkit, bpf: Add bpf programmable net device bpf: Improve JEQ/JNE branch taken logic bpf: Fold smp_mb__before_atomic() into atomic_set_release() bpf: Fix unnecessary -EBUSY from htab_lock_bucket xsk: Avoid starving the xsk further down the list bpf: print full verifier states on infinite loop detection selftests/bpf: test if state loops are detected in a tricky case bpf: correct loop detection for iterators convergence ... ==================== Link: https://lore.kernel.org/r/20231026150509.2824-1-daniel@iogearbox.net Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-10-19cgroup: Prepare for using css_task_iter_*() in BPFChuyi Zhou1-6/+12
This patch makes some preparations for using css_task_iter_*() in BPF Program. 1. Flags CSS_TASK_ITER_* are #define-s and it's not easy for bpf prog to use them. Convert them to enum so bpf prog can take them from vmlinux.h. 2. In the next patch we will add css_task_iter_*() in common kfuncs which is not safe. Since css_task_iter_*() does spin_unlock_irq() which might screw up irq flags depending on the context where bpf prog is running. So we should use irqsave/irqrestore here and the switching is harmless. Suggested-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Chuyi Zhou <zhouchuyi@bytedance.com> Acked-by: Tejun Heo <tj@kernel.org> Link: https://lore.kernel.org/r/20231018061746.111364-2-zhouchuyi@bytedance.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-10-18hugetlb: memcg: account hugetlb-backed memory in memory controllerNhat Pham1-1/+14
Currently, hugetlb memory usage is not acounted for in the memory controller, which could lead to memory overprotection for cgroups with hugetlb-backed memory. This has been observed in our production system. For instance, here is one of our usecases: suppose there are two 32G containers. The machine is booted with hugetlb_cma=6G, and each container may or may not use up to 3 gigantic page, depending on the workload within it. The rest is anon, cache, slab, etc. We can set the hugetlb cgroup limit of each cgroup to 3G to enforce hugetlb fairness. But it is very difficult to configure memory.max to keep overall consumption, including anon, cache, slab etc. fair. What we have had to resort to is to constantly poll hugetlb usage and readjust memory.max. Similar procedure is done to other memory limits (memory.low for e.g). However, this is rather cumbersome and buggy. Furthermore, when there is a delay in memory limits correction, (for e.g when hugetlb usage changes within consecutive runs of the userspace agent), the system could be in an over/underprotected state. This patch rectifies this issue by charging the memcg when the hugetlb folio is utilized, and uncharging when the folio is freed (analogous to the hugetlb controller). Note that we do not charge when the folio is allocated to the hugetlb pool, because at this point it is not owned by any memcg. Some caveats to consider: * This feature is only available on cgroup v2. * There is no hugetlb pool management involved in the memory controller. As stated above, hugetlb folios are only charged towards the memory controller when it is used. Host overcommit management has to consider it when configuring hard limits. * Failure to charge towards the memcg results in SIGBUS. This could happen even if the hugetlb pool still has pages (but the cgroup limit is hit and reclaim attempt fails). * When this feature is enabled, hugetlb pages contribute to memory reclaim protection. low, min limits tuning must take into account hugetlb memory. * Hugetlb pages utilized while this option is not selected will not be tracked by the memory controller (even if cgroup v2 is remounted later on). Link: https://lkml.kernel.org/r/20231006184629.155543-4-nphamcs@gmail.com Signed-off-by: Nhat Pham <nphamcs@gmail.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Frank van der Linden <fvdl@google.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: Rik van Riel <riel@surriel.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Shakeel Butt <shakeelb@google.com> Cc: Shuah Khan <shuah@kernel.org> Cc: Tejun heo <tj@kernel.org> Cc: Yosry Ahmed <yosryahmed@google.com> Cc: Zefan Li <lizefan.x@bytedance.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-10-09cgroup: use legacy_name for cgroup v1 disable infoKamalesh Babulal1-1/+1
cgroup v1 or v2 or both controller names can be passed as arguments to the 'cgroup_no_v1' kernel parameter, though most of the controller's names are the same for both cgroup versions. This can be confusing when both versions are used interchangeably, i.e., passing cgroup_no_v1=io $ sudo dmesg |grep cgroup ... cgroup: Disabling io control group subsystem in v1 mounts cgroup: Disabled controller 'blkio' Make it consistent across the pr_info()'s, by using ss->legacy_name, as the subsystem name, while printing the cgroup v1 controller disabling information in cgroup_init(). Signed-off-by: Kamalesh Babulal <kamalesh.babulal@oracle.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2023-10-09cgroup: Remove duplicates in cgroup v1 tasks fileMichal Koutný1-3/+2
One PID may appear multiple times in a preloaded pidlist. (Possibly due to PID recycling but we have reports of the same task_struct appearing with different PIDs, thus possibly involving transfer of PID via de_thread().) Because v1 seq_file iterator uses PIDs as position, it leads to a message: > seq_file: buggy .next function kernfs_seq_next did not update position index Conservative and quick fix consists of removing duplicates from `tasks` file (as opposed to removing pidlists altogether). It doesn't affect correctness (it's sufficient to show a PID once), performance impact would be hidden by unconditional sorting of the pidlist already in place (asymptotically). Link: https://lore.kernel.org/r/20230823174804.23632-1-mkoutny@suse.com/ Suggested-by: Firo Yang <firo.yang@suse.com> Signed-off-by: Michal Koutný <mkoutny@suse.com> Signed-off-by: Tejun Heo <tj@kernel.org> Cc: stable@vger.kernel.org
2023-10-04cgroup/cpuset: Cleanup signedness issue in cpu_exclusive_check()Harshit Mogalapalli1-7/+7
Smatch complains about returning negative error codes from a type bool function. kernel/cgroup/cpuset.c:705 cpu_exclusive_check() warn: signedness bug returning '(-22)' The code works correctly, but it is confusing. The current behavior is that cpu_exclusive_check() returns true if it's *NOT* exclusive. Rename it to cpusets_are_exclusive() and reverse the returns so it returns true if it is exclusive and false if it's not. Update both callers as well. Reported-by: kernel test robot <lkp@intel.com> Reported-by: Dan Carpenter <error27@gmail.com> Closes: https://lore.kernel.org/r/202309201706.2LhKdM6o-lkp@intel.com/ Signed-off-by: Harshit Mogalapalli <harshit.m.mogalapalli@oracle.com> Reviewed-by: Kamalesh Babulal <kamalesh.babulal@oracle.com> Acked-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2023-10-04cgroup/cpuset: Enable invalid to valid local partition transitionWaiman Long1-31/+48
When a local partition becomes invalid, it won't transition back to valid partition automatically if a proper "cpuset.cpus.exclusive" or "cpuset.cpus" change is made. Instead, system administrators have to explicitly echo "root" or "isolated" into the "cpuset.cpus.partition" file at the partition root. This patch now enables the automatic transition of an invalid local partition back to valid when there is a proper "cpuset.cpus.exclusive" or "cpuset.cpus" change. Automatic transition of an invalid remote partition to a valid one, however, is not covered by this patch. They still need an explicit write to "cpuset.cpus.partition" to become valid again. The test_cpuset_prs.sh test script is updated to add new test cases to test this automatic state transition. Reported-by: Pierre Gondois <pierre.gondois@arm.com> Link: https://lore.kernel.org/lkml/9777f0d2-2fdf-41cb-bd01-19c52939ef42@arm.com Signed-off-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2023-10-04cgroup: add cgroup_favordynmods= command-line optionLuiz Capitulino1-4/+14
We have a need of using favordynmods with cgroup v1, which doesn't support changing mount flags during remount. Enabling CONFIG_CGROUP_FAVOR_DYNMODS at build-time is not an option because we want to be able to selectively enable it for certain systems. This commit addresses this by introducing the cgroup_favordynmods= command-line option. This option works for both cgroup v1 and v2 and also allows for disabling favorynmods when the kernel built with CONFIG_CGROUP_FAVOR_DYNMODS=y. Also, note that when cgroup_favordynmods=true favordynmods is never disabled in cgroup_destroy_root(). Signed-off-by: Luiz Capitulino <luizcap@amazon.com> Reviewed-by: Michal Koutný <mkoutny@suse.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2023-09-18cgroup/cpuset: Check partition conflict with housekeeping setupWaiman Long1-0/+31
A user can pre-configure certain CPUs in an isolated state at boot time with the "isolcpus" kernel boot command line option. Those CPUs will not be in the housekeeping_cpumask(HK_TYPE_DOMAIN) and so will not be in any sched domains. This may conflict with the partition setup at runtime. Those boot time isolated CPUs should only be used in an isolated partition. This patch adds the necessary check and disallows partition setup if the check fails. Signed-off-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2023-09-18cgroup/cpuset: Introduce remote partitionWaiman Long1-29/+306
One can use "cpuset.cpus.partition" to create multiple scheduling domains or to produce a set of isolated CPUs where load balancing is disabled. The former use case is less common but the latter one can be frequently used especially for the Telco use cases like DPDK. The existing "isolated" partition can be used to produce isolated CPUs if the applications have full control of a system. However, in a containerized environment where all the apps are run in a container, it is hard to distribute out isolated CPUs from the root down given the unified hierarchy nature of cgroup v2. The container running on isolated CPUs can be several layers down from the root. The current partition feature requires that all the ancestors of a leaf partition root must be parititon roots themselves. This can be hard to configure. This patch introduces a new type of partition called remote partition. A remote partition is a partition whose parent is not a partition root itself and its CPUs are acquired directly from available CPUs in the top cpuset through a hierachical distribution of exclusive CPUs down from it. By contrast, the existing type of partitions where their parents have to be valid partition roots are referred to as local partitions as they have to be clustered around a parent partition root. Child local partitons can be created under a remote partition, but a remote partition cannot be created under a local partition. We may relax this limitation in the future if there are use cases for such configuration. Manually writing to the "cpuset.cpus.exclusive" file is not necessary when creating local partitions. However, writing proper values to "cpuset.cpus.exclusive" down the cgroup hierarchy before the target remote partition root is mandatory for the creation of a remote partition. The value in "cpuset.cpus.exclusive.effective" may change if its "cpuset.cpus" or its parent's "cpuset.cpus.exclusive.effective" changes. Signed-off-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2023-09-18cgroup/cpuset: Add cpuset.cpus.exclusive for v2Waiman Long1-34/+239
This patch introduces a new writable "cpuset.cpus.exclusive" control file for v2 which will be added to non-root cpuset enabled cgroups. This new file enables user to set a smaller list of exclusive CPUs to be used in the creation of a cpuset partition. The value written to "cpuset.cpus.exclusive" may not be the effective value being used for the creation of cpuset partition, the effective value will show up in "cpuset.cpus.exclusive.effective" and it is subject to the constraint that it must also be a subset of cpus_allowed and parent's "cpuset.cpus.exclusive.effective". By writing to "cpuset.cpus.exclusive", "cpuset.cpus.exclusive.effective" may be set to a non-empty value even for cgroups that are not valid partition roots yet. Signed-off-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2023-09-18cgroup/cpuset: Add cpuset.cpus.exclusive.effective for v2Waiman Long1-307/+421
The creation of a cpuset partition means dedicating a set of exclusive CPUs to be used by a particular partition only. These exclusive CPUs will not be used by any cpusets outside of that partition. To enable more flexibility in creating partitions, we need a way to distribute exclusive CPUs that can be used in new partitions. Currently, we have a subparts_cpus cpumask in struct cpuset that tracks only the exclusive CPUs used by all the sub-partitions underneath a given cpuset. This patch reworks the way we do exclusive CPUs tracking. The subparts_cpus is now renamed to effective_xcpus which tracks the exclusive CPUs allocated to a partition root including those that are further distributed down to sub-partitions underneath it. IOW, it also includes the exclusive CPUs used by the current partition root. Note that effective_xcpus can contain offline CPUs and it will always be a subset of cpus_allowed. The renamed effective_xcpus is now exposed via a new read-only "cpuset.cpus.exclusive.effective" control file. The new effective_xcpus cpumask should be set to cpus_allowed when a cpuset becomes a partition root and be cleared if it is not a valid partition root. In the next patch, we will enable write to another new control file to enable further control of what can get into effective_xcpus. Signed-off-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2023-09-18cgroup/cpuset: Fix load balance state in update_partition_sd_lb()Waiman Long1-1/+11
Commit a86ce68078b2 ("cgroup/cpuset: Extract out CS_CPU_EXCLUSIVE & CS_SCHED_LOAD_BALANCE handling") adds a new helper function update_partition_sd_lb() to update the load balance state of the cpuset. However the new load balance is determined by just looking at whether the cpuset is a valid isolated partition root or not. That is not enough if the cpuset is not a valid partition root but its parent is in the isolated state (load balance off). Update the function to set the new state to be the same as its parent in this case like what has been done in commit c8c926200c55 ("cgroup/cpuset: Inherit parent's load balance state in v2"). Fixes: a86ce68078b2 ("cgroup/cpuset: Extract out CS_CPU_EXCLUSIVE & CS_SCHED_LOAD_BALANCE handling") Signed-off-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2023-09-18cgroup: Avoid extra dereference in css_populate_dir()Kamalesh Babulal1-2/+2
Use css directly instead of dereferencing it from &cgroup->self, while adding the cgroup v2 cft base and psi files in css_populate_dir(). Both points to the same css, when css->ss is NULL, this avoids extra deferences and makes code consistent in usage across the function. Signed-off-by: Kamalesh Babulal <kamalesh.babulal@oracle.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2023-09-18cgroup: Check for ret during cgroup1_base_files cft additionKamalesh Babulal1-2/+4
There is no check for possible failure while populating cgroup1_base_files cft in css_populate_dir(), like its cgroup v2 counter parts cgroup_{base,psi}_files. In case of failure, the cgroup might not be set up right. Add ret value check to return on failure. Signed-off-by: Kamalesh Babulal <kamalesh.babulal@oracle.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2023-09-02cgroup: fix build when CGROUP_SCHED is not enabledLinus Torvalds1-14/+14
Sudip Mukherjee reports that the mips sb1250_swarm_defconfig build fails with the current kernel. It isn't actually MIPS-specific, it's just that that defconfig does not have CGROUP_SCHED enabled like most configs do, and as such shows this error: kernel/cgroup/cgroup.c: In function 'cgroup_local_stat_show': kernel/cgroup/cgroup.c:3699:15: error: implicit declaration of function 'cgroup_tryget_css'; did you mean 'cgroup_tryget'? [-Werror=implicit-function-declaration] 3699 | css = cgroup_tryget_css(cgrp, ss); | ^~~~~~~~~~~~~~~~~ | cgroup_tryget kernel/cgroup/cgroup.c:3699:13: warning: assignment to 'struct cgroup_subsys_state *' from 'int' makes pointer from integer without a cast [-Wint-conversion] 3699 | css = cgroup_tryget_css(cgrp, ss); | ^ because cgroup_tryget_css() only exists when CGROUP_SCHED is enabled, and the cgroup_local_stat_show() function should similarly be guarded by that config option. Move things around a bit to fix this all. Fixes: d1d4ff5d11a5 ("cgroup: put cgroup_tryget_css() inside CONFIG_CGROUP_SCHED") Reported-by: Sudip Mukherjee <sudipm.mukherjee@gmail.com> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2023-09-01Merge tag 'cgroup-for-6.6' of ↵Linus Torvalds6-181/+243
git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup Pull cgroup updates from Tejun Heo: - Per-cpu cpu usage stats are now tracked This currently isn't printed out in the cgroupfs interface and can only be accessed through e.g. BPF. Should decide on a not-too-ugly way to show per-cpu stats in cgroupfs - cpuset received some cleanups and prepatory patches for the pending cpus.exclusive patchset which will allow cpuset partitions to be created below non-partition parents, which should ease the management of partition cpusets - A lot of code and documentation cleanup patches - tools/testing/selftests/cgroup/test_cpuset.c added * tag 'cgroup-for-6.6' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: (32 commits) cgroup: Avoid -Wstringop-overflow warnings cgroup:namespace: Remove unused cgroup_namespaces_init() cgroup/rstat: Record the cumulative per-cpu time of cgroup and its descendants cgroup: clean up if condition in cgroup_pidlist_start() cgroup: fix obsolete function name in cgroup_destroy_locked() Documentation: cgroup-v2.rst: Correct number of stats entries cgroup: fix obsolete function name above css_free_rwork_fn() cgroup/cpuset: fix kernel-doc cgroup: clean up printk() cgroup: fix obsolete comment above cgroup_create() docs: cgroup-v1: fix typo docs: cgroup-v1: correct the term of Page Cache organization in inode cgroup/misc: Store atomic64_t reads to u64 cgroup/misc: Change counters to be explicit 64bit types cgroup/misc: update struct members descriptions cgroup: remove cgrp->kn check in css_populate_dir() cgroup: fix obsolete function name cgroup: use cached local variable parent in for loop cgroup: remove obsolete comment above struct cgroupstats cgroup: put cgroup_tryget_css() inside CONFIG_CGROUP_SCHED ...
2023-08-17cgroup: Avoid -Wstringop-overflow warningsGustavo A. R. Silva1-2/+2
Change the notation from pointer-to-array to pointer-to-pointer. With this, we avoid the compiler complaining about trying to access a region of size zero as an argument during function calls. This is a workaround to prevent the compiler complaining about accessing an array of size zero when evaluating the arguments of a couple of function calls. See below: kernel/cgroup/cgroup.c: In function 'find_css_set': kernel/cgroup/cgroup.c:1206:16: warning: 'find_existing_css_set' accessing 4 bytes in a region of size 0 [-Wstringop-overflow=] 1206 | cset = find_existing_css_set(old_cset, cgrp, template); | ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ kernel/cgroup/cgroup.c:1206:16: note: referencing argument 3 of type 'struct cgroup_subsys_state *[0]' kernel/cgroup/cgroup.c:1071:24: note: in a call to function 'find_existing_css_set' 1071 | static struct css_set *find_existing_css_set(struct css_set *old_cset, | ^~~~~~~~~~~~~~~~~~~~~ With the change to pointer-to-pointer, the functions are not prevented from being executed, and they will do what they have to do when CGROUP_SUBSYS_COUNT == 0. Address the following -Wstringop-overflow warnings seen when built with ARM architecture and aspeed_g4_defconfig configuration (notice that under this configuration CGROUP_SUBSYS_COUNT == 0): kernel/cgroup/cgroup.c:1208:16: warning: 'find_existing_css_set' accessing 4 bytes in a region of size 0 [-Wstringop-overflow=] kernel/cgroup/cgroup.c:1258:15: warning: 'css_set_hash' accessing 4 bytes in a region of size 0 [-Wstringop-overflow=] kernel/cgroup/cgroup.c:6089:18: warning: 'css_set_hash' accessing 4 bytes in a region of size 0 [-Wstringop-overflow=] kernel/cgroup/cgroup.c:6153:18: warning: 'css_set_hash' accessing 4 bytes in a region of size 0 [-Wstringop-overflow=] This results in no differences in binary output. Link: https://github.com/KSPP/linux/issues/316 Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org> Reviewed-by: Kees Cook <keescook@chromium.org> Signed-off-by: Tejun Heo <tj@kernel.org>
2023-08-14cgroup:namespace: Remove unused cgroup_namespaces_init()Lu Jialin1-6/+0
cgroup_namspace_init() just return 0. Therefore, there is no need to call it during start_kernel. Just remove it. Fixes: a79a908fd2b0 ("cgroup: introduce cgroup namespaces") Signed-off-by: Lu Jialin <lujialin4@huawei.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2023-08-07cgroup/rstat: Record the cumulative per-cpu time of cgroup and its descendantsHao Jia1-2/+10
The member variable bstat of the structure cgroup_rstat_cpu records the per-cpu time of the cgroup itself, but does not include the per-cpu time of its descendants. The per-cpu time including descendants is very useful for calculating the per-cpu usage of cgroups. Although we can indirectly obtain the total per-cpu time of the cgroup and its descendants by accumulating the per-cpu bstat of each descendant of the cgroup. But after a child cgroup is removed, we will lose its bstat information. This will cause the cumulative value to be non-monotonic, thus affecting the accuracy of cgroup per-cpu usage. So we add the subtree_bstat variable to record the total per-cpu time of this cgroup and its descendants, which is similar to "cpuacct.usage*" in cgroup v1. And this is also helpful for the migration from cgroup v1 to cgroup v2. After adding this variable, we can obtain the per-cpu time of cgroup and its descendants in user mode through eBPF/drgn, etc. And we are still trying to determine how to expose it in the cgroupfs interface. Suggested-by: Tejun Heo <tj@kernel.org> Signed-off-by: Hao Jia <jiahao.os@bytedance.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2023-08-07cgroup: clean up if condition in cgroup_pidlist_start()Miaohe Lin1-1/+1
There's no need to use '<=' when knowing 'l->list[mid] != pid' already. No functional change intended. Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2023-08-03cgroup: fix obsolete function name in cgroup_destroy_locked()Miaohe Lin1-1/+1
Since commit e76ecaeef65c ("cgroup: use cgroup_kn_lock_live() in other cgroup kernfs methods"), cgroup_kn_lock_live() is used in cgroup kernfs methods. Update corresponding comment. Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Tejun Heo <tj@kernel.org>
2023-08-02cgroup: fix obsolete function name above css_free_rwork_fn()Miaohe Lin1-1/+1
Since commit 8f36aaec9c92 ("cgroup: Use rcu_work instead of explicit rcu and work item"), css_free_work_fn has been renamed to css_free_rwork_fn. Update corresponding comment. Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2023-08-02cgroup/cpuset: fix kernel-docCai Xinchen1-0/+2
Add kernel-doc of param @rotor to fix warnings: kernel/cgroup/cpuset.c:4162: warning: Function parameter or member 'rotor' not described in 'cpuset_spread_node' kernel/cgroup/cpuset.c:3771: warning: Function parameter or member 'work' not described in 'cpuset_hotplug_workfn' Signed-off-by: Cai Xinchen <caixinchen1@huawei.com> Acked-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2023-08-02cgroup: clean up printk()Kamalesh Babulal1-2/+2
Convert the only printk() to use pr_*() helper. No functional change. Signed-off-by: Kamalesh Babulal <kamalesh.babulal@oracle.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2023-07-21cgroup: fix obsolete comment above cgroup_create()Miaohe Lin1-2/+1
Since commit 743210386c03 ("cgroup: use cgrp->kn->id as the cgroup ID"), cgrp is associated with its kernfs_node. Update corresponding comment. Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2023-07-21cgroup/misc: Store atomic64_t reads to u64Haitao Huang1-1/+1
Change 'new_usage' type to u64 so it can be compared with unsigned 'max' and 'capacity' properly even if the value crosses the signed boundary. Signed-off-by: Haitao Huang <haitao.huang@linux.intel.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2023-07-19Merge tag 'v6.5-rc2' into sched/core, to pick up fixesIngo Molnar1-1/+1
Sync with upstream fixes before applying EEVDF. Signed-off-by: Ingo Molnar <mingo@kernel.org>
2023-07-18cgroup/misc: Change counters to be explicit 64bit typesHaitao Huang1-28/+27
So the variables can account for resources of huge quantities even on 32-bit machines. Signed-off-by: Haitao Huang <haitao.huang@linux.intel.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2023-07-17cgroup: remove cgrp->kn check in css_populate_dir()Kamalesh Babulal1-1/+1
cgroup_create() creates cgrp and assigns the kernfs_node to cgrp->kn, then cgroup_mkdir() populates base and csses cft file by calling css_populate_dir() and cgroup_apply_control_enable() with a valid cgrp->kn. Check for NULL cgrp->kn, will always be false, remove it. Signed-off-by: Kamalesh Babulal <kamalesh.babulal@oracle.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2023-07-17cgroup: fix obsolete function nameMiaohe Lin1-1/+1
cgroup_taskset_migrate() has been renamed to cgroup_migrate_execute() since commit e595cd706982 ("cgroup: track migration context in cgroup_mgctx"). Update the corresponding comment. Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2023-07-17cgroup: use cached local variable parent in for loopMiaohe Lin1-1/+1
Use local variable parent to initialize iter tcgrp in for loop so the size of cgroup.o can be reduced by 64 bytes. No functional change intended. Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2023-07-13sched: add throttled time stat for throttled childrenJosh Don1-0/+34
We currently export the total throttled time for cgroups that are given a bandwidth limit. This patch extends this accounting to also account the total time that each children cgroup has been throttled. This is useful to understand the degree to which children have been affected by the throttling control. Children which are not runnable during the entire throttled period, for example, will not show any self-throttling time during this period. Expose this in a new interface, 'cpu.stat.local', which is similar to how non-hierarchical events are accounted in 'memory.events.local'. Signed-off-by: Josh Don <joshdon@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Tejun Heo <tj@kernel.org> Link: https://lore.kernel.org/r/20230620183247.737942-2-joshdon@google.com
2023-07-11cgroup: put cgroup_tryget_css() inside CONFIG_CGROUP_SCHEDMiaohe Lin1-22/+22
Put cgroup_tryget_css() inside CONFIG_CGROUP_SCHED to fix the warning of 'cgroup_tryget_css' defined but not used [-Wunused-function] when CONFIG_CGROUP_SCHED is disabled. Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Reviewed-by: Kamalesh Babulal <kamalesh.babulal@oracle.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2023-07-10cgroup/cpuset: Allow suppression of sched domain rebuild in ↵Waiman Long1-8/+16
update_cpumasks_hier() A single partition setup and tear-down operation can lead to multiple rebuild_sched_domains_locked() calls which is a waste of effort. This can partly be mitigated by adding a flag to suppress the rebuild_sched_domains_locked() call in update_cpumasks_hier(). Since a Boolean flag has already been passed as the 3rd argument to update_cpumasks_hier(), we can extend that to a full flag word. The sched domain rebuild suppression is now enabled in update_sibling_cpumasks() as all it callers will do the sched domain rebuild after its return later on anyway. Signed-off-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2023-07-10cgroup/cpuset: Improve temporary cpumasks handlingWaiman Long1-29/+13
The limitation that update_parent_subparts_cpumask() can only use addmask & delmask in the given tmp cpumasks is fragile and may lead to unexpected error. Fix this problem by allocating/freeing a struct tmpmasks in update_cpumask() to avoid reusing the cpumasks in trial_cs. With this change, we can move the update_tasks_cpumask() for the parent and update_sibling_cpumasks() for the sibling to inside update_parent_subparts_cpumask(). Signed-off-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2023-07-10cgroup/cpuset: Extract out CS_CPU_EXCLUSIVE & CS_SCHED_LOAD_BALANCE handlingWaiman Long1-55/+86
Extract out the setting of CS_CPU_EXCLUSIVE and CS_SCHED_LOAD_BALANCE flags as well as the rebuilding of scheduling domains into the new update_partition_exclusive() and update_partition_sd_lb() helper functions to simplify the logic. The update_partition_exclusive() helper is called mainly at the beginning of the caller, but it may be called at the end too. The update_partition_sd_lb() helper is called at the end of the caller. This patch should reduce the chance that cpuset partition will end up in an incorrect state. Signed-off-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2023-07-10cgroup/cpuset: Inherit parent's load balance state in v2Waiman Long1-3/+30
Since commit f28e22441f35 ("cgroup/cpuset: Add a new isolated cpus.partition type"), the CS_SCHED_LOAD_BALANCE bit of a v2 cpuset can be on or off. The child cpusets of a partition root must have the same setting as its parent or it may screw up the rebuilding of sched domains. Fix this problem by making sure the a child v2 cpuset will follows its parent cpuset load balance state unless the child cpuset is a new partition root itself. Fixes: f28e22441f35 ("cgroup/cpuset: Add a new isolated cpus.partition type") Signed-off-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2023-07-10cgroup: fix obsolete comment above for_each_css()Miaohe Lin1-1/+1
cgroup_tree_mutex is removed since commit 8353da1f91f1 ("cgroup: remove cgroup_tree_mutex"), update corresponding comment. Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2023-07-10cpuset: Allow setscheduler regardless of manipulated taskMichal Koutný1-3/+16
When we migrate a task between two cgroups, one of the checks is a verification whether we can modify task's scheduler settings (cap_task_setscheduler()). An implicit migration occurs also when enabling a controller on the unified hierarchy (think of parent to child migration). The aforementioned check may be problematic if the caller of the migration (enabling a controller) has no permissions over migrated tasks. For instance, a user's cgroup that ends up running a process of a different user. Although cgroup permissions are configured favorably, the enablement fails due to the foreign process [1]. Change the behavior by relaxing the permissions check on the unified hierarchy when no effective change would happen. This is in accordance with unified hierarchy attachment behavior when permissions of the source to target cgroups are decisive whereas the migrated task is opaque (as opposed to more restrictive check in __cgroup1_procs_write()). Notice that foreign task's affinity may still be modified if the user can modify destination cgroup's cpuset attributes (update_tasks_cpumask() does no permissions check). The permissions check could thus be skipped on v2 even when affinity changes. Stay conservative in this patch though. [1] https://github.com/systemd/systemd/issues/18293#issuecomment-831205649 Signed-off-by: Michal Koutný <mkoutny@suse.com> Reviewed-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2023-07-10cgroup/cpuset: avoid unneeded cpuset_mutex re-lockMiaohe Lin1-5/+4
cpuset_mutex unlock and lock pair is only needed when transferring tasks out of empty cpuset. Avoid unneeded cpuset_mutex re-lock when !is_empty to save cpu cycles. Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Reviewed-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2023-07-10cgroup/cpuset: simplify the percpu kthreads check in update_tasks_cpumask()Miaohe Lin1-1/+1
kthread_is_per_cpu() can be called directly without checking whether PF_KTHREAD is set in task->flags. So remove PF_KTHREAD check to make code more concise. Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Reviewed-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2023-07-10cgroup: minor cleanup for cgroup_extra_stat_show()Miaohe Lin1-4/+5
Make it under CONFIG_CGROUP_SCHED to rid of __maybe_unused annotation. And further fetch cgrp inside cgroup_extra_stat_show() directly to rid of __maybe_unused annotation of cgrp. No functional change intended. Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2023-07-10cgroup: remove unneeded return value of cgroup_rm_cftypes_locked()Miaohe Lin1-6/+3
The return value of cgroup_rm_cftypes_locked() is always 0. So remove it to simplify the code. No functional change intended. Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Reviewed-by: Michal Koutný <mkoutny@suse.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2023-07-10sched/psi: use kernfs polling functions for PSI trigger pollingSuren Baghdasaryan1-1/+1
Destroying psi trigger in cgroup_file_release causes UAF issues when a cgroup is removed from under a polling process. This is happening because cgroup removal causes a call to cgroup_file_release while the actual file is still alive. Destroying the trigger at this point would also destroy its waitqueue head and if there is still a polling process on that file accessing the waitqueue, it will step on the freed pointer: do_select vfs_poll do_rmdir cgroup_rmdir kernfs_drain_open_files cgroup_file_release cgroup_pressure_release psi_trigger_destroy wake_up_pollfree(&t->event_wait) // vfs_poll is unblocked synchronize_rcu kfree(t) poll_freewait -> UAF access to the trigger's waitqueue head Patch [1] fixed this issue for epoll() case using wake_up_pollfree(), however the same issue exists for synchronous poll() case. The root cause of this issue is that the lifecycles of the psi trigger's waitqueue and of the file associated with the trigger are different. Fix this by using kernfs_generic_poll function when polling on cgroup-specific psi triggers. It internally uses kernfs_open_node->poll waitqueue head with its lifecycle tied to the file's lifecycle. This also renders the fix in [1] obsolete, so revert it. [1] commit c2dbe32d5db5 ("sched/psi: Fix use-after-free in ep_remove_wait_queue()") Fixes: 0e94682b73bf ("psi: introduce psi monitor") Closes: https://lore.kernel.org/all/20230613062306.101831-1-lujialin4@huawei.com/ Reported-by: Lu Jialin <lujialin4@huawei.com> Signed-off-by: Suren Baghdasaryan <surenb@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20230630005612.1014540-1-surenb@google.com
2023-06-28Merge tag 'mm-stable-2023-06-24-19-15' of ↵Linus Torvalds1-21/+5
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull mm updates from Andrew Morton: - Yosry Ahmed brought back some cgroup v1 stats in OOM logs - Yosry has also eliminated cgroup's atomic rstat flushing - Nhat Pham adds the new cachestat() syscall. It provides userspace with the ability to query pagecache status - a similar concept to mincore() but more powerful and with improved usability - Mel Gorman provides more optimizations for compaction, reducing the prevalence of page rescanning - Lorenzo Stoakes has done some maintanance work on the get_user_pages() interface - Liam Howlett continues with cleanups and maintenance work to the maple tree code. Peng Zhang also does some work on maple tree - Johannes Weiner has done some cleanup work on the compaction code - David Hildenbrand has contributed additional selftests for get_user_pages() - Thomas Gleixner has contributed some maintenance and optimization work for the vmalloc code - Baolin Wang has provided some compaction cleanups, - SeongJae Park continues maintenance work on the DAMON code - Huang Ying has done some maintenance on the swap code's usage of device refcounting - Christoph Hellwig has some cleanups for the filemap/directio code - Ryan Roberts provides two patch series which yield some rationalization of the kernel's access to pte entries - use the provided APIs rather than open-coding accesses - Lorenzo Stoakes has some fixes to the interaction between pagecache and directio access to file mappings - John Hubbard has a series of fixes to the MM selftesting code - ZhangPeng continues the folio conversion campaign - Hugh Dickins has been working on the pagetable handling code, mainly with a view to reducing the load on the mmap_lock - Catalin Marinas has reduced the arm64 kmalloc() minimum alignment from 128 to 8 - Domenico Cerasuolo has improved the zswap reclaim mechanism by reorganizing the LRU management - Matthew Wilcox provides some fixups to make gfs2 work better with the buffer_head code - Vishal Moola also has done some folio conversion work - Matthew Wilcox has removed the remnants of the pagevec code - their functionality is migrated over to struct folio_batch * tag 'mm-stable-2023-06-24-19-15' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (380 commits) mm/hugetlb: remove hugetlb_set_page_subpool() mm: nommu: correct the range of mmap_sem_read_lock in task_mem() hugetlb: revert use of page_cache_next_miss() Revert "page cache: fix page_cache_next/prev_miss off by one" mm/vmscan: fix root proactive reclaim unthrottling unbalanced node mm: memcg: rename and document global_reclaim() mm: kill [add|del]_page_to_lru_list() mm: compaction: convert to use a folio in isolate_migratepages_block() mm: zswap: fix double invalidate with exclusive loads mm: remove unnecessary pagevec includes mm: remove references to pagevec mm: rename invalidate_mapping_pagevec to mapping_try_invalidate mm: remove struct pagevec net: convert sunrpc from pagevec to folio_batch i915: convert i915_gpu_error to use a folio_batch pagevec: rename fbatch_count() mm: remove check_move_unevictable_pages() drm: convert drm_gem_put_pages() to use a folio_batch i915: convert shmem_sg_free_table() to use a folio_batch scatterlist: add sg_set_folio() ...
2023-06-27Merge tag 'cgroup-for-6.5' of ↵Linus Torvalds6-176/+172
git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup Pull cgroup updates from Tejun Heo: - Whenever cpuset needs to rebuild sched_domain, it walked all tasks looking for DEADLINE tasks as they need to be accounted on the new domain. Walking all tasks can be expensive and there may not be any DEADLINE tasks at all. Task iteration is now omitted if there are no DEADLINE tasks - Fixes DEADLINE bandwidth misaccounting after task migration failures - When no controller is enabled, -Wstringop-overflow warning is triggered. The fix patch added an early exit which is too eager and got reverted for now. Will fix later - Everything else is minor cleanups * tag 'cgroup-for-6.5' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: Revert "cgroup: Avoid -Wstringop-overflow warnings" cgroup/misc: Expose misc.current on cgroup v2 root cgroup: Avoid -Wstringop-overflow warnings cgroup: remove obsolete comment on cgroup_on_dfl() cgroup: remove unused task_cgroup_path() cgroup/cpuset: remove unneeded header files cgroup: make cgroup_is_threaded() and cgroup_is_thread_root() static rdmacg: fix kernel-doc warnings in rdmacg cgroup: Replace the css_set call with cgroup_get cgroup: remove unused macro for_each_e_css() cgroup: Update out-of-date comment in cgroup_migrate() cgroup: Replace all non-returning strlcpy with strscpy cgroup/cpuset: remove unneeded header files cgroup/cpuset: Free DL BW in case can_attach() fails sched/deadline: Create DL BW alloc, free & check overflow interface cgroup/cpuset: Iterate only if DEADLINE tasks are present sched/cpuset: Keep track of SCHED_DEADLINE task in cpusets sched/cpuset: Bring back cpuset_mutex cgroup/cpuset: Rename functions dealing with DEADLINE accounting
2023-06-27Merge tag 'sched-core-2023-06-27' of ↵Linus Torvalds1-0/+12
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler updates from Ingo Molnar: "Scheduler SMP load-balancer improvements: - Avoid unnecessary migrations within SMT domains on hybrid systems. Problem: On hybrid CPU systems, (processors with a mixture of higher-frequency SMT cores and lower-frequency non-SMT cores), under the old code lower-priority CPUs pulled tasks from the higher-priority cores if more than one SMT sibling was busy - resulting in many unnecessary task migrations. Solution: The new code improves the load balancer to recognize SMT cores with more than one busy sibling and allows lower-priority CPUs to pull tasks, which avoids superfluous migrations and lets lower-priority cores inspect all SMT siblings for the busiest queue. - Implement the 'runnable boosting' feature in the EAS balancer: consider CPU contention in frequency, EAS max util & load-balance busiest CPU selection. This improves CPU utilization for certain workloads, while leaves other key workloads unchanged. Scheduler infrastructure improvements: - Rewrite the scheduler topology setup code by consolidating it into the build_sched_topology() helper function and building it dynamically on the fly. - Resolve the local_clock() vs. noinstr complications by rewriting the code: provide separate sched_clock_noinstr() and local_clock_noinstr() functions to be used in instrumentation code, and make sure it is all instrumentation-safe. Fixes: - Fix a kthread_park() race with wait_woken() - Fix misc wait_task_inactive() bugs unearthed by the -rt merge: - Fix UP PREEMPT bug by unifying the SMP and UP implementations - Fix task_struct::saved_state handling - Fix various rq clock update bugs, unearthed by turning on the rq clock debugging code. - Fix the PSI WINDOW_MIN_US trigger limit, which was easy to trigger by creating enough cgroups, by removing the warnign and restricting window size triggers to PSI file write-permission or CAP_SYS_RESOURCE. - Propagate SMT flags in the topology when removing degenerate domain - Fix grub_reclaim() calculation bug in the deadline scheduler code - Avoid resetting the min update period when it is unnecessary, in psi_trigger_destroy(). - Don't balance a task to its current running CPU in load_balance(), which was possible on certain NUMA topologies with overlapping groups. - Fix the sched-debug printing of rq->nr_uninterruptible Cleanups: - Address various -Wmissing-prototype warnings, as a preparation to (maybe) enable this warning in the future. - Remove unused code - Mark more functions __init - Fix shadow-variable warnings" * tag 'sched-core-2023-06-27' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (50 commits) sched/core: Avoid multiple calling update_rq_clock() in __cfsb_csd_unthrottle() sched/core: Avoid double calling update_rq_clock() in __balance_push_cpu_stop() sched/core: Fixed missing rq clock update before calling set_rq_offline() sched/deadline: Update GRUB description in the documentation sched/deadline: Fix bandwidth reclaim equation in GRUB sched/wait: Fix a kthread_park race with wait_woken() sched/topology: Mark set_sched_topology() __init sched/fair: Rename variable cpu_util eff_util arm64/arch_timer: Fix MMIO byteswap sched/fair, cpufreq: Introduce 'runnable boosting' sched/fair: Refactor CPU utilization functions cpuidle: Use local_clock_noinstr() sched/clock: Provide local_clock_noinstr() x86/tsc: Provide sched_clock_noinstr() clocksource: hyper-v: Provide noinstr sched_clock() clocksource: hyper-v: Adjust hv_read_tsc_page_tsc() to avoid special casing U64_MAX x86/vdso: Fix gettimeofday masking math64: Always inline u128 version of mul_u64_u64_shr() s390/time: Provide sched_clock_noinstr() loongarch: Provide noinstr sched_clock_read() ...
2023-06-22Revert "cgroup: Avoid -Wstringop-overflow warnings"Tejun Heo1-6/+0
This reverts commit 36de5f303ca1bd6fce74815ef17ef3d8ff8737b5. The commit caused boot failures on some configurations due to cgroup hierarchies not being created at all. Signed-off-by: Tejun Heo <tj@kernel.org>
2023-06-21cgroup/misc: Expose misc.current on cgroup v2 rootLeiZhou-971-1/+0
Hello, This patch is to expose misc.current on cgroup v2 root for tracking how much of the resource has been consumed in total on the system. Most of the cloud infrastucture use cgroup to fetch the host information for scheduling purpose. Currently, the misc controller can be used by Intel TDX HKIDs and AMD SEV ASIDs, which are both used for creating encrypted VMs. Intel TDX and AMD SEV are mostly be used by the cloud providers for providing confidential VMs. In actual use of a server, these confidential VMs may be launched in different ways. For the cloud solution, there are kubvirt and coco (tracked by kubepods.slice); on host, they can be booted directly through qemu by end user (tracked by user.slice), etc. In this complex environment, when wanting to know how many resource is used in total it has to iterate through all existing slices to get the value of each misc.current and add them up to calculate the total number of consumed keys. So exposing misc.current to root cgroup tends to give much easier when calculates how much resource has been used in total, which helps to schedule and count resources for the cloud infrastucture. Signed-off-by: LeiZhou-97 <lei.zhou@intel.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2023-06-21cgroup: Avoid -Wstringop-overflow warningsGustavo A. R. Silva1-0/+6
Address the following -Wstringop-overflow warnings seen when built with ARM architecture and aspeed_g4_defconfig configuration (notice that under this configuration CGROUP_SUBSYS_COUNT == 0): kernel/cgroup/cgroup.c:1208:16: warning: 'find_existing_css_set' accessing 4 bytes in a region of size 0 [-Wstringop-overflow=] kernel/cgroup/cgroup.c:1258:15: warning: 'css_set_hash' accessing 4 bytes in a region of size 0 [-Wstringop-overflow=] kernel/cgroup/cgroup.c:6089:18: warning: 'css_set_hash' accessing 4 bytes in a region of size 0 [-Wstringop-overflow=] kernel/cgroup/cgroup.c:6153:18: warning: 'css_set_hash' accessing 4 bytes in a region of size 0 [-Wstringop-overflow=] These changes are based on commit d20d30ebb199 ("cgroup: Avoid compiler warnings with no subsystems"). Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org> Reviewed-by: Kees Cook <keescook@chromium.org> Signed-off-by: Tejun Heo <tj@kernel.org>
2023-06-21cgroup: remove obsolete comment on cgroup_on_dfl()Miaohe Lin1-2/+0
The debug feature is supported since commit 8cc38fa7fa31 ("cgroup: make debug an implicit controller on cgroup2"), update corresponding comment. Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2023-06-12cgroup: Do not corrupt task iteration when rebinding subsystemXiu Jianfeng1-3/+17
We found a refcount UAF bug as follows: refcount_t: addition on 0; use-after-free. WARNING: CPU: 1 PID: 342 at lib/refcount.c:25 refcount_warn_saturate+0xa0/0x148 Workqueue: events cpuset_hotplug_workfn Call trace: refcount_warn_saturate+0xa0/0x148 __refcount_add.constprop.0+0x5c/0x80 css_task_iter_advance_css_set+0xd8/0x210 css_task_iter_advance+0xa8/0x120 css_task_iter_next+0x94/0x158 update_tasks_root_domain+0x58/0x98 rebuild_root_domains+0xa0/0x1b0 rebuild_sched_domains_locked+0x144/0x188 cpuset_hotplug_workfn+0x138/0x5a0 process_one_work+0x1e8/0x448 worker_thread+0x228/0x3e0 kthread+0xe0/0xf0 ret_from_fork+0x10/0x20 then a kernel panic will be triggered as below: Unable to handle kernel paging request at virtual address 00000000c0000010 Call trace: cgroup_apply_control_disable+0xa4/0x16c rebind_subsystems+0x224/0x590 cgroup_destroy_root+0x64/0x2e0 css_free_rwork_fn+0x198/0x2a0 process_one_work+0x1d4/0x4bc worker_thread+0x158/0x410 kthread+0x108/0x13c ret_from_fork+0x10/0x18 The race that cause this bug can be shown as below: (hotplug cpu) | (umount cpuset) mutex_lock(&cpuset_mutex) | mutex_lock(&cgroup_mutex) cpuset_hotplug_workfn | rebuild_root_domains | rebind_subsystems update_tasks_root_domain | spin_lock_irq(&css_set_lock) css_task_iter_start | list_move_tail(&cset->e_cset_node[ss->id] while(css_task_iter_next) | &dcgrp->e_csets[ss->id]); css_task_iter_end | spin_unlock_irq(&css_set_lock) mutex_unlock(&cpuset_mutex) | mutex_unlock(&cgroup_mutex) Inside css_task_iter_start/next/end, css_set_lock is hold and then released, so when iterating task(left side), the css_set may be moved to another list(right side), then it->cset_head points to the old list head and it->cset_pos->next points to the head node of new list, which can't be used as struct css_set. To fix this issue, switch from all css_sets to only scgrp's css_sets to patch in-flight iterators to preserve correct iteration, and then update it->cset_head as well. Reported-by: Gaosheng Cui <cuigaosheng1@huawei.com> Link: https://www.spinics.net/lists/cgroups/msg37935.html Suggested-by: Michal Koutný <mkoutny@suse.com> Link: https://lore.kernel.org/all/20230526114139.70274-1-xiujianfeng@huaweicloud.com/ Signed-off-by: Xiu Jianfeng <xiujianfeng@huawei.com> Fixes: 2d8f243a5e6e ("cgroup: implement cgroup->e_csets[]") Cc: stable@vger.kernel.org # v3.16+ Signed-off-by: Tejun Heo <tj@kernel.org>
2023-06-12cgroup: remove unused task_cgroup_path()Miaohe Lin1-39/+0
task_cgroup_path() is not used anymore. So remove it. Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2023-06-12cgroup/cpuset: remove unneeded header filesMiaohe Lin1-2/+0
Remove some unnecessary header files. No functional change intended. Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2023-06-12cgroup,freezer: hold cpu_hotplug_lock before freezer_mutex in ↵Tetsuo Handa1-2/+6
freezer_css_{online,offline}() syzbot is again reporting circular locking dependency between cpu_hotplug_lock and freezer_mutex. Do like what we did with commit 57dcd64c7e036299 ("cgroup,freezer: hold cpu_hotplug_lock before freezer_mutex"). Reported-by: syzbot <syzbot+2ab700fe1829880a2ec6@syzkaller.appspotmail.com> Closes: https://syzkaller.appspot.com/bug?extid=2ab700fe1829880a2ec6 Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Tested-by: syzbot <syzbot+2ab700fe1829880a2ec6@syzkaller.appspotmail.com> Fixes: f5d39b020809 ("freezer,sched: Rewrite core freezer logic") Cc: stable@vger.kernel.org # v6.1+ Signed-off-by: Tejun Heo <tj@kernel.org>
2023-06-09cgroup: remove cgroup_rstat_flush_atomic()Yosry Ahmed1-21/+5
Previous patches removed the only caller of cgroup_rstat_flush_atomic(). Remove the function and simplify the code. Link: https://lkml.kernel.org/r/20230421174020.2994750-6-yosryahmed@google.com Signed-off-by: Yosry Ahmed <yosryahmed@google.com> Acked-by: Shakeel Butt <shakeelb@google.com> Acked-by: Tejun Heo <tj@kernel.org> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Christian Brauner <brauner@kernel.org> Cc: Jan Kara <jack@suse.cz> Cc: Jens Axboe <axboe@kernel.dk> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Michal Koutný <mkoutny@suse.com> Cc: Muchun Song <songmuchun@bytedance.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-05cgroup: make cgroup_is_threaded() and cgroup_is_thread_root() staticMiaohe Lin2-4/+2
They're only called inside cgroup.c. Make them static. Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2023-06-05rdmacg: fix kernel-doc warnings in rdmacgGaosheng Cui1-0/+2
Fix all kernel-doc warnings in rdmacg: kernel/cgroup/rdma.c:209: warning: Function parameter or member 'cg' not described in 'rdmacg_uncharge_hierarchy' kernel/cgroup/rdma.c:230: warning: Function parameter or member 'cg' not described in 'rdmacg_uncharge' Signed-off-by: Gaosheng Cui <cuigaosheng1@huawei.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2023-06-05cgroup: Replace the css_set call with cgroup_getGaosheng Cui1-1/+1
We will release the refcnt of cgroup via cgroup_put, for example in the cgroup_lock_and_drain_offline function, so replace css_get with the cgroup_get function for better readability. Signed-off-by: Gaosheng Cui <cuigaosheng1@huawei.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2023-06-05cgroup: remove unused macro for_each_e_css()Miaohe Lin1-15/+0
for_each_e_css() is unused now. Remove it. Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Reviewed-by: Yosry Ahmed <yosryahmed@google.com> Reviewed-by: Michal Koutný <mkoutny@suse.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2023-05-24cgroup: Update out-of-date comment in cgroup_migrate()Xiu Jianfeng1-3/+3
Commit 674b745e22b3 ("cgroup: remove rcu_read_lock()/rcu_read_unlock() in critical section of spin_lock_irq()") has removed the rcu_read_lock, which makes the comment out-of-date, so update it. tj: Updated the comment a bit. Signed-off-by: Xiu Jianfeng <xiujianfeng@huawei.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2023-05-22cgroup: always put cset in cgroup_css_set_put_forkJohn Sperbeck1-9/+8
A successful call to cgroup_css_set_fork() will always have taken a ref on kargs->cset (regardless of CLONE_INTO_CGROUP), so always do a corresponding put in cgroup_css_set_put_fork(). Without this, a cset and its contained css structures will be leaked for some fork failures. The following script reproduces the leak for a fork failure due to exceeding pids.max in the pids controller. A similar thing can happen if we jump to the bad_fork_cancel_cgroup label in copy_process(). [ -z "$1" ] && echo "Usage $0 pids-root" && exit 1 PID_ROOT=$1 CGROUP=$PID_ROOT/foo [ -e $CGROUP ] && rmdir -f $CGROUP mkdir $CGROUP echo 5 > $CGROUP/pids.max echo $$ > $CGROUP/cgroup.procs fork_bomb() { set -e for i in $(seq 10); do /bin/sleep 3600 & done } (fork_bomb) & wait echo $$ > $PID_ROOT/cgroup.procs kill $(cat $CGROUP/cgroup.procs) rmdir $CGROUP Fixes: ef2c41cf38a7 ("clone3: allow spawning processes into cgroups") Cc: stable@vger.kernel.org # v5.7+ Signed-off-by: John Sperbeck <jsperbeck@google.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2023-05-22cgroup: Replace all non-returning strlcpy with strscpyAzeem Shaikh1-2/+2
strlcpy() reads the entire source buffer first. This read may exceed the destination size limit. This is both inefficient and can lead to linear read overflows if a source string is not NUL-terminated [1]. In an effort to remove strlcpy() completely [2], replace strlcpy() here with strscpy(). No return values were used, so direct replacement is safe. [1] https://www.kernel.org/doc/html/latest/process/deprecated.html#strlcpy [2] https://github.com/KSPP/linux/issues/89 Signed-off-by: Azeem Shaikh <azeemshaikh38@gmail.com> Reviewed-by: Kees Cook <keescook@chromium.org> Signed-off-by: Tejun Heo <tj@kernel.org>
2023-05-22cgroup: fix missing cpus_read_{lock,unlock}() in cgroup_transfer_tasks()Qi Zheng1-2/+2
The commit 4f7e7236435c ("cgroup: Fix threadgroup_rwsem <-> cpus_read_lock() deadlock") fixed the deadlock between cgroup_threadgroup_rwsem and cpus_read_lock() by introducing cgroup_attach_{lock,unlock}() and removing cpus_read_{lock,unlock}() from cpuset_attach(). But cgroup_transfer_tasks() was missed and not handled, which will cause th following warning: WARNING: CPU: 0 PID: 589 at kernel/cpu.c:526 lockdep_assert_cpus_held+0x32/0x40 CPU: 0 PID: 589 Comm: kworker/1:4 Not tainted 6.4.0-rc2-next-20230517 #50 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.14.0-2 04/01/2014 Workqueue: events cpuset_hotplug_workfn RIP: 0010:lockdep_assert_cpus_held+0x32/0x40 <...> Call Trace: <TASK> cpuset_attach+0x40/0x240 cgroup_migrate_execute+0x452/0x5e0 ? _raw_spin_unlock_irq+0x28/0x40 cgroup_transfer_tasks+0x1f3/0x360 ? find_held_lock+0x32/0x90 ? cpuset_hotplug_workfn+0xc81/0xed0 cpuset_hotplug_workfn+0xcb1/0xed0 ? process_one_work+0x248/0x5b0 process_one_work+0x2b9/0x5b0 worker_thread+0x56/0x3b0 ? process_one_work+0x5b0/0x5b0 kthread+0xf1/0x120 ? kthread_complete_and_exit+0x20/0x20 ret_from_fork+0x1f/0x30 </TASK> So just use the cgroup_attach_{lock,unlock}() helper to fix it. Reported-by: Zhao Gongyi <zhaogongyi@bytedance.com> Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com> Acked-by: Muchun Song <songmuchun@bytedance.com> Fixes: 05c7b7a92cc8 ("cgroup/cpuset: Fix a race between cpuset_attach() and cpu hotplug") Cc: stable@vger.kernel.org # v5.17+ Signed-off-by: Tejun Heo <tj@kernel.org>
2023-05-20cgroup/cpuset: remove unneeded header filesMiaohe Lin1-21/+0
Remove some unnecessary header files. No functional change intended. Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2023-05-08cgroup/cpuset: Free DL BW in case can_attach() failsDietmar Eggemann1-5/+48
cpuset_can_attach() can fail. Postpone DL BW allocation until all tasks have been checked. DL BW is not allocated per-task but as a sum over all DL tasks migrating. If multiple controllers are attached to the cgroup next to the cpuset controller a non-cpuset can_attach() can fail. In this case free DL BW in cpuset_cancel_attach(). Finally, update cpuset DL task count (nr_deadline_tasks) only in cpuset_attach(). Suggested-by: Waiman Long <longman@redhat.com> Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Signed-off-by: Juri Lelli <juri.lelli@redhat.com> Reviewed-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2023-05-08cgroup/cpuset: Iterate only if DEADLINE tasks are presentJuri Lelli1-0/+3
update_tasks_root_domain currently iterates over all tasks even if no DEADLINE task is present on the cpuset/root domain for which bandwidth accounting is being rebuilt. This has been reported to introduce 10+ ms delays on suspend-resume operations. Skip the costly iteration for cpusets that don't contain DEADLINE tasks. Reported-by: Qais Yousef <qyousef@layalina.io> Link: https://lore.kernel.org/lkml/20230206221428.2125324-1-qyousef@layalina.io/ Signed-off-by: Juri Lelli <juri.lelli@redhat.com> Reviewed-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2023-05-08sched/cpuset: Keep track of SCHED_DEADLINE task in cpusetsJuri Lelli2-0/+29
Qais reported that iterating over all tasks when rebuilding root domains for finding out which ones are DEADLINE and need their bandwidth correctly restored on such root domains can be a costly operation (10+ ms delays on suspend-resume). To fix the problem keep track of the number of DEADLINE tasks belonging to each cpuset and then use this information (followup patch) to only perform the above iteration if DEADLINE tasks are actually present in the cpuset for which a corresponding root domain is being rebuilt. Reported-by: Qais Yousef <qyousef@layalina.io> Link: https://lore.kernel.org/lkml/20230206221428.2125324-1-qyousef@layalina.io/ Signed-off-by: Juri Lelli <juri.lelli@redhat.com> Reviewed-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2023-05-08sched/cpuset: Bring back cpuset_mutexJuri Lelli1-79/+80
Turns out percpu_cpuset_rwsem - commit 1243dc518c9d ("cgroup/cpuset: Convert cpuset_mutex to percpu_rwsem") - wasn't such a brilliant idea, as it has been reported to cause slowdowns in workloads that need to change cpuset configuration frequently and it is also not implementing priority inheritance (which causes troubles with realtime workloads). Convert percpu_cpuset_rwsem back to regular cpuset_mutex. Also grab it only for SCHED_DEADLINE tasks (other policies don't care about stable cpusets anyway). Signed-off-by: Juri Lelli <juri.lelli@redhat.com> Reviewed-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2023-05-08cgroup/cpuset: Rename functions dealing with DEADLINE accountingJuri Lelli1-4/+4
rebuild_root_domains() and update_tasks_root_domain() have neutral names, but actually deal with DEADLINE bandwidth accounting. Rename them to use 'dl_' prefix so that intent is more clear. No functional change. Suggested-by: Qais Yousef <qyousef@layalina.io> Signed-off-by: Juri Lelli <juri.lelli@redhat.com> Reviewed-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2023-05-08psi: remove 500ms min window size limitation for triggersSuren Baghdasaryan1-0/+12
Current 500ms min window size for psi triggers limits polling interval to 50ms to prevent polling threads from using too much cpu bandwidth by polling too frequently. However the number of cgroups with triggers is unlimited, so this protection can be defeated by creating multiple cgroups with psi triggers (triggers in each cgroup are served by a single "psimon" kernel thread). Instead of limiting min polling period, which also limits the latency of psi events, it's better to limit psi trigger creation to authorized users only, like we do for system-wide psi triggers (/proc/pressure/* files can be written only by processes with CAP_SYS_RESOURCE capability). This also makes access rules for cgroup psi files consistent with system-wide ones. Add a CAP_SYS_RESOURCE capability check for cgroup psi file writers and remove the psi window min size limitation. Suggested-by: Sudarshan Rajagopalan <quic_sudaraja@quicinc.com> Signed-off-by: Suren Baghdasaryan <surenb@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Link: https://lore.kernel.org/all/cover.1676067791.git.quic_sudaraja@quicinc.com/
2023-04-29Merge tag 'cgroup-for-6.4' of ↵Linus Torvalds3-49/+57
git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup Pull cgroup updates from Tejun Heo: - cpuset changes including the fix for an incorrect interaction with CPU hotplug and an optimization - Other doc and cosmetic changes * tag 'cgroup-for-6.4' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: docs: cgroup-v1/cpusets: update libcgroup project link cgroup/cpuset: Minor updates to test_cpuset_prs.sh cgroup/cpuset: Include offline CPUs when tasks' cpumasks in top_cpuset are updated cgroup/cpuset: Skip task update if hotplug doesn't affect current cpuset cpuset: Clean up cpuset_node_allowed cgroup: bpf: use cgroup_lock()/cgroup_unlock() wrappers
2023-04-28Merge tag 'sched-core-2023-04-27' of ↵Linus Torvalds1-1/+1
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler updates from Ingo Molnar: - Allow unprivileged PSI poll()ing - Fix performance regression introduced by mm_cid - Improve livepatch stalls by adding livepatch task switching to cond_resched(). This resolves livepatching busy-loop stalls with certain CPU-bound kthreads - Improve sched_move_task() performance on autogroup configs - On core-scheduling CPUs, avoid selecting throttled tasks to run - Misc cleanups, fixes and improvements * tag 'sched-core-2023-04-27' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: sched/clock: Fix local_clock() before sched_clock_init() sched/rt: Fix bad task migration for rt tasks sched: Fix performance regression introduced by mm_cid sched/core: Make sched_dynamic_mutex static sched/psi: Allow unprivileged polling of N*2s period sched/psi: Extract update_triggers side effect sched/psi: Rename existing poll members in preparation sched/psi: Rearrange polling code in preparation sched/fair: Fix inaccurate tally of ttwu_move_affine vhost: Fix livepatch timeouts in vhost_worker() livepatch,sched: Add livepatch task switching to cond_resched() livepatch: Skip task_call_func() for current task livepatch: Convert stack entries array to percpu sched: Interleave cfs bandwidth timers for improved single thread performance at low utilization sched/core: Reduce cost of sched_move_task when config autogroup sched/core: Avoid selecting the task that is throttled to run when core-sched enable sched/topology: Make sched_energy_mutex,update static
2023-04-27Merge tag 'mm-stable-2023-04-27-15-30' of ↵Linus Torvalds1-2/+2
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull MM updates from Andrew Morton: - Nick Piggin's "shoot lazy tlbs" series, to improve the peformance of switching from a user process to a kernel thread. - More folio conversions from Kefeng Wang, Zhang Peng and Pankaj Raghav. - zsmalloc performance improvements from Sergey Senozhatsky. - Yue Zhao has found and fixed some data race issues around the alteration of memcg userspace tunables. - VFS rationalizations from Christoph Hellwig: - removal of most of the callers of write_one_page() - make __filemap_get_folio()'s return value more useful - Luis Chamberlain has changed tmpfs so it no longer requires swap backing. Use `mount -o noswap'. - Qi Zheng has made the slab shrinkers operate locklessly, providing some scalability benefits. - Keith Busch has improved dmapool's performance, making part of its operations O(1) rather than O(n). - Peter Xu adds the UFFD_FEATURE_WP_UNPOPULATED feature to userfaultd, permitting userspace to wr-protect anon memory unpopulated ptes. - Kirill Shutemov has changed MAX_ORDER's meaning to be inclusive rather than exclusive, and has fixed a bunch of errors which were caused by its unintuitive meaning. - Axel Rasmussen give userfaultfd the UFFDIO_CONTINUE_MODE_WP feature, which causes minor faults to install a write-protected pte. - Vlastimil Babka has done some maintenance work on vma_merge(): cleanups to the kernel code and improvements to our userspace test harness. - Cleanups to do_fault_around() by Lorenzo Stoakes. - Mike Rapoport has moved a lot of initialization code out of various mm/ files and into mm/mm_init.c. - Lorenzo Stoakes removd vmf_insert_mixed_prot(), which was added for DRM, but DRM doesn't use it any more. - Lorenzo has also coverted read_kcore() and vread() to use iterators and has thereby removed the use of bounce buffers in some cases. - Lorenzo has also contributed further cleanups of vma_merge(). - Chaitanya Prakash provides some fixes to the mmap selftesting code. - Matthew Wilcox changes xfs and afs so they no longer take sleeping locks in ->map_page(), a step towards RCUification of pagefaults. - Suren Baghdasaryan has improved mmap_lock scalability by switching to per-VMA locking. - Frederic Weisbecker has reworked the percpu cache draining so that it no longer causes latency glitches on cpu isolated workloads. - Mike Rapoport cleans up and corrects the ARCH_FORCE_MAX_ORDER Kconfig logic. - Liu Shixin has changed zswap's initialization so we no longer waste a chunk of memory if zswap is not being used. - Yosry Ahmed has improved the performance of memcg statistics flushing. - David Stevens has fixed several issues involving khugepaged, userfaultfd and shmem. - Christoph Hellwig has provided some cleanup work to zram's IO-related code paths. - David Hildenbrand has fixed up some issues in the selftest code's testing of our pte state changing. - Pankaj Raghav has made page_endio() unneeded and has removed it. - Peter Xu contributed some rationalizations of the userfaultfd selftests. - Yosry Ahmed has fixed an issue around memcg's page recalim accounting. - Chaitanya Prakash has fixed some arm-related issues in the selftests/mm code. - Longlong Xia has improved the way in which KSM handles hwpoisoned pages. - Peter Xu fixes a few issues with uffd-wp at fork() time. - Stefan Roesch has changed KSM so that it may now be used on a per-process and per-cgroup basis. * tag 'mm-stable-2023-04-27-15-30' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (369 commits) mm,unmap: avoid flushing TLB in batch if PTE is inaccessible shmem: restrict noswap option to initial user namespace mm/khugepaged: fix conflicting mods to collapse_file() sparse: remove unnecessary 0 values from rc mm: move 'mmap_min_addr' logic from callers into vm_unmapped_area() hugetlb: pte_alloc_huge() to replace huge pte_alloc_map() maple_tree: fix allocation in mas_sparse_area() mm: do not increment pgfault stats when page fault handler retries zsmalloc: allow only one active pool compaction context selftests/mm: add new selftests for KSM mm: add new KSM process and sysfs knobs mm: add new api to enable ksm per process mm: shrinkers: fix debugfs file permissions mm: don't check VMA write permissions if the PTE/PMD indicates write permissions migrate_pages_batch: fix statistics for longterm pin retry userfaultfd: use helper function range_in_vma() lib/show_mem.c: use for_each_populated_zone() simplify code mm: correct arg in reclaim_pages()/reclaim_clean_pages_from_list() fs/buffer: convert create_page_buffers to folio_create_buffers fs/buffer: add folio_create_empty_buffers helper ...
2023-04-26Merge tag 'net-next-6.4' of ↵Linus Torvalds1-2/+12
git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next Pull networking updates from Paolo Abeni: "Core: - Introduce a config option to tweak MAX_SKB_FRAGS. Increasing the default value allows for better BIG TCP performances - Reduce compound page head access for zero-copy data transfers - RPS/RFS improvements, avoiding unneeded NET_RX_SOFTIRQ when possible - Threaded NAPI improvements, adding defer skb free support and unneeded softirq avoidance - Address dst_entry reference count scalability issues, via false sharing avoidance and optimize refcount tracking - Add lockless accesses annotation to sk_err[_soft] - Optimize again the skb struct layout - Extends the skb drop reasons to make it usable by multiple subsystems - Better const qualifier awareness for socket casts BPF: - Add skb and XDP typed dynptrs which allow BPF programs for more ergonomic and less brittle iteration through data and variable-sized accesses - Add a new BPF netfilter program type and minimal support to hook BPF programs to netfilter hooks such as prerouting or forward - Add more precise memory usage reporting for all BPF map types - Adds support for using {FOU,GUE} encap with an ipip device operating in collect_md mode and add a set of BPF kfuncs for controlling encap params - Allow BPF programs to detect at load time whether a particular kfunc exists or not, and also add support for this in light skeleton - Bigger batch of BPF verifier improvements to prepare for upcoming BPF open-coded iterators allowing for less restrictive looping capabilities - Rework RCU enforcement in the verifier, add kptr_rcu and enforce BPF programs to NULL-check before passing such pointers into kfunc - Add support for kptrs in percpu hashmaps, percpu LRU hashmaps and in local storage maps - Enable RCU semantics for task BPF kptrs and allow referenced kptr tasks to be stored in BPF maps - Add support for refcounted local kptrs to the verifier for allowing shared ownership, useful for adding a node to both the BPF list and rbtree - Add BPF verifier support for ST instructions in convert_ctx_access() which will help new -mcpu=v4 clang flag to start emitting them - Add ARM32 USDT support to libbpf - Improve bpftool's visual program dump which produces the control flow graph in a DOT format by adding C source inline annotations Protocols: - IPv4: Allow adding to IPv4 address a 'protocol' tag. Such value indicates the provenance of the IP address - IPv6: optimize route lookup, dropping unneeded R/W lock acquisition - Add the handshake upcall mechanism, allowing the user-space to implement generic TLS handshake on kernel's behalf - Bridge: support per-{Port, VLAN} neighbor suppression, increasing resilience to nodes failures - SCTP: add support for Fair Capacity and Weighted Fair Queueing schedulers - MPTCP: delay first subflow allocation up to its first usage. This will allow for later better LSM interaction - xfrm: Remove inner/outer modes from input/output path. These are not needed anymore - WiFi: - reduced neighbor report (RNR) handling for AP mode - HW timestamping support - support for randomized auth/deauth TA for PASN privacy - per-link debugfs for multi-link - TC offload support for mac80211 drivers - mac80211 mesh fast-xmit and fast-rx support - enable Wi-Fi 7 (EHT) mesh support Netfilter: - Add nf_tables 'brouting' support, to force a packet to be routed instead of being bridged - Update bridge netfilter and ovs conntrack helpers to handle IPv6 Jumbo packets properly, i.e. fetch the packet length from hop-by-hop extension header. This is needed for BIT TCP support - The iptables 32bit compat interface isn't compiled in by default anymore - Move ip(6)tables builtin icmp matches to the udptcp one. This has the advantage that icmp/icmpv6 match doesn't load the iptables/ip6tables modules anymore when iptables-nft is used - Extended netlink error report for netdevice in flowtables and netdev/chains. Allow for incrementally add/delete devices to netdev basechain. Allow to create netdev chain without device Driver API: - Remove redundant Device Control Error Reporting Enable, as PCI core has already error reporting enabled at enumeration time - Move Multicast DB netlink handlers to core, allowing devices other then bridge to use them - Allow the page_pool to directly recycle the pages from safely localized NAPI - Implement lockless TX queue stop/wake combo macros, allowing for further code de-duplication and sanitization - Add YNL support for user headers and struct attrs - Add partial YNL specification for devlink - Add partial YNL specification for ethtool - Add tc-mqprio and tc-taprio support for preemptible traffic classes - Add tx push buf len param to ethtool, specifies the maximum number of bytes of a transmitted packet a driver can push directly to the underlying device - Add basic LED support for switch/phy - Add NAPI documentation, stop relaying on external links - Convert dsa_master_ioctl() to netdev notifier. This is a preparatory work to make the hardware timestamping layer selectable by user space - Add transceiver support and improve the error messages for CAN-FD controllers New hardware / drivers: - Ethernet: - AMD/Pensando core device support - MediaTek MT7981 SoC - MediaTek MT7988 SoC - Broadcom BCM53134 embedded switch - Texas Instruments CPSW9G ethernet switch - Qualcomm EMAC3 DWMAC ethernet - StarFive JH7110 SoC - NXP CBTX ethernet PHY - WiFi: - Apple M1 Pro/Max devices - RealTek rtl8710bu/rtl8188gu - RealTek rtl8822bs, rtl8822cs and rtl8821cs SDIO chipset - Bluetooth: - Realtek RTL8821CS, RTL8851B, RTL8852BS - Mediatek MT7663, MT7922 - NXP w8997 - Actions Semi ATS2851 - QTI WCN6855 - Marvell 88W8997 - Can: - STMicroelectronics bxcan stm32f429 Drivers: - Ethernet NICs: - Intel (1G, icg): - add tracking and reporting of QBV config errors - add support for configuring max SDU for each Tx queue - Intel (100G, ice): - refactor mailbox overflow detection to support Scalable IOV - GNSS interface optimization - Intel (i40e): - support XDP multi-buffer - nVidia/Mellanox: - add the support for linux bridge multicast offload - enable TC offload for egress and engress MACVLAN over bond - add support for VxLAN GBP encap/decap flows offload - extend packet offload to fully support libreswan - support tunnel mode in mlx5 IPsec packet offload - extend XDP multi-buffer support - support MACsec VLAN offload - add support for dynamic msix vectors allocation - drop RX page_cache and fully use page_pool - implement thermal zone to report NIC temperature - Netronome/Corigine: - add support for multi-zone conntrack offload - Solarflare/Xilinx: - support offloading TC VLAN push/pop actions to the MAE - support TC decap rules - support unicast PTP - Other NICs: - Broadcom (bnxt): enforce software based freq adjustments only on shared PHC NIC - RealTek (r8169): refactor to addess ASPM issues during NAPI poll - Micrel (lan8841): add support for PTP_PF_PEROUT - Cadence (macb): enable PTP unicast - Engleder (tsnep): add XDP socket zero-copy support - virtio-net: implement exact header length guest feature - veth: add page_pool support for page recycling - vxlan: add MDB data path support - gve: add XDP support for GQI-QPL format - geneve: accept every ethertype - macvlan: allow some packets to bypass broadcast queue - mana: add support for jumbo frame - Ethernet high-speed switches: - Microchip (sparx5): Add support for TC flower templates - Ethernet embedded switches: - Broadcom (b54): - configure 6318 and 63268 RGMII ports - Marvell (mv88e6xxx): - faster C45 bus scan - Microchip: - lan966x: - add support for IS1 VCAP - better TX/RX from/to CPU performances - ksz9477: add ETS Qdisc support - ksz8: enhance static MAC table operations and error handling - sama7g5: add PTP capability - NXP (ocelot): - add support for external ports - add support for preemptible traffic classes - Texas Instruments: - add CPSWxG SGMII support for J7200 and J721E - Intel WiFi (iwlwifi): - preparation for Wi-Fi 7 EHT and multi-link support - EHT (Wi-Fi 7) sniffer support - hardware timestamping support for some devices/firwmares - TX beacon protection on newer hardware - Qualcomm 802.11ax WiFi (ath11k): - MU-MIMO parameters support - ack signal support for management packets - RealTek WiFi (rtw88): - SDIO bus support - better support for some SDIO devices (e.g. MAC address from efuse) - RealTek WiFi (rtw89): - HW scan support for 8852b - better support for 6 GHz scanning - support for various newer firmware APIs - framework firmware backwards compatibility - MediaTek WiFi (mt76): - P2P support - mesh A-MSDU support - EHT (Wi-Fi 7) support - coredump support" * tag 'net-next-6.4' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (2078 commits) net: phy: hide the PHYLIB_LEDS knob net: phy: marvell-88x2222: remove unnecessary (void*) conversions tcp/udp: Fix memleaks of sk and zerocopy skbs with TX timestamp. net: amd: Fix link leak when verifying config failed net: phy: marvell: Fix inconsistent indenting in led_blink_set lan966x: Don't use xdp_frame when action is XDP_TX tsnep: Add XDP socket zero-copy TX support tsnep: Add XDP socket zero-copy RX support tsnep: Move skb receive action to separate function tsnep: Add functions for queue enable/disable tsnep: Rework TX/RX queue initialization tsnep: Replace modulo operation with mask net: phy: dp83867: Add led_brightness_set support net: phy: Fix reading LED reg property drivers: nfc: nfcsim: remove return value check of `dev_dir` net: phy: dp83867: Remove unnecessary (void*) conversions net: ethtool: coalesce: try to make user settings stick twice net: mana: Check if netdev/napi_alloc_frag returns single page net: mana: Rename mana_refill_rxoob and remove some empty lines net: veth: add page_pool stats ...
2023-04-24Merge tag 'pull-fd' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfsLinus Torvalds1-6/+4
Pull vfs fget updates from Al Viro: "fget() to fdget() conversions" * tag 'pull-fd' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: fuse_dev_ioctl(): switch to fdget() cgroup_get_from_fd(): switch to fdget_raw() bpf: switch to fdget_raw() build_mount_idmapped(): switch to fdget() kill the last remaining user of proc_ns_fget() SVM-SEV: convert the rest of fget() uses to fdget() in there convert sgx_set_attribute() to fdget()/fdput() convert setns(2) to fdget()/fdput()
2023-04-21Merge branch 'v6.3-rc7'Peter Zijlstra3-39/+150
Sync with the urgent patches; in particular: a53ce18cacb4 ("sched/fair: Sanitize vruntime of entity being migrated") Signed-off-by: Peter Zijlstra <peterz@infradead.org>
2023-04-20cgroup_get_from_fd(): switch to fdget_raw()Al Viro1-6/+4
Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2023-04-20Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski3-39/+150
Adjacent changes: net/mptcp/protocol.h 63740448a32e ("mptcp: fix accept vs worker race") 2a6a870e44dd ("mptcp: stops worker on unaccepted sockets at listener close") ddb1a072f858 ("mptcp: move first subflow allocation at mpc access time") Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-04-18cgroup: rename cgroup_rstat_flush_"irqsafe" to "atomic"Yosry Ahmed1-2/+2
Patch series "memcg: avoid flushing stats atomically where possible", v3. rstat flushing is an expensive operation that scales with the number of cpus and the number of cgroups in the system. The purpose of this series is to minimize the contexts where we flush stats atomically. Patches 1 and 2 are cleanups requested during reviews of prior versions of this series. Patch 3 makes sure we never try to flush from within an irq context. Patches 4 to 7 introduce separate variants of mem_cgroup_flush_stats() for atomic and non-atomic flushing, and make sure we only flush the stats atomically when necessary. Patch 8 is a slightly tangential optimization that limits the work done by rstat flushing in some scenarios. This patch (of 8): cgroup_rstat_flush_irqsafe() can be a confusing name. It may read as "irqs are disabled throughout", which is what the current implementation does (currently under discussion [1]), but is not the intention. The intention is that this function is safe to call from atomic contexts. Name it as such. Link: https://lkml.kernel.org/r/20230330191801.1967435-1-yosryahmed@google.com Link: https://lkml.kernel.org/r/20230330191801.1967435-2-yosryahmed@google.com Signed-off-by: Yosry Ahmed <yosryahmed@google.com> Suggested-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Shakeel Butt <shakeelb@google.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Jens Axboe <axboe@kernel.dk> Cc: Josef Bacik <josef@toxicpanda.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Michal Koutný <mkoutny@suse.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Tejun Heo <tj@kernel.org> Cc: Vasily Averin <vasily.averin@linux.dev> Cc: Zefan Li <lizefan.x@bytedance.com> Cc: Michal Hocko <mhocko@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-04-12cgroup/cpuset: Make cpuset_attach_task() skip subpartitions CPUs for top_cpusetWaiman Long1-1/+2
It is found that attaching a task to the top_cpuset does not currently ignore CPUs allocated to subpartitions in cpuset_attach_task(). So the code is changed to fix that. Signed-off-by: Waiman Long <longman@redhat.com> Reviewed-by: Michal Koutný <mkoutny@suse.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2023-04-12cgroup/cpuset: Add cpuset_can_fork() and cpuset_cancel_fork() methodsWaiman Long1-11/+86
In the case of CLONE_INTO_CGROUP, not all cpusets are ready to accept new tasks. It is too late to check that in cpuset_fork(). So we need to add the cpuset_can_fork() and cpuset_cancel_fork() methods to pre-check it before we can allow attachment to a different cpuset. We also need to set the attach_in_progress flag to alert other code that a new task is going to be added to the cpuset. Fixes: ef2c41cf38a7 ("clone3: allow spawning processes into cgroups") Suggested-by: Michal Koutný <mkoutny@suse.com> Signed-off-by: Waiman Long <longman@redhat.com> Cc: stable@vger.kernel.org # v5.7+ Signed-off-by: Tejun Heo <tj@kernel.org>
2023-04-12cgroup/cpuset: Make cpuset_fork() handle CLONE_INTO_CGROUP properlyWaiman Long1-20/+42
By default, the clone(2) syscall spawn a child process into the same cgroup as its parent. With the use of the CLONE_INTO_CGROUP flag introduced by commit ef2c41cf38a7 ("clone3: allow spawning processes into cgroups"), the child will be spawned into a different cgroup which is somewhat similar to writing the child's tid into "cgroup.threads". The current cpuset_fork() method does not properly handle the CLONE_INTO_CGROUP case where the cpuset of the child may be different from that of its parent. Update the cpuset_fork() method to treat the CLONE_INTO_CGROUP case similar to cpuset_attach(). Since the newly cloned task has not been running yet, its actual memory usage isn't known. So it is not necessary to make change to mm in cpuset_fork(). Fixes: ef2c41cf38a7 ("clone3: allow spawning processes into cgroups") Reported-by: Giuseppe Scrivano <gscrivan@redhat.com> Signed-off-by: Waiman Long <longman@redhat.com> Cc: stable@vger.kernel.org # v5.7+ Signed-off-by: Tejun Heo <tj@kernel.org>
2023-04-12cgroup/cpuset: Wake up cpuset_attach_wq tasks in cpuset_cancel_attach()Waiman Long1-1/+5
After a successful cpuset_can_attach() call which increments the attach_in_progress flag, either cpuset_cancel_attach() or cpuset_attach() will be called later. In cpuset_attach(), tasks in cpuset_attach_wq, if present, will be woken up at the end. That is not the case in cpuset_cancel_attach(). So missed wakeup is possible if the attach operation is somehow cancelled. Fix that by doing the wakeup in cpuset_cancel_attach() as well. Fixes: e44193d39e8d ("cpuset: let hotplug propagation work wait for task attaching") Signed-off-by: Waiman Long <longman@redhat.com> Reviewed-by: Michal Koutný <mkoutny@suse.com> Cc: stable@vger.kernel.org # v3.11+ Signed-off-by: Tejun Heo <tj@kernel.org>
2023-04-12cgroup,freezer: hold cpu_hotplug_lock before freezer_mutexTetsuo Handa1-2/+5
syzbot is reporting circular locking dependency between cpu_hotplug_lock and freezer_mutex, for commit f5d39b020809 ("freezer,sched: Rewrite core freezer logic") replaced atomic_inc() in freezer_apply_state() with static_branch_inc() which holds cpu_hotplug_lock. cpu_hotplug_lock => cgroup_threadgroup_rwsem => freezer_mutex cgroup_file_write() { cgroup_procs_write() { __cgroup_procs_write() { cgroup_procs_write_start() { cgroup_attach_lock() { cpus_read_lock() { percpu_down_read(&cpu_hotplug_lock); } percpu_down_write(&cgroup_threadgroup_rwsem); } } cgroup_attach_task() { cgroup_migrate() { cgroup_migrate_execute() { freezer_attach() { mutex_lock(&freezer_mutex); (...snipped...) } } } } (...snipped...) } } } freezer_mutex => cpu_hotplug_lock cgroup_file_write() { freezer_write() { freezer_change_state() { mutex_lock(&freezer_mutex); freezer_apply_state() { static_branch_inc(&freezer_active) { static_key_slow_inc() { cpus_read_lock(); static_key_slow_inc_cpuslocked(); cpus_read_unlock(); } } } mutex_unlock(&freezer_mutex); } } } Swap locking order by moving cpus_read_lock() in freezer_apply_state() to before mutex_lock(&freezer_mutex) in freezer_change_state(). Reported-by: syzbot <syzbot+c39682e86c9d84152f93@syzkaller.appspotmail.com> Link: https://syzkaller.appspot.com/bug?extid=c39682e86c9d84152f93 Suggested-by: Hillf Danton <hdanton@sina.com> Fixes: f5d39b020809 ("freezer,sched: Rewrite core freezer logic") Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Mukesh Ojha <quic_mojha@quicinc.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2023-04-05sched/psi: Allow unprivileged polling of N*2s periodDomenico Cerasuolo1-1/+1
PSI offers 2 mechanisms to get information about a specific resource pressure. One is reading from /proc/pressure/<resource>, which gives average pressures aggregated every 2s. The other is creating a pollable fd for a specific resource and cgroup. The trigger creation requires CAP_SYS_RESOURCE, and gives the possibility to pick specific time window and threshold, spawing an RT thread to aggregate the data. Systemd would like to provide containers the option to monitor pressure on their own cgroup and sub-cgroups. For example, if systemd launches a container that itself then launches services, the container should have the ability to poll() for pressure in individual services. But neither the container nor the services are privileged. This patch implements a mechanism to allow unprivileged users to create pressure triggers. The difference with privileged triggers creation is that unprivileged ones must have a time window that's a multiple of 2s. This is so that we can avoid unrestricted spawning of rt threads, and use instead the same aggregation mechanism done for the averages, which runs independently of any triggers. Suggested-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Domenico Cerasuolo <cerasuolodomenico@gmail.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Link: https://lore.kernel.org/r/20230330105418.77061-5-cerasuolodomenico@gmail.com
2023-03-29cgroup/cpuset: Include offline CPUs when tasks' cpumasks in top_cpuset are ↵Waiman Long1-9/+14
updated Similar to commit 3fb906e7fabb ("group/cpuset: Don't filter offline CPUs in cpuset_cpus_allowed() for top cpuset tasks"), the whole set of possible CPUs including offline ones should be used for setting cpumasks for tasks in the top cpuset when a cpuset partition is modified as the hotplug code won't update cpumasks for tasks in the top cpuset when CPUs become online or offline. Signed-off-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2023-03-29cgroup/cpuset: Skip task update if hotplug doesn't affect current cpusetWaiman Long1-0/+3
If a hotplug event doesn't affect the current cpuset, there is no point to call hotplug_update_tasks() or hotplug_update_tasks_legacy(). So just skip it. Signed-off-by: Waiman Long <longman@redhat.com> Reviewed-by: Michal Koutný <mkoutny@suse.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2023-03-29Merge branch 'for-6.3-fixes' into for-6.4Tejun Heo2-5/+11
To receive 292fd843de26 ("cgroup/cpuset: Fix partition root's cpuset.cpus update bug") in preparation for further cpuset updates.
2023-03-29cgroup/cpuset: Fix partition root's cpuset.cpus update bugWaiman Long1-2/+10
It was found that commit 7a2127e66a00 ("cpuset: Call set_cpus_allowed_ptr() with appropriate mask for task") introduced a bug that corrupted "cpuset.cpus" of a partition root when it was updated. It is because the tmp->new_cpus field of the passed tmp parameter of update_parent_subparts_cpumask() should not be used at all as it contains important cpumask data that should not be overwritten. Fix it by using tmp->addmask instead. Also update update_cpumask() to make sure that trialcs->cpu_allowed will not be corrupted until it is no longer needed. Fixes: 7a2127e66a00 ("cpuset: Call set_cpus_allowed_ptr() with appropriate mask for task") Signed-off-by: Waiman Long <longman@redhat.com> Cc: stable@vger.kernel.org # v6.2+ Signed-off-by: Tejun Heo <tj@kernel.org>
2023-03-23cpuset: Clean up cpuset_node_allowedHaifeng Xu1-2/+2
Commit 002f290627c2 ("cpuset: use static key better and convert to new API") has used __cpuset_node_allowed() instead of cpuset_node_allowed() to check whether we can allocate on a memory node. Now this function isn't used by anyone, so we can do the follow things to clean up it. 1. remove unused codes 2. rename __cpuset_node_allowed() to cpuset_node_allowed() 3. update comments in mm/page_alloc.c Suggested-by: Waiman Long <longman@redhat.com> Signed-off-by: Haifeng Xu <haifeng.xu@shopee.com> Acked-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2023-03-17cgroup: fix display of forceidle time at rootJosh Don1-3/+1
We need to reset forceidle_sum to 0 when reading from root, since the bstat we accumulate into is stack allocated. To make this more robust, just replace the existing cputime reset with a memset of the overall bstat. Signed-off-by: Josh Don <joshdon@google.com> Fixes: 1fcf54deb767 ("sched/core: add forced idle accounting for cgroups") Cc: stable@vger.kernel.org # v6.0+ Signed-off-by: Tejun Heo <tj@kernel.org>
2023-03-17cgroup: bpf: use cgroup_lock()/cgroup_unlock() wrappersKamalesh Babulal2-38/+38
Replace mutex_[un]lock() with cgroup_[un]lock() wrappers to stay consistent across cgroup core and other subsystem code, while operating on the cgroup_mutex. Signed-off-by: Kamalesh Babulal <kamalesh.babulal@oracle.com> Acked-by: Alexei Starovoitov <ast@kernel.org> Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Tejun Heo <tj@kernel.org>
2023-03-14cgroup: Make current_cgns_cgroup_dfl() safe to call after exit_task_namespace()Tejun Heo1-2/+12
The commit 332ea1f697be ("bpf: Add bpf_cgroup_from_id() kfunc") added bpf_cgroup_from_id() which calls current_cgns_cgroup_dfl() through cgroup_get_from_id(). However, BPF programs may be attached to a point where current->nsproxy has already been cleared to NULL by exit_task_namespace() and calling bpf_cgroup_from_id() would cause an oops. Just return the system-wide root if nsproxy has been cleared. This allows all cgroups to be looked up after the task passed through exit_task_namespace(), which semantically makes sense. Given that the only way to get this behavior is through BPF programs, it seems safe but let's see what others think. Fixes: 332ea1f697be ("bpf: Add bpf_cgroup_from_id() kfunc") Signed-off-by: Tejun Heo <tj@kernel.org> Link: https://lore.kernel.org/r/ZBDuVWiFj2jiz3i8@slm.duckdns.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-02-21Merge tag 'net-next-6.3' of ↵Linus Torvalds1-2/+2
git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next Pull networking updates from Jakub Kicinski: "Core: - Add dedicated kmem_cache for typical/small skb->head, avoid having to access struct page at kfree time, and improve memory use. - Introduce sysctl to set default RPS configuration for new netdevs. - Define Netlink protocol specification format which can be used to describe messages used by each family and auto-generate parsers. Add tools for generating kernel data structures and uAPI headers. - Expose all net/core sysctls inside netns. - Remove 4s sleep in netpoll if carrier is instantly detected on boot. - Add configurable limit of MDB entries per port, and port-vlan. - Continue populating drop reasons throughout the stack. - Retire a handful of legacy Qdiscs and classifiers. Protocols: - Support IPv4 big TCP (TSO frames larger than 64kB). - Add IP_LOCAL_PORT_RANGE socket option, to control local port range on socket by socket basis. - Track and report in procfs number of MPTCP sockets used. - Support mixing IPv4 and IPv6 flows in the in-kernel MPTCP path manager. - IPv6: don't check net.ipv6.route.max_size and rely on garbage collection to free memory (similarly to IPv4). - Support Penultimate Segment Pop (PSP) flavor in SRv6 (RFC8986). - ICMP: add per-rate limit counters. - Add support for user scanning requests in ieee802154. - Remove static WEP support. - Support minimal Wi-Fi 7 Extremely High Throughput (EHT) rate reporting. - WiFi 7 EHT channel puncturing support (client & AP). BPF: - Add a rbtree data structure following the "next-gen data structure" precedent set by recently added linked list, that is, by using kfunc + kptr instead of adding a new BPF map type. - Expose XDP hints via kfuncs with initial support for RX hash and timestamp metadata. - Add BPF_F_NO_TUNNEL_KEY extension to bpf_skb_set_tunnel_key to better support decap on GRE tunnel devices not operating in collect metadata. - Improve x86 JIT's codegen for PROBE_MEM runtime error checks. - Remove the need for trace_printk_lock for bpf_trace_printk and bpf_trace_vprintk helpers. - Extend libbpf's bpf_tracing.h support for tracing arguments of kprobes/uprobes and syscall as a special case. - Significantly reduce the search time for module symbols by livepatch and BPF. - Enable cpumasks to be used as kptrs, which is useful for tracing programs tracking which tasks end up running on which CPUs in different time intervals. - Add support for BPF trampoline on s390x and riscv64. - Add capability to export the XDP features supported by the NIC. - Add __bpf_kfunc tag for marking kernel functions as kfuncs. - Add cgroup.memory=nobpf kernel parameter option to disable BPF memory accounting for container environments. Netfilter: - Remove the CLUSTERIP target. It has been marked as obsolete for years, and we still have WARN splats wrt races of the out-of-band /proc interface installed by this target. - Add 'destroy' commands to nf_tables. They are identical to the existing 'delete' commands, but do not return an error if the referenced object (set, chain, rule...) did not exist. Driver API: - Improve cpumask_local_spread() locality to help NICs set the right IRQ affinity on AMD platforms. - Separate C22 and C45 MDIO bus transactions more clearly. - Introduce new DCB table to control DSCP rewrite on egress. - Support configuration of Physical Layer Collision Avoidance (PLCA) Reconciliation Sublayer (RS) (802.3cg-2019). Modern version of shared medium Ethernet. - Support for MAC Merge layer (IEEE 802.3-2018 clause 99). Allowing preemption of low priority frames by high priority frames. - Add support for controlling MACSec offload using netlink SET. - Rework devlink instance refcounts to allow registration and de-registration under the instance lock. Split the code into multiple files, drop some of the unnecessarily granular locks and factor out common parts of netlink operation handling. - Add TX frame aggregation parameters (for USB drivers). - Add a new attr TCA_EXT_WARN_MSG to report TC (offload) warning messages with notifications for debug. - Allow offloading of UDP NEW connections via act_ct. - Add support for per action HW stats in TC. - Support hardware miss to TC action (continue processing in SW from a specific point in the action chain). - Warn if old Wireless Extension user space interface is used with modern cfg80211/mac80211 drivers. Do not support Wireless Extensions for Wi-Fi 7 devices at all. Everyone should switch to using nl80211 interface instead. - Improve the CAN bit timing configuration. Use extack to return error messages directly to user space, update the SJW handling, including the definition of a new default value that will benefit CAN-FD controllers, by increasing their oscillator tolerance. New hardware / drivers: - Ethernet: - nVidia BlueField-3 support (control traffic driver) - Ethernet support for imx93 SoCs - Motorcomm yt8531 gigabit Ethernet PHY - onsemi NCN26000 10BASE-T1S PHY (with support for PLCA) - Microchip LAN8841 PHY (incl. cable diagnostics and PTP) - Amlogic gxl MDIO mux - WiFi: - RealTek RTL8188EU (rtl8xxxu) - Qualcomm Wi-Fi 7 devices (ath12k) - CAN: - Renesas R-Car V4H Drivers: - Bluetooth: - Set Per Platform Antenna Gain (PPAG) for Intel controllers. - Ethernet NICs: - Intel (1G, igc): - support TSN / Qbv / packet scheduling features of i226 model - Intel (100G, ice): - use GNSS subsystem instead of TTY - multi-buffer XDP support - extend support for GPIO pins to E823 devices - nVidia/Mellanox: - update the shared buffer configuration on PFC commands - implement PTP adjphase function for HW offset control - TC support for Geneve and GRE with VF tunnel offload - more efficient crypto key management method - multi-port eswitch support - Netronome/Corigine: - add DCB IEEE support - support IPsec offloading for NFP3800 - Freescale/NXP (enetc): - support XDP_REDIRECT for XDP non-linear buffers - improve reconfig, avoid link flap and waiting for idle - support MAC Merge layer - Other NICs: - sfc/ef100: add basic devlink support for ef100 - ionic: rx_push mode operation (writing descriptors via MMIO) - bnxt: use the auxiliary bus abstraction for RDMA - r8169: disable ASPM and reset bus in case of tx timeout - cpsw: support QSGMII mode for J721e CPSW9G - cpts: support pulse-per-second output - ngbe: add an mdio bus driver - usbnet: optimize usbnet_bh() by avoiding unnecessary queuing - r8152: handle devices with FW with NCM support - amd-xgbe: support 10Mbps, 2.5GbE speeds and rx-adaptation - virtio-net: support multi buffer XDP - virtio/vsock: replace virtio_vsock_pkt with sk_buff - tsnep: XDP support - Ethernet high-speed switches: - nVidia/Mellanox (mlxsw): - add support for latency TLV (in FW control messages) - Microchip (sparx5): - separate explicit and implicit traffic forwarding rules, make the implicit rules always active - add support for egress DSCP rewrite - IS0 VCAP support (Ingress Classification) - IS2 VCAP filters (protos, L3 addrs, L4 ports, flags, ToS etc.) - ES2 VCAP support (Egress Access Control) - support for Per-Stream Filtering and Policing (802.1Q, 8.6.5.1) - Ethernet embedded switches: - Marvell (mv88e6xxx): - add MAB (port auth) offload support - enable PTP receive for mv88e6390 - NXP (ocelot): - support MAC Merge layer - support for the the vsc7512 internal copper phys - Microchip: - lan9303: convert to PHYLINK - lan966x: support TC flower filter statistics - lan937x: PTP support for KSZ9563/KSZ8563 and LAN937x - lan937x: support Credit Based Shaper configuration - ksz9477: support Energy Efficient Ethernet - other: - qca8k: convert to regmap read/write API, use bulk operations - rswitch: Improve TX timestamp accuracy - Intel WiFi (iwlwifi): - EHT (Wi-Fi 7) rate reporting - STEP equalizer support: transfer some STEP (connection to radio on platforms with integrated wifi) related parameters from the BIOS to the firmware. - Qualcomm 802.11ax WiFi (ath11k): - IPQ5018 support - Fine Timing Measurement (FTM) responder role support - channel 177 support - MediaTek WiFi (mt76): - per-PHY LED support - mt7996: EHT (Wi-Fi 7) support - Wireless Ethernet Dispatch (WED) reset support - switch to using page pool allocator - RealTek WiFi (rtw89): - support new version of Bluetooth co-existance - Mobile: - rmnet: support TX aggregation" * tag 'net-next-6.3' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (1872 commits) page_pool: add a comment explaining the fragment counter usage net: ethtool: fix __ethtool_dev_mm_supported() implementation ethtool: pse-pd: Fix double word in comments xsk: add linux/vmalloc.h to xsk.c sefltests: netdevsim: wait for devlink instance after netns removal selftest: fib_tests: Always cleanup before exit net/mlx5e: Align IPsec ASO result memory to be as required by hardware net/mlx5e: TC, Set CT miss to the specific ct action instance net/mlx5e: Rename CHAIN_TO_REG to MAPPED_OBJ_TO_REG net/mlx5: Refactor tc miss handling to a single function net/mlx5: Kconfig: Make tc offload depend on tc skb extension net/sched: flower: Support hardware miss to tc action net/sched: flower: Move filter handle initialization earlier net/sched: cls_api: Support hardware miss to tc action net/sched: Rename user cookie and act cookie sfc: fix builds without CONFIG_RTC_LIB sfc: clean up some inconsistent indentings net/mlx4_en: Introduce flexible array to silence overflow warning net: lan966x: Fix possible deadlock inside PTP net/ulp: Remove redundant ->clone() test in inet_clone_ulp(). ...
2023-02-21Merge tag 'cgroup-for-6.3' of ↵Linus Torvalds1-7/+8
git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup Pull cgroup updates from Tejun Heo: "All the changes are trivial: documentation updates and a trivial code cleanup" * tag 'cgroup-for-6.3' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: cgroup/cpuset: fix a few kernel-doc warnings & coding style docs: cgroup-v1: use numbered lists for user interface setup docs: cgroup-v1: add internal cross-references docs: cgroup-v1: make swap extension subsections subsections docs: cgroup-v1: use bullet lists for list of stat file tables docs: cgroup-v1: move hierarchy of accounting caption docs: cgroup-v1: fix footnotes docs: cgroup-v1: use code block for locking order schema docs: cgroup-v1: wrap remaining admonitions in admonition blocks docs: cgroup-v1: replace custom note constructs with appropriate admonition blocks cgroup/cpuset: no need to explicitly init a global static variable
2023-02-20Merge tag 'fs.idmapped.v6.3' of ↵Linus Torvalds1-1/+1
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/idmapping Pull vfs idmapping updates from Christian Brauner: - Last cycle we introduced the dedicated struct mnt_idmap type for mount idmapping and the required infrastucture in 256c8aed2b42 ("fs: introduce dedicated idmap type for mounts"). As promised in last cycle's pull request message this converts everything to rely on struct mnt_idmap. Currently we still pass around the plain namespace that was attached to a mount. This is in general pretty convenient but it makes it easy to conflate namespaces that are relevant on the filesystem with namespaces that are relevant on the mount level. Especially for non-vfs developers without detailed knowledge in this area this was a potential source for bugs. This finishes the conversion. Instead of passing the plain namespace around this updates all places that currently take a pointer to a mnt_userns with a pointer to struct mnt_idmap. Now that the conversion is done all helpers down to the really low-level helpers only accept a struct mnt_idmap argument instead of two namespace arguments. Conflating mount and other idmappings will now cause the compiler to complain loudly thus eliminating the possibility of any bugs. This makes it impossible for filesystem developers to mix up mount and filesystem idmappings as they are two distinct types and require distinct helpers that cannot be used interchangeably. Everything associated with struct mnt_idmap is moved into a single separate file. With that change no code can poke around in struct mnt_idmap. It can only be interacted with through dedicated helpers. That means all filesystems are and all of the vfs is completely oblivious to the actual implementation of idmappings. We are now also able to extend struct mnt_idmap as we see fit. For example, we can decouple it completely from namespaces for users that don't require or don't want to use them at all. We can also extend the concept of idmappings so we can cover filesystem specific requirements. In combination with the vfs{g,u}id_t work we finished in v6.2 this makes this feature substantially more robust and thus difficult to implement wrong by a given filesystem and also protects the vfs. - Enable idmapped mounts for tmpfs and fulfill a longstanding request. A long-standing request from users had been to make it possible to create idmapped mounts for tmpfs. For example, to share the host's tmpfs mount between multiple sandboxes. This is a prerequisite for some advanced Kubernetes cases. Systemd also has a range of use-cases to increase service isolation. And there are more users of this. However, with all of the other work going on this was way down on the priority list but luckily someone other than ourselves picked this up. As usual the patch is tiny as all the infrastructure work had been done multiple kernel releases ago. In addition to all the tests that we already have I requested that Rodrigo add a dedicated tmpfs testsuite for idmapped mounts to xfstests. It is to be included into xfstests during the v6.3 development cycle. This should add a slew of additional tests. * tag 'fs.idmapped.v6.3' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/idmapping: (26 commits) shmem: support idmapped mounts for tmpfs fs: move mnt_idmap fs: port vfs{g,u}id helpers to mnt_idmap fs: port fs{g,u}id helpers to mnt_idmap fs: port i_{g,u}id_into_vfs{g,u}id() to mnt_idmap fs: port i_{g,u}id_{needs_}update() to mnt_idmap quota: port to mnt_idmap fs: port privilege checking helpers to mnt_idmap fs: port inode_owner_or_capable() to mnt_idmap fs: port inode_init_owner() to mnt_idmap fs: port acl to mnt_idmap fs: port xattr to mnt_idmap fs: port ->permission() to pass mnt_idmap fs: port ->fileattr_set() to pass mnt_idmap fs: port ->set_acl() to pass mnt_idmap fs: port ->get_acl() to pass mnt_idmap fs: port ->tmpfile() to pass mnt_idmap fs: port ->rename() to pass mnt_idmap fs: port ->mknod() to pass mnt_idmap fs: port ->mkdir() to pass mnt_idmap ...
2023-02-10Daniel Borkmann says:Jakub Kicinski1-2/+2
==================== pull-request: bpf-next 2023-02-11 We've added 96 non-merge commits during the last 14 day(s) which contain a total of 152 files changed, 4884 insertions(+), 962 deletions(-). There is a minor conflict in drivers/net/ethernet/intel/ice/ice_main.c between commit 5b246e533d01 ("ice: split probe into smaller functions") from the net-next tree and commit 66c0e13ad236 ("drivers: net: turn on XDP features") from the bpf-next tree. Remove the hunk given ice_cfg_netdev() is otherwise there a 2nd time, and add XDP features to the existing ice_cfg_netdev() one: [...] ice_set_netdev_features(netdev); netdev->xdp_features = NETDEV_XDP_ACT_BASIC | NETDEV_XDP_ACT_REDIRECT | NETDEV_XDP_ACT_XSK_ZEROCOPY; ice_set_ops(netdev); [...] Stephen's merge conflict mail: https://lore.kernel.org/bpf/20230207101951.21a114fa@canb.auug.org.au/ The main changes are: 1) Add support for BPF trampoline on s390x which finally allows to remove many test cases from the BPF CI's DENYLIST.s390x, from Ilya Leoshkevich. 2) Add multi-buffer XDP support to ice driver, from Maciej Fijalkowski. 3) Add capability to export the XDP features supported by the NIC. Along with that, add a XDP compliance test tool, from Lorenzo Bianconi & Marek Majtyka. 4) Add __bpf_kfunc tag for marking kernel functions as kfuncs, from David Vernet. 5) Add a deep dive documentation about the verifier's register liveness tracking algorithm, from Eduard Zingerman. 6) Fix and follow-up cleanups for resolve_btfids to be compiled as a host program to avoid cross compile issues, from Jiri Olsa & Ian Rogers. 7) Batch of fixes to the BPF selftest for xdp_hw_metadata which resulted when testing on different NICs, from Jesper Dangaard Brouer. 8) Fix libbpf to better detect kernel version code on Debian, from Hao Xiang. 9) Extend libbpf to add an option for when the perf buffer should wake up, from Jon Doron. 10) Follow-up fix on xdp_metadata selftest to just consume on TX completion, from Stanislav Fomichev. 11) Extend the kfuncs.rst document with description on kfunc lifecycle & stability expectations, from David Vernet. 12) Fix bpftool prog profile to skip attaching to offline CPUs, from Tonghao Zhang. ==================== Link: https://lore.kernel.org/r/20230211002037.8489-1-daniel@iogearbox.net Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-02-06cpuset: Call set_cpus_allowed_ptr() with appropriate mask for taskWill Deacon1-7/+11
set_cpus_allowed_ptr() will fail with -EINVAL if the requested affinity mask is not a subset of the task_cpu_possible_mask() for the task being updated. Consequently, on a heterogeneous system with cpusets spanning the different CPU types, updates to the cgroup hierarchy can silently fail to update task affinities when the effective affinity mask for the cpuset is expanded. For example, consider an arm64 system with 4 CPUs, where CPUs 2-3 are the only cores capable of executing 32-bit tasks. Attaching a 32-bit task to a cpuset containing CPUs 0-2 will correctly affine the task to CPU 2. Extending the cpuset to CPUs 0-3, however, will fail to extend the affinity mask of the 32-bit task because update_tasks_cpumask() will pass the full 0-3 mask to set_cpus_allowed_ptr(). Extend update_tasks_cpumask() to take a temporary 'cpumask' paramater and use it to mask the 'effective_cpus' mask with the possible mask for each task being updated. Fixes: 431c69fac05b ("cpuset: Honour task_cpu_possible_mask() in guarantee_online_cpus()") Signed-off-by: Will Deacon <will@kernel.org> Acked-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2023-02-06cgroup/cpuset: Don't filter offline CPUs in cpuset_cpus_allowed() for top ↵Waiman Long1-2/+25
cpuset tasks Since commit 8f9ea86fdf99 ("sched: Always preserve the user requested cpumask"), relax_compatible_cpus_allowed_ptr() is calling __sched_setaffinity() unconditionally. This helps to expose a bug in the current cpuset hotplug code where the cpumasks of the tasks in the top cpuset are not updated at all when some CPUs become online or offline. It is likely caused by the fact that some of the tasks in the top cpuset, like percpu kthreads, cannot have their cpu affinity changed. One way to reproduce this as suggested by Peter is: - boot machine - offline all CPUs except one - taskset -p ffffffff $$ - online all CPUs Fix this by allowing cpuset_cpus_allowed() to return a wider mask that includes offline CPUs for those tasks that are in the top cpuset. For tasks not in the top cpuset, the old rule applies and only online CPUs will be returned in the mask since hotplug events will update their cpumasks accordingly. Fixes: 8f9ea86fdf99 ("sched: Always preserve the user requested cpumask") Reported-by: Will Deacon <will@kernel.org> Originally-from: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Will Deacon <will@kernel.org> Signed-off-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2023-02-02bpf: Add __bpf_kfunc tag to all kfuncsDavid Vernet1-2/+2
Now that we have the __bpf_kfunc tag, we should use add it to all existing kfuncs to ensure that they'll never be elided in LTO builds. Signed-off-by: David Vernet <void@manifault.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Stanislav Fomichev <sdf@google.com> Link: https://lore.kernel.org/bpf/20230201173016.342758-4-void@manifault.com
2023-01-31cgroup/cpuset: Fix wrong check in update_parent_subparts_cpumask()Waiman Long1-1/+2
It was found that the check to see if a partition could use up all the cpus from the parent cpuset in update_parent_subparts_cpumask() was incorrect. As a result, it is possible to leave parent with no effective cpu left even if there are tasks in the parent cpuset. This can lead to system panic as reported in [1]. Fix this probem by updating the check to fail the enabling the partition if parent's effective_cpus is a subset of the child's cpus_allowed. Also record the error code when an error happens in update_prstate() and add a test case where parent partition and child have the same cpu list and parent has task. Enabling partition in the child will fail in this case. [1] https://www.spinics.net/lists/cgroups/msg36254.html Fixes: f0af1bfc27b5 ("cgroup/cpuset: Relax constraints to partition & cpus changes") Cc: stable@vger.kernel.org # v6.1 Reported-by: Srinivas Pandruvada <srinivas.pandruvada@intel.com> Signed-off-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2023-01-19fs: port ->permission() to pass mnt_idmapChristian Brauner1-1/+1
Convert to struct mnt_idmap. Last cycle we merged the necessary infrastructure in 256c8aed2b42 ("fs: introduce dedicated idmap type for mounts"). This is just the conversion to struct mnt_idmap. Currently we still pass around the plain namespace that was attached to a mount. This is in general pretty convenient but it makes it easy to conflate namespaces that are relevant on the filesystem with namespaces that are relevent on the mount level. Especially for non-vfs developers without detailed knowledge in this area this can be a potential source for bugs. Once the conversion to struct mnt_idmap is done all helpers down to the really low-level helpers will take a struct mnt_idmap argument instead of two namespace arguments. This way it becomes impossible to conflate the two eliminating the possibility of any bugs. All of the vfs and all filesystems only operate on struct mnt_idmap. Acked-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>
2023-01-10cgroup/cpuset: fix a few kernel-doc warnings & coding styleRandy Dunlap1-5/+8
Fix kernel-doc notation warnings: kernel/cgroup/cpuset.c:1309: warning: Excess function parameter 'cpuset' description in 'update_parent_subparts_cpumask' kernel/cgroup/cpuset.c:3909: warning: expecting prototype for cpuset_mem_spread_node(). Prototype was for cpuset_spread_node() instead Also drop a blank line before EXPORT_SYMBOL_GPL() to be consistent with kernel coding style. Signed-off-by: Randy Dunlap <rdunlap@infradead.org> Cc: Waiman Long <longman@redhat.com> Cc: Zefan Li <lizefan.x@bytedance.com> Cc: Tejun Heo <tj@kernel.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: cgroups@vger.kernel.org Signed-off-by: Tejun Heo <tj@kernel.org>
2023-01-04cgroup/cpuset: no need to explicitly init a global static variableDaniel Vacek1-2/+0
cpuset_rwsem is a static variable defined with DEFINE_STATIC_PERCPU_RWSEM(). It's initialized at build time and so there's no need for explicit runtime init leaking one percpu int. Signed-off-by: Daniel Vacek <neelx@redhat.com> Reviewed-by: Aaron Tomlin <atomlin@atomlin.com> Acked-by: Mukesh Ojha <quic_mojha@quicinc.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2022-12-13Merge tag 'mm-stable-2022-12-13' of ↵Linus Torvalds1-6/+1
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull MM updates from Andrew Morton: - More userfaultfs work from Peter Xu - Several convert-to-folios series from Sidhartha Kumar and Huang Ying - Some filemap cleanups from Vishal Moola - David Hildenbrand added the ability to selftest anon memory COW handling - Some cpuset simplifications from Liu Shixin - Addition of vmalloc tracing support by Uladzislau Rezki - Some pagecache folioifications and simplifications from Matthew Wilcox - A pagemap cleanup from Kefeng Wang: we have VM_ACCESS_FLAGS, so use it - Miguel Ojeda contributed some cleanups for our use of the __no_sanitize_thread__ gcc keyword. This series should have been in the non-MM tree, my bad - Naoya Horiguchi improved the interaction between memory poisoning and memory section removal for huge pages - DAMON cleanups and tuneups from SeongJae Park - Tony Luck fixed the handling of COW faults against poisoned pages - Peter Xu utilized the PTE marker code for handling swapin errors - Hugh Dickins reworked compound page mapcount handling, simplifying it and making it more efficient - Removal of the autonuma savedwrite infrastructure from Nadav Amit and David Hildenbrand - zram support for multiple compression streams from Sergey Senozhatsky - David Hildenbrand reworked the GUP code's R/O long-term pinning so that drivers no longer need to use the FOLL_FORCE workaround which didn't work very well anyway - Mel Gorman altered the page allocator so that local IRQs can remnain enabled during per-cpu page allocations - Vishal Moola removed the try_to_release_page() wrapper - Stefan Roesch added some per-BDI sysfs tunables which are used to prevent network block devices from dirtying excessive amounts of pagecache - David Hildenbrand did some cleanup and repair work on KSM COW breaking - Nhat Pham and Johannes Weiner have implemented writeback in zswap's zsmalloc backend - Brian Foster has fixed a longstanding corner-case oddity in file[map]_write_and_wait_range() - sparse-vmemmap changes for MIPS, LoongArch and NIOS2 from Feiyang Chen - Shiyang Ruan has done some work on fsdax, to make its reflink mode work better under xfstests. Better, but still not perfect - Christoph Hellwig has removed the .writepage() method from several filesystems. They only need .writepages() - Yosry Ahmed wrote a series which fixes the memcg reclaim target beancounting - David Hildenbrand has fixed some of our MM selftests for 32-bit machines - Many singleton patches, as usual * tag 'mm-stable-2022-12-13' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (313 commits) mm/hugetlb: set head flag before setting compound_order in __prep_compound_gigantic_folio mm: mmu_gather: allow more than one batch of delayed rmaps mm: fix typo in struct pglist_data code comment kmsan: fix memcpy tests mm: add cond_resched() in swapin_walk_pmd_entry() mm: do not show fs mm pc for VM_LOCKONFAULT pages selftests/vm: ksm_functional_tests: fixes for 32bit selftests/vm: cow: fix compile warning on 32bit selftests/vm: madv_populate: fix missing MADV_POPULATE_(READ|WRITE) definitions mm/gup_test: fix PIN_LONGTERM_TEST_READ with highmem mm,thp,rmap: fix races between updates of subpages_mapcount mm: memcg: fix swapcached stat accounting mm: add nodes= arg to memory.reclaim mm: disable top-tier fallback to reclaim on proactive reclaim selftests: cgroup: make sure reclaim target memcg is unprotected selftests: cgroup: refactor proactive reclaim code to reclaim_until() mm: memcg: fix stale protection of reclaim target memcg mm/mmap: properly unaccount memory on mas_preallocate() failure omfs: remove ->writepage jfs: remove ->writepage ...
2022-12-13Merge tag 'net-next-6.2' of ↵Linus Torvalds1-0/+1
git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next Pull networking updates from Paolo Abeni: "Core: - Allow live renaming when an interface is up - Add retpoline wrappers for tc, improving considerably the performances of complex queue discipline configurations - Add inet drop monitor support - A few GRO performance improvements - Add infrastructure for atomic dev stats, addressing long standing data races - De-duplicate common code between OVS and conntrack offloading infrastructure - A bunch of UBSAN_BOUNDS/FORTIFY_SOURCE improvements - Netfilter: introduce packet parser for tunneled packets - Replace IPVS timer-based estimators with kthreads to scale up the workload with the number of available CPUs - Add the helper support for connection-tracking OVS offload BPF: - Support for user defined BPF objects: the use case is to allocate own objects, build own object hierarchies and use the building blocks to build own data structures flexibly, for example, linked lists in BPF - Make cgroup local storage available to non-cgroup attached BPF programs - Avoid unnecessary deadlock detection and failures wrt BPF task storage helpers - A relevant bunch of BPF verifier fixes and improvements - Veristat tool improvements to support custom filtering, sorting, and replay of results - Add LLVM disassembler as default library for dumping JITed code - Lots of new BPF documentation for various BPF maps - Add bpf_rcu_read_{,un}lock() support for sleepable programs - Add RCU grace period chaining to BPF to wait for the completion of access from both sleepable and non-sleepable BPF programs - Add support storing struct task_struct objects as kptrs in maps - Improve helper UAPI by explicitly defining BPF_FUNC_xxx integer values - Add libbpf *_opts API-variants for bpf_*_get_fd_by_id() functions Protocols: - TCP: implement Protective Load Balancing across switch links - TCP: allow dynamically disabling TCP-MD5 static key, reverting back to fast[er]-path - UDP: Introduce optional per-netns hash lookup table - IPv6: simplify and cleanup sockets disposal - Netlink: support different type policies for each generic netlink operation - MPTCP: add MSG_FASTOPEN and FastOpen listener side support - MPTCP: add netlink notification support for listener sockets events - SCTP: add VRF support, allowing sctp sockets binding to VRF devices - Add bridging MAC Authentication Bypass (MAB) support - Extensions for Ethernet VPN bridging implementation to better support multicast scenarios - More work for Wi-Fi 7 support, comprising conversion of all the existing drivers to internal TX queue usage - IPSec: introduce a new offload type (packet offload) allowing complete header processing and crypto offloading - IPSec: extended ack support for more descriptive XFRM error reporting - RXRPC: increase SACK table size and move processing into a per-local endpoint kernel thread, reducing considerably the required locking - IEEE 802154: synchronous send frame and extended filtering support, initial support for scanning available 15.4 networks - Tun: bump the link speed from 10Mbps to 10Gbps - Tun/VirtioNet: implement UDP segmentation offload support Driver API: - PHY/SFP: improve power level switching between standard level 1 and the higher power levels - New API for netdev <-> devlink_port linkage - PTP: convert existing drivers to new frequency adjustment implementation - DSA: add support for rx offloading - Autoload DSA tagging driver when dynamically changing protocol - Add new PCP and APPTRUST attributes to Data Center Bridging - Add configuration support for 800Gbps link speed - Add devlink port function attribute to enable/disable RoCE and migratable - Extend devlink-rate to support strict prioriry and weighted fair queuing - Add devlink support to directly reading from region memory - New device tree helper to fetch MAC address from nvmem - New big TCP helper to simplify temporary header stripping New hardware / drivers: - Ethernet: - Marvel Octeon CNF95N and CN10KB Ethernet Switches - Marvel Prestera AC5X Ethernet Switch - WangXun 10 Gigabit NIC - Motorcomm yt8521 Gigabit Ethernet - Microchip ksz9563 Gigabit Ethernet Switch - Microsoft Azure Network Adapter - Linux Automation 10Base-T1L adapter - PHY: - Aquantia AQR112 and AQR412 - Motorcomm YT8531S - PTP: - Orolia ART-CARD - WiFi: - MediaTek Wi-Fi 7 (802.11be) devices - RealTek rtw8821cu, rtw8822bu, rtw8822cu and rtw8723du USB devices - Bluetooth: - Broadcom BCM4377/4378/4387 Bluetooth chipsets - Realtek RTL8852BE and RTL8723DS - Cypress.CYW4373A0 WiFi + Bluetooth combo device Drivers: - CAN: - gs_usb: bus error reporting support - kvaser_usb: listen only and bus error reporting support - Ethernet NICs: - Intel (100G): - extend action skbedit to RX queue mapping - implement devlink-rate support - support direct read from memory - nVidia/Mellanox (mlx5): - SW steering improvements, increasing rules update rate - Support for enhanced events compression - extend H/W offload packet manipulation capabilities - implement IPSec packet offload mode - nVidia/Mellanox (mlx4): - better big TCP support - Netronome Ethernet NICs (nfp): - IPsec offload support - add support for multicast filter - Broadcom: - RSS and PTP support improvements - AMD/SolarFlare: - netlink extened ack improvements - add basic flower matches to offload, and related stats - Virtual NICs: - ibmvnic: introduce affinity hint support - small / embedded: - FreeScale fec: add initial XDP support - Marvel mv643xx_eth: support MII/GMII/RGMII modes for Kirkwood - TI am65-cpsw: add suspend/resume support - Mediatek MT7986: add RX wireless wthernet dispatch support - Realtek 8169: enable GRO software interrupt coalescing per default - Ethernet high-speed switches: - Microchip (sparx5): - add support for Sparx5 TC/flower H/W offload via VCAP - Mellanox mlxsw: - add 802.1X and MAC Authentication Bypass offload support - add ip6gre support - Embedded Ethernet switches: - Mediatek (mtk_eth_soc): - improve PCS implementation, add DSA untag support - enable flow offload support - Renesas: - add rswitch R-Car Gen4 gPTP support - Microchip (lan966x): - add full XDP support - add TC H/W offload via VCAP - enable PTP on bridge interfaces - Microchip (ksz8): - add MTU support for KSZ8 series - Qualcomm 802.11ax WiFi (ath11k): - support configuring channel dwell time during scan - MediaTek WiFi (mt76): - enable Wireless Ethernet Dispatch (WED) offload support - add ack signal support - enable coredump support - remain_on_channel support - Intel WiFi (iwlwifi): - enable Wi-Fi 7 Extremely High Throughput (EHT) PHY capabilities - 320 MHz channels support - RealTek WiFi (rtw89): - new dynamic header firmware format support - wake-over-WLAN support" * tag 'net-next-6.2' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (2002 commits) ipvs: fix type warning in do_div() on 32 bit net: lan966x: Remove a useless test in lan966x_ptp_add_trap() net: ipa: add IPA v4.7 support dt-bindings: net: qcom,ipa: Add SM6350 compatible bnxt: Use generic HBH removal helper in tx path IPv6/GRO: generic helper to remove temporary HBH/jumbo header in driver selftests: forwarding: Add bridge MDB test selftests: forwarding: Rename bridge_mdb test bridge: mcast: Support replacement of MDB port group entries bridge: mcast: Allow user space to specify MDB entry routing protocol bridge: mcast: Allow user space to add (*, G) with a source list and filter mode bridge: mcast: Add support for (*, G) with a source list and filter mode bridge: mcast: Avoid arming group timer when (S, G) corresponds to a source bridge: mcast: Add a flag for user installed source entries bridge: mcast: Expose __br_multicast_del_group_src() bridge: mcast: Expose br_multicast_new_group_src() bridge: mcast: Add a centralized error path bridge: mcast: Place netlink policy before validation functions bridge: mcast: Split (*, G) and (S, G) addition into different functions bridge: mcast: Do not derive entry type from its filter mode ...
2022-12-12Merge tag 'cgroup-for-6.2' of ↵Linus Torvalds2-11/+45
git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup Pull cgroup updates from Tejun Heo: "Nothing too interesting: - Add CONFIG_DEBUG_GROUP_REF which makes cgroup refcnt operations kprobable - A couple cpuset optimizations - Other misc changes including doc and test updates" * tag 'cgroup-for-6.2' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: cgroup: remove rcu_read_lock()/rcu_read_unlock() in critical section of spin_lock_irq() cgroup/cpuset: Improve cpuset_css_alloc() description kselftest/cgroup: Add cleanup() to test_cpuset_prs.sh cgroup/cpuset: Optimize cpuset_attach() on v2 cgroup/cpuset: Skip spread flags update on v2 kselftest/cgroup: Fix gathering number of CPUs cgroup: cgroup refcnt functions should be exported when CONFIG_DEBUG_CGROUP_REF cgroup: Implement DEBUG_CGROUP_REF
2022-12-09Merge branch 'mm-hotfixes-stable' into mm-stableAndrew Morton1-1/+0
2022-12-09memcg: fix possible use-after-free in memcg_write_event_control()Tejun Heo1-1/+0
memcg_write_event_control() accesses the dentry->d_name of the specified control fd to route the write call. As a cgroup interface file can't be renamed, it's safe to access d_name as long as the specified file is a regular cgroup file. Also, as these cgroup interface files can't be removed before the directory, it's safe to access the parent too. Prior to 347c4a874710 ("memcg: remove cgroup_event->cft"), there was a call to __file_cft() which verified that the specified file is a regular cgroupfs file before further accesses. The cftype pointer returned from __file_cft() was no longer necessary and the commit inadvertently dropped the file type check with it allowing any file to slip through. With the invarients broken, the d_name and parent accesses can now race against renames and removals of arbitrary files and cause use-after-free's. Fix the bug by resurrecting the file type check in __file_cft(). Now that cgroupfs is implemented through kernfs, checking the file operations needs to go through a layer of indirection. Instead, let's check the superblock and dentry type. Link: https://lkml.kernel.org/r/Y5FRm/cfcKPGzWwl@slm.duckdns.org Fixes: 347c4a874710 ("memcg: remove cgroup_event->cft") Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: Jann Horn <jannh@google.com> Acked-by: Roman Gushchin <roman.gushchin@linux.dev> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Muchun Song <songmuchun@bytedance.com> Cc: Shakeel Butt <shakeelb@google.com> Cc: <stable@vger.kernel.org> [3.14+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-12-08Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski1-1/+0
No conflicts. Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-12-08memcg: Fix possible use-after-free in memcg_write_event_control()Tejun Heo1-1/+0
memcg_write_event_control() accesses the dentry->d_name of the specified control fd to route the write call. As a cgroup interface file can't be renamed, it's safe to access d_name as long as the specified file is a regular cgroup file. Also, as these cgroup interface files can't be removed before the directory, it's safe to access the parent too. Prior to 347c4a874710 ("memcg: remove cgroup_event->cft"), there was a call to __file_cft() which verified that the specified file is a regular cgroupfs file before further accesses. The cftype pointer returned from __file_cft() was no longer necessary and the commit inadvertently dropped the file type check with it allowing any file to slip through. With the invarients broken, the d_name and parent accesses can now race against renames and removals of arbitrary files and cause use-after-free's. Fix the bug by resurrecting the file type check in __file_cft(). Now that cgroupfs is implemented through kernfs, checking the file operations needs to go through a layer of indirection. Instead, let's check the superblock and dentry type. Signed-off-by: Tejun Heo <tj@kernel.org> Fixes: 347c4a874710 ("memcg: remove cgroup_event->cft") Cc: stable@kernel.org # v3.14+ Reported-by: Jann Horn <jannh@google.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Roman Gushchin <roman.gushchin@linux.dev> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2022-11-23cgroup: remove rcu_read_lock()/rcu_read_unlock() in critical section of ↵Ran Tian1-2/+0
spin_lock_irq() spin_lock_irq() already disable preempt, so remove rcu_read_lock(). Signed-off-by: Ran Tian <tianran_trtr@163.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2022-11-22cgroup/cpuset: Improve cpuset_css_alloc() descriptionKamalesh Babulal1-4/+8
Change the function argument in the description of cpuset_css_alloc() from 'struct cgroup' -> 'struct cgroup_subsys_state'. The change to the argument type was introduced by commit eb95419b023a ("cgroup: pass around cgroup_subsys_state instead of cgroup in subsystem methods"). Also, add more information to its description. Signed-off-by: Kamalesh Babulal <kamalesh.babulal@oracle.com> Acked-by: Waiman Long <longman@redhat.com> Acked-by: Joel Savitz <jsavitz@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2022-11-14cgroup/cpuset: Optimize cpuset_attach() on v2Waiman Long1-1/+23
It was found that with the default hierarchy, enabling cpuset in the child cgroups can trigger a cpuset_attach() call in each of the child cgroups that have tasks with no change in effective cpus and mems. If there are many processes in those child cgroups, it will burn quite a lot of cpu cycles iterating all the tasks without doing useful work. Optimizing this case by comparing between the old and new cpusets and skip useless update if there is no change in effective cpus and mems. Also mems_allowed are less likely to be changed than cpus_allowed. So skip changing mm if there is no change in effective_mems and CS_MEMORY_MIGRATE is not set. By inserting some instrumentation code and running a simple command in a container 200 times in a cgroup v2 system, it was found that all the cpuset_attach() calls are skipped (401 times in total) as there was no change in effective cpus and mems. Signed-off-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2022-11-14cgroup/cpuset: Skip spread flags update on v2Waiman Long1-4/+8
Cpuset v2 has no spread flags to set. So we can skip spread flags update if cpuset v2 is being used. Also change the name to cpuset_update_task_spread_flags() to indicate that there are multiple spread flags. Signed-off-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2022-11-08memory: move hotplug memory notifier priority to same file for easy sortingLiu Shixin1-1/+1
The priority of hotplug memory callback is defined in a different file. And there are some callers using numbers directly. Collect them together into include/linux/memory.h for easy reading. This allows us to sort their priorities more intuitively without additional comments. Link: https://lkml.kernel.org/r/20220923033347.3935160-9-liushixin2@huawei.com Signed-off-by: Liu Shixin <liushixin2@huawei.com> Cc: Christoph Lameter <cl@linux.com> Cc: David Hildenbrand <david@redhat.com> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Waiman Long <longman@redhat.com> Cc: zefan li <lizefan.x@bytedance.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-11-08cgroup/cpuset: use hotplug_memory_notifier() directlyLiu Shixin1-6/+1
Patch series "mm: Use hotplug_memory_notifier() instead of register_hotmemory_notifier()", v4. Commit f02c69680088 ("include/linux/memory.h: implement register_hotmemory_notifier()") introduced register_hotmemory_notifier() to avoid a compile problem with gcc-4.4.4: When CONFIG_MEMORY_HOTPLUG=n, we don't want the memory-hotplug notifier handlers to be included in the .o files, for space reasons. The existing hotplug_memory_notifier() tries to handle this but testing with gcc-4.4.4 shows that it doesn't work - the hotplug functions are still present in the .o files. Since commit 76ae847497bc52 ("Documentation: raise minimum supported version of GCC to 5.1") has already updated the minimum gcc version to 5.1. The previous problem mentioned in f02c69680088 does not exist. So we can now revert to use hotplug_memory_notifier() directly rather than register_hotmemory_notifier(). In the last patch, we move all hotplug memory notifier priority to same file for easy sorting. This patch (of 8): Commit 76ae847497bc52 ("Documentation: raise minimum supported version of GCC to 5.1") updated the minimum gcc version to 5.1. So the problem mentioned in f02c69680088 ("include/linux/memory.h: implement register_hotmemory_notifier()") no longer exist. So we can now switch to use hotplug_memory_notifier() directly rather than register_hotmemory_notifier(). Link: https://lkml.kernel.org/r/20220923033347.3935160-1-liushixin2@huawei.com Link: https://lkml.kernel.org/r/20220923033347.3935160-2-liushixin2@huawei.com Signed-off-by: Liu Shixin <liushixin2@huawei.com> Reviewed-by: David Hildenbrand <david@redhat.com> Cc: Christoph Lameter <cl@linux.com> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Waiman Long <longman@redhat.com> Cc: zefan li <lizefan.x@bytedance.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-11-02Merge tag 'for-netdev' of ↵Jakub Kicinski1-0/+1
https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next Daniel Borkmann says: ==================== bpf-next 2022-11-02 We've added 70 non-merge commits during the last 14 day(s) which contain a total of 96 files changed, 3203 insertions(+), 640 deletions(-). The main changes are: 1) Make cgroup local storage available to non-cgroup attached BPF programs such as tc BPF ones, from Yonghong Song. 2) Avoid unnecessary deadlock detection and failures wrt BPF task storage helpers, from Martin KaFai Lau. 3) Add LLVM disassembler as default library for dumping JITed code in bpftool, from Quentin Monnet. 4) Various kprobe_multi_link fixes related to kernel modules, from Jiri Olsa. 5) Optimize x86-64 JIT with emitting BMI2-based shift instructions, from Jie Meng. 6) Improve BPF verifier's memory type compatibility for map key/value arguments, from Dave Marchevsky. 7) Only create mmap-able data section maps in libbpf when data is exposed via skeletons, from Andrii Nakryiko. 8) Add an autoattach option for bpftool to load all object assets, from Wang Yufen. 9) Various memory handling fixes for libbpf and BPF selftests, from Xu Kuohai. 10) Initial support for BPF selftest's vmtest.sh on arm64, from Manu Bretelle. 11) Improve libbpf's BTF handling to dedup identical structs, from Alan Maguire. 12) Add BPF CI and denylist documentation for BPF selftests, from Daniel Müller. 13) Check BPF cpumap max_entries before doing allocation work, from Florian Lehner. * tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (70 commits) samples/bpf: Fix typo in README bpf: Remove the obsolte u64_stats_fetch_*_irq() users. bpf: check max_entries before allocating memory bpf: Fix a typo in comment for DFS algorithm bpftool: Fix spelling mistake "disasembler" -> "disassembler" selftests/bpf: Fix bpftool synctypes checking failure selftests/bpf: Panic on hard/soft lockup docs/bpf: Add documentation for new cgroup local storage selftests/bpf: Add test cgrp_local_storage to DENYLIST.s390x selftests/bpf: Add selftests for new cgroup local storage selftests/bpf: Fix test test_libbpf_str/bpf_map_type_str bpftool: Support new cgroup local storage libbpf: Support new cgroup local storage bpf: Implement cgroup storage available to non-cgroup-attached bpf progs bpf: Refactor some inode/task/sk storage functions for reuse bpf: Make struct cgroup btf id global selftests/bpf: Tracing prog can still do lookup under busy lock selftests/bpf: Ensure no task storage failure for bpf_lsm.s prog due to deadlock detection bpf: Add new bpf_task_storage_delete proto with no deadlock detection bpf: bpf_task_storage_delete_recur does lookup first before the deadlock check ... ==================== Link: https://lore.kernel.org/r/20221102062120.5724-1-daniel@iogearbox.net Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-10-31cgroup: cgroup refcnt functions should be exported when CONFIG_DEBUG_CGROUP_REFTejun Heo1-0/+1
6ab428604f72 ("cgroup: Implement DEBUG_CGROUP_REF") added a config option which forces cgroup refcnt functions to be not inlined so that they can be kprobed for debugging. However, it forgot export them when the config is enabled breaking modules which make use of css reference counting. Fix it by adding CGROUP_REF_EXPORT() macro to cgroup_refcnt.h which is defined to EXPORT_SYMBOL_GPL when CONFIG_DEBUG_CGROUP_REF is set. Signed-off-by: Tejun Heo <tj@kernel.org> Fixes: 6ab428604f72 ("cgroup: Implement DEBUG_CGROUP_REF")
2022-10-28cgroup: Implement DEBUG_CGROUP_REFTejun Heo1-0/+5
It's really difficult to debug when cgroup or css refs leak. Let's add a debug option to force the refcnt function to not be inlined so that they can be kprobed for debugging. Signed-off-by: Tejun Heo <tj@kernel.org>
2022-10-25bpf: Implement cgroup storage available to non-cgroup-attached bpf progsYonghong Song1-0/+1
Similar to sk/inode/task storage, implement similar cgroup local storage. There already exists a local storage implementation for cgroup-attached bpf programs. See map type BPF_MAP_TYPE_CGROUP_STORAGE and helper bpf_get_local_storage(). But there are use cases such that non-cgroup attached bpf progs wants to access cgroup local storage data. For example, tc egress prog has access to sk and cgroup. It is possible to use sk local storage to emulate cgroup local storage by storing data in socket. But this is a waste as it could be lots of sockets belonging to a particular cgroup. Alternatively, a separate map can be created with cgroup id as the key. But this will introduce additional overhead to manipulate the new map. A cgroup local storage, similar to existing sk/inode/task storage, should help for this use case. The life-cycle of storage is managed with the life-cycle of the cgroup struct. i.e. the storage is destroyed along with the owning cgroup with a call to bpf_cgrp_storage_free() when cgroup itself is deleted. The userspace map operations can be done by using a cgroup fd as a key passed to the lookup, update and delete operations. Typically, the following code is used to get the current cgroup: struct task_struct *task = bpf_get_current_task_btf(); ... task->cgroups->dfl_cgrp ... and in structure task_struct definition: struct task_struct { .... struct css_set __rcu *cgroups; .... } With sleepable program, accessing task->cgroups is not protected by rcu_read_lock. So the current implementation only supports non-sleepable program and supporting sleepable program will be the next step together with adding rcu_read_lock protection for rcu tagged structures. Since map name BPF_MAP_TYPE_CGROUP_STORAGE has been used for old cgroup local storage support, the new map name BPF_MAP_TYPE_CGRP_STORAGE is used for cgroup storage available to non-cgroup-attached bpf programs. The old cgroup storage supports bpf_get_local_storage() helper to get the cgroup data. The new cgroup storage helper bpf_cgrp_storage_get() can provide similar functionality. While old cgroup storage pre-allocates storage memory, the new mechanism can also pre-allocate with a user space bpf_map_update_elem() call to avoid potential run-time memory allocation failure. Therefore, the new cgroup storage can provide all functionality w.r.t. the old one. So in uapi bpf.h, the old BPF_MAP_TYPE_CGROUP_STORAGE is alias to BPF_MAP_TYPE_CGROUP_STORAGE_DEPRECATED to indicate the old cgroup storage can be deprecated since the new one can provide the same functionality. Acked-by: David Vernet <void@manifault.com> Signed-off-by: Yonghong Song <yhs@fb.com> Link: https://lore.kernel.org/r/20221026042850.673791-1-yhs@fb.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-10-17Merge tag 'cgroup-for-6.1-rc1-fixes' of ↵Linus Torvalds1-20/+79
git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup Pull cgroup fixes from Tejun Heo: - Fix a recent regression where a sleeping kernfs function is called with css_set_lock (spinlock) held - Revert the commit to enable cgroup1 support for cgroup_get_from_fd/file() Multiple users assume that the lookup only works for cgroup2 and breaks when fed a cgroup1 file. Instead, introduce a separate set of functions to lookup both v1 and v2 and use them where the user explicitly wants to support both versions. - Compat update for tools/perf/util/bpf_skel/bperf_cgroup.bpf.c. - Add Josef Bacik as a blkcg maintainer. * tag 'cgroup-for-6.1-rc1-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: blkcg: Update MAINTAINERS entry mm: cgroup: fix comments for get from fd/file helpers perf stat: Support old kernels for bperf cgroup counting bpf: cgroup_iter: support cgroup1 using cgroup fd cgroup: add cgroup_v1v2_get_from_[fd/file]() Revert "cgroup: enable cgroup_get_from_file() on cgroup1" cgroup: Reorganize css_set_lock and kernfs path processing
2022-10-14Merge tag 'sched-psi-2022-10-14' of ↵Linus Torvalds1-9/+95
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull PSI updates from Ingo Molnar: - Various performance optimizations, resulting in a 4%-9% speedup in the mmtests/config-scheduler-perfpipe micro-benchmark. - New interface to turn PSI on/off on a per cgroup level. * tag 'sched-psi-2022-10-14' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: sched/psi: Per-cgroup PSI accounting disable/re-enable interface sched/psi: Cache parent psi_group to speed up group iteration sched/psi: Consolidate cgroup_psi() sched/psi: Add PSI_IRQ to track IRQ/SOFTIRQ pressure sched/psi: Remove NR_ONCPU task accounting sched/psi: Optimize task switch inside shared cgroups again sched/psi: Move private helpers to sched/stats.h sched/psi: Save percpu memory when !psi_cgroups_enabled sched/psi: Don't create cgroup PSI files when psi_disabled sched/psi: Fix periodic aggregation shut off
2022-10-11mm: cgroup: fix comments for get from fd/file helpersYosry Ahmed1-1/+3
Fix the documentation comments for cgroup_[v1v2_]get_from_[fd/file](). Reported-by: kernel test robot <lkp@intel.com> Signed-off-by: Yosry Ahmed <yosryahmed@google.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2022-10-11cgroup: add cgroup_v1v2_get_from_[fd/file]()Yosry Ahmed1-6/+44
Add cgroup_v1v2_get_from_fd() and cgroup_v1v2_get_from_file() that support both cgroup1 and cgroup2. Signed-off-by: Yosry Ahmed <yosryahmed@google.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2022-10-10Merge tag 'mm-stable-2022-10-08' of ↵Linus Torvalds1-1/+0
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull MM updates from Andrew Morton: - Yu Zhao's Multi-Gen LRU patches are here. They've been under test in linux-next for a couple of months without, to my knowledge, any negative reports (or any positive ones, come to that). - Also the Maple Tree from Liam Howlett. An overlapping range-based tree for vmas. It it apparently slightly more efficient in its own right, but is mainly targeted at enabling work to reduce mmap_lock contention. Liam has identified a number of other tree users in the kernel which could be beneficially onverted to mapletrees. Yu Zhao has identified a hard-to-hit but "easy to fix" lockdep splat at [1]. This has yet to be addressed due to Liam's unfortunately timed vacation. He is now back and we'll get this fixed up. - Dmitry Vyukov introduces KMSAN: the Kernel Memory Sanitizer. It uses clang-generated instrumentation to detect used-unintialized bugs down to the single bit level. KMSAN keeps finding bugs. New ones, as well as the legacy ones. - Yang Shi adds a userspace mechanism (madvise) to induce a collapse of memory into THPs. - Zach O'Keefe has expanded Yang Shi's madvise(MADV_COLLAPSE) to support file/shmem-backed pages. - userfaultfd updates from Axel Rasmussen - zsmalloc cleanups from Alexey Romanov - cleanups from Miaohe Lin: vmscan, hugetlb_cgroup, hugetlb and memory-failure - Huang Ying adds enhancements to NUMA balancing memory tiering mode's page promotion, with a new way of detecting hot pages. - memcg updates from Shakeel Butt: charging optimizations and reduced memory consumption. - memcg cleanups from Kairui Song. - memcg fixes and cleanups from Johannes Weiner. - Vishal Moola provides more folio conversions - Zhang Yi removed ll_rw_block() :( - migration enhancements from Peter Xu - migration error-path bugfixes from Huang Ying - Aneesh Kumar added ability for a device driver to alter the memory tiering promotion paths. For optimizations by PMEM drivers, DRM drivers, etc. - vma merging improvements from Jakub Matěn. - NUMA hinting cleanups from David Hildenbrand. - xu xin added aditional userspace visibility into KSM merging activity. - THP & KSM code consolidation from Qi Zheng. - more folio work from Matthew Wilcox. - KASAN updates from Andrey Konovalov. - DAMON cleanups from Kaixu Xia. - DAMON work from SeongJae Park: fixes, cleanups. - hugetlb sysfs cleanups from Muchun Song. - Mike Kravetz fixes locking issues in hugetlbfs and in hugetlb core. Link: https://lkml.kernel.org/r/CAOUHufZabH85CeUN-MEMgL8gJGzJEWUrkiM58JkTbBhh-jew0Q@mail.gmail.com [1] * tag 'mm-stable-2022-10-08' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (555 commits) hugetlb: allocate vma lock for all sharable vmas hugetlb: take hugetlb vma_lock when clearing vma_lock->vma pointer hugetlb: fix vma lock handling during split vma and range unmapping mglru: mm/vmscan.c: fix imprecise comments mm/mglru: don't sync disk for each aging cycle mm: memcontrol: drop dead CONFIG_MEMCG_SWAP config symbol mm: memcontrol: use do_memsw_account() in a few more places mm: memcontrol: deprecate swapaccounting=0 mode mm: memcontrol: don't allocate cgroup swap arrays when memcg is disabled mm/secretmem: remove reduntant return value mm/hugetlb: add available_huge_pages() func mm: remove unused inline functions from include/linux/mm_inline.h selftests/vm: add selftest for MADV_COLLAPSE of uffd-minor memory selftests/vm: add file/shmem MADV_COLLAPSE selftest for cleared pmd selftests/vm: add thp collapse shmem testing selftests/vm: add thp collapse file and tmpfs testing selftests/vm: modularize thp collapse memory operations selftests/vm: dedup THP helpers mm/khugepaged: add tracepoint to hpage_collapse_scan_file() mm/madvise: add file and shmem support to MADV_COLLAPSE ...
2022-10-10Revert "cgroup: enable cgroup_get_from_file() on cgroup1"Tejun Heo1-0/+5
This reverts commit f3a2aebdd6fb90e444d595e46de64e822af419da. The commit enabled looking up v1 cgroups via cgroup_get_from_file(). However, there are multiple users, including CLONE_INTO_CGROUP, which have been assuming that it would only look up v2 cgroups. Returning v1 cgroups breaks them. Let's revert the commit and retry later with a separate lookup interface which allows both v1 and v2. Signed-off-by: Tejun Heo <tj@kernel.org> Link: http://lkml.kernel.org/r/000000000000385cbf05ea3f1862@google.com Cc: Yosry Ahmed <yosryahmed@google.com>
2022-10-10cgroup: Reorganize css_set_lock and kernfs path processingMichal Koutný1-13/+27
The commit 74e4b956eb1c incorrectly wrapped kernfs_walk_and_get (might_sleep) under css_set_lock (spinlock). css_set_lock is needed by __cset_cgroup_from_root to ensure stable cset->cgrp_links but not for kernfs_walk_and_get. We only need to make sure that the returned root_cgrp won't be freed under us. This is given in the case of global root because it is static (cgrp_dfl_root.cgrp). When the root_cgrp is lower in the hierarchy, it is pinned by cgroup_ns->root_cset (and `current` task cannot switch namespace asynchronously so ns_proxy pins cgroup_ns). Note this reasoning won't hold for root cgroups in v1 hierarchies, therefore create a special-cased helper function just for the default hierarchy. Fixes: 74e4b956eb1c ("cgroup: Honor caller's cgroup NS when resolving path") Reported-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Michal Koutný <mkoutny@suse.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2022-10-10Merge tag 'cgroup-for-6.1' of ↵Linus Torvalds5-342/+663
git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup Pull cgroup updates from Tejun Heo: - cpuset now support isolated cpus.partition type, which will enable dynamic CPU isolation - pids.peak added to remember the max number of pids used - holes in cgroup namespace plugged - internal cleanups * tag 'cgroup-for-6.1' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: (25 commits) cgroup: use strscpy() is more robust and safer iocost_monitor: reorder BlkgIterator cgroup: simplify code in cgroup_apply_control cgroup: Make cgroup_get_from_id() prettier cgroup/cpuset: remove unreachable code cgroup: Remove CFTYPE_PRESSURE cgroup: Improve cftype add/rm error handling kselftest/cgroup: Add cpuset v2 partition root state test cgroup/cpuset: Update description of cpuset.cpus.partition in cgroup-v2.rst cgroup/cpuset: Make partition invalid if cpumask change violates exclusivity rule cgroup/cpuset: Relocate a code block in validate_change() cgroup/cpuset: Show invalid partition reason string cgroup/cpuset: Add a new isolated cpus.partition type cgroup/cpuset: Relax constraints to partition & cpus changes cgroup/cpuset: Allow no-task partition to have empty cpuset.cpus.effective cgroup/cpuset: Miscellaneous cleanups & add helper functions cgroup/cpuset: Enable update_tasks_cpumask() on top_cpuset cgroup: add pids.peak interface for pids controller cgroup: Remove data-race around cgrp_dfl_visible cgroup: Fix build failure when CONFIG_SHRINKER_DEBUG ...
2022-10-10Merge tag 'sched-core-2022-10-07' of ↵Linus Torvalds1-15/+8
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler updates from Ingo Molnar: "Debuggability: - Change most occurances of BUG_ON() to WARN_ON_ONCE() - Reorganize & fix TASK_ state comparisons, turn it into a bitmap - Update/fix misc scheduler debugging facilities Load-balancing & regular scheduling: - Improve the behavior of the scheduler in presence of lot of SCHED_IDLE tasks - in particular they should not impact other scheduling classes. - Optimize task load tracking, cleanups & fixes - Clean up & simplify misc load-balancing code Freezer: - Rewrite the core freezer to behave better wrt thawing and be simpler in general, by replacing PF_FROZEN with TASK_FROZEN & fixing/adjusting all the fallout. Deadline scheduler: - Fix the DL capacity-aware code - Factor out dl_task_is_earliest_deadline() & replenish_dl_new_period() - Relax/optimize locking in task_non_contending() Cleanups: - Factor out the update_current_exec_runtime() helper - Various cleanups, simplifications" * tag 'sched-core-2022-10-07' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (41 commits) sched: Fix more TASK_state comparisons sched: Fix TASK_state comparisons sched/fair: Move call to list_last_entry() in detach_tasks sched/fair: Cleanup loop_max and loop_break sched/fair: Make sure to try to detach at least one movable task sched: Show PF_flag holes freezer,sched: Rewrite core freezer logic sched: Widen TAKS_state literals sched/wait: Add wait_event_state() sched/completion: Add wait_for_completion_state() sched: Add TASK_ANY for wait_task_inactive() sched: Change wait_task_inactive()s match_state freezer,umh: Clean up freezer/initrd interaction freezer: Have {,un}lock_system_sleep() save/restore flags sched: Rename task_running() to task_on_cpu() sched/fair: Cleanup for SIS_PROP sched/fair: Default to false in test_idle_cores() sched/fair: Remove useless check in select_idle_core() sched/fair: Avoid double search on same cpu sched/fair: Remove redundant check in select_idle_smt() ...
2022-10-07Merge tag 'driver-core-6.1-rc1' of ↵Linus Torvalds1-0/+20
git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core Pull driver core updates from Greg KH: "Here is the big set of driver core and debug printk changes for 6.1-rc1. Included in here is: - dynamic debug updates for the core and the drm subsystem. The drm changes have all been acked by the relevant maintainers - kernfs fixes for syzbot reported problems - kernfs refactors and updates for cgroup requirements - magic number cleanups and removals from the kernel tree (they were not being used and they really did not actually do anything) - other tiny cleanups All of these have been in linux-next for a while with no reported issues" * tag 'driver-core-6.1-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: (74 commits) docs: filesystems: sysfs: Make text and code for ->show() consistent Documentation: NBD_REQUEST_MAGIC isn't a magic number a.out: restore CMAGIC device property: Add const qualifier to device_get_match_data() parameter drm_print: add _ddebug descriptor to drm_*dbg prototypes drm_print: prefer bare printk KERN_DEBUG on generic fn drm_print: optimize drm_debug_enabled for jump-label drm-print: add drm_dbg_driver to improve namespace symmetry drm-print.h: include dyndbg header drm_print: wrap drm_*_dbg in dyndbg descriptor factory macro drm_print: interpose drm_*dbg with forwarding macros drm: POC drm on dyndbg - use in core, 2 helpers, 3 drivers. drm_print: condense enum drm_debug_category debugfs: use DEFINE_SHOW_ATTRIBUTE to define debugfs_regset32_fops driver core: use IS_ERR_OR_NULL() helper in device_create_groups_vargs() Documentation: ENI155_MAGIC isn't a magic number Documentation: NBD_REPLY_MAGIC isn't a magic number nbd: remove define-only NBD_MAGIC, previously magic number Documentation: FW_HEADER_MAGIC isn't a magic number Documentation: EEPROM_MAGIC_VALUE isn't a magic number ...
2022-09-29Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski1-1/+4
No conflicts. Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-09-26mm: multi-gen LRU: kill switchYu Zhao1-1/+0
Add /sys/kernel/mm/lru_gen/enabled as a kill switch. Components that can be disabled include: 0x0001: the multi-gen LRU core 0x0002: walking page table, when arch_has_hw_pte_young() returns true 0x0004: clearing the accessed bit in non-leaf PMD entries, when CONFIG_ARCH_HAS_NONLEAF_PMD_YOUNG=y [yYnN]: apply to all the components above E.g., echo y >/sys/kernel/mm/lru_gen/enabled cat /sys/kernel/mm/lru_gen/enabled 0x0007 echo 5 >/sys/kernel/mm/lru_gen/enabled cat /sys/kernel/mm/lru_gen/enabled 0x0005 NB: the page table walks happen on the scale of seconds under heavy memory pressure, in which case the mmap_lock contention is a lesser concern, compared with the LRU lock contention and the I/O congestion. So far the only well-known case of the mmap_lock contention happens on Android, due to Scudo [1] which allocates several thousand VMAs for merely a few hundred MBs. The SPF and the Maple Tree also have provided their own assessments [2][3]. However, if walking page tables does worsen the mmap_lock contention, the kill switch can be used to disable it. In this case the multi-gen LRU will suffer a minor performance degradation, as shown previously. Clearing the accessed bit in non-leaf PMD entries can also be disabled, since this behavior was not tested on x86 varieties other than Intel and AMD. [1] https://source.android.com/devices/tech/debug/scudo [2] https://lore.kernel.org/r/20220128131006.67712-1-michel@lespinasse.org/ [3] https://lore.kernel.org/r/20220426150616.3937571-1-Liam.Howlett@oracle.com/ Link: https://lkml.kernel.org/r/20220918080010.2920238-11-yuzhao@google.com Signed-off-by: Yu Zhao <yuzhao@google.com> Acked-by: Brian Geffon <bgeffon@google.com> Acked-by: Jan Alexander Steffens (heftig) <heftig@archlinux.org> Acked-by: Oleksandr Natalenko <oleksandr@natalenko.name> Acked-by: Steven Barrett <steven@liquorix.net> Acked-by: Suleiman Souhlal <suleiman@google.com> Tested-by: Daniel Byrne <djbyrne@mtu.edu> Tested-by: Donald Carr <d@chaos-reins.com> Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com> Tested-by: Konstantin Kharlamov <Hi-Angel@yandex.ru> Tested-by: Shuang Zhai <szhai2@cs.rochester.edu> Tested-by: Sofia Trinh <sofia.trinh@edi.works> Tested-by: Vaibhav Jain <vaibhav@linux.ibm.com> Cc: Andi Kleen <ak@linux.intel.com> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Cc: Barry Song <baohua@kernel.org> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Hillf Danton <hdanton@sina.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Matthew Wilcox <willy@infradead.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Michael Larabel <Michael@MichaelLarabel.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Mike Rapoport <rppt@kernel.org> Cc: Mike Rapoport <rppt@linux.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Qi Zheng <zhengqi.arch@bytedance.com> Cc: Tejun Heo <tj@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-09-23cgroup: use strscpy() is more robust and saferye xingchen1-1/+1
The implementation of strscpy() is more robust and safer. That's now the recommended way to copy NUL terminated strings. Signed-off-by: ye xingchen <ye.xingchen@zte.com.cn> Signed-off-by: Tejun Heo <tj@kernel.org>
2022-09-23cgroup: simplify code in cgroup_apply_controlWilliam Dean1-5/+1
It could directly return 'cgroup_update_dfl_csses' to simplify code. Signed-off-by: William Dean <williamsukatube@163.com> Reviewed-by: Mukesh Ojha <quic_mojha@quicinc.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2022-09-23cgroup: Make cgroup_get_from_id() prettierTejun Heo1-9/+10
After merging 836ac87d ("cgroup: fix cgroup_get_from_id") into for-6.1, its combination with two commits in for-6.1 - 4534dee9 ("cgroup: cgroup: Honor caller's cgroup NS when resolving cgroup id") and fa7e439c ("cgroup: Homogenize cgroup_get_from_id() return value") - makes the gotos in the error handling path too ugly while not adding anything of value. All that the gotos are saving is one extra kernfs_put() call. Let's remove the gotos and perform error returns directly. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Ming Lei <ming.lei@redhat.com> Cc: Michal Koutný <mkoutny@suse.com>
2022-09-23Merge branch 'for-6.0-fixes' into for-6.1Tejun Heo1-0/+4
for-6.0 has the following fix for cgroup_get_from_id(). 836ac87d ("cgroup: fix cgroup_get_from_id") which conflicts with the following two commits in for-6.1. 4534dee9 ("cgroup: cgroup: Honor caller's cgroup NS when resolving cgroup id") fa7e439c ("cgroup: Homogenize cgroup_get_from_id() return value") While the resolution is straightforward, the code ends up pretty ugly afterwards. Let's pull for-6.0-fixes into for-6.1 so that the code can be fixed up there. Signed-off-by: Tejun Heo <tj@kernel.org>
2022-09-23cgroup: cgroup_get_from_id() must check the looked-up kn is a directoryMing Lei1-1/+4
cgroup has to be one kernfs dir, otherwise kernel panic is caused, especially cgroup id is provide from userspace. Reported-by: Marco Patalano <mpatalan@redhat.com> Fixes: 6b658c4863c1 ("scsi: cgroup: Add cgroup_get_from_id()") Cc: Muneendra <muneendra.kumar@broadcom.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Acked-by: Mukesh Ojha <quic_mojha@quicinc.com> Cc: stable@vger.kernel.org # v5.14+ Signed-off-by: Tejun Heo <tj@kernel.org>
2022-09-12Merge 6.0-rc5 into driver-core-nextGreg Kroah-Hartman3-26/+59
We need the driver core and debugfs changes in this branch. Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-09-09sched/psi: Per-cgroup PSI accounting disable/re-enable interfaceChengming Zhou1-6/+64
PSI accounts stalls for each cgroup separately and aggregates it at each level of the hierarchy. This may cause non-negligible overhead for some workloads when under deep level of the hierarchy. commit 3958e2d0c34e ("cgroup: make per-cgroup pressure stall tracking configurable") make PSI to skip per-cgroup stall accounting, only account system-wide to avoid this each level overhead. But for our use case, we also want leaf cgroup PSI stats accounted for userspace adjustment on that cgroup, apart from only system-wide adjustment. So this patch introduce a per-cgroup PSI accounting disable/re-enable interface "cgroup.pressure", which is a read-write single value file that allowed values are "0" and "1", the defaults is "1" so per-cgroup PSI stats is enabled by default. Implementation details: It should be relatively straight-forward to disable and re-enable state aggregation, time tracking, averaging on a per-cgroup level, if we can live with losing history from while it was disabled. I.e. the avgs will restart from 0, total= will have gaps. But it's hard or complex to stop/restart groupc->tasks[] updates, which is not implemented in this patch. So we always update groupc->tasks[] and PSI_ONCPU bit in psi_group_change() even when the cgroup PSI stats is disabled. Suggested-by: Johannes Weiner <hannes@cmpxchg.org> Suggested-by: Tejun Heo <tj@kernel.org> Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Link: https://lkml.kernel.org/r/20220907090332.2078-1-zhouchengming@bytedance.com
2022-09-09sched/psi: Consolidate cgroup_psi()Chengming Zhou1-5/+5
cgroup_psi() can't return psi_group for root cgroup, so we have many open code "psi = cgroup_ino(cgrp) == 1 ? &psi_system : cgrp->psi". This patch move cgroup_psi() definition to <linux/psi.h>, in which we can return psi_system for root cgroup, so can handle all cgroups. Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Link: https://lore.kernel.org/r/20220825164111.29534-9-zhouchengming@bytedance.com
2022-09-09sched/psi: Add PSI_IRQ to track IRQ/SOFTIRQ pressureChengming Zhou1-0/+27
Now PSI already tracked workload pressure stall information for CPU, memory and IO. Apart from these, IRQ/SOFTIRQ could have obvious impact on some workload productivity, such as web service workload. When CONFIG_IRQ_TIME_ACCOUNTING, we can get IRQ/SOFTIRQ delta time from update_rq_clock_task(), in which we can record that delta to CPU curr task's cgroups as PSI_IRQ_FULL status. Note we don't use PSI_IRQ_SOME since IRQ/SOFTIRQ always happen in the current task on the CPU, make nothing productive could run even if it were runnable, so we only use PSI_IRQ_FULL. Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Link: https://lore.kernel.org/r/20220825164111.29534-8-zhouchengming@bytedance.com
2022-09-09sched/psi: Don't create cgroup PSI files when psi_disabledChengming Zhou1-0/+3
commit 3958e2d0c34e ("cgroup: make per-cgroup pressure stall tracking configurable") make PSI can be configured to skip per-cgroup stall accounting. And doesn't expose PSI files in cgroup hierarchy. This patch do the same thing when psi_disabled. Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Link: https://lore.kernel.org/r/20220825164111.29534-3-zhouchengming@bytedance.com
2022-09-09Merge branch 'driver-core/driver-core-next'Peter Zijlstra1-0/+20
Pull in dependent cgroup patches Signed-off-by: Peter Zijlstra <peterz@infradead.org>
2022-09-07freezer,sched: Rewrite core freezer logicPeter Zijlstra1-15/+8
Rewrite the core freezer to behave better wrt thawing and be simpler in general. By replacing PF_FROZEN with TASK_FROZEN, a special block state, it is ensured frozen tasks stay frozen until thawed and don't randomly wake up early, as is currently possible. As such, it does away with PF_FROZEN and PF_FREEZER_SKIP, freeing up two PF_flags (yay!). Specifically; the current scheme works a little like: freezer_do_not_count(); schedule(); freezer_count(); And either the task is blocked, or it lands in try_to_freezer() through freezer_count(). Now, when it is blocked, the freezer considers it frozen and continues. However, on thawing, once pm_freezing is cleared, freezer_count() stops working, and any random/spurious wakeup will let a task run before its time. That is, thawing tries to thaw things in explicit order; kernel threads and workqueues before doing bringing SMP back before userspace etc.. However due to the above mentioned races it is entirely possible for userspace tasks to thaw (by accident) before SMP is back. This can be a fatal problem in asymmetric ISA architectures (eg ARMv9) where the userspace task requires a special CPU to run. As said; replace this with a special task state TASK_FROZEN and add the following state transitions: TASK_FREEZABLE -> TASK_FROZEN __TASK_STOPPED -> TASK_FROZEN __TASK_TRACED -> TASK_FROZEN The new TASK_FREEZABLE can be set on any state part of TASK_NORMAL (IOW. TASK_INTERRUPTIBLE and TASK_UNINTERRUPTIBLE) -- any such state is already required to deal with spurious wakeups and the freezer causes one such when thawing the task (since the original state is lost). The special __TASK_{STOPPED,TRACED} states *can* be restored since their canonical state is in ->jobctl. With this, frozen tasks need an explicit TASK_FROZEN wakeup and are free of undue (early / spurious) wakeups. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Ingo Molnar <mingo@kernel.org> Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Link: https://lore.kernel.org/r/20220822114649.055452969@infradead.org
2022-09-07cgroup/cpuset: remove unreachable codeJiapeng Chong1-2/+0
The function sched_partition_show cannot execute seq_puts, delete the invalid code. kernel/cgroup/cpuset.c:2849 sched_partition_show() warn: ignoring unreachable code. Link: https://bugzilla.openanolis.cn/show_bug.cgi?id=2087 Reported-by: Abaci Robot <abaci@linux.alibaba.com> Signed-off-by: Jiapeng Chong <jiapeng.chong@linux.alibaba.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2022-09-06Merge https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-nextPaolo Abeni1-0/+48
Daniel Borkmann says: ==================== pull-request: bpf-next 2022-09-05 The following pull-request contains BPF updates for your *net-next* tree. We've added 106 non-merge commits during the last 18 day(s) which contain a total of 159 files changed, 5225 insertions(+), 1358 deletions(-). There are two small merge conflicts, resolve them as follows: 1) tools/testing/selftests/bpf/DENYLIST.s390x Commit 27e23836ce22 ("selftests/bpf: Add lru_bug to s390x deny list") in bpf tree was needed to get BPF CI green on s390x, but it conflicted with newly added tests on bpf-next. Resolve by adding both hunks, result: [...] lru_bug # prog 'printk': failed to auto-attach: -524 setget_sockopt # attach unexpected error: -524 (trampoline) cb_refs # expected error message unexpected error: -524 (trampoline) cgroup_hierarchical_stats # JIT does not support calling kernel function (kfunc) htab_update # failed to attach: ERROR: strerror_r(-524)=22 (trampoline) [...] 2) net/core/filter.c Commit 1227c1771dd2 ("net: Fix data-races around sysctl_[rw]mem_(max|default).") from net tree conflicts with commit 29003875bd5b ("bpf: Change bpf_setsockopt(SOL_SOCKET) to reuse sk_setsockopt()") from bpf-next tree. Take the code as it is from bpf-next tree, result: [...] if (getopt) { if (optname == SO_BINDTODEVICE) return -EINVAL; return sk_getsockopt(sk, SOL_SOCKET, optname, KERNEL_SOCKPTR(optval), KERNEL_SOCKPTR(optlen)); } return sk_setsockopt(sk, SOL_SOCKET, optname, KERNEL_SOCKPTR(optval), *optlen); [...] The main changes are: 1) Add any-context BPF specific memory allocator which is useful in particular for BPF tracing with bonus of performance equal to full prealloc, from Alexei Starovoitov. 2) Big batch to remove duplicated code from bpf_{get,set}sockopt() helpers as an effort to reuse the existing core socket code as much as possible, from Martin KaFai Lau. 3) Extend BPF flow dissector for BPF programs to just augment the in-kernel dissector with custom logic. In other words, allow for partial replacement, from Shmulik Ladkani. 4) Add a new cgroup iterator to BPF with different traversal options, from Hao Luo. 5) Support for BPF to collect hierarchical cgroup statistics efficiently through BPF integration with the rstat framework, from Yosry Ahmed. 6) Support bpf_{g,s}et_retval() under more BPF cgroup hooks, from Stanislav Fomichev. 7) BPF hash table and local storages fixes under fully preemptible kernel, from Hou Tao. 8) Add various improvements to BPF selftests and libbpf for compilation with gcc BPF backend, from James Hilliard. 9) Fix verifier helper permissions and reference state management for synchronous callbacks, from Kumar Kartikeya Dwivedi. 10) Add support for BPF selftest's xskxceiver to also be used against real devices that support MAC loopback, from Maciej Fijalkowski. 11) Various fixes to the bpf-helpers(7) man page generation script, from Quentin Monnet. 12) Document BPF verifier's tnum_in(tnum_range(), ...) gotchas, from Shung-Hsi Yu. 13) Various minor misc improvements all over the place. * https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (106 commits) bpf: Optimize rcu_barrier usage between hash map and bpf_mem_alloc. bpf: Remove usage of kmem_cache from bpf_mem_cache. bpf: Remove prealloc-only restriction for sleepable bpf programs. bpf: Prepare bpf_mem_alloc to be used by sleepable bpf programs. bpf: Remove tracing program restriction on map types bpf: Convert percpu hash map to per-cpu bpf_mem_alloc. bpf: Add percpu allocation support to bpf_mem_alloc. bpf: Batch call_rcu callbacks instead of SLAB_TYPESAFE_BY_RCU. bpf: Adjust low/high watermarks in bpf_mem_cache bpf: Optimize call_rcu in non-preallocated hash map. bpf: Optimize element count in non-preallocated hash map. bpf: Relax the requirement to use preallocated hash maps in tracing progs. samples/bpf: Reduce syscall overhead in map_perf_test. selftests/bpf: Improve test coverage of test_maps bpf: Convert hash map to bpf_mem_alloc. bpf: Introduce any context BPF specific memory allocator. selftest/bpf: Add test for bpf_getsockopt() bpf: Change bpf_getsockopt(SOL_IPV6) to reuse do_ipv6_getsockopt() bpf: Change bpf_getsockopt(SOL_IP) to reuse do_ip_getsockopt() bpf: Change bpf_getsockopt(SOL_TCP) to reuse do_tcp_getsockopt() ... ==================== Link: https://lore.kernel.org/r/20220905161136.9150-1-daniel@iogearbox.net Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2022-09-06cgroup: Remove CFTYPE_PRESSURETejun Heo1-27/+37
CFTYPE_PRESSURE is used to flag PSI related files so that they are not created if PSI is disabled during boot. It's a bit weird to use a generic flag to mark a specific file type. Let's instead move the PSI files into its own cftypes array and add/rm them conditionally. This is a bit more code but cleaner. No userland visible changes. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Johannes Weiner <hannes@cmpxchg.org>
2022-09-06cgroup: Improve cftype add/rm error handlingTejun Heo1-7/+20
Let's track whether a cftype is currently added or not using a new flag __CFTYPE_ADDED so that duplicate operations can be failed safely and consistently allow using empty cftypes. Signed-off-by: Tejun Heo <tj@kernel.org>
2022-09-04cgroup/cpuset: Make partition invalid if cpumask change violates exclusivity ↵Waiman Long1-9/+60
rule Currently, changes in "cpust.cpus" of a partition root is not allowed if it violates the sibling cpu exclusivity rule when the check is done in the validate_change() function. That is inconsistent with the other cpuset changes that are always allowed but may make a partition invalid. Update the cpuset code to allow cpumask change even if it violates the sibling cpu exclusivity rule, but invalidate the partition instead just like the other changes. However, other sibling partitions with conflicting cpumask will also be invalidated in order to not violating the exclusivity rule. This behavior is specific to this partition rule violation. Note that a previous commit has made sibling cpu exclusivity rule check the last check of validate_change(). So if -EINVAL is returned, we can be sure that sibling cpu exclusivity rule violation is the only rule that is broken. Signed-off-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2022-09-04cgroup/cpuset: Relocate a code block in validate_change()Waiman Long1-16/+16
This patch moves down the exclusive cpu and memory check in validate_change(). There is no functional change. Signed-off-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2022-09-04cgroup/cpuset: Show invalid partition reason stringWaiman Long1-18/+75
There are a number of different reasons which can cause a partition to become invalid. A user seeing an invalid partition may not know exactly why. To help user to get a better understanding of the underlying reason, The cpuset.cpus.partition control file, when read, will now report the reason why a partition become invalid. When a partition does become invalid, reading the control file will show "root invalid (<reason>)" where <reason> is a string that describes why the partition is invalid. Signed-off-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2022-09-04cgroup/cpuset: Add a new isolated cpus.partition typeWaiman Long1-11/+63
Cpuset v1 uses the sched_load_balance control file to determine if load balancing should be enabled. Cpuset v2 gets rid of sched_load_balance as its use may require disabling load balancing at cgroup root. For workloads that require very low latency like DPDK, the latency jitters caused by periodic load balancing may exceed the desired latency limit. When cpuset v2 is in use, the only way to avoid this latency cost is to use the "isolcpus=" kernel boot option to isolate a set of CPUs. After the kernel boot, however, there is no way to add or remove CPUs from this isolated set. For workloads that are more dynamic in nature, that means users have to provision enough CPUs for the worst case situation resulting in excess idle CPUs. To address this issue for cpuset v2, a new cpuset.cpus.partition type "isolated" is added which allows the creation of a cpuset partition without load balancing. This will allow system administrators to dynamically adjust the size of isolated partition to the current need of the workload without rebooting the system. Signed-off-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2022-09-04cgroup/cpuset: Relax constraints to partition & cpus changesWaiman Long1-194/+215
Currently, enabling a partition root is only allowed if all the constraints of a valid partition are satisfied. Even changes to "cpuset.cpus" may not be allowed in some cases. Moreover, there are limits to changes made to a parent cpuset if it is a valid partition root. This is contrary to the general cgroup v2 philosophy. This patch relaxes the constraints of changing the state of "cpuset.cpus" and "cpuset.cpus.partition". Now all valid changes ("member" or "root") to "cpuset.cpus.partition" are allowed even if there are child cpusets underneath it. Trying to make a cpuset a partition root, however, will cause its state to become invalid if the following constraints of a valid partition root are not satisfied. 1) The "cpuset.cpus" is non-empty and exclusive. 2) The parent cpuset is a valid partition root. 3) The "cpuset.cpus" overlaps parent's "cpuset.cpus". Similarly, almost all changes to "cpuset.cpus" are allowed with the exception that if the underlying CS_CPU_EXCLUSIVE flag is set, the exclusivity rule will still apply. Signed-off-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2022-09-04cgroup/cpuset: Allow no-task partition to have empty cpuset.cpus.effectiveWaiman Long1-25/+84
Currently, a partition root cannot have empty "cpuset.cpus.effective". As a result, a parent partition root cannot distribute out all its CPUs to child partitions with no CPUs left. However in most cases, there shouldn't be any tasks associated with intermediate nodes of the default hierarchy. So the current rule is too restrictive and can waste valuable CPU resource. To address this issue, we are now allowing a partition to have empty "cpuset.cpus.effective" as long as it has no task. Since cpuset is threaded, no-internal-process rule does not apply. So it is possible to have tasks in a partition root with child sub-partitions even though that should be uncommon. A parent partition with no task can now have all its CPUs distributed out to its child partitions. The top cpuset always have some house-keeping tasks running and so its list of effective cpu can't be empty. Once a partition with empty "cpuset.cpus.effective" is formed, no new task can be moved into it until "cpuset.cpus.effective" becomes non-empty. Signed-off-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>