aboutsummaryrefslogtreecommitdiffstats
path: root/kernel
AgeCommit message (Collapse)AuthorFilesLines
13 hoursMerge tag 'mm-stable-2024-05-24-11-49' of ↵Linus Torvalds1-0/+1
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull more mm updates from Andrew Morton: "Jeff Xu's implementation of the mseal() syscall" * tag 'mm-stable-2024-05-24-11-49' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: selftest mm/mseal read-only elf memory segment mseal: add documentation selftest mm/mseal memory sealing mseal: add mseal syscall mseal: wire up mseal syscall
30 hoursmseal: wire up mseal syscallJeff Xu1-0/+1
Patch series "Introduce mseal", v10. This patchset proposes a new mseal() syscall for the Linux kernel. In a nutshell, mseal() protects the VMAs of a given virtual memory range against modifications, such as changes to their permission bits. Modern CPUs support memory permissions, such as the read/write (RW) and no-execute (NX) bits. Linux has supported NX since the release of kernel version 2.6.8 in August 2004 [1]. The memory permission feature improves the security stance on memory corruption bugs, as an attacker cannot simply write to arbitrary memory and point the code to it. The memory must be marked with the X bit, or else an exception will occur. Internally, the kernel maintains the memory permissions in a data structure called VMA (vm_area_struct). mseal() additionally protects the VMA itself against modifications of the selected seal type. Memory sealing is useful to mitigate memory corruption issues where a corrupted pointer is passed to a memory management system. For example, such an attacker primitive can break control-flow integrity guarantees since read-only memory that is supposed to be trusted can become writable or .text pages can get remapped. Memory sealing can automatically be applied by the runtime loader to seal .text and .rodata pages and applications can additionally seal security critical data at runtime. A similar feature already exists in the XNU kernel with the VM_FLAGS_PERMANENT [3] flag and on OpenBSD with the mimmutable syscall [4]. Also, Chrome wants to adopt this feature for their CFI work [2] and this patchset has been designed to be compatible with the Chrome use case. Two system calls are involved in sealing the map: mmap() and mseal(). The new mseal() is an syscall on 64 bit CPU, and with following signature: int mseal(void addr, size_t len, unsigned long flags) addr/len: memory range. flags: reserved. mseal() blocks following operations for the given memory range. 1> Unmapping, moving to another location, and shrinking the size, via munmap() and mremap(), can leave an empty space, therefore can be replaced with a VMA with a new set of attributes. 2> Moving or expanding a different VMA into the current location, via mremap(). 3> Modifying a VMA via mmap(MAP_FIXED). 4> Size expansion, via mremap(), does not appear to pose any specific risks to sealed VMAs. It is included anyway because the use case is unclear. In any case, users can rely on merging to expand a sealed VMA. 5> mprotect() and pkey_mprotect(). 6> Some destructive madvice() behaviors (e.g. MADV_DONTNEED) for anonymous memory, when users don't have write permission to the memory. Those behaviors can alter region contents by discarding pages, effectively a memset(0) for anonymous memory. The idea that inspired this patch comes from Stephen Röttger’s work in V8 CFI [5]. Chrome browser in ChromeOS will be the first user of this API. Indeed, the Chrome browser has very specific requirements for sealing, which are distinct from those of most applications. For example, in the case of libc, sealing is only applied to read-only (RO) or read-execute (RX) memory segments (such as .text and .RELRO) to prevent them from becoming writable, the lifetime of those mappings are tied to the lifetime of the process. Chrome wants to seal two large address space reservations that are managed by different allocators. The memory is mapped RW- and RWX respectively but write access to it is restricted using pkeys (or in the future ARM permission overlay extensions). The lifetime of those mappings are not tied to the lifetime of the process, therefore, while the memory is sealed, the allocators still need to free or discard the unused memory. For example, with madvise(DONTNEED). However, always allowing madvise(DONTNEED) on this range poses a security risk. For example if a jump instruction crosses a page boundary and the second page gets discarded, it will overwrite the target bytes with zeros and change the control flow. Checking write-permission before the discard operation allows us to control when the operation is valid. In this case, the madvise will only succeed if the executing thread has PKEY write permissions and PKRU changes are protected in software by control-flow integrity. Although the initial version of this patch series is targeting the Chrome browser as its first user, it became evident during upstream discussions that we would also want to ensure that the patch set eventually is a complete solution for memory sealing and compatible with other use cases. The specific scenario currently in mind is glibc's use case of loading and sealing ELF executables. To this end, Stephen is working on a change to glibc to add sealing support to the dynamic linker, which will seal all non-writable segments at startup. Once this work is completed, all applications will be able to automatically benefit from these new protections. In closing, I would like to formally acknowledge the valuable contributions received during the RFC process, which were instrumental in shaping this patch: Jann Horn: raising awareness and providing valuable insights on the destructive madvise operations. Liam R. Howlett: perf optimization. Linus Torvalds: assisting in defining system call signature and scope. Theo de Raadt: sharing the experiences and insight gained from implementing mimmutable() in OpenBSD. MM perf benchmarks ================== This patch adds a loop in the mprotect/munmap/madvise(DONTNEED) to check the VMAs’ sealing flag, so that no partial update can be made, when any segment within the given memory range is sealed. To measure the performance impact of this loop, two tests are developed. [8] The first is measuring the time taken for a particular system call, by using clock_gettime(CLOCK_MONOTONIC). The second is using PERF_COUNT_HW_REF_CPU_CYCLES (exclude user space). Both tests have similar results. The tests have roughly below sequence: for (i = 0; i < 1000, i++) create 1000 mappings (1 page per VMA) start the sampling for (j = 0; j < 1000, j++) mprotect one mapping stop and save the sample delete 1000 mappings calculates all samples. Below tests are performed on Intel(R) Pentium(R) Gold 7505 @ 2.00GHz, 4G memory, Chromebook. Based on the latest upstream code: The first test (measuring time) syscall__ vmas t t_mseal delta_ns per_vma % munmap__ 1 909 944 35 35 104% munmap__ 2 1398 1502 104 52 107% munmap__ 4 2444 2594 149 37 106% munmap__ 8 4029 4323 293 37 107% munmap__ 16 6647 6935 288 18 104% munmap__ 32 11811 12398 587 18 105% mprotect 1 439 465 26 26 106% mprotect 2 1659 1745 86 43 105% mprotect 4 3747 3889 142 36 104% mprotect 8 6755 6969 215 27 103% mprotect 16 13748 14144 396 25 103% mprotect 32 27827 28969 1142 36 104% madvise_ 1 240 262 22 22 109% madvise_ 2 366 442 76 38 121% madvise_ 4 623 751 128 32 121% madvise_ 8 1110 1324 215 27 119% madvise_ 16 2127 2451 324 20 115% madvise_ 32 4109 4642 534 17 113% The second test (measuring cpu cycle) syscall__ vmas cpu cmseal delta_cpu per_vma % munmap__ 1 1790 1890 100 100 106% munmap__ 2 2819 3033 214 107 108% munmap__ 4 4959 5271 312 78 106% munmap__ 8 8262 8745 483 60 106% munmap__ 16 13099 14116 1017 64 108% munmap__ 32 23221 24785 1565 49 107% mprotect 1 906 967 62 62 107% mprotect 2 3019 3203 184 92 106% mprotect 4 6149 6569 420 105 107% mprotect 8 9978 10524 545 68 105% mprotect 16 20448 21427 979 61 105% mprotect 32 40972 42935 1963 61 105% madvise_ 1 434 497 63 63 115% madvise_ 2 752 899 147 74 120% madvise_ 4 1313 1513 200 50 115% madvise_ 8 2271 2627 356 44 116% madvise_ 16 4312 4883 571 36 113% madvise_ 32 8376 9319 943 29 111% Based on the result, for 6.8 kernel, sealing check adds 20-40 nano seconds, or around 50-100 CPU cycles, per VMA. In addition, I applied the sealing to 5.10 kernel: The first test (measuring time) syscall__ vmas t tmseal delta_ns per_vma % munmap__ 1 357 390 33 33 109% munmap__ 2 442 463 21 11 105% munmap__ 4 614 634 20 5 103% munmap__ 8 1017 1137 120 15 112% munmap__ 16 1889 2153 263 16 114% munmap__ 32 4109 4088 -21 -1 99% mprotect 1 235 227 -7 -7 97% mprotect 2 495 464 -30 -15 94% mprotect 4 741 764 24 6 103% mprotect 8 1434 1437 2 0 100% mprotect 16 2958 2991 33 2 101% mprotect 32 6431 6608 177 6 103% madvise_ 1 191 208 16 16 109% madvise_ 2 300 324 24 12 108% madvise_ 4 450 473 23 6 105% madvise_ 8 753 806 53 7 107% madvise_ 16 1467 1592 125 8 108% madvise_ 32 2795 3405 610 19 122% The second test (measuring cpu cycle) syscall__ nbr_vma cpu cmseal delta_cpu per_vma % munmap__ 1 684 715 31 31 105% munmap__ 2 861 898 38 19 104% munmap__ 4 1183 1235 51 13 104% munmap__ 8 1999 2045 46 6 102% munmap__ 16 3839 3816 -23 -1 99% munmap__ 32 7672 7887 216 7 103% mprotect 1 397 443 46 46 112% mprotect 2 738 788 50 25 107% mprotect 4 1221 1256 35 9 103% mprotect 8 2356 2429 72 9 103% mprotect 16 4961 4935 -26 -2 99% mprotect 32 9882 10172 291 9 103% madvise_ 1 351 380 29 29 108% madvise_ 2 565 615 49 25 109% madvise_ 4 872 933 61 15 107% madvise_ 8 1508 1640 132 16 109% madvise_ 16 3078 3323 245 15 108% madvise_ 32 5893 6704 811 25 114% For 5.10 kernel, sealing check adds 0-15 ns in time, or 10-30 CPU cycles, there is even decrease in some cases. It might be interesting to compare 5.10 and 6.8 kernel The first test (measuring time) syscall__ vmas t_5_10 t_6_8 delta_ns per_vma % munmap__ 1 357 909 552 552 254% munmap__ 2 442 1398 956 478 316% munmap__ 4 614 2444 1830 458 398% munmap__ 8 1017 4029 3012 377 396% munmap__ 16 1889 6647 4758 297 352% munmap__ 32 4109 11811 7702 241 287% mprotect 1 235 439 204 204 187% mprotect 2 495 1659 1164 582 335% mprotect 4 741 3747 3006 752 506% mprotect 8 1434 6755 5320 665 471% mprotect 16 2958 13748 10790 674 465% mprotect 32 6431 27827 21397 669 433% madvise_ 1 191 240 49 49 125% madvise_ 2 300 366 67 33 122% madvise_ 4 450 623 173 43 138% madvise_ 8 753 1110 357 45 147% madvise_ 16 1467 2127 660 41 145% madvise_ 32 2795 4109 1314 41 147% The second test (measuring cpu cycle) syscall__ vmas cpu_5_10 c_6_8 delta_cpu per_vma % munmap__ 1 684 1790 1106 1106 262% munmap__ 2 861 2819 1958 979 327% munmap__ 4 1183 4959 3776 944 419% munmap__ 8 1999 8262 6263 783 413% munmap__ 16 3839 13099 9260 579 341% munmap__ 32 7672 23221 15549 486 303% mprotect 1 397 906 509 509 228% mprotect 2 738 3019 2281 1140 409% mprotect 4 1221 6149 4929 1232 504% mprotect 8 2356 9978 7622 953 423% mprotect 16 4961 20448 15487 968 412% mprotect 32 9882 40972 31091 972 415% madvise_ 1 351 434 82 82 123% madvise_ 2 565 752 186 93 133% madvise_ 4 872 1313 442 110 151% madvise_ 8 1508 2271 763 95 151% madvise_ 16 3078 4312 1234 77 140% madvise_ 32 5893 8376 2483 78 142% From 5.10 to 6.8 munmap: added 250-550 ns in time, or 500-1100 in cpu cycle, per vma. mprotect: added 200-750 ns in time, or 500-1200 in cpu cycle, per vma. madvise: added 33-50 ns in time, or 70-110 in cpu cycle, per vma. In comparison to mseal, which adds 20-40 ns or 50-100 CPU cycles, the increase from 5.10 to 6.8 is significantly larger, approximately ten times greater for munmap and mprotect. When I discuss the mm performance with Brian Makin, an engineer who worked on performance, it was brought to my attention that such performance benchmarks, which measuring millions of mm syscall in a tight loop, may not accurately reflect real-world scenarios, such as that of a database service. Also this is tested using a single HW and ChromeOS, the data from another HW or distribution might be different. It might be best to take this data with a grain of salt. This patch (of 5): Wire up mseal syscall for all architectures. Link: https://lkml.kernel.org/r/20240415163527.626541-1-jeffxu@chromium.org Link: https://lkml.kernel.org/r/20240415163527.626541-2-jeffxu@chromium.org Signed-off-by: Jeff Xu <jeffxu@chromium.org> Reviewed-by: Kees Cook <keescook@chromium.org> Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Guenter Roeck <groeck@chromium.org> Cc: Jann Horn <jannh@google.com> [Bug #2] Cc: Jeff Xu <jeffxu@google.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Jorge Lucangeli Obes <jorgelo@chromium.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Muhammad Usama Anjum <usama.anjum@collabora.com> Cc: Pedro Falcato <pedro.falcato@gmail.com> Cc: Stephen Röttger <sroettger@google.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Amer Al Shanawany <amer.shanawany@gmail.com> Cc: Javier Carrasco <javier.carrasco.cruz@gmail.com> Cc: Shuah Khan <shuah@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
37 hoursMerge tag 'trace-fixes-v6.10' of ↵Linus Torvalds3-13/+15
git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace Pull tracing fixes from Steven Rostedt: "Minor last minute fixes: - Fix a very tight race between the ring buffer readers and resizing the ring buffer - Correct some stale comments in the ring buffer code - Fix kernel-doc in the rv code - Add a MODULE_DESCRIPTION to preemptirq_delay_test" * tag 'trace-fixes-v6.10' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace: rv: Update rv_en(dis)able_monitor doc to match kernel-doc tracing: Add MODULE_DESCRIPTION() to preemptirq_delay_test ring-buffer: Fix a race between readers and resize checks ring-buffer: Correct stale comments related to non-consuming readers
37 hoursMerge tag 'trace-assign-str-v6.10' of ↵Linus Torvalds1-1/+1
git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace Pull tracing cleanup from Steven Rostedt: "Remove second argument of __assign_str() The __assign_str() macro logic of the TRACE_EVENT() macro was optimized so that it no longer needs the second argument. The __assign_str() is always matched with __string() field that takes a field name and the source for that field: __string(field, source) The TRACE_EVENT() macro logic will save off the source value and then use that value to copy into the ring buffer via the __assign_str(). Before commit c1fa617caeb0 ("tracing: Rework __assign_str() and __string() to not duplicate getting the string"), the __assign_str() needed the second argument which would perform the same logic as the __string() source parameter did. Not only would this add overhead, but it was error prone as if the __assign_str() source produced something different, it may not have allocated enough for the string in the ring buffer (as the __string() source was used to determine how much to allocate) Now that the __assign_str() just uses the same string that was used in __string() it no longer needs the source parameter. It can now be removed" * tag 'trace-assign-str-v6.10' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace: tracing/treewide: Remove second parameter of __assign_str()
38 hoursMerge tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhostLinus Torvalds3-26/+36
Pull virtio updates from Michael Tsirkin: "Several new features here: - virtio-net is finally supported in vduse - virtio (balloon and mem) interaction with suspend is improved - vhost-scsi now handles signals better/faster And fixes, cleanups all over the place" * tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost: (48 commits) virtio-pci: Check if is_avq is NULL virtio: delete vq in vp_find_vqs_msix() when request_irq() fails MAINTAINERS: add Eugenio Pérez as reviewer vhost-vdpa: Remove usage of the deprecated ida_simple_xx() API vp_vdpa: don't allocate unused msix vectors sound: virtio: drop owner assignment fuse: virtio: drop owner assignment scsi: virtio: drop owner assignment rpmsg: virtio: drop owner assignment nvdimm: virtio_pmem: drop owner assignment wifi: mac80211_hwsim: drop owner assignment vsock/virtio: drop owner assignment net: 9p: virtio: drop owner assignment net: virtio: drop owner assignment net: caif: virtio: drop owner assignment misc: nsm: drop owner assignment iommu: virtio: drop owner assignment drm/virtio: drop owner assignment gpio: virtio: drop owner assignment firmware: arm_scmi: virtio: drop owner assignment ...
2 daystracing/treewide: Remove second parameter of __assign_str()Steven Rostedt (Google)1-1/+1
With the rework of how the __string() handles dynamic strings where it saves off the source string in field in the helper structure[1], the assignment of that value to the trace event field is stored in the helper value and does not need to be passed in again. This means that with: __string(field, mystring) Which use to be assigned with __assign_str(field, mystring), no longer needs the second parameter and it is unused. With this, __assign_str() will now only get a single parameter. There's over 700 users of __assign_str() and because coccinelle does not handle the TRACE_EVENT() macro I ended up using the following sed script: git grep -l __assign_str | while read a ; do sed -e 's/\(__assign_str([^,]*[^ ,]\) *,[^;]*/\1)/' $a > /tmp/test-file; mv /tmp/test-file $a; done I then searched for __assign_str() that did not end with ';' as those were multi line assignments that the sed script above would fail to catch. Note, the same updates will need to be done for: __assign_str_len() __assign_rel_str() __assign_rel_str_len() I tested this with both an allmodconfig and an allyesconfig (build only for both). [1] https://lore.kernel.org/linux-trace-kernel/20240222211442.634192653@goodmis.org/ Link: https://lore.kernel.org/linux-trace-kernel/20240516133454.681ba6a0@rorschach.local.home Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Julia Lawall <Julia.Lawall@inria.fr> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> Acked-by: Jani Nikula <jani.nikula@intel.com> Acked-by: Christian König <christian.koenig@amd.com> for the amdgpu parts. Acked-by: Thomas Hellström <thomas.hellstrom@linux.intel.com> #for Acked-by: Rafael J. Wysocki <rafael@kernel.org> # for thermal Acked-by: Takashi Iwai <tiwai@suse.de> Acked-by: Darrick J. Wong <djwong@kernel.org> # xfs Tested-by: Guenter Roeck <linux@roeck-us.net>
3 daysMerge tag 'driver-core-6.10-rc1' of ↵Linus Torvalds1-12/+1
git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core Pull driver core updates from Greg KH: "Here is the small set of driver core and kernfs changes for 6.10-rc1. Nothing major here at all, just a small set of changes for some driver core apis, and minor fixups. Included in here are: - sysfs_bin_attr_simple_read() helper added and used - device_show_string() helper added and used All usages of these were acked by the various maintainers. Also in here are: - kernfs minor cleanup - removed unused functions - typo fix in documentation - pay attention to sysfs_create_link() failures in module.c finally All of these have been in linux-next for a very long time with no reported problems" * tag 'driver-core-6.10-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: device property: Fix a typo in the description of device_get_child_node_count() kernfs: mount: Remove unnecessary ‘NULL’ values from knparent scsi: Use device_show_string() helper for sysfs attributes platform/x86: Use device_show_string() helper for sysfs attributes perf: Use device_show_string() helper for sysfs attributes IB/qib: Use device_show_string() helper for sysfs attributes hwmon: Use device_show_string() helper for sysfs attributes driver core: Add device_show_string() helper for sysfs attributes treewide: Use sysfs_bin_attr_simple_read() helper sysfs: Add sysfs_bin_attr_simple_read() helper module: don't ignore sysfs_create_link() failures driver core: Remove unused platform_notify, platform_notify_remove
3 daysMerge tag 'tty-6.10-rc1' of ↵Linus Torvalds4-29/+225
git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty Pull tty / serial updates from Greg KH: "Here is the big set of tty/serial driver changes for 6.10-rc1. Included in here are: - Usual good set of api cleanups and evolution by Jiri Slaby to make the serial interfaces move out of the 1990's by using kfifos instead of hand-rolling their own logic. - 8250_exar driver updates - max3100 driver updates - sc16is7xx driver updates - exar driver updates - sh-sci driver updates - tty ldisc api addition to help refuse bindings - other smaller serial driver updates All of these have been in linux-next for a while with no reported issues" * tag 'tty-6.10-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty: (113 commits) serial: Clear UPF_DEAD before calling tty_port_register_device_attr_serdev() serial: imx: Raise TX trigger level to 8 serial: 8250_pnp: Simplify "line" related code serial: sh-sci: simplify locking when re-issuing RXDMA fails serial: sh-sci: let timeout timer only run when DMA is scheduled serial: sh-sci: describe locking requirements for invalidating RXDMA serial: sh-sci: protect invalidating RXDMA on shutdown tty: add the option to have a tty reject a new ldisc serial: core: Call device_set_awake_path() for console port dt-bindings: serial: brcm,bcm2835-aux-uart: convert to dtschema tty: serial: uartps: Add support for uartps controller reset arm64: zynqmp: Add resets property for UART nodes dt-bindings: serial: cdns,uart: Add optional reset property serial: 8250_pnp: Switch to DEFINE_SIMPLE_DEV_PM_OPS() serial: 8250_exar: Keep the includes sorted serial: 8250_exar: Make type of bit the same in exar_ee_*_bit() serial: 8250_exar: Use BIT() in exar_ee_read() serial: 8250_exar: Switch to use dev_err_probe() serial: 8250_exar: Return directly from switch-cases serial: 8250_exar: Decrease indentation level ...
3 daysMerge tag 'leds-next-6.10' of ↵Linus Torvalds1-0/+12
git://git.kernel.org/pub/scm/linux/kernel/git/lee/leds Pull LED updates from Lee Jones: "Core Frameworks: - Ensure seldom updated triggers have a brightness value before first update New Device Support: - Add support for Simatic IPC Device BX_59A to IPC LEDs Core - Add support for Qualcomm PMI8950 PWM to LPG Core New Functionality: - Add a bunch of new LED function identifiers - Add support for High Resolution Timers in LED Trigger Patten Fix-ups: - Shift out Audio Trigger to the Sound subsystem - Convert suitable calls to devm_* managed resources - Device Tree binding adaptions/conversions/creation - Remove superfluous code/variables/attributes and simplify overall - Use/convert to new/better APIs/helpers/MACROs instead of hand-rolling implementations Bug Fixes: - Repair enabling Torch Mode from V4L2 on the second LED - Ensure PWM is disabled when suspending" * tag 'leds-next-6.10' of git://git.kernel.org/pub/scm/linux/kernel/git/lee/leds: (28 commits) leds: mt6370: Remove unused field 'reg_cfgs' from 'struct mt6370_priv' leds: lp50xx: Remove unused field 'num_of_banked_leds' from 'struct lp50xx' leds: lp50xx: Remove unused field 'bank_modules' from 'struct lp50xx_led' leds: aat1290: Remove unused field 'torch_brightness' from 'struct aat1290_led' leds: sun50i-a100: Use match_string() helper to simplify the code leds: pwm: Disable PWM when going to suspend leds: trigger: pattern: Add support for hrtimer leds: mt6360: Fix the second LED can not enable torch mode by V4L2 dt-bindings: leds: leds-qcom-lpg: Add support for PMI8950 PWM leds: qcom-lpg: Add support for PMI8950 PWM leds: apu: Remove duplicate DMI lookup data leds: trigger: netdev: Remove not needed call to led_set_brightness in deactivate dt-bindings: leds: Add LED_FUNCTION_SPEED_* for link speed on LAN/WAN dt-bindings: leds: Add LED_FUNCTION_MOBILE for mobile network leds: simatic-ipc-leds-gpio: Add support for module BX-59A dt-bindings: leds: qcom-lpg: Document PM6150L compatible dt-bindings: leds: pca963x: Convert text bindings to YAML leds: an30259a: Use devm_mutex_init() for mutex initialization leds: mlxreg: Use devm_mutex_init() for mutex initialization leds: nic78bx: Use devm API to cleanup module's resources ...
3 daysMerge tag 'riscv-for-linus-6.10-mw1' of ↵Linus Torvalds1-0/+6
git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux Pull RISC-V updates from Palmer Dabbelt: - Add byte/half-word compare-and-exchange, emulated via LR/SC loops - Support for Rust - Support for Zihintpause in hwprobe - Add PR_RISCV_SET_ICACHE_FLUSH_CTX prctl() - Support lockless lockrefs * tag 'riscv-for-linus-6.10-mw1' of git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux: (42 commits) riscv: defconfig: Enable CONFIG_CLK_SOPHGO_CV1800 riscv: select ARCH_HAS_FAST_MULTIPLIER riscv: mm: still create swiotlb buffer for kmalloc() bouncing if required riscv: Annotate pgtable_l{4,5}_enabled with __ro_after_init riscv: Remove redundant CONFIG_64BIT from pgtable_l{4,5}_enabled riscv: mm: Always use an ASID to flush mm contexts riscv: mm: Preserve global TLB entries when switching contexts riscv: mm: Make asid_bits a local variable riscv: mm: Use a fixed layout for the MM context ID riscv: mm: Introduce cntx2asid/cntx2version helper macros riscv: Avoid TLB flush loops when affected by SiFive CIP-1200 riscv: Apply SiFive CIP-1200 workaround to single-ASID sfence.vma riscv: mm: Combine the SMP and UP TLB flush code riscv: Only send remote fences when some other CPU is online riscv: mm: Broadcast kernel TLB flushes only when needed riscv: Use IPIs for remote cache/TLB flushes by default riscv: Factor out page table TLB synchronization riscv: Flush the instruction cache during SMP bringup riscv: hwprobe: export Zihintpause ISA extension riscv: misaligned: remove CONFIG_RISCV_M_MODE specific code ...
3 dayskernel: Remove signal hacks for vhost_tasksMike Christie2-7/+2
This removes the signal/coredump hacks added for vhost_tasks in: Commit f9010dbdce91 ("fork, vhost: Use CLONE_THREAD to fix freezer/ps regression") When that patch was added vhost_tasks did not handle SIGKILL and would try to ignore/clear the signal and continue on until the device's close function was called. In the previous patches vhost_tasks and the vhost drivers were converted to support SIGKILL by cleaning themselves up and exiting. The hacks are no longer needed so this removes them. Signed-off-by: Mike Christie <michael.christie@oracle.com> Message-Id: <20240316004707.45557-10-michael.christie@oracle.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
3 daysvhost_task: Handle SIGKILL by flushing work and exitingMike Christie1-19/+34
Instead of lingering until the device is closed, this has us handle SIGKILL by: 1. marking the worker as killed so we no longer try to use it with new virtqueues and new flush operations. 2. setting the virtqueue to worker mapping so no new works are queued. 3. running all the exiting works. Suggested-by: Edward Adam Davis <eadavis@qq.com> Reported-and-tested-by: syzbot+98edc2df894917b3431f@syzkaller.appspotmail.com Message-Id: <tencent_546DA49414E876EEBECF2C78D26D242EE50A@qq.com> Signed-off-by: Mike Christie <michael.christie@oracle.com> Message-Id: <20240316004707.45557-9-michael.christie@oracle.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
3 daysrv: Update rv_en(dis)able_monitor doc to match kernel-docYang Li1-0/+2
The patch updates the function documentation comment for rv_en(dis)able_monitor to adhere to the kernel-doc specification. Link: https://lore.kernel.org/linux-trace-kernel/20240520054239.61784-1-yang.lee@linux.alibaba.com Fixes: 102227b970a15 ("rv: Add Runtime Verification (RV) interface") Signed-off-by: Yang Li <yang.lee@linux.alibaba.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
3 daystracing: Add MODULE_DESCRIPTION() to preemptirq_delay_testJeff Johnson1-0/+1
Fix the 'make W=1' warning: WARNING: modpost: missing MODULE_DESCRIPTION() in kernel/trace/preemptirq_delay_test.o Link: https://lore.kernel.org/linux-trace-kernel/20240518-md-preemptirq_delay_test-v1-1-387d11b30d85@quicinc.com Cc: stable@vger.kernel.org Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Fixes: f96e8577da10 ("lib: Add module for testing preemptoff/irqsoff latency tracers") Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Jeff Johnson <quic_jjohnson@quicinc.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
3 daysring-buffer: Fix a race between readers and resize checksPetr Pavlu1-0/+9
The reader code in rb_get_reader_page() swaps a new reader page into the ring buffer by doing cmpxchg on old->list.prev->next to point it to the new page. Following that, if the operation is successful, old->list.next->prev gets updated too. This means the underlying doubly-linked list is temporarily inconsistent, page->prev->next or page->next->prev might not be equal back to page for some page in the ring buffer. The resize operation in ring_buffer_resize() can be invoked in parallel. It calls rb_check_pages() which can detect the described inconsistency and stop further tracing: [ 190.271762] ------------[ cut here ]------------ [ 190.271771] WARNING: CPU: 1 PID: 6186 at kernel/trace/ring_buffer.c:1467 rb_check_pages.isra.0+0x6a/0xa0 [ 190.271789] Modules linked in: [...] [ 190.271991] Unloaded tainted modules: intel_uncore_frequency(E):1 skx_edac(E):1 [ 190.272002] CPU: 1 PID: 6186 Comm: cmd.sh Kdump: loaded Tainted: G E 6.9.0-rc6-default #5 158d3e1e6d0b091c34c3b96bfd99a1c58306d79f [ 190.272011] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.16.0-0-gd239552c-rebuilt.opensuse.org 04/01/2014 [ 190.272015] RIP: 0010:rb_check_pages.isra.0+0x6a/0xa0 [ 190.272023] Code: [...] [ 190.272028] RSP: 0018:ffff9c37463abb70 EFLAGS: 00010206 [ 190.272034] RAX: ffff8eba04b6cb80 RBX: 0000000000000007 RCX: ffff8eba01f13d80 [ 190.272038] RDX: ffff8eba01f130c0 RSI: ffff8eba04b6cd00 RDI: ffff8eba0004c700 [ 190.272042] RBP: ffff8eba0004c700 R08: 0000000000010002 R09: 0000000000000000 [ 190.272045] R10: 00000000ffff7f52 R11: ffff8eba7f600000 R12: ffff8eba0004c720 [ 190.272049] R13: ffff8eba00223a00 R14: 0000000000000008 R15: ffff8eba067a8000 [ 190.272053] FS: 00007f1bd64752c0(0000) GS:ffff8eba7f680000(0000) knlGS:0000000000000000 [ 190.272057] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 190.272061] CR2: 00007f1bd6662590 CR3: 000000010291e001 CR4: 0000000000370ef0 [ 190.272070] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 190.272073] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 190.272077] Call Trace: [ 190.272098] <TASK> [ 190.272189] ring_buffer_resize+0x2ab/0x460 [ 190.272199] __tracing_resize_ring_buffer.part.0+0x23/0xa0 [ 190.272206] tracing_resize_ring_buffer+0x65/0x90 [ 190.272216] tracing_entries_write+0x74/0xc0 [ 190.272225] vfs_write+0xf5/0x420 [ 190.272248] ksys_write+0x67/0xe0 [ 190.272256] do_syscall_64+0x82/0x170 [ 190.272363] entry_SYSCALL_64_after_hwframe+0x76/0x7e [ 190.272373] RIP: 0033:0x7f1bd657d263 [ 190.272381] Code: [...] [ 190.272385] RSP: 002b:00007ffe72b643f8 EFLAGS: 00000246 ORIG_RAX: 0000000000000001 [ 190.272391] RAX: ffffffffffffffda RBX: 0000000000000002 RCX: 00007f1bd657d263 [ 190.272395] RDX: 0000000000000002 RSI: 0000555a6eb538e0 RDI: 0000000000000001 [ 190.272398] RBP: 0000555a6eb538e0 R08: 000000000000000a R09: 0000000000000000 [ 190.272401] R10: 0000555a6eb55190 R11: 0000000000000246 R12: 00007f1bd6662500 [ 190.272404] R13: 0000000000000002 R14: 00007f1bd6667c00 R15: 0000000000000002 [ 190.272412] </TASK> [ 190.272414] ---[ end trace 0000000000000000 ]--- Note that ring_buffer_resize() calls rb_check_pages() only if the parent trace_buffer has recording disabled. Recent commit d78ab792705c ("tracing: Stop current tracer when resizing buffer") causes that it is now always the case which makes it more likely to experience this issue. The window to hit this race is nonetheless very small. To help reproducing it, one can add a delay loop in rb_get_reader_page(): ret = rb_head_page_replace(reader, cpu_buffer->reader_page); if (!ret) goto spin; for (unsigned i = 0; i < 1U << 26; i++) /* inserted delay loop */ __asm__ __volatile__ ("" : : : "memory"); rb_list_head(reader->list.next)->prev = &cpu_buffer->reader_page->list; .. and then run the following commands on the target system: echo 1 > /sys/kernel/tracing/events/sched/sched_switch/enable while true; do echo 16 > /sys/kernel/tracing/buffer_size_kb; sleep 0.1 echo 8 > /sys/kernel/tracing/buffer_size_kb; sleep 0.1 done & while true; do for i in /sys/kernel/tracing/per_cpu/*; do timeout 0.1 cat $i/trace_pipe; sleep 0.2 done done To fix the problem, make sure ring_buffer_resize() doesn't invoke rb_check_pages() concurrently with a reader operating on the same ring_buffer_per_cpu by taking its cpu_buffer->reader_lock. Link: https://lore.kernel.org/linux-trace-kernel/20240517134008.24529-3-petr.pavlu@suse.com Cc: stable@vger.kernel.org Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Fixes: 659f451ff213 ("ring-buffer: Add integrity check at end of iter read") Signed-off-by: Petr Pavlu <petr.pavlu@suse.com> [ Fixed whitespace ] Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
3 daysring-buffer: Correct stale comments related to non-consuming readersPetr Pavlu1-13/+3
Adjust the following code documentation: * Kernel-doc comments for ring_buffer_read_prepare() and ring_buffer_read_finish() mention that recording to the ring buffer is disabled when the read is active. Remove mention of this restriction because it was already lifted in commit 1039221cc278 ("ring-buffer: Do not disable recording when there is an iterator"). * Function ring_buffer_read_finish() performs a self-check of the ring-buffer by locking cpu_buffer->reader_lock and then calling rb_check_pages(). The preceding comment explains that the lock is needed because rb_check_pages() clears the HEAD flag required by readers which might be running in parallel. Remove this explanation because commit 8843e06f67b1 ("ring-buffer: Handle race between rb_move_tail and rb_check_pages") simplified the function so it no longer resets the mentioned flag. Nonetheless, the lock is still needed because a reader swapping a page into the ring buffer can make the underlying doubly-linked list temporarily inconsistent. This is a non-functional change. Link: https://lore.kernel.org/linux-trace-kernel/20240517134008.24529-2-petr.pavlu@suse.com Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Signed-off-by: Petr Pavlu <petr.pavlu@suse.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
3 daysMerge tag 'bitmap-for-6.10v2' of https://github.com/norov/linuxLinus Torvalds1-4/+2
Pull bitmap updates from Yury Norov: - topology_span_sane() optimization from Kyle Meyer - fns() rework from Kuan-Wei Chiu (used in cpumask_local_spread() and other places) - headers cleanup from Andy - add a MAINTAINERS record for bitops API * tag 'bitmap-for-6.10v2' of https://github.com/norov/linux: usercopy: Don't use "proxy" headers bitops: Move aligned_byte_mask() to wordpart.h MAINTAINERS: add BITOPS API record bitmap: relax find_nth_bit() limitation on return value lib: make test_bitops compilable into the kernel image bitops: Optimize fns() for improved performance lib/test_bitops: Add benchmark test for fns() Compiler Attributes: Add __always_used macro sched/topology: Optimize topology_span_sane() cpumask: Add for_each_cpu_from()
4 daysMerge tag 'pull-set_blocksize' of ↵Linus Torvalds1-6/+1
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull vfs blocksize updates from Al Viro: "This gets rid of bogus set_blocksize() uses, switches it over to be based on a 'struct file *' and verifies that the caller has the device opened exclusively" * tag 'pull-set_blocksize' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: make set_blocksize() fail unless block device is opened exclusive set_blocksize(): switch to passing struct file * btrfs_get_bdev_and_sb(): call set_blocksize() only for exclusive opens swsusp: don't bother with setting block size zram: don't bother with reopening - just use O_EXCL for open swapon(2): open swap with O_EXCL swapon(2)/swapoff(2): don't bother with block size pktcdvd: sort set_blocksize() calls out bcache_register(): don't bother with set_blocksize()
4 daysMerge tag 'vfio-v6.10-rc1' of https://github.com/awilliam/linux-vfioLinus Torvalds1-0/+2
Pull vfio updates from Alex Williamson: - The vfio fsl-mc bus driver has become orphaned. We'll consider removing it in future releases if a new maintainer isn't found (Alex Williamson) - Improved usage of opaque data in vfio-pci INTx handling, avoiding lookups of the eventfd through the interrupt and irqfd runtime paths (Alex Williamson) - Resolve an error path memory leak introduced in vfio-pci interrupt code (Ye Bin) - Addition of interrupt support for vfio devices exposed on the CDX bus, including a new MSI allocation helper and export of existing helpers for MSI alloc and free (Nipun Gupta) - A new vfio-pci variant driver supporting migration of Intel QAT VF devices for the GEN4 PFs (Xin Zeng & Yahui Cao) - Resolve a possibly circular locking dependency in vfio-pci by avoiding copy_to_user() from a PCI bus walk callback (Alex Williamson) - Trivial docs update to remove a duplicate semicolon (Foryun Ma) * tag 'vfio-v6.10-rc1' of https://github.com/awilliam/linux-vfio: vfio/pci: Restore zero affected bus reset devices warning vfio: remove an extra semicolon vfio/pci: Collect hot-reset devices to local buffer vfio/qat: Add vfio_pci driver for Intel QAT SR-IOV VF devices vfio/cdx: add interrupt support genirq/msi: Add MSI allocation helper and export MSI functions vfio/pci: fix potential memory leak in vfio_intx_enable() vfio/pci: Pass eventfd context object through irqfd vfio/pci: Pass eventfd context to IRQ handler MAINTAINERS: Orphan vfio fsl-mc bus driver
5 daysMerge tag 'fsnotify_for_v6.10-rc1' of ↵Linus Torvalds2-2/+2
git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs Pull fsnotify updates from Jan Kara: - reduce overhead of fsnotify infrastructure when no permission events are in use - a few small cleanups * tag 'fsnotify_for_v6.10-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs: fsnotify: fix UAF from FS_ERROR event on a shutting down filesystem fsnotify: optimize the case of no permission event watchers fsnotify: use an enum for group priority constants fsnotify: move s_fsnotify_connectors into fsnotify_sb_info fsnotify: lazy attach fsnotify_sb_info state to sb fsnotify: create helper fsnotify_update_sb_watchers() fsnotify: pass object pointer and type to fsnotify mark helpers fanotify: merge two checks regarding add of ignore mark fsnotify: create a wrapper fsnotify_find_inode_mark() fsnotify: create helpers to get sb and connp from object fsnotify: rename fsnotify_{get,put}_sb_connectors() fsnotify: Avoid -Wflex-array-member-not-at-end warning fanotify: remove unneeded sub-zero check for unsigned value
5 daysMerge tag 'dma-mapping-6.10-2024-05-20' of ↵Linus Torvalds3-32/+104
git://git.infradead.org/users/hch/dma-mapping Pull dma-mapping updates from Christoph Hellwig: - optimize DMA sync calls when they are no-ops (Alexander Lobakin) - fix swiotlb padding for untrusted devices (Michael Kelley) - add documentation for swiotb (Michael Kelley) * tag 'dma-mapping-6.10-2024-05-20' of git://git.infradead.org/users/hch/dma-mapping: dma: fix DMA sync for drivers not calling dma_set_mask*() xsk: use generic DMA sync shortcut instead of a custom one page_pool: check for DMA sync shortcut earlier page_pool: don't use driver-set flags field directly page_pool: make sure frag API fields don't span between cachelines iommu/dma: avoid expensive indirect calls for sync operations dma: avoid redundant calls for sync operations dma: compile-out DMA sync op calls when not used iommu/dma: fix zeroing of bounce buffer padding used by untrusted devices swiotlb: remove alloc_size argument to swiotlb_tbl_map_single() Documentation/core-api: add swiotlb documentation
5 daysMerge tag 'mm-nonmm-stable-2024-05-19-11-56' of ↵Linus Torvalds7-10/+64
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull non-mm updates from Andrew Morton: "Mainly singleton patches, documented in their respective changelogs. Notable series include: - Some maintenance and performance work for ocfs2 in Heming Zhao's series "improve write IO performance when fragmentation is high". - Some ocfs2 bugfixes from Su Yue in the series "ocfs2 bugs fixes exposed by fstests". - kfifo header rework from Andy Shevchenko in the series "kfifo: Clean up kfifo.h". - GDB script fixes from Florian Rommel in the series "scripts/gdb: Fixes for $lx_current and $lx_per_cpu". - After much discussion, a coding-style update from Barry Song explaining one reason why inline functions are preferred over macros. The series is "codingstyle: avoid unused parameters for a function-like macro"" * tag 'mm-nonmm-stable-2024-05-19-11-56' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (62 commits) fs/proc: fix softlockup in __read_vmcore nilfs2: convert BUG_ON() in nilfs_finish_roll_forward() to WARN_ON() scripts: checkpatch: check unused parameters for function-like macro Documentation: coding-style: ask function-like macros to evaluate parameters nilfs2: use __field_struct() for a bitwise field selftests/kcmp: remove unused open mode nilfs2: remove calls to folio_set_error() and folio_clear_error() kernel/watchdog_perf.c: tidy up kerneldoc watchdog: allow nmi watchdog to use raw perf event watchdog: handle comma separated nmi_watchdog command line nilfs2: make superblock data array index computation sparse friendly squashfs: remove calls to set the folio error flag squashfs: convert squashfs_symlink_read_folio to use folio APIs scripts/gdb: fix detection of current CPU in KGDB scripts/gdb: make get_thread_info accept pointers scripts/gdb: fix parameter handling in $lx_per_cpu scripts/gdb: fix failing KGDB detection during probe kfifo: don't use "proxy" headers media: stih-cec: add missing io.h media: rc: add missing io.h ...
6 daysMerge tag 'kgdb-6.10-rc1' of ↵Linus Torvalds2-76/+79
git://git.kernel.org/pub/scm/linux/kernel/git/danielt/linux Pull kgdb updates from Daniel Thompson: "Nine patches this cycle and they split into just three topics: - Adopt coccinelle's recommendation to adopt str_plural() - A set of seven patches to refactor kdb_read() to improve both code clarity and its discipline with respect to fixed size buffers. This isn't just a refactor. Between them these also fix a cursor movement redraw problem and two buffer overflows (one latent and one real, albeit difficult to tickle). - Fix an NMI-safety problem when enqueuing kdb's keyboard reset code I wrote eight of the nine patches in this collection so many thanks to Doug Anderson for the reviews. The changes that affects drivers/tty/serial is acked by Greg KH" * tag 'kgdb-6.10-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/danielt/linux: serial: kgdboc: Fix NMI-safety problems from keyboard reset code kdb: Simplify management of tmpbuffer in kdb_read() kdb: Replace double memcpy() with memmove() in kdb_read() kdb: Use format-specifiers rather than memset() for padding in kdb_read() kdb: Merge identical case statements in kdb_read() kdb: Fix console handling when editing and tab-completing commands kdb: Use format-strings rather than '\0' injection in kdb_read() kdb: Fix buffer overflow during tab-complete kdb: Use str_plural() to fix Coccinelle warning
6 daysMerge tag 'sched-urgent-2024-05-18' of ↵Linus Torvalds5-7/+9
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler fixes from Ingo Molnar: - Fix a sched_balance_newidle setting bug - Fix bug in the setting of /sys/fs/cgroup/test/cpu.max.burst - Fix variable-shadowing build warning - Extend sched-domains debug output - Fix documentation - Fix comments * tag 'sched-urgent-2024-05-18' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: sched/core: Fix incorrect initialization of the 'burst' parameter in cpu_max_write() sched/fair: Remove stale FREQUENCY_UTIL comment sched/fair: Fix initial util_avg calculation docs: cgroup-v1: Clarify that domain levels are system-specific sched/debug: Dump domains' level sched/fair: Allow disabling sched_balance_newidle with sched_relax_domain_level arch/topology: Fix variable naming to avoid shadowing
6 daysMerge tag 'mm-stable-2024-05-17-19-19' of ↵Linus Torvalds9-20/+33
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull mm updates from Andrew Morton: "The usual shower of singleton fixes and minor series all over MM, documented (hopefully adequately) in the respective changelogs. Notable series include: - Lucas Stach has provided some page-mapping cleanup/consolidation/ maintainability work in the series "mm/treewide: Remove pXd_huge() API". - In the series "Allow migrate on protnone reference with MPOL_PREFERRED_MANY policy", Donet Tom has optimized mempolicy's MPOL_PREFERRED_MANY mode, yielding almost doubled performance in one test. - In their series "Memory allocation profiling" Kent Overstreet and Suren Baghdasaryan have contributed a means of determining (via /proc/allocinfo) whereabouts in the kernel memory is being allocated: number of calls and amount of memory. - Matthew Wilcox has provided the series "Various significant MM patches" which does a number of rather unrelated things, but in largely similar code sites. - In his series "mm: page_alloc: freelist migratetype hygiene" Johannes Weiner has fixed the page allocator's handling of migratetype requests, with resulting improvements in compaction efficiency. - In the series "make the hugetlb migration strategy consistent" Baolin Wang has fixed a hugetlb migration issue, which should improve hugetlb allocation reliability. - Liu Shixin has hit an I/O meltdown caused by readahead in a memory-tight memcg. Addressed in the series "Fix I/O high when memory almost met memcg limit". - In the series "mm/filemap: optimize folio adding and splitting" Kairui Song has optimized pagecache insertion, yielding ~10% performance improvement in one test. - Baoquan He has cleaned up and consolidated the early zone initialization code in the series "mm/mm_init.c: refactor free_area_init_core()". - Baoquan has also redone some MM initializatio code in the series "mm/init: minor clean up and improvement". - MM helper cleanups from Christoph Hellwig in his series "remove follow_pfn". - More cleanups from Matthew Wilcox in the series "Various page->flags cleanups". - Vlastimil Babka has contributed maintainability improvements in the series "memcg_kmem hooks refactoring". - More folio conversions and cleanups in Matthew Wilcox's series: "Convert huge_zero_page to huge_zero_folio" "khugepaged folio conversions" "Remove page_idle and page_young wrappers" "Use folio APIs in procfs" "Clean up __folio_put()" "Some cleanups for memory-failure" "Remove page_mapping()" "More folio compat code removal" - David Hildenbrand chipped in with "fs/proc/task_mmu: convert hugetlb functions to work on folis". - Code consolidation and cleanup work related to GUP's handling of hugetlbs in Peter Xu's series "mm/gup: Unify hugetlb, part 2". - Rick Edgecombe has developed some fixes to stack guard gaps in the series "Cover a guard gap corner case". - Jinjiang Tu has fixed KSM's behaviour after a fork+exec in the series "mm/ksm: fix ksm exec support for prctl". - Baolin Wang has implemented NUMA balancing for multi-size THPs. This is a simple first-cut implementation for now. The series is "support multi-size THP numa balancing". - Cleanups to vma handling helper functions from Matthew Wilcox in the series "Unify vma_address and vma_pgoff_address". - Some selftests maintenance work from Dev Jain in the series "selftests/mm: mremap_test: Optimizations and style fixes". - Improvements to the swapping of multi-size THPs from Ryan Roberts in the series "Swap-out mTHP without splitting". - Kefeng Wang has significantly optimized the handling of arm64's permission page faults in the series "arch/mm/fault: accelerate pagefault when badaccess" "mm: remove arch's private VM_FAULT_BADMAP/BADACCESS" - GUP cleanups from David Hildenbrand in "mm/gup: consistently call it GUP-fast". - hugetlb fault code cleanups from Vishal Moola in "Hugetlb fault path to use struct vm_fault". - selftests build fixes from John Hubbard in the series "Fix selftests/mm build without requiring "make headers"". - Memory tiering fixes/improvements from Ho-Ren (Jack) Chuang in the series "Improved Memory Tier Creation for CPUless NUMA Nodes". Fixes the initialization code so that migration between different memory types works as intended. - David Hildenbrand has improved follow_pte() and fixed an errant driver in the series "mm: follow_pte() improvements and acrn follow_pte() fixes". - David also did some cleanup work on large folio mapcounts in his series "mm: mapcount for large folios + page_mapcount() cleanups". - Folio conversions in KSM in Alex Shi's series "transfer page to folio in KSM". - Barry Song has added some sysfs stats for monitoring multi-size THP's in the series "mm: add per-order mTHP alloc and swpout counters". - Some zswap cleanups from Yosry Ahmed in the series "zswap same-filled and limit checking cleanups". - Matthew Wilcox has been looking at buffer_head code and found the documentation to be lacking. The series is "Improve buffer head documentation". - Multi-size THPs get more work, this time from Lance Yang. His series "mm/madvise: enhance lazyfreeing with mTHP in madvise_free" optimizes the freeing of these things. - Kemeng Shi has added more userspace-visible writeback instrumentation in the series "Improve visibility of writeback". - Kemeng Shi then sent some maintenance work on top in the series "Fix and cleanups to page-writeback". - Matthew Wilcox reduces mmap_lock traffic in the anon vma code in the series "Improve anon_vma scalability for anon VMAs". Intel's test bot reported an improbable 3x improvement in one test. - SeongJae Park adds some DAMON feature work in the series "mm/damon: add a DAMOS filter type for page granularity access recheck" "selftests/damon: add DAMOS quota goal test" - Also some maintenance work in the series "mm/damon/paddr: simplify page level access re-check for pageout" "mm/damon: misc fixes and improvements" - David Hildenbrand has disabled some known-to-fail selftests ni the series "selftests: mm: cow: flag vmsplice() hugetlb tests as XFAIL". - memcg metadata storage optimizations from Shakeel Butt in "memcg: reduce memory consumption by memcg stats". - DAX fixes and maintenance work from Vishal Verma in the series "dax/bus.c: Fixups for dax-bus locking"" * tag 'mm-stable-2024-05-17-19-19' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (426 commits) memcg, oom: cleanup unused memcg_oom_gfp_mask and memcg_oom_order selftests/mm: hugetlb_madv_vs_map: avoid test skipping by querying hugepage size at runtime mm/hugetlb: add missing VM_FAULT_SET_HINDEX in hugetlb_wp mm/hugetlb: add missing VM_FAULT_SET_HINDEX in hugetlb_fault selftests: cgroup: add tests to verify the zswap writeback path mm: memcg: make alloc_mem_cgroup_per_node_info() return bool mm/damon/core: fix return value from damos_wmark_metric_value mm: do not update memcg stats for NR_{FILE/SHMEM}_PMDMAPPED selftests: cgroup: remove redundant enabling of memory controller Docs/mm/damon/maintainer-profile: allow posting patches based on damon/next tree Docs/mm/damon/maintainer-profile: change the maintainer's timezone from PST to PT Docs/mm/damon/design: use a list for supported filters Docs/admin-guide/mm/damon/usage: fix wrong schemes effective quota update command Docs/admin-guide/mm/damon/usage: fix wrong example of DAMOS filter matching sysfs file selftests/damon: classify tests for functionalities and regressions selftests/damon/_damon_sysfs: use 'is' instead of '==' for 'None' selftests/damon/_damon_sysfs: find sysfs mount point from /proc/mounts selftests/damon/_damon_sysfs: check errors from nr_schemes file reads mm/damon/core: initialize ->esz_bp from damos_quota_init_priv() selftests/damon: add a test for DAMOS quota goal ...
7 daysMerge tag 'kbuild-v6.10' of ↵Linus Torvalds3-26/+14
git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild Pull Kbuild updates from Masahiro Yamada: - Avoid 'constexpr', which is a keyword in C23 - Allow 'dtbs_check' and 'dt_compatible_check' run independently of 'dt_binding_check' - Fix weak references to avoid GOT entries in position-independent code generation - Convert the last use of 'optional' property in arch/sh/Kconfig - Remove support for the 'optional' property in Kconfig - Remove support for Clang's ThinLTO caching, which does not work with the .incbin directive - Change the semantics of $(src) so it always points to the source directory, which fixes Makefile inconsistencies between upstream and downstream - Fix 'make tar-pkg' for RISC-V to produce a consistent package - Provide reasonable default coverage for objtool, sanitizers, and profilers - Remove redundant OBJECT_FILES_NON_STANDARD, KASAN_SANITIZE, etc. - Remove the last use of tristate choice in drivers/rapidio/Kconfig - Various cleanups and fixes in Kconfig * tag 'kbuild-v6.10' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild: (46 commits) kconfig: use sym_get_choice_menu() in sym_check_prop() rapidio: remove choice for enumeration kconfig: lxdialog: remove initialization with A_NORMAL kconfig: m/nconf: merge two item_add_str() calls kconfig: m/nconf: remove dead code to display value of bool choice kconfig: m/nconf: remove dead code to display children of choice members kconfig: gconf: show checkbox for choice correctly kbuild: use GCOV_PROFILE and KCSAN_SANITIZE in scripts/Makefile.modfinal Makefile: remove redundant tool coverage variables kbuild: provide reasonable defaults for tool coverage modules: Drop the .export_symbol section from the final modules kconfig: use menu_list_for_each_sym() in sym_check_choice_deps() kconfig: use sym_get_choice_menu() in conf_write_defconfig() kconfig: add sym_get_choice_menu() helper kconfig: turn defaults and additional prompt for choice members into error kconfig: turn missing prompt for choice members into error kconfig: turn conf_choice() into void function kconfig: use linked list in sym_set_changed() kconfig: gconf: use MENU_CHANGED instead of SYMBOL_CHANGED kconfig: gconf: remove debug code ...
7 dayskprobe/ftrace: fix build error due to bad function definitionLinus Torvalds1-1/+1
Commit 1a7d0890dd4a ("kprobe/ftrace: bail out if ftrace was killed") introduced a bad K&R function definition, which we haven't accepted in a long long time. Gcc seems to let it slide, but clang notices with the appropriate error: kernel/kprobes.c:1140:24: error: a function declaration without a prototype is deprecated in all > 1140 | void kprobe_ftrace_kill() | ^ | void but this commit was apparently never in linux-next before it was sent upstream, so it didn't get the appropriate build test coverage. Fixes: 1a7d0890dd4a kprobe/ftrace: bail out if ftrace was killed Cc: Stephen Brennan <stephen.s.brennan@oracle.com> Cc: Masami Hiramatsu (Google) <mhiramat@kernel.org> Cc: Guo Ren <guoren@kernel.org> Cc: Steven Rostedt (Google) <rostedt@goodmis.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
7 daysMerge tag 'net-6.10-rc0' of ↵Linus Torvalds1-2/+2
git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Pull networking fixes from Jakub Kicinski: "Current release - regressions: - virtio_net: fix missed error path rtnl_unlock after control queue locking rework Current release - new code bugs: - bpf: fix KASAN slab-out-of-bounds in percpu_array_map_gen_lookup, caused by missing nested map handling - drv: dsa: correct initialization order for KSZ88x3 ports Previous releases - regressions: - af_packet: do not call packet_read_pending() from tpacket_destruct_skb() fix performance regression - ipv6: fix route deleting failure when metric equals 0, don't assume 0 means not set / default in this case Previous releases - always broken: - bridge: couple of syzbot-driven fixes" * tag 'net-6.10-rc0' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (30 commits) selftests: net: local_termination: annotate the expected failures net: dsa: microchip: Correct initialization order for KSZ88x3 ports MAINTAINERS: net: Update reviewers for TI's Ethernet drivers dt-bindings: net: ti: Update maintainers list l2tp: fix ICMP error handling for UDP-encap sockets net: txgbe: fix to control VLAN strip net: wangxun: match VLAN CTAG and STAG features net: wangxun: fix to change Rx features af_packet: do not call packet_read_pending() from tpacket_destruct_skb() virtio_net: Fix missed rtnl_unlock netrom: fix possible dead-lock in nr_rt_ioctl() idpf: don't skip over ethtool tcp-data-split setting dt-bindings: net: qcom: ethernet: Allow dma-coherent bonding: fix oops during rmmod net/ipv6: Fix route deleting failure when metric equals 0 selftests/net: reduce xfrm_policy test time selftests/bpf: Adjust btf_dump test to reflect recent change in file_operations selftests/bpf: Adjust test_access_variable_array after a kernel function name change selftests/net/lib: no need to record ns name if it already exist net: qrtr: ns: Fix module refcnt ...
7 daysMerge tag 'trace-user-events-v6.10' of ↵Linus Torvalds1-1/+75
git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace Pull tracing user-event updates from Steven Rostedt: - Minor update to the user_events interface The ABI of creating a user event states that the fields are separated by semicolons, and spaces should be ignored. But the parsing expected at least one space to be there (which was incorrect). Fix the reading of the string to handle fields separated by semicolons but no space between them. This does extend the API sightly as now "field;field" will now be parsed and not cause an error. But it should not cause any regressions as no logic should expect it to fail. Note, that the logic that parses the event fields to create the trace_event works with no spaces after the semi-colon. It is the logic that tests against existing events that is inconsistent. This causes registering an event without using spaces to succeed if it doesn't exist, but makes the same call that tries to register to the same event, but doesn't use spaces, fail. * tag 'trace-user-events-v6.10' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace: selftests/user_events: Add non-spacing separator check tracing/user_events: Fix non-spaced field matching
7 daysMerge tag 'trace-ringbuffer-v6.10' of ↵Linus Torvalds3-16/+560
git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace Pull tracing ring buffer updates from Steven Rostedt: "Add ring_buffer memory mappings. The tracing ring buffer was created based on being mostly used with the splice system call. It is broken up into page ordered sub-buffers and the reader swaps a new sub-buffer with an existing sub-buffer that's part of the write buffer. It then has total access to the swapped out sub-buffer and can do copyless movements of the memory into other mediums (file system, network, etc). The buffer is great for passing around the ring buffer contents in the kernel, but is not so good for when the consumer is the user space task itself. A new interface is added that allows user space to memory map the ring buffer. It will get all the write sub-buffers as well as reader sub-buffer (that is not written to). It can send an ioctl to change which sub-buffer is the new reader sub-buffer. The ring buffer is read only to user space. It only needs to call the ioctl when it is finished with a sub-buffer and needs a new sub-buffer that the writer will not write over. A self test program was also created for testing and can be used as an example for the interface to user space. The libtracefs (external to the kernel) also has code that interacts with this, although it is disabled until the interface is in a official release. It can be enabled by compiling the library with a special flag. This was used for testing applications that perform better with the buffer being mapped. Memory mapped buffers have limitations. The main one is that it can not be used with the snapshot logic. If the buffer is mapped, snapshots will be disabled. If any logic is set to trigger snapshots on a buffer, that buffer will not be allowed to be mapped" * tag 'trace-ringbuffer-v6.10' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace: ring-buffer: Add cast to unsigned long addr passed to virt_to_page() ring-buffer: Have mmapped ring buffer keep track of missed events ring-buffer/selftest: Add ring-buffer mapping test Documentation: tracing: Add ring-buffer mapping tracing: Allow user-space mapping of the ring-buffer ring-buffer: Introducing ring-buffer mapping functions ring-buffer: Allocate sub-buffers with __GFP_COMP
7 daysMerge tag 'trace-v6.10' of ↵Linus Torvalds3-45/+36
git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace Pull tracing updates from Steven Rostedt: - Remove unused ftrace_direct_funcs variables - Fix a possible NULL pointer dereference race in eventfs - Update do_div() usage in trace event benchmark test - Speedup direct function registration with asynchronous RCU callback. The synchronization was done in the registration code and this caused delays when registering direct callbacks. Move the freeing to a call_rcu() that will prevent delaying of the registering. - Replace simple_strtoul() usage with kstrtoul() * tag 'trace-v6.10' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace: eventfs: Fix a possible null pointer dereference in eventfs_find_events() ftrace: Fix possible use-after-free issue in ftrace_location() ftrace: Remove unused global 'ftrace_direct_func_count' ftrace: Remove unused list 'ftrace_direct_funcs' tracing: Improve benchmark test performance by using do_div() ftrace: Use asynchronous grace period for register_ftrace_direct() ftrace: Replaces simple_strtoul in ftrace
7 daysMerge tag 'probes-v6.10' of ↵Linus Torvalds11-57/+171
git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace Pull probes updates from Masami Hiramatsu: - tracing/probes: Add new pseudo-types %pd and %pD support for dumping dentry name from 'struct dentry *' and file name from 'struct file *' - uprobes performance optimizations: - Speed up the BPF uprobe event by delaying the fetching of the uprobe event arguments that are not used in BPF - Avoid locking by speculatively checking whether uprobe event is valid - Reduce lock contention by using read/write_lock instead of spinlock for uprobe list operation. This improved BPF uprobe benchmark result 43% on average - rethook: Remove non-fatal warning messages when tracing stack from BPF and skip rcu_is_watching() validation in rethook if possible - objpool: Optimize objpool (which is used by kretprobes and fprobe as rethook backend storage) by inlining functions and avoid caching nr_cpu_ids because it is a const value - fprobe: Add entry/exit callbacks types (code cleanup) - kprobes: Check ftrace was killed in kprobes if it uses ftrace * tag 'probes-v6.10' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace: kprobe/ftrace: bail out if ftrace was killed selftests/ftrace: Fix required features for VFS type test case objpool: cache nr_possible_cpus() and avoid caching nr_cpu_ids objpool: enable inlining objpool_push() and objpool_pop() operations rethook: honor CONFIG_FTRACE_VALIDATE_RCU_IS_WATCHING in rethook_try_get() ftrace: make extra rcu_is_watching() validation check optional uprobes: reduce contention on uprobes_tree access rethook: Remove warning messages printed for finding return address of a frame. fprobe: Add entry/exit callbacks types selftests/ftrace: add fprobe test cases for VFS type "%pd" and "%pD" selftests/ftrace: add kprobe test cases for VFS type "%pd" and "%pD" Documentation: tracing: add new type '%pd' and '%pD' for kprobe tracing/probes: support '%pD' type for print struct file's name tracing/probes: support '%pd' type for print struct dentry's name uprobes: add speculative lockless system-wide uprobe filter check uprobes: prepare uprobe args buffer lazily uprobes: encapsulate preparation of uprobe args buffer
7 daysMerge tag 'sysctl-6.10-rc1' of ↵Linus Torvalds30-34/+2
git://git.kernel.org/pub/scm/linux/kernel/git/sysctl/sysctl Pull sysctl updates from Joel Granados: - Remove sentinel elements from ctl_table structs in kernel/* Removing sentinels in ctl_table arrays reduces the build time size and runtime memory consumed by ~64 bytes per array. Removals for net/, io_uring/, mm/, ipc/ and security/ are set to go into mainline through their respective subsystems making the next release the most likely place where the final series that removes the check for proc_name == NULL will land. This adds to removals already in arch/, drivers/ and fs/. - Adjust ctl_table definitions and references to allow constification - Remove unused ctl_table function arguments - Move non-const elements from ctl_table to ctl_table_header - Make ctl_table pointers const in ctl_table_root structure Making the static ctl_table structs const will increase safety by keeping the pointers to proc_handler functions in .rodata. Though no ctl_tables where made const in this PR, the ground work for making that possible has started with these changes sent by Thomas Weißschuh. * tag 'sysctl-6.10-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/sysctl/sysctl: sysctl: drop now unnecessary out-of-bounds check sysctl: move sysctl type to ctl_table_header sysctl: drop sysctl_is_perm_empty_ctl_table sysctl: treewide: constify argument ctl_table_root::permissions(table) sysctl: treewide: drop unused argument ctl_table_root::set_ownership(table) bpf: Remove the now superfluous sentinel elements from ctl_table array delayacct: Remove the now superfluous sentinel elements from ctl_table array kprobes: Remove the now superfluous sentinel elements from ctl_table array printk: Remove the now superfluous sentinel elements from ctl_table array scheduler: Remove the now superfluous sentinel elements from ctl_table array seccomp: Remove the now superfluous sentinel elements from ctl_table array timekeeping: Remove the now superfluous sentinel elements from ctl_table array ftrace: Remove the now superfluous sentinel elements from ctl_table array umh: Remove the now superfluous sentinel elements from ctl_table array kernel misc: Remove the now superfluous sentinel elements from ctl_table array
8 daysMerge tag 'powerpc-6.10-1' of ↵Linus Torvalds4-18/+36
git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux Pull powerpc updates from Michael Ellerman: - Enable BPF Kernel Functions (kfuncs) in the powerpc BPF JIT. - Allow per-process DEXCR (Dynamic Execution Control Register) settings via prctl, notably NPHIE which controls hashst/hashchk for ROP protection. - Install powerpc selftests in sub-directories. Note this changes the way run_kselftest.sh needs to be invoked for powerpc selftests. - Change fadump (Firmware Assisted Dump) to better handle memory add/remove. - Add support for passing additional parameters to the fadump kernel. - Add support for updating the kdump image on CPU/memory add/remove events. - Other small features, cleanups and fixes. Thanks to Andrew Donnellan, Andy Shevchenko, Aneesh Kumar K.V, Arnd Bergmann, Benjamin Gray, Bjorn Helgaas, Christian Zigotzky, Christophe Jaillet, Christophe Leroy, Colin Ian King, Cédric Le Goater, Dr. David Alan Gilbert, Erhard Furtner, Frank Li, GUO Zihua, Ganesh Goudar, Geoff Levand, Ghanshyam Agrawal, Greg Kurz, Hari Bathini, Joel Stanley, Justin Stitt, Kunwu Chan, Li Yang, Lidong Zhong, Madhavan Srinivasan, Mahesh Salgaonkar, Masahiro Yamada, Matthias Schiffer, Naresh Kamboju, Nathan Chancellor, Nathan Lynch, Naveen N Rao, Nicholas Miehlbradt, Ran Wang, Randy Dunlap, Ritesh Harjani, Sachin Sant, Shirisha Ganta, Shrikanth Hegde, Sourabh Jain, Stephen Rothwell, sundar, Thorsten Blum, Vaibhav Jain, Xiaowei Bao, Yang Li, and Zhao Chenhui. * tag 'powerpc-6.10-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux: (85 commits) powerpc/fadump: Fix section mismatch warning powerpc/85xx: fix compile error without CONFIG_CRASH_DUMP powerpc/fadump: update documentation about bootargs_append powerpc/fadump: pass additional parameters when fadump is active powerpc/fadump: setup additional parameters for dump capture kernel powerpc/pseries/fadump: add support for multiple boot memory regions selftests/powerpc/dexcr: Fix spelling mistake "predicition" -> "prediction" KVM: PPC: Book3S HV nestedv2: Fix an error handling path in gs_msg_ops_kvmhv_nestedv2_config_fill_info() KVM: PPC: Fix documentation for ppc mmu caps KVM: PPC: code cleanup for kvmppc_book3s_irqprio_deliver KVM: PPC: Book3S HV nestedv2: Cancel pending DEC exception powerpc/xmon: Check cpu id in commands "c#", "dp#" and "dx#" powerpc/code-patching: Use dedicated memory routines for patching powerpc/code-patching: Test patch_instructions() during boot powerpc64/kasan: Pass virtual addresses to kasan_init_phys_region() powerpc: rename SPRN_HID2 define to SPRN_HID2_750FX powerpc: Fix typos powerpc/eeh: Fix spelling of the word "auxillary" and update comment macintosh/ams: Fix unused variable warning powerpc/Makefile: Remove bits related to the previous use of -mcmodel=large ...
8 dayssched/core: Fix incorrect initialization of the 'burst' parameter in ↵Cheng Yu1-1/+1
cpu_max_write() In the cgroup v2 CPU subsystem, assuming we have a cgroup named 'test', and we set cpu.max and cpu.max.burst: # echo 1000000 > /sys/fs/cgroup/test/cpu.max # echo 1000000 > /sys/fs/cgroup/test/cpu.max.burst then we check cpu.max and cpu.max.burst: # cat /sys/fs/cgroup/test/cpu.max 1000000 100000 # cat /sys/fs/cgroup/test/cpu.max.burst 1000000 Next we set cpu.max again and check cpu.max and cpu.max.burst: # echo 2000000 > /sys/fs/cgroup/test/cpu.max # cat /sys/fs/cgroup/test/cpu.max 2000000 100000 # cat /sys/fs/cgroup/test/cpu.max.burst 1000 ... we find that the cpu.max.burst value changed unexpectedly. In cpu_max_write(), the unit of the burst value returned by tg_get_cfs_burst() is microseconds, while in cpu_max_write(), the burst unit used for calculation should be nanoseconds, which leads to the bug. To fix it, get the burst value directly from tg->cfs_bandwidth.burst. Fixes: f4183717b370 ("sched/fair: Introduce the burstable CFS controller") Reported-by: Qixin Liao <liaoqixin@huawei.com> Signed-off-by: Cheng Yu <serein.chengyu@huawei.com> Signed-off-by: Zhang Qiao <zhangqiao22@huawei.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Tested-by: Vincent Guittot <vincent.guittot@linaro.org> Link: https://lore.kernel.org/r/20240424132438.514720-1-serein.chengyu@huawei.com
8 dayssched/fair: Remove stale FREQUENCY_UTIL commentChristian Loehle1-2/+2
On 05/03/2024 15:05, Vincent Guittot wrote: I'm fine with either and that was my first thought here, too, but it did seem like the comment was mostly placed there to justify the 'unexpected' high utilization when explicitly passing FREQUENCY_UTIL and the need to clamp it then. So removing did feel slightly more natural to me anyway. So alternatively: From: Christian Loehle <christian.loehle@arm.com> Date: Tue, 5 Mar 2024 09:34:41 +0000 Subject: [PATCH] sched/fair: Remove stale FREQUENCY_UTIL mention effective_cpu_util() flags were removed, so remove mentioning of the flag. commit 9c0b4bb7f6303 ("sched/cpufreq: Rework schedutil governor performance estimation") reworked effective_cpu_util() removing enum cpu_util_type. Modify the comment accordingly. Signed-off-by: Christian Loehle <christian.loehle@arm.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Link: https://lore.kernel.org/r/0e2833ee-0939-44e0-82a2-520a585a0153@arm.com
8 dayssched/fair: Fix initial util_avg calculationDawei Li1-2/+3
Change se->load.weight to se_weight(se) in the calculation for the initial util_avg to avoid unnecessarily inflating the util_avg by 1024 times. The reason is that se->load.weight has the unit/scale as the scaled-up load, while cfs_rg->avg.load_avg has the unit/scale as the true task weight (as mapped directly from the task's nice/priority value). With CONFIG_32BIT, the scaled-up load is equal to the true task weight. With CONFIG_64BIT, the scaled-up load is 1024 times the true task weight. Thus, the current code may inflate the util_avg by 1024 times. The follow-up capping will not allow the util_avg value to go wild. But the calculation should have the correct logic. Signed-off-by: Dawei Li <daweilics@gmail.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Reviewed-by: Vishal Chourasia <vishalc@linux.ibm.com> Link: https://lore.kernel.org/r/20240315015916.21545-1-daweilics@gmail.com
8 dayssched/debug: Dump domains' levelVitalii Bursov1-0/+1
Knowing domain's level exactly can be useful when setting relax_domain_level or cpuset.sched_relax_domain_level Usage: cat /debug/sched/domains/cpu0/domain1/level to dump cpu0 domain1's level. SDM macro is not used because sd->level is 'int' and it would hide the type mismatch between 'int' and 'u32'. Signed-off-by: Vitalii Bursov <vitaly@bursov.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Tested-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Acked-by: Vincent Guittot <vincent.guittot@linaro.org> Reviewed-by: Valentin Schneider <vschneid@redhat.com> Link: https://lore.kernel.org/r/9489b6475f6dd6fbc67c617752d4216fa094da53.1714488502.git.vitaly@bursov.com
8 dayssched/fair: Allow disabling sched_balance_newidle with sched_relax_domain_levelVitalii Bursov2-2/+2
Change relax_domain_level checks so that it would be possible to include or exclude all domains from newidle balancing. This matches the behavior described in the documentation: -1 no request. use system default or follow request of others. 0 no search. 1 search siblings (hyperthreads in a core). "2" enables levels 0 and 1, level_max excludes the last (level_max) level, and level_max+1 includes all levels. Fixes: 1d3504fcf560 ("sched, cpuset: customize sched domains, core") Signed-off-by: Vitalii Bursov <vitaly@bursov.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Tested-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Reviewed-by: Valentin Schneider <vschneid@redhat.com> Link: https://lore.kernel.org/r/bd6de28e80073c79466ec6401cdeae78f0d4423d.1714488502.git.vitaly@bursov.com
9 daysMerge tag 'wq-for-6.10' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wqLinus Torvalds1-178/+275
Pull workqueue updates from Tejun Heo: - Work items can now be disabled and enabled, and cancel_work_sync() and disable_work() can be called form atomic contexts for BH work items. This closes feature gap with tasklet and should allow converting all existing tasklet users to BH workqueues. - Improve pool sharing for unbound workqueues with strict affinity. - Misc changes including doc updates, improved debug annotations and cleanups. * tag 'wq-for-6.10' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq: workqueue: Use "@..." in function comment to describe variable length argument workqueue: Add destroy_work_on_stack() in workqueue_softirq_dead() workqueue: remove unnecessary import and function in wq_monitor.py workqueue: Introduce enable_and_queue_work() convenience function workqueue: add function in event of workqueue_activate_work workqueue: Cleanup subsys attribute registration workqueue: Use list_last_entry() to get the last idle worker workqueue: Move attrs->cpumask out of worker_pool's properties when attrs->affn_strict workqueue: Use INIT_WORK_ONSTACK in workqueue_softirq_dead() workqueue: Allow cancel_work_sync() and disable_work() from atomic contexts on BH work items workqueue: Remember whether a work item was on a BH workqueue workqueue: Remove WORK_OFFQ_CANCELING workqueue: Implement disable/enable for (delayed) work items workqueue: Preserve OFFQ bits in cancel[_sync] paths
9 daysMerge tag 'cgroup-for-6.10' of ↵Linus Torvalds8-173/+162
git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup Pull cgroup updates from Tejun Heo: - The locking around cpuset hotplug processing has always been a bit of mess which was worked around by making hotplug processing asynchronous. The asynchronity isn't great and led to other issues. We tried to make the behavior synchronous a while ago but that led to lockdep splats. Waiman took another stab at cleaning up and making it synchronous. The patch has been in -next for well over a month and there haven't been any complaints, so fingers crossed. - Tracepoints added to help understanding rstat lock contentions. - A bunch of minor changes - doc updates, code cleanups and selftests. * tag 'cgroup-for-6.10' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: (24 commits) cgroup/rstat: add cgroup_rstat_cpu_lock helpers and tracepoints selftests/cgroup: Drop define _GNU_SOURCE docs: cgroup-v1: Update page cache removal functions selftests/cgroup: fix uninitialized variables in test_zswap.c selftests/cgroup: cpu_hogger init: use {} instead of {NULL} selftests/cgroup: fix clang warnings: uninitialized fd variable selftests/cgroup: fix clang build failures for abs() calls cgroup/cpuset: Remove outdated comment in sched_partition_write() cgroup/cpuset: Fix incorrect top_cpuset flags cgroup/cpuset: Avoid clearing CS_SCHED_LOAD_BALANCE twice cgroup/cpuset: Statically initialize more members of top_cpuset cgroup: Avoid unnecessary looping in cgroup_no_v1() cgroup, legacy_freezer: update comment for freezer_css_offline() docs, cgroup: add entries for pids to cgroup-v2.rst cgroup: don't call cgroup1_pidlist_destroy_all() for v2 cgroup_freezer: update comment for freezer_css_online() cgroup/rstat: desc member cgrp in cgroup_rstat_flush_release cgroup/rstat: add cgroup_rstat_lock helpers and tracepoints cgroup/pids: Remove superfluous zeroing docs: cgroup-v1: Fix description for css_online ...
9 dayskprobe/ftrace: bail out if ftrace was killedStephen Brennan2-0/+7
If an error happens in ftrace, ftrace_kill() will prevent disarming kprobes. Eventually, the ftrace_ops associated with the kprobes will be freed, yet the kprobes will still be active, and when triggered, they will use the freed memory, likely resulting in a page fault and panic. This behavior can be reproduced quite easily, by creating a kprobe and then triggering a ftrace_kill(). For simplicity, we can simulate an ftrace error with a kernel module like [1]: [1]: https://github.com/brenns10/kernel_stuff/tree/master/ftrace_killer sudo perf probe --add commit_creds sudo perf trace -e probe:commit_creds # In another terminal make sudo insmod ftrace_killer.ko # calls ftrace_kill(), simulating bug # Back to perf terminal # ctrl-c sudo perf probe --del commit_creds After a short period, a page fault and panic would occur as the kprobe continues to execute and uses the freed ftrace_ops. While ftrace_kill() is supposed to be used only in extreme circumstances, it is invoked in FTRACE_WARN_ON() and so there are many places where an unexpected bug could be triggered, yet the system may continue operating, possibly without the administrator noticing. If ftrace_kill() does not panic the system, then we should do everything we can to continue operating, rather than leave a ticking time bomb. Link: https://lore.kernel.org/all/20240501162956.229427-1-stephen.s.brennan@oracle.com/ Signed-off-by: Stephen Brennan <stephen.s.brennan@oracle.com> Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Acked-by: Guo Ren <guoren@kernel.org> Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org> Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
9 daysMerge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvmLinus Torvalds1-3/+3
Pull KVM updates from Paolo Bonzini: "ARM: - Move a lot of state that was previously stored on a per vcpu basis into a per-CPU area, because it is only pertinent to the host while the vcpu is loaded. This results in better state tracking, and a smaller vcpu structure. - Add full handling of the ERET/ERETAA/ERETAB instructions in nested virtualisation. The last two instructions also require emulating part of the pointer authentication extension. As a result, the trap handling of pointer authentication has been greatly simplified. - Turn the global (and not very scalable) LPI translation cache into a per-ITS, scalable cache, making non directly injected LPIs much cheaper to make visible to the vcpu. - A batch of pKVM patches, mostly fixes and cleanups, as the upstreaming process seems to be resuming. Fingers crossed! - Allocate PPIs and SGIs outside of the vcpu structure, allowing for smaller EL2 mapping and some flexibility in implementing more or less than 32 private IRQs. - Purge stale mpidr_data if a vcpu is created after the MPIDR map has been created. - Preserve vcpu-specific ID registers across a vcpu reset. - Various minor cleanups and improvements. LoongArch: - Add ParaVirt IPI support - Add software breakpoint support - Add mmio trace events support RISC-V: - Support guest breakpoints using ebreak - Introduce per-VCPU mp_state_lock and reset_cntx_lock - Virtualize SBI PMU snapshot and counter overflow interrupts - New selftests for SBI PMU and Guest ebreak - Some preparatory work for both TDX and SNP page fault handling. This also cleans up the page fault path, so that the priorities of various kinds of fauls (private page, no memory, write to read-only slot, etc.) are easier to follow. x86: - Minimize amount of time that shadow PTEs remain in the special REMOVED_SPTE state. This is a state where the mmu_lock is held for reading but concurrent accesses to the PTE have to spin; shortening its use allows other vCPUs to repopulate the zapped region while the zapper finishes tearing down the old, defunct page tables. - Advertise the max mappable GPA in the "guest MAXPHYADDR" CPUID field, which is defined by hardware but left for software use. This lets KVM communicate its inability to map GPAs that set bits 51:48 on hosts without 5-level nested page tables. Guest firmware is expected to use the information when mapping BARs; this avoids that they end up at a legal, but unmappable, GPA. - Fixed a bug where KVM would not reject accesses to MSR that aren't supposed to exist given the vCPU model and/or KVM configuration. - As usual, a bunch of code cleanups. x86 (AMD): - Implement a new and improved API to initialize SEV and SEV-ES VMs, which will also be extendable to SEV-SNP. The new API specifies the desired encryption in KVM_CREATE_VM and then separately initializes the VM. The new API also allows customizing the desired set of VMSA features; the features affect the measurement of the VM's initial state, and therefore enabling them cannot be done tout court by the hypervisor. While at it, the new API includes two bugfixes that couldn't be applied to the old one without a flag day in userspace or without affecting the initial measurement. When a SEV-ES VM is created with the new VM type, KVM_GET_REGS/KVM_SET_REGS and friends are rejected once the VMSA has been encrypted. Also, the FPU and AVX state will be synchronized and encrypted too. - Support for GHCB version 2 as applicable to SEV-ES guests. This, once more, is only accessible when using the new KVM_SEV_INIT2 flow for initialization of SEV-ES VMs. x86 (Intel): - An initial bunch of prerequisite patches for Intel TDX were merged. They generally don't do anything interesting. The only somewhat user visible change is a new debugging mode that checks that KVM's MMU never triggers a #VE virtualization exception in the guest. - Clear vmcs.EXIT_QUALIFICATION when synthesizing an EPT Misconfig VM-Exit to L1, as per the SDM. Generic: - Use vfree() instead of kvfree() for allocations that always use vcalloc() or __vcalloc(). - Remove .change_pte() MMU notifier - the changes to non-KVM code are small and Andrew Morton asked that I also take those through the KVM tree. The callback was only ever implemented by KVM (which was also the original user of MMU notifiers) but it had been nonfunctional ever since calls to set_pte_at_notify were wrapped with invalidate_range_start and invalidate_range_end... in 2012. Selftests: - Enhance the demand paging test to allow for better reporting and stressing of UFFD performance. - Convert the steal time test to generate TAP-friendly output. - Fix a flaky false positive in the xen_shinfo_test due to comparing elapsed time across two different clock domains. - Skip the MONITOR/MWAIT test if the host doesn't actually support MWAIT. - Avoid unnecessary use of "sudo" in the NX hugepage test wrapper shell script, to play nice with running in a minimal userspace environment. - Allow skipping the RSEQ test's sanity check that the vCPU was able to complete a reasonable number of KVM_RUNs, as the assert can fail on a completely valid setup. If the test is run on a large-ish system that is otherwise idle, and the test isn't affined to a low-ish number of CPUs, the vCPU task can be repeatedly migrated to CPUs that are in deep sleep states, which results in the vCPU having very little net runtime before the next migration due to high wakeup latencies. - Define _GNU_SOURCE for all selftests to fix a warning that was introduced by a change to kselftest_harness.h late in the 6.9 cycle, and because forcing every test to #define _GNU_SOURCE is painful. - Provide a global pseudo-RNG instance for all tests, so that library code can generate random, but determinstic numbers. - Use the global pRNG to randomly force emulation of select writes from guest code on x86, e.g. to help validate KVM's emulation of locked accesses. - Allocate and initialize x86's GDT, IDT, TSS, segments, and default exception handlers at VM creation, instead of forcing tests to manually trigger the related setup. Documentation: - Fix a goof in the KVM_CREATE_GUEST_MEMFD documentation" * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (225 commits) selftests/kvm: remove dead file KVM: selftests: arm64: Test vCPU-scoped feature ID registers KVM: selftests: arm64: Test that feature ID regs survive a reset KVM: selftests: arm64: Store expected register value in set_id_regs KVM: selftests: arm64: Rename helper in set_id_regs to imply VM scope KVM: arm64: Only reset vCPU-scoped feature ID regs once KVM: arm64: Reset VM feature ID regs from kvm_reset_sys_regs() KVM: arm64: Rename is_id_reg() to imply VM scope KVM: arm64: Destroy mpidr_data for 'late' vCPU creation KVM: arm64: Use hVHE in pKVM by default on CPUs with VHE support KVM: arm64: Fix hvhe/nvhe early alias parsing KVM: SEV: Allow per-guest configuration of GHCB protocol version KVM: SEV: Add GHCB handling for termination requests KVM: SEV: Add GHCB handling for Hypervisor Feature Support requests KVM: SEV: Add support to handle AP reset MSR protocol KVM: x86: Explicitly zero kvm_caps during vendor module load KVM: x86: Fully re-initialize supported_mce_cap on vendor module load KVM: x86: Fully re-initialize supported_vm_types on vendor module load KVM: x86/mmu: Sanity check that __kvm_faultin_pfn() doesn't create noslot pfns KVM: x86/mmu: Initialize kvm_page_fault's pfn and hva to error values ...
9 daysMerge branch 'for-6.10' into test-merge-for-6.10Tejun Heo1-178/+275
9 daysMerge tag 'modules-6.10-rc1' of ↵Linus Torvalds7-89/+112
git://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux Pull modules updates from Luis Chamberlain: "Finally something fun. Mike Rapoport does some cleanup to allow us to take out module_alloc() out of modules into a new paint shedded execmem_alloc() and execmem_free() so to make emphasis these helpers are actually used outside of modules. It starts with a non-functional changes API rename / placeholders to then allow architectures to define their requirements into a new shiny struct execmem_info with ranges, and requirements for those ranges. Archs now can intitialize this execmem_info as the last part of mm_core_init() if they have to diverge from the norm. Each range is a known type clearly articulated and spelled out in enum execmem_type. Although a lot of this is major cleanup and prep work for future enhancements an immediate clear gain is we get to enable KPROBES without MODULES now. That is ultimately what motiviated to pick this work up again, now with smaller goal as concrete stepping stone" * tag 'modules-6.10-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux: bpf: remove CONFIG_BPF_JIT dependency on CONFIG_MODULES of kprobes: remove dependency on CONFIG_MODULES powerpc: use CONFIG_EXECMEM instead of CONFIG_MODULES where appropriate x86/ftrace: enable dynamic ftrace without CONFIG_MODULES arch: make execmem setup available regardless of CONFIG_MODULES powerpc: extend execmem_params for kprobes allocations arm64: extend execmem_info for generated code allocations riscv: extend execmem_params for generated code allocations mm/execmem, arch: convert remaining overrides of module_alloc to execmem mm/execmem, arch: convert simple overrides of module_alloc to execmem mm: introduce execmem_alloc() and execmem_free() module: make module_memory_{alloc,free} more self-contained sparc: simplify module_alloc() nios2: define virtual address space for modules mips: module: rename MODULE_START to MODULES_VADDR arm64: module: remove unneeded call to kasan_alloc_module_shadow() kallsyms: replace deprecated strncpy with strscpy module: allow UNUSED_KSYMS_WHITELIST to be relative against objtree.
10 daysMerge tag 'livepatching-for-6.10' of ↵Linus Torvalds3-31/+31
git://git.kernel.org/pub/scm/linux/kernel/git/livepatching/livepatching Pull livepatching update from Petr Mladek: - Use more informative names for the livepatch transition states * tag 'livepatching-for-6.10' of git://git.kernel.org/pub/scm/linux/kernel/git/livepatching/livepatching: livepatch: Rename KLP_* to KLP_TRANSITION_*
10 daysMerge tag 'printk-for-6.10' of ↵Linus Torvalds3-15/+15
git://git.kernel.org/pub/scm/linux/kernel/git/printk/linux Pull printk updates from Petr Mladek: - Use no_printk() instead of "if (0) printk()" constructs to avoid generating printk index for messages disabled at compile time - Remove deprecated strncpy/strcpy from printk.c - Remove redundant CONFIG_BASE_FULL in favor of CONFIG_BASE_SMALL * tag 'printk-for-6.10' of git://git.kernel.org/pub/scm/linux/kernel/git/printk/linux: printk: cleanup deprecated uses of strncpy/strcpy printk: Remove redundant CONFIG_BASE_FULL printk: Change type of CONFIG_BASE_SMALL to bool printk: Fix LOG_CPU_MAX_BUF_SHIFT when BASE_SMALL is enabled ceph: Use no_printk() helper dyndbg: Use *no_printk() helpers dev_printk: Add and use dev_no_printk() printk: Let no_printk() use _printk()
10 daysbpf: save extended inner map info for percpu array maps as wellAndrii Nakryiko1-2/+2
ARRAY_OF_MAPS and HASH_OF_MAPS map types have special logic to save a few extra fields required for correct operations of ARRAY maps, when they are used as inner maps. PERCPU_ARRAY maps have similar requirements as they now support generating inline element lookup logic. So make sure that both classes of maps are handled correctly. Reported-by: Jakub Kicinski <kuba@kernel.org> Fixes: db69718b8efa ("bpf: inline bpf_map_lookup_elem() for PERCPU_ARRAY maps") Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20240515062440.846086-1-andrii@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
10 daysMerge branch 'for-6.10-base-small' into for-linusPetr Mladek2-2/+2
10 daysring-buffer: Add cast to unsigned long addr passed to virt_to_page()Steven Rostedt (Google)1-1/+1
The sub-buffer pages are held in an unsigned long array, and when it is passed to virt_to_page() a cast is needed. Link: https://lore.kernel.org/all/20240515124808.06279d04@canb.auug.org.au/ Link: https://lore.kernel.org/linux-trace-kernel/20240515010558.4abaefdd@rorschach.local.home Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Fixes: 117c39200d9d ("ring-buffer: Introducing ring-buffer mapping functions") Reported-by: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
10 daysMerge tag 'net-next-6.10' of ↵Linus Torvalds22-393/+1544
git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next Pull networking updates from Jakub Kicinski: "Core & protocols: - Complete rework of garbage collection of AF_UNIX sockets. AF_UNIX is prone to forming reference count cycles due to fd passing functionality. New method based on Tarjan's Strongly Connected Components algorithm should be both faster and remove a lot of workarounds we accumulated over the years. - Add TCP fraglist GRO support, allowing chaining multiple TCP packets and forwarding them together. Useful for small switches / routers which lack basic checksum offload in some scenarios (e.g. PPPoE). - Support using SMP threads for handling packet backlog i.e. packet processing from software interfaces and old drivers which don't use NAPI. This helps move the processing out of the softirq jumble. - Continue work of converting from rtnl lock to RCU protection. Don't require rtnl lock when reading: IPv6 routing FIB, IPv6 address labels, netdev threaded NAPI sysfs files, bonding driver's sysfs files, MPLS devconf, IPv4 FIB rules, netns IDs, tcp metrics, TC Qdiscs, neighbor entries, ARP entries via ioctl(SIOCGARP), a lot of the link information available via rtnetlink. - Small optimizations from Eric to UDP wake up handling, memory accounting, RPS/RFS implementation, TCP packet sizing etc. - Allow direct page recycling in the bulk API used by XDP, for +2% PPS. - Support peek with an offset on TCP sockets. - Add MPTCP APIs for querying last time packets were received/sent/acked and whether MPTCP "upgrade" succeeded on a TCP socket. - Add intra-node communication shortcut to improve SMC performance. - Add IPv6 (and IPv{4,6}-over-IPv{4,6}) support to the GTP protocol driver. - Add HSR-SAN (RedBOX) mode of operation to the HSR protocol driver. - Add reset reasons for tracing what caused a TCP reset to be sent. - Introduce direction attribute for xfrm (IPSec) states. State can be used either for input or output packet processing. Things we sprinkled into general kernel code: - Add bitmap_{read,write}(), bitmap_size(), expose BYTES_TO_BITS(). This required touch-ups and renaming of a few existing users. - Add Endian-dependent __counted_by_{le,be} annotations. - Make building selftests "quieter" by printing summaries like "CC object.o" rather than full commands with all the arguments. Netfilter: - Use GFP_KERNEL to clone elements, to deal better with OOM situations and avoid failures in the .commit step. BPF: - Add eBPF JIT for ARCv2 CPUs. - Support attaching kprobe BPF programs through kprobe_multi link in a session mode, meaning, a BPF program is attached to both function entry and return, the entry program can decide if the return program gets executed and the entry program can share u64 cookie value with return program. "Session mode" is a common use-case for tetragon and bpftrace. - Add the ability to specify and retrieve BPF cookie for raw tracepoint programs in order to ease migration from classic to raw tracepoints. - Add an internal-only BPF per-CPU instruction for resolving per-CPU memory addresses and implement support in x86, ARM64 and RISC-V JITs. This allows inlining functions which need to access per-CPU state. - Optimize x86 BPF JIT's emit_mov_imm64, and add support for various atomics in bpf_arena which can be JITed as a single x86 instruction. Support BPF arena on ARM64. - Add a new bpf_wq API for deferring events and refactor process-context bpf_timer code to keep common code where possible. - Harden the BPF verifier's and/or/xor value tracking. - Introduce crypto kfuncs to let BPF programs call kernel crypto APIs. - Support bpf_tail_call_static() helper for BPF programs with GCC 13. - Add bpf_preempt_{disable,enable}() kfuncs in order to allow a BPF program to have code sections where preemption is disabled. Driver API: - Skip software TC processing completely if all installed rules are marked as HW-only, instead of checking the HW-only flag rule by rule. - Add support for configuring PoE (Power over Ethernet), similar to the already existing support for PoDL (Power over Data Line) config. - Initial bits of a queue control API, for now allowing a single queue to be reset without disturbing packet flow to other queues. - Common (ethtool) statistics for hardware timestamping. Tests and tooling: - Remove the need to create a config file to run the net forwarding tests so that a naive "make run_tests" can exercise them. - Define a method of writing tests which require an external endpoint to communicate with (to send/receive data towards the test machine). Add a few such tests. - Create a shared code library for writing Python tests. Expose the YAML Netlink library from tools/ to the tests for easy Netlink access. - Move netfilter tests under net/, extend them, separate performance tests from correctness tests, and iron out issues found by running them "on every commit". - Refactor BPF selftests to use common network helpers. - Further work filling in YAML definitions of Netlink messages for: nftables, team driver, bonding interfaces, vlan interfaces, VF info, TC u32 mark, TC police action. - Teach Python YAML Netlink to decode attribute policies. - Extend the definition of the "indexed array" construct in the specs to cover arrays of scalars rather than just nests. - Add hyperlinks between definitions in generated Netlink docs. Drivers: - Make sure unsupported flower control flags are rejected by drivers, and make more drivers report errors directly to the application rather than dmesg (large number of driver changes from Asbjørn Sloth Tønnesen). - Ethernet high-speed NICs: - Broadcom (bnxt): - support multiple RSS contexts and steering traffic to them - support XDP metadata - make page pool allocations more NUMA aware - Intel (100G, ice, idpf): - extract datapath code common among Intel drivers into a library - use fewer resources in switchdev by sharing queues with the PF - add PFCP filter support - add Ethernet filter support - use a spinlock instead of HW lock in PTP clock ops - support 5 layer Tx scheduler topology - nVidia/Mellanox: - 800G link modes and 100G SerDes speeds - per-queue IRQ coalescing configuration - Marvell Octeon: - support offloading TC packet mark action - Ethernet NICs consumer, embedded and virtual: - stop lying about skb->truesize in USB Ethernet drivers, it messes up TCP memory calculations - Google cloud vNIC: - support changing ring size via ethtool - support ring reset using the queue control API - VirtIO net: - expose flow hash from RSS to XDP - per-queue statistics - add selftests - Synopsys (stmmac): - support controllers which require an RX clock signal from the MII bus to perform their hardware initialization - TI: - icssg_prueth: support ICSSG-based Ethernet on AM65x SR1.0 devices - icssg_prueth: add SW TX / RX Coalescing based on hrtimers - cpsw: minimal XDP support - Renesas (ravb): - support describing the MDIO bus - Realtek (r8169): - add support for RTL8168M - Microchip Sparx5: - matchall and flower actions mirred and redirect - Ethernet switches: - nVidia/Mellanox: - improve events processing performance - Marvell: - add support for MV88E6250 family internal PHYs - Microchip: - add DCB and DSCP mapping support for KSZ switches - vsc73xx: convert to PHYLINK - Realtek: - rtl8226b/rtl8221b: add C45 instances and SerDes switching - Many driver changes related to PHYLIB and PHYLINK deprecated API cleanup - Ethernet PHYs: - Add a new driver for Airoha EN8811H 2.5 Gigabit PHY. - micrel: lan8814: add support for PPS out and external timestamp trigger - WiFi: - Disable Wireless Extensions (WEXT) in all Wi-Fi 7 devices drivers. Modern devices can only be configured using nl80211. - mac80211/cfg80211 - handle color change per link for WiFi 7 Multi-Link Operation - Intel (iwlwifi): - don't support puncturing in 5 GHz - support monitor mode on passive channels - BZ-W device support - P2P with HE/EHT support - re-add support for firmware API 90 - provide channel survey information for Automatic Channel Selection - MediaTek (mt76): - mt7921 LED control - mt7925 EHT radiotap support - mt7920e PCI support - Qualcomm (ath11k): - P2P support for QCA6390, WCN6855 and QCA2066 - support hibernation - ieee80211-freq-limit Device Tree property support - Qualcomm (ath12k): - refactoring in preparation of multi-link support - suspend and hibernation support - ACPI support - debugfs support, including dfs_simulate_radar support - RealTek: - rtw88: RTL8723CS SDIO device support - rtw89: RTL8922AE Wi-Fi 7 PCI device support - rtw89: complete features of new WiFi 7 chip 8922AE including BT-coexistence and Wake-on-WLAN - rtw89: use BIOS ACPI settings to set TX power and channels - rtl8xxxu: enable Management Frame Protection (MFP) support - Bluetooth: - support for Intel BlazarI and Filmore Peak2 (BE201) - support for MediaTek MT7921S SDIO - initial support for Intel PCIe BT driver - remove HCI_AMP support" * tag 'net-next-6.10' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (1827 commits) selftests: netfilter: fix packetdrill conntrack testcase net: gro: fix napi_gro_cb zeroed alignment Bluetooth: btintel_pcie: Refactor and code cleanup Bluetooth: btintel_pcie: Fix warning reported by sparse Bluetooth: hci_core: Fix not handling hdev->le_num_of_adv_sets=1 Bluetooth: btintel: Fix compiler warning for multi_v7_defconfig config Bluetooth: btintel_pcie: Fix compiler warnings Bluetooth: btintel_pcie: Add *setup* function to download firmware Bluetooth: btintel_pcie: Add support for PCIe transport Bluetooth: btintel: Export few static functions Bluetooth: HCI: Remove HCI_AMP support Bluetooth: L2CAP: Fix div-by-zero in l2cap_le_flowctl_init() Bluetooth: qca: Fix error code in qca_read_fw_build_info() Bluetooth: hci_conn: Use __counted_by() and avoid -Wfamnae warning Bluetooth: btintel: Add support for Filmore Peak2 (BE201) Bluetooth: btintel: Add support for BlazarI LE Create Connection command timeout increased to 20 secs dt-bindings: net: bluetooth: Add MediaTek MT7921S SDIO Bluetooth Bluetooth: compute LE flow credits based on recvbuf space Bluetooth: hci_sync: Use cmd->num_cis instead of magic number ...
11 daysMerge tag 'pm-6.10-rc1' of ↵Linus Torvalds2-20/+88
git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm Pull power management updates from Rafael Wysocki: "These are mostly cpufreq updates, including a significant intel-pstate driver update and several amd-pstate improvements plus some updates of ARM cpufreq drivers, general fixes and cleanups. Also included are changes related to system sleep, power capping updates adding support for a new platform and a new hardware feature (among other things), a Samsung exynos-asv driver update allowing it to change its Energy Model after adjusting voltage, minor cpuidle and devfreq updates and a small documentation cleanup. Specifics: - Rework the handling of disabled turbo in the intel_pstate driver and make it update the maximum CPU frequency consistently regardless of the reason on top of a number of cleanups (Rafael Wysocki) - Add missing checks for NULL .exit() cpufreq driver callback to the cpufreq core (Viresh Kumar) - Prevent pulicy->max from going above the frequency QoS maximum value when cpufreq_frequency_table_verify() is used (Xuewen Yan) - Prevent a negative CPU number or frequency value from being printed if they are really large (Joshua Yeong) - Update MAINTAINERS entry for amd-pstate to add two new submaintainers and a designated reviewer (Huang Rui) - Clean up the amd-pstate driver and update its documentation (Gautham Shenoy) - Fix the highest frequency issue in the amd-pstate driver which limits performance (Perry Yuan) - Enable CPPC v2 for certain processors in the family 17H, as requested by TR40 processor users who expect improved performance and lower system temperature (Perry Yuan) - Change latency and delay values to be read from platform firmware firstly for more accurate timing (Perry Yuan) - A new quirk is introduced for supporting amd-pstate on legacy processors which either lack CPPC capability, or only only have CPPC v2 capability (Perry Yuan) - Sun50i cpufreq: Add support for opp_supported_hw, H616 platform and general cleanups (Andre Przywara, Martin Botka, Brandon Cheo Fusi, Dan Carpenter, Viresh Kumar) - CPPC cpufreq: Fix possible null pointer dereference (Aleksandr Mishin) - Eliminate uses of of_node_put() from cpufreq (Javier Carrasco, Shivani Gupta) - brcmstb-avs: ISO C90 forbids mixed declarations (Portia Stephens) - mediatek cpufreq: Add support for MT7988A (Sam Shih) - cpufreq-qcom-hw: Add SM4450 compatibles in DT bindings (Tengfei Fan) - Fix struct cpudata::epp_cached kernel-doc in the intel_pstate cpufreq driver (Jeff Johnson) - Fix kerneldoc description of ladder_do_selection() (Jeff Johnson) - Convert the cpuidle kirkwood driver to platform remove callback returning void (Yangtao Li) - Replace deprecated strncpy() with strscpy() in the hibernation core code (Justin Stitt) - Use %ps to simplify debug output in the core system-wide suspend and resume code (Len Brown) - Remove unnecessary else from device_init_wakeup() and make device_wakeup_disable() return void (Dhruva Gole) - Enable PMU support in the Intel TPMI RAPL driver (Zhang Rui) - Add support for ArrowLake-H platform to the Intel RAPL driver (Zhang Rui) - Avoid explicit cpumask allocation on stack in DTPM (Dawei Li) - Make the Samsung exynos-asv driver update the Energy Model after adjusting voltage on top of some preliminary changes of the OPP and Enery Model generic code (Lukasz Luba) - Remove a reference to a function that has been dropped from the power management documentation (Bjorn Helgaas) - Convert the platfrom remove callback to .remove_new for the exyno-nocp, exynos-ppmu, mtk-cci-devfreq, sun8i-a33-mbus, and rk3399_dmc devfreq drivers (Uwe Kleine-König) - Use DEFINE_SIMPLE_PM_OPS for exyno-bus.c driver (Anand Moon)" * tag 'pm-6.10-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (68 commits) PM / devfreq: exynos: Use DEFINE_SIMPLE_DEV_PM_OPS for PM functions PM / devfreq: rk3399_dmc: Convert to platform remove callback returning void PM / devfreq: sun8i-a33-mbus: Convert to platform remove callback returning void PM / devfreq: mtk-cci: Convert to platform remove callback returning void PM / devfreq: exynos-ppmu: Convert to platform remove callback returning void PM / devfreq: exynos-nocp: Convert to platform remove callback returning void cpufreq: amd-pstate: fix the highest frequency issue which limits performance cpufreq: intel_pstate: fix struct cpudata::epp_cached kernel-doc cpuidle: ladder: fix ladder_do_selection() kernel-doc powercap: intel_rapl_tpmi: Enable PMU support powercap: intel_rapl: Introduce APIs for PMU support PM: hibernate: replace deprecated strncpy() with strscpy() cpufreq: Fix up printing large CPU numbers and frequency values MAINTAINERS: cpufreq: amd-pstate: Add co-maintainers and reviewer cpufreq: amd-pstate: remove unused variable lowest_nonlinear_freq cpufreq: amd-pstate: fix code format problems cpufreq: amd-pstate: Add quirk for the pstate CPPC capabilities missing cppc_acpi: print error message if CPPC is unsupported cpufreq: amd-pstate: get transition delay and latency value from ACPI tables cpufreq: amd-pstate: Bail out if min/max/nominal_freq is 0 ...
11 dayscgroup/rstat: add cgroup_rstat_cpu_lock helpers and tracepointsJesper Dangaard Brouer1-12/+58
This closely resembles helpers added for the global cgroup_rstat_lock in commit fc29e04ae1ad ("cgroup/rstat: add cgroup_rstat_lock helpers and tracepoints"). This is for the per CPU lock cgroup_rstat_cpu_lock. Based on production workloads, we observe the fast-path "update" function cgroup_rstat_updated() is invoked around 3 million times per sec, while the "flush" function cgroup_rstat_flush_locked(), walking each possible CPU, can see periodic spikes of 700 invocations/sec. For this reason, the tracepoints are split into normal and fastpath versions for this per-CPU lock. Making it feasible for production to continuously monitor the non-fastpath tracepoint to detect lock contention issues. The reason for monitoring is that lock disables IRQs which can disturb e.g. softirq processing on the local CPUs involved. When the global cgroup_rstat_lock stops disabling IRQs (e.g converted to a mutex), this per CPU lock becomes the next bottleneck that can introduce latency variations. A practical bpftrace script for monitoring contention latency: bpftrace -e ' tracepoint:cgroup:cgroup_rstat_cpu_lock_contended { @start[tid]=nsecs; @cnt[probe]=count()} tracepoint:cgroup:cgroup_rstat_cpu_locked { if (args->contended) { @wait_ns=hist(nsecs-@start[tid]); delete(@start[tid]);} @cnt[probe]=count()} interval:s:1 {time("%H:%M:%S "); print(@wait_ns); print(@cnt); clear(@cnt);}' Signed-off-by: Jesper Dangaard Brouer <hawk@kernel.org> Signed-off-by: Tejun Heo <tj@kernel.org>
11 daysMerge tag 'linux_kselftest-kunit-6.10-rc1' of ↵Linus Torvalds1-0/+1
git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest Pull kunit updates from Shuah Khan: - fix race condition in try-catch completion - change __kunit_test_suites_init() to exit early if there is nothing to test - change string-stream-test to use KUNIT_DEFINE_ACTION_WRAPPER - move fault tests behind KUNIT_FAULT_TEST Kconfig option - kthread test fixes and improvements - iov_iter test fixes * tag 'linux_kselftest-kunit-6.10-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest: kunit: bail out early in __kunit_test_suites_init() if there are no suites to test kunit: string-stream-test: use KUNIT_DEFINE_ACTION_WRAPPER kunit: test: Move fault tests behind KUNIT_FAULT_TEST Kconfig option kunit: unregister the device on error kunit: Fix race condition in try-catch completion kunit: Add tests for fault kunit: Print last test location on fault kunit: Fix KUNIT_SUCCESS() calls in iov_iter tests kunit: Handle test faults kunit: Fix timeout message kunit: Fix kthread reference kunit: Handle thread creation error
11 daysMerge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski1-0/+5
Merge in late fixes to prepare for the 6.10 net-next PR. Signed-off-by: Jakub Kicinski <kuba@kernel.org>
11 daysMerge tag 'irq-core-2024-05-12' of ↵Linus Torvalds9-53/+311
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull interrupt subsystem updates from Thomas Gleixner: "Core code: - Interrupt storm detection for the lockup watchdog: Lockups which are caused by interrupt storms are not easy to debug because there is no information about the events which make the lockup detector trigger. To make this more user friendly, provide an extenstion to interrupt statistics which allows to take snapshots and an interface to retrieve the delta to the snapshot. Use this new mechanism in the watchdog code to do a two stage lockup analysis by taking the snapshot and printing the deltas for the topmost active interrupts on the second trigger. Note: This contains both the interrupt and the watchdog changes as the latter depend on the former obviously. - Avoid summation loops in the /proc/interrupts output and use the global counter when possible - Skip suspended interrupts on CPU hotplug operations to ensure that they are not delivered before the system resumes the device drivers when coming out of suspend. - On CPU hot-unplug interrupts which are affine to the outgoing CPU are migrated to a different CPU in the affinity mask. This can fail when the CPUs have no vectors left. Instead of giving up try to migrate it to any online CPU and thereby breaking the affinity setting in order to prevent a stale device interrupt which targets an offline CPU - The usual small cleanups Driver code: - Support for the RISCV AIA MSI controller - Make the interrupt allocation for the Loongson PCH controller more flexible to prevent vector exhaustion - The usual set of cleanups and fixes all over the place" * tag 'irq-core-2024-05-12' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (51 commits) irqchip/gic-v3-its: Remove BUG_ON in its_vpe_irq_domain_alloc cpuidle: Avoid explicit cpumask allocation on stack irqchip/sifive-plic: Avoid explicit cpumask allocation on stack irqchip/riscv-aplic-direct: Avoid explicit cpumask allocation on stack irqchip/loongson-eiointc: Avoid explicit cpumask allocation on stack irqchip/gic-v3-its: Avoid explicit cpumask allocation on stack irqchip/irq-bcm6345-l1: Avoid explicit cpumask allocation on stack cpumask: Introduce cpumask_first_and_and() irqchip/irq-brcmstb-l2: Avoid saving mask on shutdown genirq: Reuse irq_is_nmi() genirq/cpuhotplug: Retry with cpu_online_mask when migration fails genirq/cpuhotplug: Skip suspended interrupts when restoring affinity arm64: dts: st: Add interrupt parent to pinctrl on stm32mp251 arm64: dts: st: Add exti1 and exti2 nodes on stm32mp251 ARM: dts: stm32: List exti parent interrupts on stm32mp131 ARM: dts: stm32: List exti parent interrupts on stm32mp151 arm64: Kconfig.platforms: Enable STM32_EXTI for ARCH_STM32 irqchip/stm32-exti: Mark events reserved with RIF configuration check irqchip/stm32-exti: Skip secure events irqchip/stm32-exti: Convert driver to standard PM ...
11 daysMerge tag 'timers-core-2024-05-12' of ↵Linus Torvalds6-104/+87
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull timers and timekeeping updates from Thomas Gleixner: "Core code: - Make timekeeping and VDSO time readouts resilent against math overflow: In guest context the kernel is prone to math overflow when the host defers the timer interrupt due to overload, malfunction or malice. This can be mitigated by checking the clocksource delta for the maximum deferrement which is readily available. If that value is exceeded then the code uses a slowpath function which can handle the multiplication overflow. This functionality is enabled unconditionally in the kernel, but made conditional in the VDSO code. The latter is conditional because it allows architectures to optimize the check so it is not causing performance regressions. On X86 this is achieved by reworking the existing check for negative TSC deltas as a negative delta obviously exceeds the maximum deferrement when it is evaluated as an unsigned value. That avoids two conditionals in the hotpath and allows to hide both the negative delta and the large delta handling in the same slow path. - Add an initial minimal ktime_t abstraction for Rust - The usual boring cleanups and enhancements Drivers: - Boring updates to device trees and trivial enhancements in various drivers" * tag 'timers-core-2024-05-12' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (33 commits) clocksource/drivers/arm_arch_timer: Mark hisi_161010101_oem_info const clocksource/drivers/timer-ti-dm: Remove an unused field in struct dmtimer clocksource/drivers/renesas-ostm: Avoid reprobe after successful early probe clocksource/drivers/renesas-ostm: Allow OSTM driver to reprobe for RZ/V2H(P) SoC dt-bindings: timer: renesas: ostm: Document Renesas RZ/V2H(P) SoC rust: time: doc: Add missing C header links clocksource: Make the int help prompt unit readable in ncurses hrtimer: Rename __hrtimer_hres_active() to hrtimer_hres_active() timerqueue: Remove never used function timerqueue_node_expires() rust: time: Add Ktime vdso: Fix powerpc build U64_MAX undeclared error clockevents: Convert s[n]printf() to sysfs_emit() clocksource: Convert s[n]printf() to sysfs_emit() clocksource: Make watchdog and suspend-timing multiplication overflow safe timekeeping: Let timekeeping_cycles_to_ns() handle both under and overflow timekeeping: Make delta calculation overflow safe timekeeping: Prepare timekeeping_cycles_to_ns() for overflow safety timekeeping: Fold in timekeeping_delta_to_ns() timekeeping: Consolidate timekeeping helpers timekeeping: Refactor timekeeping helpers ...
11 daysftrace: Fix possible use-after-free issue in ftrace_location()Zheng Yejian1-16/+23
KASAN reports a bug: BUG: KASAN: use-after-free in ftrace_location+0x90/0x120 Read of size 8 at addr ffff888141d40010 by task insmod/424 CPU: 8 PID: 424 Comm: insmod Tainted: G W 6.9.0-rc2+ [...] Call Trace: <TASK> dump_stack_lvl+0x68/0xa0 print_report+0xcf/0x610 kasan_report+0xb5/0xe0 ftrace_location+0x90/0x120 register_kprobe+0x14b/0xa40 kprobe_init+0x2d/0xff0 [kprobe_example] do_one_initcall+0x8f/0x2d0 do_init_module+0x13a/0x3c0 load_module+0x3082/0x33d0 init_module_from_file+0xd2/0x130 __x64_sys_finit_module+0x306/0x440 do_syscall_64+0x68/0x140 entry_SYSCALL_64_after_hwframe+0x71/0x79 The root cause is that, in lookup_rec(), ftrace record of some address is being searched in ftrace pages of some module, but those ftrace pages at the same time is being freed in ftrace_release_mod() as the corresponding module is being deleted: CPU1 | CPU2 register_kprobes() { | delete_module() { check_kprobe_address_safe() { | arch_check_ftrace_location() { | ftrace_location() { | lookup_rec() // USE! | ftrace_release_mod() // Free! To fix this issue: 1. Hold rcu lock as accessing ftrace pages in ftrace_location_range(); 2. Use ftrace_location_range() instead of lookup_rec() in ftrace_location(); 3. Call synchronize_rcu() before freeing any ftrace pages both in ftrace_process_locs()/ftrace_release_mod()/ftrace_free_mem(). Link: https://lore.kernel.org/linux-trace-kernel/20240509192859.1273558-1-zhengyejian1@huawei.com Cc: stable@vger.kernel.org Cc: <mhiramat@kernel.org> Cc: <mark.rutland@arm.com> Cc: <mathieu.desnoyers@efficios.com> Fixes: ae6aa16fdc16 ("kprobes: introduce ftrace based optimization") Suggested-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Zheng Yejian <zhengyejian1@huawei.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
11 daysbpf: remove CONFIG_BPF_JIT dependency on CONFIG_MODULES ofMike Rapoport (IBM)1-1/+1
BPF just-in-time compiler depended on CONFIG_MODULES because it used module_alloc() to allocate memory for the generated code. Since code allocations are now implemented with execmem, drop dependency of CONFIG_BPF_JIT on CONFIG_MODULES and make it select CONFIG_EXECMEM. Suggested-by: Björn Töpel <bjorn@kernel.org> Signed-off-by: Mike Rapoport (IBM) <rppt@kernel.org> Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
11 dayskprobes: remove dependency on CONFIG_MODULESMike Rapoport (IBM)2-22/+53
kprobes depended on CONFIG_MODULES because it has to allocate memory for code. Since code allocations are now implemented with execmem, kprobes can be enabled in non-modular kernels. Add #ifdef CONFIG_MODULE guards for the code dealing with kprobes inside modules, make CONFIG_KPROBES select CONFIG_EXECMEM and drop the dependency of CONFIG_KPROBES on CONFIG_MODULES. Signed-off-by: Mike Rapoport (IBM) <rppt@kernel.org> Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> [mcgrof: rebase in light of NEED_TASKS_RCU ] Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
11 daysmm/execmem, arch: convert remaining overrides of module_alloc to execmemMike Rapoport (IBM)1-20/+6
Extend execmem parameters to accommodate more complex overrides of module_alloc() by architectures. This includes specification of a fallback range required by arm, arm64 and powerpc, EXECMEM_MODULE_DATA type required by powerpc, support for allocation of KASAN shadow required by s390 and x86 and support for late initialization of execmem required by arm64. The core implementation of execmem_alloc() takes care of suppressing warnings when the initial allocation fails but there is a fallback range defined. Signed-off-by: Mike Rapoport (IBM) <rppt@kernel.org> Acked-by: Will Deacon <will@kernel.org> Acked-by: Song Liu <song@kernel.org> Tested-by: Liviu Dudau <liviu@dudau.co.uk> Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
11 daysmm: introduce execmem_alloc() and execmem_free()Mike Rapoport (IBM)4-24/+16
module_alloc() is used everywhere as a mean to allocate memory for code. Beside being semantically wrong, this unnecessarily ties all subsystems that need to allocate code, such as ftrace, kprobes and BPF to modules and puts the burden of code allocation to the modules code. Several architectures override module_alloc() because of various constraints where the executable memory can be located and this causes additional obstacles for improvements of code allocation. Start splitting code allocation from modules by introducing execmem_alloc() and execmem_free() APIs. Initially, execmem_alloc() is a wrapper for module_alloc() and execmem_free() is a replacement of module_memfree() to allow updating all call sites to use the new APIs. Since architectures define different restrictions on placement, permissions, alignment and other parameters for memory that can be used by different subsystems that allocate executable memory, execmem_alloc() takes a type argument, that will be used to identify the calling subsystem and to allow architectures define parameters for ranges suitable for that subsystem. No functional changes. Signed-off-by: Mike Rapoport (IBM) <rppt@kernel.org> Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Acked-by: Song Liu <song@kernel.org> Acked-by: Steven Rostedt (Google) <rostedt@goodmis.org> Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
11 daysmodule: make module_memory_{alloc,free} more self-containedMike Rapoport (IBM)1-25/+39
Move the logic related to the memory allocation and freeing into module_memory_alloc() and module_memory_free(). Signed-off-by: Mike Rapoport (IBM) <rppt@kernel.org> Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org> Reviewed-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Acked-by: Song Liu <song@kernel.org> Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
11 dayskallsyms: replace deprecated strncpy with strscpyJustin Stitt1-1/+1
strncpy() is deprecated for use on NUL-terminated destination strings [1] and as such we should prefer more robust and less ambiguous string interfaces. The goal is to remove its use completely [2]. namebuf is eventually cleaned of any trailing llvm suffixes using strstr(). This hints that namebuf should be NUL-terminated. static void cleanup_symbol_name(char *s) { char *res; ... res = strstr(s, ".llvm."); ... } Due to this, use strscpy() over strncpy() as it guarantees NUL-termination on the destination buffer. Drop the -1 from the length calculation as it is no longer needed to ensure NUL-termination. Link: https://www.kernel.org/doc/html/latest/process/deprecated.html#strncpy-on-nul-terminated-strings [1] Link: https://manpages.debian.org/testing/linux-manual-4.8/strscpy.9.en.html Link: https://github.com/KSPP/linux/issues/90 [2] Cc: linux-hardening@vger.kernel.org Signed-off-by: Justin Stitt <justinstitt@google.com> Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
11 daysmodule: allow UNUSED_KSYMS_WHITELIST to be relative against objtree.Yifan Hong1-1/+1
If UNUSED_KSYMS_WHITELIST is a file generated before Kbuild runs, and the source tree is in a read-only filesystem, the developer must put the file somewhere and specify an absolute path to UNUSED_KSYMS_WHITELIST. This worked, but if IKCONFIG=y, an absolute path is embedded into .config and eventually into vmlinux, causing the build to be less reproducible when building on a different machine. This patch makes the handling of UNUSED_KSYMS_WHITELIST to be similar to MODULE_SIG_KEY. First, check if UNUSED_KSYMS_WHITELIST is an absolute path, just as before this patch. If so, use the path as is. If it is a relative path, use wildcard to check the existence of the file below objtree first. If it does not exist, fall back to the original behavior of adding $(srctree)/ before the value. After this patch, the developer can put the generated file in objtree, then use a relative path against objtree in .config, eradicating any absolute paths that may be evaluated differently on different machines. Signed-off-by: Yifan Hong <elsk@google.com> Reviewed-by: Elliot Berman <quic_eberman@quicinc.com> Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
11 daysftrace: Remove unused global 'ftrace_direct_func_count'Dr. David Alan Gilbert2-12/+0
Commit 8788ca164eb4b ("ftrace: Remove the legacy _ftrace_direct API") stopped setting the 'ftrace_direct_func_count' variable, but left it around. Clean it up. Link: https://lore.kernel.org/linux-trace-kernel/20240506233305.215735-1-linux@treblig.org Signed-off-by: Dr. David Alan Gilbert <linux@treblig.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
11 daysftrace: Remove unused list 'ftrace_direct_funcs'Dr. David Alan Gilbert1-8/+0
Commit 8788ca164eb4b ("ftrace: Remove the legacy _ftrace_direct API") stopped using 'ftrace_direct_funcs' (and the associated struct ftrace_direct_func). Remove them. Build tested only (on x86-64 with FTRACE and DYNAMIC_FTRACE enabled) Link: https://lore.kernel.org/linux-trace-kernel/20240504132303.67538-1-linux@treblig.org Signed-off-by: Dr. David Alan Gilbert <linux@treblig.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
11 daysMerge tag 'sched-core-2024-05-13' of ↵Linus Torvalds11-325/+381
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler updates from Ingo Molnar: - Add cpufreq pressure feedback for the scheduler - Rework misfit load-balancing wrt affinity restrictions - Clean up and simplify the code around ::overutilized and ::overload access. - Simplify sched_balance_newidle() - Bump SCHEDSTAT_VERSION to 16 due to a cleanup of CPU_MAX_IDLE_TYPES handling that changed the output. - Rework & clean up <asm/vtime.h> interactions wrt arch_vtime_task_switch() - Reorganize, clean up and unify most of the higher level scheduler balancing function names around the sched_balance_*() prefix - Simplify the balancing flag code (sched_balance_running) - Miscellaneous cleanups & fixes * tag 'sched-core-2024-05-13' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (50 commits) sched/pelt: Remove shift of thermal clock sched/cpufreq: Rename arch_update_thermal_pressure() => arch_update_hw_pressure() thermal/cpufreq: Remove arch_update_thermal_pressure() sched/cpufreq: Take cpufreq feedback into account cpufreq: Add a cpufreq pressure feedback for the scheduler sched/fair: Fix update of rd->sg_overutilized sched/vtime: Do not include <asm/vtime.h> header s390/irq,nmi: Include <asm/vtime.h> header directly s390/vtime: Remove unused __ARCH_HAS_VTIME_TASK_SWITCH leftover sched/vtime: Get rid of generic vtime_task_switch() implementation sched/vtime: Remove confusing arch_vtime_task_switch() declaration sched/balancing: Simplify the sg_status bitmask and use separate ->overloaded and ->overutilized flags sched/fair: Rename set_rd_overutilized_status() to set_rd_overutilized() sched/fair: Rename SG_OVERLOAD to SG_OVERLOADED sched/fair: Rename {set|get}_rd_overload() to {set|get}_rd_overloaded() sched/fair: Rename root_domain::overload to ::overloaded sched/fair: Use helper functions to access root_domain::overload sched/fair: Check root_domain::overload value before update sched/fair: Combine EAS check with root_domain::overutilized access sched/fair: Simplify the continue_balancing logic in sched_balance_newidle() ...
11 daysMerge tag 'perf-core-2024-05-13' of ↵Linus Torvalds2-129/+148
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull perf events updates from Ingo Molnar: - Combine perf and BPF for fast evalution of HW breakpoint conditions - Add LBR capture support outside of hardware events - Trigger IO signals for watermark_wakeup - Add RAPL support for Intel Arrow Lake and Lunar Lake - Optimize frequency-throttling - Miscellaneous cleanups & fixes * tag 'perf-core-2024-05-13' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (21 commits) perf/bpf: Mark perf_event_set_bpf_handler() and perf_event_free_bpf_handler() as inline too selftests/perf_events: Test FASYNC with watermark wakeups perf/ring_buffer: Trigger IO signals for watermark_wakeup perf: Move perf_event_fasync() to perf_event.h perf/bpf: Change the !CONFIG_BPF_SYSCALL stubs to static inlines selftest/bpf: Test a perf BPF program that suppresses side effects perf/bpf: Allow a BPF program to suppress all sample side effects perf/bpf: Remove unneeded uses_default_overflow_handler() perf/bpf: Call BPF handler directly, not through overflow machinery perf/bpf: Remove #ifdef CONFIG_BPF_SYSCALL from struct perf_event members perf/bpf: Create bpf_overflow_handler() stub for !CONFIG_BPF_SYSCALL perf/bpf: Reorder bpf_overflow_handler() ahead of __perf_event_overflow() perf/x86/rapl: Add support for Intel Lunar Lake perf/x86/rapl: Add support for Intel Arrow Lake perf/core: Reduce PMU access to adjust sample freq perf/core: Optimize perf_adjust_freq_unthr_context() perf/x86/amd: Don't reject non-sampling events with configured LBR perf/x86/amd: Support capturing LBR from software events perf/x86/amd: Avoid taking branches before disabling LBR perf/x86/amd: Ensure amd_pmu_core_disable_all() is always inlined ...
11 daysMerge tag 'locking-core-2024-05-13' of ↵Linus Torvalds5-38/+83
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull locking updates from Ingo Molnar: - Over a dozen code generation micro-optimizations for the atomic and spinlock code - Add more __ro_after_init attributes - Robustify the lockdevent_*() macros * tag 'locking-core-2024-05-13' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: locking/pvqspinlock/x86: Use _Q_LOCKED_VAL in PV_UNLOCK_ASM macro locking/qspinlock/x86: Micro-optimize virt_spin_lock() locking/atomic/x86: Merge __arch{,_try}_cmpxchg64_emu_local() with __arch{,_try}_cmpxchg64_emu() locking/atomic/x86: Introduce arch_try_cmpxchg64_local() locking/pvqspinlock/x86: Remove redundant CMP after CMPXCHG in __raw_callee_save___pv_queued_spin_unlock() locking/pvqspinlock: Use try_cmpxchg() in qspinlock_paravirt.h locking/pvqspinlock: Use try_cmpxchg_acquire() in trylock_clear_pending() locking/qspinlock: Use atomic_try_cmpxchg_relaxed() in xchg_tail() locking/atomic/x86: Define arch_atomic_sub() family using arch_atomic_add() functions locking/atomic/x86: Rewrite x86_32 arch_atomic64_{,fetch}_{and,or,xor}() functions locking/atomic/x86: Introduce arch_atomic64_read_nonatomic() to x86_32 locking/atomic/x86: Introduce arch_atomic64_try_cmpxchg() to x86_32 locking/atomic/x86: Introduce arch_try_cmpxchg64() for !CONFIG_X86_CMPXCHG64 locking/atomic/x86: Modernize x86_32 arch_{,try_}_cmpxchg64{,_local}() locking/atomic/x86: Correct the definition of __arch_try_cmpxchg128() x86/tsc: Make __use_tsc __ro_after_init x86/kvm: Make kvm_async_pf_enabled __ro_after_init context_tracking: Make context_tracking_key __ro_after_init jump_label,module: Don't alloc static_key_mod for __ro_after_init keys locking/qspinlock: Always evaluate lockevent* non-event parameter once
11 daystracing: Improve benchmark test performance by using do_div()Thorsten Blum1-1/+1
Partially revert commit d6cb38e10810 ("tracing: Use div64_u64() instead of do_div()") and use do_div() again to utilize its faster 64-by-32 division compared to the 64-by-64 division done by div64_u64(). Explicitly cast the divisor bm_cnt to u32 to prevent a Coccinelle warning reported by do_div.cocci. The warning was removed with commit d6cb38e10810 ("tracing: Use div64_u64() instead of do_div()"). Using the faster 64-by-32 division and casting bm_cnt to u32 is safe because we return early from trace_do_benchmark() if bm_cnt > UINT_MAX. This approach is already used twice in trace_do_benchmark() when calculating the standard deviation: do_div(stddev, (u32)bm_cnt); do_div(stddev, (u32)bm_cnt - 1); Link: https://lore.kernel.org/linux-trace-kernel/20240329160229.4874-2-thorsten.blum@toblux.com Signed-off-by: Thorsten Blum <thorsten.blum@toblux.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
11 daysring-buffer: Have mmapped ring buffer keep track of missed eventsSteven Rostedt (Google)1-6/+47
While testing libtracefs on the mmapped ring buffer, the test that checks if missed events are accounted for failed when using the mapped buffer. This is because the mapped page does not update the missed events that were dropped because the writer filled up the ring buffer before the reader could catch it. Add the missed events to the reader page/sub-buffer when the IOCTL is done and a new reader page is acquired. Note that all accesses to the reader_page via rb_page_commit() had to be switched to rb_page_size(), and rb_page_size() which was just a copy of rb_page_commit() but now it masks out the RB_MISSED bits. This is needed as the mapped reader page is still active in the ring buffer code and where it reads the commit field of the bpage for the size, it now must mask it otherwise the missed bits that are now set will corrupt the size returned. Link: https://lore.kernel.org/linux-trace-kernel/20240312175405.12fb6726@gandalf.local.home Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Vincent Donnefort <vdonnefort@google.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
11 daysMerge tag 'for-netdev' of ↵Jakub Kicinski9-136/+213
https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next Daniel Borkmann says: ==================== pull-request: bpf-next 2024-05-13 We've added 119 non-merge commits during the last 14 day(s) which contain a total of 134 files changed, 9462 insertions(+), 4742 deletions(-). The main changes are: 1) Add BPF JIT support for 32-bit ARCv2 processors, from Shahab Vahedi. 2) Add BPF range computation improvements to the verifier in particular around XOR and OR operators, refactoring of checks for range computation and relaxing MUL range computation so that src_reg can also be an unknown scalar, from Cupertino Miranda. 3) Add support to attach kprobe BPF programs through kprobe_multi link in a session mode, meaning, a BPF program is attached to both function entry and return, the entry program can decide if the return program gets executed and the entry program can share u64 cookie value with return program. Session mode is a common use-case for tetragon and bpftrace, from Jiri Olsa. 4) Fix a potential overflow in libbpf's ring__consume_n() and improve libbpf as well as BPF selftest's struct_ops handling, from Andrii Nakryiko. 5) Improvements to BPF selftests in context of BPF gcc backend, from Jose E. Marchesi & David Faust. 6) Migrate remaining BPF selftest tests from test_sock_addr.c to prog_test- -style in order to retire the old test, run it in BPF CI and additionally expand test coverage, from Jordan Rife. 7) Big batch for BPF selftest refactoring in order to remove duplicate code around common network helpers, from Geliang Tang. 8) Another batch of improvements to BPF selftests to retire obsolete bpf_tcp_helpers.h as everything is available vmlinux.h, from Martin KaFai Lau. 9) Fix BPF map tear-down to not walk the map twice on free when both timer and wq is used, from Benjamin Tissoires. 10) Fix BPF verifier assumptions about socket->sk that it can be non-NULL, from Alexei Starovoitov. 11) Change BTF build scripts to using --btf_features for pahole v1.26+, from Alan Maguire. 12) Small improvements to BPF reusing struct_size() and krealloc_array(), from Andy Shevchenko. 13) Fix s390 JIT to emit a barrier for BPF_FETCH instructions, from Ilya Leoshkevich. 14) Extend TCP ->cong_control() callback in order to feed in ack and flag parameters and allow write-access to tp->snd_cwnd_stamp from BPF program, from Miao Xu. 15) Add support for internal-only per-CPU instructions to inline bpf_get_smp_processor_id() helper call for arm64 and riscv64 BPF JITs, from Puranjay Mohan. 16) Follow-up to remove the redundant ethtool.h from tooling infrastructure, from Tushar Vyavahare. 17) Extend libbpf to support "module:<function>" syntax for tracing programs, from Viktor Malik. * tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (119 commits) bpf: make list_for_each_entry portable bpf: ignore expected GCC warning in test_global_func10.c bpf: disable strict aliasing in test_global_func9.c selftests/bpf: Free strdup memory in xdp_hw_metadata selftests/bpf: Fix a few tests for GCC related warnings. bpf: avoid gcc overflow warning in test_xdp_vlan.c tools: remove redundant ethtool.h from tooling infra selftests/bpf: Expand ATTACH_REJECT tests selftests/bpf: Expand getsockname and getpeername tests sefltests/bpf: Expand sockaddr hook deny tests selftests/bpf: Expand sockaddr program return value tests selftests/bpf: Retire test_sock_addr.(c|sh) selftests/bpf: Remove redundant sendmsg test cases selftests/bpf: Migrate ATTACH_REJECT test cases selftests/bpf: Migrate expected_attach_type tests selftests/bpf: Migrate wildcard destination rewrite test selftests/bpf: Migrate sendmsg6 v4 mapped address tests selftests/bpf: Migrate sendmsg deny test cases selftests/bpf: Migrate WILDCARD_IP test selftests/bpf: Handle SYSCALL_EPERM and SYSCALL_ENOTSUPP test cases ... ==================== Link: https://lore.kernel.org/r/20240513134114.17575-1-daniel@iogearbox.net Signed-off-by: Jakub Kicinski <kuba@kernel.org>
11 daysftrace: Use asynchronous grace period for register_ftrace_direct()Paul E. McKenney1-4/+9
When running heavy test workloads with KASAN enabled, RCU Tasks grace periods can extend for many tens of seconds, significantly slowing trace registration. Therefore, make the registration-side RCU Tasks grace period be asynchronous via call_rcu_tasks(). Link: https://lore.kernel.org/linux-trace-kernel/ac05be77-2972-475b-9b57-56bef15aa00a@paulmck-laptop Reported-by: Jakub Kicinski <kuba@kernel.org> Reported-by: Alexei Starovoitov <ast@kernel.org> Reported-by: Chris Mason <clm@fb.com> Reviewed-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
11 daysftrace: Replaces simple_strtoul in ftraceYuran Pereira1-4/+3
The function simple_strtoul performs no error checking in scenarios where the input value overflows the intended output variable. This results in this function successfully returning, even when the output does not match the input string (aka the function returns successfully even when the result is wrong). Or as it was mentioned [1], "...simple_strtol(), simple_strtoll(), simple_strtoul(), and simple_strtoull() functions explicitly ignore overflows, which may lead to unexpected results in callers." Hence, the use of those functions is discouraged. This patch replaces all uses of the simple_strtoul with the safer alternatives kstrtoul and kstruint. Callers affected: - add_rec_by_index - set_graph_max_depth_function Side effects of this patch: - Since `fgraph_max_depth` is an `unsigned int`, this patch uses kstrtouint instead of kstrtoul to avoid any compiler warnings that could originate from calling the latter. - This patch ensures that the callers of kstrtou* return accordingly when kstrtoul and kstruint fail for some reason. In this case, both callers this patch is addressing return 0 on error. [1] https://www.kernel.org/doc/html/latest/process/deprecated.html#simple-strtol-simple-strtoll-simple-strtoul-simple-strtoull Link: https://lore.kernel.org/linux-trace-kernel/GV1PR10MB656333529A8D7B8AFB28D238E8B4A@GV1PR10MB6563.EURPRD10.PROD.OUTLOOK.COM Signed-off-by: Yuran Pereira <yuran.pereira@hotmail.com> Reviewed-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
11 daystracing: Allow user-space mapping of the ring-bufferVincent Donnefort2-5/+100
Currently, user-space extracts data from the ring-buffer via splice, which is handy for storage or network sharing. However, due to splice limitations, it is imposible to do real-time analysis without a copy. A solution for that problem is to let the user-space map the ring-buffer directly. The mapping is exposed via the per-CPU file trace_pipe_raw. The first element of the mapping is the meta-page. It is followed by each subbuffer constituting the ring-buffer, ordered by their unique page ID: * Meta-page -- include/uapi/linux/trace_mmap.h for a description * Subbuf ID 0 * Subbuf ID 1 ... It is therefore easy to translate a subbuf ID into an offset in the mapping: reader_id = meta->reader->id; reader_offset = meta->meta_page_size + reader_id * meta->subbuf_size; When new data is available, the mapper must call a newly introduced ioctl: TRACE_MMAP_IOCTL_GET_READER. This will update the Meta-page reader ID to point to the next reader containing unread data. Mapping will prevent snapshot and buffer size modifications. Link: https://lore.kernel.org/linux-trace-kernel/20240510140435.3550353-4-vdonnefort@google.com CC: <linux-mm@kvack.org> Signed-off-by: Vincent Donnefort <vdonnefort@google.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
11 daysring-buffer: Introducing ring-buffer mapping functionsVincent Donnefort1-3/+411
In preparation for allowing the user-space to map a ring-buffer, add a set of mapping functions: ring_buffer_{map,unmap}() And controls on the ring-buffer: ring_buffer_map_get_reader() /* swap reader and head */ Mapping the ring-buffer also involves: A unique ID for each subbuf of the ring-buffer, currently they are only identified through their in-kernel VA. A meta-page, where are stored ring-buffer statistics and a description for the current reader The linear mapping exposes the meta-page, and each subbuf of the ring-buffer, ordered following their unique ID, assigned during the first mapping. Once mapped, no subbuf can get in or out of the ring-buffer: the buffer size will remain unmodified and the splice enabling functions will in reality simply memcpy the data instead of swapping subbufs. Link: https://lore.kernel.org/linux-trace-kernel/20240510140435.3550353-3-vdonnefort@google.com CC: <linux-mm@kvack.org> Signed-off-by: Vincent Donnefort <vdonnefort@google.com> Acked-by: David Hildenbrand <david@redhat.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
11 daysring-buffer: Allocate sub-buffers with __GFP_COMPVincent Donnefort1-3/+3
In preparation for the ring-buffer memory mapping, allocate compound pages for the ring-buffer sub-buffers to enable us to map them to user-space with vm_insert_pages(). Link: https://lore.kernel.org/linux-trace-kernel/20240510140435.3550353-2-vdonnefort@google.com Acked-by: David Hildenbrand <david@redhat.com> Signed-off-by: Vincent Donnefort <vdonnefort@google.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
11 daysMerge tag 'v6.10-p1' of ↵Linus Torvalds1-4/+4
git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6 Pull crypto updates from Herbert Xu: "API: - Remove crypto stats interface Algorithms: - Add faster AES-XTS on modern x86_64 CPUs - Forbid curves with order less than 224 bits in ecc (FIPS 186-5) - Add ECDSA NIST P521 Drivers: - Expose otp zone in atmel - Add dh fallback for primes > 4K in qat - Add interface for live migration in qat - Use dma for aes requests in starfive - Add full DMA support for stm32mpx in stm32 - Add Tegra Security Engine driver Others: - Introduce scope-based x509_certificate allocation" * tag 'v6.10-p1' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6: (123 commits) crypto: atmel-sha204a - provide the otp content crypto: atmel-sha204a - add reading from otp zone crypto: atmel-i2c - rename read function crypto: atmel-i2c - add missing arg description crypto: iaa - Use kmemdup() instead of kzalloc() and memcpy() crypto: sahara - use 'time_left' variable with wait_for_completion_timeout() crypto: api - use 'time_left' variable with wait_for_completion_killable_timeout() crypto: caam - i.MX8ULP donot have CAAM page0 access crypto: caam - init-clk based on caam-page0-access crypto: starfive - Use fallback for unaligned dma access crypto: starfive - Do not free stack buffer crypto: starfive - Skip unneeded fallback allocation crypto: starfive - Skip dma setup for zeroed message crypto: hisilicon/sec2 - fix for register offset crypto: hisilicon/debugfs - mask the unnecessary info from the dump crypto: qat - specify firmware files for 402xx crypto: x86/aes-gcm - simplify GCM hash subkey derivation crypto: x86/aes-gcm - delete unused GCM assembly code crypto: x86/aes-xts - simplify loop in xts_crypt_slowpath() hwrng: stm32 - repair clock handling ...
11 daysMerge tag 'hardening-6.10-rc1' of ↵Linus Torvalds2-3/+11
git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux Pull hardening updates from Kees Cook: "The bulk of the changes here are related to refactoring and expanding the KUnit tests for string helper and fortify behavior. Some trivial strncpy replacements in fs/ were carried in my tree. Also some fixes to SCSI string handling were carried in my tree since the helper for those was introduce here. Beyond that, just little fixes all around: objtool getting confused about LKDTM+KCFI, preparing for future refactors (constification of sysctl tables, additional __counted_by annotations), a Clang UBSAN+i386 crash fix, and adding more options in the hardening.config Kconfig fragment. Summary: - selftests: Add str*cmp tests (Ivan Orlov) - __counted_by: provide UAPI for _le/_be variants (Erick Archer) - Various strncpy deprecation refactors (Justin Stitt) - stackleak: Use a copy of soon-to-be-const sysctl table (Thomas Weißschuh) - UBSAN: Work around i386 -regparm=3 bug with Clang prior to version 19 - Provide helper to deal with non-NUL-terminated string copying - SCSI: Fix older string copying bugs (with new helper) - selftests: Consolidate string helper behavioral tests - selftests: add memcpy() fortify tests - string: Add additional __realloc_size() annotations for "dup" helpers - LKDTM: Fix KCFI+rodata+objtool confusion - hardening.config: Enable KCFI" * tag 'hardening-6.10-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux: (29 commits) uapi: stddef.h: Provide UAPI macros for __counted_by_{le, be} stackleak: Use a copy of the ctl_table argument string: Add additional __realloc_size() annotations for "dup" helpers kunit/fortify: Fix replaced failure path to unbreak __alloc_size hardening: Enable KCFI and some other options lkdtm: Disable CFI checking for perms functions kunit/fortify: Add memcpy() tests kunit/fortify: Do not spam logs with fortify WARNs kunit/fortify: Rename tests to use recommended conventions init: replace deprecated strncpy with strscpy_pad kunit/fortify: Fix mismatched kvalloc()/vfree() usage scsi: qla2xxx: Avoid possible run-time warning with long model_num scsi: mpi3mr: Avoid possible run-time warning with long manufacturer strings scsi: mptfusion: Avoid possible run-time warning with long manufacturer strings fs: ecryptfs: replace deprecated strncpy with strscpy hfsplus: refactor copy_name to not use strncpy reiserfs: replace deprecated strncpy with scnprintf virt: acrn: replace deprecated strncpy with strscpy ubsan: Avoid i386 UBSAN handler crashes with Clang ubsan: Remove 1-element array usage in debug reporting ...
11 daysMerge tag 'seccomp-6.10-rc1' of ↵Linus Torvalds1-2/+2
git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux Pull seccomp update from Kees Cook: - Prepare for sysctl table constification * tag 'seccomp-6.10-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux: seccomp: Constify sysctl subhelpers
12 daysMerge tag 'for-netdev' of ↵Jakub Kicinski1-0/+5
https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf Daniel Borkmann says: ==================== pull-request: bpf 2024-05-13 We've added 3 non-merge commits during the last 2 day(s) which contain a total of 2 files changed, 62 insertions(+), 8 deletions(-). The main changes are: 1) Fix a case where syzkaller found that it's unexpectedly possible to attach a cgroup_skb program to the sockopt hooks. The fix adds missing attach_type enforcement for the link_create case along with selftests, from Stanislav Fomichev. * tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf: selftests/bpf: Add sockopt case to verify prog_type selftests/bpf: Extend sockopt tests to use BPF_LINK_CREATE bpf: Add BPF_PROG_TYPE_CGROUP_SKB attach type enforcement in BPF_LINK_CREATE ==================== Link: https://lore.kernel.org/r/20240513041845.31040-1-daniel@iogearbox.net Signed-off-by: Jakub Kicinski <kuba@kernel.org>
12 daysMerge branches 'pm-em' and 'pm-docs'Rafael J. Wysocki1-19/+87
Merge Enery Model update and a power management documentation update for 6.10: - Make the Samsung exynos-asv driver update the Energy Model after adjusting voltage on top of some preliminary changes of the OPP and Enery Model generic code (Lukasz Luba). - Remove a reference to a function that has been dropped from the power management documentation (Bjorn Helgaas). * pm-em: soc: samsung: exynos-asv: Update Energy Model after adjusting voltage PM: EM: Add em_dev_update_chip_binning() PM: EM: Refactor em_adjust_new_capacity() OPP: OF: Export dev_opp_pm_calc_power() for usage from EM * pm-docs: Documentation: PM: Update platform_pci_wakeup_init() reference
12 daysMerge branches 'pm-cpuidle', 'pm-sleep' and 'pm-powercap'Rafael J. Wysocki1-1/+1
Merge cpuidle updates, changes related to system sleep and power capping updates for 6.10: - Fix kerneldoc description of ladder_do_selection() (Jeff Johnson). - Convert the cpuidle kirkwood driver to platform remove callback returning void (Yangtao Li). - Replace deprecated strncpy() with strscpy() in the hibernation core code (Justin Stitt). - Use %ps to simplify debug output in the core system-wide suspend and resume code (Len Brown). - Remove unnecessary else from device_init_wakeup() and make device_wakeup_disable() return void (Dhruva Gole). - Enable PMU support in the Intel TPMI RAPL driver (Zhang Rui). - Add support for ArrowLake-H platform to the Intel RAPL driver (Zhang Rui). - Avoid explicit cpumask allocation on stack in DTPM (Dawei Li). * pm-cpuidle: cpuidle: ladder: fix ladder_do_selection() kernel-doc cpuidle: kirkwood: Convert to platform remove callback returning void * pm-sleep: PM: hibernate: replace deprecated strncpy() with strscpy() PM: sleep: Take advantage of %ps to simplify debug output PM: wakeup: Remove unnecessary else from device_init_wakeup() PM: wakeup: make device_wakeup_disable() return void * pm-powercap: powercap: intel_rapl_tpmi: Enable PMU support powercap: intel_rapl: Introduce APIs for PMU support powercap: intel_rapl: Sort header files powercap: intel_rapl: Add support for ArrowLake-H platform powercap: DTPM: Avoid explicit cpumask allocation on stack
12 daysMerge tag 'kcsan.2024.05.10a' of ↵Linus Torvalds1-0/+17
git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu Pull kcsan update from Paul McKenney: "Introduce __data_racy type qualifier This adds a __data_racy type qualifier that enables kernel developers to inform KCSAN that a given variable is a shared variable without needing to mark each and every access. This allows pre-KCSAN code to be correctly (if approximately) instrumented withh very little effort, and also provides people reading the code a clear indication that the variable is in fact shared. In addition, it permits incremental transition to per-access KCSAN marking, so that (for example) a given subsystem can be transitioned one variable at a time, while avoiding large numbers of KCSAN warnings during this transition" * tag 'kcsan.2024.05.10a' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu: kcsan, compiler_types: Introduce __data_racy type qualifier
12 daysMerge tag 'rcu.next.v6.10' of https://github.com/urezki/linuxLinus Torvalds18-111/+580
Pull RCU updates from Uladzislau Rezki: - Fix a lockdep complain for lazy-preemptible kernel, remove redundant BH disable for TINY_RCU, remove redundant READ_ONCE() in tree.c, fix false positives KCSAN splat and fix buffer overflow in the print_cpu_stall_info(). - Misc updates related to bpf, tracing and update the MAINTAINERS file. - An improvement of a normal synchronize_rcu() call in terms of latency. It maintains a separate track for sync. users only. This approach bypasses per-cpu nocb-lists thus sync-users do not depend on nocb-list length and how fast regular callbacks are processed. - RCU tasks: switch tasks RCU grace periods to sleep at TASK_IDLE priority, fix some comments, add some diagnostic warning to the exit_tasks_rcu_start() and fix a buffer overflow in the show_rcu_tasks_trace_gp_kthread(). - RCU torture: Increase memory to guest OS, fix a Tasks Rude RCU testing, some updates for TREE09, dump mode information to debug GP kthread state, remove redundant READ_ONCE(), fix some comments about RCU_TORTURE_PIPE_LEN and pipe_count, remove some redundant pointer initialization, fix a hung splat task by when the rcutorture tests start to exit, fix invalid context warning, add '--do-kvfree' parameter to torture test and use slow register unregister callbacks only for rcutype test. * tag 'rcu.next.v6.10' of https://github.com/urezki/linux: (48 commits) rcutorture: Use rcu_gp_slow_register/unregister() only for rcutype test torture: Scale --do-kvfree test time rcutorture: Fix invalid context warning when enable srcu barrier testing rcutorture: Make stall-tasks directly exit when rcutorture tests end rcutorture: Removing redundant function pointer initialization rcutorture: Make rcutorture support print rcu-tasks gp state rcutorture: Use the gp_kthread_dbg operation specified by cur_ops rcutorture: Re-use value stored to ->rtort_pipe_count instead of re-reading rcutorture: Fix rcu_torture_one_read() pipe_count overflow comment rcutorture: Remove extraneous rcu_torture_pipe_update_one() READ_ONCE() rcu: Allocate WQ with WQ_MEM_RECLAIM bit set rcu: Support direct wake-up of synchronize_rcu() users rcu: Add a trace event for synchronize_rcu_normal() rcu: Reduce synchronize_rcu() latency rcu: Fix buffer overflow in print_cpu_stall_info() rcu: Mollify sparse with RCU guard rcu-tasks: Fix show_rcu_tasks_trace_gp_kthread buffer overflow rcu-tasks: Fix the comments for tasks_rcu_exit_srcu_stall_timer rcu-tasks: Replace exit_tasks_rcu_start() initialization with WARN_ON_ONCE() rcu: Remove redundant CONFIG_PROVE_RCU #if condition ...
12 daystracing/user_events: Fix non-spaced field matchingBeau Belgrave1-1/+75
When the ABI was updated to prevent same name w/different args, it missed an important corner case when fields don't end with a space. Typically, space is used for fields to help separate them, like "u8 field1; u8 field2". If no spaces are used, like "u8 field1;u8 field2", then the parsing works for the first time. However, the match check fails on a subsequent register, leading to confusion. This is because the match check uses argv_split() and assumes that all fields will be split upon the space. When spaces are used, we get back { "u8", "field1;" }, without spaces we get back { "u8", "field1;u8" }. This causes a mismatch, and the user program gets back -EADDRINUSE. Add a method to detect this case before calling argv_split(). If found force a space after the field separator character ';'. This ensures all cases work properly for matching. With this fix, the following are all treated as matching: u8 field1;u8 field2 u8 field1; u8 field2 u8 field1;\tu8 field2 u8 field1;\nu8 field2 Link: https://lore.kernel.org/linux-trace-kernel/20240423162338.292-2-beaub@linux.microsoft.com Fixes: ba470eebc2f6 ("tracing/user_events: Prevent same name but different args event") Signed-off-by: Beau Belgrave <beaub@linux.microsoft.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
12 daysMerge branch 'topic/kdump-hotplug' into nextMichael Ellerman3-18/+20
Merge our topic branch containing kdump hotplug changes, more detail from the original cover letter: Commit 247262756121 ("crash: add generic infrastructure for crash hotplug support") added a generic infrastructure that allows architectures to selectively update the kdump image component during CPU or memory add/remove events within the kernel itself. This patch series adds crash hotplug handler for PowerPC and enable support to update the kdump image on CPU/Memory add/remove events. Among the 6 patches in this series, the first two patches make changes to the generic crash hotplug handler to assist PowerPC in adding support for this feature. The last four patches add support for this feature. The following section outlines the problem addressed by this patch series, along with the current solution, its shortcomings, and the proposed resolution. Problem: ======== Due to CPU/Memory hotplug or online/offline events the elfcorehdr (which describes the CPUs and memory of the crashed kernel) and FDT (Flattened Device Tree) of kdump image becomes outdated. Consequently, attempting dump collection with an outdated elfcorehdr or FDT can lead to failed or inaccurate dump collection. Going forward CPU hotplug or online/offline events are referred as CPU/Memory add/remove events. Existing solution and its shortcoming: ====================================== The current solution to address the above issue involves monitoring the CPU/memory add/remove events in userspace using udev rules and whenever there are changes in CPU and memory resources, the entire kdump image is loaded again. The kdump image includes kernel, initrd, elfcorehdr, FDT, purgatory. Given that only elfcorehdr and FDT get outdated due to CPU/Memory add/remove events, reloading the entire kdump image is inefficient. More importantly, kdump remains inactive for a substantial amount of time until the kdump reload completes. Proposed solution: ================== Instead of initiating a full kdump image reload from userspace on CPU/Memory hotplug and online/offline events, the proposed solution aims to update only the necessary kdump image component within the kernel itself.
12 daysriscv, bpf: inline bpf_get_smp_processor_id()Puranjay Mohan2-0/+15
Inline the calls to bpf_get_smp_processor_id() in the riscv bpf jit. RISCV saves the pointer to the CPU's task_struct in the TP (thread pointer) register. This makes it trivial to get the CPU's processor id. As thread_info is the first member of task_struct, we can read the processor id from TP + offsetof(struct thread_info, cpu). RISCV64 JIT output for `call bpf_get_smp_processor_id` ====================================================== Before After -------- ------- auipc t1,0x848c ld a5,32(tp) jalr 604(t1) mv a5,a0 Benchmark using [1] on Qemu. ./benchs/run_bench_trigger.sh glob-arr-inc arr-inc hash-inc +---------------+------------------+------------------+--------------+ | Name | Before | After | % change | |---------------+------------------+------------------+--------------| | glob-arr-inc | 1.077 ± 0.006M/s | 1.336 ± 0.010M/s | + 24.04% | | arr-inc | 1.078 ± 0.002M/s | 1.332 ± 0.015M/s | + 23.56% | | hash-inc | 0.494 ± 0.004M/s | 0.653 ± 0.001M/s | + 32.18% | +---------------+------------------+------------------+--------------+ NOTE: This benchmark includes changes from this patch and the previous patch that implemented the per-cpu insn. [1] https://github.com/anakryiko/linux/commit/8dec900975ef Signed-off-by: Puranjay Mohan <puranjay@kernel.org> Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Acked-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Björn Töpel <bjorn@kernel.org> Link: https://lore.kernel.org/r/20240502151854.9810-3-puranjay@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-05-10Merge tag 'loongarch-kvm-6.10' of ↵Paolo Bonzini22-168/+279
git://git.kernel.org/pub/scm/linux/kernel/git/chenhuacai/linux-loongson into HEAD LoongArch KVM changes for v6.10 1. Add ParaVirt IPI support. 2. Add software breakpoint support. 3. Add mmio trace events support.
2024-05-09Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski4-5/+22
Cross-merge networking fixes after downstream PR. No conflicts. Adjacent changes: drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c 35d92abfbad8 ("net: hns3: fix kernel crash when devlink reload during initialization") 2a1a1a7b5fd7 ("net: hns3: add command queue trace for hns3") Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-05-09dma: fix DMA sync for drivers not calling dma_set_mask*()Alexander Lobakin2-6/+6
There are several reports that the DMA sync shortcut broke non-coherent devices. dev->dma_need_sync is false after the &device allocation and if a driver didn't call dma_set_mask*(), it will still be false even if the device is not DMA-coherent and thus needs synchronizing. Due to historical reasons, there's still a lot of drivers not calling it. Invert the boolean, so that the sync will be performed by default and the shortcut will be enabled only when calling dma_set_mask*(). Reported-by: Steven Price <steven.price@arm.com> Closes: https://lore.kernel.org/lkml/010686f5-3049-46a1-8230-7752a1b433ff@arm.com Reported-by: Marek Szyprowski <m.szyprowski@samsung.com> Closes: https://lore.kernel.org/lkml/46160534-5003-4809-a408-6b3a3f4921e9@samsung.com Fixes: f406c8e4b770. ("dma: avoid redundant calls for sync operations") Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Tested-by: Steven Price <steven.price@arm.com> Tested-by: Marek Szyprowski <m.szyprowski@samsung.com>
2024-05-09sched/topology: Optimize topology_span_sane()Kyle Meyer1-4/+2
Optimize topology_span_sane() by removing duplicate comparisons. Since topology_span_sane() is called inside of for_each_cpu(), each previous CPU has already been compared against every other CPU. The current CPU only needs to be compared against higher-numbered CPUs. The total number of comparisons is reduced from N * (N - 1) to N * (N - 1) / 2 on each non-NUMA scheduling domain level. Signed-off-by: Kyle Meyer <kyle.meyer@hpe.com> Reviewed-by: Yury Norov <yury.norov@gmail.com> Acked-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Yury Norov <yury.norov@gmail.com>
2024-05-09livepatch: Rename KLP_* to KLP_TRANSITION_*Wardenjohn3-31/+31
The original macros of KLP_* is about the state of the transition. Rename macros of KLP_* to KLP_TRANSITION_* to fix the confusing description of klp transition state. Signed-off-by: Wardenjohn <zhangwarden@gmail.com> Reviewed-by: Petr Mladek <pmladek@suse.com> Tested-by: Petr Mladek <pmladek@suse.com> Acked-by: Josh Poimboeuf <jpoimboe@kernel.org> Acked-by: Miroslav Benes <mbenes@suse.cz> Link: https://lore.kernel.org/r/20240507050111.38195-2-zhangwarden@gmail.com Signed-off-by: Petr Mladek <pmladek@suse.com>
2024-05-08seccomp: Constify sysctl subhelpersKees Cook1-2/+2
The read_actions_logged() and write_actions_logged() helpers called by the sysctl proc handler seccomp_actions_logged_handler() are already expecting their sysctl table argument to be read-only. Actually mark the argument as const in preparation[1] for global constification of the sysctl tables. Suggested-by: Thomas Weißschuh <linux@weissschuh.net> Link: https://lore.kernel.org/lkml/20240423-sysctl-const-handler-v3-11-e0beccb836e2@weissschuh.net/ [1] Reviewed-by: Luis Chamberlain <mcgrof@kernel.org> Reviewed-by: Thomas Weißschuh <linux@weissschuh.net> Link: https://lore.kernel.org/r/20240508171337.work.861-kees@kernel.org Signed-off-by: Kees Cook <keescook@chromium.org>
2024-05-08kernel/watchdog_perf.c: tidy up kerneldocAndrew Morton1-3/+0
It is unconventional to have a blank line between name-of-function and description-of-args. Cc: Peter Zijlstra <peterz@infradead.org> Cc: Song Liu <song@kernel.org> Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org> Cc: Ryusuke Konishi <konishi.ryusuke@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-05-08watchdog: allow nmi watchdog to use raw perf eventSong Liu2-0/+48
NMI watchdog permanently consumes one hardware counters per CPU on the system. For systems that use many hardware counters, this causes more aggressive time multiplexing of perf events. OTOH, some CPUs (mostly Intel) support "ref-cycles" event, which is rarely used. Add kernel cmdline arg nmi_watchdog=rNNN to configure the watchdog to use raw event. For example, on Intel CPUs, we can use "r300" to configure the watchdog to use ref-cycles event. If the raw event does not work, fall back to use "cycles". [akpm@linux-foundation.org: fix kerneldoc] Link: https://lkml.kernel.org/r/20240430060236.1878002-2-song@kernel.org Signed-off-by: Song Liu <song@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-05-08watchdog: handle comma separated nmi_watchdog command lineSong Liu1-0/+7
Per the document, the kernel can accept comma separated command line like nmi_watchdog=nopanic,0. However, the code doesn't really handle it. Fix the kernel to handle it properly. Link: https://lkml.kernel.org/r/20240430060236.1878002-1-song@kernel.org Signed-off-by: Song Liu <song@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-05-08crash: add prefix for crash dumping messagesBaoquan He2-2/+4
Add pr_fmt() to kernel/crash_core.c to add the module name to debugging message printed as prefix. And also add prefix 'crashkernel:' to two lines of message printing code in kernel/crash_reserve.c. In kernel/crash_reserve.c, almost all debugging messages have 'crashkernel:' prefix or there's keyword crashkernel at the beginning or in the middle, adding pr_fmt() makes it redundant. Link: https://lkml.kernel.org/r/20240418035843.1562887-1-bhe@redhat.com Signed-off-by: Baoquan He <bhe@redhat.com> Cc: Dave Young <dyoung@redhat.com> Cc: Jiri Slaby <jirislaby@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-05-08timers/migration: Prevent out of bounds access on failureLevi Yun1-2/+2
When tmigr_setup_groups() fails the level 0 group allocation, then the cleanup derefences index -1 of the local stack array. Prevent this by checking the loop condition first. Fixes: 7ee988770326 ("timers: Implement the hierarchical pull model") Signed-off-by: Levi Yun <ppbuk5246@gmail.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Anna-Maria Behnsen <anna-maria@linutronix.de> Link: https://lore.kernel.org/r/20240506041059.86877-1-ppbuk5246@gmail.com
2024-05-07bpf: Remove redundant page mask of vmf->addressHaiyue Wang1-1/+1
As the comment described in "struct vm_fault": ".address" : 'Faulting virtual address - masked' ".real_address" : 'Faulting virtual address - unmasked' The link [1] said: "Whatever the routes, all architectures end up to the invocation of handle_mm_fault() which, in turn, (likely) ends up calling __handle_mm_fault() to carry out the actual work of allocating the page tables." __handle_mm_fault() does address assignment: .address = address & PAGE_MASK, .real_address = address, This is debug dump by running `./test_progs -a "*arena*"`: [ 69.767494] arena fault: vmf->address = 10000001d000, vmf->real_address = 10000001d008 [ 69.767496] arena fault: vmf->address = 10000001c000, vmf->real_address = 10000001c008 [ 69.767499] arena fault: vmf->address = 10000001b000, vmf->real_address = 10000001b008 [ 69.767501] arena fault: vmf->address = 10000001a000, vmf->real_address = 10000001a008 [ 69.767504] arena fault: vmf->address = 100000019000, vmf->real_address = 100000019008 [ 69.769388] arena fault: vmf->address = 10000001e000, vmf->real_address = 10000001e1e8 So we can use the value of 'vmf->address' to do BPF arena kernel address space cast directly. [1] https://docs.kernel.org/mm/page_tables.html Signed-off-by: Haiyue Wang <haiyue.wang@intel.com> Link: https://lore.kernel.org/r/20240507063358.8048-1-haiyue.wang@intel.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-05-07kcsan, compiler_types: Introduce __data_racy type qualifierMarco Elver1-0/+17
Based on the discussion at [1], it would be helpful to mark certain variables as explicitly "data racy", which would result in KCSAN not reporting data races involving any accesses on such variables. To do that, introduce the __data_racy type qualifier: struct foo { ... int __data_racy bar; ... }; In KCSAN-kernels, __data_racy turns into volatile, which KCSAN already treats specially by considering them "marked". In non-KCSAN kernels the type qualifier turns into no-op. The generated code between KCSAN-instrumented kernels and non-KCSAN kernels is already huge (inserted calls into runtime for every memory access), so the extra generated code (if any) due to volatile for few such __data_racy variables are unlikely to have measurable impact on performance. Link: https://lore.kernel.org/all/CAHk-=wi3iondeh_9V2g3Qz5oHTRjLsOpoy83hb58MVh=nRZe0A@mail.gmail.com/ [1] Suggested-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Marco Elver <elver@google.com> Cc: Paul E. McKenney <paulmck@kernel.org> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2024-05-07Merge tag 'kvm-riscv-6.10-1' of https://github.com/kvm-riscv/linux into HEADPaolo Bonzini4-16/+58
KVM/riscv changes for 6.10 - Support guest breakpoints using ebreak - Introduce per-VCPU mp_state_lock and reset_cntx_lock - Virtualize SBI PMU snapshot and counter overflow interrupts - New selftests for SBI PMU and Guest ebreak
2024-05-07dma: avoid redundant calls for sync operationsAlexander Lobakin2-12/+49
Quite often, devices do not need dma_sync operations on x86_64 at least. Indeed, when dev_is_dma_coherent(dev) is true and dev_use_swiotlb(dev) is false, iommu_dma_sync_single_for_cpu() and friends do nothing. However, indirectly calling them when CONFIG_RETPOLINE=y consumes about 10% of cycles on a cpu receiving packets from softirq at ~100Gbit rate. Even if/when CONFIG_RETPOLINE is not set, there is a cost of about 3%. Add dev->need_dma_sync boolean and turn it off during the device initialization (dma_set_mask()) depending on the setup: dev_is_dma_coherent() for the direct DMA, !(sync_single_for_device || sync_single_for_cpu) or the new dma_map_ops flag, %DMA_F_CAN_SKIP_SYNC, advertised for non-NULL DMA ops. Then later, if/when swiotlb is used for the first time, the flag is reset back to on, from swiotlb_tbl_map_single(). On iavf, the UDP trafficgen with XDP_DROP in skb mode test shows +3-5% increase for direct DMA. Suggested-by: Christoph Hellwig <hch@lst.de> # direct DMA shortcut Co-developed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
2024-05-07dma: compile-out DMA sync op calls when not usedAlexander Lobakin2-10/+17
Some platforms do have DMA, but DMA there is always direct and coherent. Currently, even on such platforms DMA sync operations are compiled and called. Add a new hidden Kconfig symbol, DMA_NEED_SYNC, and set it only when either sync operations are needed or there is DMA ops or swiotlb or DMA debug is enabled. Compile global dma_sync_*() and dma_need_sync() only when it's set, otherwise provide empty inline stubs. The change allows for future optimizations of DMA sync calls depending on runtime conditions. Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
2024-05-07swiotlb: remove alloc_size argument to swiotlb_tbl_map_single()Michael Kelley1-14/+42
Currently swiotlb_tbl_map_single() takes alloc_align_mask and alloc_size arguments to specify an swiotlb allocation that is larger than mapping_size. This larger allocation is used solely by iommu_dma_map_single() to handle untrusted devices that should not have DMA visibility to memory pages that are partially used for unrelated kernel data. Having two arguments to specify the allocation is redundant. While alloc_align_mask naturally specifies the alignment of the starting address of the allocation, it can also implicitly specify the size by rounding up the mapping_size to that alignment. Additionally, the current approach has an edge case bug. iommu_dma_map_page() already does the rounding up to compute the alloc_size argument. But swiotlb_tbl_map_single() then calculates the alignment offset based on the DMA min_align_mask, and adds that offset to alloc_size. If the offset is non-zero, the addition may result in a value that is larger than the max the swiotlb can allocate. If the rounding up is done _after_ the alignment offset is added to the mapping_size (and the original mapping_size conforms to the value returned by swiotlb_max_mapping_size), then the max that the swiotlb can allocate will not be exceeded. In view of these issues, simplify the swiotlb_tbl_map_single() interface by removing the alloc_size argument. Most call sites pass the same value for mapping_size and alloc_size, and they pass alloc_align_mask as zero. Just remove the redundant argument from these callers, as they will see no functional change. For iommu_dma_map_page() also remove the alloc_size argument, and have swiotlb_tbl_map_single() compute the alloc_size by rounding up mapping_size after adding the offset based on min_align_mask. This has the side effect of fixing the edge case bug but with no other functional change. Also add a sanity test on the alloc_align_mask. While IOMMU code currently ensures the granule is not larger than PAGE_SIZE, if that guarantee were to be removed in the future, the downstream effect on the swiotlb might go unnoticed until strange allocation failures occurred. Tested on an ARM64 system with 16K page size and some kernel test-only hackery to allow modifying the DMA min_align_mask and the granule size that becomes the alloc_align_mask. Tested these combinations with a variety of original memory addresses and sizes, including those that reproduce the edge case bug: * 4K granule and 0 min_align_mask * 4K granule and 0xFFF min_align_mask (4K - 1) * 16K granule and 0xFFF min_align_mask * 64K granule and 0xFFF min_align_mask * 64K granule and 0x3FFF min_align_mask (16K - 1) With the changes, all combinations pass. Signed-off-by: Michael Kelley <mhklinux@outlook.com> Reviewed-by: Petr Tesarik <petr@tesarici.cz> Signed-off-by: Christoph Hellwig <hch@lst.de>
2024-05-07printk: cleanup deprecated uses of strncpy/strcpyJustin Stitt1-13/+13
Cleanup some deprecated uses of strncpy() and strcpy() [1]. There doesn't seem to be any bugs with the current code but the readability of this code could benefit from a quick makeover while removing some deprecated stuff as a benefit. The most interesting replacement made in this patch involves concatenating "ttyS" with a digit-led user-supplied string. Instead of doing two distinct string copies with carefully managed offsets and lengths, let's use the more robust and self-explanatory scnprintf(). scnprintf will 1) respect the bounds of @buf, 2) null-terminate @buf, 3) do the concatenation. This allows us to drop the manual NUL-byte assignment. Also, since isdigit() is used about a dozen lines after the open-coded version we'll replace it for uniformity's sake. All the strcpy() --> strscpy() replacements are trivial as the source strings are literals and much smaller than the destination size. No behavioral change here. Use the new 2-argument version of strscpy() introduced in Commit e6584c3964f2f ("string: Allow 2-argument strscpy()"). However, to make this work fully (since the size must be known at compile time), also update the extern-qualified declaration to have the proper size information. Link: https://www.kernel.org/doc/html/latest/process/deprecated.html#strncpy-on-nul-terminated-strings [1] Link: https://github.com/KSPP/linux/issues/90 [2] Link: https://manpages.debian.org/testing/linux-manual-4.8/strscpy.9.en.html [3] Cc: linux-hardening@vger.kernel.org Signed-off-by: Justin Stitt <justinstitt@google.com> Reviewed-by: Petr Mladek <pmladek@suse.com> Link: https://lore.kernel.org/r/20240429-strncpy-kernel-printk-printk-c-v1-1-4da7926d7b69@google.com [pmladek@suse.com: Removed obsolete brackets and added empty lines.] Signed-off-by: Petr Mladek <pmladek@suse.com>
2024-05-06bpf/verifier: relax MUL range computation checkCupertino Miranda1-5/+1
MUL instruction required that src_reg would be a known value (i.e. src_reg would be a const value). The condition in this case can be relaxed, since the range computation algorithm used in current code already supports a proper range computation for any valid range value on its operands. Signed-off-by: Cupertino Miranda <cupertino.miranda@oracle.com> Acked-by: Eduard Zingerman <eddyz87@gmail.com> Acked-by: Andrii Nakryiko <andrii.nakryiko@gmail.com> Cc: Yonghong Song <yonghong.song@linux.dev> Cc: Alexei Starovoitov <alexei.starovoitov@gmail.com> Cc: David Faust <david.faust@oracle.com> Cc: Jose Marchesi <jose.marchesi@oracle.com> Cc: Elena Zannoni <elena.zannoni@oracle.com> Link: https://lore.kernel.org/r/20240506141849.185293-6-cupertino.miranda@oracle.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-05-06bpf/verifier: improve XOR and OR range computationCupertino Miranda1-2/+2
Range for XOR and OR operators would not be attempted unless src_reg would resolve to a single value, i.e. a known constant value. This condition is unnecessary, and the following XOR/OR operator handling could compute a possible better range. Acked-by: Eduard Zingerman <eddyz87@gmail.com> Signed-off-by: Cupertino Miranda <cupertino.miranda@oracle.com Acked-by: Eduard Zingerman <eddyz87@gmail.com> Cc: Yonghong Song <yonghong.song@linux.dev> Cc: Alexei Starovoitov <alexei.starovoitov@gmail.com> Cc: David Faust <david.faust@oracle.com> Cc: Jose Marchesi <jose.marchesi@oracle.com> Cc: Elena Zannoni <elena.zannoni@oracle.com> Cc: Andrii Nakryiko <andrii.nakryiko@gmail.com> Link: https://lore.kernel.org/r/20240506141849.185293-4-cupertino.miranda@oracle.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-05-06bpf/verifier: refactor checks for range computationCupertino Miranda1-64/+45
Split range computation checks in its own function, isolating pessimitic range set for dst_reg and failing return to a single point. Signed-off-by: Cupertino Miranda <cupertino.miranda@oracle.com> Acked-by: Eduard Zingerman <eddyz87@gmail.com> Cc: Yonghong Song <yonghong.song@linux.dev> Cc: Alexei Starovoitov <alexei.starovoitov@gmail.com> Cc: David Faust <david.faust@oracle.com> Cc: Jose Marchesi <jose.marchesi@oracle.com> Cc: Elena Zannoni <elena.zannoni@oracle.com> Cc: Andrii Nakryiko <andrii.nakryiko@gmail.com> bpf/verifier: improve code after range computation recent changes. Link: https://lore.kernel.org/r/20240506141849.185293-3-cupertino.miranda@oracle.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-05-06bpf/verifier: replace calls to mark_reg_unknown.Cupertino Miranda1-5/+4
In order to further simplify the code in adjust_scalar_min_max_vals all the calls to mark_reg_unknown are replaced by __mark_reg_unknown. static void mark_reg_unknown(struct bpf_verifier_env *env, struct bpf_reg_state *regs, u32 regno) { if (WARN_ON(regno >= MAX_BPF_REG)) { ... mark all regs not init ... return; } __mark_reg_unknown(env, regs + regno); } The 'regno >= MAX_BPF_REG' does not apply to adjust_scalar_min_max_vals(), because it is only called from the following stack: - check_alu_op - adjust_reg_min_max_vals - adjust_scalar_min_max_vals The check_alu_op() does check_reg_arg() which verifies that both src and dst register numbers are within bounds. Signed-off-by: Cupertino Miranda <cupertino.miranda@oracle.com> Acked-by: Eduard Zingerman <eddyz87@gmail.com> Cc: Yonghong Song <yonghong.song@linux.dev> Cc: Alexei Starovoitov <alexei.starovoitov@gmail.com> Cc: David Faust <david.faust@oracle.com> Cc: Jose Marchesi <jose.marchesi@oracle.com> Cc: Elena Zannoni <elena.zannoni@oracle.com> Cc: Andrii Nakryiko <andrii.nakryiko@gmail.com> Link: https://lore.kernel.org/r/20240506141849.185293-2-cupertino.miranda@oracle.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-05-06kunit: Handle test faultsMickaël Salaün1-0/+1
Previously, when a kernel test thread crashed (e.g. NULL pointer dereference, general protection fault), the KUnit test hanged for 30 seconds and exited with a timeout error. Fix this issue by waiting on task_struct->vfork_done instead of the custom kunit_try_catch.try_completion, and track the execution state by initially setting try_result with -EINTR and only setting it to 0 if the test passed. Fix kunit_generic_run_threadfn_adapter() signature by returning 0 instead of calling kthread_complete_and_exit(). Because thread's exit code is never checked, always set it to 0 to make it clear. To make this explicit, export kthread_exit() for KUnit tests built as module. Fix the -EINTR error message, which couldn't be reached until now. This is tested with a following patch. Cc: Brendan Higgins <brendanhiggins@google.com> Cc: Eric W. Biederman <ebiederm@xmission.com> Cc: Shuah Khan <skhan@linuxfoundation.org> Reviewed-by: Kees Cook <keescook@chromium.org> Reviewed-by: David Gow <davidgow@google.com> Tested-by: Rae Moar <rmoar@google.com> Signed-off-by: Mickaël Salaün <mic@digikod.net> Link: https://lore.kernel.org/r/20240408074625.65017-5-mic@digikod.net Signed-off-by: Shuah Khan <skhan@linuxfoundation.org>
2024-05-06printk: Change type of CONFIG_BASE_SMALL to boolYoann Congal2-2/+2
CONFIG_BASE_SMALL is currently a type int but is only used as a boolean. So, change its type to bool and adapt all usages: CONFIG_BASE_SMALL == 0 becomes !IS_ENABLED(CONFIG_BASE_SMALL) and CONFIG_BASE_SMALL != 0 becomes IS_ENABLED(CONFIG_BASE_SMALL). Reviewed-by: Petr Mladek <pmladek@suse.com> Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Reviewed-by: Masahiro Yamada <masahiroy@kernel.org> Signed-off-by: Yoann Congal <yoann.congal@smile.fr> Link: https://lore.kernel.org/r/20240505080343.1471198-3-yoann.congal@smile.fr Signed-off-by: Petr Mladek <pmladek@suse.com>
2024-05-06powerpc/dexcr: Add DEXCR prctl interfaceBenjamin Gray1-0/+16
Now that we track a DEXCR on a per-task basis, individual tasks are free to configure it as they like. The interface is a pair of getter/setter prctl's that work on a single aspect at a time (multiple aspects at once is more difficult if there are different rules applied for each aspect, now or in future). The getter shows the current state of the process config, and the setter allows setting/clearing the aspect. Signed-off-by: Benjamin Gray <bgray@linux.ibm.com> [mpe: Account for PR_RISCV_SET_ICACHE_FLUSH_CTX, shrink some longs lines] Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://msgid.link/20240417112325.728010-5-bgray@linux.ibm.com
2024-05-05Merge tag 'irq-urgent-2024-05-05' of ↵Linus Torvalds1-4/+8
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull irq fix from Ingo Molnar: "Fix suspicious RCU usage in __do_softirq()" * tag 'irq-urgent-2024-05-05' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: softirq: Fix suspicious RCU usage in __do_softirq()
2024-05-05Merge tag 'probes-fixes-v6.9-rc6' of ↵Linus Torvalds1-1/+1
git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace Pull probes fix from Masami Hiramatsu: - probe-events: Fix memory leak in parsing probe argument. There is a memory leak (forget to free an allocated buffer) in a memory allocation failure path. Fix it to jump to the correct error handling code. * tag 'probes-fixes-v6.9-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace: tracing/probes: Fix memory leak in traceprobe_parse_probe_arg_body()
2024-05-05Merge tag 'trace-v6.9-rc6-2' of ↵Linus Torvalds1-0/+12
git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace Pull tracing and tracefs fixes from Steven Rostedt: - Fix RCU callback of freeing an eventfs_inode. The freeing of the eventfs_inode from the kref going to zero freed the contents of the eventfs_inode and then used kfree_rcu() to free the inode itself. But the contents should also be protected by RCU. Switch to a call_rcu() that calls a function to free all of the eventfs_inode after the RCU synchronization. - The tracing subsystem maps its own descriptor to a file represented by eventfs. The freeing of this descriptor needs to know when the last reference of an eventfs_inode is released, but currently there is no interface for that. Add a "release" callback to the eventfs_inode entry array that allows for freeing of data that can be referenced by the eventfs_inode being opened. Then increment the ref counter for this descriptor when the eventfs_inode file is created, and decrement/free it when the last reference to the eventfs_inode is released and the file is removed. This prevents races between freeing the descriptor and the opening of the eventfs file. - Fix the permission processing of eventfs. The change to make the permissions of eventfs default to the mount point but keep track of when changes were made had a side effect that could cause security concerns. When the tracefs is remounted with a given gid or uid, all the files within it should inherit that gid or uid. But if the admin had changed the permission of some file within the tracefs file system, it would not get updated by the remount. This caused the kselftest of file permissions to fail the second time it is run. The first time, all changes would look fine, but the second time, because the changes were "saved", the remount did not reset them. Create a link list of all existing tracefs inodes, and clear the saved flags on them on a remount if the remount changes the corresponding gid or uid fields. This also simplifies the code by removing the distinction between the toplevel eventfs and an instance eventfs. They should both act the same. They were different because of a misconception due to the remount not resetting the flags. Now that remount resets all the files and directories to default to the root node if a uid/gid is specified, it makes the logic simpler to implement. * tag 'trace-v6.9-rc6-2' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace: eventfs: Have "events" directory get permissions from its parent eventfs: Do not treat events directory different than other directories eventfs: Do not differentiate the toplevel events directory tracefs: Still use mount point as default permissions for instances tracefs: Reset permissions on remount if permissions are options eventfs: Free all of the eventfs_inode after RCU eventfs/tracing: Add callback for release of an eventfs_inode
2024-05-05Merge tag 'dma-mapping-6.9-2024-05-04' of ↵Linus Torvalds1-0/+1
git://git.infradead.org/users/hch/dma-mapping Pull dma-mapping fix from Christoph Hellwig: - fix the combination of restricted pools and dynamic swiotlb (Will Deacon) * tag 'dma-mapping-6.9-2024-05-04' of git://git.infradead.org/users/hch/dma-mapping: swiotlb: initialise restricted pool list_head when SWIOTLB_DYNAMIC=y
2024-05-04eventfs/tracing: Add callback for release of an eventfs_inodeSteven Rostedt (Google)1-0/+12
Synthetic events create and destroy tracefs files when they are created and removed. The tracing subsystem has its own file descriptor representing the state of the events attached to the tracefs files. There's a race between the eventfs files and this file descriptor of the tracing system where the following can cause an issue: With two scripts 'A' and 'B' doing: Script 'A': echo "hello int aaa" > /sys/kernel/tracing/synthetic_events while : do echo 0 > /sys/kernel/tracing/events/synthetic/hello/enable done Script 'B': echo > /sys/kernel/tracing/synthetic_events Script 'A' creates a synthetic event "hello" and then just writes zero into its enable file. Script 'B' removes all synthetic events (including the newly created "hello" event). What happens is that the opening of the "enable" file has: { struct trace_event_file *file = inode->i_private; int ret; ret = tracing_check_open_get_tr(file->tr); [..] But deleting the events frees the "file" descriptor, and a "use after free" happens with the dereference at "file->tr". The file descriptor does have a reference counter, but there needs to be a way to decrement it from the eventfs when the eventfs_inode is removed that represents this file descriptor. Add an optional "release" callback to the eventfs_entry array structure, that gets called when the eventfs file is about to be removed. This allows for the creating on the eventfs file to increment the tracing file descriptor ref counter. When the eventfs file is deleted, it can call the release function that will call the put function for the tracing file descriptor. This will protect the tracing file from being freed while a eventfs file that references it is being opened. Link: https://lore.kernel.org/linux-trace-kernel/20240426073410.17154-1-Tze-nan.Wu@mediatek.com/ Link: https://lore.kernel.org/linux-trace-kernel/20240502090315.448cba46@gandalf.local.home Cc: stable@vger.kernel.org Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Fixes: 5790b1fb3d672 ("eventfs: Remove eventfs_file and just use eventfs_inode") Reported-by: Tze-nan wu <Tze-nan.Wu@mediatek.com> Tested-by: Tze-nan Wu (吳澤南) <Tze-nan.Wu@mediatek.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2024-05-03stackleak: Use a copy of the ctl_table argumentThomas Weißschuh1-3/+3
Sysctl handlers are not supposed to modify the ctl_table passed to them. Adapt the logic to work with a temporary variable, similar to how it is done in other parts of the kernel. This is also a prerequisite to enforce the immutability of the argument through the callbacks. Reviewed-by: Luis Chamberlain <mcgrof@kernel.org> Signed-off-by: Thomas Weißschuh <linux@weissschuh.net> Reviewed-by: Tycho Andersen <tycho@tycho.pizza> Link: https://lore.kernel.org/r/20240503-sysctl-const-stackleak-v1-1-603fecb19170@weissschuh.net Signed-off-by: Kees Cook <keescook@chromium.org>
2024-05-02swsusp: don't bother with setting block sizeAl Viro1-6/+1
same as with the swap... Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2024-05-02Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski9-72/+105
Cross-merge networking fixes after downstream PR. Conflicts: include/linux/filter.h kernel/bpf/core.c 66e13b615a0c ("bpf: verifier: prevent userspace memory access") d503a04f8bc0 ("bpf: Add support for certain atomics in bpf_arena to x86 JIT") https://lore.kernel.org/all/20240429114939.210328b0@canb.auug.org.au/ No adjacent changes. Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-05-02Merge tag 'net-6.9-rc7' of ↵Linus Torvalds2-2/+40
git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Pull networking fixes from Paolo Abeni: "Including fixes from bpf. Relatively calm week, likely due to public holiday in most places. No known outstanding regressions. Current release - regressions: - rxrpc: fix wrong alignmask in __page_frag_alloc_align() - eth: e1000e: change usleep_range to udelay in PHY mdic access Previous releases - regressions: - gro: fix udp bad offset in socket lookup - bpf: fix incorrect runtime stat for arm64 - tipc: fix UAF in error path - netfs: fix a potential infinite loop in extract_user_to_sg() - eth: ice: ensure the copied buf is NUL terminated - eth: qeth: fix kernel panic after setting hsuid Previous releases - always broken: - bpf: - verifier: prevent userspace memory access - xdp: use flags field to disambiguate broadcast redirect - bridge: fix multicast-to-unicast with fraglist GSO - mptcp: ensure snd_nxt is properly initialized on connect - nsh: fix outer header access in nsh_gso_segment(). - eth: bcmgenet: fix racing registers access - eth: vxlan: fix stats counters. Misc: - a bunch of MAINTAINERS file updates" * tag 'net-6.9-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (45 commits) MAINTAINERS: mark MYRICOM MYRI-10G as Orphan MAINTAINERS: remove Ariel Elior net: gro: add flush check in udp_gro_receive_segment net: gro: fix udp bad offset in socket lookup by adding {inner_}network_offset to napi_gro_cb ipv4: Fix uninit-value access in __ip_make_skb() s390/qeth: Fix kernel panic after setting hsuid vxlan: Pull inner IP header in vxlan_rcv(). tipc: fix a possible memleak in tipc_buf_append tipc: fix UAF in error path rxrpc: Clients must accept conn from any address net: core: reject skb_copy(_expand) for fraglist GSO skbs net: bridge: fix multicast-to-unicast with fraglist GSO mptcp: ensure snd_nxt is properly initialized on connect e1000e: change usleep_range to udelay in PHY mdic access net: dsa: mv88e6xxx: Fix number of databases for 88E6141 / 88E6341 cxgb4: Properly lock TX queue for the selftest. rxrpc: Fix using alignmask being zero for __page_frag_alloc_align() vxlan: Add missing VNI filter counter update in arp_reduce(). vxlan: Fix racy device stats updates. net: qede: use return from qede_parse_actions() ...
2024-05-02swiotlb: initialise restricted pool list_head when SWIOTLB_DYNAMIC=yWill Deacon1-0/+1
Using restricted DMA pools (CONFIG_DMA_RESTRICTED_POOL=y) in conjunction with dynamic SWIOTLB (CONFIG_SWIOTLB_DYNAMIC=y) leads to the following crash when initialising the restricted pools at boot-time: | Unable to handle kernel NULL pointer dereference at virtual address 0000000000000008 | Internal error: Oops: 0000000096000005 [#1] PREEMPT SMP | pc : rmem_swiotlb_device_init+0xfc/0x1ec | lr : rmem_swiotlb_device_init+0xf0/0x1ec | Call trace: | rmem_swiotlb_device_init+0xfc/0x1ec | of_reserved_mem_device_init_by_idx+0x18c/0x238 | of_dma_configure_id+0x31c/0x33c | platform_dma_configure+0x34/0x80 faddr2line reveals that the crash is in the list validation code: include/linux/list.h:83 include/linux/rculist.h:79 include/linux/rculist.h:106 kernel/dma/swiotlb.c:306 kernel/dma/swiotlb.c:1695 because add_mem_pool() is trying to list_add_rcu() to a NULL 'mem->pools'. Fix the crash by initialising the 'mem->pools' list_head in rmem_swiotlb_device_init() before calling add_mem_pool(). Reported-by: Nikita Ioffe <ioffe@google.com> Tested-by: Nikita Ioffe <ioffe@google.com> Fixes: 1aaa736815eb ("swiotlb: allocate a new memory pool when existing pools are full") Signed-off-by: Will Deacon <will@kernel.org> Signed-off-by: Christoph Hellwig <hch@lst.de>
2024-05-02vmlinux: Avoid weak reference to notes sectionArd Biesheuvel1-2/+2
Weak references are references that are permitted to remain unsatisfied in the final link. This means they cannot be implemented using place relative relocations, resulting in GOT entries when using position independent code generation. The notes section should always exist, so the weak annotations can be omitted. Acked-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Ard Biesheuvel <ardb@kernel.org> Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>
2024-05-02kallsyms: Avoid weak references for kallsyms symbolsArd Biesheuvel2-24/+12
kallsyms is a directory of all the symbols in the vmlinux binary, and so creating it is somewhat of a chicken-and-egg problem, as its non-zero size affects the layout of the binary, and therefore the values of the symbols. For this reason, the kernel is linked more than once, and the first pass does not include any kallsyms data at all. For the linker to accept this, the symbol declarations describing the kallsyms metadata are emitted as having weak linkage, so they can remain unsatisfied. During the subsequent passes, the weak references are satisfied by the kallsyms metadata that was constructed based on information gathered from the preceding passes. Weak references lead to somewhat worse codegen, because taking their address may need to produce NULL (if the reference was unsatisfied), and this is not usually supported by RIP or PC relative symbol references. Given that these references are ultimately always satisfied in the final link, let's drop the weak annotation, and instead, provide fallback definitions in the linker script that are only emitted if an unsatisfied reference exists. While at it, drop the FRV specific annotation that these symbols reside in .rodata - FRV is long gone. Tested-by: Nick Desaulniers <ndesaulniers@google.com> # Boot Reviewed-by: Nick Desaulniers <ndesaulniers@google.com> Reviewed-by: Kees Cook <keescook@chromium.org> Acked-by: Arnd Bergmann <arnd@arndb.de> Link: https://lkml.kernel.org/r/20230504174320.3930345-1-ardb%40kernel.org Signed-off-by: Ard Biesheuvel <ardb@kernel.org> Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>
2024-05-01bpf: crypto: fix build when CONFIG_CRYPTO=mVadim Fedorenko1-1/+1
Crypto subsytem can be build as a module. In this case we still have to build BPF crypto framework otherwise the build will fail. Fixes: 3e1c6f35409f ("bpf: make common crypto API for TC/XDP programs") Reported-by: kernel test robot <lkp@intel.com> Closes: https://lore.kernel.org/oe-kbuild-all/202405011634.4JK40epY-lkp@intel.com/ Signed-off-by: Vadim Fedorenko <vadfed@meta.com> Link: https://lore.kernel.org/r/20240501170130.1682309-1-vadfed@meta.com Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2024-05-01hardening: Enable KCFI and some other optionsKees Cook1-0/+8
Add some stuff that got missed along the way: - CONFIG_UNWIND_PATCH_PAC_INTO_SCS=y so SCS vs PAC is hardware selectable. - CONFIG_X86_KERNEL_IBT=y while a default, just be sure. - CONFIG_CFI_CLANG=y globally. - CONFIG_PAGE_TABLE_CHECK=y for userspace mapping sanity. Reviewed-by: Nathan Chancellor <nathan@kernel.org> Link: https://lore.kernel.org/r/20240501193709.make.982-kees@kernel.org Signed-off-by: Kees Cook <keescook@chromium.org>
2024-05-01rethook: honor CONFIG_FTRACE_VALIDATE_RCU_IS_WATCHING in rethook_try_get()Andrii Nakryiko1-0/+2
Take into account CONFIG_FTRACE_VALIDATE_RCU_IS_WATCHING when validating that RCU is watching when trying to setup rethooko on a function entry. One notable exception when we force rcu_is_watching() check is CONFIG_KPROBE_EVENTS_ON_NOTRACE=y case, in which case kretprobes will use old-style int3-based workflow instead of relying on ftrace, making RCU watching check important to validate. This further (in addition to improvements in the previous patch) improves BPF multi-kretprobe (which rely on rethook) runtime throughput by 2.3%, according to BPF benchmarks ([0]). [0] https://lore.kernel.org/bpf/CAEf4BzauQ2WKMjZdc9s0rBWa01BYbgwHN6aNDXQSHYia47pQ-w@mail.gmail.com/ Link: https://lore.kernel.org/all/20240418190909.704286-2-andrii@kernel.org/ Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Paul E. McKenney <paulmck@kernel.org> Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
2024-05-01ftrace: make extra rcu_is_watching() validation check optionalAndrii Nakryiko1-0/+13
Introduce CONFIG_FTRACE_VALIDATE_RCU_IS_WATCHING config option to control whether ftrace low-level code performs additional rcu_is_watching()-based validation logic in an attempt to catch noinstr violations. This check is expected to never be true and is mostly useful for low-level validation of ftrace subsystem invariants. For most users it should probably be kept disabled to eliminate unnecessary runtime overhead. This improves BPF multi-kretprobe (relying on ftrace and rethook infrastructure) runtime throughput by 2%, according to BPF benchmarks ([0]). [0] https://lore.kernel.org/bpf/CAEf4BzauQ2WKMjZdc9s0rBWa01BYbgwHN6aNDXQSHYia47pQ-w@mail.gmail.com/ Link: https://lore.kernel.org/all/20240418190909.704286-1-andrii@kernel.org/ Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Paul E. McKenney <paulmck@kernel.org> Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
2024-05-01uprobes: reduce contention on uprobes_tree accessJonathan Haslam1-11/+11
Active uprobes are stored in an RB tree and accesses to this tree are dominated by read operations. Currently these accesses are serialized by a spinlock but this leads to enormous contention when large numbers of threads are executing active probes. This patch converts the spinlock used to serialize access to the uprobes_tree RB tree into a reader-writer spinlock. This lock type aligns naturally with the overwhelmingly read-only nature of the tree usage here. Although the addition of reader-writer spinlocks are discouraged [0], this fix is proposed as an interim solution while an RCU based approach is implemented (that work is in a nascent form). This fix also has the benefit of being trivial, self contained and therefore simple to backport. We have used a uprobe benchmark from the BPF selftests [1] to estimate the improvements. Each block of results below show 1 line per execution of the benchmark ("the "Summary" line) and each line is a run with one more thread added - a thread is a "producer". The lines are edited to remove extraneous output. The tests were executed with this driver script: for num_threads in {1..20} do sudo ./bench -a -p $num_threads trig-uprobe-nop | grep Summary done SPINLOCK (BEFORE) ================== Summary: hits 1.396 ± 0.007M/s ( 1.396M/prod) Summary: hits 1.656 ± 0.016M/s ( 0.828M/prod) Summary: hits 2.246 ± 0.008M/s ( 0.749M/prod) Summary: hits 2.114 ± 0.010M/s ( 0.529M/prod) Summary: hits 2.013 ± 0.009M/s ( 0.403M/prod) Summary: hits 1.753 ± 0.008M/s ( 0.292M/prod) Summary: hits 1.847 ± 0.001M/s ( 0.264M/prod) Summary: hits 1.889 ± 0.001M/s ( 0.236M/prod) Summary: hits 1.833 ± 0.006M/s ( 0.204M/prod) Summary: hits 1.900 ± 0.003M/s ( 0.190M/prod) Summary: hits 1.918 ± 0.006M/s ( 0.174M/prod) Summary: hits 1.925 ± 0.002M/s ( 0.160M/prod) Summary: hits 1.837 ± 0.001M/s ( 0.141M/prod) Summary: hits 1.898 ± 0.001M/s ( 0.136M/prod) Summary: hits 1.799 ± 0.016M/s ( 0.120M/prod) Summary: hits 1.850 ± 0.005M/s ( 0.109M/prod) Summary: hits 1.816 ± 0.002M/s ( 0.101M/prod) Summary: hits 1.787 ± 0.001M/s ( 0.094M/prod) Summary: hits 1.764 ± 0.002M/s ( 0.088M/prod) RW SPINLOCK (AFTER) =================== Summary: hits 1.444 ± 0.020M/s ( 1.444M/prod) Summary: hits 2.279 ± 0.011M/s ( 1.139M/prod) Summary: hits 3.422 ± 0.014M/s ( 1.141M/prod) Summary: hits 3.565 ± 0.017M/s ( 0.891M/prod) Summary: hits 2.671 ± 0.013M/s ( 0.534M/prod) Summary: hits 2.409 ± 0.005M/s ( 0.401M/prod) Summary: hits 2.485 ± 0.008M/s ( 0.355M/prod) Summary: hits 2.496 ± 0.003M/s ( 0.312M/prod) Summary: hits 2.585 ± 0.002M/s ( 0.287M/prod) Summary: hits 2.908 ± 0.011M/s ( 0.291M/prod) Summary: hits 2.346 ± 0.016M/s ( 0.213M/prod) Summary: hits 2.804 ± 0.004M/s ( 0.234M/prod) Summary: hits 2.556 ± 0.001M/s ( 0.197M/prod) Summary: hits 2.754 ± 0.004M/s ( 0.197M/prod) Summary: hits 2.482 ± 0.002M/s ( 0.165M/prod) Summary: hits 2.412 ± 0.005M/s ( 0.151M/prod) Summary: hits 2.710 ± 0.003M/s ( 0.159M/prod) Summary: hits 2.826 ± 0.005M/s ( 0.157M/prod) Summary: hits 2.718 ± 0.001M/s ( 0.143M/prod) Summary: hits 2.844 ± 0.006M/s ( 0.142M/prod) The numbers in parenthesis give averaged throughput per thread which is of greatest interest here as a measure of scalability. Improvements are in the order of 22 - 68% with this particular benchmark (mean = 43%). V2: - Updated commit message to include benchmark results. [0] https://docs.kernel.org/locking/spinlocks.html [1] https://github.com/torvalds/linux/blob/master/tools/testing/selftests/bpf/benchs/bench_trigger.c Link: https://lore.kernel.org/all/20240422102306.6026-1-jonathan.haslam@gmail.com/ Signed-off-by: Jonathan Haslam <jonathan.haslam@gmail.com> Acked-by: Jiri Olsa <jolsa@kernel.org> Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
2024-05-01rethook: Remove warning messages printed for finding return address of a frame.Kui-Feng Lee1-1/+1
The function rethook_find_ret_addr() prints a warning message and returns 0 when the target task is running and is not the "current" task in order to prevent the incorrect return address, although it still may return an incorrect address. However, the warning message turns into noise when BPF profiling programs call bpf_get_task_stack() on running tasks in a firm with a large number of hosts. The callers should be aware and willing to take the risk of receiving an incorrect return address from a task that is currently running other than the "current" one. A warning is not needed here as the callers are intent on it. Link: https://lore.kernel.org/all/20240408175140.60223-1-thinker.li@gmail.com/ Acked-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: John Fastabend <john.fastabend@gmail.com> Signed-off-by: Kui-Feng Lee <thinker.li@gmail.com> Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
2024-05-01tracing/probes: support '%pD' type for print struct file's nameYe Bin2-23/+36
As like '%pd' type, this patch supports print type '%pD' for print file's name. For example "name=$arg1:%pD" casts the `$arg1` as (struct file*), dereferences the "file.f_path.dentry.d_name.name" field and stores it to "name" argument as a kernel string. Here is an example: [tracing]# echo 'p:testprobe vfs_read name=$arg1:%pD' > kprobe_event [tracing]# echo 1 > events/kprobes/testprobe/enable [tracing]# grep -q "1" events/kprobes/testprobe/enable [tracing]# echo 0 > events/kprobes/testprobe/enable [tracing]# grep "vfs_read" trace | grep "enable" grep-15108 [003] ..... 5228.328609: testprobe: (vfs_read+0x4/0xbb0) name="enable" Note that this expects the given argument (e.g. $arg1) is an address of struct file. User must ensure it. Link: https://lore.kernel.org/all/20240322064308.284457-3-yebin10@huawei.com/ [Masami: replaced "previous patch" with '%pd' type] Signed-off-by: Ye Bin <yebin10@huawei.com> Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
2024-05-01tracing/probes: support '%pd' type for print struct dentry's nameYe Bin5-1/+65
During fault locating, the file name needs to be printed based on the dentry address. The offset needs to be calculated each time, which is troublesome. Similar to printk, kprobe support print type '%pd' for print dentry's name. For example "name=$arg1:%pd" casts the `$arg1` as (struct dentry *), dereferences the "d_name.name" field and stores it to "name" argument as a kernel string. Here is an example: [tracing]# echo 'p:testprobe dput name=$arg1:%pd' > kprobe_events [tracing]# echo 1 > events/kprobes/testprobe/enable [tracing]# grep -q "1" events/kprobes/testprobe/enable [tracing]# echo 0 > events/kprobes/testprobe/enable [tracing]# cat trace | grep "enable" bash-14844 [002] ..... 16912.889543: testprobe: (dput+0x4/0x30) name="enable" grep-15389 [003] ..... 16922.834182: testprobe: (dput+0x4/0x30) name="enable" grep-15389 [003] ..... 16922.836103: testprobe: (dput+0x4/0x30) name="enable" bash-14844 [001] ..... 16931.820909: testprobe: (dput+0x4/0x30) name="enable" Note that this expects the given argument (e.g. $arg1) is an address of struct dentry. User must ensure it. Link: https://lore.kernel.org/all/20240322064308.284457-2-yebin10@huawei.com/ Signed-off-by: Ye Bin <yebin10@huawei.com> Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
2024-05-01uprobes: add speculative lockless system-wide uprobe filter checkAndrii Nakryiko1-3/+7
It's very common with BPF-based uprobe/uretprobe use cases to have a system-wide (not PID specific) probes used. In this case uprobe's trace_uprobe_filter->nr_systemwide counter is bumped at registration time, and actual filtering is short circuited at the time when uprobe/uretprobe is triggered. This is a great optimization, and the only issue with it is that to even get to checking this counter uprobe subsystem is taking read-side trace_uprobe_filter->rwlock. This is actually noticeable in profiles and is just another point of contention when uprobe is triggered on multiple CPUs simultaneously. This patch moves this nr_systemwide check outside of filter list's rwlock scope, as rwlock is meant to protect list modification, while nr_systemwide-based check is speculative and racy already, despite the lock (as discussed in [0]). trace_uprobe_filter_remove() and trace_uprobe_filter_add() already check for filter->nr_systewide explicitly outside of __uprobe_perf_filter, so no modifications are required there. Confirming with BPF selftests's based benchmarks. BEFORE (based on changes in previous patch) =========================================== uprobe-nop : 2.732 ± 0.022M/s uprobe-push : 2.621 ± 0.016M/s uprobe-ret : 1.105 ± 0.007M/s uretprobe-nop : 1.396 ± 0.007M/s uretprobe-push : 1.347 ± 0.008M/s uretprobe-ret : 0.800 ± 0.006M/s AFTER ===== uprobe-nop : 2.878 ± 0.017M/s (+5.5%, total +8.3%) uprobe-push : 2.753 ± 0.013M/s (+5.3%, total +10.2%) uprobe-ret : 1.142 ± 0.010M/s (+3.8%, total +3.8%) uretprobe-nop : 1.444 ± 0.008M/s (+3.5%, total +6.5%) uretprobe-push : 1.410 ± 0.010M/s (+4.8%, total +7.1%) uretprobe-ret : 0.816 ± 0.002M/s (+2.0%, total +3.9%) In the above, first percentage value is based on top of previous patch (lazy uprobe buffer optimization), while the "total" percentage is based on kernel without any of the changes in this patch set. As can be seen, we get about 4% - 10% speed up, in total, with both lazy uprobe buffer and speculative filter check optimizations. [0] https://lore.kernel.org/bpf/20240313131926.GA19986@redhat.com/ Reviewed-by: Jiri Olsa <jolsa@kernel.org> Link: https://lore.kernel.org/all/20240318181728.2795838-4-andrii@kernel.org/ Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
2024-05-01uprobes: prepare uprobe args buffer lazilyAndrii Nakryiko1-21/+28
uprobe_cpu_buffer and corresponding logic to store uprobe args into it are used for uprobes/uretprobes that are created through tracefs or perf events. BPF is yet another user of uprobe/uretprobe infrastructure, but doesn't need uprobe_cpu_buffer and associated data. For BPF-only use cases this buffer handling and preparation is a pure overhead. At the same time, BPF-only uprobe/uretprobe usage is very common in practice. Also, for a lot of cases applications are very senstivie to performance overheads, as they might be tracing a very high frequency functions like malloc()/free(), so every bit of performance improvement matters. All that is to say that this uprobe_cpu_buffer preparation is an unnecessary overhead that each BPF user of uprobes/uretprobe has to pay. This patch is changing this by making uprobe_cpu_buffer preparation optional. It will happen only if either tracefs-based or perf event-based uprobe/uretprobe consumer is registered for given uprobe/uretprobe. For BPF-only use cases this step will be skipped. We used uprobe/uretprobe benchmark which is part of BPF selftests (see [0]) to estimate the improvements. We have 3 uprobe and 3 uretprobe scenarios, which vary an instruction that is replaced by uprobe: nop (fastest uprobe case), `push rbp` (typical case), and non-simulated `ret` instruction (slowest case). Benchmark thread is constantly calling user space function in a tight loop. User space function has attached BPF uprobe or uretprobe program doing nothing but atomic counter increments to count number of triggering calls. Benchmark emits throughput in millions of executions per second. BEFORE these changes ==================== uprobe-nop : 2.657 ± 0.024M/s uprobe-push : 2.499 ± 0.018M/s uprobe-ret : 1.100 ± 0.006M/s uretprobe-nop : 1.356 ± 0.004M/s uretprobe-push : 1.317 ± 0.019M/s uretprobe-ret : 0.785 ± 0.007M/s AFTER these changes =================== uprobe-nop : 2.732 ± 0.022M/s (+2.8%) uprobe-push : 2.621 ± 0.016M/s (+4.9%) uprobe-ret : 1.105 ± 0.007M/s (+0.5%) uretprobe-nop : 1.396 ± 0.007M/s (+2.9%) uretprobe-push : 1.347 ± 0.008M/s (+2.3%) uretprobe-ret : 0.800 ± 0.006M/s (+1.9) So the improvements on this particular machine seems to be between 2% and 5%. [0] https://github.com/torvalds/linux/blob/master/tools/testing/selftests/bpf/benchs/bench_trigger.c Reviewed-by: Jiri Olsa <jolsa@kernel.org> Link: https://lore.kernel.org/all/20240318181728.2795838-3-andrii@kernel.org/ Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
2024-05-01uprobes: encapsulate preparation of uprobe args bufferAndrii Nakryiko1-37/+41
Move the logic of fetching temporary per-CPU uprobe buffer and storing uprobes args into it to a new helper function. Store data size as part of this buffer, simplifying interfaces a bit, as now we only pass single uprobe_cpu_buffer reference around, instead of pointer + dsize. This logic was duplicated across uprobe_dispatcher and uretprobe_dispatcher, and now will be centralized. All this is also in preparation to make this uprobe_cpu_buffer handling logic optional in the next patch. Link: https://lore.kernel.org/all/20240318181728.2795838-2-andrii@kernel.org/ [Masami: update for v6.9-rc3 kernel] Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Reviewed-by: Jiri Olsa <jolsa@kernel.org> Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
2024-05-01Merge branches 'fixes.2024.04.15a', 'misc.2024.04.12a', ↵Uladzislau Rezki (Sony)12-77/+479
'rcu-sync-normal-improve.2024.04.15a', 'rcu-tasks.2024.04.15a' and 'rcutorture.2024.04.15a' into rcu-merge.2024.04.15a fixes.2024.04.15a: RCU fixes misc.2024.04.12a: Miscellaneous fixes rcu-sync-normal-improve.2024.04.15a: Improving synchronize_rcu() call rcu-tasks.2024.04.15a: Tasks RCU updates rcutorture.2024.04.15a: Torture-test updates
2024-04-30bpf: Add BPF_PROG_TYPE_CGROUP_SKB attach type enforcement in BPF_LINK_CREATEStanislav Fomichev1-0/+5
bpf_prog_attach uses attach_type_to_prog_type to enforce proper attach type for BPF_PROG_TYPE_CGROUP_SKB. link_create uses bpf_prog_get and relies on bpf_prog_attach_check_attach_type to properly verify prog_type <> attach_type association. Add missing attach_type enforcement for the link_create case. Otherwise, it's currently possible to attach cgroup_skb prog types to other cgroup hooks. Fixes: af6eea57437a ("bpf: Implement bpf_link-based cgroup BPF program attachment") Link: https://lore.kernel.org/bpf/0000000000004792a90615a1dde0@google.com/ Reported-by: syzbot+838346b979830606c854@syzkaller.appspotmail.com Signed-off-by: Stanislav Fomichev <sdf@google.com> Acked-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/20240426231621.2716876-2-sdf@google.com Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2024-04-30Merge patch series "riscv: Create and document PR_RISCV_SET_ICACHE_FLUSH_CTX ↵Palmer Dabbelt1-0/+6
prctl" Charlie Jenkins <charlie@rivosinc.com> says: Improve the performance of icache flushing by creating a new prctl flag PR_RISCV_SET_ICACHE_FLUSH_CTX. The interface is left generic to allow for future expansions such as with the proposed J extension [1]. Documentation is also provided to explain the use case. Patch sent to add PR_RISCV_SET_ICACHE_FLUSH_CTX to man-pages [2]. [1] https://github.com/riscv/riscv-j-extension [2] https://lore.kernel.org/linux-man/20240124-fencei_prctl-v1-1-0bddafcef331@rivosinc.com * b4-shazam-merge: cpumask: Add assign cpu documentation: Document PR_RISCV_SET_ICACHE_FLUSH_CTX prctl riscv: Include riscv_set_icache_flush_ctx prctl riscv: Remove unnecessary irqflags processor.h include Link: https://lore.kernel.org/r/20240312-fencei-v13-0-4b6bdc2bbf32@rivosinc.com Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
2024-04-30bpf: Add support for kprobe session cookieJiri Olsa2-3/+23
Adding support for cookie within the session of kprobe multi entry and return program. The session cookie is u64 value and can be retrieved be new kfunc bpf_session_cookie, which returns pointer to the cookie value. The bpf program can use the pointer to store (on entry) and load (on return) the value. The cookie value is implemented via fprobe feature that allows to share values between entry and return ftrace fprobe callbacks. Signed-off-by: Jiri Olsa <jolsa@kernel.org> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20240430112830.1184228-4-jolsa@kernel.org
2024-04-30bpf: Add support for kprobe session contextJiri Olsa2-7/+63
Adding struct bpf_session_run_ctx object to hold session related data, which is atm is_return bool and data pointer coming in following changes. Placing bpf_session_run_ctx layer in between bpf_run_ctx and bpf_kprobe_multi_run_ctx so the session data can be retrieved regardless of if it's kprobe_multi or uprobe_multi link, which support is coming in future. This way both kprobe_multi and uprobe_multi can use same kfuncs to access the session data. Adding bpf_session_is_return kfunc that returns true if the bpf program is executed from the exit probe of the kprobe multi link attached in wrapper mode. It returns false otherwise. Adding new kprobe hook for kprobe program type. Signed-off-by: Jiri Olsa <jolsa@kernel.org> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20240430112830.1184228-3-jolsa@kernel.org
2024-04-30bpf: Add support for kprobe session attachJiri Olsa2-9/+26
Adding support to attach bpf program for entry and return probe of the same function. This is common use case which at the moment requires to create two kprobe multi links. Adding new BPF_TRACE_KPROBE_SESSION attach type that instructs kernel to attach single link program to both entry and exit probe. It's possible to control execution of the bpf program on return probe simply by returning zero or non zero from the entry bpf program execution to execute or not the bpf program on return probe respectively. Signed-off-by: Jiri Olsa <jolsa@kernel.org> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20240430112830.1184228-2-jolsa@kernel.org
2024-04-30bpf: Do not walk twice the hash map on freeBenjamin Tissoires1-36/+13
If someone stores both a timer and a workqueue in a hash map, on free, we would walk it twice. Add a check in htab_free_malloced_timers_or_wq and free the timers and workqueues if they are present. Fixes: 246331e3f1ea ("bpf: allow struct bpf_wq to be embedded in arraymaps and hashmaps") Signed-off-by: Benjamin Tissoires <bentiss@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/bpf/20240430-bpf-next-v3-2-27afe7f3b17c@kernel.org
2024-04-30bpf: Do not walk twice the map on freeBenjamin Tissoires1-7/+8
If someone stores both a timer and a workqueue in a map, on free we would walk it twice. Add a check in array_map_free_timers_wq and free the timers and workqueues if they are present. Fixes: 246331e3f1ea ("bpf: allow struct bpf_wq to be embedded in arraymaps and hashmaps") Signed-off-by: Benjamin Tissoires <bentiss@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/bpf/20240430-bpf-next-v3-1-27afe7f3b17c@kernel.org
2024-04-30PM: hibernate: replace deprecated strncpy() with strscpy()Justin Stitt1-1/+1
strncpy() is deprecated for use on NUL-terminated destination strings [1] and as such we should prefer more robust and less ambiguous string interfaces. This kernel config option is simply assigned with the resume_file buffer. It should be NUL-terminated but not necessarily NUL-padded as per its further usage with other string apis: | static int __init find_resume_device(void) | { | if (!strlen(resume_file)) | return -ENOENT; | | pm_pr_dbg("Checking hibernation image partition %s\n", resume_file); Use strscpy() [2] as it guarantees NUL-termination on the destination buffer. Specifically, use the new 2-argument version of strscpy() introduced in Commit e6584c3964f2f ("string: Allow 2-argument strscpy()"). Link: https://www.kernel.org/doc/html/latest/process/deprecated.html#strncpy-on-nul-terminated-strings [1] Link: https://manpages.debian.org/testing/linux-manual-4.8/strscpy.9.en.html [2] Link: https://github.com/KSPP/linux/issues/90 Signed-off-by: Justin Stitt <justinstitt@google.com> Reviewed-by: Kees Cook <keescook@chromium.org> Reviewed-by: Dhruva Gole <d-gole@ti.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2024-04-29bpf: Switch to krealloc_array()Andy Shevchenko1-1/+1
Let the krealloc_array() copy the original data and check for a multiplication overflow. Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Yonghong Song <yonghong.song@linux.dev> Link: https://lore.kernel.org/bpf/20240429120005.3539116-1-andriy.shevchenko@linux.intel.com
2024-04-29bpf: Use struct_size()Andy Shevchenko1-5/+7
Use struct_size() instead of hand writing it. This is less verbose and more robust. Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Yonghong Song <yonghong.song@linux.dev> Link: https://lore.kernel.org/bpf/20240429121323.3818497-1-andriy.shevchenko@linux.intel.com
2024-04-29Merge tag 'wq-for-6.9-rc6-fixes' of ↵Linus Torvalds1-3/+16
git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq Pull workqueue fixes from Tejun Heo: "Two doc update patches and the following three fixes: - On single node systems, the default pool is used but the node_nr_active for the default pool was set to min_active. This effectively limited the max concurrency of unbound pools on single node systems to 8 causing performance regressions on some workloads. Fixed by setting the default pool's node_nr_active to max_active. - wq_update_node_max_active() could trigger divide-by-zero if the intersection between the allowed CPUs for an unbound workqueue and online CPUs becomes empty. - When kick_pool() was trying to repatriate a worker to a CPU in its pod by setting task->wake_cpu, it didn't consider whether the CPU being selected is online or not which obviously can lead to subobtimal behaviors. On s390, this triggered a crash in arch code. The workqueue patch removes the gross misbehavior but doesn't fix the crash completely as there's a race window in which CPUs can go down after wake_cpu is set. Need to decide whether the fix should be on the core or arch side" * tag 'wq-for-6.9-rc6-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq: workqueue: Fix divide error in wq_update_node_max_active() workqueue: The default node_nr_active should have its max set to max_active workqueue: Fix selection of wake_cpu in kick_pool() docs/zh_CN: core-api: Update translation of workqueue.rst to 6.9-rc1 Documentation/core-api: Update events_freezable_power references.
2024-04-30clocksource: Make the int help prompt unit readable in ncursesBorislav Petkov (AMD)1-1/+1
When doing make menuconfig and searching for the CLOCKSOURCE_WATCHDOG_MAX_SKEW_US config item, the help says: │ Symbol: CLOCKSOURCE_WATCHDOG_MAX_SKEW_US [=125] │ Type : integer │ Range : [50 1000] │ Defined at kernel/time/Kconfig:204 │ Prompt: Clocksource watchdog maximum allowable skew (in s) ^^^ │ Depends on: GENERIC_CLOCKEVENTS [=y] && CLOCKSOURCE_WATCHDOG [=y] because on some terminals, it cannot display the 'μ' char, unicode number 0x3bc. So simply write it out so that there's no trouble. Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Paul E. McKenney <paulmck@kernel.org> Acked-by: Randy Dunlap <rdunlap@infradead.org> Link: https://lore.kernel.org/r/20240428102143.26764-1-bp@kernel.org
2024-04-29bpf: Fix verifier assumptions about socket->skAlexei Starovoitov1-5/+18
The verifier assumes that 'sk' field in 'struct socket' is valid and non-NULL when 'socket' pointer itself is trusted and non-NULL. That may not be the case when socket was just created and passed to LSM socket_accept hook. Fix this verifier assumption and adjust tests. Reported-by: Liam Wisehart <liamwisehart@meta.com> Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Fixes: 6fcd486b3a0a ("bpf: Refactor RCU enforcement in the verifier.") Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/r/20240427002544.68803-1-alexei.starovoitov@gmail.com Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2024-04-29Merge tag 'for-netdev' of ↵Jakub Kicinski18-222/+1249
https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next Daniel Borkmann says: ==================== pull-request: bpf-next 2024-04-29 We've added 147 non-merge commits during the last 32 day(s) which contain a total of 158 files changed, 9400 insertions(+), 2213 deletions(-). The main changes are: 1) Add an internal-only BPF per-CPU instruction for resolving per-CPU memory addresses and implement support in x86 BPF JIT. This allows inlining per-CPU array and hashmap lookups and the bpf_get_smp_processor_id() helper, from Andrii Nakryiko. 2) Add BPF link support for sk_msg and sk_skb programs, from Yonghong Song. 3) Optimize x86 BPF JIT's emit_mov_imm64, and add support for various atomics in bpf_arena which can be JITed as a single x86 instruction, from Alexei Starovoitov. 4) Add support for passing mark with bpf_fib_lookup helper, from Anton Protopopov. 5) Add a new bpf_wq API for deferring events and refactor sleepable bpf_timer code to keep common code where possible, from Benjamin Tissoires. 6) Fix BPF_PROG_TEST_RUN infra with regards to bpf_dummy_struct_ops programs to check when NULL is passed for non-NULLable parameters, from Eduard Zingerman. 7) Harden the BPF verifier's and/or/xor value tracking, from Harishankar Vishwanathan. 8) Introduce crypto kfuncs to make BPF programs able to utilize the kernel crypto subsystem, from Vadim Fedorenko. 9) Various improvements to the BPF instruction set standardization doc, from Dave Thaler. 10) Extend libbpf APIs to partially consume items from the BPF ringbuffer, from Andrea Righi. 11) Bigger batch of BPF selftests refactoring to use common network helpers and to drop duplicate code, from Geliang Tang. 12) Support bpf_tail_call_static() helper for BPF programs with GCC 13, from Jose E. Marchesi. 13) Add bpf_preempt_{disable,enable}() kfuncs in order to allow a BPF program to have code sections where preemption is disabled, from Kumar Kartikeya Dwivedi. 14) Allow invoking BPF kfuncs from BPF_PROG_TYPE_SYSCALL programs, from David Vernet. 15) Extend the BPF verifier to allow different input maps for a given bpf_for_each_map_elem() helper call in a BPF program, from Philo Lu. 16) Add support for PROBE_MEM32 and bpf_addr_space_cast instructions for riscv64 and arm64 JITs to enable BPF Arena, from Puranjay Mohan. 17) Shut up a false-positive KMSAN splat in interpreter mode by unpoison the stack memory, from Martin KaFai Lau. 18) Improve xsk selftest coverage with new tests on maximum and minimum hardware ring size configurations, from Tushar Vyavahare. 19) Various ReST man pages fixes as well as documentation and bash completion improvements for bpftool, from Rameez Rehman & Quentin Monnet. 20) Fix libbpf with regards to dumping subsequent char arrays, from Quentin Deslandes. * tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (147 commits) bpf, docs: Clarify PC use in instruction-set.rst bpf_helpers.h: Define bpf_tail_call_static when building with GCC bpf, docs: Add introduction for use in the ISA Internet Draft selftests/bpf: extend BPF_SOCK_OPS_RTT_CB test for srtt and mrtt_us bpf: add mrtt and srtt as BPF_SOCK_OPS_RTT_CB args selftests/bpf: dummy_st_ops should reject 0 for non-nullable params bpf: check bpf_dummy_struct_ops program params for test runs selftests/bpf: do not pass NULL for non-nullable params in dummy_st_ops selftests/bpf: adjust dummy_st_ops_success to detect additional error bpf: mark bpf_dummy_struct_ops.test_1 parameter as nullable selftests/bpf: Add ring_buffer__consume_n test. bpf: Add bpf_guard_preempt() convenience macro selftests: bpf: crypto: add benchmark for crypto functions selftests: bpf: crypto skcipher algo selftests bpf: crypto: add skcipher to bpf crypto bpf: make common crypto API for TC/XDP programs bpf: update the comment for BTF_FIELDS_MAX selftests/bpf: Fix wq test. selftests/bpf: Use make_sockaddr in test_sock_addr selftests/bpf: Use connect_to_addr in test_sock_addr ... ==================== Link: https://lore.kernel.org/r/20240429131657.19423-1-daniel@iogearbox.net Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-04-29bounds: Use the right number of bits for power-of-two CONFIG_NR_CPUSMatthew Wilcox (Oracle)1-1/+1
bits_per() rounds up to the next power of two when passed a power of two. This causes crashes on some machines and configurations. Reported-by: Михаил Новоселов <m.novosyolov@rosalinux.ru> Tested-by: Ильфат Гаптрахманов <i.gaptrakhmanov@rosalinux.ru> Link: https://gitlab.freedesktop.org/drm/amd/-/issues/3347 Link: https://lore.kernel.org/all/1c978cf1-2934-4e66-e4b3-e81b04cb3571@rosalinux.ru/ Fixes: f2d5dcb48f7b (bounds: support non-power-of-two CONFIG_NR_CPUS) Cc: <stable@vger.kernel.org> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Rik van Riel <riel@surriel.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Ingo Molnar <mingo@kernel.org> Cc: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2024-04-29tracing/probes: Fix memory leak in traceprobe_parse_probe_arg_body()LuMingYin1-1/+1
If traceprobe_parse_probe_arg_body() failed to allocate 'parg->fmt', it jumps to the label 'out' instead of 'fail' by mistake.In the result, the buffer 'tmp' is not freed in this case and leaks its memory. Thus jump to the label 'fail' in that error case. Link: https://lore.kernel.org/all/20240427072347.1421053-1-lumingyindetect@126.com/ Fixes: 032330abd08b ("tracing/probes: Cleanup probe argument parser") Signed-off-by: LuMingYin <lumingyindetect@126.com> Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
2024-04-29softirq: Fix suspicious RCU usage in __do_softirq()Zqiang1-4/+8
Currently, the condition "__this_cpu_read(ksoftirqd) == current" is used to invoke rcu_softirq_qs() in ksoftirqd tasks context for non-RT kernels. This works correctly as long as the context is actually task context but this condition is wrong when: - the current task is ksoftirqd - the task is interrupted in a RCU read side critical section - __do_softirq() is invoked on return from interrupt Syzkaller triggered the following scenario: -> finish_task_switch() -> put_task_struct_rcu_user() -> call_rcu(&task->rcu, delayed_put_task_struct) -> __kasan_record_aux_stack() -> pfn_valid() -> rcu_read_lock_sched() <interrupt> __irq_exit_rcu() -> __do_softirq)() -> if (!IS_ENABLED(CONFIG_PREEMPT_RT) && __this_cpu_read(ksoftirqd) == current) -> rcu_softirq_qs() -> RCU_LOCKDEP_WARN(lock_is_held(&rcu_sched_lock_map)) The rcu quiescent state is reported in the rcu-read critical section, so the lockdep warning is triggered. Fix this by splitting out the inner working of __do_softirq() into a helper function which takes an argument to distinguish between ksoftirqd task context and interrupted context and invoke it from the relevant call sites with the proper context information and use that for the conditional invocation of rcu_softirq_qs(). Reported-by: syzbot+dce04ed6d1438ad69656@syzkaller.appspotmail.com Suggested-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Zqiang <qiang.zhang1211@gmail.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/20240427102808.29356-1-qiang.zhang1211@gmail.com Link: https://lore.kernel.org/lkml/8f281a10-b85a-4586-9586-5bbc12dc784f@paulmck-laptop/T/#mea8aba4abfcb97bbf499d169ce7f30c4cff1b0e3
2024-04-28Merge tag 'sched-urgent-2024-04-28' of ↵Linus Torvalds2-16/+36
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler fixes from Ingo Molnar: - Fix EEVDF corner cases - Fix two nohz_full= related bugs that can cause boot crashes and warnings * tag 'sched-urgent-2024-04-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: sched/isolation: Fix boot crash when maxcpus < first housekeeping CPU sched/isolation: Prevent boot crash when the boot CPU is nohz_full sched/eevdf: Prevent vlag from going out of bounds in reweight_eevdf() sched/eevdf: Fix miscalculation in reweight_entity() when se is not curr sched/eevdf: Always update V if se->on_rq when reweighting
2024-04-28Merge tag 'x86-urgent-2024-04-28' of ↵Linus Torvalds1-4/+10
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 fixes from Ingo Molnar: - Make the CPU_MITIGATIONS=n interaction with conflicting mitigation-enabling boot parameters a bit saner. - Re-enable CPU mitigations by default on non-x86 - Fix TDX shared bit propagation on mprotect() - Fix potential show_regs() system hang when PKE initialization is not fully finished yet. - Add the 0x10-0x1f model IDs to the Zen5 range - Harden #VC instruction emulation some more * tag 'x86-urgent-2024-04-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: cpu: Ignore "mitigations" kernel parameter if CPU_MITIGATIONS=n cpu: Re-enable CPU mitigations by default for !X86 architectures x86/tdx: Preserve shared bit on mprotect() x86/cpu: Fix check for RDPKRU in __show_regs() x86/CPU/AMD: Add models 0x10-0x1f to the Zen5 range x86/sev: Check for MWAITX and MONITORX opcodes in the #VC handler
2024-04-28sched/isolation: Fix boot crash when maxcpus < first housekeeping CPUOleg Nesterov1-1/+6
housekeeping_setup() checks cpumask_intersects(present, online) to ensure that the kernel will have at least one housekeeping CPU after smp_init(), but this doesn't work if the maxcpus= kernel parameter limits the number of processors available after bootup. For example, a kernel with "maxcpus=2 nohz_full=0-2" parameters crashes at boot time on a virtual machine with 4 CPUs. Change housekeeping_setup() to use cpumask_first_and() and check that the returned CPU number is valid and less than setup_max_cpus. Another corner case is "nohz_full=0" on a machine with a single CPU or with the maxcpus=1 kernel argument. In this case non_housekeeping_mask is empty and tick_nohz_full_setup() makes no sense. And indeed, the kernel hits the WARN_ON(tick_nohz_full_running) in tick_sched_do_timer(). And how should the kernel interpret the "nohz_full=" parameter? It should be silently ignored, but currently cpulist_parse() happily returns the empty cpumask and this leads to the same problem. Change housekeeping_setup() to check cpumask_empty(non_housekeeping_mask) and do nothing in this case. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Phil Auld <pauld@redhat.com> Acked-by: Frederic Weisbecker <frederic@kernel.org> Link: https://lore.kernel.org/r/20240413141746.GA10008@redhat.com
2024-04-28sched/isolation: Prevent boot crash when the boot CPU is nohz_fullOleg Nesterov1-1/+10
Documentation/timers/no_hz.rst states that the "nohz_full=" mask must not include the boot CPU, which is no longer true after: 08ae95f4fd3b ("nohz_full: Allow the boot CPU to be nohz_full"). However after: aae17ebb53cd ("workqueue: Avoid using isolated cpus' timers on queue_delayed_work") the kernel will crash at boot time in this case; housekeeping_any_cpu() returns an invalid CPU number until smp_init() brings the first housekeeping CPU up. Change housekeeping_any_cpu() to check the result of cpumask_any_and() and return smp_processor_id() in this case. This is just the simple and backportable workaround which fixes the symptom, but smp_processor_id() at boot time should be safe at least for type == HK_TYPE_TIMER, this more or less matches the tick_do_timer_boot_cpu logic. There is no worry about cpu_down(); tick_nohz_cpu_down() will not allow to offline tick_do_timer_cpu (the 1st online housekeeping CPU). Fixes: aae17ebb53cd ("workqueue: Avoid using isolated cpus' timers on queue_delayed_work") Reported-by: Chris von Recklinghausen <crecklin@redhat.com> Signed-off-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Phil Auld <pauld@redhat.com> Acked-by: Frederic Weisbecker <frederic@kernel.org> Link: https://lore.kernel.org/r/20240411143905.GA19288@redhat.com Closes: https://lore.kernel.org/all/20240402105847.GA24832@redhat.com/
2024-04-27profiling: Remove create_prof_cpu_mask().Tetsuo Handa1-43/+0
create_prof_cpu_mask() is no longer used after commit 1f44a225777e ("s390: convert interrupt handling to use generic hardirq"). Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2024-04-26Merge tag 'for-netdev' of ↵Jakub Kicinski2-2/+40
https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf Daniel Borkmann says: ==================== pull-request: bpf 2024-04-26 We've added 12 non-merge commits during the last 22 day(s) which contain a total of 14 files changed, 168 insertions(+), 72 deletions(-). The main changes are: 1) Fix BPF_PROBE_MEM in verifier and JIT to skip loads from vsyscall page, from Puranjay Mohan. 2) Fix a crash in XDP with devmap broadcast redirect when the latter map is in process of being torn down, from Toke Høiland-Jørgensen. 3) Fix arm64 and riscv64 BPF JITs to properly clear start time for BPF program runtime stats, from Xu Kuohai. 4) Fix a sockmap KCSAN-reported data race in sk_psock_skb_ingress_enqueue, from Jason Xing. 5) Fix BPF verifier error message in resolve_pseudo_ldimm64, from Anton Protopopov. 6) Fix missing DEBUG_INFO_BTF_MODULES Kconfig menu item, from Andrii Nakryiko. * tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf: selftests/bpf: Test PROBE_MEM of VSYSCALL_ADDR on x86-64 bpf, x86: Fix PROBE_MEM runtime load check bpf: verifier: prevent userspace memory access xdp: use flags field to disambiguate broadcast redirect arm32, bpf: Reimplement sign-extension mov instruction riscv, bpf: Fix incorrect runtime stats bpf, arm64: Fix incorrect runtime stats bpf: Fix a verifier verbose message bpf, skmsg: Fix NULL pointer dereference in sk_psock_skb_ingress_enqueue MAINTAINERS: bpf: Add Lehui and Puranjay as riscv64 reviewers MAINTAINERS: Update email address for Puranjay Mohan bpf, kconfig: Fix DEBUG_INFO_BTF_MODULES Kconfig definition ==================== Link: https://lore.kernel.org/r/20240426224248.26197-1-daniel@iogearbox.net Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-04-26Merge tag 'mm-hotfixes-stable-2024-04-26-13-30' of ↵Linus Torvalds1-3/+2
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull misc fixes from Andrew Morton: "11 hotfixes. 8 are cc:stable and the remaining 3 (nice ratio!) address post-6.8 issues or aren't considered suitable for backporting. All except one of these are for MM. I see no particular theme - it's singletons all over" * tag 'mm-hotfixes-stable-2024-04-26-13-30' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: mm/hugetlb: fix DEBUG_LOCKS_WARN_ON(1) when dissolve_free_hugetlb_folio() selftests: mm: protection_keys: save/restore nr_hugepages value from launch script stackdepot: respect __GFP_NOLOCKDEP allocation flag hugetlb: check for anon_vma prior to folio allocation mm: zswap: fix shrinker NULL crash with cgroup_disable=memory mm: turn folio_test_hugetlb into a PageType mm: support page_mapcount() on page_has_type() pages mm: create FOLIO_FLAG_FALSE and FOLIO_TYPE_OPS macros mm/hugetlb: fix missing hugetlb_lock for resv uncharge selftests: mm: fix unused and uninitialized variable warning selftests/harness: remove use of LINE_MAX
2024-04-26bpf: verifier: prevent userspace memory accessPuranjay Mohan2-0/+39
With BPF_PROBE_MEM, BPF allows de-referencing an untrusted pointer. To thwart invalid memory accesses, the JITs add an exception table entry for all such accesses. But in case the src_reg + offset is a userspace address, the BPF program might read that memory if the user has mapped it. Make the verifier add guard instructions around such memory accesses and skip the load if the address falls into the userspace region. The JITs need to implement bpf_arch_uaddress_limit() to define where the userspace addresses end for that architecture or TASK_SIZE is taken as default. The implementation is as follows: REG_AX = SRC_REG if(offset) REG_AX += offset; REG_AX >>= 32; if (REG_AX <= (uaddress_limit >> 32)) DST_REG = 0; else DST_REG = *(size *)(SRC_REG + offset); Comparing just the upper 32 bits of the load address with the upper 32 bits of uaddress_limit implies that the values are being aligned down to a 4GB boundary before comparison. The above means that all loads with address <= uaddress_limit + 4GB are skipped. This is acceptable because there is a large hole (much larger than 4GB) between userspace and kernel space memory, therefore a correctly functioning BPF program should not access this 4GB memory above the userspace. Let's analyze what this patch does to the following fentry program dereferencing an untrusted pointer: SEC("fentry/tcp_v4_connect") int BPF_PROG(fentry_tcp_v4_connect, struct sock *sk) { *(volatile long *)sk; return 0; } BPF Program before | BPF Program after ------------------ | ----------------- 0: (79) r1 = *(u64 *)(r1 +0) 0: (79) r1 = *(u64 *)(r1 +0) ----------------------------------------------------------------------- 1: (79) r1 = *(u64 *)(r1 +0) --\ 1: (bf) r11 = r1 ----------------------------\ \ 2: (77) r11 >>= 32 2: (b7) r0 = 0 \ \ 3: (b5) if r11 <= 0x8000 goto pc+2 3: (95) exit \ \-> 4: (79) r1 = *(u64 *)(r1 +0) \ 5: (05) goto pc+1 \ 6: (b7) r1 = 0 \-------------------------------------- 7: (b7) r0 = 0 8: (95) exit As you can see from above, in the best case (off=0), 5 extra instructions are emitted. Now, we analyze the same program after it has gone through the JITs of ARM64 and RISC-V architectures. We follow the single load instruction that has the untrusted pointer and see what instrumentation has been added around it. x86-64 JIT ========== JIT's Instrumentation (upstream) --------------------- 0: nopl 0x0(%rax,%rax,1) 5: xchg %ax,%ax 7: push %rbp 8: mov %rsp,%rbp b: mov 0x0(%rdi),%rdi --------------------------------- f: movabs $0x800000000000,%r11 19: cmp %r11,%rdi 1c: jb 0x000000000000002a 1e: mov %rdi,%r11 21: add $0x0,%r11 28: jae 0x000000000000002e 2a: xor %edi,%edi 2c: jmp 0x0000000000000032 2e: mov 0x0(%rdi),%rdi --------------------------------- 32: xor %eax,%eax 34: leave 35: ret The x86-64 JIT already emits some instructions to protect against user memory access. This patch doesn't make any changes for the x86-64 JIT. ARM64 JIT ========= No Intrumentation Verifier's Instrumentation (upstream) (This patch) ----------------- -------------------------- 0: add x9, x30, #0x0 0: add x9, x30, #0x0 4: nop 4: nop 8: paciasp 8: paciasp c: stp x29, x30, [sp, #-16]! c: stp x29, x30, [sp, #-16]! 10: mov x29, sp 10: mov x29, sp 14: stp x19, x20, [sp, #-16]! 14: stp x19, x20, [sp, #-16]! 18: stp x21, x22, [sp, #-16]! 18: stp x21, x22, [sp, #-16]! 1c: stp x25, x26, [sp, #-16]! 1c: stp x25, x26, [sp, #-16]! 20: stp x27, x28, [sp, #-16]! 20: stp x27, x28, [sp, #-16]! 24: mov x25, sp 24: mov x25, sp 28: mov x26, #0x0 28: mov x26, #0x0 2c: sub x27, x25, #0x0 2c: sub x27, x25, #0x0 30: sub sp, sp, #0x0 30: sub sp, sp, #0x0 34: ldr x0, [x0] 34: ldr x0, [x0] -------------------------------------------------------------------------------- 38: ldr x0, [x0] ----------\ 38: add x9, x0, #0x0 -----------------------------------\\ 3c: lsr x9, x9, #32 3c: mov x7, #0x0 \\ 40: cmp x9, #0x10, lsl #12 40: mov sp, sp \\ 44: b.ls 0x0000000000000050 44: ldp x27, x28, [sp], #16 \\--> 48: ldr x0, [x0] 48: ldp x25, x26, [sp], #16 \ 4c: b 0x0000000000000054 4c: ldp x21, x22, [sp], #16 \ 50: mov x0, #0x0 50: ldp x19, x20, [sp], #16 \--------------------------------------- 54: ldp x29, x30, [sp], #16 54: mov x7, #0x0 58: add x0, x7, #0x0 58: mov sp, sp 5c: autiasp 5c: ldp x27, x28, [sp], #16 60: ret 60: ldp x25, x26, [sp], #16 64: nop 64: ldp x21, x22, [sp], #16 68: ldr x10, 0x0000000000000070 68: ldp x19, x20, [sp], #16 6c: br x10 6c: ldp x29, x30, [sp], #16 70: add x0, x7, #0x0 74: autiasp 78: ret 7c: nop 80: ldr x10, 0x0000000000000088 84: br x10 There are 6 extra instructions added in ARM64 in the best case. This will become 7 in the worst case (off != 0). RISC-V JIT (RISCV_ISA_C Disabled) ========== No Intrumentation Verifier's Instrumentation (upstream) (This patch) ----------------- -------------------------- 0: nop 0: nop 4: nop 4: nop 8: li a6, 33 8: li a6, 33 c: addi sp, sp, -16 c: addi sp, sp, -16 10: sd s0, 8(sp) 10: sd s0, 8(sp) 14: addi s0, sp, 16 14: addi s0, sp, 16 18: ld a0, 0(a0) 18: ld a0, 0(a0) --------------------------------------------------------------- 1c: ld a0, 0(a0) --\ 1c: mv t0, a0 --------------------------\ \ 20: srli t0, t0, 32 20: li a5, 0 \ \ 24: lui t1, 4096 24: ld s0, 8(sp) \ \ 28: sext.w t1, t1 28: addi sp, sp, 16 \ \ 2c: bgeu t1, t0, 12 2c: sext.w a0, a5 \ \--> 30: ld a0, 0(a0) 30: ret \ 34: j 8 \ 38: li a0, 0 \------------------------------ 3c: li a5, 0 40: ld s0, 8(sp) 44: addi sp, sp, 16 48: sext.w a0, a5 4c: ret There are 7 extra instructions added in RISC-V. Fixes: 800834285361 ("bpf, arm64: Add BPF exception tables") Reported-by: Breno Leitao <leitao@debian.org> Suggested-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Ilya Leoshkevich <iii@linux.ibm.com> Signed-off-by: Puranjay Mohan <puranjay12@gmail.com> Link: https://lore.kernel.org/r/20240424100210.11982-2-puranjay@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-04-26kdb: Simplify management of tmpbuffer in kdb_read()Daniel Thompson1-23/+18
The current approach to filling tmpbuffer with completion candidates is confusing, with the buffer management being especially hard to reason about. That's because it doesn't copy the completion canidate into tmpbuffer, instead of copies a whole bunch of other nonsense and then runs the completion search from the middle of tmpbuffer! Change this to copy nothing but the completion candidate into tmpbuffer. Pretty much everything else in this patch is renaming to reflect the above change: s/p_tmp/tmpbuffer/ s/buf_size/sizeof(tmpbuffer)/ Reviewed-by: Douglas Anderson <dianders@chromium.org> Tested-by: Justin Stitt <justinstitt@google.com> Link: https://lore.kernel.org/r/20240424-kgdb_read_refactor-v3-7-f236dbe9828d@linaro.org Signed-off-by: Daniel Thompson <daniel.thompson@linaro.org>
2024-04-26kdb: Replace double memcpy() with memmove() in kdb_read()Daniel Thompson1-12/+7
At several points in kdb_read() there are variants of the following code pattern (with offsets slightly altered): memcpy(tmpbuffer, cp, lastchar - cp); memcpy(cp-1, tmpbuffer, lastchar - cp); *(--lastchar) = '\0'; There is no need to use tmpbuffer here, since we can use memmove() instead so refactor in the obvious way. Additionally the strings that are being copied are already properly terminated so let's also change the code so that the library calls also move the terminator. Changing how the terminators are managed has no functional effect for now but might allow us to retire lastchar at a later point. lastchar, although stored as a pointer, is functionally equivalent to caching strlen(buffer). Reviewed-by: Douglas Anderson <dianders@chromium.org> Tested-by: Justin Stitt <justinstitt@google.com> Link: https://lore.kernel.org/r/20240424-kgdb_read_refactor-v3-6-f236dbe9828d@linaro.org Signed-off-by: Daniel Thompson <daniel.thompson@linaro.org>
2024-04-26kdb: Use format-specifiers rather than memset() for padding in kdb_read()Daniel Thompson1-5/+3
Currently when the current line should be removed from the display kdb_read() uses memset() to fill a temporary buffer with spaces. The problem is not that this could be trivially implemented using a format string rather than open coding it. The real problem is that it is possible, on systems with a long kdb_prompt_str, to write past the end of the tmpbuffer. Happily, as mentioned above, this can be trivially implemented using a format string. Make it so! Cc: stable@vger.kernel.org Reviewed-by: Douglas Anderson <dianders@chromium.org> Tested-by: Justin Stitt <justinstitt@google.com> Link: https://lore.kernel.org/r/20240424-kgdb_read_refactor-v3-5-f236dbe9828d@linaro.org Signed-off-by: Daniel Thompson <daniel.thompson@linaro.org>
2024-04-26kdb: Merge identical case statements in kdb_read()Daniel Thompson1-9/+1
The code that handles case 14 (down) and case 16 (up) has been copy and pasted despite being byte-for-byte identical. Combine them. Cc: stable@vger.kernel.org # Not a bug fix but it is needed for later bug fixes Reviewed-by: Douglas Anderson <dianders@chromium.org> Tested-by: Justin Stitt <justinstitt@google.com> Link: https://lore.kernel.org/r/20240424-kgdb_read_refactor-v3-4-f236dbe9828d@linaro.org Signed-off-by: Daniel Thompson <daniel.thompson@linaro.org>
2024-04-26kdb: Fix console handling when editing and tab-completing commandsDaniel Thompson1-0/+5
Currently, if the cursor position is not at the end of the command buffer and the user uses the Tab-complete functions, then the console does not leave the cursor in the correct position. For example consider the following buffer with the cursor positioned at the ^: md kdb_pro 10 ^ Pressing tab should result in: md kdb_prompt_str 10 ^ However this does not happen. Instead the cursor is placed at the end (after then 10) and further cursor movement redraws incorrectly. The same problem exists when we double-Tab but in a different part of the code. Fix this by sending a carriage return and then redisplaying the text to the left of the cursor. Cc: stable@vger.kernel.org Reviewed-by: Douglas Anderson <dianders@chromium.org> Tested-by: Justin Stitt <justinstitt@google.com> Link: https://lore.kernel.org/r/20240424-kgdb_read_refactor-v3-3-f236dbe9828d@linaro.org Signed-off-by: Daniel Thompson <daniel.thompson@linaro.org>
2024-04-26kdb: Use format-strings rather than '\0' injection in kdb_read()Daniel Thompson1-21/+34
Currently when kdb_read() needs to reposition the cursor it uses copy and paste code that works by injecting an '\0' at the cursor position before delivering a carriage-return and reprinting the line (which stops at the '\0'). Tidy up the code by hoisting the copy and paste code into an appropriately named function. Additionally let's replace the '\0' injection with a proper field width parameter so that the string will be abridged during formatting instead. Cc: stable@vger.kernel.org # Not a bug fix but it is needed for later bug fixes Tested-by: Justin Stitt <justinstitt@google.com> Reviewed-by: Douglas Anderson <dianders@chromium.org> Link: https://lore.kernel.org/r/20240424-kgdb_read_refactor-v3-2-f236dbe9828d@linaro.org Signed-off-by: Daniel Thompson <daniel.thompson@linaro.org>
2024-04-26kdb: Fix buffer overflow during tab-completeDaniel Thompson1-8/+13
Currently, when the user attempts symbol completion with the Tab key, kdb will use strncpy() to insert the completed symbol into the command buffer. Unfortunately it passes the size of the source buffer rather than the destination to strncpy() with predictably horrible results. Most obviously if the command buffer is already full but cp, the cursor position, is in the middle of the buffer, then we will write past the end of the supplied buffer. Fix this by replacing the dubious strncpy() calls with memmove()/memcpy() calls plus explicit boundary checks to make sure we have enough space before we start moving characters around. Reported-by: Justin Stitt <justinstitt@google.com> Closes: https://lore.kernel.org/all/CAFhGd8qESuuifuHsNjFPR-Va3P80bxrw+LqvC8deA8GziUJLpw@mail.gmail.com/ Cc: stable@vger.kernel.org Reviewed-by: Douglas Anderson <dianders@chromium.org> Reviewed-by: Justin Stitt <justinstitt@google.com> Tested-by: Justin Stitt <justinstitt@google.com> Link: https://lore.kernel.org/r/20240424-kgdb_read_refactor-v3-1-f236dbe9828d@linaro.org Signed-off-by: Daniel Thompson <daniel.thompson@linaro.org>
2024-04-25blktrace: convert strncpy() to strscpy_pad()Arnd Bergmann1-2/+1
gcc-9 warns about a possibly non-terminated string copy: kernel/trace/blktrace.c: In function 'do_blk_trace_setup': kernel/trace/blktrace.c:527:2: error: 'strncpy' specified bound 32 equals destination size [-Werror=stringop-truncation] Newer versions are fine here because they see the following explicit nul-termination. Using strscpy_pad() avoids the warning and simplifies the code a little. The padding helps give a clean buffer to userspace. Link: https://lkml.kernel.org/r/20240409140059.3806717-5-arnd@kernel.org Signed-off-by: Arnd Bergmann <arnd@arndb.de> Acked-by: Justin Stitt <justinstitt@google.com> Cc: Alexey Starikovskiy <astarikovskiy@suse.de> Cc: Bob Moore <robert.moore@intel.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Len Brown <lenb@kernel.org> Cc: Lin Ming <ming.m.lin@intel.com> Cc: Masahiro Yamada <masahiroy@kernel.org> Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Nathan Chancellor <nathan@kernel.org> Cc: Nicolas Schier <nicolas@fjasle.eu> Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Cc: "Richard Russon (FlatCap)" <ldm@flatcap.org> Cc: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-04-25kcov: avoid clang out-of-range warningArnd Bergmann1-1/+2
The area_size is never larger than the maximum on 64-bit architectutes: kernel/kcov.c:634:29: error: result of comparison of constant 1152921504606846975 with expression of type '__u32' (aka 'unsigned int') is always false [-Werror,-Wtautological-constant-out-of-range-compare] if (remote_arg->area_size > LONG_MAX / sizeof(unsigned long)) ~~~~~~~~~~~~~~~~~~~~~ ^ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ The compiler can correctly optimize the check away and the code appears correct to me, so just add a cast to avoid the warning. Link: https://lkml.kernel.org/r/20240328143051.1069575-5-arnd@kernel.org Signed-off-by: Arnd Bergmann <arnd@arndb.de> Reviewed-by: Justin Stitt <justinstitt@google.com> Cc: Andrey Konovalov <andreyknvl@gmail.com> Cc: Bill Wendling <morbo@google.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Nathan Chancellor <nathan@kernel.org> Cc: Nick Desaulniers <ndesaulniers@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-04-25regset: use kvzalloc() for regset_get_alloc()Douglas Anderson1-3/+3
While browsing through ChromeOS crash reports, I found one with an allocation failure that looked like this: chrome: page allocation failure: order:7, mode:0x40dc0(GFP_KERNEL|__GFP_COMP|__GFP_ZERO), nodemask=(null),cpuset=urgent,mems_allowed=0 CPU: 7 PID: 3295 Comm: chrome Not tainted 5.15.133-20574-g8044615ac35c #1 (HASH:1162 1) Hardware name: Google Lazor (rev3 - 8) with KB Backlight (DT) Call trace: ... warn_alloc+0x104/0x174 __alloc_pages+0x5f0/0x6e4 kmalloc_order+0x44/0x98 kmalloc_order_trace+0x34/0x124 __kmalloc+0x228/0x36c __regset_get+0x68/0xcc regset_get_alloc+0x1c/0x28 elf_core_dump+0x3d8/0xd8c do_coredump+0xeb8/0x1378 get_signal+0x14c/0x804 ... An order 7 allocation is (1 << 7) contiguous pages, or 512K. It's not a surprise that this allocation failed on a system that's been running for a while. More digging showed that it was fairly easy to see the order 7 allocation by just sending a SIGQUIT to chrome (or other processes) to generate a core dump. The actual amount being allocated was 279,584 bytes and it was for "core_note_type" NT_ARM_SVE. There was quite a bit of discussion [1] on the mailing lists in response to my v1 patch attempting to switch to vmalloc. The overall conclusion was that we could likely reduce the 279,584 byte allocation by quite a bit and Mark Brown has sent a patch to that effect [2]. However even with the 279,584 byte allocation gone there are still 65,552 byte allocations. These are just barely more than the 65,536 bytes and thus would require an order 5 allocation. An order 5 allocation is still something to avoid unless necessary and nothing needs the memory here to be contiguous. Change the allocation to kvzalloc() which should still be efficient for small allocations but doesn't force the memory subsystem to work hard (and maybe fail) at getting a large contiguous chunk. [1] https://lore.kernel.org/r/20240201171159.1.Id9ad163b60d21c9e56c2d686b0cc9083a8ba7924@changeid [2] https://lore.kernel.org/r/20240203-arm64-sve-ptrace-regset-size-v1-1-2c3ba1386b9e@kernel.org Link: https://lkml.kernel.org/r/20240205092626.v2.1.Id9ad163b60d21c9e56c2d686b0cc9083a8ba7924@changeid Signed-off-by: Douglas Anderson <dianders@chromium.org> Reviewed-by: Catalin Marinas <catalin.marinas@arm.com> Cc: Al Viro <viro@ZenIV.linux.org.uk> Cc: Christian Brauner <brauner@kernel.org> Cc: Dave Martin <Dave.Martin@arm.com> Cc: Eric Biederman <ebiederm@xmission.com> Cc: Jan Kara <jack@suse.cz> Cc: Kees Cook <keescook@chromium.org> Cc: Mark Brown <broonie@kernel.org> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-04-25mm/treewide: rename CONFIG_HAVE_FAST_GUP to CONFIG_HAVE_GUP_FASTDavid Hildenbrand1-2/+2
Nowadays, we call it "GUP-fast", the external interface includes functions like "get_user_pages_fast()", and we renamed all internal functions to reflect that as well. Let's make the config option reflect that. Link: https://lkml.kernel.org/r/20240402125516.223131-3-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Reviewed-by: Mike Rapoport (IBM) <rppt@kernel.org> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Reviewed-by: John Hubbard <jhubbard@nvidia.com> Cc: Peter Xu <peterx@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-04-25mm: switch mm->get_unmapped_area() to a flagRick Edgecombe2-2/+2
The mm_struct contains a function pointer *get_unmapped_area(), which is set to either arch_get_unmapped_area() or arch_get_unmapped_area_topdown() during the initialization of the mm. Since the function pointer only ever points to two functions that are named the same across all arch's, a function pointer is not really required. In addition future changes will want to add versions of the functions that take additional arguments. So to save a pointers worth of bytes in mm_struct, and prevent adding additional function pointers to mm_struct in future changes, remove it and keep the information about which get_unmapped_area() to use in a flag. Add the new flag to MMF_INIT_MASK so it doesn't get clobbered on fork by mmf_init_flags(). Most MM flags get clobbered on fork. In the pre-existing behavior mm->get_unmapped_area() would get copied to the new mm in dup_mm(), so not clobbering the flag preserves the existing behavior around inheriting the topdown-ness. Introduce a helper, mm_get_unmapped_area(), to easily convert code that refers to the old function pointer to instead select and call either arch_get_unmapped_area() or arch_get_unmapped_area_topdown() based on the flag. Then drop the mm->get_unmapped_area() function pointer. Leave the get_unmapped_area() pointer in struct file_operations alone. The main purpose of this change is to reorganize in preparation for future changes, but it also converts the calls of mm->get_unmapped_area() from indirect branches into a direct ones. The stress-ng bigheap benchmark calls realloc a lot, which calls through get_unmapped_area() in the kernel. On x86, the change yielded a ~1% improvement there on a retpoline config. In testing a few x86 configs, removing the pointer unfortunately didn't result in any actual size reductions in the compiled layout of mm_struct. But depending on compiler or arch alignment requirements, the change could shrink the size of mm_struct. Link: https://lkml.kernel.org/r/20240326021656.202649-3-rick.p.edgecombe@intel.com Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com> Acked-by: Dave Hansen <dave.hansen@linux.intel.com> Acked-by: Liam R. Howlett <Liam.Howlett@oracle.com> Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Acked-by: Alexei Starovoitov <ast@kernel.org> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Aneesh Kumar K.V <aneesh.kumar@kernel.org> Cc: Borislav Petkov (AMD) <bp@alien8.de> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Deepak Gupta <debug@rivosinc.com> Cc: Guo Ren <guoren@kernel.org> Cc: Helge Deller <deller@gmx.de> Cc: H. Peter Anvin (Intel) <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: "James E.J. Bottomley" <James.Bottomley@HansenPartnership.com> Cc: Kees Cook <keescook@chromium.org> Cc: Mark Brown <broonie@kernel.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Naveen N. Rao <naveen.n.rao@linux.ibm.com> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-04-25mm: rename mm_put_huge_zero_page to mm_put_huge_zero_folioMatthew Wilcox (Oracle)1-1/+1
Also remove mm_get_huge_zero_page() now it has no users. Link: https://lkml.kernel.org/r/20240326202833.523759-9-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: David Hildenbrand <david@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-04-25mm: free up PG_slabMatthew Wilcox (Oracle)1-1/+2
Reclaim the Slab page flag by using a spare bit in PageType. We are perennially short of page flags for various purposes, and now that the original SLAB allocator has been retired, SLUB does not use the mapcount/page_type field. This lets us remove a number of special cases for ignoring mapcount on Slab pages. [willy@infradead.org: update vmcoreinfo] Link: https://lkml.kernel.org/r/ZgGV-O8WYQ_83kxp@casper.infradead.org Link: https://lkml.kernel.org/r/20240321142448.1645400-8-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Acked-by: David Hildenbrand <david@redhat.com> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: Oscar Salvador <osalvador@suse.de> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-04-25mm: vmalloc: enable memory allocation profilingKent Overstreet1-1/+1
This wrapps all external vmalloc allocation functions with the alloc_hooks() wrapper, and switches internal allocations to _noprof variants where appropriate, for the new memory allocation profiling feature. [surenb@google.com: arch/um: fix forward declaration for vmalloc] Link: https://lkml.kernel.org/r/20240326073750.726636-1-surenb@google.com [surenb@google.com: undo _noprof additions in the documentation] Link: https://lkml.kernel.org/r/20240326231453.1206227-5-surenb@google.com Link: https://lkml.kernel.org/r/20240321163705.3067592-31-surenb@google.com Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev> Signed-off-by: Suren Baghdasaryan <surenb@google.com> Tested-by: Kees Cook <keescook@chromium.org> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Alex Gaynor <alex.gaynor@gmail.com> Cc: Alice Ryhl <aliceryhl@google.com> Cc: Andreas Hindborg <a.hindborg@samsung.com> Cc: Benno Lossin <benno.lossin@proton.me> Cc: "Björn Roy Baron" <bjorn3_gh@protonmail.com> Cc: Boqun Feng <boqun.feng@gmail.com> Cc: Christoph Lameter <cl@linux.com> Cc: Dennis Zhou <dennis@kernel.org> Cc: Gary Guo <gary@garyguo.net> Cc: Miguel Ojeda <ojeda@kernel.org> Cc: Pasha Tatashin <pasha.tatashin@soleen.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Tejun Heo <tj@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Wedson Almeida Filho <wedsonaf@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-04-25change alloc_pages name in dma_map_ops to avoid name conflictsSuren Baghdasaryan1-2/+2
After redefining alloc_pages, all uses of that name are being replaced. Change the conflicting names to prevent preprocessor from replacing them when it's not intended. Link: https://lkml.kernel.org/r/20240321163705.3067592-18-surenb@google.com Signed-off-by: Suren Baghdasaryan <surenb@google.com> Tested-by: Kees Cook <keescook@chromium.org> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Alex Gaynor <alex.gaynor@gmail.com> Cc: Alice Ryhl <aliceryhl@google.com> Cc: Andreas Hindborg <a.hindborg@samsung.com> Cc: Benno Lossin <benno.lossin@proton.me> Cc: "Björn Roy Baron" <bjorn3_gh@protonmail.com> Cc: Boqun Feng <boqun.feng@gmail.com> Cc: Christoph Lameter <cl@linux.com> Cc: Dennis Zhou <dennis@kernel.org> Cc: Gary Guo <gary@garyguo.net> Cc: Kent Overstreet <kent.overstreet@linux.dev> Cc: Miguel Ojeda <ojeda@kernel.org> Cc: Pasha Tatashin <pasha.tatashin@soleen.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Tejun Heo <tj@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Wedson Almeida Filho <wedsonaf@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-04-25lib: prevent module unloading if memory is not freedSuren Baghdasaryan1-8/+19
Skip freeing module's data section if there are non-zero allocation tags because otherwise, once these allocations are freed, the access to their code tag would cause UAF. Link: https://lkml.kernel.org/r/20240321163705.3067592-13-surenb@google.com Signed-off-by: Suren Baghdasaryan <surenb@google.com> Tested-by: Kees Cook <keescook@chromium.org> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Alex Gaynor <alex.gaynor@gmail.com> Cc: Alice Ryhl <aliceryhl@google.com> Cc: Andreas Hindborg <a.hindborg@samsung.com> Cc: Benno Lossin <benno.lossin@proton.me> Cc: "Björn Roy Baron" <bjorn3_gh@protonmail.com> Cc: Boqun Feng <boqun.feng@gmail.com> Cc: Christoph Lameter <cl@linux.com> Cc: Dennis Zhou <dennis@kernel.org> Cc: Gary Guo <gary@garyguo.net> Cc: Kent Overstreet <kent.overstreet@linux.dev> Cc: Miguel Ojeda <ojeda@kernel.org> Cc: Pasha Tatashin <pasha.tatashin@soleen.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Tejun Heo <tj@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Wedson Almeida Filho <wedsonaf@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-04-25lib: code tagging module supportSuren Baghdasaryan1-0/+4
Add support for code tagging from dynamically loaded modules. Link: https://lkml.kernel.org/r/20240321163705.3067592-12-surenb@google.com Signed-off-by: Suren Baghdasaryan <surenb@google.com> Co-developed-by: Kent Overstreet <kent.overstreet@linux.dev> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev> Tested-by: Kees Cook <keescook@chromium.org> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Alex Gaynor <alex.gaynor@gmail.com> Cc: Alice Ryhl <aliceryhl@google.com> Cc: Andreas Hindborg <a.hindborg@samsung.com> Cc: Benno Lossin <benno.lossin@proton.me> Cc: "Björn Roy Baron" <bjorn3_gh@protonmail.com> Cc: Boqun Feng <boqun.feng@gmail.com> Cc: Christoph Lameter <cl@linux.com> Cc: Dennis Zhou <dennis@kernel.org> Cc: Gary Guo <gary@garyguo.net> Cc: Miguel Ojeda <ojeda@kernel.org> Cc: Pasha Tatashin <pasha.tatashin@soleen.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Tejun Heo <tj@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Wedson Almeida Filho <wedsonaf@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-04-25mm: memcg: add NULL check to obj_cgroup_put()Yosry Ahmed1-4/+2
9 out of 16 callers perform a NULL check before calling obj_cgroup_put(). Move the NULL check in the function, similar to mem_cgroup_put(). The unlikely() NULL check in current_objcg_update() was left alone to avoid dropping the unlikey() annotation as this a fast path. Link: https://lkml.kernel.org/r/20240316015803.2777252-1-yosryahmed@google.com Signed-off-by: Yosry Ahmed <yosryahmed@google.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Roman Gushchin <roman.gushchin@linux.dev> Cc: Michal Hocko <mhocko@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Shakeel Butt <shakeel.butt@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-04-25Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski3-27/+37
Cross-merge networking fixes after downstream PR. Conflicts: drivers/net/ethernet/ti/icssg/icssg_prueth.c net/mac80211/chan.c 89884459a0b9 ("wifi: mac80211: fix idle calculation with multi-link") 87f5500285fb ("wifi: mac80211: simplify ieee80211_assign_link_chanctx()") https://lore.kernel.org/all/20240422105623.7b1fbda2@canb.auug.org.au/ net/unix/garbage.c 1971d13ffa84 ("af_unix: Suppress false-positive lockdep splat for spin_lock() in __unix_gc().") 4090fa373f0e ("af_unix: Replace garbage collection algorithm.") drivers/net/ethernet/ti/icssg/icssg_prueth.c drivers/net/ethernet/ti/icssg/icssg_common.c 4dcd0e83ea1d ("net: ti: icssg-prueth: Fix signedness bug in prueth_init_rx_chns()") e2dc7bfd677f ("net: ti: icssg-prueth: Move common functions into a separate file") No adjacent changes. Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-04-25cgroup/cpuset: Remove outdated comment in sched_partition_write()Xiu Jianfeng1-3/+0
The comment here is outdated and can cause confusion, from the code perspective, there’s also no need for new comment, so just remove it. Signed-off-by: Xiu Jianfeng <xiujianfeng@huawei.com> Acked-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2024-04-25cpu: Ignore "mitigations" kernel parameter if CPU_MITIGATIONS=nSean Christopherson1-4/+10
Explicitly disallow enabling mitigations at runtime for kernels that were built with CONFIG_CPU_MITIGATIONS=n, as some architectures may omit code entirely if mitigations are disabled at compile time. E.g. on x86, a large pile of Kconfigs are buried behind CPU_MITIGATIONS, and trying to provide sane behavior for retroactively enabling mitigations is extremely difficult, bordering on impossible. E.g. page table isolation and call depth tracking require build-time support, BHI mitigations will still be off without additional kernel parameters, etc. [ bp: Touchups. ] Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Acked-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/r/20240420000556.2645001-3-seanjc@google.com
2024-04-25cpu: Re-enable CPU mitigations by default for !X86 architecturesSean Christopherson1-2/+2
Rename x86's to CPU_MITIGATIONS, define it in generic code, and force it on for all architectures exception x86. A recent commit to turn mitigations off by default if SPECULATION_MITIGATIONS=n kinda sorta missed that "cpu_mitigations" is completely generic, whereas SPECULATION_MITIGATIONS is x86-specific. Rename x86's SPECULATIVE_MITIGATIONS instead of keeping both and have it select CPU_MITIGATIONS, as having two configs for the same thing is unnecessary and confusing. This will also allow x86 to use the knob to manage mitigations that aren't strictly related to speculative execution. Use another Kconfig to communicate to common code that CPU_MITIGATIONS is already defined instead of having x86's menu depend on the common CPU_MITIGATIONS. This allows keeping a single point of contact for all of x86's mitigations, and it's not clear that other architectures *want* to allow disabling mitigations at compile-time. Fixes: f337a6a21e2f ("x86/cpu: Actually turn off mitigations by default for SPECULATION_MITIGATIONS=n") Closes: https://lkml.kernel.org/r/20240413115324.53303a68%40canb.auug.org.au Reported-by: Stephen Rothwell <sfr@canb.auug.org.au> Reported-by: Michael Ellerman <mpe@ellerman.id.au> Reported-by: Geert Uytterhoeven <geert@linux-m68k.org> Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Acked-by: Josh Poimboeuf <jpoimboe@kernel.org> Acked-by: Borislav Petkov (AMD) <bp@alien8.de> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/20240420000556.2645001-2-seanjc@google.com
2024-04-24mm: turn folio_test_hugetlb into a PageTypeMatthew Wilcox (Oracle)1-3/+2
The current folio_test_hugetlb() can be fooled by a concurrent folio split into returning true for a folio which has never belonged to hugetlbfs. This can't happen if the caller holds a refcount on it, but we have a few places (memory-failure, compaction, procfs) which do not and should not take a speculative reference. Since hugetlb pages do not use individual page mapcounts (they are always fully mapped and use the entire_mapcount field to record the number of mappings), the PageType field is available now that page_mapcount() ignores the value in this field. In compaction and with CONFIG_DEBUG_VM enabled, the current implementation can result in an oops, as reported by Luis. This happens since 9c5ccf2db04b ("mm: remove HUGETLB_PAGE_DTOR") effectively added some VM_BUG_ON() checks in the PageHuge() testing path. [willy@infradead.org: update vmcoreinfo] Link: https://lkml.kernel.org/r/ZgGZUvsdhaT1Va-T@casper.infradead.org Link: https://lkml.kernel.org/r/20240321142448.1645400-6-willy@infradead.org Fixes: 9c5ccf2db04b ("mm: remove HUGETLB_PAGE_DTOR") Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: David Hildenbrand <david@redhat.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Reported-by: Luis Chamberlain <mcgrof@kernel.org> Closes: https://bugzilla.kernel.org/show_bug.cgi?id=218227 Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: Oscar Salvador <osalvador@suse.de> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-04-24bpf: make common crypto API for TC/XDP programsVadim Fedorenko4-1/+390
Add crypto API support to BPF to be able to decrypt or encrypt packets in TC/XDP BPF programs. Special care should be taken for initialization part of crypto algo because crypto alloc) doesn't work with preemtion disabled, it can be run only in sleepable BPF program. Also async crypto is not supported because of the very same issue - TC/XDP BPF programs are not sleepable. Signed-off-by: Vadim Fedorenko <vadfed@meta.com> Link: https://lore.kernel.org/r/20240422225024.2847039-2-vadfed@meta.com Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2024-04-24genirq: Reuse irq_is_nmi()Jinjie Ruan4-14/+14
Move irq_is_nmi() to the internal header file and reuse it all over the place. Signed-off-by: Jinjie Ruan <ruanjinjie@huawei.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/20240423024037.3331215-1-ruanjinjie@huawei.com
2024-04-24genirq/cpuhotplug: Retry with cpu_online_mask when migration failsDongli Zhang1-0/+16
When a CPU goes offline, the interrupts affine to that CPU are re-configured. Managed interrupts undergo either migration to other CPUs or shutdown if all CPUs listed in the affinity are offline. The migration of managed interrupts is guaranteed on x86 because there are interrupt vectors reserved. Regular interrupts are migrated to a still online CPU in the affinity mask or if there is no online CPU to any online CPU. This works as long as the still online CPUs in the affinity mask have interrupt vectors available, but in case that none of those CPUs has a vector available the migration fails and the device interrupt becomes stale. This is not any different from the case where the affinity mask does not contain any online CPU, but there is no fallback operation for this. Instead of giving up, retry the migration attempt with the online CPU mask if the interrupt is not managed, as managed interrupts cannot be affected by this problem. Signed-off-by: Dongli Zhang <dongli.zhang@oracle.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/20240423073413.79625-1-dongli.zhang@oracle.com
2024-04-24genirq/cpuhotplug: Skip suspended interrupts when restoring affinityDavid Stevens2-7/+16
irq_restore_affinity_of_irq() restarts managed interrupts unconditionally when the first CPU in the affinity mask comes online. That's correct during normal hotplug operations, but not when resuming from S3 because the drivers are not resumed yet and interrupt delivery is not expected by them. Skip the startup of suspended interrupts and let resume_device_irqs() deal with restoring them. This ensures that irqs are not delivered to drivers during the noirq phase of resuming from S3, after non-boot CPUs are brought back online. Signed-off-by: David Stevens <stevensd@chromium.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/20240424090341.72236-1-stevensd@chromium.org
2024-04-24workqueue: Fix divide error in wq_update_node_max_active()Lai Jiangshan1-0/+9
Yue Sun and xingwei lee reported a divide error bug in wq_update_node_max_active(): divide error: 0000 [#1] PREEMPT SMP KASAN PTI CPU: 1 PID: 21 Comm: cpuhp/1 Not tainted 6.9.0-rc5 #1 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.15.0-1 04/01/2014 RIP: 0010:wq_update_node_max_active+0x369/0x6b0 kernel/workqueue.c:1605 Code: 24 bf 00 00 00 80 44 89 fe e8 83 27 33 00 41 83 fc ff 75 0d 41 81 ff 00 00 00 80 0f 84 68 01 00 00 e8 fb 22 33 00 44 89 f8 99 <41> f7 fc 89 c5 89 c7 44 89 ee e8 a8 24 33 00 89 ef 8b 5c 24 04 89 RSP: 0018:ffffc9000018fbb0 EFLAGS: 00010293 RAX: 00000000000000ff RBX: 0000000000000001 RCX: ffff888100ada500 RDX: 0000000000000000 RSI: 00000000000000ff RDI: 0000000080000000 RBP: 0000000000000001 R08: ffffffff815b1fcd R09: 1ffff1100364ad72 R10: dffffc0000000000 R11: ffffed100364ad73 R12: 0000000000000000 R13: 0000000000000100 R14: 0000000000000000 R15: 00000000000000ff FS: 0000000000000000(0000) GS:ffff888135c00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007fb8c06ca6f8 CR3: 000000010d6c6000 CR4: 0000000000750ef0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 PKRU: 55555554 Call Trace: <TASK> workqueue_offline_cpu+0x56f/0x600 kernel/workqueue.c:6525 cpuhp_invoke_callback+0x4e1/0x870 kernel/cpu.c:194 cpuhp_thread_fun+0x411/0x7d0 kernel/cpu.c:1092 smpboot_thread_fn+0x544/0xa10 kernel/smpboot.c:164 kthread+0x2ed/0x390 kernel/kthread.c:388 ret_from_fork+0x4b/0x80 arch/x86/kernel/process.c:147 ret_from_fork_asm+0x11/0x20 arch/x86/entry/entry_64.S:244 </TASK> Modules linked in: ---[ end trace 0000000000000000 ]--- After analysis, it happens when all of the CPUs in a workqueue's affinity get offine. The problem can be easily reproduced by: # echo 8 > /sys/devices/virtual/workqueue/<any-wq-name>/cpumask # echo 0 > /sys/devices/system/cpu/cpu3/online Use the default max_actives for nodes when all of the CPUs in the workqueue's affinity get offline to fix the problem. Reported-by: Yue Sun <samsun1006219@gmail.com> Reported-by: xingwei lee <xrivendell7@gmail.com> Link: https://lore.kernel.org/lkml/CAEkJfYPGS1_4JqvpSo0=FM0S1ytB8CEbyreLTtWpR900dUZymw@mail.gmail.com/ Fixes: 5797b1c18919 ("workqueue: Implement system-wide nr_active enforcement for unbound workqueues") Cc: stable@vger.kernel.org Signed-off-by: Lai Jiangshan <jiangshan.ljs@antgroup.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2024-04-24bpf: Introduce bpf_preempt_[disable,enable] kfuncsKumar Kartikeya Dwivedi2-2/+81
Introduce two new BPF kfuncs, bpf_preempt_disable and bpf_preempt_enable. These kfuncs allow disabling preemption in BPF programs. Nesting is allowed, since the intended use cases includes building native BPF spin locks without kernel helper involvement. Apart from that, this can be used to per-CPU data structures for cases where programs (or userspace) may preempt one or the other. Currently, while per-CPU access is stable, whether it will be consistent is not guaranteed, as only migration is disabled for BPF programs. Global functions are disallowed from being called, but support for them will be added as a follow up not just preempt kfuncs, but rcu_read_lock kfuncs as well. Static subprog calls are permitted. Sleepable helpers and kfuncs are disallowed in non-preemptible regions. Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20240424031315.2757363-2-memxor@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-04-24bpf: Don't check for recursion in bpf_wq_work.Alexei Starovoitov1-13/+5
__bpf_prog_enter_sleepable_recur does recursion check which is not applicable to wq callback. The callback function is part of bpf program and bpf prog might be running on the same cpu. So recursion check would incorrectly prevent callback from running. The code can call __bpf_prog_enter_sleepable(), but run_ctx would be fake, hence use explicit rcu_read_lock_trace(); migrate_disable(); to address this problem. Another reason to open code is __bpf_prog_enter* are not available in !JIT configs. Reported-by: kernel test robot <lkp@intel.com> Closes: https://lore.kernel.org/oe-kbuild-all/202404241719.IIGdpAku-lkp@intel.com/ Closes: https://lore.kernel.org/oe-kbuild-all/202404241811.FFV4Bku3-lkp@intel.com/ Fixes: eb48f6cd41a0 ("bpf: wq: add bpf_wq_init") Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-04-24sched/pelt: Remove shift of thermal clockVincent Guittot3-27/+3
The optional shift of the clock used by thermal/hw load avg has been introduced to handle case where the signal was not always a high frequency hw signal. Now that cpufreq provides a signal for firmware and SW pressure, we can remove this exception and always keep this PELT signal aligned with other signals. Mark sysctl_sched_migration_cost boot parameter as deprecated Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Tested-by: Lukasz Luba <lukasz.luba@arm.com> Reviewed-by: Qais Yousef <qyousef@layalina.io> Reviewed-by: Lukasz Luba <lukasz.luba@arm.com> Link: https://lore.kernel.org/r/20240326091616.3696851-6-vincent.guittot@linaro.org
2024-04-24sched/cpufreq: Rename arch_update_thermal_pressure() => ↵Vincent Guittot5-34/+34
arch_update_hw_pressure() Now that cpufreq provides a pressure value to the scheduler, rename arch_update_thermal_pressure into HW pressure to reflect that it returns a pressure applied by HW (i.e. with a high frequency change) and not always related to thermal mitigation but also generated by max current limitation as an example. Such high frequency signal needs filtering to be smoothed and provide an value that reflects the average available capacity into the scheduler time scale. Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Tested-by: Lukasz Luba <lukasz.luba@arm.com> Reviewed-by: Qais Yousef <qyousef@layalina.io> Reviewed-by: Lukasz Luba <lukasz.luba@arm.com> Link: https://lore.kernel.org/r/20240326091616.3696851-5-vincent.guittot@linaro.org
2024-04-24sched/cpufreq: Take cpufreq feedback into accountVincent Guittot1-20/+25
Aggregate the different pressures applied on the capacity of CPUs and create a new function that returns the actual capacity of the CPU: get_actual_cpu_capacity(). Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Tested-by: Lukasz Luba <lukasz.luba@arm.com> Reviewed-by: Lukasz Luba <lukasz.luba@arm.com> Reviewed-by: Qais Yousef <qyousef@layalina.io> Link: https://lore.kernel.org/r/20240326091616.3696851-3-vincent.guittot@linaro.org
2024-04-24sched/fair: Fix update of rd->sg_overutilizedVincent Guittot1-1/+1
sg_overloaded is used instead of sg_overutilized to update rd->sg_overutilized. Fixes: 4475cd8bfd9b ("sched/balancing: Simplify the sg_status bitmask and use separate ->overloaded and ->overutilized flags") Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20240404155738.2866102-1-vincent.guittot@linaro.org
2024-04-24sysctl: treewide: constify argument ctl_table_root::permissions(table)Thomas Weißschuh1-1/+1
The permissions callback should not modify the ctl_table. Enforce this expectation via the typesystem. This is a step to put "struct ctl_table" into .rodata throughout the kernel. The patch was created with the following coccinelle script: @@ identifier func, head, ctl; @@ int func( struct ctl_table_header *head, - struct ctl_table *ctl) + const struct ctl_table *ctl) { ... } (insert_entry() from fs/proc/proc_sysctl.c is a false-positive) No additional occurrences of '.permissions =' were found after a tree-wide search for places missed by the conccinelle script. Reviewed-by: Joel Granados <j.granados@samsung.com> Signed-off-by: Thomas Weißschuh <linux@weissschuh.net> Signed-off-by: Joel Granados <j.granados@samsung.com>
2024-04-24bpf: Remove the now superfluous sentinel elements from ctl_table arrayJoel Granados1-1/+0
This commit comes at the tail end of a greater effort to remove the empty elements at the end of the ctl_table arrays (sentinels) which will reduce the overall build time size of the kernel and run time memory bloat by ~64 bytes per sentinel (further information Link : https://lore.kernel.org/all/ZO5Yx5JFogGi%2FcBo@bombadil.infradead.org/) Remove sentinel element from bpf_syscall_table. Acked-by: Andrii Nakryiko <andrii@kernel.org> Signed-off-by: Joel Granados <j.granados@samsung.com>