aboutsummaryrefslogtreecommitdiffstats
AgeCommit message (Collapse)AuthorFilesLines
2023-08-17Merge branch 'netconsole-enable-compile-time-configuration'HEADmainJakub Kicinski2-20/+50
Breno Leitao says: ==================== netconsole: Enable compile time configuration Enable netconsole features to be set at compilation time. Create two Kconfig options that allow users to set extended logs and release prepending features at compilation time. The first patch de-duplicates the initialization code, and the second patch adds the support in the de-duplicated code, avoiding touching two different functions with the same change. ==================== Link: https://lore.kernel.org/r/20230811093158.1678322-1-leitao@debian.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-08-17netconsole: Enable compile time configurationBreno Leitao2-0/+27
Enable netconsole features to be set at compilation time. Create two Kconfig options that allow users to set extended logs and release prepending features at compilation time. Right now, the user needs to pass command line parameters to netconsole, such as "+"/"r" to enable extended logs and version prepending features. With these two options, the user could set the default values for the features at compile time, and don't need to pass it in the command line to get them enabled, simplifying the command line. Signed-off-by: Breno Leitao <leitao@debian.org> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://lore.kernel.org/r/20230811093158.1678322-3-leitao@debian.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-08-17netconsole: Create a allocation helperBreno Leitao1-20/+23
De-duplicate the initialization and allocation code for struct netconsole_target. The same allocation and initialization code is duplicated in two different places in the netconsole subsystem, when the netconsole target is initialized by command line parameters (alloc_param_target()), and dynamically by sysfs (make_netconsole_target()). Create a helper function, and call it from the two different functions. Suggested-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Breno Leitao <leitao@debian.org> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://lore.kernel.org/r/20230811093158.1678322-2-leitao@debian.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-08-17net: mdio: fix -Wvoid-pointer-to-enum-cast warningJustin Stitt1-1/+1
When building with clang 18 I see the following warning: | drivers/net/mdio/mdio-xgene.c:338:13: warning: cast to smaller integer | type 'enum xgene_mdio_id' from 'const void *' [-Wvoid-pointer-to-enum-cast] | 338 | mdio_id = (enum xgene_mdio_id)of_id->data; This is due to the fact that `of_id->data` is a void* while `enum xgene_mdio_id` has the size of an int. This leads to truncation and possible data loss. Link: https://github.com/ClangBuiltLinux/linux/issues/1910 Reported-by: Nathan Chancellor <nathan@kernel.org> Signed-off-by: Justin Stitt <justinstitt@google.com> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Reviewed-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> Link: https://lore.kernel.org/r/20230815-void-drivers-net-mdio-mdio-xgene-v1-1-5304342e0659@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-08-17Merge branch 'netem-use-a-seeded-prng-for-loss-and-corruption-events'Jakub Kicinski2-15/+35
François Michel says: ==================== netem: use a seeded PRNG for loss and corruption events In order to reproduce bugs or performance evaluation of network protocols and applications, it is useful to have reproducible test suites and tools. This patch adds a way to specify a PRNG seed through the TCA_NETEM_PRNG_SEED attribute for generating netem loss and corruption events. Initializing the qdisc with the same seed leads to the exact same loss and corruption patterns. If no seed is explicitly specified, the qdisc generates a random seed using get_random_u64(). This patch can be and has been tested using tc from the following iproute2-next fork: https://github.com/francoismichel/iproute2-next For instance, setting the seed 42424242 on the loopback with a loss rate of 10% will systematically drop the 5th, 12th and 24th packet when sending 25 packets. ==================== Link: https://lore.kernel.org/r/20230815092348.1449179-1-francois.michel@uclouvain.be Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-08-17netem: use seeded PRNG for correlated loss eventsFrançois Michel1-10/+12
Use prandom_u32_state() instead of get_random_u32() to generate the correlated loss events of netem. Signed-off-by: François Michel <francois.michel@uclouvain.be> Reviewed-by: Simon Horman <horms@kernel.org> Acked-by: Stephen Hemminger <stephen@networkplumber.org> Link: https://lore.kernel.org/r/20230815092348.1449179-4-francois.michel@uclouvain.be Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-08-17netem: use a seeded PRNG for generating random lossesFrançois Michel1-5/+6
Use prandom_u32_state() instead of get_random_u32() to generate the random loss events of netem. The state of the prng is part of the prng attribute of struct netem_sched_data. Signed-off-by: François Michel <francois.michel@uclouvain.be> Reviewed-by: Simon Horman <horms@kernel.org> Acked-by: Stephen Hemminger <stephen@networkplumber.org> Link: https://lore.kernel.org/r/20230815092348.1449179-3-francois.michel@uclouvain.be Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-08-17netem: add prng attribute to netem_sched_dataFrançois Michel2-0/+17
Add prng attribute to struct netem_sched_data and allows setting the seed of the PRNG through netlink using the new TCA_NETEM_PRNG_SEED attribute. The PRNG attribute is not actually used yet. Signed-off-by: François Michel <francois.michel@uclouvain.be> Reviewed-by: Simon Horman <horms@kernel.org> Acked-by: Stephen Hemminger <stephen@networkplumber.org> Link: https://lore.kernel.org/r/20230815092348.1449179-2-francois.michel@uclouvain.be Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-08-17net: ena: Use pci_dev_id() to simplify the codeJialin Zhang1-1/+1
PCI core API pci_dev_id() can be used to get the BDF number for a pci device. We don't need to compose it manually. Use pci_dev_id() to simplify the code a little bit. Signed-off-by: Jialin Zhang <zhangjialin11@huawei.com> Reviewed-by: Leon Romanovsky <leonro@nvidia.com> Reviewed-by: Shay Agroskin <shayagr@amazon.com> Link: https://lore.kernel.org/r/20230815024248.3519068-1-zhangjialin11@huawei.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-08-17tun: add __exit annotations to module exit func tun_cleanup()Ziyang Xuan1-1/+1
Add missing __exit annotations to module exit func tun_cleanup(). Signed-off-by: Ziyang Xuan <william.xuanziyang@huawei.com> Reviewed-by: Leon Romanovsky <leonro@nvidia.com> Link: https://lore.kernel.org/r/20230814083000.3893589-1-william.xuanziyang@huawei.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-08-16Merge tag 'for-netdev' of ↵Jakub Kicinski20-37/+1179
https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next Daniel Borkmann says: ==================== pull-request: bpf-next 2023-08-16 We've added 17 non-merge commits during the last 6 day(s) which contain a total of 20 files changed, 1179 insertions(+), 37 deletions(-). The main changes are: 1) Add a BPF hook in sys_socket() to change the protocol ID from IPPROTO_TCP to IPPROTO_MPTCP to cover migration for legacy applications, from Geliang Tang. 2) Follow-up/fallout fix from the SO_REUSEPORT + bpf_sk_assign work to fix a splat on non-fullsock sks in inet[6]_steal_sock, from Lorenz Bauer. 3) Improvements to struct_ops links to avoid forcing presence of update/validate callbacks. Also add bpf_struct_ops fields documentation, from David Vernet. 4) Ensure libbpf sets close-on-exec flag on gzopen, from Marco Vedovati. 5) Several new tcx selftest additions and bpftool link show support for tcx and xdp links, from Daniel Borkmann. 6) Fix a smatch warning on uninitialized symbol in bpf_perf_link_fill_kprobe, from Yafang Shao. 7) BPF selftest fixes e.g. misplaced break in kfunc_call test, from Yipeng Zou. 8) Small cleanup to remove unused declaration bpf_link_new_file, from Yue Haibing. 9) Small typo fix to bpftool's perf help message, from Daniel T. Lee. * tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: selftests/bpf: Add mptcpify test selftests/bpf: Fix error checks of mptcp open_and_load selftests/bpf: Add two mptcp netns helpers bpf: Add update_socket_protocol hook bpftool: Implement link show support for xdp bpftool: Implement link show support for tcx selftests/bpf: Add selftest for fill_link_info bpf: Fix uninitialized symbol in bpf_perf_link_fill_kprobe() net: Fix slab-out-of-bounds in inet[6]_steal_sock bpf: Document struct bpf_struct_ops fields bpf: Support default .validate() and .update() behavior for struct_ops links selftests/bpf: Add various more tcx test cases selftests/bpf: Clean up fmod_ret in bench_rename test script selftests/bpf: Fix repeat option when kfunc_call verification fails libbpf: Set close-on-exec flag on gzopen bpftool: fix perf help message bpf: Remove unused declaration bpf_link_new_file() ==================== Link: https://lore.kernel.org/r/20230816212840.1539-1-daniel@iogearbox.net Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-08-16Revert "net: ethernet: ti: am65-cpsw: add mqprio qdisc offload in channel mode"Jakub Kicinski3-311/+1
This reverts commit 90bc21aaef4adaefceda2d385756138fc247c0c2. Patch was merged too hastily, Vladimir requested changes in: https://lore.kernel.org/all/20230816121305.5dio5tk3chge2ndh@skbuf/ Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-08-16Merge branch 'bpf: Force to MPTCP'Martin KaFai Lau4-20/+221
Geliang Tang says: ==================== As is described in the "How to use MPTCP?" section in MPTCP wiki [1]: "Your app should create sockets with IPPROTO_MPTCP as the proto: ( socket(AF_INET, SOCK_STREAM, IPPROTO_MPTCP); ). Legacy apps can be forced to create and use MPTCP sockets instead of TCP ones via the mptcpize command bundled with the mptcpd daemon." But the mptcpize (LD_PRELOAD technique) command has some limitations [2]: - it doesn't work if the application is not using libc (e.g. GoLang apps) - in some envs, it might not be easy to set env vars / change the way apps are launched, e.g. on Android - mptcpize needs to be launched with all apps that want MPTCP: we could have more control from BPF to enable MPTCP only for some apps or all the ones of a netns or a cgroup, etc. - it is not in BPF, we cannot talk about it at netdev conf. So this patchset attempts to use BPF to implement functions similer to mptcpize. The main idea is to add a hook in sys_socket() to change the protocol id from IPPROTO_TCP (or 0) to IPPROTO_MPTCP. [1] https://github.com/multipath-tcp/mptcp_net-next/wiki [2] https://github.com/multipath-tcp/mptcp_net-next/issues/79 v14: - Use getsockopt(MPTCP_INFO) to verify mptcp protocol intead of using nstat command. v13: - drop "Use random netns name for mptcp" patch. v12: - update diag_* log of update_socket_protocol. - add 'ip netns show' after 'ip netns del' to check if there is a test did not clean up its netns. - return libbpf_get_error() instead of -EIO for the error from open_and_load(). - Use getsockopt(SOL_PROTOCOL) to verify mptcp protocol intead of using 'ss -tOni'. v11: - add comments about outputs of 'ss' and 'nstat'. - use "err = verify_mptcpify()" instead of using =+. v10: - drop "#ifdef CONFIG_BPF_JIT". - include vmlinux.h and bpf_tracing_net.h to avoid defining some macros. - drop unneeded checks for mptcp. v9: - update comment for 'update_socket_protocol'. v8: - drop the additional checks on the 'protocol' value after the 'update_socket_protocol()' call. v7: - add __weak and __diag_* for update_socket_protocol. v6: - add update_socket_protocol. v5: - add bpf_mptcpify helper. v4: - use lsm_cgroup/socket_create v3: - patch 8: char cmd[128]; -> char cmd[256]; v2: - Fix build selftests errors reported by CI Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/79 ==================== Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2023-08-16selftests/bpf: Add mptcpify testGeliang Tang2-0/+161
Implement a new test program mptcpify: if the family is AF_INET or AF_INET6, the type is SOCK_STREAM, and the protocol ID is 0 or IPPROTO_TCP, set it to IPPROTO_MPTCP. It will be hooked in update_socket_protocol(). Extend the MPTCP test base, add a selftest test_mptcpify() for the mptcpify case. Open and load the mptcpify test prog to mptcpify the TCP sockets dynamically, then use start_server() and connect_to_fd() to create a TCP socket, but actually what's created is an MPTCP socket, which can be verified through 'getsockopt(SOL_PROTOCOL)' and 'getsockopt(MPTCP_INFO)'. Acked-by: Yonghong Song <yonghong.song@linux.dev> Reviewed-by: Matthieu Baerts <matthieu.baerts@tessares.net> Signed-off-by: Geliang Tang <geliang.tang@suse.com> Link: https://lore.kernel.org/r/364e72f307e7bb38382ec7442c182d76298a9c41.1692147782.git.geliang.tang@suse.com Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2023-08-16selftests/bpf: Fix error checks of mptcp open_and_loadGeliang Tang1-11/+1
Return libbpf_get_error(), instead of -EIO, for the error from mptcp_sock__open_and_load(). Load success means prog_fd and map_fd are always valid. So drop these unneeded ASSERT_GE checks for them in mptcp run_test(). Acked-by: Yonghong Song <yonghong.song@linux.dev> Signed-off-by: Geliang Tang <geliang.tang@suse.com> Link: https://lore.kernel.org/r/db5fcb93293df9ab173edcbaf8252465b80da6f2.1692147782.git.geliang.tang@suse.com Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2023-08-16selftests/bpf: Add two mptcp netns helpersGeliang Tang1-10/+21
Add two netns helpers for mptcp tests: create_netns() and cleanup_netns(). Use them in test_base(). These new helpers will be re-used in the following commits introducing new tests. Acked-by: Yonghong Song <yonghong.song@linux.dev> Reviewed-by: Matthieu Baerts <matthieu.baerts@tessares.net> Signed-off-by: Geliang Tang <geliang.tang@suse.com> Link: https://lore.kernel.org/r/7506371fb6c417b401cc9d7365fe455754f4ba3f.1692147782.git.geliang.tang@suse.com Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2023-08-16bpf: Add update_socket_protocol hookGeliang Tang2-1/+40
Add a hook named update_socket_protocol in __sys_socket(), for bpf progs to attach to and update socket protocol. One user case is to force legacy TCP apps to create and use MPTCP sockets instead of TCP ones. Define a fmod_ret set named bpf_mptcp_fmodret_ids, add the hook update_socket_protocol into this set, and register it in bpf_mptcp_kfunc_init(). Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/79 Acked-by: Matthieu Baerts <matthieu.baerts@tessares.net> Acked-by: Yonghong Song <yonghong.song@linux.dev> Signed-off-by: Geliang Tang <geliang.tang@suse.com> Link: https://lore.kernel.org/r/ac84be00f97072a46f8a72b4e2be46cbb7fa5053.1692147782.git.geliang.tang@suse.com Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2023-08-16bpftool: Implement link show support for xdpDaniel Borkmann1-0/+7
Add support to dump XDP link information to bpftool. This reuses the recently added show_link_ifindex_{plain,json}(). The XDP link info only exposes the ifindex. Below shows an example link dump output, and a cgroup link is included for comparison, too: # bpftool link [...] 10: cgroup prog 2466 cgroup_id 1 attach_type cgroup_inet6_post_bind [...] 16: xdp prog 2477 ifindex enp5s0(3) [...] Equivalent json output: # bpftool link --json [...] { "id": 10, "type": "cgroup", "prog_id": 2466, "cgroup_id": 1, "attach_type": "cgroup_inet6_post_bind" }, [...] { "id": 16, "type": "xdp", "prog_id": 2477, "devname": "enp5s0", "ifindex": 3 } [...] Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Reviewed-by: Quentin Monnet <quentin@isovalent.com> Link: https://lore.kernel.org/r/20230816095651.10014-2-daniel@iogearbox.net Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2023-08-16bpftool: Implement link show support for tcxDaniel Borkmann1-0/+37
Add support to dump tcx link information to bpftool. This adds a common helper show_link_ifindex_{plain,json}() which can be reused also for other link types. The plain text and json device output is the same format as in bpftool net dump. Below shows an example link dump output along with a cgroup link for comparison: # bpftool link [...] 10: cgroup prog 1977 cgroup_id 1 attach_type cgroup_inet6_post_bind [...] 13: tcx prog 2053 ifindex enp5s0(3) attach_type tcx_ingress 14: tcx prog 2080 ifindex enp5s0(3) attach_type tcx_egress [...] Equivalent json output: # bpftool link --json [...] { "id": 10, "type": "cgroup", "prog_id": 1977, "cgroup_id": 1, "attach_type": "cgroup_inet6_post_bind" }, [...] { "id": 13, "type": "tcx", "prog_id": 2053, "devname": "enp5s0", "ifindex": 3, "attach_type": "tcx_ingress" }, { "id": 14, "type": "tcx", "prog_id": 2080, "devname": "enp5s0", "ifindex": 3, "attach_type": "tcx_egress" } [...] Suggested-by: Yafang Shao <laoar.shao@gmail.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Reviewed-by: Quentin Monnet <quentin@isovalent.com> Acked-by: Yafang Shao <laoar.shao@gmail.com> Link: https://lore.kernel.org/r/20230816095651.10014-1-daniel@iogearbox.net Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2023-08-16selftests/bpf: Add selftest for fill_link_infoYafang Shao3-0/+387
Add selftest for the fill_link_info of uprobe, kprobe and tracepoint. The result: $ tools/testing/selftests/bpf/test_progs --name=fill_link_info #79/1 fill_link_info/kprobe_link_info:OK #79/2 fill_link_info/kretprobe_link_info:OK #79/3 fill_link_info/kprobe_invalid_ubuff:OK #79/4 fill_link_info/tracepoint_link_info:OK #79/5 fill_link_info/uprobe_link_info:OK #79/6 fill_link_info/uretprobe_link_info:OK #79/7 fill_link_info/kprobe_multi_link_info:OK #79/8 fill_link_info/kretprobe_multi_link_info:OK #79/9 fill_link_info/kprobe_multi_invalid_ubuff:OK #79 fill_link_info:OK Summary: 1/9 PASSED, 0 SKIPPED, 0 FAILED The test case for kprobe_multi won't be run on aarch64, as it is not supported. Signed-off-by: Yafang Shao <laoar.shao@gmail.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Yonghong Song <yonghong.song@linux.dev> Acked-by: Jiri Olsa <jolsa@kernel.org> Link: https://lore.kernel.org/bpf/20230813141900.1268-3-laoar.shao@gmail.com
2023-08-16bpf: Fix uninitialized symbol in bpf_perf_link_fill_kprobe()Yafang Shao1-3/+2
The commit 1b715e1b0ec5 ("bpf: Support ->fill_link_info for perf_event") leads to the following Smatch static checker warning: kernel/bpf/syscall.c:3416 bpf_perf_link_fill_kprobe() error: uninitialized symbol 'type'. That can happens when uname is NULL. So fix it by verifying the uname when we really need to fill it. Fixes: 1b715e1b0ec5 ("bpf: Support ->fill_link_info for perf_event") Reported-by: Dan Carpenter <dan.carpenter@linaro.org> Signed-off-by: Yafang Shao <laoar.shao@gmail.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Yonghong Song <yonghong.song@linux.dev> Acked-by: Jiri Olsa <jolsa@kernel.org> Closes: https://lore.kernel.org/bpf/85697a7e-f897-4f74-8b43-82721bebc462@kili.mountain Link: https://lore.kernel.org/bpf/20230813141900.1268-2-laoar.shao@gmail.com
2023-08-16Merge branch 'ipv6-expired-routes'David S. Miller4-25/+172
Kui-Feng Lee says: ==================== Remove expired routes with a separated list of routes. FIB6 GC walks trees of fib6_tables to remove expired routes. Walking a tree can be expensive if the number of routes in a table is big, even if most of them are permanent. Checking routes in a separated list of routes having expiration will avoid this potential issue. Background ========== The size of a Linux IPv6 routing table can become a big problem if not managed appropriately. Now, Linux has a garbage collector to remove expired routes periodically. However, this may lead to a situation in which the routing path is blocked for a long period due to an excessive number of routes. For example, years ago, there is a commit c7bb4b89033b ("ipv6: tcp: drop silly ICMPv6 packet too big messages"). The root cause is that malicious ICMPv6 packets were sent back for every small packet sent to them. These packets add routes with an expiration time that prompts the GC to periodically check all routes in the tables, including permanent ones. Why Route Expires ================= Users can add IPv6 routes with an expiration time manually. However, the Neighbor Discovery protocol may also generate routes that can expire. For example, Router Advertisement (RA) messages may create a default route with an expiration time. [RFC 4861] For IPv4, it is not possible to set an expiration time for a route, and there is no RA, so there is no need to worry about such issues. Create Routes with Expires ========================== You can create routes with expires with the command. For example, ip -6 route add 2001:b000:591::3 via fe80::5054:ff:fe12:3457 \ dev enp0s3 expires 30 The route that has been generated will be deleted automatically in 30 seconds. GC of FIB6 ========== The function called fib6_run_gc() is responsible for performing garbage collection (GC) for the Linux IPv6 stack. It checks for the expiration of every route by traversing the trees of routing tables. The time taken to traverse a routing table increases with its size. Holding the routing table lock during traversal is particularly undesirable. Therefore, it is preferable to keep the lock for the shortest possible duration. Solution ======== The cause of the issue is keeping the routing table locked during the traversal of large trees. To solve this problem, we can create a separate list of routes that have expiration. This will prevent GC from checking permanent routes. Result ====== We conducted a test to measure the execution times of fib6_gc_timer_cb() and observed that it enhances the GC of FIB6. During the test, we added permanent routes with the following numbers: 1000, 3000, 6000, and 9000. Additionally, we added a route with an expiration time. Here are the average execution times for the kernel without the patch. - 120020 ns with 1000 permanent routes - 308920 ns with 3000 ... - 581470 ns with 6000 ... - 855310 ns with 9000 ... The kernel with the patch consistently takes around 14000 ns to execute, regardless of the number of permanent routes that are installed. Major changes from v7: - Fix warings raised by the patchwork. Major changes from v6: - Remove unnecessary check of tb6 in fib6_clean_expires_locked(). - Use ib6_clean_expires_locked() instead in fib6_purge_rt(). Major changes from v5: - Change the order of adding new routes to the GC list and starting GC timer. - Remove time measurements from the test case. - Stop forcing GC flush. Major changes from v4: - Detect existence of 'strace' in the test case. Major changes from v3: - Fix the type of arg according to feedback. - Add 1k temporary routes and 5K permanent routes in the test case. Measure time spending on GC with strace. Major changes from v2: - Remove unnecessary and incorrect sysctl restoring in the test case. Major changes from v1: - Moved gc_link to avoid creating a hole in fib6_info. - Moved fib6_set_expires*() and fib6_clean_expires*() to the header file and inlined. And removed duplicated lines. - Added a test case. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2023-08-16selftests: fib_tests: Add a test case for IPv6 garbage collectionKui-Feng Lee1-3/+69
Add 1000 IPv6 routes with expiration time (w/ and w/o additional 5000 permanet routes in the background.) Wait for a few seconds to make sure they are removed correctly. The expected output of the test looks like the following example. > Fib6 garbage collection test > TEST: ipv6 route garbage collection [ OK ] Signed-off-by: Kui-Feng Lee <thinker.li@gmail.com> Reviewed-by: David Ahern <dsahern@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-08-16net/ipv6: Remove expired routes with a separated list of routes.Kui-Feng Lee3-22/+103
FIB6 GC walks trees of fib6_tables to remove expired routes. Walking a tree can be expensive if the number of routes in a table is big, even if most of them are permanent. Checking routes in a separated list of routes having expiration will avoid this potential issue. Signed-off-by: Kui-Feng Lee <thinker.li@gmail.com> Reviewed-by: David Ahern <dsahern@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-08-16e1000e: Use PME poll to circumvent unreliable ACPI wakeKai-Heng Feng1-1/+3
On some I219 devices, ethernet cable plugging detection only works once from PCI D3 state. Subsequent cable plugging does set PME bit correctly, but device still doesn't get woken up. Since I219 connects to the root complex directly, it relies on platform firmware (ACPI) to wake it up. In this case, the GPE from _PRW only works for first cable plugging but fails to notify the driver for subsequent plugging events. The issue was originally found on CNP, but the same issue can be found on ADL too. So workaround the issue by continuing use PME poll after first ACPI wake. As PME poll is always used, the runtime suspend restriction for CNP can also be removed. Signed-off-by: Kai-Heng Feng <kai.heng.feng@canonical.com> Tested-by: Naama Meir <naamax.meir@linux.intel.com> Acked-by: Sasha Neftin <sasha.neftin@intel.com> Reviewed-by: Simon Horman <simon.horman@corigine.com> Reviewed-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-08-16net-memcg: Fix scope of sockmem pressure indicatorsAbel Wu2-2/+15
Now there are two indicators of socket memory pressure sit inside struct mem_cgroup, socket_pressure and tcpmem_pressure, indicating memory reclaim pressure in memcg->memory and ->tcpmem respectively. When in legacy mode (cgroupv1), the socket memory is charged into ->tcpmem which is independent of ->memory, so socket_pressure has nothing to do with socket's pressure at all. Things could be worse by taking socket_pressure into consideration in legacy mode, as a pressure in ->memory can lead to premature reclamation/throttling in socket. While for the default mode (cgroupv2), the socket memory is charged into ->memory, and ->tcpmem/->tcpmem_pressure are simply not used. So {socket,tcpmem}_pressure are only used in default/legacy mode respectively for indicating socket memory pressure. This patch fixes the pieces of code that make mixed use of both. Fixes: 8e8ae645249b ("mm: memcontrol: hook up vmpressure to socket pressure") Signed-off-by: Abel Wu <wuyun.abel@bytedance.com> Acked-by: Shakeel Butt <shakeelb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-08-16nfp: update maintainerLouis Peens1-1/+1
Take over maintainership of the nfp driver from Simon as he is moving away from Corigine. Signed-off-by: Louis Peens <louis.peens@corigine.com> Acked-by: Simon Horman <simon.horman@corigine.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-08-16net: ethernet: ti: am65-cpsw: add mqprio qdisc offload in channel modeGrygorii Strashko3-1/+311
This patch adds MQPRIO Qdisc offload in full 'channel' mode which allows not only setting up pri:tc mapping, but also configuring TX shapers on external port FIFOs. The K3 CPSW MQPRIO Qdisc offload is expected to work with VLAN/priority tagged packets. Non-tagged packets have to be mapped only to TC0. - TX traffic classes must be rated starting from TC that has highest priority and with no gaps - Traffic classes are used starting from 0, that has highest priority - min_rate defines Committed Information Rate (guaranteed) - max_rate defines Excess Information Rate (non guaranteed) and offloaded as (max_rate[i] - tcX_min_rate[i]) - VLAN/priority tagged packets mapped to TC0 will exit switch with VLAN tag priority 0 The configuration example: ethtool -L eth1 tx 5 ethtool --set-priv-flags eth1 p0-rx-ptype-rrobin off tc qdisc add dev eth1 parent root handle 100: mqprio num_tc 3 \ map 0 0 1 2 0 0 0 0 0 0 0 0 0 0 0 0 \ queues 1@0 1@1 1@2 hw 1 mode channel \ shaper bw_rlimit min_rate 0 100mbit 200mbit max_rate 0 101mbit 202mbit tc qdisc replace dev eth2 handle 100: parent root mqprio num_tc 1 \ map 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 queues 1@0 hw 1 ip link add link eth1 name eth1.100 type vlan id 100 ip link set eth1.100 type vlan egress 0:0 1:1 2:2 3:3 4:4 5:5 6:6 7:7 In the above example two ports share the same TX CPPI queue 0 for low priority traffic. 3 traffic classes are defined for eth1 and mapped to: TC0 - low priority, TX CPPI queue 0 -> ext Port 1 fifo0, no rate limit TC1 - prio 2, TX CPPI queue 1 -> ext Port 1 fifo1, CIR=100Mbit/s, EIR=1Mbit/s TC2 - prio 3, TX CPPI queue 2 -> ext Port 1 fifo2, CIR=200Mbit/s, EIR=2Mbit/s Signed-off-by: Grygorii Strashko <grygorii.strashko@ti.com> Signed-off-by: Roger Quadros <rogerq@kernel.org> Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-08-16Merge branch 'inet-data-races'David S. Miller39-370/+362
Eric Dumazet says: ==================== inet: socket lock and data-races avoidance In this series, I converted 20 bits in "struct inet_sock" and made them truly atomic. This allows to implement many IP_ socket options in a lockless fashion (no need to acquire socket lock), and fixes data-races that were showing up in various KCSAN reports. I also took care of IP_TTL/IP_MINTTL, but left few other options for another series. v4: Rebased after recent mptcp changes. Added Reviewed-by: tags from Simon (thanks !) v3: fixed patch 7, feedback from build bot about ipvs set_mcast_loop() v2: addressed a feedback from a build bot in patch 9 by removing unused issk variable in mptcp_setsockopt_sol_ip_set_transparent() Added Acked-by: tags from Soheil (thanks !) ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2023-08-16inet: implement lockless IP_MINTTLEric Dumazet1-18/+14
inet->min_ttl is already read with READ_ONCE(). Implementing IP_MINTTL socket option set/read without holding the socket lock is easy. Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Soheil Hassas Yeganeh <soheil@google.com> Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-08-16inet: implement lockless IP_TTLEric Dumazet2-16/+13
ip_select_ttl() is racy, because it reads inet->uc_ttl without proper locking. Add READ_ONCE()/WRITE_ONCE() annotations while allowing IP_TTL socket option to be set/read without holding the socket lock. Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Soheil Hassas Yeganeh <soheil@google.com> Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-08-16inet: move inet->defer_connect to inet->inet_flagsEric Dumazet6-19/+21
Make room in struct inet_sock by removing this bit field, using one available bit in inet_flags instead. Also move local_port_range to fill the resulting hole, saving 8 bytes on 64bit arches. Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Soheil Hassas Yeganeh <soheil@google.com> Reviewed-by: Simon Horman <horms@kernel.org> Reviewed-by: Matthieu Baerts <matthieu.baerts@tessares.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-08-16inet: move inet->bind_address_no_port to inet->inet_flagsEric Dumazet5-11/+11
IP_BIND_ADDRESS_NO_PORT socket option can now be set/read without locking the socket. Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Soheil Hassas Yeganeh <soheil@google.com> Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-08-16inet: move inet->nodefrag to inet->inet_flagsEric Dumazet6-16/+14
IP_NODEFRAG socket option can now be set/read without locking the socket. Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Soheil Hassas Yeganeh <soheil@google.com> Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-08-16inet: move inet->is_icsk to inet->inet_flagsEric Dumazet8-14/+13
We move single bit fields to inet->inet_flags to avoid races. Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Soheil Hassas Yeganeh <soheil@google.com> Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-08-16inet: move inet->transparent to inet->inet_flagsEric Dumazet11-34/+30
IP_TRANSPARENT socket option can now be set/read without locking the socket. v2: removed unused issk variable in mptcp_setsockopt_sol_ip_set_transparent() v4: rebased after commit 3f326a821b99 ("mptcp: change the mpc check helper to return a sk") Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Paolo Abeni <pabeni@redhat.com> Acked-by: Soheil Hassas Yeganeh <soheil@google.com> Reviewed-by: Simon Horman <horms@kernel.org> Reviewed-by: Matthieu Baerts <matthieu.baerts@tessares.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-08-16inet: move inet->mc_all to inet->inet_fragsEric Dumazet5-15/+15
IP_MULTICAST_ALL socket option can now be set/read without locking the socket. Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Soheil Hassas Yeganeh <soheil@google.com> Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-08-16inet: move inet->mc_loop to inet->inet_fragsEric Dumazet9-18/+16
IP_MULTICAST_LOOP socket option can now be set/read without locking the socket. v3: fix build bot error reported in ipvs set_mcast_loop() Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Soheil Hassas Yeganeh <soheil@google.com> Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-08-16inet: move inet->hdrincl to inet->inet_flagsEric Dumazet10-41/+31
IP_HDRINCL socket option can now be set/read without locking the socket. Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Soheil Hassas Yeganeh <soheil@google.com> Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-08-16inet: move inet->freebind to inet->inet_flagsEric Dumazet7-22/+22
IP_FREEBIND socket option can now be set/read without locking the socket. Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Soheil Hassas Yeganeh <soheil@google.com> Reviewed-by: Simon Horman <horms@kernel.org> Reviewed-by: Matthieu Baerts <matthieu.baerts@tessares.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-08-16inet: move inet->recverr_rfc4884 to inet->inet_flagsEric Dumazet3-11/+11
IP_RECVERR_RFC4884 socket option can now be set/read without locking the socket. Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Soheil Hassas Yeganeh <soheil@google.com> Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-08-16inet: move inet->recverr to inet->inet_flagsEric Dumazet9-32/+30
IP_RECVERR socket option can now be set/get without locking the socket. This patch potentially avoid data-races around inet->recverr. Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Soheil Hassas Yeganeh <soheil@google.com> Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-08-16inet: set/get simple options locklesslyEric Dumazet1-56/+62
Now we have inet->inet_flags, we can set following options without having to hold the socket lock: IP_PKTINFO, IP_RECVTTL, IP_RECVTOS, IP_RECVOPTS, IP_RETOPTS, IP_PASSSEC, IP_RECVORIGDSTADDR, IP_RECVFRAGSIZE. ip_sock_set_pktinfo() no longer hold the socket lock. Similarly we can get the following options whithout holding the socket lock: IP_PKTINFO, IP_RECVTTL, IP_RECVTOS, IP_RECVOPTS, IP_RETOPTS, IP_PASSSEC, IP_RECVORIGDSTADDR, IP_CHECKSUM, IP_RECVFRAGSIZE. Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Soheil Hassas Yeganeh <soheil@google.com> Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-08-16inet: introduce inet->inet_flagsEric Dumazet8-71/+83
Various inet fields are currently racy. do_ip_setsockopt() and do_ip_getsockopt() are mostly holding the socket lock, but some (fast) paths do not. Use a new inet->inet_flags to hold atomic bits in the series. Remove inet->cmsg_flags, and use instead 9 bits from inet_flags. Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Soheil Hassas Yeganeh <soheil@google.com> Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-08-16Merge branch 'redundant-of_match_ptr'David S. Miller6-7/+7
Ruan Jinjie says: ==================== net: Remove redundant of_match_ptr() macro Since these net drivers depend on CONFIG_OF, there is no need to wrap the macro of_match_ptr() here. Changes in v3: - Collect responses from v1 and v2. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2023-08-16wlcore: spi: Remove redundant of_match_ptr()Ruan Jinjie1-1/+1
The driver depends on CONFIG_OF, it is not necessary to use of_match_ptr() here. Signed-off-by: Ruan Jinjie <ruanjinjie@huawei.com> Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-08-16net: qualcomm: Remove redundant of_match_ptr()Ruan Jinjie1-1/+1
The driver depends on CONFIG_OF, it is not necessary to use of_match_ptr() here. Signed-off-by: Ruan Jinjie <ruanjinjie@huawei.com> Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-08-16net: gemini: Remove redundant of_match_ptr()Ruan Jinjie1-2/+2
The driver depends on CONFIG_OF, it is not necessary to use of_match_ptr() here. Signed-off-by: Ruan Jinjie <ruanjinjie@huawei.com> Acked-by: Linus Walleij <linus.walleij@linaro.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-08-16net: dsa: rzn1-a5psw: Remove redundant of_match_ptr()Ruan Jinjie1-1/+1
The driver depends on CONFIG_OF, it is not necessary to use of_match_ptr() here. Signed-off-by: Ruan Jinjie <ruanjinjie@huawei.com> Reviewed-by: Vladimir Oltean <olteanv@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-08-16net: dsa: realtek: Remove redundant of_match_ptr()Ruan Jinjie2-2/+2
The driver depends on CONFIG_OF, it is not necessary to use of_match_ptr() here. Signed-off-by: Ruan Jinjie <ruanjinjie@huawei.com> Acked-by: Linus Walleij <linus.walleij@linaro.org> Reviewed-by: Vladimir Oltean <olteanv@gmail.com> Reviewed-by: Linus Walleij <linus.walleij@linaro.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-08-16nfc: virtual_ncidev: Use module_misc_device macro to simplify the codeLi Zetao1-12/+1
Use the module_misc_device macro to simplify the code, which is the same as declaring with module_init() and module_exit(). Signed-off-by: Li Zetao <lizetao1@huawei.com> Reviewed-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-08-16Merge branch 'hns3-ethtool'David S. Miller12-681/+875
Jijie Shao says: ==================== hns3: refactor registers information for ethtool -d refactor registers information for ethtool -d ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2023-08-16net: hns3: fix wrong rpu tln reg issueJijie Shao4-2/+71
In the original RPU query command, the status register values of multiple RPU tunnels are accumulated by default, which is unreasonable. This patch Fix it by querying the specified tunnel ID. The tunnel number of the device can be obtained from firmware during initialization. Fixes: ddb54554fa51 ("net: hns3: add DFX registers information for ethtool -d") Signed-off-by: Jijie Shao <shaojijie@huawei.com> Reviewed-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-08-16net: hns3: Support tlv in regs data for HNS3 VF driverJijie Shao1-24/+61
The dump register function is being refactored. The third step in refactoring is to support tlv info in regs data for HNS3 PF driver. Currently, if we use "ethtool -d" to dump regs value, the output is as follows: offset1: 00 01 02 03 04 05 ... offset2:10 11 12 13 14 15 ... ...... We can't get the value of a register directly. This patch deletes the original separator information and add tag_len_value information in regs data. ethtool can parse register data in key-value format by -d command. a patch will be added to the ethtool to parse regs data in the following format: reg1 : value2 reg2 : value2 ...... Signed-off-by: Jijie Shao <shaojijie@huawei.com> Reviewed-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-08-16net: hns3: Support tlv in regs data for HNS3 PF driverJijie Shao1-65/+102
The dump register function is being refactored. The second step in refactoring is to support tlv info in regs data for HNS3 PF driver. Currently, if we use "ethtool -d" to dump regs value, the output is as follows: offset1: 00 01 02 03 04 05 ... offset2:10 11 12 13 14 15 ... ...... We can't get the value of a register directly. This patch deletes the original separator information and add tag_len_value information in regs data. ethtool can parse register data in key-value format by -d command. a patch will be added to the ethtool to parse regs data in the following format: reg1 : value2 reg2 : value2 ...... Signed-off-by: Jijie Shao <shaojijie@huawei.com> Reviewed-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-08-16net: hns3: move dump regs function to a separate fileJijie Shao10-680/+731
The dump register function is being refactored. The first step in refactoring is put the dump regs function into a separate file. Signed-off-by: Jijie Shao <shaojijie@huawei.com> Reviewed-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-08-16Merge branch 'fec-XDP_TX'David S. Miller2-61/+132
Wei Fang says: ==================== net: fec: add XDP_TX feature support This patch set is to support the XDP_TX feature of FEC driver, the first patch is add initial XDP_TX support, and the second patch improves the performance of XDP_TX by not using xdp_convert_buff_to_frame(). Please refer to the commit message of each patch for more details. ==================== Acked-by: Jesper Dangaard Brouer <hawk@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-08-16net: fec: improve XDP_TX performanceWei Fang2-70/+75
As suggested by Jesper and Alexander, we can avoid converting xdp_buff to xdp_frame in case of XDP_TX to save a bunch of CPU cycles, so that we can further improve the XDP_TX performance. Before this patch on i.MX8MP-EVK board, the performance shows as follows. root@imx8mpevk:~# ./xdp2 eth0 proto 17: 353918 pkt/s proto 17: 352923 pkt/s proto 17: 353900 pkt/s proto 17: 352672 pkt/s proto 17: 353912 pkt/s proto 17: 354219 pkt/s After applying this patch, the performance is improved. root@imx8mpevk:~# ./xdp2 eth0 proto 17: 369261 pkt/s proto 17: 369267 pkt/s proto 17: 369206 pkt/s proto 17: 369214 pkt/s proto 17: 369126 pkt/s proto 17: 369272 pkt/s Signed-off-by: Wei Fang <wei.fang@nxp.com> Suggested-by: Alexander Lobakin <aleksander.lobakin@intel.com> Suggested-by: Jesper Dangaard Brouer <hawk@kernel.org> Reviewed-by: Jesper Dangaard Brouer <hawk@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-08-16net: fec: add XDP_TX feature supportWei Fang2-21/+87
The XDP_TX feature is not supported before, and all the frames which are deemed to do XDP_TX action actually do the XDP_DROP action. So this patch adds the XDP_TX support to FEC driver. I tested the performance of XDP_TX in XDP_DRV mode and XDP_SKB mode respectively on i.MX8MP-EVK platform, and as suggested by Jesper, I also tested the performance of XDP_REDIRECT on the same platform. And the test steps and results are as follows. XDP_TX test: Step 1: One board is used as generator and connects to switch,and the FEC port of DUT also connects to the switch. Both boards with flow control off. Then the generator runs the pktgen_sample03_burst_single_flow.sh script to generate and send burst traffic to DUT. Note that the size of packet was set to 64 bytes and the procotol of packet was UDP in my test scenario. In addition, the SMAC of the packet need to be different from the MAC of the generator, because the xdp2 program will swap the DMAC and SMAC of the packet and send it back to the generator. If the SMAC of the generated packet is the MAC of the generator, the generator will receive the returned traffic which increase the CPU loading and significantly degrade the transmit speed of the generator, and finally it affects the test of XDP_TX performance. Step 2: The DUT runs the xdp2 program to transmit received UDP packets back out on the same port where they were received. root@imx8mpevk:~# ./xdp2 eth0 proto 17: 353918 pkt/s proto 17: 352923 pkt/s proto 17: 353900 pkt/s proto 17: 352672 pkt/s proto 17: 353912 pkt/s proto 17: 354219 pkt/s root@imx8mpevk:~# ./xdp2 -S eth0 proto 17: 160604 pkt/s proto 17: 160708 pkt/s proto 17: 160564 pkt/s proto 17: 160684 pkt/s proto 17: 160640 pkt/s proto 17: 160720 pkt/s The above results show that the XDP_TX performance of XDP_DRV mode is much better than XDP_SKB mode, more than twice that of XDP_SKB mode, which is in line with our expectation. XDP_REDIRECT test: Step1: Both the generator and the FEC port of the DUT connet to the switch port. All the ports with flow control off, then the generator runs the pktgen script to generate and send burst traffic to DUT. Note that the size of packet was set to 64 bytes and the procotol of packet was UDP in my test scenario. Step2: The DUT runs the xdp_redirect program to redirect the traffic from the FEC port to the FEC port itself. root@imx8mpevk:~# ./xdp_redirect eth0 eth0 Redirecting from eth0 (ifindex 2; driver fec) to eth0 (ifindex 2; driver fec) Summary 232,302 rx/s 0 err,drop/s 232,344 xmit/s Summary 234,579 rx/s 0 err,drop/s 234,577 xmit/s Summary 235,548 rx/s 0 err,drop/s 235,549 xmit/s Summary 234,704 rx/s 0 err,drop/s 234,703 xmit/s Summary 235,504 rx/s 0 err,drop/s 235,504 xmit/s Summary 235,223 rx/s 0 err,drop/s 235,224 xmit/s Summary 234,509 rx/s 0 err,drop/s 234,507 xmit/s Summary 235,481 rx/s 0 err,drop/s 235,482 xmit/s Summary 234,684 rx/s 0 err,drop/s 234,683 xmit/s Summary 235,520 rx/s 0 err,drop/s 235,520 xmit/s Summary 235,461 rx/s 0 err,drop/s 235,461 xmit/s Summary 234,627 rx/s 0 err,drop/s 234,627 xmit/s Summary 235,611 rx/s 0 err,drop/s 235,611 xmit/s Packets received : 3,053,753 Average packets/s : 234,904 Packets transmitted : 3,053,792 Average transmit/s : 234,907 Compared the performance of XDP_TX with XDP_REDIRECT, XDP_TX is also much better than XDP_REDIRECT. It's also in line with our expectation. Signed-off-by: Wei Fang <wei.fang@nxp.com> Suggested-by: Jesper Dangaard Brouer <hawk@kernel.org> Suggested-by: Jakub Kicinski <kuba@kernel.org> Reviewed-by: Larysa Zaremba <larysa.zaremba@intel.com> Acked-by: Jesper Dangaard Brouer <hawk@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-08-16selftests: bonding: remove redundant delete action of device link1_1Zhengchao Shao1-1/+0
When run command "ip netns delete client", device link1_1 has been deleted. So, it is no need to delete link1_1 again. Remove it. Signed-off-by: Zhengchao Shao <shaozhengchao@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-08-15Merge tag 'mlx5-updates-2023-08-14' of ↵Jakub Kicinski25-569/+672
git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux Saeed Mahameed says: ==================== mlx5-updates-2023-08-14 1) Handle PTP out of order CQEs issue 2) Check FW status before determining reset successful 3) Expose maximum supported SFs via devlink resource 4) MISC cleanups * tag 'mlx5-updates-2023-08-14' of git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux: net/mlx5: Don't query MAX caps twice net/mlx5: Remove unused MAX HCA capabilities net/mlx5: Remove unused CAPs net/mlx5: Fix error message in mlx5_sf_dev_state_change_handler() net/mlx5: Remove redundant check of mlx5_vhca_event_supported() net/mlx5: Use mlx5_sf_start_function_id() helper instead of directly calling MLX5_CAP_GEN() net/mlx5: Remove redundant SF supported check from mlx5_sf_hw_table_init() net/mlx5: Use auxiliary_device_uninit() instead of device_put() net/mlx5: E-switch, Add checking for flow rule destinations net/mlx5: Check with FW that sync reset completed successfully net/mlx5: Expose max possible SFs via devlink resource net/mlx5e: Add recovery flow for tx devlink health reporter for unhealthy PTP SQ net/mlx5e: Make tx_port_ts logic resilient to out-of-order CQEs net/mlx5: Consolidate devlink documentation in devlink/mlx5.rst ==================== Link: https://lore.kernel.org/r/20230814214144.159464-1-saeed@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-08-15Merge branch 'net-warn-about-attempts-to-register-negative-ifindex'Jakub Kicinski3-3/+35
Jakub Kicinski says: ==================== net: warn about attempts to register negative ifindex Follow up to the recently posted fix for OvS lacking input validation: https://lore.kernel.org/all/20230814203840.2908710-1-kuba@kernel.org/ Warn about negative ifindex more explicitly and misc YNL updates. ==================== Link: https://lore.kernel.org/r/20230814205627.2914583-1-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-08-15tools: ynl: add more info to KeyErrors on missing attrsJakub Kicinski1-3/+12
When developing specs its useful to know which attr space YNL was trying to find an attribute in on key error. Instead of printing: KeyError: 0 add info about the space: Exception: Space 'vport' has no attribute with value '0' Reviewed-by: Donald Hunter <donald.hunter@gmail.com> Link: https://lore.kernel.org/r/20230814205627.2914583-4-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-08-15netlink: specs: add ovs_vport new commandJakub Kicinski1-0/+18
Add NEW to the spec, it was useful testing the fix for OvS input validation. Reviewed-by: Donald Hunter <donald.hunter@gmail.com> Link: https://lore.kernel.org/r/20230814205627.2914583-3-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-08-15net: warn about attempts to register negative ifindexJakub Kicinski1-0/+5
Since the xarray changes we mix returning valid ifindex and negative errno in a single int returned from dev_index_reserve(). This depends on the fact that ifindexes can't be negative. Otherwise we may insert into the xarray and return a very large negative value. This in turn may break ERR_PTR(). OvS is susceptible to this problem and lacking validation (fix posted separately for net). Reject negative ifindex explicitly. Add a warning because the input validation is better handled by the caller. Reviewed-by: Leon Romanovsky <leonro@nvidia.com> Link: https://lore.kernel.org/r/20230814205627.2914583-2-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-08-15eth: r8152: try to use a normal budgetJakub Kicinski1-2/+1
Mario reports that loading r8152 on his system leads to a: netif_napi_add_weight() called with weight 256 warning getting printed. We don't have any solid data on why such high budget was chosen, and it may cause stalls in processing other softirqs and rt threads. So try to switch back to the default (64) weight. If this slows down someone's system we should investigate which part of stopping starting the NAPI poll in this driver are expensive. Reported-by: Mario Limonciello <mario.limonciello@amd.com> Link: https://lore.kernel.org/all/0bfd445a-81f7-f702-08b0-bd5a72095e49@amd.com/ Acked-by: Hayes Wang <hayeswang@realtek.com> Link: https://lore.kernel.org/r/20230814153521.2697982-1-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-08-15net: e1000e: Remove unused declarationsYue Haibing1-2/+0
Commit bdfe2da6aefd ("e1000e: cosmetic move of function prototypes to the new mac.h") declared but never implemented them. Signed-off-by: Yue Haibing <yuehaibing@huawei.com> Reviewed-by: Tony Nguyen <anthony.l.nguyen@intel.com> Link: https://lore.kernel.org/r/20230814135821.4808-1-yuehaibing@huawei.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-08-15qed: remove unused 'resp_size' calculationArnd Bergmann1-28/+17
Newer versions of clang warn about this variable being assigned but never used: drivers/net/ethernet/qlogic/qed/qed_vf.c:63:67: error: parameter 'resp_size' set but not used [-Werror,-Wunused-but-set-parameter] There is no indication in the git history on how this was ever meant to be used, so just remove the entire calculation and argument passing for it to avoid the warning. Signed-off-by: Arnd Bergmann <arnd@arndb.de> Reviewed-by: Leon Romanovsky <leonro@nvidia.com> Link: https://lore.kernel.org/r/20230814074512.1067715-1-arnd@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-08-15net: phy: mediatek-ge-soc: support PHY LEDsDaniel Golle1-9/+426
Implement netdev trigger and primitive bliking offloading as well as simple set_brigthness function for both PHY LEDs of the in-SoC PHYs found in MT7981 and MT7988. For MT7988, read boottrap register and apply LED polarities accordingly to get uniform behavior from all LEDs on MT7988. This requires syscon phandle 'mediatek,pio' present in parenting MDIO bus which should point to the syscon holding the boottrap register. Signed-off-by: Daniel Golle <daniel@makrotopia.org> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Link: https://lore.kernel.org/r/dc324d48c00cd7350f3a506eaa785324cae97372.1691977904.git.daniel@makrotopia.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-08-15Merge branch 'nexthop-various-cleanups'Jakub Kicinski1-6/+0
Ido Schimmel says: ==================== nexthop: Various cleanups Benefit from recent bug fixes and simplify the nexthop dump code. No regressions in existing tests: # ./fib_nexthops.sh [...] Tests passed: 234 Tests failed: 0 ==================== Link: https://lore.kernel.org/r/20230813164856.2379822-1-idosch@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-08-15nexthop: Do not increment dump sentinel at the end of the dumpIdo Schimmel1-1/+0
The nexthop and nexthop bucket dump callbacks previously returned a positive return code even when the dump was complete, prompting the core netlink code to invoke the callback again, until returning zero. Zero was only returned by these callbacks when no information was filled in the provided skb, which was achieved by incrementing the dump sentinel at the end of the dump beyond the ID of the last nexthop. This is no longer necessary as when the dump is complete these callbacks return zero. Remove the unnecessary increment. Signed-off-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Petr Machata <petrm@nvidia.com> Reviewed-by: David Ahern <dsahern@kernel.org> Link: https://lore.kernel.org/r/20230813164856.2379822-3-idosch@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-08-15nexthop: Simplify nexthop bucket dumpIdo Schimmel1-5/+0
Before commit f10d3d9df49d ("nexthop: Make nexthop bucket dump more efficient"), rtm_dump_nexthop_bucket_nh() returned a non-zero return code for each resilient nexthop group whose buckets it dumped, regardless if it encountered an error or not. This meant that the sentinel ('dd->ctx->nh.idx') used by the function that walked the different nexthops could not be used as a sentinel for the bucket dump, as otherwise buckets from the same group would be dumped over and over again. This was dealt with by adding another sentinel ('dd->ctx->done_nh_idx') that was incremented by rtm_dump_nexthop_bucket_nh() after successfully dumping all the buckets from a given group. After the previously mentioned commit this sentinel is no longer necessary since the function no longer returns a non-zero return code when successfully dumping all the buckets from a given group. Remove this sentinel and simplify the code. Signed-off-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Petr Machata <petrm@nvidia.com> Reviewed-by: David Ahern <dsahern@kernel.org> Link: https://lore.kernel.org/r/20230813164856.2379822-2-idosch@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-08-15Merge branch 'seg6-add-next-c-sid-support-for-srv6-end-x-behavior'Jakub Kicinski3-20/+1302
Andrea Mayer says: ==================== seg6: add NEXT-C-SID support for SRv6 End.X behavior In the Segment Routing (SR) architecture a list of instructions, called segments, can be added to the packet headers to influence the forwarding and processing of the packets in an SR enabled network. Considering the Segment Routing over IPv6 data plane (SRv6) [1], the segment identifiers (SIDs) are IPv6 addresses (128 bits) and the segment list (SID List) is carried in the Segment Routing Header (SRH). A segment may correspond to a "behavior" that is executed by a node when the packet is received. The Linux kernel currently supports a large subset of the behaviors described in [2] (e.g., End, End.X, End.T and so on). In some SRv6 scenarios, the number of segments carried by the SID List may increase dramatically, reducing the MTU (Maximum Transfer Unit) size and/or limiting the processing power of legacy hardware devices (due to longer IPv6 headers). The NEXT-C-SID mechanism [3] extends the SRv6 architecture by providing several ways to efficiently represent the SID List. By leveraging the NEXT-C-SID, it is possible to encode several SRv6 segments within a single 128 bit SID address (also referenced as Compressed SID Container). In this way, the length of the SID List can be drastically reduced. The NEXT-C-SID mechanism is built upon the "flavors" framework defined in [2]. This framework is already supported by the Linux SRv6 subsystem and is used to modify and/or extend a subset of existing behaviors. In this patchset, we extend the SRv6 End.X behavior in order to support the NEXT-C-SID mechanism. In details, the patchset is made of: - patch 1/2: add NEXT-C-SID support for SRv6 End.X behavior; - patch 2/2: add selftest for NEXT-C-SID in SRv6 End.X behavior. From the user space perspective, we do not need to change the iproute2 code to support the NEXT-C-SID flavor for the SRv6 End.X behavior. However, we will update the man page considering the NEXT-C-SID flavor applied to the SRv6 End.X behavior in a separate patch. [1] - https://datatracker.ietf.org/doc/html/rfc8754 [2] - https://datatracker.ietf.org/doc/html/rfc8986 [3] - https://datatracker.ietf.org/doc/html/draft-ietf-spring-srv6-srh-compression ==================== Link: https://lore.kernel.org/r/20230812180926.16689-1-andrea.mayer@uniroma2.it Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-08-15selftests: seg6: add selftest for NEXT-C-SID flavor in SRv6 End.X behaviorPaolo Lungaroni2-0/+1214
This selftest is designed for testing the support of NEXT-C-SID flavor for SRv6 End.X behavior. It instantiates a virtual network composed of several nodes: hosts and SRv6 routers. Each node is realized using a network namespace that is properly interconnected to others through veth pairs, according to the topology depicted in the selftest script file. The test considers SRv6 routers implementing IPv4/IPv6 L3 VPNs leveraged by hosts for communicating with each other. Such routers i) apply different SRv6 Policies to the traffic received from connected hosts, considering the IPv4 or IPv6 protocols; ii) use the NEXT-C-SID compression mechanism for encoding several SRv6 segments within a single 128-bit SID address, referred to as a Compressed SID (C-SID) container. The NEXT-C-SID is provided as a "flavor" of the SRv6 End.X behavior, enabling it to properly process the C-SID containers. The correct execution of the enabled NEXT-C-SID SRv6 End.X behavior is verified through reachability tests carried out between hosts belonging to the same VPN. Signed-off-by: Paolo Lungaroni <paolo.lungaroni@uniroma2.it> Co-developed-by: Andrea Mayer <andrea.mayer@uniroma2.it> Signed-off-by: Andrea Mayer <andrea.mayer@uniroma2.it> Reviewed-by: David Ahern <dsahern@kernel.org> Link: https://lore.kernel.org/r/20230812180926.16689-3-andrea.mayer@uniroma2.it Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-08-15seg6: add NEXT-C-SID support for SRv6 End.X behaviorAndrea Mayer1-20/+88
The NEXT-C-SID mechanism described in [1] offers the possibility of encoding several SRv6 segments within a single 128 bit SID address. Such a SID address is called a Compressed SID (C-SID) container. In this way, the length of the SID List can be drastically reduced. A SID instantiated with the NEXT-C-SID flavor considers an IPv6 address logically structured in three main blocks: i) Locator-Block; ii) Locator-Node Function; iii) Argument. C-SID container +------------------------------------------------------------------+ | Locator-Block |Loc-Node| Argument | | |Function| | +------------------------------------------------------------------+ <--------- B -----------> <- NF -> <------------- A ---------------> (i) The Locator-Block can be any IPv6 prefix available to the provider; (ii) The Locator-Node Function represents the node and the function to be triggered when a packet is received on the node; (iii) The Argument carries the remaining C-SIDs in the current C-SID container. This patch leverages the NEXT-C-SID mechanism previously introduced in the Linux SRv6 subsystem [2] to support SID compression capabilities in the SRv6 End.X behavior [3]. An SRv6 End.X behavior with NEXT-C-SID flavor works as an End.X behavior but it is capable of processing the compressed SID List encoded in C-SID containers. An SRv6 End.X behavior with NEXT-C-SID flavor can be configured to support user-provided Locator-Block and Locator-Node Function lengths. In this implementation, such lengths must be evenly divisible by 8 (i.e. must be byte-aligned), otherwise the kernel informs the user about invalid values with a meaningful error code and message through netlink_ext_ack. If Locator-Block and/or Locator-Node Function lengths are not provided by the user during configuration of an SRv6 End.X behavior instance with NEXT-C-SID flavor, the kernel will choose their default values i.e., 32-bit Locator-Block and 16-bit Locator-Node Function. [1] - https://datatracker.ietf.org/doc/html/draft-ietf-spring-srv6-srh-compression [2] - https://lore.kernel.org/all/20220912171619.16943-1-andrea.mayer@uniroma2.it/ [3] - https://datatracker.ietf.org/doc/html/rfc8986#name-endx-l3-cross-connect Signed-off-by: Andrea Mayer <andrea.mayer@uniroma2.it> Reviewed-by: Hangbin Liu <liuhangbin@gmail.com> Reviewed-by: David Ahern <dsahern@kernel.org> Link: https://lore.kernel.org/r/20230812180926.16689-2-andrea.mayer@uniroma2.it Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-08-15Merge branch 'genetlink-provide-struct-genl_info-to-dumps'Jakub Kicinski45-180/+236
Jakub Kicinski says: ==================== genetlink: provide struct genl_info to dumps One of the biggest (which is not to say only) annoyances with genetlink handling today is that doit and dumpit need some of the same information, but it is passed to them in completely different structs. The implementations commonly end up writing a _fill() method which populates a message and have to pass at least 6 parameters. 3 of which are extracted manually from request info. After a lot of umming and ahing I decided to populate struct genl_info for dumps, without trying to factor out only the common parts. This makes the adoption easiest. In the future we may add a new version of dump which takes struct genl_info *info as the second argument, instead of struct netlink_callback *cb. For now developers have to call genl_info_dump(cb) to get the info. Typical genetlink families no longer get exposed to netlink protocol internals like pid and seq numbers. v3: - correct the condition in ethtool code (patch 10) v2: https://lore.kernel.org/all/20230810233845.2318049-1-kuba@kernel.org/ - replace the GENL_INFO_NTF() macro with init helper - fix the commit messages v1: https://lore.kernel.org/all/20230809182648.1816537-1-kuba@kernel.org/ ==================== Link: https://lore.kernel.org/r/20230814214723.2924989-1-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-08-15ethtool: netlink: always pass genl_info to .prepare_dataJakub Kicinski25-47/+47
We had a number of bugs in the past because developers forgot to fully test dumps, which pass NULL as info to .prepare_data. .prepare_data implementations would try to access info->extack leading to a null-deref. Now that dumps and notifications can access struct genl_info we can pass it in, and remove the info null checks. Reviewed-by: Vladimir Oltean <vladimir.oltean@nxp.com> Tested-by: Vladimir Oltean <vladimir.oltean@nxp.com> # pause Reviewed-by: Jiri Pirko <jiri@nvidia.com> Link: https://lore.kernel.org/r/20230814214723.2924989-11-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-08-15ethtool: netlink: simplify arguments to ethnl_default_parse()Jakub Kicinski1-12/+9
Pass struct genl_info directly instead of its members. Reviewed-by: Jiri Pirko <jiri@nvidia.com> Link: https://lore.kernel.org/r/20230814214723.2924989-10-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-08-15netdev-genl: use struct genl_info for reply constructionJakub Kicinski1-9/+8
Use the just added APIs to make the code simpler. Reviewed-by: Jiri Pirko <jiri@nvidia.com> Link: https://lore.kernel.org/r/20230814214723.2924989-9-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-08-15genetlink: add genlmsg_iput() APIJakub Kicinski1-1/+53
Add some APIs and helpers required for convenient construction of replies and notifications based on struct genl_info. Reviewed-by: Jiri Pirko <jiri@nvidia.com> Link: https://lore.kernel.org/r/20230814214723.2924989-8-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-08-15genetlink: add a family pointer to struct genl_infoJakub Kicinski2-11/+14
Having family in struct genl_info is quite useful. It cuts down the number of arguments which need to be passed to helpers which already take struct genl_info. Reviewed-by: Jiri Pirko <jiri@nvidia.com> Link: https://lore.kernel.org/r/20230814214723.2924989-7-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-08-15genetlink: use attrs from struct genl_infoJakub Kicinski14-23/+22
Since dumps carry struct genl_info now, use the attrs pointer from genl_info and remove the one in struct genl_dumpit_info. Reviewed-by: Johannes Berg <johannes@sipsolutions.net> Reviewed-by: Miquel Raynal <miquel.raynal@bootlin.com> Reviewed-by: Jiri Pirko <jiri@nvidia.com> Link: https://lore.kernel.org/r/20230814214723.2924989-6-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-08-15genetlink: add struct genl_info to struct genl_dumpit_infoJakub Kicinski2-2/+22
Netlink GET implementations must currently juggle struct genl_info and struct netlink_callback, depending on whether they were called from doit or dumpit. Add genl_info to the dump state and populate the fields. This way implementations can simply pass struct genl_info around. Reviewed-by: Jiri Pirko <jiri@nvidia.com> Link: https://lore.kernel.org/r/20230814214723.2924989-5-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-08-15genetlink: remove userhdr from struct genl_infoJakub Kicinski7-27/+33
Only three families use info->userhdr today and going forward we discourage using fixed headers in new families. So having the pointer to user header in struct genl_info is an overkill. Compute the header pointer at runtime. Reviewed-by: Johannes Berg <johannes@sipsolutions.net> Reviewed-by: Jiri Pirko <jiri@nvidia.com> Reviewed-by: Aaron Conole <aconole@redhat.com> Link: https://lore.kernel.org/r/20230814214723.2924989-4-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-08-15genetlink: make genl_info->nlhdr constJakub Kicinski3-3/+3
struct netlink_callback has a const nlh pointer, make the pointer in struct genl_info const as well, to make copying between the two easier. Reviewed-by: Jiri Pirko <jiri@nvidia.com> Link: https://lore.kernel.org/r/20230814214723.2924989-3-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-08-15genetlink: push conditional locking into dumpit/doneJakub Kicinski1-55/+35
Add helpers which take/release the genl mutex based on family->parallel_ops. Remove the separation between handling of ops in locked and parallel families. Future patches would make the duplicated code grow even more. Reviewed-by: Johannes Berg <johannes@sipsolutions.net> Reviewed-by: Jiri Pirko <jiri@nvidia.com> Link: https://lore.kernel.org/r/20230814214723.2924989-2-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-08-15net: Fix slab-out-of-bounds in inet[6]_steal_sockLorenz Bauer2-2/+2
Kumar reported a KASAN splat in tcp_v6_rcv: bash-5.2# ./test_progs -t btf_skc_cls_ingress ... [ 51.810085] BUG: KASAN: slab-out-of-bounds in tcp_v6_rcv+0x2d7d/0x3440 [ 51.810458] Read of size 2 at addr ffff8881053f038c by task test_progs/226 The problem is that inet[6]_steal_sock accesses sk->sk_protocol without accounting for request or timewait sockets. To fix this we can't just check sock_common->skc_reuseport since that flag is present on timewait sockets. Instead, add a fullsock check to avoid the out of bands access of sk_protocol. Fixes: 9c02bec95954 ("bpf, net: Support SO_REUSEPORT sockets with bpf_sk_assign") Reported-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Signed-off-by: Lorenz Bauer <lmb@isovalent.com> Link: https://lore.kernel.org/r/20230815-bpf-next-v2-1-95126eaa4c1b@isovalent.com Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2023-08-14Merge branch 'Update and document struct_ops'Martin KaFai Lau2-6/+56
David Vernet says: ==================== The struct bpf_struct_ops structure in BPF is a framework that allows subsystems to extend themselves using BPF. In commit 68b04864ca425 ("bpf: Create links for BPF struct_ops maps") and commit aef56f2e918bf ("bpf: Update the struct_ops of a bpf_link"), the structure was updated to include new ->validate() and ->update() callbacks respectively in support of allowing struct_ops maps to be created with BPF_F_LINK. The intention was that struct bpf_struct_ops implementations could support map updates through the link. Because map validation and registration would take place in two separate steps for struct_ops maps managed by the link (the first in map update elem, and the latter in link create), the ->validate() callback was added, and any struct_ops implementation that wished to use BPF_F_LINK, even just for lifetime management, would then be required to define both it and ->update(). Not all struct_ops implementations can or will support update, however. For example, the sched_ext struct_ops implementation proposed in [0] will not be able to support atomic map updates because it can race with sysrq, has to cycle tasks through various states in order to safely transition, etc. It can, however, benefit from letting the BPF link automatically evict the struc_ops map when the application exits (e.g. if it crashes). This patch set therefore: 1. Updates the struct_ops implementation to support default values for ->validate() and ->update() so that struct_ops implementations can benefit from BPF_F_LINK management even if they can't support updates. 2. Documents struct bpf_struct_ops so that the semantics are clear and well defined. --- v2: https://lore.kernel.org/bpf/0f5ea3de-c6e7-490f-b5ec-b5c7cd288687@gmail.com/T/ Changes from v2 -> v3: - Add patch 2/2 that documents the struct bpf_struct_ops structure. - Add Kui-Feng's Acked-by tag to patch 1/2. v1: https://lore.kernel.org/lkml/20230811150934.GA542801@maniforge/ Changes from v1 -> v2: - Move the if (!st_map->st_ops->update) check outside of the critical section before we acquire the update_mutex. ==================== Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2023-08-14bpf: Document struct bpf_struct_ops fieldsDavid Vernet1-0/+47
Subsystems that want to implement a struct bpf_struct_ops structure to enable struct_ops maps must currently reverse engineer how the structure works. Given that this is meant to be a way for subsystem maintainers to extend their subsystems using BPF, let's document it to make it a bit easier on them. Signed-off-by: David Vernet <void@manifault.com> Link: https://lore.kernel.org/r/20230814185908.700553-3-void@manifault.com Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2023-08-14bpf: Support default .validate() and .update() behavior for struct_ops linksDavid Vernet1-6/+9
Currently, if a struct_ops map is loaded with BPF_F_LINK, it must also define the .validate() and .update() callbacks in its corresponding struct bpf_struct_ops in the kernel. Enabling struct_ops link is useful in its own right to ensure that the map is unloaded if an application crashes. For example, with sched_ext, we want to automatically unload the host-wide scheduler if the application crashes. We would likely never support updating elements of a sched_ext struct_ops map, so we'd have to implement these callbacks showing that they _can't_ support element updates just to benefit from the basic lifetime management of struct_ops links. Let's enable struct_ops maps to work with BPF_F_LINK even if they haven't defined these callbacks, by assuming that a struct_ops map element cannot be updated by default. Acked-by: Kui-Feng Lee <thinker.li@gmail.com> Signed-off-by: David Vernet <void@manifault.com> Link: https://lore.kernel.org/r/20230814185908.700553-2-void@manifault.com Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2023-08-14selftests/bpf: Add various more tcx test casesDaniel Borkmann3-0/+462
Add several new tcx test cases to improve test coverage. This also includes a few new tests with ingress instead of clsact qdisc, to cover the fix from commit dc644b540a2d ("tcx: Fix splat in ingress_destroy upon tcx_entry_free"). # ./test_progs -t tc [...] #234 tc_links_after:OK #235 tc_links_append:OK #236 tc_links_basic:OK #237 tc_links_before:OK #238 tc_links_chain_classic:OK #239 tc_links_chain_mixed:OK #240 tc_links_dev_cleanup:OK #241 tc_links_dev_mixed:OK #242 tc_links_ingress:OK #243 tc_links_invalid:OK #244 tc_links_prepend:OK #245 tc_links_replace:OK #246 tc_links_revision:OK #247 tc_opts_after:OK #248 tc_opts_append:OK #249 tc_opts_basic:OK #250 tc_opts_before:OK #251 tc_opts_chain_classic:OK #252 tc_opts_chain_mixed:OK #253 tc_opts_delete_empty:OK #254 tc_opts_demixed:OK #255 tc_opts_detach:OK #256 tc_opts_detach_after:OK #257 tc_opts_detach_before:OK #258 tc_opts_dev_cleanup:OK #259 tc_opts_invalid:OK #260 tc_opts_mixed:OK #261 tc_opts_prepend:OK #262 tc_opts_replace:OK #263 tc_opts_revision:OK [...] Summary: 44/38 PASSED, 0 SKIPPED, 0 FAILED Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/r/8699efc284b75ccdc51ddf7062fa2370330dc6c0.1692029283.git.daniel@iogearbox.net Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2023-08-14net: dsa: mv88e6060: add phylink_get_caps implementationRussell King (Oracle)1-0/+45
Add a phylink_get_caps implementation for Marvell 88e6060 DSA switch. This is a fast ethernet switch, with internal PHYs for ports 0 through 4. Port 4 also supports MII, REVMII, REVRMII and SNI. Port 5 supports MII, REVMII, REVRMII and SNI without an internal PHY. Reviewed-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> Reviewed-by: Vladimir Oltean <olteanv@gmail.com> Link: https://lore.kernel.org/r/E1qUkx7-003dMX-9b@rmk-PC.armlinux.org.uk Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-08-14net/mlx5: Don't query MAX caps twiceShay Drory1-6/+6
Whenever mlx5 driver is probed or reloaded, it queries some capabilities in MAX mode via set_hca_cap() API. Afterwards, the driver queries all capabilities in MAX mode via mlx5_query_hca_caps() API. Since MAX caps are read only caps, querying them twice is redundant. Hence, delete the second query. Signed-off-by: Shay Drory <shayd@nvidia.com> Reviewed-by: Maher Sanalla <msanalla@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2023-08-14net/mlx5: Remove unused MAX HCA capabilitiesShay Drory5-74/+30
Each device cap has two modes: MAX and CUR. The driver maintains a cache of both modes of the capabilities. For most device caps, the MAX cap mode is never used. Hence, remove all driver queries of the MAX mode of the said caps as well as their helper MACROs. Signed-off-by: Shay Drory <shayd@nvidia.com> Reviewed-by: Maher Sanalla <msanalla@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2023-08-14net/mlx5: Remove unused CAPsShay Drory4-74/+1
mlx5 driver queries the device for VECTOR_CALC and SHAMPO caps, but there isn't any user who requires them. As well as, MLX5_MCAM_REGS_0x9080_0x90FF is queried but not used. Thus, drop all usages and definitions of the mentioned caps above. Signed-off-by: Shay Drory <shayd@nvidia.com> Reviewed-by: Maher Sanalla <msanalla@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2023-08-14net/mlx5: Fix error message in mlx5_sf_dev_state_change_handler()Jiri Pirko1-1/+1
sw_function_id contains sfnum, so fix the error message to name the value properly. Signed-off-by: Jiri Pirko <jiri@nvidia.com> Reviewed-by: Shay Drory <shayd@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2023-08-14net/mlx5: Remove redundant check of mlx5_vhca_event_supported()Jiri Pirko1-1/+1
Since mlx5_vhca_event_supported() is called in mlx5_sf_dev_supported(), remove the redundant call. Signed-off-by: Jiri Pirko <jiri@nvidia.com> Reviewed-by: Shay Drory <shayd@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2023-08-14net/mlx5: Use mlx5_sf_start_function_id() helper instead of directly calling ↵Jiri Pirko1-3/+3
MLX5_CAP_GEN() There is a helper called mlx5_sf_start_function_id() that wraps up a query to get base SF function id. Use that instead of calling MLX5_CAP_GEN() directly. Signed-off-by: Jiri Pirko <jiri@nvidia.com> Reviewed-by: Shay Drory <shayd@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2023-08-14net/mlx5: Remove redundant SF supported check from mlx5_sf_hw_table_init()Jiri Pirko1-3/+2
Since mlx5_sf_supported() check is done as a first thing in mlx5_sf_max_functions(), remove the redundant check. Signed-off-by: Jiri Pirko <jiri@nvidia.com> Reviewed-by: Shay Drory <shayd@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2023-08-14net/mlx5: Use auxiliary_device_uninit() instead of device_put()Jiri Pirko1-1/+1
Instead of using device_put(), use auxiliary_device_uninit() for auxiliary device uninit which internally just calls device_put(). Signed-off-by: Jiri Pirko <jiri@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2023-08-14net/mlx5: E-switch, Add checking for flow rule destinationsJianbo Liu1-0/+31
Firmware doesn't allow flow rules in FDB to do header rewrite and send packets to both internal and uplink vports. The following syndrome will be generated when trying to offload such kind of rules: mlx5_core 0000:08:00.0: mlx5_cmd_out_err:803:(pid 23569): SET_FLOW_TABLE_ENTRY(0x936) op_mod(0x0) failed, status bad parameter(0x3), syndrome (0x8c8f08), err(-22) To avoid this syndrome, add a checking before creating FTE. If a rule with header rewrite action forwards packets to both VF and PF, an error is returned directly. Signed-off-by: Jianbo Liu <jianbol@nvidia.com> Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2023-08-14net/mlx5: Check with FW that sync reset completed successfullyMoshe Shemesh4-7/+40
Even if the PF driver had no error on his part of the sync reset flow, the firmware can see wider picture as it syncs all the PFs in the flow. So add at end of sync reset flow check with firmware by reading MFRL register and initialization segment that the flow had no issue from firmware point of view too. Signed-off-by: Moshe Shemesh <moshe@nvidia.com> Reviewed-by: Shay Drory <shayd@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2023-08-14net/mlx5: Expose max possible SFs via devlink resourceShay Drory2-4/+48
Introduce devlink resource for exposing max possible SFs on mlx5 devices. For example: $ devlink resource show pci/0000:00:0b.0 pci/0000:00:0b.0: name max_local_SFs size 5 unit entry dpipe_tables none name max_external_SFs size 0 unit entry dpipe_tables none Signed-off-by: Shay Drory <shayd@nvidia.com> Reviewed-by: Moshe Shemesh <moshe@nvidia.com> Reviewed-by: Jiri Pirko <jiri@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2023-08-14net/mlx5e: Add recovery flow for tx devlink health reporter for unhealthy PTP SQRahul Rameshbabu5-1/+94
A new check for the tx devlink health reporter is introduced for determining when the PTP port timestamping SQ is considered unhealthy. If there are enough CQEs considered never to be delivered, the space that can be utilized on the SQ decreases significantly, impacting performance and usability of the SQ. The health reporter is triggered when the number of likely never delivered port timestamping CQEs that utilize the space of the PTP SQ is greater than 93.75% of the total capacity of the SQ. A devlink health reporter recover method is also provided for this specific TX error context that restarts the PTP SQ. Signed-off-by: Rahul Rameshbabu <rrameshbabu@nvidia.com> Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2023-08-14net/mlx5e: Make tx_port_ts logic resilient to out-of-order CQEsRahul Rameshbabu7-81/+236
Use a map structure for associating CQEs containing port timestamping information with the appropriate skb. Track order of WQEs submitted using a FIFO. Check if the corresponding port timestamping CQEs from the lookup values in the FIFO are considered dropped due to time elapsed. Return the lookup value to a freelist after consuming the skb. Reuse the freed lookup in future WQE submission iterations. The map structure uses an integer identifier for the key and returns an skb corresponding to that identifier. Embed the integer identifier in the WQE submitted to the WQ for the transmit path when the SQ is a PTP (port timestamping) SQ. The embedded identifier can then be queried using a field in the CQE of the corresponding port timestamping CQ. In the port timestamping napi_poll context, the identifier is queried from the CQE polled from CQ and used to lookup the corresponding skb from the WQE submit path. The skb reference is removed from map and then embedded with the port HW timestamp information from the CQE and eventually consumed. The metadata freelist FIFO is an array containing integer identifiers that can be pushed and popped in the FIFO. The purpose of this structure is bookkeeping what identifier values can safely be used in a subsequent WQE submission and should not contain identifiers that have still not been reaped by processing a corresponding CQE completion on the port timestamping CQ. The ts_cqe_pending_list structure is a combination of an array and linked list. The array is pre-populated with the nodes that will be added and removed from the head of the linked list. Each node contains the unique identifier value associated with the values submitted in the WQEs and retrieved in the port timestamping CQEs. When a WQE is submitted, the node in the array corresponding to the identifier popped from the metadata freelist is added to the end of the CQE pending list and is marked as "in-use". The node is removed from the linked list under two conditions. The first condition is that the corresponding port timestamping CQE is polled in the PTP napi_poll context. The second condition is that more than a second has elapsed since the DMA timestamp value corresponding to the WQE submission. When the first condition occurs, the "in-use" bit in the linked list node is cleared, and the resources corresponding to the WQE submission are then released. The second condition, however, indicates that the port timestamping CQE will likely never be delivered. It's not impossible for the device to post a CQE after an infinite amount of time though highly improbable. In order to be resilient to this improbable case, resources related to the corresponding WQE submission are still kept, the identifier value is not returned to the freelist, and the "in-use" bit is cleared on the node to indicate that it's no longer part of the linked list of "likely to be delivered" port timestamping CQE identifiers. A count for the number of port timestamping CQEs considered highly likely to never be delivered by the device is maintained. This count gets decremented in the unlikely event a port timestamping CQE considered unlikely to ever be delivered is polled in the PTP napi_poll context. Signed-off-by: Rahul Rameshbabu <rrameshbabu@nvidia.com> Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2023-08-14net/mlx5: Consolidate devlink documentation in devlink/mlx5.rstRahul Rameshbabu3-314/+179
De-duplicate documentation by removing mellanox/mlx5/devlink.rst. Instead, only use the generic devlink documentation directory to document mlx5 devlink parameters. Avoid providing general devlink tool usage information in mlx5-specific documentation. Signed-off-by: Rahul Rameshbabu <rrameshbabu@nvidia.com> Reviewed-by: Jiri Pirko <jiri@nvidia.com> Reviewed-by: Gal Pressman <gal@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2023-08-14Merge branch 'devlink-introduce-selective-dumps'Jakub Kicinski10-367/+5188
Jiri Pirko says: ==================== devlink: introduce selective dumps Motivation: For SFs, one devlink instance per SF is created. There might be thousands of these on a single host. When a user needs to know port handle for specific SF, he needs to dump all devlink ports on the host which does not scale good. Solution: Allow user to pass devlink handle (and possibly other attributes) alongside the dump command and dump only objects which are matching the selection. Use split ops to generate policies for dump callbacks acccording to the attributes used for selection. The userspace can use ctrl genetlink GET_POLICY command to find out if the selective dumps are supported by kernel for particular command. Example: $ devlink port show auxiliary/mlx5_core.eth.0/65535: type eth netdev eth2 flavour physical port 0 splittable false auxiliary/mlx5_core.eth.1/131071: type eth netdev eth3 flavour physical port 1 splittable false $ devlink port show auxiliary/mlx5_core.eth.0 auxiliary/mlx5_core.eth.0/65535: type eth netdev eth2 flavour physical port 0 splittable false $ devlink port show auxiliary/mlx5_core.eth.1 auxiliary/mlx5_core.eth.1/131071: type eth netdev eth3 flavour physical port 1 splittable false Extension: patches #12 and #13 extends selection attributes by port index for health reporter dumping. ==================== Link: https://lore.kernel.org/r/20230811155714.1736405-1-jiri@resnulli.us Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-08-14netlink: specs: devlink: extend health reporter dump attributes by port indexJiri Pirko4-3/+15
Allow user to pass port index for health reporter dump request. Re-generate the related code. Signed-off-by: Jiri Pirko <jiri@nvidia.com> Acked-by: Jakub Kicinski <kuba@kernel.org> Link: https://lore.kernel.org/r/20230811155714.1736405-14-jiri@resnulli.us Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-08-14devlink: extend health reporter dump selector by port indexJiri Pirko1-1/+14
Introduce a possibility for devlink object to expose attributes it supports for selection of dumped objects. Use this by health reporter to indicate it supports port index based selection of dump objects. Implement this selection mechanism in devlink_nl_cmd_health_reporter_get_dump_one() Example: $ devlink health pci/0000:08:00.0: reporter fw state healthy error 0 recover 0 auto_dump true reporter fw_fatal state healthy error 0 recover 0 grace_period 60000 auto_recover true auto_dump true reporter vnic state healthy error 0 recover 0 pci/0000:08:00.0/32768: reporter vnic state healthy error 0 recover 0 pci/0000:08:00.0/32769: reporter vnic state healthy error 0 recover 0 pci/0000:08:00.0/32770: reporter vnic state healthy error 0 recover 0 pci/0000:08:00.1: reporter fw state healthy error 0 recover 0 auto_dump true reporter fw_fatal state healthy error 0 recover 0 grace_period 60000 auto_recover true auto_dump true reporter vnic state healthy error 0 recover 0 pci/0000:08:00.1/98304: reporter vnic state healthy error 0 recover 0 pci/0000:08:00.1/98305: reporter vnic state healthy error 0 recover 0 pci/0000:08:00.1/98306: reporter vnic state healthy error 0 recover 0 $ devlink health show pci/0000:08:00.0 pci/0000:08:00.0: reporter fw state healthy error 0 recover 0 auto_dump true reporter fw_fatal state healthy error 0 recover 0 grace_period 60000 auto_recover true auto_dump true reporter vnic state healthy error 0 recover 0 pci/0000:08:00.0/32768: reporter vnic state healthy error 0 recover 0 pci/0000:08:00.0/32769: reporter vnic state healthy error 0 recover 0 pci/0000:08:00.0/32770: reporter vnic state healthy error 0 recover 0 $ devlink health show pci/0000:08:00.0/32768 pci/0000:08:00.0/32768: reporter vnic state healthy error 0 recover 0 The last command is possible because of this patch. Signed-off-by: Jiri Pirko <jiri@nvidia.com> Acked-by: Jakub Kicinski <kuba@kernel.org> Link: https://lore.kernel.org/r/20230811155714.1736405-13-jiri@resnulli.us Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-08-14netlink: specs: devlink: extend per-instance dump commands to accept ↵Jiri Pirko4-78/+799
instance attributes Extend per-instance dump command definitions to accept instance attributes. Allow parsing of devlink handle attributes so they could be used for instance selection. Re-generate the related code. Signed-off-by: Jiri Pirko <jiri@nvidia.com> Acked-by: Jakub Kicinski <kuba@kernel.org> Link: https://lore.kernel.org/r/20230811155714.1736405-12-jiri@resnulli.us Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-08-14devlink: allow user to narrow per-instance dumps by passing handle attrsJiri Pirko1-3/+40
For SFs, one devlink instance per SF is created. There might be thousands of these on a single host. When a user needs to know port handle for specific SF, he needs to dump all devlink ports on the host which does not scale good. Allow user to pass devlink handle attributes alongside the dump command and dump only objects which are under selected devlink instance. Example: $ devlink port show auxiliary/mlx5_core.eth.0/65535: type eth netdev eth2 flavour physical port 0 splittable false auxiliary/mlx5_core.eth.1/131071: type eth netdev eth3 flavour physical port 1 splittable false $ devlink port show auxiliary/mlx5_core.eth.0 auxiliary/mlx5_core.eth.0/65535: type eth netdev eth2 flavour physical port 0 splittable false $ devlink port show auxiliary/mlx5_core.eth.1 auxiliary/mlx5_core.eth.1/131071: type eth netdev eth3 flavour physical port 1 splittable false Signed-off-by: Jiri Pirko <jiri@nvidia.com> Acked-by: Jakub Kicinski <kuba@kernel.org> Link: https://lore.kernel.org/r/20230811155714.1736405-11-jiri@resnulli.us Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-08-14devlink: remove converted commands from small opsJiri Pirko2-98/+3
As the commands are already defined in split ops, remove them from small ops. Signed-off-by: Jiri Pirko <jiri@nvidia.com> Acked-by: Jakub Kicinski <kuba@kernel.org> Link: https://lore.kernel.org/r/20230811155714.1736405-10-jiri@resnulli.us Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-08-14devlink: remove duplicate temporary netlink callback prototypesJiri Pirko1-48/+0
Remove the duplicate temporary netlink callback prototype as the generated ones are already in place. Signed-off-by: Jiri Pirko <jiri@nvidia.com> Acked-by: Jakub Kicinski <kuba@kernel.org> Link: https://lore.kernel.org/r/20230811155714.1736405-9-jiri@resnulli.us Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-08-14netlink: specs: devlink: add commands that do per-instance dumpJiri Pirko5-3/+4177
Add the definitions for the commands that do per-instance dump and re-generate the related code. Signed-off-by: Jiri Pirko <jiri@nvidia.com> Acked-by: Jakub Kicinski <kuba@kernel.org> Link: https://lore.kernel.org/r/20230811155714.1736405-8-jiri@resnulli.us Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-08-14devlink: pass flags as an arg of dump_one() callbackJiri Pirko5-56/+56
In order to easily set NLM_F_DUMP_FILTERED for partial dumps, pass the flags as an arg of dump_one() callback. Currently, it is always NLM_F_MULTI. Signed-off-by: Jiri Pirko <jiri@nvidia.com> Acked-by: Jakub Kicinski <kuba@kernel.org> Link: https://lore.kernel.org/r/20230811155714.1736405-7-jiri@resnulli.us Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-08-14devlink: introduce dumpit callbacks for split opsJiri Pirko5-147/+144
Introduce dumpit callbacks for generated split ops. Have them as a thin wrapper around iteration function and allow to pass dump_one() function pointer directly without need to store in devlink_cmd structs. Note that the function prototypes are temporary until the generated ones will replace them in a follow-up patch. Signed-off-by: Jiri Pirko <jiri@nvidia.com> Acked-by: Jakub Kicinski <kuba@kernel.org> Link: https://lore.kernel.org/r/20230811155714.1736405-6-jiri@resnulli.us Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-08-14devlink: rename doit callbacks for per-instance dump commandsJiri Pirko4-45/+52
Rename netlink doit callback functions for the commands that do implement per-instance dump to match the generated names that are going to be introduce in the follow-up patch. Note that the function prototypes are temporary until the generated ones will replace them in a follow-up patch. Signed-off-by: Jiri Pirko <jiri@nvidia.com> Acked-by: Jakub Kicinski <kuba@kernel.org> Link: https://lore.kernel.org/r/20230811155714.1736405-5-jiri@resnulli.us Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-08-14devlink: introduce devlink_nl_pre_doit_port*() helper functionsJiri Pirko2-4/+28
Define port handling helpers what don't rely on internal_flags. Have __devlink_nl_pre_doit() to accept the flags as a function arg and make devlink_nl_pre_doit() a wrapper helper function calling it. Introduce new helpers devlink_nl_pre_doit_port() and devlink_nl_pre_doit_port_optional() to be used by split ops in follow-up patch. Note that the function prototypes are temporary until the generated ones will replace them in a follow-up patch. Signed-off-by: Jiri Pirko <jiri@nvidia.com> Acked-by: Jakub Kicinski <kuba@kernel.org> Link: https://lore.kernel.org/r/20230811155714.1736405-4-jiri@resnulli.us Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-08-14devlink: parse rate attrs in doit() callbacksJiri Pirko3-38/+25
No need to give the rate any special treatment in netlink attributes parsing, as unlike for ports, there is only a couple of commands benefiting from that. Remove DEVLINK_NL_FLAG_NEED_RATE*, make pre_doit() callback simpler by moving the rate attributes parsing to rate_*_doit() ops. Signed-off-by: Jiri Pirko <jiri@nvidia.com> Acked-by: Jakub Kicinski <kuba@kernel.org> Link: https://lore.kernel.org/r/20230811155714.1736405-3-jiri@resnulli.us Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-08-14devlink: parse linecard attr in doit() callbacksJiri Pirko3-21/+13
No need to give the linecards any special treatment in netlink attribute parsing, as unlike for ports, there is only a couple of commands benefiting from that. Remove DEVLINK_NL_FLAG_NEED_LINECARD, make pre_doit() callback simpler by moving the linecard attribute parsing to linecard_[gs]et_doit() ops. Signed-off-by: Jiri Pirko <jiri@nvidia.com> Acked-by: Jakub Kicinski <kuba@kernel.org> Link: https://lore.kernel.org/r/20230811155714.1736405-2-jiri@resnulli.us Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-08-14selftests/bpf: Clean up fmod_ret in bench_rename test scriptYipeng Zou1-1/+1
Running the bench_rename test script, the following error occurs: # ./benchs/run_bench_rename.sh base : 0.819 ± 0.012M/s kprobe : 0.538 ± 0.009M/s kretprobe : 0.503 ± 0.004M/s rawtp : 0.779 ± 0.020M/s fentry : 0.726 ± 0.007M/s fexit : 0.691 ± 0.007M/s benchmark 'rename-fmodret' not found The bench_rename_fmodret has been removed in commit b000def2e052 ("selftests: Remove fmod_ret from test_overhead"), thus remove it from the runners in the test script. Fixes: b000def2e052 ("selftests: Remove fmod_ret from test_overhead") Signed-off-by: Yipeng Zou <zouyipeng@huawei.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/20230814030727.3010390-1-zouyipeng@huawei.com
2023-08-14selftests/bpf: Fix repeat option when kfunc_call verification failsYipeng Zou1-1/+1
There is no way where topts.repeat can be set to 1 when tc_test fails. Fix the typo where the break statement slipped by one line. Fixes: fb66223a244f ("selftests/bpf: add test for accessing ctx from syscall program type") Signed-off-by: Yipeng Zou <zouyipeng@huawei.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Reviewed-by: Li Zetao <lizetao1@huawei.com> Link: https://lore.kernel.org/bpf/20230814031434.3077944-1-zouyipeng@huawei.com
2023-08-14libbpf: Set close-on-exec flag on gzopenMarco Vedovati1-2/+2
Enable the close-on-exec flag when using gzopen. This is especially important for multithreaded programs making use of libbpf, where a fork + exec could race with libbpf library calls, potentially resulting in a file descriptor leaked to the new process. This got missed in 59842c5451fe ("libbpf: Ensure libbpf always opens files with O_CLOEXEC"). Fixes: 59842c5451fe ("libbpf: Ensure libbpf always opens files with O_CLOEXEC") Signed-off-by: Marco Vedovati <marco.vedovati@crowdstrike.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/20230810214350.106301-1-martin.kelly@crowdstrike.com
2023-08-14net: phy: Introduce PSGMII PHY interface modeGabor Juhos4-0/+13
The PSGMII interface is similar to QSGMII. The main difference is that the PSGMII interface combines five SGMII lines into a single link while in QSGMII only four lines are combined. Similarly to the QSGMII, this interface mode might also needs special handling within the MAC driver. It is commonly used by Qualcomm with their QCA807x PHY series and modern WiSoC-s. Add definitions for the PHY layer to allow to express this type of connection between the MAC and PHY. Signed-off-by: Gabor Juhos <j4g8y7@gmail.com> Signed-off-by: Robert Marko <robert.marko@sartura.hr> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-08-14dt-bindings: net: ethernet-controller: add PSGMII modeRobert Marko1-0/+1
Add a new PSGMII mode which is similar to QSGMII with the difference being that it combines 5 SGMII lines into a single link compared to 4 on QSGMII. It is commonly used by Qualcomm on their QCA807x PHY series. Signed-off-by: Robert Marko <robert.marko@sartura.hr> Acked-by: Rob Herring <robh@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-08-14Merge branch 'mlxsw-redirection'David S. Miller8-17/+100
Petr Machata says: ==================== mlxsw: Support traffic redirection from a locked bridge port Ido Schimmel writes: It is possible to add a filter that redirects traffic from the ingress of a bridge port that is locked (i.e., performs security / SMAC lookup) and has learning enabled. For example: # ip link add name br0 type bridge # ip link set dev swp1 master br0 # bridge link set dev swp1 learning on locked on mab on # tc qdisc add dev swp1 clsact # tc filter add dev swp1 ingress pref 1 proto ip flower skip_sw src_ip 192.0.2.1 action mirred egress redirect dev swp2 In the kernel's Rx path, this filter is evaluated before the Rx handler of the bridge, which means that redirected traffic should not be affected by bridge port configuration such as learning. However, the hardware data path is a bit different and the redirect action (FORWARDING_ACTION in hardware) merely attaches a pointer to the packet, which is later used by the L2 lookup stage to understand how to forward the packet. Between both stages - ingress ACL and L2 lookup - learning and security lookup are performed, which means that redirected traffic is affected by bridge port configuration, unlike in the kernel's data path. The learning discrepancy was handled in commit 577fa14d2100 ("mlxsw: spectrum: Do not process learned records with a dummy FID") by simply ignoring learning notifications generated by the redirected traffic. A similar solution is not possible for the security / SMAC lookup since - unlike learning - the CPU is not involved and packets that failed the lookup are dropped by the device. Instead, solve this by prepending the ignore action to the redirect action and use it to instruct the device to disable both learning and the security / SMAC lookup for redirected traffic. Patch #1 adds the ignore action. Patch #2 prepends the action to the redirect action in flower offload code. Patch #3 removes the workaround in commit 577fa14d2100 ("mlxsw: spectrum: Do not process learned records with a dummy FID") since it is no longer needed. Patch #4 adds a test case. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2023-08-14selftests: forwarding: Add test case for traffic redirection from a locked portIdo Schimmel1-0/+36
Check that traffic can be redirected from a locked bridge port and that it does not create locked FDB entries. Cc: Hans J. Schultz <netdev@kapio-technology.com> Signed-off-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Petr Machata <petrm@nvidia.com> Signed-off-by: Petr Machata <petrm@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-08-14mlxsw: spectrum: Stop ignoring learning notifications from redirected trafficIdo Schimmel3-17/+0
As explained in the previous patch, with the ignore action prepended to the redirect action, it is not longer possible for redirected traffic to generate learning notifications. Therefore, remove the workaround that was added in commit 577fa14d2100 ("mlxsw: spectrum: Do not process learned records with a dummy FID") as it is no longer needed. Signed-off-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Petr Machata <petrm@nvidia.com> Signed-off-by: Petr Machata <petrm@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-08-14mlxsw: spectrum_flower: Disable learning and security lookup when redirectingIdo Schimmel3-0/+22
It is possible to add a filter that redirects traffic from the ingress of a bridge port that is locked (i.e., performs security / SMAC lookup) and has learning enabled. For example: # ip link add name br0 type bridge # ip link set dev swp1 master br0 # bridge link set dev swp1 learning on locked on mab on # tc qdisc add dev swp1 clsact # tc filter add dev swp1 ingress pref 1 proto ip flower skip_sw src_ip 192.0.2.1 action mirred egress redirect dev swp2 In the kernel's Rx path, this filter is evaluated before the Rx handler of the bridge, which means that redirected traffic should not be affected by bridge port configuration such as learning. However, the hardware data path is a bit different and the redirect action (FORWARDING_ACTION in hardware) merely attaches a pointer to the packet, which is later used by the L2 lookup stage to understand how to forward the packet. Between both stages - ingress ACL and L2 lookup - learning and security lookup are performed, which means that redirected traffic is affected by bridge port configuration, unlike in the kernel's data path. The learning discrepancy was handled in commit 577fa14d2100 ("mlxsw: spectrum: Do not process learned records with a dummy FID") by simply ignoring learning notifications generated by the redirected traffic. A similar solution is not possible for the security / SMAC lookup since - unlike learning - the CPU is not involved and packets that failed the lookup are dropped by the device. Instead, solve this by prepending the ignore action to the redirect action and use it to instruct the device to disable both learning and the security / SMAC lookup for redirected traffic. Signed-off-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Petr Machata <petrm@nvidia.com> Signed-off-by: Petr Machata <petrm@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-08-14mlxsw: core_acl_flex_actions: Add IGNORE_ACTIONIdo Schimmel2-0/+42
Add the IGNORE_ACTION which is used to ignore basic switching functions such as learning on a per-packet basis. The action will be prepended to the FORWARDING_ACTION in subsequent patches. Signed-off-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Petr Machata <petrm@nvidia.com> Signed-off-by: Petr Machata <petrm@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-08-14net: stmmac: xgmac: show more MAC HW features in debugfsFurong Xu4-6/+31
1. Show TSSTSSEL(Timestamp System Time Source), ADDMACADRSEL(additional MAC addresses), SMASEL(SMA/MDIO Interface), HDSEL(Half-duplex Support) in debugfs. 2. Show exact number of additional MAC address registers for XGMAC2 core. 3. XGMAC2 core does not have different IP checksum offload types, so just show rx_coe instead of rx_coe_type1 or rx_coe_type2. 4. XGMAC2 core does not have rxfifo_over_2048 definition, skip it. Signed-off-by: Furong Xu <0x1207@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-08-14Merge branch 'net-stats-helpers'David S. Miller2-25/+5
Li Zetao says: ==================== Use helper functions to update stats The patch set uses the helper functions dev_sw_netstats_rx_add() and dev_sw_netstats_tx_add() to update stats, which is the same as implementing the function separately. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2023-08-14vxlan: Use helper functions to update statsLi Zetao1-11/+2
Use the helper functions dev_sw_netstats_rx_add() and dev_sw_netstats_tx_add() to update stats, which helps to provide code readability. Signed-off-by: Li Zetao <lizetao1@huawei.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-08-14net: macsec: Use helper functions to update statsLi Zetao1-14/+3
Use the helper functions dev_sw_netstats_rx_add() and dev_sw_netstats_tx_add() to update stats, which helps to provide code readability. Signed-off-by: Li Zetao <lizetao1@huawei.com> Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-08-14vmxnet3: Add XDP support.William Tu7-42/+729
The patch adds native-mode XDP support: XDP DROP, PASS, TX, and REDIRECT. Background: The vmxnet3 rx consists of three rings: ring0, ring1, and dataring. For r0 and r1, buffers at r0 are allocated using alloc_skb APIs and dma mapped to the ring's descriptor. If LRO is enabled and packet size larger than 3K, VMXNET3_MAX_SKB_BUF_SIZE, then r1 is used to mapped the rest of the buffer larger than VMXNET3_MAX_SKB_BUF_SIZE. Each buffer in r1 is allocated using alloc_page. So for LRO packets, the payload will be in one buffer from r0 and multiple from r1, for non-LRO packets, only one descriptor in r0 is used for packet size less than 3k. When receiving a packet, the first descriptor will have the sop (start of packet) bit set, and the last descriptor will have the eop (end of packet) bit set. Non-LRO packets will have only one descriptor with both sop and eop set. Other than r0 and r1, vmxnet3 dataring is specifically designed for handling packets with small size, usually 128 bytes, defined in VMXNET3_DEF_RXDATA_DESC_SIZE, by simply copying the packet from the backend driver in ESXi to the ring's memory region at front-end vmxnet3 driver, in order to avoid memory mapping/unmapping overhead. In summary, packet size: A. < 128B: use dataring B. 128B - 3K: use ring0 (VMXNET3_RX_BUF_SKB) C. > 3K: use ring0 and ring1 (VMXNET3_RX_BUF_SKB + VMXNET3_RX_BUF_PAGE) As a result, the patch adds XDP support for packets using dataring and r0 (case A and B), not the large packet size when LRO is enabled. XDP Implementation: When user loads and XDP prog, vmxnet3 driver checks configurations, such as mtu, lro, and re-allocate the rx buffer size for reserving the extra headroom, XDP_PACKET_HEADROOM, for XDP frame. The XDP prog will then be associated with every rx queue of the device. Note that when using dataring for small packet size, vmxnet3 (front-end driver) doesn't control the buffer allocation, as a result we allocate a new page and copy packet from the dataring to XDP frame. The receive side of XDP is implemented for case A and B, by invoking the bpf program at vmxnet3_rq_rx_complete and handle its returned action. The vmxnet3_process_xdp(), vmxnet3_process_xdp_small() function handles the ring0 and dataring case separately, and decides the next journey of the packet afterward. For TX, vmxnet3 has split header design. Outgoing packets are parsed first and protocol headers (L2/L3/L4) are copied to the backend. The rest of the payload are dma mapped. Since XDP_TX does not parse the packet protocol, the entire XDP frame is dma mapped for transmission and transmitted in a batch. Later on, the frame is freed and recycled back to the memory pool. Performance: Tested using two VMs inside one ESXi vSphere 7.0 machine, using single core on each vmxnet3 device, sender using DPDK testpmd tx-mode attached to vmxnet3 device, sending 64B or 512B UDP packet. VM1 txgen: $ dpdk-testpmd -l 0-3 -n 1 -- -i --nb-cores=3 \ --forward-mode=txonly --eth-peer=0,<mac addr of vm2> option: add "--txonly-multi-flow" option: use --txpkts=512 or 64 byte VM2 running XDP: $ ./samples/bpf/xdp_rxq_info -d ens160 -a <options> --skb-mode $ ./samples/bpf/xdp_rxq_info -d ens160 -a <options> options: XDP_DROP, XDP_PASS, XDP_TX To test REDIRECT to cpu 0, use $ ./samples/bpf/xdp_redirect_cpu -d ens160 -c 0 -e drop Single core performance comparison with skb-mode. 64B: skb-mode -> native-mode XDP_DROP: 1.6Mpps -> 2.4Mpps XDP_PASS: 338Kpps -> 367Kpps XDP_TX: 1.1Mpps -> 2.3Mpps REDIRECT-drop: 1.3Mpps -> 2.3Mpps 512B: skb-mode -> native-mode XDP_DROP: 863Kpps -> 1.3Mpps XDP_PASS: 275Kpps -> 376Kpps XDP_TX: 554Kpps -> 1.2Mpps REDIRECT-drop: 659Kpps -> 1.2Mpps Demo: https://youtu.be/4lm1CSCi78Q Future work: - XDP frag support - use napi_consume_skb() instead of dev_kfree_skb_any at unmap - stats using u64_stats_t - using bitfield macro BIT() - optimization for DMA synchronization using actual frame length, instead of always max_len Signed-off-by: William Tu <u9012063@gmail.com> Reviewed-by: Alexander Duyck <alexanderduyck@fb.com> Reviewed-by: Alexander Lobakin <alexandr.lobakin@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-08-14Merge branch 'ovs-drop-reasons'David S. Miller9-18/+226
Adrian Moreno says: ==================== openvswitch: add drop reasons There is currently a gap in drop visibility in the openvswitch module. This series tries to improve this by adding a new drop reason subsystem for OVS. Apart from adding a new drop reasson subsystem and some common drop reasons, this series takes Eric's preliminary work [1] on adding an explicit drop action and integrates it into the same subsystem. A limitation of this series is that it does not report upcall errors. The reason is that there could be many sources of upcall drops and the most common one, which is the netlink buffer overflow, cannot be reported via kfree_skb() because the skb is freed in the netlink layer (see [2]). Therefore, using a reason for the rare events and not the common one would be even more misleading. I'd propose we add (in a follow up patch) a tracepoint to better report upcall errors. [1] https://lore.kernel.org/netdev/202306300609.tdRdZscy-lkp@intel.com/T/ [2] commit 1100248a5c5c ("openvswitch: Fix double reporting of drops in dropwatch") --- v4 -> v5: - Rebased - Added a helper function to explicitly convert drop reason enum types v3 -> v4: - Changed names of errors following Ilya's suggestions - Moved the ovs-dpctl.py changes from patch 7/7 to 3/7 - Added a test to ensure actions following a drop are rejected rfc2 -> v3: - Rebased on top of latest net-next rfc1 -> rfc2: - Fail when an explicit drop is not the last - Added a drop reason for action errors - Added braces around macros - Dropped patch that added support for masks in ovs-dpctl.py as it's now included in Aaron's series [2]. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2023-08-14selftests: openvswitch: add explicit drop testcaseAdrian Moreno1-0/+35
Test explicit drops generate the right drop reason. Also, verify that the kernel rejects flows with actions following an explicit drop. Acked-by: Aaron Conole <aconole@redhat.com> Signed-off-by: Adrian Moreno <amorenoz@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-08-14selftests: openvswitch: add drop reason testcaseAdrian Moreno1-1/+66
Test if the correct drop reason is reported when OVS drops a packet due to an explicit flow. Acked-by: Aaron Conole <aconole@redhat.com> Signed-off-by: Adrian Moreno <amorenoz@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-08-14net: openvswitch: add misc error drop reasonsAdrian Moreno3-8/+18
Use drop reasons from include/net/dropreason-core.h when a reasonable candidate exists. Acked-by: Aaron Conole <aconole@redhat.com> Signed-off-by: Adrian Moreno <amorenoz@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-08-14net: openvswitch: add meter drop reasonAdrian Moreno2-1/+2
By using an independent drop reason it makes it easy to distinguish between QoS-triggered or flow-triggered drop. Acked-by: Aaron Conole <aconole@redhat.com> Signed-off-by: Adrian Moreno <amorenoz@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-08-14net: openvswitch: add explicit drop actionEric Garver5-5/+40
From: Eric Garver <eric@garver.life> This adds an explicit drop action. This is used by OVS to drop packets for which it cannot determine what to do. An explicit action in the kernel allows passing the reason _why_ the packet is being dropped or zero to indicate no particular error happened (i.e: OVS intentionally dropped the packet). Since the error codes coming from userspace mean nothing for the kernel, we squash all of them into only two drop reasons: - OVS_DROP_EXPLICIT_WITH_ERROR to indicate a non-zero value was passed - OVS_DROP_EXPLICIT to indicate a zero value was passed (no error) e.g. trace all OVS dropped skbs # perf trace -e skb:kfree_skb --filter="reason >= 0x30000" [..] 106.023 ping/2465 skb:kfree_skb(skbaddr: 0xffffa0e8765f2000, \ location:0xffffffffc0d9b462, protocol: 2048, reason: 196611) reason: 196611 --> 0x30003 (OVS_DROP_EXPLICIT) Also, this patch allows ovs-dpctl.py to add explicit drop actions as: "drop" -> implicit empty-action drop "drop(0)" -> explicit non-error action drop "drop(42)" -> explicit error action drop Signed-off-by: Eric Garver <eric@garver.life> Co-developed-by: Adrian Moreno <amorenoz@redhat.com> Signed-off-by: Adrian Moreno <amorenoz@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-08-14net: openvswitch: add action error drop reasonAdrian Moreno2-1/+2
Add a drop reason for packets that are dropped because an action returns a non-zero error code. Acked-by: Aaron Conole <aconole@redhat.com> Signed-off-by: Adrian Moreno <amorenoz@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-08-14net: openvswitch: add last-action drop reasonAdrian Moreno4-2/+63
Create a new drop reason subsystem for openvswitch and add the first drop reason to represent last-action drops. Last-action drops happen when a flow has an empty action list or there is no action that consumes the packet (output, userspace, recirc, etc). It is the most common way in which OVS drops packets. Implementation-wise, most of these skb-consuming actions already call "consume_skb" internally and return directly from within the do_execute_actions() loop so with minimal changes we can assume that any skb that exits the loop normally is a packet drop. Signed-off-by: Adrian Moreno <amorenoz@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-08-14Merge branch 'mptcp-remove-msk-subflow'David S. Miller8-177/+186
Matthieu Baerts says: ==================== mptcp: get rid of msk->subflow The MPTCP protocol maintains an additional struct socket per connection, mainly to be able to easily use tcp-level struct socket operations. This leads to several side effects, beyond the quite unfortunate / confusing 'subflow' field name: - active and passive sockets behaviour is inconsistent: only active ones have a not NULL msk->subflow, leading to different error handling and different error code returned to the user-space in several places. - active sockets uses an unneeded, larger amount of memory - passive sockets can't successfully go through accept(), disconnect(), accept() sequence, see [1] for more details. The 13 first patches of this series are from Paolo and address all the above, finally getting rid of the blamed field: - The first patch is a minor clean-up. - In the next 11 patches, msk->subflow usage is systematically removed from the MPTCP protocol, replacing it with direct msk->first usage, eventually introducing new core helpers when needed. - The 13th patch finally disposes the field, and it's the only patch in the series intended to produce functional changes. The last and 14th patch is from Kuniyuki and it is not linked to the previous ones: it is a small clean-up to get rid of an unnecessary check in mptcp_init_sock(). [1] https://github.com/multipath-tcp/mptcp_net-next/issues/290 ==================== Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net>
2023-08-14mptcp: Remove unnecessary test for __mptcp_init_sock()Kuniyuki Iwashima1-7/+2
__mptcp_init_sock() always returns 0 because mptcp_init_sock() used to return the value directly. But after commit 18b683bff89d ("mptcp: queue data for mptcp level retransmission"), __mptcp_init_sock() need not return value anymore. Let's remove the unnecessary test for __mptcp_init_sock() and make it return void. Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Reviewed-by: Matthieu Baerts <matthieu.baerts@tessares.net> Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-08-14mptcp: get rid of msk->subflowPaolo Abeni2-26/+12
Such field is now unused just as a flag to control the first subflow deletion at close() time. Introduce a new bit flag for that and finally drop the mentioned field. As an intended side effect, now the first subflow sock is not freed before close() even for passive sockets. The msk has no open/active subflows if the first one is closed and the subflow list is singular, update accordingly the state check in mptcp_stream_accept(). Among other benefits, the subflow removal, reduces the amount of memory used on the client side for each mptcp connection, allows passive sockets to go through successful accept()/disconnect()/connect() and makes return error code consistent for failing both passive and active sockets. Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/290 Signed-off-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Mat Martineau <martineau@kernel.org> Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-08-14mptcp: change the mpc check helper to return a skPaolo Abeni4-55/+38
After the previous patch the __mptcp_nmpc_socket helper is used only to ensure that the MPTCP socket is a suitable status - that is, the mptcp capable handshake is not started yet. Change the return value to the relevant subflow sock, to finally remove the last references to first subflow socket in the MPTCP stack. As a bonus, we can get rid of a few local variables in different functions. No functional change intended. Signed-off-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Mat Martineau <martineau@kernel.org> Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-08-14mptcp: avoid ssock usage in mptcp_pm_nl_create_listen_socket()Paolo Abeni1-9/+15
This is one of the few remaining spots actually manipulating the first subflow socket. We can leverage the recently introduced inet helpers to get rid of ssock there. No functional changes intended. Signed-off-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Mat Martineau <martineau@kernel.org> Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-08-14mptcp: avoid additional indirection in sockoptPaolo Abeni1-12/+16
The mptcp sockopt infrastructure unneedly uses the first subflow socket struct in a few spots. We are going to remove such field soon, so use directly the first subflow sock instead. No functional changes intended. Signed-off-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Mat Martineau <martineau@kernel.org> Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-08-14mptcp: avoid unneeded indirection in mptcp_stream_accept()Paolo Abeni1-19/+10
We are going to remove the first subflow socket soon, so avoid the additional indirection at accept() time. Instead access directly the first subflow sock, and update mptcp_accept() to operate on it. This allows dropping a duplicated check in mptcp_accept(). No functional changes intended. Signed-off-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Mat Martineau <martineau@kernel.org> Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-08-14mptcp: avoid additional indirection in mptcp_poll()Paolo Abeni1-3/+3
We are going to remove the first subflow socket soon, so avoid the additional indirection at poll() time. Instead access directly the first subflow sock. No functional changes intended. Signed-off-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Mat Martineau <martineau@kernel.org> Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-08-14mptcp: avoid additional indirection in mptcp_listen()Paolo Abeni1-4/+9
We are going to remove the first subflow socket soon, so avoid the additional indirection via at listen() time. Instead call directly the recently introduced helper on the first subflow sock. No functional changes intended. Signed-off-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Mat Martineau <martineau@kernel.org> Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-08-14net: factor out __inet_listen_sk() helperPaolo Abeni2-16/+23
The mptcp protocol maintains an additional socket just to easily invoke a few stream operations on the first subflow. One of them is inet_listen(). Factor out an helper operating directly on the (locked) struct sock, to allow get rid of the above dependency in the next patch without duplicating the existing code. No functional changes intended. Signed-off-by: Paolo Abeni <pabeni@redhat.com> Acked-by: Mat Martineau <martineau@kernel.org> Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-08-14mptcp: mptcp: avoid additional indirection in mptcp_bind()Paolo Abeni1-5/+12
We are going to remove the first subflow socket soon, so avoid the additional indirection via at bind() time. Instead call directly the recently introduced helpers on the first subflow sock. Signed-off-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Mat Martineau <martineau@kernel.org> Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-08-14net: factor out inet{,6}_bind_sk helpersPaolo Abeni4-5/+15
The mptcp protocol maintains an additional socket just to easily invoke a few stream operations on the first subflow. One of them is bind(). Factor out the helpers operating directly on the struct sock, to allow get rid of the above dependency in the next patch without duplicating the existing code. No functional changes intended. Signed-off-by: Paolo Abeni <pabeni@redhat.com> Acked-by: Mat Martineau <martineau@kernel.org> Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-08-14mptcp: avoid subflow socket usage in mptcp_get_port()Paolo Abeni1-5/+3
We are going to remove the first subflow socket soon, so avoid accessing it in mptcp_get_port(). Instead, access directly the first subflow sock. Signed-off-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Mat Martineau <martineau@kernel.org> Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-08-14mptcp: avoid additional __inet_stream_connect() callPaolo Abeni1-16/+33
The mptcp protocol maintains an additional socket just to easily invoke a few stream operations on the first subflow. One of them is __inet_stream_connect(). We are going to remove the first subflow socket soon, so avoid the additional indirection via at connect time, calling directly into the sock-level connect() ops. The sk-level connect never return -EINPROGRESS, cleanup the error path accordingly. Additionally, the ssk status on error is always TCP_CLOSE. Avoid unneeded access to the subflow sk state. No functional change intended. Signed-off-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Mat Martineau <martineau@kernel.org> Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-08-14mptcp: avoid unneeded mptcp_token_destroy() callsPaolo Abeni1-2/+2
The MPTCP protocol currently clears the msk token both at connect() and listen() time. That is needed to deal with failing connect() calls that can create a new token while leaving the sk in TCP_CLOSE,SS_UNCONNECTED status and thus allowing later connect() and/or listen() calls. Let's deal with such failures explicitly, cleaning the token in a timely manner and avoid the confusing early mptcp_token_destroy(). Signed-off-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Mat Martineau <martineau@kernel.org> Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-08-13net: Remove leftover include from nftables.hJörn-Thorben Hinz1-2/+0
Commit db3685b4046f ("net: remove obsolete members from struct net") removed the uses of struct list_head from this header, without removing the corresponding included header. Signed-off-by: Jörn-Thorben Hinz <jthinz@mailbox.tu-berlin.de> Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-08-13Merge tag 'for-net-next-2023-08-11' of ↵David S. Miller41-1002/+2370
git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth-next bluetooth-next pull request for net-next: - Add new VID/PID for Mediatek MT7922 - Add support multiple BIS/BIG - Add support for Intel Gale Peak - Add support for Qualcomm WCN3988 - Add support for BT_PKT_STATUS for ISO sockets - Various fixes for experimental ISO support - Load FW v2 for RTL8852C - Add support for NXP AW693 chipset - Add support for Mediatek MT2925
2023-08-13net: pcs: lynx: fix lynx_pcs_link_up_sgmii() not doing anything in ↵Vladimir Oltean1-1/+1
fixed-link mode lynx_pcs_link_up_sgmii() is supposed to update the PCS speed and duplex for the non-inband operating modes, and prior to the blamed commit, it did just that, but a mistake sneaked into the conversion and reversed the condition. It is easy for this to go undetected on platforms that also initialize the PCS in the bootloader, because Linux doesn't reset it (although maybe it should). The nature of the bug is that phylink will not touch the IF_MODE_HALF_DUPLEX | IF_MODE_SPEED_MSK fields when it should, and it will apparently keep working if the previous values set by the bootloader were correct. Fixes: c689a6528c22 ("net: pcs: lynx: update PCS driver to use neg_mode") Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-08-13Merge branch 'net-pci_dev_id'David S. Miller5-10/+5
Zheng Zengkai says: ==================== net: Use pci_dev_id() to simplify the code PCI core API pci_dev_id() can be used to get the BDF number for a pci device. Use the API to simplify the code. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2023-08-13net: ngbe: use pci_dev_id() to simplify the codeZheng Zengkai1-2/+1
PCI core API pci_dev_id() can be used to get the BDF number for a pci device. We don't need to compose it manually. Use pci_dev_id() to simplify the code a little bit. Signed-off-by: Zheng Zengkai <zhengzengkai@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-08-13net: tc35815: Use pci_dev_id() to simplify the codeZheng Zengkai1-2/+1
PCI core API pci_dev_id() can be used to get the BDF number for a pci device. We don't need to compose it manually. Use pci_dev_id() to simplify the code a little bit. Signed-off-by: Zheng Zengkai <zhengzengkai@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-08-13net: smsc: Use pci_dev_id() to simplify the codeZheng Zengkai1-2/+1
PCI core API pci_dev_id() can be used to get the BDF number for a pci device. We don't need to compose it manually. Use pci_dev_id() to simplify the code a little bit. Signed-off-by: Zheng Zengkai <zhengzengkai@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-08-13tg3: Use pci_dev_id() to simplify the codeZheng Zengkai1-2/+1
PCI core API pci_dev_id() can be used to get the BDF number for a pci device. We don't need to compose it manually. Use pci_dev_id() to simplify the code a little bit. Signed-off-by: Zheng Zengkai <zhengzengkai@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-08-13et131x: Use pci_dev_id() to simplify the codeZheng Zengkai1-2/+1
PCI core API pci_dev_id() can be used to get the BDF number for a pci device. We don't need to compose it manually. Use pci_dev_id() to simplify the code a little bit. Signed-off-by: Zheng Zengkai <zhengzengkai@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-08-13net: e1000: Remove unused declarationsYue Haibing2-4/+0
Commit 675ad47375c7 ("e1000: Use netdev_<level>, pr_<level> and dev_<level>") declared but never implemented e1000_get_hw_dev_name(). Commit 1532ecea1deb ("e1000: drop dead pcie code from e1000") removed e1000_check_mng_mode()/e1000_blink_led_start() but not the declarations. Commit c46b59b241ec ("e1000: Remove unused function e1000_mta_set.") removed e1000_mta_set() but not its declaration. Signed-off-by: Yue Haibing <yuehaibing@huawei.com> Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-08-13net/rds: Remove unused function declarationsYue Haibing3-5/+0
Commit 39de8281791c ("RDS: Main header file") declared but never implemented rds_trans_init() and rds_trans_exit(), remove it. Commit d37c9359056f ("RDS: Move loop-only function to loop.c") removed the implementation rds_message_inc_free() but not the declaration. Since commit 55b7ed0b582f ("RDS: Common RDMA transport code") rds_rdma_conn_connect() is never implemented and used. rds_tcp_map_seq() is never implemented and used since commit 70041088e3b9 ("RDS: Add TCP transport to RDS"). Signed-off-by: Yue Haibing <yuehaibing@huawei.com> Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-08-13netlink: convert nlk->flags to atomic flagsEric Dumazet3-74/+48
sk_diag_put_flags(), netlink_setsockopt(), netlink_getsockopt() and others use nlk->flags without correct locking. Use set_bit(), clear_bit(), test_bit(), assign_bit() to remove data-races. Reported-by: syzbot <syzkaller@googlegroups.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-08-13Merge branch 'tcp-oom-probe'David S. Miller4-22/+56
Menglong Dong says: ==================== net: tcp: support probing OOM In this series, we make some small changes to make the tcp retransmission become zero-window probes if the receiver drops the skb because of memory pressure. In the 1st patch, we reply a zero-window ACK if the skb is dropped because out of memory, instead of dropping the skb silently. In the 2nd patch, we allow a zero-window ACK to update the window. In the 3rd patch, fix unexcepted socket die when snd_wnd is 0 in tcp_retransmit_timer(). In the 4th patch, we refactor the debug message in tcp_retransmit_timer() to make it more correct. After these changes, the tcp can probe the OOM of the receiver forever. Changes since v3: - make the timeout "2 * TCP_RTO_MAX" in the 3rd patch - tp->retrans_stamp is not based on jiffies and can't be compared with icsk->icsk_timeout in the 3rd patch. Fix it. - introduce the 4th patch Changes since v2: - refactor the code to avoid code duplication in the 1st patch - use after() instead of max() in tcp_rtx_probe0_timed_out() Changes since v1: - send 0 rwin ACK for the receive queue empty case when necessary in the 1st patch - send the ACK immediately by using the ICSK_ACK_NOW flag in the 1st patch - consider the case of the connection restart from idle, as Neal comment, in the 3rd patch ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2023-08-13net: tcp: refactor the dbg message in tcp_retransmit_timer()Menglong Dong1-10/+13
The debug message in tcp_retransmit_timer() is slightly wrong, because they could be printed even if we did not receive a new ACK packet from the remote peer. Change it to probing zero-window, as it is a expected case now. The description may be not correct. Adding the duration since the last ACK we received, and the duration of the retransmission, which are useful for debugging. And the message now like this: Probing zero-window on 127.0.0.1:9999/46946, seq=3737778959:3737791503, recv 209ms ago, lasting 209ms Probing zero-window on 127.0.0.1:9999/46946, seq=3737778959:3737791503, recv 404ms ago, lasting 408ms Probing zero-window on 127.0.0.1:9999/46946, seq=3737778959:3737791503, recv 812ms ago, lasting 1224ms Signed-off-by: Menglong Dong <imagedong@tencent.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-08-13net: tcp: fix unexcepted socket die when snd_wnd is 0Menglong Dong1-1/+17
In tcp_retransmit_timer(), a window shrunk connection will be regarded as timeout if 'tcp_jiffies32 - tp->rcv_tstamp > TCP_RTO_MAX'. This is not right all the time. The retransmits will become zero-window probes in tcp_retransmit_timer() if the 'snd_wnd==0'. Therefore, the icsk->icsk_rto will come up to TCP_RTO_MAX sooner or later. However, the timer can be delayed and be triggered after 122877ms, not TCP_RTO_MAX, as I tested. Therefore, 'tcp_jiffies32 - tp->rcv_tstamp > TCP_RTO_MAX' is always true once the RTO come up to TCP_RTO_MAX, and the socket will die. Fix this by replacing the 'tcp_jiffies32' with '(u32)icsk->icsk_timeout', which is exact the timestamp of the timeout. However, "tp->rcv_tstamp" can restart from idle, then tp->rcv_tstamp could already be a long time (minutes or hours) in the past even on the first RTO. So we double check the timeout with the duration of the retransmission. Meanwhile, making "2 * TCP_RTO_MAX" as the timeout to avoid the socket dying too soon. Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Link: https://lore.kernel.org/netdev/CADxym3YyMiO+zMD4zj03YPM3FBi-1LHi6gSD2XT8pyAMM096pg@mail.gmail.com/ Signed-off-by: Menglong Dong <imagedong@tencent.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-08-13net: tcp: allow zero-window ACK update the windowMenglong Dong1-1/+1
Fow now, an ACK can update the window in following case, according to the tcp_may_update_window(): 1. the ACK acknowledged new data 2. the ACK has new data 3. the ACK expand the window and the seq of it is valid Now, we allow the ACK update the window if the window is 0, and the seq/ack of it is valid. This is for the case that the receiver replies an zero-window ACK when it is under memory stress and can't queue the new data. Signed-off-by: Menglong Dong <imagedong@tencent.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-08-13net: tcp: send zero-window ACK when no memoryMenglong Dong3-10/+25
For now, skb will be dropped when no memory, which makes client keep retrans util timeout and it's not friendly to the users. In this patch, we reply an ACK with zero-window in this case to update the snd_wnd of the sender to 0. Therefore, the sender won't timeout the connection and will probe the zero-window with the retransmits. Signed-off-by: Menglong Dong <imagedong@tencent.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-08-11bpftool: fix perf help messageDaniel T. Lee1-1/+1
Currently, bpftool perf subcommand has typo with the help message. $ tools/bpf/bpftool/bpftool perf help Usage: bpftool perf { show | list } bpftool perf help } Since this bpftool perf subcommand help message has the extra bracket, this commit fix the typo by removing the extra bracket. Signed-off-by: Daniel T. Lee <danieltimlee@gmail.com> Reviewed-by: Quentin Monnet <quentin@isovalent.com> Link: https://lore.kernel.org/r/20230811121603.17429-1-danieltimlee@gmail.com Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2023-08-11bpf: Remove unused declaration bpf_link_new_file()Yue Haibing1-1/+0
Commit a3b80e107894 ("bpf: Allocate ID for bpf_link") removed the implementation but not the declaration. Signed-off-by: Yue Haibing <yuehaibing@huawei.com> Link: https://lore.kernel.org/r/20230809140556.45836-1-yuehaibing@huawei.com Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2023-08-11Merge branch '40GbE' of ↵Jakub Kicinski2-6/+6
git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/next-queue Tony Nguyen says: ==================== i40e: Replace one-element arrays with flexible-array members Replace one-element arrays with flexible-array members in multiple structures. This results in no differences in binary output. * '40GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/next-queue: i40e: Replace one-element array with flex-array member in struct i40e_profile_aq_section i40e: Replace one-element array with flex-array member in struct i40e_section_table i40e: Replace one-element array with flex-array member in struct i40e_profile_segment i40e: Replace one-element array with flex-array member in struct i40e_package_header ==================== Link: https://lore.kernel.org/r/20230810175302.1964182-1-anthony.l.nguyen@intel.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-08-11ethernet: atarilance: mark init function staticArnd Bergmann1-1/+1
The init function is only referenced locally, so it should be static to avoid this warning: drivers/net/ethernet/amd/atarilance.c:370:28: error: no previous prototype for 'atarilance_probe' [-Werror=missing-prototypes] Signed-off-by: Arnd Bergmann <arnd@arndb.de> Reviewed-by: Yang Yingliang <yangyingliang@huawei.com> Link: https://lore.kernel.org/r/20230810122528.1220434-2-arnd@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-08-11ethernet: ldmvsw: mark ldmvsw_open() staticArnd Bergmann1-2/+1
The function is exported for no reason and should just be static: drivers/net/ethernet/sun/ldmvsw.c:127:5: error: no previous prototype for 'ldmvsw_open' [-Werror=missing-prototypes] Signed-off-by: Arnd Bergmann <arnd@arndb.de> Reviewed-by: Simon Horman <horms@kernel.org> Tested-by: Simon Horman <horms@kernel.org> # build-tested Link: https://lore.kernel.org/r/20230810122528.1220434-1-arnd@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-08-11Bluetooth: hci_conn: avoid checking uninitialized CIG/CIS idsPauli Virtanen2-2/+4
The CIS/CIG ids of ISO connections are defined only when the connection is unicast. Fix the lookup functions to check for unicast first. Ensure CIG/CIS IDs have valid value also in state BT_OPEN. Signed-off-by: Pauli Virtanen <pav@iki.fi> Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
2023-08-11Bluetooth: hci_event: drop only unbound CIS if Set CIG Parameters failsPauli Virtanen1-5/+24
When user tries to connect a new CIS when its CIG is not configurable, that connection shall fail, but pre-existing connections shall not be affected. However, currently hci_cc_le_set_cig_params deletes all CIS of the CIG on error so it doesn't work, even though controller shall not change CIG/CIS configuration if the command fails. Fix by failing on command error only the connections that are not yet bound, so that we keep the previous CIS configuration like the controller does. Fixes: 26afbd826ee3 ("Bluetooth: Add initial implementation of CIS connections") Signed-off-by: Pauli Virtanen <pav@iki.fi> Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
2023-08-11Bluetooth: btrtl: Load FW v2 otherwise FW v1 for RTL8852CMax Chou1-25/+45
In this commit, prefer to load FW v2 if available. Fallback to FW v1 otherwise. This behavior is only for RTL8852C. Fixes: 9a24ce5e29b1 ("Bluetooth: btrtl: Firmware format v2 support") Cc: stable@vger.kernel.org Suggested-by: Juerg Haefliger <juerg.haefliger@canonical.com> Tested-by: Hilda Wu <hildawu@realtek.com> Signed-off-by: Max Chou <max.chou@realtek.com> Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
2023-08-11Bluetooth: Remove unnecessary NULL check before vfree()Ziyang Xuan1-2/+1
Remove unnecessary NULL check which causes coccinelle warning: net/bluetooth/coredump.c:104:2-7: WARNING: NULL check before some freeing functions is not needed. Signed-off-by: Ziyang Xuan <william.xuanziyang@huawei.com> Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
2023-08-11Bluetooth: hci_sync: Avoid use-after-free in dbg for hci_add_adv_monitor()Manish Mandlik1-1/+1
KSAN reports use-after-free in hci_add_adv_monitor(). While adding an adv monitor, hci_add_adv_monitor() calls -> msft_add_monitor_pattern() calls -> msft_add_monitor_sync() calls -> msft_le_monitor_advertisement_cb() calls in an error case -> hci_free_adv_monitor() which frees the *moniter. This is referenced by bt_dev_dbg() in hci_add_adv_monitor(). Fix the bt_dev_dbg() by using handle instead of monitor->handle. Fixes: b747a83690c8 ("Bluetooth: hci_sync: Refactor add Adv Monitor") Signed-off-by: Manish Mandlik <mmandlik@google.com> Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
2023-08-11Bluetooth: Fix potential use-after-free when clear keysMin Li1-8/+8
Similar to commit c5d2b6fa26b5 ("Bluetooth: Fix use-after-free in hci_remove_ltk/hci_remove_irk"). We can not access k after kfree_rcu() call. Fixes: d7d41682efc2 ("Bluetooth: Fix Suspicious RCU usage warnings") Signed-off-by: Min Li <lm0963hack@gmail.com> Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
2023-08-11Bluetooth: hci_sync: Introduce PTR_UINT/UINT_PTR macrosLuiz Augusto von Dentz3-11/+15
This introduces PTR_UINT/UINT_PTR macros and replace the use of PTR_ERR/ERR_PTR. Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
2023-08-11Bluetooth: hci_conn: Fix hci_le_set_cig_paramsLuiz Augusto von Dentz1-94/+63
When running with concurrent task only one CIS was being assigned so this attempts to rework the way the PDU is constructed so it is handled later at the callback instead of in place. Fixes: 26afbd826ee3 ("Bluetooth: Add initial implementation of CIS connections") Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
2023-08-11Bluetooth: hci_core: Make hci_is_le_conn_scanning publicLuiz Augusto von Dentz3-42/+21
This moves hci_is_le_conn_scanning to hci_core.h so it can be used by different files without having to duplicate its code. Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
2023-08-11Bluetooth: hci_conn: Fix not allowing valid CIS IDLuiz Augusto von Dentz1-3/+6
Only the number of CIS shall be limited to 0x1f, the CIS ID in the other hand is up to 0xef. Fixes: 26afbd826ee3 ("Bluetooth: Add initial implementation of CIS connections") Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
2023-08-11Bluetooth: hci_conn: Fix modifying handle while abortingLuiz Augusto von Dentz3-18/+39
This introduces hci_conn_set_handle which takes care of verifying the conditions where the hci_conn handle can be modified, including when hci_conn_abort has been called and also checks that the handles is valid as well. Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
2023-08-11Bluetooth: ISO: Fix not checking for valid CIG/CIS IDsLuiz Augusto von Dentz1-0/+6
Valid range of CIG/CIS are 0x00 to 0xEF, so this checks they are properly checked before attempting to use HCI_OP_LE_SET_CIG_PARAMS. Fixes: ccf74f2390d6 ("Bluetooth: Add BTPROTO_ISO socket type") Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
2023-08-11Bluetooth: hci_sync: Fix UAF on hci_abort_conn_syncLuiz Augusto von Dentz1-16/+29
Connections may be cleanup while waiting for the commands to complete so this attempts to check if the connection handle remains valid in case of errors that would lead to call hci_conn_failed: BUG: KASAN: slab-use-after-free in hci_conn_failed+0x1f/0x160 Read of size 8 at addr ffff888001376958 by task kworker/u3:0/52 CPU: 0 PID: 52 Comm: kworker/u3:0 Not tainted 6.5.0-rc1-00527-g2dfe76d58d3a #5615 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.2-1.fc38 04/01/2014 Workqueue: hci0 hci_cmd_sync_work Call Trace: <TASK> dump_stack_lvl+0x1d/0x70 print_report+0xce/0x620 ? __virt_addr_valid+0xd4/0x150 ? hci_conn_failed+0x1f/0x160 kasan_report+0xd1/0x100 ? hci_conn_failed+0x1f/0x160 hci_conn_failed+0x1f/0x160 hci_abort_conn_sync+0x237/0x360 Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
2023-08-11Bluetooth: hci_sync: Fix handling of HCI_OP_CREATE_CONN_CANCELLuiz Augusto von Dentz1-0/+11
When sending HCI_OP_CREATE_CONN_CANCEL it shall Wait for HCI_EV_CONN_COMPLETE, not HCI_EV_CMD_STATUS, when the reason is anything but HCI_ERROR_REMOTE_POWER_OFF. This reason is used when suspending or powering off, where we don't want to wait for the peer's response. Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
2023-08-11Bluetooth: hci_sync: delete CIS in BT_OPEN/CONNECT/BOUND when abortingPauli Virtanen1-7/+9
Dropped CIS that are in state BT_OPEN/BT_BOUND, and in state BT_CONNECT with HCI_CONN_CREATE_CIS unset, should be cleaned up immediately. Closing CIS ISO sockets should result to the hci_conn be deleted, so that potentially pending CIG removal can run. hci_abort_conn cannot refer to them by handle, since their handle is still unset if Set CIG Parameters has not yet completed. This fixes CIS not being terminated if the socket is shut down immediately after connection, so that the hci_abort_conn runs before Set CIG Parameters completes. See new BlueZ test "ISO Connect Close - Success" Signed-off-by: Pauli Virtanen <pav@iki.fi> Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
2023-08-11Bluetooth: ISO: handle bound CIS cleanup via hci_connPauli Virtanen2-13/+6
Calling hci_conn_del in __iso_sock_close is invalid. It needs hdev->lock, but it cannot be acquired there due to lock ordering. Fix this by doing cleanup via hci_conn_drop. Return hci_conn with refcount 1 from hci_bind_cis and hci_connect_cis, so that the iso_conn always holds one reference. This also fixes refcounting when error handling. Since hci_conn_abort shall handle termination of connections in any state properly, we can handle BT_CONNECT socket state in the same way as BT_CONNECTED. Signed-off-by: Pauli Virtanen <pav@iki.fi> Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
2023-08-11Bluetooth: Remove unused declaration amp_read_loc_info()Yue Haibing1-1/+0
This is introduced in commit 903e45411099 but was never implemented. Fixes: 903e45411099 ("Bluetooth: AMP: Use HCI cmd to Read Loc AMP Assoc") Signed-off-by: Yue Haibing <yuehaibing@huawei.com> Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
2023-08-11Bluetooth: btusb: Move btusb_recv_event_intel to btintelLuiz Augusto von Dentz3-74/+76
btusb_recv_event_intel is specific to Intel controllers therefore it shall be placed inside btintel.c so btusb don't have a mix of vendor specific code with the generic parts. Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
2023-08-11Bluetooth: btqca: Add WCN3988 supportLuca Weiss3-4/+33
Add support for the Bluetooth chip codenamed APACHE which is part of WCN3988. The firmware for this chip has a slightly different naming scheme compared to most others. For ROM Version 0x0200 we need to use apbtfw10.tlv + apnv10.bin and for ROM version 0x201 apbtfw11.tlv + apnv11.bin Signed-off-by: Luca Weiss <luca.weiss@fairphone.com> Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
2023-08-11dt-bindings: net: qualcomm: Add WCN3988Luca Weiss1-0/+2
Add the compatible for the Bluetooth part of the Qualcomm WCN3988 chipset. Acked-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org> Signed-off-by: Luca Weiss <luca.weiss@fairphone.com> Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>