aboutsummaryrefslogtreecommitdiffstats
path: root/include/linux/netdevice.h
AgeCommit message (Collapse)AuthorFilesLines
2024-05-13net: revert partially applied PHY topology seriesJakub Kicinski1-3/+1
The series is causing issues with PHY drivers built as modules. Since it was only partially applied and the merge window has opened let's revert and try again for v6.11. Revert 6916e461e793 ("net: phy: Introduce ethernet link topology representation") Revert 0ec5ed6c130e ("net: sfp: pass the phy_device when disconnecting an sfp module's PHY") Revert e75e4e074c44 ("net: phy: add helpers to handle sfp phy connect/disconnect") Revert fdd353965b52 ("net: sfp: Add helper to return the SFP bus name") Revert 841942bc6212 ("net: ethtool: Allow passing a phy index for some commands") Link: https://lore.kernel.org/all/171242462917.4000.9759453824684907063.git-patchwork-notify@kernel.org/ Link: https://lore.kernel.org/all/20240507102822.2023826-1-maxime.chevallier@bootlin.com/ Link: https://lore.kernel.org/r/20240513154156.104281-1-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-05-05queue_api: define queue apiMina Almasry1-0/+3
This API enables the net stack to reset the queues used for devmem TCP. Signed-off-by: Mina Almasry <almasrymina@google.com> Signed-off-by: Shailend Chand <shailend@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-05-03net: no longer acquire RTNL in threaded_show()Eric Dumazet1-1/+1
dev->threaded can be read locklessly, if we add corresponding READ_ONCE()/WRITE_ONCE() annotations. Signed-off-by: Eric Dumazet <edumazet@google.com> Link: https://lore.kernel.org/r/20240502173926.2010646-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-05-01net: Protect dev->name by seqlock.Kuniyuki Iwashima1-0/+1
We will convert ioctl(SIOCGARP) to RCU, and then we need to copy dev->name which is currently protected by rtnl_lock(). This patch does the following: 1) Add seqlock netdev_rename_lock to protect dev->name 2) Add netdev_copy_name() that copies dev->name to buffer under netdev_rename_lock 3) Use netdev_copy_name() in netdev_get_name() and drop devnet_rename_sem Suggested-by: Eric Dumazet <edumazet@google.com> Link: https://lore.kernel.org/netdev/CANn89iJEWs7AYSJqGCUABeVqOCTkErponfZdT5kV-iD=-SajnQ@mail.gmail.com/ Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Link: https://lore.kernel.org/r/20240430015813.71143-7-kuniyu@amazon.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-04-24net: create a dummy net_device allocatorBreno Leitao1-0/+3
It is impossible to use init_dummy_netdev together with alloc_netdev() as the 'setup' argument. This is because alloc_netdev() initializes some fields in the net_device structure, and later init_dummy_netdev() memzero them all. This causes some problems as reported here: https://lore.kernel.org/all/20240322082336.49f110cc@kernel.org/ Split the init_dummy_netdev() function in two. Create a new function called init_dummy_netdev_core() that does not memzero the net_device structure. Then have init_dummy_netdev() memzero-ing and calling init_dummy_netdev_core(), keeping the old behaviour. init_dummy_netdev_core() is the new function that could be called as an argument for alloc_netdev(). Also, create a helper to allocate and initialize dummy net devices, leveraging init_dummy_netdev_core() as the setup argument. This function basically simplify the allocation of dummy devices, by allocating and initializing it. Freeing the device continue to be done through free_netdev() Suggested-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Breno Leitao <leitao@debian.org> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-04-06net: phy: Introduce ethernet link topology representationMaxime Chevallier1-1/+3
Link topologies containing multiple network PHYs attached to the same net_device can be found when using a PHY as a media converter for use with an SFP connector, on which an SFP transceiver containing a PHY can be used. With the current model, the transceiver's PHY can't be used for operations such as cable testing, timestamping, macsec offload, etc. The reason being that most of the logic for these configuration, coming from either ethtool netlink or ioctls tend to use netdev->phydev, which in multi-phy systems will reference the PHY closest to the MAC. Introduce a numbering scheme allowing to enumerate PHY devices that belong to any netdev, which can in turn allow userspace to take more precise decisions with regard to each PHY's configuration. The numbering is maintained per-netdev, in a phy_device_list. The numbering works similarly to a netdevice's ifindex, with identifiers that are only recycled once INT_MAX has been reached. This prevents races that could occur between PHY listing and SFP transceiver removal/insertion. The identifiers are assigned at phy_attach time, as the numbering depends on the netdevice the phy is attached to. The PHY index can be re-used for PHYs that are persistent. Signed-off-by: Maxime Chevallier <maxime.chevallier@bootlin.com> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-04-03netdevice: add DEFINE_FREE() for dev_putJohannes Berg1-0/+2
For short netdev holds within a function there are still a lot of users of dev_put() rather than netdev_put(). Add DEFINE_FREE() to allow making those safer. Signed-off-by: Johannes Berg <johannes.berg@intel.com> Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-04-01net: rps: move received_rps field to a better locationEric Dumazet1-1/+1
Commit 14d898f3c1b3 ("dev: Move received_rps counter next to RPS members in softnet data") was unfortunate: received_rps is dirtied by a cpu and never read by other cpus in fast path. Its presence in the hot RPS cache line (shared by many cpus) is hurting RPS/RFS performance. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-04-01net: rps: change input_queue_tail_incr_save()Eric Dumazet1-15/+0
input_queue_tail_incr_save() is incrementing the sd queue_tail and save it in the flow last_qtail. Two issues here : - no lock protects the write on last_qtail, we should use appropriate annotations. - We can perform this write after releasing the per-cpu backlog lock, to decrease this lock hold duration (move away the cache line miss) Also move input_queue_head_incr() and rps helpers to include/net/rps.h, while adding rps_ prefix to better reflect their role. v2: Fixed a build issue (Jakub and kernel build bots) Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-04-01net: make softnet_data.dropped an atomic_tEric Dumazet1-1/+2
If under extreme cpu backlog pressure enqueue_to_backlog() has to drop a packet, it could do this without dirtying a cache line and potentially slowing down the target cpu. Move sd->dropped into a separate cache line, and make it atomic. In non pressure mode, this field is not touched, no need to consume valuable space in a hot cache line. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-04-01net: move dev_xmit_recursion() helpers to net/core/dev.hEric Dumazet1-17/+0
Move dev_xmit_recursion() and friends to net/core/dev.h They are only used from net/core/dev.c and net/core/filter.c. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-04-01net: move kick_defer_list_purge() to net/core/dev.hEric Dumazet1-1/+0
kick_defer_list_purge() is defined in net/core/dev.c and used from net/core/skubff.c Because we need softnet_data, include <linux/netdevice.h> from net/core/dev.h Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-04-01ip_tunnel: use a separate struct to store tunnel params in the kernelAlexander Lobakin1-3/+4
Unlike IPv6 tunnels which use purely-kernel __ip6_tnl_parm structure to store params inside the kernel, IPv4 tunnel code uses the same ip_tunnel_parm which is being used to talk with the userspace. This makes it difficult to alter or add any fields or use a different format for whatever data. Define struct ip_tunnel_parm_kern, a 1:1 copy of ip_tunnel_parm for now, and use it throughout the code. Define the pieces, where the copy user <-> kernel happens, as standalone functions, and copy the data there field-by-field, so that the kernel-side structure could be easily modified later on and the users wouldn't have to care about this. Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-03-26net: Use backlog-NAPI to clean up the defer_list.Sebastian Andrzej Siewior1-0/+1
The defer_list is a per-CPU list which is used to free skbs outside of the socket lock and on the CPU on which they have been allocated. The list is processed during NAPI callbacks so ideally the list is cleaned up. Should the amount of skbs on the list exceed a certain water mark then the softirq is triggered remotely on the target CPU by invoking a remote function call. The raise of the softirqs via a remote function call leads to waking the ksoftirqd on PREEMPT_RT which is undesired. The backlog-NAPI threads already provide the infrastructure which can be utilized to perform the cleanup of the defer_list. The NAPI state is updated with the input_pkt_queue.lock acquired. It order not to break the state, it is needed to also wake the backlog-NAPI thread with the lock held. This requires to acquire the use the lock in rps_lock_irq*() if the backlog-NAPI threads are used even with RPS disabled. Move the logic of remotely starting softirqs to clean up the defer_list into kick_defer_list_purge(). Make sure a lock is held in rps_lock_irq*() if backlog-NAPI threads are used. Schedule backlog-NAPI for defer_list cleanup if backlog-NAPI is available. Acked-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-03-19net: move dev->state into net_device_read_txrx groupEric Dumazet1-1/+1
dev->state can be read in rx and tx fast paths. netif_running() which needs dev->state is called from - enqueue_to_backlog() [RX path] - __dev_direct_xmit() [TX path] Fixes: 43a71cd66b9c ("net-device: reorganize net_device fast path variables") Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Coco Li <lixiaoyan@google.com> Reviewed-by: Jiri Pirko <jiri@nvidia.com> Link: https://lore.kernel.org/r/20240314200845.3050179-1-edumazet@google.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-03-11net: gro: move two declarations to include/net/gro.hEric Dumazet1-2/+0
Move gro_find_receive_by_type() and gro_find_complete_by_type() to include/net/gro.h where they belong. Also use _NET_GRO_H instead of _NET_IPV6_GRO_H to protect include/net/gro.h from multiple inclusions. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://lore.kernel.org/r/20240308102230.296224-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-03-07netdev: add per-queue statisticsJakub Kicinski1-0/+3
The ethtool-nl family does a good job exposing various protocol related and IEEE/IETF statistics which used to get dumped under ethtool -S, with creative names. Queue stats don't have a netlink API, yet, and remain a lion's share of ethtool -S output for new drivers. Not only is that bad because the names differ driver to driver but it's also bug-prone. Intuitively drivers try to report only the stats for active queues, but querying ethtool stats involves multiple system calls, and the number of stats is read separately from the stats themselves. Worse still when user space asks for values of the stats, it doesn't inform the kernel how big the buffer is. If number of stats increases in the meantime kernel will overflow user buffer. Add a netlink API for dumping queue stats. Queue information is exposed via the netdev-genl family, so add the stats there. Support per-queue and sum-for-device dumps. Latter will be useful when subsequent patches add more interesting common stats than just bytes and packets. The API does not currently distinguish between HW and SW stats. The expectation is that the source of the stats will either not matter much (good packets) or be obvious (skb alloc errors). Acked-by: Stanislav Fomichev <sdf@google.com> Reviewed-by: Amritha Nambiar <amritha.nambiar@intel.com> Reviewed-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com> Link: https://lore.kernel.org/r/20240306195509.1502746-2-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-03-07net: introduce include/net/rps.hEric Dumazet1-82/+0
Move RPS related structures and helpers from include/linux/netdevice.h and include/net/sock.h to a new include file. Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Soheil Hassas Yeganeh <soheil@google.com> Reviewed-by: David Ahern <dsahern@kernel.org> Link: https://lore.kernel.org/r/20240306160031.874438-18-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-03-07net: move dev_rx_weight to net_hotdataEric Dumazet1-2/+0
dev_rx_weight is read from process_backlog(). Move it to net_hotdata for better cache locality. Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Soheil Hassas Yeganeh <soheil@google.com> Reviewed-by: David Ahern <dsahern@kernel.org> Link: https://lore.kernel.org/r/20240306160031.874438-10-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-03-07net: move dev_tx_weight to net_hotdataEric Dumazet1-1/+0
dev_tx_weight is used in tx fast path. Move it to net_hotdata for better cache locality. Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Soheil Hassas Yeganeh <soheil@google.com> Reviewed-by: David Ahern <dsahern@kernel.org> Link: https://lore.kernel.org/r/20240306160031.874438-9-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-03-07net: move netdev_max_backlog to net_hotdataEric Dumazet1-1/+0
netdev_max_backlog is used in rx fat path. Move it to net_hodata for better cache locality. Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Soheil Hassas Yeganeh <soheil@google.com> Reviewed-by: David Ahern <dsahern@kernel.org> Link: https://lore.kernel.org/r/20240306160031.874438-6-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-03-07net: move ptype_all into net_hotdataEric Dumazet1-1/+0
ptype_all is used in rx/tx fast paths. Move it to net_hotdata for better cache locality. Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Soheil Hassas Yeganeh <soheil@google.com> Reviewed-by: David Ahern <dsahern@kernel.org> Link: https://lore.kernel.org/r/20240306160031.874438-5-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-03-07net: introduce struct net_hotdataEric Dumazet1-1/+0
Instead of spreading networking critical fields all over the places, add a custom net_hotdata structure so that we can precisely control its layout. In this first patch, move : - gro_normal_batch used in rx (GRO stack) - offload_base used in rx and tx (GRO and TSO stacks) Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Soheil Hassas Yeganeh <soheil@google.com> Reviewed-by: David Ahern <dsahern@kernel.org> Link: https://lore.kernel.org/r/20240306160031.874438-2-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-03-07Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski1-4/+10
Cross-merge networking fixes after downstream PR. No conflicts. Adjacent changes: net/core/page_pool_user.c 0b11b1c5c320 ("netdev: let netlink core handle -EMSGSIZE errors") 429679dcf7d9 ("page_pool: fix netlink dump stop/resume") Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-03-05dpll: move all dpll<>netdev helpers to dpll codeJakub Kicinski1-4/+0
Older versions of GCC really want to know the full definition of the type involved in rcu_assign_pointer(). struct dpll_pin is defined in a local header, net/core can't reach it. Move all the netdev <> dpll code into dpll, where the type is known. Otherwise we'd need multiple function calls to jump between the compilation units. This is the same problem the commit under fixes was trying to address, but with rcu_assign_pointer() not rcu_dereference(). Some of the exports are not needed, networking core can't be a module, we only need exports for the helpers used by drivers. Reported-by: Geert Uytterhoeven <geert@linux-m68k.org> Link: https://lore.kernel.org/all/35a869c8-52e8-177-1d4d-e57578b99b6@linux-m68k.org/ Fixes: 640f41ed33b5 ("dpll: fix build failure due to rcu_dereference_check() on unknown type") Reviewed-by: Jiri Pirko <jiri@nvidia.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://lore.kernel.org/r/20240305013532.694866-1-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-03-01net: bql: fix building with BQL disabledArnd Bergmann1-0/+10
It is now possible to disable BQL, but that causes the cpsw driver to break: drivers/net/ethernet/ti/am65-cpsw-nuss.c:297:28: error: no member named 'dql' in 'struct netdev_queue' 297 | dql_avail(&netif_txq->dql), There is already a helper function in net/sch_generic.h that could be used to help here. Move its implementation into the common linux/netdevice.h along with the other bql interfaces and change both users over to the new interface. Fixes: ea7f3cfaa588 ("net: bql: allow the config to be disabled") Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-02-29Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski1-10/+1
Cross-merge networking fixes after downstream PR. Conflicts: net/mptcp/protocol.c adf1bb78dab5 ("mptcp: fix snd_wnd initialization for passive socket") 9426ce476a70 ("mptcp: annotate lockless access for RX path fields") https://lore.kernel.org/all/20240228103048.19255709@canb.auug.org.au/ Adjacent changes: drivers/dpll/dpll_core.c 0d60d8df6f49 ("dpll: rely on rcu for netdev_dpll_pin()") e7f8df0e81bf ("dpll: move xa_erase() call in to match dpll_pin_alloc() error path order") drivers/net/veth.c 1ce7d306ea63 ("veth: try harder when allocating queue memory") 0bef512012b1 ("net: add netdev_lockdep_set_classes() to virtual drivers") drivers/net/wireless/intel/iwlwifi/mvm/d3.c 8c9bef26e98b ("wifi: iwlwifi: mvm: d3: implement suspend with MLO") 78f65fbf421a ("wifi: iwlwifi: mvm: ensure offloading TID queue exists") net/wireless/nl80211.c f78c1375339a ("wifi: nl80211: reject iftype change with mesh ID change") 414532d8aa89 ("wifi: cfg80211: use IEEE80211_MAX_MESH_ID_LEN appropriately") Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-02-26dpll: rely on rcu for netdev_dpll_pin()Eric Dumazet1-10/+1
This fixes a possible UAF in if_nlmsg_size(), which can run without RTNL. Add rcu protection to "struct dpll_pin" Move netdev_dpll_pin() from netdevice.h to dpll.h to decrease name pollution. Note: This looks possible to no longer acquire RTNL in netdev_dpll_pin_assign() later in net-next. v2: do not force rcu_read_lock() in rtnl_dpll_pin_size() (Jiri Pirko) Fixes: 5f1842692880 ("netdev: expose DPLL pin handle for netdevice") Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Arkadiusz Kubalewski <arkadiusz.kubalewski@intel.com> Cc: Vadim Fedorenko <vadim.fedorenko@linux.dev> Reviewed-by: Jiri Pirko <jiri@nvidia.com> Link: https://lore.kernel.org/r/20240223123208.3543319-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-02-26ipv6: prepare inet6_fill_ifinfo() for RCU protectionEric Dumazet1-2/+4
We want to use RCU protection instead of RTNL for inet6_fill_ifinfo(). Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-02-15Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski1-5/+5
Cross-merge networking fixes after downstream PR. No conflicts. Adjacent changes: net/core/dev.c 9f30831390ed ("net: add rcu safety to rtnl_prop_list_size()") 723de3ebef03 ("net: free altname using an RCU callback") net/unix/garbage.c 11498715f266 ("af_unix: Remove io_uring code for GC.") 25236c91b5ab ("af_unix: Fix task hung while purging oob_skb in GC.") drivers/net/ethernet/renesas/ravb_main.c ed4adc07207d ("net: ravb: Count packets instead of descriptors in GbEth RX path" ) c2da9408579d ("ravb: Add Rx checksum offload support for GbEth") net/mptcp/protocol.c bdd70eb68913 ("mptcp: drop the push_pending field") 28e5c1380506 ("mptcp: annotate lockless accesses around read-mostly fields") Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-02-14net: remove dev_base_lockEric Dumazet1-2/+0
dev_base_lock is not needed anymore, all remaining users also hold RTNL. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-02-14net: add netdev_set_operstate() helperEric Dumazet1-1/+1
dev_base_lock is going away, add netdev_set_operstate() helper so that hsr does not have to know core internals. Remove dev_base_lock acquisition from rfc2863_policy() v3: use an "unsigned int" for dev->operstate, so that try_cmpxchg() can work on all arches. ( https://lore.kernel.org/oe-kbuild-all/202402081918.OLyGaea3-lkp@intel.com/ ) Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-02-14net: convert dev->reg_state to u8Eric Dumazet1-9/+14
Prepares things so that dev->reg_state reads can be lockless, by adding WRITE_ONCE() on write side. READ_ONCE()/WRITE_ONCE() do not support bitfields. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-02-13xdp: rely on skb pointer reference in do_xdp_generic and ↵Lorenzo Bianconi1-1/+1
netif_receive_generic_xdp Rely on skb pointer reference instead of the skb pointer in do_xdp_generic and netif_receive_generic_xdp routine signatures. This is a preliminary patch to add multi-buff support for xdp running in generic mode where we will need to reallocate the skb to avoid linearization and we will need to make it visible to do_xdp_generic() caller. Acked-by: Jesper Dangaard Brouer <hawk@kernel.org> Reviewed-by: Toke Hoiland-Jorgensen <toke@redhat.com> Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org> Link: https://lore.kernel.org/r/c09415b1f48c8620ef4d76deed35050a7bddf7c2.1707729884.git.lorenzo@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-02-12net-device: move lstats in net_device_read_txrxEric Dumazet1-5/+5
dev->lstats is notably used from loopback ndo_start_xmit() and other virtual drivers. Per cpu stats updates are dirtying per-cpu data, but the pointer itself is read-only. Fixes: 43a71cd66b9c ("net-device: reorganize net_device fast path variables") Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Coco Li <lixiaoyan@google.com> Cc: Simon Horman <horms@kernel.org> Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-02-07Merge tag 'mlx5-updates-2024-02-01' of ↵Jakub Kicinski1-1/+1
git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux Saeed Mahameed says: ==================== mlx5-updates-2024-02-01 1) IPSec global stats for xfrm and mlx5 2) XSK memory improvements for non-linear SKBs 3) Software steering debug dump to use seq_file ops 4) Various code clean-ups * tag 'mlx5-updates-2024-02-01' of git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux: net/mlx5e: XDP, Exclude headroom and tailroom from memory calculations net/mlx5e: XSK, Exclude tailroom from non-linear SKBs memory calculations net/mlx5: DR, Change SWS usage to debug fs seq_file interface net/mlx5: Change missing SyncE capability print to debug net/mlx5: Remove initial segmentation duplicate definitions net/mlx5: Return specific error code for timeout on wait_fw_init net/mlx5: SF, Stop waiting for FW as teardown was called net/mlx5: remove fw reporter dump option for non PF net/mlx5: remove fw_fatal reporter dump option for non PF net/mlx5: Rename mlx5_sf_dev_remove Documentation: Fix counter name of mlx5 vnic reporter net/mlx5e: Delete obsolete IPsec code net/mlx5e: Connect mlx5 IPsec statistics with XFRM core xfrm: get global statistics from the offloaded device xfrm: generalize xdo_dev_state_update_curlft to allow statistics update ==================== Link: https://lore.kernel.org/r/20240206005527.1353368-1-saeed@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-02-07net: Do not return value from init_dummy_netdev()Amit Cohen1-1/+1
init_dummy_netdev() always returns zero and all the callers do not check the returned value. Set the function to not return value, as it is not really used today. Signed-off-by: Amit Cohen <amcohen@nvidia.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Jiri Pirko <jiri@nvidia.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://lore.kernel.org/r/20240205103022.440946-1-amcohen@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-02-05xfrm: generalize xdo_dev_state_update_curlft to allow statistics updateLeon Romanovsky1-1/+1
In order to allow drivers to fill all statistics, change the name of xdo_dev_state_update_curlft to be xdo_dev_state_update_stats. Acked-by: Steffen Klassert <steffen.klassert@secunet.com> Signed-off-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2024-01-04Revert "Introduce PHY listing and link_topology tracking"Jakub Kicinski1-3/+1
This reverts commit 32bb4515e34469975abc936deb0a116c4a445817. This reverts commit d078d480639a4f3b5fc2d56247afa38e0956483a. This reverts commit fcc4b105caa4b844bf043375bf799c20a9c99db1. This reverts commit 345237dbc1bdbb274c9fb9ec38976261ff4a40b8. This reverts commit 7db69ec9cfb8b4ab50420262631fb2d1908b25bf. This reverts commit 95132a018f00f5dad38bdcfd4180d1af955d46f6. This reverts commit 63d5eaf35ac36cad00cfb3809d794ef0078c822b. This reverts commit c29451aefcb42359905d18678de38e52eccb3bb5. This reverts commit 2ab0edb505faa9ac90dee1732571390f074e8113. This reverts commit dedd702a35793ab462fce4c737eeba0badf9718e. This reverts commit 034fcc210349b873ece7356905be5c6ca11eef2a. This reverts commit 9c5625f559ad6fe9f6f733c11475bf470e637d34. This reverts commit 02018c544ef113e980a2349eba89003d6f399d22. Looks like we need more time for reviews, and incremental changes will be hard to make sense of. So revert. Link: https://lore.kernel.org/all/ZZP6FV5sXEf+xd58@shell.armlinux.org.uk/ Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-01-03net-device: move xdp_prog to net_device_read_rxEric Dumazet1-1/+1
xdp_prog is used in receive path, both from XDP enabled drivers and from netif_elide_gro(). This patch also removes two 4-bytes holes. Fixes: 43a71cd66b9c ("net-device: reorganize net_device fast path variables") Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Coco Li <lixiaoyan@google.com> Cc: Simon Horman <horms@kernel.org> Link: https://lore.kernel.org/r/20240102162220.750823-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-01-02net-device: move gso_partial_features to net_device_read_txEric Dumazet1-1/+1
dev->gso_partial_features is read from tx fast path for GSO packets. Move it to appropriate section to avoid a cache line miss. Fixes: 43a71cd66b9c ("net-device: reorganize net_device fast path variables") Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Coco Li <lixiaoyan@google.com> Cc: David Ahern <dsahern@kernel.org> Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-01-01net: phy: Introduce ethernet link topology representationMaxime Chevallier1-1/+3
Link topologies containing multiple network PHYs attached to the same net_device can be found when using a PHY as a media converter for use with an SFP connector, on which an SFP transceiver containing a PHY can be used. With the current model, the transceiver's PHY can't be used for operations such as cable testing, timestamping, macsec offload, etc. The reason being that most of the logic for these configuration, coming from either ethtool netlink or ioctls tend to use netdev->phydev, which in multi-phy systems will reference the PHY closest to the MAC. Introduce a numbering scheme allowing to enumerate PHY devices that belong to any netdev, which can in turn allow userspace to take more precise decisions with regard to each PHY's configuration. The numbering is maintained per-netdev, in a phy_device_list. The numbering works similarly to a netdevice's ifindex, with identifiers that are only recycled once INT_MAX has been reached. This prevents races that could occur between PHY listing and SFP transceiver removal/insertion. The identifiers are assigned at phy_attach time, as the numbering depends on the netdevice the phy is attached to. Signed-off-by: Maxime Chevallier <maxime.chevallier@bootlin.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-12-20net: Add MDB bulk deletion device operationIdo Schimmel1-0/+6
Add MDB net device operation that will be invoked by rtnetlink code in response to received 'RTM_DELMDB' messages with the 'NLM_F_BULK' flag set. Subsequent patches will implement the operation in the bridge and VXLAN drivers. Signed-off-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Petr Machata <petrm@nvidia.com> Acked-by: Nikolay Aleksandrov <razor@blackwall.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-12-05net: core: synchronize link-watch when carrier is queriedJohannes Berg1-0/+9
There are multiple ways to query for the carrier state: through rtnetlink, sysfs, and (possibly) ethtool. Synchronize linkwatch work before these operations so that we don't have a situation where userspace queries the carrier state between the driver's carrier off->on transition and linkwatch running and expects it to work, when really (at least) TX cannot work until linkwatch has run. I previously posted a longer explanation of how this applies to wireless [1] but with this wireless can simply query the state before sending data, to ensure the kernel is ready for it. [1] https://lore.kernel.org/all/346b21d87c69f817ea3c37caceb34f1f56255884.camel@sipsolutions.net/ Signed-off-by: Johannes Berg <johannes.berg@intel.com> Reviewed-by: Jiri Pirko <jiri@nvidia.com> Link: https://lore.kernel.org/r/20231204214706.303c62768415.I1caedccae72ee5a45c9085c5eb49c145ce1c0dd5@changeid Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-12-05net-device: reorganize net_device fast path variablesCoco Li1-53/+64
Reorganize fast path variables on tx-txrx-rx order Fastpath variables end after npinfo. Below data generated with pahole on x86 architecture. Fast path variables span cache lines before change: 12 Fast path variables span cache lines after change: 4 Suggested-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Coco Li <lixiaoyan@google.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Reviewed-by: David Ahern <dsahern@kernel.org> Link: https://lore.kernel.org/r/20231204201232.520025-2-lixiaoyan@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-12-04net: Add NAPI IRQ supportAmritha Nambiar1-0/+6
Add support to associate the interrupt vector number for a NAPI instance. Signed-off-by: Amritha Nambiar <amritha.nambiar@intel.com> Reviewed-by: Sridhar Samudrala <sridhar.samudrala@intel.com> Link: https://lore.kernel.org/r/170147334728.5260.13221803396905901904.stgit@anambiarhost.jf.intel.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-12-04net: Add queue and napi associationAmritha Nambiar1-0/+8
Add the napi pointer in netdev queue for tracking the napi instance for each queue. This achieves the queue<->napi mapping. Signed-off-by: Amritha Nambiar <amritha.nambiar@intel.com> Reviewed-by: Sridhar Samudrala <sridhar.samudrala@intel.com> Link: https://lore.kernel.org/r/170147331483.5260.15723438819994285695.stgit@anambiarhost.jf.intel.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-11-30Merge tag 'for-netdev' of ↵Jakub Kicinski1-0/+2
https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next Daniel Borkmann says: ==================== pull-request: bpf-next 2023-11-30 We've added 30 non-merge commits during the last 7 day(s) which contain a total of 58 files changed, 1598 insertions(+), 154 deletions(-). The main changes are: 1) Add initial TX metadata implementation for AF_XDP with support in mlx5 and stmmac drivers. Two types of offloads are supported right now, that is, TX timestamp and TX checksum offload, from Stanislav Fomichev with stmmac implementation from Song Yoong Siang. 2) Change BPF verifier logic to validate global subprograms lazily instead of unconditionally before the main program, so they can be guarded using BPF CO-RE techniques, from Andrii Nakryiko. 3) Add BPF link_info support for uprobe multi link along with bpftool integration for the latter, from Jiri Olsa. 4) Use pkg-config in BPF selftests to determine ld flags which is in particular needed for linking statically, from Akihiko Odaki. 5) Fix a few BPF selftest failures to adapt to the upcoming LLVM18, from Yonghong Song. * tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (30 commits) bpf/tests: Remove duplicate JSGT tests selftests/bpf: Add TX side to xdp_hw_metadata selftests/bpf: Convert xdp_hw_metadata to XDP_USE_NEED_WAKEUP selftests/bpf: Add TX side to xdp_metadata selftests/bpf: Add csum helpers selftests/xsk: Support tx_metadata_len xsk: Add option to calculate TX checksum in SW xsk: Validate xsk_tx_metadata flags xsk: Document tx_metadata_len layout net: stmmac: Add Tx HWTS support to XDP ZC net/mlx5e: Implement AF_XDP TX timestamp and checksum offload tools: ynl: Print xsk-features from the sample xsk: Add TX timestamp and TX checksum offload support xsk: Support tx_metadata_len selftests/bpf: Use pkg-config for libelf selftests/bpf: Override PKG_CONFIG for static builds selftests/bpf: Choose pkg-config for the target bpftool: Add support to display uprobe_multi links selftests/bpf: Add link_info test for uprobe_multi link selftests/bpf: Use bpf_link__destroy in fill_link_info tests ... ==================== Conflicts: Documentation/netlink/specs/netdev.yaml: 839ff60df3ab ("net: page_pool: add nlspec for basic access to page pools") 48eb03dd2630 ("xsk: Add TX timestamp and TX checksum offload support") https://lore.kernel.org/all/20231201094705.1ee3cab8@canb.auug.org.au/ While at it also regen, tree is dirty after: 48eb03dd2630 ("xsk: Add TX timestamp and TX checksum offload support") looks like code wasn't re-rendered after "render-max" was removed. Link: https://lore.kernel.org/r/20231130145708.32573-1-daniel@iogearbox.net Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-11-29xsk: Add TX timestamp and TX checksum offload supportStanislav Fomichev1-0/+2
This change actually defines the (initial) metadata layout that should be used by AF_XDP userspace (xsk_tx_metadata). The first field is flags which requests appropriate offloads, followed by the offload-specific fields. The supported per-device offloads are exported via netlink (new xsk-flags). The offloads themselves are still implemented in a bit of a framework-y fashion that's left from my initial kfunc attempt. I'm introducing new xsk_tx_metadata_ops which drivers are supposed to implement. The drivers are also supposed to call xsk_tx_metadata_request/xsk_tx_metadata_complete in the right places. Since xsk_tx_metadata_{request,_complete} are static inline, we don't incur any extra overhead doing indirect calls. The benefit of this scheme is as follows: - keeps all metadata layout parsing away from driver code - makes it easy to grep and see which drivers implement what - don't need any extra flags to maintain to keep track of what offloads are implemented; if the callback is implemented - the offload is supported (used by netlink reporting code) Two offloads are defined right now: 1. XDP_TXMD_FLAGS_CHECKSUM: skb-style csum_start+csum_offset 2. XDP_TXMD_FLAGS_TIMESTAMP: writes TX timestamp back into metadata area upon completion (tx_timestamp field) XDP_TXMD_FLAGS_TIMESTAMP is also implemented for XDP_COPY mode: it writes SW timestamp from the skb destructor (note I'm reusing hwtstamps to pass metadata pointer). The struct is forward-compatible and can be extended in the future by appending more fields. Reviewed-by: Song Yoong Siang <yoong.siang.song@intel.com> Signed-off-by: Stanislav Fomichev <sdf@google.com> Acked-by: Jakub Kicinski <kuba@kernel.org> Link: https://lore.kernel.org/r/20231127190319.1190813-3-sdf@google.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-11-28net: page_pool: record pools per netdevJakub Kicinski1-0/+4
Link the page pools with netdevs. This needs to be netns compatible so we have two options. Either we record the pools per netns and have to worry about moving them as the netdev gets moved. Or we record them directly on the netdev so they move with the netdev without any extra work. Implement the latter option. Since pools may outlast netdev we need a place to store orphans. In time honored tradition use loopback for this purpose. Reviewed-by: Mina Almasry <almasrymina@google.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Acked-by: Jesper Dangaard Brouer <hawk@kernel.org> Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2023-11-23Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski1-4/+26
Cross-merge networking fixes after downstream PR. Conflicts: drivers/net/ethernet/intel/ice/ice_main.c c9663f79cd82 ("ice: adjust switchdev rebuild path") 7758017911a4 ("ice: restore timestamp configuration after device reset") https://lore.kernel.org/all/20231121211259.3348630-1-anthony.l.nguyen@intel.com/ Adjacent changes: kernel/bpf/verifier.c bb124da69c47 ("bpf: keep track of max number of bpf_loop callback iterations") 5f99f312bd3b ("bpf: add register bounds sanity checks and sanitization") Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-11-20net: Move {l,t,d}stats allocation to core and convert veth & vrfDaniel Borkmann1-4/+16
Move {l,t,d}stats allocation to the core and let netdevs pick the stats type they need. That way the driver doesn't have to bother with error handling (allocation failure checking, making sure free happens in the right spot, etc) - all happening in the core. Co-developed-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org> Cc: David Ahern <dsahern@kernel.org> Link: https://lore.kernel.org/r/20231114004220.6495-3-daniel@iogearbox.net Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2023-11-20net, vrf: Move dstats structure to coreDaniel Borkmann1-0/+10
Just move struct pcpu_dstats out of the vrf into the core, and streamline the field names slightly, so they better align with the {t,l}stats ones. No functional change otherwise. A conversion of the u64s to u64_stats_t could be done at a separate point in future. This move is needed as we are moving the {t,l,d}stats allocation/freeing to the core. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org> Cc: Jakub Kicinski <kuba@kernel.org> Cc: David Ahern <dsahern@kernel.org> Link: https://lore.kernel.org/r/20231114004220.6495-2-daniel@iogearbox.net Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2023-11-18net: partial revert of the "Make timestamping selectable: seriesJakub Kicinski1-5/+0
Revert following commits: commit acec05fb78ab ("net_tstamp: Add TIMESTAMPING SOFTWARE and HARDWARE mask") commit 11d55be06df0 ("net: ethtool: Add a command to expose current time stamping layer") commit bb8645b00ced ("netlink: specs: Introduce new netlink command to get current timestamp") commit d905f9c75329 ("net: ethtool: Add a command to list available time stamping layers") commit aed5004ee7a0 ("netlink: specs: Introduce new netlink command to list available time stamping layers") commit 51bdf3165f01 ("net: Replace hwtstamp_source by timestamping layer") commit 0f7f463d4821 ("net: Change the API of PHY default timestamp to MAC") commit 091fab122869 ("net: ethtool: ts: Update GET_TS to reply the current selected timestamp") commit 152c75e1d002 ("net: ethtool: ts: Let the active time stamping layer be selectable") commit ee60ea6be0d3 ("netlink: specs: Introduce time stamping set command") They need more time for reviews. Link: https://lore.kernel.org/all/20231118183529.6e67100c@kernel.org/ Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-11-18net: Change the API of PHY default timestamp to MACKory Maincent1-0/+5
Change the API to select MAC default time stamping instead of the PHY. Indeed the PHY is closer to the wire therefore theoretically it has less delay than the MAC timestamping but the reality is different. Due to lower time stamping clock frequency, latency in the MDIO bus and no PHC hardware synchronization between different PHY, the PHY PTP is often less precise than the MAC. The exception is for PHY designed specially for PTP case but these devices are not very widespread. For not breaking the compatibility I introduce a default_timestamp flag in phy_device that is set by the phy driver to know we are using the old API behavior. The phy_set_timestamp function is called at each call of phy_attach_direct. In case of MAC driver using phylink this function is called when the interface is turned up. Then if the interface goes down and up again the last choice of timestamp will be overwritten by the default choice. A solution could be to cache the timestamp status but it can bring other issues. In case of SFP, if we change the module, it doesn't make sense to blindly re-set the timestamp back to PHY, if the new module has a PHY with mediocre timestamping capabilities. Signed-off-by: Kory Maincent <kory.maincent@bootlin.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-11-18net: Make dev_set_hwtstamp_phylib accessibleKory Maincent1-0/+3
Make the dev_set_hwtstamp_phylib function accessible in prevision to use it from ethtool to reset the tstamp current configuration. Reviewed-by: Florian Fainelli <florian.fainelli@broadcom.com> Signed-off-by: Kory Maincent <kory.maincent@bootlin.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-10-27net: Add MDB get device operationIdo Schimmel1-0/+4
Add MDB net device operation that will be invoked by rtnetlink code in response to received RTM_GETMDB messages. Subsequent patches will implement the operation in the bridge and VXLAN drivers. Signed-off-by: Ido Schimmel <idosch@nvidia.com> Acked-by: Nikolay Aleksandrov <razor@blackwall.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-10-19net: introduce napi_is_scheduled helperChristian Marangi1-0/+23
We currently have napi_if_scheduled_mark_missed that can be used to check if napi is scheduled but that does more thing than simply checking it and return a bool. Some driver already implement custom function to check if napi is scheduled. Drop these custom function and introduce napi_is_scheduled that simply check if napi is scheduled atomically. Update any driver and code that implement a similar check and instead use this new helper. Signed-off-by: Christian Marangi <ansuelsmth@gmail.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2023-10-13net: Handle bulk delete policy in bridge driverAmit Cohen1-6/+2
The merge commit 92716869375b ("Merge branch 'br-flush-filtering'") added support for FDB flushing in bridge driver. The following patches will extend VXLAN driver to support FDB flushing as well. The netlink message for bulk delete is shared between the drivers. With the existing implementation, there is no way to prevent user from flushing with attributes that are not supported per driver. For example, when VNI will be added, user will not get an error for flush FDB entries in bridge with VNI, although this attribute is not relevant for bridge. As preparation for support of FDB flush in VXLAN driver, move the policy to be handled in bridge driver, later a new policy for VXLAN will be added in VXLAN driver. Do not pass 'vid' as part of ndo_fdb_del_bulk(), as this field is relevant only for bridge. Signed-off-by: Amit Cohen <amcohen@nvidia.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-10-11netdev: replace napi_reschedule with napi_scheduleChristian Marangi1-10/+0
Now that napi_schedule return a bool, we can drop napi_reschedule that does the same exact function. The function comes from a very old commit bfe13f54f502 ("ibm_emac: Convert to use napi_struct independent of struct net_device") and the purpose is actually deprecated in favour of different logic. Convert every user of napi_reschedule to napi_schedule. Signed-off-by: Christian Marangi <ansuelsmth@gmail.com> Acked-by: Jeff Johnson <quic_jjohnson@quicinc.com> # ath10k Acked-by: Nick Child <nnac123@linux.ibm.com> # ibm Acked-by: Marc Kleine-Budde <mkl@pengutronix.de> # for can/dev/rx-offload.c Reviewed-by: Eric Dumazet <edumazet@google.com> Acked-by: Tariq Toukan <tariqt@nvidia.com> Link: https://lore.kernel.org/r/20231009133754.9834-3-ansuelsmth@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-10-11netdev: make napi_schedule return bool on NAPI successful scheduleChristian Marangi1-2/+9
Change napi_schedule to return a bool on NAPI successful schedule. This might be useful for some driver to do additional steps after a NAPI has been scheduled. Suggested-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Christian Marangi <ansuelsmth@gmail.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://lore.kernel.org/r/20231009133754.9834-2-ansuelsmth@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-10-11net/core: Introduce netdev_core_stats_inc()Yajun Deng1-17/+4
Although there is a kfree_skb_reason() helper function that can be used to find the reason why this skb is dropped, but most callers didn't increase one of rx_dropped, tx_dropped, rx_nohandler and rx_otherhost_dropped. For the users, people are more concerned about why the dropped in ip is increasing. Introduce netdev_core_stats_inc() for trace the caller of dev_core_stats_*_inc(). Also, add __code to netdev_core_stats_alloc(), as it's called with small probability. And add noinline make sure netdev_core_stats_inc was never inlined. Signed-off-by: Yajun Deng <yajun.deng@linux.dev> Suggested-by: Alexander Lobakin <aleksander.lobakin@intel.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-10-01net: add DEV_STATS_READ() helperEric Dumazet1-0/+1
Companion of DEV_STATS_INC() & DEV_STATS_ADD(). This is going to be used in the series. Use it in macsec_get_stats64(). Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-09-19net: ethernet: mtk_wed: introduce partial AMSDU offload support for MT7988Sujuan Chen1-0/+1
Introduce partial AMSDU offload support for MT7988 SoC in order to merge in hw packets belonging to the same AMSDU before passing them to the WLAN nic. Co-developed-by: Lorenzo Bianconi <lorenzo@kernel.org> Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org> Signed-off-by: Sujuan Chen <sujuan.chen@mediatek.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2023-09-17netdev: expose DPLL pin handle for netdeviceJiri Pirko1-0/+21
In case netdevice represents a SyncE port, the user needs to understand the connection between netdevice and associated DPLL pin. There might me multiple netdevices pointing to the same pin, in case of VF/SF implementation. Add a IFLA Netlink attribute to nest the DPLL pin handle, similar to how it is implemented for devlink port. Add a struct dpll_pin pointer to netdev and protect access to it by RTNL. Expose netdev_dpll_pin_set() and netdev_dpll_pin_clear() helpers to the drivers so they can set/clear the DPLL pin relationship to netdev. Note that during the lifetime of struct dpll_pin the pin handle does not change. Therefore it is save to access it lockless. It is drivers responsibility to call netdev_dpll_pin_clear() before dpll_pin_put(). Signed-off-by: Jiri Pirko <jiri@nvidia.com> Signed-off-by: Arkadiusz Kubalewski <arkadiusz.kubalewski@intel.com> Signed-off-by: Vadim Fedorenko <vadim.fedorenko@linux.dev> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-08-03Merge tag 'for-netdev' of ↵Jakub Kicinski1-51/+4
https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next Martin KaFai Lau says: ==================== pull-request: bpf-next 2023-08-03 We've added 54 non-merge commits during the last 10 day(s) which contain a total of 84 files changed, 4026 insertions(+), 562 deletions(-). The main changes are: 1) Add SO_REUSEPORT support for TC bpf_sk_assign from Lorenz Bauer, Daniel Borkmann 2) Support new insns from cpu v4 from Yonghong Song 3) Non-atomically allocate freelist during prefill from YiFei Zhu 4) Support defragmenting IPv(4|6) packets in BPF from Daniel Xu 5) Add tracepoint to xdp attaching failure from Leon Hwang 6) struct netdev_rx_queue and xdp.h reshuffling to reduce rebuild time from Jakub Kicinski * tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (54 commits) net: invert the netdevice.h vs xdp.h dependency net: move struct netdev_rx_queue out of netdevice.h eth: add missing xdp.h includes in drivers selftests/bpf: Add testcase for xdp attaching failure tracepoint bpf, xdp: Add tracepoint to xdp attaching failure selftests/bpf: fix static assert compilation issue for test_cls_*.c bpf: fix bpf_probe_read_kernel prototype mismatch riscv, bpf: Adapt bpf trampoline to optimized riscv ftrace framework libbpf: fix typos in Makefile tracing: bpf: use struct trace_entry in struct syscall_tp_t bpf, devmap: Remove unused dtab field from bpf_dtab_netdev bpf, cpumap: Remove unused cmap field from bpf_cpu_map_entry netfilter: bpf: Only define get_proto_defrag_hook() if necessary bpf: Fix an array-index-out-of-bounds issue in disasm.c net: remove duplicate INDIRECT_CALLABLE_DECLARE of udp[6]_ehashfn docs/bpf: Fix malformed documentation bpf: selftests: Add defrag selftests bpf: selftests: Support custom type and proto for client sockets bpf: selftests: Support not connecting client socket netfilter: bpf: Support BPF_F_NETFILTER_IP_DEFRAG in netfilter link ... ==================== Link: https://lore.kernel.org/r/20230803174845.825419-1-martin.lau@linux.dev Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-08-03net: invert the netdevice.h vs xdp.h dependencyJakub Kicinski1-7/+4
xdp.h is far more specific and is included in only 67 other files vs netdevice.h's 1538 include sites. Make xdp.h include netdevice.h, instead of the other way around. This decreases the incremental allmodconfig builds size when xdp.h is touched from 5947 to 662 objects. Move bpf_prog_run_xdp() to xdp.h, seems appropriate and filter.h is a mega-header in its own right so it's nice to avoid xdp.h getting included there as well. The only unfortunate part is that the typedef for xdp_features_t has to move to netdevice.h, since its embedded in struct netdevice. Signed-off-by: Jakub Kicinski <kuba@kernel.org> Acked-by: Jesper Dangaard Brouer <hawk@kernel.org> Link: https://lore.kernel.org/r/20230803010230.1755386-4-kuba@kernel.org Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2023-08-03net: move struct netdev_rx_queue out of netdevice.hJakub Kicinski1-44/+0
struct netdev_rx_queue is touched in only a few places and having it defined in netdevice.h brings in the dependency on xdp.h, because struct xdp_rxq_info gets embedded in struct netdev_rx_queue. In prep for removal of xdp.h from netdevice.h move all the netdev_rx_queue stuff to a new header. We could technically break the new header up to avoid the sysfs.h include but it's so rarely included it doesn't seem to be worth it at this point. Reviewed-by: Amritha Nambiar <amritha.nambiar@intel.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org> Acked-by: Jesper Dangaard Brouer <hawk@kernel.org> Link: https://lore.kernel.org/r/20230803010230.1755386-3-kuba@kernel.org Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2023-08-03bonding: support balance-alb with openvswitchMateusz Kowalski1-0/+5
Commit d5410ac7b0ba ("net:bonding:support balance-alb interface with vlan to bridge") introduced a support for balance-alb mode for interfaces connected to the linux bridge by fixing missing matching of MAC entry in FDB. In our testing we discovered that it still does not work when the bond is connected to the OVS bridge as show in diagram below: eth1(mac:eth1_mac)--bond0(balance-alb,mac:eth0_mac)--eth0(mac:eth0_mac) | bond0.150(mac:eth0_mac) | ovs_bridge(ip:bridge_ip,mac:eth0_mac) This patch fixes it by checking not only if the device is a bridge but also if it is an openvswitch. Signed-off-by: Mateusz Kowalski <mko@redhat.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://lore.kernel.org/r/9fe7297c-609e-208b-c77b-3ceef6eb51a4@redhat.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2023-08-02net: remove phy_has_hwtstamp() -> phy_mii_ioctl() decision from converted ↵Vladimir Oltean1-0/+4
drivers It is desirable that the new .ndo_hwtstamp_set() API gives more uniformity, less overhead and future flexibility w.r.t. the PHY timestamping behavior. Currently there are some drivers which allow PHY timestamping through the procedure mentioned in Documentation/networking/timestamping.rst. They don't do anything locally if phy_has_hwtstamp() is set, except for lan966x which installs PTP packet traps. Centralize that behavior in a new dev_set_hwtstamp_phylib() code function, which calls either phy_mii_ioctl() for the phylib PHY, or .ndo_hwtstamp_set() of the netdev, based on a single policy (currently simplistic: phy_has_hwtstamp()). Any driver converted to .ndo_hwtstamp_set() will automatically opt into the centralized phylib timestamping policy. Unconverted drivers still get to choose whether they let the PHY handle timestamping or not. Netdev drivers with integrated PHY drivers that don't use phylib presumably don't set dev->phydev, and those will always see HWTSTAMP_SOURCE_NETDEV requests even when converted. The timestamping policy will remain 100% up to them. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Tested-by: Horatiu Vultur <horatiu.vultur@microchip.com> Link: https://lore.kernel.org/r/20230801142824.1772134-13-vladimir.oltean@nxp.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-08-02net: add hwtstamping helpers for stackable net devicesMaxim Georgiev1-0/+5
The stackable net devices with hwtstamping support (vlan, macvlan, bonding) only pass the hwtstamping ops to the lower (real) device. These drivers are the first that need to be converted to the new timestamping API, because if they aren't prepared to handle that, then no real device driver cannot be converted to the new API either. After studying what vlan_dev_ioctl(), macvlan_eth_ioctl() and bond_eth_ioctl() have in common, here we propose two generic implementations of ndo_hwtstamp_get() and ndo_hwtstamp_set() which can be called by those 3 drivers, with "dev" being their lower device. These helpers cover both cases, when the lower driver is converted to the new API or unconverted. We need some hacks in case of an unconverted driver, namely to stuff some pointers in struct kernel_hwtstamp_config which shouldn't have been there (since the new API isn't supposed to need it). These will be removed when all drivers will have been converted to the new API. Signed-off-by: Maxim Georgiev <glipus@gmail.com> Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Link: https://lore.kernel.org/r/20230801142824.1772134-3-vladimir.oltean@nxp.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-08-02net: add NDOs for configuring hardware timestampingMaxim Georgiev1-0/+16
Current hardware timestamping API for NICs requires implementing .ndo_eth_ioctl() for SIOCGHWTSTAMP and SIOCSHWTSTAMP. That API has some boilerplate such as request parameter translation between user and kernel address spaces, handling possible translation failures correctly, etc. Since it is the same all across the board, it would be desirable to handle it through generic code. Here we introduce .ndo_hwtstamp_get() and .ndo_hwtstamp_set(), which implement that boilerplate and allow drivers to just act upon requests. Suggested-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Maxim Georgiev <glipus@gmail.com> Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Tested-by: Horatiu Vultur <horatiu.vultur@microchip.com> Link: https://lore.kernel.org/r/20230801142824.1772134-2-vladimir.oltean@nxp.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-07-28net: convert some netlink netdev iterators to depend on the xarrayJakub Kicinski1-0/+3
Reap the benefits of easier iteration thanks to the xarray. Convert just the genetlink ones, those are easier to test. Reviewed-by: Leon Romanovsky <leonro@nvidia.com> Link: https://lore.kernel.org/r/20230726185530.2247698-3-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-07-27net: Remove unused declaration dev_restart()YueHaibing1-2/+0
This is not used, so can remove it. Signed-off-by: YueHaibing <yuehaibing@huawei.com> Reviewed-by: Simon Horman <simon.horman@corigine.com> Link: https://lore.kernel.org/r/20230726143715.24700-1-yuehaibing@huawei.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-07-24net: add missing net_device::xdp_zc_max_segs descriptionMaciej Fijalkowski1-0/+2
Cited commit under 'Fixes' tag introduced new member to struct net_device without providing description of it - fix it. Reported-by: Stephen Rothwell <sfr@canb.auug.org.au> Closes: https://lore.kernel.org/all/20230720141613.61488b9e@canb.auug.org.au/ Fixes: 13ce2daa259a ("xsk: add new netlink attribute dedicated for ZC max frags") Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> Reviewed-by: Simon Horman <simon.horman@corigine.com> Tested-by: Simon Horman <simon.horman@corigine.com> # build-tested Link: https://lore.kernel.org/r/20230721145808.596298-1-maciej.fijalkowski@intel.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-07-19bpf: Add fd-based tcx multi-prog infra with link supportDaniel Borkmann1-9/+6
This work refactors and adds a lightweight extension ("tcx") to the tc BPF ingress and egress data path side for allowing BPF program management based on fds via bpf() syscall through the newly added generic multi-prog API. The main goal behind this work which we also presented at LPC [0] last year and a recent update at LSF/MM/BPF this year [3] is to support long-awaited BPF link functionality for tc BPF programs, which allows for a model of safe ownership and program detachment. Given the rise in tc BPF users in cloud native environments, this becomes necessary to avoid hard to debug incidents either through stale leftover programs or 3rd party applications accidentally stepping on each others toes. As a recap, a BPF link represents the attachment of a BPF program to a BPF hook point. The BPF link holds a single reference to keep BPF program alive. Moreover, hook points do not reference a BPF link, only the application's fd or pinning does. A BPF link holds meta-data specific to attachment and implements operations for link creation, (atomic) BPF program update, detachment and introspection. The motivation for BPF links for tc BPF programs is multi-fold, for example: - From Meta: "It's especially important for applications that are deployed fleet-wide and that don't "control" hosts they are deployed to. If such application crashes and no one notices and does anything about that, BPF program will keep running draining resources or even just, say, dropping packets. We at FB had outages due to such permanent BPF attachment semantics. With fd-based BPF link we are getting a framework, which allows safe, auto-detachable behavior by default, unless application explicitly opts in by pinning the BPF link." [1] - From Cilium-side the tc BPF programs we attach to host-facing veth devices and phys devices build the core datapath for Kubernetes Pods, and they implement forwarding, load-balancing, policy, EDT-management, etc, within BPF. Currently there is no concept of 'safe' ownership, e.g. we've recently experienced hard-to-debug issues in a user's staging environment where another Kubernetes application using tc BPF attached to the same prio/handle of cls_bpf, accidentally wiping all Cilium-based BPF programs from underneath it. The goal is to establish a clear/safe ownership model via links which cannot accidentally be overridden. [0,2] BPF links for tc can co-exist with non-link attachments, and the semantics are in line also with XDP links: BPF links cannot replace other BPF links, BPF links cannot replace non-BPF links, non-BPF links cannot replace BPF links and lastly only non-BPF links can replace non-BPF links. In case of Cilium, this would solve mentioned issue of safe ownership model as 3rd party applications would not be able to accidentally wipe Cilium programs, even if they are not BPF link aware. Earlier attempts [4] have tried to integrate BPF links into core tc machinery to solve cls_bpf, which has been intrusive to the generic tc kernel API with extensions only specific to cls_bpf and suboptimal/complex since cls_bpf could be wiped from the qdisc also. Locking a tc BPF program in place this way, is getting into layering hacks given the two object models are vastly different. We instead implemented the tcx (tc 'express') layer which is an fd-based tc BPF attach API, so that the BPF link implementation blends in naturally similar to other link types which are fd-based and without the need for changing core tc internal APIs. BPF programs for tc can then be successively migrated from classic cls_bpf to the new tc BPF link without needing to change the program's source code, just the BPF loader mechanics for attaching is sufficient. For the current tc framework, there is no change in behavior with this change and neither does this change touch on tc core kernel APIs. The gist of this patch is that the ingress and egress hook have a lightweight, qdisc-less extension for BPF to attach its tc BPF programs, in other words, a minimal entry point for tc BPF. The name tcx has been suggested from discussion of earlier revisions of this work as a good fit, and to more easily differ between the classic cls_bpf attachment and the fd-based one. For the ingress and egress tcx points, the device holds a cache-friendly array with program pointers which is separated from control plane (slow-path) data. Earlier versions of this work used priority to determine ordering and expression of dependencies similar as with classic tc, but it was challenged that for something more future-proof a better user experience is required. Hence this resulted in the design and development of the generic attach/detach/query API for multi-progs. See prior patch with its discussion on the API design. tcx is the first user and later we plan to integrate also others, for example, one candidate is multi-prog support for XDP which would benefit and have the same 'look and feel' from API perspective. The goal with tcx is to have maximum compatibility to existing tc BPF programs, so they don't need to be rewritten specifically. Compatibility to call into classic tcf_classify() is also provided in order to allow successive migration or both to cleanly co-exist where needed given its all one logical tc layer and the tcx plus classic tc cls/act build one logical overall processing pipeline. tcx supports the simplified return codes TCX_NEXT which is non-terminating (go to next program) and terminating ones with TCX_PASS, TCX_DROP, TCX_REDIRECT. The fd-based API is behind a static key, so that when unused the code is also not entered. The struct tcx_entry's program array is currently static, but could be made dynamic if necessary at a point in future. The a/b pair swap design has been chosen so that for detachment there are no allocations which otherwise could fail. The work has been tested with tc-testing selftest suite which all passes, as well as the tc BPF tests from the BPF CI, and also with Cilium's L4LB. Thanks also to Nikolay Aleksandrov and Martin Lau for in-depth early reviews of this work. [0] https://lpc.events/event/16/contributions/1353/ [1] https://lore.kernel.org/bpf/CAEf4BzbokCJN33Nw_kg82sO=xppXnKWEncGTWCTB9vGCmLB6pw@mail.gmail.com [2] https://colocatedeventseu2023.sched.com/event/1Jo6O/tales-from-an-ebpf-programs-murder-mystery-hemanth-malla-guillaume-fournier-datadog [3] http://vger.kernel.org/bpfconf2023_material/tcx_meta_netdev_borkmann.pdf [4] https://lore.kernel.org/bpf/20210604063116.234316-1-memxor@gmail.com Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Jakub Kicinski <kuba@kernel.org> Link: https://lore.kernel.org/r/20230719140858.13224-3-daniel@iogearbox.net Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-07-19xsk: add new netlink attribute dedicated for ZC max fragsMaciej Fijalkowski1-0/+1
Introduce new netlink attribute NETDEV_A_DEV_XDP_ZC_MAX_SEGS that will carry maximum fragments that underlying ZC driver is able to handle on TX side. It is going to be included in netlink response only when driver supports ZC. Any value higher than 1 implies multi-buffer ZC support on underlying device. Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> Link: https://lore.kernel.org/r/20230719132421.584801-11-maciej.fijalkowski@intel.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-06-24Merge tag 'for-netdev' of ↵Jakub Kicinski1-0/+9
https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next Daniel Borkmann says: ==================== pull-request: bpf-next 2023-06-23 We've added 49 non-merge commits during the last 24 day(s) which contain a total of 70 files changed, 1935 insertions(+), 442 deletions(-). The main changes are: 1) Extend bpf_fib_lookup helper to allow passing the route table ID, from Louis DeLosSantos. 2) Fix regsafe() in verifier to call check_ids() for scalar registers, from Eduard Zingerman. 3) Extend the set of cpumask kfuncs with bpf_cpumask_first_and() and a rework of bpf_cpumask_any*() kfuncs. Additionally, add selftests, from David Vernet. 4) Fix socket lookup BPF helpers for tc/XDP to respect VRF bindings, from Gilad Sever. 5) Change bpf_link_put() to use workqueue unconditionally to fix it under PREEMPT_RT, from Sebastian Andrzej Siewior. 6) Follow-ups to address issues in the bpf_refcount shared ownership implementation, from Dave Marchevsky. 7) A few general refactorings to BPF map and program creation permissions checks which were part of the BPF token series, from Andrii Nakryiko. 8) Various fixes for benchmark framework and add a new benchmark for BPF memory allocator to BPF selftests, from Hou Tao. 9) Documentation improvements around iterators and trusted pointers, from Anton Protopopov. 10) Small cleanup in verifier to improve allocated object check, from Daniel T. Lee. 11) Improve performance of bpf_xdp_pointer() by avoiding access to shared_info when XDP packet does not have frags, from Jesper Dangaard Brouer. 12) Silence a harmless syzbot-reported warning in btf_type_id_size(), from Yonghong Song. 13) Remove duplicate bpfilter_umh_cleanup in favor of umd_cleanup_helper, from Jarkko Sakkinen. 14) Fix BPF selftests build for resolve_btfids under custom HOSTCFLAGS, from Viktor Malik. * tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (49 commits) bpf, docs: Document existing macros instead of deprecated bpf, docs: BPF Iterator Document selftests/bpf: Fix compilation failure for prog vrf_socket_lookup selftests/bpf: Add vrf_socket_lookup tests bpf: Fix bpf socket lookup from tc/xdp to respect socket VRF bindings bpf: Call __bpf_sk_lookup()/__bpf_skc_lookup() directly via TC hookpoint bpf: Factor out socket lookup functions for the TC hookpoint. selftests/bpf: Set the default value of consumer_cnt as 0 selftests/bpf: Ensure that next_cpu() returns a valid CPU number selftests/bpf: Output the correct error code for pthread APIs selftests/bpf: Use producer_cnt to allocate local counter array xsk: Remove unused inline function xsk_buff_discard() bpf: Keep BPF_PROG_LOAD permission checks clear of validations bpf: Centralize permissions checks for all BPF map types bpf: Inline map creation logic in map_create() function bpf: Move unprivileged checks into map_create() and bpf_prog_load() bpf: Remove in_atomic() from bpf_link_put(). selftests/bpf: Verify that check_ids() is used for scalars in regsafe() bpf: Verify scalar ids mapping in regsafe() using check_ids() selftests/bpf: Check if mark_chain_precision() follows scalar ids ... ==================== Link: https://lore.kernel.org/r/20230623211256.8409-1-daniel@iogearbox.net Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-06-21bpf: Fix bpf socket lookup from tc/xdp to respect socket VRF bindingsGilad Sever1-0/+9
When calling bpf_sk_lookup_tcp(), bpf_sk_lookup_udp() or bpf_skc_lookup_tcp() from tc/xdp ingress, VRF socket bindings aren't respoected, i.e. unbound sockets are returned, and bound sockets aren't found. VRF binding is determined by the sdif argument to sk_lookup(), however when called from tc the IP SKB control block isn't initialized and thus inet{,6}_sdif() always returns 0. Fix by calculating sdif for the tc/xdp flows by observing the device's l3 enslaved state. The cg/sk_skb hooking points which are expected to support inet{,6}_sdif() pass sdif=-1 which makes __bpf_skc_lookup() use the existing logic. Fixes: 6acc9b432e67 ("bpf: Add helper to retrieve socket in BPF") Signed-off-by: Gilad Sever <gilad9366@gmail.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Reviewed-by: Shmulik Ladkani <shmulik.ladkani@gmail.com> Reviewed-by: Eyal Birger <eyal.birger@gmail.com> Acked-by: Stanislav Fomichev <sdf@google.com> Cc: David Ahern <dsahern@kernel.org> Link: https://lore.kernel.org/bpf/20230621104211.301902-4-gilad9366@gmail.com
2023-06-15net: create device lookup API with reference trackingJakub Kicinski1-0/+4
New users of dev_get_by_index() and dev_get_by_name() keep getting added and it would be nice to steer them towards the APIs with reference tracking. Add variants of those calls which allocate the reference tracker and use them in a couple of places. Signed-off-by: Jakub Kicinski <kuba@kernel.org> Reviewed-by: Eric Dumazet <edumazet@google.com> Reviewed-by: David Ahern <dsahern@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-06-10net: move gso declarations and functions to their own filesEric Dumazet1-25/+1
Move declarations into include/net/gso.h and code into net/core/gso.c Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Stanislav Fomichev <sdf@google.com> Reviewed-by: Simon Horman <simon.horman@corigine.com> Reviewed-by: David Ahern <dsahern@kernel.org> Link: https://lore.kernel.org/r/20230608191738.3947077-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-06-07net: sched: add rcu annotations around qdisc->qdisc_sleepingEric Dumazet1-1/+1
syzbot reported a race around qdisc->qdisc_sleeping [1] It is time we add proper annotations to reads and writes to/from qdisc->qdisc_sleeping. [1] BUG: KCSAN: data-race in dev_graft_qdisc / qdisc_lookup_rcu read to 0xffff8881286fc618 of 8 bytes by task 6928 on cpu 1: qdisc_lookup_rcu+0x192/0x2c0 net/sched/sch_api.c:331 __tcf_qdisc_find+0x74/0x3c0 net/sched/cls_api.c:1174 tc_get_tfilter+0x18f/0x990 net/sched/cls_api.c:2547 rtnetlink_rcv_msg+0x7af/0x8c0 net/core/rtnetlink.c:6386 netlink_rcv_skb+0x126/0x220 net/netlink/af_netlink.c:2546 rtnetlink_rcv+0x1c/0x20 net/core/rtnetlink.c:6413 netlink_unicast_kernel net/netlink/af_netlink.c:1339 [inline] netlink_unicast+0x56f/0x640 net/netlink/af_netlink.c:1365 netlink_sendmsg+0x665/0x770 net/netlink/af_netlink.c:1913 sock_sendmsg_nosec net/socket.c:724 [inline] sock_sendmsg net/socket.c:747 [inline] ____sys_sendmsg+0x375/0x4c0 net/socket.c:2503 ___sys_sendmsg net/socket.c:2557 [inline] __sys_sendmsg+0x1e3/0x270 net/socket.c:2586 __do_sys_sendmsg net/socket.c:2595 [inline] __se_sys_sendmsg net/socket.c:2593 [inline] __x64_sys_sendmsg+0x46/0x50 net/socket.c:2593 do_syscall_x64 arch/x86/entry/common.c:50 [inline] do_syscall_64+0x41/0xc0 arch/x86/entry/common.c:80 entry_SYSCALL_64_after_hwframe+0x63/0xcd write to 0xffff8881286fc618 of 8 bytes by task 6912 on cpu 0: dev_graft_qdisc+0x4f/0x80 net/sched/sch_generic.c:1115 qdisc_graft+0x7d0/0xb60 net/sched/sch_api.c:1103 tc_modify_qdisc+0x712/0xf10 net/sched/sch_api.c:1693 rtnetlink_rcv_msg+0x807/0x8c0 net/core/rtnetlink.c:6395 netlink_rcv_skb+0x126/0x220 net/netlink/af_netlink.c:2546 rtnetlink_rcv+0x1c/0x20 net/core/rtnetlink.c:6413 netlink_unicast_kernel net/netlink/af_netlink.c:1339 [inline] netlink_unicast+0x56f/0x640 net/netlink/af_netlink.c:1365 netlink_sendmsg+0x665/0x770 net/netlink/af_netlink.c:1913 sock_sendmsg_nosec net/socket.c:724 [inline] sock_sendmsg net/socket.c:747 [inline] ____sys_sendmsg+0x375/0x4c0 net/socket.c:2503 ___sys_sendmsg net/socket.c:2557 [inline] __sys_sendmsg+0x1e3/0x270 net/socket.c:2586 __do_sys_sendmsg net/socket.c:2595 [inline] __se_sys_sendmsg net/socket.c:2593 [inline] __x64_sys_sendmsg+0x46/0x50 net/socket.c:2593 do_syscall_x64 arch/x86/entry/common.c:50 [inline] do_syscall_64+0x41/0xc0 arch/x86/entry/common.c:80 entry_SYSCALL_64_after_hwframe+0x63/0xcd Reported by Kernel Concurrency Sanitizer on: CPU: 0 PID: 6912 Comm: syz-executor.5 Not tainted 6.4.0-rc3-syzkaller-00190-g0d85b27b0cc6 #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 05/16/2023 Fixes: 3a7d0d07a386 ("net: sched: extend Qdisc with rcu") Reported-by: syzbot <syzkaller@googlegroups.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Vlad Buslov <vladbu@nvidia.com> Acked-by: Jamal Hadi Salim<jhs@mojatatu.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-06-07rfs: annotate lockless accesses to RFS sock flow tableEric Dumazet1-2/+5
Add READ_ONCE()/WRITE_ONCE() on accesses to the sock flow table. This also prevents a (smart ?) compiler to remove the condition in: if (table->ents[index] != newval) table->ents[index] = newval; We need the condition to avoid dirtying a shared cache line. Fixes: fec5e652e58f ("rfs: Receive Flow Steering") Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Simon Horman <simon.horman@corigine.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-04-23net: optimize napi_threaded_poll() vs RPS/RFSEric Dumazet1-0/+3
We use napi_threaded_poll() in order to reduce our softirq dependency. We can add a followup of 821eba962d95 ("net: optimize napi_schedule_rps()") to further remove the need of firing NET_RX_SOFTIRQ whenever RPS/RFS are used. Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-04-20net: move dropreason.h to dropreason-core.hJohannes Berg1-1/+1
This will, after the next patch, hold only the core drop reasons and minimal infrastructure. Fix a small kernel-doc issue while at it, to avoid the move triggering a checker. Signed-off-by: Johannes Berg <johannes.berg@intel.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-04-14page_pool: allow caching from safely localized NAPIJakub Kicinski1-0/+3
Recent patches to mlx5 mentioned a regression when moving from driver local page pool to only using the generic page pool code. Page pool has two recycling paths (1) direct one, which runs in safe NAPI context (basically consumer context, so producing can be lockless); and (2) via a ptr_ring, which takes a spin lock because the freeing can happen from any CPU; producer and consumer may run concurrently. Since the page pool code was added, Eric introduced a revised version of deferred skb freeing. TCP skbs are now usually returned to the CPU which allocated them, and freed in softirq context. This places the freeing (producing of pages back to the pool) enticingly close to the allocation (consumer). If we can prove that we're freeing in the same softirq context in which the consumer NAPI will run - lockless use of the cache is perfectly fine, no need for the lock. Let drivers link the page pool to a NAPI instance. If the NAPI instance is scheduled on the same CPU on which we're freeing - place the pages in the direct cache. With that and patched bnxt (XDP enabled to engage the page pool, sigh, bnxt really needs page pool work :() I see a 2.6% perf boost with a TCP stream test (app on a different physical core than softirq). The CPU use of relevant functions decreases as expected: page_pool_refill_alloc_cache 1.17% -> 0% _raw_spin_lock 2.41% -> 0.98% Only consider lockless path to be safe when NAPI is scheduled - in practice this should cover majority if not all of steady state workloads. It's usually the NAPI kicking in that causes the skb flush. The main case we'll miss out on is when application runs on the same CPU as NAPI. In that case we don't use the deferred skb free path. Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Acked-by: Jesper Dangaard Brouer <brouer@redhat.com> Tested-by: Dragos Tatulea <dtatulea@nvidia.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-04-13Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski1-1/+2
Conflicts: tools/testing/selftests/net/config 62199e3f1658 ("selftests: net: Add VXLAN MDB test") 3a0385be133e ("selftests: add the missing CONFIG_IP_SCTP in net config") Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-04-13xdp: rss hash types representationJesper Dangaard Brouer1-1/+2
The RSS hash type specifies what portion of packet data NIC hardware used when calculating RSS hash value. The RSS types are focused on Internet traffic protocols at OSI layers L3 and L4. L2 (e.g. ARP) often get hash value zero and no RSS type. For L3 focused on IPv4 vs. IPv6, and L4 primarily TCP vs UDP, but some hardware supports SCTP. Hardware RSS types are differently encoded for each hardware NIC. Most hardware represent RSS hash type as a number. Determining L3 vs L4 often requires a mapping table as there often isn't a pattern or sorting according to ISO layer. The patch introduce a XDP RSS hash type (enum xdp_rss_hash_type) that contains both BITs for the L3/L4 types, and combinations to be used by drivers for their mapping tables. The enum xdp_rss_type_bits get exposed to BPF via BTF, and it is up to the BPF-programmer to match using these defines. This proposal change the kfunc API bpf_xdp_metadata_rx_hash() adding a pointer value argument for provide the RSS hash type. Change signature for all xmo_rx_hash calls in drivers to make it compile. The RSS type implementations for each driver comes as separate patches. Fixes: 3d76a4d3d4e5 ("bpf: XDP metadata RX kfuncs") Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> Acked-by: Toke Høiland-Jørgensen <toke@redhat.com> Acked-by: Stanislav Fomichev <sdf@google.com> Link: https://lore.kernel.org/r/168132892042.340624.582563003880565460.stgit@firesoul Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-04-10net: piggy back on the memory barrier in bql when waking queuesJakub Kicinski1-1/+1
Drivers call netdev_tx_completed_queue() right before netif_txq_maybe_wake(). If BQL is enabled netdev_tx_completed_queue() should issue a memory barrier, so we can depend on that separating the stop check from the consumer index update, instead of adding another barrier in netif_txq_maybe_wake(). This matters more than the barriers on the xmit path, because the wake condition is almost always true. So we issue the consumer side barrier often. Wrap netdev_tx_completed_queue() in a local helper to issue the barrier even if BQL is disabled. Keep the same semantics as netdev_tx_completed_queue() (barrier only if bytes != 0) to make it clear that the barrier is conditional. Plus since macro gets pkt/byte counts as arguments now - we can skip waking if there were no packets completed. Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-04-10net: provide macros for commonly copied lockless queue stop/wake codeJakub Kicinski1-0/+1
A lot of drivers follow the same scheme to stop / start queues without introducing locks between xmit and NAPI tx completions. I'm guessing they all copy'n'paste each other's code. The original code dates back all the way to e1000 and Linux 2.6.19. Smaller drivers shy away from the scheme and introduce a lock which may cause deadlocks in netpoll. Provide macros which encapsulate the necessary logic. The macros do not prevent false wake ups, the extra barrier required to close that race is not worth it. See discussion in: https://lore.kernel.org/all/c39312a2-4537-14b4-270c-9fe1fbb91e89@gmail.com/ Acked-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-04-09net: dsa: replace NETDEV_PRE_CHANGE_HWTSTAMP notifier with a stubVladimir Oltean1-6/+0
There was a sort of rush surrounding commit 88c0a6b503b7 ("net: create a netdev notifier for DSA to reject PTP on DSA master"), due to a desire to convert DSA's attempt to deny TX timestamping on a DSA master to something that doesn't block the kernel-wide API conversion from ndo_eth_ioctl() to ndo_hwtstamp_set(). What was required was a mechanism that did not depend on ndo_eth_ioctl(), and what was provided was a mechanism that did not depend on ndo_eth_ioctl(), while at the same time introducing something that wasn't absolutely necessary - a new netdev notifier. There have been objections from Jakub Kicinski that using notifiers in general when they are not absolutely necessary creates complications to the control flow and difficulties to maintainers who look at the code. So there is a desire to not use notifiers. In addition to that, the notifier chain gets called even if there is no DSA in the system and no one is interested in applying any restriction. Take the model of udp_tunnel_nic_ops and introduce a stub mechanism, through which net/core/dev_ioctl.c can call into DSA even when CONFIG_NET_DSA=m. Compared to the code that existed prior to the notifier conversion, aka what was added in commits: - 4cfab3566710 ("net: dsa: Add wrappers for overloaded ndo_ops") - 3369afba1e46 ("net: Call into DSA netdevice_ops wrappers") this is different because we are not overloading any struct net_device_ops of the DSA master anymore, but rather, we are exposing a rather specific functionality which is orthogonal to which API is used to enable it - ndo_eth_ioctl() or ndo_hwtstamp_set(). Also, what is similar is that both approaches use function pointers to get from built-in code to DSA. There is no point in replicating the function pointers towards __dsa_master_hwtstamp_validate() once for every CPU port (dev->dsa_ptr). Instead, it is sufficient to introduce a singleton struct dsa_stubs, built into the kernel, which contains a single function pointer to __dsa_master_hwtstamp_validate(). I find this approach preferable to what we had originally, because dev->dsa_ptr->netdev_ops->ndo_do_ioctl() used to require going through struct dsa_port (dev->dsa_ptr), and so, this was incompatible with any attempts to add any data encapsulation and hide DSA data structures from the outside world. Link: https://lore.kernel.org/netdev/20230403083019.120b72fd@kernel.org/ Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-04-03net: create a netdev notifier for DSA to reject PTP on DSA masterVladimir Oltean1-1/+8
The fact that PTP 2-step TX timestamping is broken on DSA switches if the master also timestamps the same packets is documented by commit f685e609a301 ("net: dsa: Deny PTP on master if switch supports it"). We attempt to help the users avoid shooting themselves in the foot by making DSA reject the timestamping ioctls on an interface that is a DSA master, and the switch tree beneath it contains switches which are aware of PTP. The only problem is that there isn't an established way of intercepting ndo_eth_ioctl calls, so DSA creates avoidable burden upon the network stack by creating a struct dsa_netdevice_ops with overlaid function pointers that are manually checked from the relevant call sites. There used to be 2 such dsa_netdevice_ops, but now, ndo_eth_ioctl is the only one left. There is an ongoing effort to migrate driver-visible hardware timestamping control from the ndo_eth_ioctl() based API to a new ndo_hwtstamp_set() model, but DSA actively prevents that migration, since dsa_master_ioctl() is currently coded to manually call the master's legacy ndo_eth_ioctl(), and so, whenever a network device driver would be converted to the new API, DSA's restrictions would be circumvented, because any device could be used as a DSA master. The established way for unrelated modules to react on a net device event is via netdevice notifiers. So we create a new notifier which gets called whenever there is an attempt to change hardware timestamping settings on a device. Finally, there is another reason why a netdev notifier will be a good idea, besides strictly DSA, and this has to do with PHY timestamping. With ndo_eth_ioctl(), all MAC drivers must manually call phy_has_hwtstamp() before deciding whether to act upon SIOCSHWTSTAMP, otherwise they must pass this ioctl to the PHY driver via phy_mii_ioctl(). With the new ndo_hwtstamp_set() API, it will be desirable to simply not make any calls into the MAC device driver when timestamping should be performed at the PHY level. But there exist drivers, such as the lan966x switch, which need to install packet traps for PTP regardless of whether they are the layer that provides the hardware timestamps, or the PHY is. That would be impossible to support with the new API. The proposal there, too, is to introduce a netdev notifier which acts as a better cue for switching drivers to add or remove PTP packet traps, than ndo_hwtstamp_set(). The one introduced here "almost" works there as well, except for the fact that packet traps should only be installed if the PHY driver succeeded to enable hardware timestamping, whereas here, we need to deny hardware timestamping on the DSA master before it actually gets enabled. This is why this notifier is called "PRE_", and the notifier that would get used for PHY timestamping and packet traps would be called NETDEV_CHANGE_HWTSTAMP. This isn't a new concept, for example NETDEV_CHANGEUPPER and NETDEV_PRECHANGEUPPER do the same thing. In expectation of future netlink UAPI, we also pass a non-NULL extack pointer to the netdev notifier, and we make DSA populate it with an informative reason for the rejection. To avoid making it go to waste, we make the ioctl-based dev_set_hwtstamp() create a fake extack and print the message to the kernel log. Link: https://lore.kernel.org/netdev/20230401191215.tvveoi3lkawgg6g4@skbuf/ Link: https://lore.kernel.org/netdev/20230310164451.ls7bbs6pdzs4m6pw@skbuf/ Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-04-02net: minor reshuffle of napi_structJakub Kicinski1-2/+3
napi_id is read by GRO and drivers to mark skbs, and it currently sits at the end of the structure, in a mostly unused cache line. Move it up into a hole, and separate the clearly control path fields from the important ones. Before: struct napi_struct { struct list_head poll_list; /* 0 16 */ long unsigned int state; /* 16 8 */ int weight; /* 24 4 */ int defer_hard_irqs_count; /* 28 4 */ long unsigned int gro_bitmask; /* 32 8 */ int (*poll)(struct napi_struct *, int); /* 40 8 */ int poll_owner; /* 48 4 */ /* XXX 4 bytes hole, try to pack */ struct net_device * dev; /* 56 8 */ /* --- cacheline 1 boundary (64 bytes) --- */ struct gro_list gro_hash[8]; /* 64 192 */ /* --- cacheline 4 boundary (256 bytes) --- */ struct sk_buff * skb; /* 256 8 */ struct list_head rx_list; /* 264 16 */ int rx_count; /* 280 4 */ /* XXX 4 bytes hole, try to pack */ struct hrtimer timer; /* 288 64 */ /* XXX last struct has 4 bytes of padding */ /* --- cacheline 5 boundary (320 bytes) was 32 bytes ago --- */ struct list_head dev_list; /* 352 16 */ struct hlist_node napi_hash_node; /* 368 16 */ /* --- cacheline 6 boundary (384 bytes) --- */ unsigned int napi_id; /* 384 4 */ /* XXX 4 bytes hole, try to pack */ struct task_struct * thread; /* 392 8 */ /* size: 400, cachelines: 7, members: 17 */ /* sum members: 388, holes: 3, sum holes: 12 */ /* paddings: 1, sum paddings: 4 */ /* last cacheline: 16 bytes */ }; After: struct napi_struct { struct list_head poll_list; /* 0 16 */ long unsigned int state; /* 16 8 */ int weight; /* 24 4 */ int defer_hard_irqs_count; /* 28 4 */ long unsigned int gro_bitmask; /* 32 8 */ int (*poll)(struct napi_struct *, int); /* 40 8 */ int poll_owner; /* 48 4 */ /* XXX 4 bytes hole, try to pack */ struct net_device * dev; /* 56 8 */ /* --- cacheline 1 boundary (64 bytes) --- */ struct gro_list gro_hash[8]; /* 64 192 */ /* --- cacheline 4 boundary (256 bytes) --- */ struct sk_buff * skb; /* 256 8 */ struct list_head rx_list; /* 264 16 */ int rx_count; /* 280 4 */ unsigned int napi_id; /* 284 4 */ struct hrtimer timer; /* 288 64 */ /* XXX last struct has 4 bytes of padding */ /* --- cacheline 5 boundary (320 bytes) was 32 bytes ago --- */ struct task_struct * thread; /* 352 8 */ struct list_head dev_list; /* 360 16 */ struct hlist_node napi_hash_node; /* 376 16 */ /* size: 392, cachelines: 7, members: 17 */ /* sum members: 388, holes: 1, sum holes: 4 */ /* paddings: 1, sum paddings: 4 */ /* forced alignments: 1 */ /* last cacheline: 8 bytes */ } __attribute__((__aligned__(8))); Signed-off-by: Jakub Kicinski <kuba@kernel.org> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-03-30net: add softnet_data.in_net_rx_actionEric Dumazet1-0/+1
We want to make two optimizations in napi_schedule_rps() and ____napi_schedule() which require to know if these helpers are called from net_rx_action(), instead of being called from other contexts. sd.in_net_rx_action is only read/written by the owning cpu. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Jason Xing <kerneljasonxing@gmail.com> Tested-by: Jason Xing <kerneljasonxing@gmail.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2023-03-23docs: networking: document NAPIJakub Kicinski1-5/+8
Add basic documentation about NAPI. We can stop linking to the ancient doc on the LF wiki. Link: https://lore.kernel.org/all/20230315223044.471002-1-kuba@kernel.org/ Reviewed-by: Bagas Sanjaya <bagasdotme@gmail.com> Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com> Acked-by: Pavel Pisa <pisa@cmp.felk.cvut.cz> # for ctucanfd-driver.rst Reviewed-by: Tony Nguyen <anthony.l.nguyen@intel.com> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Reviewed-by: Stephen Hemminger <stephen@networkplumber.org> Reviewed-by: Randy Dunlap <rdunlap@infradead.org> Link: https://lore.kernel.org/r/20230322053848.198452-1-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-03-22netdev: Enforce index cap in netdev_get_tx_queueNick Child1-0/+1
When requesting a TX queue at a given index, warn on out-of-bounds referencing if the index is greater than the allocated number of queues. Specifically, since this function is used heavily in the networking stack use DEBUG_NET_WARN_ON_ONCE to avoid executing a new branch on every packet. Signed-off-by: Nick Child <nnac123@linux.ibm.com> Link: https://lore.kernel.org/r/20230321150725.127229-2-nnac123@linux.ibm.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-03-17Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski1-2/+4
net/wireless/nl80211.c b27f07c50a73 ("wifi: nl80211: fix puncturing bitmap policy") cbbaf2bb829b ("wifi: nl80211: add a command to enable/disable HW timestamping") https://lore.kernel.org/all/20230314105421.3608efae@canb.auug.org.au tools/testing/selftests/net/Makefile 62199e3f1658 ("selftests: net: Add VXLAN MDB test") 13715acf8ab5 ("selftest: Add test for bind() conflicts.") Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-03-17net: Add MDB net device operationsIdo Schimmel1-0/+21
Add MDB net device operations that will be invoked by rtnetlink code in response to received RTM_{NEW,DEL,GET}MDB messages. Subsequent patches will implement these operations in the bridge and VXLAN drivers. Signed-off-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-03-15net: tunnels: annotate lockless accesses to dev->needed_headroomEric Dumazet1-2/+4
IP tunnels can apparently update dev->needed_headroom in their xmit path. This patch takes care of three tunnels xmit, and also the core LL_RESERVED_SPACE() and LL_RESERVED_SPACE_EXTRA() helpers. More changes might be needed for completeness. BUG: KCSAN: data-race in ip_tunnel_xmit / ip_tunnel_xmit read to 0xffff88815b9da0ec of 2 bytes by task 888 on cpu 1: ip_tunnel_xmit+0x1270/0x1730 net/ipv4/ip_tunnel.c:803 __gre_xmit net/ipv4/ip_gre.c:469 [inline] ipgre_xmit+0x516/0x570 net/ipv4/ip_gre.c:661 __netdev_start_xmit include/linux/netdevice.h:4881 [inline] netdev_start_xmit include/linux/netdevice.h:4895 [inline] xmit_one net/core/dev.c:3580 [inline] dev_hard_start_xmit+0x127/0x400 net/core/dev.c:3596 __dev_queue_xmit+0x1007/0x1eb0 net/core/dev.c:4246 dev_queue_xmit include/linux/netdevice.h:3051 [inline] neigh_direct_output+0x17/0x20 net/core/neighbour.c:1623 neigh_output include/net/neighbour.h:546 [inline] ip_finish_output2+0x740/0x840 net/ipv4/ip_output.c:228 ip_finish_output+0xf4/0x240 net/ipv4/ip_output.c:316 NF_HOOK_COND include/linux/netfilter.h:291 [inline] ip_output+0xe5/0x1b0 net/ipv4/ip_output.c:430 dst_output include/net/dst.h:444 [inline] ip_local_out+0x64/0x80 net/ipv4/ip_output.c:126 iptunnel_xmit+0x34a/0x4b0 net/ipv4/ip_tunnel_core.c:82 ip_tunnel_xmit+0x1451/0x1730 net/ipv4/ip_tunnel.c:813 __gre_xmit net/ipv4/ip_gre.c:469 [inline] ipgre_xmit+0x516/0x570 net/ipv4/ip_gre.c:661 __netdev_start_xmit include/linux/netdevice.h:4881 [inline] netdev_start_xmit include/linux/netdevice.h:4895 [inline] xmit_one net/core/dev.c:3580 [inline] dev_hard_start_xmit+0x127/0x400 net/core/dev.c:3596 __dev_queue_xmit+0x1007/0x1eb0 net/core/dev.c:4246 dev_queue_xmit include/linux/netdevice.h:3051 [inline] neigh_direct_output+0x17/0x20 net/core/neighbour.c:1623 neigh_output include/net/neighbour.h:546 [inline] ip_finish_output2+0x740/0x840 net/ipv4/ip_output.c:228 ip_finish_output+0xf4/0x240 net/ipv4/ip_output.c:316 NF_HOOK_COND include/linux/netfilter.h:291 [inline] ip_output+0xe5/0x1b0 net/ipv4/ip_output.c:430 dst_output include/net/dst.h:444 [inline] ip_local_out+0x64/0x80 net/ipv4/ip_output.c:126 iptunnel_xmit+0x34a/0x4b0 net/ipv4/ip_tunnel_core.c:82 ip_tunnel_xmit+0x1451/0x1730 net/ipv4/ip_tunnel.c:813 __gre_xmit net/ipv4/ip_gre.c:469 [inline] ipgre_xmit+0x516/0x570 net/ipv4/ip_gre.c:661 __netdev_start_xmit include/linux/netdevice.h:4881 [inline] netdev_start_xmit include/linux/netdevice.h:4895 [inline] xmit_one net/core/dev.c:3580 [inline] dev_hard_start_xmit+0x127/0x400 net/core/dev.c:3596 __dev_queue_xmit+0x1007/0x1eb0 net/core/dev.c:4246 dev_queue_xmit include/linux/netdevice.h:3051 [inline] neigh_direct_output+0x17/0x20 net/core/neighbour.c:1623 neigh_output include/net/neighbour.h:546 [inline] ip_finish_output2+0x740/0x840 net/ipv4/ip_output.c:228 ip_finish_output+0xf4/0x240 net/ipv4/ip_output.c:316 NF_HOOK_COND include/linux/netfilter.h:291 [inline] ip_output+0xe5/0x1b0 net/ipv4/ip_output.c:430 dst_output include/net/dst.h:444 [inline] ip_local_out+0x64/0x80 net/ipv4/ip_output.c:126 iptunnel_xmit+0x34a/0x4b0 net/ipv4/ip_tunnel_core.c:82 ip_tunnel_xmit+0x1451/0x1730 net/ipv4/ip_tunnel.c:813 __gre_xmit net/ipv4/ip_gre.c:469 [inline] ipgre_xmit+0x516/0x570 net/ipv4/ip_gre.c:661 __netdev_start_xmit include/linux/netdevice.h:4881 [inline] netdev_start_xmit include/linux/netdevice.h:4895 [inline] xmit_one net/core/dev.c:3580 [inline] dev_hard_start_xmit+0x127/0x400 net/core/dev.c:3596 __dev_queue_xmit+0x1007/0x1eb0 net/core/dev.c:4246 dev_queue_xmit include/linux/netdevice.h:3051 [inline] neigh_direct_output+0x17/0x20 net/core/neighbour.c:1623 neigh_output include/net/neighbour.h:546 [inline] ip_finish_output2+0x740/0x840 net/ipv4/ip_output.c:228 ip_finish_output+0xf4/0x240 net/ipv4/ip_output.c:316 NF_HOOK_COND include/linux/netfilter.h:291 [inline] ip_output+0xe5/0x1b0 net/ipv4/ip_output.c:430 dst_output include/net/dst.h:444 [inline] ip_local_out+0x64/0x80 net/ipv4/ip_output.c:126 iptunnel_xmit+0x34a/0x4b0 net/ipv4/ip_tunnel_core.c:82 ip_tunnel_xmit+0x1451/0x1730 net/ipv4/ip_tunnel.c:813 __gre_xmit net/ipv4/ip_gre.c:469 [inline] ipgre_xmit+0x516/0x570 net/ipv4/ip_gre.c:661 __netdev_start_xmit include/linux/netdevice.h:4881 [inline] netdev_start_xmit include/linux/netdevice.h:4895 [inline] xmit_one net/core/dev.c:3580 [inline] dev_hard_start_xmit+0x127/0x400 net/core/dev.c:3596 __dev_queue_xmit+0x1007/0x1eb0 net/core/dev.c:4246 dev_queue_xmit include/linux/netdevice.h:3051 [inline] neigh_direct_output+0x17/0x20 net/core/neighbour.c:1623 neigh_output include/net/neighbour.h:546 [inline] ip_finish_output2+0x740/0x840 net/ipv4/ip_output.c:228 ip_finish_output+0xf4/0x240 net/ipv4/ip_output.c:316 NF_HOOK_COND include/linux/netfilter.h:291 [inline] ip_output+0xe5/0x1b0 net/ipv4/ip_output.c:430 dst_output include/net/dst.h:444 [inline] ip_local_out+0x64/0x80 net/ipv4/ip_output.c:126 iptunnel_xmit+0x34a/0x4b0 net/ipv4/ip_tunnel_core.c:82 ip_tunnel_xmit+0x1451/0x1730 net/ipv4/ip_tunnel.c:813 __gre_xmit net/ipv4/ip_gre.c:469 [inline] ipgre_xmit+0x516/0x570 net/ipv4/ip_gre.c:661 __netdev_start_xmit include/linux/netdevice.h:4881 [inline] netdev_start_xmit include/linux/netdevice.h:4895 [inline] xmit_one net/core/dev.c:3580 [inline] dev_hard_start_xmit+0x127/0x400 net/core/dev.c:3596 __dev_queue_xmit+0x1007/0x1eb0 net/core/dev.c:4246 dev_queue_xmit include/linux/netdevice.h:3051 [inline] neigh_direct_output+0x17/0x20 net/core/neighbour.c:1623 neigh_output include/net/neighbour.h:546 [inline] ip_finish_output2+0x740/0x840 net/ipv4/ip_output.c:228 ip_finish_output+0xf4/0x240 net/ipv4/ip_output.c:316 NF_HOOK_COND include/linux/netfilter.h:291 [inline] ip_output+0xe5/0x1b0 net/ipv4/ip_output.c:430 dst_output include/net/dst.h:444 [inline] ip_local_out+0x64/0x80 net/ipv4/ip_output.c:126 iptunnel_xmit+0x34a/0x4b0 net/ipv4/ip_tunnel_core.c:82 ip_tunnel_xmit+0x1451/0x1730 net/ipv4/ip_tunnel.c:813 __gre_xmit net/ipv4/ip_gre.c:469 [inline] ipgre_xmit+0x516/0x570 net/ipv4/ip_gre.c:661 __netdev_start_xmit include/linux/netdevice.h:4881 [inline] netdev_start_xmit include/linux/netdevice.h:4895 [inline] xmit_one net/core/dev.c:3580 [inline] dev_hard_start_xmit+0x127/0x400 net/core/dev.c:3596 __dev_queue_xmit+0x1007/0x1eb0 net/core/dev.c:4246 write to 0xffff88815b9da0ec of 2 bytes by task 2379 on cpu 0: ip_tunnel_xmit+0x1294/0x1730 net/ipv4/ip_tunnel.c:804 __gre_xmit net/ipv4/ip_gre.c:469 [inline] ipgre_xmit+0x516/0x570 net/ipv4/ip_gre.c:661 __netdev_start_xmit include/linux/netdevice.h:4881 [inline] netdev_start_xmit include/linux/netdevice.h:4895 [inline] xmit_one net/core/dev.c:3580 [inline] dev_hard_start_xmit+0x127/0x400 net/core/dev.c:3596 __dev_queue_xmit+0x1007/0x1eb0 net/core/dev.c:4246 dev_queue_xmit include/linux/netdevice.h:3051 [inline] neigh_direct_output+0x17/0x20 net/core/neighbour.c:1623 neigh_output include/net/neighbour.h:546 [inline] ip6_finish_output2+0x9bc/0xc50 net/ipv6/ip6_output.c:134 __ip6_finish_output net/ipv6/ip6_output.c:195 [inline] ip6_finish_output+0x39a/0x4e0 net/ipv6/ip6_output.c:206 NF_HOOK_COND include/linux/netfilter.h:291 [inline] ip6_output+0xeb/0x220 net/ipv6/ip6_output.c:227 dst_output include/net/dst.h:444 [inline] NF_HOOK include/linux/netfilter.h:302 [inline] mld_sendpack+0x438/0x6a0 net/ipv6/mcast.c:1820 mld_send_cr net/ipv6/mcast.c:2121 [inline] mld_ifc_work+0x519/0x7b0 net/ipv6/mcast.c:2653 process_one_work+0x3e6/0x750 kernel/workqueue.c:2390 worker_thread+0x5f2/0xa10 kernel/workqueue.c:2537 kthread+0x1ac/0x1e0 kernel/kthread.c:376 ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:308 value changed: 0x0dd4 -> 0x0e14 Reported by Kernel Concurrency Sanitizer on: CPU: 0 PID: 2379 Comm: kworker/0:0 Not tainted 6.3.0-rc1-syzkaller-00002-g8ca09d5fa354-dirty #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 03/02/2023 Workqueue: mld mld_ifc_work Fixes: 8eb30be0352d ("ipv6: Create ip6_tnl_xmit") Reported-by: syzbot <syzkaller@googlegroups.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Link: https://lore.kernel.org/r/20230310191109.2384387-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-03-07net: remove enum skb_free_reasonEric Dumazet1-11/+7
enum skb_drop_reason is more generic, we can adopt it instead. Provide dev_kfree_skb_irq_reason() and dev_kfree_skb_any_reason(). This means drivers can use more precise drop reasons if they want to. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Simon Horman <simon.horman@corigine.com> Reviewed-by: Yunsheng Lin <linyunsheng@huawei.com> Link: https://lore.kernel.org/r/20230306204313.10492-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-02-20net: make default_rps_mask a per netns attributePaolo Abeni1-1/+0
That really was meant to be a per netns attribute from the beginning. The idea is that once proper isolation is in place in the main namespace, additional demux in the child namespaces will be redundant. Let's make child netns default rps mask empty by default. To avoid bloating the netns with a possibly large cpumask, allocate it on-demand during the first write operation. Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-02-17Merge ra.kernel.org:/pub/scm/linux/kernel/git/netdev/netDavid S. Miller1-2/+0
Some of the devlink bits were tricky, but I think I got it right. Signed-off-by: David S. Miller <davem@davemloft.net>
2023-02-16devlink: Fix netdev notifier chain corruptionIdo Schimmel1-2/+0
Cited commit changed devlink to register its netdev notifier block on the global netdev notifier chain instead of on the per network namespace one. However, when changing the network namespace of the devlink instance, devlink still tries to unregister its notifier block from the chain of the old namespace and register it on the chain of the new namespace. This results in corruption of the notifier chains, as the same notifier block is registered on two different chains: The global one and the per network namespace one. In turn, this causes other problems such as the inability to dismantle namespaces due to netdev reference count issues. Fix by preventing devlink from moving its notifier block between namespaces. Reproducer: # echo "10 1" > /sys/bus/netdevsim/new_device # ip netns add test123 # devlink dev reload netdevsim/netdevsim10 netns test123 # ip netns del test123 [ 71.935619] unregister_netdevice: waiting for lo to become free. Usage count = 2 [ 71.938348] leaked reference. Fixes: 565b4824c39f ("devlink: change port event netdev notifier from per-net to global") Signed-off-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Jiri Pirko <jiri@nvidia.com> Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Reviewed-by: Jakub Kicinski <kuba@kernel.org> Link: https://lore.kernel.org/r/20230215073139.1360108-1-idosch@nvidia.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2023-02-10Daniel Borkmann says:Jakub Kicinski1-0/+4
==================== pull-request: bpf-next 2023-02-11 We've added 96 non-merge commits during the last 14 day(s) which contain a total of 152 files changed, 4884 insertions(+), 962 deletions(-). There is a minor conflict in drivers/net/ethernet/intel/ice/ice_main.c between commit 5b246e533d01 ("ice: split probe into smaller functions") from the net-next tree and commit 66c0e13ad236 ("drivers: net: turn on XDP features") from the bpf-next tree. Remove the hunk given ice_cfg_netdev() is otherwise there a 2nd time, and add XDP features to the existing ice_cfg_netdev() one: [...] ice_set_netdev_features(netdev); netdev->xdp_features = NETDEV_XDP_ACT_BASIC | NETDEV_XDP_ACT_REDIRECT | NETDEV_XDP_ACT_XSK_ZEROCOPY; ice_set_ops(netdev); [...] Stephen's merge conflict mail: https://lore.kernel.org/bpf/20230207101951.21a114fa@canb.auug.org.au/ The main changes are: 1) Add support for BPF trampoline on s390x which finally allows to remove many test cases from the BPF CI's DENYLIST.s390x, from Ilya Leoshkevich. 2) Add multi-buffer XDP support to ice driver, from Maciej Fijalkowski. 3) Add capability to export the XDP features supported by the NIC. Along with that, add a XDP compliance test tool, from Lorenzo Bianconi & Marek Majtyka. 4) Add __bpf_kfunc tag for marking kernel functions as kfuncs, from David Vernet. 5) Add a deep dive documentation about the verifier's register liveness tracking algorithm, from Eduard Zingerman. 6) Fix and follow-up cleanups for resolve_btfids to be compiled as a host program to avoid cross compile issues, from Jiri Olsa & Ian Rogers. 7) Batch of fixes to the BPF selftest for xdp_hw_metadata which resulted when testing on different NICs, from Jesper Dangaard Brouer. 8) Fix libbpf to better detect kernel version code on Debian, from Hao Xiang. 9) Extend libbpf to add an option for when the perf buffer should wake up, from Jon Doron. 10) Follow-up fix on xdp_metadata selftest to just consume on TX completion, from Stanislav Fomichev. 11) Extend the kfuncs.rst document with description on kfunc lifecycle & stability expectations, from David Vernet. 12) Fix bpftool prog profile to skip attaching to offline CPUs, from Tonghao Zhang. ==================== Link: https://lore.kernel.org/r/20230211002037.8489-1-daniel@iogearbox.net Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-02-09net: introduce default_rps_mask netns attributePaolo Abeni1-0/+1
If RPS is enabled, this allows configuring a default rps mask, which is effective since receive queue creation time. A default RPS mask allows the system admin to ensure proper isolation, avoiding races at network namespace or device creation time. The default RPS mask is initially empty, and can be modified via a newly added sysctl entry. Signed-off-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Simon Horman <simon.horman@corigine.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-02-07net, xdp: Add missing xdp_features descriptionLorenzo Bianconi1-0/+1
Add missing xdp_features field description in the struct net_device documentation. This patch fix the following warning: [...] ./include/linux/netdevice.h:2375: warning: Function parameter or member 'xdp_features' not described in 'net_device' [...] Fixes: d3d854fd6a1d ("netdev-genl: create a simple family for netdev stuff") Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/7878544903d855b49e838c9d59f715bde0b5e63b.1675705948.git.lorenzo@kernel.org
2023-02-02netdev-genl: create a simple family for netdev stuffJakub Kicinski1-0/+3
Add a Netlink spec-compatible family for netdevs. This is a very simple implementation without much thought going into it. It allows us to reap all the benefits of Netlink specs, one can use the generic client to issue the commands: $ ./cli.py --spec netdev.yaml --dump dev_get [{'ifindex': 1, 'xdp-features': set()}, {'ifindex': 2, 'xdp-features': {'basic', 'ndo-xmit', 'redirect'}}, {'ifindex': 3, 'xdp-features': {'rx-sg'}}] the generic python library does not have flags-by-name support, yet, but we also don't have to carry strings in the messages, as user space can get the names from the spec. Acked-by: Jesper Dangaard Brouer <brouer@redhat.com> Co-developed-by: Lorenzo Bianconi <lorenzo@kernel.org> Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org> Co-developed-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Co-developed-by: Marek Majtyka <alardam@gmail.com> Signed-off-by: Marek Majtyka <alardam@gmail.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org> Link: https://lore.kernel.org/r/327ad9c9868becbe1e601b580c962549c8cd81f2.1675245258.git.lorenzo@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-02-01net: add gso_ipv4_max_size and gro_ipv4_max_size per deviceXin Long1-0/+6
This patch introduces gso_ipv4_max_size and gro_ipv4_max_size per device and adds netlink attributes for them, so that IPV4 BIG TCP can be guarded by a separate tunable in the next patch. To not break the old application using "gso/gro_max_size" for IPv4 GSO packets, this patch updates "gso/gro_ipv4_max_size" in netif_set_gso/gro_max_size() if the new size isn't greater than GSO_LEGACY_MAX_SIZE, so that nothing will change even if userspace doesn't realize the new netlink attributes. Signed-off-by: Xin Long <lucien.xin@gmail.com> Reviewed-by: David Ahern <dsahern@kernel.org> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-01-28Merge tag 'for-netdev' of ↵Jakub Kicinski1-0/+8
https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next Daniel Borkmann says: ==================== bpf-next 2023-01-28 We've added 124 non-merge commits during the last 22 day(s) which contain a total of 124 files changed, 6386 insertions(+), 1827 deletions(-). The main changes are: 1) Implement XDP hints via kfuncs with initial support for RX hash and timestamp metadata kfuncs, from Stanislav Fomichev and Toke Høiland-Jørgensen. Measurements on overhead: https://lore.kernel.org/bpf/875yellcx6.fsf@toke.dk 2) Extend libbpf's bpf_tracing.h support for tracing arguments of kprobes/uprobes and syscall as a special case, from Andrii Nakryiko. 3) Significantly reduce the search time for module symbols by livepatch and BPF, from Jiri Olsa and Zhen Lei. 4) Enable cpumasks to be used as kptrs, which is useful for tracing programs tracking which tasks end up running on which CPUs in different time intervals, from David Vernet. 5) Fix several issues in the dynptr processing such as stack slot liveness propagation, missing checks for PTR_TO_STACK variable offset, etc, from Kumar Kartikeya Dwivedi. 6) Various performance improvements, fixes, and introduction of more than just one XDP program to XSK selftests, from Magnus Karlsson. 7) Big batch to BPF samples to reduce deprecated functionality, from Daniel T. Lee. 8) Enable struct_ops programs to be sleepable in verifier, from David Vernet. 9) Reduce pr_warn() noise on BTF mismatches when they are expected under the CONFIG_MODULE_ALLOW_BTF_MISMATCH config anyway, from Connor O'Brien. 10) Describe modulo and division by zero behavior of the BPF runtime in BPF's instruction specification document, from Dave Thaler. 11) Several improvements to libbpf API documentation in libbpf.h, from Grant Seltzer. 12) Improve resolve_btfids header dependencies related to subcmd and add proper support for HOSTCC, from Ian Rogers. 13) Add ipip6 and ip6ip decapsulation support for bpf_skb_adjust_room() helper along with BPF selftests, from Ziyang Xuan. 14) Simplify the parsing logic of structure parameters for BPF trampoline in the x86-64 JIT compiler, from Pu Lehui. 15) Get BTF working for kernels with CONFIG_RUST enabled by excluding Rust compilation units with pahole, from Martin Rodriguez Reboredo. 16) Get bpf_setsockopt() working for kTLS on top of TCP sockets, from Kui-Feng Lee. 17) Disable stack protection for BPF objects in bpftool given BPF backends don't support it, from Holger Hoffstätte. * tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (124 commits) selftest/bpf: Make crashes more debuggable in test_progs libbpf: Add documentation to map pinning API functions libbpf: Fix malformed documentation formatting selftests/bpf: Properly enable hwtstamp in xdp_hw_metadata selftests/bpf: Calls bpf_setsockopt() on a ktls enabled socket. bpf: Check the protocol of a sock to agree the calls to bpf_setsockopt(). bpf/selftests: Verify struct_ops prog sleepable behavior bpf: Pass const struct bpf_prog * to .check_member libbpf: Support sleepable struct_ops.s section bpf: Allow BPF_PROG_TYPE_STRUCT_OPS programs to be sleepable selftests/bpf: Fix vmtest static compilation error tools/resolve_btfids: Alter how HOSTCC is forced tools/resolve_btfids: Install subcmd headers bpf/docs: Document the nocast aliasing behavior of ___init bpf/docs: Document how nested trusted fields may be defined bpf/docs: Document cpumask kfuncs in a new file selftests/bpf: Add selftest suite for cpumask kfuncs selftests/bpf: Add nested trust selftests suite bpf: Enable cpumasks to be queried and used as kptrs bpf: Disallow NULLable pointers for trusted kfuncs ... ==================== Link: https://lore.kernel.org/r/20230128004827.21371-1-daniel@iogearbox.net Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-01-26xfrm: extend add state callback to set failure reasonLeon Romanovsky1-1/+1
Almost all validation logic is in the drivers, but they are missing reliable way to convey failure reason to userspace applications. Let's use extack to return this information to users. Signed-off-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-01-26xfrm: extend add policy callback to set failure reasonLeon Romanovsky1-1/+1
Almost all validation logic is in the drivers, but they are missing reliable way to convey failure reason to userspace applications. Let's use extack to return this information to users. Signed-off-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-01-23bpf: XDP metadata RX kfuncsStanislav Fomichev1-0/+8
Define a new kfunc set (xdp_metadata_kfunc_ids) which implements all possible XDP metatada kfuncs. Not all devices have to implement them. If kfunc is not supported by the target device, the default implementation is called instead. The verifier, at load time, replaces a call to the generic kfunc with a call to the per-device one. Per-device kfunc pointers are stored in separate struct xdp_metadata_ops. Cc: John Fastabend <john.fastabend@gmail.com> Cc: David Ahern <dsahern@gmail.com> Cc: Martin KaFai Lau <martin.lau@linux.dev> Cc: Jakub Kicinski <kuba@kernel.org> Cc: Willem de Bruijn <willemb@google.com> Cc: Jesper Dangaard Brouer <brouer@redhat.com> Cc: Anatoly Burakov <anatoly.burakov@intel.com> Cc: Alexander Lobakin <alexandr.lobakin@intel.com> Cc: Magnus Karlsson <magnus.karlsson@gmail.com> Cc: Maryam Tahhan <mtahhan@redhat.com> Cc: xdp-hints@xdp-project.net Cc: netdev@vger.kernel.org Signed-off-by: Stanislav Fomichev <sdf@google.com> Link: https://lore.kernel.org/r/20230119221536.3349901-8-sdf@google.com Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2022-12-12net: add IFF_NO_ADDRCONF and use it in bonding to prevent ipv6 addrconfXin Long1-1/+2
Currently, in bonding it reused the IFF_SLAVE flag and checked it in ipv6 addrconf to prevent ipv6 addrconf. However, it is not a proper flag to use for no ipv6 addrconf, for bonding it has to move IFF_SLAVE flag setting ahead of dev_open() in bond_enslave(). Also, IFF_MASTER/SLAVE are historical flags used in bonding and eql, as Jiri mentioned, the new devices like Team, Failover do not use this flag. So as Jiri suggested, this patch adds IFF_NO_ADDRCONF in priv_flags of the device to indicate no ipv6 addconf, and uses it in bonding and moves IFF_SLAVE flag setting back to its original place. Signed-off-by: Xin Long <lucien.xin@gmail.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-12-09Merge tag 'ipsec-next-2022-12-09' of ↵Jakub Kicinski1-0/+4
git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec-next Steffen Klassert says: ==================== ipsec-next 2022-12-09 1) Add xfrm packet offload core API. From Leon Romanovsky. 2) Add xfrm packet offload support for mlx5. From Leon Romanovsky and Raed Salem. 3) Fix a typto in a error message. From Colin Ian King. * tag 'ipsec-next-2022-12-09' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec-next: (38 commits) xfrm: Fix spelling mistake "oflload" -> "offload" net/mlx5e: Open mlx5 driver to accept IPsec packet offload net/mlx5e: Handle ESN update events net/mlx5e: Handle hardware IPsec limits events net/mlx5e: Update IPsec soft and hard limits net/mlx5e: Store all XFRM SAs in Xarray net/mlx5e: Provide intermediate pointer to access IPsec struct net/mlx5e: Skip IPsec encryption for TX path without matching policy net/mlx5e: Add statistics for Rx/Tx IPsec offloaded flows net/mlx5e: Improve IPsec flow steering autogroup net/mlx5e: Configure IPsec packet offload flow steering net/mlx5e: Use same coding pattern for Rx and Tx flows net/mlx5e: Add XFRM policy offload logic net/mlx5e: Create IPsec policy offload tables net/mlx5e: Generalize creation of default IPsec miss group and rule net/mlx5e: Group IPsec miss handles into separate struct net/mlx5e: Make clear what IPsec rx_err does net/mlx5e: Flatten the IPsec RX add rule path net/mlx5e: Refactor FTE setup code to be more clear net/mlx5e: Move IPsec flow table creation to separate function ... ==================== Link: https://lore.kernel.org/r/20221209093310.4018731-1-steffen.klassert@secunet.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-12-05xfrm: add support to HW update soft and hard limitsLeon Romanovsky1-0/+1
Both in RX and TX, the traffic that performs IPsec packet offload transformation is accounted by HW. It is needed to properly handle hard limits that require to drop the packet. It means that XFRM core needs to update internal counters with the one that accounted by the HW, so new callbacks are introduced in this patch. In case of soft or hard limit is occurred, the driver should call to xfrm_state_check_expire() that will perform key rekeying exactly as done by XFRM core. Signed-off-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2022-12-05xfrm: add an interface to offload policyLeon Romanovsky1-0/+3
Extend netlink interface to add and delete XFRM policy from the device. This functionality is a first step to implement packet IPsec offload solution. Signed-off-by: Raed Salem <raeds@nvidia.com> Signed-off-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2022-12-03net: add netdev_sw_irq_coalesce_default_on()Heiner Kallweit1-0/+1
Add a helper for drivers wanting to set SW IRQ coalescing by default. The related sysfs attributes can be used to override the default values. Follow Jakub's suggestion and put this functionality into net core so that drivers wanting to use software interrupt coalescing per default don't have to open-code it. Note that this function needs to be called before the netdevice is registered. Suggested-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-11-28Daniel Borkmann says:Jakub Kicinski1-1/+1
==================== bpf-next 2022-11-25 We've added 101 non-merge commits during the last 11 day(s) which contain a total of 109 files changed, 8827 insertions(+), 1129 deletions(-). The main changes are: 1) Support for user defined BPF objects: the use case is to allocate own objects, build own object hierarchies and use the building blocks to build own data structures flexibly, for example, linked lists in BPF, from Kumar Kartikeya Dwivedi. 2) Add bpf_rcu_read_{,un}lock() support for sleepable programs, from Yonghong Song. 3) Add support storing struct task_struct objects as kptrs in maps, from David Vernet. 4) Batch of BPF map documentation improvements, from Maryam Tahhan and Donald Hunter. 5) Improve BPF verifier to propagate nullness information for branches of register to register comparisons, from Eduard Zingerman. 6) Fix cgroup BPF iter infra to hold reference on the start cgroup, from Hou Tao. 7) Fix BPF verifier to not mark fentry/fexit program arguments as trusted given it is not the case for them, from Alexei Starovoitov. 8) Improve BPF verifier's realloc handling to better play along with dynamic runtime analysis tools like KASAN and friends, from Kees Cook. 9) Remove legacy libbpf mode support from bpftool, from Sahid Orentino Ferdjaoui. 10) Rework zero-len skb redirection checks to avoid potentially breaking existing BPF test infra users, from Stanislav Fomichev. 11) Two small refactorings which are independent and have been split out of the XDP queueing RFC series, from Toke Høiland-Jørgensen. 12) Fix a memory leak in LSM cgroup BPF selftest, from Wang Yufen. 13) Documentation on how to run BPF CI without patch submission, from Daniel Müller. Signed-off-by: Jakub Kicinski <kuba@kernel.org> ==================== Link: https://lore.kernel.org/r/20221125012450.441-1-daniel@iogearbox.net Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-11-16net: add atomic_long_t to net_device_stats fieldsEric Dumazet1-23/+35
Long standing KCSAN issues are caused by data-race around some dev->stats changes. Most performance critical paths already use per-cpu variables, or per-queue ones. It is reasonable (and more correct) to use atomic operations for the slow paths. This patch adds an union for each field of net_device_stats, so that we can convert paths that are not yet protected by a spinlock or a mutex. netdev_stats_to_stats64() no longer has an #if BITS_PER_LONG==64 Note that the memcpy() we were using on 64bit arches had no provision to avoid load-tearing, while atomic_long_read() is providing the needed protection at no cost. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-11-15dev: Move received_rps counter next to RPS members in softnet dataToke Høiland-Jørgensen1-1/+1
Move the received_rps counter value next to the other RPS-related members in softnet_data. This closes two four-byte holes in the structure, making room for another pointer in the first two cache lines without bumping the xmit struct to its own line. Acked-by: Song Liu <song@kernel.org> Reviewed-by: Stanislav Fomichev <sdf@google.com> Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com> Link: https://lore.kernel.org/r/20221108140601.149971-2-toke@redhat.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-11-09net: introduce a helper to move notifier block to different namespaceJiri Pirko1-0/+2
Currently, net_dev() netdev notifier variant follows the netdev with per-net notifier from namespace to namespace. This is implemented by move_netdevice_notifiers_dev_net() helper. For devlink it is needed to re-register per-net notifier during devlink reload. Introduce a new helper called move_netdevice_notifier_net() and share the unregister/register code with existing move_netdevice_notifiers_dev_net() helper. Signed-off-by: Jiri Pirko <jiri@nvidia.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-11-09net/core: Allow live renaming when an interface is upAndy Ren1-3/+1
Allow a network interface to be renamed when the interface is up. As described in the netconsole documentation [1], when netconsole is used as a built-in, it will bring up the specified interface as soon as possible. As a result, user space will not be able to rename the interface since the kernel disallows renaming of interfaces that are administratively up unless the 'IFF_LIVE_RENAME_OK' private flag was set by the kernel. The original solution [2] to this problem was to add a new parameter to the netconsole configuration parameters that allows renaming of the interface used by netconsole while it is administratively up. However, during the discussion that followed, it became apparent that we have no reason to keep the current restriction and instead we should allow user space to rename interfaces regardless of their administrative state: 1. The restriction was put in place over 20 years ago when renaming was only possible via IOCTL and before rtnetlink started notifying user space about such changes like it does today. 2. The 'IFF_LIVE_RENAME_OK' flag was added over 3 years ago in version 5.2 and no regressions were reported. 3. In-kernel listeners to 'NETDEV_CHANGENAME' do not seem to care about the administrative state of interface. Therefore, allow user space to rename running interfaces by removing the restriction and the associated 'IFF_LIVE_RENAME_OK' flag. Help in possible triage by emitting a message to the kernel log that an interface was renamed while UP. [1] https://www.kernel.org/doc/Documentation/networking/netconsole.rst [2] https://lore.kernel.org/netdev/20221102002420.2613004-1-andy.ren@getcruise.com/ Signed-off-by: Andy Ren <andy.ren@getcruise.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: David Ahern <dsahern@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-11-03net: remove unused ndo_get_devlink_portJiri Pirko1-5/+0
Remove ndo_get_devlink_port which is no longer used alongside with the implementations in drivers. Signed-off-by: Jiri Pirko <jiri@nvidia.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-11-03net: devlink: track netdev with devlink_port assignedJiri Pirko1-0/+19
Currently, ethernet drivers are using devlink_port_type_eth_set() and devlink_port_type_clear() to set devlink port type and link to related netdev. Instead of calling them directly, let the driver use SET_NETDEV_DEVLINK_PORT macro to assign devlink_port pointer and let devlink to track it. Note the devlink port pointer is static during the time netdevice is registered. In devlink code, use per-namespace netdev notifier to track the netdevices with devlink_port assigned and change the internal devlink_port type and related type pointer accordingly. Signed-off-by: Jiri Pirko <jiri@nvidia.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-10-31rtnetlink: pass netlink message header and portid to rtnl_configure_link()Hangbin Liu1-2/+0
This patch pass netlink message header and portid to rtnl_configure_link() All the functions in this call chain need to add the parameters so we can use them in the last call rtnl_notify(), and notify the userspace about the new link info if NLM_F_ECHO flag is set. - rtnl_configure_link() - __dev_notify_flags() - rtmsg_ifinfo() - rtmsg_ifinfo_event() - rtmsg_ifinfo_build_skb() - rtmsg_ifinfo_send() - rtnl_notify() Also move __dev_notify_flags() declaration to net/core/dev.h, as Jakub suggested. Signed-off-by: Hangbin Liu <liuhangbin@gmail.com> Reviewed-by: Guillaume Nault <gnault@redhat.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-10-30net: remove unused netdev_unregistering()Juhee Kang1-5/+0
Currently, use dev->reg_state == NETREG_UNREGISTERING to check the status which is NETREG_UNREGISTERING, rather than using netdev_unregistering. Also, A helper function which is netdev_unregistering on nedevice.h is no longer used. Thus, netdev_unregistering removes from netdevice.h. Signed-off-by: Juhee Kang <claudiajkang@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-10-15Revert "net: fix cpu_max_bits_warn() usage in netif_attrmask_next{,_and}"Jakub Kicinski1-4/+6
This reverts commit 854701ba4c39afae2362ba19a580c461cb183e4f. We have more violations around, which leads to: WARNING: CPU: 2 PID: 1 at include/linux/cpumask.h:110 __netif_set_xps_queue+0x14e/0x770 Let's back this out and retry with a larger clean up in -next. Fixes: 854701ba4c39 ("net: fix cpu_max_bits_warn() usage in netif_attrmask_next{,_and}") Link: https://lore.kernel.org/all/20221014030459.3272206-2-guoren@kernel.org/ Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-10-10Merge tag 'bitmap-6.1-rc1' of https://github.com/norov/linuxLinus Torvalds1-6/+4
Pull bitmap updates from Yury Norov: - Fix unsigned comparison to -1 in CPUMAP_FILE_MAX_BYTES (Phil Auld) - cleanup nr_cpu_ids vs nr_cpumask_bits mess (me) This series cleans that mess and adds new config FORCE_NR_CPUS that allows to optimize cpumask subsystem if the number of CPUs is known at compile-time. - optimize find_bit() functions (me) Reworks find_bit() functions based on new FIND_{FIRST,NEXT}_BIT() macros. - add find_nth_bit() (me) Adds find_nth_bit(), which is ~70 times faster than bitcounting with for_each() loop: for_each_set_bit(bit, mask, size) if (n-- == 0) return bit; Also adds bitmap_weight_and() to let people replace this pattern: tmp = bitmap_alloc(nbits); bitmap_and(tmp, map1, map2, nbits); weight = bitmap_weight(tmp, nbits); bitmap_free(tmp); with a single bitmap_weight_and() call. - repair cpumask_check() (me) After switching cpumask to use nr_cpu_ids, cpumask_check() started generating many false-positive warnings. This series fixes it. - Add for_each_cpu_andnot() and for_each_cpu_andnot() (Valentin Schneider) Extends the API with one more function and applies it in sched/core. * tag 'bitmap-6.1-rc1' of https://github.com/norov/linux: (28 commits) sched/core: Merge cpumask_andnot()+for_each_cpu() into for_each_cpu_andnot() lib/test_cpumask: Add for_each_cpu_and(not) tests cpumask: Introduce for_each_cpu_andnot() lib/find_bit: Introduce find_next_andnot_bit() cpumask: fix checking valid cpu range lib/bitmap: add tests for for_each() loops lib/find: optimize for_each() macros lib/bitmap: introduce for_each_set_bit_wrap() macro lib/find_bit: add find_next{,_and}_bit_wrap cpumask: switch for_each_cpu{,_not} to use for_each_bit() net: fix cpu_max_bits_warn() usage in netif_attrmask_next{,_and} cpumask: add cpumask_nth_{,and,andnot} lib/bitmap: remove bitmap_ord_to_pos lib/bitmap: add tests for find_nth_bit() lib: add find_nth{,_and,_andnot}_bit() lib/bitmap: add bitmap_weight_and() lib/bitmap: don't call __bitmap_weight() in kernel code tools: sync find_bit() implementation lib/find_bit: optimize find_next_bit() functions lib/find_bit: create find_first_zero_bit_le() ...
2022-10-01net: fix cpu_max_bits_warn() usage in netif_attrmask_next{,_and}Yury Norov1-6/+4
The functions require to be passed with a cpu index prior to one that is the first to start search, so the valid input range is [-1, nr_cpu_ids-1). However, the code checks against [-1, nr_cpu_ids). Acked-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Yury Norov <yury.norov@gmail.com>
2022-09-29net/sched: query offload capabilities through ndo_setup_tc()Vladimir Oltean1-0/+1
When adding optional new features to Qdisc offloads, existing drivers must reject the new configuration until they are coded up to act on it. Since modifying all drivers in lockstep with the changes in the Qdisc can create problems of its own, it would be nice if there existed an automatic opt-in mechanism for offloading optional features. Jakub proposes that we multiplex one more kind of call through ndo_setup_tc(): one where the driver populates a Qdisc-specific capability structure. First user will be taprio in further changes. Here we are introducing the definitions for the base functionality. Link: https://patchwork.kernel.org/project/netdevbpf/patch/20220923163310.3192733-3-vladimir.oltean@nxp.com/ Suggested-by: Jakub Kicinski <kuba@kernel.org> Co-developed-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-09-29net: skb: introduce and use a single page frag cachePaolo Abeni1-0/+1
After commit 3226b158e67c ("net: avoid 32 x truesize under-estimation for tiny skbs") we are observing 10-20% regressions in performance tests with small packets. The perf trace points to high pressure on the slab allocator. This change tries to improve the allocation schema for small packets using an idea originally suggested by Eric: a new per CPU page frag is introduced and used in __napi_alloc_skb to cope with small allocation requests. To ensure that the above does not lead to excessive truesize underestimation, the frag size for small allocation is inflated to 1K and all the above is restricted to build with 4K page size. Note that we need to update accordingly the run-time check introduced with commit fd9ea57f4e95 ("net: add napi_get_frags_check() helper"). Alex suggested a smart page refcount schema to reduce the number of atomic operations and deal properly with pfmemalloc pages. Under small packet UDP flood, I measure a 15% peak tput increases. Suggested-by: Eric Dumazet <eric.dumazet@gmail.com> Suggested-by: Alexander H Duyck <alexanderduyck@fb.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Alexander Duyck <alexanderduyck@fb.com> Link: https://lore.kernel.org/r/6b6f65957c59f86a353fc09a5127e83a32ab5999.1664350652.git.pabeni@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-09-28net: drop the weight argument from netif_napi_addJakub Kicinski1-3/+2
We tell driver developers to always pass NAPI_POLL_WEIGHT as the weight to netif_napi_add(). This may be confusing to newcomers, drop the weight argument, those who really need to tweak the weight can use netif_napi_add_weight(). Acked-by: Marc Kleine-Budde <mkl@pengutronix.de> # for CAN Link: https://lore.kernel.org/r/20220927132753.750069-1-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-09-20net: introduce iterators over synced hw addressesVladimir Oltean1-0/+6
Some network drivers use __dev_mc_sync()/__dev_uc_sync() and therefore program the hardware only with addresses with a non-zero sync_cnt. Some of the above drivers also need to save/restore the address filtering lists when certain events happen, and they need to walk through the struct net_device :: uc and struct net_device :: mc lists. But these lists contain unsynced addresses too. To keep the appearance of an elementary form of data encapsulation, provide iterators through these lists that only look at entries with a non-zero sync_cnt, instead of filtering entries out from device drivers. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2022-09-02net: remove netif_tx_napi_add()Jakub Kicinski1-2/+0
All callers are now gone. Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-09-02net: bql: add more documentationEric Dumazet1-6/+26
Add some documentation for netdev_tx_sent_queue() and netdev_tx_completed_queue() Stating that netdev_tx_completed_queue() must be called once per TX completion round is apparently not obvious for everybody. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-08-25Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski1-3/+17
drivers/net/ethernet/mellanox/mlx5/core/en_fs.c 21234e3a84c7 ("net/mlx5e: Fix use after free in mlx5e_fs_init()") c7eafc5ed068 ("net/mlx5e: Convert ethtool_steering member of flow_steering struct to pointer") https://lore.kernel.org/all/20220825104410.67d4709c@canb.auug.org.au/ https://lore.kernel.org/all/20220823055533.334471-1-saeed@kernel.org/ Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-08-25netdev: Use try_cmpxchg in napi_if_scheduled_mark_missedUros Bizjak1-2/+2
Use try_cmpxchg instead of cmpxchg (*ptr, old, new) == old in napi_if_scheduled_mark_missed. x86 CMPXCHG instruction returns success in ZF flag, so this change saves a compare after cmpxchg (and related move instruction in front of cmpxchg). Also, try_cmpxchg implicitly assigns old *ptr value to "old" when cmpxchg fails, enabling further code simplifications. Cc: Eric Dumazet <edumazet@google.com> Cc: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Uros Bizjak <ubizjak@gmail.com> Link: https://lore.kernel.org/r/20220822143243.2798-1-ubizjak@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-08-24net: Fix data-races around sysctl_devconf_inherit_init_net.Kuniyuki Iwashima1-0/+9
While reading sysctl_devconf_inherit_init_net, it can be changed concurrently. Thus, we need to add READ_ONCE() to its readers. Fixes: 856c395cfa63 ("net: introduce a knob to control whether to inherit devconf config") Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-08-24net: Fix data-races around sysctl_fb_tunnels_only_for_init_net.Kuniyuki Iwashima1-3/+8
While reading sysctl_fb_tunnels_only_for_init_net, it can be changed concurrently. Thus, we need to add READ_ONCE() to its readers. Fixes: 79134e6ce2c9 ("net: do not create fallback tunnels for non-default namespaces") Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-08-22Remove DECnet support from kernelStephen Hemminger1-4/+0
DECnet is an obsolete network protocol that receives more attention from kernel janitors than users. It belongs in computer protocol history museum not in Linux kernel. It has been "Orphaned" in kernel since 2010. The iproute2 support for DECnet was dropped in 5.0 release. The documentation link on Sourceforge says it is abandoned there as well. Leave the UAPI alone to keep userspace programs compiling. This means that there is still an empty neighbour table for AF_DECNET. The table of /proc/sys/net entries was updated to match current directories and reformatted to be alphabetical. Signed-off-by: Stephen Hemminger <stephen@networkplumber.org> Acked-by: David Ahern <dsahern@kernel.org> Acked-by: Nikolay Aleksandrov <razor@blackwall.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-06-30Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski1-1/+1
drivers/net/ethernet/microchip/sparx5/sparx5_switchdev.c 9c5de246c1db ("net: sparx5: mdb add/del handle non-sparx5 devices") fbb89d02e33a ("net: sparx5: Allow mdb entries to both CPU and ports") Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-06-24net: fix IFF_TX_SKB_NO_LINEAR definitionDan Carpenter1-1/+1
The "1<<31" shift has a sign extension bug so IFF_TX_SKB_NO_LINEAR is 0xffffffff80000000 instead of 0x0000000080000000. Fixes: c2ff53d8049f ("net: Add priv_flags for allow tx skb without linear") Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Reviewed-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com> Link: https://lore.kernel.org/r/YrRrcGttfEVnf85Q@kili Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-06-09net: adopt u64_stats_t in struct pcpu_sw_netstatsEric Dumazet1-8/+8
As explained in commit 316580b69d0a ("u64_stats: provide u64_stats_t type") we should use u64_stats_t and related accessors to avoid load/store tearing. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-06-09net: rename reference+tracking helpersJakub Kicinski1-12/+12
Netdev reference helpers have a dev_ prefix for historic reasons. Renaming the old helpers would be too much churn but we can rename the tracking ones which are relatively recent and should be the default for new code. Rename: dev_hold_track() -> netdev_hold() dev_put_track() -> netdev_put() dev_replace_track() -> netdev_ref_replace() Link: https://lore.kernel.org/r/20220608043955.919359-1-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-05-22net: wrap the wireless pointers in struct net_device in an ifdefJakub Kicinski1-2/+6
Most protocol-specific pointers in struct net_device are under a respective ifdef. Wireless is the notable exception. Since there's a sizable number of custom-built kernels for datacenter workloads which don't build wireless it seems reasonable to ifdefy those pointers as well. While at it move IPv4 and IPv6 pointers up, those are special for obvious reasons. Acked-by: Johannes Berg <johannes@sipsolutions.net> Acked-by: Stefan Schmidt <stefan@datenfreihafen.org> # ieee802154 Acked-by: Sven Eckelmann <sven@narfation.org> Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-05-19Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski1-1/+1
drivers/net/ethernet/mellanox/mlx5/core/main.c b33886971dbc ("net/mlx5: Initialize flow steering during driver probe") 40379a0084c2 ("net/mlx5_fpga: Drop INNOVA TLS support") f2b41b32cde8 ("net/mlx5: Remove ipsec_ops function table") https://lore.kernel.org/all/20220519040345.6yrjromcdistu7vh@sx1/ 16d42d313350 ("net/mlx5: Drain fw_reset when removing device") 8324a02c342a ("net/mlx5: Add exit route when waiting for FW") https://lore.kernel.org/all/20220519114119.060ce014@canb.auug.org.au/ tools/testing/selftests/net/mptcp/mptcp_join.sh e274f7154008 ("selftests: mptcp: add subflow limits test-cases") b6e074e171bc ("selftests: mptcp: add infinite map testcase") 5ac1d2d63451 ("selftests: mptcp: Add tests for userspace PM type") https://lore.kernel.org/all/20220516111918.366d747f@canb.auug.org.au/ net/mptcp/options.c ba2c89e0ea74 ("mptcp: fix checksum byte order") 1e39e5a32ad7 ("mptcp: infinite mapping sending") ea66758c1795 ("tcp: allow MPTCP to update the announced window") https://lore.kernel.org/all/20220519115146.751c3a37@canb.auug.org.au/ net/mptcp/pm.c 95d686517884 ("mptcp: fix subflow accounting on close") 4d25247d3ae4 ("mptcp: bypass in-kernel PM restrictions for non-kernel PMs") https://lore.kernel.org/all/20220516111435.72f35dca@canb.auug.org.au/ net/mptcp/subflow.c ae66fb2ba6c3 ("mptcp: Do TCP fallback on early DSS checksum failure") 0348c690ed37 ("mptcp: add the fallback check") f8d4bcacff3b ("mptcp: infinite mapping receiving") https://lore.kernel.org/all/20220519115837.380bb8d4@canb.auug.org.au/ Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-05-16net: fix dev_fill_forward_path with pppoe + bridgeFelix Fietkau1-1/+1
When calling dev_fill_forward_path on a pppoe device, the provided destination address is invalid. In order for the bridge fdb lookup to succeed, the pppoe code needs to update ctx->daddr to the correct value. Fix this by storing the address inside struct net_device_path_ctx Fixes: f6efc675c9dd ("net: ppp: resolve forwarding path for bridge pppoe devices") Signed-off-by: Felix Fietkau <nbd@nbd.name> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2022-05-16net: fix possible race in skb_attempt_defer_free()Eric Dumazet1-0/+1
A cpu can observe sd->defer_count reaching 128, and call smp_call_function_single_async() Problem is that the remote CPU can clear sd->defer_count before the IPI is run/acknowledged. Other cpus can queue more packets and also decide to call smp_call_function_single_async() while the pending IPI was not yet delivered. This is a common issue with smp_call_function_single_async(). Callers must ensure correct synchronization and serialization. I triggered this issue while experimenting smaller threshold. Performing the call to smp_call_function_single_async() under sd->defer_lock protection did not solve the problem. Commit 5a18ceca6350 ("smp: Allow smp_call_function_single_async() to insert locked csd") replaced an informative WARN_ON_ONCE() with a return of -EBUSY, which is often ignored. Test of CSD_FLAG_LOCK presence is racy anyway. Fixes: 68822bdf76f1 ("net: generalize skb freeing deferral to per-cpu lists") Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-05-16net: allow gro_max_size to exceed 65536Alexander Duyck1-1/+5
Allow the gro_max_size to exceed a value larger than 65536. There weren't really any external limitations that prevented this other than the fact that IPv4 only supports a 16 bit length field. Since we have the option of adding a hop-by-hop header for IPv6 we can allow IPv6 to exceed this value and for IPv4 and non-TCP flows we can cap things at 65536 via a constant rather than relying on gro_max_size. [edumazet] limit GRO_MAX_SIZE to (8 * 65535) to avoid overflows. Signed-off-by: Alexander Duyck <alexanderduyck@fb.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-05-16net: limit GSO_MAX_SIZE to 524280 bytesEric Dumazet1-2/+5
Make sure we will not overflow shinfo->gso_segs Minimal TCP MSS size is 8 bytes, and shinfo->gso_segs is a 16bit field. TCP_MIN_GSO_SIZE is currently defined in include/net/tcp.h, it seems cleaner to not bring tcp details into include/linux/netdevice.h Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Alexander Duyck <alexanderduyck@fb.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-05-16net: allow gso_max_size to exceed 65536Alexander Duyck1-1/+3
The code for gso_max_size was added originally to allow for debugging and workaround of buggy devices that couldn't support TSO with blocks 64K in size. The original reason for limiting it to 64K was because that was the existing limits of IPv4 and non-jumbogram IPv6 length fields. With the addition of Big TCP we can remove this limit and allow the value to potentially go up to UINT_MAX and instead be limited by the tso_max_size value. So in order to support this we need to go through and clean up the remaining users of the gso_max_size value so that the values will cap at 64K for non-TCPv6 flows. In addition we can clean up the GSO_MAX_SIZE value so that 64K becomes GSO_LEGACY_MAX_SIZE and UINT_MAX will now be the upper limit for GSO_MAX_SIZE. v6: (edumazet) fixed a compile error if CONFIG_IPV6=n, in a new sk_trim_gso_size() helper. netif_set_tso_max_size() caps the requested TSO size with GSO_MAX_SIZE. Signed-off-by: Alexander Duyck <alexanderduyck@fb.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-05-11net: add include/net/net_debug.hEric Dumazet1-140/+1
Remove from include/linux/netdevice.h helpers that send debug/info/warnings to syslog. We plan adding more helpers in following patches. v2: added two includes, and 'struct net_device' forward declaration to avoid compile errors (kernel bots) Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-05-10ptp: Support late timestamp determinationGerhard Engleder1-0/+21
If a physical clock supports a free running cycle counter, then timestamps shall be based on this time too. For TX it is known in advance before the transmission if a timestamp based on the free running cycle counter is needed. For RX it is impossible to know which timestamp is needed before the packet is received and assigned to a socket. Support late timestamp determination by a network device. Therefore, an address/cookie is stored within the new netdev_data field of struct skb_shared_hwtstamps. This address/cookie is provided to a new network device function called ndo_get_tstamp(), which returns a timestamp based on the normal/adjustable time or based on the free running cycle counter. If function is not supported, then timestamp handling is not changed. This mechanism is intended for RX, but TX use is also possible. Signed-off-by: Gerhard Engleder <gerhard@engleder-embedded.com> Acked-by: Jonathan Lemon <jonathan.lemon@gmail.com> Acked-by: Richard Cochran <richardcochran@gmail.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2022-05-09rtnetlink: add extack support in fdb del handlersAlaa Mohamed1-1/+1
Add extack support to .ndo_fdb_del in netdevice.h and all related methods. Signed-off-by: Alaa Mohamed <eng.alaamohamedsoliman.am@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-05-06net: move netif_set_gso_max helpersJakub Kicinski1-21/+0
These are now internal to the core, no need to expose them. Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-05-06net: don't allow user space to lift the device limitsJakub Kicinski1-0/+9
Up until commit 46e6b992c250 ("rtnetlink: allow GSO maximums to be set on device creation") the gso_max_segs and gso_max_size of a device were not controlled from user space. The quoted commit added the ability to control them because of the following setup: netns A | netns B veth<->veth eth0 If eth0 has TSO limitations and user wants to efficiently forward traffic between eth0 and the veths they should copy the TSO limitations of eth0 onto the veths. This would happen automatically for macvlans or ipvlan but veth users are not so lucky (given the loose coupling). Unfortunately the commit in question allowed users to also override the limits on real HW devices. It may be useful to control the max GSO size and someone may be using that ability (not that I know of any user), so create a separate set of knobs to reliably record the TSO limitations. Validate the user requests. Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-05-06net: add netif_inherit_tso_max()Jakub Kicinski1-0/+3
To make later patches smaller create a helper for inheriting the TSO limitations of a lower device. The TSO in the name is not an accident, subsequent patches will replace GSO with TSO in more names. Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-05-03netdev: reshuffle netif_napi_add() APIs to allow dropping weightJakub Kicinski1-17/+33
Most drivers should not have to worry about selecting the right weight for their NAPI instances and pass NAPI_POLL_WEIGHT. It'd be best if we didn't require the argument at all and selected the default internally. This change prepares the ground for such reshuffling, allowing for a smooth transition. The following API should remain after the next release cycle: netif_napi_add() netif_napi_add_weight() netif_napi_add_tx() netif_napi_add_tx_weight() Where the _weight() variants take an explicit weight argument. I opted for a _weight() suffix rather than a __ prefix, because we use __ in places to mean that caller needs to also issue a synchronize_net() call. Signed-off-by: Jakub Kicinski <kuba@kernel.org> Link: https://lore.kernel.org/r/20220502232703.396351-1-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-04-30net: inline dev_queue_xmit()Pavel Begunkov1-2/+12
Inline dev_queue_xmit() and dev_queue_xmit_accel(), they both are small proxy functions doing nothing but redirecting the control flow to __dev_queue_xmit(). Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-04-28Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski1-13/+10
include/linux/netdevice.h net/core/dev.c 6510ea973d8d ("net: Use this_cpu_inc() to increment net->core_stats") 794c24e9921f ("net-core: rx_otherhost_dropped to core_stats") https://lore.kernel.org/all/20220428111903.5f4304e0@canb.auug.org.au/ drivers/net/wan/cosa.c d48fea8401cf ("net: cosa: fix error check return value of register_chrdev()") 89fbca3307d4 ("net: wan: remove support for COSA and SRP synchronous serial boards") https://lore.kernel.org/all/20220428112130.1f689e5e@canb.auug.org.au/ Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-04-26net: Use this_cpu_inc() to increment net->core_statsSebastian Andrzej Siewior1-12/+9
The macro dev_core_stats_##FIELD##_inc() disables preemption and invokes netdev_core_stats_alloc() to return a per-CPU pointer. netdev_core_stats_alloc() will allocate memory on its first invocation which breaks on PREEMPT_RT because it requires non-atomic context for memory allocation. This can be avoided by enabling preemption in netdev_core_stats_alloc() assuming the caller always disables preemption. It might be better to replace local_inc() with this_cpu_inc() now that dev_core_stats_##FIELD##_inc() gained a preempt-disable section and does not rely on already disabled preemption. This results in less instructions on x86-64: local_inc: | incl %gs:__preempt_count(%rip) # __preempt_count | movq 488(%rdi), %rax # _1->core_stats, _22 | testq %rax, %rax # _22 | je .L585 #, | add %gs:this_cpu_off(%rip), %rax # this_cpu_off, tcp_ptr__ | .L586: | testq %rax, %rax # _27 | je .L587 #, | incq (%rax) # _6->a.counter | .L587: | decl %gs:__preempt_count(%rip) # __preempt_count this_cpu_inc(), this patch: | movq 488(%rdi), %rax # _1->core_stats, _5 | testq %rax, %rax # _5 | je .L591 #, | .L585: | incq %gs:(%rax) # _18->rx_dropped Use unsigned long as type for the counter. Use this_cpu_inc() to increment the counter. Use a plain read of the counter. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://lore.kernel.org/r/YmbO0pxgtKpCw4SY@linutronix.de Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-04-26net: generalize skb freeing deferral to per-cpu listsEric Dumazet1-0/+5
Logic added in commit f35f821935d8 ("tcp: defer skb freeing after socket lock is released") helped bulk TCP flows to move the cost of skbs frees outside of critical section where socket lock was held. But for RPC traffic, or hosts with RFS enabled, the solution is far from being ideal. For RPC traffic, recvmsg() has to return to user space right after skb payload has been consumed, meaning that BH handler has no chance to pick the skb before recvmsg() thread. This issue is more visible with BIG TCP, as more RPC fit one skb. For RFS, even if BH handler picks the skbs, they are still picked from the cpu on which user thread is running. Ideally, it is better to free the skbs (and associated page frags) on the cpu that originally allocated them. This patch removes the per socket anchor (sk->defer_list) and instead uses a per-cpu list, which will hold more skbs per round. This new per-cpu list is drained at the end of net_action_rx(), after incoming packets have been processed, to lower latencies. In normal conditions, skbs are added to the per-cpu list with no further action. In the (unlikely) cases where the cpu does not run net_action_rx() handler fast enough, we use an IPI to raise NET_RX_SOFTIRQ on the remote cpu. Also, we do not bother draining the per-cpu list from dev_cpu_dead() This is because skbs in this list have no requirement on how fast they should be freed. Note that we can add in the future a small per-cpu cache if we see any contention on sd->defer_lock. Tested on a pair of hosts with 100Gbit NIC, RFS enabled, and /proc/sys/net/ipv4/tcp_rmem[2] tuned to 16MB to work around page recycling strategy used by NIC driver (its page pool capacity being too small compared to number of skbs/pages held in sockets receive queues) Note that this tuning was only done to demonstrate worse conditions for skb freeing for this particular test. These conditions can happen in more general production workload. 10 runs of one TCP_STREAM flow Before: Average throughput: 49685 Mbit. Kernel profiles on cpu running user thread recvmsg() show high cost for skb freeing related functions (*) 57.81% [kernel] [k] copy_user_enhanced_fast_string (*) 12.87% [kernel] [k] skb_release_data (*) 4.25% [kernel] [k] __free_one_page (*) 3.57% [kernel] [k] __list_del_entry_valid 1.85% [kernel] [k] __netif_receive_skb_core 1.60% [kernel] [k] __skb_datagram_iter (*) 1.59% [kernel] [k] free_unref_page_commit (*) 1.16% [kernel] [k] __slab_free 1.16% [kernel] [k] _copy_to_iter (*) 1.01% [kernel] [k] kfree (*) 0.88% [kernel] [k] free_unref_page 0.57% [kernel] [k] ip6_rcv_core 0.55% [kernel] [k] ip6t_do_table 0.54% [kernel] [k] flush_smp_call_function_queue (*) 0.54% [kernel] [k] free_pcppages_bulk 0.51% [kernel] [k] llist_reverse_order 0.38% [kernel] [k] process_backlog (*) 0.38% [kernel] [k] free_pcp_prepare 0.37% [kernel] [k] tcp_recvmsg_locked (*) 0.37% [kernel] [k] __list_add_valid 0.34% [kernel] [k] sock_rfree 0.34% [kernel] [k] _raw_spin_lock_irq (*) 0.33% [kernel] [k] __page_cache_release 0.33% [kernel] [k] tcp_v6_rcv (*) 0.33% [kernel] [k] __put_page (*) 0.29% [kernel] [k] __mod_zone_page_state 0.27% [kernel] [k] _raw_spin_lock After patch: Average throughput: 73076 Mbit. Kernel profiles on cpu running user thread recvmsg() looks better: 81.35% [kernel] [k] copy_user_enhanced_fast_string 1.95% [kernel] [k] _copy_to_iter 1.95% [kernel] [k] __skb_datagram_iter 1.27% [kernel] [k] __netif_receive_skb_core 1.03% [kernel] [k] ip6t_do_table 0.60% [kernel] [k] sock_rfree 0.50% [kernel] [k] tcp_v6_rcv 0.47% [kernel] [k] ip6_rcv_core 0.45% [kernel] [k] read_tsc 0.44% [kernel] [k] _raw_spin_lock_irqsave 0.37% [kernel] [k] _raw_spin_lock 0.37% [kernel] [k] native_irq_return_iret 0.33% [kernel] [k] __inet6_lookup_established 0.31% [kernel] [k] ip6_protocol_deliver_rcu 0.29% [kernel] [k] tcp_rcv_established 0.29% [kernel] [k] llist_reverse_order v2: kdoc issue (kernel bots) do not defer if (alloc_cpu == smp_processor_id()) (Paolo) replace the sk_buff_head with a single-linked list (Jakub) add a READ_ONCE()/WRITE_ONCE() for the lockless read of sd->defer_list Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Paolo Abeni <pabeni@redhat.com> Link: https://lore.kernel.org/r/20220422201237.416238-1-eric.dumazet@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-04-19net: sched: use queue_mapping to pick tx queueTonghao Zhang1-0/+3
This patch fixes issue: * If we install tc filters with act_skbedit in clsact hook. It doesn't work, because netdev_core_pick_tx() overwrites queue_mapping. $ tc filter ... action skbedit queue_mapping 1 And this patch is useful: * We can use FQ + EDT to implement efficient policies. Tx queues are picked by xps, ndo_select_queue of netdev driver, or skb hash in netdev_core_pick_tx(). In fact, the netdev driver, and skb hash are _not_ under control. xps uses the CPUs map to select Tx queues, but we can't figure out which task_struct of pod/containter running on this cpu in most case. We can use clsact filters to classify one pod/container traffic to one Tx queue. Why ? In containter networking environment, there are two kinds of pod/ containter/net-namespace. One kind (e.g. P1, P2), the high throughput is key in these applications. But avoid running out of network resource, the outbound traffic of these pods is limited, using or sharing one dedicated Tx queues assigned HTB/TBF/FQ Qdisc. Other kind of pods (e.g. Pn), the low latency of data access is key. And the traffic is not limited. Pods use or share other dedicated Tx queues assigned FIFO Qdisc. This choice provides two benefits. First, contention on the HTB/FQ Qdisc lock is significantly reduced since fewer CPUs contend for the same queue. More importantly, Qdisc contention can be eliminated completely if each CPU has its own FIFO Qdisc for the second kind of pods. There must be a mechanism in place to support classifying traffic based on pods/container to different Tx queues. Note that clsact is outside of Qdisc while Qdisc can run a classifier to select a sub-queue under the lock. In general recording the decision in the skb seems a little heavy handed. This patch introduces a per-CPU variable, suggested by Eric. The xmit.skip_txqueue flag is firstly cleared in __dev_queue_xmit(). - Tx Qdisc may install that skbedit actions, then xmit.skip_txqueue flag is set in qdisc->enqueue() though tx queue has been selected in netdev_tx_queue_mapping() or netdev_core_pick_tx(). That flag is cleared firstly in __dev_queue_xmit(), is useful: - Avoid picking Tx queue with netdev_tx_queue_mapping() in next netdev in such case: eth0 macvlan - eth0.3 vlan - eth0 ixgbe-phy: For example, eth0, macvlan in pod, which root Qdisc install skbedit queue_mapping, send packets to eth0.3, vlan in host. In __dev_queue_xmit() of eth0.3, clear the flag, does not select tx queue according to skb->queue_mapping because there is no filters in clsact or tx Qdisc of this netdev. Same action taked in eth0, ixgbe in Host. - Avoid picking Tx queue for next packet. If we set xmit.skip_txqueue in tx Qdisc (qdisc->enqueue()), the proper way to clear it is clearing it in __dev_queue_xmit when processing next packets. For performance reasons, use the static key. If user does not config the NET_EGRESS, the patch will not be compiled. +----+ +----+ +----+ | P1 | | P2 | | Pn | +----+ +----+ +----+ | | | +-----------+-----------+ | | clsact/skbedit | MQ v +-----------+-----------+ | q0 | q1 | qn v v v HTB/FQ HTB/FQ ... FIFO Cc: Jamal Hadi Salim <jhs@mojatatu.com> Cc: Cong Wang <xiyou.wangcong@gmail.com> Cc: Jiri Pirko <jiri@resnulli.us> Cc: "David S. Miller" <davem@davemloft.net> Cc: Jakub Kicinski <kuba@kernel.org> Cc: Jonathan Lemon <jonathan.lemon@gmail.com> Cc: Eric Dumazet <edumazet@google.com> Cc: Alexander Lobakin <alobakin@pm.me> Cc: Paolo Abeni <pabeni@redhat.com> Cc: Talal Ahmad <talalahmad@google.com> Cc: Kevin Hao <haokexin@gmail.com> Cc: Ilias Apalodimas <ilias.apalodimas@linaro.org> Cc: Kees Cook <keescook@chromium.org> Cc: Kumar Kartikeya Dwivedi <memxor@gmail.com> Cc: Antoine Tenart <atenart@kernel.org> Cc: Wei Wang <weiwan@google.com> Cc: Arnd Bergmann <arnd@arndb.de> Suggested-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Tonghao Zhang <xiangxia.m.yue@gmail.com> Acked-by: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2022-04-13net: add ndo_fdb_del_bulkNikolay Aleksandrov1-0/+9
Add a new netdev op called ndo_fdb_del_bulk, it will be later used for driver-specific bulk delete implementation dispatched from rtnetlink. The first user will be the bridge, we need it to signal to rtnetlink from the driver that we support bulk delete operation (NLM_F_BULK). Signed-off-by: Nikolay Aleksandrov <razor@blackwall.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-04-07net-core: rx_otherhost_dropped to core_statsJeffrey Ji1-0/+2
Increment rx_otherhost_dropped counter when packet dropped due to mismatched dest MAC addr. An example when this drop can occur is when manually crafting raw packets that will be consumed by a user space application via a tap device. For testing purposes local traffic was generated using trafgen for the client and netcat to start a server Tested: Created 2 netns, sent 1 packet using trafgen from 1 to the other with "{eth(daddr=$INCORRECT_MAC...}", verified that iproute2 showed the counter was incremented. (Also had to modify iproute2 to show the stat, additional patch for that coming next.) Signed-off-by: Jeffrey Ji <jeffreyji@google.com> Reviewed-by: Brian Vazquez <brianvv@google.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://lore.kernel.org/r/20220406172600.1141083-1-jeffreyjilinux@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-04-07net: extract a few internals from netdevice.hJakub Kicinski1-70/+2
There's a number of functions and static variables used under net/core/ but not from the outside. We currently dump most of them into netdevice.h. That bad for many reasons: - netdevice.h is very cluttered, hard to figure out what the APIs are; - netdevice.h is very long; - we have to touch netdevice.h more which causes expensive incremental builds. Create a header under net/core/ and move some declarations. The new header is also a bit of a catch-all but that's fine, if we create more specific headers people will likely over-think where their declaration fit best. And end up putting them in netdevice.h, again. More work should be done on splitting netdevice.h into more targeted headers, but that'd be more time consuming so small steps. Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-04-06net: ethernet: mtk_eth_soc: implement flow offloading to WED devicesFelix Fietkau1-0/+7
This allows hardware flow offloading from Ethernet to WLAN on MT7622 SoC Co-developed-by: Lorenzo Bianconi <lorenzo@kernel.org> Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org> Signed-off-by: Felix Fietkau <nbd@nbd.name> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-04-05net: ensure net_todo_list is processed quicklyJohannes Berg1-1/+2
In [1], Will raised a potential issue that the cfg80211 code, which does (from a locking perspective) rtnl_lock() wiphy_lock() rtnl_unlock() might be suspectible to ABBA deadlocks, because rtnl_unlock() calls netdev_run_todo(), which might end up calling rtnl_lock() again, which could then deadlock (see the comment in the code added here for the scenario). Some back and forth and thinking ensued, but clearly this can't happen if the net_todo_list is empty at the rtnl_unlock() here. Clearly, the code here cannot actually put an entry on it, and all other users of rtnl_unlock() will empty it since that will always go through netdev_run_todo(), emptying the list. So the only other way to get there would be to add to the list and then unlock the RTNL without going through rtnl_unlock(), which is only possible through __rtnl_unlock(). However, this isn't exported and not used in many places, and none of them seem to be able to unregister before using it. Therefore, add a WARN_ON() in the code to ensure this invariant won't be broken, so that the cfg80211 (or any similar) code stays safe. [1] https://lore.kernel.org/r/Yjzpo3TfZxtKPMAG@google.com Signed-off-by: Johannes Berg <johannes.berg@intel.com> Link: https://lore.kernel.org/r/20220404113847.0ee02e4a70da.Ic73d206e217db20fd22dcec14fe5442ca732804b@changeid Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-03-28net: move net_unlink_todo() out of the headerJohannes Berg1-10/+0
There's no reason for this to be in netdevice.h, it's all just used in dev.c. Also make it no longer inline and let the compiler decide to do that by itself. Signed-off-by: Johannes Berg <johannes.berg@intel.com> Link: https://lore.kernel.org/r/20220325225023.f49b9056fe1c.I6b901a2df00000837a9bd251a8dd259bd23f5ded@changeid Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-03-22netdevice: add missing dm_private kdocJakub Kicinski1-0/+2
Building htmldocs complains: include/linux/netdevice.h:2295: warning: Function parameter or member 'dm_private' not described in 'net_device' Fixes: b26ef81c46ed ("drop_monitor: remove quadratic behavior") Signed-off-by: Jakub Kicinski <kuba@kernel.org> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://lore.kernel.org/r/20220322051053.1883186-1-kuba@kernel.org Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2022-03-18net: set default rss queues num to physical cores / 2Íñigo Huguet1-1/+0
Network drivers can call to netif_get_num_default_rss_queues to get the default number of receive queues to use. Right now, this default number is min(8, num_online_cpus()). Instead, as suggested by Jakub, use the number of physical cores divided by 2 as a way to avoid wasting CPU resources and to avoid using both CPU threads, but still allowing to scale for high-end processors with many cores. As an exception, select 2 queues for processors with 2 cores, because otherwise it won't take any advantage of RSS despite being SMP capable. Tested: Processor Intel Xeon E5-2620 (2 sockets, 6 cores/socket, 2 threads/core). NIC Broadcom NetXtreme II BCM57810 (10GBps). Ran some tests with `perf stat iperf3 -R`, with parallelisms of 1, 8 and 24, getting the following results: - Number of queues: 6 (instead of 8) - Network throughput: not affected - CPU usage: utilized 0.05-0.12 CPUs more than before (having 24 CPUs this is only 0.2-0.5% higher) - Reduced the number of context switches by 7-50%, being more noticeable when using a higher number of parallel threads. Suggested-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Íñigo Huguet <ihuguet@redhat.com> Link: https://lore.kernel.org/r/20220315091832.13873-1-ihuguet@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-03-14net: disable preemption in dev_core_stats_XXX_inc() helpersEric Dumazet1-1/+5
syzbot was kind enough to remind us that dev->{tx_dropped|rx_dropped} could be increased in process context. BUG: using smp_processor_id() in preemptible [00000000] code: syz-executor413/3593 caller is netdev_core_stats_alloc+0x98/0x110 net/core/dev.c:10298 CPU: 1 PID: 3593 Comm: syz-executor413 Not tainted 5.17.0-rc7-syzkaller-02426-g97aeb877de7f #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Call Trace: <TASK> __dump_stack lib/dump_stack.c:88 [inline] dump_stack_lvl+0xcd/0x134 lib/dump_stack.c:106 check_preemption_disabled+0x16b/0x170 lib/smp_processor_id.c:49 netdev_core_stats_alloc+0x98/0x110 net/core/dev.c:10298 dev_core_stats include/linux/netdevice.h:3855 [inline] dev_core_stats_rx_dropped_inc include/linux/netdevice.h:3866 [inline] tun_get_user+0x3455/0x3ab0 drivers/net/tun.c:1800 tun_chr_write_iter+0xe1/0x200 drivers/net/tun.c:2015 call_write_iter include/linux/fs.h:2074 [inline] new_sync_write+0x431/0x660 fs/read_write.c:503 vfs_write+0x7cd/0xae0 fs/read_write.c:590 ksys_write+0x12d/0x250 fs/read_write.c:643 do_syscall_x64 arch/x86/entry/common.c:50 [inline] do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80 entry_SYSCALL_64_after_hwframe+0x44/0xae RIP: 0033:0x7f2cf4f887e3 Code: 5d 41 5c 41 5d 41 5e e9 9b fd ff ff 66 2e 0f 1f 84 00 00 00 00 00 90 64 8b 04 25 18 00 00 00 85 c0 75 14 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 55 c3 0f 1f 40 00 48 83 ec 28 48 89 54 24 18 RSP: 002b:00007ffd50dd5fd8 EFLAGS: 00000246 ORIG_RAX: 0000000000000001 RAX: ffffffffffffffda RBX: 00007ffd50dd6000 RCX: 00007f2cf4f887e3 RDX: 000000000000002a RSI: 0000000000000000 RDI: 00000000000000c8 RBP: 0000000000000003 R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000 R13: 00007ffd50dd5ff0 R14: 00007ffd50dd5fe8 R15: 00007ffd50dd5fe4 </TASK> Fixes: 625788b58445 ("net: add per-cpu storage and net->core_stats") Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: jeffreyji <jeffreyji@google.com> Cc: Brian Vazquez <brianvv@google.com> Acked-by: Paolo Abeni <pabeni@redhat.com> Link: https://lore.kernel.org/r/20220312214505.3294762-1-eric.dumazet@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-03-11net: add per-cpu storage and net->core_statsEric Dumazet1-9/+37
Before adding yet another possibly contended atomic_long_t, it is time to add per-cpu storage for existing ones: dev->tx_dropped, dev->rx_dropped, and dev->rx_nohandler Because many devices do not have to increment such counters, allocate the per-cpu storage on demand, so that dev_get_stats() does not have to spend considerable time folding zero counters. Note that some drivers have abused these counters which were supposed to be only used by core networking stack. v4: should use per_cpu_ptr() in dev_get_stats() (Jakub) v3: added a READ_ONCE() in netdev_core_stats_alloc() (Paolo) v2: add a missing include (reported by kernel test robot <lkp@intel.com>) Change in netdev_core_stats_alloc() (Jakub) Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: jeffreyji <jeffreyji@google.com> Reviewed-by: Brian Vazquez <brianvv@google.com> Reviewed-by: Jakub Kicinski <kuba@kernel.org> Acked-by: Paolo Abeni <pabeni@redhat.com> Link: https://lore.kernel.org/r/20220311051420.2608812-1-eric.dumazet@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-03-10Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski1-0/+2
net/dsa/dsa2.c commit afb3cc1a397d ("net: dsa: unlock the rtnl_mutex when dsa_master_setup() fails") commit e83d56537859 ("net: dsa: replay master state events in dsa_tree_{setup,teardown}_master") https://lore.kernel.org/all/20220307101436.7ae87da0@canb.auug.org.au/ drivers/net/ethernet/intel/ice/ice.h commit 97b0129146b1 ("ice: Fix error with handling of bonding MTU") commit 43113ff73453 ("ice: add TTY for GNSS module for E810T device") https://lore.kernel.org/all/20220310112843.3233bcf1@canb.auug.org.au/ drivers/staging/gdm724x/gdm_lte.c commit fc7f750dc9d1 ("staging: gdm724x: fix use after free in gdm_lte_rx()") commit 4bcc4249b4cf ("staging: Use netif_rx().") https://lore.kernel.org/all/20220308111043.1018a59d@canb.auug.org.au/ Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-03-07net: Fix esp GSO on inter address family tunnels.Steffen Klassert1-0/+2
The esp tunnel GSO handlers use skb_mac_gso_segment to push the inner packet to the segmentation handlers. However, skb_mac_gso_segment takes the Ethernet Protocol ID from 'skb->protocol' which is wrong for inter address family tunnels. We fix this by introducing a new skb_eth_gso_segment function. This function can be used if it is necessary to pass the Ethernet Protocol ID directly to the segmentation handler. First users of this function will be the esp4 and esp6 tunnel segmentation handlers. Fixes: c35fe4106b92 ("xfrm: Add mode handlers for IPsec on layer 2") Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2022-03-07net: Remove netif_rx_any_context() and netif_rx_ni().Sebastian Andrzej Siewior1-10/+0
Remove netif_rx_any_context and netif_rx_ni() because there are no more users in tree. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-03-03net: dev: Add hardware stats supportPetr Machata1-0/+42
Offloading switch device drivers may be able to collect statistics of the traffic taking place in the HW datapath that pertains to a certain soft netdevice, such as VLAN. Add the necessary infrastructure to allow exposing these statistics to the offloaded netdevice in question. The API was shaped by the following considerations: - Collection of HW statistics is not free: there may be a finite number of counters, and the act of counting may have a performance impact. It is therefore necessary to allow toggling whether HW counting should be done for any particular SW netdevice. - As the drivers are loaded and removed, a particular device may get offloaded and unoffloaded again. At the same time, the statistics values need to stay monotonic (modulo the eventual 64-bit wraparound), increasing only to reflect traffic measured in the device. To that end, the netdevice keeps around a lazily-allocated copy of struct rtnl_link_stats64. Device drivers then contribute to the values kept therein at various points. Even as the driver goes away, the struct stays around to maintain the statistics values. - Different HW devices may be able to count different things. The motivation behind this patch in particular is exposure of HW counters on Nvidia Spectrum switches, where the only practical approach to counting traffic on offloaded soft netdevices currently is to use router interface counters, and count L3 traffic. Correspondingly that is the statistics suite added in this patch. Other devices may be able to measure different kinds of traffic, and for that reason, the APIs are built to allow uniform access to different statistics suites. - Because soft netdevices and offloading drivers are only loosely bound, a netdevice uses a notifier chain to communicate with the drivers. Several new notifiers, NETDEV_OFFLOAD_XSTATS_*, have been added to carry messages to the offloading drivers. - Devices can have various conditions for when a particular counter is available. As the device is configured and reconfigured, the device offload may become or cease being suitable for counter binding. A netdevice can use a notifier type NETDEV_OFFLOAD_XSTATS_REPORT_USED to ping offloading drivers and determine whether anyone currently implements a given statistics suite. This information can then be propagated to user space. When the driver decides to unoffload a netdevice, it can use a newly-added function, netdev_offload_xstats_report_delta(), to record outstanding collected statistics, before destroying the HW counter. This patch adds a helper, call_netdevice_notifiers_info_robust(), for dispatching a notifier with the possibility of unwind when one of the consumers bails. Given the wish to eventually get rid of the global notifier block altogether, this helper only invokes the per-netns notifier block. Signed-off-by: Petr Machata <petrm@nvidia.com> Signed-off-by: Ido Schimmel <idosch@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-02-23drop_monitor: remove quadratic behaviorEric Dumazet1-1/+3
drop_monitor is using an unique list on which all netdevices in the host have an element, regardless of their netns. This scales poorly, not only at device unregister time (what I caught during my netns dismantle stress tests), but also at packet processing time whenever trace_napi_poll_hit() is called. If the intent was to avoid adding one pointer in 'struct net_device' then surely we prefer O(1) behavior. Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Neil Horman <nhorman@tuxdriver.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-02-17Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski1-1/+1
No conflicts. Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-02-14net: dev: Makes sure netif_rx() can be invoked in any context.Sebastian Andrzej Siewior1-2/+12
Dave suggested a while ago (eleven years by now) "Let's make netif_rx() work in all contexts and get rid of netif_rx_ni()". Eric agreed and pointed out that modern devices should use netif_receive_skb() to avoid the overhead. In the meantime someone added another variant, netif_rx_any_context(), which behaves as suggested. netif_rx() must be invoked with disabled bottom halves to ensure that pending softirqs, which were raised within the function, are handled. netif_rx_ni() can be invoked only from process context (bottom halves must be enabled) because the function handles pending softirqs without checking if bottom halves were disabled or not. netif_rx_any_context() invokes on the former functions by checking in_interrupts(). netif_rx() could be taught to handle both cases (disabled and enabled bottom halves) by simply disabling bottom halves while invoking netif_rx_internal(). The local_bh_enable() invocation will then invoke pending softirqs only if the BH-disable counter drops to zero. Eric is concerned about the overhead of BH-disable+enable especially in regard to the loopback driver. As critical as this driver is, it will receive a shortcut to avoid the additional overhead which is not needed. Add a local_bh_disable() section in netif_rx() to ensure softirqs are handled if needed. Provide __netif_rx() which does not disable BH and has a lockdep assert to ensure that interrupts are disabled. Use this shortcut in the loopback driver and in drivers/net/*.c. Make netif_rx_ni() and netif_rx_any_context() invoke netif_rx() so they can be removed once they are no more users left. Link: https://lkml.kernel.org/r/20100415.020246.218622820.davem@davemloft.net Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Reviewed-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-02-14net_sched: add __rcu annotation to netdev->qdiscEric Dumazet1-1/+1
syzbot found a data-race [1] which lead me to add __rcu annotations to netdev->qdisc, and proper accessors to get LOCKDEP support. [1] BUG: KCSAN: data-race in dev_activate / qdisc_lookup_rcu write to 0xffff888168ad6410 of 8 bytes by task 13559 on cpu 1: attach_default_qdiscs net/sched/sch_generic.c:1167 [inline] dev_activate+0x2ed/0x8f0 net/sched/sch_generic.c:1221 __dev_open+0x2e9/0x3a0 net/core/dev.c:1416 __dev_change_flags+0x167/0x3f0 net/core/dev.c:8139 rtnl_configure_link+0xc2/0x150 net/core/rtnetlink.c:3150 __rtnl_newlink net/core/rtnetlink.c:3489 [inline] rtnl_newlink+0xf4d/0x13e0 net/core/rtnetlink.c:3529 rtnetlink_rcv_msg+0x745/0x7e0 net/core/rtnetlink.c:5594 netlink_rcv_skb+0x14e/0x250 net/netlink/af_netlink.c:2494 rtnetlink_rcv+0x18/0x20 net/core/rtnetlink.c:5612 netlink_unicast_kernel net/netlink/af_netlink.c:1317 [inline] netlink_unicast+0x602/0x6d0 net/netlink/af_netlink.c:1343 netlink_sendmsg+0x728/0x850 net/netlink/af_netlink.c:1919 sock_sendmsg_nosec net/socket.c:705 [inline] sock_sendmsg net/socket.c:725 [inline] ____sys_sendmsg+0x39a/0x510 net/socket.c:2413 ___sys_sendmsg net/socket.c:2467 [inline] __sys_sendmsg+0x195/0x230 net/socket.c:2496 __do_sys_sendmsg net/socket.c:2505 [inline] __se_sys_sendmsg net/socket.c:2503 [inline] __x64_sys_sendmsg+0x42/0x50 net/socket.c:2503 do_syscall_x64 arch/x86/entry/common.c:50 [inline] do_syscall_64+0x44/0xd0 arch/x86/entry/common.c:80 entry_SYSCALL_64_after_hwframe+0x44/0xae read to 0xffff888168ad6410 of 8 bytes by task 13560 on cpu 0: qdisc_lookup_rcu+0x30/0x2e0 net/sched/sch_api.c:323 __tcf_qdisc_find+0x74/0x3a0 net/sched/cls_api.c:1050 tc_del_tfilter+0x1c7/0x1350 net/sched/cls_api.c:2211 rtnetlink_rcv_msg+0x5ba/0x7e0 net/core/rtnetlink.c:5585 netlink_rcv_skb+0x14e/0x250 net/netlink/af_netlink.c:2494 rtnetlink_rcv+0x18/0x20 net/core/rtnetlink.c:5612 netlink_unicast_kernel net/netlink/af_netlink.c:1317 [inline] netlink_unicast+0x602/0x6d0 net/netlink/af_netlink.c:1343 netlink_sendmsg+0x728/0x850 net/netlink/af_netlink.c:1919 sock_sendmsg_nosec net/socket.c:705 [inline] sock_sendmsg net/socket.c:725 [inline] ____sys_sendmsg+0x39a/0x510 net/socket.c:2413 ___sys_sendmsg net/socket.c:2467 [inline] __sys_sendmsg+0x195/0x230 net/socket.c:2496 __do_sys_sendmsg net/socket.c:2505 [inline] __se_sys_sendmsg net/socket.c:2503 [inline] __x64_sys_sendmsg+0x42/0x50 net/socket.c:2503 do_syscall_x64 arch/x86/entry/common.c:50 [inline] do_syscall_64+0x44/0xd0 arch/x86/entry/common.c:80 entry_SYSCALL_64_after_hwframe+0x44/0xae value changed: 0xffffffff85dee080 -> 0xffff88815d96ec00 Reported by Kernel Concurrency Sanitizer on: CPU: 0 PID: 13560 Comm: syz-executor.2 Not tainted 5.17.0-rc3-syzkaller-00116-gf1baf68e1383-dirty #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Fixes: 470502de5bdb ("net: sched: unlock rules update API") Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Vlad Buslov <vladbu@mellanox.com> Reported-by: syzbot <syzkaller@googlegroups.com> Cc: Jamal Hadi Salim <jhs@mojatatu.com> Cc: Cong Wang <xiyou.wangcong@gmail.com> Cc: Jiri Pirko <jiri@resnulli.us> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-02-08net: add dev->dev_registered_trackerEric Dumazet1-0/+3
Convert one dev_hold()/dev_put() pair in register_netdevice() and unregister_netdevice_many() to dev_hold_track() and dev_put_track(). This would allow to detect a rogue dev_put() a bit earlier. Signed-off-by: Eric Dumazet <edumazet@google.com> Link: https://lore.kernel.org/r/20220207184107.1401096-1-eric.dumazet@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-02-05net: refine dev_put()/dev_hold() debuggingEric Dumazet1-23/+46
We are still chasing some syzbot reports where we think a rogue dev_put() is called with no corresponding prior dev_hold(). Unfortunately it eats a reference on dev->dev_refcnt taken by innocent dev_hold_track(), meaning that the refcount saturation splat comes too late to be useful. Make sure that 'not tracked' dev_put() and dev_hold() better use CONFIG_NET_DEV_REFCNT_TRACKER=y debug infrastructure: Prior patch in the series allowed ref_tracker_alloc() and ref_tracker_free() to be called with a NULL @trackerp parameter, and to use a separate refcount only to detect too many put() even in the following case: dev_hold_track(dev, tracker_1, GFP_ATOMIC); dev_hold(dev); dev_put(dev); dev_put(dev); // Should complain loudly here. dev_put_track(dev, tracker_1); // instead of here Add clarification about netdev_tracker_alloc() role. v2: I replaced the dev_put() in linkwatch_do_dev() with __dev_put() because callers called netdev_tracker_free(). Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-01-20net: fix information leakage in /proc/net/ptypeCongyu Liu1-0/+1
In one net namespace, after creating a packet socket without binding it to a device, users in other net namespaces can observe the new `packet_type` added by this packet socket by reading `/proc/net/ptype` file. This is minor information leakage as packet socket is namespace aware. Add a net pointer in `packet_type` to keep the net namespace of of corresponding packet socket. In `ptype_seq_show`, this net pointer must be checked when it is not NULL. Fixes: 2feb27dbe00c ("[NETNS]: Minor information leak via /proc/net/ptype file.") Signed-off-by: Congyu Liu <liu3101@purdue.edu> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-01-06gro: add ability to control gro max packet sizeCoco Li1-0/+11
Eric Dumazet suggested to allow users to modify max GRO packet size. We have seen GRO being disabled by users of appliances (such as wifi access points) because of claimed bufferbloat issues, or some work arounds in sch_cake, to split GRO/GSO packets. Instead of disabling GRO completely, one can chose to limit the maximum packet size of GRO packets, depending on their latency constraints. This patch adds a per device gro_max_size attribute that can be changed with ip link command. ip link set dev eth0 gro_max_size 16000 Suggested-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Coco Li <lixiaoyan@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-12-30net: remove references to CONFIG_IRDA in network header filesLukas Bulwahn1-1/+1
Commit d64c2a76123f ("staging: irda: remove the irda network stack and drivers") removes the config IRDA. Remove the remaining references to this non-existing config in the network header files. Signed-off-by: Lukas Bulwahn <lukas.bulwahn@gmail.com> Link: https://lore.kernel.org/r/20211229113620.19368-1-lukas.bulwahn@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-12-23Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski1-1/+1
include/net/sock.h commit 8f905c0e7354 ("inet: fully convert sk->sk_rx_dst to RCU rules") commit 43f51df41729 ("net: move early demux fields close to sk_refcnt") https://lore.kernel.org/all/20211222141641.0caa0ab3@canb.auug.org.au/ Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-12-19flow_offload: allow user to offload tc action to net deviceBaowen Zheng1-0/+1
Use flow_indr_dev_register/flow_indr_dev_setup_offload to offload tc action. We need to call tc_cleanup_flow_action to clean up tc action entry since in tc_setup_action, some actions may hold dev refcnt, especially the mirror action. Signed-off-by: Baowen Zheng <baowen.zheng@corigine.com> Signed-off-by: Louis Peens <louis.peens@corigine.com> Signed-off-by: Simon Horman <simon.horman@corigine.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-12-17net: fix typo in a commentXiang wangx1-1/+1
The double 'as' in a comment is repeated, thus it should be removed. Signed-off-by: Xiang wangx <wangxiang@cdjrlc.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-12-15net: add net device refcount tracker to struct packet_typeEric Dumazet1-0/+1
Most notable changes are in af_packet, tipc ones are trivial. Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Jon Maloy <jmaloy@redhat.com> Cc: Ying Xue <ying.xue@windriver.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-12-14net: dev_replace_track() cleanupEric Dumazet1-6/+4
Use existing helpers (netdev_tracker_free() and netdev_tracker_alloc()) to remove ifdefery. Signed-off-by: Eric Dumazet <edumazet@google.com> Link: https://lore.kernel.org/r/20211214151515.312535-1-eric.dumazet@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-12-10net: add networking namespace refcount trackerEric Dumazet1-8/+1
We have 100+ syzbot reports about netns being dismantled too soon, still unresolved as of today. We think a missing get_net() or an extra put_net() is the root cause. In order to find the bug(s), and be able to spot future ones, this patch adds CONFIG_NET_NS_REFCNT_TRACKER and new helpers to precisely pair all put_net() with corresponding get_net(). To use these helpers, each data structure owning a refcount should also use a "netns_tracker" to pair the get and put. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-12-07net: watchdog: add net device refcount trackerEric Dumazet1-0/+2
Add a netdevice_tracker inside struct net_device, to track the self reference when a device has an active watchdog timer. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-12-06net: linkwatch: add net device refcount trackerEric Dumazet1-0/+2
Add a netdevice_tracker inside struct net_device, to track the self reference when a device is in lweventlist. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-12-06net: dst: add net device refcount tracking to dst_entryEric Dumazet1-0/+17
We want to track all dev_hold()/dev_put() to ease leak hunting. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-12-06net: add net device refcount tracker to struct netdev_queueEric Dumazet1-0/+2
This will help debugging pesky netdev reference leaks. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-12-06net: add net device refcount tracker to struct netdev_rx_queueEric Dumazet1-0/+2
This helps debugging net device refcount leaks. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-12-06net: add net device refcount tracker infrastructureEric Dumazet1-0/+45
net device are refcounted. Over the years we had numerous bugs caused by imbalanced dev_hold() and dev_put() calls. The general idea is to be able to precisely pair each decrement with a corresponding prior increment. Both share a cookie, basically a pointer to private data storing stack traces. This patch adds dev_hold_track() and dev_put_track(). To use these helpers, each data structure owning a refcount should also use a "netdevice_tracker" to pair the hold and put. netdevice_tracker dev_tracker; ... dev_hold_track(dev, &dev_tracker, GFP_ATOMIC); ... dev_put_track(dev, &dev_tracker); Whenever a leak happens, we will get precise stack traces of the point dev_hold_track() happened, at device dismantle phase. We will also get a stack trace if too many dev_put_track() for the same netdevice_tracker are attempted. This is guarded by CONFIG_NET_DEV_REFCNT_TRACKER option. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-12-02Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski1-6/+13
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-12-01net: annotate data-races on txq->xmit_lock_ownerEric Dumazet1-6/+13
syzbot found that __dev_queue_xmit() is reading txq->xmit_lock_owner without annotations. No serious issue there, let's document what is happening there. BUG: KCSAN: data-race in __dev_queue_xmit / __dev_queue_xmit write to 0xffff888139d09484 of 4 bytes by interrupt on cpu 0: __netif_tx_unlock include/linux/netdevice.h:4437 [inline] __dev_queue_xmit+0x948/0xf70 net/core/dev.c:4229 dev_queue_xmit_accel+0x19/0x20 net/core/dev.c:4265 macvlan_queue_xmit drivers/net/macvlan.c:543 [inline] macvlan_start_xmit+0x2b3/0x3d0 drivers/net/macvlan.c:567 __netdev_start_xmit include/linux/netdevice.h:4987 [inline] netdev_start_xmit include/linux/netdevice.h:5001 [inline] xmit_one+0x105/0x2f0 net/core/dev.c:3590 dev_hard_start_xmit+0x72/0x120 net/core/dev.c:3606 sch_direct_xmit+0x1b2/0x7c0 net/sched/sch_generic.c:342 __dev_xmit_skb+0x83d/0x1370 net/core/dev.c:3817 __dev_queue_xmit+0x590/0xf70 net/core/dev.c:4194 dev_queue_xmit+0x13/0x20 net/core/dev.c:4259 neigh_hh_output include/net/neighbour.h:511 [inline] neigh_output include/net/neighbour.h:525 [inline] ip6_finish_output2+0x995/0xbb0 net/ipv6/ip6_output.c:126 __ip6_finish_output net/ipv6/ip6_output.c:191 [inline] ip6_finish_output+0x444/0x4c0 net/ipv6/ip6_output.c:201 NF_HOOK_COND include/linux/netfilter.h:296 [inline] ip6_output+0x10e/0x210 net/ipv6/ip6_output.c:224 dst_output include/net/dst.h:450 [inline] NF_HOOK include/linux/netfilter.h:307 [inline] ndisc_send_skb+0x486/0x610 net/ipv6/ndisc.c:508 ndisc_send_rs+0x3b0/0x3e0 net/ipv6/ndisc.c:702 addrconf_rs_timer+0x370/0x540 net/ipv6/addrconf.c:3898 call_timer_fn+0x2e/0x240 kernel/time/timer.c:1421 expire_timers+0x116/0x240 kernel/time/timer.c:1466 __run_timers+0x368/0x410 kernel/time/timer.c:1734 run_timer_softirq+0x2e/0x60 kernel/time/timer.c:1747 __do_softirq+0x158/0x2de kernel/softirq.c:558 __irq_exit_rcu kernel/softirq.c:636 [inline] irq_exit_rcu+0x37/0x70 kernel/softirq.c:648 sysvec_apic_timer_interrupt+0x3e/0xb0 arch/x86/kernel/apic/apic.c:1097 asm_sysvec_apic_timer_interrupt+0x12/0x20 read to 0xffff888139d09484 of 4 bytes by interrupt on cpu 1: __dev_queue_xmit+0x5e3/0xf70 net/core/dev.c:4213 dev_queue_xmit_accel+0x19/0x20 net/core/dev.c:4265 macvlan_queue_xmit drivers/net/macvlan.c:543 [inline] macvlan_start_xmit+0x2b3/0x3d0 drivers/net/macvlan.c:567 __netdev_start_xmit include/linux/netdevice.h:4987 [inline] netdev_start_xmit include/linux/netdevice.h:5001 [inline] xmit_one+0x105/0x2f0 net/core/dev.c:3590 dev_hard_start_xmit+0x72/0x120 net/core/dev.c:3606 sch_direct_xmit+0x1b2/0x7c0 net/sched/sch_generic.c:342 __dev_xmit_skb+0x83d/0x1370 net/core/dev.c:3817 __dev_queue_xmit+0x590/0xf70 net/core/dev.c:4194 dev_queue_xmit+0x13/0x20 net/core/dev.c:4259 neigh_resolve_output+0x3db/0x410 net/core/neighbour.c:1523 neigh_output include/net/neighbour.h:527 [inline] ip6_finish_output2+0x9be/0xbb0 net/ipv6/ip6_output.c:126 __ip6_finish_output net/ipv6/ip6_output.c:191 [inline] ip6_finish_output+0x444/0x4c0 net/ipv6/ip6_output.c:201 NF_HOOK_COND include/linux/netfilter.h:296 [inline] ip6_output+0x10e/0x210 net/ipv6/ip6_output.c:224 dst_output include/net/dst.h:450 [inline] NF_HOOK include/linux/netfilter.h:307 [inline] ndisc_send_skb+0x486/0x610 net/ipv6/ndisc.c:508 ndisc_send_rs+0x3b0/0x3e0 net/ipv6/ndisc.c:702 addrconf_rs_timer+0x370/0x540 net/ipv6/addrconf.c:3898 call_timer_fn+0x2e/0x240 kernel/time/timer.c:1421 expire_timers+0x116/0x240 kernel/time/timer.c:1466 __run_timers+0x368/0x410 kernel/time/timer.c:1734 run_timer_softirq+0x2e/0x60 kernel/time/timer.c:1747 __do_softirq+0x158/0x2de kernel/softirq.c:558 __irq_exit_rcu kernel/softirq.c:636 [inline] irq_exit_rcu+0x37/0x70 kernel/softirq.c:648 sysvec_apic_timer_interrupt+0x8d/0xb0 arch/x86/kernel/apic/apic.c:1097 asm_sysvec_apic_timer_interrupt+0x12/0x20 kcsan_setup_watchpoint+0x94/0x420 kernel/kcsan/core.c:443 folio_test_anon include/linux/page-flags.h:581 [inline] PageAnon include/linux/page-flags.h:586 [inline] zap_pte_range+0x5ac/0x10e0 mm/memory.c:1347 zap_pmd_range mm/memory.c:1467 [inline] zap_pud_range mm/memory.c:1496 [inline] zap_p4d_range mm/memory.c:1517 [inline] unmap_page_range+0x2dc/0x3d0 mm/memory.c:1538 unmap_single_vma+0x157/0x210 mm/memory.c:1583 unmap_vmas+0xd0/0x180 mm/memory.c:1615 exit_mmap+0x23d/0x470 mm/mmap.c:3170 __mmput+0x27/0x1b0 kernel/fork.c:1113 mmput+0x3d/0x50 kernel/fork.c:1134 exit_mm+0xdb/0x170 kernel/exit.c:507 do_exit+0x608/0x17a0 kernel/exit.c:819 do_group_exit+0xce/0x180 kernel/exit.c:929 get_signal+0xfc3/0x1550 kernel/signal.c:2852 arch_do_signal_or_restart+0x8c/0x2e0 arch/x86/kernel/signal.c:868 handle_signal_work kernel/entry/common.c:148 [inline] exit_to_user_mode_loop kernel/entry/common.c:172 [inline] exit_to_user_mode_prepare+0x113/0x190 kernel/entry/common.c:207 __syscall_exit_to_user_mode_work kernel/entry/common.c:289 [inline] syscall_exit_to_user_mode+0x20/0x40 kernel/entry/common.c:300 do_syscall_64+0x50/0xd0 arch/x86/entry/common.c:86 entry_SYSCALL_64_after_hwframe+0x44/0xae value changed: 0x00000000 -> 0xffffffff Reported by Kernel Concurrency Sanitizer on: CPU: 1 PID: 28712 Comm: syz-executor.0 Tainted: G W 5.16.0-rc1-syzkaller #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: syzbot <syzkaller@googlegroups.com> Link: https://lore.kernel.org/r/20211130170155.2331929-1-eric.dumazet@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>