.. SPDX-License-Identifier: GPL-2.0 ===================================== Network Devices, the Kernel, and You! ===================================== Introduction ============ The following is a random collection of documentation regarding network devices. It is intended for driver developers. struct net_device lifetime rules ================================ Network device structures need to persist even after module is unloaded and must be allocated with alloc_netdev_mqs() and friends. If device has registered successfully, it will be freed on last use by free_netdev(). This is required to handle the pathological case cleanly (example: ``rmmod mydriver needs_free_netdev = true; } static void my_destructor(struct net_device *dev) { some_obj_destroy(priv->obj); some_uninit(priv); } int create_link() { struct my_device_priv *priv; int err; ASSERT_RTNL(); dev = alloc_netdev(sizeof(*priv), "net%d", NET_NAME_UNKNOWN, my_setup); if (!dev) return -ENOMEM; priv = netdev_priv(dev); /* Implicit constructor */ err = some_init(priv); if (err) goto err_free_dev; priv->obj = some_obj_create(); if (!priv->obj) { err = -ENOMEM; goto err_some_uninit; } /* End of constructor, set the destructor: */ dev->priv_destructor = my_destructor; err = register_netdevice(dev); if (err) /* register_netdevice() calls destructor on failure */ goto err_free_dev; /* If anything fails now unregister_netdevice() (or unregister_netdev()) * will take care of calling my_destructor and free_netdev(). */ return 0; err_some_uninit: some_uninit(priv); err_free_dev: free_netdev(dev); return err; } If struct net_device.priv_destructor is set it will be called by the core some time after unregister_netdevice(), it will also be called if register_netdevice() fails. The callback may be invoked with or without ``rtnl_lock`` held. There is no explicit constructor callback, driver "constructs" the private netdev state after allocating it and before registration. Setting struct net_device.needs_free_netdev makes core call free_netdevice() automatically after unregister_netdevice() when all references to the device are gone. It only takes effect after a successful call to register_netdevice() so if register_netdevice() fails driver is responsible for calling free_netdev(). free_netdev() is safe to call on error paths right after unregister_netdevice() or when register_netdevice() fails. Parts of netdev (de)registration process happen after ``rtnl_lock`` is released, therefore in those cases free_netdev() will defer some of the processing until ``rtnl_lock`` is released. Devices spawned from struct rtnl_link_ops should never free the struct net_device directly. .ndo_init and .ndo_uninit ~~~~~~~~~~~~~~~~~~~~~~~~~ ``.ndo_init`` and ``.ndo_uninit`` callbacks are called during net_device registration and de-registration, under ``rtnl_lock``. Drivers can use those e.g. when parts of their init process need to run under ``rtnl_lock``. ``.ndo_init`` runs before device is visible in the system, ``.ndo_uninit`` runs during de-registering after device is closed but other subsystems may still have outstanding references to the netdevice. MTU === Each network device has a Maximum Transfer Unit. The MTU does not include any link layer protocol overhead. Upper layer protocols must not pass a socket buffer (skb) to a device to transmit with more data than the mtu. The MTU does not include link layer header overhead, so for example on Ethernet if the standard MTU is 1500 bytes used, the actual skb will contain up to 1514 bytes because of the Ethernet header. Devices should allow for the 4 byte VLAN header as well. Segmentation Offload (GSO, TSO) is an exception to this rule. The upper layer protocol may pass a large socket buffer to the device transmit routine, and the device will break that up into separate packets based on the current MTU. MTU is symmetrical and applies both to receive and transmit. A device must be able to receive at least the maximum size packet allowed by the MTU. A network device may use the MTU as mechanism to size receive buffers, but the device should allow packets with VLAN header. With standard Ethernet mtu of 1500 bytes, the device should allow up to 1518 byte packets (1500 + 14 header + 4 tag). The device may either: drop, truncate, or pass up oversize packets, but dropping oversize packets is preferred. struct net_device synchronization rules ======================================= ndo_open: Synchronization: rtnl_lock() semaphore. In addition, netdev instance lock if the driver implements queue management or shaper API. Context: process ndo_stop: Synchronization: rtnl_lock() semaphore. In addition, netdev instance lock if the driver implements queue management or shaper API. Context: process Note: netif_running() is guaranteed false ndo_do_ioctl: Synchronization: rtnl_lock() semaphore. This is only called by network subsystems internally, not by user space calling ioctl as it was in before linux-5.14. ndo_siocbond: Synchronization: rtnl_lock() semaphore. In addition, netdev instance lock if the driver implements queue management or shaper API. Context: process Used by the bonding driver for the SIOCBOND family of ioctl commands. ndo_siocwandev: Synchronization: rtnl_lock() semaphore. In addition, netdev instance lock if the driver implements queue management or shaper API. Context: process Used by the drivers/net/wan framework to handle the SIOCWANDEV ioctl with the if_settings structure. ndo_siocdevprivate: Synchronization: rtnl_lock() semaphore. In addition, netdev instance lock if the driver implements queue management or shaper API. Context: process This is used to implement SIOCDEVPRIVATE ioctl helpers. These should not be added to new drivers, so don't use. ndo_eth_ioctl: Synchronization: rtnl_lock() semaphore. In addition, netdev instance lock if the driver implements queue management or shaper API. Context: process ndo_get_stats: Synchronization: RCU (can be called concurrently with the stats update path). Context: atomic (can't sleep under RCU) ndo_start_xmit: Synchronization: __netif_tx_lock spinlock. When the driver sets dev->lltx this will be called without holding netif_tx_lock. In this case the driver has to lock by itself when needed. The locking there should also properly protect against set_rx_mode. WARNING: use of dev->lltx is deprecated. Don't use it for new drivers. Context: Process with BHs disabled or BH (timer), will be called with interrupts disabled by netconsole. Return codes: * NETDEV_TX_OK everything ok. * NETDEV_TX_BUSY Cannot transmit packet, try later Usually a bug, means queue start/stop flow control is broken in the driver. Note: the driver must NOT put the skb in its DMA ring. ndo_tx_timeout: Synchronization: netif_tx_lock spinlock; all TX queues frozen. Context: BHs disabled Notes: netif_queue_stopped() is guaranteed true ndo_set_rx_mode: Synchronization: netif_addr_lock spinlock. Context: BHs disabled ndo_setup_tc: ``TC_SETUP_BLOCK`` and ``TC_SETUP_FT`` are running under NFT locks (i.e. no ``rtnl_lock`` and no device instance lock). The rest of ``tc_setup_type`` types run under netdev instance lock if the driver implements queue management or shaper API. Most ndo callbacks not specified in the list above are running under ``rtnl_lock``. In addition, netdev instance lock is taken as well if the driver implements queue management or shaper API. struct napi_struct synchronization rules ======================================== napi->poll: Synchronization: NAPI_STATE_SCHED bit in napi->state. Device driver's ndo_stop method will invoke napi_disable() on all NAPI instances which will do a sleeping poll on the NAPI_STATE_SCHED napi->state bit, waiting for all pending NAPI activity to cease. Context: softirq will be called with interrupts disabled by netconsole. netdev instance lock ==================== Historically, all networking control operations were protected by a single global lock known as ``rtnl_lock``. There is an ongoing effort to replace this global lock with separate locks for each network namespace. Additionally, properties of individual netdev are increasingly protected by per-netdev locks. For device drivers that implement shaping or queue management APIs, all control operations will be performed under the netdev instance lock. Drivers can also explicitly request instance lock to be held during ops by setting ``request_ops_lock`` to true. Code comments and docs refer to drivers which have ops called under the instance lock as "ops locked". See also the documentation of the ``lock`` member of struct net_device. In the future, there will be an option for individual drivers to opt out of using ``rtnl_lock`` and instead perform their control operations directly under the netdev instance lock. Devices drivers are encouraged to rely on the instance lock where possible. For the (mostly software) drivers that need to interact with the core stack, there are two sets of interfaces: ``dev_xxx``/``netdev_xxx`` and ``netif_xxx`` (e.g., ``dev_set_mtu`` and ``netif_set_mtu``). The ``dev_xxx``/``netdev_xxx`` functions handle acquiring the instance lock themselves, while the ``netif_xxx`` functions assume that the driver has already acquired the instance lock. struct net_device_ops --------------------- ``ndos`` are called without holding the instance lock for most drivers. "Ops locked" drivers will have most of the ``ndos`` invoked under the instance lock. struct ethtool_ops ------------------ Similarly to ``ndos`` the instance lock is only held for select drivers. For "ops locked" drivers all ethtool ops without exceptions should be called under the instance lock. struct netdev_stat_ops ---------------------- "qstat" ops are invoked under the instance lock for "ops locked" drivers, and under rtnl_lock for all other drivers. struct net_shaper_ops --------------------- All net shaper callbacks are invoked while holding the netdev instance lock. ``rtnl_lock`` may or may not be held. Note that supporting net shapers automatically enables "ops locking". struct netdev_queue_mgmt_ops ---------------------------- All queue management callbacks are invoked while holding the netdev instance lock. ``rtnl_lock`` may or may not be held. Note that supporting struct netdev_queue_mgmt_ops automatically enables "ops locking". Notifiers and netdev instance lock ---------------------------------- For device drivers that implement shaping or queue management APIs, some of the notifiers (``enum netdev_cmd``) are running under the netdev instance lock. The following netdev notifiers are always run under the instance lock: * ``NETDEV_XDP_FEAT_CHANGE`` For devices with locked ops, currently only the following notifiers are running under the lock: * ``NETDEV_CHANGE`` * ``NETDEV_REGISTER`` * ``NETDEV_UP`` The following notifiers are running without the lock: * ``NETDEV_UNREGISTER`` There are no clear expectations for the remaining notifiers. Notifiers not on the list may run with or without the instance lock, potentially even invoking the same notifier type with and without the lock from different code paths. The goal is to eventually ensure that all (or most, with a few documented exceptions) notifiers run under the instance lock. Please extend this documentation whenever you make explicit assumption about lock being held from a notifier. NETDEV_INTERNAL symbol namespace ================================ Symbols exported as NETDEV_INTERNAL can only be used in networking core and drivers which exclusively flow via the main networking list and trees. Note that the inverse is not true, most symbols outside of NETDEV_INTERNAL are not expected to be used by random code outside netdev either. Symbols may lack the designation because they predate the namespaces, or simply due to an oversight.