Michał Mirosław | e5b1de1 | 2011-07-12 22:27:00 -0700 | [diff] [blame] | 1 | Netdev features mess and how to get out from it alive |
| 2 | ===================================================== |
| 3 | |
| 4 | Author: |
| 5 | Michał Mirosław <mirq-linux@rere.qmqm.pl> |
| 6 | |
| 7 | |
| 8 | |
| 9 | Part I: Feature sets |
| 10 | ====================== |
| 11 | |
| 12 | Long gone are the days when a network card would just take and give packets |
| 13 | verbatim. Today's devices add multiple features and bugs (read: offloads) |
| 14 | that relieve an OS of various tasks like generating and checking checksums, |
| 15 | splitting packets, classifying them. Those capabilities and their state |
| 16 | are commonly referred to as netdev features in Linux kernel world. |
| 17 | |
| 18 | There are currently three sets of features relevant to the driver, and |
| 19 | one used internally by network core: |
| 20 | |
| 21 | 1. netdev->hw_features set contains features whose state may possibly |
| 22 | be changed (enabled or disabled) for a particular device by user's |
| 23 | request. This set should be initialized in ndo_init callback and not |
| 24 | changed later. |
| 25 | |
| 26 | 2. netdev->features set contains features which are currently enabled |
| 27 | for a device. This should be changed only by network core or in |
| 28 | error paths of ndo_set_features callback. |
| 29 | |
| 30 | 3. netdev->vlan_features set contains features whose state is inherited |
| 31 | by child VLAN devices (limits netdev->features set). This is currently |
| 32 | used for all VLAN devices whether tags are stripped or inserted in |
| 33 | hardware or software. |
| 34 | |
| 35 | 4. netdev->wanted_features set contains feature set requested by user. |
| 36 | This set is filtered by ndo_fix_features callback whenever it or |
| 37 | some device-specific conditions change. This set is internal to |
| 38 | networking core and should not be referenced in drivers. |
| 39 | |
| 40 | |
| 41 | |
| 42 | Part II: Controlling enabled features |
| 43 | ======================================= |
| 44 | |
| 45 | When current feature set (netdev->features) is to be changed, new set |
| 46 | is calculated and filtered by calling ndo_fix_features callback |
| 47 | and netdev_fix_features(). If the resulting set differs from current |
| 48 | set, it is passed to ndo_set_features callback and (if the callback |
| 49 | returns success) replaces value stored in netdev->features. |
| 50 | NETDEV_FEAT_CHANGE notification is issued after that whenever current |
| 51 | set might have changed. |
| 52 | |
| 53 | The following events trigger recalculation: |
| 54 | 1. device's registration, after ndo_init returned success |
| 55 | 2. user requested changes in features state |
| 56 | 3. netdev_update_features() is called |
| 57 | |
| 58 | ndo_*_features callbacks are called with rtnl_lock held. Missing callbacks |
| 59 | are treated as always returning success. |
| 60 | |
| 61 | A driver that wants to trigger recalculation must do so by calling |
| 62 | netdev_update_features() while holding rtnl_lock. This should not be done |
| 63 | from ndo_*_features callbacks. netdev->features should not be modified by |
| 64 | driver except by means of ndo_fix_features callback. |
| 65 | |
| 66 | |
| 67 | |
| 68 | Part III: Implementation hints |
| 69 | ================================ |
| 70 | |
| 71 | * ndo_fix_features: |
| 72 | |
| 73 | All dependencies between features should be resolved here. The resulting |
| 74 | set can be reduced further by networking core imposed limitations (as coded |
| 75 | in netdev_fix_features()). For this reason it is safer to disable a feature |
| 76 | when its dependencies are not met instead of forcing the dependency on. |
| 77 | |
| 78 | This callback should not modify hardware nor driver state (should be |
| 79 | stateless). It can be called multiple times between successive |
| 80 | ndo_set_features calls. |
| 81 | |
| 82 | Callback must not alter features contained in NETIF_F_SOFT_FEATURES or |
| 83 | NETIF_F_NEVER_CHANGE sets. The exception is NETIF_F_VLAN_CHALLENGED but |
| 84 | care must be taken as the change won't affect already configured VLANs. |
| 85 | |
| 86 | * ndo_set_features: |
| 87 | |
| 88 | Hardware should be reconfigured to match passed feature set. The set |
| 89 | should not be altered unless some error condition happens that can't |
| 90 | be reliably detected in ndo_fix_features. In this case, the callback |
| 91 | should update netdev->features to match resulting hardware state. |
| 92 | Errors returned are not (and cannot be) propagated anywhere except dmesg. |
| 93 | (Note: successful return is zero, >0 means silent error.) |
| 94 | |
| 95 | |
| 96 | |
| 97 | Part IV: Features |
| 98 | =================== |
| 99 | |
| 100 | For current list of features, see include/linux/netdev_features.h. |
| 101 | This section describes semantics of some of them. |
| 102 | |
| 103 | * Transmit checksumming |
| 104 | |
| 105 | For complete description, see comments near the top of include/linux/skbuff.h. |
| 106 | |
| 107 | Note: NETIF_F_HW_CSUM is a superset of NETIF_F_IP_CSUM + NETIF_F_IPV6_CSUM. |
| 108 | It means that device can fill TCP/UDP-like checksum anywhere in the packets |
| 109 | whatever headers there might be. |
| 110 | |
| 111 | * Transmit TCP segmentation offload |
| 112 | |
| 113 | NETIF_F_TSO_ECN means that hardware can properly split packets with CWR bit |
| 114 | set, be it TCPv4 (when NETIF_F_TSO is enabled) or TCPv6 (NETIF_F_TSO6). |
| 115 | |
Willem de Bruijn | 83aa025 | 2018-04-26 13:42:21 -0400 | [diff] [blame] | 116 | * Transmit UDP segmentation offload |
| 117 | |
Jesse Brandeburg | 09e58b2 | 2018-11-07 21:40:17 -0800 | [diff] [blame] | 118 | NETIF_F_GSO_UDP_L4 accepts a single UDP header with a payload that exceeds |
Willem de Bruijn | 83aa025 | 2018-04-26 13:42:21 -0400 | [diff] [blame] | 119 | gso_size. On segmentation, it segments the payload on gso_size boundaries and |
| 120 | replicates the network and UDP headers (fixing up the last one if less than |
| 121 | gso_size). |
| 122 | |
Michał Mirosław | e5b1de1 | 2011-07-12 22:27:00 -0700 | [diff] [blame] | 123 | * Transmit DMA from high memory |
| 124 | |
| 125 | On platforms where this is relevant, NETIF_F_HIGHDMA signals that |
| 126 | ndo_start_xmit can handle skbs with frags in high memory. |
| 127 | |
| 128 | * Transmit scatter-gather |
| 129 | |
| 130 | Those features say that ndo_start_xmit can handle fragmented skbs: |
| 131 | NETIF_F_SG --- paged skbs (skb_shinfo()->frags), NETIF_F_FRAGLIST --- |
| 132 | chained skbs (skb->next/prev list). |
| 133 | |
| 134 | * Software features |
| 135 | |
| 136 | Features contained in NETIF_F_SOFT_FEATURES are features of networking |
| 137 | stack. Driver should not change behaviour based on them. |
| 138 | |
| 139 | * LLTX driver (deprecated for hardware drivers) |
| 140 | |
Florian Westphal | f0cdf76 | 2016-04-24 21:38:14 +0200 | [diff] [blame] | 141 | NETIF_F_LLTX is meant to be used by drivers that don't need locking at all, |
| 142 | e.g. software tunnels. |
Michał Mirosław | e5b1de1 | 2011-07-12 22:27:00 -0700 | [diff] [blame] | 143 | |
Florian Westphal | f0cdf76 | 2016-04-24 21:38:14 +0200 | [diff] [blame] | 144 | This is also used in a few legacy drivers that implement their |
| 145 | own locking, don't use it for new (hardware) drivers. |
Michał Mirosław | e5b1de1 | 2011-07-12 22:27:00 -0700 | [diff] [blame] | 146 | |
| 147 | * netns-local device |
| 148 | |
| 149 | NETIF_F_NETNS_LOCAL is set for devices that are not allowed to move between |
| 150 | network namespaces (e.g. loopback). |
| 151 | |
| 152 | Don't use it in drivers. |
| 153 | |
| 154 | * VLAN challenged |
| 155 | |
| 156 | NETIF_F_VLAN_CHALLENGED should be set for devices which can't cope with VLAN |
| 157 | headers. Some drivers set this because the cards can't handle the bigger MTU. |
| 158 | [FIXME: Those cases could be fixed in VLAN code by allowing only reduced-MTU |
| 159 | VLANs. This may be not useful, though.] |
Ben Greear | 36eabda3 | 2012-02-11 15:39:14 +0000 | [diff] [blame] | 160 | |
| 161 | * rx-fcs |
| 162 | |
| 163 | This requests that the NIC append the Ethernet Frame Checksum (FCS) |
| 164 | to the end of the skb data. This allows sniffers and other tools to |
| 165 | read the CRC recorded by the NIC on receipt of the packet. |
Ben Greear | 5e0c03c | 2012-02-11 15:39:45 +0000 | [diff] [blame] | 166 | |
| 167 | * rx-all |
| 168 | |
| 169 | This requests that the NIC receive all possible frames, including errored |
| 170 | frames (such as bad FCS, etc). This can be helpful when sniffing a link with |
| 171 | bad packets on it. Some NICs may receive more packets if also put into normal |
Kirill Smelkov | 73e212f | 2012-11-10 07:12:36 +0000 | [diff] [blame] | 172 | PROMISC mode. |
Michael Chan | fb1f5f7 | 2017-12-16 03:09:40 -0500 | [diff] [blame] | 173 | |
| 174 | * rx-gro-hw |
| 175 | |
| 176 | This requests that the NIC enables Hardware GRO (generic receive offload). |
| 177 | Hardware GRO is basically the exact reverse of TSO, and is generally |
| 178 | stricter than Hardware LRO. A packet stream merged by Hardware GRO must |
| 179 | be re-segmentable by GSO or TSO back to the exact original packet stream. |
| 180 | Hardware GRO is dependent on RXCSUM since every packet successfully merged |
| 181 | by hardware must also have the checksum verified by hardware. |