Mauro Carvalho Chehab | 32c0f0b | 2020-04-30 18:04:27 +0200 | [diff] [blame] | 1 | .. SPDX-License-Identifier: GPL-2.0 |
| 2 | .. include:: <isonum.txt> |
| 3 | |
| 4 | =============================================== |
Scott Feldman | 4ceec22 | 2015-05-10 09:48:09 -0700 | [diff] [blame] | 5 | Ethernet switch device driver model (switchdev) |
| 6 | =============================================== |
Mauro Carvalho Chehab | 32c0f0b | 2020-04-30 18:04:27 +0200 | [diff] [blame] | 7 | |
| 8 | Copyright |copy| 2014 Jiri Pirko <jiri@resnulli.us> |
| 9 | |
| 10 | Copyright |copy| 2014-2015 Scott Feldman <sfeldma@gmail.com> |
Jiri Pirko | 007f790 | 2014-11-28 14:34:17 +0100 | [diff] [blame] | 11 | |
Jiri Pirko | 007f790 | 2014-11-28 14:34:17 +0100 | [diff] [blame] | 12 | |
Scott Feldman | 4ceec22 | 2015-05-10 09:48:09 -0700 | [diff] [blame] | 13 | The Ethernet switch device driver model (switchdev) is an in-kernel driver |
| 14 | model for switch devices which offload the forwarding (data) plane from the |
| 15 | kernel. |
Jiri Pirko | 007f790 | 2014-11-28 14:34:17 +0100 | [diff] [blame] | 16 | |
Scott Feldman | 4ceec22 | 2015-05-10 09:48:09 -0700 | [diff] [blame] | 17 | Figure 1 is a block diagram showing the components of the switchdev model for |
| 18 | an example setup using a data-center-class switch ASIC chip. Other setups |
| 19 | with SR-IOV or soft switches, such as OVS, are possible. |
Jiri Pirko | 007f790 | 2014-11-28 14:34:17 +0100 | [diff] [blame] | 20 | |
Mauro Carvalho Chehab | 32c0f0b | 2020-04-30 18:04:27 +0200 | [diff] [blame] | 21 | :: |
Jiri Pirko | 007f790 | 2014-11-28 14:34:17 +0100 | [diff] [blame] | 22 | |
Mauro Carvalho Chehab | 32c0f0b | 2020-04-30 18:04:27 +0200 | [diff] [blame] | 23 | |
| 24 | User-space tools |
Liam Beguin | d5066c4 | 2017-05-01 11:02:01 -0400 | [diff] [blame] | 25 | |
Randy Dunlap | 5151374 | 2017-09-16 13:10:06 -0700 | [diff] [blame] | 26 | user space | |
| 27 | +-------------------------------------------------------------------+ |
| 28 | kernel | Netlink |
Mauro Carvalho Chehab | 32c0f0b | 2020-04-30 18:04:27 +0200 | [diff] [blame] | 29 | | |
| 30 | +--------------+-------------------------------+ |
| 31 | | Network stack | |
| 32 | | (Linux) | |
| 33 | | | |
| 34 | +----------------------------------------------+ |
Liam Beguin | d5066c4 | 2017-05-01 11:02:01 -0400 | [diff] [blame] | 35 | |
Mauro Carvalho Chehab | 32c0f0b | 2020-04-30 18:04:27 +0200 | [diff] [blame] | 36 | sw1p2 sw1p4 sw1p6 |
| 37 | sw1p1 + sw1p3 + sw1p5 + eth1 |
| 38 | + | + | + | + |
| 39 | | | | | | | | |
| 40 | +--+----+----+----+----+----+---+ +-----+-----+ |
| 41 | | Switch driver | | mgmt | |
| 42 | | (this document) | | driver | |
| 43 | | | | | |
| 44 | +--------------+----------------+ +-----------+ |
| 45 | | |
Randy Dunlap | 5151374 | 2017-09-16 13:10:06 -0700 | [diff] [blame] | 46 | kernel | HW bus (eg PCI) |
| 47 | +-------------------------------------------------------------------+ |
| 48 | hardware | |
Mauro Carvalho Chehab | 32c0f0b | 2020-04-30 18:04:27 +0200 | [diff] [blame] | 49 | +--------------+----------------+ |
| 50 | | Switch device (sw1) | |
| 51 | | +----+ +--------+ |
| 52 | | | v offloaded data path | mgmt port |
| 53 | | | | | |
| 54 | +--|----|----+----+----+----+---+ |
| 55 | | | | | | | |
| 56 | + + + + + + |
| 57 | p1 p2 p3 p4 p5 p6 |
Liam Beguin | d5066c4 | 2017-05-01 11:02:01 -0400 | [diff] [blame] | 58 | |
Mauro Carvalho Chehab | 32c0f0b | 2020-04-30 18:04:27 +0200 | [diff] [blame] | 59 | front-panel ports |
Liam Beguin | d5066c4 | 2017-05-01 11:02:01 -0400 | [diff] [blame] | 60 | |
Jiri Pirko | 007f790 | 2014-11-28 14:34:17 +0100 | [diff] [blame] | 61 | |
Mauro Carvalho Chehab | 32c0f0b | 2020-04-30 18:04:27 +0200 | [diff] [blame] | 62 | Fig 1. |
Jiri Pirko | 007f790 | 2014-11-28 14:34:17 +0100 | [diff] [blame] | 63 | |
Jiri Pirko | 007f790 | 2014-11-28 14:34:17 +0100 | [diff] [blame] | 64 | |
Scott Feldman | 4ceec22 | 2015-05-10 09:48:09 -0700 | [diff] [blame] | 65 | Include Files |
| 66 | ------------- |
Jiri Pirko | 007f790 | 2014-11-28 14:34:17 +0100 | [diff] [blame] | 67 | |
Mauro Carvalho Chehab | 32c0f0b | 2020-04-30 18:04:27 +0200 | [diff] [blame] | 68 | :: |
| 69 | |
| 70 | #include <linux/netdevice.h> |
| 71 | #include <net/switchdev.h> |
Scott Feldman | 4ceec22 | 2015-05-10 09:48:09 -0700 | [diff] [blame] | 72 | |
| 73 | |
| 74 | Configuration |
| 75 | ------------- |
| 76 | |
| 77 | Use "depends NET_SWITCHDEV" in driver's Kconfig to ensure switchdev model |
| 78 | support is built for driver. |
| 79 | |
| 80 | |
| 81 | Switch Ports |
| 82 | ------------ |
| 83 | |
| 84 | On switchdev driver initialization, the driver will allocate and register a |
| 85 | struct net_device (using register_netdev()) for each enumerated physical switch |
| 86 | port, called the port netdev. A port netdev is the software representation of |
| 87 | the physical port and provides a conduit for control traffic to/from the |
| 88 | controller (the kernel) and the network, as well as an anchor point for higher |
| 89 | level constructs such as bridges, bonds, VLANs, tunnels, and L3 routers. Using |
| 90 | standard netdev tools (iproute2, ethtool, etc), the port netdev can also |
| 91 | provide to the user access to the physical properties of the switch port such |
| 92 | as PHY link state and I/O statistics. |
| 93 | |
| 94 | There is (currently) no higher-level kernel object for the switch beyond the |
| 95 | port netdevs. All of the switchdev driver ops are netdev ops or switchdev ops. |
| 96 | |
| 97 | A switch management port is outside the scope of the switchdev driver model. |
| 98 | Typically, the management port is not participating in offloaded data plane and |
| 99 | is loaded with a different driver, such as a NIC driver, on the management port |
| 100 | device. |
| 101 | |
Ido Schimmel | 75f3a10 | 2016-04-05 10:20:03 +0200 | [diff] [blame] | 102 | Switch ID |
| 103 | ^^^^^^^^^ |
| 104 | |
Florian Fainelli | 80d79ad | 2019-02-20 14:58:50 -0800 | [diff] [blame] | 105 | The switchdev driver must implement the net_device operation |
| 106 | ndo_get_port_parent_id for each port netdev, returning the same physical ID for |
| 107 | each port of a switch. The ID must be unique between switches on the same |
| 108 | system. The ID does not need to be unique between switches on different |
| 109 | systems. |
Ido Schimmel | 75f3a10 | 2016-04-05 10:20:03 +0200 | [diff] [blame] | 110 | |
| 111 | The switch ID is used to locate ports on a switch and to know if aggregated |
| 112 | ports belong to the same switch. |
| 113 | |
Scott Feldman | 4ceec22 | 2015-05-10 09:48:09 -0700 | [diff] [blame] | 114 | Port Netdev Naming |
| 115 | ^^^^^^^^^^^^^^^^^^ |
| 116 | |
| 117 | Udev rules should be used for port netdev naming, using some unique attribute |
| 118 | of the port as a key, for example the port MAC address or the port PHYS name. |
| 119 | Hard-coding of kernel netdev names within the driver is discouraged; let the |
| 120 | kernel pick the default netdev name, and let udev set the final name based on a |
| 121 | port attribute. |
| 122 | |
| 123 | Using port PHYS name (ndo_get_phys_port_name) for the key is particularly |
Scott Feldman | 1f5dc44 | 2015-05-12 23:03:54 -0700 | [diff] [blame] | 124 | useful for dynamically-named ports where the device names its ports based on |
Scott Feldman | 4ceec22 | 2015-05-10 09:48:09 -0700 | [diff] [blame] | 125 | external configuration. For example, if a physical 40G port is split logically |
| 126 | into 4 10G ports, resulting in 4 port netdevs, the device can give a unique |
Mauro Carvalho Chehab | 32c0f0b | 2020-04-30 18:04:27 +0200 | [diff] [blame] | 127 | name for each port using port PHYS name. The udev rule would be:: |
Scott Feldman | 4ceec22 | 2015-05-10 09:48:09 -0700 | [diff] [blame] | 128 | |
Mauro Carvalho Chehab | 32c0f0b | 2020-04-30 18:04:27 +0200 | [diff] [blame] | 129 | SUBSYSTEM=="net", ACTION=="add", ATTR{phys_switch_id}=="<phys_switch_id>", \ |
| 130 | ATTR{phys_port_name}!="", NAME="swX$attr{phys_port_name}" |
Scott Feldman | 4ceec22 | 2015-05-10 09:48:09 -0700 | [diff] [blame] | 131 | |
| 132 | Suggested naming convention is "swXpYsZ", where X is the switch name or ID, Y |
| 133 | is the port name or ID, and Z is the sub-port name or ID. For example, sw1p1s0 |
| 134 | would be sub-port 0 on port 1 on switch 1. |
| 135 | |
Scott Feldman | 4ceec22 | 2015-05-10 09:48:09 -0700 | [diff] [blame] | 136 | Port Features |
| 137 | ^^^^^^^^^^^^^ |
| 138 | |
| 139 | NETIF_F_NETNS_LOCAL |
| 140 | |
| 141 | If the switchdev driver (and device) only supports offloading of the default |
| 142 | network namespace (netns), the driver should set this feature flag to prevent |
| 143 | the port netdev from being moved out of the default netns. A netns-aware |
Scott Feldman | 1f5dc44 | 2015-05-12 23:03:54 -0700 | [diff] [blame] | 144 | driver/device would not set this flag and be responsible for partitioning |
Scott Feldman | 4ceec22 | 2015-05-10 09:48:09 -0700 | [diff] [blame] | 145 | hardware to preserve netns containment. This means hardware cannot forward |
| 146 | traffic from a port in one namespace to another port in another namespace. |
| 147 | |
| 148 | Port Topology |
| 149 | ^^^^^^^^^^^^^ |
| 150 | |
| 151 | The port netdevs representing the physical switch ports can be organized into |
| 152 | higher-level switching constructs. The default construct is a standalone |
| 153 | router port, used to offload L3 forwarding. Two or more ports can be bonded |
| 154 | together to form a LAG. Two or more ports (or LAGs) can be bridged to bridge |
Scott Feldman | d290f1f | 2015-06-03 20:43:41 -0700 | [diff] [blame] | 155 | L2 networks. VLANs can be applied to sub-divide L2 networks. L2-over-L3 |
Scott Feldman | 4ceec22 | 2015-05-10 09:48:09 -0700 | [diff] [blame] | 156 | tunnels can be built on ports. These constructs are built using standard Linux |
| 157 | tools such as the bridge driver, the bonding/team drivers, and netlink-based |
| 158 | tools such as iproute2. |
| 159 | |
| 160 | The switchdev driver can know a particular port's position in the topology by |
| 161 | monitoring NETDEV_CHANGEUPPER notifications. For example, a port moved into a |
| 162 | bond will see it's upper master change. If that bond is moved into a bridge, |
| 163 | the bond's upper master will change. And so on. The driver will track such |
| 164 | movements to know what position a port is in in the overall topology by |
| 165 | registering for netdevice events and acting on NETDEV_CHANGEUPPER. |
| 166 | |
| 167 | L2 Forwarding Offload |
| 168 | --------------------- |
| 169 | |
| 170 | The idea is to offload the L2 data forwarding (switching) path from the kernel |
| 171 | to the switchdev device by mirroring bridge FDB entries down to the device. An |
| 172 | FDB entry is the {port, MAC, VLAN} tuple forwarding destination. |
| 173 | |
| 174 | To offloading L2 bridging, the switchdev driver/device should support: |
| 175 | |
| 176 | - Static FDB entries installed on a bridge port |
| 177 | - Notification of learned/forgotten src mac/vlans from device |
| 178 | - STP state changes on the port |
| 179 | - VLAN flooding of multicast/broadcast and unknown unicast packets |
| 180 | |
| 181 | Static FDB Entries |
| 182 | ^^^^^^^^^^^^^^^^^^ |
| 183 | |
Vladimir Oltean | 787a410 | 2021-03-16 13:24:19 +0200 | [diff] [blame] | 184 | A driver which implements the ``ndo_fdb_add``, ``ndo_fdb_del`` and |
| 185 | ``ndo_fdb_dump`` operations is able to support the command below, which adds a |
| 186 | static bridge FDB entry:: |
Scott Feldman | 4ceec22 | 2015-05-10 09:48:09 -0700 | [diff] [blame] | 187 | |
Vladimir Oltean | 787a410 | 2021-03-16 13:24:19 +0200 | [diff] [blame] | 188 | bridge fdb add dev DEV ADDRESS [vlan VID] [self] static |
Scott Feldman | 4ceec22 | 2015-05-10 09:48:09 -0700 | [diff] [blame] | 189 | |
Vladimir Oltean | 787a410 | 2021-03-16 13:24:19 +0200 | [diff] [blame] | 190 | (the "static" keyword is non-optional: if not specified, the entry defaults to |
| 191 | being "local", which means that it should not be forwarded) |
Scott Feldman | 4b5364f | 2015-06-03 20:43:42 -0700 | [diff] [blame] | 192 | |
Vladimir Oltean | 787a410 | 2021-03-16 13:24:19 +0200 | [diff] [blame] | 193 | The "self" keyword (optional because it is implicit) has the role of |
| 194 | instructing the kernel to fulfill the operation through the ``ndo_fdb_add`` |
| 195 | implementation of the ``DEV`` device itself. If ``DEV`` is a bridge port, this |
| 196 | will bypass the bridge and therefore leave the software database out of sync |
| 197 | with the hardware one. |
| 198 | |
| 199 | To avoid this, the "master" keyword can be used:: |
| 200 | |
| 201 | bridge fdb add dev DEV ADDRESS [vlan VID] master static |
| 202 | |
| 203 | The above command instructs the kernel to search for a master interface of |
| 204 | ``DEV`` and fulfill the operation through the ``ndo_fdb_add`` method of that. |
| 205 | This time, the bridge generates a ``SWITCHDEV_FDB_ADD_TO_DEVICE`` notification |
| 206 | which the port driver can handle and use it to program its hardware table. This |
| 207 | way, the software and the hardware database will both contain this static FDB |
| 208 | entry. |
| 209 | |
| 210 | Note: for new switchdev drivers that offload the Linux bridge, implementing the |
| 211 | ``ndo_fdb_add`` and ``ndo_fdb_del`` bridge bypass methods is strongly |
| 212 | discouraged: all static FDB entries should be added on a bridge port using the |
| 213 | "master" flag. The ``ndo_fdb_dump`` is an exception and can be implemented to |
| 214 | visualize the hardware tables, if the device does not have an interrupt for |
| 215 | notifying the operating system of newly learned/forgotten dynamic FDB |
| 216 | addresses. In that case, the hardware FDB might end up having entries that the |
| 217 | software FDB does not, and implementing ``ndo_fdb_dump`` is the only way to see |
| 218 | them. |
Scott Feldman | 1f5dc44 | 2015-05-12 23:03:54 -0700 | [diff] [blame] | 219 | |
Scott Feldman | 4ceec22 | 2015-05-10 09:48:09 -0700 | [diff] [blame] | 220 | Note: by default, the bridge does not filter on VLAN and only bridges untagged |
Mauro Carvalho Chehab | 32c0f0b | 2020-04-30 18:04:27 +0200 | [diff] [blame] | 221 | traffic. To enable VLAN support, turn on VLAN filtering:: |
Scott Feldman | 4ceec22 | 2015-05-10 09:48:09 -0700 | [diff] [blame] | 222 | |
| 223 | echo 1 >/sys/class/net/<bridge>/bridge/vlan_filtering |
| 224 | |
| 225 | Notification of Learned/Forgotten Source MAC/VLANs |
| 226 | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ |
| 227 | |
| 228 | The switch device will learn/forget source MAC address/VLAN on ingress packets |
| 229 | and notify the switch driver of the mac/vlan/port tuples. The switch driver, |
Mauro Carvalho Chehab | 32c0f0b | 2020-04-30 18:04:27 +0200 | [diff] [blame] | 230 | in turn, will notify the bridge driver using the switchdev notifier call:: |
Scott Feldman | 4ceec22 | 2015-05-10 09:48:09 -0700 | [diff] [blame] | 231 | |
Petr Machata | 6685987 | 2019-01-16 23:06:56 +0000 | [diff] [blame] | 232 | err = call_switchdev_notifiers(val, dev, info, extack); |
Scott Feldman | 4ceec22 | 2015-05-10 09:48:09 -0700 | [diff] [blame] | 233 | |
Scott Feldman | f5ed2fe | 2015-06-03 20:43:40 -0700 | [diff] [blame] | 234 | Where val is SWITCHDEV_FDB_ADD when learning and SWITCHDEV_FDB_DEL when |
| 235 | forgetting, and info points to a struct switchdev_notifier_fdb_info. On |
| 236 | SWITCHDEV_FDB_ADD, the bridge driver will install the FDB entry into the |
| 237 | bridge's FDB and mark the entry as NTF_EXT_LEARNED. The iproute2 bridge |
Mauro Carvalho Chehab | 32c0f0b | 2020-04-30 18:04:27 +0200 | [diff] [blame] | 238 | command will label these entries "offload":: |
Scott Feldman | 4ceec22 | 2015-05-10 09:48:09 -0700 | [diff] [blame] | 239 | |
| 240 | $ bridge fdb |
| 241 | 52:54:00:12:35:01 dev sw1p1 master br0 permanent |
| 242 | 00:02:00:00:02:00 dev sw1p1 master br0 offload |
| 243 | 00:02:00:00:02:00 dev sw1p1 self |
| 244 | 52:54:00:12:35:02 dev sw1p2 master br0 permanent |
| 245 | 00:02:00:00:03:00 dev sw1p2 master br0 offload |
| 246 | 00:02:00:00:03:00 dev sw1p2 self |
| 247 | 33:33:00:00:00:01 dev eth0 self permanent |
| 248 | 01:00:5e:00:00:01 dev eth0 self permanent |
| 249 | 33:33:ff:00:00:00 dev eth0 self permanent |
| 250 | 01:80:c2:00:00:0e dev eth0 self permanent |
| 251 | 33:33:00:00:00:01 dev br0 self permanent |
| 252 | 01:00:5e:00:00:01 dev br0 self permanent |
| 253 | 33:33:ff:12:35:01 dev br0 self permanent |
| 254 | |
Mauro Carvalho Chehab | 32c0f0b | 2020-04-30 18:04:27 +0200 | [diff] [blame] | 255 | Learning on the port should be disabled on the bridge using the bridge command:: |
Scott Feldman | 4ceec22 | 2015-05-10 09:48:09 -0700 | [diff] [blame] | 256 | |
| 257 | bridge link set dev DEV learning off |
| 258 | |
Mauro Carvalho Chehab | 32c0f0b | 2020-04-30 18:04:27 +0200 | [diff] [blame] | 259 | Learning on the device port should be enabled, as well as learning_sync:: |
Scott Feldman | 4ceec22 | 2015-05-10 09:48:09 -0700 | [diff] [blame] | 260 | |
| 261 | bridge link set dev DEV learning on self |
| 262 | bridge link set dev DEV learning_sync on self |
| 263 | |
Chris Packham | 5a78449 | 2017-08-21 08:52:54 +1200 | [diff] [blame] | 264 | Learning_sync attribute enables syncing of the learned/forgotten FDB entry to |
Scott Feldman | 4ceec22 | 2015-05-10 09:48:09 -0700 | [diff] [blame] | 265 | the bridge's FDB. It's possible, but not optimal, to enable learning on the |
| 266 | device port and on the bridge port, and disable learning_sync. |
| 267 | |
Florian Fainelli | cc0c207 | 2019-02-20 16:58:25 -0800 | [diff] [blame] | 268 | To support learning, the driver implements switchdev op |
Florian Fainelli | 010c8f0 | 2019-02-20 16:58:26 -0800 | [diff] [blame] | 269 | switchdev_port_attr_set for SWITCHDEV_ATTR_PORT_ID_{PRE}_BRIDGE_FLAGS. |
Scott Feldman | 4ceec22 | 2015-05-10 09:48:09 -0700 | [diff] [blame] | 270 | |
| 271 | FDB Ageing |
| 272 | ^^^^^^^^^^ |
| 273 | |
Scott Feldman | 45ffda7 | 2015-09-23 08:39:20 -0700 | [diff] [blame] | 274 | The bridge will skip ageing FDB entries marked with NTF_EXT_LEARNED and it is |
| 275 | the responsibility of the port driver/device to age out these entries. If the |
| 276 | port device supports ageing, when the FDB entry expires, it will notify the |
| 277 | driver which in turn will notify the bridge with SWITCHDEV_FDB_DEL. If the |
| 278 | device does not support ageing, the driver can simulate ageing using a |
Chris Packham | 5a78449 | 2017-08-21 08:52:54 +1200 | [diff] [blame] | 279 | garbage collection timer to monitor FDB entries. Expired entries will be |
Scott Feldman | 45ffda7 | 2015-09-23 08:39:20 -0700 | [diff] [blame] | 280 | notified to the bridge using SWITCHDEV_FDB_DEL. See rocker driver for |
| 281 | example of driver running ageing timer. |
Scott Feldman | 4ceec22 | 2015-05-10 09:48:09 -0700 | [diff] [blame] | 282 | |
Scott Feldman | 45ffda7 | 2015-09-23 08:39:20 -0700 | [diff] [blame] | 283 | To keep an NTF_EXT_LEARNED entry "alive", the driver should refresh the FDB |
| 284 | entry by calling call_switchdev_notifiers(SWITCHDEV_FDB_ADD, ...). The |
Scott Feldman | 4ceec22 | 2015-05-10 09:48:09 -0700 | [diff] [blame] | 285 | notification will reset the FDB entry's last-used time to now. The driver |
| 286 | should rate limit refresh notifications, for example, no more than once a |
Scott Feldman | 45ffda7 | 2015-09-23 08:39:20 -0700 | [diff] [blame] | 287 | second. (The last-used time is visible using the bridge -s fdb option). |
Scott Feldman | 4ceec22 | 2015-05-10 09:48:09 -0700 | [diff] [blame] | 288 | |
| 289 | STP State Change on Port |
| 290 | ^^^^^^^^^^^^^^^^^^^^^^^^ |
| 291 | |
| 292 | Internally or with a third-party STP protocol implementation (e.g. mstpd), the |
| 293 | bridge driver maintains the STP state for ports, and will notify the switch |
Scott Feldman | f5ed2fe | 2015-06-03 20:43:40 -0700 | [diff] [blame] | 294 | driver of STP state change on a port using the switchdev op |
Jiri Pirko | 1f86839 | 2015-10-01 11:03:42 +0200 | [diff] [blame] | 295 | switchdev_attr_port_set for SWITCHDEV_ATTR_PORT_ID_STP_UPDATE. |
Scott Feldman | 4ceec22 | 2015-05-10 09:48:09 -0700 | [diff] [blame] | 296 | |
| 297 | State is one of BR_STATE_*. The switch driver can use STP state updates to |
| 298 | update ingress packet filter list for the port. For example, if port is |
| 299 | DISABLED, no packets should pass, but if port moves to BLOCKED, then STP BPDUs |
| 300 | and other IEEE 01:80:c2:xx:xx:xx link-local multicast packets can pass. |
| 301 | |
| 302 | Note that STP BDPUs are untagged and STP state applies to all VLANs on the port |
| 303 | so packet filters should be applied consistently across untagged and tagged |
| 304 | VLANs on the port. |
| 305 | |
| 306 | Flooding L2 domain |
| 307 | ^^^^^^^^^^^^^^^^^^ |
| 308 | |
| 309 | For a given L2 VLAN domain, the switch device should flood multicast/broadcast |
| 310 | and unknown unicast packets to all ports in domain, if allowed by port's |
| 311 | current STP state. The switch driver, knowing which ports are within which |
Ido Schimmel | 371e59a | 2015-10-28 10:16:55 +0100 | [diff] [blame] | 312 | vlan L2 domain, can program the switch device for flooding. The packet may |
| 313 | be sent to the port netdev for processing by the bridge driver. The |
Scott Feldman | a48037e | 2015-07-18 18:24:52 -0700 | [diff] [blame] | 314 | bridge should not reflood the packet to the same ports the device flooded, |
| 315 | otherwise there will be duplicate packets on the wire. |
| 316 | |
Ido Schimmel | 6bc506b | 2016-08-25 18:42:37 +0200 | [diff] [blame] | 317 | To avoid duplicate packets, the switch driver should mark a packet as already |
| 318 | forwarded by setting the skb->offload_fwd_mark bit. The bridge driver will mark |
| 319 | the skb using the ingress bridge port's mark and prevent it from being forwarded |
| 320 | through any bridge port with the same mark. |
Scott Feldman | 4ceec22 | 2015-05-10 09:48:09 -0700 | [diff] [blame] | 321 | |
| 322 | It is possible for the switch device to not handle flooding and push the |
| 323 | packets up to the bridge driver for flooding. This is not ideal as the number |
| 324 | of ports scale in the L2 domain as the device is much more efficient at |
| 325 | flooding packets that software. |
| 326 | |
Ido Schimmel | 741af00 | 2015-10-28 10:16:54 +0100 | [diff] [blame] | 327 | If supported by the device, flood control can be offloaded to it, preventing |
| 328 | certain netdevs from flooding unicast traffic for which there is no FDB entry. |
| 329 | |
Scott Feldman | 4ceec22 | 2015-05-10 09:48:09 -0700 | [diff] [blame] | 330 | IGMP Snooping |
| 331 | ^^^^^^^^^^^^^ |
| 332 | |
Elad Raz | 4f5590f | 2016-01-10 21:06:29 +0100 | [diff] [blame] | 333 | In order to support IGMP snooping, the port netdevs should trap to the bridge |
| 334 | driver all IGMP join and leave messages. |
| 335 | The bridge multicast module will notify port netdevs on every multicast group |
| 336 | changed whether it is static configured or dynamically joined/leave. |
| 337 | The hardware implementation should be forwarding all registered multicast |
| 338 | traffic groups only to the configured ports. |
Scott Feldman | 4ceec22 | 2015-05-10 09:48:09 -0700 | [diff] [blame] | 339 | |
Scott Feldman | 7616dcb | 2015-06-03 20:43:43 -0700 | [diff] [blame] | 340 | L3 Routing Offload |
| 341 | ------------------ |
Scott Feldman | 4ceec22 | 2015-05-10 09:48:09 -0700 | [diff] [blame] | 342 | |
| 343 | Offloading L3 routing requires that device be programmed with FIB entries from |
| 344 | the kernel, with the device doing the FIB lookup and forwarding. The device |
| 345 | does a longest prefix match (LPM) on FIB entries matching route prefix and |
Scott Feldman | 7616dcb | 2015-06-03 20:43:43 -0700 | [diff] [blame] | 346 | forwards the packet to the matching FIB entry's nexthop(s) egress ports. |
Scott Feldman | 4ceec22 | 2015-05-10 09:48:09 -0700 | [diff] [blame] | 347 | |
Jiri Pirko | fd41b0e | 2016-09-26 12:52:34 +0200 | [diff] [blame] | 348 | To program the device, the driver has to register a FIB notifier handler |
| 349 | using register_fib_notifier. The following events are available: |
Scott Feldman | 4ceec22 | 2015-05-10 09:48:09 -0700 | [diff] [blame] | 350 | |
Mauro Carvalho Chehab | 32c0f0b | 2020-04-30 18:04:27 +0200 | [diff] [blame] | 351 | =================== =================================================== |
| 352 | FIB_EVENT_ENTRY_ADD used for both adding a new FIB entry to the device, |
| 353 | or modifying an existing entry on the device. |
| 354 | FIB_EVENT_ENTRY_DEL used for removing a FIB entry |
| 355 | FIB_EVENT_RULE_ADD, |
| 356 | FIB_EVENT_RULE_DEL used to propagate FIB rule changes |
| 357 | =================== =================================================== |
| 358 | |
| 359 | FIB_EVENT_ENTRY_ADD and FIB_EVENT_ENTRY_DEL events pass:: |
Scott Feldman | 4ceec22 | 2015-05-10 09:48:09 -0700 | [diff] [blame] | 360 | |
Jiri Pirko | fd41b0e | 2016-09-26 12:52:34 +0200 | [diff] [blame] | 361 | struct fib_entry_notifier_info { |
| 362 | struct fib_notifier_info info; /* must be first */ |
Scott Feldman | 7616dcb | 2015-06-03 20:43:43 -0700 | [diff] [blame] | 363 | u32 dst; |
| 364 | int dst_len; |
| 365 | struct fib_info *fi; |
| 366 | u8 tos; |
| 367 | u8 type; |
Scott Feldman | 7616dcb | 2015-06-03 20:43:43 -0700 | [diff] [blame] | 368 | u32 tb_id; |
Jiri Pirko | fd41b0e | 2016-09-26 12:52:34 +0200 | [diff] [blame] | 369 | u32 nlflags; |
| 370 | }; |
Scott Feldman | 7616dcb | 2015-06-03 20:43:43 -0700 | [diff] [blame] | 371 | |
Mauro Carvalho Chehab | 32c0f0b | 2020-04-30 18:04:27 +0200 | [diff] [blame] | 372 | to add/modify/delete IPv4 dst/dest_len prefix on table tb_id. The ``*fi`` |
| 373 | structure holds details on the route and route's nexthops. ``*dev`` is one |
| 374 | of the port netdevs mentioned in the route's next hop list. |
Scott Feldman | 4ceec22 | 2015-05-10 09:48:09 -0700 | [diff] [blame] | 375 | |
| 376 | Routes offloaded to the device are labeled with "offload" in the ip route |
Mauro Carvalho Chehab | 32c0f0b | 2020-04-30 18:04:27 +0200 | [diff] [blame] | 377 | listing:: |
Scott Feldman | 4ceec22 | 2015-05-10 09:48:09 -0700 | [diff] [blame] | 378 | |
| 379 | $ ip route show |
| 380 | default via 192.168.0.2 dev eth0 |
| 381 | 11.0.0.0/30 dev sw1p1 proto kernel scope link src 11.0.0.2 offload |
| 382 | 11.0.0.4/30 via 11.0.0.1 dev sw1p1 proto zebra metric 20 offload |
| 383 | 11.0.0.8/30 dev sw1p2 proto kernel scope link src 11.0.0.10 offload |
| 384 | 11.0.0.12/30 via 11.0.0.9 dev sw1p2 proto zebra metric 20 offload |
| 385 | 12.0.0.2 proto zebra metric 30 offload |
| 386 | nexthop via 11.0.0.1 dev sw1p1 weight 1 |
| 387 | nexthop via 11.0.0.9 dev sw1p2 weight 1 |
| 388 | 12.0.0.3 via 11.0.0.1 dev sw1p1 proto zebra metric 20 offload |
| 389 | 12.0.0.4 via 11.0.0.9 dev sw1p2 proto zebra metric 20 offload |
| 390 | 192.168.0.0/24 dev eth0 proto kernel scope link src 192.168.0.15 |
| 391 | |
Jiri Pirko | fd41b0e | 2016-09-26 12:52:34 +0200 | [diff] [blame] | 392 | The "offload" flag is set in case at least one device offloads the FIB entry. |
| 393 | |
Scott Feldman | 7616dcb | 2015-06-03 20:43:43 -0700 | [diff] [blame] | 394 | XXX: add/mod/del IPv6 FIB API |
Scott Feldman | 4ceec22 | 2015-05-10 09:48:09 -0700 | [diff] [blame] | 395 | |
| 396 | Nexthop Resolution |
| 397 | ^^^^^^^^^^^^^^^^^^ |
| 398 | |
| 399 | The FIB entry's nexthop list contains the nexthop tuple (gateway, dev), but for |
| 400 | the switch device to forward the packet with the correct dst mac address, the |
| 401 | nexthop gateways must be resolved to the neighbor's mac address. Neighbor mac |
| 402 | address discovery comes via the ARP (or ND) process and is available via the |
| 403 | arp_tbl neighbor table. To resolve the routes nexthop gateways, the driver |
| 404 | should trigger the kernel's neighbor resolution process. See the rocker |
| 405 | driver's rocker_port_ipv4_resolve() for an example. |
| 406 | |
| 407 | The driver can monitor for updates to arp_tbl using the netevent notifier |
| 408 | NETEVENT_NEIGH_UPDATE. The device can be programmed with resolved nexthops |
Scott Feldman | dd19f83 | 2015-08-12 18:45:25 -0700 | [diff] [blame] | 409 | for the routes as arp_tbl updates. The driver implements ndo_neigh_destroy |
| 410 | to know when arp_tbl neighbor entries are purged from the port. |
Florian Fainelli | 0f22ad4 | 2021-03-16 13:24:18 +0200 | [diff] [blame] | 411 | |
| 412 | Device driver expected behavior |
| 413 | ------------------------------- |
| 414 | |
| 415 | Below is a set of defined behavior that switchdev enabled network devices must |
| 416 | adhere to. |
| 417 | |
| 418 | Configuration-less state |
| 419 | ^^^^^^^^^^^^^^^^^^^^^^^^ |
| 420 | |
| 421 | Upon driver bring up, the network devices must be fully operational, and the |
| 422 | backing driver must configure the network device such that it is possible to |
| 423 | send and receive traffic to this network device and it is properly separated |
| 424 | from other network devices/ports (e.g.: as is frequent with a switch ASIC). How |
| 425 | this is achieved is heavily hardware dependent, but a simple solution can be to |
| 426 | use per-port VLAN identifiers unless a better mechanism is available |
| 427 | (proprietary metadata for each network port for instance). |
| 428 | |
| 429 | The network device must be capable of running a full IP protocol stack |
| 430 | including multicast, DHCP, IPv4/6, etc. If necessary, it should program the |
| 431 | appropriate filters for VLAN, multicast, unicast etc. The underlying device |
| 432 | driver must effectively be configured in a similar fashion to what it would do |
| 433 | when IGMP snooping is enabled for IP multicast over these switchdev network |
| 434 | devices and unsolicited multicast must be filtered as early as possible in |
| 435 | the hardware. |
| 436 | |
| 437 | When configuring VLANs on top of the network device, all VLANs must be working, |
| 438 | irrespective of the state of other network devices (e.g.: other ports being part |
| 439 | of a VLAN-aware bridge doing ingress VID checking). See below for details. |
| 440 | |
| 441 | If the device implements e.g.: VLAN filtering, putting the interface in |
| 442 | promiscuous mode should allow the reception of all VLAN tags (including those |
| 443 | not present in the filter(s)). |
| 444 | |
| 445 | Bridged switch ports |
| 446 | ^^^^^^^^^^^^^^^^^^^^ |
| 447 | |
| 448 | When a switchdev enabled network device is added as a bridge member, it should |
| 449 | not disrupt any functionality of non-bridged network devices and they |
| 450 | should continue to behave as normal network devices. Depending on the bridge |
| 451 | configuration knobs below, the expected behavior is documented. |
| 452 | |
| 453 | Bridge VLAN filtering |
| 454 | ^^^^^^^^^^^^^^^^^^^^^ |
| 455 | |
| 456 | The Linux bridge allows the configuration of a VLAN filtering mode (statically, |
| 457 | at device creation time, and dynamically, during run time) which must be |
| 458 | observed by the underlying switchdev network device/hardware: |
| 459 | |
| 460 | - with VLAN filtering turned off: the bridge is strictly VLAN unaware and its |
| 461 | data path will process all Ethernet frames as if they are VLAN-untagged. |
| 462 | The bridge VLAN database can still be modified, but the modifications should |
| 463 | have no effect while VLAN filtering is turned off. Frames ingressing the |
| 464 | device with a VID that is not programmed into the bridge/switch's VLAN table |
| 465 | must be forwarded and may be processed using a VLAN device (see below). |
| 466 | |
| 467 | - with VLAN filtering turned on: the bridge is VLAN-aware and frames ingressing |
| 468 | the device with a VID that is not programmed into the bridges/switch's VLAN |
| 469 | table must be dropped (strict VID checking). |
| 470 | |
| 471 | When there is a VLAN device (e.g: sw0p1.100) configured on top of a switchdev |
| 472 | network device which is a bridge port member, the behavior of the software |
| 473 | network stack must be preserved, or the configuration must be refused if that |
| 474 | is not possible. |
| 475 | |
| 476 | - with VLAN filtering turned off, the bridge will process all ingress traffic |
| 477 | for the port, except for the traffic tagged with a VLAN ID destined for a |
| 478 | VLAN upper. The VLAN upper interface (which consumes the VLAN tag) can even |
| 479 | be added to a second bridge, which includes other switch ports or software |
| 480 | interfaces. Some approaches to ensure that the forwarding domain for traffic |
| 481 | belonging to the VLAN upper interfaces are managed properly: |
Vladimir Oltean | cfeb961 | 2021-03-17 19:44:54 +0200 | [diff] [blame] | 482 | |
Florian Fainelli | 0f22ad4 | 2021-03-16 13:24:18 +0200 | [diff] [blame] | 483 | * If forwarding destinations can be managed per VLAN, the hardware could be |
| 484 | configured to map all traffic, except the packets tagged with a VID |
| 485 | belonging to a VLAN upper interface, to an internal VID corresponding to |
| 486 | untagged packets. This internal VID spans all ports of the VLAN-unaware |
| 487 | bridge. The VID corresponding to the VLAN upper interface spans the |
| 488 | physical port of that VLAN interface, as well as the other ports that |
| 489 | might be bridged with it. |
| 490 | * Treat bridge ports with VLAN upper interfaces as standalone, and let |
| 491 | forwarding be handled in the software data path. |
| 492 | |
| 493 | - with VLAN filtering turned on, these VLAN devices can be created as long as |
| 494 | the bridge does not have an existing VLAN entry with the same VID on any |
| 495 | bridge port. These VLAN devices cannot be enslaved into the bridge since they |
| 496 | duplicate functionality/use case with the bridge's VLAN data path processing. |
| 497 | |
| 498 | Non-bridged network ports of the same switch fabric must not be disturbed in any |
| 499 | way by the enabling of VLAN filtering on the bridge device(s). If the VLAN |
| 500 | filtering setting is global to the entire chip, then the standalone ports |
| 501 | should indicate to the network stack that VLAN filtering is required by setting |
| 502 | 'rx-vlan-filter: on [fixed]' in the ethtool features. |
| 503 | |
| 504 | Because VLAN filtering can be turned on/off at runtime, the switchdev driver |
| 505 | must be able to reconfigure the underlying hardware on the fly to honor the |
| 506 | toggling of that option and behave appropriately. If that is not possible, the |
| 507 | switchdev driver can also refuse to support dynamic toggling of the VLAN |
| 508 | filtering knob at runtime and require a destruction of the bridge device(s) and |
| 509 | creation of new bridge device(s) with a different VLAN filtering value to |
| 510 | ensure VLAN awareness is pushed down to the hardware. |
| 511 | |
| 512 | Even when VLAN filtering in the bridge is turned off, the underlying switch |
| 513 | hardware and driver may still configure itself in a VLAN-aware mode provided |
| 514 | that the behavior described above is observed. |
| 515 | |
| 516 | The VLAN protocol of the bridge plays a role in deciding whether a packet is |
| 517 | treated as tagged or not: a bridge using the 802.1ad protocol must treat both |
| 518 | VLAN-untagged packets, as well as packets tagged with 802.1Q headers, as |
| 519 | untagged. |
| 520 | |
| 521 | The 802.1p (VID 0) tagged packets must be treated in the same way by the device |
| 522 | as untagged packets, since the bridge device does not allow the manipulation of |
| 523 | VID 0 in its database. |
| 524 | |
| 525 | When the bridge has VLAN filtering enabled and a PVID is not configured on the |
Vladimir Oltean | 6b38c57 | 2021-03-17 19:44:55 +0200 | [diff] [blame] | 526 | ingress port, untagged and 802.1p tagged packets must be dropped. When the bridge |
Florian Fainelli | 0f22ad4 | 2021-03-16 13:24:18 +0200 | [diff] [blame] | 527 | has VLAN filtering enabled and a PVID exists on the ingress port, untagged and |
| 528 | priority-tagged packets must be accepted and forwarded according to the |
| 529 | bridge's port membership of the PVID VLAN. When the bridge has VLAN filtering |
| 530 | disabled, the presence/lack of a PVID should not influence the packet |
| 531 | forwarding decision. |
| 532 | |
| 533 | Bridge IGMP snooping |
| 534 | ^^^^^^^^^^^^^^^^^^^^ |
| 535 | |
| 536 | The Linux bridge allows the configuration of IGMP snooping (statically, at |
| 537 | interface creation time, or dynamically, during runtime) which must be observed |
| 538 | by the underlying switchdev network device/hardware in the following way: |
| 539 | |
| 540 | - when IGMP snooping is turned off, multicast traffic must be flooded to all |
| 541 | ports within the same bridge that have mcast_flood=true. The CPU/management |
| 542 | port should ideally not be flooded (unless the ingress interface has |
| 543 | IFF_ALLMULTI or IFF_PROMISC) and continue to learn multicast traffic through |
| 544 | the network stack notifications. If the hardware is not capable of doing that |
| 545 | then the CPU/management port must also be flooded and multicast filtering |
| 546 | happens in software. |
| 547 | |
| 548 | - when IGMP snooping is turned on, multicast traffic must selectively flow |
| 549 | to the appropriate network ports (including CPU/management port). Flooding of |
| 550 | unknown multicast should be only towards the ports connected to a multicast |
| 551 | router (the local device may also act as a multicast router). |
| 552 | |
| 553 | The switch must adhere to RFC 4541 and flood multicast traffic accordingly |
| 554 | since that is what the Linux bridge implementation does. |
| 555 | |
| 556 | Because IGMP snooping can be turned on/off at runtime, the switchdev driver |
| 557 | must be able to reconfigure the underlying hardware on the fly to honor the |
| 558 | toggling of that option and behave appropriately. |
| 559 | |
| 560 | A switchdev driver can also refuse to support dynamic toggling of the multicast |
| 561 | snooping knob at runtime and require the destruction of the bridge device(s) |
| 562 | and creation of a new bridge device(s) with a different multicast snooping |
| 563 | value. |