Magnus Karlsson | b4b8faa | 2018-05-02 13:01:36 +0200 | [diff] [blame] | 1 | .. SPDX-License-Identifier: GPL-2.0 |
| 2 | |
| 3 | ====== |
| 4 | AF_XDP |
| 5 | ====== |
| 6 | |
| 7 | Overview |
| 8 | ======== |
| 9 | |
| 10 | AF_XDP is an address family that is optimized for high performance |
| 11 | packet processing. |
| 12 | |
| 13 | This document assumes that the reader is familiar with BPF and XDP. If |
| 14 | not, the Cilium project has an excellent reference guide at |
Björn Töpel | bbff2f3 | 2018-06-04 13:57:13 +0200 | [diff] [blame] | 15 | http://cilium.readthedocs.io/en/latest/bpf/. |
Magnus Karlsson | b4b8faa | 2018-05-02 13:01:36 +0200 | [diff] [blame] | 16 | |
| 17 | Using the XDP_REDIRECT action from an XDP program, the program can |
| 18 | redirect ingress frames to other XDP enabled netdevs, using the |
| 19 | bpf_redirect_map() function. AF_XDP sockets enable the possibility for |
| 20 | XDP programs to redirect frames to a memory buffer in a user-space |
| 21 | application. |
| 22 | |
| 23 | An AF_XDP socket (XSK) is created with the normal socket() |
| 24 | syscall. Associated with each XSK are two rings: the RX ring and the |
| 25 | TX ring. A socket can receive packets on the RX ring and it can send |
| 26 | packets on the TX ring. These rings are registered and sized with the |
| 27 | setsockopts XDP_RX_RING and XDP_TX_RING, respectively. It is mandatory |
| 28 | to have at least one of these rings for each socket. An RX or TX |
| 29 | descriptor ring points to a data buffer in a memory area called a |
| 30 | UMEM. RX and TX can share the same UMEM so that a packet does not have |
| 31 | to be copied between RX and TX. Moreover, if a packet needs to be kept |
| 32 | for a while due to a possible retransmit, the descriptor that points |
| 33 | to that packet can be changed to point to another and reused right |
| 34 | away. This again avoids copying data. |
| 35 | |
Björn Töpel | bbff2f3 | 2018-06-04 13:57:13 +0200 | [diff] [blame] | 36 | The UMEM consists of a number of equally sized chunks. A descriptor in |
| 37 | one of the rings references a frame by referencing its addr. The addr |
| 38 | is simply an offset within the entire UMEM region. The user space |
| 39 | allocates memory for this UMEM using whatever means it feels is most |
| 40 | appropriate (malloc, mmap, huge pages, etc). This memory area is then |
| 41 | registered with the kernel using the new setsockopt XDP_UMEM_REG. The |
| 42 | UMEM also has two rings: the FILL ring and the COMPLETION ring. The |
| 43 | fill ring is used by the application to send down addr for the kernel |
| 44 | to fill in with RX packet data. References to these frames will then |
| 45 | appear in the RX ring once each packet has been received. The |
| 46 | completion ring, on the other hand, contains frame addr that the |
| 47 | kernel has transmitted completely and can now be used again by user |
| 48 | space, for either TX or RX. Thus, the frame addrs appearing in the |
| 49 | completion ring are addrs that were previously transmitted using the |
| 50 | TX ring. In summary, the RX and FILL rings are used for the RX path |
| 51 | and the TX and COMPLETION rings are used for the TX path. |
Magnus Karlsson | b4b8faa | 2018-05-02 13:01:36 +0200 | [diff] [blame] | 52 | |
| 53 | The socket is then finally bound with a bind() call to a device and a |
| 54 | specific queue id on that device, and it is not until bind is |
| 55 | completed that traffic starts to flow. |
| 56 | |
| 57 | The UMEM can be shared between processes, if desired. If a process |
| 58 | wants to do this, it simply skips the registration of the UMEM and its |
| 59 | corresponding two rings, sets the XDP_SHARED_UMEM flag in the bind |
| 60 | call and submits the XSK of the process it would like to share UMEM |
| 61 | with as well as its own newly created XSK socket. The new process will |
Björn Töpel | bbff2f3 | 2018-06-04 13:57:13 +0200 | [diff] [blame] | 62 | then receive frame addr references in its own RX ring that point to |
| 63 | this shared UMEM. Note that since the ring structures are |
| 64 | single-consumer / single-producer (for performance reasons), the new |
| 65 | process has to create its own socket with associated RX and TX rings, |
| 66 | since it cannot share this with the other process. This is also the |
| 67 | reason that there is only one set of FILL and COMPLETION rings per |
| 68 | UMEM. It is the responsibility of a single process to handle the UMEM. |
Magnus Karlsson | b4b8faa | 2018-05-02 13:01:36 +0200 | [diff] [blame] | 69 | |
| 70 | How is then packets distributed from an XDP program to the XSKs? There |
| 71 | is a BPF map called XSKMAP (or BPF_MAP_TYPE_XSKMAP in full). The |
| 72 | user-space application can place an XSK at an arbitrary place in this |
| 73 | map. The XDP program can then redirect a packet to a specific index in |
| 74 | this map and at this point XDP validates that the XSK in that map was |
| 75 | indeed bound to that device and ring number. If not, the packet is |
| 76 | dropped. If the map is empty at that index, the packet is also |
| 77 | dropped. This also means that it is currently mandatory to have an XDP |
| 78 | program loaded (and one XSK in the XSKMAP) to be able to get any |
| 79 | traffic to user space through the XSK. |
| 80 | |
| 81 | AF_XDP can operate in two different modes: XDP_SKB and XDP_DRV. If the |
| 82 | driver does not have support for XDP, or XDP_SKB is explicitly chosen |
| 83 | when loading the XDP program, XDP_SKB mode is employed that uses SKBs |
| 84 | together with the generic XDP support and copies out the data to user |
| 85 | space. A fallback mode that works for any network device. On the other |
| 86 | hand, if the driver has support for XDP, it will be used by the AF_XDP |
| 87 | code to provide better performance, but there is still a copy of the |
| 88 | data into user space. |
| 89 | |
| 90 | Concepts |
| 91 | ======== |
| 92 | |
| 93 | In order to use an AF_XDP socket, a number of associated objects need |
| 94 | to be setup. |
| 95 | |
| 96 | Jonathan Corbet has also written an excellent article on LWN, |
| 97 | "Accelerating networking with AF_XDP". It can be found at |
| 98 | https://lwn.net/Articles/750845/. |
| 99 | |
| 100 | UMEM |
| 101 | ---- |
| 102 | |
| 103 | UMEM is a region of virtual contiguous memory, divided into |
| 104 | equal-sized frames. An UMEM is associated to a netdev and a specific |
Björn Töpel | bbff2f3 | 2018-06-04 13:57:13 +0200 | [diff] [blame] | 105 | queue id of that netdev. It is created and configured (chunk size, |
| 106 | headroom, start address and size) by using the XDP_UMEM_REG setsockopt |
| 107 | system call. A UMEM is bound to a netdev and queue id, via the bind() |
| 108 | system call. |
Magnus Karlsson | b4b8faa | 2018-05-02 13:01:36 +0200 | [diff] [blame] | 109 | |
| 110 | An AF_XDP is socket linked to a single UMEM, but one UMEM can have |
| 111 | multiple AF_XDP sockets. To share an UMEM created via one socket A, |
| 112 | the next socket B can do this by setting the XDP_SHARED_UMEM flag in |
| 113 | struct sockaddr_xdp member sxdp_flags, and passing the file descriptor |
| 114 | of A to struct sockaddr_xdp member sxdp_shared_umem_fd. |
| 115 | |
| 116 | The UMEM has two single-producer/single-consumer rings, that are used |
| 117 | to transfer ownership of UMEM frames between the kernel and the |
| 118 | user-space application. |
| 119 | |
| 120 | Rings |
| 121 | ----- |
| 122 | |
| 123 | There are a four different kind of rings: Fill, Completion, RX and |
| 124 | TX. All rings are single-producer/single-consumer, so the user-space |
| 125 | application need explicit synchronization of multiple |
| 126 | processes/threads are reading/writing to them. |
| 127 | |
| 128 | The UMEM uses two rings: Fill and Completion. Each socket associated |
| 129 | with the UMEM must have an RX queue, TX queue or both. Say, that there |
| 130 | is a setup with four sockets (all doing TX and RX). Then there will be |
| 131 | one Fill ring, one Completion ring, four TX rings and four RX rings. |
| 132 | |
| 133 | The rings are head(producer)/tail(consumer) based rings. A producer |
| 134 | writes the data ring at the index pointed out by struct xdp_ring |
| 135 | producer member, and increasing the producer index. A consumer reads |
| 136 | the data ring at the index pointed out by struct xdp_ring consumer |
| 137 | member, and increasing the consumer index. |
| 138 | |
| 139 | The rings are configured and created via the _RING setsockopt system |
| 140 | calls and mmapped to user-space using the appropriate offset to mmap() |
| 141 | (XDP_PGOFF_RX_RING, XDP_PGOFF_TX_RING, XDP_UMEM_PGOFF_FILL_RING and |
| 142 | XDP_UMEM_PGOFF_COMPLETION_RING). |
| 143 | |
| 144 | The size of the rings need to be of size power of two. |
| 145 | |
| 146 | UMEM Fill Ring |
| 147 | ~~~~~~~~~~~~~~ |
| 148 | |
| 149 | The Fill ring is used to transfer ownership of UMEM frames from |
Björn Töpel | bbff2f3 | 2018-06-04 13:57:13 +0200 | [diff] [blame] | 150 | user-space to kernel-space. The UMEM addrs are passed in the ring. As |
| 151 | an example, if the UMEM is 64k and each chunk is 4k, then the UMEM has |
| 152 | 16 chunks and can pass addrs between 0 and 64k. |
Magnus Karlsson | b4b8faa | 2018-05-02 13:01:36 +0200 | [diff] [blame] | 153 | |
| 154 | Frames passed to the kernel are used for the ingress path (RX rings). |
| 155 | |
Björn Töpel | bbff2f3 | 2018-06-04 13:57:13 +0200 | [diff] [blame] | 156 | The user application produces UMEM addrs to this ring. Note that the |
| 157 | kernel will mask the incoming addr. E.g. for a chunk size of 2k, the |
| 158 | log2(2048) LSB of the addr will be masked off, meaning that 2048, 2050 |
| 159 | and 3000 refers to the same chunk. |
| 160 | |
Magnus Karlsson | b4b8faa | 2018-05-02 13:01:36 +0200 | [diff] [blame] | 161 | |
Konrad Djimeli | 7ccc4f1 | 2018-10-04 18:01:32 +0100 | [diff] [blame^] | 162 | UMEM Completion Ring |
| 163 | ~~~~~~~~~~~~~~~~~~~~ |
Magnus Karlsson | b4b8faa | 2018-05-02 13:01:36 +0200 | [diff] [blame] | 164 | |
| 165 | The Completion Ring is used transfer ownership of UMEM frames from |
| 166 | kernel-space to user-space. Just like the Fill ring, UMEM indicies are |
| 167 | used. |
| 168 | |
| 169 | Frames passed from the kernel to user-space are frames that has been |
| 170 | sent (TX ring) and can be used by user-space again. |
| 171 | |
Björn Töpel | bbff2f3 | 2018-06-04 13:57:13 +0200 | [diff] [blame] | 172 | The user application consumes UMEM addrs from this ring. |
Magnus Karlsson | b4b8faa | 2018-05-02 13:01:36 +0200 | [diff] [blame] | 173 | |
| 174 | |
| 175 | RX Ring |
| 176 | ~~~~~~~ |
| 177 | |
| 178 | The RX ring is the receiving side of a socket. Each entry in the ring |
Björn Töpel | bbff2f3 | 2018-06-04 13:57:13 +0200 | [diff] [blame] | 179 | is a struct xdp_desc descriptor. The descriptor contains UMEM offset |
| 180 | (addr) and the length of the data (len). |
Magnus Karlsson | b4b8faa | 2018-05-02 13:01:36 +0200 | [diff] [blame] | 181 | |
| 182 | If no frames have been passed to kernel via the Fill ring, no |
| 183 | descriptors will (or can) appear on the RX ring. |
| 184 | |
| 185 | The user application consumes struct xdp_desc descriptors from this |
| 186 | ring. |
| 187 | |
| 188 | TX Ring |
| 189 | ~~~~~~~ |
| 190 | |
| 191 | The TX ring is used to send frames. The struct xdp_desc descriptor is |
| 192 | filled (index, length and offset) and passed into the ring. |
| 193 | |
| 194 | To start the transfer a sendmsg() system call is required. This might |
| 195 | be relaxed in the future. |
| 196 | |
| 197 | The user application produces struct xdp_desc descriptors to this |
| 198 | ring. |
| 199 | |
| 200 | XSKMAP / BPF_MAP_TYPE_XSKMAP |
| 201 | ---------------------------- |
| 202 | |
| 203 | On XDP side there is a BPF map type BPF_MAP_TYPE_XSKMAP (XSKMAP) that |
| 204 | is used in conjunction with bpf_redirect_map() to pass the ingress |
| 205 | frame to a socket. |
| 206 | |
| 207 | The user application inserts the socket into the map, via the bpf() |
| 208 | system call. |
| 209 | |
| 210 | Note that if an XDP program tries to redirect to a socket that does |
| 211 | not match the queue configuration and netdev, the frame will be |
| 212 | dropped. E.g. an AF_XDP socket is bound to netdev eth0 and |
| 213 | queue 17. Only the XDP program executing for eth0 and queue 17 will |
| 214 | successfully pass data to the socket. Please refer to the sample |
| 215 | application (samples/bpf/) in for an example. |
| 216 | |
| 217 | Usage |
| 218 | ===== |
| 219 | |
| 220 | In order to use AF_XDP sockets there are two parts needed. The |
| 221 | user-space application and the XDP program. For a complete setup and |
| 222 | usage example, please refer to the sample application. The user-space |
| 223 | side is xdpsock_user.c and the XDP side xdpsock_kern.c. |
| 224 | |
| 225 | Naive ring dequeue and enqueue could look like this:: |
| 226 | |
Björn Töpel | bbff2f3 | 2018-06-04 13:57:13 +0200 | [diff] [blame] | 227 | // struct xdp_rxtx_ring { |
| 228 | // __u32 *producer; |
| 229 | // __u32 *consumer; |
| 230 | // struct xdp_desc *desc; |
| 231 | // }; |
| 232 | |
| 233 | // struct xdp_umem_ring { |
| 234 | // __u32 *producer; |
| 235 | // __u32 *consumer; |
| 236 | // __u64 *desc; |
| 237 | // }; |
| 238 | |
Magnus Karlsson | b4b8faa | 2018-05-02 13:01:36 +0200 | [diff] [blame] | 239 | // typedef struct xdp_rxtx_ring RING; |
| 240 | // typedef struct xdp_umem_ring RING; |
| 241 | |
| 242 | // typedef struct xdp_desc RING_TYPE; |
Björn Töpel | bbff2f3 | 2018-06-04 13:57:13 +0200 | [diff] [blame] | 243 | // typedef __u64 RING_TYPE; |
Magnus Karlsson | b4b8faa | 2018-05-02 13:01:36 +0200 | [diff] [blame] | 244 | |
| 245 | int dequeue_one(RING *ring, RING_TYPE *item) |
| 246 | { |
Björn Töpel | bbff2f3 | 2018-06-04 13:57:13 +0200 | [diff] [blame] | 247 | __u32 entries = *ring->producer - *ring->consumer; |
Magnus Karlsson | b4b8faa | 2018-05-02 13:01:36 +0200 | [diff] [blame] | 248 | |
| 249 | if (entries == 0) |
| 250 | return -1; |
| 251 | |
| 252 | // read-barrier! |
| 253 | |
Björn Töpel | bbff2f3 | 2018-06-04 13:57:13 +0200 | [diff] [blame] | 254 | *item = ring->desc[*ring->consumer & (RING_SIZE - 1)]; |
| 255 | (*ring->consumer)++; |
Magnus Karlsson | b4b8faa | 2018-05-02 13:01:36 +0200 | [diff] [blame] | 256 | return 0; |
| 257 | } |
| 258 | |
| 259 | int enqueue_one(RING *ring, const RING_TYPE *item) |
| 260 | { |
Björn Töpel | bbff2f3 | 2018-06-04 13:57:13 +0200 | [diff] [blame] | 261 | u32 free_entries = RING_SIZE - (*ring->producer - *ring->consumer); |
Magnus Karlsson | b4b8faa | 2018-05-02 13:01:36 +0200 | [diff] [blame] | 262 | |
| 263 | if (free_entries == 0) |
| 264 | return -1; |
| 265 | |
Björn Töpel | bbff2f3 | 2018-06-04 13:57:13 +0200 | [diff] [blame] | 266 | ring->desc[*ring->producer & (RING_SIZE - 1)] = *item; |
Magnus Karlsson | b4b8faa | 2018-05-02 13:01:36 +0200 | [diff] [blame] | 267 | |
| 268 | // write-barrier! |
| 269 | |
Björn Töpel | bbff2f3 | 2018-06-04 13:57:13 +0200 | [diff] [blame] | 270 | (*ring->producer)++; |
Magnus Karlsson | b4b8faa | 2018-05-02 13:01:36 +0200 | [diff] [blame] | 271 | return 0; |
| 272 | } |
| 273 | |
| 274 | |
| 275 | For a more optimized version, please refer to the sample application. |
| 276 | |
| 277 | Sample application |
| 278 | ================== |
| 279 | |
| 280 | There is a xdpsock benchmarking/test application included that |
| 281 | demonstrates how to use AF_XDP sockets with both private and shared |
| 282 | UMEMs. Say that you would like your UDP traffic from port 4242 to end |
| 283 | up in queue 16, that we will enable AF_XDP on. Here, we use ethtool |
| 284 | for this:: |
| 285 | |
| 286 | ethtool -N p3p2 rx-flow-hash udp4 fn |
| 287 | ethtool -N p3p2 flow-type udp4 src-port 4242 dst-port 4242 \ |
| 288 | action 16 |
| 289 | |
| 290 | Running the rxdrop benchmark in XDP_DRV mode can then be done |
| 291 | using:: |
| 292 | |
| 293 | samples/bpf/xdpsock -i p3p2 -q 16 -r -N |
| 294 | |
| 295 | For XDP_SKB mode, use the switch "-S" instead of "-N" and all options |
| 296 | can be displayed with "-h", as usual. |
| 297 | |
| 298 | Credits |
| 299 | ======= |
| 300 | |
| 301 | - Björn Töpel (AF_XDP core) |
| 302 | - Magnus Karlsson (AF_XDP core) |
| 303 | - Alexander Duyck |
| 304 | - Alexei Starovoitov |
| 305 | - Daniel Borkmann |
| 306 | - Jesper Dangaard Brouer |
| 307 | - John Fastabend |
| 308 | - Jonathan Corbet (LWN coverage) |
| 309 | - Michael S. Tsirkin |
| 310 | - Qi Z Zhang |
| 311 | - Willem de Bruijn |
| 312 | |