Magnus Karlsson | b4b8faa | 2018-05-02 13:01:36 +0200 | [diff] [blame] | 1 | .. SPDX-License-Identifier: GPL-2.0 |
| 2 | |
| 3 | ====== |
| 4 | AF_XDP |
| 5 | ====== |
| 6 | |
| 7 | Overview |
| 8 | ======== |
| 9 | |
| 10 | AF_XDP is an address family that is optimized for high performance |
| 11 | packet processing. |
| 12 | |
| 13 | This document assumes that the reader is familiar with BPF and XDP. If |
| 14 | not, the Cilium project has an excellent reference guide at |
Björn Töpel | bbff2f3 | 2018-06-04 13:57:13 +0200 | [diff] [blame] | 15 | http://cilium.readthedocs.io/en/latest/bpf/. |
Magnus Karlsson | b4b8faa | 2018-05-02 13:01:36 +0200 | [diff] [blame] | 16 | |
| 17 | Using the XDP_REDIRECT action from an XDP program, the program can |
| 18 | redirect ingress frames to other XDP enabled netdevs, using the |
| 19 | bpf_redirect_map() function. AF_XDP sockets enable the possibility for |
| 20 | XDP programs to redirect frames to a memory buffer in a user-space |
| 21 | application. |
| 22 | |
| 23 | An AF_XDP socket (XSK) is created with the normal socket() |
| 24 | syscall. Associated with each XSK are two rings: the RX ring and the |
| 25 | TX ring. A socket can receive packets on the RX ring and it can send |
| 26 | packets on the TX ring. These rings are registered and sized with the |
| 27 | setsockopts XDP_RX_RING and XDP_TX_RING, respectively. It is mandatory |
| 28 | to have at least one of these rings for each socket. An RX or TX |
| 29 | descriptor ring points to a data buffer in a memory area called a |
| 30 | UMEM. RX and TX can share the same UMEM so that a packet does not have |
| 31 | to be copied between RX and TX. Moreover, if a packet needs to be kept |
| 32 | for a while due to a possible retransmit, the descriptor that points |
| 33 | to that packet can be changed to point to another and reused right |
| 34 | away. This again avoids copying data. |
| 35 | |
Björn Töpel | bbff2f3 | 2018-06-04 13:57:13 +0200 | [diff] [blame] | 36 | The UMEM consists of a number of equally sized chunks. A descriptor in |
| 37 | one of the rings references a frame by referencing its addr. The addr |
| 38 | is simply an offset within the entire UMEM region. The user space |
| 39 | allocates memory for this UMEM using whatever means it feels is most |
| 40 | appropriate (malloc, mmap, huge pages, etc). This memory area is then |
| 41 | registered with the kernel using the new setsockopt XDP_UMEM_REG. The |
| 42 | UMEM also has two rings: the FILL ring and the COMPLETION ring. The |
Magnus Karlsson | e0e4f8e | 2019-10-21 10:57:04 +0200 | [diff] [blame] | 43 | FILL ring is used by the application to send down addr for the kernel |
Björn Töpel | bbff2f3 | 2018-06-04 13:57:13 +0200 | [diff] [blame] | 44 | to fill in with RX packet data. References to these frames will then |
| 45 | appear in the RX ring once each packet has been received. The |
Magnus Karlsson | e0e4f8e | 2019-10-21 10:57:04 +0200 | [diff] [blame] | 46 | COMPLETION ring, on the other hand, contains frame addr that the |
Björn Töpel | bbff2f3 | 2018-06-04 13:57:13 +0200 | [diff] [blame] | 47 | kernel has transmitted completely and can now be used again by user |
| 48 | space, for either TX or RX. Thus, the frame addrs appearing in the |
Magnus Karlsson | e0e4f8e | 2019-10-21 10:57:04 +0200 | [diff] [blame] | 49 | COMPLETION ring are addrs that were previously transmitted using the |
Björn Töpel | bbff2f3 | 2018-06-04 13:57:13 +0200 | [diff] [blame] | 50 | TX ring. In summary, the RX and FILL rings are used for the RX path |
| 51 | and the TX and COMPLETION rings are used for the TX path. |
Magnus Karlsson | b4b8faa | 2018-05-02 13:01:36 +0200 | [diff] [blame] | 52 | |
| 53 | The socket is then finally bound with a bind() call to a device and a |
| 54 | specific queue id on that device, and it is not until bind is |
| 55 | completed that traffic starts to flow. |
| 56 | |
| 57 | The UMEM can be shared between processes, if desired. If a process |
| 58 | wants to do this, it simply skips the registration of the UMEM and its |
| 59 | corresponding two rings, sets the XDP_SHARED_UMEM flag in the bind |
| 60 | call and submits the XSK of the process it would like to share UMEM |
| 61 | with as well as its own newly created XSK socket. The new process will |
Björn Töpel | bbff2f3 | 2018-06-04 13:57:13 +0200 | [diff] [blame] | 62 | then receive frame addr references in its own RX ring that point to |
| 63 | this shared UMEM. Note that since the ring structures are |
| 64 | single-consumer / single-producer (for performance reasons), the new |
| 65 | process has to create its own socket with associated RX and TX rings, |
| 66 | since it cannot share this with the other process. This is also the |
| 67 | reason that there is only one set of FILL and COMPLETION rings per |
| 68 | UMEM. It is the responsibility of a single process to handle the UMEM. |
Magnus Karlsson | b4b8faa | 2018-05-02 13:01:36 +0200 | [diff] [blame] | 69 | |
| 70 | How is then packets distributed from an XDP program to the XSKs? There |
| 71 | is a BPF map called XSKMAP (or BPF_MAP_TYPE_XSKMAP in full). The |
| 72 | user-space application can place an XSK at an arbitrary place in this |
| 73 | map. The XDP program can then redirect a packet to a specific index in |
| 74 | this map and at this point XDP validates that the XSK in that map was |
| 75 | indeed bound to that device and ring number. If not, the packet is |
| 76 | dropped. If the map is empty at that index, the packet is also |
| 77 | dropped. This also means that it is currently mandatory to have an XDP |
| 78 | program loaded (and one XSK in the XSKMAP) to be able to get any |
| 79 | traffic to user space through the XSK. |
| 80 | |
| 81 | AF_XDP can operate in two different modes: XDP_SKB and XDP_DRV. If the |
| 82 | driver does not have support for XDP, or XDP_SKB is explicitly chosen |
| 83 | when loading the XDP program, XDP_SKB mode is employed that uses SKBs |
| 84 | together with the generic XDP support and copies out the data to user |
| 85 | space. A fallback mode that works for any network device. On the other |
| 86 | hand, if the driver has support for XDP, it will be used by the AF_XDP |
| 87 | code to provide better performance, but there is still a copy of the |
| 88 | data into user space. |
| 89 | |
| 90 | Concepts |
| 91 | ======== |
| 92 | |
| 93 | In order to use an AF_XDP socket, a number of associated objects need |
Magnus Karlsson | e0e4f8e | 2019-10-21 10:57:04 +0200 | [diff] [blame] | 94 | to be setup. These objects and their options are explained in the |
| 95 | following sections. |
Magnus Karlsson | b4b8faa | 2018-05-02 13:01:36 +0200 | [diff] [blame] | 96 | |
Magnus Karlsson | e0e4f8e | 2019-10-21 10:57:04 +0200 | [diff] [blame] | 97 | For an overview on how AF_XDP works, you can also take a look at the |
| 98 | Linux Plumbers paper from 2018 on the subject: |
| 99 | http://vger.kernel.org/lpc_net2018_talks/lpc18_paper_af_xdp_perf-v2.pdf. Do |
| 100 | NOT consult the paper from 2017 on "AF_PACKET v4", the first attempt |
| 101 | at AF_XDP. Nearly everything changed since then. Jonathan Corbet has |
| 102 | also written an excellent article on LWN, "Accelerating networking |
| 103 | with AF_XDP". It can be found at https://lwn.net/Articles/750845/. |
Magnus Karlsson | b4b8faa | 2018-05-02 13:01:36 +0200 | [diff] [blame] | 104 | |
| 105 | UMEM |
| 106 | ---- |
| 107 | |
| 108 | UMEM is a region of virtual contiguous memory, divided into |
| 109 | equal-sized frames. An UMEM is associated to a netdev and a specific |
Björn Töpel | bbff2f3 | 2018-06-04 13:57:13 +0200 | [diff] [blame] | 110 | queue id of that netdev. It is created and configured (chunk size, |
| 111 | headroom, start address and size) by using the XDP_UMEM_REG setsockopt |
| 112 | system call. A UMEM is bound to a netdev and queue id, via the bind() |
| 113 | system call. |
Magnus Karlsson | b4b8faa | 2018-05-02 13:01:36 +0200 | [diff] [blame] | 114 | |
| 115 | An AF_XDP is socket linked to a single UMEM, but one UMEM can have |
| 116 | multiple AF_XDP sockets. To share an UMEM created via one socket A, |
| 117 | the next socket B can do this by setting the XDP_SHARED_UMEM flag in |
| 118 | struct sockaddr_xdp member sxdp_flags, and passing the file descriptor |
| 119 | of A to struct sockaddr_xdp member sxdp_shared_umem_fd. |
| 120 | |
Magnus Karlsson | e0e4f8e | 2019-10-21 10:57:04 +0200 | [diff] [blame] | 121 | The UMEM has two single-producer/single-consumer rings that are used |
Magnus Karlsson | b4b8faa | 2018-05-02 13:01:36 +0200 | [diff] [blame] | 122 | to transfer ownership of UMEM frames between the kernel and the |
| 123 | user-space application. |
| 124 | |
| 125 | Rings |
| 126 | ----- |
| 127 | |
Magnus Karlsson | e0e4f8e | 2019-10-21 10:57:04 +0200 | [diff] [blame] | 128 | There are a four different kind of rings: FILL, COMPLETION, RX and |
Magnus Karlsson | b4b8faa | 2018-05-02 13:01:36 +0200 | [diff] [blame] | 129 | TX. All rings are single-producer/single-consumer, so the user-space |
| 130 | application need explicit synchronization of multiple |
| 131 | processes/threads are reading/writing to them. |
| 132 | |
Magnus Karlsson | e0e4f8e | 2019-10-21 10:57:04 +0200 | [diff] [blame] | 133 | The UMEM uses two rings: FILL and COMPLETION. Each socket associated |
Magnus Karlsson | b4b8faa | 2018-05-02 13:01:36 +0200 | [diff] [blame] | 134 | with the UMEM must have an RX queue, TX queue or both. Say, that there |
| 135 | is a setup with four sockets (all doing TX and RX). Then there will be |
Magnus Karlsson | e0e4f8e | 2019-10-21 10:57:04 +0200 | [diff] [blame] | 136 | one FILL ring, one COMPLETION ring, four TX rings and four RX rings. |
Magnus Karlsson | b4b8faa | 2018-05-02 13:01:36 +0200 | [diff] [blame] | 137 | |
| 138 | The rings are head(producer)/tail(consumer) based rings. A producer |
| 139 | writes the data ring at the index pointed out by struct xdp_ring |
| 140 | producer member, and increasing the producer index. A consumer reads |
| 141 | the data ring at the index pointed out by struct xdp_ring consumer |
| 142 | member, and increasing the consumer index. |
| 143 | |
| 144 | The rings are configured and created via the _RING setsockopt system |
| 145 | calls and mmapped to user-space using the appropriate offset to mmap() |
| 146 | (XDP_PGOFF_RX_RING, XDP_PGOFF_TX_RING, XDP_UMEM_PGOFF_FILL_RING and |
| 147 | XDP_UMEM_PGOFF_COMPLETION_RING). |
| 148 | |
| 149 | The size of the rings need to be of size power of two. |
| 150 | |
| 151 | UMEM Fill Ring |
| 152 | ~~~~~~~~~~~~~~ |
| 153 | |
Magnus Karlsson | e0e4f8e | 2019-10-21 10:57:04 +0200 | [diff] [blame] | 154 | The FILL ring is used to transfer ownership of UMEM frames from |
Björn Töpel | bbff2f3 | 2018-06-04 13:57:13 +0200 | [diff] [blame] | 155 | user-space to kernel-space. The UMEM addrs are passed in the ring. As |
| 156 | an example, if the UMEM is 64k and each chunk is 4k, then the UMEM has |
| 157 | 16 chunks and can pass addrs between 0 and 64k. |
Magnus Karlsson | b4b8faa | 2018-05-02 13:01:36 +0200 | [diff] [blame] | 158 | |
| 159 | Frames passed to the kernel are used for the ingress path (RX rings). |
| 160 | |
Kevin Laatz | d57f172 | 2019-08-27 02:25:31 +0000 | [diff] [blame] | 161 | The user application produces UMEM addrs to this ring. Note that, if |
| 162 | running the application with aligned chunk mode, the kernel will mask |
| 163 | the incoming addr. E.g. for a chunk size of 2k, the log2(2048) LSB of |
| 164 | the addr will be masked off, meaning that 2048, 2050 and 3000 refers |
| 165 | to the same chunk. If the user application is run in the unaligned |
| 166 | chunks mode, then the incoming addr will be left untouched. |
Björn Töpel | bbff2f3 | 2018-06-04 13:57:13 +0200 | [diff] [blame] | 167 | |
Magnus Karlsson | b4b8faa | 2018-05-02 13:01:36 +0200 | [diff] [blame] | 168 | |
Konrad Djimeli | 7ccc4f1 | 2018-10-04 18:01:32 +0100 | [diff] [blame] | 169 | UMEM Completion Ring |
| 170 | ~~~~~~~~~~~~~~~~~~~~ |
Magnus Karlsson | b4b8faa | 2018-05-02 13:01:36 +0200 | [diff] [blame] | 171 | |
Magnus Karlsson | e0e4f8e | 2019-10-21 10:57:04 +0200 | [diff] [blame] | 172 | The COMPLETION Ring is used transfer ownership of UMEM frames from |
| 173 | kernel-space to user-space. Just like the FILL ring, UMEM indices are |
Magnus Karlsson | b4b8faa | 2018-05-02 13:01:36 +0200 | [diff] [blame] | 174 | used. |
| 175 | |
| 176 | Frames passed from the kernel to user-space are frames that has been |
| 177 | sent (TX ring) and can be used by user-space again. |
| 178 | |
Björn Töpel | bbff2f3 | 2018-06-04 13:57:13 +0200 | [diff] [blame] | 179 | The user application consumes UMEM addrs from this ring. |
Magnus Karlsson | b4b8faa | 2018-05-02 13:01:36 +0200 | [diff] [blame] | 180 | |
| 181 | |
| 182 | RX Ring |
| 183 | ~~~~~~~ |
| 184 | |
| 185 | The RX ring is the receiving side of a socket. Each entry in the ring |
Björn Töpel | bbff2f3 | 2018-06-04 13:57:13 +0200 | [diff] [blame] | 186 | is a struct xdp_desc descriptor. The descriptor contains UMEM offset |
| 187 | (addr) and the length of the data (len). |
Magnus Karlsson | b4b8faa | 2018-05-02 13:01:36 +0200 | [diff] [blame] | 188 | |
Magnus Karlsson | e0e4f8e | 2019-10-21 10:57:04 +0200 | [diff] [blame] | 189 | If no frames have been passed to kernel via the FILL ring, no |
Magnus Karlsson | b4b8faa | 2018-05-02 13:01:36 +0200 | [diff] [blame] | 190 | descriptors will (or can) appear on the RX ring. |
| 191 | |
| 192 | The user application consumes struct xdp_desc descriptors from this |
| 193 | ring. |
| 194 | |
| 195 | TX Ring |
| 196 | ~~~~~~~ |
| 197 | |
| 198 | The TX ring is used to send frames. The struct xdp_desc descriptor is |
| 199 | filled (index, length and offset) and passed into the ring. |
| 200 | |
| 201 | To start the transfer a sendmsg() system call is required. This might |
| 202 | be relaxed in the future. |
| 203 | |
| 204 | The user application produces struct xdp_desc descriptors to this |
| 205 | ring. |
| 206 | |
Magnus Karlsson | e0e4f8e | 2019-10-21 10:57:04 +0200 | [diff] [blame] | 207 | Libbpf |
| 208 | ====== |
| 209 | |
| 210 | Libbpf is a helper library for eBPF and XDP that makes using these |
| 211 | technologies a lot simpler. It also contains specific helper functions |
| 212 | in tools/lib/bpf/xsk.h for facilitating the use of AF_XDP. It |
| 213 | contains two types of functions: those that can be used to make the |
| 214 | setup of AF_XDP socket easier and ones that can be used in the data |
| 215 | plane to access the rings safely and quickly. To see an example on how |
| 216 | to use this API, please take a look at the sample application in |
| 217 | samples/bpf/xdpsock_usr.c which uses libbpf for both setup and data |
| 218 | plane operations. |
| 219 | |
| 220 | We recommend that you use this library unless you have become a power |
| 221 | user. It will make your program a lot simpler. |
| 222 | |
Magnus Karlsson | b4b8faa | 2018-05-02 13:01:36 +0200 | [diff] [blame] | 223 | XSKMAP / BPF_MAP_TYPE_XSKMAP |
Magnus Karlsson | e0e4f8e | 2019-10-21 10:57:04 +0200 | [diff] [blame] | 224 | ============================ |
Magnus Karlsson | b4b8faa | 2018-05-02 13:01:36 +0200 | [diff] [blame] | 225 | |
| 226 | On XDP side there is a BPF map type BPF_MAP_TYPE_XSKMAP (XSKMAP) that |
| 227 | is used in conjunction with bpf_redirect_map() to pass the ingress |
| 228 | frame to a socket. |
| 229 | |
| 230 | The user application inserts the socket into the map, via the bpf() |
| 231 | system call. |
| 232 | |
| 233 | Note that if an XDP program tries to redirect to a socket that does |
| 234 | not match the queue configuration and netdev, the frame will be |
| 235 | dropped. E.g. an AF_XDP socket is bound to netdev eth0 and |
| 236 | queue 17. Only the XDP program executing for eth0 and queue 17 will |
| 237 | successfully pass data to the socket. Please refer to the sample |
| 238 | application (samples/bpf/) in for an example. |
| 239 | |
Magnus Karlsson | e0e4f8e | 2019-10-21 10:57:04 +0200 | [diff] [blame] | 240 | Configuration Flags and Socket Options |
| 241 | ====================================== |
| 242 | |
| 243 | These are the various configuration flags that can be used to control |
| 244 | and monitor the behavior of AF_XDP sockets. |
| 245 | |
| 246 | XDP_COPY and XDP_ZERO_COPY bind flags |
| 247 | ------------------------------------- |
| 248 | |
| 249 | When you bind to a socket, the kernel will first try to use zero-copy |
| 250 | copy. If zero-copy is not supported, it will fall back on using copy |
| 251 | mode, i.e. copying all packets out to user space. But if you would |
| 252 | like to force a certain mode, you can use the following flags. If you |
| 253 | pass the XDP_COPY flag to the bind call, the kernel will force the |
| 254 | socket into copy mode. If it cannot use copy mode, the bind call will |
| 255 | fail with an error. Conversely, the XDP_ZERO_COPY flag will force the |
| 256 | socket into zero-copy mode or fail. |
| 257 | |
| 258 | XDP_SHARED_UMEM bind flag |
| 259 | ------------------------- |
| 260 | |
Magnus Karlsson | acabf32 | 2020-08-28 10:26:29 +0200 | [diff] [blame] | 261 | This flag enables you to bind multiple sockets to the same UMEM. It |
| 262 | works on the same queue id, between queue ids and between |
| 263 | netdevs/devices. In this mode, each socket has their own RX and TX |
| 264 | rings as usual, but you are going to have one or more FILL and |
| 265 | COMPLETION ring pairs. You have to create one of these pairs per |
| 266 | unique netdev and queue id tuple that you bind to. |
| 267 | |
| 268 | Starting with the case were we would like to share a UMEM between |
| 269 | sockets bound to the same netdev and queue id. The UMEM (tied to the |
| 270 | fist socket created) will only have a single FILL ring and a single |
| 271 | COMPLETION ring as there is only on unique netdev,queue_id tuple that |
| 272 | we have bound to. To use this mode, create the first socket and bind |
| 273 | it in the normal way. Create a second socket and create an RX and a TX |
| 274 | ring, or at least one of them, but no FILL or COMPLETION rings as the |
| 275 | ones from the first socket will be used. In the bind call, set he |
Magnus Karlsson | e0e4f8e | 2019-10-21 10:57:04 +0200 | [diff] [blame] | 276 | XDP_SHARED_UMEM option and provide the initial socket's fd in the |
| 277 | sxdp_shared_umem_fd field. You can attach an arbitrary number of extra |
| 278 | sockets this way. |
| 279 | |
| 280 | What socket will then a packet arrive on? This is decided by the XDP |
| 281 | program. Put all the sockets in the XSK_MAP and just indicate which |
| 282 | index in the array you would like to send each packet to. A simple |
| 283 | round-robin example of distributing packets is shown below: |
| 284 | |
| 285 | .. code-block:: c |
| 286 | |
| 287 | #include <linux/bpf.h> |
| 288 | #include "bpf_helpers.h" |
| 289 | |
| 290 | #define MAX_SOCKS 16 |
| 291 | |
| 292 | struct { |
| 293 | __uint(type, BPF_MAP_TYPE_XSKMAP); |
| 294 | __uint(max_entries, MAX_SOCKS); |
| 295 | __uint(key_size, sizeof(int)); |
| 296 | __uint(value_size, sizeof(int)); |
| 297 | } xsks_map SEC(".maps"); |
| 298 | |
| 299 | static unsigned int rr; |
| 300 | |
| 301 | SEC("xdp_sock") int xdp_sock_prog(struct xdp_md *ctx) |
| 302 | { |
| 303 | rr = (rr + 1) & (MAX_SOCKS - 1); |
| 304 | |
Magnus Karlsson | 57afa8b | 2019-11-07 18:47:40 +0100 | [diff] [blame] | 305 | return bpf_redirect_map(&xsks_map, rr, XDP_DROP); |
Magnus Karlsson | e0e4f8e | 2019-10-21 10:57:04 +0200 | [diff] [blame] | 306 | } |
| 307 | |
| 308 | Note, that since there is only a single set of FILL and COMPLETION |
| 309 | rings, and they are single producer, single consumer rings, you need |
| 310 | to make sure that multiple processes or threads do not use these rings |
| 311 | concurrently. There are no synchronization primitives in the |
| 312 | libbpf code that protects multiple users at this point in time. |
| 313 | |
Magnus Karlsson | 57afa8b | 2019-11-07 18:47:40 +0100 | [diff] [blame] | 314 | Libbpf uses this mode if you create more than one socket tied to the |
Magnus Karlsson | acabf32 | 2020-08-28 10:26:29 +0200 | [diff] [blame] | 315 | same UMEM. However, note that you need to supply the |
Magnus Karlsson | 57afa8b | 2019-11-07 18:47:40 +0100 | [diff] [blame] | 316 | XSK_LIBBPF_FLAGS__INHIBIT_PROG_LOAD libbpf_flag with the |
| 317 | xsk_socket__create calls and load your own XDP program as there is no |
| 318 | built in one in libbpf that will route the traffic for you. |
| 319 | |
Magnus Karlsson | acabf32 | 2020-08-28 10:26:29 +0200 | [diff] [blame] | 320 | The second case is when you share a UMEM between sockets that are |
| 321 | bound to different queue ids and/or netdevs. In this case you have to |
| 322 | create one FILL ring and one COMPLETION ring for each unique |
| 323 | netdev,queue_id pair. Let us say you want to create two sockets bound |
| 324 | to two different queue ids on the same netdev. Create the first socket |
| 325 | and bind it in the normal way. Create a second socket and create an RX |
| 326 | and a TX ring, or at least one of them, and then one FILL and |
| 327 | COMPLETION ring for this socket. Then in the bind call, set he |
| 328 | XDP_SHARED_UMEM option and provide the initial socket's fd in the |
| 329 | sxdp_shared_umem_fd field as you registered the UMEM on that |
| 330 | socket. These two sockets will now share one and the same UMEM. |
| 331 | |
| 332 | There is no need to supply an XDP program like the one in the previous |
| 333 | case where sockets were bound to the same queue id and |
| 334 | device. Instead, use the NIC's packet steering capabilities to steer |
| 335 | the packets to the right queue. In the previous example, there is only |
| 336 | one queue shared among sockets, so the NIC cannot do this steering. It |
| 337 | can only steer between queues. |
| 338 | |
| 339 | In libbpf, you need to use the xsk_socket__create_shared() API as it |
| 340 | takes a reference to a FILL ring and a COMPLETION ring that will be |
| 341 | created for you and bound to the shared UMEM. You can use this |
| 342 | function for all the sockets you create, or you can use it for the |
| 343 | second and following ones and use xsk_socket__create() for the first |
| 344 | one. Both methods yield the same result. |
| 345 | |
| 346 | Note that a UMEM can be shared between sockets on the same queue id |
| 347 | and device, as well as between queues on the same device and between |
| 348 | devices at the same time. |
| 349 | |
Magnus Karlsson | e0e4f8e | 2019-10-21 10:57:04 +0200 | [diff] [blame] | 350 | XDP_USE_NEED_WAKEUP bind flag |
| 351 | ----------------------------- |
| 352 | |
| 353 | This option adds support for a new flag called need_wakeup that is |
| 354 | present in the FILL ring and the TX ring, the rings for which user |
| 355 | space is a producer. When this option is set in the bind call, the |
| 356 | need_wakeup flag will be set if the kernel needs to be explicitly |
| 357 | woken up by a syscall to continue processing packets. If the flag is |
| 358 | zero, no syscall is needed. |
| 359 | |
| 360 | If the flag is set on the FILL ring, the application needs to call |
| 361 | poll() to be able to continue to receive packets on the RX ring. This |
| 362 | can happen, for example, when the kernel has detected that there are no |
| 363 | more buffers on the FILL ring and no buffers left on the RX HW ring of |
| 364 | the NIC. In this case, interrupts are turned off as the NIC cannot |
| 365 | receive any packets (as there are no buffers to put them in), and the |
| 366 | need_wakeup flag is set so that user space can put buffers on the |
| 367 | FILL ring and then call poll() so that the kernel driver can put these |
| 368 | buffers on the HW ring and start to receive packets. |
| 369 | |
| 370 | If the flag is set for the TX ring, it means that the application |
| 371 | needs to explicitly notify the kernel to send any packets put on the |
| 372 | TX ring. This can be accomplished either by a poll() call, as in the |
| 373 | RX path, or by calling sendto(). |
| 374 | |
| 375 | An example of how to use this flag can be found in |
| 376 | samples/bpf/xdpsock_user.c. An example with the use of libbpf helpers |
| 377 | would look like this for the TX path: |
| 378 | |
| 379 | .. code-block:: c |
| 380 | |
| 381 | if (xsk_ring_prod__needs_wakeup(&my_tx_ring)) |
| 382 | sendto(xsk_socket__fd(xsk_handle), NULL, 0, MSG_DONTWAIT, NULL, 0); |
| 383 | |
| 384 | I.e., only use the syscall if the flag is set. |
| 385 | |
| 386 | We recommend that you always enable this mode as it usually leads to |
| 387 | better performance especially if you run the application and the |
| 388 | driver on the same core, but also if you use different cores for the |
| 389 | application and the kernel driver, as it reduces the number of |
| 390 | syscalls needed for the TX path. |
| 391 | |
| 392 | XDP_{RX|TX|UMEM_FILL|UMEM_COMPLETION}_RING setsockopts |
| 393 | ------------------------------------------------------ |
| 394 | |
| 395 | These setsockopts sets the number of descriptors that the RX, TX, |
| 396 | FILL, and COMPLETION rings respectively should have. It is mandatory |
| 397 | to set the size of at least one of the RX and TX rings. If you set |
| 398 | both, you will be able to both receive and send traffic from your |
| 399 | application, but if you only want to do one of them, you can save |
| 400 | resources by only setting up one of them. Both the FILL ring and the |
Magnus Karlsson | 57afa8b | 2019-11-07 18:47:40 +0100 | [diff] [blame] | 401 | COMPLETION ring are mandatory as you need to have a UMEM tied to your |
| 402 | socket. But if the XDP_SHARED_UMEM flag is used, any socket after the |
| 403 | first one does not have a UMEM and should in that case not have any |
Magnus Karlsson | acabf32 | 2020-08-28 10:26:29 +0200 | [diff] [blame] | 404 | FILL or COMPLETION rings created as the ones from the shared UMEM will |
Magnus Karlsson | 57afa8b | 2019-11-07 18:47:40 +0100 | [diff] [blame] | 405 | be used. Note, that the rings are single-producer single-consumer, so |
| 406 | do not try to access them from multiple processes at the same |
| 407 | time. See the XDP_SHARED_UMEM section. |
| 408 | |
| 409 | In libbpf, you can create Rx-only and Tx-only sockets by supplying |
| 410 | NULL to the rx and tx arguments, respectively, to the |
| 411 | xsk_socket__create function. |
| 412 | |
| 413 | If you create a Tx-only socket, we recommend that you do not put any |
| 414 | packets on the fill ring. If you do this, drivers might think you are |
| 415 | going to receive something when you in fact will not, and this can |
| 416 | negatively impact performance. |
Magnus Karlsson | e0e4f8e | 2019-10-21 10:57:04 +0200 | [diff] [blame] | 417 | |
| 418 | XDP_UMEM_REG setsockopt |
| 419 | ----------------------- |
| 420 | |
| 421 | This setsockopt registers a UMEM to a socket. This is the area that |
| 422 | contain all the buffers that packet can recide in. The call takes a |
| 423 | pointer to the beginning of this area and the size of it. Moreover, it |
| 424 | also has parameter called chunk_size that is the size that the UMEM is |
| 425 | divided into. It can only be 2K or 4K at the moment. If you have an |
| 426 | UMEM area that is 128K and a chunk size of 2K, this means that you |
| 427 | will be able to hold a maximum of 128K / 2K = 64 packets in your UMEM |
| 428 | area and that your largest packet size can be 2K. |
| 429 | |
| 430 | There is also an option to set the headroom of each single buffer in |
| 431 | the UMEM. If you set this to N bytes, it means that the packet will |
| 432 | start N bytes into the buffer leaving the first N bytes for the |
| 433 | application to use. The final option is the flags field, but it will |
| 434 | be dealt with in separate sections for each UMEM flag. |
| 435 | |
| 436 | XDP_STATISTICS getsockopt |
| 437 | ------------------------- |
| 438 | |
| 439 | Gets drop statistics of a socket that can be useful for debug |
| 440 | purposes. The supported statistics are shown below: |
| 441 | |
| 442 | .. code-block:: c |
| 443 | |
| 444 | struct xdp_statistics { |
| 445 | __u64 rx_dropped; /* Dropped for reasons other than invalid desc */ |
| 446 | __u64 rx_invalid_descs; /* Dropped due to invalid descriptor */ |
| 447 | __u64 tx_invalid_descs; /* Dropped due to invalid descriptor */ |
| 448 | }; |
| 449 | |
| 450 | XDP_OPTIONS getsockopt |
| 451 | ---------------------- |
| 452 | |
| 453 | Gets options from an XDP socket. The only one supported so far is |
| 454 | XDP_OPTIONS_ZEROCOPY which tells you if zero-copy is on or not. |
| 455 | |
Magnus Karlsson | b4b8faa | 2018-05-02 13:01:36 +0200 | [diff] [blame] | 456 | Usage |
| 457 | ===== |
| 458 | |
Magnus Karlsson | e0e4f8e | 2019-10-21 10:57:04 +0200 | [diff] [blame] | 459 | In order to use AF_XDP sockets two parts are needed. The |
Magnus Karlsson | b4b8faa | 2018-05-02 13:01:36 +0200 | [diff] [blame] | 460 | user-space application and the XDP program. For a complete setup and |
| 461 | usage example, please refer to the sample application. The user-space |
Eric Leblond | 0bed613 | 2019-06-21 22:13:10 +0200 | [diff] [blame] | 462 | side is xdpsock_user.c and the XDP side is part of libbpf. |
| 463 | |
Magnus Karlsson | e0e4f8e | 2019-10-21 10:57:04 +0200 | [diff] [blame] | 464 | The XDP code sample included in tools/lib/bpf/xsk.c is the following: |
| 465 | |
| 466 | .. code-block:: c |
Eric Leblond | 0bed613 | 2019-06-21 22:13:10 +0200 | [diff] [blame] | 467 | |
| 468 | SEC("xdp_sock") int xdp_sock_prog(struct xdp_md *ctx) |
| 469 | { |
| 470 | int index = ctx->rx_queue_index; |
| 471 | |
Magnus Karlsson | e0e4f8e | 2019-10-21 10:57:04 +0200 | [diff] [blame] | 472 | // A set entry here means that the corresponding queue_id |
Eric Leblond | 0bed613 | 2019-06-21 22:13:10 +0200 | [diff] [blame] | 473 | // has an active AF_XDP socket bound to it. |
| 474 | if (bpf_map_lookup_elem(&xsks_map, &index)) |
| 475 | return bpf_redirect_map(&xsks_map, index, 0); |
| 476 | |
| 477 | return XDP_PASS; |
| 478 | } |
Magnus Karlsson | b4b8faa | 2018-05-02 13:01:36 +0200 | [diff] [blame] | 479 | |
Magnus Karlsson | e0e4f8e | 2019-10-21 10:57:04 +0200 | [diff] [blame] | 480 | A simple but not so performance ring dequeue and enqueue could look |
| 481 | like this: |
| 482 | |
| 483 | .. code-block:: c |
Magnus Karlsson | b4b8faa | 2018-05-02 13:01:36 +0200 | [diff] [blame] | 484 | |
Björn Töpel | bbff2f3 | 2018-06-04 13:57:13 +0200 | [diff] [blame] | 485 | // struct xdp_rxtx_ring { |
| 486 | // __u32 *producer; |
| 487 | // __u32 *consumer; |
| 488 | // struct xdp_desc *desc; |
| 489 | // }; |
| 490 | |
| 491 | // struct xdp_umem_ring { |
| 492 | // __u32 *producer; |
| 493 | // __u32 *consumer; |
| 494 | // __u64 *desc; |
| 495 | // }; |
| 496 | |
Magnus Karlsson | b4b8faa | 2018-05-02 13:01:36 +0200 | [diff] [blame] | 497 | // typedef struct xdp_rxtx_ring RING; |
| 498 | // typedef struct xdp_umem_ring RING; |
| 499 | |
| 500 | // typedef struct xdp_desc RING_TYPE; |
Björn Töpel | bbff2f3 | 2018-06-04 13:57:13 +0200 | [diff] [blame] | 501 | // typedef __u64 RING_TYPE; |
Magnus Karlsson | b4b8faa | 2018-05-02 13:01:36 +0200 | [diff] [blame] | 502 | |
| 503 | int dequeue_one(RING *ring, RING_TYPE *item) |
| 504 | { |
Björn Töpel | bbff2f3 | 2018-06-04 13:57:13 +0200 | [diff] [blame] | 505 | __u32 entries = *ring->producer - *ring->consumer; |
Magnus Karlsson | b4b8faa | 2018-05-02 13:01:36 +0200 | [diff] [blame] | 506 | |
| 507 | if (entries == 0) |
| 508 | return -1; |
| 509 | |
| 510 | // read-barrier! |
| 511 | |
Björn Töpel | bbff2f3 | 2018-06-04 13:57:13 +0200 | [diff] [blame] | 512 | *item = ring->desc[*ring->consumer & (RING_SIZE - 1)]; |
| 513 | (*ring->consumer)++; |
Magnus Karlsson | b4b8faa | 2018-05-02 13:01:36 +0200 | [diff] [blame] | 514 | return 0; |
| 515 | } |
| 516 | |
| 517 | int enqueue_one(RING *ring, const RING_TYPE *item) |
| 518 | { |
Björn Töpel | bbff2f3 | 2018-06-04 13:57:13 +0200 | [diff] [blame] | 519 | u32 free_entries = RING_SIZE - (*ring->producer - *ring->consumer); |
Magnus Karlsson | b4b8faa | 2018-05-02 13:01:36 +0200 | [diff] [blame] | 520 | |
| 521 | if (free_entries == 0) |
| 522 | return -1; |
| 523 | |
Björn Töpel | bbff2f3 | 2018-06-04 13:57:13 +0200 | [diff] [blame] | 524 | ring->desc[*ring->producer & (RING_SIZE - 1)] = *item; |
Magnus Karlsson | b4b8faa | 2018-05-02 13:01:36 +0200 | [diff] [blame] | 525 | |
| 526 | // write-barrier! |
| 527 | |
Björn Töpel | bbff2f3 | 2018-06-04 13:57:13 +0200 | [diff] [blame] | 528 | (*ring->producer)++; |
Magnus Karlsson | b4b8faa | 2018-05-02 13:01:36 +0200 | [diff] [blame] | 529 | return 0; |
| 530 | } |
| 531 | |
Magnus Karlsson | e0e4f8e | 2019-10-21 10:57:04 +0200 | [diff] [blame] | 532 | But please use the libbpf functions as they are optimized and ready to |
| 533 | use. Will make your life easier. |
Magnus Karlsson | b4b8faa | 2018-05-02 13:01:36 +0200 | [diff] [blame] | 534 | |
| 535 | Sample application |
| 536 | ================== |
| 537 | |
| 538 | There is a xdpsock benchmarking/test application included that |
Magnus Karlsson | e0e4f8e | 2019-10-21 10:57:04 +0200 | [diff] [blame] | 539 | demonstrates how to use AF_XDP sockets with private UMEMs. Say that |
| 540 | you would like your UDP traffic from port 4242 to end up in queue 16, |
| 541 | that we will enable AF_XDP on. Here, we use ethtool for this:: |
Magnus Karlsson | b4b8faa | 2018-05-02 13:01:36 +0200 | [diff] [blame] | 542 | |
| 543 | ethtool -N p3p2 rx-flow-hash udp4 fn |
| 544 | ethtool -N p3p2 flow-type udp4 src-port 4242 dst-port 4242 \ |
| 545 | action 16 |
| 546 | |
| 547 | Running the rxdrop benchmark in XDP_DRV mode can then be done |
| 548 | using:: |
| 549 | |
| 550 | samples/bpf/xdpsock -i p3p2 -q 16 -r -N |
| 551 | |
| 552 | For XDP_SKB mode, use the switch "-S" instead of "-N" and all options |
| 553 | can be displayed with "-h", as usual. |
| 554 | |
Magnus Karlsson | e0e4f8e | 2019-10-21 10:57:04 +0200 | [diff] [blame] | 555 | This sample application uses libbpf to make the setup and usage of |
| 556 | AF_XDP simpler. If you want to know how the raw uapi of AF_XDP is |
| 557 | really used to make something more advanced, take a look at the libbpf |
| 558 | code in tools/lib/bpf/xsk.[ch]. |
| 559 | |
Magnus Karlsson | 0f4a9b7 | 2019-02-21 10:21:28 +0100 | [diff] [blame] | 560 | FAQ |
| 561 | ======= |
| 562 | |
| 563 | Q: I am not seeing any traffic on the socket. What am I doing wrong? |
| 564 | |
| 565 | A: When a netdev of a physical NIC is initialized, Linux usually |
Magnus Karlsson | e0e4f8e | 2019-10-21 10:57:04 +0200 | [diff] [blame] | 566 | allocates one RX and TX queue pair per core. So on a 8 core system, |
Magnus Karlsson | 0f4a9b7 | 2019-02-21 10:21:28 +0100 | [diff] [blame] | 567 | queue ids 0 to 7 will be allocated, one per core. In the AF_XDP |
| 568 | bind call or the xsk_socket__create libbpf function call, you |
| 569 | specify a specific queue id to bind to and it is only the traffic |
| 570 | towards that queue you are going to get on you socket. So in the |
| 571 | example above, if you bind to queue 0, you are NOT going to get any |
| 572 | traffic that is distributed to queues 1 through 7. If you are |
| 573 | lucky, you will see the traffic, but usually it will end up on one |
| 574 | of the queues you have not bound to. |
| 575 | |
| 576 | There are a number of ways to solve the problem of getting the |
| 577 | traffic you want to the queue id you bound to. If you want to see |
| 578 | all the traffic, you can force the netdev to only have 1 queue, queue |
| 579 | id 0, and then bind to queue 0. You can use ethtool to do this:: |
| 580 | |
Randy Dunlap | 221fb72 | 2019-05-20 14:22:25 -0700 | [diff] [blame] | 581 | sudo ethtool -L <interface> combined 1 |
Magnus Karlsson | 0f4a9b7 | 2019-02-21 10:21:28 +0100 | [diff] [blame] | 582 | |
| 583 | If you want to only see part of the traffic, you can program the |
| 584 | NIC through ethtool to filter out your traffic to a single queue id |
| 585 | that you can bind your XDP socket to. Here is one example in which |
| 586 | UDP traffic to and from port 4242 are sent to queue 2:: |
| 587 | |
Randy Dunlap | 221fb72 | 2019-05-20 14:22:25 -0700 | [diff] [blame] | 588 | sudo ethtool -N <interface> rx-flow-hash udp4 fn |
| 589 | sudo ethtool -N <interface> flow-type udp4 src-port 4242 dst-port \ |
| 590 | 4242 action 2 |
Magnus Karlsson | 0f4a9b7 | 2019-02-21 10:21:28 +0100 | [diff] [blame] | 591 | |
Magnus Karlsson | e0e4f8e | 2019-10-21 10:57:04 +0200 | [diff] [blame] | 592 | A number of other ways are possible all up to the capabilities of |
Magnus Karlsson | 0f4a9b7 | 2019-02-21 10:21:28 +0100 | [diff] [blame] | 593 | the NIC you have. |
| 594 | |
Magnus Karlsson | e0e4f8e | 2019-10-21 10:57:04 +0200 | [diff] [blame] | 595 | Q: Can I use the XSKMAP to implement a switch betwen different umems |
| 596 | in copy mode? |
| 597 | |
| 598 | A: The short answer is no, that is not supported at the moment. The |
| 599 | XSKMAP can only be used to switch traffic coming in on queue id X |
| 600 | to sockets bound to the same queue id X. The XSKMAP can contain |
| 601 | sockets bound to different queue ids, for example X and Y, but only |
| 602 | traffic goming in from queue id Y can be directed to sockets bound |
| 603 | to the same queue id Y. In zero-copy mode, you should use the |
| 604 | switch, or other distribution mechanism, in your NIC to direct |
| 605 | traffic to the correct queue id and socket. |
| 606 | |
Magnus Karlsson | acabf32 | 2020-08-28 10:26:29 +0200 | [diff] [blame] | 607 | Q: My packets are sometimes corrupted. What is wrong? |
| 608 | |
| 609 | A: Care has to be taken not to feed the same buffer in the UMEM into |
| 610 | more than one ring at the same time. If you for example feed the |
| 611 | same buffer into the FILL ring and the TX ring at the same time, the |
| 612 | NIC might receive data into the buffer at the same time it is |
| 613 | sending it. This will cause some packets to become corrupted. Same |
| 614 | thing goes for feeding the same buffer into the FILL rings |
| 615 | belonging to different queue ids or netdevs bound with the |
| 616 | XDP_SHARED_UMEM flag. |
| 617 | |
Magnus Karlsson | b4b8faa | 2018-05-02 13:01:36 +0200 | [diff] [blame] | 618 | Credits |
| 619 | ======= |
| 620 | |
| 621 | - Björn Töpel (AF_XDP core) |
| 622 | - Magnus Karlsson (AF_XDP core) |
| 623 | - Alexander Duyck |
| 624 | - Alexei Starovoitov |
| 625 | - Daniel Borkmann |
| 626 | - Jesper Dangaard Brouer |
| 627 | - John Fastabend |
| 628 | - Jonathan Corbet (LWN coverage) |
| 629 | - Michael S. Tsirkin |
| 630 | - Qi Z Zhang |
| 631 | - Willem de Bruijn |