Blame - Documentation/networking/af_xdp.rst - SHIFTPHONES/mainline/linux

blob: 60b217b436be668fe705e53ee5da35119eeb2995 [file] [log] [blame]

Magnus Karlsson	b4b8faa	2018-05-02 13:01:36 +0200	[diff] [blame]	1	.. SPDX-License-Identifier: GPL-2.0
				2
				3	======
				4	AF_XDP
				5	======
				6
				7	Overview
				8	========
				9
				10	AF_XDP is an address family that is optimized for high performance
				11	packet processing.
				12
				13	This document assumes that the reader is familiar with BPF and XDP. If
				14	not, the Cilium project has an excellent reference guide at
Björn Töpel	bbff2f3	2018-06-04 13:57:13 +0200	[diff] [blame]	15	http://cilium.readthedocs.io/en/latest/bpf/.
Magnus Karlsson	b4b8faa	2018-05-02 13:01:36 +0200	[diff] [blame]	16
				17	Using the XDP_REDIRECT action from an XDP program, the program can
				18	redirect ingress frames to other XDP enabled netdevs, using the
				19	bpf_redirect_map() function. AF_XDP sockets enable the possibility for
				20	XDP programs to redirect frames to a memory buffer in a user-space
				21	application.
				22
				23	An AF_XDP socket (XSK) is created with the normal socket()
				24	syscall. Associated with each XSK are two rings: the RX ring and the
				25	TX ring. A socket can receive packets on the RX ring and it can send
				26	packets on the TX ring. These rings are registered and sized with the
				27	setsockopts XDP_RX_RING and XDP_TX_RING, respectively. It is mandatory
				28	to have at least one of these rings for each socket. An RX or TX
				29	descriptor ring points to a data buffer in a memory area called a
				30	UMEM. RX and TX can share the same UMEM so that a packet does not have
				31	to be copied between RX and TX. Moreover, if a packet needs to be kept
				32	for a while due to a possible retransmit, the descriptor that points
				33	to that packet can be changed to point to another and reused right
				34	away. This again avoids copying data.
				35
Björn Töpel	bbff2f3	2018-06-04 13:57:13 +0200	[diff] [blame]	36	The UMEM consists of a number of equally sized chunks. A descriptor in
				37	one of the rings references a frame by referencing its addr. The addr
				38	is simply an offset within the entire UMEM region. The user space
				39	allocates memory for this UMEM using whatever means it feels is most
				40	appropriate (malloc, mmap, huge pages, etc). This memory area is then
				41	registered with the kernel using the new setsockopt XDP_UMEM_REG. The
				42	UMEM also has two rings: the FILL ring and the COMPLETION ring. The
Magnus Karlsson	e0e4f8e	2019-10-21 10:57:04 +0200	[diff] [blame]	43	FILL ring is used by the application to send down addr for the kernel
Björn Töpel	bbff2f3	2018-06-04 13:57:13 +0200	[diff] [blame]	44	to fill in with RX packet data. References to these frames will then
				45	appear in the RX ring once each packet has been received. The
Magnus Karlsson	e0e4f8e	2019-10-21 10:57:04 +0200	[diff] [blame]	46	COMPLETION ring, on the other hand, contains frame addr that the
Björn Töpel	bbff2f3	2018-06-04 13:57:13 +0200	[diff] [blame]	47	kernel has transmitted completely and can now be used again by user
				48	space, for either TX or RX. Thus, the frame addrs appearing in the
Magnus Karlsson	e0e4f8e	2019-10-21 10:57:04 +0200	[diff] [blame]	49	COMPLETION ring are addrs that were previously transmitted using the
Björn Töpel	bbff2f3	2018-06-04 13:57:13 +0200	[diff] [blame]	50	TX ring. In summary, the RX and FILL rings are used for the RX path
				51	and the TX and COMPLETION rings are used for the TX path.
Magnus Karlsson	b4b8faa	2018-05-02 13:01:36 +0200	[diff] [blame]	52
				53	The socket is then finally bound with a bind() call to a device and a
				54	specific queue id on that device, and it is not until bind is
				55	completed that traffic starts to flow.
				56
				57	The UMEM can be shared between processes, if desired. If a process
				58	wants to do this, it simply skips the registration of the UMEM and its
				59	corresponding two rings, sets the XDP_SHARED_UMEM flag in the bind
				60	call and submits the XSK of the process it would like to share UMEM
				61	with as well as its own newly created XSK socket. The new process will
Björn Töpel	bbff2f3	2018-06-04 13:57:13 +0200	[diff] [blame]	62	then receive frame addr references in its own RX ring that point to
				63	this shared UMEM. Note that since the ring structures are
				64	single-consumer / single-producer (for performance reasons), the new
				65	process has to create its own socket with associated RX and TX rings,
				66	since it cannot share this with the other process. This is also the
				67	reason that there is only one set of FILL and COMPLETION rings per
				68	UMEM. It is the responsibility of a single process to handle the UMEM.
Magnus Karlsson	b4b8faa	2018-05-02 13:01:36 +0200	[diff] [blame]	69
				70	How is then packets distributed from an XDP program to the XSKs? There
				71	is a BPF map called XSKMAP (or BPF_MAP_TYPE_XSKMAP in full). The
				72	user-space application can place an XSK at an arbitrary place in this
				73	map. The XDP program can then redirect a packet to a specific index in
				74	this map and at this point XDP validates that the XSK in that map was
				75	indeed bound to that device and ring number. If not, the packet is
				76	dropped. If the map is empty at that index, the packet is also
				77	dropped. This also means that it is currently mandatory to have an XDP
				78	program loaded (and one XSK in the XSKMAP) to be able to get any
				79	traffic to user space through the XSK.
				80
				81	AF_XDP can operate in two different modes: XDP_SKB and XDP_DRV. If the
				82	driver does not have support for XDP, or XDP_SKB is explicitly chosen
				83	when loading the XDP program, XDP_SKB mode is employed that uses SKBs
				84	together with the generic XDP support and copies out the data to user
				85	space. A fallback mode that works for any network device. On the other
				86	hand, if the driver has support for XDP, it will be used by the AF_XDP
				87	code to provide better performance, but there is still a copy of the
				88	data into user space.
				89
				90	Concepts
				91	========
				92
				93	In order to use an AF_XDP socket, a number of associated objects need
Magnus Karlsson	e0e4f8e	2019-10-21 10:57:04 +0200	[diff] [blame]	94	to be setup. These objects and their options are explained in the
				95	following sections.
Magnus Karlsson	b4b8faa	2018-05-02 13:01:36 +0200	[diff] [blame]	96
Magnus Karlsson	e0e4f8e	2019-10-21 10:57:04 +0200	[diff] [blame]	97	For an overview on how AF_XDP works, you can also take a look at the
				98	Linux Plumbers paper from 2018 on the subject:
				99	http://vger.kernel.org/lpc_net2018_talks/lpc18_paper_af_xdp_perf-v2.pdf. Do
				100	NOT consult the paper from 2017 on "AF_PACKET v4", the first attempt
				101	at AF_XDP. Nearly everything changed since then. Jonathan Corbet has
				102	also written an excellent article on LWN, "Accelerating networking
				103	with AF_XDP". It can be found at https://lwn.net/Articles/750845/.
Magnus Karlsson	b4b8faa	2018-05-02 13:01:36 +0200	[diff] [blame]	104
				105	UMEM
				106	----
				107
				108	UMEM is a region of virtual contiguous memory, divided into
				109	equal-sized frames. An UMEM is associated to a netdev and a specific
Björn Töpel	bbff2f3	2018-06-04 13:57:13 +0200	[diff] [blame]	110	queue id of that netdev. It is created and configured (chunk size,
				111	headroom, start address and size) by using the XDP_UMEM_REG setsockopt
				112	system call. A UMEM is bound to a netdev and queue id, via the bind()
				113	system call.
Magnus Karlsson	b4b8faa	2018-05-02 13:01:36 +0200	[diff] [blame]	114
				115	An AF_XDP is socket linked to a single UMEM, but one UMEM can have
				116	multiple AF_XDP sockets. To share an UMEM created via one socket A,
				117	the next socket B can do this by setting the XDP_SHARED_UMEM flag in
				118	struct sockaddr_xdp member sxdp_flags, and passing the file descriptor
				119	of A to struct sockaddr_xdp member sxdp_shared_umem_fd.
				120
Magnus Karlsson	e0e4f8e	2019-10-21 10:57:04 +0200	[diff] [blame]	121	The UMEM has two single-producer/single-consumer rings that are used
Magnus Karlsson	b4b8faa	2018-05-02 13:01:36 +0200	[diff] [blame]	122	to transfer ownership of UMEM frames between the kernel and the
				123	user-space application.
				124
				125	Rings
				126	-----
				127
Magnus Karlsson	e0e4f8e	2019-10-21 10:57:04 +0200	[diff] [blame]	128	There are a four different kind of rings: FILL, COMPLETION, RX and
Magnus Karlsson	b4b8faa	2018-05-02 13:01:36 +0200	[diff] [blame]	129	TX. All rings are single-producer/single-consumer, so the user-space
				130	application need explicit synchronization of multiple
				131	processes/threads are reading/writing to them.
				132
Magnus Karlsson	e0e4f8e	2019-10-21 10:57:04 +0200	[diff] [blame]	133	The UMEM uses two rings: FILL and COMPLETION. Each socket associated
Magnus Karlsson	b4b8faa	2018-05-02 13:01:36 +0200	[diff] [blame]	134	with the UMEM must have an RX queue, TX queue or both. Say, that there
				135	is a setup with four sockets (all doing TX and RX). Then there will be
Magnus Karlsson	e0e4f8e	2019-10-21 10:57:04 +0200	[diff] [blame]	136	one FILL ring, one COMPLETION ring, four TX rings and four RX rings.
Magnus Karlsson	b4b8faa	2018-05-02 13:01:36 +0200	[diff] [blame]	137
				138	The rings are head(producer)/tail(consumer) based rings. A producer
				139	writes the data ring at the index pointed out by struct xdp_ring
				140	producer member, and increasing the producer index. A consumer reads
				141	the data ring at the index pointed out by struct xdp_ring consumer
				142	member, and increasing the consumer index.
				143
				144	The rings are configured and created via the _RING setsockopt system
				145	calls and mmapped to user-space using the appropriate offset to mmap()
				146	(XDP_PGOFF_RX_RING, XDP_PGOFF_TX_RING, XDP_UMEM_PGOFF_FILL_RING and
				147	XDP_UMEM_PGOFF_COMPLETION_RING).
				148
				149	The size of the rings need to be of size power of two.
				150
				151	UMEM Fill Ring
				152	~~~~~~~~~~~~~~
				153
Magnus Karlsson	e0e4f8e	2019-10-21 10:57:04 +0200	[diff] [blame]	154	The FILL ring is used to transfer ownership of UMEM frames from
Björn Töpel	bbff2f3	2018-06-04 13:57:13 +0200	[diff] [blame]	155	user-space to kernel-space. The UMEM addrs are passed in the ring. As
				156	an example, if the UMEM is 64k and each chunk is 4k, then the UMEM has
				157	16 chunks and can pass addrs between 0 and 64k.
Magnus Karlsson	b4b8faa	2018-05-02 13:01:36 +0200	[diff] [blame]	158
				159	Frames passed to the kernel are used for the ingress path (RX rings).
				160
Kevin Laatz	d57f172	2019-08-27 02:25:31 +0000	[diff] [blame]	161	The user application produces UMEM addrs to this ring. Note that, if
				162	running the application with aligned chunk mode, the kernel will mask
				163	the incoming addr. E.g. for a chunk size of 2k, the log2(2048) LSB of
				164	the addr will be masked off, meaning that 2048, 2050 and 3000 refers
				165	to the same chunk. If the user application is run in the unaligned
				166	chunks mode, then the incoming addr will be left untouched.
Björn Töpel	bbff2f3	2018-06-04 13:57:13 +0200	[diff] [blame]	167
Magnus Karlsson	b4b8faa	2018-05-02 13:01:36 +0200	[diff] [blame]	168
Konrad Djimeli	7ccc4f1	2018-10-04 18:01:32 +0100	[diff] [blame]	169	UMEM Completion Ring
				170	~~~~~~~~~~~~~~~~~~~~
Magnus Karlsson	b4b8faa	2018-05-02 13:01:36 +0200	[diff] [blame]	171
Magnus Karlsson	e0e4f8e	2019-10-21 10:57:04 +0200	[diff] [blame]	172	The COMPLETION Ring is used transfer ownership of UMEM frames from
				173	kernel-space to user-space. Just like the FILL ring, UMEM indices are
Magnus Karlsson	b4b8faa	2018-05-02 13:01:36 +0200	[diff] [blame]	174	used.
				175
				176	Frames passed from the kernel to user-space are frames that has been
				177	sent (TX ring) and can be used by user-space again.
				178
Björn Töpel	bbff2f3	2018-06-04 13:57:13 +0200	[diff] [blame]	179	The user application consumes UMEM addrs from this ring.
Magnus Karlsson	b4b8faa	2018-05-02 13:01:36 +0200	[diff] [blame]	180
				181
				182	RX Ring
				183	~~~~~~~
				184
				185	The RX ring is the receiving side of a socket. Each entry in the ring
Björn Töpel	bbff2f3	2018-06-04 13:57:13 +0200	[diff] [blame]	186	is a struct xdp_desc descriptor. The descriptor contains UMEM offset
				187	(addr) and the length of the data (len).
Magnus Karlsson	b4b8faa	2018-05-02 13:01:36 +0200	[diff] [blame]	188
Magnus Karlsson	e0e4f8e	2019-10-21 10:57:04 +0200	[diff] [blame]	189	If no frames have been passed to kernel via the FILL ring, no
Magnus Karlsson	b4b8faa	2018-05-02 13:01:36 +0200	[diff] [blame]	190	descriptors will (or can) appear on the RX ring.
				191
				192	The user application consumes struct xdp_desc descriptors from this
				193	ring.
				194
				195	TX Ring
				196	~~~~~~~
				197
				198	The TX ring is used to send frames. The struct xdp_desc descriptor is
				199	filled (index, length and offset) and passed into the ring.
				200
				201	To start the transfer a sendmsg() system call is required. This might
				202	be relaxed in the future.
				203
				204	The user application produces struct xdp_desc descriptors to this
				205	ring.
				206
Magnus Karlsson	e0e4f8e	2019-10-21 10:57:04 +0200	[diff] [blame]	207	Libbpf
				208	======
				209
				210	Libbpf is a helper library for eBPF and XDP that makes using these
				211	technologies a lot simpler. It also contains specific helper functions
				212	in tools/lib/bpf/xsk.h for facilitating the use of AF_XDP. It
				213	contains two types of functions: those that can be used to make the
				214	setup of AF_XDP socket easier and ones that can be used in the data
				215	plane to access the rings safely and quickly. To see an example on how
				216	to use this API, please take a look at the sample application in
				217	samples/bpf/xdpsock_usr.c which uses libbpf for both setup and data
				218	plane operations.
				219
				220	We recommend that you use this library unless you have become a power
				221	user. It will make your program a lot simpler.
				222
Magnus Karlsson	b4b8faa	2018-05-02 13:01:36 +0200	[diff] [blame]	223	XSKMAP / BPF_MAP_TYPE_XSKMAP
Magnus Karlsson	e0e4f8e	2019-10-21 10:57:04 +0200	[diff] [blame]	224	============================
Magnus Karlsson	b4b8faa	2018-05-02 13:01:36 +0200	[diff] [blame]	225
				226	On XDP side there is a BPF map type BPF_MAP_TYPE_XSKMAP (XSKMAP) that
				227	is used in conjunction with bpf_redirect_map() to pass the ingress
				228	frame to a socket.
				229
				230	The user application inserts the socket into the map, via the bpf()
				231	system call.
				232
				233	Note that if an XDP program tries to redirect to a socket that does
				234	not match the queue configuration and netdev, the frame will be
				235	dropped. E.g. an AF_XDP socket is bound to netdev eth0 and
				236	queue 17. Only the XDP program executing for eth0 and queue 17 will
				237	successfully pass data to the socket. Please refer to the sample
				238	application (samples/bpf/) in for an example.
				239
Magnus Karlsson	e0e4f8e	2019-10-21 10:57:04 +0200	[diff] [blame]	240	Configuration Flags and Socket Options
				241	======================================
				242
				243	These are the various configuration flags that can be used to control
				244	and monitor the behavior of AF_XDP sockets.
				245
Baruch Siach	f35e0cc	2021-07-06 08:44:00 +0300	[diff] [blame]	246	XDP_COPY and XDP_ZEROCOPY bind flags
				247	------------------------------------
Magnus Karlsson	e0e4f8e	2019-10-21 10:57:04 +0200	[diff] [blame]	248
				249	When you bind to a socket, the kernel will first try to use zero-copy
				250	copy. If zero-copy is not supported, it will fall back on using copy
				251	mode, i.e. copying all packets out to user space. But if you would
				252	like to force a certain mode, you can use the following flags. If you
				253	pass the XDP_COPY flag to the bind call, the kernel will force the
				254	socket into copy mode. If it cannot use copy mode, the bind call will
Baruch Siach	f35e0cc	2021-07-06 08:44:00 +0300	[diff] [blame]	255	fail with an error. Conversely, the XDP_ZEROCOPY flag will force the
Magnus Karlsson	e0e4f8e	2019-10-21 10:57:04 +0200	[diff] [blame]	256	socket into zero-copy mode or fail.
				257
				258	XDP_SHARED_UMEM bind flag
				259	-------------------------
				260
Magnus Karlsson	acabf32	2020-08-28 10:26:29 +0200	[diff] [blame]	261	This flag enables you to bind multiple sockets to the same UMEM. It
				262	works on the same queue id, between queue ids and between
				263	netdevs/devices. In this mode, each socket has their own RX and TX
				264	rings as usual, but you are going to have one or more FILL and
				265	COMPLETION ring pairs. You have to create one of these pairs per
				266	unique netdev and queue id tuple that you bind to.
				267
				268	Starting with the case were we would like to share a UMEM between
				269	sockets bound to the same netdev and queue id. The UMEM (tied to the
				270	fist socket created) will only have a single FILL ring and a single
				271	COMPLETION ring as there is only on unique netdev,queue_id tuple that
				272	we have bound to. To use this mode, create the first socket and bind
				273	it in the normal way. Create a second socket and create an RX and a TX
				274	ring, or at least one of them, but no FILL or COMPLETION rings as the
				275	ones from the first socket will be used. In the bind call, set he
Magnus Karlsson	e0e4f8e	2019-10-21 10:57:04 +0200	[diff] [blame]	276	XDP_SHARED_UMEM option and provide the initial socket's fd in the
				277	sxdp_shared_umem_fd field. You can attach an arbitrary number of extra
				278	sockets this way.
				279
				280	What socket will then a packet arrive on? This is decided by the XDP
				281	program. Put all the sockets in the XSK_MAP and just indicate which
				282	index in the array you would like to send each packet to. A simple
				283	round-robin example of distributing packets is shown below:
				284
				285	.. code-block:: c
				286
				287	#include <linux/bpf.h>
				288	#include "bpf_helpers.h"
				289
				290	#define MAX_SOCKS 16
				291
				292	struct {
Ilya Maximets	4b9718b	2021-06-22 20:56:47 +0200	[diff] [blame]	293	__uint(type, BPF_MAP_TYPE_XSKMAP);
				294	__uint(max_entries, MAX_SOCKS);
				295	__uint(key_size, sizeof(int));
				296	__uint(value_size, sizeof(int));
Magnus Karlsson	e0e4f8e	2019-10-21 10:57:04 +0200	[diff] [blame]	297	} xsks_map SEC(".maps");
				298
				299	static unsigned int rr;
				300
				301	SEC("xdp_sock") int xdp_sock_prog(struct xdp_md *ctx)
				302	{
Ilya Maximets	4b9718b	2021-06-22 20:56:47 +0200	[diff] [blame]	303	rr = (rr + 1) & (MAX_SOCKS - 1);
Magnus Karlsson	e0e4f8e	2019-10-21 10:57:04 +0200	[diff] [blame]	304
Ilya Maximets	4b9718b	2021-06-22 20:56:47 +0200	[diff] [blame]	305	return bpf_redirect_map(&xsks_map, rr, XDP_DROP);
Magnus Karlsson	e0e4f8e	2019-10-21 10:57:04 +0200	[diff] [blame]	306	}
				307
				308	Note, that since there is only a single set of FILL and COMPLETION
				309	rings, and they are single producer, single consumer rings, you need
				310	to make sure that multiple processes or threads do not use these rings
				311	concurrently. There are no synchronization primitives in the
				312	libbpf code that protects multiple users at this point in time.
				313
Magnus Karlsson	57afa8b	2019-11-07 18:47:40 +0100	[diff] [blame]	314	Libbpf uses this mode if you create more than one socket tied to the
Magnus Karlsson	acabf32	2020-08-28 10:26:29 +0200	[diff] [blame]	315	same UMEM. However, note that you need to supply the
Magnus Karlsson	57afa8b	2019-11-07 18:47:40 +0100	[diff] [blame]	316	XSK_LIBBPF_FLAGS__INHIBIT_PROG_LOAD libbpf_flag with the
				317	xsk_socket__create calls and load your own XDP program as there is no
				318	built in one in libbpf that will route the traffic for you.
				319
Magnus Karlsson	acabf32	2020-08-28 10:26:29 +0200	[diff] [blame]	320	The second case is when you share a UMEM between sockets that are
				321	bound to different queue ids and/or netdevs. In this case you have to
				322	create one FILL ring and one COMPLETION ring for each unique
				323	netdev,queue_id pair. Let us say you want to create two sockets bound
				324	to two different queue ids on the same netdev. Create the first socket
				325	and bind it in the normal way. Create a second socket and create an RX
				326	and a TX ring, or at least one of them, and then one FILL and
				327	COMPLETION ring for this socket. Then in the bind call, set he
				328	XDP_SHARED_UMEM option and provide the initial socket's fd in the
				329	sxdp_shared_umem_fd field as you registered the UMEM on that
				330	socket. These two sockets will now share one and the same UMEM.
				331
				332	There is no need to supply an XDP program like the one in the previous
				333	case where sockets were bound to the same queue id and
				334	device. Instead, use the NIC's packet steering capabilities to steer
				335	the packets to the right queue. In the previous example, there is only
				336	one queue shared among sockets, so the NIC cannot do this steering. It
				337	can only steer between queues.
				338
				339	In libbpf, you need to use the xsk_socket__create_shared() API as it
				340	takes a reference to a FILL ring and a COMPLETION ring that will be
				341	created for you and bound to the shared UMEM. You can use this
				342	function for all the sockets you create, or you can use it for the
				343	second and following ones and use xsk_socket__create() for the first
				344	one. Both methods yield the same result.
				345
				346	Note that a UMEM can be shared between sockets on the same queue id
				347	and device, as well as between queues on the same device and between
				348	devices at the same time.
				349
Magnus Karlsson	e0e4f8e	2019-10-21 10:57:04 +0200	[diff] [blame]	350	XDP_USE_NEED_WAKEUP bind flag
				351	-----------------------------
				352
				353	This option adds support for a new flag called need_wakeup that is
				354	present in the FILL ring and the TX ring, the rings for which user
				355	space is a producer. When this option is set in the bind call, the
				356	need_wakeup flag will be set if the kernel needs to be explicitly
				357	woken up by a syscall to continue processing packets. If the flag is
				358	zero, no syscall is needed.
				359
				360	If the flag is set on the FILL ring, the application needs to call
				361	poll() to be able to continue to receive packets on the RX ring. This
				362	can happen, for example, when the kernel has detected that there are no
				363	more buffers on the FILL ring and no buffers left on the RX HW ring of
				364	the NIC. In this case, interrupts are turned off as the NIC cannot
				365	receive any packets (as there are no buffers to put them in), and the
				366	need_wakeup flag is set so that user space can put buffers on the
				367	FILL ring and then call poll() so that the kernel driver can put these
				368	buffers on the HW ring and start to receive packets.
				369
				370	If the flag is set for the TX ring, it means that the application
				371	needs to explicitly notify the kernel to send any packets put on the
				372	TX ring. This can be accomplished either by a poll() call, as in the
				373	RX path, or by calling sendto().
				374
				375	An example of how to use this flag can be found in
				376	samples/bpf/xdpsock_user.c. An example with the use of libbpf helpers
				377	would look like this for the TX path:
				378
				379	.. code-block:: c
				380
				381	if (xsk_ring_prod__needs_wakeup(&my_tx_ring))
Ilya Maximets	4b9718b	2021-06-22 20:56:47 +0200	[diff] [blame]	382	sendto(xsk_socket__fd(xsk_handle), NULL, 0, MSG_DONTWAIT, NULL, 0);
Magnus Karlsson	e0e4f8e	2019-10-21 10:57:04 +0200	[diff] [blame]	383
				384	I.e., only use the syscall if the flag is set.
				385
				386	We recommend that you always enable this mode as it usually leads to
				387	better performance especially if you run the application and the
				388	driver on the same core, but also if you use different cores for the
				389	application and the kernel driver, as it reduces the number of
				390	syscalls needed for the TX path.
				391
				392	XDP_{RX\|TX\|UMEM_FILL\|UMEM_COMPLETION}_RING setsockopts
				393	------------------------------------------------------
				394
				395	These setsockopts sets the number of descriptors that the RX, TX,
				396	FILL, and COMPLETION rings respectively should have. It is mandatory
				397	to set the size of at least one of the RX and TX rings. If you set
				398	both, you will be able to both receive and send traffic from your
				399	application, but if you only want to do one of them, you can save
				400	resources by only setting up one of them. Both the FILL ring and the
Magnus Karlsson	57afa8b	2019-11-07 18:47:40 +0100	[diff] [blame]	401	COMPLETION ring are mandatory as you need to have a UMEM tied to your
				402	socket. But if the XDP_SHARED_UMEM flag is used, any socket after the
				403	first one does not have a UMEM and should in that case not have any
Magnus Karlsson	acabf32	2020-08-28 10:26:29 +0200	[diff] [blame]	404	FILL or COMPLETION rings created as the ones from the shared UMEM will
Magnus Karlsson	57afa8b	2019-11-07 18:47:40 +0100	[diff] [blame]	405	be used. Note, that the rings are single-producer single-consumer, so
				406	do not try to access them from multiple processes at the same
				407	time. See the XDP_SHARED_UMEM section.
				408
				409	In libbpf, you can create Rx-only and Tx-only sockets by supplying
				410	NULL to the rx and tx arguments, respectively, to the
				411	xsk_socket__create function.
				412
				413	If you create a Tx-only socket, we recommend that you do not put any
				414	packets on the fill ring. If you do this, drivers might think you are
				415	going to receive something when you in fact will not, and this can
				416	negatively impact performance.
Magnus Karlsson	e0e4f8e	2019-10-21 10:57:04 +0200	[diff] [blame]	417
				418	XDP_UMEM_REG setsockopt
				419	-----------------------
				420
				421	This setsockopt registers a UMEM to a socket. This is the area that
				422	contain all the buffers that packet can recide in. The call takes a
				423	pointer to the beginning of this area and the size of it. Moreover, it
				424	also has parameter called chunk_size that is the size that the UMEM is
				425	divided into. It can only be 2K or 4K at the moment. If you have an
				426	UMEM area that is 128K and a chunk size of 2K, this means that you
				427	will be able to hold a maximum of 128K / 2K = 64 packets in your UMEM
				428	area and that your largest packet size can be 2K.
				429
				430	There is also an option to set the headroom of each single buffer in
				431	the UMEM. If you set this to N bytes, it means that the packet will
				432	start N bytes into the buffer leaving the first N bytes for the
				433	application to use. The final option is the flags field, but it will
				434	be dealt with in separate sections for each UMEM flag.
				435
				436	XDP_STATISTICS getsockopt
				437	-------------------------
				438
				439	Gets drop statistics of a socket that can be useful for debug
				440	purposes. The supported statistics are shown below:
				441
				442	.. code-block:: c
				443
				444	struct xdp_statistics {
Ilya Maximets	4b9718b	2021-06-22 20:56:47 +0200	[diff] [blame]	445	__u64 rx_dropped; /* Dropped for reasons other than invalid desc */
				446	__u64 rx_invalid_descs; /* Dropped due to invalid descriptor */
				447	__u64 tx_invalid_descs; /* Dropped due to invalid descriptor */
Magnus Karlsson	e0e4f8e	2019-10-21 10:57:04 +0200	[diff] [blame]	448	};
				449
				450	XDP_OPTIONS getsockopt
				451	----------------------
				452
				453	Gets options from an XDP socket. The only one supported so far is
				454	XDP_OPTIONS_ZEROCOPY which tells you if zero-copy is on or not.
				455
Magnus Karlsson	b4b8faa	2018-05-02 13:01:36 +0200	[diff] [blame]	456	Usage
				457	=====
				458
Magnus Karlsson	e0e4f8e	2019-10-21 10:57:04 +0200	[diff] [blame]	459	In order to use AF_XDP sockets two parts are needed. The
Magnus Karlsson	b4b8faa	2018-05-02 13:01:36 +0200	[diff] [blame]	460	user-space application and the XDP program. For a complete setup and
				461	usage example, please refer to the sample application. The user-space
Eric Leblond	0bed613	2019-06-21 22:13:10 +0200	[diff] [blame]	462	side is xdpsock_user.c and the XDP side is part of libbpf.
				463
Magnus Karlsson	e0e4f8e	2019-10-21 10:57:04 +0200	[diff] [blame]	464	The XDP code sample included in tools/lib/bpf/xsk.c is the following:
				465
				466	.. code-block:: c
Eric Leblond	0bed613	2019-06-21 22:13:10 +0200	[diff] [blame]	467
				468	SEC("xdp_sock") int xdp_sock_prog(struct xdp_md *ctx)
				469	{
				470	int index = ctx->rx_queue_index;
				471
Magnus Karlsson	e0e4f8e	2019-10-21 10:57:04 +0200	[diff] [blame]	472	// A set entry here means that the corresponding queue_id
Eric Leblond	0bed613	2019-06-21 22:13:10 +0200	[diff] [blame]	473	// has an active AF_XDP socket bound to it.
				474	if (bpf_map_lookup_elem(&xsks_map, &index))
				475	return bpf_redirect_map(&xsks_map, index, 0);
				476
				477	return XDP_PASS;
				478	}
Magnus Karlsson	b4b8faa	2018-05-02 13:01:36 +0200	[diff] [blame]	479
Magnus Karlsson	e0e4f8e	2019-10-21 10:57:04 +0200	[diff] [blame]	480	A simple but not so performance ring dequeue and enqueue could look
				481	like this:
				482
				483	.. code-block:: c
Magnus Karlsson	b4b8faa	2018-05-02 13:01:36 +0200	[diff] [blame]	484
Björn Töpel	bbff2f3	2018-06-04 13:57:13 +0200	[diff] [blame]	485	// struct xdp_rxtx_ring {
Ilya Maximets	4b9718b	2021-06-22 20:56:47 +0200	[diff] [blame]	486	// __u32 *producer;
				487	// __u32 *consumer;
				488	// struct xdp_desc *desc;
Björn Töpel	bbff2f3	2018-06-04 13:57:13 +0200	[diff] [blame]	489	// };
				490
				491	// struct xdp_umem_ring {
Ilya Maximets	4b9718b	2021-06-22 20:56:47 +0200	[diff] [blame]	492	// __u32 *producer;
				493	// __u32 *consumer;
				494	// __u64 *desc;
Björn Töpel	bbff2f3	2018-06-04 13:57:13 +0200	[diff] [blame]	495	// };
				496
Magnus Karlsson	b4b8faa	2018-05-02 13:01:36 +0200	[diff] [blame]	497	// typedef struct xdp_rxtx_ring RING;
				498	// typedef struct xdp_umem_ring RING;
				499
				500	// typedef struct xdp_desc RING_TYPE;
Björn Töpel	bbff2f3	2018-06-04 13:57:13 +0200	[diff] [blame]	501	// typedef __u64 RING_TYPE;
Magnus Karlsson	b4b8faa	2018-05-02 13:01:36 +0200	[diff] [blame]	502
				503	int dequeue_one(RING ring, RING_TYPE item)
				504	{
Björn Töpel	bbff2f3	2018-06-04 13:57:13 +0200	[diff] [blame]	505	__u32 entries = ring->producer - ring->consumer;
Magnus Karlsson	b4b8faa	2018-05-02 13:01:36 +0200	[diff] [blame]	506
				507	if (entries == 0)
				508	return -1;
				509
				510	// read-barrier!
				511
Björn Töpel	bbff2f3	2018-06-04 13:57:13 +0200	[diff] [blame]	512	item = ring->desc[ring->consumer & (RING_SIZE - 1)];
				513	(*ring->consumer)++;
Magnus Karlsson	b4b8faa	2018-05-02 13:01:36 +0200	[diff] [blame]	514	return 0;
				515	}
				516
				517	int enqueue_one(RING ring, const RING_TYPE item)
				518	{
Björn Töpel	bbff2f3	2018-06-04 13:57:13 +0200	[diff] [blame]	519	u32 free_entries = RING_SIZE - (ring->producer - ring->consumer);
Magnus Karlsson	b4b8faa	2018-05-02 13:01:36 +0200	[diff] [blame]	520
				521	if (free_entries == 0)
				522	return -1;
				523
Björn Töpel	bbff2f3	2018-06-04 13:57:13 +0200	[diff] [blame]	524	ring->desc[ring->producer & (RING_SIZE - 1)] = item;
Magnus Karlsson	b4b8faa	2018-05-02 13:01:36 +0200	[diff] [blame]	525
				526	// write-barrier!
				527
Björn Töpel	bbff2f3	2018-06-04 13:57:13 +0200	[diff] [blame]	528	(*ring->producer)++;
Magnus Karlsson	b4b8faa	2018-05-02 13:01:36 +0200	[diff] [blame]	529	return 0;
				530	}
				531
Magnus Karlsson	e0e4f8e	2019-10-21 10:57:04 +0200	[diff] [blame]	532	But please use the libbpf functions as they are optimized and ready to
				533	use. Will make your life easier.
Magnus Karlsson	b4b8faa	2018-05-02 13:01:36 +0200	[diff] [blame]	534
				535	Sample application
				536	==================
				537
				538	There is a xdpsock benchmarking/test application included that
Magnus Karlsson	e0e4f8e	2019-10-21 10:57:04 +0200	[diff] [blame]	539	demonstrates how to use AF_XDP sockets with private UMEMs. Say that
				540	you would like your UDP traffic from port 4242 to end up in queue 16,
				541	that we will enable AF_XDP on. Here, we use ethtool for this::
Magnus Karlsson	b4b8faa	2018-05-02 13:01:36 +0200	[diff] [blame]	542
				543	ethtool -N p3p2 rx-flow-hash udp4 fn
				544	ethtool -N p3p2 flow-type udp4 src-port 4242 dst-port 4242 \
				545	action 16
				546
				547	Running the rxdrop benchmark in XDP_DRV mode can then be done
				548	using::
				549
				550	samples/bpf/xdpsock -i p3p2 -q 16 -r -N
				551
				552	For XDP_SKB mode, use the switch "-S" instead of "-N" and all options
				553	can be displayed with "-h", as usual.
				554
Magnus Karlsson	e0e4f8e	2019-10-21 10:57:04 +0200	[diff] [blame]	555	This sample application uses libbpf to make the setup and usage of
				556	AF_XDP simpler. If you want to know how the raw uapi of AF_XDP is
				557	really used to make something more advanced, take a look at the libbpf
				558	code in tools/lib/bpf/xsk.[ch].
				559
Magnus Karlsson	0f4a9b7	2019-02-21 10:21:28 +0100	[diff] [blame]	560	FAQ
				561	=======
				562
				563	Q: I am not seeing any traffic on the socket. What am I doing wrong?
				564
				565	A: When a netdev of a physical NIC is initialized, Linux usually
Magnus Karlsson	e0e4f8e	2019-10-21 10:57:04 +0200	[diff] [blame]	566	allocates one RX and TX queue pair per core. So on a 8 core system,
Magnus Karlsson	0f4a9b7	2019-02-21 10:21:28 +0100	[diff] [blame]	567	queue ids 0 to 7 will be allocated, one per core. In the AF_XDP
				568	bind call or the xsk_socket__create libbpf function call, you
				569	specify a specific queue id to bind to and it is only the traffic
				570	towards that queue you are going to get on you socket. So in the
				571	example above, if you bind to queue 0, you are NOT going to get any
				572	traffic that is distributed to queues 1 through 7. If you are
				573	lucky, you will see the traffic, but usually it will end up on one
				574	of the queues you have not bound to.
				575
				576	There are a number of ways to solve the problem of getting the
				577	traffic you want to the queue id you bound to. If you want to see
				578	all the traffic, you can force the netdev to only have 1 queue, queue
				579	id 0, and then bind to queue 0. You can use ethtool to do this::
				580
Randy Dunlap	221fb72	2019-05-20 14:22:25 -0700	[diff] [blame]	581	sudo ethtool -L <interface> combined 1
Magnus Karlsson	0f4a9b7	2019-02-21 10:21:28 +0100	[diff] [blame]	582
				583	If you want to only see part of the traffic, you can program the
				584	NIC through ethtool to filter out your traffic to a single queue id
				585	that you can bind your XDP socket to. Here is one example in which
				586	UDP traffic to and from port 4242 are sent to queue 2::
				587
Randy Dunlap	221fb72	2019-05-20 14:22:25 -0700	[diff] [blame]	588	sudo ethtool -N <interface> rx-flow-hash udp4 fn
				589	sudo ethtool -N <interface> flow-type udp4 src-port 4242 dst-port \
				590	4242 action 2
Magnus Karlsson	0f4a9b7	2019-02-21 10:21:28 +0100	[diff] [blame]	591
Magnus Karlsson	e0e4f8e	2019-10-21 10:57:04 +0200	[diff] [blame]	592	A number of other ways are possible all up to the capabilities of
Magnus Karlsson	0f4a9b7	2019-02-21 10:21:28 +0100	[diff] [blame]	593	the NIC you have.
				594
Magnus Karlsson	e0e4f8e	2019-10-21 10:57:04 +0200	[diff] [blame]	595	Q: Can I use the XSKMAP to implement a switch betwen different umems
				596	in copy mode?
				597
				598	A: The short answer is no, that is not supported at the moment. The
				599	XSKMAP can only be used to switch traffic coming in on queue id X
				600	to sockets bound to the same queue id X. The XSKMAP can contain
				601	sockets bound to different queue ids, for example X and Y, but only
				602	traffic goming in from queue id Y can be directed to sockets bound
				603	to the same queue id Y. In zero-copy mode, you should use the
				604	switch, or other distribution mechanism, in your NIC to direct
				605	traffic to the correct queue id and socket.
				606
Magnus Karlsson	acabf32	2020-08-28 10:26:29 +0200	[diff] [blame]	607	Q: My packets are sometimes corrupted. What is wrong?
				608
				609	A: Care has to be taken not to feed the same buffer in the UMEM into
				610	more than one ring at the same time. If you for example feed the
				611	same buffer into the FILL ring and the TX ring at the same time, the
				612	NIC might receive data into the buffer at the same time it is
				613	sending it. This will cause some packets to become corrupted. Same
				614	thing goes for feeding the same buffer into the FILL rings
				615	belonging to different queue ids or netdevs bound with the
				616	XDP_SHARED_UMEM flag.
				617
Magnus Karlsson	b4b8faa	2018-05-02 13:01:36 +0200	[diff] [blame]	618	Credits
				619	=======
				620
				621	- Björn Töpel (AF_XDP core)
				622	- Magnus Karlsson (AF_XDP core)
				623	- Alexander Duyck
				624	- Alexei Starovoitov
				625	- Daniel Borkmann
				626	- Jesper Dangaard Brouer
				627	- John Fastabend
				628	- Jonathan Corbet (LWN coverage)
				629	- Michael S. Tsirkin
				630	- Qi Z Zhang
				631	- Willem de Bruijn