Blame - Documentation/networking/packet_mmap.rst - SHIFTPHONES/mainline/linux

blob: c5da1a5d93de827f65058ba4a6ddf761af399f49 [file] [log] [blame]

Mauro Carvalho Chehab	4ba7bc9	2020-04-30 18:04:11 +0200	[diff] [blame]	1	.. SPDX-License-Identifier: GPL-2.0
				2
				3	===========
				4	Packet MMAP
				5	===========
				6
				7	Abstract
				8	========
				9
				10	This file documents the mmap() facility available with the PACKET
Baruch Siach	e4da63c	2020-12-29 11:08:39 +0200	[diff] [blame]	11	socket interface. This type of sockets is used for
Mauro Carvalho Chehab	4ba7bc9	2020-04-30 18:04:11 +0200	[diff] [blame]	12
				13	i) capture network traffic with utilities like tcpdump,
				14	ii) transmit network traffic, or any other that needs raw
				15	access to network interface.
				16
				17	Howto can be found at:
				18
				19	https://sites.google.com/site/packetmmap/
				20
				21	Please send your comments to
				22	- Ulisses Alonso Camaró <uaca@i.hate.spam.alumni.uv.es>
				23	- Johann Baudy
				24
				25	Why use PACKET_MMAP
				26	===================
				27
Baruch Siach	e4da63c	2020-12-29 11:08:39 +0200	[diff] [blame]	28	Non PACKET_MMAP capture process (plain AF_PACKET) is very
Mauro Carvalho Chehab	4ba7bc9	2020-04-30 18:04:11 +0200	[diff] [blame]	29	inefficient. It uses very limited buffers and requires one system call to
				30	capture each packet, it requires two if you want to get packet's timestamp
				31	(like libpcap always does).
				32
Baruch Siach	e4da63c	2020-12-29 11:08:39 +0200	[diff] [blame]	33	On the other hand PACKET_MMAP is very efficient. PACKET_MMAP provides a size
Mauro Carvalho Chehab	4ba7bc9	2020-04-30 18:04:11 +0200	[diff] [blame]	34	configurable circular buffer mapped in user space that can be used to either
				35	send or receive packets. This way reading packets just needs to wait for them,
				36	most of the time there is no need to issue a single system call. Concerning
				37	transmission, multiple packets can be sent through one system call to get the
				38	highest bandwidth. By using a shared buffer between the kernel and the user
				39	also has the benefit of minimizing packet copies.
				40
				41	It's fine to use PACKET_MMAP to improve the performance of the capture and
				42	transmission process, but it isn't everything. At least, if you are capturing
				43	at high speeds (this is relative to the cpu speed), you should check if the
				44	device driver of your network interface card supports some sort of interrupt
				45	load mitigation or (even better) if it supports NAPI, also make sure it is
				46	enabled. For transmission, check the MTU (Maximum Transmission Unit) used and
				47	supported by devices of your network. CPU IRQ pinning of your network interface
				48	card can also be an advantage.
				49
				50	How to use mmap() to improve capture process
				51	============================================
				52
				53	From the user standpoint, you should use the higher level libpcap library, which
				54	is a de facto standard, portable across nearly all operating systems
				55	including Win32.
				56
				57	Packet MMAP support was integrated into libpcap around the time of version 1.3.0;
				58	TPACKET_V3 support was added in version 1.5.0
				59
				60	How to use mmap() directly to improve capture process
				61	=====================================================
				62
				63	From the system calls stand point, the use of PACKET_MMAP involves
				64	the following process::
				65
				66
				67	[setup] socket() -------> creation of the capture socket
				68	setsockopt() ---> allocation of the circular buffer (ring)
				69	option: PACKET_RX_RING
				70	mmap() ---------> mapping of the allocated buffer to the
				71	user process
				72
				73	[capture] poll() ---------> to wait for incoming packets
				74
				75	[shutdown] close() --------> destruction of the capture socket and
				76	deallocation of all associated
				77	resources.
				78
				79
				80	socket creation and destruction is straight forward, and is done
				81	the same way with or without PACKET_MMAP::
				82
				83	int fd = socket(PF_PACKET, mode, htons(ETH_P_ALL));
				84
				85	where mode is SOCK_RAW for the raw interface were link level
				86	information can be captured or SOCK_DGRAM for the cooked
				87	interface where link level information capture is not
				88	supported and a link level pseudo-header is provided
				89	by the kernel.
				90
				91	The destruction of the socket and all associated resources
				92	is done by a simple call to close(fd).
				93
				94	Similarly as without PACKET_MMAP, it is possible to use one socket
				95	for capture and transmission. This can be done by mapping the
				96	allocated RX and TX buffer ring with a single mmap() call.
				97	See "Mapping and use of the circular buffer (ring)".
				98
				99	Next I will describe PACKET_MMAP settings and its constraints,
				100	also the mapping of the circular buffer in the user process and
				101	the use of this buffer.
				102
				103	How to use mmap() directly to improve transmission process
				104	==========================================================
				105	Transmission process is similar to capture as shown below::
				106
				107	[setup] socket() -------> creation of the transmission socket
				108	setsockopt() ---> allocation of the circular buffer (ring)
				109	option: PACKET_TX_RING
				110	bind() ---------> bind transmission socket with a network interface
				111	mmap() ---------> mapping of the allocated buffer to the
				112	user process
				113
				114	[transmission] poll() ---------> wait for free packets (optional)
				115	send() ---------> send all packets that are set as ready in
				116	the ring
				117	The flag MSG_DONTWAIT can be used to return
				118	before end of transfer.
				119
				120	[shutdown] close() --------> destruction of the transmission socket and
				121	deallocation of all associated resources.
				122
				123	Socket creation and destruction is also straight forward, and is done
				124	the same way as in capturing described in the previous paragraph::
				125
				126	int fd = socket(PF_PACKET, mode, 0);
				127
				128	The protocol can optionally be 0 in case we only want to transmit
				129	via this socket, which avoids an expensive call to packet_rcv().
				130	In this case, you also need to bind(2) the TX_RING with sll_protocol = 0
				131	set. Otherwise, htons(ETH_P_ALL) or any other protocol, for example.
				132
				133	Binding the socket to your network interface is mandatory (with zero copy) to
				134	know the header size of frames used in the circular buffer.
				135
				136	As capture, each frame contains two parts::
				137
				138	--------------------
				139	\| struct tpacket_hdr \| Header. It contains the status of
				140	\| \| of this frame
				141	\|--------------------\|
				142	\| data buffer \|
				143	. . Data that will be sent over the network interface.
				144	. .
				145	--------------------
				146
				147	bind() associates the socket to your network interface thanks to
				148	sll_ifindex parameter of struct sockaddr_ll.
				149
				150	Initialization example::
				151
				152	struct sockaddr_ll my_addr;
				153	struct ifreq s_ifr;
				154	...
				155
Kees Cook	f9ce26c	2021-06-02 13:29:14 -0700	[diff] [blame]	156	strscpy_pad (s_ifr.ifr_name, "eth0", sizeof(s_ifr.ifr_name));
Mauro Carvalho Chehab	4ba7bc9	2020-04-30 18:04:11 +0200	[diff] [blame]	157
				158	/* get interface index of eth0 */
				159	ioctl(this->socket, SIOCGIFINDEX, &s_ifr);
				160
				161	/* fill sockaddr_ll struct to prepare binding */
				162	my_addr.sll_family = AF_PACKET;
				163	my_addr.sll_protocol = htons(ETH_P_ALL);
				164	my_addr.sll_ifindex = s_ifr.ifr_ifindex;
				165
				166	/* bind socket to eth0 */
				167	bind(this->socket, (struct sockaddr *)&my_addr, sizeof(struct sockaddr_ll));
				168
				169	A complete tutorial is available at: https://sites.google.com/site/packetmmap/
				170
				171	By default, the user should put data at::
				172
				173	frame base + TPACKET_HDRLEN - sizeof(struct sockaddr_ll)
				174
				175	So, whatever you choose for the socket mode (SOCK_DGRAM or SOCK_RAW),
				176	the beginning of the user data will be at::
				177
				178	frame base + TPACKET_ALIGN(sizeof(struct tpacket_hdr))
				179
				180	If you wish to put user data at a custom offset from the beginning of
				181	the frame (for payload alignment with SOCK_RAW mode for instance) you
				182	can set tp_net (with SOCK_DGRAM) or tp_mac (with SOCK_RAW). In order
				183	to make this work it must be enabled previously with setsockopt()
				184	and the PACKET_TX_HAS_OFF option.
				185
				186	PACKET_MMAP settings
				187	====================
				188
				189	To setup PACKET_MMAP from user level code is done with a call like
				190
				191	- Capture process::
				192
				193	setsockopt(fd, SOL_PACKET, PACKET_RX_RING, (void *) &req, sizeof(req))
				194
				195	- Transmission process::
				196
				197	setsockopt(fd, SOL_PACKET, PACKET_TX_RING, (void *) &req, sizeof(req))
				198
				199	The most significant argument in the previous call is the req parameter,
				200	this parameter must to have the following structure::
				201
				202	struct tpacket_req
				203	{
				204	unsigned int tp_block_size; /* Minimal size of contiguous block */
				205	unsigned int tp_block_nr; /* Number of blocks */
				206	unsigned int tp_frame_size; /* Size of frame */
				207	unsigned int tp_frame_nr; /* Total number of frames */
				208	};
				209
				210	This structure is defined in /usr/include/linux/if_packet.h and establishes a
				211	circular buffer (ring) of unswappable memory.
				212	Being mapped in the capture process allows reading the captured frames and
				213	related meta-information like timestamps without requiring a system call.
				214
				215	Frames are grouped in blocks. Each block is a physically contiguous
				216	region of memory and holds tp_block_size/tp_frame_size frames. The total number
				217	of blocks is tp_block_nr. Note that tp_frame_nr is a redundant parameter because::
				218
				219	frames_per_block = tp_block_size/tp_frame_size
				220
				221	indeed, packet_set_ring checks that the following condition is true::
				222
				223	frames_per_block * tp_block_nr == tp_frame_nr
				224
				225	Lets see an example, with the following values::
				226
				227	tp_block_size= 4096
				228	tp_frame_size= 2048
				229	tp_block_nr = 4
				230	tp_frame_nr = 8
				231
				232	we will get the following buffer structure::
				233
				234	block #1 block #2
				235	+---------+---------+ +---------+---------+
				236	\| frame 1 \| frame 2 \| \| frame 3 \| frame 4 \|
				237	+---------+---------+ +---------+---------+
				238
				239	block #3 block #4
				240	+---------+---------+ +---------+---------+
				241	\| frame 5 \| frame 6 \| \| frame 7 \| frame 8 \|
				242	+---------+---------+ +---------+---------+
				243
				244	A frame can be of any size with the only condition it can fit in a block. A block
				245	can only hold an integer number of frames, or in other words, a frame cannot
				246	be spawned across two blocks, so there are some details you have to take into
				247	account when choosing the frame_size. See "Mapping and use of the circular
				248	buffer (ring)".
				249
				250	PACKET_MMAP setting constraints
				251	===============================
				252
				253	In kernel versions prior to 2.4.26 (for the 2.4 branch) and 2.6.5 (2.6 branch),
				254	the PACKET_MMAP buffer could hold only 32768 frames in a 32 bit architecture or
Baruch Siach	e4da63c	2020-12-29 11:08:39 +0200	[diff] [blame]	255	16384 in a 64 bit architecture.
Mauro Carvalho Chehab	4ba7bc9	2020-04-30 18:04:11 +0200	[diff] [blame]	256
				257	Block size limit
				258	----------------
				259
				260	As stated earlier, each block is a contiguous physical region of memory. These
				261	memory regions are allocated with calls to the __get_free_pages() function. As
				262	the name indicates, this function allocates pages of memory, and the second
				263	argument is "order" or a power of two number of pages, that is
				264	(for PAGE_SIZE == 4096) order=0 ==> 4096 bytes, order=1 ==> 8192 bytes,
				265	order=2 ==> 16384 bytes, etc. The maximum size of a
				266	region allocated by __get_free_pages is determined by the MAX_ORDER macro. More
				267	precisely the limit can be calculated as::
				268
				269	PAGE_SIZE << MAX_ORDER
				270
				271	In a i386 architecture PAGE_SIZE is 4096 bytes
				272	In a 2.4/i386 kernel MAX_ORDER is 10
				273	In a 2.6/i386 kernel MAX_ORDER is 11
				274
				275	So get_free_pages can allocate as much as 4MB or 8MB in a 2.4/2.6 kernel
				276	respectively, with an i386 architecture.
				277
				278	User space programs can include /usr/include/sys/user.h and
				279	/usr/include/linux/mmzone.h to get PAGE_SIZE MAX_ORDER declarations.
				280
				281	The pagesize can also be determined dynamically with the getpagesize (2)
				282	system call.
				283
				284	Block number limit
				285	------------------
				286
				287	To understand the constraints of PACKET_MMAP, we have to see the structure
				288	used to hold the pointers to each block.
				289
				290	Currently, this structure is a dynamically allocated vector with kmalloc
				291	called pg_vec, its size limits the number of blocks that can be allocated::
				292
				293	+---+---+---+---+
				294	\| x \| x \| x \| x \|
				295	+---+---+---+---+
				296	\| \| \| \|
				297	\| \| \| v
				298	\| \| v block #4
				299	\| v block #3
				300	v block #2
				301	block #1
				302
				303	kmalloc allocates any number of bytes of physically contiguous memory from
				304	a pool of pre-determined sizes. This pool of memory is maintained by the slab
				305	allocator which is at the end the responsible for doing the allocation and
				306	hence which imposes the maximum memory that kmalloc can allocate.
				307
				308	In a 2.4/2.6 kernel and the i386 architecture, the limit is 131072 bytes. The
				309	predetermined sizes that kmalloc uses can be checked in the "size-<bytes>"
				310	entries of /proc/slabinfo
				311
				312	In a 32 bit architecture, pointers are 4 bytes long, so the total number of
				313	pointers to blocks is::
				314
				315	131072/4 = 32768 blocks
				316
				317	PACKET_MMAP buffer size calculator
				318	==================================
				319
				320	Definitions:
				321
				322	============== ================================================================
				323	<size-max> is the maximum size of allocable with kmalloc
				324	(see /proc/slabinfo)
				325	<pointer size> depends on the architecture -- ``sizeof(void *)``
				326	<page size> depends on the architecture -- PAGE_SIZE or getpagesize (2)
				327	<max-order> is the value defined with MAX_ORDER
				328	<frame size> it's an upper bound of frame's capture size (more on this later)
				329	============== ================================================================
				330
				331	from these definitions we will derive::
				332
				333	<block number> = <size-max>/<pointer size>
				334	<block size> = <pagesize> << <max-order>
				335
				336	so, the max buffer size is::
				337
				338	<block number> * <block size>
				339
				340	and, the number of frames be::
				341
				342	<block number> * <block size> / <frame size>
				343
				344	Suppose the following parameters, which apply for 2.6 kernel and an
				345	i386 architecture::
				346
				347	<size-max> = 131072 bytes
				348	<pointer size> = 4 bytes
				349	<pagesize> = 4096 bytes
				350	<max-order> = 11
				351
				352	and a value for <frame size> of 2048 bytes. These parameters will yield::
				353
				354	<block number> = 131072/4 = 32768 blocks
				355	<block size> = 4096 << 11 = 8 MiB.
				356
				357	and hence the buffer will have a 262144 MiB size. So it can hold
				358	262144 MiB / 2048 bytes = 134217728 frames
				359
				360	Actually, this buffer size is not possible with an i386 architecture.
				361	Remember that the memory is allocated in kernel space, in the case of
				362	an i386 kernel's memory size is limited to 1GiB.
				363
				364	All memory allocations are not freed until the socket is closed. The memory
				365	allocations are done with GFP_KERNEL priority, this basically means that
				366	the allocation can wait and swap other process' memory in order to allocate
				367	the necessary memory, so normally limits can be reached.
				368
				369	Other constraints
				370	-----------------
				371
				372	If you check the source code you will see that what I draw here as a frame
				373	is not only the link level frame. At the beginning of each frame there is a
				374	header called struct tpacket_hdr used in PACKET_MMAP to hold link level's frame
				375	meta information like timestamp. So what we draw here a frame it's really
				376	the following (from include/linux/if_packet.h)::
				377
				378	/*
				379	Frame structure:
				380
				381	- Start. Frame must be aligned to TPACKET_ALIGNMENT=16
				382	- struct tpacket_hdr
				383	- pad to TPACKET_ALIGNMENT=16
				384	- struct sockaddr_ll
				385	- Gap, chosen so that packet data (Start+tp_net) aligns to
				386	TPACKET_ALIGNMENT=16
				387	- Start+tp_mac: [ Optional MAC header ]
				388	- Start+tp_net: Packet data, aligned to TPACKET_ALIGNMENT=16.
				389	- Pad to align to TPACKET_ALIGNMENT=16
				390	*/
				391
				392	The following are conditions that are checked in packet_set_ring
				393
				394	- tp_block_size must be a multiple of PAGE_SIZE (1)
				395	- tp_frame_size must be greater than TPACKET_HDRLEN (obvious)
				396	- tp_frame_size must be a multiple of TPACKET_ALIGNMENT
				397	- tp_frame_nr must be exactly frames_per_block*tp_block_nr
				398
				399	Note that tp_block_size should be chosen to be a power of two or there will
				400	be a waste of memory.
				401
				402	Mapping and use of the circular buffer (ring)
				403	---------------------------------------------
				404
				405	The mapping of the buffer in the user process is done with the conventional
				406	mmap function. Even the circular buffer is compound of several physically
				407	discontiguous blocks of memory, they are contiguous to the user space, hence
				408	just one call to mmap is needed::
				409
				410	mmap(0, size, PROT_READ\|PROT_WRITE, MAP_SHARED, fd, 0);
				411
				412	If tp_frame_size is a divisor of tp_block_size frames will be
				413	contiguously spaced by tp_frame_size bytes. If not, each
				414	tp_block_size/tp_frame_size frames there will be a gap between
				415	the frames. This is because a frame cannot be spawn across two
				416	blocks.
				417
				418	To use one socket for capture and transmission, the mapping of both the
				419	RX and TX buffer ring has to be done with one call to mmap::
				420
				421	...
				422	setsockopt(fd, SOL_PACKET, PACKET_RX_RING, &foo, sizeof(foo));
				423	setsockopt(fd, SOL_PACKET, PACKET_TX_RING, &bar, sizeof(bar));
				424	...
				425	rx_ring = mmap(0, size * 2, PROT_READ\|PROT_WRITE, MAP_SHARED, fd, 0);
				426	tx_ring = rx_ring + size;
				427
				428	RX must be the first as the kernel maps the TX ring memory right
				429	after the RX one.
				430
				431	At the beginning of each frame there is an status field (see
				432	struct tpacket_hdr). If this field is 0 means that the frame is ready
				433	to be used for the kernel, If not, there is a frame the user can read
				434	and the following flags apply:
				435
				436	Capture process
				437	^^^^^^^^^^^^^^^
				438
Baruch Siach	17e9456	2020-12-29 11:08:38 +0200	[diff] [blame]	439	From include/linux/if_packet.h::
Mauro Carvalho Chehab	4ba7bc9	2020-04-30 18:04:11 +0200	[diff] [blame]	440
				441	#define TP_STATUS_COPY (1 << 1)
				442	#define TP_STATUS_LOSING (1 << 2)
				443	#define TP_STATUS_CSUMNOTREADY (1 << 3)
				444	#define TP_STATUS_CSUM_VALID (1 << 7)
				445
				446	====================== =======================================================
				447	TP_STATUS_COPY This flag indicates that the frame (and associated
				448	meta information) has been truncated because it's
				449	larger than tp_frame_size. This packet can be
				450	read entirely with recvfrom().
				451
				452	In order to make this work it must to be
				453	enabled previously with setsockopt() and
				454	the PACKET_COPY_THRESH option.
				455
				456	The number of frames that can be buffered to
				457	be read with recvfrom is limited like a normal socket.
				458	See the SO_RCVBUF option in the socket (7) man page.
				459
				460	TP_STATUS_LOSING indicates there were packet drops from last time
				461	statistics where checked with getsockopt() and
				462	the PACKET_STATISTICS option.
				463
				464	TP_STATUS_CSUMNOTREADY currently it's used for outgoing IP packets which
				465	its checksum will be done in hardware. So while
				466	reading the packet we should not try to check the
				467	checksum.
				468
				469	TP_STATUS_CSUM_VALID This flag indicates that at least the transport
				470	header checksum of the packet has been already
				471	validated on the kernel side. If the flag is not set
				472	then we are free to check the checksum by ourselves
				473	provided that TP_STATUS_CSUMNOTREADY is also not set.
				474	====================== =======================================================
				475
				476	for convenience there are also the following defines::
				477
				478	#define TP_STATUS_KERNEL 0
				479	#define TP_STATUS_USER 1
				480
				481	The kernel initializes all frames to TP_STATUS_KERNEL, when the kernel
				482	receives a packet it puts in the buffer and updates the status with
				483	at least the TP_STATUS_USER flag. Then the user can read the packet,
				484	once the packet is read the user must zero the status field, so the kernel
				485	can use again that frame buffer.
				486
				487	The user can use poll (any other variant should apply too) to check if new
				488	packets are in the ring::
				489
				490	struct pollfd pfd;
				491
				492	pfd.fd = fd;
				493	pfd.revents = 0;
				494	pfd.events = POLLIN\|POLLRDNORM\|POLLERR;
				495
				496	if (status == TP_STATUS_KERNEL)
				497	retval = poll(&pfd, 1, timeout);
				498
				499	It doesn't incur in a race condition to first check the status value and
				500	then poll for frames.
				501
				502	Transmission process
				503	^^^^^^^^^^^^^^^^^^^^
				504
				505	Those defines are also used for transmission::
				506
				507	#define TP_STATUS_AVAILABLE 0 // Frame is available
				508	#define TP_STATUS_SEND_REQUEST 1 // Frame will be sent on next send()
				509	#define TP_STATUS_SENDING 2 // Frame is currently in transmission
				510	#define TP_STATUS_WRONG_FORMAT 4 // Frame format is not correct
				511
				512	First, the kernel initializes all frames to TP_STATUS_AVAILABLE. To send a
				513	packet, the user fills a data buffer of an available frame, sets tp_len to
				514	current data buffer size and sets its status field to TP_STATUS_SEND_REQUEST.
				515	This can be done on multiple frames. Once the user is ready to transmit, it
				516	calls send(). Then all buffers with status equal to TP_STATUS_SEND_REQUEST are
				517	forwarded to the network device. The kernel updates each status of sent
				518	frames with TP_STATUS_SENDING until the end of transfer.
				519
				520	At the end of each transfer, buffer status returns to TP_STATUS_AVAILABLE.
				521
				522	::
				523
				524	header->tp_len = in_i_size;
				525	header->tp_status = TP_STATUS_SEND_REQUEST;
				526	retval = send(this->socket, NULL, 0, 0);
				527
				528	The user can also use poll() to check if a buffer is available:
				529
				530	(status == TP_STATUS_SENDING)
				531
				532	::
				533
				534	struct pollfd pfd;
				535	pfd.fd = fd;
				536	pfd.revents = 0;
				537	pfd.events = POLLOUT;
				538	retval = poll(&pfd, 1, timeout);
				539
				540	What TPACKET versions are available and when to use them?
				541	=========================================================
				542
				543	::
				544
				545	int val = tpacket_version;
				546	setsockopt(fd, SOL_PACKET, PACKET_VERSION, &val, sizeof(val));
				547	getsockopt(fd, SOL_PACKET, PACKET_VERSION, &val, sizeof(val));
				548
				549	where 'tpacket_version' can be TPACKET_V1 (default), TPACKET_V2, TPACKET_V3.
				550
				551	TPACKET_V1:
				552	- Default if not otherwise specified by setsockopt(2)
				553	- RX_RING, TX_RING available
				554
				555	TPACKET_V1 --> TPACKET_V2:
				556	- Made 64 bit clean due to unsigned long usage in TPACKET_V1
				557	structures, thus this also works on 64 bit kernel with 32 bit
				558	userspace and the like
				559	- Timestamp resolution in nanoseconds instead of microseconds
				560	- RX_RING, TX_RING available
				561	- VLAN metadata information available for packets
				562	(TP_STATUS_VLAN_VALID, TP_STATUS_VLAN_TPID_VALID),
				563	in the tpacket2_hdr structure:
				564
				565	- TP_STATUS_VLAN_VALID bit being set into the tp_status field indicates
				566	that the tp_vlan_tci field has valid VLAN TCI value
				567	- TP_STATUS_VLAN_TPID_VALID bit being set into the tp_status field
				568	indicates that the tp_vlan_tpid field has valid VLAN TPID value
				569
				570	- How to switch to TPACKET_V2:
				571
				572	1. Replace struct tpacket_hdr by struct tpacket2_hdr
				573	2. Query header len and save
				574	3. Set protocol version to 2, set up ring as usual
				575	4. For getting the sockaddr_ll,
				576	use ``(void *)hdr + TPACKET_ALIGN(hdrlen)`` instead of
				577	``(void *)hdr + TPACKET_ALIGN(sizeof(struct tpacket_hdr))``
				578
				579	TPACKET_V2 --> TPACKET_V3:
				580	- Flexible buffer implementation for RX_RING:
				581	1. Blocks can be configured with non-static frame-size
				582	2. Read/poll is at a block-level (as opposed to packet-level)
				583	3. Added poll timeout to avoid indefinite user-space wait
				584	on idle links
				585	4. Added user-configurable knobs:
				586
				587	4.1 block::timeout
				588	4.2 tpkt_hdr::sk_rxhash
				589
				590	- RX Hash data available in user space
				591	- TX_RING semantics are conceptually similar to TPACKET_V2;
				592	use tpacket3_hdr instead of tpacket2_hdr, and TPACKET3_HDRLEN
				593	instead of TPACKET2_HDRLEN. In the current implementation,
				594	the tp_next_offset field in the tpacket3_hdr MUST be set to
				595	zero, indicating that the ring does not hold variable sized frames.
				596	Packets with non-zero values of tp_next_offset will be dropped.
				597
				598	AF_PACKET fanout mode
				599	=====================
				600
				601	In the AF_PACKET fanout mode, packet reception can be load balanced among
				602	processes. This also works in combination with mmap(2) on packet sockets.
				603
				604	Currently implemented fanout policies are:
				605
				606	- PACKET_FANOUT_HASH: schedule to socket by skb's packet hash
				607	- PACKET_FANOUT_LB: schedule to socket by round-robin
				608	- PACKET_FANOUT_CPU: schedule to socket by CPU packet arrives on
				609	- PACKET_FANOUT_RND: schedule to socket by random selection
				610	- PACKET_FANOUT_ROLLOVER: if one socket is full, rollover to another
				611	- PACKET_FANOUT_QM: schedule to socket by skbs recorded queue_mapping
				612
				613	Minimal example code by David S. Miller (try things like "./test eth0 hash",
				614	"./test eth0 lb", etc.)::
				615
				616	#include <stddef.h>
				617	#include <stdlib.h>
				618	#include <stdio.h>
				619	#include <string.h>
				620
				621	#include <sys/types.h>
				622	#include <sys/wait.h>
				623	#include <sys/socket.h>
				624	#include <sys/ioctl.h>
				625
				626	#include <unistd.h>
				627
				628	#include <linux/if_ether.h>
				629	#include <linux/if_packet.h>
				630
				631	#include <net/if.h>
				632
				633	static const char *device_name;
				634	static int fanout_type;
				635	static int fanout_id;
				636
				637	#ifndef PACKET_FANOUT
				638	# define PACKET_FANOUT 18
				639	# define PACKET_FANOUT_HASH 0
				640	# define PACKET_FANOUT_LB 1
				641	#endif
				642
				643	static int setup_socket(void)
				644	{
				645	int err, fd = socket(AF_PACKET, SOCK_RAW, htons(ETH_P_IP));
				646	struct sockaddr_ll ll;
				647	struct ifreq ifr;
				648	int fanout_arg;
				649
				650	if (fd < 0) {
				651	perror("socket");
				652	return EXIT_FAILURE;
				653	}
				654
				655	memset(&ifr, 0, sizeof(ifr));
				656	strcpy(ifr.ifr_name, device_name);
				657	err = ioctl(fd, SIOCGIFINDEX, &ifr);
				658	if (err < 0) {
				659	perror("SIOCGIFINDEX");
				660	return EXIT_FAILURE;
				661	}
				662
				663	memset(&ll, 0, sizeof(ll));
				664	ll.sll_family = AF_PACKET;
				665	ll.sll_ifindex = ifr.ifr_ifindex;
				666	err = bind(fd, (struct sockaddr *) &ll, sizeof(ll));
				667	if (err < 0) {
				668	perror("bind");
				669	return EXIT_FAILURE;
				670	}
				671
				672	fanout_arg = (fanout_id \| (fanout_type << 16));
				673	err = setsockopt(fd, SOL_PACKET, PACKET_FANOUT,
				674	&fanout_arg, sizeof(fanout_arg));
				675	if (err) {
				676	perror("setsockopt");
				677	return EXIT_FAILURE;
				678	}
				679
				680	return fd;
				681	}
				682
				683	static void fanout_thread(void)
				684	{
				685	int fd = setup_socket();
				686	int limit = 10000;
				687
				688	if (fd < 0)
				689	exit(fd);
				690
				691	while (limit-- > 0) {
				692	char buf[1600];
				693	int err;
				694
				695	err = read(fd, buf, sizeof(buf));
				696	if (err < 0) {
				697	perror("read");
				698	exit(EXIT_FAILURE);
				699	}
				700	if ((limit % 10) == 0)
				701	fprintf(stdout, "(%d) \n", getpid());
				702	}
				703
				704	fprintf(stdout, "%d: Received 10000 packets\n", getpid());
				705
				706	close(fd);
				707	exit(0);
				708	}
				709
				710	int main(int argc, char **argp)
				711	{
				712	int fd, err;
				713	int i;
				714
				715	if (argc != 3) {
				716	fprintf(stderr, "Usage: %s INTERFACE {hash\|lb}\n", argp[0]);
				717	return EXIT_FAILURE;
				718	}
				719
				720	if (!strcmp(argp[2], "hash"))
				721	fanout_type = PACKET_FANOUT_HASH;
				722	else if (!strcmp(argp[2], "lb"))
				723	fanout_type = PACKET_FANOUT_LB;
				724	else {
				725	fprintf(stderr, "Unknown fanout type [%s]\n", argp[2]);
				726	exit(EXIT_FAILURE);
				727	}
				728
				729	device_name = argp[1];
				730	fanout_id = getpid() & 0xffff;
				731
				732	for (i = 0; i < 4; i++) {
				733	pid_t pid = fork();
				734
				735	switch (pid) {
				736	case 0:
				737	fanout_thread();
				738
				739	case -1:
				740	perror("fork");
				741	exit(EXIT_FAILURE);
				742	}
				743	}
				744
				745	for (i = 0; i < 4; i++) {
				746	int status;
				747
				748	wait(&status);
				749	}
				750
				751	return 0;
				752	}
				753
				754	AF_PACKET TPACKET_V3 example
				755	============================
				756
				757	AF_PACKET's TPACKET_V3 ring buffer can be configured to use non-static frame
				758	sizes by doing it's own memory management. It is based on blocks where polling
				759	works on a per block basis instead of per ring as in TPACKET_V2 and predecessor.
				760
				761	It is said that TPACKET_V3 brings the following benefits:
				762
				763	* ~15% - 20% reduction in CPU-usage
				764	* ~20% increase in packet capture rate
				765	* ~2x increase in packet density
				766	* Port aggregation analysis
				767	* Non static frame size to capture entire packet payload
				768
				769	So it seems to be a good candidate to be used with packet fanout.
				770
				771	Minimal example code by Daniel Borkmann based on Chetan Loke's lolpcap (compile
				772	it with gcc -Wall -O2 blob.c, and try things like "./a.out eth0", etc.)::
				773
				774	/* Written from scratch, but kernel-to-user space API usage
				775	* dissected from lolpcap:
				776	* Copyright 2011, Chetan Loke <loke.chetan@gmail.com>
				777	* License: GPL, version 2.0
				778	*/
				779
				780	#include <stdio.h>
				781	#include <stdlib.h>
				782	#include <stdint.h>
				783	#include <string.h>
				784	#include <assert.h>
				785	#include <net/if.h>
				786	#include <arpa/inet.h>
				787	#include <netdb.h>
				788	#include <poll.h>
				789	#include <unistd.h>
				790	#include <signal.h>
				791	#include <inttypes.h>
				792	#include <sys/socket.h>
				793	#include <sys/mman.h>
				794	#include <linux/if_packet.h>
				795	#include <linux/if_ether.h>
				796	#include <linux/ip.h>
				797
				798	#ifndef likely
				799	# define likely(x) __builtin_expect(!!(x), 1)
				800	#endif
				801	#ifndef unlikely
				802	# define unlikely(x) __builtin_expect(!!(x), 0)
				803	#endif
				804
				805	struct block_desc {
				806	uint32_t version;
				807	uint32_t offset_to_priv;
				808	struct tpacket_hdr_v1 h1;
				809	};
				810
				811	struct ring {
				812	struct iovec *rd;
				813	uint8_t *map;
				814	struct tpacket_req3 req;
				815	};
				816
				817	static unsigned long packets_total = 0, bytes_total = 0;
				818	static sig_atomic_t sigint = 0;
				819
				820	static void sighandler(int num)
				821	{
				822	sigint = 1;
				823	}
				824
				825	static int setup_socket(struct ring ring, char netdev)
				826	{
				827	int err, i, fd, v = TPACKET_V3;
				828	struct sockaddr_ll ll;
				829	unsigned int blocksiz = 1 << 22, framesiz = 1 << 11;
				830	unsigned int blocknum = 64;
				831
				832	fd = socket(AF_PACKET, SOCK_RAW, htons(ETH_P_ALL));
				833	if (fd < 0) {
				834	perror("socket");
				835	exit(1);
				836	}
				837
				838	err = setsockopt(fd, SOL_PACKET, PACKET_VERSION, &v, sizeof(v));
				839	if (err < 0) {
				840	perror("setsockopt");
				841	exit(1);
				842	}
				843
				844	memset(&ring->req, 0, sizeof(ring->req));
				845	ring->req.tp_block_size = blocksiz;
				846	ring->req.tp_frame_size = framesiz;
				847	ring->req.tp_block_nr = blocknum;
				848	ring->req.tp_frame_nr = (blocksiz * blocknum) / framesiz;
				849	ring->req.tp_retire_blk_tov = 60;
				850	ring->req.tp_feature_req_word = TP_FT_REQ_FILL_RXHASH;
				851
				852	err = setsockopt(fd, SOL_PACKET, PACKET_RX_RING, &ring->req,
				853	sizeof(ring->req));
				854	if (err < 0) {
				855	perror("setsockopt");
				856	exit(1);
				857	}
				858
				859	ring->map = mmap(NULL, ring->req.tp_block_size * ring->req.tp_block_nr,
				860	PROT_READ \| PROT_WRITE, MAP_SHARED \| MAP_LOCKED, fd, 0);
				861	if (ring->map == MAP_FAILED) {
				862	perror("mmap");
				863	exit(1);
				864	}
				865
				866	ring->rd = malloc(ring->req.tp_block_nr * sizeof(*ring->rd));
				867	assert(ring->rd);
				868	for (i = 0; i < ring->req.tp_block_nr; ++i) {
				869	ring->rd[i].iov_base = ring->map + (i * ring->req.tp_block_size);
				870	ring->rd[i].iov_len = ring->req.tp_block_size;
				871	}
				872
				873	memset(&ll, 0, sizeof(ll));
				874	ll.sll_family = PF_PACKET;
				875	ll.sll_protocol = htons(ETH_P_ALL);
				876	ll.sll_ifindex = if_nametoindex(netdev);
				877	ll.sll_hatype = 0;
				878	ll.sll_pkttype = 0;
				879	ll.sll_halen = 0;
				880
				881	err = bind(fd, (struct sockaddr *) &ll, sizeof(ll));
				882	if (err < 0) {
				883	perror("bind");
				884	exit(1);
				885	}
				886
				887	return fd;
				888	}
				889
				890	static void display(struct tpacket3_hdr *ppd)
				891	{
				892	struct ethhdr eth = (struct ethhdr ) ((uint8_t *) ppd + ppd->tp_mac);
				893	struct iphdr ip = (struct iphdr ) ((uint8_t *) eth + ETH_HLEN);
				894
				895	if (eth->h_proto == htons(ETH_P_IP)) {
				896	struct sockaddr_in ss, sd;
				897	char sbuff[NI_MAXHOST], dbuff[NI_MAXHOST];
				898
				899	memset(&ss, 0, sizeof(ss));
				900	ss.sin_family = PF_INET;
				901	ss.sin_addr.s_addr = ip->saddr;
				902	getnameinfo((struct sockaddr *) &ss, sizeof(ss),
				903	sbuff, sizeof(sbuff), NULL, 0, NI_NUMERICHOST);
				904
				905	memset(&sd, 0, sizeof(sd));
				906	sd.sin_family = PF_INET;
				907	sd.sin_addr.s_addr = ip->daddr;
				908	getnameinfo((struct sockaddr *) &sd, sizeof(sd),
				909	dbuff, sizeof(dbuff), NULL, 0, NI_NUMERICHOST);
				910
				911	printf("%s -> %s, ", sbuff, dbuff);
				912	}
				913
				914	printf("rxhash: 0x%x\n", ppd->hv1.tp_rxhash);
				915	}
				916
				917	static void walk_block(struct block_desc *pbd, const int block_num)
				918	{
				919	int num_pkts = pbd->h1.num_pkts, i;
				920	unsigned long bytes = 0;
				921	struct tpacket3_hdr *ppd;
				922
				923	ppd = (struct tpacket3_hdr ) ((uint8_t ) pbd +
				924	pbd->h1.offset_to_first_pkt);
				925	for (i = 0; i < num_pkts; ++i) {
				926	bytes += ppd->tp_snaplen;
				927	display(ppd);
				928
				929	ppd = (struct tpacket3_hdr ) ((uint8_t ) ppd +
				930	ppd->tp_next_offset);
				931	}
				932
				933	packets_total += num_pkts;
				934	bytes_total += bytes;
				935	}
				936
				937	static void flush_block(struct block_desc *pbd)
				938	{
				939	pbd->h1.block_status = TP_STATUS_KERNEL;
				940	}
				941
				942	static void teardown_socket(struct ring *ring, int fd)
				943	{
				944	munmap(ring->map, ring->req.tp_block_size * ring->req.tp_block_nr);
				945	free(ring->rd);
				946	close(fd);
				947	}
				948
				949	int main(int argc, char **argp)
				950	{
				951	int fd, err;
				952	socklen_t len;
				953	struct ring ring;
				954	struct pollfd pfd;
				955	unsigned int block_num = 0, blocks = 64;
				956	struct block_desc *pbd;
				957	struct tpacket_stats_v3 stats;
				958
				959	if (argc != 2) {
				960	fprintf(stderr, "Usage: %s INTERFACE\n", argp[0]);
				961	return EXIT_FAILURE;
				962	}
				963
				964	signal(SIGINT, sighandler);
				965
				966	memset(&ring, 0, sizeof(ring));
				967	fd = setup_socket(&ring, argp[argc - 1]);
				968	assert(fd > 0);
				969
				970	memset(&pfd, 0, sizeof(pfd));
				971	pfd.fd = fd;
				972	pfd.events = POLLIN \| POLLERR;
				973	pfd.revents = 0;
				974
				975	while (likely(!sigint)) {
				976	pbd = (struct block_desc *) ring.rd[block_num].iov_base;
				977
				978	if ((pbd->h1.block_status & TP_STATUS_USER) == 0) {
				979	poll(&pfd, 1, -1);
				980	continue;
				981	}
				982
				983	walk_block(pbd, block_num);
				984	flush_block(pbd);
				985	block_num = (block_num + 1) % blocks;
				986	}
				987
				988	len = sizeof(stats);
				989	err = getsockopt(fd, SOL_PACKET, PACKET_STATISTICS, &stats, &len);
				990	if (err < 0) {
				991	perror("getsockopt");
				992	exit(1);
				993	}
				994
				995	fflush(stdout);
				996	printf("\nReceived %u packets, %lu bytes, %u dropped, freeze_q_cnt: %u\n",
				997	stats.tp_packets, bytes_total, stats.tp_drops,
				998	stats.tp_freeze_q_cnt);
				999
				1000	teardown_socket(&ring, fd);
				1001	return 0;
				1002	}
				1003
				1004	PACKET_QDISC_BYPASS
				1005	===================
				1006
				1007	If there is a requirement to load the network with many packets in a similar
				1008	fashion as pktgen does, you might set the following option after socket
				1009	creation::
				1010
				1011	int one = 1;
				1012	setsockopt(fd, SOL_PACKET, PACKET_QDISC_BYPASS, &one, sizeof(one));
				1013
				1014	This has the side-effect, that packets sent through PF_PACKET will bypass the
				1015	kernel's qdisc layer and are forcedly pushed to the driver directly. Meaning,
				1016	packet are not buffered, tc disciplines are ignored, increased loss can occur
				1017	and such packets are also not visible to other PF_PACKET sockets anymore. So,
				1018	you have been warned; generally, this can be useful for stress testing various
				1019	components of a system.
				1020
				1021	On default, PACKET_QDISC_BYPASS is disabled and needs to be explicitly enabled
				1022	on PF_PACKET sockets.
				1023
				1024	PACKET_TIMESTAMP
				1025	================
				1026
				1027	The PACKET_TIMESTAMP setting determines the source of the timestamp in
				1028	the packet meta information for mmap(2)ed RX_RING and TX_RINGs. If your
				1029	NIC is capable of timestamping packets in hardware, you can request those
				1030	hardware timestamps to be used. Note: you may need to enable the generation
				1031	of hardware timestamps with SIOCSHWTSTAMP (see related information from
Mauro Carvalho Chehab	06bfa47	2020-04-30 18:04:31 +0200	[diff] [blame]	1032	Documentation/networking/timestamping.rst).
Mauro Carvalho Chehab	4ba7bc9	2020-04-30 18:04:11 +0200	[diff] [blame]	1033
				1034	PACKET_TIMESTAMP accepts the same integer bit field as SO_TIMESTAMPING::
				1035
				1036	int req = SOF_TIMESTAMPING_RAW_HARDWARE;
				1037	setsockopt(fd, SOL_PACKET, PACKET_TIMESTAMP, (void *) &req, sizeof(req))
				1038
				1039	For the mmap(2)ed ring buffers, such timestamps are stored in the
				1040	``tpacket{,2,3}_hdr`` structure's tp_sec and ``tp_{n,u}sec`` members.
				1041	To determine what kind of timestamp has been reported, the tp_status field
				1042	is binary or'ed with the following possible bits ...
				1043
				1044	::
				1045
				1046	TP_STATUS_TS_RAW_HARDWARE
				1047	TP_STATUS_TS_SOFTWARE
				1048
				1049	... that are equivalent to its ``SOF_TIMESTAMPING_*`` counterparts. For the
				1050	RX_RING, if neither is set (i.e. PACKET_TIMESTAMP is not set), then a
				1051	software fallback was invoked within PF_PACKET's processing code (less
				1052	precise).
				1053
				1054	Getting timestamps for the TX_RING works as follows: i) fill the ring frames,
				1055	ii) call sendto() e.g. in blocking mode, iii) wait for status of relevant
				1056	frames to be updated resp. the frame handed over to the application, iv) walk
				1057	through the frames to pick up the individual hw/sw timestamps.
				1058
				1059	Only (!) if transmit timestamping is enabled, then these bits are combined
				1060	with binary \| with TP_STATUS_AVAILABLE, so you must check for that in your
				1061	application (e.g. !(tp_status & (TP_STATUS_SEND_REQUEST \| TP_STATUS_SENDING))
				1062	in a first step to see if the frame belongs to the application, and then
				1063	one can extract the type of timestamp in a second step from tp_status)!
				1064
				1065	If you don't care about them, thus having it disabled, checking for
				1066	TP_STATUS_AVAILABLE resp. TP_STATUS_WRONG_FORMAT is sufficient. If in the
				1067	TX_RING part only TP_STATUS_AVAILABLE is set, then the tp_sec and tp_{n,u}sec
				1068	members do not contain a valid value. For TX_RINGs, by default no timestamp
				1069	is generated!
				1070
Mauro Carvalho Chehab	06bfa47	2020-04-30 18:04:31 +0200	[diff] [blame]	1071	See include/linux/net_tstamp.h and Documentation/networking/timestamping.rst
Mauro Carvalho Chehab	4ba7bc9	2020-04-30 18:04:11 +0200	[diff] [blame]	1072	for more information on hardware timestamps.
				1073
				1074	Miscellaneous bits
				1075	==================
				1076
				1077	- Packet sockets work well together with Linux socket filters, thus you also
Mauro Carvalho Chehab	6e94eaa	2020-04-30 18:04:12 +0200	[diff] [blame]	1078	might want to have a look at Documentation/networking/filter.rst
Mauro Carvalho Chehab	4ba7bc9	2020-04-30 18:04:11 +0200	[diff] [blame]	1079
				1080	THANKS
				1081	======
				1082
				1083	Jesse Brandeburg, for fixing my grammathical/spelling errors