Blame - Documentation/networking/snmp_counter.rst - SHIFTPHONES/kernel/common

blob: 38a4edc4522b46f6ad3859f411eb46dfa4bc7f94 [file] [log] [blame]

Randy Dunlap	ae5220c	2019-01-13 20:17:41 -0800	[diff] [blame]	1	============
yupeng	b08794a	2018-11-10 13:38:12 -0800	[diff] [blame]	2	SNMP counter
Randy Dunlap	ae5220c	2019-01-13 20:17:41 -0800	[diff] [blame]	3	============
yupeng	b08794a	2018-11-10 13:38:12 -0800	[diff] [blame]	4
				5	This document explains the meaning of SNMP counters.
				6
				7	General IPv4 counters
Randy Dunlap	ae5220c	2019-01-13 20:17:41 -0800	[diff] [blame]	8	=====================
yupeng	b08794a	2018-11-10 13:38:12 -0800	[diff] [blame]	9	All layer 4 packets and ICMP packets will change these counters, but
				10	these counters won't be changed by layer 2 packets (such as STP) or
				11	ARP packets.
				12
				13	* IpInReceives
Randy Dunlap	ae5220c	2019-01-13 20:17:41 -0800	[diff] [blame]	14
yupeng	b08794a	2018-11-10 13:38:12 -0800	[diff] [blame]	15	Defined in `RFC1213 ipInReceives`_
				16
				17	.. _RFC1213 ipInReceives: https://tools.ietf.org/html/rfc1213#page-26
				18
				19	The number of packets received by the IP layer. It gets increasing at the
				20	beginning of ip_rcv function, always be updated together with
yupeng	8e2ea53	2018-12-12 00:14:10 -0800	[diff] [blame]	21	IpExtInOctets. It will be increased even if the packet is dropped
				22	later (e.g. due to the IP header is invalid or the checksum is wrong
				23	and so on). It indicates the number of aggregated segments after
yupeng	b08794a	2018-11-10 13:38:12 -0800	[diff] [blame]	24	GRO/LRO.
				25
				26	* IpInDelivers
Randy Dunlap	ae5220c	2019-01-13 20:17:41 -0800	[diff] [blame]	27
yupeng	b08794a	2018-11-10 13:38:12 -0800	[diff] [blame]	28	Defined in `RFC1213 ipInDelivers`_
				29
				30	.. _RFC1213 ipInDelivers: https://tools.ietf.org/html/rfc1213#page-28
				31
				32	The number of packets delivers to the upper layer protocols. E.g. TCP, UDP,
				33	ICMP and so on. If no one listens on a raw socket, only kernel
				34	supported protocols will be delivered, if someone listens on the raw
				35	socket, all valid IP packets will be delivered.
				36
				37	* IpOutRequests
Randy Dunlap	ae5220c	2019-01-13 20:17:41 -0800	[diff] [blame]	38
yupeng	b08794a	2018-11-10 13:38:12 -0800	[diff] [blame]	39	Defined in `RFC1213 ipOutRequests`_
				40
				41	.. _RFC1213 ipOutRequests: https://tools.ietf.org/html/rfc1213#page-28
				42
				43	The number of packets sent via IP layer, for both single cast and
				44	multicast packets, and would always be updated together with
				45	IpExtOutOctets.
				46
				47	* IpExtInOctets and IpExtOutOctets
Randy Dunlap	ae5220c	2019-01-13 20:17:41 -0800	[diff] [blame]	48
yupeng	80cc495	2018-11-16 11:17:40 -0800	[diff] [blame]	49	They are Linux kernel extensions, no RFC definitions. Please note,
yupeng	b08794a	2018-11-10 13:38:12 -0800	[diff] [blame]	50	RFC1213 indeed defines ifInOctets and ifOutOctets, but they
				51	are different things. The ifInOctets and ifOutOctets include the MAC
				52	layer header size but IpExtInOctets and IpExtOutOctets don't, they
				53	only include the IP layer header and the IP layer data.
				54
				55	* IpExtInNoECTPkts, IpExtInECT1Pkts, IpExtInECT0Pkts, IpExtInCEPkts
Randy Dunlap	ae5220c	2019-01-13 20:17:41 -0800	[diff] [blame]	56
yupeng	b08794a	2018-11-10 13:38:12 -0800	[diff] [blame]	57	They indicate the number of four kinds of ECN IP packets, please refer
				58	`Explicit Congestion Notification`_ for more details.
				59
				60	.. _Explicit Congestion Notification: https://tools.ietf.org/html/rfc3168#page-6
				61
				62	These 4 counters calculate how many packets received per ECN
				63	status. They count the real frame number regardless the LRO/GRO. So
				64	for the same packet, you might find that IpInReceives count 1, but
				65	IpExtInNoECTPkts counts 2 or more.
				66
yupeng	8e2ea53	2018-12-12 00:14:10 -0800	[diff] [blame]	67	* IpInHdrErrors
Randy Dunlap	ae5220c	2019-01-13 20:17:41 -0800	[diff] [blame]	68
yupeng	8e2ea53	2018-12-12 00:14:10 -0800	[diff] [blame]	69	Defined in `RFC1213 ipInHdrErrors`_. It indicates the packet is
				70	dropped due to the IP header error. It might happen in both IP input
				71	and IP forward paths.
				72
				73	.. _RFC1213 ipInHdrErrors: https://tools.ietf.org/html/rfc1213#page-27
				74
				75	* IpInAddrErrors
Randy Dunlap	ae5220c	2019-01-13 20:17:41 -0800	[diff] [blame]	76
yupeng	8e2ea53	2018-12-12 00:14:10 -0800	[diff] [blame]	77	Defined in `RFC1213 ipInAddrErrors`_. It will be increased in two
				78	scenarios: (1) The IP address is invalid. (2) The destination IP
				79	address is not a local address and IP forwarding is not enabled
				80
				81	.. _RFC1213 ipInAddrErrors: https://tools.ietf.org/html/rfc1213#page-27
				82
				83	* IpExtInNoRoutes
Randy Dunlap	ae5220c	2019-01-13 20:17:41 -0800	[diff] [blame]	84
yupeng	8e2ea53	2018-12-12 00:14:10 -0800	[diff] [blame]	85	This counter means the packet is dropped when the IP stack receives a
				86	packet and can't find a route for it from the route table. It might
				87	happen when IP forwarding is enabled and the destination IP address is
				88	not a local address and there is no route for the destination IP
				89	address.
				90
				91	* IpInUnknownProtos
Randy Dunlap	ae5220c	2019-01-13 20:17:41 -0800	[diff] [blame]	92
yupeng	8e2ea53	2018-12-12 00:14:10 -0800	[diff] [blame]	93	Defined in `RFC1213 ipInUnknownProtos`_. It will be increased if the
				94	layer 4 protocol is unsupported by kernel. If an application is using
				95	raw socket, kernel will always deliver the packet to the raw socket
				96	and this counter won't be increased.
				97
				98	.. _RFC1213 ipInUnknownProtos: https://tools.ietf.org/html/rfc1213#page-27
				99
				100	* IpExtInTruncatedPkts
Randy Dunlap	ae5220c	2019-01-13 20:17:41 -0800	[diff] [blame]	101
yupeng	8e2ea53	2018-12-12 00:14:10 -0800	[diff] [blame]	102	For IPv4 packet, it means the actual data size is smaller than the
				103	"Total Length" field in the IPv4 header.
				104
				105	* IpInDiscards
Randy Dunlap	ae5220c	2019-01-13 20:17:41 -0800	[diff] [blame]	106
yupeng	8e2ea53	2018-12-12 00:14:10 -0800	[diff] [blame]	107	Defined in `RFC1213 ipInDiscards`_. It indicates the packet is dropped
				108	in the IP receiving path and due to kernel internal reasons (e.g. no
				109	enough memory).
				110
				111	.. _RFC1213 ipInDiscards: https://tools.ietf.org/html/rfc1213#page-28
				112
				113	* IpOutDiscards
Randy Dunlap	ae5220c	2019-01-13 20:17:41 -0800	[diff] [blame]	114
yupeng	8e2ea53	2018-12-12 00:14:10 -0800	[diff] [blame]	115	Defined in `RFC1213 ipOutDiscards`_. It indicates the packet is
				116	dropped in the IP sending path and due to kernel internal reasons.
				117
				118	.. _RFC1213 ipOutDiscards: https://tools.ietf.org/html/rfc1213#page-28
				119
				120	* IpOutNoRoutes
Randy Dunlap	ae5220c	2019-01-13 20:17:41 -0800	[diff] [blame]	121
yupeng	8e2ea53	2018-12-12 00:14:10 -0800	[diff] [blame]	122	Defined in `RFC1213 ipOutNoRoutes`_. It indicates the packet is
				123	dropped in the IP sending path and no route is found for it.
				124
				125	.. _RFC1213 ipOutNoRoutes: https://tools.ietf.org/html/rfc1213#page-29
				126
yupeng	b08794a	2018-11-10 13:38:12 -0800	[diff] [blame]	127	ICMP counters
Randy Dunlap	ae5220c	2019-01-13 20:17:41 -0800	[diff] [blame]	128	=============
yupeng	b08794a	2018-11-10 13:38:12 -0800	[diff] [blame]	129	* IcmpInMsgs and IcmpOutMsgs
Randy Dunlap	ae5220c	2019-01-13 20:17:41 -0800	[diff] [blame]	130
yupeng	b08794a	2018-11-10 13:38:12 -0800	[diff] [blame]	131	Defined by `RFC1213 icmpInMsgs`_ and `RFC1213 icmpOutMsgs`_
				132
				133	.. _RFC1213 icmpInMsgs: https://tools.ietf.org/html/rfc1213#page-41
				134	.. _RFC1213 icmpOutMsgs: https://tools.ietf.org/html/rfc1213#page-43
				135
				136	As mentioned in the RFC1213, these two counters include errors, they
				137	would be increased even if the ICMP packet has an invalid type. The
				138	ICMP output path will check the header of a raw socket, so the
				139	IcmpOutMsgs would still be updated if the IP header is constructed by
				140	a userspace program.
				141
				142	* ICMP named types
Randy Dunlap	ae5220c	2019-01-13 20:17:41 -0800	[diff] [blame]	143
yupeng	b08794a	2018-11-10 13:38:12 -0800	[diff] [blame]	144	\| These counters include most of common ICMP types, they are:
				145	\| IcmpInDestUnreachs: `RFC1213 icmpInDestUnreachs`_
				146	\| IcmpInTimeExcds: `RFC1213 icmpInTimeExcds`_
				147	\| IcmpInParmProbs: `RFC1213 icmpInParmProbs`_
				148	\| IcmpInSrcQuenchs: `RFC1213 icmpInSrcQuenchs`_
				149	\| IcmpInRedirects: `RFC1213 icmpInRedirects`_
				150	\| IcmpInEchos: `RFC1213 icmpInEchos`_
				151	\| IcmpInEchoReps: `RFC1213 icmpInEchoReps`_
				152	\| IcmpInTimestamps: `RFC1213 icmpInTimestamps`_
				153	\| IcmpInTimestampReps: `RFC1213 icmpInTimestampReps`_
				154	\| IcmpInAddrMasks: `RFC1213 icmpInAddrMasks`_
				155	\| IcmpInAddrMaskReps: `RFC1213 icmpInAddrMaskReps`_
				156	\| IcmpOutDestUnreachs: `RFC1213 icmpOutDestUnreachs`_
				157	\| IcmpOutTimeExcds: `RFC1213 icmpOutTimeExcds`_
				158	\| IcmpOutParmProbs: `RFC1213 icmpOutParmProbs`_
				159	\| IcmpOutSrcQuenchs: `RFC1213 icmpOutSrcQuenchs`_
				160	\| IcmpOutRedirects: `RFC1213 icmpOutRedirects`_
				161	\| IcmpOutEchos: `RFC1213 icmpOutEchos`_
				162	\| IcmpOutEchoReps: `RFC1213 icmpOutEchoReps`_
				163	\| IcmpOutTimestamps: `RFC1213 icmpOutTimestamps`_
				164	\| IcmpOutTimestampReps: `RFC1213 icmpOutTimestampReps`_
				165	\| IcmpOutAddrMasks: `RFC1213 icmpOutAddrMasks`_
				166	\| IcmpOutAddrMaskReps: `RFC1213 icmpOutAddrMaskReps`_
				167
				168	.. _RFC1213 icmpInDestUnreachs: https://tools.ietf.org/html/rfc1213#page-41
				169	.. _RFC1213 icmpInTimeExcds: https://tools.ietf.org/html/rfc1213#page-41
				170	.. _RFC1213 icmpInParmProbs: https://tools.ietf.org/html/rfc1213#page-42
				171	.. _RFC1213 icmpInSrcQuenchs: https://tools.ietf.org/html/rfc1213#page-42
				172	.. _RFC1213 icmpInRedirects: https://tools.ietf.org/html/rfc1213#page-42
				173	.. _RFC1213 icmpInEchos: https://tools.ietf.org/html/rfc1213#page-42
				174	.. _RFC1213 icmpInEchoReps: https://tools.ietf.org/html/rfc1213#page-42
				175	.. _RFC1213 icmpInTimestamps: https://tools.ietf.org/html/rfc1213#page-42
				176	.. _RFC1213 icmpInTimestampReps: https://tools.ietf.org/html/rfc1213#page-43
				177	.. _RFC1213 icmpInAddrMasks: https://tools.ietf.org/html/rfc1213#page-43
				178	.. _RFC1213 icmpInAddrMaskReps: https://tools.ietf.org/html/rfc1213#page-43
				179
				180	.. _RFC1213 icmpOutDestUnreachs: https://tools.ietf.org/html/rfc1213#page-44
				181	.. _RFC1213 icmpOutTimeExcds: https://tools.ietf.org/html/rfc1213#page-44
				182	.. _RFC1213 icmpOutParmProbs: https://tools.ietf.org/html/rfc1213#page-44
				183	.. _RFC1213 icmpOutSrcQuenchs: https://tools.ietf.org/html/rfc1213#page-44
				184	.. _RFC1213 icmpOutRedirects: https://tools.ietf.org/html/rfc1213#page-44
				185	.. _RFC1213 icmpOutEchos: https://tools.ietf.org/html/rfc1213#page-45
				186	.. _RFC1213 icmpOutEchoReps: https://tools.ietf.org/html/rfc1213#page-45
				187	.. _RFC1213 icmpOutTimestamps: https://tools.ietf.org/html/rfc1213#page-45
				188	.. _RFC1213 icmpOutTimestampReps: https://tools.ietf.org/html/rfc1213#page-45
				189	.. _RFC1213 icmpOutAddrMasks: https://tools.ietf.org/html/rfc1213#page-45
				190	.. _RFC1213 icmpOutAddrMaskReps: https://tools.ietf.org/html/rfc1213#page-46
				191
				192	Every ICMP type has two counters: 'In' and 'Out'. E.g., for the ICMP
				193	Echo packet, they are IcmpInEchos and IcmpOutEchos. Their meanings are
				194	straightforward. The 'In' counter means kernel receives such a packet
				195	and the 'Out' counter means kernel sends such a packet.
				196
				197	* ICMP numeric types
Randy Dunlap	ae5220c	2019-01-13 20:17:41 -0800	[diff] [blame]	198
yupeng	b08794a	2018-11-10 13:38:12 -0800	[diff] [blame]	199	They are IcmpMsgInType[N] and IcmpMsgOutType[N], the [N] indicates the
				200	ICMP type number. These counters track all kinds of ICMP packets. The
				201	ICMP type number definition could be found in the `ICMP parameters`_
				202	document.
				203
				204	.. _ICMP parameters: https://www.iana.org/assignments/icmp-parameters/icmp-parameters.xhtml
				205
				206	For example, if the Linux kernel sends an ICMP Echo packet, the
				207	IcmpMsgOutType8 would increase 1. And if kernel gets an ICMP Echo Reply
				208	packet, IcmpMsgInType0 would increase 1.
				209
				210	* IcmpInCsumErrors
Randy Dunlap	ae5220c	2019-01-13 20:17:41 -0800	[diff] [blame]	211
yupeng	b08794a	2018-11-10 13:38:12 -0800	[diff] [blame]	212	This counter indicates the checksum of the ICMP packet is
				213	wrong. Kernel verifies the checksum after updating the IcmpInMsgs and
				214	before updating IcmpMsgInType[N]. If a packet has bad checksum, the
				215	IcmpInMsgs would be updated but none of IcmpMsgInType[N] would be updated.
				216
				217	* IcmpInErrors and IcmpOutErrors
Randy Dunlap	ae5220c	2019-01-13 20:17:41 -0800	[diff] [blame]	218
yupeng	b08794a	2018-11-10 13:38:12 -0800	[diff] [blame]	219	Defined by `RFC1213 icmpInErrors`_ and `RFC1213 icmpOutErrors`_
				220
				221	.. _RFC1213 icmpInErrors: https://tools.ietf.org/html/rfc1213#page-41
				222	.. _RFC1213 icmpOutErrors: https://tools.ietf.org/html/rfc1213#page-43
				223
				224	When an error occurs in the ICMP packet handler path, these two
				225	counters would be updated. The receiving packet path use IcmpInErrors
				226	and the sending packet path use IcmpOutErrors. When IcmpInCsumErrors
				227	is increased, IcmpInErrors would always be increased too.
				228
				229	relationship of the ICMP counters
Randy Dunlap	ae5220c	2019-01-13 20:17:41 -0800	[diff] [blame]	230	---------------------------------
yupeng	b08794a	2018-11-10 13:38:12 -0800	[diff] [blame]	231	The sum of IcmpMsgOutType[N] is always equal to IcmpOutMsgs, as they
				232	are updated at the same time. The sum of IcmpMsgInType[N] plus
				233	IcmpInErrors should be equal or larger than IcmpInMsgs. When kernel
				234	receives an ICMP packet, kernel follows below logic:
				235
				236	1. increase IcmpInMsgs
				237	2. if has any error, update IcmpInErrors and finish the process
				238	3. update IcmpMsgOutType[N]
				239	4. handle the packet depending on the type, if has any error, update
				240	IcmpInErrors and finish the process
				241
				242	So if all errors occur in step (2), IcmpInMsgs should be equal to the
				243	sum of IcmpMsgOutType[N] plus IcmpInErrors. If all errors occur in
				244	step (4), IcmpInMsgs should be equal to the sum of
				245	IcmpMsgOutType[N]. If the errors occur in both step (2) and step (4),
				246	IcmpInMsgs should be less than the sum of IcmpMsgOutType[N] plus
				247	IcmpInErrors.
				248
yupeng	80cc495	2018-11-16 11:17:40 -0800	[diff] [blame]	249	General TCP counters
Randy Dunlap	ae5220c	2019-01-13 20:17:41 -0800	[diff] [blame]	250	====================
yupeng	80cc495	2018-11-16 11:17:40 -0800	[diff] [blame]	251	* TcpInSegs
Randy Dunlap	ae5220c	2019-01-13 20:17:41 -0800	[diff] [blame]	252
yupeng	80cc495	2018-11-16 11:17:40 -0800	[diff] [blame]	253	Defined in `RFC1213 tcpInSegs`_
				254
				255	.. _RFC1213 tcpInSegs: https://tools.ietf.org/html/rfc1213#page-48
				256
				257	The number of packets received by the TCP layer. As mentioned in
				258	RFC1213, it includes the packets received in error, such as checksum
				259	error, invalid TCP header and so on. Only one error won't be included:
				260	if the layer 2 destination address is not the NIC's layer 2
				261	address. It might happen if the packet is a multicast or broadcast
				262	packet, or the NIC is in promiscuous mode. In these situations, the
				263	packets would be delivered to the TCP layer, but the TCP layer will discard
				264	these packets before increasing TcpInSegs. The TcpInSegs counter
				265	isn't aware of GRO. So if two packets are merged by GRO, the TcpInSegs
				266	counter would only increase 1.
				267
				268	* TcpOutSegs
Randy Dunlap	ae5220c	2019-01-13 20:17:41 -0800	[diff] [blame]	269
yupeng	80cc495	2018-11-16 11:17:40 -0800	[diff] [blame]	270	Defined in `RFC1213 tcpOutSegs`_
				271
				272	.. _RFC1213 tcpOutSegs: https://tools.ietf.org/html/rfc1213#page-48
				273
				274	The number of packets sent by the TCP layer. As mentioned in RFC1213,
				275	it excludes the retransmitted packets. But it includes the SYN, ACK
				276	and RST packets. Doesn't like TcpInSegs, the TcpOutSegs is aware of
				277	GSO, so if a packet would be split to 2 by GSO, TcpOutSegs will
				278	increase 2.
				279
				280	* TcpActiveOpens
Randy Dunlap	ae5220c	2019-01-13 20:17:41 -0800	[diff] [blame]	281
yupeng	80cc495	2018-11-16 11:17:40 -0800	[diff] [blame]	282	Defined in `RFC1213 tcpActiveOpens`_
				283
				284	.. _RFC1213 tcpActiveOpens: https://tools.ietf.org/html/rfc1213#page-47
				285
				286	It means the TCP layer sends a SYN, and come into the SYN-SENT
				287	state. Every time TcpActiveOpens increases 1, TcpOutSegs should always
				288	increase 1.
				289
				290	* TcpPassiveOpens
Randy Dunlap	ae5220c	2019-01-13 20:17:41 -0800	[diff] [blame]	291
yupeng	80cc495	2018-11-16 11:17:40 -0800	[diff] [blame]	292	Defined in `RFC1213 tcpPassiveOpens`_
				293
				294	.. _RFC1213 tcpPassiveOpens: https://tools.ietf.org/html/rfc1213#page-47
				295
				296	It means the TCP layer receives a SYN, replies a SYN+ACK, come into
				297	the SYN-RCVD state.
				298
yupeng	712ee16	2018-11-25 23:35:46 -0800	[diff] [blame]	299	* TcpExtTCPRcvCoalesce
Randy Dunlap	ae5220c	2019-01-13 20:17:41 -0800	[diff] [blame]	300
yupeng	712ee16	2018-11-25 23:35:46 -0800	[diff] [blame]	301	When packets are received by the TCP layer and are not be read by the
				302	application, the TCP layer will try to merge them. This counter
				303	indicate how many packets are merged in such situation. If GRO is
				304	enabled, lots of packets would be merged by GRO, these packets
				305	wouldn't be counted to TcpExtTCPRcvCoalesce.
				306
				307	* TcpExtTCPAutoCorking
Randy Dunlap	ae5220c	2019-01-13 20:17:41 -0800	[diff] [blame]	308
yupeng	712ee16	2018-11-25 23:35:46 -0800	[diff] [blame]	309	When sending packets, the TCP layer will try to merge small packets to
				310	a bigger one. This counter increase 1 for every packet merged in such
				311	situation. Please refer to the LWN article for more details:
				312	https://lwn.net/Articles/576263/
				313
				314	* TcpExtTCPOrigDataSent
Randy Dunlap	ae5220c	2019-01-13 20:17:41 -0800	[diff] [blame]	315
yupeng	712ee16	2018-11-25 23:35:46 -0800	[diff] [blame]	316	This counter is explained by `kernel commit f19c29e3e391`_, I pasted the
				317	explaination below::
				318
				319	TCPOrigDataSent: number of outgoing packets with original data (excluding
				320	retransmission but including data-in-SYN). This counter is different from
				321	TcpOutSegs because TcpOutSegs also tracks pure ACKs. TCPOrigDataSent is
				322	more useful to track the TCP retransmission rate.
				323
				324	* TCPSynRetrans
Randy Dunlap	ae5220c	2019-01-13 20:17:41 -0800	[diff] [blame]	325
yupeng	712ee16	2018-11-25 23:35:46 -0800	[diff] [blame]	326	This counter is explained by `kernel commit f19c29e3e391`_, I pasted the
				327	explaination below::
				328
				329	TCPSynRetrans: number of SYN and SYN/ACK retransmits to break down
				330	retransmissions into SYN, fast-retransmits, timeout retransmits, etc.
				331
				332	* TCPFastOpenActiveFail
Randy Dunlap	ae5220c	2019-01-13 20:17:41 -0800	[diff] [blame]	333
yupeng	712ee16	2018-11-25 23:35:46 -0800	[diff] [blame]	334	This counter is explained by `kernel commit f19c29e3e391`_, I pasted the
				335	explaination below::
				336
				337	TCPFastOpenActiveFail: Fast Open attempts (SYN/data) failed because
				338	the remote does not accept it or the attempts timed out.
				339
				340	.. _kernel commit f19c29e3e391: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=f19c29e3e391a66a273e9afebaf01917245148cd
				341
				342	* TcpExtListenOverflows and TcpExtListenDrops
Randy Dunlap	ae5220c	2019-01-13 20:17:41 -0800	[diff] [blame]	343
yupeng	712ee16	2018-11-25 23:35:46 -0800	[diff] [blame]	344	When kernel receives a SYN from a client, and if the TCP accept queue
				345	is full, kernel will drop the SYN and add 1 to TcpExtListenOverflows.
				346	At the same time kernel will also add 1 to TcpExtListenDrops. When a
				347	TCP socket is in LISTEN state, and kernel need to drop a packet,
				348	kernel would always add 1 to TcpExtListenDrops. So increase
				349	TcpExtListenOverflows would let TcpExtListenDrops increasing at the
				350	same time, but TcpExtListenDrops would also increase without
				351	TcpExtListenOverflows increasing, e.g. a memory allocation fail would
				352	also let TcpExtListenDrops increase.
				353
				354	Note: The above explanation is based on kernel 4.10 or above version, on
				355	an old kernel, the TCP stack has different behavior when TCP accept
				356	queue is full. On the old kernel, TCP stack won't drop the SYN, it
				357	would complete the 3-way handshake. As the accept queue is full, TCP
				358	stack will keep the socket in the TCP half-open queue. As it is in the
				359	half open queue, TCP stack will send SYN+ACK on an exponential backoff
				360	timer, after client replies ACK, TCP stack checks whether the accept
				361	queue is still full, if it is not full, moves the socket to the accept
				362	queue, if it is full, keeps the socket in the half-open queue, at next
				363	time client replies ACK, this socket will get another chance to move
				364	to the accept queue.
				365
				366
yupeng	80cc495	2018-11-16 11:17:40 -0800	[diff] [blame]	367	TCP Fast Open
Randy Dunlap	ae5220c	2019-01-13 20:17:41 -0800	[diff] [blame]	368	=============
yupeng	a6c7c7a	2019-01-11 15:07:24 -0800	[diff] [blame]	369	* TcpEstabResets
yupeng	132c4e9	2019-02-09 14:46:18 -0800	[diff] [blame]	370
yupeng	a6c7c7a	2019-01-11 15:07:24 -0800	[diff] [blame]	371	Defined in `RFC1213 tcpEstabResets`_.
				372
				373	.. _RFC1213 tcpEstabResets: https://tools.ietf.org/html/rfc1213#page-48
				374
				375	* TcpAttemptFails
yupeng	132c4e9	2019-02-09 14:46:18 -0800	[diff] [blame]	376
yupeng	a6c7c7a	2019-01-11 15:07:24 -0800	[diff] [blame]	377	Defined in `RFC1213 tcpAttemptFails`_.
				378
				379	.. _RFC1213 tcpAttemptFails: https://tools.ietf.org/html/rfc1213#page-48
				380
				381	* TcpOutRsts
yupeng	132c4e9	2019-02-09 14:46:18 -0800	[diff] [blame]	382
yupeng	a6c7c7a	2019-01-11 15:07:24 -0800	[diff] [blame]	383	Defined in `RFC1213 tcpOutRsts`_. The RFC says this counter indicates
				384	the 'segments sent containing the RST flag', but in linux kernel, this
				385	couner indicates the segments kerenl tried to send. The sending
				386	process might be failed due to some errors (e.g. memory alloc failed).
				387
				388	.. _RFC1213 tcpOutRsts: https://tools.ietf.org/html/rfc1213#page-52
				389
yupeng	132c4e9	2019-02-09 14:46:18 -0800	[diff] [blame]	390	* TcpExtTCPSpuriousRtxHostQueues
				391
				392	When the TCP stack wants to retransmit a packet, and finds that packet
				393	is not lost in the network, but the packet is not sent yet, the TCP
				394	stack would give up the retransmission and update this counter. It
				395	might happen if a packet stays too long time in a qdisc or driver
				396	queue.
				397
				398	* TcpEstabResets
				399
				400	The socket receives a RST packet in Establish or CloseWait state.
				401
				402	* TcpExtTCPKeepAlive
				403
				404	This counter indicates many keepalive packets were sent. The keepalive
				405	won't be enabled by default. A userspace program could enable it by
				406	setting the SO_KEEPALIVE socket option.
				407
				408	* TcpExtTCPSpuriousRTOs
				409
				410	The spurious retransmission timeout detected by the `F-RTO`_
				411	algorithm.
				412
				413	.. _F-RTO: https://tools.ietf.org/html/rfc5682
yupeng	a6c7c7a	2019-01-11 15:07:24 -0800	[diff] [blame]	414
				415	TCP Fast Path
Randy Dunlap	65e9a6d	2019-03-17 17:17:45 -0700	[diff] [blame]	416	=============
yupeng	80cc495	2018-11-16 11:17:40 -0800	[diff] [blame]	417	When kernel receives a TCP packet, it has two paths to handler the
				418	packet, one is fast path, another is slow path. The comment in kernel
				419	code provides a good explanation of them, I pasted them below::
				420
				421	It is split into a fast path and a slow path. The fast path is
				422	disabled when:
				423
				424	- A zero window was announced from us
				425	- zero window probing
				426	is only handled properly on the slow path.
				427	- Out of order segments arrived.
				428	- Urgent data is expected.
				429	- There is no buffer space left
				430	- Unexpected TCP flags/window values/header lengths are received
				431	(detected by checking the TCP header against pred_flags)
				432	- Data is sent in both directions. The fast path only supports pure senders
				433	or pure receivers (this means either the sequence number or the ack
				434	value must stay constant)
				435	- Unexpected TCP option.
				436
				437	Kernel will try to use fast path unless any of the above conditions
				438	are satisfied. If the packets are out of order, kernel will handle
				439	them in slow path, which means the performance might be not very
				440	good. Kernel would also come into slow path if the "Delayed ack" is
				441	used, because when using "Delayed ack", the data is sent in both
				442	directions. When the TCP window scale option is not used, kernel will
				443	try to enable fast path immediately when the connection comes into the
				444	established state, but if the TCP window scale option is used, kernel
				445	will disable the fast path at first, and try to enable it after kernel
				446	receives packets.
				447
				448	* TcpExtTCPPureAcks and TcpExtTCPHPAcks
Randy Dunlap	ae5220c	2019-01-13 20:17:41 -0800	[diff] [blame]	449
yupeng	80cc495	2018-11-16 11:17:40 -0800	[diff] [blame]	450	If a packet set ACK flag and has no data, it is a pure ACK packet, if
				451	kernel handles it in the fast path, TcpExtTCPHPAcks will increase 1,
				452	if kernel handles it in the slow path, TcpExtTCPPureAcks will
				453	increase 1.
				454
				455	* TcpExtTCPHPHits
Randy Dunlap	ae5220c	2019-01-13 20:17:41 -0800	[diff] [blame]	456
yupeng	80cc495	2018-11-16 11:17:40 -0800	[diff] [blame]	457	If a TCP packet has data (which means it is not a pure ACK packet),
				458	and this packet is handled in the fast path, TcpExtTCPHPHits will
				459	increase 1.
				460
				461
				462	TCP abort
Randy Dunlap	ae5220c	2019-01-13 20:17:41 -0800	[diff] [blame]	463	=========
yupeng	80cc495	2018-11-16 11:17:40 -0800	[diff] [blame]	464	* TcpExtTCPAbortOnData
Randy Dunlap	ae5220c	2019-01-13 20:17:41 -0800	[diff] [blame]	465
yupeng	80cc495	2018-11-16 11:17:40 -0800	[diff] [blame]	466	It means TCP layer has data in flight, but need to close the
				467	connection. So TCP layer sends a RST to the other side, indicate the
				468	connection is not closed very graceful. An easy way to increase this
				469	counter is using the SO_LINGER option. Please refer to the SO_LINGER
				470	section of the `socket man page`_:
				471
				472	.. _socket man page: http://man7.org/linux/man-pages/man7/socket.7.html
				473
				474	By default, when an application closes a connection, the close function
				475	will return immediately and kernel will try to send the in-flight data
				476	async. If you use the SO_LINGER option, set l_onoff to 1, and l_linger
				477	to a positive number, the close function won't return immediately, but
				478	wait for the in-flight data are acked by the other side, the max wait
				479	time is l_linger seconds. If set l_onoff to 1 and set l_linger to 0,
				480	when the application closes a connection, kernel will send a RST
				481	immediately and increase the TcpExtTCPAbortOnData counter.
				482
				483	* TcpExtTCPAbortOnClose
Randy Dunlap	ae5220c	2019-01-13 20:17:41 -0800	[diff] [blame]	484
yupeng	80cc495	2018-11-16 11:17:40 -0800	[diff] [blame]	485	This counter means the application has unread data in the TCP layer when
				486	the application wants to close the TCP connection. In such a situation,
				487	kernel will send a RST to the other side of the TCP connection.
				488
				489	* TcpExtTCPAbortOnMemory
Randy Dunlap	ae5220c	2019-01-13 20:17:41 -0800	[diff] [blame]	490
yupeng	80cc495	2018-11-16 11:17:40 -0800	[diff] [blame]	491	When an application closes a TCP connection, kernel still need to track
				492	the connection, let it complete the TCP disconnect process. E.g. an
				493	app calls the close method of a socket, kernel sends fin to the other
				494	side of the connection, then the app has no relationship with the
				495	socket any more, but kernel need to keep the socket, this socket
				496	becomes an orphan socket, kernel waits for the reply of the other side,
				497	and would come to the TIME_WAIT state finally. When kernel has no
				498	enough memory to keep the orphan socket, kernel would send an RST to
				499	the other side, and delete the socket, in such situation, kernel will
				500	increase 1 to the TcpExtTCPAbortOnMemory. Two conditions would trigger
				501	TcpExtTCPAbortOnMemory:
				502
				503	1. the memory used by the TCP protocol is higher than the third value of
				504	the tcp_mem. Please refer the tcp_mem section in the `TCP man page`_:
				505
				506	.. _TCP man page: http://man7.org/linux/man-pages/man7/tcp.7.html
				507
				508	2. the orphan socket count is higher than net.ipv4.tcp_max_orphans
				509
				510
				511	* TcpExtTCPAbortOnTimeout
Randy Dunlap	ae5220c	2019-01-13 20:17:41 -0800	[diff] [blame]	512
yupeng	80cc495	2018-11-16 11:17:40 -0800	[diff] [blame]	513	This counter will increase when any of the TCP timers expire. In such
				514	situation, kernel won't send RST, just give up the connection.
				515
				516	* TcpExtTCPAbortOnLinger
Randy Dunlap	ae5220c	2019-01-13 20:17:41 -0800	[diff] [blame]	517
yupeng	80cc495	2018-11-16 11:17:40 -0800	[diff] [blame]	518	When a TCP connection comes into FIN_WAIT_2 state, instead of waiting
				519	for the fin packet from the other side, kernel could send a RST and
				520	delete the socket immediately. This is not the default behavior of
				521	Linux kernel TCP stack. By configuring the TCP_LINGER2 socket option,
				522	you could let kernel follow this behavior.
				523
				524	* TcpExtTCPAbortFailed
Randy Dunlap	ae5220c	2019-01-13 20:17:41 -0800	[diff] [blame]	525
yupeng	80cc495	2018-11-16 11:17:40 -0800	[diff] [blame]	526	The kernel TCP layer will send RST if the `RFC2525 2.17 section`_ is
				527	satisfied. If an internal error occurs during this process,
				528	TcpExtTCPAbortFailed will be increased.
				529
				530	.. _RFC2525 2.17 section: https://tools.ietf.org/html/rfc2525#page-50
				531
yupeng	712ee16	2018-11-25 23:35:46 -0800	[diff] [blame]	532	TCP Hybrid Slow Start
Randy Dunlap	ae5220c	2019-01-13 20:17:41 -0800	[diff] [blame]	533	=====================
yupeng	712ee16	2018-11-25 23:35:46 -0800	[diff] [blame]	534	The Hybrid Slow Start algorithm is an enhancement of the traditional
				535	TCP congestion window Slow Start algorithm. It uses two pieces of
				536	information to detect whether the max bandwidth of the TCP path is
				537	approached. The two pieces of information are ACK train length and
				538	increase in packet delay. For detail information, please refer the
				539	`Hybrid Slow Start paper`_. Either ACK train length or packet delay
				540	hits a specific threshold, the congestion control algorithm will come
				541	into the Congestion Avoidance state. Until v4.20, two congestion
				542	control algorithms are using Hybrid Slow Start, they are cubic (the
				543	default congestion control algorithm) and cdg. Four snmp counters
				544	relate with the Hybrid Slow Start algorithm.
				545
				546	.. _Hybrid Slow Start paper: https://pdfs.semanticscholar.org/25e9/ef3f03315782c7f1cbcd31b587857adae7d1.pdf
				547
				548	* TcpExtTCPHystartTrainDetect
Randy Dunlap	ae5220c	2019-01-13 20:17:41 -0800	[diff] [blame]	549
yupeng	712ee16	2018-11-25 23:35:46 -0800	[diff] [blame]	550	How many times the ACK train length threshold is detected
				551
				552	* TcpExtTCPHystartTrainCwnd
Randy Dunlap	ae5220c	2019-01-13 20:17:41 -0800	[diff] [blame]	553
yupeng	712ee16	2018-11-25 23:35:46 -0800	[diff] [blame]	554	The sum of CWND detected by ACK train length. Dividing this value by
				555	TcpExtTCPHystartTrainDetect is the average CWND which detected by the
				556	ACK train length.
				557
				558	* TcpExtTCPHystartDelayDetect
Randy Dunlap	ae5220c	2019-01-13 20:17:41 -0800	[diff] [blame]	559
yupeng	712ee16	2018-11-25 23:35:46 -0800	[diff] [blame]	560	How many times the packet delay threshold is detected.
				561
				562	* TcpExtTCPHystartDelayCwnd
Randy Dunlap	ae5220c	2019-01-13 20:17:41 -0800	[diff] [blame]	563
yupeng	712ee16	2018-11-25 23:35:46 -0800	[diff] [blame]	564	The sum of CWND detected by packet delay. Dividing this value by
				565	TcpExtTCPHystartDelayDetect is the average CWND which detected by the
				566	packet delay.
				567
yupeng	8e2ea53	2018-12-12 00:14:10 -0800	[diff] [blame]	568	TCP retransmission and congestion control
Randy Dunlap	ae5220c	2019-01-13 20:17:41 -0800	[diff] [blame]	569	=========================================
yupeng	8e2ea53	2018-12-12 00:14:10 -0800	[diff] [blame]	570	The TCP protocol has two retransmission mechanisms: SACK and fast
				571	recovery. They are exclusive with each other. When SACK is enabled,
				572	the kernel TCP stack would use SACK, or kernel would use fast
				573	recovery. The SACK is a TCP option, which is defined in `RFC2018`_,
				574	the fast recovery is defined in `RFC6582`_, which is also called
				575	'Reno'.
				576
				577	The TCP congestion control is a big and complex topic. To understand
				578	the related snmp counter, we need to know the states of the congestion
				579	control state machine. There are 5 states: Open, Disorder, CWR,
				580	Recovery and Loss. For details about these states, please refer page 5
				581	and page 6 of this document:
				582	https://pdfs.semanticscholar.org/0e9c/968d09ab2e53e24c4dca5b2d67c7f7140f8e.pdf
				583
				584	.. _RFC2018: https://tools.ietf.org/html/rfc2018
				585	.. _RFC6582: https://tools.ietf.org/html/rfc6582
				586
				587	* TcpExtTCPRenoRecovery and TcpExtTCPSackRecovery
Randy Dunlap	ae5220c	2019-01-13 20:17:41 -0800	[diff] [blame]	588
yupeng	8e2ea53	2018-12-12 00:14:10 -0800	[diff] [blame]	589	When the congestion control comes into Recovery state, if sack is
				590	used, TcpExtTCPSackRecovery increases 1, if sack is not used,
				591	TcpExtTCPRenoRecovery increases 1. These two counters mean the TCP
				592	stack begins to retransmit the lost packets.
				593
				594	* TcpExtTCPSACKReneging
Randy Dunlap	ae5220c	2019-01-13 20:17:41 -0800	[diff] [blame]	595
yupeng	8e2ea53	2018-12-12 00:14:10 -0800	[diff] [blame]	596	A packet was acknowledged by SACK, but the receiver has dropped this
				597	packet, so the sender needs to retransmit this packet. In this
				598	situation, the sender adds 1 to TcpExtTCPSACKReneging. A receiver
				599	could drop a packet which has been acknowledged by SACK, although it is
				600	unusual, it is allowed by the TCP protocol. The sender doesn't really
				601	know what happened on the receiver side. The sender just waits until
				602	the RTO expires for this packet, then the sender assumes this packet
				603	has been dropped by the receiver.
				604
				605	* TcpExtTCPRenoReorder
Randy Dunlap	ae5220c	2019-01-13 20:17:41 -0800	[diff] [blame]	606
yupeng	8e2ea53	2018-12-12 00:14:10 -0800	[diff] [blame]	607	The reorder packet is detected by fast recovery. It would only be used
				608	if SACK is disabled. The fast recovery algorithm detects recorder by
				609	the duplicate ACK number. E.g., if retransmission is triggered, and
				610	the original retransmitted packet is not lost, it is just out of
				611	order, the receiver would acknowledge multiple times, one for the
				612	retransmitted packet, another for the arriving of the original out of
				613	order packet. Thus the sender would find more ACks than its
				614	expectation, and the sender knows out of order occurs.
				615
				616	* TcpExtTCPTSReorder
Randy Dunlap	ae5220c	2019-01-13 20:17:41 -0800	[diff] [blame]	617
yupeng	8e2ea53	2018-12-12 00:14:10 -0800	[diff] [blame]	618	The reorder packet is detected when a hole is filled. E.g., assume the
				619	sender sends packet 1,2,3,4,5, and the receiving order is
				620	1,2,4,5,3. When the sender receives the ACK of packet 3 (which will
				621	fill the hole), two conditions will let TcpExtTCPTSReorder increase
				622	1: (1) if the packet 3 is not re-retransmitted yet. (2) if the packet
				623	3 is retransmitted but the timestamp of the packet 3's ACK is earlier
				624	than the retransmission timestamp.
				625
				626	* TcpExtTCPSACKReorder
Randy Dunlap	ae5220c	2019-01-13 20:17:41 -0800	[diff] [blame]	627
yupeng	8e2ea53	2018-12-12 00:14:10 -0800	[diff] [blame]	628	The reorder packet detected by SACK. The SACK has two methods to
				629	detect reorder: (1) DSACK is received by the sender. It means the
				630	sender sends the same packet more than one times. And the only reason
				631	is the sender believes an out of order packet is lost so it sends the
				632	packet again. (2) Assume packet 1,2,3,4,5 are sent by the sender, and
				633	the sender has received SACKs for packet 2 and 5, now the sender
				634	receives SACK for packet 4 and the sender doesn't retransmit the
				635	packet yet, the sender would know packet 4 is out of order. The TCP
				636	stack of kernel will increase TcpExtTCPSACKReorder for both of the
				637	above scenarios.
				638
yupeng	132c4e9	2019-02-09 14:46:18 -0800	[diff] [blame]	639	* TcpExtTCPSlowStartRetrans
				640
				641	The TCP stack wants to retransmit a packet and the congestion control
				642	state is 'Loss'.
				643
				644	* TcpExtTCPFastRetrans
				645
				646	The TCP stack wants to retransmit a packet and the congestion control
				647	state is not 'Loss'.
				648
				649	* TcpExtTCPLostRetransmit
				650
				651	A SACK points out that a retransmission packet is lost again.
				652
				653	* TcpExtTCPRetransFail
				654
				655	The TCP stack tries to deliver a retransmission packet to lower layers
				656	but the lower layers return an error.
				657
				658	* TcpExtTCPSynRetrans
				659
				660	The TCP stack retransmits a SYN packet.
				661
yupeng	8e2ea53	2018-12-12 00:14:10 -0800	[diff] [blame]	662	DSACK
				663	=====
				664	The DSACK is defined in `RFC2883`_. The receiver uses DSACK to report
				665	duplicate packets to the sender. There are two kinds of
				666	duplications: (1) a packet which has been acknowledged is
				667	duplicate. (2) an out of order packet is duplicate. The TCP stack
				668	counts these two kinds of duplications on both receiver side and
				669	sender side.
				670
				671	.. _RFC2883 : https://tools.ietf.org/html/rfc2883
				672
				673	* TcpExtTCPDSACKOldSent
Randy Dunlap	ae5220c	2019-01-13 20:17:41 -0800	[diff] [blame]	674
yupeng	8e2ea53	2018-12-12 00:14:10 -0800	[diff] [blame]	675	The TCP stack receives a duplicate packet which has been acked, so it
				676	sends a DSACK to the sender.
				677
				678	* TcpExtTCPDSACKOfoSent
Randy Dunlap	ae5220c	2019-01-13 20:17:41 -0800	[diff] [blame]	679
yupeng	8e2ea53	2018-12-12 00:14:10 -0800	[diff] [blame]	680	The TCP stack receives an out of order duplicate packet, so it sends a
				681	DSACK to the sender.
				682
				683	* TcpExtTCPDSACKRecv
Randy Dunlap	65e9a6d	2019-03-17 17:17:45 -0700	[diff] [blame]	684
yupeng	a6c7c7a	2019-01-11 15:07:24 -0800	[diff] [blame]	685	The TCP stack receives a DSACK, which indicates an acknowledged
yupeng	8e2ea53	2018-12-12 00:14:10 -0800	[diff] [blame]	686	duplicate packet is received.
				687
				688	* TcpExtTCPDSACKOfoRecv
Randy Dunlap	ae5220c	2019-01-13 20:17:41 -0800	[diff] [blame]	689
yupeng	8e2ea53	2018-12-12 00:14:10 -0800	[diff] [blame]	690	The TCP stack receives a DSACK, which indicate an out of order
yupeng	2b965472	2018-12-29 21:46:38 -0800	[diff] [blame]	691	duplicate packet is received.
				692
yupeng	a6c7c7a	2019-01-11 15:07:24 -0800	[diff] [blame]	693	invalid SACK and DSACK
Randy Dunlap	65e9a6d	2019-03-17 17:17:45 -0700	[diff] [blame]	694	======================
yupeng	a6c7c7a	2019-01-11 15:07:24 -0800	[diff] [blame]	695	When a SACK (or DSACK) block is invalid, a corresponding counter would
				696	be updated. The validation method is base on the start/end sequence
				697	number of the SACK block. For more details, please refer the comment
				698	of the function tcp_is_sackblock_valid in the kernel source code. A
				699	SACK option could have up to 4 blocks, they are checked
				700	individually. E.g., if 3 blocks of a SACk is invalid, the
				701	corresponding counter would be updated 3 times. The comment of the
				702	`Add counters for discarded SACK blocks`_ patch has additional
				703	explaination:
				704
				705	.. _Add counters for discarded SACK blocks: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=18f02545a9a16c9a89778b91a162ad16d510bb32
				706
				707	* TcpExtTCPSACKDiscard
Randy Dunlap	65e9a6d	2019-03-17 17:17:45 -0700	[diff] [blame]	708
yupeng	a6c7c7a	2019-01-11 15:07:24 -0800	[diff] [blame]	709	This counter indicates how many SACK blocks are invalid. If the invalid
				710	SACK block is caused by ACK recording, the TCP stack will only ignore
				711	it and won't update this counter.
				712
				713	* TcpExtTCPDSACKIgnoredOld and TcpExtTCPDSACKIgnoredNoUndo
Randy Dunlap	65e9a6d	2019-03-17 17:17:45 -0700	[diff] [blame]	714
yupeng	a6c7c7a	2019-01-11 15:07:24 -0800	[diff] [blame]	715	When a DSACK block is invalid, one of these two counters would be
				716	updated. Which counter will be updated depends on the undo_marker flag
				717	of the TCP socket. If the undo_marker is not set, the TCP stack isn't
				718	likely to re-transmit any packets, and we still receive an invalid
				719	DSACK block, the reason might be that the packet is duplicated in the
				720	middle of the network. In such scenario, TcpExtTCPDSACKIgnoredNoUndo
				721	will be updated. If the undo_marker is set, TcpExtTCPDSACKIgnoredOld
				722	will be updated. As implied in its name, it might be an old packet.
				723
				724	SACK shift
Randy Dunlap	65e9a6d	2019-03-17 17:17:45 -0700	[diff] [blame]	725	==========
yupeng	a6c7c7a	2019-01-11 15:07:24 -0800	[diff] [blame]	726	The linux networking stack stores data in sk_buff struct (skb for
				727	short). If a SACK block acrosses multiple skb, the TCP stack will try
				728	to re-arrange data in these skb. E.g. if a SACK block acknowledges seq
				729	10 to 15, skb1 has seq 10 to 13, skb2 has seq 14 to 20. The seq 14 and
				730	15 in skb2 would be moved to skb1. This operation is 'shift'. If a
				731	SACK block acknowledges seq 10 to 20, skb1 has seq 10 to 13, skb2 has
				732	seq 14 to 20. All data in skb2 will be moved to skb1, and skb2 will be
				733	discard, this operation is 'merge'.
				734
				735	* TcpExtTCPSackShifted
Randy Dunlap	65e9a6d	2019-03-17 17:17:45 -0700	[diff] [blame]	736
yupeng	a6c7c7a	2019-01-11 15:07:24 -0800	[diff] [blame]	737	A skb is shifted
				738
				739	* TcpExtTCPSackMerged
Randy Dunlap	65e9a6d	2019-03-17 17:17:45 -0700	[diff] [blame]	740
yupeng	a6c7c7a	2019-01-11 15:07:24 -0800	[diff] [blame]	741	A skb is merged
				742
				743	* TcpExtTCPSackShiftFallback
Randy Dunlap	65e9a6d	2019-03-17 17:17:45 -0700	[diff] [blame]	744
yupeng	a6c7c7a	2019-01-11 15:07:24 -0800	[diff] [blame]	745	A skb should be shifted or merged, but the TCP stack doesn't do it for
				746	some reasons.
				747
yupeng	2b965472	2018-12-29 21:46:38 -0800	[diff] [blame]	748	TCP out of order
Randy Dunlap	ae5220c	2019-01-13 20:17:41 -0800	[diff] [blame]	749	================
yupeng	2b965472	2018-12-29 21:46:38 -0800	[diff] [blame]	750	* TcpExtTCPOFOQueue
Randy Dunlap	ae5220c	2019-01-13 20:17:41 -0800	[diff] [blame]	751
yupeng	2b965472	2018-12-29 21:46:38 -0800	[diff] [blame]	752	The TCP layer receives an out of order packet and has enough memory
				753	to queue it.
				754
				755	* TcpExtTCPOFODrop
Randy Dunlap	ae5220c	2019-01-13 20:17:41 -0800	[diff] [blame]	756
yupeng	2b965472	2018-12-29 21:46:38 -0800	[diff] [blame]	757	The TCP layer receives an out of order packet but doesn't have enough
				758	memory, so drops it. Such packets won't be counted into
				759	TcpExtTCPOFOQueue.
				760
				761	* TcpExtTCPOFOMerge
Randy Dunlap	ae5220c	2019-01-13 20:17:41 -0800	[diff] [blame]	762
yupeng	2b965472	2018-12-29 21:46:38 -0800	[diff] [blame]	763	The received out of order packet has an overlay with the previous
				764	packet. the overlay part will be dropped. All of TcpExtTCPOFOMerge
				765	packets will also be counted into TcpExtTCPOFOQueue.
				766
				767	TCP PAWS
Randy Dunlap	ae5220c	2019-01-13 20:17:41 -0800	[diff] [blame]	768	========
yupeng	2b965472	2018-12-29 21:46:38 -0800	[diff] [blame]	769	PAWS (Protection Against Wrapped Sequence numbers) is an algorithm
				770	which is used to drop old packets. It depends on the TCP
				771	timestamps. For detail information, please refer the `timestamp wiki`_
				772	and the `RFC of PAWS`_.
				773
				774	.. _RFC of PAWS: https://tools.ietf.org/html/rfc1323#page-17
				775	.. _timestamp wiki: https://en.wikipedia.org/wiki/Transmission_Control_Protocol#TCP_timestamps
				776
				777	* TcpExtPAWSActive
Randy Dunlap	ae5220c	2019-01-13 20:17:41 -0800	[diff] [blame]	778
yupeng	2b965472	2018-12-29 21:46:38 -0800	[diff] [blame]	779	Packets are dropped by PAWS in Syn-Sent status.
				780
				781	* TcpExtPAWSEstab
Randy Dunlap	ae5220c	2019-01-13 20:17:41 -0800	[diff] [blame]	782
yupeng	2b965472	2018-12-29 21:46:38 -0800	[diff] [blame]	783	Packets are dropped by PAWS in any status other than Syn-Sent.
				784
				785	TCP ACK skip
Randy Dunlap	ae5220c	2019-01-13 20:17:41 -0800	[diff] [blame]	786	============
yupeng	2b965472	2018-12-29 21:46:38 -0800	[diff] [blame]	787	In some scenarios, kernel would avoid sending duplicate ACKs too
				788	frequently. Please find more details in the tcp_invalid_ratelimit
				789	section of the `sysctl document`_. When kernel decides to skip an ACK
				790	due to tcp_invalid_ratelimit, kernel would update one of below
				791	counters to indicate the ACK is skipped in which scenario. The ACK
				792	would only be skipped if the received packet is either a SYN packet or
				793	it has no data.
				794
				795	.. _sysctl document: https://www.kernel.org/doc/Documentation/networking/ip-sysctl.txt
				796
				797	* TcpExtTCPACKSkippedSynRecv
Randy Dunlap	ae5220c	2019-01-13 20:17:41 -0800	[diff] [blame]	798
yupeng	2b965472	2018-12-29 21:46:38 -0800	[diff] [blame]	799	The ACK is skipped in Syn-Recv status. The Syn-Recv status means the
				800	TCP stack receives a SYN and replies SYN+ACK. Now the TCP stack is
				801	waiting for an ACK. Generally, the TCP stack doesn't need to send ACK
				802	in the Syn-Recv status. But in several scenarios, the TCP stack need
				803	to send an ACK. E.g., the TCP stack receives the same SYN packet
				804	repeately, the received packet does not pass the PAWS check, or the
				805	received packet sequence number is out of window. In these scenarios,
				806	the TCP stack needs to send ACK. If the ACk sending frequency is higher than
				807	tcp_invalid_ratelimit allows, the TCP stack will skip sending ACK and
				808	increase TcpExtTCPACKSkippedSynRecv.
				809
				810
				811	* TcpExtTCPACKSkippedPAWS
Randy Dunlap	ae5220c	2019-01-13 20:17:41 -0800	[diff] [blame]	812
yupeng	2b965472	2018-12-29 21:46:38 -0800	[diff] [blame]	813	The ACK is skipped due to PAWS (Protect Against Wrapped Sequence
				814	numbers) check fails. If the PAWS check fails in Syn-Recv, Fin-Wait-2
				815	or Time-Wait statuses, the skipped ACK would be counted to
				816	TcpExtTCPACKSkippedSynRecv, TcpExtTCPACKSkippedFinWait2 or
				817	TcpExtTCPACKSkippedTimeWait. In all other statuses, the skipped ACK
				818	would be counted to TcpExtTCPACKSkippedPAWS.
				819
				820	* TcpExtTCPACKSkippedSeq
Randy Dunlap	ae5220c	2019-01-13 20:17:41 -0800	[diff] [blame]	821
yupeng	2b965472	2018-12-29 21:46:38 -0800	[diff] [blame]	822	The sequence number is out of window and the timestamp passes the PAWS
				823	check and the TCP status is not Syn-Recv, Fin-Wait-2, and Time-Wait.
				824
				825	* TcpExtTCPACKSkippedFinWait2
Randy Dunlap	ae5220c	2019-01-13 20:17:41 -0800	[diff] [blame]	826
yupeng	2b965472	2018-12-29 21:46:38 -0800	[diff] [blame]	827	The ACK is skipped in Fin-Wait-2 status, the reason would be either
				828	PAWS check fails or the received sequence number is out of window.
				829
				830	* TcpExtTCPACKSkippedTimeWait
Randy Dunlap	ae5220c	2019-01-13 20:17:41 -0800	[diff] [blame]	831
yupeng	2b965472	2018-12-29 21:46:38 -0800	[diff] [blame]	832	Tha ACK is skipped in Time-Wait status, the reason would be either
				833	PAWS check failed or the received sequence number is out of window.
				834
				835	* TcpExtTCPACKSkippedChallenge
Randy Dunlap	ae5220c	2019-01-13 20:17:41 -0800	[diff] [blame]	836
yupeng	2b965472	2018-12-29 21:46:38 -0800	[diff] [blame]	837	The ACK is skipped if the ACK is a challenge ACK. The RFC 5961 defines
				838	3 kind of challenge ACK, please refer `RFC 5961 section 3.2`_,
				839	`RFC 5961 section 4.2`_ and `RFC 5961 section 5.2`_. Besides these
				840	three scenarios, In some TCP status, the linux TCP stack would also
				841	send challenge ACKs if the ACK number is before the first
				842	unacknowledged number (more strict than `RFC 5961 section 5.2`_).
				843
				844	.. _RFC 5961 section 3.2: https://tools.ietf.org/html/rfc5961#page-7
				845	.. _RFC 5961 section 4.2: https://tools.ietf.org/html/rfc5961#page-9
				846	.. _RFC 5961 section 5.2: https://tools.ietf.org/html/rfc5961#page-11
				847
yupeng	a6c7c7a	2019-01-11 15:07:24 -0800	[diff] [blame]	848	TCP receive window
yupeng	132c4e9	2019-02-09 14:46:18 -0800	[diff] [blame]	849	==================
yupeng	a6c7c7a	2019-01-11 15:07:24 -0800	[diff] [blame]	850	* TcpExtTCPWantZeroWindowAdv
yupeng	132c4e9	2019-02-09 14:46:18 -0800	[diff] [blame]	851
yupeng	a6c7c7a	2019-01-11 15:07:24 -0800	[diff] [blame]	852	Depending on current memory usage, the TCP stack tries to set receive
				853	window to zero. But the receive window might still be a no-zero
				854	value. For example, if the previous window size is 10, and the TCP
				855	stack receives 3 bytes, the current window size would be 7 even if the
				856	window size calculated by the memory usage is zero.
				857
				858	* TcpExtTCPToZeroWindowAdv
yupeng	132c4e9	2019-02-09 14:46:18 -0800	[diff] [blame]	859
yupeng	a6c7c7a	2019-01-11 15:07:24 -0800	[diff] [blame]	860	The TCP receive window is set to zero from a no-zero value.
				861
				862	* TcpExtTCPFromZeroWindowAdv
yupeng	132c4e9	2019-02-09 14:46:18 -0800	[diff] [blame]	863
yupeng	a6c7c7a	2019-01-11 15:07:24 -0800	[diff] [blame]	864	The TCP receive window is set to no-zero value from zero.
				865
				866
				867	Delayed ACK
yupeng	132c4e9	2019-02-09 14:46:18 -0800	[diff] [blame]	868	===========
yupeng	a6c7c7a	2019-01-11 15:07:24 -0800	[diff] [blame]	869	The TCP Delayed ACK is a technique which is used for reducing the
				870	packet count in the network. For more details, please refer the
				871	`Delayed ACK wiki`_
				872
				873	.. _Delayed ACK wiki: https://en.wikipedia.org/wiki/TCP_delayed_acknowledgment
				874
				875	* TcpExtDelayedACKs
yupeng	132c4e9	2019-02-09 14:46:18 -0800	[diff] [blame]	876
yupeng	a6c7c7a	2019-01-11 15:07:24 -0800	[diff] [blame]	877	A delayed ACK timer expires. The TCP stack will send a pure ACK packet
				878	and exit the delayed ACK mode.
				879
				880	* TcpExtDelayedACKLocked
yupeng	132c4e9	2019-02-09 14:46:18 -0800	[diff] [blame]	881
yupeng	a6c7c7a	2019-01-11 15:07:24 -0800	[diff] [blame]	882	A delayed ACK timer expires, but the TCP stack can't send an ACK
				883	immediately due to the socket is locked by a userspace program. The
				884	TCP stack will send a pure ACK later (after the userspace program
				885	unlock the socket). When the TCP stack sends the pure ACK later, the
				886	TCP stack will also update TcpExtDelayedACKs and exit the delayed ACK
				887	mode.
				888
				889	* TcpExtDelayedACKLost
yupeng	132c4e9	2019-02-09 14:46:18 -0800	[diff] [blame]	890
yupeng	a6c7c7a	2019-01-11 15:07:24 -0800	[diff] [blame]	891	It will be updated when the TCP stack receives a packet which has been
				892	ACKed. A Delayed ACK loss might cause this issue, but it would also be
				893	triggered by other reasons, such as a packet is duplicated in the
				894	network.
				895
				896	Tail Loss Probe (TLP)
yupeng	132c4e9	2019-02-09 14:46:18 -0800	[diff] [blame]	897	=====================
yupeng	a6c7c7a	2019-01-11 15:07:24 -0800	[diff] [blame]	898	TLP is an algorithm which is used to detect TCP packet loss. For more
				899	details, please refer the `TLP paper`_.
				900
				901	.. _TLP paper: https://tools.ietf.org/html/draft-dukkipati-tcpm-tcp-loss-probe-01
				902
				903	* TcpExtTCPLossProbes
yupeng	132c4e9	2019-02-09 14:46:18 -0800	[diff] [blame]	904
yupeng	a6c7c7a	2019-01-11 15:07:24 -0800	[diff] [blame]	905	A TLP probe packet is sent.
				906
				907	* TcpExtTCPLossProbeRecovery
yupeng	132c4e9	2019-02-09 14:46:18 -0800	[diff] [blame]	908
yupeng	a6c7c7a	2019-01-11 15:07:24 -0800	[diff] [blame]	909	A packet loss is detected and recovered by TLP.
yupeng	8e2ea53	2018-12-12 00:14:10 -0800	[diff] [blame]	910
yupeng	132c4e9	2019-02-09 14:46:18 -0800	[diff] [blame]	911	TCP Fast Open
				912	=============
				913	TCP Fast Open is a technology which allows data transfer before the
				914	3-way handshake complete. Please refer the `TCP Fast Open wiki`_ for a
				915	general description.
				916
				917	.. _TCP Fast Open wiki: https://en.wikipedia.org/wiki/TCP_Fast_Open
				918
				919	* TcpExtTCPFastOpenActive
				920
				921	When the TCP stack receives an ACK packet in the SYN-SENT status, and
				922	the ACK packet acknowledges the data in the SYN packet, the TCP stack
				923	understand the TFO cookie is accepted by the other side, then it
				924	updates this counter.
				925
				926	* TcpExtTCPFastOpenActiveFail
				927
				928	This counter indicates that the TCP stack initiated a TCP Fast Open,
				929	but it failed. This counter would be updated in three scenarios: (1)
				930	the other side doesn't acknowledge the data in the SYN packet. (2) The
				931	SYN packet which has the TFO cookie is timeout at least once. (3)
				932	after the 3-way handshake, the retransmission timeout happens
				933	net.ipv4.tcp_retries1 times, because some middle-boxes may black-hole
				934	fast open after the handshake.
				935
				936	* TcpExtTCPFastOpenPassive
				937
				938	This counter indicates how many times the TCP stack accepts the fast
				939	open request.
				940
				941	* TcpExtTCPFastOpenPassiveFail
				942
				943	This counter indicates how many times the TCP stack rejects the fast
				944	open request. It is caused by either the TFO cookie is invalid or the
				945	TCP stack finds an error during the socket creating process.
				946
				947	* TcpExtTCPFastOpenListenOverflow
				948
				949	When the pending fast open request number is larger than
				950	fastopenq->max_qlen, the TCP stack will reject the fast open request
				951	and update this counter. When this counter is updated, the TCP stack
				952	won't update TcpExtTCPFastOpenPassive or
				953	TcpExtTCPFastOpenPassiveFail. The fastopenq->max_qlen is set by the
				954	TCP_FASTOPEN socket operation and it could not be larger than
				955	net.core.somaxconn. For example:
				956
				957	setsockopt(sfd, SOL_TCP, TCP_FASTOPEN, &qlen, sizeof(qlen));
				958
				959	* TcpExtTCPFastOpenCookieReqd
				960
				961	This counter indicates how many times a client wants to request a TFO
				962	cookie.
				963
				964	SYN cookies
				965	===========
				966	SYN cookies are used to mitigate SYN flood, for details, please refer
				967	the `SYN cookies wiki`_.
				968
				969	.. _SYN cookies wiki: https://en.wikipedia.org/wiki/SYN_cookies
				970
				971	* TcpExtSyncookiesSent
				972
				973	It indicates how many SYN cookies are sent.
				974
				975	* TcpExtSyncookiesRecv
				976
				977	How many reply packets of the SYN cookies the TCP stack receives.
				978
				979	* TcpExtSyncookiesFailed
				980
				981	The MSS decoded from the SYN cookie is invalid. When this counter is
				982	updated, the received packet won't be treated as a SYN cookie and the
				983	TcpExtSyncookiesRecv counter wont be updated.
				984
				985	Challenge ACK
				986	=============
				987	For details of challenge ACK, please refer the explaination of
				988	TcpExtTCPACKSkippedChallenge.
				989
				990	* TcpExtTCPChallengeACK
				991
				992	The number of challenge acks sent.
				993
				994	* TcpExtTCPSYNChallenge
				995
				996	The number of challenge acks sent in response to SYN packets. After
				997	updates this counter, the TCP stack might send a challenge ACK and
				998	update the TcpExtTCPChallengeACK counter, or it might also skip to
				999	send the challenge and update the TcpExtTCPACKSkippedChallenge.
				1000
				1001	prune
				1002	=====
				1003	When a socket is under memory pressure, the TCP stack will try to
				1004	reclaim memory from the receiving queue and out of order queue. One of
				1005	the reclaiming method is 'collapse', which means allocate a big sbk,
				1006	copy the contiguous skbs to the single big skb, and free these
				1007	contiguous skbs.
				1008
				1009	* TcpExtPruneCalled
				1010
				1011	The TCP stack tries to reclaim memory for a socket. After updates this
				1012	counter, the TCP stack will try to collapse the out of order queue and
				1013	the receiving queue. If the memory is still not enough, the TCP stack
				1014	will try to discard packets from the out of order queue (and update the
				1015	TcpExtOfoPruned counter)
				1016
				1017	* TcpExtOfoPruned
				1018
				1019	The TCP stack tries to discard packet on the out of order queue.
				1020
				1021	* TcpExtRcvPruned
				1022
				1023	After 'collapse' and discard packets from the out of order queue, if
				1024	the actually used memory is still larger than the max allowed memory,
				1025	this counter will be updated. It means the 'prune' fails.
				1026
				1027	* TcpExtTCPRcvCollapsed
				1028
				1029	This counter indicates how many skbs are freed during 'collapse'.
				1030
yupeng	b08794a	2018-11-10 13:38:12 -0800	[diff] [blame]	1031	examples
Randy Dunlap	ae5220c	2019-01-13 20:17:41 -0800	[diff] [blame]	1032	========
yupeng	b08794a	2018-11-10 13:38:12 -0800	[diff] [blame]	1033
				1034	ping test
Randy Dunlap	ae5220c	2019-01-13 20:17:41 -0800	[diff] [blame]	1035	---------
yupeng	b08794a	2018-11-10 13:38:12 -0800	[diff] [blame]	1036	Run the ping command against the public dns server 8.8.8.8::
				1037
				1038	nstatuser@nstat-a:~$ ping 8.8.8.8 -c 1
				1039	PING 8.8.8.8 (8.8.8.8) 56(84) bytes of data.
				1040	64 bytes from 8.8.8.8: icmp_seq=1 ttl=119 time=17.8 ms
				1041
				1042	--- 8.8.8.8 ping statistics ---
				1043	1 packets transmitted, 1 received, 0% packet loss, time 0ms
				1044	rtt min/avg/max/mdev = 17.875/17.875/17.875/0.000 ms
				1045
				1046	The nstayt result::
				1047
				1048	nstatuser@nstat-a:~$ nstat
				1049	#kernel
				1050	IpInReceives 1 0.0
				1051	IpInDelivers 1 0.0
				1052	IpOutRequests 1 0.0
				1053	IcmpInMsgs 1 0.0
				1054	IcmpInEchoReps 1 0.0
				1055	IcmpOutMsgs 1 0.0
				1056	IcmpOutEchos 1 0.0
				1057	IcmpMsgInType0 1 0.0
				1058	IcmpMsgOutType8 1 0.0
				1059	IpExtInOctets 84 0.0
				1060	IpExtOutOctets 84 0.0
				1061	IpExtInNoECTPkts 1 0.0
				1062
				1063	The Linux server sent an ICMP Echo packet, so IpOutRequests,
				1064	IcmpOutMsgs, IcmpOutEchos and IcmpMsgOutType8 were increased 1. The
				1065	server got ICMP Echo Reply from 8.8.8.8, so IpInReceives, IcmpInMsgs,
				1066	IcmpInEchoReps and IcmpMsgInType0 were increased 1. The ICMP Echo Reply
				1067	was passed to the ICMP layer via IP layer, so IpInDelivers was
				1068	increased 1. The default ping data size is 48, so an ICMP Echo packet
				1069	and its corresponding Echo Reply packet are constructed by:
				1070
				1071	* 14 bytes MAC header
				1072	* 20 bytes IP header
				1073	* 16 bytes ICMP header
				1074	* 48 bytes data (default value of the ping command)
				1075
				1076	So the IpExtInOctets and IpExtOutOctets are 20+16+48=84.
yupeng	80cc495	2018-11-16 11:17:40 -0800	[diff] [blame]	1077
				1078	tcp 3-way handshake
Randy Dunlap	ae5220c	2019-01-13 20:17:41 -0800	[diff] [blame]	1079	-------------------
yupeng	80cc495	2018-11-16 11:17:40 -0800	[diff] [blame]	1080	On server side, we run::
				1081
				1082	nstatuser@nstat-b:~$ nc -lknv 0.0.0.0 9000
				1083	Listening on [0.0.0.0] (family 0, port 9000)
				1084
				1085	On client side, we run::
				1086
				1087	nstatuser@nstat-a:~$ nc -nv 192.168.122.251 9000
				1088	Connection to 192.168.122.251 9000 port [tcp/*] succeeded!
				1089
				1090	The server listened on tcp 9000 port, the client connected to it, they
				1091	completed the 3-way handshake.
				1092
				1093	On server side, we can find below nstat output::
				1094
				1095	nstatuser@nstat-b:~$ nstat \| grep -i tcp
				1096	TcpPassiveOpens 1 0.0
				1097	TcpInSegs 2 0.0
				1098	TcpOutSegs 1 0.0
				1099	TcpExtTCPPureAcks 1 0.0
				1100
				1101	On client side, we can find below nstat output::
				1102
				1103	nstatuser@nstat-a:~$ nstat \| grep -i tcp
				1104	TcpActiveOpens 1 0.0
				1105	TcpInSegs 1 0.0
				1106	TcpOutSegs 2 0.0
				1107
				1108	When the server received the first SYN, it replied a SYN+ACK, and came into
				1109	SYN-RCVD state, so TcpPassiveOpens increased 1. The server received
				1110	SYN, sent SYN+ACK, received ACK, so server sent 1 packet, received 2
				1111	packets, TcpInSegs increased 2, TcpOutSegs increased 1. The last ACK
				1112	of the 3-way handshake is a pure ACK without data, so
				1113	TcpExtTCPPureAcks increased 1.
				1114
				1115	When the client sent SYN, the client came into the SYN-SENT state, so
				1116	TcpActiveOpens increased 1, the client sent SYN, received SYN+ACK, sent
				1117	ACK, so client sent 2 packets, received 1 packet, TcpInSegs increased
				1118	1, TcpOutSegs increased 2.
				1119
				1120	TCP normal traffic
Randy Dunlap	ae5220c	2019-01-13 20:17:41 -0800	[diff] [blame]	1121	------------------
yupeng	80cc495	2018-11-16 11:17:40 -0800	[diff] [blame]	1122	Run nc on server::
				1123
				1124	nstatuser@nstat-b:~$ nc -lkv 0.0.0.0 9000
				1125	Listening on [0.0.0.0] (family 0, port 9000)
				1126
				1127	Run nc on client::
				1128
				1129	nstatuser@nstat-a:~$ nc -v nstat-b 9000
				1130	Connection to nstat-b 9000 port [tcp/*] succeeded!
				1131
				1132	Input a string in the nc client ('hello' in our example)::
				1133
				1134	nstatuser@nstat-a:~$ nc -v nstat-b 9000
				1135	Connection to nstat-b 9000 port [tcp/*] succeeded!
				1136	hello
				1137
				1138	The client side nstat output::
				1139
				1140	nstatuser@nstat-a:~$ nstat
				1141	#kernel
				1142	IpInReceives 1 0.0
				1143	IpInDelivers 1 0.0
				1144	IpOutRequests 1 0.0
				1145	TcpInSegs 1 0.0
				1146	TcpOutSegs 1 0.0
				1147	TcpExtTCPPureAcks 1 0.0
				1148	TcpExtTCPOrigDataSent 1 0.0
				1149	IpExtInOctets 52 0.0
				1150	IpExtOutOctets 58 0.0
				1151	IpExtInNoECTPkts 1 0.0
				1152
				1153	The server side nstat output::
				1154
				1155	nstatuser@nstat-b:~$ nstat
				1156	#kernel
				1157	IpInReceives 1 0.0
				1158	IpInDelivers 1 0.0
				1159	IpOutRequests 1 0.0
				1160	TcpInSegs 1 0.0
				1161	TcpOutSegs 1 0.0
				1162	IpExtInOctets 58 0.0
				1163	IpExtOutOctets 52 0.0
				1164	IpExtInNoECTPkts 1 0.0
				1165
				1166	Input a string in nc client side again ('world' in our exmaple)::
				1167
				1168	nstatuser@nstat-a:~$ nc -v nstat-b 9000
				1169	Connection to nstat-b 9000 port [tcp/*] succeeded!
				1170	hello
				1171	world
				1172
				1173	Client side nstat output::
				1174
				1175	nstatuser@nstat-a:~$ nstat
				1176	#kernel
				1177	IpInReceives 1 0.0
				1178	IpInDelivers 1 0.0
				1179	IpOutRequests 1 0.0
				1180	TcpInSegs 1 0.0
				1181	TcpOutSegs 1 0.0
				1182	TcpExtTCPHPAcks 1 0.0
				1183	TcpExtTCPOrigDataSent 1 0.0
				1184	IpExtInOctets 52 0.0
				1185	IpExtOutOctets 58 0.0
				1186	IpExtInNoECTPkts 1 0.0
				1187
				1188
				1189	Server side nstat output::
				1190
				1191	nstatuser@nstat-b:~$ nstat
				1192	#kernel
				1193	IpInReceives 1 0.0
				1194	IpInDelivers 1 0.0
				1195	IpOutRequests 1 0.0
				1196	TcpInSegs 1 0.0
				1197	TcpOutSegs 1 0.0
				1198	TcpExtTCPHPHits 1 0.0
				1199	IpExtInOctets 58 0.0
				1200	IpExtOutOctets 52 0.0
				1201	IpExtInNoECTPkts 1 0.0
				1202
				1203	Compare the first client-side nstat and the second client-side nstat,
				1204	we could find one difference: the first one had a 'TcpExtTCPPureAcks',
				1205	but the second one had a 'TcpExtTCPHPAcks'. The first server-side
				1206	nstat and the second server-side nstat had a difference too: the
				1207	second server-side nstat had a TcpExtTCPHPHits, but the first
				1208	server-side nstat didn't have it. The network traffic patterns were
				1209	exactly the same: the client sent a packet to the server, the server
				1210	replied an ACK. But kernel handled them in different ways. When the
				1211	TCP window scale option is not used, kernel will try to enable fast
				1212	path immediately when the connection comes into the established state,
				1213	but if the TCP window scale option is used, kernel will disable the
				1214	fast path at first, and try to enable it after kerenl receives
				1215	packets. We could use the 'ss' command to verify whether the window
				1216	scale option is used. e.g. run below command on either server or
				1217	client::
				1218
				1219	nstatuser@nstat-a:~$ ss -o state established -i '( dport = :9000 or sport = :9000 )
				1220	Netid Recv-Q Send-Q Local Address:Port Peer Address:Port
				1221	tcp 0 0 192.168.122.250:40654 192.168.122.251:9000
				1222	ts sack cubic wscale:7,7 rto:204 rtt:0.98/0.49 mss:1448 pmtu:1500 rcvmss:536 advmss:1448 cwnd:10 bytes_acked:1 segs_out:2 segs_in:1 send 118.2Mbps lastsnd:46572 lastrcv:46572 lastack:46572 pacing_rate 236.4Mbps rcv_space:29200 rcv_ssthresh:29200 minrtt:0.98
				1223
				1224	The 'wscale:7,7' means both server and client set the window scale
				1225	option to 7. Now we could explain the nstat output in our test:
				1226
				1227	In the first nstat output of client side, the client sent a packet, server
				1228	reply an ACK, when kernel handled this ACK, the fast path was not
				1229	enabled, so the ACK was counted into 'TcpExtTCPPureAcks'.
				1230
				1231	In the second nstat output of client side, the client sent a packet again,
				1232	and received another ACK from the server, in this time, the fast path is
				1233	enabled, and the ACK was qualified for fast path, so it was handled by
				1234	the fast path, so this ACK was counted into TcpExtTCPHPAcks.
				1235
				1236	In the first nstat output of server side, fast path was not enabled,
				1237	so there was no 'TcpExtTCPHPHits'.
				1238
				1239	In the second nstat output of server side, the fast path was enabled,
				1240	and the packet received from client qualified for fast path, so it
				1241	was counted into 'TcpExtTCPHPHits'.
				1242
				1243	TcpExtTCPAbortOnClose
Randy Dunlap	ae5220c	2019-01-13 20:17:41 -0800	[diff] [blame]	1244	---------------------
yupeng	80cc495	2018-11-16 11:17:40 -0800	[diff] [blame]	1245	On the server side, we run below python script::
				1246
				1247	import socket
				1248	import time
				1249
				1250	port = 9000
				1251
				1252	s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
				1253	s.bind(('0.0.0.0', port))
				1254	s.listen(1)
				1255	sock, addr = s.accept()
				1256	while True:
				1257	time.sleep(9999999)
				1258
				1259	This python script listen on 9000 port, but doesn't read anything from
				1260	the connection.
				1261
				1262	On the client side, we send the string "hello" by nc::
				1263
				1264	nstatuser@nstat-a:~$ echo "hello" \| nc nstat-b 9000
				1265
				1266	Then, we come back to the server side, the server has received the "hello"
				1267	packet, and the TCP layer has acked this packet, but the application didn't
				1268	read it yet. We type Ctrl-C to terminate the server script. Then we
				1269	could find TcpExtTCPAbortOnClose increased 1 on the server side::
				1270
				1271	nstatuser@nstat-b:~$ nstat \| grep -i abort
				1272	TcpExtTCPAbortOnClose 1 0.0
				1273
				1274	If we run tcpdump on the server side, we could find the server sent a
				1275	RST after we type Ctrl-C.
				1276
				1277	TcpExtTCPAbortOnMemory and TcpExtTCPAbortOnTimeout
Randy Dunlap	ae5220c	2019-01-13 20:17:41 -0800	[diff] [blame]	1278	---------------------------------------------------
yupeng	80cc495	2018-11-16 11:17:40 -0800	[diff] [blame]	1279	Below is an example which let the orphan socket count be higher than
				1280	net.ipv4.tcp_max_orphans.
				1281	Change tcp_max_orphans to a smaller value on client::
				1282
				1283	sudo bash -c "echo 10 > /proc/sys/net/ipv4/tcp_max_orphans"
				1284
				1285	Client code (create 64 connection to server)::
				1286
				1287	nstatuser@nstat-a:~$ cat client_orphan.py
				1288	import socket
				1289	import time
				1290
				1291	server = 'nstat-b' # server address
				1292	port = 9000
				1293
				1294	count = 64
				1295
				1296	connection_list = []
				1297
				1298	for i in range(64):
				1299	s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
				1300	s.connect((server, port))
				1301	connection_list.append(s)
				1302	print("connection_count: %d" % len(connection_list))
				1303
				1304	while True:
				1305	time.sleep(99999)
				1306
				1307	Server code (accept 64 connection from client)::
				1308
				1309	nstatuser@nstat-b:~$ cat server_orphan.py
				1310	import socket
				1311	import time
				1312
				1313	port = 9000
				1314	count = 64
				1315
				1316	s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
				1317	s.bind(('0.0.0.0', port))
				1318	s.listen(count)
				1319	connection_list = []
				1320	while True:
				1321	sock, addr = s.accept()
				1322	connection_list.append((sock, addr))
				1323	print("connection_count: %d" % len(connection_list))
				1324
				1325	Run the python scripts on server and client.
				1326
				1327	On server::
				1328
				1329	python3 server_orphan.py
				1330
				1331	On client::
				1332
				1333	python3 client_orphan.py
				1334
				1335	Run iptables on server::
				1336
				1337	sudo iptables -A INPUT -i ens3 -p tcp --destination-port 9000 -j DROP
				1338
				1339	Type Ctrl-C on client, stop client_orphan.py.
				1340
				1341	Check TcpExtTCPAbortOnMemory on client::
				1342
				1343	nstatuser@nstat-a:~$ nstat \| grep -i abort
				1344	TcpExtTCPAbortOnMemory 54 0.0
				1345
				1346	Check orphane socket count on client::
				1347
				1348	nstatuser@nstat-a:~$ ss -s
				1349	Total: 131 (kernel 0)
				1350	TCP: 14 (estab 1, closed 0, orphaned 10, synrecv 0, timewait 0/0), ports 0
				1351
				1352	Transport Total IP IPv6
				1353	* 0 - -
				1354	RAW 1 0 1
				1355	UDP 1 1 0
				1356	TCP 14 13 1
				1357	INET 16 14 2
				1358	FRAG 0 0 0
				1359
				1360	The explanation of the test: after run server_orphan.py and
				1361	client_orphan.py, we set up 64 connections between server and
				1362	client. Run the iptables command, the server will drop all packets from
				1363	the client, type Ctrl-C on client_orphan.py, the system of the client
				1364	would try to close these connections, and before they are closed
				1365	gracefully, these connections became orphan sockets. As the iptables
				1366	of the server blocked packets from the client, the server won't receive fin
				1367	from the client, so all connection on clients would be stuck on FIN_WAIT_1
				1368	stage, so they will keep as orphan sockets until timeout. We have echo
				1369	10 to /proc/sys/net/ipv4/tcp_max_orphans, so the client system would
				1370	only keep 10 orphan sockets, for all other orphan sockets, the client
				1371	system sent RST for them and delete them. We have 64 connections, so
				1372	the 'ss -s' command shows the system has 10 orphan sockets, and the
				1373	value of TcpExtTCPAbortOnMemory was 54.
				1374
				1375	An additional explanation about orphan socket count: You could find the
				1376	exactly orphan socket count by the 'ss -s' command, but when kernel
				1377	decide whither increases TcpExtTCPAbortOnMemory and sends RST, kernel
				1378	doesn't always check the exactly orphan socket count. For increasing
				1379	performance, kernel checks an approximate count firstly, if the
				1380	approximate count is more than tcp_max_orphans, kernel checks the
				1381	exact count again. So if the approximate count is less than
				1382	tcp_max_orphans, but exactly count is more than tcp_max_orphans, you
				1383	would find TcpExtTCPAbortOnMemory is not increased at all. If
				1384	tcp_max_orphans is large enough, it won't occur, but if you decrease
				1385	tcp_max_orphans to a small value like our test, you might find this
				1386	issue. So in our test, the client set up 64 connections although the
				1387	tcp_max_orphans is 10. If the client only set up 11 connections, we
				1388	can't find the change of TcpExtTCPAbortOnMemory.
				1389
				1390	Continue the previous test, we wait for several minutes. Because of the
				1391	iptables on the server blocked the traffic, the server wouldn't receive
				1392	fin, and all the client's orphan sockets would timeout on the
				1393	FIN_WAIT_1 state finally. So we wait for a few minutes, we could find
				1394	10 timeout on the client::
				1395
				1396	nstatuser@nstat-a:~$ nstat \| grep -i abort
				1397	TcpExtTCPAbortOnTimeout 10 0.0
				1398
				1399	TcpExtTCPAbortOnLinger
Randy Dunlap	ae5220c	2019-01-13 20:17:41 -0800	[diff] [blame]	1400	----------------------
yupeng	80cc495	2018-11-16 11:17:40 -0800	[diff] [blame]	1401	The server side code::
				1402
				1403	nstatuser@nstat-b:~$ cat server_linger.py
				1404	import socket
				1405	import time
				1406
				1407	port = 9000
				1408
				1409	s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
				1410	s.bind(('0.0.0.0', port))
				1411	s.listen(1)
				1412	sock, addr = s.accept()
				1413	while True:
				1414	time.sleep(9999999)
				1415
				1416	The client side code::
				1417
				1418	nstatuser@nstat-a:~$ cat client_linger.py
				1419	import socket
				1420	import struct
				1421
				1422	server = 'nstat-b' # server address
				1423	port = 9000
				1424
				1425	s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
				1426	s.setsockopt(socket.SOL_SOCKET, socket.SO_LINGER, struct.pack('ii', 1, 10))
				1427	s.setsockopt(socket.SOL_TCP, socket.TCP_LINGER2, struct.pack('i', -1))
				1428	s.connect((server, port))
				1429	s.close()
				1430
				1431	Run server_linger.py on server::
				1432
				1433	nstatuser@nstat-b:~$ python3 server_linger.py
				1434
				1435	Run client_linger.py on client::
				1436
				1437	nstatuser@nstat-a:~$ python3 client_linger.py
				1438
				1439	After run client_linger.py, check the output of nstat::
				1440
				1441	nstatuser@nstat-a:~$ nstat \| grep -i abort
				1442	TcpExtTCPAbortOnLinger 1 0.0
yupeng	712ee16	2018-11-25 23:35:46 -0800	[diff] [blame]	1443
				1444	TcpExtTCPRcvCoalesce
Randy Dunlap	ae5220c	2019-01-13 20:17:41 -0800	[diff] [blame]	1445	--------------------
yupeng	712ee16	2018-11-25 23:35:46 -0800	[diff] [blame]	1446	On the server, we run a program which listen on TCP port 9000, but
				1447	doesn't read any data::
				1448
				1449	import socket
				1450	import time
				1451	port = 9000
				1452	s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
				1453	s.bind(('0.0.0.0', port))
				1454	s.listen(1)
				1455	sock, addr = s.accept()
				1456	while True:
				1457	time.sleep(9999999)
				1458
				1459	Save the above code as server_coalesce.py, and run::
				1460
				1461	python3 server_coalesce.py
				1462
				1463	On the client, save below code as client_coalesce.py::
				1464
				1465	import socket
				1466	server = 'nstat-b'
				1467	port = 9000
				1468	s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
				1469	s.connect((server, port))
				1470
				1471	Run::
				1472
				1473	nstatuser@nstat-a:~$ python3 -i client_coalesce.py
				1474
				1475	We use '-i' to come into the interactive mode, then a packet::
				1476
				1477	>>> s.send(b'foo')
				1478	3
				1479
				1480	Send a packet again::
				1481
				1482	>>> s.send(b'bar')
				1483	3
				1484
				1485	On the server, run nstat::
				1486
				1487	ubuntu@nstat-b:~$ nstat
				1488	#kernel
				1489	IpInReceives 2 0.0
				1490	IpInDelivers 2 0.0
				1491	IpOutRequests 2 0.0
				1492	TcpInSegs 2 0.0
				1493	TcpOutSegs 2 0.0
				1494	TcpExtTCPRcvCoalesce 1 0.0
				1495	IpExtInOctets 110 0.0
				1496	IpExtOutOctets 104 0.0
				1497	IpExtInNoECTPkts 2 0.0
				1498
				1499	The client sent two packets, server didn't read any data. When
				1500	the second packet arrived at server, the first packet was still in
				1501	the receiving queue. So the TCP layer merged the two packets, and we
				1502	could find the TcpExtTCPRcvCoalesce increased 1.
				1503
				1504	TcpExtListenOverflows and TcpExtListenDrops
Randy Dunlap	ae5220c	2019-01-13 20:17:41 -0800	[diff] [blame]	1505	-------------------------------------------
yupeng	712ee16	2018-11-25 23:35:46 -0800	[diff] [blame]	1506	On server, run the nc command, listen on port 9000::
				1507
				1508	nstatuser@nstat-b:~$ nc -lkv 0.0.0.0 9000
				1509	Listening on [0.0.0.0] (family 0, port 9000)
				1510
				1511	On client, run 3 nc commands in different terminals::
				1512
				1513	nstatuser@nstat-a:~$ nc -v nstat-b 9000
				1514	Connection to nstat-b 9000 port [tcp/*] succeeded!
				1515
				1516	The nc command only accepts 1 connection, and the accept queue length
				1517	is 1. On current linux implementation, set queue length to n means the
				1518	actual queue length is n+1. Now we create 3 connections, 1 is accepted
				1519	by nc, 2 in accepted queue, so the accept queue is full.
				1520
				1521	Before running the 4th nc, we clean the nstat history on the server::
				1522
				1523	nstatuser@nstat-b:~$ nstat -n
				1524
				1525	Run the 4th nc on the client::
				1526
				1527	nstatuser@nstat-a:~$ nc -v nstat-b 9000
				1528
				1529	If the nc server is running on kernel 4.10 or higher version, you
				1530	won't see the "Connection to ... succeeded!" string, because kernel
				1531	will drop the SYN if the accept queue is full. If the nc client is running
				1532	on an old kernel, you would see that the connection is succeeded,
				1533	because kernel would complete the 3 way handshake and keep the socket
				1534	on half open queue. I did the test on kernel 4.15. Below is the nstat
				1535	on the server::
				1536
				1537	nstatuser@nstat-b:~$ nstat
				1538	#kernel
				1539	IpInReceives 4 0.0
				1540	IpInDelivers 4 0.0
				1541	TcpInSegs 4 0.0
				1542	TcpExtListenOverflows 4 0.0
				1543	TcpExtListenDrops 4 0.0
				1544	IpExtInOctets 240 0.0
				1545	IpExtInNoECTPkts 4 0.0
				1546
				1547	Both TcpExtListenOverflows and TcpExtListenDrops were 4. If the time
				1548	between the 4th nc and the nstat was longer, the value of
				1549	TcpExtListenOverflows and TcpExtListenDrops would be larger, because
				1550	the SYN of the 4th nc was dropped, the client was retrying.
yupeng	8e2ea53	2018-12-12 00:14:10 -0800	[diff] [blame]	1551
				1552	IpInAddrErrors, IpExtInNoRoutes and IpOutNoRoutes
Randy Dunlap	ae5220c	2019-01-13 20:17:41 -0800	[diff] [blame]	1553	-------------------------------------------------
yupeng	8e2ea53	2018-12-12 00:14:10 -0800	[diff] [blame]	1554	server A IP address: 192.168.122.250
				1555	server B IP address: 192.168.122.251
				1556	Prepare on server A, add a route to server B::
				1557
				1558	$ sudo ip route add 8.8.8.8/32 via 192.168.122.251
				1559
				1560	Prepare on server B, disable send_redirects for all interfaces::
				1561
				1562	$ sudo sysctl -w net.ipv4.conf.all.send_redirects=0
				1563	$ sudo sysctl -w net.ipv4.conf.ens3.send_redirects=0
				1564	$ sudo sysctl -w net.ipv4.conf.lo.send_redirects=0
				1565	$ sudo sysctl -w net.ipv4.conf.default.send_redirects=0
				1566
				1567	We want to let sever A send a packet to 8.8.8.8, and route the packet
				1568	to server B. When server B receives such packet, it might send a ICMP
				1569	Redirect message to server A, set send_redirects to 0 will disable
				1570	this behavior.
				1571
				1572	First, generate InAddrErrors. On server B, we disable IP forwarding::
				1573
				1574	$ sudo sysctl -w net.ipv4.conf.all.forwarding=0
				1575
				1576	On server A, we send packets to 8.8.8.8::
				1577
				1578	$ nc -v 8.8.8.8 53
				1579
				1580	On server B, we check the output of nstat::
				1581
				1582	$ nstat
				1583	#kernel
				1584	IpInReceives 3 0.0
				1585	IpInAddrErrors 3 0.0
				1586	IpExtInOctets 180 0.0
				1587	IpExtInNoECTPkts 3 0.0
				1588
				1589	As we have let server A route 8.8.8.8 to server B, and we disabled IP
				1590	forwarding on server B, Server A sent packets to server B, then server B
				1591	dropped packets and increased IpInAddrErrors. As the nc command would
				1592	re-send the SYN packet if it didn't receive a SYN+ACK, we could find
				1593	multiple IpInAddrErrors.
				1594
				1595	Second, generate IpExtInNoRoutes. On server B, we enable IP
				1596	forwarding::
				1597
				1598	$ sudo sysctl -w net.ipv4.conf.all.forwarding=1
				1599
				1600	Check the route table of server B and remove the default route::
				1601
				1602	$ ip route show
				1603	default via 192.168.122.1 dev ens3 proto static
				1604	192.168.122.0/24 dev ens3 proto kernel scope link src 192.168.122.251
				1605	$ sudo ip route delete default via 192.168.122.1 dev ens3 proto static
				1606
				1607	On server A, we contact 8.8.8.8 again::
				1608
				1609	$ nc -v 8.8.8.8 53
				1610	nc: connect to 8.8.8.8 port 53 (tcp) failed: Network is unreachable
				1611
				1612	On server B, run nstat::
				1613
				1614	$ nstat
				1615	#kernel
				1616	IpInReceives 1 0.0
				1617	IpOutRequests 1 0.0
				1618	IcmpOutMsgs 1 0.0
				1619	IcmpOutDestUnreachs 1 0.0
				1620	IcmpMsgOutType3 1 0.0
				1621	IpExtInNoRoutes 1 0.0
				1622	IpExtInOctets 60 0.0
				1623	IpExtOutOctets 88 0.0
				1624	IpExtInNoECTPkts 1 0.0
				1625
				1626	We enabled IP forwarding on server B, when server B received a packet
				1627	which destination IP address is 8.8.8.8, server B will try to forward
				1628	this packet. We have deleted the default route, there was no route for
				1629	8.8.8.8, so server B increase IpExtInNoRoutes and sent the "ICMP
				1630	Destination Unreachable" message to server A.
				1631
				1632	Third, generate IpOutNoRoutes. Run ping command on server B::
				1633
				1634	$ ping -c 1 8.8.8.8
				1635	connect: Network is unreachable
				1636
				1637	Run nstat on server B::
				1638
				1639	$ nstat
				1640	#kernel
				1641	IpOutNoRoutes 1 0.0
				1642
				1643	We have deleted the default route on server B. Server B couldn't find
				1644	a route for the 8.8.8.8 IP address, so server B increased
				1645	IpOutNoRoutes.
yupeng	2b965472	2018-12-29 21:46:38 -0800	[diff] [blame]	1646
				1647	TcpExtTCPACKSkippedSynRecv
Randy Dunlap	ae5220c	2019-01-13 20:17:41 -0800	[diff] [blame]	1648	--------------------------
yupeng	2b965472	2018-12-29 21:46:38 -0800	[diff] [blame]	1649	In this test, we send 3 same SYN packets from client to server. The
				1650	first SYN will let server create a socket, set it to Syn-Recv status,
				1651	and reply a SYN/ACK. The second SYN will let server reply the SYN/ACK
				1652	again, and record the reply time (the duplicate ACK reply time). The
				1653	third SYN will let server check the previous duplicate ACK reply time,
				1654	and decide to skip the duplicate ACK, then increase the
				1655	TcpExtTCPACKSkippedSynRecv counter.
				1656
				1657	Run tcpdump to capture a SYN packet::
				1658
				1659	nstatuser@nstat-a:~$ sudo tcpdump -c 1 -w /tmp/syn.pcap port 9000
				1660	tcpdump: listening on ens3, link-type EN10MB (Ethernet), capture size 262144 bytes
				1661
				1662	Open another terminal, run nc command::
				1663
				1664	nstatuser@nstat-a:~$ nc nstat-b 9000
				1665
				1666	As the nstat-b didn't listen on port 9000, it should reply a RST, and
				1667	the nc command exited immediately. It was enough for the tcpdump
				1668	command to capture a SYN packet. A linux server might use hardware
				1669	offload for the TCP checksum, so the checksum in the /tmp/syn.pcap
				1670	might be not correct. We call tcprewrite to fix it::
				1671
				1672	nstatuser@nstat-a:~$ tcprewrite --infile=/tmp/syn.pcap --outfile=/tmp/syn_fixcsum.pcap --fixcsum
				1673
				1674	On nstat-b, we run nc to listen on port 9000::
				1675
				1676	nstatuser@nstat-b:~$ nc -lkv 9000
				1677	Listening on [0.0.0.0] (family 0, port 9000)
				1678
				1679	On nstat-a, we blocked the packet from port 9000, or nstat-a would send
				1680	RST to nstat-b::
				1681
				1682	nstatuser@nstat-a:~$ sudo iptables -A INPUT -p tcp --sport 9000 -j DROP
				1683
				1684	Send 3 SYN repeatly to nstat-b::
				1685
				1686	nstatuser@nstat-a:~$ for i in {1..3}; do sudo tcpreplay -i ens3 /tmp/syn_fixcsum.pcap; done
				1687
				1688	Check snmp cunter on nstat-b::
				1689
				1690	nstatuser@nstat-b:~$ nstat \| grep -i skip
				1691	TcpExtTCPACKSkippedSynRecv 1 0.0
				1692
				1693	As we expected, TcpExtTCPACKSkippedSynRecv is 1.
				1694
				1695	TcpExtTCPACKSkippedPAWS
Randy Dunlap	ae5220c	2019-01-13 20:17:41 -0800	[diff] [blame]	1696	-----------------------
yupeng	2b965472	2018-12-29 21:46:38 -0800	[diff] [blame]	1697	To trigger PAWS, we could send an old SYN.
				1698
				1699	On nstat-b, let nc listen on port 9000::
				1700
				1701	nstatuser@nstat-b:~$ nc -lkv 9000
				1702	Listening on [0.0.0.0] (family 0, port 9000)
				1703
				1704	On nstat-a, run tcpdump to capture a SYN::
				1705
				1706	nstatuser@nstat-a:~$ sudo tcpdump -w /tmp/paws_pre.pcap -c 1 port 9000
				1707	tcpdump: listening on ens3, link-type EN10MB (Ethernet), capture size 262144 bytes
				1708
				1709	On nstat-a, run nc as a client to connect nstat-b::
				1710
				1711	nstatuser@nstat-a:~$ nc -v nstat-b 9000
				1712	Connection to nstat-b 9000 port [tcp/*] succeeded!
				1713
				1714	Now the tcpdump has captured the SYN and exit. We should fix the
				1715	checksum::
				1716
				1717	nstatuser@nstat-a:~$ tcprewrite --infile /tmp/paws_pre.pcap --outfile /tmp/paws.pcap --fixcsum
				1718
				1719	Send the SYN packet twice::
				1720
				1721	nstatuser@nstat-a:~$ for i in {1..2}; do sudo tcpreplay -i ens3 /tmp/paws.pcap; done
				1722
				1723	On nstat-b, check the snmp counter::
				1724
				1725	nstatuser@nstat-b:~$ nstat \| grep -i skip
				1726	TcpExtTCPACKSkippedPAWS 1 0.0
				1727
				1728	We sent two SYN via tcpreplay, both of them would let PAWS check
				1729	failed, the nstat-b replied an ACK for the first SYN, skipped the ACK
				1730	for the second SYN, and updated TcpExtTCPACKSkippedPAWS.
				1731
				1732	TcpExtTCPACKSkippedSeq
Randy Dunlap	ae5220c	2019-01-13 20:17:41 -0800	[diff] [blame]	1733	----------------------
yupeng	2b965472	2018-12-29 21:46:38 -0800	[diff] [blame]	1734	To trigger TcpExtTCPACKSkippedSeq, we send packets which have valid
				1735	timestamp (to pass PAWS check) but the sequence number is out of
				1736	window. The linux TCP stack would avoid to skip if the packet has
				1737	data, so we need a pure ACK packet. To generate such a packet, we
				1738	could create two sockets: one on port 9000, another on port 9001. Then
				1739	we capture an ACK on port 9001, change the source/destination port
				1740	numbers to match the port 9000 socket. Then we could trigger
				1741	TcpExtTCPACKSkippedSeq via this packet.
				1742
				1743	On nstat-b, open two terminals, run two nc commands to listen on both
				1744	port 9000 and port 9001::
				1745
				1746	nstatuser@nstat-b:~$ nc -lkv 9000
				1747	Listening on [0.0.0.0] (family 0, port 9000)
				1748
				1749	nstatuser@nstat-b:~$ nc -lkv 9001
				1750	Listening on [0.0.0.0] (family 0, port 9001)
				1751
				1752	On nstat-a, run two nc clients::
				1753
				1754	nstatuser@nstat-a:~$ nc -v nstat-b 9000
				1755	Connection to nstat-b 9000 port [tcp/*] succeeded!
				1756
				1757	nstatuser@nstat-a:~$ nc -v nstat-b 9001
				1758	Connection to nstat-b 9001 port [tcp/*] succeeded!
				1759
				1760	On nstat-a, run tcpdump to capture an ACK::
				1761
				1762	nstatuser@nstat-a:~$ sudo tcpdump -w /tmp/seq_pre.pcap -c 1 dst port 9001
				1763	tcpdump: listening on ens3, link-type EN10MB (Ethernet), capture size 262144 bytes
				1764
				1765	On nstat-b, send a packet via the port 9001 socket. E.g. we sent a
				1766	string 'foo' in our example::
				1767
				1768	nstatuser@nstat-b:~$ nc -lkv 9001
				1769	Listening on [0.0.0.0] (family 0, port 9001)
				1770	Connection from nstat-a 42132 received!
				1771	foo
				1772
				1773	On nstat-a, the tcpdump should have caputred the ACK. We should check
				1774	the source port numbers of the two nc clients::
				1775
				1776	nstatuser@nstat-a:~$ ss -ta '( dport = :9000 \|\| dport = :9001 )' \| tee
				1777	State Recv-Q Send-Q Local Address:Port Peer Address:Port
				1778	ESTAB 0 0 192.168.122.250:50208 192.168.122.251:9000
				1779	ESTAB 0 0 192.168.122.250:42132 192.168.122.251:9001
				1780
				1781	Run tcprewrite, change port 9001 to port 9000, chagne port 42132 to
				1782	port 50208::
				1783
				1784	nstatuser@nstat-a:~$ tcprewrite --infile /tmp/seq_pre.pcap --outfile /tmp/seq.pcap -r 9001:9000 -r 42132:50208 --fixcsum
				1785
				1786	Now the /tmp/seq.pcap is the packet we need. Send it to nstat-b::
				1787
				1788	nstatuser@nstat-a:~$ for i in {1..2}; do sudo tcpreplay -i ens3 /tmp/seq.pcap; done
				1789
				1790	Check TcpExtTCPACKSkippedSeq on nstat-b::
				1791
				1792	nstatuser@nstat-b:~$ nstat \| grep -i skip
				1793	TcpExtTCPACKSkippedSeq 1 0.0