Blame - Documentation/networking/snmp_counter.rst - SHIFTPHONES/mainline/linux

blob: f8eb77ddbd4403d60513bff5c232d6297b305b79 [file] [log] [blame]

yupeng	b08794a	2018-11-10 13:38:12 -0800	[diff] [blame]	1	===========
				2	SNMP counter
				3	===========
				4
				5	This document explains the meaning of SNMP counters.
				6
				7	General IPv4 counters
				8	====================
				9	All layer 4 packets and ICMP packets will change these counters, but
				10	these counters won't be changed by layer 2 packets (such as STP) or
				11	ARP packets.
				12
				13	* IpInReceives
				14	Defined in `RFC1213 ipInReceives`_
				15
				16	.. _RFC1213 ipInReceives: https://tools.ietf.org/html/rfc1213#page-26
				17
				18	The number of packets received by the IP layer. It gets increasing at the
				19	beginning of ip_rcv function, always be updated together with
yupeng	8e2ea53	2018-12-12 00:14:10 -0800	[diff] [blame^]	20	IpExtInOctets. It will be increased even if the packet is dropped
				21	later (e.g. due to the IP header is invalid or the checksum is wrong
				22	and so on). It indicates the number of aggregated segments after
yupeng	b08794a	2018-11-10 13:38:12 -0800	[diff] [blame]	23	GRO/LRO.
				24
				25	* IpInDelivers
				26	Defined in `RFC1213 ipInDelivers`_
				27
				28	.. _RFC1213 ipInDelivers: https://tools.ietf.org/html/rfc1213#page-28
				29
				30	The number of packets delivers to the upper layer protocols. E.g. TCP, UDP,
				31	ICMP and so on. If no one listens on a raw socket, only kernel
				32	supported protocols will be delivered, if someone listens on the raw
				33	socket, all valid IP packets will be delivered.
				34
				35	* IpOutRequests
				36	Defined in `RFC1213 ipOutRequests`_
				37
				38	.. _RFC1213 ipOutRequests: https://tools.ietf.org/html/rfc1213#page-28
				39
				40	The number of packets sent via IP layer, for both single cast and
				41	multicast packets, and would always be updated together with
				42	IpExtOutOctets.
				43
				44	* IpExtInOctets and IpExtOutOctets
yupeng	80cc495	2018-11-16 11:17:40 -0800	[diff] [blame]	45	They are Linux kernel extensions, no RFC definitions. Please note,
yupeng	b08794a	2018-11-10 13:38:12 -0800	[diff] [blame]	46	RFC1213 indeed defines ifInOctets and ifOutOctets, but they
				47	are different things. The ifInOctets and ifOutOctets include the MAC
				48	layer header size but IpExtInOctets and IpExtOutOctets don't, they
				49	only include the IP layer header and the IP layer data.
				50
				51	* IpExtInNoECTPkts, IpExtInECT1Pkts, IpExtInECT0Pkts, IpExtInCEPkts
				52	They indicate the number of four kinds of ECN IP packets, please refer
				53	`Explicit Congestion Notification`_ for more details.
				54
				55	.. _Explicit Congestion Notification: https://tools.ietf.org/html/rfc3168#page-6
				56
				57	These 4 counters calculate how many packets received per ECN
				58	status. They count the real frame number regardless the LRO/GRO. So
				59	for the same packet, you might find that IpInReceives count 1, but
				60	IpExtInNoECTPkts counts 2 or more.
				61
yupeng	8e2ea53	2018-12-12 00:14:10 -0800	[diff] [blame^]	62	* IpInHdrErrors
				63	Defined in `RFC1213 ipInHdrErrors`_. It indicates the packet is
				64	dropped due to the IP header error. It might happen in both IP input
				65	and IP forward paths.
				66
				67	.. _RFC1213 ipInHdrErrors: https://tools.ietf.org/html/rfc1213#page-27
				68
				69	* IpInAddrErrors
				70	Defined in `RFC1213 ipInAddrErrors`_. It will be increased in two
				71	scenarios: (1) The IP address is invalid. (2) The destination IP
				72	address is not a local address and IP forwarding is not enabled
				73
				74	.. _RFC1213 ipInAddrErrors: https://tools.ietf.org/html/rfc1213#page-27
				75
				76	* IpExtInNoRoutes
				77	This counter means the packet is dropped when the IP stack receives a
				78	packet and can't find a route for it from the route table. It might
				79	happen when IP forwarding is enabled and the destination IP address is
				80	not a local address and there is no route for the destination IP
				81	address.
				82
				83	* IpInUnknownProtos
				84	Defined in `RFC1213 ipInUnknownProtos`_. It will be increased if the
				85	layer 4 protocol is unsupported by kernel. If an application is using
				86	raw socket, kernel will always deliver the packet to the raw socket
				87	and this counter won't be increased.
				88
				89	.. _RFC1213 ipInUnknownProtos: https://tools.ietf.org/html/rfc1213#page-27
				90
				91	* IpExtInTruncatedPkts
				92	For IPv4 packet, it means the actual data size is smaller than the
				93	"Total Length" field in the IPv4 header.
				94
				95	* IpInDiscards
				96	Defined in `RFC1213 ipInDiscards`_. It indicates the packet is dropped
				97	in the IP receiving path and due to kernel internal reasons (e.g. no
				98	enough memory).
				99
				100	.. _RFC1213 ipInDiscards: https://tools.ietf.org/html/rfc1213#page-28
				101
				102	* IpOutDiscards
				103	Defined in `RFC1213 ipOutDiscards`_. It indicates the packet is
				104	dropped in the IP sending path and due to kernel internal reasons.
				105
				106	.. _RFC1213 ipOutDiscards: https://tools.ietf.org/html/rfc1213#page-28
				107
				108	* IpOutNoRoutes
				109	Defined in `RFC1213 ipOutNoRoutes`_. It indicates the packet is
				110	dropped in the IP sending path and no route is found for it.
				111
				112	.. _RFC1213 ipOutNoRoutes: https://tools.ietf.org/html/rfc1213#page-29
				113
yupeng	b08794a	2018-11-10 13:38:12 -0800	[diff] [blame]	114	ICMP counters
				115	============
				116	* IcmpInMsgs and IcmpOutMsgs
				117	Defined by `RFC1213 icmpInMsgs`_ and `RFC1213 icmpOutMsgs`_
				118
				119	.. _RFC1213 icmpInMsgs: https://tools.ietf.org/html/rfc1213#page-41
				120	.. _RFC1213 icmpOutMsgs: https://tools.ietf.org/html/rfc1213#page-43
				121
				122	As mentioned in the RFC1213, these two counters include errors, they
				123	would be increased even if the ICMP packet has an invalid type. The
				124	ICMP output path will check the header of a raw socket, so the
				125	IcmpOutMsgs would still be updated if the IP header is constructed by
				126	a userspace program.
				127
				128	* ICMP named types
				129	\| These counters include most of common ICMP types, they are:
				130	\| IcmpInDestUnreachs: `RFC1213 icmpInDestUnreachs`_
				131	\| IcmpInTimeExcds: `RFC1213 icmpInTimeExcds`_
				132	\| IcmpInParmProbs: `RFC1213 icmpInParmProbs`_
				133	\| IcmpInSrcQuenchs: `RFC1213 icmpInSrcQuenchs`_
				134	\| IcmpInRedirects: `RFC1213 icmpInRedirects`_
				135	\| IcmpInEchos: `RFC1213 icmpInEchos`_
				136	\| IcmpInEchoReps: `RFC1213 icmpInEchoReps`_
				137	\| IcmpInTimestamps: `RFC1213 icmpInTimestamps`_
				138	\| IcmpInTimestampReps: `RFC1213 icmpInTimestampReps`_
				139	\| IcmpInAddrMasks: `RFC1213 icmpInAddrMasks`_
				140	\| IcmpInAddrMaskReps: `RFC1213 icmpInAddrMaskReps`_
				141	\| IcmpOutDestUnreachs: `RFC1213 icmpOutDestUnreachs`_
				142	\| IcmpOutTimeExcds: `RFC1213 icmpOutTimeExcds`_
				143	\| IcmpOutParmProbs: `RFC1213 icmpOutParmProbs`_
				144	\| IcmpOutSrcQuenchs: `RFC1213 icmpOutSrcQuenchs`_
				145	\| IcmpOutRedirects: `RFC1213 icmpOutRedirects`_
				146	\| IcmpOutEchos: `RFC1213 icmpOutEchos`_
				147	\| IcmpOutEchoReps: `RFC1213 icmpOutEchoReps`_
				148	\| IcmpOutTimestamps: `RFC1213 icmpOutTimestamps`_
				149	\| IcmpOutTimestampReps: `RFC1213 icmpOutTimestampReps`_
				150	\| IcmpOutAddrMasks: `RFC1213 icmpOutAddrMasks`_
				151	\| IcmpOutAddrMaskReps: `RFC1213 icmpOutAddrMaskReps`_
				152
				153	.. _RFC1213 icmpInDestUnreachs: https://tools.ietf.org/html/rfc1213#page-41
				154	.. _RFC1213 icmpInTimeExcds: https://tools.ietf.org/html/rfc1213#page-41
				155	.. _RFC1213 icmpInParmProbs: https://tools.ietf.org/html/rfc1213#page-42
				156	.. _RFC1213 icmpInSrcQuenchs: https://tools.ietf.org/html/rfc1213#page-42
				157	.. _RFC1213 icmpInRedirects: https://tools.ietf.org/html/rfc1213#page-42
				158	.. _RFC1213 icmpInEchos: https://tools.ietf.org/html/rfc1213#page-42
				159	.. _RFC1213 icmpInEchoReps: https://tools.ietf.org/html/rfc1213#page-42
				160	.. _RFC1213 icmpInTimestamps: https://tools.ietf.org/html/rfc1213#page-42
				161	.. _RFC1213 icmpInTimestampReps: https://tools.ietf.org/html/rfc1213#page-43
				162	.. _RFC1213 icmpInAddrMasks: https://tools.ietf.org/html/rfc1213#page-43
				163	.. _RFC1213 icmpInAddrMaskReps: https://tools.ietf.org/html/rfc1213#page-43
				164
				165	.. _RFC1213 icmpOutDestUnreachs: https://tools.ietf.org/html/rfc1213#page-44
				166	.. _RFC1213 icmpOutTimeExcds: https://tools.ietf.org/html/rfc1213#page-44
				167	.. _RFC1213 icmpOutParmProbs: https://tools.ietf.org/html/rfc1213#page-44
				168	.. _RFC1213 icmpOutSrcQuenchs: https://tools.ietf.org/html/rfc1213#page-44
				169	.. _RFC1213 icmpOutRedirects: https://tools.ietf.org/html/rfc1213#page-44
				170	.. _RFC1213 icmpOutEchos: https://tools.ietf.org/html/rfc1213#page-45
				171	.. _RFC1213 icmpOutEchoReps: https://tools.ietf.org/html/rfc1213#page-45
				172	.. _RFC1213 icmpOutTimestamps: https://tools.ietf.org/html/rfc1213#page-45
				173	.. _RFC1213 icmpOutTimestampReps: https://tools.ietf.org/html/rfc1213#page-45
				174	.. _RFC1213 icmpOutAddrMasks: https://tools.ietf.org/html/rfc1213#page-45
				175	.. _RFC1213 icmpOutAddrMaskReps: https://tools.ietf.org/html/rfc1213#page-46
				176
				177	Every ICMP type has two counters: 'In' and 'Out'. E.g., for the ICMP
				178	Echo packet, they are IcmpInEchos and IcmpOutEchos. Their meanings are
				179	straightforward. The 'In' counter means kernel receives such a packet
				180	and the 'Out' counter means kernel sends such a packet.
				181
				182	* ICMP numeric types
				183	They are IcmpMsgInType[N] and IcmpMsgOutType[N], the [N] indicates the
				184	ICMP type number. These counters track all kinds of ICMP packets. The
				185	ICMP type number definition could be found in the `ICMP parameters`_
				186	document.
				187
				188	.. _ICMP parameters: https://www.iana.org/assignments/icmp-parameters/icmp-parameters.xhtml
				189
				190	For example, if the Linux kernel sends an ICMP Echo packet, the
				191	IcmpMsgOutType8 would increase 1. And if kernel gets an ICMP Echo Reply
				192	packet, IcmpMsgInType0 would increase 1.
				193
				194	* IcmpInCsumErrors
				195	This counter indicates the checksum of the ICMP packet is
				196	wrong. Kernel verifies the checksum after updating the IcmpInMsgs and
				197	before updating IcmpMsgInType[N]. If a packet has bad checksum, the
				198	IcmpInMsgs would be updated but none of IcmpMsgInType[N] would be updated.
				199
				200	* IcmpInErrors and IcmpOutErrors
				201	Defined by `RFC1213 icmpInErrors`_ and `RFC1213 icmpOutErrors`_
				202
				203	.. _RFC1213 icmpInErrors: https://tools.ietf.org/html/rfc1213#page-41
				204	.. _RFC1213 icmpOutErrors: https://tools.ietf.org/html/rfc1213#page-43
				205
				206	When an error occurs in the ICMP packet handler path, these two
				207	counters would be updated. The receiving packet path use IcmpInErrors
				208	and the sending packet path use IcmpOutErrors. When IcmpInCsumErrors
				209	is increased, IcmpInErrors would always be increased too.
				210
				211	relationship of the ICMP counters
				212	-------------------------------
				213	The sum of IcmpMsgOutType[N] is always equal to IcmpOutMsgs, as they
				214	are updated at the same time. The sum of IcmpMsgInType[N] plus
				215	IcmpInErrors should be equal or larger than IcmpInMsgs. When kernel
				216	receives an ICMP packet, kernel follows below logic:
				217
				218	1. increase IcmpInMsgs
				219	2. if has any error, update IcmpInErrors and finish the process
				220	3. update IcmpMsgOutType[N]
				221	4. handle the packet depending on the type, if has any error, update
				222	IcmpInErrors and finish the process
				223
				224	So if all errors occur in step (2), IcmpInMsgs should be equal to the
				225	sum of IcmpMsgOutType[N] plus IcmpInErrors. If all errors occur in
				226	step (4), IcmpInMsgs should be equal to the sum of
				227	IcmpMsgOutType[N]. If the errors occur in both step (2) and step (4),
				228	IcmpInMsgs should be less than the sum of IcmpMsgOutType[N] plus
				229	IcmpInErrors.
				230
yupeng	80cc495	2018-11-16 11:17:40 -0800	[diff] [blame]	231	General TCP counters
				232	==================
				233	* TcpInSegs
				234	Defined in `RFC1213 tcpInSegs`_
				235
				236	.. _RFC1213 tcpInSegs: https://tools.ietf.org/html/rfc1213#page-48
				237
				238	The number of packets received by the TCP layer. As mentioned in
				239	RFC1213, it includes the packets received in error, such as checksum
				240	error, invalid TCP header and so on. Only one error won't be included:
				241	if the layer 2 destination address is not the NIC's layer 2
				242	address. It might happen if the packet is a multicast or broadcast
				243	packet, or the NIC is in promiscuous mode. In these situations, the
				244	packets would be delivered to the TCP layer, but the TCP layer will discard
				245	these packets before increasing TcpInSegs. The TcpInSegs counter
				246	isn't aware of GRO. So if two packets are merged by GRO, the TcpInSegs
				247	counter would only increase 1.
				248
				249	* TcpOutSegs
				250	Defined in `RFC1213 tcpOutSegs`_
				251
				252	.. _RFC1213 tcpOutSegs: https://tools.ietf.org/html/rfc1213#page-48
				253
				254	The number of packets sent by the TCP layer. As mentioned in RFC1213,
				255	it excludes the retransmitted packets. But it includes the SYN, ACK
				256	and RST packets. Doesn't like TcpInSegs, the TcpOutSegs is aware of
				257	GSO, so if a packet would be split to 2 by GSO, TcpOutSegs will
				258	increase 2.
				259
				260	* TcpActiveOpens
				261	Defined in `RFC1213 tcpActiveOpens`_
				262
				263	.. _RFC1213 tcpActiveOpens: https://tools.ietf.org/html/rfc1213#page-47
				264
				265	It means the TCP layer sends a SYN, and come into the SYN-SENT
				266	state. Every time TcpActiveOpens increases 1, TcpOutSegs should always
				267	increase 1.
				268
				269	* TcpPassiveOpens
				270	Defined in `RFC1213 tcpPassiveOpens`_
				271
				272	.. _RFC1213 tcpPassiveOpens: https://tools.ietf.org/html/rfc1213#page-47
				273
				274	It means the TCP layer receives a SYN, replies a SYN+ACK, come into
				275	the SYN-RCVD state.
				276
yupeng	712ee16	2018-11-25 23:35:46 -0800	[diff] [blame]	277	* TcpExtTCPRcvCoalesce
				278	When packets are received by the TCP layer and are not be read by the
				279	application, the TCP layer will try to merge them. This counter
				280	indicate how many packets are merged in such situation. If GRO is
				281	enabled, lots of packets would be merged by GRO, these packets
				282	wouldn't be counted to TcpExtTCPRcvCoalesce.
				283
				284	* TcpExtTCPAutoCorking
				285	When sending packets, the TCP layer will try to merge small packets to
				286	a bigger one. This counter increase 1 for every packet merged in such
				287	situation. Please refer to the LWN article for more details:
				288	https://lwn.net/Articles/576263/
				289
				290	* TcpExtTCPOrigDataSent
				291	This counter is explained by `kernel commit f19c29e3e391`_, I pasted the
				292	explaination below::
				293
				294	TCPOrigDataSent: number of outgoing packets with original data (excluding
				295	retransmission but including data-in-SYN). This counter is different from
				296	TcpOutSegs because TcpOutSegs also tracks pure ACKs. TCPOrigDataSent is
				297	more useful to track the TCP retransmission rate.
				298
				299	* TCPSynRetrans
				300	This counter is explained by `kernel commit f19c29e3e391`_, I pasted the
				301	explaination below::
				302
				303	TCPSynRetrans: number of SYN and SYN/ACK retransmits to break down
				304	retransmissions into SYN, fast-retransmits, timeout retransmits, etc.
				305
				306	* TCPFastOpenActiveFail
				307	This counter is explained by `kernel commit f19c29e3e391`_, I pasted the
				308	explaination below::
				309
				310	TCPFastOpenActiveFail: Fast Open attempts (SYN/data) failed because
				311	the remote does not accept it or the attempts timed out.
				312
				313	.. _kernel commit f19c29e3e391: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=f19c29e3e391a66a273e9afebaf01917245148cd
				314
				315	* TcpExtListenOverflows and TcpExtListenDrops
				316	When kernel receives a SYN from a client, and if the TCP accept queue
				317	is full, kernel will drop the SYN and add 1 to TcpExtListenOverflows.
				318	At the same time kernel will also add 1 to TcpExtListenDrops. When a
				319	TCP socket is in LISTEN state, and kernel need to drop a packet,
				320	kernel would always add 1 to TcpExtListenDrops. So increase
				321	TcpExtListenOverflows would let TcpExtListenDrops increasing at the
				322	same time, but TcpExtListenDrops would also increase without
				323	TcpExtListenOverflows increasing, e.g. a memory allocation fail would
				324	also let TcpExtListenDrops increase.
				325
				326	Note: The above explanation is based on kernel 4.10 or above version, on
				327	an old kernel, the TCP stack has different behavior when TCP accept
				328	queue is full. On the old kernel, TCP stack won't drop the SYN, it
				329	would complete the 3-way handshake. As the accept queue is full, TCP
				330	stack will keep the socket in the TCP half-open queue. As it is in the
				331	half open queue, TCP stack will send SYN+ACK on an exponential backoff
				332	timer, after client replies ACK, TCP stack checks whether the accept
				333	queue is still full, if it is not full, moves the socket to the accept
				334	queue, if it is full, keeps the socket in the half-open queue, at next
				335	time client replies ACK, this socket will get another chance to move
				336	to the accept queue.
				337
				338
yupeng	80cc495	2018-11-16 11:17:40 -0800	[diff] [blame]	339	TCP Fast Open
				340	============
				341	When kernel receives a TCP packet, it has two paths to handler the
				342	packet, one is fast path, another is slow path. The comment in kernel
				343	code provides a good explanation of them, I pasted them below::
				344
				345	It is split into a fast path and a slow path. The fast path is
				346	disabled when:
				347
				348	- A zero window was announced from us
				349	- zero window probing
				350	is only handled properly on the slow path.
				351	- Out of order segments arrived.
				352	- Urgent data is expected.
				353	- There is no buffer space left
				354	- Unexpected TCP flags/window values/header lengths are received
				355	(detected by checking the TCP header against pred_flags)
				356	- Data is sent in both directions. The fast path only supports pure senders
				357	or pure receivers (this means either the sequence number or the ack
				358	value must stay constant)
				359	- Unexpected TCP option.
				360
				361	Kernel will try to use fast path unless any of the above conditions
				362	are satisfied. If the packets are out of order, kernel will handle
				363	them in slow path, which means the performance might be not very
				364	good. Kernel would also come into slow path if the "Delayed ack" is
				365	used, because when using "Delayed ack", the data is sent in both
				366	directions. When the TCP window scale option is not used, kernel will
				367	try to enable fast path immediately when the connection comes into the
				368	established state, but if the TCP window scale option is used, kernel
				369	will disable the fast path at first, and try to enable it after kernel
				370	receives packets.
				371
				372	* TcpExtTCPPureAcks and TcpExtTCPHPAcks
				373	If a packet set ACK flag and has no data, it is a pure ACK packet, if
				374	kernel handles it in the fast path, TcpExtTCPHPAcks will increase 1,
				375	if kernel handles it in the slow path, TcpExtTCPPureAcks will
				376	increase 1.
				377
				378	* TcpExtTCPHPHits
				379	If a TCP packet has data (which means it is not a pure ACK packet),
				380	and this packet is handled in the fast path, TcpExtTCPHPHits will
				381	increase 1.
				382
				383
				384	TCP abort
				385	========
				386
				387
				388	* TcpExtTCPAbortOnData
				389	It means TCP layer has data in flight, but need to close the
				390	connection. So TCP layer sends a RST to the other side, indicate the
				391	connection is not closed very graceful. An easy way to increase this
				392	counter is using the SO_LINGER option. Please refer to the SO_LINGER
				393	section of the `socket man page`_:
				394
				395	.. _socket man page: http://man7.org/linux/man-pages/man7/socket.7.html
				396
				397	By default, when an application closes a connection, the close function
				398	will return immediately and kernel will try to send the in-flight data
				399	async. If you use the SO_LINGER option, set l_onoff to 1, and l_linger
				400	to a positive number, the close function won't return immediately, but
				401	wait for the in-flight data are acked by the other side, the max wait
				402	time is l_linger seconds. If set l_onoff to 1 and set l_linger to 0,
				403	when the application closes a connection, kernel will send a RST
				404	immediately and increase the TcpExtTCPAbortOnData counter.
				405
				406	* TcpExtTCPAbortOnClose
				407	This counter means the application has unread data in the TCP layer when
				408	the application wants to close the TCP connection. In such a situation,
				409	kernel will send a RST to the other side of the TCP connection.
				410
				411	* TcpExtTCPAbortOnMemory
				412	When an application closes a TCP connection, kernel still need to track
				413	the connection, let it complete the TCP disconnect process. E.g. an
				414	app calls the close method of a socket, kernel sends fin to the other
				415	side of the connection, then the app has no relationship with the
				416	socket any more, but kernel need to keep the socket, this socket
				417	becomes an orphan socket, kernel waits for the reply of the other side,
				418	and would come to the TIME_WAIT state finally. When kernel has no
				419	enough memory to keep the orphan socket, kernel would send an RST to
				420	the other side, and delete the socket, in such situation, kernel will
				421	increase 1 to the TcpExtTCPAbortOnMemory. Two conditions would trigger
				422	TcpExtTCPAbortOnMemory:
				423
				424	1. the memory used by the TCP protocol is higher than the third value of
				425	the tcp_mem. Please refer the tcp_mem section in the `TCP man page`_:
				426
				427	.. _TCP man page: http://man7.org/linux/man-pages/man7/tcp.7.html
				428
				429	2. the orphan socket count is higher than net.ipv4.tcp_max_orphans
				430
				431
				432	* TcpExtTCPAbortOnTimeout
				433	This counter will increase when any of the TCP timers expire. In such
				434	situation, kernel won't send RST, just give up the connection.
				435
				436	* TcpExtTCPAbortOnLinger
				437	When a TCP connection comes into FIN_WAIT_2 state, instead of waiting
				438	for the fin packet from the other side, kernel could send a RST and
				439	delete the socket immediately. This is not the default behavior of
				440	Linux kernel TCP stack. By configuring the TCP_LINGER2 socket option,
				441	you could let kernel follow this behavior.
				442
				443	* TcpExtTCPAbortFailed
				444	The kernel TCP layer will send RST if the `RFC2525 2.17 section`_ is
				445	satisfied. If an internal error occurs during this process,
				446	TcpExtTCPAbortFailed will be increased.
				447
				448	.. _RFC2525 2.17 section: https://tools.ietf.org/html/rfc2525#page-50
				449
yupeng	712ee16	2018-11-25 23:35:46 -0800	[diff] [blame]	450	TCP Hybrid Slow Start
				451	====================
				452	The Hybrid Slow Start algorithm is an enhancement of the traditional
				453	TCP congestion window Slow Start algorithm. It uses two pieces of
				454	information to detect whether the max bandwidth of the TCP path is
				455	approached. The two pieces of information are ACK train length and
				456	increase in packet delay. For detail information, please refer the
				457	`Hybrid Slow Start paper`_. Either ACK train length or packet delay
				458	hits a specific threshold, the congestion control algorithm will come
				459	into the Congestion Avoidance state. Until v4.20, two congestion
				460	control algorithms are using Hybrid Slow Start, they are cubic (the
				461	default congestion control algorithm) and cdg. Four snmp counters
				462	relate with the Hybrid Slow Start algorithm.
				463
				464	.. _Hybrid Slow Start paper: https://pdfs.semanticscholar.org/25e9/ef3f03315782c7f1cbcd31b587857adae7d1.pdf
				465
				466	* TcpExtTCPHystartTrainDetect
				467	How many times the ACK train length threshold is detected
				468
				469	* TcpExtTCPHystartTrainCwnd
				470	The sum of CWND detected by ACK train length. Dividing this value by
				471	TcpExtTCPHystartTrainDetect is the average CWND which detected by the
				472	ACK train length.
				473
				474	* TcpExtTCPHystartDelayDetect
				475	How many times the packet delay threshold is detected.
				476
				477	* TcpExtTCPHystartDelayCwnd
				478	The sum of CWND detected by packet delay. Dividing this value by
				479	TcpExtTCPHystartDelayDetect is the average CWND which detected by the
				480	packet delay.
				481
yupeng	8e2ea53	2018-12-12 00:14:10 -0800	[diff] [blame^]	482	TCP retransmission and congestion control
				483	======================================
				484	The TCP protocol has two retransmission mechanisms: SACK and fast
				485	recovery. They are exclusive with each other. When SACK is enabled,
				486	the kernel TCP stack would use SACK, or kernel would use fast
				487	recovery. The SACK is a TCP option, which is defined in `RFC2018`_,
				488	the fast recovery is defined in `RFC6582`_, which is also called
				489	'Reno'.
				490
				491	The TCP congestion control is a big and complex topic. To understand
				492	the related snmp counter, we need to know the states of the congestion
				493	control state machine. There are 5 states: Open, Disorder, CWR,
				494	Recovery and Loss. For details about these states, please refer page 5
				495	and page 6 of this document:
				496	https://pdfs.semanticscholar.org/0e9c/968d09ab2e53e24c4dca5b2d67c7f7140f8e.pdf
				497
				498	.. _RFC2018: https://tools.ietf.org/html/rfc2018
				499	.. _RFC6582: https://tools.ietf.org/html/rfc6582
				500
				501	* TcpExtTCPRenoRecovery and TcpExtTCPSackRecovery
				502	When the congestion control comes into Recovery state, if sack is
				503	used, TcpExtTCPSackRecovery increases 1, if sack is not used,
				504	TcpExtTCPRenoRecovery increases 1. These two counters mean the TCP
				505	stack begins to retransmit the lost packets.
				506
				507	* TcpExtTCPSACKReneging
				508	A packet was acknowledged by SACK, but the receiver has dropped this
				509	packet, so the sender needs to retransmit this packet. In this
				510	situation, the sender adds 1 to TcpExtTCPSACKReneging. A receiver
				511	could drop a packet which has been acknowledged by SACK, although it is
				512	unusual, it is allowed by the TCP protocol. The sender doesn't really
				513	know what happened on the receiver side. The sender just waits until
				514	the RTO expires for this packet, then the sender assumes this packet
				515	has been dropped by the receiver.
				516
				517	* TcpExtTCPRenoReorder
				518	The reorder packet is detected by fast recovery. It would only be used
				519	if SACK is disabled. The fast recovery algorithm detects recorder by
				520	the duplicate ACK number. E.g., if retransmission is triggered, and
				521	the original retransmitted packet is not lost, it is just out of
				522	order, the receiver would acknowledge multiple times, one for the
				523	retransmitted packet, another for the arriving of the original out of
				524	order packet. Thus the sender would find more ACks than its
				525	expectation, and the sender knows out of order occurs.
				526
				527	* TcpExtTCPTSReorder
				528	The reorder packet is detected when a hole is filled. E.g., assume the
				529	sender sends packet 1,2,3,4,5, and the receiving order is
				530	1,2,4,5,3. When the sender receives the ACK of packet 3 (which will
				531	fill the hole), two conditions will let TcpExtTCPTSReorder increase
				532	1: (1) if the packet 3 is not re-retransmitted yet. (2) if the packet
				533	3 is retransmitted but the timestamp of the packet 3's ACK is earlier
				534	than the retransmission timestamp.
				535
				536	* TcpExtTCPSACKReorder
				537	The reorder packet detected by SACK. The SACK has two methods to
				538	detect reorder: (1) DSACK is received by the sender. It means the
				539	sender sends the same packet more than one times. And the only reason
				540	is the sender believes an out of order packet is lost so it sends the
				541	packet again. (2) Assume packet 1,2,3,4,5 are sent by the sender, and
				542	the sender has received SACKs for packet 2 and 5, now the sender
				543	receives SACK for packet 4 and the sender doesn't retransmit the
				544	packet yet, the sender would know packet 4 is out of order. The TCP
				545	stack of kernel will increase TcpExtTCPSACKReorder for both of the
				546	above scenarios.
				547
				548
				549	DSACK
				550	=====
				551	The DSACK is defined in `RFC2883`_. The receiver uses DSACK to report
				552	duplicate packets to the sender. There are two kinds of
				553	duplications: (1) a packet which has been acknowledged is
				554	duplicate. (2) an out of order packet is duplicate. The TCP stack
				555	counts these two kinds of duplications on both receiver side and
				556	sender side.
				557
				558	.. _RFC2883 : https://tools.ietf.org/html/rfc2883
				559
				560	* TcpExtTCPDSACKOldSent
				561	The TCP stack receives a duplicate packet which has been acked, so it
				562	sends a DSACK to the sender.
				563
				564	* TcpExtTCPDSACKOfoSent
				565	The TCP stack receives an out of order duplicate packet, so it sends a
				566	DSACK to the sender.
				567
				568	* TcpExtTCPDSACKRecv
				569	The TCP stack receives a DSACK, which indicate an acknowledged
				570	duplicate packet is received.
				571
				572	* TcpExtTCPDSACKOfoRecv
				573	The TCP stack receives a DSACK, which indicate an out of order
				574	duplciate packet is received.
				575
yupeng	b08794a	2018-11-10 13:38:12 -0800	[diff] [blame]	576	examples
				577	=======
				578
				579	ping test
				580	--------
				581	Run the ping command against the public dns server 8.8.8.8::
				582
				583	nstatuser@nstat-a:~$ ping 8.8.8.8 -c 1
				584	PING 8.8.8.8 (8.8.8.8) 56(84) bytes of data.
				585	64 bytes from 8.8.8.8: icmp_seq=1 ttl=119 time=17.8 ms
				586
				587	--- 8.8.8.8 ping statistics ---
				588	1 packets transmitted, 1 received, 0% packet loss, time 0ms
				589	rtt min/avg/max/mdev = 17.875/17.875/17.875/0.000 ms
				590
				591	The nstayt result::
				592
				593	nstatuser@nstat-a:~$ nstat
				594	#kernel
				595	IpInReceives 1 0.0
				596	IpInDelivers 1 0.0
				597	IpOutRequests 1 0.0
				598	IcmpInMsgs 1 0.0
				599	IcmpInEchoReps 1 0.0
				600	IcmpOutMsgs 1 0.0
				601	IcmpOutEchos 1 0.0
				602	IcmpMsgInType0 1 0.0
				603	IcmpMsgOutType8 1 0.0
				604	IpExtInOctets 84 0.0
				605	IpExtOutOctets 84 0.0
				606	IpExtInNoECTPkts 1 0.0
				607
				608	The Linux server sent an ICMP Echo packet, so IpOutRequests,
				609	IcmpOutMsgs, IcmpOutEchos and IcmpMsgOutType8 were increased 1. The
				610	server got ICMP Echo Reply from 8.8.8.8, so IpInReceives, IcmpInMsgs,
				611	IcmpInEchoReps and IcmpMsgInType0 were increased 1. The ICMP Echo Reply
				612	was passed to the ICMP layer via IP layer, so IpInDelivers was
				613	increased 1. The default ping data size is 48, so an ICMP Echo packet
				614	and its corresponding Echo Reply packet are constructed by:
				615
				616	* 14 bytes MAC header
				617	* 20 bytes IP header
				618	* 16 bytes ICMP header
				619	* 48 bytes data (default value of the ping command)
				620
				621	So the IpExtInOctets and IpExtOutOctets are 20+16+48=84.
yupeng	80cc495	2018-11-16 11:17:40 -0800	[diff] [blame]	622
				623	tcp 3-way handshake
				624	------------------
				625	On server side, we run::
				626
				627	nstatuser@nstat-b:~$ nc -lknv 0.0.0.0 9000
				628	Listening on [0.0.0.0] (family 0, port 9000)
				629
				630	On client side, we run::
				631
				632	nstatuser@nstat-a:~$ nc -nv 192.168.122.251 9000
				633	Connection to 192.168.122.251 9000 port [tcp/*] succeeded!
				634
				635	The server listened on tcp 9000 port, the client connected to it, they
				636	completed the 3-way handshake.
				637
				638	On server side, we can find below nstat output::
				639
				640	nstatuser@nstat-b:~$ nstat \| grep -i tcp
				641	TcpPassiveOpens 1 0.0
				642	TcpInSegs 2 0.0
				643	TcpOutSegs 1 0.0
				644	TcpExtTCPPureAcks 1 0.0
				645
				646	On client side, we can find below nstat output::
				647
				648	nstatuser@nstat-a:~$ nstat \| grep -i tcp
				649	TcpActiveOpens 1 0.0
				650	TcpInSegs 1 0.0
				651	TcpOutSegs 2 0.0
				652
				653	When the server received the first SYN, it replied a SYN+ACK, and came into
				654	SYN-RCVD state, so TcpPassiveOpens increased 1. The server received
				655	SYN, sent SYN+ACK, received ACK, so server sent 1 packet, received 2
				656	packets, TcpInSegs increased 2, TcpOutSegs increased 1. The last ACK
				657	of the 3-way handshake is a pure ACK without data, so
				658	TcpExtTCPPureAcks increased 1.
				659
				660	When the client sent SYN, the client came into the SYN-SENT state, so
				661	TcpActiveOpens increased 1, the client sent SYN, received SYN+ACK, sent
				662	ACK, so client sent 2 packets, received 1 packet, TcpInSegs increased
				663	1, TcpOutSegs increased 2.
				664
				665	TCP normal traffic
				666	-----------------
				667	Run nc on server::
				668
				669	nstatuser@nstat-b:~$ nc -lkv 0.0.0.0 9000
				670	Listening on [0.0.0.0] (family 0, port 9000)
				671
				672	Run nc on client::
				673
				674	nstatuser@nstat-a:~$ nc -v nstat-b 9000
				675	Connection to nstat-b 9000 port [tcp/*] succeeded!
				676
				677	Input a string in the nc client ('hello' in our example)::
				678
				679	nstatuser@nstat-a:~$ nc -v nstat-b 9000
				680	Connection to nstat-b 9000 port [tcp/*] succeeded!
				681	hello
				682
				683	The client side nstat output::
				684
				685	nstatuser@nstat-a:~$ nstat
				686	#kernel
				687	IpInReceives 1 0.0
				688	IpInDelivers 1 0.0
				689	IpOutRequests 1 0.0
				690	TcpInSegs 1 0.0
				691	TcpOutSegs 1 0.0
				692	TcpExtTCPPureAcks 1 0.0
				693	TcpExtTCPOrigDataSent 1 0.0
				694	IpExtInOctets 52 0.0
				695	IpExtOutOctets 58 0.0
				696	IpExtInNoECTPkts 1 0.0
				697
				698	The server side nstat output::
				699
				700	nstatuser@nstat-b:~$ nstat
				701	#kernel
				702	IpInReceives 1 0.0
				703	IpInDelivers 1 0.0
				704	IpOutRequests 1 0.0
				705	TcpInSegs 1 0.0
				706	TcpOutSegs 1 0.0
				707	IpExtInOctets 58 0.0
				708	IpExtOutOctets 52 0.0
				709	IpExtInNoECTPkts 1 0.0
				710
				711	Input a string in nc client side again ('world' in our exmaple)::
				712
				713	nstatuser@nstat-a:~$ nc -v nstat-b 9000
				714	Connection to nstat-b 9000 port [tcp/*] succeeded!
				715	hello
				716	world
				717
				718	Client side nstat output::
				719
				720	nstatuser@nstat-a:~$ nstat
				721	#kernel
				722	IpInReceives 1 0.0
				723	IpInDelivers 1 0.0
				724	IpOutRequests 1 0.0
				725	TcpInSegs 1 0.0
				726	TcpOutSegs 1 0.0
				727	TcpExtTCPHPAcks 1 0.0
				728	TcpExtTCPOrigDataSent 1 0.0
				729	IpExtInOctets 52 0.0
				730	IpExtOutOctets 58 0.0
				731	IpExtInNoECTPkts 1 0.0
				732
				733
				734	Server side nstat output::
				735
				736	nstatuser@nstat-b:~$ nstat
				737	#kernel
				738	IpInReceives 1 0.0
				739	IpInDelivers 1 0.0
				740	IpOutRequests 1 0.0
				741	TcpInSegs 1 0.0
				742	TcpOutSegs 1 0.0
				743	TcpExtTCPHPHits 1 0.0
				744	IpExtInOctets 58 0.0
				745	IpExtOutOctets 52 0.0
				746	IpExtInNoECTPkts 1 0.0
				747
				748	Compare the first client-side nstat and the second client-side nstat,
				749	we could find one difference: the first one had a 'TcpExtTCPPureAcks',
				750	but the second one had a 'TcpExtTCPHPAcks'. The first server-side
				751	nstat and the second server-side nstat had a difference too: the
				752	second server-side nstat had a TcpExtTCPHPHits, but the first
				753	server-side nstat didn't have it. The network traffic patterns were
				754	exactly the same: the client sent a packet to the server, the server
				755	replied an ACK. But kernel handled them in different ways. When the
				756	TCP window scale option is not used, kernel will try to enable fast
				757	path immediately when the connection comes into the established state,
				758	but if the TCP window scale option is used, kernel will disable the
				759	fast path at first, and try to enable it after kerenl receives
				760	packets. We could use the 'ss' command to verify whether the window
				761	scale option is used. e.g. run below command on either server or
				762	client::
				763
				764	nstatuser@nstat-a:~$ ss -o state established -i '( dport = :9000 or sport = :9000 )
				765	Netid Recv-Q Send-Q Local Address:Port Peer Address:Port
				766	tcp 0 0 192.168.122.250:40654 192.168.122.251:9000
				767	ts sack cubic wscale:7,7 rto:204 rtt:0.98/0.49 mss:1448 pmtu:1500 rcvmss:536 advmss:1448 cwnd:10 bytes_acked:1 segs_out:2 segs_in:1 send 118.2Mbps lastsnd:46572 lastrcv:46572 lastack:46572 pacing_rate 236.4Mbps rcv_space:29200 rcv_ssthresh:29200 minrtt:0.98
				768
				769	The 'wscale:7,7' means both server and client set the window scale
				770	option to 7. Now we could explain the nstat output in our test:
				771
				772	In the first nstat output of client side, the client sent a packet, server
				773	reply an ACK, when kernel handled this ACK, the fast path was not
				774	enabled, so the ACK was counted into 'TcpExtTCPPureAcks'.
				775
				776	In the second nstat output of client side, the client sent a packet again,
				777	and received another ACK from the server, in this time, the fast path is
				778	enabled, and the ACK was qualified for fast path, so it was handled by
				779	the fast path, so this ACK was counted into TcpExtTCPHPAcks.
				780
				781	In the first nstat output of server side, fast path was not enabled,
				782	so there was no 'TcpExtTCPHPHits'.
				783
				784	In the second nstat output of server side, the fast path was enabled,
				785	and the packet received from client qualified for fast path, so it
				786	was counted into 'TcpExtTCPHPHits'.
				787
				788	TcpExtTCPAbortOnClose
				789	--------------------
				790	On the server side, we run below python script::
				791
				792	import socket
				793	import time
				794
				795	port = 9000
				796
				797	s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
				798	s.bind(('0.0.0.0', port))
				799	s.listen(1)
				800	sock, addr = s.accept()
				801	while True:
				802	time.sleep(9999999)
				803
				804	This python script listen on 9000 port, but doesn't read anything from
				805	the connection.
				806
				807	On the client side, we send the string "hello" by nc::
				808
				809	nstatuser@nstat-a:~$ echo "hello" \| nc nstat-b 9000
				810
				811	Then, we come back to the server side, the server has received the "hello"
				812	packet, and the TCP layer has acked this packet, but the application didn't
				813	read it yet. We type Ctrl-C to terminate the server script. Then we
				814	could find TcpExtTCPAbortOnClose increased 1 on the server side::
				815
				816	nstatuser@nstat-b:~$ nstat \| grep -i abort
				817	TcpExtTCPAbortOnClose 1 0.0
				818
				819	If we run tcpdump on the server side, we could find the server sent a
				820	RST after we type Ctrl-C.
				821
				822	TcpExtTCPAbortOnMemory and TcpExtTCPAbortOnTimeout
				823	-----------------------------------------------
				824	Below is an example which let the orphan socket count be higher than
				825	net.ipv4.tcp_max_orphans.
				826	Change tcp_max_orphans to a smaller value on client::
				827
				828	sudo bash -c "echo 10 > /proc/sys/net/ipv4/tcp_max_orphans"
				829
				830	Client code (create 64 connection to server)::
				831
				832	nstatuser@nstat-a:~$ cat client_orphan.py
				833	import socket
				834	import time
				835
				836	server = 'nstat-b' # server address
				837	port = 9000
				838
				839	count = 64
				840
				841	connection_list = []
				842
				843	for i in range(64):
				844	s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
				845	s.connect((server, port))
				846	connection_list.append(s)
				847	print("connection_count: %d" % len(connection_list))
				848
				849	while True:
				850	time.sleep(99999)
				851
				852	Server code (accept 64 connection from client)::
				853
				854	nstatuser@nstat-b:~$ cat server_orphan.py
				855	import socket
				856	import time
				857
				858	port = 9000
				859	count = 64
				860
				861	s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
				862	s.bind(('0.0.0.0', port))
				863	s.listen(count)
				864	connection_list = []
				865	while True:
				866	sock, addr = s.accept()
				867	connection_list.append((sock, addr))
				868	print("connection_count: %d" % len(connection_list))
				869
				870	Run the python scripts on server and client.
				871
				872	On server::
				873
				874	python3 server_orphan.py
				875
				876	On client::
				877
				878	python3 client_orphan.py
				879
				880	Run iptables on server::
				881
				882	sudo iptables -A INPUT -i ens3 -p tcp --destination-port 9000 -j DROP
				883
				884	Type Ctrl-C on client, stop client_orphan.py.
				885
				886	Check TcpExtTCPAbortOnMemory on client::
				887
				888	nstatuser@nstat-a:~$ nstat \| grep -i abort
				889	TcpExtTCPAbortOnMemory 54 0.0
				890
				891	Check orphane socket count on client::
				892
				893	nstatuser@nstat-a:~$ ss -s
				894	Total: 131 (kernel 0)
				895	TCP: 14 (estab 1, closed 0, orphaned 10, synrecv 0, timewait 0/0), ports 0
				896
				897	Transport Total IP IPv6
				898	* 0 - -
				899	RAW 1 0 1
				900	UDP 1 1 0
				901	TCP 14 13 1
				902	INET 16 14 2
				903	FRAG 0 0 0
				904
				905	The explanation of the test: after run server_orphan.py and
				906	client_orphan.py, we set up 64 connections between server and
				907	client. Run the iptables command, the server will drop all packets from
				908	the client, type Ctrl-C on client_orphan.py, the system of the client
				909	would try to close these connections, and before they are closed
				910	gracefully, these connections became orphan sockets. As the iptables
				911	of the server blocked packets from the client, the server won't receive fin
				912	from the client, so all connection on clients would be stuck on FIN_WAIT_1
				913	stage, so they will keep as orphan sockets until timeout. We have echo
				914	10 to /proc/sys/net/ipv4/tcp_max_orphans, so the client system would
				915	only keep 10 orphan sockets, for all other orphan sockets, the client
				916	system sent RST for them and delete them. We have 64 connections, so
				917	the 'ss -s' command shows the system has 10 orphan sockets, and the
				918	value of TcpExtTCPAbortOnMemory was 54.
				919
				920	An additional explanation about orphan socket count: You could find the
				921	exactly orphan socket count by the 'ss -s' command, but when kernel
				922	decide whither increases TcpExtTCPAbortOnMemory and sends RST, kernel
				923	doesn't always check the exactly orphan socket count. For increasing
				924	performance, kernel checks an approximate count firstly, if the
				925	approximate count is more than tcp_max_orphans, kernel checks the
				926	exact count again. So if the approximate count is less than
				927	tcp_max_orphans, but exactly count is more than tcp_max_orphans, you
				928	would find TcpExtTCPAbortOnMemory is not increased at all. If
				929	tcp_max_orphans is large enough, it won't occur, but if you decrease
				930	tcp_max_orphans to a small value like our test, you might find this
				931	issue. So in our test, the client set up 64 connections although the
				932	tcp_max_orphans is 10. If the client only set up 11 connections, we
				933	can't find the change of TcpExtTCPAbortOnMemory.
				934
				935	Continue the previous test, we wait for several minutes. Because of the
				936	iptables on the server blocked the traffic, the server wouldn't receive
				937	fin, and all the client's orphan sockets would timeout on the
				938	FIN_WAIT_1 state finally. So we wait for a few minutes, we could find
				939	10 timeout on the client::
				940
				941	nstatuser@nstat-a:~$ nstat \| grep -i abort
				942	TcpExtTCPAbortOnTimeout 10 0.0
				943
				944	TcpExtTCPAbortOnLinger
				945	---------------------
				946	The server side code::
				947
				948	nstatuser@nstat-b:~$ cat server_linger.py
				949	import socket
				950	import time
				951
				952	port = 9000
				953
				954	s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
				955	s.bind(('0.0.0.0', port))
				956	s.listen(1)
				957	sock, addr = s.accept()
				958	while True:
				959	time.sleep(9999999)
				960
				961	The client side code::
				962
				963	nstatuser@nstat-a:~$ cat client_linger.py
				964	import socket
				965	import struct
				966
				967	server = 'nstat-b' # server address
				968	port = 9000
				969
				970	s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
				971	s.setsockopt(socket.SOL_SOCKET, socket.SO_LINGER, struct.pack('ii', 1, 10))
				972	s.setsockopt(socket.SOL_TCP, socket.TCP_LINGER2, struct.pack('i', -1))
				973	s.connect((server, port))
				974	s.close()
				975
				976	Run server_linger.py on server::
				977
				978	nstatuser@nstat-b:~$ python3 server_linger.py
				979
				980	Run client_linger.py on client::
				981
				982	nstatuser@nstat-a:~$ python3 client_linger.py
				983
				984	After run client_linger.py, check the output of nstat::
				985
				986	nstatuser@nstat-a:~$ nstat \| grep -i abort
				987	TcpExtTCPAbortOnLinger 1 0.0
yupeng	712ee16	2018-11-25 23:35:46 -0800	[diff] [blame]	988
				989	TcpExtTCPRcvCoalesce
				990	-------------------
				991	On the server, we run a program which listen on TCP port 9000, but
				992	doesn't read any data::
				993
				994	import socket
				995	import time
				996	port = 9000
				997	s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
				998	s.bind(('0.0.0.0', port))
				999	s.listen(1)
				1000	sock, addr = s.accept()
				1001	while True:
				1002	time.sleep(9999999)
				1003
				1004	Save the above code as server_coalesce.py, and run::
				1005
				1006	python3 server_coalesce.py
				1007
				1008	On the client, save below code as client_coalesce.py::
				1009
				1010	import socket
				1011	server = 'nstat-b'
				1012	port = 9000
				1013	s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
				1014	s.connect((server, port))
				1015
				1016	Run::
				1017
				1018	nstatuser@nstat-a:~$ python3 -i client_coalesce.py
				1019
				1020	We use '-i' to come into the interactive mode, then a packet::
				1021
				1022	>>> s.send(b'foo')
				1023	3
				1024
				1025	Send a packet again::
				1026
				1027	>>> s.send(b'bar')
				1028	3
				1029
				1030	On the server, run nstat::
				1031
				1032	ubuntu@nstat-b:~$ nstat
				1033	#kernel
				1034	IpInReceives 2 0.0
				1035	IpInDelivers 2 0.0
				1036	IpOutRequests 2 0.0
				1037	TcpInSegs 2 0.0
				1038	TcpOutSegs 2 0.0
				1039	TcpExtTCPRcvCoalesce 1 0.0
				1040	IpExtInOctets 110 0.0
				1041	IpExtOutOctets 104 0.0
				1042	IpExtInNoECTPkts 2 0.0
				1043
				1044	The client sent two packets, server didn't read any data. When
				1045	the second packet arrived at server, the first packet was still in
				1046	the receiving queue. So the TCP layer merged the two packets, and we
				1047	could find the TcpExtTCPRcvCoalesce increased 1.
				1048
				1049	TcpExtListenOverflows and TcpExtListenDrops
				1050	----------------------------------------
				1051	On server, run the nc command, listen on port 9000::
				1052
				1053	nstatuser@nstat-b:~$ nc -lkv 0.0.0.0 9000
				1054	Listening on [0.0.0.0] (family 0, port 9000)
				1055
				1056	On client, run 3 nc commands in different terminals::
				1057
				1058	nstatuser@nstat-a:~$ nc -v nstat-b 9000
				1059	Connection to nstat-b 9000 port [tcp/*] succeeded!
				1060
				1061	The nc command only accepts 1 connection, and the accept queue length
				1062	is 1. On current linux implementation, set queue length to n means the
				1063	actual queue length is n+1. Now we create 3 connections, 1 is accepted
				1064	by nc, 2 in accepted queue, so the accept queue is full.
				1065
				1066	Before running the 4th nc, we clean the nstat history on the server::
				1067
				1068	nstatuser@nstat-b:~$ nstat -n
				1069
				1070	Run the 4th nc on the client::
				1071
				1072	nstatuser@nstat-a:~$ nc -v nstat-b 9000
				1073
				1074	If the nc server is running on kernel 4.10 or higher version, you
				1075	won't see the "Connection to ... succeeded!" string, because kernel
				1076	will drop the SYN if the accept queue is full. If the nc client is running
				1077	on an old kernel, you would see that the connection is succeeded,
				1078	because kernel would complete the 3 way handshake and keep the socket
				1079	on half open queue. I did the test on kernel 4.15. Below is the nstat
				1080	on the server::
				1081
				1082	nstatuser@nstat-b:~$ nstat
				1083	#kernel
				1084	IpInReceives 4 0.0
				1085	IpInDelivers 4 0.0
				1086	TcpInSegs 4 0.0
				1087	TcpExtListenOverflows 4 0.0
				1088	TcpExtListenDrops 4 0.0
				1089	IpExtInOctets 240 0.0
				1090	IpExtInNoECTPkts 4 0.0
				1091
				1092	Both TcpExtListenOverflows and TcpExtListenDrops were 4. If the time
				1093	between the 4th nc and the nstat was longer, the value of
				1094	TcpExtListenOverflows and TcpExtListenDrops would be larger, because
				1095	the SYN of the 4th nc was dropped, the client was retrying.
yupeng	8e2ea53	2018-12-12 00:14:10 -0800	[diff] [blame^]	1096
				1097	IpInAddrErrors, IpExtInNoRoutes and IpOutNoRoutes
				1098	----------------------------------------------
				1099	server A IP address: 192.168.122.250
				1100	server B IP address: 192.168.122.251
				1101	Prepare on server A, add a route to server B::
				1102
				1103	$ sudo ip route add 8.8.8.8/32 via 192.168.122.251
				1104
				1105	Prepare on server B, disable send_redirects for all interfaces::
				1106
				1107	$ sudo sysctl -w net.ipv4.conf.all.send_redirects=0
				1108	$ sudo sysctl -w net.ipv4.conf.ens3.send_redirects=0
				1109	$ sudo sysctl -w net.ipv4.conf.lo.send_redirects=0
				1110	$ sudo sysctl -w net.ipv4.conf.default.send_redirects=0
				1111
				1112	We want to let sever A send a packet to 8.8.8.8, and route the packet
				1113	to server B. When server B receives such packet, it might send a ICMP
				1114	Redirect message to server A, set send_redirects to 0 will disable
				1115	this behavior.
				1116
				1117	First, generate InAddrErrors. On server B, we disable IP forwarding::
				1118
				1119	$ sudo sysctl -w net.ipv4.conf.all.forwarding=0
				1120
				1121	On server A, we send packets to 8.8.8.8::
				1122
				1123	$ nc -v 8.8.8.8 53
				1124
				1125	On server B, we check the output of nstat::
				1126
				1127	$ nstat
				1128	#kernel
				1129	IpInReceives 3 0.0
				1130	IpInAddrErrors 3 0.0
				1131	IpExtInOctets 180 0.0
				1132	IpExtInNoECTPkts 3 0.0
				1133
				1134	As we have let server A route 8.8.8.8 to server B, and we disabled IP
				1135	forwarding on server B, Server A sent packets to server B, then server B
				1136	dropped packets and increased IpInAddrErrors. As the nc command would
				1137	re-send the SYN packet if it didn't receive a SYN+ACK, we could find
				1138	multiple IpInAddrErrors.
				1139
				1140	Second, generate IpExtInNoRoutes. On server B, we enable IP
				1141	forwarding::
				1142
				1143	$ sudo sysctl -w net.ipv4.conf.all.forwarding=1
				1144
				1145	Check the route table of server B and remove the default route::
				1146
				1147	$ ip route show
				1148	default via 192.168.122.1 dev ens3 proto static
				1149	192.168.122.0/24 dev ens3 proto kernel scope link src 192.168.122.251
				1150	$ sudo ip route delete default via 192.168.122.1 dev ens3 proto static
				1151
				1152	On server A, we contact 8.8.8.8 again::
				1153
				1154	$ nc -v 8.8.8.8 53
				1155	nc: connect to 8.8.8.8 port 53 (tcp) failed: Network is unreachable
				1156
				1157	On server B, run nstat::
				1158
				1159	$ nstat
				1160	#kernel
				1161	IpInReceives 1 0.0
				1162	IpOutRequests 1 0.0
				1163	IcmpOutMsgs 1 0.0
				1164	IcmpOutDestUnreachs 1 0.0
				1165	IcmpMsgOutType3 1 0.0
				1166	IpExtInNoRoutes 1 0.0
				1167	IpExtInOctets 60 0.0
				1168	IpExtOutOctets 88 0.0
				1169	IpExtInNoECTPkts 1 0.0
				1170
				1171	We enabled IP forwarding on server B, when server B received a packet
				1172	which destination IP address is 8.8.8.8, server B will try to forward
				1173	this packet. We have deleted the default route, there was no route for
				1174	8.8.8.8, so server B increase IpExtInNoRoutes and sent the "ICMP
				1175	Destination Unreachable" message to server A.
				1176
				1177	Third, generate IpOutNoRoutes. Run ping command on server B::
				1178
				1179	$ ping -c 1 8.8.8.8
				1180	connect: Network is unreachable
				1181
				1182	Run nstat on server B::
				1183
				1184	$ nstat
				1185	#kernel
				1186	IpOutNoRoutes 1 0.0
				1187
				1188	We have deleted the default route on server B. Server B couldn't find
				1189	a route for the 8.8.8.8 IP address, so server B increased
				1190	IpOutNoRoutes.