Blame - Documentation/x86/tlb.rst - SHIFTPHONES/mainline/linux

blob: 82ec58ae63a84d6539fd1bb68f20bbec348ecc97 [file] [log] [blame]

Changbin Du	1715604	2019-05-08 23:21:23 +0800	[diff] [blame]	1	.. SPDX-License-Identifier: GPL-2.0
				2
				3	=======
				4	The TLB
				5	=======
				6
Dave Hansen	2d040a1	2014-07-31 08:41:01 -0700	[diff] [blame]	7	When the kernel unmaps or modified the attributes of a range of
				8	memory, it has two choices:
Changbin Du	1715604	2019-05-08 23:21:23 +0800	[diff] [blame]	9
Dave Hansen	2d040a1	2014-07-31 08:41:01 -0700	[diff] [blame]	10	1. Flush the entire TLB with a two-instruction sequence. This is
				11	a quick operation, but it causes collateral damage: TLB entries
				12	from areas other than the one we are trying to flush will be
				13	destroyed and must be refilled later, at some cost.
				14	2. Use the invlpg instruction to invalidate a single page at a
Masanari Iida	c76a093	2016-07-01 12:46:01 +0900	[diff] [blame]	15	time. This could potentially cost many more instructions, but
Dave Hansen	2d040a1	2014-07-31 08:41:01 -0700	[diff] [blame]	16	it is a much more precise operation, causing no collateral
				17	damage to other TLB entries.
				18
				19	Which method to do depends on a few things:
Changbin Du	1715604	2019-05-08 23:21:23 +0800	[diff] [blame]	20
Dave Hansen	2d040a1	2014-07-31 08:41:01 -0700	[diff] [blame]	21	1. The size of the flush being performed. A flush of the entire
				22	address space is obviously better performed by flushing the
				23	entire TLB than doing 2^48/PAGE_SIZE individual flushes.
				24	2. The contents of the TLB. If the TLB is empty, then there will
				25	be no collateral damage caused by doing the global flush, and
				26	all of the individual flush will have ended up being wasted
				27	work.
				28	3. The size of the TLB. The larger the TLB, the more collateral
				29	damage we do with a full flush. So, the larger the TLB, the
Masanari Iida	c76a093	2016-07-01 12:46:01 +0900	[diff] [blame]	30	more attractive an individual flush looks. Data and
Dave Hansen	2d040a1	2014-07-31 08:41:01 -0700	[diff] [blame]	31	instructions have separate TLBs, as do different page sizes.
				32	4. The microarchitecture. The TLB has become a multi-level
				33	cache on modern CPUs, and the global flushes have become more
				34	expensive relative to single-page flushes.
				35
				36	There is obviously no way the kernel can know all these things,
				37	especially the contents of the TLB during a given flush. The
				38	sizes of the flush will vary greatly depending on the workload as
				39	well. There is essentially no "right" point to choose.
				40
				41	You may be doing too many individual invalidations if you see the
				42	invlpg instruction (or instructions _near_ it) show up high in
				43	profiles. If you believe that individual invalidations being
Changbin Du	1715604	2019-05-08 23:21:23 +0800	[diff] [blame]	44	called too often, you can lower the tunable::
Dave Hansen	2d040a1	2014-07-31 08:41:01 -0700	[diff] [blame]	45
Jeremiah Mahler	129ea00	2014-08-08 00:49:55 -0700	[diff] [blame]	46	/sys/kernel/debug/x86/tlb_single_page_flush_ceiling
Dave Hansen	2d040a1	2014-07-31 08:41:01 -0700	[diff] [blame]	47
				48	This will cause us to do the global flush for more cases.
				49	Lowering it to 0 will disable the use of the individual flushes.
				50	Setting it to 1 is a very conservative setting and it should
				51	never need to be 0 under normal circumstances.
				52
				53	Despite the fact that a single individual flush on x86 is
Changbin Du	1715604	2019-05-08 23:21:23 +0800	[diff] [blame]	54	guaranteed to flush a full 2MB [1]_, hugetlbfs always uses the full
Dave Hansen	2d040a1	2014-07-31 08:41:01 -0700	[diff] [blame]	55	flushes. THP is treated exactly the same as normal memory.
				56
				57	You might see invlpg inside of flush_tlb_mm_range() show up in
				58	profiles, or you can use the trace_tlb_flush() tracepoints. to
				59	determine how long the flush operations are taking.
				60
				61	Essentially, you are balancing the cycles you spend doing invlpg
				62	with the cycles that you spend refilling the TLB later.
				63
				64	You can measure how expensive TLB refills are by using
Changbin Du	1715604	2019-05-08 23:21:23 +0800	[diff] [blame]	65	performance counters and 'perf stat', like this::
Dave Hansen	2d040a1	2014-07-31 08:41:01 -0700	[diff] [blame]	66
Changbin Du	1715604	2019-05-08 23:21:23 +0800	[diff] [blame]	67	perf stat -e
				68	cpu/event=0x8,umask=0x84,name=dtlb_load_misses_walk_duration/,
				69	cpu/event=0x8,umask=0x82,name=dtlb_load_misses_walk_completed/,
				70	cpu/event=0x49,umask=0x4,name=dtlb_store_misses_walk_duration/,
				71	cpu/event=0x49,umask=0x2,name=dtlb_store_misses_walk_completed/,
				72	cpu/event=0x85,umask=0x4,name=itlb_misses_walk_duration/,
				73	cpu/event=0x85,umask=0x2,name=itlb_misses_walk_completed/
Dave Hansen	2d040a1	2014-07-31 08:41:01 -0700	[diff] [blame]	74
				75	That works on an IvyBridge-era CPU (i5-3320M). Different CPUs
				76	may have differently-named counters, but they should at least
				77	be there in some form. You can use pmu-tools 'ocperf list'
				78	(https://github.com/andikleen/pmu-tools) to find the right
				79	counters for a given CPU.
				80
Changbin Du	1715604	2019-05-08 23:21:23 +0800	[diff] [blame]	81	.. [1] A footnote in Intel's SDM "4.10.4.2 Recommended Invalidation"
Dave Hansen	2d040a1	2014-07-31 08:41:01 -0700	[diff] [blame]	82	says: "One execution of INVLPG is sufficient even for a page
				83	with size greater than 4 KBytes."