blob: 82ec58ae63a84d6539fd1bb68f20bbec348ecc97 [file] [log] [blame]
Changbin Du17156042019-05-08 23:21:23 +08001.. SPDX-License-Identifier: GPL-2.0
2
3=======
4The TLB
5=======
6
Dave Hansen2d040a12014-07-31 08:41:01 -07007When the kernel unmaps or modified the attributes of a range of
8memory, it has two choices:
Changbin Du17156042019-05-08 23:21:23 +08009
Dave Hansen2d040a12014-07-31 08:41:01 -070010 1. Flush the entire TLB with a two-instruction sequence. This is
11 a quick operation, but it causes collateral damage: TLB entries
12 from areas other than the one we are trying to flush will be
13 destroyed and must be refilled later, at some cost.
14 2. Use the invlpg instruction to invalidate a single page at a
Masanari Iidac76a0932016-07-01 12:46:01 +090015 time. This could potentially cost many more instructions, but
Dave Hansen2d040a12014-07-31 08:41:01 -070016 it is a much more precise operation, causing no collateral
17 damage to other TLB entries.
18
19Which method to do depends on a few things:
Changbin Du17156042019-05-08 23:21:23 +080020
Dave Hansen2d040a12014-07-31 08:41:01 -070021 1. The size of the flush being performed. A flush of the entire
22 address space is obviously better performed by flushing the
23 entire TLB than doing 2^48/PAGE_SIZE individual flushes.
24 2. The contents of the TLB. If the TLB is empty, then there will
25 be no collateral damage caused by doing the global flush, and
26 all of the individual flush will have ended up being wasted
27 work.
28 3. The size of the TLB. The larger the TLB, the more collateral
29 damage we do with a full flush. So, the larger the TLB, the
Masanari Iidac76a0932016-07-01 12:46:01 +090030 more attractive an individual flush looks. Data and
Dave Hansen2d040a12014-07-31 08:41:01 -070031 instructions have separate TLBs, as do different page sizes.
32 4. The microarchitecture. The TLB has become a multi-level
33 cache on modern CPUs, and the global flushes have become more
34 expensive relative to single-page flushes.
35
36There is obviously no way the kernel can know all these things,
37especially the contents of the TLB during a given flush. The
38sizes of the flush will vary greatly depending on the workload as
39well. There is essentially no "right" point to choose.
40
41You may be doing too many individual invalidations if you see the
42invlpg instruction (or instructions _near_ it) show up high in
43profiles. If you believe that individual invalidations being
Changbin Du17156042019-05-08 23:21:23 +080044called too often, you can lower the tunable::
Dave Hansen2d040a12014-07-31 08:41:01 -070045
Jeremiah Mahler129ea002014-08-08 00:49:55 -070046 /sys/kernel/debug/x86/tlb_single_page_flush_ceiling
Dave Hansen2d040a12014-07-31 08:41:01 -070047
48This will cause us to do the global flush for more cases.
49Lowering it to 0 will disable the use of the individual flushes.
50Setting it to 1 is a very conservative setting and it should
51never need to be 0 under normal circumstances.
52
53Despite the fact that a single individual flush on x86 is
Changbin Du17156042019-05-08 23:21:23 +080054guaranteed to flush a full 2MB [1]_, hugetlbfs always uses the full
Dave Hansen2d040a12014-07-31 08:41:01 -070055flushes. THP is treated exactly the same as normal memory.
56
57You might see invlpg inside of flush_tlb_mm_range() show up in
58profiles, or you can use the trace_tlb_flush() tracepoints. to
59determine how long the flush operations are taking.
60
61Essentially, you are balancing the cycles you spend doing invlpg
62with the cycles that you spend refilling the TLB later.
63
64You can measure how expensive TLB refills are by using
Changbin Du17156042019-05-08 23:21:23 +080065performance counters and 'perf stat', like this::
Dave Hansen2d040a12014-07-31 08:41:01 -070066
Changbin Du17156042019-05-08 23:21:23 +080067 perf stat -e
68 cpu/event=0x8,umask=0x84,name=dtlb_load_misses_walk_duration/,
69 cpu/event=0x8,umask=0x82,name=dtlb_load_misses_walk_completed/,
70 cpu/event=0x49,umask=0x4,name=dtlb_store_misses_walk_duration/,
71 cpu/event=0x49,umask=0x2,name=dtlb_store_misses_walk_completed/,
72 cpu/event=0x85,umask=0x4,name=itlb_misses_walk_duration/,
73 cpu/event=0x85,umask=0x2,name=itlb_misses_walk_completed/
Dave Hansen2d040a12014-07-31 08:41:01 -070074
75That works on an IvyBridge-era CPU (i5-3320M). Different CPUs
76may have differently-named counters, but they should at least
77be there in some form. You can use pmu-tools 'ocperf list'
78(https://github.com/andikleen/pmu-tools) to find the right
79counters for a given CPU.
80
Changbin Du17156042019-05-08 23:21:23 +080081.. [1] A footnote in Intel's SDM "4.10.4.2 Recommended Invalidation"
Dave Hansen2d040a12014-07-31 08:41:01 -070082 says: "One execution of INVLPG is sufficient even for a page
83 with size greater than 4 KBytes."