Changbin Du | 1715604 | 2019-05-08 23:21:23 +0800 | [diff] [blame] | 1 | .. SPDX-License-Identifier: GPL-2.0 |
| 2 | |
| 3 | ======= |
| 4 | The TLB |
| 5 | ======= |
| 6 | |
Dave Hansen | 2d040a1 | 2014-07-31 08:41:01 -0700 | [diff] [blame] | 7 | When the kernel unmaps or modified the attributes of a range of |
| 8 | memory, it has two choices: |
Changbin Du | 1715604 | 2019-05-08 23:21:23 +0800 | [diff] [blame] | 9 | |
Dave Hansen | 2d040a1 | 2014-07-31 08:41:01 -0700 | [diff] [blame] | 10 | 1. Flush the entire TLB with a two-instruction sequence. This is |
| 11 | a quick operation, but it causes collateral damage: TLB entries |
| 12 | from areas other than the one we are trying to flush will be |
| 13 | destroyed and must be refilled later, at some cost. |
| 14 | 2. Use the invlpg instruction to invalidate a single page at a |
Masanari Iida | c76a093 | 2016-07-01 12:46:01 +0900 | [diff] [blame] | 15 | time. This could potentially cost many more instructions, but |
Dave Hansen | 2d040a1 | 2014-07-31 08:41:01 -0700 | [diff] [blame] | 16 | it is a much more precise operation, causing no collateral |
| 17 | damage to other TLB entries. |
| 18 | |
| 19 | Which method to do depends on a few things: |
Changbin Du | 1715604 | 2019-05-08 23:21:23 +0800 | [diff] [blame] | 20 | |
Dave Hansen | 2d040a1 | 2014-07-31 08:41:01 -0700 | [diff] [blame] | 21 | 1. The size of the flush being performed. A flush of the entire |
| 22 | address space is obviously better performed by flushing the |
| 23 | entire TLB than doing 2^48/PAGE_SIZE individual flushes. |
| 24 | 2. The contents of the TLB. If the TLB is empty, then there will |
| 25 | be no collateral damage caused by doing the global flush, and |
| 26 | all of the individual flush will have ended up being wasted |
| 27 | work. |
| 28 | 3. The size of the TLB. The larger the TLB, the more collateral |
| 29 | damage we do with a full flush. So, the larger the TLB, the |
Masanari Iida | c76a093 | 2016-07-01 12:46:01 +0900 | [diff] [blame] | 30 | more attractive an individual flush looks. Data and |
Dave Hansen | 2d040a1 | 2014-07-31 08:41:01 -0700 | [diff] [blame] | 31 | instructions have separate TLBs, as do different page sizes. |
| 32 | 4. The microarchitecture. The TLB has become a multi-level |
| 33 | cache on modern CPUs, and the global flushes have become more |
| 34 | expensive relative to single-page flushes. |
| 35 | |
| 36 | There is obviously no way the kernel can know all these things, |
| 37 | especially the contents of the TLB during a given flush. The |
| 38 | sizes of the flush will vary greatly depending on the workload as |
| 39 | well. There is essentially no "right" point to choose. |
| 40 | |
| 41 | You may be doing too many individual invalidations if you see the |
| 42 | invlpg instruction (or instructions _near_ it) show up high in |
| 43 | profiles. If you believe that individual invalidations being |
Changbin Du | 1715604 | 2019-05-08 23:21:23 +0800 | [diff] [blame] | 44 | called too often, you can lower the tunable:: |
Dave Hansen | 2d040a1 | 2014-07-31 08:41:01 -0700 | [diff] [blame] | 45 | |
Jeremiah Mahler | 129ea00 | 2014-08-08 00:49:55 -0700 | [diff] [blame] | 46 | /sys/kernel/debug/x86/tlb_single_page_flush_ceiling |
Dave Hansen | 2d040a1 | 2014-07-31 08:41:01 -0700 | [diff] [blame] | 47 | |
| 48 | This will cause us to do the global flush for more cases. |
| 49 | Lowering it to 0 will disable the use of the individual flushes. |
| 50 | Setting it to 1 is a very conservative setting and it should |
| 51 | never need to be 0 under normal circumstances. |
| 52 | |
| 53 | Despite the fact that a single individual flush on x86 is |
Changbin Du | 1715604 | 2019-05-08 23:21:23 +0800 | [diff] [blame] | 54 | guaranteed to flush a full 2MB [1]_, hugetlbfs always uses the full |
Dave Hansen | 2d040a1 | 2014-07-31 08:41:01 -0700 | [diff] [blame] | 55 | flushes. THP is treated exactly the same as normal memory. |
| 56 | |
| 57 | You might see invlpg inside of flush_tlb_mm_range() show up in |
| 58 | profiles, or you can use the trace_tlb_flush() tracepoints. to |
| 59 | determine how long the flush operations are taking. |
| 60 | |
| 61 | Essentially, you are balancing the cycles you spend doing invlpg |
| 62 | with the cycles that you spend refilling the TLB later. |
| 63 | |
| 64 | You can measure how expensive TLB refills are by using |
Changbin Du | 1715604 | 2019-05-08 23:21:23 +0800 | [diff] [blame] | 65 | performance counters and 'perf stat', like this:: |
Dave Hansen | 2d040a1 | 2014-07-31 08:41:01 -0700 | [diff] [blame] | 66 | |
Changbin Du | 1715604 | 2019-05-08 23:21:23 +0800 | [diff] [blame] | 67 | perf stat -e |
| 68 | cpu/event=0x8,umask=0x84,name=dtlb_load_misses_walk_duration/, |
| 69 | cpu/event=0x8,umask=0x82,name=dtlb_load_misses_walk_completed/, |
| 70 | cpu/event=0x49,umask=0x4,name=dtlb_store_misses_walk_duration/, |
| 71 | cpu/event=0x49,umask=0x2,name=dtlb_store_misses_walk_completed/, |
| 72 | cpu/event=0x85,umask=0x4,name=itlb_misses_walk_duration/, |
| 73 | cpu/event=0x85,umask=0x2,name=itlb_misses_walk_completed/ |
Dave Hansen | 2d040a1 | 2014-07-31 08:41:01 -0700 | [diff] [blame] | 74 | |
| 75 | That works on an IvyBridge-era CPU (i5-3320M). Different CPUs |
| 76 | may have differently-named counters, but they should at least |
| 77 | be there in some form. You can use pmu-tools 'ocperf list' |
| 78 | (https://github.com/andikleen/pmu-tools) to find the right |
| 79 | counters for a given CPU. |
| 80 | |
Changbin Du | 1715604 | 2019-05-08 23:21:23 +0800 | [diff] [blame] | 81 | .. [1] A footnote in Intel's SDM "4.10.4.2 Recommended Invalidation" |
Dave Hansen | 2d040a1 | 2014-07-31 08:41:01 -0700 | [diff] [blame] | 82 | says: "One execution of INVLPG is sufficient even for a page |
| 83 | with size greater than 4 KBytes." |