Changbin Du | ea0765e | 2019-05-08 23:21:29 +0800 | [diff] [blame] | 1 | .. SPDX-License-Identifier: GPL-2.0 |
| 2 | |
| 3 | ========================== |
| 4 | Page Table Isolation (PTI) |
| 5 | ========================== |
| 6 | |
Dave Hansen | 01c9b17 | 2018-01-05 09:44:36 -0800 | [diff] [blame] | 7 | Overview |
| 8 | ======== |
| 9 | |
Changbin Du | ea0765e | 2019-05-08 23:21:29 +0800 | [diff] [blame] | 10 | Page Table Isolation (pti, previously known as KAISER [1]_) is a |
Dave Hansen | 01c9b17 | 2018-01-05 09:44:36 -0800 | [diff] [blame] | 11 | countermeasure against attacks on the shared user/kernel address |
Changbin Du | ea0765e | 2019-05-08 23:21:29 +0800 | [diff] [blame] | 12 | space such as the "Meltdown" approach [2]_. |
Dave Hansen | 01c9b17 | 2018-01-05 09:44:36 -0800 | [diff] [blame] | 13 | |
| 14 | To mitigate this class of attacks, we create an independent set of |
| 15 | page tables for use only when running userspace applications. When |
| 16 | the kernel is entered via syscalls, interrupts or exceptions, the |
| 17 | page tables are switched to the full "kernel" copy. When the system |
| 18 | switches back to user mode, the user copy is used again. |
| 19 | |
| 20 | The userspace page tables contain only a minimal amount of kernel |
| 21 | data: only what is needed to enter/exit the kernel such as the |
| 22 | entry/exit functions themselves and the interrupt descriptor table |
| 23 | (IDT). There are a few strictly unnecessary things that get mapped |
| 24 | such as the first C function when entering an interrupt (see |
| 25 | comments in pti.c). |
| 26 | |
| 27 | This approach helps to ensure that side-channel attacks leveraging |
| 28 | the paging structures do not function when PTI is enabled. It can be |
| 29 | enabled by setting CONFIG_PAGE_TABLE_ISOLATION=y at compile time. |
| 30 | Once enabled at compile-time, it can be disabled at boot with the |
| 31 | 'nopti' or 'pti=' kernel parameters (see kernel-parameters.txt). |
| 32 | |
| 33 | Page Table Management |
| 34 | ===================== |
| 35 | |
| 36 | When PTI is enabled, the kernel manages two sets of page tables. |
| 37 | The first set is very similar to the single set which is present in |
| 38 | kernels without PTI. This includes a complete mapping of userspace |
| 39 | that the kernel can use for things like copy_to_user(). |
| 40 | |
| 41 | Although _complete_, the user portion of the kernel page tables is |
| 42 | crippled by setting the NX bit in the top level. This ensures |
| 43 | that any missed kernel->user CR3 switch will immediately crash |
| 44 | userspace upon executing its first instruction. |
| 45 | |
| 46 | The userspace page tables map only the kernel data needed to enter |
| 47 | and exit the kernel. This data is entirely contained in the 'struct |
| 48 | cpu_entry_area' structure which is placed in the fixmap which gives |
| 49 | each CPU's copy of the area a compile-time-fixed virtual address. |
| 50 | |
| 51 | For new userspace mappings, the kernel makes the entries in its |
| 52 | page tables like normal. The only difference is when the kernel |
| 53 | makes entries in the top (PGD) level. In addition to setting the |
| 54 | entry in the main kernel PGD, a copy of the entry is made in the |
| 55 | userspace page tables' PGD. |
| 56 | |
| 57 | This sharing at the PGD level also inherently shares all the lower |
| 58 | layers of the page tables. This leaves a single, shared set of |
| 59 | userspace page tables to manage. One PTE to lock, one set of |
| 60 | accessed bits, dirty bits, etc... |
| 61 | |
| 62 | Overhead |
| 63 | ======== |
| 64 | |
| 65 | Protection against side-channel attacks is important. But, |
| 66 | this protection comes at a cost: |
| 67 | |
| 68 | 1. Increased Memory Use |
Changbin Du | ea0765e | 2019-05-08 23:21:29 +0800 | [diff] [blame] | 69 | |
Dave Hansen | 01c9b17 | 2018-01-05 09:44:36 -0800 | [diff] [blame] | 70 | a. Each process now needs an order-1 PGD instead of order-0. |
| 71 | (Consumes an additional 4k per process). |
| 72 | b. The 'cpu_entry_area' structure must be 2MB in size and 2MB |
| 73 | aligned so that it can be mapped by setting a single PMD |
| 74 | entry. This consumes nearly 2MB of RAM once the kernel |
| 75 | is decompressed, but no space in the kernel image itself. |
| 76 | |
| 77 | 2. Runtime Cost |
Changbin Du | ea0765e | 2019-05-08 23:21:29 +0800 | [diff] [blame] | 78 | |
Dave Hansen | 01c9b17 | 2018-01-05 09:44:36 -0800 | [diff] [blame] | 79 | a. CR3 manipulation to switch between the page table copies |
| 80 | must be done at interrupt, syscall, and exception entry |
| 81 | and exit (it can be skipped when the kernel is interrupted, |
| 82 | though.) Moves to CR3 are on the order of a hundred |
| 83 | cycles, and are required at every entry and exit. |
| 84 | b. A "trampoline" must be used for SYSCALL entry. This |
| 85 | trampoline depends on a smaller set of resources than the |
| 86 | non-PTI SYSCALL entry code, so requires mapping fewer |
| 87 | things into the userspace page tables. The downside is |
| 88 | that stacks must be switched at entry time. |
zhenwei.pi | 98f0fce | 2018-01-18 09:04:52 +0800 | [diff] [blame] | 89 | c. Global pages are disabled for all kernel structures not |
Dave Hansen | 01c9b17 | 2018-01-05 09:44:36 -0800 | [diff] [blame] | 90 | mapped into both kernel and userspace page tables. This |
| 91 | feature of the MMU allows different processes to share TLB |
| 92 | entries mapping the kernel. Losing the feature means more |
| 93 | TLB misses after a context switch. The actual loss of |
| 94 | performance is very small, however, never exceeding 1%. |
| 95 | d. Process Context IDentifiers (PCID) is a CPU feature that |
| 96 | allows us to skip flushing the entire TLB when switching page |
| 97 | tables by setting a special bit in CR3 when the page tables |
| 98 | are changed. This makes switching the page tables (at context |
| 99 | switch, or kernel entry/exit) cheaper. But, on systems with |
| 100 | PCID support, the context switch code must flush both the user |
| 101 | and kernel entries out of the TLB. The user PCID TLB flush is |
| 102 | deferred until the exit to userspace, minimizing the cost. |
| 103 | See intel.com/sdm for the gory PCID/INVPCID details. |
| 104 | e. The userspace page tables must be populated for each new |
| 105 | process. Even without PTI, the shared kernel mappings |
| 106 | are created by copying top-level (PGD) entries into each |
| 107 | new process. But, with PTI, there are now *two* kernel |
| 108 | mappings: one in the kernel page tables that maps everything |
| 109 | and one for the entry/exit structures. At fork(), we need to |
| 110 | copy both. |
| 111 | f. In addition to the fork()-time copying, there must also |
| 112 | be an update to the userspace PGD any time a set_pgd() is done |
| 113 | on a PGD used to map userspace. This ensures that the kernel |
| 114 | and userspace copies always map the same userspace |
| 115 | memory. |
| 116 | g. On systems without PCID support, each CR3 write flushes |
| 117 | the entire TLB. That means that each syscall, interrupt |
| 118 | or exception flushes the TLB. |
| 119 | h. INVPCID is a TLB-flushing instruction which allows flushing |
| 120 | of TLB entries for non-current PCIDs. Some systems support |
| 121 | PCIDs, but do not support INVPCID. On these systems, addresses |
| 122 | can only be flushed from the TLB for the current PCID. When |
| 123 | flushing a kernel address, we need to flush all PCIDs, so a |
| 124 | single kernel address flush will require a TLB-flushing CR3 |
| 125 | write upon the next use of every PCID. |
| 126 | |
| 127 | Possible Future Work |
| 128 | ==================== |
| 129 | 1. We can be more careful about not actually writing to CR3 |
| 130 | unless its value is actually changed. |
| 131 | 2. Allow PTI to be enabled/disabled at runtime in addition to the |
| 132 | boot-time switching. |
| 133 | |
| 134 | Testing |
| 135 | ======== |
| 136 | |
| 137 | To test stability of PTI, the following test procedure is recommended, |
| 138 | ideally doing all of these in parallel: |
| 139 | |
| 140 | 1. Set CONFIG_DEBUG_ENTRY=y |
| 141 | 2. Run several copies of all of the tools/testing/selftests/x86/ tests |
| 142 | (excluding MPX and protection_keys) in a loop on multiple CPUs for |
| 143 | several minutes. These tests frequently uncover corner cases in the |
| 144 | kernel entry code. In general, old kernels might cause these tests |
| 145 | themselves to crash, but they should never crash the kernel. |
| 146 | 3. Run the 'perf' tool in a mode (top or record) that generates many |
| 147 | frequent performance monitoring non-maskable interrupts (see "NMI" |
| 148 | in /proc/interrupts). This exercises the NMI entry/exit code which |
| 149 | is known to trigger bugs in code paths that did not expect to be |
| 150 | interrupted, including nested NMIs. Using "-c" boosts the rate of |
| 151 | NMIs, and using two -c with separate counters encourages nested NMIs |
| 152 | and less deterministic behavior. |
Changbin Du | ea0765e | 2019-05-08 23:21:29 +0800 | [diff] [blame] | 153 | :: |
Dave Hansen | 01c9b17 | 2018-01-05 09:44:36 -0800 | [diff] [blame] | 154 | |
| 155 | while true; do perf record -c 10000 -e instructions,cycles -a sleep 10; done |
| 156 | |
| 157 | 4. Launch a KVM virtual machine. |
| 158 | 5. Run 32-bit binaries on systems supporting the SYSCALL instruction. |
| 159 | This has been a lightly-tested code path and needs extra scrutiny. |
| 160 | |
| 161 | Debugging |
| 162 | ========= |
| 163 | |
| 164 | Bugs in PTI cause a few different signatures of crashes |
| 165 | that are worth noting here. |
| 166 | |
| 167 | * Failures of the selftests/x86 code. Usually a bug in one of the |
| 168 | more obscure corners of entry_64.S |
| 169 | * Crashes in early boot, especially around CPU bringup. Bugs |
| 170 | in the trampoline code or mappings cause these. |
| 171 | * Crashes at the first interrupt. Caused by bugs in entry_64.S, |
| 172 | like screwing up a page table switch. Also caused by |
| 173 | incorrectly mapping the IRQ handler entry code. |
| 174 | * Crashes at the first NMI. The NMI code is separate from main |
| 175 | interrupt handlers and can have bugs that do not affect |
| 176 | normal interrupts. Also caused by incorrectly mapping NMI |
| 177 | code. NMIs that interrupt the entry code must be very |
| 178 | careful and can be the cause of crashes that show up when |
| 179 | running perf. |
| 180 | * Kernel crashes at the first exit to userspace. entry_64.S |
| 181 | bugs, or failing to map some of the exit code. |
| 182 | * Crashes at first interrupt that interrupts userspace. The paths |
| 183 | in entry_64.S that return to userspace are sometimes separate |
| 184 | from the ones that return to the kernel. |
| 185 | * Double faults: overflowing the kernel stack because of page |
| 186 | faults upon page faults. Caused by touching non-pti-mapped |
| 187 | data in the entry code, or forgetting to switch to kernel |
| 188 | CR3 before calling into C functions which are not pti-mapped. |
| 189 | * Userspace segfaults early in boot, sometimes manifesting |
| 190 | as mount(8) failing to mount the rootfs. These have |
| 191 | tended to be TLB invalidation issues. Usually invalidating |
| 192 | the wrong PCID, or otherwise missing an invalidation. |
| 193 | |
Changbin Du | ea0765e | 2019-05-08 23:21:29 +0800 | [diff] [blame] | 194 | .. [1] https://gruss.cc/files/kaiser.pdf |
| 195 | .. [2] https://meltdownattack.com/meltdown.pdf |