Kees Cook | c2ed674 | 2017-05-13 04:51:41 -0700 | [diff] [blame] | 1 | ====================== |
| 2 | Kernel Self-Protection |
| 3 | ====================== |
Kees Cook | 9f80366 | 2016-05-16 19:27:28 -0700 | [diff] [blame] | 4 | |
| 5 | Kernel self-protection is the design and implementation of systems and |
| 6 | structures within the Linux kernel to protect against security flaws in |
| 7 | the kernel itself. This covers a wide range of issues, including removing |
| 8 | entire classes of bugs, blocking security flaw exploitation methods, |
| 9 | and actively detecting attack attempts. Not all topics are explored in |
| 10 | this document, but it should serve as a reasonable starting point and |
| 11 | answer any frequently asked questions. (Patches welcome, of course!) |
| 12 | |
| 13 | In the worst-case scenario, we assume an unprivileged local attacker |
| 14 | has arbitrary read and write access to the kernel's memory. In many |
| 15 | cases, bugs being exploited will not provide this level of access, |
| 16 | but with systems in place that defend against the worst case we'll |
| 17 | cover the more limited cases as well. A higher bar, and one that should |
| 18 | still be kept in mind, is protecting the kernel against a _privileged_ |
| 19 | local attacker, since the root user has access to a vastly increased |
| 20 | attack surface. (Especially when they have the ability to load arbitrary |
| 21 | kernel modules.) |
| 22 | |
| 23 | The goals for successful self-protection systems would be that they |
| 24 | are effective, on by default, require no opt-in by developers, have no |
| 25 | performance impact, do not impede kernel debugging, and have tests. It |
| 26 | is uncommon that all these goals can be met, but it is worth explicitly |
| 27 | mentioning them, since these aspects need to be explored, dealt with, |
| 28 | and/or accepted. |
| 29 | |
| 30 | |
Kees Cook | c2ed674 | 2017-05-13 04:51:41 -0700 | [diff] [blame] | 31 | Attack Surface Reduction |
| 32 | ======================== |
Kees Cook | 9f80366 | 2016-05-16 19:27:28 -0700 | [diff] [blame] | 33 | |
| 34 | The most fundamental defense against security exploits is to reduce the |
| 35 | areas of the kernel that can be used to redirect execution. This ranges |
| 36 | from limiting the exposed APIs available to userspace, making in-kernel |
| 37 | APIs hard to use incorrectly, minimizing the areas of writable kernel |
| 38 | memory, etc. |
| 39 | |
Kees Cook | c2ed674 | 2017-05-13 04:51:41 -0700 | [diff] [blame] | 40 | Strict kernel memory permissions |
| 41 | -------------------------------- |
Kees Cook | 9f80366 | 2016-05-16 19:27:28 -0700 | [diff] [blame] | 42 | |
| 43 | When all of kernel memory is writable, it becomes trivial for attacks |
| 44 | to redirect execution flow. To reduce the availability of these targets |
| 45 | the kernel needs to protect its memory with a tight set of permissions. |
| 46 | |
Kees Cook | c2ed674 | 2017-05-13 04:51:41 -0700 | [diff] [blame] | 47 | Executable code and read-only data must not be writable |
| 48 | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ |
Kees Cook | 9f80366 | 2016-05-16 19:27:28 -0700 | [diff] [blame] | 49 | |
| 50 | Any areas of the kernel with executable memory must not be writable. |
| 51 | While this obviously includes the kernel text itself, we must consider |
| 52 | all additional places too: kernel modules, JIT memory, etc. (There are |
| 53 | temporary exceptions to this rule to support things like instruction |
| 54 | alternatives, breakpoints, kprobes, etc. If these must exist in a |
| 55 | kernel, they are implemented in a way where the memory is temporarily |
| 56 | made writable during the update, and then returned to the original |
| 57 | permissions.) |
| 58 | |
Kees Cook | c2ed674 | 2017-05-13 04:51:41 -0700 | [diff] [blame] | 59 | In support of this are ``CONFIG_STRICT_KERNEL_RWX`` and |
| 60 | ``CONFIG_STRICT_MODULE_RWX``, which seek to make sure that code is not |
Kees Cook | 9f80366 | 2016-05-16 19:27:28 -0700 | [diff] [blame] | 61 | writable, data is not executable, and read-only data is neither writable |
| 62 | nor executable. |
| 63 | |
Laura Abbott | ad21fc4 | 2017-02-06 16:31:57 -0800 | [diff] [blame] | 64 | Most architectures have these options on by default and not user selectable. |
| 65 | For some architectures like arm that wish to have these be selectable, |
| 66 | the architecture Kconfig can select ARCH_OPTIONAL_KERNEL_RWX to enable |
Kees Cook | c2ed674 | 2017-05-13 04:51:41 -0700 | [diff] [blame] | 67 | a Kconfig prompt. ``CONFIG_ARCH_OPTIONAL_KERNEL_RWX_DEFAULT`` determines |
Laura Abbott | ad21fc4 | 2017-02-06 16:31:57 -0800 | [diff] [blame] | 68 | the default setting when ARCH_OPTIONAL_KERNEL_RWX is enabled. |
| 69 | |
Kees Cook | c2ed674 | 2017-05-13 04:51:41 -0700 | [diff] [blame] | 70 | Function pointers and sensitive variables must not be writable |
| 71 | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ |
Kees Cook | 9f80366 | 2016-05-16 19:27:28 -0700 | [diff] [blame] | 72 | |
| 73 | Vast areas of kernel memory contain function pointers that are looked |
| 74 | up by the kernel and used to continue execution (e.g. descriptor/vector |
| 75 | tables, file/network/etc operation structures, etc). The number of these |
| 76 | variables must be reduced to an absolute minimum. |
| 77 | |
| 78 | Many such variables can be made read-only by setting them "const" |
| 79 | so that they live in the .rodata section instead of the .data section |
| 80 | of the kernel, gaining the protection of the kernel's strict memory |
| 81 | permissions as described above. |
| 82 | |
Kees Cook | c2ed674 | 2017-05-13 04:51:41 -0700 | [diff] [blame] | 83 | For variables that are initialized once at ``__init`` time, these can |
| 84 | be marked with the (new and under development) ``__ro_after_init`` |
Kees Cook | 9f80366 | 2016-05-16 19:27:28 -0700 | [diff] [blame] | 85 | attribute. |
| 86 | |
| 87 | What remains are variables that are updated rarely (e.g. GDT). These |
| 88 | will need another infrastructure (similar to the temporary exceptions |
| 89 | made to kernel code mentioned above) that allow them to spend the rest |
| 90 | of their lifetime read-only. (For example, when being updated, only the |
| 91 | CPU thread performing the update would be given uninterruptible write |
| 92 | access to the memory.) |
| 93 | |
Kees Cook | c2ed674 | 2017-05-13 04:51:41 -0700 | [diff] [blame] | 94 | Segregation of kernel memory from userspace memory |
| 95 | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ |
Kees Cook | 9f80366 | 2016-05-16 19:27:28 -0700 | [diff] [blame] | 96 | |
| 97 | The kernel must never execute userspace memory. The kernel must also never |
| 98 | access userspace memory without explicit expectation to do so. These |
| 99 | rules can be enforced either by support of hardware-based restrictions |
| 100 | (x86's SMEP/SMAP, ARM's PXN/PAN) or via emulation (ARM's Memory Domains). |
| 101 | By blocking userspace memory in this way, execution and data parsing |
| 102 | cannot be passed to trivially-controlled userspace memory, forcing |
| 103 | attacks to operate entirely in kernel memory. |
| 104 | |
Kees Cook | c2ed674 | 2017-05-13 04:51:41 -0700 | [diff] [blame] | 105 | Reduced access to syscalls |
| 106 | -------------------------- |
Kees Cook | 9f80366 | 2016-05-16 19:27:28 -0700 | [diff] [blame] | 107 | |
| 108 | One trivial way to eliminate many syscalls for 64-bit systems is building |
Kees Cook | c2ed674 | 2017-05-13 04:51:41 -0700 | [diff] [blame] | 109 | without ``CONFIG_COMPAT``. However, this is rarely a feasible scenario. |
Kees Cook | 9f80366 | 2016-05-16 19:27:28 -0700 | [diff] [blame] | 110 | |
| 111 | The "seccomp" system provides an opt-in feature made available to |
| 112 | userspace, which provides a way to reduce the number of kernel entry |
| 113 | points available to a running process. This limits the breadth of kernel |
| 114 | code that can be reached, possibly reducing the availability of a given |
| 115 | bug to an attack. |
| 116 | |
| 117 | An area of improvement would be creating viable ways to keep access to |
| 118 | things like compat, user namespaces, BPF creation, and perf limited only |
| 119 | to trusted processes. This would keep the scope of kernel entry points |
| 120 | restricted to the more regular set of normally available to unprivileged |
| 121 | userspace. |
| 122 | |
Kees Cook | c2ed674 | 2017-05-13 04:51:41 -0700 | [diff] [blame] | 123 | Restricting access to kernel modules |
| 124 | ------------------------------------ |
Kees Cook | 9f80366 | 2016-05-16 19:27:28 -0700 | [diff] [blame] | 125 | |
| 126 | The kernel should never allow an unprivileged user the ability to |
| 127 | load specific kernel modules, since that would provide a facility to |
| 128 | unexpectedly extend the available attack surface. (The on-demand loading |
| 129 | of modules via their predefined subsystems, e.g. MODULE_ALIAS_*, is |
| 130 | considered "expected" here, though additional consideration should be |
| 131 | given even to these.) For example, loading a filesystem module via an |
| 132 | unprivileged socket API is nonsense: only the root or physically local |
| 133 | user should trigger filesystem module loading. (And even this can be up |
| 134 | for debate in some scenarios.) |
| 135 | |
| 136 | To protect against even privileged users, systems may need to either |
| 137 | disable module loading entirely (e.g. monolithic kernel builds or |
| 138 | modules_disabled sysctl), or provide signed modules (e.g. |
Kees Cook | c2ed674 | 2017-05-13 04:51:41 -0700 | [diff] [blame] | 139 | ``CONFIG_MODULE_SIG_FORCE``, or dm-crypt with LoadPin), to keep from having |
Kees Cook | 9f80366 | 2016-05-16 19:27:28 -0700 | [diff] [blame] | 140 | root load arbitrary kernel code via the module loader interface. |
| 141 | |
| 142 | |
Kees Cook | c2ed674 | 2017-05-13 04:51:41 -0700 | [diff] [blame] | 143 | Memory integrity |
| 144 | ================ |
Kees Cook | 9f80366 | 2016-05-16 19:27:28 -0700 | [diff] [blame] | 145 | |
| 146 | There are many memory structures in the kernel that are regularly abused |
| 147 | to gain execution control during an attack, By far the most commonly |
| 148 | understood is that of the stack buffer overflow in which the return |
| 149 | address stored on the stack is overwritten. Many other examples of this |
| 150 | kind of attack exist, and protections exist to defend against them. |
| 151 | |
Kees Cook | c2ed674 | 2017-05-13 04:51:41 -0700 | [diff] [blame] | 152 | Stack buffer overflow |
| 153 | --------------------- |
Kees Cook | 9f80366 | 2016-05-16 19:27:28 -0700 | [diff] [blame] | 154 | |
| 155 | The classic stack buffer overflow involves writing past the expected end |
| 156 | of a variable stored on the stack, ultimately writing a controlled value |
| 157 | to the stack frame's stored return address. The most widely used defense |
| 158 | is the presence of a stack canary between the stack variables and the |
Linus Torvalds | 050e9ba | 2018-06-14 12:21:18 +0900 | [diff] [blame] | 159 | return address (``CONFIG_STACKPROTECTOR``), which is verified just before |
Kees Cook | 9f80366 | 2016-05-16 19:27:28 -0700 | [diff] [blame] | 160 | the function returns. Other defenses include things like shadow stacks. |
| 161 | |
Kees Cook | c2ed674 | 2017-05-13 04:51:41 -0700 | [diff] [blame] | 162 | Stack depth overflow |
| 163 | -------------------- |
Kees Cook | 9f80366 | 2016-05-16 19:27:28 -0700 | [diff] [blame] | 164 | |
| 165 | A less well understood attack is using a bug that triggers the |
| 166 | kernel to consume stack memory with deep function calls or large stack |
| 167 | allocations. With this attack it is possible to write beyond the end of |
| 168 | the kernel's preallocated stack space and into sensitive structures. Two |
| 169 | important changes need to be made for better protections: moving the |
| 170 | sensitive thread_info structure elsewhere, and adding a faulting memory |
| 171 | hole at the bottom of the stack to catch these overflows. |
| 172 | |
Kees Cook | c2ed674 | 2017-05-13 04:51:41 -0700 | [diff] [blame] | 173 | Heap memory integrity |
| 174 | --------------------- |
Kees Cook | 9f80366 | 2016-05-16 19:27:28 -0700 | [diff] [blame] | 175 | |
| 176 | The structures used to track heap free lists can be sanity-checked during |
| 177 | allocation and freeing to make sure they aren't being used to manipulate |
| 178 | other memory areas. |
| 179 | |
Kees Cook | c2ed674 | 2017-05-13 04:51:41 -0700 | [diff] [blame] | 180 | Counter integrity |
| 181 | ----------------- |
Kees Cook | 9f80366 | 2016-05-16 19:27:28 -0700 | [diff] [blame] | 182 | |
| 183 | Many places in the kernel use atomic counters to track object references |
| 184 | or perform similar lifetime management. When these counters can be made |
| 185 | to wrap (over or under) this traditionally exposes a use-after-free |
| 186 | flaw. By trapping atomic wrapping, this class of bug vanishes. |
| 187 | |
Kees Cook | c2ed674 | 2017-05-13 04:51:41 -0700 | [diff] [blame] | 188 | Size calculation overflow detection |
| 189 | ----------------------------------- |
Kees Cook | 9f80366 | 2016-05-16 19:27:28 -0700 | [diff] [blame] | 190 | |
| 191 | Similar to counter overflow, integer overflows (usually size calculations) |
| 192 | need to be detected at runtime to kill this class of bug, which |
| 193 | traditionally leads to being able to write past the end of kernel buffers. |
| 194 | |
| 195 | |
Kees Cook | c2ed674 | 2017-05-13 04:51:41 -0700 | [diff] [blame] | 196 | Probabilistic defenses |
| 197 | ====================== |
Kees Cook | 9f80366 | 2016-05-16 19:27:28 -0700 | [diff] [blame] | 198 | |
| 199 | While many protections can be considered deterministic (e.g. read-only |
| 200 | memory cannot be written to), some protections provide only statistical |
| 201 | defense, in that an attack must gather enough information about a |
| 202 | running system to overcome the defense. While not perfect, these do |
| 203 | provide meaningful defenses. |
| 204 | |
Kees Cook | c2ed674 | 2017-05-13 04:51:41 -0700 | [diff] [blame] | 205 | Canaries, blinding, and other secrets |
| 206 | ------------------------------------- |
Kees Cook | 9f80366 | 2016-05-16 19:27:28 -0700 | [diff] [blame] | 207 | |
| 208 | It should be noted that things like the stack canary discussed earlier |
Kees Cook | c9de4a8 | 2016-05-18 06:37:47 -0700 | [diff] [blame] | 209 | are technically statistical defenses, since they rely on a secret value, |
| 210 | and such values may become discoverable through an information exposure |
| 211 | flaw. |
Kees Cook | 9f80366 | 2016-05-16 19:27:28 -0700 | [diff] [blame] | 212 | |
| 213 | Blinding literal values for things like JITs, where the executable |
| 214 | contents may be partially under the control of userspace, need a similar |
| 215 | secret value. |
| 216 | |
| 217 | It is critical that the secret values used must be separate (e.g. |
| 218 | different canary per stack) and high entropy (e.g. is the RNG actually |
| 219 | working?) in order to maximize their success. |
| 220 | |
Kees Cook | c2ed674 | 2017-05-13 04:51:41 -0700 | [diff] [blame] | 221 | Kernel Address Space Layout Randomization (KASLR) |
| 222 | ------------------------------------------------- |
Kees Cook | 9f80366 | 2016-05-16 19:27:28 -0700 | [diff] [blame] | 223 | |
| 224 | Since the location of kernel memory is almost always instrumental in |
| 225 | mounting a successful attack, making the location non-deterministic |
| 226 | raises the difficulty of an exploit. (Note that this in turn makes |
Kees Cook | c9de4a8 | 2016-05-18 06:37:47 -0700 | [diff] [blame] | 227 | the value of information exposures higher, since they may be used to |
| 228 | discover desired memory locations.) |
Kees Cook | 9f80366 | 2016-05-16 19:27:28 -0700 | [diff] [blame] | 229 | |
Kees Cook | c2ed674 | 2017-05-13 04:51:41 -0700 | [diff] [blame] | 230 | Text and module base |
| 231 | ~~~~~~~~~~~~~~~~~~~~ |
Kees Cook | 9f80366 | 2016-05-16 19:27:28 -0700 | [diff] [blame] | 232 | |
| 233 | By relocating the physical and virtual base address of the kernel at |
Kees Cook | c2ed674 | 2017-05-13 04:51:41 -0700 | [diff] [blame] | 234 | boot-time (``CONFIG_RANDOMIZE_BASE``), attacks needing kernel code will be |
Kees Cook | 9f80366 | 2016-05-16 19:27:28 -0700 | [diff] [blame] | 235 | frustrated. Additionally, offsetting the module loading base address |
| 236 | means that even systems that load the same set of modules in the same |
| 237 | order every boot will not share a common base address with the rest of |
| 238 | the kernel text. |
| 239 | |
Kees Cook | c2ed674 | 2017-05-13 04:51:41 -0700 | [diff] [blame] | 240 | Stack base |
| 241 | ~~~~~~~~~~ |
Kees Cook | 9f80366 | 2016-05-16 19:27:28 -0700 | [diff] [blame] | 242 | |
| 243 | If the base address of the kernel stack is not the same between processes, |
| 244 | or even not the same between syscalls, targets on or beyond the stack |
| 245 | become more difficult to locate. |
| 246 | |
Kees Cook | c2ed674 | 2017-05-13 04:51:41 -0700 | [diff] [blame] | 247 | Dynamic memory base |
| 248 | ~~~~~~~~~~~~~~~~~~~ |
Kees Cook | 9f80366 | 2016-05-16 19:27:28 -0700 | [diff] [blame] | 249 | |
| 250 | Much of the kernel's dynamic memory (e.g. kmalloc, vmalloc, etc) ends up |
| 251 | being relatively deterministic in layout due to the order of early-boot |
| 252 | initializations. If the base address of these areas is not the same |
Kees Cook | c9de4a8 | 2016-05-18 06:37:47 -0700 | [diff] [blame] | 253 | between boots, targeting them is frustrated, requiring an information |
| 254 | exposure specific to the region. |
| 255 | |
Kees Cook | c2ed674 | 2017-05-13 04:51:41 -0700 | [diff] [blame] | 256 | Structure layout |
| 257 | ~~~~~~~~~~~~~~~~ |
Kees Cook | c9de4a8 | 2016-05-18 06:37:47 -0700 | [diff] [blame] | 258 | |
| 259 | By performing a per-build randomization of the layout of sensitive |
| 260 | structures, attacks must either be tuned to known kernel builds or expose |
| 261 | enough kernel memory to determine structure layouts before manipulating |
| 262 | them. |
Kees Cook | 9f80366 | 2016-05-16 19:27:28 -0700 | [diff] [blame] | 263 | |
| 264 | |
Kees Cook | c2ed674 | 2017-05-13 04:51:41 -0700 | [diff] [blame] | 265 | Preventing Information Exposures |
| 266 | ================================ |
Kees Cook | 9f80366 | 2016-05-16 19:27:28 -0700 | [diff] [blame] | 267 | |
| 268 | Since the locations of sensitive structures are the primary target for |
Kees Cook | c9de4a8 | 2016-05-18 06:37:47 -0700 | [diff] [blame] | 269 | attacks, it is important to defend against exposure of both kernel memory |
Kees Cook | 9f80366 | 2016-05-16 19:27:28 -0700 | [diff] [blame] | 270 | addresses and kernel memory contents (since they may contain kernel |
| 271 | addresses or other sensitive things like canary values). |
| 272 | |
Tobin C. Harding | 227d1a6 | 2017-12-20 08:17:17 +1100 | [diff] [blame] | 273 | Kernel addresses |
| 274 | ---------------- |
| 275 | |
| 276 | Printing kernel addresses to userspace leaks sensitive information about |
| 277 | the kernel memory layout. Care should be exercised when using any printk |
| 278 | specifier that prints the raw address, currently %px, %p[ad], (and %p[sSb] |
| 279 | in certain circumstances [*]). Any file written to using one of these |
| 280 | specifiers should be readable only by privileged processes. |
| 281 | |
| 282 | Kernels 4.14 and older printed the raw address using %p. As of 4.15-rc1 |
| 283 | addresses printed with the specifier %p are hashed before printing. |
| 284 | |
| 285 | [*] If KALLSYMS is enabled and symbol lookup fails, the raw address is |
| 286 | printed. If KALLSYMS is not enabled the raw address is printed. |
| 287 | |
Kees Cook | c2ed674 | 2017-05-13 04:51:41 -0700 | [diff] [blame] | 288 | Unique identifiers |
| 289 | ------------------ |
Kees Cook | 9f80366 | 2016-05-16 19:27:28 -0700 | [diff] [blame] | 290 | |
| 291 | Kernel memory addresses must never be used as identifiers exposed to |
| 292 | userspace. Instead, use an atomic counter, an idr, or similar unique |
| 293 | identifier. |
| 294 | |
Kees Cook | c2ed674 | 2017-05-13 04:51:41 -0700 | [diff] [blame] | 295 | Memory initialization |
| 296 | --------------------- |
Kees Cook | 9f80366 | 2016-05-16 19:27:28 -0700 | [diff] [blame] | 297 | |
| 298 | Memory copied to userspace must always be fully initialized. If not |
| 299 | explicitly memset(), this will require changes to the compiler to make |
| 300 | sure structure holes are cleared. |
| 301 | |
Kees Cook | c2ed674 | 2017-05-13 04:51:41 -0700 | [diff] [blame] | 302 | Memory poisoning |
| 303 | ---------------- |
Kees Cook | 9f80366 | 2016-05-16 19:27:28 -0700 | [diff] [blame] | 304 | |
Alexander Popov | ed535a2 | 2018-08-17 01:17:02 +0300 | [diff] [blame] | 305 | When releasing memory, it is best to poison the contents, to avoid reuse |
| 306 | attacks that rely on the old contents of memory. E.g., clear stack on a |
| 307 | syscall return (``CONFIG_GCC_PLUGIN_STACKLEAK``), wipe heap memory on a |
| 308 | free. This frustrates many uninitialized variable attacks, stack content |
| 309 | exposures, heap content exposures, and use-after-free attacks. |
Kees Cook | 9f80366 | 2016-05-16 19:27:28 -0700 | [diff] [blame] | 310 | |
Kees Cook | c2ed674 | 2017-05-13 04:51:41 -0700 | [diff] [blame] | 311 | Destination tracking |
| 312 | -------------------- |
Kees Cook | 9f80366 | 2016-05-16 19:27:28 -0700 | [diff] [blame] | 313 | |
| 314 | To help kill classes of bugs that result in kernel addresses being |
| 315 | written to userspace, the destination of writes needs to be tracked. If |
Kees Cook | c2ed674 | 2017-05-13 04:51:41 -0700 | [diff] [blame] | 316 | the buffer is destined for userspace (e.g. seq_file backed ``/proc`` files), |
Kees Cook | 9f80366 | 2016-05-16 19:27:28 -0700 | [diff] [blame] | 317 | it should automatically censor sensitive values. |