blob: f584fb74b4ff2852eead1a46df316f14d3aeef69 [file] [log] [blame]
Kees Cookc2ed6742017-05-13 04:51:41 -07001======================
2Kernel Self-Protection
3======================
Kees Cook9f803662016-05-16 19:27:28 -07004
5Kernel self-protection is the design and implementation of systems and
6structures within the Linux kernel to protect against security flaws in
7the kernel itself. This covers a wide range of issues, including removing
8entire classes of bugs, blocking security flaw exploitation methods,
9and actively detecting attack attempts. Not all topics are explored in
10this document, but it should serve as a reasonable starting point and
11answer any frequently asked questions. (Patches welcome, of course!)
12
13In the worst-case scenario, we assume an unprivileged local attacker
14has arbitrary read and write access to the kernel's memory. In many
15cases, bugs being exploited will not provide this level of access,
16but with systems in place that defend against the worst case we'll
17cover the more limited cases as well. A higher bar, and one that should
18still be kept in mind, is protecting the kernel against a _privileged_
19local attacker, since the root user has access to a vastly increased
20attack surface. (Especially when they have the ability to load arbitrary
21kernel modules.)
22
23The goals for successful self-protection systems would be that they
24are effective, on by default, require no opt-in by developers, have no
25performance impact, do not impede kernel debugging, and have tests. It
26is uncommon that all these goals can be met, but it is worth explicitly
27mentioning them, since these aspects need to be explored, dealt with,
28and/or accepted.
29
30
Kees Cookc2ed6742017-05-13 04:51:41 -070031Attack Surface Reduction
32========================
Kees Cook9f803662016-05-16 19:27:28 -070033
34The most fundamental defense against security exploits is to reduce the
35areas of the kernel that can be used to redirect execution. This ranges
36from limiting the exposed APIs available to userspace, making in-kernel
37APIs hard to use incorrectly, minimizing the areas of writable kernel
38memory, etc.
39
Kees Cookc2ed6742017-05-13 04:51:41 -070040Strict kernel memory permissions
41--------------------------------
Kees Cook9f803662016-05-16 19:27:28 -070042
43When all of kernel memory is writable, it becomes trivial for attacks
44to redirect execution flow. To reduce the availability of these targets
45the kernel needs to protect its memory with a tight set of permissions.
46
Kees Cookc2ed6742017-05-13 04:51:41 -070047Executable code and read-only data must not be writable
48~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Kees Cook9f803662016-05-16 19:27:28 -070049
50Any areas of the kernel with executable memory must not be writable.
51While this obviously includes the kernel text itself, we must consider
52all additional places too: kernel modules, JIT memory, etc. (There are
53temporary exceptions to this rule to support things like instruction
54alternatives, breakpoints, kprobes, etc. If these must exist in a
55kernel, they are implemented in a way where the memory is temporarily
56made writable during the update, and then returned to the original
57permissions.)
58
Kees Cookc2ed6742017-05-13 04:51:41 -070059In support of this are ``CONFIG_STRICT_KERNEL_RWX`` and
60``CONFIG_STRICT_MODULE_RWX``, which seek to make sure that code is not
Kees Cook9f803662016-05-16 19:27:28 -070061writable, data is not executable, and read-only data is neither writable
62nor executable.
63
Laura Abbottad21fc42017-02-06 16:31:57 -080064Most architectures have these options on by default and not user selectable.
65For some architectures like arm that wish to have these be selectable,
66the architecture Kconfig can select ARCH_OPTIONAL_KERNEL_RWX to enable
Kees Cookc2ed6742017-05-13 04:51:41 -070067a Kconfig prompt. ``CONFIG_ARCH_OPTIONAL_KERNEL_RWX_DEFAULT`` determines
Laura Abbottad21fc42017-02-06 16:31:57 -080068the default setting when ARCH_OPTIONAL_KERNEL_RWX is enabled.
69
Kees Cookc2ed6742017-05-13 04:51:41 -070070Function pointers and sensitive variables must not be writable
71~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Kees Cook9f803662016-05-16 19:27:28 -070072
73Vast areas of kernel memory contain function pointers that are looked
74up by the kernel and used to continue execution (e.g. descriptor/vector
75tables, file/network/etc operation structures, etc). The number of these
76variables must be reduced to an absolute minimum.
77
78Many such variables can be made read-only by setting them "const"
79so that they live in the .rodata section instead of the .data section
80of the kernel, gaining the protection of the kernel's strict memory
81permissions as described above.
82
Kees Cookc2ed6742017-05-13 04:51:41 -070083For variables that are initialized once at ``__init`` time, these can
84be marked with the (new and under development) ``__ro_after_init``
Kees Cook9f803662016-05-16 19:27:28 -070085attribute.
86
87What remains are variables that are updated rarely (e.g. GDT). These
88will need another infrastructure (similar to the temporary exceptions
89made to kernel code mentioned above) that allow them to spend the rest
90of their lifetime read-only. (For example, when being updated, only the
91CPU thread performing the update would be given uninterruptible write
92access to the memory.)
93
Kees Cookc2ed6742017-05-13 04:51:41 -070094Segregation of kernel memory from userspace memory
95~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Kees Cook9f803662016-05-16 19:27:28 -070096
97The kernel must never execute userspace memory. The kernel must also never
98access userspace memory without explicit expectation to do so. These
99rules can be enforced either by support of hardware-based restrictions
100(x86's SMEP/SMAP, ARM's PXN/PAN) or via emulation (ARM's Memory Domains).
101By blocking userspace memory in this way, execution and data parsing
102cannot be passed to trivially-controlled userspace memory, forcing
103attacks to operate entirely in kernel memory.
104
Kees Cookc2ed6742017-05-13 04:51:41 -0700105Reduced access to syscalls
106--------------------------
Kees Cook9f803662016-05-16 19:27:28 -0700107
108One trivial way to eliminate many syscalls for 64-bit systems is building
Kees Cookc2ed6742017-05-13 04:51:41 -0700109without ``CONFIG_COMPAT``. However, this is rarely a feasible scenario.
Kees Cook9f803662016-05-16 19:27:28 -0700110
111The "seccomp" system provides an opt-in feature made available to
112userspace, which provides a way to reduce the number of kernel entry
113points available to a running process. This limits the breadth of kernel
114code that can be reached, possibly reducing the availability of a given
115bug to an attack.
116
117An area of improvement would be creating viable ways to keep access to
118things like compat, user namespaces, BPF creation, and perf limited only
119to trusted processes. This would keep the scope of kernel entry points
120restricted to the more regular set of normally available to unprivileged
121userspace.
122
Kees Cookc2ed6742017-05-13 04:51:41 -0700123Restricting access to kernel modules
124------------------------------------
Kees Cook9f803662016-05-16 19:27:28 -0700125
126The kernel should never allow an unprivileged user the ability to
127load specific kernel modules, since that would provide a facility to
128unexpectedly extend the available attack surface. (The on-demand loading
129of modules via their predefined subsystems, e.g. MODULE_ALIAS_*, is
130considered "expected" here, though additional consideration should be
131given even to these.) For example, loading a filesystem module via an
132unprivileged socket API is nonsense: only the root or physically local
133user should trigger filesystem module loading. (And even this can be up
134for debate in some scenarios.)
135
136To protect against even privileged users, systems may need to either
137disable module loading entirely (e.g. monolithic kernel builds or
138modules_disabled sysctl), or provide signed modules (e.g.
Kees Cookc2ed6742017-05-13 04:51:41 -0700139``CONFIG_MODULE_SIG_FORCE``, or dm-crypt with LoadPin), to keep from having
Kees Cook9f803662016-05-16 19:27:28 -0700140root load arbitrary kernel code via the module loader interface.
141
142
Kees Cookc2ed6742017-05-13 04:51:41 -0700143Memory integrity
144================
Kees Cook9f803662016-05-16 19:27:28 -0700145
146There are many memory structures in the kernel that are regularly abused
147to gain execution control during an attack, By far the most commonly
148understood is that of the stack buffer overflow in which the return
149address stored on the stack is overwritten. Many other examples of this
150kind of attack exist, and protections exist to defend against them.
151
Kees Cookc2ed6742017-05-13 04:51:41 -0700152Stack buffer overflow
153---------------------
Kees Cook9f803662016-05-16 19:27:28 -0700154
155The classic stack buffer overflow involves writing past the expected end
156of a variable stored on the stack, ultimately writing a controlled value
157to the stack frame's stored return address. The most widely used defense
158is the presence of a stack canary between the stack variables and the
Linus Torvalds050e9ba2018-06-14 12:21:18 +0900159return address (``CONFIG_STACKPROTECTOR``), which is verified just before
Kees Cook9f803662016-05-16 19:27:28 -0700160the function returns. Other defenses include things like shadow stacks.
161
Kees Cookc2ed6742017-05-13 04:51:41 -0700162Stack depth overflow
163--------------------
Kees Cook9f803662016-05-16 19:27:28 -0700164
165A less well understood attack is using a bug that triggers the
166kernel to consume stack memory with deep function calls or large stack
167allocations. With this attack it is possible to write beyond the end of
168the kernel's preallocated stack space and into sensitive structures. Two
169important changes need to be made for better protections: moving the
170sensitive thread_info structure elsewhere, and adding a faulting memory
171hole at the bottom of the stack to catch these overflows.
172
Kees Cookc2ed6742017-05-13 04:51:41 -0700173Heap memory integrity
174---------------------
Kees Cook9f803662016-05-16 19:27:28 -0700175
176The structures used to track heap free lists can be sanity-checked during
177allocation and freeing to make sure they aren't being used to manipulate
178other memory areas.
179
Kees Cookc2ed6742017-05-13 04:51:41 -0700180Counter integrity
181-----------------
Kees Cook9f803662016-05-16 19:27:28 -0700182
183Many places in the kernel use atomic counters to track object references
184or perform similar lifetime management. When these counters can be made
185to wrap (over or under) this traditionally exposes a use-after-free
186flaw. By trapping atomic wrapping, this class of bug vanishes.
187
Kees Cookc2ed6742017-05-13 04:51:41 -0700188Size calculation overflow detection
189-----------------------------------
Kees Cook9f803662016-05-16 19:27:28 -0700190
191Similar to counter overflow, integer overflows (usually size calculations)
192need to be detected at runtime to kill this class of bug, which
193traditionally leads to being able to write past the end of kernel buffers.
194
195
Kees Cookc2ed6742017-05-13 04:51:41 -0700196Probabilistic defenses
197======================
Kees Cook9f803662016-05-16 19:27:28 -0700198
199While many protections can be considered deterministic (e.g. read-only
200memory cannot be written to), some protections provide only statistical
201defense, in that an attack must gather enough information about a
202running system to overcome the defense. While not perfect, these do
203provide meaningful defenses.
204
Kees Cookc2ed6742017-05-13 04:51:41 -0700205Canaries, blinding, and other secrets
206-------------------------------------
Kees Cook9f803662016-05-16 19:27:28 -0700207
208It should be noted that things like the stack canary discussed earlier
Kees Cookc9de4a82016-05-18 06:37:47 -0700209are technically statistical defenses, since they rely on a secret value,
210and such values may become discoverable through an information exposure
211flaw.
Kees Cook9f803662016-05-16 19:27:28 -0700212
213Blinding literal values for things like JITs, where the executable
214contents may be partially under the control of userspace, need a similar
215secret value.
216
217It is critical that the secret values used must be separate (e.g.
218different canary per stack) and high entropy (e.g. is the RNG actually
219working?) in order to maximize their success.
220
Kees Cookc2ed6742017-05-13 04:51:41 -0700221Kernel Address Space Layout Randomization (KASLR)
222-------------------------------------------------
Kees Cook9f803662016-05-16 19:27:28 -0700223
224Since the location of kernel memory is almost always instrumental in
225mounting a successful attack, making the location non-deterministic
226raises the difficulty of an exploit. (Note that this in turn makes
Kees Cookc9de4a82016-05-18 06:37:47 -0700227the value of information exposures higher, since they may be used to
228discover desired memory locations.)
Kees Cook9f803662016-05-16 19:27:28 -0700229
Kees Cookc2ed6742017-05-13 04:51:41 -0700230Text and module base
231~~~~~~~~~~~~~~~~~~~~
Kees Cook9f803662016-05-16 19:27:28 -0700232
233By relocating the physical and virtual base address of the kernel at
Kees Cookc2ed6742017-05-13 04:51:41 -0700234boot-time (``CONFIG_RANDOMIZE_BASE``), attacks needing kernel code will be
Kees Cook9f803662016-05-16 19:27:28 -0700235frustrated. Additionally, offsetting the module loading base address
236means that even systems that load the same set of modules in the same
237order every boot will not share a common base address with the rest of
238the kernel text.
239
Kees Cookc2ed6742017-05-13 04:51:41 -0700240Stack base
241~~~~~~~~~~
Kees Cook9f803662016-05-16 19:27:28 -0700242
243If the base address of the kernel stack is not the same between processes,
244or even not the same between syscalls, targets on or beyond the stack
245become more difficult to locate.
246
Kees Cookc2ed6742017-05-13 04:51:41 -0700247Dynamic memory base
248~~~~~~~~~~~~~~~~~~~
Kees Cook9f803662016-05-16 19:27:28 -0700249
250Much of the kernel's dynamic memory (e.g. kmalloc, vmalloc, etc) ends up
251being relatively deterministic in layout due to the order of early-boot
252initializations. If the base address of these areas is not the same
Kees Cookc9de4a82016-05-18 06:37:47 -0700253between boots, targeting them is frustrated, requiring an information
254exposure specific to the region.
255
Kees Cookc2ed6742017-05-13 04:51:41 -0700256Structure layout
257~~~~~~~~~~~~~~~~
Kees Cookc9de4a82016-05-18 06:37:47 -0700258
259By performing a per-build randomization of the layout of sensitive
260structures, attacks must either be tuned to known kernel builds or expose
261enough kernel memory to determine structure layouts before manipulating
262them.
Kees Cook9f803662016-05-16 19:27:28 -0700263
264
Kees Cookc2ed6742017-05-13 04:51:41 -0700265Preventing Information Exposures
266================================
Kees Cook9f803662016-05-16 19:27:28 -0700267
268Since the locations of sensitive structures are the primary target for
Kees Cookc9de4a82016-05-18 06:37:47 -0700269attacks, it is important to defend against exposure of both kernel memory
Kees Cook9f803662016-05-16 19:27:28 -0700270addresses and kernel memory contents (since they may contain kernel
271addresses or other sensitive things like canary values).
272
Tobin C. Harding227d1a62017-12-20 08:17:17 +1100273Kernel addresses
274----------------
275
276Printing kernel addresses to userspace leaks sensitive information about
277the kernel memory layout. Care should be exercised when using any printk
278specifier that prints the raw address, currently %px, %p[ad], (and %p[sSb]
279in certain circumstances [*]). Any file written to using one of these
280specifiers should be readable only by privileged processes.
281
282Kernels 4.14 and older printed the raw address using %p. As of 4.15-rc1
283addresses printed with the specifier %p are hashed before printing.
284
285[*] If KALLSYMS is enabled and symbol lookup fails, the raw address is
286printed. If KALLSYMS is not enabled the raw address is printed.
287
Kees Cookc2ed6742017-05-13 04:51:41 -0700288Unique identifiers
289------------------
Kees Cook9f803662016-05-16 19:27:28 -0700290
291Kernel memory addresses must never be used as identifiers exposed to
292userspace. Instead, use an atomic counter, an idr, or similar unique
293identifier.
294
Kees Cookc2ed6742017-05-13 04:51:41 -0700295Memory initialization
296---------------------
Kees Cook9f803662016-05-16 19:27:28 -0700297
298Memory copied to userspace must always be fully initialized. If not
299explicitly memset(), this will require changes to the compiler to make
300sure structure holes are cleared.
301
Kees Cookc2ed6742017-05-13 04:51:41 -0700302Memory poisoning
303----------------
Kees Cook9f803662016-05-16 19:27:28 -0700304
Alexander Popoved535a22018-08-17 01:17:02 +0300305When releasing memory, it is best to poison the contents, to avoid reuse
306attacks that rely on the old contents of memory. E.g., clear stack on a
307syscall return (``CONFIG_GCC_PLUGIN_STACKLEAK``), wipe heap memory on a
308free. This frustrates many uninitialized variable attacks, stack content
309exposures, heap content exposures, and use-after-free attacks.
Kees Cook9f803662016-05-16 19:27:28 -0700310
Kees Cookc2ed6742017-05-13 04:51:41 -0700311Destination tracking
312--------------------
Kees Cook9f803662016-05-16 19:27:28 -0700313
314To help kill classes of bugs that result in kernel addresses being
315written to userspace, the destination of writes needs to be tracked. If
Kees Cookc2ed6742017-05-13 04:51:41 -0700316the buffer is destined for userspace (e.g. seq_file backed ``/proc`` files),
Kees Cook9f803662016-05-16 19:27:28 -0700317it should automatically censor sensitive values.