Marco Elver | 10efe55 | 2021-02-25 17:19:26 -0800 | [diff] [blame] | 1 | .. SPDX-License-Identifier: GPL-2.0 |
| 2 | .. Copyright (C) 2020, Google LLC. |
| 3 | |
| 4 | Kernel Electric-Fence (KFENCE) |
| 5 | ============================== |
| 6 | |
| 7 | Kernel Electric-Fence (KFENCE) is a low-overhead sampling-based memory safety |
| 8 | error detector. KFENCE detects heap out-of-bounds access, use-after-free, and |
| 9 | invalid-free errors. |
| 10 | |
| 11 | KFENCE is designed to be enabled in production kernels, and has near zero |
| 12 | performance overhead. Compared to KASAN, KFENCE trades performance for |
| 13 | precision. The main motivation behind KFENCE's design, is that with enough |
| 14 | total uptime KFENCE will detect bugs in code paths not typically exercised by |
| 15 | non-production test workloads. One way to quickly achieve a large enough total |
| 16 | uptime is when the tool is deployed across a large fleet of machines. |
| 17 | |
| 18 | Usage |
| 19 | ----- |
| 20 | |
| 21 | To enable KFENCE, configure the kernel with:: |
| 22 | |
| 23 | CONFIG_KFENCE=y |
| 24 | |
| 25 | To build a kernel with KFENCE support, but disabled by default (to enable, set |
| 26 | ``kfence.sample_interval`` to non-zero value), configure the kernel with:: |
| 27 | |
| 28 | CONFIG_KFENCE=y |
| 29 | CONFIG_KFENCE_SAMPLE_INTERVAL=0 |
| 30 | |
| 31 | KFENCE provides several other configuration options to customize behaviour (see |
| 32 | the respective help text in ``lib/Kconfig.kfence`` for more info). |
| 33 | |
| 34 | Tuning performance |
| 35 | ~~~~~~~~~~~~~~~~~~ |
| 36 | |
| 37 | The most important parameter is KFENCE's sample interval, which can be set via |
| 38 | the kernel boot parameter ``kfence.sample_interval`` in milliseconds. The |
| 39 | sample interval determines the frequency with which heap allocations will be |
| 40 | guarded by KFENCE. The default is configurable via the Kconfig option |
| 41 | ``CONFIG_KFENCE_SAMPLE_INTERVAL``. Setting ``kfence.sample_interval=0`` |
| 42 | disables KFENCE. |
| 43 | |
| 44 | The KFENCE memory pool is of fixed size, and if the pool is exhausted, no |
| 45 | further KFENCE allocations occur. With ``CONFIG_KFENCE_NUM_OBJECTS`` (default |
| 46 | 255), the number of available guarded objects can be controlled. Each object |
| 47 | requires 2 pages, one for the object itself and the other one used as a guard |
| 48 | page; object pages are interleaved with guard pages, and every object page is |
| 49 | therefore surrounded by two guard pages. |
| 50 | |
| 51 | The total memory dedicated to the KFENCE memory pool can be computed as:: |
| 52 | |
| 53 | ( #objects + 1 ) * 2 * PAGE_SIZE |
| 54 | |
| 55 | Using the default config, and assuming a page size of 4 KiB, results in |
| 56 | dedicating 2 MiB to the KFENCE memory pool. |
| 57 | |
| 58 | Note: On architectures that support huge pages, KFENCE will ensure that the |
| 59 | pool is using pages of size ``PAGE_SIZE``. This will result in additional page |
| 60 | tables being allocated. |
| 61 | |
| 62 | Error reports |
| 63 | ~~~~~~~~~~~~~ |
| 64 | |
| 65 | A typical out-of-bounds access looks like this:: |
| 66 | |
| 67 | ================================================================== |
Marco Elver | bc8fbc5 | 2021-02-25 17:19:31 -0800 | [diff] [blame^] | 68 | BUG: KFENCE: out-of-bounds read in test_out_of_bounds_read+0xa3/0x22b |
Marco Elver | 10efe55 | 2021-02-25 17:19:26 -0800 | [diff] [blame] | 69 | |
Marco Elver | bc8fbc5 | 2021-02-25 17:19:31 -0800 | [diff] [blame^] | 70 | Out-of-bounds read at 0xffffffffb672efff (1B left of kfence-#17): |
Marco Elver | 10efe55 | 2021-02-25 17:19:26 -0800 | [diff] [blame] | 71 | test_out_of_bounds_read+0xa3/0x22b |
| 72 | kunit_try_run_case+0x51/0x85 |
| 73 | kunit_generic_run_threadfn_adapter+0x16/0x30 |
| 74 | kthread+0x137/0x160 |
| 75 | ret_from_fork+0x22/0x30 |
| 76 | |
| 77 | kfence-#17 [0xffffffffb672f000-0xffffffffb672f01f, size=32, cache=kmalloc-32] allocated by task 507: |
| 78 | test_alloc+0xf3/0x25b |
| 79 | test_out_of_bounds_read+0x98/0x22b |
| 80 | kunit_try_run_case+0x51/0x85 |
| 81 | kunit_generic_run_threadfn_adapter+0x16/0x30 |
| 82 | kthread+0x137/0x160 |
| 83 | ret_from_fork+0x22/0x30 |
| 84 | |
| 85 | CPU: 4 PID: 107 Comm: kunit_try_catch Not tainted 5.8.0-rc6+ #7 |
| 86 | Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.13.0-1 04/01/2014 |
| 87 | ================================================================== |
| 88 | |
| 89 | The header of the report provides a short summary of the function involved in |
| 90 | the access. It is followed by more detailed information about the access and |
| 91 | its origin. Note that, real kernel addresses are only shown for |
| 92 | ``CONFIG_DEBUG_KERNEL=y`` builds. |
| 93 | |
| 94 | Use-after-free accesses are reported as:: |
| 95 | |
| 96 | ================================================================== |
Marco Elver | bc8fbc5 | 2021-02-25 17:19:31 -0800 | [diff] [blame^] | 97 | BUG: KFENCE: use-after-free read in test_use_after_free_read+0xb3/0x143 |
Marco Elver | 10efe55 | 2021-02-25 17:19:26 -0800 | [diff] [blame] | 98 | |
Marco Elver | bc8fbc5 | 2021-02-25 17:19:31 -0800 | [diff] [blame^] | 99 | Use-after-free read at 0xffffffffb673dfe0 (in kfence-#24): |
Marco Elver | 10efe55 | 2021-02-25 17:19:26 -0800 | [diff] [blame] | 100 | test_use_after_free_read+0xb3/0x143 |
| 101 | kunit_try_run_case+0x51/0x85 |
| 102 | kunit_generic_run_threadfn_adapter+0x16/0x30 |
| 103 | kthread+0x137/0x160 |
| 104 | ret_from_fork+0x22/0x30 |
| 105 | |
| 106 | kfence-#24 [0xffffffffb673dfe0-0xffffffffb673dfff, size=32, cache=kmalloc-32] allocated by task 507: |
| 107 | test_alloc+0xf3/0x25b |
| 108 | test_use_after_free_read+0x76/0x143 |
| 109 | kunit_try_run_case+0x51/0x85 |
| 110 | kunit_generic_run_threadfn_adapter+0x16/0x30 |
| 111 | kthread+0x137/0x160 |
| 112 | ret_from_fork+0x22/0x30 |
| 113 | |
| 114 | freed by task 507: |
| 115 | test_use_after_free_read+0xa8/0x143 |
| 116 | kunit_try_run_case+0x51/0x85 |
| 117 | kunit_generic_run_threadfn_adapter+0x16/0x30 |
| 118 | kthread+0x137/0x160 |
| 119 | ret_from_fork+0x22/0x30 |
| 120 | |
| 121 | CPU: 4 PID: 109 Comm: kunit_try_catch Tainted: G W 5.8.0-rc6+ #7 |
| 122 | Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.13.0-1 04/01/2014 |
| 123 | ================================================================== |
| 124 | |
| 125 | KFENCE also reports on invalid frees, such as double-frees:: |
| 126 | |
| 127 | ================================================================== |
| 128 | BUG: KFENCE: invalid free in test_double_free+0xdc/0x171 |
| 129 | |
| 130 | Invalid free of 0xffffffffb6741000: |
| 131 | test_double_free+0xdc/0x171 |
| 132 | kunit_try_run_case+0x51/0x85 |
| 133 | kunit_generic_run_threadfn_adapter+0x16/0x30 |
| 134 | kthread+0x137/0x160 |
| 135 | ret_from_fork+0x22/0x30 |
| 136 | |
| 137 | kfence-#26 [0xffffffffb6741000-0xffffffffb674101f, size=32, cache=kmalloc-32] allocated by task 507: |
| 138 | test_alloc+0xf3/0x25b |
| 139 | test_double_free+0x76/0x171 |
| 140 | kunit_try_run_case+0x51/0x85 |
| 141 | kunit_generic_run_threadfn_adapter+0x16/0x30 |
| 142 | kthread+0x137/0x160 |
| 143 | ret_from_fork+0x22/0x30 |
| 144 | |
| 145 | freed by task 507: |
| 146 | test_double_free+0xa8/0x171 |
| 147 | kunit_try_run_case+0x51/0x85 |
| 148 | kunit_generic_run_threadfn_adapter+0x16/0x30 |
| 149 | kthread+0x137/0x160 |
| 150 | ret_from_fork+0x22/0x30 |
| 151 | |
| 152 | CPU: 4 PID: 111 Comm: kunit_try_catch Tainted: G W 5.8.0-rc6+ #7 |
| 153 | Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.13.0-1 04/01/2014 |
| 154 | ================================================================== |
| 155 | |
| 156 | KFENCE also uses pattern-based redzones on the other side of an object's guard |
| 157 | page, to detect out-of-bounds writes on the unprotected side of the object. |
| 158 | These are reported on frees:: |
| 159 | |
| 160 | ================================================================== |
| 161 | BUG: KFENCE: memory corruption in test_kmalloc_aligned_oob_write+0xef/0x184 |
| 162 | |
| 163 | Corrupted memory at 0xffffffffb6797ff9 [ 0xac . . . . . . ] (in kfence-#69): |
| 164 | test_kmalloc_aligned_oob_write+0xef/0x184 |
| 165 | kunit_try_run_case+0x51/0x85 |
| 166 | kunit_generic_run_threadfn_adapter+0x16/0x30 |
| 167 | kthread+0x137/0x160 |
| 168 | ret_from_fork+0x22/0x30 |
| 169 | |
| 170 | kfence-#69 [0xffffffffb6797fb0-0xffffffffb6797ff8, size=73, cache=kmalloc-96] allocated by task 507: |
| 171 | test_alloc+0xf3/0x25b |
| 172 | test_kmalloc_aligned_oob_write+0x57/0x184 |
| 173 | kunit_try_run_case+0x51/0x85 |
| 174 | kunit_generic_run_threadfn_adapter+0x16/0x30 |
| 175 | kthread+0x137/0x160 |
| 176 | ret_from_fork+0x22/0x30 |
| 177 | |
| 178 | CPU: 4 PID: 120 Comm: kunit_try_catch Tainted: G W 5.8.0-rc6+ #7 |
| 179 | Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.13.0-1 04/01/2014 |
| 180 | ================================================================== |
| 181 | |
| 182 | For such errors, the address where the corruption occurred as well as the |
| 183 | invalidly written bytes (offset from the address) are shown; in this |
| 184 | representation, '.' denote untouched bytes. In the example above ``0xac`` is |
| 185 | the value written to the invalid address at offset 0, and the remaining '.' |
| 186 | denote that no following bytes have been touched. Note that, real values are |
| 187 | only shown for ``CONFIG_DEBUG_KERNEL=y`` builds; to avoid information |
| 188 | disclosure for non-debug builds, '!' is used instead to denote invalidly |
| 189 | written bytes. |
| 190 | |
| 191 | And finally, KFENCE may also report on invalid accesses to any protected page |
| 192 | where it was not possible to determine an associated object, e.g. if adjacent |
| 193 | object pages had not yet been allocated:: |
| 194 | |
| 195 | ================================================================== |
Marco Elver | bc8fbc5 | 2021-02-25 17:19:31 -0800 | [diff] [blame^] | 196 | BUG: KFENCE: invalid read in test_invalid_access+0x26/0xe0 |
Marco Elver | 10efe55 | 2021-02-25 17:19:26 -0800 | [diff] [blame] | 197 | |
Marco Elver | bc8fbc5 | 2021-02-25 17:19:31 -0800 | [diff] [blame^] | 198 | Invalid read at 0xffffffffb670b00a: |
Marco Elver | 10efe55 | 2021-02-25 17:19:26 -0800 | [diff] [blame] | 199 | test_invalid_access+0x26/0xe0 |
| 200 | kunit_try_run_case+0x51/0x85 |
| 201 | kunit_generic_run_threadfn_adapter+0x16/0x30 |
| 202 | kthread+0x137/0x160 |
| 203 | ret_from_fork+0x22/0x30 |
| 204 | |
| 205 | CPU: 4 PID: 124 Comm: kunit_try_catch Tainted: G W 5.8.0-rc6+ #7 |
| 206 | Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.13.0-1 04/01/2014 |
| 207 | ================================================================== |
| 208 | |
| 209 | DebugFS interface |
| 210 | ~~~~~~~~~~~~~~~~~ |
| 211 | |
| 212 | Some debugging information is exposed via debugfs: |
| 213 | |
| 214 | * The file ``/sys/kernel/debug/kfence/stats`` provides runtime statistics. |
| 215 | |
| 216 | * The file ``/sys/kernel/debug/kfence/objects`` provides a list of objects |
| 217 | allocated via KFENCE, including those already freed but protected. |
| 218 | |
| 219 | Implementation Details |
| 220 | ---------------------- |
| 221 | |
| 222 | Guarded allocations are set up based on the sample interval. After expiration |
| 223 | of the sample interval, the next allocation through the main allocator (SLAB or |
| 224 | SLUB) returns a guarded allocation from the KFENCE object pool (allocation |
| 225 | sizes up to PAGE_SIZE are supported). At this point, the timer is reset, and |
| 226 | the next allocation is set up after the expiration of the interval. To "gate" a |
| 227 | KFENCE allocation through the main allocator's fast-path without overhead, |
| 228 | KFENCE relies on static branches via the static keys infrastructure. The static |
| 229 | branch is toggled to redirect the allocation to KFENCE. |
| 230 | |
| 231 | KFENCE objects each reside on a dedicated page, at either the left or right |
| 232 | page boundaries selected at random. The pages to the left and right of the |
| 233 | object page are "guard pages", whose attributes are changed to a protected |
| 234 | state, and cause page faults on any attempted access. Such page faults are then |
| 235 | intercepted by KFENCE, which handles the fault gracefully by reporting an |
| 236 | out-of-bounds access, and marking the page as accessible so that the faulting |
| 237 | code can (wrongly) continue executing (set ``panic_on_warn`` to panic instead). |
| 238 | |
| 239 | To detect out-of-bounds writes to memory within the object's page itself, |
| 240 | KFENCE also uses pattern-based redzones. For each object page, a redzone is set |
| 241 | up for all non-object memory. For typical alignments, the redzone is only |
| 242 | required on the unguarded side of an object. Because KFENCE must honor the |
| 243 | cache's requested alignment, special alignments may result in unprotected gaps |
| 244 | on either side of an object, all of which are redzoned. |
| 245 | |
| 246 | The following figure illustrates the page layout:: |
| 247 | |
| 248 | ---+-----------+-----------+-----------+-----------+-----------+--- |
| 249 | | xxxxxxxxx | O : | xxxxxxxxx | : O | xxxxxxxxx | |
| 250 | | xxxxxxxxx | B : | xxxxxxxxx | : B | xxxxxxxxx | |
| 251 | | x GUARD x | J : RED- | x GUARD x | RED- : J | x GUARD x | |
| 252 | | xxxxxxxxx | E : ZONE | xxxxxxxxx | ZONE : E | xxxxxxxxx | |
| 253 | | xxxxxxxxx | C : | xxxxxxxxx | : C | xxxxxxxxx | |
| 254 | | xxxxxxxxx | T : | xxxxxxxxx | : T | xxxxxxxxx | |
| 255 | ---+-----------+-----------+-----------+-----------+-----------+--- |
| 256 | |
| 257 | Upon deallocation of a KFENCE object, the object's page is again protected and |
| 258 | the object is marked as freed. Any further access to the object causes a fault |
| 259 | and KFENCE reports a use-after-free access. Freed objects are inserted at the |
| 260 | tail of KFENCE's freelist, so that the least recently freed objects are reused |
| 261 | first, and the chances of detecting use-after-frees of recently freed objects |
| 262 | is increased. |
| 263 | |
| 264 | Interface |
| 265 | --------- |
| 266 | |
| 267 | The following describes the functions which are used by allocators as well as |
| 268 | page handling code to set up and deal with KFENCE allocations. |
| 269 | |
| 270 | .. kernel-doc:: include/linux/kfence.h |
| 271 | :functions: is_kfence_address |
| 272 | kfence_shutdown_cache |
| 273 | kfence_alloc kfence_free __kfence_free |
| 274 | kfence_ksize kfence_object_start |
| 275 | kfence_handle_page_fault |
| 276 | |
| 277 | Related Tools |
| 278 | ------------- |
| 279 | |
| 280 | In userspace, a similar approach is taken by `GWP-ASan |
| 281 | <http://llvm.org/docs/GwpAsan.html>`_. GWP-ASan also relies on guard pages and |
| 282 | a sampling strategy to detect memory unsafety bugs at scale. KFENCE's design is |
| 283 | directly influenced by GWP-ASan, and can be seen as its kernel sibling. Another |
| 284 | similar but non-sampling approach, that also inspired the name "KFENCE", can be |
| 285 | found in the userspace `Electric Fence Malloc Debugger |
| 286 | <https://linux.die.net/man/3/efence>`_. |
| 287 | |
| 288 | In the kernel, several tools exist to debug memory access errors, and in |
| 289 | particular KASAN can detect all bug classes that KFENCE can detect. While KASAN |
| 290 | is more precise, relying on compiler instrumentation, this comes at a |
| 291 | performance cost. |
| 292 | |
| 293 | It is worth highlighting that KASAN and KFENCE are complementary, with |
| 294 | different target environments. For instance, KASAN is the better debugging-aid, |
| 295 | where test cases or reproducers exists: due to the lower chance to detect the |
| 296 | error, it would require more effort using KFENCE to debug. Deployments at scale |
| 297 | that cannot afford to enable KASAN, however, would benefit from using KFENCE to |
| 298 | discover bugs due to code paths not exercised by test cases or fuzzers. |