Shuah Khan | 4b8fec2 | 2022-01-14 14:06:26 -0800 | [diff] [blame] | 1 | .. SPDX-License-Identifier: GPL-2.0 |
| 2 | |
| 3 | ===================================== |
| 4 | Virtually Mapped Kernel Stack Support |
| 5 | ===================================== |
| 6 | |
| 7 | :Author: Shuah Khan <skhan@linuxfoundation.org> |
| 8 | |
| 9 | .. contents:: :local: |
| 10 | |
| 11 | Overview |
| 12 | -------- |
| 13 | |
| 14 | This is a compilation of information from the code and original patch |
| 15 | series that introduced the `Virtually Mapped Kernel Stacks feature |
| 16 | <https://lwn.net/Articles/694348/>` |
| 17 | |
| 18 | Introduction |
| 19 | ------------ |
| 20 | |
| 21 | Kernel stack overflows are often hard to debug and make the kernel |
| 22 | susceptible to exploits. Problems could show up at a later time making |
| 23 | it difficult to isolate and root-cause. |
| 24 | |
| 25 | Virtually-mapped kernel stacks with guard pages causes kernel stack |
| 26 | overflows to be caught immediately rather than causing difficult to |
| 27 | diagnose corruptions. |
| 28 | |
| 29 | HAVE_ARCH_VMAP_STACK and VMAP_STACK configuration options enable |
| 30 | support for virtually mapped stacks with guard pages. This feature |
| 31 | causes reliable faults when the stack overflows. The usability of |
| 32 | the stack trace after overflow and response to the overflow itself |
| 33 | is architecture dependent. |
| 34 | |
| 35 | .. note:: |
| 36 | As of this writing, arm64, powerpc, riscv, s390, um, and x86 have |
| 37 | support for VMAP_STACK. |
| 38 | |
| 39 | HAVE_ARCH_VMAP_STACK |
| 40 | -------------------- |
| 41 | |
| 42 | Architectures that can support Virtually Mapped Kernel Stacks should |
| 43 | enable this bool configuration option. The requirements are: |
| 44 | |
| 45 | - vmalloc space must be large enough to hold many kernel stacks. This |
| 46 | may rule out many 32-bit architectures. |
| 47 | - Stacks in vmalloc space need to work reliably. For example, if |
| 48 | vmap page tables are created on demand, either this mechanism |
| 49 | needs to work while the stack points to a virtual address with |
| 50 | unpopulated page tables or arch code (switch_to() and switch_mm(), |
| 51 | most likely) needs to ensure that the stack's page table entries |
| 52 | are populated before running on a possibly unpopulated stack. |
| 53 | - If the stack overflows into a guard page, something reasonable |
| 54 | should happen. The definition of "reasonable" is flexible, but |
| 55 | instantly rebooting without logging anything would be unfriendly. |
| 56 | |
| 57 | VMAP_STACK |
| 58 | ---------- |
| 59 | |
| 60 | VMAP_STACK bool configuration option when enabled allocates virtually |
| 61 | mapped task stacks. This option depends on HAVE_ARCH_VMAP_STACK. |
| 62 | |
| 63 | - Enable this if you want the use virtually-mapped kernel stacks |
| 64 | with guard pages. This causes kernel stack overflows to be caught |
| 65 | immediately rather than causing difficult-to-diagnose corruption. |
| 66 | |
| 67 | .. note:: |
| 68 | |
| 69 | Using this feature with KASAN requires architecture support |
| 70 | for backing virtual mappings with real shadow memory, and |
| 71 | KASAN_VMALLOC must be enabled. |
| 72 | |
| 73 | .. note:: |
| 74 | |
| 75 | VMAP_STACK is enabled, it is not possible to run DMA on stack |
| 76 | allocated data. |
| 77 | |
| 78 | Kernel configuration options and dependencies keep changing. Refer to |
| 79 | the latest code base: |
| 80 | |
| 81 | `Kconfig <https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/arch/Kconfig>` |
| 82 | |
| 83 | Allocation |
| 84 | ----------- |
| 85 | |
| 86 | When a new kernel thread is created, thread stack is allocated from |
| 87 | virtually contiguous memory pages from the page level allocator. These |
| 88 | pages are mapped into contiguous kernel virtual space with PAGE_KERNEL |
| 89 | protections. |
| 90 | |
| 91 | alloc_thread_stack_node() calls __vmalloc_node_range() to allocate stack |
| 92 | with PAGE_KERNEL protections. |
| 93 | |
| 94 | - Allocated stacks are cached and later reused by new threads, so memcg |
| 95 | accounting is performed manually on assigning/releasing stacks to tasks. |
| 96 | Hence, __vmalloc_node_range is called without __GFP_ACCOUNT. |
| 97 | - vm_struct is cached to be able to find when thread free is initiated |
| 98 | in interrupt context. free_thread_stack() can be called in interrupt |
| 99 | context. |
| 100 | - On arm64, all VMAP's stacks need to have the same alignment to ensure |
| 101 | that VMAP'd stack overflow detection works correctly. Arch specific |
| 102 | vmap stack allocator takes care of this detail. |
| 103 | - This does not address interrupt stacks - according to the original patch |
| 104 | |
| 105 | Thread stack allocation is initiated from clone(), fork(), vfork(), |
| 106 | kernel_thread() via kernel_clone(). Leaving a few hints for searching |
| 107 | the code base to understand when and how thread stack is allocated. |
| 108 | |
| 109 | Bulk of the code is in: |
| 110 | `kernel/fork.c <https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/kernel/fork.c>`. |
| 111 | |
| 112 | stack_vm_area pointer in task_struct keeps track of the virtually allocated |
| 113 | stack and a non-null stack_vm_area pointer serves as a indication that the |
| 114 | virtually mapped kernel stacks are enabled. |
| 115 | |
| 116 | :: |
| 117 | |
| 118 | struct vm_struct *stack_vm_area; |
| 119 | |
| 120 | Stack overflow handling |
| 121 | ----------------------- |
| 122 | |
| 123 | Leading and trailing guard pages help detect stack overflows. When stack |
| 124 | overflows into the guard pages, handlers have to be careful not overflow |
| 125 | the stack again. When handlers are called, it is likely that very little |
| 126 | stack space is left. |
| 127 | |
| 128 | On x86, this is done by handling the page fault indicating the kernel |
| 129 | stack overflow on the double-fault stack. |
| 130 | |
| 131 | Testing VMAP allocation with guard pages |
| 132 | ---------------------------------------- |
| 133 | |
| 134 | How do we ensure that VMAP_STACK is actually allocating with a leading |
| 135 | and trailing guard page? The following lkdtm tests can help detect any |
| 136 | regressions. |
| 137 | |
| 138 | :: |
| 139 | |
| 140 | void lkdtm_STACK_GUARD_PAGE_LEADING() |
| 141 | void lkdtm_STACK_GUARD_PAGE_TRAILING() |
| 142 | |
| 143 | Conclusions |
| 144 | ----------- |
| 145 | |
| 146 | - A percpu cache of vmalloced stacks appears to be a bit faster than a |
| 147 | high-order stack allocation, at least when the cache hits. |
| 148 | - THREAD_INFO_IN_TASK gets rid of arch-specific thread_info entirely and |
| 149 | simply embed the thread_info (containing only flags) and 'int cpu' into |
| 150 | task_struct. |
| 151 | - The thread stack can be free'ed as soon as the task is dead (without |
| 152 | waiting for RCU) and then, if vmapped stacks are in use, cache the |
| 153 | entire stack for reuse on the same cpu. |