Blame - Documentation/vm/vmalloced-kernel-stacks.rst - SHIFTPHONES/mainline/linux

blob: fc8c67833af67a84fc18fa331168cb34e8b55248 [file] [log] [blame]

Shuah Khan	4b8fec2	2022-01-14 14:06:26 -0800	[diff] [blame]	1	.. SPDX-License-Identifier: GPL-2.0
				2
				3	=====================================
				4	Virtually Mapped Kernel Stack Support
				5	=====================================
				6
				7	:Author: Shuah Khan <skhan@linuxfoundation.org>
				8
				9	.. contents:: :local:
				10
				11	Overview
				12	--------
				13
				14	This is a compilation of information from the code and original patch
				15	series that introduced the `Virtually Mapped Kernel Stacks feature
				16	<https://lwn.net/Articles/694348/>`
				17
				18	Introduction
				19	------------
				20
				21	Kernel stack overflows are often hard to debug and make the kernel
				22	susceptible to exploits. Problems could show up at a later time making
				23	it difficult to isolate and root-cause.
				24
				25	Virtually-mapped kernel stacks with guard pages causes kernel stack
				26	overflows to be caught immediately rather than causing difficult to
				27	diagnose corruptions.
				28
				29	HAVE_ARCH_VMAP_STACK and VMAP_STACK configuration options enable
				30	support for virtually mapped stacks with guard pages. This feature
				31	causes reliable faults when the stack overflows. The usability of
				32	the stack trace after overflow and response to the overflow itself
				33	is architecture dependent.
				34
				35	.. note::
				36	As of this writing, arm64, powerpc, riscv, s390, um, and x86 have
				37	support for VMAP_STACK.
				38
				39	HAVE_ARCH_VMAP_STACK
				40	--------------------
				41
				42	Architectures that can support Virtually Mapped Kernel Stacks should
				43	enable this bool configuration option. The requirements are:
				44
				45	- vmalloc space must be large enough to hold many kernel stacks. This
				46	may rule out many 32-bit architectures.
				47	- Stacks in vmalloc space need to work reliably. For example, if
				48	vmap page tables are created on demand, either this mechanism
				49	needs to work while the stack points to a virtual address with
				50	unpopulated page tables or arch code (switch_to() and switch_mm(),
				51	most likely) needs to ensure that the stack's page table entries
				52	are populated before running on a possibly unpopulated stack.
				53	- If the stack overflows into a guard page, something reasonable
				54	should happen. The definition of "reasonable" is flexible, but
				55	instantly rebooting without logging anything would be unfriendly.
				56
				57	VMAP_STACK
				58	----------
				59
				60	VMAP_STACK bool configuration option when enabled allocates virtually
				61	mapped task stacks. This option depends on HAVE_ARCH_VMAP_STACK.
				62
				63	- Enable this if you want the use virtually-mapped kernel stacks
				64	with guard pages. This causes kernel stack overflows to be caught
				65	immediately rather than causing difficult-to-diagnose corruption.
				66
				67	.. note::
				68
				69	Using this feature with KASAN requires architecture support
				70	for backing virtual mappings with real shadow memory, and
				71	KASAN_VMALLOC must be enabled.
				72
				73	.. note::
				74
				75	VMAP_STACK is enabled, it is not possible to run DMA on stack
				76	allocated data.
				77
				78	Kernel configuration options and dependencies keep changing. Refer to
				79	the latest code base:
				80
				81	`Kconfig <https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/arch/Kconfig>`
				82
				83	Allocation
				84	-----------
				85
				86	When a new kernel thread is created, thread stack is allocated from
				87	virtually contiguous memory pages from the page level allocator. These
				88	pages are mapped into contiguous kernel virtual space with PAGE_KERNEL
				89	protections.
				90
				91	alloc_thread_stack_node() calls __vmalloc_node_range() to allocate stack
				92	with PAGE_KERNEL protections.
				93
				94	- Allocated stacks are cached and later reused by new threads, so memcg
				95	accounting is performed manually on assigning/releasing stacks to tasks.
				96	Hence, __vmalloc_node_range is called without __GFP_ACCOUNT.
				97	- vm_struct is cached to be able to find when thread free is initiated
				98	in interrupt context. free_thread_stack() can be called in interrupt
				99	context.
				100	- On arm64, all VMAP's stacks need to have the same alignment to ensure
				101	that VMAP'd stack overflow detection works correctly. Arch specific
				102	vmap stack allocator takes care of this detail.
				103	- This does not address interrupt stacks - according to the original patch
				104
				105	Thread stack allocation is initiated from clone(), fork(), vfork(),
				106	kernel_thread() via kernel_clone(). Leaving a few hints for searching
				107	the code base to understand when and how thread stack is allocated.
				108
				109	Bulk of the code is in:
				110	`kernel/fork.c <https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/kernel/fork.c>`.
				111
				112	stack_vm_area pointer in task_struct keeps track of the virtually allocated
				113	stack and a non-null stack_vm_area pointer serves as a indication that the
				114	virtually mapped kernel stacks are enabled.
				115
				116	::
				117
				118	struct vm_struct *stack_vm_area;
				119
				120	Stack overflow handling
				121	-----------------------
				122
				123	Leading and trailing guard pages help detect stack overflows. When stack
				124	overflows into the guard pages, handlers have to be careful not overflow
				125	the stack again. When handlers are called, it is likely that very little
				126	stack space is left.
				127
				128	On x86, this is done by handling the page fault indicating the kernel
				129	stack overflow on the double-fault stack.
				130
				131	Testing VMAP allocation with guard pages
				132	----------------------------------------
				133
				134	How do we ensure that VMAP_STACK is actually allocating with a leading
				135	and trailing guard page? The following lkdtm tests can help detect any
				136	regressions.
				137
				138	::
				139
				140	void lkdtm_STACK_GUARD_PAGE_LEADING()
				141	void lkdtm_STACK_GUARD_PAGE_TRAILING()
				142
				143	Conclusions
				144	-----------
				145
				146	- A percpu cache of vmalloced stacks appears to be a bit faster than a
				147	high-order stack allocation, at least when the cache hits.
				148	- THREAD_INFO_IN_TASK gets rid of arch-specific thread_info entirely and
				149	simply embed the thread_info (containing only flags) and 'int cpu' into
				150	task_struct.
				151	- The thread stack can be free'ed as soon as the task is dead (without
				152	waiting for RCU) and then, if vmapped stacks are in use, cache the
				153	entire stack for reuse on the same cpu.