Blame - Documentation/x86/kernel-stacks.rst - SHIFTPHONES/mainline/linux

blob: 6b0bcf027ff1eda7736ba0965a81a2ab614d86ab [file] [log] [blame]

Changbin Du	ac2b468	2019-05-08 23:21:19 +0800	[diff] [blame]	1	.. SPDX-License-Identifier: GPL-2.0
				2
				3	=============
				4	Kernel Stacks
				5	=============
				6
Borislav Petkov	d724a9a	2015-05-26 10:28:19 +0200	[diff] [blame]	7	Kernel stacks on x86-64 bit
Changbin Du	ac2b468	2019-05-08 23:21:19 +0800	[diff] [blame]	8	===========================
Borislav Petkov	d724a9a	2015-05-26 10:28:19 +0200	[diff] [blame]	9
Andi Kleen	352f7ba	2006-09-26 10:52:31 +0200	[diff] [blame]	10	Most of the text from Keith Owens, hacked by AK
				11
				12	x86_64 page size (PAGE_SIZE) is 4K.
				13
				14	Like all other architectures, x86_64 has a kernel stack for every
				15	active thread. These thread stacks are THREAD_SIZE (2*PAGE_SIZE) big.
				16	These stacks contain useful data as long as a thread is alive or a
				17	zombie. While the thread is in user space the kernel stack is empty
				18	except for the thread_info structure at the bottom.
				19
				20	In addition to the per thread stacks, there are specialized stacks
Randy Dunlap	57d3077	2007-02-13 13:26:23 +0100	[diff] [blame]	21	associated with each CPU. These stacks are only used while the kernel
				22	is in control on that CPU; when a CPU returns to user space the
				23	specialized stacks contain no useful data. The main CPU stacks are:
Andi Kleen	352f7ba	2006-09-26 10:52:31 +0200	[diff] [blame]	24
Alexander Kuleshov	0fe0965	2015-08-21 15:19:06 +0600	[diff] [blame]	25	* Interrupt stack. IRQ_STACK_SIZE
Andi Kleen	352f7ba	2006-09-26 10:52:31 +0200	[diff] [blame]	26
				27	Used for external hardware interrupts. If this is the first external
				28	hardware interrupt (i.e. not a nested hardware interrupt) then the
				29	kernel switches from the current task to the interrupt stack. Like
Christoph Hellwig	7974891	2010-06-28 14:15:54 +0200	[diff] [blame]	30	the split thread and interrupt stacks on i386, this gives more room
				31	for kernel interrupt processing without having to increase the size
				32	of every per thread stack.
Andi Kleen	352f7ba	2006-09-26 10:52:31 +0200	[diff] [blame]	33
				34	The interrupt stack is also used when processing a softirq.
				35
				36	Switching to the kernel interrupt stack is done by software based on a
				37	per CPU interrupt nest counter. This is needed because x86-64 "IST"
				38	hardware stacks cannot nest without races.
				39
				40	x86_64 also has a feature which is not available on i386, the ability
				41	to automatically switch to a new stack for designated events such as
				42	double fault or NMI, which makes it easier to handle these unusual
				43	events on x86_64. This feature is called the Interrupt Stack Table
Randy Dunlap	57d3077	2007-02-13 13:26:23 +0100	[diff] [blame]	44	(IST). There can be up to 7 IST entries per CPU. The IST code is an
				45	index into the Task State Segment (TSS). The IST entries in the TSS
				46	point to dedicated stacks; each stack can be a different size.
Andi Kleen	352f7ba	2006-09-26 10:52:31 +0200	[diff] [blame]	47
Randy Dunlap	57d3077	2007-02-13 13:26:23 +0100	[diff] [blame]	48	An IST is selected by a non-zero value in the IST field of an
Andi Kleen	352f7ba	2006-09-26 10:52:31 +0200	[diff] [blame]	49	interrupt-gate descriptor. When an interrupt occurs and the hardware
				50	loads such a descriptor, the hardware automatically sets the new stack
				51	pointer based on the IST value, then invokes the interrupt handler. If
Andy Lutomirski	48e08d0	2014-11-11 12:49:41 -0800	[diff] [blame]	52	the interrupt came from user mode, then the interrupt handler prologue
				53	will switch back to the per-thread stack. If software wants to allow
				54	nested IST interrupts then the handler must adjust the IST values on
				55	entry to and exit from the interrupt handler. (This is occasionally
				56	done, e.g. for debug exceptions.)
Andi Kleen	352f7ba	2006-09-26 10:52:31 +0200	[diff] [blame]	57
				58	Events with different IST codes (i.e. with different stacks) can be
				59	nested. For example, a debug interrupt can safely be interrupted by an
				60	NMI. arch/x86_64/kernel/entry.S::paranoidentry adjusts the stack
				61	pointers on entry to and exit from all IST events, in theory allowing
				62	IST events with the same code to be nested. However in most cases, the
				63	stack size allocated to an IST assumes no nesting for the same code.
				64	If that assumption is ever broken then the stacks will become corrupt.
				65
Changbin Du	ac2b468	2019-05-08 23:21:19 +0800	[diff] [blame]	66	The currently assigned IST stacks are:
Andi Kleen	352f7ba	2006-09-26 10:52:31 +0200	[diff] [blame]	67
Thomas Gleixner	8f34c5b	2019-04-14 17:59:45 +0200	[diff] [blame]	68	* ESTACK_DF. EXCEPTION_STKSZ (PAGE_SIZE).
Andi Kleen	352f7ba	2006-09-26 10:52:31 +0200	[diff] [blame]	69
				70	Used for interrupt 8 - Double Fault Exception (#DF).
				71
Randy Dunlap	57d3077	2007-02-13 13:26:23 +0100	[diff] [blame]	72	Invoked when handling one exception causes another exception. Happens
				73	when the kernel is very confused (e.g. kernel stack pointer corrupt).
				74	Using a separate stack allows the kernel to recover from it well enough
				75	in many cases to still output an oops.
Andi Kleen	352f7ba	2006-09-26 10:52:31 +0200	[diff] [blame]	76
Thomas Gleixner	8f34c5b	2019-04-14 17:59:45 +0200	[diff] [blame]	77	* ESTACK_NMI. EXCEPTION_STKSZ (PAGE_SIZE).
Andi Kleen	352f7ba	2006-09-26 10:52:31 +0200	[diff] [blame]	78
				79	Used for non-maskable interrupts (NMI).
				80
				81	NMI can be delivered at any time, including when the kernel is in the
				82	middle of switching stacks. Using IST for NMI events avoids making
				83	assumptions about the previous state of the kernel stack.
				84
Thomas Gleixner	2a594d4	2019-04-14 17:59:57 +0200	[diff] [blame]	85	* ESTACK_DB. EXCEPTION_STKSZ (PAGE_SIZE).
Andi Kleen	352f7ba	2006-09-26 10:52:31 +0200	[diff] [blame]	86
				87	Used for hardware debug interrupts (interrupt 1) and for software
				88	debug interrupts (INT3).
				89
				90	When debugging a kernel, debug interrupts (both hardware and
				91	software) can occur at any time. Using IST for these interrupts
				92	avoids making assumptions about the previous state of the kernel
				93	stack.
				94
Thomas Gleixner	2a594d4	2019-04-14 17:59:57 +0200	[diff] [blame]	95	To handle nested #DB correctly there exist two instances of DB stacks. On
				96	#DB entry the IST stackpointer for #DB is switched to the second instance
				97	so a nested #DB starts from a clean stack. The nested #DB switches
				98	the IST stackpointer to a guard hole to catch triple nesting.
				99
Thomas Gleixner	8f34c5b	2019-04-14 17:59:45 +0200	[diff] [blame]	100	* ESTACK_MCE. EXCEPTION_STKSZ (PAGE_SIZE).
Andi Kleen	352f7ba	2006-09-26 10:52:31 +0200	[diff] [blame]	101
				102	Used for interrupt 18 - Machine Check Exception (#MC).
				103
				104	MCE can be delivered at any time, including when the kernel is in the
				105	middle of switching stacks. Using IST for MCE events avoids making
				106	assumptions about the previous state of the kernel stack.
				107
				108	For more details see the Intel IA32 or AMD AMD64 architecture manuals.
Borislav Petkov	113b5e3	2015-05-26 10:28:20 +0200	[diff] [blame]	109
				110
				111	Printing backtraces on x86
Changbin Du	ac2b468	2019-05-08 23:21:19 +0800	[diff] [blame]	112	==========================
Borislav Petkov	113b5e3	2015-05-26 10:28:20 +0200	[diff] [blame]	113
				114	The question about the '?' preceding function names in an x86 stacktrace
				115	keeps popping up, here's an indepth explanation. It helps if the reader
				116	stares at print_context_stack() and the whole machinery in and around
				117	arch/x86/kernel/dumpstack.c.
				118
				119	Adapted from Ingo's mail, Message-ID: <20150521101614.GA10889@gmail.com>:
				120
				121	We always scan the full kernel stack for return addresses stored on
Changbin Du	ac2b468	2019-05-08 23:21:19 +0800	[diff] [blame]	122	the kernel stack(s) [1]_, from stack top to stack bottom, and print out
Borislav Petkov	113b5e3	2015-05-26 10:28:20 +0200	[diff] [blame]	123	anything that 'looks like' a kernel text address.
				124
				125	If it fits into the frame pointer chain, we print it without a question
				126	mark, knowing that it's part of the real backtrace.
				127
				128	If the address does not fit into our expected frame pointer chain we
				129	still print it, but we print a '?'. It can mean two things:
				130
				131	- either the address is not part of the call chain: it's just stale
				132	values on the kernel stack, from earlier function calls. This is
				133	the common case.
				134
				135	- or it is part of the call chain, but the frame pointer was not set
				136	up properly within the function, so we don't recognize it.
				137
				138	This way we will always print out the real call chain (plus a few more
				139	entries), regardless of whether the frame pointer was set up correctly
				140	or not - but in most cases we'll get the call chain right as well. The
				141	entries printed are strictly in stack order, so you can deduce more
				142	information from that as well.
				143
				144	The most important property of this method is that we _never_ lose
				145	information: we always strive to print _all_ addresses on the stack(s)
				146	that look like kernel text addresses, so if debug information is wrong,
				147	we still print out the real call chain as well - just with more question
				148	marks than ideal.
				149
Changbin Du	ac2b468	2019-05-08 23:21:19 +0800	[diff] [blame]	150	.. [1] For things like IRQ and IST stacks, we also scan those stacks, in
				151	the right order, and try to cross from one stack into another
				152	reconstructing the call chain. This works most of the time.