Changbin Du | ac2b468 | 2019-05-08 23:21:19 +0800 | [diff] [blame] | 1 | .. SPDX-License-Identifier: GPL-2.0 |
| 2 | |
| 3 | ============= |
| 4 | Kernel Stacks |
| 5 | ============= |
| 6 | |
Borislav Petkov | d724a9a | 2015-05-26 10:28:19 +0200 | [diff] [blame] | 7 | Kernel stacks on x86-64 bit |
Changbin Du | ac2b468 | 2019-05-08 23:21:19 +0800 | [diff] [blame] | 8 | =========================== |
Borislav Petkov | d724a9a | 2015-05-26 10:28:19 +0200 | [diff] [blame] | 9 | |
Andi Kleen | 352f7ba | 2006-09-26 10:52:31 +0200 | [diff] [blame] | 10 | Most of the text from Keith Owens, hacked by AK |
| 11 | |
| 12 | x86_64 page size (PAGE_SIZE) is 4K. |
| 13 | |
| 14 | Like all other architectures, x86_64 has a kernel stack for every |
| 15 | active thread. These thread stacks are THREAD_SIZE (2*PAGE_SIZE) big. |
| 16 | These stacks contain useful data as long as a thread is alive or a |
| 17 | zombie. While the thread is in user space the kernel stack is empty |
| 18 | except for the thread_info structure at the bottom. |
| 19 | |
| 20 | In addition to the per thread stacks, there are specialized stacks |
Randy Dunlap | 57d3077 | 2007-02-13 13:26:23 +0100 | [diff] [blame] | 21 | associated with each CPU. These stacks are only used while the kernel |
| 22 | is in control on that CPU; when a CPU returns to user space the |
| 23 | specialized stacks contain no useful data. The main CPU stacks are: |
Andi Kleen | 352f7ba | 2006-09-26 10:52:31 +0200 | [diff] [blame] | 24 | |
Alexander Kuleshov | 0fe0965 | 2015-08-21 15:19:06 +0600 | [diff] [blame] | 25 | * Interrupt stack. IRQ_STACK_SIZE |
Andi Kleen | 352f7ba | 2006-09-26 10:52:31 +0200 | [diff] [blame] | 26 | |
| 27 | Used for external hardware interrupts. If this is the first external |
| 28 | hardware interrupt (i.e. not a nested hardware interrupt) then the |
| 29 | kernel switches from the current task to the interrupt stack. Like |
Christoph Hellwig | 7974891 | 2010-06-28 14:15:54 +0200 | [diff] [blame] | 30 | the split thread and interrupt stacks on i386, this gives more room |
| 31 | for kernel interrupt processing without having to increase the size |
| 32 | of every per thread stack. |
Andi Kleen | 352f7ba | 2006-09-26 10:52:31 +0200 | [diff] [blame] | 33 | |
| 34 | The interrupt stack is also used when processing a softirq. |
| 35 | |
| 36 | Switching to the kernel interrupt stack is done by software based on a |
| 37 | per CPU interrupt nest counter. This is needed because x86-64 "IST" |
| 38 | hardware stacks cannot nest without races. |
| 39 | |
| 40 | x86_64 also has a feature which is not available on i386, the ability |
| 41 | to automatically switch to a new stack for designated events such as |
| 42 | double fault or NMI, which makes it easier to handle these unusual |
| 43 | events on x86_64. This feature is called the Interrupt Stack Table |
Randy Dunlap | 57d3077 | 2007-02-13 13:26:23 +0100 | [diff] [blame] | 44 | (IST). There can be up to 7 IST entries per CPU. The IST code is an |
| 45 | index into the Task State Segment (TSS). The IST entries in the TSS |
| 46 | point to dedicated stacks; each stack can be a different size. |
Andi Kleen | 352f7ba | 2006-09-26 10:52:31 +0200 | [diff] [blame] | 47 | |
Randy Dunlap | 57d3077 | 2007-02-13 13:26:23 +0100 | [diff] [blame] | 48 | An IST is selected by a non-zero value in the IST field of an |
Andi Kleen | 352f7ba | 2006-09-26 10:52:31 +0200 | [diff] [blame] | 49 | interrupt-gate descriptor. When an interrupt occurs and the hardware |
| 50 | loads such a descriptor, the hardware automatically sets the new stack |
| 51 | pointer based on the IST value, then invokes the interrupt handler. If |
Andy Lutomirski | 48e08d0 | 2014-11-11 12:49:41 -0800 | [diff] [blame] | 52 | the interrupt came from user mode, then the interrupt handler prologue |
| 53 | will switch back to the per-thread stack. If software wants to allow |
| 54 | nested IST interrupts then the handler must adjust the IST values on |
| 55 | entry to and exit from the interrupt handler. (This is occasionally |
| 56 | done, e.g. for debug exceptions.) |
Andi Kleen | 352f7ba | 2006-09-26 10:52:31 +0200 | [diff] [blame] | 57 | |
| 58 | Events with different IST codes (i.e. with different stacks) can be |
| 59 | nested. For example, a debug interrupt can safely be interrupted by an |
| 60 | NMI. arch/x86_64/kernel/entry.S::paranoidentry adjusts the stack |
| 61 | pointers on entry to and exit from all IST events, in theory allowing |
| 62 | IST events with the same code to be nested. However in most cases, the |
| 63 | stack size allocated to an IST assumes no nesting for the same code. |
| 64 | If that assumption is ever broken then the stacks will become corrupt. |
| 65 | |
Changbin Du | ac2b468 | 2019-05-08 23:21:19 +0800 | [diff] [blame] | 66 | The currently assigned IST stacks are: |
Andi Kleen | 352f7ba | 2006-09-26 10:52:31 +0200 | [diff] [blame] | 67 | |
Thomas Gleixner | 8f34c5b | 2019-04-14 17:59:45 +0200 | [diff] [blame] | 68 | * ESTACK_DF. EXCEPTION_STKSZ (PAGE_SIZE). |
Andi Kleen | 352f7ba | 2006-09-26 10:52:31 +0200 | [diff] [blame] | 69 | |
| 70 | Used for interrupt 8 - Double Fault Exception (#DF). |
| 71 | |
Randy Dunlap | 57d3077 | 2007-02-13 13:26:23 +0100 | [diff] [blame] | 72 | Invoked when handling one exception causes another exception. Happens |
| 73 | when the kernel is very confused (e.g. kernel stack pointer corrupt). |
| 74 | Using a separate stack allows the kernel to recover from it well enough |
| 75 | in many cases to still output an oops. |
Andi Kleen | 352f7ba | 2006-09-26 10:52:31 +0200 | [diff] [blame] | 76 | |
Thomas Gleixner | 8f34c5b | 2019-04-14 17:59:45 +0200 | [diff] [blame] | 77 | * ESTACK_NMI. EXCEPTION_STKSZ (PAGE_SIZE). |
Andi Kleen | 352f7ba | 2006-09-26 10:52:31 +0200 | [diff] [blame] | 78 | |
| 79 | Used for non-maskable interrupts (NMI). |
| 80 | |
| 81 | NMI can be delivered at any time, including when the kernel is in the |
| 82 | middle of switching stacks. Using IST for NMI events avoids making |
| 83 | assumptions about the previous state of the kernel stack. |
| 84 | |
Thomas Gleixner | 2a594d4 | 2019-04-14 17:59:57 +0200 | [diff] [blame] | 85 | * ESTACK_DB. EXCEPTION_STKSZ (PAGE_SIZE). |
Andi Kleen | 352f7ba | 2006-09-26 10:52:31 +0200 | [diff] [blame] | 86 | |
| 87 | Used for hardware debug interrupts (interrupt 1) and for software |
| 88 | debug interrupts (INT3). |
| 89 | |
| 90 | When debugging a kernel, debug interrupts (both hardware and |
| 91 | software) can occur at any time. Using IST for these interrupts |
| 92 | avoids making assumptions about the previous state of the kernel |
| 93 | stack. |
| 94 | |
Thomas Gleixner | 2a594d4 | 2019-04-14 17:59:57 +0200 | [diff] [blame] | 95 | To handle nested #DB correctly there exist two instances of DB stacks. On |
| 96 | #DB entry the IST stackpointer for #DB is switched to the second instance |
| 97 | so a nested #DB starts from a clean stack. The nested #DB switches |
| 98 | the IST stackpointer to a guard hole to catch triple nesting. |
| 99 | |
Thomas Gleixner | 8f34c5b | 2019-04-14 17:59:45 +0200 | [diff] [blame] | 100 | * ESTACK_MCE. EXCEPTION_STKSZ (PAGE_SIZE). |
Andi Kleen | 352f7ba | 2006-09-26 10:52:31 +0200 | [diff] [blame] | 101 | |
| 102 | Used for interrupt 18 - Machine Check Exception (#MC). |
| 103 | |
| 104 | MCE can be delivered at any time, including when the kernel is in the |
| 105 | middle of switching stacks. Using IST for MCE events avoids making |
| 106 | assumptions about the previous state of the kernel stack. |
| 107 | |
| 108 | For more details see the Intel IA32 or AMD AMD64 architecture manuals. |
Borislav Petkov | 113b5e3 | 2015-05-26 10:28:20 +0200 | [diff] [blame] | 109 | |
| 110 | |
| 111 | Printing backtraces on x86 |
Changbin Du | ac2b468 | 2019-05-08 23:21:19 +0800 | [diff] [blame] | 112 | ========================== |
Borislav Petkov | 113b5e3 | 2015-05-26 10:28:20 +0200 | [diff] [blame] | 113 | |
| 114 | The question about the '?' preceding function names in an x86 stacktrace |
| 115 | keeps popping up, here's an indepth explanation. It helps if the reader |
| 116 | stares at print_context_stack() and the whole machinery in and around |
| 117 | arch/x86/kernel/dumpstack.c. |
| 118 | |
| 119 | Adapted from Ingo's mail, Message-ID: <20150521101614.GA10889@gmail.com>: |
| 120 | |
| 121 | We always scan the full kernel stack for return addresses stored on |
Changbin Du | ac2b468 | 2019-05-08 23:21:19 +0800 | [diff] [blame] | 122 | the kernel stack(s) [1]_, from stack top to stack bottom, and print out |
Borislav Petkov | 113b5e3 | 2015-05-26 10:28:20 +0200 | [diff] [blame] | 123 | anything that 'looks like' a kernel text address. |
| 124 | |
| 125 | If it fits into the frame pointer chain, we print it without a question |
| 126 | mark, knowing that it's part of the real backtrace. |
| 127 | |
| 128 | If the address does not fit into our expected frame pointer chain we |
| 129 | still print it, but we print a '?'. It can mean two things: |
| 130 | |
| 131 | - either the address is not part of the call chain: it's just stale |
| 132 | values on the kernel stack, from earlier function calls. This is |
| 133 | the common case. |
| 134 | |
| 135 | - or it is part of the call chain, but the frame pointer was not set |
| 136 | up properly within the function, so we don't recognize it. |
| 137 | |
| 138 | This way we will always print out the real call chain (plus a few more |
| 139 | entries), regardless of whether the frame pointer was set up correctly |
| 140 | or not - but in most cases we'll get the call chain right as well. The |
| 141 | entries printed are strictly in stack order, so you can deduce more |
| 142 | information from that as well. |
| 143 | |
| 144 | The most important property of this method is that we _never_ lose |
| 145 | information: we always strive to print _all_ addresses on the stack(s) |
| 146 | that look like kernel text addresses, so if debug information is wrong, |
| 147 | we still print out the real call chain as well - just with more question |
| 148 | marks than ideal. |
| 149 | |
Changbin Du | ac2b468 | 2019-05-08 23:21:19 +0800 | [diff] [blame] | 150 | .. [1] For things like IRQ and IST stacks, we also scan those stacks, in |
| 151 | the right order, and try to cross from one stack into another |
| 152 | reconstructing the call chain. This works most of the time. |