Mahesh Salgaonkar | 8e0aa6d | 2012-02-16 01:14:14 +0000 | [diff] [blame] | 1 | |
| 2 | Firmware-Assisted Dump |
| 3 | ------------------------ |
| 4 | July 2011 |
| 5 | |
| 6 | The goal of firmware-assisted dump is to enable the dump of |
| 7 | a crashed system, and to do so from a fully-reset system, and |
| 8 | to minimize the total elapsed time until the system is back |
| 9 | in production use. |
| 10 | |
| 11 | - Firmware assisted dump (fadump) infrastructure is intended to replace |
| 12 | the existing phyp assisted dump. |
| 13 | - Fadump uses the same firmware interfaces and memory reservation model |
| 14 | as phyp assisted dump. |
| 15 | - Unlike phyp dump, fadump exports the memory dump through /proc/vmcore |
| 16 | in the ELF format in the same way as kdump. This helps us reuse the |
| 17 | kdump infrastructure for dump capture and filtering. |
| 18 | - Unlike phyp dump, userspace tool does not need to refer any sysfs |
| 19 | interface while reading /proc/vmcore. |
| 20 | - Unlike phyp dump, fadump allows user to release all the memory reserved |
| 21 | for dump, with a single operation of echo 1 > /sys/kernel/fadump_release_mem. |
| 22 | - Once enabled through kernel boot parameter, fadump can be |
| 23 | started/stopped through /sys/kernel/fadump_registered interface (see |
| 24 | sysfs files section below) and can be easily integrated with kdump |
| 25 | service start/stop init scripts. |
| 26 | |
| 27 | Comparing with kdump or other strategies, firmware-assisted |
| 28 | dump offers several strong, practical advantages: |
| 29 | |
| 30 | -- Unlike kdump, the system has been reset, and loaded |
| 31 | with a fresh copy of the kernel. In particular, |
| 32 | PCI and I/O devices have been reinitialized and are |
| 33 | in a clean, consistent state. |
| 34 | -- Once the dump is copied out, the memory that held the dump |
| 35 | is immediately available to the running kernel. And therefore, |
| 36 | unlike kdump, fadump doesn't need a 2nd reboot to get back |
| 37 | the system to the production configuration. |
| 38 | |
| 39 | The above can only be accomplished by coordination with, |
| 40 | and assistance from the Power firmware. The procedure is |
| 41 | as follows: |
| 42 | |
| 43 | -- The first kernel registers the sections of memory with the |
| 44 | Power firmware for dump preservation during OS initialization. |
| 45 | These registered sections of memory are reserved by the first |
| 46 | kernel during early boot. |
| 47 | |
| 48 | -- When a system crashes, the Power firmware will save |
| 49 | the low memory (boot memory of size larger of 5% of system RAM |
| 50 | or 256MB) of RAM to the previous registered region. It will |
| 51 | also save system registers, and hardware PTE's. |
| 52 | |
| 53 | NOTE: The term 'boot memory' means size of the low memory chunk |
| 54 | that is required for a kernel to boot successfully when |
| 55 | booted with restricted memory. By default, the boot memory |
| 56 | size will be the larger of 5% of system RAM or 256MB. |
| 57 | Alternatively, user can also specify boot memory size |
Hari Bathini | 92019ef | 2017-05-08 15:56:31 -0700 | [diff] [blame] | 58 | through boot parameter 'crashkernel=' which will override |
| 59 | the default calculated size. Use this option if default |
| 60 | boot memory size is not sufficient for second kernel to |
| 61 | boot successfully. For syntax of crashkernel= parameter, |
| 62 | refer to Documentation/kdump/kdump.txt. If any offset is |
| 63 | provided in crashkernel= parameter, it will be ignored |
Hari Bathini | e7467dc | 2017-05-22 15:04:47 +0530 | [diff] [blame] | 64 | as fadump uses a predefined offset to reserve memory |
| 65 | for boot memory dump preservation in case of a crash. |
Mahesh Salgaonkar | 8e0aa6d | 2012-02-16 01:14:14 +0000 | [diff] [blame] | 66 | |
| 67 | -- After the low memory (boot memory) area has been saved, the |
| 68 | firmware will reset PCI and other hardware state. It will |
| 69 | *not* clear the RAM. It will then launch the bootloader, as |
| 70 | normal. |
| 71 | |
| 72 | -- The freshly booted kernel will notice that there is a new |
| 73 | node (ibm,dump-kernel) in the device tree, indicating that |
| 74 | there is crash data available from a previous boot. During |
| 75 | the early boot OS will reserve rest of the memory above |
| 76 | boot memory size effectively booting with restricted memory |
| 77 | size. This will make sure that the second kernel will not |
| 78 | touch any of the dump memory area. |
| 79 | |
| 80 | -- User-space tools will read /proc/vmcore to obtain the contents |
| 81 | of memory, which holds the previous crashed kernel dump in ELF |
| 82 | format. The userspace tools may copy this info to disk, or |
| 83 | network, nas, san, iscsi, etc. as desired. |
| 84 | |
| 85 | -- Once the userspace tool is done saving dump, it will echo |
| 86 | '1' to /sys/kernel/fadump_release_mem to release the reserved |
| 87 | memory back to general use, except the memory required for |
| 88 | next firmware-assisted dump registration. |
| 89 | |
| 90 | e.g. |
| 91 | # echo 1 > /sys/kernel/fadump_release_mem |
| 92 | |
| 93 | Please note that the firmware-assisted dump feature |
| 94 | is only available on Power6 and above systems with recent |
| 95 | firmware versions. |
| 96 | |
| 97 | Implementation details: |
| 98 | ---------------------- |
| 99 | |
| 100 | During boot, a check is made to see if firmware supports |
| 101 | this feature on that particular machine. If it does, then |
| 102 | we check to see if an active dump is waiting for us. If yes |
| 103 | then everything but boot memory size of RAM is reserved during |
| 104 | early boot (See Fig. 2). This area is released once we finish |
| 105 | collecting the dump from user land scripts (e.g. kdump scripts) |
| 106 | that are run. If there is dump data, then the |
| 107 | /sys/kernel/fadump_release_mem file is created, and the reserved |
| 108 | memory is held. |
| 109 | |
| 110 | If there is no waiting dump data, then only the memory required |
| 111 | to hold CPU state, HPTE region, boot memory dump and elfcore |
Hari Bathini | bc18377 | 2017-03-17 02:35:42 +0530 | [diff] [blame] | 112 | header, is usually reserved at an offset greater than boot memory |
| 113 | size (see Fig. 1). This area is *not* released: this region will |
| 114 | be kept permanently reserved, so that it can act as a receptacle |
| 115 | for a copy of the boot memory content in addition to CPU state |
Mahesh Salgaonkar | a4e92ce | 2018-08-20 13:47:17 +0530 | [diff] [blame^] | 116 | and HPTE region, in the case a crash does occur. Since this reserved |
| 117 | memory area is used only after the system crash, there is no point in |
| 118 | blocking this significant chunk of memory from production kernel. |
| 119 | Hence, the implementation uses the Linux kernel's Contiguous Memory |
| 120 | Allocator (CMA) for memory reservation if CMA is configured for kernel. |
| 121 | With CMA reservation this memory will be available for applications to |
| 122 | use it, while kernel is prevented from using it. With this fadump will |
| 123 | still be able to capture all of the kernel memory and most of the user |
| 124 | space memory except the user pages that were present in CMA region. |
Mahesh Salgaonkar | 8e0aa6d | 2012-02-16 01:14:14 +0000 | [diff] [blame] | 125 | |
| 126 | o Memory Reservation during first kernel |
| 127 | |
Hari Bathini | bc18377 | 2017-03-17 02:35:42 +0530 | [diff] [blame] | 128 | Low memory Top of memory |
Mahesh Salgaonkar | 8e0aa6d | 2012-02-16 01:14:14 +0000 | [diff] [blame] | 129 | 0 boot memory size | |
Hari Bathini | bc18377 | 2017-03-17 02:35:42 +0530 | [diff] [blame] | 130 | | | |<--Reserved dump area -->| | |
| 131 | V V | Permanent Reservation | V |
| 132 | +-----------+----------/ /---+---+----+-----------+----+------+ |
| 133 | | | |CPU|HPTE| DUMP |ELF | | |
| 134 | +-----------+----------/ /---+---+----+-----------+----+------+ |
Mahesh Salgaonkar | 8e0aa6d | 2012-02-16 01:14:14 +0000 | [diff] [blame] | 135 | | ^ |
| 136 | | | |
| 137 | \ / |
| 138 | ------------------------------------------- |
| 139 | Boot memory content gets transferred to |
| 140 | reserved area by firmware at the time of |
| 141 | crash |
| 142 | Fig. 1 |
| 143 | |
| 144 | o Memory Reservation during second kernel after crash |
| 145 | |
| 146 | Low memory Top of memory |
| 147 | 0 boot memory size | |
| 148 | | |<------------- Reserved dump area ----------- -->| |
| 149 | V V V |
Hari Bathini | bc18377 | 2017-03-17 02:35:42 +0530 | [diff] [blame] | 150 | +-----------+----------/ /---+---+----+-----------+----+------+ |
| 151 | | | |CPU|HPTE| DUMP |ELF | | |
| 152 | +-----------+----------/ /---+---+----+-----------+----+------+ |
| 153 | | | |
| 154 | V V |
| 155 | Used by second /proc/vmcore |
Mahesh Salgaonkar | 8e0aa6d | 2012-02-16 01:14:14 +0000 | [diff] [blame] | 156 | kernel to boot |
| 157 | Fig. 2 |
| 158 | |
| 159 | Currently the dump will be copied from /proc/vmcore to a |
| 160 | a new file upon user intervention. The dump data available through |
| 161 | /proc/vmcore will be in ELF format. Hence the existing kdump |
| 162 | infrastructure (kdump scripts) to save the dump works fine with |
| 163 | minor modifications. |
| 164 | |
| 165 | The tools to examine the dump will be same as the ones |
| 166 | used for kdump. |
| 167 | |
| 168 | How to enable firmware-assisted dump (fadump): |
| 169 | ------------------------------------- |
| 170 | |
| 171 | 1. Set config option CONFIG_FA_DUMP=y and build kernel. |
| 172 | 2. Boot into linux kernel with 'fadump=on' kernel cmdline option. |
Mahesh Salgaonkar | a4e92ce | 2018-08-20 13:47:17 +0530 | [diff] [blame^] | 173 | By default, fadump reserved memory will be initialized as CMA area. |
| 174 | Alternatively, user can boot linux kernel with 'fadump=nocma' to |
| 175 | prevent fadump to use CMA. |
Hari Bathini | 92019ef | 2017-05-08 15:56:31 -0700 | [diff] [blame] | 176 | 3. Optionally, user can also set 'crashkernel=' kernel cmdline |
Mahesh Salgaonkar | 8e0aa6d | 2012-02-16 01:14:14 +0000 | [diff] [blame] | 177 | to specify size of the memory to reserve for boot memory dump |
| 178 | preservation. |
| 179 | |
Hari Bathini | 92019ef | 2017-05-08 15:56:31 -0700 | [diff] [blame] | 180 | NOTE: 1. 'fadump_reserve_mem=' parameter has been deprecated. Instead |
| 181 | use 'crashkernel=' to specify size of the memory to reserve |
| 182 | for boot memory dump preservation. |
| 183 | 2. If firmware-assisted dump fails to reserve memory then it |
| 184 | will fallback to existing kdump mechanism if 'crashkernel=' |
| 185 | option is set at kernel cmdline. |
Mahesh Salgaonkar | a4e92ce | 2018-08-20 13:47:17 +0530 | [diff] [blame^] | 186 | 3. if user wants to capture all of user space memory and ok with |
| 187 | reserved memory not available to production system, then |
| 188 | 'fadump=nocma' kernel parameter can be used to fallback to |
| 189 | old behaviour. |
Mahesh Salgaonkar | 8e0aa6d | 2012-02-16 01:14:14 +0000 | [diff] [blame] | 190 | |
| 191 | Sysfs/debugfs files: |
| 192 | ------------ |
| 193 | |
| 194 | Firmware-assisted dump feature uses sysfs file system to hold |
| 195 | the control files and debugfs file to display memory reserved region. |
| 196 | |
| 197 | Here is the list of files under kernel sysfs: |
| 198 | |
| 199 | /sys/kernel/fadump_enabled |
| 200 | |
| 201 | This is used to display the fadump status. |
| 202 | 0 = fadump is disabled |
| 203 | 1 = fadump is enabled |
| 204 | |
| 205 | This interface can be used by kdump init scripts to identify if |
| 206 | fadump is enabled in the kernel and act accordingly. |
| 207 | |
| 208 | /sys/kernel/fadump_registered |
| 209 | |
| 210 | This is used to display the fadump registration status as well |
| 211 | as to control (start/stop) the fadump registration. |
| 212 | 0 = fadump is not registered. |
| 213 | 1 = fadump is registered and ready to handle system crash. |
| 214 | |
| 215 | To register fadump echo 1 > /sys/kernel/fadump_registered and |
| 216 | echo 0 > /sys/kernel/fadump_registered for un-register and stop the |
| 217 | fadump. Once the fadump is un-registered, the system crash will not |
| 218 | be handled and vmcore will not be captured. This interface can be |
| 219 | easily integrated with kdump service start/stop. |
| 220 | |
| 221 | /sys/kernel/fadump_release_mem |
| 222 | |
| 223 | This file is available only when fadump is active during |
| 224 | second kernel. This is used to release the reserved memory |
| 225 | region that are held for saving crash dump. To release the |
| 226 | reserved memory echo 1 to it: |
| 227 | |
| 228 | echo 1 > /sys/kernel/fadump_release_mem |
| 229 | |
| 230 | After echo 1, the content of the /sys/kernel/debug/powerpc/fadump_region |
| 231 | file will change to reflect the new memory reservations. |
| 232 | |
| 233 | The existing userspace tools (kdump infrastructure) can be easily |
| 234 | enhanced to use this interface to release the memory reserved for |
| 235 | dump and continue without 2nd reboot. |
| 236 | |
| 237 | Here is the list of files under powerpc debugfs: |
| 238 | (Assuming debugfs is mounted on /sys/kernel/debug directory.) |
| 239 | |
| 240 | /sys/kernel/debug/powerpc/fadump_region |
| 241 | |
| 242 | This file shows the reserved memory regions if fadump is |
| 243 | enabled otherwise this file is empty. The output format |
| 244 | is: |
| 245 | <region>: [<start>-<end>] <reserved-size> bytes, Dumped: <dump-size> |
| 246 | |
| 247 | e.g. |
| 248 | Contents when fadump is registered during first kernel |
| 249 | |
| 250 | # cat /sys/kernel/debug/powerpc/fadump_region |
| 251 | CPU : [0x0000006ffb0000-0x0000006fff001f] 0x40020 bytes, Dumped: 0x0 |
| 252 | HPTE: [0x0000006fff0020-0x0000006fff101f] 0x1000 bytes, Dumped: 0x0 |
| 253 | DUMP: [0x0000006fff1020-0x0000007fff101f] 0x10000000 bytes, Dumped: 0x0 |
| 254 | |
| 255 | Contents when fadump is active during second kernel |
| 256 | |
| 257 | # cat /sys/kernel/debug/powerpc/fadump_region |
| 258 | CPU : [0x0000006ffb0000-0x0000006fff001f] 0x40020 bytes, Dumped: 0x40020 |
| 259 | HPTE: [0x0000006fff0020-0x0000006fff101f] 0x1000 bytes, Dumped: 0x1000 |
| 260 | DUMP: [0x0000006fff1020-0x0000007fff101f] 0x10000000 bytes, Dumped: 0x10000000 |
| 261 | : [0x00000010000000-0x0000006ffaffff] 0x5ffb0000 bytes, Dumped: 0x5ffb0000 |
| 262 | |
| 263 | NOTE: Please refer to Documentation/filesystems/debugfs.txt on |
| 264 | how to mount the debugfs filesystem. |
| 265 | |
| 266 | |
| 267 | TODO: |
| 268 | ----- |
| 269 | o Need to come up with the better approach to find out more |
| 270 | accurate boot memory size that is required for a kernel to |
| 271 | boot successfully when booted with restricted memory. |
| 272 | o The fadump implementation introduces a fadump crash info structure |
| 273 | in the scratch area before the ELF core header. The idea of introducing |
| 274 | this structure is to pass some important crash info data to the second |
| 275 | kernel which will help second kernel to populate ELF core header with |
| 276 | correct data before it gets exported through /proc/vmcore. The current |
| 277 | design implementation does not address a possibility of introducing |
| 278 | additional fields (in future) to this structure without affecting |
| 279 | compatibility. Need to come up with the better approach to address this. |
| 280 | The possible approaches are: |
| 281 | 1. Introduce version field for version tracking, bump up the version |
| 282 | whenever a new field is added to the structure in future. The version |
| 283 | field can be used to find out what fields are valid for the current |
| 284 | version of the structure. |
| 285 | 2. Reserve the area of predefined size (say PAGE_SIZE) for this |
| 286 | structure and have unused area as reserved (initialized to zero) |
| 287 | for future field additions. |
| 288 | The advantage of approach 1 over 2 is we don't need to reserve extra space. |
| 289 | --- |
| 290 | Author: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com> |
| 291 | This document is based on the original documentation written for phyp |
| 292 | assisted dump by Linas Vepstas and Manish Ahuja. |