Rafael J. Wysocki | ce2b714 | 2007-11-19 23:43:34 +0100 | [diff] [blame] | 1 | Debugging hibernation and suspend |
Rafael J. Wysocki | 5b79520 | 2007-05-08 00:24:07 -0700 | [diff] [blame] | 2 | (C) 2007 Rafael J. Wysocki <rjw@sisk.pl>, GPL |
| 3 | |
Rafael J. Wysocki | ce2b714 | 2007-11-19 23:43:34 +0100 | [diff] [blame] | 4 | 1. Testing hibernation (aka suspend to disk or STD) |
Rafael J. Wysocki | 5b79520 | 2007-05-08 00:24:07 -0700 | [diff] [blame] | 5 | |
Rafael J. Wysocki | ce2b714 | 2007-11-19 23:43:34 +0100 | [diff] [blame] | 6 | To check if hibernation works, you can try to hibernate in the "reboot" mode: |
Rafael J. Wysocki | 5b79520 | 2007-05-08 00:24:07 -0700 | [diff] [blame] | 7 | |
| 8 | # echo reboot > /sys/power/disk |
| 9 | # echo disk > /sys/power/state |
| 10 | |
Rafael J. Wysocki | ce2b714 | 2007-11-19 23:43:34 +0100 | [diff] [blame] | 11 | and the system should create a hibernation image, reboot, resume and get back to |
| 12 | the command prompt where you have started the transition. If that happens, |
| 13 | hibernation is most likely to work correctly. Still, you need to repeat the |
| 14 | test at least a couple of times in a row for confidence. [This is necessary, |
| 15 | because some problems only show up on a second attempt at suspending and |
| 16 | resuming the system.] Moreover, hibernating in the "reboot" and "shutdown" |
| 17 | modes causes the PM core to skip some platform-related callbacks which on ACPI |
Viresh Kumar | f581b63 | 2012-01-19 23:22:38 +0100 | [diff] [blame] | 18 | systems might be necessary to make hibernation work. Thus, if your machine fails |
Rafael J. Wysocki | ce2b714 | 2007-11-19 23:43:34 +0100 | [diff] [blame] | 19 | to hibernate or resume in the "reboot" mode, you should try the "platform" mode: |
Rafael J. Wysocki | 5b79520 | 2007-05-08 00:24:07 -0700 | [diff] [blame] | 20 | |
| 21 | # echo platform > /sys/power/disk |
| 22 | # echo disk > /sys/power/state |
| 23 | |
Rafael J. Wysocki | ce2b714 | 2007-11-19 23:43:34 +0100 | [diff] [blame] | 24 | which is the default and recommended mode of hibernation. |
| 25 | |
| 26 | Unfortunately, the "platform" mode of hibernation does not work on some systems |
| 27 | with broken BIOSes. In such cases the "shutdown" mode of hibernation might |
| 28 | work: |
Rafael J. Wysocki | 5b79520 | 2007-05-08 00:24:07 -0700 | [diff] [blame] | 29 | |
| 30 | # echo shutdown > /sys/power/disk |
| 31 | # echo disk > /sys/power/state |
| 32 | |
Rafael J. Wysocki | ce2b714 | 2007-11-19 23:43:34 +0100 | [diff] [blame] | 33 | (it is similar to the "reboot" mode, but it requires you to press the power |
| 34 | button to make the system resume). |
Rafael J. Wysocki | 5b79520 | 2007-05-08 00:24:07 -0700 | [diff] [blame] | 35 | |
Rafael J. Wysocki | ce2b714 | 2007-11-19 23:43:34 +0100 | [diff] [blame] | 36 | If neither "platform" nor "shutdown" hibernation mode works, you will need to |
| 37 | identify what goes wrong. |
Rafael J. Wysocki | 5b79520 | 2007-05-08 00:24:07 -0700 | [diff] [blame] | 38 | |
Rafael J. Wysocki | ce2b714 | 2007-11-19 23:43:34 +0100 | [diff] [blame] | 39 | a) Test modes of hibernation |
Rafael J. Wysocki | 5b79520 | 2007-05-08 00:24:07 -0700 | [diff] [blame] | 40 | |
Rafael J. Wysocki | ce2b714 | 2007-11-19 23:43:34 +0100 | [diff] [blame] | 41 | To find out why hibernation fails on your system, you can use a special testing |
| 42 | facility available if the kernel is compiled with CONFIG_PM_DEBUG set. Then, |
| 43 | there is the file /sys/power/pm_test that can be used to make the hibernation |
| 44 | core run in a test mode. There are 5 test modes available: |
| 45 | |
| 46 | freezer |
| 47 | - test the freezing of processes |
| 48 | |
| 49 | devices |
| 50 | - test the freezing of processes and suspending of devices |
| 51 | |
| 52 | platform |
| 53 | - test the freezing of processes, suspending of devices and platform |
| 54 | global control methods(*) |
| 55 | |
| 56 | processors |
| 57 | - test the freezing of processes, suspending of devices, platform |
| 58 | global control methods(*) and the disabling of nonboot CPUs |
| 59 | |
| 60 | core |
| 61 | - test the freezing of processes, suspending of devices, platform global |
| 62 | control methods(*), the disabling of nonboot CPUs and suspending of |
| 63 | platform/system devices |
| 64 | |
| 65 | (*) the platform global control methods are only available on ACPI systems |
| 66 | and are only tested if the hibernation mode is set to "platform" |
| 67 | |
| 68 | To use one of them it is necessary to write the corresponding string to |
| 69 | /sys/power/pm_test (eg. "devices" to test the freezing of processes and |
| 70 | suspending devices) and issue the standard hibernation commands. For example, |
| 71 | to use the "devices" test mode along with the "platform" mode of hibernation, |
| 72 | you should do the following: |
| 73 | |
| 74 | # echo devices > /sys/power/pm_test |
| 75 | # echo platform > /sys/power/disk |
Rafael J. Wysocki | 5b79520 | 2007-05-08 00:24:07 -0700 | [diff] [blame] | 76 | # echo disk > /sys/power/state |
| 77 | |
Brian Norris | 1d4a9c1 | 2015-02-22 21:16:49 -0800 | [diff] [blame] | 78 | Then, the kernel will try to freeze processes, suspend devices, wait a few |
| 79 | seconds (5 by default, but configurable by the suspend.pm_test_delay module |
| 80 | parameter), resume devices and thaw processes. If "platform" is written to |
Rafael J. Wysocki | ce2b714 | 2007-11-19 23:43:34 +0100 | [diff] [blame] | 81 | /sys/power/pm_test , then after suspending devices the kernel will additionally |
| 82 | invoke the global control methods (eg. ACPI global control methods) used to |
Brian Norris | 1d4a9c1 | 2015-02-22 21:16:49 -0800 | [diff] [blame] | 83 | prepare the platform firmware for hibernation. Next, it will wait a |
| 84 | configurable number of seconds and invoke the platform (eg. ACPI) global |
| 85 | methods used to cancel hibernation etc. |
Rafael J. Wysocki | ce2b714 | 2007-11-19 23:43:34 +0100 | [diff] [blame] | 86 | |
| 87 | Writing "none" to /sys/power/pm_test causes the kernel to switch to the normal |
| 88 | hibernation/suspend operations. Also, when open for reading, /sys/power/pm_test |
| 89 | contains a space-separated list of all available tests (including "none" that |
| 90 | represents the normal functionality) in which the current test level is |
| 91 | indicated by square brackets. |
| 92 | |
| 93 | Generally, as you can see, each test level is more "invasive" than the previous |
| 94 | one and the "core" level tests the hardware and drivers as deeply as possible |
| 95 | without creating a hibernation image. Obviously, if the "devices" test fails, |
| 96 | the "platform" test will fail as well and so on. Thus, as a rule of thumb, you |
| 97 | should try the test modes starting from "freezer", through "devices", "platform" |
| 98 | and "processors" up to "core" (repeat the test on each level a couple of times |
| 99 | to make sure that any random factors are avoided). |
| 100 | |
| 101 | If the "freezer" test fails, there is a task that cannot be frozen (in that case |
| 102 | it usually is possible to identify the offending task by analysing the output of |
| 103 | dmesg obtained after the failing test). Failure at this level usually means |
| 104 | that there is a problem with the tasks freezer subsystem that should be |
| 105 | reported. |
| 106 | |
| 107 | If the "devices" test fails, most likely there is a driver that cannot suspend |
| 108 | or resume its device (in the latter case the system may hang or become unstable |
| 109 | after the test, so please take that into consideration). To find this driver, |
| 110 | you can carry out a binary search according to the rules: |
Rafael J. Wysocki | 5b79520 | 2007-05-08 00:24:07 -0700 | [diff] [blame] | 111 | - if the test fails, unload a half of the drivers currently loaded and repeat |
| 112 | (that would probably involve rebooting the system, so always note what drivers |
| 113 | have been loaded before the test), |
| 114 | - if the test succeeds, load a half of the drivers you have unloaded most |
| 115 | recently and repeat. |
| 116 | |
| 117 | Once you have found the failing driver (there can be more than just one of |
Rafael J. Wysocki | ce2b714 | 2007-11-19 23:43:34 +0100 | [diff] [blame] | 118 | them), you have to unload it every time before hibernation. In that case please |
| 119 | make sure to report the problem with the driver. |
Rafael J. Wysocki | 5b79520 | 2007-05-08 00:24:07 -0700 | [diff] [blame] | 120 | |
Rafael J. Wysocki | ce2b714 | 2007-11-19 23:43:34 +0100 | [diff] [blame] | 121 | It is also possible that the "devices" test will still fail after you have |
| 122 | unloaded all modules. In that case, you may want to look in your kernel |
| 123 | configuration for the drivers that can be compiled as modules (and test again |
| 124 | with these drivers compiled as modules). You may also try to use some special |
| 125 | kernel command line options such as "noapic", "noacpi" or even "acpi=off". |
| 126 | |
| 127 | If the "platform" test fails, there is a problem with the handling of the |
| 128 | platform (eg. ACPI) firmware on your system. In that case the "platform" mode |
| 129 | of hibernation is not likely to work. You can try the "shutdown" mode, but that |
| 130 | is rather a poor man's workaround. |
| 131 | |
| 132 | If the "processors" test fails, the disabling/enabling of nonboot CPUs does not |
| 133 | work (of course, this only may be an issue on SMP systems) and the problem |
| 134 | should be reported. In that case you can also try to switch the nonboot CPUs |
| 135 | off and on using the /sys/devices/system/cpu/cpu*/online sysfs attributes and |
| 136 | see if that works. |
| 137 | |
| 138 | If the "core" test fails, which means that suspending of the system/platform |
| 139 | devices has failed (these devices are suspended on one CPU with interrupts off), |
| 140 | the problem is most probably hardware-related and serious, so it should be |
| 141 | reported. |
| 142 | |
| 143 | A failure of any of the "platform", "processors" or "core" tests may cause your |
| 144 | system to hang or become unstable, so please beware. Such a failure usually |
| 145 | indicates a serious problem that very well may be related to the hardware, but |
| 146 | please report it anyway. |
Rafael J. Wysocki | 5b79520 | 2007-05-08 00:24:07 -0700 | [diff] [blame] | 147 | |
| 148 | b) Testing minimal configuration |
| 149 | |
Rafael J. Wysocki | ce2b714 | 2007-11-19 23:43:34 +0100 | [diff] [blame] | 150 | If all of the hibernation test modes work, you can boot the system with the |
| 151 | "init=/bin/bash" command line parameter and attempt to hibernate in the |
| 152 | "reboot", "shutdown" and "platform" modes. If that does not work, there |
| 153 | probably is a problem with a driver statically compiled into the kernel and you |
| 154 | can try to compile more drivers as modules, so that they can be tested |
| 155 | individually. Otherwise, there is a problem with a modular driver and you can |
| 156 | find it by loading a half of the modules you normally use and binary searching |
| 157 | in accordance with the algorithm: |
Rafael J. Wysocki | 5b79520 | 2007-05-08 00:24:07 -0700 | [diff] [blame] | 158 | - if there are n modules loaded and the attempt to suspend and resume fails, |
| 159 | unload n/2 of the modules and try again (that would probably involve rebooting |
| 160 | the system), |
| 161 | - if there are n modules loaded and the attempt to suspend and resume succeeds, |
| 162 | load n/2 modules more and try again. |
| 163 | |
| 164 | Again, if you find the offending module(s), it(they) must be unloaded every time |
Rafael J. Wysocki | ce2b714 | 2007-11-19 23:43:34 +0100 | [diff] [blame] | 165 | before hibernation, and please report the problem with it(them). |
Rafael J. Wysocki | 5b79520 | 2007-05-08 00:24:07 -0700 | [diff] [blame] | 166 | |
Rafael J. Wysocki | 947d2c2 | 2016-08-13 02:54:04 +0200 | [diff] [blame^] | 167 | c) Using the "test_resume" hibernation option |
| 168 | |
| 169 | /sys/power/disk generally tells the kernel what to do after creating a |
| 170 | hibernation image. One of the available options is "test_resume" which |
| 171 | causes the just created image to be used for immediate restoration. Namely, |
| 172 | after doing: |
| 173 | |
| 174 | # echo test_resume > /sys/power/disk |
| 175 | # echo disk > /sys/power/state |
| 176 | |
| 177 | a hibernation image will be created and a resume from it will be triggered |
| 178 | immediately without involving the platform firmware in any way. |
| 179 | |
| 180 | That test can be used to check if failures to resume from hibernation are |
| 181 | related to bad interactions with the platform firmware. That is, if the above |
| 182 | works every time, but resume from actual hibernation does not work or is |
| 183 | unreliable, the platform firmware may be responsible for the failures. |
| 184 | |
| 185 | On architectures and platforms that support using different kernels to restore |
| 186 | hibernation images (that is, the kernel used to read the image from storage and |
| 187 | load it into memory is different from the one included in the image) or support |
| 188 | kernel address space randomization, it also can be used to check if failures |
| 189 | to resume may be related to the differences between the restore and image |
| 190 | kernels. |
| 191 | |
| 192 | d) Advanced debugging |
Rafael J. Wysocki | 5b79520 | 2007-05-08 00:24:07 -0700 | [diff] [blame] | 193 | |
Rafael J. Wysocki | ce2b714 | 2007-11-19 23:43:34 +0100 | [diff] [blame] | 194 | In case that hibernation does not work on your system even in the minimal |
| 195 | configuration and compiling more drivers as modules is not practical or some |
| 196 | modules cannot be unloaded, you can use one of the more advanced debugging |
| 197 | techniques to find the problem. First, if there is a serial port in your box, |
| 198 | you can boot the kernel with the 'no_console_suspend' parameter and try to log |
| 199 | kernel messages using the serial console. This may provide you with some |
| 200 | information about the reasons of the suspend (resume) failure. Alternatively, |
| 201 | it may be possible to use a FireWire port for debugging with firescope |
Lubomir Rintel | a9954ce | 2013-12-22 11:31:41 +0100 | [diff] [blame] | 202 | (http://v3.sk/~lkundrak/firescope/). On x86 it is also possible to |
Paul Bolle | 395cf96 | 2011-08-15 02:02:26 +0200 | [diff] [blame] | 203 | use the PM_TRACE mechanism documented in Documentation/power/s2ram.txt . |
Rafael J. Wysocki | 5b79520 | 2007-05-08 00:24:07 -0700 | [diff] [blame] | 204 | |
| 205 | 2. Testing suspend to RAM (STR) |
| 206 | |
| 207 | To verify that the STR works, it is generally more convenient to use the s2ram |
| 208 | tool available from http://suspend.sf.net and documented at |
Jens Frederich | 54d4f25 | 2013-08-21 21:03:09 -0700 | [diff] [blame] | 209 | http://en.opensuse.org/SDB:Suspend_to_RAM (S2RAM_LINK). |
Rafael J. Wysocki | 5b79520 | 2007-05-08 00:24:07 -0700 | [diff] [blame] | 210 | |
Rafael J. Wysocki | ce2b714 | 2007-11-19 23:43:34 +0100 | [diff] [blame] | 211 | Namely, after writing "freezer", "devices", "platform", "processors", or "core" |
| 212 | into /sys/power/pm_test (available if the kernel is compiled with |
| 213 | CONFIG_PM_DEBUG set) the suspend code will work in the test mode corresponding |
| 214 | to given string. The STR test modes are defined in the same way as for |
| 215 | hibernation, so please refer to Section 1 for more information about them. In |
| 216 | particular, the "core" test allows you to test everything except for the actual |
| 217 | invocation of the platform firmware in order to put the system into the sleep |
| 218 | state. |
| 219 | |
| 220 | Among other things, the testing with the help of /sys/power/pm_test may allow |
| 221 | you to identify drivers that fail to suspend or resume their devices. They |
| 222 | should be unloaded every time before an STR transition. |
| 223 | |
Jens Frederich | 54d4f25 | 2013-08-21 21:03:09 -0700 | [diff] [blame] | 224 | Next, you can follow the instructions at S2RAM_LINK to test the system, but if |
| 225 | it does not work "out of the box", you may need to boot it with |
| 226 | "init=/bin/bash" and test s2ram in the minimal configuration. In that case, |
| 227 | you may be able to search for failing drivers by following the procedure |
Rafael J. Wysocki | ce2b714 | 2007-11-19 23:43:34 +0100 | [diff] [blame] | 228 | analogous to the one described in section 1. If you find some failing drivers, |
| 229 | you will have to unload them every time before an STR transition (ie. before |
| 230 | you run s2ram), and please report the problems with them. |
ShuoX Liu | 2a77c46 | 2011-08-10 23:01:26 +0200 | [diff] [blame] | 231 | |
| 232 | There is a debugfs entry which shows the suspend to RAM statistics. Here is an |
| 233 | example of its output. |
| 234 | # mount -t debugfs none /sys/kernel/debug |
| 235 | # cat /sys/kernel/debug/suspend_stats |
| 236 | success: 20 |
| 237 | fail: 5 |
| 238 | failed_freeze: 0 |
| 239 | failed_prepare: 0 |
| 240 | failed_suspend: 5 |
| 241 | failed_suspend_noirq: 0 |
| 242 | failed_resume: 0 |
| 243 | failed_resume_noirq: 0 |
| 244 | failures: |
| 245 | last_failed_dev: alarm |
| 246 | adc |
| 247 | last_failed_errno: -16 |
| 248 | -16 |
| 249 | last_failed_step: suspend |
| 250 | suspend |
| 251 | Field success means the success number of suspend to RAM, and field fail means |
| 252 | the failure number. Others are the failure number of different steps of suspend |
| 253 | to RAM. suspend_stats just lists the last 2 failed devices, error number and |
| 254 | failed step of suspend. |