Mauro Carvalho Chehab | 151f4e2 | 2019-06-13 07:10:36 -0300 | [diff] [blame] | 1 | ==================================================================== |
Srivatsa S. Bhat | 7fef9fc | 2011-10-19 23:59:05 +0200 | [diff] [blame] | 2 | Interaction of Suspend code (S3) with the CPU hotplug infrastructure |
Mauro Carvalho Chehab | 151f4e2 | 2019-06-13 07:10:36 -0300 | [diff] [blame] | 3 | ==================================================================== |
Srivatsa S. Bhat | 7fef9fc | 2011-10-19 23:59:05 +0200 | [diff] [blame] | 4 | |
Mauro Carvalho Chehab | 151f4e2 | 2019-06-13 07:10:36 -0300 | [diff] [blame] | 5 | (C) 2011 - 2014 Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com> |
Srivatsa S. Bhat | 7fef9fc | 2011-10-19 23:59:05 +0200 | [diff] [blame] | 6 | |
| 7 | |
Mauro Carvalho Chehab | 151f4e2 | 2019-06-13 07:10:36 -0300 | [diff] [blame] | 8 | I. Differences between CPU hotplug and Suspend-to-RAM |
| 9 | ====================================================== |
| 10 | |
| 11 | How does the regular CPU hotplug code differ from how the Suspend-to-RAM |
| 12 | infrastructure uses it internally? And where do they share common code? |
Srivatsa S. Bhat | 7fef9fc | 2011-10-19 23:59:05 +0200 | [diff] [blame] | 13 | |
| 14 | Well, a picture is worth a thousand words... So ASCII art follows :-) |
| 15 | |
| 16 | [This depicts the current design in the kernel, and focusses only on the |
| 17 | interactions involving the freezer and CPU hotplug and also tries to explain |
| 18 | the locking involved. It outlines the notifications involved as well. |
| 19 | But please note that here, only the call paths are illustrated, with the aim |
| 20 | of describing where they take different paths and where they share code. |
| 21 | What happens when regular CPU hotplug and Suspend-to-RAM race with each other |
| 22 | is not depicted here.] |
| 23 | |
Mauro Carvalho Chehab | 151f4e2 | 2019-06-13 07:10:36 -0300 | [diff] [blame] | 24 | On a high level, the suspend-resume cycle goes like this:: |
Srivatsa S. Bhat | 7fef9fc | 2011-10-19 23:59:05 +0200 | [diff] [blame] | 25 | |
Mauro Carvalho Chehab | 151f4e2 | 2019-06-13 07:10:36 -0300 | [diff] [blame] | 26 | |Freeze| -> |Disable nonboot| -> |Do suspend| -> |Enable nonboot| -> |Thaw | |
| 27 | |tasks | | cpus | | | | cpus | |tasks| |
Srivatsa S. Bhat | 7fef9fc | 2011-10-19 23:59:05 +0200 | [diff] [blame] | 28 | |
| 29 | |
Mauro Carvalho Chehab | 151f4e2 | 2019-06-13 07:10:36 -0300 | [diff] [blame] | 30 | More details follow:: |
Srivatsa S. Bhat | 7fef9fc | 2011-10-19 23:59:05 +0200 | [diff] [blame] | 31 | |
| 32 | Suspend call path |
| 33 | ----------------- |
| 34 | |
| 35 | Write 'mem' to |
| 36 | /sys/power/state |
Marcos Paulo de Souza | 6237dd1 | 2012-05-02 14:33:37 +0200 | [diff] [blame] | 37 | sysfs file |
Srivatsa S. Bhat | 7fef9fc | 2011-10-19 23:59:05 +0200 | [diff] [blame] | 38 | | |
| 39 | v |
Pingfan Liu | 55f2503 | 2018-07-31 16:51:32 +0800 | [diff] [blame] | 40 | Acquire system_transition_mutex lock |
Srivatsa S. Bhat | 7fef9fc | 2011-10-19 23:59:05 +0200 | [diff] [blame] | 41 | | |
| 42 | v |
| 43 | Send PM_SUSPEND_PREPARE |
| 44 | notifications |
| 45 | | |
| 46 | v |
| 47 | Freeze tasks |
| 48 | | |
| 49 | | |
| 50 | v |
Qais Yousef | 5655585 | 2020-04-30 12:40:03 +0100 | [diff] [blame] | 51 | freeze_secondary_cpus() |
Srivatsa S. Bhat | 7fef9fc | 2011-10-19 23:59:05 +0200 | [diff] [blame] | 52 | /* start */ |
| 53 | | |
| 54 | v |
| 55 | Acquire cpu_add_remove_lock |
| 56 | | |
| 57 | v |
| 58 | Iterate over CURRENTLY |
| 59 | online CPUs |
| 60 | | |
| 61 | | |
| 62 | | ---------- |
| 63 | v | L |
| 64 | ======> _cpu_down() | |
| 65 | | [This takes cpuhotplug.lock | |
| 66 | Common | before taking down the CPU | |
| 67 | code | and releases it when done] | O |
| 68 | | While it is at it, notifications | |
| 69 | | are sent when notable events occur, | |
| 70 | ======> by running all registered callbacks. | |
| 71 | | | O |
| 72 | | | |
| 73 | | | |
| 74 | v | |
| 75 | Note down these cpus in | P |
| 76 | frozen_cpus mask ---------- |
| 77 | | |
| 78 | v |
| 79 | Disable regular cpu hotplug |
Vitaly Kuznetsov | 89af7ba | 2015-08-05 00:52:46 -0700 | [diff] [blame] | 80 | by increasing cpu_hotplug_disabled |
Srivatsa S. Bhat | 7fef9fc | 2011-10-19 23:59:05 +0200 | [diff] [blame] | 81 | | |
| 82 | v |
| 83 | Release cpu_add_remove_lock |
| 84 | | |
| 85 | v |
Qais Yousef | 5655585 | 2020-04-30 12:40:03 +0100 | [diff] [blame] | 86 | /* freeze_secondary_cpus() complete */ |
Srivatsa S. Bhat | 7fef9fc | 2011-10-19 23:59:05 +0200 | [diff] [blame] | 87 | | |
| 88 | v |
| 89 | Do suspend |
| 90 | |
| 91 | |
| 92 | |
| 93 | Resuming back is likewise, with the counterparts being (in the order of |
| 94 | execution during resume): |
Mauro Carvalho Chehab | 151f4e2 | 2019-06-13 07:10:36 -0300 | [diff] [blame] | 95 | |
Qais Yousef | 5655585 | 2020-04-30 12:40:03 +0100 | [diff] [blame] | 96 | * thaw_secondary_cpus() which involves:: |
Mauro Carvalho Chehab | 151f4e2 | 2019-06-13 07:10:36 -0300 | [diff] [blame] | 97 | |
Srivatsa S. Bhat | 7fef9fc | 2011-10-19 23:59:05 +0200 | [diff] [blame] | 98 | | Acquire cpu_add_remove_lock |
Vitaly Kuznetsov | 89af7ba | 2015-08-05 00:52:46 -0700 | [diff] [blame] | 99 | | Decrease cpu_hotplug_disabled, thereby enabling regular cpu hotplug |
Srivatsa S. Bhat | 7fef9fc | 2011-10-19 23:59:05 +0200 | [diff] [blame] | 100 | | Call _cpu_up() [for all those cpus in the frozen_cpus mask, in a loop] |
| 101 | | Release cpu_add_remove_lock |
| 102 | v |
| 103 | |
| 104 | * thaw tasks |
| 105 | * send PM_POST_SUSPEND notifications |
Pingfan Liu | 55f2503 | 2018-07-31 16:51:32 +0800 | [diff] [blame] | 106 | * Release system_transition_mutex lock. |
Srivatsa S. Bhat | 7fef9fc | 2011-10-19 23:59:05 +0200 | [diff] [blame] | 107 | |
| 108 | |
Bjorn Helgaas | 1992b66 | 2019-11-19 08:09:23 -0600 | [diff] [blame] | 109 | It is to be noted here that the system_transition_mutex lock is acquired at the |
| 110 | very beginning, when we are just starting out to suspend, and then released only |
Srivatsa S. Bhat | 7fef9fc | 2011-10-19 23:59:05 +0200 | [diff] [blame] | 111 | after the entire cycle is complete (i.e., suspend + resume). |
| 112 | |
Mauro Carvalho Chehab | 151f4e2 | 2019-06-13 07:10:36 -0300 | [diff] [blame] | 113 | :: |
| 114 | |
Srivatsa S. Bhat | 7fef9fc | 2011-10-19 23:59:05 +0200 | [diff] [blame] | 115 | |
| 116 | |
| 117 | Regular CPU hotplug call path |
| 118 | ----------------------------- |
| 119 | |
| 120 | Write 0 (or 1) to |
| 121 | /sys/devices/system/cpu/cpu*/online |
| 122 | sysfs file |
| 123 | | |
| 124 | | |
| 125 | v |
| 126 | cpu_down() |
| 127 | | |
| 128 | v |
| 129 | Acquire cpu_add_remove_lock |
| 130 | | |
| 131 | v |
Vitaly Kuznetsov | 89af7ba | 2015-08-05 00:52:46 -0700 | [diff] [blame] | 132 | If cpu_hotplug_disabled > 0 |
Srivatsa S. Bhat | 7fef9fc | 2011-10-19 23:59:05 +0200 | [diff] [blame] | 133 | return gracefully |
| 134 | | |
| 135 | | |
| 136 | v |
| 137 | ======> _cpu_down() |
| 138 | | [This takes cpuhotplug.lock |
| 139 | Common | before taking down the CPU |
| 140 | code | and releases it when done] |
| 141 | | While it is at it, notifications |
| 142 | | are sent when notable events occur, |
| 143 | ======> by running all registered callbacks. |
| 144 | | |
| 145 | | |
| 146 | v |
| 147 | Release cpu_add_remove_lock |
| 148 | [That's it!, for |
| 149 | regular CPU hotplug] |
| 150 | |
| 151 | |
| 152 | |
| 153 | So, as can be seen from the two diagrams (the parts marked as "Common code"), |
| 154 | regular CPU hotplug and the suspend code path converge at the _cpu_down() and |
| 155 | _cpu_up() functions. They differ in the arguments passed to these functions, |
| 156 | in that during regular CPU hotplug, 0 is passed for the 'tasks_frozen' |
| 157 | argument. But during suspend, since the tasks are already frozen by the time |
| 158 | the non-boot CPUs are offlined or onlined, the _cpu_*() functions are called |
| 159 | with the 'tasks_frozen' argument set to 1. |
| 160 | [See below for some known issues regarding this.] |
| 161 | |
| 162 | |
| 163 | Important files and functions/entry points: |
Mauro Carvalho Chehab | 151f4e2 | 2019-06-13 07:10:36 -0300 | [diff] [blame] | 164 | ------------------------------------------- |
Srivatsa S. Bhat | 7fef9fc | 2011-10-19 23:59:05 +0200 | [diff] [blame] | 165 | |
Mauro Carvalho Chehab | 151f4e2 | 2019-06-13 07:10:36 -0300 | [diff] [blame] | 166 | - kernel/power/process.c : freeze_processes(), thaw_processes() |
| 167 | - kernel/power/suspend.c : suspend_prepare(), suspend_enter(), suspend_finish() |
Bjorn Helgaas | 1992b66 | 2019-11-19 08:09:23 -0600 | [diff] [blame] | 168 | - kernel/cpu.c: cpu_[up|down](), _cpu_[up|down](), |
| 169 | [disable|enable]_nonboot_cpus() |
Srivatsa S. Bhat | 7fef9fc | 2011-10-19 23:59:05 +0200 | [diff] [blame] | 170 | |
| 171 | |
| 172 | |
| 173 | II. What are the issues involved in CPU hotplug? |
Mauro Carvalho Chehab | 151f4e2 | 2019-06-13 07:10:36 -0300 | [diff] [blame] | 174 | ------------------------------------------------ |
Srivatsa S. Bhat | 7fef9fc | 2011-10-19 23:59:05 +0200 | [diff] [blame] | 175 | |
| 176 | There are some interesting situations involving CPU hotplug and microcode |
| 177 | update on the CPUs, as discussed below: |
| 178 | |
| 179 | [Please bear in mind that the kernel requests the microcode images from |
| 180 | userspace, using the request_firmware() function defined in |
Hans de Goede | df9267f | 2018-04-08 18:06:21 +0200 | [diff] [blame] | 181 | drivers/base/firmware_loader/main.c] |
Srivatsa S. Bhat | 7fef9fc | 2011-10-19 23:59:05 +0200 | [diff] [blame] | 182 | |
| 183 | |
| 184 | a. When all the CPUs are identical: |
| 185 | |
| 186 | This is the most common situation and it is quite straightforward: we want |
| 187 | to apply the same microcode revision to each of the CPUs. |
| 188 | To give an example of x86, the collect_cpu_info() function defined in |
| 189 | arch/x86/kernel/microcode_core.c helps in discovering the type of the CPU |
| 190 | and thereby in applying the correct microcode revision to it. |
| 191 | But note that the kernel does not maintain a common microcode image for the |
| 192 | all CPUs, in order to handle case 'b' described below. |
| 193 | |
| 194 | |
| 195 | b. When some of the CPUs are different than the rest: |
| 196 | |
| 197 | In this case since we probably need to apply different microcode revisions |
| 198 | to different CPUs, the kernel maintains a copy of the correct microcode |
| 199 | image for each CPU (after appropriate CPU type/model discovery using |
| 200 | functions such as collect_cpu_info()). |
| 201 | |
| 202 | |
| 203 | c. When a CPU is physically hot-unplugged and a new (and possibly different |
| 204 | type of) CPU is hot-plugged into the system: |
| 205 | |
| 206 | In the current design of the kernel, whenever a CPU is taken offline during |
| 207 | a regular CPU hotplug operation, upon receiving the CPU_DEAD notification |
| 208 | (which is sent by the CPU hotplug code), the microcode update driver's |
| 209 | callback for that event reacts by freeing the kernel's copy of the |
| 210 | microcode image for that CPU. |
| 211 | |
| 212 | Hence, when a new CPU is brought online, since the kernel finds that it |
| 213 | doesn't have the microcode image, it does the CPU type/model discovery |
| 214 | afresh and then requests the userspace for the appropriate microcode image |
| 215 | for that CPU, which is subsequently applied. |
| 216 | |
| 217 | For example, in x86, the mc_cpu_callback() function (which is the microcode |
| 218 | update driver's callback registered for CPU hotplug events) calls |
| 219 | microcode_update_cpu() which would call microcode_init_cpu() in this case, |
| 220 | instead of microcode_resume_cpu() when it finds that the kernel doesn't |
| 221 | have a valid microcode image. This ensures that the CPU type/model |
| 222 | discovery is performed and the right microcode is applied to the CPU after |
| 223 | getting it from userspace. |
| 224 | |
| 225 | |
| 226 | d. Handling microcode update during suspend/hibernate: |
| 227 | |
| 228 | Strictly speaking, during a CPU hotplug operation which does not involve |
| 229 | physically removing or inserting CPUs, the CPUs are not actually powered |
| 230 | off during a CPU offline. They are just put to the lowest C-states possible. |
| 231 | Hence, in such a case, it is not really necessary to re-apply microcode |
| 232 | when the CPUs are brought back online, since they wouldn't have lost the |
| 233 | image during the CPU offline operation. |
| 234 | |
| 235 | This is the usual scenario encountered during a resume after a suspend. |
| 236 | However, in the case of hibernation, since all the CPUs are completely |
| 237 | powered off, during restore it becomes necessary to apply the microcode |
| 238 | images to all the CPUs. |
| 239 | |
| 240 | [Note that we don't expect someone to physically pull out nodes and insert |
| 241 | nodes with a different type of CPUs in-between a suspend-resume or a |
| 242 | hibernate/restore cycle.] |
| 243 | |
| 244 | In the current design of the kernel however, during a CPU offline operation |
Thomas Gleixner | f4c09f8 | 2017-11-13 09:39:01 +0100 | [diff] [blame] | 245 | as part of the suspend/hibernate cycle (cpuhp_tasks_frozen is set), |
Srivatsa S. Bhat | 7fef9fc | 2011-10-19 23:59:05 +0200 | [diff] [blame] | 246 | the existing copy of microcode image in the kernel is not freed up. |
| 247 | And during the CPU online operations (during resume/restore), since the |
| 248 | kernel finds that it already has copies of the microcode images for all the |
| 249 | CPUs, it just applies them to the CPUs, avoiding any re-discovery of CPU |
| 250 | type/model and the need for validating whether the microcode revisions are |
| 251 | right for the CPUs or not (due to the above assumption that physical CPU |
| 252 | hotplug will not be done in-between suspend/resume or hibernate/restore |
| 253 | cycles). |
| 254 | |
| 255 | |
Mauro Carvalho Chehab | 151f4e2 | 2019-06-13 07:10:36 -0300 | [diff] [blame] | 256 | III. Known problems |
| 257 | =================== |
| 258 | |
| 259 | Are there any known problems when regular CPU hotplug and suspend race |
| 260 | with each other? |
Srivatsa S. Bhat | 7fef9fc | 2011-10-19 23:59:05 +0200 | [diff] [blame] | 261 | |
| 262 | Yes, they are listed below: |
| 263 | |
| 264 | 1. When invoking regular CPU hotplug, the 'tasks_frozen' argument passed to |
| 265 | the _cpu_down() and _cpu_up() functions is *always* 0. |
| 266 | This might not reflect the true current state of the system, since the |
| 267 | tasks could have been frozen by an out-of-band event such as a suspend |
Thomas Gleixner | f4c09f8 | 2017-11-13 09:39:01 +0100 | [diff] [blame] | 268 | operation in progress. Hence, the cpuhp_tasks_frozen variable will not |
| 269 | reflect the frozen state and the CPU hotplug callbacks which evaluate |
| 270 | that variable might execute the wrong code path. |
Srivatsa S. Bhat | 7fef9fc | 2011-10-19 23:59:05 +0200 | [diff] [blame] | 271 | |
| 272 | 2. If a regular CPU hotplug stress test happens to race with the freezer due |
| 273 | to a suspend operation in progress at the same time, then we could hit the |
| 274 | situation described below: |
| 275 | |
| 276 | * A regular cpu online operation continues its journey from userspace |
| 277 | into the kernel, since the freezing has not yet begun. |
| 278 | * Then freezer gets to work and freezes userspace. |
| 279 | * If cpu online has not yet completed the microcode update stuff by now, |
| 280 | it will now start waiting on the frozen userspace in the |
| 281 | TASK_UNINTERRUPTIBLE state, in order to get the microcode image. |
| 282 | * Now the freezer continues and tries to freeze the remaining tasks. But |
| 283 | due to this wait mentioned above, the freezer won't be able to freeze |
| 284 | the cpu online hotplug task and hence freezing of tasks fails. |
| 285 | |
| 286 | As a result of this task freezing failure, the suspend operation gets |
| 287 | aborted. |