Mauro Carvalho Chehab | 151f4e2 | 2019-06-13 07:10:36 -0300 | [diff] [blame] | 1 | ==================== |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 2 | PCI Power Management |
Mauro Carvalho Chehab | 151f4e2 | 2019-06-13 07:10:36 -0300 | [diff] [blame] | 3 | ==================== |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 4 | |
Rafael J. Wysocki | b799957 | 2010-05-18 00:23:24 +0200 | [diff] [blame] | 5 | Copyright (c) 2010 Rafael J. Wysocki <rjw@sisk.pl>, Novell Inc. |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 6 | |
Rafael J. Wysocki | b799957 | 2010-05-18 00:23:24 +0200 | [diff] [blame] | 7 | An overview of concepts and the Linux kernel's interfaces related to PCI power |
| 8 | management. Based on previous work by Patrick Mochel <mochel@transmeta.com> |
| 9 | (and others). |
| 10 | |
| 11 | This document only covers the aspects of power management specific to PCI |
| 12 | devices. For general description of the kernel's interfaces related to device |
Tom Saeger | 66ccc64 | 2017-10-10 12:36:09 -0500 | [diff] [blame] | 13 | power management refer to Documentation/driver-api/pm/devices.rst and |
Mauro Carvalho Chehab | 151f4e2 | 2019-06-13 07:10:36 -0300 | [diff] [blame] | 14 | Documentation/power/runtime_pm.rst. |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 15 | |
Mauro Carvalho Chehab | 151f4e2 | 2019-06-13 07:10:36 -0300 | [diff] [blame] | 16 | .. contents: |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 17 | |
Mauro Carvalho Chehab | 151f4e2 | 2019-06-13 07:10:36 -0300 | [diff] [blame] | 18 | 1. Hardware and Platform Support for PCI Power Management |
| 19 | 2. PCI Subsystem and Device Power Management |
| 20 | 3. PCI Device Drivers and Power Management |
| 21 | 4. Resources |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 22 | |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 23 | |
Rafael J. Wysocki | b799957 | 2010-05-18 00:23:24 +0200 | [diff] [blame] | 24 | 1. Hardware and Platform Support for PCI Power Management |
| 25 | ========================================================= |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 26 | |
Rafael J. Wysocki | b799957 | 2010-05-18 00:23:24 +0200 | [diff] [blame] | 27 | 1.1. Native and Platform-Based Power Management |
| 28 | ----------------------------------------------- |
Mauro Carvalho Chehab | 151f4e2 | 2019-06-13 07:10:36 -0300 | [diff] [blame] | 29 | |
Rafael J. Wysocki | b799957 | 2010-05-18 00:23:24 +0200 | [diff] [blame] | 30 | In general, power management is a feature allowing one to save energy by putting |
| 31 | devices into states in which they draw less power (low-power states) at the |
| 32 | price of reduced functionality or performance. |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 33 | |
Rafael J. Wysocki | b799957 | 2010-05-18 00:23:24 +0200 | [diff] [blame] | 34 | Usually, a device is put into a low-power state when it is underutilized or |
| 35 | completely inactive. However, when it is necessary to use the device once |
| 36 | again, it has to be put back into the "fully functional" state (full-power |
| 37 | state). This may happen when there are some data for the device to handle or |
| 38 | as a result of an external event requiring the device to be active, which may |
| 39 | be signaled by the device itself. |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 40 | |
Rafael J. Wysocki | b799957 | 2010-05-18 00:23:24 +0200 | [diff] [blame] | 41 | PCI devices may be put into low-power states in two ways, by using the device |
| 42 | capabilities introduced by the PCI Bus Power Management Interface Specification, |
| 43 | or with the help of platform firmware, such as an ACPI BIOS. In the first |
| 44 | approach, that is referred to as the native PCI power management (native PCI PM) |
| 45 | in what follows, the device power state is changed as a result of writing a |
| 46 | specific value into one of its standard configuration registers. The second |
| 47 | approach requires the platform firmware to provide special methods that may be |
| 48 | used by the kernel to change the device's power state. |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 49 | |
Rafael J. Wysocki | b799957 | 2010-05-18 00:23:24 +0200 | [diff] [blame] | 50 | Devices supporting the native PCI PM usually can generate wakeup signals called |
| 51 | Power Management Events (PMEs) to let the kernel know about external events |
| 52 | requiring the device to be active. After receiving a PME the kernel is supposed |
| 53 | to put the device that sent it into the full-power state. However, the PCI Bus |
| 54 | Power Management Interface Specification doesn't define any standard method of |
| 55 | delivering the PME from the device to the CPU and the operating system kernel. |
| 56 | It is assumed that the platform firmware will perform this task and therefore, |
| 57 | even though a PCI device is set up to generate PMEs, it also may be necessary to |
| 58 | prepare the platform firmware for notifying the CPU of the PMEs coming from the |
| 59 | device (e.g. by generating interrupts). |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 60 | |
Rafael J. Wysocki | b799957 | 2010-05-18 00:23:24 +0200 | [diff] [blame] | 61 | In turn, if the methods provided by the platform firmware are used for changing |
| 62 | the power state of a device, usually the platform also provides a method for |
| 63 | preparing the device to generate wakeup signals. In that case, however, it |
| 64 | often also is necessary to prepare the device for generating PMEs using the |
| 65 | native PCI PM mechanism, because the method provided by the platform depends on |
| 66 | that. |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 67 | |
Rafael J. Wysocki | b799957 | 2010-05-18 00:23:24 +0200 | [diff] [blame] | 68 | Thus in many situations both the native and the platform-based power management |
| 69 | mechanisms have to be used simultaneously to obtain the desired result. |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 70 | |
Rafael J. Wysocki | b799957 | 2010-05-18 00:23:24 +0200 | [diff] [blame] | 71 | 1.2. Native PCI Power Management |
| 72 | -------------------------------- |
Mauro Carvalho Chehab | 151f4e2 | 2019-06-13 07:10:36 -0300 | [diff] [blame] | 73 | |
Rafael J. Wysocki | b799957 | 2010-05-18 00:23:24 +0200 | [diff] [blame] | 74 | The PCI Bus Power Management Interface Specification (PCI PM Spec) was |
| 75 | introduced between the PCI 2.1 and PCI 2.2 Specifications. It defined a |
| 76 | standard interface for performing various operations related to power |
| 77 | management. |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 78 | |
Rafael J. Wysocki | b799957 | 2010-05-18 00:23:24 +0200 | [diff] [blame] | 79 | The implementation of the PCI PM Spec is optional for conventional PCI devices, |
| 80 | but it is mandatory for PCI Express devices. If a device supports the PCI PM |
| 81 | Spec, it has an 8 byte power management capability field in its PCI |
| 82 | configuration space. This field is used to describe and control the standard |
| 83 | features related to the native PCI power management. |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 84 | |
Rafael J. Wysocki | b799957 | 2010-05-18 00:23:24 +0200 | [diff] [blame] | 85 | The PCI PM Spec defines 4 operating states for devices (D0-D3) and for buses |
| 86 | (B0-B3). The higher the number, the less power is drawn by the device or bus |
| 87 | in that state. However, the higher the number, the longer the latency for |
| 88 | the device or bus to return to the full-power state (D0 or B0, respectively). |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 89 | |
Rafael J. Wysocki | b799957 | 2010-05-18 00:23:24 +0200 | [diff] [blame] | 90 | There are two variants of the D3 state defined by the specification. The first |
| 91 | one is D3hot, referred to as the software accessible D3, because devices can be |
| 92 | programmed to go into it. The second one, D3cold, is the state that PCI devices |
| 93 | are in when the supply voltage (Vcc) is removed from them. It is not possible |
| 94 | to program a PCI device to go into D3cold, although there may be a programmable |
| 95 | interface for putting the bus the device is on into a state in which Vcc is |
| 96 | removed from all devices on the bus. |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 97 | |
Rafael J. Wysocki | b799957 | 2010-05-18 00:23:24 +0200 | [diff] [blame] | 98 | PCI bus power management, however, is not supported by the Linux kernel at the |
| 99 | time of this writing and therefore it is not covered by this document. |
| 100 | |
| 101 | Note that every PCI device can be in the full-power state (D0) or in D3cold, |
| 102 | regardless of whether or not it implements the PCI PM Spec. In addition to |
| 103 | that, if the PCI PM Spec is implemented by the device, it must support D3hot |
| 104 | as well as D0. The support for the D1 and D2 power states is optional. |
| 105 | |
| 106 | PCI devices supporting the PCI PM Spec can be programmed to go to any of the |
| 107 | supported low-power states (except for D3cold). While in D1-D3hot the |
| 108 | standard configuration registers of the device must be accessible to software |
| 109 | (i.e. the device is required to respond to PCI configuration accesses), although |
| 110 | its I/O and memory spaces are then disabled. This allows the device to be |
| 111 | programmatically put into D0. Thus the kernel can switch the device back and |
| 112 | forth between D0 and the supported low-power states (except for D3cold) and the |
| 113 | possible power state transitions the device can undergo are the following: |
| 114 | |
| 115 | +----------------------------+ |
| 116 | | Current State | New State | |
| 117 | +----------------------------+ |
| 118 | | D0 | D1, D2, D3 | |
| 119 | +----------------------------+ |
| 120 | | D1 | D2, D3 | |
| 121 | +----------------------------+ |
| 122 | | D2 | D3 | |
| 123 | +----------------------------+ |
| 124 | | D1, D2, D3 | D0 | |
| 125 | +----------------------------+ |
| 126 | |
| 127 | The transition from D3cold to D0 occurs when the supply voltage is provided to |
| 128 | the device (i.e. power is restored). In that case the device returns to D0 with |
| 129 | a full power-on reset sequence and the power-on defaults are restored to the |
| 130 | device by hardware just as at initial power up. |
| 131 | |
| 132 | PCI devices supporting the PCI PM Spec can be programmed to generate PMEs |
Bjorn Helgaas | 85a9b05 | 2019-10-08 15:28:00 -0500 | [diff] [blame] | 133 | while in any power state (D0-D3), but they are not required to be capable |
| 134 | of generating PMEs from all supported power states. In particular, the |
Rafael J. Wysocki | b799957 | 2010-05-18 00:23:24 +0200 | [diff] [blame] | 135 | capability of generating PMEs from D3cold is optional and depends on the |
| 136 | presence of additional voltage (3.3Vaux) allowing the device to remain |
| 137 | sufficiently active to generate a wakeup signal. |
| 138 | |
| 139 | 1.3. ACPI Device Power Management |
| 140 | --------------------------------- |
Mauro Carvalho Chehab | 151f4e2 | 2019-06-13 07:10:36 -0300 | [diff] [blame] | 141 | |
Rafael J. Wysocki | b799957 | 2010-05-18 00:23:24 +0200 | [diff] [blame] | 142 | The platform firmware support for the power management of PCI devices is |
| 143 | system-specific. However, if the system in question is compliant with the |
| 144 | Advanced Configuration and Power Interface (ACPI) Specification, like the |
| 145 | majority of x86-based systems, it is supposed to implement device power |
| 146 | management interfaces defined by the ACPI standard. |
| 147 | |
| 148 | For this purpose the ACPI BIOS provides special functions called "control |
| 149 | methods" that may be executed by the kernel to perform specific tasks, such as |
| 150 | putting a device into a low-power state. These control methods are encoded |
| 151 | using special byte-code language called the ACPI Machine Language (AML) and |
| 152 | stored in the machine's BIOS. The kernel loads them from the BIOS and executes |
| 153 | them as needed using an AML interpreter that translates the AML byte code into |
| 154 | computations and memory or I/O space accesses. This way, in theory, a BIOS |
| 155 | writer can provide the kernel with a means to perform actions depending |
| 156 | on the system design in a system-specific fashion. |
| 157 | |
| 158 | ACPI control methods may be divided into global control methods, that are not |
| 159 | associated with any particular devices, and device control methods, that have |
| 160 | to be defined separately for each device supposed to be handled with the help of |
| 161 | the platform. This means, in particular, that ACPI device control methods can |
| 162 | only be used to handle devices that the BIOS writer knew about in advance. The |
| 163 | ACPI methods used for device power management fall into that category. |
| 164 | |
| 165 | The ACPI specification assumes that devices can be in one of four power states |
| 166 | labeled as D0, D1, D2, and D3 that roughly correspond to the native PCI PM |
| 167 | D0-D3 states (although the difference between D3hot and D3cold is not taken |
| 168 | into account by ACPI). Moreover, for each power state of a device there is a |
| 169 | set of power resources that have to be enabled for the device to be put into |
| 170 | that state. These power resources are controlled (i.e. enabled or disabled) |
| 171 | with the help of their own control methods, _ON and _OFF, that have to be |
| 172 | defined individually for each of them. |
| 173 | |
| 174 | To put a device into the ACPI power state Dx (where x is a number between 0 and |
| 175 | 3 inclusive) the kernel is supposed to (1) enable the power resources required |
| 176 | by the device in this state using their _ON control methods and (2) execute the |
| 177 | _PSx control method defined for the device. In addition to that, if the device |
| 178 | is going to be put into a low-power state (D1-D3) and is supposed to generate |
| 179 | wakeup signals from that state, the _DSW (or _PSW, replaced with _DSW by ACPI |
| 180 | 3.0) control method defined for it has to be executed before _PSx. Power |
| 181 | resources that are not required by the device in the target power state and are |
| 182 | not required any more by any other device should be disabled (by executing their |
| 183 | _OFF control methods). If the current power state of the device is D3, it can |
| 184 | only be put into D0 this way. |
| 185 | |
| 186 | However, quite often the power states of devices are changed during a |
| 187 | system-wide transition into a sleep state or back into the working state. ACPI |
| 188 | defines four system sleep states, S1, S2, S3, and S4, and denotes the system |
| 189 | working state as S0. In general, the target system sleep (or working) state |
| 190 | determines the highest power (lowest number) state the device can be put |
| 191 | into and the kernel is supposed to obtain this information by executing the |
| 192 | device's _SxD control method (where x is a number between 0 and 4 inclusive). |
| 193 | If the device is required to wake up the system from the target sleep state, the |
| 194 | lowest power (highest number) state it can be put into is also determined by the |
| 195 | target state of the system. The kernel is then supposed to use the device's |
| 196 | _SxW control method to obtain the number of that state. It also is supposed to |
| 197 | use the device's _PRW control method to learn which power resources need to be |
| 198 | enabled for the device to be able to generate wakeup signals. |
| 199 | |
| 200 | 1.4. Wakeup Signaling |
| 201 | --------------------- |
Mauro Carvalho Chehab | 151f4e2 | 2019-06-13 07:10:36 -0300 | [diff] [blame] | 202 | |
Rafael J. Wysocki | b799957 | 2010-05-18 00:23:24 +0200 | [diff] [blame] | 203 | Wakeup signals generated by PCI devices, either as native PCI PMEs, or as |
| 204 | a result of the execution of the _DSW (or _PSW) ACPI control method before |
| 205 | putting the device into a low-power state, have to be caught and handled as |
| 206 | appropriate. If they are sent while the system is in the working state |
| 207 | (ACPI S0), they should be translated into interrupts so that the kernel can |
| 208 | put the devices generating them into the full-power state and take care of the |
| 209 | events that triggered them. In turn, if they are sent while the system is |
| 210 | sleeping, they should cause the system's core logic to trigger wakeup. |
| 211 | |
| 212 | On ACPI-based systems wakeup signals sent by conventional PCI devices are |
| 213 | converted into ACPI General-Purpose Events (GPEs) which are hardware signals |
| 214 | from the system core logic generated in response to various events that need to |
| 215 | be acted upon. Every GPE is associated with one or more sources of potentially |
| 216 | interesting events. In particular, a GPE may be associated with a PCI device |
| 217 | capable of signaling wakeup. The information on the connections between GPEs |
| 218 | and event sources is recorded in the system's ACPI BIOS from where it can be |
| 219 | read by the kernel. |
| 220 | |
| 221 | If a PCI device known to the system's ACPI BIOS signals wakeup, the GPE |
| 222 | associated with it (if there is one) is triggered. The GPEs associated with PCI |
| 223 | bridges may also be triggered in response to a wakeup signal from one of the |
| 224 | devices below the bridge (this also is the case for root bridges) and, for |
| 225 | example, native PCI PMEs from devices unknown to the system's ACPI BIOS may be |
| 226 | handled this way. |
| 227 | |
| 228 | A GPE may be triggered when the system is sleeping (i.e. when it is in one of |
| 229 | the ACPI S1-S4 states), in which case system wakeup is started by its core logic |
| 230 | (the device that was the source of the signal causing the system wakeup to occur |
| 231 | may be identified later). The GPEs used in such situations are referred to as |
| 232 | wakeup GPEs. |
| 233 | |
| 234 | Usually, however, GPEs are also triggered when the system is in the working |
| 235 | state (ACPI S0) and in that case the system's core logic generates a System |
| 236 | Control Interrupt (SCI) to notify the kernel of the event. Then, the SCI |
| 237 | handler identifies the GPE that caused the interrupt to be generated which, |
| 238 | in turn, allows the kernel to identify the source of the event (that may be |
| 239 | a PCI device signaling wakeup). The GPEs used for notifying the kernel of |
| 240 | events occurring while the system is in the working state are referred to as |
| 241 | runtime GPEs. |
| 242 | |
| 243 | Unfortunately, there is no standard way of handling wakeup signals sent by |
| 244 | conventional PCI devices on systems that are not ACPI-based, but there is one |
| 245 | for PCI Express devices. Namely, the PCI Express Base Specification introduced |
| 246 | a native mechanism for converting native PCI PMEs into interrupts generated by |
| 247 | root ports. For conventional PCI devices native PMEs are out-of-band, so they |
| 248 | are routed separately and they need not pass through bridges (in principle they |
| 249 | may be routed directly to the system's core logic), but for PCI Express devices |
| 250 | they are in-band messages that have to pass through the PCI Express hierarchy, |
| 251 | including the root port on the path from the device to the Root Complex. Thus |
| 252 | it was possible to introduce a mechanism by which a root port generates an |
| 253 | interrupt whenever it receives a PME message from one of the devices below it. |
| 254 | The PCI Express Requester ID of the device that sent the PME message is then |
| 255 | recorded in one of the root port's configuration registers from where it may be |
| 256 | read by the interrupt handler allowing the device to be identified. [PME |
| 257 | messages sent by PCI Express endpoints integrated with the Root Complex don't |
| 258 | pass through root ports, but instead they cause a Root Complex Event Collector |
| 259 | (if there is one) to generate interrupts.] |
| 260 | |
| 261 | In principle the native PCI Express PME signaling may also be used on ACPI-based |
| 262 | systems along with the GPEs, but to use it the kernel has to ask the system's |
| 263 | ACPI BIOS to release control of root port configuration registers. The ACPI |
| 264 | BIOS, however, is not required to allow the kernel to control these registers |
| 265 | and if it doesn't do that, the kernel must not modify their contents. Of course |
| 266 | the native PCI Express PME signaling cannot be used by the kernel in that case. |
| 267 | |
| 268 | |
| 269 | 2. PCI Subsystem and Device Power Management |
| 270 | ============================================ |
| 271 | |
| 272 | 2.1. Device Power Management Callbacks |
| 273 | -------------------------------------- |
Mauro Carvalho Chehab | 151f4e2 | 2019-06-13 07:10:36 -0300 | [diff] [blame] | 274 | |
Rafael J. Wysocki | b799957 | 2010-05-18 00:23:24 +0200 | [diff] [blame] | 275 | The PCI Subsystem participates in the power management of PCI devices in a |
| 276 | number of ways. First of all, it provides an intermediate code layer between |
| 277 | the device power management core (PM core) and PCI device drivers. |
| 278 | Specifically, the pm field of the PCI subsystem's struct bus_type object, |
| 279 | pci_bus_type, points to a struct dev_pm_ops object, pci_dev_pm_ops, containing |
Mauro Carvalho Chehab | 151f4e2 | 2019-06-13 07:10:36 -0300 | [diff] [blame] | 280 | pointers to several device power management callbacks:: |
Rafael J. Wysocki | b799957 | 2010-05-18 00:23:24 +0200 | [diff] [blame] | 281 | |
Mauro Carvalho Chehab | 151f4e2 | 2019-06-13 07:10:36 -0300 | [diff] [blame] | 282 | const struct dev_pm_ops pci_dev_pm_ops = { |
Rafael J. Wysocki | b799957 | 2010-05-18 00:23:24 +0200 | [diff] [blame] | 283 | .prepare = pci_pm_prepare, |
| 284 | .complete = pci_pm_complete, |
| 285 | .suspend = pci_pm_suspend, |
| 286 | .resume = pci_pm_resume, |
| 287 | .freeze = pci_pm_freeze, |
| 288 | .thaw = pci_pm_thaw, |
| 289 | .poweroff = pci_pm_poweroff, |
| 290 | .restore = pci_pm_restore, |
| 291 | .suspend_noirq = pci_pm_suspend_noirq, |
| 292 | .resume_noirq = pci_pm_resume_noirq, |
| 293 | .freeze_noirq = pci_pm_freeze_noirq, |
| 294 | .thaw_noirq = pci_pm_thaw_noirq, |
| 295 | .poweroff_noirq = pci_pm_poweroff_noirq, |
| 296 | .restore_noirq = pci_pm_restore_noirq, |
| 297 | .runtime_suspend = pci_pm_runtime_suspend, |
| 298 | .runtime_resume = pci_pm_runtime_resume, |
| 299 | .runtime_idle = pci_pm_runtime_idle, |
Mauro Carvalho Chehab | 151f4e2 | 2019-06-13 07:10:36 -0300 | [diff] [blame] | 300 | }; |
Rafael J. Wysocki | b799957 | 2010-05-18 00:23:24 +0200 | [diff] [blame] | 301 | |
| 302 | These callbacks are executed by the PM core in various situations related to |
| 303 | device power management and they, in turn, execute power management callbacks |
| 304 | provided by PCI device drivers. They also perform power management operations |
| 305 | involving some standard configuration registers of PCI devices that device |
| 306 | drivers need not know or care about. |
| 307 | |
| 308 | The structure representing a PCI device, struct pci_dev, contains several fields |
Mauro Carvalho Chehab | 151f4e2 | 2019-06-13 07:10:36 -0300 | [diff] [blame] | 309 | that these callbacks operate on:: |
Rafael J. Wysocki | b799957 | 2010-05-18 00:23:24 +0200 | [diff] [blame] | 310 | |
Mauro Carvalho Chehab | 151f4e2 | 2019-06-13 07:10:36 -0300 | [diff] [blame] | 311 | struct pci_dev { |
Rafael J. Wysocki | b799957 | 2010-05-18 00:23:24 +0200 | [diff] [blame] | 312 | ... |
| 313 | pci_power_t current_state; /* Current operating state. */ |
| 314 | int pm_cap; /* PM capability offset in the |
| 315 | configuration space */ |
| 316 | unsigned int pme_support:5; /* Bitmask of states from which PME# |
| 317 | can be generated */ |
| 318 | unsigned int pme_interrupt:1;/* Is native PCIe PME signaling used? */ |
| 319 | unsigned int d1_support:1; /* Low power state D1 is supported */ |
| 320 | unsigned int d2_support:1; /* Low power state D2 is supported */ |
| 321 | unsigned int no_d1d2:1; /* D1 and D2 are forbidden */ |
| 322 | unsigned int wakeup_prepared:1; /* Device prepared for wake up */ |
Krzysztof Wilczyński | 3789af9 | 2020-07-30 21:08:48 +0000 | [diff] [blame] | 323 | unsigned int d3hot_delay; /* D3hot->D0 transition time in ms */ |
Rafael J. Wysocki | b799957 | 2010-05-18 00:23:24 +0200 | [diff] [blame] | 324 | ... |
Mauro Carvalho Chehab | 151f4e2 | 2019-06-13 07:10:36 -0300 | [diff] [blame] | 325 | }; |
Rafael J. Wysocki | b799957 | 2010-05-18 00:23:24 +0200 | [diff] [blame] | 326 | |
| 327 | They also indirectly use some fields of the struct device that is embedded in |
| 328 | struct pci_dev. |
| 329 | |
| 330 | 2.2. Device Initialization |
| 331 | -------------------------- |
Mauro Carvalho Chehab | 151f4e2 | 2019-06-13 07:10:36 -0300 | [diff] [blame] | 332 | |
Rafael J. Wysocki | b799957 | 2010-05-18 00:23:24 +0200 | [diff] [blame] | 333 | The PCI subsystem's first task related to device power management is to |
| 334 | prepare the device for power management and initialize the fields of struct |
| 335 | pci_dev used for this purpose. This happens in two functions defined in |
| 336 | drivers/pci/pci.c, pci_pm_init() and platform_pci_wakeup_init(). |
| 337 | |
| 338 | The first of these functions checks if the device supports native PCI PM |
| 339 | and if that's the case the offset of its power management capability structure |
| 340 | in the configuration space is stored in the pm_cap field of the device's struct |
| 341 | pci_dev object. Next, the function checks which PCI low-power states are |
| 342 | supported by the device and from which low-power states the device can generate |
| 343 | native PCI PMEs. The power management fields of the device's struct pci_dev and |
| 344 | the struct device embedded in it are updated accordingly and the generation of |
| 345 | PMEs by the device is disabled. |
| 346 | |
| 347 | The second function checks if the device can be prepared to signal wakeup with |
| 348 | the help of the platform firmware, such as the ACPI BIOS. If that is the case, |
| 349 | the function updates the wakeup fields in struct device embedded in the |
| 350 | device's struct pci_dev and uses the firmware-provided method to prevent the |
| 351 | device from signaling wakeup. |
| 352 | |
| 353 | At this point the device is ready for power management. For driverless devices, |
| 354 | however, this functionality is limited to a few basic operations carried out |
| 355 | during system-wide transitions to a sleep state and back to the working state. |
| 356 | |
| 357 | 2.3. Runtime Device Power Management |
| 358 | ------------------------------------ |
Mauro Carvalho Chehab | 151f4e2 | 2019-06-13 07:10:36 -0300 | [diff] [blame] | 359 | |
Rafael J. Wysocki | b799957 | 2010-05-18 00:23:24 +0200 | [diff] [blame] | 360 | The PCI subsystem plays a vital role in the runtime power management of PCI |
| 361 | devices. For this purpose it uses the general runtime power management |
Mauro Carvalho Chehab | 151f4e2 | 2019-06-13 07:10:36 -0300 | [diff] [blame] | 362 | (runtime PM) framework described in Documentation/power/runtime_pm.rst. |
| 363 | Namely, it provides subsystem-level callbacks:: |
Rafael J. Wysocki | b799957 | 2010-05-18 00:23:24 +0200 | [diff] [blame] | 364 | |
| 365 | pci_pm_runtime_suspend() |
| 366 | pci_pm_runtime_resume() |
| 367 | pci_pm_runtime_idle() |
| 368 | |
| 369 | that are executed by the core runtime PM routines. It also implements the |
| 370 | entire mechanics necessary for handling runtime wakeup signals from PCI devices |
| 371 | in low-power states, which at the time of this writing works for both the native |
| 372 | PCI Express PME signaling and the ACPI GPE-based wakeup signaling described in |
| 373 | Section 1. |
| 374 | |
| 375 | First, a PCI device is put into a low-power state, or suspended, with the help |
| 376 | of pm_schedule_suspend() or pm_runtime_suspend() which for PCI devices call |
| 377 | pci_pm_runtime_suspend() to do the actual job. For this to work, the device's |
| 378 | driver has to provide a pm->runtime_suspend() callback (see below), which is |
| 379 | run by pci_pm_runtime_suspend() as the first action. If the driver's callback |
| 380 | returns successfully, the device's standard configuration registers are saved, |
| 381 | the device is prepared to generate wakeup signals and, finally, it is put into |
| 382 | the target low-power state. |
| 383 | |
| 384 | The low-power state to put the device into is the lowest-power (highest number) |
| 385 | state from which it can signal wakeup. The exact method of signaling wakeup is |
| 386 | system-dependent and is determined by the PCI subsystem on the basis of the |
| 387 | reported capabilities of the device and the platform firmware. To prepare the |
| 388 | device for signaling wakeup and put it into the selected low-power state, the |
| 389 | PCI subsystem can use the platform firmware as well as the device's native PCI |
| 390 | PM capabilities, if supported. |
| 391 | |
| 392 | It is expected that the device driver's pm->runtime_suspend() callback will |
| 393 | not attempt to prepare the device for signaling wakeup or to put it into a |
| 394 | low-power state. The driver ought to leave these tasks to the PCI subsystem |
| 395 | that has all of the information necessary to perform them. |
| 396 | |
| 397 | A suspended device is brought back into the "active" state, or resumed, |
| 398 | with the help of pm_request_resume() or pm_runtime_resume() which both call |
| 399 | pci_pm_runtime_resume() for PCI devices. Again, this only works if the device's |
| 400 | driver provides a pm->runtime_resume() callback (see below). However, before |
| 401 | the driver's callback is executed, pci_pm_runtime_resume() brings the device |
| 402 | back into the full-power state, prevents it from signaling wakeup while in that |
| 403 | state and restores its standard configuration registers. Thus the driver's |
| 404 | callback need not worry about the PCI-specific aspects of the device resume. |
| 405 | |
| 406 | Note that generally pci_pm_runtime_resume() may be called in two different |
| 407 | situations. First, it may be called at the request of the device's driver, for |
| 408 | example if there are some data for it to process. Second, it may be called |
| 409 | as a result of a wakeup signal from the device itself (this sometimes is |
| 410 | referred to as "remote wakeup"). Of course, for this purpose the wakeup signal |
| 411 | is handled in one of the ways described in Section 1 and finally converted into |
| 412 | a notification for the PCI subsystem after the source device has been |
| 413 | identified. |
| 414 | |
| 415 | The pci_pm_runtime_idle() function, called for PCI devices by pm_runtime_idle() |
| 416 | and pm_request_idle(), executes the device driver's pm->runtime_idle() |
| 417 | callback, if defined, and if that callback doesn't return error code (or is not |
| 418 | present at all), suspends the device with the help of pm_runtime_suspend(). |
| 419 | Sometimes pci_pm_runtime_idle() is called automatically by the PM core (for |
| 420 | example, it is called right after the device has just been resumed), in which |
| 421 | cases it is expected to suspend the device if that makes sense. Usually, |
| 422 | however, the PCI subsystem doesn't really know if the device really can be |
| 423 | suspended, so it lets the device's driver decide by running its |
| 424 | pm->runtime_idle() callback. |
| 425 | |
| 426 | 2.4. System-Wide Power Transitions |
| 427 | ---------------------------------- |
| 428 | There are a few different types of system-wide power transitions, described in |
Bjorn Helgaas | b64cf7a | 2019-10-08 15:25:23 -0500 | [diff] [blame] | 429 | Documentation/driver-api/pm/devices.rst. Each of them requires devices to be |
| 430 | handled in a specific way and the PM core executes subsystem-level power |
| 431 | management callbacks for this purpose. They are executed in phases such that |
| 432 | each phase involves executing the same subsystem-level callback for every device |
| 433 | belonging to the given subsystem before the next phase begins. These phases |
| 434 | always run after tasks have been frozen. |
Rafael J. Wysocki | b799957 | 2010-05-18 00:23:24 +0200 | [diff] [blame] | 435 | |
| 436 | 2.4.1. System Suspend |
Mauro Carvalho Chehab | 151f4e2 | 2019-06-13 07:10:36 -0300 | [diff] [blame] | 437 | ^^^^^^^^^^^^^^^^^^^^^ |
Rafael J. Wysocki | b799957 | 2010-05-18 00:23:24 +0200 | [diff] [blame] | 438 | |
| 439 | When the system is going into a sleep state in which the contents of memory will |
| 440 | be preserved, such as one of the ACPI sleep states S1-S3, the phases are: |
| 441 | |
| 442 | prepare, suspend, suspend_noirq. |
| 443 | |
Mauro Carvalho Chehab | 151f4e2 | 2019-06-13 07:10:36 -0300 | [diff] [blame] | 444 | The following PCI bus type's callbacks, respectively, are used in these phases:: |
Rafael J. Wysocki | b799957 | 2010-05-18 00:23:24 +0200 | [diff] [blame] | 445 | |
| 446 | pci_pm_prepare() |
| 447 | pci_pm_suspend() |
| 448 | pci_pm_suspend_noirq() |
| 449 | |
| 450 | The pci_pm_prepare() routine first puts the device into the "fully functional" |
| 451 | state with the help of pm_runtime_resume(). Then, it executes the device |
| 452 | driver's pm->prepare() callback if defined (i.e. if the driver's struct |
| 453 | dev_pm_ops object is present and the prepare pointer in that object is valid). |
| 454 | |
| 455 | The pci_pm_suspend() routine first checks if the device's driver implements |
| 456 | legacy PCI suspend routines (see Section 3), in which case the driver's legacy |
| 457 | suspend callback is executed, if present, and its result is returned. Next, if |
| 458 | the device's driver doesn't provide a struct dev_pm_ops object (containing |
| 459 | pointers to the driver's callbacks), pci_pm_default_suspend() is called, which |
| 460 | simply turns off the device's bus master capability and runs |
| 461 | pcibios_disable_device() to disable it, unless the device is a bridge (PCI |
| 462 | bridges are ignored by this routine). Next, the device driver's pm->suspend() |
| 463 | callback is executed, if defined, and its result is returned if it fails. |
| 464 | Finally, pci_fixup_device() is called to apply hardware suspend quirks related |
| 465 | to the device if necessary. |
| 466 | |
| 467 | Note that the suspend phase is carried out asynchronously for PCI devices, so |
| 468 | the pci_pm_suspend() callback may be executed in parallel for any pair of PCI |
| 469 | devices that don't depend on each other in a known way (i.e. none of the paths |
| 470 | in the device tree from the root bridge to a leaf device contains both of them). |
| 471 | |
| 472 | The pci_pm_suspend_noirq() routine is executed after suspend_device_irqs() has |
| 473 | been called, which means that the device driver's interrupt handler won't be |
| 474 | invoked while this routine is running. It first checks if the device's driver |
| 475 | implements legacy PCI suspends routines (Section 3), in which case the legacy |
| 476 | late suspend routine is called and its result is returned (the standard |
| 477 | configuration registers of the device are saved if the driver's callback hasn't |
| 478 | done that). Second, if the device driver's struct dev_pm_ops object is not |
| 479 | present, the device's standard configuration registers are saved and the routine |
| 480 | returns success. Otherwise the device driver's pm->suspend_noirq() callback is |
| 481 | executed, if present, and its result is returned if it fails. Next, if the |
| 482 | device's standard configuration registers haven't been saved yet (one of the |
| 483 | device driver's callbacks executed before might do that), pci_pm_suspend_noirq() |
| 484 | saves them, prepares the device to signal wakeup (if necessary) and puts it into |
| 485 | a low-power state. |
| 486 | |
| 487 | The low-power state to put the device into is the lowest-power (highest number) |
| 488 | state from which it can signal wakeup while the system is in the target sleep |
| 489 | state. Just like in the runtime PM case described above, the mechanism of |
| 490 | signaling wakeup is system-dependent and determined by the PCI subsystem, which |
| 491 | is also responsible for preparing the device to signal wakeup from the system's |
| 492 | target sleep state as appropriate. |
| 493 | |
| 494 | PCI device drivers (that don't implement legacy power management callbacks) are |
| 495 | generally not expected to prepare devices for signaling wakeup or to put them |
| 496 | into low-power states. However, if one of the driver's suspend callbacks |
| 497 | (pm->suspend() or pm->suspend_noirq()) saves the device's standard configuration |
| 498 | registers, pci_pm_suspend_noirq() will assume that the device has been prepared |
| 499 | to signal wakeup and put into a low-power state by the driver (the driver is |
| 500 | then assumed to have used the helper functions provided by the PCI subsystem for |
| 501 | this purpose). PCI device drivers are not encouraged to do that, but in some |
| 502 | rare cases doing that in the driver may be the optimum approach. |
| 503 | |
| 504 | 2.4.2. System Resume |
Mauro Carvalho Chehab | 151f4e2 | 2019-06-13 07:10:36 -0300 | [diff] [blame] | 505 | ^^^^^^^^^^^^^^^^^^^^ |
Rafael J. Wysocki | b799957 | 2010-05-18 00:23:24 +0200 | [diff] [blame] | 506 | |
| 507 | When the system is undergoing a transition from a sleep state in which the |
| 508 | contents of memory have been preserved, such as one of the ACPI sleep states |
| 509 | S1-S3, into the working state (ACPI S0), the phases are: |
| 510 | |
| 511 | resume_noirq, resume, complete. |
| 512 | |
| 513 | The following PCI bus type's callbacks, respectively, are executed in these |
Mauro Carvalho Chehab | 151f4e2 | 2019-06-13 07:10:36 -0300 | [diff] [blame] | 514 | phases:: |
Rafael J. Wysocki | b799957 | 2010-05-18 00:23:24 +0200 | [diff] [blame] | 515 | |
| 516 | pci_pm_resume_noirq() |
| 517 | pci_pm_resume() |
| 518 | pci_pm_complete() |
| 519 | |
| 520 | The pci_pm_resume_noirq() routine first puts the device into the full-power |
| 521 | state, restores its standard configuration registers and applies early resume |
| 522 | hardware quirks related to the device, if necessary. This is done |
| 523 | unconditionally, regardless of whether or not the device's driver implements |
| 524 | legacy PCI power management callbacks (this way all PCI devices are in the |
| 525 | full-power state and their standard configuration registers have been restored |
| 526 | when their interrupt handlers are invoked for the first time during resume, |
| 527 | which allows the kernel to avoid problems with the handling of shared interrupts |
| 528 | by drivers whose devices are still suspended). If legacy PCI power management |
| 529 | callbacks (see Section 3) are implemented by the device's driver, the legacy |
| 530 | early resume callback is executed and its result is returned. Otherwise, the |
| 531 | device driver's pm->resume_noirq() callback is executed, if defined, and its |
| 532 | result is returned. |
| 533 | |
| 534 | The pci_pm_resume() routine first checks if the device's standard configuration |
| 535 | registers have been restored and restores them if that's not the case (this |
| 536 | only is necessary in the error path during a failing suspend). Next, resume |
| 537 | hardware quirks related to the device are applied, if necessary, and if the |
| 538 | device's driver implements legacy PCI power management callbacks (see |
| 539 | Section 3), the driver's legacy resume callback is executed and its result is |
| 540 | returned. Otherwise, the device's wakeup signaling mechanisms are blocked and |
| 541 | its driver's pm->resume() callback is executed, if defined (the callback's |
| 542 | result is then returned). |
| 543 | |
| 544 | The resume phase is carried out asynchronously for PCI devices, like the |
| 545 | suspend phase described above, which means that if two PCI devices don't depend |
| 546 | on each other in a known way, the pci_pm_resume() routine may be executed for |
| 547 | the both of them in parallel. |
| 548 | |
| 549 | The pci_pm_complete() routine only executes the device driver's pm->complete() |
| 550 | callback, if defined. |
| 551 | |
| 552 | 2.4.3. System Hibernation |
Mauro Carvalho Chehab | 151f4e2 | 2019-06-13 07:10:36 -0300 | [diff] [blame] | 553 | ^^^^^^^^^^^^^^^^^^^^^^^^^ |
Rafael J. Wysocki | b799957 | 2010-05-18 00:23:24 +0200 | [diff] [blame] | 554 | |
| 555 | System hibernation is more complicated than system suspend, because it requires |
| 556 | a system image to be created and written into a persistent storage medium. The |
| 557 | image is created atomically and all devices are quiesced, or frozen, before that |
| 558 | happens. |
| 559 | |
| 560 | The freezing of devices is carried out after enough memory has been freed (at |
| 561 | the time of this writing the image creation requires at least 50% of system RAM |
| 562 | to be free) in the following three phases: |
| 563 | |
| 564 | prepare, freeze, freeze_noirq |
| 565 | |
Mauro Carvalho Chehab | 151f4e2 | 2019-06-13 07:10:36 -0300 | [diff] [blame] | 566 | that correspond to the PCI bus type's callbacks:: |
Rafael J. Wysocki | b799957 | 2010-05-18 00:23:24 +0200 | [diff] [blame] | 567 | |
| 568 | pci_pm_prepare() |
| 569 | pci_pm_freeze() |
| 570 | pci_pm_freeze_noirq() |
| 571 | |
| 572 | This means that the prepare phase is exactly the same as for system suspend. |
| 573 | The other two phases, however, are different. |
| 574 | |
| 575 | The pci_pm_freeze() routine is quite similar to pci_pm_suspend(), but it runs |
| 576 | the device driver's pm->freeze() callback, if defined, instead of pm->suspend(), |
| 577 | and it doesn't apply the suspend-related hardware quirks. It is executed |
| 578 | asynchronously for different PCI devices that don't depend on each other in a |
| 579 | known way. |
| 580 | |
| 581 | The pci_pm_freeze_noirq() routine, in turn, is similar to |
| 582 | pci_pm_suspend_noirq(), but it calls the device driver's pm->freeze_noirq() |
| 583 | routine instead of pm->suspend_noirq(). It also doesn't attempt to prepare the |
| 584 | device for signaling wakeup and put it into a low-power state. Still, it saves |
| 585 | the device's standard configuration registers if they haven't been saved by one |
| 586 | of the driver's callbacks. |
| 587 | |
| 588 | Once the image has been created, it has to be saved. However, at this point all |
| 589 | devices are frozen and they cannot handle I/O, while their ability to handle |
| 590 | I/O is obviously necessary for the image saving. Thus they have to be brought |
| 591 | back to the fully functional state and this is done in the following phases: |
| 592 | |
| 593 | thaw_noirq, thaw, complete |
| 594 | |
Mauro Carvalho Chehab | 151f4e2 | 2019-06-13 07:10:36 -0300 | [diff] [blame] | 595 | using the following PCI bus type's callbacks:: |
Rafael J. Wysocki | b799957 | 2010-05-18 00:23:24 +0200 | [diff] [blame] | 596 | |
| 597 | pci_pm_thaw_noirq() |
| 598 | pci_pm_thaw() |
| 599 | pci_pm_complete() |
| 600 | |
| 601 | respectively. |
| 602 | |
Bjorn Helgaas | dc68b40 | 2019-10-14 14:14:06 -0500 | [diff] [blame] | 603 | The first of them, pci_pm_thaw_noirq(), is analogous to pci_pm_resume_noirq(). |
| 604 | It puts the device into the full power state and restores its standard |
| 605 | configuration registers. It also executes the device driver's pm->thaw_noirq() |
| 606 | callback, if defined, instead of pm->resume_noirq(). |
Rafael J. Wysocki | b799957 | 2010-05-18 00:23:24 +0200 | [diff] [blame] | 607 | |
| 608 | The pci_pm_thaw() routine is similar to pci_pm_resume(), but it runs the device |
| 609 | driver's pm->thaw() callback instead of pm->resume(). It is executed |
| 610 | asynchronously for different PCI devices that don't depend on each other in a |
| 611 | known way. |
| 612 | |
Bjorn Helgaas | dc68b40 | 2019-10-14 14:14:06 -0500 | [diff] [blame] | 613 | The complete phase is the same as for system resume. |
Rafael J. Wysocki | b799957 | 2010-05-18 00:23:24 +0200 | [diff] [blame] | 614 | |
| 615 | After saving the image, devices need to be powered down before the system can |
| 616 | enter the target sleep state (ACPI S4 for ACPI-based systems). This is done in |
| 617 | three phases: |
| 618 | |
| 619 | prepare, poweroff, poweroff_noirq |
| 620 | |
| 621 | where the prepare phase is exactly the same as for system suspend. The other |
| 622 | two phases are analogous to the suspend and suspend_noirq phases, respectively. |
Mauro Carvalho Chehab | 151f4e2 | 2019-06-13 07:10:36 -0300 | [diff] [blame] | 623 | The PCI subsystem-level callbacks they correspond to:: |
Rafael J. Wysocki | b799957 | 2010-05-18 00:23:24 +0200 | [diff] [blame] | 624 | |
| 625 | pci_pm_poweroff() |
| 626 | pci_pm_poweroff_noirq() |
| 627 | |
| 628 | work in analogy with pci_pm_suspend() and pci_pm_poweroff_noirq(), respectively, |
| 629 | although they don't attempt to save the device's standard configuration |
| 630 | registers. |
| 631 | |
| 632 | 2.4.4. System Restore |
Mauro Carvalho Chehab | 151f4e2 | 2019-06-13 07:10:36 -0300 | [diff] [blame] | 633 | ^^^^^^^^^^^^^^^^^^^^^ |
Rafael J. Wysocki | b799957 | 2010-05-18 00:23:24 +0200 | [diff] [blame] | 634 | |
| 635 | System restore requires a hibernation image to be loaded into memory and the |
| 636 | pre-hibernation memory contents to be restored before the pre-hibernation system |
| 637 | activity can be resumed. |
| 638 | |
Bjorn Helgaas | b64cf7a | 2019-10-08 15:25:23 -0500 | [diff] [blame] | 639 | As described in Documentation/driver-api/pm/devices.rst, the hibernation image |
| 640 | is loaded into memory by a fresh instance of the kernel, called the boot kernel, |
| 641 | which in turn is loaded and run by a boot loader in the usual way. After the |
| 642 | boot kernel has loaded the image, it needs to replace its own code and data with |
| 643 | the code and data of the "hibernated" kernel stored within the image, called the |
| 644 | image kernel. For this purpose all devices are frozen just like before creating |
Rafael J. Wysocki | b799957 | 2010-05-18 00:23:24 +0200 | [diff] [blame] | 645 | the image during hibernation, in the |
| 646 | |
| 647 | prepare, freeze, freeze_noirq |
| 648 | |
| 649 | phases described above. However, the devices affected by these phases are only |
| 650 | those having drivers in the boot kernel; other devices will still be in whatever |
| 651 | state the boot loader left them. |
| 652 | |
| 653 | Should the restoration of the pre-hibernation memory contents fail, the boot |
| 654 | kernel would go through the "thawing" procedure described above, using the |
| 655 | thaw_noirq, thaw, and complete phases (that will only affect the devices having |
| 656 | drivers in the boot kernel), and then continue running normally. |
| 657 | |
| 658 | If the pre-hibernation memory contents are restored successfully, which is the |
| 659 | usual situation, control is passed to the image kernel, which then becomes |
| 660 | responsible for bringing the system back to the working state. To achieve this, |
| 661 | it must restore the devices' pre-hibernation functionality, which is done much |
| 662 | like waking up from the memory sleep state, although it involves different |
| 663 | phases: |
| 664 | |
| 665 | restore_noirq, restore, complete |
| 666 | |
| 667 | The first two of these are analogous to the resume_noirq and resume phases |
| 668 | described above, respectively, and correspond to the following PCI subsystem |
Mauro Carvalho Chehab | 151f4e2 | 2019-06-13 07:10:36 -0300 | [diff] [blame] | 669 | callbacks:: |
Rafael J. Wysocki | b799957 | 2010-05-18 00:23:24 +0200 | [diff] [blame] | 670 | |
| 671 | pci_pm_restore_noirq() |
| 672 | pci_pm_restore() |
| 673 | |
| 674 | These callbacks work in analogy with pci_pm_resume_noirq() and pci_pm_resume(), |
| 675 | respectively, but they execute the device driver's pm->restore_noirq() and |
| 676 | pm->restore() callbacks, if available. |
| 677 | |
| 678 | The complete phase is carried out in exactly the same way as during system |
| 679 | resume. |
| 680 | |
| 681 | |
| 682 | 3. PCI Device Drivers and Power Management |
| 683 | ========================================== |
| 684 | |
| 685 | 3.1. Power Management Callbacks |
| 686 | ------------------------------- |
Mauro Carvalho Chehab | 151f4e2 | 2019-06-13 07:10:36 -0300 | [diff] [blame] | 687 | |
Rafael J. Wysocki | b799957 | 2010-05-18 00:23:24 +0200 | [diff] [blame] | 688 | PCI device drivers participate in power management by providing callbacks to be |
| 689 | executed by the PCI subsystem's power management routines described above and by |
| 690 | controlling the runtime power management of their devices. |
| 691 | |
| 692 | At the time of this writing there are two ways to define power management |
| 693 | callbacks for a PCI device driver, the recommended one, based on using a |
Bjorn Helgaas | b64cf7a | 2019-10-08 15:25:23 -0500 | [diff] [blame] | 694 | dev_pm_ops structure described in Documentation/driver-api/pm/devices.rst, and |
Bjorn Helgaas | 1a1daf0 | 2019-10-31 17:37:54 -0500 | [diff] [blame] | 695 | the "legacy" one, in which the .suspend() and .resume() callbacks from struct |
| 696 | pci_driver are used. The legacy approach, however, doesn't allow one to define |
| 697 | runtime power management callbacks and is not really suitable for any new |
| 698 | drivers. Therefore it is not covered by this document (refer to the source code |
| 699 | to learn more about it). |
Rafael J. Wysocki | b799957 | 2010-05-18 00:23:24 +0200 | [diff] [blame] | 700 | |
| 701 | It is recommended that all PCI device drivers define a struct dev_pm_ops object |
| 702 | containing pointers to power management (PM) callbacks that will be executed by |
| 703 | the PCI subsystem's PM routines in various circumstances. A pointer to the |
| 704 | driver's struct dev_pm_ops object has to be assigned to the driver.pm field in |
| 705 | its struct pci_driver object. Once that has happened, the "legacy" PM callbacks |
| 706 | in struct pci_driver are ignored (even if they are not NULL). |
| 707 | |
| 708 | The PM callbacks in struct dev_pm_ops are not mandatory and if they are not |
| 709 | defined (i.e. the respective fields of struct dev_pm_ops are unset) the PCI |
| 710 | subsystem will handle the device in a simplified default manner. If they are |
| 711 | defined, though, they are expected to behave as described in the following |
| 712 | subsections. |
| 713 | |
| 714 | 3.1.1. prepare() |
Mauro Carvalho Chehab | 151f4e2 | 2019-06-13 07:10:36 -0300 | [diff] [blame] | 715 | ^^^^^^^^^^^^^^^^ |
Rafael J. Wysocki | b799957 | 2010-05-18 00:23:24 +0200 | [diff] [blame] | 716 | |
| 717 | The prepare() callback is executed during system suspend, during hibernation |
| 718 | (when a hibernation image is about to be created), during power-off after |
| 719 | saving a hibernation image and during system restore, when a hibernation image |
| 720 | has just been loaded into memory. |
| 721 | |
| 722 | This callback is only necessary if the driver's device has children that in |
| 723 | general may be registered at any time. In that case the role of the prepare() |
| 724 | callback is to prevent new children of the device from being registered until |
| 725 | one of the resume_noirq(), thaw_noirq(), or restore_noirq() callbacks is run. |
| 726 | |
| 727 | In addition to that the prepare() callback may carry out some operations |
| 728 | preparing the device to be suspended, although it should not allocate memory |
| 729 | (if additional memory is required to suspend the device, it has to be |
| 730 | preallocated earlier, for example in a suspend/hibernate notifier as described |
Rafael J. Wysocki | 730c4c0 | 2017-02-02 01:38:54 +0100 | [diff] [blame] | 731 | in Documentation/driver-api/pm/notifiers.rst). |
Rafael J. Wysocki | b799957 | 2010-05-18 00:23:24 +0200 | [diff] [blame] | 732 | |
| 733 | 3.1.2. suspend() |
Mauro Carvalho Chehab | 151f4e2 | 2019-06-13 07:10:36 -0300 | [diff] [blame] | 734 | ^^^^^^^^^^^^^^^^ |
Rafael J. Wysocki | b799957 | 2010-05-18 00:23:24 +0200 | [diff] [blame] | 735 | |
| 736 | The suspend() callback is only executed during system suspend, after prepare() |
| 737 | callbacks have been executed for all devices in the system. |
| 738 | |
| 739 | This callback is expected to quiesce the device and prepare it to be put into a |
| 740 | low-power state by the PCI subsystem. It is not required (in fact it even is |
| 741 | not recommended) that a PCI driver's suspend() callback save the standard |
| 742 | configuration registers of the device, prepare it for waking up the system, or |
| 743 | put it into a low-power state. All of these operations can very well be taken |
| 744 | care of by the PCI subsystem, without the driver's participation. |
| 745 | |
| 746 | However, in some rare case it is convenient to carry out these operations in |
| 747 | a PCI driver. Then, pci_save_state(), pci_prepare_to_sleep(), and |
| 748 | pci_set_power_state() should be used to save the device's standard configuration |
| 749 | registers, to prepare it for system wakeup (if necessary), and to put it into a |
| 750 | low-power state, respectively. Moreover, if the driver calls pci_save_state(), |
| 751 | the PCI subsystem will not execute either pci_prepare_to_sleep(), or |
| 752 | pci_set_power_state() for its device, so the driver is then responsible for |
| 753 | handling the device as appropriate. |
| 754 | |
| 755 | While the suspend() callback is being executed, the driver's interrupt handler |
| 756 | can be invoked to handle an interrupt from the device, so all suspend-related |
| 757 | operations relying on the driver's ability to handle interrupts should be |
| 758 | carried out in this callback. |
| 759 | |
| 760 | 3.1.3. suspend_noirq() |
Mauro Carvalho Chehab | 151f4e2 | 2019-06-13 07:10:36 -0300 | [diff] [blame] | 761 | ^^^^^^^^^^^^^^^^^^^^^^ |
Rafael J. Wysocki | b799957 | 2010-05-18 00:23:24 +0200 | [diff] [blame] | 762 | |
| 763 | The suspend_noirq() callback is only executed during system suspend, after |
| 764 | suspend() callbacks have been executed for all devices in the system and |
| 765 | after device interrupts have been disabled by the PM core. |
| 766 | |
| 767 | The difference between suspend_noirq() and suspend() is that the driver's |
| 768 | interrupt handler will not be invoked while suspend_noirq() is running. Thus |
| 769 | suspend_noirq() can carry out operations that would cause race conditions to |
| 770 | arise if they were performed in suspend(). |
| 771 | |
| 772 | 3.1.4. freeze() |
Mauro Carvalho Chehab | 151f4e2 | 2019-06-13 07:10:36 -0300 | [diff] [blame] | 773 | ^^^^^^^^^^^^^^^ |
Rafael J. Wysocki | b799957 | 2010-05-18 00:23:24 +0200 | [diff] [blame] | 774 | |
| 775 | The freeze() callback is hibernation-specific and is executed in two situations, |
| 776 | during hibernation, after prepare() callbacks have been executed for all devices |
| 777 | in preparation for the creation of a system image, and during restore, |
| 778 | after a system image has been loaded into memory from persistent storage and the |
| 779 | prepare() callbacks have been executed for all devices. |
| 780 | |
| 781 | The role of this callback is analogous to the role of the suspend() callback |
| 782 | described above. In fact, they only need to be different in the rare cases when |
| 783 | the driver takes the responsibility for putting the device into a low-power |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 784 | state. |
| 785 | |
Rafael J. Wysocki | b799957 | 2010-05-18 00:23:24 +0200 | [diff] [blame] | 786 | In that cases the freeze() callback should not prepare the device system wakeup |
| 787 | or put it into a low-power state. Still, either it or freeze_noirq() should |
| 788 | save the device's standard configuration registers using pci_save_state(). |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 789 | |
Rafael J. Wysocki | b799957 | 2010-05-18 00:23:24 +0200 | [diff] [blame] | 790 | 3.1.5. freeze_noirq() |
Mauro Carvalho Chehab | 151f4e2 | 2019-06-13 07:10:36 -0300 | [diff] [blame] | 791 | ^^^^^^^^^^^^^^^^^^^^^ |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 792 | |
Rafael J. Wysocki | b799957 | 2010-05-18 00:23:24 +0200 | [diff] [blame] | 793 | The freeze_noirq() callback is hibernation-specific. It is executed during |
| 794 | hibernation, after prepare() and freeze() callbacks have been executed for all |
| 795 | devices in preparation for the creation of a system image, and during restore, |
| 796 | after a system image has been loaded into memory and after prepare() and |
| 797 | freeze() callbacks have been executed for all devices. It is always executed |
| 798 | after device interrupts have been disabled by the PM core. |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 799 | |
Rafael J. Wysocki | b799957 | 2010-05-18 00:23:24 +0200 | [diff] [blame] | 800 | The role of this callback is analogous to the role of the suspend_noirq() |
| 801 | callback described above and it very rarely is necessary to define |
| 802 | freeze_noirq(). |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 803 | |
Rafael J. Wysocki | b799957 | 2010-05-18 00:23:24 +0200 | [diff] [blame] | 804 | The difference between freeze_noirq() and freeze() is analogous to the |
| 805 | difference between suspend_noirq() and suspend(). |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 806 | |
Rafael J. Wysocki | b799957 | 2010-05-18 00:23:24 +0200 | [diff] [blame] | 807 | 3.1.6. poweroff() |
Mauro Carvalho Chehab | 151f4e2 | 2019-06-13 07:10:36 -0300 | [diff] [blame] | 808 | ^^^^^^^^^^^^^^^^^ |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 809 | |
Rafael J. Wysocki | b799957 | 2010-05-18 00:23:24 +0200 | [diff] [blame] | 810 | The poweroff() callback is hibernation-specific. It is executed when the system |
| 811 | is about to be powered off after saving a hibernation image to a persistent |
| 812 | storage. prepare() callbacks are executed for all devices before poweroff() is |
| 813 | called. |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 814 | |
Rafael J. Wysocki | b799957 | 2010-05-18 00:23:24 +0200 | [diff] [blame] | 815 | The role of this callback is analogous to the role of the suspend() and freeze() |
| 816 | callbacks described above, although it does not need to save the contents of |
| 817 | the device's registers. In particular, if the driver wants to put the device |
| 818 | into a low-power state itself instead of allowing the PCI subsystem to do that, |
| 819 | the poweroff() callback should use pci_prepare_to_sleep() and |
| 820 | pci_set_power_state() to prepare the device for system wakeup and to put it |
| 821 | into a low-power state, respectively, but it need not save the device's standard |
| 822 | configuration registers. |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 823 | |
Rafael J. Wysocki | b799957 | 2010-05-18 00:23:24 +0200 | [diff] [blame] | 824 | 3.1.7. poweroff_noirq() |
Mauro Carvalho Chehab | 151f4e2 | 2019-06-13 07:10:36 -0300 | [diff] [blame] | 825 | ^^^^^^^^^^^^^^^^^^^^^^^ |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 826 | |
Rafael J. Wysocki | b799957 | 2010-05-18 00:23:24 +0200 | [diff] [blame] | 827 | The poweroff_noirq() callback is hibernation-specific. It is executed after |
| 828 | poweroff() callbacks have been executed for all devices in the system. |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 829 | |
Rafael J. Wysocki | b799957 | 2010-05-18 00:23:24 +0200 | [diff] [blame] | 830 | The role of this callback is analogous to the role of the suspend_noirq() and |
| 831 | freeze_noirq() callbacks described above, but it does not need to save the |
| 832 | contents of the device's registers. |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 833 | |
Rafael J. Wysocki | b799957 | 2010-05-18 00:23:24 +0200 | [diff] [blame] | 834 | The difference between poweroff_noirq() and poweroff() is analogous to the |
| 835 | difference between suspend_noirq() and suspend(). |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 836 | |
Rafael J. Wysocki | b799957 | 2010-05-18 00:23:24 +0200 | [diff] [blame] | 837 | 3.1.8. resume_noirq() |
Mauro Carvalho Chehab | 151f4e2 | 2019-06-13 07:10:36 -0300 | [diff] [blame] | 838 | ^^^^^^^^^^^^^^^^^^^^^ |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 839 | |
Rafael J. Wysocki | b799957 | 2010-05-18 00:23:24 +0200 | [diff] [blame] | 840 | The resume_noirq() callback is only executed during system resume, after the |
| 841 | PM core has enabled the non-boot CPUs. The driver's interrupt handler will not |
| 842 | be invoked while resume_noirq() is running, so this callback can carry out |
| 843 | operations that might race with the interrupt handler. |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 844 | |
Rafael J. Wysocki | b799957 | 2010-05-18 00:23:24 +0200 | [diff] [blame] | 845 | Since the PCI subsystem unconditionally puts all devices into the full power |
| 846 | state in the resume_noirq phase of system resume and restores their standard |
| 847 | configuration registers, resume_noirq() is usually not necessary. In general |
| 848 | it should only be used for performing operations that would lead to race |
| 849 | conditions if carried out by resume(). |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 850 | |
Rafael J. Wysocki | b799957 | 2010-05-18 00:23:24 +0200 | [diff] [blame] | 851 | 3.1.9. resume() |
Mauro Carvalho Chehab | 151f4e2 | 2019-06-13 07:10:36 -0300 | [diff] [blame] | 852 | ^^^^^^^^^^^^^^^ |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 853 | |
Rafael J. Wysocki | b799957 | 2010-05-18 00:23:24 +0200 | [diff] [blame] | 854 | The resume() callback is only executed during system resume, after |
| 855 | resume_noirq() callbacks have been executed for all devices in the system and |
| 856 | device interrupts have been enabled by the PM core. |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 857 | |
Rafael J. Wysocki | b799957 | 2010-05-18 00:23:24 +0200 | [diff] [blame] | 858 | This callback is responsible for restoring the pre-suspend configuration of the |
| 859 | device and bringing it back to the fully functional state. The device should be |
| 860 | able to process I/O in a usual way after resume() has returned. |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 861 | |
Rafael J. Wysocki | b799957 | 2010-05-18 00:23:24 +0200 | [diff] [blame] | 862 | 3.1.10. thaw_noirq() |
Mauro Carvalho Chehab | 151f4e2 | 2019-06-13 07:10:36 -0300 | [diff] [blame] | 863 | ^^^^^^^^^^^^^^^^^^^^ |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 864 | |
Rafael J. Wysocki | b799957 | 2010-05-18 00:23:24 +0200 | [diff] [blame] | 865 | The thaw_noirq() callback is hibernation-specific. It is executed after a |
| 866 | system image has been created and the non-boot CPUs have been enabled by the PM |
| 867 | core, in the thaw_noirq phase of hibernation. It also may be executed if the |
| 868 | loading of a hibernation image fails during system restore (it is then executed |
| 869 | after enabling the non-boot CPUs). The driver's interrupt handler will not be |
| 870 | invoked while thaw_noirq() is running. |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 871 | |
Rafael J. Wysocki | b799957 | 2010-05-18 00:23:24 +0200 | [diff] [blame] | 872 | The role of this callback is analogous to the role of resume_noirq(). The |
| 873 | difference between these two callbacks is that thaw_noirq() is executed after |
| 874 | freeze() and freeze_noirq(), so in general it does not need to modify the |
| 875 | contents of the device's registers. |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 876 | |
Rafael J. Wysocki | b799957 | 2010-05-18 00:23:24 +0200 | [diff] [blame] | 877 | 3.1.11. thaw() |
Mauro Carvalho Chehab | 151f4e2 | 2019-06-13 07:10:36 -0300 | [diff] [blame] | 878 | ^^^^^^^^^^^^^^ |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 879 | |
Rafael J. Wysocki | b799957 | 2010-05-18 00:23:24 +0200 | [diff] [blame] | 880 | The thaw() callback is hibernation-specific. It is executed after thaw_noirq() |
| 881 | callbacks have been executed for all devices in the system and after device |
| 882 | interrupts have been enabled by the PM core. |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 883 | |
Rafael J. Wysocki | b799957 | 2010-05-18 00:23:24 +0200 | [diff] [blame] | 884 | This callback is responsible for restoring the pre-freeze configuration of |
| 885 | the device, so that it will work in a usual way after thaw() has returned. |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 886 | |
Rafael J. Wysocki | b799957 | 2010-05-18 00:23:24 +0200 | [diff] [blame] | 887 | 3.1.12. restore_noirq() |
Mauro Carvalho Chehab | 151f4e2 | 2019-06-13 07:10:36 -0300 | [diff] [blame] | 888 | ^^^^^^^^^^^^^^^^^^^^^^^ |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 889 | |
Rafael J. Wysocki | b799957 | 2010-05-18 00:23:24 +0200 | [diff] [blame] | 890 | The restore_noirq() callback is hibernation-specific. It is executed in the |
| 891 | restore_noirq phase of hibernation, when the boot kernel has passed control to |
| 892 | the image kernel and the non-boot CPUs have been enabled by the image kernel's |
| 893 | PM core. |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 894 | |
Rafael J. Wysocki | b799957 | 2010-05-18 00:23:24 +0200 | [diff] [blame] | 895 | This callback is analogous to resume_noirq() with the exception that it cannot |
| 896 | make any assumption on the previous state of the device, even if the BIOS (or |
| 897 | generally the platform firmware) is known to preserve that state over a |
| 898 | suspend-resume cycle. |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 899 | |
Rafael J. Wysocki | b799957 | 2010-05-18 00:23:24 +0200 | [diff] [blame] | 900 | For the vast majority of PCI device drivers there is no difference between |
| 901 | resume_noirq() and restore_noirq(). |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 902 | |
Rafael J. Wysocki | b799957 | 2010-05-18 00:23:24 +0200 | [diff] [blame] | 903 | 3.1.13. restore() |
Mauro Carvalho Chehab | 151f4e2 | 2019-06-13 07:10:36 -0300 | [diff] [blame] | 904 | ^^^^^^^^^^^^^^^^^ |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 905 | |
Rafael J. Wysocki | b799957 | 2010-05-18 00:23:24 +0200 | [diff] [blame] | 906 | The restore() callback is hibernation-specific. It is executed after |
| 907 | restore_noirq() callbacks have been executed for all devices in the system and |
| 908 | after the PM core has enabled device drivers' interrupt handlers to be invoked. |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 909 | |
Rafael J. Wysocki | b799957 | 2010-05-18 00:23:24 +0200 | [diff] [blame] | 910 | This callback is analogous to resume(), just like restore_noirq() is analogous |
| 911 | to resume_noirq(). Consequently, the difference between restore_noirq() and |
| 912 | restore() is analogous to the difference between resume_noirq() and resume(). |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 913 | |
Rafael J. Wysocki | b799957 | 2010-05-18 00:23:24 +0200 | [diff] [blame] | 914 | For the vast majority of PCI device drivers there is no difference between |
| 915 | resume() and restore(). |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 916 | |
Rafael J. Wysocki | b799957 | 2010-05-18 00:23:24 +0200 | [diff] [blame] | 917 | 3.1.14. complete() |
Mauro Carvalho Chehab | 151f4e2 | 2019-06-13 07:10:36 -0300 | [diff] [blame] | 918 | ^^^^^^^^^^^^^^^^^^ |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 919 | |
Rafael J. Wysocki | b799957 | 2010-05-18 00:23:24 +0200 | [diff] [blame] | 920 | The complete() callback is executed in the following situations: |
Mauro Carvalho Chehab | 151f4e2 | 2019-06-13 07:10:36 -0300 | [diff] [blame] | 921 | |
Rafael J. Wysocki | b799957 | 2010-05-18 00:23:24 +0200 | [diff] [blame] | 922 | - during system resume, after resume() callbacks have been executed for all |
| 923 | devices, |
| 924 | - during hibernation, before saving the system image, after thaw() callbacks |
| 925 | have been executed for all devices, |
| 926 | - during system restore, when the system is going back to its pre-hibernation |
| 927 | state, after restore() callbacks have been executed for all devices. |
Mauro Carvalho Chehab | 151f4e2 | 2019-06-13 07:10:36 -0300 | [diff] [blame] | 928 | |
Rafael J. Wysocki | b799957 | 2010-05-18 00:23:24 +0200 | [diff] [blame] | 929 | It also may be executed if the loading of a hibernation image into memory fails |
| 930 | (in that case it is run after thaw() callbacks have been executed for all |
| 931 | devices that have drivers in the boot kernel). |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 932 | |
Rafael J. Wysocki | b799957 | 2010-05-18 00:23:24 +0200 | [diff] [blame] | 933 | This callback is entirely optional, although it may be necessary if the |
| 934 | prepare() callback performs operations that need to be reversed. |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 935 | |
Rafael J. Wysocki | b799957 | 2010-05-18 00:23:24 +0200 | [diff] [blame] | 936 | 3.1.15. runtime_suspend() |
Mauro Carvalho Chehab | 151f4e2 | 2019-06-13 07:10:36 -0300 | [diff] [blame] | 937 | ^^^^^^^^^^^^^^^^^^^^^^^^^ |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 938 | |
Rafael J. Wysocki | b799957 | 2010-05-18 00:23:24 +0200 | [diff] [blame] | 939 | The runtime_suspend() callback is specific to device runtime power management |
| 940 | (runtime PM). It is executed by the PM core's runtime PM framework when the |
| 941 | device is about to be suspended (i.e. quiesced and put into a low-power state) |
| 942 | at run time. |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 943 | |
Rafael J. Wysocki | b799957 | 2010-05-18 00:23:24 +0200 | [diff] [blame] | 944 | This callback is responsible for freezing the device and preparing it to be |
| 945 | put into a low-power state, but it must allow the PCI subsystem to perform all |
| 946 | of the PCI-specific actions necessary for suspending the device. |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 947 | |
Rafael J. Wysocki | b799957 | 2010-05-18 00:23:24 +0200 | [diff] [blame] | 948 | 3.1.16. runtime_resume() |
Mauro Carvalho Chehab | 151f4e2 | 2019-06-13 07:10:36 -0300 | [diff] [blame] | 949 | ^^^^^^^^^^^^^^^^^^^^^^^^ |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 950 | |
Rafael J. Wysocki | b799957 | 2010-05-18 00:23:24 +0200 | [diff] [blame] | 951 | The runtime_resume() callback is specific to device runtime PM. It is executed |
| 952 | by the PM core's runtime PM framework when the device is about to be resumed |
| 953 | (i.e. put into the full-power state and programmed to process I/O normally) at |
| 954 | run time. |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 955 | |
Rafael J. Wysocki | b799957 | 2010-05-18 00:23:24 +0200 | [diff] [blame] | 956 | This callback is responsible for restoring the normal functionality of the |
| 957 | device after it has been put into the full-power state by the PCI subsystem. |
| 958 | The device is expected to be able to process I/O in the usual way after |
| 959 | runtime_resume() has returned. |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 960 | |
Rafael J. Wysocki | b799957 | 2010-05-18 00:23:24 +0200 | [diff] [blame] | 961 | 3.1.17. runtime_idle() |
Mauro Carvalho Chehab | 151f4e2 | 2019-06-13 07:10:36 -0300 | [diff] [blame] | 962 | ^^^^^^^^^^^^^^^^^^^^^^ |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 963 | |
Rafael J. Wysocki | b799957 | 2010-05-18 00:23:24 +0200 | [diff] [blame] | 964 | The runtime_idle() callback is specific to device runtime PM. It is executed |
| 965 | by the PM core's runtime PM framework whenever it may be desirable to suspend |
| 966 | the device according to the PM core's information. In particular, it is |
| 967 | automatically executed right after runtime_resume() has returned in case the |
| 968 | resume of the device has happened as a result of a spurious event. |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 969 | |
Rafael J. Wysocki | b799957 | 2010-05-18 00:23:24 +0200 | [diff] [blame] | 970 | This callback is optional, but if it is not implemented or if it returns 0, the |
| 971 | PCI subsystem will call pm_runtime_suspend() for the device, which in turn will |
| 972 | cause the driver's runtime_suspend() callback to be executed. |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 973 | |
Rafael J. Wysocki | b799957 | 2010-05-18 00:23:24 +0200 | [diff] [blame] | 974 | 3.1.18. Pointing Multiple Callback Pointers to One Routine |
Mauro Carvalho Chehab | 151f4e2 | 2019-06-13 07:10:36 -0300 | [diff] [blame] | 975 | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 976 | |
Rafael J. Wysocki | b799957 | 2010-05-18 00:23:24 +0200 | [diff] [blame] | 977 | Although in principle each of the callbacks described in the previous |
| 978 | subsections can be defined as a separate function, it often is convenient to |
| 979 | point two or more members of struct dev_pm_ops to the same routine. There are |
| 980 | a few convenience macros that can be used for this purpose. |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 981 | |
Rafael J. Wysocki | b799957 | 2010-05-18 00:23:24 +0200 | [diff] [blame] | 982 | The SIMPLE_DEV_PM_OPS macro declares a struct dev_pm_ops object with one |
| 983 | suspend routine pointed to by the .suspend(), .freeze(), and .poweroff() |
| 984 | members and one resume routine pointed to by the .resume(), .thaw(), and |
| 985 | .restore() members. The other function pointers in this struct dev_pm_ops are |
| 986 | unset. |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 987 | |
Rafael J. Wysocki | b799957 | 2010-05-18 00:23:24 +0200 | [diff] [blame] | 988 | The UNIVERSAL_DEV_PM_OPS macro is similar to SIMPLE_DEV_PM_OPS, but it |
| 989 | additionally sets the .runtime_resume() pointer to the same value as |
| 990 | .resume() (and .thaw(), and .restore()) and the .runtime_suspend() pointer to |
| 991 | the same value as .suspend() (and .freeze() and .poweroff()). |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 992 | |
Rafael J. Wysocki | b799957 | 2010-05-18 00:23:24 +0200 | [diff] [blame] | 993 | The SET_SYSTEM_SLEEP_PM_OPS can be used inside of a declaration of struct |
| 994 | dev_pm_ops to indicate that one suspend routine is to be pointed to by the |
| 995 | .suspend(), .freeze(), and .poweroff() members and one resume routine is to |
| 996 | be pointed to by the .resume(), .thaw(), and .restore() members. |
pavel@ucw.cz | 21d6b7e | 2005-06-25 14:55:16 -0700 | [diff] [blame] | 997 | |
Rafael J. Wysocki | 08810a41 | 2017-10-25 14:12:29 +0200 | [diff] [blame] | 998 | 3.1.19. Driver Flags for Power Management |
Mauro Carvalho Chehab | 151f4e2 | 2019-06-13 07:10:36 -0300 | [diff] [blame] | 999 | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ |
Rafael J. Wysocki | 08810a41 | 2017-10-25 14:12:29 +0200 | [diff] [blame] | 1000 | |
| 1001 | The PM core allows device drivers to set flags that influence the handling of |
| 1002 | power management for the devices by the core itself and by middle layer code |
| 1003 | including the PCI bus type. The flags should be set once at the driver probe |
| 1004 | time with the help of the dev_pm_set_driver_flags() function and they should not |
| 1005 | be updated directly afterwards. |
| 1006 | |
Rafael J. Wysocki | e075155 | 2020-04-18 18:53:01 +0200 | [diff] [blame] | 1007 | The DPM_FLAG_NO_DIRECT_COMPLETE flag prevents the PM core from using the |
| 1008 | direct-complete mechanism allowing device suspend/resume callbacks to be skipped |
| 1009 | if the device is in runtime suspend when the system suspend starts. That also |
| 1010 | affects all of the ancestors of the device, so this flag should only be used if |
| 1011 | absolutely necessary. |
Rafael J. Wysocki | 08810a41 | 2017-10-25 14:12:29 +0200 | [diff] [blame] | 1012 | |
Rafael J. Wysocki | 2fff3f7 | 2020-04-18 18:55:32 +0200 | [diff] [blame] | 1013 | The DPM_FLAG_SMART_PREPARE flag causes the PCI bus type to return a positive |
| 1014 | value from pci_pm_prepare() only if the ->prepare callback provided by the |
Rafael J. Wysocki | 08810a41 | 2017-10-25 14:12:29 +0200 | [diff] [blame] | 1015 | driver of the device returns a positive value. That allows the driver to opt |
Rafael J. Wysocki | 2fff3f7 | 2020-04-18 18:55:32 +0200 | [diff] [blame] | 1016 | out from using the direct-complete mechanism dynamically (whereas setting |
| 1017 | DPM_FLAG_NO_DIRECT_COMPLETE means permanent opt-out). |
Rafael J. Wysocki | 08810a41 | 2017-10-25 14:12:29 +0200 | [diff] [blame] | 1018 | |
Rafael J. Wysocki | c4b6515 | 2017-10-26 12:12:22 +0200 | [diff] [blame] | 1019 | The DPM_FLAG_SMART_SUSPEND flag tells the PCI bus type that from the driver's |
| 1020 | perspective the device can be safely left in runtime suspend during system |
| 1021 | suspend. That causes pci_pm_suspend(), pci_pm_freeze() and pci_pm_poweroff() |
Rafael J. Wysocki | 2fff3f7 | 2020-04-18 18:55:32 +0200 | [diff] [blame] | 1022 | to avoid resuming the device from runtime suspend unless there are PCI-specific |
| 1023 | reasons for doing that. Also, it causes pci_pm_suspend_late/noirq() and |
| 1024 | pci_pm_poweroff_late/noirq() to return early if the device remains in runtime |
| 1025 | suspend during the "late" phase of the system-wide transition under way. |
| 1026 | Moreover, if the device is in runtime suspend in pci_pm_resume_noirq() or |
| 1027 | pci_pm_restore_noirq(), its runtime PM status will be changed to "active" (as it |
| 1028 | is going to be put into D0 going forward). |
Rafael J. Wysocki | c4b6515 | 2017-10-26 12:12:22 +0200 | [diff] [blame] | 1029 | |
Rafael J. Wysocki | 2fff3f7 | 2020-04-18 18:55:32 +0200 | [diff] [blame] | 1030 | Setting the DPM_FLAG_MAY_SKIP_RESUME flag means that the driver allows its |
| 1031 | "noirq" and "early" resume callbacks to be skipped if the device can be left |
| 1032 | in suspend after a system-wide transition into the working state. This flag is |
| 1033 | taken into consideration by the PM core along with the power.may_skip_resume |
| 1034 | status bit of the device which is set by pci_pm_suspend_noirq() in certain |
| 1035 | situations. If the PM core determines that the driver's "noirq" and "early" |
| 1036 | resume callbacks should be skipped, the dev_pm_skip_resume() helper function |
| 1037 | will return "true" and that will cause pci_pm_resume_noirq() and |
| 1038 | pci_pm_resume_early() to return upfront without touching the device and |
| 1039 | executing the driver callbacks. |
Rafael J. Wysocki | bd755d7 | 2017-11-18 15:33:52 +0100 | [diff] [blame] | 1040 | |
Rafael J. Wysocki | b799957 | 2010-05-18 00:23:24 +0200 | [diff] [blame] | 1041 | 3.2. Device Runtime Power Management |
| 1042 | ------------------------------------ |
Mauro Carvalho Chehab | 151f4e2 | 2019-06-13 07:10:36 -0300 | [diff] [blame] | 1043 | |
Rafael J. Wysocki | b799957 | 2010-05-18 00:23:24 +0200 | [diff] [blame] | 1044 | In addition to providing device power management callbacks PCI device drivers |
| 1045 | are responsible for controlling the runtime power management (runtime PM) of |
| 1046 | their devices. |
pavel@ucw.cz | 21d6b7e | 2005-06-25 14:55:16 -0700 | [diff] [blame] | 1047 | |
Rafael J. Wysocki | b799957 | 2010-05-18 00:23:24 +0200 | [diff] [blame] | 1048 | The PCI device runtime PM is optional, but it is recommended that PCI device |
| 1049 | drivers implement it at least in the cases where there is a reliable way of |
| 1050 | verifying that the device is not used (like when the network cable is detached |
| 1051 | from an Ethernet adapter or there are no devices attached to a USB controller). |
pavel@ucw.cz | 21d6b7e | 2005-06-25 14:55:16 -0700 | [diff] [blame] | 1052 | |
Rafael J. Wysocki | b799957 | 2010-05-18 00:23:24 +0200 | [diff] [blame] | 1053 | To support the PCI runtime PM the driver first needs to implement the |
| 1054 | runtime_suspend() and runtime_resume() callbacks. It also may need to implement |
| 1055 | the runtime_idle() callback to prevent the device from being suspended again |
| 1056 | every time right after the runtime_resume() callback has returned |
| 1057 | (alternatively, the runtime_suspend() callback will have to check if the |
| 1058 | device should really be suspended and return -EAGAIN if that is not the case). |
pavel@ucw.cz | 21d6b7e | 2005-06-25 14:55:16 -0700 | [diff] [blame] | 1059 | |
Rafael J. Wysocki | a836006 | 2015-09-18 03:08:40 +0200 | [diff] [blame] | 1060 | The runtime PM of PCI devices is enabled by default by the PCI core. PCI |
| 1061 | device drivers do not need to enable it and should not attempt to do so. |
| 1062 | However, it is blocked by pci_pm_init() that runs the pm_runtime_forbid() |
| 1063 | helper function. In addition to that, the runtime PM usage counter of |
| 1064 | each PCI device is incremented by local_pci_probe() before executing the |
| 1065 | probe callback provided by the device's driver. |
| 1066 | |
| 1067 | If a PCI driver implements the runtime PM callbacks and intends to use the |
| 1068 | runtime PM framework provided by the PM core and the PCI subsystem, it needs |
| 1069 | to decrement the device's runtime PM usage counter in its probe callback |
| 1070 | function. If it doesn't do that, the counter will always be different from |
| 1071 | zero for the device and it will never be runtime-suspended. The simplest |
| 1072 | way to do that is by calling pm_runtime_put_noidle(), but if the driver |
| 1073 | wants to schedule an autosuspend right away, for example, it may call |
| 1074 | pm_runtime_put_autosuspend() instead for this purpose. Generally, it |
| 1075 | just needs to call a function that decrements the devices usage counter |
| 1076 | from its probe routine to make runtime PM work for the device. |
| 1077 | |
| 1078 | It is important to remember that the driver's runtime_suspend() callback |
| 1079 | may be executed right after the usage counter has been decremented, because |
Jarkko Nikula | 76fc35d | 2015-12-08 16:17:25 +0200 | [diff] [blame] | 1080 | user space may already have caused the pm_runtime_allow() helper function |
Rafael J. Wysocki | a836006 | 2015-09-18 03:08:40 +0200 | [diff] [blame] | 1081 | unblocking the runtime PM of the device to run via sysfs, so the driver must |
| 1082 | be prepared to cope with that. |
| 1083 | |
| 1084 | The driver itself should not call pm_runtime_allow(), though. Instead, it |
| 1085 | should let user space or some platform-specific code do that (user space can |
| 1086 | do it via sysfs as stated above), but it must be prepared to handle the |
Rafael J. Wysocki | b799957 | 2010-05-18 00:23:24 +0200 | [diff] [blame] | 1087 | runtime PM of the device correctly as soon as pm_runtime_allow() is called |
Rafael J. Wysocki | a836006 | 2015-09-18 03:08:40 +0200 | [diff] [blame] | 1088 | (which may happen at any time, even before the driver is loaded). |
| 1089 | |
| 1090 | When the driver's remove callback runs, it has to balance the decrementation |
| 1091 | of the device's runtime PM usage counter at the probe time. For this reason, |
| 1092 | if it has decremented the counter in its probe callback, it must run |
| 1093 | pm_runtime_get_noresume() in its remove callback. [Since the core carries |
| 1094 | out a runtime resume of the device and bumps up the device's usage counter |
| 1095 | before running the driver's remove callback, the runtime PM of the device |
| 1096 | is effectively disabled for the duration of the remove execution and all |
| 1097 | runtime PM helper functions incrementing the device's usage counter are |
| 1098 | then effectively equivalent to pm_runtime_get_noresume().] |
pavel@ucw.cz | 21d6b7e | 2005-06-25 14:55:16 -0700 | [diff] [blame] | 1099 | |
Rafael J. Wysocki | b799957 | 2010-05-18 00:23:24 +0200 | [diff] [blame] | 1100 | The runtime PM framework works by processing requests to suspend or resume |
| 1101 | devices, or to check if they are idle (in which cases it is reasonable to |
| 1102 | subsequently request that they be suspended). These requests are represented |
| 1103 | by work items put into the power management workqueue, pm_wq. Although there |
| 1104 | are a few situations in which power management requests are automatically |
| 1105 | queued by the PM core (for example, after processing a request to resume a |
| 1106 | device the PM core automatically queues a request to check if the device is |
| 1107 | idle), device drivers are generally responsible for queuing power management |
| 1108 | requests for their devices. For this purpose they should use the runtime PM |
| 1109 | helper functions provided by the PM core, discussed in |
Mauro Carvalho Chehab | 151f4e2 | 2019-06-13 07:10:36 -0300 | [diff] [blame] | 1110 | Documentation/power/runtime_pm.rst. |
pavel@ucw.cz | 21d6b7e | 2005-06-25 14:55:16 -0700 | [diff] [blame] | 1111 | |
Rafael J. Wysocki | b799957 | 2010-05-18 00:23:24 +0200 | [diff] [blame] | 1112 | Devices can also be suspended and resumed synchronously, without placing a |
| 1113 | request into pm_wq. In the majority of cases this also is done by their |
| 1114 | drivers that use helper functions provided by the PM core for this purpose. |
pavel@ucw.cz | 21d6b7e | 2005-06-25 14:55:16 -0700 | [diff] [blame] | 1115 | |
Rafael J. Wysocki | b799957 | 2010-05-18 00:23:24 +0200 | [diff] [blame] | 1116 | For more information on the runtime PM of devices refer to |
Mauro Carvalho Chehab | 151f4e2 | 2019-06-13 07:10:36 -0300 | [diff] [blame] | 1117 | Documentation/power/runtime_pm.rst. |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 1118 | |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 1119 | |
Rafael J. Wysocki | b799957 | 2010-05-18 00:23:24 +0200 | [diff] [blame] | 1120 | 4. Resources |
| 1121 | ============ |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 1122 | |
Rafael J. Wysocki | b799957 | 2010-05-18 00:23:24 +0200 | [diff] [blame] | 1123 | PCI Local Bus Specification, Rev. 3.0 |
Mauro Carvalho Chehab | 151f4e2 | 2019-06-13 07:10:36 -0300 | [diff] [blame] | 1124 | |
Rafael J. Wysocki | b799957 | 2010-05-18 00:23:24 +0200 | [diff] [blame] | 1125 | PCI Bus Power Management Interface Specification, Rev. 1.2 |
Mauro Carvalho Chehab | 151f4e2 | 2019-06-13 07:10:36 -0300 | [diff] [blame] | 1126 | |
Rafael J. Wysocki | b799957 | 2010-05-18 00:23:24 +0200 | [diff] [blame] | 1127 | Advanced Configuration and Power Interface (ACPI) Specification, Rev. 3.0b |
Mauro Carvalho Chehab | 151f4e2 | 2019-06-13 07:10:36 -0300 | [diff] [blame] | 1128 | |
Rafael J. Wysocki | b799957 | 2010-05-18 00:23:24 +0200 | [diff] [blame] | 1129 | PCI Express Base Specification, Rev. 2.0 |
Mauro Carvalho Chehab | 151f4e2 | 2019-06-13 07:10:36 -0300 | [diff] [blame] | 1130 | |
Tom Saeger | 66ccc64 | 2017-10-10 12:36:09 -0500 | [diff] [blame] | 1131 | Documentation/driver-api/pm/devices.rst |
Mauro Carvalho Chehab | 151f4e2 | 2019-06-13 07:10:36 -0300 | [diff] [blame] | 1132 | |
| 1133 | Documentation/power/runtime_pm.rst |