Changbin Du | 4e37f05 | 2019-05-14 22:47:30 +0800 | [diff] [blame] | 1 | .. SPDX-License-Identifier: GPL-2.0 |
| 2 | .. include:: <isonum.txt> |
Zhang, Yanmin | 4740240 | 2006-07-31 15:15:18 +0800 | [diff] [blame] | 3 | |
Changbin Du | 4e37f05 | 2019-05-14 22:47:30 +0800 | [diff] [blame] | 4 | =========================================================== |
| 5 | The PCI Express Advanced Error Reporting Driver Guide HOWTO |
| 6 | =========================================================== |
Zhang, Yanmin | 4740240 | 2006-07-31 15:15:18 +0800 | [diff] [blame] | 7 | |
Changbin Du | 4e37f05 | 2019-05-14 22:47:30 +0800 | [diff] [blame] | 8 | :Authors: - T. Long Nguyen <tom.l.nguyen@intel.com> |
| 9 | - Yanmin Zhang <yanmin.zhang@intel.com> |
Zhang, Yanmin | 4740240 | 2006-07-31 15:15:18 +0800 | [diff] [blame] | 10 | |
Changbin Du | 4e37f05 | 2019-05-14 22:47:30 +0800 | [diff] [blame] | 11 | :Copyright: |copy| 2006 Intel Corporation |
| 12 | |
| 13 | Overview |
| 14 | =========== |
| 15 | |
| 16 | About this guide |
| 17 | ---------------- |
Zhang, Yanmin | 4740240 | 2006-07-31 15:15:18 +0800 | [diff] [blame] | 18 | |
| 19 | This guide describes the basics of the PCI Express Advanced Error |
| 20 | Reporting (AER) driver and provides information on how to use it, as |
| 21 | well as how to enable the drivers of endpoint devices to conform with |
| 22 | PCI Express AER driver. |
| 23 | |
Zhang, Yanmin | 4740240 | 2006-07-31 15:15:18 +0800 | [diff] [blame] | 24 | |
Changbin Du | 4e37f05 | 2019-05-14 22:47:30 +0800 | [diff] [blame] | 25 | What is the PCI Express AER Driver? |
| 26 | ----------------------------------- |
Zhang, Yanmin | 4740240 | 2006-07-31 15:15:18 +0800 | [diff] [blame] | 27 | |
| 28 | PCI Express error signaling can occur on the PCI Express link itself |
| 29 | or on behalf of transactions initiated on the link. PCI Express |
| 30 | defines two error reporting paradigms: the baseline capability and |
| 31 | the Advanced Error Reporting capability. The baseline capability is |
| 32 | required of all PCI Express components providing a minimum defined |
| 33 | set of error reporting requirements. Advanced Error Reporting |
| 34 | capability is implemented with a PCI Express advanced error reporting |
| 35 | extended capability structure providing more robust error reporting. |
| 36 | |
| 37 | The PCI Express AER driver provides the infrastructure to support PCI |
| 38 | Express Advanced Error Reporting capability. The PCI Express AER |
| 39 | driver provides three basic functions: |
| 40 | |
Changbin Du | 4e37f05 | 2019-05-14 22:47:30 +0800 | [diff] [blame] | 41 | - Gathers the comprehensive error information if errors occurred. |
| 42 | - Reports error to the users. |
| 43 | - Performs error recovery actions. |
Zhang, Yanmin | 4740240 | 2006-07-31 15:15:18 +0800 | [diff] [blame] | 44 | |
| 45 | AER driver only attaches root ports which support PCI-Express AER |
| 46 | capability. |
| 47 | |
| 48 | |
Changbin Du | 4e37f05 | 2019-05-14 22:47:30 +0800 | [diff] [blame] | 49 | User Guide |
| 50 | ========== |
Zhang, Yanmin | 4740240 | 2006-07-31 15:15:18 +0800 | [diff] [blame] | 51 | |
Changbin Du | 4e37f05 | 2019-05-14 22:47:30 +0800 | [diff] [blame] | 52 | Include the PCI Express AER Root Driver into the Linux Kernel |
| 53 | ------------------------------------------------------------- |
Zhang, Yanmin | 4740240 | 2006-07-31 15:15:18 +0800 | [diff] [blame] | 54 | |
| 55 | The PCI Express AER Root driver is a Root Port service driver attached |
| 56 | to the PCI Express Port Bus driver. If a user wants to use it, the driver |
| 57 | has to be compiled. Option CONFIG_PCIEAER supports this capability. It |
| 58 | depends on CONFIG_PCIEPORTBUS, so pls. set CONFIG_PCIEPORTBUS=y and |
| 59 | CONFIG_PCIEAER = y. |
| 60 | |
Changbin Du | 4e37f05 | 2019-05-14 22:47:30 +0800 | [diff] [blame] | 61 | Load PCI Express AER Root Driver |
| 62 | -------------------------------- |
Bjorn Helgaas | 7ece141 | 2016-09-06 16:24:37 -0500 | [diff] [blame] | 63 | |
| 64 | Some systems have AER support in firmware. Enabling Linux AER support at |
| 65 | the same time the firmware handles AER may result in unpredictable |
| 66 | behavior. Therefore, Linux does not handle AER events unless the firmware |
| 67 | grants AER control to the OS via the ACPI _OSC method. See the PCI FW 3.0 |
| 68 | Specification for details regarding _OSC usage. |
Zhang, Yanmin | 4740240 | 2006-07-31 15:15:18 +0800 | [diff] [blame] | 69 | |
Changbin Du | 4e37f05 | 2019-05-14 22:47:30 +0800 | [diff] [blame] | 70 | AER error output |
| 71 | ---------------- |
Bjorn Helgaas | 7ece141 | 2016-09-06 16:24:37 -0500 | [diff] [blame] | 72 | |
| 73 | When a PCIe AER error is captured, an error message will be output to |
| 74 | console. If it's a correctable error, it is output as a warning. |
Zhang, Yanmin | 4740240 | 2006-07-31 15:15:18 +0800 | [diff] [blame] | 75 | Otherwise, it is printed as an error. So users could choose different |
| 76 | log level to filter out correctable error messages. |
| 77 | |
Changbin Du | 4e37f05 | 2019-05-14 22:47:30 +0800 | [diff] [blame] | 78 | Below shows an example:: |
| 79 | |
| 80 | 0000:50:00.0: PCIe Bus Error: severity=Uncorrected (Fatal), type=Transaction Layer, id=0500(Requester ID) |
| 81 | 0000:50:00.0: device [8086:0329] error status/mask=00100000/00000000 |
| 82 | 0000:50:00.0: [20] Unsupported Request (First) |
| 83 | 0000:50:00.0: TLP Header: 04000001 00200a03 05010000 00050100 |
Zhang, Yanmin | 4740240 | 2006-07-31 15:15:18 +0800 | [diff] [blame] | 84 | |
| 85 | In the example, 'Requester ID' means the ID of the device who sends |
| 86 | the error message to root port. Pls. refer to pci express specs for |
| 87 | other fields. |
| 88 | |
Changbin Du | 4e37f05 | 2019-05-14 22:47:30 +0800 | [diff] [blame] | 89 | AER Statistics / Counters |
| 90 | ------------------------- |
Rajat Jain | 81aa520 | 2018-06-21 16:48:28 -0700 | [diff] [blame] | 91 | |
| 92 | When PCIe AER errors are captured, the counters / statistics are also exposed |
| 93 | in the form of sysfs attributes which are documented at |
| 94 | Documentation/ABI/testing/sysfs-bus-pci-devices-aer_stats |
Zhang, Yanmin | 4740240 | 2006-07-31 15:15:18 +0800 | [diff] [blame] | 95 | |
Changbin Du | 4e37f05 | 2019-05-14 22:47:30 +0800 | [diff] [blame] | 96 | Developer Guide |
| 97 | =============== |
Zhang, Yanmin | 4740240 | 2006-07-31 15:15:18 +0800 | [diff] [blame] | 98 | |
| 99 | To enable AER aware support requires a software driver to configure |
| 100 | the AER capability structure within its device and to provide callbacks. |
| 101 | |
| 102 | To support AER better, developers need understand how AER does work |
| 103 | firstly. |
| 104 | |
| 105 | PCI Express errors are classified into two types: correctable errors |
| 106 | and uncorrectable errors. This classification is based on the impacts |
| 107 | of those errors, which may result in degraded performance or function |
| 108 | failure. |
| 109 | |
| 110 | Correctable errors pose no impacts on the functionality of the |
| 111 | interface. The PCI Express protocol can recover without any software |
| 112 | intervention or any loss of data. These errors are detected and |
| 113 | corrected by hardware. Unlike correctable errors, uncorrectable |
| 114 | errors impact functionality of the interface. Uncorrectable errors |
| 115 | can cause a particular transaction or a particular PCI Express link |
| 116 | to be unreliable. Depending on those error conditions, uncorrectable |
| 117 | errors are further classified into non-fatal errors and fatal errors. |
| 118 | Non-fatal errors cause the particular transaction to be unreliable, |
| 119 | but the PCI Express link itself is fully functional. Fatal errors, on |
| 120 | the other hand, cause the link to be unreliable. |
| 121 | |
| 122 | When AER is enabled, a PCI Express device will automatically send an |
Hidetoshi Seto | 8971342 | 2010-04-15 13:21:27 +0900 | [diff] [blame] | 123 | error message to the PCIe root port above it when the device captures |
Zhang, Yanmin | 4740240 | 2006-07-31 15:15:18 +0800 | [diff] [blame] | 124 | an error. The Root Port, upon receiving an error reporting message, |
| 125 | internally processes and logs the error message in its PCI Express |
| 126 | capability structure. Error information being logged includes storing |
| 127 | the error reporting agent's requestor ID into the Error Source |
| 128 | Identification Registers and setting the error bits of the Root Error |
| 129 | Status Register accordingly. If AER error reporting is enabled in Root |
| 130 | Error Command Register, the Root Port generates an interrupt if an |
| 131 | error is detected. |
| 132 | |
| 133 | Note that the errors as described above are related to the PCI Express |
| 134 | hierarchy and links. These errors do not include any device specific |
| 135 | errors because device specific errors will still get sent directly to |
| 136 | the device driver. |
| 137 | |
Changbin Du | 4e37f05 | 2019-05-14 22:47:30 +0800 | [diff] [blame] | 138 | Configure the AER capability structure |
| 139 | -------------------------------------- |
Zhang, Yanmin | 4740240 | 2006-07-31 15:15:18 +0800 | [diff] [blame] | 140 | |
| 141 | AER aware drivers of PCI Express component need change the device |
| 142 | control registers to enable AER. They also could change AER registers, |
| 143 | including mask and severity registers. Helper function |
| 144 | pci_enable_pcie_error_reporting could be used to enable AER. See |
| 145 | section 3.3. |
| 146 | |
Changbin Du | 4e37f05 | 2019-05-14 22:47:30 +0800 | [diff] [blame] | 147 | Provide callbacks |
| 148 | ----------------- |
Zhang, Yanmin | 4740240 | 2006-07-31 15:15:18 +0800 | [diff] [blame] | 149 | |
Changbin Du | 4e37f05 | 2019-05-14 22:47:30 +0800 | [diff] [blame] | 150 | callback reset_link to reset pci express link |
| 151 | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ |
Zhang, Yanmin | 4740240 | 2006-07-31 15:15:18 +0800 | [diff] [blame] | 152 | |
| 153 | This callback is used to reset the pci express physical link when a |
| 154 | fatal error happens. The root port aer service driver provides a |
| 155 | default reset_link function, but different upstream ports might |
| 156 | have different specifications to reset pci express link, so all |
| 157 | upstream ports should provide their own reset_link functions. |
| 158 | |
Zhang, Yanmin | 4740240 | 2006-07-31 15:15:18 +0800 | [diff] [blame] | 159 | Section 3.2.2.2 provides more detailed info on when to call |
| 160 | reset_link. |
| 161 | |
Changbin Du | 4e37f05 | 2019-05-14 22:47:30 +0800 | [diff] [blame] | 162 | PCI error-recovery callbacks |
| 163 | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~ |
Zhang, Yanmin | 4740240 | 2006-07-31 15:15:18 +0800 | [diff] [blame] | 164 | |
| 165 | The PCI Express AER Root driver uses error callbacks to coordinate |
| 166 | with downstream device drivers associated with a hierarchy in question |
| 167 | when performing error recovery actions. |
| 168 | |
| 169 | Data struct pci_driver has a pointer, err_handler, to point to |
| 170 | pci_error_handlers who consists of a couple of callback function |
| 171 | pointers. AER driver follows the rules defined in |
| 172 | pci-error-recovery.txt except pci express specific parts (e.g. |
| 173 | reset_link). Pls. refer to pci-error-recovery.txt for detailed |
| 174 | definitions of the callbacks. |
| 175 | |
| 176 | Below sections specify when to call the error callback functions. |
| 177 | |
Changbin Du | 4e37f05 | 2019-05-14 22:47:30 +0800 | [diff] [blame] | 178 | Correctable errors |
| 179 | ~~~~~~~~~~~~~~~~~~ |
Zhang, Yanmin | 4740240 | 2006-07-31 15:15:18 +0800 | [diff] [blame] | 180 | |
| 181 | Correctable errors pose no impacts on the functionality of |
| 182 | the interface. The PCI Express protocol can recover without any |
| 183 | software intervention or any loss of data. These errors do not |
| 184 | require any recovery actions. The AER driver clears the device's |
| 185 | correctable error status register accordingly and logs these errors. |
| 186 | |
Changbin Du | 4e37f05 | 2019-05-14 22:47:30 +0800 | [diff] [blame] | 187 | Non-correctable (non-fatal and fatal) errors |
| 188 | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ |
Zhang, Yanmin | 4740240 | 2006-07-31 15:15:18 +0800 | [diff] [blame] | 189 | |
| 190 | If an error message indicates a non-fatal error, performing link reset |
| 191 | at upstream is not required. The AER driver calls error_detected(dev, |
| 192 | pci_channel_io_normal) to all drivers associated within a hierarchy in |
Changbin Du | 4e37f05 | 2019-05-14 22:47:30 +0800 | [diff] [blame] | 193 | question. for example:: |
| 194 | |
| 195 | EndPoint<==>DownstreamPort B<==>UpstreamPort A<==>RootPort |
| 196 | |
Zhang, Yanmin | 4740240 | 2006-07-31 15:15:18 +0800 | [diff] [blame] | 197 | If Upstream port A captures an AER error, the hierarchy consists of |
| 198 | Downstream port B and EndPoint. |
| 199 | |
| 200 | A driver may return PCI_ERS_RESULT_CAN_RECOVER, |
| 201 | PCI_ERS_RESULT_DISCONNECT, or PCI_ERS_RESULT_NEED_RESET, depending on |
| 202 | whether it can recover or the AER driver calls mmio_enabled as next. |
| 203 | |
| 204 | If an error message indicates a fatal error, kernel will broadcast |
| 205 | error_detected(dev, pci_channel_io_frozen) to all drivers within |
| 206 | a hierarchy in question. Then, performing link reset at upstream is |
| 207 | necessary. As different kinds of devices might use different approaches |
| 208 | to reset link, AER port service driver is required to provide the |
Kuppuswamy Sathyanarayanan | b6cf1a4 | 2020-03-23 17:26:02 -0700 | [diff] [blame] | 209 | function to reset link via callback parameter of pcie_do_recovery() |
| 210 | function. If reset_link is not NULL, recovery function will use it |
| 211 | to reset the link. If error_detected returns PCI_ERS_RESULT_CAN_RECOVER |
| 212 | and reset_link returns PCI_ERS_RESULT_RECOVERED, the error handling goes |
Zhang, Yanmin | 4740240 | 2006-07-31 15:15:18 +0800 | [diff] [blame] | 213 | to mmio_enabled. |
| 214 | |
Changbin Du | 4e37f05 | 2019-05-14 22:47:30 +0800 | [diff] [blame] | 215 | helper functions |
| 216 | ---------------- |
| 217 | :: |
Zhang, Yanmin | 4740240 | 2006-07-31 15:15:18 +0800 | [diff] [blame] | 218 | |
Changbin Du | 4e37f05 | 2019-05-14 22:47:30 +0800 | [diff] [blame] | 219 | int pci_enable_pcie_error_reporting(struct pci_dev *dev); |
| 220 | |
Zhang, Yanmin | 4740240 | 2006-07-31 15:15:18 +0800 | [diff] [blame] | 221 | pci_enable_pcie_error_reporting enables the device to send error |
| 222 | messages to root port when an error is detected. Note that devices |
| 223 | don't enable the error reporting by default, so device drivers need |
| 224 | call this function to enable it. |
| 225 | |
Changbin Du | 4e37f05 | 2019-05-14 22:47:30 +0800 | [diff] [blame] | 226 | :: |
| 227 | |
| 228 | int pci_disable_pcie_error_reporting(struct pci_dev *dev); |
| 229 | |
Zhang, Yanmin | 4740240 | 2006-07-31 15:15:18 +0800 | [diff] [blame] | 230 | pci_disable_pcie_error_reporting disables the device to send error |
| 231 | messages to root port when an error is detected. |
| 232 | |
Changbin Du | 4e37f05 | 2019-05-14 22:47:30 +0800 | [diff] [blame] | 233 | :: |
| 234 | |
Kuppuswamy Sathyanarayanan | 894020f | 2020-03-23 17:26:08 -0700 | [diff] [blame] | 235 | int pci_aer_clear_nonfatal_status(struct pci_dev *dev);` |
Changbin Du | 4e37f05 | 2019-05-14 22:47:30 +0800 | [diff] [blame] | 236 | |
Kuppuswamy Sathyanarayanan | 894020f | 2020-03-23 17:26:08 -0700 | [diff] [blame] | 237 | pci_aer_clear_nonfatal_status clears non-fatal errors in the uncorrectable |
Zhang, Yanmin | 4740240 | 2006-07-31 15:15:18 +0800 | [diff] [blame] | 238 | error status register. |
| 239 | |
Changbin Du | 4e37f05 | 2019-05-14 22:47:30 +0800 | [diff] [blame] | 240 | Frequent Asked Questions |
| 241 | ------------------------ |
Zhang, Yanmin | 4740240 | 2006-07-31 15:15:18 +0800 | [diff] [blame] | 242 | |
Changbin Du | 4e37f05 | 2019-05-14 22:47:30 +0800 | [diff] [blame] | 243 | Q: |
| 244 | What happens if a PCI Express device driver does not provide an |
| 245 | error recovery handler (pci_driver->err_handler is equal to NULL)? |
Zhang, Yanmin | 4740240 | 2006-07-31 15:15:18 +0800 | [diff] [blame] | 246 | |
Changbin Du | 4e37f05 | 2019-05-14 22:47:30 +0800 | [diff] [blame] | 247 | A: |
| 248 | The devices attached with the driver won't be recovered. If the |
| 249 | error is fatal, kernel will print out warning messages. Please refer |
| 250 | to section 3 for more information. |
Zhang, Yanmin | 4740240 | 2006-07-31 15:15:18 +0800 | [diff] [blame] | 251 | |
Changbin Du | 4e37f05 | 2019-05-14 22:47:30 +0800 | [diff] [blame] | 252 | Q: |
| 253 | What happens if an upstream port service driver does not provide |
| 254 | callback reset_link? |
Zhang, Yanmin | 4740240 | 2006-07-31 15:15:18 +0800 | [diff] [blame] | 255 | |
Changbin Du | 4e37f05 | 2019-05-14 22:47:30 +0800 | [diff] [blame] | 256 | A: |
| 257 | Fatal error recovery will fail if the errors are reported by the |
| 258 | upstream ports who are attached by the service driver. |
Zhang, Yanmin | 4740240 | 2006-07-31 15:15:18 +0800 | [diff] [blame] | 259 | |
Changbin Du | 4e37f05 | 2019-05-14 22:47:30 +0800 | [diff] [blame] | 260 | Q: |
| 261 | How does this infrastructure deal with driver that is not PCI |
| 262 | Express aware? |
Zhang, Yanmin | 4740240 | 2006-07-31 15:15:18 +0800 | [diff] [blame] | 263 | |
Changbin Du | 4e37f05 | 2019-05-14 22:47:30 +0800 | [diff] [blame] | 264 | A: |
| 265 | This infrastructure calls the error callback functions of the |
| 266 | driver when an error happens. But if the driver is not aware of |
| 267 | PCI Express, the device might not report its own errors to root |
| 268 | port. |
Zhang, Yanmin | 4740240 | 2006-07-31 15:15:18 +0800 | [diff] [blame] | 269 | |
Changbin Du | 4e37f05 | 2019-05-14 22:47:30 +0800 | [diff] [blame] | 270 | Q: |
| 271 | What modifications will that driver need to make it compatible |
| 272 | with the PCI Express AER Root driver? |
Zhang, Yanmin | 4740240 | 2006-07-31 15:15:18 +0800 | [diff] [blame] | 273 | |
Changbin Du | 4e37f05 | 2019-05-14 22:47:30 +0800 | [diff] [blame] | 274 | A: |
| 275 | It could call the helper functions to enable AER in devices and |
| 276 | cleanup uncorrectable status register. Pls. refer to section 3.3. |
Zhang, Yanmin | 4740240 | 2006-07-31 15:15:18 +0800 | [diff] [blame] | 277 | |
Huang Ying | bfe5a74 | 2009-04-24 10:45:31 +0800 | [diff] [blame] | 278 | |
Changbin Du | 4e37f05 | 2019-05-14 22:47:30 +0800 | [diff] [blame] | 279 | Software error injection |
| 280 | ======================== |
Huang Ying | bfe5a74 | 2009-04-24 10:45:31 +0800 | [diff] [blame] | 281 | |
Hidetoshi Seto | 8971342 | 2010-04-15 13:21:27 +0900 | [diff] [blame] | 282 | Debugging PCIe AER error recovery code is quite difficult because it |
Huang Ying | bfe5a74 | 2009-04-24 10:45:31 +0800 | [diff] [blame] | 283 | is hard to trigger real hardware errors. Software based error |
Hidetoshi Seto | 8971342 | 2010-04-15 13:21:27 +0900 | [diff] [blame] | 284 | injection can be used to fake various kinds of PCIe errors. |
Huang Ying | bfe5a74 | 2009-04-24 10:45:31 +0800 | [diff] [blame] | 285 | |
Hidetoshi Seto | 8971342 | 2010-04-15 13:21:27 +0900 | [diff] [blame] | 286 | First you should enable PCIe AER software error injection in kernel |
Huang Ying | bfe5a74 | 2009-04-24 10:45:31 +0800 | [diff] [blame] | 287 | configuration, that is, following item should be in your .config. |
| 288 | |
| 289 | CONFIG_PCIEAER_INJECT=y or CONFIG_PCIEAER_INJECT=m |
| 290 | |
| 291 | After reboot with new kernel or insert the module, a device file named |
| 292 | /dev/aer_inject should be created. |
| 293 | |
| 294 | Then, you need a user space tool named aer-inject, which can be gotten |
| 295 | from: |
Changbin Du | 4e37f05 | 2019-05-14 22:47:30 +0800 | [diff] [blame] | 296 | |
Cao jin | 2eb6a4b | 2017-03-01 17:05:28 +0800 | [diff] [blame] | 297 | https://git.kernel.org/cgit/linux/kernel/git/gong.chen/aer-inject.git/ |
Huang Ying | bfe5a74 | 2009-04-24 10:45:31 +0800 | [diff] [blame] | 298 | |
| 299 | More information about aer-inject can be found in the document comes |
| 300 | with its source code. |