Changbin Du | 440ebec | 2019-04-25 01:53:02 +0800 | [diff] [blame^] | 1 | .. SPDX-License-Identifier: GPL-2.0 |
| 2 | |
| 3 | ==================== |
| 4 | APEI Error INJection |
| 5 | ==================== |
Huang Ying | ea8c071 | 2010-05-18 14:35:15 +0800 | [diff] [blame] | 6 | |
Borislav Petkov | 0eac092 | 2015-01-20 17:40:43 +0100 | [diff] [blame] | 7 | EINJ provides a hardware error injection mechanism. It is very useful |
| 8 | for debugging and testing APEI and RAS features in general. |
Huang Ying | ea8c071 | 2010-05-18 14:35:15 +0800 | [diff] [blame] | 9 | |
Borislav Petkov | 0eac092 | 2015-01-20 17:40:43 +0100 | [diff] [blame] | 10 | You need to check whether your BIOS supports EINJ first. For that, look |
Changbin Du | 440ebec | 2019-04-25 01:53:02 +0800 | [diff] [blame^] | 11 | for early boot messages similar to this one:: |
Borislav Petkov | 0eac092 | 2015-01-20 17:40:43 +0100 | [diff] [blame] | 12 | |
Changbin Du | 440ebec | 2019-04-25 01:53:02 +0800 | [diff] [blame^] | 13 | ACPI: EINJ 0x000000007370A000 000150 (v01 INTEL 00000001 INTL 00000001) |
Borislav Petkov | 0eac092 | 2015-01-20 17:40:43 +0100 | [diff] [blame] | 14 | |
| 15 | which shows that the BIOS is exposing an EINJ table - it is the |
| 16 | mechanism through which the injection is done. |
| 17 | |
| 18 | Alternatively, look in /sys/firmware/acpi/tables for an "EINJ" file, |
| 19 | which is a different representation of the same thing. |
| 20 | |
| 21 | It doesn't necessarily mean that EINJ is not supported if those above |
| 22 | don't exist: before you give up, go into BIOS setup to see if the BIOS |
| 23 | has an option to enable error injection. Look for something called WHEA |
| 24 | or similar. Often, you need to enable an ACPI5 support option prior, in |
| 25 | order to see the APEI,EINJ,... functionality supported and exposed by |
| 26 | the BIOS menu. |
| 27 | |
| 28 | To use EINJ, make sure the following are options enabled in your kernel |
Changbin Du | 440ebec | 2019-04-25 01:53:02 +0800 | [diff] [blame^] | 29 | configuration:: |
Huang Ying | ea8c071 | 2010-05-18 14:35:15 +0800 | [diff] [blame] | 30 | |
Changbin Du | 440ebec | 2019-04-25 01:53:02 +0800 | [diff] [blame^] | 31 | CONFIG_DEBUG_FS |
| 32 | CONFIG_ACPI_APEI |
| 33 | CONFIG_ACPI_APEI_EINJ |
Huang Ying | ea8c071 | 2010-05-18 14:35:15 +0800 | [diff] [blame] | 34 | |
Borislav Petkov | 0eac092 | 2015-01-20 17:40:43 +0100 | [diff] [blame] | 35 | The EINJ user interface is in <debugfs mount point>/apei/einj. |
| 36 | |
| 37 | The following files belong to it: |
Huang Ying | ea8c071 | 2010-05-18 14:35:15 +0800 | [diff] [blame] | 38 | |
| 39 | - available_error_type |
Huang Ying | ea8c071 | 2010-05-18 14:35:15 +0800 | [diff] [blame] | 40 | |
Borislav Petkov | 0eac092 | 2015-01-20 17:40:43 +0100 | [diff] [blame] | 41 | This file shows which error types are supported: |
Huang Ying | ea8c071 | 2010-05-18 14:35:15 +0800 | [diff] [blame] | 42 | |
Changbin Du | 440ebec | 2019-04-25 01:53:02 +0800 | [diff] [blame^] | 43 | ================ =================================== |
Borislav Petkov | 0eac092 | 2015-01-20 17:40:43 +0100 | [diff] [blame] | 44 | Error Type Value Error Description |
Changbin Du | 440ebec | 2019-04-25 01:53:02 +0800 | [diff] [blame^] | 45 | ================ =================================== |
| 46 | 0x00000001 Processor Correctable |
| 47 | 0x00000002 Processor Uncorrectable non-fatal |
| 48 | 0x00000004 Processor Uncorrectable fatal |
| 49 | 0x00000008 Memory Correctable |
| 50 | 0x00000010 Memory Uncorrectable non-fatal |
| 51 | 0x00000020 Memory Uncorrectable fatal |
| 52 | 0x00000040 PCI Express Correctable |
| 53 | 0x00000080 PCI Express Uncorrectable fatal |
| 54 | 0x00000100 PCI Express Uncorrectable non-fatal |
| 55 | 0x00000200 Platform Correctable |
| 56 | 0x00000400 Platform Uncorrectable non-fatal |
| 57 | 0x00000800 Platform Uncorrectable fatal |
| 58 | ================ =================================== |
Borislav Petkov | 0eac092 | 2015-01-20 17:40:43 +0100 | [diff] [blame] | 59 | |
| 60 | The format of the file contents are as above, except present are only |
| 61 | the available error types. |
Huang Ying | ea8c071 | 2010-05-18 14:35:15 +0800 | [diff] [blame] | 62 | |
| 63 | - error_type |
Borislav Petkov | 0eac092 | 2015-01-20 17:40:43 +0100 | [diff] [blame] | 64 | |
| 65 | Set the value of the error type being injected. Possible error types |
| 66 | are defined in the file available_error_type above. |
Huang Ying | ea8c071 | 2010-05-18 14:35:15 +0800 | [diff] [blame] | 67 | |
| 68 | - error_inject |
Borislav Petkov | 0eac092 | 2015-01-20 17:40:43 +0100 | [diff] [blame] | 69 | |
| 70 | Write any integer to this file to trigger the error injection. Make |
| 71 | sure you have specified all necessary error parameters, i.e. this |
| 72 | write should be the last step when injecting errors. |
Huang Ying | ea8c071 | 2010-05-18 14:35:15 +0800 | [diff] [blame] | 73 | |
Luck, Tony | 3482fb5 | 2013-11-06 13:30:36 -0800 | [diff] [blame] | 74 | - flags |
Borislav Petkov | 0eac092 | 2015-01-20 17:40:43 +0100 | [diff] [blame] | 75 | |
| 76 | Present for kernel versions 3.13 and above. Used to specify which |
| 77 | of param{1..4} are valid and should be used by the firmware during |
| 78 | injection. Value is a bitmask as specified in ACPI5.0 spec for the |
Luck, Tony | 3482fb5 | 2013-11-06 13:30:36 -0800 | [diff] [blame] | 79 | SET_ERROR_TYPE_WITH_ADDRESS data structure: |
Borislav Petkov | 0eac092 | 2015-01-20 17:40:43 +0100 | [diff] [blame] | 80 | |
Changbin Du | 440ebec | 2019-04-25 01:53:02 +0800 | [diff] [blame^] | 81 | Bit 0 |
| 82 | Processor APIC field valid (see param3 below). |
| 83 | Bit 1 |
| 84 | Memory address and mask valid (param1 and param2). |
| 85 | Bit 2 |
| 86 | PCIe (seg,bus,dev,fn) valid (see param4 below). |
Borislav Petkov | 0eac092 | 2015-01-20 17:40:43 +0100 | [diff] [blame] | 87 | |
| 88 | If set to zero, legacy behavior is mimicked where the type of |
| 89 | injection specifies just one bit set, and param1 is multiplexed. |
Luck, Tony | 3482fb5 | 2013-11-06 13:30:36 -0800 | [diff] [blame] | 90 | |
Huang Ying | 6e320ec | 2010-05-18 14:35:24 +0800 | [diff] [blame] | 91 | - param1 |
Borislav Petkov | 0eac092 | 2015-01-20 17:40:43 +0100 | [diff] [blame] | 92 | |
| 93 | This file is used to set the first error parameter value. Its effect |
| 94 | depends on the error type specified in error_type. For example, if |
| 95 | error type is memory related type, the param1 should be a valid |
| 96 | physical memory address. [Unless "flag" is set - see above] |
Huang Ying | 6e320ec | 2010-05-18 14:35:24 +0800 | [diff] [blame] | 97 | |
| 98 | - param2 |
Borislav Petkov | 0eac092 | 2015-01-20 17:40:43 +0100 | [diff] [blame] | 99 | |
| 100 | Same use as param1 above. For example, if error type is of memory |
| 101 | related type, then param2 should be a physical memory address mask. |
| 102 | Linux requires page or narrower granularity, say, 0xfffffffffffff000. |
Huang Ying | c3e6088 | 2011-07-20 16:09:29 +0800 | [diff] [blame] | 103 | |
Luck, Tony | 3482fb5 | 2013-11-06 13:30:36 -0800 | [diff] [blame] | 104 | - param3 |
Borislav Petkov | 0eac092 | 2015-01-20 17:40:43 +0100 | [diff] [blame] | 105 | |
| 106 | Used when the 0x1 bit is set in "flags" to specify the APIC id |
Luck, Tony | 3482fb5 | 2013-11-06 13:30:36 -0800 | [diff] [blame] | 107 | |
| 108 | - param4 |
Borislav Petkov | 0eac092 | 2015-01-20 17:40:43 +0100 | [diff] [blame] | 109 | Used when the 0x4 bit is set in "flags" to specify target PCIe device |
Luck, Tony | 3482fb5 | 2013-11-06 13:30:36 -0800 | [diff] [blame] | 110 | |
Chen Gong | 6ef19ab | 2012-03-15 16:53:37 +0800 | [diff] [blame] | 111 | - notrigger |
Chen Gong | 6ef19ab | 2012-03-15 16:53:37 +0800 | [diff] [blame] | 112 | |
Borislav Petkov | 0eac092 | 2015-01-20 17:40:43 +0100 | [diff] [blame] | 113 | The error injection mechanism is a two-step process. First inject the |
| 114 | error, then perform some actions to trigger it. Setting "notrigger" |
| 115 | to 1 skips the trigger phase, which *may* allow the user to cause the |
| 116 | error in some other context by a simple access to the CPU, memory |
| 117 | location, or device that is the target of the error injection. Whether |
| 118 | this actually works depends on what operations the BIOS actually |
| 119 | includes in the trigger phase. |
Tony Luck | c130bd6 | 2012-01-17 12:10:16 -0800 | [diff] [blame] | 120 | |
Borislav Petkov | 0eac092 | 2015-01-20 17:40:43 +0100 | [diff] [blame] | 121 | BIOS versions based on the ACPI 4.0 specification have limited options |
| 122 | in controlling where the errors are injected. Your BIOS may support an |
| 123 | extension (enabled with the param_extension=1 module parameter, or boot |
| 124 | command line einj.param_extension=1). This allows the address and mask |
| 125 | for memory injections to be specified by the param1 and param2 files in |
| 126 | apei/einj. |
| 127 | |
| 128 | BIOS versions based on the ACPI 5.0 specification have more control over |
| 129 | the target of the injection. For processor-related errors (type 0x1, 0x2 |
| 130 | and 0x4), you can set flags to 0x3 (param3 for bit 0, and param1 and |
| 131 | param2 for bit 1) so that you have more information added to the error |
Changbin Du | 440ebec | 2019-04-25 01:53:02 +0800 | [diff] [blame^] | 132 | signature being injected. The actual data passed is this:: |
Borislav Petkov | 0eac092 | 2015-01-20 17:40:43 +0100 | [diff] [blame] | 133 | |
| 134 | memory_address = param1; |
| 135 | memory_address_range = param2; |
| 136 | apicid = param3; |
| 137 | pcie_sbdf = param4; |
| 138 | |
| 139 | For memory errors (type 0x8, 0x10 and 0x20) the address is set using |
| 140 | param1 with a mask in param2 (0x0 is equivalent to all ones). For PCI |
| 141 | express errors (type 0x40, 0x80 and 0x100) the segment, bus, device and |
Changbin Du | 440ebec | 2019-04-25 01:53:02 +0800 | [diff] [blame^] | 142 | function are specified using param1:: |
Tony Luck | c130bd6 | 2012-01-17 12:10:16 -0800 | [diff] [blame] | 143 | |
| 144 | 31 24 23 16 15 11 10 8 7 0 |
| 145 | +-------------------------------------------------+ |
| 146 | | segment | bus | device | function | reserved | |
| 147 | +-------------------------------------------------+ |
| 148 | |
Borislav Petkov | 0eac092 | 2015-01-20 17:40:43 +0100 | [diff] [blame] | 149 | Anyway, you get the idea, if there's doubt just take a look at the code |
| 150 | in drivers/acpi/apei/einj.c. |
| 151 | |
| 152 | An ACPI 5.0 BIOS may also allow vendor-specific errors to be injected. |
Tony Luck | c130bd6 | 2012-01-17 12:10:16 -0800 | [diff] [blame] | 153 | In this case a file named vendor will contain identifying information |
| 154 | from the BIOS that hopefully will allow an application wishing to use |
Borislav Petkov | 0eac092 | 2015-01-20 17:40:43 +0100 | [diff] [blame] | 155 | the vendor-specific extension to tell that they are running on a BIOS |
Tony Luck | c130bd6 | 2012-01-17 12:10:16 -0800 | [diff] [blame] | 156 | that supports it. All vendor extensions have the 0x80000000 bit set in |
| 157 | error_type. A file vendor_flags controls the interpretation of param1 |
| 158 | and param2 (1 = PROCESSOR, 2 = MEMORY, 4 = PCI). See your BIOS vendor |
| 159 | documentation for details (and expect changes to this API if vendors |
| 160 | creativity in using this feature expands beyond our expectations). |
| 161 | |
Borislav Petkov | 0eac092 | 2015-01-20 17:40:43 +0100 | [diff] [blame] | 162 | |
Changbin Du | 440ebec | 2019-04-25 01:53:02 +0800 | [diff] [blame^] | 163 | An error injection example:: |
Borislav Petkov | 0eac092 | 2015-01-20 17:40:43 +0100 | [diff] [blame] | 164 | |
Changbin Du | 440ebec | 2019-04-25 01:53:02 +0800 | [diff] [blame^] | 165 | # cd /sys/kernel/debug/apei/einj |
| 166 | # cat available_error_type # See which errors can be injected |
| 167 | 0x00000002 Processor Uncorrectable non-fatal |
| 168 | 0x00000008 Memory Correctable |
| 169 | 0x00000010 Memory Uncorrectable non-fatal |
| 170 | # echo 0x12345000 > param1 # Set memory address for injection |
| 171 | # echo $((-1 << 12)) > param2 # Mask 0xfffffffffffff000 - anywhere in this page |
| 172 | # echo 0x8 > error_type # Choose correctable memory error |
| 173 | # echo 1 > error_inject # Inject now |
Tony Luck | c130bd6 | 2012-01-17 12:10:16 -0800 | [diff] [blame] | 174 | |
Changbin Du | 440ebec | 2019-04-25 01:53:02 +0800 | [diff] [blame^] | 175 | You should see something like this in dmesg:: |
Borislav Petkov | 0eac092 | 2015-01-20 17:40:43 +0100 | [diff] [blame] | 176 | |
Changbin Du | 440ebec | 2019-04-25 01:53:02 +0800 | [diff] [blame^] | 177 | [22715.830801] EDAC sbridge MC3: HANDLING MCE MEMORY ERROR |
| 178 | [22715.834759] EDAC sbridge MC3: CPU 0: Machine Check Event: 0 Bank 7: 8c00004000010090 |
| 179 | [22715.834759] EDAC sbridge MC3: TSC 0 |
| 180 | [22715.834759] EDAC sbridge MC3: ADDR 12345000 EDAC sbridge MC3: MISC 144780c86 |
| 181 | [22715.834759] EDAC sbridge MC3: PROCESSOR 0:306e7 TIME 1422553404 SOCKET 0 APIC 0 |
| 182 | [22716.616173] EDAC MC3: 1 CE memory read error on CPU_SrcID#0_Channel#0_DIMM#0 (channel:0 slot:0 page:0x12345 offset:0x0 grain:32 syndrome:0x0 - area:DRAM err_code:0001:0090 socket:0 channel_mask:1 rank:0) |
Huang Ying | 6e320ec | 2010-05-18 14:35:24 +0800 | [diff] [blame] | 183 | |
Huang Ying | ea8c071 | 2010-05-18 14:35:15 +0800 | [diff] [blame] | 184 | For more information about EINJ, please refer to ACPI specification |
Tony Luck | c130bd6 | 2012-01-17 12:10:16 -0800 | [diff] [blame] | 185 | version 4.0, section 17.5 and ACPI 5.0, section 18.6. |