Mike Rapoport | b53ba58 | 2018-03-21 21:22:25 +0200 | [diff] [blame] | 1 | .. hwpoison: |
| 2 | |
| 3 | ======== |
| 4 | hwpoison |
| 5 | ======== |
| 6 | |
Andi Kleen | f58ee00 | 2009-10-04 02:28:42 +0200 | [diff] [blame] | 7 | What is hwpoison? |
Mike Rapoport | b53ba58 | 2018-03-21 21:22:25 +0200 | [diff] [blame] | 8 | ================= |
Andi Kleen | f58ee00 | 2009-10-04 02:28:42 +0200 | [diff] [blame] | 9 | |
| 10 | Upcoming Intel CPUs have support for recovering from some memory errors |
Mike Rapoport | b53ba58 | 2018-03-21 21:22:25 +0200 | [diff] [blame] | 11 | (``MCA recovery``). This requires the OS to declare a page "poisoned", |
Andi Kleen | f58ee00 | 2009-10-04 02:28:42 +0200 | [diff] [blame] | 12 | kill the processes associated with it and avoid using it in the future. |
| 13 | |
| 14 | This patchkit implements the necessary infrastructure in the VM. |
| 15 | |
Valentin Schneider | 22aac857 | 2019-06-18 15:56:05 +0100 | [diff] [blame] | 16 | To quote the overview comment:: |
Andi Kleen | f58ee00 | 2009-10-04 02:28:42 +0200 | [diff] [blame] | 17 | |
Valentin Schneider | 22aac857 | 2019-06-18 15:56:05 +0100 | [diff] [blame] | 18 | High level machine check handler. Handles pages reported by the |
| 19 | hardware as being corrupted usually due to a 2bit ECC memory or cache |
| 20 | failure. |
| 21 | |
| 22 | This focusses on pages detected as corrupted in the background. |
| 23 | When the current CPU tries to consume corruption the currently |
| 24 | running process can just be killed directly instead. This implies |
| 25 | that if the error cannot be handled for some reason it's safe to |
| 26 | just ignore it because no corruption has been consumed yet. Instead |
| 27 | when that happens another machine check will happen. |
| 28 | |
| 29 | Handles page cache pages in various states. The tricky part |
| 30 | here is that we can access any page asynchronous to other VM |
| 31 | users, because memory failures could happen anytime and anywhere, |
| 32 | possibly violating some of their assumptions. This is why this code |
| 33 | has to be extremely careful. Generally it tries to use normal locking |
| 34 | rules, as in get the standard locks, even if that means the |
| 35 | error handling takes potentially a long time. |
| 36 | |
| 37 | Some of the operations here are somewhat inefficient and have non |
| 38 | linear algorithmic complexity, because the data structures have not |
| 39 | been optimized for this case. This is in particular the case |
| 40 | for the mapping from a vma to a process. Since this case is expected |
| 41 | to be rare we hope we can get away with this. |
Andi Kleen | f58ee00 | 2009-10-04 02:28:42 +0200 | [diff] [blame] | 42 | |
| 43 | The code consists of a the high level handler in mm/memory-failure.c, |
| 44 | a new page poison bit and various checks in the VM to handle poisoned |
| 45 | pages. |
| 46 | |
| 47 | The main target right now is KVM guests, but it works for all kinds |
| 48 | of applications. KVM support requires a recent qemu-kvm release. |
| 49 | |
| 50 | For the KVM use there was need for a new signal type so that |
| 51 | KVM can inject the machine check into the guest with the proper |
| 52 | address. This in theory allows other applications to handle |
| 53 | memory failures too. The expection is that near all applications |
| 54 | won't do that, but some very specialized ones might. |
| 55 | |
Mike Rapoport | b53ba58 | 2018-03-21 21:22:25 +0200 | [diff] [blame] | 56 | Failure recovery modes |
| 57 | ====================== |
Andi Kleen | f58ee00 | 2009-10-04 02:28:42 +0200 | [diff] [blame] | 58 | |
Mike Rapoport | b53ba58 | 2018-03-21 21:22:25 +0200 | [diff] [blame] | 59 | There are two (actually three) modes memory failure recovery can be in: |
Andi Kleen | f58ee00 | 2009-10-04 02:28:42 +0200 | [diff] [blame] | 60 | |
| 61 | vm.memory_failure_recovery sysctl set to zero: |
| 62 | All memory failures cause a panic. Do not attempt recovery. |
| 63 | (on x86 this can be also affected by the tolerant level of the |
| 64 | MCE subsystem) |
| 65 | |
| 66 | early kill |
| 67 | (can be controlled globally and per process) |
| 68 | Send SIGBUS to the application as soon as the error is detected |
| 69 | This allows applications who can process memory errors in a gentle |
| 70 | way (e.g. drop affected object) |
| 71 | This is the mode used by KVM qemu. |
| 72 | |
| 73 | late kill |
| 74 | Send SIGBUS when the application runs into the corrupted page. |
| 75 | This is best for memory error unaware applications and default |
| 76 | Note some pages are always handled as late kill. |
| 77 | |
Mike Rapoport | b53ba58 | 2018-03-21 21:22:25 +0200 | [diff] [blame] | 78 | User control |
| 79 | ============ |
Andi Kleen | f58ee00 | 2009-10-04 02:28:42 +0200 | [diff] [blame] | 80 | |
| 81 | vm.memory_failure_recovery |
| 82 | See sysctl.txt |
| 83 | |
| 84 | vm.memory_failure_early_kill |
| 85 | Enable early kill mode globally |
| 86 | |
| 87 | PR_MCE_KILL |
| 88 | Set early/late kill mode/revert to system default |
Mike Rapoport | b53ba58 | 2018-03-21 21:22:25 +0200 | [diff] [blame] | 89 | |
| 90 | arg1: PR_MCE_KILL_CLEAR: |
| 91 | Revert to system default |
| 92 | arg1: PR_MCE_KILL_SET: |
| 93 | arg2 defines thread specific mode |
| 94 | |
| 95 | PR_MCE_KILL_EARLY: |
| 96 | Early kill |
| 97 | PR_MCE_KILL_LATE: |
| 98 | Late kill |
| 99 | PR_MCE_KILL_DEFAULT |
| 100 | Use system global default |
| 101 | |
Naoya Horiguchi | 3ba0812 | 2014-06-04 16:11:02 -0700 | [diff] [blame] | 102 | Note that if you want to have a dedicated thread which handles |
| 103 | the SIGBUS(BUS_MCEERR_AO) on behalf of the process, you should |
| 104 | call prctl(PR_MCE_KILL_EARLY) on the designated thread. Otherwise, |
| 105 | the SIGBUS is sent to the main thread. |
| 106 | |
Andi Kleen | f58ee00 | 2009-10-04 02:28:42 +0200 | [diff] [blame] | 107 | PR_MCE_KILL_GET |
| 108 | return current mode |
| 109 | |
Mike Rapoport | b53ba58 | 2018-03-21 21:22:25 +0200 | [diff] [blame] | 110 | Testing |
| 111 | ======= |
Andi Kleen | f58ee00 | 2009-10-04 02:28:42 +0200 | [diff] [blame] | 112 | |
Mike Rapoport | b53ba58 | 2018-03-21 21:22:25 +0200 | [diff] [blame] | 113 | * madvise(MADV_HWPOISON, ....) (as root) - Poison a page in the |
| 114 | process for testing |
Andi Kleen | f58ee00 | 2009-10-04 02:28:42 +0200 | [diff] [blame] | 115 | |
Mike Rapoport | b53ba58 | 2018-03-21 21:22:25 +0200 | [diff] [blame] | 116 | * hwpoison-inject module through debugfs ``/sys/kernel/debug/hwpoison/`` |
Andi Kleen | f58ee00 | 2009-10-04 02:28:42 +0200 | [diff] [blame] | 117 | |
Mike Rapoport | b53ba58 | 2018-03-21 21:22:25 +0200 | [diff] [blame] | 118 | corrupt-pfn |
| 119 | Inject hwpoison fault at PFN echoed into this file. This does |
| 120 | some early filtering to avoid corrupted unintended pages in test suites. |
Andi Kleen | f58ee00 | 2009-10-04 02:28:42 +0200 | [diff] [blame] | 121 | |
Mike Rapoport | b53ba58 | 2018-03-21 21:22:25 +0200 | [diff] [blame] | 122 | unpoison-pfn |
| 123 | Software-unpoison page at PFN echoed into this file. This way |
| 124 | a page can be reused again. This only works for Linux |
| 125 | injected failures, not for real memory failures. |
Andi Kleen | f58ee00 | 2009-10-04 02:28:42 +0200 | [diff] [blame] | 126 | |
Mike Rapoport | b53ba58 | 2018-03-21 21:22:25 +0200 | [diff] [blame] | 127 | Note these injection interfaces are not stable and might change between |
| 128 | kernel versions |
Andi Kleen | f58ee00 | 2009-10-04 02:28:42 +0200 | [diff] [blame] | 129 | |
Mike Rapoport | b53ba58 | 2018-03-21 21:22:25 +0200 | [diff] [blame] | 130 | corrupt-filter-dev-major, corrupt-filter-dev-minor |
| 131 | Only handle memory failures to pages associated with the file |
| 132 | system defined by block device major/minor. -1U is the |
| 133 | wildcard value. This should be only used for testing with |
| 134 | artificial injection. |
Andi Kleen | f58ee00 | 2009-10-04 02:28:42 +0200 | [diff] [blame] | 135 | |
Mike Rapoport | b53ba58 | 2018-03-21 21:22:25 +0200 | [diff] [blame] | 136 | corrupt-filter-memcg |
| 137 | Limit injection to pages owned by memgroup. Specified by inode |
| 138 | number of the memcg. |
Wu Fengguang | 847ce40 | 2009-12-16 12:19:58 +0100 | [diff] [blame] | 139 | |
Mike Rapoport | b53ba58 | 2018-03-21 21:22:25 +0200 | [diff] [blame] | 140 | Example:: |
Wu Fengguang | 847ce40 | 2009-12-16 12:19:58 +0100 | [diff] [blame] | 141 | |
Mike Rapoport | b53ba58 | 2018-03-21 21:22:25 +0200 | [diff] [blame] | 142 | mkdir /sys/fs/cgroup/mem/hwpoison |
Wu Fengguang | 847ce40 | 2009-12-16 12:19:58 +0100 | [diff] [blame] | 143 | |
Mike Rapoport | b53ba58 | 2018-03-21 21:22:25 +0200 | [diff] [blame] | 144 | usemem -m 100 -s 1000 & |
| 145 | echo `jobs -p` > /sys/fs/cgroup/mem/hwpoison/tasks |
Wu Fengguang | 847ce40 | 2009-12-16 12:19:58 +0100 | [diff] [blame] | 146 | |
Mike Rapoport | b53ba58 | 2018-03-21 21:22:25 +0200 | [diff] [blame] | 147 | memcg_ino=$(ls -id /sys/fs/cgroup/mem/hwpoison | cut -f1 -d' ') |
| 148 | echo $memcg_ino > /debug/hwpoison/corrupt-filter-memcg |
Andi Kleen | f58ee00 | 2009-10-04 02:28:42 +0200 | [diff] [blame] | 149 | |
Mike Rapoport | b53ba58 | 2018-03-21 21:22:25 +0200 | [diff] [blame] | 150 | page-types -p `pidof init` --hwpoison # shall do nothing |
| 151 | page-types -p `pidof usemem` --hwpoison # poison its pages |
Wu Fengguang | 7c116f2 | 2009-12-16 12:19:59 +0100 | [diff] [blame] | 152 | |
Mike Rapoport | b53ba58 | 2018-03-21 21:22:25 +0200 | [diff] [blame] | 153 | corrupt-filter-flags-mask, corrupt-filter-flags-value |
| 154 | When specified, only poison pages if ((page_flags & mask) == |
| 155 | value). This allows stress testing of many kinds of |
| 156 | pages. The page_flags are the same as in /proc/kpageflags. The |
| 157 | flag bits are defined in include/linux/kernel-page-flags.h and |
Mike Rapoport | 1ad1335 | 2018-04-18 11:07:49 +0300 | [diff] [blame] | 158 | documented in Documentation/admin-guide/mm/pagemap.rst |
Wu Fengguang | 7c116f2 | 2009-12-16 12:19:59 +0100 | [diff] [blame] | 159 | |
Mike Rapoport | b53ba58 | 2018-03-21 21:22:25 +0200 | [diff] [blame] | 160 | * Architecture specific MCE injector |
Andi Kleen | 4fd466e | 2009-12-16 12:19:59 +0100 | [diff] [blame] | 161 | |
Mike Rapoport | b53ba58 | 2018-03-21 21:22:25 +0200 | [diff] [blame] | 162 | x86 has mce-inject, mce-test |
Andi Kleen | 4fd466e | 2009-12-16 12:19:59 +0100 | [diff] [blame] | 163 | |
Mike Rapoport | b53ba58 | 2018-03-21 21:22:25 +0200 | [diff] [blame] | 164 | Some portable hwpoison test programs in mce-test, see below. |
Andi Kleen | 4fd466e | 2009-12-16 12:19:59 +0100 | [diff] [blame] | 165 | |
Mike Rapoport | b53ba58 | 2018-03-21 21:22:25 +0200 | [diff] [blame] | 166 | References |
| 167 | ========== |
Andi Kleen | f58ee00 | 2009-10-04 02:28:42 +0200 | [diff] [blame] | 168 | |
| 169 | http://halobates.de/mce-lc09-2.pdf |
| 170 | Overview presentation from LinuxCon 09 |
| 171 | |
| 172 | git://git.kernel.org/pub/scm/utils/cpu/mce/mce-test.git |
| 173 | Test suite (hwpoison specific portable tests in tsrc) |
| 174 | |
| 175 | git://git.kernel.org/pub/scm/utils/cpu/mce/mce-inject.git |
| 176 | x86 specific injector |
| 177 | |
| 178 | |
Mike Rapoport | b53ba58 | 2018-03-21 21:22:25 +0200 | [diff] [blame] | 179 | Limitations |
| 180 | =========== |
Andi Kleen | f58ee00 | 2009-10-04 02:28:42 +0200 | [diff] [blame] | 181 | - Not all page types are supported and never will. Most kernel internal |
Mike Rapoport | b53ba58 | 2018-03-21 21:22:25 +0200 | [diff] [blame] | 182 | objects cannot be recovered, only LRU pages for now. |
Andi Kleen | f58ee00 | 2009-10-04 02:28:42 +0200 | [diff] [blame] | 183 | - Right now hugepage support is missing. |
| 184 | |
| 185 | --- |
| 186 | Andi Kleen, Oct 2009 |