Thomas Gleixner | 3ec8ce5 | 2018-07-13 16:23:26 +0200 | [diff] [blame] | 1 | L1TF - L1 Terminal Fault |
| 2 | ======================== |
| 3 | |
| 4 | L1 Terminal Fault is a hardware vulnerability which allows unprivileged |
| 5 | speculative access to data which is available in the Level 1 Data Cache |
| 6 | when the page table entry controlling the virtual address, which is used |
| 7 | for the access, has the Present bit cleared or other reserved bits set. |
| 8 | |
| 9 | Affected processors |
| 10 | ------------------- |
| 11 | |
| 12 | This vulnerability affects a wide range of Intel processors. The |
| 13 | vulnerability is not present on: |
| 14 | |
| 15 | - Processors from AMD, Centaur and other non Intel vendors |
| 16 | |
| 17 | - Older processor models, where the CPU family is < 6 |
| 18 | |
| 19 | - A range of Intel ATOM processors (Cedarview, Cloverview, Lincroft, |
Tony Luck | 1949f9f | 2018-07-19 13:49:58 -0700 | [diff] [blame] | 20 | Penwell, Pineview, Silvermont, Airmont, Merrifield) |
Thomas Gleixner | 3ec8ce5 | 2018-07-13 16:23:26 +0200 | [diff] [blame] | 21 | |
Thomas Gleixner | 3ec8ce5 | 2018-07-13 16:23:26 +0200 | [diff] [blame] | 22 | - The Intel XEON PHI family |
| 23 | |
| 24 | - Intel processors which have the ARCH_CAP_RDCL_NO bit set in the |
| 25 | IA32_ARCH_CAPABILITIES MSR. If the bit is set the CPU is not affected |
| 26 | by the Meltdown vulnerability either. These CPUs should become |
| 27 | available by end of 2018. |
| 28 | |
| 29 | Whether a processor is affected or not can be read out from the L1TF |
| 30 | vulnerability file in sysfs. See :ref:`l1tf_sys_info`. |
| 31 | |
| 32 | Related CVEs |
| 33 | ------------ |
| 34 | |
| 35 | The following CVE entries are related to the L1TF vulnerability: |
| 36 | |
| 37 | ============= ================= ============================== |
| 38 | CVE-2018-3615 L1 Terminal Fault SGX related aspects |
| 39 | CVE-2018-3620 L1 Terminal Fault OS, SMM related aspects |
| 40 | CVE-2018-3646 L1 Terminal Fault Virtualization related aspects |
| 41 | ============= ================= ============================== |
| 42 | |
| 43 | Problem |
| 44 | ------- |
| 45 | |
| 46 | If an instruction accesses a virtual address for which the relevant page |
| 47 | table entry (PTE) has the Present bit cleared or other reserved bits set, |
| 48 | then speculative execution ignores the invalid PTE and loads the referenced |
| 49 | data if it is present in the Level 1 Data Cache, as if the page referenced |
| 50 | by the address bits in the PTE was still present and accessible. |
| 51 | |
| 52 | While this is a purely speculative mechanism and the instruction will raise |
| 53 | a page fault when it is retired eventually, the pure act of loading the |
| 54 | data and making it available to other speculative instructions opens up the |
| 55 | opportunity for side channel attacks to unprivileged malicious code, |
| 56 | similar to the Meltdown attack. |
| 57 | |
| 58 | While Meltdown breaks the user space to kernel space protection, L1TF |
| 59 | allows to attack any physical memory address in the system and the attack |
| 60 | works across all protection domains. It allows an attack of SGX and also |
| 61 | works from inside virtual machines because the speculation bypasses the |
| 62 | extended page table (EPT) protection mechanism. |
| 63 | |
| 64 | |
| 65 | Attack scenarios |
| 66 | ---------------- |
| 67 | |
| 68 | 1. Malicious user space |
| 69 | ^^^^^^^^^^^^^^^^^^^^^^^ |
| 70 | |
| 71 | Operating Systems store arbitrary information in the address bits of a |
| 72 | PTE which is marked non present. This allows a malicious user space |
| 73 | application to attack the physical memory to which these PTEs resolve. |
| 74 | In some cases user-space can maliciously influence the information |
| 75 | encoded in the address bits of the PTE, thus making attacks more |
| 76 | deterministic and more practical. |
| 77 | |
| 78 | The Linux kernel contains a mitigation for this attack vector, PTE |
| 79 | inversion, which is permanently enabled and has no performance |
| 80 | impact. The kernel ensures that the address bits of PTEs, which are not |
| 81 | marked present, never point to cacheable physical memory space. |
| 82 | |
| 83 | A system with an up to date kernel is protected against attacks from |
| 84 | malicious user space applications. |
| 85 | |
| 86 | 2. Malicious guest in a virtual machine |
| 87 | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ |
| 88 | |
| 89 | The fact that L1TF breaks all domain protections allows malicious guest |
| 90 | OSes, which can control the PTEs directly, and malicious guest user |
| 91 | space applications, which run on an unprotected guest kernel lacking the |
| 92 | PTE inversion mitigation for L1TF, to attack physical host memory. |
| 93 | |
| 94 | A special aspect of L1TF in the context of virtualization is symmetric |
| 95 | multi threading (SMT). The Intel implementation of SMT is called |
| 96 | HyperThreading. The fact that Hyperthreads on the affected processors |
| 97 | share the L1 Data Cache (L1D) is important for this. As the flaw allows |
| 98 | only to attack data which is present in L1D, a malicious guest running |
| 99 | on one Hyperthread can attack the data which is brought into the L1D by |
| 100 | the context which runs on the sibling Hyperthread of the same physical |
| 101 | core. This context can be host OS, host user space or a different guest. |
| 102 | |
| 103 | If the processor does not support Extended Page Tables, the attack is |
| 104 | only possible, when the hypervisor does not sanitize the content of the |
| 105 | effective (shadow) page tables. |
| 106 | |
| 107 | While solutions exist to mitigate these attack vectors fully, these |
| 108 | mitigations are not enabled by default in the Linux kernel because they |
| 109 | can affect performance significantly. The kernel provides several |
| 110 | mechanisms which can be utilized to address the problem depending on the |
| 111 | deployment scenario. The mitigations, their protection scope and impact |
| 112 | are described in the next sections. |
| 113 | |
Tony Luck | 1949f9f | 2018-07-19 13:49:58 -0700 | [diff] [blame] | 114 | The default mitigations and the rationale for choosing them are explained |
Thomas Gleixner | 3ec8ce5 | 2018-07-13 16:23:26 +0200 | [diff] [blame] | 115 | at the end of this document. See :ref:`default_mitigations`. |
| 116 | |
| 117 | .. _l1tf_sys_info: |
| 118 | |
| 119 | L1TF system information |
| 120 | ----------------------- |
| 121 | |
| 122 | The Linux kernel provides a sysfs interface to enumerate the current L1TF |
| 123 | status of the system: whether the system is vulnerable, and which |
| 124 | mitigations are active. The relevant sysfs file is: |
| 125 | |
| 126 | /sys/devices/system/cpu/vulnerabilities/l1tf |
| 127 | |
| 128 | The possible values in this file are: |
| 129 | |
| 130 | =========================== =============================== |
| 131 | 'Not affected' The processor is not vulnerable |
| 132 | 'Mitigation: PTE Inversion' The host protection is active |
| 133 | =========================== =============================== |
| 134 | |
| 135 | If KVM/VMX is enabled and the processor is vulnerable then the following |
| 136 | information is appended to the 'Mitigation: PTE Inversion' part: |
| 137 | |
| 138 | - SMT status: |
| 139 | |
| 140 | ===================== ================ |
| 141 | 'VMX: SMT vulnerable' SMT is enabled |
| 142 | 'VMX: SMT disabled' SMT is disabled |
| 143 | ===================== ================ |
| 144 | |
| 145 | - L1D Flush mode: |
| 146 | |
| 147 | ================================ ==================================== |
| 148 | 'L1D vulnerable' L1D flushing is disabled |
| 149 | |
| 150 | 'L1D conditional cache flushes' L1D flush is conditionally enabled |
| 151 | |
| 152 | 'L1D cache flushes' L1D flush is unconditionally enabled |
| 153 | ================================ ==================================== |
| 154 | |
| 155 | The resulting grade of protection is discussed in the following sections. |
| 156 | |
| 157 | |
| 158 | Host mitigation mechanism |
| 159 | ------------------------- |
| 160 | |
| 161 | The kernel is unconditionally protected against L1TF attacks from malicious |
| 162 | user space running on the host. |
| 163 | |
| 164 | |
| 165 | Guest mitigation mechanisms |
| 166 | --------------------------- |
| 167 | |
| 168 | .. _l1d_flush: |
| 169 | |
| 170 | 1. L1D flush on VMENTER |
| 171 | ^^^^^^^^^^^^^^^^^^^^^^^ |
| 172 | |
| 173 | To make sure that a guest cannot attack data which is present in the L1D |
| 174 | the hypervisor flushes the L1D before entering the guest. |
| 175 | |
| 176 | Flushing the L1D evicts not only the data which should not be accessed |
| 177 | by a potentially malicious guest, it also flushes the guest |
| 178 | data. Flushing the L1D has a performance impact as the processor has to |
| 179 | bring the flushed guest data back into the L1D. Depending on the |
| 180 | frequency of VMEXIT/VMENTER and the type of computations in the guest |
| 181 | performance degradation in the range of 1% to 50% has been observed. For |
| 182 | scenarios where guest VMEXIT/VMENTER are rare the performance impact is |
| 183 | minimal. Virtio and mechanisms like posted interrupts are designed to |
| 184 | confine the VMEXITs to a bare minimum, but specific configurations and |
| 185 | application scenarios might still suffer from a high VMEXIT rate. |
| 186 | |
| 187 | The kernel provides two L1D flush modes: |
| 188 | - conditional ('cond') |
| 189 | - unconditional ('always') |
| 190 | |
| 191 | The conditional mode avoids L1D flushing after VMEXITs which execute |
Tony Luck | 1949f9f | 2018-07-19 13:49:58 -0700 | [diff] [blame] | 192 | only audited code paths before the corresponding VMENTER. These code |
| 193 | paths have been verified that they cannot expose secrets or other |
Thomas Gleixner | 3ec8ce5 | 2018-07-13 16:23:26 +0200 | [diff] [blame] | 194 | interesting data to an attacker, but they can leak information about the |
| 195 | address space layout of the hypervisor. |
| 196 | |
| 197 | Unconditional mode flushes L1D on all VMENTER invocations and provides |
| 198 | maximum protection. It has a higher overhead than the conditional |
| 199 | mode. The overhead cannot be quantified correctly as it depends on the |
Tony Luck | 1949f9f | 2018-07-19 13:49:58 -0700 | [diff] [blame] | 200 | workload scenario and the resulting number of VMEXITs. |
Thomas Gleixner | 3ec8ce5 | 2018-07-13 16:23:26 +0200 | [diff] [blame] | 201 | |
| 202 | The general recommendation is to enable L1D flush on VMENTER. The kernel |
| 203 | defaults to conditional mode on affected processors. |
| 204 | |
| 205 | **Note**, that L1D flush does not prevent the SMT problem because the |
| 206 | sibling thread will also bring back its data into the L1D which makes it |
| 207 | attackable again. |
| 208 | |
| 209 | L1D flush can be controlled by the administrator via the kernel command |
| 210 | line and sysfs control files. See :ref:`mitigation_control_command_line` |
| 211 | and :ref:`mitigation_control_kvm`. |
| 212 | |
| 213 | .. _guest_confinement: |
| 214 | |
| 215 | 2. Guest VCPU confinement to dedicated physical cores |
| 216 | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ |
| 217 | |
| 218 | To address the SMT problem, it is possible to make a guest or a group of |
| 219 | guests affine to one or more physical cores. The proper mechanism for |
| 220 | that is to utilize exclusive cpusets to ensure that no other guest or |
| 221 | host tasks can run on these cores. |
| 222 | |
| 223 | If only a single guest or related guests run on sibling SMT threads on |
| 224 | the same physical core then they can only attack their own memory and |
| 225 | restricted parts of the host memory. |
| 226 | |
| 227 | Host memory is attackable, when one of the sibling SMT threads runs in |
| 228 | host OS (hypervisor) context and the other in guest context. The amount |
| 229 | of valuable information from the host OS context depends on the context |
| 230 | which the host OS executes, i.e. interrupts, soft interrupts and kernel |
| 231 | threads. The amount of valuable data from these contexts cannot be |
| 232 | declared as non-interesting for an attacker without deep inspection of |
| 233 | the code. |
| 234 | |
| 235 | **Note**, that assigning guests to a fixed set of physical cores affects |
| 236 | the ability of the scheduler to do load balancing and might have |
| 237 | negative effects on CPU utilization depending on the hosting |
| 238 | scenario. Disabling SMT might be a viable alternative for particular |
| 239 | scenarios. |
| 240 | |
| 241 | For further information about confining guests to a single or to a group |
| 242 | of cores consult the cpusets documentation: |
| 243 | |
Mauro Carvalho Chehab | 4f4cfa6 | 2019-06-27 14:56:51 -0300 | [diff] [blame] | 244 | https://www.kernel.org/doc/Documentation/admin-guide/cgroup-v1/cpusets.rst |
Thomas Gleixner | 3ec8ce5 | 2018-07-13 16:23:26 +0200 | [diff] [blame] | 245 | |
| 246 | .. _interrupt_isolation: |
| 247 | |
| 248 | 3. Interrupt affinity |
| 249 | ^^^^^^^^^^^^^^^^^^^^^ |
| 250 | |
| 251 | Interrupts can be made affine to logical CPUs. This is not universally |
| 252 | true because there are types of interrupts which are truly per CPU |
| 253 | interrupts, e.g. the local timer interrupt. Aside of that multi queue |
| 254 | devices affine their interrupts to single CPUs or groups of CPUs per |
| 255 | queue without allowing the administrator to control the affinities. |
| 256 | |
| 257 | Moving the interrupts, which can be affinity controlled, away from CPUs |
| 258 | which run untrusted guests, reduces the attack vector space. |
| 259 | |
| 260 | Whether the interrupts with are affine to CPUs, which run untrusted |
| 261 | guests, provide interesting data for an attacker depends on the system |
| 262 | configuration and the scenarios which run on the system. While for some |
Tony Luck | 1949f9f | 2018-07-19 13:49:58 -0700 | [diff] [blame] | 263 | of the interrupts it can be assumed that they won't expose interesting |
Thomas Gleixner | 3ec8ce5 | 2018-07-13 16:23:26 +0200 | [diff] [blame] | 264 | information beyond exposing hints about the host OS memory layout, there |
| 265 | is no way to make general assumptions. |
| 266 | |
| 267 | Interrupt affinity can be controlled by the administrator via the |
| 268 | /proc/irq/$NR/smp_affinity[_list] files. Limited documentation is |
| 269 | available at: |
| 270 | |
| 271 | https://www.kernel.org/doc/Documentation/IRQ-affinity.txt |
| 272 | |
| 273 | .. _smt_control: |
| 274 | |
| 275 | 4. SMT control |
| 276 | ^^^^^^^^^^^^^^ |
| 277 | |
| 278 | To prevent the SMT issues of L1TF it might be necessary to disable SMT |
| 279 | completely. Disabling SMT can have a significant performance impact, but |
| 280 | the impact depends on the hosting scenario and the type of workloads. |
| 281 | The impact of disabling SMT needs also to be weighted against the impact |
| 282 | of other mitigation solutions like confining guests to dedicated cores. |
| 283 | |
| 284 | The kernel provides a sysfs interface to retrieve the status of SMT and |
| 285 | to control it. It also provides a kernel command line interface to |
| 286 | control SMT. |
| 287 | |
| 288 | The kernel command line interface consists of the following options: |
| 289 | |
| 290 | =========== ========================================================== |
| 291 | nosmt Affects the bring up of the secondary CPUs during boot. The |
| 292 | kernel tries to bring all present CPUs online during the |
| 293 | boot process. "nosmt" makes sure that from each physical |
| 294 | core only one - the so called primary (hyper) thread is |
| 295 | activated. Due to a design flaw of Intel processors related |
| 296 | to Machine Check Exceptions the non primary siblings have |
| 297 | to be brought up at least partially and are then shut down |
| 298 | again. "nosmt" can be undone via the sysfs interface. |
| 299 | |
Tony Luck | 1949f9f | 2018-07-19 13:49:58 -0700 | [diff] [blame] | 300 | nosmt=force Has the same effect as "nosmt" but it does not allow to |
Thomas Gleixner | 3ec8ce5 | 2018-07-13 16:23:26 +0200 | [diff] [blame] | 301 | undo the SMT disable via the sysfs interface. |
| 302 | =========== ========================================================== |
| 303 | |
| 304 | The sysfs interface provides two files: |
| 305 | |
| 306 | - /sys/devices/system/cpu/smt/control |
| 307 | - /sys/devices/system/cpu/smt/active |
| 308 | |
| 309 | /sys/devices/system/cpu/smt/control: |
| 310 | |
| 311 | This file allows to read out the SMT control state and provides the |
| 312 | ability to disable or (re)enable SMT. The possible states are: |
| 313 | |
| 314 | ============== =================================================== |
| 315 | on SMT is supported by the CPU and enabled. All |
| 316 | logical CPUs can be onlined and offlined without |
| 317 | restrictions. |
| 318 | |
| 319 | off SMT is supported by the CPU and disabled. Only |
| 320 | the so called primary SMT threads can be onlined |
| 321 | and offlined without restrictions. An attempt to |
| 322 | online a non-primary sibling is rejected |
| 323 | |
| 324 | forceoff Same as 'off' but the state cannot be controlled. |
| 325 | Attempts to write to the control file are rejected. |
| 326 | |
| 327 | notsupported The processor does not support SMT. It's therefore |
| 328 | not affected by the SMT implications of L1TF. |
| 329 | Attempts to write to the control file are rejected. |
| 330 | ============== =================================================== |
| 331 | |
| 332 | The possible states which can be written into this file to control SMT |
| 333 | state are: |
| 334 | |
| 335 | - on |
| 336 | - off |
| 337 | - forceoff |
| 338 | |
| 339 | /sys/devices/system/cpu/smt/active: |
| 340 | |
| 341 | This file reports whether SMT is enabled and active, i.e. if on any |
| 342 | physical core two or more sibling threads are online. |
| 343 | |
| 344 | SMT control is also possible at boot time via the l1tf kernel command |
| 345 | line parameter in combination with L1D flush control. See |
| 346 | :ref:`mitigation_control_command_line`. |
| 347 | |
| 348 | 5. Disabling EPT |
| 349 | ^^^^^^^^^^^^^^^^ |
| 350 | |
| 351 | Disabling EPT for virtual machines provides full mitigation for L1TF even |
| 352 | with SMT enabled, because the effective page tables for guests are |
| 353 | managed and sanitized by the hypervisor. Though disabling EPT has a |
| 354 | significant performance impact especially when the Meltdown mitigation |
| 355 | KPTI is enabled. |
| 356 | |
| 357 | EPT can be disabled in the hypervisor via the 'kvm-intel.ept' parameter. |
| 358 | |
| 359 | There is ongoing research and development for new mitigation mechanisms to |
| 360 | address the performance impact of disabling SMT or EPT. |
| 361 | |
| 362 | .. _mitigation_control_command_line: |
| 363 | |
| 364 | Mitigation control on the kernel command line |
| 365 | --------------------------------------------- |
| 366 | |
| 367 | The kernel command line allows to control the L1TF mitigations at boot |
| 368 | time with the option "l1tf=". The valid arguments for this option are: |
| 369 | |
| 370 | ============ ============================================================= |
| 371 | full Provides all available mitigations for the L1TF |
| 372 | vulnerability. Disables SMT and enables all mitigations in |
| 373 | the hypervisors, i.e. unconditional L1D flushing |
| 374 | |
| 375 | SMT control and L1D flush control via the sysfs interface |
| 376 | is still possible after boot. Hypervisors will issue a |
| 377 | warning when the first VM is started in a potentially |
| 378 | insecure configuration, i.e. SMT enabled or L1D flush |
| 379 | disabled. |
| 380 | |
| 381 | full,force Same as 'full', but disables SMT and L1D flush runtime |
| 382 | control. Implies the 'nosmt=force' command line option. |
| 383 | (i.e. sysfs control of SMT is disabled.) |
| 384 | |
| 385 | flush Leaves SMT enabled and enables the default hypervisor |
| 386 | mitigation, i.e. conditional L1D flushing |
| 387 | |
| 388 | SMT control and L1D flush control via the sysfs interface |
| 389 | is still possible after boot. Hypervisors will issue a |
| 390 | warning when the first VM is started in a potentially |
| 391 | insecure configuration, i.e. SMT enabled or L1D flush |
| 392 | disabled. |
| 393 | |
| 394 | flush,nosmt Disables SMT and enables the default hypervisor mitigation, |
| 395 | i.e. conditional L1D flushing. |
| 396 | |
| 397 | SMT control and L1D flush control via the sysfs interface |
| 398 | is still possible after boot. Hypervisors will issue a |
| 399 | warning when the first VM is started in a potentially |
| 400 | insecure configuration, i.e. SMT enabled or L1D flush |
| 401 | disabled. |
| 402 | |
| 403 | flush,nowarn Same as 'flush', but hypervisors will not warn when a VM is |
| 404 | started in a potentially insecure configuration. |
| 405 | |
| 406 | off Disables hypervisor mitigations and doesn't emit any |
| 407 | warnings. |
Michal Hocko | 5b5e4d6 | 2018-11-13 19:49:10 +0100 | [diff] [blame] | 408 | It also drops the swap size and available RAM limit restrictions |
| 409 | on both hypervisor and bare metal. |
| 410 | |
Thomas Gleixner | 3ec8ce5 | 2018-07-13 16:23:26 +0200 | [diff] [blame] | 411 | ============ ============================================================= |
| 412 | |
| 413 | The default is 'flush'. For details about L1D flushing see :ref:`l1d_flush`. |
| 414 | |
| 415 | |
| 416 | .. _mitigation_control_kvm: |
| 417 | |
| 418 | Mitigation control for KVM - module parameter |
| 419 | ------------------------------------------------------------- |
| 420 | |
| 421 | The KVM hypervisor mitigation mechanism, flushing the L1D cache when |
| 422 | entering a guest, can be controlled with a module parameter. |
| 423 | |
| 424 | The option/parameter is "kvm-intel.vmentry_l1d_flush=". It takes the |
| 425 | following arguments: |
| 426 | |
| 427 | ============ ============================================================== |
| 428 | always L1D cache flush on every VMENTER. |
| 429 | |
| 430 | cond Flush L1D on VMENTER only when the code between VMEXIT and |
| 431 | VMENTER can leak host memory which is considered |
| 432 | interesting for an attacker. This still can leak host memory |
| 433 | which allows e.g. to determine the hosts address space layout. |
| 434 | |
| 435 | never Disables the mitigation |
| 436 | ============ ============================================================== |
| 437 | |
| 438 | The parameter can be provided on the kernel command line, as a module |
| 439 | parameter when loading the modules and at runtime modified via the sysfs |
| 440 | file: |
| 441 | |
| 442 | /sys/module/kvm_intel/parameters/vmentry_l1d_flush |
| 443 | |
| 444 | The default is 'cond'. If 'l1tf=full,force' is given on the kernel command |
| 445 | line, then 'always' is enforced and the kvm-intel.vmentry_l1d_flush |
| 446 | module parameter is ignored and writes to the sysfs file are rejected. |
| 447 | |
Thomas Gleixner | 5999bbe | 2019-02-19 00:02:31 +0100 | [diff] [blame] | 448 | .. _mitigation_selection: |
Thomas Gleixner | 3ec8ce5 | 2018-07-13 16:23:26 +0200 | [diff] [blame] | 449 | |
| 450 | Mitigation selection guide |
| 451 | -------------------------- |
| 452 | |
| 453 | 1. No virtualization in use |
| 454 | ^^^^^^^^^^^^^^^^^^^^^^^^^^^ |
| 455 | |
| 456 | The system is protected by the kernel unconditionally and no further |
| 457 | action is required. |
| 458 | |
| 459 | 2. Virtualization with trusted guests |
| 460 | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ |
| 461 | |
| 462 | If the guest comes from a trusted source and the guest OS kernel is |
| 463 | guaranteed to have the L1TF mitigations in place the system is fully |
| 464 | protected against L1TF and no further action is required. |
| 465 | |
| 466 | To avoid the overhead of the default L1D flushing on VMENTER the |
| 467 | administrator can disable the flushing via the kernel command line and |
| 468 | sysfs control files. See :ref:`mitigation_control_command_line` and |
| 469 | :ref:`mitigation_control_kvm`. |
| 470 | |
| 471 | |
| 472 | 3. Virtualization with untrusted guests |
| 473 | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ |
| 474 | |
| 475 | 3.1. SMT not supported or disabled |
| 476 | """""""""""""""""""""""""""""""""" |
| 477 | |
| 478 | If SMT is not supported by the processor or disabled in the BIOS or by |
| 479 | the kernel, it's only required to enforce L1D flushing on VMENTER. |
| 480 | |
| 481 | Conditional L1D flushing is the default behaviour and can be tuned. See |
| 482 | :ref:`mitigation_control_command_line` and :ref:`mitigation_control_kvm`. |
| 483 | |
| 484 | 3.2. EPT not supported or disabled |
| 485 | """""""""""""""""""""""""""""""""" |
| 486 | |
| 487 | If EPT is not supported by the processor or disabled in the hypervisor, |
| 488 | the system is fully protected. SMT can stay enabled and L1D flushing on |
| 489 | VMENTER is not required. |
| 490 | |
| 491 | EPT can be disabled in the hypervisor via the 'kvm-intel.ept' parameter. |
| 492 | |
| 493 | 3.3. SMT and EPT supported and active |
| 494 | """"""""""""""""""""""""""""""""""""" |
| 495 | |
| 496 | If SMT and EPT are supported and active then various degrees of |
| 497 | mitigations can be employed: |
| 498 | |
| 499 | - L1D flushing on VMENTER: |
| 500 | |
| 501 | L1D flushing on VMENTER is the minimal protection requirement, but it |
| 502 | is only potent in combination with other mitigation methods. |
| 503 | |
| 504 | Conditional L1D flushing is the default behaviour and can be tuned. See |
| 505 | :ref:`mitigation_control_command_line` and :ref:`mitigation_control_kvm`. |
| 506 | |
| 507 | - Guest confinement: |
| 508 | |
| 509 | Confinement of guests to a single or a group of physical cores which |
| 510 | are not running any other processes, can reduce the attack surface |
| 511 | significantly, but interrupts, soft interrupts and kernel threads can |
| 512 | still expose valuable data to a potential attacker. See |
| 513 | :ref:`guest_confinement`. |
| 514 | |
| 515 | - Interrupt isolation: |
| 516 | |
| 517 | Isolating the guest CPUs from interrupts can reduce the attack surface |
| 518 | further, but still allows a malicious guest to explore a limited amount |
| 519 | of host physical memory. This can at least be used to gain knowledge |
| 520 | about the host address space layout. The interrupts which have a fixed |
| 521 | affinity to the CPUs which run the untrusted guests can depending on |
| 522 | the scenario still trigger soft interrupts and schedule kernel threads |
| 523 | which might expose valuable information. See |
| 524 | :ref:`interrupt_isolation`. |
| 525 | |
| 526 | The above three mitigation methods combined can provide protection to a |
| 527 | certain degree, but the risk of the remaining attack surface has to be |
| 528 | carefully analyzed. For full protection the following methods are |
| 529 | available: |
| 530 | |
| 531 | - Disabling SMT: |
| 532 | |
| 533 | Disabling SMT and enforcing the L1D flushing provides the maximum |
| 534 | amount of protection. This mitigation is not depending on any of the |
| 535 | above mitigation methods. |
| 536 | |
| 537 | SMT control and L1D flushing can be tuned by the command line |
| 538 | parameters 'nosmt', 'l1tf', 'kvm-intel.vmentry_l1d_flush' and at run |
| 539 | time with the matching sysfs control files. See :ref:`smt_control`, |
| 540 | :ref:`mitigation_control_command_line` and |
| 541 | :ref:`mitigation_control_kvm`. |
| 542 | |
| 543 | - Disabling EPT: |
| 544 | |
| 545 | Disabling EPT provides the maximum amount of protection as well. It is |
| 546 | not depending on any of the above mitigation methods. SMT can stay |
| 547 | enabled and L1D flushing is not required, but the performance impact is |
| 548 | significant. |
| 549 | |
| 550 | EPT can be disabled in the hypervisor via the 'kvm-intel.ept' |
| 551 | parameter. |
| 552 | |
Paolo Bonzini | 5b76a3c | 2018-08-05 16:07:47 +0200 | [diff] [blame] | 553 | 3.4. Nested virtual machines |
| 554 | """""""""""""""""""""""""""" |
| 555 | |
| 556 | When nested virtualization is in use, three operating systems are involved: |
| 557 | the bare metal hypervisor, the nested hypervisor and the nested virtual |
| 558 | machine. VMENTER operations from the nested hypervisor into the nested |
| 559 | guest will always be processed by the bare metal hypervisor. If KVM is the |
Salvatore Bonaccorso | 60ca05c | 2018-08-15 07:46:04 +0200 | [diff] [blame] | 560 | bare metal hypervisor it will: |
Paolo Bonzini | 5b76a3c | 2018-08-05 16:07:47 +0200 | [diff] [blame] | 561 | |
| 562 | - Flush the L1D cache on every switch from the nested hypervisor to the |
| 563 | nested virtual machine, so that the nested hypervisor's secrets are not |
| 564 | exposed to the nested virtual machine; |
| 565 | |
| 566 | - Flush the L1D cache on every switch from the nested virtual machine to |
| 567 | the nested hypervisor; this is a complex operation, and flushing the L1D |
| 568 | cache avoids that the bare metal hypervisor's secrets are exposed to the |
| 569 | nested virtual machine; |
| 570 | |
| 571 | - Instruct the nested hypervisor to not perform any L1D cache flush. This |
| 572 | is an optimization to avoid double L1D flushing. |
| 573 | |
Thomas Gleixner | 3ec8ce5 | 2018-07-13 16:23:26 +0200 | [diff] [blame] | 574 | |
| 575 | .. _default_mitigations: |
| 576 | |
| 577 | Default mitigations |
| 578 | ------------------- |
| 579 | |
| 580 | The kernel default mitigations for vulnerable processors are: |
| 581 | |
| 582 | - PTE inversion to protect against malicious user space. This is done |
Michal Hocko | 5b5e4d6 | 2018-11-13 19:49:10 +0100 | [diff] [blame] | 583 | unconditionally and cannot be controlled. The swap storage is limited |
| 584 | to ~16TB. |
Thomas Gleixner | 3ec8ce5 | 2018-07-13 16:23:26 +0200 | [diff] [blame] | 585 | |
| 586 | - L1D conditional flushing on VMENTER when EPT is enabled for |
| 587 | a guest. |
| 588 | |
| 589 | The kernel does not by default enforce the disabling of SMT, which leaves |
| 590 | SMT systems vulnerable when running untrusted guests with EPT enabled. |
| 591 | |
| 592 | The rationale for this choice is: |
| 593 | |
| 594 | - Force disabling SMT can break existing setups, especially with |
| 595 | unattended updates. |
| 596 | |
| 597 | - If regular users run untrusted guests on their machine, then L1TF is |
| 598 | just an add on to other malware which might be embedded in an untrusted |
| 599 | guest, e.g. spam-bots or attacks on the local network. |
| 600 | |
| 601 | There is no technical way to prevent a user from running untrusted code |
| 602 | on their machines blindly. |
| 603 | |
| 604 | - It's technically extremely unlikely and from today's knowledge even |
| 605 | impossible that L1TF can be exploited via the most popular attack |
| 606 | mechanisms like JavaScript because these mechanisms have no way to |
| 607 | control PTEs. If this would be possible and not other mitigation would |
| 608 | be possible, then the default might be different. |
| 609 | |
| 610 | - The administrators of cloud and hosting setups have to carefully |
| 611 | analyze the risk for their scenarios and make the appropriate |
| 612 | mitigation choices, which might even vary across their deployed |
| 613 | machines and also result in other changes of their overall setup. |
| 614 | There is no way for the kernel to provide a sensible default for this |
| 615 | kind of scenarios. |