Mauro Carvalho Chehab | 320f3f7 | 2020-02-10 07:03:01 +0100 | [diff] [blame] | 1 | .. SPDX-License-Identifier: GPL-2.0 |
| 2 | |
| 3 | ========== |
Nadav Har'El | 823e396 | 2011-05-25 23:17:11 +0300 | [diff] [blame] | 4 | Nested VMX |
| 5 | ========== |
| 6 | |
| 7 | Overview |
| 8 | --------- |
| 9 | |
| 10 | On Intel processors, KVM uses Intel's VMX (Virtual-Machine eXtensions) |
| 11 | to easily and efficiently run guest operating systems. Normally, these guests |
| 12 | *cannot* themselves be hypervisors running their own guests, because in VMX, |
| 13 | guests cannot use VMX instructions. |
| 14 | |
| 15 | The "Nested VMX" feature adds this missing capability - of running guest |
| 16 | hypervisors (which use VMX) with their own nested guests. It does so by |
| 17 | allowing a guest to use VMX instructions, and correctly and efficiently |
| 18 | emulating them using the single level of VMX available in the hardware. |
| 19 | |
| 20 | We describe in much greater detail the theory behind the nested VMX feature, |
| 21 | its implementation and its performance characteristics, in the OSDI 2010 paper |
| 22 | "The Turtles Project: Design and Implementation of Nested Virtualization", |
| 23 | available at: |
| 24 | |
Alexander A. Klimov | 3c60357 | 2020-07-13 13:47:19 +0200 | [diff] [blame] | 25 | https://www.usenix.org/events/osdi10/tech/full_papers/Ben-Yehuda.pdf |
Nadav Har'El | 823e396 | 2011-05-25 23:17:11 +0300 | [diff] [blame] | 26 | |
| 27 | |
| 28 | Terminology |
| 29 | ----------- |
| 30 | |
| 31 | Single-level virtualization has two levels - the host (KVM) and the guests. |
| 32 | In nested virtualization, we have three levels: The host (KVM), which we call |
| 33 | L0, the guest hypervisor, which we call L1, and its nested guest, which we |
| 34 | call L2. |
| 35 | |
| 36 | |
Nadav Har'El | 823e396 | 2011-05-25 23:17:11 +0300 | [diff] [blame] | 37 | Running nested VMX |
| 38 | ------------------ |
| 39 | |
| 40 | The nested VMX feature is disabled by default. It can be enabled by giving |
| 41 | the "nested=1" option to the kvm-intel module. |
| 42 | |
| 43 | No modifications are required to user space (qemu). However, qemu's default |
| 44 | emulated CPU type (qemu64) does not list the "VMX" CPU feature, so it must be |
| 45 | explicitly enabled, by giving qemu one of the following options: |
| 46 | |
Mauro Carvalho Chehab | 320f3f7 | 2020-02-10 07:03:01 +0100 | [diff] [blame] | 47 | - cpu host (emulated CPU has all features of the real CPU) |
Nadav Har'El | 823e396 | 2011-05-25 23:17:11 +0300 | [diff] [blame] | 48 | |
Mauro Carvalho Chehab | 320f3f7 | 2020-02-10 07:03:01 +0100 | [diff] [blame] | 49 | - cpu qemu64,+vmx (add just the vmx feature to a named CPU type) |
Nadav Har'El | 823e396 | 2011-05-25 23:17:11 +0300 | [diff] [blame] | 50 | |
| 51 | |
| 52 | ABIs |
| 53 | ---- |
| 54 | |
| 55 | Nested VMX aims to present a standard and (eventually) fully-functional VMX |
| 56 | implementation for the a guest hypervisor to use. As such, the official |
| 57 | specification of the ABI that it provides is Intel's VMX specification, |
| 58 | namely volume 3B of their "Intel 64 and IA-32 Architectures Software |
| 59 | Developer's Manual". Not all of VMX's features are currently fully supported, |
| 60 | but the goal is to eventually support them all, starting with the VMX features |
| 61 | which are used in practice by popular hypervisors (KVM and others). |
| 62 | |
| 63 | As a VMX implementation, nested VMX presents a VMCS structure to L1. |
| 64 | As mandated by the spec, other than the two fields revision_id and abort, |
| 65 | this structure is *opaque* to its user, who is not supposed to know or care |
| 66 | about its internal structure. Rather, the structure is accessed through the |
| 67 | VMREAD and VMWRITE instructions. |
| 68 | Still, for debugging purposes, KVM developers might be interested to know the |
| 69 | internals of this structure; This is struct vmcs12 from arch/x86/kvm/vmx.c. |
| 70 | |
| 71 | The name "vmcs12" refers to the VMCS that L1 builds for L2. In the code we |
| 72 | also have "vmcs01", the VMCS that L0 built for L1, and "vmcs02" is the VMCS |
| 73 | which L0 builds to actually run L2 - how this is done is explained in the |
| 74 | aforementioned paper. |
| 75 | |
| 76 | For convenience, we repeat the content of struct vmcs12 here. If the internals |
| 77 | of this structure changes, this can break live migration across KVM versions. |
| 78 | VMCS12_REVISION (from vmx.c) should be changed if struct vmcs12 or its inner |
| 79 | struct shadow_vmcs is ever changed. |
| 80 | |
Mauro Carvalho Chehab | 320f3f7 | 2020-02-10 07:03:01 +0100 | [diff] [blame] | 81 | :: |
| 82 | |
Nadav Har'El | 823e396 | 2011-05-25 23:17:11 +0300 | [diff] [blame] | 83 | typedef u64 natural_width; |
| 84 | struct __packed vmcs12 { |
| 85 | /* According to the Intel spec, a VMCS region must start with |
| 86 | * these two user-visible fields */ |
| 87 | u32 revision_id; |
| 88 | u32 abort; |
| 89 | |
| 90 | u32 launch_state; /* set to 0 by VMCLEAR, to 1 by VMLAUNCH */ |
| 91 | u32 padding[7]; /* room for future expansion */ |
| 92 | |
| 93 | u64 io_bitmap_a; |
| 94 | u64 io_bitmap_b; |
| 95 | u64 msr_bitmap; |
| 96 | u64 vm_exit_msr_store_addr; |
| 97 | u64 vm_exit_msr_load_addr; |
| 98 | u64 vm_entry_msr_load_addr; |
| 99 | u64 tsc_offset; |
| 100 | u64 virtual_apic_page_addr; |
| 101 | u64 apic_access_addr; |
| 102 | u64 ept_pointer; |
| 103 | u64 guest_physical_address; |
| 104 | u64 vmcs_link_pointer; |
| 105 | u64 guest_ia32_debugctl; |
| 106 | u64 guest_ia32_pat; |
| 107 | u64 guest_ia32_efer; |
| 108 | u64 guest_pdptr0; |
| 109 | u64 guest_pdptr1; |
| 110 | u64 guest_pdptr2; |
| 111 | u64 guest_pdptr3; |
| 112 | u64 host_ia32_pat; |
| 113 | u64 host_ia32_efer; |
| 114 | u64 padding64[8]; /* room for future expansion */ |
| 115 | natural_width cr0_guest_host_mask; |
| 116 | natural_width cr4_guest_host_mask; |
| 117 | natural_width cr0_read_shadow; |
| 118 | natural_width cr4_read_shadow; |
Sean Christopherson | b8d295f | 2020-04-15 17:07:39 -0700 | [diff] [blame] | 119 | natural_width dead_space[4]; /* Last remnants of cr3_target_value[0-3]. */ |
Nadav Har'El | 823e396 | 2011-05-25 23:17:11 +0300 | [diff] [blame] | 120 | natural_width exit_qualification; |
| 121 | natural_width guest_linear_address; |
| 122 | natural_width guest_cr0; |
| 123 | natural_width guest_cr3; |
| 124 | natural_width guest_cr4; |
| 125 | natural_width guest_es_base; |
| 126 | natural_width guest_cs_base; |
| 127 | natural_width guest_ss_base; |
| 128 | natural_width guest_ds_base; |
| 129 | natural_width guest_fs_base; |
| 130 | natural_width guest_gs_base; |
| 131 | natural_width guest_ldtr_base; |
| 132 | natural_width guest_tr_base; |
| 133 | natural_width guest_gdtr_base; |
| 134 | natural_width guest_idtr_base; |
| 135 | natural_width guest_dr7; |
| 136 | natural_width guest_rsp; |
| 137 | natural_width guest_rip; |
| 138 | natural_width guest_rflags; |
| 139 | natural_width guest_pending_dbg_exceptions; |
| 140 | natural_width guest_sysenter_esp; |
| 141 | natural_width guest_sysenter_eip; |
| 142 | natural_width host_cr0; |
| 143 | natural_width host_cr3; |
| 144 | natural_width host_cr4; |
| 145 | natural_width host_fs_base; |
| 146 | natural_width host_gs_base; |
| 147 | natural_width host_tr_base; |
| 148 | natural_width host_gdtr_base; |
| 149 | natural_width host_idtr_base; |
| 150 | natural_width host_ia32_sysenter_esp; |
| 151 | natural_width host_ia32_sysenter_eip; |
| 152 | natural_width host_rsp; |
| 153 | natural_width host_rip; |
| 154 | natural_width paddingl[8]; /* room for future expansion */ |
| 155 | u32 pin_based_vm_exec_control; |
| 156 | u32 cpu_based_vm_exec_control; |
| 157 | u32 exception_bitmap; |
| 158 | u32 page_fault_error_code_mask; |
| 159 | u32 page_fault_error_code_match; |
| 160 | u32 cr3_target_count; |
| 161 | u32 vm_exit_controls; |
| 162 | u32 vm_exit_msr_store_count; |
| 163 | u32 vm_exit_msr_load_count; |
| 164 | u32 vm_entry_controls; |
| 165 | u32 vm_entry_msr_load_count; |
| 166 | u32 vm_entry_intr_info_field; |
| 167 | u32 vm_entry_exception_error_code; |
| 168 | u32 vm_entry_instruction_len; |
| 169 | u32 tpr_threshold; |
| 170 | u32 secondary_vm_exec_control; |
| 171 | u32 vm_instruction_error; |
| 172 | u32 vm_exit_reason; |
| 173 | u32 vm_exit_intr_info; |
| 174 | u32 vm_exit_intr_error_code; |
| 175 | u32 idt_vectoring_info_field; |
| 176 | u32 idt_vectoring_error_code; |
| 177 | u32 vm_exit_instruction_len; |
| 178 | u32 vmx_instruction_info; |
| 179 | u32 guest_es_limit; |
| 180 | u32 guest_cs_limit; |
| 181 | u32 guest_ss_limit; |
| 182 | u32 guest_ds_limit; |
| 183 | u32 guest_fs_limit; |
| 184 | u32 guest_gs_limit; |
| 185 | u32 guest_ldtr_limit; |
| 186 | u32 guest_tr_limit; |
| 187 | u32 guest_gdtr_limit; |
| 188 | u32 guest_idtr_limit; |
| 189 | u32 guest_es_ar_bytes; |
| 190 | u32 guest_cs_ar_bytes; |
| 191 | u32 guest_ss_ar_bytes; |
| 192 | u32 guest_ds_ar_bytes; |
| 193 | u32 guest_fs_ar_bytes; |
| 194 | u32 guest_gs_ar_bytes; |
| 195 | u32 guest_ldtr_ar_bytes; |
| 196 | u32 guest_tr_ar_bytes; |
| 197 | u32 guest_interruptibility_info; |
| 198 | u32 guest_activity_state; |
| 199 | u32 guest_sysenter_cs; |
| 200 | u32 host_ia32_sysenter_cs; |
| 201 | u32 padding32[8]; /* room for future expansion */ |
| 202 | u16 virtual_processor_id; |
| 203 | u16 guest_es_selector; |
| 204 | u16 guest_cs_selector; |
| 205 | u16 guest_ss_selector; |
| 206 | u16 guest_ds_selector; |
| 207 | u16 guest_fs_selector; |
| 208 | u16 guest_gs_selector; |
| 209 | u16 guest_ldtr_selector; |
| 210 | u16 guest_tr_selector; |
| 211 | u16 host_es_selector; |
| 212 | u16 host_cs_selector; |
| 213 | u16 host_ss_selector; |
| 214 | u16 host_ds_selector; |
| 215 | u16 host_fs_selector; |
| 216 | u16 host_gs_selector; |
| 217 | u16 host_tr_selector; |
| 218 | }; |
| 219 | |
| 220 | |
| 221 | Authors |
| 222 | ------- |
| 223 | |
| 224 | These patches were written by: |
Mauro Carvalho Chehab | 320f3f7 | 2020-02-10 07:03:01 +0100 | [diff] [blame] | 225 | - Abel Gordon, abelg <at> il.ibm.com |
| 226 | - Nadav Har'El, nyh <at> il.ibm.com |
| 227 | - Orit Wasserman, oritw <at> il.ibm.com |
| 228 | - Ben-Ami Yassor, benami <at> il.ibm.com |
| 229 | - Muli Ben-Yehuda, muli <at> il.ibm.com |
Nadav Har'El | 823e396 | 2011-05-25 23:17:11 +0300 | [diff] [blame] | 230 | |
| 231 | With contributions by: |
Mauro Carvalho Chehab | 320f3f7 | 2020-02-10 07:03:01 +0100 | [diff] [blame] | 232 | - Anthony Liguori, aliguori <at> us.ibm.com |
| 233 | - Mike Day, mdday <at> us.ibm.com |
| 234 | - Michael Factor, factor <at> il.ibm.com |
| 235 | - Zvi Dubitzky, dubi <at> il.ibm.com |
Nadav Har'El | 823e396 | 2011-05-25 23:17:11 +0300 | [diff] [blame] | 236 | |
| 237 | And valuable reviews by: |
Mauro Carvalho Chehab | 320f3f7 | 2020-02-10 07:03:01 +0100 | [diff] [blame] | 238 | - Avi Kivity, avi <at> redhat.com |
| 239 | - Gleb Natapov, gleb <at> redhat.com |
| 240 | - Marcelo Tosatti, mtosatti <at> redhat.com |
| 241 | - Kevin Tian, kevin.tian <at> intel.com |
| 242 | - and others. |