blob: 6ab4e35cee233c408338d3d7efbff4d6c9f065a9 [file] [log] [blame]
Mauro Carvalho Chehab320f3f72020-02-10 07:03:01 +01001.. SPDX-License-Identifier: GPL-2.0
2
3==========
Nadav Har'El823e3962011-05-25 23:17:11 +03004Nested VMX
5==========
6
7Overview
8---------
9
10On Intel processors, KVM uses Intel's VMX (Virtual-Machine eXtensions)
11to easily and efficiently run guest operating systems. Normally, these guests
12*cannot* themselves be hypervisors running their own guests, because in VMX,
13guests cannot use VMX instructions.
14
15The "Nested VMX" feature adds this missing capability - of running guest
16hypervisors (which use VMX) with their own nested guests. It does so by
17allowing a guest to use VMX instructions, and correctly and efficiently
18emulating them using the single level of VMX available in the hardware.
19
20We describe in much greater detail the theory behind the nested VMX feature,
21its implementation and its performance characteristics, in the OSDI 2010 paper
22"The Turtles Project: Design and Implementation of Nested Virtualization",
23available at:
24
Alexander A. Klimov3c603572020-07-13 13:47:19 +020025 https://www.usenix.org/events/osdi10/tech/full_papers/Ben-Yehuda.pdf
Nadav Har'El823e3962011-05-25 23:17:11 +030026
27
28Terminology
29-----------
30
31Single-level virtualization has two levels - the host (KVM) and the guests.
32In nested virtualization, we have three levels: The host (KVM), which we call
33L0, the guest hypervisor, which we call L1, and its nested guest, which we
34call L2.
35
36
Nadav Har'El823e3962011-05-25 23:17:11 +030037Running nested VMX
38------------------
39
40The nested VMX feature is disabled by default. It can be enabled by giving
41the "nested=1" option to the kvm-intel module.
42
43No modifications are required to user space (qemu). However, qemu's default
44emulated CPU type (qemu64) does not list the "VMX" CPU feature, so it must be
45explicitly enabled, by giving qemu one of the following options:
46
Mauro Carvalho Chehab320f3f72020-02-10 07:03:01 +010047 - cpu host (emulated CPU has all features of the real CPU)
Nadav Har'El823e3962011-05-25 23:17:11 +030048
Mauro Carvalho Chehab320f3f72020-02-10 07:03:01 +010049 - cpu qemu64,+vmx (add just the vmx feature to a named CPU type)
Nadav Har'El823e3962011-05-25 23:17:11 +030050
51
52ABIs
53----
54
55Nested VMX aims to present a standard and (eventually) fully-functional VMX
56implementation for the a guest hypervisor to use. As such, the official
57specification of the ABI that it provides is Intel's VMX specification,
58namely volume 3B of their "Intel 64 and IA-32 Architectures Software
59Developer's Manual". Not all of VMX's features are currently fully supported,
60but the goal is to eventually support them all, starting with the VMX features
61which are used in practice by popular hypervisors (KVM and others).
62
63As a VMX implementation, nested VMX presents a VMCS structure to L1.
64As mandated by the spec, other than the two fields revision_id and abort,
65this structure is *opaque* to its user, who is not supposed to know or care
66about its internal structure. Rather, the structure is accessed through the
67VMREAD and VMWRITE instructions.
68Still, for debugging purposes, KVM developers might be interested to know the
69internals of this structure; This is struct vmcs12 from arch/x86/kvm/vmx.c.
70
71The name "vmcs12" refers to the VMCS that L1 builds for L2. In the code we
72also have "vmcs01", the VMCS that L0 built for L1, and "vmcs02" is the VMCS
73which L0 builds to actually run L2 - how this is done is explained in the
74aforementioned paper.
75
76For convenience, we repeat the content of struct vmcs12 here. If the internals
77of this structure changes, this can break live migration across KVM versions.
78VMCS12_REVISION (from vmx.c) should be changed if struct vmcs12 or its inner
79struct shadow_vmcs is ever changed.
80
Mauro Carvalho Chehab320f3f72020-02-10 07:03:01 +010081::
82
Nadav Har'El823e3962011-05-25 23:17:11 +030083 typedef u64 natural_width;
84 struct __packed vmcs12 {
85 /* According to the Intel spec, a VMCS region must start with
86 * these two user-visible fields */
87 u32 revision_id;
88 u32 abort;
89
90 u32 launch_state; /* set to 0 by VMCLEAR, to 1 by VMLAUNCH */
91 u32 padding[7]; /* room for future expansion */
92
93 u64 io_bitmap_a;
94 u64 io_bitmap_b;
95 u64 msr_bitmap;
96 u64 vm_exit_msr_store_addr;
97 u64 vm_exit_msr_load_addr;
98 u64 vm_entry_msr_load_addr;
99 u64 tsc_offset;
100 u64 virtual_apic_page_addr;
101 u64 apic_access_addr;
102 u64 ept_pointer;
103 u64 guest_physical_address;
104 u64 vmcs_link_pointer;
105 u64 guest_ia32_debugctl;
106 u64 guest_ia32_pat;
107 u64 guest_ia32_efer;
108 u64 guest_pdptr0;
109 u64 guest_pdptr1;
110 u64 guest_pdptr2;
111 u64 guest_pdptr3;
112 u64 host_ia32_pat;
113 u64 host_ia32_efer;
114 u64 padding64[8]; /* room for future expansion */
115 natural_width cr0_guest_host_mask;
116 natural_width cr4_guest_host_mask;
117 natural_width cr0_read_shadow;
118 natural_width cr4_read_shadow;
Sean Christophersonb8d295f2020-04-15 17:07:39 -0700119 natural_width dead_space[4]; /* Last remnants of cr3_target_value[0-3]. */
Nadav Har'El823e3962011-05-25 23:17:11 +0300120 natural_width exit_qualification;
121 natural_width guest_linear_address;
122 natural_width guest_cr0;
123 natural_width guest_cr3;
124 natural_width guest_cr4;
125 natural_width guest_es_base;
126 natural_width guest_cs_base;
127 natural_width guest_ss_base;
128 natural_width guest_ds_base;
129 natural_width guest_fs_base;
130 natural_width guest_gs_base;
131 natural_width guest_ldtr_base;
132 natural_width guest_tr_base;
133 natural_width guest_gdtr_base;
134 natural_width guest_idtr_base;
135 natural_width guest_dr7;
136 natural_width guest_rsp;
137 natural_width guest_rip;
138 natural_width guest_rflags;
139 natural_width guest_pending_dbg_exceptions;
140 natural_width guest_sysenter_esp;
141 natural_width guest_sysenter_eip;
142 natural_width host_cr0;
143 natural_width host_cr3;
144 natural_width host_cr4;
145 natural_width host_fs_base;
146 natural_width host_gs_base;
147 natural_width host_tr_base;
148 natural_width host_gdtr_base;
149 natural_width host_idtr_base;
150 natural_width host_ia32_sysenter_esp;
151 natural_width host_ia32_sysenter_eip;
152 natural_width host_rsp;
153 natural_width host_rip;
154 natural_width paddingl[8]; /* room for future expansion */
155 u32 pin_based_vm_exec_control;
156 u32 cpu_based_vm_exec_control;
157 u32 exception_bitmap;
158 u32 page_fault_error_code_mask;
159 u32 page_fault_error_code_match;
160 u32 cr3_target_count;
161 u32 vm_exit_controls;
162 u32 vm_exit_msr_store_count;
163 u32 vm_exit_msr_load_count;
164 u32 vm_entry_controls;
165 u32 vm_entry_msr_load_count;
166 u32 vm_entry_intr_info_field;
167 u32 vm_entry_exception_error_code;
168 u32 vm_entry_instruction_len;
169 u32 tpr_threshold;
170 u32 secondary_vm_exec_control;
171 u32 vm_instruction_error;
172 u32 vm_exit_reason;
173 u32 vm_exit_intr_info;
174 u32 vm_exit_intr_error_code;
175 u32 idt_vectoring_info_field;
176 u32 idt_vectoring_error_code;
177 u32 vm_exit_instruction_len;
178 u32 vmx_instruction_info;
179 u32 guest_es_limit;
180 u32 guest_cs_limit;
181 u32 guest_ss_limit;
182 u32 guest_ds_limit;
183 u32 guest_fs_limit;
184 u32 guest_gs_limit;
185 u32 guest_ldtr_limit;
186 u32 guest_tr_limit;
187 u32 guest_gdtr_limit;
188 u32 guest_idtr_limit;
189 u32 guest_es_ar_bytes;
190 u32 guest_cs_ar_bytes;
191 u32 guest_ss_ar_bytes;
192 u32 guest_ds_ar_bytes;
193 u32 guest_fs_ar_bytes;
194 u32 guest_gs_ar_bytes;
195 u32 guest_ldtr_ar_bytes;
196 u32 guest_tr_ar_bytes;
197 u32 guest_interruptibility_info;
198 u32 guest_activity_state;
199 u32 guest_sysenter_cs;
200 u32 host_ia32_sysenter_cs;
201 u32 padding32[8]; /* room for future expansion */
202 u16 virtual_processor_id;
203 u16 guest_es_selector;
204 u16 guest_cs_selector;
205 u16 guest_ss_selector;
206 u16 guest_ds_selector;
207 u16 guest_fs_selector;
208 u16 guest_gs_selector;
209 u16 guest_ldtr_selector;
210 u16 guest_tr_selector;
211 u16 host_es_selector;
212 u16 host_cs_selector;
213 u16 host_ss_selector;
214 u16 host_ds_selector;
215 u16 host_fs_selector;
216 u16 host_gs_selector;
217 u16 host_tr_selector;
218 };
219
220
221Authors
222-------
223
224These patches were written by:
Mauro Carvalho Chehab320f3f72020-02-10 07:03:01 +0100225 - Abel Gordon, abelg <at> il.ibm.com
226 - Nadav Har'El, nyh <at> il.ibm.com
227 - Orit Wasserman, oritw <at> il.ibm.com
228 - Ben-Ami Yassor, benami <at> il.ibm.com
229 - Muli Ben-Yehuda, muli <at> il.ibm.com
Nadav Har'El823e3962011-05-25 23:17:11 +0300230
231With contributions by:
Mauro Carvalho Chehab320f3f72020-02-10 07:03:01 +0100232 - Anthony Liguori, aliguori <at> us.ibm.com
233 - Mike Day, mdday <at> us.ibm.com
234 - Michael Factor, factor <at> il.ibm.com
235 - Zvi Dubitzky, dubi <at> il.ibm.com
Nadav Har'El823e3962011-05-25 23:17:11 +0300236
237And valuable reviews by:
Mauro Carvalho Chehab320f3f72020-02-10 07:03:01 +0100238 - Avi Kivity, avi <at> redhat.com
239 - Gleb Natapov, gleb <at> redhat.com
240 - Marcelo Tosatti, mtosatti <at> redhat.com
241 - Kevin Tian, kevin.tian <at> intel.com
242 - and others.