Glauber Costa | d2d7a61 | 2010-06-01 08:22:48 -0400 | [diff] [blame] | 1 | KVM-specific MSRs. |
| 2 | Glauber Costa <glommer@redhat.com>, Red Hat Inc, 2010 |
| 3 | ===================================================== |
| 4 | |
| 5 | KVM makes use of some custom MSRs to service some requests. |
Glauber Costa | d2d7a61 | 2010-06-01 08:22:48 -0400 | [diff] [blame] | 6 | |
| 7 | Custom MSRs have a range reserved for them, that goes from |
| 8 | 0x4b564d00 to 0x4b564dff. There are MSRs outside this area, |
| 9 | but they are deprecated and their use is discouraged. |
| 10 | |
| 11 | Custom MSR list |
| 12 | -------- |
| 13 | |
| 14 | The current supported Custom MSR list is: |
| 15 | |
| 16 | MSR_KVM_WALL_CLOCK_NEW: 0x4b564d00 |
| 17 | |
| 18 | data: 4-byte alignment physical address of a memory area which must be |
| 19 | in guest RAM. This memory is expected to hold a copy of the following |
| 20 | structure: |
| 21 | |
| 22 | struct pvclock_wall_clock { |
| 23 | u32 version; |
| 24 | u32 sec; |
| 25 | u32 nsec; |
| 26 | } __attribute__((__packed__)); |
| 27 | |
| 28 | whose data will be filled in by the hypervisor. The hypervisor is only |
| 29 | guaranteed to update this data at the moment of MSR write. |
| 30 | Users that want to reliably query this information more than once have |
| 31 | to write more than once to this MSR. Fields have the following meanings: |
| 32 | |
| 33 | version: guest has to check version before and after grabbing |
| 34 | time information and check that they are both equal and even. |
| 35 | An odd version indicates an in-progress update. |
| 36 | |
Stefan Fritsch | 879238f | 2012-09-16 12:55:40 +0200 | [diff] [blame] | 37 | sec: number of seconds for wallclock at time of boot. |
Glauber Costa | d2d7a61 | 2010-06-01 08:22:48 -0400 | [diff] [blame] | 38 | |
Stefan Fritsch | 879238f | 2012-09-16 12:55:40 +0200 | [diff] [blame] | 39 | nsec: number of nanoseconds for wallclock at time of boot. |
| 40 | |
| 41 | In order to get the current wallclock time, the system_time from |
| 42 | MSR_KVM_SYSTEM_TIME_NEW needs to be added. |
Glauber Costa | d2d7a61 | 2010-06-01 08:22:48 -0400 | [diff] [blame] | 43 | |
| 44 | Note that although MSRs are per-CPU entities, the effect of this |
| 45 | particular MSR is global. |
| 46 | |
| 47 | Availability of this MSR must be checked via bit 3 in 0x4000001 cpuid |
| 48 | leaf prior to usage. |
| 49 | |
| 50 | MSR_KVM_SYSTEM_TIME_NEW: 0x4b564d01 |
| 51 | |
| 52 | data: 4-byte aligned physical address of a memory area which must be in |
| 53 | guest RAM, plus an enable bit in bit 0. This memory is expected to hold |
| 54 | a copy of the following structure: |
| 55 | |
| 56 | struct pvclock_vcpu_time_info { |
| 57 | u32 version; |
| 58 | u32 pad0; |
| 59 | u64 tsc_timestamp; |
| 60 | u64 system_time; |
| 61 | u32 tsc_to_system_mul; |
| 62 | s8 tsc_shift; |
| 63 | u8 flags; |
| 64 | u8 pad[2]; |
| 65 | } __attribute__((__packed__)); /* 32 bytes */ |
| 66 | |
| 67 | whose data will be filled in by the hypervisor periodically. Only one |
| 68 | write, or registration, is needed for each VCPU. The interval between |
| 69 | updates of this structure is arbitrary and implementation-dependent. |
| 70 | The hypervisor may update this structure at any time it sees fit until |
| 71 | anything with bit0 == 0 is written to it. |
| 72 | |
| 73 | Fields have the following meanings: |
| 74 | |
| 75 | version: guest has to check version before and after grabbing |
| 76 | time information and check that they are both equal and even. |
| 77 | An odd version indicates an in-progress update. |
| 78 | |
| 79 | tsc_timestamp: the tsc value at the current VCPU at the time |
| 80 | of the update of this structure. Guests can subtract this value |
| 81 | from current tsc to derive a notion of elapsed time since the |
| 82 | structure update. |
| 83 | |
| 84 | system_time: a host notion of monotonic time, including sleep |
| 85 | time at the time this structure was last updated. Unit is |
| 86 | nanoseconds. |
| 87 | |
Stefan Fritsch | 879238f | 2012-09-16 12:55:40 +0200 | [diff] [blame] | 88 | tsc_to_system_mul: multiplier to be used when converting |
| 89 | tsc-related quantity to nanoseconds |
Glauber Costa | d2d7a61 | 2010-06-01 08:22:48 -0400 | [diff] [blame] | 90 | |
Stefan Fritsch | 879238f | 2012-09-16 12:55:40 +0200 | [diff] [blame] | 91 | tsc_shift: shift to be used when converting tsc-related |
| 92 | quantity to nanoseconds. This shift will ensure that |
| 93 | multiplication with tsc_to_system_mul does not overflow. |
| 94 | A positive value denotes a left shift, a negative value |
| 95 | a right shift. |
Glauber Costa | d2d7a61 | 2010-06-01 08:22:48 -0400 | [diff] [blame] | 96 | |
Stefan Fritsch | 879238f | 2012-09-16 12:55:40 +0200 | [diff] [blame] | 97 | The conversion from tsc to nanoseconds involves an additional |
| 98 | right shift by 32 bits. With this information, guests can |
| 99 | derive per-CPU time by doing: |
Glauber Costa | d2d7a61 | 2010-06-01 08:22:48 -0400 | [diff] [blame] | 100 | |
| 101 | time = (current_tsc - tsc_timestamp) |
Stefan Fritsch | 879238f | 2012-09-16 12:55:40 +0200 | [diff] [blame] | 102 | if (tsc_shift >= 0) |
| 103 | time <<= tsc_shift; |
| 104 | else |
| 105 | time >>= -tsc_shift; |
| 106 | time = (time * tsc_to_system_mul) >> 32 |
Glauber Costa | d2d7a61 | 2010-06-01 08:22:48 -0400 | [diff] [blame] | 107 | time = time + system_time |
| 108 | |
| 109 | flags: bits in this field indicate extended capabilities |
| 110 | coordinated between the guest and the hypervisor. Availability |
| 111 | of specific flags has to be checked in 0x40000001 cpuid leaf. |
| 112 | Current flags are: |
| 113 | |
| 114 | flag bit | cpuid bit | meaning |
| 115 | ------------------------------------------------------------- |
| 116 | | | time measures taken across |
| 117 | 0 | 24 | multiple cpus are guaranteed to |
| 118 | | | be monotonic |
| 119 | ------------------------------------------------------------- |
Eric B Munson | 1c0b28c | 2012-03-10 14:37:27 -0500 | [diff] [blame] | 120 | | | guest vcpu has been paused by |
| 121 | 1 | N/A | the host |
| 122 | | | See 4.70 in api.txt |
| 123 | ------------------------------------------------------------- |
Glauber Costa | d2d7a61 | 2010-06-01 08:22:48 -0400 | [diff] [blame] | 124 | |
| 125 | Availability of this MSR must be checked via bit 3 in 0x4000001 cpuid |
| 126 | leaf prior to usage. |
| 127 | |
| 128 | |
| 129 | MSR_KVM_WALL_CLOCK: 0x11 |
| 130 | |
| 131 | data and functioning: same as MSR_KVM_WALL_CLOCK_NEW. Use that instead. |
| 132 | |
| 133 | This MSR falls outside the reserved KVM range and may be removed in the |
| 134 | future. Its usage is deprecated. |
| 135 | |
| 136 | Availability of this MSR must be checked via bit 0 in 0x4000001 cpuid |
| 137 | leaf prior to usage. |
| 138 | |
| 139 | MSR_KVM_SYSTEM_TIME: 0x12 |
| 140 | |
| 141 | data and functioning: same as MSR_KVM_SYSTEM_TIME_NEW. Use that instead. |
| 142 | |
| 143 | This MSR falls outside the reserved KVM range and may be removed in the |
| 144 | future. Its usage is deprecated. |
| 145 | |
| 146 | Availability of this MSR must be checked via bit 0 in 0x4000001 cpuid |
| 147 | leaf prior to usage. |
| 148 | |
| 149 | The suggested algorithm for detecting kvmclock presence is then: |
| 150 | |
| 151 | if (!kvm_para_available()) /* refer to cpuid.txt */ |
| 152 | return NON_PRESENT; |
| 153 | |
| 154 | flags = cpuid_eax(0x40000001); |
| 155 | if (flags & 3) { |
| 156 | msr_kvm_system_time = MSR_KVM_SYSTEM_TIME_NEW; |
| 157 | msr_kvm_wall_clock = MSR_KVM_WALL_CLOCK_NEW; |
| 158 | return PRESENT; |
| 159 | } else if (flags & 0) { |
| 160 | msr_kvm_system_time = MSR_KVM_SYSTEM_TIME; |
| 161 | msr_kvm_wall_clock = MSR_KVM_WALL_CLOCK; |
| 162 | return PRESENT; |
| 163 | } else |
| 164 | return NON_PRESENT; |
Gleb Natapov | 344d958 | 2010-10-14 11:22:50 +0200 | [diff] [blame] | 165 | |
| 166 | MSR_KVM_ASYNC_PF_EN: 0x4b564d02 |
| 167 | data: Bits 63-6 hold 64-byte aligned physical address of a |
| 168 | 64 byte memory area which must be in guest RAM and must be |
Wanpeng Li | 52a5c15 | 2017-07-13 18:30:42 -0700 | [diff] [blame] | 169 | zeroed. Bits 5-3 are reserved and should be zero. Bit 0 is 1 |
Gleb Natapov | 344d958 | 2010-10-14 11:22:50 +0200 | [diff] [blame] | 170 | when asynchronous page faults are enabled on the vcpu 0 when |
Tiejun Chen | 91690bf | 2014-10-11 09:19:54 +0800 | [diff] [blame] | 171 | disabled. Bit 1 is 1 if asynchronous page faults can be injected |
Wanpeng Li | 52a5c15 | 2017-07-13 18:30:42 -0700 | [diff] [blame] | 172 | when vcpu is in cpl == 0. Bit 2 is 1 if asynchronous page faults |
Radim Krčmář | fe2a302 | 2018-02-01 22:16:21 +0100 | [diff] [blame] | 173 | are delivered to L1 as #PF vmexits. Bit 2 can be set only if |
| 174 | KVM_FEATURE_ASYNC_PF_VMEXIT is present in CPUID. |
Gleb Natapov | 344d958 | 2010-10-14 11:22:50 +0200 | [diff] [blame] | 175 | |
| 176 | First 4 byte of 64 byte memory location will be written to by |
| 177 | the hypervisor at the time of asynchronous page fault (APF) |
| 178 | injection to indicate type of asynchronous page fault. Value |
| 179 | of 1 means that the page referred to by the page fault is not |
| 180 | present. Value 2 means that the page is now available. Disabling |
| 181 | interrupt inhibits APFs. Guest must not enable interrupt |
| 182 | before the reason is read, or it may be overwritten by another |
| 183 | APF. Since APF uses the same exception vector as regular page |
| 184 | fault guest must reset the reason to 0 before it does |
| 185 | something that can generate normal page fault. If during page |
| 186 | fault APF reason is 0 it means that this is regular page |
| 187 | fault. |
| 188 | |
| 189 | During delivery of type 1 APF cr2 contains a token that will |
| 190 | be used to notify a guest when missing page becomes |
| 191 | available. When page becomes available type 2 APF is sent with |
| 192 | cr2 set to the token associated with the page. There is special |
| 193 | kind of token 0xffffffff which tells vcpu that it should wake |
| 194 | up all processes waiting for APFs and no individual type 2 APFs |
| 195 | will be sent. |
| 196 | |
| 197 | If APF is disabled while there are outstanding APFs, they will |
| 198 | not be delivered. |
| 199 | |
| 200 | Currently type 2 APF will be always delivered on the same vcpu as |
| 201 | type 1 was, but guest should not rely on that. |
Glauber Costa | 9ddabbe | 2011-07-11 15:28:13 -0400 | [diff] [blame] | 202 | |
| 203 | MSR_KVM_STEAL_TIME: 0x4b564d03 |
| 204 | |
| 205 | data: 64-byte alignment physical address of a memory area which must be |
| 206 | in guest RAM, plus an enable bit in bit 0. This memory is expected to |
| 207 | hold a copy of the following structure: |
| 208 | |
| 209 | struct kvm_steal_time { |
| 210 | __u64 steal; |
| 211 | __u32 version; |
| 212 | __u32 flags; |
Pan Xinhui | 3dd3e0c | 2016-11-02 05:08:38 -0400 | [diff] [blame] | 213 | __u8 preempted; |
| 214 | __u8 u8_pad[3]; |
| 215 | __u32 pad[11]; |
Glauber Costa | 9ddabbe | 2011-07-11 15:28:13 -0400 | [diff] [blame] | 216 | } |
| 217 | |
| 218 | whose data will be filled in by the hypervisor periodically. Only one |
| 219 | write, or registration, is needed for each VCPU. The interval between |
| 220 | updates of this structure is arbitrary and implementation-dependent. |
| 221 | The hypervisor may update this structure at any time it sees fit until |
| 222 | anything with bit0 == 0 is written to it. Guest is required to make sure |
| 223 | this structure is initialized to zero. |
| 224 | |
| 225 | Fields have the following meanings: |
| 226 | |
| 227 | version: a sequence counter. In other words, guest has to check |
| 228 | this field before and after grabbing time information and make |
| 229 | sure they are both equal and even. An odd version indicates an |
| 230 | in-progress update. |
| 231 | |
| 232 | flags: At this point, always zero. May be used to indicate |
| 233 | changes in this structure in the future. |
| 234 | |
| 235 | steal: the amount of time in which this vCPU did not run, in |
| 236 | nanoseconds. Time during which the vcpu is idle, will not be |
| 237 | reported as steal time. |
Michael S. Tsirkin | c1af87d | 2012-06-24 19:24:49 +0300 | [diff] [blame] | 238 | |
Pan Xinhui | 3dd3e0c | 2016-11-02 05:08:38 -0400 | [diff] [blame] | 239 | preempted: indicate the vCPU who owns this struct is running or |
| 240 | not. Non-zero values mean the vCPU has been preempted. Zero |
| 241 | means the vCPU is not preempted. NOTE, it is always zero if the |
| 242 | the hypervisor doesn't support this field. |
| 243 | |
Michael S. Tsirkin | c1af87d | 2012-06-24 19:24:49 +0300 | [diff] [blame] | 244 | MSR_KVM_EOI_EN: 0x4b564d04 |
| 245 | data: Bit 0 is 1 when PV end of interrupt is enabled on the vcpu; 0 |
| 246 | when disabled. Bit 1 is reserved and must be zero. When PV end of |
| 247 | interrupt is enabled (bit 0 set), bits 63-2 hold a 4-byte aligned |
| 248 | physical address of a 4 byte memory area which must be in guest RAM and |
| 249 | must be zeroed. |
| 250 | |
| 251 | The first, least significant bit of 4 byte memory location will be |
| 252 | written to by the hypervisor, typically at the time of interrupt |
| 253 | injection. Value of 1 means that guest can skip writing EOI to the apic |
| 254 | (using MSR or MMIO write); instead, it is sufficient to signal |
| 255 | EOI by clearing the bit in guest memory - this location will |
| 256 | later be polled by the hypervisor. |
| 257 | Value of 0 means that the EOI write is required. |
| 258 | |
| 259 | It is always safe for the guest to ignore the optimization and perform |
| 260 | the APIC EOI write anyway. |
| 261 | |
| 262 | Hypervisor is guaranteed to only modify this least |
| 263 | significant bit while in the current VCPU context, this means that |
| 264 | guest does not need to use either lock prefix or memory ordering |
| 265 | primitives to synchronise with the hypervisor. |
| 266 | |
| 267 | However, hypervisor can set and clear this memory bit at any time: |
| 268 | therefore to make sure hypervisor does not interrupt the |
| 269 | guest and clear the least significant bit in the memory area |
| 270 | in the window between guest testing it to detect |
| 271 | whether it can skip EOI apic write and between guest |
| 272 | clearing it to signal EOI to the hypervisor, |
| 273 | guest must both read the least significant bit in the memory area and |
| 274 | clear it using a single CPU instruction, such as test and clear, or |
| 275 | compare and exchange. |