Jan Kiszka | 38a778a | 2011-02-09 15:11:28 +0100 | [diff] [blame] | 1 | KVM Lock Overview |
| 2 | ================= |
| 3 | |
| 4 | 1. Acquisition Orders |
| 5 | --------------------- |
| 6 | |
| 7 | (to be written) |
| 8 | |
Xiao Guangrong | 58d8b17 | 2012-06-20 16:00:26 +0800 | [diff] [blame] | 9 | 2: Exception |
| 10 | ------------ |
| 11 | |
| 12 | Fast page fault: |
| 13 | |
| 14 | Fast page fault is the fast path which fixes the guest page fault out of |
| 15 | the mmu-lock on x86. Currently, the page fault can be fast only if the |
| 16 | shadow page table is present and it is caused by write-protect, that means |
| 17 | we just need change the W bit of the spte. |
| 18 | |
| 19 | What we use to avoid all the race is the SPTE_HOST_WRITEABLE bit and |
| 20 | SPTE_MMU_WRITEABLE bit on the spte: |
| 21 | - SPTE_HOST_WRITEABLE means the gfn is writable on host. |
| 22 | - SPTE_MMU_WRITEABLE means the gfn is writable on mmu. The bit is set when |
| 23 | the gfn is writable on guest mmu and it is not write-protected by shadow |
| 24 | page write-protection. |
| 25 | |
| 26 | On fast page fault path, we will use cmpxchg to atomically set the spte W |
| 27 | bit if spte.SPTE_HOST_WRITEABLE = 1 and spte.SPTE_WRITE_PROTECT = 1, this |
| 28 | is safe because whenever changing these bits can be detected by cmpxchg. |
| 29 | |
| 30 | But we need carefully check these cases: |
| 31 | 1): The mapping from gfn to pfn |
| 32 | The mapping from gfn to pfn may be changed since we can only ensure the pfn |
| 33 | is not changed during cmpxchg. This is a ABA problem, for example, below case |
| 34 | will happen: |
| 35 | |
| 36 | At the beginning: |
| 37 | gpte = gfn1 |
| 38 | gfn1 is mapped to pfn1 on host |
| 39 | spte is the shadow page table entry corresponding with gpte and |
| 40 | spte = pfn1 |
| 41 | |
| 42 | VCPU 0 VCPU0 |
| 43 | on fast page fault path: |
| 44 | |
| 45 | old_spte = *spte; |
| 46 | pfn1 is swapped out: |
| 47 | spte = 0; |
| 48 | |
| 49 | pfn1 is re-alloced for gfn2. |
| 50 | |
| 51 | gpte is changed to point to |
| 52 | gfn2 by the guest: |
| 53 | spte = pfn1; |
| 54 | |
| 55 | if (cmpxchg(spte, old_spte, old_spte+W) |
| 56 | mark_page_dirty(vcpu->kvm, gfn1) |
| 57 | OOPS!!! |
| 58 | |
| 59 | We dirty-log for gfn1, that means gfn2 is lost in dirty-bitmap. |
| 60 | |
| 61 | For direct sp, we can easily avoid it since the spte of direct sp is fixed |
| 62 | to gfn. For indirect sp, before we do cmpxchg, we call gfn_to_pfn_atomic() |
| 63 | to pin gfn to pfn, because after gfn_to_pfn_atomic(): |
| 64 | - We have held the refcount of pfn that means the pfn can not be freed and |
| 65 | be reused for another gfn. |
| 66 | - The pfn is writable that means it can not be shared between different gfns |
| 67 | by KSM. |
| 68 | |
| 69 | Then, we can ensure the dirty bitmaps is correctly set for a gfn. |
| 70 | |
| 71 | Currently, to simplify the whole things, we disable fast page fault for |
| 72 | indirect shadow page. |
| 73 | |
| 74 | 2): Dirty bit tracking |
| 75 | In the origin code, the spte can be fast updated (non-atomically) if the |
| 76 | spte is read-only and the Accessed bit has already been set since the |
| 77 | Accessed bit and Dirty bit can not be lost. |
| 78 | |
| 79 | But it is not true after fast page fault since the spte can be marked |
| 80 | writable between reading spte and updating spte. Like below case: |
| 81 | |
| 82 | At the beginning: |
| 83 | spte.W = 0 |
| 84 | spte.Accessed = 1 |
| 85 | |
| 86 | VCPU 0 VCPU0 |
| 87 | In mmu_spte_clear_track_bits(): |
| 88 | |
| 89 | old_spte = *spte; |
| 90 | |
| 91 | /* 'if' condition is satisfied. */ |
| 92 | if (old_spte.Accssed == 1 && |
| 93 | old_spte.W == 0) |
| 94 | spte = 0ull; |
| 95 | on fast page fault path: |
| 96 | spte.W = 1 |
| 97 | memory write on the spte: |
| 98 | spte.Dirty = 1 |
| 99 | |
| 100 | |
| 101 | else |
| 102 | old_spte = xchg(spte, 0ull) |
| 103 | |
| 104 | |
| 105 | if (old_spte.Accssed == 1) |
| 106 | kvm_set_pfn_accessed(spte.pfn); |
| 107 | if (old_spte.Dirty == 1) |
| 108 | kvm_set_pfn_dirty(spte.pfn); |
| 109 | OOPS!!! |
| 110 | |
| 111 | The Dirty bit is lost in this case. |
| 112 | |
| 113 | In order to avoid this kind of issue, we always treat the spte as "volatile" |
| 114 | if it can be updated out of mmu-lock, see spte_has_volatile_bits(), it means, |
Masanari Iida | 1718003 | 2013-12-22 01:21:23 +0900 | [diff] [blame] | 115 | the spte is always atomically updated in this case. |
Xiao Guangrong | 58d8b17 | 2012-06-20 16:00:26 +0800 | [diff] [blame] | 116 | |
| 117 | 3): flush tlbs due to spte updated |
| 118 | If the spte is updated from writable to readonly, we should flush all TLBs, |
| 119 | otherwise rmap_write_protect will find a read-only spte, even though the |
| 120 | writable spte might be cached on a CPU's TLB. |
| 121 | |
| 122 | As mentioned before, the spte can be updated to writable out of mmu-lock on |
| 123 | fast page fault path, in order to easily audit the path, we see if TLBs need |
| 124 | be flushed caused by this reason in mmu_spte_update() since this is a common |
| 125 | function to update spte (present -> present). |
| 126 | |
| 127 | Since the spte is "volatile" if it can be updated out of mmu-lock, we always |
Masanari Iida | 1718003 | 2013-12-22 01:21:23 +0900 | [diff] [blame] | 128 | atomically update the spte, the race caused by fast page fault can be avoided, |
Xiao Guangrong | 58d8b17 | 2012-06-20 16:00:26 +0800 | [diff] [blame] | 129 | See the comments in spte_has_volatile_bits() and mmu_spte_update(). |
| 130 | |
| 131 | 3. Reference |
Jan Kiszka | 38a778a | 2011-02-09 15:11:28 +0100 | [diff] [blame] | 132 | ------------ |
| 133 | |
| 134 | Name: kvm_lock |
Paolo Bonzini | 2f303b7 | 2013-09-25 13:53:07 +0200 | [diff] [blame] | 135 | Type: spinlock_t |
Jan Kiszka | 38a778a | 2011-02-09 15:11:28 +0100 | [diff] [blame] | 136 | Arch: any |
| 137 | Protects: - vm_list |
Paolo Bonzini | 4a937f9 | 2013-09-10 12:58:35 +0200 | [diff] [blame] | 138 | |
| 139 | Name: kvm_count_lock |
| 140 | Type: raw_spinlock_t |
| 141 | Arch: any |
| 142 | Protects: - hardware virtualization enable/disable |
Jan Kiszka | 38a778a | 2011-02-09 15:11:28 +0100 | [diff] [blame] | 143 | Comment: 'raw' because hardware enabling/disabling must be atomic /wrt |
| 144 | migration. |
| 145 | |
| 146 | Name: kvm_arch::tsc_write_lock |
| 147 | Type: raw_spinlock |
| 148 | Arch: x86 |
| 149 | Protects: - kvm_arch::{last_tsc_write,last_tsc_nsec,last_tsc_offset} |
| 150 | - tsc offset in vmcb |
| 151 | Comment: 'raw' because updating the tsc offsets must not be preempted. |
Xiao Guangrong | 58d8b17 | 2012-06-20 16:00:26 +0800 | [diff] [blame] | 152 | |
| 153 | Name: kvm->mmu_lock |
| 154 | Type: spinlock_t |
| 155 | Arch: any |
| 156 | Protects: -shadow page/shadow tlb entry |
| 157 | Comment: it is a spinlock since it is used in mmu notifier. |
Thomas Huth | 519192a | 2013-09-09 17:32:56 +0200 | [diff] [blame] | 158 | |
| 159 | Name: kvm->srcu |
| 160 | Type: srcu lock |
| 161 | Arch: any |
| 162 | Protects: - kvm->memslots |
| 163 | - kvm->buses |
| 164 | Comment: The srcu read lock must be held while accessing memslots (e.g. |
| 165 | when using gfn_to_* functions) and while accessing in-kernel |
| 166 | MMIO/PIO address->device structure mapping (kvm->buses). |
| 167 | The srcu index can be stored in kvm_vcpu->srcu_idx per vcpu |
| 168 | if it is needed by multiple functions. |