Mike Rapoport | 16f9f7f | 2018-03-21 21:22:28 +0200 | [diff] [blame] | 1 | .. _mmu_notifier: |
| 2 | |
Jérôme Glisse | 0f10851 | 2017-11-15 17:34:07 -0800 | [diff] [blame] | 3 | When do you need to notify inside page table lock ? |
Mike Rapoport | 16f9f7f | 2018-03-21 21:22:28 +0200 | [diff] [blame] | 4 | =================================================== |
Jérôme Glisse | 0f10851 | 2017-11-15 17:34:07 -0800 | [diff] [blame] | 5 | |
| 6 | When clearing a pte/pmd we are given a choice to notify the event through |
Mike Rapoport | 16f9f7f | 2018-03-21 21:22:28 +0200 | [diff] [blame] | 7 | (notify version of \*_clear_flush call mmu_notifier_invalidate_range) under |
Jérôme Glisse | 0f10851 | 2017-11-15 17:34:07 -0800 | [diff] [blame] | 8 | the page table lock. But that notification is not necessary in all cases. |
| 9 | |
| 10 | For secondary TLB (non CPU TLB) like IOMMU TLB or device TLB (when device use |
| 11 | thing like ATS/PASID to get the IOMMU to walk the CPU page table to access a |
| 12 | process virtual address space). There is only 2 cases when you need to notify |
| 13 | those secondary TLB while holding page table lock when clearing a pte/pmd: |
| 14 | |
| 15 | A) page backing address is free before mmu_notifier_invalidate_range_end() |
| 16 | B) a page table entry is updated to point to a new page (COW, write fault |
| 17 | on zero page, __replace_page(), ...) |
| 18 | |
| 19 | Case A is obvious you do not want to take the risk for the device to write to |
| 20 | a page that might now be used by some completely different task. |
| 21 | |
| 22 | Case B is more subtle. For correctness it requires the following sequence to |
| 23 | happen: |
Mike Rapoport | 16f9f7f | 2018-03-21 21:22:28 +0200 | [diff] [blame] | 24 | |
Jérôme Glisse | 0f10851 | 2017-11-15 17:34:07 -0800 | [diff] [blame] | 25 | - take page table lock |
| 26 | - clear page table entry and notify ([pmd/pte]p_huge_clear_flush_notify()) |
| 27 | - set page table entry to point to new page |
| 28 | |
| 29 | If clearing the page table entry is not followed by a notify before setting |
| 30 | the new pte/pmd value then you can break memory model like C11 or C++11 for |
| 31 | the device. |
| 32 | |
| 33 | Consider the following scenario (device use a feature similar to ATS/PASID): |
| 34 | |
Mike Rapoport | 16f9f7f | 2018-03-21 21:22:28 +0200 | [diff] [blame] | 35 | Two address addrA and addrB such that \|addrA - addrB\| >= PAGE_SIZE we assume |
Jérôme Glisse | 0f10851 | 2017-11-15 17:34:07 -0800 | [diff] [blame] | 36 | they are write protected for COW (other case of B apply too). |
| 37 | |
Mike Rapoport | 16f9f7f | 2018-03-21 21:22:28 +0200 | [diff] [blame] | 38 | :: |
| 39 | |
| 40 | [Time N] -------------------------------------------------------------------- |
| 41 | CPU-thread-0 {try to write to addrA} |
| 42 | CPU-thread-1 {try to write to addrB} |
| 43 | CPU-thread-2 {} |
| 44 | CPU-thread-3 {} |
| 45 | DEV-thread-0 {read addrA and populate device TLB} |
| 46 | DEV-thread-2 {read addrB and populate device TLB} |
| 47 | [Time N+1] ------------------------------------------------------------------ |
| 48 | CPU-thread-0 {COW_step0: {mmu_notifier_invalidate_range_start(addrA)}} |
| 49 | CPU-thread-1 {COW_step0: {mmu_notifier_invalidate_range_start(addrB)}} |
| 50 | CPU-thread-2 {} |
| 51 | CPU-thread-3 {} |
| 52 | DEV-thread-0 {} |
| 53 | DEV-thread-2 {} |
| 54 | [Time N+2] ------------------------------------------------------------------ |
| 55 | CPU-thread-0 {COW_step1: {update page table to point to new page for addrA}} |
| 56 | CPU-thread-1 {COW_step1: {update page table to point to new page for addrB}} |
| 57 | CPU-thread-2 {} |
| 58 | CPU-thread-3 {} |
| 59 | DEV-thread-0 {} |
| 60 | DEV-thread-2 {} |
| 61 | [Time N+3] ------------------------------------------------------------------ |
| 62 | CPU-thread-0 {preempted} |
| 63 | CPU-thread-1 {preempted} |
| 64 | CPU-thread-2 {write to addrA which is a write to new page} |
| 65 | CPU-thread-3 {} |
| 66 | DEV-thread-0 {} |
| 67 | DEV-thread-2 {} |
| 68 | [Time N+3] ------------------------------------------------------------------ |
| 69 | CPU-thread-0 {preempted} |
| 70 | CPU-thread-1 {preempted} |
| 71 | CPU-thread-2 {} |
| 72 | CPU-thread-3 {write to addrB which is a write to new page} |
| 73 | DEV-thread-0 {} |
| 74 | DEV-thread-2 {} |
| 75 | [Time N+4] ------------------------------------------------------------------ |
| 76 | CPU-thread-0 {preempted} |
| 77 | CPU-thread-1 {COW_step3: {mmu_notifier_invalidate_range_end(addrB)}} |
| 78 | CPU-thread-2 {} |
| 79 | CPU-thread-3 {} |
| 80 | DEV-thread-0 {} |
| 81 | DEV-thread-2 {} |
| 82 | [Time N+5] ------------------------------------------------------------------ |
| 83 | CPU-thread-0 {preempted} |
| 84 | CPU-thread-1 {} |
| 85 | CPU-thread-2 {} |
| 86 | CPU-thread-3 {} |
| 87 | DEV-thread-0 {read addrA from old page} |
| 88 | DEV-thread-2 {read addrB from new page} |
Jérôme Glisse | 0f10851 | 2017-11-15 17:34:07 -0800 | [diff] [blame] | 89 | |
| 90 | So here because at time N+2 the clear page table entry was not pair with a |
| 91 | notification to invalidate the secondary TLB, the device see the new value for |
| 92 | addrB before seing the new value for addrA. This break total memory ordering |
| 93 | for the device. |
| 94 | |
| 95 | When changing a pte to write protect or to point to a new write protected page |
| 96 | with same content (KSM) it is fine to delay the mmu_notifier_invalidate_range |
| 97 | call to mmu_notifier_invalidate_range_end() outside the page table lock. This |
| 98 | is true even if the thread doing the page table update is preempted right after |
| 99 | releasing page table lock but before call mmu_notifier_invalidate_range_end(). |