Changbin Du | 2f6eae4 | 2019-05-08 23:21:25 +0800 | [diff] [blame] | 1 | .. SPDX-License-Identifier: GPL-2.0 |
| 2 | |
| 3 | ========================== |
| 4 | PAT (Page Attribute Table) |
| 5 | ========================== |
| 6 | |
| 7 | x86 Page Attribute Table (PAT) allows for setting the memory attribute at the |
| 8 | page level granularity. PAT is complementary to the MTRR settings which allows |
| 9 | for setting of memory types over physical address ranges. However, PAT is |
| 10 | more flexible than MTRR due to its capability to set attributes at page level |
| 11 | and also due to the fact that there are no hardware limitations on number of |
| 12 | such attribute settings allowed. Added flexibility comes with guidelines for |
| 13 | not having memory type aliasing for the same physical memory with multiple |
| 14 | virtual addresses. |
| 15 | |
| 16 | PAT allows for different types of memory attributes. The most commonly used |
| 17 | ones that will be supported at this time are: |
| 18 | |
| 19 | === ============== |
| 20 | WB Write-back |
| 21 | UC Uncached |
| 22 | WC Write-combined |
| 23 | WT Write-through |
| 24 | UC- Uncached Minus |
| 25 | === ============== |
| 26 | |
| 27 | |
| 28 | PAT APIs |
| 29 | ======== |
| 30 | |
| 31 | There are many different APIs in the kernel that allows setting of memory |
| 32 | attributes at the page level. In order to avoid aliasing, these interfaces |
| 33 | should be used thoughtfully. Below is a table of interfaces available, |
| 34 | their intended usage and their memory attribute relationships. Internally, |
| 35 | these APIs use a reserve_memtype()/free_memtype() interface on the physical |
| 36 | address range to avoid any aliasing. |
| 37 | |
| 38 | +------------------------+----------+--------------+------------------+ |
| 39 | | API | RAM | ACPI,... | Reserved/Holes | |
| 40 | +------------------------+----------+--------------+------------------+ |
| 41 | | ioremap | -- | UC- | UC- | |
| 42 | +------------------------+----------+--------------+------------------+ |
| 43 | | ioremap_cache | -- | WB | WB | |
| 44 | +------------------------+----------+--------------+------------------+ |
| 45 | | ioremap_uc | -- | UC | UC | |
| 46 | +------------------------+----------+--------------+------------------+ |
Changbin Du | 2f6eae4 | 2019-05-08 23:21:25 +0800 | [diff] [blame] | 47 | | ioremap_wc | -- | -- | WC | |
| 48 | +------------------------+----------+--------------+------------------+ |
| 49 | | ioremap_wt | -- | -- | WT | |
| 50 | +------------------------+----------+--------------+------------------+ |
| 51 | | set_memory_uc, | UC- | -- | -- | |
| 52 | | set_memory_wb | | | | |
| 53 | +------------------------+----------+--------------+------------------+ |
| 54 | | set_memory_wc, | WC | -- | -- | |
| 55 | | set_memory_wb | | | | |
| 56 | +------------------------+----------+--------------+------------------+ |
| 57 | | set_memory_wt, | WT | -- | -- | |
| 58 | | set_memory_wb | | | | |
| 59 | +------------------------+----------+--------------+------------------+ |
| 60 | | pci sysfs resource | -- | -- | UC- | |
| 61 | +------------------------+----------+--------------+------------------+ |
| 62 | | pci sysfs resource_wc | -- | -- | WC | |
| 63 | | is IORESOURCE_PREFETCH | | | | |
| 64 | +------------------------+----------+--------------+------------------+ |
| 65 | | pci proc | -- | -- | UC- | |
| 66 | | !PCIIOC_WRITE_COMBINE | | | | |
| 67 | +------------------------+----------+--------------+------------------+ |
| 68 | | pci proc | -- | -- | WC | |
| 69 | | PCIIOC_WRITE_COMBINE | | | | |
| 70 | +------------------------+----------+--------------+------------------+ |
| 71 | | /dev/mem | -- | WB/WC/UC- | WB/WC/UC- | |
| 72 | | read-write | | | | |
| 73 | +------------------------+----------+--------------+------------------+ |
| 74 | | /dev/mem | -- | UC- | UC- | |
| 75 | | mmap SYNC flag | | | | |
| 76 | +------------------------+----------+--------------+------------------+ |
| 77 | | /dev/mem | -- | WB/WC/UC- | WB/WC/UC- | |
| 78 | | mmap !SYNC flag | | | | |
| 79 | | and | |(from existing| (from existing | |
| 80 | | any alias to this area | |alias) | alias) | |
| 81 | +------------------------+----------+--------------+------------------+ |
| 82 | | /dev/mem | -- | WB | WB | |
| 83 | | mmap !SYNC flag | | | | |
| 84 | | no alias to this area | | | | |
| 85 | | and | | | | |
| 86 | | MTRR says WB | | | | |
| 87 | +------------------------+----------+--------------+------------------+ |
| 88 | | /dev/mem | -- | -- | UC- | |
| 89 | | mmap !SYNC flag | | | | |
| 90 | | no alias to this area | | | | |
| 91 | | and | | | | |
| 92 | | MTRR says !WB | | | | |
| 93 | +------------------------+----------+--------------+------------------+ |
| 94 | |
| 95 | |
| 96 | Advanced APIs for drivers |
| 97 | ========================= |
| 98 | |
| 99 | A. Exporting pages to users with remap_pfn_range, io_remap_pfn_range, |
| 100 | vmf_insert_pfn. |
| 101 | |
| 102 | Drivers wanting to export some pages to userspace do it by using mmap |
| 103 | interface and a combination of: |
| 104 | |
| 105 | 1) pgprot_noncached() |
| 106 | 2) io_remap_pfn_range() or remap_pfn_range() or vmf_insert_pfn() |
| 107 | |
| 108 | With PAT support, a new API pgprot_writecombine is being added. So, drivers can |
| 109 | continue to use the above sequence, with either pgprot_noncached() or |
| 110 | pgprot_writecombine() in step 1, followed by step 2. |
| 111 | |
| 112 | In addition, step 2 internally tracks the region as UC or WC in memtype |
| 113 | list in order to ensure no conflicting mapping. |
| 114 | |
| 115 | Note that this set of APIs only works with IO (non RAM) regions. If driver |
| 116 | wants to export a RAM region, it has to do set_memory_uc() or set_memory_wc() |
| 117 | as step 0 above and also track the usage of those pages and use set_memory_wb() |
| 118 | before the page is freed to free pool. |
| 119 | |
| 120 | MTRR effects on PAT / non-PAT systems |
| 121 | ===================================== |
| 122 | |
| 123 | The following table provides the effects of using write-combining MTRRs when |
| 124 | using ioremap*() calls on x86 for both non-PAT and PAT systems. Ideally |
| 125 | mtrr_add() usage will be phased out in favor of arch_phys_wc_add() which will |
| 126 | be a no-op on PAT enabled systems. The region over which a arch_phys_wc_add() |
| 127 | is made, should already have been ioremapped with WC attributes or PAT entries, |
| 128 | this can be done by using ioremap_wc() / set_memory_wc(). Devices which |
| 129 | combine areas of IO memory desired to remain uncacheable with areas where |
| 130 | write-combining is desirable should consider use of ioremap_uc() followed by |
| 131 | set_memory_wc() to white-list effective write-combined areas. Such use is |
| 132 | nevertheless discouraged as the effective memory type is considered |
| 133 | implementation defined, yet this strategy can be used as last resort on devices |
| 134 | with size-constrained regions where otherwise MTRR write-combining would |
| 135 | otherwise not be effective. |
| 136 | :: |
| 137 | |
| 138 | ==== ======= === ========================= ===================== |
| 139 | MTRR Non-PAT PAT Linux ioremap value Effective memory type |
| 140 | ==== ======= === ========================= ===================== |
| 141 | PAT Non-PAT | PAT |
| 142 | |PCD | |
| 143 | ||PWT | |
| 144 | ||| | |
| 145 | WC 000 WB _PAGE_CACHE_MODE_WB WC | WC |
| 146 | WC 001 WC _PAGE_CACHE_MODE_WC WC* | WC |
| 147 | WC 010 UC- _PAGE_CACHE_MODE_UC_MINUS WC* | UC |
| 148 | WC 011 UC _PAGE_CACHE_MODE_UC UC | UC |
| 149 | ==== ======= === ========================= ===================== |
| 150 | |
| 151 | (*) denotes implementation defined and is discouraged |
| 152 | |
| 153 | .. note:: -- in the above table mean "Not suggested usage for the API". Some |
| 154 | of the --'s are strictly enforced by the kernel. Some others are not really |
| 155 | enforced today, but may be enforced in future. |
| 156 | |
| 157 | For ioremap and pci access through /sys or /proc - The actual type returned |
| 158 | can be more restrictive, in case of any existing aliasing for that address. |
| 159 | For example: If there is an existing uncached mapping, a new ioremap_wc can |
| 160 | return uncached mapping in place of write-combine requested. |
| 161 | |
| 162 | set_memory_[uc|wc|wt] and set_memory_wb should be used in pairs, where driver |
| 163 | will first make a region uc, wc or wt and switch it back to wb after use. |
| 164 | |
| 165 | Over time writes to /proc/mtrr will be deprecated in favor of using PAT based |
| 166 | interfaces. Users writing to /proc/mtrr are suggested to use above interfaces. |
| 167 | |
| 168 | Drivers should use ioremap_[uc|wc] to access PCI BARs with [uc|wc] access |
| 169 | types. |
| 170 | |
| 171 | Drivers should use set_memory_[uc|wc|wt] to set access type for RAM ranges. |
| 172 | |
| 173 | |
| 174 | PAT debugging |
| 175 | ============= |
| 176 | |
| 177 | With CONFIG_DEBUG_FS enabled, PAT memtype list can be examined by:: |
| 178 | |
| 179 | # mount -t debugfs debugfs /sys/kernel/debug |
| 180 | # cat /sys/kernel/debug/x86/pat_memtype_list |
| 181 | PAT memtype list: |
| 182 | uncached-minus @ 0x7fadf000-0x7fae0000 |
| 183 | uncached-minus @ 0x7fb19000-0x7fb1a000 |
| 184 | uncached-minus @ 0x7fb1a000-0x7fb1b000 |
| 185 | uncached-minus @ 0x7fb1b000-0x7fb1c000 |
| 186 | uncached-minus @ 0x7fb1c000-0x7fb1d000 |
| 187 | uncached-minus @ 0x7fb1d000-0x7fb1e000 |
| 188 | uncached-minus @ 0x7fb1e000-0x7fb25000 |
| 189 | uncached-minus @ 0x7fb25000-0x7fb26000 |
| 190 | uncached-minus @ 0x7fb26000-0x7fb27000 |
| 191 | uncached-minus @ 0x7fb27000-0x7fb28000 |
| 192 | uncached-minus @ 0x7fb28000-0x7fb2e000 |
| 193 | uncached-minus @ 0x7fb2e000-0x7fb2f000 |
| 194 | uncached-minus @ 0x7fb2f000-0x7fb30000 |
| 195 | uncached-minus @ 0x7fb31000-0x7fb32000 |
| 196 | uncached-minus @ 0x80000000-0x90000000 |
| 197 | |
| 198 | This list shows physical address ranges and various PAT settings used to |
| 199 | access those physical address ranges. |
| 200 | |
| 201 | Another, more verbose way of getting PAT related debug messages is with |
| 202 | "debugpat" boot parameter. With this parameter, various debug messages are |
| 203 | printed to dmesg log. |
| 204 | |
| 205 | PAT Initialization |
| 206 | ================== |
| 207 | |
| 208 | The following table describes how PAT is initialized under various |
| 209 | configurations. The PAT MSR must be updated by Linux in order to support WC |
| 210 | and WT attributes. Otherwise, the PAT MSR has the value programmed in it |
| 211 | by the firmware. Note, Xen enables WC attribute in the PAT MSR for guests. |
| 212 | |
| 213 | ==== ===== ========================== ========= ======= |
| 214 | MTRR PAT Call Sequence PAT State PAT MSR |
| 215 | ==== ===== ========================== ========= ======= |
| 216 | E E MTRR -> PAT init Enabled OS |
| 217 | E D MTRR -> PAT init Disabled - |
| 218 | D E MTRR -> PAT disable Disabled BIOS |
| 219 | D D MTRR -> PAT disable Disabled - |
| 220 | - np/E PAT -> PAT disable Disabled BIOS |
| 221 | - np/D PAT -> PAT disable Disabled - |
| 222 | E !P/E MTRR -> PAT init Disabled BIOS |
| 223 | D !P/E MTRR -> PAT disable Disabled BIOS |
| 224 | !M !P/E MTRR stub -> PAT disable Disabled BIOS |
| 225 | ==== ===== ========================== ========= ======= |
| 226 | |
| 227 | Legend |
| 228 | |
| 229 | ========= ======================================= |
| 230 | E Feature enabled in CPU |
| 231 | D Feature disabled/unsupported in CPU |
| 232 | np "nopat" boot option specified |
| 233 | !P CONFIG_X86_PAT option unset |
| 234 | !M CONFIG_MTRR option unset |
| 235 | Enabled PAT state set to enabled |
| 236 | Disabled PAT state set to disabled |
| 237 | OS PAT initializes PAT MSR with OS setting |
| 238 | BIOS PAT keeps PAT MSR with BIOS setting |
| 239 | ========= ======================================= |
| 240 | |