Mike Rapoport | 7d10bdb | 2019-04-28 15:17:43 +0300 | [diff] [blame] | 1 | .. SPDX-License-Identifier: GPL-2.0 |
| 2 | |
| 3 | .. _physical_memory_model: |
| 4 | |
| 5 | ===================== |
| 6 | Physical Memory Model |
| 7 | ===================== |
| 8 | |
| 9 | Physical memory in a system may be addressed in different ways. The |
| 10 | simplest case is when the physical memory starts at address 0 and |
| 11 | spans a contiguous range up to the maximal address. It could be, |
| 12 | however, that this range contains small holes that are not accessible |
| 13 | for the CPU. Then there could be several contiguous ranges at |
| 14 | completely distinct addresses. And, don't forget about NUMA, where |
| 15 | different memory banks are attached to different CPUs. |
| 16 | |
| 17 | Linux abstracts this diversity using one of the three memory models: |
| 18 | FLATMEM, DISCONTIGMEM and SPARSEMEM. Each architecture defines what |
| 19 | memory models it supports, what the default memory model is and |
| 20 | whether it is possible to manually override that default. |
| 21 | |
| 22 | .. note:: |
| 23 | At time of this writing, DISCONTIGMEM is considered deprecated, |
| 24 | although it is still in use by several architectures. |
| 25 | |
| 26 | All the memory models track the status of physical page frames using |
| 27 | :c:type:`struct page` arranged in one or more arrays. |
| 28 | |
| 29 | Regardless of the selected memory model, there exists one-to-one |
| 30 | mapping between the physical page frame number (PFN) and the |
| 31 | corresponding `struct page`. |
| 32 | |
| 33 | Each memory model defines :c:func:`pfn_to_page` and :c:func:`page_to_pfn` |
| 34 | helpers that allow the conversion from PFN to `struct page` and vice |
| 35 | versa. |
| 36 | |
| 37 | FLATMEM |
| 38 | ======= |
| 39 | |
| 40 | The simplest memory model is FLATMEM. This model is suitable for |
| 41 | non-NUMA systems with contiguous, or mostly contiguous, physical |
| 42 | memory. |
| 43 | |
| 44 | In the FLATMEM memory model, there is a global `mem_map` array that |
| 45 | maps the entire physical memory. For most architectures, the holes |
| 46 | have entries in the `mem_map` array. The `struct page` objects |
| 47 | corresponding to the holes are never fully initialized. |
| 48 | |
Mike Rapoport | 237e506c | 2020-06-03 15:58:22 -0700 | [diff] [blame] | 49 | To allocate the `mem_map` array, architecture specific setup code should |
| 50 | call :c:func:`free_area_init` function. Yet, the mappings array is not |
| 51 | usable until the call to :c:func:`memblock_free_all` that hands all the |
| 52 | memory to the page allocator. |
Mike Rapoport | 7d10bdb | 2019-04-28 15:17:43 +0300 | [diff] [blame] | 53 | |
| 54 | If an architecture enables `CONFIG_ARCH_HAS_HOLES_MEMORYMODEL` option, |
| 55 | it may free parts of the `mem_map` array that do not cover the |
| 56 | actual physical pages. In such case, the architecture specific |
| 57 | :c:func:`pfn_valid` implementation should take the holes in the |
| 58 | `mem_map` into account. |
| 59 | |
| 60 | With FLATMEM, the conversion between a PFN and the `struct page` is |
| 61 | straightforward: `PFN - ARCH_PFN_OFFSET` is an index to the |
| 62 | `mem_map` array. |
| 63 | |
| 64 | The `ARCH_PFN_OFFSET` defines the first page frame number for |
| 65 | systems with physical memory starting at address different from 0. |
| 66 | |
| 67 | DISCONTIGMEM |
| 68 | ============ |
| 69 | |
| 70 | The DISCONTIGMEM model treats the physical memory as a collection of |
| 71 | `nodes` similarly to how Linux NUMA support does. For each node Linux |
| 72 | constructs an independent memory management subsystem represented by |
| 73 | `struct pglist_data` (or `pg_data_t` for short). Among other |
| 74 | things, `pg_data_t` holds the `node_mem_map` array that maps |
| 75 | physical pages belonging to that node. The `node_start_pfn` field of |
| 76 | `pg_data_t` is the number of the first page frame belonging to that |
| 77 | node. |
| 78 | |
| 79 | The architecture setup code should call :c:func:`free_area_init_node` for |
| 80 | each node in the system to initialize the `pg_data_t` object and its |
| 81 | `node_mem_map`. |
| 82 | |
| 83 | Every `node_mem_map` behaves exactly as FLATMEM's `mem_map` - |
| 84 | every physical page frame in a node has a `struct page` entry in the |
| 85 | `node_mem_map` array. When DISCONTIGMEM is enabled, a portion of the |
| 86 | `flags` field of the `struct page` encodes the node number of the |
| 87 | node hosting that page. |
| 88 | |
| 89 | The conversion between a PFN and the `struct page` in the |
| 90 | DISCONTIGMEM model became slightly more complex as it has to determine |
| 91 | which node hosts the physical page and which `pg_data_t` object |
| 92 | holds the `struct page`. |
| 93 | |
| 94 | Architectures that support DISCONTIGMEM provide :c:func:`pfn_to_nid` |
| 95 | to convert PFN to the node number. The opposite conversion helper |
| 96 | :c:func:`page_to_nid` is generic as it uses the node number encoded in |
| 97 | page->flags. |
| 98 | |
| 99 | Once the node number is known, the PFN can be used to index |
| 100 | appropriate `node_mem_map` array to access the `struct page` and |
| 101 | the offset of the `struct page` from the `node_mem_map` plus |
| 102 | `node_start_pfn` is the PFN of that page. |
| 103 | |
| 104 | SPARSEMEM |
| 105 | ========= |
| 106 | |
| 107 | SPARSEMEM is the most versatile memory model available in Linux and it |
| 108 | is the only memory model that supports several advanced features such |
| 109 | as hot-plug and hot-remove of the physical memory, alternative memory |
| 110 | maps for non-volatile memory devices and deferred initialization of |
| 111 | the memory map for larger systems. |
| 112 | |
| 113 | The SPARSEMEM model presents the physical memory as a collection of |
| 114 | sections. A section is represented with :c:type:`struct mem_section` |
| 115 | that contains `section_mem_map` that is, logically, a pointer to an |
| 116 | array of struct pages. However, it is stored with some other magic |
| 117 | that aids the sections management. The section size and maximal number |
| 118 | of section is specified using `SECTION_SIZE_BITS` and |
| 119 | `MAX_PHYSMEM_BITS` constants defined by each architecture that |
| 120 | supports SPARSEMEM. While `MAX_PHYSMEM_BITS` is an actual width of a |
| 121 | physical address that an architecture supports, the |
| 122 | `SECTION_SIZE_BITS` is an arbitrary value. |
| 123 | |
| 124 | The maximal number of sections is denoted `NR_MEM_SECTIONS` and |
| 125 | defined as |
| 126 | |
| 127 | .. math:: |
| 128 | |
| 129 | NR\_MEM\_SECTIONS = 2 ^ {(MAX\_PHYSMEM\_BITS - SECTION\_SIZE\_BITS)} |
| 130 | |
| 131 | The `mem_section` objects are arranged in a two-dimensional array |
| 132 | called `mem_sections`. The size and placement of this array depend |
| 133 | on `CONFIG_SPARSEMEM_EXTREME` and the maximal possible number of |
| 134 | sections: |
| 135 | |
| 136 | * When `CONFIG_SPARSEMEM_EXTREME` is disabled, the `mem_sections` |
| 137 | array is static and has `NR_MEM_SECTIONS` rows. Each row holds a |
| 138 | single `mem_section` object. |
| 139 | * When `CONFIG_SPARSEMEM_EXTREME` is enabled, the `mem_sections` |
| 140 | array is dynamically allocated. Each row contains PAGE_SIZE worth of |
| 141 | `mem_section` objects and the number of rows is calculated to fit |
| 142 | all the memory sections. |
| 143 | |
Mike Rapoport | c89ab04 | 2020-08-06 23:24:02 -0700 | [diff] [blame] | 144 | The architecture setup code should call sparse_init() to |
| 145 | initialize the memory sections and the memory maps. |
Mike Rapoport | 7d10bdb | 2019-04-28 15:17:43 +0300 | [diff] [blame] | 146 | |
| 147 | With SPARSEMEM there are two possible ways to convert a PFN to the |
| 148 | corresponding `struct page` - a "classic sparse" and "sparse |
| 149 | vmemmap". The selection is made at build time and it is determined by |
| 150 | the value of `CONFIG_SPARSEMEM_VMEMMAP`. |
| 151 | |
| 152 | The classic sparse encodes the section number of a page in page->flags |
| 153 | and uses high bits of a PFN to access the section that maps that page |
| 154 | frame. Inside a section, the PFN is the index to the array of pages. |
| 155 | |
| 156 | The sparse vmemmap uses a virtually mapped memory map to optimize |
| 157 | pfn_to_page and page_to_pfn operations. There is a global `struct |
| 158 | page *vmemmap` pointer that points to a virtually contiguous array of |
Randy Dunlap | 18d97ed | 2020-07-07 11:04:13 -0700 | [diff] [blame] | 159 | `struct page` objects. A PFN is an index to that array and the |
Mike Rapoport | 7d10bdb | 2019-04-28 15:17:43 +0300 | [diff] [blame] | 160 | offset of the `struct page` from `vmemmap` is the PFN of that |
| 161 | page. |
| 162 | |
| 163 | To use vmemmap, an architecture has to reserve a range of virtual |
| 164 | addresses that will map the physical pages containing the memory |
| 165 | map and make sure that `vmemmap` points to that range. In addition, |
| 166 | the architecture should implement :c:func:`vmemmap_populate` method |
| 167 | that will allocate the physical memory and create page tables for the |
| 168 | virtual memory map. If an architecture does not have any special |
| 169 | requirements for the vmemmap mappings, it can use default |
| 170 | :c:func:`vmemmap_populate_basepages` provided by the generic memory |
| 171 | management. |
| 172 | |
| 173 | The virtually mapped memory map allows storing `struct page` objects |
| 174 | for persistent memory devices in pre-allocated storage on those |
| 175 | devices. This storage is represented with :c:type:`struct vmem_altmap` |
| 176 | that is eventually passed to vmemmap_populate() through a long chain |
| 177 | of function calls. The vmemmap_populate() implementation may use the |
Anshuman Khandual | 56993b4 | 2020-08-06 23:23:24 -0700 | [diff] [blame] | 178 | `vmem_altmap` along with :c:func:`vmemmap_alloc_block_buf` helper to |
Mike Rapoport | 7d10bdb | 2019-04-28 15:17:43 +0300 | [diff] [blame] | 179 | allocate memory map on the persistent memory device. |
Dan Williams | a065340 | 2019-07-18 15:58:29 -0700 | [diff] [blame] | 180 | |
| 181 | ZONE_DEVICE |
| 182 | =========== |
| 183 | The `ZONE_DEVICE` facility builds upon `SPARSEMEM_VMEMMAP` to offer |
| 184 | `struct page` `mem_map` services for device driver identified physical |
| 185 | address ranges. The "device" aspect of `ZONE_DEVICE` relates to the fact |
| 186 | that the page objects for these address ranges are never marked online, |
| 187 | and that a reference must be taken against the device, not just the page |
| 188 | to keep the memory pinned for active use. `ZONE_DEVICE`, via |
| 189 | :c:func:`devm_memremap_pages`, performs just enough memory hotplug to |
| 190 | turn on :c:func:`pfn_to_page`, :c:func:`page_to_pfn`, and |
| 191 | :c:func:`get_user_pages` service for the given range of pfns. Since the |
| 192 | page reference count never drops below 1 the page is never tracked as |
| 193 | free memory and the page's `struct list_head lru` space is repurposed |
| 194 | for back referencing to the host device / driver that mapped the memory. |
| 195 | |
| 196 | While `SPARSEMEM` presents memory as a collection of sections, |
| 197 | optionally collected into memory blocks, `ZONE_DEVICE` users have a need |
| 198 | for smaller granularity of populating the `mem_map`. Given that |
| 199 | `ZONE_DEVICE` memory is never marked online it is subsequently never |
| 200 | subject to its memory ranges being exposed through the sysfs memory |
| 201 | hotplug api on memory block boundaries. The implementation relies on |
| 202 | this lack of user-api constraint to allow sub-section sized memory |
| 203 | ranges to be specified to :c:func:`arch_add_memory`, the top-half of |
| 204 | memory hotplug. Sub-section support allows for 2MB as the cross-arch |
| 205 | common alignment granularity for :c:func:`devm_memremap_pages`. |
| 206 | |
| 207 | The users of `ZONE_DEVICE` are: |
| 208 | |
| 209 | * pmem: Map platform persistent memory to be used as a direct-I/O target |
| 210 | via DAX mappings. |
| 211 | |
| 212 | * hmm: Extend `ZONE_DEVICE` with `->page_fault()` and `->page_free()` |
| 213 | event callbacks to allow a device-driver to coordinate memory management |
| 214 | events related to device-memory, typically GPU memory. See |
| 215 | Documentation/vm/hmm.rst. |
| 216 | |
| 217 | * p2pdma: Create `struct page` objects to allow peer devices in a |
| 218 | PCI/-E topology to coordinate direct-DMA operations between themselves, |
| 219 | i.e. bypass host memory. |