blob: 7694497345731c76d0a0abcf0b8b2f7f94f0b657 [file] [log] [blame]
Mike Rapoport7d10bdb2019-04-28 15:17:43 +03001.. SPDX-License-Identifier: GPL-2.0
2
3.. _physical_memory_model:
4
5=====================
6Physical Memory Model
7=====================
8
9Physical memory in a system may be addressed in different ways. The
10simplest case is when the physical memory starts at address 0 and
11spans a contiguous range up to the maximal address. It could be,
12however, that this range contains small holes that are not accessible
13for the CPU. Then there could be several contiguous ranges at
14completely distinct addresses. And, don't forget about NUMA, where
15different memory banks are attached to different CPUs.
16
17Linux abstracts this diversity using one of the three memory models:
18FLATMEM, DISCONTIGMEM and SPARSEMEM. Each architecture defines what
19memory models it supports, what the default memory model is and
20whether it is possible to manually override that default.
21
22.. note::
23 At time of this writing, DISCONTIGMEM is considered deprecated,
24 although it is still in use by several architectures.
25
26All the memory models track the status of physical page frames using
27:c:type:`struct page` arranged in one or more arrays.
28
29Regardless of the selected memory model, there exists one-to-one
30mapping between the physical page frame number (PFN) and the
31corresponding `struct page`.
32
33Each memory model defines :c:func:`pfn_to_page` and :c:func:`page_to_pfn`
34helpers that allow the conversion from PFN to `struct page` and vice
35versa.
36
37FLATMEM
38=======
39
40The simplest memory model is FLATMEM. This model is suitable for
41non-NUMA systems with contiguous, or mostly contiguous, physical
42memory.
43
44In the FLATMEM memory model, there is a global `mem_map` array that
45maps the entire physical memory. For most architectures, the holes
46have entries in the `mem_map` array. The `struct page` objects
47corresponding to the holes are never fully initialized.
48
Mike Rapoport237e506c2020-06-03 15:58:22 -070049To allocate the `mem_map` array, architecture specific setup code should
50call :c:func:`free_area_init` function. Yet, the mappings array is not
51usable until the call to :c:func:`memblock_free_all` that hands all the
52memory to the page allocator.
Mike Rapoport7d10bdb2019-04-28 15:17:43 +030053
54If an architecture enables `CONFIG_ARCH_HAS_HOLES_MEMORYMODEL` option,
55it may free parts of the `mem_map` array that do not cover the
56actual physical pages. In such case, the architecture specific
57:c:func:`pfn_valid` implementation should take the holes in the
58`mem_map` into account.
59
60With FLATMEM, the conversion between a PFN and the `struct page` is
61straightforward: `PFN - ARCH_PFN_OFFSET` is an index to the
62`mem_map` array.
63
64The `ARCH_PFN_OFFSET` defines the first page frame number for
65systems with physical memory starting at address different from 0.
66
67DISCONTIGMEM
68============
69
70The DISCONTIGMEM model treats the physical memory as a collection of
71`nodes` similarly to how Linux NUMA support does. For each node Linux
72constructs an independent memory management subsystem represented by
73`struct pglist_data` (or `pg_data_t` for short). Among other
74things, `pg_data_t` holds the `node_mem_map` array that maps
75physical pages belonging to that node. The `node_start_pfn` field of
76`pg_data_t` is the number of the first page frame belonging to that
77node.
78
79The architecture setup code should call :c:func:`free_area_init_node` for
80each node in the system to initialize the `pg_data_t` object and its
81`node_mem_map`.
82
83Every `node_mem_map` behaves exactly as FLATMEM's `mem_map` -
84every physical page frame in a node has a `struct page` entry in the
85`node_mem_map` array. When DISCONTIGMEM is enabled, a portion of the
86`flags` field of the `struct page` encodes the node number of the
87node hosting that page.
88
89The conversion between a PFN and the `struct page` in the
90DISCONTIGMEM model became slightly more complex as it has to determine
91which node hosts the physical page and which `pg_data_t` object
92holds the `struct page`.
93
94Architectures that support DISCONTIGMEM provide :c:func:`pfn_to_nid`
95to convert PFN to the node number. The opposite conversion helper
96:c:func:`page_to_nid` is generic as it uses the node number encoded in
97page->flags.
98
99Once the node number is known, the PFN can be used to index
100appropriate `node_mem_map` array to access the `struct page` and
101the offset of the `struct page` from the `node_mem_map` plus
102`node_start_pfn` is the PFN of that page.
103
104SPARSEMEM
105=========
106
107SPARSEMEM is the most versatile memory model available in Linux and it
108is the only memory model that supports several advanced features such
109as hot-plug and hot-remove of the physical memory, alternative memory
110maps for non-volatile memory devices and deferred initialization of
111the memory map for larger systems.
112
113The SPARSEMEM model presents the physical memory as a collection of
114sections. A section is represented with :c:type:`struct mem_section`
115that contains `section_mem_map` that is, logically, a pointer to an
116array of struct pages. However, it is stored with some other magic
117that aids the sections management. The section size and maximal number
118of section is specified using `SECTION_SIZE_BITS` and
119`MAX_PHYSMEM_BITS` constants defined by each architecture that
120supports SPARSEMEM. While `MAX_PHYSMEM_BITS` is an actual width of a
121physical address that an architecture supports, the
122`SECTION_SIZE_BITS` is an arbitrary value.
123
124The maximal number of sections is denoted `NR_MEM_SECTIONS` and
125defined as
126
127.. math::
128
129 NR\_MEM\_SECTIONS = 2 ^ {(MAX\_PHYSMEM\_BITS - SECTION\_SIZE\_BITS)}
130
131The `mem_section` objects are arranged in a two-dimensional array
132called `mem_sections`. The size and placement of this array depend
133on `CONFIG_SPARSEMEM_EXTREME` and the maximal possible number of
134sections:
135
136* When `CONFIG_SPARSEMEM_EXTREME` is disabled, the `mem_sections`
137 array is static and has `NR_MEM_SECTIONS` rows. Each row holds a
138 single `mem_section` object.
139* When `CONFIG_SPARSEMEM_EXTREME` is enabled, the `mem_sections`
140 array is dynamically allocated. Each row contains PAGE_SIZE worth of
141 `mem_section` objects and the number of rows is calculated to fit
142 all the memory sections.
143
Mike Rapoportc89ab042020-08-06 23:24:02 -0700144The architecture setup code should call sparse_init() to
145initialize the memory sections and the memory maps.
Mike Rapoport7d10bdb2019-04-28 15:17:43 +0300146
147With SPARSEMEM there are two possible ways to convert a PFN to the
148corresponding `struct page` - a "classic sparse" and "sparse
149vmemmap". The selection is made at build time and it is determined by
150the value of `CONFIG_SPARSEMEM_VMEMMAP`.
151
152The classic sparse encodes the section number of a page in page->flags
153and uses high bits of a PFN to access the section that maps that page
154frame. Inside a section, the PFN is the index to the array of pages.
155
156The sparse vmemmap uses a virtually mapped memory map to optimize
157pfn_to_page and page_to_pfn operations. There is a global `struct
158page *vmemmap` pointer that points to a virtually contiguous array of
Randy Dunlap18d97ed2020-07-07 11:04:13 -0700159`struct page` objects. A PFN is an index to that array and the
Mike Rapoport7d10bdb2019-04-28 15:17:43 +0300160offset of the `struct page` from `vmemmap` is the PFN of that
161page.
162
163To use vmemmap, an architecture has to reserve a range of virtual
164addresses that will map the physical pages containing the memory
165map and make sure that `vmemmap` points to that range. In addition,
166the architecture should implement :c:func:`vmemmap_populate` method
167that will allocate the physical memory and create page tables for the
168virtual memory map. If an architecture does not have any special
169requirements for the vmemmap mappings, it can use default
170:c:func:`vmemmap_populate_basepages` provided by the generic memory
171management.
172
173The virtually mapped memory map allows storing `struct page` objects
174for persistent memory devices in pre-allocated storage on those
175devices. This storage is represented with :c:type:`struct vmem_altmap`
176that is eventually passed to vmemmap_populate() through a long chain
177of function calls. The vmemmap_populate() implementation may use the
Anshuman Khandual56993b42020-08-06 23:23:24 -0700178`vmem_altmap` along with :c:func:`vmemmap_alloc_block_buf` helper to
Mike Rapoport7d10bdb2019-04-28 15:17:43 +0300179allocate memory map on the persistent memory device.
Dan Williamsa0653402019-07-18 15:58:29 -0700180
181ZONE_DEVICE
182===========
183The `ZONE_DEVICE` facility builds upon `SPARSEMEM_VMEMMAP` to offer
184`struct page` `mem_map` services for device driver identified physical
185address ranges. The "device" aspect of `ZONE_DEVICE` relates to the fact
186that the page objects for these address ranges are never marked online,
187and that a reference must be taken against the device, not just the page
188to keep the memory pinned for active use. `ZONE_DEVICE`, via
189:c:func:`devm_memremap_pages`, performs just enough memory hotplug to
190turn on :c:func:`pfn_to_page`, :c:func:`page_to_pfn`, and
191:c:func:`get_user_pages` service for the given range of pfns. Since the
192page reference count never drops below 1 the page is never tracked as
193free memory and the page's `struct list_head lru` space is repurposed
194for back referencing to the host device / driver that mapped the memory.
195
196While `SPARSEMEM` presents memory as a collection of sections,
197optionally collected into memory blocks, `ZONE_DEVICE` users have a need
198for smaller granularity of populating the `mem_map`. Given that
199`ZONE_DEVICE` memory is never marked online it is subsequently never
200subject to its memory ranges being exposed through the sysfs memory
201hotplug api on memory block boundaries. The implementation relies on
202this lack of user-api constraint to allow sub-section sized memory
203ranges to be specified to :c:func:`arch_add_memory`, the top-half of
204memory hotplug. Sub-section support allows for 2MB as the cross-arch
205common alignment granularity for :c:func:`devm_memremap_pages`.
206
207The users of `ZONE_DEVICE` are:
208
209* pmem: Map platform persistent memory to be used as a direct-I/O target
210 via DAX mappings.
211
212* hmm: Extend `ZONE_DEVICE` with `->page_fault()` and `->page_free()`
213 event callbacks to allow a device-driver to coordinate memory management
214 events related to device-memory, typically GPU memory. See
215 Documentation/vm/hmm.rst.
216
217* p2pdma: Create `struct page` objects to allow peer devices in a
218 PCI/-E topology to coordinate direct-DMA operations between themselves,
219 i.e. bypass host memory.