Mike Rapoport | 137b4552 | 2018-03-21 21:22:32 +0200 | [diff] [blame] | 1 | .. _numa: |
| 2 | |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 3 | Started Nov 1999 by Kanoj Sarcar <kanoj@sgi.com> |
| 4 | |
Mike Rapoport | 137b4552 | 2018-03-21 21:22:32 +0200 | [diff] [blame] | 5 | ============= |
Lee Schermerhorn | b9498bf | 2010-05-26 14:45:06 -0700 | [diff] [blame] | 6 | What is NUMA? |
Mike Rapoport | 137b4552 | 2018-03-21 21:22:32 +0200 | [diff] [blame] | 7 | ============= |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 8 | |
Lee Schermerhorn | b9498bf | 2010-05-26 14:45:06 -0700 | [diff] [blame] | 9 | This question can be answered from a couple of perspectives: the |
| 10 | hardware view and the Linux software view. |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 11 | |
Lee Schermerhorn | b9498bf | 2010-05-26 14:45:06 -0700 | [diff] [blame] | 12 | From the hardware perspective, a NUMA system is a computer platform that |
| 13 | comprises multiple components or assemblies each of which may contain 0 |
| 14 | or more CPUs, local memory, and/or IO buses. For brevity and to |
| 15 | disambiguate the hardware view of these physical components/assemblies |
| 16 | from the software abstraction thereof, we'll call the components/assemblies |
| 17 | 'cells' in this document. |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 18 | |
Lee Schermerhorn | b9498bf | 2010-05-26 14:45:06 -0700 | [diff] [blame] | 19 | Each of the 'cells' may be viewed as an SMP [symmetric multi-processor] subset |
| 20 | of the system--although some components necessary for a stand-alone SMP system |
| 21 | may not be populated on any given cell. The cells of the NUMA system are |
| 22 | connected together with some sort of system interconnect--e.g., a crossbar or |
| 23 | point-to-point link are common types of NUMA system interconnects. Both of |
| 24 | these types of interconnects can be aggregated to create NUMA platforms with |
| 25 | cells at multiple distances from other cells. |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 26 | |
Lee Schermerhorn | b9498bf | 2010-05-26 14:45:06 -0700 | [diff] [blame] | 27 | For Linux, the NUMA platforms of interest are primarily what is known as Cache |
| 28 | Coherent NUMA or ccNUMA systems. With ccNUMA systems, all memory is visible |
| 29 | to and accessible from any CPU attached to any cell and cache coherency |
| 30 | is handled in hardware by the processor caches and/or the system interconnect. |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 31 | |
Lee Schermerhorn | b9498bf | 2010-05-26 14:45:06 -0700 | [diff] [blame] | 32 | Memory access time and effective memory bandwidth varies depending on how far |
| 33 | away the cell containing the CPU or IO bus making the memory access is from the |
| 34 | cell containing the target memory. For example, access to memory by CPUs |
| 35 | attached to the same cell will experience faster access times and higher |
| 36 | bandwidths than accesses to memory on other, remote cells. NUMA platforms |
| 37 | can have cells at multiple remote distances from any given cell. |
| 38 | |
| 39 | Platform vendors don't build NUMA systems just to make software developers' |
| 40 | lives interesting. Rather, this architecture is a means to provide scalable |
| 41 | memory bandwidth. However, to achieve scalable memory bandwidth, system and |
| 42 | application software must arrange for a large majority of the memory references |
| 43 | [cache misses] to be to "local" memory--memory on the same cell, if any--or |
| 44 | to the closest cell with memory. |
| 45 | |
| 46 | This leads to the Linux software view of a NUMA system: |
| 47 | |
| 48 | Linux divides the system's hardware resources into multiple software |
| 49 | abstractions called "nodes". Linux maps the nodes onto the physical cells |
| 50 | of the hardware platform, abstracting away some of the details for some |
| 51 | architectures. As with physical cells, software nodes may contain 0 or more |
| 52 | CPUs, memory and/or IO buses. And, again, memory accesses to memory on |
| 53 | "closer" nodes--nodes that map to closer cells--will generally experience |
| 54 | faster access times and higher effective bandwidth than accesses to more |
| 55 | remote cells. |
| 56 | |
| 57 | For some architectures, such as x86, Linux will "hide" any node representing a |
| 58 | physical cell that has no memory attached, and reassign any CPUs attached to |
| 59 | that cell to a node representing a cell that does have memory. Thus, on |
| 60 | these architectures, one cannot assume that all CPUs that Linux associates with |
| 61 | a given node will see the same local memory access times and bandwidth. |
| 62 | |
| 63 | In addition, for some architectures, again x86 is an example, Linux supports |
| 64 | the emulation of additional nodes. For NUMA emulation, linux will carve up |
| 65 | the existing nodes--or the system memory for non-NUMA platforms--into multiple |
| 66 | nodes. Each emulated node will manage a fraction of the underlying cells' |
| 67 | physical memory. NUMA emluation is useful for testing NUMA kernel and |
| 68 | application features on non-NUMA platforms, and as a sort of memory resource |
| 69 | management mechanism when used together with cpusets. |
Mauro Carvalho Chehab | da82c92 | 2019-06-27 13:08:35 -0300 | [diff] [blame] | 70 | [see Documentation/admin-guide/cgroup-v1/cpusets.rst] |
Lee Schermerhorn | b9498bf | 2010-05-26 14:45:06 -0700 | [diff] [blame] | 71 | |
| 72 | For each node with memory, Linux constructs an independent memory management |
| 73 | subsystem, complete with its own free page lists, in-use page lists, usage |
| 74 | statistics and locks to mediate access. In addition, Linux constructs for |
| 75 | each memory zone [one or more of DMA, DMA32, NORMAL, HIGH_MEMORY, MOVABLE], |
| 76 | an ordered "zonelist". A zonelist specifies the zones/nodes to visit when a |
| 77 | selected zone/node cannot satisfy the allocation request. This situation, |
| 78 | when a zone has no available memory to satisfy a request, is called |
| 79 | "overflow" or "fallback". |
| 80 | |
| 81 | Because some nodes contain multiple zones containing different types of |
| 82 | memory, Linux must decide whether to order the zonelists such that allocations |
| 83 | fall back to the same zone type on a different node, or to a different zone |
| 84 | type on the same node. This is an important consideration because some zones, |
| 85 | such as DMA or DMA32, represent relatively scarce resources. Linux chooses |
Michal Hocko | c9bff3e | 2017-09-06 16:20:13 -0700 | [diff] [blame] | 86 | a default Node ordered zonelist. This means it tries to fallback to other zones |
| 87 | from the same node before using remote nodes which are ordered by NUMA distance. |
Lee Schermerhorn | b9498bf | 2010-05-26 14:45:06 -0700 | [diff] [blame] | 88 | |
| 89 | By default, Linux will attempt to satisfy memory allocation requests from the |
| 90 | node to which the CPU that executes the request is assigned. Specifically, |
| 91 | Linux will attempt to allocate from the first node in the appropriate zonelist |
| 92 | for the node where the request originates. This is called "local allocation." |
| 93 | If the "local" node cannot satisfy the request, the kernel will examine other |
| 94 | nodes' zones in the selected zonelist looking for the first zone in the list |
| 95 | that can satisfy the request. |
| 96 | |
| 97 | Local allocation will tend to keep subsequent access to the allocated memory |
| 98 | "local" to the underlying physical resources and off the system interconnect-- |
| 99 | as long as the task on whose behalf the kernel allocated some memory does not |
| 100 | later migrate away from that memory. The Linux scheduler is aware of the |
| 101 | NUMA topology of the platform--embodied in the "scheduling domains" data |
Mauro Carvalho Chehab | d6a3b24 | 2019-06-12 14:53:03 -0300 | [diff] [blame] | 102 | structures [see Documentation/scheduler/sched-domains.rst]--and the scheduler |
Lee Schermerhorn | b9498bf | 2010-05-26 14:45:06 -0700 | [diff] [blame] | 103 | attempts to minimize task migration to distant scheduling domains. However, |
| 104 | the scheduler does not take a task's NUMA footprint into account directly. |
| 105 | Thus, under sufficient imbalance, tasks can migrate between nodes, remote |
| 106 | from their initial node and kernel data structures. |
| 107 | |
| 108 | System administrators and application designers can restrict a task's migration |
| 109 | to improve NUMA locality using various CPU affinity command line interfaces, |
| 110 | such as taskset(1) and numactl(1), and program interfaces such as |
| 111 | sched_setaffinity(2). Further, one can modify the kernel's default local |
Tobin C. Harding | 66e9c46 | 2019-04-09 10:43:59 +1000 | [diff] [blame] | 112 | allocation behavior using Linux NUMA memory policy. [see |
| 113 | :ref:`Documentation/admin-guide/mm/numa_memory_policy.rst <numa_memory_policy>`]. |
Lee Schermerhorn | b9498bf | 2010-05-26 14:45:06 -0700 | [diff] [blame] | 114 | |
| 115 | System administrators can restrict the CPUs and nodes' memories that a non- |
| 116 | privileged user can specify in the scheduling or NUMA commands and functions |
Mauro Carvalho Chehab | da82c92 | 2019-06-27 13:08:35 -0300 | [diff] [blame] | 117 | using control groups and CPUsets. [see Documentation/admin-guide/cgroup-v1/cpusets.rst] |
Lee Schermerhorn | b9498bf | 2010-05-26 14:45:06 -0700 | [diff] [blame] | 118 | |
| 119 | On architectures that do not hide memoryless nodes, Linux will include only |
| 120 | zones [nodes] with memory in the zonelists. This means that for a memoryless |
| 121 | node the "local memory node"--the node of the first zone in CPU's node's |
| 122 | zonelist--will not be the node itself. Rather, it will be the node that the |
| 123 | kernel selected as the nearest node with memory when it built the zonelists. |
| 124 | So, default, local allocations will succeed with the kernel supplying the |
| 125 | closest available memory. This is a consequence of the same mechanism that |
| 126 | allows such allocations to fallback to other nearby nodes when a node that |
| 127 | does contain memory overflows. |
| 128 | |
| 129 | Some kernel allocations do not want or cannot tolerate this allocation fallback |
| 130 | behavior. Rather they want to be sure they get memory from the specified node |
| 131 | or get notified that the node has no free memory. This is usually the case when |
| 132 | a subsystem allocates per CPU memory resources, for example. |
| 133 | |
| 134 | A typical model for making such an allocation is to obtain the node id of the |
| 135 | node to which the "current CPU" is attached using one of the kernel's |
| 136 | numa_node_id() or CPU_to_node() functions and then request memory from only |
| 137 | the node id returned. When such an allocation fails, the requesting subsystem |
| 138 | may revert to its own fallback path. The slab kernel memory allocator is an |
| 139 | example of this. Or, the subsystem may choose to disable or not to enable |
| 140 | itself on allocation failure. The kernel profiling subsystem is an example of |
| 141 | this. |
| 142 | |
| 143 | If the architecture supports--does not hide--memoryless nodes, then CPUs |
| 144 | attached to memoryless nodes would always incur the fallback path overhead |
| 145 | or some subsystems would fail to initialize if they attempted to allocated |
| 146 | memory exclusively from a node without memory. To support such |
| 147 | architectures transparently, kernel subsystems can use the numa_mem_id() |
| 148 | or cpu_to_mem() function to locate the "local memory node" for the calling or |
| 149 | specified CPU. Again, this is the same node from which default, local page |
| 150 | allocations will be attempted. |