Mauro Carvalho Chehab | b693d0b | 2019-06-12 14:52:38 -0300 | [diff] [blame] | 1 | ============================== |
| 2 | Memory Layout on AArch64 Linux |
| 3 | ============================== |
| 4 | |
| 5 | Author: Catalin Marinas <catalin.marinas@arm.com> |
| 6 | |
| 7 | This document describes the virtual memory layout used by the AArch64 |
| 8 | Linux kernel. The architecture allows up to 4 levels of translation |
| 9 | tables with a 4KB page size and up to 3 levels with a 64KB page size. |
| 10 | |
| 11 | AArch64 Linux uses either 3 levels or 4 levels of translation tables |
| 12 | with the 4KB page configuration, allowing 39-bit (512GB) or 48-bit |
| 13 | (256TB) virtual addresses, respectively, for both user and kernel. With |
| 14 | 64KB pages, only 2 levels of translation tables, allowing 42-bit (4TB) |
| 15 | virtual address, are used but the memory layout is the same. |
| 16 | |
Steve Capper | d2c68de | 2019-08-07 16:55:24 +0100 | [diff] [blame] | 17 | ARMv8.2 adds optional support for Large Virtual Address space. This is |
| 18 | only available when running with a 64KB page size and expands the |
| 19 | number of descriptors in the first level of translation. |
| 20 | |
Mauro Carvalho Chehab | b693d0b | 2019-06-12 14:52:38 -0300 | [diff] [blame] | 21 | User addresses have bits 63:48 set to 0 while the kernel addresses have |
| 22 | the same bits set to 1. TTBRx selection is given by bit 63 of the |
| 23 | virtual address. The swapper_pg_dir contains only kernel (global) |
| 24 | mappings while the user pgd contains only user (non-global) mappings. |
| 25 | The swapper_pg_dir address is written to TTBR1 and never written to |
| 26 | TTBR0. |
| 27 | |
| 28 | |
Steve Capper | d2c68de | 2019-08-07 16:55:24 +0100 | [diff] [blame] | 29 | AArch64 Linux memory layout with 4KB pages + 4 levels (48-bit):: |
Mauro Carvalho Chehab | b693d0b | 2019-06-12 14:52:38 -0300 | [diff] [blame] | 30 | |
| 31 | Start End Size Use |
| 32 | ----------------------------------------------------------------------- |
| 33 | 0000000000000000 0000ffffffffffff 256TB user |
Steve Capper | d2c68de | 2019-08-07 16:55:24 +0100 | [diff] [blame] | 34 | ffff000000000000 ffff7fffffffffff 128TB kernel logical memory map |
Ard Biesheuvel | 68af6d2 | 2020-11-10 14:08:51 +0100 | [diff] [blame] | 35 | [ffff600000000000 ffff7fffffffffff] 32TB [kasan shadow region] |
Ard Biesheuvel | f4693c2 | 2020-10-08 17:36:00 +0200 | [diff] [blame] | 36 | ffff800000000000 ffff800007ffffff 128MB bpf jit region |
| 37 | ffff800008000000 ffff80000fffffff 128MB modules |
Ard Biesheuvel | 9ad7c6d | 2020-10-08 17:36:02 +0200 | [diff] [blame] | 38 | ffff800010000000 fffffbffefffffff 124TB vmalloc |
| 39 | fffffbfff0000000 fffffbfffdffffff 224MB fixed mappings (top down) |
| 40 | fffffbfffe000000 fffffbfffe7fffff 8MB [guard region] |
| 41 | fffffbfffe800000 fffffbffff7fffff 16MB PCI I/O space |
| 42 | fffffbffff800000 fffffbffffffffff 8MB [guard region] |
Ard Biesheuvel | 8c96400 | 2020-10-08 17:36:01 +0200 | [diff] [blame] | 43 | fffffc0000000000 fffffdffffffffff 2TB vmemmap |
| 44 | fffffe0000000000 ffffffffffffffff 2TB [guard region] |
Mauro Carvalho Chehab | b693d0b | 2019-06-12 14:52:38 -0300 | [diff] [blame] | 45 | |
| 46 | |
Steve Capper | d2c68de | 2019-08-07 16:55:24 +0100 | [diff] [blame] | 47 | AArch64 Linux memory layout with 64KB pages + 3 levels (52-bit with HW support):: |
Mauro Carvalho Chehab | b693d0b | 2019-06-12 14:52:38 -0300 | [diff] [blame] | 48 | |
| 49 | Start End Size Use |
| 50 | ----------------------------------------------------------------------- |
Steve Capper | d2c68de | 2019-08-07 16:55:24 +0100 | [diff] [blame] | 51 | 0000000000000000 000fffffffffffff 4PB user |
Ard Biesheuvel | f4693c2 | 2020-10-08 17:36:00 +0200 | [diff] [blame] | 52 | fff0000000000000 ffff7fffffffffff ~4PB kernel logical memory map |
Ard Biesheuvel | 68af6d2 | 2020-11-10 14:08:51 +0100 | [diff] [blame] | 53 | [fffd800000000000 ffff7fffffffffff] 512TB [kasan shadow region] |
Ard Biesheuvel | f4693c2 | 2020-10-08 17:36:00 +0200 | [diff] [blame] | 54 | ffff800000000000 ffff800007ffffff 128MB bpf jit region |
| 55 | ffff800008000000 ffff80000fffffff 128MB modules |
Ard Biesheuvel | 9ad7c6d | 2020-10-08 17:36:02 +0200 | [diff] [blame] | 56 | ffff800010000000 fffffbffefffffff 124TB vmalloc |
| 57 | fffffbfff0000000 fffffbfffdffffff 224MB fixed mappings (top down) |
| 58 | fffffbfffe000000 fffffbfffe7fffff 8MB [guard region] |
| 59 | fffffbfffe800000 fffffbffff7fffff 16MB PCI I/O space |
| 60 | fffffbffff800000 fffffbffffffffff 8MB [guard region] |
Ard Biesheuvel | 8c96400 | 2020-10-08 17:36:01 +0200 | [diff] [blame] | 61 | fffffc0000000000 ffffffdfffffffff ~4TB vmemmap |
| 62 | ffffffe000000000 ffffffffffffffff 128GB [guard region] |
Mauro Carvalho Chehab | b693d0b | 2019-06-12 14:52:38 -0300 | [diff] [blame] | 63 | |
| 64 | |
| 65 | Translation table lookup with 4KB pages:: |
| 66 | |
| 67 | +--------+--------+--------+--------+--------+--------+--------+--------+ |
| 68 | |63 56|55 48|47 40|39 32|31 24|23 16|15 8|7 0| |
| 69 | +--------+--------+--------+--------+--------+--------+--------+--------+ |
| 70 | | | | | | | |
| 71 | | | | | | v |
| 72 | | | | | | [11:0] in-page offset |
| 73 | | | | | +-> [20:12] L3 index |
| 74 | | | | +-----------> [29:21] L2 index |
| 75 | | | +---------------------> [38:30] L1 index |
| 76 | | +-------------------------------> [47:39] L0 index |
| 77 | +-------------------------------------------------> [63] TTBR0/1 |
| 78 | |
| 79 | |
| 80 | Translation table lookup with 64KB pages:: |
| 81 | |
| 82 | +--------+--------+--------+--------+--------+--------+--------+--------+ |
| 83 | |63 56|55 48|47 40|39 32|31 24|23 16|15 8|7 0| |
| 84 | +--------+--------+--------+--------+--------+--------+--------+--------+ |
| 85 | | | | | | |
| 86 | | | | | v |
| 87 | | | | | [15:0] in-page offset |
| 88 | | | | +----------> [28:16] L3 index |
| 89 | | | +--------------------------> [41:29] L2 index |
Steve Capper | d2c68de | 2019-08-07 16:55:24 +0100 | [diff] [blame] | 90 | | +-------------------------------> [47:42] L1 index (48-bit) |
| 91 | | [51:42] L1 index (52-bit) |
Mauro Carvalho Chehab | b693d0b | 2019-06-12 14:52:38 -0300 | [diff] [blame] | 92 | +-------------------------------------------------> [63] TTBR0/1 |
| 93 | |
| 94 | |
| 95 | When using KVM without the Virtualization Host Extensions, the |
| 96 | hypervisor maps kernel pages in EL2 at a fixed (and potentially |
| 97 | random) offset from the linear mapping. See the kern_hyp_va macro and |
| 98 | kvm_update_va_mask function for more details. MMIO devices such as |
| 99 | GICv2 gets mapped next to the HYP idmap page, as do vectors when |
Will Deacon | c4792b6 | 2020-11-13 11:38:45 +0000 | [diff] [blame] | 100 | ARM64_SPECTRE_V3A is enabled for particular CPUs. |
Mauro Carvalho Chehab | b693d0b | 2019-06-12 14:52:38 -0300 | [diff] [blame] | 101 | |
| 102 | When using KVM with the Virtualization Host Extensions, no additional |
| 103 | mappings are created, since the host kernel runs directly in EL2. |
Steve Capper | d2c68de | 2019-08-07 16:55:24 +0100 | [diff] [blame] | 104 | |
| 105 | 52-bit VA support in the kernel |
| 106 | ------------------------------- |
| 107 | If the ARMv8.2-LVA optional feature is present, and we are running |
| 108 | with a 64KB page size; then it is possible to use 52-bits of address |
| 109 | space for both userspace and kernel addresses. However, any kernel |
| 110 | binary that supports 52-bit must also be able to fall back to 48-bit |
| 111 | at early boot time if the hardware feature is not present. |
| 112 | |
| 113 | This fallback mechanism necessitates the kernel .text to be in the |
| 114 | higher addresses such that they are invariant to 48/52-bit VAs. Due |
| 115 | to the kasan shadow being a fraction of the entire kernel VA space, |
| 116 | the end of the kasan shadow must also be in the higher half of the |
| 117 | kernel VA space for both 48/52-bit. (Switching from 48-bit to 52-bit, |
| 118 | the end of the kasan shadow is invariant and dependent on ~0UL, |
| 119 | whilst the start address will "grow" towards the lower addresses). |
| 120 | |
| 121 | In order to optimise phys_to_virt and virt_to_phys, the PAGE_OFFSET |
| 122 | is kept constant at 0xFFF0000000000000 (corresponding to 52-bit), |
| 123 | this obviates the need for an extra variable read. The physvirt |
| 124 | offset and vmemmap offsets are computed at early boot to enable |
| 125 | this logic. |
| 126 | |
| 127 | As a single binary will need to support both 48-bit and 52-bit VA |
| 128 | spaces, the VMEMMAP must be sized large enough for 52-bit VAs and |
Scott Branden | ce4a64e | 2020-02-19 14:14:03 -0800 | [diff] [blame] | 129 | also must be sized large enough to accommodate a fixed PAGE_OFFSET. |
Steve Capper | d2c68de | 2019-08-07 16:55:24 +0100 | [diff] [blame] | 130 | |
| 131 | Most code in the kernel should not need to consider the VA_BITS, for |
| 132 | code that does need to know the VA size the variables are |
| 133 | defined as follows: |
| 134 | |
| 135 | VA_BITS constant the *maximum* VA space size |
| 136 | |
| 137 | VA_BITS_MIN constant the *minimum* VA space size |
| 138 | |
| 139 | vabits_actual variable the *actual* VA space size |
| 140 | |
| 141 | |
| 142 | Maximum and minimum sizes can be useful to ensure that buffers are |
| 143 | sized large enough or that addresses are positioned close enough for |
| 144 | the "worst" case. |
| 145 | |
| 146 | 52-bit userspace VAs |
| 147 | -------------------- |
| 148 | To maintain compatibility with software that relies on the ARMv8.0 |
| 149 | VA space maximum size of 48-bits, the kernel will, by default, |
| 150 | return virtual addresses to userspace from a 48-bit range. |
| 151 | |
| 152 | Software can "opt-in" to receiving VAs from a 52-bit space by |
| 153 | specifying an mmap hint parameter that is larger than 48-bit. |
Adam Zerella | a2b99dc | 2019-09-28 22:58:19 +1000 | [diff] [blame] | 154 | |
Steve Capper | d2c68de | 2019-08-07 16:55:24 +0100 | [diff] [blame] | 155 | For example: |
Adam Zerella | a2b99dc | 2019-09-28 22:58:19 +1000 | [diff] [blame] | 156 | |
| 157 | .. code-block:: c |
| 158 | |
| 159 | maybe_high_address = mmap(~0UL, size, prot, flags,...); |
Steve Capper | d2c68de | 2019-08-07 16:55:24 +0100 | [diff] [blame] | 160 | |
| 161 | It is also possible to build a debug kernel that returns addresses |
| 162 | from a 52-bit space by enabling the following kernel config options: |
Adam Zerella | a2b99dc | 2019-09-28 22:58:19 +1000 | [diff] [blame] | 163 | |
| 164 | .. code-block:: sh |
| 165 | |
Steve Capper | d2c68de | 2019-08-07 16:55:24 +0100 | [diff] [blame] | 166 | CONFIG_EXPERT=y && CONFIG_ARM64_FORCE_52BIT=y |
| 167 | |
| 168 | Note that this option is only intended for debugging applications |
| 169 | and should not be used in production. |