Mike Rapoport | a5e4da9 | 2018-03-21 21:22:42 +0200 | [diff] [blame^] | 1 | .. _unevictable_lru: |
Lee Schermerhorn | fa07e78 | 2008-10-18 20:26:47 -0700 | [diff] [blame] | 2 | |
Mike Rapoport | a5e4da9 | 2018-03-21 21:22:42 +0200 | [diff] [blame^] | 3 | ============================== |
| 4 | Unevictable LRU Infrastructure |
| 5 | ============================== |
Lee Schermerhorn | fa07e78 | 2008-10-18 20:26:47 -0700 | [diff] [blame] | 6 | |
Mike Rapoport | a5e4da9 | 2018-03-21 21:22:42 +0200 | [diff] [blame^] | 7 | .. contents:: :local: |
David Howells | c24b720 | 2009-04-13 14:40:01 -0700 | [diff] [blame] | 8 | |
| 9 | |
Mike Rapoport | a5e4da9 | 2018-03-21 21:22:42 +0200 | [diff] [blame^] | 10 | Introduction |
David Howells | c24b720 | 2009-04-13 14:40:01 -0700 | [diff] [blame] | 11 | ============ |
| 12 | |
| 13 | This document describes the Linux memory manager's "Unevictable LRU" |
| 14 | infrastructure and the use of this to manage several types of "unevictable" |
| 15 | pages. |
| 16 | |
| 17 | The document attempts to provide the overall rationale behind this mechanism |
| 18 | and the rationale for some of the design decisions that drove the |
| 19 | implementation. The latter design rationale is discussed in the context of an |
| 20 | implementation description. Admittedly, one can obtain the implementation |
| 21 | details - the "what does it do?" - by reading the code. One hopes that the |
| 22 | descriptions below add value by provide the answer to "why does it do that?". |
| 23 | |
| 24 | |
Mike Rapoport | a5e4da9 | 2018-03-21 21:22:42 +0200 | [diff] [blame^] | 25 | |
| 26 | The Unevictable LRU |
David Howells | c24b720 | 2009-04-13 14:40:01 -0700 | [diff] [blame] | 27 | =================== |
| 28 | |
| 29 | The Unevictable LRU facility adds an additional LRU list to track unevictable |
| 30 | pages and to hide these pages from vmscan. This mechanism is based on a patch |
| 31 | by Larry Woodman of Red Hat to address several scalability problems with page |
Lee Schermerhorn | fa07e78 | 2008-10-18 20:26:47 -0700 | [diff] [blame] | 32 | reclaim in Linux. The problems have been observed at customer sites on large |
David Howells | c24b720 | 2009-04-13 14:40:01 -0700 | [diff] [blame] | 33 | memory x86_64 systems. |
Lee Schermerhorn | fa07e78 | 2008-10-18 20:26:47 -0700 | [diff] [blame] | 34 | |
David Howells | c24b720 | 2009-04-13 14:40:01 -0700 | [diff] [blame] | 35 | To illustrate this with an example, a non-NUMA x86_64 platform with 128GB of |
| 36 | main memory will have over 32 million 4k pages in a single zone. When a large |
| 37 | fraction of these pages are not evictable for any reason [see below], vmscan |
| 38 | will spend a lot of time scanning the LRU lists looking for the small fraction |
| 39 | of pages that are evictable. This can result in a situation where all CPUs are |
| 40 | spending 100% of their time in vmscan for hours or days on end, with the system |
| 41 | completely unresponsive. |
Lee Schermerhorn | fa07e78 | 2008-10-18 20:26:47 -0700 | [diff] [blame] | 42 | |
David Howells | c24b720 | 2009-04-13 14:40:01 -0700 | [diff] [blame] | 43 | The unevictable list addresses the following classes of unevictable pages: |
Lee Schermerhorn | fa07e78 | 2008-10-18 20:26:47 -0700 | [diff] [blame] | 44 | |
Mike Rapoport | a5e4da9 | 2018-03-21 21:22:42 +0200 | [diff] [blame^] | 45 | * Those owned by ramfs. |
David Howells | c24b720 | 2009-04-13 14:40:01 -0700 | [diff] [blame] | 46 | |
Mike Rapoport | a5e4da9 | 2018-03-21 21:22:42 +0200 | [diff] [blame^] | 47 | * Those mapped into SHM_LOCK'd shared memory regions. |
David Howells | c24b720 | 2009-04-13 14:40:01 -0700 | [diff] [blame] | 48 | |
Mike Rapoport | a5e4da9 | 2018-03-21 21:22:42 +0200 | [diff] [blame^] | 49 | * Those mapped into VM_LOCKED [mlock()ed] VMAs. |
David Howells | c24b720 | 2009-04-13 14:40:01 -0700 | [diff] [blame] | 50 | |
| 51 | The infrastructure may also be able to handle other conditions that make pages |
Lee Schermerhorn | fa07e78 | 2008-10-18 20:26:47 -0700 | [diff] [blame] | 52 | unevictable, either by definition or by circumstance, in the future. |
| 53 | |
| 54 | |
Mike Rapoport | a5e4da9 | 2018-03-21 21:22:42 +0200 | [diff] [blame^] | 55 | The Unevictable Page List |
David Howells | c24b720 | 2009-04-13 14:40:01 -0700 | [diff] [blame] | 56 | ------------------------- |
Lee Schermerhorn | fa07e78 | 2008-10-18 20:26:47 -0700 | [diff] [blame] | 57 | |
| 58 | The Unevictable LRU infrastructure consists of an additional, per-zone, LRU list |
| 59 | called the "unevictable" list and an associated page flag, PG_unevictable, to |
David Howells | c24b720 | 2009-04-13 14:40:01 -0700 | [diff] [blame] | 60 | indicate that the page is being managed on the unevictable list. |
| 61 | |
| 62 | The PG_unevictable flag is analogous to, and mutually exclusive with, the |
| 63 | PG_active flag in that it indicates on which LRU list a page resides when |
Michal Hocko | e6e8dd5 | 2011-03-16 15:01:37 +0100 | [diff] [blame] | 64 | PG_lru is set. |
Lee Schermerhorn | fa07e78 | 2008-10-18 20:26:47 -0700 | [diff] [blame] | 65 | |
| 66 | The Unevictable LRU infrastructure maintains unevictable pages on an additional |
| 67 | LRU list for a few reasons: |
| 68 | |
David Howells | c24b720 | 2009-04-13 14:40:01 -0700 | [diff] [blame] | 69 | (1) We get to "treat unevictable pages just like we treat other pages in the |
| 70 | system - which means we get to use the same code to manipulate them, the |
| 71 | same code to isolate them (for migrate, etc.), the same code to keep track |
| 72 | of the statistics, etc..." [Rik van Riel] |
Lee Schermerhorn | fa07e78 | 2008-10-18 20:26:47 -0700 | [diff] [blame] | 73 | |
David Howells | c24b720 | 2009-04-13 14:40:01 -0700 | [diff] [blame] | 74 | (2) We want to be able to migrate unevictable pages between nodes for memory |
| 75 | defragmentation, workload management and memory hotplug. The linux kernel |
| 76 | can only migrate pages that it can successfully isolate from the LRU |
| 77 | lists. If we were to maintain pages elsewhere than on an LRU-like list, |
| 78 | where they can be found by isolate_lru_page(), we would prevent their |
| 79 | migration, unless we reworked migration code to find the unevictable pages |
| 80 | itself. |
Lee Schermerhorn | fa07e78 | 2008-10-18 20:26:47 -0700 | [diff] [blame] | 81 | |
| 82 | |
David Howells | c24b720 | 2009-04-13 14:40:01 -0700 | [diff] [blame] | 83 | The unevictable list does not differentiate between file-backed and anonymous, |
| 84 | swap-backed pages. This differentiation is only important while the pages are, |
| 85 | in fact, evictable. |
Lee Schermerhorn | fa07e78 | 2008-10-18 20:26:47 -0700 | [diff] [blame] | 86 | |
David Howells | c24b720 | 2009-04-13 14:40:01 -0700 | [diff] [blame] | 87 | The unevictable list benefits from the "arrayification" of the per-zone LRU |
| 88 | lists and statistics originally proposed and posted by Christoph Lameter. |
Lee Schermerhorn | fa07e78 | 2008-10-18 20:26:47 -0700 | [diff] [blame] | 89 | |
David Howells | c24b720 | 2009-04-13 14:40:01 -0700 | [diff] [blame] | 90 | The unevictable list does not use the LRU pagevec mechanism. Rather, |
| 91 | unevictable pages are placed directly on the page's zone's unevictable list |
| 92 | under the zone lru_lock. This allows us to prevent the stranding of pages on |
| 93 | the unevictable list when one task has the page isolated from the LRU and other |
| 94 | tasks are changing the "evictability" state of the page. |
Lee Schermerhorn | fa07e78 | 2008-10-18 20:26:47 -0700 | [diff] [blame] | 95 | |
| 96 | |
Mike Rapoport | a5e4da9 | 2018-03-21 21:22:42 +0200 | [diff] [blame^] | 97 | Memory Control Group Interaction |
David Howells | c24b720 | 2009-04-13 14:40:01 -0700 | [diff] [blame] | 98 | -------------------------------- |
Lee Schermerhorn | fa07e78 | 2008-10-18 20:26:47 -0700 | [diff] [blame] | 99 | |
David Howells | c24b720 | 2009-04-13 14:40:01 -0700 | [diff] [blame] | 100 | The unevictable LRU facility interacts with the memory control group [aka |
seokhoon.yoon | 09c3bcc | 2016-08-02 23:23:57 +0900 | [diff] [blame] | 101 | memory controller; see Documentation/cgroup-v1/memory.txt] by extending the |
David Howells | c24b720 | 2009-04-13 14:40:01 -0700 | [diff] [blame] | 102 | lru_list enum. |
| 103 | |
| 104 | The memory controller data structure automatically gets a per-zone unevictable |
| 105 | list as a result of the "arrayification" of the per-zone LRU lists (one per |
| 106 | lru_list enum element). The memory controller tracks the movement of pages to |
| 107 | and from the unevictable list. |
| 108 | |
Lee Schermerhorn | fa07e78 | 2008-10-18 20:26:47 -0700 | [diff] [blame] | 109 | When a memory control group comes under memory pressure, the controller will |
| 110 | not attempt to reclaim pages on the unevictable list. This has a couple of |
David Howells | c24b720 | 2009-04-13 14:40:01 -0700 | [diff] [blame] | 111 | effects: |
| 112 | |
| 113 | (1) Because the pages are "hidden" from reclaim on the unevictable list, the |
| 114 | reclaim process can be more efficient, dealing only with pages that have a |
| 115 | chance of being reclaimed. |
| 116 | |
| 117 | (2) On the other hand, if too many of the pages charged to the control group |
| 118 | are unevictable, the evictable portion of the working set of the tasks in |
| 119 | the control group may not fit into the available memory. This can cause |
| 120 | the control group to thrash or to OOM-kill tasks. |
Lee Schermerhorn | fa07e78 | 2008-10-18 20:26:47 -0700 | [diff] [blame] | 121 | |
| 122 | |
Mike Rapoport | a5e4da9 | 2018-03-21 21:22:42 +0200 | [diff] [blame^] | 123 | .. _mark_addr_space_unevict: |
| 124 | |
| 125 | Marking Address Spaces Unevictable |
David Howells | c24b720 | 2009-04-13 14:40:01 -0700 | [diff] [blame] | 126 | ---------------------------------- |
Lee Schermerhorn | fa07e78 | 2008-10-18 20:26:47 -0700 | [diff] [blame] | 127 | |
David Howells | c24b720 | 2009-04-13 14:40:01 -0700 | [diff] [blame] | 128 | For facilities such as ramfs none of the pages attached to the address space |
| 129 | may be evicted. To prevent eviction of any such pages, the AS_UNEVICTABLE |
| 130 | address space flag is provided, and this can be manipulated by a filesystem |
| 131 | using a number of wrapper functions: |
Lee Schermerhorn | fa07e78 | 2008-10-18 20:26:47 -0700 | [diff] [blame] | 132 | |
Mike Rapoport | a5e4da9 | 2018-03-21 21:22:42 +0200 | [diff] [blame^] | 133 | * ``void mapping_set_unevictable(struct address_space *mapping);`` |
Lee Schermerhorn | fa07e78 | 2008-10-18 20:26:47 -0700 | [diff] [blame] | 134 | |
David Howells | c24b720 | 2009-04-13 14:40:01 -0700 | [diff] [blame] | 135 | Mark the address space as being completely unevictable. |
| 136 | |
Mike Rapoport | a5e4da9 | 2018-03-21 21:22:42 +0200 | [diff] [blame^] | 137 | * ``void mapping_clear_unevictable(struct address_space *mapping);`` |
David Howells | c24b720 | 2009-04-13 14:40:01 -0700 | [diff] [blame] | 138 | |
| 139 | Mark the address space as being evictable. |
| 140 | |
Mike Rapoport | a5e4da9 | 2018-03-21 21:22:42 +0200 | [diff] [blame^] | 141 | * ``int mapping_unevictable(struct address_space *mapping);`` |
David Howells | c24b720 | 2009-04-13 14:40:01 -0700 | [diff] [blame] | 142 | |
| 143 | Query the address space, and return true if it is completely |
| 144 | unevictable. |
| 145 | |
| 146 | These are currently used in two places in the kernel: |
| 147 | |
| 148 | (1) By ramfs to mark the address spaces of its inodes when they are created, |
| 149 | and this mark remains for the life of the inode. |
| 150 | |
| 151 | (2) By SYSV SHM to mark SHM_LOCK'd address spaces until SHM_UNLOCK is called. |
| 152 | |
| 153 | Note that SHM_LOCK is not required to page in the locked pages if they're |
| 154 | swapped out; the application must touch the pages manually if it wants to |
| 155 | ensure they're in memory. |
| 156 | |
| 157 | |
Mike Rapoport | a5e4da9 | 2018-03-21 21:22:42 +0200 | [diff] [blame^] | 158 | Detecting Unevictable Pages |
David Howells | c24b720 | 2009-04-13 14:40:01 -0700 | [diff] [blame] | 159 | --------------------------- |
| 160 | |
| 161 | The function page_evictable() in vmscan.c determines whether a page is |
Mike Rapoport | a5e4da9 | 2018-03-21 21:22:42 +0200 | [diff] [blame^] | 162 | evictable or not using the query function outlined above [see section |
| 163 | :ref:`Marking address spaces unevictable <mark_addr_space_unevict>`] |
| 164 | to check the AS_UNEVICTABLE flag. |
David Howells | c24b720 | 2009-04-13 14:40:01 -0700 | [diff] [blame] | 165 | |
| 166 | For address spaces that are so marked after being populated (as SHM regions |
| 167 | might be), the lock action (eg: SHM_LOCK) can be lazy, and need not populate |
| 168 | the page tables for the region as does, for example, mlock(), nor need it make |
| 169 | any special effort to push any pages in the SHM_LOCK'd area to the unevictable |
| 170 | list. Instead, vmscan will do this if and when it encounters the pages during |
| 171 | a reclamation scan. |
| 172 | |
| 173 | On an unlock action (such as SHM_UNLOCK), the unlocker (eg: shmctl()) must scan |
| 174 | the pages in the region and "rescue" them from the unevictable list if no other |
| 175 | condition is keeping them unevictable. If an unevictable region is destroyed, |
| 176 | the pages are also "rescued" from the unevictable list in the process of |
| 177 | freeing them. |
| 178 | |
| 179 | page_evictable() also checks for mlocked pages by testing an additional page |
Hugh Dickins | 39b5f29 | 2012-10-08 16:33:18 -0700 | [diff] [blame] | 180 | flag, PG_mlocked (as wrapped by PageMlocked()), which is set when a page is |
| 181 | faulted into a VM_LOCKED vma, or found in a vma being VM_LOCKED. |
Lee Schermerhorn | fa07e78 | 2008-10-18 20:26:47 -0700 | [diff] [blame] | 182 | |
| 183 | |
Mike Rapoport | a5e4da9 | 2018-03-21 21:22:42 +0200 | [diff] [blame^] | 184 | Vmscan's Handling of Unevictable Pages |
David Howells | c24b720 | 2009-04-13 14:40:01 -0700 | [diff] [blame] | 185 | -------------------------------------- |
Lee Schermerhorn | fa07e78 | 2008-10-18 20:26:47 -0700 | [diff] [blame] | 186 | |
| 187 | If unevictable pages are culled in the fault path, or moved to the unevictable |
David Howells | c24b720 | 2009-04-13 14:40:01 -0700 | [diff] [blame] | 188 | list at mlock() or mmap() time, vmscan will not encounter the pages until they |
| 189 | have become evictable again (via munlock() for example) and have been "rescued" |
| 190 | from the unevictable list. However, there may be situations where we decide, |
| 191 | for the sake of expediency, to leave a unevictable page on one of the regular |
| 192 | active/inactive LRU lists for vmscan to deal with. vmscan checks for such |
| 193 | pages in all of the shrink_{active|inactive|page}_list() functions and will |
| 194 | "cull" such pages that it encounters: that is, it diverts those pages to the |
| 195 | unevictable list for the zone being scanned. |
Lee Schermerhorn | fa07e78 | 2008-10-18 20:26:47 -0700 | [diff] [blame] | 196 | |
David Howells | c24b720 | 2009-04-13 14:40:01 -0700 | [diff] [blame] | 197 | There may be situations where a page is mapped into a VM_LOCKED VMA, but the |
| 198 | page is not marked as PG_mlocked. Such pages will make it all the way to |
Lee Schermerhorn | fa07e78 | 2008-10-18 20:26:47 -0700 | [diff] [blame] | 199 | shrink_page_list() where they will be detected when vmscan walks the reverse |
David Howells | c24b720 | 2009-04-13 14:40:01 -0700 | [diff] [blame] | 200 | map in try_to_unmap(). If try_to_unmap() returns SWAP_MLOCK, |
| 201 | shrink_page_list() will cull the page at that point. |
Lee Schermerhorn | fa07e78 | 2008-10-18 20:26:47 -0700 | [diff] [blame] | 202 | |
David Howells | c24b720 | 2009-04-13 14:40:01 -0700 | [diff] [blame] | 203 | To "cull" an unevictable page, vmscan simply puts the page back on the LRU list |
| 204 | using putback_lru_page() - the inverse operation to isolate_lru_page() - after |
| 205 | dropping the page lock. Because the condition which makes the page unevictable |
| 206 | may change once the page is unlocked, putback_lru_page() will recheck the |
| 207 | unevictable state of a page that it places on the unevictable list. If the |
| 208 | page has become unevictable, putback_lru_page() removes it from the list and |
| 209 | retries, including the page_unevictable() test. Because such a race is a rare |
| 210 | event and movement of pages onto the unevictable list should be rare, these |
| 211 | extra evictabilty checks should not occur in the majority of calls to |
| 212 | putback_lru_page(). |
Lee Schermerhorn | fa07e78 | 2008-10-18 20:26:47 -0700 | [diff] [blame] | 213 | |
| 214 | |
Mike Rapoport | a5e4da9 | 2018-03-21 21:22:42 +0200 | [diff] [blame^] | 215 | MLOCKED Pages |
David Howells | c24b720 | 2009-04-13 14:40:01 -0700 | [diff] [blame] | 216 | ============= |
Lee Schermerhorn | fa07e78 | 2008-10-18 20:26:47 -0700 | [diff] [blame] | 217 | |
David Howells | c24b720 | 2009-04-13 14:40:01 -0700 | [diff] [blame] | 218 | The unevictable page list is also useful for mlock(), in addition to ramfs and |
| 219 | SYSV SHM. Note that mlock() is only available in CONFIG_MMU=y situations; in |
| 220 | NOMMU situations, all mappings are effectively mlocked. |
| 221 | |
| 222 | |
Mike Rapoport | a5e4da9 | 2018-03-21 21:22:42 +0200 | [diff] [blame^] | 223 | History |
David Howells | c24b720 | 2009-04-13 14:40:01 -0700 | [diff] [blame] | 224 | ------- |
| 225 | |
| 226 | The "Unevictable mlocked Pages" infrastructure is based on work originally |
Lee Schermerhorn | fa07e78 | 2008-10-18 20:26:47 -0700 | [diff] [blame] | 227 | posted by Nick Piggin in an RFC patch entitled "mm: mlocked pages off LRU". |
David Howells | c24b720 | 2009-04-13 14:40:01 -0700 | [diff] [blame] | 228 | Nick posted his patch as an alternative to a patch posted by Christoph Lameter |
| 229 | to achieve the same objective: hiding mlocked pages from vmscan. |
| 230 | |
| 231 | In Nick's patch, he used one of the struct page LRU list link fields as a count |
| 232 | of VM_LOCKED VMAs that map the page. This use of the link field for a count |
| 233 | prevented the management of the pages on an LRU list, and thus mlocked pages |
| 234 | were not migratable as isolate_lru_page() could not find them, and the LRU list |
| 235 | link field was not available to the migration subsystem. |
| 236 | |
| 237 | Nick resolved this by putting mlocked pages back on the lru list before |
| 238 | attempting to isolate them, thus abandoning the count of VM_LOCKED VMAs. When |
| 239 | Nick's patch was integrated with the Unevictable LRU work, the count was |
| 240 | replaced by walking the reverse map to determine whether any VM_LOCKED VMAs |
| 241 | mapped the page. More on this below. |
Lee Schermerhorn | fa07e78 | 2008-10-18 20:26:47 -0700 | [diff] [blame] | 242 | |
| 243 | |
Mike Rapoport | a5e4da9 | 2018-03-21 21:22:42 +0200 | [diff] [blame^] | 244 | Basic Management |
David Howells | c24b720 | 2009-04-13 14:40:01 -0700 | [diff] [blame] | 245 | ---------------- |
Lee Schermerhorn | fa07e78 | 2008-10-18 20:26:47 -0700 | [diff] [blame] | 246 | |
David Howells | c24b720 | 2009-04-13 14:40:01 -0700 | [diff] [blame] | 247 | mlocked pages - pages mapped into a VM_LOCKED VMA - are a class of unevictable |
| 248 | pages. When such a page has been "noticed" by the memory management subsystem, |
| 249 | the page is marked with the PG_mlocked flag. This can be manipulated using the |
| 250 | PageMlocked() functions. |
Lee Schermerhorn | fa07e78 | 2008-10-18 20:26:47 -0700 | [diff] [blame] | 251 | |
David Howells | c24b720 | 2009-04-13 14:40:01 -0700 | [diff] [blame] | 252 | A PG_mlocked page will be placed on the unevictable list when it is added to |
| 253 | the LRU. Such pages can be "noticed" by memory management in several places: |
Lee Schermerhorn | fa07e78 | 2008-10-18 20:26:47 -0700 | [diff] [blame] | 254 | |
David Howells | c24b720 | 2009-04-13 14:40:01 -0700 | [diff] [blame] | 255 | (1) in the mlock()/mlockall() system call handlers; |
Lee Schermerhorn | fa07e78 | 2008-10-18 20:26:47 -0700 | [diff] [blame] | 256 | |
David Howells | c24b720 | 2009-04-13 14:40:01 -0700 | [diff] [blame] | 257 | (2) in the mmap() system call handler when mmapping a region with the |
| 258 | MAP_LOCKED flag; |
| 259 | |
| 260 | (3) mmapping a region in a task that has called mlockall() with the MCL_FUTURE |
| 261 | flag |
| 262 | |
| 263 | (4) in the fault path, if mlocked pages are "culled" in the fault path, |
| 264 | and when a VM_LOCKED stack segment is expanded; or |
| 265 | |
| 266 | (5) as mentioned above, in vmscan:shrink_page_list() when attempting to |
| 267 | reclaim a page in a VM_LOCKED VMA via try_to_unmap() |
| 268 | |
| 269 | all of which result in the VM_LOCKED flag being set for the VMA if it doesn't |
| 270 | already have it set. |
| 271 | |
| 272 | mlocked pages become unlocked and rescued from the unevictable list when: |
| 273 | |
| 274 | (1) mapped in a range unlocked via the munlock()/munlockall() system calls; |
| 275 | |
| 276 | (2) munmap()'d out of the last VM_LOCKED VMA that maps the page, including |
| 277 | unmapping at task exit; |
| 278 | |
| 279 | (3) when the page is truncated from the last VM_LOCKED VMA of an mmapped file; |
| 280 | or |
| 281 | |
| 282 | (4) before a page is COW'd in a VM_LOCKED VMA. |
Lee Schermerhorn | fa07e78 | 2008-10-18 20:26:47 -0700 | [diff] [blame] | 283 | |
| 284 | |
Mike Rapoport | a5e4da9 | 2018-03-21 21:22:42 +0200 | [diff] [blame^] | 285 | mlock()/mlockall() System Call Handling |
David Howells | c24b720 | 2009-04-13 14:40:01 -0700 | [diff] [blame] | 286 | --------------------------------------- |
Lee Schermerhorn | fa07e78 | 2008-10-18 20:26:47 -0700 | [diff] [blame] | 287 | |
Mike Rapoport | a5e4da9 | 2018-03-21 21:22:42 +0200 | [diff] [blame^] | 288 | Both [do\_]mlock() and [do\_]mlockall() system call handlers call mlock_fixup() |
David Howells | c24b720 | 2009-04-13 14:40:01 -0700 | [diff] [blame] | 289 | for each VMA in the range specified by the call. In the case of mlockall(), |
Lee Schermerhorn | fa07e78 | 2008-10-18 20:26:47 -0700 | [diff] [blame] | 290 | this is the entire active address space of the task. Note that mlock_fixup() |
David Howells | c24b720 | 2009-04-13 14:40:01 -0700 | [diff] [blame] | 291 | is used for both mlocking and munlocking a range of memory. A call to mlock() |
| 292 | an already VM_LOCKED VMA, or to munlock() a VMA that is not VM_LOCKED is |
| 293 | treated as a no-op, and mlock_fixup() simply returns. |
Lee Schermerhorn | fa07e78 | 2008-10-18 20:26:47 -0700 | [diff] [blame] | 294 | |
David Howells | c24b720 | 2009-04-13 14:40:01 -0700 | [diff] [blame] | 295 | If the VMA passes some filtering as described in "Filtering Special Vmas" |
| 296 | below, mlock_fixup() will attempt to merge the VMA with its neighbors or split |
| 297 | off a subset of the VMA if the range does not cover the entire VMA. Once the |
| 298 | VMA has been merged or split or neither, mlock_fixup() will call |
Kirill A. Shutemov | fc05f56 | 2015-04-14 15:44:39 -0700 | [diff] [blame] | 299 | populate_vma_page_range() to fault in the pages via get_user_pages() and to |
David Howells | c24b720 | 2009-04-13 14:40:01 -0700 | [diff] [blame] | 300 | mark the pages as mlocked via mlock_vma_page(). |
Lee Schermerhorn | fa07e78 | 2008-10-18 20:26:47 -0700 | [diff] [blame] | 301 | |
David Howells | c24b720 | 2009-04-13 14:40:01 -0700 | [diff] [blame] | 302 | Note that the VMA being mlocked might be mapped with PROT_NONE. In this case, |
| 303 | get_user_pages() will be unable to fault in the pages. That's okay. If pages |
| 304 | do end up getting faulted into this VM_LOCKED VMA, we'll handle them in the |
Lee Schermerhorn | fa07e78 | 2008-10-18 20:26:47 -0700 | [diff] [blame] | 305 | fault path or in vmscan. |
| 306 | |
| 307 | Also note that a page returned by get_user_pages() could be truncated or |
David Howells | c24b720 | 2009-04-13 14:40:01 -0700 | [diff] [blame] | 308 | migrated out from under us, while we're trying to mlock it. To detect this, |
Kirill A. Shutemov | fc05f56 | 2015-04-14 15:44:39 -0700 | [diff] [blame] | 309 | populate_vma_page_range() checks page_mapping() after acquiring the page lock. |
David Howells | c24b720 | 2009-04-13 14:40:01 -0700 | [diff] [blame] | 310 | If the page is still associated with its mapping, we'll go ahead and call |
| 311 | mlock_vma_page(). If the mapping is gone, we just unlock the page and move on. |
| 312 | In the worst case, this will result in a page mapped in a VM_LOCKED VMA |
| 313 | remaining on a normal LRU list without being PageMlocked(). Again, vmscan will |
| 314 | detect and cull such pages. |
Lee Schermerhorn | fa07e78 | 2008-10-18 20:26:47 -0700 | [diff] [blame] | 315 | |
David Howells | c24b720 | 2009-04-13 14:40:01 -0700 | [diff] [blame] | 316 | mlock_vma_page() will call TestSetPageMlocked() for each page returned by |
| 317 | get_user_pages(). We use TestSetPageMlocked() because the page might already |
| 318 | be mlocked by another task/VMA and we don't want to do extra work. We |
| 319 | especially do not want to count an mlocked page more than once in the |
| 320 | statistics. If the page was already mlocked, mlock_vma_page() need do nothing |
| 321 | more. |
Lee Schermerhorn | fa07e78 | 2008-10-18 20:26:47 -0700 | [diff] [blame] | 322 | |
| 323 | If the page was NOT already mlocked, mlock_vma_page() attempts to isolate the |
| 324 | page from the LRU, as it is likely on the appropriate active or inactive list |
David Howells | c24b720 | 2009-04-13 14:40:01 -0700 | [diff] [blame] | 325 | at that time. If the isolate_lru_page() succeeds, mlock_vma_page() will put |
| 326 | back the page - by calling putback_lru_page() - which will notice that the page |
| 327 | is now mlocked and divert the page to the zone's unevictable list. If |
Lee Schermerhorn | fa07e78 | 2008-10-18 20:26:47 -0700 | [diff] [blame] | 328 | mlock_vma_page() is unable to isolate the page from the LRU, vmscan will handle |
David Howells | c24b720 | 2009-04-13 14:40:01 -0700 | [diff] [blame] | 329 | it later if and when it attempts to reclaim the page. |
Lee Schermerhorn | fa07e78 | 2008-10-18 20:26:47 -0700 | [diff] [blame] | 330 | |
| 331 | |
Mike Rapoport | a5e4da9 | 2018-03-21 21:22:42 +0200 | [diff] [blame^] | 332 | Filtering Special VMAs |
David Howells | c24b720 | 2009-04-13 14:40:01 -0700 | [diff] [blame] | 333 | ---------------------- |
Lee Schermerhorn | fa07e78 | 2008-10-18 20:26:47 -0700 | [diff] [blame] | 334 | |
David Howells | c24b720 | 2009-04-13 14:40:01 -0700 | [diff] [blame] | 335 | mlock_fixup() filters several classes of "special" VMAs: |
Lee Schermerhorn | fa07e78 | 2008-10-18 20:26:47 -0700 | [diff] [blame] | 336 | |
David Howells | c24b720 | 2009-04-13 14:40:01 -0700 | [diff] [blame] | 337 | 1) VMAs with VM_IO or VM_PFNMAP set are skipped entirely. The pages behind |
Lee Schermerhorn | fa07e78 | 2008-10-18 20:26:47 -0700 | [diff] [blame] | 338 | these mappings are inherently pinned, so we don't need to mark them as |
David Howells | c24b720 | 2009-04-13 14:40:01 -0700 | [diff] [blame] | 339 | mlocked. In any case, most of the pages have no struct page in which to so |
| 340 | mark the page. Because of this, get_user_pages() will fail for these VMAs, |
| 341 | so there is no sense in attempting to visit them. |
Lee Schermerhorn | fa07e78 | 2008-10-18 20:26:47 -0700 | [diff] [blame] | 342 | |
David Howells | c24b720 | 2009-04-13 14:40:01 -0700 | [diff] [blame] | 343 | 2) VMAs mapping hugetlbfs page are already effectively pinned into memory. We |
| 344 | neither need nor want to mlock() these pages. However, to preserve the |
| 345 | prior behavior of mlock() - before the unevictable/mlock changes - |
| 346 | mlock_fixup() will call make_pages_present() in the hugetlbfs VMA range to |
| 347 | allocate the huge pages and populate the ptes. |
Lee Schermerhorn | fa07e78 | 2008-10-18 20:26:47 -0700 | [diff] [blame] | 348 | |
Konstantin Khlebnikov | 314e51b | 2012-10-08 16:29:02 -0700 | [diff] [blame] | 349 | 3) VMAs with VM_DONTEXPAND are generally userspace mappings of kernel pages, |
| 350 | such as the VDSO page, relay channel pages, etc. These pages |
Lee Schermerhorn | fa07e78 | 2008-10-18 20:26:47 -0700 | [diff] [blame] | 351 | are inherently unevictable and are not managed on the LRU lists. |
David Howells | c24b720 | 2009-04-13 14:40:01 -0700 | [diff] [blame] | 352 | mlock_fixup() treats these VMAs the same as hugetlbfs VMAs. It calls |
Lee Schermerhorn | fa07e78 | 2008-10-18 20:26:47 -0700 | [diff] [blame] | 353 | make_pages_present() to populate the ptes. |
| 354 | |
David Howells | c24b720 | 2009-04-13 14:40:01 -0700 | [diff] [blame] | 355 | Note that for all of these special VMAs, mlock_fixup() does not set the |
Lee Schermerhorn | fa07e78 | 2008-10-18 20:26:47 -0700 | [diff] [blame] | 356 | VM_LOCKED flag. Therefore, we won't have to deal with them later during |
David Howells | c24b720 | 2009-04-13 14:40:01 -0700 | [diff] [blame] | 357 | munlock(), munmap() or task exit. Neither does mlock_fixup() account these |
| 358 | VMAs against the task's "locked_vm". |
Lee Schermerhorn | fa07e78 | 2008-10-18 20:26:47 -0700 | [diff] [blame] | 359 | |
Mike Rapoport | a5e4da9 | 2018-03-21 21:22:42 +0200 | [diff] [blame^] | 360 | .. _munlock_munlockall_handling: |
Lee Schermerhorn | fa07e78 | 2008-10-18 20:26:47 -0700 | [diff] [blame] | 361 | |
Mike Rapoport | a5e4da9 | 2018-03-21 21:22:42 +0200 | [diff] [blame^] | 362 | munlock()/munlockall() System Call Handling |
David Howells | c24b720 | 2009-04-13 14:40:01 -0700 | [diff] [blame] | 363 | ------------------------------------------- |
Lee Schermerhorn | fa07e78 | 2008-10-18 20:26:47 -0700 | [diff] [blame] | 364 | |
David Howells | c24b720 | 2009-04-13 14:40:01 -0700 | [diff] [blame] | 365 | The munlock() and munlockall() system calls are handled by the same functions - |
| 366 | do_mlock[all]() - as the mlock() and mlockall() system calls with the unlock vs |
| 367 | lock operation indicated by an argument. So, these system calls are also |
| 368 | handled by mlock_fixup(). Again, if called for an already munlocked VMA, |
| 369 | mlock_fixup() simply returns. Because of the VMA filtering discussed above, |
| 370 | VM_LOCKED will not be set in any "special" VMAs. So, these VMAs will be |
Lee Schermerhorn | fa07e78 | 2008-10-18 20:26:47 -0700 | [diff] [blame] | 371 | ignored for munlock. |
| 372 | |
David Howells | c24b720 | 2009-04-13 14:40:01 -0700 | [diff] [blame] | 373 | If the VMA is VM_LOCKED, mlock_fixup() again attempts to merge or split off the |
| 374 | specified range. The range is then munlocked via the function |
Kirill A. Shutemov | fc05f56 | 2015-04-14 15:44:39 -0700 | [diff] [blame] | 375 | populate_vma_page_range() - the same function used to mlock a VMA range - |
Lee Schermerhorn | fa07e78 | 2008-10-18 20:26:47 -0700 | [diff] [blame] | 376 | passing a flag to indicate that munlock() is being performed. |
| 377 | |
David Howells | c24b720 | 2009-04-13 14:40:01 -0700 | [diff] [blame] | 378 | Because the VMA access protections could have been changed to PROT_NONE after |
Hugh Dickins | 63d6c5a | 2009-01-06 14:39:38 -0800 | [diff] [blame] | 379 | faulting in and mlocking pages, get_user_pages() was unreliable for visiting |
David Howells | c24b720 | 2009-04-13 14:40:01 -0700 | [diff] [blame] | 380 | these pages for munlocking. Because we don't want to leave pages mlocked, |
Lee Schermerhorn | fa07e78 | 2008-10-18 20:26:47 -0700 | [diff] [blame] | 381 | get_user_pages() was enhanced to accept a flag to ignore the permissions when |
David Howells | c24b720 | 2009-04-13 14:40:01 -0700 | [diff] [blame] | 382 | fetching the pages - all of which should be resident as a result of previous |
| 383 | mlocking. |
Lee Schermerhorn | fa07e78 | 2008-10-18 20:26:47 -0700 | [diff] [blame] | 384 | |
Kirill A. Shutemov | fc05f56 | 2015-04-14 15:44:39 -0700 | [diff] [blame] | 385 | For munlock(), populate_vma_page_range() unlocks individual pages by calling |
Lee Schermerhorn | fa07e78 | 2008-10-18 20:26:47 -0700 | [diff] [blame] | 386 | munlock_vma_page(). munlock_vma_page() unconditionally clears the PG_mlocked |
David Howells | c24b720 | 2009-04-13 14:40:01 -0700 | [diff] [blame] | 387 | flag using TestClearPageMlocked(). As with mlock_vma_page(), |
| 388 | munlock_vma_page() use the Test*PageMlocked() function to handle the case where |
| 389 | the page might have already been unlocked by another task. If the page was |
| 390 | mlocked, munlock_vma_page() updates that zone statistics for the number of |
| 391 | mlocked pages. Note, however, that at this point we haven't checked whether |
| 392 | the page is mapped by other VM_LOCKED VMAs. |
Lee Schermerhorn | fa07e78 | 2008-10-18 20:26:47 -0700 | [diff] [blame] | 393 | |
David Howells | c24b720 | 2009-04-13 14:40:01 -0700 | [diff] [blame] | 394 | We can't call try_to_munlock(), the function that walks the reverse map to |
| 395 | check for other VM_LOCKED VMAs, without first isolating the page from the LRU. |
Lee Schermerhorn | fa07e78 | 2008-10-18 20:26:47 -0700 | [diff] [blame] | 396 | try_to_munlock() is a variant of try_to_unmap() and thus requires that the page |
David Howells | c24b720 | 2009-04-13 14:40:01 -0700 | [diff] [blame] | 397 | not be on an LRU list [more on these below]. However, the call to |
| 398 | isolate_lru_page() could fail, in which case we couldn't try_to_munlock(). So, |
| 399 | we go ahead and clear PG_mlocked up front, as this might be the only chance we |
| 400 | have. If we can successfully isolate the page, we go ahead and |
Lee Schermerhorn | fa07e78 | 2008-10-18 20:26:47 -0700 | [diff] [blame] | 401 | try_to_munlock(), which will restore the PG_mlocked flag and update the zone |
David Howells | c24b720 | 2009-04-13 14:40:01 -0700 | [diff] [blame] | 402 | page statistics if it finds another VMA holding the page mlocked. If we fail |
Lee Schermerhorn | fa07e78 | 2008-10-18 20:26:47 -0700 | [diff] [blame] | 403 | to isolate the page, we'll have left a potentially mlocked page on the LRU. |
David Howells | c24b720 | 2009-04-13 14:40:01 -0700 | [diff] [blame] | 404 | This is fine, because we'll catch it later if and if vmscan tries to reclaim |
| 405 | the page. This should be relatively rare. |
Lee Schermerhorn | fa07e78 | 2008-10-18 20:26:47 -0700 | [diff] [blame] | 406 | |
| 407 | |
Mike Rapoport | a5e4da9 | 2018-03-21 21:22:42 +0200 | [diff] [blame^] | 408 | Migrating MLOCKED Pages |
David Howells | c24b720 | 2009-04-13 14:40:01 -0700 | [diff] [blame] | 409 | ----------------------- |
| 410 | |
| 411 | A page that is being migrated has been isolated from the LRU lists and is held |
| 412 | locked across unmapping of the page, updating the page's address space entry |
| 413 | and copying the contents and state, until the page table entry has been |
| 414 | replaced with an entry that refers to the new page. Linux supports migration |
| 415 | of mlocked pages and other unevictable pages. This involves simply moving the |
| 416 | PG_mlocked and PG_unevictable states from the old page to the new page. |
| 417 | |
| 418 | Note that page migration can race with mlocking or munlocking of the same page. |
| 419 | This has been discussed from the mlock/munlock perspective in the respective |
| 420 | sections above. Both processes (migration and m[un]locking) hold the page |
| 421 | locked. This provides the first level of synchronization. Page migration |
| 422 | zeros out the page_mapping of the old page before unlocking it, so m[un]lock |
| 423 | can skip these pages by testing the page mapping under page lock. |
| 424 | |
| 425 | To complete page migration, we place the new and old pages back onto the LRU |
| 426 | after dropping the page lock. The "unneeded" page - old page on success, new |
| 427 | page on failure - will be freed when the reference count held by the migration |
| 428 | process is released. To ensure that we don't strand pages on the unevictable |
| 429 | list because of a race between munlock and migration, page migration uses the |
| 430 | putback_lru_page() function to add migrated pages back to the LRU. |
| 431 | |
| 432 | |
Mike Rapoport | a5e4da9 | 2018-03-21 21:22:42 +0200 | [diff] [blame^] | 433 | Compacting MLOCKED Pages |
Eric B Munson | 922c055 | 2015-04-15 16:13:23 -0700 | [diff] [blame] | 434 | ------------------------ |
| 435 | |
| 436 | The unevictable LRU can be scanned for compactable regions and the default |
| 437 | behavior is to do so. /proc/sys/vm/compact_unevictable_allowed controls |
| 438 | this behavior (see Documentation/sysctl/vm.txt). Once scanning of the |
| 439 | unevictable LRU is enabled, the work of compaction is mostly handled by |
| 440 | the page migration code and the same work flow as described in MIGRATING |
| 441 | MLOCKED PAGES will apply. |
| 442 | |
Mike Rapoport | a5e4da9 | 2018-03-21 21:22:42 +0200 | [diff] [blame^] | 443 | MLOCKING Transparent Huge Pages |
Kirill A. Shutemov | 6fb8ddf | 2016-07-26 15:25:15 -0700 | [diff] [blame] | 444 | ------------------------------- |
| 445 | |
| 446 | A transparent huge page is represented by a single entry on an LRU list. |
| 447 | Therefore, we can only make unevictable an entire compound page, not |
| 448 | individual subpages. |
| 449 | |
| 450 | If a user tries to mlock() part of a huge page, we want the rest of the |
| 451 | page to be reclaimable. |
| 452 | |
| 453 | We cannot just split the page on partial mlock() as split_huge_page() can |
| 454 | fail and new intermittent failure mode for the syscall is undesirable. |
| 455 | |
| 456 | We handle this by keeping PTE-mapped huge pages on normal LRU lists: the |
| 457 | PMD on border of VM_LOCKED VMA will be split into PTE table. |
| 458 | |
| 459 | This way the huge page is accessible for vmscan. Under memory pressure the |
| 460 | page will be split, subpages which belong to VM_LOCKED VMAs will be moved |
| 461 | to unevictable LRU and the rest can be reclaimed. |
| 462 | |
| 463 | See also comment in follow_trans_huge_pmd(). |
Eric B Munson | 922c055 | 2015-04-15 16:13:23 -0700 | [diff] [blame] | 464 | |
Mike Rapoport | a5e4da9 | 2018-03-21 21:22:42 +0200 | [diff] [blame^] | 465 | mmap(MAP_LOCKED) System Call Handling |
David Howells | c24b720 | 2009-04-13 14:40:01 -0700 | [diff] [blame] | 466 | ------------------------------------- |
Lee Schermerhorn | fa07e78 | 2008-10-18 20:26:47 -0700 | [diff] [blame] | 467 | |
Masanari Iida | df5cbb2 | 2014-03-21 10:04:30 +0900 | [diff] [blame] | 468 | In addition the mlock()/mlockall() system calls, an application can request |
David Howells | c24b720 | 2009-04-13 14:40:01 -0700 | [diff] [blame] | 469 | that a region of memory be mlocked supplying the MAP_LOCKED flag to the mmap() |
Michal Hocko | 9b012a2 | 2015-06-24 16:57:50 -0700 | [diff] [blame] | 470 | call. There is one important and subtle difference here, though. mmap() + mlock() |
| 471 | will fail if the range cannot be faulted in (e.g. because mm_populate fails) |
| 472 | and returns with ENOMEM while mmap(MAP_LOCKED) will not fail. The mmaped |
| 473 | area will still have properties of the locked area - aka. pages will not get |
| 474 | swapped out - but major page faults to fault memory in might still happen. |
| 475 | |
| 476 | Furthermore, any mmap() call or brk() call that expands the heap by a |
Lee Schermerhorn | fa07e78 | 2008-10-18 20:26:47 -0700 | [diff] [blame] | 477 | task that has previously called mlockall() with the MCL_FUTURE flag will result |
David Howells | c24b720 | 2009-04-13 14:40:01 -0700 | [diff] [blame] | 478 | in the newly mapped memory being mlocked. Before the unevictable/mlock |
| 479 | changes, the kernel simply called make_pages_present() to allocate pages and |
| 480 | populate the page table. |
Lee Schermerhorn | fa07e78 | 2008-10-18 20:26:47 -0700 | [diff] [blame] | 481 | |
| 482 | To mlock a range of memory under the unevictable/mlock infrastructure, the |
| 483 | mmap() handler and task address space expansion functions call |
Kirill A. Shutemov | fc05f56 | 2015-04-14 15:44:39 -0700 | [diff] [blame] | 484 | populate_vma_page_range() specifying the vma and the address range to mlock. |
Lee Schermerhorn | fa07e78 | 2008-10-18 20:26:47 -0700 | [diff] [blame] | 485 | |
Kirill A. Shutemov | fc05f56 | 2015-04-14 15:44:39 -0700 | [diff] [blame] | 486 | The callers of populate_vma_page_range() will have already added the memory range |
David Howells | c24b720 | 2009-04-13 14:40:01 -0700 | [diff] [blame] | 487 | to be mlocked to the task's "locked_vm". To account for filtered VMAs, |
Kirill A. Shutemov | fc05f56 | 2015-04-14 15:44:39 -0700 | [diff] [blame] | 488 | populate_vma_page_range() returns the number of pages NOT mlocked. All of the |
David Howells | c24b720 | 2009-04-13 14:40:01 -0700 | [diff] [blame] | 489 | callers then subtract a non-negative return value from the task's locked_vm. A |
| 490 | negative return value represent an error - for example, from get_user_pages() |
| 491 | attempting to fault in a VMA with PROT_NONE access. In this case, we leave the |
| 492 | memory range accounted as locked_vm, as the protections could be changed later |
| 493 | and pages allocated into that region. |
Lee Schermerhorn | fa07e78 | 2008-10-18 20:26:47 -0700 | [diff] [blame] | 494 | |
| 495 | |
Mike Rapoport | a5e4da9 | 2018-03-21 21:22:42 +0200 | [diff] [blame^] | 496 | munmap()/exit()/exec() System Call Handling |
David Howells | c24b720 | 2009-04-13 14:40:01 -0700 | [diff] [blame] | 497 | ------------------------------------------- |
Lee Schermerhorn | fa07e78 | 2008-10-18 20:26:47 -0700 | [diff] [blame] | 498 | |
| 499 | When unmapping an mlocked region of memory, whether by an explicit call to |
| 500 | munmap() or via an internal unmap from exit() or exec() processing, we must |
David Howells | c24b720 | 2009-04-13 14:40:01 -0700 | [diff] [blame] | 501 | munlock the pages if we're removing the last VM_LOCKED VMA that maps the pages. |
Hugh Dickins | 63d6c5a | 2009-01-06 14:39:38 -0800 | [diff] [blame] | 502 | Before the unevictable/mlock changes, mlocking did not mark the pages in any |
| 503 | way, so unmapping them required no processing. |
Lee Schermerhorn | fa07e78 | 2008-10-18 20:26:47 -0700 | [diff] [blame] | 504 | |
| 505 | To munlock a range of memory under the unevictable/mlock infrastructure, the |
David Howells | c24b720 | 2009-04-13 14:40:01 -0700 | [diff] [blame] | 506 | munmap() handler and task address space call tear down function |
Lee Schermerhorn | fa07e78 | 2008-10-18 20:26:47 -0700 | [diff] [blame] | 507 | munlock_vma_pages_all(). The name reflects the observation that one always |
David Howells | c24b720 | 2009-04-13 14:40:01 -0700 | [diff] [blame] | 508 | specifies the entire VMA range when munlock()ing during unmap of a region. |
| 509 | Because of the VMA filtering when mlocking() regions, only "normal" VMAs that |
Lee Schermerhorn | fa07e78 | 2008-10-18 20:26:47 -0700 | [diff] [blame] | 510 | actually contain mlocked pages will be passed to munlock_vma_pages_all(). |
| 511 | |
David Howells | c24b720 | 2009-04-13 14:40:01 -0700 | [diff] [blame] | 512 | munlock_vma_pages_all() clears the VM_LOCKED VMA flag and, like mlock_fixup() |
Lee Schermerhorn | fa07e78 | 2008-10-18 20:26:47 -0700 | [diff] [blame] | 513 | for the munlock case, calls __munlock_vma_pages_range() to walk the page table |
David Howells | c24b720 | 2009-04-13 14:40:01 -0700 | [diff] [blame] | 514 | for the VMA's memory range and munlock_vma_page() each resident page mapped by |
| 515 | the VMA. This effectively munlocks the page, only if this is the last |
| 516 | VM_LOCKED VMA that maps the page. |
Lee Schermerhorn | fa07e78 | 2008-10-18 20:26:47 -0700 | [diff] [blame] | 517 | |
| 518 | |
David Howells | c24b720 | 2009-04-13 14:40:01 -0700 | [diff] [blame] | 519 | try_to_unmap() |
| 520 | -------------- |
Lee Schermerhorn | fa07e78 | 2008-10-18 20:26:47 -0700 | [diff] [blame] | 521 | |
David Howells | c24b720 | 2009-04-13 14:40:01 -0700 | [diff] [blame] | 522 | Pages can, of course, be mapped into multiple VMAs. Some of these VMAs may |
Lee Schermerhorn | fa07e78 | 2008-10-18 20:26:47 -0700 | [diff] [blame] | 523 | have VM_LOCKED flag set. It is possible for a page mapped into one or more |
David Howells | c24b720 | 2009-04-13 14:40:01 -0700 | [diff] [blame] | 524 | VM_LOCKED VMAs not to have the PG_mlocked flag set and therefore reside on one |
| 525 | of the active or inactive LRU lists. This could happen if, for example, a task |
| 526 | in the process of munlocking the page could not isolate the page from the LRU. |
| 527 | As a result, vmscan/shrink_page_list() might encounter such a page as described |
| 528 | in section "vmscan's handling of unevictable pages". To handle this situation, |
| 529 | try_to_unmap() checks for VM_LOCKED VMAs while it is walking a page's reverse |
| 530 | map. |
Lee Schermerhorn | fa07e78 | 2008-10-18 20:26:47 -0700 | [diff] [blame] | 531 | |
| 532 | try_to_unmap() is always called, by either vmscan for reclaim or for page |
David Howells | c24b720 | 2009-04-13 14:40:01 -0700 | [diff] [blame] | 533 | migration, with the argument page locked and isolated from the LRU. Separate |
Hugh Dickins | b87537d9e | 2015-11-05 18:49:33 -0800 | [diff] [blame] | 534 | functions handle anonymous and mapped file and KSM pages, as these types of |
| 535 | pages have different reverse map lookup mechanisms, with different locking. |
| 536 | In each case, whether rmap_walk_anon() or rmap_walk_file() or rmap_walk_ksm(), |
| 537 | it will call try_to_unmap_one() for every VMA which might contain the page. |
Lee Schermerhorn | fa07e78 | 2008-10-18 20:26:47 -0700 | [diff] [blame] | 538 | |
Hugh Dickins | b87537d9e | 2015-11-05 18:49:33 -0800 | [diff] [blame] | 539 | When trying to reclaim, if try_to_unmap_one() finds the page in a VM_LOCKED |
| 540 | VMA, it will then mlock the page via mlock_vma_page() instead of unmapping it, |
| 541 | and return SWAP_MLOCK to indicate that the page is unevictable: and the scan |
| 542 | stops there. |
Lee Schermerhorn | fa07e78 | 2008-10-18 20:26:47 -0700 | [diff] [blame] | 543 | |
Hugh Dickins | b87537d9e | 2015-11-05 18:49:33 -0800 | [diff] [blame] | 544 | mlock_vma_page() is called while holding the page table's lock (in addition |
| 545 | to the page lock, and the rmap lock): to serialize against concurrent mlock or |
| 546 | munlock or munmap system calls, mm teardown (munlock_vma_pages_all), reclaim, |
| 547 | holepunching, and truncation of file pages and their anonymous COWed pages. |
Lee Schermerhorn | fa07e78 | 2008-10-18 20:26:47 -0700 | [diff] [blame] | 548 | |
Lee Schermerhorn | fa07e78 | 2008-10-18 20:26:47 -0700 | [diff] [blame] | 549 | |
Mike Rapoport | a5e4da9 | 2018-03-21 21:22:42 +0200 | [diff] [blame^] | 550 | try_to_munlock() Reverse Map Scan |
David Howells | c24b720 | 2009-04-13 14:40:01 -0700 | [diff] [blame] | 551 | --------------------------------- |
Lee Schermerhorn | fa07e78 | 2008-10-18 20:26:47 -0700 | [diff] [blame] | 552 | |
Mike Rapoport | a5e4da9 | 2018-03-21 21:22:42 +0200 | [diff] [blame^] | 553 | .. warning:: |
| 554 | [!] TODO/FIXME: a better name might be page_mlocked() - analogous to the |
| 555 | page_referenced() reverse map walker. |
Lee Schermerhorn | fa07e78 | 2008-10-18 20:26:47 -0700 | [diff] [blame] | 556 | |
Mike Rapoport | a5e4da9 | 2018-03-21 21:22:42 +0200 | [diff] [blame^] | 557 | When munlock_vma_page() [see section :ref:`munlock()/munlockall() System Call |
| 558 | Handling <munlock_munlockall_handling>` above] tries to munlock a |
| 559 | page, it needs to determine whether or not the page is mapped by any |
| 560 | VM_LOCKED VMA without actually attempting to unmap all PTEs from the |
| 561 | page. For this purpose, the unevictable/mlock infrastructure |
David Howells | c24b720 | 2009-04-13 14:40:01 -0700 | [diff] [blame] | 562 | introduced a variant of try_to_unmap() called try_to_munlock(). |
Lee Schermerhorn | fa07e78 | 2008-10-18 20:26:47 -0700 | [diff] [blame] | 563 | |
| 564 | try_to_munlock() calls the same functions as try_to_unmap() for anonymous and |
Hugh Dickins | b87537d9e | 2015-11-05 18:49:33 -0800 | [diff] [blame] | 565 | mapped file and KSM pages with a flag argument specifying unlock versus unmap |
Lee Schermerhorn | fa07e78 | 2008-10-18 20:26:47 -0700 | [diff] [blame] | 566 | processing. Again, these functions walk the respective reverse maps looking |
Hugh Dickins | 7a14239 | 2015-11-05 18:49:30 -0800 | [diff] [blame] | 567 | for VM_LOCKED VMAs. When such a VMA is found, as in the try_to_unmap() case, |
Hugh Dickins | b87537d9e | 2015-11-05 18:49:33 -0800 | [diff] [blame] | 568 | the functions mlock the page via mlock_vma_page() and return SWAP_MLOCK. This |
| 569 | undoes the pre-clearing of the page's PG_mlocked done by munlock_vma_page. |
Lee Schermerhorn | fa07e78 | 2008-10-18 20:26:47 -0700 | [diff] [blame] | 570 | |
David Howells | c24b720 | 2009-04-13 14:40:01 -0700 | [diff] [blame] | 571 | Note that try_to_munlock()'s reverse map walk must visit every VMA in a page's |
| 572 | reverse map to determine that a page is NOT mapped into any VM_LOCKED VMA. |
Hugh Dickins | b87537d9e | 2015-11-05 18:49:33 -0800 | [diff] [blame] | 573 | However, the scan can terminate when it encounters a VM_LOCKED VMA. |
David Howells | c24b720 | 2009-04-13 14:40:01 -0700 | [diff] [blame] | 574 | Although try_to_munlock() might be called a great many times when munlocking a |
| 575 | large region or tearing down a large address space that has been mlocked via |
| 576 | mlockall(), overall this is a fairly rare event. |
Lee Schermerhorn | fa07e78 | 2008-10-18 20:26:47 -0700 | [diff] [blame] | 577 | |
Lee Schermerhorn | fa07e78 | 2008-10-18 20:26:47 -0700 | [diff] [blame] | 578 | |
Mike Rapoport | a5e4da9 | 2018-03-21 21:22:42 +0200 | [diff] [blame^] | 579 | Page Reclaim in shrink_*_list() |
David Howells | c24b720 | 2009-04-13 14:40:01 -0700 | [diff] [blame] | 580 | ------------------------------- |
| 581 | |
| 582 | shrink_active_list() culls any obviously unevictable pages - i.e. |
Hugh Dickins | 39b5f29 | 2012-10-08 16:33:18 -0700 | [diff] [blame] | 583 | !page_evictable(page) - diverting these to the unevictable list. |
David Howells | c24b720 | 2009-04-13 14:40:01 -0700 | [diff] [blame] | 584 | However, shrink_active_list() only sees unevictable pages that made it onto the |
| 585 | active/inactive lru lists. Note that these pages do not have PageUnevictable |
| 586 | set - otherwise they would be on the unevictable list and shrink_active_list |
| 587 | would never see them. |
Lee Schermerhorn | fa07e78 | 2008-10-18 20:26:47 -0700 | [diff] [blame] | 588 | |
| 589 | Some examples of these unevictable pages on the LRU lists are: |
| 590 | |
David Howells | c24b720 | 2009-04-13 14:40:01 -0700 | [diff] [blame] | 591 | (1) ramfs pages that have been placed on the LRU lists when first allocated. |
Lee Schermerhorn | fa07e78 | 2008-10-18 20:26:47 -0700 | [diff] [blame] | 592 | |
David Howells | c24b720 | 2009-04-13 14:40:01 -0700 | [diff] [blame] | 593 | (2) SHM_LOCK'd shared memory pages. shmctl(SHM_LOCK) does not attempt to |
| 594 | allocate or fault in the pages in the shared memory region. This happens |
| 595 | when an application accesses the page the first time after SHM_LOCK'ing |
| 596 | the segment. |
Lee Schermerhorn | fa07e78 | 2008-10-18 20:26:47 -0700 | [diff] [blame] | 597 | |
David Howells | c24b720 | 2009-04-13 14:40:01 -0700 | [diff] [blame] | 598 | (3) mlocked pages that could not be isolated from the LRU and moved to the |
| 599 | unevictable list in mlock_vma_page(). |
Lee Schermerhorn | fa07e78 | 2008-10-18 20:26:47 -0700 | [diff] [blame] | 600 | |
David Howells | c24b720 | 2009-04-13 14:40:01 -0700 | [diff] [blame] | 601 | shrink_inactive_list() also diverts any unevictable pages that it finds on the |
| 602 | inactive lists to the appropriate zone's unevictable list. |
| 603 | |
| 604 | shrink_inactive_list() should only see SHM_LOCK'd pages that became SHM_LOCK'd |
| 605 | after shrink_active_list() had moved them to the inactive list, or pages mapped |
| 606 | into VM_LOCKED VMAs that munlock_vma_page() couldn't isolate from the LRU to |
| 607 | recheck via try_to_munlock(). shrink_inactive_list() won't notice the latter, |
| 608 | but will pass on to shrink_page_list(). |
Lee Schermerhorn | fa07e78 | 2008-10-18 20:26:47 -0700 | [diff] [blame] | 609 | |
| 610 | shrink_page_list() again culls obviously unevictable pages that it could |
Hugh Dickins | 63d6c5a | 2009-01-06 14:39:38 -0800 | [diff] [blame] | 611 | encounter for similar reason to shrink_inactive_list(). Pages mapped into |
David Howells | c24b720 | 2009-04-13 14:40:01 -0700 | [diff] [blame] | 612 | VM_LOCKED VMAs but without PG_mlocked set will make it all the way to |
Hugh Dickins | 63d6c5a | 2009-01-06 14:39:38 -0800 | [diff] [blame] | 613 | try_to_unmap(). shrink_page_list() will divert them to the unevictable list |
| 614 | when try_to_unmap() returns SWAP_MLOCK, as discussed above. |