Lee Schermerhorn | fa07e78 | 2008-10-18 20:26:47 -0700 | [diff] [blame] | 1 | |
| 2 | This document describes the Linux memory management "Unevictable LRU" |
| 3 | infrastructure and the use of this infrastructure to manage several types |
| 4 | of "unevictable" pages. The document attempts to provide the overall |
| 5 | rationale behind this mechanism and the rationale for some of the design |
| 6 | decisions that drove the implementation. The latter design rationale is |
| 7 | discussed in the context of an implementation description. Admittedly, one |
| 8 | can obtain the implementation details--the "what does it do?"--by reading the |
| 9 | code. One hopes that the descriptions below add value by provide the answer |
| 10 | to "why does it do that?". |
| 11 | |
| 12 | Unevictable LRU Infrastructure: |
| 13 | |
| 14 | The Unevictable LRU adds an additional LRU list to track unevictable pages |
| 15 | and to hide these pages from vmscan. This mechanism is based on a patch by |
| 16 | Larry Woodman of Red Hat to address several scalability problems with page |
| 17 | reclaim in Linux. The problems have been observed at customer sites on large |
| 18 | memory x86_64 systems. For example, a non-numal x86_64 platform with 128GB |
| 19 | of main memory will have over 32 million 4k pages in a single zone. When a |
| 20 | large fraction of these pages are not evictable for any reason [see below], |
| 21 | vmscan will spend a lot of time scanning the LRU lists looking for the small |
| 22 | fraction of pages that are evictable. This can result in a situation where |
| 23 | all cpus are spending 100% of their time in vmscan for hours or days on end, |
| 24 | with the system completely unresponsive. |
| 25 | |
| 26 | The Unevictable LRU infrastructure addresses the following classes of |
| 27 | unevictable pages: |
| 28 | |
| 29 | + page owned by ramfs |
| 30 | + page mapped into SHM_LOCKed shared memory regions |
| 31 | + page mapped into VM_LOCKED [mlock()ed] vmas |
| 32 | |
| 33 | The infrastructure might be able to handle other conditions that make pages |
| 34 | unevictable, either by definition or by circumstance, in the future. |
| 35 | |
| 36 | |
| 37 | The Unevictable LRU List |
| 38 | |
| 39 | The Unevictable LRU infrastructure consists of an additional, per-zone, LRU list |
| 40 | called the "unevictable" list and an associated page flag, PG_unevictable, to |
| 41 | indicate that the page is being managed on the unevictable list. The |
| 42 | PG_unevictable flag is analogous to, and mutually exclusive with, the PG_active |
| 43 | flag in that it indicates on which LRU list a page resides when PG_lru is set. |
| 44 | The unevictable LRU list is source configurable based on the UNEVICTABLE_LRU |
| 45 | Kconfig option. |
| 46 | |
| 47 | The Unevictable LRU infrastructure maintains unevictable pages on an additional |
| 48 | LRU list for a few reasons: |
| 49 | |
| 50 | 1) We get to "treat unevictable pages just like we treat other pages in the |
| 51 | system, which means we get to use the same code to manipulate them, the |
| 52 | same code to isolate them (for migrate, etc.), the same code to keep track |
| 53 | of the statistics, etc..." [Rik van Riel] |
| 54 | |
| 55 | 2) We want to be able to migrate unevictable pages between nodes--for memory |
| 56 | defragmentation, workload management and memory hotplug. The linux kernel |
| 57 | can only migrate pages that it can successfully isolate from the lru lists. |
| 58 | If we were to maintain pages elsewise than on an lru-like list, where they |
| 59 | can be found by isolate_lru_page(), we would prevent their migration, unless |
| 60 | we reworked migration code to find the unevictable pages. |
| 61 | |
| 62 | |
| 63 | The unevictable LRU list does not differentiate between file backed and swap |
| 64 | backed [anon] pages. This differentiation is only important while the pages |
| 65 | are, in fact, evictable. |
| 66 | |
| 67 | The unevictable LRU list benefits from the "arrayification" of the per-zone |
| 68 | LRU lists and statistics originally proposed and posted by Christoph Lameter. |
| 69 | |
| 70 | The unevictable list does not use the lru pagevec mechanism. Rather, |
| 71 | unevictable pages are placed directly on the page's zone's unevictable |
| 72 | list under the zone lru_lock. The reason for this is to prevent stranding |
| 73 | of pages on the unevictable list when one task has the page isolated from the |
| 74 | lru and other tasks are changing the "evictability" state of the page. |
| 75 | |
| 76 | |
| 77 | Unevictable LRU and Memory Controller Interaction |
| 78 | |
| 79 | The memory controller data structure automatically gets a per zone unevictable |
| 80 | lru list as a result of the "arrayification" of the per-zone LRU lists. The |
| 81 | memory controller tracks the movement of pages to and from the unevictable list. |
| 82 | When a memory control group comes under memory pressure, the controller will |
| 83 | not attempt to reclaim pages on the unevictable list. This has a couple of |
| 84 | effects. Because the pages are "hidden" from reclaim on the unevictable list, |
| 85 | the reclaim process can be more efficient, dealing only with pages that have |
| 86 | a chance of being reclaimed. On the other hand, if too many of the pages |
| 87 | charged to the control group are unevictable, the evictable portion of the |
| 88 | working set of the tasks in the control group may not fit into the available |
| 89 | memory. This can cause the control group to thrash or to oom-kill tasks. |
| 90 | |
| 91 | |
| 92 | Unevictable LRU: Detecting Unevictable Pages |
| 93 | |
| 94 | The function page_evictable(page, vma) in vmscan.c determines whether a |
| 95 | page is evictable or not. For ramfs pages and pages in SHM_LOCKed regions, |
| 96 | page_evictable() tests a new address space flag, AS_UNEVICTABLE, in the page's |
| 97 | address space using a wrapper function. Wrapper functions are used to set, |
| 98 | clear and test the flag to reduce the requirement for #ifdef's throughout the |
| 99 | source code. AS_UNEVICTABLE is set on ramfs inode/mapping when it is created. |
| 100 | This flag remains for the life of the inode. |
| 101 | |
| 102 | For shared memory regions, AS_UNEVICTABLE is set when an application |
| 103 | successfully SHM_LOCKs the region and is removed when the region is |
| 104 | SHM_UNLOCKed. Note that shmctl(SHM_LOCK, ...) does not populate the page |
| 105 | tables for the region as does, for example, mlock(). So, we make no special |
| 106 | effort to push any pages in the SHM_LOCKed region to the unevictable list. |
| 107 | Vmscan will do this when/if it encounters the pages during reclaim. On |
| 108 | SHM_UNLOCK, shmctl() scans the pages in the region and "rescues" them from the |
| 109 | unevictable list if no other condition keeps them unevictable. If a SHM_LOCKed |
| 110 | region is destroyed, the pages are also "rescued" from the unevictable list in |
| 111 | the process of freeing them. |
| 112 | |
| 113 | page_evictable() detects mlock()ed pages by testing an additional page flag, |
| 114 | PG_mlocked via the PageMlocked() wrapper. If the page is NOT mlocked, and a |
| 115 | non-NULL vma is supplied, page_evictable() will check whether the vma is |
| 116 | VM_LOCKED via is_mlocked_vma(). is_mlocked_vma() will SetPageMlocked() and |
| 117 | update the appropriate statistics if the vma is VM_LOCKED. This method allows |
| 118 | efficient "culling" of pages in the fault path that are being faulted in to |
| 119 | VM_LOCKED vmas. |
| 120 | |
| 121 | |
| 122 | Unevictable Pages and Vmscan [shrink_*_list()] |
| 123 | |
| 124 | If unevictable pages are culled in the fault path, or moved to the unevictable |
| 125 | list at mlock() or mmap() time, vmscan will never encounter the pages until |
| 126 | they have become evictable again, for example, via munlock() and have been |
| 127 | "rescued" from the unevictable list. However, there may be situations where we |
| 128 | decide, for the sake of expediency, to leave a unevictable page on one of the |
| 129 | regular active/inactive LRU lists for vmscan to deal with. Vmscan checks for |
| 130 | such pages in all of the shrink_{active|inactive|page}_list() functions and |
| 131 | will "cull" such pages that it encounters--that is, it diverts those pages to |
| 132 | the unevictable list for the zone being scanned. |
| 133 | |
| 134 | There may be situations where a page is mapped into a VM_LOCKED vma, but the |
| 135 | page is not marked as PageMlocked. Such pages will make it all the way to |
| 136 | shrink_page_list() where they will be detected when vmscan walks the reverse |
| 137 | map in try_to_unmap(). If try_to_unmap() returns SWAP_MLOCK, shrink_page_list() |
| 138 | will cull the page at that point. |
| 139 | |
Lee Schermerhorn | fa07e78 | 2008-10-18 20:26:47 -0700 | [diff] [blame] | 140 | To "cull" an unevictable page, vmscan simply puts the page back on the lru |
| 141 | list using putback_lru_page()--the inverse operation to isolate_lru_page()-- |
| 142 | after dropping the page lock. Because the condition which makes the page |
| 143 | unevictable may change once the page is unlocked, putback_lru_page() will |
| 144 | recheck the unevictable state of a page that it places on the unevictable lru |
| 145 | list. If the page has become unevictable, putback_lru_page() removes it from |
| 146 | the list and retries, including the page_unevictable() test. Because such a |
| 147 | race is a rare event and movement of pages onto the unevictable list should be |
| 148 | rare, these extra evictabilty checks should not occur in the majority of calls |
| 149 | to putback_lru_page(). |
| 150 | |
| 151 | |
| 152 | Mlocked Page: Prior Work |
| 153 | |
| 154 | The "Unevictable Mlocked Pages" infrastructure is based on work originally |
| 155 | posted by Nick Piggin in an RFC patch entitled "mm: mlocked pages off LRU". |
| 156 | Nick posted his patch as an alternative to a patch posted by Christoph |
| 157 | Lameter to achieve the same objective--hiding mlocked pages from vmscan. |
| 158 | In Nick's patch, he used one of the struct page lru list link fields as a count |
| 159 | of VM_LOCKED vmas that map the page. This use of the link field for a count |
| 160 | prevented the management of the pages on an LRU list. Thus, mlocked pages were |
| 161 | not migratable as isolate_lru_page() could not find them and the lru list link |
| 162 | field was not available to the migration subsystem. Nick resolved this by |
| 163 | putting mlocked pages back on the lru list before attempting to isolate them, |
| 164 | thus abandoning the count of VM_LOCKED vmas. When Nick's patch was integrated |
| 165 | with the Unevictable LRU work, the count was replaced by walking the reverse |
| 166 | map to determine whether any VM_LOCKED vmas mapped the page. More on this |
| 167 | below. |
| 168 | |
| 169 | |
| 170 | Mlocked Pages: Basic Management |
| 171 | |
| 172 | Mlocked pages--pages mapped into a VM_LOCKED vma--represent one class of |
| 173 | unevictable pages. When such a page has been "noticed" by the memory |
| 174 | management subsystem, the page is marked with the PG_mlocked [PageMlocked()] |
| 175 | flag. A PageMlocked() page will be placed on the unevictable LRU list when |
| 176 | it is added to the LRU. Pages can be "noticed" by memory management in |
| 177 | several places: |
| 178 | |
| 179 | 1) in the mlock()/mlockall() system call handlers. |
| 180 | 2) in the mmap() system call handler when mmap()ing a region with the |
| 181 | MAP_LOCKED flag, or mmap()ing a region in a task that has called |
| 182 | mlockall() with the MCL_FUTURE flag. Both of these conditions result |
| 183 | in the VM_LOCKED flag being set for the vma. |
| 184 | 3) in the fault path, if mlocked pages are "culled" in the fault path, |
| 185 | and when a VM_LOCKED stack segment is expanded. |
Hugh Dickins | 63d6c5a | 2009-01-06 14:39:38 -0800 | [diff] [blame^] | 186 | 4) as mentioned above, in vmscan:shrink_page_list() when attempting to |
| 187 | reclaim a page in a VM_LOCKED vma via try_to_unmap(). |
Lee Schermerhorn | fa07e78 | 2008-10-18 20:26:47 -0700 | [diff] [blame] | 188 | |
| 189 | Mlocked pages become unlocked and rescued from the unevictable list when: |
| 190 | |
| 191 | 1) mapped in a range unlocked via the munlock()/munlockall() system calls. |
| 192 | 2) munmapped() out of the last VM_LOCKED vma that maps the page, including |
| 193 | unmapping at task exit. |
| 194 | 3) when the page is truncated from the last VM_LOCKED vma of an mmap()ed file. |
| 195 | 4) before a page is COWed in a VM_LOCKED vma. |
| 196 | |
| 197 | |
| 198 | Mlocked Pages: mlock()/mlockall() System Call Handling |
| 199 | |
| 200 | Both [do_]mlock() and [do_]mlockall() system call handlers call mlock_fixup() |
| 201 | for each vma in the range specified by the call. In the case of mlockall(), |
| 202 | this is the entire active address space of the task. Note that mlock_fixup() |
| 203 | is used for both mlock()ing and munlock()ing a range of memory. A call to |
| 204 | mlock() an already VM_LOCKED vma, or to munlock() a vma that is not VM_LOCKED |
| 205 | is treated as a no-op--mlock_fixup() simply returns. |
| 206 | |
| 207 | If the vma passes some filtering described in "Mlocked Pages: Filtering Vmas" |
| 208 | below, mlock_fixup() will attempt to merge the vma with its neighbors or split |
| 209 | off a subset of the vma if the range does not cover the entire vma. Once the |
| 210 | vma has been merged or split or neither, mlock_fixup() will call |
| 211 | __mlock_vma_pages_range() to fault in the pages via get_user_pages() and |
| 212 | to mark the pages as mlocked via mlock_vma_page(). |
| 213 | |
| 214 | Note that the vma being mlocked might be mapped with PROT_NONE. In this case, |
| 215 | get_user_pages() will be unable to fault in the pages. That's OK. If pages |
| 216 | do end up getting faulted into this VM_LOCKED vma, we'll handle them in the |
| 217 | fault path or in vmscan. |
| 218 | |
| 219 | Also note that a page returned by get_user_pages() could be truncated or |
| 220 | migrated out from under us, while we're trying to mlock it. To detect |
| 221 | this, __mlock_vma_pages_range() tests the page_mapping after acquiring |
| 222 | the page lock. If the page is still associated with its mapping, we'll |
| 223 | go ahead and call mlock_vma_page(). If the mapping is gone, we just |
| 224 | unlock the page and move on. Worse case, this results in page mapped |
| 225 | in a VM_LOCKED vma remaining on a normal LRU list without being |
| 226 | PageMlocked(). Again, vmscan will detect and cull such pages. |
| 227 | |
| 228 | mlock_vma_page(), called with the page locked [N.B., not "mlocked"], will |
| 229 | TestSetPageMlocked() for each page returned by get_user_pages(). We use |
| 230 | TestSetPageMlocked() because the page might already be mlocked by another |
| 231 | task/vma and we don't want to do extra work. We especially do not want to |
| 232 | count an mlocked page more than once in the statistics. If the page was |
| 233 | already mlocked, mlock_vma_page() is done. |
| 234 | |
| 235 | If the page was NOT already mlocked, mlock_vma_page() attempts to isolate the |
| 236 | page from the LRU, as it is likely on the appropriate active or inactive list |
| 237 | at that time. If the isolate_lru_page() succeeds, mlock_vma_page() will |
| 238 | putback the page--putback_lru_page()--which will notice that the page is now |
| 239 | mlocked and divert the page to the zone's unevictable LRU list. If |
| 240 | mlock_vma_page() is unable to isolate the page from the LRU, vmscan will handle |
| 241 | it later if/when it attempts to reclaim the page. |
| 242 | |
| 243 | |
| 244 | Mlocked Pages: Filtering Special Vmas |
| 245 | |
| 246 | mlock_fixup() filters several classes of "special" vmas: |
| 247 | |
| 248 | 1) vmas with VM_IO|VM_PFNMAP set are skipped entirely. The pages behind |
| 249 | these mappings are inherently pinned, so we don't need to mark them as |
| 250 | mlocked. In any case, most of the pages have no struct page in which to |
| 251 | so mark the page. Because of this, get_user_pages() will fail for these |
| 252 | vmas, so there is no sense in attempting to visit them. |
| 253 | |
| 254 | 2) vmas mapping hugetlbfs page are already effectively pinned into memory. |
| 255 | We don't need nor want to mlock() these pages. However, to preserve the |
Hugh Dickins | 63d6c5a | 2009-01-06 14:39:38 -0800 | [diff] [blame^] | 256 | prior behavior of mlock()--before the unevictable/mlock changes-- |
| 257 | mlock_fixup() will call make_pages_present() in the hugetlbfs vma range |
| 258 | to allocate the huge pages and populate the ptes. |
Lee Schermerhorn | fa07e78 | 2008-10-18 20:26:47 -0700 | [diff] [blame] | 259 | |
| 260 | 3) vmas with VM_DONTEXPAND|VM_RESERVED are generally user space mappings of |
| 261 | kernel pages, such as the vdso page, relay channel pages, etc. These pages |
| 262 | are inherently unevictable and are not managed on the LRU lists. |
| 263 | mlock_fixup() treats these vmas the same as hugetlbfs vmas. It calls |
| 264 | make_pages_present() to populate the ptes. |
| 265 | |
| 266 | Note that for all of these special vmas, mlock_fixup() does not set the |
| 267 | VM_LOCKED flag. Therefore, we won't have to deal with them later during |
| 268 | munlock() or munmap()--for example, at task exit. Neither does mlock_fixup() |
| 269 | account these vmas against the task's "locked_vm". |
| 270 | |
| 271 | Mlocked Pages: Downgrading the Mmap Semaphore. |
| 272 | |
| 273 | mlock_fixup() must be called with the mmap semaphore held for write, because |
| 274 | it may have to merge or split vmas. However, mlocking a large region of |
| 275 | memory can take a long time--especially if vmscan must reclaim pages to |
| 276 | satisfy the regions requirements. Faulting in a large region with the mmap |
| 277 | semaphore held for write can hold off other faults on the address space, in |
| 278 | the case of a multi-threaded task. It can also hold off scans of the task's |
| 279 | address space via /proc. While testing under heavy load, it was observed that |
| 280 | the ps(1) command could be held off for many minutes while a large segment was |
| 281 | mlock()ed down. |
| 282 | |
| 283 | To address this issue, and to make the system more responsive during mlock()ing |
| 284 | of large segments, mlock_fixup() downgrades the mmap semaphore to read mode |
| 285 | during the call to __mlock_vma_pages_range(). This works fine. However, the |
| 286 | callers of mlock_fixup() expect the semaphore to be returned in write mode. |
| 287 | So, mlock_fixup() "upgrades" the semphore to write mode. Linux does not |
| 288 | support an atomic upgrade_sem() call, so mlock_fixup() must drop the semaphore |
| 289 | and reacquire it in write mode. In a multi-threaded task, it is possible for |
| 290 | the task memory map to change while the semaphore is dropped. Therefore, |
| 291 | mlock_fixup() looks up the vma at the range start address after reacquiring |
| 292 | the semaphore in write mode and verifies that it still covers the original |
| 293 | range. If not, mlock_fixup() returns an error [-EAGAIN]. All callers of |
| 294 | mlock_fixup() have been changed to deal with this new error condition. |
| 295 | |
| 296 | Note: when munlocking a region, all of the pages should already be resident-- |
| 297 | unless we have racing threads mlocking() and munlocking() regions. So, |
| 298 | unlocking should not have to wait for page allocations nor faults of any kind. |
| 299 | Therefore mlock_fixup() does not downgrade the semaphore for munlock(). |
| 300 | |
| 301 | |
| 302 | Mlocked Pages: munlock()/munlockall() System Call Handling |
| 303 | |
| 304 | The munlock() and munlockall() system calls are handled by the same functions-- |
| 305 | do_mlock[all]()--as the mlock() and mlockall() system calls with the unlock |
| 306 | vs lock operation indicated by an argument. So, these system calls are also |
| 307 | handled by mlock_fixup(). Again, if called for an already munlock()ed vma, |
| 308 | mlock_fixup() simply returns. Because of the vma filtering discussed above, |
| 309 | VM_LOCKED will not be set in any "special" vmas. So, these vmas will be |
| 310 | ignored for munlock. |
| 311 | |
| 312 | If the vma is VM_LOCKED, mlock_fixup() again attempts to merge or split off |
| 313 | the specified range. The range is then munlocked via the function |
| 314 | __mlock_vma_pages_range()--the same function used to mlock a vma range-- |
| 315 | passing a flag to indicate that munlock() is being performed. |
| 316 | |
| 317 | Because the vma access protections could have been changed to PROT_NONE after |
Hugh Dickins | 63d6c5a | 2009-01-06 14:39:38 -0800 | [diff] [blame^] | 318 | faulting in and mlocking pages, get_user_pages() was unreliable for visiting |
Lee Schermerhorn | fa07e78 | 2008-10-18 20:26:47 -0700 | [diff] [blame] | 319 | these pages for munlocking. Because we don't want to leave pages mlocked(), |
| 320 | get_user_pages() was enhanced to accept a flag to ignore the permissions when |
| 321 | fetching the pages--all of which should be resident as a result of previous |
| 322 | mlock()ing. |
| 323 | |
| 324 | For munlock(), __mlock_vma_pages_range() unlocks individual pages by calling |
| 325 | munlock_vma_page(). munlock_vma_page() unconditionally clears the PG_mlocked |
| 326 | flag using TestClearPageMlocked(). As with mlock_vma_page(), munlock_vma_page() |
| 327 | use the Test*PageMlocked() function to handle the case where the page might |
| 328 | have already been unlocked by another task. If the page was mlocked, |
| 329 | munlock_vma_page() updates that zone statistics for the number of mlocked |
| 330 | pages. Note, however, that at this point we haven't checked whether the page |
| 331 | is mapped by other VM_LOCKED vmas. |
| 332 | |
| 333 | We can't call try_to_munlock(), the function that walks the reverse map to check |
| 334 | for other VM_LOCKED vmas, without first isolating the page from the LRU. |
| 335 | try_to_munlock() is a variant of try_to_unmap() and thus requires that the page |
| 336 | not be on an lru list. [More on these below.] However, the call to |
| 337 | isolate_lru_page() could fail, in which case we couldn't try_to_munlock(). |
| 338 | So, we go ahead and clear PG_mlocked up front, as this might be the only chance |
| 339 | we have. If we can successfully isolate the page, we go ahead and |
| 340 | try_to_munlock(), which will restore the PG_mlocked flag and update the zone |
| 341 | page statistics if it finds another vma holding the page mlocked. If we fail |
| 342 | to isolate the page, we'll have left a potentially mlocked page on the LRU. |
| 343 | This is fine, because we'll catch it later when/if vmscan tries to reclaim the |
| 344 | page. This should be relatively rare. |
| 345 | |
| 346 | Mlocked Pages: Migrating Them... |
| 347 | |
| 348 | A page that is being migrated has been isolated from the lru lists and is |
| 349 | held locked across unmapping of the page, updating the page's mapping |
| 350 | [address_space] entry and copying the contents and state, until the |
| 351 | page table entry has been replaced with an entry that refers to the new |
| 352 | page. Linux supports migration of mlocked pages and other unevictable |
| 353 | pages. This involves simply moving the PageMlocked and PageUnevictable states |
| 354 | from the old page to the new page. |
| 355 | |
| 356 | Note that page migration can race with mlocking or munlocking of the same |
| 357 | page. This has been discussed from the mlock/munlock perspective in the |
| 358 | respective sections above. Both processes [migration, m[un]locking], hold |
| 359 | the page locked. This provides the first level of synchronization. Page |
| 360 | migration zeros out the page_mapping of the old page before unlocking it, |
| 361 | so m[un]lock can skip these pages by testing the page mapping under page |
| 362 | lock. |
| 363 | |
| 364 | When completing page migration, we place the new and old pages back onto the |
| 365 | lru after dropping the page lock. The "unneeded" page--old page on success, |
| 366 | new page on failure--will be freed when the reference count held by the |
| 367 | migration process is released. To ensure that we don't strand pages on the |
| 368 | unevictable list because of a race between munlock and migration, page |
| 369 | migration uses the putback_lru_page() function to add migrated pages back to |
| 370 | the lru. |
| 371 | |
| 372 | |
| 373 | Mlocked Pages: mmap(MAP_LOCKED) System Call Handling |
| 374 | |
| 375 | In addition the the mlock()/mlockall() system calls, an application can request |
| 376 | that a region of memory be mlocked using the MAP_LOCKED flag with the mmap() |
| 377 | call. Furthermore, any mmap() call or brk() call that expands the heap by a |
| 378 | task that has previously called mlockall() with the MCL_FUTURE flag will result |
| 379 | in the newly mapped memory being mlocked. Before the unevictable/mlock changes, |
| 380 | the kernel simply called make_pages_present() to allocate pages and populate |
| 381 | the page table. |
| 382 | |
| 383 | To mlock a range of memory under the unevictable/mlock infrastructure, the |
| 384 | mmap() handler and task address space expansion functions call |
| 385 | mlock_vma_pages_range() specifying the vma and the address range to mlock. |
| 386 | mlock_vma_pages_range() filters vmas like mlock_fixup(), as described above in |
| 387 | "Mlocked Pages: Filtering Vmas". It will clear the VM_LOCKED flag, which will |
| 388 | have already been set by the caller, in filtered vmas. Thus these vma's need |
| 389 | not be visited for munlock when the region is unmapped. |
| 390 | |
| 391 | For "normal" vmas, mlock_vma_pages_range() calls __mlock_vma_pages_range() to |
| 392 | fault/allocate the pages and mlock them. Again, like mlock_fixup(), |
| 393 | mlock_vma_pages_range() downgrades the mmap semaphore to read mode before |
| 394 | attempting to fault/allocate and mlock the pages; and "upgrades" the semaphore |
| 395 | back to write mode before returning. |
| 396 | |
| 397 | The callers of mlock_vma_pages_range() will have already added the memory |
| 398 | range to be mlocked to the task's "locked_vm". To account for filtered vmas, |
| 399 | mlock_vma_pages_range() returns the number of pages NOT mlocked. All of the |
| 400 | callers then subtract a non-negative return value from the task's locked_vm. |
| 401 | A negative return value represent an error--for example, from get_user_pages() |
| 402 | attempting to fault in a vma with PROT_NONE access. In this case, we leave |
| 403 | the memory range accounted as locked_vm, as the protections could be changed |
| 404 | later and pages allocated into that region. |
| 405 | |
| 406 | |
| 407 | Mlocked Pages: munmap()/exit()/exec() System Call Handling |
| 408 | |
| 409 | When unmapping an mlocked region of memory, whether by an explicit call to |
| 410 | munmap() or via an internal unmap from exit() or exec() processing, we must |
| 411 | munlock the pages if we're removing the last VM_LOCKED vma that maps the pages. |
Hugh Dickins | 63d6c5a | 2009-01-06 14:39:38 -0800 | [diff] [blame^] | 412 | Before the unevictable/mlock changes, mlocking did not mark the pages in any |
| 413 | way, so unmapping them required no processing. |
Lee Schermerhorn | fa07e78 | 2008-10-18 20:26:47 -0700 | [diff] [blame] | 414 | |
| 415 | To munlock a range of memory under the unevictable/mlock infrastructure, the |
| 416 | munmap() hander and task address space tear down function call |
| 417 | munlock_vma_pages_all(). The name reflects the observation that one always |
| 418 | specifies the entire vma range when munlock()ing during unmap of a region. |
| 419 | Because of the vma filtering when mlocking() regions, only "normal" vmas that |
| 420 | actually contain mlocked pages will be passed to munlock_vma_pages_all(). |
| 421 | |
| 422 | munlock_vma_pages_all() clears the VM_LOCKED vma flag and, like mlock_fixup() |
| 423 | for the munlock case, calls __munlock_vma_pages_range() to walk the page table |
| 424 | for the vma's memory range and munlock_vma_page() each resident page mapped by |
| 425 | the vma. This effectively munlocks the page, only if this is the last |
| 426 | VM_LOCKED vma that maps the page. |
| 427 | |
| 428 | |
| 429 | Mlocked Page: try_to_unmap() |
| 430 | |
| 431 | [Note: the code changes represented by this section are really quite small |
| 432 | compared to the text to describe what happening and why, and to discuss the |
| 433 | implications.] |
| 434 | |
| 435 | Pages can, of course, be mapped into multiple vmas. Some of these vmas may |
| 436 | have VM_LOCKED flag set. It is possible for a page mapped into one or more |
| 437 | VM_LOCKED vmas not to have the PG_mlocked flag set and therefore reside on one |
| 438 | of the active or inactive LRU lists. This could happen if, for example, a |
| 439 | task in the process of munlock()ing the page could not isolate the page from |
| 440 | the LRU. As a result, vmscan/shrink_page_list() might encounter such a page |
| 441 | as described in "Unevictable Pages and Vmscan [shrink_*_list()]". To |
| 442 | handle this situation, try_to_unmap() has been enhanced to check for VM_LOCKED |
| 443 | vmas while it is walking a page's reverse map. |
| 444 | |
| 445 | try_to_unmap() is always called, by either vmscan for reclaim or for page |
| 446 | migration, with the argument page locked and isolated from the LRU. BUG_ON() |
| 447 | assertions enforce this requirement. Separate functions handle anonymous and |
| 448 | mapped file pages, as these types of pages have different reverse map |
| 449 | mechanisms. |
| 450 | |
| 451 | try_to_unmap_anon() |
| 452 | |
| 453 | To unmap anonymous pages, each vma in the list anchored in the anon_vma must be |
| 454 | visited--at least until a VM_LOCKED vma is encountered. If the page is being |
| 455 | unmapped for migration, VM_LOCKED vmas do not stop the process because mlocked |
| 456 | pages are migratable. However, for reclaim, if the page is mapped into a |
| 457 | VM_LOCKED vma, the scan stops. try_to_unmap() attempts to acquire the mmap |
| 458 | semphore of the mm_struct to which the vma belongs in read mode. If this is |
| 459 | successful, try_to_unmap() will mlock the page via mlock_vma_page()--we |
| 460 | wouldn't have gotten to try_to_unmap() if the page were already mlocked--and |
| 461 | will return SWAP_MLOCK, indicating that the page is unevictable. If the |
| 462 | mmap semaphore cannot be acquired, we are not sure whether the page is really |
| 463 | unevictable or not. In this case, try_to_unmap() will return SWAP_AGAIN. |
| 464 | |
| 465 | try_to_unmap_file() -- linear mappings |
| 466 | |
| 467 | Unmapping of a mapped file page works the same, except that the scan visits |
| 468 | all vmas that maps the page's index/page offset in the page's mapping's |
| 469 | reverse map priority search tree. It must also visit each vma in the page's |
| 470 | mapping's non-linear list, if the list is non-empty. As for anonymous pages, |
| 471 | on encountering a VM_LOCKED vma for a mapped file page, try_to_unmap() will |
| 472 | attempt to acquire the associated mm_struct's mmap semaphore to mlock the page, |
| 473 | returning SWAP_MLOCK if this is successful, and SWAP_AGAIN, if not. |
| 474 | |
| 475 | try_to_unmap_file() -- non-linear mappings |
| 476 | |
| 477 | If a page's mapping contains a non-empty non-linear mapping vma list, then |
| 478 | try_to_un{map|lock}() must also visit each vma in that list to determine |
| 479 | whether the page is mapped in a VM_LOCKED vma. Again, the scan must visit |
| 480 | all vmas in the non-linear list to ensure that the pages is not/should not be |
| 481 | mlocked. If a VM_LOCKED vma is found in the list, the scan could terminate. |
| 482 | However, there is no easy way to determine whether the page is actually mapped |
| 483 | in a given vma--either for unmapping or testing whether the VM_LOCKED vma |
| 484 | actually pins the page. |
| 485 | |
| 486 | So, try_to_unmap_file() handles non-linear mappings by scanning a certain |
| 487 | number of pages--a "cluster"--in each non-linear vma associated with the page's |
| 488 | mapping, for each file mapped page that vmscan tries to unmap. If this happens |
| 489 | to unmap the page we're trying to unmap, try_to_unmap() will notice this on |
| 490 | return--(page_mapcount(page) == 0)--and return SWAP_SUCCESS. Otherwise, it |
| 491 | will return SWAP_AGAIN, causing vmscan to recirculate this page. We take |
| 492 | advantage of the cluster scan in try_to_unmap_cluster() as follows: |
| 493 | |
| 494 | For each non-linear vma, try_to_unmap_cluster() attempts to acquire the mmap |
| 495 | semaphore of the associated mm_struct for read without blocking. If this |
| 496 | attempt is successful and the vma is VM_LOCKED, try_to_unmap_cluster() will |
| 497 | retain the mmap semaphore for the scan; otherwise it drops it here. Then, |
| 498 | for each page in the cluster, if we're holding the mmap semaphore for a locked |
| 499 | vma, try_to_unmap_cluster() calls mlock_vma_page() to mlock the page. This |
| 500 | call is a no-op if the page is already locked, but will mlock any pages in |
| 501 | the non-linear mapping that happen to be unlocked. If one of the pages so |
| 502 | mlocked is the page passed in to try_to_unmap(), try_to_unmap_cluster() will |
| 503 | return SWAP_MLOCK, rather than the default SWAP_AGAIN. This will allow vmscan |
| 504 | to cull the page, rather than recirculating it on the inactive list. Again, |
| 505 | if try_to_unmap_cluster() cannot acquire the vma's mmap sem, it returns |
| 506 | SWAP_AGAIN, indicating that the page is mapped by a VM_LOCKED vma, but |
| 507 | couldn't be mlocked. |
| 508 | |
| 509 | |
| 510 | Mlocked pages: try_to_munlock() Reverse Map Scan |
| 511 | |
| 512 | TODO/FIXME: a better name might be page_mlocked()--analogous to the |
Hugh Dickins | 63d6c5a | 2009-01-06 14:39:38 -0800 | [diff] [blame^] | 513 | page_referenced() reverse map walker. |
Lee Schermerhorn | fa07e78 | 2008-10-18 20:26:47 -0700 | [diff] [blame] | 514 | |
Hugh Dickins | 63d6c5a | 2009-01-06 14:39:38 -0800 | [diff] [blame^] | 515 | When munlock_vma_page()--see "Mlocked Pages: munlock()/munlockall() |
| 516 | System Call Handling" above--tries to munlock a page, it needs to |
Lee Schermerhorn | fa07e78 | 2008-10-18 20:26:47 -0700 | [diff] [blame] | 517 | determine whether or not the page is mapped by any VM_LOCKED vma, without |
| 518 | actually attempting to unmap all ptes from the page. For this purpose, the |
| 519 | unevictable/mlock infrastructure introduced a variant of try_to_unmap() called |
| 520 | try_to_munlock(). |
| 521 | |
| 522 | try_to_munlock() calls the same functions as try_to_unmap() for anonymous and |
| 523 | mapped file pages with an additional argument specifing unlock versus unmap |
| 524 | processing. Again, these functions walk the respective reverse maps looking |
| 525 | for VM_LOCKED vmas. When such a vma is found for anonymous pages and file |
| 526 | pages mapped in linear VMAs, as in the try_to_unmap() case, the functions |
| 527 | attempt to acquire the associated mmap semphore, mlock the page via |
| 528 | mlock_vma_page() and return SWAP_MLOCK. This effectively undoes the |
Hugh Dickins | 63d6c5a | 2009-01-06 14:39:38 -0800 | [diff] [blame^] | 529 | pre-clearing of the page's PG_mlocked done by munlock_vma_page. |
Lee Schermerhorn | fa07e78 | 2008-10-18 20:26:47 -0700 | [diff] [blame] | 530 | |
| 531 | If try_to_unmap() is unable to acquire a VM_LOCKED vma's associated mmap |
| 532 | semaphore, it will return SWAP_AGAIN. This will allow shrink_page_list() |
| 533 | to recycle the page on the inactive list and hope that it has better luck |
| 534 | with the page next time. |
| 535 | |
| 536 | For file pages mapped into non-linear vmas, the try_to_munlock() logic works |
| 537 | slightly differently. On encountering a VM_LOCKED non-linear vma that might |
| 538 | map the page, try_to_munlock() returns SWAP_AGAIN without actually mlocking |
| 539 | the page. munlock_vma_page() will just leave the page unlocked and let |
| 540 | vmscan deal with it--the usual fallback position. |
| 541 | |
| 542 | Note that try_to_munlock()'s reverse map walk must visit every vma in a pages' |
| 543 | reverse map to determine that a page is NOT mapped into any VM_LOCKED vma. |
| 544 | However, the scan can terminate when it encounters a VM_LOCKED vma and can |
| 545 | successfully acquire the vma's mmap semphore for read and mlock the page. |
| 546 | Although try_to_munlock() can be called many [very many!] times when |
| 547 | munlock()ing a large region or tearing down a large address space that has been |
Hugh Dickins | 63d6c5a | 2009-01-06 14:39:38 -0800 | [diff] [blame^] | 548 | mlocked via mlockall(), overall this is a fairly rare event. |
Lee Schermerhorn | fa07e78 | 2008-10-18 20:26:47 -0700 | [diff] [blame] | 549 | |
| 550 | Mlocked Page: Page Reclaim in shrink_*_list() |
| 551 | |
| 552 | shrink_active_list() culls any obviously unevictable pages--i.e., |
| 553 | !page_evictable(page, NULL)--diverting these to the unevictable lru |
| 554 | list. However, shrink_active_list() only sees unevictable pages that |
| 555 | made it onto the active/inactive lru lists. Note that these pages do not |
| 556 | have PageUnevictable set--otherwise, they would be on the unevictable list and |
| 557 | shrink_active_list would never see them. |
| 558 | |
| 559 | Some examples of these unevictable pages on the LRU lists are: |
| 560 | |
| 561 | 1) ramfs pages that have been placed on the lru lists when first allocated. |
| 562 | |
| 563 | 2) SHM_LOCKed shared memory pages. shmctl(SHM_LOCK) does not attempt to |
| 564 | allocate or fault in the pages in the shared memory region. This happens |
| 565 | when an application accesses the page the first time after SHM_LOCKing |
| 566 | the segment. |
| 567 | |
| 568 | 3) Mlocked pages that could not be isolated from the lru and moved to the |
| 569 | unevictable list in mlock_vma_page(). |
| 570 | |
| 571 | 3) Pages mapped into multiple VM_LOCKED vmas, but try_to_munlock() couldn't |
| 572 | acquire the vma's mmap semaphore to test the flags and set PageMlocked. |
| 573 | munlock_vma_page() was forced to let the page back on to the normal |
| 574 | LRU list for vmscan to handle. |
| 575 | |
Hugh Dickins | 63d6c5a | 2009-01-06 14:39:38 -0800 | [diff] [blame^] | 576 | shrink_inactive_list() also culls any unevictable pages that it finds on |
| 577 | the inactive lists, again diverting them to the appropriate zone's unevictable |
Lee Schermerhorn | fa07e78 | 2008-10-18 20:26:47 -0700 | [diff] [blame] | 578 | lru list. shrink_inactive_list() should only see SHM_LOCKed pages that became |
| 579 | SHM_LOCKed after shrink_active_list() had moved them to the inactive list, or |
| 580 | pages mapped into VM_LOCKED vmas that munlock_vma_page() couldn't isolate from |
| 581 | the lru to recheck via try_to_munlock(). shrink_inactive_list() won't notice |
| 582 | the latter, but will pass on to shrink_page_list(). |
| 583 | |
| 584 | shrink_page_list() again culls obviously unevictable pages that it could |
Hugh Dickins | 63d6c5a | 2009-01-06 14:39:38 -0800 | [diff] [blame^] | 585 | encounter for similar reason to shrink_inactive_list(). Pages mapped into |
Lee Schermerhorn | fa07e78 | 2008-10-18 20:26:47 -0700 | [diff] [blame] | 586 | VM_LOCKED vmas but without PG_mlocked set will make it all the way to |
Hugh Dickins | 63d6c5a | 2009-01-06 14:39:38 -0800 | [diff] [blame^] | 587 | try_to_unmap(). shrink_page_list() will divert them to the unevictable list |
| 588 | when try_to_unmap() returns SWAP_MLOCK, as discussed above. |