KAMEZAWA Hiroyuki | 9836d89 | 2009-01-07 18:08:27 -0800 | [diff] [blame] | 1 | Memory Resource Controller(Memcg) Implementation Memo. |
Daisuke Nishimura | 1080d7a | 2010-03-10 15:22:31 -0800 | [diff] [blame] | 2 | Last Updated: 2010/2 |
| 3 | Base Kernel Version: based on 2.6.33-rc7-mm(candidate for 34). |
KAMEZAWA Hiroyuki | 9836d89 | 2009-01-07 18:08:27 -0800 | [diff] [blame] | 4 | |
| 5 | Because VM is getting complex (one of reasons is memcg...), memcg's behavior |
| 6 | is complex. This is a document for memcg's internal behavior. |
| 7 | Please note that implementation details can be changed. |
| 8 | |
Li Zefan | 45ce80f | 2009-01-15 13:50:59 -0800 | [diff] [blame] | 9 | (*) Topics on API should be in Documentation/cgroups/memory.txt) |
KAMEZAWA Hiroyuki | 9836d89 | 2009-01-07 18:08:27 -0800 | [diff] [blame] | 10 | |
| 11 | 0. How to record usage ? |
| 12 | 2 objects are used. |
| 13 | |
| 14 | page_cgroup ....an object per page. |
| 15 | Allocated at boot or memory hotplug. Freed at memory hot removal. |
| 16 | |
| 17 | swap_cgroup ... an entry per swp_entry. |
| 18 | Allocated at swapon(). Freed at swapoff(). |
| 19 | |
| 20 | The page_cgroup has USED bit and double count against a page_cgroup never |
| 21 | occurs. swap_cgroup is used only when a charged page is swapped-out. |
| 22 | |
| 23 | 1. Charge |
| 24 | |
| 25 | a page/swp_entry may be charged (usage += PAGE_SIZE) at |
| 26 | |
Johannes Weiner | 00501b5 | 2014-08-08 14:19:20 -0700 | [diff] [blame] | 27 | mem_cgroup_try_charge() |
KAMEZAWA Hiroyuki | 9836d89 | 2009-01-07 18:08:27 -0800 | [diff] [blame] | 28 | |
| 29 | 2. Uncharge |
| 30 | a page/swp_entry may be uncharged (usage -= PAGE_SIZE) by |
| 31 | |
Johannes Weiner | 0a31bc9 | 2014-08-08 14:19:22 -0700 | [diff] [blame] | 32 | mem_cgroup_uncharge() |
| 33 | Called when a page's refcount goes down to 0. |
KAMEZAWA Hiroyuki | 9836d89 | 2009-01-07 18:08:27 -0800 | [diff] [blame] | 34 | |
| 35 | mem_cgroup_uncharge_swap() |
| 36 | Called when swp_entry's refcnt goes down to 0. A charge against swap |
| 37 | disappears. |
| 38 | |
KAMEZAWA Hiroyuki | 9836d89 | 2009-01-07 18:08:27 -0800 | [diff] [blame] | 39 | 3. charge-commit-cancel |
Johannes Weiner | 00501b5 | 2014-08-08 14:19:20 -0700 | [diff] [blame] | 40 | Memcg pages are charged in two steps: |
| 41 | mem_cgroup_try_charge() |
| 42 | mem_cgroup_commit_charge() or mem_cgroup_cancel_charge() |
KAMEZAWA Hiroyuki | 9836d89 | 2009-01-07 18:08:27 -0800 | [diff] [blame] | 43 | |
| 44 | At try_charge(), there are no flags to say "this page is charged". |
| 45 | at this point, usage += PAGE_SIZE. |
| 46 | |
Johannes Weiner | 00501b5 | 2014-08-08 14:19:20 -0700 | [diff] [blame] | 47 | At commit(), the page is associated with the memcg. |
KAMEZAWA Hiroyuki | 9836d89 | 2009-01-07 18:08:27 -0800 | [diff] [blame] | 48 | |
| 49 | At cancel(), simply usage -= PAGE_SIZE. |
| 50 | |
| 51 | Under below explanation, we assume CONFIG_MEM_RES_CTRL_SWAP=y. |
| 52 | |
| 53 | 4. Anonymous |
| 54 | Anonymous page is newly allocated at |
| 55 | - page fault into MAP_ANONYMOUS mapping. |
| 56 | - Copy-On-Write. |
KAMEZAWA Hiroyuki | 9836d89 | 2009-01-07 18:08:27 -0800 | [diff] [blame] | 57 | |
| 58 | 4.1 Swap-in. |
| 59 | At swap-in, the page is taken from swap-cache. There are 2 cases. |
| 60 | |
| 61 | (a) If the SwapCache is newly allocated and read, it has no charges. |
| 62 | (b) If the SwapCache has been mapped by processes, it has been |
| 63 | charged already. |
| 64 | |
KAMEZAWA Hiroyuki | 9836d89 | 2009-01-07 18:08:27 -0800 | [diff] [blame] | 65 | 4.2 Swap-out. |
| 66 | At swap-out, typical state transition is below. |
| 67 | |
| 68 | (a) add to swap cache. (marked as SwapCache) |
| 69 | swp_entry's refcnt += 1. |
| 70 | (b) fully unmapped. |
| 71 | swp_entry's refcnt += # of ptes. |
| 72 | (c) write back to swap. |
| 73 | (d) delete from swap cache. (remove from SwapCache) |
| 74 | swp_entry's refcnt -= 1. |
| 75 | |
| 76 | |
KAMEZAWA Hiroyuki | 9836d89 | 2009-01-07 18:08:27 -0800 | [diff] [blame] | 77 | Finally, at task exit, |
| 78 | (e) zap_pte() is called and swp_entry's refcnt -=1 -> 0. |
KAMEZAWA Hiroyuki | 9836d89 | 2009-01-07 18:08:27 -0800 | [diff] [blame] | 79 | |
| 80 | 5. Page Cache |
| 81 | Page Cache is charged at |
| 82 | - add_to_page_cache_locked(). |
| 83 | |
KAMEZAWA Hiroyuki | 9836d89 | 2009-01-07 18:08:27 -0800 | [diff] [blame] | 84 | The logic is very clear. (About migration, see below) |
| 85 | Note: __remove_from_page_cache() is called by remove_from_page_cache() |
| 86 | and __remove_mapping(). |
| 87 | |
| 88 | 6. Shmem(tmpfs) Page Cache |
Johannes Weiner | 0a31bc9 | 2014-08-08 14:19:22 -0700 | [diff] [blame] | 89 | The best way to understand shmem's page state transition is to read |
| 90 | mm/shmem.c. |
KAMEZAWA Hiroyuki | 9836d89 | 2009-01-07 18:08:27 -0800 | [diff] [blame] | 91 | But brief explanation of the behavior of memcg around shmem will be |
| 92 | helpful to understand the logic. |
| 93 | |
| 94 | Shmem's page (just leaf page, not direct/indirect block) can be on |
| 95 | - radix-tree of shmem's inode. |
| 96 | - SwapCache. |
| 97 | - Both on radix-tree and SwapCache. This happens at swap-in |
| 98 | and swap-out, |
| 99 | |
| 100 | It's charged when... |
| 101 | - A new page is added to shmem's radix-tree. |
| 102 | - A swp page is read. (move a charge from swap_cgroup to page_cgroup) |
KAMEZAWA Hiroyuki | 9836d89 | 2009-01-07 18:08:27 -0800 | [diff] [blame] | 103 | |
| 104 | 7. Page Migration |
KAMEZAWA Hiroyuki | 9836d89 | 2009-01-07 18:08:27 -0800 | [diff] [blame] | 105 | |
Johannes Weiner | 0a31bc9 | 2014-08-08 14:19:22 -0700 | [diff] [blame] | 106 | mem_cgroup_migrate() |
KAMEZAWA Hiroyuki | 9836d89 | 2009-01-07 18:08:27 -0800 | [diff] [blame] | 107 | |
| 108 | 8. LRU |
Francis Galiegue | a33f322 | 2010-04-23 00:08:02 +0200 | [diff] [blame] | 109 | Each memcg has its own private LRU. Now, its handling is under global |
Mel Gorman | a52633d | 2016-07-28 15:45:28 -0700 | [diff] [blame] | 110 | VM's control (means that it's handled under global zone_lru_lock). |
KAMEZAWA Hiroyuki | 9836d89 | 2009-01-07 18:08:27 -0800 | [diff] [blame] | 111 | Almost all routines around memcg's LRU is called by global LRU's |
Mel Gorman | a52633d | 2016-07-28 15:45:28 -0700 | [diff] [blame] | 112 | list management functions under zone_lru_lock(). |
KAMEZAWA Hiroyuki | 9836d89 | 2009-01-07 18:08:27 -0800 | [diff] [blame] | 113 | |
| 114 | A special function is mem_cgroup_isolate_pages(). This scans |
| 115 | memcg's private LRU and call __isolate_lru_page() to extract a page |
| 116 | from LRU. |
| 117 | (By __isolate_lru_page(), the page is removed from both of global and |
| 118 | private LRU.) |
| 119 | |
| 120 | |
| 121 | 9. Typical Tests. |
| 122 | |
| 123 | Tests for racy cases. |
| 124 | |
| 125 | 9.1 Small limit to memcg. |
| 126 | When you do test to do racy case, it's good test to set memcg's limit |
| 127 | to be very small rather than GB. Many races found in the test under |
| 128 | xKB or xxMB limits. |
| 129 | (Memory behavior under GB and Memory behavior under MB shows very |
| 130 | different situation.) |
| 131 | |
| 132 | 9.2 Shmem |
| 133 | Historically, memcg's shmem handling was poor and we saw some amount |
| 134 | of troubles here. This is because shmem is page-cache but can be |
| 135 | SwapCache. Test with shmem/tmpfs is always good test. |
| 136 | |
| 137 | 9.3 Migration |
| 138 | For NUMA, migration is an another special case. To do easy test, cpuset |
| 139 | is useful. Following is a sample script to do migration. |
| 140 | |
| 141 | mount -t cgroup -o cpuset none /opt/cpuset |
| 142 | |
| 143 | mkdir /opt/cpuset/01 |
| 144 | echo 1 > /opt/cpuset/01/cpuset.cpus |
| 145 | echo 0 > /opt/cpuset/01/cpuset.mems |
| 146 | echo 1 > /opt/cpuset/01/cpuset.memory_migrate |
| 147 | mkdir /opt/cpuset/02 |
| 148 | echo 1 > /opt/cpuset/02/cpuset.cpus |
| 149 | echo 1 > /opt/cpuset/02/cpuset.mems |
| 150 | echo 1 > /opt/cpuset/02/cpuset.memory_migrate |
| 151 | |
| 152 | In above set, when you moves a task from 01 to 02, page migration to |
| 153 | node 0 to node 1 will occur. Following is a script to migrate all |
| 154 | under cpuset. |
| 155 | -- |
| 156 | move_task() |
| 157 | { |
| 158 | for pid in $1 |
| 159 | do |
| 160 | /bin/echo $pid >$2/tasks 2>/dev/null |
| 161 | echo -n $pid |
| 162 | echo -n " " |
| 163 | done |
| 164 | echo END |
| 165 | } |
| 166 | |
| 167 | G1_TASK=`cat ${G1}/tasks` |
| 168 | G2_TASK=`cat ${G2}/tasks` |
| 169 | move_task "${G1_TASK}" ${G2} & |
| 170 | -- |
| 171 | 9.4 Memory hotplug. |
| 172 | memory hotplug test is one of good test. |
| 173 | to offline memory, do following. |
| 174 | # echo offline > /sys/devices/system/memory/memoryXXX/state |
| 175 | (XXX is the place of memory) |
| 176 | This is an easy way to test page migration, too. |
| 177 | |
| 178 | 9.5 mkdir/rmdir |
| 179 | When using hierarchy, mkdir/rmdir test should be done. |
| 180 | Use tests like the following. |
| 181 | |
| 182 | echo 1 >/opt/cgroup/01/memory/use_hierarchy |
| 183 | mkdir /opt/cgroup/01/child_a |
| 184 | mkdir /opt/cgroup/01/child_b |
| 185 | |
| 186 | set limit to 01. |
| 187 | add limit to 01/child_b |
| 188 | run jobs under child_a and child_b |
| 189 | |
| 190 | create/delete following groups at random while jobs are running. |
| 191 | /opt/cgroup/01/child_a/child_aa |
| 192 | /opt/cgroup/01/child_b/child_bb |
| 193 | /opt/cgroup/01/child_c |
| 194 | |
| 195 | running new jobs in new group is also good. |
| 196 | |
| 197 | 9.6 Mount with other subsystems. |
| 198 | Mounting with other subsystems is a good test because there is a |
| 199 | race and lock dependency with other cgroup subsystems. |
| 200 | |
| 201 | example) |
Kirill A. Shutemov | 0263c12 | 2010-03-10 15:22:37 -0800 | [diff] [blame] | 202 | # mount -t cgroup none /cgroup -o cpuset,memory,cpu,devices |
KAMEZAWA Hiroyuki | 9836d89 | 2009-01-07 18:08:27 -0800 | [diff] [blame] | 203 | |
| 204 | and do task move, mkdir, rmdir etc...under this. |
KAMEZAWA Hiroyuki | 8d50d36 | 2009-01-29 14:25:14 -0800 | [diff] [blame] | 205 | |
| 206 | 9.7 swapoff. |
| 207 | Besides management of swap is one of complicated parts of memcg, |
| 208 | call path of swap-in at swapoff is not same as usual swap-in path.. |
| 209 | It's worth to be tested explicitly. |
| 210 | |
| 211 | For example, test like following is good. |
| 212 | (Shell-A) |
Kirill A. Shutemov | 0263c12 | 2010-03-10 15:22:37 -0800 | [diff] [blame] | 213 | # mount -t cgroup none /cgroup -o memory |
KAMEZAWA Hiroyuki | 8d50d36 | 2009-01-29 14:25:14 -0800 | [diff] [blame] | 214 | # mkdir /cgroup/test |
| 215 | # echo 40M > /cgroup/test/memory.limit_in_bytes |
| 216 | # echo 0 > /cgroup/test/tasks |
| 217 | Run malloc(100M) program under this. You'll see 60M of swaps. |
| 218 | (Shell-B) |
| 219 | # move all tasks in /cgroup/test to /cgroup |
| 220 | # /sbin/swapoff -a |
Thadeu Lima de Souza Cascardo | 6d5e147 | 2009-02-03 11:57:13 +0100 | [diff] [blame] | 221 | # rmdir /cgroup/test |
KAMEZAWA Hiroyuki | 8d50d36 | 2009-01-29 14:25:14 -0800 | [diff] [blame] | 222 | # kill malloc task. |
| 223 | |
| 224 | Of course, tmpfs v.s. swapoff test should be tested, too. |
KAMEZAWA Hiroyuki | 0b7f569 | 2009-04-02 16:57:38 -0700 | [diff] [blame] | 225 | |
| 226 | 9.8 OOM-Killer |
| 227 | Out-of-memory caused by memcg's limit will kill tasks under |
| 228 | the memcg. When hierarchy is used, a task under hierarchy |
| 229 | will be killed by the kernel. |
| 230 | In this case, panic_on_oom shouldn't be invoked and tasks |
| 231 | in other groups shouldn't be killed. |
| 232 | |
| 233 | It's not difficult to cause OOM under memcg as following. |
| 234 | Case A) when you can swapoff |
| 235 | #swapoff -a |
| 236 | #echo 50M > /memory.limit_in_bytes |
| 237 | run 51M of malloc |
| 238 | |
| 239 | Case B) when you use mem+swap limitation. |
| 240 | #echo 50M > memory.limit_in_bytes |
| 241 | #echo 50M > memory.memsw.limit_in_bytes |
| 242 | run 51M of malloc |
Daisuke Nishimura | 1080d7a | 2010-03-10 15:22:31 -0800 | [diff] [blame] | 243 | |
| 244 | 9.9 Move charges at task migration |
| 245 | Charges associated with a task can be moved along with task migration. |
| 246 | |
| 247 | (Shell-A) |
| 248 | #mkdir /cgroup/A |
| 249 | #echo $$ >/cgroup/A/tasks |
| 250 | run some programs which uses some amount of memory in /cgroup/A. |
| 251 | |
| 252 | (Shell-B) |
| 253 | #mkdir /cgroup/B |
| 254 | #echo 1 >/cgroup/B/memory.move_charge_at_immigrate |
| 255 | #echo "pid of the program running in group A" >/cgroup/B/tasks |
| 256 | |
| 257 | You can see charges have been moved by reading *.usage_in_bytes or |
| 258 | memory.stat of both A and B. |
| 259 | See 8.2 of Documentation/cgroups/memory.txt to see what value should be |
| 260 | written to move_charge_at_immigrate. |
Kirill A. Shutemov | 1e11145 | 2010-03-10 15:22:36 -0800 | [diff] [blame] | 261 | |
| 262 | 9.10 Memory thresholds |
Uwe Kleine-König | b595076 | 2010-11-01 15:38:34 -0400 | [diff] [blame] | 263 | Memory controller implements memory thresholds using cgroups notification |
Greg Thelen | 92e015b1 | 2013-01-04 13:05:17 -0800 | [diff] [blame] | 264 | API. You can use tools/cgroup/cgroup_event_listener.c to test it. |
Kirill A. Shutemov | 1e11145 | 2010-03-10 15:22:36 -0800 | [diff] [blame] | 265 | |
| 266 | (Shell-A) Create cgroup and run event listener |
| 267 | # mkdir /cgroup/A |
| 268 | # ./cgroup_event_listener /cgroup/A/memory.usage_in_bytes 5M |
| 269 | |
| 270 | (Shell-B) Add task to cgroup and try to allocate and free memory |
| 271 | # echo $$ >/cgroup/A/tasks |
| 272 | # a="$(dd if=/dev/zero bs=1M count=10)" |
| 273 | # a= |
| 274 | |
| 275 | You will see message from cgroup_event_listener every time you cross |
| 276 | the thresholds. |
| 277 | |
| 278 | Use /cgroup/A/memory.memsw.usage_in_bytes to test memsw thresholds. |
| 279 | |
| 280 | It's good idea to test root cgroup as well. |