Mauro Carvalho Chehab | 99c8b23 | 2019-06-12 14:52:41 -0300 | [diff] [blame] | 1 | ================== |
Aneesh Kumar K.V | 585e27e | 2012-07-31 16:42:30 -0700 | [diff] [blame] | 2 | HugeTLB Controller |
Mauro Carvalho Chehab | 99c8b23 | 2019-06-12 14:52:41 -0300 | [diff] [blame] | 3 | ================== |
Aneesh Kumar K.V | 585e27e | 2012-07-31 16:42:30 -0700 | [diff] [blame] | 4 | |
Aneesh Kumar K.V | 585e27e | 2012-07-31 16:42:30 -0700 | [diff] [blame] | 5 | HugeTLB controller can be created by first mounting the cgroup filesystem. |
| 6 | |
| 7 | # mount -t cgroup -o hugetlb none /sys/fs/cgroup |
| 8 | |
| 9 | With the above step, the initial or the parent HugeTLB group becomes |
| 10 | visible at /sys/fs/cgroup. At bootup, this group includes all the tasks in |
| 11 | the system. /sys/fs/cgroup/tasks lists the tasks in this cgroup. |
| 12 | |
Mauro Carvalho Chehab | 99c8b23 | 2019-06-12 14:52:41 -0300 | [diff] [blame] | 13 | New groups can be created under the parent group /sys/fs/cgroup:: |
Aneesh Kumar K.V | 585e27e | 2012-07-31 16:42:30 -0700 | [diff] [blame] | 14 | |
Mauro Carvalho Chehab | 99c8b23 | 2019-06-12 14:52:41 -0300 | [diff] [blame] | 15 | # cd /sys/fs/cgroup |
| 16 | # mkdir g1 |
| 17 | # echo $$ > g1/tasks |
Aneesh Kumar K.V | 585e27e | 2012-07-31 16:42:30 -0700 | [diff] [blame] | 18 | |
| 19 | The above steps create a new group g1 and move the current shell |
| 20 | process (bash) into it. |
| 21 | |
Mauro Carvalho Chehab | 99c8b23 | 2019-06-12 14:52:41 -0300 | [diff] [blame] | 22 | Brief summary of control files:: |
Aneesh Kumar K.V | 585e27e | 2012-07-31 16:42:30 -0700 | [diff] [blame] | 23 | |
Mina Almasry | 6566704 | 2020-04-01 21:11:41 -0700 | [diff] [blame] | 24 | hugetlb.<hugepagesize>.rsvd.limit_in_bytes # set/show limit of "hugepagesize" hugetlb reservations |
| 25 | hugetlb.<hugepagesize>.rsvd.max_usage_in_bytes # show max "hugepagesize" hugetlb reservations and no-reserve faults |
| 26 | hugetlb.<hugepagesize>.rsvd.usage_in_bytes # show current reservations and no-reserve faults for "hugepagesize" hugetlb |
| 27 | hugetlb.<hugepagesize>.rsvd.failcnt # show the number of allocation failure due to HugeTLB reservation limit |
| 28 | hugetlb.<hugepagesize>.limit_in_bytes # set/show limit of "hugepagesize" hugetlb faults |
| 29 | hugetlb.<hugepagesize>.max_usage_in_bytes # show max "hugepagesize" hugetlb usage recorded |
| 30 | hugetlb.<hugepagesize>.usage_in_bytes # show current usage for "hugepagesize" hugetlb |
| 31 | hugetlb.<hugepagesize>.failcnt # show the number of allocation failure due to HugeTLB usage limit |
Mina Almasry | f477619 | 2022-01-14 14:07:48 -0800 | [diff] [blame] | 32 | hugetlb.<hugepagesize>.numa_stat # show the numa information of the hugetlb memory charged to this cgroup |
Aneesh Kumar K.V | 585e27e | 2012-07-31 16:42:30 -0700 | [diff] [blame] | 33 | |
Odin Ugedal | 8cfeb38 | 2019-05-30 00:24:25 +0200 | [diff] [blame] | 34 | For a system supporting three hugepage sizes (64k, 32M and 1G), the control |
Mauro Carvalho Chehab | 99c8b23 | 2019-06-12 14:52:41 -0300 | [diff] [blame] | 35 | files include:: |
Aneesh Kumar K.V | 585e27e | 2012-07-31 16:42:30 -0700 | [diff] [blame] | 36 | |
Mauro Carvalho Chehab | 99c8b23 | 2019-06-12 14:52:41 -0300 | [diff] [blame] | 37 | hugetlb.1GB.limit_in_bytes |
| 38 | hugetlb.1GB.max_usage_in_bytes |
Mina Almasry | f477619 | 2022-01-14 14:07:48 -0800 | [diff] [blame] | 39 | hugetlb.1GB.numa_stat |
Mauro Carvalho Chehab | 99c8b23 | 2019-06-12 14:52:41 -0300 | [diff] [blame] | 40 | hugetlb.1GB.usage_in_bytes |
| 41 | hugetlb.1GB.failcnt |
Mina Almasry | 6566704 | 2020-04-01 21:11:41 -0700 | [diff] [blame] | 42 | hugetlb.1GB.rsvd.limit_in_bytes |
| 43 | hugetlb.1GB.rsvd.max_usage_in_bytes |
| 44 | hugetlb.1GB.rsvd.usage_in_bytes |
| 45 | hugetlb.1GB.rsvd.failcnt |
Mauro Carvalho Chehab | 99c8b23 | 2019-06-12 14:52:41 -0300 | [diff] [blame] | 46 | hugetlb.64KB.limit_in_bytes |
| 47 | hugetlb.64KB.max_usage_in_bytes |
Mina Almasry | f477619 | 2022-01-14 14:07:48 -0800 | [diff] [blame] | 48 | hugetlb.64KB.numa_stat |
Mauro Carvalho Chehab | 99c8b23 | 2019-06-12 14:52:41 -0300 | [diff] [blame] | 49 | hugetlb.64KB.usage_in_bytes |
| 50 | hugetlb.64KB.failcnt |
Mina Almasry | 6566704 | 2020-04-01 21:11:41 -0700 | [diff] [blame] | 51 | hugetlb.64KB.rsvd.limit_in_bytes |
| 52 | hugetlb.64KB.rsvd.max_usage_in_bytes |
| 53 | hugetlb.64KB.rsvd.usage_in_bytes |
| 54 | hugetlb.64KB.rsvd.failcnt |
Mauro Carvalho Chehab | 99c8b23 | 2019-06-12 14:52:41 -0300 | [diff] [blame] | 55 | hugetlb.32MB.limit_in_bytes |
| 56 | hugetlb.32MB.max_usage_in_bytes |
Mina Almasry | f477619 | 2022-01-14 14:07:48 -0800 | [diff] [blame] | 57 | hugetlb.32MB.numa_stat |
Mauro Carvalho Chehab | 99c8b23 | 2019-06-12 14:52:41 -0300 | [diff] [blame] | 58 | hugetlb.32MB.usage_in_bytes |
| 59 | hugetlb.32MB.failcnt |
Mina Almasry | 6566704 | 2020-04-01 21:11:41 -0700 | [diff] [blame] | 60 | hugetlb.32MB.rsvd.limit_in_bytes |
| 61 | hugetlb.32MB.rsvd.max_usage_in_bytes |
| 62 | hugetlb.32MB.rsvd.usage_in_bytes |
| 63 | hugetlb.32MB.rsvd.failcnt |
| 64 | |
| 65 | |
| 66 | 1. Page fault accounting |
| 67 | |
| 68 | hugetlb.<hugepagesize>.limit_in_bytes |
| 69 | hugetlb.<hugepagesize>.max_usage_in_bytes |
| 70 | hugetlb.<hugepagesize>.usage_in_bytes |
| 71 | hugetlb.<hugepagesize>.failcnt |
| 72 | |
| 73 | The HugeTLB controller allows users to limit the HugeTLB usage (page fault) per |
| 74 | control group and enforces the limit during page fault. Since HugeTLB |
| 75 | doesn't support page reclaim, enforcing the limit at page fault time implies |
| 76 | that, the application will get SIGBUS signal if it tries to fault in HugeTLB |
| 77 | pages beyond its limit. Therefore the application needs to know exactly how many |
| 78 | HugeTLB pages it uses before hand, and the sysadmin needs to make sure that |
| 79 | there are enough available on the machine for all the users to avoid processes |
| 80 | getting SIGBUS. |
| 81 | |
| 82 | |
| 83 | 2. Reservation accounting |
| 84 | |
| 85 | hugetlb.<hugepagesize>.rsvd.limit_in_bytes |
| 86 | hugetlb.<hugepagesize>.rsvd.max_usage_in_bytes |
| 87 | hugetlb.<hugepagesize>.rsvd.usage_in_bytes |
| 88 | hugetlb.<hugepagesize>.rsvd.failcnt |
| 89 | |
| 90 | The HugeTLB controller allows to limit the HugeTLB reservations per control |
| 91 | group and enforces the controller limit at reservation time and at the fault of |
| 92 | HugeTLB memory for which no reservation exists. Since reservation limits are |
| 93 | enforced at reservation time (on mmap or shget), reservation limits never causes |
| 94 | the application to get SIGBUS signal if the memory was reserved before hand. For |
| 95 | MAP_NORESERVE allocations, the reservation limit behaves the same as the fault |
| 96 | limit, enforcing memory usage at fault time and causing the application to |
| 97 | receive a SIGBUS if it's crossing its limit. |
| 98 | |
| 99 | Reservation limits are superior to page fault limits described above, since |
| 100 | reservation limits are enforced at reservation time (on mmap or shget), and |
| 101 | never causes the application to get SIGBUS signal if the memory was reserved |
| 102 | before hand. This allows for easier fallback to alternatives such as |
| 103 | non-HugeTLB memory for example. In the case of page fault accounting, it's very |
| 104 | hard to avoid processes getting SIGBUS since the sysadmin needs precisely know |
| 105 | the HugeTLB usage of all the tasks in the system and make sure there is enough |
| 106 | pages to satisfy all requests. Avoiding tasks getting SIGBUS on overcommited |
| 107 | systems is practically impossible with page fault accounting. |
| 108 | |
| 109 | |
| 110 | 3. Caveats with shared memory |
| 111 | |
| 112 | For shared HugeTLB memory, both HugeTLB reservation and page faults are charged |
| 113 | to the first task that causes the memory to be reserved or faulted, and all |
| 114 | subsequent uses of this reserved or faulted memory is done without charging. |
| 115 | |
| 116 | Shared HugeTLB memory is only uncharged when it is unreserved or deallocated. |
| 117 | This is usually when the HugeTLB file is deleted, and not when the task that |
| 118 | caused the reservation or fault has exited. |
| 119 | |
| 120 | |
| 121 | 4. Caveats with HugeTLB cgroup offline. |
| 122 | |
| 123 | When a HugeTLB cgroup goes offline with some reservations or faults still |
| 124 | charged to it, the behavior is as follows: |
| 125 | |
| 126 | - The fault charges are charged to the parent HugeTLB cgroup (reparented), |
| 127 | - the reservation charges remain on the offline HugeTLB cgroup. |
| 128 | |
| 129 | This means that if a HugeTLB cgroup gets offlined while there is still HugeTLB |
| 130 | reservations charged to it, that cgroup persists as a zombie until all HugeTLB |
| 131 | reservations are uncharged. HugeTLB reservations behave in this manner to match |
| 132 | the memory controller whose cgroups also persist as zombie until all charged |
| 133 | memory is uncharged. Also, the tracking of HugeTLB reservations is a bit more |
| 134 | complex compared to the tracking of HugeTLB faults, so it is significantly |
| 135 | harder to reparent reservations at offline time. |