Mike Rapoport | 2fcbc41 | 2018-03-21 21:22:27 +0200 | [diff] [blame] | 1 | .. _ksm: |
| 2 | |
| 3 | ======================= |
| 4 | Kernel Samepage Merging |
| 5 | ======================= |
Hugh Dickins | 7701c9c | 2009-09-21 17:02:24 -0700 | [diff] [blame] | 6 | |
Mike Rapoport | db12c00 | 2018-04-24 09:40:23 +0300 | [diff] [blame] | 7 | Overview |
| 8 | ======== |
| 9 | |
Hugh Dickins | 7701c9c | 2009-09-21 17:02:24 -0700 | [diff] [blame] | 10 | KSM is a memory-saving de-duplication feature, enabled by CONFIG_KSM=y, |
Mike Rapoport | 2fcbc41 | 2018-03-21 21:22:27 +0200 | [diff] [blame] | 11 | added to the Linux kernel in 2.6.32. See ``mm/ksm.c`` for its implementation, |
Hugh Dickins | 7701c9c | 2009-09-21 17:02:24 -0700 | [diff] [blame] | 12 | and http://lwn.net/Articles/306704/ and http://lwn.net/Articles/330589/ |
| 13 | |
Hugh Dickins | 7701c9c | 2009-09-21 17:02:24 -0700 | [diff] [blame] | 14 | KSM was originally developed for use with KVM (where it was known as |
| 15 | Kernel Shared Memory), to fit more virtual machines into physical memory, |
| 16 | by sharing the data common between them. But it can be useful to any |
| 17 | application which generates many instances of the same data. |
| 18 | |
Mike Rapoport | db12c00 | 2018-04-24 09:40:23 +0300 | [diff] [blame] | 19 | The KSM daemon ksmd periodically scans those areas of user memory |
| 20 | which have been registered with it, looking for pages of identical |
| 21 | content which can be replaced by a single write-protected page (which |
| 22 | is automatically copied if a process later wants to update its |
| 23 | content). The amount of pages that KSM daemon scans in a single pass |
| 24 | and the time between the passes are configured using :ref:`sysfs |
| 25 | intraface <ksm_sysfs>` |
| 26 | |
Hugh Dickins | 7701c9c | 2009-09-21 17:02:24 -0700 | [diff] [blame] | 27 | KSM only merges anonymous (private) pages, never pagecache (file) pages. |
Hugh Dickins | d0f209f | 2009-12-14 17:59:34 -0800 | [diff] [blame] | 28 | KSM's merged pages were originally locked into kernel memory, but can now |
| 29 | be swapped out just like other user pages (but sharing is broken when they |
| 30 | are swapped back in: ksmd must rediscover their identity and merge again). |
Hugh Dickins | 7701c9c | 2009-09-21 17:02:24 -0700 | [diff] [blame] | 31 | |
Mike Rapoport | db12c00 | 2018-04-24 09:40:23 +0300 | [diff] [blame] | 32 | Controlling KSM with madvise |
| 33 | ============================ |
| 34 | |
Hugh Dickins | 7701c9c | 2009-09-21 17:02:24 -0700 | [diff] [blame] | 35 | KSM only operates on those areas of address space which an application |
| 36 | has advised to be likely candidates for merging, by using the madvise(2) |
Mike Rapoport | db12c00 | 2018-04-24 09:40:23 +0300 | [diff] [blame] | 37 | system call:: |
Hugh Dickins | 7701c9c | 2009-09-21 17:02:24 -0700 | [diff] [blame] | 38 | |
Mike Rapoport | db12c00 | 2018-04-24 09:40:23 +0300 | [diff] [blame] | 39 | int madvise(addr, length, MADV_MERGEABLE) |
| 40 | |
| 41 | The app may call |
| 42 | |
| 43 | :: |
| 44 | |
| 45 | int madvise(addr, length, MADV_UNMERGEABLE) |
| 46 | |
| 47 | to cancel that advice and restore unshared pages: whereupon KSM |
| 48 | unmerges whatever it merged in that range. Note: this unmerging call |
| 49 | may suddenly require more memory than is available - possibly failing |
| 50 | with EAGAIN, but more probably arousing the Out-Of-Memory killer. |
Hugh Dickins | 7701c9c | 2009-09-21 17:02:24 -0700 | [diff] [blame] | 51 | |
| 52 | If KSM is not configured into the running kernel, madvise MADV_MERGEABLE |
| 53 | and MADV_UNMERGEABLE simply fail with EINVAL. If the running kernel was |
| 54 | built with CONFIG_KSM=y, those calls will normally succeed: even if the |
| 55 | the KSM daemon is not currently running, MADV_MERGEABLE still registers |
| 56 | the range for whenever the KSM daemon is started; even if the range |
| 57 | cannot contain any pages which KSM could actually merge; even if |
| 58 | MADV_UNMERGEABLE is applied to a range which was never MADV_MERGEABLE. |
| 59 | |
David Rientjes | def5efe | 2017-02-24 14:58:47 -0800 | [diff] [blame] | 60 | If a region of memory must be split into at least one new MADV_MERGEABLE |
| 61 | or MADV_UNMERGEABLE region, the madvise may return ENOMEM if the process |
Mike Rapoport | db12c00 | 2018-04-24 09:40:23 +0300 | [diff] [blame] | 62 | will exceed ``vm.max_map_count`` (see Documentation/sysctl/vm.txt). |
David Rientjes | def5efe | 2017-02-24 14:58:47 -0800 | [diff] [blame] | 63 | |
Hugh Dickins | 7701c9c | 2009-09-21 17:02:24 -0700 | [diff] [blame] | 64 | Like other madvise calls, they are intended for use on mapped areas of |
| 65 | the user address space: they will report ENOMEM if the specified range |
| 66 | includes unmapped gaps (though working on the intervening mapped areas), |
| 67 | and might fail with EAGAIN if not enough memory for internal structures. |
| 68 | |
| 69 | Applications should be considerate in their use of MADV_MERGEABLE, |
Hugh Dickins | d0f209f | 2009-12-14 17:59:34 -0800 | [diff] [blame] | 70 | restricting its use to areas likely to benefit. KSM's scans may use a lot |
| 71 | of processing power: some installations will disable KSM for that reason. |
Hugh Dickins | 7701c9c | 2009-09-21 17:02:24 -0700 | [diff] [blame] | 72 | |
Mike Rapoport | db12c00 | 2018-04-24 09:40:23 +0300 | [diff] [blame] | 73 | .. _ksm_sysfs: |
| 74 | |
| 75 | KSM daemon sysfs interface |
| 76 | ========================== |
| 77 | |
Mike Rapoport | 2fcbc41 | 2018-03-21 21:22:27 +0200 | [diff] [blame] | 78 | The KSM daemon is controlled by sysfs files in ``/sys/kernel/mm/ksm/``, |
Hugh Dickins | 7701c9c | 2009-09-21 17:02:24 -0700 | [diff] [blame] | 79 | readable by all but writable only by root: |
| 80 | |
Mike Rapoport | 2fcbc41 | 2018-03-21 21:22:27 +0200 | [diff] [blame] | 81 | pages_to_scan |
Mike Rapoport | db12c00 | 2018-04-24 09:40:23 +0300 | [diff] [blame] | 82 | how many pages to scan before ksmd goes to sleep |
| 83 | e.g. ``echo 100 > /sys/kernel/mm/ksm/pages_to_scan``. |
| 84 | |
| 85 | Default: 100 (chosen for demonstration purposes) |
Hugh Dickins | 7701c9c | 2009-09-21 17:02:24 -0700 | [diff] [blame] | 86 | |
Mike Rapoport | 2fcbc41 | 2018-03-21 21:22:27 +0200 | [diff] [blame] | 87 | sleep_millisecs |
| 88 | how many milliseconds ksmd should sleep before next scan |
Mike Rapoport | db12c00 | 2018-04-24 09:40:23 +0300 | [diff] [blame] | 89 | e.g. ``echo 20 > /sys/kernel/mm/ksm/sleep_millisecs`` |
| 90 | |
| 91 | Default: 20 (chosen for demonstration purposes) |
Hugh Dickins | 7701c9c | 2009-09-21 17:02:24 -0700 | [diff] [blame] | 92 | |
Mike Rapoport | 2fcbc41 | 2018-03-21 21:22:27 +0200 | [diff] [blame] | 93 | merge_across_nodes |
Mike Rapoport | db12c00 | 2018-04-24 09:40:23 +0300 | [diff] [blame] | 94 | specifies if pages from different NUMA nodes can be merged. |
Mike Rapoport | 2fcbc41 | 2018-03-21 21:22:27 +0200 | [diff] [blame] | 95 | When set to 0, ksm merges only pages which physically reside |
| 96 | in the memory area of same NUMA node. That brings lower |
| 97 | latency to access of shared pages. Systems with more nodes, at |
| 98 | significant NUMA distances, are likely to benefit from the |
| 99 | lower latency of setting 0. Smaller systems, which need to |
| 100 | minimize memory usage, are likely to benefit from the greater |
| 101 | sharing of setting 1 (default). You may wish to compare how |
| 102 | your system performs under each setting, before deciding on |
Mike Rapoport | db12c00 | 2018-04-24 09:40:23 +0300 | [diff] [blame] | 103 | which to use. ``merge_across_nodes`` setting can be changed only |
| 104 | when there are no ksm shared pages in the system: set run 2 to |
Mike Rapoport | 2fcbc41 | 2018-03-21 21:22:27 +0200 | [diff] [blame] | 105 | unmerge pages first, then to 1 after changing |
Mike Rapoport | db12c00 | 2018-04-24 09:40:23 +0300 | [diff] [blame] | 106 | ``merge_across_nodes``, to remerge according to the new setting. |
| 107 | |
Mike Rapoport | 2fcbc41 | 2018-03-21 21:22:27 +0200 | [diff] [blame] | 108 | Default: 1 (merging across nodes as in earlier releases) |
Petr Holasek | 90bd6fd | 2013-02-22 16:35:00 -0800 | [diff] [blame] | 109 | |
Mike Rapoport | 2fcbc41 | 2018-03-21 21:22:27 +0200 | [diff] [blame] | 110 | run |
Mike Rapoport | db12c00 | 2018-04-24 09:40:23 +0300 | [diff] [blame] | 111 | * set to 0 to stop ksmd from running but keep merged pages, |
| 112 | * set to 1 to run ksmd e.g. ``echo 1 > /sys/kernel/mm/ksm/run``, |
| 113 | * set to 2 to stop ksmd and unmerge all pages currently merged, but |
| 114 | leave mergeable areas registered for next run. |
| 115 | |
| 116 | Default: 0 (must be changed to 1 to activate KSM, except if |
| 117 | CONFIG_SYSFS is disabled) |
Hugh Dickins | 7701c9c | 2009-09-21 17:02:24 -0700 | [diff] [blame] | 118 | |
Mike Rapoport | 2fcbc41 | 2018-03-21 21:22:27 +0200 | [diff] [blame] | 119 | use_zero_pages |
| 120 | specifies whether empty pages (i.e. allocated pages that only |
| 121 | contain zeroes) should be treated specially. When set to 1, |
| 122 | empty pages are merged with the kernel zero page(s) instead of |
| 123 | with each other as it would happen normally. This can improve |
| 124 | the performance on architectures with coloured zero pages, |
| 125 | depending on the workload. Care should be taken when enabling |
| 126 | this setting, as it can potentially degrade the performance of |
| 127 | KSM for some workloads, for example if the checksums of pages |
| 128 | candidate for merging match the checksum of an empty |
| 129 | page. This setting can be changed at any time, it is only |
Mike Rapoport | db12c00 | 2018-04-24 09:40:23 +0300 | [diff] [blame] | 130 | effective for pages merged after the change. |
| 131 | |
| 132 | Default: 0 (normal KSM behaviour as in earlier releases) |
Claudio Imbrenda | e86c59b | 2017-02-24 14:55:39 -0800 | [diff] [blame] | 133 | |
Mike Rapoport | 2fcbc41 | 2018-03-21 21:22:27 +0200 | [diff] [blame] | 134 | max_page_sharing |
| 135 | Maximum sharing allowed for each KSM page. This enforces a |
Mike Rapoport | 6570c78 | 2018-04-24 09:40:25 +0300 | [diff] [blame] | 136 | deduplication limit to avoid high latency for virtual memory |
| 137 | operations that involve traversal of the virtual mappings that |
| 138 | share the KSM page. The minimum value is 2 as a newly created |
| 139 | KSM page will have at least two sharers. The higher this value |
| 140 | the faster KSM will merge the memory and the higher the |
| 141 | deduplication factor will be, but the slower the worst case |
| 142 | virtual mappings traversal could be for any given KSM |
| 143 | page. Slowing down this traversal means there will be higher |
Mike Rapoport | 2fcbc41 | 2018-03-21 21:22:27 +0200 | [diff] [blame] | 144 | latency for certain virtual memory operations happening during |
| 145 | swapping, compaction, NUMA balancing and page migration, in |
| 146 | turn decreasing responsiveness for the caller of those virtual |
| 147 | memory operations. The scheduler latency of other tasks not |
Mike Rapoport | 6570c78 | 2018-04-24 09:40:25 +0300 | [diff] [blame] | 148 | involved with the VM operations doing the virtual mappings |
| 149 | traversal is not affected by this parameter as these |
| 150 | traversals are always schedule friendly themselves. |
Andrea Arcangeli | 2c653d0 | 2017-07-06 15:36:55 -0700 | [diff] [blame] | 151 | |
Mike Rapoport | 2fcbc41 | 2018-03-21 21:22:27 +0200 | [diff] [blame] | 152 | stable_node_chains_prune_millisecs |
Mike Rapoport | 2a695ca | 2018-04-24 09:40:26 +0300 | [diff] [blame^] | 153 | specifies how frequently KSM checks the metadata of the pages |
| 154 | that hit the deduplication limit for stale information. |
| 155 | Smaller milllisecs values will free up the KSM metadata with |
| 156 | lower latency, but they will make ksmd use more CPU during the |
| 157 | scan. It's a noop if not a single KSM page hit the |
| 158 | ``max_page_sharing`` yet. |
Andrea Arcangeli | 2c653d0 | 2017-07-06 15:36:55 -0700 | [diff] [blame] | 159 | |
Mike Rapoport | 2fcbc41 | 2018-03-21 21:22:27 +0200 | [diff] [blame] | 160 | The effectiveness of KSM and MADV_MERGEABLE is shown in ``/sys/kernel/mm/ksm/``: |
Hugh Dickins | 7701c9c | 2009-09-21 17:02:24 -0700 | [diff] [blame] | 161 | |
Mike Rapoport | 2fcbc41 | 2018-03-21 21:22:27 +0200 | [diff] [blame] | 162 | pages_shared |
| 163 | how many shared pages are being used |
| 164 | pages_sharing |
| 165 | how many more sites are sharing them i.e. how much saved |
| 166 | pages_unshared |
| 167 | how many pages unique but repeatedly checked for merging |
| 168 | pages_volatile |
| 169 | how many pages changing too fast to be placed in a tree |
| 170 | full_scans |
| 171 | how many times all mergeable areas have been scanned |
| 172 | stable_node_chains |
| 173 | number of stable node chains allocated, this is effectively |
Mike Rapoport | db12c00 | 2018-04-24 09:40:23 +0300 | [diff] [blame] | 174 | the number of KSM pages that hit the ``max_page_sharing`` limit |
Mike Rapoport | 2fcbc41 | 2018-03-21 21:22:27 +0200 | [diff] [blame] | 175 | stable_node_dups |
| 176 | number of stable node dups queued into the stable_node chains |
Andrea Arcangeli | 2c653d0 | 2017-07-06 15:36:55 -0700 | [diff] [blame] | 177 | |
Mike Rapoport | db12c00 | 2018-04-24 09:40:23 +0300 | [diff] [blame] | 178 | A high ratio of ``pages_sharing`` to ``pages_shared`` indicates good |
| 179 | sharing, but a high ratio of ``pages_unshared`` to ``pages_sharing`` |
| 180 | indicates wasted effort. ``pages_volatile`` embraces several |
| 181 | different kinds of activity, but a high proportion there would also |
| 182 | indicate poor use of madvise MADV_MERGEABLE. |
Hugh Dickins | 7701c9c | 2009-09-21 17:02:24 -0700 | [diff] [blame] | 183 | |
Mike Rapoport | db12c00 | 2018-04-24 09:40:23 +0300 | [diff] [blame] | 184 | The maximum possible ``pages_sharing/pages_shared`` ratio is limited by the |
| 185 | ``max_page_sharing`` tunable. To increase the ratio ``max_page_sharing`` must |
Andrea Arcangeli | 2c653d0 | 2017-07-06 15:36:55 -0700 | [diff] [blame] | 186 | be increased accordingly. |
| 187 | |
Mike Rapoport | db12c00 | 2018-04-24 09:40:23 +0300 | [diff] [blame] | 188 | The ``stable_node_dups/stable_node_chains`` ratio is also affected by the |
| 189 | ``max_page_sharing`` tunable, and an high ratio may indicate fragmentation |
Andrea Arcangeli | 2c653d0 | 2017-07-06 15:36:55 -0700 | [diff] [blame] | 190 | in the stable_node dups, which could be solved by introducing |
| 191 | fragmentation algorithms in ksmd which would refile rmap_items from |
Mike Rapoport | db12c00 | 2018-04-24 09:40:23 +0300 | [diff] [blame] | 192 | one stable_node dup to another stable_node dup, in order to free up |
Andrea Arcangeli | 2c653d0 | 2017-07-06 15:36:55 -0700 | [diff] [blame] | 193 | stable_node "dups" with few rmap_items in them, but that may increase |
| 194 | the ksmd CPU usage and possibly slowdown the readonly computations on |
| 195 | the KSM pages of the applications. |
| 196 | |
Mike Rapoport | 064fca3 | 2018-04-24 09:40:24 +0300 | [diff] [blame] | 197 | Design |
| 198 | ====== |
| 199 | |
| 200 | Overview |
| 201 | -------- |
| 202 | |
| 203 | .. kernel-doc:: mm/ksm.c |
| 204 | :DOC: Overview |
| 205 | |
| 206 | Reverse mapping |
| 207 | --------------- |
| 208 | KSM maintains reverse mapping information for KSM pages in the stable |
| 209 | tree. |
| 210 | |
| 211 | If a KSM page is shared between less than ``max_page_sharing`` VMAs, |
| 212 | the node of the stable tree that represents such KSM page points to a |
| 213 | list of :c:type:`struct rmap_item` and the ``page->mapping`` of the |
| 214 | KSM page points to the stable tree node. |
| 215 | |
| 216 | When the sharing passes this threshold, KSM adds a second dimension to |
| 217 | the stable tree. The tree node becomes a "chain" that links one or |
| 218 | more "dups". Each "dup" keeps reverse mapping information for a KSM |
| 219 | page with ``page->mapping`` pointing to that "dup". |
| 220 | |
| 221 | Every "chain" and all "dups" linked into a "chain" enforce the |
| 222 | invariant that they represent the same write protected memory content, |
| 223 | even if each "dup" will be pointed by a different KSM page copy of |
| 224 | that content. |
| 225 | |
| 226 | This way the stable tree lookup computational complexity is unaffected |
| 227 | if compared to an unlimited list of reverse mappings. It is still |
| 228 | enforced that there cannot be KSM page content duplicates in the |
| 229 | stable tree itself. |
| 230 | |
Mike Rapoport | 6570c78 | 2018-04-24 09:40:25 +0300 | [diff] [blame] | 231 | The deduplication limit enforced by ``max_page_sharing`` is required |
| 232 | to avoid the virtual memory rmap lists to grow too large. The rmap |
| 233 | walk has O(N) complexity where N is the number of rmap_items |
| 234 | (i.e. virtual mappings) that are sharing the page, which is in turn |
| 235 | capped by ``max_page_sharing``. So this effectively spreads the linear |
| 236 | O(N) computational complexity from rmap walk context over different |
| 237 | KSM pages. The ksmd walk over the stable_node "chains" is also O(N), |
| 238 | but N is the number of stable_node "dups", not the number of |
| 239 | rmap_items, so it has not a significant impact on ksmd performance. In |
| 240 | practice the best stable_node "dup" candidate will be kept and found |
| 241 | at the head of the "dups" list. |
| 242 | |
| 243 | High values of ``max_page_sharing`` result in faster memory merging |
| 244 | (because there will be fewer stable_node dups queued into the |
| 245 | stable_node chain->hlist to check for pruning) and higher |
| 246 | deduplication factor at the expense of slower worst case for rmap |
| 247 | walks for any KSM page which can happen during swapping, compaction, |
| 248 | NUMA balancing and page migration. |
| 249 | |
Mike Rapoport | 2a695ca | 2018-04-24 09:40:26 +0300 | [diff] [blame^] | 250 | The whole list of stable_node "dups" linked in the stable_node |
| 251 | "chains" is scanned periodically in order to prune stale stable_nodes. |
| 252 | The frequency of such scans is defined by |
| 253 | ``stable_node_chains_prune_millisecs`` sysfs tunable. |
| 254 | |
Mike Rapoport | 064fca3 | 2018-04-24 09:40:24 +0300 | [diff] [blame] | 255 | Reference |
| 256 | --------- |
| 257 | .. kernel-doc:: mm/ksm.c |
| 258 | :functions: mm_slot ksm_scan stable_node rmap_item |
| 259 | |
Mike Rapoport | db12c00 | 2018-04-24 09:40:23 +0300 | [diff] [blame] | 260 | -- |
Hugh Dickins | 7701c9c | 2009-09-21 17:02:24 -0700 | [diff] [blame] | 261 | Izik Eidus, |
Hugh Dickins | d0f209f | 2009-12-14 17:59:34 -0800 | [diff] [blame] | 262 | Hugh Dickins, 17 Nov 2009 |