Mike Rapoport | 76b387b | 2018-03-21 21:22:20 +0200 | [diff] [blame] | 1 | .. _frontswap: |
| 2 | |
| 3 | ========= |
| 4 | Frontswap |
| 5 | ========= |
| 6 | |
Dan Magenheimer | 27c6aec | 2012-04-09 17:10:34 -0600 | [diff] [blame] | 7 | Frontswap provides a "transcendent memory" interface for swap pages. |
| 8 | In some environments, dramatic performance savings may be obtained because |
| 9 | swapped pages are saved in RAM (or a RAM-like device) instead of a swap disk. |
| 10 | |
Mike Rapoport | 76b387b | 2018-03-21 21:22:20 +0200 | [diff] [blame] | 11 | (Note, frontswap -- and :ref:`cleancache` (merged at 3.0) -- are the "frontends" |
Dan Magenheimer | 27c6aec | 2012-04-09 17:10:34 -0600 | [diff] [blame] | 12 | and the only necessary changes to the core kernel for transcendent memory; |
| 13 | all other supporting code -- the "backends" -- is implemented as drivers. |
Mike Rapoport | 76b387b | 2018-03-21 21:22:20 +0200 | [diff] [blame] | 14 | See the LWN.net article `Transcendent memory in a nutshell`_ |
| 15 | for a detailed overview of frontswap and related kernel parts) |
| 16 | |
| 17 | .. _Transcendent memory in a nutshell: https://lwn.net/Articles/454795/ |
Dan Magenheimer | 27c6aec | 2012-04-09 17:10:34 -0600 | [diff] [blame] | 18 | |
| 19 | Frontswap is so named because it can be thought of as the opposite of |
| 20 | a "backing" store for a swap device. The storage is assumed to be |
| 21 | a synchronous concurrency-safe page-oriented "pseudo-RAM device" conforming |
| 22 | to the requirements of transcendent memory (such as Xen's "tmem", or |
| 23 | in-kernel compressed memory, aka "zcache", or future RAM-like devices); |
| 24 | this pseudo-RAM device is not directly accessible or addressable by the |
| 25 | kernel and is of unknown and possibly time-varying size. The driver |
| 26 | links itself to frontswap by calling frontswap_register_ops to set the |
| 27 | frontswap_ops funcs appropriately and the functions it provides must |
| 28 | conform to certain policies as follows: |
| 29 | |
| 30 | An "init" prepares the device to receive frontswap pages associated |
Konrad Rzeszutek Wilk | 165c8ae | 2012-05-15 11:32:15 -0400 | [diff] [blame] | 31 | with the specified swap device number (aka "type"). A "store" will |
Dan Magenheimer | 27c6aec | 2012-04-09 17:10:34 -0600 | [diff] [blame] | 32 | copy the page to transcendent memory and associate it with the type and |
Konrad Rzeszutek Wilk | 165c8ae | 2012-05-15 11:32:15 -0400 | [diff] [blame] | 33 | offset associated with the page. A "load" will copy the page, if found, |
Dan Magenheimer | 27c6aec | 2012-04-09 17:10:34 -0600 | [diff] [blame] | 34 | from transcendent memory into kernel memory, but will NOT remove the page |
Wanpeng Li | 1d00015 | 2012-06-16 20:37:48 +0800 | [diff] [blame] | 35 | from transcendent memory. An "invalidate_page" will remove the page |
Dan Magenheimer | 27c6aec | 2012-04-09 17:10:34 -0600 | [diff] [blame] | 36 | from transcendent memory and an "invalidate_area" will remove ALL pages |
| 37 | associated with the swap type (e.g., like swapoff) and notify the "device" |
Konrad Rzeszutek Wilk | 165c8ae | 2012-05-15 11:32:15 -0400 | [diff] [blame] | 38 | to refuse further stores with that swap type. |
Dan Magenheimer | 27c6aec | 2012-04-09 17:10:34 -0600 | [diff] [blame] | 39 | |
Konrad Rzeszutek Wilk | 165c8ae | 2012-05-15 11:32:15 -0400 | [diff] [blame] | 40 | Once a page is successfully stored, a matching load on the page will normally |
Dan Magenheimer | 27c6aec | 2012-04-09 17:10:34 -0600 | [diff] [blame] | 41 | succeed. So when the kernel finds itself in a situation where it needs |
Konrad Rzeszutek Wilk | 165c8ae | 2012-05-15 11:32:15 -0400 | [diff] [blame] | 42 | to swap out a page, it first attempts to use frontswap. If the store returns |
Dan Magenheimer | 27c6aec | 2012-04-09 17:10:34 -0600 | [diff] [blame] | 43 | success, the data has been successfully saved to transcendent memory and |
| 44 | a disk write and, if the data is later read back, a disk read are avoided. |
Konrad Rzeszutek Wilk | 165c8ae | 2012-05-15 11:32:15 -0400 | [diff] [blame] | 45 | If a store returns failure, transcendent memory has rejected the data, and the |
Dan Magenheimer | 27c6aec | 2012-04-09 17:10:34 -0600 | [diff] [blame] | 46 | page can be written to swap as usual. |
| 47 | |
| 48 | If a backend chooses, frontswap can be configured as a "writethrough |
| 49 | cache" by calling frontswap_writethrough(). In this mode, the reduction |
| 50 | in swap device writes is lost (and also a non-trivial performance advantage) |
| 51 | in order to allow the backend to arbitrarily "reclaim" space used to |
| 52 | store frontswap pages to more completely manage its memory usage. |
| 53 | |
Konrad Rzeszutek Wilk | 165c8ae | 2012-05-15 11:32:15 -0400 | [diff] [blame] | 54 | Note that if a page is stored and the page already exists in transcendent memory |
| 55 | (a "duplicate" store), either the store succeeds and the data is overwritten, |
| 56 | or the store fails AND the page is invalidated. This ensures stale data may |
Dan Magenheimer | 27c6aec | 2012-04-09 17:10:34 -0600 | [diff] [blame] | 57 | never be obtained from frontswap. |
| 58 | |
| 59 | If properly configured, monitoring of frontswap is done via debugfs in |
Mike Rapoport | 76b387b | 2018-03-21 21:22:20 +0200 | [diff] [blame] | 60 | the `/sys/kernel/debug/frontswap` directory. The effectiveness of |
Dan Magenheimer | 27c6aec | 2012-04-09 17:10:34 -0600 | [diff] [blame] | 61 | frontswap can be measured (across all swap devices) with: |
| 62 | |
Mike Rapoport | 76b387b | 2018-03-21 21:22:20 +0200 | [diff] [blame] | 63 | ``failed_stores`` |
| 64 | how many store attempts have failed |
| 65 | |
| 66 | ``loads`` |
| 67 | how many loads were attempted (all should succeed) |
| 68 | |
| 69 | ``succ_stores`` |
| 70 | how many store attempts have succeeded |
| 71 | |
| 72 | ``invalidates`` |
| 73 | how many invalidates were attempted |
Dan Magenheimer | 27c6aec | 2012-04-09 17:10:34 -0600 | [diff] [blame] | 74 | |
| 75 | A backend implementation may provide additional metrics. |
| 76 | |
| 77 | FAQ |
Mike Rapoport | 76b387b | 2018-03-21 21:22:20 +0200 | [diff] [blame] | 78 | === |
Dan Magenheimer | 27c6aec | 2012-04-09 17:10:34 -0600 | [diff] [blame] | 79 | |
Mike Rapoport | 76b387b | 2018-03-21 21:22:20 +0200 | [diff] [blame] | 80 | * Where's the value? |
Dan Magenheimer | 27c6aec | 2012-04-09 17:10:34 -0600 | [diff] [blame] | 81 | |
| 82 | When a workload starts swapping, performance falls through the floor. |
| 83 | Frontswap significantly increases performance in many such workloads by |
| 84 | providing a clean, dynamic interface to read and write swap pages to |
| 85 | "transcendent memory" that is otherwise not directly addressable to the kernel. |
| 86 | This interface is ideal when data is transformed to a different form |
| 87 | and size (such as with compression) or secretly moved (as might be |
| 88 | useful for write-balancing for some RAM-like devices). Swap pages (and |
| 89 | evicted page-cache pages) are a great use for this kind of slower-than-RAM- |
| 90 | but-much-faster-than-disk "pseudo-RAM device" and the frontswap (and |
| 91 | cleancache) interface to transcendent memory provides a nice way to read |
| 92 | and write -- and indirectly "name" -- the pages. |
| 93 | |
| 94 | Frontswap -- and cleancache -- with a fairly small impact on the kernel, |
| 95 | provides a huge amount of flexibility for more dynamic, flexible RAM |
| 96 | utilization in various system configurations: |
| 97 | |
| 98 | In the single kernel case, aka "zcache", pages are compressed and |
| 99 | stored in local memory, thus increasing the total anonymous pages |
| 100 | that can be safely kept in RAM. Zcache essentially trades off CPU |
| 101 | cycles used in compression/decompression for better memory utilization. |
| 102 | Benchmarks have shown little or no impact when memory pressure is |
| 103 | low while providing a significant performance improvement (25%+) |
| 104 | on some workloads under high memory pressure. |
| 105 | |
| 106 | "RAMster" builds on zcache by adding "peer-to-peer" transcendent memory |
| 107 | support for clustered systems. Frontswap pages are locally compressed |
| 108 | as in zcache, but then "remotified" to another system's RAM. This |
| 109 | allows RAM to be dynamically load-balanced back-and-forth as needed, |
| 110 | i.e. when system A is overcommitted, it can swap to system B, and |
| 111 | vice versa. RAMster can also be configured as a memory server so |
| 112 | many servers in a cluster can swap, dynamically as needed, to a single |
| 113 | server configured with a large amount of RAM... without pre-configuring |
| 114 | how much of the RAM is available for each of the clients! |
| 115 | |
| 116 | In the virtual case, the whole point of virtualization is to statistically |
Wanpeng Li | 1d00015 | 2012-06-16 20:37:48 +0800 | [diff] [blame] | 117 | multiplex physical resources across the varying demands of multiple |
Dan Magenheimer | 27c6aec | 2012-04-09 17:10:34 -0600 | [diff] [blame] | 118 | virtual machines. This is really hard to do with RAM and efforts to do |
| 119 | it well with no kernel changes have essentially failed (except in some |
| 120 | well-publicized special-case workloads). |
| 121 | Specifically, the Xen Transcendent Memory backend allows otherwise |
| 122 | "fallow" hypervisor-owned RAM to not only be "time-shared" between multiple |
| 123 | virtual machines, but the pages can be compressed and deduplicated to |
| 124 | optimize RAM utilization. And when guest OS's are induced to surrender |
| 125 | underutilized RAM (e.g. with "selfballooning"), sudden unexpected |
| 126 | memory pressure may result in swapping; frontswap allows those pages |
| 127 | to be swapped to and from hypervisor RAM (if overall host system memory |
| 128 | conditions allow), thus mitigating the potentially awful performance impact |
| 129 | of unplanned swapping. |
| 130 | |
| 131 | A KVM implementation is underway and has been RFC'ed to lkml. And, |
| 132 | using frontswap, investigation is also underway on the use of NVM as |
| 133 | a memory extension technology. |
| 134 | |
Mike Rapoport | 76b387b | 2018-03-21 21:22:20 +0200 | [diff] [blame] | 135 | * Sure there may be performance advantages in some situations, but |
| 136 | what's the space/time overhead of frontswap? |
Dan Magenheimer | 27c6aec | 2012-04-09 17:10:34 -0600 | [diff] [blame] | 137 | |
| 138 | If CONFIG_FRONTSWAP is disabled, every frontswap hook compiles into |
| 139 | nothingness and the only overhead is a few extra bytes per swapon'ed |
| 140 | swap device. If CONFIG_FRONTSWAP is enabled but no frontswap "backend" |
| 141 | registers, there is one extra global variable compared to zero for |
| 142 | every swap page read or written. If CONFIG_FRONTSWAP is enabled |
Konrad Rzeszutek Wilk | 165c8ae | 2012-05-15 11:32:15 -0400 | [diff] [blame] | 143 | AND a frontswap backend registers AND the backend fails every "store" |
Dan Magenheimer | 27c6aec | 2012-04-09 17:10:34 -0600 | [diff] [blame] | 144 | request (i.e. provides no memory despite claiming it might), |
| 145 | CPU overhead is still negligible -- and since every frontswap fail |
| 146 | precedes a swap page write-to-disk, the system is highly likely |
| 147 | to be I/O bound and using a small fraction of a percent of a CPU |
| 148 | will be irrelevant anyway. |
| 149 | |
| 150 | As for space, if CONFIG_FRONTSWAP is enabled AND a frontswap backend |
| 151 | registers, one bit is allocated for every swap page for every swap |
| 152 | device that is swapon'd. This is added to the EIGHT bits (which |
| 153 | was sixteen until about 2.6.34) that the kernel already allocates |
| 154 | for every swap page for every swap device that is swapon'd. (Hugh |
| 155 | Dickins has observed that frontswap could probably steal one of |
| 156 | the existing eight bits, but let's worry about that minor optimization |
| 157 | later.) For very large swap disks (which are rare) on a standard |
| 158 | 4K pagesize, this is 1MB per 32GB swap. |
| 159 | |
| 160 | When swap pages are stored in transcendent memory instead of written |
| 161 | out to disk, there is a side effect that this may create more memory |
| 162 | pressure that can potentially outweigh the other advantages. A |
| 163 | backend, such as zcache, must implement policies to carefully (but |
| 164 | dynamically) manage memory limits to ensure this doesn't happen. |
| 165 | |
Mike Rapoport | 76b387b | 2018-03-21 21:22:20 +0200 | [diff] [blame] | 166 | * OK, how about a quick overview of what this frontswap patch does |
| 167 | in terms that a kernel hacker can grok? |
Dan Magenheimer | 27c6aec | 2012-04-09 17:10:34 -0600 | [diff] [blame] | 168 | |
| 169 | Let's assume that a frontswap "backend" has registered during |
| 170 | kernel initialization; this registration indicates that this |
| 171 | frontswap backend has access to some "memory" that is not directly |
| 172 | accessible by the kernel. Exactly how much memory it provides is |
| 173 | entirely dynamic and random. |
| 174 | |
| 175 | Whenever a swap-device is swapon'd frontswap_init() is called, |
| 176 | passing the swap device number (aka "type") as a parameter. |
Konrad Rzeszutek Wilk | 165c8ae | 2012-05-15 11:32:15 -0400 | [diff] [blame] | 177 | This notifies frontswap to expect attempts to "store" swap pages |
Dan Magenheimer | 27c6aec | 2012-04-09 17:10:34 -0600 | [diff] [blame] | 178 | associated with that number. |
| 179 | |
| 180 | Whenever the swap subsystem is readying a page to write to a swap |
Konrad Rzeszutek Wilk | 165c8ae | 2012-05-15 11:32:15 -0400 | [diff] [blame] | 181 | device (c.f swap_writepage()), frontswap_store is called. Frontswap |
Dan Magenheimer | 27c6aec | 2012-04-09 17:10:34 -0600 | [diff] [blame] | 182 | consults with the frontswap backend and if the backend says it does NOT |
Konrad Rzeszutek Wilk | 165c8ae | 2012-05-15 11:32:15 -0400 | [diff] [blame] | 183 | have room, frontswap_store returns -1 and the kernel swaps the page |
Dan Magenheimer | 27c6aec | 2012-04-09 17:10:34 -0600 | [diff] [blame] | 184 | to the swap device as normal. Note that the response from the frontswap |
| 185 | backend is unpredictable to the kernel; it may choose to never accept a |
| 186 | page, it could accept every ninth page, or it might accept every |
| 187 | page. But if the backend does accept a page, the data from the page |
| 188 | has already been copied and associated with the type and offset, |
| 189 | and the backend guarantees the persistence of the data. In this case, |
| 190 | frontswap sets a bit in the "frontswap_map" for the swap device |
| 191 | corresponding to the page offset on the swap device to which it would |
| 192 | otherwise have written the data. |
| 193 | |
| 194 | When the swap subsystem needs to swap-in a page (swap_readpage()), |
Konrad Rzeszutek Wilk | 165c8ae | 2012-05-15 11:32:15 -0400 | [diff] [blame] | 195 | it first calls frontswap_load() which checks the frontswap_map to |
Dan Magenheimer | 27c6aec | 2012-04-09 17:10:34 -0600 | [diff] [blame] | 196 | see if the page was earlier accepted by the frontswap backend. If |
| 197 | it was, the page of data is filled from the frontswap backend and |
| 198 | the swap-in is complete. If not, the normal swap-in code is |
| 199 | executed to obtain the page of data from the real swap device. |
| 200 | |
| 201 | So every time the frontswap backend accepts a page, a swap device read |
| 202 | and (potentially) a swap device write are replaced by a "frontswap backend |
Konrad Rzeszutek Wilk | 165c8ae | 2012-05-15 11:32:15 -0400 | [diff] [blame] | 203 | store" and (possibly) a "frontswap backend loads", which are presumably much |
Dan Magenheimer | 27c6aec | 2012-04-09 17:10:34 -0600 | [diff] [blame] | 204 | faster. |
| 205 | |
Mike Rapoport | 76b387b | 2018-03-21 21:22:20 +0200 | [diff] [blame] | 206 | * Can't frontswap be configured as a "special" swap device that is |
| 207 | just higher priority than any real swap device (e.g. like zswap, |
| 208 | or maybe swap-over-nbd/NFS)? |
Dan Magenheimer | 27c6aec | 2012-04-09 17:10:34 -0600 | [diff] [blame] | 209 | |
| 210 | No. First, the existing swap subsystem doesn't allow for any kind of |
Masanari Iida | 4e79162a | 2012-11-08 21:57:35 +0900 | [diff] [blame] | 211 | swap hierarchy. Perhaps it could be rewritten to accommodate a hierarchy, |
Dan Magenheimer | 27c6aec | 2012-04-09 17:10:34 -0600 | [diff] [blame] | 212 | but this would require fairly drastic changes. Even if it were |
| 213 | rewritten, the existing swap subsystem uses the block I/O layer which |
| 214 | assumes a swap device is fixed size and any page in it is linearly |
| 215 | addressable. Frontswap barely touches the existing swap subsystem, |
| 216 | and works around the constraints of the block I/O subsystem to provide |
| 217 | a great deal of flexibility and dynamicity. |
| 218 | |
| 219 | For example, the acceptance of any swap page by the frontswap backend is |
| 220 | entirely unpredictable. This is critical to the definition of frontswap |
| 221 | backends because it grants completely dynamic discretion to the |
| 222 | backend. In zcache, one cannot know a priori how compressible a page is. |
| 223 | "Poorly" compressible pages can be rejected, and "poorly" can itself be |
| 224 | defined dynamically depending on current memory constraints. |
| 225 | |
| 226 | Further, frontswap is entirely synchronous whereas a real swap |
| 227 | device is, by definition, asynchronous and uses block I/O. The |
| 228 | block I/O layer is not only unnecessary, but may perform "optimizations" |
| 229 | that are inappropriate for a RAM-oriented device including delaying |
| 230 | the write of some pages for a significant amount of time. Synchrony is |
| 231 | required to ensure the dynamicity of the backend and to avoid thorny race |
| 232 | conditions that would unnecessarily and greatly complicate frontswap |
Konrad Rzeszutek Wilk | 165c8ae | 2012-05-15 11:32:15 -0400 | [diff] [blame] | 233 | and/or the block I/O subsystem. That said, only the initial "store" |
| 234 | and "load" operations need be synchronous. A separate asynchronous thread |
Dan Magenheimer | 27c6aec | 2012-04-09 17:10:34 -0600 | [diff] [blame] | 235 | is free to manipulate the pages stored by frontswap. For example, |
| 236 | the "remotification" thread in RAMster uses standard asynchronous |
| 237 | kernel sockets to move compressed frontswap pages to a remote machine. |
| 238 | Similarly, a KVM guest-side implementation could do in-guest compression |
| 239 | and use "batched" hypercalls. |
| 240 | |
| 241 | In a virtualized environment, the dynamicity allows the hypervisor |
| 242 | (or host OS) to do "intelligent overcommit". For example, it can |
| 243 | choose to accept pages only until host-swapping might be imminent, |
| 244 | then force guests to do their own swapping. |
| 245 | |
| 246 | There is a downside to the transcendent memory specifications for |
Konrad Rzeszutek Wilk | 165c8ae | 2012-05-15 11:32:15 -0400 | [diff] [blame] | 247 | frontswap: Since any "store" might fail, there must always be a real |
Dan Magenheimer | 27c6aec | 2012-04-09 17:10:34 -0600 | [diff] [blame] | 248 | slot on a real swap device to swap the page. Thus frontswap must be |
| 249 | implemented as a "shadow" to every swapon'd device with the potential |
| 250 | capability of holding every page that the swap device might have held |
| 251 | and the possibility that it might hold no pages at all. This means |
| 252 | that frontswap cannot contain more pages than the total of swapon'd |
| 253 | swap devices. For example, if NO swap device is configured on some |
| 254 | installation, frontswap is useless. Swapless portable devices |
| 255 | can still use frontswap but a backend for such devices must configure |
| 256 | some kind of "ghost" swap device and ensure that it is never used. |
| 257 | |
Mike Rapoport | 76b387b | 2018-03-21 21:22:20 +0200 | [diff] [blame] | 258 | * Why this weird definition about "duplicate stores"? If a page |
| 259 | has been previously successfully stored, can't it always be |
| 260 | successfully overwritten? |
Dan Magenheimer | 27c6aec | 2012-04-09 17:10:34 -0600 | [diff] [blame] | 261 | |
| 262 | Nearly always it can, but no, sometimes it cannot. Consider an example |
| 263 | where data is compressed and the original 4K page has been compressed |
| 264 | to 1K. Now an attempt is made to overwrite the page with data that |
| 265 | is non-compressible and so would take the entire 4K. But the backend |
Konrad Rzeszutek Wilk | 165c8ae | 2012-05-15 11:32:15 -0400 | [diff] [blame] | 266 | has no more space. In this case, the store must be rejected. Whenever |
| 267 | frontswap rejects a store that would overwrite, it also must invalidate |
Dan Magenheimer | 27c6aec | 2012-04-09 17:10:34 -0600 | [diff] [blame] | 268 | the old data and ensure that it is no longer accessible. Since the |
| 269 | swap subsystem then writes the new data to the read swap device, |
| 270 | this is the correct course of action to ensure coherency. |
| 271 | |
Mike Rapoport | 76b387b | 2018-03-21 21:22:20 +0200 | [diff] [blame] | 272 | * What is frontswap_shrink for? |
Dan Magenheimer | 27c6aec | 2012-04-09 17:10:34 -0600 | [diff] [blame] | 273 | |
| 274 | When the (non-frontswap) swap subsystem swaps out a page to a real |
| 275 | swap device, that page is only taking up low-value pre-allocated disk |
| 276 | space. But if frontswap has placed a page in transcendent memory, that |
| 277 | page may be taking up valuable real estate. The frontswap_shrink |
| 278 | routine allows code outside of the swap subsystem to force pages out |
| 279 | of the memory managed by frontswap and back into kernel-addressable memory. |
| 280 | For example, in RAMster, a "suction driver" thread will attempt |
| 281 | to "repatriate" pages sent to a remote machine back to the local machine; |
| 282 | this is driven using the frontswap_shrink mechanism when memory pressure |
| 283 | subsides. |
| 284 | |
Mike Rapoport | 76b387b | 2018-03-21 21:22:20 +0200 | [diff] [blame] | 285 | * Why does the frontswap patch create the new include file swapfile.h? |
Dan Magenheimer | 27c6aec | 2012-04-09 17:10:34 -0600 | [diff] [blame] | 286 | |
| 287 | The frontswap code depends on some swap-subsystem-internal data |
| 288 | structures that have, over the years, moved back and forth between |
| 289 | static and global. This seemed a reasonable compromise: Define |
| 290 | them as global but declare them in a new include file that isn't |
| 291 | included by the large number of source files that include swap.h. |
| 292 | |
| 293 | Dan Magenheimer, last updated April 9, 2012 |