blob: 6a1fadf3e1735eb5b01837290730525c2d2a172d [file] [log] [blame]
Mike Rapoportd04f9f52018-03-21 21:22:18 +02001.. _balance:
2
3================
4Memory Balancing
5================
6
Linus Torvalds1da177e2005-04-16 15:20:36 -07007Started Jan 2000 by Kanoj Sarcar <kanoj@sgi.com>
8
Mel Gormand0164ad2015-11-06 16:28:21 -08009Memory balancing is needed for !__GFP_ATOMIC and !__GFP_KSWAPD_RECLAIM as
10well as for non __GFP_IO allocations.
Linus Torvalds1da177e2005-04-16 15:20:36 -070011
Mel Gormand0164ad2015-11-06 16:28:21 -080012The first reason why a caller may avoid reclaim is that the caller can not
13sleep due to holding a spinlock or is in interrupt context. The second may
14be that the caller is willing to fail the allocation without incurring the
15overhead of page reclaim. This may happen for opportunistic high-order
16allocation requests that have order-0 fallback options. In such cases,
17the caller may also wish to avoid waking kswapd.
Linus Torvalds1da177e2005-04-16 15:20:36 -070018
19__GFP_IO allocation requests are made to prevent file system deadlocks.
20
21In the absence of non sleepable allocation requests, it seems detrimental
22to be doing balancing. Page reclamation can be kicked off lazily, that
23is, only when needed (aka zone free memory is 0), instead of making it
24a proactive process.
25
26That being said, the kernel should try to fulfill requests for direct
27mapped pages from the direct mapped pool, instead of falling back on
28the dma pool, so as to keep the dma pool filled for dma requests (atomic
29or not). A similar argument applies to highmem and direct mapped pages.
30OTOH, if there is a lot of free dma pages, it is preferable to satisfy
31regular memory requests by allocating one from the dma pool, instead
32of incurring the overhead of regular zone balancing.
33
34In 2.2, memory balancing/page reclamation would kick off only when the
35_total_ number of free pages fell below 1/64 th of total memory. With the
36right ratio of dma and regular memory, it is quite possible that balancing
37would not be done even when the dma zone was completely empty. 2.2 has
38been running production machines of varying memory sizes, and seems to be
39doing fine even with the presence of this problem. In 2.3, due to
40HIGHMEM, this problem is aggravated.
41
42In 2.3, zone balancing can be done in one of two ways: depending on the
43zone size (and possibly of the size of lower class zones), we can decide
44at init time how many free pages we should aim for while balancing any
45zone. The good part is, while balancing, we do not need to look at sizes
46of lower class zones, the bad part is, we might do too frequent balancing
47due to ignoring possibly lower usage in the lower class zones. Also,
48with a slight change in the allocation routine, it is possible to reduce
49the memclass() macro to be a simple equality.
50
51Another possible solution is that we balance only when the free memory
52of a zone _and_ all its lower class zones falls below 1/64th of the
53total memory in the zone and its lower class zones. This fixes the 2.2
54balancing problem, and stays as close to 2.2 behavior as possible. Also,
55the balancing algorithm works the same way on the various architectures,
56which have different numbers and types of zones. If we wanted to get
57fancy, we could assign different weights to free pages in different
58zones in the future.
59
60Note that if the size of the regular zone is huge compared to dma zone,
61it becomes less significant to consider the free dma pages while
62deciding whether to balance the regular zone. The first solution
63becomes more attractive then.
64
65The appended patch implements the second solution. It also "fixes" two
66problems: first, kswapd is woken up as in 2.2 on low memory conditions
67for non-sleepable allocations. Second, the HIGHMEM zone is also balanced,
68so as to give a fighting chance for replace_with_highmem() to get a
69HIGHMEM page, as well as to ensure that HIGHMEM allocations do not
70fall back into regular zone. This also makes sure that HIGHMEM pages
Mike Rapoportd04f9f52018-03-21 21:22:18 +020071are not leaked (for example, in situations where a HIGHMEM page is in
Linus Torvalds1da177e2005-04-16 15:20:36 -070072the swapcache but is not being used by anyone)
73
74kswapd also needs to know about the zones it should balance. kswapd is
Mike Rapoportd04f9f52018-03-21 21:22:18 +020075primarily needed in a situation where balancing can not be done,
Linus Torvalds1da177e2005-04-16 15:20:36 -070076probably because all allocation requests are coming from intr context
77and all process contexts are sleeping. For 2.3, kswapd does not really
78need to balance the highmem zone, since intr context does not request
79highmem pages. kswapd looks at the zone_wake_kswapd field in the zone
80structure to decide whether a zone needs balancing.
81
82Page stealing from process memory and shm is done if stealing the page would
83alleviate memory pressure on any zone in the page's node that has fallen below
84its watermark.
85
Mel Gorman41858962009-06-16 15:32:12 -070086watemark[WMARK_MIN/WMARK_LOW/WMARK_HIGH]/low_on_memory/zone_wake_kswapd: These
87are per-zone fields, used to determine when a zone needs to be balanced. When
88the number of pages falls below watermark[WMARK_MIN], the hysteric field
89low_on_memory gets set. This stays set till the number of free pages becomes
90watermark[WMARK_HIGH]. When low_on_memory is set, page allocation requests will
91try to free some pages in the zone (providing GFP_WAIT is set in the request).
92Orthogonal to this, is the decision to poke kswapd to free some zone pages.
93That decision is not hysteresis based, and is done when the number of free
94pages is below watermark[WMARK_LOW]; in which case zone_wake_kswapd is also set.
Linus Torvalds1da177e2005-04-16 15:20:36 -070095
96
97(Good) Ideas that I have heard:
Mike Rapoportd04f9f52018-03-21 21:22:18 +020098
Linus Torvalds1da177e2005-04-16 15:20:36 -0700991. Dynamic experience should influence balancing: number of failed requests
Mike Rapoportd04f9f52018-03-21 21:22:18 +0200100 for a zone can be tracked and fed into the balancing scheme (jalvo@mbay.net)
Linus Torvalds1da177e2005-04-16 15:20:36 -07001012. Implement a replace_with_highmem()-like replace_with_regular() to preserve
Mike Rapoportd04f9f52018-03-21 21:22:18 +0200102 dma pages. (lkd@tantalophile.demon.co.uk)