Mike Rapoport | 5431998 | 2018-03-21 21:22:40 +0200 | [diff] [blame] | 1 | .. _swap_numa: |
| 2 | |
| 3 | =========================================== |
Aaron Lu | a2468cc | 2017-09-06 16:24:57 -0700 | [diff] [blame] | 4 | Automatically bind swap device to numa node |
Mike Rapoport | 5431998 | 2018-03-21 21:22:40 +0200 | [diff] [blame] | 5 | =========================================== |
Aaron Lu | a2468cc | 2017-09-06 16:24:57 -0700 | [diff] [blame] | 6 | |
| 7 | If the system has more than one swap device and swap device has the node |
| 8 | information, we can make use of this information to decide which swap |
| 9 | device to use in get_swap_pages() to get better performance. |
| 10 | |
| 11 | |
| 12 | How to use this feature |
Mike Rapoport | 5431998 | 2018-03-21 21:22:40 +0200 | [diff] [blame] | 13 | ======================= |
Aaron Lu | a2468cc | 2017-09-06 16:24:57 -0700 | [diff] [blame] | 14 | |
| 15 | Swap device has priority and that decides the order of it to be used. To make |
| 16 | use of automatically binding, there is no need to manipulate priority settings |
| 17 | for swap devices. e.g. on a 2 node machine, assume 2 swap devices swapA and |
| 18 | swapB, with swapA attached to node 0 and swapB attached to node 1, are going |
Mike Rapoport | 5431998 | 2018-03-21 21:22:40 +0200 | [diff] [blame] | 19 | to be swapped on. Simply swapping them on by doing:: |
| 20 | |
| 21 | # swapon /dev/swapA |
| 22 | # swapon /dev/swapB |
Aaron Lu | a2468cc | 2017-09-06 16:24:57 -0700 | [diff] [blame] | 23 | |
| 24 | Then node 0 will use the two swap devices in the order of swapA then swapB and |
| 25 | node 1 will use the two swap devices in the order of swapB then swapA. Note |
| 26 | that the order of them being swapped on doesn't matter. |
| 27 | |
| 28 | A more complex example on a 4 node machine. Assume 6 swap devices are going to |
| 29 | be swapped on: swapA and swapB are attached to node 0, swapC is attached to |
| 30 | node 1, swapD and swapE are attached to node 2 and swapF is attached to node3. |
Mike Rapoport | 5431998 | 2018-03-21 21:22:40 +0200 | [diff] [blame] | 31 | The way to swap them on is the same as above:: |
Aaron Lu | a2468cc | 2017-09-06 16:24:57 -0700 | [diff] [blame] | 32 | |
Mike Rapoport | 5431998 | 2018-03-21 21:22:40 +0200 | [diff] [blame] | 33 | # swapon /dev/swapA |
| 34 | # swapon /dev/swapB |
| 35 | # swapon /dev/swapC |
| 36 | # swapon /dev/swapD |
| 37 | # swapon /dev/swapE |
| 38 | # swapon /dev/swapF |
| 39 | |
| 40 | Then node 0 will use them in the order of:: |
| 41 | |
| 42 | swapA/swapB -> swapC -> swapD -> swapE -> swapF |
| 43 | |
Aaron Lu | a2468cc | 2017-09-06 16:24:57 -0700 | [diff] [blame] | 44 | swapA and swapB will be used in a round robin mode before any other swap device. |
| 45 | |
Mike Rapoport | 5431998 | 2018-03-21 21:22:40 +0200 | [diff] [blame] | 46 | node 1 will use them in the order of:: |
Aaron Lu | a2468cc | 2017-09-06 16:24:57 -0700 | [diff] [blame] | 47 | |
Mike Rapoport | 5431998 | 2018-03-21 21:22:40 +0200 | [diff] [blame] | 48 | swapC -> swapA -> swapB -> swapD -> swapE -> swapF |
| 49 | |
| 50 | node 2 will use them in the order of:: |
| 51 | |
| 52 | swapD/swapE -> swapA -> swapB -> swapC -> swapF |
| 53 | |
Aaron Lu | a2468cc | 2017-09-06 16:24:57 -0700 | [diff] [blame] | 54 | Similaly, swapD and swapE will be used in a round robin mode before any |
| 55 | other swap devices. |
| 56 | |
Mike Rapoport | 5431998 | 2018-03-21 21:22:40 +0200 | [diff] [blame] | 57 | node 3 will use them in the order of:: |
| 58 | |
| 59 | swapF -> swapA -> swapB -> swapC -> swapD -> swapE |
Aaron Lu | a2468cc | 2017-09-06 16:24:57 -0700 | [diff] [blame] | 60 | |
| 61 | |
| 62 | Implementation details |
Mike Rapoport | 5431998 | 2018-03-21 21:22:40 +0200 | [diff] [blame] | 63 | ====================== |
Aaron Lu | a2468cc | 2017-09-06 16:24:57 -0700 | [diff] [blame] | 64 | |
| 65 | The current code uses a priority based list, swap_avail_list, to decide |
| 66 | which swap device to use and if multiple swap devices share the same |
| 67 | priority, they are used round robin. This change here replaces the single |
| 68 | global swap_avail_list with a per-numa-node list, i.e. for each numa node, |
| 69 | it sees its own priority based list of available swap devices. Swap |
| 70 | device's priority can be promoted on its matching node's swap_avail_list. |
| 71 | |
| 72 | The current swap device's priority is set as: user can set a >=0 value, |
| 73 | or the system will pick one starting from -1 then downwards. The priority |
| 74 | value in the swap_avail_list is the negated value of the swap device's |
| 75 | due to plist being sorted from low to high. The new policy doesn't change |
| 76 | the semantics for priority >=0 cases, the previous starting from -1 then |
| 77 | downwards now becomes starting from -2 then downwards and -1 is reserved |
| 78 | as the promoted value. So if multiple swap devices are attached to the same |
| 79 | node, they will all be promoted to priority -1 on that node's plist and will |
| 80 | be used round robin before any other swap devices. |