Mauro Carvalho Chehab | d6a3b24 | 2019-06-12 14:53:03 -0300 | [diff] [blame] | 1 | ================= |
| 2 | Scheduler Domains |
| 3 | ================= |
| 4 | |
Borislav Petkov | e2495b5 | 2011-03-27 17:57:13 +0200 | [diff] [blame] | 5 | Each CPU has a "base" scheduling domain (struct sched_domain). The domain |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 6 | hierarchy is built from these base domains via the ->parent pointer. ->parent |
Borislav Petkov | e2495b5 | 2011-03-27 17:57:13 +0200 | [diff] [blame] | 7 | MUST be NULL terminated, and domain structures should be per-CPU as they are |
| 8 | locklessly updated. |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 9 | |
| 10 | Each scheduling domain spans a number of CPUs (stored in the ->span field). |
| 11 | A domain's span MUST be a superset of it child's span (this restriction could |
| 12 | be relaxed if the need arises), and a base domain for CPU i MUST span at least |
| 13 | i. The top domain for each CPU will generally span all CPUs in the system |
| 14 | although strictly it doesn't have to, but this could lead to a case where some |
| 15 | CPUs will never be given tasks to run unless the CPUs allowed mask is |
| 16 | explicitly set. A sched domain's span means "balance process load among these |
| 17 | CPUs". |
| 18 | |
| 19 | Each scheduling domain must have one or more CPU groups (struct sched_group) |
| 20 | which are organised as a circular one way linked list from the ->groups |
| 21 | pointer. The union of cpumasks of these groups MUST be the same as the |
Adrian Freund | 7b91210 | 2020-04-07 15:05:25 +0200 | [diff] [blame] | 22 | domain's span. The group pointed to by the ->groups pointer MUST contain the CPU |
| 23 | to which the domain belongs. Groups may be shared among CPUs as they contain |
| 24 | read only data after they have been set up. The intersection of cpumasks from |
| 25 | any two of these groups may be non empty. If this is the case the SD_OVERLAP |
| 26 | flag is set on the corresponding scheduling domain and its groups may not be |
| 27 | shared between CPUs. |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 28 | |
| 29 | Balancing within a sched domain occurs between groups. That is, each group |
| 30 | is treated as one entity. The load of a group is defined as the sum of the |
| 31 | load of each of its member CPUs, and only when the load of a group becomes |
| 32 | out of balance are tasks moved between groups. |
| 33 | |
Viresh Kumar | 0a0fca9 | 2013-06-04 13:10:24 +0530 | [diff] [blame] | 34 | In kernel/sched/core.c, trigger_load_balance() is run periodically on each CPU |
Borislav Petkov | e2495b5 | 2011-03-27 17:57:13 +0200 | [diff] [blame] | 35 | through scheduler_tick(). It raises a softirq after the next regularly scheduled |
| 36 | rebalancing event for the current runqueue has arrived. The actual load |
| 37 | balancing workhorse, run_rebalance_domains()->rebalance_domains(), is then run |
| 38 | in softirq context (SCHED_SOFTIRQ). |
| 39 | |
| 40 | The latter function takes two arguments: the current CPU and whether it was idle |
| 41 | at the time the scheduler_tick() happened and iterates over all sched domains |
| 42 | our CPU is on, starting from its base domain and going up the ->parent chain. |
| 43 | While doing that, it checks to see if the current domain has exhausted its |
| 44 | rebalance interval. If so, it runs load_balance() on that domain. It then checks |
| 45 | the parent sched_domain (if it exists), and the parent of the parent and so |
| 46 | forth. |
| 47 | |
| 48 | Initially, load_balance() finds the busiest group in the current sched domain. |
| 49 | If it succeeds, it looks for the busiest runqueue of all the CPUs' runqueues in |
| 50 | that group. If it manages to find such a runqueue, it locks both our initial |
| 51 | CPU's runqueue and the newly found busiest one and starts moving tasks from it |
| 52 | to our runqueue. The exact number of tasks amounts to an imbalance previously |
| 53 | computed while iterating over this sched domain's groups. |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 54 | |
Mauro Carvalho Chehab | d6a3b24 | 2019-06-12 14:53:03 -0300 | [diff] [blame] | 55 | Implementing sched domains |
| 56 | ========================== |
| 57 | |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 58 | The "base" domain will "span" the first level of the hierarchy. In the case |
| 59 | of SMT, you'll span all siblings of the physical CPU, with each group being |
| 60 | a single virtual CPU. |
| 61 | |
| 62 | In SMP, the parent of the base domain will span all physical CPUs in the |
| 63 | node. Each group being a single physical CPU. Then with NUMA, the parent |
| 64 | of the SMP domain will span the entire machine, with each group having the |
| 65 | cpumask of a node. Or, you could do multi-level NUMA or Opteron, for example, |
| 66 | might have just one domain covering its one NUMA level. |
| 67 | |
Barry Song | 9032dc2 | 2020-11-14 00:50:18 +1300 | [diff] [blame] | 68 | The implementor should read comments in include/linux/sched/sd_flags.h: |
| 69 | SD_* to get an idea of the specifics and what to tune for the SD flags |
| 70 | of a sched_domain. |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 71 | |
Barry Song | 9032dc2 | 2020-11-14 00:50:18 +1300 | [diff] [blame] | 72 | Architectures may override the generic domain builder and the default SD flags |
| 73 | for a given topology level by creating a sched_domain_topology_level array and |
| 74 | calling set_sched_topology() with this array as the parameter. |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 75 | |
Gautham R Shenoy | e29c98d | 2008-05-29 12:36:18 +0530 | [diff] [blame] | 76 | The sched-domains debugging infrastructure can be enabled by enabling |
Barry Song | 19987fd | 2021-05-04 22:53:43 +1200 | [diff] [blame] | 77 | CONFIG_SCHED_DEBUG and adding 'sched_verbose' to your cmdline. If you |
Peter Zijlstra | 9406415 | 2021-04-15 18:23:17 +0200 | [diff] [blame] | 78 | forgot to tweak your cmdline, you can also flip the |
| 79 | /sys/kernel/debug/sched/verbose knob. This enables an error checking parse of |
| 80 | the sched domains which should catch most possible errors (described above). It |
| 81 | also prints out the domain structure in a visual format. |