Peter Zijlstra | 0301925 | 2020-12-18 11:28:12 +0100 | [diff] [blame] | 1 | |
| 2 | |
| 3 | NOTE; all this assumes a linear relation between frequency and work capacity, |
| 4 | we know this is flawed, but it is the best workable approximation. |
| 5 | |
| 6 | |
| 7 | PELT (Per Entity Load Tracking) |
| 8 | ------------------------------- |
| 9 | |
| 10 | With PELT we track some metrics across the various scheduler entities, from |
| 11 | individual tasks to task-group slices to CPU runqueues. As the basis for this |
| 12 | we use an Exponentially Weighted Moving Average (EWMA), each period (1024us) |
| 13 | is decayed such that y^32 = 0.5. That is, the most recent 32ms contribute |
| 14 | half, while the rest of history contribute the other half. |
| 15 | |
| 16 | Specifically: |
| 17 | |
| 18 | ewma_sum(u) := u_0 + u_1*y + u_2*y^2 + ... |
| 19 | |
| 20 | ewma(u) = ewma_sum(u) / ewma_sum(1) |
| 21 | |
| 22 | Since this is essentially a progression of an infinite geometric series, the |
| 23 | results are composable, that is ewma(A) + ewma(B) = ewma(A+B). This property |
| 24 | is key, since it gives the ability to recompose the averages when tasks move |
| 25 | around. |
| 26 | |
| 27 | Note that blocked tasks still contribute to the aggregates (task-group slices |
| 28 | and CPU runqueues), which reflects their expected contribution when they |
| 29 | resume running. |
| 30 | |
| 31 | Using this we track 2 key metrics: 'running' and 'runnable'. 'Running' |
| 32 | reflects the time an entity spends on the CPU, while 'runnable' reflects the |
| 33 | time an entity spends on the runqueue. When there is only a single task these |
| 34 | two metrics are the same, but once there is contention for the CPU 'running' |
| 35 | will decrease to reflect the fraction of time each task spends on the CPU |
| 36 | while 'runnable' will increase to reflect the amount of contention. |
| 37 | |
| 38 | For more detail see: kernel/sched/pelt.c |
| 39 | |
| 40 | |
| 41 | Frequency- / CPU Invariance |
| 42 | --------------------------- |
| 43 | |
| 44 | Because consuming the CPU for 50% at 1GHz is not the same as consuming the CPU |
| 45 | for 50% at 2GHz, nor is running 50% on a LITTLE CPU the same as running 50% on |
| 46 | a big CPU, we allow architectures to scale the time delta with two ratios, one |
| 47 | Dynamic Voltage and Frequency Scaling (DVFS) ratio and one microarch ratio. |
| 48 | |
| 49 | For simple DVFS architectures (where software is in full control) we trivially |
| 50 | compute the ratio as: |
| 51 | |
| 52 | f_cur |
| 53 | r_dvfs := ----- |
| 54 | f_max |
| 55 | |
| 56 | For more dynamic systems where the hardware is in control of DVFS we use |
| 57 | hardware counters (Intel APERF/MPERF, ARMv8.4-AMU) to provide us this ratio. |
| 58 | For Intel specifically, we use: |
| 59 | |
| 60 | APERF |
| 61 | f_cur := ----- * P0 |
| 62 | MPERF |
| 63 | |
| 64 | 4C-turbo; if available and turbo enabled |
| 65 | f_max := { 1C-turbo; if turbo enabled |
| 66 | P0; otherwise |
| 67 | |
| 68 | f_cur |
| 69 | r_dvfs := min( 1, ----- ) |
| 70 | f_max |
| 71 | |
| 72 | We pick 4C turbo over 1C turbo to make it slightly more sustainable. |
| 73 | |
| 74 | r_cpu is determined as the ratio of highest performance level of the current |
| 75 | CPU vs the highest performance level of any other CPU in the system. |
| 76 | |
| 77 | r_tot = r_dvfs * r_cpu |
| 78 | |
| 79 | The result is that the above 'running' and 'runnable' metrics become invariant |
| 80 | of DVFS and CPU type. IOW. we can transfer and compare them between CPUs. |
| 81 | |
| 82 | For more detail see: |
| 83 | |
| 84 | - kernel/sched/pelt.h:update_rq_clock_pelt() |
| 85 | - arch/x86/kernel/smpboot.c:"APERF/MPERF frequency ratio computation." |
| 86 | - Documentation/scheduler/sched-capacity.rst:"1. CPU Capacity + 2. Task utilization" |
| 87 | |
| 88 | |
| 89 | UTIL_EST / UTIL_EST_FASTUP |
| 90 | -------------------------- |
| 91 | |
| 92 | Because periodic tasks have their averages decayed while they sleep, even |
| 93 | though when running their expected utilization will be the same, they suffer a |
| 94 | (DVFS) ramp-up after they are running again. |
| 95 | |
| 96 | To alleviate this (a default enabled option) UTIL_EST drives an Infinite |
| 97 | Impulse Response (IIR) EWMA with the 'running' value on dequeue -- when it is |
| 98 | highest. A further default enabled option UTIL_EST_FASTUP modifies the IIR |
| 99 | filter to instantly increase and only decay on decrease. |
| 100 | |
| 101 | A further runqueue wide sum (of runnable tasks) is maintained of: |
| 102 | |
| 103 | util_est := \Sum_t max( t_running, t_util_est_ewma ) |
| 104 | |
| 105 | For more detail see: kernel/sched/fair.c:util_est_dequeue() |
| 106 | |
| 107 | |
| 108 | UCLAMP |
| 109 | ------ |
| 110 | |
| 111 | It is possible to set effective u_min and u_max clamps on each CFS or RT task; |
| 112 | the runqueue keeps an max aggregate of these clamps for all running tasks. |
| 113 | |
| 114 | For more detail see: include/uapi/linux/sched/types.h |
| 115 | |
| 116 | |
| 117 | Schedutil / DVFS |
| 118 | ---------------- |
| 119 | |
| 120 | Every time the scheduler load tracking is updated (task wakeup, task |
| 121 | migration, time progression) we call out to schedutil to update the hardware |
| 122 | DVFS state. |
| 123 | |
| 124 | The basis is the CPU runqueue's 'running' metric, which per the above it is |
| 125 | the frequency invariant utilization estimate of the CPU. From this we compute |
| 126 | a desired frequency like: |
| 127 | |
| 128 | max( running, util_est ); if UTIL_EST |
| 129 | u_cfs := { running; otherwise |
| 130 | |
| 131 | clamp( u_cfs + u_rt , u_min, u_max ); if UCLAMP_TASK |
| 132 | u_clamp := { u_cfs + u_rt; otherwise |
| 133 | |
| 134 | u := u_clamp + u_irq + u_dl; [approx. see source for more detail] |
| 135 | |
| 136 | f_des := min( f_max, 1.25 u * f_max ) |
| 137 | |
| 138 | XXX IO-wait; when the update is due to a task wakeup from IO-completion we |
| 139 | boost 'u' above. |
| 140 | |
| 141 | This frequency is then used to select a P-state/OPP or directly munged into a |
| 142 | CPPC style request to the hardware. |
| 143 | |
| 144 | XXX: deadline tasks (Sporadic Task Model) allows us to calculate a hard f_min |
| 145 | required to satisfy the workload. |
| 146 | |
| 147 | Because these callbacks are directly from the scheduler, the DVFS hardware |
| 148 | interaction should be 'fast' and non-blocking. Schedutil supports |
| 149 | rate-limiting DVFS requests for when hardware interaction is slow and |
| 150 | expensive, this reduces effectiveness. |
| 151 | |
| 152 | For more information see: kernel/sched/cpufreq_schedutil.c |
| 153 | |
| 154 | |
| 155 | NOTES |
| 156 | ----- |
| 157 | |
| 158 | - On low-load scenarios, where DVFS is most relevant, the 'running' numbers |
| 159 | will closely reflect utilization. |
| 160 | |
| 161 | - In saturated scenarios task movement will cause some transient dips, |
| 162 | suppose we have a CPU saturated with 4 tasks, then when we migrate a task |
| 163 | to an idle CPU, the old CPU will have a 'running' value of 0.75 while the |
| 164 | new CPU will gain 0.25. This is inevitable and time progression will |
| 165 | correct this. XXX do we still guarantee f_max due to no idle-time? |
| 166 | |
| 167 | - Much of the above is about avoiding DVFS dips, and independent DVFS domains |
| 168 | having to re-learn / ramp-up when load shifts. |
| 169 | |