Valentin Schneider | 65065fd | 2020-07-31 20:20:15 +0100 | [diff] [blame^] | 1 | ========================= |
| 2 | Capacity Aware Scheduling |
| 3 | ========================= |
| 4 | |
| 5 | 1. CPU Capacity |
| 6 | =============== |
| 7 | |
| 8 | 1.1 Introduction |
| 9 | ---------------- |
| 10 | |
| 11 | Conventional, homogeneous SMP platforms are composed of purely identical |
| 12 | CPUs. Heterogeneous platforms on the other hand are composed of CPUs with |
| 13 | different performance characteristics - on such platforms, not all CPUs can be |
| 14 | considered equal. |
| 15 | |
| 16 | CPU capacity is a measure of the performance a CPU can reach, normalized against |
| 17 | the most performant CPU in the system. Heterogeneous systems are also called |
| 18 | asymmetric CPU capacity systems, as they contain CPUs of different capacities. |
| 19 | |
| 20 | Disparity in maximum attainable performance (IOW in maximum CPU capacity) stems |
| 21 | from two factors: |
| 22 | |
| 23 | - not all CPUs may have the same microarchitecture (µarch). |
| 24 | - with Dynamic Voltage and Frequency Scaling (DVFS), not all CPUs may be |
| 25 | physically able to attain the higher Operating Performance Points (OPP). |
| 26 | |
| 27 | Arm big.LITTLE systems are an example of both. The big CPUs are more |
| 28 | performance-oriented than the LITTLE ones (more pipeline stages, bigger caches, |
| 29 | smarter predictors, etc), and can usually reach higher OPPs than the LITTLE ones |
| 30 | can. |
| 31 | |
| 32 | CPU performance is usually expressed in Millions of Instructions Per Second |
| 33 | (MIPS), which can also be expressed as a given amount of instructions attainable |
| 34 | per Hz, leading to:: |
| 35 | |
| 36 | capacity(cpu) = work_per_hz(cpu) * max_freq(cpu) |
| 37 | |
| 38 | 1.2 Scheduler terms |
| 39 | ------------------- |
| 40 | |
| 41 | Two different capacity values are used within the scheduler. A CPU's |
| 42 | ``capacity_orig`` is its maximum attainable capacity, i.e. its maximum |
| 43 | attainable performance level. A CPU's ``capacity`` is its ``capacity_orig`` to |
| 44 | which some loss of available performance (e.g. time spent handling IRQs) is |
| 45 | subtracted. |
| 46 | |
| 47 | Note that a CPU's ``capacity`` is solely intended to be used by the CFS class, |
| 48 | while ``capacity_orig`` is class-agnostic. The rest of this document will use |
| 49 | the term ``capacity`` interchangeably with ``capacity_orig`` for the sake of |
| 50 | brevity. |
| 51 | |
| 52 | 1.3 Platform examples |
| 53 | --------------------- |
| 54 | |
| 55 | 1.3.1 Identical OPPs |
| 56 | ~~~~~~~~~~~~~~~~~~~~ |
| 57 | |
| 58 | Consider an hypothetical dual-core asymmetric CPU capacity system where |
| 59 | |
| 60 | - work_per_hz(CPU0) = W |
| 61 | - work_per_hz(CPU1) = W/2 |
| 62 | - all CPUs are running at the same fixed frequency |
| 63 | |
| 64 | By the above definition of capacity: |
| 65 | |
| 66 | - capacity(CPU0) = C |
| 67 | - capacity(CPU1) = C/2 |
| 68 | |
| 69 | To draw the parallel with Arm big.LITTLE, CPU0 would be a big while CPU1 would |
| 70 | be a LITTLE. |
| 71 | |
| 72 | With a workload that periodically does a fixed amount of work, you will get an |
| 73 | execution trace like so:: |
| 74 | |
| 75 | CPU0 work ^ |
| 76 | | ____ ____ ____ |
| 77 | | | | | | | | |
| 78 | +----+----+----+----+----+----+----+----+----+----+-> time |
| 79 | |
| 80 | CPU1 work ^ |
| 81 | | _________ _________ ____ |
| 82 | | | | | | | |
| 83 | +----+----+----+----+----+----+----+----+----+----+-> time |
| 84 | |
| 85 | CPU0 has the highest capacity in the system (C), and completes a fixed amount of |
| 86 | work W in T units of time. On the other hand, CPU1 has half the capacity of |
| 87 | CPU0, and thus only completes W/2 in T. |
| 88 | |
| 89 | 1.3.2 Different max OPPs |
| 90 | ~~~~~~~~~~~~~~~~~~~~~~~~ |
| 91 | |
| 92 | Usually, CPUs of different capacity values also have different maximum |
| 93 | OPPs. Consider the same CPUs as above (i.e. same work_per_hz()) with: |
| 94 | |
| 95 | - max_freq(CPU0) = F |
| 96 | - max_freq(CPU1) = 2/3 * F |
| 97 | |
| 98 | This yields: |
| 99 | |
| 100 | - capacity(CPU0) = C |
| 101 | - capacity(CPU1) = C/3 |
| 102 | |
| 103 | Executing the same workload as described in 1.3.1, which each CPU running at its |
| 104 | maximum frequency results in:: |
| 105 | |
| 106 | CPU0 work ^ |
| 107 | | ____ ____ ____ |
| 108 | | | | | | | | |
| 109 | +----+----+----+----+----+----+----+----+----+----+-> time |
| 110 | |
| 111 | workload on CPU1 |
| 112 | CPU1 work ^ |
| 113 | | ______________ ______________ ____ |
| 114 | | | | | | | |
| 115 | +----+----+----+----+----+----+----+----+----+----+-> time |
| 116 | |
| 117 | 1.4 Representation caveat |
| 118 | ------------------------- |
| 119 | |
| 120 | It should be noted that having a *single* value to represent differences in CPU |
| 121 | performance is somewhat of a contentious point. The relative performance |
| 122 | difference between two different µarchs could be X% on integer operations, Y% on |
| 123 | floating point operations, Z% on branches, and so on. Still, results using this |
| 124 | simple approach have been satisfactory for now. |
| 125 | |
| 126 | 2. Task utilization |
| 127 | =================== |
| 128 | |
| 129 | 2.1 Introduction |
| 130 | ---------------- |
| 131 | |
| 132 | Capacity aware scheduling requires an expression of a task's requirements with |
| 133 | regards to CPU capacity. Each scheduler class can express this differently, and |
| 134 | while task utilization is specific to CFS, it is convenient to describe it here |
| 135 | in order to introduce more generic concepts. |
| 136 | |
| 137 | Task utilization is a percentage meant to represent the throughput requirements |
| 138 | of a task. A simple approximation of it is the task's duty cycle, i.e.:: |
| 139 | |
| 140 | task_util(p) = duty_cycle(p) |
| 141 | |
| 142 | On an SMP system with fixed frequencies, 100% utilization suggests the task is a |
| 143 | busy loop. Conversely, 10% utilization hints it is a small periodic task that |
| 144 | spends more time sleeping than executing. Variable CPU frequencies and |
| 145 | asymmetric CPU capacities complexify this somewhat; the following sections will |
| 146 | expand on these. |
| 147 | |
| 148 | 2.2 Frequency invariance |
| 149 | ------------------------ |
| 150 | |
| 151 | One issue that needs to be taken into account is that a workload's duty cycle is |
| 152 | directly impacted by the current OPP the CPU is running at. Consider running a |
| 153 | periodic workload at a given frequency F:: |
| 154 | |
| 155 | CPU work ^ |
| 156 | | ____ ____ ____ |
| 157 | | | | | | | | |
| 158 | +----+----+----+----+----+----+----+----+----+----+-> time |
| 159 | |
| 160 | This yields duty_cycle(p) == 25%. |
| 161 | |
| 162 | Now, consider running the *same* workload at frequency F/2:: |
| 163 | |
| 164 | CPU work ^ |
| 165 | | _________ _________ ____ |
| 166 | | | | | | | |
| 167 | +----+----+----+----+----+----+----+----+----+----+-> time |
| 168 | |
| 169 | This yields duty_cycle(p) == 50%, despite the task having the exact same |
| 170 | behaviour (i.e. executing the same amount of work) in both executions. |
| 171 | |
| 172 | The task utilization signal can be made frequency invariant using the following |
| 173 | formula:: |
| 174 | |
| 175 | task_util_freq_inv(p) = duty_cycle(p) * (curr_frequency(cpu) / max_frequency(cpu)) |
| 176 | |
| 177 | Applying this formula to the two examples above yields a frequency invariant |
| 178 | task utilization of 25%. |
| 179 | |
| 180 | 2.3 CPU invariance |
| 181 | ------------------ |
| 182 | |
| 183 | CPU capacity has a similar effect on task utilization in that running an |
| 184 | identical workload on CPUs of different capacity values will yield different |
| 185 | duty cycles. |
| 186 | |
| 187 | Consider the system described in 1.3.2., i.e.:: |
| 188 | |
| 189 | - capacity(CPU0) = C |
| 190 | - capacity(CPU1) = C/3 |
| 191 | |
| 192 | Executing a given periodic workload on each CPU at their maximum frequency would |
| 193 | result in:: |
| 194 | |
| 195 | CPU0 work ^ |
| 196 | | ____ ____ ____ |
| 197 | | | | | | | | |
| 198 | +----+----+----+----+----+----+----+----+----+----+-> time |
| 199 | |
| 200 | CPU1 work ^ |
| 201 | | ______________ ______________ ____ |
| 202 | | | | | | | |
| 203 | +----+----+----+----+----+----+----+----+----+----+-> time |
| 204 | |
| 205 | IOW, |
| 206 | |
| 207 | - duty_cycle(p) == 25% if p runs on CPU0 at its maximum frequency |
| 208 | - duty_cycle(p) == 75% if p runs on CPU1 at its maximum frequency |
| 209 | |
| 210 | The task utilization signal can be made CPU invariant using the following |
| 211 | formula:: |
| 212 | |
| 213 | task_util_cpu_inv(p) = duty_cycle(p) * (capacity(cpu) / max_capacity) |
| 214 | |
| 215 | with ``max_capacity`` being the highest CPU capacity value in the |
| 216 | system. Applying this formula to the above example above yields a CPU |
| 217 | invariant task utilization of 25%. |
| 218 | |
| 219 | 2.4 Invariant task utilization |
| 220 | ------------------------------ |
| 221 | |
| 222 | Both frequency and CPU invariance need to be applied to task utilization in |
| 223 | order to obtain a truly invariant signal. The pseudo-formula for a task |
| 224 | utilization that is both CPU and frequency invariant is thus, for a given |
| 225 | task p:: |
| 226 | |
| 227 | curr_frequency(cpu) capacity(cpu) |
| 228 | task_util_inv(p) = duty_cycle(p) * ------------------- * ------------- |
| 229 | max_frequency(cpu) max_capacity |
| 230 | |
| 231 | In other words, invariant task utilization describes the behaviour of a task as |
| 232 | if it were running on the highest-capacity CPU in the system, running at its |
| 233 | maximum frequency. |
| 234 | |
| 235 | Any mention of task utilization in the following sections will imply its |
| 236 | invariant form. |
| 237 | |
| 238 | 2.5 Utilization estimation |
| 239 | -------------------------- |
| 240 | |
| 241 | Without a crystal ball, task behaviour (and thus task utilization) cannot |
| 242 | accurately be predicted the moment a task first becomes runnable. The CFS class |
| 243 | maintains a handful of CPU and task signals based on the Per-Entity Load |
| 244 | Tracking (PELT) mechanism, one of those yielding an *average* utilization (as |
| 245 | opposed to instantaneous). |
| 246 | |
| 247 | This means that while the capacity aware scheduling criteria will be written |
| 248 | considering a "true" task utilization (using a crystal ball), the implementation |
| 249 | will only ever be able to use an estimator thereof. |
| 250 | |
| 251 | 3. Capacity aware scheduling requirements |
| 252 | ========================================= |
| 253 | |
| 254 | 3.1 CPU capacity |
| 255 | ---------------- |
| 256 | |
| 257 | Linux cannot currently figure out CPU capacity on its own, this information thus |
| 258 | needs to be handed to it. Architectures must define arch_scale_cpu_capacity() |
| 259 | for that purpose. |
| 260 | |
| 261 | The arm and arm64 architectures directly map this to the arch_topology driver |
| 262 | CPU scaling data, which is derived from the capacity-dmips-mhz CPU binding; see |
| 263 | Documentation/devicetree/bindings/arm/cpu-capacity.txt. |
| 264 | |
| 265 | 3.2 Frequency invariance |
| 266 | ------------------------ |
| 267 | |
| 268 | As stated in 2.2, capacity-aware scheduling requires a frequency-invariant task |
| 269 | utilization. Architectures must define arch_scale_freq_capacity(cpu) for that |
| 270 | purpose. |
| 271 | |
| 272 | Implementing this function requires figuring out at which frequency each CPU |
| 273 | have been running at. One way to implement this is to leverage hardware counters |
| 274 | whose increment rate scale with a CPU's current frequency (APERF/MPERF on x86, |
| 275 | AMU on arm64). Another is to directly hook into cpufreq frequency transitions, |
| 276 | when the kernel is aware of the switched-to frequency (also employed by |
| 277 | arm/arm64). |
| 278 | |
| 279 | 4. Scheduler topology |
| 280 | ===================== |
| 281 | |
| 282 | During the construction of the sched domains, the scheduler will figure out |
| 283 | whether the system exhibits asymmetric CPU capacities. Should that be the |
| 284 | case: |
| 285 | |
| 286 | - The sched_asym_cpucapacity static key will be enabled. |
| 287 | - The SD_ASYM_CPUCAPACITY flag will be set at the lowest sched_domain level that |
| 288 | spans all unique CPU capacity values. |
| 289 | |
| 290 | The sched_asym_cpucapacity static key is intended to guard sections of code that |
| 291 | cater to asymmetric CPU capacity systems. Do note however that said key is |
| 292 | *system-wide*. Imagine the following setup using cpusets:: |
| 293 | |
| 294 | capacity C/2 C |
| 295 | ________ ________ |
| 296 | / \ / \ |
| 297 | CPUs 0 1 2 3 4 5 6 7 |
| 298 | \__/ \______________/ |
| 299 | cpusets cs0 cs1 |
| 300 | |
| 301 | Which could be created via: |
| 302 | |
| 303 | .. code-block:: sh |
| 304 | |
| 305 | mkdir /sys/fs/cgroup/cpuset/cs0 |
| 306 | echo 0-1 > /sys/fs/cgroup/cpuset/cs0/cpuset.cpus |
| 307 | echo 0 > /sys/fs/cgroup/cpuset/cs0/cpuset.mems |
| 308 | |
| 309 | mkdir /sys/fs/cgroup/cpuset/cs1 |
| 310 | echo 2-7 > /sys/fs/cgroup/cpuset/cs1/cpuset.cpus |
| 311 | echo 0 > /sys/fs/cgroup/cpuset/cs1/cpuset.mems |
| 312 | |
| 313 | echo 0 > /sys/fs/cgroup/cpuset/cpuset.sched_load_balance |
| 314 | |
| 315 | Since there *is* CPU capacity asymmetry in the system, the |
| 316 | sched_asym_cpucapacity static key will be enabled. However, the sched_domain |
| 317 | hierarchy of CPUs 0-1 spans a single capacity value: SD_ASYM_CPUCAPACITY isn't |
| 318 | set in that hierarchy, it describes an SMP island and should be treated as such. |
| 319 | |
| 320 | Therefore, the 'canonical' pattern for protecting codepaths that cater to |
| 321 | asymmetric CPU capacities is to: |
| 322 | |
| 323 | - Check the sched_asym_cpucapacity static key |
| 324 | - If it is enabled, then also check for the presence of SD_ASYM_CPUCAPACITY in |
| 325 | the sched_domain hierarchy (if relevant, i.e. the codepath targets a specific |
| 326 | CPU or group thereof) |
| 327 | |
| 328 | 5. Capacity aware scheduling implementation |
| 329 | =========================================== |
| 330 | |
| 331 | 5.1 CFS |
| 332 | ------- |
| 333 | |
| 334 | 5.1.1 Capacity fitness |
| 335 | ~~~~~~~~~~~~~~~~~~~~~~ |
| 336 | |
| 337 | The main capacity scheduling criterion of CFS is:: |
| 338 | |
| 339 | task_util(p) < capacity(task_cpu(p)) |
| 340 | |
| 341 | This is commonly called the capacity fitness criterion, i.e. CFS must ensure a |
| 342 | task "fits" on its CPU. If it is violated, the task will need to achieve more |
| 343 | work than what its CPU can provide: it will be CPU-bound. |
| 344 | |
| 345 | Furthermore, uclamp lets userspace specify a minimum and a maximum utilization |
| 346 | value for a task, either via sched_setattr() or via the cgroup interface (see |
| 347 | Documentation/admin-guide/cgroup-v2.rst). As its name imply, this can be used to |
| 348 | clamp task_util() in the previous criterion. |
| 349 | |
| 350 | 5.1.2 Wakeup CPU selection |
| 351 | ~~~~~~~~~~~~~~~~~~~~~~~~~~ |
| 352 | |
| 353 | CFS task wakeup CPU selection follows the capacity fitness criterion described |
| 354 | above. On top of that, uclamp is used to clamp the task utilization values, |
| 355 | which lets userspace have more leverage over the CPU selection of CFS |
| 356 | tasks. IOW, CFS wakeup CPU selection searches for a CPU that satisfies:: |
| 357 | |
| 358 | clamp(task_util(p), task_uclamp_min(p), task_uclamp_max(p)) < capacity(cpu) |
| 359 | |
| 360 | By using uclamp, userspace can e.g. allow a busy loop (100% utilization) to run |
| 361 | on any CPU by giving it a low uclamp.max value. Conversely, it can force a small |
| 362 | periodic task (e.g. 10% utilization) to run on the highest-performance CPUs by |
| 363 | giving it a high uclamp.min value. |
| 364 | |
| 365 | .. note:: |
| 366 | |
| 367 | Wakeup CPU selection in CFS can be eclipsed by Energy Aware Scheduling |
| 368 | (EAS), which is described in Documentation/scheduling/sched-energy.rst. |
| 369 | |
| 370 | 5.1.3 Load balancing |
| 371 | ~~~~~~~~~~~~~~~~~~~~ |
| 372 | |
| 373 | A pathological case in the wakeup CPU selection occurs when a task rarely |
| 374 | sleeps, if at all - it thus rarely wakes up, if at all. Consider:: |
| 375 | |
| 376 | w == wakeup event |
| 377 | |
| 378 | capacity(CPU0) = C |
| 379 | capacity(CPU1) = C / 3 |
| 380 | |
| 381 | workload on CPU0 |
| 382 | CPU work ^ |
| 383 | | _________ _________ ____ |
| 384 | | | | | | | |
| 385 | +----+----+----+----+----+----+----+----+----+----+-> time |
| 386 | w w w |
| 387 | |
| 388 | workload on CPU1 |
| 389 | CPU work ^ |
| 390 | | ____________________________________________ |
| 391 | | | |
| 392 | +----+----+----+----+----+----+----+----+----+----+-> |
| 393 | w |
| 394 | |
| 395 | This workload should run on CPU0, but if the task either: |
| 396 | |
| 397 | - was improperly scheduled from the start (inaccurate initial |
| 398 | utilization estimation) |
| 399 | - was properly scheduled from the start, but suddenly needs more |
| 400 | processing power |
| 401 | |
| 402 | then it might become CPU-bound, IOW ``task_util(p) > capacity(task_cpu(p))``; |
| 403 | the CPU capacity scheduling criterion is violated, and there may not be any more |
| 404 | wakeup event to fix this up via wakeup CPU selection. |
| 405 | |
| 406 | Tasks that are in this situation are dubbed "misfit" tasks, and the mechanism |
| 407 | put in place to handle this shares the same name. Misfit task migration |
| 408 | leverages the CFS load balancer, more specifically the active load balance part |
| 409 | (which caters to migrating currently running tasks). When load balance happens, |
| 410 | a misfit active load balance will be triggered if a misfit task can be migrated |
| 411 | to a CPU with more capacity than its current one. |
| 412 | |
| 413 | 5.2 RT |
| 414 | ------ |
| 415 | |
| 416 | 5.2.1 Wakeup CPU selection |
| 417 | ~~~~~~~~~~~~~~~~~~~~~~~~~~ |
| 418 | |
| 419 | RT task wakeup CPU selection searches for a CPU that satisfies:: |
| 420 | |
| 421 | task_uclamp_min(p) <= capacity(task_cpu(cpu)) |
| 422 | |
| 423 | while still following the usual priority constraints. If none of the candidate |
| 424 | CPUs can satisfy this capacity criterion, then strict priority based scheduling |
| 425 | is followed and CPU capacities are ignored. |
| 426 | |
| 427 | 5.3 DL |
| 428 | ------ |
| 429 | |
| 430 | 5.3.1 Wakeup CPU selection |
| 431 | ~~~~~~~~~~~~~~~~~~~~~~~~~~ |
| 432 | |
| 433 | DL task wakeup CPU selection searches for a CPU that satisfies:: |
| 434 | |
| 435 | task_bandwidth(p) < capacity(task_cpu(p)) |
| 436 | |
| 437 | while still respecting the usual bandwidth and deadline constraints. If |
| 438 | none of the candidate CPUs can satisfy this capacity criterion, then the |
| 439 | task will remain on its current CPU. |