Patrick Bellasi | afe22aa | 2015-06-30 12:03:26 +0100 | [diff] [blame] | 1 | Central, scheduler-driven, power-performance control |
| 2 | (EXPERIMENTAL) |
| 3 | |
| 4 | Abstract |
| 5 | ======== |
| 6 | |
| 7 | The topic of a single simple power-performance tunable, that is wholly |
| 8 | scheduler centric, and has well defined and predictable properties has come up |
| 9 | on several occasions in the past [1,2]. With techniques such as a scheduler |
| 10 | driven DVFS [3], we now have a good framework for implementing such a tunable. |
| 11 | This document describes the overall ideas behind its design and implementation. |
| 12 | |
| 13 | |
| 14 | Table of Contents |
| 15 | ================= |
| 16 | |
| 17 | 1. Motivation |
| 18 | 2. Introduction |
| 19 | 3. Signal Boosting Strategy |
| 20 | 4. OPP selection using boosted CPU utilization |
| 21 | 5. Per task group boosting |
| 22 | 6. Question and Answers |
| 23 | - What about "auto" mode? |
| 24 | - What about boosting on a congested system? |
| 25 | - How CPUs are boosted when we have tasks with multiple boost values? |
| 26 | 7. References |
| 27 | |
| 28 | |
| 29 | 1. Motivation |
| 30 | ============= |
| 31 | |
| 32 | Sched-DVFS [3] is a new event-driven cpufreq governor which allows the |
| 33 | scheduler to select the optimal DVFS operating point (OPP) for running a task |
| 34 | allocated to a CPU. The introduction of sched-DVFS enables running workloads at |
| 35 | the most energy efficient OPPs. |
| 36 | |
| 37 | However, sometimes it may be desired to intentionally boost the performance of |
| 38 | a workload even if that could imply a reasonable increase in energy |
| 39 | consumption. For example, in order to reduce the response time of a task, we |
| 40 | may want to run the task at a higher OPP than the one that is actually required |
| 41 | by it's CPU bandwidth demand. |
| 42 | |
| 43 | This last requirement is especially important if we consider that one of the |
| 44 | main goals of the sched-DVFS component is to replace all currently available |
| 45 | CPUFreq policies. Since sched-DVFS is event based, as opposed to the sampling |
| 46 | driven governors we currently have, it is already more responsive at selecting |
| 47 | the optimal OPP to run tasks allocated to a CPU. However, just tracking the |
| 48 | actual task load demand may not be enough from a performance standpoint. For |
| 49 | example, it is not possible to get behaviors similar to those provided by the |
| 50 | "performance" and "interactive" CPUFreq governors. |
| 51 | |
| 52 | This document describes an implementation of a tunable, stacked on top of the |
| 53 | sched-DVFS which extends its functionality to support task performance |
| 54 | boosting. |
| 55 | |
| 56 | By "performance boosting" we mean the reduction of the time required to |
| 57 | complete a task activation, i.e. the time elapsed from a task wakeup to its |
| 58 | next deactivation (e.g. because it goes back to sleep or it terminates). For |
| 59 | example, if we consider a simple periodic task which executes the same workload |
| 60 | for 5[s] every 20[s] while running at a certain OPP, a boosted execution of |
| 61 | that task must complete each of its activations in less than 5[s]. |
| 62 | |
| 63 | A previous attempt [5] to introduce such a boosting feature has not been |
| 64 | successful mainly because of the complexity of the proposed solution. The |
| 65 | approach described in this document exposes a single simple interface to |
| 66 | user-space. This single tunable knob allows the tuning of system wide |
| 67 | scheduler behaviours ranging from energy efficiency at one end through to |
| 68 | incremental performance boosting at the other end. This first tunable affects |
| 69 | all tasks. However, a more advanced extension of the concept is also provided |
| 70 | which uses CGroups to boost the performance of only selected tasks while using |
| 71 | the energy efficient default for all others. |
| 72 | |
| 73 | The rest of this document introduces in more details the proposed solution |
| 74 | which has been named SchedTune. |
| 75 | |
| 76 | |
| 77 | 2. Introduction |
| 78 | =============== |
| 79 | |
| 80 | SchedTune exposes a simple user-space interface with a single power-performance |
| 81 | tunable: |
| 82 | |
| 83 | /proc/sys/kernel/sched_cfs_boost |
| 84 | |
| 85 | This permits expressing a boost value as an integer in the range [0..100]. |
| 86 | |
| 87 | A value of 0 (default) configures the CFS scheduler for maximum energy |
| 88 | efficiency. This means that sched-DVFS runs the tasks at the minimum OPP |
| 89 | required to satisfy their workload demand. |
| 90 | A value of 100 configures scheduler for maximum performance, which translates |
| 91 | to the selection of the maximum OPP on that CPU. |
| 92 | |
| 93 | The range between 0 and 100 can be set to satisfy other scenarios suitably. For |
| 94 | example to satisfy interactive response or depending on other system events |
| 95 | (battery level etc). |
| 96 | |
| 97 | A CGroup based extension is also provided, which permits further user-space |
| 98 | defined task classification to tune the scheduler for different goals depending |
| 99 | on the specific nature of the task, e.g. background vs interactive vs |
| 100 | low-priority. |
| 101 | |
| 102 | The overall design of the SchedTune module is built on top of "Per-Entity Load |
| 103 | Tracking" (PELT) signals and sched-DVFS by introducing a bias on the Operating |
| 104 | Performance Point (OPP) selection. |
| 105 | Each time a task is allocated on a CPU, sched-DVFS has the opportunity to tune |
| 106 | the operating frequency of that CPU to better match the workload demand. The |
| 107 | selection of the actual OPP being activated is influenced by the global boost |
| 108 | value, or the boost value for the task CGroup when in use. |
| 109 | |
| 110 | This simple biasing approach leverages existing frameworks, which means minimal |
| 111 | modifications to the scheduler, and yet it allows to achieve a range of |
| 112 | different behaviours all from a single simple tunable knob. |
| 113 | The only new concept introduced is that of signal boosting. |
| 114 | |
| 115 | |
| 116 | 3. Signal Boosting Strategy |
| 117 | =========================== |
| 118 | |
| 119 | The whole PELT machinery works based on the value of a few load tracking signals |
| 120 | which basically track the CPU bandwidth requirements for tasks and the capacity |
| 121 | of CPUs. The basic idea behind the SchedTune knob is to artificially inflate |
| 122 | some of these load tracking signals to make a task or RQ appears more demanding |
| 123 | that it actually is. |
| 124 | |
| 125 | Which signals have to be inflated depends on the specific "consumer". However, |
| 126 | independently from the specific (signal, consumer) pair, it is important to |
| 127 | define a simple and possibly consistent strategy for the concept of boosting a |
| 128 | signal. |
| 129 | |
| 130 | A boosting strategy defines how the "abstract" user-space defined |
| 131 | sched_cfs_boost value is translated into an internal "margin" value to be added |
| 132 | to a signal to get its inflated value: |
| 133 | |
| 134 | margin := boosting_strategy(sched_cfs_boost, signal) |
| 135 | boosted_signal := signal + margin |
| 136 | |
| 137 | Different boosting strategies were identified and analyzed before selecting the |
| 138 | one found to be most effective. |
| 139 | |
| 140 | Signal Proportional Compensation (SPC) |
| 141 | -------------------------------------- |
| 142 | |
| 143 | In this boosting strategy the sched_cfs_boost value is used to compute a |
| 144 | margin which is proportional to the complement of the original signal. |
| 145 | When a signal has a maximum possible value, its complement is defined as |
| 146 | the delta from the actual value and its possible maximum. |
| 147 | |
| 148 | Since the tunable implementation uses signals which have SCHED_LOAD_SCALE as |
| 149 | the maximum possible value, the margin becomes: |
| 150 | |
| 151 | margin := sched_cfs_boost * (SCHED_LOAD_SCALE - signal) |
| 152 | |
| 153 | Using this boosting strategy: |
| 154 | - a 100% sched_cfs_boost means that the signal is scaled to the maximum value |
| 155 | - each value in the range of sched_cfs_boost effectively inflates the signal in |
| 156 | question by a quantity which is proportional to the maximum value. |
| 157 | |
| 158 | For example, by applying the SPC boosting strategy to the selection of the OPP |
| 159 | to run a task it is possible to achieve these behaviors: |
| 160 | |
| 161 | - 0% boosting: run the task at the minimum OPP required by its workload |
| 162 | - 100% boosting: run the task at the maximum OPP available for the CPU |
| 163 | - 50% boosting: run at the half-way OPP between minimum and maximum |
| 164 | |
| 165 | Which means that, at 50% boosting, a task will be scheduled to run at half of |
| 166 | the maximum theoretically achievable performance on the specific target |
| 167 | platform. |
| 168 | |
| 169 | A graphical representation of an SPC boosted signal is represented in the |
| 170 | following figure where: |
| 171 | a) "-" represents the original signal |
| 172 | b) "b" represents a 50% boosted signal |
| 173 | c) "p" represents a 100% boosted signal |
| 174 | |
| 175 | |
| 176 | ^ |
| 177 | | SCHED_LOAD_SCALE |
| 178 | +-----------------------------------------------------------------+ |
| 179 | |pppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppp |
| 180 | | |
| 181 | | boosted_signal |
| 182 | | bbbbbbbbbbbbbbbbbbbbbbbb |
| 183 | | |
| 184 | | original signal |
| 185 | | bbbbbbbbbbbbbbbbbbbbbbbb+----------------------+ |
| 186 | | | |
| 187 | |bbbbbbbbbbbbbbbbbb | |
| 188 | | | |
| 189 | | | |
| 190 | | | |
| 191 | | +-----------------------+ |
| 192 | | | |
| 193 | | | |
| 194 | | | |
| 195 | |------------------+ |
| 196 | | |
| 197 | | |
| 198 | +-----------------------------------------------------------------------> |
| 199 | |
| 200 | The plot above shows a ramped load signal (titled 'original_signal') and it's |
| 201 | boosted equivalent. For each step of the original signal the boosted signal |
| 202 | corresponding to a 50% boost is midway from the original signal and the upper |
| 203 | bound. Boosting by 100% generates a boosted signal which is always saturated to |
| 204 | the upper bound. |
| 205 | |
| 206 | |
| 207 | 4. OPP selection using boosted CPU utilization |
| 208 | ============================================== |
| 209 | |
| 210 | It is worth calling out that the implementation does not introduce any new load |
| 211 | signals. Instead, it provides an API to tune existing signals. This tuning is |
| 212 | done on demand and only in scheduler code paths where it is sensible to do so. |
| 213 | The new API calls are defined to return either the default signal or a boosted |
| 214 | one, depending on the value of sched_cfs_boost. This is a clean an non invasive |
| 215 | modification of the existing existing code paths. |
| 216 | |
| 217 | The signal representing a CPU's utilization is boosted according to the |
| 218 | previously described SPC boosting strategy. To sched-DVFS, this allows a CPU |
| 219 | (ie CFS run-queue) to appear more used then it actually is. |
| 220 | |
| 221 | Thus, with the sched_cfs_boost enabled we have the following main functions to |
| 222 | get the current utilization of a CPU: |
| 223 | |
| 224 | cpu_util() |
| 225 | boosted_cpu_util() |
| 226 | |
| 227 | The new boosted_cpu_util() is similar to the first but returns a boosted |
| 228 | utilization signal which is a function of the sched_cfs_boost value. |
| 229 | |
| 230 | This function is used in the CFS scheduler code paths where sched-DVFS needs to |
| 231 | decide the OPP to run a CPU at. |
| 232 | For example, this allows selecting the highest OPP for a CPU which has |
| 233 | the boost value set to 100%. |
| 234 | |
| 235 | |
| 236 | 5. Per task group boosting |
| 237 | ========================== |
| 238 | |
| 239 | The availability of a single knob which is used to boost all tasks in the |
| 240 | system is certainly a simple solution but it quite likely doesn't fit many |
| 241 | utilization scenarios, especially in the mobile device space. |
| 242 | |
| 243 | For example, on battery powered devices there usually are many background |
| 244 | services which are long running and need energy efficient scheduling. On the |
| 245 | other hand, some applications are more performance sensitive and require an |
| 246 | interactive response and/or maximum performance, regardless of the energy cost. |
| 247 | To better service such scenarios, the SchedTune implementation has an extension |
| 248 | that provides a more fine grained boosting interface. |
| 249 | |
| 250 | A new CGroup controller, namely "schedtune", could be enabled which allows to |
| 251 | defined and configure task groups with different boosting values. |
| 252 | Tasks that require special performance can be put into separate CGroups. |
| 253 | The value of the boost associated with the tasks in this group can be specified |
| 254 | using a single knob exposed by the CGroup controller: |
| 255 | |
| 256 | schedtune.boost |
| 257 | |
| 258 | This knob allows the definition of a boost value that is to be used for |
| 259 | SPC boosting of all tasks attached to this group. |
| 260 | |
| 261 | The current schedtune controller implementation is really simple and has these |
| 262 | main characteristics: |
| 263 | |
| 264 | 1) It is only possible to create 1 level depth hierarchies |
| 265 | |
| 266 | The root control groups define the system-wide boost value to be applied |
| 267 | by default to all tasks. Its direct subgroups are named "boost groups" and |
| 268 | they define the boost value for specific set of tasks. |
| 269 | Further nested subgroups are not allowed since they do not have a sensible |
| 270 | meaning from a user-space standpoint. |
| 271 | |
| 272 | 2) It is possible to define only a limited number of "boost groups" |
| 273 | |
| 274 | This number is defined at compile time and by default configured to 16. |
| 275 | This is a design decision motivated by two main reasons: |
| 276 | a) In a real system we do not expect utilization scenarios with more then few |
| 277 | boost groups. For example, a reasonable collection of groups could be |
| 278 | just "background", "interactive" and "performance". |
| 279 | b) It simplifies the implementation considerably, especially for the code |
| 280 | which has to compute the per CPU boosting once there are multiple |
| 281 | RUNNABLE tasks with different boost values. |
| 282 | |
| 283 | Such a simple design should allow servicing the main utilization scenarios identified |
| 284 | so far. It provides a simple interface which can be used to manage the |
| 285 | power-performance of all tasks or only selected tasks. |
| 286 | Moreover, this interface can be easily integrated by user-space run-times (e.g. |
| 287 | Android, ChromeOS) to implement a QoS solution for task boosting based on tasks |
| 288 | classification, which has been a long standing requirement. |
| 289 | |
| 290 | Setup and usage |
| 291 | --------------- |
| 292 | |
| 293 | 0. Use a kernel with CGROUP_SCHEDTUNE support enabled |
| 294 | |
| 295 | 1. Check that the "schedtune" CGroup controller is available: |
| 296 | |
| 297 | root@linaro-nano:~# cat /proc/cgroups |
| 298 | #subsys_name hierarchy num_cgroups enabled |
| 299 | cpuset 0 1 1 |
| 300 | cpu 0 1 1 |
| 301 | schedtune 0 1 1 |
| 302 | |
| 303 | 2. Mount a tmpfs to create the CGroups mount point (Optional) |
| 304 | |
| 305 | root@linaro-nano:~# sudo mount -t tmpfs cgroups /sys/fs/cgroup |
| 306 | |
| 307 | 3. Mount the "schedtune" controller |
| 308 | |
| 309 | root@linaro-nano:~# mkdir /sys/fs/cgroup/stune |
| 310 | root@linaro-nano:~# sudo mount -t cgroup -o schedtune stune /sys/fs/cgroup/stune |
| 311 | |
| 312 | 4. Setup the system-wide boost value (Optional) |
| 313 | |
| 314 | If not configured the root control group has a 0% boost value, which |
| 315 | basically disables boosting for all tasks in the system thus running in |
| 316 | an energy-efficient mode. |
| 317 | |
| 318 | root@linaro-nano:~# echo $SYSBOOST > /sys/fs/cgroup/stune/schedtune.boost |
| 319 | |
| 320 | 5. Create task groups and configure their specific boost value (Optional) |
| 321 | |
| 322 | For example here we create a "performance" boost group configure to boost |
| 323 | all its tasks to 100% |
| 324 | |
| 325 | root@linaro-nano:~# mkdir /sys/fs/cgroup/stune/performance |
| 326 | root@linaro-nano:~# echo 100 > /sys/fs/cgroup/stune/performance/schedtune.boost |
| 327 | |
| 328 | 6. Move tasks into the boost group |
| 329 | |
| 330 | For example, the following moves the tasks with PID $TASKPID (and all its |
| 331 | threads) into the "performance" boost group. |
| 332 | |
| 333 | root@linaro-nano:~# echo "TASKPID > /sys/fs/cgroup/stune/performance/cgroup.procs |
| 334 | |
| 335 | This simple configuration allows only the threads of the $TASKPID task to run, |
| 336 | when needed, at the highest OPP in the most capable CPU of the system. |
| 337 | |
| 338 | |
| 339 | 6. Question and Answers |
| 340 | ======================= |
| 341 | |
| 342 | What about "auto" mode? |
| 343 | ----------------------- |
| 344 | |
| 345 | The 'auto' mode as described in [5] can be implemented by interfacing SchedTune |
| 346 | with some suitable user-space element. This element could use the exposed |
| 347 | system-wide or cgroup based interface. |
| 348 | |
| 349 | How are multiple groups of tasks with different boost values managed? |
| 350 | --------------------------------------------------------------------- |
| 351 | |
| 352 | The current SchedTune implementation keeps track of the boosted RUNNABLE tasks |
| 353 | on a CPU. Once sched-DVFS selects the OPP to run a CPU at, the CPU utilization |
| 354 | is boosted with a value which is the maximum of the boost values of the |
| 355 | currently RUNNABLE tasks in its RQ. |
| 356 | |
| 357 | This allows sched-DVFS to boost a CPU only while there are boosted tasks ready |
| 358 | to run and switch back to the energy efficient mode as soon as the last boosted |
| 359 | task is dequeued. |
| 360 | |
| 361 | |
| 362 | 7. References |
| 363 | ============= |
| 364 | [1] http://lwn.net/Articles/552889 |
| 365 | [2] http://lkml.org/lkml/2012/5/18/91 |
| 366 | [3] http://lkml.org/lkml/2015/6/26/620 |