Viktor Radnai | b9b158f | 2008-04-19 19:45:01 +0200 | [diff] [blame] | 1 | Real-Time group scheduling |
| 2 | -------------------------- |
| 3 | |
| 4 | CONTENTS |
| 5 | ======== |
| 6 | |
Peter Zijlstra | 60aa605 | 2009-05-05 17:50:21 +0200 | [diff] [blame^] | 7 | 0. WARNING |
Viktor Radnai | b9b158f | 2008-04-19 19:45:01 +0200 | [diff] [blame] | 8 | 1. Overview |
| 9 | 1.1 The problem |
| 10 | 1.2 The solution |
| 11 | 2. The interface |
| 12 | 2.1 System-wide settings |
| 13 | 2.2 Default behaviour |
| 14 | 2.3 Basis for grouping tasks |
| 15 | 3. Future plans |
Peter Zijlstra | 9f0c1e5 | 2008-02-13 15:45:39 +0100 | [diff] [blame] | 16 | |
| 17 | |
Peter Zijlstra | 60aa605 | 2009-05-05 17:50:21 +0200 | [diff] [blame^] | 18 | 0. WARNING |
| 19 | ========== |
| 20 | |
| 21 | Fiddling with these settings can result in an unstable system, the knobs are |
| 22 | root only and assumes root knows what he is doing. |
| 23 | |
| 24 | Most notable: |
| 25 | |
| 26 | * very small values in sched_rt_period_us can result in an unstable |
| 27 | system when the period is smaller than either the available hrtimer |
| 28 | resolution, or the time it takes to handle the budget refresh itself. |
| 29 | |
| 30 | * very small values in sched_rt_runtime_us can result in an unstable |
| 31 | system when the runtime is so small the system has difficulty making |
| 32 | forward progress (NOTE: the migration thread and kstopmachine both |
| 33 | are real-time processes). |
| 34 | |
Viktor Radnai | b9b158f | 2008-04-19 19:45:01 +0200 | [diff] [blame] | 35 | 1. Overview |
| 36 | =========== |
Peter Zijlstra | 9f0c1e5 | 2008-02-13 15:45:39 +0100 | [diff] [blame] | 37 | |
| 38 | |
Viktor Radnai | b9b158f | 2008-04-19 19:45:01 +0200 | [diff] [blame] | 39 | 1.1 The problem |
| 40 | --------------- |
Peter Zijlstra | 9f0c1e5 | 2008-02-13 15:45:39 +0100 | [diff] [blame] | 41 | |
Viktor Radnai | b9b158f | 2008-04-19 19:45:01 +0200 | [diff] [blame] | 42 | Realtime scheduling is all about determinism, a group has to be able to rely on |
| 43 | the amount of bandwidth (eg. CPU time) being constant. In order to schedule |
| 44 | multiple groups of realtime tasks, each group must be assigned a fixed portion |
| 45 | of the CPU time available. Without a minimum guarantee a realtime group can |
| 46 | obviously fall short. A fuzzy upper limit is of no use since it cannot be |
| 47 | relied upon. Which leaves us with just the single fixed portion. |
Peter Zijlstra | 9f0c1e5 | 2008-02-13 15:45:39 +0100 | [diff] [blame] | 48 | |
Viktor Radnai | b9b158f | 2008-04-19 19:45:01 +0200 | [diff] [blame] | 49 | 1.2 The solution |
| 50 | ---------------- |
Peter Zijlstra | 9f0c1e5 | 2008-02-13 15:45:39 +0100 | [diff] [blame] | 51 | |
Viktor Radnai | b9b158f | 2008-04-19 19:45:01 +0200 | [diff] [blame] | 52 | CPU time is divided by means of specifying how much time can be spent running |
| 53 | in a given period. We allocate this "run time" for each realtime group which |
| 54 | the other realtime groups will not be permitted to use. |
Peter Zijlstra | 9f0c1e5 | 2008-02-13 15:45:39 +0100 | [diff] [blame] | 55 | |
Viktor Radnai | b9b158f | 2008-04-19 19:45:01 +0200 | [diff] [blame] | 56 | Any time not allocated to a realtime group will be used to run normal priority |
| 57 | tasks (SCHED_OTHER). Any allocated run time not used will also be picked up by |
| 58 | SCHED_OTHER. |
Peter Zijlstra | 9f0c1e5 | 2008-02-13 15:45:39 +0100 | [diff] [blame] | 59 | |
Viktor Radnai | b9b158f | 2008-04-19 19:45:01 +0200 | [diff] [blame] | 60 | Let's consider an example: a frame fixed realtime renderer must deliver 25 |
| 61 | frames a second, which yields a period of 0.04s per frame. Now say it will also |
| 62 | have to play some music and respond to input, leaving it with around 80% CPU |
| 63 | time dedicated for the graphics. We can then give this group a run time of 0.8 |
| 64 | * 0.04s = 0.032s. |
Peter Zijlstra | 9f0c1e5 | 2008-02-13 15:45:39 +0100 | [diff] [blame] | 65 | |
Viktor Radnai | b9b158f | 2008-04-19 19:45:01 +0200 | [diff] [blame] | 66 | This way the graphics group will have a 0.04s period with a 0.032s run time |
| 67 | limit. Now if the audio thread needs to refill the DMA buffer every 0.005s, but |
| 68 | needs only about 3% CPU time to do so, it can do with a 0.03 * 0.005s = |
| 69 | 0.00015s. So this group can be scheduled with a period of 0.005s and a run time |
| 70 | of 0.00015s. |
Peter Zijlstra | 9f0c1e5 | 2008-02-13 15:45:39 +0100 | [diff] [blame] | 71 | |
Hiroshi Shimamoto | f7d62364 | 2008-06-10 20:29:19 -0700 | [diff] [blame] | 72 | The remaining CPU time will be used for user input and other tasks. Because |
Viktor Radnai | b9b158f | 2008-04-19 19:45:01 +0200 | [diff] [blame] | 73 | realtime tasks have explicitly allocated the CPU time they need to perform |
Hiroshi Shimamoto | f7d62364 | 2008-06-10 20:29:19 -0700 | [diff] [blame] | 74 | their tasks, buffer underruns in the graphics or audio can be eliminated. |
Peter Zijlstra | 9f0c1e5 | 2008-02-13 15:45:39 +0100 | [diff] [blame] | 75 | |
Viktor Radnai | b9b158f | 2008-04-19 19:45:01 +0200 | [diff] [blame] | 76 | NOTE: the above example is not fully implemented as of yet (2.6.25). We still |
| 77 | lack an EDF scheduler to make non-uniform periods usable. |
Peter Zijlstra | 9f0c1e5 | 2008-02-13 15:45:39 +0100 | [diff] [blame] | 78 | |
Peter Zijlstra | 9f0c1e5 | 2008-02-13 15:45:39 +0100 | [diff] [blame] | 79 | |
Viktor Radnai | b9b158f | 2008-04-19 19:45:01 +0200 | [diff] [blame] | 80 | 2. The Interface |
| 81 | ================ |
Peter Zijlstra | 9f0c1e5 | 2008-02-13 15:45:39 +0100 | [diff] [blame] | 82 | |
Peter Zijlstra | 9f0c1e5 | 2008-02-13 15:45:39 +0100 | [diff] [blame] | 83 | |
Viktor Radnai | b9b158f | 2008-04-19 19:45:01 +0200 | [diff] [blame] | 84 | 2.1 System wide settings |
| 85 | ------------------------ |
| 86 | |
| 87 | The system wide settings are configured under the /proc virtual file system: |
| 88 | |
| 89 | /proc/sys/kernel/sched_rt_period_us: |
| 90 | The scheduling period that is equivalent to 100% CPU bandwidth |
| 91 | |
| 92 | /proc/sys/kernel/sched_rt_runtime_us: |
| 93 | A global limit on how much time realtime scheduling may use. Even without |
| 94 | CONFIG_RT_GROUP_SCHED enabled, this will limit time reserved to realtime |
| 95 | processes. With CONFIG_RT_GROUP_SCHED it signifies the total bandwidth |
| 96 | available to all realtime groups. |
| 97 | |
| 98 | * Time is specified in us because the interface is s32. This gives an |
| 99 | operating range from 1us to about 35 minutes. |
| 100 | * sched_rt_period_us takes values from 1 to INT_MAX. |
| 101 | * sched_rt_runtime_us takes values from -1 to (INT_MAX - 1). |
| 102 | * A run time of -1 specifies runtime == period, ie. no limit. |
| 103 | |
| 104 | |
| 105 | 2.2 Default behaviour |
| 106 | --------------------- |
| 107 | |
| 108 | The default values for sched_rt_period_us (1000000 or 1s) and |
| 109 | sched_rt_runtime_us (950000 or 0.95s). This gives 0.05s to be used by |
| 110 | SCHED_OTHER (non-RT tasks). These defaults were chosen so that a run-away |
| 111 | realtime tasks will not lock up the machine but leave a little time to recover |
| 112 | it. By setting runtime to -1 you'd get the old behaviour back. |
| 113 | |
| 114 | By default all bandwidth is assigned to the root group and new groups get the |
| 115 | period from /proc/sys/kernel/sched_rt_period_us and a run time of 0. If you |
| 116 | want to assign bandwidth to another group, reduce the root group's bandwidth |
| 117 | and assign some or all of the difference to another group. |
| 118 | |
| 119 | Realtime group scheduling means you have to assign a portion of total CPU |
| 120 | bandwidth to the group before it will accept realtime tasks. Therefore you will |
| 121 | not be able to run realtime tasks as any user other than root until you have |
| 122 | done that, even if the user has the rights to run processes with realtime |
| 123 | priority! |
| 124 | |
| 125 | |
| 126 | 2.3 Basis for grouping tasks |
| 127 | ---------------------------- |
| 128 | |
| 129 | There are two compile-time settings for allocating CPU bandwidth. These are |
| 130 | configured using the "Basis for grouping tasks" multiple choice menu under |
| 131 | General setup > Group CPU Scheduler: |
| 132 | |
| 133 | a. CONFIG_USER_SCHED (aka "Basis for grouping tasks" = "user id") |
| 134 | |
| 135 | This lets you use the virtual files under |
| 136 | "/sys/kernel/uids/<uid>/cpu_rt_runtime_us" to control he CPU time reserved for |
| 137 | each user . |
| 138 | |
| 139 | The other option is: |
| 140 | |
| 141 | .o CONFIG_CGROUP_SCHED (aka "Basis for grouping tasks" = "Control groups") |
| 142 | |
| 143 | This uses the /cgroup virtual file system and "/cgroup/<cgroup>/cpu.rt_runtime_us" |
| 144 | to control the CPU time reserved for each control group instead. |
| 145 | |
| 146 | For more information on working with control groups, you should read |
Thadeu Lima de Souza Cascardo | 21acb9c | 2009-02-04 10:12:08 +0100 | [diff] [blame] | 147 | Documentation/cgroups/cgroups.txt as well. |
Viktor Radnai | b9b158f | 2008-04-19 19:45:01 +0200 | [diff] [blame] | 148 | |
| 149 | Group settings are checked against the following limits in order to keep the configuration |
| 150 | schedulable: |
Peter Zijlstra | 9f0c1e5 | 2008-02-13 15:45:39 +0100 | [diff] [blame] | 151 | |
| 152 | \Sum_{i} runtime_{i} / global_period <= global_runtime / global_period |
| 153 | |
Viktor Radnai | b9b158f | 2008-04-19 19:45:01 +0200 | [diff] [blame] | 154 | For now, this can be simplified to just the following (but see Future plans): |
| 155 | |
| 156 | \Sum_{i} runtime_{i} <= global_runtime |
| 157 | |
| 158 | |
| 159 | 3. Future plans |
| 160 | =============== |
| 161 | |
| 162 | There is work in progress to make the scheduling period for each group |
| 163 | ("/sys/kernel/uids/<uid>/cpu_rt_period_us" or |
| 164 | "/cgroup/<cgroup>/cpu.rt_period_us" respectively) configurable as well. |
| 165 | |
| 166 | The constraint on the period is that a subgroup must have a smaller or |
| 167 | equal period to its parent. But realistically its not very useful _yet_ |
| 168 | as its prone to starvation without deadline scheduling. |
| 169 | |
| 170 | Consider two sibling groups A and B; both have 50% bandwidth, but A's |
| 171 | period is twice the length of B's. |
| 172 | |
| 173 | * group A: period=100000us, runtime=10000us |
| 174 | - this runs for 0.01s once every 0.1s |
| 175 | |
| 176 | * group B: period= 50000us, runtime=10000us |
| 177 | - this runs for 0.01s twice every 0.1s (or once every 0.05 sec). |
| 178 | |
| 179 | This means that currently a while (1) loop in A will run for the full period of |
| 180 | B and can starve B's tasks (assuming they are of lower priority) for a whole |
| 181 | period. |
| 182 | |
| 183 | The next project will be SCHED_EDF (Earliest Deadline First scheduling) to bring |
| 184 | full deadline scheduling to the linux kernel. Deadline scheduling the above |
| 185 | groups and treating end of the period as a deadline will ensure that they both |
| 186 | get their allocated time. |
| 187 | |
| 188 | Implementing SCHED_EDF might take a while to complete. Priority Inheritance is |
| 189 | the biggest challenge as the current linux PI infrastructure is geared towards |
| 190 | the limited static priority levels 0-139. With deadline scheduling you need to |
| 191 | do deadline inheritance (since priority is inversely proportional to the |
| 192 | deadline delta (deadline - now). |
| 193 | |
| 194 | This means the whole PI machinery will have to be reworked - and that is one of |
| 195 | the most complex pieces of code we have. |