Namjae Jeon | 2792d87 | 2012-08-09 15:27:29 +0200 | [diff] [blame] | 1 | CFQ (Complete Fairness Queueing) |
| 2 | =============================== |
| 3 | |
| 4 | The main aim of CFQ scheduler is to provide a fair allocation of the disk |
| 5 | I/O bandwidth for all the processes which requests an I/O operation. |
| 6 | |
| 7 | CFQ maintains the per process queue for the processes which request I/O |
Namjae Jeon | fdc6fdc | 2013-04-09 14:57:06 +0200 | [diff] [blame^] | 8 | operation(synchronous requests). In case of asynchronous requests, all the |
Namjae Jeon | 2792d87 | 2012-08-09 15:27:29 +0200 | [diff] [blame] | 9 | requests from all the processes are batched together according to their |
| 10 | process's I/O priority. |
| 11 | |
Vivek Goyal | 6d6ac1c | 2010-08-23 12:25:29 +0200 | [diff] [blame] | 12 | CFQ ioscheduler tunables |
| 13 | ======================== |
| 14 | |
| 15 | slice_idle |
| 16 | ---------- |
| 17 | This specifies how long CFQ should idle for next request on certain cfq queues |
| 18 | (for sequential workloads) and service trees (for random workloads) before |
| 19 | queue is expired and CFQ selects next queue to dispatch from. |
| 20 | |
| 21 | By default slice_idle is a non-zero value. That means by default we idle on |
| 22 | queues/service trees. This can be very helpful on highly seeky media like |
| 23 | single spindle SATA/SAS disks where we can cut down on overall number of |
| 24 | seeks and see improved throughput. |
| 25 | |
| 26 | Setting slice_idle to 0 will remove all the idling on queues/service tree |
| 27 | level and one should see an overall improved throughput on faster storage |
| 28 | devices like multiple SATA/SAS disks in hardware RAID configuration. The down |
| 29 | side is that isolation provided from WRITES also goes down and notion of |
| 30 | IO priority becomes weaker. |
| 31 | |
| 32 | So depending on storage and workload, it might be useful to set slice_idle=0. |
| 33 | In general I think for SATA/SAS disks and software RAID of SATA/SAS disks |
| 34 | keeping slice_idle enabled should be useful. For any configurations where |
| 35 | there are multiple spindles behind single LUN (Host based hardware RAID |
| 36 | controller or for storage arrays), setting slice_idle=0 might end up in better |
| 37 | throughput and acceptable latencies. |
| 38 | |
Namjae Jeon | 2792d87 | 2012-08-09 15:27:29 +0200 | [diff] [blame] | 39 | back_seek_max |
| 40 | ------------- |
| 41 | This specifies, given in Kbytes, the maximum "distance" for backward seeking. |
| 42 | The distance is the amount of space from the current head location to the |
| 43 | sectors that are backward in terms of distance. |
| 44 | |
| 45 | This parameter allows the scheduler to anticipate requests in the "backward" |
| 46 | direction and consider them as being the "next" if they are within this |
| 47 | distance from the current head location. |
| 48 | |
| 49 | back_seek_penalty |
| 50 | ----------------- |
| 51 | This parameter is used to compute the cost of backward seeking. If the |
| 52 | backward distance of request is just 1/back_seek_penalty from a "front" |
| 53 | request, then the seeking cost of two requests is considered equivalent. |
| 54 | |
| 55 | So scheduler will not bias toward one or the other request (otherwise scheduler |
| 56 | will bias toward front request). Default value of back_seek_penalty is 2. |
| 57 | |
| 58 | fifo_expire_async |
| 59 | ----------------- |
| 60 | This parameter is used to set the timeout of asynchronous requests. Default |
| 61 | value of this is 248ms. |
| 62 | |
| 63 | fifo_expire_sync |
| 64 | ---------------- |
| 65 | This parameter is used to set the timeout of synchronous requests. Default |
| 66 | value of this is 124ms. In case to favor synchronous requests over asynchronous |
| 67 | one, this value should be decreased relative to fifo_expire_async. |
| 68 | |
Namjae Jeon | fdc6fdc | 2013-04-09 14:57:06 +0200 | [diff] [blame^] | 69 | group_idle |
| 70 | ----------- |
| 71 | This parameter forces idling at the CFQ group level instead of CFQ |
| 72 | queue level. This was introduced after after a bottleneck was observed |
| 73 | in higher end storage due to idle on sequential queue and allow dispatch |
| 74 | from a single queue. The idea with this parameter is that it can be run with |
| 75 | slice_idle=0 and group_idle=8, so that idling does not happen on individual |
| 76 | queues in the group but happens overall on the group and thus still keeps the |
| 77 | IO controller working. |
| 78 | Not idling on individual queues in the group will dispatch requests from |
| 79 | multiple queues in the group at the same time and achieve higher throughput |
| 80 | on higher end storage. |
| 81 | |
| 82 | Default value for this parameter is 8ms. |
| 83 | |
| 84 | latency |
| 85 | ------- |
| 86 | This parameter is used to enable/disable the latency mode of the CFQ |
| 87 | scheduler. If latency mode (called low_latency) is enabled, CFQ tries |
| 88 | to recompute the slice time for each process based on the target_latency set |
| 89 | for the system. This favors fairness over throughput. Disabling low |
| 90 | latency (setting it to 0) ignores target latency, allowing each process in the |
| 91 | system to get a full time slice. |
| 92 | |
| 93 | By default low latency mode is enabled. |
| 94 | |
| 95 | target_latency |
| 96 | -------------- |
| 97 | This parameter is used to calculate the time slice for a process if cfq's |
| 98 | latency mode is enabled. It will ensure that sync requests have an estimated |
| 99 | latency. But if sequential workload is higher(e.g. sequential read), |
| 100 | then to meet the latency constraints, throughput may decrease because of less |
| 101 | time for each process to issue I/O request before the cfq queue is switched. |
| 102 | |
| 103 | Though this can be overcome by disabling the latency_mode, it may increase |
| 104 | the read latency for some applications. This parameter allows for changing |
| 105 | target_latency through the sysfs interface which can provide the balanced |
| 106 | throughput and read latency. |
| 107 | |
| 108 | Default value for target_latency is 300ms. |
| 109 | |
Namjae Jeon | 2792d87 | 2012-08-09 15:27:29 +0200 | [diff] [blame] | 110 | slice_async |
| 111 | ----------- |
| 112 | This parameter is same as of slice_sync but for asynchronous queue. The |
| 113 | default value is 40ms. |
| 114 | |
| 115 | slice_async_rq |
| 116 | -------------- |
| 117 | This parameter is used to limit the dispatching of asynchronous request to |
| 118 | device request queue in queue's slice time. The maximum number of request that |
| 119 | are allowed to be dispatched also depends upon the io priority. Default value |
| 120 | for this is 2. |
| 121 | |
| 122 | slice_sync |
| 123 | ---------- |
| 124 | When a queue is selected for execution, the queues IO requests are only |
| 125 | executed for a certain amount of time(time_slice) before switching to another |
| 126 | queue. This parameter is used to calculate the time slice of synchronous |
| 127 | queue. |
| 128 | |
| 129 | time_slice is computed using the below equation:- |
| 130 | time_slice = slice_sync + (slice_sync/5 * (4 - prio)). To increase the |
| 131 | time_slice of synchronous queue, increase the value of slice_sync. Default |
| 132 | value is 100ms. |
| 133 | |
| 134 | quantum |
| 135 | ------- |
| 136 | This specifies the number of request dispatched to the device queue. In a |
| 137 | queue's time slice, a request will not be dispatched if the number of request |
| 138 | in the device exceeds this parameter. This parameter is used for synchronous |
| 139 | request. |
| 140 | |
| 141 | In case of storage with several disk, this setting can limit the parallel |
Namjae Jeon | fdc6fdc | 2013-04-09 14:57:06 +0200 | [diff] [blame^] | 142 | processing of request. Therefore, increasing the value can improve the |
| 143 | performance although this can cause the latency of some I/O to increase due |
Namjae Jeon | 2792d87 | 2012-08-09 15:27:29 +0200 | [diff] [blame] | 144 | to more number of requests. |
| 145 | |
Tejun Heo | d02f7aa | 2013-01-09 08:05:11 -0800 | [diff] [blame] | 146 | CFQ Group scheduling |
| 147 | ==================== |
| 148 | |
| 149 | CFQ supports blkio cgroup and has "blkio." prefixed files in each |
| 150 | blkio cgroup directory. It is weight-based and there are four knobs |
| 151 | for configuration - weight[_device] and leaf_weight[_device]. |
| 152 | Internal cgroup nodes (the ones with children) can also have tasks in |
| 153 | them, so the former two configure how much proportion the cgroup as a |
| 154 | whole is entitled to at its parent's level while the latter two |
| 155 | configure how much proportion the tasks in the cgroup have compared to |
| 156 | its direct children. |
| 157 | |
| 158 | Another way to think about it is assuming that each internal node has |
| 159 | an implicit leaf child node which hosts all the tasks whose weight is |
| 160 | configured by leaf_weight[_device]. Let's assume a blkio hierarchy |
| 161 | composed of five cgroups - root, A, B, AA and AB - with the following |
| 162 | weights where the names represent the hierarchy. |
| 163 | |
| 164 | weight leaf_weight |
| 165 | root : 125 125 |
| 166 | A : 500 750 |
| 167 | B : 250 500 |
| 168 | AA : 500 500 |
| 169 | AB : 1000 500 |
| 170 | |
| 171 | root never has a parent making its weight is meaningless. For backward |
| 172 | compatibility, weight is always kept in sync with leaf_weight. B, AA |
| 173 | and AB have no child and thus its tasks have no children cgroup to |
| 174 | compete with. They always get 100% of what the cgroup won at the |
| 175 | parent level. Considering only the weights which matter, the hierarchy |
| 176 | looks like the following. |
| 177 | |
| 178 | root |
| 179 | / | \ |
| 180 | A B leaf |
| 181 | 500 250 125 |
| 182 | / | \ |
| 183 | AA AB leaf |
| 184 | 500 1000 750 |
| 185 | |
| 186 | If all cgroups have active IOs and competing with each other, disk |
| 187 | time will be distributed like the following. |
| 188 | |
| 189 | Distribution below root. The total active weight at this level is |
| 190 | A:500 + B:250 + C:125 = 875. |
| 191 | |
| 192 | root-leaf : 125 / 875 =~ 14% |
| 193 | A : 500 / 875 =~ 57% |
| 194 | B(-leaf) : 250 / 875 =~ 28% |
| 195 | |
| 196 | A has children and further distributes its 57% among the children and |
| 197 | the implicit leaf node. The total active weight at this level is |
| 198 | AA:500 + AB:1000 + A-leaf:750 = 2250. |
| 199 | |
| 200 | A-leaf : ( 750 / 2250) * A =~ 19% |
| 201 | AA(-leaf) : ( 500 / 2250) * A =~ 12% |
| 202 | AB(-leaf) : (1000 / 2250) * A =~ 25% |
| 203 | |
Vivek Goyal | 6d6ac1c | 2010-08-23 12:25:29 +0200 | [diff] [blame] | 204 | CFQ IOPS Mode for group scheduling |
| 205 | =================================== |
| 206 | Basic CFQ design is to provide priority based time slices. Higher priority |
| 207 | process gets bigger time slice and lower priority process gets smaller time |
| 208 | slice. Measuring time becomes harder if storage is fast and supports NCQ and |
| 209 | it would be better to dispatch multiple requests from multiple cfq queues in |
| 210 | request queue at a time. In such scenario, it is not possible to measure time |
| 211 | consumed by single queue accurately. |
| 212 | |
| 213 | What is possible though is to measure number of requests dispatched from a |
| 214 | single queue and also allow dispatch from multiple cfq queue at the same time. |
| 215 | This effectively becomes the fairness in terms of IOPS (IO operations per |
| 216 | second). |
| 217 | |
| 218 | If one sets slice_idle=0 and if storage supports NCQ, CFQ internally switches |
| 219 | to IOPS mode and starts providing fairness in terms of number of requests |
| 220 | dispatched. Note that this mode switching takes effect only for group |
| 221 | scheduling. For non-cgroup users nothing should change. |
Vivek Goyal | 4931402 | 2011-08-05 09:42:20 +0200 | [diff] [blame] | 222 | |
| 223 | CFQ IO scheduler Idling Theory |
| 224 | =============================== |
| 225 | Idling on a queue is primarily about waiting for the next request to come |
| 226 | on same queue after completion of a request. In this process CFQ will not |
| 227 | dispatch requests from other cfq queues even if requests are pending there. |
| 228 | |
| 229 | The rationale behind idling is that it can cut down on number of seeks |
| 230 | on rotational media. For example, if a process is doing dependent |
| 231 | sequential reads (next read will come on only after completion of previous |
| 232 | one), then not dispatching request from other queue should help as we |
| 233 | did not move the disk head and kept on dispatching sequential IO from |
| 234 | one queue. |
| 235 | |
| 236 | CFQ has following service trees and various queues are put on these trees. |
| 237 | |
| 238 | sync-idle sync-noidle async |
| 239 | |
| 240 | All cfq queues doing synchronous sequential IO go on to sync-idle tree. |
| 241 | On this tree we idle on each queue individually. |
| 242 | |
| 243 | All synchronous non-sequential queues go on sync-noidle tree. Also any |
| 244 | request which are marked with REQ_NOIDLE go on this service tree. On this |
| 245 | tree we do not idle on individual queues instead idle on the whole group |
| 246 | of queues or the tree. So if there are 4 queues waiting for IO to dispatch |
| 247 | we will idle only once last queue has dispatched the IO and there is |
| 248 | no more IO on this service tree. |
| 249 | |
| 250 | All async writes go on async service tree. There is no idling on async |
| 251 | queues. |
| 252 | |
| 253 | CFQ has some optimizations for SSDs and if it detects a non-rotational |
| 254 | media which can support higher queue depth (multiple requests at in |
| 255 | flight at a time), then it cuts down on idling of individual queues and |
| 256 | all the queues move to sync-noidle tree and only tree idle remains. This |
| 257 | tree idling provides isolation with buffered write queues on async tree. |
| 258 | |
| 259 | FAQ |
| 260 | === |
| 261 | Q1. Why to idle at all on queues marked with REQ_NOIDLE. |
| 262 | |
| 263 | A1. We only do tree idle (all queues on sync-noidle tree) on queues marked |
| 264 | with REQ_NOIDLE. This helps in providing isolation with all the sync-idle |
| 265 | queues. Otherwise in presence of many sequential readers, other |
| 266 | synchronous IO might not get fair share of disk. |
| 267 | |
| 268 | For example, if there are 10 sequential readers doing IO and they get |
| 269 | 100ms each. If a REQ_NOIDLE request comes in, it will be scheduled |
| 270 | roughly after 1 second. If after completion of REQ_NOIDLE request we |
| 271 | do not idle, and after a couple of milli seconds a another REQ_NOIDLE |
| 272 | request comes in, again it will be scheduled after 1second. Repeat it |
| 273 | and notice how a workload can lose its disk share and suffer due to |
| 274 | multiple sequential readers. |
| 275 | |
| 276 | fsync can generate dependent IO where bunch of data is written in the |
| 277 | context of fsync, and later some journaling data is written. Journaling |
| 278 | data comes in only after fsync has finished its IO (atleast for ext4 |
| 279 | that seemed to be the case). Now if one decides not to idle on fsync |
| 280 | thread due to REQ_NOIDLE, then next journaling write will not get |
| 281 | scheduled for another second. A process doing small fsync, will suffer |
| 282 | badly in presence of multiple sequential readers. |
| 283 | |
| 284 | Hence doing tree idling on threads using REQ_NOIDLE flag on requests |
| 285 | provides isolation from multiple sequential readers and at the same |
| 286 | time we do not idle on individual threads. |
| 287 | |
| 288 | Q2. When to specify REQ_NOIDLE |
| 289 | A2. I would think whenever one is doing synchronous write and not expecting |
| 290 | more writes to be dispatched from same context soon, should be able |
| 291 | to specify REQ_NOIDLE on writes and that probably should work well for |
| 292 | most of the cases. |