Mauro Carvalho Chehab | 898bd37 | 2019-04-18 19:45:00 -0300 | [diff] [blame] | 1 | ========================== |
Paolo Valente | aee69d7 | 2017-04-19 08:29:02 -0600 | [diff] [blame] | 2 | BFQ (Budget Fair Queueing) |
| 3 | ========================== |
| 4 | |
| 5 | BFQ is a proportional-share I/O scheduler, with some extra |
| 6 | low-latency capabilities. In addition to cgroups support (blkio or io |
| 7 | controllers), BFQ's main features are: |
Mauro Carvalho Chehab | 898bd37 | 2019-04-18 19:45:00 -0300 | [diff] [blame] | 8 | |
Paolo Valente | aee69d7 | 2017-04-19 08:29:02 -0600 | [diff] [blame] | 9 | - BFQ guarantees a high system and application responsiveness, and a |
| 10 | low latency for time-sensitive applications, such as audio or video |
| 11 | players; |
| 12 | - BFQ distributes bandwidth, and not just time, among processes or |
| 13 | groups (switching back to time distribution when needed to keep |
| 14 | throughput high). |
| 15 | |
Paolo Valente | 43c1b3d | 2017-05-09 12:54:23 +0200 | [diff] [blame] | 16 | In its default configuration, BFQ privileges latency over |
| 17 | throughput. So, when needed for achieving a lower latency, BFQ builds |
| 18 | schedules that may lead to a lower throughput. If your main or only |
| 19 | goal, for a given device, is to achieve the maximum-possible |
| 20 | throughput at all times, then do switch off all low-latency heuristics |
Paolo Valente | 233f0bf | 2017-08-31 20:00:30 +0200 | [diff] [blame] | 21 | for that device, by setting low_latency to 0. See Section 3 for |
| 22 | details on how to configure BFQ for the desired tradeoff between |
| 23 | latency and throughput, or on how to maximize throughput. |
Paolo Valente | 43c1b3d | 2017-05-09 12:54:23 +0200 | [diff] [blame] | 24 | |
Paolo Valente | 4438cf5 | 2019-03-12 09:59:35 +0100 | [diff] [blame] | 25 | As every I/O scheduler, BFQ adds some overhead to per-I/O-request |
| 26 | processing. To give an idea of this overhead, the total, |
| 27 | single-lock-protected, per-request processing time of BFQ---i.e., the |
| 28 | sum of the execution times of the request insertion, dispatch and |
| 29 | completion hooks---is, e.g., 1.9 us on an Intel Core i7-2760QM@2.40GHz |
| 30 | (dated CPU for notebooks; time measured with simple code |
| 31 | instrumentation, and using the throughput-sync.sh script of the S |
| 32 | suite [1], in performance-profiling mode). To put this result into |
| 33 | context, the total, single-lock-protected, per-request execution time |
| 34 | of the lightest I/O scheduler available in blk-mq, mq-deadline, is 0.7 |
| 35 | us (mq-deadline is ~800 LOC, against ~10500 LOC for BFQ). |
| 36 | |
| 37 | Scheduling overhead further limits the maximum IOPS that a CPU can |
| 38 | process (already limited by the execution of the rest of the I/O |
| 39 | stack). To give an idea of the limits with BFQ, on slow or average |
| 40 | CPUs, here are, first, the limits of BFQ for three different CPUs, on, |
| 41 | respectively, an average laptop, an old desktop, and a cheap embedded |
| 42 | system, in case full hierarchical support is enabled (i.e., |
Christoph Hellwig | 8060c47 | 2019-06-06 12:26:24 +0200 | [diff] [blame] | 43 | CONFIG_BFQ_GROUP_IOSCHED is set), but CONFIG_BFQ_CGROUP_DEBUG is not |
Paolo Valente | 4438cf5 | 2019-03-12 09:59:35 +0100 | [diff] [blame] | 44 | set (Section 4-2): |
Luca Miccio | a33801e | 2017-11-13 07:34:10 +0100 | [diff] [blame] | 45 | - Intel i7-4850HQ: 400 KIOPS |
| 46 | - AMD A8-3850: 250 KIOPS |
| 47 | - ARM CortexTM-A53 Octa-core: 80 KIOPS |
| 48 | |
Christoph Hellwig | 8060c47 | 2019-06-06 12:26:24 +0200 | [diff] [blame] | 49 | If CONFIG_BFQ_CGROUP_DEBUG is set (and of course full hierarchical |
Luca Miccio | a33801e | 2017-11-13 07:34:10 +0100 | [diff] [blame] | 50 | support is enabled), then the sustainable throughput with BFQ |
| 51 | decreases, because all blkio.bfq* statistics are created and updated |
| 52 | (Section 4-2). For BFQ, this leads to the following maximum |
| 53 | sustainable throughputs, on the same systems as above: |
Paolo Valente | 24bfd19 | 2017-11-13 07:34:09 +0100 | [diff] [blame] | 54 | - Intel i7-4850HQ: 310 KIOPS |
| 55 | - AMD A8-3850: 200 KIOPS |
| 56 | - ARM CortexTM-A53 Octa-core: 56 KIOPS |
Paolo Valente | 68017e5 | 2017-11-13 07:34:07 +0100 | [diff] [blame] | 57 | |
| 58 | BFQ works for multi-queue devices too. |
Paolo Valente | aee69d7 | 2017-04-19 08:29:02 -0600 | [diff] [blame] | 59 | |
Mauro Carvalho Chehab | 898bd37 | 2019-04-18 19:45:00 -0300 | [diff] [blame] | 60 | .. The table of contents follow. Impatients can just jump to Section 3. |
Paolo Valente | aee69d7 | 2017-04-19 08:29:02 -0600 | [diff] [blame] | 61 | |
Mauro Carvalho Chehab | 898bd37 | 2019-04-18 19:45:00 -0300 | [diff] [blame] | 62 | .. CONTENTS |
Paolo Valente | aee69d7 | 2017-04-19 08:29:02 -0600 | [diff] [blame] | 63 | |
Mauro Carvalho Chehab | 898bd37 | 2019-04-18 19:45:00 -0300 | [diff] [blame] | 64 | 1. When may BFQ be useful? |
| 65 | 1-1 Personal systems |
| 66 | 1-2 Server systems |
| 67 | 2. How does BFQ work? |
| 68 | 3. What are BFQ's tunables and how to properly configure BFQ? |
| 69 | 4. BFQ group scheduling |
| 70 | 4-1 Service guarantees provided |
| 71 | 4-2 Interface |
Paolo Valente | aee69d7 | 2017-04-19 08:29:02 -0600 | [diff] [blame] | 72 | |
| 73 | 1. When may BFQ be useful? |
| 74 | ========================== |
| 75 | |
| 76 | BFQ provides the following benefits on personal and server systems. |
| 77 | |
| 78 | 1-1 Personal systems |
| 79 | -------------------- |
| 80 | |
| 81 | Low latency for interactive applications |
Mauro Carvalho Chehab | 898bd37 | 2019-04-18 19:45:00 -0300 | [diff] [blame] | 82 | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ |
Paolo Valente | aee69d7 | 2017-04-19 08:29:02 -0600 | [diff] [blame] | 83 | |
| 84 | Regardless of the actual background workload, BFQ guarantees that, for |
| 85 | interactive tasks, the storage device is virtually as responsive as if |
| 86 | it was idle. For example, even if one or more of the following |
| 87 | background workloads are being executed: |
Mauro Carvalho Chehab | 898bd37 | 2019-04-18 19:45:00 -0300 | [diff] [blame] | 88 | |
Paolo Valente | aee69d7 | 2017-04-19 08:29:02 -0600 | [diff] [blame] | 89 | - one or more large files are being read, written or copied, |
| 90 | - a tree of source files is being compiled, |
| 91 | - one or more virtual machines are performing I/O, |
| 92 | - a software update is in progress, |
| 93 | - indexing daemons are scanning filesystems and updating their |
| 94 | databases, |
Mauro Carvalho Chehab | 898bd37 | 2019-04-18 19:45:00 -0300 | [diff] [blame] | 95 | |
Paolo Valente | aee69d7 | 2017-04-19 08:29:02 -0600 | [diff] [blame] | 96 | starting an application or loading a file from within an application |
| 97 | takes about the same time as if the storage device was idle. As a |
| 98 | comparison, with CFQ, NOOP or DEADLINE, and in the same conditions, |
| 99 | applications experience high latencies, or even become unresponsive |
| 100 | until the background workload terminates (also on SSDs). |
| 101 | |
| 102 | Low latency for soft real-time applications |
Mauro Carvalho Chehab | 898bd37 | 2019-04-18 19:45:00 -0300 | [diff] [blame] | 103 | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ |
Paolo Valente | aee69d7 | 2017-04-19 08:29:02 -0600 | [diff] [blame] | 104 | Also soft real-time applications, such as audio and video |
| 105 | players/streamers, enjoy a low latency and a low drop rate, regardless |
| 106 | of the background I/O workload. As a consequence, these applications |
| 107 | do not suffer from almost any glitch due to the background workload. |
| 108 | |
| 109 | Higher speed for code-development tasks |
Mauro Carvalho Chehab | 898bd37 | 2019-04-18 19:45:00 -0300 | [diff] [blame] | 110 | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ |
Paolo Valente | aee69d7 | 2017-04-19 08:29:02 -0600 | [diff] [blame] | 111 | |
| 112 | If some additional workload happens to be executed in parallel, then |
| 113 | BFQ executes the I/O-related components of typical code-development |
| 114 | tasks (compilation, checkout, merge, ...) much more quickly than CFQ, |
| 115 | NOOP or DEADLINE. |
| 116 | |
| 117 | High throughput |
Mauro Carvalho Chehab | 898bd37 | 2019-04-18 19:45:00 -0300 | [diff] [blame] | 118 | ^^^^^^^^^^^^^^^ |
Paolo Valente | aee69d7 | 2017-04-19 08:29:02 -0600 | [diff] [blame] | 119 | |
| 120 | On hard disks, BFQ achieves up to 30% higher throughput than CFQ, and |
| 121 | up to 150% higher throughput than DEADLINE and NOOP, with all the |
| 122 | sequential workloads considered in our tests. With random workloads, |
| 123 | and with all the workloads on flash-based devices, BFQ achieves, |
| 124 | instead, about the same throughput as the other schedulers. |
| 125 | |
| 126 | Strong fairness, bandwidth and delay guarantees |
Mauro Carvalho Chehab | 898bd37 | 2019-04-18 19:45:00 -0300 | [diff] [blame] | 127 | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ |
Paolo Valente | aee69d7 | 2017-04-19 08:29:02 -0600 | [diff] [blame] | 128 | |
| 129 | BFQ distributes the device throughput, and not just the device time, |
| 130 | among I/O-bound applications in proportion their weights, with any |
| 131 | workload and regardless of the device parameters. From these bandwidth |
| 132 | guarantees, it is possible to compute tight per-I/O-request delay |
| 133 | guarantees by a simple formula. If not configured for strict service |
| 134 | guarantees, BFQ switches to time-based resource sharing (only) for |
| 135 | applications that would otherwise cause a throughput loss. |
| 136 | |
| 137 | 1-2 Server systems |
| 138 | ------------------ |
| 139 | |
| 140 | Most benefits for server systems follow from the same service |
| 141 | properties as above. In particular, regardless of whether additional, |
| 142 | possibly heavy workloads are being served, BFQ guarantees: |
| 143 | |
Mauro Carvalho Chehab | 898bd37 | 2019-04-18 19:45:00 -0300 | [diff] [blame] | 144 | * audio and video-streaming with zero or very low jitter and drop |
Paolo Valente | aee69d7 | 2017-04-19 08:29:02 -0600 | [diff] [blame] | 145 | rate; |
| 146 | |
Mauro Carvalho Chehab | 898bd37 | 2019-04-18 19:45:00 -0300 | [diff] [blame] | 147 | * fast retrieval of WEB pages and embedded objects; |
Paolo Valente | aee69d7 | 2017-04-19 08:29:02 -0600 | [diff] [blame] | 148 | |
Mauro Carvalho Chehab | 898bd37 | 2019-04-18 19:45:00 -0300 | [diff] [blame] | 149 | * real-time recording of data in live-dumping applications (e.g., |
Paolo Valente | aee69d7 | 2017-04-19 08:29:02 -0600 | [diff] [blame] | 150 | packet logging); |
| 151 | |
Mauro Carvalho Chehab | 898bd37 | 2019-04-18 19:45:00 -0300 | [diff] [blame] | 152 | * responsiveness in local and remote access to a server. |
Paolo Valente | aee69d7 | 2017-04-19 08:29:02 -0600 | [diff] [blame] | 153 | |
| 154 | |
| 155 | 2. How does BFQ work? |
| 156 | ===================== |
| 157 | |
| 158 | BFQ is a proportional-share I/O scheduler, whose general structure, |
| 159 | plus a lot of code, are borrowed from CFQ. |
| 160 | |
| 161 | - Each process doing I/O on a device is associated with a weight and a |
Mauro Carvalho Chehab | 898bd37 | 2019-04-18 19:45:00 -0300 | [diff] [blame] | 162 | `(bfq_)queue`. |
Paolo Valente | aee69d7 | 2017-04-19 08:29:02 -0600 | [diff] [blame] | 163 | |
| 164 | - BFQ grants exclusive access to the device, for a while, to one queue |
| 165 | (process) at a time, and implements this service model by |
| 166 | associating every queue with a budget, measured in number of |
| 167 | sectors. |
| 168 | |
| 169 | - After a queue is granted access to the device, the budget of the |
| 170 | queue is decremented, on each request dispatch, by the size of the |
| 171 | request. |
| 172 | |
| 173 | - The in-service queue is expired, i.e., its service is suspended, |
| 174 | only if one of the following events occurs: 1) the queue finishes |
| 175 | its budget, 2) the queue empties, 3) a "budget timeout" fires. |
| 176 | |
| 177 | - The budget timeout prevents processes doing random I/O from |
| 178 | holding the device for too long and dramatically reducing |
| 179 | throughput. |
| 180 | |
| 181 | - Actually, as in CFQ, a queue associated with a process issuing |
| 182 | sync requests may not be expired immediately when it empties. In |
| 183 | contrast, BFQ may idle the device for a short time interval, |
| 184 | giving the process the chance to go on being served if it issues |
| 185 | a new request in time. Device idling typically boosts the |
Paolo Valente | 2670cd1 | 2017-08-31 20:00:31 +0200 | [diff] [blame] | 186 | throughput on rotational devices and on non-queueing flash-based |
| 187 | devices, if processes do synchronous and sequential I/O. In |
| 188 | addition, under BFQ, device idling is also instrumental in |
| 189 | guaranteeing the desired throughput fraction to processes |
| 190 | issuing sync requests (see the description of the slice_idle |
| 191 | tunable in this document, or [1, 2], for more details). |
Paolo Valente | aee69d7 | 2017-04-19 08:29:02 -0600 | [diff] [blame] | 192 | |
| 193 | - With respect to idling for service guarantees, if several |
| 194 | processes are competing for the device at the same time, but |
Paolo Valente | 233f0bf | 2017-08-31 20:00:30 +0200 | [diff] [blame] | 195 | all processes and groups have the same weight, then BFQ |
| 196 | guarantees the expected throughput distribution without ever |
| 197 | idling the device. Throughput is thus as high as possible in |
| 198 | this common scenario. |
Paolo Valente | aee69d7 | 2017-04-19 08:29:02 -0600 | [diff] [blame] | 199 | |
Paolo Valente | 2670cd1 | 2017-08-31 20:00:31 +0200 | [diff] [blame] | 200 | - On flash-based storage with internal queueing of commands |
| 201 | (typically NCQ), device idling happens to be always detrimental |
| 202 | for throughput. So, with these devices, BFQ performs idling |
| 203 | only when strictly needed for service guarantees, i.e., for |
| 204 | guaranteeing low latency or fairness. In these cases, overall |
| 205 | throughput may be sub-optimal. No solution currently exists to |
| 206 | provide both strong service guarantees and optimal throughput |
| 207 | on devices with internal queueing. |
| 208 | |
Paolo Valente | aee69d7 | 2017-04-19 08:29:02 -0600 | [diff] [blame] | 209 | - If low-latency mode is enabled (default configuration), BFQ |
| 210 | executes some special heuristics to detect interactive and soft |
| 211 | real-time applications (e.g., video or audio players/streamers), |
| 212 | and to reduce their latency. The most important action taken to |
| 213 | achieve this goal is to give to the queues associated with these |
| 214 | applications more than their fair share of the device |
| 215 | throughput. For brevity, we call just "weight-raising" the whole |
| 216 | sets of actions taken by BFQ to privilege these queues. In |
| 217 | particular, BFQ provides a milder form of weight-raising for |
| 218 | interactive applications, and a stronger form for soft real-time |
| 219 | applications. |
| 220 | |
| 221 | - BFQ automatically deactivates idling for queues born in a burst of |
| 222 | queue creations. In fact, these queues are usually associated with |
| 223 | the processes of applications and services that benefit mostly |
| 224 | from a high throughput. Examples are systemd during boot, or git |
| 225 | grep. |
| 226 | |
| 227 | - As CFQ, BFQ merges queues performing interleaved I/O, i.e., |
| 228 | performing random I/O that becomes mostly sequential if |
| 229 | merged. Differently from CFQ, BFQ achieves this goal with a more |
| 230 | reactive mechanism, called Early Queue Merge (EQM). EQM is so |
| 231 | responsive in detecting interleaved I/O (cooperating processes), |
| 232 | that it enables BFQ to achieve a high throughput, by queue |
| 233 | merging, even for queues for which CFQ needs a different |
| 234 | mechanism, preemption, to get a high throughput. As such EQM is a |
| 235 | unified mechanism to achieve a high throughput with interleaved |
| 236 | I/O. |
| 237 | |
| 238 | - Queues are scheduled according to a variant of WF2Q+, named |
| 239 | B-WF2Q+, and implemented using an augmented rb-tree to preserve an |
| 240 | O(log N) overall complexity. See [2] for more details. B-WF2Q+ is |
Paolo Valente | 233f0bf | 2017-08-31 20:00:30 +0200 | [diff] [blame] | 241 | also ready for hierarchical scheduling, details in Section 4. |
Paolo Valente | aee69d7 | 2017-04-19 08:29:02 -0600 | [diff] [blame] | 242 | |
| 243 | - B-WF2Q+ guarantees a tight deviation with respect to an ideal, |
| 244 | perfectly fair, and smooth service. In particular, B-WF2Q+ |
| 245 | guarantees that each queue receives a fraction of the device |
| 246 | throughput proportional to its weight, even if the throughput |
| 247 | fluctuates, and regardless of: the device parameters, the current |
| 248 | workload and the budgets assigned to the queue. |
| 249 | |
| 250 | - The last, budget-independence, property (although probably |
| 251 | counterintuitive in the first place) is definitely beneficial, for |
| 252 | the following reasons: |
| 253 | |
| 254 | - First, with any proportional-share scheduler, the maximum |
| 255 | deviation with respect to an ideal service is proportional to |
| 256 | the maximum budget (slice) assigned to queues. As a consequence, |
| 257 | BFQ can keep this deviation tight not only because of the |
| 258 | accurate service of B-WF2Q+, but also because BFQ *does not* |
| 259 | need to assign a larger budget to a queue to let the queue |
| 260 | receive a higher fraction of the device throughput. |
| 261 | |
| 262 | - Second, BFQ is free to choose, for every process (queue), the |
| 263 | budget that best fits the needs of the process, or best |
| 264 | leverages the I/O pattern of the process. In particular, BFQ |
| 265 | updates queue budgets with a simple feedback-loop algorithm that |
| 266 | allows a high throughput to be achieved, while still providing |
| 267 | tight latency guarantees to time-sensitive applications. When |
| 268 | the in-service queue expires, this algorithm computes the next |
| 269 | budget of the queue so as to: |
| 270 | |
| 271 | - Let large budgets be eventually assigned to the queues |
| 272 | associated with I/O-bound applications performing sequential |
| 273 | I/O: in fact, the longer these applications are served once |
| 274 | got access to the device, the higher the throughput is. |
| 275 | |
| 276 | - Let small budgets be eventually assigned to the queues |
| 277 | associated with time-sensitive applications (which typically |
| 278 | perform sporadic and short I/O), because, the smaller the |
| 279 | budget assigned to a queue waiting for service is, the sooner |
| 280 | B-WF2Q+ will serve that queue (Subsec 3.3 in [2]). |
| 281 | |
| 282 | - If several processes are competing for the device at the same time, |
| 283 | but all processes and groups have the same weight, then BFQ |
| 284 | guarantees the expected throughput distribution without ever idling |
| 285 | the device. It uses preemption instead. Throughput is then much |
| 286 | higher in this common scenario. |
| 287 | |
| 288 | - ioprio classes are served in strict priority order, i.e., |
| 289 | lower-priority queues are not served as long as there are |
| 290 | higher-priority queues. Among queues in the same class, the |
| 291 | bandwidth is distributed in proportion to the weight of each |
| 292 | queue. A very thin extra bandwidth is however guaranteed to |
| 293 | the Idle class, to prevent it from starving. |
| 294 | |
| 295 | |
Paolo Valente | 2670cd1 | 2017-08-31 20:00:31 +0200 | [diff] [blame] | 296 | 3. What are BFQ's tunables and how to properly configure BFQ? |
| 297 | ============================================================= |
Paolo Valente | aee69d7 | 2017-04-19 08:29:02 -0600 | [diff] [blame] | 298 | |
Paolo Valente | 2670cd1 | 2017-08-31 20:00:31 +0200 | [diff] [blame] | 299 | Most BFQ tunables affect service guarantees (basically latency and |
| 300 | fairness) and throughput. For full details on how to choose the |
| 301 | desired tradeoff between service guarantees and throughput, see the |
| 302 | parameters slice_idle, strict_guarantees and low_latency. For details |
| 303 | on how to maximise throughput, see slice_idle, timeout_sync and |
| 304 | max_budget. The other performance-related parameters have been |
| 305 | inherited from, and have been preserved mostly for compatibility with |
| 306 | CFQ. So far, no performance improvement has been reported after |
| 307 | changing the latter parameters in BFQ. |
| 308 | |
| 309 | In particular, the tunables back_seek-max, back_seek_penalty, |
| 310 | fifo_expire_async and fifo_expire_sync below are the same as in |
| 311 | CFQ. Their description is just copied from that for CFQ. Some |
| 312 | considerations in the description of slice_idle are copied from CFQ |
| 313 | too. |
Paolo Valente | aee69d7 | 2017-04-19 08:29:02 -0600 | [diff] [blame] | 314 | |
| 315 | per-process ioprio and weight |
| 316 | ----------------------------- |
| 317 | |
Arianna Avanzini | e21b7a0 | 2017-04-12 18:23:08 +0200 | [diff] [blame] | 318 | Unless the cgroups interface is used (see "4. BFQ group scheduling"), |
| 319 | weights can be assigned to processes only indirectly, through I/O |
| 320 | priorities, and according to the relation: |
| 321 | weight = (IOPRIO_BE_NR - ioprio) * 10. |
| 322 | |
| 323 | Beware that, if low-latency is set, then BFQ automatically raises the |
| 324 | weight of the queues associated with interactive and soft real-time |
| 325 | applications. Unset this tunable if you need/want to control weights. |
Paolo Valente | aee69d7 | 2017-04-19 08:29:02 -0600 | [diff] [blame] | 326 | |
| 327 | slice_idle |
| 328 | ---------- |
| 329 | |
| 330 | This parameter specifies how long BFQ should idle for next I/O |
| 331 | request, when certain sync BFQ queues become empty. By default |
| 332 | slice_idle is a non-zero value. Idling has a double purpose: boosting |
| 333 | throughput and making sure that the desired throughput distribution is |
| 334 | respected (see the description of how BFQ works, and, if needed, the |
| 335 | papers referred there). |
| 336 | |
| 337 | As for throughput, idling can be very helpful on highly seeky media |
| 338 | like single spindle SATA/SAS disks where we can cut down on overall |
| 339 | number of seeks and see improved throughput. |
| 340 | |
| 341 | Setting slice_idle to 0 will remove all the idling on queues and one |
| 342 | should see an overall improved throughput on faster storage devices |
Paolo Valente | 2670cd1 | 2017-08-31 20:00:31 +0200 | [diff] [blame] | 343 | like multiple SATA/SAS disks in hardware RAID configuration, as well |
| 344 | as flash-based storage with internal command queueing (and |
| 345 | parallelism). |
Paolo Valente | aee69d7 | 2017-04-19 08:29:02 -0600 | [diff] [blame] | 346 | |
| 347 | So depending on storage and workload, it might be useful to set |
| 348 | slice_idle=0. In general for SATA/SAS disks and software RAID of |
| 349 | SATA/SAS disks keeping slice_idle enabled should be useful. For any |
| 350 | configurations where there are multiple spindles behind single LUN |
Paolo Valente | 2670cd1 | 2017-08-31 20:00:31 +0200 | [diff] [blame] | 351 | (Host based hardware RAID controller or for storage arrays), or with |
| 352 | flash-based fast storage, setting slice_idle=0 might end up in better |
| 353 | throughput and acceptable latencies. |
Paolo Valente | aee69d7 | 2017-04-19 08:29:02 -0600 | [diff] [blame] | 354 | |
| 355 | Idling is however necessary to have service guarantees enforced in |
| 356 | case of differentiated weights or differentiated I/O-request lengths. |
| 357 | To see why, suppose that a given BFQ queue A must get several I/O |
| 358 | requests served for each request served for another queue B. Idling |
| 359 | ensures that, if A makes a new I/O request slightly after becoming |
| 360 | empty, then no request of B is dispatched in the middle, and thus A |
| 361 | does not lose the possibility to get more than one request dispatched |
| 362 | before the next request of B is dispatched. Note that idling |
| 363 | guarantees the desired differentiated treatment of queues only in |
| 364 | terms of I/O-request dispatches. To guarantee that the actual service |
| 365 | order then corresponds to the dispatch order, the strict_guarantees |
| 366 | tunable must be set too. |
| 367 | |
| 368 | There is an important flipside for idling: apart from the above cases |
| 369 | where it is beneficial also for throughput, idling can severely impact |
| 370 | throughput. One important case is random workload. Because of this |
| 371 | issue, BFQ tends to avoid idling as much as possible, when it is not |
Paolo Valente | 2670cd1 | 2017-08-31 20:00:31 +0200 | [diff] [blame] | 372 | beneficial also for throughput (as detailed in Section 2). As a |
| 373 | consequence of this behavior, and of further issues described for the |
| 374 | strict_guarantees tunable, short-term service guarantees may be |
| 375 | occasionally violated. And, in some cases, these guarantees may be |
| 376 | more important than guaranteeing maximum throughput. For example, in |
| 377 | video playing/streaming, a very low drop rate may be more important |
| 378 | than maximum throughput. In these cases, consider setting the |
| 379 | strict_guarantees parameter. |
Paolo Valente | aee69d7 | 2017-04-19 08:29:02 -0600 | [diff] [blame] | 380 | |
John Pittman | 47cb393 | 2019-01-08 16:56:13 -0500 | [diff] [blame] | 381 | slice_idle_us |
| 382 | ------------- |
| 383 | |
| 384 | Controls the same tuning parameter as slice_idle, but in microseconds. |
| 385 | Either tunable can be used to set idling behavior. Afterwards, the |
| 386 | other tunable will reflect the newly set value in sysfs. |
| 387 | |
Paolo Valente | aee69d7 | 2017-04-19 08:29:02 -0600 | [diff] [blame] | 388 | strict_guarantees |
| 389 | ----------------- |
| 390 | |
| 391 | If this parameter is set (default: unset), then BFQ |
| 392 | |
| 393 | - always performs idling when the in-service queue becomes empty; |
| 394 | |
| 395 | - forces the device to serve one I/O request at a time, by dispatching a |
| 396 | new request only if there is no outstanding request. |
| 397 | |
| 398 | In the presence of differentiated weights or I/O-request sizes, both |
| 399 | the above conditions are needed to guarantee that every BFQ queue |
| 400 | receives its allotted share of the bandwidth. The first condition is |
| 401 | needed for the reasons explained in the description of the slice_idle |
| 402 | tunable. The second condition is needed because all modern storage |
| 403 | devices reorder internally-queued requests, which may trivially break |
| 404 | the service guarantees enforced by the I/O scheduler. |
| 405 | |
| 406 | Setting strict_guarantees may evidently affect throughput. |
| 407 | |
| 408 | back_seek_max |
| 409 | ------------- |
| 410 | |
| 411 | This specifies, given in Kbytes, the maximum "distance" for backward seeking. |
| 412 | The distance is the amount of space from the current head location to the |
| 413 | sectors that are backward in terms of distance. |
| 414 | |
| 415 | This parameter allows the scheduler to anticipate requests in the "backward" |
| 416 | direction and consider them as being the "next" if they are within this |
| 417 | distance from the current head location. |
| 418 | |
| 419 | back_seek_penalty |
| 420 | ----------------- |
| 421 | |
| 422 | This parameter is used to compute the cost of backward seeking. If the |
| 423 | backward distance of request is just 1/back_seek_penalty from a "front" |
| 424 | request, then the seeking cost of two requests is considered equivalent. |
| 425 | |
| 426 | So scheduler will not bias toward one or the other request (otherwise scheduler |
| 427 | will bias toward front request). Default value of back_seek_penalty is 2. |
| 428 | |
| 429 | fifo_expire_async |
| 430 | ----------------- |
| 431 | |
| 432 | This parameter is used to set the timeout of asynchronous requests. Default |
Joseph Qi | 4168a8d | 2021-02-23 09:55:28 +0800 | [diff] [blame] | 433 | value of this is 250ms. |
Paolo Valente | aee69d7 | 2017-04-19 08:29:02 -0600 | [diff] [blame] | 434 | |
| 435 | fifo_expire_sync |
| 436 | ---------------- |
| 437 | |
| 438 | This parameter is used to set the timeout of synchronous requests. Default |
Joseph Qi | 4168a8d | 2021-02-23 09:55:28 +0800 | [diff] [blame] | 439 | value of this is 125ms. In case to favor synchronous requests over asynchronous |
Paolo Valente | aee69d7 | 2017-04-19 08:29:02 -0600 | [diff] [blame] | 440 | one, this value should be decreased relative to fifo_expire_async. |
| 441 | |
| 442 | low_latency |
| 443 | ----------- |
| 444 | |
| 445 | This parameter is used to enable/disable BFQ's low latency mode. By |
| 446 | default, low latency mode is enabled. If enabled, interactive and soft |
| 447 | real-time applications are privileged and experience a lower latency, |
| 448 | as explained in more detail in the description of how BFQ works. |
| 449 | |
Paolo Valente | 43c1b3d | 2017-05-09 12:54:23 +0200 | [diff] [blame] | 450 | DISABLE this mode if you need full control on bandwidth |
Paolo Valente | 44e44a1 | 2017-04-12 18:23:12 +0200 | [diff] [blame] | 451 | distribution. In fact, if it is enabled, then BFQ automatically |
| 452 | increases the bandwidth share of privileged applications, as the main |
| 453 | means to guarantee a lower latency to them. |
| 454 | |
Paolo Valente | 43c1b3d | 2017-05-09 12:54:23 +0200 | [diff] [blame] | 455 | In addition, as already highlighted at the beginning of this document, |
| 456 | DISABLE this mode if your only goal is to achieve a high throughput. |
| 457 | In fact, privileging the I/O of some application over the rest may |
| 458 | entail a lower throughput. To achieve the highest-possible throughput |
| 459 | on a non-rotational device, setting slice_idle to 0 may be needed too |
| 460 | (at the cost of giving up any strong guarantee on fairness and low |
| 461 | latency). |
| 462 | |
Paolo Valente | aee69d7 | 2017-04-19 08:29:02 -0600 | [diff] [blame] | 463 | timeout_sync |
| 464 | ------------ |
| 465 | |
| 466 | Maximum amount of device time that can be given to a task (queue) once |
| 467 | it has been selected for service. On devices with costly seeks, |
| 468 | increasing this time usually increases maximum throughput. On the |
| 469 | opposite end, increasing this time coarsens the granularity of the |
| 470 | short-term bandwidth and latency guarantees, especially if the |
| 471 | following parameter is set to zero. |
| 472 | |
| 473 | max_budget |
| 474 | ---------- |
| 475 | |
| 476 | Maximum amount of service, measured in sectors, that can be provided |
| 477 | to a BFQ queue once it is set in service (of course within the limits |
| 478 | of the above timeout). According to what said in the description of |
| 479 | the algorithm, larger values increase the throughput in proportion to |
| 480 | the percentage of sequential I/O requests issued. The price of larger |
| 481 | values is that they coarsen the granularity of short-term bandwidth |
| 482 | and latency guarantees. |
| 483 | |
| 484 | The default value is 0, which enables auto-tuning: BFQ sets max_budget |
| 485 | to the maximum number of sectors that can be served during |
| 486 | timeout_sync, according to the estimated peak rate. |
| 487 | |
Paolo Valente | 2670cd1 | 2017-08-31 20:00:31 +0200 | [diff] [blame] | 488 | For specific devices, some users have occasionally reported to have |
| 489 | reached a higher throughput by setting max_budget explicitly, i.e., by |
| 490 | setting max_budget to a higher value than 0. In particular, they have |
| 491 | set max_budget to higher values than those to which BFQ would have set |
| 492 | it with auto-tuning. An alternative way to achieve this goal is to |
| 493 | just increase the value of timeout_sync, leaving max_budget equal to 0. |
| 494 | |
Paolo Valente | aee69d7 | 2017-04-19 08:29:02 -0600 | [diff] [blame] | 495 | 4. Group scheduling with BFQ |
| 496 | ============================ |
| 497 | |
Arianna Avanzini | e21b7a0 | 2017-04-12 18:23:08 +0200 | [diff] [blame] | 498 | BFQ supports both cgroups-v1 and cgroups-v2 io controllers, namely |
| 499 | blkio and io. In particular, BFQ supports weight-based proportional |
| 500 | share. To activate cgroups support, set BFQ_GROUP_IOSCHED. |
Paolo Valente | aee69d7 | 2017-04-19 08:29:02 -0600 | [diff] [blame] | 501 | |
| 502 | 4-1 Service guarantees provided |
| 503 | ------------------------------- |
| 504 | |
| 505 | With BFQ, proportional share means true proportional share of the |
| 506 | device bandwidth, according to group weights. For example, a group |
| 507 | with weight 200 gets twice the bandwidth, and not just twice the time, |
| 508 | of a group with weight 100. |
| 509 | |
| 510 | BFQ supports hierarchies (group trees) of any depth. Bandwidth is |
| 511 | distributed among groups and processes in the expected way: for each |
| 512 | group, the children of the group share the whole bandwidth of the |
| 513 | group in proportion to their weights. In particular, this implies |
| 514 | that, for each leaf group, every process of the group receives the |
| 515 | same share of the whole group bandwidth, unless the ioprio of the |
| 516 | process is modified. |
| 517 | |
| 518 | The resource-sharing guarantee for a group may partially or totally |
| 519 | switch from bandwidth to time, if providing bandwidth guarantees to |
| 520 | the group lowers the throughput too much. This switch occurs on a |
| 521 | per-process basis: if a process of a leaf group causes throughput loss |
| 522 | if served in such a way to receive its share of the bandwidth, then |
| 523 | BFQ switches back to just time-based proportional share for that |
| 524 | process. |
| 525 | |
| 526 | 4-2 Interface |
| 527 | ------------- |
| 528 | |
| 529 | To get proportional sharing of bandwidth with BFQ for a given device, |
| 530 | BFQ must of course be the active scheduler for that device. |
| 531 | |
| 532 | Within each group directory, the names of the files associated with |
| 533 | BFQ-specific cgroup parameters and stats begin with the "bfq." |
| 534 | prefix. So, with cgroups-v1 or cgroups-v2, the full prefix for |
| 535 | BFQ-specific files is "blkio.bfq." or "io.bfq." For example, the group |
| 536 | parameter to set the weight of a group with BFQ is blkio.bfq.weight |
| 537 | or io.bfq.weight. |
| 538 | |
Luca Miccio | a33801e | 2017-11-13 07:34:10 +0100 | [diff] [blame] | 539 | As for cgroups-v1 (blkio controller), the exact set of stat files |
| 540 | created, and kept up-to-date by bfq, depends on whether |
Christoph Hellwig | 8060c47 | 2019-06-06 12:26:24 +0200 | [diff] [blame] | 541 | CONFIG_BFQ_CGROUP_DEBUG is set. If it is set, then bfq creates all |
Luca Miccio | a33801e | 2017-11-13 07:34:10 +0100 | [diff] [blame] | 542 | the stat files documented in |
Mauro Carvalho Chehab | da82c92 | 2019-06-27 13:08:35 -0300 | [diff] [blame] | 543 | Documentation/admin-guide/cgroup-v1/blkio-controller.rst. If, instead, |
Mauro Carvalho Chehab | 898bd37 | 2019-04-18 19:45:00 -0300 | [diff] [blame] | 544 | CONFIG_BFQ_CGROUP_DEBUG is not set, then bfq creates only the files:: |
| 545 | |
| 546 | blkio.bfq.io_service_bytes |
| 547 | blkio.bfq.io_service_bytes_recursive |
| 548 | blkio.bfq.io_serviced |
| 549 | blkio.bfq.io_serviced_recursive |
Luca Miccio | a33801e | 2017-11-13 07:34:10 +0100 | [diff] [blame] | 550 | |
Christoph Hellwig | 8060c47 | 2019-06-06 12:26:24 +0200 | [diff] [blame] | 551 | The value of CONFIG_BFQ_CGROUP_DEBUG greatly influences the maximum |
Luca Miccio | a33801e | 2017-11-13 07:34:10 +0100 | [diff] [blame] | 552 | throughput sustainable with bfq, because updating the blkio.bfq.* |
| 553 | stats is rather costly, especially for some of the stats enabled by |
Christoph Hellwig | 8060c47 | 2019-06-06 12:26:24 +0200 | [diff] [blame] | 554 | CONFIG_BFQ_CGROUP_DEBUG. |
Luca Miccio | a33801e | 2017-11-13 07:34:10 +0100 | [diff] [blame] | 555 | |
Kir Kolyshkin | fda0b5b | 2021-06-14 14:41:09 -0700 | [diff] [blame] | 556 | Parameters |
| 557 | ---------- |
Paolo Valente | aee69d7 | 2017-04-19 08:29:02 -0600 | [diff] [blame] | 558 | |
Kir Kolyshkin | fda0b5b | 2021-06-14 14:41:09 -0700 | [diff] [blame] | 559 | For each group, the following parameters can be set: |
Paolo Valente | aee69d7 | 2017-04-19 08:29:02 -0600 | [diff] [blame] | 560 | |
Kir Kolyshkin | fda0b5b | 2021-06-14 14:41:09 -0700 | [diff] [blame] | 561 | weight |
| 562 | This specifies the default weight for the cgroup inside its parent. |
| 563 | Available values: 1..1000 (default: 100). |
Paolo Valente | aee69d7 | 2017-04-19 08:29:02 -0600 | [diff] [blame] | 564 | |
Kir Kolyshkin | fda0b5b | 2021-06-14 14:41:09 -0700 | [diff] [blame] | 565 | For cgroup v1, it is set by writing the value to `blkio.bfq.weight`. |
| 566 | |
| 567 | For cgroup v2, it is set by writing the value to `io.bfq.weight`. |
| 568 | (with an optional prefix of `default` and a space). |
| 569 | |
| 570 | The linear mapping between ioprio and weights, described at the beginning |
| 571 | of the tunable section, is still valid, but all weights higher than |
| 572 | IOPRIO_BE_NR*10 are mapped to ioprio 0. |
| 573 | |
| 574 | Recall that, if low-latency is set, then BFQ automatically raises the |
| 575 | weight of the queues associated with interactive and soft real-time |
| 576 | applications. Unset this tunable if you need/want to control weights. |
| 577 | |
| 578 | weight_device |
| 579 | This specifies a per-device weight for the cgroup. The syntax is |
| 580 | `minor:major weight`. A weight of `0` may be used to reset to the default |
| 581 | weight. |
| 582 | |
| 583 | For cgroup v1, it is set by writing the value to `blkio.bfq.weight_device`. |
| 584 | |
| 585 | For cgroup v2, the file name is `io.bfq.weight`. |
Paolo Valente | 44e44a1 | 2017-04-12 18:23:12 +0200 | [diff] [blame] | 586 | |
Paolo Valente | aee69d7 | 2017-04-19 08:29:02 -0600 | [diff] [blame] | 587 | |
Mauro Carvalho Chehab | 898bd37 | 2019-04-18 19:45:00 -0300 | [diff] [blame] | 588 | [1] |
| 589 | P. Valente, A. Avanzini, "Evolution of the BFQ Storage I/O |
Paolo Valente | aee69d7 | 2017-04-19 08:29:02 -0600 | [diff] [blame] | 590 | Scheduler", Proceedings of the First Workshop on Mobile System |
| 591 | Technologies (MST-2015), May 2015. |
Mauro Carvalho Chehab | 898bd37 | 2019-04-18 19:45:00 -0300 | [diff] [blame] | 592 | |
Paolo Valente | aee69d7 | 2017-04-19 08:29:02 -0600 | [diff] [blame] | 593 | http://algogroup.unimore.it/people/paolo/disk_sched/mst-2015.pdf |
| 594 | |
Mauro Carvalho Chehab | 898bd37 | 2019-04-18 19:45:00 -0300 | [diff] [blame] | 595 | [2] |
| 596 | P. Valente and M. Andreolini, "Improving Application |
Paolo Valente | aee69d7 | 2017-04-19 08:29:02 -0600 | [diff] [blame] | 597 | Responsiveness with the BFQ Disk I/O Scheduler", Proceedings of |
| 598 | the 5th Annual International Systems and Storage Conference |
| 599 | (SYSTOR '12), June 2012. |
Paolo Valente | 4438cf5 | 2019-03-12 09:59:35 +0100 | [diff] [blame] | 600 | |
Mauro Carvalho Chehab | 898bd37 | 2019-04-18 19:45:00 -0300 | [diff] [blame] | 601 | Slightly extended version: |
| 602 | |
| 603 | http://algogroup.unimore.it/people/paolo/disk_sched/bfq-v1-suite-results.pdf |
| 604 | |
| 605 | [3] |
| 606 | https://github.com/Algodev-github/S |