blob: 1b87df6cd4761ab0e548d0cc0611c2c5b74725eb [file] [log] [blame]
Paolo Valenteaee69d72017-04-19 08:29:02 -06001BFQ (Budget Fair Queueing)
2==========================
3
4BFQ is a proportional-share I/O scheduler, with some extra
5low-latency capabilities. In addition to cgroups support (blkio or io
6controllers), BFQ's main features are:
7- BFQ guarantees a high system and application responsiveness, and a
8 low latency for time-sensitive applications, such as audio or video
9 players;
10- BFQ distributes bandwidth, and not just time, among processes or
11 groups (switching back to time distribution when needed to keep
12 throughput high).
13
14On average CPUs, the current version of BFQ can handle devices
15performing at most ~30K IOPS; at most ~50 KIOPS on faster CPUs. As a
16reference, 30-50 KIOPS correspond to very high bandwidths with
17sequential I/O (e.g., 8-12 GB/s if I/O requests are 256 KB large), and
18to 120-200 MB/s with 4KB random I/O. BFQ has not yet been tested on
19multi-queue devices.
20
21The table of contents follow. Impatients can just jump to Section 3.
22
23CONTENTS
24
251. When may BFQ be useful?
26 1-1 Personal systems
27 1-2 Server systems
282. How does BFQ work?
293. What are BFQ's tunable?
304. BFQ group scheduling
31 4-1 Service guarantees provided
32 4-2 Interface
33
341. When may BFQ be useful?
35==========================
36
37BFQ provides the following benefits on personal and server systems.
38
391-1 Personal systems
40--------------------
41
42Low latency for interactive applications
43
44Regardless of the actual background workload, BFQ guarantees that, for
45interactive tasks, the storage device is virtually as responsive as if
46it was idle. For example, even if one or more of the following
47background workloads are being executed:
48- one or more large files are being read, written or copied,
49- a tree of source files is being compiled,
50- one or more virtual machines are performing I/O,
51- a software update is in progress,
52- indexing daemons are scanning filesystems and updating their
53 databases,
54starting an application or loading a file from within an application
55takes about the same time as if the storage device was idle. As a
56comparison, with CFQ, NOOP or DEADLINE, and in the same conditions,
57applications experience high latencies, or even become unresponsive
58until the background workload terminates (also on SSDs).
59
60Low latency for soft real-time applications
61
62Also soft real-time applications, such as audio and video
63players/streamers, enjoy a low latency and a low drop rate, regardless
64of the background I/O workload. As a consequence, these applications
65do not suffer from almost any glitch due to the background workload.
66
67Higher speed for code-development tasks
68
69If some additional workload happens to be executed in parallel, then
70BFQ executes the I/O-related components of typical code-development
71tasks (compilation, checkout, merge, ...) much more quickly than CFQ,
72NOOP or DEADLINE.
73
74High throughput
75
76On hard disks, BFQ achieves up to 30% higher throughput than CFQ, and
77up to 150% higher throughput than DEADLINE and NOOP, with all the
78sequential workloads considered in our tests. With random workloads,
79and with all the workloads on flash-based devices, BFQ achieves,
80instead, about the same throughput as the other schedulers.
81
82Strong fairness, bandwidth and delay guarantees
83
84BFQ distributes the device throughput, and not just the device time,
85among I/O-bound applications in proportion their weights, with any
86workload and regardless of the device parameters. From these bandwidth
87guarantees, it is possible to compute tight per-I/O-request delay
88guarantees by a simple formula. If not configured for strict service
89guarantees, BFQ switches to time-based resource sharing (only) for
90applications that would otherwise cause a throughput loss.
91
921-2 Server systems
93------------------
94
95Most benefits for server systems follow from the same service
96properties as above. In particular, regardless of whether additional,
97possibly heavy workloads are being served, BFQ guarantees:
98
99. audio and video-streaming with zero or very low jitter and drop
100 rate;
101
102. fast retrieval of WEB pages and embedded objects;
103
104. real-time recording of data in live-dumping applications (e.g.,
105 packet logging);
106
107. responsiveness in local and remote access to a server.
108
109
1102. How does BFQ work?
111=====================
112
113BFQ is a proportional-share I/O scheduler, whose general structure,
114plus a lot of code, are borrowed from CFQ.
115
116- Each process doing I/O on a device is associated with a weight and a
117 (bfq_)queue.
118
119- BFQ grants exclusive access to the device, for a while, to one queue
120 (process) at a time, and implements this service model by
121 associating every queue with a budget, measured in number of
122 sectors.
123
124 - After a queue is granted access to the device, the budget of the
125 queue is decremented, on each request dispatch, by the size of the
126 request.
127
128 - The in-service queue is expired, i.e., its service is suspended,
129 only if one of the following events occurs: 1) the queue finishes
130 its budget, 2) the queue empties, 3) a "budget timeout" fires.
131
132 - The budget timeout prevents processes doing random I/O from
133 holding the device for too long and dramatically reducing
134 throughput.
135
136 - Actually, as in CFQ, a queue associated with a process issuing
137 sync requests may not be expired immediately when it empties. In
138 contrast, BFQ may idle the device for a short time interval,
139 giving the process the chance to go on being served if it issues
140 a new request in time. Device idling typically boosts the
141 throughput on rotational devices, if processes do synchronous
142 and sequential I/O. In addition, under BFQ, device idling is
143 also instrumental in guaranteeing the desired throughput
144 fraction to processes issuing sync requests (see the description
145 of the slice_idle tunable in this document, or [1, 2], for more
146 details).
147
148 - With respect to idling for service guarantees, if several
149 processes are competing for the device at the same time, but
150 all processes (and groups, after the following commit) have
151 the same weight, then BFQ guarantees the expected throughput
152 distribution without ever idling the device. Throughput is
153 thus as high as possible in this common scenario.
154
155 - If low-latency mode is enabled (default configuration), BFQ
156 executes some special heuristics to detect interactive and soft
157 real-time applications (e.g., video or audio players/streamers),
158 and to reduce their latency. The most important action taken to
159 achieve this goal is to give to the queues associated with these
160 applications more than their fair share of the device
161 throughput. For brevity, we call just "weight-raising" the whole
162 sets of actions taken by BFQ to privilege these queues. In
163 particular, BFQ provides a milder form of weight-raising for
164 interactive applications, and a stronger form for soft real-time
165 applications.
166
167 - BFQ automatically deactivates idling for queues born in a burst of
168 queue creations. In fact, these queues are usually associated with
169 the processes of applications and services that benefit mostly
170 from a high throughput. Examples are systemd during boot, or git
171 grep.
172
173 - As CFQ, BFQ merges queues performing interleaved I/O, i.e.,
174 performing random I/O that becomes mostly sequential if
175 merged. Differently from CFQ, BFQ achieves this goal with a more
176 reactive mechanism, called Early Queue Merge (EQM). EQM is so
177 responsive in detecting interleaved I/O (cooperating processes),
178 that it enables BFQ to achieve a high throughput, by queue
179 merging, even for queues for which CFQ needs a different
180 mechanism, preemption, to get a high throughput. As such EQM is a
181 unified mechanism to achieve a high throughput with interleaved
182 I/O.
183
184 - Queues are scheduled according to a variant of WF2Q+, named
185 B-WF2Q+, and implemented using an augmented rb-tree to preserve an
186 O(log N) overall complexity. See [2] for more details. B-WF2Q+ is
187 also ready for hierarchical scheduling. However, for a cleaner
188 logical breakdown, the code that enables and completes
189 hierarchical support is provided in the next commit, which focuses
190 exactly on this feature.
191
192 - B-WF2Q+ guarantees a tight deviation with respect to an ideal,
193 perfectly fair, and smooth service. In particular, B-WF2Q+
194 guarantees that each queue receives a fraction of the device
195 throughput proportional to its weight, even if the throughput
196 fluctuates, and regardless of: the device parameters, the current
197 workload and the budgets assigned to the queue.
198
199 - The last, budget-independence, property (although probably
200 counterintuitive in the first place) is definitely beneficial, for
201 the following reasons:
202
203 - First, with any proportional-share scheduler, the maximum
204 deviation with respect to an ideal service is proportional to
205 the maximum budget (slice) assigned to queues. As a consequence,
206 BFQ can keep this deviation tight not only because of the
207 accurate service of B-WF2Q+, but also because BFQ *does not*
208 need to assign a larger budget to a queue to let the queue
209 receive a higher fraction of the device throughput.
210
211 - Second, BFQ is free to choose, for every process (queue), the
212 budget that best fits the needs of the process, or best
213 leverages the I/O pattern of the process. In particular, BFQ
214 updates queue budgets with a simple feedback-loop algorithm that
215 allows a high throughput to be achieved, while still providing
216 tight latency guarantees to time-sensitive applications. When
217 the in-service queue expires, this algorithm computes the next
218 budget of the queue so as to:
219
220 - Let large budgets be eventually assigned to the queues
221 associated with I/O-bound applications performing sequential
222 I/O: in fact, the longer these applications are served once
223 got access to the device, the higher the throughput is.
224
225 - Let small budgets be eventually assigned to the queues
226 associated with time-sensitive applications (which typically
227 perform sporadic and short I/O), because, the smaller the
228 budget assigned to a queue waiting for service is, the sooner
229 B-WF2Q+ will serve that queue (Subsec 3.3 in [2]).
230
231- If several processes are competing for the device at the same time,
232 but all processes and groups have the same weight, then BFQ
233 guarantees the expected throughput distribution without ever idling
234 the device. It uses preemption instead. Throughput is then much
235 higher in this common scenario.
236
237- ioprio classes are served in strict priority order, i.e.,
238 lower-priority queues are not served as long as there are
239 higher-priority queues. Among queues in the same class, the
240 bandwidth is distributed in proportion to the weight of each
241 queue. A very thin extra bandwidth is however guaranteed to
242 the Idle class, to prevent it from starving.
243
244
2453. What are BFQ's tunable?
246==========================
247
248The tunables back_seek-max, back_seek_penalty, fifo_expire_async and
249fifo_expire_sync below are the same as in CFQ. Their description is
250just copied from that for CFQ. Some considerations in the description
251of slice_idle are copied from CFQ too.
252
253per-process ioprio and weight
254-----------------------------
255
Arianna Avanzinie21b7a02017-04-12 18:23:08 +0200256Unless the cgroups interface is used (see "4. BFQ group scheduling"),
257weights can be assigned to processes only indirectly, through I/O
258priorities, and according to the relation:
259weight = (IOPRIO_BE_NR - ioprio) * 10.
260
261Beware that, if low-latency is set, then BFQ automatically raises the
262weight of the queues associated with interactive and soft real-time
263applications. Unset this tunable if you need/want to control weights.
Paolo Valenteaee69d72017-04-19 08:29:02 -0600264
265slice_idle
266----------
267
268This parameter specifies how long BFQ should idle for next I/O
269request, when certain sync BFQ queues become empty. By default
270slice_idle is a non-zero value. Idling has a double purpose: boosting
271throughput and making sure that the desired throughput distribution is
272respected (see the description of how BFQ works, and, if needed, the
273papers referred there).
274
275As for throughput, idling can be very helpful on highly seeky media
276like single spindle SATA/SAS disks where we can cut down on overall
277number of seeks and see improved throughput.
278
279Setting slice_idle to 0 will remove all the idling on queues and one
280should see an overall improved throughput on faster storage devices
281like multiple SATA/SAS disks in hardware RAID configuration.
282
283So depending on storage and workload, it might be useful to set
284slice_idle=0. In general for SATA/SAS disks and software RAID of
285SATA/SAS disks keeping slice_idle enabled should be useful. For any
286configurations where there are multiple spindles behind single LUN
287(Host based hardware RAID controller or for storage arrays), setting
288slice_idle=0 might end up in better throughput and acceptable
289latencies.
290
291Idling is however necessary to have service guarantees enforced in
292case of differentiated weights or differentiated I/O-request lengths.
293To see why, suppose that a given BFQ queue A must get several I/O
294requests served for each request served for another queue B. Idling
295ensures that, if A makes a new I/O request slightly after becoming
296empty, then no request of B is dispatched in the middle, and thus A
297does not lose the possibility to get more than one request dispatched
298before the next request of B is dispatched. Note that idling
299guarantees the desired differentiated treatment of queues only in
300terms of I/O-request dispatches. To guarantee that the actual service
301order then corresponds to the dispatch order, the strict_guarantees
302tunable must be set too.
303
304There is an important flipside for idling: apart from the above cases
305where it is beneficial also for throughput, idling can severely impact
306throughput. One important case is random workload. Because of this
307issue, BFQ tends to avoid idling as much as possible, when it is not
308beneficial also for throughput. As a consequence of this behavior, and
309of further issues described for the strict_guarantees tunable,
310short-term service guarantees may be occasionally violated. And, in
311some cases, these guarantees may be more important than guaranteeing
312maximum throughput. For example, in video playing/streaming, a very
313low drop rate may be more important than maximum throughput. In these
314cases, consider setting the strict_guarantees parameter.
315
316strict_guarantees
317-----------------
318
319If this parameter is set (default: unset), then BFQ
320
321- always performs idling when the in-service queue becomes empty;
322
323- forces the device to serve one I/O request at a time, by dispatching a
324 new request only if there is no outstanding request.
325
326In the presence of differentiated weights or I/O-request sizes, both
327the above conditions are needed to guarantee that every BFQ queue
328receives its allotted share of the bandwidth. The first condition is
329needed for the reasons explained in the description of the slice_idle
330tunable. The second condition is needed because all modern storage
331devices reorder internally-queued requests, which may trivially break
332the service guarantees enforced by the I/O scheduler.
333
334Setting strict_guarantees may evidently affect throughput.
335
336back_seek_max
337-------------
338
339This specifies, given in Kbytes, the maximum "distance" for backward seeking.
340The distance is the amount of space from the current head location to the
341sectors that are backward in terms of distance.
342
343This parameter allows the scheduler to anticipate requests in the "backward"
344direction and consider them as being the "next" if they are within this
345distance from the current head location.
346
347back_seek_penalty
348-----------------
349
350This parameter is used to compute the cost of backward seeking. If the
351backward distance of request is just 1/back_seek_penalty from a "front"
352request, then the seeking cost of two requests is considered equivalent.
353
354So scheduler will not bias toward one or the other request (otherwise scheduler
355will bias toward front request). Default value of back_seek_penalty is 2.
356
357fifo_expire_async
358-----------------
359
360This parameter is used to set the timeout of asynchronous requests. Default
361value of this is 248ms.
362
363fifo_expire_sync
364----------------
365
366This parameter is used to set the timeout of synchronous requests. Default
367value of this is 124ms. In case to favor synchronous requests over asynchronous
368one, this value should be decreased relative to fifo_expire_async.
369
370low_latency
371-----------
372
373This parameter is used to enable/disable BFQ's low latency mode. By
374default, low latency mode is enabled. If enabled, interactive and soft
375real-time applications are privileged and experience a lower latency,
376as explained in more detail in the description of how BFQ works.
377
Paolo Valente44e44a12017-04-12 18:23:12 +0200378DO NOT enable this mode if you need full control on bandwidth
379distribution. In fact, if it is enabled, then BFQ automatically
380increases the bandwidth share of privileged applications, as the main
381means to guarantee a lower latency to them.
382
Paolo Valenteaee69d72017-04-19 08:29:02 -0600383timeout_sync
384------------
385
386Maximum amount of device time that can be given to a task (queue) once
387it has been selected for service. On devices with costly seeks,
388increasing this time usually increases maximum throughput. On the
389opposite end, increasing this time coarsens the granularity of the
390short-term bandwidth and latency guarantees, especially if the
391following parameter is set to zero.
392
393max_budget
394----------
395
396Maximum amount of service, measured in sectors, that can be provided
397to a BFQ queue once it is set in service (of course within the limits
398of the above timeout). According to what said in the description of
399the algorithm, larger values increase the throughput in proportion to
400the percentage of sequential I/O requests issued. The price of larger
401values is that they coarsen the granularity of short-term bandwidth
402and latency guarantees.
403
404The default value is 0, which enables auto-tuning: BFQ sets max_budget
405to the maximum number of sectors that can be served during
406timeout_sync, according to the estimated peak rate.
407
408weights
409-------
410
411Read-only parameter, used to show the weights of the currently active
412BFQ queues.
413
414
415wr_ tunables
416------------
417
418BFQ exports a few parameters to control/tune the behavior of
419low-latency heuristics.
420
421wr_coeff
422
423Factor by which the weight of a weight-raised queue is multiplied. If
424the queue is deemed soft real-time, then the weight is further
425multiplied by an additional, constant factor.
426
427wr_max_time
428
429Maximum duration of a weight-raising period for an interactive task
430(ms). If set to zero (default value), then this value is computed
431automatically, as a function of the peak rate of the device. In any
432case, when the value of this parameter is read, it always reports the
433current duration, regardless of whether it has been set manually or
434computed automatically.
435
436wr_max_softrt_rate
437
438Maximum service rate below which a queue is deemed to be associated
439with a soft real-time application, and is then weight-raised
440accordingly (sectors/sec).
441
442wr_min_idle_time
443
444Minimum idle period after which interactive weight-raising may be
445reactivated for a queue (in ms).
446
447wr_rt_max_time
448
449Maximum weight-raising duration for soft real-time queues (in ms). The
450start time from which this duration is considered is automatically
451moved forward if the queue is detected to be still soft real-time
452before the current soft real-time weight-raising period finishes.
453
454wr_min_inter_arr_async
455
456Minimum period between I/O request arrivals after which weight-raising
457may be reactivated for an already busy async queue (in ms).
458
459
4604. Group scheduling with BFQ
461============================
462
Arianna Avanzinie21b7a02017-04-12 18:23:08 +0200463BFQ supports both cgroups-v1 and cgroups-v2 io controllers, namely
464blkio and io. In particular, BFQ supports weight-based proportional
465share. To activate cgroups support, set BFQ_GROUP_IOSCHED.
Paolo Valenteaee69d72017-04-19 08:29:02 -0600466
4674-1 Service guarantees provided
468-------------------------------
469
470With BFQ, proportional share means true proportional share of the
471device bandwidth, according to group weights. For example, a group
472with weight 200 gets twice the bandwidth, and not just twice the time,
473of a group with weight 100.
474
475BFQ supports hierarchies (group trees) of any depth. Bandwidth is
476distributed among groups and processes in the expected way: for each
477group, the children of the group share the whole bandwidth of the
478group in proportion to their weights. In particular, this implies
479that, for each leaf group, every process of the group receives the
480same share of the whole group bandwidth, unless the ioprio of the
481process is modified.
482
483The resource-sharing guarantee for a group may partially or totally
484switch from bandwidth to time, if providing bandwidth guarantees to
485the group lowers the throughput too much. This switch occurs on a
486per-process basis: if a process of a leaf group causes throughput loss
487if served in such a way to receive its share of the bandwidth, then
488BFQ switches back to just time-based proportional share for that
489process.
490
4914-2 Interface
492-------------
493
494To get proportional sharing of bandwidth with BFQ for a given device,
495BFQ must of course be the active scheduler for that device.
496
497Within each group directory, the names of the files associated with
498BFQ-specific cgroup parameters and stats begin with the "bfq."
499prefix. So, with cgroups-v1 or cgroups-v2, the full prefix for
500BFQ-specific files is "blkio.bfq." or "io.bfq." For example, the group
501parameter to set the weight of a group with BFQ is blkio.bfq.weight
502or io.bfq.weight.
503
504Parameters to set
505-----------------
506
507For each group, there is only the following parameter to set.
508
509weight (namely blkio.bfq.weight or io.bfq-weight): the weight of the
510group inside its parent. Available values: 1..10000 (default 100). The
511linear mapping between ioprio and weights, described at the beginning
512of the tunable section, is still valid, but all weights higher than
513IOPRIO_BE_NR*10 are mapped to ioprio 0.
514
Paolo Valente44e44a12017-04-12 18:23:12 +0200515Recall that, if low-latency is set, then BFQ automatically raises the
516weight of the queues associated with interactive and soft real-time
517applications. Unset this tunable if you need/want to control weights.
518
Paolo Valenteaee69d72017-04-19 08:29:02 -0600519
520[1] P. Valente, A. Avanzini, "Evolution of the BFQ Storage I/O
521 Scheduler", Proceedings of the First Workshop on Mobile System
522 Technologies (MST-2015), May 2015.
523 http://algogroup.unimore.it/people/paolo/disk_sched/mst-2015.pdf
524
525[2] P. Valente and M. Andreolini, "Improving Application
526 Responsiveness with the BFQ Disk I/O Scheduler", Proceedings of
527 the 5th Annual International Systems and Storage Conference
528 (SYSTOR '12), June 2012.
529 Slightly extended version:
530 http://algogroup.unimore.it/people/paolo/disk_sched/bfq-v1-suite-
531 results.pdf