blob: 03ff4cc7991188e15932644c3ee6b71050a91f8d [file] [log] [blame]
Paolo Valenteaee69d72017-04-19 08:29:02 -06001BFQ (Budget Fair Queueing)
2==========================
3
4BFQ is a proportional-share I/O scheduler, with some extra
5low-latency capabilities. In addition to cgroups support (blkio or io
6controllers), BFQ's main features are:
7- BFQ guarantees a high system and application responsiveness, and a
8 low latency for time-sensitive applications, such as audio or video
9 players;
10- BFQ distributes bandwidth, and not just time, among processes or
11 groups (switching back to time distribution when needed to keep
12 throughput high).
13
Paolo Valente43c1b3d2017-05-09 12:54:23 +020014In its default configuration, BFQ privileges latency over
15throughput. So, when needed for achieving a lower latency, BFQ builds
16schedules that may lead to a lower throughput. If your main or only
17goal, for a given device, is to achieve the maximum-possible
18throughput at all times, then do switch off all low-latency heuristics
Paolo Valente233f0bf2017-08-31 20:00:30 +020019for that device, by setting low_latency to 0. See Section 3 for
20details on how to configure BFQ for the desired tradeoff between
21latency and throughput, or on how to maximize throughput.
Paolo Valente43c1b3d2017-05-09 12:54:23 +020022
Paolo Valenteaee69d72017-04-19 08:29:02 -060023On average CPUs, the current version of BFQ can handle devices
24performing at most ~30K IOPS; at most ~50 KIOPS on faster CPUs. As a
25reference, 30-50 KIOPS correspond to very high bandwidths with
26sequential I/O (e.g., 8-12 GB/s if I/O requests are 256 KB large), and
Paolo Valente233f0bf2017-08-31 20:00:30 +020027to 120-200 MB/s with 4KB random I/O. BFQ is currently being tested on
28multi-queue devices too.
Paolo Valenteaee69d72017-04-19 08:29:02 -060029
30The table of contents follow. Impatients can just jump to Section 3.
31
32CONTENTS
33
341. When may BFQ be useful?
35 1-1 Personal systems
36 1-2 Server systems
372. How does BFQ work?
383. What are BFQ's tunable?
394. BFQ group scheduling
40 4-1 Service guarantees provided
41 4-2 Interface
42
431. When may BFQ be useful?
44==========================
45
46BFQ provides the following benefits on personal and server systems.
47
481-1 Personal systems
49--------------------
50
51Low latency for interactive applications
52
53Regardless of the actual background workload, BFQ guarantees that, for
54interactive tasks, the storage device is virtually as responsive as if
55it was idle. For example, even if one or more of the following
56background workloads are being executed:
57- one or more large files are being read, written or copied,
58- a tree of source files is being compiled,
59- one or more virtual machines are performing I/O,
60- a software update is in progress,
61- indexing daemons are scanning filesystems and updating their
62 databases,
63starting an application or loading a file from within an application
64takes about the same time as if the storage device was idle. As a
65comparison, with CFQ, NOOP or DEADLINE, and in the same conditions,
66applications experience high latencies, or even become unresponsive
67until the background workload terminates (also on SSDs).
68
69Low latency for soft real-time applications
70
71Also soft real-time applications, such as audio and video
72players/streamers, enjoy a low latency and a low drop rate, regardless
73of the background I/O workload. As a consequence, these applications
74do not suffer from almost any glitch due to the background workload.
75
76Higher speed for code-development tasks
77
78If some additional workload happens to be executed in parallel, then
79BFQ executes the I/O-related components of typical code-development
80tasks (compilation, checkout, merge, ...) much more quickly than CFQ,
81NOOP or DEADLINE.
82
83High throughput
84
85On hard disks, BFQ achieves up to 30% higher throughput than CFQ, and
86up to 150% higher throughput than DEADLINE and NOOP, with all the
87sequential workloads considered in our tests. With random workloads,
88and with all the workloads on flash-based devices, BFQ achieves,
89instead, about the same throughput as the other schedulers.
90
91Strong fairness, bandwidth and delay guarantees
92
93BFQ distributes the device throughput, and not just the device time,
94among I/O-bound applications in proportion their weights, with any
95workload and regardless of the device parameters. From these bandwidth
96guarantees, it is possible to compute tight per-I/O-request delay
97guarantees by a simple formula. If not configured for strict service
98guarantees, BFQ switches to time-based resource sharing (only) for
99applications that would otherwise cause a throughput loss.
100
1011-2 Server systems
102------------------
103
104Most benefits for server systems follow from the same service
105properties as above. In particular, regardless of whether additional,
106possibly heavy workloads are being served, BFQ guarantees:
107
108. audio and video-streaming with zero or very low jitter and drop
109 rate;
110
111. fast retrieval of WEB pages and embedded objects;
112
113. real-time recording of data in live-dumping applications (e.g.,
114 packet logging);
115
116. responsiveness in local and remote access to a server.
117
118
1192. How does BFQ work?
120=====================
121
122BFQ is a proportional-share I/O scheduler, whose general structure,
123plus a lot of code, are borrowed from CFQ.
124
125- Each process doing I/O on a device is associated with a weight and a
126 (bfq_)queue.
127
128- BFQ grants exclusive access to the device, for a while, to one queue
129 (process) at a time, and implements this service model by
130 associating every queue with a budget, measured in number of
131 sectors.
132
133 - After a queue is granted access to the device, the budget of the
134 queue is decremented, on each request dispatch, by the size of the
135 request.
136
137 - The in-service queue is expired, i.e., its service is suspended,
138 only if one of the following events occurs: 1) the queue finishes
139 its budget, 2) the queue empties, 3) a "budget timeout" fires.
140
141 - The budget timeout prevents processes doing random I/O from
142 holding the device for too long and dramatically reducing
143 throughput.
144
145 - Actually, as in CFQ, a queue associated with a process issuing
146 sync requests may not be expired immediately when it empties. In
147 contrast, BFQ may idle the device for a short time interval,
148 giving the process the chance to go on being served if it issues
149 a new request in time. Device idling typically boosts the
150 throughput on rotational devices, if processes do synchronous
151 and sequential I/O. In addition, under BFQ, device idling is
152 also instrumental in guaranteeing the desired throughput
153 fraction to processes issuing sync requests (see the description
154 of the slice_idle tunable in this document, or [1, 2], for more
155 details).
156
157 - With respect to idling for service guarantees, if several
158 processes are competing for the device at the same time, but
Paolo Valente233f0bf2017-08-31 20:00:30 +0200159 all processes and groups have the same weight, then BFQ
160 guarantees the expected throughput distribution without ever
161 idling the device. Throughput is thus as high as possible in
162 this common scenario.
Paolo Valenteaee69d72017-04-19 08:29:02 -0600163
164 - If low-latency mode is enabled (default configuration), BFQ
165 executes some special heuristics to detect interactive and soft
166 real-time applications (e.g., video or audio players/streamers),
167 and to reduce their latency. The most important action taken to
168 achieve this goal is to give to the queues associated with these
169 applications more than their fair share of the device
170 throughput. For brevity, we call just "weight-raising" the whole
171 sets of actions taken by BFQ to privilege these queues. In
172 particular, BFQ provides a milder form of weight-raising for
173 interactive applications, and a stronger form for soft real-time
174 applications.
175
176 - BFQ automatically deactivates idling for queues born in a burst of
177 queue creations. In fact, these queues are usually associated with
178 the processes of applications and services that benefit mostly
179 from a high throughput. Examples are systemd during boot, or git
180 grep.
181
182 - As CFQ, BFQ merges queues performing interleaved I/O, i.e.,
183 performing random I/O that becomes mostly sequential if
184 merged. Differently from CFQ, BFQ achieves this goal with a more
185 reactive mechanism, called Early Queue Merge (EQM). EQM is so
186 responsive in detecting interleaved I/O (cooperating processes),
187 that it enables BFQ to achieve a high throughput, by queue
188 merging, even for queues for which CFQ needs a different
189 mechanism, preemption, to get a high throughput. As such EQM is a
190 unified mechanism to achieve a high throughput with interleaved
191 I/O.
192
193 - Queues are scheduled according to a variant of WF2Q+, named
194 B-WF2Q+, and implemented using an augmented rb-tree to preserve an
195 O(log N) overall complexity. See [2] for more details. B-WF2Q+ is
Paolo Valente233f0bf2017-08-31 20:00:30 +0200196 also ready for hierarchical scheduling, details in Section 4.
Paolo Valenteaee69d72017-04-19 08:29:02 -0600197
198 - B-WF2Q+ guarantees a tight deviation with respect to an ideal,
199 perfectly fair, and smooth service. In particular, B-WF2Q+
200 guarantees that each queue receives a fraction of the device
201 throughput proportional to its weight, even if the throughput
202 fluctuates, and regardless of: the device parameters, the current
203 workload and the budgets assigned to the queue.
204
205 - The last, budget-independence, property (although probably
206 counterintuitive in the first place) is definitely beneficial, for
207 the following reasons:
208
209 - First, with any proportional-share scheduler, the maximum
210 deviation with respect to an ideal service is proportional to
211 the maximum budget (slice) assigned to queues. As a consequence,
212 BFQ can keep this deviation tight not only because of the
213 accurate service of B-WF2Q+, but also because BFQ *does not*
214 need to assign a larger budget to a queue to let the queue
215 receive a higher fraction of the device throughput.
216
217 - Second, BFQ is free to choose, for every process (queue), the
218 budget that best fits the needs of the process, or best
219 leverages the I/O pattern of the process. In particular, BFQ
220 updates queue budgets with a simple feedback-loop algorithm that
221 allows a high throughput to be achieved, while still providing
222 tight latency guarantees to time-sensitive applications. When
223 the in-service queue expires, this algorithm computes the next
224 budget of the queue so as to:
225
226 - Let large budgets be eventually assigned to the queues
227 associated with I/O-bound applications performing sequential
228 I/O: in fact, the longer these applications are served once
229 got access to the device, the higher the throughput is.
230
231 - Let small budgets be eventually assigned to the queues
232 associated with time-sensitive applications (which typically
233 perform sporadic and short I/O), because, the smaller the
234 budget assigned to a queue waiting for service is, the sooner
235 B-WF2Q+ will serve that queue (Subsec 3.3 in [2]).
236
237- If several processes are competing for the device at the same time,
238 but all processes and groups have the same weight, then BFQ
239 guarantees the expected throughput distribution without ever idling
240 the device. It uses preemption instead. Throughput is then much
241 higher in this common scenario.
242
243- ioprio classes are served in strict priority order, i.e.,
244 lower-priority queues are not served as long as there are
245 higher-priority queues. Among queues in the same class, the
246 bandwidth is distributed in proportion to the weight of each
247 queue. A very thin extra bandwidth is however guaranteed to
248 the Idle class, to prevent it from starving.
249
250
2513. What are BFQ's tunable?
252==========================
253
254The tunables back_seek-max, back_seek_penalty, fifo_expire_async and
255fifo_expire_sync below are the same as in CFQ. Their description is
256just copied from that for CFQ. Some considerations in the description
257of slice_idle are copied from CFQ too.
258
259per-process ioprio and weight
260-----------------------------
261
Arianna Avanzinie21b7a02017-04-12 18:23:08 +0200262Unless the cgroups interface is used (see "4. BFQ group scheduling"),
263weights can be assigned to processes only indirectly, through I/O
264priorities, and according to the relation:
265weight = (IOPRIO_BE_NR - ioprio) * 10.
266
267Beware that, if low-latency is set, then BFQ automatically raises the
268weight of the queues associated with interactive and soft real-time
269applications. Unset this tunable if you need/want to control weights.
Paolo Valenteaee69d72017-04-19 08:29:02 -0600270
271slice_idle
272----------
273
274This parameter specifies how long BFQ should idle for next I/O
275request, when certain sync BFQ queues become empty. By default
276slice_idle is a non-zero value. Idling has a double purpose: boosting
277throughput and making sure that the desired throughput distribution is
278respected (see the description of how BFQ works, and, if needed, the
279papers referred there).
280
281As for throughput, idling can be very helpful on highly seeky media
282like single spindle SATA/SAS disks where we can cut down on overall
283number of seeks and see improved throughput.
284
285Setting slice_idle to 0 will remove all the idling on queues and one
286should see an overall improved throughput on faster storage devices
287like multiple SATA/SAS disks in hardware RAID configuration.
288
289So depending on storage and workload, it might be useful to set
290slice_idle=0. In general for SATA/SAS disks and software RAID of
291SATA/SAS disks keeping slice_idle enabled should be useful. For any
292configurations where there are multiple spindles behind single LUN
293(Host based hardware RAID controller or for storage arrays), setting
294slice_idle=0 might end up in better throughput and acceptable
295latencies.
296
297Idling is however necessary to have service guarantees enforced in
298case of differentiated weights or differentiated I/O-request lengths.
299To see why, suppose that a given BFQ queue A must get several I/O
300requests served for each request served for another queue B. Idling
301ensures that, if A makes a new I/O request slightly after becoming
302empty, then no request of B is dispatched in the middle, and thus A
303does not lose the possibility to get more than one request dispatched
304before the next request of B is dispatched. Note that idling
305guarantees the desired differentiated treatment of queues only in
306terms of I/O-request dispatches. To guarantee that the actual service
307order then corresponds to the dispatch order, the strict_guarantees
308tunable must be set too.
309
310There is an important flipside for idling: apart from the above cases
311where it is beneficial also for throughput, idling can severely impact
312throughput. One important case is random workload. Because of this
313issue, BFQ tends to avoid idling as much as possible, when it is not
314beneficial also for throughput. As a consequence of this behavior, and
315of further issues described for the strict_guarantees tunable,
316short-term service guarantees may be occasionally violated. And, in
317some cases, these guarantees may be more important than guaranteeing
318maximum throughput. For example, in video playing/streaming, a very
319low drop rate may be more important than maximum throughput. In these
320cases, consider setting the strict_guarantees parameter.
321
322strict_guarantees
323-----------------
324
325If this parameter is set (default: unset), then BFQ
326
327- always performs idling when the in-service queue becomes empty;
328
329- forces the device to serve one I/O request at a time, by dispatching a
330 new request only if there is no outstanding request.
331
332In the presence of differentiated weights or I/O-request sizes, both
333the above conditions are needed to guarantee that every BFQ queue
334receives its allotted share of the bandwidth. The first condition is
335needed for the reasons explained in the description of the slice_idle
336tunable. The second condition is needed because all modern storage
337devices reorder internally-queued requests, which may trivially break
338the service guarantees enforced by the I/O scheduler.
339
340Setting strict_guarantees may evidently affect throughput.
341
342back_seek_max
343-------------
344
345This specifies, given in Kbytes, the maximum "distance" for backward seeking.
346The distance is the amount of space from the current head location to the
347sectors that are backward in terms of distance.
348
349This parameter allows the scheduler to anticipate requests in the "backward"
350direction and consider them as being the "next" if they are within this
351distance from the current head location.
352
353back_seek_penalty
354-----------------
355
356This parameter is used to compute the cost of backward seeking. If the
357backward distance of request is just 1/back_seek_penalty from a "front"
358request, then the seeking cost of two requests is considered equivalent.
359
360So scheduler will not bias toward one or the other request (otherwise scheduler
361will bias toward front request). Default value of back_seek_penalty is 2.
362
363fifo_expire_async
364-----------------
365
366This parameter is used to set the timeout of asynchronous requests. Default
367value of this is 248ms.
368
369fifo_expire_sync
370----------------
371
372This parameter is used to set the timeout of synchronous requests. Default
373value of this is 124ms. In case to favor synchronous requests over asynchronous
374one, this value should be decreased relative to fifo_expire_async.
375
376low_latency
377-----------
378
379This parameter is used to enable/disable BFQ's low latency mode. By
380default, low latency mode is enabled. If enabled, interactive and soft
381real-time applications are privileged and experience a lower latency,
382as explained in more detail in the description of how BFQ works.
383
Paolo Valente43c1b3d2017-05-09 12:54:23 +0200384DISABLE this mode if you need full control on bandwidth
Paolo Valente44e44a12017-04-12 18:23:12 +0200385distribution. In fact, if it is enabled, then BFQ automatically
386increases the bandwidth share of privileged applications, as the main
387means to guarantee a lower latency to them.
388
Paolo Valente43c1b3d2017-05-09 12:54:23 +0200389In addition, as already highlighted at the beginning of this document,
390DISABLE this mode if your only goal is to achieve a high throughput.
391In fact, privileging the I/O of some application over the rest may
392entail a lower throughput. To achieve the highest-possible throughput
393on a non-rotational device, setting slice_idle to 0 may be needed too
394(at the cost of giving up any strong guarantee on fairness and low
395latency).
396
Paolo Valenteaee69d72017-04-19 08:29:02 -0600397timeout_sync
398------------
399
400Maximum amount of device time that can be given to a task (queue) once
401it has been selected for service. On devices with costly seeks,
402increasing this time usually increases maximum throughput. On the
403opposite end, increasing this time coarsens the granularity of the
404short-term bandwidth and latency guarantees, especially if the
405following parameter is set to zero.
406
407max_budget
408----------
409
410Maximum amount of service, measured in sectors, that can be provided
411to a BFQ queue once it is set in service (of course within the limits
412of the above timeout). According to what said in the description of
413the algorithm, larger values increase the throughput in proportion to
414the percentage of sequential I/O requests issued. The price of larger
415values is that they coarsen the granularity of short-term bandwidth
416and latency guarantees.
417
418The default value is 0, which enables auto-tuning: BFQ sets max_budget
419to the maximum number of sectors that can be served during
420timeout_sync, according to the estimated peak rate.
421
422weights
423-------
424
425Read-only parameter, used to show the weights of the currently active
426BFQ queues.
427
428
Paolo Valenteaee69d72017-04-19 08:29:02 -06004294. Group scheduling with BFQ
430============================
431
Arianna Avanzinie21b7a02017-04-12 18:23:08 +0200432BFQ supports both cgroups-v1 and cgroups-v2 io controllers, namely
433blkio and io. In particular, BFQ supports weight-based proportional
434share. To activate cgroups support, set BFQ_GROUP_IOSCHED.
Paolo Valenteaee69d72017-04-19 08:29:02 -0600435
4364-1 Service guarantees provided
437-------------------------------
438
439With BFQ, proportional share means true proportional share of the
440device bandwidth, according to group weights. For example, a group
441with weight 200 gets twice the bandwidth, and not just twice the time,
442of a group with weight 100.
443
444BFQ supports hierarchies (group trees) of any depth. Bandwidth is
445distributed among groups and processes in the expected way: for each
446group, the children of the group share the whole bandwidth of the
447group in proportion to their weights. In particular, this implies
448that, for each leaf group, every process of the group receives the
449same share of the whole group bandwidth, unless the ioprio of the
450process is modified.
451
452The resource-sharing guarantee for a group may partially or totally
453switch from bandwidth to time, if providing bandwidth guarantees to
454the group lowers the throughput too much. This switch occurs on a
455per-process basis: if a process of a leaf group causes throughput loss
456if served in such a way to receive its share of the bandwidth, then
457BFQ switches back to just time-based proportional share for that
458process.
459
4604-2 Interface
461-------------
462
463To get proportional sharing of bandwidth with BFQ for a given device,
464BFQ must of course be the active scheduler for that device.
465
466Within each group directory, the names of the files associated with
467BFQ-specific cgroup parameters and stats begin with the "bfq."
468prefix. So, with cgroups-v1 or cgroups-v2, the full prefix for
469BFQ-specific files is "blkio.bfq." or "io.bfq." For example, the group
470parameter to set the weight of a group with BFQ is blkio.bfq.weight
471or io.bfq.weight.
472
473Parameters to set
474-----------------
475
476For each group, there is only the following parameter to set.
477
478weight (namely blkio.bfq.weight or io.bfq-weight): the weight of the
479group inside its parent. Available values: 1..10000 (default 100). The
480linear mapping between ioprio and weights, described at the beginning
481of the tunable section, is still valid, but all weights higher than
482IOPRIO_BE_NR*10 are mapped to ioprio 0.
483
Paolo Valente44e44a12017-04-12 18:23:12 +0200484Recall that, if low-latency is set, then BFQ automatically raises the
485weight of the queues associated with interactive and soft real-time
486applications. Unset this tunable if you need/want to control weights.
487
Paolo Valenteaee69d72017-04-19 08:29:02 -0600488
489[1] P. Valente, A. Avanzini, "Evolution of the BFQ Storage I/O
490 Scheduler", Proceedings of the First Workshop on Mobile System
491 Technologies (MST-2015), May 2015.
492 http://algogroup.unimore.it/people/paolo/disk_sched/mst-2015.pdf
493
494[2] P. Valente and M. Andreolini, "Improving Application
495 Responsiveness with the BFQ Disk I/O Scheduler", Proceedings of
496 the 5th Annual International Systems and Storage Conference
497 (SYSTOR '12), June 2012.
498 Slightly extended version:
499 http://algogroup.unimore.it/people/paolo/disk_sched/bfq-v1-suite-
500 results.pdf