Blame - Documentation/block/bfq-iosched.txt - SHIFTPHONES/mainline/linux

blob: 03ff4cc7991188e15932644c3ee6b71050a91f8d [file] [log] [blame]

Paolo Valente	aee69d7	2017-04-19 08:29:02 -0600	[diff] [blame]	1	BFQ (Budget Fair Queueing)
				2	==========================
				3
				4	BFQ is a proportional-share I/O scheduler, with some extra
				5	low-latency capabilities. In addition to cgroups support (blkio or io
				6	controllers), BFQ's main features are:
				7	- BFQ guarantees a high system and application responsiveness, and a
				8	low latency for time-sensitive applications, such as audio or video
				9	players;
				10	- BFQ distributes bandwidth, and not just time, among processes or
				11	groups (switching back to time distribution when needed to keep
				12	throughput high).
				13
Paolo Valente	43c1b3d	2017-05-09 12:54:23 +0200	[diff] [blame]	14	In its default configuration, BFQ privileges latency over
				15	throughput. So, when needed for achieving a lower latency, BFQ builds
				16	schedules that may lead to a lower throughput. If your main or only
				17	goal, for a given device, is to achieve the maximum-possible
				18	throughput at all times, then do switch off all low-latency heuristics
Paolo Valente	233f0bf	2017-08-31 20:00:30 +0200	[diff] [blame^]	19	for that device, by setting low_latency to 0. See Section 3 for
				20	details on how to configure BFQ for the desired tradeoff between
				21	latency and throughput, or on how to maximize throughput.
Paolo Valente	43c1b3d	2017-05-09 12:54:23 +0200	[diff] [blame]	22
Paolo Valente	aee69d7	2017-04-19 08:29:02 -0600	[diff] [blame]	23	On average CPUs, the current version of BFQ can handle devices
				24	performing at most ~30K IOPS; at most ~50 KIOPS on faster CPUs. As a
				25	reference, 30-50 KIOPS correspond to very high bandwidths with
				26	sequential I/O (e.g., 8-12 GB/s if I/O requests are 256 KB large), and
Paolo Valente	233f0bf	2017-08-31 20:00:30 +0200	[diff] [blame^]	27	to 120-200 MB/s with 4KB random I/O. BFQ is currently being tested on
				28	multi-queue devices too.
Paolo Valente	aee69d7	2017-04-19 08:29:02 -0600	[diff] [blame]	29
				30	The table of contents follow. Impatients can just jump to Section 3.
				31
				32	CONTENTS
				33
				34	1. When may BFQ be useful?
				35	1-1 Personal systems
				36	1-2 Server systems
				37	2. How does BFQ work?
				38	3. What are BFQ's tunable?
				39	4. BFQ group scheduling
				40	4-1 Service guarantees provided
				41	4-2 Interface
				42
				43	1. When may BFQ be useful?
				44	==========================
				45
				46	BFQ provides the following benefits on personal and server systems.
				47
				48	1-1 Personal systems
				49	--------------------
				50
				51	Low latency for interactive applications
				52
				53	Regardless of the actual background workload, BFQ guarantees that, for
				54	interactive tasks, the storage device is virtually as responsive as if
				55	it was idle. For example, even if one or more of the following
				56	background workloads are being executed:
				57	- one or more large files are being read, written or copied,
				58	- a tree of source files is being compiled,
				59	- one or more virtual machines are performing I/O,
				60	- a software update is in progress,
				61	- indexing daemons are scanning filesystems and updating their
				62	databases,
				63	starting an application or loading a file from within an application
				64	takes about the same time as if the storage device was idle. As a
				65	comparison, with CFQ, NOOP or DEADLINE, and in the same conditions,
				66	applications experience high latencies, or even become unresponsive
				67	until the background workload terminates (also on SSDs).
				68
				69	Low latency for soft real-time applications
				70
				71	Also soft real-time applications, such as audio and video
				72	players/streamers, enjoy a low latency and a low drop rate, regardless
				73	of the background I/O workload. As a consequence, these applications
				74	do not suffer from almost any glitch due to the background workload.
				75
				76	Higher speed for code-development tasks
				77
				78	If some additional workload happens to be executed in parallel, then
				79	BFQ executes the I/O-related components of typical code-development
				80	tasks (compilation, checkout, merge, ...) much more quickly than CFQ,
				81	NOOP or DEADLINE.
				82
				83	High throughput
				84
				85	On hard disks, BFQ achieves up to 30% higher throughput than CFQ, and
				86	up to 150% higher throughput than DEADLINE and NOOP, with all the
				87	sequential workloads considered in our tests. With random workloads,
				88	and with all the workloads on flash-based devices, BFQ achieves,
				89	instead, about the same throughput as the other schedulers.
				90
				91	Strong fairness, bandwidth and delay guarantees
				92
				93	BFQ distributes the device throughput, and not just the device time,
				94	among I/O-bound applications in proportion their weights, with any
				95	workload and regardless of the device parameters. From these bandwidth
				96	guarantees, it is possible to compute tight per-I/O-request delay
				97	guarantees by a simple formula. If not configured for strict service
				98	guarantees, BFQ switches to time-based resource sharing (only) for
				99	applications that would otherwise cause a throughput loss.
				100
				101	1-2 Server systems
				102	------------------
				103
				104	Most benefits for server systems follow from the same service
				105	properties as above. In particular, regardless of whether additional,
				106	possibly heavy workloads are being served, BFQ guarantees:
				107
				108	. audio and video-streaming with zero or very low jitter and drop
				109	rate;
				110
				111	. fast retrieval of WEB pages and embedded objects;
				112
				113	. real-time recording of data in live-dumping applications (e.g.,
				114	packet logging);
				115
				116	. responsiveness in local and remote access to a server.
				117
				118
				119	2. How does BFQ work?
				120	=====================
				121
				122	BFQ is a proportional-share I/O scheduler, whose general structure,
				123	plus a lot of code, are borrowed from CFQ.
				124
				125	- Each process doing I/O on a device is associated with a weight and a
				126	(bfq_)queue.
				127
				128	- BFQ grants exclusive access to the device, for a while, to one queue
				129	(process) at a time, and implements this service model by
				130	associating every queue with a budget, measured in number of
				131	sectors.
				132
				133	- After a queue is granted access to the device, the budget of the
				134	queue is decremented, on each request dispatch, by the size of the
				135	request.
				136
				137	- The in-service queue is expired, i.e., its service is suspended,
				138	only if one of the following events occurs: 1) the queue finishes
				139	its budget, 2) the queue empties, 3) a "budget timeout" fires.
				140
				141	- The budget timeout prevents processes doing random I/O from
				142	holding the device for too long and dramatically reducing
				143	throughput.
				144
				145	- Actually, as in CFQ, a queue associated with a process issuing
				146	sync requests may not be expired immediately when it empties. In
				147	contrast, BFQ may idle the device for a short time interval,
				148	giving the process the chance to go on being served if it issues
				149	a new request in time. Device idling typically boosts the
				150	throughput on rotational devices, if processes do synchronous
				151	and sequential I/O. In addition, under BFQ, device idling is
				152	also instrumental in guaranteeing the desired throughput
				153	fraction to processes issuing sync requests (see the description
				154	of the slice_idle tunable in this document, or [1, 2], for more
				155	details).
				156
				157	- With respect to idling for service guarantees, if several
				158	processes are competing for the device at the same time, but
Paolo Valente	233f0bf	2017-08-31 20:00:30 +0200	[diff] [blame^]	159	all processes and groups have the same weight, then BFQ
				160	guarantees the expected throughput distribution without ever
				161	idling the device. Throughput is thus as high as possible in
				162	this common scenario.
Paolo Valente	aee69d7	2017-04-19 08:29:02 -0600	[diff] [blame]	163
				164	- If low-latency mode is enabled (default configuration), BFQ
				165	executes some special heuristics to detect interactive and soft
				166	real-time applications (e.g., video or audio players/streamers),
				167	and to reduce their latency. The most important action taken to
				168	achieve this goal is to give to the queues associated with these
				169	applications more than their fair share of the device
				170	throughput. For brevity, we call just "weight-raising" the whole
				171	sets of actions taken by BFQ to privilege these queues. In
				172	particular, BFQ provides a milder form of weight-raising for
				173	interactive applications, and a stronger form for soft real-time
				174	applications.
				175
				176	- BFQ automatically deactivates idling for queues born in a burst of
				177	queue creations. In fact, these queues are usually associated with
				178	the processes of applications and services that benefit mostly
				179	from a high throughput. Examples are systemd during boot, or git
				180	grep.
				181
				182	- As CFQ, BFQ merges queues performing interleaved I/O, i.e.,
				183	performing random I/O that becomes mostly sequential if
				184	merged. Differently from CFQ, BFQ achieves this goal with a more
				185	reactive mechanism, called Early Queue Merge (EQM). EQM is so
				186	responsive in detecting interleaved I/O (cooperating processes),
				187	that it enables BFQ to achieve a high throughput, by queue
				188	merging, even for queues for which CFQ needs a different
				189	mechanism, preemption, to get a high throughput. As such EQM is a
				190	unified mechanism to achieve a high throughput with interleaved
				191	I/O.
				192
				193	- Queues are scheduled according to a variant of WF2Q+, named
				194	B-WF2Q+, and implemented using an augmented rb-tree to preserve an
				195	O(log N) overall complexity. See [2] for more details. B-WF2Q+ is
Paolo Valente	233f0bf	2017-08-31 20:00:30 +0200	[diff] [blame^]	196	also ready for hierarchical scheduling, details in Section 4.
Paolo Valente	aee69d7	2017-04-19 08:29:02 -0600	[diff] [blame]	197
				198	- B-WF2Q+ guarantees a tight deviation with respect to an ideal,
				199	perfectly fair, and smooth service. In particular, B-WF2Q+
				200	guarantees that each queue receives a fraction of the device
				201	throughput proportional to its weight, even if the throughput
				202	fluctuates, and regardless of: the device parameters, the current
				203	workload and the budgets assigned to the queue.
				204
				205	- The last, budget-independence, property (although probably
				206	counterintuitive in the first place) is definitely beneficial, for
				207	the following reasons:
				208
				209	- First, with any proportional-share scheduler, the maximum
				210	deviation with respect to an ideal service is proportional to
				211	the maximum budget (slice) assigned to queues. As a consequence,
				212	BFQ can keep this deviation tight not only because of the
				213	accurate service of B-WF2Q+, but also because BFQ does not
				214	need to assign a larger budget to a queue to let the queue
				215	receive a higher fraction of the device throughput.
				216
				217	- Second, BFQ is free to choose, for every process (queue), the
				218	budget that best fits the needs of the process, or best
				219	leverages the I/O pattern of the process. In particular, BFQ
				220	updates queue budgets with a simple feedback-loop algorithm that
				221	allows a high throughput to be achieved, while still providing
				222	tight latency guarantees to time-sensitive applications. When
				223	the in-service queue expires, this algorithm computes the next
				224	budget of the queue so as to:
				225
				226	- Let large budgets be eventually assigned to the queues
				227	associated with I/O-bound applications performing sequential
				228	I/O: in fact, the longer these applications are served once
				229	got access to the device, the higher the throughput is.
				230
				231	- Let small budgets be eventually assigned to the queues
				232	associated with time-sensitive applications (which typically
				233	perform sporadic and short I/O), because, the smaller the
				234	budget assigned to a queue waiting for service is, the sooner
				235	B-WF2Q+ will serve that queue (Subsec 3.3 in [2]).
				236
				237	- If several processes are competing for the device at the same time,
				238	but all processes and groups have the same weight, then BFQ
				239	guarantees the expected throughput distribution without ever idling
				240	the device. It uses preemption instead. Throughput is then much
				241	higher in this common scenario.
				242
				243	- ioprio classes are served in strict priority order, i.e.,
				244	lower-priority queues are not served as long as there are
				245	higher-priority queues. Among queues in the same class, the
				246	bandwidth is distributed in proportion to the weight of each
				247	queue. A very thin extra bandwidth is however guaranteed to
				248	the Idle class, to prevent it from starving.
				249
				250
				251	3. What are BFQ's tunable?
				252	==========================
				253
				254	The tunables back_seek-max, back_seek_penalty, fifo_expire_async and
				255	fifo_expire_sync below are the same as in CFQ. Their description is
				256	just copied from that for CFQ. Some considerations in the description
				257	of slice_idle are copied from CFQ too.
				258
				259	per-process ioprio and weight
				260	-----------------------------
				261
Arianna Avanzini	e21b7a0	2017-04-12 18:23:08 +0200	[diff] [blame]	262	Unless the cgroups interface is used (see "4. BFQ group scheduling"),
				263	weights can be assigned to processes only indirectly, through I/O
				264	priorities, and according to the relation:
				265	weight = (IOPRIO_BE_NR - ioprio) * 10.
				266
				267	Beware that, if low-latency is set, then BFQ automatically raises the
				268	weight of the queues associated with interactive and soft real-time
				269	applications. Unset this tunable if you need/want to control weights.
Paolo Valente	aee69d7	2017-04-19 08:29:02 -0600	[diff] [blame]	270
				271	slice_idle
				272	----------
				273
				274	This parameter specifies how long BFQ should idle for next I/O
				275	request, when certain sync BFQ queues become empty. By default
				276	slice_idle is a non-zero value. Idling has a double purpose: boosting
				277	throughput and making sure that the desired throughput distribution is
				278	respected (see the description of how BFQ works, and, if needed, the
				279	papers referred there).
				280
				281	As for throughput, idling can be very helpful on highly seeky media
				282	like single spindle SATA/SAS disks where we can cut down on overall
				283	number of seeks and see improved throughput.
				284
				285	Setting slice_idle to 0 will remove all the idling on queues and one
				286	should see an overall improved throughput on faster storage devices
				287	like multiple SATA/SAS disks in hardware RAID configuration.
				288
				289	So depending on storage and workload, it might be useful to set
				290	slice_idle=0. In general for SATA/SAS disks and software RAID of
				291	SATA/SAS disks keeping slice_idle enabled should be useful. For any
				292	configurations where there are multiple spindles behind single LUN
				293	(Host based hardware RAID controller or for storage arrays), setting
				294	slice_idle=0 might end up in better throughput and acceptable
				295	latencies.
				296
				297	Idling is however necessary to have service guarantees enforced in
				298	case of differentiated weights or differentiated I/O-request lengths.
				299	To see why, suppose that a given BFQ queue A must get several I/O
				300	requests served for each request served for another queue B. Idling
				301	ensures that, if A makes a new I/O request slightly after becoming
				302	empty, then no request of B is dispatched in the middle, and thus A
				303	does not lose the possibility to get more than one request dispatched
				304	before the next request of B is dispatched. Note that idling
				305	guarantees the desired differentiated treatment of queues only in
				306	terms of I/O-request dispatches. To guarantee that the actual service
				307	order then corresponds to the dispatch order, the strict_guarantees
				308	tunable must be set too.
				309
				310	There is an important flipside for idling: apart from the above cases
				311	where it is beneficial also for throughput, idling can severely impact
				312	throughput. One important case is random workload. Because of this
				313	issue, BFQ tends to avoid idling as much as possible, when it is not
				314	beneficial also for throughput. As a consequence of this behavior, and
				315	of further issues described for the strict_guarantees tunable,
				316	short-term service guarantees may be occasionally violated. And, in
				317	some cases, these guarantees may be more important than guaranteeing
				318	maximum throughput. For example, in video playing/streaming, a very
				319	low drop rate may be more important than maximum throughput. In these
				320	cases, consider setting the strict_guarantees parameter.
				321
				322	strict_guarantees
				323	-----------------
				324
				325	If this parameter is set (default: unset), then BFQ
				326
				327	- always performs idling when the in-service queue becomes empty;
				328
				329	- forces the device to serve one I/O request at a time, by dispatching a
				330	new request only if there is no outstanding request.
				331
				332	In the presence of differentiated weights or I/O-request sizes, both
				333	the above conditions are needed to guarantee that every BFQ queue
				334	receives its allotted share of the bandwidth. The first condition is
				335	needed for the reasons explained in the description of the slice_idle
				336	tunable. The second condition is needed because all modern storage
				337	devices reorder internally-queued requests, which may trivially break
				338	the service guarantees enforced by the I/O scheduler.
				339
				340	Setting strict_guarantees may evidently affect throughput.
				341
				342	back_seek_max
				343	-------------
				344
				345	This specifies, given in Kbytes, the maximum "distance" for backward seeking.
				346	The distance is the amount of space from the current head location to the
				347	sectors that are backward in terms of distance.
				348
				349	This parameter allows the scheduler to anticipate requests in the "backward"
				350	direction and consider them as being the "next" if they are within this
				351	distance from the current head location.
				352
				353	back_seek_penalty
				354	-----------------
				355
				356	This parameter is used to compute the cost of backward seeking. If the
				357	backward distance of request is just 1/back_seek_penalty from a "front"
				358	request, then the seeking cost of two requests is considered equivalent.
				359
				360	So scheduler will not bias toward one or the other request (otherwise scheduler
				361	will bias toward front request). Default value of back_seek_penalty is 2.
				362
				363	fifo_expire_async
				364	-----------------
				365
				366	This parameter is used to set the timeout of asynchronous requests. Default
				367	value of this is 248ms.
				368
				369	fifo_expire_sync
				370	----------------
				371
				372	This parameter is used to set the timeout of synchronous requests. Default
				373	value of this is 124ms. In case to favor synchronous requests over asynchronous
				374	one, this value should be decreased relative to fifo_expire_async.
				375
				376	low_latency
				377	-----------
				378
				379	This parameter is used to enable/disable BFQ's low latency mode. By
				380	default, low latency mode is enabled. If enabled, interactive and soft
				381	real-time applications are privileged and experience a lower latency,
				382	as explained in more detail in the description of how BFQ works.
				383
Paolo Valente	43c1b3d	2017-05-09 12:54:23 +0200	[diff] [blame]	384	DISABLE this mode if you need full control on bandwidth
Paolo Valente	44e44a1	2017-04-12 18:23:12 +0200	[diff] [blame]	385	distribution. In fact, if it is enabled, then BFQ automatically
				386	increases the bandwidth share of privileged applications, as the main
				387	means to guarantee a lower latency to them.
				388
Paolo Valente	43c1b3d	2017-05-09 12:54:23 +0200	[diff] [blame]	389	In addition, as already highlighted at the beginning of this document,
				390	DISABLE this mode if your only goal is to achieve a high throughput.
				391	In fact, privileging the I/O of some application over the rest may
				392	entail a lower throughput. To achieve the highest-possible throughput
				393	on a non-rotational device, setting slice_idle to 0 may be needed too
				394	(at the cost of giving up any strong guarantee on fairness and low
				395	latency).
				396
Paolo Valente	aee69d7	2017-04-19 08:29:02 -0600	[diff] [blame]	397	timeout_sync
				398	------------
				399
				400	Maximum amount of device time that can be given to a task (queue) once
				401	it has been selected for service. On devices with costly seeks,
				402	increasing this time usually increases maximum throughput. On the
				403	opposite end, increasing this time coarsens the granularity of the
				404	short-term bandwidth and latency guarantees, especially if the
				405	following parameter is set to zero.
				406
				407	max_budget
				408	----------
				409
				410	Maximum amount of service, measured in sectors, that can be provided
				411	to a BFQ queue once it is set in service (of course within the limits
				412	of the above timeout). According to what said in the description of
				413	the algorithm, larger values increase the throughput in proportion to
				414	the percentage of sequential I/O requests issued. The price of larger
				415	values is that they coarsen the granularity of short-term bandwidth
				416	and latency guarantees.
				417
				418	The default value is 0, which enables auto-tuning: BFQ sets max_budget
				419	to the maximum number of sectors that can be served during
				420	timeout_sync, according to the estimated peak rate.
				421
				422	weights
				423	-------
				424
				425	Read-only parameter, used to show the weights of the currently active
				426	BFQ queues.
				427
				428
Paolo Valente	aee69d7	2017-04-19 08:29:02 -0600	[diff] [blame]	429	4. Group scheduling with BFQ
				430	============================
				431
Arianna Avanzini	e21b7a0	2017-04-12 18:23:08 +0200	[diff] [blame]	432	BFQ supports both cgroups-v1 and cgroups-v2 io controllers, namely
				433	blkio and io. In particular, BFQ supports weight-based proportional
				434	share. To activate cgroups support, set BFQ_GROUP_IOSCHED.
Paolo Valente	aee69d7	2017-04-19 08:29:02 -0600	[diff] [blame]	435
				436	4-1 Service guarantees provided
				437	-------------------------------
				438
				439	With BFQ, proportional share means true proportional share of the
				440	device bandwidth, according to group weights. For example, a group
				441	with weight 200 gets twice the bandwidth, and not just twice the time,
				442	of a group with weight 100.
				443
				444	BFQ supports hierarchies (group trees) of any depth. Bandwidth is
				445	distributed among groups and processes in the expected way: for each
				446	group, the children of the group share the whole bandwidth of the
				447	group in proportion to their weights. In particular, this implies
				448	that, for each leaf group, every process of the group receives the
				449	same share of the whole group bandwidth, unless the ioprio of the
				450	process is modified.
				451
				452	The resource-sharing guarantee for a group may partially or totally
				453	switch from bandwidth to time, if providing bandwidth guarantees to
				454	the group lowers the throughput too much. This switch occurs on a
				455	per-process basis: if a process of a leaf group causes throughput loss
				456	if served in such a way to receive its share of the bandwidth, then
				457	BFQ switches back to just time-based proportional share for that
				458	process.
				459
				460	4-2 Interface
				461	-------------
				462
				463	To get proportional sharing of bandwidth with BFQ for a given device,
				464	BFQ must of course be the active scheduler for that device.
				465
				466	Within each group directory, the names of the files associated with
				467	BFQ-specific cgroup parameters and stats begin with the "bfq."
				468	prefix. So, with cgroups-v1 or cgroups-v2, the full prefix for
				469	BFQ-specific files is "blkio.bfq." or "io.bfq." For example, the group
				470	parameter to set the weight of a group with BFQ is blkio.bfq.weight
				471	or io.bfq.weight.
				472
				473	Parameters to set
				474	-----------------
				475
				476	For each group, there is only the following parameter to set.
				477
				478	weight (namely blkio.bfq.weight or io.bfq-weight): the weight of the
				479	group inside its parent. Available values: 1..10000 (default 100). The
				480	linear mapping between ioprio and weights, described at the beginning
				481	of the tunable section, is still valid, but all weights higher than
				482	IOPRIO_BE_NR*10 are mapped to ioprio 0.
				483
Paolo Valente	44e44a1	2017-04-12 18:23:12 +0200	[diff] [blame]	484	Recall that, if low-latency is set, then BFQ automatically raises the
				485	weight of the queues associated with interactive and soft real-time
				486	applications. Unset this tunable if you need/want to control weights.
				487
Paolo Valente	aee69d7	2017-04-19 08:29:02 -0600	[diff] [blame]	488
				489	[1] P. Valente, A. Avanzini, "Evolution of the BFQ Storage I/O
				490	Scheduler", Proceedings of the First Workshop on Mobile System
				491	Technologies (MST-2015), May 2015.
				492	http://algogroup.unimore.it/people/paolo/disk_sched/mst-2015.pdf
				493
				494	[2] P. Valente and M. Andreolini, "Improving Application
				495	Responsiveness with the BFQ Disk I/O Scheduler", Proceedings of
				496	the 5th Annual International Systems and Storage Conference
				497	(SYSTOR '12), June 2012.
				498	Slightly extended version:
				499	http://algogroup.unimore.it/people/paolo/disk_sched/bfq-v1-suite-
				500	results.pdf