Blame - Documentation/block/bfq-iosched.txt - SHIFTPHONES/kernel/shift/mainline

blob: 1b87df6cd4761ab0e548d0cc0611c2c5b74725eb [file] [log] [blame]

Paolo Valente	aee69d7	2017-04-19 08:29:02 -0600	[diff] [blame]	1	BFQ (Budget Fair Queueing)
				2	==========================
				3
				4	BFQ is a proportional-share I/O scheduler, with some extra
				5	low-latency capabilities. In addition to cgroups support (blkio or io
				6	controllers), BFQ's main features are:
				7	- BFQ guarantees a high system and application responsiveness, and a
				8	low latency for time-sensitive applications, such as audio or video
				9	players;
				10	- BFQ distributes bandwidth, and not just time, among processes or
				11	groups (switching back to time distribution when needed to keep
				12	throughput high).
				13
				14	On average CPUs, the current version of BFQ can handle devices
				15	performing at most ~30K IOPS; at most ~50 KIOPS on faster CPUs. As a
				16	reference, 30-50 KIOPS correspond to very high bandwidths with
				17	sequential I/O (e.g., 8-12 GB/s if I/O requests are 256 KB large), and
				18	to 120-200 MB/s with 4KB random I/O. BFQ has not yet been tested on
				19	multi-queue devices.
				20
				21	The table of contents follow. Impatients can just jump to Section 3.
				22
				23	CONTENTS
				24
				25	1. When may BFQ be useful?
				26	1-1 Personal systems
				27	1-2 Server systems
				28	2. How does BFQ work?
				29	3. What are BFQ's tunable?
				30	4. BFQ group scheduling
				31	4-1 Service guarantees provided
				32	4-2 Interface
				33
				34	1. When may BFQ be useful?
				35	==========================
				36
				37	BFQ provides the following benefits on personal and server systems.
				38
				39	1-1 Personal systems
				40	--------------------
				41
				42	Low latency for interactive applications
				43
				44	Regardless of the actual background workload, BFQ guarantees that, for
				45	interactive tasks, the storage device is virtually as responsive as if
				46	it was idle. For example, even if one or more of the following
				47	background workloads are being executed:
				48	- one or more large files are being read, written or copied,
				49	- a tree of source files is being compiled,
				50	- one or more virtual machines are performing I/O,
				51	- a software update is in progress,
				52	- indexing daemons are scanning filesystems and updating their
				53	databases,
				54	starting an application or loading a file from within an application
				55	takes about the same time as if the storage device was idle. As a
				56	comparison, with CFQ, NOOP or DEADLINE, and in the same conditions,
				57	applications experience high latencies, or even become unresponsive
				58	until the background workload terminates (also on SSDs).
				59
				60	Low latency for soft real-time applications
				61
				62	Also soft real-time applications, such as audio and video
				63	players/streamers, enjoy a low latency and a low drop rate, regardless
				64	of the background I/O workload. As a consequence, these applications
				65	do not suffer from almost any glitch due to the background workload.
				66
				67	Higher speed for code-development tasks
				68
				69	If some additional workload happens to be executed in parallel, then
				70	BFQ executes the I/O-related components of typical code-development
				71	tasks (compilation, checkout, merge, ...) much more quickly than CFQ,
				72	NOOP or DEADLINE.
				73
				74	High throughput
				75
				76	On hard disks, BFQ achieves up to 30% higher throughput than CFQ, and
				77	up to 150% higher throughput than DEADLINE and NOOP, with all the
				78	sequential workloads considered in our tests. With random workloads,
				79	and with all the workloads on flash-based devices, BFQ achieves,
				80	instead, about the same throughput as the other schedulers.
				81
				82	Strong fairness, bandwidth and delay guarantees
				83
				84	BFQ distributes the device throughput, and not just the device time,
				85	among I/O-bound applications in proportion their weights, with any
				86	workload and regardless of the device parameters. From these bandwidth
				87	guarantees, it is possible to compute tight per-I/O-request delay
				88	guarantees by a simple formula. If not configured for strict service
				89	guarantees, BFQ switches to time-based resource sharing (only) for
				90	applications that would otherwise cause a throughput loss.
				91
				92	1-2 Server systems
				93	------------------
				94
				95	Most benefits for server systems follow from the same service
				96	properties as above. In particular, regardless of whether additional,
				97	possibly heavy workloads are being served, BFQ guarantees:
				98
				99	. audio and video-streaming with zero or very low jitter and drop
				100	rate;
				101
				102	. fast retrieval of WEB pages and embedded objects;
				103
				104	. real-time recording of data in live-dumping applications (e.g.,
				105	packet logging);
				106
				107	. responsiveness in local and remote access to a server.
				108
				109
				110	2. How does BFQ work?
				111	=====================
				112
				113	BFQ is a proportional-share I/O scheduler, whose general structure,
				114	plus a lot of code, are borrowed from CFQ.
				115
				116	- Each process doing I/O on a device is associated with a weight and a
				117	(bfq_)queue.
				118
				119	- BFQ grants exclusive access to the device, for a while, to one queue
				120	(process) at a time, and implements this service model by
				121	associating every queue with a budget, measured in number of
				122	sectors.
				123
				124	- After a queue is granted access to the device, the budget of the
				125	queue is decremented, on each request dispatch, by the size of the
				126	request.
				127
				128	- The in-service queue is expired, i.e., its service is suspended,
				129	only if one of the following events occurs: 1) the queue finishes
				130	its budget, 2) the queue empties, 3) a "budget timeout" fires.
				131
				132	- The budget timeout prevents processes doing random I/O from
				133	holding the device for too long and dramatically reducing
				134	throughput.
				135
				136	- Actually, as in CFQ, a queue associated with a process issuing
				137	sync requests may not be expired immediately when it empties. In
				138	contrast, BFQ may idle the device for a short time interval,
				139	giving the process the chance to go on being served if it issues
				140	a new request in time. Device idling typically boosts the
				141	throughput on rotational devices, if processes do synchronous
				142	and sequential I/O. In addition, under BFQ, device idling is
				143	also instrumental in guaranteeing the desired throughput
				144	fraction to processes issuing sync requests (see the description
				145	of the slice_idle tunable in this document, or [1, 2], for more
				146	details).
				147
				148	- With respect to idling for service guarantees, if several
				149	processes are competing for the device at the same time, but
				150	all processes (and groups, after the following commit) have
				151	the same weight, then BFQ guarantees the expected throughput
				152	distribution without ever idling the device. Throughput is
				153	thus as high as possible in this common scenario.
				154
				155	- If low-latency mode is enabled (default configuration), BFQ
				156	executes some special heuristics to detect interactive and soft
				157	real-time applications (e.g., video or audio players/streamers),
				158	and to reduce their latency. The most important action taken to
				159	achieve this goal is to give to the queues associated with these
				160	applications more than their fair share of the device
				161	throughput. For brevity, we call just "weight-raising" the whole
				162	sets of actions taken by BFQ to privilege these queues. In
				163	particular, BFQ provides a milder form of weight-raising for
				164	interactive applications, and a stronger form for soft real-time
				165	applications.
				166
				167	- BFQ automatically deactivates idling for queues born in a burst of
				168	queue creations. In fact, these queues are usually associated with
				169	the processes of applications and services that benefit mostly
				170	from a high throughput. Examples are systemd during boot, or git
				171	grep.
				172
				173	- As CFQ, BFQ merges queues performing interleaved I/O, i.e.,
				174	performing random I/O that becomes mostly sequential if
				175	merged. Differently from CFQ, BFQ achieves this goal with a more
				176	reactive mechanism, called Early Queue Merge (EQM). EQM is so
				177	responsive in detecting interleaved I/O (cooperating processes),
				178	that it enables BFQ to achieve a high throughput, by queue
				179	merging, even for queues for which CFQ needs a different
				180	mechanism, preemption, to get a high throughput. As such EQM is a
				181	unified mechanism to achieve a high throughput with interleaved
				182	I/O.
				183
				184	- Queues are scheduled according to a variant of WF2Q+, named
				185	B-WF2Q+, and implemented using an augmented rb-tree to preserve an
				186	O(log N) overall complexity. See [2] for more details. B-WF2Q+ is
				187	also ready for hierarchical scheduling. However, for a cleaner
				188	logical breakdown, the code that enables and completes
				189	hierarchical support is provided in the next commit, which focuses
				190	exactly on this feature.
				191
				192	- B-WF2Q+ guarantees a tight deviation with respect to an ideal,
				193	perfectly fair, and smooth service. In particular, B-WF2Q+
				194	guarantees that each queue receives a fraction of the device
				195	throughput proportional to its weight, even if the throughput
				196	fluctuates, and regardless of: the device parameters, the current
				197	workload and the budgets assigned to the queue.
				198
				199	- The last, budget-independence, property (although probably
				200	counterintuitive in the first place) is definitely beneficial, for
				201	the following reasons:
				202
				203	- First, with any proportional-share scheduler, the maximum
				204	deviation with respect to an ideal service is proportional to
				205	the maximum budget (slice) assigned to queues. As a consequence,
				206	BFQ can keep this deviation tight not only because of the
				207	accurate service of B-WF2Q+, but also because BFQ does not
				208	need to assign a larger budget to a queue to let the queue
				209	receive a higher fraction of the device throughput.
				210
				211	- Second, BFQ is free to choose, for every process (queue), the
				212	budget that best fits the needs of the process, or best
				213	leverages the I/O pattern of the process. In particular, BFQ
				214	updates queue budgets with a simple feedback-loop algorithm that
				215	allows a high throughput to be achieved, while still providing
				216	tight latency guarantees to time-sensitive applications. When
				217	the in-service queue expires, this algorithm computes the next
				218	budget of the queue so as to:
				219
				220	- Let large budgets be eventually assigned to the queues
				221	associated with I/O-bound applications performing sequential
				222	I/O: in fact, the longer these applications are served once
				223	got access to the device, the higher the throughput is.
				224
				225	- Let small budgets be eventually assigned to the queues
				226	associated with time-sensitive applications (which typically
				227	perform sporadic and short I/O), because, the smaller the
				228	budget assigned to a queue waiting for service is, the sooner
				229	B-WF2Q+ will serve that queue (Subsec 3.3 in [2]).
				230
				231	- If several processes are competing for the device at the same time,
				232	but all processes and groups have the same weight, then BFQ
				233	guarantees the expected throughput distribution without ever idling
				234	the device. It uses preemption instead. Throughput is then much
				235	higher in this common scenario.
				236
				237	- ioprio classes are served in strict priority order, i.e.,
				238	lower-priority queues are not served as long as there are
				239	higher-priority queues. Among queues in the same class, the
				240	bandwidth is distributed in proportion to the weight of each
				241	queue. A very thin extra bandwidth is however guaranteed to
				242	the Idle class, to prevent it from starving.
				243
				244
				245	3. What are BFQ's tunable?
				246	==========================
				247
				248	The tunables back_seek-max, back_seek_penalty, fifo_expire_async and
				249	fifo_expire_sync below are the same as in CFQ. Their description is
				250	just copied from that for CFQ. Some considerations in the description
				251	of slice_idle are copied from CFQ too.
				252
				253	per-process ioprio and weight
				254	-----------------------------
				255
Arianna Avanzini	e21b7a0	2017-04-12 18:23:08 +0200	[diff] [blame]	256	Unless the cgroups interface is used (see "4. BFQ group scheduling"),
				257	weights can be assigned to processes only indirectly, through I/O
				258	priorities, and according to the relation:
				259	weight = (IOPRIO_BE_NR - ioprio) * 10.
				260
				261	Beware that, if low-latency is set, then BFQ automatically raises the
				262	weight of the queues associated with interactive and soft real-time
				263	applications. Unset this tunable if you need/want to control weights.
Paolo Valente	aee69d7	2017-04-19 08:29:02 -0600	[diff] [blame]	264
				265	slice_idle
				266	----------
				267
				268	This parameter specifies how long BFQ should idle for next I/O
				269	request, when certain sync BFQ queues become empty. By default
				270	slice_idle is a non-zero value. Idling has a double purpose: boosting
				271	throughput and making sure that the desired throughput distribution is
				272	respected (see the description of how BFQ works, and, if needed, the
				273	papers referred there).
				274
				275	As for throughput, idling can be very helpful on highly seeky media
				276	like single spindle SATA/SAS disks where we can cut down on overall
				277	number of seeks and see improved throughput.
				278
				279	Setting slice_idle to 0 will remove all the idling on queues and one
				280	should see an overall improved throughput on faster storage devices
				281	like multiple SATA/SAS disks in hardware RAID configuration.
				282
				283	So depending on storage and workload, it might be useful to set
				284	slice_idle=0. In general for SATA/SAS disks and software RAID of
				285	SATA/SAS disks keeping slice_idle enabled should be useful. For any
				286	configurations where there are multiple spindles behind single LUN
				287	(Host based hardware RAID controller or for storage arrays), setting
				288	slice_idle=0 might end up in better throughput and acceptable
				289	latencies.
				290
				291	Idling is however necessary to have service guarantees enforced in
				292	case of differentiated weights or differentiated I/O-request lengths.
				293	To see why, suppose that a given BFQ queue A must get several I/O
				294	requests served for each request served for another queue B. Idling
				295	ensures that, if A makes a new I/O request slightly after becoming
				296	empty, then no request of B is dispatched in the middle, and thus A
				297	does not lose the possibility to get more than one request dispatched
				298	before the next request of B is dispatched. Note that idling
				299	guarantees the desired differentiated treatment of queues only in
				300	terms of I/O-request dispatches. To guarantee that the actual service
				301	order then corresponds to the dispatch order, the strict_guarantees
				302	tunable must be set too.
				303
				304	There is an important flipside for idling: apart from the above cases
				305	where it is beneficial also for throughput, idling can severely impact
				306	throughput. One important case is random workload. Because of this
				307	issue, BFQ tends to avoid idling as much as possible, when it is not
				308	beneficial also for throughput. As a consequence of this behavior, and
				309	of further issues described for the strict_guarantees tunable,
				310	short-term service guarantees may be occasionally violated. And, in
				311	some cases, these guarantees may be more important than guaranteeing
				312	maximum throughput. For example, in video playing/streaming, a very
				313	low drop rate may be more important than maximum throughput. In these
				314	cases, consider setting the strict_guarantees parameter.
				315
				316	strict_guarantees
				317	-----------------
				318
				319	If this parameter is set (default: unset), then BFQ
				320
				321	- always performs idling when the in-service queue becomes empty;
				322
				323	- forces the device to serve one I/O request at a time, by dispatching a
				324	new request only if there is no outstanding request.
				325
				326	In the presence of differentiated weights or I/O-request sizes, both
				327	the above conditions are needed to guarantee that every BFQ queue
				328	receives its allotted share of the bandwidth. The first condition is
				329	needed for the reasons explained in the description of the slice_idle
				330	tunable. The second condition is needed because all modern storage
				331	devices reorder internally-queued requests, which may trivially break
				332	the service guarantees enforced by the I/O scheduler.
				333
				334	Setting strict_guarantees may evidently affect throughput.
				335
				336	back_seek_max
				337	-------------
				338
				339	This specifies, given in Kbytes, the maximum "distance" for backward seeking.
				340	The distance is the amount of space from the current head location to the
				341	sectors that are backward in terms of distance.
				342
				343	This parameter allows the scheduler to anticipate requests in the "backward"
				344	direction and consider them as being the "next" if they are within this
				345	distance from the current head location.
				346
				347	back_seek_penalty
				348	-----------------
				349
				350	This parameter is used to compute the cost of backward seeking. If the
				351	backward distance of request is just 1/back_seek_penalty from a "front"
				352	request, then the seeking cost of two requests is considered equivalent.
				353
				354	So scheduler will not bias toward one or the other request (otherwise scheduler
				355	will bias toward front request). Default value of back_seek_penalty is 2.
				356
				357	fifo_expire_async
				358	-----------------
				359
				360	This parameter is used to set the timeout of asynchronous requests. Default
				361	value of this is 248ms.
				362
				363	fifo_expire_sync
				364	----------------
				365
				366	This parameter is used to set the timeout of synchronous requests. Default
				367	value of this is 124ms. In case to favor synchronous requests over asynchronous
				368	one, this value should be decreased relative to fifo_expire_async.
				369
				370	low_latency
				371	-----------
				372
				373	This parameter is used to enable/disable BFQ's low latency mode. By
				374	default, low latency mode is enabled. If enabled, interactive and soft
				375	real-time applications are privileged and experience a lower latency,
				376	as explained in more detail in the description of how BFQ works.
				377
Paolo Valente	44e44a1	2017-04-12 18:23:12 +0200	[diff] [blame^]	378	DO NOT enable this mode if you need full control on bandwidth
				379	distribution. In fact, if it is enabled, then BFQ automatically
				380	increases the bandwidth share of privileged applications, as the main
				381	means to guarantee a lower latency to them.
				382
Paolo Valente	aee69d7	2017-04-19 08:29:02 -0600	[diff] [blame]	383	timeout_sync
				384	------------
				385
				386	Maximum amount of device time that can be given to a task (queue) once
				387	it has been selected for service. On devices with costly seeks,
				388	increasing this time usually increases maximum throughput. On the
				389	opposite end, increasing this time coarsens the granularity of the
				390	short-term bandwidth and latency guarantees, especially if the
				391	following parameter is set to zero.
				392
				393	max_budget
				394	----------
				395
				396	Maximum amount of service, measured in sectors, that can be provided
				397	to a BFQ queue once it is set in service (of course within the limits
				398	of the above timeout). According to what said in the description of
				399	the algorithm, larger values increase the throughput in proportion to
				400	the percentage of sequential I/O requests issued. The price of larger
				401	values is that they coarsen the granularity of short-term bandwidth
				402	and latency guarantees.
				403
				404	The default value is 0, which enables auto-tuning: BFQ sets max_budget
				405	to the maximum number of sectors that can be served during
				406	timeout_sync, according to the estimated peak rate.
				407
				408	weights
				409	-------
				410
				411	Read-only parameter, used to show the weights of the currently active
				412	BFQ queues.
				413
				414
				415	wr_ tunables
				416	------------
				417
				418	BFQ exports a few parameters to control/tune the behavior of
				419	low-latency heuristics.
				420
				421	wr_coeff
				422
				423	Factor by which the weight of a weight-raised queue is multiplied. If
				424	the queue is deemed soft real-time, then the weight is further
				425	multiplied by an additional, constant factor.
				426
				427	wr_max_time
				428
				429	Maximum duration of a weight-raising period for an interactive task
				430	(ms). If set to zero (default value), then this value is computed
				431	automatically, as a function of the peak rate of the device. In any
				432	case, when the value of this parameter is read, it always reports the
				433	current duration, regardless of whether it has been set manually or
				434	computed automatically.
				435
				436	wr_max_softrt_rate
				437
				438	Maximum service rate below which a queue is deemed to be associated
				439	with a soft real-time application, and is then weight-raised
				440	accordingly (sectors/sec).
				441
				442	wr_min_idle_time
				443
				444	Minimum idle period after which interactive weight-raising may be
				445	reactivated for a queue (in ms).
				446
				447	wr_rt_max_time
				448
				449	Maximum weight-raising duration for soft real-time queues (in ms). The
				450	start time from which this duration is considered is automatically
				451	moved forward if the queue is detected to be still soft real-time
				452	before the current soft real-time weight-raising period finishes.
				453
				454	wr_min_inter_arr_async
				455
				456	Minimum period between I/O request arrivals after which weight-raising
				457	may be reactivated for an already busy async queue (in ms).
				458
				459
				460	4. Group scheduling with BFQ
				461	============================
				462
Arianna Avanzini	e21b7a0	2017-04-12 18:23:08 +0200	[diff] [blame]	463	BFQ supports both cgroups-v1 and cgroups-v2 io controllers, namely
				464	blkio and io. In particular, BFQ supports weight-based proportional
				465	share. To activate cgroups support, set BFQ_GROUP_IOSCHED.
Paolo Valente	aee69d7	2017-04-19 08:29:02 -0600	[diff] [blame]	466
				467	4-1 Service guarantees provided
				468	-------------------------------
				469
				470	With BFQ, proportional share means true proportional share of the
				471	device bandwidth, according to group weights. For example, a group
				472	with weight 200 gets twice the bandwidth, and not just twice the time,
				473	of a group with weight 100.
				474
				475	BFQ supports hierarchies (group trees) of any depth. Bandwidth is
				476	distributed among groups and processes in the expected way: for each
				477	group, the children of the group share the whole bandwidth of the
				478	group in proportion to their weights. In particular, this implies
				479	that, for each leaf group, every process of the group receives the
				480	same share of the whole group bandwidth, unless the ioprio of the
				481	process is modified.
				482
				483	The resource-sharing guarantee for a group may partially or totally
				484	switch from bandwidth to time, if providing bandwidth guarantees to
				485	the group lowers the throughput too much. This switch occurs on a
				486	per-process basis: if a process of a leaf group causes throughput loss
				487	if served in such a way to receive its share of the bandwidth, then
				488	BFQ switches back to just time-based proportional share for that
				489	process.
				490
				491	4-2 Interface
				492	-------------
				493
				494	To get proportional sharing of bandwidth with BFQ for a given device,
				495	BFQ must of course be the active scheduler for that device.
				496
				497	Within each group directory, the names of the files associated with
				498	BFQ-specific cgroup parameters and stats begin with the "bfq."
				499	prefix. So, with cgroups-v1 or cgroups-v2, the full prefix for
				500	BFQ-specific files is "blkio.bfq." or "io.bfq." For example, the group
				501	parameter to set the weight of a group with BFQ is blkio.bfq.weight
				502	or io.bfq.weight.
				503
				504	Parameters to set
				505	-----------------
				506
				507	For each group, there is only the following parameter to set.
				508
				509	weight (namely blkio.bfq.weight or io.bfq-weight): the weight of the
				510	group inside its parent. Available values: 1..10000 (default 100). The
				511	linear mapping between ioprio and weights, described at the beginning
				512	of the tunable section, is still valid, but all weights higher than
				513	IOPRIO_BE_NR*10 are mapped to ioprio 0.
				514
Paolo Valente	44e44a1	2017-04-12 18:23:12 +0200	[diff] [blame^]	515	Recall that, if low-latency is set, then BFQ automatically raises the
				516	weight of the queues associated with interactive and soft real-time
				517	applications. Unset this tunable if you need/want to control weights.
				518
Paolo Valente	aee69d7	2017-04-19 08:29:02 -0600	[diff] [blame]	519
				520	[1] P. Valente, A. Avanzini, "Evolution of the BFQ Storage I/O
				521	Scheduler", Proceedings of the First Workshop on Mobile System
				522	Technologies (MST-2015), May 2015.
				523	http://algogroup.unimore.it/people/paolo/disk_sched/mst-2015.pdf
				524
				525	[2] P. Valente and M. Andreolini, "Improving Application
				526	Responsiveness with the BFQ Disk I/O Scheduler", Proceedings of
				527	the 5th Annual International Systems and Storage Conference
				528	(SYSTOR '12), June 2012.
				529	Slightly extended version:
				530	http://algogroup.unimore.it/people/paolo/disk_sched/bfq-v1-suite-
				531	results.pdf