Blame - Documentation/scheduler/sched-capacity.rst - SHIFTPHONES/mainline/linux

blob: 805f85f330b54572d89ff4fbaacdf0097b47ecbe [file] [log] [blame]

Valentin Schneider	65065fd	2020-07-31 20:20:15 +0100	[diff] [blame]	1	=========================
				2	Capacity Aware Scheduling
				3	=========================
				4
				5	1. CPU Capacity
				6	===============
				7
				8	1.1 Introduction
				9	----------------
				10
				11	Conventional, homogeneous SMP platforms are composed of purely identical
				12	CPUs. Heterogeneous platforms on the other hand are composed of CPUs with
				13	different performance characteristics - on such platforms, not all CPUs can be
				14	considered equal.
				15
				16	CPU capacity is a measure of the performance a CPU can reach, normalized against
				17	the most performant CPU in the system. Heterogeneous systems are also called
				18	asymmetric CPU capacity systems, as they contain CPUs of different capacities.
				19
				20	Disparity in maximum attainable performance (IOW in maximum CPU capacity) stems
				21	from two factors:
				22
				23	- not all CPUs may have the same microarchitecture (µarch).
				24	- with Dynamic Voltage and Frequency Scaling (DVFS), not all CPUs may be
				25	physically able to attain the higher Operating Performance Points (OPP).
				26
				27	Arm big.LITTLE systems are an example of both. The big CPUs are more
				28	performance-oriented than the LITTLE ones (more pipeline stages, bigger caches,
				29	smarter predictors, etc), and can usually reach higher OPPs than the LITTLE ones
				30	can.
				31
				32	CPU performance is usually expressed in Millions of Instructions Per Second
				33	(MIPS), which can also be expressed as a given amount of instructions attainable
				34	per Hz, leading to::
				35
				36	capacity(cpu) = work_per_hz(cpu) * max_freq(cpu)
				37
				38	1.2 Scheduler terms
				39	-------------------
				40
				41	Two different capacity values are used within the scheduler. A CPU's
				42	``capacity_orig`` is its maximum attainable capacity, i.e. its maximum
				43	attainable performance level. A CPU's ``capacity`` is its ``capacity_orig`` to
				44	which some loss of available performance (e.g. time spent handling IRQs) is
				45	subtracted.
				46
				47	Note that a CPU's ``capacity`` is solely intended to be used by the CFS class,
				48	while ``capacity_orig`` is class-agnostic. The rest of this document will use
				49	the term ``capacity`` interchangeably with ``capacity_orig`` for the sake of
				50	brevity.
				51
				52	1.3 Platform examples
				53	---------------------
				54
				55	1.3.1 Identical OPPs
				56	~~~~~~~~~~~~~~~~~~~~
				57
				58	Consider an hypothetical dual-core asymmetric CPU capacity system where
				59
				60	- work_per_hz(CPU0) = W
				61	- work_per_hz(CPU1) = W/2
				62	- all CPUs are running at the same fixed frequency
				63
				64	By the above definition of capacity:
				65
				66	- capacity(CPU0) = C
				67	- capacity(CPU1) = C/2
				68
				69	To draw the parallel with Arm big.LITTLE, CPU0 would be a big while CPU1 would
				70	be a LITTLE.
				71
				72	With a workload that periodically does a fixed amount of work, you will get an
				73	execution trace like so::
				74
				75	CPU0 work ^
				76	\| ____ ____ ____
				77	\| \| \| \| \| \| \|
				78	+----+----+----+----+----+----+----+----+----+----+-> time
				79
				80	CPU1 work ^
				81	\| _________ _________ ____
				82	\| \| \| \| \| \|
				83	+----+----+----+----+----+----+----+----+----+----+-> time
				84
				85	CPU0 has the highest capacity in the system (C), and completes a fixed amount of
				86	work W in T units of time. On the other hand, CPU1 has half the capacity of
				87	CPU0, and thus only completes W/2 in T.
				88
				89	1.3.2 Different max OPPs
				90	~~~~~~~~~~~~~~~~~~~~~~~~
				91
				92	Usually, CPUs of different capacity values also have different maximum
				93	OPPs. Consider the same CPUs as above (i.e. same work_per_hz()) with:
				94
				95	- max_freq(CPU0) = F
				96	- max_freq(CPU1) = 2/3 * F
				97
				98	This yields:
				99
				100	- capacity(CPU0) = C
				101	- capacity(CPU1) = C/3
				102
				103	Executing the same workload as described in 1.3.1, which each CPU running at its
				104	maximum frequency results in::
				105
				106	CPU0 work ^
				107	\| ____ ____ ____
				108	\| \| \| \| \| \| \|
				109	+----+----+----+----+----+----+----+----+----+----+-> time
				110
				111	workload on CPU1
				112	CPU1 work ^
				113	\| ______________ ______________ ____
				114	\| \| \| \| \| \|
				115	+----+----+----+----+----+----+----+----+----+----+-> time
				116
				117	1.4 Representation caveat
				118	-------------------------
				119
				120	It should be noted that having a single value to represent differences in CPU
				121	performance is somewhat of a contentious point. The relative performance
				122	difference between two different µarchs could be X% on integer operations, Y% on
				123	floating point operations, Z% on branches, and so on. Still, results using this
				124	simple approach have been satisfactory for now.
				125
				126	2. Task utilization
				127	===================
				128
				129	2.1 Introduction
				130	----------------
				131
				132	Capacity aware scheduling requires an expression of a task's requirements with
				133	regards to CPU capacity. Each scheduler class can express this differently, and
				134	while task utilization is specific to CFS, it is convenient to describe it here
				135	in order to introduce more generic concepts.
				136
				137	Task utilization is a percentage meant to represent the throughput requirements
				138	of a task. A simple approximation of it is the task's duty cycle, i.e.::
				139
				140	task_util(p) = duty_cycle(p)
				141
				142	On an SMP system with fixed frequencies, 100% utilization suggests the task is a
				143	busy loop. Conversely, 10% utilization hints it is a small periodic task that
				144	spends more time sleeping than executing. Variable CPU frequencies and
				145	asymmetric CPU capacities complexify this somewhat; the following sections will
				146	expand on these.
				147
				148	2.2 Frequency invariance
				149	------------------------
				150
				151	One issue that needs to be taken into account is that a workload's duty cycle is
				152	directly impacted by the current OPP the CPU is running at. Consider running a
				153	periodic workload at a given frequency F::
				154
				155	CPU work ^
				156	\| ____ ____ ____
				157	\| \| \| \| \| \| \|
				158	+----+----+----+----+----+----+----+----+----+----+-> time
				159
				160	This yields duty_cycle(p) == 25%.
				161
				162	Now, consider running the same workload at frequency F/2::
				163
				164	CPU work ^
				165	\| _________ _________ ____
				166	\| \| \| \| \| \|
				167	+----+----+----+----+----+----+----+----+----+----+-> time
				168
				169	This yields duty_cycle(p) == 50%, despite the task having the exact same
				170	behaviour (i.e. executing the same amount of work) in both executions.
				171
				172	The task utilization signal can be made frequency invariant using the following
				173	formula::
				174
				175	task_util_freq_inv(p) = duty_cycle(p) * (curr_frequency(cpu) / max_frequency(cpu))
				176
				177	Applying this formula to the two examples above yields a frequency invariant
				178	task utilization of 25%.
				179
				180	2.3 CPU invariance
				181	------------------
				182
				183	CPU capacity has a similar effect on task utilization in that running an
				184	identical workload on CPUs of different capacity values will yield different
				185	duty cycles.
				186
				187	Consider the system described in 1.3.2., i.e.::
				188
				189	- capacity(CPU0) = C
				190	- capacity(CPU1) = C/3
				191
				192	Executing a given periodic workload on each CPU at their maximum frequency would
				193	result in::
				194
				195	CPU0 work ^
				196	\| ____ ____ ____
				197	\| \| \| \| \| \| \|
				198	+----+----+----+----+----+----+----+----+----+----+-> time
				199
				200	CPU1 work ^
				201	\| ______________ ______________ ____
				202	\| \| \| \| \| \|
				203	+----+----+----+----+----+----+----+----+----+----+-> time
				204
				205	IOW,
				206
				207	- duty_cycle(p) == 25% if p runs on CPU0 at its maximum frequency
				208	- duty_cycle(p) == 75% if p runs on CPU1 at its maximum frequency
				209
				210	The task utilization signal can be made CPU invariant using the following
				211	formula::
				212
				213	task_util_cpu_inv(p) = duty_cycle(p) * (capacity(cpu) / max_capacity)
				214
				215	with ``max_capacity`` being the highest CPU capacity value in the
				216	system. Applying this formula to the above example above yields a CPU
				217	invariant task utilization of 25%.
				218
				219	2.4 Invariant task utilization
				220	------------------------------
				221
				222	Both frequency and CPU invariance need to be applied to task utilization in
				223	order to obtain a truly invariant signal. The pseudo-formula for a task
				224	utilization that is both CPU and frequency invariant is thus, for a given
				225	task p::
				226
				227	curr_frequency(cpu) capacity(cpu)
				228	task_util_inv(p) = duty_cycle(p) * ------------------- * -------------
				229	max_frequency(cpu) max_capacity
				230
				231	In other words, invariant task utilization describes the behaviour of a task as
				232	if it were running on the highest-capacity CPU in the system, running at its
				233	maximum frequency.
				234
				235	Any mention of task utilization in the following sections will imply its
				236	invariant form.
				237
				238	2.5 Utilization estimation
				239	--------------------------
				240
				241	Without a crystal ball, task behaviour (and thus task utilization) cannot
				242	accurately be predicted the moment a task first becomes runnable. The CFS class
				243	maintains a handful of CPU and task signals based on the Per-Entity Load
				244	Tracking (PELT) mechanism, one of those yielding an average utilization (as
				245	opposed to instantaneous).
				246
				247	This means that while the capacity aware scheduling criteria will be written
				248	considering a "true" task utilization (using a crystal ball), the implementation
				249	will only ever be able to use an estimator thereof.
				250
				251	3. Capacity aware scheduling requirements
				252	=========================================
				253
				254	3.1 CPU capacity
				255	----------------
				256
				257	Linux cannot currently figure out CPU capacity on its own, this information thus
				258	needs to be handed to it. Architectures must define arch_scale_cpu_capacity()
				259	for that purpose.
				260
				261	The arm and arm64 architectures directly map this to the arch_topology driver
				262	CPU scaling data, which is derived from the capacity-dmips-mhz CPU binding; see
				263	Documentation/devicetree/bindings/arm/cpu-capacity.txt.
				264
				265	3.2 Frequency invariance
				266	------------------------
				267
				268	As stated in 2.2, capacity-aware scheduling requires a frequency-invariant task
				269	utilization. Architectures must define arch_scale_freq_capacity(cpu) for that
				270	purpose.
				271
				272	Implementing this function requires figuring out at which frequency each CPU
				273	have been running at. One way to implement this is to leverage hardware counters
				274	whose increment rate scale with a CPU's current frequency (APERF/MPERF on x86,
				275	AMU on arm64). Another is to directly hook into cpufreq frequency transitions,
				276	when the kernel is aware of the switched-to frequency (also employed by
				277	arm/arm64).
				278
				279	4. Scheduler topology
				280	=====================
				281
				282	During the construction of the sched domains, the scheduler will figure out
				283	whether the system exhibits asymmetric CPU capacities. Should that be the
				284	case:
				285
				286	- The sched_asym_cpucapacity static key will be enabled.
Beata Michalska	adf3c31	2021-06-03 15:06:27 +0100	[diff] [blame]	287	- The SD_ASYM_CPUCAPACITY_FULL flag will be set at the lowest sched_domain
				288	level that spans all unique CPU capacity values.
				289	- The SD_ASYM_CPUCAPACITY flag will be set for any sched_domain that spans
				290	CPUs with any range of asymmetry.
Valentin Schneider	65065fd	2020-07-31 20:20:15 +0100	[diff] [blame]	291
				292	The sched_asym_cpucapacity static key is intended to guard sections of code that
				293	cater to asymmetric CPU capacity systems. Do note however that said key is
				294	system-wide. Imagine the following setup using cpusets::
				295
				296	capacity C/2 C
				297	________ ________
				298	/ \ / \
				299	CPUs 0 1 2 3 4 5 6 7
				300	\__/ \______________/
				301	cpusets cs0 cs1
				302
				303	Which could be created via:
				304
				305	.. code-block:: sh
				306
				307	mkdir /sys/fs/cgroup/cpuset/cs0
				308	echo 0-1 > /sys/fs/cgroup/cpuset/cs0/cpuset.cpus
				309	echo 0 > /sys/fs/cgroup/cpuset/cs0/cpuset.mems
				310
				311	mkdir /sys/fs/cgroup/cpuset/cs1
				312	echo 2-7 > /sys/fs/cgroup/cpuset/cs1/cpuset.cpus
				313	echo 0 > /sys/fs/cgroup/cpuset/cs1/cpuset.mems
				314
				315	echo 0 > /sys/fs/cgroup/cpuset/cpuset.sched_load_balance
				316
				317	Since there is CPU capacity asymmetry in the system, the
				318	sched_asym_cpucapacity static key will be enabled. However, the sched_domain
				319	hierarchy of CPUs 0-1 spans a single capacity value: SD_ASYM_CPUCAPACITY isn't
				320	set in that hierarchy, it describes an SMP island and should be treated as such.
				321
				322	Therefore, the 'canonical' pattern for protecting codepaths that cater to
				323	asymmetric CPU capacities is to:
				324
				325	- Check the sched_asym_cpucapacity static key
				326	- If it is enabled, then also check for the presence of SD_ASYM_CPUCAPACITY in
				327	the sched_domain hierarchy (if relevant, i.e. the codepath targets a specific
				328	CPU or group thereof)
				329
				330	5. Capacity aware scheduling implementation
				331	===========================================
				332
				333	5.1 CFS
				334	-------
				335
				336	5.1.1 Capacity fitness
				337	~~~~~~~~~~~~~~~~~~~~~~
				338
				339	The main capacity scheduling criterion of CFS is::
				340
				341	task_util(p) < capacity(task_cpu(p))
				342
				343	This is commonly called the capacity fitness criterion, i.e. CFS must ensure a
				344	task "fits" on its CPU. If it is violated, the task will need to achieve more
				345	work than what its CPU can provide: it will be CPU-bound.
				346
				347	Furthermore, uclamp lets userspace specify a minimum and a maximum utilization
				348	value for a task, either via sched_setattr() or via the cgroup interface (see
				349	Documentation/admin-guide/cgroup-v2.rst). As its name imply, this can be used to
				350	clamp task_util() in the previous criterion.
				351
				352	5.1.2 Wakeup CPU selection
				353	~~~~~~~~~~~~~~~~~~~~~~~~~~
				354
				355	CFS task wakeup CPU selection follows the capacity fitness criterion described
				356	above. On top of that, uclamp is used to clamp the task utilization values,
				357	which lets userspace have more leverage over the CPU selection of CFS
				358	tasks. IOW, CFS wakeup CPU selection searches for a CPU that satisfies::
				359
				360	clamp(task_util(p), task_uclamp_min(p), task_uclamp_max(p)) < capacity(cpu)
				361
				362	By using uclamp, userspace can e.g. allow a busy loop (100% utilization) to run
				363	on any CPU by giving it a low uclamp.max value. Conversely, it can force a small
				364	periodic task (e.g. 10% utilization) to run on the highest-performance CPUs by
				365	giving it a high uclamp.min value.
				366
				367	.. note::
				368
				369	Wakeup CPU selection in CFS can be eclipsed by Energy Aware Scheduling
Mauro Carvalho Chehab	e4e29e7	2020-09-09 16:10:36 +0200	[diff] [blame]	370	(EAS), which is described in Documentation/scheduler/sched-energy.rst.
Valentin Schneider	65065fd	2020-07-31 20:20:15 +0100	[diff] [blame]	371
				372	5.1.3 Load balancing
				373	~~~~~~~~~~~~~~~~~~~~
				374
				375	A pathological case in the wakeup CPU selection occurs when a task rarely
				376	sleeps, if at all - it thus rarely wakes up, if at all. Consider::
				377
				378	w == wakeup event
				379
				380	capacity(CPU0) = C
				381	capacity(CPU1) = C / 3
				382
				383	workload on CPU0
				384	CPU work ^
				385	\| _________ _________ ____
				386	\| \| \| \| \| \|
				387	+----+----+----+----+----+----+----+----+----+----+-> time
				388	w w w
				389
				390	workload on CPU1
				391	CPU work ^
				392	\| ____________________________________________
				393	\| \|
				394	+----+----+----+----+----+----+----+----+----+----+->
				395	w
				396
				397	This workload should run on CPU0, but if the task either:
				398
				399	- was improperly scheduled from the start (inaccurate initial
				400	utilization estimation)
				401	- was properly scheduled from the start, but suddenly needs more
				402	processing power
				403
				404	then it might become CPU-bound, IOW ``task_util(p) > capacity(task_cpu(p))``;
				405	the CPU capacity scheduling criterion is violated, and there may not be any more
				406	wakeup event to fix this up via wakeup CPU selection.
				407
				408	Tasks that are in this situation are dubbed "misfit" tasks, and the mechanism
				409	put in place to handle this shares the same name. Misfit task migration
				410	leverages the CFS load balancer, more specifically the active load balance part
				411	(which caters to migrating currently running tasks). When load balance happens,
				412	a misfit active load balance will be triggered if a misfit task can be migrated
				413	to a CPU with more capacity than its current one.
				414
				415	5.2 RT
				416	------
				417
				418	5.2.1 Wakeup CPU selection
				419	~~~~~~~~~~~~~~~~~~~~~~~~~~
				420
				421	RT task wakeup CPU selection searches for a CPU that satisfies::
				422
				423	task_uclamp_min(p) <= capacity(task_cpu(cpu))
				424
				425	while still following the usual priority constraints. If none of the candidate
				426	CPUs can satisfy this capacity criterion, then strict priority based scheduling
				427	is followed and CPU capacities are ignored.
				428
				429	5.3 DL
				430	------
				431
				432	5.3.1 Wakeup CPU selection
				433	~~~~~~~~~~~~~~~~~~~~~~~~~~
				434
				435	DL task wakeup CPU selection searches for a CPU that satisfies::
				436
				437	task_bandwidth(p) < capacity(task_cpu(p))
				438
				439	while still respecting the usual bandwidth and deadline constraints. If
				440	none of the candidate CPUs can satisfy this capacity criterion, then the
				441	task will remain on its current CPU.