Blame - Documentation/scheduler/sched-energy.rst - SHIFTPHONES/mainline/linux

blob: 8fbce5e767d98066fb4fd3b1e1cf78ba2e6f1bc6 [file] [log] [blame]

Mauro Carvalho Chehab	d6a3b24	2019-06-12 14:53:03 -0300	[diff] [blame]	1	=======================
				2	Energy Aware Scheduling
				3	=======================
Quentin Perret	81a930d	2019-01-10 11:05:46 +0000	[diff] [blame]	4
				5	1. Introduction
				6	---------------
				7
				8	Energy Aware Scheduling (or EAS) gives the scheduler the ability to predict
				9	the impact of its decisions on the energy consumed by CPUs. EAS relies on an
				10	Energy Model (EM) of the CPUs to select an energy efficient CPU for each task,
				11	with a minimal impact on throughput. This document aims at providing an
				12	introduction on how EAS works, what are the main design decisions behind it, and
				13	details what is needed to get it to run.
				14
Mauro Carvalho Chehab	d6a3b24	2019-06-12 14:53:03 -0300	[diff] [blame]	15	Before going any further, please note that at the time of writing::
Quentin Perret	81a930d	2019-01-10 11:05:46 +0000	[diff] [blame]	16
				17	/!\ EAS does not support platforms with symmetric CPU topologies /!\
				18
				19	EAS operates only on heterogeneous CPU topologies (such as Arm big.LITTLE)
				20	because this is where the potential for saving energy through scheduling is
				21	the highest.
				22
				23	The actual EM used by EAS is _not_ maintained by the scheduler, but by a
				24	dedicated framework. For details about this framework and what it provides,
Linus Torvalds	fb4da21	2019-07-15 20:44:49 -0700	[diff] [blame]	25	please refer to its documentation (see Documentation/power/energy-model.rst).
Quentin Perret	81a930d	2019-01-10 11:05:46 +0000	[diff] [blame]	26
				27
				28	2. Background and Terminology
				29	-----------------------------
				30
				31	To make it clear from the start:
				32	- energy = [joule] (resource like a battery on powered devices)
				33	- power = energy/time = [joule/second] = [watt]
				34
				35	The goal of EAS is to minimize energy, while still getting the job done. That
Mauro Carvalho Chehab	d6a3b24	2019-06-12 14:53:03 -0300	[diff] [blame]	36	is, we want to maximize::
Quentin Perret	81a930d	2019-01-10 11:05:46 +0000	[diff] [blame]	37
				38	performance [inst/s]
				39	--------------------
				40	power [W]
				41
Mauro Carvalho Chehab	d6a3b24	2019-06-12 14:53:03 -0300	[diff] [blame]	42	which is equivalent to minimizing::
Quentin Perret	81a930d	2019-01-10 11:05:46 +0000	[diff] [blame]	43
				44	energy [J]
				45	-----------
				46	instruction
				47
				48	while still getting 'good' performance. It is essentially an alternative
				49	optimization objective to the current performance-only objective for the
				50	scheduler. This alternative considers two objectives: energy-efficiency and
				51	performance.
				52
				53	The idea behind introducing an EM is to allow the scheduler to evaluate the
				54	implications of its decisions rather than blindly applying energy-saving
				55	techniques that may have positive effects only on some platforms. At the same
				56	time, the EM must be as simple as possible to minimize the scheduler latency
				57	impact.
				58
				59	In short, EAS changes the way CFS tasks are assigned to CPUs. When it is time
				60	for the scheduler to decide where a task should run (during wake-up), the EM
				61	is used to break the tie between several good CPU candidates and pick the one
				62	that is predicted to yield the best energy consumption without harming the
				63	system's throughput. The predictions made by EAS rely on specific elements of
				64	knowledge about the platform's topology, which include the 'capacity' of CPUs,
				65	and their respective energy costs.
				66
				67
				68	3. Topology information
				69	-----------------------
				70
				71	EAS (as well as the rest of the scheduler) uses the notion of 'capacity' to
				72	differentiate CPUs with different computing throughput. The 'capacity' of a CPU
				73	represents the amount of work it can absorb when running at its highest
				74	frequency compared to the most capable CPU of the system. Capacity values are
				75	normalized in a 1024 range, and are comparable with the utilization signals of
				76	tasks and CPUs computed by the Per-Entity Load Tracking (PELT) mechanism. Thanks
				77	to capacity and utilization values, EAS is able to estimate how big/busy a
				78	task/CPU is, and to take this into consideration when evaluating performance vs
				79	energy trade-offs. The capacity of CPUs is provided via arch-specific code
				80	through the arch_scale_cpu_capacity() callback.
				81
				82	The rest of platform knowledge used by EAS is directly read from the Energy
				83	Model (EM) framework. The EM of a platform is composed of a power cost table
Linus Torvalds	fb4da21	2019-07-15 20:44:49 -0700	[diff] [blame]	84	per 'performance domain' in the system (see Documentation/power/energy-model.rst
Quentin Perret	81a930d	2019-01-10 11:05:46 +0000	[diff] [blame]	85	for futher details about performance domains).
				86
				87	The scheduler manages references to the EM objects in the topology code when the
				88	scheduling domains are built, or re-built. For each root domain (rd), the
				89	scheduler maintains a singly linked list of all performance domains intersecting
				90	the current rd->span. Each node in the list contains a pointer to a struct
				91	em_perf_domain as provided by the EM framework.
				92
				93	The lists are attached to the root domains in order to cope with exclusive
				94	cpuset configurations. Since the boundaries of exclusive cpusets do not
				95	necessarily match those of performance domains, the lists of different root
				96	domains can contain duplicate elements.
				97
				98	Example 1.
				99	Let us consider a platform with 12 CPUs, split in 3 performance domains
Mauro Carvalho Chehab	d6a3b24	2019-06-12 14:53:03 -0300	[diff] [blame]	100	(pd0, pd4 and pd8), organized as follows::
Quentin Perret	81a930d	2019-01-10 11:05:46 +0000	[diff] [blame]	101
				102	CPUs: 0 1 2 3 4 5 6 7 8 9 10 11
				103	PDs: \|--pd0--\|--pd4--\|---pd8---\|
				104	RDs: \|----rd1----\|-----rd2-----\|
				105
				106	Now, consider that userspace decided to split the system with two
				107	exclusive cpusets, hence creating two independent root domains, each
				108	containing 6 CPUs. The two root domains are denoted rd1 and rd2 in the
				109	above figure. Since pd4 intersects with both rd1 and rd2, it will be
				110	present in the linked list '->pd' attached to each of them:
Mauro Carvalho Chehab	d6a3b24	2019-06-12 14:53:03 -0300	[diff] [blame]	111
Quentin Perret	81a930d	2019-01-10 11:05:46 +0000	[diff] [blame]	112	* rd1->pd: pd0 -> pd4
				113	* rd2->pd: pd4 -> pd8
				114
				115	Please note that the scheduler will create two duplicate list nodes for
				116	pd4 (one for each list). However, both just hold a pointer to the same
				117	shared data structure of the EM framework.
				118
				119	Since the access to these lists can happen concurrently with hotplug and other
				120	things, they are protected by RCU, like the rest of topology structures
				121	manipulated by the scheduler.
				122
				123	EAS also maintains a static key (sched_energy_present) which is enabled when at
				124	least one root domain meets all conditions for EAS to start. Those conditions
				125	are summarized in Section 6.
				126
				127
				128	4. Energy-Aware task placement
				129	------------------------------
				130
				131	EAS overrides the CFS task wake-up balancing code. It uses the EM of the
				132	platform and the PELT signals to choose an energy-efficient target CPU during
				133	wake-up balance. When EAS is enabled, select_task_rq_fair() calls
				134	find_energy_efficient_cpu() to do the placement decision. This function looks
				135	for the CPU with the highest spare capacity (CPU capacity - CPU utilization) in
				136	each performance domain since it is the one which will allow us to keep the
				137	frequency the lowest. Then, the function checks if placing the task there could
				138	save energy compared to leaving it on prev_cpu, i.e. the CPU where the task ran
				139	in its previous activation.
				140
				141	find_energy_efficient_cpu() uses compute_energy() to estimate what will be the
				142	energy consumed by the system if the waking task was migrated. compute_energy()
				143	looks at the current utilization landscape of the CPUs and adjusts it to
				144	'simulate' the task migration. The EM framework provides the em_pd_energy() API
				145	which computes the expected energy consumption of each performance domain for
				146	the given utilization landscape.
				147
				148	An example of energy-optimized task placement decision is detailed below.
				149
				150	Example 2.
				151	Let us consider a (fake) platform with 2 independent performance domains
				152	composed of two CPUs each. CPU0 and CPU1 are little CPUs; CPU2 and CPU3
				153	are big.
				154
				155	The scheduler must decide where to place a task P whose util_avg = 200
				156	and prev_cpu = 0.
				157
				158	The current utilization landscape of the CPUs is depicted on the graph
				159	below. CPUs 0-3 have a util_avg of 400, 100, 600 and 500 respectively
				160	Each performance domain has three Operating Performance Points (OPPs).
				161	The CPU capacity and power cost associated with each OPP is listed in
				162	the Energy Model table. The util_avg of P is shown on the figures
Mauro Carvalho Chehab	d6a3b24	2019-06-12 14:53:03 -0300	[diff] [blame]	163	below as 'PP'::
Quentin Perret	81a930d	2019-01-10 11:05:46 +0000	[diff] [blame]	164
Mauro Carvalho Chehab	d6a3b24	2019-06-12 14:53:03 -0300	[diff] [blame]	165	CPU util.
Quentin Perret	81a930d	2019-01-10 11:05:46 +0000	[diff] [blame]	166	1024 - - - - - - - Energy Model
				167	+-----------+-------------+
				168	\| Little \| Big \|
				169	768 ============= +-----+-----+------+------+
				170	\| Cap \| Pwr \| Cap \| Pwr \|
				171	+-----+-----+------+------+
				172	512 =========== - ##- - - - - \| 170 \| 50 \| 512 \| 400 \|
				173	## ## \| 341 \| 150 \| 768 \| 800 \|
				174	341 -PP - - - - ## ## \| 512 \| 300 \| 1024 \| 1700 \|
				175	PP ## ## +-----+-----+------+------+
				176	170 -## - - - - ## ##
				177	## ## ## ##
				178	------------ -------------
				179	CPU0 CPU1 CPU2 CPU3
				180
				181	Current OPP: ===== Other OPP: - - - util_avg (100 each): ##
				182
				183
				184	find_energy_efficient_cpu() will first look for the CPUs with the
				185	maximum spare capacity in the two performance domains. In this example,
				186	CPU1 and CPU3. Then it will estimate the energy of the system if P was
				187	placed on either of them, and check if that would save some energy
				188	compared to leaving P on CPU0. EAS assumes that OPPs follow utilization
				189	(which is coherent with the behaviour of the schedutil CPUFreq
				190	governor, see Section 6. for more details on this topic).
				191
Mauro Carvalho Chehab	d6a3b24	2019-06-12 14:53:03 -0300	[diff] [blame]	192	Case 1. P is migrated to CPU1::
Quentin Perret	81a930d	2019-01-10 11:05:46 +0000	[diff] [blame]	193
				194	1024 - - - - - - -
				195
				196	Energy calculation:
				197	768 ============= * CPU0: 200 / 341 * 150 = 88
				198	* CPU1: 300 / 341 * 150 = 131
				199	* CPU2: 600 / 768 * 800 = 625
				200	512 - - - - - - - ##- - - - - * CPU3: 500 / 768 * 800 = 520
				201	## ## => total_energy = 1364
				202	341 =========== ## ##
				203	PP ## ##
				204	170 -## - - PP- ## ##
				205	## ## ## ##
				206	------------ -------------
				207	CPU0 CPU1 CPU2 CPU3
				208
				209
Mauro Carvalho Chehab	d6a3b24	2019-06-12 14:53:03 -0300	[diff] [blame]	210	Case 2. P is migrated to CPU3::
Quentin Perret	81a930d	2019-01-10 11:05:46 +0000	[diff] [blame]	211
				212	1024 - - - - - - -
				213
				214	Energy calculation:
				215	768 ============= * CPU0: 200 / 341 * 150 = 88
				216	* CPU1: 100 / 341 * 150 = 43
				217	PP * CPU2: 600 / 768 * 800 = 625
				218	512 - - - - - - - ##- - -PP - * CPU3: 700 / 768 * 800 = 729
				219	## ## => total_energy = 1485
				220	341 =========== ## ##
				221	## ##
				222	170 -## - - - - ## ##
				223	## ## ## ##
				224	------------ -------------
				225	CPU0 CPU1 CPU2 CPU3
				226
				227
Mauro Carvalho Chehab	d6a3b24	2019-06-12 14:53:03 -0300	[diff] [blame]	228	Case 3. P stays on prev_cpu / CPU 0::
Quentin Perret	81a930d	2019-01-10 11:05:46 +0000	[diff] [blame]	229
				230	1024 - - - - - - -
				231
				232	Energy calculation:
				233	768 ============= * CPU0: 400 / 512 * 300 = 234
				234	* CPU1: 100 / 512 * 300 = 58
				235	* CPU2: 600 / 768 * 800 = 625
				236	512 =========== - ##- - - - - * CPU3: 500 / 768 * 800 = 520
				237	## ## => total_energy = 1437
				238	341 -PP - - - - ## ##
				239	PP ## ##
				240	170 -## - - - - ## ##
				241	## ## ## ##
				242	------------ -------------
				243	CPU0 CPU1 CPU2 CPU3
				244
				245
				246	From these calculations, the Case 1 has the lowest total energy. So CPU 1
				247	is be the best candidate from an energy-efficiency standpoint.
				248
				249	Big CPUs are generally more power hungry than the little ones and are thus used
				250	mainly when a task doesn't fit the littles. However, little CPUs aren't always
				251	necessarily more energy-efficient than big CPUs. For some systems, the high OPPs
				252	of the little CPUs can be less energy-efficient than the lowest OPPs of the
				253	bigs, for example. So, if the little CPUs happen to have enough utilization at
				254	a specific point in time, a small task waking up at that moment could be better
				255	of executing on the big side in order to save energy, even though it would fit
				256	on the little side.
				257
				258	And even in the case where all OPPs of the big CPUs are less energy-efficient
				259	than those of the little, using the big CPUs for a small task might still, under
				260	specific conditions, save energy. Indeed, placing a task on a little CPU can
				261	result in raising the OPP of the entire performance domain, and that will
				262	increase the cost of the tasks already running there. If the waking task is
				263	placed on a big CPU, its own execution cost might be higher than if it was
				264	running on a little, but it won't impact the other tasks of the little CPUs
				265	which will keep running at a lower OPP. So, when considering the total energy
				266	consumed by CPUs, the extra cost of running that one task on a big core can be
				267	smaller than the cost of raising the OPP on the little CPUs for all the other
				268	tasks.
				269
				270	The examples above would be nearly impossible to get right in a generic way, and
				271	for all platforms, without knowing the cost of running at different OPPs on all
				272	CPUs of the system. Thanks to its EM-based design, EAS should cope with them
				273	correctly without too many troubles. However, in order to ensure a minimal
				274	impact on throughput for high-utilization scenarios, EAS also implements another
				275	mechanism called 'over-utilization'.
				276
				277
				278	5. Over-utilization
				279	-------------------
				280
				281	From a general standpoint, the use-cases where EAS can help the most are those
				282	involving a light/medium CPU utilization. Whenever long CPU-bound tasks are
				283	being run, they will require all of the available CPU capacity, and there isn't
				284	much that can be done by the scheduler to save energy without severly harming
				285	throughput. In order to avoid hurting performance with EAS, CPUs are flagged as
				286	'over-utilized' as soon as they are used at more than 80% of their compute
				287	capacity. As long as no CPUs are over-utilized in a root domain, load balancing
				288	is disabled and EAS overridess the wake-up balancing code. EAS is likely to load
				289	the most energy efficient CPUs of the system more than the others if that can be
				290	done without harming throughput. So, the load-balancer is disabled to prevent
				291	it from breaking the energy-efficient task placement found by EAS. It is safe to
				292	do so when the system isn't overutilized since being below the 80% tipping point
				293	implies that:
				294
				295	a. there is some idle time on all CPUs, so the utilization signals used by
				296	EAS are likely to accurately represent the 'size' of the various tasks
				297	in the system;
				298	b. all tasks should already be provided with enough CPU capacity,
				299	regardless of their nice values;
				300	c. since there is spare capacity all tasks must be blocking/sleeping
				301	regularly and balancing at wake-up is sufficient.
				302
				303	As soon as one CPU goes above the 80% tipping point, at least one of the three
				304	assumptions above becomes incorrect. In this scenario, the 'overutilized' flag
				305	is raised for the entire root domain, EAS is disabled, and the load-balancer is
				306	re-enabled. By doing so, the scheduler falls back onto load-based algorithms for
				307	wake-up and load balance under CPU-bound conditions. This provides a better
				308	respect of the nice values of tasks.
				309
				310	Since the notion of overutilization largely relies on detecting whether or not
				311	there is some idle time in the system, the CPU capacity 'stolen' by higher
				312	(than CFS) scheduling classes (as well as IRQ) must be taken into account. As
				313	such, the detection of overutilization accounts for the capacity used not only
				314	by CFS tasks, but also by the other scheduling classes and IRQ.
				315
				316
				317	6. Dependencies and requirements for EAS
				318	----------------------------------------
				319
				320	Energy Aware Scheduling depends on the CPUs of the system having specific
				321	hardware properties and on other features of the kernel being enabled. This
				322	section lists these dependencies and provides hints as to how they can be met.
				323
				324
Mauro Carvalho Chehab	d6a3b24	2019-06-12 14:53:03 -0300	[diff] [blame]	325	6.1 - Asymmetric CPU topology
				326	^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
				327
Quentin Perret	81a930d	2019-01-10 11:05:46 +0000	[diff] [blame]	328
				329	As mentioned in the introduction, EAS is only supported on platforms with
				330	asymmetric CPU topologies for now. This requirement is checked at run-time by
Beata Michalska	adf3c31	2021-06-03 15:06:27 +0100	[diff] [blame]	331	looking for the presence of the SD_ASYM_CPUCAPACITY_FULL flag when the scheduling
Quentin Perret	81a930d	2019-01-10 11:05:46 +0000	[diff] [blame]	332	domains are built.
				333
Mauro Carvalho Chehab	e4e29e7	2020-09-09 16:10:36 +0200	[diff] [blame]	334	See Documentation/scheduler/sched-capacity.rst for requirements to be met for this
Valentin Schneider	949bcb8	2020-07-31 20:20:16 +0100	[diff] [blame]	335	flag to be set in the sched_domain hierarchy.
Quentin Perret	81a930d	2019-01-10 11:05:46 +0000	[diff] [blame]	336
				337	Please note that EAS is not fundamentally incompatible with SMP, but no
				338	significant savings on SMP platforms have been observed yet. This restriction
				339	could be amended in the future if proven otherwise.
				340
				341
Mauro Carvalho Chehab	d6a3b24	2019-06-12 14:53:03 -0300	[diff] [blame]	342	6.2 - Energy Model presence
				343	^^^^^^^^^^^^^^^^^^^^^^^^^^^
Quentin Perret	81a930d	2019-01-10 11:05:46 +0000	[diff] [blame]	344
				345	EAS uses the EM of a platform to estimate the impact of scheduling decisions on
				346	energy. So, your platform must provide power cost tables to the EM framework in
				347	order to make EAS start. To do so, please refer to documentation of the
Linus Torvalds	fb4da21	2019-07-15 20:44:49 -0700	[diff] [blame]	348	independent EM framework in Documentation/power/energy-model.rst.
Quentin Perret	81a930d	2019-01-10 11:05:46 +0000	[diff] [blame]	349
				350	Please also note that the scheduling domains need to be re-built after the
				351	EM has been registered in order to start EAS.
				352
Lukasz Luba	5a64f77	2020-11-03 09:05:58 +0000	[diff] [blame]	353	EAS uses the EM to make a forecasting decision on energy usage and thus it is
				354	more focused on the difference when checking possible options for task
				355	placement. For EAS it doesn't matter whether the EM power values are expressed
				356	in milli-Watts or in an 'abstract scale'.
				357
Quentin Perret	81a930d	2019-01-10 11:05:46 +0000	[diff] [blame]	358
Mauro Carvalho Chehab	d6a3b24	2019-06-12 14:53:03 -0300	[diff] [blame]	359	6.3 - Energy Model complexity
				360	^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Quentin Perret	81a930d	2019-01-10 11:05:46 +0000	[diff] [blame]	361
				362	The task wake-up path is very latency-sensitive. When the EM of a platform is
				363	too complex (too many CPUs, too many performance domains, too many performance
				364	states, ...), the cost of using it in the wake-up path can become prohibitive.
				365	The energy-aware wake-up algorithm has a complexity of:
				366
				367	C = Nd * (Nc + Ns)
				368
				369	with: Nd the number of performance domains; Nc the number of CPUs; and Ns the
				370	total number of OPPs (ex: for two perf. domains with 4 OPPs each, Ns = 8).
				371
				372	A complexity check is performed at the root domain level, when scheduling
				373	domains are built. EAS will not start on a root domain if its C happens to be
				374	higher than the completely arbitrary EM_MAX_COMPLEXITY threshold (2048 at the
				375	time of writing).
				376
				377	If you really want to use EAS but the complexity of your platform's Energy
				378	Model is too high to be used with a single root domain, you're left with only
				379	two possible options:
				380
				381	1. split your system into separate, smaller, root domains using exclusive
				382	cpusets and enable EAS locally on each of them. This option has the
				383	benefit to work out of the box but the drawback of preventing load
				384	balance between root domains, which can result in an unbalanced system
				385	overall;
				386	2. submit patches to reduce the complexity of the EAS wake-up algorithm,
				387	hence enabling it to cope with larger EMs in reasonable time.
				388
				389
Mauro Carvalho Chehab	d6a3b24	2019-06-12 14:53:03 -0300	[diff] [blame]	390	6.4 - Schedutil governor
				391	^^^^^^^^^^^^^^^^^^^^^^^^
Quentin Perret	81a930d	2019-01-10 11:05:46 +0000	[diff] [blame]	392
				393	EAS tries to predict at which OPP will the CPUs be running in the close future
				394	in order to estimate their energy consumption. To do so, it is assumed that OPPs
				395	of CPUs follow their utilization.
				396
				397	Although it is very difficult to provide hard guarantees regarding the accuracy
				398	of this assumption in practice (because the hardware might not do what it is
				399	told to do, for example), schedutil as opposed to other CPUFreq governors at
				400	least _requests_ frequencies calculated using the utilization signals.
				401	Consequently, the only sane governor to use together with EAS is schedutil,
				402	because it is the only one providing some degree of consistency between
				403	frequency requests and energy predictions.
				404
				405	Using EAS with any other governor than schedutil is not supported.
				406
				407
Mauro Carvalho Chehab	d6a3b24	2019-06-12 14:53:03 -0300	[diff] [blame]	408	6.5 Scale-invariant utilization signals
				409	^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Quentin Perret	81a930d	2019-01-10 11:05:46 +0000	[diff] [blame]	410
				411	In order to make accurate prediction across CPUs and for all performance
				412	states, EAS needs frequency-invariant and CPU-invariant PELT signals. These can
				413	be obtained using the architecture-defined arch_scale{cpu,freq}_capacity()
				414	callbacks.
				415
				416	Using EAS on a platform that doesn't implement these two callbacks is not
				417	supported.
				418
				419
Mauro Carvalho Chehab	d6a3b24	2019-06-12 14:53:03 -0300	[diff] [blame]	420	6.6 Multithreading (SMT)
				421	^^^^^^^^^^^^^^^^^^^^^^^^
Quentin Perret	81a930d	2019-01-10 11:05:46 +0000	[diff] [blame]	422
				423	EAS in its current form is SMT unaware and is not able to leverage
				424	multithreaded hardware to save energy. EAS considers threads as independent
				425	CPUs, which can actually be counter-productive for both performance and energy.
				426
				427	EAS on SMT is not supported.