Blame - Documentation/scheduler/schedutil.txt - SHIFTPHONES/mainline/linux

blob: 78f6b91e22914c2f2999ddece21ea1b405211964 [file] [log] [blame]

Peter Zijlstra	0301925	2020-12-18 11:28:12 +0100	[diff] [blame]	1
				2
				3	NOTE; all this assumes a linear relation between frequency and work capacity,
				4	we know this is flawed, but it is the best workable approximation.
				5
				6
				7	PELT (Per Entity Load Tracking)
				8	-------------------------------
				9
				10	With PELT we track some metrics across the various scheduler entities, from
				11	individual tasks to task-group slices to CPU runqueues. As the basis for this
				12	we use an Exponentially Weighted Moving Average (EWMA), each period (1024us)
				13	is decayed such that y^32 = 0.5. That is, the most recent 32ms contribute
				14	half, while the rest of history contribute the other half.
				15
				16	Specifically:
				17
				18	ewma_sum(u) := u_0 + u_1y + u_2y^2 + ...
				19
				20	ewma(u) = ewma_sum(u) / ewma_sum(1)
				21
				22	Since this is essentially a progression of an infinite geometric series, the
				23	results are composable, that is ewma(A) + ewma(B) = ewma(A+B). This property
				24	is key, since it gives the ability to recompose the averages when tasks move
				25	around.
				26
				27	Note that blocked tasks still contribute to the aggregates (task-group slices
				28	and CPU runqueues), which reflects their expected contribution when they
				29	resume running.
				30
				31	Using this we track 2 key metrics: 'running' and 'runnable'. 'Running'
				32	reflects the time an entity spends on the CPU, while 'runnable' reflects the
				33	time an entity spends on the runqueue. When there is only a single task these
				34	two metrics are the same, but once there is contention for the CPU 'running'
				35	will decrease to reflect the fraction of time each task spends on the CPU
				36	while 'runnable' will increase to reflect the amount of contention.
				37
				38	For more detail see: kernel/sched/pelt.c
				39
				40
				41	Frequency- / CPU Invariance
				42	---------------------------
				43
				44	Because consuming the CPU for 50% at 1GHz is not the same as consuming the CPU
				45	for 50% at 2GHz, nor is running 50% on a LITTLE CPU the same as running 50% on
				46	a big CPU, we allow architectures to scale the time delta with two ratios, one
				47	Dynamic Voltage and Frequency Scaling (DVFS) ratio and one microarch ratio.
				48
				49	For simple DVFS architectures (where software is in full control) we trivially
				50	compute the ratio as:
				51
				52	f_cur
				53	r_dvfs := -----
				54	f_max
				55
				56	For more dynamic systems where the hardware is in control of DVFS we use
				57	hardware counters (Intel APERF/MPERF, ARMv8.4-AMU) to provide us this ratio.
				58	For Intel specifically, we use:
				59
				60	APERF
				61	f_cur := ----- * P0
				62	MPERF
				63
				64	4C-turbo; if available and turbo enabled
				65	f_max := { 1C-turbo; if turbo enabled
				66	P0; otherwise
				67
				68	f_cur
				69	r_dvfs := min( 1, ----- )
				70	f_max
				71
				72	We pick 4C turbo over 1C turbo to make it slightly more sustainable.
				73
				74	r_cpu is determined as the ratio of highest performance level of the current
				75	CPU vs the highest performance level of any other CPU in the system.
				76
				77	r_tot = r_dvfs * r_cpu
				78
				79	The result is that the above 'running' and 'runnable' metrics become invariant
				80	of DVFS and CPU type. IOW. we can transfer and compare them between CPUs.
				81
				82	For more detail see:
				83
				84	- kernel/sched/pelt.h:update_rq_clock_pelt()
				85	- arch/x86/kernel/smpboot.c:"APERF/MPERF frequency ratio computation."
				86	- Documentation/scheduler/sched-capacity.rst:"1. CPU Capacity + 2. Task utilization"
				87
				88
				89	UTIL_EST / UTIL_EST_FASTUP
				90	--------------------------
				91
				92	Because periodic tasks have their averages decayed while they sleep, even
				93	though when running their expected utilization will be the same, they suffer a
				94	(DVFS) ramp-up after they are running again.
				95
				96	To alleviate this (a default enabled option) UTIL_EST drives an Infinite
				97	Impulse Response (IIR) EWMA with the 'running' value on dequeue -- when it is
				98	highest. A further default enabled option UTIL_EST_FASTUP modifies the IIR
				99	filter to instantly increase and only decay on decrease.
				100
				101	A further runqueue wide sum (of runnable tasks) is maintained of:
				102
				103	util_est := \Sum_t max( t_running, t_util_est_ewma )
				104
				105	For more detail see: kernel/sched/fair.c:util_est_dequeue()
				106
				107
				108	UCLAMP
				109	------
				110
				111	It is possible to set effective u_min and u_max clamps on each CFS or RT task;
				112	the runqueue keeps an max aggregate of these clamps for all running tasks.
				113
				114	For more detail see: include/uapi/linux/sched/types.h
				115
				116
				117	Schedutil / DVFS
				118	----------------
				119
				120	Every time the scheduler load tracking is updated (task wakeup, task
				121	migration, time progression) we call out to schedutil to update the hardware
				122	DVFS state.
				123
				124	The basis is the CPU runqueue's 'running' metric, which per the above it is
				125	the frequency invariant utilization estimate of the CPU. From this we compute
				126	a desired frequency like:
				127
				128	max( running, util_est ); if UTIL_EST
				129	u_cfs := { running; otherwise
				130
				131	clamp( u_cfs + u_rt , u_min, u_max ); if UCLAMP_TASK
				132	u_clamp := { u_cfs + u_rt; otherwise
				133
				134	u := u_clamp + u_irq + u_dl; [approx. see source for more detail]
				135
				136	f_des := min( f_max, 1.25 u * f_max )
				137
				138	XXX IO-wait; when the update is due to a task wakeup from IO-completion we
				139	boost 'u' above.
				140
				141	This frequency is then used to select a P-state/OPP or directly munged into a
				142	CPPC style request to the hardware.
				143
				144	XXX: deadline tasks (Sporadic Task Model) allows us to calculate a hard f_min
				145	required to satisfy the workload.
				146
				147	Because these callbacks are directly from the scheduler, the DVFS hardware
				148	interaction should be 'fast' and non-blocking. Schedutil supports
				149	rate-limiting DVFS requests for when hardware interaction is slow and
				150	expensive, this reduces effectiveness.
				151
				152	For more information see: kernel/sched/cpufreq_schedutil.c
				153
				154
				155	NOTES
				156	-----
				157
				158	- On low-load scenarios, where DVFS is most relevant, the 'running' numbers
				159	will closely reflect utilization.
				160
				161	- In saturated scenarios task movement will cause some transient dips,
				162	suppose we have a CPU saturated with 4 tasks, then when we migrate a task
				163	to an idle CPU, the old CPU will have a 'running' value of 0.75 while the
				164	new CPU will gain 0.25. This is inevitable and time progression will
				165	correct this. XXX do we still guarantee f_max due to no idle-time?
				166
				167	- Much of the above is about avoiding DVFS dips, and independent DVFS domains
				168	having to re-learn / ramp-up when load shifts.
				169