Blame - Documentation/x86/intel_rdt_ui.txt - SHIFTPHONES/mainline/linux

blob: c491a1b82de22a74d9559a418023fb8c312ffed4 [file] [log] [blame]

Fenghua Yu	f20e578	2016-10-28 15:04:40 -0700	[diff] [blame]	1	User Interface for Resource Allocation in Intel Resource Director Technology
				2
				3	Copyright (C) 2016 Intel Corporation
				4
				5	Fenghua Yu <fenghua.yu@intel.com>
				6	Tony Luck <tony.luck@intel.com>
Vikas Shivappa	a9cad3d	2017-04-07 17:33:50 -0700	[diff] [blame]	7	Vikas Shivappa <vikas.shivappa@intel.com>
Fenghua Yu	f20e578	2016-10-28 15:04:40 -0700	[diff] [blame]	8
				9	This feature is enabled by the CONFIG_INTEL_RDT_A Kconfig and the
				10	X86 /proc/cpuinfo flag bits "rdt", "cat_l3" and "cdp_l3".
				11
				12	To use the feature mount the file system:
				13
				14	# mount -t resctrl resctrl [-o cdp] /sys/fs/resctrl
				15
				16	mount options are:
				17
				18	"cdp": Enable code/data prioritization in L3 cache allocations.
				19
				20
Thomas Gleixner	458b0d6e	2016-11-07 11:58:12 +0100	[diff] [blame]	21	Info directory
				22	--------------
				23
				24	The 'info' directory contains information about the enabled
				25	resources. Each resource has its own subdirectory. The subdirectory
Vikas Shivappa	a9cad3d	2017-04-07 17:33:50 -0700	[diff] [blame]	26	names reflect the resource names.
				27	Cache resource(L3/L2) subdirectory contains the following files:
Thomas Gleixner	458b0d6e	2016-11-07 11:58:12 +0100	[diff] [blame]	28
Vikas Shivappa	a9cad3d	2017-04-07 17:33:50 -0700	[diff] [blame]	29	"num_closids": The number of CLOSIDs which are valid for this
				30	resource. The kernel uses the smallest number of
				31	CLOSIDs of all enabled resources as limit.
Thomas Gleixner	458b0d6e	2016-11-07 11:58:12 +0100	[diff] [blame]	32
Vikas Shivappa	a9cad3d	2017-04-07 17:33:50 -0700	[diff] [blame]	33	"cbm_mask": The bitmask which is valid for this resource.
				34	This mask is equivalent to 100%.
Thomas Gleixner	458b0d6e	2016-11-07 11:58:12 +0100	[diff] [blame]	35
Vikas Shivappa	a9cad3d	2017-04-07 17:33:50 -0700	[diff] [blame]	36	"min_cbm_bits": The minimum number of consecutive bits which
				37	must be set when writing a mask.
Thomas Gleixner	458b0d6e	2016-11-07 11:58:12 +0100	[diff] [blame]	38
Vikas Shivappa	a9cad3d	2017-04-07 17:33:50 -0700	[diff] [blame]	39	Memory bandwitdh(MB) subdirectory contains the following files:
				40
				41	"min_bandwidth": The minimum memory bandwidth percentage which
				42	user can request.
				43
				44	"bandwidth_gran": The granularity in which the memory bandwidth
				45	percentage is allocated. The allocated
				46	b/w percentage is rounded off to the next
				47	control step available on the hardware. The
				48	available bandwidth control steps are:
				49	min_bandwidth + N * bandwidth_gran.
				50
				51	"delay_linear": Indicates if the delay scale is linear or
				52	non-linear. This field is purely informational
				53	only.
Thomas Gleixner	458b0d6e	2016-11-07 11:58:12 +0100	[diff] [blame]	54
Fenghua Yu	f20e578	2016-10-28 15:04:40 -0700	[diff] [blame]	55	Resource groups
				56	---------------
				57	Resource groups are represented as directories in the resctrl file
				58	system. The default group is the root directory. Other groups may be
				59	created as desired by the system administrator using the "mkdir(1)"
				60	command, and removed using "rmdir(1)".
				61
				62	There are three files associated with each group:
				63
				64	"tasks": A list of tasks that belongs to this group. Tasks can be
				65	added to a group by writing the task ID to the "tasks" file
				66	(which will automatically remove them from the previous
				67	group to which they belonged). New tasks created by fork(2)
				68	and clone(2) are added to the same group as their parent.
				69	If a pid is not in any sub partition, it is in root partition
				70	(i.e. default partition).
				71
				72	"cpus": A bitmask of logical CPUs assigned to this group. Writing
				73	a new mask can add/remove CPUs from this group. Added CPUs
				74	are removed from their previous group. Removed ones are
				75	given to the default (root) group. You cannot remove CPUs
				76	from the default group.
				77
Jiri Olsa	4ffa3c9	2017-04-10 16:52:32 +0200	[diff] [blame]	78	"cpus_list": One or more CPU ranges of logical CPUs assigned to this
				79	group. Same rules apply like for the "cpus" file.
				80
Fenghua Yu	f20e578	2016-10-28 15:04:40 -0700	[diff] [blame]	81	"schemata": A list of all the resources available to this group.
				82	Each resource has its own line and format - see below for
				83	details.
				84
				85	When a task is running the following rules define which resources
				86	are available to it:
				87
				88	1) If the task is a member of a non-default group, then the schemata
				89	for that group is used.
				90
				91	2) Else if the task belongs to the default group, but is running on a
				92	CPU that is assigned to some specific group, then the schemata for
				93	the CPU's group is used.
				94
				95	3) Otherwise the schemata for the default group is used.
				96
				97
				98	Schemata files - general concepts
				99	---------------------------------
				100	Each line in the file describes one resource. The line starts with
				101	the name of the resource, followed by specific values to be applied
				102	in each of the instances of that resource on the system.
				103
				104	Cache IDs
				105	---------
				106	On current generation systems there is one L3 cache per socket and L2
				107	caches are generally just shared by the hyperthreads on a core, but this
				108	isn't an architectural requirement. We could have multiple separate L3
				109	caches on a socket, multiple cores could share an L2 cache. So instead
				110	of using "socket" or "core" to define the set of logical cpus sharing
				111	a resource we use a "Cache ID". At a given cache level this will be a
				112	unique number across the whole system (but it isn't guaranteed to be a
				113	contiguous sequence, there may be gaps). To find the ID for each logical
				114	CPU look in /sys/devices/system/cpu/cpu/cache/index/id
				115
				116	Cache Bit Masks (CBM)
				117	---------------------
				118	For cache resources we describe the portion of the cache that is available
				119	for allocation using a bitmask. The maximum value of the mask is defined
				120	by each cpu model (and may be different for different cache levels). It
				121	is found using CPUID, but is also provided in the "info" directory of
				122	the resctrl file system in "info/{resource}/cbm_mask". X86 hardware
				123	requires that these masks have all the '1' bits in a contiguous block. So
				124	0x3, 0x6 and 0xC are legal 4-bit masks with two bits set, but 0x5, 0x9
				125	and 0xA are not. On a system with a 20-bit mask each bit represents 5%
				126	of the capacity of the cache. You could partition the cache into four
				127	equal parts with masks: 0x1f, 0x3e0, 0x7c00, 0xf8000.
				128
Vikas Shivappa	a9cad3d	2017-04-07 17:33:50 -0700	[diff] [blame]	129	Memory bandwidth(b/w) percentage
				130	--------------------------------
				131	For Memory b/w resource, user controls the resource by indicating the
				132	percentage of total memory b/w.
				133
				134	The minimum bandwidth percentage value for each cpu model is predefined
				135	and can be looked up through "info/MB/min_bandwidth". The bandwidth
				136	granularity that is allocated is also dependent on the cpu model and can
				137	be looked up at "info/MB/bandwidth_gran". The available bandwidth
				138	control steps are: min_bw + N * bw_gran. Intermediate values are rounded
				139	to the next control step available on the hardware.
				140
				141	The bandwidth throttling is a core specific mechanism on some of Intel
				142	SKUs. Using a high bandwidth and a low bandwidth setting on two threads
				143	sharing a core will result in both threads being throttled to use the
				144	low bandwidth.
Fenghua Yu	f20e578	2016-10-28 15:04:40 -0700	[diff] [blame]	145
				146	L3 details (code and data prioritization disabled)
				147	--------------------------------------------------
				148	With CDP disabled the L3 schemata format is:
				149
				150	L3:<cache_id0>=<cbm>;<cache_id1>=<cbm>;...
				151
				152	L3 details (CDP enabled via mount option to resctrl)
				153	----------------------------------------------------
				154	When CDP is enabled L3 control is split into two separate resources
				155	so you can specify independent masks for code and data like this:
				156
				157	L3data:<cache_id0>=<cbm>;<cache_id1>=<cbm>;...
				158	L3code:<cache_id0>=<cbm>;<cache_id1>=<cbm>;...
				159
				160	L2 details
				161	----------
				162	L2 cache does not support code and data prioritization, so the
				163	schemata format is always:
				164
				165	L2:<cache_id0>=<cbm>;<cache_id1>=<cbm>;...
				166
Vikas Shivappa	a9cad3d	2017-04-07 17:33:50 -0700	[diff] [blame]	167	Memory b/w Allocation details
				168	-----------------------------
				169
				170	Memory b/w domain is L3 cache.
				171
				172	MB:<cache_id0>=bandwidth0;<cache_id1>=bandwidth1;...
				173
Tony Luck	c4026b7b	2017-04-03 14:44:16 -0700	[diff] [blame]	174	Reading/writing the schemata file
				175	---------------------------------
				176	Reading the schemata file will show the state of all resources
				177	on all domains. When writing you only need to specify those values
				178	which you wish to change. E.g.
				179
				180	# cat schemata
				181	L3DATA:0=fffff;1=fffff;2=fffff;3=fffff
				182	L3CODE:0=fffff;1=fffff;2=fffff;3=fffff
				183	# echo "L3DATA:2=3c0;" > schemata
				184	# cat schemata
				185	L3DATA:0=fffff;1=fffff;2=3c0;3=fffff
				186	L3CODE:0=fffff;1=fffff;2=fffff;3=fffff
				187
Fenghua Yu	f20e578	2016-10-28 15:04:40 -0700	[diff] [blame]	188	Example 1
				189	---------
				190	On a two socket machine (one L3 cache per socket) with just four bits
Vikas Shivappa	a9cad3d	2017-04-07 17:33:50 -0700	[diff] [blame]	191	for cache bit masks, minimum b/w of 10% with a memory bandwidth
				192	granularity of 10%
Fenghua Yu	f20e578	2016-10-28 15:04:40 -0700	[diff] [blame]	193
				194	# mount -t resctrl resctrl /sys/fs/resctrl
				195	# cd /sys/fs/resctrl
				196	# mkdir p0 p1
Vikas Shivappa	a9cad3d	2017-04-07 17:33:50 -0700	[diff] [blame]	197	# echo "L3:0=3;1=c\nMB:0=50;1=50" > /sys/fs/resctrl/p0/schemata
				198	# echo "L3:0=3;1=3\nMB:0=50;1=50" > /sys/fs/resctrl/p1/schemata
Fenghua Yu	f20e578	2016-10-28 15:04:40 -0700	[diff] [blame]	199
				200	The default resource group is unmodified, so we have access to all parts
				201	of all caches (its schemata file reads "L3:0=f;1=f").
				202
				203	Tasks that are under the control of group "p0" may only allocate from the
				204	"lower" 50% on cache ID 0, and the "upper" 50% of cache ID 1.
				205	Tasks in group "p1" use the "lower" 50% of cache on both sockets.
				206
Vikas Shivappa	a9cad3d	2017-04-07 17:33:50 -0700	[diff] [blame]	207	Similarly, tasks that are under the control of group "p0" may use a
				208	maximum memory b/w of 50% on socket0 and 50% on socket 1.
				209	Tasks in group "p1" may also use 50% memory b/w on both sockets.
				210	Note that unlike cache masks, memory b/w cannot specify whether these
				211	allocations can overlap or not. The allocations specifies the maximum
				212	b/w that the group may be able to use and the system admin can configure
				213	the b/w accordingly.
				214
Fenghua Yu	f20e578	2016-10-28 15:04:40 -0700	[diff] [blame]	215	Example 2
				216	---------
				217	Again two sockets, but this time with a more realistic 20-bit mask.
				218
				219	Two real time tasks pid=1234 running on processor 0 and pid=5678 running on
				220	processor 1 on socket 0 on a 2-socket and dual core machine. To avoid noisy
				221	neighbors, each of the two real-time tasks exclusively occupies one quarter
				222	of L3 cache on socket 0.
				223
				224	# mount -t resctrl resctrl /sys/fs/resctrl
				225	# cd /sys/fs/resctrl
				226
				227	First we reset the schemata for the default group so that the "upper"
Vikas Shivappa	a9cad3d	2017-04-07 17:33:50 -0700	[diff] [blame]	228	50% of the L3 cache on socket 0 and 50% of memory b/w cannot be used by
				229	ordinary tasks:
Fenghua Yu	f20e578	2016-10-28 15:04:40 -0700	[diff] [blame]	230
Vikas Shivappa	a9cad3d	2017-04-07 17:33:50 -0700	[diff] [blame]	231	# echo "L3:0=3ff;1=fffff\nMB:0=50;1=100" > schemata
Fenghua Yu	f20e578	2016-10-28 15:04:40 -0700	[diff] [blame]	232
				233	Next we make a resource group for our first real time task and give
				234	it access to the "top" 25% of the cache on socket 0.
				235
				236	# mkdir p0
				237	# echo "L3:0=f8000;1=fffff" > p0/schemata
				238
				239	Finally we move our first real time task into this resource group. We
				240	also use taskset(1) to ensure the task always runs on a dedicated CPU
				241	on socket 0. Most uses of resource groups will also constrain which
				242	processors tasks run on.
				243
				244	# echo 1234 > p0/tasks
				245	# taskset -cp 1 1234
				246
				247	Ditto for the second real time task (with the remaining 25% of cache):
				248
				249	# mkdir p1
				250	# echo "L3:0=7c00;1=fffff" > p1/schemata
				251	# echo 5678 > p1/tasks
				252	# taskset -cp 2 5678
				253
Vikas Shivappa	a9cad3d	2017-04-07 17:33:50 -0700	[diff] [blame]	254	For the same 2 socket system with memory b/w resource and CAT L3 the
				255	schemata would look like(Assume min_bandwidth 10 and bandwidth_gran is
				256	10):
				257
				258	For our first real time task this would request 20% memory b/w on socket
				259	0.
				260
				261	# echo -e "L3:0=f8000;1=fffff\nMB:0=20;1=100" > p0/schemata
				262
				263	For our second real time task this would request an other 20% memory b/w
				264	on socket 0.
				265
				266	# echo -e "L3:0=f8000;1=fffff\nMB:0=20;1=100" > p0/schemata
				267
Fenghua Yu	f20e578	2016-10-28 15:04:40 -0700	[diff] [blame]	268	Example 3
				269	---------
				270
				271	A single socket system which has real-time tasks running on core 4-7 and
				272	non real-time workload assigned to core 0-3. The real-time tasks share text
				273	and data, so a per task association is not required and due to interaction
				274	with the kernel it's desired that the kernel on these cores shares L3 with
				275	the tasks.
				276
				277	# mount -t resctrl resctrl /sys/fs/resctrl
				278	# cd /sys/fs/resctrl
				279
				280	First we reset the schemata for the default group so that the "upper"
Vikas Shivappa	a9cad3d	2017-04-07 17:33:50 -0700	[diff] [blame]	281	50% of the L3 cache on socket 0, and 50% of memory bandwidth on socket 0
				282	cannot be used by ordinary tasks:
Fenghua Yu	f20e578	2016-10-28 15:04:40 -0700	[diff] [blame]	283
Vikas Shivappa	a9cad3d	2017-04-07 17:33:50 -0700	[diff] [blame]	284	# echo "L3:0=3ff\nMB:0=50" > schemata
Fenghua Yu	f20e578	2016-10-28 15:04:40 -0700	[diff] [blame]	285
Vikas Shivappa	a9cad3d	2017-04-07 17:33:50 -0700	[diff] [blame]	286	Next we make a resource group for our real time cores and give it access
				287	to the "top" 50% of the cache on socket 0 and 50% of memory bandwidth on
				288	socket 0.
Fenghua Yu	f20e578	2016-10-28 15:04:40 -0700	[diff] [blame]	289
				290	# mkdir p0
Vikas Shivappa	a9cad3d	2017-04-07 17:33:50 -0700	[diff] [blame]	291	# echo "L3:0=ffc00\nMB:0=50" > p0/schemata
Fenghua Yu	f20e578	2016-10-28 15:04:40 -0700	[diff] [blame]	292
				293	Finally we move core 4-7 over to the new group and make sure that the
Vikas Shivappa	a9cad3d	2017-04-07 17:33:50 -0700	[diff] [blame]	294	kernel and the tasks running there get 50% of the cache. They should
				295	also get 50% of memory bandwidth assuming that the cores 4-7 are SMT
				296	siblings and only the real time threads are scheduled on the cores 4-7.
Fenghua Yu	f20e578	2016-10-28 15:04:40 -0700	[diff] [blame]	297
Xiaochen Shen	fb8fb46	2017-05-03 11:15:56 +0800	[diff] [blame^]	298	# echo F0 > p0/cpus
Marcelo Tosatti	3c2a769	2016-12-14 15:08:37 -0200	[diff] [blame]	299
				300	4) Locking between applications
				301
				302	Certain operations on the resctrl filesystem, composed of read/writes
				303	to/from multiple files, must be atomic.
				304
				305	As an example, the allocation of an exclusive reservation of L3 cache
				306	involves:
				307
				308	1. Read the cbmmasks from each directory
				309	2. Find a contiguous set of bits in the global CBM bitmask that is clear
				310	in any of the directory cbmmasks
				311	3. Create a new directory
				312	4. Set the bits found in step 2 to the new directory "schemata" file
				313
				314	If two applications attempt to allocate space concurrently then they can
				315	end up allocating the same bits so the reservations are shared instead of
				316	exclusive.
				317
				318	To coordinate atomic operations on the resctrlfs and to avoid the problem
				319	above, the following locking procedure is recommended:
				320
				321	Locking is based on flock, which is available in libc and also as a shell
				322	script command
				323
				324	Write lock:
				325
				326	A) Take flock(LOCK_EX) on /sys/fs/resctrl
				327	B) Read/write the directory structure.
				328	C) funlock
				329
				330	Read lock:
				331
				332	A) Take flock(LOCK_SH) on /sys/fs/resctrl
				333	B) If success read the directory structure.
				334	C) funlock
				335
				336	Example with bash:
				337
				338	# Atomically read directory structure
				339	$ flock -s /sys/fs/resctrl/ find /sys/fs/resctrl
				340
				341	# Read directory contents and create new subdirectory
				342
				343	$ cat create-dir.sh
				344	find /sys/fs/resctrl/ > output.txt
				345	mask = function-of(output.txt)
				346	mkdir /sys/fs/resctrl/newres/
				347	echo mask > /sys/fs/resctrl/newres/schemata
				348
				349	$ flock /sys/fs/resctrl/ ./create-dir.sh
				350
				351	Example with C:
				352
				353	/*
				354	* Example code do take advisory locks
				355	* before accessing resctrl filesystem
				356	*/
				357	#include <sys/file.h>
				358	#include <stdlib.h>
				359
				360	void resctrl_take_shared_lock(int fd)
				361	{
				362	int ret;
				363
				364	/* take shared lock on resctrl filesystem */
				365	ret = flock(fd, LOCK_SH);
				366	if (ret) {
				367	perror("flock");
				368	exit(-1);
				369	}
				370	}
				371
				372	void resctrl_take_exclusive_lock(int fd)
				373	{
				374	int ret;
				375
				376	/* release lock on resctrl filesystem */
				377	ret = flock(fd, LOCK_EX);
				378	if (ret) {
				379	perror("flock");
				380	exit(-1);
				381	}
				382	}
				383
				384	void resctrl_release_lock(int fd)
				385	{
				386	int ret;
				387
				388	/* take shared lock on resctrl filesystem */
				389	ret = flock(fd, LOCK_UN);
				390	if (ret) {
				391	perror("flock");
				392	exit(-1);
				393	}
				394	}
				395
				396	void main(void)
				397	{
				398	int fd, ret;
				399
				400	fd = open("/sys/fs/resctrl", O_DIRECTORY);
				401	if (fd == -1) {
				402	perror("open");
				403	exit(-1);
				404	}
				405	resctrl_take_shared_lock(fd);
				406	/* code to read directory contents */
				407	resctrl_release_lock(fd);
				408
				409	resctrl_take_exclusive_lock(fd);
				410	/* code to read and write directory contents */
				411	resctrl_release_lock(fd);
				412	}