Blame - Documentation/x86/intel_rdt_ui.txt - SHIFTPHONES/kernel/common

blob: 4d8848e4e224a8eb8207c8fb7469be57340f1671 [file] [log] [blame]

Fenghua Yu	f20e578	2016-10-28 15:04:40 -0700	[diff] [blame]	1	User Interface for Resource Allocation in Intel Resource Director Technology
				2
				3	Copyright (C) 2016 Intel Corporation
				4
				5	Fenghua Yu <fenghua.yu@intel.com>
				6	Tony Luck <tony.luck@intel.com>
Vikas Shivappa	a9cad3d	2017-04-07 17:33:50 -0700	[diff] [blame]	7	Vikas Shivappa <vikas.shivappa@intel.com>
Fenghua Yu	f20e578	2016-10-28 15:04:40 -0700	[diff] [blame]	8
Vikas Shivappa	1640ae9	2017-07-25 14:14:21 -0700	[diff] [blame]	9	This feature is enabled by the CONFIG_INTEL_RDT Kconfig and the
				10	X86 /proc/cpuinfo flag bits "rdt", "cqm", "cat_l3" and "cdp_l3".
Fenghua Yu	f20e578	2016-10-28 15:04:40 -0700	[diff] [blame]	11
				12	To use the feature mount the file system:
				13
				14	# mount -t resctrl resctrl [-o cdp] /sys/fs/resctrl
				15
				16	mount options are:
				17
				18	"cdp": Enable code/data prioritization in L3 cache allocations.
				19
Vikas Shivappa	1640ae9	2017-07-25 14:14:21 -0700	[diff] [blame]	20	RDT features are orthogonal. A particular system may support only
				21	monitoring, only control, or both monitoring and control.
				22
				23	The mount succeeds if either of allocation or monitoring is present, but
				24	only those files and directories supported by the system will be created.
				25	For more details on the behavior of the interface during monitoring
				26	and allocation, see the "Resource alloc and monitor groups" section.
Fenghua Yu	f20e578	2016-10-28 15:04:40 -0700	[diff] [blame]	27
Thomas Gleixner	458b0d6e	2016-11-07 11:58:12 +0100	[diff] [blame]	28	Info directory
				29	--------------
				30
				31	The 'info' directory contains information about the enabled
				32	resources. Each resource has its own subdirectory. The subdirectory
Vikas Shivappa	a9cad3d	2017-04-07 17:33:50 -0700	[diff] [blame]	33	names reflect the resource names.
Vikas Shivappa	1640ae9	2017-07-25 14:14:21 -0700	[diff] [blame]	34
				35	Each subdirectory contains the following files with respect to
				36	allocation:
				37
				38	Cache resource(L3/L2) subdirectory contains the following files
				39	related to allocation:
Thomas Gleixner	458b0d6e	2016-11-07 11:58:12 +0100	[diff] [blame]	40
Vikas Shivappa	a9cad3d	2017-04-07 17:33:50 -0700	[diff] [blame]	41	"num_closids": The number of CLOSIDs which are valid for this
				42	resource. The kernel uses the smallest number of
				43	CLOSIDs of all enabled resources as limit.
Thomas Gleixner	458b0d6e	2016-11-07 11:58:12 +0100	[diff] [blame]	44
Vikas Shivappa	a9cad3d	2017-04-07 17:33:50 -0700	[diff] [blame]	45	"cbm_mask": The bitmask which is valid for this resource.
				46	This mask is equivalent to 100%.
Thomas Gleixner	458b0d6e	2016-11-07 11:58:12 +0100	[diff] [blame]	47
Vikas Shivappa	a9cad3d	2017-04-07 17:33:50 -0700	[diff] [blame]	48	"min_cbm_bits": The minimum number of consecutive bits which
				49	must be set when writing a mask.
Thomas Gleixner	458b0d6e	2016-11-07 11:58:12 +0100	[diff] [blame]	50
Fenghua Yu	0dd2d74	2017-07-25 15:39:04 -0700	[diff] [blame]	51	"shareable_bits": Bitmask of shareable resource with other executing
				52	entities (e.g. I/O). User can use this when
				53	setting up exclusive cache partitions. Note that
				54	some platforms support devices that have their
				55	own settings for cache use which can over-ride
				56	these bits.
				57
Vikas Shivappa	1640ae9	2017-07-25 14:14:21 -0700	[diff] [blame]	58	Memory bandwitdh(MB) subdirectory contains the following files
				59	with respect to allocation:
Vikas Shivappa	a9cad3d	2017-04-07 17:33:50 -0700	[diff] [blame]	60
				61	"min_bandwidth": The minimum memory bandwidth percentage which
				62	user can request.
				63
				64	"bandwidth_gran": The granularity in which the memory bandwidth
				65	percentage is allocated. The allocated
				66	b/w percentage is rounded off to the next
				67	control step available on the hardware. The
				68	available bandwidth control steps are:
				69	min_bandwidth + N * bandwidth_gran.
				70
				71	"delay_linear": Indicates if the delay scale is linear or
				72	non-linear. This field is purely informational
				73	only.
Thomas Gleixner	458b0d6e	2016-11-07 11:58:12 +0100	[diff] [blame]	74
Vikas Shivappa	1640ae9	2017-07-25 14:14:21 -0700	[diff] [blame]	75	If RDT monitoring is available there will be an "L3_MON" directory
				76	with the following files:
				77
				78	"num_rmids": The number of RMIDs available. This is the
				79	upper bound for how many "CTRL_MON" + "MON"
				80	groups can be created.
				81
				82	"mon_features": Lists the monitoring events if
				83	monitoring is enabled for the resource.
				84
				85	"max_threshold_occupancy":
				86	Read/write file provides the largest value (in
				87	bytes) at which a previously used LLC_occupancy
				88	counter can be considered for re-use.
				89
				90
				91	Resource alloc and monitor groups
				92	---------------------------------
				93
Fenghua Yu	f20e578	2016-10-28 15:04:40 -0700	[diff] [blame]	94	Resource groups are represented as directories in the resctrl file
Vikas Shivappa	1640ae9	2017-07-25 14:14:21 -0700	[diff] [blame]	95	system. The default group is the root directory which, immediately
				96	after mounting, owns all the tasks and cpus in the system and can make
				97	full use of all resources.
Fenghua Yu	f20e578	2016-10-28 15:04:40 -0700	[diff] [blame]	98
Vikas Shivappa	1640ae9	2017-07-25 14:14:21 -0700	[diff] [blame]	99	On a system with RDT control features additional directories can be
				100	created in the root directory that specify different amounts of each
				101	resource (see "schemata" below). The root and these additional top level
				102	directories are referred to as "CTRL_MON" groups below.
Fenghua Yu	f20e578	2016-10-28 15:04:40 -0700	[diff] [blame]	103
Vikas Shivappa	1640ae9	2017-07-25 14:14:21 -0700	[diff] [blame]	104	On a system with RDT monitoring the root directory and other top level
				105	directories contain a directory named "mon_groups" in which additional
				106	directories can be created to monitor subsets of tasks in the CTRL_MON
				107	group that is their ancestor. These are called "MON" groups in the rest
				108	of this document.
Fenghua Yu	f20e578	2016-10-28 15:04:40 -0700	[diff] [blame]	109
Vikas Shivappa	1640ae9	2017-07-25 14:14:21 -0700	[diff] [blame]	110	Removing a directory will move all tasks and cpus owned by the group it
				111	represents to the parent. Removing one of the created CTRL_MON groups
				112	will automatically remove all MON groups below it.
Fenghua Yu	f20e578	2016-10-28 15:04:40 -0700	[diff] [blame]	113
Vikas Shivappa	1640ae9	2017-07-25 14:14:21 -0700	[diff] [blame]	114	All groups contain the following files:
Jiri Olsa	4ffa3c9	2017-04-10 16:52:32 +0200	[diff] [blame]	115
Vikas Shivappa	1640ae9	2017-07-25 14:14:21 -0700	[diff] [blame]	116	"tasks":
				117	Reading this file shows the list of all tasks that belong to
				118	this group. Writing a task id to the file will add a task to the
				119	group. If the group is a CTRL_MON group the task is removed from
				120	whichever previous CTRL_MON group owned the task and also from
				121	any MON group that owned the task. If the group is a MON group,
				122	then the task must already belong to the CTRL_MON parent of this
				123	group. The task is removed from any previous MON group.
Fenghua Yu	f20e578	2016-10-28 15:04:40 -0700	[diff] [blame]	124
Vikas Shivappa	1640ae9	2017-07-25 14:14:21 -0700	[diff] [blame]	125
				126	"cpus":
				127	Reading this file shows a bitmask of the logical CPUs owned by
				128	this group. Writing a mask to this file will add and remove
				129	CPUs to/from this group. As with the tasks file a hierarchy is
				130	maintained where MON groups may only include CPUs owned by the
				131	parent CTRL_MON group.
				132
				133
				134	"cpus_list":
				135	Just like "cpus", only using ranges of CPUs instead of bitmasks.
				136
				137
				138	When control is enabled all CTRL_MON groups will also contain:
				139
				140	"schemata":
				141	A list of all the resources available to this group.
				142	Each resource has its own line and format - see below for details.
				143
				144	When monitoring is enabled all MON groups will also contain:
				145
				146	"mon_data":
				147	This contains a set of files organized by L3 domain and by
				148	RDT event. E.g. on a system with two L3 domains there will
				149	be subdirectories "mon_L3_00" and "mon_L3_01". Each of these
				150	directories have one file per event (e.g. "llc_occupancy",
				151	"mbm_total_bytes", and "mbm_local_bytes"). In a MON group these
				152	files provide a read out of the current value of the event for
				153	all tasks in the group. In CTRL_MON groups these files provide
				154	the sum for all tasks in the CTRL_MON group and all tasks in
				155	MON groups. Please see example section for more details on usage.
				156
				157	Resource allocation rules
				158	-------------------------
				159	When a task is running the following rules define which resources are
				160	available to it:
Fenghua Yu	f20e578	2016-10-28 15:04:40 -0700	[diff] [blame]	161
				162	1) If the task is a member of a non-default group, then the schemata
Vikas Shivappa	1640ae9	2017-07-25 14:14:21 -0700	[diff] [blame]	163	for that group is used.
Fenghua Yu	f20e578	2016-10-28 15:04:40 -0700	[diff] [blame]	164
				165	2) Else if the task belongs to the default group, but is running on a
Vikas Shivappa	1640ae9	2017-07-25 14:14:21 -0700	[diff] [blame]	166	CPU that is assigned to some specific group, then the schemata for the
				167	CPU's group is used.
Fenghua Yu	f20e578	2016-10-28 15:04:40 -0700	[diff] [blame]	168
				169	3) Otherwise the schemata for the default group is used.
				170
Vikas Shivappa	1640ae9	2017-07-25 14:14:21 -0700	[diff] [blame]	171	Resource monitoring rules
				172	-------------------------
				173	1) If a task is a member of a MON group, or non-default CTRL_MON group
				174	then RDT events for the task will be reported in that group.
				175
				176	2) If a task is a member of the default CTRL_MON group, but is running
				177	on a CPU that is assigned to some specific group, then the RDT events
				178	for the task will be reported in that group.
				179
				180	3) Otherwise RDT events for the task will be reported in the root level
				181	"mon_data" group.
				182
				183
				184	Notes on cache occupancy monitoring and control
				185	-----------------------------------------------
				186	When moving a task from one group to another you should remember that
				187	this only affects new cache allocations by the task. E.g. you may have
				188	a task in a monitor group showing 3 MB of cache occupancy. If you move
				189	to a new group and immediately check the occupancy of the old and new
				190	groups you will likely see that the old group is still showing 3 MB and
				191	the new group zero. When the task accesses locations still in cache from
				192	before the move, the h/w does not update any counters. On a busy system
				193	you will likely see the occupancy in the old group go down as cache lines
				194	are evicted and re-used while the occupancy in the new group rises as
				195	the task accesses memory and loads into the cache are counted based on
				196	membership in the new group.
				197
				198	The same applies to cache allocation control. Moving a task to a group
				199	with a smaller cache partition will not evict any cache lines. The
				200	process may continue to use them from the old partition.
				201
				202	Hardware uses CLOSid(Class of service ID) and an RMID(Resource monitoring ID)
				203	to identify a control group and a monitoring group respectively. Each of
				204	the resource groups are mapped to these IDs based on the kind of group. The
				205	number of CLOSid and RMID are limited by the hardware and hence the creation of
				206	a "CTRL_MON" directory may fail if we run out of either CLOSID or RMID
				207	and creation of "MON" group may fail if we run out of RMIDs.
				208
				209	max_threshold_occupancy - generic concepts
				210	------------------------------------------
				211
				212	Note that an RMID once freed may not be immediately available for use as
				213	the RMID is still tagged the cache lines of the previous user of RMID.
				214	Hence such RMIDs are placed on limbo list and checked back if the cache
				215	occupancy has gone down. If there is a time when system has a lot of
				216	limbo RMIDs but which are not ready to be used, user may see an -EBUSY
				217	during mkdir.
				218
				219	max_threshold_occupancy is a user configurable value to determine the
				220	occupancy at which an RMID can be freed.
Fenghua Yu	f20e578	2016-10-28 15:04:40 -0700	[diff] [blame]	221
				222	Schemata files - general concepts
				223	---------------------------------
				224	Each line in the file describes one resource. The line starts with
				225	the name of the resource, followed by specific values to be applied
				226	in each of the instances of that resource on the system.
				227
				228	Cache IDs
				229	---------
				230	On current generation systems there is one L3 cache per socket and L2
				231	caches are generally just shared by the hyperthreads on a core, but this
				232	isn't an architectural requirement. We could have multiple separate L3
				233	caches on a socket, multiple cores could share an L2 cache. So instead
				234	of using "socket" or "core" to define the set of logical cpus sharing
				235	a resource we use a "Cache ID". At a given cache level this will be a
				236	unique number across the whole system (but it isn't guaranteed to be a
				237	contiguous sequence, there may be gaps). To find the ID for each logical
				238	CPU look in /sys/devices/system/cpu/cpu/cache/index/id
				239
				240	Cache Bit Masks (CBM)
				241	---------------------
				242	For cache resources we describe the portion of the cache that is available
				243	for allocation using a bitmask. The maximum value of the mask is defined
				244	by each cpu model (and may be different for different cache levels). It
				245	is found using CPUID, but is also provided in the "info" directory of
				246	the resctrl file system in "info/{resource}/cbm_mask". X86 hardware
				247	requires that these masks have all the '1' bits in a contiguous block. So
				248	0x3, 0x6 and 0xC are legal 4-bit masks with two bits set, but 0x5, 0x9
				249	and 0xA are not. On a system with a 20-bit mask each bit represents 5%
				250	of the capacity of the cache. You could partition the cache into four
				251	equal parts with masks: 0x1f, 0x3e0, 0x7c00, 0xf8000.
				252
Vikas Shivappa	a9cad3d	2017-04-07 17:33:50 -0700	[diff] [blame]	253	Memory bandwidth(b/w) percentage
				254	--------------------------------
				255	For Memory b/w resource, user controls the resource by indicating the
				256	percentage of total memory b/w.
				257
				258	The minimum bandwidth percentage value for each cpu model is predefined
				259	and can be looked up through "info/MB/min_bandwidth". The bandwidth
				260	granularity that is allocated is also dependent on the cpu model and can
				261	be looked up at "info/MB/bandwidth_gran". The available bandwidth
				262	control steps are: min_bw + N * bw_gran. Intermediate values are rounded
				263	to the next control step available on the hardware.
				264
				265	The bandwidth throttling is a core specific mechanism on some of Intel
				266	SKUs. Using a high bandwidth and a low bandwidth setting on two threads
				267	sharing a core will result in both threads being throttled to use the
				268	low bandwidth.
Fenghua Yu	f20e578	2016-10-28 15:04:40 -0700	[diff] [blame]	269
Vikas Shivappa	1640ae9	2017-07-25 14:14:21 -0700	[diff] [blame]	270	L3 schemata file details (code and data prioritization disabled)
				271	----------------------------------------------------------------
Fenghua Yu	f20e578	2016-10-28 15:04:40 -0700	[diff] [blame]	272	With CDP disabled the L3 schemata format is:
				273
				274	L3:<cache_id0>=<cbm>;<cache_id1>=<cbm>;...
				275
Vikas Shivappa	1640ae9	2017-07-25 14:14:21 -0700	[diff] [blame]	276	L3 schemata file details (CDP enabled via mount option to resctrl)
				277	------------------------------------------------------------------
Fenghua Yu	f20e578	2016-10-28 15:04:40 -0700	[diff] [blame]	278	When CDP is enabled L3 control is split into two separate resources
				279	so you can specify independent masks for code and data like this:
				280
				281	L3data:<cache_id0>=<cbm>;<cache_id1>=<cbm>;...
				282	L3code:<cache_id0>=<cbm>;<cache_id1>=<cbm>;...
				283
Vikas Shivappa	1640ae9	2017-07-25 14:14:21 -0700	[diff] [blame]	284	L2 schemata file details
				285	------------------------
Fenghua Yu	f20e578	2016-10-28 15:04:40 -0700	[diff] [blame]	286	L2 cache does not support code and data prioritization, so the
				287	schemata format is always:
				288
				289	L2:<cache_id0>=<cbm>;<cache_id1>=<cbm>;...
				290
Vikas Shivappa	a9cad3d	2017-04-07 17:33:50 -0700	[diff] [blame]	291	Memory b/w Allocation details
				292	-----------------------------
				293
				294	Memory b/w domain is L3 cache.
				295
				296	MB:<cache_id0>=bandwidth0;<cache_id1>=bandwidth1;...
				297
Tony Luck	c4026b7b	2017-04-03 14:44:16 -0700	[diff] [blame]	298	Reading/writing the schemata file
				299	---------------------------------
				300	Reading the schemata file will show the state of all resources
				301	on all domains. When writing you only need to specify those values
				302	which you wish to change. E.g.
				303
				304	# cat schemata
				305	L3DATA:0=fffff;1=fffff;2=fffff;3=fffff
				306	L3CODE:0=fffff;1=fffff;2=fffff;3=fffff
				307	# echo "L3DATA:2=3c0;" > schemata
				308	# cat schemata
				309	L3DATA:0=fffff;1=fffff;2=3c0;3=fffff
				310	L3CODE:0=fffff;1=fffff;2=fffff;3=fffff
				311
Vikas Shivappa	1640ae9	2017-07-25 14:14:21 -0700	[diff] [blame]	312	Examples for RDT allocation usage:
				313
Fenghua Yu	f20e578	2016-10-28 15:04:40 -0700	[diff] [blame]	314	Example 1
				315	---------
				316	On a two socket machine (one L3 cache per socket) with just four bits
Vikas Shivappa	a9cad3d	2017-04-07 17:33:50 -0700	[diff] [blame]	317	for cache bit masks, minimum b/w of 10% with a memory bandwidth
				318	granularity of 10%
Fenghua Yu	f20e578	2016-10-28 15:04:40 -0700	[diff] [blame]	319
				320	# mount -t resctrl resctrl /sys/fs/resctrl
				321	# cd /sys/fs/resctrl
				322	# mkdir p0 p1
Vikas Shivappa	a9cad3d	2017-04-07 17:33:50 -0700	[diff] [blame]	323	# echo "L3:0=3;1=c\nMB:0=50;1=50" > /sys/fs/resctrl/p0/schemata
				324	# echo "L3:0=3;1=3\nMB:0=50;1=50" > /sys/fs/resctrl/p1/schemata
Fenghua Yu	f20e578	2016-10-28 15:04:40 -0700	[diff] [blame]	325
				326	The default resource group is unmodified, so we have access to all parts
				327	of all caches (its schemata file reads "L3:0=f;1=f").
				328
				329	Tasks that are under the control of group "p0" may only allocate from the
				330	"lower" 50% on cache ID 0, and the "upper" 50% of cache ID 1.
				331	Tasks in group "p1" use the "lower" 50% of cache on both sockets.
				332
Vikas Shivappa	a9cad3d	2017-04-07 17:33:50 -0700	[diff] [blame]	333	Similarly, tasks that are under the control of group "p0" may use a
				334	maximum memory b/w of 50% on socket0 and 50% on socket 1.
				335	Tasks in group "p1" may also use 50% memory b/w on both sockets.
				336	Note that unlike cache masks, memory b/w cannot specify whether these
				337	allocations can overlap or not. The allocations specifies the maximum
				338	b/w that the group may be able to use and the system admin can configure
				339	the b/w accordingly.
				340
Fenghua Yu	f20e578	2016-10-28 15:04:40 -0700	[diff] [blame]	341	Example 2
				342	---------
				343	Again two sockets, but this time with a more realistic 20-bit mask.
				344
				345	Two real time tasks pid=1234 running on processor 0 and pid=5678 running on
				346	processor 1 on socket 0 on a 2-socket and dual core machine. To avoid noisy
				347	neighbors, each of the two real-time tasks exclusively occupies one quarter
				348	of L3 cache on socket 0.
				349
				350	# mount -t resctrl resctrl /sys/fs/resctrl
				351	# cd /sys/fs/resctrl
				352
				353	First we reset the schemata for the default group so that the "upper"
Vikas Shivappa	a9cad3d	2017-04-07 17:33:50 -0700	[diff] [blame]	354	50% of the L3 cache on socket 0 and 50% of memory b/w cannot be used by
				355	ordinary tasks:
Fenghua Yu	f20e578	2016-10-28 15:04:40 -0700	[diff] [blame]	356
Vikas Shivappa	a9cad3d	2017-04-07 17:33:50 -0700	[diff] [blame]	357	# echo "L3:0=3ff;1=fffff\nMB:0=50;1=100" > schemata
Fenghua Yu	f20e578	2016-10-28 15:04:40 -0700	[diff] [blame]	358
				359	Next we make a resource group for our first real time task and give
				360	it access to the "top" 25% of the cache on socket 0.
				361
				362	# mkdir p0
				363	# echo "L3:0=f8000;1=fffff" > p0/schemata
				364
				365	Finally we move our first real time task into this resource group. We
				366	also use taskset(1) to ensure the task always runs on a dedicated CPU
				367	on socket 0. Most uses of resource groups will also constrain which
				368	processors tasks run on.
				369
				370	# echo 1234 > p0/tasks
				371	# taskset -cp 1 1234
				372
				373	Ditto for the second real time task (with the remaining 25% of cache):
				374
				375	# mkdir p1
				376	# echo "L3:0=7c00;1=fffff" > p1/schemata
				377	# echo 5678 > p1/tasks
				378	# taskset -cp 2 5678
				379
Vikas Shivappa	a9cad3d	2017-04-07 17:33:50 -0700	[diff] [blame]	380	For the same 2 socket system with memory b/w resource and CAT L3 the
				381	schemata would look like(Assume min_bandwidth 10 and bandwidth_gran is
				382	10):
				383
				384	For our first real time task this would request 20% memory b/w on socket
				385	0.
				386
				387	# echo -e "L3:0=f8000;1=fffff\nMB:0=20;1=100" > p0/schemata
				388
				389	For our second real time task this would request an other 20% memory b/w
				390	on socket 0.
				391
				392	# echo -e "L3:0=f8000;1=fffff\nMB:0=20;1=100" > p0/schemata
				393
Fenghua Yu	f20e578	2016-10-28 15:04:40 -0700	[diff] [blame]	394	Example 3
				395	---------
				396
				397	A single socket system which has real-time tasks running on core 4-7 and
				398	non real-time workload assigned to core 0-3. The real-time tasks share text
				399	and data, so a per task association is not required and due to interaction
				400	with the kernel it's desired that the kernel on these cores shares L3 with
				401	the tasks.
				402
				403	# mount -t resctrl resctrl /sys/fs/resctrl
				404	# cd /sys/fs/resctrl
				405
				406	First we reset the schemata for the default group so that the "upper"
Vikas Shivappa	a9cad3d	2017-04-07 17:33:50 -0700	[diff] [blame]	407	50% of the L3 cache on socket 0, and 50% of memory bandwidth on socket 0
				408	cannot be used by ordinary tasks:
Fenghua Yu	f20e578	2016-10-28 15:04:40 -0700	[diff] [blame]	409
Vikas Shivappa	a9cad3d	2017-04-07 17:33:50 -0700	[diff] [blame]	410	# echo "L3:0=3ff\nMB:0=50" > schemata
Fenghua Yu	f20e578	2016-10-28 15:04:40 -0700	[diff] [blame]	411
Vikas Shivappa	a9cad3d	2017-04-07 17:33:50 -0700	[diff] [blame]	412	Next we make a resource group for our real time cores and give it access
				413	to the "top" 50% of the cache on socket 0 and 50% of memory bandwidth on
				414	socket 0.
Fenghua Yu	f20e578	2016-10-28 15:04:40 -0700	[diff] [blame]	415
				416	# mkdir p0
Vikas Shivappa	a9cad3d	2017-04-07 17:33:50 -0700	[diff] [blame]	417	# echo "L3:0=ffc00\nMB:0=50" > p0/schemata
Fenghua Yu	f20e578	2016-10-28 15:04:40 -0700	[diff] [blame]	418
				419	Finally we move core 4-7 over to the new group and make sure that the
Vikas Shivappa	a9cad3d	2017-04-07 17:33:50 -0700	[diff] [blame]	420	kernel and the tasks running there get 50% of the cache. They should
				421	also get 50% of memory bandwidth assuming that the cores 4-7 are SMT
				422	siblings and only the real time threads are scheduled on the cores 4-7.
Fenghua Yu	f20e578	2016-10-28 15:04:40 -0700	[diff] [blame]	423
Xiaochen Shen	fb8fb46	2017-05-03 11:15:56 +0800	[diff] [blame]	424	# echo F0 > p0/cpus
Marcelo Tosatti	3c2a769	2016-12-14 15:08:37 -0200	[diff] [blame]	425
				426	4) Locking between applications
				427
				428	Certain operations on the resctrl filesystem, composed of read/writes
				429	to/from multiple files, must be atomic.
				430
				431	As an example, the allocation of an exclusive reservation of L3 cache
				432	involves:
				433
				434	1. Read the cbmmasks from each directory
				435	2. Find a contiguous set of bits in the global CBM bitmask that is clear
				436	in any of the directory cbmmasks
				437	3. Create a new directory
				438	4. Set the bits found in step 2 to the new directory "schemata" file
				439
				440	If two applications attempt to allocate space concurrently then they can
				441	end up allocating the same bits so the reservations are shared instead of
				442	exclusive.
				443
				444	To coordinate atomic operations on the resctrlfs and to avoid the problem
				445	above, the following locking procedure is recommended:
				446
				447	Locking is based on flock, which is available in libc and also as a shell
				448	script command
				449
				450	Write lock:
				451
				452	A) Take flock(LOCK_EX) on /sys/fs/resctrl
				453	B) Read/write the directory structure.
				454	C) funlock
				455
				456	Read lock:
				457
				458	A) Take flock(LOCK_SH) on /sys/fs/resctrl
				459	B) If success read the directory structure.
				460	C) funlock
				461
				462	Example with bash:
				463
				464	# Atomically read directory structure
				465	$ flock -s /sys/fs/resctrl/ find /sys/fs/resctrl
				466
				467	# Read directory contents and create new subdirectory
				468
				469	$ cat create-dir.sh
				470	find /sys/fs/resctrl/ > output.txt
				471	mask = function-of(output.txt)
				472	mkdir /sys/fs/resctrl/newres/
				473	echo mask > /sys/fs/resctrl/newres/schemata
				474
				475	$ flock /sys/fs/resctrl/ ./create-dir.sh
				476
				477	Example with C:
				478
				479	/*
				480	* Example code do take advisory locks
				481	* before accessing resctrl filesystem
				482	*/
				483	#include <sys/file.h>
				484	#include <stdlib.h>
				485
				486	void resctrl_take_shared_lock(int fd)
				487	{
				488	int ret;
				489
				490	/* take shared lock on resctrl filesystem */
				491	ret = flock(fd, LOCK_SH);
				492	if (ret) {
				493	perror("flock");
				494	exit(-1);
				495	}
				496	}
				497
				498	void resctrl_take_exclusive_lock(int fd)
				499	{
				500	int ret;
				501
				502	/* release lock on resctrl filesystem */
				503	ret = flock(fd, LOCK_EX);
				504	if (ret) {
				505	perror("flock");
				506	exit(-1);
				507	}
				508	}
				509
				510	void resctrl_release_lock(int fd)
				511	{
				512	int ret;
				513
				514	/* take shared lock on resctrl filesystem */
				515	ret = flock(fd, LOCK_UN);
				516	if (ret) {
				517	perror("flock");
				518	exit(-1);
				519	}
				520	}
				521
				522	void main(void)
				523	{
				524	int fd, ret;
				525
				526	fd = open("/sys/fs/resctrl", O_DIRECTORY);
				527	if (fd == -1) {
				528	perror("open");
				529	exit(-1);
				530	}
				531	resctrl_take_shared_lock(fd);
				532	/* code to read directory contents */
				533	resctrl_release_lock(fd);
				534
				535	resctrl_take_exclusive_lock(fd);
				536	/* code to read and write directory contents */
				537	resctrl_release_lock(fd);
				538	}
Vikas Shivappa	1640ae9	2017-07-25 14:14:21 -0700	[diff] [blame]	539
				540	Examples for RDT Monitoring along with allocation usage:
				541
				542	Reading monitored data
				543	----------------------
				544	Reading an event file (for ex: mon_data/mon_L3_00/llc_occupancy) would
				545	show the current snapshot of LLC occupancy of the corresponding MON
				546	group or CTRL_MON group.
				547
				548
				549	Example 1 (Monitor CTRL_MON group and subset of tasks in CTRL_MON group)
				550	---------
				551	On a two socket machine (one L3 cache per socket) with just four bits
				552	for cache bit masks
				553
				554	# mount -t resctrl resctrl /sys/fs/resctrl
				555	# cd /sys/fs/resctrl
				556	# mkdir p0 p1
				557	# echo "L3:0=3;1=c" > /sys/fs/resctrl/p0/schemata
				558	# echo "L3:0=3;1=3" > /sys/fs/resctrl/p1/schemata
				559	# echo 5678 > p1/tasks
				560	# echo 5679 > p1/tasks
				561
				562	The default resource group is unmodified, so we have access to all parts
				563	of all caches (its schemata file reads "L3:0=f;1=f").
				564
				565	Tasks that are under the control of group "p0" may only allocate from the
				566	"lower" 50% on cache ID 0, and the "upper" 50% of cache ID 1.
				567	Tasks in group "p1" use the "lower" 50% of cache on both sockets.
				568
				569	Create monitor groups and assign a subset of tasks to each monitor group.
				570
				571	# cd /sys/fs/resctrl/p1/mon_groups
				572	# mkdir m11 m12
				573	# echo 5678 > m11/tasks
				574	# echo 5679 > m12/tasks
				575
				576	fetch data (data shown in bytes)
				577
				578	# cat m11/mon_data/mon_L3_00/llc_occupancy
				579	16234000
				580	# cat m11/mon_data/mon_L3_01/llc_occupancy
				581	14789000
				582	# cat m12/mon_data/mon_L3_00/llc_occupancy
				583	16789000
				584
				585	The parent ctrl_mon group shows the aggregated data.
				586
				587	# cat /sys/fs/resctrl/p1/mon_data/mon_l3_00/llc_occupancy
				588	31234000
				589
				590	Example 2 (Monitor a task from its creation)
				591	---------
				592	On a two socket machine (one L3 cache per socket)
				593
				594	# mount -t resctrl resctrl /sys/fs/resctrl
				595	# cd /sys/fs/resctrl
				596	# mkdir p0 p1
				597
				598	An RMID is allocated to the group once its created and hence the <cmd>
				599	below is monitored from its creation.
				600
				601	# echo $$ > /sys/fs/resctrl/p1/tasks
				602	# <cmd>
				603
				604	Fetch the data
				605
				606	# cat /sys/fs/resctrl/p1/mon_data/mon_l3_00/llc_occupancy
				607	31789000
				608
				609	Example 3 (Monitor without CAT support or before creating CAT groups)
				610	---------
				611
				612	Assume a system like HSW has only CQM and no CAT support. In this case
				613	the resctrl will still mount but cannot create CTRL_MON directories.
				614	But user can create different MON groups within the root group thereby
				615	able to monitor all tasks including kernel threads.
				616
				617	This can also be used to profile jobs cache size footprint before being
				618	able to allocate them to different allocation groups.
				619
				620	# mount -t resctrl resctrl /sys/fs/resctrl
				621	# cd /sys/fs/resctrl
				622	# mkdir mon_groups/m01
				623	# mkdir mon_groups/m02
				624
				625	# echo 3478 > /sys/fs/resctrl/mon_groups/m01/tasks
				626	# echo 2467 > /sys/fs/resctrl/mon_groups/m02/tasks
				627
				628	Monitor the groups separately and also get per domain data. From the
				629	below its apparent that the tasks are mostly doing work on
				630	domain(socket) 0.
				631
				632	# cat /sys/fs/resctrl/mon_groups/m01/mon_L3_00/llc_occupancy
				633	31234000
				634	# cat /sys/fs/resctrl/mon_groups/m01/mon_L3_01/llc_occupancy
				635	34555
				636	# cat /sys/fs/resctrl/mon_groups/m02/mon_L3_00/llc_occupancy
				637	31234000
				638	# cat /sys/fs/resctrl/mon_groups/m02/mon_L3_01/llc_occupancy
				639	32789
				640
				641
				642	Example 4 (Monitor real time tasks)
				643	-----------------------------------
				644
				645	A single socket system which has real time tasks running on cores 4-7
				646	and non real time tasks on other cpus. We want to monitor the cache
				647	occupancy of the real time threads on these cores.
				648
				649	# mount -t resctrl resctrl /sys/fs/resctrl
				650	# cd /sys/fs/resctrl
				651	# mkdir p1
				652
				653	Move the cpus 4-7 over to p1
				654	# echo f0 > p0/cpus
				655
				656	View the llc occupancy snapshot
				657
				658	# cat /sys/fs/resctrl/p1/mon_data/mon_L3_00/llc_occupancy
				659	11234000