blob: c491a1b82de22a74d9559a418023fb8c312ffed4 [file] [log] [blame]
Fenghua Yuf20e5782016-10-28 15:04:40 -07001User Interface for Resource Allocation in Intel Resource Director Technology
2
3Copyright (C) 2016 Intel Corporation
4
5Fenghua Yu <fenghua.yu@intel.com>
6Tony Luck <tony.luck@intel.com>
Vikas Shivappaa9cad3d2017-04-07 17:33:50 -07007Vikas Shivappa <vikas.shivappa@intel.com>
Fenghua Yuf20e5782016-10-28 15:04:40 -07008
9This feature is enabled by the CONFIG_INTEL_RDT_A Kconfig and the
10X86 /proc/cpuinfo flag bits "rdt", "cat_l3" and "cdp_l3".
11
12To use the feature mount the file system:
13
14 # mount -t resctrl resctrl [-o cdp] /sys/fs/resctrl
15
16mount options are:
17
18"cdp": Enable code/data prioritization in L3 cache allocations.
19
20
Thomas Gleixner458b0d6e2016-11-07 11:58:12 +010021Info directory
22--------------
23
24The 'info' directory contains information about the enabled
25resources. Each resource has its own subdirectory. The subdirectory
Vikas Shivappaa9cad3d2017-04-07 17:33:50 -070026names reflect the resource names.
27Cache resource(L3/L2) subdirectory contains the following files:
Thomas Gleixner458b0d6e2016-11-07 11:58:12 +010028
Vikas Shivappaa9cad3d2017-04-07 17:33:50 -070029"num_closids": The number of CLOSIDs which are valid for this
30 resource. The kernel uses the smallest number of
31 CLOSIDs of all enabled resources as limit.
Thomas Gleixner458b0d6e2016-11-07 11:58:12 +010032
Vikas Shivappaa9cad3d2017-04-07 17:33:50 -070033"cbm_mask": The bitmask which is valid for this resource.
34 This mask is equivalent to 100%.
Thomas Gleixner458b0d6e2016-11-07 11:58:12 +010035
Vikas Shivappaa9cad3d2017-04-07 17:33:50 -070036"min_cbm_bits": The minimum number of consecutive bits which
37 must be set when writing a mask.
Thomas Gleixner458b0d6e2016-11-07 11:58:12 +010038
Vikas Shivappaa9cad3d2017-04-07 17:33:50 -070039Memory bandwitdh(MB) subdirectory contains the following files:
40
41"min_bandwidth": The minimum memory bandwidth percentage which
42 user can request.
43
44"bandwidth_gran": The granularity in which the memory bandwidth
45 percentage is allocated. The allocated
46 b/w percentage is rounded off to the next
47 control step available on the hardware. The
48 available bandwidth control steps are:
49 min_bandwidth + N * bandwidth_gran.
50
51"delay_linear": Indicates if the delay scale is linear or
52 non-linear. This field is purely informational
53 only.
Thomas Gleixner458b0d6e2016-11-07 11:58:12 +010054
Fenghua Yuf20e5782016-10-28 15:04:40 -070055Resource groups
56---------------
57Resource groups are represented as directories in the resctrl file
58system. The default group is the root directory. Other groups may be
59created as desired by the system administrator using the "mkdir(1)"
60command, and removed using "rmdir(1)".
61
62There are three files associated with each group:
63
64"tasks": A list of tasks that belongs to this group. Tasks can be
65 added to a group by writing the task ID to the "tasks" file
66 (which will automatically remove them from the previous
67 group to which they belonged). New tasks created by fork(2)
68 and clone(2) are added to the same group as their parent.
69 If a pid is not in any sub partition, it is in root partition
70 (i.e. default partition).
71
72"cpus": A bitmask of logical CPUs assigned to this group. Writing
73 a new mask can add/remove CPUs from this group. Added CPUs
74 are removed from their previous group. Removed ones are
75 given to the default (root) group. You cannot remove CPUs
76 from the default group.
77
Jiri Olsa4ffa3c92017-04-10 16:52:32 +020078"cpus_list": One or more CPU ranges of logical CPUs assigned to this
79 group. Same rules apply like for the "cpus" file.
80
Fenghua Yuf20e5782016-10-28 15:04:40 -070081"schemata": A list of all the resources available to this group.
82 Each resource has its own line and format - see below for
83 details.
84
85When a task is running the following rules define which resources
86are available to it:
87
881) If the task is a member of a non-default group, then the schemata
89for that group is used.
90
912) Else if the task belongs to the default group, but is running on a
92CPU that is assigned to some specific group, then the schemata for
93the CPU's group is used.
94
953) Otherwise the schemata for the default group is used.
96
97
98Schemata files - general concepts
99---------------------------------
100Each line in the file describes one resource. The line starts with
101the name of the resource, followed by specific values to be applied
102in each of the instances of that resource on the system.
103
104Cache IDs
105---------
106On current generation systems there is one L3 cache per socket and L2
107caches are generally just shared by the hyperthreads on a core, but this
108isn't an architectural requirement. We could have multiple separate L3
109caches on a socket, multiple cores could share an L2 cache. So instead
110of using "socket" or "core" to define the set of logical cpus sharing
111a resource we use a "Cache ID". At a given cache level this will be a
112unique number across the whole system (but it isn't guaranteed to be a
113contiguous sequence, there may be gaps). To find the ID for each logical
114CPU look in /sys/devices/system/cpu/cpu*/cache/index*/id
115
116Cache Bit Masks (CBM)
117---------------------
118For cache resources we describe the portion of the cache that is available
119for allocation using a bitmask. The maximum value of the mask is defined
120by each cpu model (and may be different for different cache levels). It
121is found using CPUID, but is also provided in the "info" directory of
122the resctrl file system in "info/{resource}/cbm_mask". X86 hardware
123requires that these masks have all the '1' bits in a contiguous block. So
1240x3, 0x6 and 0xC are legal 4-bit masks with two bits set, but 0x5, 0x9
125and 0xA are not. On a system with a 20-bit mask each bit represents 5%
126of the capacity of the cache. You could partition the cache into four
127equal parts with masks: 0x1f, 0x3e0, 0x7c00, 0xf8000.
128
Vikas Shivappaa9cad3d2017-04-07 17:33:50 -0700129Memory bandwidth(b/w) percentage
130--------------------------------
131For Memory b/w resource, user controls the resource by indicating the
132percentage of total memory b/w.
133
134The minimum bandwidth percentage value for each cpu model is predefined
135and can be looked up through "info/MB/min_bandwidth". The bandwidth
136granularity that is allocated is also dependent on the cpu model and can
137be looked up at "info/MB/bandwidth_gran". The available bandwidth
138control steps are: min_bw + N * bw_gran. Intermediate values are rounded
139to the next control step available on the hardware.
140
141The bandwidth throttling is a core specific mechanism on some of Intel
142SKUs. Using a high bandwidth and a low bandwidth setting on two threads
143sharing a core will result in both threads being throttled to use the
144low bandwidth.
Fenghua Yuf20e5782016-10-28 15:04:40 -0700145
146L3 details (code and data prioritization disabled)
147--------------------------------------------------
148With CDP disabled the L3 schemata format is:
149
150 L3:<cache_id0>=<cbm>;<cache_id1>=<cbm>;...
151
152L3 details (CDP enabled via mount option to resctrl)
153----------------------------------------------------
154When CDP is enabled L3 control is split into two separate resources
155so you can specify independent masks for code and data like this:
156
157 L3data:<cache_id0>=<cbm>;<cache_id1>=<cbm>;...
158 L3code:<cache_id0>=<cbm>;<cache_id1>=<cbm>;...
159
160L2 details
161----------
162L2 cache does not support code and data prioritization, so the
163schemata format is always:
164
165 L2:<cache_id0>=<cbm>;<cache_id1>=<cbm>;...
166
Vikas Shivappaa9cad3d2017-04-07 17:33:50 -0700167Memory b/w Allocation details
168-----------------------------
169
170Memory b/w domain is L3 cache.
171
172 MB:<cache_id0>=bandwidth0;<cache_id1>=bandwidth1;...
173
Tony Luckc4026b7b2017-04-03 14:44:16 -0700174Reading/writing the schemata file
175---------------------------------
176Reading the schemata file will show the state of all resources
177on all domains. When writing you only need to specify those values
178which you wish to change. E.g.
179
180# cat schemata
181L3DATA:0=fffff;1=fffff;2=fffff;3=fffff
182L3CODE:0=fffff;1=fffff;2=fffff;3=fffff
183# echo "L3DATA:2=3c0;" > schemata
184# cat schemata
185L3DATA:0=fffff;1=fffff;2=3c0;3=fffff
186L3CODE:0=fffff;1=fffff;2=fffff;3=fffff
187
Fenghua Yuf20e5782016-10-28 15:04:40 -0700188Example 1
189---------
190On a two socket machine (one L3 cache per socket) with just four bits
Vikas Shivappaa9cad3d2017-04-07 17:33:50 -0700191for cache bit masks, minimum b/w of 10% with a memory bandwidth
192granularity of 10%
Fenghua Yuf20e5782016-10-28 15:04:40 -0700193
194# mount -t resctrl resctrl /sys/fs/resctrl
195# cd /sys/fs/resctrl
196# mkdir p0 p1
Vikas Shivappaa9cad3d2017-04-07 17:33:50 -0700197# echo "L3:0=3;1=c\nMB:0=50;1=50" > /sys/fs/resctrl/p0/schemata
198# echo "L3:0=3;1=3\nMB:0=50;1=50" > /sys/fs/resctrl/p1/schemata
Fenghua Yuf20e5782016-10-28 15:04:40 -0700199
200The default resource group is unmodified, so we have access to all parts
201of all caches (its schemata file reads "L3:0=f;1=f").
202
203Tasks that are under the control of group "p0" may only allocate from the
204"lower" 50% on cache ID 0, and the "upper" 50% of cache ID 1.
205Tasks in group "p1" use the "lower" 50% of cache on both sockets.
206
Vikas Shivappaa9cad3d2017-04-07 17:33:50 -0700207Similarly, tasks that are under the control of group "p0" may use a
208maximum memory b/w of 50% on socket0 and 50% on socket 1.
209Tasks in group "p1" may also use 50% memory b/w on both sockets.
210Note that unlike cache masks, memory b/w cannot specify whether these
211allocations can overlap or not. The allocations specifies the maximum
212b/w that the group may be able to use and the system admin can configure
213the b/w accordingly.
214
Fenghua Yuf20e5782016-10-28 15:04:40 -0700215Example 2
216---------
217Again two sockets, but this time with a more realistic 20-bit mask.
218
219Two real time tasks pid=1234 running on processor 0 and pid=5678 running on
220processor 1 on socket 0 on a 2-socket and dual core machine. To avoid noisy
221neighbors, each of the two real-time tasks exclusively occupies one quarter
222of L3 cache on socket 0.
223
224# mount -t resctrl resctrl /sys/fs/resctrl
225# cd /sys/fs/resctrl
226
227First we reset the schemata for the default group so that the "upper"
Vikas Shivappaa9cad3d2017-04-07 17:33:50 -070022850% of the L3 cache on socket 0 and 50% of memory b/w cannot be used by
229ordinary tasks:
Fenghua Yuf20e5782016-10-28 15:04:40 -0700230
Vikas Shivappaa9cad3d2017-04-07 17:33:50 -0700231# echo "L3:0=3ff;1=fffff\nMB:0=50;1=100" > schemata
Fenghua Yuf20e5782016-10-28 15:04:40 -0700232
233Next we make a resource group for our first real time task and give
234it access to the "top" 25% of the cache on socket 0.
235
236# mkdir p0
237# echo "L3:0=f8000;1=fffff" > p0/schemata
238
239Finally we move our first real time task into this resource group. We
240also use taskset(1) to ensure the task always runs on a dedicated CPU
241on socket 0. Most uses of resource groups will also constrain which
242processors tasks run on.
243
244# echo 1234 > p0/tasks
245# taskset -cp 1 1234
246
247Ditto for the second real time task (with the remaining 25% of cache):
248
249# mkdir p1
250# echo "L3:0=7c00;1=fffff" > p1/schemata
251# echo 5678 > p1/tasks
252# taskset -cp 2 5678
253
Vikas Shivappaa9cad3d2017-04-07 17:33:50 -0700254For the same 2 socket system with memory b/w resource and CAT L3 the
255schemata would look like(Assume min_bandwidth 10 and bandwidth_gran is
25610):
257
258For our first real time task this would request 20% memory b/w on socket
2590.
260
261# echo -e "L3:0=f8000;1=fffff\nMB:0=20;1=100" > p0/schemata
262
263For our second real time task this would request an other 20% memory b/w
264on socket 0.
265
266# echo -e "L3:0=f8000;1=fffff\nMB:0=20;1=100" > p0/schemata
267
Fenghua Yuf20e5782016-10-28 15:04:40 -0700268Example 3
269---------
270
271A single socket system which has real-time tasks running on core 4-7 and
272non real-time workload assigned to core 0-3. The real-time tasks share text
273and data, so a per task association is not required and due to interaction
274with the kernel it's desired that the kernel on these cores shares L3 with
275the tasks.
276
277# mount -t resctrl resctrl /sys/fs/resctrl
278# cd /sys/fs/resctrl
279
280First we reset the schemata for the default group so that the "upper"
Vikas Shivappaa9cad3d2017-04-07 17:33:50 -070028150% of the L3 cache on socket 0, and 50% of memory bandwidth on socket 0
282cannot be used by ordinary tasks:
Fenghua Yuf20e5782016-10-28 15:04:40 -0700283
Vikas Shivappaa9cad3d2017-04-07 17:33:50 -0700284# echo "L3:0=3ff\nMB:0=50" > schemata
Fenghua Yuf20e5782016-10-28 15:04:40 -0700285
Vikas Shivappaa9cad3d2017-04-07 17:33:50 -0700286Next we make a resource group for our real time cores and give it access
287to the "top" 50% of the cache on socket 0 and 50% of memory bandwidth on
288socket 0.
Fenghua Yuf20e5782016-10-28 15:04:40 -0700289
290# mkdir p0
Vikas Shivappaa9cad3d2017-04-07 17:33:50 -0700291# echo "L3:0=ffc00\nMB:0=50" > p0/schemata
Fenghua Yuf20e5782016-10-28 15:04:40 -0700292
293Finally we move core 4-7 over to the new group and make sure that the
Vikas Shivappaa9cad3d2017-04-07 17:33:50 -0700294kernel and the tasks running there get 50% of the cache. They should
295also get 50% of memory bandwidth assuming that the cores 4-7 are SMT
296siblings and only the real time threads are scheduled on the cores 4-7.
Fenghua Yuf20e5782016-10-28 15:04:40 -0700297
Xiaochen Shenfb8fb462017-05-03 11:15:56 +0800298# echo F0 > p0/cpus
Marcelo Tosatti3c2a7692016-12-14 15:08:37 -0200299
3004) Locking between applications
301
302Certain operations on the resctrl filesystem, composed of read/writes
303to/from multiple files, must be atomic.
304
305As an example, the allocation of an exclusive reservation of L3 cache
306involves:
307
308 1. Read the cbmmasks from each directory
309 2. Find a contiguous set of bits in the global CBM bitmask that is clear
310 in any of the directory cbmmasks
311 3. Create a new directory
312 4. Set the bits found in step 2 to the new directory "schemata" file
313
314If two applications attempt to allocate space concurrently then they can
315end up allocating the same bits so the reservations are shared instead of
316exclusive.
317
318To coordinate atomic operations on the resctrlfs and to avoid the problem
319above, the following locking procedure is recommended:
320
321Locking is based on flock, which is available in libc and also as a shell
322script command
323
324Write lock:
325
326 A) Take flock(LOCK_EX) on /sys/fs/resctrl
327 B) Read/write the directory structure.
328 C) funlock
329
330Read lock:
331
332 A) Take flock(LOCK_SH) on /sys/fs/resctrl
333 B) If success read the directory structure.
334 C) funlock
335
336Example with bash:
337
338# Atomically read directory structure
339$ flock -s /sys/fs/resctrl/ find /sys/fs/resctrl
340
341# Read directory contents and create new subdirectory
342
343$ cat create-dir.sh
344find /sys/fs/resctrl/ > output.txt
345mask = function-of(output.txt)
346mkdir /sys/fs/resctrl/newres/
347echo mask > /sys/fs/resctrl/newres/schemata
348
349$ flock /sys/fs/resctrl/ ./create-dir.sh
350
351Example with C:
352
353/*
354 * Example code do take advisory locks
355 * before accessing resctrl filesystem
356 */
357#include <sys/file.h>
358#include <stdlib.h>
359
360void resctrl_take_shared_lock(int fd)
361{
362 int ret;
363
364 /* take shared lock on resctrl filesystem */
365 ret = flock(fd, LOCK_SH);
366 if (ret) {
367 perror("flock");
368 exit(-1);
369 }
370}
371
372void resctrl_take_exclusive_lock(int fd)
373{
374 int ret;
375
376 /* release lock on resctrl filesystem */
377 ret = flock(fd, LOCK_EX);
378 if (ret) {
379 perror("flock");
380 exit(-1);
381 }
382}
383
384void resctrl_release_lock(int fd)
385{
386 int ret;
387
388 /* take shared lock on resctrl filesystem */
389 ret = flock(fd, LOCK_UN);
390 if (ret) {
391 perror("flock");
392 exit(-1);
393 }
394}
395
396void main(void)
397{
398 int fd, ret;
399
400 fd = open("/sys/fs/resctrl", O_DIRECTORY);
401 if (fd == -1) {
402 perror("open");
403 exit(-1);
404 }
405 resctrl_take_shared_lock(fd);
406 /* code to read directory contents */
407 resctrl_release_lock(fd);
408
409 resctrl_take_exclusive_lock(fd);
410 /* code to read and write directory contents */
411 resctrl_release_lock(fd);
412}