blob: aaf105c02fba77f72eb44e00c873aca6e189117b [file] [log] [blame]
Ingo Molnare7bc62b2008-12-04 20:13:45 +01001
2Performance Counters for Linux
3------------------------------
4
5Performance counters are special hardware registers available on most modern
6CPUs. These registers count the number of certain types of hw events: such
7as instructions executed, cachemisses suffered, or branches mis-predicted -
8without slowing down the kernel or applications. These registers can also
9trigger interrupts when a threshold number of events have passed - and can
10thus be used to profile the code that runs on that CPU.
11
12The Linux Performance Counter subsystem provides an abstraction of these
Ingo Molnar447557a2008-12-11 20:40:18 +010013hardware capabilities. It provides per task and per CPU counters, counter
Paul Mackerrasf66c6b22009-03-23 10:29:36 +110014groups, and it provides event capabilities on top of those. It
15provides "virtual" 64-bit counters, regardless of the width of the
16underlying hardware counters.
Ingo Molnare7bc62b2008-12-04 20:13:45 +010017
18Performance counters are accessed via special file descriptors.
19There's one file descriptor per virtual counter used.
20
21The special file descriptor is opened via the perf_counter_open()
22system call:
23
Ingo Molnar447557a2008-12-11 20:40:18 +010024 int sys_perf_counter_open(struct perf_counter_hw_event *hw_event_uptr,
Paul Mackerrasf66c6b22009-03-23 10:29:36 +110025 pid_t pid, int cpu, int group_fd,
26 unsigned long flags);
Ingo Molnare7bc62b2008-12-04 20:13:45 +010027
28The syscall returns the new fd. The fd can be used via the normal
29VFS system calls: read() can be used to read the counter, fcntl()
30can be used to set the blocking mode, etc.
31
32Multiple counters can be kept open at a time, and the counters
33can be poll()ed.
34
Ingo Molnar447557a2008-12-11 20:40:18 +010035When creating a new counter fd, 'perf_counter_hw_event' is:
Ingo Molnare7bc62b2008-12-04 20:13:45 +010036
Ingo Molnar447557a2008-12-11 20:40:18 +010037/*
Paul Mackerrasf66c6b22009-03-23 10:29:36 +110038 * Event to monitor via a performance monitoring counter:
Ingo Molnar447557a2008-12-11 20:40:18 +010039 */
40struct perf_counter_hw_event {
Paul Mackerrasf66c6b22009-03-23 10:29:36 +110041 __u64 event_config;
Ingo Molnar447557a2008-12-11 20:40:18 +010042
Paul Mackerrasf66c6b22009-03-23 10:29:36 +110043 __u64 irq_period;
44 __u64 record_type;
45 __u64 read_format;
Ingo Molnar447557a2008-12-11 20:40:18 +010046
Paul Mackerrasf66c6b22009-03-23 10:29:36 +110047 __u64 disabled : 1, /* off by default */
48 nmi : 1, /* NMI sampling */
49 inherit : 1, /* children inherit it */
50 pinned : 1, /* must always be on PMU */
51 exclusive : 1, /* only group on PMU */
52 exclude_user : 1, /* don't count user */
53 exclude_kernel : 1, /* ditto kernel */
54 exclude_hv : 1, /* ditto hypervisor */
55 exclude_idle : 1, /* don't count when idle */
Ingo Molnar447557a2008-12-11 20:40:18 +010056
Paul Mackerrasf66c6b22009-03-23 10:29:36 +110057 __reserved_1 : 55;
58
59 __u32 extra_config_len;
60
61 __u32 __reserved_4;
62 __u64 __reserved_2;
63 __u64 __reserved_3;
Ingo Molnar447557a2008-12-11 20:40:18 +010064};
65
Paul Mackerrasf66c6b22009-03-23 10:29:36 +110066The 'event_config' field specifies what the counter should count. It
67is divided into 3 bit-fields:
68
69raw_type: 1 bit (most significant bit) 0x8000_0000_0000_0000
70type: 7 bits (next most significant) 0x7f00_0000_0000_0000
71event_id: 56 bits (least significant) 0x00ff_0000_0000_0000
72
73If 'raw_type' is 1, then the counter will count a hardware event
74specified by the remaining 63 bits of event_config. The encoding is
75machine-specific.
76
77If 'raw_type' is 0, then the 'type' field says what kind of counter
78this is, with the following encoding:
79
80enum perf_event_types {
81 PERF_TYPE_HARDWARE = 0,
82 PERF_TYPE_SOFTWARE = 1,
83 PERF_TYPE_TRACEPOINT = 2,
84};
85
86A counter of PERF_TYPE_HARDWARE will count the hardware event
87specified by 'event_id':
88
Ingo Molnar447557a2008-12-11 20:40:18 +010089/*
Paul Mackerrasf66c6b22009-03-23 10:29:36 +110090 * Generalized performance counter event types, used by the hw_event.event_id
Ingo Molnar447557a2008-12-11 20:40:18 +010091 * parameter of the sys_perf_counter_open() syscall:
92 */
Paul Mackerrasf66c6b22009-03-23 10:29:36 +110093enum hw_event_ids {
Ingo Molnar447557a2008-12-11 20:40:18 +010094 /*
95 * Common hardware events, generalized by the kernel:
96 */
Paul Mackerrasf66c6b22009-03-23 10:29:36 +110097 PERF_COUNT_CPU_CYCLES = 0,
98 PERF_COUNT_INSTRUCTIONS = 1,
99 PERF_COUNT_CACHE_REFERENCES = 2,
100 PERF_COUNT_CACHE_MISSES = 3,
101 PERF_COUNT_BRANCH_INSTRUCTIONS = 4,
102 PERF_COUNT_BRANCH_MISSES = 5,
103 PERF_COUNT_BUS_CYCLES = 6,
Ingo Molnar447557a2008-12-11 20:40:18 +0100104};
Ingo Molnare7bc62b2008-12-04 20:13:45 +0100105
Paul Mackerrasf66c6b22009-03-23 10:29:36 +1100106These are standardized types of events that work relatively uniformly
107on all CPUs that implement Performance Counters support under Linux,
108although there may be variations (e.g., different CPUs might count
109cache references and misses at different levels of the cache hierarchy).
110If a CPU is not able to count the selected event, then the system call
111will return -EINVAL.
Ingo Molnare7bc62b2008-12-04 20:13:45 +0100112
Paul Mackerrasf66c6b22009-03-23 10:29:36 +1100113More hw_event_types are supported as well, but they are CPU-specific
114and accessed as raw events. For example, to count "External bus
115cycles while bus lock signal asserted" events on Intel Core CPUs, pass
116in a 0x4064 event_id value and set hw_event.raw_type to 1.
Ingo Molnare7bc62b2008-12-04 20:13:45 +0100117
Paul Mackerrasf66c6b22009-03-23 10:29:36 +1100118A counter of type PERF_TYPE_SOFTWARE will count one of the available
119software events, selected by 'event_id':
120
121/*
122 * Special "software" counters provided by the kernel, even if the hardware
123 * does not support performance counters. These counters measure various
124 * physical and sw events of the kernel (and allow the profiling of them as
125 * well):
126 */
127enum sw_event_ids {
128 PERF_COUNT_CPU_CLOCK = 0,
129 PERF_COUNT_TASK_CLOCK = 1,
130 PERF_COUNT_PAGE_FAULTS = 2,
131 PERF_COUNT_CONTEXT_SWITCHES = 3,
132 PERF_COUNT_CPU_MIGRATIONS = 4,
133 PERF_COUNT_PAGE_FAULTS_MIN = 5,
134 PERF_COUNT_PAGE_FAULTS_MAJ = 6,
135};
136
137Counters come in two flavours: counting counters and sampling
138counters. A "counting" counter is one that is used for counting the
139number of events that occur, and is characterised by having
140irq_period = 0 and record_type = PERF_RECORD_SIMPLE. A read() on a
141counting counter simply returns the current value of the counter as
142an 8-byte number.
143
144A "sampling" counter is one that is set up to generate an interrupt
145every N events, where N is given by 'irq_period'. A sampling counter
146has irq_period > 0 and record_type != PERF_RECORD_SIMPLE. The
147record_type controls what data is recorded on each interrupt, and the
148available values are currently:
Ingo Molnare7bc62b2008-12-04 20:13:45 +0100149
Ingo Molnar447557a2008-12-11 20:40:18 +0100150/*
151 * IRQ-notification data record type:
152 */
153enum perf_counter_record_type {
Paul Mackerrasf66c6b22009-03-23 10:29:36 +1100154 PERF_RECORD_SIMPLE = 0,
155 PERF_RECORD_IRQ = 1,
156 PERF_RECORD_GROUP = 2,
Ingo Molnar447557a2008-12-11 20:40:18 +0100157};
Ingo Molnare7bc62b2008-12-04 20:13:45 +0100158
Paul Mackerrasf66c6b22009-03-23 10:29:36 +1100159A record_type value of PERF_RECORD_IRQ will record the instruction
160pointer (IP) at which the interrupt occurred. A record_type value of
161PERF_RECORD_GROUP will record the event_config and counter value of
162all of the other counters in the group, and should only be used on a
163group leader (see below). Currently these two values are mutually
164exclusive, but record_type will become a bit-mask in future and
165support other values.
Ingo Molnare7bc62b2008-12-04 20:13:45 +0100166
Paul Mackerrasf66c6b22009-03-23 10:29:36 +1100167A sampling counter has an event queue, into which an event is placed
168on each interrupt. A read() on a sampling counter will read the next
169event from the event queue. If the queue is empty, the read() will
170either block or return an EAGAIN error, depending on whether the fd
171has been set to non-blocking mode or not.
Ingo Molnare7bc62b2008-12-04 20:13:45 +0100172
Paul Mackerrasf66c6b22009-03-23 10:29:36 +1100173The 'disabled' bit specifies whether the counter starts out disabled
174or enabled. If it is initially disabled, it can be enabled by ioctl
175or prctl (see below).
Ingo Molnar447557a2008-12-11 20:40:18 +0100176
Paul Mackerrasf66c6b22009-03-23 10:29:36 +1100177The 'nmi' bit specifies, for hardware events, whether the counter
178should be set up to request non-maskable interrupts (NMIs) or normal
179interrupts. This bit is ignored if the user doesn't have
180CAP_SYS_ADMIN privilege (i.e. is not root) or if the CPU doesn't
181generate NMIs from hardware counters.
182
183The 'inherit' bit, if set, specifies that this counter should count
184events on descendant tasks as well as the task specified. This only
185applies to new descendents, not to any existing descendents at the
186time the counter is created (nor to any new descendents of existing
187descendents).
188
189The 'pinned' bit, if set, specifies that the counter should always be
190on the CPU if at all possible. It only applies to hardware counters
191and only to group leaders. If a pinned counter cannot be put onto the
192CPU (e.g. because there are not enough hardware counters or because of
193a conflict with some other event), then the counter goes into an
194'error' state, where reads return end-of-file (i.e. read() returns 0)
195until the counter is subsequently enabled or disabled.
196
197The 'exclusive' bit, if set, specifies that when this counter's group
198is on the CPU, it should be the only group using the CPU's counters.
199In future, this will allow sophisticated monitoring programs to supply
200extra configuration information via 'extra_config_len' to exploit
201advanced features of the CPU's Performance Monitor Unit (PMU) that are
202not otherwise accessible and that might disrupt other hardware
203counters.
204
205The 'exclude_user', 'exclude_kernel' and 'exclude_hv' bits provide a
206way to request that counting of events be restricted to times when the
207CPU is in user, kernel and/or hypervisor mode.
208
209
210The 'pid' parameter to the perf_counter_open() system call allows the
211counter to be specific to a task:
Ingo Molnare7bc62b2008-12-04 20:13:45 +0100212
213 pid == 0: if the pid parameter is zero, the counter is attached to the
214 current task.
215
216 pid > 0: the counter is attached to a specific task (if the current task
217 has sufficient privilege to do so)
218
219 pid < 0: all tasks are counted (per cpu counters)
220
Paul Mackerrasf66c6b22009-03-23 10:29:36 +1100221The 'cpu' parameter allows a counter to be made specific to a CPU:
Ingo Molnare7bc62b2008-12-04 20:13:45 +0100222
223 cpu >= 0: the counter is restricted to a specific CPU
224 cpu == -1: the counter counts on all CPUs
225
Ingo Molnar447557a2008-12-11 20:40:18 +0100226(Note: the combination of 'pid == -1' and 'cpu == -1' is not valid.)
Ingo Molnare7bc62b2008-12-04 20:13:45 +0100227
228A 'pid > 0' and 'cpu == -1' counter is a per task counter that counts
229events of that task and 'follows' that task to whatever CPU the task
230gets schedule to. Per task counters can be created by any user, for
231their own tasks.
232
233A 'pid == -1' and 'cpu == x' counter is a per CPU counter that counts
234all events on CPU-x. Per CPU counters need CAP_SYS_ADMIN privilege.
235
Paul Mackerrasf66c6b22009-03-23 10:29:36 +1100236The 'flags' parameter is currently unused and must be zero.
237
238The 'group_fd' parameter allows counter "groups" to be set up. A
239counter group has one counter which is the group "leader". The leader
240is created first, with group_fd = -1 in the perf_counter_open call
241that creates it. The rest of the group members are created
242subsequently, with group_fd giving the fd of the group leader.
243(A single counter on its own is created with group_fd = -1 and is
244considered to be a group with only 1 member.)
245
246A counter group is scheduled onto the CPU as a unit, that is, it will
247only be put onto the CPU if all of the counters in the group can be
248put onto the CPU. This means that the values of the member counters
249can be meaningfully compared, added, divided (to get ratios), etc.,
250with each other, since they have counted events for the same set of
251executed instructions.
252
253Counters can be enabled and disabled in two ways: via ioctl and via
254prctl. When a counter is disabled, it doesn't count or generate
255events but does continue to exist and maintain its count value.
256
257An individual counter or counter group can be enabled with
258
259 ioctl(fd, PERF_COUNTER_IOC_ENABLE);
260
261or disabled with
262
263 ioctl(fd, PERF_COUNTER_IOC_DISABLE);
264
265Enabling or disabling the leader of a group enables or disables the
266whole group; that is, while the group leader is disabled, none of the
267counters in the group will count. Enabling or disabling a member of a
268group other than the leader only affects that counter - disabling an
269non-leader stops that counter from counting but doesn't affect any
270other counter.
271
272A process can enable or disable all the counter groups that are
273attached to it, using prctl:
274
275 prctl(PR_TASK_PERF_COUNTERS_ENABLE);
276
277 prctl(PR_TASK_PERF_COUNTERS_DISABLE);
278
279This applies to all counters on the current process, whether created
280by this process or by another, and doesn't affect any counters that
281this process has created on other processes. It only enables or
282disables the group leaders, not any other members in the groups.
Ingo Molnar447557a2008-12-11 20:40:18 +0100283