Ingo Molnar | e7bc62b | 2008-12-04 20:13:45 +0100 | [diff] [blame] | 1 | |
| 2 | Performance Counters for Linux |
| 3 | ------------------------------ |
| 4 | |
| 5 | Performance counters are special hardware registers available on most modern |
| 6 | CPUs. These registers count the number of certain types of hw events: such |
| 7 | as instructions executed, cachemisses suffered, or branches mis-predicted - |
| 8 | without slowing down the kernel or applications. These registers can also |
| 9 | trigger interrupts when a threshold number of events have passed - and can |
| 10 | thus be used to profile the code that runs on that CPU. |
| 11 | |
| 12 | The Linux Performance Counter subsystem provides an abstraction of these |
Ingo Molnar | 447557a | 2008-12-11 20:40:18 +0100 | [diff] [blame] | 13 | hardware capabilities. It provides per task and per CPU counters, counter |
Paul Mackerras | f66c6b2 | 2009-03-23 10:29:36 +1100 | [diff] [blame^] | 14 | groups, and it provides event capabilities on top of those. It |
| 15 | provides "virtual" 64-bit counters, regardless of the width of the |
| 16 | underlying hardware counters. |
Ingo Molnar | e7bc62b | 2008-12-04 20:13:45 +0100 | [diff] [blame] | 17 | |
| 18 | Performance counters are accessed via special file descriptors. |
| 19 | There's one file descriptor per virtual counter used. |
| 20 | |
| 21 | The special file descriptor is opened via the perf_counter_open() |
| 22 | system call: |
| 23 | |
Ingo Molnar | 447557a | 2008-12-11 20:40:18 +0100 | [diff] [blame] | 24 | int sys_perf_counter_open(struct perf_counter_hw_event *hw_event_uptr, |
Paul Mackerras | f66c6b2 | 2009-03-23 10:29:36 +1100 | [diff] [blame^] | 25 | pid_t pid, int cpu, int group_fd, |
| 26 | unsigned long flags); |
Ingo Molnar | e7bc62b | 2008-12-04 20:13:45 +0100 | [diff] [blame] | 27 | |
| 28 | The syscall returns the new fd. The fd can be used via the normal |
| 29 | VFS system calls: read() can be used to read the counter, fcntl() |
| 30 | can be used to set the blocking mode, etc. |
| 31 | |
| 32 | Multiple counters can be kept open at a time, and the counters |
| 33 | can be poll()ed. |
| 34 | |
Ingo Molnar | 447557a | 2008-12-11 20:40:18 +0100 | [diff] [blame] | 35 | When creating a new counter fd, 'perf_counter_hw_event' is: |
Ingo Molnar | e7bc62b | 2008-12-04 20:13:45 +0100 | [diff] [blame] | 36 | |
Ingo Molnar | 447557a | 2008-12-11 20:40:18 +0100 | [diff] [blame] | 37 | /* |
Paul Mackerras | f66c6b2 | 2009-03-23 10:29:36 +1100 | [diff] [blame^] | 38 | * Event to monitor via a performance monitoring counter: |
Ingo Molnar | 447557a | 2008-12-11 20:40:18 +0100 | [diff] [blame] | 39 | */ |
| 40 | struct perf_counter_hw_event { |
Paul Mackerras | f66c6b2 | 2009-03-23 10:29:36 +1100 | [diff] [blame^] | 41 | __u64 event_config; |
Ingo Molnar | 447557a | 2008-12-11 20:40:18 +0100 | [diff] [blame] | 42 | |
Paul Mackerras | f66c6b2 | 2009-03-23 10:29:36 +1100 | [diff] [blame^] | 43 | __u64 irq_period; |
| 44 | __u64 record_type; |
| 45 | __u64 read_format; |
Ingo Molnar | 447557a | 2008-12-11 20:40:18 +0100 | [diff] [blame] | 46 | |
Paul Mackerras | f66c6b2 | 2009-03-23 10:29:36 +1100 | [diff] [blame^] | 47 | __u64 disabled : 1, /* off by default */ |
| 48 | nmi : 1, /* NMI sampling */ |
| 49 | inherit : 1, /* children inherit it */ |
| 50 | pinned : 1, /* must always be on PMU */ |
| 51 | exclusive : 1, /* only group on PMU */ |
| 52 | exclude_user : 1, /* don't count user */ |
| 53 | exclude_kernel : 1, /* ditto kernel */ |
| 54 | exclude_hv : 1, /* ditto hypervisor */ |
| 55 | exclude_idle : 1, /* don't count when idle */ |
Ingo Molnar | 447557a | 2008-12-11 20:40:18 +0100 | [diff] [blame] | 56 | |
Paul Mackerras | f66c6b2 | 2009-03-23 10:29:36 +1100 | [diff] [blame^] | 57 | __reserved_1 : 55; |
| 58 | |
| 59 | __u32 extra_config_len; |
| 60 | |
| 61 | __u32 __reserved_4; |
| 62 | __u64 __reserved_2; |
| 63 | __u64 __reserved_3; |
Ingo Molnar | 447557a | 2008-12-11 20:40:18 +0100 | [diff] [blame] | 64 | }; |
| 65 | |
Paul Mackerras | f66c6b2 | 2009-03-23 10:29:36 +1100 | [diff] [blame^] | 66 | The 'event_config' field specifies what the counter should count. It |
| 67 | is divided into 3 bit-fields: |
| 68 | |
| 69 | raw_type: 1 bit (most significant bit) 0x8000_0000_0000_0000 |
| 70 | type: 7 bits (next most significant) 0x7f00_0000_0000_0000 |
| 71 | event_id: 56 bits (least significant) 0x00ff_0000_0000_0000 |
| 72 | |
| 73 | If 'raw_type' is 1, then the counter will count a hardware event |
| 74 | specified by the remaining 63 bits of event_config. The encoding is |
| 75 | machine-specific. |
| 76 | |
| 77 | If 'raw_type' is 0, then the 'type' field says what kind of counter |
| 78 | this is, with the following encoding: |
| 79 | |
| 80 | enum perf_event_types { |
| 81 | PERF_TYPE_HARDWARE = 0, |
| 82 | PERF_TYPE_SOFTWARE = 1, |
| 83 | PERF_TYPE_TRACEPOINT = 2, |
| 84 | }; |
| 85 | |
| 86 | A counter of PERF_TYPE_HARDWARE will count the hardware event |
| 87 | specified by 'event_id': |
| 88 | |
Ingo Molnar | 447557a | 2008-12-11 20:40:18 +0100 | [diff] [blame] | 89 | /* |
Paul Mackerras | f66c6b2 | 2009-03-23 10:29:36 +1100 | [diff] [blame^] | 90 | * Generalized performance counter event types, used by the hw_event.event_id |
Ingo Molnar | 447557a | 2008-12-11 20:40:18 +0100 | [diff] [blame] | 91 | * parameter of the sys_perf_counter_open() syscall: |
| 92 | */ |
Paul Mackerras | f66c6b2 | 2009-03-23 10:29:36 +1100 | [diff] [blame^] | 93 | enum hw_event_ids { |
Ingo Molnar | 447557a | 2008-12-11 20:40:18 +0100 | [diff] [blame] | 94 | /* |
| 95 | * Common hardware events, generalized by the kernel: |
| 96 | */ |
Paul Mackerras | f66c6b2 | 2009-03-23 10:29:36 +1100 | [diff] [blame^] | 97 | PERF_COUNT_CPU_CYCLES = 0, |
| 98 | PERF_COUNT_INSTRUCTIONS = 1, |
| 99 | PERF_COUNT_CACHE_REFERENCES = 2, |
| 100 | PERF_COUNT_CACHE_MISSES = 3, |
| 101 | PERF_COUNT_BRANCH_INSTRUCTIONS = 4, |
| 102 | PERF_COUNT_BRANCH_MISSES = 5, |
| 103 | PERF_COUNT_BUS_CYCLES = 6, |
Ingo Molnar | 447557a | 2008-12-11 20:40:18 +0100 | [diff] [blame] | 104 | }; |
Ingo Molnar | e7bc62b | 2008-12-04 20:13:45 +0100 | [diff] [blame] | 105 | |
Paul Mackerras | f66c6b2 | 2009-03-23 10:29:36 +1100 | [diff] [blame^] | 106 | These are standardized types of events that work relatively uniformly |
| 107 | on all CPUs that implement Performance Counters support under Linux, |
| 108 | although there may be variations (e.g., different CPUs might count |
| 109 | cache references and misses at different levels of the cache hierarchy). |
| 110 | If a CPU is not able to count the selected event, then the system call |
| 111 | will return -EINVAL. |
Ingo Molnar | e7bc62b | 2008-12-04 20:13:45 +0100 | [diff] [blame] | 112 | |
Paul Mackerras | f66c6b2 | 2009-03-23 10:29:36 +1100 | [diff] [blame^] | 113 | More hw_event_types are supported as well, but they are CPU-specific |
| 114 | and accessed as raw events. For example, to count "External bus |
| 115 | cycles while bus lock signal asserted" events on Intel Core CPUs, pass |
| 116 | in a 0x4064 event_id value and set hw_event.raw_type to 1. |
Ingo Molnar | e7bc62b | 2008-12-04 20:13:45 +0100 | [diff] [blame] | 117 | |
Paul Mackerras | f66c6b2 | 2009-03-23 10:29:36 +1100 | [diff] [blame^] | 118 | A counter of type PERF_TYPE_SOFTWARE will count one of the available |
| 119 | software events, selected by 'event_id': |
| 120 | |
| 121 | /* |
| 122 | * Special "software" counters provided by the kernel, even if the hardware |
| 123 | * does not support performance counters. These counters measure various |
| 124 | * physical and sw events of the kernel (and allow the profiling of them as |
| 125 | * well): |
| 126 | */ |
| 127 | enum sw_event_ids { |
| 128 | PERF_COUNT_CPU_CLOCK = 0, |
| 129 | PERF_COUNT_TASK_CLOCK = 1, |
| 130 | PERF_COUNT_PAGE_FAULTS = 2, |
| 131 | PERF_COUNT_CONTEXT_SWITCHES = 3, |
| 132 | PERF_COUNT_CPU_MIGRATIONS = 4, |
| 133 | PERF_COUNT_PAGE_FAULTS_MIN = 5, |
| 134 | PERF_COUNT_PAGE_FAULTS_MAJ = 6, |
| 135 | }; |
| 136 | |
| 137 | Counters come in two flavours: counting counters and sampling |
| 138 | counters. A "counting" counter is one that is used for counting the |
| 139 | number of events that occur, and is characterised by having |
| 140 | irq_period = 0 and record_type = PERF_RECORD_SIMPLE. A read() on a |
| 141 | counting counter simply returns the current value of the counter as |
| 142 | an 8-byte number. |
| 143 | |
| 144 | A "sampling" counter is one that is set up to generate an interrupt |
| 145 | every N events, where N is given by 'irq_period'. A sampling counter |
| 146 | has irq_period > 0 and record_type != PERF_RECORD_SIMPLE. The |
| 147 | record_type controls what data is recorded on each interrupt, and the |
| 148 | available values are currently: |
Ingo Molnar | e7bc62b | 2008-12-04 20:13:45 +0100 | [diff] [blame] | 149 | |
Ingo Molnar | 447557a | 2008-12-11 20:40:18 +0100 | [diff] [blame] | 150 | /* |
| 151 | * IRQ-notification data record type: |
| 152 | */ |
| 153 | enum perf_counter_record_type { |
Paul Mackerras | f66c6b2 | 2009-03-23 10:29:36 +1100 | [diff] [blame^] | 154 | PERF_RECORD_SIMPLE = 0, |
| 155 | PERF_RECORD_IRQ = 1, |
| 156 | PERF_RECORD_GROUP = 2, |
Ingo Molnar | 447557a | 2008-12-11 20:40:18 +0100 | [diff] [blame] | 157 | }; |
Ingo Molnar | e7bc62b | 2008-12-04 20:13:45 +0100 | [diff] [blame] | 158 | |
Paul Mackerras | f66c6b2 | 2009-03-23 10:29:36 +1100 | [diff] [blame^] | 159 | A record_type value of PERF_RECORD_IRQ will record the instruction |
| 160 | pointer (IP) at which the interrupt occurred. A record_type value of |
| 161 | PERF_RECORD_GROUP will record the event_config and counter value of |
| 162 | all of the other counters in the group, and should only be used on a |
| 163 | group leader (see below). Currently these two values are mutually |
| 164 | exclusive, but record_type will become a bit-mask in future and |
| 165 | support other values. |
Ingo Molnar | e7bc62b | 2008-12-04 20:13:45 +0100 | [diff] [blame] | 166 | |
Paul Mackerras | f66c6b2 | 2009-03-23 10:29:36 +1100 | [diff] [blame^] | 167 | A sampling counter has an event queue, into which an event is placed |
| 168 | on each interrupt. A read() on a sampling counter will read the next |
| 169 | event from the event queue. If the queue is empty, the read() will |
| 170 | either block or return an EAGAIN error, depending on whether the fd |
| 171 | has been set to non-blocking mode or not. |
Ingo Molnar | e7bc62b | 2008-12-04 20:13:45 +0100 | [diff] [blame] | 172 | |
Paul Mackerras | f66c6b2 | 2009-03-23 10:29:36 +1100 | [diff] [blame^] | 173 | The 'disabled' bit specifies whether the counter starts out disabled |
| 174 | or enabled. If it is initially disabled, it can be enabled by ioctl |
| 175 | or prctl (see below). |
Ingo Molnar | 447557a | 2008-12-11 20:40:18 +0100 | [diff] [blame] | 176 | |
Paul Mackerras | f66c6b2 | 2009-03-23 10:29:36 +1100 | [diff] [blame^] | 177 | The 'nmi' bit specifies, for hardware events, whether the counter |
| 178 | should be set up to request non-maskable interrupts (NMIs) or normal |
| 179 | interrupts. This bit is ignored if the user doesn't have |
| 180 | CAP_SYS_ADMIN privilege (i.e. is not root) or if the CPU doesn't |
| 181 | generate NMIs from hardware counters. |
| 182 | |
| 183 | The 'inherit' bit, if set, specifies that this counter should count |
| 184 | events on descendant tasks as well as the task specified. This only |
| 185 | applies to new descendents, not to any existing descendents at the |
| 186 | time the counter is created (nor to any new descendents of existing |
| 187 | descendents). |
| 188 | |
| 189 | The 'pinned' bit, if set, specifies that the counter should always be |
| 190 | on the CPU if at all possible. It only applies to hardware counters |
| 191 | and only to group leaders. If a pinned counter cannot be put onto the |
| 192 | CPU (e.g. because there are not enough hardware counters or because of |
| 193 | a conflict with some other event), then the counter goes into an |
| 194 | 'error' state, where reads return end-of-file (i.e. read() returns 0) |
| 195 | until the counter is subsequently enabled or disabled. |
| 196 | |
| 197 | The 'exclusive' bit, if set, specifies that when this counter's group |
| 198 | is on the CPU, it should be the only group using the CPU's counters. |
| 199 | In future, this will allow sophisticated monitoring programs to supply |
| 200 | extra configuration information via 'extra_config_len' to exploit |
| 201 | advanced features of the CPU's Performance Monitor Unit (PMU) that are |
| 202 | not otherwise accessible and that might disrupt other hardware |
| 203 | counters. |
| 204 | |
| 205 | The 'exclude_user', 'exclude_kernel' and 'exclude_hv' bits provide a |
| 206 | way to request that counting of events be restricted to times when the |
| 207 | CPU is in user, kernel and/or hypervisor mode. |
| 208 | |
| 209 | |
| 210 | The 'pid' parameter to the perf_counter_open() system call allows the |
| 211 | counter to be specific to a task: |
Ingo Molnar | e7bc62b | 2008-12-04 20:13:45 +0100 | [diff] [blame] | 212 | |
| 213 | pid == 0: if the pid parameter is zero, the counter is attached to the |
| 214 | current task. |
| 215 | |
| 216 | pid > 0: the counter is attached to a specific task (if the current task |
| 217 | has sufficient privilege to do so) |
| 218 | |
| 219 | pid < 0: all tasks are counted (per cpu counters) |
| 220 | |
Paul Mackerras | f66c6b2 | 2009-03-23 10:29:36 +1100 | [diff] [blame^] | 221 | The 'cpu' parameter allows a counter to be made specific to a CPU: |
Ingo Molnar | e7bc62b | 2008-12-04 20:13:45 +0100 | [diff] [blame] | 222 | |
| 223 | cpu >= 0: the counter is restricted to a specific CPU |
| 224 | cpu == -1: the counter counts on all CPUs |
| 225 | |
Ingo Molnar | 447557a | 2008-12-11 20:40:18 +0100 | [diff] [blame] | 226 | (Note: the combination of 'pid == -1' and 'cpu == -1' is not valid.) |
Ingo Molnar | e7bc62b | 2008-12-04 20:13:45 +0100 | [diff] [blame] | 227 | |
| 228 | A 'pid > 0' and 'cpu == -1' counter is a per task counter that counts |
| 229 | events of that task and 'follows' that task to whatever CPU the task |
| 230 | gets schedule to. Per task counters can be created by any user, for |
| 231 | their own tasks. |
| 232 | |
| 233 | A 'pid == -1' and 'cpu == x' counter is a per CPU counter that counts |
| 234 | all events on CPU-x. Per CPU counters need CAP_SYS_ADMIN privilege. |
| 235 | |
Paul Mackerras | f66c6b2 | 2009-03-23 10:29:36 +1100 | [diff] [blame^] | 236 | The 'flags' parameter is currently unused and must be zero. |
| 237 | |
| 238 | The 'group_fd' parameter allows counter "groups" to be set up. A |
| 239 | counter group has one counter which is the group "leader". The leader |
| 240 | is created first, with group_fd = -1 in the perf_counter_open call |
| 241 | that creates it. The rest of the group members are created |
| 242 | subsequently, with group_fd giving the fd of the group leader. |
| 243 | (A single counter on its own is created with group_fd = -1 and is |
| 244 | considered to be a group with only 1 member.) |
| 245 | |
| 246 | A counter group is scheduled onto the CPU as a unit, that is, it will |
| 247 | only be put onto the CPU if all of the counters in the group can be |
| 248 | put onto the CPU. This means that the values of the member counters |
| 249 | can be meaningfully compared, added, divided (to get ratios), etc., |
| 250 | with each other, since they have counted events for the same set of |
| 251 | executed instructions. |
| 252 | |
| 253 | Counters can be enabled and disabled in two ways: via ioctl and via |
| 254 | prctl. When a counter is disabled, it doesn't count or generate |
| 255 | events but does continue to exist and maintain its count value. |
| 256 | |
| 257 | An individual counter or counter group can be enabled with |
| 258 | |
| 259 | ioctl(fd, PERF_COUNTER_IOC_ENABLE); |
| 260 | |
| 261 | or disabled with |
| 262 | |
| 263 | ioctl(fd, PERF_COUNTER_IOC_DISABLE); |
| 264 | |
| 265 | Enabling or disabling the leader of a group enables or disables the |
| 266 | whole group; that is, while the group leader is disabled, none of the |
| 267 | counters in the group will count. Enabling or disabling a member of a |
| 268 | group other than the leader only affects that counter - disabling an |
| 269 | non-leader stops that counter from counting but doesn't affect any |
| 270 | other counter. |
| 271 | |
| 272 | A process can enable or disable all the counter groups that are |
| 273 | attached to it, using prctl: |
| 274 | |
| 275 | prctl(PR_TASK_PERF_COUNTERS_ENABLE); |
| 276 | |
| 277 | prctl(PR_TASK_PERF_COUNTERS_DISABLE); |
| 278 | |
| 279 | This applies to all counters on the current process, whether created |
| 280 | by this process or by another, and doesn't affect any counters that |
| 281 | this process has created on other processes. It only enables or |
| 282 | disables the group leaders, not any other members in the groups. |
Ingo Molnar | 447557a | 2008-12-11 20:40:18 +0100 | [diff] [blame] | 283 | |