Ingo Molnar | e7bc62b | 2008-12-04 20:13:45 +0100 | [diff] [blame] | 1 | |
| 2 | Performance Counters for Linux |
| 3 | ------------------------------ |
| 4 | |
| 5 | Performance counters are special hardware registers available on most modern |
| 6 | CPUs. These registers count the number of certain types of hw events: such |
| 7 | as instructions executed, cachemisses suffered, or branches mis-predicted - |
| 8 | without slowing down the kernel or applications. These registers can also |
| 9 | trigger interrupts when a threshold number of events have passed - and can |
| 10 | thus be used to profile the code that runs on that CPU. |
| 11 | |
| 12 | The Linux Performance Counter subsystem provides an abstraction of these |
Ingo Molnar | 447557a | 2008-12-11 20:40:18 +0100 | [diff] [blame] | 13 | hardware capabilities. It provides per task and per CPU counters, counter |
Paul Mackerras | f66c6b2 | 2009-03-23 10:29:36 +1100 | [diff] [blame] | 14 | groups, and it provides event capabilities on top of those. It |
| 15 | provides "virtual" 64-bit counters, regardless of the width of the |
| 16 | underlying hardware counters. |
Ingo Molnar | e7bc62b | 2008-12-04 20:13:45 +0100 | [diff] [blame] | 17 | |
| 18 | Performance counters are accessed via special file descriptors. |
| 19 | There's one file descriptor per virtual counter used. |
| 20 | |
Ramkumar Ramachandra | b68eebd | 2014-03-18 15:10:04 -0400 | [diff] [blame] | 21 | The special file descriptor is opened via the sys_perf_event_open() |
Ingo Molnar | e7bc62b | 2008-12-04 20:13:45 +0100 | [diff] [blame] | 22 | system call: |
| 23 | |
Tim Blechmann | 0b413e4 | 2009-12-27 14:43:06 +0100 | [diff] [blame] | 24 | int sys_perf_event_open(struct perf_event_attr *hw_event_uptr, |
Paul Mackerras | f66c6b2 | 2009-03-23 10:29:36 +1100 | [diff] [blame] | 25 | pid_t pid, int cpu, int group_fd, |
| 26 | unsigned long flags); |
Ingo Molnar | e7bc62b | 2008-12-04 20:13:45 +0100 | [diff] [blame] | 27 | |
| 28 | The syscall returns the new fd. The fd can be used via the normal |
| 29 | VFS system calls: read() can be used to read the counter, fcntl() |
| 30 | can be used to set the blocking mode, etc. |
| 31 | |
| 32 | Multiple counters can be kept open at a time, and the counters |
| 33 | can be poll()ed. |
| 34 | |
Tim Blechmann | 0b413e4 | 2009-12-27 14:43:06 +0100 | [diff] [blame] | 35 | When creating a new counter fd, 'perf_event_attr' is: |
Ingo Molnar | e7bc62b | 2008-12-04 20:13:45 +0100 | [diff] [blame] | 36 | |
Tim Blechmann | 0b413e4 | 2009-12-27 14:43:06 +0100 | [diff] [blame] | 37 | struct perf_event_attr { |
Peter Zijlstra | e5791a8 | 2009-05-01 12:23:19 +0200 | [diff] [blame] | 38 | /* |
| 39 | * The MSB of the config word signifies if the rest contains cpu |
| 40 | * specific (raw) counter configuration data, if unset, the next |
| 41 | * 7 bits are an event type and the rest of the bits are the event |
| 42 | * identifier. |
| 43 | */ |
| 44 | __u64 config; |
Ingo Molnar | 447557a | 2008-12-11 20:40:18 +0100 | [diff] [blame] | 45 | |
Peter Zijlstra | e5791a8 | 2009-05-01 12:23:19 +0200 | [diff] [blame] | 46 | __u64 irq_period; |
| 47 | __u32 record_type; |
| 48 | __u32 read_format; |
Ingo Molnar | 447557a | 2008-12-11 20:40:18 +0100 | [diff] [blame] | 49 | |
Peter Zijlstra | e5791a8 | 2009-05-01 12:23:19 +0200 | [diff] [blame] | 50 | __u64 disabled : 1, /* off by default */ |
Peter Zijlstra | e5791a8 | 2009-05-01 12:23:19 +0200 | [diff] [blame] | 51 | inherit : 1, /* children inherit it */ |
| 52 | pinned : 1, /* must always be on PMU */ |
| 53 | exclusive : 1, /* only group on PMU */ |
| 54 | exclude_user : 1, /* don't count user */ |
| 55 | exclude_kernel : 1, /* ditto kernel */ |
| 56 | exclude_hv : 1, /* ditto hypervisor */ |
| 57 | exclude_idle : 1, /* don't count when idle */ |
| 58 | mmap : 1, /* include mmap data */ |
| 59 | munmap : 1, /* include munmap data */ |
| 60 | comm : 1, /* include comm data */ |
Ingo Molnar | 447557a | 2008-12-11 20:40:18 +0100 | [diff] [blame] | 61 | |
Peter Zijlstra | e5791a8 | 2009-05-01 12:23:19 +0200 | [diff] [blame] | 62 | __reserved_1 : 52; |
Paul Mackerras | f66c6b2 | 2009-03-23 10:29:36 +1100 | [diff] [blame] | 63 | |
Peter Zijlstra | e5791a8 | 2009-05-01 12:23:19 +0200 | [diff] [blame] | 64 | __u32 extra_config_len; |
| 65 | __u32 wakeup_events; /* wakeup every n events */ |
Paul Mackerras | f66c6b2 | 2009-03-23 10:29:36 +1100 | [diff] [blame] | 66 | |
Peter Zijlstra | e5791a8 | 2009-05-01 12:23:19 +0200 | [diff] [blame] | 67 | __u64 __reserved_2; |
| 68 | __u64 __reserved_3; |
Ingo Molnar | 447557a | 2008-12-11 20:40:18 +0100 | [diff] [blame] | 69 | }; |
| 70 | |
Peter Zijlstra | e5791a8 | 2009-05-01 12:23:19 +0200 | [diff] [blame] | 71 | The 'config' field specifies what the counter should count. It |
Paul Mackerras | f66c6b2 | 2009-03-23 10:29:36 +1100 | [diff] [blame] | 72 | is divided into 3 bit-fields: |
| 73 | |
Peter Zijlstra | e5791a8 | 2009-05-01 12:23:19 +0200 | [diff] [blame] | 74 | raw_type: 1 bit (most significant bit) 0x8000_0000_0000_0000 |
| 75 | type: 7 bits (next most significant) 0x7f00_0000_0000_0000 |
| 76 | event_id: 56 bits (least significant) 0x00ff_ffff_ffff_ffff |
Paul Mackerras | f66c6b2 | 2009-03-23 10:29:36 +1100 | [diff] [blame] | 77 | |
| 78 | If 'raw_type' is 1, then the counter will count a hardware event |
| 79 | specified by the remaining 63 bits of event_config. The encoding is |
| 80 | machine-specific. |
| 81 | |
| 82 | If 'raw_type' is 0, then the 'type' field says what kind of counter |
| 83 | this is, with the following encoding: |
| 84 | |
Ramkumar Ramachandra | b68eebd | 2014-03-18 15:10:04 -0400 | [diff] [blame] | 85 | enum perf_type_id { |
Paul Mackerras | f66c6b2 | 2009-03-23 10:29:36 +1100 | [diff] [blame] | 86 | PERF_TYPE_HARDWARE = 0, |
| 87 | PERF_TYPE_SOFTWARE = 1, |
| 88 | PERF_TYPE_TRACEPOINT = 2, |
| 89 | }; |
| 90 | |
| 91 | A counter of PERF_TYPE_HARDWARE will count the hardware event |
| 92 | specified by 'event_id': |
| 93 | |
Ingo Molnar | 447557a | 2008-12-11 20:40:18 +0100 | [diff] [blame] | 94 | /* |
Paul Mackerras | f66c6b2 | 2009-03-23 10:29:36 +1100 | [diff] [blame] | 95 | * Generalized performance counter event types, used by the hw_event.event_id |
Ingo Molnar | cdd6c48 | 2009-09-21 12:02:48 +0200 | [diff] [blame] | 96 | * parameter of the sys_perf_event_open() syscall: |
Ingo Molnar | 447557a | 2008-12-11 20:40:18 +0100 | [diff] [blame] | 97 | */ |
Ramkumar Ramachandra | b68eebd | 2014-03-18 15:10:04 -0400 | [diff] [blame] | 98 | enum perf_hw_id { |
Ingo Molnar | 447557a | 2008-12-11 20:40:18 +0100 | [diff] [blame] | 99 | /* |
| 100 | * Common hardware events, generalized by the kernel: |
| 101 | */ |
Peter Zijlstra | f4dbfa8 | 2009-06-11 14:06:28 +0200 | [diff] [blame] | 102 | PERF_COUNT_HW_CPU_CYCLES = 0, |
| 103 | PERF_COUNT_HW_INSTRUCTIONS = 1, |
Kirill Smelkov | 0895cf0 | 2010-01-13 13:22:18 -0200 | [diff] [blame] | 104 | PERF_COUNT_HW_CACHE_REFERENCES = 2, |
Peter Zijlstra | f4dbfa8 | 2009-06-11 14:06:28 +0200 | [diff] [blame] | 105 | PERF_COUNT_HW_CACHE_MISSES = 3, |
| 106 | PERF_COUNT_HW_BRANCH_INSTRUCTIONS = 4, |
Kirill Smelkov | 0895cf0 | 2010-01-13 13:22:18 -0200 | [diff] [blame] | 107 | PERF_COUNT_HW_BRANCH_MISSES = 5, |
Peter Zijlstra | f4dbfa8 | 2009-06-11 14:06:28 +0200 | [diff] [blame] | 108 | PERF_COUNT_HW_BUS_CYCLES = 6, |
Like Xu | 438f1a9 | 2021-11-09 17:01:47 +0800 | [diff] [blame] | 109 | PERF_COUNT_HW_STALLED_CYCLES_FRONTEND = 7, |
| 110 | PERF_COUNT_HW_STALLED_CYCLES_BACKEND = 8, |
| 111 | PERF_COUNT_HW_REF_CPU_CYCLES = 9, |
Ingo Molnar | 447557a | 2008-12-11 20:40:18 +0100 | [diff] [blame] | 112 | }; |
Ingo Molnar | e7bc62b | 2008-12-04 20:13:45 +0100 | [diff] [blame] | 113 | |
Paul Mackerras | f66c6b2 | 2009-03-23 10:29:36 +1100 | [diff] [blame] | 114 | These are standardized types of events that work relatively uniformly |
| 115 | on all CPUs that implement Performance Counters support under Linux, |
| 116 | although there may be variations (e.g., different CPUs might count |
| 117 | cache references and misses at different levels of the cache hierarchy). |
| 118 | If a CPU is not able to count the selected event, then the system call |
| 119 | will return -EINVAL. |
Ingo Molnar | e7bc62b | 2008-12-04 20:13:45 +0100 | [diff] [blame] | 120 | |
Paul Mackerras | f66c6b2 | 2009-03-23 10:29:36 +1100 | [diff] [blame] | 121 | More hw_event_types are supported as well, but they are CPU-specific |
| 122 | and accessed as raw events. For example, to count "External bus |
| 123 | cycles while bus lock signal asserted" events on Intel Core CPUs, pass |
| 124 | in a 0x4064 event_id value and set hw_event.raw_type to 1. |
Ingo Molnar | e7bc62b | 2008-12-04 20:13:45 +0100 | [diff] [blame] | 125 | |
Paul Mackerras | f66c6b2 | 2009-03-23 10:29:36 +1100 | [diff] [blame] | 126 | A counter of type PERF_TYPE_SOFTWARE will count one of the available |
| 127 | software events, selected by 'event_id': |
| 128 | |
| 129 | /* |
| 130 | * Special "software" counters provided by the kernel, even if the hardware |
| 131 | * does not support performance counters. These counters measure various |
| 132 | * physical and sw events of the kernel (and allow the profiling of them as |
| 133 | * well): |
| 134 | */ |
Ramkumar Ramachandra | b68eebd | 2014-03-18 15:10:04 -0400 | [diff] [blame] | 135 | enum perf_sw_ids { |
Peter Zijlstra | f4dbfa8 | 2009-06-11 14:06:28 +0200 | [diff] [blame] | 136 | PERF_COUNT_SW_CPU_CLOCK = 0, |
Kirill Smelkov | 0895cf0 | 2010-01-13 13:22:18 -0200 | [diff] [blame] | 137 | PERF_COUNT_SW_TASK_CLOCK = 1, |
| 138 | PERF_COUNT_SW_PAGE_FAULTS = 2, |
Peter Zijlstra | f4dbfa8 | 2009-06-11 14:06:28 +0200 | [diff] [blame] | 139 | PERF_COUNT_SW_CONTEXT_SWITCHES = 3, |
| 140 | PERF_COUNT_SW_CPU_MIGRATIONS = 4, |
| 141 | PERF_COUNT_SW_PAGE_FAULTS_MIN = 5, |
| 142 | PERF_COUNT_SW_PAGE_FAULTS_MAJ = 6, |
Anton Blanchard | f7d7986 | 2009-10-18 01:09:29 +0000 | [diff] [blame] | 143 | PERF_COUNT_SW_ALIGNMENT_FAULTS = 7, |
| 144 | PERF_COUNT_SW_EMULATION_FAULTS = 8, |
Paul Mackerras | f66c6b2 | 2009-03-23 10:29:36 +1100 | [diff] [blame] | 145 | }; |
| 146 | |
Peter Zijlstra | e5791a8 | 2009-05-01 12:23:19 +0200 | [diff] [blame] | 147 | Counters of the type PERF_TYPE_TRACEPOINT are available when the ftrace event |
| 148 | tracer is available, and event_id values can be obtained from |
| 149 | /debug/tracing/events/*/*/id |
| 150 | |
| 151 | |
Paul Mackerras | f66c6b2 | 2009-03-23 10:29:36 +1100 | [diff] [blame] | 152 | Counters come in two flavours: counting counters and sampling |
| 153 | counters. A "counting" counter is one that is used for counting the |
| 154 | number of events that occur, and is characterised by having |
Peter Zijlstra | e5791a8 | 2009-05-01 12:23:19 +0200 | [diff] [blame] | 155 | irq_period = 0. |
| 156 | |
| 157 | |
| 158 | A read() on a counter returns the current value of the counter and possible |
| 159 | additional values as specified by 'read_format', each value is a u64 (8 bytes) |
| 160 | in size. |
| 161 | |
| 162 | /* |
| 163 | * Bits that can be set in hw_event.read_format to request that |
| 164 | * reads on the counter should return the indicated quantities, |
| 165 | * in increasing order of bit value, after the counter value. |
| 166 | */ |
Ingo Molnar | cdd6c48 | 2009-09-21 12:02:48 +0200 | [diff] [blame] | 167 | enum perf_event_read_format { |
Peter Zijlstra | e5791a8 | 2009-05-01 12:23:19 +0200 | [diff] [blame] | 168 | PERF_FORMAT_TOTAL_TIME_ENABLED = 1, |
| 169 | PERF_FORMAT_TOTAL_TIME_RUNNING = 2, |
| 170 | }; |
| 171 | |
| 172 | Using these additional values one can establish the overcommit ratio for a |
| 173 | particular counter allowing one to take the round-robin scheduling effect |
| 174 | into account. |
| 175 | |
Paul Mackerras | f66c6b2 | 2009-03-23 10:29:36 +1100 | [diff] [blame] | 176 | |
| 177 | A "sampling" counter is one that is set up to generate an interrupt |
| 178 | every N events, where N is given by 'irq_period'. A sampling counter |
Peter Zijlstra | e5791a8 | 2009-05-01 12:23:19 +0200 | [diff] [blame] | 179 | has irq_period > 0. The record_type controls what data is recorded on each |
| 180 | interrupt: |
Ingo Molnar | e7bc62b | 2008-12-04 20:13:45 +0100 | [diff] [blame] | 181 | |
Ingo Molnar | 447557a | 2008-12-11 20:40:18 +0100 | [diff] [blame] | 182 | /* |
Peter Zijlstra | e5791a8 | 2009-05-01 12:23:19 +0200 | [diff] [blame] | 183 | * Bits that can be set in hw_event.record_type to request information |
| 184 | * in the overflow packets. |
Ingo Molnar | 447557a | 2008-12-11 20:40:18 +0100 | [diff] [blame] | 185 | */ |
Ingo Molnar | cdd6c48 | 2009-09-21 12:02:48 +0200 | [diff] [blame] | 186 | enum perf_event_record_format { |
Peter Zijlstra | e5791a8 | 2009-05-01 12:23:19 +0200 | [diff] [blame] | 187 | PERF_RECORD_IP = 1U << 0, |
| 188 | PERF_RECORD_TID = 1U << 1, |
| 189 | PERF_RECORD_TIME = 1U << 2, |
| 190 | PERF_RECORD_ADDR = 1U << 3, |
| 191 | PERF_RECORD_GROUP = 1U << 4, |
| 192 | PERF_RECORD_CALLCHAIN = 1U << 5, |
Ingo Molnar | 447557a | 2008-12-11 20:40:18 +0100 | [diff] [blame] | 193 | }; |
Ingo Molnar | e7bc62b | 2008-12-04 20:13:45 +0100 | [diff] [blame] | 194 | |
Peter Zijlstra | e5791a8 | 2009-05-01 12:23:19 +0200 | [diff] [blame] | 195 | Such (and other) events will be recorded in a ring-buffer, which is |
| 196 | available to user-space using mmap() (see below). |
Ingo Molnar | e7bc62b | 2008-12-04 20:13:45 +0100 | [diff] [blame] | 197 | |
Paul Mackerras | f66c6b2 | 2009-03-23 10:29:36 +1100 | [diff] [blame] | 198 | The 'disabled' bit specifies whether the counter starts out disabled |
| 199 | or enabled. If it is initially disabled, it can be enabled by ioctl |
| 200 | or prctl (see below). |
Ingo Molnar | 447557a | 2008-12-11 20:40:18 +0100 | [diff] [blame] | 201 | |
Paul Mackerras | f66c6b2 | 2009-03-23 10:29:36 +1100 | [diff] [blame] | 202 | The 'inherit' bit, if set, specifies that this counter should count |
| 203 | events on descendant tasks as well as the task specified. This only |
| 204 | applies to new descendents, not to any existing descendents at the |
| 205 | time the counter is created (nor to any new descendents of existing |
| 206 | descendents). |
| 207 | |
| 208 | The 'pinned' bit, if set, specifies that the counter should always be |
| 209 | on the CPU if at all possible. It only applies to hardware counters |
| 210 | and only to group leaders. If a pinned counter cannot be put onto the |
| 211 | CPU (e.g. because there are not enough hardware counters or because of |
| 212 | a conflict with some other event), then the counter goes into an |
| 213 | 'error' state, where reads return end-of-file (i.e. read() returns 0) |
| 214 | until the counter is subsequently enabled or disabled. |
| 215 | |
| 216 | The 'exclusive' bit, if set, specifies that when this counter's group |
| 217 | is on the CPU, it should be the only group using the CPU's counters. |
| 218 | In future, this will allow sophisticated monitoring programs to supply |
| 219 | extra configuration information via 'extra_config_len' to exploit |
| 220 | advanced features of the CPU's Performance Monitor Unit (PMU) that are |
| 221 | not otherwise accessible and that might disrupt other hardware |
| 222 | counters. |
| 223 | |
| 224 | The 'exclude_user', 'exclude_kernel' and 'exclude_hv' bits provide a |
| 225 | way to request that counting of events be restricted to times when the |
| 226 | CPU is in user, kernel and/or hypervisor mode. |
| 227 | |
Andrew Murray | 23e232b | 2019-01-10 13:53:23 +0000 | [diff] [blame] | 228 | Furthermore the 'exclude_host' and 'exclude_guest' bits provide a way |
| 229 | to request counting of events restricted to guest and host contexts when |
| 230 | using Linux as the hypervisor. |
| 231 | |
Peter Zijlstra | e5791a8 | 2009-05-01 12:23:19 +0200 | [diff] [blame] | 232 | The 'mmap' and 'munmap' bits allow recording of PROT_EXEC mmap/munmap |
| 233 | operations, these can be used to relate userspace IP addresses to actual |
| 234 | code, even after the mapping (or even the whole process) is gone, |
| 235 | these events are recorded in the ring-buffer (see below). |
| 236 | |
| 237 | The 'comm' bit allows tracking of process comm data on process creation. |
| 238 | This too is recorded in the ring-buffer (see below). |
Paul Mackerras | f66c6b2 | 2009-03-23 10:29:36 +1100 | [diff] [blame] | 239 | |
Ramkumar Ramachandra | b68eebd | 2014-03-18 15:10:04 -0400 | [diff] [blame] | 240 | The 'pid' parameter to the sys_perf_event_open() system call allows the |
Paul Mackerras | f66c6b2 | 2009-03-23 10:29:36 +1100 | [diff] [blame] | 241 | counter to be specific to a task: |
Ingo Molnar | e7bc62b | 2008-12-04 20:13:45 +0100 | [diff] [blame] | 242 | |
| 243 | pid == 0: if the pid parameter is zero, the counter is attached to the |
| 244 | current task. |
| 245 | |
| 246 | pid > 0: the counter is attached to a specific task (if the current task |
| 247 | has sufficient privilege to do so) |
| 248 | |
| 249 | pid < 0: all tasks are counted (per cpu counters) |
| 250 | |
Paul Mackerras | f66c6b2 | 2009-03-23 10:29:36 +1100 | [diff] [blame] | 251 | The 'cpu' parameter allows a counter to be made specific to a CPU: |
Ingo Molnar | e7bc62b | 2008-12-04 20:13:45 +0100 | [diff] [blame] | 252 | |
| 253 | cpu >= 0: the counter is restricted to a specific CPU |
| 254 | cpu == -1: the counter counts on all CPUs |
| 255 | |
Ingo Molnar | 447557a | 2008-12-11 20:40:18 +0100 | [diff] [blame] | 256 | (Note: the combination of 'pid == -1' and 'cpu == -1' is not valid.) |
Ingo Molnar | e7bc62b | 2008-12-04 20:13:45 +0100 | [diff] [blame] | 257 | |
| 258 | A 'pid > 0' and 'cpu == -1' counter is a per task counter that counts |
| 259 | events of that task and 'follows' that task to whatever CPU the task |
| 260 | gets schedule to. Per task counters can be created by any user, for |
| 261 | their own tasks. |
| 262 | |
| 263 | A 'pid == -1' and 'cpu == x' counter is a per CPU counter that counts |
Alexey Budankov | 6b3e0e2 | 2020-04-02 11:47:35 +0300 | [diff] [blame] | 264 | all events on CPU-x. Per CPU counters need CAP_PERFMON or CAP_SYS_ADMIN |
| 265 | privilege. |
Ingo Molnar | e7bc62b | 2008-12-04 20:13:45 +0100 | [diff] [blame] | 266 | |
Paul Mackerras | f66c6b2 | 2009-03-23 10:29:36 +1100 | [diff] [blame] | 267 | The 'flags' parameter is currently unused and must be zero. |
| 268 | |
| 269 | The 'group_fd' parameter allows counter "groups" to be set up. A |
| 270 | counter group has one counter which is the group "leader". The leader |
Ramkumar Ramachandra | b68eebd | 2014-03-18 15:10:04 -0400 | [diff] [blame] | 271 | is created first, with group_fd = -1 in the sys_perf_event_open call |
Paul Mackerras | f66c6b2 | 2009-03-23 10:29:36 +1100 | [diff] [blame] | 272 | that creates it. The rest of the group members are created |
| 273 | subsequently, with group_fd giving the fd of the group leader. |
| 274 | (A single counter on its own is created with group_fd = -1 and is |
| 275 | considered to be a group with only 1 member.) |
| 276 | |
| 277 | A counter group is scheduled onto the CPU as a unit, that is, it will |
| 278 | only be put onto the CPU if all of the counters in the group can be |
| 279 | put onto the CPU. This means that the values of the member counters |
| 280 | can be meaningfully compared, added, divided (to get ratios), etc., |
| 281 | with each other, since they have counted events for the same set of |
| 282 | executed instructions. |
| 283 | |
Peter Zijlstra | e5791a8 | 2009-05-01 12:23:19 +0200 | [diff] [blame] | 284 | |
| 285 | Like stated, asynchronous events, like counter overflow or PROT_EXEC mmap |
| 286 | tracking are logged into a ring-buffer. This ring-buffer is created and |
| 287 | accessed through mmap(). |
| 288 | |
| 289 | The mmap size should be 1+2^n pages, where the first page is a meta-data page |
Ingo Molnar | cdd6c48 | 2009-09-21 12:02:48 +0200 | [diff] [blame] | 290 | (struct perf_event_mmap_page) that contains various bits of information such |
Peter Zijlstra | e5791a8 | 2009-05-01 12:23:19 +0200 | [diff] [blame] | 291 | as where the ring-buffer head is. |
| 292 | |
| 293 | /* |
| 294 | * Structure of the page that can be mapped via mmap |
| 295 | */ |
Ingo Molnar | cdd6c48 | 2009-09-21 12:02:48 +0200 | [diff] [blame] | 296 | struct perf_event_mmap_page { |
Peter Zijlstra | e5791a8 | 2009-05-01 12:23:19 +0200 | [diff] [blame] | 297 | __u32 version; /* version number of this structure */ |
| 298 | __u32 compat_version; /* lowest version this is compat with */ |
| 299 | |
| 300 | /* |
| 301 | * Bits needed to read the hw counters in user-space. |
| 302 | * |
| 303 | * u32 seq; |
| 304 | * s64 count; |
| 305 | * |
| 306 | * do { |
| 307 | * seq = pc->lock; |
| 308 | * |
| 309 | * barrier() |
| 310 | * if (pc->index) { |
| 311 | * count = pmc_read(pc->index - 1); |
| 312 | * count += pc->offset; |
| 313 | * } else |
| 314 | * goto regular_read; |
| 315 | * |
| 316 | * barrier(); |
| 317 | * } while (pc->lock != seq); |
| 318 | * |
| 319 | * NOTE: for obvious reason this only works on self-monitoring |
| 320 | * processes. |
| 321 | */ |
| 322 | __u32 lock; /* seqlock for synchronization */ |
| 323 | __u32 index; /* hardware counter identifier */ |
| 324 | __s64 offset; /* add to hardware counter value */ |
| 325 | |
| 326 | /* |
| 327 | * Control data for the mmap() data buffer. |
| 328 | * |
| 329 | * User-space reading this value should issue an rmb(), on SMP capable |
Ingo Molnar | cdd6c48 | 2009-09-21 12:02:48 +0200 | [diff] [blame] | 330 | * platforms, after reading this value -- see perf_event_wakeup(). |
Peter Zijlstra | e5791a8 | 2009-05-01 12:23:19 +0200 | [diff] [blame] | 331 | */ |
| 332 | __u32 data_head; /* head in the data section */ |
| 333 | }; |
| 334 | |
| 335 | NOTE: the hw-counter userspace bits are arch specific and are currently only |
| 336 | implemented on powerpc. |
| 337 | |
| 338 | The following 2^n pages are the ring-buffer which contains events of the form: |
| 339 | |
Ingo Molnar | cdd6c48 | 2009-09-21 12:02:48 +0200 | [diff] [blame] | 340 | #define PERF_RECORD_MISC_KERNEL (1 << 0) |
| 341 | #define PERF_RECORD_MISC_USER (1 << 1) |
| 342 | #define PERF_RECORD_MISC_OVERFLOW (1 << 2) |
Peter Zijlstra | e5791a8 | 2009-05-01 12:23:19 +0200 | [diff] [blame] | 343 | |
| 344 | struct perf_event_header { |
| 345 | __u32 type; |
| 346 | __u16 misc; |
| 347 | __u16 size; |
| 348 | }; |
| 349 | |
| 350 | enum perf_event_type { |
| 351 | |
| 352 | /* |
| 353 | * The MMAP events record the PROT_EXEC mappings so that we can |
| 354 | * correlate userspace IPs to code. They have the following structure: |
| 355 | * |
| 356 | * struct { |
| 357 | * struct perf_event_header header; |
| 358 | * |
| 359 | * u32 pid, tid; |
| 360 | * u64 addr; |
| 361 | * u64 len; |
| 362 | * u64 pgoff; |
| 363 | * char filename[]; |
| 364 | * }; |
| 365 | */ |
Ingo Molnar | cdd6c48 | 2009-09-21 12:02:48 +0200 | [diff] [blame] | 366 | PERF_RECORD_MMAP = 1, |
| 367 | PERF_RECORD_MUNMAP = 2, |
Peter Zijlstra | e5791a8 | 2009-05-01 12:23:19 +0200 | [diff] [blame] | 368 | |
| 369 | /* |
| 370 | * struct { |
| 371 | * struct perf_event_header header; |
| 372 | * |
| 373 | * u32 pid, tid; |
| 374 | * char comm[]; |
| 375 | * }; |
| 376 | */ |
Ingo Molnar | cdd6c48 | 2009-09-21 12:02:48 +0200 | [diff] [blame] | 377 | PERF_RECORD_COMM = 3, |
Peter Zijlstra | e5791a8 | 2009-05-01 12:23:19 +0200 | [diff] [blame] | 378 | |
| 379 | /* |
Ingo Molnar | cdd6c48 | 2009-09-21 12:02:48 +0200 | [diff] [blame] | 380 | * When header.misc & PERF_RECORD_MISC_OVERFLOW the event_type field |
Peter Zijlstra | e5791a8 | 2009-05-01 12:23:19 +0200 | [diff] [blame] | 381 | * will be PERF_RECORD_* |
| 382 | * |
| 383 | * struct { |
| 384 | * struct perf_event_header header; |
| 385 | * |
| 386 | * { u64 ip; } && PERF_RECORD_IP |
| 387 | * { u32 pid, tid; } && PERF_RECORD_TID |
| 388 | * { u64 time; } && PERF_RECORD_TIME |
| 389 | * { u64 addr; } && PERF_RECORD_ADDR |
| 390 | * |
| 391 | * { u64 nr; |
| 392 | * { u64 event, val; } cnt[nr]; } && PERF_RECORD_GROUP |
| 393 | * |
| 394 | * { u16 nr, |
| 395 | * hv, |
| 396 | * kernel, |
| 397 | * user; |
| 398 | * u64 ips[nr]; } && PERF_RECORD_CALLCHAIN |
| 399 | * }; |
| 400 | */ |
| 401 | }; |
| 402 | |
| 403 | NOTE: PERF_RECORD_CALLCHAIN is arch specific and currently only implemented |
| 404 | on x86. |
| 405 | |
| 406 | Notification of new events is possible through poll()/select()/epoll() and |
| 407 | fcntl() managing signals. |
| 408 | |
| 409 | Normally a notification is generated for every page filled, however one can |
Tim Blechmann | 0b413e4 | 2009-12-27 14:43:06 +0100 | [diff] [blame] | 410 | additionally set perf_event_attr.wakeup_events to generate one every |
Peter Zijlstra | e5791a8 | 2009-05-01 12:23:19 +0200 | [diff] [blame] | 411 | so many counter overflow events. |
| 412 | |
| 413 | Future work will include a splice() interface to the ring-buffer. |
| 414 | |
| 415 | |
Paul Mackerras | f66c6b2 | 2009-03-23 10:29:36 +1100 | [diff] [blame] | 416 | Counters can be enabled and disabled in two ways: via ioctl and via |
| 417 | prctl. When a counter is disabled, it doesn't count or generate |
| 418 | events but does continue to exist and maintain its count value. |
| 419 | |
Namhyung Kim | a59e64a | 2012-05-31 14:51:45 +0900 | [diff] [blame] | 420 | An individual counter can be enabled with |
Paul Mackerras | f66c6b2 | 2009-03-23 10:29:36 +1100 | [diff] [blame] | 421 | |
Namhyung Kim | a59e64a | 2012-05-31 14:51:45 +0900 | [diff] [blame] | 422 | ioctl(fd, PERF_EVENT_IOC_ENABLE, 0); |
Paul Mackerras | f66c6b2 | 2009-03-23 10:29:36 +1100 | [diff] [blame] | 423 | |
| 424 | or disabled with |
| 425 | |
Namhyung Kim | a59e64a | 2012-05-31 14:51:45 +0900 | [diff] [blame] | 426 | ioctl(fd, PERF_EVENT_IOC_DISABLE, 0); |
Paul Mackerras | f66c6b2 | 2009-03-23 10:29:36 +1100 | [diff] [blame] | 427 | |
Namhyung Kim | a59e64a | 2012-05-31 14:51:45 +0900 | [diff] [blame] | 428 | For a counter group, pass PERF_IOC_FLAG_GROUP as the third argument. |
Paul Mackerras | f66c6b2 | 2009-03-23 10:29:36 +1100 | [diff] [blame] | 429 | Enabling or disabling the leader of a group enables or disables the |
| 430 | whole group; that is, while the group leader is disabled, none of the |
| 431 | counters in the group will count. Enabling or disabling a member of a |
| 432 | group other than the leader only affects that counter - disabling an |
| 433 | non-leader stops that counter from counting but doesn't affect any |
| 434 | other counter. |
| 435 | |
Peter Zijlstra | e5791a8 | 2009-05-01 12:23:19 +0200 | [diff] [blame] | 436 | Additionally, non-inherited overflow counters can use |
| 437 | |
Ingo Molnar | cdd6c48 | 2009-09-21 12:02:48 +0200 | [diff] [blame] | 438 | ioctl(fd, PERF_EVENT_IOC_REFRESH, nr); |
Peter Zijlstra | e5791a8 | 2009-05-01 12:23:19 +0200 | [diff] [blame] | 439 | |
| 440 | to enable a counter for 'nr' events, after which it gets disabled again. |
| 441 | |
Paul Mackerras | f66c6b2 | 2009-03-23 10:29:36 +1100 | [diff] [blame] | 442 | A process can enable or disable all the counter groups that are |
| 443 | attached to it, using prctl: |
| 444 | |
Ingo Molnar | cdd6c48 | 2009-09-21 12:02:48 +0200 | [diff] [blame] | 445 | prctl(PR_TASK_PERF_EVENTS_ENABLE); |
Paul Mackerras | f66c6b2 | 2009-03-23 10:29:36 +1100 | [diff] [blame] | 446 | |
Ingo Molnar | cdd6c48 | 2009-09-21 12:02:48 +0200 | [diff] [blame] | 447 | prctl(PR_TASK_PERF_EVENTS_DISABLE); |
Paul Mackerras | f66c6b2 | 2009-03-23 10:29:36 +1100 | [diff] [blame] | 448 | |
| 449 | This applies to all counters on the current process, whether created |
| 450 | by this process or by another, and doesn't affect any counters that |
| 451 | this process has created on other processes. It only enables or |
| 452 | disables the group leaders, not any other members in the groups. |
Ingo Molnar | 447557a | 2008-12-11 20:40:18 +0100 | [diff] [blame] | 453 | |
Mike Frysinger | 018df72 | 2009-06-12 13:17:43 -0400 | [diff] [blame] | 454 | |
| 455 | Arch requirements |
| 456 | ----------------- |
| 457 | |
| 458 | If your architecture does not have hardware performance metrics, you can |
| 459 | still use the generic software counters based on hrtimers for sampling. |
| 460 | |
Ingo Molnar | cdd6c48 | 2009-09-21 12:02:48 +0200 | [diff] [blame] | 461 | So to start with, in order to add HAVE_PERF_EVENTS to your Kconfig, you |
Mike Frysinger | 018df72 | 2009-06-12 13:17:43 -0400 | [diff] [blame] | 462 | will need at least this: |
Ingo Molnar | cdd6c48 | 2009-09-21 12:02:48 +0200 | [diff] [blame] | 463 | - asm/perf_event.h - a basic stub will suffice at first |
Mike Frysinger | 018df72 | 2009-06-12 13:17:43 -0400 | [diff] [blame] | 464 | - support for atomic64 types (and associated helper functions) |
Mike Frysinger | 018df72 | 2009-06-12 13:17:43 -0400 | [diff] [blame] | 465 | |
| 466 | If your architecture does have hardware capabilities, you can override the |
Ingo Molnar | cdd6c48 | 2009-09-21 12:02:48 +0200 | [diff] [blame] | 467 | weak stub hw_perf_event_init() to register hardware counters. |
Peter Zijlstra | 906010b | 2009-09-21 16:08:49 +0200 | [diff] [blame] | 468 | |
| 469 | Architectures that have d-cache aliassing issues, such as Sparc and ARM, |
| 470 | should select PERF_USE_VMALLOC in order to avoid these for perf mmap(). |