Mauro Carvalho Chehab | bdf3a95 | 2019-06-12 14:52:58 -0300 | [diff] [blame] | 1 | =================================== |
Alan Kao | 0d43155 | 2018-04-20 07:27:50 +0800 | [diff] [blame] | 2 | Supporting PMUs on RISC-V platforms |
Mauro Carvalho Chehab | bdf3a95 | 2019-06-12 14:52:58 -0300 | [diff] [blame] | 3 | =================================== |
| 4 | |
Alan Kao | 0d43155 | 2018-04-20 07:27:50 +0800 | [diff] [blame] | 5 | Alan Kao <alankao@andestech.com>, Mar 2018 |
| 6 | |
| 7 | Introduction |
| 8 | ------------ |
| 9 | |
| 10 | As of this writing, perf_event-related features mentioned in The RISC-V ISA |
| 11 | Privileged Version 1.10 are as follows: |
| 12 | (please check the manual for more details) |
| 13 | |
| 14 | * [m|s]counteren |
| 15 | * mcycle[h], cycle[h] |
| 16 | * minstret[h], instret[h] |
| 17 | * mhpeventx, mhpcounterx[h] |
| 18 | |
| 19 | With such function set only, porting perf would require a lot of work, due to |
| 20 | the lack of the following general architectural performance monitoring features: |
| 21 | |
| 22 | * Enabling/Disabling counters |
| 23 | Counters are just free-running all the time in our case. |
| 24 | * Interrupt caused by counter overflow |
| 25 | No such feature in the spec. |
| 26 | * Interrupt indicator |
| 27 | It is not possible to have many interrupt ports for all counters, so an |
| 28 | interrupt indicator is required for software to tell which counter has |
| 29 | just overflowed. |
| 30 | * Writing to counters |
| 31 | There will be an SBI to support this since the kernel cannot modify the |
| 32 | counters [1]. Alternatively, some vendor considers to implement |
| 33 | hardware-extension for M-S-U model machines to write counters directly. |
| 34 | |
| 35 | This document aims to provide developers a quick guide on supporting their |
| 36 | PMUs in the kernel. The following sections briefly explain perf' mechanism |
| 37 | and todos. |
| 38 | |
| 39 | You may check previous discussions here [1][2]. Also, it might be helpful |
| 40 | to check the appendix for related kernel structures. |
| 41 | |
| 42 | |
| 43 | 1. Initialization |
| 44 | ----------------- |
| 45 | |
| 46 | *riscv_pmu* is a global pointer of type *struct riscv_pmu*, which contains |
| 47 | various methods according to perf's internal convention and PMU-specific |
| 48 | parameters. One should declare such instance to represent the PMU. By default, |
| 49 | *riscv_pmu* points to a constant structure *riscv_base_pmu*, which has very |
| 50 | basic support to a baseline QEMU model. |
| 51 | |
| 52 | Then he/she can either assign the instance's pointer to *riscv_pmu* so that |
| 53 | the minimal and already-implemented logic can be leveraged, or invent his/her |
| 54 | own *riscv_init_platform_pmu* implementation. |
| 55 | |
| 56 | In other words, existing sources of *riscv_base_pmu* merely provide a |
| 57 | reference implementation. Developers can flexibly decide how many parts they |
| 58 | can leverage, and in the most extreme case, they can customize every function |
| 59 | according to their needs. |
| 60 | |
| 61 | |
| 62 | 2. Event Initialization |
| 63 | ----------------------- |
| 64 | |
| 65 | When a user launches a perf command to monitor some events, it is first |
| 66 | interpreted by the userspace perf tool into multiple *perf_event_open* |
| 67 | system calls, and then each of them calls to the body of *event_init* |
| 68 | member function that was assigned in the previous step. In *riscv_base_pmu*'s |
| 69 | case, it is *riscv_event_init*. |
| 70 | |
| 71 | The main purpose of this function is to translate the event provided by user |
| 72 | into bitmap, so that HW-related control registers or counters can directly be |
| 73 | manipulated. The translation is based on the mappings and methods provided in |
| 74 | *riscv_pmu*. |
| 75 | |
| 76 | Note that some features can be done in this stage as well: |
| 77 | |
| 78 | (1) interrupt setting, which is stated in the next section; |
| 79 | (2) privilege level setting (user space only, kernel space only, both); |
| 80 | (3) destructor setting. Normally it is sufficient to apply *riscv_destroy_event*; |
| 81 | (4) tweaks for non-sampling events, which will be utilized by functions such as |
Mauro Carvalho Chehab | bdf3a95 | 2019-06-12 14:52:58 -0300 | [diff] [blame] | 82 | *perf_adjust_period*, usually something like the follows:: |
Alan Kao | 0d43155 | 2018-04-20 07:27:50 +0800 | [diff] [blame] | 83 | |
Mauro Carvalho Chehab | bdf3a95 | 2019-06-12 14:52:58 -0300 | [diff] [blame] | 84 | if (!is_sampling_event(event)) { |
| 85 | hwc->sample_period = x86_pmu.max_period; |
| 86 | hwc->last_period = hwc->sample_period; |
| 87 | local64_set(&hwc->period_left, hwc->sample_period); |
| 88 | } |
Alan Kao | 0d43155 | 2018-04-20 07:27:50 +0800 | [diff] [blame] | 89 | |
| 90 | In the case of *riscv_base_pmu*, only (3) is provided for now. |
| 91 | |
| 92 | |
| 93 | 3. Interrupt |
| 94 | ------------ |
| 95 | |
| 96 | 3.1. Interrupt Initialization |
| 97 | |
| 98 | This often occurs at the beginning of the *event_init* method. In common |
Mauro Carvalho Chehab | bdf3a95 | 2019-06-12 14:52:58 -0300 | [diff] [blame] | 99 | practice, this should be a code segment like:: |
Alan Kao | 0d43155 | 2018-04-20 07:27:50 +0800 | [diff] [blame] | 100 | |
Mauro Carvalho Chehab | bdf3a95 | 2019-06-12 14:52:58 -0300 | [diff] [blame] | 101 | int x86_reserve_hardware(void) |
| 102 | { |
Alan Kao | 0d43155 | 2018-04-20 07:27:50 +0800 | [diff] [blame] | 103 | int err = 0; |
| 104 | |
| 105 | if (!atomic_inc_not_zero(&pmc_refcount)) { |
| 106 | mutex_lock(&pmc_reserve_mutex); |
| 107 | if (atomic_read(&pmc_refcount) == 0) { |
| 108 | if (!reserve_pmc_hardware()) |
| 109 | err = -EBUSY; |
| 110 | else |
| 111 | reserve_ds_buffers(); |
| 112 | } |
| 113 | if (!err) |
| 114 | atomic_inc(&pmc_refcount); |
| 115 | mutex_unlock(&pmc_reserve_mutex); |
| 116 | } |
| 117 | |
| 118 | return err; |
Mauro Carvalho Chehab | bdf3a95 | 2019-06-12 14:52:58 -0300 | [diff] [blame] | 119 | } |
Alan Kao | 0d43155 | 2018-04-20 07:27:50 +0800 | [diff] [blame] | 120 | |
| 121 | And the magic is in *reserve_pmc_hardware*, which usually does atomic |
| 122 | operations to make implemented IRQ accessible from some global function pointer. |
| 123 | *release_pmc_hardware* serves the opposite purpose, and it is used in event |
| 124 | destructors mentioned in previous section. |
| 125 | |
| 126 | (Note: From the implementations in all the architectures, the *reserve/release* |
| 127 | pair are always IRQ settings, so the *pmc_hardware* seems somehow misleading. |
| 128 | It does NOT deal with the binding between an event and a physical counter, |
| 129 | which will be introduced in the next section.) |
| 130 | |
| 131 | 3.2. IRQ Structure |
| 132 | |
Mauro Carvalho Chehab | bdf3a95 | 2019-06-12 14:52:58 -0300 | [diff] [blame] | 133 | Basically, a IRQ runs the following pseudo code:: |
Alan Kao | 0d43155 | 2018-04-20 07:27:50 +0800 | [diff] [blame] | 134 | |
Mauro Carvalho Chehab | bdf3a95 | 2019-06-12 14:52:58 -0300 | [diff] [blame] | 135 | for each hardware counter that triggered this overflow |
Alan Kao | 0d43155 | 2018-04-20 07:27:50 +0800 | [diff] [blame] | 136 | |
Mauro Carvalho Chehab | bdf3a95 | 2019-06-12 14:52:58 -0300 | [diff] [blame] | 137 | get the event of this counter |
Alan Kao | 0d43155 | 2018-04-20 07:27:50 +0800 | [diff] [blame] | 138 | |
Mauro Carvalho Chehab | bdf3a95 | 2019-06-12 14:52:58 -0300 | [diff] [blame] | 139 | // following two steps are defined as *read()*, |
| 140 | // check the section Reading/Writing Counters for details. |
| 141 | count the delta value since previous interrupt |
| 142 | update the event->count (# event occurs) by adding delta, and |
| 143 | event->hw.period_left by subtracting delta |
Alan Kao | 0d43155 | 2018-04-20 07:27:50 +0800 | [diff] [blame] | 144 | |
Mauro Carvalho Chehab | bdf3a95 | 2019-06-12 14:52:58 -0300 | [diff] [blame] | 145 | if the event overflows |
| 146 | sample data |
| 147 | set the counter appropriately for the next overflow |
Alan Kao | 0d43155 | 2018-04-20 07:27:50 +0800 | [diff] [blame] | 148 | |
Mauro Carvalho Chehab | bdf3a95 | 2019-06-12 14:52:58 -0300 | [diff] [blame] | 149 | if the event overflows again |
| 150 | too frequently, throttle this event |
| 151 | fi |
| 152 | fi |
Alan Kao | 0d43155 | 2018-04-20 07:27:50 +0800 | [diff] [blame] | 153 | |
Mauro Carvalho Chehab | bdf3a95 | 2019-06-12 14:52:58 -0300 | [diff] [blame] | 154 | end for |
Alan Kao | 0d43155 | 2018-04-20 07:27:50 +0800 | [diff] [blame] | 155 | |
| 156 | However as of this writing, none of the RISC-V implementations have designed an |
| 157 | interrupt for perf, so the details are to be completed in the future. |
| 158 | |
| 159 | 4. Reading/Writing Counters |
| 160 | --------------------------- |
| 161 | |
| 162 | They seem symmetric but perf treats them quite differently. For reading, there |
| 163 | is a *read* interface in *struct pmu*, but it serves more than just reading. |
| 164 | According to the context, the *read* function not only reads the content of the |
| 165 | counter (event->count), but also updates the left period to the next interrupt |
| 166 | (event->hw.period_left). |
| 167 | |
| 168 | But the core of perf does not need direct write to counters. Writing counters |
| 169 | is hidden behind the abstraction of 1) *pmu->start*, literally start counting so one |
| 170 | has to set the counter to a good value for the next interrupt; 2) inside the IRQ |
| 171 | it should set the counter to the same resonable value. |
| 172 | |
| 173 | Reading is not a problem in RISC-V but writing would need some effort, since |
| 174 | counters are not allowed to be written by S-mode. |
| 175 | |
| 176 | |
| 177 | 5. add()/del()/start()/stop() |
| 178 | ----------------------------- |
| 179 | |
| 180 | Basic idea: add()/del() adds/deletes events to/from a PMU, and start()/stop() |
| 181 | starts/stop the counter of some event in the PMU. All of them take the same |
| 182 | arguments: *struct perf_event *event* and *int flag*. |
| 183 | |
| 184 | Consider perf as a state machine, then you will find that these functions serve |
| 185 | as the state transition process between those states. |
| 186 | Three states (event->hw.state) are defined: |
| 187 | |
| 188 | * PERF_HES_STOPPED: the counter is stopped |
| 189 | * PERF_HES_UPTODATE: the event->count is up-to-date |
| 190 | * PERF_HES_ARCH: arch-dependent usage ... we don't need this for now |
| 191 | |
| 192 | A normal flow of these state transitions are as follows: |
| 193 | |
| 194 | * A user launches a perf event, resulting in calling to *event_init*. |
| 195 | * When being context-switched in, *add* is called by the perf core, with a flag |
| 196 | PERF_EF_START, which means that the event should be started after it is added. |
| 197 | At this stage, a general event is bound to a physical counter, if any. |
| 198 | The state changes to PERF_HES_STOPPED and PERF_HES_UPTODATE, because it is now |
| 199 | stopped, and the (software) event count does not need updating. |
Mauro Carvalho Chehab | bdf3a95 | 2019-06-12 14:52:58 -0300 | [diff] [blame] | 200 | |
| 201 | - *start* is then called, and the counter is enabled. |
| 202 | With flag PERF_EF_RELOAD, it writes an appropriate value to the counter (check |
| 203 | previous section for detail). |
| 204 | Nothing is written if the flag does not contain PERF_EF_RELOAD. |
| 205 | The state now is reset to none, because it is neither stopped nor updated |
| 206 | (the counting already started) |
| 207 | |
Alan Kao | 0d43155 | 2018-04-20 07:27:50 +0800 | [diff] [blame] | 208 | * When being context-switched out, *del* is called. It then checks out all the |
| 209 | events in the PMU and calls *stop* to update their counts. |
Alan Kao | 0d43155 | 2018-04-20 07:27:50 +0800 | [diff] [blame] | 210 | |
Mauro Carvalho Chehab | bdf3a95 | 2019-06-12 14:52:58 -0300 | [diff] [blame] | 211 | - *stop* is called by *del* |
| 212 | and the perf core with flag PERF_EF_UPDATE, and it often shares the same |
| 213 | subroutine as *read* with the same logic. |
| 214 | The state changes to PERF_HES_STOPPED and PERF_HES_UPTODATE, again. |
| 215 | |
| 216 | - Life cycle of these two pairs: *add* and *del* are called repeatedly as |
| 217 | tasks switch in-and-out; *start* and *stop* is also called when the perf core |
| 218 | needs a quick stop-and-start, for instance, when the interrupt period is being |
| 219 | adjusted. |
Alan Kao | 0d43155 | 2018-04-20 07:27:50 +0800 | [diff] [blame] | 220 | |
| 221 | Current implementation is sufficient for now and can be easily extended to |
| 222 | features in the future. |
| 223 | |
| 224 | A. Related Structures |
| 225 | --------------------- |
| 226 | |
| 227 | * struct pmu: include/linux/perf_event.h |
| 228 | * struct riscv_pmu: arch/riscv/include/asm/perf_event.h |
| 229 | |
| 230 | Both structures are designed to be read-only. |
| 231 | |
| 232 | *struct pmu* defines some function pointer interfaces, and most of them take |
Mauro Carvalho Chehab | bdf3a95 | 2019-06-12 14:52:58 -0300 | [diff] [blame] | 233 | *struct perf_event* as a main argument, dealing with perf events according to |
| 234 | perf's internal state machine (check kernel/events/core.c for details). |
Alan Kao | 0d43155 | 2018-04-20 07:27:50 +0800 | [diff] [blame] | 235 | |
| 236 | *struct riscv_pmu* defines PMU-specific parameters. The naming follows the |
Mauro Carvalho Chehab | bdf3a95 | 2019-06-12 14:52:58 -0300 | [diff] [blame] | 237 | convention of all other architectures. |
Alan Kao | 0d43155 | 2018-04-20 07:27:50 +0800 | [diff] [blame] | 238 | |
| 239 | * struct perf_event: include/linux/perf_event.h |
| 240 | * struct hw_perf_event |
| 241 | |
| 242 | The generic structure that represents perf events, and the hardware-related |
Mauro Carvalho Chehab | bdf3a95 | 2019-06-12 14:52:58 -0300 | [diff] [blame] | 243 | details. |
Alan Kao | 0d43155 | 2018-04-20 07:27:50 +0800 | [diff] [blame] | 244 | |
| 245 | * struct riscv_hw_events: arch/riscv/include/asm/perf_event.h |
| 246 | |
| 247 | The structure that holds the status of events, has two fixed members: |
Mauro Carvalho Chehab | bdf3a95 | 2019-06-12 14:52:58 -0300 | [diff] [blame] | 248 | the number of events and the array of the events. |
Alan Kao | 0d43155 | 2018-04-20 07:27:50 +0800 | [diff] [blame] | 249 | |
| 250 | References |
| 251 | ---------- |
| 252 | |
| 253 | [1] https://github.com/riscv/riscv-linux/pull/124 |
Mauro Carvalho Chehab | bdf3a95 | 2019-06-12 14:52:58 -0300 | [diff] [blame] | 254 | |
Alan Kao | 0d43155 | 2018-04-20 07:27:50 +0800 | [diff] [blame] | 255 | [2] https://groups.google.com/a/groups.riscv.org/forum/#!topic/sw-dev/f19TmCNP6yA |