Jiri Olsa | 465f27a | 2016-08-26 10:36:12 +0200 | [diff] [blame] | 1 | perf-c2c(1) |
| 2 | =========== |
| 3 | |
| 4 | NAME |
| 5 | ---- |
| 6 | perf-c2c - Shared Data C2C/HITM Analyzer. |
| 7 | |
| 8 | SYNOPSIS |
| 9 | -------- |
| 10 | [verse] |
| 11 | 'perf c2c record' [<options>] <command> |
| 12 | 'perf c2c record' [<options>] -- [<record command options>] <command> |
| 13 | 'perf c2c report' [<options>] |
| 14 | |
| 15 | DESCRIPTION |
| 16 | ----------- |
| 17 | C2C stands for Cache To Cache. |
| 18 | |
| 19 | The perf c2c tool provides means for Shared Data C2C/HITM analysis. It allows |
| 20 | you to track down the cacheline contentions. |
| 21 | |
Ravi Bangoria | f0fabf9 | 2019-01-29 18:54:12 +0530 | [diff] [blame] | 22 | On x86, the tool is based on load latency and precise store facility events |
| 23 | provided by Intel CPUs. On PowerPC, the tool uses random instruction sampling |
| 24 | with thresholding feature. |
| 25 | |
| 26 | These events provide: |
Jiri Olsa | 465f27a | 2016-08-26 10:36:12 +0200 | [diff] [blame] | 27 | - memory address of the access |
| 28 | - type of the access (load and store details) |
| 29 | - latency (in cycles) of the load access |
| 30 | |
| 31 | The c2c tool provide means to record this data and report back access details |
| 32 | for cachelines with highest contention - highest number of HITM accesses. |
| 33 | |
| 34 | The basic workflow with this tool follows the standard record/report phase. |
| 35 | User uses the record command to record events data and report command to |
| 36 | display it. |
| 37 | |
| 38 | |
| 39 | RECORD OPTIONS |
| 40 | -------------- |
| 41 | -e:: |
| 42 | --event=:: |
Ian Rogers | b027cc6 | 2020-05-07 15:06:04 -0700 | [diff] [blame] | 43 | Select the PMU event. Use 'perf c2c record -e list' |
Jiri Olsa | 465f27a | 2016-08-26 10:36:12 +0200 | [diff] [blame] | 44 | to list available events. |
| 45 | |
| 46 | -v:: |
| 47 | --verbose:: |
| 48 | Be more verbose (show counter open errors, etc). |
| 49 | |
| 50 | -l:: |
| 51 | --ldlat:: |
Ravi Bangoria | f0fabf9 | 2019-01-29 18:54:12 +0530 | [diff] [blame] | 52 | Configure mem-loads latency. (x86 only) |
Jiri Olsa | 465f27a | 2016-08-26 10:36:12 +0200 | [diff] [blame] | 53 | |
| 54 | -k:: |
| 55 | --all-kernel:: |
| 56 | Configure all used events to run in kernel space. |
| 57 | |
| 58 | -u:: |
| 59 | --all-user:: |
| 60 | Configure all used events to run in user space. |
| 61 | |
| 62 | REPORT OPTIONS |
| 63 | -------------- |
| 64 | -k:: |
| 65 | --vmlinux=<file>:: |
| 66 | vmlinux pathname |
| 67 | |
| 68 | -v:: |
| 69 | --verbose:: |
| 70 | Be more verbose (show counter open errors, etc). |
| 71 | |
| 72 | -i:: |
| 73 | --input:: |
| 74 | Specify the input file to process. |
| 75 | |
| 76 | -N:: |
| 77 | --node-info:: |
| 78 | Show extra node info in report (see NODE INFO section) |
| 79 | |
| 80 | -c:: |
| 81 | --coalesce:: |
Kim Phillips | 1291927 | 2017-05-03 13:13:50 +0100 | [diff] [blame] | 82 | Specify sorting fields for single cacheline display. |
Jiri Olsa | 465f27a | 2016-08-26 10:36:12 +0200 | [diff] [blame] | 83 | Following fields are available: tid,pid,iaddr,dso |
| 84 | (see COALESCE) |
| 85 | |
| 86 | -g:: |
| 87 | --call-graph:: |
| 88 | Setup callchains parameters. |
| 89 | Please refer to perf-report man page for details. |
| 90 | |
| 91 | --stdio:: |
| 92 | Force the stdio output (see STDIO OUTPUT) |
| 93 | |
| 94 | --stats:: |
| 95 | Display only statistic tables and force stdio mode. |
| 96 | |
| 97 | --full-symbols:: |
| 98 | Display full length of symbols. |
| 99 | |
Jiri Olsa | 18f278d | 2016-10-11 13:39:47 +0200 | [diff] [blame] | 100 | --no-source:: |
| 101 | Do not display Source:Line column. |
| 102 | |
Jiri Olsa | af09b2d | 2016-10-11 13:52:05 +0200 | [diff] [blame] | 103 | --show-all:: |
| 104 | Show all captured HITM lines, with no regard to HITM % 0.0005 limit. |
| 105 | |
Jiri Olsa | b7ac4f9 | 2016-11-21 22:33:28 +0100 | [diff] [blame] | 106 | -f:: |
| 107 | --force:: |
| 108 | Don't do ownership validation. |
| 109 | |
Jiri Olsa | d940bac | 2016-11-21 22:33:30 +0100 | [diff] [blame] | 110 | -d:: |
| 111 | --display:: |
Kim Phillips | 1291927 | 2017-05-03 13:13:50 +0100 | [diff] [blame] | 112 | Switch to HITM type (rmt, lcl) to display and sort on. Total HITMs as default. |
Jiri Olsa | d940bac | 2016-11-21 22:33:30 +0100 | [diff] [blame] | 113 | |
Kan Liang | d80da76 | 2020-03-19 13:25:16 -0700 | [diff] [blame] | 114 | --stitch-lbr:: |
| 115 | Show callgraph with stitched LBRs, which may have more complete |
| 116 | callgraph. The perf.data file must have been obtained using |
| 117 | perf c2c record --call-graph lbr. |
| 118 | Disabled by default. In common cases with call stack overflows, |
| 119 | it can recreate better call stacks than the default lbr call stack |
| 120 | output. But this approach is not full proof. There can be cases |
| 121 | where it creates incorrect call stacks from incorrect matches. |
| 122 | The known limitations include exception handing such as |
| 123 | setjmp/longjmp will have calls/returns not match. |
| 124 | |
Jiri Olsa | 465f27a | 2016-08-26 10:36:12 +0200 | [diff] [blame] | 125 | C2C RECORD |
| 126 | ---------- |
| 127 | The perf c2c record command setup options related to HITM cacheline analysis |
| 128 | and calls standard perf record command. |
| 129 | |
| 130 | Following perf record options are configured by default: |
| 131 | (check perf record man page for details) |
| 132 | |
Jiri Olsa | 8fab784 | 2018-03-09 11:14:37 +0100 | [diff] [blame] | 133 | -W,-d,--phys-data,--sample-cpu |
Jiri Olsa | 465f27a | 2016-08-26 10:36:12 +0200 | [diff] [blame] | 134 | |
| 135 | Unless specified otherwise with '-e' option, following events are monitored by |
Ravi Bangoria | f0fabf9 | 2019-01-29 18:54:12 +0530 | [diff] [blame] | 136 | default on x86: |
Jiri Olsa | 465f27a | 2016-08-26 10:36:12 +0200 | [diff] [blame] | 137 | |
| 138 | cpu/mem-loads,ldlat=30/P |
| 139 | cpu/mem-stores/P |
| 140 | |
Ravi Bangoria | f0fabf9 | 2019-01-29 18:54:12 +0530 | [diff] [blame] | 141 | and following on PowerPC: |
| 142 | |
| 143 | cpu/mem-loads/ |
| 144 | cpu/mem-stores/ |
| 145 | |
Jiri Olsa | 465f27a | 2016-08-26 10:36:12 +0200 | [diff] [blame] | 146 | User can pass any 'perf record' option behind '--' mark, like (to enable |
| 147 | callchains and system wide monitoring): |
| 148 | |
| 149 | $ perf c2c record -- -g -a |
| 150 | |
| 151 | Please check RECORD OPTIONS section for specific c2c record options. |
| 152 | |
| 153 | C2C REPORT |
| 154 | ---------- |
| 155 | The perf c2c report command displays shared data analysis. It comes in two |
| 156 | display modes: stdio and tui (default). |
| 157 | |
| 158 | The report command workflow is following: |
| 159 | - sort all the data based on the cacheline address |
| 160 | - store access details for each cacheline |
| 161 | - sort all cachelines based on user settings |
| 162 | - display data |
| 163 | |
| 164 | In general perf report output consist of 2 basic views: |
| 165 | 1) most expensive cachelines list |
| 166 | 2) offsets details for each cacheline |
| 167 | |
| 168 | For each cacheline in the 1) list we display following data: |
| 169 | (Both stdio and TUI modes follow the same fields output) |
| 170 | |
| 171 | Index |
| 172 | - zero based index to identify the cacheline |
| 173 | |
| 174 | Cacheline |
| 175 | - cacheline address (hex number) |
| 176 | |
Jiri Olsa | 465f27a | 2016-08-26 10:36:12 +0200 | [diff] [blame] | 177 | Rmt/Lcl Hitm |
| 178 | - cacheline percentage of all Remote/Local HITM accesses |
| 179 | |
Leo Yan | 744aec4 | 2020-10-15 15:45:48 +0100 | [diff] [blame] | 180 | LLC Load Hitm - Total, LclHitm, RmtHitm |
Jiri Olsa | 465f27a | 2016-08-26 10:36:12 +0200 | [diff] [blame] | 181 | - count of Total/Local/Remote load HITMs |
| 182 | |
Leo Yan | 744aec4 | 2020-10-15 15:45:48 +0100 | [diff] [blame] | 183 | Total records |
| 184 | - sum of all cachelines accesses |
Jiri Olsa | 465f27a | 2016-08-26 10:36:12 +0200 | [diff] [blame] | 185 | |
Leo Yan | 744aec4 | 2020-10-15 15:45:48 +0100 | [diff] [blame] | 186 | Total loads |
Jiri Olsa | 465f27a | 2016-08-26 10:36:12 +0200 | [diff] [blame] | 187 | - sum of all load accesses |
| 188 | |
Leo Yan | 744aec4 | 2020-10-15 15:45:48 +0100 | [diff] [blame] | 189 | Total stores |
| 190 | - sum of all store accesses |
| 191 | |
| 192 | Store Reference - L1Hit, L1Miss |
| 193 | L1Hit - store accesses that hit L1 |
| 194 | L1Miss - store accesses that missed L1 |
| 195 | |
Jiri Olsa | 465f27a | 2016-08-26 10:36:12 +0200 | [diff] [blame] | 196 | Core Load Hit - FB, L1, L2 |
| 197 | - count of load hits in FB (Fill Buffer), L1 and L2 cache |
| 198 | |
Leo Yan | 744aec4 | 2020-10-15 15:45:48 +0100 | [diff] [blame] | 199 | LLC Load Hit - LlcHit, LclHitm |
| 200 | - count of LLC load accesses, includes LLC hits and LLC HITMs |
| 201 | |
| 202 | RMT Load Hit - RmtHit, RmtHitm |
| 203 | - count of remote load accesses, includes remote hits and remote HITMs |
| 204 | |
| 205 | Load Dram - Lcl, Rmt |
| 206 | - count of local and remote DRAM accesses |
Jiri Olsa | 465f27a | 2016-08-26 10:36:12 +0200 | [diff] [blame] | 207 | |
| 208 | For each offset in the 2) list we display following data: |
| 209 | |
| 210 | HITM - Rmt, Lcl |
| 211 | - % of Remote/Local HITM accesses for given offset within cacheline |
| 212 | |
| 213 | Store Refs - L1 Hit, L1 Miss |
| 214 | - % of store accesses that hit/missed L1 for given offset within cacheline |
| 215 | |
| 216 | Data address - Offset |
| 217 | - offset address |
| 218 | |
| 219 | Pid |
| 220 | - pid of the process responsible for the accesses |
| 221 | |
| 222 | Tid |
| 223 | - tid of the process responsible for the accesses |
| 224 | |
| 225 | Code address |
| 226 | - code address responsible for the accesses |
| 227 | |
| 228 | cycles - rmt hitm, lcl hitm, load |
| 229 | - sum of cycles for given accesses - Remote/Local HITM and generic load |
| 230 | |
| 231 | cpu cnt |
| 232 | - number of cpus that participated on the access |
| 233 | |
| 234 | Symbol |
| 235 | - code symbol related to the 'Code address' value |
| 236 | |
| 237 | Shared Object |
| 238 | - shared object name related to the 'Code address' value |
| 239 | |
| 240 | Source:Line |
| 241 | - source information related to the 'Code address' value |
| 242 | |
| 243 | Node |
| 244 | - nodes participating on the access (see NODE INFO section) |
| 245 | |
| 246 | NODE INFO |
| 247 | --------- |
| 248 | The 'Node' field displays nodes that accesses given cacheline |
| 249 | offset. Its output comes in 3 flavors: |
| 250 | - node IDs separated by ',' |
| 251 | - node IDs with stats for each ID, in following format: |
| 252 | Node{cpus %hitms %stores} |
| 253 | - node IDs with list of affected CPUs in following format: |
| 254 | Node{cpu list} |
| 255 | |
| 256 | User can switch between above flavors with -N option or |
| 257 | use 'n' key to interactively switch in TUI mode. |
| 258 | |
| 259 | COALESCE |
| 260 | -------- |
| 261 | User can specify how to sort offsets for cacheline. |
| 262 | |
| 263 | Following fields are available and governs the final |
| 264 | output fields set for caheline offsets output: |
| 265 | |
| 266 | tid - coalesced by process TIDs |
| 267 | pid - coalesced by process PIDs |
| 268 | iaddr - coalesced by code address, following fields are displayed: |
| 269 | Code address, Code symbol, Shared Object, Source line |
| 270 | dso - coalesced by shared object |
| 271 | |
Jiri Olsa | 190bacc | 2017-01-20 10:20:32 +0100 | [diff] [blame] | 272 | By default the coalescing is setup with 'pid,iaddr'. |
Jiri Olsa | 465f27a | 2016-08-26 10:36:12 +0200 | [diff] [blame] | 273 | |
| 274 | STDIO OUTPUT |
| 275 | ------------ |
| 276 | The stdio output displays data on standard output. |
| 277 | |
| 278 | Following tables are displayed: |
| 279 | Trace Event Information |
| 280 | - overall statistics of memory accesses |
| 281 | |
| 282 | Global Shared Cache Line Event Information |
| 283 | - overall statistics on shared cachelines |
| 284 | |
| 285 | Shared Data Cache Line Table |
| 286 | - list of most expensive cachelines |
| 287 | |
| 288 | Shared Cache Line Distribution Pareto |
| 289 | - list of all accessed offsets for each cacheline |
| 290 | |
| 291 | TUI OUTPUT |
| 292 | ---------- |
| 293 | The TUI output provides interactive interface to navigate |
| 294 | through cachelines list and to display offset details. |
| 295 | |
| 296 | For details please refer to the help window by pressing '?' key. |
| 297 | |
| 298 | CREDITS |
| 299 | ------- |
| 300 | Although Don Zickus, Dick Fowles and Joe Mario worked together |
| 301 | to get this implemented, we got lots of early help from Arnaldo |
| 302 | Carvalho de Melo, Stephane Eranian, Jiri Olsa and Andi Kleen. |
| 303 | |
| 304 | C2C BLOG |
| 305 | -------- |
| 306 | Check Joe's blog on c2c tool for detailed use case explanation: |
| 307 | https://joemario.github.io/blog/2016/09/01/c2c-blog/ |
| 308 | |
| 309 | SEE ALSO |
| 310 | -------- |
| 311 | linkperf:perf-record[1], linkperf:perf-mem[1] |