Blame - tools/perf/Documentation/perf-c2c.txt - SHIFTPHONES/mainline/linux

blob: c81d72e3eecf44dbffbba26ae15bdac8f2ed0ca9 [file] [log] [blame]

Jiri Olsa	465f27a	2016-08-26 10:36:12 +0200	[diff] [blame]	1	perf-c2c(1)
				2	===========
				3
				4	NAME
				5	----
				6	perf-c2c - Shared Data C2C/HITM Analyzer.
				7
				8	SYNOPSIS
				9	--------
				10	[verse]
				11	'perf c2c record' [<options>] <command>
				12	'perf c2c record' [<options>] -- [<record command options>] <command>
				13	'perf c2c report' [<options>]
				14
				15	DESCRIPTION
				16	-----------
				17	C2C stands for Cache To Cache.
				18
				19	The perf c2c tool provides means for Shared Data C2C/HITM analysis. It allows
				20	you to track down the cacheline contentions.
				21
Ravi Bangoria	f0fabf9	2019-01-29 18:54:12 +0530	[diff] [blame]	22	On x86, the tool is based on load latency and precise store facility events
				23	provided by Intel CPUs. On PowerPC, the tool uses random instruction sampling
				24	with thresholding feature.
				25
				26	These events provide:
Jiri Olsa	465f27a	2016-08-26 10:36:12 +0200	[diff] [blame]	27	- memory address of the access
				28	- type of the access (load and store details)
				29	- latency (in cycles) of the load access
				30
				31	The c2c tool provide means to record this data and report back access details
				32	for cachelines with highest contention - highest number of HITM accesses.
				33
				34	The basic workflow with this tool follows the standard record/report phase.
				35	User uses the record command to record events data and report command to
				36	display it.
				37
				38
				39	RECORD OPTIONS
				40	--------------
				41	-e::
				42	--event=::
Ian Rogers	b027cc6	2020-05-07 15:06:04 -0700	[diff] [blame]	43	Select the PMU event. Use 'perf c2c record -e list'
Jiri Olsa	465f27a	2016-08-26 10:36:12 +0200	[diff] [blame]	44	to list available events.
				45
				46	-v::
				47	--verbose::
				48	Be more verbose (show counter open errors, etc).
				49
				50	-l::
				51	--ldlat::
Ravi Bangoria	f0fabf9	2019-01-29 18:54:12 +0530	[diff] [blame]	52	Configure mem-loads latency. (x86 only)
Jiri Olsa	465f27a	2016-08-26 10:36:12 +0200	[diff] [blame]	53
				54	-k::
				55	--all-kernel::
				56	Configure all used events to run in kernel space.
				57
				58	-u::
				59	--all-user::
				60	Configure all used events to run in user space.
				61
				62	REPORT OPTIONS
				63	--------------
				64	-k::
				65	--vmlinux=<file>::
				66	vmlinux pathname
				67
				68	-v::
				69	--verbose::
				70	Be more verbose (show counter open errors, etc).
				71
				72	-i::
				73	--input::
				74	Specify the input file to process.
				75
				76	-N::
				77	--node-info::
				78	Show extra node info in report (see NODE INFO section)
				79
				80	-c::
				81	--coalesce::
Kim Phillips	1291927	2017-05-03 13:13:50 +0100	[diff] [blame]	82	Specify sorting fields for single cacheline display.
Jiri Olsa	465f27a	2016-08-26 10:36:12 +0200	[diff] [blame]	83	Following fields are available: tid,pid,iaddr,dso
				84	(see COALESCE)
				85
				86	-g::
				87	--call-graph::
				88	Setup callchains parameters.
				89	Please refer to perf-report man page for details.
				90
				91	--stdio::
				92	Force the stdio output (see STDIO OUTPUT)
				93
				94	--stats::
				95	Display only statistic tables and force stdio mode.
				96
				97	--full-symbols::
				98	Display full length of symbols.
				99
Jiri Olsa	18f278d	2016-10-11 13:39:47 +0200	[diff] [blame]	100	--no-source::
				101	Do not display Source:Line column.
				102
Jiri Olsa	af09b2d	2016-10-11 13:52:05 +0200	[diff] [blame]	103	--show-all::
				104	Show all captured HITM lines, with no regard to HITM % 0.0005 limit.
				105
Jiri Olsa	b7ac4f9	2016-11-21 22:33:28 +0100	[diff] [blame]	106	-f::
				107	--force::
				108	Don't do ownership validation.
				109
Jiri Olsa	d940bac	2016-11-21 22:33:30 +0100	[diff] [blame]	110	-d::
				111	--display::
Kim Phillips	1291927	2017-05-03 13:13:50 +0100	[diff] [blame]	112	Switch to HITM type (rmt, lcl) to display and sort on. Total HITMs as default.
Jiri Olsa	d940bac	2016-11-21 22:33:30 +0100	[diff] [blame]	113
Kan Liang	d80da76	2020-03-19 13:25:16 -0700	[diff] [blame]	114	--stitch-lbr::
				115	Show callgraph with stitched LBRs, which may have more complete
				116	callgraph. The perf.data file must have been obtained using
				117	perf c2c record --call-graph lbr.
				118	Disabled by default. In common cases with call stack overflows,
				119	it can recreate better call stacks than the default lbr call stack
				120	output. But this approach is not full proof. There can be cases
				121	where it creates incorrect call stacks from incorrect matches.
				122	The known limitations include exception handing such as
				123	setjmp/longjmp will have calls/returns not match.
				124
Jiri Olsa	465f27a	2016-08-26 10:36:12 +0200	[diff] [blame]	125	C2C RECORD
				126	----------
				127	The perf c2c record command setup options related to HITM cacheline analysis
				128	and calls standard perf record command.
				129
				130	Following perf record options are configured by default:
				131	(check perf record man page for details)
				132
Jiri Olsa	8fab784	2018-03-09 11:14:37 +0100	[diff] [blame]	133	-W,-d,--phys-data,--sample-cpu
Jiri Olsa	465f27a	2016-08-26 10:36:12 +0200	[diff] [blame]	134
				135	Unless specified otherwise with '-e' option, following events are monitored by
Ravi Bangoria	f0fabf9	2019-01-29 18:54:12 +0530	[diff] [blame]	136	default on x86:
Jiri Olsa	465f27a	2016-08-26 10:36:12 +0200	[diff] [blame]	137
				138	cpu/mem-loads,ldlat=30/P
				139	cpu/mem-stores/P
				140
Ravi Bangoria	f0fabf9	2019-01-29 18:54:12 +0530	[diff] [blame]	141	and following on PowerPC:
				142
				143	cpu/mem-loads/
				144	cpu/mem-stores/
				145
Jiri Olsa	465f27a	2016-08-26 10:36:12 +0200	[diff] [blame]	146	User can pass any 'perf record' option behind '--' mark, like (to enable
				147	callchains and system wide monitoring):
				148
				149	$ perf c2c record -- -g -a
				150
				151	Please check RECORD OPTIONS section for specific c2c record options.
				152
				153	C2C REPORT
				154	----------
				155	The perf c2c report command displays shared data analysis. It comes in two
				156	display modes: stdio and tui (default).
				157
				158	The report command workflow is following:
				159	- sort all the data based on the cacheline address
				160	- store access details for each cacheline
				161	- sort all cachelines based on user settings
				162	- display data
				163
				164	In general perf report output consist of 2 basic views:
				165	1) most expensive cachelines list
				166	2) offsets details for each cacheline
				167
				168	For each cacheline in the 1) list we display following data:
				169	(Both stdio and TUI modes follow the same fields output)
				170
				171	Index
				172	- zero based index to identify the cacheline
				173
				174	Cacheline
				175	- cacheline address (hex number)
				176
Jiri Olsa	465f27a	2016-08-26 10:36:12 +0200	[diff] [blame]	177	Rmt/Lcl Hitm
				178	- cacheline percentage of all Remote/Local HITM accesses
				179
Leo Yan	744aec4	2020-10-15 15:45:48 +0100	[diff] [blame]	180	LLC Load Hitm - Total, LclHitm, RmtHitm
Jiri Olsa	465f27a	2016-08-26 10:36:12 +0200	[diff] [blame]	181	- count of Total/Local/Remote load HITMs
				182
Leo Yan	744aec4	2020-10-15 15:45:48 +0100	[diff] [blame]	183	Total records
				184	- sum of all cachelines accesses
Jiri Olsa	465f27a	2016-08-26 10:36:12 +0200	[diff] [blame]	185
Leo Yan	744aec4	2020-10-15 15:45:48 +0100	[diff] [blame]	186	Total loads
Jiri Olsa	465f27a	2016-08-26 10:36:12 +0200	[diff] [blame]	187	- sum of all load accesses
				188
Leo Yan	744aec4	2020-10-15 15:45:48 +0100	[diff] [blame]	189	Total stores
				190	- sum of all store accesses
				191
				192	Store Reference - L1Hit, L1Miss
				193	L1Hit - store accesses that hit L1
				194	L1Miss - store accesses that missed L1
				195
Jiri Olsa	465f27a	2016-08-26 10:36:12 +0200	[diff] [blame]	196	Core Load Hit - FB, L1, L2
				197	- count of load hits in FB (Fill Buffer), L1 and L2 cache
				198
Leo Yan	744aec4	2020-10-15 15:45:48 +0100	[diff] [blame]	199	LLC Load Hit - LlcHit, LclHitm
				200	- count of LLC load accesses, includes LLC hits and LLC HITMs
				201
				202	RMT Load Hit - RmtHit, RmtHitm
				203	- count of remote load accesses, includes remote hits and remote HITMs
				204
				205	Load Dram - Lcl, Rmt
				206	- count of local and remote DRAM accesses
Jiri Olsa	465f27a	2016-08-26 10:36:12 +0200	[diff] [blame]	207
				208	For each offset in the 2) list we display following data:
				209
				210	HITM - Rmt, Lcl
				211	- % of Remote/Local HITM accesses for given offset within cacheline
				212
				213	Store Refs - L1 Hit, L1 Miss
				214	- % of store accesses that hit/missed L1 for given offset within cacheline
				215
				216	Data address - Offset
				217	- offset address
				218
				219	Pid
				220	- pid of the process responsible for the accesses
				221
				222	Tid
				223	- tid of the process responsible for the accesses
				224
				225	Code address
				226	- code address responsible for the accesses
				227
				228	cycles - rmt hitm, lcl hitm, load
				229	- sum of cycles for given accesses - Remote/Local HITM and generic load
				230
				231	cpu cnt
				232	- number of cpus that participated on the access
				233
				234	Symbol
				235	- code symbol related to the 'Code address' value
				236
				237	Shared Object
				238	- shared object name related to the 'Code address' value
				239
				240	Source:Line
				241	- source information related to the 'Code address' value
				242
				243	Node
				244	- nodes participating on the access (see NODE INFO section)
				245
				246	NODE INFO
				247	---------
				248	The 'Node' field displays nodes that accesses given cacheline
				249	offset. Its output comes in 3 flavors:
				250	- node IDs separated by ','
				251	- node IDs with stats for each ID, in following format:
				252	Node{cpus %hitms %stores}
				253	- node IDs with list of affected CPUs in following format:
				254	Node{cpu list}
				255
				256	User can switch between above flavors with -N option or
				257	use 'n' key to interactively switch in TUI mode.
				258
				259	COALESCE
				260	--------
				261	User can specify how to sort offsets for cacheline.
				262
				263	Following fields are available and governs the final
				264	output fields set for caheline offsets output:
				265
				266	tid - coalesced by process TIDs
				267	pid - coalesced by process PIDs
				268	iaddr - coalesced by code address, following fields are displayed:
				269	Code address, Code symbol, Shared Object, Source line
				270	dso - coalesced by shared object
				271
Jiri Olsa	190bacc	2017-01-20 10:20:32 +0100	[diff] [blame]	272	By default the coalescing is setup with 'pid,iaddr'.
Jiri Olsa	465f27a	2016-08-26 10:36:12 +0200	[diff] [blame]	273
				274	STDIO OUTPUT
				275	------------
				276	The stdio output displays data on standard output.
				277
				278	Following tables are displayed:
				279	Trace Event Information
				280	- overall statistics of memory accesses
				281
				282	Global Shared Cache Line Event Information
				283	- overall statistics on shared cachelines
				284
				285	Shared Data Cache Line Table
				286	- list of most expensive cachelines
				287
				288	Shared Cache Line Distribution Pareto
				289	- list of all accessed offsets for each cacheline
				290
				291	TUI OUTPUT
				292	----------
				293	The TUI output provides interactive interface to navigate
				294	through cachelines list and to display offset details.
				295
				296	For details please refer to the help window by pressing '?' key.
				297
				298	CREDITS
				299	-------
				300	Although Don Zickus, Dick Fowles and Joe Mario worked together
				301	to get this implemented, we got lots of early help from Arnaldo
				302	Carvalho de Melo, Stephane Eranian, Jiri Olsa and Andi Kleen.
				303
				304	C2C BLOG
				305	--------
				306	Check Joe's blog on c2c tool for detailed use case explanation:
				307	https://joemario.github.io/blog/2016/09/01/c2c-blog/
				308
				309	SEE ALSO
				310	--------
				311	linkperf:perf-record[1], linkperf:perf-mem[1]