Blame - tools/perf/Documentation/intel-pt.txt - SHIFTPHONES/kernel/common

blob: b0b3007d3c9c0ff9b3288e8005d533efe48c0b57 [file] [log] [blame]

Adrian Hunter	5efb1d5	2015-07-17 19:33:42 +0300	[diff] [blame]	1	Intel Processor Trace
				2	=====================
				3
				4	Overview
				5	========
				6
				7	Intel Processor Trace (Intel PT) is an extension of Intel Architecture that
				8	collects information about software execution such as control flow, execution
				9	modes and timings and formats it into highly compressed binary packets.
				10	Technical details are documented in the Intel 64 and IA-32 Architectures
				11	Software Developer Manuals, Chapter 36 Intel Processor Trace.
				12
				13	Intel PT is first supported in Intel Core M and 5th generation Intel Core
				14	processors that are based on the Intel micro-architecture code name Broadwell.
				15
				16	Trace data is collected by 'perf record' and stored within the perf.data file.
				17	See below for options to 'perf record'.
				18
				19	Trace data must be 'decoded' which involves walking the object code and matching
				20	the trace data packets. For example a TNT packet only tells whether a
				21	conditional branch was taken or not taken, so to make use of that packet the
				22	decoder must know precisely which instruction was being executed.
				23
				24	Decoding is done on-the-fly. The decoder outputs samples in the same format as
				25	samples output by perf hardware events, for example as though the "instructions"
				26	or "branches" events had been recorded. Presently 3 tools support this:
				27	'perf script', 'perf report' and 'perf inject'. See below for more information
				28	on using those tools.
				29
				30	The main distinguishing feature of Intel PT is that the decoder can determine
				31	the exact flow of software execution. Intel PT can be used to understand why
				32	and how did software get to a certain point, or behave a certain way. The
				33	software does not have to be recompiled, so Intel PT works with debug or release
				34	builds, however the executed images are needed - which makes use in JIT-compiled
				35	environments, or with self-modified code, a challenge. Also symbols need to be
				36	provided to make sense of addresses.
				37
				38	A limitation of Intel PT is that it produces huge amounts of trace data
				39	(hundreds of megabytes per second per core) which takes a long time to decode,
				40	for example two or three orders of magnitude longer than it took to collect.
				41	Another limitation is the performance impact of tracing, something that will
				42	vary depending on the use-case and architecture.
				43
				44
				45	Quickstart
				46	==========
				47
				48	It is important to start small. That is because it is easy to capture vastly
				49	more data than can possibly be processed.
				50
				51	The simplest thing to do with Intel PT is userspace profiling of small programs.
				52	Data is captured with 'perf record' e.g. to trace 'ls' userspace-only:
				53
				54	perf record -e intel_pt//u ls
				55
				56	And profiled with 'perf report' e.g.
				57
				58	perf report
				59
				60	To also trace kernel space presents a problem, namely kernel self-modifying
				61	code. A fairly good kernel image is available in /proc/kcore but to get an
				62	accurate image a copy of /proc/kcore needs to be made under the same conditions
				63	as the data capture. A script perf-with-kcore can do that, but beware that the
				64	script makes use of 'sudo' to copy /proc/kcore. If you have perf installed
				65	locally from the source tree you can do:
				66
				67	~/libexec/perf-core/perf-with-kcore record pt_ls -e intel_pt// -- ls
				68
				69	which will create a directory named 'pt_ls' and put the perf.data file and
				70	copies of /proc/kcore, /proc/kallsyms and /proc/modules into it. Then to use
				71	'perf report' becomes:
				72
				73	~/libexec/perf-core/perf-with-kcore report pt_ls
				74
				75	Because samples are synthesized after-the-fact, the sampling period can be
				76	selected for reporting. e.g. sample every microsecond
				77
				78	~/libexec/perf-core/perf-with-kcore report pt_ls --itrace=i1usge
				79
				80	See the sections below for more information about the --itrace option.
				81
				82	Beware the smaller the period, the more samples that are produced, and the
				83	longer it takes to process them.
				84
				85	Also note that the coarseness of Intel PT timing information will start to
				86	distort the statistical value of the sampling as the sampling period becomes
				87	smaller.
				88
				89	To represent software control flow, "branches" samples are produced. By default
				90	a branch sample is synthesized for every single branch. To get an idea what
				91	data is available you can use the 'perf script' tool with no parameters, which
				92	will list all the samples.
				93
				94	perf record -e intel_pt//u ls
				95	perf script
				96
				97	An interesting field that is not printed by default is 'flags' which can be
				98	displayed as follows:
				99
				100	perf script -Fcomm,tid,pid,time,cpu,event,trace,ip,sym,dso,addr,symoff,flags
				101
				102	The flags are "bcrosyiABEx" which stand for branch, call, return, conditional,
				103	system, asynchronous, interrupt, transaction abort, trace begin, trace end, and
				104	in transaction, respectively.
				105
				106	While it is possible to create scripts to analyze the data, an alternative
				107	approach is available to export the data to a postgresql database. Refer to
				108	script export-to-postgresql.py for more details, and to script
				109	call-graph-from-postgresql.py for an example of using the database.
				110
				111	As mentioned above, it is easy to capture too much data. One way to limit the
				112	data captured is to use 'snapshot' mode which is explained further below.
				113	Refer to 'new snapshot option' and 'Intel PT modes of operation' further below.
				114
				115	Another problem that will be experienced is decoder errors. They can be caused
				116	by inability to access the executed image, self-modified or JIT-ed code, or the
				117	inability to match side-band information (such as context switches and mmaps)
				118	which results in the decoder not knowing what code was executed.
				119
				120	There is also the problem of perf not being able to copy the data fast enough,
				121	resulting in data lost because the buffer was full. See 'Buffer handling' below
				122	for more details.
				123
				124
				125	perf record
				126	===========
				127
				128	new event
				129	---------
				130
				131	The Intel PT kernel driver creates a new PMU for Intel PT. PMU events are
				132	selected by providing the PMU name followed by the "config" separated by slashes.
				133	An enhancement has been made to allow default "config" e.g. the option
				134
				135	-e intel_pt//
				136
				137	will use a default config value. Currently that is the same as
				138
				139	-e intel_pt/tsc,noretcomp=0/
				140
				141	which is the same as
				142
				143	-e intel_pt/tsc=1,noretcomp=0/
				144
Adrian Hunter	9d1bf02	2015-07-17 19:34:00 +0300	[diff] [blame]	145	Note there are now new config terms - see section 'config terms' further below.
				146
Adrian Hunter	5efb1d5	2015-07-17 19:33:42 +0300	[diff] [blame]	147	The config terms are listed in /sys/devices/intel_pt/format. They are bit
				148	fields within the config member of the struct perf_event_attr which is
				149	passed to the kernel by the perf_event_open system call. They correspond to bit
				150	fields in the IA32_RTIT_CTL MSR. Here is a list of them and their definitions:
				151
Adrian Hunter	9d1bf02	2015-07-17 19:34:00 +0300	[diff] [blame]	152	$ grep -H . /sys/bus/event_source/devices/intel_pt/format/*
				153	/sys/bus/event_source/devices/intel_pt/format/cyc:config:1
				154	/sys/bus/event_source/devices/intel_pt/format/cyc_thresh:config:19-22
				155	/sys/bus/event_source/devices/intel_pt/format/mtc:config:9
				156	/sys/bus/event_source/devices/intel_pt/format/mtc_period:config:14-17
				157	/sys/bus/event_source/devices/intel_pt/format/noretcomp:config:11
				158	/sys/bus/event_source/devices/intel_pt/format/psb_period:config:24-27
				159	/sys/bus/event_source/devices/intel_pt/format/tsc:config:10
Adrian Hunter	5efb1d5	2015-07-17 19:33:42 +0300	[diff] [blame]	160
				161	Note that the default config must be overridden for each term i.e.
				162
				163	-e intel_pt/noretcomp=0/
				164
				165	is the same as:
				166
				167	-e intel_pt/tsc=1,noretcomp=0/
				168
				169	So, to disable TSC packets use:
				170
				171	-e intel_pt/tsc=0/
				172
				173	It is also possible to specify the config value explicitly:
				174
				175	-e intel_pt/config=0x400/
				176
				177	Note that, as with all events, the event is suffixed with event modifiers:
				178
				179	u userspace
				180	k kernel
				181	h hypervisor
				182	G guest
				183	H host
				184	p precise ip
				185
				186	'h', 'G' and 'H' are for virtualization which is not supported by Intel PT.
				187	'p' is also not relevant to Intel PT. So only options 'u' and 'k' are
				188	meaningful for Intel PT.
				189
				190	perf_event_attr is displayed if the -vv option is used e.g.
				191
				192	------------------------------------------------------------
				193	perf_event_attr:
				194	type 6
				195	size 112
				196	config 0x400
				197	{ sample_period, sample_freq } 1
				198	sample_type IP\|TID\|TIME\|CPU\|IDENTIFIER
				199	read_format ID
				200	disabled 1
				201	inherit 1
				202	exclude_kernel 1
				203	exclude_hv 1
				204	enable_on_exec 1
				205	sample_id_all 1
				206	------------------------------------------------------------
				207	sys_perf_event_open: pid 31104 cpu 0 group_fd -1 flags 0x8
				208	sys_perf_event_open: pid 31104 cpu 1 group_fd -1 flags 0x8
				209	sys_perf_event_open: pid 31104 cpu 2 group_fd -1 flags 0x8
				210	sys_perf_event_open: pid 31104 cpu 3 group_fd -1 flags 0x8
				211	------------------------------------------------------------
				212
				213
Adrian Hunter	9d1bf02	2015-07-17 19:34:00 +0300	[diff] [blame]	214	config terms
				215	------------
				216
				217	The June 2015 version of Intel 64 and IA-32 Architectures Software Developer
				218	Manuals, Chapter 36 Intel Processor Trace, defined new Intel PT features.
				219	Some of the features are reflect in new config terms. All the config terms are
				220	described below.
				221
				222	tsc Always supported. Produces TSC timestamp packets to provide
				223	timing information. In some cases it is possible to decode
				224	without timing information, for example a per-thread context
				225	that does not overlap executable memory maps.
				226
				227	The default config selects tsc (i.e. tsc=1).
				228
				229	noretcomp Always supported. Disables "return compression" so a TIP packet
				230	is produced when a function returns. Causes more packets to be
				231	produced but might make decoding more reliable.
				232
				233	The default config does not select noretcomp (i.e. noretcomp=0).
				234
				235	psb_period Allows the frequency of PSB packets to be specified.
				236
				237	The PSB packet is a synchronization packet that provides a
				238	starting point for decoding or recovery from errors.
				239
				240	Support for psb_period is indicated by:
				241
				242	/sys/bus/event_source/devices/intel_pt/caps/psb_cyc
				243
				244	which contains "1" if the feature is supported and "0"
				245	otherwise.
				246
				247	Valid values are given by:
				248
				249	/sys/bus/event_source/devices/intel_pt/caps/psb_periods
				250
				251	which contains a hexadecimal value, the bits of which represent
				252	valid values e.g. bit 2 set means value 2 is valid.
				253
				254	The psb_period value is converted to the approximate number of
				255	trace bytes between PSB packets as:
				256
				257	2 ^ (value + 11)
				258
				259	e.g. value 3 means 16KiB bytes between PSBs
				260
				261	If an invalid value is entered, the error message
				262	will give a list of valid values e.g.
				263
				264	$ perf record -e intel_pt/psb_period=15/u uname
				265	Invalid psb_period for intel_pt. Valid values are: 0-5
				266
				267	If MTC packets are selected, the default config selects a value
				268	of 3 (i.e. psb_period=3) or the nearest lower value that is
				269	supported (0 is always supported). Otherwise the default is 0.
				270
				271	If decoding is expected to be reliable and the buffer is large
				272	then a large PSB period can be used.
				273
				274	Because a TSC packet is produced with PSB, the PSB period can
				275	also affect the granularity to timing information in the absence
				276	of MTC or CYC.
				277
				278	mtc Produces MTC timing packets.
				279
				280	MTC packets provide finer grain timestamp information than TSC
				281	packets. MTC packets record time using the hardware crystal
				282	clock (CTC) which is related to TSC packets using a TMA packet.
				283
				284	Support for this feature is indicated by:
				285
				286	/sys/bus/event_source/devices/intel_pt/caps/mtc
				287
				288	which contains "1" if the feature is supported and
				289	"0" otherwise.
				290
				291	The frequency of MTC packets can also be specified - see
				292	mtc_period below.
				293
				294	mtc_period Specifies how frequently MTC packets are produced - see mtc
				295	above for how to determine if MTC packets are supported.
				296
				297	Valid values are given by:
				298
				299	/sys/bus/event_source/devices/intel_pt/caps/mtc_periods
				300
				301	which contains a hexadecimal value, the bits of which represent
				302	valid values e.g. bit 2 set means value 2 is valid.
				303
				304	The mtc_period value is converted to the MTC frequency as:
				305
				306	CTC-frequency / (2 ^ value)
				307
				308	e.g. value 3 means one eighth of CTC-frequency
				309
				310	Where CTC is the hardware crystal clock, the frequency of which
				311	can be related to TSC via values provided in cpuid leaf 0x15.
				312
				313	If an invalid value is entered, the error message
				314	will give a list of valid values e.g.
				315
				316	$ perf record -e intel_pt/mtc_period=15/u uname
				317	Invalid mtc_period for intel_pt. Valid values are: 0,3,6,9
				318
				319	The default value is 3 or the nearest lower value
				320	that is supported (0 is always supported).
				321
				322	cyc Produces CYC timing packets.
				323
				324	CYC packets provide even finer grain timestamp information than
				325	MTC and TSC packets. A CYC packet contains the number of CPU
				326	cycles since the last CYC packet. Unlike MTC and TSC packets,
				327	CYC packets are only sent when another packet is also sent.
				328
				329	Support for this feature is indicated by:
				330
				331	/sys/bus/event_source/devices/intel_pt/caps/psb_cyc
				332
				333	which contains "1" if the feature is supported and
				334	"0" otherwise.
				335
				336	The number of CYC packets produced can be reduced by specifying
				337	a threshold - see cyc_thresh below.
				338
				339	cyc_thresh Specifies how frequently CYC packets are produced - see cyc
				340	above for how to determine if CYC packets are supported.
				341
				342	Valid cyc_thresh values are given by:
				343
				344	/sys/bus/event_source/devices/intel_pt/caps/cycle_thresholds
				345
				346	which contains a hexadecimal value, the bits of which represent
				347	valid values e.g. bit 2 set means value 2 is valid.
				348
				349	The cyc_thresh value represents the minimum number of CPU cycles
				350	that must have passed before a CYC packet can be sent. The
				351	number of CPU cycles is:
				352
				353	2 ^ (value - 1)
				354
				355	e.g. value 4 means 8 CPU cycles must pass before a CYC packet
				356	can be sent. Note a CYC packet is still only sent when another
				357	packet is sent, not at, e.g. every 8 CPU cycles.
				358
				359	If an invalid value is entered, the error message
				360	will give a list of valid values e.g.
				361
				362	$ perf record -e intel_pt/cyc,cyc_thresh=15/u uname
				363	Invalid cyc_thresh for intel_pt. Valid values are: 0-12
				364
				365	CYC packets are not requested by default.
				366
Adrian Hunter	9d1bf02	2015-07-17 19:34:00 +0300	[diff] [blame]	367
Adrian Hunter	5efb1d5	2015-07-17 19:33:42 +0300	[diff] [blame]	368	new snapshot option
				369	-------------------
				370
Adrian Hunter	9d1bf02	2015-07-17 19:34:00 +0300	[diff] [blame]	371	The difference between full trace and snapshot from the kernel's perspective is
				372	that in full trace we don't overwrite trace data that the user hasn't collected
				373	yet (and indicated that by advancing aux_tail), whereas in snapshot mode we let
				374	the trace run and overwrite older data in the buffer so that whenever something
				375	interesting happens, we can stop it and grab a snapshot of what was going on
				376	around that interesting moment.
				377
Adrian Hunter	5efb1d5	2015-07-17 19:33:42 +0300	[diff] [blame]	378	To select snapshot mode a new option has been added:
				379
				380	-S
				381
				382	Optionally it can be followed by the snapshot size e.g.
				383
				384	-S0x100000
				385
				386	The default snapshot size is the auxtrace mmap size. If neither auxtrace mmap size
				387	nor snapshot size is specified, then the default is 4MiB for privileged users
				388	(or if /proc/sys/kernel/perf_event_paranoid < 0), 128KiB for unprivileged users.
				389	If an unprivileged user does not specify mmap pages, the mmap pages will be
				390	reduced as described in the 'new auxtrace mmap size option' section below.
				391
				392	The snapshot size is displayed if the option -vv is used e.g.
				393
				394	Intel PT snapshot size: %zu
				395
				396
				397	new auxtrace mmap size option
				398	---------------------------
				399
				400	Intel PT buffer size is specified by an addition to the -m option e.g.
				401
				402	-m,16
				403
				404	selects a buffer size of 16 pages i.e. 64KiB.
				405
				406	Note that the existing functionality of -m is unchanged. The auxtrace mmap size
				407	is specified by the optional addition of a comma and the value.
				408
				409	The default auxtrace mmap size for Intel PT is 4MiB/page_size for privileged users
				410	(or if /proc/sys/kernel/perf_event_paranoid < 0), 128KiB for unprivileged users.
				411	If an unprivileged user does not specify mmap pages, the mmap pages will be
				412	reduced from the default 512KiB/page_size to 256KiB/page_size, otherwise the
				413	user is likely to get an error as they exceed their mlock limit (Max locked
				414	memory as shown in /proc/self/limits). Note that perf does not count the first
				415	512KiB (actually /proc/sys/kernel/perf_event_mlock_kb minus 1 page) per cpu
				416	against the mlock limit so an unprivileged user is allowed 512KiB per cpu plus
				417	their mlock limit (which defaults to 64KiB but is not multiplied by the number
				418	of cpus).
				419
				420	In full-trace mode, powers of two are allowed for buffer size, with a minimum
				421	size of 2 pages. In snapshot mode, it is the same but the minimum size is
				422	1 page.
				423
				424	The mmap size and auxtrace mmap size are displayed if the -vv option is used e.g.
				425
				426	mmap length 528384
				427	auxtrace mmap length 4198400
				428
				429
				430	Intel PT modes of operation
				431	---------------------------
				432
				433	Intel PT can be used in 2 modes:
				434	full-trace mode
				435	snapshot mode
				436
				437	Full-trace mode traces continuously e.g.
				438
				439	perf record -e intel_pt//u uname
				440
				441	Snapshot mode captures the available data when a signal is sent e.g.
				442
				443	perf record -v -e intel_pt//u -S ./loopy 1000000000 &
				444	[1] 11435
				445	kill -USR2 11435
				446	Recording AUX area tracing snapshot
				447
				448	Note that the signal sent is SIGUSR2.
				449	Note that "Recording AUX area tracing snapshot" is displayed because the -v
				450	option is used.
				451
				452	The 2 modes cannot be used together.
				453
				454
				455	Buffer handling
				456	---------------
				457
				458	There may be buffer limitations (i.e. single ToPa entry) which means that actual
				459	buffer sizes are limited to powers of 2 up to 4MiB (MAX_ORDER). In order to
				460	provide other sizes, and in particular an arbitrarily large size, multiple
				461	buffers are logically concatenated. However an interrupt must be used to switch
				462	between buffers. That has two potential problems:
				463	a) the interrupt may not be handled in time so that the current buffer
				464	becomes full and some trace data is lost.
				465	b) the interrupts may slow the system and affect the performance
				466	results.
				467
				468	If trace data is lost, the driver sets 'truncated' in the PERF_RECORD_AUX event
				469	which the tools report as an error.
				470
				471	In full-trace mode, the driver waits for data to be copied out before allowing
				472	the (logical) buffer to wrap-around. If data is not copied out quickly enough,
				473	again 'truncated' is set in the PERF_RECORD_AUX event. If the driver has to
				474	wait, the intel_pt event gets disabled. Because it is difficult to know when
				475	that happens, perf tools always re-enable the intel_pt event after copying out
				476	data.
				477
				478
				479	Intel PT and build ids
				480	----------------------
				481
				482	By default "perf record" post-processes the event stream to find all build ids
				483	for executables for all addresses sampled. Deliberately, Intel PT is not
				484	decoded for that purpose (it would take too long). Instead the build ids for
				485	all executables encountered (due to mmap, comm or task events) are included
				486	in the perf.data file.
				487
				488	To see buildids included in the perf.data file use the command:
				489
				490	perf buildid-list
				491
				492	If the perf.data file contains Intel PT data, that is the same as:
				493
				494	perf buildid-list --with-hits
				495
				496
				497	Snapshot mode and event disabling
				498	---------------------------------
				499
				500	In order to make a snapshot, the intel_pt event is disabled using an IOCTL,
				501	namely PERF_EVENT_IOC_DISABLE. However doing that can also disable the
				502	collection of side-band information. In order to prevent that, a dummy
				503	software event has been introduced that permits tracking events (like mmaps) to
				504	continue to be recorded while intel_pt is disabled. That is important to ensure
				505	there is complete side-band information to allow the decoding of subsequent
				506	snapshots.
				507
				508	A test has been created for that. To find the test:
				509
				510	perf test list
				511	...
				512	23: Test using a dummy software event to keep tracking
				513
				514	To run the test:
				515
				516	perf test 23
				517	23: Test using a dummy software event to keep tracking : Ok
				518
				519
				520	perf record modes (nothing new here)
				521	------------------------------------
				522
				523	perf record essentially operates in one of three modes:
				524	per thread
				525	per cpu
				526	workload only
				527
				528	"per thread" mode is selected by -t or by --per-thread (with -p or -u or just a
				529	workload).
				530	"per cpu" is selected by -C or -a.
				531	"workload only" mode is selected by not using the other options but providing a
				532	command to run (i.e. the workload).
				533
				534	In per-thread mode an exact list of threads is traced. There is no inheritance.
				535	Each thread has its own event buffer.
				536
				537	In per-cpu mode all processes (or processes from the selected cgroup i.e. -G
				538	option, or processes selected with -p or -u) are traced. Each cpu has its own
				539	buffer. Inheritance is allowed.
				540
				541	In workload-only mode, the workload is traced but with per-cpu buffers.
				542	Inheritance is allowed. Note that you can now trace a workload in per-thread
				543	mode by using the --per-thread option.
				544
				545
				546	Privileged vs non-privileged users
				547	----------------------------------
				548
				549	Unless /proc/sys/kernel/perf_event_paranoid is set to -1, unprivileged users
				550	have memory limits imposed upon them. That affects what buffer sizes they can
				551	have as outlined above.
				552
Arnaldo Carvalho de Melo	699c12a	2016-11-09 11:04:05 -0300	[diff] [blame]	553	The v4.2 kernel introduced support for a context switch metadata event,
				554	PERF_RECORD_SWITCH, which allows unprivileged users to see when their processes
				555	are scheduled out and in, just not by whom, which is left for the
				556	PERF_RECORD_SWITCH_CPU_WIDE, that is only accessible in system wide context,
				557	which in turn requires CAP_SYS_ADMIN.
				558
				559	Please see the 45ac1403f564 ("perf: Add PERF_RECORD_SWITCH to indicate context
				560	switches") commit, that introduces these metadata events for further info.
				561
				562	When working with kernels < v4.2, the following considerations must be taken,
				563	as the sched:sched_switch tracepoints will be used to receive such information:
				564
Adrian Hunter	5efb1d5	2015-07-17 19:33:42 +0300	[diff] [blame]	565	Unless /proc/sys/kernel/perf_event_paranoid is set to -1, unprivileged users are
				566	not permitted to use tracepoints which means there is insufficient side-band
				567	information to decode Intel PT in per-cpu mode, and potentially workload-only
				568	mode too if the workload creates new processes.
				569
				570	Note also, that to use tracepoints, read-access to debugfs is required. So if
				571	debugfs is not mounted or the user does not have read-access, it will again not
				572	be possible to decode Intel PT in per-cpu mode.
				573
				574
				575	sched_switch tracepoint
				576	-----------------------
				577
				578	The sched_switch tracepoint is used to provide side-band data for Intel PT
Arnaldo Carvalho de Melo	699c12a	2016-11-09 11:04:05 -0300	[diff] [blame]	579	decoding in kernels where the PERF_RECORD_SWITCH metadata event isn't
				580	available.
				581
				582	The sched_switch events are automatically added. e.g. the second event shown
				583	below:
Adrian Hunter	5efb1d5	2015-07-17 19:33:42 +0300	[diff] [blame]	584
				585	$ perf record -vv -e intel_pt//u uname
				586	------------------------------------------------------------
				587	perf_event_attr:
				588	type 6
				589	size 112
				590	config 0x400
				591	{ sample_period, sample_freq } 1
				592	sample_type IP\|TID\|TIME\|CPU\|IDENTIFIER
				593	read_format ID
				594	disabled 1
				595	inherit 1
				596	exclude_kernel 1
				597	exclude_hv 1
				598	enable_on_exec 1
				599	sample_id_all 1
				600	------------------------------------------------------------
				601	sys_perf_event_open: pid 31104 cpu 0 group_fd -1 flags 0x8
				602	sys_perf_event_open: pid 31104 cpu 1 group_fd -1 flags 0x8
				603	sys_perf_event_open: pid 31104 cpu 2 group_fd -1 flags 0x8
				604	sys_perf_event_open: pid 31104 cpu 3 group_fd -1 flags 0x8
				605	------------------------------------------------------------
				606	perf_event_attr:
				607	type 2
				608	size 112
				609	config 0x108
				610	{ sample_period, sample_freq } 1
				611	sample_type IP\|TID\|TIME\|CPU\|PERIOD\|RAW\|IDENTIFIER
				612	read_format ID
				613	inherit 1
				614	sample_id_all 1
				615	exclude_guest 1
				616	------------------------------------------------------------
				617	sys_perf_event_open: pid -1 cpu 0 group_fd -1 flags 0x8
				618	sys_perf_event_open: pid -1 cpu 1 group_fd -1 flags 0x8
				619	sys_perf_event_open: pid -1 cpu 2 group_fd -1 flags 0x8
				620	sys_perf_event_open: pid -1 cpu 3 group_fd -1 flags 0x8
				621	------------------------------------------------------------
				622	perf_event_attr:
				623	type 1
				624	size 112
				625	config 0x9
				626	{ sample_period, sample_freq } 1
				627	sample_type IP\|TID\|TIME\|IDENTIFIER
				628	read_format ID
				629	disabled 1
				630	inherit 1
				631	exclude_kernel 1
				632	exclude_hv 1
				633	mmap 1
				634	comm 1
				635	enable_on_exec 1
				636	task 1
				637	sample_id_all 1
				638	mmap2 1
				639	comm_exec 1
				640	------------------------------------------------------------
				641	sys_perf_event_open: pid 31104 cpu 0 group_fd -1 flags 0x8
				642	sys_perf_event_open: pid 31104 cpu 1 group_fd -1 flags 0x8
				643	sys_perf_event_open: pid 31104 cpu 2 group_fd -1 flags 0x8
				644	sys_perf_event_open: pid 31104 cpu 3 group_fd -1 flags 0x8
				645	mmap size 528384B
				646	AUX area mmap length 4194304
				647	perf event ring buffer mmapped per cpu
				648	Synthesizing auxtrace information
				649	Linux
				650	[ perf record: Woken up 1 times to write data ]
				651	[ perf record: Captured and wrote 0.042 MB perf.data ]
				652
				653	Note, the sched_switch event is only added if the user is permitted to use it
				654	and only in per-cpu mode.
				655
				656	Note also, the sched_switch event is only added if TSC packets are requested.
				657	That is because, in the absence of timing information, the sched_switch events
				658	cannot be matched against the Intel PT trace.
				659
				660
				661	perf script
				662	===========
				663
				664	By default, perf script will decode trace data found in the perf.data file.
				665	This can be further controlled by new option --itrace.
				666
				667
				668	New --itrace option
				669	-------------------
				670
				671	Having no option is the same as
				672
				673	--itrace
				674
				675	which, in turn, is the same as
				676
				677	--itrace=ibxe
				678
				679	The letters are:
				680
				681	i synthesize "instructions" events
				682	b synthesize "branches" events
				683	x synthesize "transactions" events
				684	c synthesize branches events (calls only)
				685	r synthesize branches events (returns only)
				686	e synthesize tracing error events
				687	d create a debug log
				688	g synthesize a call chain (use with i or x)
Adrian Hunter	f14445e	2015-09-25 16:15:45 +0300	[diff] [blame]	689	l synthesize last branch entries (use with i or x)
Andi Kleen	d1706b3	2016-03-28 10:45:38 -0700	[diff] [blame]	690	s skip initial number of events
Adrian Hunter	5efb1d5	2015-07-17 19:33:42 +0300	[diff] [blame]	691
				692	"Instructions" events look like they were recorded by "perf record -e
				693	instructions".
				694
				695	"Branches" events look like they were recorded by "perf record -e branches". "c"
				696	and "r" can be combined to get calls and returns.
				697
				698	"Transactions" events correspond to the start or end of transactions. The
				699	'flags' field can be used in perf script to determine whether the event is a
				700	tranasaction start, commit or abort.
				701
				702	Error events are new. They show where the decoder lost the trace. Error events
				703	are quite important. Users must know if what they are seeing is a complete
				704	picture or not.
				705
				706	The "d" option will cause the creation of a file "intel_pt.log" containing all
				707	decoded packets and instructions. Note that this option slows down the decoder
				708	and that the resulting file may be very large.
				709
				710	In addition, the period of the "instructions" event can be specified. e.g.
				711
				712	--itrace=i10us
				713
				714	sets the period to 10us i.e. one instruction sample is synthesized for each 10
				715	microseconds of trace. Alternatives to "us" are "ms" (milliseconds),
				716	"ns" (nanoseconds), "t" (TSC ticks) or "i" (instructions).
				717
				718	"ms", "us" and "ns" are converted to TSC ticks.
				719
				720	The timing information included with Intel PT does not give the time of every
				721	instruction. Consequently, for the purpose of sampling, the decoder estimates
				722	the time since the last timing packet based on 1 tick per instruction. The time
				723	on the sample is not adjusted and reflects the last known value of TSC.
				724
				725	For Intel PT, the default period is 100us.
				726
Adrian Hunter	e179134	2015-09-25 16:15:32 +0300	[diff] [blame]	727	Setting it to a zero period means "as often as possible".
				728
				729	In the case of Intel PT that is the same as a period of 1 and a unit of
				730	'instructions' (i.e. --itrace=i1i).
				731
Adrian Hunter	5efb1d5	2015-07-17 19:33:42 +0300	[diff] [blame]	732	Also the call chain size (default 16, max. 1024) for instructions or
				733	transactions events can be specified. e.g.
				734
				735	--itrace=ig32
				736	--itrace=xg32
				737
Adrian Hunter	f14445e	2015-09-25 16:15:45 +0300	[diff] [blame]	738	Also the number of last branch entries (default 64, max. 1024) for instructions or
				739	transactions events can be specified. e.g.
				740
				741	--itrace=il10
				742	--itrace=xl10
				743
				744	Note that last branch entries are cleared for each sample, so there is no overlap
				745	from one sample to the next.
				746
Adrian Hunter	5efb1d5	2015-07-17 19:33:42 +0300	[diff] [blame]	747	To disable trace decoding entirely, use the option --no-itrace.
				748
Andi Kleen	d1706b3	2016-03-28 10:45:38 -0700	[diff] [blame]	749	It is also possible to skip events generated (instructions, branches, transactions)
				750	at the beginning. This is useful to ignore initialization code.
				751
				752	--itrace=i0nss1000000
				753
				754	skips the first million instructions.
Adrian Hunter	5efb1d5	2015-07-17 19:33:42 +0300	[diff] [blame]	755
				756	dump option
				757	-----------
				758
				759	perf script has an option (-D) to "dump" the events i.e. display the binary
				760	data.
				761
				762	When -D is used, Intel PT packets are displayed. The packet decoder does not
				763	pay attention to PSB packets, but just decodes the bytes - so the packets seen
				764	by the actual decoder may not be identical in places where the data is corrupt.
				765	One example of that would be when the buffer-switching interrupt has been too
				766	slow, and the buffer has been filled completely. In that case, the last packet
				767	in the buffer might be truncated and immediately followed by a PSB as the trace
				768	continues in the next buffer.
				769
				770	To disable the display of Intel PT packets, combine the -D option with
				771	--no-itrace.
				772
				773
				774	perf report
				775	===========
				776
				777	By default, perf report will decode trace data found in the perf.data file.
				778	This can be further controlled by new option --itrace exactly the same as
				779	perf script, with the exception that the default is --itrace=igxe.
				780
				781
				782	perf inject
				783	===========
				784
				785	perf inject also accepts the --itrace option in which case tracing data is
				786	removed and replaced with the synthesized events. e.g.
				787
				788	perf inject --itrace -i perf.data -o perf.data.new
Adrian Hunter	ba11ba6	2015-09-25 16:15:56 +0300	[diff] [blame]	789
				790	Below is an example of using Intel PT with autofdo. It requires autofdo
				791	(https://github.com/google/autofdo) and gcc version 5. The bubble
				792	sort example is from the AutoFDO tutorial (https://gcc.gnu.org/wiki/AutoFDO/Tutorial)
				793	amended to take the number of elements as a parameter.
				794
				795	$ gcc-5 -O3 sort.c -o sort_optimized
				796	$ ./sort_optimized 30000
				797	Bubble sorting array of 30000 elements
				798	2254 ms
				799
				800	$ cat ~/.perfconfig
				801	[intel-pt]
				802	mispred-all
				803
				804	$ perf record -e intel_pt//u ./sort 3000
				805	Bubble sorting array of 3000 elements
				806	58 ms
				807	[ perf record: Woken up 2 times to write data ]
				808	[ perf record: Captured and wrote 3.939 MB perf.data ]
				809	$ perf inject -i perf.data -o inj --itrace=i100usle --strip
				810	$ ./create_gcov --binary=./sort --profile=inj --gcov=sort.gcov -gcov_version=1
				811	$ gcc-5 -O3 -fauto-profile=sort.gcov sort.c -o sort_autofdo
				812	$ ./sort_autofdo 30000
				813	Bubble sorting array of 30000 elements
				814	2155 ms
				815
				816	Note there is currently no advantage to using Intel PT instead of LBR, but
				817	that may change in the future if greater use is made of the data.