Blame - Documentation/trace/ftrace.rst - SHIFTPHONES/mainline/linux

blob: e45f0786f3f9ef29bb1da2f9c2b3c8d7430adc85 [file] [log] [blame]

Changbin Du	1f198e2	2018-02-17 13:39:38 +0800	[diff] [blame]	1	========================
				2	ftrace - Function Tracer
				3	========================
				4
				5	Copyright 2008 Red Hat Inc.
				6
				7	:Author: Steven Rostedt <srostedt@redhat.com>
				8	:License: The GNU Free Documentation License, Version 1.2
				9	(dual licensed under the GPL v2)
				10	:Original Reviewers: Elias Oltmanns, Randy Dunlap, Andrew Morton,
				11	John Kacur, and David Teigland.
				12
				13	- Written for: 2.6.28-rc2
				14	- Updated for: 3.10
				15	- Updated for: 4.13 - Copyright 2017 VMware Inc. Steven Rostedt
				16	- Converted to rst format - Changbin Du <changbin.du@intel.com>
				17
				18	Introduction
				19	------------
				20
				21	Ftrace is an internal tracer designed to help out developers and
				22	designers of systems to find what is going on inside the kernel.
				23	It can be used for debugging or analyzing latencies and
				24	performance issues that take place outside of user-space.
				25
				26	Although ftrace is typically considered the function tracer, it
				27	is really a frame work of several assorted tracing utilities.
				28	There's latency tracing to examine what occurs between interrupts
				29	disabled and enabled, as well as for preemption and from a time
				30	a task is woken to the task is actually scheduled in.
				31
				32	One of the most common uses of ftrace is the event tracing.
				33	Through out the kernel is hundreds of static event points that
				34	can be enabled via the tracefs file system to see what is
				35	going on in certain parts of the kernel.
				36
				37	See events.txt for more information.
				38
				39
				40	Implementation Details
				41	----------------------
				42
				43	See :doc:`ftrace-design` for details for arch porters and such.
				44
				45
				46	The File System
				47	---------------
				48
				49	Ftrace uses the tracefs file system to hold the control files as
				50	well as the files to display output.
				51
				52	When tracefs is configured into the kernel (which selecting any ftrace
				53	option will do) the directory /sys/kernel/tracing will be created. To mount
				54	this directory, you can add to your /etc/fstab file::
				55
				56	tracefs /sys/kernel/tracing tracefs defaults 0 0
				57
				58	Or you can mount it at run time with::
				59
				60	mount -t tracefs nodev /sys/kernel/tracing
				61
				62	For quicker access to that directory you may want to make a soft link to
				63	it::
				64
				65	ln -s /sys/kernel/tracing /tracing
				66
				67	.. attention::
				68
				69	Before 4.1, all ftrace tracing control files were within the debugfs
				70	file system, which is typically located at /sys/kernel/debug/tracing.
				71	For backward compatibility, when mounting the debugfs file system,
				72	the tracefs file system will be automatically mounted at:
				73
				74	/sys/kernel/debug/tracing
				75
				76	All files located in the tracefs file system will be located in that
				77	debugfs file system directory as well.
				78
				79	.. attention::
				80
				81	Any selected ftrace option will also create the tracefs file system.
				82	The rest of the document will assume that you are in the ftrace directory
				83	(cd /sys/kernel/tracing) and will only concentrate on the files within that
				84	directory and not distract from the content with the extended
				85	"/sys/kernel/tracing" path name.
				86
				87	That's it! (assuming that you have ftrace configured into your kernel)
				88
				89	After mounting tracefs you will have access to the control and output files
				90	of ftrace. Here is a list of some of the key files:
				91
				92
				93	Note: all time values are in microseconds.
				94
				95	current_tracer:
				96
				97	This is used to set or display the current tracer
				98	that is configured.
				99
				100	available_tracers:
				101
				102	This holds the different types of tracers that
				103	have been compiled into the kernel. The
				104	tracers listed here can be configured by
				105	echoing their name into current_tracer.
				106
				107	tracing_on:
				108
				109	This sets or displays whether writing to the trace
				110	ring buffer is enabled. Echo 0 into this file to disable
				111	the tracer or 1 to enable it. Note, this only disables
				112	writing to the ring buffer, the tracing overhead may
				113	still be occurring.
				114
				115	The kernel function tracing_off() can be used within the
				116	kernel to disable writing to the ring buffer, which will
				117	set this file to "0". User space can re-enable tracing by
				118	echoing "1" into the file.
				119
				120	Note, the function and event trigger "traceoff" will also
				121	set this file to zero and stop tracing. Which can also
				122	be re-enabled by user space using this file.
				123
				124	trace:
				125
				126	This file holds the output of the trace in a human
				127	readable format (described below). Note, tracing is temporarily
				128	disabled while this file is being read (opened).
				129
				130	trace_pipe:
				131
				132	The output is the same as the "trace" file but this
				133	file is meant to be streamed with live tracing.
				134	Reads from this file will block until new data is
				135	retrieved. Unlike the "trace" file, this file is a
				136	consumer. This means reading from this file causes
				137	sequential reads to display more current data. Once
				138	data is read from this file, it is consumed, and
				139	will not be read again with a sequential read. The
				140	"trace" file is static, and if the tracer is not
				141	adding more data, it will display the same
				142	information every time it is read. This file will not
				143	disable tracing while being read.
				144
				145	trace_options:
				146
				147	This file lets the user control the amount of data
				148	that is displayed in one of the above output
				149	files. Options also exist to modify how a tracer
				150	or events work (stack traces, timestamps, etc).
				151
				152	options:
				153
				154	This is a directory that has a file for every available
				155	trace option (also in trace_options). Options may also be set
				156	or cleared by writing a "1" or "0" respectively into the
				157	corresponding file with the option name.
				158
				159	tracing_max_latency:
				160
				161	Some of the tracers record the max latency.
				162	For example, the maximum time that interrupts are disabled.
				163	The maximum time is saved in this file. The max trace will also be
				164	stored, and displayed by "trace". A new max trace will only be
				165	recorded if the latency is greater than the value in this file
				166	(in microseconds).
				167
				168	By echoing in a time into this file, no latency will be recorded
				169	unless it is greater than the time in this file.
				170
				171	tracing_thresh:
				172
				173	Some latency tracers will record a trace whenever the
				174	latency is greater than the number in this file.
				175	Only active when the file contains a number greater than 0.
				176	(in microseconds)
				177
				178	buffer_size_kb:
				179
				180	This sets or displays the number of kilobytes each CPU
				181	buffer holds. By default, the trace buffers are the same size
				182	for each CPU. The displayed number is the size of the
				183	CPU buffer and not total size of all buffers. The
				184	trace buffers are allocated in pages (blocks of memory
				185	that the kernel uses for allocation, usually 4 KB in size).
				186	If the last page allocated has room for more bytes
				187	than requested, the rest of the page will be used,
				188	making the actual allocation bigger than requested or shown.
				189	( Note, the size may not be a multiple of the page size
				190	due to buffer management meta-data. )
				191
				192	Buffer sizes for individual CPUs may vary
				193	(see "per_cpu/cpu0/buffer_size_kb" below), and if they do
				194	this file will show "X".
				195
				196	buffer_total_size_kb:
				197
				198	This displays the total combined size of all the trace buffers.
				199
				200	free_buffer:
				201
				202	If a process is performing tracing, and the ring buffer should be
				203	shrunk "freed" when the process is finished, even if it were to be
				204	killed by a signal, this file can be used for that purpose. On close
				205	of this file, the ring buffer will be resized to its minimum size.
				206	Having a process that is tracing also open this file, when the process
				207	exits its file descriptor for this file will be closed, and in doing so,
				208	the ring buffer will be "freed".
				209
				210	It may also stop tracing if disable_on_free option is set.
				211
				212	tracing_cpumask:
				213
				214	This is a mask that lets the user only trace on specified CPUs.
				215	The format is a hex string representing the CPUs.
				216
				217	set_ftrace_filter:
				218
				219	When dynamic ftrace is configured in (see the
				220	section below "dynamic ftrace"), the code is dynamically
				221	modified (code text rewrite) to disable calling of the
				222	function profiler (mcount). This lets tracing be configured
				223	in with practically no overhead in performance. This also
				224	has a side effect of enabling or disabling specific functions
				225	to be traced. Echoing names of functions into this file
				226	will limit the trace to only those functions.
				227
				228	The functions listed in "available_filter_functions" are what
				229	can be written into this file.
				230
				231	This interface also allows for commands to be used. See the
				232	"Filter commands" section for more details.
				233
				234	set_ftrace_notrace:
				235
				236	This has an effect opposite to that of
				237	set_ftrace_filter. Any function that is added here will not
				238	be traced. If a function exists in both set_ftrace_filter
				239	and set_ftrace_notrace, the function will _not_ be traced.
				240
				241	set_ftrace_pid:
				242
				243	Have the function tracer only trace the threads whose PID are
				244	listed in this file.
				245
				246	If the "function-fork" option is set, then when a task whose
				247	PID is listed in this file forks, the child's PID will
				248	automatically be added to this file, and the child will be
				249	traced by the function tracer as well. This option will also
				250	cause PIDs of tasks that exit to be removed from the file.
				251
				252	set_event_pid:
				253
				254	Have the events only trace a task with a PID listed in this file.
				255	Note, sched_switch and sched_wake_up will also trace events
				256	listed in this file.
				257
				258	To have the PIDs of children of tasks with their PID in this file
				259	added on fork, enable the "event-fork" option. That option will also
				260	cause the PIDs of tasks to be removed from this file when the task
				261	exits.
				262
				263	set_graph_function:
				264
				265	Functions listed in this file will cause the function graph
				266	tracer to only trace these functions and the functions that
				267	they call. (See the section "dynamic ftrace" for more details).
				268
				269	set_graph_notrace:
				270
				271	Similar to set_graph_function, but will disable function graph
				272	tracing when the function is hit until it exits the function.
				273	This makes it possible to ignore tracing functions that are called
				274	by a specific function.
				275
				276	available_filter_functions:
				277
				278	This lists the functions that ftrace has processed and can trace.
				279	These are the function names that you can pass to
				280	"set_ftrace_filter" or "set_ftrace_notrace".
				281	(See the section "dynamic ftrace" below for more details.)
				282
				283	dyn_ftrace_total_info:
				284
				285	This file is for debugging purposes. The number of functions that
				286	have been converted to nops and are available to be traced.
				287
				288	enabled_functions:
				289
				290	This file is more for debugging ftrace, but can also be useful
				291	in seeing if any function has a callback attached to it.
				292	Not only does the trace infrastructure use ftrace function
				293	trace utility, but other subsystems might too. This file
				294	displays all functions that have a callback attached to them
				295	as well as the number of callbacks that have been attached.
				296	Note, a callback may also call multiple functions which will
				297	not be listed in this count.
				298
				299	If the callback registered to be traced by a function with
				300	the "save regs" attribute (thus even more overhead), a 'R'
				301	will be displayed on the same line as the function that
				302	is returning registers.
				303
				304	If the callback registered to be traced by a function with
				305	the "ip modify" attribute (thus the regs->ip can be changed),
				306	an 'I' will be displayed on the same line as the function that
				307	can be overridden.
				308
				309	If the architecture supports it, it will also show what callback
				310	is being directly called by the function. If the count is greater
				311	than 1 it most likely will be ftrace_ops_list_func().
				312
				313	If the callback of the function jumps to a trampoline that is
				314	specific to a the callback and not the standard trampoline,
				315	its address will be printed as well as the function that the
				316	trampoline calls.
				317
				318	function_profile_enabled:
				319
				320	When set it will enable all functions with either the function
				321	tracer, or if configured, the function graph tracer. It will
				322	keep a histogram of the number of functions that were called
				323	and if the function graph tracer was configured, it will also keep
				324	track of the time spent in those functions. The histogram
				325	content can be displayed in the files:
				326
				327	trace_stats/function<cpu> ( function0, function1, etc).
				328
				329	trace_stats:
				330
				331	A directory that holds different tracing stats.
				332
				333	kprobe_events:
				334
				335	Enable dynamic trace points. See kprobetrace.txt.
				336
				337	kprobe_profile:
				338
				339	Dynamic trace points stats. See kprobetrace.txt.
				340
				341	max_graph_depth:
				342
				343	Used with the function graph tracer. This is the max depth
				344	it will trace into a function. Setting this to a value of
				345	one will show only the first kernel function that is called
				346	from user space.
				347
				348	printk_formats:
				349
				350	This is for tools that read the raw format files. If an event in
				351	the ring buffer references a string, only a pointer to the string
				352	is recorded into the buffer and not the string itself. This prevents
				353	tools from knowing what that string was. This file displays the string
				354	and address for the string allowing tools to map the pointers to what
				355	the strings were.
				356
				357	saved_cmdlines:
				358
				359	Only the pid of the task is recorded in a trace event unless
				360	the event specifically saves the task comm as well. Ftrace
				361	makes a cache of pid mappings to comms to try to display
				362	comms for events. If a pid for a comm is not listed, then
				363	"<...>" is displayed in the output.
				364
				365	If the option "record-cmd" is set to "0", then comms of tasks
				366	will not be saved during recording. By default, it is enabled.
				367
				368	saved_cmdlines_size:
				369
				370	By default, 128 comms are saved (see "saved_cmdlines" above). To
				371	increase or decrease the amount of comms that are cached, echo
				372	in a the number of comms to cache, into this file.
				373
				374	saved_tgids:
				375
				376	If the option "record-tgid" is set, on each scheduling context switch
				377	the Task Group ID of a task is saved in a table mapping the PID of
				378	the thread to its TGID. By default, the "record-tgid" option is
				379	disabled.
				380
				381	snapshot:
				382
				383	This displays the "snapshot" buffer and also lets the user
				384	take a snapshot of the current running trace.
				385	See the "Snapshot" section below for more details.
				386
				387	stack_max_size:
				388
				389	When the stack tracer is activated, this will display the
				390	maximum stack size it has encountered.
				391	See the "Stack Trace" section below.
				392
				393	stack_trace:
				394
				395	This displays the stack back trace of the largest stack
				396	that was encountered when the stack tracer is activated.
				397	See the "Stack Trace" section below.
				398
				399	stack_trace_filter:
				400
				401	This is similar to "set_ftrace_filter" but it limits what
				402	functions the stack tracer will check.
				403
				404	trace_clock:
				405
				406	Whenever an event is recorded into the ring buffer, a
				407	"timestamp" is added. This stamp comes from a specified
				408	clock. By default, ftrace uses the "local" clock. This
				409	clock is very fast and strictly per cpu, but on some
				410	systems it may not be monotonic with respect to other
				411	CPUs. In other words, the local clocks may not be in sync
				412	with local clocks on other CPUs.
				413
				414	Usual clocks for tracing::
				415
				416	# cat trace_clock
				417	[local] global counter x86-tsc
				418
				419	The clock with the square brackets around it is the one in effect.
				420
				421	local:
				422	Default clock, but may not be in sync across CPUs
				423
				424	global:
				425	This clock is in sync with all CPUs but may
				426	be a bit slower than the local clock.
				427
				428	counter:
				429	This is not a clock at all, but literally an atomic
				430	counter. It counts up one by one, but is in sync
				431	with all CPUs. This is useful when you need to
				432	know exactly the order events occurred with respect to
				433	each other on different CPUs.
				434
				435	uptime:
				436	This uses the jiffies counter and the time stamp
				437	is relative to the time since boot up.
				438
				439	perf:
				440	This makes ftrace use the same clock that perf uses.
				441	Eventually perf will be able to read ftrace buffers
				442	and this will help out in interleaving the data.
				443
				444	x86-tsc:
				445	Architectures may define their own clocks. For
				446	example, x86 uses its own TSC cycle clock here.
				447
				448	ppc-tb:
				449	This uses the powerpc timebase register value.
				450	This is in sync across CPUs and can also be used
				451	to correlate events across hypervisor/guest if
				452	tb_offset is known.
				453
				454	mono:
				455	This uses the fast monotonic clock (CLOCK_MONOTONIC)
				456	which is monotonic and is subject to NTP rate adjustments.
				457
				458	mono_raw:
				459	This is the raw monotonic clock (CLOCK_MONOTONIC_RAW)
				460	which is montonic but is not subject to any rate adjustments
				461	and ticks at the same rate as the hardware clocksource.
				462
				463	boot:
Linus Torvalds	680014d	2018-04-04 14:50:29 -0700	[diff] [blame]	464	Same as mono. Used to be a separate clock which accounted
				465	for the time spent in suspend while CLOCK_MONOTONIC did
				466	not.
Changbin Du	1f198e2	2018-02-17 13:39:38 +0800	[diff] [blame]	467
Linus Torvalds	680014d	2018-04-04 14:50:29 -0700	[diff] [blame]	468	To set a clock, simply echo the clock name into this file::
Changbin Du	1f198e2	2018-02-17 13:39:38 +0800	[diff] [blame]	469
Linus Torvalds	680014d	2018-04-04 14:50:29 -0700	[diff] [blame]	470	# echo global > trace_clock
Changbin Du	1f198e2	2018-02-17 13:39:38 +0800	[diff] [blame]	471
				472	trace_marker:
				473
				474	This is a very useful file for synchronizing user space
				475	with events happening in the kernel. Writing strings into
				476	this file will be written into the ftrace buffer.
				477
				478	It is useful in applications to open this file at the start
				479	of the application and just reference the file descriptor
				480	for the file::
				481
				482	void trace_write(const char *fmt, ...)
				483	{
				484	va_list ap;
				485	char buf[256];
				486	int n;
				487
				488	if (trace_fd < 0)
				489	return;
				490
				491	va_start(ap, fmt);
				492	n = vsnprintf(buf, 256, fmt, ap);
				493	va_end(ap);
				494
				495	write(trace_fd, buf, n);
				496	}
				497
				498	start::
				499
				500	trace_fd = open("trace_marker", WR_ONLY);
				501
				502	trace_marker_raw:
				503
				504	This is similar to trace_marker above, but is meant for for binary data
				505	to be written to it, where a tool can be used to parse the data
				506	from trace_pipe_raw.
				507
				508	uprobe_events:
				509
				510	Add dynamic tracepoints in programs.
				511	See uprobetracer.txt
				512
				513	uprobe_profile:
				514
				515	Uprobe statistics. See uprobetrace.txt
				516
				517	instances:
				518
				519	This is a way to make multiple trace buffers where different
				520	events can be recorded in different buffers.
				521	See "Instances" section below.
				522
				523	events:
				524
				525	This is the trace event directory. It holds event tracepoints
				526	(also known as static tracepoints) that have been compiled
				527	into the kernel. It shows what event tracepoints exist
				528	and how they are grouped by system. There are "enable"
				529	files at various levels that can enable the tracepoints
				530	when a "1" is written to them.
				531
				532	See events.txt for more information.
				533
				534	set_event:
				535
				536	By echoing in the event into this file, will enable that event.
				537
				538	See events.txt for more information.
				539
				540	available_events:
				541
				542	A list of events that can be enabled in tracing.
				543
				544	See events.txt for more information.
				545
Linus Torvalds	2a56bb5	2018-04-10 11:27:30 -0700	[diff] [blame^]	546	timestamp_mode:
				547
				548	Certain tracers may change the timestamp mode used when
				549	logging trace events into the event buffer. Events with
				550	different modes can coexist within a buffer but the mode in
				551	effect when an event is logged determines which timestamp mode
				552	is used for that event. The default timestamp mode is
				553	'delta'.
				554
				555	Usual timestamp modes for tracing:
				556
				557	# cat timestamp_mode
				558	[delta] absolute
				559
				560	The timestamp mode with the square brackets around it is the
				561	one in effect.
				562
				563	delta: Default timestamp mode - timestamp is a delta against
				564	a per-buffer timestamp.
				565
				566	absolute: The timestamp is a full timestamp, not a delta
				567	against some other value. As such it takes up more
				568	space and is less efficient.
				569
Changbin Du	1f198e2	2018-02-17 13:39:38 +0800	[diff] [blame]	570	hwlat_detector:
				571
				572	Directory for the Hardware Latency Detector.
				573	See "Hardware Latency Detector" section below.
				574
				575	per_cpu:
				576
				577	This is a directory that contains the trace per_cpu information.
				578
				579	per_cpu/cpu0/buffer_size_kb:
				580
				581	The ftrace buffer is defined per_cpu. That is, there's a separate
				582	buffer for each CPU to allow writes to be done atomically,
				583	and free from cache bouncing. These buffers may have different
				584	size buffers. This file is similar to the buffer_size_kb
				585	file, but it only displays or sets the buffer size for the
				586	specific CPU. (here cpu0).
				587
				588	per_cpu/cpu0/trace:
				589
				590	This is similar to the "trace" file, but it will only display
				591	the data specific for the CPU. If written to, it only clears
				592	the specific CPU buffer.
				593
				594	per_cpu/cpu0/trace_pipe
				595
				596	This is similar to the "trace_pipe" file, and is a consuming
				597	read, but it will only display (and consume) the data specific
				598	for the CPU.
				599
				600	per_cpu/cpu0/trace_pipe_raw
				601
				602	For tools that can parse the ftrace ring buffer binary format,
				603	the trace_pipe_raw file can be used to extract the data
				604	from the ring buffer directly. With the use of the splice()
				605	system call, the buffer data can be quickly transferred to
				606	a file or to the network where a server is collecting the
				607	data.
				608
				609	Like trace_pipe, this is a consuming reader, where multiple
				610	reads will always produce different data.
				611
				612	per_cpu/cpu0/snapshot:
				613
				614	This is similar to the main "snapshot" file, but will only
				615	snapshot the current CPU (if supported). It only displays
				616	the content of the snapshot for a given CPU, and if
				617	written to, only clears this CPU buffer.
				618
				619	per_cpu/cpu0/snapshot_raw:
				620
				621	Similar to the trace_pipe_raw, but will read the binary format
				622	from the snapshot buffer for the given CPU.
				623
				624	per_cpu/cpu0/stats:
				625
				626	This displays certain stats about the ring buffer:
				627
				628	entries:
				629	The number of events that are still in the buffer.
				630
				631	overrun:
				632	The number of lost events due to overwriting when
				633	the buffer was full.
				634
				635	commit overrun:
				636	Should always be zero.
				637	This gets set if so many events happened within a nested
				638	event (ring buffer is re-entrant), that it fills the
				639	buffer and starts dropping events.
				640
				641	bytes:
				642	Bytes actually read (not overwritten).
				643
				644	oldest event ts:
				645	The oldest timestamp in the buffer
				646
				647	now ts:
				648	The current timestamp
				649
				650	dropped events:
				651	Events lost due to overwrite option being off.
				652
				653	read events:
				654	The number of events read.
				655
				656	The Tracers
				657	-----------
				658
				659	Here is the list of current tracers that may be configured.
				660
				661	"function"
				662
				663	Function call tracer to trace all kernel functions.
				664
				665	"function_graph"
				666
				667	Similar to the function tracer except that the
				668	function tracer probes the functions on their entry
				669	whereas the function graph tracer traces on both entry
				670	and exit of the functions. It then provides the ability
				671	to draw a graph of function calls similar to C code
				672	source.
				673
				674	"blk"
				675
				676	The block tracer. The tracer used by the blktrace user
				677	application.
				678
				679	"hwlat"
				680
				681	The Hardware Latency tracer is used to detect if the hardware
				682	produces any latency. See "Hardware Latency Detector" section
				683	below.
				684
				685	"irqsoff"
				686
				687	Traces the areas that disable interrupts and saves
				688	the trace with the longest max latency.
				689	See tracing_max_latency. When a new max is recorded,
				690	it replaces the old trace. It is best to view this
				691	trace with the latency-format option enabled, which
				692	happens automatically when the tracer is selected.
				693
				694	"preemptoff"
				695
				696	Similar to irqsoff but traces and records the amount of
				697	time for which preemption is disabled.
				698
				699	"preemptirqsoff"
				700
				701	Similar to irqsoff and preemptoff, but traces and
				702	records the largest time for which irqs and/or preemption
				703	is disabled.
				704
				705	"wakeup"
				706
				707	Traces and records the max latency that it takes for
				708	the highest priority task to get scheduled after
				709	it has been woken up.
				710	Traces all tasks as an average developer would expect.
				711
				712	"wakeup_rt"
				713
				714	Traces and records the max latency that it takes for just
				715	RT tasks (as the current "wakeup" does). This is useful
				716	for those interested in wake up timings of RT tasks.
				717
				718	"wakeup_dl"
				719
				720	Traces and records the max latency that it takes for
				721	a SCHED_DEADLINE task to be woken (as the "wakeup" and
				722	"wakeup_rt" does).
				723
				724	"mmiotrace"
				725
				726	A special tracer that is used to trace binary module.
				727	It will trace all the calls that a module makes to the
				728	hardware. Everything it writes and reads from the I/O
				729	as well.
				730
				731	"branch"
				732
				733	This tracer can be configured when tracing likely/unlikely
				734	calls within the kernel. It will trace when a likely and
				735	unlikely branch is hit and if it was correct in its prediction
				736	of being correct.
				737
				738	"nop"
				739
				740	This is the "trace nothing" tracer. To remove all
				741	tracers from tracing simply echo "nop" into
				742	current_tracer.
				743
				744
				745	Examples of using the tracer
				746	----------------------------
				747
				748	Here are typical examples of using the tracers when controlling
				749	them only with the tracefs interface (without using any
				750	user-land utilities).
				751
				752	Output format:
				753	--------------
				754
				755	Here is an example of the output format of the file "trace"::
				756
				757	# tracer: function
				758	#
				759	# entries-in-buffer/entries-written: 140080/250280 #P:4
				760	#
				761	# _-----=> irqs-off
				762	# / _----=> need-resched
				763	# \| / _---=> hardirq/softirq
				764	# \|\| / _--=> preempt-depth
				765	# \|\|\| / delay
				766	# TASK-PID CPU# \|\|\|\| TIMESTAMP FUNCTION
				767	# \| \| \| \|\|\|\| \| \|
				768	bash-1977 [000] .... 17284.993652: sys_close <-system_call_fastpath
				769	bash-1977 [000] .... 17284.993653: __close_fd <-sys_close
				770	bash-1977 [000] .... 17284.993653: _raw_spin_lock <-__close_fd
				771	sshd-1974 [003] .... 17284.993653: __srcu_read_unlock <-fsnotify
				772	bash-1977 [000] .... 17284.993654: add_preempt_count <-_raw_spin_lock
				773	bash-1977 [000] ...1 17284.993655: _raw_spin_unlock <-__close_fd
				774	bash-1977 [000] ...1 17284.993656: sub_preempt_count <-_raw_spin_unlock
				775	bash-1977 [000] .... 17284.993657: filp_close <-__close_fd
				776	bash-1977 [000] .... 17284.993657: dnotify_flush <-filp_close
				777	sshd-1974 [003] .... 17284.993658: sys_select <-system_call_fastpath
				778	....
				779
				780	A header is printed with the tracer name that is represented by
				781	the trace. In this case the tracer is "function". Then it shows the
				782	number of events in the buffer as well as the total number of entries
				783	that were written. The difference is the number of entries that were
				784	lost due to the buffer filling up (250280 - 140080 = 110200 events
				785	lost).
				786
				787	The header explains the content of the events. Task name "bash", the task
				788	PID "1977", the CPU that it was running on "000", the latency format
				789	(explained below), the timestamp in <secs>.<usecs> format, the
				790	function name that was traced "sys_close" and the parent function that
				791	called this function "system_call_fastpath". The timestamp is the time
				792	at which the function was entered.
				793
				794	Latency trace format
				795	--------------------
				796
				797	When the latency-format option is enabled or when one of the latency
				798	tracers is set, the trace file gives somewhat more information to see
				799	why a latency happened. Here is a typical trace::
				800
				801	# tracer: irqsoff
				802	#
				803	# irqsoff latency trace v1.1.5 on 3.8.0-test+
				804	# --------------------------------------------------------------------
				805	# latency: 259 us, #4/4, CPU#2 \| (M:preempt VP:0, KP:0, SP:0 HP:0 #P:4)
				806	# -----------------
				807	# \| task: ps-6143 (uid:0 nice:0 policy:0 rt_prio:0)
				808	# -----------------
				809	# => started at: __lock_task_sighand
				810	# => ended at: _raw_spin_unlock_irqrestore
				811	#
				812	#
				813	# _------=> CPU#
				814	# / _-----=> irqs-off
				815	# \| / _----=> need-resched
				816	# \|\| / _---=> hardirq/softirq
				817	# \|\|\| / _--=> preempt-depth
				818	# \|\|\|\| / delay
				819	# cmd pid \|\|\|\|\| time \| caller
				820	# \ / \|\|\|\|\| \ \| /
				821	ps-6143 2d... 0us!: trace_hardirqs_off <-__lock_task_sighand
				822	ps-6143 2d..1 259us+: trace_hardirqs_on <-_raw_spin_unlock_irqrestore
				823	ps-6143 2d..1 263us+: time_hardirqs_on <-_raw_spin_unlock_irqrestore
				824	ps-6143 2d..1 306us : <stack trace>
				825	=> trace_hardirqs_on_caller
				826	=> trace_hardirqs_on
				827	=> _raw_spin_unlock_irqrestore
				828	=> do_task_stat
				829	=> proc_tgid_stat
				830	=> proc_single_show
				831	=> seq_read
				832	=> vfs_read
				833	=> sys_read
				834	=> system_call_fastpath
				835
				836
				837	This shows that the current tracer is "irqsoff" tracing the time
				838	for which interrupts were disabled. It gives the trace version (which
				839	never changes) and the version of the kernel upon which this was executed on
				840	(3.8). Then it displays the max latency in microseconds (259 us). The number
				841	of trace entries displayed and the total number (both are four: #4/4).
				842	VP, KP, SP, and HP are always zero and are reserved for later use.
				843	#P is the number of online CPUs (#P:4).
				844
				845	The task is the process that was running when the latency
				846	occurred. (ps pid: 6143).
				847
				848	The start and stop (the functions in which the interrupts were
				849	disabled and enabled respectively) that caused the latencies:
				850
				851	- __lock_task_sighand is where the interrupts were disabled.
				852	- _raw_spin_unlock_irqrestore is where they were enabled again.
				853
				854	The next lines after the header are the trace itself. The header
				855	explains which is which.
				856
				857	cmd: The name of the process in the trace.
				858
				859	pid: The PID of that process.
				860
				861	CPU#: The CPU which the process was running on.
				862
				863	irqs-off: 'd' interrupts are disabled. '.' otherwise.
				864	.. caution:: If the architecture does not support a way to
				865	read the irq flags variable, an 'X' will always
				866	be printed here.
				867
				868	need-resched:
				869	- 'N' both TIF_NEED_RESCHED and PREEMPT_NEED_RESCHED is set,
				870	- 'n' only TIF_NEED_RESCHED is set,
				871	- 'p' only PREEMPT_NEED_RESCHED is set,
				872	- '.' otherwise.
				873
				874	hardirq/softirq:
				875	- 'Z' - NMI occurred inside a hardirq
				876	- 'z' - NMI is running
				877	- 'H' - hard irq occurred inside a softirq.
				878	- 'h' - hard irq is running
				879	- 's' - soft irq is running
				880	- '.' - normal context.
				881
				882	preempt-depth: The level of preempt_disabled
				883
				884	The above is mostly meaningful for kernel developers.
				885
				886	time:
				887	When the latency-format option is enabled, the trace file
				888	output includes a timestamp relative to the start of the
				889	trace. This differs from the output when latency-format
				890	is disabled, which includes an absolute timestamp.
				891
				892	delay:
				893	This is just to help catch your eye a bit better. And
				894	needs to be fixed to be only relative to the same CPU.
				895	The marks are determined by the difference between this
				896	current trace and the next trace.
				897
				898	- '$' - greater than 1 second
				899	- '@' - greater than 100 milisecond
				900	- '*' - greater than 10 milisecond
				901	- '#' - greater than 1000 microsecond
				902	- '!' - greater than 100 microsecond
				903	- '+' - greater than 10 microsecond
				904	- ' ' - less than or equal to 10 microsecond.
				905
				906	The rest is the same as the 'trace' file.
				907
				908	Note, the latency tracers will usually end with a back trace
				909	to easily find where the latency occurred.
				910
				911	trace_options
				912	-------------
				913
				914	The trace_options file (or the options directory) is used to control
				915	what gets printed in the trace output, or manipulate the tracers.
				916	To see what is available, simply cat the file::
				917
				918	cat trace_options
				919	print-parent
				920	nosym-offset
				921	nosym-addr
				922	noverbose
				923	noraw
				924	nohex
				925	nobin
				926	noblock
				927	trace_printk
				928	annotate
				929	nouserstacktrace
				930	nosym-userobj
				931	noprintk-msg-only
				932	context-info
				933	nolatency-format
				934	record-cmd
				935	norecord-tgid
				936	overwrite
				937	nodisable_on_free
				938	irq-info
				939	markers
				940	noevent-fork
				941	function-trace
				942	nofunction-fork
				943	nodisplay-graph
				944	nostacktrace
				945	nobranch
				946
				947	To disable one of the options, echo in the option prepended with
				948	"no"::
				949
				950	echo noprint-parent > trace_options
				951
				952	To enable an option, leave off the "no"::
				953
				954	echo sym-offset > trace_options
				955
				956	Here are the available options:
				957
				958	print-parent
				959	On function traces, display the calling (parent)
				960	function as well as the function being traced.
				961	::
				962
				963	print-parent:
				964	bash-4000 [01] 1477.606694: simple_strtoul <-kstrtoul
				965
				966	noprint-parent:
				967	bash-4000 [01] 1477.606694: simple_strtoul
				968
				969
				970	sym-offset
				971	Display not only the function name, but also the
				972	offset in the function. For example, instead of
				973	seeing just "ktime_get", you will see
				974	"ktime_get+0xb/0x20".
				975	::
				976
				977	sym-offset:
				978	bash-4000 [01] 1477.606694: simple_strtoul+0x6/0xa0
				979
				980	sym-addr
				981	This will also display the function address as well
				982	as the function name.
				983	::
				984
				985	sym-addr:
				986	bash-4000 [01] 1477.606694: simple_strtoul <c0339346>
				987
				988	verbose
				989	This deals with the trace file when the
				990	latency-format option is enabled.
				991	::
				992
				993	bash 4000 1 0 00000000 00010a95 [58127d26] 1720.415ms \
				994	(+0.000ms): simple_strtoul (kstrtoul)
				995
				996	raw
				997	This will display raw numbers. This option is best for
				998	use with user applications that can translate the raw
				999	numbers better than having it done in the kernel.
				1000
				1001	hex
				1002	Similar to raw, but the numbers will be in a hexadecimal format.
				1003
				1004	bin
				1005	This will print out the formats in raw binary.
				1006
				1007	block
				1008	When set, reading trace_pipe will not block when polled.
				1009
				1010	trace_printk
				1011	Can disable trace_printk() from writing into the buffer.
				1012
				1013	annotate
				1014	It is sometimes confusing when the CPU buffers are full
				1015	and one CPU buffer had a lot of events recently, thus
				1016	a shorter time frame, were another CPU may have only had
				1017	a few events, which lets it have older events. When
				1018	the trace is reported, it shows the oldest events first,
				1019	and it may look like only one CPU ran (the one with the
				1020	oldest events). When the annotate option is set, it will
				1021	display when a new CPU buffer started::
				1022
				1023	<idle>-0 [001] dNs4 21169.031481: wake_up_idle_cpu <-add_timer_on
				1024	<idle>-0 [001] dNs4 21169.031482: _raw_spin_unlock_irqrestore <-add_timer_on
				1025	<idle>-0 [001] .Ns4 21169.031484: sub_preempt_count <-_raw_spin_unlock_irqrestore
				1026	##### CPU 2 buffer started ####
				1027	<idle>-0 [002] .N.1 21169.031484: rcu_idle_exit <-cpu_idle
				1028	<idle>-0 [001] .Ns3 21169.031484: _raw_spin_unlock <-clocksource_watchdog
				1029	<idle>-0 [001] .Ns3 21169.031485: sub_preempt_count <-_raw_spin_unlock
				1030
				1031	userstacktrace
				1032	This option changes the trace. It records a
				1033	stacktrace of the current user space thread after
				1034	each trace event.
				1035
				1036	sym-userobj
				1037	when user stacktrace are enabled, look up which
				1038	object the address belongs to, and print a
				1039	relative address. This is especially useful when
				1040	ASLR is on, otherwise you don't get a chance to
				1041	resolve the address to object/file/line after
				1042	the app is no longer running
				1043
				1044	The lookup is performed when you read
				1045	trace,trace_pipe. Example::
				1046
				1047	a.out-1623 [000] 40874.465068: /root/a.out[+0x480] <-/root/a.out[+0
				1048	x494] <- /root/a.out[+0x4a8] <- /lib/libc-2.7.so[+0x1e1a6]
				1049
				1050
				1051	printk-msg-only
				1052	When set, trace_printk()s will only show the format
				1053	and not their parameters (if trace_bprintk() or
				1054	trace_bputs() was used to save the trace_printk()).
				1055
				1056	context-info
				1057	Show only the event data. Hides the comm, PID,
				1058	timestamp, CPU, and other useful data.
				1059
				1060	latency-format
				1061	This option changes the trace output. When it is enabled,
				1062	the trace displays additional information about the
				1063	latency, as described in "Latency trace format".
				1064
				1065	record-cmd
				1066	When any event or tracer is enabled, a hook is enabled
				1067	in the sched_switch trace point to fill comm cache
				1068	with mapped pids and comms. But this may cause some
				1069	overhead, and if you only care about pids, and not the
				1070	name of the task, disabling this option can lower the
				1071	impact of tracing. See "saved_cmdlines".
				1072
				1073	record-tgid
				1074	When any event or tracer is enabled, a hook is enabled
				1075	in the sched_switch trace point to fill the cache of
				1076	mapped Thread Group IDs (TGID) mapping to pids. See
				1077	"saved_tgids".
				1078
				1079	overwrite
				1080	This controls what happens when the trace buffer is
				1081	full. If "1" (default), the oldest events are
				1082	discarded and overwritten. If "0", then the newest
				1083	events are discarded.
				1084	(see per_cpu/cpu0/stats for overrun and dropped)
				1085
				1086	disable_on_free
				1087	When the free_buffer is closed, tracing will
				1088	stop (tracing_on set to 0).
				1089
				1090	irq-info
				1091	Shows the interrupt, preempt count, need resched data.
				1092	When disabled, the trace looks like::
				1093
				1094	# tracer: function
				1095	#
				1096	# entries-in-buffer/entries-written: 144405/9452052 #P:4
				1097	#
				1098	# TASK-PID CPU# TIMESTAMP FUNCTION
				1099	# \| \| \| \| \|
				1100	<idle>-0 [002] 23636.756054: ttwu_do_activate.constprop.89 <-try_to_wake_up
				1101	<idle>-0 [002] 23636.756054: activate_task <-ttwu_do_activate.constprop.89
				1102	<idle>-0 [002] 23636.756055: enqueue_task <-activate_task
				1103
				1104
				1105	markers
				1106	When set, the trace_marker is writable (only by root).
				1107	When disabled, the trace_marker will error with EINVAL
				1108	on write.
				1109
				1110	event-fork
				1111	When set, tasks with PIDs listed in set_event_pid will have
				1112	the PIDs of their children added to set_event_pid when those
				1113	tasks fork. Also, when tasks with PIDs in set_event_pid exit,
				1114	their PIDs will be removed from the file.
				1115
				1116	function-trace
				1117	The latency tracers will enable function tracing
				1118	if this option is enabled (default it is). When
				1119	it is disabled, the latency tracers do not trace
				1120	functions. This keeps the overhead of the tracer down
				1121	when performing latency tests.
				1122
				1123	function-fork
				1124	When set, tasks with PIDs listed in set_ftrace_pid will
				1125	have the PIDs of their children added to set_ftrace_pid
				1126	when those tasks fork. Also, when tasks with PIDs in
				1127	set_ftrace_pid exit, their PIDs will be removed from the
				1128	file.
				1129
				1130	display-graph
				1131	When set, the latency tracers (irqsoff, wakeup, etc) will
				1132	use function graph tracing instead of function tracing.
				1133
				1134	stacktrace
				1135	When set, a stack trace is recorded after any trace event
				1136	is recorded.
				1137
				1138	branch
				1139	Enable branch tracing with the tracer. This enables branch
				1140	tracer along with the currently set tracer. Enabling this
				1141	with the "nop" tracer is the same as just enabling the
				1142	"branch" tracer.
				1143
				1144	.. tip:: Some tracers have their own options. They only appear in this
				1145	file when the tracer is active. They always appear in the
				1146	options directory.
				1147
				1148
				1149	Here are the per tracer options:
				1150
				1151	Options for function tracer:
				1152
				1153	func_stack_trace
				1154	When set, a stack trace is recorded after every
				1155	function that is recorded. NOTE! Limit the functions
				1156	that are recorded before enabling this, with
				1157	"set_ftrace_filter" otherwise the system performance
				1158	will be critically degraded. Remember to disable
				1159	this option before clearing the function filter.
				1160
				1161	Options for function_graph tracer:
				1162
				1163	Since the function_graph tracer has a slightly different output
				1164	it has its own options to control what is displayed.
				1165
				1166	funcgraph-overrun
				1167	When set, the "overrun" of the graph stack is
				1168	displayed after each function traced. The
				1169	overrun, is when the stack depth of the calls
				1170	is greater than what is reserved for each task.
				1171	Each task has a fixed array of functions to
				1172	trace in the call graph. If the depth of the
				1173	calls exceeds that, the function is not traced.
				1174	The overrun is the number of functions missed
				1175	due to exceeding this array.
				1176
				1177	funcgraph-cpu
				1178	When set, the CPU number of the CPU where the trace
				1179	occurred is displayed.
				1180
				1181	funcgraph-overhead
				1182	When set, if the function takes longer than
				1183	A certain amount, then a delay marker is
				1184	displayed. See "delay" above, under the
				1185	header description.
				1186
				1187	funcgraph-proc
				1188	Unlike other tracers, the process' command line
				1189	is not displayed by default, but instead only
				1190	when a task is traced in and out during a context
				1191	switch. Enabling this options has the command
				1192	of each process displayed at every line.
				1193
				1194	funcgraph-duration
				1195	At the end of each function (the return)
				1196	the duration of the amount of time in the
				1197	function is displayed in microseconds.
				1198
				1199	funcgraph-abstime
				1200	When set, the timestamp is displayed at each line.
				1201
				1202	funcgraph-irqs
				1203	When disabled, functions that happen inside an
				1204	interrupt will not be traced.
				1205
				1206	funcgraph-tail
				1207	When set, the return event will include the function
				1208	that it represents. By default this is off, and
				1209	only a closing curly bracket "}" is displayed for
				1210	the return of a function.
				1211
				1212	sleep-time
				1213	When running function graph tracer, to include
				1214	the time a task schedules out in its function.
				1215	When enabled, it will account time the task has been
				1216	scheduled out as part of the function call.
				1217
				1218	graph-time
				1219	When running function profiler with function graph tracer,
				1220	to include the time to call nested functions. When this is
				1221	not set, the time reported for the function will only
				1222	include the time the function itself executed for, not the
				1223	time for functions that it called.
				1224
				1225	Options for blk tracer:
				1226
				1227	blk_classic
				1228	Shows a more minimalistic output.
				1229
				1230
				1231	irqsoff
				1232	-------
				1233
				1234	When interrupts are disabled, the CPU can not react to any other
				1235	external event (besides NMIs and SMIs). This prevents the timer
				1236	interrupt from triggering or the mouse interrupt from letting
				1237	the kernel know of a new mouse event. The result is a latency
				1238	with the reaction time.
				1239
				1240	The irqsoff tracer tracks the time for which interrupts are
				1241	disabled. When a new maximum latency is hit, the tracer saves
				1242	the trace leading up to that latency point so that every time a
				1243	new maximum is reached, the old saved trace is discarded and the
				1244	new trace is saved.
				1245
				1246	To reset the maximum, echo 0 into tracing_max_latency. Here is
				1247	an example::
				1248
				1249	# echo 0 > options/function-trace
				1250	# echo irqsoff > current_tracer
				1251	# echo 1 > tracing_on
				1252	# echo 0 > tracing_max_latency
				1253	# ls -ltr
				1254	[...]
				1255	# echo 0 > tracing_on
				1256	# cat trace
				1257	# tracer: irqsoff
				1258	#
				1259	# irqsoff latency trace v1.1.5 on 3.8.0-test+
				1260	# --------------------------------------------------------------------
				1261	# latency: 16 us, #4/4, CPU#0 \| (M:preempt VP:0, KP:0, SP:0 HP:0 #P:4)
				1262	# -----------------
				1263	# \| task: swapper/0-0 (uid:0 nice:0 policy:0 rt_prio:0)
				1264	# -----------------
				1265	# => started at: run_timer_softirq
				1266	# => ended at: run_timer_softirq
				1267	#
				1268	#
				1269	# _------=> CPU#
				1270	# / _-----=> irqs-off
				1271	# \| / _----=> need-resched
				1272	# \|\| / _---=> hardirq/softirq
				1273	# \|\|\| / _--=> preempt-depth
				1274	# \|\|\|\| / delay
				1275	# cmd pid \|\|\|\|\| time \| caller
				1276	# \ / \|\|\|\|\| \ \| /
				1277	<idle>-0 0d.s2 0us+: _raw_spin_lock_irq <-run_timer_softirq
				1278	<idle>-0 0dNs3 17us : _raw_spin_unlock_irq <-run_timer_softirq
				1279	<idle>-0 0dNs3 17us+: trace_hardirqs_on <-run_timer_softirq
				1280	<idle>-0 0dNs3 25us : <stack trace>
				1281	=> _raw_spin_unlock_irq
				1282	=> run_timer_softirq
				1283	=> __do_softirq
				1284	=> call_softirq
				1285	=> do_softirq
				1286	=> irq_exit
				1287	=> smp_apic_timer_interrupt
				1288	=> apic_timer_interrupt
				1289	=> rcu_idle_exit
				1290	=> cpu_idle
				1291	=> rest_init
				1292	=> start_kernel
				1293	=> x86_64_start_reservations
				1294	=> x86_64_start_kernel
				1295
				1296	Here we see that that we had a latency of 16 microseconds (which is
				1297	very good). The _raw_spin_lock_irq in run_timer_softirq disabled
				1298	interrupts. The difference between the 16 and the displayed
				1299	timestamp 25us occurred because the clock was incremented
				1300	between the time of recording the max latency and the time of
				1301	recording the function that had that latency.
				1302
				1303	Note the above example had function-trace not set. If we set
				1304	function-trace, we get a much larger output::
				1305
				1306	with echo 1 > options/function-trace
				1307
				1308	# tracer: irqsoff
				1309	#
				1310	# irqsoff latency trace v1.1.5 on 3.8.0-test+
				1311	# --------------------------------------------------------------------
				1312	# latency: 71 us, #168/168, CPU#3 \| (M:preempt VP:0, KP:0, SP:0 HP:0 #P:4)
				1313	# -----------------
				1314	# \| task: bash-2042 (uid:0 nice:0 policy:0 rt_prio:0)
				1315	# -----------------
				1316	# => started at: ata_scsi_queuecmd
				1317	# => ended at: ata_scsi_queuecmd
				1318	#
				1319	#
				1320	# _------=> CPU#
				1321	# / _-----=> irqs-off
				1322	# \| / _----=> need-resched
				1323	# \|\| / _---=> hardirq/softirq
				1324	# \|\|\| / _--=> preempt-depth
				1325	# \|\|\|\| / delay
				1326	# cmd pid \|\|\|\|\| time \| caller
				1327	# \ / \|\|\|\|\| \ \| /
				1328	bash-2042 3d... 0us : _raw_spin_lock_irqsave <-ata_scsi_queuecmd
				1329	bash-2042 3d... 0us : add_preempt_count <-_raw_spin_lock_irqsave
				1330	bash-2042 3d..1 1us : ata_scsi_find_dev <-ata_scsi_queuecmd
				1331	bash-2042 3d..1 1us : __ata_scsi_find_dev <-ata_scsi_find_dev
				1332	bash-2042 3d..1 2us : ata_find_dev.part.14 <-__ata_scsi_find_dev
				1333	bash-2042 3d..1 2us : ata_qc_new_init <-__ata_scsi_queuecmd
				1334	bash-2042 3d..1 3us : ata_sg_init <-__ata_scsi_queuecmd
				1335	bash-2042 3d..1 4us : ata_scsi_rw_xlat <-__ata_scsi_queuecmd
				1336	bash-2042 3d..1 4us : ata_build_rw_tf <-ata_scsi_rw_xlat
				1337	[...]
				1338	bash-2042 3d..1 67us : delay_tsc <-__delay
				1339	bash-2042 3d..1 67us : add_preempt_count <-delay_tsc
				1340	bash-2042 3d..2 67us : sub_preempt_count <-delay_tsc
				1341	bash-2042 3d..1 67us : add_preempt_count <-delay_tsc
				1342	bash-2042 3d..2 68us : sub_preempt_count <-delay_tsc
				1343	bash-2042 3d..1 68us+: ata_bmdma_start <-ata_bmdma_qc_issue
				1344	bash-2042 3d..1 71us : _raw_spin_unlock_irqrestore <-ata_scsi_queuecmd
				1345	bash-2042 3d..1 71us : _raw_spin_unlock_irqrestore <-ata_scsi_queuecmd
				1346	bash-2042 3d..1 72us+: trace_hardirqs_on <-ata_scsi_queuecmd
				1347	bash-2042 3d..1 120us : <stack trace>
				1348	=> _raw_spin_unlock_irqrestore
				1349	=> ata_scsi_queuecmd
				1350	=> scsi_dispatch_cmd
				1351	=> scsi_request_fn
				1352	=> __blk_run_queue_uncond
				1353	=> __blk_run_queue
				1354	=> blk_queue_bio
				1355	=> generic_make_request
				1356	=> submit_bio
				1357	=> submit_bh
				1358	=> __ext3_get_inode_loc
				1359	=> ext3_iget
				1360	=> ext3_lookup
				1361	=> lookup_real
				1362	=> __lookup_hash
				1363	=> walk_component
				1364	=> lookup_last
				1365	=> path_lookupat
				1366	=> filename_lookup
				1367	=> user_path_at_empty
				1368	=> user_path_at
				1369	=> vfs_fstatat
				1370	=> vfs_stat
				1371	=> sys_newstat
				1372	=> system_call_fastpath
				1373
				1374
				1375	Here we traced a 71 microsecond latency. But we also see all the
				1376	functions that were called during that time. Note that by
				1377	enabling function tracing, we incur an added overhead. This
				1378	overhead may extend the latency times. But nevertheless, this
				1379	trace has provided some very helpful debugging information.
				1380
				1381
				1382	preemptoff
				1383	----------
				1384
				1385	When preemption is disabled, we may be able to receive
				1386	interrupts but the task cannot be preempted and a higher
				1387	priority task must wait for preemption to be enabled again
				1388	before it can preempt a lower priority task.
				1389
				1390	The preemptoff tracer traces the places that disable preemption.
				1391	Like the irqsoff tracer, it records the maximum latency for
				1392	which preemption was disabled. The control of preemptoff tracer
				1393	is much like the irqsoff tracer.
				1394	::
				1395
				1396	# echo 0 > options/function-trace
				1397	# echo preemptoff > current_tracer
				1398	# echo 1 > tracing_on
				1399	# echo 0 > tracing_max_latency
				1400	# ls -ltr
				1401	[...]
				1402	# echo 0 > tracing_on
				1403	# cat trace
				1404	# tracer: preemptoff
				1405	#
				1406	# preemptoff latency trace v1.1.5 on 3.8.0-test+
				1407	# --------------------------------------------------------------------
				1408	# latency: 46 us, #4/4, CPU#1 \| (M:preempt VP:0, KP:0, SP:0 HP:0 #P:4)
				1409	# -----------------
				1410	# \| task: sshd-1991 (uid:0 nice:0 policy:0 rt_prio:0)
				1411	# -----------------
				1412	# => started at: do_IRQ
				1413	# => ended at: do_IRQ
				1414	#
				1415	#
				1416	# _------=> CPU#
				1417	# / _-----=> irqs-off
				1418	# \| / _----=> need-resched
				1419	# \|\| / _---=> hardirq/softirq
				1420	# \|\|\| / _--=> preempt-depth
				1421	# \|\|\|\| / delay
				1422	# cmd pid \|\|\|\|\| time \| caller
				1423	# \ / \|\|\|\|\| \ \| /
				1424	sshd-1991 1d.h. 0us+: irq_enter <-do_IRQ
				1425	sshd-1991 1d..1 46us : irq_exit <-do_IRQ
				1426	sshd-1991 1d..1 47us+: trace_preempt_on <-do_IRQ
				1427	sshd-1991 1d..1 52us : <stack trace>
				1428	=> sub_preempt_count
				1429	=> irq_exit
				1430	=> do_IRQ
				1431	=> ret_from_intr
				1432
				1433
				1434	This has some more changes. Preemption was disabled when an
				1435	interrupt came in (notice the 'h'), and was enabled on exit.
				1436	But we also see that interrupts have been disabled when entering
				1437	the preempt off section and leaving it (the 'd'). We do not know if
				1438	interrupts were enabled in the mean time or shortly after this
				1439	was over.
				1440	::
				1441
				1442	# tracer: preemptoff
				1443	#
				1444	# preemptoff latency trace v1.1.5 on 3.8.0-test+
				1445	# --------------------------------------------------------------------
				1446	# latency: 83 us, #241/241, CPU#1 \| (M:preempt VP:0, KP:0, SP:0 HP:0 #P:4)
				1447	# -----------------
				1448	# \| task: bash-1994 (uid:0 nice:0 policy:0 rt_prio:0)
				1449	# -----------------
				1450	# => started at: wake_up_new_task
				1451	# => ended at: task_rq_unlock
				1452	#
				1453	#
				1454	# _------=> CPU#
				1455	# / _-----=> irqs-off
				1456	# \| / _----=> need-resched
				1457	# \|\| / _---=> hardirq/softirq
				1458	# \|\|\| / _--=> preempt-depth
				1459	# \|\|\|\| / delay
				1460	# cmd pid \|\|\|\|\| time \| caller
				1461	# \ / \|\|\|\|\| \ \| /
				1462	bash-1994 1d..1 0us : _raw_spin_lock_irqsave <-wake_up_new_task
				1463	bash-1994 1d..1 0us : select_task_rq_fair <-select_task_rq
				1464	bash-1994 1d..1 1us : __rcu_read_lock <-select_task_rq_fair
				1465	bash-1994 1d..1 1us : source_load <-select_task_rq_fair
				1466	bash-1994 1d..1 1us : source_load <-select_task_rq_fair
				1467	[...]
				1468	bash-1994 1d..1 12us : irq_enter <-smp_apic_timer_interrupt
				1469	bash-1994 1d..1 12us : rcu_irq_enter <-irq_enter
				1470	bash-1994 1d..1 13us : add_preempt_count <-irq_enter
				1471	bash-1994 1d.h1 13us : exit_idle <-smp_apic_timer_interrupt
				1472	bash-1994 1d.h1 13us : hrtimer_interrupt <-smp_apic_timer_interrupt
				1473	bash-1994 1d.h1 13us : _raw_spin_lock <-hrtimer_interrupt
				1474	bash-1994 1d.h1 14us : add_preempt_count <-_raw_spin_lock
				1475	bash-1994 1d.h2 14us : ktime_get_update_offsets <-hrtimer_interrupt
				1476	[...]
				1477	bash-1994 1d.h1 35us : lapic_next_event <-clockevents_program_event
				1478	bash-1994 1d.h1 35us : irq_exit <-smp_apic_timer_interrupt
				1479	bash-1994 1d.h1 36us : sub_preempt_count <-irq_exit
				1480	bash-1994 1d..2 36us : do_softirq <-irq_exit
				1481	bash-1994 1d..2 36us : __do_softirq <-call_softirq
				1482	bash-1994 1d..2 36us : __local_bh_disable <-__do_softirq
				1483	bash-1994 1d.s2 37us : add_preempt_count <-_raw_spin_lock_irq
				1484	bash-1994 1d.s3 38us : _raw_spin_unlock <-run_timer_softirq
				1485	bash-1994 1d.s3 39us : sub_preempt_count <-_raw_spin_unlock
				1486	bash-1994 1d.s2 39us : call_timer_fn <-run_timer_softirq
				1487	[...]
				1488	bash-1994 1dNs2 81us : cpu_needs_another_gp <-rcu_process_callbacks
				1489	bash-1994 1dNs2 82us : __local_bh_enable <-__do_softirq
				1490	bash-1994 1dNs2 82us : sub_preempt_count <-__local_bh_enable
				1491	bash-1994 1dN.2 82us : idle_cpu <-irq_exit
				1492	bash-1994 1dN.2 83us : rcu_irq_exit <-irq_exit
				1493	bash-1994 1dN.2 83us : sub_preempt_count <-irq_exit
				1494	bash-1994 1.N.1 84us : _raw_spin_unlock_irqrestore <-task_rq_unlock
				1495	bash-1994 1.N.1 84us+: trace_preempt_on <-task_rq_unlock
				1496	bash-1994 1.N.1 104us : <stack trace>
				1497	=> sub_preempt_count
				1498	=> _raw_spin_unlock_irqrestore
				1499	=> task_rq_unlock
				1500	=> wake_up_new_task
				1501	=> do_fork
				1502	=> sys_clone
				1503	=> stub_clone
				1504
				1505
				1506	The above is an example of the preemptoff trace with
				1507	function-trace set. Here we see that interrupts were not disabled
				1508	the entire time. The irq_enter code lets us know that we entered
				1509	an interrupt 'h'. Before that, the functions being traced still
				1510	show that it is not in an interrupt, but we can see from the
				1511	functions themselves that this is not the case.
				1512
				1513	preemptirqsoff
				1514	--------------
				1515
				1516	Knowing the locations that have interrupts disabled or
				1517	preemption disabled for the longest times is helpful. But
				1518	sometimes we would like to know when either preemption and/or
				1519	interrupts are disabled.
				1520
				1521	Consider the following code::
				1522
				1523	local_irq_disable();
				1524	call_function_with_irqs_off();
				1525	preempt_disable();
				1526	call_function_with_irqs_and_preemption_off();
				1527	local_irq_enable();
				1528	call_function_with_preemption_off();
				1529	preempt_enable();
				1530
				1531	The irqsoff tracer will record the total length of
				1532	call_function_with_irqs_off() and
				1533	call_function_with_irqs_and_preemption_off().
				1534
				1535	The preemptoff tracer will record the total length of
				1536	call_function_with_irqs_and_preemption_off() and
				1537	call_function_with_preemption_off().
				1538
				1539	But neither will trace the time that interrupts and/or
				1540	preemption is disabled. This total time is the time that we can
				1541	not schedule. To record this time, use the preemptirqsoff
				1542	tracer.
				1543
				1544	Again, using this trace is much like the irqsoff and preemptoff
				1545	tracers.
				1546	::
				1547
				1548	# echo 0 > options/function-trace
				1549	# echo preemptirqsoff > current_tracer
				1550	# echo 1 > tracing_on
				1551	# echo 0 > tracing_max_latency
				1552	# ls -ltr
				1553	[...]
				1554	# echo 0 > tracing_on
				1555	# cat trace
				1556	# tracer: preemptirqsoff
				1557	#
				1558	# preemptirqsoff latency trace v1.1.5 on 3.8.0-test+
				1559	# --------------------------------------------------------------------
				1560	# latency: 100 us, #4/4, CPU#3 \| (M:preempt VP:0, KP:0, SP:0 HP:0 #P:4)
				1561	# -----------------
				1562	# \| task: ls-2230 (uid:0 nice:0 policy:0 rt_prio:0)
				1563	# -----------------
				1564	# => started at: ata_scsi_queuecmd
				1565	# => ended at: ata_scsi_queuecmd
				1566	#
				1567	#
				1568	# _------=> CPU#
				1569	# / _-----=> irqs-off
				1570	# \| / _----=> need-resched
				1571	# \|\| / _---=> hardirq/softirq
				1572	# \|\|\| / _--=> preempt-depth
				1573	# \|\|\|\| / delay
				1574	# cmd pid \|\|\|\|\| time \| caller
				1575	# \ / \|\|\|\|\| \ \| /
				1576	ls-2230 3d... 0us+: _raw_spin_lock_irqsave <-ata_scsi_queuecmd
				1577	ls-2230 3...1 100us : _raw_spin_unlock_irqrestore <-ata_scsi_queuecmd
				1578	ls-2230 3...1 101us+: trace_preempt_on <-ata_scsi_queuecmd
				1579	ls-2230 3...1 111us : <stack trace>
				1580	=> sub_preempt_count
				1581	=> _raw_spin_unlock_irqrestore
				1582	=> ata_scsi_queuecmd
				1583	=> scsi_dispatch_cmd
				1584	=> scsi_request_fn
				1585	=> __blk_run_queue_uncond
				1586	=> __blk_run_queue
				1587	=> blk_queue_bio
				1588	=> generic_make_request
				1589	=> submit_bio
				1590	=> submit_bh
				1591	=> ext3_bread
				1592	=> ext3_dir_bread
				1593	=> htree_dirblock_to_tree
				1594	=> ext3_htree_fill_tree
				1595	=> ext3_readdir
				1596	=> vfs_readdir
				1597	=> sys_getdents
				1598	=> system_call_fastpath
				1599
				1600
				1601	The trace_hardirqs_off_thunk is called from assembly on x86 when
				1602	interrupts are disabled in the assembly code. Without the
				1603	function tracing, we do not know if interrupts were enabled
				1604	within the preemption points. We do see that it started with
				1605	preemption enabled.
				1606
				1607	Here is a trace with function-trace set::
				1608
				1609	# tracer: preemptirqsoff
				1610	#
				1611	# preemptirqsoff latency trace v1.1.5 on 3.8.0-test+
				1612	# --------------------------------------------------------------------
				1613	# latency: 161 us, #339/339, CPU#3 \| (M:preempt VP:0, KP:0, SP:0 HP:0 #P:4)
				1614	# -----------------
				1615	# \| task: ls-2269 (uid:0 nice:0 policy:0 rt_prio:0)
				1616	# -----------------
				1617	# => started at: schedule
				1618	# => ended at: mutex_unlock
				1619	#
				1620	#
				1621	# _------=> CPU#
				1622	# / _-----=> irqs-off
				1623	# \| / _----=> need-resched
				1624	# \|\| / _---=> hardirq/softirq
				1625	# \|\|\| / _--=> preempt-depth
				1626	# \|\|\|\| / delay
				1627	# cmd pid \|\|\|\|\| time \| caller
				1628	# \ / \|\|\|\|\| \ \| /
				1629	kworker/-59 3...1 0us : __schedule <-schedule
				1630	kworker/-59 3d..1 0us : rcu_preempt_qs <-rcu_note_context_switch
				1631	kworker/-59 3d..1 1us : add_preempt_count <-_raw_spin_lock_irq
				1632	kworker/-59 3d..2 1us : deactivate_task <-__schedule
				1633	kworker/-59 3d..2 1us : dequeue_task <-deactivate_task
				1634	kworker/-59 3d..2 2us : update_rq_clock <-dequeue_task
				1635	kworker/-59 3d..2 2us : dequeue_task_fair <-dequeue_task
				1636	kworker/-59 3d..2 2us : update_curr <-dequeue_task_fair
				1637	kworker/-59 3d..2 2us : update_min_vruntime <-update_curr
				1638	kworker/-59 3d..2 3us : cpuacct_charge <-update_curr
				1639	kworker/-59 3d..2 3us : __rcu_read_lock <-cpuacct_charge
				1640	kworker/-59 3d..2 3us : __rcu_read_unlock <-cpuacct_charge
				1641	kworker/-59 3d..2 3us : update_cfs_rq_blocked_load <-dequeue_task_fair
				1642	kworker/-59 3d..2 4us : clear_buddies <-dequeue_task_fair
				1643	kworker/-59 3d..2 4us : account_entity_dequeue <-dequeue_task_fair
				1644	kworker/-59 3d..2 4us : update_min_vruntime <-dequeue_task_fair
				1645	kworker/-59 3d..2 4us : update_cfs_shares <-dequeue_task_fair
				1646	kworker/-59 3d..2 5us : hrtick_update <-dequeue_task_fair
				1647	kworker/-59 3d..2 5us : wq_worker_sleeping <-__schedule
				1648	kworker/-59 3d..2 5us : kthread_data <-wq_worker_sleeping
				1649	kworker/-59 3d..2 5us : put_prev_task_fair <-__schedule
				1650	kworker/-59 3d..2 6us : pick_next_task_fair <-pick_next_task
				1651	kworker/-59 3d..2 6us : clear_buddies <-pick_next_task_fair
				1652	kworker/-59 3d..2 6us : set_next_entity <-pick_next_task_fair
				1653	kworker/-59 3d..2 6us : update_stats_wait_end <-set_next_entity
				1654	ls-2269 3d..2 7us : finish_task_switch <-__schedule
				1655	ls-2269 3d..2 7us : _raw_spin_unlock_irq <-finish_task_switch
				1656	ls-2269 3d..2 8us : do_IRQ <-ret_from_intr
				1657	ls-2269 3d..2 8us : irq_enter <-do_IRQ
				1658	ls-2269 3d..2 8us : rcu_irq_enter <-irq_enter
				1659	ls-2269 3d..2 9us : add_preempt_count <-irq_enter
				1660	ls-2269 3d.h2 9us : exit_idle <-do_IRQ
				1661	[...]
				1662	ls-2269 3d.h3 20us : sub_preempt_count <-_raw_spin_unlock
				1663	ls-2269 3d.h2 20us : irq_exit <-do_IRQ
				1664	ls-2269 3d.h2 21us : sub_preempt_count <-irq_exit
				1665	ls-2269 3d..3 21us : do_softirq <-irq_exit
				1666	ls-2269 3d..3 21us : __do_softirq <-call_softirq
				1667	ls-2269 3d..3 21us+: __local_bh_disable <-__do_softirq
				1668	ls-2269 3d.s4 29us : sub_preempt_count <-_local_bh_enable_ip
				1669	ls-2269 3d.s5 29us : sub_preempt_count <-_local_bh_enable_ip
				1670	ls-2269 3d.s5 31us : do_IRQ <-ret_from_intr
				1671	ls-2269 3d.s5 31us : irq_enter <-do_IRQ
				1672	ls-2269 3d.s5 31us : rcu_irq_enter <-irq_enter
				1673	[...]
				1674	ls-2269 3d.s5 31us : rcu_irq_enter <-irq_enter
				1675	ls-2269 3d.s5 32us : add_preempt_count <-irq_enter
				1676	ls-2269 3d.H5 32us : exit_idle <-do_IRQ
				1677	ls-2269 3d.H5 32us : handle_irq <-do_IRQ
				1678	ls-2269 3d.H5 32us : irq_to_desc <-handle_irq
				1679	ls-2269 3d.H5 33us : handle_fasteoi_irq <-handle_irq
				1680	[...]
				1681	ls-2269 3d.s5 158us : _raw_spin_unlock_irqrestore <-rtl8139_poll
				1682	ls-2269 3d.s3 158us : net_rps_action_and_irq_enable.isra.65 <-net_rx_action
				1683	ls-2269 3d.s3 159us : __local_bh_enable <-__do_softirq
				1684	ls-2269 3d.s3 159us : sub_preempt_count <-__local_bh_enable
				1685	ls-2269 3d..3 159us : idle_cpu <-irq_exit
				1686	ls-2269 3d..3 159us : rcu_irq_exit <-irq_exit
				1687	ls-2269 3d..3 160us : sub_preempt_count <-irq_exit
				1688	ls-2269 3d... 161us : __mutex_unlock_slowpath <-mutex_unlock
				1689	ls-2269 3d... 162us+: trace_hardirqs_on <-mutex_unlock
				1690	ls-2269 3d... 186us : <stack trace>
				1691	=> __mutex_unlock_slowpath
				1692	=> mutex_unlock
				1693	=> process_output
				1694	=> n_tty_write
				1695	=> tty_write
				1696	=> vfs_write
				1697	=> sys_write
				1698	=> system_call_fastpath
				1699
				1700	This is an interesting trace. It started with kworker running and
				1701	scheduling out and ls taking over. But as soon as ls released the
				1702	rq lock and enabled interrupts (but not preemption) an interrupt
				1703	triggered. When the interrupt finished, it started running softirqs.
				1704	But while the softirq was running, another interrupt triggered.
				1705	When an interrupt is running inside a softirq, the annotation is 'H'.
				1706
				1707
				1708	wakeup
				1709	------
				1710
				1711	One common case that people are interested in tracing is the
				1712	time it takes for a task that is woken to actually wake up.
				1713	Now for non Real-Time tasks, this can be arbitrary. But tracing
				1714	it none the less can be interesting.
				1715
				1716	Without function tracing::
				1717
				1718	# echo 0 > options/function-trace
				1719	# echo wakeup > current_tracer
				1720	# echo 1 > tracing_on
				1721	# echo 0 > tracing_max_latency
				1722	# chrt -f 5 sleep 1
				1723	# echo 0 > tracing_on
				1724	# cat trace
				1725	# tracer: wakeup
				1726	#
				1727	# wakeup latency trace v1.1.5 on 3.8.0-test+
				1728	# --------------------------------------------------------------------
				1729	# latency: 15 us, #4/4, CPU#3 \| (M:preempt VP:0, KP:0, SP:0 HP:0 #P:4)
				1730	# -----------------
				1731	# \| task: kworker/3:1H-312 (uid:0 nice:-20 policy:0 rt_prio:0)
				1732	# -----------------
				1733	#
				1734	# _------=> CPU#
				1735	# / _-----=> irqs-off
				1736	# \| / _----=> need-resched
				1737	# \|\| / _---=> hardirq/softirq
				1738	# \|\|\| / _--=> preempt-depth
				1739	# \|\|\|\| / delay
				1740	# cmd pid \|\|\|\|\| time \| caller
				1741	# \ / \|\|\|\|\| \ \| /
				1742	<idle>-0 3dNs7 0us : 0:120:R + [003] 312:100:R kworker/3:1H
				1743	<idle>-0 3dNs7 1us+: ttwu_do_activate.constprop.87 <-try_to_wake_up
				1744	<idle>-0 3d..3 15us : __schedule <-schedule
				1745	<idle>-0 3d..3 15us : 0:120:R ==> [003] 312:100:R kworker/3:1H
				1746
				1747	The tracer only traces the highest priority task in the system
				1748	to avoid tracing the normal circumstances. Here we see that
				1749	the kworker with a nice priority of -20 (not very nice), took
				1750	just 15 microseconds from the time it woke up, to the time it
				1751	ran.
				1752
				1753	Non Real-Time tasks are not that interesting. A more interesting
				1754	trace is to concentrate only on Real-Time tasks.
				1755
				1756	wakeup_rt
				1757	---------
				1758
				1759	In a Real-Time environment it is very important to know the
				1760	wakeup time it takes for the highest priority task that is woken
				1761	up to the time that it executes. This is also known as "schedule
				1762	latency". I stress the point that this is about RT tasks. It is
				1763	also important to know the scheduling latency of non-RT tasks,
				1764	but the average schedule latency is better for non-RT tasks.
				1765	Tools like LatencyTop are more appropriate for such
				1766	measurements.
				1767
				1768	Real-Time environments are interested in the worst case latency.
				1769	That is the longest latency it takes for something to happen,
				1770	and not the average. We can have a very fast scheduler that may
				1771	only have a large latency once in a while, but that would not
				1772	work well with Real-Time tasks. The wakeup_rt tracer was designed
				1773	to record the worst case wakeups of RT tasks. Non-RT tasks are
				1774	not recorded because the tracer only records one worst case and
				1775	tracing non-RT tasks that are unpredictable will overwrite the
				1776	worst case latency of RT tasks (just run the normal wakeup
				1777	tracer for a while to see that effect).
				1778
				1779	Since this tracer only deals with RT tasks, we will run this
				1780	slightly differently than we did with the previous tracers.
				1781	Instead of performing an 'ls', we will run 'sleep 1' under
				1782	'chrt' which changes the priority of the task.
				1783	::
				1784
				1785	# echo 0 > options/function-trace
				1786	# echo wakeup_rt > current_tracer
				1787	# echo 1 > tracing_on
				1788	# echo 0 > tracing_max_latency
				1789	# chrt -f 5 sleep 1
				1790	# echo 0 > tracing_on
				1791	# cat trace
				1792	# tracer: wakeup
				1793	#
				1794	# tracer: wakeup_rt
				1795	#
				1796	# wakeup_rt latency trace v1.1.5 on 3.8.0-test+
				1797	# --------------------------------------------------------------------
				1798	# latency: 5 us, #4/4, CPU#3 \| (M:preempt VP:0, KP:0, SP:0 HP:0 #P:4)
				1799	# -----------------
				1800	# \| task: sleep-2389 (uid:0 nice:0 policy:1 rt_prio:5)
				1801	# -----------------
				1802	#
				1803	# _------=> CPU#
				1804	# / _-----=> irqs-off
				1805	# \| / _----=> need-resched
				1806	# \|\| / _---=> hardirq/softirq
				1807	# \|\|\| / _--=> preempt-depth
				1808	# \|\|\|\| / delay
				1809	# cmd pid \|\|\|\|\| time \| caller
				1810	# \ / \|\|\|\|\| \ \| /
				1811	<idle>-0 3d.h4 0us : 0:120:R + [003] 2389: 94:R sleep
				1812	<idle>-0 3d.h4 1us+: ttwu_do_activate.constprop.87 <-try_to_wake_up
				1813	<idle>-0 3d..3 5us : __schedule <-schedule
				1814	<idle>-0 3d..3 5us : 0:120:R ==> [003] 2389: 94:R sleep
				1815
				1816
				1817	Running this on an idle system, we see that it only took 5 microseconds
				1818	to perform the task switch. Note, since the trace point in the schedule
				1819	is before the actual "switch", we stop the tracing when the recorded task
				1820	is about to schedule in. This may change if we add a new marker at the
				1821	end of the scheduler.
				1822
				1823	Notice that the recorded task is 'sleep' with the PID of 2389
				1824	and it has an rt_prio of 5. This priority is user-space priority
				1825	and not the internal kernel priority. The policy is 1 for
				1826	SCHED_FIFO and 2 for SCHED_RR.
				1827
				1828	Note, that the trace data shows the internal priority (99 - rtprio).
				1829	::
				1830
				1831	<idle>-0 3d..3 5us : 0:120:R ==> [003] 2389: 94:R sleep
				1832
				1833	The 0:120:R means idle was running with a nice priority of 0 (120 - 120)
				1834	and in the running state 'R'. The sleep task was scheduled in with
				1835	2389: 94:R. That is the priority is the kernel rtprio (99 - 5 = 94)
				1836	and it too is in the running state.
				1837
				1838	Doing the same with chrt -r 5 and function-trace set.
				1839	::
				1840
				1841	echo 1 > options/function-trace
				1842
				1843	# tracer: wakeup_rt
				1844	#
				1845	# wakeup_rt latency trace v1.1.5 on 3.8.0-test+
				1846	# --------------------------------------------------------------------
				1847	# latency: 29 us, #85/85, CPU#3 \| (M:preempt VP:0, KP:0, SP:0 HP:0 #P:4)
				1848	# -----------------
				1849	# \| task: sleep-2448 (uid:0 nice:0 policy:1 rt_prio:5)
				1850	# -----------------
				1851	#
				1852	# _------=> CPU#
				1853	# / _-----=> irqs-off
				1854	# \| / _----=> need-resched
				1855	# \|\| / _---=> hardirq/softirq
				1856	# \|\|\| / _--=> preempt-depth
				1857	# \|\|\|\| / delay
				1858	# cmd pid \|\|\|\|\| time \| caller
				1859	# \ / \|\|\|\|\| \ \| /
				1860	<idle>-0 3d.h4 1us+: 0:120:R + [003] 2448: 94:R sleep
				1861	<idle>-0 3d.h4 2us : ttwu_do_activate.constprop.87 <-try_to_wake_up
				1862	<idle>-0 3d.h3 3us : check_preempt_curr <-ttwu_do_wakeup
				1863	<idle>-0 3d.h3 3us : resched_curr <-check_preempt_curr
				1864	<idle>-0 3dNh3 4us : task_woken_rt <-ttwu_do_wakeup
				1865	<idle>-0 3dNh3 4us : _raw_spin_unlock <-try_to_wake_up
				1866	<idle>-0 3dNh3 4us : sub_preempt_count <-_raw_spin_unlock
				1867	<idle>-0 3dNh2 5us : ttwu_stat <-try_to_wake_up
				1868	<idle>-0 3dNh2 5us : _raw_spin_unlock_irqrestore <-try_to_wake_up
				1869	<idle>-0 3dNh2 6us : sub_preempt_count <-_raw_spin_unlock_irqrestore
				1870	<idle>-0 3dNh1 6us : _raw_spin_lock <-__run_hrtimer
				1871	<idle>-0 3dNh1 6us : add_preempt_count <-_raw_spin_lock
				1872	<idle>-0 3dNh2 7us : _raw_spin_unlock <-hrtimer_interrupt
				1873	<idle>-0 3dNh2 7us : sub_preempt_count <-_raw_spin_unlock
				1874	<idle>-0 3dNh1 7us : tick_program_event <-hrtimer_interrupt
				1875	<idle>-0 3dNh1 7us : clockevents_program_event <-tick_program_event
				1876	<idle>-0 3dNh1 8us : ktime_get <-clockevents_program_event
				1877	<idle>-0 3dNh1 8us : lapic_next_event <-clockevents_program_event
				1878	<idle>-0 3dNh1 8us : irq_exit <-smp_apic_timer_interrupt
				1879	<idle>-0 3dNh1 9us : sub_preempt_count <-irq_exit
				1880	<idle>-0 3dN.2 9us : idle_cpu <-irq_exit
				1881	<idle>-0 3dN.2 9us : rcu_irq_exit <-irq_exit
				1882	<idle>-0 3dN.2 10us : rcu_eqs_enter_common.isra.45 <-rcu_irq_exit
				1883	<idle>-0 3dN.2 10us : sub_preempt_count <-irq_exit
				1884	<idle>-0 3.N.1 11us : rcu_idle_exit <-cpu_idle
				1885	<idle>-0 3dN.1 11us : rcu_eqs_exit_common.isra.43 <-rcu_idle_exit
				1886	<idle>-0 3.N.1 11us : tick_nohz_idle_exit <-cpu_idle
				1887	<idle>-0 3dN.1 12us : menu_hrtimer_cancel <-tick_nohz_idle_exit
				1888	<idle>-0 3dN.1 12us : ktime_get <-tick_nohz_idle_exit
				1889	<idle>-0 3dN.1 12us : tick_do_update_jiffies64 <-tick_nohz_idle_exit
				1890	<idle>-0 3dN.1 13us : cpu_load_update_nohz <-tick_nohz_idle_exit
				1891	<idle>-0 3dN.1 13us : _raw_spin_lock <-cpu_load_update_nohz
				1892	<idle>-0 3dN.1 13us : add_preempt_count <-_raw_spin_lock
				1893	<idle>-0 3dN.2 13us : __cpu_load_update <-cpu_load_update_nohz
				1894	<idle>-0 3dN.2 14us : sched_avg_update <-__cpu_load_update
				1895	<idle>-0 3dN.2 14us : _raw_spin_unlock <-cpu_load_update_nohz
				1896	<idle>-0 3dN.2 14us : sub_preempt_count <-_raw_spin_unlock
				1897	<idle>-0 3dN.1 15us : calc_load_nohz_stop <-tick_nohz_idle_exit
				1898	<idle>-0 3dN.1 15us : touch_softlockup_watchdog <-tick_nohz_idle_exit
				1899	<idle>-0 3dN.1 15us : hrtimer_cancel <-tick_nohz_idle_exit
				1900	<idle>-0 3dN.1 15us : hrtimer_try_to_cancel <-hrtimer_cancel
				1901	<idle>-0 3dN.1 16us : lock_hrtimer_base.isra.18 <-hrtimer_try_to_cancel
				1902	<idle>-0 3dN.1 16us : _raw_spin_lock_irqsave <-lock_hrtimer_base.isra.18
				1903	<idle>-0 3dN.1 16us : add_preempt_count <-_raw_spin_lock_irqsave
				1904	<idle>-0 3dN.2 17us : __remove_hrtimer <-remove_hrtimer.part.16
				1905	<idle>-0 3dN.2 17us : hrtimer_force_reprogram <-__remove_hrtimer
				1906	<idle>-0 3dN.2 17us : tick_program_event <-hrtimer_force_reprogram
				1907	<idle>-0 3dN.2 18us : clockevents_program_event <-tick_program_event
				1908	<idle>-0 3dN.2 18us : ktime_get <-clockevents_program_event
				1909	<idle>-0 3dN.2 18us : lapic_next_event <-clockevents_program_event
				1910	<idle>-0 3dN.2 19us : _raw_spin_unlock_irqrestore <-hrtimer_try_to_cancel
				1911	<idle>-0 3dN.2 19us : sub_preempt_count <-_raw_spin_unlock_irqrestore
				1912	<idle>-0 3dN.1 19us : hrtimer_forward <-tick_nohz_idle_exit
				1913	<idle>-0 3dN.1 20us : ktime_add_safe <-hrtimer_forward
				1914	<idle>-0 3dN.1 20us : ktime_add_safe <-hrtimer_forward
				1915	<idle>-0 3dN.1 20us : hrtimer_start_range_ns <-hrtimer_start_expires.constprop.11
				1916	<idle>-0 3dN.1 20us : __hrtimer_start_range_ns <-hrtimer_start_range_ns
				1917	<idle>-0 3dN.1 21us : lock_hrtimer_base.isra.18 <-__hrtimer_start_range_ns
				1918	<idle>-0 3dN.1 21us : _raw_spin_lock_irqsave <-lock_hrtimer_base.isra.18
				1919	<idle>-0 3dN.1 21us : add_preempt_count <-_raw_spin_lock_irqsave
				1920	<idle>-0 3dN.2 22us : ktime_add_safe <-__hrtimer_start_range_ns
				1921	<idle>-0 3dN.2 22us : enqueue_hrtimer <-__hrtimer_start_range_ns
				1922	<idle>-0 3dN.2 22us : tick_program_event <-__hrtimer_start_range_ns
				1923	<idle>-0 3dN.2 23us : clockevents_program_event <-tick_program_event
				1924	<idle>-0 3dN.2 23us : ktime_get <-clockevents_program_event
				1925	<idle>-0 3dN.2 23us : lapic_next_event <-clockevents_program_event
				1926	<idle>-0 3dN.2 24us : _raw_spin_unlock_irqrestore <-__hrtimer_start_range_ns
				1927	<idle>-0 3dN.2 24us : sub_preempt_count <-_raw_spin_unlock_irqrestore
				1928	<idle>-0 3dN.1 24us : account_idle_ticks <-tick_nohz_idle_exit
				1929	<idle>-0 3dN.1 24us : account_idle_time <-account_idle_ticks
				1930	<idle>-0 3.N.1 25us : sub_preempt_count <-cpu_idle
				1931	<idle>-0 3.N.. 25us : schedule <-cpu_idle
				1932	<idle>-0 3.N.. 25us : __schedule <-preempt_schedule
				1933	<idle>-0 3.N.. 26us : add_preempt_count <-__schedule
				1934	<idle>-0 3.N.1 26us : rcu_note_context_switch <-__schedule
				1935	<idle>-0 3.N.1 26us : rcu_sched_qs <-rcu_note_context_switch
				1936	<idle>-0 3dN.1 27us : rcu_preempt_qs <-rcu_note_context_switch
				1937	<idle>-0 3.N.1 27us : _raw_spin_lock_irq <-__schedule
				1938	<idle>-0 3dN.1 27us : add_preempt_count <-_raw_spin_lock_irq
				1939	<idle>-0 3dN.2 28us : put_prev_task_idle <-__schedule
				1940	<idle>-0 3dN.2 28us : pick_next_task_stop <-pick_next_task
				1941	<idle>-0 3dN.2 28us : pick_next_task_rt <-pick_next_task
				1942	<idle>-0 3dN.2 29us : dequeue_pushable_task <-pick_next_task_rt
				1943	<idle>-0 3d..3 29us : __schedule <-preempt_schedule
				1944	<idle>-0 3d..3 30us : 0:120:R ==> [003] 2448: 94:R sleep
				1945
				1946	This isn't that big of a trace, even with function tracing enabled,
				1947	so I included the entire trace.
				1948
				1949	The interrupt went off while when the system was idle. Somewhere
				1950	before task_woken_rt() was called, the NEED_RESCHED flag was set,
				1951	this is indicated by the first occurrence of the 'N' flag.
				1952
				1953	Latency tracing and events
				1954	--------------------------
				1955	As function tracing can induce a much larger latency, but without
				1956	seeing what happens within the latency it is hard to know what
				1957	caused it. There is a middle ground, and that is with enabling
				1958	events.
				1959	::
				1960
				1961	# echo 0 > options/function-trace
				1962	# echo wakeup_rt > current_tracer
				1963	# echo 1 > events/enable
				1964	# echo 1 > tracing_on
				1965	# echo 0 > tracing_max_latency
				1966	# chrt -f 5 sleep 1
				1967	# echo 0 > tracing_on
				1968	# cat trace
				1969	# tracer: wakeup_rt
				1970	#
				1971	# wakeup_rt latency trace v1.1.5 on 3.8.0-test+
				1972	# --------------------------------------------------------------------
				1973	# latency: 6 us, #12/12, CPU#2 \| (M:preempt VP:0, KP:0, SP:0 HP:0 #P:4)
				1974	# -----------------
				1975	# \| task: sleep-5882 (uid:0 nice:0 policy:1 rt_prio:5)
				1976	# -----------------
				1977	#
				1978	# _------=> CPU#
				1979	# / _-----=> irqs-off
				1980	# \| / _----=> need-resched
				1981	# \|\| / _---=> hardirq/softirq
				1982	# \|\|\| / _--=> preempt-depth
				1983	# \|\|\|\| / delay
				1984	# cmd pid \|\|\|\|\| time \| caller
				1985	# \ / \|\|\|\|\| \ \| /
				1986	<idle>-0 2d.h4 0us : 0:120:R + [002] 5882: 94:R sleep
				1987	<idle>-0 2d.h4 0us : ttwu_do_activate.constprop.87 <-try_to_wake_up
				1988	<idle>-0 2d.h4 1us : sched_wakeup: comm=sleep pid=5882 prio=94 success=1 target_cpu=002
				1989	<idle>-0 2dNh2 1us : hrtimer_expire_exit: hrtimer=ffff88007796feb8
				1990	<idle>-0 2.N.2 2us : power_end: cpu_id=2
				1991	<idle>-0 2.N.2 3us : cpu_idle: state=4294967295 cpu_id=2
				1992	<idle>-0 2dN.3 4us : hrtimer_cancel: hrtimer=ffff88007d50d5e0
				1993	<idle>-0 2dN.3 4us : hrtimer_start: hrtimer=ffff88007d50d5e0 function=tick_sched_timer expires=34311211000000 softexpires=34311211000000
				1994	<idle>-0 2.N.2 5us : rcu_utilization: Start context switch
				1995	<idle>-0 2.N.2 5us : rcu_utilization: End context switch
				1996	<idle>-0 2d..3 6us : __schedule <-schedule
				1997	<idle>-0 2d..3 6us : 0:120:R ==> [002] 5882: 94:R sleep
				1998
				1999
				2000	Hardware Latency Detector
				2001	-------------------------
				2002
				2003	The hardware latency detector is executed by enabling the "hwlat" tracer.
				2004
				2005	NOTE, this tracer will affect the performance of the system as it will
				2006	periodically make a CPU constantly busy with interrupts disabled.
				2007	::
				2008
				2009	# echo hwlat > current_tracer
				2010	# sleep 100
				2011	# cat trace
				2012	# tracer: hwlat
				2013	#
				2014	# _-----=> irqs-off
				2015	# / _----=> need-resched
				2016	# \| / _---=> hardirq/softirq
				2017	# \|\| / _--=> preempt-depth
				2018	# \|\|\| / delay
				2019	# TASK-PID CPU# \|\|\|\| TIMESTAMP FUNCTION
				2020	# \| \| \| \|\|\|\| \| \|
				2021	<...>-3638 [001] d... 19452.055471: #1 inner/outer(us): 12/14 ts:1499801089.066141940
				2022	<...>-3638 [003] d... 19454.071354: #2 inner/outer(us): 11/9 ts:1499801091.082164365
				2023	<...>-3638 [002] dn.. 19461.126852: #3 inner/outer(us): 12/9 ts:1499801098.138150062
				2024	<...>-3638 [001] d... 19488.340960: #4 inner/outer(us): 8/12 ts:1499801125.354139633
				2025	<...>-3638 [003] d... 19494.388553: #5 inner/outer(us): 8/12 ts:1499801131.402150961
				2026	<...>-3638 [003] d... 19501.283419: #6 inner/outer(us): 0/12 ts:1499801138.297435289 nmi-total:4 nmi-count:1
				2027
				2028
				2029	The above output is somewhat the same in the header. All events will have
				2030	interrupts disabled 'd'. Under the FUNCTION title there is:
				2031
				2032	#1
				2033	This is the count of events recorded that were greater than the
				2034	tracing_threshold (See below).
				2035
				2036	inner/outer(us): 12/14
				2037
				2038	This shows two numbers as "inner latency" and "outer latency". The test
				2039	runs in a loop checking a timestamp twice. The latency detected within
				2040	the two timestamps is the "inner latency" and the latency detected
				2041	after the previous timestamp and the next timestamp in the loop is
				2042	the "outer latency".
				2043
				2044	ts:1499801089.066141940
				2045
				2046	The absolute timestamp that the event happened.
				2047
				2048	nmi-total:4 nmi-count:1
				2049
				2050	On architectures that support it, if an NMI comes in during the
				2051	test, the time spent in NMI is reported in "nmi-total" (in
				2052	microseconds).
				2053
				2054	All architectures that have NMIs will show the "nmi-count" if an
				2055	NMI comes in during the test.
				2056
				2057	hwlat files:
				2058
				2059	tracing_threshold
				2060	This gets automatically set to "10" to represent 10
				2061	microseconds. This is the threshold of latency that
				2062	needs to be detected before the trace will be recorded.
				2063
				2064	Note, when hwlat tracer is finished (another tracer is
				2065	written into "current_tracer"), the original value for
				2066	tracing_threshold is placed back into this file.
				2067
				2068	hwlat_detector/width
				2069	The length of time the test runs with interrupts disabled.
				2070
				2071	hwlat_detector/window
				2072	The length of time of the window which the test
				2073	runs. That is, the test will run for "width"
				2074	microseconds per "window" microseconds
				2075
				2076	tracing_cpumask
				2077	When the test is started. A kernel thread is created that
				2078	runs the test. This thread will alternate between CPUs
				2079	listed in the tracing_cpumask between each period
				2080	(one "window"). To limit the test to specific CPUs
				2081	set the mask in this file to only the CPUs that the test
				2082	should run on.
				2083
				2084	function
				2085	--------
				2086
				2087	This tracer is the function tracer. Enabling the function tracer
				2088	can be done from the debug file system. Make sure the
				2089	ftrace_enabled is set; otherwise this tracer is a nop.
				2090	See the "ftrace_enabled" section below.
				2091	::
				2092
				2093	# sysctl kernel.ftrace_enabled=1
				2094	# echo function > current_tracer
				2095	# echo 1 > tracing_on
				2096	# usleep 1
				2097	# echo 0 > tracing_on
				2098	# cat trace
				2099	# tracer: function
				2100	#
				2101	# entries-in-buffer/entries-written: 24799/24799 #P:4
				2102	#
				2103	# _-----=> irqs-off
				2104	# / _----=> need-resched
				2105	# \| / _---=> hardirq/softirq
				2106	# \|\| / _--=> preempt-depth
				2107	# \|\|\| / delay
				2108	# TASK-PID CPU# \|\|\|\| TIMESTAMP FUNCTION
				2109	# \| \| \| \|\|\|\| \| \|
				2110	bash-1994 [002] .... 3082.063030: mutex_unlock <-rb_simple_write
				2111	bash-1994 [002] .... 3082.063031: __mutex_unlock_slowpath <-mutex_unlock
				2112	bash-1994 [002] .... 3082.063031: __fsnotify_parent <-fsnotify_modify
				2113	bash-1994 [002] .... 3082.063032: fsnotify <-fsnotify_modify
				2114	bash-1994 [002] .... 3082.063032: __srcu_read_lock <-fsnotify
				2115	bash-1994 [002] .... 3082.063032: add_preempt_count <-__srcu_read_lock
				2116	bash-1994 [002] ...1 3082.063032: sub_preempt_count <-__srcu_read_lock
				2117	bash-1994 [002] .... 3082.063033: __srcu_read_unlock <-fsnotify
				2118	[...]
				2119
				2120
				2121	Note: function tracer uses ring buffers to store the above
				2122	entries. The newest data may overwrite the oldest data.
				2123	Sometimes using echo to stop the trace is not sufficient because
				2124	the tracing could have overwritten the data that you wanted to
				2125	record. For this reason, it is sometimes better to disable
				2126	tracing directly from a program. This allows you to stop the
				2127	tracing at the point that you hit the part that you are
				2128	interested in. To disable the tracing directly from a C program,
				2129	something like following code snippet can be used::
				2130
				2131	int trace_fd;
				2132	[...]
				2133	int main(int argc, char *argv[]) {
				2134	[...]
				2135	trace_fd = open(tracing_file("tracing_on"), O_WRONLY);
				2136	[...]
				2137	if (condition_hit()) {
				2138	write(trace_fd, "0", 1);
				2139	}
				2140	[...]
				2141	}
				2142
				2143
				2144	Single thread tracing
				2145	---------------------
				2146
				2147	By writing into set_ftrace_pid you can trace a
				2148	single thread. For example::
				2149
				2150	# cat set_ftrace_pid
				2151	no pid
				2152	# echo 3111 > set_ftrace_pid
				2153	# cat set_ftrace_pid
				2154	3111
				2155	# echo function > current_tracer
				2156	# cat trace \| head
				2157	# tracer: function
				2158	#
				2159	# TASK-PID CPU# TIMESTAMP FUNCTION
				2160	# \| \| \| \| \|
				2161	yum-updatesd-3111 [003] 1637.254676: finish_task_switch <-thread_return
				2162	yum-updatesd-3111 [003] 1637.254681: hrtimer_cancel <-schedule_hrtimeout_range
				2163	yum-updatesd-3111 [003] 1637.254682: hrtimer_try_to_cancel <-hrtimer_cancel
				2164	yum-updatesd-3111 [003] 1637.254683: lock_hrtimer_base <-hrtimer_try_to_cancel
				2165	yum-updatesd-3111 [003] 1637.254685: fget_light <-do_sys_poll
				2166	yum-updatesd-3111 [003] 1637.254686: pipe_poll <-do_sys_poll
				2167	# echo > set_ftrace_pid
				2168	# cat trace \|head
				2169	# tracer: function
				2170	#
				2171	# TASK-PID CPU# TIMESTAMP FUNCTION
				2172	# \| \| \| \| \|
				2173	##### CPU 3 buffer started ####
				2174	yum-updatesd-3111 [003] 1701.957688: free_poll_entry <-poll_freewait
				2175	yum-updatesd-3111 [003] 1701.957689: remove_wait_queue <-free_poll_entry
				2176	yum-updatesd-3111 [003] 1701.957691: fput <-free_poll_entry
				2177	yum-updatesd-3111 [003] 1701.957692: audit_syscall_exit <-sysret_audit
				2178	yum-updatesd-3111 [003] 1701.957693: path_put <-audit_syscall_exit
				2179
				2180	If you want to trace a function when executing, you could use
				2181	something like this simple program.
				2182	::
				2183
				2184	#include <stdio.h>
				2185	#include <stdlib.h>
				2186	#include <sys/types.h>
				2187	#include <sys/stat.h>
				2188	#include <fcntl.h>
				2189	#include <unistd.h>
				2190	#include <string.h>
				2191
				2192	#define _STR(x) #x
				2193	#define STR(x) _STR(x)
				2194	#define MAX_PATH 256
				2195
				2196	const char *find_tracefs(void)
				2197	{
				2198	static char tracefs[MAX_PATH+1];
				2199	static int tracefs_found;
				2200	char type[100];
				2201	FILE *fp;
				2202
				2203	if (tracefs_found)
				2204	return tracefs;
				2205
				2206	if ((fp = fopen("/proc/mounts","r")) == NULL) {
				2207	perror("/proc/mounts");
				2208	return NULL;
				2209	}
				2210
				2211	while (fscanf(fp, "%*s %"
				2212	STR(MAX_PATH)
				2213	"s %99s %s %d %*d\n",
				2214	tracefs, type) == 2) {
				2215	if (strcmp(type, "tracefs") == 0)
				2216	break;
				2217	}
				2218	fclose(fp);
				2219
				2220	if (strcmp(type, "tracefs") != 0) {
				2221	fprintf(stderr, "tracefs not mounted");
				2222	return NULL;
				2223	}
				2224
				2225	strcat(tracefs, "/tracing/");
				2226	tracefs_found = 1;
				2227
				2228	return tracefs;
				2229	}
				2230
				2231	const char tracing_file(const char file_name)
				2232	{
				2233	static char trace_file[MAX_PATH+1];
				2234	snprintf(trace_file, MAX_PATH, "%s/%s", find_tracefs(), file_name);
				2235	return trace_file;
				2236	}
				2237
				2238	int main (int argc, char **argv)
				2239	{
				2240	if (argc < 1)
				2241	exit(-1);
				2242
				2243	if (fork() > 0) {
				2244	int fd, ffd;
				2245	char line[64];
				2246	int s;
				2247
				2248	ffd = open(tracing_file("current_tracer"), O_WRONLY);
				2249	if (ffd < 0)
				2250	exit(-1);
				2251	write(ffd, "nop", 3);
				2252
				2253	fd = open(tracing_file("set_ftrace_pid"), O_WRONLY);
				2254	s = sprintf(line, "%d\n", getpid());
				2255	write(fd, line, s);
				2256
				2257	write(ffd, "function", 8);
				2258
				2259	close(fd);
				2260	close(ffd);
				2261
				2262	execvp(argv[1], argv+1);
				2263	}
				2264
				2265	return 0;
				2266	}
				2267
				2268	Or this simple script!
				2269	::
				2270
				2271	#!/bin/bash
				2272
				2273	tracefs=`sed -ne 's/^tracefs $.$ tracefs./\1/p' /proc/mounts`
				2274	echo nop > $tracefs/tracing/current_tracer
				2275	echo 0 > $tracefs/tracing/tracing_on
				2276	echo $$ > $tracefs/tracing/set_ftrace_pid
				2277	echo function > $tracefs/tracing/current_tracer
				2278	echo 1 > $tracefs/tracing/tracing_on
				2279	exec "$@"
				2280
				2281
				2282	function graph tracer
				2283	---------------------------
				2284
				2285	This tracer is similar to the function tracer except that it
				2286	probes a function on its entry and its exit. This is done by
				2287	using a dynamically allocated stack of return addresses in each
				2288	task_struct. On function entry the tracer overwrites the return
				2289	address of each function traced to set a custom probe. Thus the
				2290	original return address is stored on the stack of return address
				2291	in the task_struct.
				2292
				2293	Probing on both ends of a function leads to special features
				2294	such as:
				2295
				2296	- measure of a function's time execution
				2297	- having a reliable call stack to draw function calls graph
				2298
				2299	This tracer is useful in several situations:
				2300
				2301	- you want to find the reason of a strange kernel behavior and
				2302	need to see what happens in detail on any areas (or specific
				2303	ones).
				2304
				2305	- you are experiencing weird latencies but it's difficult to
				2306	find its origin.
				2307
				2308	- you want to find quickly which path is taken by a specific
				2309	function
				2310
				2311	- you just want to peek inside a working kernel and want to see
				2312	what happens there.
				2313
				2314	::
				2315
				2316	# tracer: function_graph
				2317	#
				2318	# CPU DURATION FUNCTION CALLS
				2319	# \| \| \| \| \| \| \|
				2320
				2321	0) \| sys_open() {
				2322	0) \| do_sys_open() {
				2323	0) \| getname() {
				2324	0) \| kmem_cache_alloc() {
				2325	0) 1.382 us \| __might_sleep();
				2326	0) 2.478 us \| }
				2327	0) \| strncpy_from_user() {
				2328	0) \| might_fault() {
				2329	0) 1.389 us \| __might_sleep();
				2330	0) 2.553 us \| }
				2331	0) 3.807 us \| }
				2332	0) 7.876 us \| }
				2333	0) \| alloc_fd() {
				2334	0) 0.668 us \| _spin_lock();
				2335	0) 0.570 us \| expand_files();
				2336	0) 0.586 us \| _spin_unlock();
				2337
				2338
				2339	There are several columns that can be dynamically
				2340	enabled/disabled. You can use every combination of options you
				2341	want, depending on your needs.
				2342
				2343	- The cpu number on which the function executed is default
				2344	enabled. It is sometimes better to only trace one cpu (see
				2345	tracing_cpu_mask file) or you might sometimes see unordered
				2346	function calls while cpu tracing switch.
				2347
				2348	- hide: echo nofuncgraph-cpu > trace_options
				2349	- show: echo funcgraph-cpu > trace_options
				2350
				2351	- The duration (function's time of execution) is displayed on
				2352	the closing bracket line of a function or on the same line
				2353	than the current function in case of a leaf one. It is default
				2354	enabled.
				2355
				2356	- hide: echo nofuncgraph-duration > trace_options
				2357	- show: echo funcgraph-duration > trace_options
				2358
				2359	- The overhead field precedes the duration field in case of
				2360	reached duration thresholds.
				2361
				2362	- hide: echo nofuncgraph-overhead > trace_options
				2363	- show: echo funcgraph-overhead > trace_options
				2364	- depends on: funcgraph-duration
				2365
				2366	ie::
				2367
				2368	3) # 1837.709 us \| } /* __switch_to */
				2369	3) \| finish_task_switch() {
				2370	3) 0.313 us \| _raw_spin_unlock_irq();
				2371	3) 3.177 us \| }
				2372	3) # 1889.063 us \| } /* __schedule */
				2373	3) ! 140.417 us \| } /* __schedule */
				2374	3) # 2034.948 us \| } /* schedule */
				2375	3) * 33998.59 us \| } /* schedule_preempt_disabled */
				2376
				2377	[...]
				2378
				2379	1) 0.260 us \| msecs_to_jiffies();
				2380	1) 0.313 us \| __rcu_read_unlock();
				2381	1) + 61.770 us \| }
				2382	1) + 64.479 us \| }
				2383	1) 0.313 us \| rcu_bh_qs();
				2384	1) 0.313 us \| __local_bh_enable();
				2385	1) ! 217.240 us \| }
				2386	1) 0.365 us \| idle_cpu();
				2387	1) \| rcu_irq_exit() {
				2388	1) 0.417 us \| rcu_eqs_enter_common.isra.47();
				2389	1) 3.125 us \| }
				2390	1) ! 227.812 us \| }
				2391	1) ! 457.395 us \| }
				2392	1) @ 119760.2 us \| }
				2393
				2394	[...]
				2395
				2396	2) \| handle_IPI() {
				2397	1) 6.979 us \| }
				2398	2) 0.417 us \| scheduler_ipi();
				2399	1) 9.791 us \| }
				2400	1) + 12.917 us \| }
				2401	2) 3.490 us \| }
				2402	1) + 15.729 us \| }
				2403	1) + 18.542 us \| }
				2404	2) $ 3594274 us \| }
				2405
				2406	Flags::
				2407
				2408	+ means that the function exceeded 10 usecs.
				2409	! means that the function exceeded 100 usecs.
				2410	# means that the function exceeded 1000 usecs.
				2411	* means that the function exceeded 10 msecs.
				2412	@ means that the function exceeded 100 msecs.
				2413	$ means that the function exceeded 1 sec.
				2414
				2415
				2416	- The task/pid field displays the thread cmdline and pid which
				2417	executed the function. It is default disabled.
				2418
				2419	- hide: echo nofuncgraph-proc > trace_options
				2420	- show: echo funcgraph-proc > trace_options
				2421
				2422	ie::
				2423
				2424	# tracer: function_graph
				2425	#
				2426	# CPU TASK/PID DURATION FUNCTION CALLS
				2427	# \| \| \| \| \| \| \| \| \|
				2428	0) sh-4802 \| \| d_free() {
				2429	0) sh-4802 \| \| call_rcu() {
				2430	0) sh-4802 \| \| __call_rcu() {
				2431	0) sh-4802 \| 0.616 us \| rcu_process_gp_end();
				2432	0) sh-4802 \| 0.586 us \| check_for_new_grace_period();
				2433	0) sh-4802 \| 2.899 us \| }
				2434	0) sh-4802 \| 4.040 us \| }
				2435	0) sh-4802 \| 5.151 us \| }
				2436	0) sh-4802 \| + 49.370 us \| }
				2437
				2438
				2439	- The absolute time field is an absolute timestamp given by the
				2440	system clock since it started. A snapshot of this time is
				2441	given on each entry/exit of functions
				2442
				2443	- hide: echo nofuncgraph-abstime > trace_options
				2444	- show: echo funcgraph-abstime > trace_options
				2445
				2446	ie::
				2447
				2448	#
				2449	# TIME CPU DURATION FUNCTION CALLS
				2450	# \| \| \| \| \| \| \| \|
				2451	360.774522 \| 1) 0.541 us \| }
				2452	360.774522 \| 1) 4.663 us \| }
				2453	360.774523 \| 1) 0.541 us \| __wake_up_bit();
				2454	360.774524 \| 1) 6.796 us \| }
				2455	360.774524 \| 1) 7.952 us \| }
				2456	360.774525 \| 1) 9.063 us \| }
				2457	360.774525 \| 1) 0.615 us \| journal_mark_dirty();
				2458	360.774527 \| 1) 0.578 us \| __brelse();
				2459	360.774528 \| 1) \| reiserfs_prepare_for_journal() {
				2460	360.774528 \| 1) \| unlock_buffer() {
				2461	360.774529 \| 1) \| wake_up_bit() {
				2462	360.774529 \| 1) \| bit_waitqueue() {
				2463	360.774530 \| 1) 0.594 us \| __phys_addr();
				2464
				2465
				2466	The function name is always displayed after the closing bracket
				2467	for a function if the start of that function is not in the
				2468	trace buffer.
				2469
				2470	Display of the function name after the closing bracket may be
				2471	enabled for functions whose start is in the trace buffer,
				2472	allowing easier searching with grep for function durations.
				2473	It is default disabled.
				2474
				2475	- hide: echo nofuncgraph-tail > trace_options
				2476	- show: echo funcgraph-tail > trace_options
				2477
				2478	Example with nofuncgraph-tail (default)::
				2479
				2480	0) \| putname() {
				2481	0) \| kmem_cache_free() {
				2482	0) 0.518 us \| __phys_addr();
				2483	0) 1.757 us \| }
				2484	0) 2.861 us \| }
				2485
				2486	Example with funcgraph-tail::
				2487
				2488	0) \| putname() {
				2489	0) \| kmem_cache_free() {
				2490	0) 0.518 us \| __phys_addr();
				2491	0) 1.757 us \| } /* kmem_cache_free() */
				2492	0) 2.861 us \| } /* putname() */
				2493
				2494	You can put some comments on specific functions by using
				2495	trace_printk() For example, if you want to put a comment inside
				2496	the __might_sleep() function, you just have to include
				2497	<linux/ftrace.h> and call trace_printk() inside __might_sleep()::
				2498
				2499	trace_printk("I'm a comment!\n")
				2500
				2501	will produce::
				2502
				2503	1) \| __might_sleep() {
				2504	1) \| /* I'm a comment! */
				2505	1) 1.449 us \| }
				2506
				2507
				2508	You might find other useful features for this tracer in the
				2509	following "dynamic ftrace" section such as tracing only specific
				2510	functions or tasks.
				2511
				2512	dynamic ftrace
				2513	--------------
				2514
				2515	If CONFIG_DYNAMIC_FTRACE is set, the system will run with
				2516	virtually no overhead when function tracing is disabled. The way
				2517	this works is the mcount function call (placed at the start of
				2518	every kernel function, produced by the -pg switch in gcc),
				2519	starts of pointing to a simple return. (Enabling FTRACE will
				2520	include the -pg switch in the compiling of the kernel.)
				2521
				2522	At compile time every C file object is run through the
				2523	recordmcount program (located in the scripts directory). This
				2524	program will parse the ELF headers in the C object to find all
				2525	the locations in the .text section that call mcount. Starting
				2526	with gcc verson 4.6, the -mfentry has been added for x86, which
				2527	calls "__fentry__" instead of "mcount". Which is called before
				2528	the creation of the stack frame.
				2529
				2530	Note, not all sections are traced. They may be prevented by either
				2531	a notrace, or blocked another way and all inline functions are not
				2532	traced. Check the "available_filter_functions" file to see what functions
				2533	can be traced.
				2534
				2535	A section called "__mcount_loc" is created that holds
				2536	references to all the mcount/fentry call sites in the .text section.
				2537	The recordmcount program re-links this section back into the
				2538	original object. The final linking stage of the kernel will add all these
				2539	references into a single table.
				2540
				2541	On boot up, before SMP is initialized, the dynamic ftrace code
				2542	scans this table and updates all the locations into nops. It
				2543	also records the locations, which are added to the
				2544	available_filter_functions list. Modules are processed as they
				2545	are loaded and before they are executed. When a module is
				2546	unloaded, it also removes its functions from the ftrace function
				2547	list. This is automatic in the module unload code, and the
				2548	module author does not need to worry about it.
				2549
				2550	When tracing is enabled, the process of modifying the function
				2551	tracepoints is dependent on architecture. The old method is to use
				2552	kstop_machine to prevent races with the CPUs executing code being
				2553	modified (which can cause the CPU to do undesirable things, especially
				2554	if the modified code crosses cache (or page) boundaries), and the nops are
				2555	patched back to calls. But this time, they do not call mcount
				2556	(which is just a function stub). They now call into the ftrace
				2557	infrastructure.
				2558
				2559	The new method of modifying the function tracepoints is to place
				2560	a breakpoint at the location to be modified, sync all CPUs, modify
				2561	the rest of the instruction not covered by the breakpoint. Sync
				2562	all CPUs again, and then remove the breakpoint with the finished
				2563	version to the ftrace call site.
				2564
				2565	Some archs do not even need to monkey around with the synchronization,
				2566	and can just slap the new code on top of the old without any
				2567	problems with other CPUs executing it at the same time.
				2568
				2569	One special side-effect to the recording of the functions being
				2570	traced is that we can now selectively choose which functions we
				2571	wish to trace and which ones we want the mcount calls to remain
				2572	as nops.
				2573
				2574	Two files are used, one for enabling and one for disabling the
				2575	tracing of specified functions. They are:
				2576
				2577	set_ftrace_filter
				2578
				2579	and
				2580
				2581	set_ftrace_notrace
				2582
				2583	A list of available functions that you can add to these files is
				2584	listed in:
				2585
				2586	available_filter_functions
				2587
				2588	::
				2589
				2590	# cat available_filter_functions
				2591	put_prev_task_idle
				2592	kmem_cache_create
				2593	pick_next_task_rt
				2594	get_online_cpus
				2595	pick_next_task_fair
				2596	mutex_lock
				2597	[...]
				2598
				2599	If I am only interested in sys_nanosleep and hrtimer_interrupt::
				2600
				2601	# echo sys_nanosleep hrtimer_interrupt > set_ftrace_filter
				2602	# echo function > current_tracer
				2603	# echo 1 > tracing_on
				2604	# usleep 1
				2605	# echo 0 > tracing_on
				2606	# cat trace
				2607	# tracer: function
				2608	#
				2609	# entries-in-buffer/entries-written: 5/5 #P:4
				2610	#
				2611	# _-----=> irqs-off
				2612	# / _----=> need-resched
				2613	# \| / _---=> hardirq/softirq
				2614	# \|\| / _--=> preempt-depth
				2615	# \|\|\| / delay
				2616	# TASK-PID CPU# \|\|\|\| TIMESTAMP FUNCTION
				2617	# \| \| \| \|\|\|\| \| \|
				2618	usleep-2665 [001] .... 4186.475355: sys_nanosleep <-system_call_fastpath
				2619	<idle>-0 [001] d.h1 4186.475409: hrtimer_interrupt <-smp_apic_timer_interrupt
				2620	usleep-2665 [001] d.h1 4186.475426: hrtimer_interrupt <-smp_apic_timer_interrupt
				2621	<idle>-0 [003] d.h1 4186.475426: hrtimer_interrupt <-smp_apic_timer_interrupt
				2622	<idle>-0 [002] d.h1 4186.475427: hrtimer_interrupt <-smp_apic_timer_interrupt
				2623
				2624	To see which functions are being traced, you can cat the file:
				2625	::
				2626
				2627	# cat set_ftrace_filter
				2628	hrtimer_interrupt
				2629	sys_nanosleep
				2630
				2631
				2632	Perhaps this is not enough. The filters also allow glob(7) matching.
				2633
Jonathan Corbet	6234c7b	2018-03-07 10:44:08 -0700	[diff] [blame]	2634	``<match>*``
Changbin Du	1f198e2	2018-02-17 13:39:38 +0800	[diff] [blame]	2635	will match functions that begin with <match>
Jonathan Corbet	6234c7b	2018-03-07 10:44:08 -0700	[diff] [blame]	2636	``*<match>``
Changbin Du	1f198e2	2018-02-17 13:39:38 +0800	[diff] [blame]	2637	will match functions that end with <match>
Jonathan Corbet	6234c7b	2018-03-07 10:44:08 -0700	[diff] [blame]	2638	``<match>``
Changbin Du	1f198e2	2018-02-17 13:39:38 +0800	[diff] [blame]	2639	will match functions that have <match> in it
Jonathan Corbet	6234c7b	2018-03-07 10:44:08 -0700	[diff] [blame]	2640	``<match1>*<match2>``
Changbin Du	1f198e2	2018-02-17 13:39:38 +0800	[diff] [blame]	2641	will match functions that begin with <match1> and end with <match2>
				2642
				2643	.. note::
				2644	It is better to use quotes to enclose the wild cards,
				2645	otherwise the shell may expand the parameters into names
				2646	of files in the local directory.
				2647
				2648	::
				2649
				2650	# echo 'hrtimer_*' > set_ftrace_filter
				2651
				2652	Produces::
				2653
				2654	# tracer: function
				2655	#
				2656	# entries-in-buffer/entries-written: 897/897 #P:4
				2657	#
				2658	# _-----=> irqs-off
				2659	# / _----=> need-resched
				2660	# \| / _---=> hardirq/softirq
				2661	# \|\| / _--=> preempt-depth
				2662	# \|\|\| / delay
				2663	# TASK-PID CPU# \|\|\|\| TIMESTAMP FUNCTION
				2664	# \| \| \| \|\|\|\| \| \|
				2665	<idle>-0 [003] dN.1 4228.547803: hrtimer_cancel <-tick_nohz_idle_exit
				2666	<idle>-0 [003] dN.1 4228.547804: hrtimer_try_to_cancel <-hrtimer_cancel
				2667	<idle>-0 [003] dN.2 4228.547805: hrtimer_force_reprogram <-__remove_hrtimer
				2668	<idle>-0 [003] dN.1 4228.547805: hrtimer_forward <-tick_nohz_idle_exit
				2669	<idle>-0 [003] dN.1 4228.547805: hrtimer_start_range_ns <-hrtimer_start_expires.constprop.11
				2670	<idle>-0 [003] d..1 4228.547858: hrtimer_get_next_event <-get_next_timer_interrupt
				2671	<idle>-0 [003] d..1 4228.547859: hrtimer_start <-__tick_nohz_idle_enter
				2672	<idle>-0 [003] d..2 4228.547860: hrtimer_force_reprogram <-__rem
				2673
				2674	Notice that we lost the sys_nanosleep.
				2675	::
				2676
				2677	# cat set_ftrace_filter
				2678	hrtimer_run_queues
				2679	hrtimer_run_pending
				2680	hrtimer_init
				2681	hrtimer_cancel
				2682	hrtimer_try_to_cancel
				2683	hrtimer_forward
				2684	hrtimer_start
				2685	hrtimer_reprogram
				2686	hrtimer_force_reprogram
				2687	hrtimer_get_next_event
				2688	hrtimer_interrupt
				2689	hrtimer_nanosleep
				2690	hrtimer_wakeup
				2691	hrtimer_get_remaining
				2692	hrtimer_get_res
				2693	hrtimer_init_sleeper
				2694
				2695
				2696	This is because the '>' and '>>' act just like they do in bash.
				2697	To rewrite the filters, use '>'
				2698	To append to the filters, use '>>'
				2699
				2700	To clear out a filter so that all functions will be recorded
				2701	again::
				2702
				2703	# echo > set_ftrace_filter
				2704	# cat set_ftrace_filter
				2705	#
				2706
				2707	Again, now we want to append.
				2708
				2709	::
				2710
				2711	# echo sys_nanosleep > set_ftrace_filter
				2712	# cat set_ftrace_filter
				2713	sys_nanosleep
				2714	# echo 'hrtimer_*' >> set_ftrace_filter
				2715	# cat set_ftrace_filter
				2716	hrtimer_run_queues
				2717	hrtimer_run_pending
				2718	hrtimer_init
				2719	hrtimer_cancel
				2720	hrtimer_try_to_cancel
				2721	hrtimer_forward
				2722	hrtimer_start
				2723	hrtimer_reprogram
				2724	hrtimer_force_reprogram
				2725	hrtimer_get_next_event
				2726	hrtimer_interrupt
				2727	sys_nanosleep
				2728	hrtimer_nanosleep
				2729	hrtimer_wakeup
				2730	hrtimer_get_remaining
				2731	hrtimer_get_res
				2732	hrtimer_init_sleeper
				2733
				2734
				2735	The set_ftrace_notrace prevents those functions from being
				2736	traced.
				2737	::
				2738
				2739	# echo 'preempt' 'lock' > set_ftrace_notrace
				2740
				2741	Produces::
				2742
				2743	# tracer: function
				2744	#
				2745	# entries-in-buffer/entries-written: 39608/39608 #P:4
				2746	#
				2747	# _-----=> irqs-off
				2748	# / _----=> need-resched
				2749	# \| / _---=> hardirq/softirq
				2750	# \|\| / _--=> preempt-depth
				2751	# \|\|\| / delay
				2752	# TASK-PID CPU# \|\|\|\| TIMESTAMP FUNCTION
				2753	# \| \| \| \|\|\|\| \| \|
				2754	bash-1994 [000] .... 4342.324896: file_ra_state_init <-do_dentry_open
				2755	bash-1994 [000] .... 4342.324897: open_check_o_direct <-do_last
				2756	bash-1994 [000] .... 4342.324897: ima_file_check <-do_last
				2757	bash-1994 [000] .... 4342.324898: process_measurement <-ima_file_check
				2758	bash-1994 [000] .... 4342.324898: ima_get_action <-process_measurement
				2759	bash-1994 [000] .... 4342.324898: ima_match_policy <-ima_get_action
				2760	bash-1994 [000] .... 4342.324899: do_truncate <-do_last
				2761	bash-1994 [000] .... 4342.324899: should_remove_suid <-do_truncate
				2762	bash-1994 [000] .... 4342.324899: notify_change <-do_truncate
				2763	bash-1994 [000] .... 4342.324900: current_fs_time <-notify_change
				2764	bash-1994 [000] .... 4342.324900: current_kernel_time <-current_fs_time
				2765	bash-1994 [000] .... 4342.324900: timespec_trunc <-current_fs_time
				2766
				2767	We can see that there's no more lock or preempt tracing.
				2768
				2769
				2770	Dynamic ftrace with the function graph tracer
				2771	---------------------------------------------
				2772
				2773	Although what has been explained above concerns both the
				2774	function tracer and the function-graph-tracer, there are some
				2775	special features only available in the function-graph tracer.
				2776
				2777	If you want to trace only one function and all of its children,
				2778	you just have to echo its name into set_graph_function::
				2779
				2780	echo __do_fault > set_graph_function
				2781
				2782	will produce the following "expanded" trace of the __do_fault()
				2783	function::
				2784
				2785	0) \| __do_fault() {
				2786	0) \| filemap_fault() {
				2787	0) \| find_lock_page() {
				2788	0) 0.804 us \| find_get_page();
				2789	0) \| __might_sleep() {
				2790	0) 1.329 us \| }
				2791	0) 3.904 us \| }
				2792	0) 4.979 us \| }
				2793	0) 0.653 us \| _spin_lock();
				2794	0) 0.578 us \| page_add_file_rmap();
				2795	0) 0.525 us \| native_set_pte_at();
				2796	0) 0.585 us \| _spin_unlock();
				2797	0) \| unlock_page() {
				2798	0) 0.541 us \| page_waitqueue();
				2799	0) 0.639 us \| __wake_up_bit();
				2800	0) 2.786 us \| }
				2801	0) + 14.237 us \| }
				2802	0) \| __do_fault() {
				2803	0) \| filemap_fault() {
				2804	0) \| find_lock_page() {
				2805	0) 0.698 us \| find_get_page();
				2806	0) \| __might_sleep() {
				2807	0) 1.412 us \| }
				2808	0) 3.950 us \| }
				2809	0) 5.098 us \| }
				2810	0) 0.631 us \| _spin_lock();
				2811	0) 0.571 us \| page_add_file_rmap();
				2812	0) 0.526 us \| native_set_pte_at();
				2813	0) 0.586 us \| _spin_unlock();
				2814	0) \| unlock_page() {
				2815	0) 0.533 us \| page_waitqueue();
				2816	0) 0.638 us \| __wake_up_bit();
				2817	0) 2.793 us \| }
				2818	0) + 14.012 us \| }
				2819
				2820	You can also expand several functions at once::
				2821
				2822	echo sys_open > set_graph_function
				2823	echo sys_close >> set_graph_function
				2824
				2825	Now if you want to go back to trace all functions you can clear
				2826	this special filter via::
				2827
				2828	echo > set_graph_function
				2829
				2830
				2831	ftrace_enabled
				2832	--------------
				2833
				2834	Note, the proc sysctl ftrace_enable is a big on/off switch for the
				2835	function tracer. By default it is enabled (when function tracing is
				2836	enabled in the kernel). If it is disabled, all function tracing is
				2837	disabled. This includes not only the function tracers for ftrace, but
				2838	also for any other uses (perf, kprobes, stack tracing, profiling, etc).
				2839
				2840	Please disable this with care.
				2841
				2842	This can be disable (and enabled) with::
				2843
				2844	sysctl kernel.ftrace_enabled=0
				2845	sysctl kernel.ftrace_enabled=1
				2846
				2847	or
				2848
				2849	echo 0 > /proc/sys/kernel/ftrace_enabled
				2850	echo 1 > /proc/sys/kernel/ftrace_enabled
				2851
				2852
				2853	Filter commands
				2854	---------------
				2855
				2856	A few commands are supported by the set_ftrace_filter interface.
				2857	Trace commands have the following format::
				2858
				2859	<function>:<command>:<parameter>
				2860
				2861	The following commands are supported:
				2862
				2863	- mod:
				2864	This command enables function filtering per module. The
				2865	parameter defines the module. For example, if only the write*
				2866	functions in the ext3 module are desired, run:
				2867
				2868	echo 'write*:mod:ext3' > set_ftrace_filter
				2869
				2870	This command interacts with the filter in the same way as
				2871	filtering based on function names. Thus, adding more functions
				2872	in a different module is accomplished by appending (>>) to the
				2873	filter file. Remove specific module functions by prepending
				2874	'!'::
				2875
				2876	echo '!writeback*:mod:ext3' >> set_ftrace_filter
				2877
				2878	Mod command supports module globbing. Disable tracing for all
				2879	functions except a specific module::
				2880
				2881	echo '!*:mod:!ext3' >> set_ftrace_filter
				2882
				2883	Disable tracing for all modules, but still trace kernel::
				2884
				2885	echo '!:mod:' >> set_ftrace_filter
				2886
				2887	Enable filter only for kernel::
				2888
				2889	echo 'write:mod:!*' >> set_ftrace_filter
				2890
				2891	Enable filter for module globbing::
				2892
				2893	echo 'write:mod:snd' >> set_ftrace_filter
				2894
				2895	- traceon/traceoff:
				2896	These commands turn tracing on and off when the specified
				2897	functions are hit. The parameter determines how many times the
				2898	tracing system is turned on and off. If unspecified, there is
				2899	no limit. For example, to disable tracing when a schedule bug
				2900	is hit the first 5 times, run::
				2901
				2902	echo '__schedule_bug:traceoff:5' > set_ftrace_filter
				2903
				2904	To always disable tracing when __schedule_bug is hit::
				2905
				2906	echo '__schedule_bug:traceoff' > set_ftrace_filter
				2907
				2908	These commands are cumulative whether or not they are appended
				2909	to set_ftrace_filter. To remove a command, prepend it by '!'
				2910	and drop the parameter::
				2911
				2912	echo '!__schedule_bug:traceoff:0' > set_ftrace_filter
				2913
				2914	The above removes the traceoff command for __schedule_bug
				2915	that have a counter. To remove commands without counters::
				2916
				2917	echo '!__schedule_bug:traceoff' > set_ftrace_filter
				2918
				2919	- snapshot:
				2920	Will cause a snapshot to be triggered when the function is hit.
				2921	::
				2922
				2923	echo 'native_flush_tlb_others:snapshot' > set_ftrace_filter
				2924
				2925	To only snapshot once:
				2926	::
				2927
				2928	echo 'native_flush_tlb_others:snapshot:1' > set_ftrace_filter
				2929
				2930	To remove the above commands::
				2931
				2932	echo '!native_flush_tlb_others:snapshot' > set_ftrace_filter
				2933	echo '!native_flush_tlb_others:snapshot:0' > set_ftrace_filter
				2934
				2935	- enable_event/disable_event:
				2936	These commands can enable or disable a trace event. Note, because
				2937	function tracing callbacks are very sensitive, when these commands
				2938	are registered, the trace point is activated, but disabled in
				2939	a "soft" mode. That is, the tracepoint will be called, but
				2940	just will not be traced. The event tracepoint stays in this mode
				2941	as long as there's a command that triggers it.
				2942	::
				2943
				2944	echo 'try_to_wake_up:enable_event:sched:sched_switch:2' > \
				2945	set_ftrace_filter
				2946
				2947	The format is::
				2948
				2949	<function>:enable_event:<system>:<event>[:count]
				2950	<function>:disable_event:<system>:<event>[:count]
				2951
				2952	To remove the events commands::
				2953
				2954	echo '!try_to_wake_up:enable_event:sched:sched_switch:0' > \
				2955	set_ftrace_filter
				2956	echo '!schedule:disable_event:sched:sched_switch' > \
				2957	set_ftrace_filter
				2958
				2959	- dump:
				2960	When the function is hit, it will dump the contents of the ftrace
				2961	ring buffer to the console. This is useful if you need to debug
				2962	something, and want to dump the trace when a certain function
				2963	is hit. Perhaps its a function that is called before a tripple
				2964	fault happens and does not allow you to get a regular dump.
				2965
				2966	- cpudump:
				2967	When the function is hit, it will dump the contents of the ftrace
				2968	ring buffer for the current CPU to the console. Unlike the "dump"
				2969	command, it only prints out the contents of the ring buffer for the
				2970	CPU that executed the function that triggered the dump.
				2971
				2972	trace_pipe
				2973	----------
				2974
				2975	The trace_pipe outputs the same content as the trace file, but
				2976	the effect on the tracing is different. Every read from
				2977	trace_pipe is consumed. This means that subsequent reads will be
				2978	different. The trace is live.
				2979	::
				2980
				2981	# echo function > current_tracer
				2982	# cat trace_pipe > /tmp/trace.out &
				2983	[1] 4153
				2984	# echo 1 > tracing_on
				2985	# usleep 1
				2986	# echo 0 > tracing_on
				2987	# cat trace
				2988	# tracer: function
				2989	#
				2990	# entries-in-buffer/entries-written: 0/0 #P:4
				2991	#
				2992	# _-----=> irqs-off
				2993	# / _----=> need-resched
				2994	# \| / _---=> hardirq/softirq
				2995	# \|\| / _--=> preempt-depth
				2996	# \|\|\| / delay
				2997	# TASK-PID CPU# \|\|\|\| TIMESTAMP FUNCTION
				2998	# \| \| \| \|\|\|\| \| \|
				2999
				3000	#
				3001	# cat /tmp/trace.out
				3002	bash-1994 [000] .... 5281.568961: mutex_unlock <-rb_simple_write
				3003	bash-1994 [000] .... 5281.568963: __mutex_unlock_slowpath <-mutex_unlock
				3004	bash-1994 [000] .... 5281.568963: __fsnotify_parent <-fsnotify_modify
				3005	bash-1994 [000] .... 5281.568964: fsnotify <-fsnotify_modify
				3006	bash-1994 [000] .... 5281.568964: __srcu_read_lock <-fsnotify
				3007	bash-1994 [000] .... 5281.568964: add_preempt_count <-__srcu_read_lock
				3008	bash-1994 [000] ...1 5281.568965: sub_preempt_count <-__srcu_read_lock
				3009	bash-1994 [000] .... 5281.568965: __srcu_read_unlock <-fsnotify
				3010	bash-1994 [000] .... 5281.568967: sys_dup2 <-system_call_fastpath
				3011
				3012
				3013	Note, reading the trace_pipe file will block until more input is
				3014	added.
				3015
				3016	trace entries
				3017	-------------
				3018
				3019	Having too much or not enough data can be troublesome in
				3020	diagnosing an issue in the kernel. The file buffer_size_kb is
				3021	used to modify the size of the internal trace buffers. The
				3022	number listed is the number of entries that can be recorded per
				3023	CPU. To know the full size, multiply the number of possible CPUs
				3024	with the number of entries.
				3025	::
				3026
				3027	# cat buffer_size_kb
				3028	1408 (units kilobytes)
				3029
				3030	Or simply read buffer_total_size_kb
				3031	::
				3032
				3033	# cat buffer_total_size_kb
				3034	5632
				3035
				3036	To modify the buffer, simple echo in a number (in 1024 byte segments).
				3037	::
				3038
				3039	# echo 10000 > buffer_size_kb
				3040	# cat buffer_size_kb
				3041	10000 (units kilobytes)
				3042
				3043	It will try to allocate as much as possible. If you allocate too
				3044	much, it can cause Out-Of-Memory to trigger.
				3045	::
				3046
				3047	# echo 1000000000000 > buffer_size_kb
				3048	-bash: echo: write error: Cannot allocate memory
				3049	# cat buffer_size_kb
				3050	85
				3051
				3052	The per_cpu buffers can be changed individually as well:
				3053	::
				3054
				3055	# echo 10000 > per_cpu/cpu0/buffer_size_kb
				3056	# echo 100 > per_cpu/cpu1/buffer_size_kb
				3057
				3058	When the per_cpu buffers are not the same, the buffer_size_kb
				3059	at the top level will just show an X
				3060	::
				3061
				3062	# cat buffer_size_kb
				3063	X
				3064
				3065	This is where the buffer_total_size_kb is useful:
				3066	::
				3067
				3068	# cat buffer_total_size_kb
				3069	12916
				3070
				3071	Writing to the top level buffer_size_kb will reset all the buffers
				3072	to be the same again.
				3073
				3074	Snapshot
				3075	--------
				3076	CONFIG_TRACER_SNAPSHOT makes a generic snapshot feature
				3077	available to all non latency tracers. (Latency tracers which
				3078	record max latency, such as "irqsoff" or "wakeup", can't use
				3079	this feature, since those are already using the snapshot
				3080	mechanism internally.)
				3081
				3082	Snapshot preserves a current trace buffer at a particular point
				3083	in time without stopping tracing. Ftrace swaps the current
				3084	buffer with a spare buffer, and tracing continues in the new
				3085	current (=previous spare) buffer.
				3086
				3087	The following tracefs files in "tracing" are related to this
				3088	feature:
				3089
				3090	snapshot:
				3091
				3092	This is used to take a snapshot and to read the output
				3093	of the snapshot. Echo 1 into this file to allocate a
				3094	spare buffer and to take a snapshot (swap), then read
				3095	the snapshot from this file in the same format as
				3096	"trace" (described above in the section "The File
				3097	System"). Both reads snapshot and tracing are executable
				3098	in parallel. When the spare buffer is allocated, echoing
				3099	0 frees it, and echoing else (positive) values clear the
				3100	snapshot contents.
				3101	More details are shown in the table below.
				3102
				3103	+--------------+------------+------------+------------+
				3104	\|status\\input \| 0 \| 1 \| else \|
				3105	+==============+============+============+============+
				3106	\|not allocated \|(do nothing)\| alloc+swap \|(do nothing)\|
				3107	+--------------+------------+------------+------------+
				3108	\|allocated \| free \| swap \| clear \|
				3109	+--------------+------------+------------+------------+
				3110
				3111	Here is an example of using the snapshot feature.
				3112	::
				3113
				3114	# echo 1 > events/sched/enable
				3115	# echo 1 > snapshot
				3116	# cat snapshot
				3117	# tracer: nop
				3118	#
				3119	# entries-in-buffer/entries-written: 71/71 #P:8
				3120	#
				3121	# _-----=> irqs-off
				3122	# / _----=> need-resched
				3123	# \| / _---=> hardirq/softirq
				3124	# \|\| / _--=> preempt-depth
				3125	# \|\|\| / delay
				3126	# TASK-PID CPU# \|\|\|\| TIMESTAMP FUNCTION
				3127	# \| \| \| \|\|\|\| \| \|
				3128	<idle>-0 [005] d... 2440.603828: sched_switch: prev_comm=swapper/5 prev_pid=0 prev_prio=120 prev_state=R ==> next_comm=snapshot-test-2 next_pid=2242 next_prio=120
				3129	sleep-2242 [005] d... 2440.603846: sched_switch: prev_comm=snapshot-test-2 prev_pid=2242 prev_prio=120 prev_state=R ==> next_comm=kworker/5:1 next_pid=60 next_prio=120
				3130	[...]
				3131	<idle>-0 [002] d... 2440.707230: sched_switch: prev_comm=swapper/2 prev_pid=0 prev_prio=120 prev_state=R ==> next_comm=snapshot-test-2 next_pid=2229 next_prio=120
				3132
				3133	# cat trace
				3134	# tracer: nop
				3135	#
				3136	# entries-in-buffer/entries-written: 77/77 #P:8
				3137	#
				3138	# _-----=> irqs-off
				3139	# / _----=> need-resched
				3140	# \| / _---=> hardirq/softirq
				3141	# \|\| / _--=> preempt-depth
				3142	# \|\|\| / delay
				3143	# TASK-PID CPU# \|\|\|\| TIMESTAMP FUNCTION
				3144	# \| \| \| \|\|\|\| \| \|
				3145	<idle>-0 [007] d... 2440.707395: sched_switch: prev_comm=swapper/7 prev_pid=0 prev_prio=120 prev_state=R ==> next_comm=snapshot-test-2 next_pid=2243 next_prio=120
				3146	snapshot-test-2-2229 [002] d... 2440.707438: sched_switch: prev_comm=snapshot-test-2 prev_pid=2229 prev_prio=120 prev_state=S ==> next_comm=swapper/2 next_pid=0 next_prio=120
				3147	[...]
				3148
				3149
				3150	If you try to use this snapshot feature when current tracer is
				3151	one of the latency tracers, you will get the following results.
				3152	::
				3153
				3154	# echo wakeup > current_tracer
				3155	# echo 1 > snapshot
				3156	bash: echo: write error: Device or resource busy
				3157	# cat snapshot
				3158	cat: snapshot: Device or resource busy
				3159
				3160
				3161	Instances
				3162	---------
				3163	In the tracefs tracing directory is a directory called "instances".
				3164	This directory can have new directories created inside of it using
				3165	mkdir, and removing directories with rmdir. The directory created
				3166	with mkdir in this directory will already contain files and other
				3167	directories after it is created.
				3168	::
				3169
				3170	# mkdir instances/foo
				3171	# ls instances/foo
				3172	buffer_size_kb buffer_total_size_kb events free_buffer per_cpu
				3173	set_event snapshot trace trace_clock trace_marker trace_options
				3174	trace_pipe tracing_on
				3175
				3176	As you can see, the new directory looks similar to the tracing directory
				3177	itself. In fact, it is very similar, except that the buffer and
				3178	events are agnostic from the main director, or from any other
				3179	instances that are created.
				3180
				3181	The files in the new directory work just like the files with the
				3182	same name in the tracing directory except the buffer that is used
				3183	is a separate and new buffer. The files affect that buffer but do not
				3184	affect the main buffer with the exception of trace_options. Currently,
				3185	the trace_options affect all instances and the top level buffer
				3186	the same, but this may change in future releases. That is, options
				3187	may become specific to the instance they reside in.
				3188
				3189	Notice that none of the function tracer files are there, nor is
				3190	current_tracer and available_tracers. This is because the buffers
				3191	can currently only have events enabled for them.
				3192	::
				3193
				3194	# mkdir instances/foo
				3195	# mkdir instances/bar
				3196	# mkdir instances/zoot
				3197	# echo 100000 > buffer_size_kb
				3198	# echo 1000 > instances/foo/buffer_size_kb
				3199	# echo 5000 > instances/bar/per_cpu/cpu1/buffer_size_kb
				3200	# echo function > current_trace
				3201	# echo 1 > instances/foo/events/sched/sched_wakeup/enable
				3202	# echo 1 > instances/foo/events/sched/sched_wakeup_new/enable
				3203	# echo 1 > instances/foo/events/sched/sched_switch/enable
				3204	# echo 1 > instances/bar/events/irq/enable
				3205	# echo 1 > instances/zoot/events/syscalls/enable
				3206	# cat trace_pipe
				3207	CPU:2 [LOST 11745 EVENTS]
				3208	bash-2044 [002] .... 10594.481032: _raw_spin_lock_irqsave <-get_page_from_freelist
				3209	bash-2044 [002] d... 10594.481032: add_preempt_count <-_raw_spin_lock_irqsave
				3210	bash-2044 [002] d..1 10594.481032: __rmqueue <-get_page_from_freelist
				3211	bash-2044 [002] d..1 10594.481033: _raw_spin_unlock <-get_page_from_freelist
				3212	bash-2044 [002] d..1 10594.481033: sub_preempt_count <-_raw_spin_unlock
				3213	bash-2044 [002] d... 10594.481033: get_pageblock_flags_group <-get_pageblock_migratetype
				3214	bash-2044 [002] d... 10594.481034: __mod_zone_page_state <-get_page_from_freelist
				3215	bash-2044 [002] d... 10594.481034: zone_statistics <-get_page_from_freelist
				3216	bash-2044 [002] d... 10594.481034: __inc_zone_state <-zone_statistics
				3217	bash-2044 [002] d... 10594.481034: __inc_zone_state <-zone_statistics
				3218	bash-2044 [002] .... 10594.481035: arch_dup_task_struct <-copy_process
				3219	[...]
				3220
				3221	# cat instances/foo/trace_pipe
				3222	bash-1998 [000] d..4 136.676759: sched_wakeup: comm=kworker/0:1 pid=59 prio=120 success=1 target_cpu=000
				3223	bash-1998 [000] dN.4 136.676760: sched_wakeup: comm=bash pid=1998 prio=120 success=1 target_cpu=000
				3224	<idle>-0 [003] d.h3 136.676906: sched_wakeup: comm=rcu_preempt pid=9 prio=120 success=1 target_cpu=003
				3225	<idle>-0 [003] d..3 136.676909: sched_switch: prev_comm=swapper/3 prev_pid=0 prev_prio=120 prev_state=R ==> next_comm=rcu_preempt next_pid=9 next_prio=120
				3226	rcu_preempt-9 [003] d..3 136.676916: sched_switch: prev_comm=rcu_preempt prev_pid=9 prev_prio=120 prev_state=S ==> next_comm=swapper/3 next_pid=0 next_prio=120
				3227	bash-1998 [000] d..4 136.677014: sched_wakeup: comm=kworker/0:1 pid=59 prio=120 success=1 target_cpu=000
				3228	bash-1998 [000] dN.4 136.677016: sched_wakeup: comm=bash pid=1998 prio=120 success=1 target_cpu=000
				3229	bash-1998 [000] d..3 136.677018: sched_switch: prev_comm=bash prev_pid=1998 prev_prio=120 prev_state=R+ ==> next_comm=kworker/0:1 next_pid=59 next_prio=120
				3230	kworker/0:1-59 [000] d..4 136.677022: sched_wakeup: comm=sshd pid=1995 prio=120 success=1 target_cpu=001
				3231	kworker/0:1-59 [000] d..3 136.677025: sched_switch: prev_comm=kworker/0:1 prev_pid=59 prev_prio=120 prev_state=S ==> next_comm=bash next_pid=1998 next_prio=120
				3232	[...]
				3233
				3234	# cat instances/bar/trace_pipe
				3235	migration/1-14 [001] d.h3 138.732674: softirq_raise: vec=3 [action=NET_RX]
				3236	<idle>-0 [001] dNh3 138.732725: softirq_raise: vec=3 [action=NET_RX]
				3237	bash-1998 [000] d.h1 138.733101: softirq_raise: vec=1 [action=TIMER]
				3238	bash-1998 [000] d.h1 138.733102: softirq_raise: vec=9 [action=RCU]
				3239	bash-1998 [000] ..s2 138.733105: softirq_entry: vec=1 [action=TIMER]
				3240	bash-1998 [000] ..s2 138.733106: softirq_exit: vec=1 [action=TIMER]
				3241	bash-1998 [000] ..s2 138.733106: softirq_entry: vec=9 [action=RCU]
				3242	bash-1998 [000] ..s2 138.733109: softirq_exit: vec=9 [action=RCU]
				3243	sshd-1995 [001] d.h1 138.733278: irq_handler_entry: irq=21 name=uhci_hcd:usb4
				3244	sshd-1995 [001] d.h1 138.733280: irq_handler_exit: irq=21 ret=unhandled
				3245	sshd-1995 [001] d.h1 138.733281: irq_handler_entry: irq=21 name=eth0
				3246	sshd-1995 [001] d.h1 138.733283: irq_handler_exit: irq=21 ret=handled
				3247	[...]
				3248
				3249	# cat instances/zoot/trace
				3250	# tracer: nop
				3251	#
				3252	# entries-in-buffer/entries-written: 18996/18996 #P:4
				3253	#
				3254	# _-----=> irqs-off
				3255	# / _----=> need-resched
				3256	# \| / _---=> hardirq/softirq
				3257	# \|\| / _--=> preempt-depth
				3258	# \|\|\| / delay
				3259	# TASK-PID CPU# \|\|\|\| TIMESTAMP FUNCTION
				3260	# \| \| \| \|\|\|\| \| \|
				3261	bash-1998 [000] d... 140.733501: sys_write -> 0x2
				3262	bash-1998 [000] d... 140.733504: sys_dup2(oldfd: a, newfd: 1)
				3263	bash-1998 [000] d... 140.733506: sys_dup2 -> 0x1
				3264	bash-1998 [000] d... 140.733508: sys_fcntl(fd: a, cmd: 1, arg: 0)
				3265	bash-1998 [000] d... 140.733509: sys_fcntl -> 0x1
				3266	bash-1998 [000] d... 140.733510: sys_close(fd: a)
				3267	bash-1998 [000] d... 140.733510: sys_close -> 0x0
				3268	bash-1998 [000] d... 140.733514: sys_rt_sigprocmask(how: 0, nset: 0, oset: 6e2768, sigsetsize: 8)
				3269	bash-1998 [000] d... 140.733515: sys_rt_sigprocmask -> 0x0
				3270	bash-1998 [000] d... 140.733516: sys_rt_sigaction(sig: 2, act: 7fff718846f0, oact: 7fff71884650, sigsetsize: 8)
				3271	bash-1998 [000] d... 140.733516: sys_rt_sigaction -> 0x0
				3272
				3273	You can see that the trace of the top most trace buffer shows only
				3274	the function tracing. The foo instance displays wakeups and task
				3275	switches.
				3276
				3277	To remove the instances, simply delete their directories:
				3278	::
				3279
				3280	# rmdir instances/foo
				3281	# rmdir instances/bar
				3282	# rmdir instances/zoot
				3283
				3284	Note, if a process has a trace file open in one of the instance
				3285	directories, the rmdir will fail with EBUSY.
				3286
				3287
				3288	Stack trace
				3289	-----------
				3290	Since the kernel has a fixed sized stack, it is important not to
				3291	waste it in functions. A kernel developer must be conscience of
				3292	what they allocate on the stack. If they add too much, the system
				3293	can be in danger of a stack overflow, and corruption will occur,
				3294	usually leading to a system panic.
				3295
				3296	There are some tools that check this, usually with interrupts
				3297	periodically checking usage. But if you can perform a check
				3298	at every function call that will become very useful. As ftrace provides
				3299	a function tracer, it makes it convenient to check the stack size
				3300	at every function call. This is enabled via the stack tracer.
				3301
				3302	CONFIG_STACK_TRACER enables the ftrace stack tracing functionality.
				3303	To enable it, write a '1' into /proc/sys/kernel/stack_tracer_enabled.
				3304	::
				3305
				3306	# echo 1 > /proc/sys/kernel/stack_tracer_enabled
				3307
				3308	You can also enable it from the kernel command line to trace
				3309	the stack size of the kernel during boot up, by adding "stacktrace"
				3310	to the kernel command line parameter.
				3311
				3312	After running it for a few minutes, the output looks like:
				3313	::
				3314
				3315	# cat stack_max_size
				3316	2928
				3317
				3318	# cat stack_trace
				3319	Depth Size Location (18 entries)
				3320	----- ---- --------
				3321	0) 2928 224 update_sd_lb_stats+0xbc/0x4ac
				3322	1) 2704 160 find_busiest_group+0x31/0x1f1
				3323	2) 2544 256 load_balance+0xd9/0x662
				3324	3) 2288 80 idle_balance+0xbb/0x130
				3325	4) 2208 128 __schedule+0x26e/0x5b9
				3326	5) 2080 16 schedule+0x64/0x66
				3327	6) 2064 128 schedule_timeout+0x34/0xe0
				3328	7) 1936 112 wait_for_common+0x97/0xf1
				3329	8) 1824 16 wait_for_completion+0x1d/0x1f
				3330	9) 1808 128 flush_work+0xfe/0x119
				3331	10) 1680 16 tty_flush_to_ldisc+0x1e/0x20
				3332	11) 1664 48 input_available_p+0x1d/0x5c
				3333	12) 1616 48 n_tty_poll+0x6d/0x134
				3334	13) 1568 64 tty_poll+0x64/0x7f
				3335	14) 1504 880 do_select+0x31e/0x511
				3336	15) 624 400 core_sys_select+0x177/0x216
				3337	16) 224 96 sys_select+0x91/0xb9
				3338	17) 128 128 system_call_fastpath+0x16/0x1b
				3339
				3340	Note, if -mfentry is being used by gcc, functions get traced before
				3341	they set up the stack frame. This means that leaf level functions
				3342	are not tested by the stack tracer when -mfentry is used.
				3343
				3344	Currently, -mfentry is used by gcc 4.6.0 and above on x86 only.
				3345
				3346	More
				3347	----
				3348	More details can be found in the source code, in the `kernel/trace/*.c` files.