Blame - Documentation/RCU/Design/Requirements/Requirements.htmlx - SHIFTPHONES/kernel/common

blob: 3a97ba490c42b03d606c5fe36e5afe1a9b213e28 [file] [log] [blame]

Paul E. McKenney	649e436	2015-10-07 13:32:08 -0700	[diff] [blame]	1	<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"
				2	"http://www.w3.org/TR/html4/loose.dtd">
				3	<html>
				4	<head><title>A Tour Through RCU's Requirements [LWN.net]</title>
				5	<meta HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=utf-8">
				6
				7	<h1>A Tour Through RCU's Requirements</h1>
				8
				9	<p>Copyright IBM Corporation, 2015</p>
				10	<p>Author: Paul E. McKenney</p>
				11	<p><i>The initial version of this document appeared in the
				12	<a href="https://lwn.net/">LWN</a> articles
				13	<a href="https://lwn.net/Articles/652156/">here</a>,
				14	<a href="https://lwn.net/Articles/652677/">here</a>, and
				15	<a href="https://lwn.net/Articles/653326/">here</a>.</i></p>
				16
				17	<h2>Introduction</h2>
				18
				19	<p>
				20	Read-copy update (RCU) is a synchronization mechanism that is often
				21	used as a replacement for reader-writer locking.
				22	RCU is unusual in that updaters do not block readers,
				23	which means that RCU's read-side primitives can be exceedingly fast
				24	and scalable.
				25	In addition, updaters can make useful forward progress concurrently
				26	with readers.
				27	However, all this concurrency between RCU readers and updaters does raise
				28	the question of exactly what RCU readers are doing, which in turn
				29	raises the question of exactly what RCU's requirements are.
				30
				31	<p>
				32	This document therefore summarizes RCU's requirements, and can be thought
				33	of as an informal, high-level specification for RCU.
				34	It is important to understand that RCU's specification is primarily
				35	empirical in nature;
				36	in fact, I learned about many of these requirements the hard way.
				37	This situation might cause some consternation, however, not only
				38	has this learning process been a lot of fun, but it has also been
				39	a great privilege to work with so many people willing to apply
				40	technologies in interesting new ways.
				41
				42	<p>
				43	All that aside, here are the categories of currently known RCU requirements:
				44	</p>
				45
				46	<ol>
				47	<li> <a href="#Fundamental Requirements">
				48	Fundamental Requirements</a>
				49	<li> <a href="#Fundamental Non-Requirements">Fundamental Non-Requirements</a>
				50	<li> <a href="#Parallelism Facts of Life">
				51	Parallelism Facts of Life</a>
				52	<li> <a href="#Quality-of-Implementation Requirements">
				53	Quality-of-Implementation Requirements</a>
				54	<li> <a href="#Linux Kernel Complications">
				55	Linux Kernel Complications</a>
				56	<li> <a href="#Software-Engineering Requirements">
				57	Software-Engineering Requirements</a>
				58	<li> <a href="#Other RCU Flavors">
				59	Other RCU Flavors</a>
				60	<li> <a href="#Possible Future Changes">
				61	Possible Future Changes</a>
				62	</ol>
				63
				64	<p>
				65	This is followed by a <a href="#Summary">summary</a>,
				66	which is in turn followed by the inevitable
				67	<a href="#Answers to Quick Quizzes">answers to the quick quizzes</a>.
				68
				69	<h2><a name="Fundamental Requirements">Fundamental Requirements</a></h2>
				70
				71	<p>
				72	RCU's fundamental requirements are the closest thing RCU has to hard
				73	mathematical requirements.
				74	These are:
				75
				76	<ol>
				77	<li> <a href="#Grace-Period Guarantee">
				78	Grace-Period Guarantee</a>
				79	<li> <a href="#Publish-Subscribe Guarantee">
				80	Publish-Subscribe Guarantee</a>
Paul E. McKenney	4b68933	2015-10-12 08:51:45 -0700	[diff] [blame]	81	<li> <a href="#Memory-Barrier Guarantees">
				82	Memory-Barrier Guarantees</a>
Paul E. McKenney	649e436	2015-10-07 13:32:08 -0700	[diff] [blame]	83	<li> <a href="#RCU Primitives Guaranteed to Execute Unconditionally">
				84	RCU Primitives Guaranteed to Execute Unconditionally</a>
				85	<li> <a href="#Guaranteed Read-to-Write Upgrade">
				86	Guaranteed Read-to-Write Upgrade</a>
				87	</ol>
				88
				89	<h3><a name="Grace-Period Guarantee">Grace-Period Guarantee</a></h3>
				90
				91	<p>
				92	RCU's grace-period guarantee is unusual in being premeditated:
				93	Jack Slingwine and I had this guarantee firmly in mind when we started
				94	work on RCU (then called “rclock”) in the early 1990s.
				95	That said, the past two decades of experience with RCU have produced
				96	a much more detailed understanding of this guarantee.
				97
				98	<p>
				99	RCU's grace-period guarantee allows updaters to wait for the completion
				100	of all pre-existing RCU read-side critical sections.
				101	An RCU read-side critical section
				102	begins with the marker <tt>rcu_read_lock()</tt> and ends with
				103	the marker <tt>rcu_read_unlock()</tt>.
				104	These markers may be nested, and RCU treats a nested set as one
				105	big RCU read-side critical section.
				106	Production-quality implementations of <tt>rcu_read_lock()</tt> and
				107	<tt>rcu_read_unlock()</tt> are extremely lightweight, and in
				108	fact have exactly zero overhead in Linux kernels built for production
				109	use with <tt>CONFIG_PREEMPT=n</tt>.
				110
				111	<p>
				112	This guarantee allows ordering to be enforced with extremely low
				113	overhead to readers, for example:
				114
				115	<blockquote>
				116	<pre>
				117	1 int x, y;
				118	2
				119	3 void thread0(void)
				120	4 {
				121	5 rcu_read_lock();
				122	6 r1 = READ_ONCE(x);
				123	7 r2 = READ_ONCE(y);
				124	8 rcu_read_unlock();
				125	9 }
				126	10
				127	11 void thread1(void)
				128	12 {
				129	13 WRITE_ONCE(x, 1);
				130	14 synchronize_rcu();
				131	15 WRITE_ONCE(y, 1);
				132	16 }
				133	</pre>
				134	</blockquote>
				135
				136	<p>
				137	Because the <tt>synchronize_rcu()</tt> on line 14 waits for
				138	all pre-existing readers, any instance of <tt>thread0()</tt> that
				139	loads a value of zero from <tt>x</tt> must complete before
				140	<tt>thread1()</tt> stores to <tt>y</tt>, so that instance must
				141	also load a value of zero from <tt>y</tt>.
				142	Similarly, any instance of <tt>thread0()</tt> that loads a value of
				143	one from <tt>y</tt> must have started after the
				144	<tt>synchronize_rcu()</tt> started, and must therefore also load
				145	a value of one from <tt>x</tt>.
				146	Therefore, the outcome:
				147	<blockquote>
				148	<pre>
				149	(r1 == 0 && r2 == 1)
				150	</pre>
				151	</blockquote>
				152	cannot happen.
				153
				154	<p>@@QQ@@
				155	Wait a minute!
				156	You said that updaters can make useful forward progress concurrently
				157	with readers, but pre-existing readers will block
				158	<tt>synchronize_rcu()</tt>!!!
				159	Just who are you trying to fool???
				160	<p>@@QQA@@
				161	First, if updaters do not wish to be blocked by readers, they can use
				162	<tt>call_rcu()</tt> or <tt>kfree_rcu()</tt>, which will
				163	be discussed later.
				164	Second, even when using <tt>synchronize_rcu()</tt>, the other
				165	update-side code does run concurrently with readers, whether pre-existing
				166	or not.
				167	<p>@@QQE@@
				168
				169	<p>
				170	This scenario resembles one of the first uses of RCU in
				171	<a href="https://en.wikipedia.org/wiki/DYNIX">DYNIX/ptx</a>,
				172	which managed a distributed lock manager's transition into
				173	a state suitable for handling recovery from node failure,
				174	more or less as follows:
				175
				176	<blockquote>
				177	<pre>
				178	1 #define STATE_NORMAL 0
				179	2 #define STATE_WANT_RECOVERY 1
				180	3 #define STATE_RECOVERING 2
				181	4 #define STATE_WANT_NORMAL 3
				182	5
				183	6 int state = STATE_NORMAL;
				184	7
				185	8 void do_something_dlm(void)
				186	9 {
				187	10 int state_snap;
				188	11
				189	12 rcu_read_lock();
				190	13 state_snap = READ_ONCE(state);
				191	14 if (state_snap == STATE_NORMAL)
				192	15 do_something();
				193	16 else
				194	17 do_something_carefully();
				195	18 rcu_read_unlock();
				196	19 }
				197	20
				198	21 void start_recovery(void)
				199	22 {
				200	23 WRITE_ONCE(state, STATE_WANT_RECOVERY);
				201	24 synchronize_rcu();
				202	25 WRITE_ONCE(state, STATE_RECOVERING);
				203	26 recovery();
				204	27 WRITE_ONCE(state, STATE_WANT_NORMAL);
				205	28 synchronize_rcu();
				206	29 WRITE_ONCE(state, STATE_NORMAL);
				207	30 }
				208	</pre>
				209	</blockquote>
				210
				211	<p>
				212	The RCU read-side critical section in <tt>do_something_dlm()</tt>
				213	works with the <tt>synchronize_rcu()</tt> in <tt>start_recovery()</tt>
				214	to guarantee that <tt>do_something()</tt> never runs concurrently
				215	with <tt>recovery()</tt>, but with little or no synchronization
				216	overhead in <tt>do_something_dlm()</tt>.
				217
				218	<p>@@QQ@@
				219	Why is the <tt>synchronize_rcu()</tt> on line 28 needed?
				220	<p>@@QQA@@
				221	Without that extra grace period, memory reordering could result in
				222	<tt>do_something_dlm()</tt> executing <tt>do_something()</tt>
				223	concurrently with the last bits of <tt>recovery()</tt>.
				224	<p>@@QQE@@
				225
				226	<p>
				227	In order to avoid fatal problems such as deadlocks,
				228	an RCU read-side critical section must not contain calls to
				229	<tt>synchronize_rcu()</tt>.
				230	Similarly, an RCU read-side critical section must not
				231	contain anything that waits, directly or indirectly, on completion of
				232	an invocation of <tt>synchronize_rcu()</tt>.
				233
				234	<p>
				235	Although RCU's grace-period guarantee is useful in and of itself, with
				236	<a href="https://lwn.net/Articles/573497/">quite a few use cases</a>,
				237	it would be good to be able to use RCU to coordinate read-side
				238	access to linked data structures.
				239	For this, the grace-period guarantee is not sufficient, as can
				240	be seen in function <tt>add_gp_buggy()</tt> below.
				241	We will look at the reader's code later, but in the meantime, just think of
				242	the reader as locklessly picking up the <tt>gp</tt> pointer,
				243	and, if the value loaded is non-<tt>NULL</tt>, locklessly accessing the
				244	<tt>->a</tt> and <tt>->b</tt> fields.
				245
				246	<blockquote>
				247	<pre>
				248	1 bool add_gp_buggy(int a, int b)
				249	2 {
				250	3 p = kmalloc(sizeof(*p), GFP_KERNEL);
				251	4 if (!p)
				252	5 return -ENOMEM;
				253	6 spin_lock(&gp_lock);
				254	7 if (rcu_access_pointer(gp)) {
				255	8 spin_unlock(&gp_lock);
				256	9 return false;
				257	10 }
				258	11 p->a = a;
				259	12 p->b = a;
				260	13 gp = p; /* ORDERING BUG */
				261	14 spin_unlock(&gp_lock);
				262	15 return true;
				263	16 }
				264	</pre>
				265	</blockquote>
				266
				267	<p>
				268	The problem is that both the compiler and weakly ordered CPUs are within
				269	their rights to reorder this code as follows:
				270
				271	<blockquote>
				272	<pre>
				273	1 bool add_gp_buggy_optimized(int a, int b)
				274	2 {
				275	3 p = kmalloc(sizeof(*p), GFP_KERNEL);
				276	4 if (!p)
				277	5 return -ENOMEM;
				278	6 spin_lock(&gp_lock);
				279	7 if (rcu_access_pointer(gp)) {
				280	8 spin_unlock(&gp_lock);
				281	9 return false;
				282	10 }
				283	<b>11 gp = p; /* ORDERING BUG */
				284	12 p->a = a;
				285	13 p->b = a;</b>
				286	14 spin_unlock(&gp_lock);
				287	15 return true;
				288	16 }
				289	</pre>
				290	</blockquote>
				291
				292	<p>
				293	If an RCU reader fetches <tt>gp</tt> just after
				294	<tt>add_gp_buggy_optimized</tt> executes line 11,
				295	it will see garbage in the <tt>->a</tt> and <tt>->b</tt>
				296	fields.
				297	And this is but one of many ways in which compiler and hardware optimizations
				298	could cause trouble.
				299	Therefore, we clearly need some way to prevent the compiler and the CPU from
				300	reordering in this manner, which brings us to the publish-subscribe
				301	guarantee discussed in the next section.
				302
				303	<h3><a name="Publish-Subscribe Guarantee">Publish/Subscribe Guarantee</a></h3>
				304
				305	<p>
				306	RCU's publish-subscribe guarantee allows data to be inserted
				307	into a linked data structure without disrupting RCU readers.
				308	The updater uses <tt>rcu_assign_pointer()</tt> to insert the
				309	new data, and readers use <tt>rcu_dereference()</tt> to
				310	access data, whether new or old.
				311	The following shows an example of insertion:
				312
				313	<blockquote>
				314	<pre>
				315	1 bool add_gp(int a, int b)
				316	2 {
				317	3 p = kmalloc(sizeof(*p), GFP_KERNEL);
				318	4 if (!p)
				319	5 return -ENOMEM;
				320	6 spin_lock(&gp_lock);
				321	7 if (rcu_access_pointer(gp)) {
				322	8 spin_unlock(&gp_lock);
				323	9 return false;
				324	10 }
				325	11 p->a = a;
				326	12 p->b = a;
				327	13 rcu_assign_pointer(gp, p);
				328	14 spin_unlock(&gp_lock);
				329	15 return true;
				330	16 }
				331	</pre>
				332	</blockquote>
				333
				334	<p>
				335	The <tt>rcu_assign_pointer()</tt> on line 13 is conceptually
				336	equivalent to a simple assignment statement, but also guarantees
				337	that its assignment will
				338	happen after the two assignments in lines 11 and 12,
				339	similar to the C11 <tt>memory_order_release</tt> store operation.
				340	It also prevents any number of “interesting” compiler
				341	optimizations, for example, the use of <tt>gp</tt> as a scratch
				342	location immediately preceding the assignment.
				343
				344	<p>@@QQ@@
				345	But <tt>rcu_assign_pointer()</tt> does nothing to prevent the
				346	two assignments to <tt>p->a</tt> and <tt>p->b</tt>
				347	from being reordered.
				348	Can't that also cause problems?
				349	<p>@@QQA@@
				350	No, it cannot.
				351	The readers cannot see either of these two fields until
				352	the assignment to <tt>gp</tt>, by which time both fields are
				353	fully initialized.
				354	So reordering the assignments
				355	to <tt>p->a</tt> and <tt>p->b</tt> cannot possibly
				356	cause any problems.
				357	<p>@@QQE@@
				358
				359	<p>
				360	It is tempting to assume that the reader need not do anything special
				361	to control its accesses to the RCU-protected data,
				362	as shown in <tt>do_something_gp_buggy()</tt> below:
				363
				364	<blockquote>
				365	<pre>
				366	1 bool do_something_gp_buggy(void)
				367	2 {
				368	3 rcu_read_lock();
				369	4 p = gp; /* OPTIMIZATIONS GALORE!!! */
				370	5 if (p) {
				371	6 do_something(p->a, p->b);
				372	7 rcu_read_unlock();
				373	8 return true;
				374	9 }
				375	10 rcu_read_unlock();
				376	11 return false;
				377	12 }
				378	</pre>
				379	</blockquote>
				380
				381	<p>
				382	However, this temptation must be resisted because there are a
				383	surprisingly large number of ways that the compiler
				384	(to say nothing of
				385	<a href="https://h71000.www7.hp.com/wizard/wiz_2637.html">DEC Alpha CPUs</a>)
				386	can trip this code up.
				387	For but one example, if the compiler were short of registers, it
				388	might choose to refetch from <tt>gp</tt> rather than keeping
				389	a separate copy in <tt>p</tt> as follows:
				390
				391	<blockquote>
				392	<pre>
				393	1 bool do_something_gp_buggy_optimized(void)
				394	2 {
				395	3 rcu_read_lock();
				396	4 if (gp) { /* OPTIMIZATIONS GALORE!!! */
				397	<b> 5 do_something(gp->a, gp->b);</b>
				398	6 rcu_read_unlock();
				399	7 return true;
				400	8 }
				401	9 rcu_read_unlock();
				402	10 return false;
				403	11 }
				404	</pre>
				405	</blockquote>
				406
				407	<p>
				408	If this function ran concurrently with a series of updates that
				409	replaced the current structure with a new one,
				410	the fetches of <tt>gp->a</tt>
				411	and <tt>gp->b</tt> might well come from two different structures,
				412	which could cause serious confusion.
				413	To prevent this (and much else besides), <tt>do_something_gp()</tt> uses
				414	<tt>rcu_dereference()</tt> to fetch from <tt>gp</tt>:
				415
				416	<blockquote>
				417	<pre>
				418	1 bool do_something_gp(void)
				419	2 {
				420	3 rcu_read_lock();
				421	4 p = rcu_dereference(gp);
				422	5 if (p) {
				423	6 do_something(p->a, p->b);
				424	7 rcu_read_unlock();
				425	8 return true;
				426	9 }
				427	10 rcu_read_unlock();
				428	11 return false;
				429	12 }
				430	</pre>
				431	</blockquote>
				432
				433	<p>
				434	The <tt>rcu_dereference()</tt> uses volatile casts and (for DEC Alpha)
				435	memory barriers in the Linux kernel.
				436	Should a
				437	<a href="http://www.rdrop.com/users/paulmck/RCU/consume.2015.07.13a.pdf">high-quality implementation of C11 <tt>memory_order_consume</tt> [PDF]</a>
				438	ever appear, then <tt>rcu_dereference()</tt> could be implemented
				439	as a <tt>memory_order_consume</tt> load.
				440	Regardless of the exact implementation, a pointer fetched by
				441	<tt>rcu_dereference()</tt> may not be used outside of the
				442	outermost RCU read-side critical section containing that
				443	<tt>rcu_dereference()</tt>, unless protection of
				444	the corresponding data element has been passed from RCU to some
				445	other synchronization mechanism, most commonly locking or
				446	<a href="https://www.kernel.org/doc/Documentation/RCU/rcuref.txt">reference counting</a>.
				447
				448	<p>
				449	In short, updaters use <tt>rcu_assign_pointer()</tt> and readers
				450	use <tt>rcu_dereference()</tt>, and these two RCU API elements
				451	work together to ensure that readers have a consistent view of
				452	newly added data elements.
				453
				454	<p>
				455	Of course, it is also necessary to remove elements from RCU-protected
				456	data structures, for example, using the following process:
				457
				458	<ol>
				459	<li> Remove the data element from the enclosing structure.
				460	<li> Wait for all pre-existing RCU read-side critical sections
				461	to complete (because only pre-existing readers can possibly have
				462	a reference to the newly removed data element).
				463	<li> At this point, only the updater has a reference to the
				464	newly removed data element, so it can safely reclaim
				465	the data element, for example, by passing it to <tt>kfree()</tt>.
				466	</ol>
				467
				468	This process is implemented by <tt>remove_gp_synchronous()</tt>:
				469
				470	<blockquote>
				471	<pre>
				472	1 bool remove_gp_synchronous(void)
				473	2 {
				474	3 struct foo *p;
				475	4
				476	5 spin_lock(&gp_lock);
				477	6 p = rcu_access_pointer(gp);
				478	7 if (!p) {
				479	8 spin_unlock(&gp_lock);
				480	9 return false;
				481	10 }
				482	11 rcu_assign_pointer(gp, NULL);
				483	12 spin_unlock(&gp_lock);
				484	13 synchronize_rcu();
				485	14 kfree(p);
				486	15 return true;
				487	16 }
				488	</pre>
				489	</blockquote>
				490
				491	<p>
				492	This function is straightforward, with line 13 waiting for a grace
				493	period before line 14 frees the old data element.
				494	This waiting ensures that readers will reach line 7 of
				495	<tt>do_something_gp()</tt> before the data element referenced by
				496	<tt>p</tt> is freed.
				497	The <tt>rcu_access_pointer()</tt> on line 6 is similar to
				498	<tt>rcu_dereference()</tt>, except that:
				499
				500	<ol>
				501	<li> The value returned by <tt>rcu_access_pointer()</tt>
				502	cannot be dereferenced.
				503	If you want to access the value pointed to as well as
				504	the pointer itself, use <tt>rcu_dereference()</tt>
				505	instead of <tt>rcu_access_pointer()</tt>.
				506	<li> The call to <tt>rcu_access_pointer()</tt> need not be
				507	protected.
				508	In contrast, <tt>rcu_dereference()</tt> must either be
				509	within an RCU read-side critical section or in a code
				510	segment where the pointer cannot change, for example, in
				511	code protected by the corresponding update-side lock.
				512	</ol>
				513
				514	<p>@@QQ@@
				515	Without the <tt>rcu_dereference()</tt> or the
				516	<tt>rcu_access_pointer()</tt>, what destructive optimizations
				517	might the compiler make use of?
				518	<p>@@QQA@@
				519	Let's start with what happens to <tt>do_something_gp()</tt>
				520	if it fails to use <tt>rcu_dereference()</tt>.
				521	It could reuse a value formerly fetched from this same pointer.
				522	It could also fetch the pointer from <tt>gp</tt> in a byte-at-a-time
				523	manner, resulting in <i>load tearing</i>, in turn resulting a bytewise
				524	mash-up of two distince pointer values.
				525	It might even use value-speculation optimizations, where it makes a wrong
				526	guess, but by the time it gets around to checking the value, an update
				527	has changed the pointer to match the wrong guess.
				528	Too bad about any dereferences that returned pre-initialization garbage
				529	in the meantime!
				530
				531	<p>
				532	For <tt>remove_gp_synchronous()</tt>, as long as all modifications
				533	to <tt>gp</tt> are carried out while holding <tt>gp_lock</tt>,
				534	the above optimizations are harmless.
				535	However,
				536	with <tt>CONFIG_SPARSE_RCU_POINTER=y</tt>,
				537	<tt>sparse</tt> will complain if you
				538	define <tt>gp</tt> with <tt>__rcu</tt> and then
				539	access it without using
				540	either <tt>rcu_access_pointer()</tt> or <tt>rcu_dereference()</tt>.
				541	<p>@@QQE@@
				542
				543	<p>
Paul E. McKenney	4b68933	2015-10-12 08:51:45 -0700	[diff] [blame]	544	In short, RCU's publish-subscribe guarantee is provided by the combination
				545	of <tt>rcu_assign_pointer()</tt> and <tt>rcu_dereference()</tt>.
				546	This guarantee allows data elements to be safely added to RCU-protected
				547	linked data structures without disrupting RCU readers.
				548	This guarantee can be used in combination with the grace-period
				549	guarantee to also allow data elements to be removed from RCU-protected
				550	linked data structures, again without disrupting RCU readers.
				551
				552	<p>
				553	This guarantee was only partially premeditated.
				554	DYNIX/ptx used an explicit memory barrier for publication, but had nothing
				555	resembling <tt>rcu_dereference()</tt> for subscription, nor did it
				556	have anything resembling the <tt>smp_read_barrier_depends()</tt>
				557	that was later subsumed into <tt>rcu_dereference()</tt>.
				558	The need for these operations made itself known quite suddenly at a
				559	late-1990s meeting with the DEC Alpha architects, back in the days when
				560	DEC was still a free-standing company.
				561	It took the Alpha architects a good hour to convince me that any sort
				562	of barrier would ever be needed, and it then took me a good <i>two</i> hours
				563	to convince them that their documentation did not make this point clear.
				564	More recent work with the C and C++ standards committees have provided
				565	much education on tricks and traps from the compiler.
				566	In short, compilers were much less tricky in the early 1990s, but in
				567	2015, don't even think about omitting <tt>rcu_dereference()</tt>!
				568
				569	<h3><a name="Memory-Barrier Guarantees">Memory-Barrier Guarantees</a></h3>
				570
				571	<p>
				572	The previous section's simple linked-data-structure scenario clearly
				573	demonstrates the need for RCU's stringent memory-ordering guarantees on
				574	systems with more than one CPU:
Paul E. McKenney	649e436	2015-10-07 13:32:08 -0700	[diff] [blame]	575
				576	<ol>
				577	<li> Each CPU that has an RCU read-side critical section that
				578	begins before <tt>synchronize_rcu()</tt> starts is
				579	guaranteed to execute a full memory barrier between the time
				580	that the RCU read-side critical section ends and the time that
				581	<tt>synchronize_rcu()</tt> returns.
				582	Without this guarantee, a pre-existing RCU read-side critical section
				583	might hold a reference to the newly removed <tt>struct foo</tt>
				584	after the <tt>kfree()</tt> on line 14 of
				585	<tt>remove_gp_synchronous()</tt>.
				586	<li> Each CPU that has an RCU read-side critical section that ends
				587	after <tt>synchronize_rcu()</tt> returns is guaranteed
				588	to execute a full memory barrier between the time that
				589	<tt>synchronize_rcu()</tt> begins and the time that the RCU
				590	read-side critical section begins.
				591	Without this guarantee, a later RCU read-side critical section
				592	running after the <tt>kfree()</tt> on line 14 of
				593	<tt>remove_gp_synchronous()</tt> might
				594	later run <tt>do_something_gp()</tt> and find the
				595	newly deleted <tt>struct foo</tt>.
				596	<li> If the task invoking <tt>synchronize_rcu()</tt> remains
				597	on a given CPU, then that CPU is guaranteed to execute a full
				598	memory barrier sometime during the execution of
				599	<tt>synchronize_rcu()</tt>.
				600	This guarantee ensures that the <tt>kfree()</tt> on
				601	line 14 of <tt>remove_gp_synchronous()</tt> really does
				602	execute after the removal on line 11.
				603	<li> If the task invoking <tt>synchronize_rcu()</tt> migrates
				604	among a group of CPUs during that invocation, then each of the
				605	CPUs in that group is guaranteed to execute a full memory barrier
				606	sometime during the execution of <tt>synchronize_rcu()</tt>.
				607	This guarantee also ensures that the <tt>kfree()</tt> on
				608	line 14 of <tt>remove_gp_synchronous()</tt> really does
				609	execute after the removal on
				610	line 11, but also in the case where the thread executing the
				611	<tt>synchronize_rcu()</tt> migrates in the meantime.
				612	</ol>
				613
				614	<p>@@QQ@@
				615	Given that multiple CPUs can start RCU read-side critical sections
				616	at any time without any ordering whatsoever, how can RCU possibly tell whether
				617	or not a given RCU read-side critical section starts before a
				618	given instance of <tt>synchronize_rcu()</tt>?
				619	<p>@@QQA@@
				620	If RCU cannot tell whether or not a given
				621	RCU read-side critical section starts before a
				622	given instance of <tt>synchronize_rcu()</tt>,
				623	then it must assume that the RCU read-side critical section
				624	started first.
				625	In other words, a given instance of <tt>synchronize_rcu()</tt>
				626	can avoid waiting on a given RCU read-side critical section only
				627	if it can prove that <tt>synchronize_rcu()</tt> started first.
				628	<p>@@QQE@@
				629
				630	<p>@@QQ@@
				631	The first and second guarantees require unbelievably strict ordering!
				632	Are all these memory barriers <i> really</i> required?
				633	<p>@@QQA@@
				634	Yes, they really are required.
				635	To see why the first guarantee is required, consider the following
				636	sequence of events:
				637
				638	<ol>
				639	<li> CPU 1: <tt>rcu_read_lock()</tt>
				640	<li> CPU 1: <tt>q = rcu_dereference(gp);
				641	/* Very likely to return p. */</tt>
				642	<li> CPU 0: <tt>list_del_rcu(p);</tt>
				643	<li> CPU 0: <tt>synchronize_rcu()</tt> starts.
				644	<li> CPU 1: <tt>do_something_with(q->a);
				645	/* No smp_mb(), so might happen after kfree(). */</tt>
				646	<li> CPU 1: <tt>rcu_read_unlock()</tt>
				647	<li> CPU 0: <tt>synchronize_rcu()</tt> returns.
				648	<li> CPU 0: <tt>kfree(p);</tt>
				649	</ol>
				650
				651	<p>
				652	Therefore, there absolutely must be a full memory barrier between the
				653	end of the RCU read-side critical section and the end of the
				654	grace period.
				655
				656	<p>
				657	The sequence of events demonstrating the necessity of the second rule
				658	is roughly similar:
				659
				660	<ol>
				661	<li> CPU 0: <tt>list_del_rcu(p);</tt>
				662	<li> CPU 0: <tt>synchronize_rcu()</tt> starts.
				663	<li> CPU 1: <tt>rcu_read_lock()</tt>
				664	<li> CPU 1: <tt>q = rcu_dereference(gp);
				665	/* Might return p if no memory barrier. */</tt>
				666	<li> CPU 0: <tt>synchronize_rcu()</tt> returns.
				667	<li> CPU 0: <tt>kfree(p);</tt>
				668	<li> CPU 1: <tt>do_something_with(q->a); /* Boom!!! */</tt>
				669	<li> CPU 1: <tt>rcu_read_unlock()</tt>
				670	</ol>
				671
				672	<p>
				673	And similarly, without a memory barrier between the beginning of the
				674	grace period and the beginning of the RCU read-side critical section,
				675	CPU 1 might end up accessing the freelist.
				676
				677	<p>
				678	The “as if” rule of course applies, so that any implementation
				679	that acts as if the appropriate memory barriers were in place is a
				680	correct implementation.
				681	That said, it is much easier to fool yourself into believing that you have
				682	adhered to the as-if rule than it is to actually adhere to it!
				683	<p>@@QQE@@
				684
				685	<p>
Paul E. McKenney	4b68933	2015-10-12 08:51:45 -0700	[diff] [blame]	686	Note that these memory-barrier requirements do not replace the fundamental
				687	RCU requirement that a grace period wait for all pre-existing readers.
				688	On the contrary, the memory barriers called out in this section must operate in
				689	such a way as to <i>enforce</i> this fundamental requirement.
				690	Of course, different implementations enforce this requirement in different
				691	ways, but enforce it they must.
Paul E. McKenney	649e436	2015-10-07 13:32:08 -0700	[diff] [blame]	692
				693	<h3><a name="RCU Primitives Guaranteed to Execute Unconditionally">RCU Primitives Guaranteed to Execute Unconditionally</a></h3>
				694
				695	<p>
				696	The common-case RCU primitives are unconditional.
				697	They are invoked, they do their job, and they return, with no possibility
				698	of error, and no need to retry.
				699	This is a key RCU design philosophy.
				700
				701	<p>
				702	However, this philosophy is pragmatic rather than pigheaded.
				703	If someone comes up with a good justification for a particular conditional
				704	RCU primitive, it might well be implemented and added.
				705	After all, this guarantee was reverse-engineered, not premeditated.
				706	The unconditional nature of the RCU primitives was initially an
				707	accident of implementation, and later experience with synchronization
				708	primitives with conditional primitives caused me to elevate this
				709	accident to a guarantee.
				710	Therefore, the justification for adding a conditional primitive to
				711	RCU would need to be based on detailed and compelling use cases.
				712
				713	<h3><a name="Guaranteed Read-to-Write Upgrade">Guaranteed Read-to-Write Upgrade</a></h3>
				714
				715	<p>
				716	As far as RCU is concerned, it is always possible to carry out an
				717	update within an RCU read-side critical section.
				718	For example, that RCU read-side critical section might search for
				719	a given data element, and then might acquire the update-side
				720	spinlock in order to update that element, all while remaining
				721	in that RCU read-side critical section.
				722	Of course, it is necessary to exit the RCU read-side critical section
				723	before invoking <tt>synchronize_rcu()</tt>, however, this
				724	inconvenience can be avoided through use of the
				725	<tt>call_rcu()</tt> and <tt>kfree_rcu()</tt> API members
				726	described later in this document.
				727
				728	<p>@@QQ@@
				729	But how does the upgrade-to-write operation exclude other readers?
				730	<p>@@QQA@@
				731	It doesn't, just like normal RCU updates, which also do not exclude
				732	RCU readers.
				733	<p>@@QQE@@
				734
				735	<p>
				736	This guarantee allows lookup code to be shared between read-side
				737	and update-side code, and was premeditated, appearing in the earliest
				738	DYNIX/ptx RCU documentation.
				739
				740	<h2><a name="Fundamental Non-Requirements">Fundamental Non-Requirements</a></h2>
				741
				742	<p>
				743	RCU provides extremely lightweight readers, and its read-side guarantees,
				744	though quite useful, are correspondingly lightweight.
				745	It is therefore all too easy to assume that RCU is guaranteeing more
				746	than it really is.
				747	Of course, the list of things that RCU does not guarantee is infinitely
				748	long, however, the following sections list a few non-guarantees that
				749	have caused confusion.
				750	Except where otherwise noted, these non-guarantees were premeditated.
				751
				752	<ol>
				753	<li> <a href="#Readers Impose Minimal Ordering">
				754	Readers Impose Minimal Ordering</a>
				755	<li> <a href="#Readers Do Not Exclude Updaters">
				756	Readers Do Not Exclude Updaters</a>
				757	<li> <a href="#Updaters Only Wait For Old Readers">
				758	Updaters Only Wait For Old Readers</a>
				759	<li> <a href="#Grace Periods Don't Partition Read-Side Critical Sections">
				760	Grace Periods Don't Partition Read-Side Critical Sections</a>
				761	<li> <a href="#Read-Side Critical Sections Don't Partition Grace Periods">
				762	Read-Side Critical Sections Don't Partition Grace Periods</a>
				763	<li> <a href="#Disabling Preemption Does Not Block Grace Periods">
				764	Disabling Preemption Does Not Block Grace Periods</a>
				765	</ol>
				766
				767	<h3><a name="Readers Impose Minimal Ordering">Readers Impose Minimal Ordering</a></h3>
				768
				769	<p>
				770	Reader-side markers such as <tt>rcu_read_lock()</tt> and
				771	<tt>rcu_read_unlock()</tt> provide absolutely no ordering guarantees
				772	except through their interaction with the grace-period APIs such as
				773	<tt>synchronize_rcu()</tt>.
				774	To see this, consider the following pair of threads:
				775
				776	<blockquote>
				777	<pre>
				778	1 void thread0(void)
				779	2 {
				780	3 rcu_read_lock();
				781	4 WRITE_ONCE(x, 1);
				782	5 rcu_read_unlock();
				783	6 rcu_read_lock();
				784	7 WRITE_ONCE(y, 1);
				785	8 rcu_read_unlock();
				786	9 }
				787	10
				788	11 void thread1(void)
				789	12 {
				790	13 rcu_read_lock();
				791	14 r1 = READ_ONCE(y);
				792	15 rcu_read_unlock();
				793	16 rcu_read_lock();
				794	17 r2 = READ_ONCE(x);
				795	18 rcu_read_unlock();
				796	19 }
				797	</pre>
				798	</blockquote>
				799
				800	<p>
				801	After <tt>thread0()</tt> and <tt>thread1()</tt> execute
				802	concurrently, it is quite possible to have
				803
				804	<blockquote>
				805	<pre>
				806	(r1 == 1 && r2 == 0)
				807	</pre>
				808	</blockquote>
				809
				810	(that is, <tt>y</tt> appears to have been assigned before <tt>x</tt>),
				811	which would not be possible if <tt>rcu_read_lock()</tt> and
				812	<tt>rcu_read_unlock()</tt> had much in the way of ordering
				813	properties.
				814	But they do not, so the CPU is within its rights
				815	to do significant reordering.
				816	This is by design: Any significant ordering constraints would slow down
				817	these fast-path APIs.
				818
				819	<p>@@QQ@@
				820	Can't the compiler also reorder this code?
				821	<p>@@QQA@@
				822	No, the volatile casts in <tt>READ_ONCE()</tt> and
				823	<tt>WRITE_ONCE()</tt> prevent the compiler from reordering in
				824	this particular case.
				825	<p>@@QQE@@
				826
				827	<h3><a name="Readers Do Not Exclude Updaters">Readers Do Not Exclude Updaters</a></h3>
				828
				829	<p>
				830	Neither <tt>rcu_read_lock()</tt> nor <tt>rcu_read_unlock()</tt>
				831	exclude updates.
				832	All they do is to prevent grace periods from ending.
				833	The following example illustrates this:
				834
				835	<blockquote>
				836	<pre>
				837	1 void thread0(void)
				838	2 {
				839	3 rcu_read_lock();
				840	4 r1 = READ_ONCE(y);
				841	5 if (r1) {
				842	6 do_something_with_nonzero_x();
				843	7 r2 = READ_ONCE(x);
				844	8 WARN_ON(!r2); /* BUG!!! */
				845	9 }
				846	10 rcu_read_unlock();
				847	11 }
				848	12
				849	13 void thread1(void)
				850	14 {
				851	15 spin_lock(&my_lock);
				852	16 WRITE_ONCE(x, 1);
				853	17 WRITE_ONCE(y, 1);
				854	18 spin_unlock(&my_lock);
				855	19 }
				856	</pre>
				857	</blockquote>
				858
				859	<p>
				860	If the <tt>thread0()</tt> function's <tt>rcu_read_lock()</tt>
				861	excluded the <tt>thread1()</tt> function's update,
				862	the <tt>WARN_ON()</tt> could never fire.
				863	But the fact is that <tt>rcu_read_lock()</tt> does not exclude
				864	much of anything aside from subsequent grace periods, of which
				865	<tt>thread1()</tt> has none, so the
				866	<tt>WARN_ON()</tt> can and does fire.
				867
				868	<h3><a name="Updaters Only Wait For Old Readers">Updaters Only Wait For Old Readers</a></h3>
				869
				870	<p>
				871	It might be tempting to assume that after <tt>synchronize_rcu()</tt>
				872	completes, there are no readers executing.
				873	This temptation must be avoided because
				874	new readers can start immediately after <tt>synchronize_rcu()</tt>
				875	starts, and <tt>synchronize_rcu()</tt> is under no
				876	obligation to wait for these new readers.
				877
				878	<p>@@QQ@@
				879	Suppose that synchronize_rcu() did wait until all readers had completed.
				880	Would the updater be able to rely on this?
				881	<p>@@QQA@@
				882	No.
				883	Even if <tt>synchronize_rcu()</tt> were to wait until
				884	all readers had completed, a new reader might start immediately after
				885	<tt>synchronize_rcu()</tt> completed.
				886	Therefore, the code following
				887	<tt>synchronize_rcu()</tt> cannot rely on there being no readers
				888	in any case.
				889	<p>@@QQE@@
				890
				891	<h3><a name="Grace Periods Don't Partition Read-Side Critical Sections">
				892	Grace Periods Don't Partition Read-Side Critical Sections</a></h3>
				893
				894	<p>
				895	It is tempting to assume that if any part of one RCU read-side critical
				896	section precedes a given grace period, and if any part of another RCU
				897	read-side critical section follows that same grace period, then all of
				898	the first RCU read-side critical section must precede all of the second.
				899	However, this just isn't the case: A single grace period does not
				900	partition the set of RCU read-side critical sections.
				901	An example of this situation can be illustrated as follows, where
				902	<tt>x</tt>, <tt>y</tt>, and <tt>z</tt> are initially all zero:
				903
				904	<blockquote>
				905	<pre>
				906	1 void thread0(void)
				907	2 {
				908	3 rcu_read_lock();
				909	4 WRITE_ONCE(a, 1);
				910	5 WRITE_ONCE(b, 1);
				911	6 rcu_read_unlock();
				912	7 }
				913	8
				914	9 void thread1(void)
				915	10 {
				916	11 r1 = READ_ONCE(a);
				917	12 synchronize_rcu();
				918	13 WRITE_ONCE(c, 1);
				919	14 }
				920	15
				921	16 void thread2(void)
				922	17 {
				923	18 rcu_read_lock();
				924	19 r2 = READ_ONCE(b);
				925	20 r3 = READ_ONCE(c);
				926	21 rcu_read_unlock();
				927	22 }
				928	</pre>
				929	</blockquote>
				930
				931	<p>
				932	It turns out that the outcome:
				933
				934	<blockquote>
				935	<pre>
				936	(r1 == 1 && r2 == 0 && r3 == 1)
				937	</pre>
				938	</blockquote>
				939
				940	is entirely possible.
				941	The following figure show how this can happen, with each circled
				942	<tt>QS</tt> indicating the point at which RCU recorded a
				943	<i>quiescent state</i> for each thread, that is, a state in which
				944	RCU knows that the thread cannot be in the midst of an RCU read-side
				945	critical section that started before the current grace period:
				946
				947	<p><img src="GPpartitionReaders1.svg" alt="GPpartitionReaders1.svg" width="60%"></p>
				948
				949	<p>
				950	If it is necessary to partition RCU read-side critical sections in this
				951	manner, it is necessary to use two grace periods, where the first
				952	grace period is known to end before the second grace period starts:
				953
				954	<blockquote>
				955	<pre>
				956	1 void thread0(void)
				957	2 {
				958	3 rcu_read_lock();
				959	4 WRITE_ONCE(a, 1);
				960	5 WRITE_ONCE(b, 1);
				961	6 rcu_read_unlock();
				962	7 }
				963	8
				964	9 void thread1(void)
				965	10 {
				966	11 r1 = READ_ONCE(a);
				967	12 synchronize_rcu();
				968	13 WRITE_ONCE(c, 1);
				969	14 }
				970	15
				971	16 void thread2(void)
				972	17 {
				973	18 r2 = READ_ONCE(c);
				974	19 synchronize_rcu();
				975	20 WRITE_ONCE(d, 1);
				976	21 }
				977	22
				978	23 void thread3(void)
				979	24 {
				980	25 rcu_read_lock();
				981	26 r3 = READ_ONCE(b);
				982	27 r4 = READ_ONCE(d);
				983	28 rcu_read_unlock();
				984	29 }
				985	</pre>
				986	</blockquote>
				987
				988	<p>
				989	Here, if <tt>(r1 == 1)</tt>, then
				990	<tt>thread0()</tt>'s write to <tt>b</tt> must happen
				991	before the end of <tt>thread1()</tt>'s grace period.
				992	If in addition <tt>(r4 == 1)</tt>, then
				993	<tt>thread3()</tt>'s read from <tt>b</tt> must happen
				994	after the beginning of <tt>thread2()</tt>'s grace period.
				995	If it is also the case that <tt>(r2 == 1)</tt>, then the
				996	end of <tt>thread1()</tt>'s grace period must precede the
				997	beginning of <tt>thread2()</tt>'s grace period.
				998	This mean that the two RCU read-side critical sections cannot overlap,
				999	guaranteeing that <tt>(r3 == 1)</tt>.
				1000	As a result, the outcome:
				1001
				1002	<blockquote>
				1003	<pre>
				1004	(r1 == 1 && r2 == 1 && r3 == 0 && r4 == 1)
				1005	</pre>
				1006	</blockquote>
				1007
				1008	cannot happen.
				1009
				1010	<p>
				1011	This non-requirement was also non-premeditated, but became apparent
				1012	when studying RCU's interaction with memory ordering.
				1013
				1014	<h3><a name="Read-Side Critical Sections Don't Partition Grace Periods">
				1015	Read-Side Critical Sections Don't Partition Grace Periods</a></h3>
				1016
				1017	<p>
				1018	It is also tempting to assume that if an RCU read-side critical section
				1019	happens between a pair of grace periods, then those grace periods cannot
				1020	overlap.
				1021	However, this temptation leads nowhere good, as can be illustrated by
				1022	the following, with all variables initially zero:
				1023
				1024	<blockquote>
				1025	<pre>
				1026	1 void thread0(void)
				1027	2 {
				1028	3 rcu_read_lock();
				1029	4 WRITE_ONCE(a, 1);
				1030	5 WRITE_ONCE(b, 1);
				1031	6 rcu_read_unlock();
				1032	7 }
				1033	8
				1034	9 void thread1(void)
				1035	10 {
				1036	11 r1 = READ_ONCE(a);
				1037	12 synchronize_rcu();
				1038	13 WRITE_ONCE(c, 1);
				1039	14 }
				1040	15
				1041	16 void thread2(void)
				1042	17 {
				1043	18 rcu_read_lock();
				1044	19 WRITE_ONCE(d, 1);
				1045	20 r2 = READ_ONCE(c);
				1046	21 rcu_read_unlock();
				1047	22 }
				1048	23
				1049	24 void thread3(void)
				1050	25 {
				1051	26 r3 = READ_ONCE(d);
				1052	27 synchronize_rcu();
				1053	28 WRITE_ONCE(e, 1);
				1054	29 }
				1055	30
				1056	31 void thread4(void)
				1057	32 {
				1058	33 rcu_read_lock();
				1059	34 r4 = READ_ONCE(b);
				1060	35 r5 = READ_ONCE(e);
				1061	36 rcu_read_unlock();
				1062	37 }
				1063	</pre>
				1064	</blockquote>
				1065
				1066	<p>
				1067	In this case, the outcome:
				1068
				1069	<blockquote>
				1070	<pre>
				1071	(r1 == 1 && r2 == 1 && r3 == 1 && r4 == 0 &amp& r5 == 1)
				1072	</pre>
				1073	</blockquote>
				1074
				1075	is entirely possible, as illustrated below:
				1076
				1077	<p><img src="ReadersPartitionGP1.svg" alt="ReadersPartitionGP1.svg" width="100%"></p>
				1078
				1079	<p>
				1080	Again, an RCU read-side critical section can overlap almost all of a
				1081	given grace period, just so long as it does not overlap the entire
				1082	grace period.
				1083	As a result, an RCU read-side critical section cannot partition a pair
				1084	of RCU grace periods.
				1085
				1086	<p>@@QQ@@
				1087	How long a sequence of grace periods, each separated by an RCU read-side
				1088	critical section, would be required to partition the RCU read-side
				1089	critical sections at the beginning and end of the chain?
				1090	<p>@@QQA@@
				1091	In theory, an infinite number.
				1092	In practice, an unknown number that is sensitive to both implementation
				1093	details and timing considerations.
				1094	Therefore, even in practice, RCU users must abide by the theoretical rather
				1095	than the practical answer.
				1096	<p>@@QQE@@
				1097
				1098	<h3><a name="Disabling Preemption Does Not Block Grace Periods">
				1099	Disabling Preemption Does Not Block Grace Periods</a></h3>
				1100
				1101	<p>
				1102	There was a time when disabling preemption on any given CPU would block
				1103	subsequent grace periods.
				1104	However, this was an accident of implementation and is not a requirement.
				1105	And in the current Linux-kernel implementation, disabling preemption
				1106	on a given CPU in fact does not block grace periods, as Oleg Nesterov
				1107	<a href="https://lkml.kernel.org/g/20150614193825.GA19582@redhat.com">demonstrated</a>.
				1108
				1109	<p>
				1110	If you need a preempt-disable region to block grace periods, you need to add
				1111	<tt>rcu_read_lock()</tt> and <tt>rcu_read_unlock()</tt>, for example
				1112	as follows:
				1113
				1114	<blockquote>
				1115	<pre>
				1116	1 preempt_disable();
				1117	2 rcu_read_lock();
				1118	3 do_something();
				1119	4 rcu_read_unlock();
				1120	5 preempt_enable();
				1121	6
				1122	7 /* Spinlocks implicitly disable preemption. */
				1123	8 spin_lock(&mylock);
				1124	9 rcu_read_lock();
				1125	10 do_something();
				1126	11 rcu_read_unlock();
				1127	12 spin_unlock(&mylock);
				1128	</pre>
				1129	</blockquote>
				1130
				1131	<p>
				1132	In theory, you could enter the RCU read-side critical section first,
				1133	but it is more efficient to keep the entire RCU read-side critical
				1134	section contained in the preempt-disable region as shown above.
				1135	Of course, RCU read-side critical sections that extend outside of
				1136	preempt-disable regions will work correctly, but such critical sections
				1137	can be preempted, which forces <tt>rcu_read_unlock()</tt> to do
				1138	more work.
				1139	And no, this is <i>not</i> an invitation to enclose all of your RCU
				1140	read-side critical sections within preempt-disable regions, because
				1141	doing so would degrade real-time response.
				1142
				1143	<p>
				1144	This non-requirement appeared with preemptible RCU.
				1145	If you need a grace period that waits on non-preemptible code regions, use
				1146	<a href="#Sched Flavor">RCU-sched</a>.
				1147
				1148	<h2><a name="Parallelism Facts of Life">Parallelism Facts of Life</a></h2>
				1149
				1150	<p>
				1151	These parallelism facts of life are by no means specific to RCU, but
				1152	the RCU implementation must abide by them.
				1153	They therefore bear repeating:
				1154
				1155	<ol>
				1156	<li> Any CPU or task may be delayed at any time,
				1157	and any attempts to avoid these delays by disabling
				1158	preemption, interrupts, or whatever are completely futile.
				1159	This is most obvious in preemptible user-level
				1160	environments and in virtualized environments (where
				1161	a given guest OS's VCPUs can be preempted at any time by
				1162	the underlying hypervisor), but can also happen in bare-metal
				1163	environments due to ECC errors, NMIs, and other hardware
				1164	events.
				1165	Although a delay of more than about 20 seconds can result
				1166	in splats, the RCU implementation is obligated to use
				1167	algorithms that can tolerate extremely long delays, but where
				1168	“extremely long” is not long enough to allow
				1169	wrap-around when incrementing a 64-bit counter.
				1170	<li> Both the compiler and the CPU can reorder memory accesses.
				1171	Where it matters, RCU must use compiler directives and
				1172	memory-barrier instructions to preserve ordering.
				1173	<li> Conflicting writes to memory locations in any given cache line
				1174	will result in expensive cache misses.
				1175	Greater numbers of concurrent writes and more-frequent
				1176	concurrent writes will result in more dramatic slowdowns.
				1177	RCU is therefore obligated to use algorithms that have
				1178	sufficient locality to avoid significant performance and
				1179	scalability problems.
				1180	<li> As a rough rule of thumb, only one CPU's worth of processing
				1181	may be carried out under the protection of any given exclusive
				1182	lock.
				1183	RCU must therefore use scalable locking designs.
				1184	<li> Counters are finite, especially on 32-bit systems.
				1185	RCU's use of counters must therefore tolerate counter wrap,
				1186	or be designed such that counter wrap would take way more
				1187	time than a single system is likely to run.
				1188	An uptime of ten years is quite possible, a runtime
				1189	of a century much less so.
				1190	As an example of the latter, RCU's dyntick-idle nesting counter
				1191	allows 54 bits for interrupt nesting level (this counter
				1192	is 64 bits even on a 32-bit system).
				1193	Overflowing this counter requires 2<sup>54</sup>
				1194	half-interrupts on a given CPU without that CPU ever going idle.
				1195	If a half-interrupt happened every microsecond, it would take
				1196	570 years of runtime to overflow this counter, which is currently
				1197	believed to be an acceptably long time.
				1198	<li> Linux systems can have thousands of CPUs running a single
				1199	Linux kernel in a single shared-memory environment.
				1200	RCU must therefore pay close attention to high-end scalability.
				1201	</ol>
				1202
				1203	<p>
				1204	This last parallelism fact of life means that RCU must pay special
				1205	attention to the preceding facts of life.
				1206	The idea that Linux might scale to systems with thousands of CPUs would
				1207	have been met with some skepticism in the 1990s, but these requirements
				1208	would have otherwise have been unsurprising, even in the early 1990s.
				1209
				1210	<h2><a name="Quality-of-Implementation Requirements">Quality-of-Implementation Requirements</a></h2>
				1211
				1212	<p>
				1213	These sections list quality-of-implementation requirements.
				1214	Although an RCU implementation that ignores these requirements could
				1215	still be used, it would likely be subject to limitations that would
				1216	make it inappropriate for industrial-strength production use.
				1217	Classes of quality-of-implementation requirements are as follows:
				1218
				1219	<ol>
				1220	<li> <a href="#Specialization">Specialization</a>
				1221	<li> <a href="#Performance and Scalability">Performance and Scalability</a>
				1222	<li> <a href="#Composability">Composability</a>
				1223	<li> <a href="#Corner Cases">Corner Cases</a>
				1224	</ol>
				1225
				1226	<p>
				1227	These classes is covered in the following sections.
				1228
				1229	<h3><a name="Specialization">Specialization</a></h3>
				1230
				1231	<p>
				1232	RCU is and always has been intended primarily for read-mostly situations, as
				1233	illustrated by the following figure.
				1234	This means that RCU's read-side primitives are optimized, often at the
				1235	expense of its update-side primitives.
				1236
				1237	<p><img src="RCUApplicability.svg" alt="RCUApplicability.svg" width="70%"></p>
				1238
				1239	<p>
				1240	This focus on read-mostly situations means that RCU must interoperate
				1241	with other synchronization primitives.
				1242	For example, the <tt>add_gp()</tt> and <tt>remove_gp_synchronous()</tt>
				1243	examples discussed earlier use RCU to protect readers and locking to
				1244	coordinate updaters.
				1245	However, the need extends much farther, requiring that a variety of
				1246	synchronization primitives be legal within RCU read-side critical sections,
				1247	including spinlocks, sequence locks, atomic operations, reference
				1248	counters, and memory barriers.
				1249
				1250	<p>@@QQ@@
				1251	What about sleeping locks?
				1252	<p>@@QQA@@
				1253	These are forbidden within Linux-kernel RCU read-side critical sections
				1254	because it is not legal to place a quiescent state (in this case,
				1255	voluntary context switch) within an RCU read-side critical section.
				1256	However, sleeping locks may be used within userspace RCU read-side critical
				1257	sections, and also within Linux-kernel sleepable RCU
				1258	<a href="#Sleepable RCU">(SRCU)</a>
				1259	read-side critical sections.
				1260	In addition, the -rt patchset turns spinlocks into a sleeping locks so
				1261	that the corresponding critical sections can be preempted, which
				1262	also means that these sleeplockified spinlocks (but not other sleeping locks!)
				1263	may be acquire within -rt-Linux-kernel RCU read-side critical sections.
				1264
				1265	<p>
				1266	Note that it <i>is</i> legal for a normal RCU read-side critical section
				1267	to conditionally acquire a sleeping locks (as in <tt>mutex_trylock()</tt>),
				1268	but only as long as it does not loop indefinitely attempting to
				1269	conditionally acquire that sleeping locks.
				1270	The key point is that things like <tt>mutex_trylock()</tt>
				1271	either return with the mutex held, or return an error indication if
				1272	the mutex was not immediately available.
				1273	Either way, <tt>mutex_trylock()</tt> returns immediately without sleeping.
				1274	<p>@@QQE@@
				1275
				1276	<p>
				1277	It often comes as a surprise that many algorithms do not require a
				1278	consistent view of data, but many can function in that mode,
				1279	with network routing being the poster child.
				1280	Internet routing algorithms take significant time to propagate
				1281	updates, so that by the time an update arrives at a given system,
				1282	that system has been sending network traffic the wrong way for
				1283	a considerable length of time.
				1284	Having a few threads continue to send traffic the wrong way for a
				1285	few more milliseconds is clearly not a problem: In the worst case,
				1286	TCP retransmissions will eventually get the data where it needs to go.
				1287	In general, when tracking the state of the universe outside of the
				1288	computer, some level of inconsistency must be tolerated due to
				1289	speed-of-light delays if nothing else.
				1290
				1291	<p>
				1292	Furthermore, uncertainty about external state is inherent in many cases.
				1293	For example, a pair of veternarians might use heartbeat to determine
				1294	whether or not a given cat was alive.
				1295	But how long should they wait after the last heartbeat to decide that
				1296	the cat is in fact dead?
				1297	Waiting less than 400 milliseconds makes no sense because this would
				1298	mean that a relaxed cat would be considered to cycle between death
				1299	and life more than 100 times per minute.
				1300	Moreover, just as with human beings, a cat's heart might stop for
				1301	some period of time, so the exact wait period is a judgment call.
				1302	One of our pair of veternarians might wait 30 seconds before pronouncing
				1303	the cat dead, while the other might insist on waiting a full minute.
				1304	The two veternarians would then disagree on the state of the cat during
				1305	the final 30 seconds of the minute following the last heartbeat, as
				1306	fancifully illustrated below:
				1307
				1308	<p><img src="2013-08-is-it-dead.png" alt="2013-08-is-it-dead.png" width="431"></p>
				1309
				1310	<p>
				1311	Interestingly enough, this same situation applies to hardware.
				1312	When push comes to shove, how do we tell whether or not some
				1313	external server has failed?
				1314	We send messages to it periodically, and declare it failed if we
				1315	don't receive a response within a given period of time.
				1316	Policy decisions can usually tolerate short
				1317	periods of inconsistency.
				1318	The policy was decided some time ago, and is only now being put into
				1319	effect, so a few milliseconds of delay is normally inconsequential.
				1320
				1321	<p>
				1322	However, there are algorithms that absolutely must see consistent data.
				1323	For example, the translation between a user-level SystemV semaphore
				1324	ID to the corresponding in-kernel data structure is protected by RCU,
				1325	but it is absolutely forbidden to update a semaphore that has just been
				1326	removed.
				1327	In the Linux kernel, this need for consistency is accommodated by acquiring
				1328	spinlocks located in the in-kernel data structure from within
				1329	the RCU read-side critical section, and this is indicated by the
				1330	green box in the figure above.
				1331	Many other techniques may be used, and are in fact used within the
				1332	Linux kernel.
				1333
				1334	<p>
				1335	In short, RCU is not required to maintain consistency, and other
				1336	mechanisms may be used in concert with RCU when consistency is required.
				1337	RCU's specialization allows it to do its job extremely well, and its
				1338	ability to interoperate with other synchronization mechanisms allows
				1339	the right mix of synchronization tools to be used for a given job.
				1340
				1341	<h3><a name="Performance and Scalability">Performance and Scalability</a></h3>
				1342
				1343	<p>
				1344	Energy efficiency is a critical component of performance today,
				1345	and Linux-kernel RCU implementations must therefore avoid unnecessarily
				1346	awakening idle CPUs.
				1347	I cannot claim that this requirement was premeditated.
				1348	In fact, I learned of it during a telephone conversation in which I
				1349	was given “frank and open” feedback on the importance
				1350	of energy efficiency in battery-powered systems and on specific
				1351	energy-efficiency shortcomings of the Linux-kernel RCU implementation.
				1352	In my experience, the battery-powered embedded community will consider
				1353	any unnecessary wakeups to be extremely unfriendly acts.
				1354	So much so that mere Linux-kernel-mailing-list posts are
				1355	insufficient to vent their ire.
				1356
				1357	<p>
				1358	Memory consumption is not particularly important for in most
				1359	situations, and has become decreasingly
				1360	so as memory sizes have expanded and memory
				1361	costs have plummeted.
				1362	However, as I learned from Matt Mackall's
				1363	<a href="http://elinux.org/Linux_Tiny-FAQ">bloatwatch</a>
				1364	efforts, memory footprint is critically important on single-CPU systems with
				1365	non-preemptible (<tt>CONFIG_PREEMPT=n</tt>) kernels, and thus
				1366	<a href="https://lkml.kernel.org/g/20090113221724.GA15307@linux.vnet.ibm.com">tiny RCU</a>
				1367	was born.
				1368	Josh Triplett has since taken over the small-memory banner with his
				1369	<a href="https://tiny.wiki.kernel.org/">Linux kernel tinification</a>
				1370	project, which resulted in
				1371	<a href="#Sleepable RCU">SRCU</a>
				1372	becoming optional for those kernels not needing it.
				1373
				1374	<p>
				1375	The remaining performance requirements are, for the most part,
				1376	unsurprising.
				1377	For example, in keeping with RCU's read-side specialization,
				1378	<tt>rcu_dereference()</tt> should have negligible overhead (for
				1379	example, suppression of a few minor compiler optimizations).
				1380	Similarly, in non-preemptible environments, <tt>rcu_read_lock()</tt> and
				1381	<tt>rcu_read_unlock()</tt> should have exactly zero overhead.
				1382
				1383	<p>
				1384	In preemptible environments, in the case where the RCU read-side
				1385	critical section was not preempted (as will be the case for the
				1386	highest-priority real-time process), <tt>rcu_read_lock()</tt> and
				1387	<tt>rcu_read_unlock()</tt> should have minimal overhead.
				1388	In particular, they should not contain atomic read-modify-write
				1389	operations, memory-barrier instructions, preemption disabling,
				1390	interrupt disabling, or backwards branches.
				1391	However, in the case where the RCU read-side critical section was preempted,
				1392	<tt>rcu_read_unlock()</tt> may acquire spinlocks and disable interrupts.
				1393	This is why it is better to nest an RCU read-side critical section
				1394	within a preempt-disable region than vice versa, at least in cases
				1395	where that critical section is short enough to avoid unduly degrading
				1396	real-time latencies.
				1397
				1398	<p>
				1399	The <tt>synchronize_rcu()</tt> grace-period-wait primitive is
				1400	optimized for throughput.
				1401	It may therefore incur several milliseconds of latency in addition to
				1402	the duration of the longest RCU read-side critical section.
				1403	On the other hand, multiple concurrent invocations of
				1404	<tt>synchronize_rcu()</tt> are required to use batching optimizations
				1405	so that they can be satisfied by a single underlying grace-period-wait
				1406	operation.
				1407	For example, in the Linux kernel, it is not unusual for a single
				1408	grace-period-wait operation to serve more than
				1409	<a href="https://www.usenix.org/conference/2004-usenix-annual-technical-conference/making-rcu-safe-deep-sub-millisecond-response">1,000 separate invocations</a>
				1410	of <tt>synchronize_rcu()</tt>, thus amortizing the per-invocation
				1411	overhead down to nearly zero.
				1412	However, the grace-period optimization is also required to avoid
				1413	measurable degradation of real-time scheduling and interrupt latencies.
				1414
				1415	<p>
				1416	In some cases, the multi-millisecond <tt>synchronize_rcu()</tt>
				1417	latencies are unacceptable.
				1418	In these cases, <tt>synchronize_rcu_expedited()</tt> may be used
				1419	instead, reducing the grace-period latency down to a few tens of
				1420	microseconds on small systems, at least in cases where the RCU read-side
				1421	critical sections are short.
				1422	There are currently no special latency requirements for
				1423	<tt>synchronize_rcu_expedited()</tt> on large systems, but,
				1424	consistent with the empirical nature of the RCU specification,
				1425	that is subject to change.
				1426	However, there most definitely are scalability requirements:
				1427	A storm of <tt>synchronize_rcu_expedited()</tt> invocations on 4096
				1428	CPUs should at least make reasonable forward progress.
				1429	In return for its shorter latencies, <tt>synchronize_rcu_expedited()</tt>
				1430	is permitted to impose modest degradation of real-time latency
				1431	on non-idle online CPUs.
				1432	That said, it will likely be necessary to take further steps to reduce this
				1433	degradation, hopefully to roughly that of a scheduling-clock interrupt.
				1434
				1435	<p>
				1436	There are a number of situations where even
				1437	<tt>synchronize_rcu_expedited()</tt>'s reduced grace-period
				1438	latency is unacceptable.
				1439	In these situations, the asynchronous <tt>call_rcu()</tt> can be
				1440	used in place of <tt>synchronize_rcu()</tt> as follows:
				1441
				1442	<blockquote>
				1443	<pre>
				1444	1 struct foo {
				1445	2 int a;
				1446	3 int b;
				1447	4 struct rcu_head rh;
				1448	5 };
				1449	6
				1450	7 static void remove_gp_cb(struct rcu_head *rhp)
				1451	8 {
				1452	9 struct foo *p = container_of(rhp, struct foo, rh);
				1453	10
				1454	11 kfree(p);
				1455	12 }
				1456	13
				1457	14 bool remove_gp_asynchronous(void)
				1458	15 {
				1459	16 struct foo *p;
				1460	17
				1461	18 spin_lock(&gp_lock);
				1462	19 p = rcu_dereference(gp);
				1463	20 if (!p) {
				1464	21 spin_unlock(&gp_lock);
				1465	22 return false;
				1466	23 }
				1467	24 rcu_assign_pointer(gp, NULL);
				1468	25 call_rcu(&p->rh, remove_gp_cb);
				1469	26 spin_unlock(&gp_lock);
				1470	27 return true;
				1471	28 }
				1472	</pre>
				1473	</blockquote>
				1474
				1475	<p>
				1476	A definition of <tt>struct foo</tt> is finally needed, and appears
				1477	on lines 1-5.
				1478	The function <tt>remove_gp_cb()</tt> is passed to <tt>call_rcu()</tt>
				1479	on line 25, and will be invoked after the end of a subsequent
				1480	grace period.
				1481	This gets the same effect as <tt>remove_gp_synchronous()</tt>,
				1482	but without forcing the updater to wait for a grace period to elapse.
				1483	The <tt>call_rcu()</tt> function may be used in a number of
				1484	situations where neither <tt>synchronize_rcu()</tt> nor
				1485	<tt>synchronize_rcu_expedited()</tt> would be legal,
				1486	including within preempt-disable code, <tt>local_bh_disable()</tt> code,
				1487	interrupt-disable code, and interrupt handlers.
				1488	However, even <tt>call_rcu()</tt> is illegal within NMI handlers.
				1489	The callback function (<tt>remove_gp_cb()</tt> in this case) will be
				1490	executed within softirq (software interrupt) environment within the
				1491	Linux kernel,
				1492	either within a real softirq handler or under the protection
				1493	of <tt>local_bh_disable()</tt>.
				1494	In both the Linux kernel and in userspace, it is bad practice to
				1495	write an RCU callback function that takes too long.
				1496	Long-running operations should be relegated to separate threads or
				1497	(in the Linux kernel) workqueues.
				1498
				1499	<p>@@QQ@@
				1500	Why does line 19 use <tt>rcu_access_pointer()</tt>?
				1501	After all, <tt>call_rcu()</tt> on line 25 stores into the
				1502	structure, which would interact badly with concurrent insertions.
				1503	Doesn't this mean that <tt>rcu_dereference()</tt> is required?
				1504	<p>@@QQA@@
				1505	Presumably the <tt>->gp_lock</tt> acquired on line 18 excludes
				1506	any changes, including any insertions that <tt>rcu_dereference()</tt>
				1507	would protect against.
				1508	Therefore, any insertions will be delayed until after <tt>->gp_lock</tt>
				1509	is released on line 25, which in turn means that
				1510	<tt>rcu_access_pointer()</tt> suffices.
				1511	<p>@@QQE@@
				1512
				1513	<p>
				1514	However, all that <tt>remove_gp_cb()</tt> is doing is
				1515	invoking <tt>kfree()</tt> on the data element.
				1516	This is a common idiom, and is supported by <tt>kfree_rcu()</tt>,
				1517	which allows “fire and forget” operation as shown below:
				1518
				1519	<blockquote>
				1520	<pre>
				1521	1 struct foo {
				1522	2 int a;
				1523	3 int b;
				1524	4 struct rcu_head rh;
				1525	5 };
				1526	6
				1527	7 bool remove_gp_faf(void)
				1528	8 {
				1529	9 struct foo *p;
				1530	10
				1531	11 spin_lock(&gp_lock);
				1532	12 p = rcu_dereference(gp);
				1533	13 if (!p) {
				1534	14 spin_unlock(&gp_lock);
				1535	15 return false;
				1536	16 }
				1537	17 rcu_assign_pointer(gp, NULL);
				1538	18 kfree_rcu(p, rh);
				1539	19 spin_unlock(&gp_lock);
				1540	20 return true;
				1541	21 }
				1542	</pre>
				1543	</blockquote>
				1544
				1545	<p>
				1546	Note that <tt>remove_gp_faf()</tt> simply invokes
				1547	<tt>kfree_rcu()</tt> and proceeds, without any need to pay any
				1548	further attention to the subsequent grace period and <tt>kfree()</tt>.
				1549	It is permissible to invoke <tt>kfree_rcu()</tt> from the same
				1550	environments as for <tt>call_rcu()</tt>.
				1551	Interestingly enough, DYNIX/ptx had the equivalents of
				1552	<tt>call_rcu()</tt> and <tt>kfree_rcu()</tt>, but not
				1553	<tt>synchronize_rcu()</tt>.
				1554	This was due to the fact that RCU was not heavily used within DYNIX/ptx,
				1555	so the very few places that needed something like
				1556	<tt>synchronize_rcu()</tt> simply open-coded it.
				1557
				1558	<p>@@QQ@@
				1559	Earlier it was claimed that <tt>call_rcu()</tt> and
				1560	<tt>kfree_rcu()</tt> allowed updaters to avoid being blocked
				1561	by readers.
				1562	But how can that be correct, given that the invocation of the callback
				1563	and the freeing of the memory (respectively) must still wait for
				1564	a grace period to elapse?
				1565	<p>@@QQA@@
				1566	We could define things this way, but keep in mind that this sort of
				1567	definition would say that updates in garbage-collected languages
				1568	cannot complete until the next time the garbage collector runs,
				1569	which does not seem at all reasonable.
				1570	The key point is that in most cases, an updater using either
				1571	<tt>call_rcu()</tt> or <tt>kfree_rcu()</tt> can proceed to the
				1572	next update as soon as it has invoked <tt>call_rcu()</tt> or
				1573	<tt>kfree_rcu()</tt>, without having to wait for a subsequent
				1574	grace period.
				1575	<p>@@QQE@@
				1576
				1577	<p>
				1578	But what if the updater must wait for the completion of code to be
				1579	executed after the end of the grace period, but has other tasks
				1580	that can be carried out in the meantime?
				1581	The polling-style <tt>get_state_synchronize_rcu()</tt> and
				1582	<tt>cond_synchronize_rcu()</tt> functions may be used for this
				1583	purpose, as shown below:
				1584
				1585	<blockquote>
				1586	<pre>
				1587	1 bool remove_gp_poll(void)
				1588	2 {
				1589	3 struct foo *p;
				1590	4 unsigned long s;
				1591	5
				1592	6 spin_lock(&gp_lock);
				1593	7 p = rcu_access_pointer(gp);
				1594	8 if (!p) {
				1595	9 spin_unlock(&gp_lock);
				1596	10 return false;
				1597	11 }
				1598	12 rcu_assign_pointer(gp, NULL);
				1599	13 spin_unlock(&gp_lock);
				1600	14 s = get_state_synchronize_rcu();
				1601	15 do_something_while_waiting();
				1602	16 cond_synchronize_rcu(s);
				1603	17 kfree(p);
				1604	18 return true;
				1605	19 }
				1606	</pre>
				1607	</blockquote>
				1608
				1609	<p>
				1610	On line 14, <tt>get_state_synchronize_rcu()</tt> obtains a
				1611	“cookie” from RCU,
				1612	then line 15 carries out other tasks,
				1613	and finally, line 16 returns immediately if a grace period has
				1614	elapsed in the meantime, but otherwise waits as required.
				1615	The need for <tt>get_state_synchronize_rcu</tt> and
				1616	<tt>cond_synchronize_rcu()</tt> has appeared quite recently,
				1617	so it is too early to tell whether they will stand the test of time.
				1618
				1619	<p>
				1620	RCU thus provides a range of tools to allow updaters to strike the
				1621	required tradeoff between latency, flexibility and CPU overhead.
				1622
				1623	<h3><a name="Composability">Composability</a></h3>
				1624
				1625	<p>
				1626	Composability has received much attention in recent years, perhaps in part
				1627	due to the collision of multicore hardware with object-oriented techniques
				1628	designed in single-threaded environments for single-threaded use.
				1629	And in theory, RCU read-side critical sections may be composed, and in
				1630	fact may be nested arbitrarily deeply.
				1631	In practice, as with all real-world implementations of composable
				1632	constructs, there are limitations.
				1633
				1634	<p>
				1635	Implementations of RCU for which <tt>rcu_read_lock()</tt>
				1636	and <tt>rcu_read_unlock()</tt> generate no code, such as
				1637	Linux-kernel RCU when <tt>CONFIG_PREEMPT=n</tt>, can be
				1638	nested arbitrarily deeply.
				1639	After all, there is no overhead.
				1640	Except that if all these instances of <tt>rcu_read_lock()</tt>
				1641	and <tt>rcu_read_unlock()</tt> are visible to the compiler,
				1642	compilation will eventually fail due to exhausting memory,
				1643	mass storage, or user patience, whichever comes first.
				1644	If the nesting is not visible to the compiler, as is the case with
				1645	mutually recursive functions each in its own translation unit,
				1646	stack overflow will result.
				1647	If the nesting takes the form of loops, either the control variable
				1648	will overflow or (in the Linux kernel) you will get an RCU CPU stall warning.
				1649	Nevertheless, this class of RCU implementations is one
				1650	of the most composable constructs in existence.
				1651
				1652	<p>
				1653	RCU implementations that explicitly track nesting depth
				1654	are limited by the nesting-depth counter.
				1655	For example, the Linux kernel's preemptible RCU limits nesting to
				1656	<tt>INT_MAX</tt>.
				1657	This should suffice for almost all practical purposes.
				1658	That said, a consecutive pair of RCU read-side critical sections
				1659	between which there is an operation that waits for a grace period
				1660	cannot be enclosed in another RCU read-side critical section.
				1661	This is because it is not legal to wait for a grace period within
				1662	an RCU read-side critical section: To do so would result either
				1663	in deadlock or
				1664	in RCU implicitly splitting the enclosing RCU read-side critical
				1665	section, neither of which is conducive to a long-lived and prosperous
				1666	kernel.
				1667
				1668	<p>
Paul E. McKenney	0825458	2015-10-07 15:43:31 -0700	[diff] [blame]	1669	It is worth noting that RCU is not alone in limiting composability.
				1670	For example, many transactional-memory implementations prohibit
				1671	composing a pair of transactions separated by an irrevocable
				1672	operation (for example, a network receive operation).
				1673	For another example, lock-based critical sections can be composed
				1674	surprisingly freely, but only if deadlock is avoided.
				1675
				1676	<p>
Paul E. McKenney	649e436	2015-10-07 13:32:08 -0700	[diff] [blame]	1677	In short, although RCU read-side critical sections are highly composable,
				1678	care is required in some situations, just as is the case for any other
				1679	composable synchronization mechanism.
				1680
				1681	<h3><a name="Corner Cases">Corner Cases</a></h3>
				1682
				1683	<p>
				1684	A given RCU workload might have an endless and intense stream of
				1685	RCU read-side critical sections, perhaps even so intense that there
				1686	was never a point in time during which there was not at least one
				1687	RCU read-side critical section in flight.
				1688	RCU cannot allow this situation to block grace periods: As long as
				1689	all the RCU read-side critical sections are finite, grace periods
				1690	must also be finite.
				1691
				1692	<p>
				1693	That said, preemptible RCU implementations could potentially result
				1694	in RCU read-side critical sections being preempted for long durations,
				1695	which has the effect of creating a long-duration RCU read-side
				1696	critical section.
				1697	This situation can arise only in heavily loaded systems, but systems using
				1698	real-time priorities are of course more vulnerable.
				1699	Therefore, RCU priority boosting is provided to help deal with this
				1700	case.
				1701	That said, the exact requirements on RCU priority boosting will likely
				1702	evolve as more experience accumulates.
				1703
				1704	<p>
				1705	Other workloads might have very high update rates.
				1706	Although one can argue that such workloads should instead use
				1707	something other than RCU, the fact remains that RCU must
				1708	handle such workloads gracefully.
				1709	This requirement is another factor driving batching of grace periods,
				1710	but it is also the driving force behind the checks for large numbers
				1711	of queued RCU callbacks in the <tt>call_rcu()</tt> code path.
				1712	Finally, high update rates should not delay RCU read-side critical
				1713	sections, although some read-side delays can occur when using
				1714	<tt>synchronize_rcu_expedited()</tt>, courtesy of this function's use
				1715	of <tt>try_stop_cpus()</tt>.
				1716	(In the future, <tt>synchronize_rcu_expedited()</tt> will be
				1717	converted to use lighter-weight inter-processor interrupts (IPIs),
				1718	but this will still disturb readers, though to a much smaller degree.)
				1719
				1720	<p>
				1721	Although all three of these corner cases were understood in the early
				1722	1990s, a simple user-level test consisting of <tt>close(open(path))</tt>
				1723	in a tight loop
				1724	in the early 2000s suddenly provided a much deeper appreciation of the
				1725	high-update-rate corner case.
				1726	This test also motivated addition of some RCU code to react to high update
				1727	rates, for example, if a given CPU finds itself with more than 10,000
				1728	RCU callbacks queued, it will cause RCU to take evasive action by
				1729	more aggressively starting grace periods and more aggressively forcing
				1730	completion of grace-period processing.
				1731	This evasive action causes the grace period to complete more quickly,
				1732	but at the cost of restricting RCU's batching optimizations, thus
				1733	increasing the CPU overhead incurred by that grace period.
				1734
				1735	<h2><a name="Software-Engineering Requirements">
				1736	Software-Engineering Requirements</a></h2>
				1737
				1738	<p>
				1739	Between Murphy's Law and “To err is human”, it is necessary to
				1740	guard against mishaps and misuse:
				1741
				1742	<ol>
				1743	<li> It is all too easy to forget to use <tt>rcu_read_lock()</tt>
				1744	everywhere that it is needed, so kernels built with
				1745	<tt>CONFIG_PROVE_RCU=y</tt> will spat if
				1746	<tt>rcu_dereference()</tt> is used outside of an
				1747	RCU read-side critical section.
				1748	Update-side code can use <tt>rcu_dereference_protected()</tt>,
				1749	which takes a
				1750	<a href="https://lwn.net/Articles/371986/">lockdep expression</a>
				1751	to indicate what is providing the protection.
				1752	If the indicated protection is not provided, a lockdep splat
				1753	is emitted.
				1754
				1755	<p>
				1756	Code shared between readers and updaters can use
				1757	<tt>rcu_dereference_check()</tt>, which also takes a
				1758	lockdep expression, and emits a lockdep splat if neither
				1759	<tt>rcu_read_lock()</tt> nor the indicated protection
				1760	is in place.
				1761	In addition, <tt>rcu_dereference_raw()</tt> is used in those
				1762	(hopefully rare) cases where the required protection cannot
				1763	be easily described.
				1764	Finally, <tt>rcu_read_lock_held()</tt> is provided to
				1765	allow a function to verify that it has been invoked within
				1766	an RCU read-side critical section.
				1767	I was made aware of this set of requirements shortly after Thomas
				1768	Gleixner audited a number of RCU uses.
				1769	<li> A given function might wish to check for RCU-related preconditions
				1770	upon entry, before using any other RCU API.
				1771	The <tt>rcu_lockdep_assert()</tt> does this job,
				1772	asserting the expression in kernels having lockdep enabled
				1773	and doing nothing otherwise.
				1774	<li> It is also easy to forget to use <tt>rcu_assign_pointer()</tt>
				1775	and <tt>rcu_dereference()</tt>, perhaps (incorrectly)
				1776	substituting a simple assignment.
				1777	To catch this sort of error, a given RCU-protected pointer may be
				1778	tagged with <tt>__rcu</tt>, after which running sparse
				1779	with <tt>CONFIG_SPARSE_RCU_POINTER=y</tt> will complain
				1780	about simple-assignment accesses to that pointer.
				1781	Arnd Bergmann made me aware of this requirement, and also
				1782	supplied the needed
				1783	<a href="https://lwn.net/Articles/376011/">patch series</a>.
				1784	<li> Kernels built with <tt>CONFIG_DEBUG_OBJECTS_RCU_HEAD=y</tt>
				1785	will splat if a data element is passed to <tt>call_rcu()</tt>
				1786	twice in a row, without a grace period in between.
				1787	(This error is similar to a double free.)
				1788	The corresponding <tt>rcu_head</tt> structures that are
				1789	dynamically allocated are automatically tracked, but
				1790	<tt>rcu_head</tt> structures allocated on the stack
				1791	must be initialized with <tt>init_rcu_head_on_stack()</tt>
				1792	and cleaned up with <tt>destroy_rcu_head_on_stack()</tt>.
				1793	Similarly, statically allocated non-stack <tt>rcu_head</tt>
				1794	structures must be initialized with <tt>init_rcu_head()</tt>
				1795	and cleaned up with <tt>destroy_rcu_head()</tt>.
				1796	Mathieu Desnoyers made me aware of this requirement, and also
				1797	supplied the needed
				1798	<a href="https://lkml.kernel.org/g/20100319013024.GA28456@Krystal">patch</a>.
				1799	<li> An infinite loop in an RCU read-side critical section will
Paul E. McKenney	01d3ad3	2015-10-07 15:35:35 -0700	[diff] [blame]	1800	eventually trigger an RCU CPU stall warning splat, with
				1801	the duration of “eventually” being controlled by the
				1802	<tt>RCU_CPU_STALL_TIMEOUT</tt> <tt>Kconfig</tt> option, or,
				1803	alternatively, by the
				1804	<tt>rcupdate.rcu_cpu_stall_timeout</tt> boot/sysfs
				1805	parameter.
Paul E. McKenney	649e436	2015-10-07 13:32:08 -0700	[diff] [blame]	1806	However, RCU is not obligated to produce this splat
				1807	unless there is a grace period waiting on that particular
				1808	RCU read-side critical section.
Paul E. McKenney	01d3ad3	2015-10-07 15:35:35 -0700	[diff] [blame]	1809	<p>
				1810	Some extreme workloads might intentionally delay
				1811	RCU grace periods, and systems running those workloads can
				1812	be booted with <tt>rcupdate.rcu_cpu_stall_suppress</tt>
				1813	to suppress the splats.
				1814	This kernel parameter may also be set via <tt>sysfs</tt>.
				1815	Furthermore, RCU CPU stall warnings are counter-productive
				1816	during sysrq dumps and during panics.
				1817	RCU therefore supplies the <tt>rcu_sysrq_start()</tt> and
				1818	<tt>rcu_sysrq_end()</tt> API members to be called before
				1819	and after long sysrq dumps.
				1820	RCU also supplies the <tt>rcu_panic()</tt> notifier that is
				1821	automatically invoked at the beginning of a panic to suppress
				1822	further RCU CPU stall warnings.
				1823
				1824	<p>
Paul E. McKenney	649e436	2015-10-07 13:32:08 -0700	[diff] [blame]	1825	This requirement made itself known in the early 1990s, pretty
				1826	much the first time that it was necessary to debug a CPU stall.
Paul E. McKenney	01d3ad3	2015-10-07 15:35:35 -0700	[diff] [blame]	1827	That said, the initial implementation in DYNIX/ptx was quite
				1828	generic in comparison with that of Linux.
Paul E. McKenney	649e436	2015-10-07 13:32:08 -0700	[diff] [blame]	1829	<li> Although it would be very good to detect pointers leaking out
				1830	of RCU read-side critical sections, there is currently no
				1831	good way of doing this.
				1832	One complication is the need to distinguish between pointers
				1833	leaking and pointers that have been handed off from RCU to
				1834	some other synchronization mechanism, for example, reference
				1835	counting.
				1836	<li> In kernels built with <tt>CONFIG_RCU_TRACE=y</tt>, RCU-related
				1837	information is provided via both debugfs and event tracing.
				1838	<li> Open-coded use of <tt>rcu_assign_pointer()</tt> and
				1839	<tt>rcu_dereference()</tt> to create typical linked
				1840	data structures can be surprisingly error-prone.
				1841	Therefore, RCU-protected
				1842	<a href="https://lwn.net/Articles/609973/#RCU List APIs">linked lists</a>
				1843	and, more recently, RCU-protected
				1844	<a href="https://lwn.net/Articles/612100/">hash tables</a>
				1845	are available.
				1846	Many other special-purpose RCU-protected data structures are
				1847	available in the Linux kernel and the userspace RCU library.
				1848	<li> Some linked structures are created at compile time, but still
				1849	require <tt>__rcu</tt> checking.
				1850	The <tt>RCU_POINTER_INITIALIZER()</tt> macro serves this
				1851	purpose.
				1852	<li> It is not necessary to use <tt>rcu_assign_pointer()</tt>
				1853	when creating linked structures that are to be published via
				1854	a single external pointer.
				1855	The <tt>RCU_INIT_POINTER()</tt> macro is provided for
				1856	this task and also for assigning <tt>NULL</tt> pointers
				1857	at runtime.
				1858	</ol>
				1859
				1860	<p>
				1861	This not a hard-and-fast list: RCU's diagnostic capabilities will
				1862	continue to be guided by the number and type of usage bugs found
				1863	in real-world RCU usage.
				1864
				1865	<h2><a name="Linux Kernel Complications">Linux Kernel Complications</a></h2>
				1866
				1867	<p>
				1868	The Linux kernel provides an interesting environment for all kinds of
				1869	software, including RCU.
				1870	Some of the relevant points of interest are as follows:
				1871
				1872	<ol>
				1873	<li> <a href="#Configuration">Configuration</a>.
				1874	<li> <a href="#Firmware Interface">Firmware Interface</a>.
				1875	<li> <a href="#Early Boot">Early Boot</a>.
				1876	<li> <a href="#Interrupts and NMIs">
				1877	Interrupts and non-maskable interrupts (NMIs)</a>.
				1878	<li> <a href="#Loadable Modules">Loadable Modules</a>.
				1879	<li> <a href="#Hotplug CPU">Hotplug CPU</a>.
				1880	<li> <a href="#Scheduler and RCU">Scheduler and RCU</a>.
				1881	<li> <a href="#Tracing and RCU">Tracing and RCU</a>.
				1882	<li> <a href="#Energy Efficiency">Energy Efficiency</a>.
Paul E. McKenney	701e803	2015-10-07 15:06:44 -0700	[diff] [blame]	1883	<li> <a href="#Memory Efficiency">Memory Efficiency</a>.
Paul E. McKenney	649e436	2015-10-07 13:32:08 -0700	[diff] [blame]	1884	<li> <a href="#Performance, Scalability, Response Time, and Reliability">
				1885	Performance, Scalability, Response Time, and Reliability</a>.
				1886	</ol>
				1887
				1888	<p>
				1889	This list is probably incomplete, but it does give a feel for the
				1890	most notable Linux-kernel complications.
				1891	Each of the following sections covers one of the above topics.
				1892
				1893	<h3><a name="Configuration">Configuration</a></h3>
				1894
				1895	<p>
				1896	RCU's goal is automatic configuration, so that almost nobody
				1897	needs to worry about RCU's <tt>Kconfig</tt> options.
				1898	And for almost all users, RCU does in fact work well
				1899	“out of the box.”
				1900
				1901	<p>
				1902	However, there are specialized use cases that are handled by
				1903	kernel boot parameters and <tt>Kconfig</tt> options.
				1904	Unfortunately, the <tt>Kconfig</tt> system will explicitly ask users
				1905	about new <tt>Kconfig</tt> options, which requires almost all of them
				1906	be hidden behind a <tt>CONFIG_RCU_EXPERT</tt> <tt>Kconfig</tt> option.
				1907
				1908	<p>
				1909	This all should be quite obvious, but the fact remains that
				1910	Linus Torvalds recently had to
				1911	<a href="https://lkml.kernel.org/g/CA+55aFy4wcCwaL4okTs8wXhGZ5h-ibecy_Meg9C4MNQrUnwMcg@mail.gmail.com">remind</a>
				1912	me of this requirement.
				1913
				1914	<h3><a name="Firmware Interface">Firmware Interface</a></h3>
				1915
				1916	<p>
				1917	In many cases, kernel obtains information about the system from the
				1918	firmware, and sometimes things are lost in translation.
				1919	Or the translation is accurate, but the original message is bogus.
				1920
				1921	<p>
				1922	For example, some systems' firmware overreports the number of CPUs,
				1923	sometimes by a large factor.
				1924	If RCU naively believed the firmware, as it used to do,
				1925	it would create too many per-CPU kthreads.
				1926	Although the resulting system will still run correctly, the extra
				1927	kthreads needlessly consume memory and can cause confusion
				1928	when they show up in <tt>ps</tt> listings.
				1929
				1930	<p>
				1931	RCU must therefore wait for a given CPU to actually come online before
				1932	it can allow itself to believe that the CPU actually exists.
				1933	The resulting “ghost CPUs” (which are never going to
				1934	come online) cause a number of
				1935	<a href="https://paulmck.livejournal.com/37494.html">interesting complications</a>.
				1936
				1937	<h3><a name="Early Boot">Early Boot</a></h3>
				1938
				1939	<p>
				1940	The Linux kernel's boot sequence is an interesting process,
				1941	and RCU is used early, even before <tt>rcu_init()</tt>
				1942	is invoked.
				1943	In fact, a number of RCU's primitives can be used as soon as the
				1944	initial task's <tt>task_struct</tt> is available and the
				1945	boot CPU's per-CPU variables are set up.
				1946	The read-side primitives (<tt>rcu_read_lock()</tt>,
				1947	<tt>rcu_read_unlock()</tt>, <tt>rcu_dereference()</tt>,
				1948	and <tt>rcu_access_pointer()</tt>) will operate normally very early on,
				1949	as will <tt>rcu_assign_pointer()</tt>.
				1950
				1951	<p>
				1952	Although <tt>call_rcu()</tt> may be invoked at any
				1953	time during boot, callbacks are not guaranteed to be invoked until after
				1954	the scheduler is fully up and running.
				1955	This delay in callback invocation is due to the fact that RCU does not
				1956	invoke callbacks until it is fully initialized, and this full initialization
				1957	cannot occur until after the scheduler has initialized itself to the
				1958	point where RCU can spawn and run its kthreads.
				1959	In theory, it would be possible to invoke callbacks earlier,
				1960	however, this is not a panacea because there would be severe restrictions
				1961	on what operations those callbacks could invoke.
				1962
				1963	<p>
				1964	Perhaps surprisingly, <tt>synchronize_rcu()</tt>,
				1965	<a href="#Bottom-Half Flavor"><tt>synchronize_rcu_bh()</tt></a>
				1966	(<a href="#Bottom-Half Flavor">discussed below</a>),
				1967	and
				1968	<a href="#Sched Flavor"><tt>synchronize_sched()</tt></a>
				1969	will all operate normally
				1970	during very early boot, the reason being that there is only one CPU
				1971	and preemption is disabled.
				1972	This means that the call <tt>synchronize_rcu()</tt> (or friends)
				1973	itself is a quiescent
				1974	state and thus a grace period, so the early-boot implementation can
				1975	be a no-op.
				1976
				1977	<p>
				1978	Both <tt>synchronize_rcu_bh()</tt> and <tt>synchronize_sched()</tt>
				1979	continue to operate normally through the remainder of boot, courtesy
				1980	of the fact that preemption is disabled across their RCU read-side
				1981	critical sections and also courtesy of the fact that there is still
				1982	only one CPU.
				1983	However, once the scheduler starts initializing, preemption is enabled.
				1984	There is still only a single CPU, but the fact that preemption is enabled
				1985	means that the no-op implementation of <tt>synchronize_rcu()</tt> no
				1986	longer works in <tt>CONFIG_PREEMPT=y</tt> kernels.
				1987	Therefore, as soon as the scheduler starts initializing, the early-boot
				1988	fastpath is disabled.
				1989	This means that <tt>synchronize_rcu()</tt> switches to its runtime
				1990	mode of operation where it posts callbacks, which in turn means that
				1991	any call to <tt>synchronize_rcu()</tt> will block until the corresponding
				1992	callback is invoked.
				1993	Unfortunately, the callback cannot be invoked until RCU's runtime
				1994	grace-period machinery is up and running, which cannot happen until
				1995	the scheduler has initialized itself sufficiently to allow RCU's
				1996	kthreads to be spawned.
				1997	Therefore, invoking <tt>synchronize_rcu()</tt> during scheduler
				1998	initialization can result in deadlock.
				1999
				2000	<p>@@QQ@@
				2001	So what happens with <tt>synchronize_rcu()</tt> during
				2002	scheduler initialization for <tt>CONFIG_PREEMPT=n</tt>
				2003	kernels?
				2004	<p>@@QQA@@
				2005	In <tt>CONFIG_PREEMPT=n</tt> kernel, <tt>synchronize_rcu()</tt>
				2006	maps directly to <tt>synchronize_sched()</tt>.
				2007	Therefore, <tt>synchronize_rcu()</tt> works normally throughout
				2008	boot in <tt>CONFIG_PREEMPT=n</tt> kernels.
				2009	However, your code must also work in <tt>CONFIG_PREEMPT=y</tt> kernels,
				2010	so it is still necessary to avoid invoking <tt>synchronize_rcu()</tt>
				2011	during scheduler initialization.
				2012	<p>@@QQE@@
				2013
				2014	<p>
				2015	I learned of these boot-time requirements as a result of a series of
				2016	system hangs.
				2017
				2018	<h3><a name="Interrupts and NMIs">Interrupts and NMIs</a></h3>
				2019
				2020	<p>
				2021	The Linux kernel has interrupts, and RCU read-side critical sections are
				2022	legal within interrupt handlers and within interrupt-disabled regions
				2023	of code, as are invocations of <tt>call_rcu()</tt>.
				2024
				2025	<p>
				2026	Some Linux-kernel architectures can enter an interrupt handler from
				2027	non-idle process context, and then just never leave it, instead stealthily
				2028	transitioning back to process context.
				2029	This trick is sometimes used to invoke system calls from inside the kernel.
				2030	These “half-interrupts” mean that RCU has to be very careful
				2031	about how it counts interrupt nesting levels.
				2032	I learned of this requirement the hard way during a rewrite
				2033	of RCU's dyntick-idle code.
				2034
				2035	<p>
				2036	The Linux kernel has non-maskable interrupts (NMIs), and
				2037	RCU read-side critical sections are legal within NMI handlers.
				2038	Thankfully, RCU update-side primitives, including
				2039	<tt>call_rcu()</tt>, are prohibited within NMI handlers.
				2040
				2041	<p>
				2042	The name notwithstanding, some Linux-kernel architectures
				2043	can have nested NMIs, which RCU must handle correctly.
				2044	Andy Lutomirski
				2045	<a href="https://lkml.kernel.org/g/CALCETrXLq1y7e_dKFPgou-FKHB6Pu-r8+t-6Ds+8=va7anBWDA@mail.gmail.com">surprised me</a>
				2046	with this requirement;
				2047	he also kindly surprised me with
				2048	<a href="https://lkml.kernel.org/g/CALCETrXSY9JpW3uE6H8WYk81sg56qasA2aqmjMPsq5dOtzso=g@mail.gmail.com">an algorithm</a>
				2049	that meets this requirement.
				2050
				2051	<h3><a name="Loadable Modules">Loadable Modules</a></h3>
				2052
				2053	<p>
				2054	The Linux kernel has loadable modules, and these modules can
				2055	also be unloaded.
				2056	After a given module has been unloaded, any attempt to call
				2057	one of its functions results in a segmentation fault.
				2058	The module-unload functions must therefore cancel any
				2059	delayed calls to loadable-module functions, for example,
				2060	any outstanding <tt>mod_timer()</tt> must be dealt with
				2061	via <tt>del_timer_sync()</tt> or similar.
				2062
				2063	<p>
				2064	Unfortunately, there is no way to cancel an RCU callback;
				2065	once you invoke <tt>call_rcu()</tt>, the callback function is
				2066	going to eventually be invoked, unless the system goes down first.
				2067	Because it is normally considered socially irresponsible to crash the system
				2068	in response to a module unload request, we need some other way
				2069	to deal with in-flight RCU callbacks.
				2070
				2071	<p>
				2072	RCU therefore provides
				2073	<tt><a href="https://lwn.net/Articles/217484/">rcu_barrier()</a></tt>,
				2074	which waits until all in-flight RCU callbacks have been invoked.
				2075	If a module uses <tt>call_rcu()</tt>, its exit function should therefore
				2076	prevent any future invocation of <tt>call_rcu()</tt>, then invoke
				2077	<tt>rcu_barrier()</tt>.
				2078	In theory, the underlying module-unload code could invoke
				2079	<tt>rcu_barrier()</tt> unconditionally, but in practice this would
				2080	incur unacceptable latencies.
				2081
				2082	<p>
				2083	Nikita Danilov noted this requirement for an analogous filesystem-unmount
				2084	situation, and Dipankar Sarma incorporated <tt>rcu_barrier()</tt> into RCU.
				2085	The need for <tt>rcu_barrier()</tt> for module unloading became
				2086	apparent later.
				2087
				2088	<h3><a name="Hotplug CPU">Hotplug CPU</a></h3>
				2089
				2090	<p>
				2091	The Linux kernel supports CPU hotplug, which means that CPUs
				2092	can come and go.
				2093	It is of course illegal to use any RCU API member from an offline CPU.
				2094	This requirement was present from day one in DYNIX/ptx, but
				2095	on the other hand, the Linux kernel's CPU-hotplug implementation
				2096	is “interesting.”
				2097
				2098	<p>
				2099	The Linux-kernel CPU-hotplug implementation has notifiers that
				2100	are used to allow the various kernel subsystems (including RCU)
				2101	to respond appropriately to a given CPU-hotplug operation.
				2102	Most RCU operations may be invoked from CPU-hotplug notifiers,
				2103	including even normal synchronous grace-period operations
				2104	such as <tt>synchronize_rcu()</tt>.
				2105	However, expedited grace-period operations such as
				2106	<tt>synchronize_rcu_expedited()</tt> are not supported,
				2107	due to the fact that current implementations block CPU-hotplug
				2108	operations, which could result in deadlock.
				2109
				2110	<p>
				2111	In addition, all-callback-wait operations such as
				2112	<tt>rcu_barrier()</tt> are also not supported, due to the
				2113	fact that there are phases of CPU-hotplug operations where
				2114	the outgoing CPU's callbacks will not be invoked until after
				2115	the CPU-hotplug operation ends, which could also result in deadlock.
				2116
				2117	<h3><a name="Scheduler and RCU">Scheduler and RCU</a></h3>
				2118
				2119	<p>
				2120	RCU depends on the scheduler, and the scheduler uses RCU to
				2121	protect some of its data structures.
				2122	This means the scheduler is forbidden from acquiring
				2123	the runqueue locks and the priority-inheritance locks
Paul E. McKenney	a4b5756	2015-10-07 15:52:25 -0700	[diff] [blame]	2124	in the middle of an outermost RCU read-side critical section unless either
				2125	(1) it releases them before exiting that same
				2126	RCU read-side critical section, or
Paul E. McKenney	c64c4b0	2015-11-06 23:05:32 -0800	[diff] [blame]	2127	(2) interrupts are disabled across
Paul E. McKenney	a4b5756	2015-10-07 15:52:25 -0700	[diff] [blame]	2128	that entire RCU read-side critical section.
				2129	This same prohibition also applies (recursively!) to any lock that is acquired
Paul E. McKenney	649e436	2015-10-07 13:32:08 -0700	[diff] [blame]	2130	while holding any lock to which this prohibition applies.
Paul E. McKenney	a4b5756	2015-10-07 15:52:25 -0700	[diff] [blame]	2131	Adhering to this rule prevents preemptible RCU from invoking
				2132	<tt>rcu_read_unlock_special()</tt> while either runqueue or
				2133	priority-inheritance locks are held, thus avoiding deadlock.
Paul E. McKenney	649e436	2015-10-07 13:32:08 -0700	[diff] [blame]	2134
				2135	<p>
Paul E. McKenney	c64c4b0	2015-11-06 23:05:32 -0800	[diff] [blame]	2136	Prior to v4.4, it was only necessary to disable preemption across
				2137	RCU read-side critical sections that acquired scheduler locks.
				2138	In v4.4, expedited grace periods started using IPIs, and these
				2139	IPIs could force a <tt>rcu_read_unlock()</tt> to take the slowpath.
				2140	Therefore, this expedited-grace-period change required disabling of
				2141	interrupts, not just preemption.
				2142
				2143	<p>
Paul E. McKenney	649e436	2015-10-07 13:32:08 -0700	[diff] [blame]	2144	For RCU's part, the preemptible-RCU <tt>rcu_read_unlock()</tt>
				2145	implementation must be written carefully to avoid similar deadlocks.
				2146	In particular, <tt>rcu_read_unlock()</tt> must tolerate an
				2147	interrupt where the interrupt handler invokes both
				2148	<tt>rcu_read_lock()</tt> and <tt>rcu_read_unlock()</tt>.
				2149	This possibility requires <tt>rcu_read_unlock()</tt> to use
				2150	negative nesting levels to avoid destructive recursion via
				2151	interrupt handler's use of RCU.
				2152
				2153	<p>
				2154	This pair of mutual scheduler-RCU requirements came as a
				2155	<a href="https://lwn.net/Articles/453002/">complete surprise</a>.
				2156
				2157	<p>
				2158	As noted above, RCU makes use of kthreads, and it is necessary to
				2159	avoid excessive CPU-time accumulation by these kthreads.
				2160	This requirement was no surprise, but RCU's violation of it
				2161	when running context-switch-heavy workloads when built with
				2162	<tt>CONFIG_NO_HZ_FULL=y</tt>
				2163	<a href="http://www.rdrop.com/users/paulmck/scalability/paper/BareMetal.2015.01.15b.pdf">did come as a surprise [PDF]</a>.
				2164	RCU has made good progress towards meeting this requirement, even
				2165	for context-switch-have <tt>CONFIG_NO_HZ_FULL=y</tt> workloads,
				2166	but there is room for further improvement.
				2167
				2168	<h3><a name="Tracing and RCU">Tracing and RCU</a></h3>
				2169
				2170	<p>
				2171	It is possible to use tracing on RCU code, but tracing itself
				2172	uses RCU.
				2173	For this reason, <tt>rcu_dereference_raw_notrace()</tt>
				2174	is provided for use by tracing, which avoids the destructive
				2175	recursion that could otherwise ensue.
				2176	This API is also used by virtualization in some architectures,
				2177	where RCU readers execute in environments in which tracing
				2178	cannot be used.
				2179	The tracing folks both located the requirement and provided the
				2180	needed fix, so this surprise requirement was relatively painless.
				2181
				2182	<h3><a name="Energy Efficiency">Energy Efficiency</a></h3>
				2183
				2184	<p>
				2185	Interrupting idle CPUs is considered socially unacceptable,
				2186	especially by people with battery-powered embedded systems.
				2187	RCU therefore conserves energy by detecting which CPUs are
				2188	idle, including tracking CPUs that have been interrupted from idle.
				2189	This is a large part of the energy-efficiency requirement,
				2190	so I learned of this via an irate phone call.
				2191
				2192	<p>
				2193	Because RCU avoids interrupting idle CPUs, it is illegal to
				2194	execute an RCU read-side critical section on an idle CPU.
				2195	(Kernels built with <tt>CONFIG_PROVE_RCU=y</tt> will splat
				2196	if you try it.)
				2197	The <tt>RCU_NONIDLE()</tt> macro and <tt>_rcuidle</tt>
				2198	event tracing is provided to work around this restriction.
				2199	In addition, <tt>rcu_is_watching()</tt> may be used to
				2200	test whether or not it is currently legal to run RCU read-side
				2201	critical sections on this CPU.
				2202	I learned of the need for diagnostics on the one hand
				2203	and <tt>RCU_NONIDLE()</tt> on the other while inspecting
				2204	idle-loop code.
				2205	Steven Rostedt supplied <tt>_rcuidle</tt> event tracing,
				2206	which is used quite heavily in the idle loop.
				2207
				2208	<p>
				2209	It is similarly socially unacceptable to interrupt an
				2210	<tt>nohz_full</tt> CPU running in userspace.
				2211	RCU must therefore track <tt>nohz_full</tt> userspace
				2212	execution.
				2213	And in
				2214	<a href="https://lwn.net/Articles/558284/"><tt>CONFIG_NO_HZ_FULL_SYSIDLE=y</tt></a>
				2215	kernels, RCU must separately track idle CPUs on the one hand and
				2216	CPUs that are either idle or executing in userspace on the other.
				2217	In both cases, RCU must be able to sample state at two points in
				2218	time, and be able to determine whether or not some other CPU spent
				2219	any time idle and/or executing in userspace.
				2220
				2221	<p>
				2222	These energy-efficiency requirements have proven quite difficult to
				2223	understand and to meet, for example, there have been more than five
				2224	clean-sheet rewrites of RCU's energy-efficiency code, the last of
				2225	which was finally able to demonstrate
				2226	<a href="http://www.rdrop.com/users/paulmck/realtime/paper/AMPenergy.2013.04.19a.pdf">real energy savings running on real hardware [PDF]</a>.
				2227	As noted earlier,
				2228	I learned of many of these requirements via angry phone calls:
				2229	Flaming me on the Linux-kernel mailing list was apparently not
				2230	sufficient to fully vent their ire at RCU's energy-efficiency bugs!
				2231
Paul E. McKenney	701e803	2015-10-07 15:06:44 -0700	[diff] [blame]	2232	<h3><a name="Memory Efficiency">Memory Efficiency</a></h3>
				2233
				2234	<p>
				2235	Although small-memory non-realtime systems can simply use Tiny RCU,
				2236	code size is only one aspect of memory efficiency.
				2237	Another aspect is the size of the <tt>rcu_head</tt> structure
				2238	used by <tt>call_rcu()</tt> and <tt>kfree_rcu()</tt>.
				2239	Although this structure contains nothing more than a pair of pointers,
				2240	it does appear in many RCU-protected data structures, including
				2241	some that are size critical.
				2242	The <tt>page</tt> structure is a case in point, as evidenced by
				2243	the many occurrences of the <tt>union</tt> keyword within that structure.
				2244
				2245	<p>
				2246	This need for memory efficiency is one reason that RCU uses hand-crafted
				2247	singly linked lists to track the <tt>rcu_head</tt> structures that
				2248	are waiting for a grace period to elapse.
				2249	It is also the reason why <tt>rcu_head</tt> structures do not contain
				2250	debug information, such as fields tracking the file and line of the
				2251	<tt>call_rcu()</tt> or <tt>kfree_rcu()</tt> that posted them.
				2252	Although this information might appear in debug-only kernel builds at some
				2253	point, in the meantime, the <tt>->func</tt> field will often provide
				2254	the needed debug information.
				2255
				2256	<p>
				2257	However, in some cases, the need for memory efficiency leads to even
				2258	more extreme measures.
				2259	Returning to the <tt>page</tt> structure, the <tt>rcu_head</tt> field
				2260	shares storage with a great many other structures that are used at
				2261	various points in the corresponding page's lifetime.
				2262	In order to correctly resolve certain
				2263	<a href="https://lkml.kernel.org/g/1439976106-137226-1-git-send-email-kirill.shutemov@linux.intel.com">race conditions</a>,
				2264	the Linux kernel's memory-management subsystem needs a particular bit
				2265	to remain zero during all phases of grace-period processing,
				2266	and that bit happens to map to the bottom bit of the
				2267	<tt>rcu_head</tt> structure's <tt>->next</tt> field.
				2268	RCU makes this guarantee as long as <tt>call_rcu()</tt>
				2269	is used to post the callback, as opposed to <tt>kfree_rcu()</tt>
				2270	or some future “lazy”
				2271	variant of <tt>call_rcu()</tt> that might one day be created for
				2272	energy-efficiency purposes.
				2273
Paul E. McKenney	649e436	2015-10-07 13:32:08 -0700	[diff] [blame]	2274	<h3><a name="Performance, Scalability, Response Time, and Reliability">
				2275	Performance, Scalability, Response Time, and Reliability</a></h3>
				2276
				2277	<p>
				2278	Expanding on the
				2279	<a href="#Performance and Scalability">earlier discussion</a>,
				2280	RCU is used heavily by hot code paths in performance-critical
				2281	portions of the Linux kernel's networking, security, virtualization,
				2282	and scheduling code paths.
				2283	RCU must therefore use efficient implementations, especially in its
				2284	read-side primitives.
				2285	To that end, it would be good if preemptible RCU's implementation
				2286	of <tt>rcu_read_lock()</tt> could be inlined, however, doing
				2287	this requires resolving <tt>#include</tt> issues with the
				2288	<tt>task_struct</tt> structure.
				2289
				2290	<p>
				2291	The Linux kernel supports hardware configurations with up to
				2292	4096 CPUs, which means that RCU must be extremely scalable.
				2293	Algorithms that involve frequent acquisitions of global locks or
				2294	frequent atomic operations on global variables simply cannot be
				2295	tolerated within the RCU implementation.
				2296	RCU therefore makes heavy use of a combining tree based on the
				2297	<tt>rcu_node</tt> structure.
				2298	RCU is required to tolerate all CPUs continuously invoking any
				2299	combination of RCU's runtime primitives with minimal per-operation
				2300	overhead.
				2301	In fact, in many cases, increasing load must <i>decrease</i> the
				2302	per-operation overhead, witness the batching optimizations for
				2303	<tt>synchronize_rcu()</tt>, <tt>call_rcu()</tt>,
				2304	<tt>synchronize_rcu_expedited()</tt>, and <tt>rcu_barrier()</tt>.
				2305	As a general rule, RCU must cheerfully accept whatever the
				2306	rest of the Linux kernel decides to throw at it.
				2307
				2308	<p>
				2309	The Linux kernel is used for real-time workloads, especially
				2310	in conjunction with the
				2311	<a href="https://rt.wiki.kernel.org/index.php/Main_Page">-rt patchset</a>.
				2312	The real-time-latency response requirements are such that the
				2313	traditional approach of disabling preemption across RCU
				2314	read-side critical sections is inappropriate.
				2315	Kernels built with <tt>CONFIG_PREEMPT=y</tt> therefore
				2316	use an RCU implementation that allows RCU read-side critical
				2317	sections to be preempted.
				2318	This requirement made its presence known after users made it
				2319	clear that an earlier
				2320	<a href="https://lwn.net/Articles/107930/">real-time patch</a>
				2321	did not meet their needs, in conjunction with some
				2322	<a href="https://lkml.kernel.org/g/20050318002026.GA2693@us.ibm.com">RCU issues</a>
				2323	encountered by a very early version of the -rt patchset.
				2324
				2325	<p>
				2326	In addition, RCU must make do with a sub-100-microsecond real-time latency
				2327	budget.
				2328	In fact, on smaller systems with the -rt patchset, the Linux kernel
				2329	provides sub-20-microsecond real-time latencies for the whole kernel,
				2330	including RCU.
				2331	RCU's scalability and latency must therefore be sufficient for
				2332	these sorts of configurations.
				2333	To my surprise, the sub-100-microsecond real-time latency budget
				2334	<a href="http://www.rdrop.com/users/paulmck/realtime/paper/bigrt.2013.01.31a.LCA.pdf">
				2335	applies to even the largest systems [PDF]</a>,
				2336	up to and including systems with 4096 CPUs.
				2337	This real-time requirement motivated the grace-period kthread, which
				2338	also simplified handling of a number of race conditions.
				2339
				2340	<p>
				2341	Finally, RCU's status as a synchronization primitive means that
				2342	any RCU failure can result in arbitrary memory corruption that can be
				2343	extremely difficult to debug.
				2344	This means that RCU must be extremely reliable, which in
				2345	practice also means that RCU must have an aggressive stress-test
				2346	suite.
				2347	This stress-test suite is called <tt>rcutorture</tt>.
				2348
				2349	<p>
				2350	Although the need for <tt>rcutorture</tt> was no surprise,
				2351	the current immense popularity of the Linux kernel is posing
				2352	interesting—and perhaps unprecedented—validation
				2353	challenges.
				2354	To see this, keep in mind that there are well over one billion
				2355	instances of the Linux kernel running today, given Android
				2356	smartphones, Linux-powered televisions, and servers.
				2357	This number can be expected to increase sharply with the advent of
				2358	the celebrated Internet of Things.
				2359
				2360	<p>
				2361	Suppose that RCU contains a race condition that manifests on average
				2362	once per million years of runtime.
				2363	This bug will be occurring about three times per <i>day</i> across
				2364	the installed base.
				2365	RCU could simply hide behind hardware error rates, given that no one
				2366	should really expect their smartphone to last for a million years.
				2367	However, anyone taking too much comfort from this thought should
				2368	consider the fact that in most jurisdictions, a successful multi-year
				2369	test of a given mechanism, which might include a Linux kernel,
				2370	suffices for a number of types of safety-critical certifications.
				2371	In fact, rumor has it that the Linux kernel is already being used
				2372	in production for safety-critical applications.
				2373	I don't know about you, but I would feel quite bad if a bug in RCU
				2374	killed someone.
				2375	Which might explain my recent focus on validation and verification.
				2376
				2377	<h2><a name="Other RCU Flavors">Other RCU Flavors</a></h2>
				2378
				2379	<p>
				2380	One of the more surprising things about RCU is that there are now
				2381	no fewer than five <i>flavors</i>, or API families.
				2382	In addition, the primary flavor that has been the sole focus up to
				2383	this point has two different implementations, non-preemptible and
				2384	preemptible.
				2385	The other four flavors are listed below, with requirements for each
				2386	described in a separate section.
				2387
				2388	<ol>
				2389	<li> <a href="#Bottom-Half Flavor">Bottom-Half Flavor</a>
				2390	<li> <a href="#Sched Flavor">Sched Flavor</a>
				2391	<li> <a href="#Sleepable RCU">Sleepable RCU</a>
				2392	<li> <a href="#Tasks RCU">Tasks RCU</a>
				2393	</ol>
				2394
				2395	<h3><a name="Bottom-Half Flavor">Bottom-Half Flavor</a></h3>
				2396
				2397	<p>
				2398	The softirq-disable (AKA “bottom-half”,
				2399	hence the “_bh” abbreviations)
				2400	flavor of RCU, or <i>RCU-bh</i>, was developed by
				2401	Dipankar Sarma to provide a flavor of RCU that could withstand the
				2402	network-based denial-of-service attacks researched by Robert
				2403	Olsson.
				2404	These attacks placed so much networking load on the system
				2405	that some of the CPUs never exited softirq execution,
				2406	which in turn prevented those CPUs from ever executing a context switch,
				2407	which, in the RCU implementation of that time, prevented grace periods
				2408	from ever ending.
				2409	The result was an out-of-memory condition and a system hang.
				2410
				2411	<p>
				2412	The solution was the creation of RCU-bh, which does
				2413	<tt>local_bh_disable()</tt>
				2414	across its read-side critical sections, and which uses the transition
				2415	from one type of softirq processing to another as a quiescent state
				2416	in addition to context switch, idle, user mode, and offline.
				2417	This means that RCU-bh grace periods can complete even when some of
				2418	the CPUs execute in softirq indefinitely, thus allowing algorithms
				2419	based on RCU-bh to withstand network-based denial-of-service attacks.
				2420
				2421	<p>
				2422	Because
				2423	<tt>rcu_read_lock_bh()</tt> and <tt>rcu_read_unlock_bh()</tt>
				2424	disable and re-enable softirq handlers, any attempt to start a softirq
				2425	handlers during the
				2426	RCU-bh read-side critical section will be deferred.
				2427	In this case, <tt>rcu_read_unlock_bh()</tt>
				2428	will invoke softirq processing, which can take considerable time.
				2429	One can of course argue that this softirq overhead should be associated
				2430	with the code following the RCU-bh read-side critical section rather
				2431	than <tt>rcu_read_unlock_bh()</tt>, but the fact
				2432	is that most profiling tools cannot be expected to make this sort
				2433	of fine distinction.
				2434	For example, suppose that a three-millisecond-long RCU-bh read-side
				2435	critical section executes during a time of heavy networking load.
				2436	There will very likely be an attempt to invoke at least one softirq
				2437	handler during that three milliseconds, but any such invocation will
				2438	be delayed until the time of the <tt>rcu_read_unlock_bh()</tt>.
				2439	This can of course make it appear at first glance as if
				2440	<tt>rcu_read_unlock_bh()</tt> was executing very slowly.
				2441
				2442	<p>
				2443	The
				2444	<a href="https://lwn.net/Articles/609973/#RCU Per-Flavor API Table">RCU-bh API</a>
				2445	includes
				2446	<tt>rcu_read_lock_bh()</tt>,
				2447	<tt>rcu_read_unlock_bh()</tt>,
				2448	<tt>rcu_dereference_bh()</tt>,
				2449	<tt>rcu_dereference_bh_check()</tt>,
				2450	<tt>synchronize_rcu_bh()</tt>,
				2451	<tt>synchronize_rcu_bh_expedited()</tt>,
				2452	<tt>call_rcu_bh()</tt>,
				2453	<tt>rcu_barrier_bh()</tt>, and
				2454	<tt>rcu_read_lock_bh_held()</tt>.
				2455
				2456	<h3><a name="Sched Flavor">Sched Flavor</a></h3>
				2457
				2458	<p>
				2459	Before preemptible RCU, waiting for an RCU grace period had the
				2460	side effect of also waiting for all pre-existing interrupt
				2461	and NMI handlers.
				2462	However, there are legitimate preemptible-RCU implementations that
				2463	do not have this property, given that any point in the code outside
				2464	of an RCU read-side critical section can be a quiescent state.
				2465	Therefore, <i>RCU-sched</i> was created, which follows “classic”
				2466	RCU in that an RCU-sched grace period waits for for pre-existing
				2467	interrupt and NMI handlers.
				2468	In kernels built with <tt>CONFIG_PREEMPT=n</tt>, the RCU and RCU-sched
				2469	APIs have identical implementations, while kernels built with
				2470	<tt>CONFIG_PREEMPT=y</tt> provide a separate implementation for each.
				2471
				2472	<p>
				2473	Note well that in <tt>CONFIG_PREEMPT=y</tt> kernels,
				2474	<tt>rcu_read_lock_sched()</tt> and <tt>rcu_read_unlock_sched()</tt>
				2475	disable and re-enable preemption, respectively.
				2476	This means that if there was a preemption attempt during the
				2477	RCU-sched read-side critical section, <tt>rcu_read_unlock_sched()</tt>
				2478	will enter the scheduler, with all the latency and overhead entailed.
				2479	Just as with <tt>rcu_read_unlock_bh()</tt>, this can make it look
				2480	as if <tt>rcu_read_unlock_sched()</tt> was executing very slowly.
				2481	However, the highest-priority task won't be preempted, so that task
				2482	will enjoy low-overhead <tt>rcu_read_unlock_sched()</tt> invocations.
				2483
				2484	<p>
				2485	The
				2486	<a href="https://lwn.net/Articles/609973/#RCU Per-Flavor API Table">RCU-sched API</a>
				2487	includes
				2488	<tt>rcu_read_lock_sched()</tt>,
				2489	<tt>rcu_read_unlock_sched()</tt>,
				2490	<tt>rcu_read_lock_sched_notrace()</tt>,
				2491	<tt>rcu_read_unlock_sched_notrace()</tt>,
				2492	<tt>rcu_dereference_sched()</tt>,
				2493	<tt>rcu_dereference_sched_check()</tt>,
				2494	<tt>synchronize_sched()</tt>,
				2495	<tt>synchronize_rcu_sched_expedited()</tt>,
				2496	<tt>call_rcu_sched()</tt>,
				2497	<tt>rcu_barrier_sched()</tt>, and
				2498	<tt>rcu_read_lock_sched_held()</tt>.
				2499	However, anything that disables preemption also marks an RCU-sched
				2500	read-side critical section, including
				2501	<tt>preempt_disable()</tt> and <tt>preempt_enable()</tt>,
				2502	<tt>local_irq_save()</tt> and <tt>local_irq_restore()</tt>,
				2503	and so on.
				2504
				2505	<h3><a name="Sleepable RCU">Sleepable RCU</a></h3>
				2506
				2507	<p>
				2508	For well over a decade, someone saying “I need to block within
				2509	an RCU read-side critical section” was a reliable indication
				2510	that this someone did not understand RCU.
				2511	After all, if you are always blocking in an RCU read-side critical
				2512	section, you can probably afford to use a higher-overhead synchronization
				2513	mechanism.
				2514	However, that changed with the advent of the Linux kernel's notifiers,
				2515	whose RCU read-side critical
				2516	sections almost never sleep, but sometimes need to.
				2517	This resulted in the introduction of
				2518	<a href="https://lwn.net/Articles/202847/">sleepable RCU</a>,
				2519	or <i>SRCU</i>.
				2520
				2521	<p>
				2522	SRCU allows different domains to be defined, with each such domain
				2523	defined by an instance of an <tt>srcu_struct</tt> structure.
				2524	A pointer to this structure must be passed in to each SRCU function,
				2525	for example, <tt>synchronize_srcu(&ss)</tt>, where
				2526	<tt>ss</tt> is the <tt>srcu_struct</tt> structure.
				2527	The key benefit of these domains is that a slow SRCU reader in one
				2528	domain does not delay an SRCU grace period in some other domain.
				2529	That said, one consequence of these domains is that read-side code
				2530	must pass a “cookie” from <tt>srcu_read_lock()</tt>
				2531	to <tt>srcu_read_unlock()</tt>, for example, as follows:
				2532
				2533	<blockquote>
				2534	<pre>
				2535	1 int idx;
				2536	2
				2537	3 idx = srcu_read_lock(&ss);
				2538	4 do_something();
				2539	5 srcu_read_unlock(&ss, idx);
				2540	</pre>
				2541	</blockquote>
				2542
				2543	<p>
				2544	As noted above, it is legal to block within SRCU read-side critical sections,
				2545	however, with great power comes great responsibility.
				2546	If you block forever in one of a given domain's SRCU read-side critical
				2547	sections, then that domain's grace periods will also be blocked forever.
				2548	Of course, one good way to block forever is to deadlock, which can
				2549	happen if any operation in a given domain's SRCU read-side critical
				2550	section can block waiting, either directly or indirectly, for that domain's
				2551	grace period to elapse.
				2552	For example, this results in a self-deadlock:
				2553
				2554	<blockquote>
				2555	<pre>
				2556	1 int idx;
				2557	2
				2558	3 idx = srcu_read_lock(&ss);
				2559	4 do_something();
				2560	5 synchronize_srcu(&ss);
				2561	6 srcu_read_unlock(&ss, idx);
				2562	</pre>
				2563	</blockquote>
				2564
				2565	<p>
				2566	However, if line 5 acquired a mutex that was held across
				2567	a <tt>synchronize_srcu()</tt> for domain <tt>ss</tt>,
				2568	deadlock would still be possible.
				2569	Furthermore, if line 5 acquired a mutex that was held across
				2570	a <tt>synchronize_srcu()</tt> for some other domain <tt>ss1</tt>,
				2571	and if an <tt>ss1</tt>-domain SRCU read-side critical section
				2572	acquired another mutex that was held across as <tt>ss</tt>-domain
				2573	<tt>synchronize_srcu()</tt>,
				2574	deadlock would again be possible.
				2575	Such a deadlock cycle could extend across an arbitrarily large number
				2576	of different SRCU domains.
				2577	Again, with great power comes great responsibility.
				2578
				2579	<p>
				2580	Unlike the other RCU flavors, SRCU read-side critical sections can
				2581	run on idle and even offline CPUs.
				2582	This ability requires that <tt>srcu_read_lock()</tt> and
				2583	<tt>srcu_read_unlock()</tt> contain memory barriers, which means
				2584	that SRCU readers will run a bit slower than would RCU readers.
				2585	It also motivates the <tt>smp_mb__after_srcu_read_unlock()</tt>
				2586	API, which, in combination with <tt>srcu_read_unlock()</tt>,
				2587	guarantees a full memory barrier.
				2588
				2589	<p>
				2590	The
				2591	<a href="https://lwn.net/Articles/609973/#RCU Per-Flavor API Table">SRCU API</a>
				2592	includes
				2593	<tt>srcu_read_lock()</tt>,
				2594	<tt>srcu_read_unlock()</tt>,
				2595	<tt>srcu_dereference()</tt>,
				2596	<tt>srcu_dereference_check()</tt>,
				2597	<tt>synchronize_srcu()</tt>,
				2598	<tt>synchronize_srcu_expedited()</tt>,
				2599	<tt>call_srcu()</tt>,
				2600	<tt>srcu_barrier()</tt>, and
				2601	<tt>srcu_read_lock_held()</tt>.
				2602	It also includes
				2603	<tt>DEFINE_SRCU()</tt>,
				2604	<tt>DEFINE_STATIC_SRCU()</tt>, and
				2605	<tt>init_srcu_struct()</tt>
				2606	APIs for defining and initializing <tt>srcu_struct</tt> structures.
				2607
				2608	<h3><a name="Tasks RCU">Tasks RCU</a></h3>
				2609
				2610	<p>
				2611	Some forms of tracing use “tramopolines” to handle the
				2612	binary rewriting required to install different types of probes.
				2613	It would be good to be able to free old trampolines, which sounds
				2614	like a job for some form of RCU.
				2615	However, because it is necessary to be able to install a trace
				2616	anywhere in the code, it is not possible to use read-side markers
				2617	such as <tt>rcu_read_lock()</tt> and <tt>rcu_read_unlock()</tt>.
				2618	In addition, it does not work to have these markers in the trampoline
				2619	itself, because there would need to be instructions following
				2620	<tt>rcu_read_unlock()</tt>.
				2621	Although <tt>synchronize_rcu()</tt> would guarantee that execution
				2622	reached the <tt>rcu_read_unlock()</tt>, it would not be able to
				2623	guarantee that execution had completely left the trampoline.
				2624
				2625	<p>
				2626	The solution, in the form of
				2627	<a href="https://lwn.net/Articles/607117/"><i>Tasks RCU</i></a>,
				2628	is to have implicit
				2629	read-side critical sections that are delimited by voluntary context
				2630	switches, that is, calls to <tt>schedule()</tt>,
				2631	<tt>cond_resched_rcu_qs()</tt>, and
				2632	<tt>synchronize_rcu_tasks()</tt>.
				2633	In addition, transitions to and from userspace execution also delimit
				2634	tasks-RCU read-side critical sections.
				2635
				2636	<p>
				2637	The tasks-RCU API is quite compact, consisting only of
				2638	<tt>call_rcu_tasks()</tt>,
				2639	<tt>synchronize_rcu_tasks()</tt>, and
				2640	<tt>rcu_barrier_tasks()</tt>.
				2641
				2642	<h2><a name="Possible Future Changes">Possible Future Changes</a></h2>
				2643
				2644	<p>
				2645	One of the tricks that RCU uses to attain update-side scalability is
				2646	to increase grace-period latency with increasing numbers of CPUs.
				2647	If this becomes a serious problem, it will be necessary to rework the
				2648	grace-period state machine so as to avoid the need for the additional
				2649	latency.
				2650
				2651	<p>
				2652	Expedited grace periods scan the CPUs, so their latency and overhead
				2653	increases with increasing numbers of CPUs.
				2654	If this becomes a serious problem on large systems, it will be necessary
				2655	to do some redesign to avoid this scalability problem.
				2656
				2657	<p>
				2658	RCU disables CPU hotplug in a few places, perhaps most notably in the
				2659	expedited grace-period and <tt>rcu_barrier()</tt> operations.
				2660	If there is a strong reason to use expedited grace periods in CPU-hotplug
				2661	notifiers, it will be necessary to avoid disabling CPU hotplug.
				2662	This would introduce some complexity, so there had better be a <i>very</i>
				2663	good reason.
				2664
				2665	<p>
				2666	The tradeoff between grace-period latency on the one hand and interruptions
				2667	of other CPUs on the other hand may need to be re-examined.
				2668	The desire is of course for zero grace-period latency as well as zero
				2669	interprocessor interrupts undertaken during an expedited grace period
				2670	operation.
				2671	While this ideal is unlikely to be achievable, it is quite possible that
				2672	further improvements can be made.
				2673
				2674	<p>
				2675	The multiprocessor implementations of RCU use a combining tree that
				2676	groups CPUs so as to reduce lock contention and increase cache locality.
				2677	However, this combining tree does not spread its memory across NUMA
				2678	nodes nor does it align the CPU groups with hardware features such
				2679	as sockets or cores.
				2680	Such spreading and alignment is currently believed to be unnecessary
				2681	because the hotpath read-side primitives do not access the combining
				2682	tree, nor does <tt>call_rcu()</tt> in the common case.
				2683	If you believe that your architecture needs such spreading and alignment,
				2684	then your architecture should also benefit from the
				2685	<tt>rcutree.rcu_fanout_leaf</tt> boot parameter, which can be set
				2686	to the number of CPUs in a socket, NUMA node, or whatever.
				2687	If the number of CPUs is too large, use a fraction of the number of
				2688	CPUs.
				2689	If the number of CPUs is a large prime number, well, that certainly
				2690	is an “interesting” architectural choice!
				2691	More flexible arrangements might be considered, but only if
				2692	<tt>rcutree.rcu_fanout_leaf</tt> has proven inadequate, and only
				2693	if the inadequacy has been demonstrated by a carefully run and
				2694	realistic system-level workload.
				2695
				2696	<p>
				2697	Please note that arrangements that require RCU to remap CPU numbers will
				2698	require extremely good demonstration of need and full exploration of
				2699	alternatives.
				2700
				2701	<p>
				2702	There is an embarrassingly large number of flavors of RCU, and this
				2703	number has been increasing over time.
				2704	Perhaps it will be possible to combine some at some future date.
				2705
				2706	<p>
				2707	RCU's various kthreads are reasonably recent additions.
				2708	It is quite likely that adjustments will be required to more gracefully
				2709	handle extreme loads.
				2710	It might also be necessary to be able to relate CPU utilization by
				2711	RCU's kthreads and softirq handlers to the code that instigated this
				2712	CPU utilization.
				2713	For example, RCU callback overhead might be charged back to the
				2714	originating <tt>call_rcu()</tt> instance, though probably not
				2715	in production kernels.
				2716
				2717	<h2><a name="Summary">Summary</a></h2>
				2718
				2719	<p>
				2720	This document has presented more than two decade's worth of RCU
				2721	requirements.
				2722	Given that the requirements keep changing, this will not be the last
				2723	word on this subject, but at least it serves to get an important
				2724	subset of the requirements set forth.
				2725
				2726	<h2><a name="Acknowledgments">Acknowledgments</a></h2>
				2727
				2728	I am grateful to Steven Rostedt, Lai Jiangshan, Ingo Molnar,
				2729	Oleg Nesterov, Borislav Petkov, Peter Zijlstra, Boqun Feng, and
				2730	Andy Lutomirski for their help in rendering
				2731	this article human readable, and to Michelle Rankin for her support
				2732	of this effort.
				2733	Other contributions are acknowledged in the Linux kernel's git archive.
				2734	The cartoon is copyright (c) 2013 by Melissa Broussard,
				2735	and is provided
				2736	under the terms of the Creative Commons Attribution-Share Alike 3.0
				2737	United States license.
				2738
				2739	<p>@@QQAL@@
				2740
				2741	</body></html>