Blame - tools/memory-model/Documentation/recipes.txt - SHIFTPHONES/mainline/linux

blob: 03f58b11c2525c0e35dfadf714ab2495154c5908 [file] [log] [blame]

Paul E. McKenney	1c27b64	2018-01-18 19:58:55 -0800	[diff] [blame]	1	This document provides "recipes", that is, litmus tests for commonly
				2	occurring situations, as well as a few that illustrate subtly broken but
				3	attractive nuisances. Many of these recipes include example code from
Paul E. McKenney	cc9628b	2020-07-22 16:08:23 -0700	[diff] [blame]	4	v5.7 of the Linux kernel.
Paul E. McKenney	1c27b64	2018-01-18 19:58:55 -0800	[diff] [blame]	5
				6	The first section covers simple special cases, the second section
				7	takes off the training wheels to cover more involved examples,
				8	and the third section provides a few rules of thumb.
				9
				10
				11	Simple special cases
				12	====================
				13
				14	This section presents two simple special cases, the first being where
				15	there is only one CPU or only one memory location is accessed, and the
				16	second being use of that old concurrency workhorse, locking.
				17
				18
				19	Single CPU or single memory location
				20	------------------------------------
				21
				22	If there is only one CPU on the one hand or only one variable
				23	on the other, the code will execute in order. There are (as
				24	usual) some things to be careful of:
				25
				26	1. Some aspects of the C language are unordered. For example,
				27	in the expression "f(x) + g(y)", the order in which f and g are
				28	called is not defined; the object code is allowed to use either
				29	order or even to interleave the computations.
				30
				31	2. Compilers are permitted to use the "as-if" rule. That is, a
				32	compiler can emit whatever code it likes for normal accesses,
				33	as long as the results of a single-threaded execution appear
				34	just as if the compiler had followed all the relevant rules.
				35	To see this, compile with a high level of optimization and run
				36	the debugger on the resulting binary.
				37
				38	3. If there is only one variable but multiple CPUs, that variable
				39	must be properly aligned and all accesses to that variable must
				40	be full sized. Variables that straddle cachelines or pages void
				41	your full-ordering warranty, as do undersized accesses that load
				42	from or store to only part of the variable.
				43
				44	4. If there are multiple CPUs, accesses to shared variables should
				45	use READ_ONCE() and WRITE_ONCE() or stronger to prevent load/store
				46	tearing, load/store fusing, and invented loads and stores.
				47	There are exceptions to this rule, including:
				48
				49	i. When there is no possibility of a given shared variable
				50	being updated by some other CPU, for example, while
				51	holding the update-side lock, reads from that variable
				52	need not use READ_ONCE().
				53
				54	ii. When there is no possibility of a given shared variable
				55	being either read or updated by other CPUs, for example,
				56	when running during early boot, reads from that variable
				57	need not use READ_ONCE() and writes to that variable
				58	need not use WRITE_ONCE().
				59
				60
				61	Locking
				62	-------
				63
				64	Locking is well-known and straightforward, at least if you don't think
				65	about it too hard. And the basic rule is indeed quite simple: Any CPU that
				66	has acquired a given lock sees any changes previously seen or made by any
				67	CPU before it released that same lock. Note that this statement is a bit
				68	stronger than "Any CPU holding a given lock sees all changes made by any
				69	CPU during the time that CPU was holding this same lock". For example,
				70	consider the following pair of code fragments:
				71
				72	/* See MP+polocks.litmus. */
				73	void CPU0(void)
				74	{
				75	WRITE_ONCE(x, 1);
				76	spin_lock(&mylock);
				77	WRITE_ONCE(y, 1);
				78	spin_unlock(&mylock);
				79	}
				80
				81	void CPU1(void)
				82	{
				83	spin_lock(&mylock);
				84	r0 = READ_ONCE(y);
				85	spin_unlock(&mylock);
				86	r1 = READ_ONCE(x);
				87	}
				88
				89	The basic rule guarantees that if CPU0() acquires mylock before CPU1(),
				90	then both r0 and r1 must be set to the value 1. This also has the
				91	consequence that if the final value of r0 is equal to 1, then the final
				92	value of r1 must also be equal to 1. In contrast, the weaker rule would
				93	say nothing about the final value of r1.
				94
				95	The converse to the basic rule also holds, as illustrated by the
				96	following litmus test:
				97
				98	/* See MP+porevlocks.litmus. */
				99	void CPU0(void)
				100	{
				101	r0 = READ_ONCE(y);
				102	spin_lock(&mylock);
				103	r1 = READ_ONCE(x);
				104	spin_unlock(&mylock);
				105	}
				106
				107	void CPU1(void)
				108	{
				109	spin_lock(&mylock);
				110	WRITE_ONCE(x, 1);
				111	spin_unlock(&mylock);
				112	WRITE_ONCE(y, 1);
				113	}
				114
				115	This converse to the basic rule guarantees that if CPU0() acquires
				116	mylock before CPU1(), then both r0 and r1 must be set to the value 0.
				117	This also has the consequence that if the final value of r1 is equal
				118	to 0, then the final value of r0 must also be equal to 0. In contrast,
				119	the weaker rule would say nothing about the final value of r0.
				120
				121	These examples show only a single pair of CPUs, but the effects of the
				122	locking basic rule extend across multiple acquisitions of a given lock
				123	across multiple CPUs.
				124
				125	However, it is not necessarily the case that accesses ordered by
				126	locking will be seen as ordered by CPUs not holding that lock.
				127	Consider this example:
				128
Akira Yokosawa	9725dd5	2020-05-10 13:37:14 +0900	[diff] [blame]	129	/* See Z6.0+pooncelock+pooncelock+pombonce.litmus. */
Paul E. McKenney	1c27b64	2018-01-18 19:58:55 -0800	[diff] [blame]	130	void CPU0(void)
				131	{
				132	spin_lock(&mylock);
				133	WRITE_ONCE(x, 1);
				134	WRITE_ONCE(y, 1);
				135	spin_unlock(&mylock);
				136	}
				137
				138	void CPU1(void)
				139	{
				140	spin_lock(&mylock);
				141	r0 = READ_ONCE(y);
				142	WRITE_ONCE(z, 1);
				143	spin_unlock(&mylock);
				144	}
				145
				146	void CPU2(void)
				147	{
				148	WRITE_ONCE(z, 2);
				149	smp_mb();
				150	r1 = READ_ONCE(x);
				151	}
				152
				153	Counter-intuitive though it might be, it is quite possible to have
				154	the final value of r0 be 1, the final value of z be 2, and the final
				155	value of r1 be 0. The reason for this surprising outcome is that
				156	CPU2() never acquired the lock, and thus did not benefit from the
				157	lock's ordering properties.
				158
				159	Ordering can be extended to CPUs not holding the lock by careful use
				160	of smp_mb__after_spinlock():
				161
				162	/* See Z6.0+pooncelock+poonceLock+pombonce.litmus. */
				163	void CPU0(void)
				164	{
				165	spin_lock(&mylock);
				166	WRITE_ONCE(x, 1);
				167	WRITE_ONCE(y, 1);
				168	spin_unlock(&mylock);
				169	}
				170
				171	void CPU1(void)
				172	{
				173	spin_lock(&mylock);
				174	smp_mb__after_spinlock();
				175	r0 = READ_ONCE(y);
				176	WRITE_ONCE(z, 1);
				177	spin_unlock(&mylock);
				178	}
				179
				180	void CPU2(void)
				181	{
				182	WRITE_ONCE(z, 2);
				183	smp_mb();
				184	r1 = READ_ONCE(x);
				185	}
				186
				187	This addition of smp_mb__after_spinlock() strengthens the lock acquisition
				188	sufficiently to rule out the counter-intuitive outcome.
				189
				190
				191	Taking off the training wheels
				192	==============================
				193
				194	This section looks at more complex examples, including message passing,
				195	load buffering, release-acquire chains, store buffering.
				196	Many classes of litmus tests have abbreviated names, which may be found
				197	here: https://www.cl.cam.ac.uk/~pes20/ppc-supplemental/test6.pdf
				198
				199
				200	Message passing (MP)
				201	--------------------
				202
				203	The MP pattern has one CPU execute a pair of stores to a pair of variables
				204	and another CPU execute a pair of loads from this same pair of variables,
				205	but in the opposite order. The goal is to avoid the counter-intuitive
				206	outcome in which the first load sees the value written by the second store
				207	but the second load does not see the value written by the first store.
				208	In the absence of any ordering, this goal may not be met, as can be seen
				209	in the MP+poonceonces.litmus litmus test. This section therefore looks at
				210	a number of ways of meeting this goal.
				211
				212
				213	Release and acquire
				214	~~~~~~~~~~~~~~~~~~~
				215
				216	Use of smp_store_release() and smp_load_acquire() is one way to force
				217	the desired MP ordering. The general approach is shown below:
				218
				219	/* See MP+pooncerelease+poacquireonce.litmus. */
				220	void CPU0(void)
				221	{
				222	WRITE_ONCE(x, 1);
				223	smp_store_release(&y, 1);
				224	}
				225
				226	void CPU1(void)
				227	{
				228	r0 = smp_load_acquire(&y);
				229	r1 = READ_ONCE(x);
				230	}
				231
				232	The smp_store_release() macro orders any prior accesses against the
				233	store, while the smp_load_acquire macro orders the load against any
				234	subsequent accesses. Therefore, if the final value of r0 is the value 1,
				235	the final value of r1 must also be the value 1.
				236
				237	The init_stack_slab() function in lib/stackdepot.c uses release-acquire
				238	in this way to safely initialize of a slab of the stack. Working out
				239	the mutual-exclusion design is left as an exercise for the reader.
				240
				241
				242	Assign and dereference
				243	~~~~~~~~~~~~~~~~~~~~~~
				244
				245	Use of rcu_assign_pointer() and rcu_dereference() is quite similar to the
				246	use of smp_store_release() and smp_load_acquire(), except that both
				247	rcu_assign_pointer() and rcu_dereference() operate on RCU-protected
				248	pointers. The general approach is shown below:
				249
				250	/* See MP+onceassign+derefonce.litmus. */
				251	int z;
				252	int *y = &z;
				253	int x;
				254
				255	void CPU0(void)
				256	{
				257	WRITE_ONCE(x, 1);
				258	rcu_assign_pointer(y, &x);
				259	}
				260
				261	void CPU1(void)
				262	{
				263	rcu_read_lock();
				264	r0 = rcu_dereference(y);
				265	r1 = READ_ONCE(*r0);
				266	rcu_read_unlock();
				267	}
				268
				269	In this example, if the final value of r0 is &x then the final value of
				270	r1 must be 1.
				271
				272	The rcu_assign_pointer() macro has the same ordering properties as does
				273	smp_store_release(), but the rcu_dereference() macro orders the load only
				274	against later accesses that depend on the value loaded. A dependency
				275	is present if the value loaded determines the address of a later access
				276	(address dependency, as shown above), the value written by a later store
				277	(data dependency), or whether or not a later store is executed in the
				278	first place (control dependency). Note that the term "data dependency"
				279	is sometimes casually used to cover both address and data dependencies.
				280
Paul E. McKenney	cc9628b	2020-07-22 16:08:23 -0700	[diff] [blame]	281	In lib/math/prime_numbers.c, the expand_to_next_prime() function invokes
Paul E. McKenney	1c27b64	2018-01-18 19:58:55 -0800	[diff] [blame]	282	rcu_assign_pointer(), and the next_prime_number() function invokes
				283	rcu_dereference(). This combination mediates access to a bit vector
				284	that is expanded as additional primes are needed.
				285
				286
				287	Write and read memory barriers
				288	~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
				289
				290	It is usually better to use smp_store_release() instead of smp_wmb()
				291	and to use smp_load_acquire() instead of smp_rmb(). However, the older
				292	smp_wmb() and smp_rmb() APIs are still heavily used, so it is important
				293	to understand their use cases. The general approach is shown below:
				294
Andrea Parri	71b7ff5	2018-07-16 11:06:05 -0700	[diff] [blame]	295	/* See MP+fencewmbonceonce+fencermbonceonce.litmus. */
Paul E. McKenney	1c27b64	2018-01-18 19:58:55 -0800	[diff] [blame]	296	void CPU0(void)
				297	{
				298	WRITE_ONCE(x, 1);
				299	smp_wmb();
				300	WRITE_ONCE(y, 1);
				301	}
				302
				303	void CPU1(void)
				304	{
				305	r0 = READ_ONCE(y);
				306	smp_rmb();
				307	r1 = READ_ONCE(x);
				308	}
				309
				310	The smp_wmb() macro orders prior stores against later stores, and the
				311	smp_rmb() macro orders prior loads against later loads. Therefore, if
				312	the final value of r0 is 1, the final value of r1 must also be 1.
				313
SeongJae Park	3d2046a	2018-09-26 11:29:18 -0700	[diff] [blame]	314	The xlog_state_switch_iclogs() function in fs/xfs/xfs_log.c contains
Paul E. McKenney	1c27b64	2018-01-18 19:58:55 -0800	[diff] [blame]	315	the following write-side code fragment:
				316
				317	log->l_curr_block -= log->l_logBBsize;
				318	ASSERT(log->l_curr_block >= 0);
				319	smp_wmb();
				320	log->l_curr_cycle++;
				321
				322	And the xlog_valid_lsn() function in fs/xfs/xfs_log_priv.h contains
				323	the corresponding read-side code fragment:
				324
Mark Rutland	5bde06b	2018-07-16 11:05:56 -0700	[diff] [blame]	325	cur_cycle = READ_ONCE(log->l_curr_cycle);
Paul E. McKenney	1c27b64	2018-01-18 19:58:55 -0800	[diff] [blame]	326	smp_rmb();
Mark Rutland	5bde06b	2018-07-16 11:05:56 -0700	[diff] [blame]	327	cur_block = READ_ONCE(log->l_curr_block);
Paul E. McKenney	1c27b64	2018-01-18 19:58:55 -0800	[diff] [blame]	328
				329	Alternatively, consider the following comment in function
				330	perf_output_put_handle() in kernel/events/ring_buffer.c:
				331
				332	* kernel user
				333	*
				334	* if (LOAD ->data_tail) { LOAD ->data_head
				335	* (A) smp_rmb() (C)
				336	* STORE $data LOAD $data
				337	* smp_wmb() (B) smp_mb() (D)
				338	* STORE ->data_head STORE ->data_tail
				339	* }
				340
				341	The B/C pairing is an example of the MP pattern using smp_wmb() on the
				342	write side and smp_rmb() on the read side.
				343
				344	Of course, given that smp_mb() is strictly stronger than either smp_wmb()
				345	or smp_rmb(), any code fragment that would work with smp_rmb() and
				346	smp_wmb() would also work with smp_mb() replacing either or both of the
				347	weaker barriers.
				348
				349
				350	Load buffering (LB)
				351	-------------------
				352
				353	The LB pattern has one CPU load from one variable and then store to a
				354	second, while another CPU loads from the second variable and then stores
				355	to the first. The goal is to avoid the counter-intuitive situation where
				356	each load reads the value written by the other CPU's store. In the
				357	absence of any ordering it is quite possible that this may happen, as
				358	can be seen in the LB+poonceonces.litmus litmus test.
				359
				360	One way of avoiding the counter-intuitive outcome is through the use of a
				361	control dependency paired with a full memory barrier:
				362
Andrea Parri	71b7ff5	2018-07-16 11:06:05 -0700	[diff] [blame]	363	/* See LB+fencembonceonce+ctrlonceonce.litmus. */
Paul E. McKenney	1c27b64	2018-01-18 19:58:55 -0800	[diff] [blame]	364	void CPU0(void)
				365	{
				366	r0 = READ_ONCE(x);
				367	if (r0)
				368	WRITE_ONCE(y, 1);
				369	}
				370
				371	void CPU1(void)
				372	{
				373	r1 = READ_ONCE(y);
				374	smp_mb();
				375	WRITE_ONCE(x, 1);
				376	}
				377
				378	This pairing of a control dependency in CPU0() with a full memory
				379	barrier in CPU1() prevents r0 and r1 from both ending up equal to 1.
				380
				381	The A/D pairing from the ring-buffer use case shown earlier also
				382	illustrates LB. Here is a repeat of the comment in
				383	perf_output_put_handle() in kernel/events/ring_buffer.c, showing a
				384	control dependency on the kernel side and a full memory barrier on
				385	the user side:
				386
				387	* kernel user
				388	*
				389	* if (LOAD ->data_tail) { LOAD ->data_head
				390	* (A) smp_rmb() (C)
				391	* STORE $data LOAD $data
				392	* smp_wmb() (B) smp_mb() (D)
				393	* STORE ->data_head STORE ->data_tail
				394	* }
				395	*
				396	* Where A pairs with D, and B pairs with C.
				397
				398	The kernel's control dependency between the load from ->data_tail
				399	and the store to data combined with the user's full memory barrier
				400	between the load from data and the store to ->data_tail prevents
				401	the counter-intuitive outcome where the kernel overwrites the data
				402	before the user gets done loading it.
				403
				404
				405	Release-acquire chains
				406	----------------------
				407
				408	Release-acquire chains are a low-overhead, flexible, and easy-to-use
				409	method of maintaining order. However, they do have some limitations that
				410	need to be fully understood. Here is an example that maintains order:
				411
				412	/* See ISA2+pooncerelease+poacquirerelease+poacquireonce.litmus. */
				413	void CPU0(void)
				414	{
				415	WRITE_ONCE(x, 1);
				416	smp_store_release(&y, 1);
				417	}
				418
				419	void CPU1(void)
				420	{
				421	r0 = smp_load_acquire(y);
				422	smp_store_release(&z, 1);
				423	}
				424
				425	void CPU2(void)
				426	{
				427	r1 = smp_load_acquire(z);
				428	r2 = READ_ONCE(x);
				429	}
				430
				431	In this case, if r0 and r1 both have final values of 1, then r2 must
				432	also have a final value of 1.
				433
				434	The ordering in this example is stronger than it needs to be. For
				435	example, ordering would still be preserved if CPU1()'s smp_load_acquire()
				436	invocation was replaced with READ_ONCE().
				437
				438	It is tempting to assume that CPU0()'s store to x is globally ordered
				439	before CPU1()'s store to z, but this is not the case:
				440
				441	/* See Z6.0+pooncerelease+poacquirerelease+mbonceonce.litmus. */
				442	void CPU0(void)
				443	{
				444	WRITE_ONCE(x, 1);
				445	smp_store_release(&y, 1);
				446	}
				447
				448	void CPU1(void)
				449	{
				450	r0 = smp_load_acquire(y);
				451	smp_store_release(&z, 1);
				452	}
				453
				454	void CPU2(void)
				455	{
				456	WRITE_ONCE(z, 2);
				457	smp_mb();
				458	r1 = READ_ONCE(x);
				459	}
				460
				461	One might hope that if the final value of r0 is 1 and the final value
				462	of z is 2, then the final value of r1 must also be 1, but it really is
				463	possible for r1 to have the final value of 0. The reason, of course,
				464	is that in this version, CPU2() is not part of the release-acquire chain.
				465	This situation is accounted for in the rules of thumb below.
				466
				467	Despite this limitation, release-acquire chains are low-overhead as
				468	well as simple and powerful, at least as memory-ordering mechanisms go.
				469
				470
				471	Store buffering
				472	---------------
				473
				474	Store buffering can be thought of as upside-down load buffering, so
				475	that one CPU first stores to one variable and then loads from a second,
				476	while another CPU stores to the second variable and then loads from the
				477	first. Preserving order requires nothing less than full barriers:
				478
Andrea Parri	71b7ff5	2018-07-16 11:06:05 -0700	[diff] [blame]	479	/* See SB+fencembonceonces.litmus. */
Paul E. McKenney	1c27b64	2018-01-18 19:58:55 -0800	[diff] [blame]	480	void CPU0(void)
				481	{
				482	WRITE_ONCE(x, 1);
				483	smp_mb();
				484	r0 = READ_ONCE(y);
				485	}
				486
				487	void CPU1(void)
				488	{
				489	WRITE_ONCE(y, 1);
				490	smp_mb();
				491	r1 = READ_ONCE(x);
				492	}
				493
				494	Omitting either smp_mb() will allow both r0 and r1 to have final
				495	values of 0, but providing both full barriers as shown above prevents
				496	this counter-intuitive outcome.
				497
				498	This pattern most famously appears as part of Dekker's locking
				499	algorithm, but it has a much more practical use within the Linux kernel
				500	of ordering wakeups. The following comment taken from waitqueue_active()
				501	in include/linux/wait.h shows the canonical pattern:
				502
				503	* CPU0 - waker CPU1 - waiter
				504	*
				505	* for (;;) {
				506	* @cond = true; prepare_to_wait(&wq_head, &wait, state);
				507	* smp_mb(); // smp_mb() from set_current_state()
				508	* if (waitqueue_active(wq_head)) if (@cond)
				509	* wake_up(wq_head); break;
				510	* schedule();
				511	* }
				512	* finish_wait(&wq_head, &wait);
				513
				514	On CPU0, the store is to @cond and the load is in waitqueue_active().
				515	On CPU1, prepare_to_wait() contains both a store to wq_head and a call
				516	to set_current_state(), which contains an smp_mb() barrier; the load is
				517	"if (@cond)". The full barriers prevent the undesirable outcome where
				518	CPU1 puts the waiting task to sleep and CPU0 fails to wake it up.
				519
				520	Note that use of locking can greatly simplify this pattern.
				521
				522
				523	Rules of thumb
				524	==============
				525
				526	There might seem to be no pattern governing what ordering primitives are
				527	needed in which situations, but this is not the case. There is a pattern
				528	based on the relation between the accesses linking successive CPUs in a
				529	given litmus test. There are three types of linkage:
				530
				531	1. Write-to-read, where the next CPU reads the value that the
				532	previous CPU wrote. The LB litmus-test patterns contain only
				533	this type of relation. In formal memory-modeling texts, this
				534	relation is called "reads-from" and is usually abbreviated "rf".
				535
				536	2. Read-to-write, where the next CPU overwrites the value that the
				537	previous CPU read. The SB litmus test contains only this type
				538	of relation. In formal memory-modeling texts, this relation is
				539	often called "from-reads" and is sometimes abbreviated "fr".
				540
				541	3. Write-to-write, where the next CPU overwrites the value written
				542	by the previous CPU. The Z6.0 litmus test pattern contains a
				543	write-to-write relation between the last access of CPU1() and
				544	the first access of CPU2(). In formal memory-modeling texts,
				545	this relation is often called "coherence order" and is sometimes
				546	abbreviated "co". In the C++ standard, it is instead called
				547	"modification order" and often abbreviated "mo".
				548
				549	The strength of memory ordering required for a given litmus test to
				550	avoid a counter-intuitive outcome depends on the types of relations
				551	linking the memory accesses for the outcome in question:
				552
				553	o If all links are write-to-read links, then the weakest
				554	possible ordering within each CPU suffices. For example, in
				555	the LB litmus test, a control dependency was enough to do the
				556	job.
				557
				558	o If all but one of the links are write-to-read links, then a
				559	release-acquire chain suffices. Both the MP and the ISA2
				560	litmus tests illustrate this case.
				561
				562	o If more than one of the links are something other than
				563	write-to-read links, then a full memory barrier is required
				564	between each successive pair of non-write-to-read links. This
				565	case is illustrated by the Z6.0 litmus tests, both in the
				566	locking and in the release-acquire sections.
				567
				568	However, if you find yourself having to stretch these rules of thumb
				569	to fit your situation, you should consider creating a litmus test and
				570	running it on the model.