blob: 03f58b11c2525c0e35dfadf714ab2495154c5908 [file] [log] [blame]
Paul E. McKenney1c27b642018-01-18 19:58:55 -08001This document provides "recipes", that is, litmus tests for commonly
2occurring situations, as well as a few that illustrate subtly broken but
3attractive nuisances. Many of these recipes include example code from
Paul E. McKenneycc9628b2020-07-22 16:08:23 -07004v5.7 of the Linux kernel.
Paul E. McKenney1c27b642018-01-18 19:58:55 -08005
6The first section covers simple special cases, the second section
7takes off the training wheels to cover more involved examples,
8and the third section provides a few rules of thumb.
9
10
11Simple special cases
12====================
13
14This section presents two simple special cases, the first being where
15there is only one CPU or only one memory location is accessed, and the
16second being use of that old concurrency workhorse, locking.
17
18
19Single CPU or single memory location
20------------------------------------
21
22If there is only one CPU on the one hand or only one variable
23on the other, the code will execute in order. There are (as
24usual) some things to be careful of:
25
261. Some aspects of the C language are unordered. For example,
27 in the expression "f(x) + g(y)", the order in which f and g are
28 called is not defined; the object code is allowed to use either
29 order or even to interleave the computations.
30
312. Compilers are permitted to use the "as-if" rule. That is, a
32 compiler can emit whatever code it likes for normal accesses,
33 as long as the results of a single-threaded execution appear
34 just as if the compiler had followed all the relevant rules.
35 To see this, compile with a high level of optimization and run
36 the debugger on the resulting binary.
37
383. If there is only one variable but multiple CPUs, that variable
39 must be properly aligned and all accesses to that variable must
40 be full sized. Variables that straddle cachelines or pages void
41 your full-ordering warranty, as do undersized accesses that load
42 from or store to only part of the variable.
43
444. If there are multiple CPUs, accesses to shared variables should
45 use READ_ONCE() and WRITE_ONCE() or stronger to prevent load/store
46 tearing, load/store fusing, and invented loads and stores.
47 There are exceptions to this rule, including:
48
49 i. When there is no possibility of a given shared variable
50 being updated by some other CPU, for example, while
51 holding the update-side lock, reads from that variable
52 need not use READ_ONCE().
53
54 ii. When there is no possibility of a given shared variable
55 being either read or updated by other CPUs, for example,
56 when running during early boot, reads from that variable
57 need not use READ_ONCE() and writes to that variable
58 need not use WRITE_ONCE().
59
60
61Locking
62-------
63
64Locking is well-known and straightforward, at least if you don't think
65about it too hard. And the basic rule is indeed quite simple: Any CPU that
66has acquired a given lock sees any changes previously seen or made by any
67CPU before it released that same lock. Note that this statement is a bit
68stronger than "Any CPU holding a given lock sees all changes made by any
69CPU during the time that CPU was holding this same lock". For example,
70consider the following pair of code fragments:
71
72 /* See MP+polocks.litmus. */
73 void CPU0(void)
74 {
75 WRITE_ONCE(x, 1);
76 spin_lock(&mylock);
77 WRITE_ONCE(y, 1);
78 spin_unlock(&mylock);
79 }
80
81 void CPU1(void)
82 {
83 spin_lock(&mylock);
84 r0 = READ_ONCE(y);
85 spin_unlock(&mylock);
86 r1 = READ_ONCE(x);
87 }
88
89The basic rule guarantees that if CPU0() acquires mylock before CPU1(),
90then both r0 and r1 must be set to the value 1. This also has the
91consequence that if the final value of r0 is equal to 1, then the final
92value of r1 must also be equal to 1. In contrast, the weaker rule would
93say nothing about the final value of r1.
94
95The converse to the basic rule also holds, as illustrated by the
96following litmus test:
97
98 /* See MP+porevlocks.litmus. */
99 void CPU0(void)
100 {
101 r0 = READ_ONCE(y);
102 spin_lock(&mylock);
103 r1 = READ_ONCE(x);
104 spin_unlock(&mylock);
105 }
106
107 void CPU1(void)
108 {
109 spin_lock(&mylock);
110 WRITE_ONCE(x, 1);
111 spin_unlock(&mylock);
112 WRITE_ONCE(y, 1);
113 }
114
115This converse to the basic rule guarantees that if CPU0() acquires
116mylock before CPU1(), then both r0 and r1 must be set to the value 0.
117This also has the consequence that if the final value of r1 is equal
118to 0, then the final value of r0 must also be equal to 0. In contrast,
119the weaker rule would say nothing about the final value of r0.
120
121These examples show only a single pair of CPUs, but the effects of the
122locking basic rule extend across multiple acquisitions of a given lock
123across multiple CPUs.
124
125However, it is not necessarily the case that accesses ordered by
126locking will be seen as ordered by CPUs not holding that lock.
127Consider this example:
128
Akira Yokosawa9725dd52020-05-10 13:37:14 +0900129 /* See Z6.0+pooncelock+pooncelock+pombonce.litmus. */
Paul E. McKenney1c27b642018-01-18 19:58:55 -0800130 void CPU0(void)
131 {
132 spin_lock(&mylock);
133 WRITE_ONCE(x, 1);
134 WRITE_ONCE(y, 1);
135 spin_unlock(&mylock);
136 }
137
138 void CPU1(void)
139 {
140 spin_lock(&mylock);
141 r0 = READ_ONCE(y);
142 WRITE_ONCE(z, 1);
143 spin_unlock(&mylock);
144 }
145
146 void CPU2(void)
147 {
148 WRITE_ONCE(z, 2);
149 smp_mb();
150 r1 = READ_ONCE(x);
151 }
152
153Counter-intuitive though it might be, it is quite possible to have
154the final value of r0 be 1, the final value of z be 2, and the final
155value of r1 be 0. The reason for this surprising outcome is that
156CPU2() never acquired the lock, and thus did not benefit from the
157lock's ordering properties.
158
159Ordering can be extended to CPUs not holding the lock by careful use
160of smp_mb__after_spinlock():
161
162 /* See Z6.0+pooncelock+poonceLock+pombonce.litmus. */
163 void CPU0(void)
164 {
165 spin_lock(&mylock);
166 WRITE_ONCE(x, 1);
167 WRITE_ONCE(y, 1);
168 spin_unlock(&mylock);
169 }
170
171 void CPU1(void)
172 {
173 spin_lock(&mylock);
174 smp_mb__after_spinlock();
175 r0 = READ_ONCE(y);
176 WRITE_ONCE(z, 1);
177 spin_unlock(&mylock);
178 }
179
180 void CPU2(void)
181 {
182 WRITE_ONCE(z, 2);
183 smp_mb();
184 r1 = READ_ONCE(x);
185 }
186
187This addition of smp_mb__after_spinlock() strengthens the lock acquisition
188sufficiently to rule out the counter-intuitive outcome.
189
190
191Taking off the training wheels
192==============================
193
194This section looks at more complex examples, including message passing,
195load buffering, release-acquire chains, store buffering.
196Many classes of litmus tests have abbreviated names, which may be found
197here: https://www.cl.cam.ac.uk/~pes20/ppc-supplemental/test6.pdf
198
199
200Message passing (MP)
201--------------------
202
203The MP pattern has one CPU execute a pair of stores to a pair of variables
204and another CPU execute a pair of loads from this same pair of variables,
205but in the opposite order. The goal is to avoid the counter-intuitive
206outcome in which the first load sees the value written by the second store
207but the second load does not see the value written by the first store.
208In the absence of any ordering, this goal may not be met, as can be seen
209in the MP+poonceonces.litmus litmus test. This section therefore looks at
210a number of ways of meeting this goal.
211
212
213Release and acquire
214~~~~~~~~~~~~~~~~~~~
215
216Use of smp_store_release() and smp_load_acquire() is one way to force
217the desired MP ordering. The general approach is shown below:
218
219 /* See MP+pooncerelease+poacquireonce.litmus. */
220 void CPU0(void)
221 {
222 WRITE_ONCE(x, 1);
223 smp_store_release(&y, 1);
224 }
225
226 void CPU1(void)
227 {
228 r0 = smp_load_acquire(&y);
229 r1 = READ_ONCE(x);
230 }
231
232The smp_store_release() macro orders any prior accesses against the
233store, while the smp_load_acquire macro orders the load against any
234subsequent accesses. Therefore, if the final value of r0 is the value 1,
235the final value of r1 must also be the value 1.
236
237The init_stack_slab() function in lib/stackdepot.c uses release-acquire
238in this way to safely initialize of a slab of the stack. Working out
239the mutual-exclusion design is left as an exercise for the reader.
240
241
242Assign and dereference
243~~~~~~~~~~~~~~~~~~~~~~
244
245Use of rcu_assign_pointer() and rcu_dereference() is quite similar to the
246use of smp_store_release() and smp_load_acquire(), except that both
247rcu_assign_pointer() and rcu_dereference() operate on RCU-protected
248pointers. The general approach is shown below:
249
250 /* See MP+onceassign+derefonce.litmus. */
251 int z;
252 int *y = &z;
253 int x;
254
255 void CPU0(void)
256 {
257 WRITE_ONCE(x, 1);
258 rcu_assign_pointer(y, &x);
259 }
260
261 void CPU1(void)
262 {
263 rcu_read_lock();
264 r0 = rcu_dereference(y);
265 r1 = READ_ONCE(*r0);
266 rcu_read_unlock();
267 }
268
269In this example, if the final value of r0 is &x then the final value of
270r1 must be 1.
271
272The rcu_assign_pointer() macro has the same ordering properties as does
273smp_store_release(), but the rcu_dereference() macro orders the load only
274against later accesses that depend on the value loaded. A dependency
275is present if the value loaded determines the address of a later access
276(address dependency, as shown above), the value written by a later store
277(data dependency), or whether or not a later store is executed in the
278first place (control dependency). Note that the term "data dependency"
279is sometimes casually used to cover both address and data dependencies.
280
Paul E. McKenneycc9628b2020-07-22 16:08:23 -0700281In lib/math/prime_numbers.c, the expand_to_next_prime() function invokes
Paul E. McKenney1c27b642018-01-18 19:58:55 -0800282rcu_assign_pointer(), and the next_prime_number() function invokes
283rcu_dereference(). This combination mediates access to a bit vector
284that is expanded as additional primes are needed.
285
286
287Write and read memory barriers
288~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
289
290It is usually better to use smp_store_release() instead of smp_wmb()
291and to use smp_load_acquire() instead of smp_rmb(). However, the older
292smp_wmb() and smp_rmb() APIs are still heavily used, so it is important
293to understand their use cases. The general approach is shown below:
294
Andrea Parri71b7ff52018-07-16 11:06:05 -0700295 /* See MP+fencewmbonceonce+fencermbonceonce.litmus. */
Paul E. McKenney1c27b642018-01-18 19:58:55 -0800296 void CPU0(void)
297 {
298 WRITE_ONCE(x, 1);
299 smp_wmb();
300 WRITE_ONCE(y, 1);
301 }
302
303 void CPU1(void)
304 {
305 r0 = READ_ONCE(y);
306 smp_rmb();
307 r1 = READ_ONCE(x);
308 }
309
310The smp_wmb() macro orders prior stores against later stores, and the
311smp_rmb() macro orders prior loads against later loads. Therefore, if
312the final value of r0 is 1, the final value of r1 must also be 1.
313
SeongJae Park3d2046a2018-09-26 11:29:18 -0700314The xlog_state_switch_iclogs() function in fs/xfs/xfs_log.c contains
Paul E. McKenney1c27b642018-01-18 19:58:55 -0800315the following write-side code fragment:
316
317 log->l_curr_block -= log->l_logBBsize;
318 ASSERT(log->l_curr_block >= 0);
319 smp_wmb();
320 log->l_curr_cycle++;
321
322And the xlog_valid_lsn() function in fs/xfs/xfs_log_priv.h contains
323the corresponding read-side code fragment:
324
Mark Rutland5bde06b2018-07-16 11:05:56 -0700325 cur_cycle = READ_ONCE(log->l_curr_cycle);
Paul E. McKenney1c27b642018-01-18 19:58:55 -0800326 smp_rmb();
Mark Rutland5bde06b2018-07-16 11:05:56 -0700327 cur_block = READ_ONCE(log->l_curr_block);
Paul E. McKenney1c27b642018-01-18 19:58:55 -0800328
329Alternatively, consider the following comment in function
330perf_output_put_handle() in kernel/events/ring_buffer.c:
331
332 * kernel user
333 *
334 * if (LOAD ->data_tail) { LOAD ->data_head
335 * (A) smp_rmb() (C)
336 * STORE $data LOAD $data
337 * smp_wmb() (B) smp_mb() (D)
338 * STORE ->data_head STORE ->data_tail
339 * }
340
341The B/C pairing is an example of the MP pattern using smp_wmb() on the
342write side and smp_rmb() on the read side.
343
344Of course, given that smp_mb() is strictly stronger than either smp_wmb()
345or smp_rmb(), any code fragment that would work with smp_rmb() and
346smp_wmb() would also work with smp_mb() replacing either or both of the
347weaker barriers.
348
349
350Load buffering (LB)
351-------------------
352
353The LB pattern has one CPU load from one variable and then store to a
354second, while another CPU loads from the second variable and then stores
355to the first. The goal is to avoid the counter-intuitive situation where
356each load reads the value written by the other CPU's store. In the
357absence of any ordering it is quite possible that this may happen, as
358can be seen in the LB+poonceonces.litmus litmus test.
359
360One way of avoiding the counter-intuitive outcome is through the use of a
361control dependency paired with a full memory barrier:
362
Andrea Parri71b7ff52018-07-16 11:06:05 -0700363 /* See LB+fencembonceonce+ctrlonceonce.litmus. */
Paul E. McKenney1c27b642018-01-18 19:58:55 -0800364 void CPU0(void)
365 {
366 r0 = READ_ONCE(x);
367 if (r0)
368 WRITE_ONCE(y, 1);
369 }
370
371 void CPU1(void)
372 {
373 r1 = READ_ONCE(y);
374 smp_mb();
375 WRITE_ONCE(x, 1);
376 }
377
378This pairing of a control dependency in CPU0() with a full memory
379barrier in CPU1() prevents r0 and r1 from both ending up equal to 1.
380
381The A/D pairing from the ring-buffer use case shown earlier also
382illustrates LB. Here is a repeat of the comment in
383perf_output_put_handle() in kernel/events/ring_buffer.c, showing a
384control dependency on the kernel side and a full memory barrier on
385the user side:
386
387 * kernel user
388 *
389 * if (LOAD ->data_tail) { LOAD ->data_head
390 * (A) smp_rmb() (C)
391 * STORE $data LOAD $data
392 * smp_wmb() (B) smp_mb() (D)
393 * STORE ->data_head STORE ->data_tail
394 * }
395 *
396 * Where A pairs with D, and B pairs with C.
397
398The kernel's control dependency between the load from ->data_tail
399and the store to data combined with the user's full memory barrier
400between the load from data and the store to ->data_tail prevents
401the counter-intuitive outcome where the kernel overwrites the data
402before the user gets done loading it.
403
404
405Release-acquire chains
406----------------------
407
408Release-acquire chains are a low-overhead, flexible, and easy-to-use
409method of maintaining order. However, they do have some limitations that
410need to be fully understood. Here is an example that maintains order:
411
412 /* See ISA2+pooncerelease+poacquirerelease+poacquireonce.litmus. */
413 void CPU0(void)
414 {
415 WRITE_ONCE(x, 1);
416 smp_store_release(&y, 1);
417 }
418
419 void CPU1(void)
420 {
421 r0 = smp_load_acquire(y);
422 smp_store_release(&z, 1);
423 }
424
425 void CPU2(void)
426 {
427 r1 = smp_load_acquire(z);
428 r2 = READ_ONCE(x);
429 }
430
431In this case, if r0 and r1 both have final values of 1, then r2 must
432also have a final value of 1.
433
434The ordering in this example is stronger than it needs to be. For
435example, ordering would still be preserved if CPU1()'s smp_load_acquire()
436invocation was replaced with READ_ONCE().
437
438It is tempting to assume that CPU0()'s store to x is globally ordered
439before CPU1()'s store to z, but this is not the case:
440
441 /* See Z6.0+pooncerelease+poacquirerelease+mbonceonce.litmus. */
442 void CPU0(void)
443 {
444 WRITE_ONCE(x, 1);
445 smp_store_release(&y, 1);
446 }
447
448 void CPU1(void)
449 {
450 r0 = smp_load_acquire(y);
451 smp_store_release(&z, 1);
452 }
453
454 void CPU2(void)
455 {
456 WRITE_ONCE(z, 2);
457 smp_mb();
458 r1 = READ_ONCE(x);
459 }
460
461One might hope that if the final value of r0 is 1 and the final value
462of z is 2, then the final value of r1 must also be 1, but it really is
463possible for r1 to have the final value of 0. The reason, of course,
464is that in this version, CPU2() is not part of the release-acquire chain.
465This situation is accounted for in the rules of thumb below.
466
467Despite this limitation, release-acquire chains are low-overhead as
468well as simple and powerful, at least as memory-ordering mechanisms go.
469
470
471Store buffering
472---------------
473
474Store buffering can be thought of as upside-down load buffering, so
475that one CPU first stores to one variable and then loads from a second,
476while another CPU stores to the second variable and then loads from the
477first. Preserving order requires nothing less than full barriers:
478
Andrea Parri71b7ff52018-07-16 11:06:05 -0700479 /* See SB+fencembonceonces.litmus. */
Paul E. McKenney1c27b642018-01-18 19:58:55 -0800480 void CPU0(void)
481 {
482 WRITE_ONCE(x, 1);
483 smp_mb();
484 r0 = READ_ONCE(y);
485 }
486
487 void CPU1(void)
488 {
489 WRITE_ONCE(y, 1);
490 smp_mb();
491 r1 = READ_ONCE(x);
492 }
493
494Omitting either smp_mb() will allow both r0 and r1 to have final
495values of 0, but providing both full barriers as shown above prevents
496this counter-intuitive outcome.
497
498This pattern most famously appears as part of Dekker's locking
499algorithm, but it has a much more practical use within the Linux kernel
500of ordering wakeups. The following comment taken from waitqueue_active()
501in include/linux/wait.h shows the canonical pattern:
502
503 * CPU0 - waker CPU1 - waiter
504 *
505 * for (;;) {
506 * @cond = true; prepare_to_wait(&wq_head, &wait, state);
507 * smp_mb(); // smp_mb() from set_current_state()
508 * if (waitqueue_active(wq_head)) if (@cond)
509 * wake_up(wq_head); break;
510 * schedule();
511 * }
512 * finish_wait(&wq_head, &wait);
513
514On CPU0, the store is to @cond and the load is in waitqueue_active().
515On CPU1, prepare_to_wait() contains both a store to wq_head and a call
516to set_current_state(), which contains an smp_mb() barrier; the load is
517"if (@cond)". The full barriers prevent the undesirable outcome where
518CPU1 puts the waiting task to sleep and CPU0 fails to wake it up.
519
520Note that use of locking can greatly simplify this pattern.
521
522
523Rules of thumb
524==============
525
526There might seem to be no pattern governing what ordering primitives are
527needed in which situations, but this is not the case. There is a pattern
528based on the relation between the accesses linking successive CPUs in a
529given litmus test. There are three types of linkage:
530
5311. Write-to-read, where the next CPU reads the value that the
532 previous CPU wrote. The LB litmus-test patterns contain only
533 this type of relation. In formal memory-modeling texts, this
534 relation is called "reads-from" and is usually abbreviated "rf".
535
5362. Read-to-write, where the next CPU overwrites the value that the
537 previous CPU read. The SB litmus test contains only this type
538 of relation. In formal memory-modeling texts, this relation is
539 often called "from-reads" and is sometimes abbreviated "fr".
540
5413. Write-to-write, where the next CPU overwrites the value written
542 by the previous CPU. The Z6.0 litmus test pattern contains a
543 write-to-write relation between the last access of CPU1() and
544 the first access of CPU2(). In formal memory-modeling texts,
545 this relation is often called "coherence order" and is sometimes
546 abbreviated "co". In the C++ standard, it is instead called
547 "modification order" and often abbreviated "mo".
548
549The strength of memory ordering required for a given litmus test to
550avoid a counter-intuitive outcome depends on the types of relations
551linking the memory accesses for the outcome in question:
552
553o If all links are write-to-read links, then the weakest
554 possible ordering within each CPU suffices. For example, in
555 the LB litmus test, a control dependency was enough to do the
556 job.
557
558o If all but one of the links are write-to-read links, then a
559 release-acquire chain suffices. Both the MP and the ISA2
560 litmus tests illustrate this case.
561
562o If more than one of the links are something other than
563 write-to-read links, then a full memory barrier is required
564 between each successive pair of non-write-to-read links. This
565 case is illustrated by the Z6.0 litmus tests, both in the
566 locking and in the release-acquire sections.
567
568However, if you find yourself having to stretch these rules of thumb
569to fit your situation, you should consider creating a litmus test and
570running it on the model.