Paul E. McKenney | 1c27b64 | 2018-01-18 19:58:55 -0800 | [diff] [blame] | 1 | This document provides "recipes", that is, litmus tests for commonly |
| 2 | occurring situations, as well as a few that illustrate subtly broken but |
| 3 | attractive nuisances. Many of these recipes include example code from |
Paul E. McKenney | cc9628b | 2020-07-22 16:08:23 -0700 | [diff] [blame] | 4 | v5.7 of the Linux kernel. |
Paul E. McKenney | 1c27b64 | 2018-01-18 19:58:55 -0800 | [diff] [blame] | 5 | |
| 6 | The first section covers simple special cases, the second section |
| 7 | takes off the training wheels to cover more involved examples, |
| 8 | and the third section provides a few rules of thumb. |
| 9 | |
| 10 | |
| 11 | Simple special cases |
| 12 | ==================== |
| 13 | |
| 14 | This section presents two simple special cases, the first being where |
| 15 | there is only one CPU or only one memory location is accessed, and the |
| 16 | second being use of that old concurrency workhorse, locking. |
| 17 | |
| 18 | |
| 19 | Single CPU or single memory location |
| 20 | ------------------------------------ |
| 21 | |
| 22 | If there is only one CPU on the one hand or only one variable |
| 23 | on the other, the code will execute in order. There are (as |
| 24 | usual) some things to be careful of: |
| 25 | |
| 26 | 1. Some aspects of the C language are unordered. For example, |
| 27 | in the expression "f(x) + g(y)", the order in which f and g are |
| 28 | called is not defined; the object code is allowed to use either |
| 29 | order or even to interleave the computations. |
| 30 | |
| 31 | 2. Compilers are permitted to use the "as-if" rule. That is, a |
| 32 | compiler can emit whatever code it likes for normal accesses, |
| 33 | as long as the results of a single-threaded execution appear |
| 34 | just as if the compiler had followed all the relevant rules. |
| 35 | To see this, compile with a high level of optimization and run |
| 36 | the debugger on the resulting binary. |
| 37 | |
| 38 | 3. If there is only one variable but multiple CPUs, that variable |
| 39 | must be properly aligned and all accesses to that variable must |
| 40 | be full sized. Variables that straddle cachelines or pages void |
| 41 | your full-ordering warranty, as do undersized accesses that load |
| 42 | from or store to only part of the variable. |
| 43 | |
| 44 | 4. If there are multiple CPUs, accesses to shared variables should |
| 45 | use READ_ONCE() and WRITE_ONCE() or stronger to prevent load/store |
| 46 | tearing, load/store fusing, and invented loads and stores. |
| 47 | There are exceptions to this rule, including: |
| 48 | |
| 49 | i. When there is no possibility of a given shared variable |
| 50 | being updated by some other CPU, for example, while |
| 51 | holding the update-side lock, reads from that variable |
| 52 | need not use READ_ONCE(). |
| 53 | |
| 54 | ii. When there is no possibility of a given shared variable |
| 55 | being either read or updated by other CPUs, for example, |
| 56 | when running during early boot, reads from that variable |
| 57 | need not use READ_ONCE() and writes to that variable |
| 58 | need not use WRITE_ONCE(). |
| 59 | |
| 60 | |
| 61 | Locking |
| 62 | ------- |
| 63 | |
| 64 | Locking is well-known and straightforward, at least if you don't think |
| 65 | about it too hard. And the basic rule is indeed quite simple: Any CPU that |
| 66 | has acquired a given lock sees any changes previously seen or made by any |
| 67 | CPU before it released that same lock. Note that this statement is a bit |
| 68 | stronger than "Any CPU holding a given lock sees all changes made by any |
| 69 | CPU during the time that CPU was holding this same lock". For example, |
| 70 | consider the following pair of code fragments: |
| 71 | |
| 72 | /* See MP+polocks.litmus. */ |
| 73 | void CPU0(void) |
| 74 | { |
| 75 | WRITE_ONCE(x, 1); |
| 76 | spin_lock(&mylock); |
| 77 | WRITE_ONCE(y, 1); |
| 78 | spin_unlock(&mylock); |
| 79 | } |
| 80 | |
| 81 | void CPU1(void) |
| 82 | { |
| 83 | spin_lock(&mylock); |
| 84 | r0 = READ_ONCE(y); |
| 85 | spin_unlock(&mylock); |
| 86 | r1 = READ_ONCE(x); |
| 87 | } |
| 88 | |
| 89 | The basic rule guarantees that if CPU0() acquires mylock before CPU1(), |
| 90 | then both r0 and r1 must be set to the value 1. This also has the |
| 91 | consequence that if the final value of r0 is equal to 1, then the final |
| 92 | value of r1 must also be equal to 1. In contrast, the weaker rule would |
| 93 | say nothing about the final value of r1. |
| 94 | |
| 95 | The converse to the basic rule also holds, as illustrated by the |
| 96 | following litmus test: |
| 97 | |
| 98 | /* See MP+porevlocks.litmus. */ |
| 99 | void CPU0(void) |
| 100 | { |
| 101 | r0 = READ_ONCE(y); |
| 102 | spin_lock(&mylock); |
| 103 | r1 = READ_ONCE(x); |
| 104 | spin_unlock(&mylock); |
| 105 | } |
| 106 | |
| 107 | void CPU1(void) |
| 108 | { |
| 109 | spin_lock(&mylock); |
| 110 | WRITE_ONCE(x, 1); |
| 111 | spin_unlock(&mylock); |
| 112 | WRITE_ONCE(y, 1); |
| 113 | } |
| 114 | |
| 115 | This converse to the basic rule guarantees that if CPU0() acquires |
| 116 | mylock before CPU1(), then both r0 and r1 must be set to the value 0. |
| 117 | This also has the consequence that if the final value of r1 is equal |
| 118 | to 0, then the final value of r0 must also be equal to 0. In contrast, |
| 119 | the weaker rule would say nothing about the final value of r0. |
| 120 | |
| 121 | These examples show only a single pair of CPUs, but the effects of the |
| 122 | locking basic rule extend across multiple acquisitions of a given lock |
| 123 | across multiple CPUs. |
| 124 | |
| 125 | However, it is not necessarily the case that accesses ordered by |
| 126 | locking will be seen as ordered by CPUs not holding that lock. |
| 127 | Consider this example: |
| 128 | |
Akira Yokosawa | 9725dd5 | 2020-05-10 13:37:14 +0900 | [diff] [blame] | 129 | /* See Z6.0+pooncelock+pooncelock+pombonce.litmus. */ |
Paul E. McKenney | 1c27b64 | 2018-01-18 19:58:55 -0800 | [diff] [blame] | 130 | void CPU0(void) |
| 131 | { |
| 132 | spin_lock(&mylock); |
| 133 | WRITE_ONCE(x, 1); |
| 134 | WRITE_ONCE(y, 1); |
| 135 | spin_unlock(&mylock); |
| 136 | } |
| 137 | |
| 138 | void CPU1(void) |
| 139 | { |
| 140 | spin_lock(&mylock); |
| 141 | r0 = READ_ONCE(y); |
| 142 | WRITE_ONCE(z, 1); |
| 143 | spin_unlock(&mylock); |
| 144 | } |
| 145 | |
| 146 | void CPU2(void) |
| 147 | { |
| 148 | WRITE_ONCE(z, 2); |
| 149 | smp_mb(); |
| 150 | r1 = READ_ONCE(x); |
| 151 | } |
| 152 | |
| 153 | Counter-intuitive though it might be, it is quite possible to have |
| 154 | the final value of r0 be 1, the final value of z be 2, and the final |
| 155 | value of r1 be 0. The reason for this surprising outcome is that |
| 156 | CPU2() never acquired the lock, and thus did not benefit from the |
| 157 | lock's ordering properties. |
| 158 | |
| 159 | Ordering can be extended to CPUs not holding the lock by careful use |
| 160 | of smp_mb__after_spinlock(): |
| 161 | |
| 162 | /* See Z6.0+pooncelock+poonceLock+pombonce.litmus. */ |
| 163 | void CPU0(void) |
| 164 | { |
| 165 | spin_lock(&mylock); |
| 166 | WRITE_ONCE(x, 1); |
| 167 | WRITE_ONCE(y, 1); |
| 168 | spin_unlock(&mylock); |
| 169 | } |
| 170 | |
| 171 | void CPU1(void) |
| 172 | { |
| 173 | spin_lock(&mylock); |
| 174 | smp_mb__after_spinlock(); |
| 175 | r0 = READ_ONCE(y); |
| 176 | WRITE_ONCE(z, 1); |
| 177 | spin_unlock(&mylock); |
| 178 | } |
| 179 | |
| 180 | void CPU2(void) |
| 181 | { |
| 182 | WRITE_ONCE(z, 2); |
| 183 | smp_mb(); |
| 184 | r1 = READ_ONCE(x); |
| 185 | } |
| 186 | |
| 187 | This addition of smp_mb__after_spinlock() strengthens the lock acquisition |
| 188 | sufficiently to rule out the counter-intuitive outcome. |
| 189 | |
| 190 | |
| 191 | Taking off the training wheels |
| 192 | ============================== |
| 193 | |
| 194 | This section looks at more complex examples, including message passing, |
| 195 | load buffering, release-acquire chains, store buffering. |
| 196 | Many classes of litmus tests have abbreviated names, which may be found |
| 197 | here: https://www.cl.cam.ac.uk/~pes20/ppc-supplemental/test6.pdf |
| 198 | |
| 199 | |
| 200 | Message passing (MP) |
| 201 | -------------------- |
| 202 | |
| 203 | The MP pattern has one CPU execute a pair of stores to a pair of variables |
| 204 | and another CPU execute a pair of loads from this same pair of variables, |
| 205 | but in the opposite order. The goal is to avoid the counter-intuitive |
| 206 | outcome in which the first load sees the value written by the second store |
| 207 | but the second load does not see the value written by the first store. |
| 208 | In the absence of any ordering, this goal may not be met, as can be seen |
| 209 | in the MP+poonceonces.litmus litmus test. This section therefore looks at |
| 210 | a number of ways of meeting this goal. |
| 211 | |
| 212 | |
| 213 | Release and acquire |
| 214 | ~~~~~~~~~~~~~~~~~~~ |
| 215 | |
| 216 | Use of smp_store_release() and smp_load_acquire() is one way to force |
| 217 | the desired MP ordering. The general approach is shown below: |
| 218 | |
| 219 | /* See MP+pooncerelease+poacquireonce.litmus. */ |
| 220 | void CPU0(void) |
| 221 | { |
| 222 | WRITE_ONCE(x, 1); |
| 223 | smp_store_release(&y, 1); |
| 224 | } |
| 225 | |
| 226 | void CPU1(void) |
| 227 | { |
| 228 | r0 = smp_load_acquire(&y); |
| 229 | r1 = READ_ONCE(x); |
| 230 | } |
| 231 | |
| 232 | The smp_store_release() macro orders any prior accesses against the |
| 233 | store, while the smp_load_acquire macro orders the load against any |
| 234 | subsequent accesses. Therefore, if the final value of r0 is the value 1, |
| 235 | the final value of r1 must also be the value 1. |
| 236 | |
| 237 | The init_stack_slab() function in lib/stackdepot.c uses release-acquire |
| 238 | in this way to safely initialize of a slab of the stack. Working out |
| 239 | the mutual-exclusion design is left as an exercise for the reader. |
| 240 | |
| 241 | |
| 242 | Assign and dereference |
| 243 | ~~~~~~~~~~~~~~~~~~~~~~ |
| 244 | |
| 245 | Use of rcu_assign_pointer() and rcu_dereference() is quite similar to the |
| 246 | use of smp_store_release() and smp_load_acquire(), except that both |
| 247 | rcu_assign_pointer() and rcu_dereference() operate on RCU-protected |
| 248 | pointers. The general approach is shown below: |
| 249 | |
| 250 | /* See MP+onceassign+derefonce.litmus. */ |
| 251 | int z; |
| 252 | int *y = &z; |
| 253 | int x; |
| 254 | |
| 255 | void CPU0(void) |
| 256 | { |
| 257 | WRITE_ONCE(x, 1); |
| 258 | rcu_assign_pointer(y, &x); |
| 259 | } |
| 260 | |
| 261 | void CPU1(void) |
| 262 | { |
| 263 | rcu_read_lock(); |
| 264 | r0 = rcu_dereference(y); |
| 265 | r1 = READ_ONCE(*r0); |
| 266 | rcu_read_unlock(); |
| 267 | } |
| 268 | |
| 269 | In this example, if the final value of r0 is &x then the final value of |
| 270 | r1 must be 1. |
| 271 | |
| 272 | The rcu_assign_pointer() macro has the same ordering properties as does |
| 273 | smp_store_release(), but the rcu_dereference() macro orders the load only |
| 274 | against later accesses that depend on the value loaded. A dependency |
| 275 | is present if the value loaded determines the address of a later access |
| 276 | (address dependency, as shown above), the value written by a later store |
| 277 | (data dependency), or whether or not a later store is executed in the |
| 278 | first place (control dependency). Note that the term "data dependency" |
| 279 | is sometimes casually used to cover both address and data dependencies. |
| 280 | |
Paul E. McKenney | cc9628b | 2020-07-22 16:08:23 -0700 | [diff] [blame] | 281 | In lib/math/prime_numbers.c, the expand_to_next_prime() function invokes |
Paul E. McKenney | 1c27b64 | 2018-01-18 19:58:55 -0800 | [diff] [blame] | 282 | rcu_assign_pointer(), and the next_prime_number() function invokes |
| 283 | rcu_dereference(). This combination mediates access to a bit vector |
| 284 | that is expanded as additional primes are needed. |
| 285 | |
| 286 | |
| 287 | Write and read memory barriers |
| 288 | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ |
| 289 | |
| 290 | It is usually better to use smp_store_release() instead of smp_wmb() |
| 291 | and to use smp_load_acquire() instead of smp_rmb(). However, the older |
| 292 | smp_wmb() and smp_rmb() APIs are still heavily used, so it is important |
| 293 | to understand their use cases. The general approach is shown below: |
| 294 | |
Andrea Parri | 71b7ff5 | 2018-07-16 11:06:05 -0700 | [diff] [blame] | 295 | /* See MP+fencewmbonceonce+fencermbonceonce.litmus. */ |
Paul E. McKenney | 1c27b64 | 2018-01-18 19:58:55 -0800 | [diff] [blame] | 296 | void CPU0(void) |
| 297 | { |
| 298 | WRITE_ONCE(x, 1); |
| 299 | smp_wmb(); |
| 300 | WRITE_ONCE(y, 1); |
| 301 | } |
| 302 | |
| 303 | void CPU1(void) |
| 304 | { |
| 305 | r0 = READ_ONCE(y); |
| 306 | smp_rmb(); |
| 307 | r1 = READ_ONCE(x); |
| 308 | } |
| 309 | |
| 310 | The smp_wmb() macro orders prior stores against later stores, and the |
| 311 | smp_rmb() macro orders prior loads against later loads. Therefore, if |
| 312 | the final value of r0 is 1, the final value of r1 must also be 1. |
| 313 | |
SeongJae Park | 3d2046a | 2018-09-26 11:29:18 -0700 | [diff] [blame] | 314 | The xlog_state_switch_iclogs() function in fs/xfs/xfs_log.c contains |
Paul E. McKenney | 1c27b64 | 2018-01-18 19:58:55 -0800 | [diff] [blame] | 315 | the following write-side code fragment: |
| 316 | |
| 317 | log->l_curr_block -= log->l_logBBsize; |
| 318 | ASSERT(log->l_curr_block >= 0); |
| 319 | smp_wmb(); |
| 320 | log->l_curr_cycle++; |
| 321 | |
| 322 | And the xlog_valid_lsn() function in fs/xfs/xfs_log_priv.h contains |
| 323 | the corresponding read-side code fragment: |
| 324 | |
Mark Rutland | 5bde06b | 2018-07-16 11:05:56 -0700 | [diff] [blame] | 325 | cur_cycle = READ_ONCE(log->l_curr_cycle); |
Paul E. McKenney | 1c27b64 | 2018-01-18 19:58:55 -0800 | [diff] [blame] | 326 | smp_rmb(); |
Mark Rutland | 5bde06b | 2018-07-16 11:05:56 -0700 | [diff] [blame] | 327 | cur_block = READ_ONCE(log->l_curr_block); |
Paul E. McKenney | 1c27b64 | 2018-01-18 19:58:55 -0800 | [diff] [blame] | 328 | |
| 329 | Alternatively, consider the following comment in function |
| 330 | perf_output_put_handle() in kernel/events/ring_buffer.c: |
| 331 | |
| 332 | * kernel user |
| 333 | * |
| 334 | * if (LOAD ->data_tail) { LOAD ->data_head |
| 335 | * (A) smp_rmb() (C) |
| 336 | * STORE $data LOAD $data |
| 337 | * smp_wmb() (B) smp_mb() (D) |
| 338 | * STORE ->data_head STORE ->data_tail |
| 339 | * } |
| 340 | |
| 341 | The B/C pairing is an example of the MP pattern using smp_wmb() on the |
| 342 | write side and smp_rmb() on the read side. |
| 343 | |
| 344 | Of course, given that smp_mb() is strictly stronger than either smp_wmb() |
| 345 | or smp_rmb(), any code fragment that would work with smp_rmb() and |
| 346 | smp_wmb() would also work with smp_mb() replacing either or both of the |
| 347 | weaker barriers. |
| 348 | |
| 349 | |
| 350 | Load buffering (LB) |
| 351 | ------------------- |
| 352 | |
| 353 | The LB pattern has one CPU load from one variable and then store to a |
| 354 | second, while another CPU loads from the second variable and then stores |
| 355 | to the first. The goal is to avoid the counter-intuitive situation where |
| 356 | each load reads the value written by the other CPU's store. In the |
| 357 | absence of any ordering it is quite possible that this may happen, as |
| 358 | can be seen in the LB+poonceonces.litmus litmus test. |
| 359 | |
| 360 | One way of avoiding the counter-intuitive outcome is through the use of a |
| 361 | control dependency paired with a full memory barrier: |
| 362 | |
Andrea Parri | 71b7ff5 | 2018-07-16 11:06:05 -0700 | [diff] [blame] | 363 | /* See LB+fencembonceonce+ctrlonceonce.litmus. */ |
Paul E. McKenney | 1c27b64 | 2018-01-18 19:58:55 -0800 | [diff] [blame] | 364 | void CPU0(void) |
| 365 | { |
| 366 | r0 = READ_ONCE(x); |
| 367 | if (r0) |
| 368 | WRITE_ONCE(y, 1); |
| 369 | } |
| 370 | |
| 371 | void CPU1(void) |
| 372 | { |
| 373 | r1 = READ_ONCE(y); |
| 374 | smp_mb(); |
| 375 | WRITE_ONCE(x, 1); |
| 376 | } |
| 377 | |
| 378 | This pairing of a control dependency in CPU0() with a full memory |
| 379 | barrier in CPU1() prevents r0 and r1 from both ending up equal to 1. |
| 380 | |
| 381 | The A/D pairing from the ring-buffer use case shown earlier also |
| 382 | illustrates LB. Here is a repeat of the comment in |
| 383 | perf_output_put_handle() in kernel/events/ring_buffer.c, showing a |
| 384 | control dependency on the kernel side and a full memory barrier on |
| 385 | the user side: |
| 386 | |
| 387 | * kernel user |
| 388 | * |
| 389 | * if (LOAD ->data_tail) { LOAD ->data_head |
| 390 | * (A) smp_rmb() (C) |
| 391 | * STORE $data LOAD $data |
| 392 | * smp_wmb() (B) smp_mb() (D) |
| 393 | * STORE ->data_head STORE ->data_tail |
| 394 | * } |
| 395 | * |
| 396 | * Where A pairs with D, and B pairs with C. |
| 397 | |
| 398 | The kernel's control dependency between the load from ->data_tail |
| 399 | and the store to data combined with the user's full memory barrier |
| 400 | between the load from data and the store to ->data_tail prevents |
| 401 | the counter-intuitive outcome where the kernel overwrites the data |
| 402 | before the user gets done loading it. |
| 403 | |
| 404 | |
| 405 | Release-acquire chains |
| 406 | ---------------------- |
| 407 | |
| 408 | Release-acquire chains are a low-overhead, flexible, and easy-to-use |
| 409 | method of maintaining order. However, they do have some limitations that |
| 410 | need to be fully understood. Here is an example that maintains order: |
| 411 | |
| 412 | /* See ISA2+pooncerelease+poacquirerelease+poacquireonce.litmus. */ |
| 413 | void CPU0(void) |
| 414 | { |
| 415 | WRITE_ONCE(x, 1); |
| 416 | smp_store_release(&y, 1); |
| 417 | } |
| 418 | |
| 419 | void CPU1(void) |
| 420 | { |
| 421 | r0 = smp_load_acquire(y); |
| 422 | smp_store_release(&z, 1); |
| 423 | } |
| 424 | |
| 425 | void CPU2(void) |
| 426 | { |
| 427 | r1 = smp_load_acquire(z); |
| 428 | r2 = READ_ONCE(x); |
| 429 | } |
| 430 | |
| 431 | In this case, if r0 and r1 both have final values of 1, then r2 must |
| 432 | also have a final value of 1. |
| 433 | |
| 434 | The ordering in this example is stronger than it needs to be. For |
| 435 | example, ordering would still be preserved if CPU1()'s smp_load_acquire() |
| 436 | invocation was replaced with READ_ONCE(). |
| 437 | |
| 438 | It is tempting to assume that CPU0()'s store to x is globally ordered |
| 439 | before CPU1()'s store to z, but this is not the case: |
| 440 | |
| 441 | /* See Z6.0+pooncerelease+poacquirerelease+mbonceonce.litmus. */ |
| 442 | void CPU0(void) |
| 443 | { |
| 444 | WRITE_ONCE(x, 1); |
| 445 | smp_store_release(&y, 1); |
| 446 | } |
| 447 | |
| 448 | void CPU1(void) |
| 449 | { |
| 450 | r0 = smp_load_acquire(y); |
| 451 | smp_store_release(&z, 1); |
| 452 | } |
| 453 | |
| 454 | void CPU2(void) |
| 455 | { |
| 456 | WRITE_ONCE(z, 2); |
| 457 | smp_mb(); |
| 458 | r1 = READ_ONCE(x); |
| 459 | } |
| 460 | |
| 461 | One might hope that if the final value of r0 is 1 and the final value |
| 462 | of z is 2, then the final value of r1 must also be 1, but it really is |
| 463 | possible for r1 to have the final value of 0. The reason, of course, |
| 464 | is that in this version, CPU2() is not part of the release-acquire chain. |
| 465 | This situation is accounted for in the rules of thumb below. |
| 466 | |
| 467 | Despite this limitation, release-acquire chains are low-overhead as |
| 468 | well as simple and powerful, at least as memory-ordering mechanisms go. |
| 469 | |
| 470 | |
| 471 | Store buffering |
| 472 | --------------- |
| 473 | |
| 474 | Store buffering can be thought of as upside-down load buffering, so |
| 475 | that one CPU first stores to one variable and then loads from a second, |
| 476 | while another CPU stores to the second variable and then loads from the |
| 477 | first. Preserving order requires nothing less than full barriers: |
| 478 | |
Andrea Parri | 71b7ff5 | 2018-07-16 11:06:05 -0700 | [diff] [blame] | 479 | /* See SB+fencembonceonces.litmus. */ |
Paul E. McKenney | 1c27b64 | 2018-01-18 19:58:55 -0800 | [diff] [blame] | 480 | void CPU0(void) |
| 481 | { |
| 482 | WRITE_ONCE(x, 1); |
| 483 | smp_mb(); |
| 484 | r0 = READ_ONCE(y); |
| 485 | } |
| 486 | |
| 487 | void CPU1(void) |
| 488 | { |
| 489 | WRITE_ONCE(y, 1); |
| 490 | smp_mb(); |
| 491 | r1 = READ_ONCE(x); |
| 492 | } |
| 493 | |
| 494 | Omitting either smp_mb() will allow both r0 and r1 to have final |
| 495 | values of 0, but providing both full barriers as shown above prevents |
| 496 | this counter-intuitive outcome. |
| 497 | |
| 498 | This pattern most famously appears as part of Dekker's locking |
| 499 | algorithm, but it has a much more practical use within the Linux kernel |
| 500 | of ordering wakeups. The following comment taken from waitqueue_active() |
| 501 | in include/linux/wait.h shows the canonical pattern: |
| 502 | |
| 503 | * CPU0 - waker CPU1 - waiter |
| 504 | * |
| 505 | * for (;;) { |
| 506 | * @cond = true; prepare_to_wait(&wq_head, &wait, state); |
| 507 | * smp_mb(); // smp_mb() from set_current_state() |
| 508 | * if (waitqueue_active(wq_head)) if (@cond) |
| 509 | * wake_up(wq_head); break; |
| 510 | * schedule(); |
| 511 | * } |
| 512 | * finish_wait(&wq_head, &wait); |
| 513 | |
| 514 | On CPU0, the store is to @cond and the load is in waitqueue_active(). |
| 515 | On CPU1, prepare_to_wait() contains both a store to wq_head and a call |
| 516 | to set_current_state(), which contains an smp_mb() barrier; the load is |
| 517 | "if (@cond)". The full barriers prevent the undesirable outcome where |
| 518 | CPU1 puts the waiting task to sleep and CPU0 fails to wake it up. |
| 519 | |
| 520 | Note that use of locking can greatly simplify this pattern. |
| 521 | |
| 522 | |
| 523 | Rules of thumb |
| 524 | ============== |
| 525 | |
| 526 | There might seem to be no pattern governing what ordering primitives are |
| 527 | needed in which situations, but this is not the case. There is a pattern |
| 528 | based on the relation between the accesses linking successive CPUs in a |
| 529 | given litmus test. There are three types of linkage: |
| 530 | |
| 531 | 1. Write-to-read, where the next CPU reads the value that the |
| 532 | previous CPU wrote. The LB litmus-test patterns contain only |
| 533 | this type of relation. In formal memory-modeling texts, this |
| 534 | relation is called "reads-from" and is usually abbreviated "rf". |
| 535 | |
| 536 | 2. Read-to-write, where the next CPU overwrites the value that the |
| 537 | previous CPU read. The SB litmus test contains only this type |
| 538 | of relation. In formal memory-modeling texts, this relation is |
| 539 | often called "from-reads" and is sometimes abbreviated "fr". |
| 540 | |
| 541 | 3. Write-to-write, where the next CPU overwrites the value written |
| 542 | by the previous CPU. The Z6.0 litmus test pattern contains a |
| 543 | write-to-write relation between the last access of CPU1() and |
| 544 | the first access of CPU2(). In formal memory-modeling texts, |
| 545 | this relation is often called "coherence order" and is sometimes |
| 546 | abbreviated "co". In the C++ standard, it is instead called |
| 547 | "modification order" and often abbreviated "mo". |
| 548 | |
| 549 | The strength of memory ordering required for a given litmus test to |
| 550 | avoid a counter-intuitive outcome depends on the types of relations |
| 551 | linking the memory accesses for the outcome in question: |
| 552 | |
| 553 | o If all links are write-to-read links, then the weakest |
| 554 | possible ordering within each CPU suffices. For example, in |
| 555 | the LB litmus test, a control dependency was enough to do the |
| 556 | job. |
| 557 | |
| 558 | o If all but one of the links are write-to-read links, then a |
| 559 | release-acquire chain suffices. Both the MP and the ISA2 |
| 560 | litmus tests illustrate this case. |
| 561 | |
| 562 | o If more than one of the links are something other than |
| 563 | write-to-read links, then a full memory barrier is required |
| 564 | between each successive pair of non-write-to-read links. This |
| 565 | case is illustrated by the Z6.0 litmus tests, both in the |
| 566 | locking and in the release-acquire sections. |
| 567 | |
| 568 | However, if you find yourself having to stretch these rules of thumb |
| 569 | to fit your situation, you should consider creating a litmus test and |
| 570 | running it on the model. |