Paul E. McKenney | 649e436 | 2015-10-07 13:32:08 -0700 | [diff] [blame] | 1 | <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" |
| 2 | "http://www.w3.org/TR/html4/loose.dtd"> |
| 3 | <html> |
| 4 | <head><title>A Tour Through RCU's Requirements [LWN.net]</title> |
| 5 | <meta HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=utf-8"> |
| 6 | |
| 7 | <h1>A Tour Through RCU's Requirements</h1> |
| 8 | |
| 9 | <p>Copyright IBM Corporation, 2015</p> |
| 10 | <p>Author: Paul E. McKenney</p> |
| 11 | <p><i>The initial version of this document appeared in the |
| 12 | <a href="https://lwn.net/">LWN</a> articles |
| 13 | <a href="https://lwn.net/Articles/652156/">here</a>, |
| 14 | <a href="https://lwn.net/Articles/652677/">here</a>, and |
| 15 | <a href="https://lwn.net/Articles/653326/">here</a>.</i></p> |
| 16 | |
| 17 | <h2>Introduction</h2> |
| 18 | |
| 19 | <p> |
| 20 | Read-copy update (RCU) is a synchronization mechanism that is often |
| 21 | used as a replacement for reader-writer locking. |
| 22 | RCU is unusual in that updaters do not block readers, |
| 23 | which means that RCU's read-side primitives can be exceedingly fast |
| 24 | and scalable. |
| 25 | In addition, updaters can make useful forward progress concurrently |
| 26 | with readers. |
| 27 | However, all this concurrency between RCU readers and updaters does raise |
| 28 | the question of exactly what RCU readers are doing, which in turn |
| 29 | raises the question of exactly what RCU's requirements are. |
| 30 | |
| 31 | <p> |
| 32 | This document therefore summarizes RCU's requirements, and can be thought |
| 33 | of as an informal, high-level specification for RCU. |
| 34 | It is important to understand that RCU's specification is primarily |
| 35 | empirical in nature; |
| 36 | in fact, I learned about many of these requirements the hard way. |
| 37 | This situation might cause some consternation, however, not only |
| 38 | has this learning process been a lot of fun, but it has also been |
| 39 | a great privilege to work with so many people willing to apply |
| 40 | technologies in interesting new ways. |
| 41 | |
| 42 | <p> |
| 43 | All that aside, here are the categories of currently known RCU requirements: |
| 44 | </p> |
| 45 | |
| 46 | <ol> |
| 47 | <li> <a href="#Fundamental Requirements"> |
| 48 | Fundamental Requirements</a> |
| 49 | <li> <a href="#Fundamental Non-Requirements">Fundamental Non-Requirements</a> |
| 50 | <li> <a href="#Parallelism Facts of Life"> |
| 51 | Parallelism Facts of Life</a> |
| 52 | <li> <a href="#Quality-of-Implementation Requirements"> |
| 53 | Quality-of-Implementation Requirements</a> |
| 54 | <li> <a href="#Linux Kernel Complications"> |
| 55 | Linux Kernel Complications</a> |
| 56 | <li> <a href="#Software-Engineering Requirements"> |
| 57 | Software-Engineering Requirements</a> |
| 58 | <li> <a href="#Other RCU Flavors"> |
| 59 | Other RCU Flavors</a> |
| 60 | <li> <a href="#Possible Future Changes"> |
| 61 | Possible Future Changes</a> |
| 62 | </ol> |
| 63 | |
| 64 | <p> |
| 65 | This is followed by a <a href="#Summary">summary</a>, |
| 66 | which is in turn followed by the inevitable |
| 67 | <a href="#Answers to Quick Quizzes">answers to the quick quizzes</a>. |
| 68 | |
| 69 | <h2><a name="Fundamental Requirements">Fundamental Requirements</a></h2> |
| 70 | |
| 71 | <p> |
| 72 | RCU's fundamental requirements are the closest thing RCU has to hard |
| 73 | mathematical requirements. |
| 74 | These are: |
| 75 | |
| 76 | <ol> |
| 77 | <li> <a href="#Grace-Period Guarantee"> |
| 78 | Grace-Period Guarantee</a> |
| 79 | <li> <a href="#Publish-Subscribe Guarantee"> |
| 80 | Publish-Subscribe Guarantee</a> |
Paul E. McKenney | 4b68933 | 2015-10-12 08:51:45 -0700 | [diff] [blame] | 81 | <li> <a href="#Memory-Barrier Guarantees"> |
| 82 | Memory-Barrier Guarantees</a> |
Paul E. McKenney | 649e436 | 2015-10-07 13:32:08 -0700 | [diff] [blame] | 83 | <li> <a href="#RCU Primitives Guaranteed to Execute Unconditionally"> |
| 84 | RCU Primitives Guaranteed to Execute Unconditionally</a> |
| 85 | <li> <a href="#Guaranteed Read-to-Write Upgrade"> |
| 86 | Guaranteed Read-to-Write Upgrade</a> |
| 87 | </ol> |
| 88 | |
| 89 | <h3><a name="Grace-Period Guarantee">Grace-Period Guarantee</a></h3> |
| 90 | |
| 91 | <p> |
| 92 | RCU's grace-period guarantee is unusual in being premeditated: |
| 93 | Jack Slingwine and I had this guarantee firmly in mind when we started |
| 94 | work on RCU (then called “rclock”) in the early 1990s. |
| 95 | That said, the past two decades of experience with RCU have produced |
| 96 | a much more detailed understanding of this guarantee. |
| 97 | |
| 98 | <p> |
| 99 | RCU's grace-period guarantee allows updaters to wait for the completion |
| 100 | of all pre-existing RCU read-side critical sections. |
| 101 | An RCU read-side critical section |
| 102 | begins with the marker <tt>rcu_read_lock()</tt> and ends with |
| 103 | the marker <tt>rcu_read_unlock()</tt>. |
| 104 | These markers may be nested, and RCU treats a nested set as one |
| 105 | big RCU read-side critical section. |
| 106 | Production-quality implementations of <tt>rcu_read_lock()</tt> and |
| 107 | <tt>rcu_read_unlock()</tt> are extremely lightweight, and in |
| 108 | fact have exactly zero overhead in Linux kernels built for production |
| 109 | use with <tt>CONFIG_PREEMPT=n</tt>. |
| 110 | |
| 111 | <p> |
| 112 | This guarantee allows ordering to be enforced with extremely low |
| 113 | overhead to readers, for example: |
| 114 | |
| 115 | <blockquote> |
| 116 | <pre> |
| 117 | 1 int x, y; |
| 118 | 2 |
| 119 | 3 void thread0(void) |
| 120 | 4 { |
| 121 | 5 rcu_read_lock(); |
| 122 | 6 r1 = READ_ONCE(x); |
| 123 | 7 r2 = READ_ONCE(y); |
| 124 | 8 rcu_read_unlock(); |
| 125 | 9 } |
| 126 | 10 |
| 127 | 11 void thread1(void) |
| 128 | 12 { |
| 129 | 13 WRITE_ONCE(x, 1); |
| 130 | 14 synchronize_rcu(); |
| 131 | 15 WRITE_ONCE(y, 1); |
| 132 | 16 } |
| 133 | </pre> |
| 134 | </blockquote> |
| 135 | |
| 136 | <p> |
| 137 | Because the <tt>synchronize_rcu()</tt> on line 14 waits for |
| 138 | all pre-existing readers, any instance of <tt>thread0()</tt> that |
| 139 | loads a value of zero from <tt>x</tt> must complete before |
| 140 | <tt>thread1()</tt> stores to <tt>y</tt>, so that instance must |
| 141 | also load a value of zero from <tt>y</tt>. |
| 142 | Similarly, any instance of <tt>thread0()</tt> that loads a value of |
| 143 | one from <tt>y</tt> must have started after the |
| 144 | <tt>synchronize_rcu()</tt> started, and must therefore also load |
| 145 | a value of one from <tt>x</tt>. |
| 146 | Therefore, the outcome: |
| 147 | <blockquote> |
| 148 | <pre> |
| 149 | (r1 == 0 && r2 == 1) |
| 150 | </pre> |
| 151 | </blockquote> |
| 152 | cannot happen. |
| 153 | |
| 154 | <p>@@QQ@@ |
| 155 | Wait a minute! |
| 156 | You said that updaters can make useful forward progress concurrently |
| 157 | with readers, but pre-existing readers will block |
| 158 | <tt>synchronize_rcu()</tt>!!! |
| 159 | Just who are you trying to fool??? |
| 160 | <p>@@QQA@@ |
| 161 | First, if updaters do not wish to be blocked by readers, they can use |
| 162 | <tt>call_rcu()</tt> or <tt>kfree_rcu()</tt>, which will |
| 163 | be discussed later. |
| 164 | Second, even when using <tt>synchronize_rcu()</tt>, the other |
| 165 | update-side code does run concurrently with readers, whether pre-existing |
| 166 | or not. |
| 167 | <p>@@QQE@@ |
| 168 | |
| 169 | <p> |
| 170 | This scenario resembles one of the first uses of RCU in |
| 171 | <a href="https://en.wikipedia.org/wiki/DYNIX">DYNIX/ptx</a>, |
| 172 | which managed a distributed lock manager's transition into |
| 173 | a state suitable for handling recovery from node failure, |
| 174 | more or less as follows: |
| 175 | |
| 176 | <blockquote> |
| 177 | <pre> |
| 178 | 1 #define STATE_NORMAL 0 |
| 179 | 2 #define STATE_WANT_RECOVERY 1 |
| 180 | 3 #define STATE_RECOVERING 2 |
| 181 | 4 #define STATE_WANT_NORMAL 3 |
| 182 | 5 |
| 183 | 6 int state = STATE_NORMAL; |
| 184 | 7 |
| 185 | 8 void do_something_dlm(void) |
| 186 | 9 { |
| 187 | 10 int state_snap; |
| 188 | 11 |
| 189 | 12 rcu_read_lock(); |
| 190 | 13 state_snap = READ_ONCE(state); |
| 191 | 14 if (state_snap == STATE_NORMAL) |
| 192 | 15 do_something(); |
| 193 | 16 else |
| 194 | 17 do_something_carefully(); |
| 195 | 18 rcu_read_unlock(); |
| 196 | 19 } |
| 197 | 20 |
| 198 | 21 void start_recovery(void) |
| 199 | 22 { |
| 200 | 23 WRITE_ONCE(state, STATE_WANT_RECOVERY); |
| 201 | 24 synchronize_rcu(); |
| 202 | 25 WRITE_ONCE(state, STATE_RECOVERING); |
| 203 | 26 recovery(); |
| 204 | 27 WRITE_ONCE(state, STATE_WANT_NORMAL); |
| 205 | 28 synchronize_rcu(); |
| 206 | 29 WRITE_ONCE(state, STATE_NORMAL); |
| 207 | 30 } |
| 208 | </pre> |
| 209 | </blockquote> |
| 210 | |
| 211 | <p> |
| 212 | The RCU read-side critical section in <tt>do_something_dlm()</tt> |
| 213 | works with the <tt>synchronize_rcu()</tt> in <tt>start_recovery()</tt> |
| 214 | to guarantee that <tt>do_something()</tt> never runs concurrently |
| 215 | with <tt>recovery()</tt>, but with little or no synchronization |
| 216 | overhead in <tt>do_something_dlm()</tt>. |
| 217 | |
| 218 | <p>@@QQ@@ |
| 219 | Why is the <tt>synchronize_rcu()</tt> on line 28 needed? |
| 220 | <p>@@QQA@@ |
| 221 | Without that extra grace period, memory reordering could result in |
| 222 | <tt>do_something_dlm()</tt> executing <tt>do_something()</tt> |
| 223 | concurrently with the last bits of <tt>recovery()</tt>. |
| 224 | <p>@@QQE@@ |
| 225 | |
| 226 | <p> |
| 227 | In order to avoid fatal problems such as deadlocks, |
| 228 | an RCU read-side critical section must not contain calls to |
| 229 | <tt>synchronize_rcu()</tt>. |
| 230 | Similarly, an RCU read-side critical section must not |
| 231 | contain anything that waits, directly or indirectly, on completion of |
| 232 | an invocation of <tt>synchronize_rcu()</tt>. |
| 233 | |
| 234 | <p> |
| 235 | Although RCU's grace-period guarantee is useful in and of itself, with |
| 236 | <a href="https://lwn.net/Articles/573497/">quite a few use cases</a>, |
| 237 | it would be good to be able to use RCU to coordinate read-side |
| 238 | access to linked data structures. |
| 239 | For this, the grace-period guarantee is not sufficient, as can |
| 240 | be seen in function <tt>add_gp_buggy()</tt> below. |
| 241 | We will look at the reader's code later, but in the meantime, just think of |
| 242 | the reader as locklessly picking up the <tt>gp</tt> pointer, |
| 243 | and, if the value loaded is non-<tt>NULL</tt>, locklessly accessing the |
| 244 | <tt>->a</tt> and <tt>->b</tt> fields. |
| 245 | |
| 246 | <blockquote> |
| 247 | <pre> |
| 248 | 1 bool add_gp_buggy(int a, int b) |
| 249 | 2 { |
| 250 | 3 p = kmalloc(sizeof(*p), GFP_KERNEL); |
| 251 | 4 if (!p) |
| 252 | 5 return -ENOMEM; |
| 253 | 6 spin_lock(&gp_lock); |
| 254 | 7 if (rcu_access_pointer(gp)) { |
| 255 | 8 spin_unlock(&gp_lock); |
| 256 | 9 return false; |
| 257 | 10 } |
| 258 | 11 p->a = a; |
| 259 | 12 p->b = a; |
| 260 | 13 gp = p; /* ORDERING BUG */ |
| 261 | 14 spin_unlock(&gp_lock); |
| 262 | 15 return true; |
| 263 | 16 } |
| 264 | </pre> |
| 265 | </blockquote> |
| 266 | |
| 267 | <p> |
| 268 | The problem is that both the compiler and weakly ordered CPUs are within |
| 269 | their rights to reorder this code as follows: |
| 270 | |
| 271 | <blockquote> |
| 272 | <pre> |
| 273 | 1 bool add_gp_buggy_optimized(int a, int b) |
| 274 | 2 { |
| 275 | 3 p = kmalloc(sizeof(*p), GFP_KERNEL); |
| 276 | 4 if (!p) |
| 277 | 5 return -ENOMEM; |
| 278 | 6 spin_lock(&gp_lock); |
| 279 | 7 if (rcu_access_pointer(gp)) { |
| 280 | 8 spin_unlock(&gp_lock); |
| 281 | 9 return false; |
| 282 | 10 } |
| 283 | <b>11 gp = p; /* ORDERING BUG */ |
| 284 | 12 p->a = a; |
| 285 | 13 p->b = a;</b> |
| 286 | 14 spin_unlock(&gp_lock); |
| 287 | 15 return true; |
| 288 | 16 } |
| 289 | </pre> |
| 290 | </blockquote> |
| 291 | |
| 292 | <p> |
| 293 | If an RCU reader fetches <tt>gp</tt> just after |
| 294 | <tt>add_gp_buggy_optimized</tt> executes line 11, |
| 295 | it will see garbage in the <tt>->a</tt> and <tt>->b</tt> |
| 296 | fields. |
| 297 | And this is but one of many ways in which compiler and hardware optimizations |
| 298 | could cause trouble. |
| 299 | Therefore, we clearly need some way to prevent the compiler and the CPU from |
| 300 | reordering in this manner, which brings us to the publish-subscribe |
| 301 | guarantee discussed in the next section. |
| 302 | |
| 303 | <h3><a name="Publish-Subscribe Guarantee">Publish/Subscribe Guarantee</a></h3> |
| 304 | |
| 305 | <p> |
| 306 | RCU's publish-subscribe guarantee allows data to be inserted |
| 307 | into a linked data structure without disrupting RCU readers. |
| 308 | The updater uses <tt>rcu_assign_pointer()</tt> to insert the |
| 309 | new data, and readers use <tt>rcu_dereference()</tt> to |
| 310 | access data, whether new or old. |
| 311 | The following shows an example of insertion: |
| 312 | |
| 313 | <blockquote> |
| 314 | <pre> |
| 315 | 1 bool add_gp(int a, int b) |
| 316 | 2 { |
| 317 | 3 p = kmalloc(sizeof(*p), GFP_KERNEL); |
| 318 | 4 if (!p) |
| 319 | 5 return -ENOMEM; |
| 320 | 6 spin_lock(&gp_lock); |
| 321 | 7 if (rcu_access_pointer(gp)) { |
| 322 | 8 spin_unlock(&gp_lock); |
| 323 | 9 return false; |
| 324 | 10 } |
| 325 | 11 p->a = a; |
| 326 | 12 p->b = a; |
| 327 | 13 rcu_assign_pointer(gp, p); |
| 328 | 14 spin_unlock(&gp_lock); |
| 329 | 15 return true; |
| 330 | 16 } |
| 331 | </pre> |
| 332 | </blockquote> |
| 333 | |
| 334 | <p> |
| 335 | The <tt>rcu_assign_pointer()</tt> on line 13 is conceptually |
| 336 | equivalent to a simple assignment statement, but also guarantees |
| 337 | that its assignment will |
| 338 | happen after the two assignments in lines 11 and 12, |
| 339 | similar to the C11 <tt>memory_order_release</tt> store operation. |
| 340 | It also prevents any number of “interesting” compiler |
| 341 | optimizations, for example, the use of <tt>gp</tt> as a scratch |
| 342 | location immediately preceding the assignment. |
| 343 | |
| 344 | <p>@@QQ@@ |
| 345 | But <tt>rcu_assign_pointer()</tt> does nothing to prevent the |
| 346 | two assignments to <tt>p->a</tt> and <tt>p->b</tt> |
| 347 | from being reordered. |
| 348 | Can't that also cause problems? |
| 349 | <p>@@QQA@@ |
| 350 | No, it cannot. |
| 351 | The readers cannot see either of these two fields until |
| 352 | the assignment to <tt>gp</tt>, by which time both fields are |
| 353 | fully initialized. |
| 354 | So reordering the assignments |
| 355 | to <tt>p->a</tt> and <tt>p->b</tt> cannot possibly |
| 356 | cause any problems. |
| 357 | <p>@@QQE@@ |
| 358 | |
| 359 | <p> |
| 360 | It is tempting to assume that the reader need not do anything special |
| 361 | to control its accesses to the RCU-protected data, |
| 362 | as shown in <tt>do_something_gp_buggy()</tt> below: |
| 363 | |
| 364 | <blockquote> |
| 365 | <pre> |
| 366 | 1 bool do_something_gp_buggy(void) |
| 367 | 2 { |
| 368 | 3 rcu_read_lock(); |
| 369 | 4 p = gp; /* OPTIMIZATIONS GALORE!!! */ |
| 370 | 5 if (p) { |
| 371 | 6 do_something(p->a, p->b); |
| 372 | 7 rcu_read_unlock(); |
| 373 | 8 return true; |
| 374 | 9 } |
| 375 | 10 rcu_read_unlock(); |
| 376 | 11 return false; |
| 377 | 12 } |
| 378 | </pre> |
| 379 | </blockquote> |
| 380 | |
| 381 | <p> |
| 382 | However, this temptation must be resisted because there are a |
| 383 | surprisingly large number of ways that the compiler |
| 384 | (to say nothing of |
| 385 | <a href="https://h71000.www7.hp.com/wizard/wiz_2637.html">DEC Alpha CPUs</a>) |
| 386 | can trip this code up. |
| 387 | For but one example, if the compiler were short of registers, it |
| 388 | might choose to refetch from <tt>gp</tt> rather than keeping |
| 389 | a separate copy in <tt>p</tt> as follows: |
| 390 | |
| 391 | <blockquote> |
| 392 | <pre> |
| 393 | 1 bool do_something_gp_buggy_optimized(void) |
| 394 | 2 { |
| 395 | 3 rcu_read_lock(); |
| 396 | 4 if (gp) { /* OPTIMIZATIONS GALORE!!! */ |
| 397 | <b> 5 do_something(gp->a, gp->b);</b> |
| 398 | 6 rcu_read_unlock(); |
| 399 | 7 return true; |
| 400 | 8 } |
| 401 | 9 rcu_read_unlock(); |
| 402 | 10 return false; |
| 403 | 11 } |
| 404 | </pre> |
| 405 | </blockquote> |
| 406 | |
| 407 | <p> |
| 408 | If this function ran concurrently with a series of updates that |
| 409 | replaced the current structure with a new one, |
| 410 | the fetches of <tt>gp->a</tt> |
| 411 | and <tt>gp->b</tt> might well come from two different structures, |
| 412 | which could cause serious confusion. |
| 413 | To prevent this (and much else besides), <tt>do_something_gp()</tt> uses |
| 414 | <tt>rcu_dereference()</tt> to fetch from <tt>gp</tt>: |
| 415 | |
| 416 | <blockquote> |
| 417 | <pre> |
| 418 | 1 bool do_something_gp(void) |
| 419 | 2 { |
| 420 | 3 rcu_read_lock(); |
| 421 | 4 p = rcu_dereference(gp); |
| 422 | 5 if (p) { |
| 423 | 6 do_something(p->a, p->b); |
| 424 | 7 rcu_read_unlock(); |
| 425 | 8 return true; |
| 426 | 9 } |
| 427 | 10 rcu_read_unlock(); |
| 428 | 11 return false; |
| 429 | 12 } |
| 430 | </pre> |
| 431 | </blockquote> |
| 432 | |
| 433 | <p> |
| 434 | The <tt>rcu_dereference()</tt> uses volatile casts and (for DEC Alpha) |
| 435 | memory barriers in the Linux kernel. |
| 436 | Should a |
| 437 | <a href="http://www.rdrop.com/users/paulmck/RCU/consume.2015.07.13a.pdf">high-quality implementation of C11 <tt>memory_order_consume</tt> [PDF]</a> |
| 438 | ever appear, then <tt>rcu_dereference()</tt> could be implemented |
| 439 | as a <tt>memory_order_consume</tt> load. |
| 440 | Regardless of the exact implementation, a pointer fetched by |
| 441 | <tt>rcu_dereference()</tt> may not be used outside of the |
| 442 | outermost RCU read-side critical section containing that |
| 443 | <tt>rcu_dereference()</tt>, unless protection of |
| 444 | the corresponding data element has been passed from RCU to some |
| 445 | other synchronization mechanism, most commonly locking or |
| 446 | <a href="https://www.kernel.org/doc/Documentation/RCU/rcuref.txt">reference counting</a>. |
| 447 | |
| 448 | <p> |
| 449 | In short, updaters use <tt>rcu_assign_pointer()</tt> and readers |
| 450 | use <tt>rcu_dereference()</tt>, and these two RCU API elements |
| 451 | work together to ensure that readers have a consistent view of |
| 452 | newly added data elements. |
| 453 | |
| 454 | <p> |
| 455 | Of course, it is also necessary to remove elements from RCU-protected |
| 456 | data structures, for example, using the following process: |
| 457 | |
| 458 | <ol> |
| 459 | <li> Remove the data element from the enclosing structure. |
| 460 | <li> Wait for all pre-existing RCU read-side critical sections |
| 461 | to complete (because only pre-existing readers can possibly have |
| 462 | a reference to the newly removed data element). |
| 463 | <li> At this point, only the updater has a reference to the |
| 464 | newly removed data element, so it can safely reclaim |
| 465 | the data element, for example, by passing it to <tt>kfree()</tt>. |
| 466 | </ol> |
| 467 | |
| 468 | This process is implemented by <tt>remove_gp_synchronous()</tt>: |
| 469 | |
| 470 | <blockquote> |
| 471 | <pre> |
| 472 | 1 bool remove_gp_synchronous(void) |
| 473 | 2 { |
| 474 | 3 struct foo *p; |
| 475 | 4 |
| 476 | 5 spin_lock(&gp_lock); |
| 477 | 6 p = rcu_access_pointer(gp); |
| 478 | 7 if (!p) { |
| 479 | 8 spin_unlock(&gp_lock); |
| 480 | 9 return false; |
| 481 | 10 } |
| 482 | 11 rcu_assign_pointer(gp, NULL); |
| 483 | 12 spin_unlock(&gp_lock); |
| 484 | 13 synchronize_rcu(); |
| 485 | 14 kfree(p); |
| 486 | 15 return true; |
| 487 | 16 } |
| 488 | </pre> |
| 489 | </blockquote> |
| 490 | |
| 491 | <p> |
| 492 | This function is straightforward, with line 13 waiting for a grace |
| 493 | period before line 14 frees the old data element. |
| 494 | This waiting ensures that readers will reach line 7 of |
| 495 | <tt>do_something_gp()</tt> before the data element referenced by |
| 496 | <tt>p</tt> is freed. |
| 497 | The <tt>rcu_access_pointer()</tt> on line 6 is similar to |
| 498 | <tt>rcu_dereference()</tt>, except that: |
| 499 | |
| 500 | <ol> |
| 501 | <li> The value returned by <tt>rcu_access_pointer()</tt> |
| 502 | cannot be dereferenced. |
| 503 | If you want to access the value pointed to as well as |
| 504 | the pointer itself, use <tt>rcu_dereference()</tt> |
| 505 | instead of <tt>rcu_access_pointer()</tt>. |
| 506 | <li> The call to <tt>rcu_access_pointer()</tt> need not be |
| 507 | protected. |
| 508 | In contrast, <tt>rcu_dereference()</tt> must either be |
| 509 | within an RCU read-side critical section or in a code |
| 510 | segment where the pointer cannot change, for example, in |
| 511 | code protected by the corresponding update-side lock. |
| 512 | </ol> |
| 513 | |
| 514 | <p>@@QQ@@ |
| 515 | Without the <tt>rcu_dereference()</tt> or the |
| 516 | <tt>rcu_access_pointer()</tt>, what destructive optimizations |
| 517 | might the compiler make use of? |
| 518 | <p>@@QQA@@ |
| 519 | Let's start with what happens to <tt>do_something_gp()</tt> |
| 520 | if it fails to use <tt>rcu_dereference()</tt>. |
| 521 | It could reuse a value formerly fetched from this same pointer. |
| 522 | It could also fetch the pointer from <tt>gp</tt> in a byte-at-a-time |
| 523 | manner, resulting in <i>load tearing</i>, in turn resulting a bytewise |
| 524 | mash-up of two distince pointer values. |
| 525 | It might even use value-speculation optimizations, where it makes a wrong |
| 526 | guess, but by the time it gets around to checking the value, an update |
| 527 | has changed the pointer to match the wrong guess. |
| 528 | Too bad about any dereferences that returned pre-initialization garbage |
| 529 | in the meantime! |
| 530 | |
| 531 | <p> |
| 532 | For <tt>remove_gp_synchronous()</tt>, as long as all modifications |
| 533 | to <tt>gp</tt> are carried out while holding <tt>gp_lock</tt>, |
| 534 | the above optimizations are harmless. |
| 535 | However, |
| 536 | with <tt>CONFIG_SPARSE_RCU_POINTER=y</tt>, |
| 537 | <tt>sparse</tt> will complain if you |
| 538 | define <tt>gp</tt> with <tt>__rcu</tt> and then |
| 539 | access it without using |
| 540 | either <tt>rcu_access_pointer()</tt> or <tt>rcu_dereference()</tt>. |
| 541 | <p>@@QQE@@ |
| 542 | |
| 543 | <p> |
Paul E. McKenney | 4b68933 | 2015-10-12 08:51:45 -0700 | [diff] [blame] | 544 | In short, RCU's publish-subscribe guarantee is provided by the combination |
| 545 | of <tt>rcu_assign_pointer()</tt> and <tt>rcu_dereference()</tt>. |
| 546 | This guarantee allows data elements to be safely added to RCU-protected |
| 547 | linked data structures without disrupting RCU readers. |
| 548 | This guarantee can be used in combination with the grace-period |
| 549 | guarantee to also allow data elements to be removed from RCU-protected |
| 550 | linked data structures, again without disrupting RCU readers. |
| 551 | |
| 552 | <p> |
| 553 | This guarantee was only partially premeditated. |
| 554 | DYNIX/ptx used an explicit memory barrier for publication, but had nothing |
| 555 | resembling <tt>rcu_dereference()</tt> for subscription, nor did it |
| 556 | have anything resembling the <tt>smp_read_barrier_depends()</tt> |
| 557 | that was later subsumed into <tt>rcu_dereference()</tt>. |
| 558 | The need for these operations made itself known quite suddenly at a |
| 559 | late-1990s meeting with the DEC Alpha architects, back in the days when |
| 560 | DEC was still a free-standing company. |
| 561 | It took the Alpha architects a good hour to convince me that any sort |
| 562 | of barrier would ever be needed, and it then took me a good <i>two</i> hours |
| 563 | to convince them that their documentation did not make this point clear. |
| 564 | More recent work with the C and C++ standards committees have provided |
| 565 | much education on tricks and traps from the compiler. |
| 566 | In short, compilers were much less tricky in the early 1990s, but in |
| 567 | 2015, don't even think about omitting <tt>rcu_dereference()</tt>! |
| 568 | |
| 569 | <h3><a name="Memory-Barrier Guarantees">Memory-Barrier Guarantees</a></h3> |
| 570 | |
| 571 | <p> |
| 572 | The previous section's simple linked-data-structure scenario clearly |
| 573 | demonstrates the need for RCU's stringent memory-ordering guarantees on |
| 574 | systems with more than one CPU: |
Paul E. McKenney | 649e436 | 2015-10-07 13:32:08 -0700 | [diff] [blame] | 575 | |
| 576 | <ol> |
| 577 | <li> Each CPU that has an RCU read-side critical section that |
| 578 | begins before <tt>synchronize_rcu()</tt> starts is |
| 579 | guaranteed to execute a full memory barrier between the time |
| 580 | that the RCU read-side critical section ends and the time that |
| 581 | <tt>synchronize_rcu()</tt> returns. |
| 582 | Without this guarantee, a pre-existing RCU read-side critical section |
| 583 | might hold a reference to the newly removed <tt>struct foo</tt> |
| 584 | after the <tt>kfree()</tt> on line 14 of |
| 585 | <tt>remove_gp_synchronous()</tt>. |
| 586 | <li> Each CPU that has an RCU read-side critical section that ends |
| 587 | after <tt>synchronize_rcu()</tt> returns is guaranteed |
| 588 | to execute a full memory barrier between the time that |
| 589 | <tt>synchronize_rcu()</tt> begins and the time that the RCU |
| 590 | read-side critical section begins. |
| 591 | Without this guarantee, a later RCU read-side critical section |
| 592 | running after the <tt>kfree()</tt> on line 14 of |
| 593 | <tt>remove_gp_synchronous()</tt> might |
| 594 | later run <tt>do_something_gp()</tt> and find the |
| 595 | newly deleted <tt>struct foo</tt>. |
| 596 | <li> If the task invoking <tt>synchronize_rcu()</tt> remains |
| 597 | on a given CPU, then that CPU is guaranteed to execute a full |
| 598 | memory barrier sometime during the execution of |
| 599 | <tt>synchronize_rcu()</tt>. |
| 600 | This guarantee ensures that the <tt>kfree()</tt> on |
| 601 | line 14 of <tt>remove_gp_synchronous()</tt> really does |
| 602 | execute after the removal on line 11. |
| 603 | <li> If the task invoking <tt>synchronize_rcu()</tt> migrates |
| 604 | among a group of CPUs during that invocation, then each of the |
| 605 | CPUs in that group is guaranteed to execute a full memory barrier |
| 606 | sometime during the execution of <tt>synchronize_rcu()</tt>. |
| 607 | This guarantee also ensures that the <tt>kfree()</tt> on |
| 608 | line 14 of <tt>remove_gp_synchronous()</tt> really does |
| 609 | execute after the removal on |
| 610 | line 11, but also in the case where the thread executing the |
| 611 | <tt>synchronize_rcu()</tt> migrates in the meantime. |
| 612 | </ol> |
| 613 | |
| 614 | <p>@@QQ@@ |
| 615 | Given that multiple CPUs can start RCU read-side critical sections |
| 616 | at any time without any ordering whatsoever, how can RCU possibly tell whether |
| 617 | or not a given RCU read-side critical section starts before a |
| 618 | given instance of <tt>synchronize_rcu()</tt>? |
| 619 | <p>@@QQA@@ |
| 620 | If RCU cannot tell whether or not a given |
| 621 | RCU read-side critical section starts before a |
| 622 | given instance of <tt>synchronize_rcu()</tt>, |
| 623 | then it must assume that the RCU read-side critical section |
| 624 | started first. |
| 625 | In other words, a given instance of <tt>synchronize_rcu()</tt> |
| 626 | can avoid waiting on a given RCU read-side critical section only |
| 627 | if it can prove that <tt>synchronize_rcu()</tt> started first. |
| 628 | <p>@@QQE@@ |
| 629 | |
| 630 | <p>@@QQ@@ |
| 631 | The first and second guarantees require unbelievably strict ordering! |
| 632 | Are all these memory barriers <i> really</i> required? |
| 633 | <p>@@QQA@@ |
| 634 | Yes, they really are required. |
| 635 | To see why the first guarantee is required, consider the following |
| 636 | sequence of events: |
| 637 | |
| 638 | <ol> |
| 639 | <li> CPU 1: <tt>rcu_read_lock()</tt> |
| 640 | <li> CPU 1: <tt>q = rcu_dereference(gp); |
| 641 | /* Very likely to return p. */</tt> |
| 642 | <li> CPU 0: <tt>list_del_rcu(p);</tt> |
| 643 | <li> CPU 0: <tt>synchronize_rcu()</tt> starts. |
| 644 | <li> CPU 1: <tt>do_something_with(q->a); |
| 645 | /* No smp_mb(), so might happen after kfree(). */</tt> |
| 646 | <li> CPU 1: <tt>rcu_read_unlock()</tt> |
| 647 | <li> CPU 0: <tt>synchronize_rcu()</tt> returns. |
| 648 | <li> CPU 0: <tt>kfree(p);</tt> |
| 649 | </ol> |
| 650 | |
| 651 | <p> |
| 652 | Therefore, there absolutely must be a full memory barrier between the |
| 653 | end of the RCU read-side critical section and the end of the |
| 654 | grace period. |
| 655 | |
| 656 | <p> |
| 657 | The sequence of events demonstrating the necessity of the second rule |
| 658 | is roughly similar: |
| 659 | |
| 660 | <ol> |
| 661 | <li> CPU 0: <tt>list_del_rcu(p);</tt> |
| 662 | <li> CPU 0: <tt>synchronize_rcu()</tt> starts. |
| 663 | <li> CPU 1: <tt>rcu_read_lock()</tt> |
| 664 | <li> CPU 1: <tt>q = rcu_dereference(gp); |
| 665 | /* Might return p if no memory barrier. */</tt> |
| 666 | <li> CPU 0: <tt>synchronize_rcu()</tt> returns. |
| 667 | <li> CPU 0: <tt>kfree(p);</tt> |
| 668 | <li> CPU 1: <tt>do_something_with(q->a); /* Boom!!! */</tt> |
| 669 | <li> CPU 1: <tt>rcu_read_unlock()</tt> |
| 670 | </ol> |
| 671 | |
| 672 | <p> |
| 673 | And similarly, without a memory barrier between the beginning of the |
| 674 | grace period and the beginning of the RCU read-side critical section, |
| 675 | CPU 1 might end up accessing the freelist. |
| 676 | |
| 677 | <p> |
| 678 | The “as if” rule of course applies, so that any implementation |
| 679 | that acts as if the appropriate memory barriers were in place is a |
| 680 | correct implementation. |
| 681 | That said, it is much easier to fool yourself into believing that you have |
| 682 | adhered to the as-if rule than it is to actually adhere to it! |
| 683 | <p>@@QQE@@ |
| 684 | |
| 685 | <p> |
Paul E. McKenney | 4b68933 | 2015-10-12 08:51:45 -0700 | [diff] [blame] | 686 | Note that these memory-barrier requirements do not replace the fundamental |
| 687 | RCU requirement that a grace period wait for all pre-existing readers. |
| 688 | On the contrary, the memory barriers called out in this section must operate in |
| 689 | such a way as to <i>enforce</i> this fundamental requirement. |
| 690 | Of course, different implementations enforce this requirement in different |
| 691 | ways, but enforce it they must. |
Paul E. McKenney | 649e436 | 2015-10-07 13:32:08 -0700 | [diff] [blame] | 692 | |
| 693 | <h3><a name="RCU Primitives Guaranteed to Execute Unconditionally">RCU Primitives Guaranteed to Execute Unconditionally</a></h3> |
| 694 | |
| 695 | <p> |
| 696 | The common-case RCU primitives are unconditional. |
| 697 | They are invoked, they do their job, and they return, with no possibility |
| 698 | of error, and no need to retry. |
| 699 | This is a key RCU design philosophy. |
| 700 | |
| 701 | <p> |
| 702 | However, this philosophy is pragmatic rather than pigheaded. |
| 703 | If someone comes up with a good justification for a particular conditional |
| 704 | RCU primitive, it might well be implemented and added. |
| 705 | After all, this guarantee was reverse-engineered, not premeditated. |
| 706 | The unconditional nature of the RCU primitives was initially an |
| 707 | accident of implementation, and later experience with synchronization |
| 708 | primitives with conditional primitives caused me to elevate this |
| 709 | accident to a guarantee. |
| 710 | Therefore, the justification for adding a conditional primitive to |
| 711 | RCU would need to be based on detailed and compelling use cases. |
| 712 | |
| 713 | <h3><a name="Guaranteed Read-to-Write Upgrade">Guaranteed Read-to-Write Upgrade</a></h3> |
| 714 | |
| 715 | <p> |
| 716 | As far as RCU is concerned, it is always possible to carry out an |
| 717 | update within an RCU read-side critical section. |
| 718 | For example, that RCU read-side critical section might search for |
| 719 | a given data element, and then might acquire the update-side |
| 720 | spinlock in order to update that element, all while remaining |
| 721 | in that RCU read-side critical section. |
| 722 | Of course, it is necessary to exit the RCU read-side critical section |
| 723 | before invoking <tt>synchronize_rcu()</tt>, however, this |
| 724 | inconvenience can be avoided through use of the |
| 725 | <tt>call_rcu()</tt> and <tt>kfree_rcu()</tt> API members |
| 726 | described later in this document. |
| 727 | |
| 728 | <p>@@QQ@@ |
| 729 | But how does the upgrade-to-write operation exclude other readers? |
| 730 | <p>@@QQA@@ |
| 731 | It doesn't, just like normal RCU updates, which also do not exclude |
| 732 | RCU readers. |
| 733 | <p>@@QQE@@ |
| 734 | |
| 735 | <p> |
| 736 | This guarantee allows lookup code to be shared between read-side |
| 737 | and update-side code, and was premeditated, appearing in the earliest |
| 738 | DYNIX/ptx RCU documentation. |
| 739 | |
| 740 | <h2><a name="Fundamental Non-Requirements">Fundamental Non-Requirements</a></h2> |
| 741 | |
| 742 | <p> |
| 743 | RCU provides extremely lightweight readers, and its read-side guarantees, |
| 744 | though quite useful, are correspondingly lightweight. |
| 745 | It is therefore all too easy to assume that RCU is guaranteeing more |
| 746 | than it really is. |
| 747 | Of course, the list of things that RCU does not guarantee is infinitely |
| 748 | long, however, the following sections list a few non-guarantees that |
| 749 | have caused confusion. |
| 750 | Except where otherwise noted, these non-guarantees were premeditated. |
| 751 | |
| 752 | <ol> |
| 753 | <li> <a href="#Readers Impose Minimal Ordering"> |
| 754 | Readers Impose Minimal Ordering</a> |
| 755 | <li> <a href="#Readers Do Not Exclude Updaters"> |
| 756 | Readers Do Not Exclude Updaters</a> |
| 757 | <li> <a href="#Updaters Only Wait For Old Readers"> |
| 758 | Updaters Only Wait For Old Readers</a> |
| 759 | <li> <a href="#Grace Periods Don't Partition Read-Side Critical Sections"> |
| 760 | Grace Periods Don't Partition Read-Side Critical Sections</a> |
| 761 | <li> <a href="#Read-Side Critical Sections Don't Partition Grace Periods"> |
| 762 | Read-Side Critical Sections Don't Partition Grace Periods</a> |
| 763 | <li> <a href="#Disabling Preemption Does Not Block Grace Periods"> |
| 764 | Disabling Preemption Does Not Block Grace Periods</a> |
| 765 | </ol> |
| 766 | |
| 767 | <h3><a name="Readers Impose Minimal Ordering">Readers Impose Minimal Ordering</a></h3> |
| 768 | |
| 769 | <p> |
| 770 | Reader-side markers such as <tt>rcu_read_lock()</tt> and |
| 771 | <tt>rcu_read_unlock()</tt> provide absolutely no ordering guarantees |
| 772 | except through their interaction with the grace-period APIs such as |
| 773 | <tt>synchronize_rcu()</tt>. |
| 774 | To see this, consider the following pair of threads: |
| 775 | |
| 776 | <blockquote> |
| 777 | <pre> |
| 778 | 1 void thread0(void) |
| 779 | 2 { |
| 780 | 3 rcu_read_lock(); |
| 781 | 4 WRITE_ONCE(x, 1); |
| 782 | 5 rcu_read_unlock(); |
| 783 | 6 rcu_read_lock(); |
| 784 | 7 WRITE_ONCE(y, 1); |
| 785 | 8 rcu_read_unlock(); |
| 786 | 9 } |
| 787 | 10 |
| 788 | 11 void thread1(void) |
| 789 | 12 { |
| 790 | 13 rcu_read_lock(); |
| 791 | 14 r1 = READ_ONCE(y); |
| 792 | 15 rcu_read_unlock(); |
| 793 | 16 rcu_read_lock(); |
| 794 | 17 r2 = READ_ONCE(x); |
| 795 | 18 rcu_read_unlock(); |
| 796 | 19 } |
| 797 | </pre> |
| 798 | </blockquote> |
| 799 | |
| 800 | <p> |
| 801 | After <tt>thread0()</tt> and <tt>thread1()</tt> execute |
| 802 | concurrently, it is quite possible to have |
| 803 | |
| 804 | <blockquote> |
| 805 | <pre> |
| 806 | (r1 == 1 && r2 == 0) |
| 807 | </pre> |
| 808 | </blockquote> |
| 809 | |
| 810 | (that is, <tt>y</tt> appears to have been assigned before <tt>x</tt>), |
| 811 | which would not be possible if <tt>rcu_read_lock()</tt> and |
| 812 | <tt>rcu_read_unlock()</tt> had much in the way of ordering |
| 813 | properties. |
| 814 | But they do not, so the CPU is within its rights |
| 815 | to do significant reordering. |
| 816 | This is by design: Any significant ordering constraints would slow down |
| 817 | these fast-path APIs. |
| 818 | |
| 819 | <p>@@QQ@@ |
| 820 | Can't the compiler also reorder this code? |
| 821 | <p>@@QQA@@ |
| 822 | No, the volatile casts in <tt>READ_ONCE()</tt> and |
| 823 | <tt>WRITE_ONCE()</tt> prevent the compiler from reordering in |
| 824 | this particular case. |
| 825 | <p>@@QQE@@ |
| 826 | |
| 827 | <h3><a name="Readers Do Not Exclude Updaters">Readers Do Not Exclude Updaters</a></h3> |
| 828 | |
| 829 | <p> |
| 830 | Neither <tt>rcu_read_lock()</tt> nor <tt>rcu_read_unlock()</tt> |
| 831 | exclude updates. |
| 832 | All they do is to prevent grace periods from ending. |
| 833 | The following example illustrates this: |
| 834 | |
| 835 | <blockquote> |
| 836 | <pre> |
| 837 | 1 void thread0(void) |
| 838 | 2 { |
| 839 | 3 rcu_read_lock(); |
| 840 | 4 r1 = READ_ONCE(y); |
| 841 | 5 if (r1) { |
| 842 | 6 do_something_with_nonzero_x(); |
| 843 | 7 r2 = READ_ONCE(x); |
| 844 | 8 WARN_ON(!r2); /* BUG!!! */ |
| 845 | 9 } |
| 846 | 10 rcu_read_unlock(); |
| 847 | 11 } |
| 848 | 12 |
| 849 | 13 void thread1(void) |
| 850 | 14 { |
| 851 | 15 spin_lock(&my_lock); |
| 852 | 16 WRITE_ONCE(x, 1); |
| 853 | 17 WRITE_ONCE(y, 1); |
| 854 | 18 spin_unlock(&my_lock); |
| 855 | 19 } |
| 856 | </pre> |
| 857 | </blockquote> |
| 858 | |
| 859 | <p> |
| 860 | If the <tt>thread0()</tt> function's <tt>rcu_read_lock()</tt> |
| 861 | excluded the <tt>thread1()</tt> function's update, |
| 862 | the <tt>WARN_ON()</tt> could never fire. |
| 863 | But the fact is that <tt>rcu_read_lock()</tt> does not exclude |
| 864 | much of anything aside from subsequent grace periods, of which |
| 865 | <tt>thread1()</tt> has none, so the |
| 866 | <tt>WARN_ON()</tt> can and does fire. |
| 867 | |
| 868 | <h3><a name="Updaters Only Wait For Old Readers">Updaters Only Wait For Old Readers</a></h3> |
| 869 | |
| 870 | <p> |
| 871 | It might be tempting to assume that after <tt>synchronize_rcu()</tt> |
| 872 | completes, there are no readers executing. |
| 873 | This temptation must be avoided because |
| 874 | new readers can start immediately after <tt>synchronize_rcu()</tt> |
| 875 | starts, and <tt>synchronize_rcu()</tt> is under no |
| 876 | obligation to wait for these new readers. |
| 877 | |
| 878 | <p>@@QQ@@ |
| 879 | Suppose that synchronize_rcu() did wait until all readers had completed. |
| 880 | Would the updater be able to rely on this? |
| 881 | <p>@@QQA@@ |
| 882 | No. |
| 883 | Even if <tt>synchronize_rcu()</tt> were to wait until |
| 884 | all readers had completed, a new reader might start immediately after |
| 885 | <tt>synchronize_rcu()</tt> completed. |
| 886 | Therefore, the code following |
| 887 | <tt>synchronize_rcu()</tt> cannot rely on there being no readers |
| 888 | in any case. |
| 889 | <p>@@QQE@@ |
| 890 | |
| 891 | <h3><a name="Grace Periods Don't Partition Read-Side Critical Sections"> |
| 892 | Grace Periods Don't Partition Read-Side Critical Sections</a></h3> |
| 893 | |
| 894 | <p> |
| 895 | It is tempting to assume that if any part of one RCU read-side critical |
| 896 | section precedes a given grace period, and if any part of another RCU |
| 897 | read-side critical section follows that same grace period, then all of |
| 898 | the first RCU read-side critical section must precede all of the second. |
| 899 | However, this just isn't the case: A single grace period does not |
| 900 | partition the set of RCU read-side critical sections. |
| 901 | An example of this situation can be illustrated as follows, where |
| 902 | <tt>x</tt>, <tt>y</tt>, and <tt>z</tt> are initially all zero: |
| 903 | |
| 904 | <blockquote> |
| 905 | <pre> |
| 906 | 1 void thread0(void) |
| 907 | 2 { |
| 908 | 3 rcu_read_lock(); |
| 909 | 4 WRITE_ONCE(a, 1); |
| 910 | 5 WRITE_ONCE(b, 1); |
| 911 | 6 rcu_read_unlock(); |
| 912 | 7 } |
| 913 | 8 |
| 914 | 9 void thread1(void) |
| 915 | 10 { |
| 916 | 11 r1 = READ_ONCE(a); |
| 917 | 12 synchronize_rcu(); |
| 918 | 13 WRITE_ONCE(c, 1); |
| 919 | 14 } |
| 920 | 15 |
| 921 | 16 void thread2(void) |
| 922 | 17 { |
| 923 | 18 rcu_read_lock(); |
| 924 | 19 r2 = READ_ONCE(b); |
| 925 | 20 r3 = READ_ONCE(c); |
| 926 | 21 rcu_read_unlock(); |
| 927 | 22 } |
| 928 | </pre> |
| 929 | </blockquote> |
| 930 | |
| 931 | <p> |
| 932 | It turns out that the outcome: |
| 933 | |
| 934 | <blockquote> |
| 935 | <pre> |
| 936 | (r1 == 1 && r2 == 0 && r3 == 1) |
| 937 | </pre> |
| 938 | </blockquote> |
| 939 | |
| 940 | is entirely possible. |
| 941 | The following figure show how this can happen, with each circled |
| 942 | <tt>QS</tt> indicating the point at which RCU recorded a |
| 943 | <i>quiescent state</i> for each thread, that is, a state in which |
| 944 | RCU knows that the thread cannot be in the midst of an RCU read-side |
| 945 | critical section that started before the current grace period: |
| 946 | |
| 947 | <p><img src="GPpartitionReaders1.svg" alt="GPpartitionReaders1.svg" width="60%"></p> |
| 948 | |
| 949 | <p> |
| 950 | If it is necessary to partition RCU read-side critical sections in this |
| 951 | manner, it is necessary to use two grace periods, where the first |
| 952 | grace period is known to end before the second grace period starts: |
| 953 | |
| 954 | <blockquote> |
| 955 | <pre> |
| 956 | 1 void thread0(void) |
| 957 | 2 { |
| 958 | 3 rcu_read_lock(); |
| 959 | 4 WRITE_ONCE(a, 1); |
| 960 | 5 WRITE_ONCE(b, 1); |
| 961 | 6 rcu_read_unlock(); |
| 962 | 7 } |
| 963 | 8 |
| 964 | 9 void thread1(void) |
| 965 | 10 { |
| 966 | 11 r1 = READ_ONCE(a); |
| 967 | 12 synchronize_rcu(); |
| 968 | 13 WRITE_ONCE(c, 1); |
| 969 | 14 } |
| 970 | 15 |
| 971 | 16 void thread2(void) |
| 972 | 17 { |
| 973 | 18 r2 = READ_ONCE(c); |
| 974 | 19 synchronize_rcu(); |
| 975 | 20 WRITE_ONCE(d, 1); |
| 976 | 21 } |
| 977 | 22 |
| 978 | 23 void thread3(void) |
| 979 | 24 { |
| 980 | 25 rcu_read_lock(); |
| 981 | 26 r3 = READ_ONCE(b); |
| 982 | 27 r4 = READ_ONCE(d); |
| 983 | 28 rcu_read_unlock(); |
| 984 | 29 } |
| 985 | </pre> |
| 986 | </blockquote> |
| 987 | |
| 988 | <p> |
| 989 | Here, if <tt>(r1 == 1)</tt>, then |
| 990 | <tt>thread0()</tt>'s write to <tt>b</tt> must happen |
| 991 | before the end of <tt>thread1()</tt>'s grace period. |
| 992 | If in addition <tt>(r4 == 1)</tt>, then |
| 993 | <tt>thread3()</tt>'s read from <tt>b</tt> must happen |
| 994 | after the beginning of <tt>thread2()</tt>'s grace period. |
| 995 | If it is also the case that <tt>(r2 == 1)</tt>, then the |
| 996 | end of <tt>thread1()</tt>'s grace period must precede the |
| 997 | beginning of <tt>thread2()</tt>'s grace period. |
| 998 | This mean that the two RCU read-side critical sections cannot overlap, |
| 999 | guaranteeing that <tt>(r3 == 1)</tt>. |
| 1000 | As a result, the outcome: |
| 1001 | |
| 1002 | <blockquote> |
| 1003 | <pre> |
| 1004 | (r1 == 1 && r2 == 1 && r3 == 0 && r4 == 1) |
| 1005 | </pre> |
| 1006 | </blockquote> |
| 1007 | |
| 1008 | cannot happen. |
| 1009 | |
| 1010 | <p> |
| 1011 | This non-requirement was also non-premeditated, but became apparent |
| 1012 | when studying RCU's interaction with memory ordering. |
| 1013 | |
| 1014 | <h3><a name="Read-Side Critical Sections Don't Partition Grace Periods"> |
| 1015 | Read-Side Critical Sections Don't Partition Grace Periods</a></h3> |
| 1016 | |
| 1017 | <p> |
| 1018 | It is also tempting to assume that if an RCU read-side critical section |
| 1019 | happens between a pair of grace periods, then those grace periods cannot |
| 1020 | overlap. |
| 1021 | However, this temptation leads nowhere good, as can be illustrated by |
| 1022 | the following, with all variables initially zero: |
| 1023 | |
| 1024 | <blockquote> |
| 1025 | <pre> |
| 1026 | 1 void thread0(void) |
| 1027 | 2 { |
| 1028 | 3 rcu_read_lock(); |
| 1029 | 4 WRITE_ONCE(a, 1); |
| 1030 | 5 WRITE_ONCE(b, 1); |
| 1031 | 6 rcu_read_unlock(); |
| 1032 | 7 } |
| 1033 | 8 |
| 1034 | 9 void thread1(void) |
| 1035 | 10 { |
| 1036 | 11 r1 = READ_ONCE(a); |
| 1037 | 12 synchronize_rcu(); |
| 1038 | 13 WRITE_ONCE(c, 1); |
| 1039 | 14 } |
| 1040 | 15 |
| 1041 | 16 void thread2(void) |
| 1042 | 17 { |
| 1043 | 18 rcu_read_lock(); |
| 1044 | 19 WRITE_ONCE(d, 1); |
| 1045 | 20 r2 = READ_ONCE(c); |
| 1046 | 21 rcu_read_unlock(); |
| 1047 | 22 } |
| 1048 | 23 |
| 1049 | 24 void thread3(void) |
| 1050 | 25 { |
| 1051 | 26 r3 = READ_ONCE(d); |
| 1052 | 27 synchronize_rcu(); |
| 1053 | 28 WRITE_ONCE(e, 1); |
| 1054 | 29 } |
| 1055 | 30 |
| 1056 | 31 void thread4(void) |
| 1057 | 32 { |
| 1058 | 33 rcu_read_lock(); |
| 1059 | 34 r4 = READ_ONCE(b); |
| 1060 | 35 r5 = READ_ONCE(e); |
| 1061 | 36 rcu_read_unlock(); |
| 1062 | 37 } |
| 1063 | </pre> |
| 1064 | </blockquote> |
| 1065 | |
| 1066 | <p> |
| 1067 | In this case, the outcome: |
| 1068 | |
| 1069 | <blockquote> |
| 1070 | <pre> |
| 1071 | (r1 == 1 && r2 == 1 && r3 == 1 && r4 == 0 && r5 == 1) |
| 1072 | </pre> |
| 1073 | </blockquote> |
| 1074 | |
| 1075 | is entirely possible, as illustrated below: |
| 1076 | |
| 1077 | <p><img src="ReadersPartitionGP1.svg" alt="ReadersPartitionGP1.svg" width="100%"></p> |
| 1078 | |
| 1079 | <p> |
| 1080 | Again, an RCU read-side critical section can overlap almost all of a |
| 1081 | given grace period, just so long as it does not overlap the entire |
| 1082 | grace period. |
| 1083 | As a result, an RCU read-side critical section cannot partition a pair |
| 1084 | of RCU grace periods. |
| 1085 | |
| 1086 | <p>@@QQ@@ |
| 1087 | How long a sequence of grace periods, each separated by an RCU read-side |
| 1088 | critical section, would be required to partition the RCU read-side |
| 1089 | critical sections at the beginning and end of the chain? |
| 1090 | <p>@@QQA@@ |
| 1091 | In theory, an infinite number. |
| 1092 | In practice, an unknown number that is sensitive to both implementation |
| 1093 | details and timing considerations. |
| 1094 | Therefore, even in practice, RCU users must abide by the theoretical rather |
| 1095 | than the practical answer. |
| 1096 | <p>@@QQE@@ |
| 1097 | |
| 1098 | <h3><a name="Disabling Preemption Does Not Block Grace Periods"> |
| 1099 | Disabling Preemption Does Not Block Grace Periods</a></h3> |
| 1100 | |
| 1101 | <p> |
| 1102 | There was a time when disabling preemption on any given CPU would block |
| 1103 | subsequent grace periods. |
| 1104 | However, this was an accident of implementation and is not a requirement. |
| 1105 | And in the current Linux-kernel implementation, disabling preemption |
| 1106 | on a given CPU in fact does not block grace periods, as Oleg Nesterov |
| 1107 | <a href="https://lkml.kernel.org/g/20150614193825.GA19582@redhat.com">demonstrated</a>. |
| 1108 | |
| 1109 | <p> |
| 1110 | If you need a preempt-disable region to block grace periods, you need to add |
| 1111 | <tt>rcu_read_lock()</tt> and <tt>rcu_read_unlock()</tt>, for example |
| 1112 | as follows: |
| 1113 | |
| 1114 | <blockquote> |
| 1115 | <pre> |
| 1116 | 1 preempt_disable(); |
| 1117 | 2 rcu_read_lock(); |
| 1118 | 3 do_something(); |
| 1119 | 4 rcu_read_unlock(); |
| 1120 | 5 preempt_enable(); |
| 1121 | 6 |
| 1122 | 7 /* Spinlocks implicitly disable preemption. */ |
| 1123 | 8 spin_lock(&mylock); |
| 1124 | 9 rcu_read_lock(); |
| 1125 | 10 do_something(); |
| 1126 | 11 rcu_read_unlock(); |
| 1127 | 12 spin_unlock(&mylock); |
| 1128 | </pre> |
| 1129 | </blockquote> |
| 1130 | |
| 1131 | <p> |
| 1132 | In theory, you could enter the RCU read-side critical section first, |
| 1133 | but it is more efficient to keep the entire RCU read-side critical |
| 1134 | section contained in the preempt-disable region as shown above. |
| 1135 | Of course, RCU read-side critical sections that extend outside of |
| 1136 | preempt-disable regions will work correctly, but such critical sections |
| 1137 | can be preempted, which forces <tt>rcu_read_unlock()</tt> to do |
| 1138 | more work. |
| 1139 | And no, this is <i>not</i> an invitation to enclose all of your RCU |
| 1140 | read-side critical sections within preempt-disable regions, because |
| 1141 | doing so would degrade real-time response. |
| 1142 | |
| 1143 | <p> |
| 1144 | This non-requirement appeared with preemptible RCU. |
| 1145 | If you need a grace period that waits on non-preemptible code regions, use |
| 1146 | <a href="#Sched Flavor">RCU-sched</a>. |
| 1147 | |
| 1148 | <h2><a name="Parallelism Facts of Life">Parallelism Facts of Life</a></h2> |
| 1149 | |
| 1150 | <p> |
| 1151 | These parallelism facts of life are by no means specific to RCU, but |
| 1152 | the RCU implementation must abide by them. |
| 1153 | They therefore bear repeating: |
| 1154 | |
| 1155 | <ol> |
| 1156 | <li> Any CPU or task may be delayed at any time, |
| 1157 | and any attempts to avoid these delays by disabling |
| 1158 | preemption, interrupts, or whatever are completely futile. |
| 1159 | This is most obvious in preemptible user-level |
| 1160 | environments and in virtualized environments (where |
| 1161 | a given guest OS's VCPUs can be preempted at any time by |
| 1162 | the underlying hypervisor), but can also happen in bare-metal |
| 1163 | environments due to ECC errors, NMIs, and other hardware |
| 1164 | events. |
| 1165 | Although a delay of more than about 20 seconds can result |
| 1166 | in splats, the RCU implementation is obligated to use |
| 1167 | algorithms that can tolerate extremely long delays, but where |
| 1168 | “extremely long” is not long enough to allow |
| 1169 | wrap-around when incrementing a 64-bit counter. |
| 1170 | <li> Both the compiler and the CPU can reorder memory accesses. |
| 1171 | Where it matters, RCU must use compiler directives and |
| 1172 | memory-barrier instructions to preserve ordering. |
| 1173 | <li> Conflicting writes to memory locations in any given cache line |
| 1174 | will result in expensive cache misses. |
| 1175 | Greater numbers of concurrent writes and more-frequent |
| 1176 | concurrent writes will result in more dramatic slowdowns. |
| 1177 | RCU is therefore obligated to use algorithms that have |
| 1178 | sufficient locality to avoid significant performance and |
| 1179 | scalability problems. |
| 1180 | <li> As a rough rule of thumb, only one CPU's worth of processing |
| 1181 | may be carried out under the protection of any given exclusive |
| 1182 | lock. |
| 1183 | RCU must therefore use scalable locking designs. |
| 1184 | <li> Counters are finite, especially on 32-bit systems. |
| 1185 | RCU's use of counters must therefore tolerate counter wrap, |
| 1186 | or be designed such that counter wrap would take way more |
| 1187 | time than a single system is likely to run. |
| 1188 | An uptime of ten years is quite possible, a runtime |
| 1189 | of a century much less so. |
| 1190 | As an example of the latter, RCU's dyntick-idle nesting counter |
| 1191 | allows 54 bits for interrupt nesting level (this counter |
| 1192 | is 64 bits even on a 32-bit system). |
| 1193 | Overflowing this counter requires 2<sup>54</sup> |
| 1194 | half-interrupts on a given CPU without that CPU ever going idle. |
| 1195 | If a half-interrupt happened every microsecond, it would take |
| 1196 | 570 years of runtime to overflow this counter, which is currently |
| 1197 | believed to be an acceptably long time. |
| 1198 | <li> Linux systems can have thousands of CPUs running a single |
| 1199 | Linux kernel in a single shared-memory environment. |
| 1200 | RCU must therefore pay close attention to high-end scalability. |
| 1201 | </ol> |
| 1202 | |
| 1203 | <p> |
| 1204 | This last parallelism fact of life means that RCU must pay special |
| 1205 | attention to the preceding facts of life. |
| 1206 | The idea that Linux might scale to systems with thousands of CPUs would |
| 1207 | have been met with some skepticism in the 1990s, but these requirements |
| 1208 | would have otherwise have been unsurprising, even in the early 1990s. |
| 1209 | |
| 1210 | <h2><a name="Quality-of-Implementation Requirements">Quality-of-Implementation Requirements</a></h2> |
| 1211 | |
| 1212 | <p> |
| 1213 | These sections list quality-of-implementation requirements. |
| 1214 | Although an RCU implementation that ignores these requirements could |
| 1215 | still be used, it would likely be subject to limitations that would |
| 1216 | make it inappropriate for industrial-strength production use. |
| 1217 | Classes of quality-of-implementation requirements are as follows: |
| 1218 | |
| 1219 | <ol> |
| 1220 | <li> <a href="#Specialization">Specialization</a> |
| 1221 | <li> <a href="#Performance and Scalability">Performance and Scalability</a> |
| 1222 | <li> <a href="#Composability">Composability</a> |
| 1223 | <li> <a href="#Corner Cases">Corner Cases</a> |
| 1224 | </ol> |
| 1225 | |
| 1226 | <p> |
| 1227 | These classes is covered in the following sections. |
| 1228 | |
| 1229 | <h3><a name="Specialization">Specialization</a></h3> |
| 1230 | |
| 1231 | <p> |
| 1232 | RCU is and always has been intended primarily for read-mostly situations, as |
| 1233 | illustrated by the following figure. |
| 1234 | This means that RCU's read-side primitives are optimized, often at the |
| 1235 | expense of its update-side primitives. |
| 1236 | |
| 1237 | <p><img src="RCUApplicability.svg" alt="RCUApplicability.svg" width="70%"></p> |
| 1238 | |
| 1239 | <p> |
| 1240 | This focus on read-mostly situations means that RCU must interoperate |
| 1241 | with other synchronization primitives. |
| 1242 | For example, the <tt>add_gp()</tt> and <tt>remove_gp_synchronous()</tt> |
| 1243 | examples discussed earlier use RCU to protect readers and locking to |
| 1244 | coordinate updaters. |
| 1245 | However, the need extends much farther, requiring that a variety of |
| 1246 | synchronization primitives be legal within RCU read-side critical sections, |
| 1247 | including spinlocks, sequence locks, atomic operations, reference |
| 1248 | counters, and memory barriers. |
| 1249 | |
| 1250 | <p>@@QQ@@ |
| 1251 | What about sleeping locks? |
| 1252 | <p>@@QQA@@ |
| 1253 | These are forbidden within Linux-kernel RCU read-side critical sections |
| 1254 | because it is not legal to place a quiescent state (in this case, |
| 1255 | voluntary context switch) within an RCU read-side critical section. |
| 1256 | However, sleeping locks may be used within userspace RCU read-side critical |
| 1257 | sections, and also within Linux-kernel sleepable RCU |
| 1258 | <a href="#Sleepable RCU">(SRCU)</a> |
| 1259 | read-side critical sections. |
| 1260 | In addition, the -rt patchset turns spinlocks into a sleeping locks so |
| 1261 | that the corresponding critical sections can be preempted, which |
| 1262 | also means that these sleeplockified spinlocks (but not other sleeping locks!) |
| 1263 | may be acquire within -rt-Linux-kernel RCU read-side critical sections. |
| 1264 | |
| 1265 | <p> |
| 1266 | Note that it <i>is</i> legal for a normal RCU read-side critical section |
| 1267 | to conditionally acquire a sleeping locks (as in <tt>mutex_trylock()</tt>), |
| 1268 | but only as long as it does not loop indefinitely attempting to |
| 1269 | conditionally acquire that sleeping locks. |
| 1270 | The key point is that things like <tt>mutex_trylock()</tt> |
| 1271 | either return with the mutex held, or return an error indication if |
| 1272 | the mutex was not immediately available. |
| 1273 | Either way, <tt>mutex_trylock()</tt> returns immediately without sleeping. |
| 1274 | <p>@@QQE@@ |
| 1275 | |
| 1276 | <p> |
| 1277 | It often comes as a surprise that many algorithms do not require a |
| 1278 | consistent view of data, but many can function in that mode, |
| 1279 | with network routing being the poster child. |
| 1280 | Internet routing algorithms take significant time to propagate |
| 1281 | updates, so that by the time an update arrives at a given system, |
| 1282 | that system has been sending network traffic the wrong way for |
| 1283 | a considerable length of time. |
| 1284 | Having a few threads continue to send traffic the wrong way for a |
| 1285 | few more milliseconds is clearly not a problem: In the worst case, |
| 1286 | TCP retransmissions will eventually get the data where it needs to go. |
| 1287 | In general, when tracking the state of the universe outside of the |
| 1288 | computer, some level of inconsistency must be tolerated due to |
| 1289 | speed-of-light delays if nothing else. |
| 1290 | |
| 1291 | <p> |
| 1292 | Furthermore, uncertainty about external state is inherent in many cases. |
| 1293 | For example, a pair of veternarians might use heartbeat to determine |
| 1294 | whether or not a given cat was alive. |
| 1295 | But how long should they wait after the last heartbeat to decide that |
| 1296 | the cat is in fact dead? |
| 1297 | Waiting less than 400 milliseconds makes no sense because this would |
| 1298 | mean that a relaxed cat would be considered to cycle between death |
| 1299 | and life more than 100 times per minute. |
| 1300 | Moreover, just as with human beings, a cat's heart might stop for |
| 1301 | some period of time, so the exact wait period is a judgment call. |
| 1302 | One of our pair of veternarians might wait 30 seconds before pronouncing |
| 1303 | the cat dead, while the other might insist on waiting a full minute. |
| 1304 | The two veternarians would then disagree on the state of the cat during |
| 1305 | the final 30 seconds of the minute following the last heartbeat, as |
| 1306 | fancifully illustrated below: |
| 1307 | |
| 1308 | <p><img src="2013-08-is-it-dead.png" alt="2013-08-is-it-dead.png" width="431"></p> |
| 1309 | |
| 1310 | <p> |
| 1311 | Interestingly enough, this same situation applies to hardware. |
| 1312 | When push comes to shove, how do we tell whether or not some |
| 1313 | external server has failed? |
| 1314 | We send messages to it periodically, and declare it failed if we |
| 1315 | don't receive a response within a given period of time. |
| 1316 | Policy decisions can usually tolerate short |
| 1317 | periods of inconsistency. |
| 1318 | The policy was decided some time ago, and is only now being put into |
| 1319 | effect, so a few milliseconds of delay is normally inconsequential. |
| 1320 | |
| 1321 | <p> |
| 1322 | However, there are algorithms that absolutely must see consistent data. |
| 1323 | For example, the translation between a user-level SystemV semaphore |
| 1324 | ID to the corresponding in-kernel data structure is protected by RCU, |
| 1325 | but it is absolutely forbidden to update a semaphore that has just been |
| 1326 | removed. |
| 1327 | In the Linux kernel, this need for consistency is accommodated by acquiring |
| 1328 | spinlocks located in the in-kernel data structure from within |
| 1329 | the RCU read-side critical section, and this is indicated by the |
| 1330 | green box in the figure above. |
| 1331 | Many other techniques may be used, and are in fact used within the |
| 1332 | Linux kernel. |
| 1333 | |
| 1334 | <p> |
| 1335 | In short, RCU is not required to maintain consistency, and other |
| 1336 | mechanisms may be used in concert with RCU when consistency is required. |
| 1337 | RCU's specialization allows it to do its job extremely well, and its |
| 1338 | ability to interoperate with other synchronization mechanisms allows |
| 1339 | the right mix of synchronization tools to be used for a given job. |
| 1340 | |
| 1341 | <h3><a name="Performance and Scalability">Performance and Scalability</a></h3> |
| 1342 | |
| 1343 | <p> |
| 1344 | Energy efficiency is a critical component of performance today, |
| 1345 | and Linux-kernel RCU implementations must therefore avoid unnecessarily |
| 1346 | awakening idle CPUs. |
| 1347 | I cannot claim that this requirement was premeditated. |
| 1348 | In fact, I learned of it during a telephone conversation in which I |
| 1349 | was given “frank and open” feedback on the importance |
| 1350 | of energy efficiency in battery-powered systems and on specific |
| 1351 | energy-efficiency shortcomings of the Linux-kernel RCU implementation. |
| 1352 | In my experience, the battery-powered embedded community will consider |
| 1353 | any unnecessary wakeups to be extremely unfriendly acts. |
| 1354 | So much so that mere Linux-kernel-mailing-list posts are |
| 1355 | insufficient to vent their ire. |
| 1356 | |
| 1357 | <p> |
| 1358 | Memory consumption is not particularly important for in most |
| 1359 | situations, and has become decreasingly |
| 1360 | so as memory sizes have expanded and memory |
| 1361 | costs have plummeted. |
| 1362 | However, as I learned from Matt Mackall's |
| 1363 | <a href="http://elinux.org/Linux_Tiny-FAQ">bloatwatch</a> |
| 1364 | efforts, memory footprint is critically important on single-CPU systems with |
| 1365 | non-preemptible (<tt>CONFIG_PREEMPT=n</tt>) kernels, and thus |
| 1366 | <a href="https://lkml.kernel.org/g/20090113221724.GA15307@linux.vnet.ibm.com">tiny RCU</a> |
| 1367 | was born. |
| 1368 | Josh Triplett has since taken over the small-memory banner with his |
| 1369 | <a href="https://tiny.wiki.kernel.org/">Linux kernel tinification</a> |
| 1370 | project, which resulted in |
| 1371 | <a href="#Sleepable RCU">SRCU</a> |
| 1372 | becoming optional for those kernels not needing it. |
| 1373 | |
| 1374 | <p> |
| 1375 | The remaining performance requirements are, for the most part, |
| 1376 | unsurprising. |
| 1377 | For example, in keeping with RCU's read-side specialization, |
| 1378 | <tt>rcu_dereference()</tt> should have negligible overhead (for |
| 1379 | example, suppression of a few minor compiler optimizations). |
| 1380 | Similarly, in non-preemptible environments, <tt>rcu_read_lock()</tt> and |
| 1381 | <tt>rcu_read_unlock()</tt> should have exactly zero overhead. |
| 1382 | |
| 1383 | <p> |
| 1384 | In preemptible environments, in the case where the RCU read-side |
| 1385 | critical section was not preempted (as will be the case for the |
| 1386 | highest-priority real-time process), <tt>rcu_read_lock()</tt> and |
| 1387 | <tt>rcu_read_unlock()</tt> should have minimal overhead. |
| 1388 | In particular, they should not contain atomic read-modify-write |
| 1389 | operations, memory-barrier instructions, preemption disabling, |
| 1390 | interrupt disabling, or backwards branches. |
| 1391 | However, in the case where the RCU read-side critical section was preempted, |
| 1392 | <tt>rcu_read_unlock()</tt> may acquire spinlocks and disable interrupts. |
| 1393 | This is why it is better to nest an RCU read-side critical section |
| 1394 | within a preempt-disable region than vice versa, at least in cases |
| 1395 | where that critical section is short enough to avoid unduly degrading |
| 1396 | real-time latencies. |
| 1397 | |
| 1398 | <p> |
| 1399 | The <tt>synchronize_rcu()</tt> grace-period-wait primitive is |
| 1400 | optimized for throughput. |
| 1401 | It may therefore incur several milliseconds of latency in addition to |
| 1402 | the duration of the longest RCU read-side critical section. |
| 1403 | On the other hand, multiple concurrent invocations of |
| 1404 | <tt>synchronize_rcu()</tt> are required to use batching optimizations |
| 1405 | so that they can be satisfied by a single underlying grace-period-wait |
| 1406 | operation. |
| 1407 | For example, in the Linux kernel, it is not unusual for a single |
| 1408 | grace-period-wait operation to serve more than |
| 1409 | <a href="https://www.usenix.org/conference/2004-usenix-annual-technical-conference/making-rcu-safe-deep-sub-millisecond-response">1,000 separate invocations</a> |
| 1410 | of <tt>synchronize_rcu()</tt>, thus amortizing the per-invocation |
| 1411 | overhead down to nearly zero. |
| 1412 | However, the grace-period optimization is also required to avoid |
| 1413 | measurable degradation of real-time scheduling and interrupt latencies. |
| 1414 | |
| 1415 | <p> |
| 1416 | In some cases, the multi-millisecond <tt>synchronize_rcu()</tt> |
| 1417 | latencies are unacceptable. |
| 1418 | In these cases, <tt>synchronize_rcu_expedited()</tt> may be used |
| 1419 | instead, reducing the grace-period latency down to a few tens of |
| 1420 | microseconds on small systems, at least in cases where the RCU read-side |
| 1421 | critical sections are short. |
| 1422 | There are currently no special latency requirements for |
| 1423 | <tt>synchronize_rcu_expedited()</tt> on large systems, but, |
| 1424 | consistent with the empirical nature of the RCU specification, |
| 1425 | that is subject to change. |
| 1426 | However, there most definitely are scalability requirements: |
| 1427 | A storm of <tt>synchronize_rcu_expedited()</tt> invocations on 4096 |
| 1428 | CPUs should at least make reasonable forward progress. |
| 1429 | In return for its shorter latencies, <tt>synchronize_rcu_expedited()</tt> |
| 1430 | is permitted to impose modest degradation of real-time latency |
| 1431 | on non-idle online CPUs. |
| 1432 | That said, it will likely be necessary to take further steps to reduce this |
| 1433 | degradation, hopefully to roughly that of a scheduling-clock interrupt. |
| 1434 | |
| 1435 | <p> |
| 1436 | There are a number of situations where even |
| 1437 | <tt>synchronize_rcu_expedited()</tt>'s reduced grace-period |
| 1438 | latency is unacceptable. |
| 1439 | In these situations, the asynchronous <tt>call_rcu()</tt> can be |
| 1440 | used in place of <tt>synchronize_rcu()</tt> as follows: |
| 1441 | |
| 1442 | <blockquote> |
| 1443 | <pre> |
| 1444 | 1 struct foo { |
| 1445 | 2 int a; |
| 1446 | 3 int b; |
| 1447 | 4 struct rcu_head rh; |
| 1448 | 5 }; |
| 1449 | 6 |
| 1450 | 7 static void remove_gp_cb(struct rcu_head *rhp) |
| 1451 | 8 { |
| 1452 | 9 struct foo *p = container_of(rhp, struct foo, rh); |
| 1453 | 10 |
| 1454 | 11 kfree(p); |
| 1455 | 12 } |
| 1456 | 13 |
| 1457 | 14 bool remove_gp_asynchronous(void) |
| 1458 | 15 { |
| 1459 | 16 struct foo *p; |
| 1460 | 17 |
| 1461 | 18 spin_lock(&gp_lock); |
| 1462 | 19 p = rcu_dereference(gp); |
| 1463 | 20 if (!p) { |
| 1464 | 21 spin_unlock(&gp_lock); |
| 1465 | 22 return false; |
| 1466 | 23 } |
| 1467 | 24 rcu_assign_pointer(gp, NULL); |
| 1468 | 25 call_rcu(&p->rh, remove_gp_cb); |
| 1469 | 26 spin_unlock(&gp_lock); |
| 1470 | 27 return true; |
| 1471 | 28 } |
| 1472 | </pre> |
| 1473 | </blockquote> |
| 1474 | |
| 1475 | <p> |
| 1476 | A definition of <tt>struct foo</tt> is finally needed, and appears |
| 1477 | on lines 1-5. |
| 1478 | The function <tt>remove_gp_cb()</tt> is passed to <tt>call_rcu()</tt> |
| 1479 | on line 25, and will be invoked after the end of a subsequent |
| 1480 | grace period. |
| 1481 | This gets the same effect as <tt>remove_gp_synchronous()</tt>, |
| 1482 | but without forcing the updater to wait for a grace period to elapse. |
| 1483 | The <tt>call_rcu()</tt> function may be used in a number of |
| 1484 | situations where neither <tt>synchronize_rcu()</tt> nor |
| 1485 | <tt>synchronize_rcu_expedited()</tt> would be legal, |
| 1486 | including within preempt-disable code, <tt>local_bh_disable()</tt> code, |
| 1487 | interrupt-disable code, and interrupt handlers. |
| 1488 | However, even <tt>call_rcu()</tt> is illegal within NMI handlers. |
| 1489 | The callback function (<tt>remove_gp_cb()</tt> in this case) will be |
| 1490 | executed within softirq (software interrupt) environment within the |
| 1491 | Linux kernel, |
| 1492 | either within a real softirq handler or under the protection |
| 1493 | of <tt>local_bh_disable()</tt>. |
| 1494 | In both the Linux kernel and in userspace, it is bad practice to |
| 1495 | write an RCU callback function that takes too long. |
| 1496 | Long-running operations should be relegated to separate threads or |
| 1497 | (in the Linux kernel) workqueues. |
| 1498 | |
| 1499 | <p>@@QQ@@ |
| 1500 | Why does line 19 use <tt>rcu_access_pointer()</tt>? |
| 1501 | After all, <tt>call_rcu()</tt> on line 25 stores into the |
| 1502 | structure, which would interact badly with concurrent insertions. |
| 1503 | Doesn't this mean that <tt>rcu_dereference()</tt> is required? |
| 1504 | <p>@@QQA@@ |
| 1505 | Presumably the <tt>->gp_lock</tt> acquired on line 18 excludes |
| 1506 | any changes, including any insertions that <tt>rcu_dereference()</tt> |
| 1507 | would protect against. |
| 1508 | Therefore, any insertions will be delayed until after <tt>->gp_lock</tt> |
| 1509 | is released on line 25, which in turn means that |
| 1510 | <tt>rcu_access_pointer()</tt> suffices. |
| 1511 | <p>@@QQE@@ |
| 1512 | |
| 1513 | <p> |
| 1514 | However, all that <tt>remove_gp_cb()</tt> is doing is |
| 1515 | invoking <tt>kfree()</tt> on the data element. |
| 1516 | This is a common idiom, and is supported by <tt>kfree_rcu()</tt>, |
| 1517 | which allows “fire and forget” operation as shown below: |
| 1518 | |
| 1519 | <blockquote> |
| 1520 | <pre> |
| 1521 | 1 struct foo { |
| 1522 | 2 int a; |
| 1523 | 3 int b; |
| 1524 | 4 struct rcu_head rh; |
| 1525 | 5 }; |
| 1526 | 6 |
| 1527 | 7 bool remove_gp_faf(void) |
| 1528 | 8 { |
| 1529 | 9 struct foo *p; |
| 1530 | 10 |
| 1531 | 11 spin_lock(&gp_lock); |
| 1532 | 12 p = rcu_dereference(gp); |
| 1533 | 13 if (!p) { |
| 1534 | 14 spin_unlock(&gp_lock); |
| 1535 | 15 return false; |
| 1536 | 16 } |
| 1537 | 17 rcu_assign_pointer(gp, NULL); |
| 1538 | 18 kfree_rcu(p, rh); |
| 1539 | 19 spin_unlock(&gp_lock); |
| 1540 | 20 return true; |
| 1541 | 21 } |
| 1542 | </pre> |
| 1543 | </blockquote> |
| 1544 | |
| 1545 | <p> |
| 1546 | Note that <tt>remove_gp_faf()</tt> simply invokes |
| 1547 | <tt>kfree_rcu()</tt> and proceeds, without any need to pay any |
| 1548 | further attention to the subsequent grace period and <tt>kfree()</tt>. |
| 1549 | It is permissible to invoke <tt>kfree_rcu()</tt> from the same |
| 1550 | environments as for <tt>call_rcu()</tt>. |
| 1551 | Interestingly enough, DYNIX/ptx had the equivalents of |
| 1552 | <tt>call_rcu()</tt> and <tt>kfree_rcu()</tt>, but not |
| 1553 | <tt>synchronize_rcu()</tt>. |
| 1554 | This was due to the fact that RCU was not heavily used within DYNIX/ptx, |
| 1555 | so the very few places that needed something like |
| 1556 | <tt>synchronize_rcu()</tt> simply open-coded it. |
| 1557 | |
| 1558 | <p>@@QQ@@ |
| 1559 | Earlier it was claimed that <tt>call_rcu()</tt> and |
| 1560 | <tt>kfree_rcu()</tt> allowed updaters to avoid being blocked |
| 1561 | by readers. |
| 1562 | But how can that be correct, given that the invocation of the callback |
| 1563 | and the freeing of the memory (respectively) must still wait for |
| 1564 | a grace period to elapse? |
| 1565 | <p>@@QQA@@ |
| 1566 | We could define things this way, but keep in mind that this sort of |
| 1567 | definition would say that updates in garbage-collected languages |
| 1568 | cannot complete until the next time the garbage collector runs, |
| 1569 | which does not seem at all reasonable. |
| 1570 | The key point is that in most cases, an updater using either |
| 1571 | <tt>call_rcu()</tt> or <tt>kfree_rcu()</tt> can proceed to the |
| 1572 | next update as soon as it has invoked <tt>call_rcu()</tt> or |
| 1573 | <tt>kfree_rcu()</tt>, without having to wait for a subsequent |
| 1574 | grace period. |
| 1575 | <p>@@QQE@@ |
| 1576 | |
| 1577 | <p> |
| 1578 | But what if the updater must wait for the completion of code to be |
| 1579 | executed after the end of the grace period, but has other tasks |
| 1580 | that can be carried out in the meantime? |
| 1581 | The polling-style <tt>get_state_synchronize_rcu()</tt> and |
| 1582 | <tt>cond_synchronize_rcu()</tt> functions may be used for this |
| 1583 | purpose, as shown below: |
| 1584 | |
| 1585 | <blockquote> |
| 1586 | <pre> |
| 1587 | 1 bool remove_gp_poll(void) |
| 1588 | 2 { |
| 1589 | 3 struct foo *p; |
| 1590 | 4 unsigned long s; |
| 1591 | 5 |
| 1592 | 6 spin_lock(&gp_lock); |
| 1593 | 7 p = rcu_access_pointer(gp); |
| 1594 | 8 if (!p) { |
| 1595 | 9 spin_unlock(&gp_lock); |
| 1596 | 10 return false; |
| 1597 | 11 } |
| 1598 | 12 rcu_assign_pointer(gp, NULL); |
| 1599 | 13 spin_unlock(&gp_lock); |
| 1600 | 14 s = get_state_synchronize_rcu(); |
| 1601 | 15 do_something_while_waiting(); |
| 1602 | 16 cond_synchronize_rcu(s); |
| 1603 | 17 kfree(p); |
| 1604 | 18 return true; |
| 1605 | 19 } |
| 1606 | </pre> |
| 1607 | </blockquote> |
| 1608 | |
| 1609 | <p> |
| 1610 | On line 14, <tt>get_state_synchronize_rcu()</tt> obtains a |
| 1611 | “cookie” from RCU, |
| 1612 | then line 15 carries out other tasks, |
| 1613 | and finally, line 16 returns immediately if a grace period has |
| 1614 | elapsed in the meantime, but otherwise waits as required. |
| 1615 | The need for <tt>get_state_synchronize_rcu</tt> and |
| 1616 | <tt>cond_synchronize_rcu()</tt> has appeared quite recently, |
| 1617 | so it is too early to tell whether they will stand the test of time. |
| 1618 | |
| 1619 | <p> |
| 1620 | RCU thus provides a range of tools to allow updaters to strike the |
| 1621 | required tradeoff between latency, flexibility and CPU overhead. |
| 1622 | |
| 1623 | <h3><a name="Composability">Composability</a></h3> |
| 1624 | |
| 1625 | <p> |
| 1626 | Composability has received much attention in recent years, perhaps in part |
| 1627 | due to the collision of multicore hardware with object-oriented techniques |
| 1628 | designed in single-threaded environments for single-threaded use. |
| 1629 | And in theory, RCU read-side critical sections may be composed, and in |
| 1630 | fact may be nested arbitrarily deeply. |
| 1631 | In practice, as with all real-world implementations of composable |
| 1632 | constructs, there are limitations. |
| 1633 | |
| 1634 | <p> |
| 1635 | Implementations of RCU for which <tt>rcu_read_lock()</tt> |
| 1636 | and <tt>rcu_read_unlock()</tt> generate no code, such as |
| 1637 | Linux-kernel RCU when <tt>CONFIG_PREEMPT=n</tt>, can be |
| 1638 | nested arbitrarily deeply. |
| 1639 | After all, there is no overhead. |
| 1640 | Except that if all these instances of <tt>rcu_read_lock()</tt> |
| 1641 | and <tt>rcu_read_unlock()</tt> are visible to the compiler, |
| 1642 | compilation will eventually fail due to exhausting memory, |
| 1643 | mass storage, or user patience, whichever comes first. |
| 1644 | If the nesting is not visible to the compiler, as is the case with |
| 1645 | mutually recursive functions each in its own translation unit, |
| 1646 | stack overflow will result. |
| 1647 | If the nesting takes the form of loops, either the control variable |
| 1648 | will overflow or (in the Linux kernel) you will get an RCU CPU stall warning. |
| 1649 | Nevertheless, this class of RCU implementations is one |
| 1650 | of the most composable constructs in existence. |
| 1651 | |
| 1652 | <p> |
| 1653 | RCU implementations that explicitly track nesting depth |
| 1654 | are limited by the nesting-depth counter. |
| 1655 | For example, the Linux kernel's preemptible RCU limits nesting to |
| 1656 | <tt>INT_MAX</tt>. |
| 1657 | This should suffice for almost all practical purposes. |
| 1658 | That said, a consecutive pair of RCU read-side critical sections |
| 1659 | between which there is an operation that waits for a grace period |
| 1660 | cannot be enclosed in another RCU read-side critical section. |
| 1661 | This is because it is not legal to wait for a grace period within |
| 1662 | an RCU read-side critical section: To do so would result either |
| 1663 | in deadlock or |
| 1664 | in RCU implicitly splitting the enclosing RCU read-side critical |
| 1665 | section, neither of which is conducive to a long-lived and prosperous |
| 1666 | kernel. |
| 1667 | |
| 1668 | <p> |
Paul E. McKenney | 0825458 | 2015-10-07 15:43:31 -0700 | [diff] [blame] | 1669 | It is worth noting that RCU is not alone in limiting composability. |
| 1670 | For example, many transactional-memory implementations prohibit |
| 1671 | composing a pair of transactions separated by an irrevocable |
| 1672 | operation (for example, a network receive operation). |
| 1673 | For another example, lock-based critical sections can be composed |
| 1674 | surprisingly freely, but only if deadlock is avoided. |
| 1675 | |
| 1676 | <p> |
Paul E. McKenney | 649e436 | 2015-10-07 13:32:08 -0700 | [diff] [blame] | 1677 | In short, although RCU read-side critical sections are highly composable, |
| 1678 | care is required in some situations, just as is the case for any other |
| 1679 | composable synchronization mechanism. |
| 1680 | |
| 1681 | <h3><a name="Corner Cases">Corner Cases</a></h3> |
| 1682 | |
| 1683 | <p> |
| 1684 | A given RCU workload might have an endless and intense stream of |
| 1685 | RCU read-side critical sections, perhaps even so intense that there |
| 1686 | was never a point in time during which there was not at least one |
| 1687 | RCU read-side critical section in flight. |
| 1688 | RCU cannot allow this situation to block grace periods: As long as |
| 1689 | all the RCU read-side critical sections are finite, grace periods |
| 1690 | must also be finite. |
| 1691 | |
| 1692 | <p> |
| 1693 | That said, preemptible RCU implementations could potentially result |
| 1694 | in RCU read-side critical sections being preempted for long durations, |
| 1695 | which has the effect of creating a long-duration RCU read-side |
| 1696 | critical section. |
| 1697 | This situation can arise only in heavily loaded systems, but systems using |
| 1698 | real-time priorities are of course more vulnerable. |
| 1699 | Therefore, RCU priority boosting is provided to help deal with this |
| 1700 | case. |
| 1701 | That said, the exact requirements on RCU priority boosting will likely |
| 1702 | evolve as more experience accumulates. |
| 1703 | |
| 1704 | <p> |
| 1705 | Other workloads might have very high update rates. |
| 1706 | Although one can argue that such workloads should instead use |
| 1707 | something other than RCU, the fact remains that RCU must |
| 1708 | handle such workloads gracefully. |
| 1709 | This requirement is another factor driving batching of grace periods, |
| 1710 | but it is also the driving force behind the checks for large numbers |
| 1711 | of queued RCU callbacks in the <tt>call_rcu()</tt> code path. |
| 1712 | Finally, high update rates should not delay RCU read-side critical |
| 1713 | sections, although some read-side delays can occur when using |
| 1714 | <tt>synchronize_rcu_expedited()</tt>, courtesy of this function's use |
| 1715 | of <tt>try_stop_cpus()</tt>. |
| 1716 | (In the future, <tt>synchronize_rcu_expedited()</tt> will be |
| 1717 | converted to use lighter-weight inter-processor interrupts (IPIs), |
| 1718 | but this will still disturb readers, though to a much smaller degree.) |
| 1719 | |
| 1720 | <p> |
| 1721 | Although all three of these corner cases were understood in the early |
| 1722 | 1990s, a simple user-level test consisting of <tt>close(open(path))</tt> |
| 1723 | in a tight loop |
| 1724 | in the early 2000s suddenly provided a much deeper appreciation of the |
| 1725 | high-update-rate corner case. |
| 1726 | This test also motivated addition of some RCU code to react to high update |
| 1727 | rates, for example, if a given CPU finds itself with more than 10,000 |
| 1728 | RCU callbacks queued, it will cause RCU to take evasive action by |
| 1729 | more aggressively starting grace periods and more aggressively forcing |
| 1730 | completion of grace-period processing. |
| 1731 | This evasive action causes the grace period to complete more quickly, |
| 1732 | but at the cost of restricting RCU's batching optimizations, thus |
| 1733 | increasing the CPU overhead incurred by that grace period. |
| 1734 | |
| 1735 | <h2><a name="Software-Engineering Requirements"> |
| 1736 | Software-Engineering Requirements</a></h2> |
| 1737 | |
| 1738 | <p> |
| 1739 | Between Murphy's Law and “To err is human”, it is necessary to |
| 1740 | guard against mishaps and misuse: |
| 1741 | |
| 1742 | <ol> |
| 1743 | <li> It is all too easy to forget to use <tt>rcu_read_lock()</tt> |
| 1744 | everywhere that it is needed, so kernels built with |
| 1745 | <tt>CONFIG_PROVE_RCU=y</tt> will spat if |
| 1746 | <tt>rcu_dereference()</tt> is used outside of an |
| 1747 | RCU read-side critical section. |
| 1748 | Update-side code can use <tt>rcu_dereference_protected()</tt>, |
| 1749 | which takes a |
| 1750 | <a href="https://lwn.net/Articles/371986/">lockdep expression</a> |
| 1751 | to indicate what is providing the protection. |
| 1752 | If the indicated protection is not provided, a lockdep splat |
| 1753 | is emitted. |
| 1754 | |
| 1755 | <p> |
| 1756 | Code shared between readers and updaters can use |
| 1757 | <tt>rcu_dereference_check()</tt>, which also takes a |
| 1758 | lockdep expression, and emits a lockdep splat if neither |
| 1759 | <tt>rcu_read_lock()</tt> nor the indicated protection |
| 1760 | is in place. |
| 1761 | In addition, <tt>rcu_dereference_raw()</tt> is used in those |
| 1762 | (hopefully rare) cases where the required protection cannot |
| 1763 | be easily described. |
| 1764 | Finally, <tt>rcu_read_lock_held()</tt> is provided to |
| 1765 | allow a function to verify that it has been invoked within |
| 1766 | an RCU read-side critical section. |
| 1767 | I was made aware of this set of requirements shortly after Thomas |
| 1768 | Gleixner audited a number of RCU uses. |
| 1769 | <li> A given function might wish to check for RCU-related preconditions |
| 1770 | upon entry, before using any other RCU API. |
| 1771 | The <tt>rcu_lockdep_assert()</tt> does this job, |
| 1772 | asserting the expression in kernels having lockdep enabled |
| 1773 | and doing nothing otherwise. |
| 1774 | <li> It is also easy to forget to use <tt>rcu_assign_pointer()</tt> |
| 1775 | and <tt>rcu_dereference()</tt>, perhaps (incorrectly) |
| 1776 | substituting a simple assignment. |
| 1777 | To catch this sort of error, a given RCU-protected pointer may be |
| 1778 | tagged with <tt>__rcu</tt>, after which running sparse |
| 1779 | with <tt>CONFIG_SPARSE_RCU_POINTER=y</tt> will complain |
| 1780 | about simple-assignment accesses to that pointer. |
| 1781 | Arnd Bergmann made me aware of this requirement, and also |
| 1782 | supplied the needed |
| 1783 | <a href="https://lwn.net/Articles/376011/">patch series</a>. |
| 1784 | <li> Kernels built with <tt>CONFIG_DEBUG_OBJECTS_RCU_HEAD=y</tt> |
| 1785 | will splat if a data element is passed to <tt>call_rcu()</tt> |
| 1786 | twice in a row, without a grace period in between. |
| 1787 | (This error is similar to a double free.) |
| 1788 | The corresponding <tt>rcu_head</tt> structures that are |
| 1789 | dynamically allocated are automatically tracked, but |
| 1790 | <tt>rcu_head</tt> structures allocated on the stack |
| 1791 | must be initialized with <tt>init_rcu_head_on_stack()</tt> |
| 1792 | and cleaned up with <tt>destroy_rcu_head_on_stack()</tt>. |
| 1793 | Similarly, statically allocated non-stack <tt>rcu_head</tt> |
| 1794 | structures must be initialized with <tt>init_rcu_head()</tt> |
| 1795 | and cleaned up with <tt>destroy_rcu_head()</tt>. |
| 1796 | Mathieu Desnoyers made me aware of this requirement, and also |
| 1797 | supplied the needed |
| 1798 | <a href="https://lkml.kernel.org/g/20100319013024.GA28456@Krystal">patch</a>. |
| 1799 | <li> An infinite loop in an RCU read-side critical section will |
Paul E. McKenney | 01d3ad3 | 2015-10-07 15:35:35 -0700 | [diff] [blame] | 1800 | eventually trigger an RCU CPU stall warning splat, with |
| 1801 | the duration of “eventually” being controlled by the |
| 1802 | <tt>RCU_CPU_STALL_TIMEOUT</tt> <tt>Kconfig</tt> option, or, |
| 1803 | alternatively, by the |
| 1804 | <tt>rcupdate.rcu_cpu_stall_timeout</tt> boot/sysfs |
| 1805 | parameter. |
Paul E. McKenney | 649e436 | 2015-10-07 13:32:08 -0700 | [diff] [blame] | 1806 | However, RCU is not obligated to produce this splat |
| 1807 | unless there is a grace period waiting on that particular |
| 1808 | RCU read-side critical section. |
Paul E. McKenney | 01d3ad3 | 2015-10-07 15:35:35 -0700 | [diff] [blame] | 1809 | <p> |
| 1810 | Some extreme workloads might intentionally delay |
| 1811 | RCU grace periods, and systems running those workloads can |
| 1812 | be booted with <tt>rcupdate.rcu_cpu_stall_suppress</tt> |
| 1813 | to suppress the splats. |
| 1814 | This kernel parameter may also be set via <tt>sysfs</tt>. |
| 1815 | Furthermore, RCU CPU stall warnings are counter-productive |
| 1816 | during sysrq dumps and during panics. |
| 1817 | RCU therefore supplies the <tt>rcu_sysrq_start()</tt> and |
| 1818 | <tt>rcu_sysrq_end()</tt> API members to be called before |
| 1819 | and after long sysrq dumps. |
| 1820 | RCU also supplies the <tt>rcu_panic()</tt> notifier that is |
| 1821 | automatically invoked at the beginning of a panic to suppress |
| 1822 | further RCU CPU stall warnings. |
| 1823 | |
| 1824 | <p> |
Paul E. McKenney | 649e436 | 2015-10-07 13:32:08 -0700 | [diff] [blame] | 1825 | This requirement made itself known in the early 1990s, pretty |
| 1826 | much the first time that it was necessary to debug a CPU stall. |
Paul E. McKenney | 01d3ad3 | 2015-10-07 15:35:35 -0700 | [diff] [blame] | 1827 | That said, the initial implementation in DYNIX/ptx was quite |
| 1828 | generic in comparison with that of Linux. |
Paul E. McKenney | 649e436 | 2015-10-07 13:32:08 -0700 | [diff] [blame] | 1829 | <li> Although it would be very good to detect pointers leaking out |
| 1830 | of RCU read-side critical sections, there is currently no |
| 1831 | good way of doing this. |
| 1832 | One complication is the need to distinguish between pointers |
| 1833 | leaking and pointers that have been handed off from RCU to |
| 1834 | some other synchronization mechanism, for example, reference |
| 1835 | counting. |
| 1836 | <li> In kernels built with <tt>CONFIG_RCU_TRACE=y</tt>, RCU-related |
| 1837 | information is provided via both debugfs and event tracing. |
| 1838 | <li> Open-coded use of <tt>rcu_assign_pointer()</tt> and |
| 1839 | <tt>rcu_dereference()</tt> to create typical linked |
| 1840 | data structures can be surprisingly error-prone. |
| 1841 | Therefore, RCU-protected |
| 1842 | <a href="https://lwn.net/Articles/609973/#RCU List APIs">linked lists</a> |
| 1843 | and, more recently, RCU-protected |
| 1844 | <a href="https://lwn.net/Articles/612100/">hash tables</a> |
| 1845 | are available. |
| 1846 | Many other special-purpose RCU-protected data structures are |
| 1847 | available in the Linux kernel and the userspace RCU library. |
| 1848 | <li> Some linked structures are created at compile time, but still |
| 1849 | require <tt>__rcu</tt> checking. |
| 1850 | The <tt>RCU_POINTER_INITIALIZER()</tt> macro serves this |
| 1851 | purpose. |
| 1852 | <li> It is not necessary to use <tt>rcu_assign_pointer()</tt> |
| 1853 | when creating linked structures that are to be published via |
| 1854 | a single external pointer. |
| 1855 | The <tt>RCU_INIT_POINTER()</tt> macro is provided for |
| 1856 | this task and also for assigning <tt>NULL</tt> pointers |
| 1857 | at runtime. |
| 1858 | </ol> |
| 1859 | |
| 1860 | <p> |
| 1861 | This not a hard-and-fast list: RCU's diagnostic capabilities will |
| 1862 | continue to be guided by the number and type of usage bugs found |
| 1863 | in real-world RCU usage. |
| 1864 | |
| 1865 | <h2><a name="Linux Kernel Complications">Linux Kernel Complications</a></h2> |
| 1866 | |
| 1867 | <p> |
| 1868 | The Linux kernel provides an interesting environment for all kinds of |
| 1869 | software, including RCU. |
| 1870 | Some of the relevant points of interest are as follows: |
| 1871 | |
| 1872 | <ol> |
| 1873 | <li> <a href="#Configuration">Configuration</a>. |
| 1874 | <li> <a href="#Firmware Interface">Firmware Interface</a>. |
| 1875 | <li> <a href="#Early Boot">Early Boot</a>. |
| 1876 | <li> <a href="#Interrupts and NMIs"> |
| 1877 | Interrupts and non-maskable interrupts (NMIs)</a>. |
| 1878 | <li> <a href="#Loadable Modules">Loadable Modules</a>. |
| 1879 | <li> <a href="#Hotplug CPU">Hotplug CPU</a>. |
| 1880 | <li> <a href="#Scheduler and RCU">Scheduler and RCU</a>. |
| 1881 | <li> <a href="#Tracing and RCU">Tracing and RCU</a>. |
| 1882 | <li> <a href="#Energy Efficiency">Energy Efficiency</a>. |
Paul E. McKenney | 701e803 | 2015-10-07 15:06:44 -0700 | [diff] [blame] | 1883 | <li> <a href="#Memory Efficiency">Memory Efficiency</a>. |
Paul E. McKenney | 649e436 | 2015-10-07 13:32:08 -0700 | [diff] [blame] | 1884 | <li> <a href="#Performance, Scalability, Response Time, and Reliability"> |
| 1885 | Performance, Scalability, Response Time, and Reliability</a>. |
| 1886 | </ol> |
| 1887 | |
| 1888 | <p> |
| 1889 | This list is probably incomplete, but it does give a feel for the |
| 1890 | most notable Linux-kernel complications. |
| 1891 | Each of the following sections covers one of the above topics. |
| 1892 | |
| 1893 | <h3><a name="Configuration">Configuration</a></h3> |
| 1894 | |
| 1895 | <p> |
| 1896 | RCU's goal is automatic configuration, so that almost nobody |
| 1897 | needs to worry about RCU's <tt>Kconfig</tt> options. |
| 1898 | And for almost all users, RCU does in fact work well |
| 1899 | “out of the box.” |
| 1900 | |
| 1901 | <p> |
| 1902 | However, there are specialized use cases that are handled by |
| 1903 | kernel boot parameters and <tt>Kconfig</tt> options. |
| 1904 | Unfortunately, the <tt>Kconfig</tt> system will explicitly ask users |
| 1905 | about new <tt>Kconfig</tt> options, which requires almost all of them |
| 1906 | be hidden behind a <tt>CONFIG_RCU_EXPERT</tt> <tt>Kconfig</tt> option. |
| 1907 | |
| 1908 | <p> |
| 1909 | This all should be quite obvious, but the fact remains that |
| 1910 | Linus Torvalds recently had to |
| 1911 | <a href="https://lkml.kernel.org/g/CA+55aFy4wcCwaL4okTs8wXhGZ5h-ibecy_Meg9C4MNQrUnwMcg@mail.gmail.com">remind</a> |
| 1912 | me of this requirement. |
| 1913 | |
| 1914 | <h3><a name="Firmware Interface">Firmware Interface</a></h3> |
| 1915 | |
| 1916 | <p> |
| 1917 | In many cases, kernel obtains information about the system from the |
| 1918 | firmware, and sometimes things are lost in translation. |
| 1919 | Or the translation is accurate, but the original message is bogus. |
| 1920 | |
| 1921 | <p> |
| 1922 | For example, some systems' firmware overreports the number of CPUs, |
| 1923 | sometimes by a large factor. |
| 1924 | If RCU naively believed the firmware, as it used to do, |
| 1925 | it would create too many per-CPU kthreads. |
| 1926 | Although the resulting system will still run correctly, the extra |
| 1927 | kthreads needlessly consume memory and can cause confusion |
| 1928 | when they show up in <tt>ps</tt> listings. |
| 1929 | |
| 1930 | <p> |
| 1931 | RCU must therefore wait for a given CPU to actually come online before |
| 1932 | it can allow itself to believe that the CPU actually exists. |
| 1933 | The resulting “ghost CPUs” (which are never going to |
| 1934 | come online) cause a number of |
| 1935 | <a href="https://paulmck.livejournal.com/37494.html">interesting complications</a>. |
| 1936 | |
| 1937 | <h3><a name="Early Boot">Early Boot</a></h3> |
| 1938 | |
| 1939 | <p> |
| 1940 | The Linux kernel's boot sequence is an interesting process, |
| 1941 | and RCU is used early, even before <tt>rcu_init()</tt> |
| 1942 | is invoked. |
| 1943 | In fact, a number of RCU's primitives can be used as soon as the |
| 1944 | initial task's <tt>task_struct</tt> is available and the |
| 1945 | boot CPU's per-CPU variables are set up. |
| 1946 | The read-side primitives (<tt>rcu_read_lock()</tt>, |
| 1947 | <tt>rcu_read_unlock()</tt>, <tt>rcu_dereference()</tt>, |
| 1948 | and <tt>rcu_access_pointer()</tt>) will operate normally very early on, |
| 1949 | as will <tt>rcu_assign_pointer()</tt>. |
| 1950 | |
| 1951 | <p> |
| 1952 | Although <tt>call_rcu()</tt> may be invoked at any |
| 1953 | time during boot, callbacks are not guaranteed to be invoked until after |
| 1954 | the scheduler is fully up and running. |
| 1955 | This delay in callback invocation is due to the fact that RCU does not |
| 1956 | invoke callbacks until it is fully initialized, and this full initialization |
| 1957 | cannot occur until after the scheduler has initialized itself to the |
| 1958 | point where RCU can spawn and run its kthreads. |
| 1959 | In theory, it would be possible to invoke callbacks earlier, |
| 1960 | however, this is not a panacea because there would be severe restrictions |
| 1961 | on what operations those callbacks could invoke. |
| 1962 | |
| 1963 | <p> |
| 1964 | Perhaps surprisingly, <tt>synchronize_rcu()</tt>, |
| 1965 | <a href="#Bottom-Half Flavor"><tt>synchronize_rcu_bh()</tt></a> |
| 1966 | (<a href="#Bottom-Half Flavor">discussed below</a>), |
| 1967 | and |
| 1968 | <a href="#Sched Flavor"><tt>synchronize_sched()</tt></a> |
| 1969 | will all operate normally |
| 1970 | during very early boot, the reason being that there is only one CPU |
| 1971 | and preemption is disabled. |
| 1972 | This means that the call <tt>synchronize_rcu()</tt> (or friends) |
| 1973 | itself is a quiescent |
| 1974 | state and thus a grace period, so the early-boot implementation can |
| 1975 | be a no-op. |
| 1976 | |
| 1977 | <p> |
| 1978 | Both <tt>synchronize_rcu_bh()</tt> and <tt>synchronize_sched()</tt> |
| 1979 | continue to operate normally through the remainder of boot, courtesy |
| 1980 | of the fact that preemption is disabled across their RCU read-side |
| 1981 | critical sections and also courtesy of the fact that there is still |
| 1982 | only one CPU. |
| 1983 | However, once the scheduler starts initializing, preemption is enabled. |
| 1984 | There is still only a single CPU, but the fact that preemption is enabled |
| 1985 | means that the no-op implementation of <tt>synchronize_rcu()</tt> no |
| 1986 | longer works in <tt>CONFIG_PREEMPT=y</tt> kernels. |
| 1987 | Therefore, as soon as the scheduler starts initializing, the early-boot |
| 1988 | fastpath is disabled. |
| 1989 | This means that <tt>synchronize_rcu()</tt> switches to its runtime |
| 1990 | mode of operation where it posts callbacks, which in turn means that |
| 1991 | any call to <tt>synchronize_rcu()</tt> will block until the corresponding |
| 1992 | callback is invoked. |
| 1993 | Unfortunately, the callback cannot be invoked until RCU's runtime |
| 1994 | grace-period machinery is up and running, which cannot happen until |
| 1995 | the scheduler has initialized itself sufficiently to allow RCU's |
| 1996 | kthreads to be spawned. |
| 1997 | Therefore, invoking <tt>synchronize_rcu()</tt> during scheduler |
| 1998 | initialization can result in deadlock. |
| 1999 | |
| 2000 | <p>@@QQ@@ |
| 2001 | So what happens with <tt>synchronize_rcu()</tt> during |
| 2002 | scheduler initialization for <tt>CONFIG_PREEMPT=n</tt> |
| 2003 | kernels? |
| 2004 | <p>@@QQA@@ |
| 2005 | In <tt>CONFIG_PREEMPT=n</tt> kernel, <tt>synchronize_rcu()</tt> |
| 2006 | maps directly to <tt>synchronize_sched()</tt>. |
| 2007 | Therefore, <tt>synchronize_rcu()</tt> works normally throughout |
| 2008 | boot in <tt>CONFIG_PREEMPT=n</tt> kernels. |
| 2009 | However, your code must also work in <tt>CONFIG_PREEMPT=y</tt> kernels, |
| 2010 | so it is still necessary to avoid invoking <tt>synchronize_rcu()</tt> |
| 2011 | during scheduler initialization. |
| 2012 | <p>@@QQE@@ |
| 2013 | |
| 2014 | <p> |
| 2015 | I learned of these boot-time requirements as a result of a series of |
| 2016 | system hangs. |
| 2017 | |
| 2018 | <h3><a name="Interrupts and NMIs">Interrupts and NMIs</a></h3> |
| 2019 | |
| 2020 | <p> |
| 2021 | The Linux kernel has interrupts, and RCU read-side critical sections are |
| 2022 | legal within interrupt handlers and within interrupt-disabled regions |
| 2023 | of code, as are invocations of <tt>call_rcu()</tt>. |
| 2024 | |
| 2025 | <p> |
| 2026 | Some Linux-kernel architectures can enter an interrupt handler from |
| 2027 | non-idle process context, and then just never leave it, instead stealthily |
| 2028 | transitioning back to process context. |
| 2029 | This trick is sometimes used to invoke system calls from inside the kernel. |
| 2030 | These “half-interrupts” mean that RCU has to be very careful |
| 2031 | about how it counts interrupt nesting levels. |
| 2032 | I learned of this requirement the hard way during a rewrite |
| 2033 | of RCU's dyntick-idle code. |
| 2034 | |
| 2035 | <p> |
| 2036 | The Linux kernel has non-maskable interrupts (NMIs), and |
| 2037 | RCU read-side critical sections are legal within NMI handlers. |
| 2038 | Thankfully, RCU update-side primitives, including |
| 2039 | <tt>call_rcu()</tt>, are prohibited within NMI handlers. |
| 2040 | |
| 2041 | <p> |
| 2042 | The name notwithstanding, some Linux-kernel architectures |
| 2043 | can have nested NMIs, which RCU must handle correctly. |
| 2044 | Andy Lutomirski |
| 2045 | <a href="https://lkml.kernel.org/g/CALCETrXLq1y7e_dKFPgou-FKHB6Pu-r8+t-6Ds+8=va7anBWDA@mail.gmail.com">surprised me</a> |
| 2046 | with this requirement; |
| 2047 | he also kindly surprised me with |
| 2048 | <a href="https://lkml.kernel.org/g/CALCETrXSY9JpW3uE6H8WYk81sg56qasA2aqmjMPsq5dOtzso=g@mail.gmail.com">an algorithm</a> |
| 2049 | that meets this requirement. |
| 2050 | |
| 2051 | <h3><a name="Loadable Modules">Loadable Modules</a></h3> |
| 2052 | |
| 2053 | <p> |
| 2054 | The Linux kernel has loadable modules, and these modules can |
| 2055 | also be unloaded. |
| 2056 | After a given module has been unloaded, any attempt to call |
| 2057 | one of its functions results in a segmentation fault. |
| 2058 | The module-unload functions must therefore cancel any |
| 2059 | delayed calls to loadable-module functions, for example, |
| 2060 | any outstanding <tt>mod_timer()</tt> must be dealt with |
| 2061 | via <tt>del_timer_sync()</tt> or similar. |
| 2062 | |
| 2063 | <p> |
| 2064 | Unfortunately, there is no way to cancel an RCU callback; |
| 2065 | once you invoke <tt>call_rcu()</tt>, the callback function is |
| 2066 | going to eventually be invoked, unless the system goes down first. |
| 2067 | Because it is normally considered socially irresponsible to crash the system |
| 2068 | in response to a module unload request, we need some other way |
| 2069 | to deal with in-flight RCU callbacks. |
| 2070 | |
| 2071 | <p> |
| 2072 | RCU therefore provides |
| 2073 | <tt><a href="https://lwn.net/Articles/217484/">rcu_barrier()</a></tt>, |
| 2074 | which waits until all in-flight RCU callbacks have been invoked. |
| 2075 | If a module uses <tt>call_rcu()</tt>, its exit function should therefore |
| 2076 | prevent any future invocation of <tt>call_rcu()</tt>, then invoke |
| 2077 | <tt>rcu_barrier()</tt>. |
| 2078 | In theory, the underlying module-unload code could invoke |
| 2079 | <tt>rcu_barrier()</tt> unconditionally, but in practice this would |
| 2080 | incur unacceptable latencies. |
| 2081 | |
| 2082 | <p> |
| 2083 | Nikita Danilov noted this requirement for an analogous filesystem-unmount |
| 2084 | situation, and Dipankar Sarma incorporated <tt>rcu_barrier()</tt> into RCU. |
| 2085 | The need for <tt>rcu_barrier()</tt> for module unloading became |
| 2086 | apparent later. |
| 2087 | |
| 2088 | <h3><a name="Hotplug CPU">Hotplug CPU</a></h3> |
| 2089 | |
| 2090 | <p> |
| 2091 | The Linux kernel supports CPU hotplug, which means that CPUs |
| 2092 | can come and go. |
| 2093 | It is of course illegal to use any RCU API member from an offline CPU. |
| 2094 | This requirement was present from day one in DYNIX/ptx, but |
| 2095 | on the other hand, the Linux kernel's CPU-hotplug implementation |
| 2096 | is “interesting.” |
| 2097 | |
| 2098 | <p> |
| 2099 | The Linux-kernel CPU-hotplug implementation has notifiers that |
| 2100 | are used to allow the various kernel subsystems (including RCU) |
| 2101 | to respond appropriately to a given CPU-hotplug operation. |
| 2102 | Most RCU operations may be invoked from CPU-hotplug notifiers, |
| 2103 | including even normal synchronous grace-period operations |
| 2104 | such as <tt>synchronize_rcu()</tt>. |
| 2105 | However, expedited grace-period operations such as |
| 2106 | <tt>synchronize_rcu_expedited()</tt> are not supported, |
| 2107 | due to the fact that current implementations block CPU-hotplug |
| 2108 | operations, which could result in deadlock. |
| 2109 | |
| 2110 | <p> |
| 2111 | In addition, all-callback-wait operations such as |
| 2112 | <tt>rcu_barrier()</tt> are also not supported, due to the |
| 2113 | fact that there are phases of CPU-hotplug operations where |
| 2114 | the outgoing CPU's callbacks will not be invoked until after |
| 2115 | the CPU-hotplug operation ends, which could also result in deadlock. |
| 2116 | |
| 2117 | <h3><a name="Scheduler and RCU">Scheduler and RCU</a></h3> |
| 2118 | |
| 2119 | <p> |
| 2120 | RCU depends on the scheduler, and the scheduler uses RCU to |
| 2121 | protect some of its data structures. |
| 2122 | This means the scheduler is forbidden from acquiring |
| 2123 | the runqueue locks and the priority-inheritance locks |
Paul E. McKenney | a4b5756 | 2015-10-07 15:52:25 -0700 | [diff] [blame] | 2124 | in the middle of an outermost RCU read-side critical section unless either |
| 2125 | (1) it releases them before exiting that same |
| 2126 | RCU read-side critical section, or |
Paul E. McKenney | c64c4b0 | 2015-11-06 23:05:32 -0800 | [diff] [blame] | 2127 | (2) interrupts are disabled across |
Paul E. McKenney | a4b5756 | 2015-10-07 15:52:25 -0700 | [diff] [blame] | 2128 | that entire RCU read-side critical section. |
| 2129 | This same prohibition also applies (recursively!) to any lock that is acquired |
Paul E. McKenney | 649e436 | 2015-10-07 13:32:08 -0700 | [diff] [blame] | 2130 | while holding any lock to which this prohibition applies. |
Paul E. McKenney | a4b5756 | 2015-10-07 15:52:25 -0700 | [diff] [blame] | 2131 | Adhering to this rule prevents preemptible RCU from invoking |
| 2132 | <tt>rcu_read_unlock_special()</tt> while either runqueue or |
| 2133 | priority-inheritance locks are held, thus avoiding deadlock. |
Paul E. McKenney | 649e436 | 2015-10-07 13:32:08 -0700 | [diff] [blame] | 2134 | |
| 2135 | <p> |
Paul E. McKenney | c64c4b0 | 2015-11-06 23:05:32 -0800 | [diff] [blame] | 2136 | Prior to v4.4, it was only necessary to disable preemption across |
| 2137 | RCU read-side critical sections that acquired scheduler locks. |
| 2138 | In v4.4, expedited grace periods started using IPIs, and these |
| 2139 | IPIs could force a <tt>rcu_read_unlock()</tt> to take the slowpath. |
| 2140 | Therefore, this expedited-grace-period change required disabling of |
| 2141 | interrupts, not just preemption. |
| 2142 | |
| 2143 | <p> |
Paul E. McKenney | 649e436 | 2015-10-07 13:32:08 -0700 | [diff] [blame] | 2144 | For RCU's part, the preemptible-RCU <tt>rcu_read_unlock()</tt> |
| 2145 | implementation must be written carefully to avoid similar deadlocks. |
| 2146 | In particular, <tt>rcu_read_unlock()</tt> must tolerate an |
| 2147 | interrupt where the interrupt handler invokes both |
| 2148 | <tt>rcu_read_lock()</tt> and <tt>rcu_read_unlock()</tt>. |
| 2149 | This possibility requires <tt>rcu_read_unlock()</tt> to use |
| 2150 | negative nesting levels to avoid destructive recursion via |
| 2151 | interrupt handler's use of RCU. |
| 2152 | |
| 2153 | <p> |
| 2154 | This pair of mutual scheduler-RCU requirements came as a |
| 2155 | <a href="https://lwn.net/Articles/453002/">complete surprise</a>. |
| 2156 | |
| 2157 | <p> |
| 2158 | As noted above, RCU makes use of kthreads, and it is necessary to |
| 2159 | avoid excessive CPU-time accumulation by these kthreads. |
| 2160 | This requirement was no surprise, but RCU's violation of it |
| 2161 | when running context-switch-heavy workloads when built with |
| 2162 | <tt>CONFIG_NO_HZ_FULL=y</tt> |
| 2163 | <a href="http://www.rdrop.com/users/paulmck/scalability/paper/BareMetal.2015.01.15b.pdf">did come as a surprise [PDF]</a>. |
| 2164 | RCU has made good progress towards meeting this requirement, even |
| 2165 | for context-switch-have <tt>CONFIG_NO_HZ_FULL=y</tt> workloads, |
| 2166 | but there is room for further improvement. |
| 2167 | |
| 2168 | <h3><a name="Tracing and RCU">Tracing and RCU</a></h3> |
| 2169 | |
| 2170 | <p> |
| 2171 | It is possible to use tracing on RCU code, but tracing itself |
| 2172 | uses RCU. |
| 2173 | For this reason, <tt>rcu_dereference_raw_notrace()</tt> |
| 2174 | is provided for use by tracing, which avoids the destructive |
| 2175 | recursion that could otherwise ensue. |
| 2176 | This API is also used by virtualization in some architectures, |
| 2177 | where RCU readers execute in environments in which tracing |
| 2178 | cannot be used. |
| 2179 | The tracing folks both located the requirement and provided the |
| 2180 | needed fix, so this surprise requirement was relatively painless. |
| 2181 | |
| 2182 | <h3><a name="Energy Efficiency">Energy Efficiency</a></h3> |
| 2183 | |
| 2184 | <p> |
| 2185 | Interrupting idle CPUs is considered socially unacceptable, |
| 2186 | especially by people with battery-powered embedded systems. |
| 2187 | RCU therefore conserves energy by detecting which CPUs are |
| 2188 | idle, including tracking CPUs that have been interrupted from idle. |
| 2189 | This is a large part of the energy-efficiency requirement, |
| 2190 | so I learned of this via an irate phone call. |
| 2191 | |
| 2192 | <p> |
| 2193 | Because RCU avoids interrupting idle CPUs, it is illegal to |
| 2194 | execute an RCU read-side critical section on an idle CPU. |
| 2195 | (Kernels built with <tt>CONFIG_PROVE_RCU=y</tt> will splat |
| 2196 | if you try it.) |
| 2197 | The <tt>RCU_NONIDLE()</tt> macro and <tt>_rcuidle</tt> |
| 2198 | event tracing is provided to work around this restriction. |
| 2199 | In addition, <tt>rcu_is_watching()</tt> may be used to |
| 2200 | test whether or not it is currently legal to run RCU read-side |
| 2201 | critical sections on this CPU. |
| 2202 | I learned of the need for diagnostics on the one hand |
| 2203 | and <tt>RCU_NONIDLE()</tt> on the other while inspecting |
| 2204 | idle-loop code. |
| 2205 | Steven Rostedt supplied <tt>_rcuidle</tt> event tracing, |
| 2206 | which is used quite heavily in the idle loop. |
| 2207 | |
| 2208 | <p> |
| 2209 | It is similarly socially unacceptable to interrupt an |
| 2210 | <tt>nohz_full</tt> CPU running in userspace. |
| 2211 | RCU must therefore track <tt>nohz_full</tt> userspace |
| 2212 | execution. |
| 2213 | And in |
| 2214 | <a href="https://lwn.net/Articles/558284/"><tt>CONFIG_NO_HZ_FULL_SYSIDLE=y</tt></a> |
| 2215 | kernels, RCU must separately track idle CPUs on the one hand and |
| 2216 | CPUs that are either idle or executing in userspace on the other. |
| 2217 | In both cases, RCU must be able to sample state at two points in |
| 2218 | time, and be able to determine whether or not some other CPU spent |
| 2219 | any time idle and/or executing in userspace. |
| 2220 | |
| 2221 | <p> |
| 2222 | These energy-efficiency requirements have proven quite difficult to |
| 2223 | understand and to meet, for example, there have been more than five |
| 2224 | clean-sheet rewrites of RCU's energy-efficiency code, the last of |
| 2225 | which was finally able to demonstrate |
| 2226 | <a href="http://www.rdrop.com/users/paulmck/realtime/paper/AMPenergy.2013.04.19a.pdf">real energy savings running on real hardware [PDF]</a>. |
| 2227 | As noted earlier, |
| 2228 | I learned of many of these requirements via angry phone calls: |
| 2229 | Flaming me on the Linux-kernel mailing list was apparently not |
| 2230 | sufficient to fully vent their ire at RCU's energy-efficiency bugs! |
| 2231 | |
Paul E. McKenney | 701e803 | 2015-10-07 15:06:44 -0700 | [diff] [blame] | 2232 | <h3><a name="Memory Efficiency">Memory Efficiency</a></h3> |
| 2233 | |
| 2234 | <p> |
| 2235 | Although small-memory non-realtime systems can simply use Tiny RCU, |
| 2236 | code size is only one aspect of memory efficiency. |
| 2237 | Another aspect is the size of the <tt>rcu_head</tt> structure |
| 2238 | used by <tt>call_rcu()</tt> and <tt>kfree_rcu()</tt>. |
| 2239 | Although this structure contains nothing more than a pair of pointers, |
| 2240 | it does appear in many RCU-protected data structures, including |
| 2241 | some that are size critical. |
| 2242 | The <tt>page</tt> structure is a case in point, as evidenced by |
| 2243 | the many occurrences of the <tt>union</tt> keyword within that structure. |
| 2244 | |
| 2245 | <p> |
| 2246 | This need for memory efficiency is one reason that RCU uses hand-crafted |
| 2247 | singly linked lists to track the <tt>rcu_head</tt> structures that |
| 2248 | are waiting for a grace period to elapse. |
| 2249 | It is also the reason why <tt>rcu_head</tt> structures do not contain |
| 2250 | debug information, such as fields tracking the file and line of the |
| 2251 | <tt>call_rcu()</tt> or <tt>kfree_rcu()</tt> that posted them. |
| 2252 | Although this information might appear in debug-only kernel builds at some |
| 2253 | point, in the meantime, the <tt>->func</tt> field will often provide |
| 2254 | the needed debug information. |
| 2255 | |
| 2256 | <p> |
| 2257 | However, in some cases, the need for memory efficiency leads to even |
| 2258 | more extreme measures. |
| 2259 | Returning to the <tt>page</tt> structure, the <tt>rcu_head</tt> field |
| 2260 | shares storage with a great many other structures that are used at |
| 2261 | various points in the corresponding page's lifetime. |
| 2262 | In order to correctly resolve certain |
| 2263 | <a href="https://lkml.kernel.org/g/1439976106-137226-1-git-send-email-kirill.shutemov@linux.intel.com">race conditions</a>, |
| 2264 | the Linux kernel's memory-management subsystem needs a particular bit |
| 2265 | to remain zero during all phases of grace-period processing, |
| 2266 | and that bit happens to map to the bottom bit of the |
| 2267 | <tt>rcu_head</tt> structure's <tt>->next</tt> field. |
| 2268 | RCU makes this guarantee as long as <tt>call_rcu()</tt> |
| 2269 | is used to post the callback, as opposed to <tt>kfree_rcu()</tt> |
| 2270 | or some future “lazy” |
| 2271 | variant of <tt>call_rcu()</tt> that might one day be created for |
| 2272 | energy-efficiency purposes. |
| 2273 | |
Paul E. McKenney | 649e436 | 2015-10-07 13:32:08 -0700 | [diff] [blame] | 2274 | <h3><a name="Performance, Scalability, Response Time, and Reliability"> |
| 2275 | Performance, Scalability, Response Time, and Reliability</a></h3> |
| 2276 | |
| 2277 | <p> |
| 2278 | Expanding on the |
| 2279 | <a href="#Performance and Scalability">earlier discussion</a>, |
| 2280 | RCU is used heavily by hot code paths in performance-critical |
| 2281 | portions of the Linux kernel's networking, security, virtualization, |
| 2282 | and scheduling code paths. |
| 2283 | RCU must therefore use efficient implementations, especially in its |
| 2284 | read-side primitives. |
| 2285 | To that end, it would be good if preemptible RCU's implementation |
| 2286 | of <tt>rcu_read_lock()</tt> could be inlined, however, doing |
| 2287 | this requires resolving <tt>#include</tt> issues with the |
| 2288 | <tt>task_struct</tt> structure. |
| 2289 | |
| 2290 | <p> |
| 2291 | The Linux kernel supports hardware configurations with up to |
| 2292 | 4096 CPUs, which means that RCU must be extremely scalable. |
| 2293 | Algorithms that involve frequent acquisitions of global locks or |
| 2294 | frequent atomic operations on global variables simply cannot be |
| 2295 | tolerated within the RCU implementation. |
| 2296 | RCU therefore makes heavy use of a combining tree based on the |
| 2297 | <tt>rcu_node</tt> structure. |
| 2298 | RCU is required to tolerate all CPUs continuously invoking any |
| 2299 | combination of RCU's runtime primitives with minimal per-operation |
| 2300 | overhead. |
| 2301 | In fact, in many cases, increasing load must <i>decrease</i> the |
| 2302 | per-operation overhead, witness the batching optimizations for |
| 2303 | <tt>synchronize_rcu()</tt>, <tt>call_rcu()</tt>, |
| 2304 | <tt>synchronize_rcu_expedited()</tt>, and <tt>rcu_barrier()</tt>. |
| 2305 | As a general rule, RCU must cheerfully accept whatever the |
| 2306 | rest of the Linux kernel decides to throw at it. |
| 2307 | |
| 2308 | <p> |
| 2309 | The Linux kernel is used for real-time workloads, especially |
| 2310 | in conjunction with the |
| 2311 | <a href="https://rt.wiki.kernel.org/index.php/Main_Page">-rt patchset</a>. |
| 2312 | The real-time-latency response requirements are such that the |
| 2313 | traditional approach of disabling preemption across RCU |
| 2314 | read-side critical sections is inappropriate. |
| 2315 | Kernels built with <tt>CONFIG_PREEMPT=y</tt> therefore |
| 2316 | use an RCU implementation that allows RCU read-side critical |
| 2317 | sections to be preempted. |
| 2318 | This requirement made its presence known after users made it |
| 2319 | clear that an earlier |
| 2320 | <a href="https://lwn.net/Articles/107930/">real-time patch</a> |
| 2321 | did not meet their needs, in conjunction with some |
| 2322 | <a href="https://lkml.kernel.org/g/20050318002026.GA2693@us.ibm.com">RCU issues</a> |
| 2323 | encountered by a very early version of the -rt patchset. |
| 2324 | |
| 2325 | <p> |
| 2326 | In addition, RCU must make do with a sub-100-microsecond real-time latency |
| 2327 | budget. |
| 2328 | In fact, on smaller systems with the -rt patchset, the Linux kernel |
| 2329 | provides sub-20-microsecond real-time latencies for the whole kernel, |
| 2330 | including RCU. |
| 2331 | RCU's scalability and latency must therefore be sufficient for |
| 2332 | these sorts of configurations. |
| 2333 | To my surprise, the sub-100-microsecond real-time latency budget |
| 2334 | <a href="http://www.rdrop.com/users/paulmck/realtime/paper/bigrt.2013.01.31a.LCA.pdf"> |
| 2335 | applies to even the largest systems [PDF]</a>, |
| 2336 | up to and including systems with 4096 CPUs. |
| 2337 | This real-time requirement motivated the grace-period kthread, which |
| 2338 | also simplified handling of a number of race conditions. |
| 2339 | |
| 2340 | <p> |
| 2341 | Finally, RCU's status as a synchronization primitive means that |
| 2342 | any RCU failure can result in arbitrary memory corruption that can be |
| 2343 | extremely difficult to debug. |
| 2344 | This means that RCU must be extremely reliable, which in |
| 2345 | practice also means that RCU must have an aggressive stress-test |
| 2346 | suite. |
| 2347 | This stress-test suite is called <tt>rcutorture</tt>. |
| 2348 | |
| 2349 | <p> |
| 2350 | Although the need for <tt>rcutorture</tt> was no surprise, |
| 2351 | the current immense popularity of the Linux kernel is posing |
| 2352 | interesting—and perhaps unprecedented—validation |
| 2353 | challenges. |
| 2354 | To see this, keep in mind that there are well over one billion |
| 2355 | instances of the Linux kernel running today, given Android |
| 2356 | smartphones, Linux-powered televisions, and servers. |
| 2357 | This number can be expected to increase sharply with the advent of |
| 2358 | the celebrated Internet of Things. |
| 2359 | |
| 2360 | <p> |
| 2361 | Suppose that RCU contains a race condition that manifests on average |
| 2362 | once per million years of runtime. |
| 2363 | This bug will be occurring about three times per <i>day</i> across |
| 2364 | the installed base. |
| 2365 | RCU could simply hide behind hardware error rates, given that no one |
| 2366 | should really expect their smartphone to last for a million years. |
| 2367 | However, anyone taking too much comfort from this thought should |
| 2368 | consider the fact that in most jurisdictions, a successful multi-year |
| 2369 | test of a given mechanism, which might include a Linux kernel, |
| 2370 | suffices for a number of types of safety-critical certifications. |
| 2371 | In fact, rumor has it that the Linux kernel is already being used |
| 2372 | in production for safety-critical applications. |
| 2373 | I don't know about you, but I would feel quite bad if a bug in RCU |
| 2374 | killed someone. |
| 2375 | Which might explain my recent focus on validation and verification. |
| 2376 | |
| 2377 | <h2><a name="Other RCU Flavors">Other RCU Flavors</a></h2> |
| 2378 | |
| 2379 | <p> |
| 2380 | One of the more surprising things about RCU is that there are now |
| 2381 | no fewer than five <i>flavors</i>, or API families. |
| 2382 | In addition, the primary flavor that has been the sole focus up to |
| 2383 | this point has two different implementations, non-preemptible and |
| 2384 | preemptible. |
| 2385 | The other four flavors are listed below, with requirements for each |
| 2386 | described in a separate section. |
| 2387 | |
| 2388 | <ol> |
| 2389 | <li> <a href="#Bottom-Half Flavor">Bottom-Half Flavor</a> |
| 2390 | <li> <a href="#Sched Flavor">Sched Flavor</a> |
| 2391 | <li> <a href="#Sleepable RCU">Sleepable RCU</a> |
| 2392 | <li> <a href="#Tasks RCU">Tasks RCU</a> |
| 2393 | </ol> |
| 2394 | |
| 2395 | <h3><a name="Bottom-Half Flavor">Bottom-Half Flavor</a></h3> |
| 2396 | |
| 2397 | <p> |
| 2398 | The softirq-disable (AKA “bottom-half”, |
| 2399 | hence the “_bh” abbreviations) |
| 2400 | flavor of RCU, or <i>RCU-bh</i>, was developed by |
| 2401 | Dipankar Sarma to provide a flavor of RCU that could withstand the |
| 2402 | network-based denial-of-service attacks researched by Robert |
| 2403 | Olsson. |
| 2404 | These attacks placed so much networking load on the system |
| 2405 | that some of the CPUs never exited softirq execution, |
| 2406 | which in turn prevented those CPUs from ever executing a context switch, |
| 2407 | which, in the RCU implementation of that time, prevented grace periods |
| 2408 | from ever ending. |
| 2409 | The result was an out-of-memory condition and a system hang. |
| 2410 | |
| 2411 | <p> |
| 2412 | The solution was the creation of RCU-bh, which does |
| 2413 | <tt>local_bh_disable()</tt> |
| 2414 | across its read-side critical sections, and which uses the transition |
| 2415 | from one type of softirq processing to another as a quiescent state |
| 2416 | in addition to context switch, idle, user mode, and offline. |
| 2417 | This means that RCU-bh grace periods can complete even when some of |
| 2418 | the CPUs execute in softirq indefinitely, thus allowing algorithms |
| 2419 | based on RCU-bh to withstand network-based denial-of-service attacks. |
| 2420 | |
| 2421 | <p> |
| 2422 | Because |
| 2423 | <tt>rcu_read_lock_bh()</tt> and <tt>rcu_read_unlock_bh()</tt> |
| 2424 | disable and re-enable softirq handlers, any attempt to start a softirq |
| 2425 | handlers during the |
| 2426 | RCU-bh read-side critical section will be deferred. |
| 2427 | In this case, <tt>rcu_read_unlock_bh()</tt> |
| 2428 | will invoke softirq processing, which can take considerable time. |
| 2429 | One can of course argue that this softirq overhead should be associated |
| 2430 | with the code following the RCU-bh read-side critical section rather |
| 2431 | than <tt>rcu_read_unlock_bh()</tt>, but the fact |
| 2432 | is that most profiling tools cannot be expected to make this sort |
| 2433 | of fine distinction. |
| 2434 | For example, suppose that a three-millisecond-long RCU-bh read-side |
| 2435 | critical section executes during a time of heavy networking load. |
| 2436 | There will very likely be an attempt to invoke at least one softirq |
| 2437 | handler during that three milliseconds, but any such invocation will |
| 2438 | be delayed until the time of the <tt>rcu_read_unlock_bh()</tt>. |
| 2439 | This can of course make it appear at first glance as if |
| 2440 | <tt>rcu_read_unlock_bh()</tt> was executing very slowly. |
| 2441 | |
| 2442 | <p> |
| 2443 | The |
| 2444 | <a href="https://lwn.net/Articles/609973/#RCU Per-Flavor API Table">RCU-bh API</a> |
| 2445 | includes |
| 2446 | <tt>rcu_read_lock_bh()</tt>, |
| 2447 | <tt>rcu_read_unlock_bh()</tt>, |
| 2448 | <tt>rcu_dereference_bh()</tt>, |
| 2449 | <tt>rcu_dereference_bh_check()</tt>, |
| 2450 | <tt>synchronize_rcu_bh()</tt>, |
| 2451 | <tt>synchronize_rcu_bh_expedited()</tt>, |
| 2452 | <tt>call_rcu_bh()</tt>, |
| 2453 | <tt>rcu_barrier_bh()</tt>, and |
| 2454 | <tt>rcu_read_lock_bh_held()</tt>. |
| 2455 | |
| 2456 | <h3><a name="Sched Flavor">Sched Flavor</a></h3> |
| 2457 | |
| 2458 | <p> |
| 2459 | Before preemptible RCU, waiting for an RCU grace period had the |
| 2460 | side effect of also waiting for all pre-existing interrupt |
| 2461 | and NMI handlers. |
| 2462 | However, there are legitimate preemptible-RCU implementations that |
| 2463 | do not have this property, given that any point in the code outside |
| 2464 | of an RCU read-side critical section can be a quiescent state. |
| 2465 | Therefore, <i>RCU-sched</i> was created, which follows “classic” |
| 2466 | RCU in that an RCU-sched grace period waits for for pre-existing |
| 2467 | interrupt and NMI handlers. |
| 2468 | In kernels built with <tt>CONFIG_PREEMPT=n</tt>, the RCU and RCU-sched |
| 2469 | APIs have identical implementations, while kernels built with |
| 2470 | <tt>CONFIG_PREEMPT=y</tt> provide a separate implementation for each. |
| 2471 | |
| 2472 | <p> |
| 2473 | Note well that in <tt>CONFIG_PREEMPT=y</tt> kernels, |
| 2474 | <tt>rcu_read_lock_sched()</tt> and <tt>rcu_read_unlock_sched()</tt> |
| 2475 | disable and re-enable preemption, respectively. |
| 2476 | This means that if there was a preemption attempt during the |
| 2477 | RCU-sched read-side critical section, <tt>rcu_read_unlock_sched()</tt> |
| 2478 | will enter the scheduler, with all the latency and overhead entailed. |
| 2479 | Just as with <tt>rcu_read_unlock_bh()</tt>, this can make it look |
| 2480 | as if <tt>rcu_read_unlock_sched()</tt> was executing very slowly. |
| 2481 | However, the highest-priority task won't be preempted, so that task |
| 2482 | will enjoy low-overhead <tt>rcu_read_unlock_sched()</tt> invocations. |
| 2483 | |
| 2484 | <p> |
| 2485 | The |
| 2486 | <a href="https://lwn.net/Articles/609973/#RCU Per-Flavor API Table">RCU-sched API</a> |
| 2487 | includes |
| 2488 | <tt>rcu_read_lock_sched()</tt>, |
| 2489 | <tt>rcu_read_unlock_sched()</tt>, |
| 2490 | <tt>rcu_read_lock_sched_notrace()</tt>, |
| 2491 | <tt>rcu_read_unlock_sched_notrace()</tt>, |
| 2492 | <tt>rcu_dereference_sched()</tt>, |
| 2493 | <tt>rcu_dereference_sched_check()</tt>, |
| 2494 | <tt>synchronize_sched()</tt>, |
| 2495 | <tt>synchronize_rcu_sched_expedited()</tt>, |
| 2496 | <tt>call_rcu_sched()</tt>, |
| 2497 | <tt>rcu_barrier_sched()</tt>, and |
| 2498 | <tt>rcu_read_lock_sched_held()</tt>. |
| 2499 | However, anything that disables preemption also marks an RCU-sched |
| 2500 | read-side critical section, including |
| 2501 | <tt>preempt_disable()</tt> and <tt>preempt_enable()</tt>, |
| 2502 | <tt>local_irq_save()</tt> and <tt>local_irq_restore()</tt>, |
| 2503 | and so on. |
| 2504 | |
| 2505 | <h3><a name="Sleepable RCU">Sleepable RCU</a></h3> |
| 2506 | |
| 2507 | <p> |
| 2508 | For well over a decade, someone saying “I need to block within |
| 2509 | an RCU read-side critical section” was a reliable indication |
| 2510 | that this someone did not understand RCU. |
| 2511 | After all, if you are always blocking in an RCU read-side critical |
| 2512 | section, you can probably afford to use a higher-overhead synchronization |
| 2513 | mechanism. |
| 2514 | However, that changed with the advent of the Linux kernel's notifiers, |
| 2515 | whose RCU read-side critical |
| 2516 | sections almost never sleep, but sometimes need to. |
| 2517 | This resulted in the introduction of |
| 2518 | <a href="https://lwn.net/Articles/202847/">sleepable RCU</a>, |
| 2519 | or <i>SRCU</i>. |
| 2520 | |
| 2521 | <p> |
| 2522 | SRCU allows different domains to be defined, with each such domain |
| 2523 | defined by an instance of an <tt>srcu_struct</tt> structure. |
| 2524 | A pointer to this structure must be passed in to each SRCU function, |
| 2525 | for example, <tt>synchronize_srcu(&ss)</tt>, where |
| 2526 | <tt>ss</tt> is the <tt>srcu_struct</tt> structure. |
| 2527 | The key benefit of these domains is that a slow SRCU reader in one |
| 2528 | domain does not delay an SRCU grace period in some other domain. |
| 2529 | That said, one consequence of these domains is that read-side code |
| 2530 | must pass a “cookie” from <tt>srcu_read_lock()</tt> |
| 2531 | to <tt>srcu_read_unlock()</tt>, for example, as follows: |
| 2532 | |
| 2533 | <blockquote> |
| 2534 | <pre> |
| 2535 | 1 int idx; |
| 2536 | 2 |
| 2537 | 3 idx = srcu_read_lock(&ss); |
| 2538 | 4 do_something(); |
| 2539 | 5 srcu_read_unlock(&ss, idx); |
| 2540 | </pre> |
| 2541 | </blockquote> |
| 2542 | |
| 2543 | <p> |
| 2544 | As noted above, it is legal to block within SRCU read-side critical sections, |
| 2545 | however, with great power comes great responsibility. |
| 2546 | If you block forever in one of a given domain's SRCU read-side critical |
| 2547 | sections, then that domain's grace periods will also be blocked forever. |
| 2548 | Of course, one good way to block forever is to deadlock, which can |
| 2549 | happen if any operation in a given domain's SRCU read-side critical |
| 2550 | section can block waiting, either directly or indirectly, for that domain's |
| 2551 | grace period to elapse. |
| 2552 | For example, this results in a self-deadlock: |
| 2553 | |
| 2554 | <blockquote> |
| 2555 | <pre> |
| 2556 | 1 int idx; |
| 2557 | 2 |
| 2558 | 3 idx = srcu_read_lock(&ss); |
| 2559 | 4 do_something(); |
| 2560 | 5 synchronize_srcu(&ss); |
| 2561 | 6 srcu_read_unlock(&ss, idx); |
| 2562 | </pre> |
| 2563 | </blockquote> |
| 2564 | |
| 2565 | <p> |
| 2566 | However, if line 5 acquired a mutex that was held across |
| 2567 | a <tt>synchronize_srcu()</tt> for domain <tt>ss</tt>, |
| 2568 | deadlock would still be possible. |
| 2569 | Furthermore, if line 5 acquired a mutex that was held across |
| 2570 | a <tt>synchronize_srcu()</tt> for some other domain <tt>ss1</tt>, |
| 2571 | and if an <tt>ss1</tt>-domain SRCU read-side critical section |
| 2572 | acquired another mutex that was held across as <tt>ss</tt>-domain |
| 2573 | <tt>synchronize_srcu()</tt>, |
| 2574 | deadlock would again be possible. |
| 2575 | Such a deadlock cycle could extend across an arbitrarily large number |
| 2576 | of different SRCU domains. |
| 2577 | Again, with great power comes great responsibility. |
| 2578 | |
| 2579 | <p> |
| 2580 | Unlike the other RCU flavors, SRCU read-side critical sections can |
| 2581 | run on idle and even offline CPUs. |
| 2582 | This ability requires that <tt>srcu_read_lock()</tt> and |
| 2583 | <tt>srcu_read_unlock()</tt> contain memory barriers, which means |
| 2584 | that SRCU readers will run a bit slower than would RCU readers. |
| 2585 | It also motivates the <tt>smp_mb__after_srcu_read_unlock()</tt> |
| 2586 | API, which, in combination with <tt>srcu_read_unlock()</tt>, |
| 2587 | guarantees a full memory barrier. |
| 2588 | |
| 2589 | <p> |
| 2590 | The |
| 2591 | <a href="https://lwn.net/Articles/609973/#RCU Per-Flavor API Table">SRCU API</a> |
| 2592 | includes |
| 2593 | <tt>srcu_read_lock()</tt>, |
| 2594 | <tt>srcu_read_unlock()</tt>, |
| 2595 | <tt>srcu_dereference()</tt>, |
| 2596 | <tt>srcu_dereference_check()</tt>, |
| 2597 | <tt>synchronize_srcu()</tt>, |
| 2598 | <tt>synchronize_srcu_expedited()</tt>, |
| 2599 | <tt>call_srcu()</tt>, |
| 2600 | <tt>srcu_barrier()</tt>, and |
| 2601 | <tt>srcu_read_lock_held()</tt>. |
| 2602 | It also includes |
| 2603 | <tt>DEFINE_SRCU()</tt>, |
| 2604 | <tt>DEFINE_STATIC_SRCU()</tt>, and |
| 2605 | <tt>init_srcu_struct()</tt> |
| 2606 | APIs for defining and initializing <tt>srcu_struct</tt> structures. |
| 2607 | |
| 2608 | <h3><a name="Tasks RCU">Tasks RCU</a></h3> |
| 2609 | |
| 2610 | <p> |
| 2611 | Some forms of tracing use “tramopolines” to handle the |
| 2612 | binary rewriting required to install different types of probes. |
| 2613 | It would be good to be able to free old trampolines, which sounds |
| 2614 | like a job for some form of RCU. |
| 2615 | However, because it is necessary to be able to install a trace |
| 2616 | anywhere in the code, it is not possible to use read-side markers |
| 2617 | such as <tt>rcu_read_lock()</tt> and <tt>rcu_read_unlock()</tt>. |
| 2618 | In addition, it does not work to have these markers in the trampoline |
| 2619 | itself, because there would need to be instructions following |
| 2620 | <tt>rcu_read_unlock()</tt>. |
| 2621 | Although <tt>synchronize_rcu()</tt> would guarantee that execution |
| 2622 | reached the <tt>rcu_read_unlock()</tt>, it would not be able to |
| 2623 | guarantee that execution had completely left the trampoline. |
| 2624 | |
| 2625 | <p> |
| 2626 | The solution, in the form of |
| 2627 | <a href="https://lwn.net/Articles/607117/"><i>Tasks RCU</i></a>, |
| 2628 | is to have implicit |
| 2629 | read-side critical sections that are delimited by voluntary context |
| 2630 | switches, that is, calls to <tt>schedule()</tt>, |
| 2631 | <tt>cond_resched_rcu_qs()</tt>, and |
| 2632 | <tt>synchronize_rcu_tasks()</tt>. |
| 2633 | In addition, transitions to and from userspace execution also delimit |
| 2634 | tasks-RCU read-side critical sections. |
| 2635 | |
| 2636 | <p> |
| 2637 | The tasks-RCU API is quite compact, consisting only of |
| 2638 | <tt>call_rcu_tasks()</tt>, |
| 2639 | <tt>synchronize_rcu_tasks()</tt>, and |
| 2640 | <tt>rcu_barrier_tasks()</tt>. |
| 2641 | |
| 2642 | <h2><a name="Possible Future Changes">Possible Future Changes</a></h2> |
| 2643 | |
| 2644 | <p> |
| 2645 | One of the tricks that RCU uses to attain update-side scalability is |
| 2646 | to increase grace-period latency with increasing numbers of CPUs. |
| 2647 | If this becomes a serious problem, it will be necessary to rework the |
| 2648 | grace-period state machine so as to avoid the need for the additional |
| 2649 | latency. |
| 2650 | |
| 2651 | <p> |
| 2652 | Expedited grace periods scan the CPUs, so their latency and overhead |
| 2653 | increases with increasing numbers of CPUs. |
| 2654 | If this becomes a serious problem on large systems, it will be necessary |
| 2655 | to do some redesign to avoid this scalability problem. |
| 2656 | |
| 2657 | <p> |
| 2658 | RCU disables CPU hotplug in a few places, perhaps most notably in the |
| 2659 | expedited grace-period and <tt>rcu_barrier()</tt> operations. |
| 2660 | If there is a strong reason to use expedited grace periods in CPU-hotplug |
| 2661 | notifiers, it will be necessary to avoid disabling CPU hotplug. |
| 2662 | This would introduce some complexity, so there had better be a <i>very</i> |
| 2663 | good reason. |
| 2664 | |
| 2665 | <p> |
| 2666 | The tradeoff between grace-period latency on the one hand and interruptions |
| 2667 | of other CPUs on the other hand may need to be re-examined. |
| 2668 | The desire is of course for zero grace-period latency as well as zero |
| 2669 | interprocessor interrupts undertaken during an expedited grace period |
| 2670 | operation. |
| 2671 | While this ideal is unlikely to be achievable, it is quite possible that |
| 2672 | further improvements can be made. |
| 2673 | |
| 2674 | <p> |
| 2675 | The multiprocessor implementations of RCU use a combining tree that |
| 2676 | groups CPUs so as to reduce lock contention and increase cache locality. |
| 2677 | However, this combining tree does not spread its memory across NUMA |
| 2678 | nodes nor does it align the CPU groups with hardware features such |
| 2679 | as sockets or cores. |
| 2680 | Such spreading and alignment is currently believed to be unnecessary |
| 2681 | because the hotpath read-side primitives do not access the combining |
| 2682 | tree, nor does <tt>call_rcu()</tt> in the common case. |
| 2683 | If you believe that your architecture needs such spreading and alignment, |
| 2684 | then your architecture should also benefit from the |
| 2685 | <tt>rcutree.rcu_fanout_leaf</tt> boot parameter, which can be set |
| 2686 | to the number of CPUs in a socket, NUMA node, or whatever. |
| 2687 | If the number of CPUs is too large, use a fraction of the number of |
| 2688 | CPUs. |
| 2689 | If the number of CPUs is a large prime number, well, that certainly |
| 2690 | is an “interesting” architectural choice! |
| 2691 | More flexible arrangements might be considered, but only if |
| 2692 | <tt>rcutree.rcu_fanout_leaf</tt> has proven inadequate, and only |
| 2693 | if the inadequacy has been demonstrated by a carefully run and |
| 2694 | realistic system-level workload. |
| 2695 | |
| 2696 | <p> |
| 2697 | Please note that arrangements that require RCU to remap CPU numbers will |
| 2698 | require extremely good demonstration of need and full exploration of |
| 2699 | alternatives. |
| 2700 | |
| 2701 | <p> |
| 2702 | There is an embarrassingly large number of flavors of RCU, and this |
| 2703 | number has been increasing over time. |
| 2704 | Perhaps it will be possible to combine some at some future date. |
| 2705 | |
| 2706 | <p> |
| 2707 | RCU's various kthreads are reasonably recent additions. |
| 2708 | It is quite likely that adjustments will be required to more gracefully |
| 2709 | handle extreme loads. |
| 2710 | It might also be necessary to be able to relate CPU utilization by |
| 2711 | RCU's kthreads and softirq handlers to the code that instigated this |
| 2712 | CPU utilization. |
| 2713 | For example, RCU callback overhead might be charged back to the |
| 2714 | originating <tt>call_rcu()</tt> instance, though probably not |
| 2715 | in production kernels. |
| 2716 | |
| 2717 | <h2><a name="Summary">Summary</a></h2> |
| 2718 | |
| 2719 | <p> |
| 2720 | This document has presented more than two decade's worth of RCU |
| 2721 | requirements. |
| 2722 | Given that the requirements keep changing, this will not be the last |
| 2723 | word on this subject, but at least it serves to get an important |
| 2724 | subset of the requirements set forth. |
| 2725 | |
| 2726 | <h2><a name="Acknowledgments">Acknowledgments</a></h2> |
| 2727 | |
| 2728 | I am grateful to Steven Rostedt, Lai Jiangshan, Ingo Molnar, |
| 2729 | Oleg Nesterov, Borislav Petkov, Peter Zijlstra, Boqun Feng, and |
| 2730 | Andy Lutomirski for their help in rendering |
| 2731 | this article human readable, and to Michelle Rankin for her support |
| 2732 | of this effort. |
| 2733 | Other contributions are acknowledged in the Linux kernel's git archive. |
| 2734 | The cartoon is copyright (c) 2013 by Melissa Broussard, |
| 2735 | and is provided |
| 2736 | under the terms of the Creative Commons Attribution-Share Alike 3.0 |
| 2737 | United States license. |
| 2738 | |
| 2739 | <p>@@QQAL@@ |
| 2740 | |
| 2741 | </body></html> |