Mauro Carvalho Chehab | 6b05dfa | 2020-04-21 19:04:02 +0200 | [diff] [blame] | 1 | .. SPDX-License-Identifier: GPL-2.0 |
| 2 | |
| 3 | ================================ |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 4 | Review Checklist for RCU Patches |
Mauro Carvalho Chehab | 6b05dfa | 2020-04-21 19:04:02 +0200 | [diff] [blame] | 5 | ================================ |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 6 | |
| 7 | |
| 8 | This document contains a checklist for producing and reviewing patches |
| 9 | that make use of RCU. Violating any of the rules listed below will |
| 10 | result in the same sorts of problems that leaving out a locking primitive |
| 11 | would cause. This list is based on experiences reviewing such patches |
| 12 | over a rather long period of time, but improvements are always welcome! |
| 13 | |
| 14 | 0. Is RCU being applied to a read-mostly situation? If the data |
Paul E. McKenney | 4c54005 | 2010-01-14 16:10:57 -0800 | [diff] [blame] | 15 | structure is updated more than about 10% of the time, then you |
| 16 | should strongly consider some other approach, unless detailed |
| 17 | performance measurements show that RCU is nonetheless the right |
| 18 | tool for the job. Yes, RCU does reduce read-side overhead by |
| 19 | increasing write-side overhead, which is exactly why normal uses |
| 20 | of RCU will do much more reading than updating. |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 21 | |
Paul E. McKenney | 3230075 | 2008-05-12 21:21:05 +0200 | [diff] [blame] | 22 | Another exception is where performance is not an issue, and RCU |
| 23 | provides a simpler implementation. An example of this situation |
| 24 | is the dynamic NMI code in the Linux 2.6 kernel, at least on |
| 25 | architectures where NMIs are rare. |
| 26 | |
| 27 | Yet another exception is where the low real-time latency of RCU's |
| 28 | read-side primitives is critically important. |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 29 | |
Paul E. McKenney | 4de5f89 | 2017-06-06 15:04:03 -0700 | [diff] [blame] | 30 | One final exception is where RCU readers are used to prevent |
| 31 | the ABA problem (https://en.wikipedia.org/wiki/ABA_problem) |
| 32 | for lockless updates. This does result in the mildly |
| 33 | counter-intuitive situation where rcu_read_lock() and |
| 34 | rcu_read_unlock() are used to protect updates, however, this |
| 35 | approach provides the same potential simplifications that garbage |
| 36 | collectors do. |
| 37 | |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 38 | 1. Does the update code have proper mutual exclusion? |
| 39 | |
| 40 | RCU does allow -readers- to run (almost) naked, but -writers- must |
| 41 | still use some sort of mutual exclusion, such as: |
| 42 | |
| 43 | a. locking, |
| 44 | b. atomic operations, or |
| 45 | c. restricting updates to a single task. |
| 46 | |
| 47 | If you choose #b, be prepared to describe how you have handled |
| 48 | memory barriers on weakly ordered machines (pretty much all of |
Paul E. McKenney | 4c54005 | 2010-01-14 16:10:57 -0800 | [diff] [blame] | 49 | them -- even x86 allows later loads to be reordered to precede |
| 50 | earlier stores), and be prepared to explain why this added |
| 51 | complexity is worthwhile. If you choose #c, be prepared to |
| 52 | explain how this single task does not become a major bottleneck on |
| 53 | big multiprocessor machines (for example, if the task is updating |
| 54 | information relating to itself that other tasks can read, there |
Paul E. McKenney | 4de5f89 | 2017-06-06 15:04:03 -0700 | [diff] [blame] | 55 | by definition can be no bottleneck). Note that the definition |
| 56 | of "large" has changed significantly: Eight CPUs was "large" |
| 57 | in the year 2000, but a hundred CPUs was unremarkable in 2017. |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 58 | |
| 59 | 2. Do the RCU read-side critical sections make proper use of |
| 60 | rcu_read_lock() and friends? These primitives are needed |
Paul E. McKenney | 3230075 | 2008-05-12 21:21:05 +0200 | [diff] [blame] | 61 | to prevent grace periods from ending prematurely, which |
| 62 | could result in data being unceremoniously freed out from |
| 63 | under your read-side code, which can greatly increase the |
| 64 | actuarial risk of your kernel. |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 65 | |
Paul E. McKenney | dd81eca | 2005-09-10 00:26:24 -0700 | [diff] [blame] | 66 | As a rough rule of thumb, any dereference of an RCU-protected |
Paul E. McKenney | 4c54005 | 2010-01-14 16:10:57 -0800 | [diff] [blame] | 67 | pointer must be covered by rcu_read_lock(), rcu_read_lock_bh(), |
| 68 | rcu_read_lock_sched(), or by the appropriate update-side lock. |
| 69 | Disabling of preemption can serve as rcu_read_lock_sched(), but |
Joel Fernandes (Google) | 090c168 | 2018-10-05 16:18:11 -0700 | [diff] [blame] | 70 | is less readable and prevents lockdep from detecting locking issues. |
Paul E. McKenney | dd81eca | 2005-09-10 00:26:24 -0700 | [diff] [blame] | 71 | |
Paul E. McKenney | 4de5f89 | 2017-06-06 15:04:03 -0700 | [diff] [blame] | 72 | Letting RCU-protected pointers "leak" out of an RCU read-side |
Paul Gortmaker | 9d3a048 | 2020-11-28 15:32:59 -0500 | [diff] [blame] | 73 | critical section is every bit as bad as letting them leak out |
Paul E. McKenney | 4de5f89 | 2017-06-06 15:04:03 -0700 | [diff] [blame] | 74 | from under a lock. Unless, of course, you have arranged some |
| 75 | other means of protection, such as a lock or a reference count |
| 76 | -before- letting them out of the RCU read-side critical section. |
| 77 | |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 78 | 3. Does the update code tolerate concurrent accesses? |
| 79 | |
| 80 | The whole point of RCU is to permit readers to run without |
| 81 | any locks or atomic operations. This means that readers will |
| 82 | be running while updates are in progress. There are a number |
| 83 | of ways to handle this concurrency, depending on the situation: |
| 84 | |
Paul E. McKenney | 3230075 | 2008-05-12 21:21:05 +0200 | [diff] [blame] | 85 | a. Use the RCU variants of the list and hlist update |
Paul E. McKenney | 4c54005 | 2010-01-14 16:10:57 -0800 | [diff] [blame] | 86 | primitives to add, remove, and replace elements on |
| 87 | an RCU-protected list. Alternatively, use the other |
| 88 | RCU-protected data structures that have been added to |
| 89 | the Linux kernel. |
Paul E. McKenney | 3230075 | 2008-05-12 21:21:05 +0200 | [diff] [blame] | 90 | |
| 91 | This is almost always the best approach. |
| 92 | |
| 93 | b. Proceed as in (a) above, but also maintain per-element |
| 94 | locks (that are acquired by both readers and writers) |
| 95 | that guard per-element state. Of course, fields that |
Paul E. McKenney | 4c54005 | 2010-01-14 16:10:57 -0800 | [diff] [blame] | 96 | the readers refrain from accessing can be guarded by |
| 97 | some other lock acquired only by updaters, if desired. |
Paul E. McKenney | 3230075 | 2008-05-12 21:21:05 +0200 | [diff] [blame] | 98 | |
| 99 | This works quite well, also. |
| 100 | |
Paul E. McKenney | 4de5f89 | 2017-06-06 15:04:03 -0700 | [diff] [blame] | 101 | c. Make updates appear atomic to readers. For example, |
Paul E. McKenney | 4c54005 | 2010-01-14 16:10:57 -0800 | [diff] [blame] | 102 | pointer updates to properly aligned fields will |
| 103 | appear atomic, as will individual atomic primitives. |
Paul E. McKenney | 4de5f89 | 2017-06-06 15:04:03 -0700 | [diff] [blame] | 104 | Sequences of operations performed under a lock will -not- |
Paul E. McKenney | 4c54005 | 2010-01-14 16:10:57 -0800 | [diff] [blame] | 105 | appear to be atomic to RCU readers, nor will sequences |
| 106 | of multiple atomic primitives. |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 107 | |
Paul E. McKenney | 3230075 | 2008-05-12 21:21:05 +0200 | [diff] [blame] | 108 | This can work, but is starting to get a bit tricky. |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 109 | |
Paul E. McKenney | 3230075 | 2008-05-12 21:21:05 +0200 | [diff] [blame] | 110 | d. Carefully order the updates and the reads so that |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 111 | readers see valid data at all phases of the update. |
| 112 | This is often more difficult than it sounds, especially |
| 113 | given modern CPUs' tendency to reorder memory references. |
| 114 | One must usually liberally sprinkle memory barriers |
| 115 | (smp_wmb(), smp_rmb(), smp_mb()) through the code, |
| 116 | making it difficult to understand and to test. |
| 117 | |
| 118 | It is usually better to group the changing data into |
| 119 | a separate structure, so that the change may be made |
| 120 | to appear atomic by updating a pointer to reference |
| 121 | a new structure containing updated values. |
| 122 | |
| 123 | 4. Weakly ordered CPUs pose special challenges. Almost all CPUs |
Paul E. McKenney | 4c54005 | 2010-01-14 16:10:57 -0800 | [diff] [blame] | 124 | are weakly ordered -- even x86 CPUs allow later loads to be |
| 125 | reordered to precede earlier stores. RCU code must take all of |
| 126 | the following measures to prevent memory-corruption problems: |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 127 | |
| 128 | a. Readers must maintain proper ordering of their memory |
| 129 | accesses. The rcu_dereference() primitive ensures that |
| 130 | the CPU picks up the pointer before it picks up the data |
| 131 | that the pointer points to. This really is necessary |
Paul Gortmaker | 9d3a048 | 2020-11-28 15:32:59 -0500 | [diff] [blame] | 132 | on Alpha CPUs. |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 133 | |
| 134 | The rcu_dereference() primitive is also an excellent |
Paul E. McKenney | b4c5bf3 | 2014-02-28 16:11:28 -0800 | [diff] [blame] | 135 | documentation aid, letting the person reading the |
| 136 | code know exactly which pointers are protected by RCU. |
Paul E. McKenney | 4c54005 | 2010-01-14 16:10:57 -0800 | [diff] [blame] | 137 | Please note that compilers can also reorder code, and |
| 138 | they are becoming increasingly aggressive about doing |
Paul E. McKenney | b4c5bf3 | 2014-02-28 16:11:28 -0800 | [diff] [blame] | 139 | just that. The rcu_dereference() primitive therefore also |
| 140 | prevents destructive compiler optimizations. However, |
| 141 | with a bit of devious creativity, it is possible to |
| 142 | mishandle the return value from rcu_dereference(). |
| 143 | Please see rcu_dereference.txt in this directory for |
| 144 | more information. |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 145 | |
Paul E. McKenney | 4c54005 | 2010-01-14 16:10:57 -0800 | [diff] [blame] | 146 | The rcu_dereference() primitive is used by the |
| 147 | various "_rcu()" list-traversal primitives, such |
| 148 | as the list_for_each_entry_rcu(). Note that it is |
| 149 | perfectly legal (if redundant) for update-side code to |
| 150 | use rcu_dereference() and the "_rcu()" list-traversal |
| 151 | primitives. This is particularly useful in code that |
Paul E. McKenney | c598a07 | 2010-02-22 17:04:57 -0800 | [diff] [blame] | 152 | is common to readers and updaters. However, lockdep |
| 153 | will complain if you access rcu_dereference() outside |
| 154 | of an RCU read-side critical section. See lockdep.txt |
| 155 | to learn what to do about this. |
| 156 | |
| 157 | Of course, neither rcu_dereference() nor the "_rcu()" |
| 158 | list-traversal primitives can substitute for a good |
| 159 | concurrency design coordinating among multiple updaters. |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 160 | |
Paul E. McKenney | a83f1fe | 2005-05-01 08:59:05 -0700 | [diff] [blame] | 161 | b. If the list macros are being used, the list_add_tail_rcu() |
| 162 | and list_add_rcu() primitives must be used in order |
| 163 | to prevent weakly ordered machines from misordering |
| 164 | structure initialization and pointer planting. |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 165 | Similarly, if the hlist macros are being used, the |
Paul E. McKenney | a83f1fe | 2005-05-01 08:59:05 -0700 | [diff] [blame] | 166 | hlist_add_head_rcu() primitive is required. |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 167 | |
Paul E. McKenney | a83f1fe | 2005-05-01 08:59:05 -0700 | [diff] [blame] | 168 | c. If the list macros are being used, the list_del_rcu() |
| 169 | primitive must be used to keep list_del()'s pointer |
| 170 | poisoning from inflicting toxic effects on concurrent |
| 171 | readers. Similarly, if the hlist macros are being used, |
| 172 | the hlist_del_rcu() primitive is required. |
| 173 | |
Paul E. McKenney | 4c54005 | 2010-01-14 16:10:57 -0800 | [diff] [blame] | 174 | The list_replace_rcu() and hlist_replace_rcu() primitives |
| 175 | may be used to replace an old structure with a new one |
| 176 | in their respective types of RCU-protected lists. |
Paul E. McKenney | a83f1fe | 2005-05-01 08:59:05 -0700 | [diff] [blame] | 177 | |
Paul E. McKenney | 4c54005 | 2010-01-14 16:10:57 -0800 | [diff] [blame] | 178 | d. Rules similar to (4b) and (4c) apply to the "hlist_nulls" |
| 179 | type of RCU-protected linked lists. |
| 180 | |
| 181 | e. Updates must ensure that initialization of a given |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 182 | structure happens before pointers to that structure are |
| 183 | publicized. Use the rcu_assign_pointer() primitive |
| 184 | when publicizing a pointer to a structure that can |
| 185 | be traversed by an RCU read-side critical section. |
| 186 | |
Paul E. McKenney | 4fea6ef | 2019-01-09 14:48:09 -0800 | [diff] [blame] | 187 | 5. If call_rcu() or call_srcu() is used, the callback function will |
| 188 | be called from softirq context. In particular, it cannot block. |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 189 | |
Paul E. McKenney | 4fea6ef | 2019-01-09 14:48:09 -0800 | [diff] [blame] | 190 | 6. Since synchronize_rcu() can block, it cannot be called |
| 191 | from any sort of irq context. The same rule applies |
| 192 | for synchronize_srcu(), synchronize_rcu_expedited(), and |
| 193 | synchronize_srcu_expedited(). |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 194 | |
Paul E. McKenney | 4c54005 | 2010-01-14 16:10:57 -0800 | [diff] [blame] | 195 | The expedited forms of these primitives have the same semantics |
Paul E. McKenney | 4de5f89 | 2017-06-06 15:04:03 -0700 | [diff] [blame] | 196 | as the non-expedited forms, but expediting is both expensive and |
| 197 | (with the exception of synchronize_srcu_expedited()) unfriendly |
| 198 | to real-time workloads. Use of the expedited primitives should |
| 199 | be restricted to rare configuration-change operations that would |
| 200 | not normally be undertaken while a real-time workload is running. |
| 201 | However, real-time workloads can use rcupdate.rcu_normal kernel |
| 202 | boot parameter to completely disable expedited grace periods, |
| 203 | though this might have performance implications. |
Paul E. McKenney | 4c54005 | 2010-01-14 16:10:57 -0800 | [diff] [blame] | 204 | |
Paul E. McKenney | 236fefa | 2012-01-31 14:00:41 -0800 | [diff] [blame] | 205 | In particular, if you find yourself invoking one of the expedited |
| 206 | primitives repeatedly in a loop, please do everyone a favor: |
| 207 | Restructure your code so that it batches the updates, allowing |
| 208 | a single non-expedited primitive to cover the entire batch. |
| 209 | This will very likely be faster than the loop containing the |
| 210 | expedited primitive, and will be much much easier on the rest |
| 211 | of the system, especially to real-time workloads running on |
| 212 | the rest of the system. |
| 213 | |
Paul E. McKenney | 9a145c0 | 2021-06-24 18:05:52 +0200 | [diff] [blame] | 214 | 7. As of v4.20, a given kernel implements only one RCU flavor, which |
| 215 | is RCU-sched for PREEMPTION=n and RCU-preempt for PREEMPTION=y. |
| 216 | If the updater uses call_rcu() or synchronize_rcu(), then |
| 217 | the corresponding readers may use: (1) rcu_read_lock() and |
| 218 | rcu_read_unlock(), (2) any pair of primitives that disables |
| 219 | and re-enables softirq, for example, rcu_read_lock_bh() and |
| 220 | rcu_read_unlock_bh(), or (3) any pair of primitives that disables |
| 221 | and re-enables preemption, for example, rcu_read_lock_sched() and |
| 222 | rcu_read_unlock_sched(). If the updater uses synchronize_srcu() |
| 223 | or call_srcu(), then the corresponding readers must use |
| 224 | srcu_read_lock() and srcu_read_unlock(), and with the same |
| 225 | srcu_struct. The rules for the expedited RCU grace-period-wait |
| 226 | primitives are the same as for their non-expedited counterparts. |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 227 | |
Paul E. McKenney | 9a145c0 | 2021-06-24 18:05:52 +0200 | [diff] [blame] | 228 | If the updater uses call_rcu_tasks() or synchronize_rcu_tasks(), |
| 229 | then the readers must refrain from executing voluntary |
| 230 | context switches, that is, from blocking. If the updater uses |
| 231 | call_rcu_tasks_trace() or synchronize_rcu_tasks_trace(), then |
| 232 | the corresponding readers must use rcu_read_lock_trace() and |
| 233 | rcu_read_unlock_trace(). If an updater uses call_rcu_tasks_rude() |
| 234 | or synchronize_rcu_tasks_rude(), then the corresponding readers |
| 235 | must use anything that disables interrupts. |
| 236 | |
| 237 | Mixing things up will result in confusion and broken kernels, and |
| 238 | has even resulted in an exploitable security issue. Therefore, |
Toke Høiland-Jørgensen | e74c74f | 2021-06-24 18:05:53 +0200 | [diff] [blame^] | 239 | when using non-obvious pairs of primitives, commenting is |
| 240 | of course a must. One example of non-obvious pairing is |
| 241 | the XDP feature in networking, which calls BPF programs from |
| 242 | network-driver NAPI (softirq) context. BPF relies heavily on RCU |
| 243 | protection for its data structures, but because the BPF program |
| 244 | invocation happens entirely within a single local_bh_disable() |
| 245 | section in a NAPI poll cycle, this usage is safe. The reason |
| 246 | that this usage is safe is that readers can use anything that |
| 247 | disables BH when updaters use call_rcu() or synchronize_rcu(). |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 248 | |
Paul E. McKenney | 3230075 | 2008-05-12 21:21:05 +0200 | [diff] [blame] | 249 | 8. Although synchronize_rcu() is slower than is call_rcu(), it |
Paul E. McKenney | 3f944ad | 2013-03-04 17:55:49 -0800 | [diff] [blame] | 250 | usually results in simpler code. So, unless update performance is |
| 251 | critically important, the updaters cannot block, or the latency of |
| 252 | synchronize_rcu() is visible from userspace, synchronize_rcu() |
| 253 | should be used in preference to call_rcu(). Furthermore, |
| 254 | kfree_rcu() usually results in even simpler code than does |
| 255 | synchronize_rcu() without synchronize_rcu()'s multi-millisecond |
| 256 | latency. So please take advantage of kfree_rcu()'s "fire and |
| 257 | forget" memory-freeing capabilities where it applies. |
Paul E. McKenney | 165d6c7 | 2006-06-25 05:48:44 -0700 | [diff] [blame] | 258 | |
| 259 | An especially important property of the synchronize_rcu() |
| 260 | primitive is that it automatically self-limits: if grace periods |
| 261 | are delayed for whatever reason, then the synchronize_rcu() |
| 262 | primitive will correspondingly delay updates. In contrast, |
| 263 | code using call_rcu() should explicitly limit update rate in |
| 264 | cases where grace periods are delayed, as failing to do so can |
| 265 | result in excessive realtime latencies or even OOM conditions. |
| 266 | |
| 267 | Ways of gaining this self-limiting property when using call_rcu() |
| 268 | include: |
| 269 | |
| 270 | a. Keeping a count of the number of data-structure elements |
Paul E. McKenney | 5cc6517 | 2010-08-13 16:34:22 -0700 | [diff] [blame] | 271 | used by the RCU-protected data structure, including |
| 272 | those waiting for a grace period to elapse. Enforce a |
| 273 | limit on this number, stalling updates as needed to allow |
| 274 | previously deferred frees to complete. Alternatively, |
| 275 | limit only the number awaiting deferred free rather than |
| 276 | the total number of elements. |
Paul E. McKenney | 165d6c7 | 2006-06-25 05:48:44 -0700 | [diff] [blame] | 277 | |
Paul E. McKenney | 5cc6517 | 2010-08-13 16:34:22 -0700 | [diff] [blame] | 278 | One way to stall the updates is to acquire the update-side |
| 279 | mutex. (Don't try this with a spinlock -- other CPUs |
| 280 | spinning on the lock could prevent the grace period |
| 281 | from ever ending.) Another way to stall the updates |
| 282 | is for the updates to use a wrapper function around |
| 283 | the memory allocator, so that this wrapper function |
| 284 | simulates OOM when there is too much memory awaiting an |
| 285 | RCU grace period. There are of course many other |
| 286 | variations on this theme. |
Paul E. McKenney | 165d6c7 | 2006-06-25 05:48:44 -0700 | [diff] [blame] | 287 | |
| 288 | b. Limiting update rate. For example, if updates occur only |
Paul E. McKenney | 6e67669 | 2013-12-05 14:56:54 -0800 | [diff] [blame] | 289 | once per hour, then no explicit rate limiting is |
| 290 | required, unless your system is already badly broken. |
| 291 | Older versions of the dcache subsystem take this approach, |
| 292 | guarding updates with a global lock, limiting their rate. |
Paul E. McKenney | 165d6c7 | 2006-06-25 05:48:44 -0700 | [diff] [blame] | 293 | |
| 294 | c. Trusted update -- if updates can only be done manually by |
| 295 | superuser or some other trusted user, then it might not |
| 296 | be necessary to automatically limit them. The theory |
| 297 | here is that superuser already has lots of ways to crash |
| 298 | the machine. |
| 299 | |
Joel Fernandes (Google) | bc2072c | 2018-10-05 16:18:12 -0700 | [diff] [blame] | 300 | d. Periodically invoke synchronize_rcu(), permitting a limited |
Paul E. McKenney | 165d6c7 | 2006-06-25 05:48:44 -0700 | [diff] [blame] | 301 | number of updates per grace period. |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 302 | |
Paul E. McKenney | 4fea6ef | 2019-01-09 14:48:09 -0800 | [diff] [blame] | 303 | The same cautions apply to call_srcu() and kfree_rcu(). |
Paul E. McKenney | 4c54005 | 2010-01-14 16:10:57 -0800 | [diff] [blame] | 304 | |
Paul E. McKenney | 6e67669 | 2013-12-05 14:56:54 -0800 | [diff] [blame] | 305 | Note that although these primitives do take action to avoid memory |
| 306 | exhaustion when any given CPU has too many callbacks, a determined |
| 307 | user could still exhaust memory. This is especially the case |
| 308 | if a system with a large number of CPUs has been configured to |
| 309 | offload all of its RCU callbacks onto a single CPU, or if the |
| 310 | system has relatively little free memory. |
| 311 | |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 312 | 9. All RCU list-traversal primitives, which include |
Paul E. McKenney | bb08f76 | 2012-10-20 12:33:37 -0700 | [diff] [blame] | 313 | rcu_dereference(), list_for_each_entry_rcu(), and |
| 314 | list_for_each_safe_rcu(), must be either within an RCU read-side |
| 315 | critical section or must be protected by appropriate update-side |
| 316 | locks. RCU read-side critical sections are delimited by |
| 317 | rcu_read_lock() and rcu_read_unlock(), or by similar primitives |
| 318 | such as rcu_read_lock_bh() and rcu_read_unlock_bh(), in which |
| 319 | case the matching rcu_dereference() primitive must be used in |
| 320 | order to keep lockdep happy, in this case, rcu_dereference_bh(). |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 321 | |
Paul E. McKenney | 3230075 | 2008-05-12 21:21:05 +0200 | [diff] [blame] | 322 | The reason that it is permissible to use RCU list-traversal |
| 323 | primitives when the update-side lock is held is that doing so |
| 324 | can be quite helpful in reducing code bloat when common code is |
Paul E. McKenney | 50aec00 | 2010-04-09 15:39:12 -0700 | [diff] [blame] | 325 | shared between readers and updaters. Additional primitives |
| 326 | are provided for this case, as discussed in lockdep.txt. |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 327 | |
Paul E. McKenney | 86b5a73 | 2020-09-24 20:53:25 -0700 | [diff] [blame] | 328 | One exception to this rule is when data is only ever added to |
| 329 | the linked data structure, and is never removed during any |
| 330 | time that readers might be accessing that structure. In such |
| 331 | cases, READ_ONCE() may be used in place of rcu_dereference() |
| 332 | and the read-side markers (rcu_read_lock() and rcu_read_unlock(), |
| 333 | for example) may be omitted. |
| 334 | |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 335 | 10. Conversely, if you are in an RCU read-side critical section, |
Paul E. McKenney | 3230075 | 2008-05-12 21:21:05 +0200 | [diff] [blame] | 336 | and you don't hold the appropriate update-side lock, you -must- |
| 337 | use the "_rcu()" variants of the list macros. Failing to do so |
Paul E. McKenney | 4c54005 | 2010-01-14 16:10:57 -0800 | [diff] [blame] | 338 | will break Alpha, cause aggressive compilers to generate bad code, |
| 339 | and confuse people trying to read your code. |
Paul E. McKenney | a83f1fe | 2005-05-01 08:59:05 -0700 | [diff] [blame] | 340 | |
Joel Fernandes (Google) | e060a03 | 2018-10-05 16:18:13 -0700 | [diff] [blame] | 341 | 11. Any lock acquired by an RCU callback must be acquired elsewhere |
Paul E. McKenney | 240ebbf | 2009-06-25 09:08:18 -0700 | [diff] [blame] | 342 | with softirq disabled, e.g., via spin_lock_irqsave(), |
Paul E. McKenney | 884b429 | 2019-03-06 11:24:35 -0800 | [diff] [blame] | 343 | spin_lock_bh(), etc. Failing to disable softirq on a given |
Paul E. McKenney | 4c54005 | 2010-01-14 16:10:57 -0800 | [diff] [blame] | 344 | acquisition of that lock will result in deadlock as soon as |
| 345 | the RCU softirq handler happens to run your RCU callback while |
| 346 | interrupting that acquisition's critical section. |
Paul E. McKenney | 621934e | 2006-10-04 02:17:02 -0700 | [diff] [blame] | 347 | |
Joel Fernandes (Google) | e060a03 | 2018-10-05 16:18:13 -0700 | [diff] [blame] | 348 | 12. RCU callbacks can be and are executed in parallel. In many cases, |
Paul E. McKenney | ef48bd2 | 2007-07-15 23:41:03 -0700 | [diff] [blame] | 349 | the callback code simply wrappers around kfree(), so that this |
| 350 | is not an issue (or, more accurately, to the extent that it is |
| 351 | an issue, the memory-allocator locking handles it). However, |
| 352 | if the callbacks do manipulate a shared data structure, they |
| 353 | must use whatever locking or other synchronization is required |
| 354 | to safely access and/or modify that data structure. |
| 355 | |
Paul E. McKenney | 884b429 | 2019-03-06 11:24:35 -0800 | [diff] [blame] | 356 | Do not assume that RCU callbacks will be executed on the same |
| 357 | CPU that executed the corresponding call_rcu() or call_srcu(). |
| 358 | For example, if a given CPU goes offline while having an RCU |
| 359 | callback pending, then that RCU callback will execute on some |
| 360 | surviving CPU. (If this was not the case, a self-spawning RCU |
| 361 | callback would prevent the victim CPU from ever going offline.) |
| 362 | Furthermore, CPUs designated by rcu_nocbs= might well -always- |
| 363 | have their RCU callbacks executed on some other CPUs, in fact, |
| 364 | for some real-time workloads, this is the whole point of using |
| 365 | the rcu_nocbs= kernel boot parameter. |
Paul E. McKenney | 3230075 | 2008-05-12 21:21:05 +0200 | [diff] [blame] | 366 | |
Joel Fernandes (Google) | e060a03 | 2018-10-05 16:18:13 -0700 | [diff] [blame] | 367 | 13. Unlike other forms of RCU, it -is- permissible to block in an |
Paul E. McKenney | 4de5f89 | 2017-06-06 15:04:03 -0700 | [diff] [blame] | 368 | SRCU read-side critical section (demarked by srcu_read_lock() |
| 369 | and srcu_read_unlock()), hence the "SRCU": "sleepable RCU". |
| 370 | Please note that if you don't need to sleep in read-side critical |
| 371 | sections, you should be using RCU rather than SRCU, because RCU |
| 372 | is almost always faster and easier to use than is SRCU. |
Paul E. McKenney | 621934e | 2006-10-04 02:17:02 -0700 | [diff] [blame] | 373 | |
Paul E. McKenney | 4de5f89 | 2017-06-06 15:04:03 -0700 | [diff] [blame] | 374 | Also unlike other forms of RCU, explicit initialization and |
| 375 | cleanup is required either at build time via DEFINE_SRCU() |
| 376 | or DEFINE_STATIC_SRCU() or at runtime via init_srcu_struct() |
| 377 | and cleanup_srcu_struct(). These last two are passed a |
| 378 | "struct srcu_struct" that defines the scope of a given |
| 379 | SRCU domain. Once initialized, the srcu_struct is passed |
| 380 | to srcu_read_lock(), srcu_read_unlock() synchronize_srcu(), |
| 381 | synchronize_srcu_expedited(), and call_srcu(). A given |
| 382 | synchronize_srcu() waits only for SRCU read-side critical |
Paul E. McKenney | 4c54005 | 2010-01-14 16:10:57 -0800 | [diff] [blame] | 383 | sections governed by srcu_read_lock() and srcu_read_unlock() |
| 384 | calls that have been passed the same srcu_struct. This property |
| 385 | is what makes sleeping read-side critical sections tolerable -- |
| 386 | a given subsystem delays only its own updates, not those of other |
| 387 | subsystems using SRCU. Therefore, SRCU is less prone to OOM the |
| 388 | system than RCU would be if RCU's read-side critical sections |
| 389 | were permitted to sleep. |
Paul E. McKenney | 621934e | 2006-10-04 02:17:02 -0700 | [diff] [blame] | 390 | |
| 391 | The ability to sleep in read-side critical sections does not |
| 392 | come for free. First, corresponding srcu_read_lock() and |
| 393 | srcu_read_unlock() calls must be passed the same srcu_struct. |
| 394 | Second, grace-period-detection overhead is amortized only |
| 395 | over those updates sharing a given srcu_struct, rather than |
| 396 | being globally amortized as they are for other forms of RCU. |
| 397 | Therefore, SRCU should be used in preference to rw_semaphore |
| 398 | only in extremely read-intensive situations, or in situations |
| 399 | requiring SRCU's read-side deadlock immunity or low read-side |
Paul E. McKenney | 4de5f89 | 2017-06-06 15:04:03 -0700 | [diff] [blame] | 400 | realtime latency. You should also consider percpu_rw_semaphore |
| 401 | when you need lightweight readers. |
Paul E. McKenney | 621934e | 2006-10-04 02:17:02 -0700 | [diff] [blame] | 402 | |
Paul E. McKenney | 4de5f89 | 2017-06-06 15:04:03 -0700 | [diff] [blame] | 403 | SRCU's expedited primitive (synchronize_srcu_expedited()) |
| 404 | never sends IPIs to other CPUs, so it is easier on |
Paul E. McKenney | 4fea6ef | 2019-01-09 14:48:09 -0800 | [diff] [blame] | 405 | real-time workloads than is synchronize_rcu_expedited(). |
Paul E. McKenney | 4de5f89 | 2017-06-06 15:04:03 -0700 | [diff] [blame] | 406 | |
Paul E. McKenney | 884b429 | 2019-03-06 11:24:35 -0800 | [diff] [blame] | 407 | Note that rcu_assign_pointer() relates to SRCU just as it does to |
| 408 | other forms of RCU, but instead of rcu_dereference() you should |
| 409 | use srcu_dereference() in order to avoid lockdep splats. |
Paul E. McKenney | 0612ea0 | 2009-03-10 12:55:57 -0700 | [diff] [blame] | 410 | |
Joel Fernandes (Google) | e060a03 | 2018-10-05 16:18:13 -0700 | [diff] [blame] | 411 | 14. The whole point of call_rcu(), synchronize_rcu(), and friends |
Paul E. McKenney | 0612ea0 | 2009-03-10 12:55:57 -0700 | [diff] [blame] | 412 | is to wait until all pre-existing readers have finished before |
| 413 | carrying out some otherwise-destructive operation. It is |
| 414 | therefore critically important to -first- remove any path |
| 415 | that readers can follow that could be affected by the |
| 416 | destructive operation, and -only- -then- invoke call_rcu(), |
| 417 | synchronize_rcu(), or friends. |
| 418 | |
Paul E. McKenney | 4c54005 | 2010-01-14 16:10:57 -0800 | [diff] [blame] | 419 | Because these primitives only wait for pre-existing readers, it |
| 420 | is the caller's responsibility to guarantee that any subsequent |
| 421 | readers will execute safely. |
Paul E. McKenney | 240ebbf | 2009-06-25 09:08:18 -0700 | [diff] [blame] | 422 | |
Joel Fernandes (Google) | e060a03 | 2018-10-05 16:18:13 -0700 | [diff] [blame] | 423 | 15. The various RCU read-side primitives do -not- necessarily contain |
Paul E. McKenney | 4c54005 | 2010-01-14 16:10:57 -0800 | [diff] [blame] | 424 | memory barriers. You should therefore plan for the CPU |
| 425 | and the compiler to freely reorder code into and out of RCU |
| 426 | read-side critical sections. It is the responsibility of the |
| 427 | RCU update-side primitives to deal with this. |
Paul E. McKenney | 84483ea | 2010-06-16 16:48:13 -0700 | [diff] [blame] | 428 | |
Paul E. McKenney | 884b429 | 2019-03-06 11:24:35 -0800 | [diff] [blame] | 429 | For SRCU readers, you can use smp_mb__after_srcu_read_unlock() |
| 430 | immediately after an srcu_read_unlock() to get a full barrier. |
| 431 | |
Joel Fernandes (Google) | e060a03 | 2018-10-05 16:18:13 -0700 | [diff] [blame] | 432 | 16. Use CONFIG_PROVE_LOCKING, CONFIG_DEBUG_OBJECTS_RCU_HEAD, and the |
Paul E. McKenney | 41a2901 | 2017-05-12 15:56:35 -0700 | [diff] [blame] | 433 | __rcu sparse checks to validate your RCU code. These can help |
| 434 | find problems as follows: |
Paul E. McKenney | 84483ea | 2010-06-16 16:48:13 -0700 | [diff] [blame] | 435 | |
Mauro Carvalho Chehab | 6b05dfa | 2020-04-21 19:04:02 +0200 | [diff] [blame] | 436 | CONFIG_PROVE_LOCKING: |
| 437 | check that accesses to RCU-protected data |
Paul E. McKenney | 84483ea | 2010-06-16 16:48:13 -0700 | [diff] [blame] | 438 | structures are carried out under the proper RCU |
| 439 | read-side critical section, while holding the right |
| 440 | combination of locks, or whatever other conditions |
| 441 | are appropriate. |
| 442 | |
Mauro Carvalho Chehab | 6b05dfa | 2020-04-21 19:04:02 +0200 | [diff] [blame] | 443 | CONFIG_DEBUG_OBJECTS_RCU_HEAD: |
| 444 | check that you don't pass the |
Paul E. McKenney | 84483ea | 2010-06-16 16:48:13 -0700 | [diff] [blame] | 445 | same object to call_rcu() (or friends) before an RCU |
| 446 | grace period has elapsed since the last time that you |
| 447 | passed that same object to call_rcu() (or friends). |
| 448 | |
Mauro Carvalho Chehab | 6b05dfa | 2020-04-21 19:04:02 +0200 | [diff] [blame] | 449 | __rcu sparse checks: |
| 450 | tag the pointer to the RCU-protected data |
Paul E. McKenney | 84483ea | 2010-06-16 16:48:13 -0700 | [diff] [blame] | 451 | structure with __rcu, and sparse will warn you if you |
| 452 | access that pointer without the services of one of the |
| 453 | variants of rcu_dereference(). |
| 454 | |
| 455 | These debugging aids can help you find problems that are |
| 456 | otherwise extremely difficult to spot. |
Paul E. McKenney | 4de5f89 | 2017-06-06 15:04:03 -0700 | [diff] [blame] | 457 | |
Paul E. McKenney | 884b429 | 2019-03-06 11:24:35 -0800 | [diff] [blame] | 458 | 17. If you register a callback using call_rcu() or call_srcu(), and |
| 459 | pass in a function defined within a loadable module, then it in |
| 460 | necessary to wait for all pending callbacks to be invoked after |
| 461 | the last invocation and before unloading that module. Note that |
| 462 | it is absolutely -not- sufficient to wait for a grace period! |
| 463 | The current (say) synchronize_rcu() implementation is -not- |
Paul E. McKenney | 4fea6ef | 2019-01-09 14:48:09 -0800 | [diff] [blame] | 464 | guaranteed to wait for callbacks registered on other CPUs. |
Paul E. McKenney | 884b429 | 2019-03-06 11:24:35 -0800 | [diff] [blame] | 465 | Or even on the current CPU if that CPU recently went offline |
| 466 | and came back online. |
Paul E. McKenney | 4de5f89 | 2017-06-06 15:04:03 -0700 | [diff] [blame] | 467 | |
| 468 | You instead need to use one of the barrier functions: |
| 469 | |
Mauro Carvalho Chehab | 6b05dfa | 2020-04-21 19:04:02 +0200 | [diff] [blame] | 470 | - call_rcu() -> rcu_barrier() |
| 471 | - call_srcu() -> srcu_barrier() |
Paul E. McKenney | 4de5f89 | 2017-06-06 15:04:03 -0700 | [diff] [blame] | 472 | |
| 473 | However, these barrier functions are absolutely -not- guaranteed |
| 474 | to wait for a grace period. In fact, if there are no call_rcu() |
| 475 | callbacks waiting anywhere in the system, rcu_barrier() is within |
| 476 | its rights to return immediately. |
| 477 | |
| 478 | So if you need to wait for both an RCU grace period and for |
| 479 | all pre-existing call_rcu() callbacks, you will need to execute |
| 480 | both rcu_barrier() and synchronize_rcu(), if necessary, using |
| 481 | something like workqueues to to execute them concurrently. |
| 482 | |
| 483 | See rcubarrier.txt for more information. |