blob: f1744161ef0694cef8dab811d25e0efea9c1ea7a [file] [log] [blame]
Linus Torvalds1da177e2005-04-16 15:20:36 -07001 Semantics and Behavior of Atomic and
2 Bitmask Operations
3
4 David S. Miller
5
6 This document is intended to serve as a guide to Linux port
7maintainers on how to implement atomic counter, bitops, and spinlock
8interfaces properly.
9
10 The atomic_t type should be defined as a signed integer.
11Also, it should be made opaque such that any kind of cast to a normal
12C integer type will fail. Something like the following should
13suffice:
14
15 typedef struct { volatile int counter; } atomic_t;
16
17 The first operations to implement for atomic_t's are the
18initializers and plain reads.
19
20 #define ATOMIC_INIT(i) { (i) }
21 #define atomic_set(v, i) ((v)->counter = (i))
22
23The first macro is used in definitions, such as:
24
25static atomic_t my_counter = ATOMIC_INIT(1);
26
27The second interface can be used at runtime, as in:
28
29 struct foo { atomic_t counter; };
30 ...
31
32 struct foo *k;
33
34 k = kmalloc(sizeof(*k), GFP_KERNEL);
35 if (!k)
36 return -ENOMEM;
37 atomic_set(&k->counter, 0);
38
39Next, we have:
40
41 #define atomic_read(v) ((v)->counter)
42
43which simply reads the current value of the counter.
44
45Now, we move onto the actual atomic operation interfaces.
46
47 void atomic_add(int i, atomic_t *v);
48 void atomic_sub(int i, atomic_t *v);
49 void atomic_inc(atomic_t *v);
50 void atomic_dec(atomic_t *v);
51
52These four routines add and subtract integral values to/from the given
53atomic_t value. The first two routines pass explicit integers by
54which to make the adjustment, whereas the latter two use an implicit
55adjustment value of "1".
56
57One very important aspect of these two routines is that they DO NOT
58require any explicit memory barriers. They need only perform the
59atomic_t counter update in an SMP safe manner.
60
61Next, we have:
62
63 int atomic_inc_return(atomic_t *v);
64 int atomic_dec_return(atomic_t *v);
65
66These routines add 1 and subtract 1, respectively, from the given
67atomic_t and return the new counter value after the operation is
68performed.
69
70Unlike the above routines, it is required that explicit memory
71barriers are performed before and after the operation. It must be
72done such that all memory operations before and after the atomic
73operation calls are strongly ordered with respect to the atomic
74operation itself.
75
76For example, it should behave as if a smp_mb() call existed both
77before and after the atomic operation.
78
79If the atomic instructions used in an implementation provide explicit
80memory barrier semantics which satisfy the above requirements, that is
81fine as well.
82
83Let's move on:
84
85 int atomic_add_return(int i, atomic_t *v);
86 int atomic_sub_return(int i, atomic_t *v);
87
88These behave just like atomic_{inc,dec}_return() except that an
89explicit counter adjustment is given instead of the implicit "1".
90This means that like atomic_{inc,dec}_return(), the memory barrier
91semantics are required.
92
93Next:
94
95 int atomic_inc_and_test(atomic_t *v);
96 int atomic_dec_and_test(atomic_t *v);
97
98These two routines increment and decrement by 1, respectively, the
99given atomic counter. They return a boolean indicating whether the
100resulting counter value was zero or not.
101
102It requires explicit memory barrier semantics around the operation as
103above.
104
105 int atomic_sub_and_test(int i, atomic_t *v);
106
107This is identical to atomic_dec_and_test() except that an explicit
108decrement is given instead of the implicit "1". It requires explicit
109memory barrier semantics around the operation.
110
111 int atomic_add_negative(int i, atomic_t *v);
112
113The given increment is added to the given atomic counter value. A
114boolean is return which indicates whether the resulting counter value
115is negative. It requires explicit memory barrier semantics around the
116operation.
117
Nick Piggin4a6dae62005-11-13 16:07:24 -0800118Finally:
119
120 int atomic_cmpxchg(atomic_t *v, int old, int new);
121
122This performs an atomic compare exchange operation on the atomic value v,
123with the given old and new values. Like all atomic_xxx operations,
124atomic_cmpxchg will only satisfy its atomicity semantics as long as all
125other accesses of *v are performed through atomic_xxx operations.
126
127atomic_cmpxchg requires explicit memory barriers around the operation.
128
129The semantics for atomic_cmpxchg are the same as those defined for 'cas'
130below.
131
132
Linus Torvalds1da177e2005-04-16 15:20:36 -0700133If a caller requires memory barrier semantics around an atomic_t
134operation which does not return a value, a set of interfaces are
135defined which accomplish this:
136
137 void smp_mb__before_atomic_dec(void);
138 void smp_mb__after_atomic_dec(void);
139 void smp_mb__before_atomic_inc(void);
140 void smp_mb__after_atomic_dec(void);
141
142For example, smp_mb__before_atomic_dec() can be used like so:
143
144 obj->dead = 1;
145 smp_mb__before_atomic_dec();
146 atomic_dec(&obj->ref_count);
147
148It makes sure that all memory operations preceeding the atomic_dec()
149call are strongly ordered with respect to the atomic counter
150operation. In the above example, it guarentees that the assignment of
151"1" to obj->dead will be globally visible to other cpus before the
152atomic counter decrement.
153
154Without the explicitl smp_mb__before_atomic_dec() call, the
155implementation could legally allow the atomic counter update visible
156to other cpus before the "obj->dead = 1;" assignment.
157
158The other three interfaces listed are used to provide explicit
159ordering with respect to memory operations after an atomic_dec() call
160(smp_mb__after_atomic_dec()) and around atomic_inc() calls
161(smp_mb__{before,after}_atomic_inc()).
162
163A missing memory barrier in the cases where they are required by the
164atomic_t implementation above can have disasterous results. Here is
165an example, which follows a pattern occuring frequently in the Linux
166kernel. It is the use of atomic counters to implement reference
167counting, and it works such that once the counter falls to zero it can
168be guarenteed that no other entity can be accessing the object:
169
170static void obj_list_add(struct obj *obj)
171{
172 obj->active = 1;
173 list_add(&obj->list);
174}
175
176static void obj_list_del(struct obj *obj)
177{
178 list_del(&obj->list);
179 obj->active = 0;
180}
181
182static void obj_destroy(struct obj *obj)
183{
184 BUG_ON(obj->active);
185 kfree(obj);
186}
187
188struct obj *obj_list_peek(struct list_head *head)
189{
190 if (!list_empty(head)) {
191 struct obj *obj;
192
193 obj = list_entry(head->next, struct obj, list);
194 atomic_inc(&obj->refcnt);
195 return obj;
196 }
197 return NULL;
198}
199
200void obj_poke(void)
201{
202 struct obj *obj;
203
204 spin_lock(&global_list_lock);
205 obj = obj_list_peek(&global_list);
206 spin_unlock(&global_list_lock);
207
208 if (obj) {
209 obj->ops->poke(obj);
210 if (atomic_dec_and_test(&obj->refcnt))
211 obj_destroy(obj);
212 }
213}
214
215void obj_timeout(struct obj *obj)
216{
217 spin_lock(&global_list_lock);
218 obj_list_del(obj);
219 spin_unlock(&global_list_lock);
220
221 if (atomic_dec_and_test(&obj->refcnt))
222 obj_destroy(obj);
223}
224
225(This is a simplification of the ARP queue management in the
226 generic neighbour discover code of the networking. Olaf Kirch
227 found a bug wrt. memory barriers in kfree_skb() that exposed
228 the atomic_t memory barrier requirements quite clearly.)
229
230Given the above scheme, it must be the case that the obj->active
231update done by the obj list deletion be visible to other processors
232before the atomic counter decrement is performed.
233
234Otherwise, the counter could fall to zero, yet obj->active would still
235be set, thus triggering the assertion in obj_destroy(). The error
236sequence looks like this:
237
238 cpu 0 cpu 1
239 obj_poke() obj_timeout()
240 obj = obj_list_peek();
241 ... gains ref to obj, refcnt=2
242 obj_list_del(obj);
243 obj->active = 0 ...
244 ... visibility delayed ...
245 atomic_dec_and_test()
246 ... refcnt drops to 1 ...
247 atomic_dec_and_test()
248 ... refcount drops to 0 ...
249 obj_destroy()
250 BUG() triggers since obj->active
251 still seen as one
252 obj->active update visibility occurs
253
254With the memory barrier semantics required of the atomic_t operations
255which return values, the above sequence of memory visibility can never
256happen. Specifically, in the above case the atomic_dec_and_test()
257counter decrement would not become globally visible until the
258obj->active update does.
259
260As a historical note, 32-bit Sparc used to only allow usage of
26124-bits of it's atomic_t type. This was because it used 8 bits
262as a spinlock for SMP safety. Sparc32 lacked a "compare and swap"
263type instruction. However, 32-bit Sparc has since been moved over
264to a "hash table of spinlocks" scheme, that allows the full 32-bit
265counter to be realized. Essentially, an array of spinlocks are
266indexed into based upon the address of the atomic_t being operated
267on, and that lock protects the atomic operation. Parisc uses the
268same scheme.
269
270Another note is that the atomic_t operations returning values are
271extremely slow on an old 386.
272
273We will now cover the atomic bitmask operations. You will find that
274their SMP and memory barrier semantics are similar in shape and scope
275to the atomic_t ops above.
276
277Native atomic bit operations are defined to operate on objects aligned
278to the size of an "unsigned long" C data type, and are least of that
279size. The endianness of the bits within each "unsigned long" are the
280native endianness of the cpu.
281
282 void set_bit(unsigned long nr, volatils unsigned long *addr);
283 void clear_bit(unsigned long nr, volatils unsigned long *addr);
284 void change_bit(unsigned long nr, volatils unsigned long *addr);
285
286These routines set, clear, and change, respectively, the bit number
287indicated by "nr" on the bit mask pointed to by "ADDR".
288
289They must execute atomically, yet there are no implicit memory barrier
290semantics required of these interfaces.
291
292 int test_and_set_bit(unsigned long nr, volatils unsigned long *addr);
293 int test_and_clear_bit(unsigned long nr, volatils unsigned long *addr);
294 int test_and_change_bit(unsigned long nr, volatils unsigned long *addr);
295
296Like the above, except that these routines return a boolean which
297indicates whether the changed bit was set _BEFORE_ the atomic bit
298operation.
299
300WARNING! It is incredibly important that the value be a boolean,
301ie. "0" or "1". Do not try to be fancy and save a few instructions by
302declaring the above to return "long" and just returning something like
303"old_val & mask" because that will not work.
304
305For one thing, this return value gets truncated to int in many code
306paths using these interfaces, so on 64-bit if the bit is set in the
307upper 32-bits then testers will never see that.
308
309One great example of where this problem crops up are the thread_info
310flag operations. Routines such as test_and_set_ti_thread_flag() chop
311the return value into an int. There are other places where things
312like this occur as well.
313
314These routines, like the atomic_t counter operations returning values,
315require explicit memory barrier semantics around their execution. All
316memory operations before the atomic bit operation call must be made
317visible globally before the atomic bit operation is made visible.
318Likewise, the atomic bit operation must be visible globally before any
319subsequent memory operation is made visible. For example:
320
321 obj->dead = 1;
322 if (test_and_set_bit(0, &obj->flags))
323 /* ... */;
324 obj->killed = 1;
325
326The implementation of test_and_set_bit() must guarentee that
327"obj->dead = 1;" is visible to cpus before the atomic memory operation
328done by test_and_set_bit() becomes visible. Likewise, the atomic
329memory operation done by test_and_set_bit() must become visible before
330"obj->killed = 1;" is visible.
331
332Finally there is the basic operation:
333
334 int test_bit(unsigned long nr, __const__ volatile unsigned long *addr);
335
336Which returns a boolean indicating if bit "nr" is set in the bitmask
337pointed to by "addr".
338
339If explicit memory barriers are required around clear_bit() (which
340does not return a value, and thus does not need to provide memory
341barrier semantics), two interfaces are provided:
342
343 void smp_mb__before_clear_bit(void);
344 void smp_mb__after_clear_bit(void);
345
346They are used as follows, and are akin to their atomic_t operation
347brothers:
348
349 /* All memory operations before this call will
350 * be globally visible before the clear_bit().
351 */
352 smp_mb__before_clear_bit();
353 clear_bit( ... );
354
355 /* The clear_bit() will be visible before all
356 * subsequent memory operations.
357 */
358 smp_mb__after_clear_bit();
359
360Finally, there are non-atomic versions of the bitmask operations
361provided. They are used in contexts where some other higher-level SMP
362locking scheme is being used to protect the bitmask, and thus less
363expensive non-atomic operations may be used in the implementation.
364They have names similar to the above bitmask operation interfaces,
365except that two underscores are prefixed to the interface name.
366
367 void __set_bit(unsigned long nr, volatile unsigned long *addr);
368 void __clear_bit(unsigned long nr, volatile unsigned long *addr);
369 void __change_bit(unsigned long nr, volatile unsigned long *addr);
370 int __test_and_set_bit(unsigned long nr, volatile unsigned long *addr);
371 int __test_and_clear_bit(unsigned long nr, volatile unsigned long *addr);
372 int __test_and_change_bit(unsigned long nr, volatile unsigned long *addr);
373
374These non-atomic variants also do not require any special memory
375barrier semantics.
376
377The routines xchg() and cmpxchg() need the same exact memory barriers
378as the atomic and bit operations returning values.
379
380Spinlocks and rwlocks have memory barrier expectations as well.
381The rule to follow is simple:
382
3831) When acquiring a lock, the implementation must make it globally
384 visible before any subsequent memory operation.
385
3862) When releasing a lock, the implementation must make it such that
387 all previous memory operations are globally visible before the
388 lock release.
389
390Which finally brings us to _atomic_dec_and_lock(). There is an
391architecture-neutral version implemented in lib/dec_and_lock.c,
392but most platforms will wish to optimize this in assembler.
393
394 int _atomic_dec_and_lock(atomic_t *atomic, spinlock_t *lock);
395
396Atomically decrement the given counter, and if will drop to zero
397atomically acquire the given spinlock and perform the decrement
398of the counter to zero. If it does not drop to zero, do nothing
399with the spinlock.
400
401It is actually pretty simple to get the memory barrier correct.
402Simply satisfy the spinlock grab requirements, which is make
403sure the spinlock operation is globally visible before any
404subsequent memory operation.
405
406We can demonstrate this operation more clearly if we define
407an abstract atomic operation:
408
409 long cas(long *mem, long old, long new);
410
411"cas" stands for "compare and swap". It atomically:
412
4131) Compares "old" with the value currently at "mem".
4142) If they are equal, "new" is written to "mem".
4153) Regardless, the current value at "mem" is returned.
416
417As an example usage, here is what an atomic counter update
418might look like:
419
420void example_atomic_inc(long *counter)
421{
422 long old, new, ret;
423
424 while (1) {
425 old = *counter;
426 new = old + 1;
427
428 ret = cas(counter, old, new);
429 if (ret == old)
430 break;
431 }
432}
433
434Let's use cas() in order to build a pseudo-C atomic_dec_and_lock():
435
436int _atomic_dec_and_lock(atomic_t *atomic, spinlock_t *lock)
437{
438 long old, new, ret;
439 int went_to_zero;
440
441 went_to_zero = 0;
442 while (1) {
443 old = atomic_read(atomic);
444 new = old - 1;
445 if (new == 0) {
446 went_to_zero = 1;
447 spin_lock(lock);
448 }
449 ret = cas(atomic, old, new);
450 if (ret == old)
451 break;
452 if (went_to_zero) {
453 spin_unlock(lock);
454 went_to_zero = 0;
455 }
456 }
457
458 return went_to_zero;
459}
460
461Now, as far as memory barriers go, as long as spin_lock()
462strictly orders all subsequent memory operations (including
463the cas()) with respect to itself, things will be fine.
464
465Said another way, _atomic_dec_and_lock() must guarentee that
466a counter dropping to zero is never made visible before the
467spinlock being acquired.
468
469Note that this also means that for the case where the counter
470is not dropping to zero, there are no memory ordering
471requirements.