Blame - Documentation/atomic_t.txt - SHIFTPHONES/mainline/linux

blob: 0f1ffa03db09a6e2d66b1916d8a44bff7b689f3b [file] [log] [blame]

Peter Zijlstra	706eeb3	2017-06-12 14:50:27 +0200	[diff] [blame]	1
				2	On atomic types (atomic_t atomic64_t and atomic_long_t).
				3
				4	The atomic type provides an interface to the architecture's means of atomic
				5	RMW operations between CPUs (atomic operations on MMIO are not supported and
				6	can lead to fatal traps on some platforms).
				7
				8	API
				9	---
				10
				11	The 'full' API consists of (atomic64_ and atomic_long_ prefixes omitted for
				12	brevity):
				13
				14	Non-RMW ops:
				15
				16	atomic_read(), atomic_set()
				17	atomic_read_acquire(), atomic_set_release()
				18
				19
				20	RMW atomic operations:
				21
				22	Arithmetic:
				23
				24	atomic_{add,sub,inc,dec}()
				25	atomic_{add,sub,inc,dec}_return{,_relaxed,_acquire,_release}()
				26	atomic_fetch_{add,sub,inc,dec}{,_relaxed,_acquire,_release}()
				27
				28
				29	Bitwise:
				30
				31	atomic_{and,or,xor,andnot}()
				32	atomic_fetch_{and,or,xor,andnot}{,_relaxed,_acquire,_release}()
				33
				34
				35	Swap:
				36
				37	atomic_xchg{,_relaxed,_acquire,_release}()
				38	atomic_cmpxchg{,_relaxed,_acquire,_release}()
				39	atomic_try_cmpxchg{,_relaxed,_acquire,_release}()
				40
				41
				42	Reference count (but please see refcount_t):
				43
				44	atomic_add_unless(), atomic_inc_not_zero()
				45	atomic_sub_and_test(), atomic_dec_and_test()
				46
				47
				48	Misc:
				49
				50	atomic_inc_and_test(), atomic_add_negative()
				51	atomic_dec_unless_positive(), atomic_inc_unless_negative()
				52
				53
				54	Barriers:
				55
				56	smp_mb__{before,after}_atomic()
				57
				58
Peter Zijlstra	f188714	2019-02-11 18:09:43 +0100	[diff] [blame]	59	TYPES (signed vs unsigned)
				60	-----
				61
				62	While atomic_t, atomic_long_t and atomic64_t use int, long and s64
				63	respectively (for hysterical raisins), the kernel uses -fno-strict-overflow
				64	(which implies -fwrapv) and defines signed overflow to behave like
				65	2s-complement.
				66
				67	Therefore, an explicitly unsigned variant of the atomic ops is strictly
				68	unnecessary and we can simply cast, there is no UB.
				69
				70	There was a bug in UBSAN prior to GCC-8 that would generate UB warnings for
				71	signed types.
				72
				73	With this we also conform to the C/C++ _Atomic behaviour and things like
				74	P1236R1.
				75
Peter Zijlstra	706eeb3	2017-06-12 14:50:27 +0200	[diff] [blame]	76
				77	SEMANTICS
				78	---------
				79
				80	Non-RMW ops:
				81
				82	The non-RMW ops are (typically) regular LOADs and STOREs and are canonically
				83	implemented using READ_ONCE(), WRITE_ONCE(), smp_load_acquire() and
Peter Zijlstra	fff9b6c	2019-05-24 13:52:31 +0200	[diff] [blame]	84	smp_store_release() respectively. Therefore, if you find yourself only using
				85	the Non-RMW operations of atomic_t, you do not in fact need atomic_t at all
				86	and are doing it wrong.
Peter Zijlstra	706eeb3	2017-06-12 14:50:27 +0200	[diff] [blame]	87
Boqun Feng	4dcd4d3	2020-03-26 10:40:21 +0800	[diff] [blame]	88	A note for the implementation of atomic_set{}() is that it must not break the
				89	atomicity of the RMW ops. That is:
Peter Zijlstra	706eeb3	2017-06-12 14:50:27 +0200	[diff] [blame]	90
Boqun Feng	4dcd4d3	2020-03-26 10:40:21 +0800	[diff] [blame]	91	C Atomic-RMW-ops-are-atomic-WRT-atomic_set
Peter Zijlstra	706eeb3	2017-06-12 14:50:27 +0200	[diff] [blame]	92
				93	{
Boqun Feng	4dcd4d3	2020-03-26 10:40:21 +0800	[diff] [blame]	94	atomic_t v = ATOMIC_INIT(1);
				95	}
				96
				97	P0(atomic_t *v)
				98	{
				99	(void)atomic_add_unless(v, 1, 0);
Peter Zijlstra	706eeb3	2017-06-12 14:50:27 +0200	[diff] [blame]	100	}
				101
				102	P1(atomic_t *v)
				103	{
Peter Zijlstra	706eeb3	2017-06-12 14:50:27 +0200	[diff] [blame]	104	atomic_set(v, 0);
				105	}
				106
				107	exists
				108	(v=2)
				109
				110	In this case we would expect the atomic_set() from CPU1 to either happen
				111	before the atomic_add_unless(), in which case that latter one would no-op, or
				112	_after_ in which case we'd overwrite its result. In no case is "2" a valid
				113	outcome.
				114
				115	This is typically true on 'normal' platforms, where a regular competing STORE
				116	will invalidate a LL/SC or fail a CMPXCHG.
				117
				118	The obvious case where this is not so is when we need to implement atomic ops
				119	with a lock:
				120
				121	CPU0 CPU1
				122
				123	atomic_add_unless(v, 1, 0);
				124	lock();
				125	ret = READ_ONCE(v->counter); // == 1
				126	atomic_set(v, 0);
				127	if (ret != u) WRITE_ONCE(v->counter, 0);
				128	WRITE_ONCE(v->counter, ret + 1);
				129	unlock();
				130
				131	the typical solution is to then implement atomic_set{}() with atomic_xchg().
				132
				133
				134	RMW ops:
				135
				136	These come in various forms:
				137
				138	- plain operations without return value: atomic_{}()
				139
				140	- operations which return the modified value: atomic_{}_return()
				141
				142	these are limited to the arithmetic operations because those are
				143	reversible. Bitops are irreversible and therefore the modified value
				144	is of dubious utility.
				145
				146	- operations which return the original value: atomic_fetch_{}()
				147
				148	- swap operations: xchg(), cmpxchg() and try_cmpxchg()
				149
				150	- misc; the special purpose operations that are commonly used and would,
				151	given the interface, normally be implemented using (try_)cmpxchg loops but
				152	are time critical and can, (typically) on LL/SC architectures, be more
				153	efficiently implemented.
				154
				155	All these operations are SMP atomic; that is, the operations (for a single
				156	atomic variable) can be fully ordered and no intermediate state is lost or
				157	visible.
				158
				159
				160	ORDERING (go read memory-barriers.txt first)
				161	--------
				162
				163	The rule of thumb:
				164
				165	- non-RMW operations are unordered;
				166
				167	- RMW operations that have no return value are unordered;
				168
				169	- RMW operations that have a return value are fully ordered;
				170
				171	- RMW operations that are conditional are unordered on FAILURE,
				172	otherwise the above rules apply.
				173
				174	Except of course when an operation has an explicit ordering like:
				175
				176	{}_relaxed: unordered
				177	{}_acquire: the R of the RMW (or atomic_read) is an ACQUIRE
				178	{}_release: the W of the RMW (or atomic_set) is a RELEASE
				179
				180	Where 'unordered' is against other memory locations. Address dependencies are
				181	not defeated.
				182
				183	Fully ordered primitives are ordered against everything prior and everything
				184	subsequent. Therefore a fully ordered primitive is like having an smp_mb()
				185	before and an smp_mb() after the primitive.
				186
				187
				188	The barriers:
				189
				190	smp_mb__{before,after}_atomic()
				191
Alan Stern	2966f8d	2019-05-03 13:13:44 -0400	[diff] [blame]	192	only apply to the RMW atomic ops and can be used to augment/upgrade the
				193	ordering inherent to the op. These barriers act almost like a full smp_mb():
				194	smp_mb__before_atomic() orders all earlier accesses against the RMW op
				195	itself and all accesses following it, and smp_mb__after_atomic() orders all
				196	later accesses against the RMW op and all accesses preceding it. However,
				197	accesses between the smp_mb__{before,after}_atomic() and the RMW op are not
				198	ordered, so it is advisable to place the barrier right next to the RMW atomic
				199	op whenever possible.
Peter Zijlstra	706eeb3	2017-06-12 14:50:27 +0200	[diff] [blame]	200
				201	These helper barriers exist because architectures have varying implicit
				202	ordering on their SMP atomic primitives. For example our TSO architectures
				203	provide full ordered atomics and these barriers are no-ops.
				204
Peter Zijlstra	69d927b	2019-04-24 13:38:23 +0200	[diff] [blame]	205	NOTE: when the atomic RmW ops are fully ordered, they should also imply a
				206	compiler barrier.
				207
Peter Zijlstra	706eeb3	2017-06-12 14:50:27 +0200	[diff] [blame]	208	Thus:
				209
				210	atomic_fetch_add();
				211
				212	is equivalent to:
				213
				214	smp_mb__before_atomic();
				215	atomic_fetch_add_relaxed();
				216	smp_mb__after_atomic();
				217
				218	However the atomic_fetch_add() might be implemented more efficiently.
				219
				220	Further, while something like:
				221
				222	smp_mb__before_atomic();
				223	atomic_dec(&X);
				224
				225	is a 'typical' RELEASE pattern, the barrier is strictly stronger than
Alan Stern	2966f8d	2019-05-03 13:13:44 -0400	[diff] [blame]	226	a RELEASE because it orders preceding instructions against both the read
				227	and write parts of the atomic_dec(), and against all following instructions
				228	as well. Similarly, something like:
Peter Zijlstra	706eeb3	2017-06-12 14:50:27 +0200	[diff] [blame]	229
Peter Zijlstra	ca11069	2017-08-23 18:15:20 +0200	[diff] [blame]	230	atomic_inc(&X);
				231	smp_mb__after_atomic();
Peter Zijlstra	706eeb3	2017-06-12 14:50:27 +0200	[diff] [blame]	232
Peter Zijlstra	ca11069	2017-08-23 18:15:20 +0200	[diff] [blame]	233	is an ACQUIRE pattern (though very much not typical), but again the barrier is
				234	strictly stronger than ACQUIRE. As illustrated:
				235
Boqun Feng	e30d023	2020-03-26 10:40:22 +0800	[diff] [blame]	236	C Atomic-RMW+mb__after_atomic-is-stronger-than-acquire
Peter Zijlstra	ca11069	2017-08-23 18:15:20 +0200	[diff] [blame]	237
				238	{
				239	}
				240
Boqun Feng	e30d023	2020-03-26 10:40:22 +0800	[diff] [blame]	241	P0(int x, atomic_t y)
Peter Zijlstra	ca11069	2017-08-23 18:15:20 +0200	[diff] [blame]	242	{
				243	r0 = READ_ONCE(*x);
				244	smp_rmb();
				245	r1 = atomic_read(y);
				246	}
				247
Boqun Feng	e30d023	2020-03-26 10:40:22 +0800	[diff] [blame]	248	P1(int x, atomic_t y)
Peter Zijlstra	ca11069	2017-08-23 18:15:20 +0200	[diff] [blame]	249	{
				250	atomic_inc(y);
				251	smp_mb__after_atomic();
				252	WRITE_ONCE(*x, 1);
				253	}
				254
				255	exists
Boqun Feng	e30d023	2020-03-26 10:40:22 +0800	[diff] [blame]	256	(0:r0=1 /\ 0:r1=0)
Peter Zijlstra	ca11069	2017-08-23 18:15:20 +0200	[diff] [blame]	257
				258	This should not happen; but a hypothetical atomic_inc_acquire() --
				259	(void)atomic_fetch_inc_acquire() for instance -- would allow the outcome,
Alan Stern	2966f8d	2019-05-03 13:13:44 -0400	[diff] [blame]	260	because it would not order the W part of the RMW against the following
				261	WRITE_ONCE. Thus:
Peter Zijlstra	ca11069	2017-08-23 18:15:20 +0200	[diff] [blame]	262
Boqun Feng	e30d023	2020-03-26 10:40:22 +0800	[diff] [blame]	263	P0 P1
Peter Zijlstra	ca11069	2017-08-23 18:15:20 +0200	[diff] [blame]	264
				265	t = LL.acq *y (0)
				266	t++;
				267	*x = 1;
				268	r0 = *x (1)
				269	RMB
				270	r1 = *y (0)
				271	SC *y, t;
				272
				273	is allowed.
Peter Zijlstra	d1bbfd0c	2021-07-05 17:00:24 +0200	[diff] [blame]	274
				275
				276	CMPXCHG vs TRY_CMPXCHG
				277	----------------------
				278
				279	int atomic_cmpxchg(atomic_t *ptr, int old, int new);
				280	bool atomic_try_cmpxchg(atomic_t ptr, int oldp, int new);
				281
				282	Both provide the same functionality, but try_cmpxchg() can lead to more
				283	compact code. The functions relate like:
				284
				285	bool atomic_try_cmpxchg(atomic_t ptr, int oldp, int new)
				286	{
				287	int ret, old = *oldp;
				288	ret = atomic_cmpxchg(ptr, old, new);
				289	if (ret != old)
				290	*oldp = ret;
				291	return ret == old;
				292	}
				293
				294	and:
				295
				296	int atomic_cmpxchg(atomic_t *ptr, int old, int new)
				297	{
				298	(void)atomic_try_cmpxchg(ptr, &old, new);
				299	return old;
				300	}
				301
				302	Usage:
				303
				304	old = atomic_read(&v); old = atomic_read(&v);
				305	for (;;) { do {
				306	new = func(old); new = func(old);
				307	tmp = atomic_cmpxchg(&v, old, new); } while (!atomic_try_cmpxchg(&v, &old, new));
				308	if (tmp == old)
				309	break;
				310	old = tmp;
				311	}
				312
				313	NB. try_cmpxchg() also generates better code on some platforms (notably x86)
				314	where the function more closely matches the hardware instruction.
Peter Zijlstra	55bccf1	2021-07-29 16:17:20 +0200	[diff] [blame]	315
				316
				317	FORWARD PROGRESS
				318	----------------
				319
				320	In general strong forward progress is expected of all unconditional atomic
				321	operations -- those in the Arithmetic and Bitwise classes and xchg(). However
				322	a fair amount of code also requires forward progress from the conditional
				323	atomic operations.
				324
				325	Specifically 'simple' cmpxchg() loops are expected to not starve one another
				326	indefinitely. However, this is not evident on LL/SC architectures, because
				327	while an LL/SC architecure 'can/should/must' provide forward progress
				328	guarantees between competing LL/SC sections, such a guarantee does not
				329	transfer to cmpxchg() implemented using LL/SC. Consider:
				330
				331	old = atomic_read(&v);
				332	do {
				333	new = func(old);
				334	} while (!atomic_try_cmpxchg(&v, &old, new));
				335
				336	which on LL/SC becomes something like:
				337
				338	old = atomic_read(&v);
				339	do {
				340	new = func(old);
				341	} while (!({
				342	volatile asm ("1: LL %[oldval], %[v]\n"
				343	" CMP %[oldval], %[old]\n"
				344	" BNE 2f\n"
				345	" SC %[new], %[v]\n"
				346	" BNE 1b\n"
				347	"2:\n"
				348	: [oldval] "=&r" (oldval), [v] "m" (v)
				349	: [old] "r" (old), [new] "r" (new)
				350	: "memory");
				351	success = (oldval == old);
				352	if (!success)
				353	old = oldval;
				354	success; }));
				355
				356	However, even the forward branch from the failed compare can cause the LL/SC
				357	to fail on some architectures, let alone whatever the compiler makes of the C
				358	loop body. As a result there is no guarantee what so ever the cacheline
				359	containing @v will stay on the local CPU and progress is made.
				360
				361	Even native CAS architectures can fail to provide forward progress for their
				362	primitive (See Sparc64 for an example).
				363
				364	Such implementations are strongly encouraged to add exponential backoff loops
				365	to a failed CAS in order to ensure some progress. Affected architectures are
				366	also strongly encouraged to inspect/audit the atomic fallbacks, refcount_t and
				367	their locking primitives.