Blame - Documentation/RCU/listRCU.rst - SHIFTPHONES/mainline/linux

blob: 2a643e293fb41e80f638955a98349f26ebd1b450 [file] [log] [blame]

Jiunn Chang	9422dc2	2019-06-26 15:07:02 -0500	[diff] [blame]	1	.. _list_rcu_doc:
Linus Torvalds	1da177e	2005-04-16 15:20:36 -0700	[diff] [blame]	2
Jiunn Chang	9422dc2	2019-06-26 15:07:02 -0500	[diff] [blame]	3	Using RCU to Protect Read-Mostly Linked Lists
				4	=============================================
Linus Torvalds	1da177e	2005-04-16 15:20:36 -0700	[diff] [blame]	5
				6	One of the best applications of RCU is to protect read-mostly linked lists
Joel Fernandes (Google)	dc8cb9d	2020-02-13 16:38:21 -0500	[diff] [blame]	7	(``struct list_head`` in list.h). One big advantage of this approach
Linus Torvalds	1da177e	2005-04-16 15:20:36 -0700	[diff] [blame]	8	is that all of the required memory barriers are included for you in
				9	the list macros. This document describes several applications of RCU,
				10	with the best fits first.
				11
Joel Fernandes (Google)	dc8cb9d	2020-02-13 16:38:21 -0500	[diff] [blame]	12
				13	Example 1: Read-mostly list: Deferred Destruction
				14	-------------------------------------------------
				15
				16	A widely used usecase for RCU lists in the kernel is lockless iteration over
				17	all processes in the system. ``task_struct::tasks`` represents the list node that
				18	links all the processes. The list can be traversed in parallel to any list
				19	additions or removals.
				20
				21	The traversal of the list is done using ``for_each_process()`` which is defined
				22	by the 2 macros::
				23
				24	#define next_task(p) \
				25	list_entry_rcu((p)->tasks.next, struct task_struct, tasks)
				26
				27	#define for_each_process(p) \
				28	for (p = &init_task ; (p = next_task(p)) != &init_task ; )
				29
				30	The code traversing the list of all processes typically looks like::
				31
				32	rcu_read_lock();
				33	for_each_process(p) {
				34	/* Do something with p */
				35	}
				36	rcu_read_unlock();
				37
				38	The simplified code for removing a process from a task list is::
				39
				40	void release_task(struct task_struct *p)
				41	{
				42	write_lock(&tasklist_lock);
				43	list_del_rcu(&p->tasks);
				44	write_unlock(&tasklist_lock);
				45	call_rcu(&p->rcu, delayed_put_task_struct);
				46	}
				47
				48	When a process exits, ``release_task()`` calls ``list_del_rcu(&p->tasks)`` under
				49	``tasklist_lock`` writer lock protection, to remove the task from the list of
				50	all tasks. The ``tasklist_lock`` prevents concurrent list additions/removals
				51	from corrupting the list. Readers using ``for_each_process()`` are not protected
				52	with the ``tasklist_lock``. To prevent readers from noticing changes in the list
				53	pointers, the ``task_struct`` object is freed only after one or more grace
				54	periods elapse (with the help of call_rcu()). This deferring of destruction
				55	ensures that any readers traversing the list will see valid ``p->tasks.next``
				56	pointers and deletion/freeing can happen in parallel with traversal of the list.
				57	This pattern is also called an existence lock, since RCU pins the object in
				58	memory until all existing readers finish.
				59
				60
				61	Example 2: Read-Side Action Taken Outside of Lock: No In-Place Updates
Jiunn Chang	9422dc2	2019-06-26 15:07:02 -0500	[diff] [blame]	62	----------------------------------------------------------------------
Linus Torvalds	1da177e	2005-04-16 15:20:36 -0700	[diff] [blame]	63
				64	The best applications are cases where, if reader-writer locking were
				65	used, the read-side lock would be dropped before taking any action
				66	based on the results of the search. The most celebrated example is
				67	the routing table. Because the routing table is tracking the state of
				68	equipment outside of the computer, it will at times contain stale data.
				69	Therefore, once the route has been computed, there is no need to hold
				70	the routing table static during transmission of the packet. After all,
				71	you can hold the routing table static all you want, but that won't keep
				72	the external Internet from changing, and it is the state of the external
				73	Internet that really matters. In addition, routing entries are typically
				74	added or deleted, rather than being modified in place.
				75
				76	A straightforward example of this use of RCU may be found in the
				77	system-call auditing support. For example, a reader-writer locked
Joel Fernandes (Google)	dc8cb9d	2020-02-13 16:38:21 -0500	[diff] [blame]	78	implementation of ``audit_filter_task()`` might be as follows::
Linus Torvalds	1da177e	2005-04-16 15:20:36 -0700	[diff] [blame]	79
				80	static enum audit_state audit_filter_task(struct task_struct *tsk)
				81	{
				82	struct audit_entry *e;
				83	enum audit_state state;
				84
				85	read_lock(&auditsc_lock);
Joel Fernandes (Google)	dc8cb9d	2020-02-13 16:38:21 -0500	[diff] [blame]	86	/* Note: audit_filter_mutex held by caller. */
Linus Torvalds	1da177e	2005-04-16 15:20:36 -0700	[diff] [blame]	87	list_for_each_entry(e, &audit_tsklist, list) {
				88	if (audit_filter_rules(tsk, &e->rule, NULL, &state)) {
				89	read_unlock(&auditsc_lock);
				90	return state;
				91	}
				92	}
				93	read_unlock(&auditsc_lock);
				94	return AUDIT_BUILD_CONTEXT;
				95	}
				96
				97	Here the list is searched under the lock, but the lock is dropped before
				98	the corresponding value is returned. By the time that this value is acted
				99	on, the list may well have been modified. This makes sense, since if
				100	you are turning auditing off, it is OK to audit a few extra system calls.
				101
Jiunn Chang	9422dc2	2019-06-26 15:07:02 -0500	[diff] [blame]	102	This means that RCU can be easily applied to the read side, as follows::
Linus Torvalds	1da177e	2005-04-16 15:20:36 -0700	[diff] [blame]	103
				104	static enum audit_state audit_filter_task(struct task_struct *tsk)
				105	{
				106	struct audit_entry *e;
				107	enum audit_state state;
				108
				109	rcu_read_lock();
Joel Fernandes (Google)	dc8cb9d	2020-02-13 16:38:21 -0500	[diff] [blame]	110	/* Note: audit_filter_mutex held by caller. */
Linus Torvalds	1da177e	2005-04-16 15:20:36 -0700	[diff] [blame]	111	list_for_each_entry_rcu(e, &audit_tsklist, list) {
				112	if (audit_filter_rules(tsk, &e->rule, NULL, &state)) {
				113	rcu_read_unlock();
				114	return state;
				115	}
				116	}
				117	rcu_read_unlock();
				118	return AUDIT_BUILD_CONTEXT;
				119	}
				120
Joel Fernandes (Google)	dc8cb9d	2020-02-13 16:38:21 -0500	[diff] [blame]	121	The ``read_lock()`` and ``read_unlock()`` calls have become rcu_read_lock()
Linus Torvalds	1da177e	2005-04-16 15:20:36 -0700	[diff] [blame]	122	and rcu_read_unlock(), respectively, and the list_for_each_entry() has
Joel Fernandes (Google)	dc8cb9d	2020-02-13 16:38:21 -0500	[diff] [blame]	123	become list_for_each_entry_rcu(). The _rcu() list-traversal primitives
Linus Torvalds	1da177e	2005-04-16 15:20:36 -0700	[diff] [blame]	124	insert the read-side memory barriers that are required on DEC Alpha CPUs.
				125
Joel Fernandes (Google)	dc8cb9d	2020-02-13 16:38:21 -0500	[diff] [blame]	126	The changes to the update side are also straightforward. A reader-writer lock
				127	might be used as follows for deletion and insertion::
Linus Torvalds	1da177e	2005-04-16 15:20:36 -0700	[diff] [blame]	128
				129	static inline int audit_del_rule(struct audit_rule *rule,
				130	struct list_head *list)
				131	{
Joel Fernandes (Google)	dc8cb9d	2020-02-13 16:38:21 -0500	[diff] [blame]	132	struct audit_entry *e;
Linus Torvalds	1da177e	2005-04-16 15:20:36 -0700	[diff] [blame]	133
				134	write_lock(&auditsc_lock);
				135	list_for_each_entry(e, list, list) {
				136	if (!audit_compare_rule(rule, &e->rule)) {
				137	list_del(&e->list);
				138	write_unlock(&auditsc_lock);
				139	return 0;
				140	}
				141	}
				142	write_unlock(&auditsc_lock);
				143	return -EFAULT; /* No matching rule */
				144	}
				145
				146	static inline int audit_add_rule(struct audit_entry *entry,
				147	struct list_head *list)
				148	{
				149	write_lock(&auditsc_lock);
				150	if (entry->rule.flags & AUDIT_PREPEND) {
				151	entry->rule.flags &= ~AUDIT_PREPEND;
				152	list_add(&entry->list, list);
				153	} else {
				154	list_add_tail(&entry->list, list);
				155	}
				156	write_unlock(&auditsc_lock);
				157	return 0;
				158	}
				159
Jiunn Chang	9422dc2	2019-06-26 15:07:02 -0500	[diff] [blame]	160	Following are the RCU equivalents for these two functions::
Linus Torvalds	1da177e	2005-04-16 15:20:36 -0700	[diff] [blame]	161
				162	static inline int audit_del_rule(struct audit_rule *rule,
				163	struct list_head *list)
				164	{
Joel Fernandes (Google)	dc8cb9d	2020-02-13 16:38:21 -0500	[diff] [blame]	165	struct audit_entry *e;
Linus Torvalds	1da177e	2005-04-16 15:20:36 -0700	[diff] [blame]	166
Joel Fernandes (Google)	dc8cb9d	2020-02-13 16:38:21 -0500	[diff] [blame]	167	/* No need to use the _rcu iterator here, since this is the only
Linus Torvalds	1da177e	2005-04-16 15:20:36 -0700	[diff] [blame]	168	* deletion routine. */
				169	list_for_each_entry(e, list, list) {
				170	if (!audit_compare_rule(rule, &e->rule)) {
				171	list_del_rcu(&e->list);
Jesper Dangaard Brouer	3943ac5	2009-03-29 23:03:01 +0000	[diff] [blame]	172	call_rcu(&e->rcu, audit_free_rule);
Linus Torvalds	1da177e	2005-04-16 15:20:36 -0700	[diff] [blame]	173	return 0;
				174	}
				175	}
				176	return -EFAULT; /* No matching rule */
				177	}
				178
				179	static inline int audit_add_rule(struct audit_entry *entry,
				180	struct list_head *list)
				181	{
				182	if (entry->rule.flags & AUDIT_PREPEND) {
				183	entry->rule.flags &= ~AUDIT_PREPEND;
				184	list_add_rcu(&entry->list, list);
				185	} else {
				186	list_add_tail_rcu(&entry->list, list);
				187	}
				188	return 0;
				189	}
				190
Joel Fernandes (Google)	dc8cb9d	2020-02-13 16:38:21 -0500	[diff] [blame]	191	Normally, the ``write_lock()`` and ``write_unlock()`` would be replaced by a
				192	spin_lock() and a spin_unlock(). But in this case, all callers hold
				193	``audit_filter_mutex``, so no additional locking is required. The
				194	``auditsc_lock`` can therefore be eliminated, since use of RCU eliminates the
				195	need for writers to exclude readers.
Linus Torvalds	1da177e	2005-04-16 15:20:36 -0700	[diff] [blame]	196
				197	The list_del(), list_add(), and list_add_tail() primitives have been
				198	replaced by list_del_rcu(), list_add_rcu(), and list_add_tail_rcu().
Joel Fernandes (Google)	dc8cb9d	2020-02-13 16:38:21 -0500	[diff] [blame]	199	The _rcu() list-manipulation primitives add memory barriers that are needed on
				200	weakly ordered CPUs (most of them!). The list_del_rcu() primitive omits the
				201	pointer poisoning debug-assist code that would otherwise cause concurrent
				202	readers to fail spectacularly.
Linus Torvalds	1da177e	2005-04-16 15:20:36 -0700	[diff] [blame]	203
Joel Fernandes (Google)	dc8cb9d	2020-02-13 16:38:21 -0500	[diff] [blame]	204	So, when readers can tolerate stale data and when entries are either added or
				205	deleted, without in-place modification, it is very easy to use RCU!
Linus Torvalds	1da177e	2005-04-16 15:20:36 -0700	[diff] [blame]	206
Joel Fernandes (Google)	dc8cb9d	2020-02-13 16:38:21 -0500	[diff] [blame]	207
				208	Example 3: Handling In-Place Updates
Jiunn Chang	9422dc2	2019-06-26 15:07:02 -0500	[diff] [blame]	209	------------------------------------
Linus Torvalds	1da177e	2005-04-16 15:20:36 -0700	[diff] [blame]	210
Joel Fernandes (Google)	dc8cb9d	2020-02-13 16:38:21 -0500	[diff] [blame]	211	The system-call auditing code does not update auditing rules in place. However,
				212	if it did, the reader-writer-locked code to do so might look as follows
				213	(assuming only ``field_count`` is updated, otherwise, the added fields would
				214	need to be filled in)::
Linus Torvalds	1da177e	2005-04-16 15:20:36 -0700	[diff] [blame]	215
				216	static inline int audit_upd_rule(struct audit_rule *rule,
				217	struct list_head *list,
				218	__u32 newaction,
				219	__u32 newfield_count)
				220	{
Joel Fernandes (Google)	dc8cb9d	2020-02-13 16:38:21 -0500	[diff] [blame]	221	struct audit_entry *e;
				222	struct audit_entry *ne;
Linus Torvalds	1da177e	2005-04-16 15:20:36 -0700	[diff] [blame]	223
				224	write_lock(&auditsc_lock);
Joel Fernandes (Google)	dc8cb9d	2020-02-13 16:38:21 -0500	[diff] [blame]	225	/* Note: audit_filter_mutex held by caller. */
Linus Torvalds	1da177e	2005-04-16 15:20:36 -0700	[diff] [blame]	226	list_for_each_entry(e, list, list) {
				227	if (!audit_compare_rule(rule, &e->rule)) {
				228	e->rule.action = newaction;
SeongJae Park	c50a871	2020-01-06 21:07:57 +0100	[diff] [blame]	229	e->rule.field_count = newfield_count;
Linus Torvalds	1da177e	2005-04-16 15:20:36 -0700	[diff] [blame]	230	write_unlock(&auditsc_lock);
				231	return 0;
				232	}
				233	}
				234	write_unlock(&auditsc_lock);
				235	return -EFAULT; /* No matching rule */
				236	}
				237
				238	The RCU version creates a copy, updates the copy, then replaces the old
				239	entry with the newly updated entry. This sequence of actions, allowing
Joel Fernandes (Google)	dc8cb9d	2020-02-13 16:38:21 -0500	[diff] [blame]	240	concurrent reads while making a copy to perform an update, is what gives
				241	RCU (read-copy update) its name. The RCU code is as follows::
Linus Torvalds	1da177e	2005-04-16 15:20:36 -0700	[diff] [blame]	242
				243	static inline int audit_upd_rule(struct audit_rule *rule,
				244	struct list_head *list,
				245	__u32 newaction,
				246	__u32 newfield_count)
				247	{
Joel Fernandes (Google)	dc8cb9d	2020-02-13 16:38:21 -0500	[diff] [blame]	248	struct audit_entry *e;
				249	struct audit_entry *ne;
Linus Torvalds	1da177e	2005-04-16 15:20:36 -0700	[diff] [blame]	250
				251	list_for_each_entry(e, list, list) {
				252	if (!audit_compare_rule(rule, &e->rule)) {
				253	ne = kmalloc(sizeof(*entry), GFP_ATOMIC);
				254	if (ne == NULL)
				255	return -ENOMEM;
				256	audit_copy_rule(&ne->rule, &e->rule);
				257	ne->rule.action = newaction;
SeongJae Park	c50a871	2020-01-06 21:07:57 +0100	[diff] [blame]	258	ne->rule.field_count = newfield_count;
Kees Cook	57d34a6	2012-10-19 09:48:30 -0700	[diff] [blame]	259	list_replace_rcu(&e->list, &ne->list);
Jesper Dangaard Brouer	3943ac5	2009-03-29 23:03:01 +0000	[diff] [blame]	260	call_rcu(&e->rcu, audit_free_rule);
Linus Torvalds	1da177e	2005-04-16 15:20:36 -0700	[diff] [blame]	261	return 0;
				262	}
				263	}
				264	return -EFAULT; /* No matching rule */
				265	}
				266
Joel Fernandes (Google)	dc8cb9d	2020-02-13 16:38:21 -0500	[diff] [blame]	267	Again, this assumes that the caller holds ``audit_filter_mutex``. Normally, the
				268	writer lock would become a spinlock in this sort of code.
Linus Torvalds	1da177e	2005-04-16 15:20:36 -0700	[diff] [blame]	269
Joel Fernandes (Google)	dc8cb9d	2020-02-13 16:38:21 -0500	[diff] [blame]	270	Another use of this pattern can be found in the openswitch driver's *connection
				271	tracking table* code in ``ct_limit_set()``. The table holds connection tracking
				272	entries and has a limit on the maximum entries. There is one such table
				273	per-zone and hence one limit per zone. The zones are mapped to their limits
				274	through a hashtable using an RCU-managed hlist for the hash chains. When a new
				275	limit is set, a new limit object is allocated and ``ct_limit_set()`` is called
				276	to replace the old limit object with the new one using list_replace_rcu().
				277	The old limit object is then freed after a grace period using kfree_rcu().
				278
				279
				280	Example 4: Eliminating Stale Data
Jiunn Chang	9422dc2	2019-06-26 15:07:02 -0500	[diff] [blame]	281	---------------------------------
Linus Torvalds	1da177e	2005-04-16 15:20:36 -0700	[diff] [blame]	282
Joel Fernandes (Google)	dc8cb9d	2020-02-13 16:38:21 -0500	[diff] [blame]	283	The auditing example above tolerates stale data, as do most algorithms
Linus Torvalds	1da177e	2005-04-16 15:20:36 -0700	[diff] [blame]	284	that are tracking external state. Because there is a delay from the
				285	time the external state changes before Linux becomes aware of the change,
Joel Fernandes (Google)	dc8cb9d	2020-02-13 16:38:21 -0500	[diff] [blame]	286	additional RCU-induced staleness is generally not a problem.
Linus Torvalds	1da177e	2005-04-16 15:20:36 -0700	[diff] [blame]	287
				288	However, there are many examples where stale data cannot be tolerated.
SeongJae Park	3282b04	2020-01-06 21:07:58 +0100	[diff] [blame]	289	One example in the Linux kernel is the System V IPC (see the shm_lock()
				290	function in ipc/shm.c). This code checks a deleted flag under a
Joel Fernandes (Google)	dc8cb9d	2020-02-13 16:38:21 -0500	[diff] [blame]	291	per-entry spinlock, and, if the deleted flag is set, pretends that the
Linus Torvalds	1da177e	2005-04-16 15:20:36 -0700	[diff] [blame]	292	entry does not exist. For this to be helpful, the search function must
SeongJae Park	3282b04	2020-01-06 21:07:58 +0100	[diff] [blame]	293	return holding the per-entry spinlock, as shm_lock() does in fact do.
Joel Fernandes (Google)	dc8cb9d	2020-02-13 16:38:21 -0500	[diff] [blame]	294
				295	.. _quick_quiz:
Linus Torvalds	1da177e	2005-04-16 15:20:36 -0700	[diff] [blame]	296
Jiunn Chang	9422dc2	2019-06-26 15:07:02 -0500	[diff] [blame]	297	Quick Quiz:
Joel Fernandes (Google)	dc8cb9d	2020-02-13 16:38:21 -0500	[diff] [blame]	298	For the deleted-flag technique to be helpful, why is it necessary
				299	to hold the per-entry lock while returning from the search function?
Jiunn Chang	9422dc2	2019-06-26 15:07:02 -0500	[diff] [blame]	300
Joel Fernandes (Google)	dc8cb9d	2020-02-13 16:38:21 -0500	[diff] [blame]	301	:ref:`Answer to Quick Quiz <quick_quiz_answer>`
Linus Torvalds	1da177e	2005-04-16 15:20:36 -0700	[diff] [blame]	302
Joel Fernandes (Google)	dc8cb9d	2020-02-13 16:38:21 -0500	[diff] [blame]	303	If the system-call audit module were to ever need to reject stale data, one way
				304	to accomplish this would be to add a ``deleted`` flag and a ``lock`` spinlock to the
				305	audit_entry structure, and modify ``audit_filter_task()`` as follows::
Linus Torvalds	1da177e	2005-04-16 15:20:36 -0700	[diff] [blame]	306
				307	static enum audit_state audit_filter_task(struct task_struct *tsk)
				308	{
				309	struct audit_entry *e;
				310	enum audit_state state;
				311
				312	rcu_read_lock();
				313	list_for_each_entry_rcu(e, &audit_tsklist, list) {
				314	if (audit_filter_rules(tsk, &e->rule, NULL, &state)) {
				315	spin_lock(&e->lock);
				316	if (e->deleted) {
				317	spin_unlock(&e->lock);
				318	rcu_read_unlock();
				319	return AUDIT_BUILD_CONTEXT;
				320	}
				321	rcu_read_unlock();
				322	return state;
				323	}
				324	}
				325	rcu_read_unlock();
				326	return AUDIT_BUILD_CONTEXT;
				327	}
				328
				329	Note that this example assumes that entries are only added and deleted.
Joel Fernandes (Google)	dc8cb9d	2020-02-13 16:38:21 -0500	[diff] [blame]	330	Additional mechanism is required to deal correctly with the update-in-place
				331	performed by ``audit_upd_rule()``. For one thing, ``audit_upd_rule()`` would
				332	need additional memory barriers to ensure that the list_add_rcu() was really
				333	executed before the list_del_rcu().
Linus Torvalds	1da177e	2005-04-16 15:20:36 -0700	[diff] [blame]	334
Joel Fernandes (Google)	dc8cb9d	2020-02-13 16:38:21 -0500	[diff] [blame]	335	The ``audit_del_rule()`` function would need to set the ``deleted`` flag under the
				336	spinlock as follows::
Linus Torvalds	1da177e	2005-04-16 15:20:36 -0700	[diff] [blame]	337
				338	static inline int audit_del_rule(struct audit_rule *rule,
				339	struct list_head *list)
				340	{
Joel Fernandes (Google)	dc8cb9d	2020-02-13 16:38:21 -0500	[diff] [blame]	341	struct audit_entry *e;
Linus Torvalds	1da177e	2005-04-16 15:20:36 -0700	[diff] [blame]	342
Joel Fernandes (Google)	dc8cb9d	2020-02-13 16:38:21 -0500	[diff] [blame]	343	/* No need to use the _rcu iterator here, since this
Paul E. McKenney	d19720a	2006-02-01 03:06:42 -0800	[diff] [blame]	344	* is the only deletion routine. */
Linus Torvalds	1da177e	2005-04-16 15:20:36 -0700	[diff] [blame]	345	list_for_each_entry(e, list, list) {
				346	if (!audit_compare_rule(rule, &e->rule)) {
				347	spin_lock(&e->lock);
				348	list_del_rcu(&e->list);
				349	e->deleted = 1;
				350	spin_unlock(&e->lock);
Jesper Dangaard Brouer	3943ac5	2009-03-29 23:03:01 +0000	[diff] [blame]	351	call_rcu(&e->rcu, audit_free_rule);
Linus Torvalds	1da177e	2005-04-16 15:20:36 -0700	[diff] [blame]	352	return 0;
				353	}
				354	}
				355	return -EFAULT; /* No matching rule */
				356	}
				357
Joel Fernandes (Google)	dc8cb9d	2020-02-13 16:38:21 -0500	[diff] [blame]	358	This too assumes that the caller holds ``audit_filter_mutex``.
				359
				360
				361	Example 5: Skipping Stale Objects
				362	---------------------------------
				363
				364	For some usecases, reader performance can be improved by skipping stale objects
				365	during read-side list traversal if the object in concern is pending destruction
				366	after one or more grace periods. One such example can be found in the timerfd
				367	subsystem. When a ``CLOCK_REALTIME`` clock is reprogrammed - for example due to
				368	setting of the system time, then all programmed timerfds that depend on this
				369	clock get triggered and processes waiting on them to expire are woken up in
				370	advance of their scheduled expiry. To facilitate this, all such timers are added
				371	to an RCU-managed ``cancel_list`` when they are setup in
				372	``timerfd_setup_cancel()``::
				373
				374	static void timerfd_setup_cancel(struct timerfd_ctx *ctx, int flags)
				375	{
				376	spin_lock(&ctx->cancel_lock);
				377	if ((ctx->clockid == CLOCK_REALTIME &&
				378	(flags & TFD_TIMER_ABSTIME) && (flags & TFD_TIMER_CANCEL_ON_SET)) {
				379	if (!ctx->might_cancel) {
				380	ctx->might_cancel = true;
				381	spin_lock(&cancel_lock);
				382	list_add_rcu(&ctx->clist, &cancel_list);
				383	spin_unlock(&cancel_lock);
				384	}
				385	}
				386	spin_unlock(&ctx->cancel_lock);
				387	}
				388
				389	When a timerfd is freed (fd is closed), then the ``might_cancel`` flag of the
				390	timerfd object is cleared, the object removed from the ``cancel_list`` and
				391	destroyed::
				392
				393	int timerfd_release(struct inode inode, struct file file)
				394	{
				395	struct timerfd_ctx *ctx = file->private_data;
				396
				397	spin_lock(&ctx->cancel_lock);
				398	if (ctx->might_cancel) {
				399	ctx->might_cancel = false;
				400	spin_lock(&cancel_lock);
				401	list_del_rcu(&ctx->clist);
				402	spin_unlock(&cancel_lock);
				403	}
				404	spin_unlock(&ctx->cancel_lock);
				405
				406	hrtimer_cancel(&ctx->t.tmr);
				407	kfree_rcu(ctx, rcu);
				408	return 0;
				409	}
				410
				411	If the ``CLOCK_REALTIME`` clock is set, for example by a time server, the
				412	hrtimer framework calls ``timerfd_clock_was_set()`` which walks the
				413	``cancel_list`` and wakes up processes waiting on the timerfd. While iterating
				414	the ``cancel_list``, the ``might_cancel`` flag is consulted to skip stale
				415	objects::
				416
				417	void timerfd_clock_was_set(void)
				418	{
				419	struct timerfd_ctx *ctx;
				420	unsigned long flags;
				421
				422	rcu_read_lock();
				423	list_for_each_entry_rcu(ctx, &cancel_list, clist) {
				424	if (!ctx->might_cancel)
				425	continue;
				426	spin_lock_irqsave(&ctx->wqh.lock, flags);
				427	if (ctx->moffs != ktime_mono_to_real(0)) {
				428	ctx->moffs = KTIME_MAX;
				429	ctx->ticks++;
				430	wake_up_locked_poll(&ctx->wqh, EPOLLIN);
				431	}
				432	spin_unlock_irqrestore(&ctx->wqh.lock, flags);
				433	}
				434	rcu_read_unlock();
				435	}
				436
				437	The key point here is, because RCU-traversal of the ``cancel_list`` happens
				438	while objects are being added and removed to the list, sometimes the traversal
				439	can step on an object that has been removed from the list. In this example, it
				440	is seen that it is better to skip such objects using a flag.
				441
				442
Linus Torvalds	1da177e	2005-04-16 15:20:36 -0700	[diff] [blame]	443	Summary
Jiunn Chang	9422dc2	2019-06-26 15:07:02 -0500	[diff] [blame]	444	-------
Linus Torvalds	1da177e	2005-04-16 15:20:36 -0700	[diff] [blame]	445
				446	Read-mostly list-based data structures that can tolerate stale data are
				447	the most amenable to use of RCU. The simplest case is where entries are
				448	either added or deleted from the data structure (or atomically modified
				449	in place), but non-atomic in-place modifications can be handled by making
				450	a copy, updating the copy, then replacing the original with the copy.
Joel Fernandes (Google)	dc8cb9d	2020-02-13 16:38:21 -0500	[diff] [blame]	451	If stale data cannot be tolerated, then a deleted flag may be used
Linus Torvalds	1da177e	2005-04-16 15:20:36 -0700	[diff] [blame]	452	in conjunction with a per-entry spinlock in order to allow the search
				453	function to reject newly deleted data.
				454
Joel Fernandes (Google)	dc8cb9d	2020-02-13 16:38:21 -0500	[diff] [blame]	455	.. _quick_quiz_answer:
Linus Torvalds	1da177e	2005-04-16 15:20:36 -0700	[diff] [blame]	456
Jiunn Chang	9422dc2	2019-06-26 15:07:02 -0500	[diff] [blame]	457	Answer to Quick Quiz:
Joel Fernandes (Google)	dc8cb9d	2020-02-13 16:38:21 -0500	[diff] [blame]	458	For the deleted-flag technique to be helpful, why is it necessary
				459	to hold the per-entry lock while returning from the search function?
Linus Torvalds	1da177e	2005-04-16 15:20:36 -0700	[diff] [blame]	460
Paul E. McKenney	d19720a	2006-02-01 03:06:42 -0800	[diff] [blame]	461	If the search function drops the per-entry lock before returning,
				462	then the caller will be processing stale data in any case. If it
				463	is really OK to be processing stale data, then you don't need a
Joel Fernandes (Google)	dc8cb9d	2020-02-13 16:38:21 -0500	[diff] [blame]	464	deleted flag. If processing stale data really is a problem,
Paul E. McKenney	d19720a	2006-02-01 03:06:42 -0800	[diff] [blame]	465	then you need to hold the per-entry lock across all of the code
				466	that uses the value that was returned.
Joel Fernandes (Google)	dc8cb9d	2020-02-13 16:38:21 -0500	[diff] [blame]	467
				468	:ref:`Back to Quick Quiz <quick_quiz>`