blob: 1846c199da23e3e5e2da764a21d0ba4e682fb041 [file] [log] [blame]
Mauro Carvalho Chehab387b1462019-04-10 08:32:41 -03001======================================
Thomas Hellstrom08295b32018-06-15 10:17:38 +02002Wound/Wait Deadlock-Proof Mutex Design
Maarten Lankhorst040a0a32013-06-24 10:30:04 +02003======================================
4
5Please read mutex-design.txt first, as it applies to wait/wound mutexes too.
6
7Motivation for WW-Mutexes
8-------------------------
9
10GPU's do operations that commonly involve many buffers. Those buffers
11can be shared across contexts/processes, exist in different memory
12domains (for example VRAM vs system memory), and so on. And with
13PRIME / dmabuf, they can even be shared across devices. So there are
14a handful of situations where the driver needs to wait for buffers to
15become ready. If you think about this in terms of waiting on a buffer
16mutex for it to become available, this presents a problem because
17there is no way to guarantee that buffers appear in a execbuf/batch in
18the same order in all contexts. That is directly under control of
19userspace, and a result of the sequence of GL calls that an application
20makes. Which results in the potential for deadlock. The problem gets
21more complex when you consider that the kernel may need to migrate the
22buffer(s) into VRAM before the GPU operates on the buffer(s), which
23may in turn require evicting some other buffers (and you don't want to
24evict other buffers which are already queued up to the GPU), but for a
25simplified understanding of the problem you can ignore this.
26
27The algorithm that the TTM graphics subsystem came up with for dealing with
28this problem is quite simple. For each group of buffers (execbuf) that need
29to be locked, the caller would be assigned a unique reservation id/ticket,
30from a global counter. In case of deadlock while locking all the buffers
31associated with a execbuf, the one with the lowest reservation ticket (i.e.
32the oldest task) wins, and the one with the higher reservation id (i.e. the
33younger task) unlocks all of the buffers that it has already locked, and then
34tries again.
35
Thomas Hellstrom08295b32018-06-15 10:17:38 +020036In the RDBMS literature, a reservation ticket is associated with a transaction.
37and the deadlock handling approach is called Wait-Die. The name is based on
38the actions of a locking thread when it encounters an already locked mutex.
39If the transaction holding the lock is younger, the locking transaction waits.
40If the transaction holding the lock is older, the locking transaction backs off
41and dies. Hence Wait-Die.
42There is also another algorithm called Wound-Wait:
43If the transaction holding the lock is younger, the locking transaction
44wounds the transaction holding the lock, requesting it to die.
45If the transaction holding the lock is older, it waits for the other
46transaction. Hence Wound-Wait.
47The two algorithms are both fair in that a transaction will eventually succeed.
48However, the Wound-Wait algorithm is typically stated to generate fewer backoffs
49compared to Wait-Die, but is, on the other hand, associated with more work than
50Wait-Die when recovering from a backoff. Wound-Wait is also a preemptive
51algorithm in that transactions are wounded by other transactions, and that
52requires a reliable way to pick up up the wounded condition and preempt the
53running transaction. Note that this is not the same as process preemption. A
54Wound-Wait transaction is considered preempted when it dies (returning
55-EDEADLK) following a wound.
Maarten Lankhorst040a0a32013-06-24 10:30:04 +020056
57Concepts
58--------
59
60Compared to normal mutexes two additional concepts/objects show up in the lock
61interface for w/w mutexes:
62
63Acquire context: To ensure eventual forward progress it is important the a task
64trying to acquire locks doesn't grab a new reservation id, but keeps the one it
65acquired when starting the lock acquisition. This ticket is stored in the
66acquire context. Furthermore the acquire context keeps track of debugging state
Thomas Hellstrom08295b32018-06-15 10:17:38 +020067to catch w/w mutex interface abuse. An acquire context is representing a
68transaction.
Maarten Lankhorst040a0a32013-06-24 10:30:04 +020069
70W/w class: In contrast to normal mutexes the lock class needs to be explicit for
Thomas Hellstrom08295b32018-06-15 10:17:38 +020071w/w mutexes, since it is required to initialize the acquire context. The lock
72class also specifies what algorithm to use, Wound-Wait or Wait-Die.
Maarten Lankhorst040a0a32013-06-24 10:30:04 +020073
74Furthermore there are three different class of w/w lock acquire functions:
75
76* Normal lock acquisition with a context, using ww_mutex_lock.
77
Peter Ziljstra55f036c2018-06-15 10:07:12 +020078* Slowpath lock acquisition on the contending lock, used by the task that just
79 killed its transaction after having dropped all already acquired locks.
80 These functions have the _slow postfix.
Maarten Lankhorst040a0a32013-06-24 10:30:04 +020081
82 From a simple semantics point-of-view the _slow functions are not strictly
83 required, since simply calling the normal ww_mutex_lock functions on the
84 contending lock (after having dropped all other already acquired locks) will
85 work correctly. After all if no other ww mutex has been acquired yet there's
86 no deadlock potential and hence the ww_mutex_lock call will block and not
87 prematurely return -EDEADLK. The advantage of the _slow functions is in
88 interface safety:
Mauro Carvalho Chehab387b1462019-04-10 08:32:41 -030089
Maarten Lankhorst040a0a32013-06-24 10:30:04 +020090 - ww_mutex_lock has a __must_check int return type, whereas ww_mutex_lock_slow
91 has a void return type. Note that since ww mutex code needs loops/retries
92 anyway the __must_check doesn't result in spurious warnings, even though the
93 very first lock operation can never fail.
94 - When full debugging is enabled ww_mutex_lock_slow checks that all acquired
95 ww mutex have been released (preventing deadlocks) and makes sure that we
96 block on the contending lock (preventing spinning through the -EDEADLK
97 slowpath until the contended lock can be acquired).
98
99* Functions to only acquire a single w/w mutex, which results in the exact same
100 semantics as a normal mutex. This is done by calling ww_mutex_lock with a NULL
101 context.
102
103 Again this is not strictly required. But often you only want to acquire a
104 single lock in which case it's pointless to set up an acquire context (and so
105 better to avoid grabbing a deadlock avoidance ticket).
106
107Of course, all the usual variants for handling wake-ups due to signals are also
108provided.
109
110Usage
111-----
112
Thomas Hellstrom08295b32018-06-15 10:17:38 +0200113The algorithm (Wait-Die vs Wound-Wait) is chosen by using either
114DEFINE_WW_CLASS() (Wound-Wait) or DEFINE_WD_CLASS() (Wait-Die)
115As a rough rule of thumb, use Wound-Wait iff you
116expect the number of simultaneous competing transactions to be typically small,
117and you want to reduce the number of rollbacks.
118
Maarten Lankhorst040a0a32013-06-24 10:30:04 +0200119Three different ways to acquire locks within the same w/w class. Common
Mauro Carvalho Chehab387b1462019-04-10 08:32:41 -0300120definitions for methods #1 and #2::
Maarten Lankhorst040a0a32013-06-24 10:30:04 +0200121
Mauro Carvalho Chehab387b1462019-04-10 08:32:41 -0300122 static DEFINE_WW_CLASS(ww_class);
Maarten Lankhorst040a0a32013-06-24 10:30:04 +0200123
Mauro Carvalho Chehab387b1462019-04-10 08:32:41 -0300124 struct obj {
Maarten Lankhorst040a0a32013-06-24 10:30:04 +0200125 struct ww_mutex lock;
126 /* obj data */
Mauro Carvalho Chehab387b1462019-04-10 08:32:41 -0300127 };
Maarten Lankhorst040a0a32013-06-24 10:30:04 +0200128
Mauro Carvalho Chehab387b1462019-04-10 08:32:41 -0300129 struct obj_entry {
Maarten Lankhorst040a0a32013-06-24 10:30:04 +0200130 struct list_head head;
131 struct obj *obj;
Mauro Carvalho Chehab387b1462019-04-10 08:32:41 -0300132 };
Maarten Lankhorst040a0a32013-06-24 10:30:04 +0200133
134Method 1, using a list in execbuf->buffers that's not allowed to be reordered.
135This is useful if a list of required objects is already tracked somewhere.
136Furthermore the lock helper can use propagate the -EALREADY return code back to
137the caller as a signal that an object is twice on the list. This is useful if
138the list is constructed from userspace input and the ABI requires userspace to
Mauro Carvalho Chehab387b1462019-04-10 08:32:41 -0300139not have duplicate entries (e.g. for a gpu commandbuffer submission ioctl)::
Maarten Lankhorst040a0a32013-06-24 10:30:04 +0200140
Mauro Carvalho Chehab387b1462019-04-10 08:32:41 -0300141 int lock_objs(struct list_head *list, struct ww_acquire_ctx *ctx)
142 {
Maarten Lankhorst040a0a32013-06-24 10:30:04 +0200143 struct obj *res_obj = NULL;
144 struct obj_entry *contended_entry = NULL;
145 struct obj_entry *entry;
146
147 ww_acquire_init(ctx, &ww_class);
148
Mauro Carvalho Chehab387b1462019-04-10 08:32:41 -0300149 retry:
Maarten Lankhorst040a0a32013-06-24 10:30:04 +0200150 list_for_each_entry (entry, list, head) {
151 if (entry->obj == res_obj) {
152 res_obj = NULL;
153 continue;
154 }
155 ret = ww_mutex_lock(&entry->obj->lock, ctx);
156 if (ret < 0) {
157 contended_entry = entry;
158 goto err;
159 }
160 }
161
162 ww_acquire_done(ctx);
163 return 0;
164
Mauro Carvalho Chehab387b1462019-04-10 08:32:41 -0300165 err:
Maarten Lankhorst040a0a32013-06-24 10:30:04 +0200166 list_for_each_entry_continue_reverse (entry, list, head)
167 ww_mutex_unlock(&entry->obj->lock);
168
169 if (res_obj)
170 ww_mutex_unlock(&res_obj->lock);
171
172 if (ret == -EDEADLK) {
173 /* we lost out in a seqno race, lock and retry.. */
174 ww_mutex_lock_slow(&contended_entry->obj->lock, ctx);
175 res_obj = contended_entry->obj;
176 goto retry;
177 }
178 ww_acquire_fini(ctx);
179
180 return ret;
Mauro Carvalho Chehab387b1462019-04-10 08:32:41 -0300181 }
Maarten Lankhorst040a0a32013-06-24 10:30:04 +0200182
183Method 2, using a list in execbuf->buffers that can be reordered. Same semantics
184of duplicate entry detection using -EALREADY as method 1 above. But the
Mauro Carvalho Chehab387b1462019-04-10 08:32:41 -0300185list-reordering allows for a bit more idiomatic code::
Maarten Lankhorst040a0a32013-06-24 10:30:04 +0200186
Mauro Carvalho Chehab387b1462019-04-10 08:32:41 -0300187 int lock_objs(struct list_head *list, struct ww_acquire_ctx *ctx)
188 {
Maarten Lankhorst040a0a32013-06-24 10:30:04 +0200189 struct obj_entry *entry, *entry2;
190
191 ww_acquire_init(ctx, &ww_class);
192
193 list_for_each_entry (entry, list, head) {
194 ret = ww_mutex_lock(&entry->obj->lock, ctx);
195 if (ret < 0) {
196 entry2 = entry;
197
198 list_for_each_entry_continue_reverse (entry2, list, head)
199 ww_mutex_unlock(&entry2->obj->lock);
200
201 if (ret != -EDEADLK) {
202 ww_acquire_fini(ctx);
203 return ret;
204 }
205
206 /* we lost out in a seqno race, lock and retry.. */
207 ww_mutex_lock_slow(&entry->obj->lock, ctx);
208
209 /*
210 * Move buf to head of the list, this will point
211 * buf->next to the first unlocked entry,
212 * restarting the for loop.
213 */
214 list_del(&entry->head);
215 list_add(&entry->head, list);
216 }
217 }
218
219 ww_acquire_done(ctx);
220 return 0;
Mauro Carvalho Chehab387b1462019-04-10 08:32:41 -0300221 }
Maarten Lankhorst040a0a32013-06-24 10:30:04 +0200222
Mauro Carvalho Chehab387b1462019-04-10 08:32:41 -0300223Unlocking works the same way for both methods #1 and #2::
Maarten Lankhorst040a0a32013-06-24 10:30:04 +0200224
Mauro Carvalho Chehab387b1462019-04-10 08:32:41 -0300225 void unlock_objs(struct list_head *list, struct ww_acquire_ctx *ctx)
226 {
Maarten Lankhorst040a0a32013-06-24 10:30:04 +0200227 struct obj_entry *entry;
228
229 list_for_each_entry (entry, list, head)
230 ww_mutex_unlock(&entry->obj->lock);
231
232 ww_acquire_fini(ctx);
Mauro Carvalho Chehab387b1462019-04-10 08:32:41 -0300233 }
Maarten Lankhorst040a0a32013-06-24 10:30:04 +0200234
235Method 3 is useful if the list of objects is constructed ad-hoc and not upfront,
236e.g. when adjusting edges in a graph where each node has its own ww_mutex lock,
237and edges can only be changed when holding the locks of all involved nodes. w/w
238mutexes are a natural fit for such a case for two reasons:
Mauro Carvalho Chehab387b1462019-04-10 08:32:41 -0300239
Maarten Lankhorst040a0a32013-06-24 10:30:04 +0200240- They can handle lock-acquisition in any order which allows us to start walking
241 a graph from a starting point and then iteratively discovering new edges and
242 locking down the nodes those edges connect to.
243- Due to the -EALREADY return code signalling that a given objects is already
244 held there's no need for additional book-keeping to break cycles in the graph
245 or keep track off which looks are already held (when using more than one node
246 as a starting point).
247
248Note that this approach differs in two important ways from the above methods:
Mauro Carvalho Chehab387b1462019-04-10 08:32:41 -0300249
Maarten Lankhorst040a0a32013-06-24 10:30:04 +0200250- Since the list of objects is dynamically constructed (and might very well be
Peter Ziljstra55f036c2018-06-15 10:07:12 +0200251 different when retrying due to hitting the -EDEADLK die condition) there's
Maarten Lankhorst040a0a32013-06-24 10:30:04 +0200252 no need to keep any object on a persistent list when it's not locked. We can
253 therefore move the list_head into the object itself.
254- On the other hand the dynamic object list construction also means that the -EALREADY return
255 code can't be propagated.
256
257Note also that methods #1 and #2 and method #3 can be combined, e.g. to first lock a
258list of starting nodes (passed in from userspace) using one of the above
259methods. And then lock any additional objects affected by the operations using
260method #3 below. The backoff/retry procedure will be a bit more involved, since
261when the dynamic locking step hits -EDEADLK we also need to unlock all the
262objects acquired with the fixed list. But the w/w mutex debug checks will catch
263any interface misuse for these cases.
264
265Also, method 3 can't fail the lock acquisition step since it doesn't return
266-EALREADY. Of course this would be different when using the _interruptible
Mauro Carvalho Chehab387b1462019-04-10 08:32:41 -0300267variants, but that's outside of the scope of these examples here::
Maarten Lankhorst040a0a32013-06-24 10:30:04 +0200268
Mauro Carvalho Chehab387b1462019-04-10 08:32:41 -0300269 struct obj {
Maarten Lankhorst040a0a32013-06-24 10:30:04 +0200270 struct ww_mutex ww_mutex;
271 struct list_head locked_list;
Mauro Carvalho Chehab387b1462019-04-10 08:32:41 -0300272 };
Maarten Lankhorst040a0a32013-06-24 10:30:04 +0200273
Mauro Carvalho Chehab387b1462019-04-10 08:32:41 -0300274 static DEFINE_WW_CLASS(ww_class);
Maarten Lankhorst040a0a32013-06-24 10:30:04 +0200275
Mauro Carvalho Chehab387b1462019-04-10 08:32:41 -0300276 void __unlock_objs(struct list_head *list)
277 {
Maarten Lankhorst040a0a32013-06-24 10:30:04 +0200278 struct obj *entry, *temp;
279
280 list_for_each_entry_safe (entry, temp, list, locked_list) {
281 /* need to do that before unlocking, since only the current lock holder is
282 allowed to use object */
283 list_del(&entry->locked_list);
284 ww_mutex_unlock(entry->ww_mutex)
285 }
Mauro Carvalho Chehab387b1462019-04-10 08:32:41 -0300286 }
Maarten Lankhorst040a0a32013-06-24 10:30:04 +0200287
Mauro Carvalho Chehab387b1462019-04-10 08:32:41 -0300288 void lock_objs(struct list_head *list, struct ww_acquire_ctx *ctx)
289 {
Maarten Lankhorst040a0a32013-06-24 10:30:04 +0200290 struct obj *obj;
291
292 ww_acquire_init(ctx, &ww_class);
293
Mauro Carvalho Chehab387b1462019-04-10 08:32:41 -0300294 retry:
Maarten Lankhorst040a0a32013-06-24 10:30:04 +0200295 /* re-init loop start state */
296 loop {
297 /* magic code which walks over a graph and decides which objects
298 * to lock */
299
300 ret = ww_mutex_lock(obj->ww_mutex, ctx);
301 if (ret == -EALREADY) {
302 /* we have that one already, get to the next object */
303 continue;
304 }
305 if (ret == -EDEADLK) {
306 __unlock_objs(list);
307
308 ww_mutex_lock_slow(obj, ctx);
309 list_add(&entry->locked_list, list);
310 goto retry;
311 }
312
313 /* locked a new object, add it to the list */
314 list_add_tail(&entry->locked_list, list);
315 }
316
317 ww_acquire_done(ctx);
318 return 0;
Mauro Carvalho Chehab387b1462019-04-10 08:32:41 -0300319 }
Maarten Lankhorst040a0a32013-06-24 10:30:04 +0200320
Mauro Carvalho Chehab387b1462019-04-10 08:32:41 -0300321 void unlock_objs(struct list_head *list, struct ww_acquire_ctx *ctx)
322 {
Maarten Lankhorst040a0a32013-06-24 10:30:04 +0200323 __unlock_objs(list);
324 ww_acquire_fini(ctx);
Mauro Carvalho Chehab387b1462019-04-10 08:32:41 -0300325 }
Maarten Lankhorst040a0a32013-06-24 10:30:04 +0200326
327Method 4: Only lock one single objects. In that case deadlock detection and
328prevention is obviously overkill, since with grabbing just one lock you can't
329produce a deadlock within just one class. To simplify this case the w/w mutex
330api can be used with a NULL context.
331
332Implementation Details
333----------------------
334
335Design:
Mauro Carvalho Chehab387b1462019-04-10 08:32:41 -0300336^^^^^^^
337
Maarten Lankhorst040a0a32013-06-24 10:30:04 +0200338 ww_mutex currently encapsulates a struct mutex, this means no extra overhead for
339 normal mutex locks, which are far more common. As such there is only a small
340 increase in code size if wait/wound mutexes are not used.
341
Nicolai Hähnle27bd57a2016-12-21 19:46:40 +0100342 We maintain the following invariants for the wait list:
Mauro Carvalho Chehab387b1462019-04-10 08:32:41 -0300343
Nicolai Hähnle27bd57a2016-12-21 19:46:40 +0100344 (1) Waiters with an acquire context are sorted by stamp order; waiters
345 without an acquire context are interspersed in FIFO order.
Thomas Hellstrom08295b32018-06-15 10:17:38 +0200346 (2) For Wait-Die, among waiters with contexts, only the first one can have
347 other locks acquired already (ctx->acquired > 0). Note that this waiter
348 may come after other waiters without contexts in the list.
349
350 The Wound-Wait preemption is implemented with a lazy-preemption scheme:
351 The wounded status of the transaction is checked only when there is
352 contention for a new lock and hence a true chance of deadlock. In that
353 situation, if the transaction is wounded, it backs off, clears the
354 wounded status and retries. A great benefit of implementing preemption in
355 this way is that the wounded transaction can identify a contending lock to
356 wait for before restarting the transaction. Just blindly restarting the
357 transaction would likely make the transaction end up in a situation where
358 it would have to back off again.
Nicolai Hähnle27bd57a2016-12-21 19:46:40 +0100359
Maarten Lankhorst040a0a32013-06-24 10:30:04 +0200360 In general, not much contention is expected. The locks are typically used to
Thomas Hellstrom08295b32018-06-15 10:17:38 +0200361 serialize access to resources for devices, and optimization focus should
362 therefore be directed towards the uncontended cases.
Maarten Lankhorst040a0a32013-06-24 10:30:04 +0200363
364Lockdep:
Mauro Carvalho Chehab387b1462019-04-10 08:32:41 -0300365^^^^^^^^
366
Maarten Lankhorst040a0a32013-06-24 10:30:04 +0200367 Special care has been taken to warn for as many cases of api abuse
368 as possible. Some common api abuses will be caught with
369 CONFIG_DEBUG_MUTEXES, but CONFIG_PROVE_LOCKING is recommended.
370
371 Some of the errors which will be warned about:
372 - Forgetting to call ww_acquire_fini or ww_acquire_init.
373 - Attempting to lock more mutexes after ww_acquire_done.
374 - Attempting to lock the wrong mutex after -EDEADLK and
375 unlocking all mutexes.
376 - Attempting to lock the right mutex after -EDEADLK,
377 before unlocking all mutexes.
378
379 - Calling ww_mutex_lock_slow before -EDEADLK was returned.
380
381 - Unlocking mutexes with the wrong unlock function.
382 - Calling one of the ww_acquire_* twice on the same context.
383 - Using a different ww_class for the mutex than for the ww_acquire_ctx.
384 - Normal lockdep errors that can result in deadlocks.
385
386 Some of the lockdep errors that can result in deadlocks:
387 - Calling ww_acquire_init to initialize a second ww_acquire_ctx before
388 having called ww_acquire_fini on the first.
389 - 'normal' deadlocks that can occur.
390
Mauro Carvalho Chehab387b1462019-04-10 08:32:41 -0300391FIXME:
392 Update this section once we have the TASK_DEADLOCK task state flag magic
393 implemented.