Mauro Carvalho Chehab | db4df48 | 2017-05-14 13:35:29 -0300 | [diff] [blame] | 1 | ================ |
Darren Hart | b30505c | 2009-05-07 15:40:14 -0700 | [diff] [blame] | 2 | Futex Requeue PI |
Mauro Carvalho Chehab | db4df48 | 2017-05-14 13:35:29 -0300 | [diff] [blame] | 3 | ================ |
Darren Hart | b30505c | 2009-05-07 15:40:14 -0700 | [diff] [blame] | 4 | |
| 5 | Requeueing of tasks from a non-PI futex to a PI futex requires |
| 6 | special handling in order to ensure the underlying rt_mutex is never |
| 7 | left without an owner if it has waiters; doing so would break the PI |
Alexander Aring | 8c7a729 | 2021-08-23 14:31:43 -0400 | [diff] [blame] | 8 | boosting logic [see rt-mutex-design.rst] For the purposes of |
Darren Hart | b30505c | 2009-05-07 15:40:14 -0700 | [diff] [blame] | 9 | brevity, this action will be referred to as "requeue_pi" throughout |
| 10 | this document. Priority inheritance is abbreviated throughout as |
| 11 | "PI". |
| 12 | |
| 13 | Motivation |
| 14 | ---------- |
| 15 | |
| 16 | Without requeue_pi, the glibc implementation of |
| 17 | pthread_cond_broadcast() must resort to waking all the tasks waiting |
| 18 | on a pthread_condvar and letting them try to sort out which task |
| 19 | gets to run first in classic thundering-herd formation. An ideal |
| 20 | implementation would wake the highest-priority waiter, and leave the |
| 21 | rest to the natural wakeup inherent in unlocking the mutex |
| 22 | associated with the condvar. |
| 23 | |
Mauro Carvalho Chehab | db4df48 | 2017-05-14 13:35:29 -0300 | [diff] [blame] | 24 | Consider the simplified glibc calls:: |
Darren Hart | b30505c | 2009-05-07 15:40:14 -0700 | [diff] [blame] | 25 | |
Mauro Carvalho Chehab | db4df48 | 2017-05-14 13:35:29 -0300 | [diff] [blame] | 26 | /* caller must lock mutex */ |
| 27 | pthread_cond_wait(cond, mutex) |
| 28 | { |
| 29 | lock(cond->__data.__lock); |
| 30 | unlock(mutex); |
| 31 | do { |
| 32 | unlock(cond->__data.__lock); |
| 33 | futex_wait(cond->__data.__futex); |
| 34 | lock(cond->__data.__lock); |
| 35 | } while(...) |
| 36 | unlock(cond->__data.__lock); |
| 37 | lock(mutex); |
| 38 | } |
Darren Hart | b30505c | 2009-05-07 15:40:14 -0700 | [diff] [blame] | 39 | |
Mauro Carvalho Chehab | db4df48 | 2017-05-14 13:35:29 -0300 | [diff] [blame] | 40 | pthread_cond_broadcast(cond) |
| 41 | { |
| 42 | lock(cond->__data.__lock); |
| 43 | unlock(cond->__data.__lock); |
| 44 | futex_requeue(cond->data.__futex, cond->mutex); |
| 45 | } |
Darren Hart | b30505c | 2009-05-07 15:40:14 -0700 | [diff] [blame] | 46 | |
| 47 | Once pthread_cond_broadcast() requeues the tasks, the cond->mutex |
| 48 | has waiters. Note that pthread_cond_wait() attempts to lock the |
| 49 | mutex only after it has returned to user space. This will leave the |
| 50 | underlying rt_mutex with waiters, and no owner, breaking the |
| 51 | previously mentioned PI-boosting algorithms. |
| 52 | |
| 53 | In order to support PI-aware pthread_condvar's, the kernel needs to |
| 54 | be able to requeue tasks to PI futexes. This support implies that |
| 55 | upon a successful futex_wait system call, the caller would return to |
| 56 | user space already holding the PI futex. The glibc implementation |
Mauro Carvalho Chehab | db4df48 | 2017-05-14 13:35:29 -0300 | [diff] [blame] | 57 | would be modified as follows:: |
Darren Hart | b30505c | 2009-05-07 15:40:14 -0700 | [diff] [blame] | 58 | |
| 59 | |
Mauro Carvalho Chehab | db4df48 | 2017-05-14 13:35:29 -0300 | [diff] [blame] | 60 | /* caller must lock mutex */ |
| 61 | pthread_cond_wait_pi(cond, mutex) |
| 62 | { |
| 63 | lock(cond->__data.__lock); |
| 64 | unlock(mutex); |
| 65 | do { |
| 66 | unlock(cond->__data.__lock); |
| 67 | futex_wait_requeue_pi(cond->__data.__futex); |
| 68 | lock(cond->__data.__lock); |
| 69 | } while(...) |
| 70 | unlock(cond->__data.__lock); |
| 71 | /* the kernel acquired the mutex for us */ |
| 72 | } |
Darren Hart | b30505c | 2009-05-07 15:40:14 -0700 | [diff] [blame] | 73 | |
Mauro Carvalho Chehab | db4df48 | 2017-05-14 13:35:29 -0300 | [diff] [blame] | 74 | pthread_cond_broadcast_pi(cond) |
| 75 | { |
| 76 | lock(cond->__data.__lock); |
| 77 | unlock(cond->__data.__lock); |
| 78 | futex_requeue_pi(cond->data.__futex, cond->mutex); |
| 79 | } |
Darren Hart | b30505c | 2009-05-07 15:40:14 -0700 | [diff] [blame] | 80 | |
| 81 | The actual glibc implementation will likely test for PI and make the |
| 82 | necessary changes inside the existing calls rather than creating new |
| 83 | calls for the PI cases. Similar changes are needed for |
| 84 | pthread_cond_timedwait() and pthread_cond_signal(). |
| 85 | |
| 86 | Implementation |
| 87 | -------------- |
| 88 | |
| 89 | In order to ensure the rt_mutex has an owner if it has waiters, it |
| 90 | is necessary for both the requeue code, as well as the waiting code, |
| 91 | to be able to acquire the rt_mutex before returning to user space. |
| 92 | The requeue code cannot simply wake the waiter and leave it to |
| 93 | acquire the rt_mutex as it would open a race window between the |
| 94 | requeue call returning to user space and the waiter waking and |
| 95 | starting to run. This is especially true in the uncontended case. |
| 96 | |
| 97 | The solution involves two new rt_mutex helper routines, |
| 98 | rt_mutex_start_proxy_lock() and rt_mutex_finish_proxy_lock(), which |
| 99 | allow the requeue code to acquire an uncontended rt_mutex on behalf |
| 100 | of the waiter and to enqueue the waiter on a contended rt_mutex. |
| 101 | Two new system calls provide the kernel<->user interface to |
Michael Kerrisk | 40a3550 | 2015-01-16 20:27:57 +0100 | [diff] [blame] | 102 | requeue_pi: FUTEX_WAIT_REQUEUE_PI and FUTEX_CMP_REQUEUE_PI. |
Darren Hart | b30505c | 2009-05-07 15:40:14 -0700 | [diff] [blame] | 103 | |
| 104 | FUTEX_WAIT_REQUEUE_PI is called by the waiter (pthread_cond_wait() |
| 105 | and pthread_cond_timedwait()) to block on the initial futex and wait |
| 106 | to be requeued to a PI-aware futex. The implementation is the |
| 107 | result of a high-speed collision between futex_wait() and |
| 108 | futex_lock_pi(), with some extra logic to check for the additional |
| 109 | wake-up scenarios. |
| 110 | |
Michael Kerrisk | 40a3550 | 2015-01-16 20:27:57 +0100 | [diff] [blame] | 111 | FUTEX_CMP_REQUEUE_PI is called by the waker |
Darren Hart | b30505c | 2009-05-07 15:40:14 -0700 | [diff] [blame] | 112 | (pthread_cond_broadcast() and pthread_cond_signal()) to requeue and |
| 113 | possibly wake the waiting tasks. Internally, this system call is |
| 114 | still handled by futex_requeue (by passing requeue_pi=1). Before |
| 115 | requeueing, futex_requeue() attempts to acquire the requeue target |
| 116 | PI futex on behalf of the top waiter. If it can, this waiter is |
| 117 | woken. futex_requeue() then proceeds to requeue the remaining |
| 118 | nr_wake+nr_requeue tasks to the PI futex, calling |
| 119 | rt_mutex_start_proxy_lock() prior to each requeue to prepare the |
| 120 | task as a waiter on the underlying rt_mutex. It is possible that |
| 121 | the lock can be acquired at this stage as well, if so, the next |
| 122 | waiter is woken to finish the acquisition of the lock. |
| 123 | |
Michael Kerrisk | 40a3550 | 2015-01-16 20:27:57 +0100 | [diff] [blame] | 124 | FUTEX_CMP_REQUEUE_PI accepts nr_wake and nr_requeue as arguments, but |
Darren Hart | b30505c | 2009-05-07 15:40:14 -0700 | [diff] [blame] | 125 | their sum is all that really matters. futex_requeue() will wake or |
| 126 | requeue up to nr_wake + nr_requeue tasks. It will wake only as many |
| 127 | tasks as it can acquire the lock for, which in the majority of cases |
| 128 | should be 0 as good programming practice dictates that the caller of |
| 129 | either pthread_cond_broadcast() or pthread_cond_signal() acquire the |
Michael Kerrisk | 40a3550 | 2015-01-16 20:27:57 +0100 | [diff] [blame] | 130 | mutex prior to making the call. FUTEX_CMP_REQUEUE_PI requires that |
Darren Hart | b30505c | 2009-05-07 15:40:14 -0700 | [diff] [blame] | 131 | nr_wake=1. nr_requeue should be INT_MAX for broadcast and 0 for |
| 132 | signal. |