Andrii Nakryiko | 97abb2b | 2020-05-29 00:54:24 -0700 | [diff] [blame] | 1 | =============== |
| 2 | BPF ring buffer |
| 3 | =============== |
| 4 | |
| 5 | This document describes BPF ring buffer design, API, and implementation details. |
| 6 | |
| 7 | .. contents:: |
| 8 | :local: |
| 9 | :depth: 2 |
| 10 | |
| 11 | Motivation |
| 12 | ---------- |
| 13 | |
| 14 | There are two distinctive motivators for this work, which are not satisfied by |
| 15 | existing perf buffer, which prompted creation of a new ring buffer |
| 16 | implementation. |
| 17 | |
| 18 | - more efficient memory utilization by sharing ring buffer across CPUs; |
| 19 | - preserving ordering of events that happen sequentially in time, even across |
| 20 | multiple CPUs (e.g., fork/exec/exit events for a task). |
| 21 | |
| 22 | These two problems are independent, but perf buffer fails to satisfy both. |
| 23 | Both are a result of a choice to have per-CPU perf ring buffer. Both can be |
| 24 | also solved by having an MPSC implementation of ring buffer. The ordering |
| 25 | problem could technically be solved for perf buffer with some in-kernel |
| 26 | counting, but given the first one requires an MPSC buffer, the same solution |
| 27 | would solve the second problem automatically. |
| 28 | |
| 29 | Semantics and APIs |
| 30 | ------------------ |
| 31 | |
| 32 | Single ring buffer is presented to BPF programs as an instance of BPF map of |
| 33 | type ``BPF_MAP_TYPE_RINGBUF``. Two other alternatives considered, but |
| 34 | ultimately rejected. |
| 35 | |
| 36 | One way would be to, similar to ``BPF_MAP_TYPE_PERF_EVENT_ARRAY``, make |
| 37 | ``BPF_MAP_TYPE_RINGBUF`` could represent an array of ring buffers, but not |
| 38 | enforce "same CPU only" rule. This would be more familiar interface compatible |
| 39 | with existing perf buffer use in BPF, but would fail if application needed more |
| 40 | advanced logic to lookup ring buffer by arbitrary key. |
| 41 | ``BPF_MAP_TYPE_HASH_OF_MAPS`` addresses this with current approach. |
| 42 | Additionally, given the performance of BPF ringbuf, many use cases would just |
| 43 | opt into a simple single ring buffer shared among all CPUs, for which current |
| 44 | approach would be an overkill. |
| 45 | |
| 46 | Another approach could introduce a new concept, alongside BPF map, to represent |
| 47 | generic "container" object, which doesn't necessarily have key/value interface |
| 48 | with lookup/update/delete operations. This approach would add a lot of extra |
| 49 | infrastructure that has to be built for observability and verifier support. It |
| 50 | would also add another concept that BPF developers would have to familiarize |
| 51 | themselves with, new syntax in libbpf, etc. But then would really provide no |
| 52 | additional benefits over the approach of using a map. ``BPF_MAP_TYPE_RINGBUF`` |
| 53 | doesn't support lookup/update/delete operations, but so doesn't few other map |
| 54 | types (e.g., queue and stack; array doesn't support delete, etc). |
| 55 | |
| 56 | The approach chosen has an advantage of re-using existing BPF map |
| 57 | infrastructure (introspection APIs in kernel, libbpf support, etc), being |
| 58 | familiar concept (no need to teach users a new type of object in BPF program), |
| 59 | and utilizing existing tooling (bpftool). For common scenario of using a single |
| 60 | ring buffer for all CPUs, it's as simple and straightforward, as would be with |
| 61 | a dedicated "container" object. On the other hand, by being a map, it can be |
| 62 | combined with ``ARRAY_OF_MAPS`` and ``HASH_OF_MAPS`` map-in-maps to implement |
| 63 | a wide variety of topologies, from one ring buffer for each CPU (e.g., as |
| 64 | a replacement for perf buffer use cases), to a complicated application |
| 65 | hashing/sharding of ring buffers (e.g., having a small pool of ring buffers |
| 66 | with hashed task's tgid being a look up key to preserve order, but reduce |
| 67 | contention). |
| 68 | |
| 69 | Key and value sizes are enforced to be zero. ``max_entries`` is used to specify |
| 70 | the size of ring buffer and has to be a power of 2 value. |
| 71 | |
| 72 | There are a bunch of similarities between perf buffer |
| 73 | (``BPF_MAP_TYPE_PERF_EVENT_ARRAY``) and new BPF ring buffer semantics: |
| 74 | |
| 75 | - variable-length records; |
| 76 | - if there is no more space left in ring buffer, reservation fails, no |
| 77 | blocking; |
| 78 | - memory-mappable data area for user-space applications for ease of |
| 79 | consumption and high performance; |
| 80 | - epoll notifications for new incoming data; |
| 81 | - but still the ability to do busy polling for new data to achieve the |
| 82 | lowest latency, if necessary. |
| 83 | |
| 84 | BPF ringbuf provides two sets of APIs to BPF programs: |
| 85 | |
| 86 | - ``bpf_ringbuf_output()`` allows to *copy* data from one place to a ring |
| 87 | buffer, similarly to ``bpf_perf_event_output()``; |
| 88 | - ``bpf_ringbuf_reserve()``/``bpf_ringbuf_commit()``/``bpf_ringbuf_discard()`` |
| 89 | APIs split the whole process into two steps. First, a fixed amount of space |
| 90 | is reserved. If successful, a pointer to a data inside ring buffer data |
| 91 | area is returned, which BPF programs can use similarly to a data inside |
| 92 | array/hash maps. Once ready, this piece of memory is either committed or |
| 93 | discarded. Discard is similar to commit, but makes consumer ignore the |
| 94 | record. |
| 95 | |
| 96 | ``bpf_ringbuf_output()`` has disadvantage of incurring extra memory copy, |
| 97 | because record has to be prepared in some other place first. But it allows to |
| 98 | submit records of the length that's not known to verifier beforehand. It also |
| 99 | closely matches ``bpf_perf_event_output()``, so will simplify migration |
| 100 | significantly. |
| 101 | |
| 102 | ``bpf_ringbuf_reserve()`` avoids the extra copy of memory by providing a memory |
| 103 | pointer directly to ring buffer memory. In a lot of cases records are larger |
| 104 | than BPF stack space allows, so many programs have use extra per-CPU array as |
| 105 | a temporary heap for preparing sample. bpf_ringbuf_reserve() avoid this needs |
| 106 | completely. But in exchange, it only allows a known constant size of memory to |
| 107 | be reserved, such that verifier can verify that BPF program can't access memory |
| 108 | outside its reserved record space. bpf_ringbuf_output(), while slightly slower |
| 109 | due to extra memory copy, covers some use cases that are not suitable for |
| 110 | ``bpf_ringbuf_reserve()``. |
| 111 | |
| 112 | The difference between commit and discard is very small. Discard just marks |
| 113 | a record as discarded, and such records are supposed to be ignored by consumer |
| 114 | code. Discard is useful for some advanced use-cases, such as ensuring |
| 115 | all-or-nothing multi-record submission, or emulating temporary |
| 116 | ``malloc()``/``free()`` within single BPF program invocation. |
| 117 | |
| 118 | Each reserved record is tracked by verifier through existing |
| 119 | reference-tracking logic, similar to socket ref-tracking. It is thus |
| 120 | impossible to reserve a record, but forget to submit (or discard) it. |
| 121 | |
| 122 | ``bpf_ringbuf_query()`` helper allows to query various properties of ring |
| 123 | buffer. Currently 4 are supported: |
| 124 | |
| 125 | - ``BPF_RB_AVAIL_DATA`` returns amount of unconsumed data in ring buffer; |
| 126 | - ``BPF_RB_RING_SIZE`` returns the size of ring buffer; |
| 127 | - ``BPF_RB_CONS_POS``/``BPF_RB_PROD_POS`` returns current logical possition |
| 128 | of consumer/producer, respectively. |
| 129 | |
| 130 | Returned values are momentarily snapshots of ring buffer state and could be |
| 131 | off by the time helper returns, so this should be used only for |
| 132 | debugging/reporting reasons or for implementing various heuristics, that take |
| 133 | into account highly-changeable nature of some of those characteristics. |
| 134 | |
| 135 | One such heuristic might involve more fine-grained control over poll/epoll |
| 136 | notifications about new data availability in ring buffer. Together with |
| 137 | ``BPF_RB_NO_WAKEUP``/``BPF_RB_FORCE_WAKEUP`` flags for output/commit/discard |
| 138 | helpers, it allows BPF program a high degree of control and, e.g., more |
| 139 | efficient batched notifications. Default self-balancing strategy, though, |
| 140 | should be adequate for most applications and will work reliable and efficiently |
| 141 | already. |
| 142 | |
| 143 | Design and Implementation |
| 144 | ------------------------- |
| 145 | |
| 146 | This reserve/commit schema allows a natural way for multiple producers, either |
| 147 | on different CPUs or even on the same CPU/in the same BPF program, to reserve |
| 148 | independent records and work with them without blocking other producers. This |
| 149 | means that if BPF program was interruped by another BPF program sharing the |
| 150 | same ring buffer, they will both get a record reserved (provided there is |
| 151 | enough space left) and can work with it and submit it independently. This |
| 152 | applies to NMI context as well, except that due to using a spinlock during |
| 153 | reservation, in NMI context, ``bpf_ringbuf_reserve()`` might fail to get |
| 154 | a lock, in which case reservation will fail even if ring buffer is not full. |
| 155 | |
| 156 | The ring buffer itself internally is implemented as a power-of-2 sized |
| 157 | circular buffer, with two logical and ever-increasing counters (which might |
| 158 | wrap around on 32-bit architectures, that's not a problem): |
| 159 | |
| 160 | - consumer counter shows up to which logical position consumer consumed the |
| 161 | data; |
| 162 | - producer counter denotes amount of data reserved by all producers. |
| 163 | |
| 164 | Each time a record is reserved, producer that "owns" the record will |
| 165 | successfully advance producer counter. At that point, data is still not yet |
| 166 | ready to be consumed, though. Each record has 8 byte header, which contains the |
| 167 | length of reserved record, as well as two extra bits: busy bit to denote that |
| 168 | record is still being worked on, and discard bit, which might be set at commit |
| 169 | time if record is discarded. In the latter case, consumer is supposed to skip |
| 170 | the record and move on to the next one. Record header also encodes record's |
| 171 | relative offset from the beginning of ring buffer data area (in pages). This |
| 172 | allows ``bpf_ringbuf_commit()``/``bpf_ringbuf_discard()`` to accept only the |
| 173 | pointer to the record itself, without requiring also the pointer to ring buffer |
| 174 | itself. Ring buffer memory location will be restored from record metadata |
| 175 | header. This significantly simplifies verifier, as well as improving API |
| 176 | usability. |
| 177 | |
| 178 | Producer counter increments are serialized under spinlock, so there is |
| 179 | a strict ordering between reservations. Commits, on the other hand, are |
| 180 | completely lockless and independent. All records become available to consumer |
| 181 | in the order of reservations, but only after all previous records where |
| 182 | already committed. It is thus possible for slow producers to temporarily hold |
| 183 | off submitted records, that were reserved later. |
| 184 | |
Andrii Nakryiko | 97abb2b | 2020-05-29 00:54:24 -0700 | [diff] [blame] | 185 | One interesting implementation bit, that significantly simplifies (and thus |
| 186 | speeds up as well) implementation of both producers and consumers is how data |
| 187 | area is mapped twice contiguously back-to-back in the virtual memory. This |
| 188 | allows to not take any special measures for samples that have to wrap around |
| 189 | at the end of the circular buffer data area, because the next page after the |
| 190 | last data page would be first data page again, and thus the sample will still |
| 191 | appear completely contiguous in virtual memory. See comment and a simple ASCII |
| 192 | diagram showing this visually in ``bpf_ringbuf_area_alloc()``. |
| 193 | |
| 194 | Another feature that distinguishes BPF ringbuf from perf ring buffer is |
| 195 | a self-pacing notifications of new data being availability. |
| 196 | ``bpf_ringbuf_commit()`` implementation will send a notification of new record |
| 197 | being available after commit only if consumer has already caught up right up to |
| 198 | the record being committed. If not, consumer still has to catch up and thus |
| 199 | will see new data anyways without needing an extra poll notification. |
Andrii Nakryiko | 65dce59 | 2020-09-14 17:50:31 -0700 | [diff] [blame] | 200 | Benchmarks (see tools/testing/selftests/bpf/benchs/bench_ringbufs.c) show that |
Andrii Nakryiko | 97abb2b | 2020-05-29 00:54:24 -0700 | [diff] [blame] | 201 | this allows to achieve a very high throughput without having to resort to |
| 202 | tricks like "notify only every Nth sample", which are necessary with perf |
| 203 | buffer. For extreme cases, when BPF program wants more manual control of |
| 204 | notifications, commit/discard/output helpers accept ``BPF_RB_NO_WAKEUP`` and |
| 205 | ``BPF_RB_FORCE_WAKEUP`` flags, which give full control over notifications of |
| 206 | data availability, but require extra caution and diligence in using this API. |