Willem de Bruijn | cc8889a | 2017-09-01 12:01:41 -0400 | [diff] [blame] | 1 | |
| 2 | ============ |
| 3 | MSG_ZEROCOPY |
| 4 | ============ |
| 5 | |
| 6 | Intro |
| 7 | ===== |
| 8 | |
| 9 | The MSG_ZEROCOPY flag enables copy avoidance for socket send calls. |
Petr Vorel | 31a1b8d | 2019-02-15 00:43:27 +0100 | [diff] [blame] | 10 | The feature is currently implemented for TCP and UDP sockets. |
Willem de Bruijn | cc8889a | 2017-09-01 12:01:41 -0400 | [diff] [blame] | 11 | |
| 12 | |
| 13 | Opportunity and Caveats |
| 14 | ----------------------- |
| 15 | |
| 16 | Copying large buffers between user process and kernel can be |
| 17 | expensive. Linux supports various interfaces that eschew copying, |
| 18 | such as sendpage and splice. The MSG_ZEROCOPY flag extends the |
| 19 | underlying copy avoidance mechanism to common socket send calls. |
| 20 | |
| 21 | Copy avoidance is not a free lunch. As implemented, with page pinning, |
| 22 | it replaces per byte copy cost with page accounting and completion |
| 23 | notification overhead. As a result, MSG_ZEROCOPY is generally only |
| 24 | effective at writes over around 10 KB. |
| 25 | |
| 26 | Page pinning also changes system call semantics. It temporarily shares |
| 27 | the buffer between process and network stack. Unlike with copying, the |
| 28 | process cannot immediately overwrite the buffer after system call |
| 29 | return without possibly modifying the data in flight. Kernel integrity |
| 30 | is not affected, but a buggy program can possibly corrupt its own data |
| 31 | stream. |
| 32 | |
| 33 | The kernel returns a notification when it is safe to modify data. |
| 34 | Converting an existing application to MSG_ZEROCOPY is not always as |
| 35 | trivial as just passing the flag, then. |
| 36 | |
| 37 | |
| 38 | More Info |
| 39 | --------- |
| 40 | |
| 41 | Much of this document was derived from a longer paper presented at |
| 42 | netdev 2.1. For more in-depth information see that paper and talk, |
| 43 | the excellent reporting over at LWN.net or read the original code. |
| 44 | |
| 45 | paper, slides, video |
| 46 | https://netdevconf.org/2.1/session.html?debruijn |
| 47 | |
| 48 | LWN article |
| 49 | https://lwn.net/Articles/726917/ |
| 50 | |
| 51 | patchset |
| 52 | [PATCH net-next v4 0/9] socket sendmsg MSG_ZEROCOPY |
Thorsten Leemhuis | a9d85ef | 2021-10-07 10:05:00 +0200 | [diff] [blame] | 53 | https://lore.kernel.org/netdev/20170803202945.70750-1-willemdebruijn.kernel@gmail.com |
Willem de Bruijn | cc8889a | 2017-09-01 12:01:41 -0400 | [diff] [blame] | 54 | |
| 55 | |
| 56 | Interface |
| 57 | ========= |
| 58 | |
| 59 | Passing the MSG_ZEROCOPY flag is the most obvious step to enable copy |
| 60 | avoidance, but not the only one. |
| 61 | |
| 62 | Socket Setup |
| 63 | ------------ |
| 64 | |
| 65 | The kernel is permissive when applications pass undefined flags to the |
| 66 | send system call. By default it simply ignores these. To avoid enabling |
| 67 | copy avoidance mode for legacy processes that accidentally already pass |
| 68 | this flag, a process must first signal intent by setting a socket option: |
| 69 | |
| 70 | :: |
| 71 | |
| 72 | if (setsockopt(fd, SOL_SOCKET, SO_ZEROCOPY, &one, sizeof(one))) |
| 73 | error(1, errno, "setsockopt zerocopy"); |
| 74 | |
Willem de Bruijn | cc8889a | 2017-09-01 12:01:41 -0400 | [diff] [blame] | 75 | Transmission |
| 76 | ------------ |
| 77 | |
| 78 | The change to send (or sendto, sendmsg, sendmmsg) itself is trivial. |
| 79 | Pass the new flag. |
| 80 | |
| 81 | :: |
| 82 | |
| 83 | ret = send(fd, buf, sizeof(buf), MSG_ZEROCOPY); |
| 84 | |
| 85 | A zerocopy failure will return -1 with errno ENOBUFS. This happens if |
| 86 | the socket option was not set, the socket exceeds its optmem limit or |
| 87 | the user exceeds its ulimit on locked pages. |
| 88 | |
| 89 | |
| 90 | Mixing copy avoidance and copying |
| 91 | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ |
| 92 | |
| 93 | Many workloads have a mixture of large and small buffers. Because copy |
| 94 | avoidance is more expensive than copying for small packets, the |
| 95 | feature is implemented as a flag. It is safe to mix calls with the flag |
| 96 | with those without. |
| 97 | |
| 98 | |
| 99 | Notifications |
| 100 | ------------- |
| 101 | |
| 102 | The kernel has to notify the process when it is safe to reuse a |
| 103 | previously passed buffer. It queues completion notifications on the |
| 104 | socket error queue, akin to the transmit timestamping interface. |
| 105 | |
| 106 | The notification itself is a simple scalar value. Each socket |
| 107 | maintains an internal unsigned 32-bit counter. Each send call with |
| 108 | MSG_ZEROCOPY that successfully sends data increments the counter. The |
| 109 | counter is not incremented on failure or if called with length zero. |
| 110 | The counter counts system call invocations, not bytes. It wraps after |
| 111 | UINT_MAX calls. |
| 112 | |
| 113 | |
| 114 | Notification Reception |
| 115 | ~~~~~~~~~~~~~~~~~~~~~~ |
| 116 | |
| 117 | The below snippet demonstrates the API. In the simplest case, each |
| 118 | send syscall is followed by a poll and recvmsg on the error queue. |
| 119 | |
| 120 | Reading from the error queue is always a non-blocking operation. The |
| 121 | poll call is there to block until an error is outstanding. It will set |
| 122 | POLLERR in its output flags. That flag does not have to be set in the |
| 123 | events field. Errors are signaled unconditionally. |
| 124 | |
| 125 | :: |
| 126 | |
| 127 | pfd.fd = fd; |
| 128 | pfd.events = 0; |
| 129 | if (poll(&pfd, 1, -1) != 1 || pfd.revents & POLLERR == 0) |
| 130 | error(1, errno, "poll"); |
| 131 | |
| 132 | ret = recvmsg(fd, &msg, MSG_ERRQUEUE); |
| 133 | if (ret == -1) |
| 134 | error(1, errno, "recvmsg"); |
| 135 | |
| 136 | read_notification(msg); |
| 137 | |
| 138 | The example is for demonstration purpose only. In practice, it is more |
| 139 | efficient to not wait for notifications, but read without blocking |
| 140 | every couple of send calls. |
| 141 | |
| 142 | Notifications can be processed out of order with other operations on |
| 143 | the socket. A socket that has an error queued would normally block |
| 144 | other operations until the error is read. Zerocopy notifications have |
| 145 | a zero error code, however, to not block send and recv calls. |
| 146 | |
| 147 | |
| 148 | Notification Batching |
| 149 | ~~~~~~~~~~~~~~~~~~~~~ |
| 150 | |
| 151 | Multiple outstanding packets can be read at once using the recvmmsg |
| 152 | call. This is often not needed. In each message the kernel returns not |
| 153 | a single value, but a range. It coalesces consecutive notifications |
| 154 | while one is outstanding for reception on the error queue. |
| 155 | |
| 156 | When a new notification is about to be queued, it checks whether the |
| 157 | new value extends the range of the notification at the tail of the |
| 158 | queue. If so, it drops the new notification packet and instead increases |
| 159 | the range upper value of the outstanding notification. |
| 160 | |
| 161 | For protocols that acknowledge data in-order, like TCP, each |
| 162 | notification can be squashed into the previous one, so that no more |
| 163 | than one notification is outstanding at any one point. |
| 164 | |
| 165 | Ordered delivery is the common case, but not guaranteed. Notifications |
| 166 | may arrive out of order on retransmission and socket teardown. |
| 167 | |
| 168 | |
| 169 | Notification Parsing |
| 170 | ~~~~~~~~~~~~~~~~~~~~ |
| 171 | |
| 172 | The below snippet demonstrates how to parse the control message: the |
| 173 | read_notification() call in the previous snippet. A notification |
| 174 | is encoded in the standard error format, sock_extended_err. |
| 175 | |
| 176 | The level and type fields in the control data are protocol family |
| 177 | specific, IP_RECVERR or IPV6_RECVERR. |
| 178 | |
| 179 | Error origin is the new type SO_EE_ORIGIN_ZEROCOPY. ee_errno is zero, |
| 180 | as explained before, to avoid blocking read and write system calls on |
| 181 | the socket. |
| 182 | |
| 183 | The 32-bit notification range is encoded as [ee_info, ee_data]. This |
| 184 | range is inclusive. Other fields in the struct must be treated as |
| 185 | undefined, bar for ee_code, as discussed below. |
| 186 | |
| 187 | :: |
| 188 | |
| 189 | struct sock_extended_err *serr; |
| 190 | struct cmsghdr *cm; |
| 191 | |
| 192 | cm = CMSG_FIRSTHDR(msg); |
| 193 | if (cm->cmsg_level != SOL_IP && |
| 194 | cm->cmsg_type != IP_RECVERR) |
| 195 | error(1, 0, "cmsg"); |
| 196 | |
| 197 | serr = (void *) CMSG_DATA(cm); |
| 198 | if (serr->ee_errno != 0 || |
| 199 | serr->ee_origin != SO_EE_ORIGIN_ZEROCOPY) |
| 200 | error(1, 0, "serr"); |
| 201 | |
| 202 | printf("completed: %u..%u\n", serr->ee_info, serr->ee_data); |
| 203 | |
| 204 | |
| 205 | Deferred copies |
| 206 | ~~~~~~~~~~~~~~~ |
| 207 | |
| 208 | Passing flag MSG_ZEROCOPY is a hint to the kernel to apply copy |
| 209 | avoidance, and a contract that the kernel will queue a completion |
| 210 | notification. It is not a guarantee that the copy is elided. |
| 211 | |
| 212 | Copy avoidance is not always feasible. Devices that do not support |
| 213 | scatter-gather I/O cannot send packets made up of kernel generated |
| 214 | protocol headers plus zerocopy user data. A packet may need to be |
| 215 | converted to a private copy of data deep in the stack, say to compute |
| 216 | a checksum. |
| 217 | |
| 218 | In all these cases, the kernel returns a completion notification when |
| 219 | it releases its hold on the shared pages. That notification may arrive |
| 220 | before the (copied) data is fully transmitted. A zerocopy completion |
| 221 | notification is not a transmit completion notification, therefore. |
| 222 | |
| 223 | Deferred copies can be more expensive than a copy immediately in the |
| 224 | system call, if the data is no longer warm in the cache. The process |
| 225 | also incurs notification processing cost for no benefit. For this |
| 226 | reason, the kernel signals if data was completed with a copy, by |
| 227 | setting flag SO_EE_CODE_ZEROCOPY_COPIED in field ee_code on return. |
| 228 | A process may use this signal to stop passing flag MSG_ZEROCOPY on |
| 229 | subsequent requests on the same socket. |
| 230 | |
| 231 | |
| 232 | Implementation |
| 233 | ============== |
| 234 | |
| 235 | Loopback |
| 236 | -------- |
| 237 | |
| 238 | Data sent to local sockets can be queued indefinitely if the receive |
| 239 | process does not read its socket. Unbound notification latency is not |
| 240 | acceptable. For this reason all packets generated with MSG_ZEROCOPY |
| 241 | that are looped to a local socket will incur a deferred copy. This |
| 242 | includes looping onto packet sockets (e.g., tcpdump) and tun devices. |
| 243 | |
| 244 | |
| 245 | Testing |
| 246 | ======= |
| 247 | |
| 248 | More realistic example code can be found in the kernel source under |
| 249 | tools/testing/selftests/net/msg_zerocopy.c. |
| 250 | |
| 251 | Be cognizant of the loopback constraint. The test can be run between |
| 252 | a pair of hosts. But if run between a local pair of processes, for |
| 253 | instance when run with msg_zerocopy.sh between a veth pair across |
| 254 | namespaces, the test will not show any improvement. For testing, the |
| 255 | loopback restriction can be temporarily relaxed by making |
| 256 | skb_orphan_frags_rx identical to skb_orphan_frags. |