blob: 291a012649678035b5d0fdc306904093ceedf610 [file] [log] [blame]
Willem de Bruijncc8889a2017-09-01 12:01:41 -04001
2============
3MSG_ZEROCOPY
4============
5
6Intro
7=====
8
9The MSG_ZEROCOPY flag enables copy avoidance for socket send calls.
10The feature is currently implemented for TCP sockets.
11
12
13Opportunity and Caveats
14-----------------------
15
16Copying large buffers between user process and kernel can be
17expensive. Linux supports various interfaces that eschew copying,
18such as sendpage and splice. The MSG_ZEROCOPY flag extends the
19underlying copy avoidance mechanism to common socket send calls.
20
21Copy avoidance is not a free lunch. As implemented, with page pinning,
22it replaces per byte copy cost with page accounting and completion
23notification overhead. As a result, MSG_ZEROCOPY is generally only
24effective at writes over around 10 KB.
25
26Page pinning also changes system call semantics. It temporarily shares
27the buffer between process and network stack. Unlike with copying, the
28process cannot immediately overwrite the buffer after system call
29return without possibly modifying the data in flight. Kernel integrity
30is not affected, but a buggy program can possibly corrupt its own data
31stream.
32
33The kernel returns a notification when it is safe to modify data.
34Converting an existing application to MSG_ZEROCOPY is not always as
35trivial as just passing the flag, then.
36
37
38More Info
39---------
40
41Much of this document was derived from a longer paper presented at
42netdev 2.1. For more in-depth information see that paper and talk,
43the excellent reporting over at LWN.net or read the original code.
44
45 paper, slides, video
46 https://netdevconf.org/2.1/session.html?debruijn
47
48 LWN article
49 https://lwn.net/Articles/726917/
50
51 patchset
52 [PATCH net-next v4 0/9] socket sendmsg MSG_ZEROCOPY
53 http://lkml.kernel.org/r/20170803202945.70750-1-willemdebruijn.kernel@gmail.com
54
55
56Interface
57=========
58
59Passing the MSG_ZEROCOPY flag is the most obvious step to enable copy
60avoidance, but not the only one.
61
62Socket Setup
63------------
64
65The kernel is permissive when applications pass undefined flags to the
66send system call. By default it simply ignores these. To avoid enabling
67copy avoidance mode for legacy processes that accidentally already pass
68this flag, a process must first signal intent by setting a socket option:
69
70::
71
72 if (setsockopt(fd, SOL_SOCKET, SO_ZEROCOPY, &one, sizeof(one)))
73 error(1, errno, "setsockopt zerocopy");
74
Kornilios Kourtisaf60d612018-01-09 09:52:22 +010075Setting the socket option only works when the socket is in its initial
76(TCP_CLOSED) state. Trying to set the option for a socket returned by accept(),
77for example, will lead to an EBUSY error. In this case, the option should be set
78to the listening socket and it will be inherited by the accepted sockets.
Willem de Bruijncc8889a2017-09-01 12:01:41 -040079
80Transmission
81------------
82
83The change to send (or sendto, sendmsg, sendmmsg) itself is trivial.
84Pass the new flag.
85
86::
87
88 ret = send(fd, buf, sizeof(buf), MSG_ZEROCOPY);
89
90A zerocopy failure will return -1 with errno ENOBUFS. This happens if
91the socket option was not set, the socket exceeds its optmem limit or
92the user exceeds its ulimit on locked pages.
93
94
95Mixing copy avoidance and copying
96~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
97
98Many workloads have a mixture of large and small buffers. Because copy
99avoidance is more expensive than copying for small packets, the
100feature is implemented as a flag. It is safe to mix calls with the flag
101with those without.
102
103
104Notifications
105-------------
106
107The kernel has to notify the process when it is safe to reuse a
108previously passed buffer. It queues completion notifications on the
109socket error queue, akin to the transmit timestamping interface.
110
111The notification itself is a simple scalar value. Each socket
112maintains an internal unsigned 32-bit counter. Each send call with
113MSG_ZEROCOPY that successfully sends data increments the counter. The
114counter is not incremented on failure or if called with length zero.
115The counter counts system call invocations, not bytes. It wraps after
116UINT_MAX calls.
117
118
119Notification Reception
120~~~~~~~~~~~~~~~~~~~~~~
121
122The below snippet demonstrates the API. In the simplest case, each
123send syscall is followed by a poll and recvmsg on the error queue.
124
125Reading from the error queue is always a non-blocking operation. The
126poll call is there to block until an error is outstanding. It will set
127POLLERR in its output flags. That flag does not have to be set in the
128events field. Errors are signaled unconditionally.
129
130::
131
132 pfd.fd = fd;
133 pfd.events = 0;
134 if (poll(&pfd, 1, -1) != 1 || pfd.revents & POLLERR == 0)
135 error(1, errno, "poll");
136
137 ret = recvmsg(fd, &msg, MSG_ERRQUEUE);
138 if (ret == -1)
139 error(1, errno, "recvmsg");
140
141 read_notification(msg);
142
143The example is for demonstration purpose only. In practice, it is more
144efficient to not wait for notifications, but read without blocking
145every couple of send calls.
146
147Notifications can be processed out of order with other operations on
148the socket. A socket that has an error queued would normally block
149other operations until the error is read. Zerocopy notifications have
150a zero error code, however, to not block send and recv calls.
151
152
153Notification Batching
154~~~~~~~~~~~~~~~~~~~~~
155
156Multiple outstanding packets can be read at once using the recvmmsg
157call. This is often not needed. In each message the kernel returns not
158a single value, but a range. It coalesces consecutive notifications
159while one is outstanding for reception on the error queue.
160
161When a new notification is about to be queued, it checks whether the
162new value extends the range of the notification at the tail of the
163queue. If so, it drops the new notification packet and instead increases
164the range upper value of the outstanding notification.
165
166For protocols that acknowledge data in-order, like TCP, each
167notification can be squashed into the previous one, so that no more
168than one notification is outstanding at any one point.
169
170Ordered delivery is the common case, but not guaranteed. Notifications
171may arrive out of order on retransmission and socket teardown.
172
173
174Notification Parsing
175~~~~~~~~~~~~~~~~~~~~
176
177The below snippet demonstrates how to parse the control message: the
178read_notification() call in the previous snippet. A notification
179is encoded in the standard error format, sock_extended_err.
180
181The level and type fields in the control data are protocol family
182specific, IP_RECVERR or IPV6_RECVERR.
183
184Error origin is the new type SO_EE_ORIGIN_ZEROCOPY. ee_errno is zero,
185as explained before, to avoid blocking read and write system calls on
186the socket.
187
188The 32-bit notification range is encoded as [ee_info, ee_data]. This
189range is inclusive. Other fields in the struct must be treated as
190undefined, bar for ee_code, as discussed below.
191
192::
193
194 struct sock_extended_err *serr;
195 struct cmsghdr *cm;
196
197 cm = CMSG_FIRSTHDR(msg);
198 if (cm->cmsg_level != SOL_IP &&
199 cm->cmsg_type != IP_RECVERR)
200 error(1, 0, "cmsg");
201
202 serr = (void *) CMSG_DATA(cm);
203 if (serr->ee_errno != 0 ||
204 serr->ee_origin != SO_EE_ORIGIN_ZEROCOPY)
205 error(1, 0, "serr");
206
207 printf("completed: %u..%u\n", serr->ee_info, serr->ee_data);
208
209
210Deferred copies
211~~~~~~~~~~~~~~~
212
213Passing flag MSG_ZEROCOPY is a hint to the kernel to apply copy
214avoidance, and a contract that the kernel will queue a completion
215notification. It is not a guarantee that the copy is elided.
216
217Copy avoidance is not always feasible. Devices that do not support
218scatter-gather I/O cannot send packets made up of kernel generated
219protocol headers plus zerocopy user data. A packet may need to be
220converted to a private copy of data deep in the stack, say to compute
221a checksum.
222
223In all these cases, the kernel returns a completion notification when
224it releases its hold on the shared pages. That notification may arrive
225before the (copied) data is fully transmitted. A zerocopy completion
226notification is not a transmit completion notification, therefore.
227
228Deferred copies can be more expensive than a copy immediately in the
229system call, if the data is no longer warm in the cache. The process
230also incurs notification processing cost for no benefit. For this
231reason, the kernel signals if data was completed with a copy, by
232setting flag SO_EE_CODE_ZEROCOPY_COPIED in field ee_code on return.
233A process may use this signal to stop passing flag MSG_ZEROCOPY on
234subsequent requests on the same socket.
235
236
237Implementation
238==============
239
240Loopback
241--------
242
243Data sent to local sockets can be queued indefinitely if the receive
244process does not read its socket. Unbound notification latency is not
245acceptable. For this reason all packets generated with MSG_ZEROCOPY
246that are looped to a local socket will incur a deferred copy. This
247includes looping onto packet sockets (e.g., tcpdump) and tun devices.
248
249
250Testing
251=======
252
253More realistic example code can be found in the kernel source under
254tools/testing/selftests/net/msg_zerocopy.c.
255
256Be cognizant of the loopback constraint. The test can be run between
257a pair of hosts. But if run between a local pair of processes, for
258instance when run with msg_zerocopy.sh between a veth pair across
259namespaces, the test will not show any improvement. For testing, the
260loopback restriction can be temporarily relaxed by making
261skb_orphan_frags_rx identical to skb_orphan_frags.