blob: db0f5560ac1c213288ad8942d2eb8d5ed6a47a5c [file] [log] [blame]
Mauro Carvalho Chehabb9dd2be2020-04-28 00:01:53 +02001.. SPDX-License-Identifier: GPL-2.0
2
3=============================
Olivier Gayotbb38ccc2018-06-04 12:07:37 +02004Kernel Connection Multiplexor
Mauro Carvalho Chehabb9dd2be2020-04-28 00:01:53 +02005=============================
Tom Herbert10016592016-03-07 14:11:12 -08006
7Kernel Connection Multiplexor (KCM) is a mechanism that provides a message based
8interface over TCP for generic application protocols. With KCM an application
9can efficiently send and receive application protocol messages over TCP using
10datagram sockets.
11
Mauro Carvalho Chehabb9dd2be2020-04-28 00:01:53 +020012KCM implements an NxM multiplexor in the kernel as diagrammed below::
Tom Herbert10016592016-03-07 14:11:12 -080013
Mauro Carvalho Chehabb9dd2be2020-04-28 00:01:53 +020014 +------------+ +------------+ +------------+ +------------+
15 | KCM socket | | KCM socket | | KCM socket | | KCM socket |
16 +------------+ +------------+ +------------+ +------------+
17 | | | |
18 +-----------+ | | +----------+
19 | | | |
20 +----------------------------------+
21 | Multiplexor |
22 +----------------------------------+
23 | | | | |
24 +---------+ | | | ------------+
25 | | | | |
26 +----------+ +----------+ +----------+ +----------+ +----------+
27 | Psock | | Psock | | Psock | | Psock | | Psock |
28 +----------+ +----------+ +----------+ +----------+ +----------+
29 | | | | |
30 +----------+ +----------+ +----------+ +----------+ +----------+
31 | TCP sock | | TCP sock | | TCP sock | | TCP sock | | TCP sock |
32 +----------+ +----------+ +----------+ +----------+ +----------+
Tom Herbert10016592016-03-07 14:11:12 -080033
34KCM sockets
Mauro Carvalho Chehabb9dd2be2020-04-28 00:01:53 +020035===========
Tom Herbert10016592016-03-07 14:11:12 -080036
Olivier Gayotbb38ccc2018-06-04 12:07:37 +020037The KCM sockets provide the user interface to the multiplexor. All the KCM sockets
Tom Herbert10016592016-03-07 14:11:12 -080038bound to a multiplexor are considered to have equivalent function, and I/O
39operations in different sockets may be done in parallel without the need for
40synchronization between threads in userspace.
41
42Multiplexor
Mauro Carvalho Chehabb9dd2be2020-04-28 00:01:53 +020043===========
Tom Herbert10016592016-03-07 14:11:12 -080044
45The multiplexor provides the message steering. In the transmit path, messages
46written on a KCM socket are sent atomically on an appropriate TCP socket.
47Similarly, in the receive path, messages are constructed on each TCP socket
48(Psock) and complete messages are steered to a KCM socket.
49
50TCP sockets & Psocks
Mauro Carvalho Chehabb9dd2be2020-04-28 00:01:53 +020051====================
Tom Herbert10016592016-03-07 14:11:12 -080052
53TCP sockets may be bound to a KCM multiplexor. A Psock structure is allocated
54for each bound TCP socket, this structure holds the state for constructing
55messages on receive as well as other connection specific information for KCM.
56
57Connected mode semantics
Mauro Carvalho Chehabb9dd2be2020-04-28 00:01:53 +020058========================
Tom Herbert10016592016-03-07 14:11:12 -080059
60Each multiplexor assumes that all attached TCP connections are to the same
61destination and can use the different connections for load balancing when
62transmitting. The normal send and recv calls (include sendmmsg and recvmmsg)
63can be used to send and receive messages from the KCM socket.
64
65Socket types
Mauro Carvalho Chehabb9dd2be2020-04-28 00:01:53 +020066============
Tom Herbert10016592016-03-07 14:11:12 -080067
68KCM supports SOCK_DGRAM and SOCK_SEQPACKET socket types.
69
70Message delineation
71-------------------
72
73Messages are sent over a TCP stream with some application protocol message
74format that typically includes a header which frames the messages. The length
75of a received message can be deduced from the application protocol header
76(often just a simple length field).
77
78A TCP stream must be parsed to determine message boundaries. Berkeley Packet
79Filter (BPF) is used for this. When attaching a TCP socket to a multiplexor a
80BPF program must be specified. The program is called at the start of receiving
81a new message and is given an skbuff that contains the bytes received so far.
82It parses the message header and returns the length of the message. Given this
83information, KCM will construct the message of the stated length and deliver it
84to a KCM socket.
85
86TCP socket management
87---------------------
88
89When a TCP socket is attached to a KCM multiplexor data ready (POLLIN) and
90write space available (POLLOUT) events are handled by the multiplexor. If there
91is a state change (disconnection) or other error on a TCP socket, an error is
92posted on the TCP socket so that a POLLERR event happens and KCM discontinues
93using the socket. When the application gets the error notification for a
94TCP socket, it should unattach the socket from KCM and then handle the error
95condition (the typical response is to close the socket and create a new
96connection if necessary).
97
98KCM limits the maximum receive message size to be the size of the receive
99socket buffer on the attached TCP socket (the socket buffer size can be set by
100SO_RCVBUF). If the length of a new message reported by the BPF program is
101greater than this limit a corresponding error (EMSGSIZE) is posted on the TCP
102socket. The BPF program may also enforce a maximum messages size and report an
103error when it is exceeded.
104
105A timeout may be set for assembling messages on a receive socket. The timeout
106value is taken from the receive timeout of the attached TCP socket (this is set
107by SO_RCVTIMEO). If the timer expires before assembly is complete an error
108(ETIMEDOUT) is posted on the socket.
109
110User interface
111==============
112
113Creating a multiplexor
114----------------------
115
Mauro Carvalho Chehabb9dd2be2020-04-28 00:01:53 +0200116A new multiplexor and initial KCM socket is created by a socket call::
Tom Herbert10016592016-03-07 14:11:12 -0800117
118 socket(AF_KCM, type, protocol)
119
Mauro Carvalho Chehabb9dd2be2020-04-28 00:01:53 +0200120- type is either SOCK_DGRAM or SOCK_SEQPACKET
121- protocol is KCMPROTO_CONNECTED
Tom Herbert10016592016-03-07 14:11:12 -0800122
123Cloning KCM sockets
124-------------------
125
126After the first KCM socket is created using the socket call as described
127above, additional sockets for the multiplexor can be created by cloning
Mauro Carvalho Chehabb9dd2be2020-04-28 00:01:53 +0200128a KCM socket. This is accomplished by an ioctl on a KCM socket::
Tom Herbert10016592016-03-07 14:11:12 -0800129
130 /* From linux/kcm.h */
131 struct kcm_clone {
Mauro Carvalho Chehabb9dd2be2020-04-28 00:01:53 +0200132 int fd;
Tom Herbert10016592016-03-07 14:11:12 -0800133 };
134
135 struct kcm_clone info;
136
137 memset(&info, 0, sizeof(info));
138
139 err = ioctl(kcmfd, SIOCKCMCLONE, &info);
140
141 if (!err)
142 newkcmfd = info.fd;
143
144Attach transport sockets
145------------------------
146
147Attaching of transport sockets to a multiplexor is performed by calling an
Mauro Carvalho Chehabb9dd2be2020-04-28 00:01:53 +0200148ioctl on a KCM socket for the multiplexor. e.g.::
Tom Herbert10016592016-03-07 14:11:12 -0800149
150 /* From linux/kcm.h */
151 struct kcm_attach {
Mauro Carvalho Chehabb9dd2be2020-04-28 00:01:53 +0200152 int fd;
Tom Herbert10016592016-03-07 14:11:12 -0800153 int bpf_fd;
154 };
155
156 struct kcm_attach info;
157
158 memset(&info, 0, sizeof(info));
159
160 info.fd = tcpfd;
161 info.bpf_fd = bpf_prog_fd;
162
163 ioctl(kcmfd, SIOCKCMATTACH, &info);
164
165The kcm_attach structure contains:
Mauro Carvalho Chehabb9dd2be2020-04-28 00:01:53 +0200166
167 - fd: file descriptor for TCP socket being attached
168 - bpf_prog_fd: file descriptor for compiled BPF program downloaded
Tom Herbert10016592016-03-07 14:11:12 -0800169
170Unattach transport sockets
171--------------------------
172
173Unattaching a transport socket from a multiplexor is straightforward. An
Mauro Carvalho Chehabb9dd2be2020-04-28 00:01:53 +0200174"unattach" ioctl is done with the kcm_unattach structure as the argument::
Tom Herbert10016592016-03-07 14:11:12 -0800175
176 /* From linux/kcm.h */
177 struct kcm_unattach {
Mauro Carvalho Chehabb9dd2be2020-04-28 00:01:53 +0200178 int fd;
Tom Herbert10016592016-03-07 14:11:12 -0800179 };
180
181 struct kcm_unattach info;
182
183 memset(&info, 0, sizeof(info));
184
185 info.fd = cfd;
186
187 ioctl(fd, SIOCKCMUNATTACH, &info);
188
189Disabling receive on KCM socket
190-------------------------------
191
192A setsockopt is used to disable or enable receiving on a KCM socket.
193When receive is disabled, any pending messages in the socket's
194receive buffer are moved to other sockets. This feature is useful
195if an application thread knows that it will be doing a lot of
196work on a request and won't be able to service new messages for a
Mauro Carvalho Chehabb9dd2be2020-04-28 00:01:53 +0200197while. Example use::
Tom Herbert10016592016-03-07 14:11:12 -0800198
199 int val = 1;
200
201 setsockopt(kcmfd, SOL_KCM, KCM_RECV_DISABLE, &val, sizeof(val))
202
203BFP programs for message delineation
204------------------------------------
205
Olivier Gayotbb38ccc2018-06-04 12:07:37 +0200206BPF programs can be compiled using the BPF LLVM backend. For example,
Mauro Carvalho Chehabb9dd2be2020-04-28 00:01:53 +0200207the BPF program for parsing Thrift is::
Tom Herbert10016592016-03-07 14:11:12 -0800208
209 #include "bpf.h" /* for __sk_buff */
210 #include "bpf_helpers.h" /* for load_word intrinsic */
211
212 SEC("socket_kcm")
213 int bpf_prog1(struct __sk_buff *skb)
214 {
215 return load_word(skb, 0) + 4;
216 }
217
218 char _license[] SEC("license") = "GPL";
219
220Use in applications
221===================
222
223KCM accelerates application layer protocols. Specifically, it allows
224applications to use a message based interface for sending and receiving
225messages. The kernel provides necessary assurances that messages are sent
226and received atomically. This relieves much of the burden applications have
227in mapping a message based protocol onto the TCP stream. KCM also make
228application layer messages a unit of work in the kernel for the purposes of
Olivier Gayotbb38ccc2018-06-04 12:07:37 +0200229steering and scheduling, which in turn allows a simpler networking model in
Tom Herbert10016592016-03-07 14:11:12 -0800230multithreaded applications.
231
232Configurations
233--------------
234
235In an Nx1 configuration, KCM logically provides multiple socket handles
236to the same TCP connection. This allows parallelism between in I/O
237operations on the TCP socket (for instance copyin and copyout of data is
238parallelized). In an application, a KCM socket can be opened for each
239processing thread and inserted into the epoll (similar to how SO_REUSEPORT
240is used to allow multiple listener sockets on the same port).
241
242In a MxN configuration, multiple connections are established to the
243same destination. These are used for simple load balancing.
244
245Message batching
246----------------
247
248The primary purpose of KCM is load balancing between KCM sockets and hence
249threads in a nominal use case. Perfect load balancing, that is steering
250each received message to a different KCM socket or steering each sent
251message to a different TCP socket, can negatively impact performance
252since this doesn't allow for affinities to be established. Balancing
253based on groups, or batches of messages, can be beneficial for performance.
254
255On transmit, there are three ways an application can batch (pipeline)
256messages on a KCM socket.
Mauro Carvalho Chehabb9dd2be2020-04-28 00:01:53 +0200257
Tom Herbert10016592016-03-07 14:11:12 -0800258 1) Send multiple messages in a single sendmmsg.
259 2) Send a group of messages each with a sendmsg call, where all messages
260 except the last have MSG_BATCH in the flags of sendmsg call.
261 3) Create "super message" composed of multiple messages and send this
262 with a single sendmsg.
263
264On receive, the KCM module attempts to queue messages received on the
265same KCM socket during each TCP ready callback. The targeted KCM socket
266changes at each receive ready callback on the KCM socket. The application
267does not need to configure this.
268
269Error handling
270--------------
271
272An application should include a thread to monitor errors raised on
273the TCP connection. Normally, this will be done by placing each
274TCP socket attached to a KCM multiplexor in epoll set for POLLERR
275event. If an error occurs on an attached TCP socket, KCM sets an EPIPE
276on the socket thus waking up the application thread. When the application
277sees the error (which may just be a disconnect) it should unattach the
278socket from KCM and then close it. It is assumed that once an error is
279posted on the TCP socket the data stream is unrecoverable (i.e. an error
Olivier Gayotbb38ccc2018-06-04 12:07:37 +0200280may have occurred in the middle of receiving a message).
Tom Herbert10016592016-03-07 14:11:12 -0800281
282TCP connection monitoring
283-------------------------
284
285In KCM there is no means to correlate a message to the TCP socket that
286was used to send or receive the message (except in the case there is
287only one attached TCP socket). However, the application does retain
288an open file descriptor to the socket so it will be able to get statistics
289from the socket which can be used in detecting issues (such as high
290retransmissions on the socket).