Mauro Carvalho Chehab | b9dd2be | 2020-04-28 00:01:53 +0200 | [diff] [blame] | 1 | .. SPDX-License-Identifier: GPL-2.0 |
| 2 | |
| 3 | ============================= |
Olivier Gayot | bb38ccc | 2018-06-04 12:07:37 +0200 | [diff] [blame] | 4 | Kernel Connection Multiplexor |
Mauro Carvalho Chehab | b9dd2be | 2020-04-28 00:01:53 +0200 | [diff] [blame] | 5 | ============================= |
Tom Herbert | 1001659 | 2016-03-07 14:11:12 -0800 | [diff] [blame] | 6 | |
| 7 | Kernel Connection Multiplexor (KCM) is a mechanism that provides a message based |
| 8 | interface over TCP for generic application protocols. With KCM an application |
| 9 | can efficiently send and receive application protocol messages over TCP using |
| 10 | datagram sockets. |
| 11 | |
Mauro Carvalho Chehab | b9dd2be | 2020-04-28 00:01:53 +0200 | [diff] [blame] | 12 | KCM implements an NxM multiplexor in the kernel as diagrammed below:: |
Tom Herbert | 1001659 | 2016-03-07 14:11:12 -0800 | [diff] [blame] | 13 | |
Mauro Carvalho Chehab | b9dd2be | 2020-04-28 00:01:53 +0200 | [diff] [blame] | 14 | +------------+ +------------+ +------------+ +------------+ |
| 15 | | KCM socket | | KCM socket | | KCM socket | | KCM socket | |
| 16 | +------------+ +------------+ +------------+ +------------+ |
| 17 | | | | | |
| 18 | +-----------+ | | +----------+ |
| 19 | | | | | |
| 20 | +----------------------------------+ |
| 21 | | Multiplexor | |
| 22 | +----------------------------------+ |
| 23 | | | | | | |
| 24 | +---------+ | | | ------------+ |
| 25 | | | | | | |
| 26 | +----------+ +----------+ +----------+ +----------+ +----------+ |
| 27 | | Psock | | Psock | | Psock | | Psock | | Psock | |
| 28 | +----------+ +----------+ +----------+ +----------+ +----------+ |
| 29 | | | | | | |
| 30 | +----------+ +----------+ +----------+ +----------+ +----------+ |
| 31 | | TCP sock | | TCP sock | | TCP sock | | TCP sock | | TCP sock | |
| 32 | +----------+ +----------+ +----------+ +----------+ +----------+ |
Tom Herbert | 1001659 | 2016-03-07 14:11:12 -0800 | [diff] [blame] | 33 | |
| 34 | KCM sockets |
Mauro Carvalho Chehab | b9dd2be | 2020-04-28 00:01:53 +0200 | [diff] [blame] | 35 | =========== |
Tom Herbert | 1001659 | 2016-03-07 14:11:12 -0800 | [diff] [blame] | 36 | |
Olivier Gayot | bb38ccc | 2018-06-04 12:07:37 +0200 | [diff] [blame] | 37 | The KCM sockets provide the user interface to the multiplexor. All the KCM sockets |
Tom Herbert | 1001659 | 2016-03-07 14:11:12 -0800 | [diff] [blame] | 38 | bound to a multiplexor are considered to have equivalent function, and I/O |
| 39 | operations in different sockets may be done in parallel without the need for |
| 40 | synchronization between threads in userspace. |
| 41 | |
| 42 | Multiplexor |
Mauro Carvalho Chehab | b9dd2be | 2020-04-28 00:01:53 +0200 | [diff] [blame] | 43 | =========== |
Tom Herbert | 1001659 | 2016-03-07 14:11:12 -0800 | [diff] [blame] | 44 | |
| 45 | The multiplexor provides the message steering. In the transmit path, messages |
| 46 | written on a KCM socket are sent atomically on an appropriate TCP socket. |
| 47 | Similarly, in the receive path, messages are constructed on each TCP socket |
| 48 | (Psock) and complete messages are steered to a KCM socket. |
| 49 | |
| 50 | TCP sockets & Psocks |
Mauro Carvalho Chehab | b9dd2be | 2020-04-28 00:01:53 +0200 | [diff] [blame] | 51 | ==================== |
Tom Herbert | 1001659 | 2016-03-07 14:11:12 -0800 | [diff] [blame] | 52 | |
| 53 | TCP sockets may be bound to a KCM multiplexor. A Psock structure is allocated |
| 54 | for each bound TCP socket, this structure holds the state for constructing |
| 55 | messages on receive as well as other connection specific information for KCM. |
| 56 | |
| 57 | Connected mode semantics |
Mauro Carvalho Chehab | b9dd2be | 2020-04-28 00:01:53 +0200 | [diff] [blame] | 58 | ======================== |
Tom Herbert | 1001659 | 2016-03-07 14:11:12 -0800 | [diff] [blame] | 59 | |
| 60 | Each multiplexor assumes that all attached TCP connections are to the same |
| 61 | destination and can use the different connections for load balancing when |
| 62 | transmitting. The normal send and recv calls (include sendmmsg and recvmmsg) |
| 63 | can be used to send and receive messages from the KCM socket. |
| 64 | |
| 65 | Socket types |
Mauro Carvalho Chehab | b9dd2be | 2020-04-28 00:01:53 +0200 | [diff] [blame] | 66 | ============ |
Tom Herbert | 1001659 | 2016-03-07 14:11:12 -0800 | [diff] [blame] | 67 | |
| 68 | KCM supports SOCK_DGRAM and SOCK_SEQPACKET socket types. |
| 69 | |
| 70 | Message delineation |
| 71 | ------------------- |
| 72 | |
| 73 | Messages are sent over a TCP stream with some application protocol message |
| 74 | format that typically includes a header which frames the messages. The length |
| 75 | of a received message can be deduced from the application protocol header |
| 76 | (often just a simple length field). |
| 77 | |
| 78 | A TCP stream must be parsed to determine message boundaries. Berkeley Packet |
| 79 | Filter (BPF) is used for this. When attaching a TCP socket to a multiplexor a |
| 80 | BPF program must be specified. The program is called at the start of receiving |
| 81 | a new message and is given an skbuff that contains the bytes received so far. |
| 82 | It parses the message header and returns the length of the message. Given this |
| 83 | information, KCM will construct the message of the stated length and deliver it |
| 84 | to a KCM socket. |
| 85 | |
| 86 | TCP socket management |
| 87 | --------------------- |
| 88 | |
| 89 | When a TCP socket is attached to a KCM multiplexor data ready (POLLIN) and |
| 90 | write space available (POLLOUT) events are handled by the multiplexor. If there |
| 91 | is a state change (disconnection) or other error on a TCP socket, an error is |
| 92 | posted on the TCP socket so that a POLLERR event happens and KCM discontinues |
| 93 | using the socket. When the application gets the error notification for a |
| 94 | TCP socket, it should unattach the socket from KCM and then handle the error |
| 95 | condition (the typical response is to close the socket and create a new |
| 96 | connection if necessary). |
| 97 | |
| 98 | KCM limits the maximum receive message size to be the size of the receive |
| 99 | socket buffer on the attached TCP socket (the socket buffer size can be set by |
| 100 | SO_RCVBUF). If the length of a new message reported by the BPF program is |
| 101 | greater than this limit a corresponding error (EMSGSIZE) is posted on the TCP |
| 102 | socket. The BPF program may also enforce a maximum messages size and report an |
| 103 | error when it is exceeded. |
| 104 | |
| 105 | A timeout may be set for assembling messages on a receive socket. The timeout |
| 106 | value is taken from the receive timeout of the attached TCP socket (this is set |
| 107 | by SO_RCVTIMEO). If the timer expires before assembly is complete an error |
| 108 | (ETIMEDOUT) is posted on the socket. |
| 109 | |
| 110 | User interface |
| 111 | ============== |
| 112 | |
| 113 | Creating a multiplexor |
| 114 | ---------------------- |
| 115 | |
Mauro Carvalho Chehab | b9dd2be | 2020-04-28 00:01:53 +0200 | [diff] [blame] | 116 | A new multiplexor and initial KCM socket is created by a socket call:: |
Tom Herbert | 1001659 | 2016-03-07 14:11:12 -0800 | [diff] [blame] | 117 | |
| 118 | socket(AF_KCM, type, protocol) |
| 119 | |
Mauro Carvalho Chehab | b9dd2be | 2020-04-28 00:01:53 +0200 | [diff] [blame] | 120 | - type is either SOCK_DGRAM or SOCK_SEQPACKET |
| 121 | - protocol is KCMPROTO_CONNECTED |
Tom Herbert | 1001659 | 2016-03-07 14:11:12 -0800 | [diff] [blame] | 122 | |
| 123 | Cloning KCM sockets |
| 124 | ------------------- |
| 125 | |
| 126 | After the first KCM socket is created using the socket call as described |
| 127 | above, additional sockets for the multiplexor can be created by cloning |
Mauro Carvalho Chehab | b9dd2be | 2020-04-28 00:01:53 +0200 | [diff] [blame] | 128 | a KCM socket. This is accomplished by an ioctl on a KCM socket:: |
Tom Herbert | 1001659 | 2016-03-07 14:11:12 -0800 | [diff] [blame] | 129 | |
| 130 | /* From linux/kcm.h */ |
| 131 | struct kcm_clone { |
Mauro Carvalho Chehab | b9dd2be | 2020-04-28 00:01:53 +0200 | [diff] [blame] | 132 | int fd; |
Tom Herbert | 1001659 | 2016-03-07 14:11:12 -0800 | [diff] [blame] | 133 | }; |
| 134 | |
| 135 | struct kcm_clone info; |
| 136 | |
| 137 | memset(&info, 0, sizeof(info)); |
| 138 | |
| 139 | err = ioctl(kcmfd, SIOCKCMCLONE, &info); |
| 140 | |
| 141 | if (!err) |
| 142 | newkcmfd = info.fd; |
| 143 | |
| 144 | Attach transport sockets |
| 145 | ------------------------ |
| 146 | |
| 147 | Attaching of transport sockets to a multiplexor is performed by calling an |
Mauro Carvalho Chehab | b9dd2be | 2020-04-28 00:01:53 +0200 | [diff] [blame] | 148 | ioctl on a KCM socket for the multiplexor. e.g.:: |
Tom Herbert | 1001659 | 2016-03-07 14:11:12 -0800 | [diff] [blame] | 149 | |
| 150 | /* From linux/kcm.h */ |
| 151 | struct kcm_attach { |
Mauro Carvalho Chehab | b9dd2be | 2020-04-28 00:01:53 +0200 | [diff] [blame] | 152 | int fd; |
Tom Herbert | 1001659 | 2016-03-07 14:11:12 -0800 | [diff] [blame] | 153 | int bpf_fd; |
| 154 | }; |
| 155 | |
| 156 | struct kcm_attach info; |
| 157 | |
| 158 | memset(&info, 0, sizeof(info)); |
| 159 | |
| 160 | info.fd = tcpfd; |
| 161 | info.bpf_fd = bpf_prog_fd; |
| 162 | |
| 163 | ioctl(kcmfd, SIOCKCMATTACH, &info); |
| 164 | |
| 165 | The kcm_attach structure contains: |
Mauro Carvalho Chehab | b9dd2be | 2020-04-28 00:01:53 +0200 | [diff] [blame] | 166 | |
| 167 | - fd: file descriptor for TCP socket being attached |
| 168 | - bpf_prog_fd: file descriptor for compiled BPF program downloaded |
Tom Herbert | 1001659 | 2016-03-07 14:11:12 -0800 | [diff] [blame] | 169 | |
| 170 | Unattach transport sockets |
| 171 | -------------------------- |
| 172 | |
| 173 | Unattaching a transport socket from a multiplexor is straightforward. An |
Mauro Carvalho Chehab | b9dd2be | 2020-04-28 00:01:53 +0200 | [diff] [blame] | 174 | "unattach" ioctl is done with the kcm_unattach structure as the argument:: |
Tom Herbert | 1001659 | 2016-03-07 14:11:12 -0800 | [diff] [blame] | 175 | |
| 176 | /* From linux/kcm.h */ |
| 177 | struct kcm_unattach { |
Mauro Carvalho Chehab | b9dd2be | 2020-04-28 00:01:53 +0200 | [diff] [blame] | 178 | int fd; |
Tom Herbert | 1001659 | 2016-03-07 14:11:12 -0800 | [diff] [blame] | 179 | }; |
| 180 | |
| 181 | struct kcm_unattach info; |
| 182 | |
| 183 | memset(&info, 0, sizeof(info)); |
| 184 | |
| 185 | info.fd = cfd; |
| 186 | |
| 187 | ioctl(fd, SIOCKCMUNATTACH, &info); |
| 188 | |
| 189 | Disabling receive on KCM socket |
| 190 | ------------------------------- |
| 191 | |
| 192 | A setsockopt is used to disable or enable receiving on a KCM socket. |
| 193 | When receive is disabled, any pending messages in the socket's |
| 194 | receive buffer are moved to other sockets. This feature is useful |
| 195 | if an application thread knows that it will be doing a lot of |
| 196 | work on a request and won't be able to service new messages for a |
Mauro Carvalho Chehab | b9dd2be | 2020-04-28 00:01:53 +0200 | [diff] [blame] | 197 | while. Example use:: |
Tom Herbert | 1001659 | 2016-03-07 14:11:12 -0800 | [diff] [blame] | 198 | |
| 199 | int val = 1; |
| 200 | |
| 201 | setsockopt(kcmfd, SOL_KCM, KCM_RECV_DISABLE, &val, sizeof(val)) |
| 202 | |
| 203 | BFP programs for message delineation |
| 204 | ------------------------------------ |
| 205 | |
Olivier Gayot | bb38ccc | 2018-06-04 12:07:37 +0200 | [diff] [blame] | 206 | BPF programs can be compiled using the BPF LLVM backend. For example, |
Mauro Carvalho Chehab | b9dd2be | 2020-04-28 00:01:53 +0200 | [diff] [blame] | 207 | the BPF program for parsing Thrift is:: |
Tom Herbert | 1001659 | 2016-03-07 14:11:12 -0800 | [diff] [blame] | 208 | |
| 209 | #include "bpf.h" /* for __sk_buff */ |
| 210 | #include "bpf_helpers.h" /* for load_word intrinsic */ |
| 211 | |
| 212 | SEC("socket_kcm") |
| 213 | int bpf_prog1(struct __sk_buff *skb) |
| 214 | { |
| 215 | return load_word(skb, 0) + 4; |
| 216 | } |
| 217 | |
| 218 | char _license[] SEC("license") = "GPL"; |
| 219 | |
| 220 | Use in applications |
| 221 | =================== |
| 222 | |
| 223 | KCM accelerates application layer protocols. Specifically, it allows |
| 224 | applications to use a message based interface for sending and receiving |
| 225 | messages. The kernel provides necessary assurances that messages are sent |
| 226 | and received atomically. This relieves much of the burden applications have |
| 227 | in mapping a message based protocol onto the TCP stream. KCM also make |
| 228 | application layer messages a unit of work in the kernel for the purposes of |
Olivier Gayot | bb38ccc | 2018-06-04 12:07:37 +0200 | [diff] [blame] | 229 | steering and scheduling, which in turn allows a simpler networking model in |
Tom Herbert | 1001659 | 2016-03-07 14:11:12 -0800 | [diff] [blame] | 230 | multithreaded applications. |
| 231 | |
| 232 | Configurations |
| 233 | -------------- |
| 234 | |
| 235 | In an Nx1 configuration, KCM logically provides multiple socket handles |
| 236 | to the same TCP connection. This allows parallelism between in I/O |
| 237 | operations on the TCP socket (for instance copyin and copyout of data is |
| 238 | parallelized). In an application, a KCM socket can be opened for each |
| 239 | processing thread and inserted into the epoll (similar to how SO_REUSEPORT |
| 240 | is used to allow multiple listener sockets on the same port). |
| 241 | |
| 242 | In a MxN configuration, multiple connections are established to the |
| 243 | same destination. These are used for simple load balancing. |
| 244 | |
| 245 | Message batching |
| 246 | ---------------- |
| 247 | |
| 248 | The primary purpose of KCM is load balancing between KCM sockets and hence |
| 249 | threads in a nominal use case. Perfect load balancing, that is steering |
| 250 | each received message to a different KCM socket or steering each sent |
| 251 | message to a different TCP socket, can negatively impact performance |
| 252 | since this doesn't allow for affinities to be established. Balancing |
| 253 | based on groups, or batches of messages, can be beneficial for performance. |
| 254 | |
| 255 | On transmit, there are three ways an application can batch (pipeline) |
| 256 | messages on a KCM socket. |
Mauro Carvalho Chehab | b9dd2be | 2020-04-28 00:01:53 +0200 | [diff] [blame] | 257 | |
Tom Herbert | 1001659 | 2016-03-07 14:11:12 -0800 | [diff] [blame] | 258 | 1) Send multiple messages in a single sendmmsg. |
| 259 | 2) Send a group of messages each with a sendmsg call, where all messages |
| 260 | except the last have MSG_BATCH in the flags of sendmsg call. |
| 261 | 3) Create "super message" composed of multiple messages and send this |
| 262 | with a single sendmsg. |
| 263 | |
| 264 | On receive, the KCM module attempts to queue messages received on the |
| 265 | same KCM socket during each TCP ready callback. The targeted KCM socket |
| 266 | changes at each receive ready callback on the KCM socket. The application |
| 267 | does not need to configure this. |
| 268 | |
| 269 | Error handling |
| 270 | -------------- |
| 271 | |
| 272 | An application should include a thread to monitor errors raised on |
| 273 | the TCP connection. Normally, this will be done by placing each |
| 274 | TCP socket attached to a KCM multiplexor in epoll set for POLLERR |
| 275 | event. If an error occurs on an attached TCP socket, KCM sets an EPIPE |
| 276 | on the socket thus waking up the application thread. When the application |
| 277 | sees the error (which may just be a disconnect) it should unattach the |
| 278 | socket from KCM and then close it. It is assumed that once an error is |
| 279 | posted on the TCP socket the data stream is unrecoverable (i.e. an error |
Olivier Gayot | bb38ccc | 2018-06-04 12:07:37 +0200 | [diff] [blame] | 280 | may have occurred in the middle of receiving a message). |
Tom Herbert | 1001659 | 2016-03-07 14:11:12 -0800 | [diff] [blame] | 281 | |
| 282 | TCP connection monitoring |
| 283 | ------------------------- |
| 284 | |
| 285 | In KCM there is no means to correlate a message to the TCP socket that |
| 286 | was used to send or receive the message (except in the case there is |
| 287 | only one attached TCP socket). However, the application does retain |
| 288 | an open file descriptor to the socket so it will be able to get statistics |
| 289 | from the socket which can be used in detecting issues (such as high |
| 290 | retransmissions on the socket). |