Willem de Bruijn | 56c0727 | 2011-08-09 04:20:48 +0000 | [diff] [blame] | 1 | Scaling in the Linux Networking Stack |
| 2 | |
| 3 | |
| 4 | Introduction |
| 5 | ============ |
| 6 | |
| 7 | This document describes a set of complementary techniques in the Linux |
| 8 | networking stack to increase parallelism and improve performance for |
| 9 | multi-processor systems. |
| 10 | |
| 11 | The following technologies are described: |
| 12 | |
| 13 | RSS: Receive Side Scaling |
| 14 | RPS: Receive Packet Steering |
| 15 | RFS: Receive Flow Steering |
| 16 | Accelerated Receive Flow Steering |
| 17 | XPS: Transmit Packet Steering |
| 18 | |
| 19 | |
| 20 | RSS: Receive Side Scaling |
| 21 | ========================= |
| 22 | |
| 23 | Contemporary NICs support multiple receive and transmit descriptor queues |
| 24 | (multi-queue). On reception, a NIC can send different packets to different |
| 25 | queues to distribute processing among CPUs. The NIC distributes packets by |
| 26 | applying a filter to each packet that assigns it to one of a small number |
| 27 | of logical flows. Packets for each flow are steered to a separate receive |
| 28 | queue, which in turn can be processed by separate CPUs. This mechanism is |
| 29 | generally known as “Receive-side Scaling” (RSS). The goal of RSS and |
Benjamin Poirier | 186c6bbc | 2011-10-04 04:00:30 +0000 | [diff] [blame] | 30 | the other scaling techniques is to increase performance uniformly. |
Willem de Bruijn | 56c0727 | 2011-08-09 04:20:48 +0000 | [diff] [blame] | 31 | Multi-queue distribution can also be used for traffic prioritization, but |
| 32 | that is not the focus of these techniques. |
| 33 | |
| 34 | The filter used in RSS is typically a hash function over the network |
| 35 | and/or transport layer headers-- for example, a 4-tuple hash over |
| 36 | IP addresses and TCP ports of a packet. The most common hardware |
| 37 | implementation of RSS uses a 128-entry indirection table where each entry |
| 38 | stores a queue number. The receive queue for a packet is determined |
| 39 | by masking out the low order seven bits of the computed hash for the |
| 40 | packet (usually a Toeplitz hash), taking this number as a key into the |
| 41 | indirection table and reading the corresponding value. |
| 42 | |
| 43 | Some advanced NICs allow steering packets to queues based on |
| 44 | programmable filters. For example, webserver bound TCP port 80 packets |
| 45 | can be directed to their own receive queue. Such “n-tuple” filters can |
| 46 | be configured from ethtool (--config-ntuple). |
| 47 | |
| 48 | ==== RSS Configuration |
| 49 | |
| 50 | The driver for a multi-queue capable NIC typically provides a kernel |
| 51 | module parameter for specifying the number of hardware queues to |
| 52 | configure. In the bnx2x driver, for instance, this parameter is called |
| 53 | num_queues. A typical RSS configuration would be to have one receive queue |
| 54 | for each CPU if the device supports enough queues, or otherwise at least |
Willem de Bruijn | 320f24e | 2011-08-11 14:41:48 +0000 | [diff] [blame] | 55 | one for each memory domain, where a memory domain is a set of CPUs that |
| 56 | share a particular memory level (L1, L2, NUMA node, etc.). |
Willem de Bruijn | 56c0727 | 2011-08-09 04:20:48 +0000 | [diff] [blame] | 57 | |
| 58 | The indirection table of an RSS device, which resolves a queue by masked |
| 59 | hash, is usually programmed by the driver at initialization. The |
| 60 | default mapping is to distribute the queues evenly in the table, but the |
| 61 | indirection table can be retrieved and modified at runtime using ethtool |
| 62 | commands (--show-rxfh-indir and --set-rxfh-indir). Modifying the |
| 63 | indirection table could be done to give different queues different |
| 64 | relative weights. |
| 65 | |
| 66 | == RSS IRQ Configuration |
| 67 | |
| 68 | Each receive queue has a separate IRQ associated with it. The NIC triggers |
| 69 | this to notify a CPU when new packets arrive on the given queue. The |
| 70 | signaling path for PCIe devices uses message signaled interrupts (MSI-X), |
| 71 | that can route each interrupt to a particular CPU. The active mapping |
| 72 | of queues to IRQs can be determined from /proc/interrupts. By default, |
| 73 | an IRQ may be handled on any CPU. Because a non-negligible part of packet |
| 74 | processing takes place in receive interrupt handling, it is advantageous |
| 75 | to spread receive interrupts between CPUs. To manually adjust the IRQ |
Paul Bolle | 395cf96 | 2011-08-15 02:02:26 +0200 | [diff] [blame] | 76 | affinity of each interrupt see Documentation/IRQ-affinity.txt. Some systems |
Willem de Bruijn | 56c0727 | 2011-08-09 04:20:48 +0000 | [diff] [blame] | 77 | will be running irqbalance, a daemon that dynamically optimizes IRQ |
| 78 | assignments and as a result may override any manual settings. |
| 79 | |
| 80 | == Suggested Configuration |
| 81 | |
| 82 | RSS should be enabled when latency is a concern or whenever receive |
| 83 | interrupt processing forms a bottleneck. Spreading load between CPUs |
| 84 | decreases queue length. For low latency networking, the optimal setting |
| 85 | is to allocate as many queues as there are CPUs in the system (or the |
Willem de Bruijn | 320f24e | 2011-08-11 14:41:48 +0000 | [diff] [blame] | 86 | NIC maximum, if lower). The most efficient high-rate configuration |
Willem de Bruijn | 56c0727 | 2011-08-09 04:20:48 +0000 | [diff] [blame] | 87 | is likely the one with the smallest number of receive queues where no |
Willem de Bruijn | 320f24e | 2011-08-11 14:41:48 +0000 | [diff] [blame] | 88 | receive queue overflows due to a saturated CPU, because in default |
| 89 | mode with interrupt coalescing enabled, the aggregate number of |
| 90 | interrupts (and thus work) grows with each additional queue. |
| 91 | |
| 92 | Per-cpu load can be observed using the mpstat utility, but note that on |
| 93 | processors with hyperthreading (HT), each hyperthread is represented as |
| 94 | a separate CPU. For interrupt handling, HT has shown no benefit in |
| 95 | initial tests, so limit the number of queues to the number of CPU cores |
| 96 | in the system. |
Willem de Bruijn | 56c0727 | 2011-08-09 04:20:48 +0000 | [diff] [blame] | 97 | |
| 98 | |
| 99 | RPS: Receive Packet Steering |
| 100 | ============================ |
| 101 | |
| 102 | Receive Packet Steering (RPS) is logically a software implementation of |
| 103 | RSS. Being in software, it is necessarily called later in the datapath. |
| 104 | Whereas RSS selects the queue and hence CPU that will run the hardware |
| 105 | interrupt handler, RPS selects the CPU to perform protocol processing |
| 106 | above the interrupt handler. This is accomplished by placing the packet |
| 107 | on the desired CPU’s backlog queue and waking up the CPU for processing. |
| 108 | RPS has some advantages over RSS: 1) it can be used with any NIC, |
| 109 | 2) software filters can easily be added to hash over new protocols, |
| 110 | 3) it does not increase hardware device interrupt rate (although it does |
| 111 | introduce inter-processor interrupts (IPIs)). |
| 112 | |
| 113 | RPS is called during bottom half of the receive interrupt handler, when |
| 114 | a driver sends a packet up the network stack with netif_rx() or |
| 115 | netif_receive_skb(). These call the get_rps_cpu() function, which |
| 116 | selects the queue that should process a packet. |
| 117 | |
| 118 | The first step in determining the target CPU for RPS is to calculate a |
| 119 | flow hash over the packet’s addresses or ports (2-tuple or 4-tuple hash |
| 120 | depending on the protocol). This serves as a consistent hash of the |
| 121 | associated flow of the packet. The hash is either provided by hardware |
| 122 | or will be computed in the stack. Capable hardware can pass the hash in |
| 123 | the receive descriptor for the packet; this would usually be the same |
| 124 | hash used for RSS (e.g. computed Toeplitz hash). The hash is saved in |
Michael S. Tsirkin | e4061d5 | 2017-06-06 19:01:37 +0300 | [diff] [blame] | 125 | skb->hash and can be used elsewhere in the stack as a hash of the |
Willem de Bruijn | 56c0727 | 2011-08-09 04:20:48 +0000 | [diff] [blame] | 126 | packet’s flow. |
| 127 | |
| 128 | Each receive hardware queue has an associated list of CPUs to which |
| 129 | RPS may enqueue packets for processing. For each received packet, |
| 130 | an index into the list is computed from the flow hash modulo the size |
| 131 | of the list. The indexed CPU is the target for processing the packet, |
| 132 | and the packet is queued to the tail of that CPU’s backlog queue. At |
| 133 | the end of the bottom half routine, IPIs are sent to any CPUs for which |
| 134 | packets have been queued to their backlog queue. The IPI wakes backlog |
| 135 | processing on the remote CPU, and any queued packets are then processed |
| 136 | up the networking stack. |
| 137 | |
| 138 | ==== RPS Configuration |
| 139 | |
| 140 | RPS requires a kernel compiled with the CONFIG_RPS kconfig symbol (on |
| 141 | by default for SMP). Even when compiled in, RPS remains disabled until |
| 142 | explicitly configured. The list of CPUs to which RPS may forward traffic |
| 143 | can be configured for each receive queue using a sysfs file entry: |
| 144 | |
| 145 | /sys/class/net/<dev>/queues/rx-<n>/rps_cpus |
| 146 | |
| 147 | This file implements a bitmap of CPUs. RPS is disabled when it is zero |
| 148 | (the default), in which case packets are processed on the interrupting |
| 149 | CPU. Documentation/IRQ-affinity.txt explains how CPUs are assigned to |
| 150 | the bitmap. |
| 151 | |
| 152 | == Suggested Configuration |
| 153 | |
| 154 | For a single queue device, a typical RPS configuration would be to set |
Willem de Bruijn | 320f24e | 2011-08-11 14:41:48 +0000 | [diff] [blame] | 155 | the rps_cpus to the CPUs in the same memory domain of the interrupting |
Willem de Bruijn | 56c0727 | 2011-08-09 04:20:48 +0000 | [diff] [blame] | 156 | CPU. If NUMA locality is not an issue, this could also be all CPUs in |
| 157 | the system. At high interrupt rate, it might be wise to exclude the |
| 158 | interrupting CPU from the map since that already performs much work. |
| 159 | |
| 160 | For a multi-queue system, if RSS is configured so that a hardware |
| 161 | receive queue is mapped to each CPU, then RPS is probably redundant |
| 162 | and unnecessary. If there are fewer hardware queues than CPUs, then |
| 163 | RPS might be beneficial if the rps_cpus for each queue are the ones that |
Willem de Bruijn | 320f24e | 2011-08-11 14:41:48 +0000 | [diff] [blame] | 164 | share the same memory domain as the interrupting CPU for that queue. |
Willem de Bruijn | 56c0727 | 2011-08-09 04:20:48 +0000 | [diff] [blame] | 165 | |
Willem de Bruijn | 191cb1f | 2013-05-22 07:54:40 +0000 | [diff] [blame] | 166 | ==== RPS Flow Limit |
| 167 | |
| 168 | RPS scales kernel receive processing across CPUs without introducing |
| 169 | reordering. The trade-off to sending all packets from the same flow |
| 170 | to the same CPU is CPU load imbalance if flows vary in packet rate. |
| 171 | In the extreme case a single flow dominates traffic. Especially on |
| 172 | common server workloads with many concurrent connections, such |
| 173 | behavior indicates a problem such as a misconfiguration or spoofed |
| 174 | source Denial of Service attack. |
| 175 | |
| 176 | Flow Limit is an optional RPS feature that prioritizes small flows |
| 177 | during CPU contention by dropping packets from large flows slightly |
| 178 | ahead of those from small flows. It is active only when an RPS or RFS |
| 179 | destination CPU approaches saturation. Once a CPU's input packet |
| 180 | queue exceeds half the maximum queue length (as set by sysctl |
| 181 | net.core.netdev_max_backlog), the kernel starts a per-flow packet |
| 182 | count over the last 256 packets. If a flow exceeds a set ratio (by |
| 183 | default, half) of these packets when a new packet arrives, then the |
| 184 | new packet is dropped. Packets from other flows are still only |
| 185 | dropped once the input packet queue reaches netdev_max_backlog. |
| 186 | No packets are dropped when the input packet queue length is below |
| 187 | the threshold, so flow limit does not sever connections outright: |
| 188 | even large flows maintain connectivity. |
| 189 | |
| 190 | == Interface |
| 191 | |
| 192 | Flow limit is compiled in by default (CONFIG_NET_FLOW_LIMIT), but not |
| 193 | turned on. It is implemented for each CPU independently (to avoid lock |
| 194 | and cache contention) and toggled per CPU by setting the relevant bit |
| 195 | in sysctl net.core.flow_limit_cpu_bitmap. It exposes the same CPU |
| 196 | bitmap interface as rps_cpus (see above) when called from procfs: |
| 197 | |
| 198 | /proc/sys/net/core/flow_limit_cpu_bitmap |
| 199 | |
| 200 | Per-flow rate is calculated by hashing each packet into a hashtable |
| 201 | bucket and incrementing a per-bucket counter. The hash function is |
| 202 | the same that selects a CPU in RPS, but as the number of buckets can |
| 203 | be much larger than the number of CPUs, flow limit has finer-grained |
| 204 | identification of large flows and fewer false positives. The default |
| 205 | table has 4096 buckets. This value can be modified through sysctl |
| 206 | |
| 207 | net.core.flow_limit_table_len |
| 208 | |
| 209 | The value is only consulted when a new table is allocated. Modifying |
| 210 | it does not update active tables. |
| 211 | |
| 212 | == Suggested Configuration |
| 213 | |
| 214 | Flow limit is useful on systems with many concurrent connections, |
| 215 | where a single connection taking up 50% of a CPU indicates a problem. |
| 216 | In such environments, enable the feature on all CPUs that handle |
| 217 | network rx interrupts (as set in /proc/irq/N/smp_affinity). |
| 218 | |
| 219 | The feature depends on the input packet queue length to exceed |
| 220 | the flow limit threshold (50%) + the flow history length (256). |
| 221 | Setting net.core.netdev_max_backlog to either 1000 or 10000 |
| 222 | performed well in experiments. |
| 223 | |
Willem de Bruijn | 56c0727 | 2011-08-09 04:20:48 +0000 | [diff] [blame] | 224 | |
| 225 | RFS: Receive Flow Steering |
| 226 | ========================== |
| 227 | |
| 228 | While RPS steers packets solely based on hash, and thus generally |
| 229 | provides good load distribution, it does not take into account |
| 230 | application locality. This is accomplished by Receive Flow Steering |
| 231 | (RFS). The goal of RFS is to increase datacache hitrate by steering |
| 232 | kernel processing of packets to the CPU where the application thread |
| 233 | consuming the packet is running. RFS relies on the same RPS mechanisms |
| 234 | to enqueue packets onto the backlog of another CPU and to wake up that |
| 235 | CPU. |
| 236 | |
| 237 | In RFS, packets are not forwarded directly by the value of their hash, |
| 238 | but the hash is used as index into a flow lookup table. This table maps |
| 239 | flows to the CPUs where those flows are being processed. The flow hash |
| 240 | (see RPS section above) is used to calculate the index into this table. |
| 241 | The CPU recorded in each entry is the one which last processed the flow. |
| 242 | If an entry does not hold a valid CPU, then packets mapped to that entry |
| 243 | are steered using plain RPS. Multiple table entries may point to the |
| 244 | same CPU. Indeed, with many flows and few CPUs, it is very likely that |
| 245 | a single application thread handles flows with many different flow hashes. |
| 246 | |
Benjamin Poirier | 186c6bbc | 2011-10-04 04:00:30 +0000 | [diff] [blame] | 247 | rps_sock_flow_table is a global flow table that contains the *desired* CPU |
| 248 | for flows: the CPU that is currently processing the flow in userspace. |
| 249 | Each table value is a CPU index that is updated during calls to recvmsg |
| 250 | and sendmsg (specifically, inet_recvmsg(), inet_sendmsg(), inet_sendpage() |
Willem de Bruijn | 56c0727 | 2011-08-09 04:20:48 +0000 | [diff] [blame] | 251 | and tcp_splice_read()). |
| 252 | |
| 253 | When the scheduler moves a thread to a new CPU while it has outstanding |
| 254 | receive packets on the old CPU, packets may arrive out of order. To |
| 255 | avoid this, RFS uses a second flow table to track outstanding packets |
| 256 | for each flow: rps_dev_flow_table is a table specific to each hardware |
| 257 | receive queue of each device. Each table value stores a CPU index and a |
| 258 | counter. The CPU index represents the *current* CPU onto which packets |
| 259 | for this flow are enqueued for further kernel processing. Ideally, kernel |
| 260 | and userspace processing occur on the same CPU, and hence the CPU index |
| 261 | in both tables is identical. This is likely false if the scheduler has |
| 262 | recently migrated a userspace thread while the kernel still has packets |
| 263 | enqueued for kernel processing on the old CPU. |
| 264 | |
| 265 | The counter in rps_dev_flow_table values records the length of the current |
| 266 | CPU's backlog when a packet in this flow was last enqueued. Each backlog |
| 267 | queue has a head counter that is incremented on dequeue. A tail counter |
| 268 | is computed as head counter + queue length. In other words, the counter |
Shan Wei | 08f4fc9 | 2011-12-19 16:34:15 +0000 | [diff] [blame] | 269 | in rps_dev_flow[i] records the last element in flow i that has |
Willem de Bruijn | 56c0727 | 2011-08-09 04:20:48 +0000 | [diff] [blame] | 270 | been enqueued onto the currently designated CPU for flow i (of course, |
| 271 | entry i is actually selected by hash and multiple flows may hash to the |
| 272 | same entry i). |
| 273 | |
| 274 | And now the trick for avoiding out of order packets: when selecting the |
| 275 | CPU for packet processing (from get_rps_cpu()) the rps_sock_flow table |
| 276 | and the rps_dev_flow table of the queue that the packet was received on |
| 277 | are compared. If the desired CPU for the flow (found in the |
| 278 | rps_sock_flow table) matches the current CPU (found in the rps_dev_flow |
| 279 | table), the packet is enqueued onto that CPU’s backlog. If they differ, |
| 280 | the current CPU is updated to match the desired CPU if one of the |
| 281 | following is true: |
| 282 | |
| 283 | - The current CPU's queue head counter >= the recorded tail counter |
| 284 | value in rps_dev_flow[i] |
Eric Dumazet | a31196b | 2015-04-25 09:35:24 -0700 | [diff] [blame] | 285 | - The current CPU is unset (>= nr_cpu_ids) |
Willem de Bruijn | 56c0727 | 2011-08-09 04:20:48 +0000 | [diff] [blame] | 286 | - The current CPU is offline |
| 287 | |
| 288 | After this check, the packet is sent to the (possibly updated) current |
| 289 | CPU. These rules aim to ensure that a flow only moves to a new CPU when |
| 290 | there are no packets outstanding on the old CPU, as the outstanding |
| 291 | packets could arrive later than those about to be processed on the new |
| 292 | CPU. |
| 293 | |
| 294 | ==== RFS Configuration |
| 295 | |
Shan Wei | 08f4fc9 | 2011-12-19 16:34:15 +0000 | [diff] [blame] | 296 | RFS is only available if the kconfig symbol CONFIG_RPS is enabled (on |
Willem de Bruijn | 56c0727 | 2011-08-09 04:20:48 +0000 | [diff] [blame] | 297 | by default for SMP). The functionality remains disabled until explicitly |
| 298 | configured. The number of entries in the global flow table is set through: |
| 299 | |
| 300 | /proc/sys/net/core/rps_sock_flow_entries |
| 301 | |
| 302 | The number of entries in the per-queue flow table are set through: |
| 303 | |
Jason Wang | e451e61 | 2011-09-27 13:26:27 -0400 | [diff] [blame] | 304 | /sys/class/net/<dev>/queues/rx-<n>/rps_flow_cnt |
Willem de Bruijn | 56c0727 | 2011-08-09 04:20:48 +0000 | [diff] [blame] | 305 | |
| 306 | == Suggested Configuration |
| 307 | |
| 308 | Both of these need to be set before RFS is enabled for a receive queue. |
| 309 | Values for both are rounded up to the nearest power of two. The |
| 310 | suggested flow count depends on the expected number of active connections |
| 311 | at any given time, which may be significantly less than the number of open |
| 312 | connections. We have found that a value of 32768 for rps_sock_flow_entries |
| 313 | works fairly well on a moderately loaded server. |
| 314 | |
| 315 | For a single queue device, the rps_flow_cnt value for the single queue |
| 316 | would normally be configured to the same value as rps_sock_flow_entries. |
| 317 | For a multi-queue device, the rps_flow_cnt for each queue might be |
| 318 | configured as rps_sock_flow_entries / N, where N is the number of |
Shan Wei | 08f4fc9 | 2011-12-19 16:34:15 +0000 | [diff] [blame] | 319 | queues. So for instance, if rps_sock_flow_entries is set to 32768 and there |
Willem de Bruijn | 56c0727 | 2011-08-09 04:20:48 +0000 | [diff] [blame] | 320 | are 16 configured receive queues, rps_flow_cnt for each queue might be |
| 321 | configured as 2048. |
| 322 | |
| 323 | |
| 324 | Accelerated RFS |
| 325 | =============== |
| 326 | |
| 327 | Accelerated RFS is to RFS what RSS is to RPS: a hardware-accelerated load |
| 328 | balancing mechanism that uses soft state to steer flows based on where |
| 329 | the application thread consuming the packets of each flow is running. |
| 330 | Accelerated RFS should perform better than RFS since packets are sent |
| 331 | directly to a CPU local to the thread consuming the data. The target CPU |
| 332 | will either be the same CPU where the application runs, or at least a CPU |
| 333 | which is local to the application thread’s CPU in the cache hierarchy. |
| 334 | |
| 335 | To enable accelerated RFS, the networking stack calls the |
| 336 | ndo_rx_flow_steer driver function to communicate the desired hardware |
| 337 | queue for packets matching a particular flow. The network stack |
| 338 | automatically calls this function every time a flow entry in |
| 339 | rps_dev_flow_table is updated. The driver in turn uses a device specific |
| 340 | method to program the NIC to steer the packets. |
| 341 | |
| 342 | The hardware queue for a flow is derived from the CPU recorded in |
| 343 | rps_dev_flow_table. The stack consults a CPU to hardware queue map which |
| 344 | is maintained by the NIC driver. This is an auto-generated reverse map of |
| 345 | the IRQ affinity table shown by /proc/interrupts. Drivers can use |
| 346 | functions in the cpu_rmap (“CPU affinity reverse map”) kernel library |
| 347 | to populate the map. For each CPU, the corresponding queue in the map is |
| 348 | set to be one whose processing CPU is closest in cache locality. |
| 349 | |
| 350 | ==== Accelerated RFS Configuration |
| 351 | |
| 352 | Accelerated RFS is only available if the kernel is compiled with |
| 353 | CONFIG_RFS_ACCEL and support is provided by the NIC device and driver. |
| 354 | It also requires that ntuple filtering is enabled via ethtool. The map |
| 355 | of CPU to queues is automatically deduced from the IRQ affinities |
| 356 | configured for each receive queue by the driver, so no additional |
| 357 | configuration should be necessary. |
| 358 | |
| 359 | == Suggested Configuration |
| 360 | |
| 361 | This technique should be enabled whenever one wants to use RFS and the |
| 362 | NIC supports hardware acceleration. |
| 363 | |
| 364 | XPS: Transmit Packet Steering |
| 365 | ============================= |
| 366 | |
| 367 | Transmit Packet Steering is a mechanism for intelligently selecting |
| 368 | which transmit queue to use when transmitting a packet on a multi-queue |
Amritha Nambiar | a4fd1f4 | 2018-06-29 21:27:12 -0700 | [diff] [blame] | 369 | device. This can be accomplished by recording two kinds of maps, either |
| 370 | a mapping of CPU to hardware queue(s) or a mapping of receive queue(s) |
| 371 | to hardware transmit queue(s). |
| 372 | |
| 373 | 1. XPS using CPUs map |
| 374 | |
| 375 | The goal of this mapping is usually to assign queues |
Willem de Bruijn | 56c0727 | 2011-08-09 04:20:48 +0000 | [diff] [blame] | 376 | exclusively to a subset of CPUs, where the transmit completions for |
| 377 | these queues are processed on a CPU within this set. This choice |
| 378 | provides two benefits. First, contention on the device queue lock is |
| 379 | significantly reduced since fewer CPUs contend for the same queue |
| 380 | (contention can be eliminated completely if each CPU has its own |
| 381 | transmit queue). Secondly, cache miss rate on transmit completion is |
| 382 | reduced, in particular for data cache lines that hold the sk_buff |
| 383 | structures. |
| 384 | |
Amritha Nambiar | a4fd1f4 | 2018-06-29 21:27:12 -0700 | [diff] [blame] | 385 | 2. XPS using receive queues map |
| 386 | |
| 387 | This mapping is used to pick transmit queue based on the receive |
| 388 | queue(s) map configuration set by the administrator. A set of receive |
| 389 | queues can be mapped to a set of transmit queues (many:many), although |
| 390 | the common use case is a 1:1 mapping. This will enable sending packets |
| 391 | on the same queue associations for transmit and receive. This is useful for |
| 392 | busy polling multi-threaded workloads where there are challenges in |
| 393 | associating a given CPU to a given application thread. The application |
| 394 | threads are not pinned to CPUs and each thread handles packets |
| 395 | received on a single queue. The receive queue number is cached in the |
| 396 | socket for the connection. In this model, sending the packets on the same |
| 397 | transmit queue corresponding to the associated receive queue has benefits |
| 398 | in keeping the CPU overhead low. Transmit completion work is locked into |
| 399 | the same queue-association that a given application is polling on. This |
| 400 | avoids the overhead of triggering an interrupt on another CPU. When the |
| 401 | application cleans up the packets during the busy poll, transmit completion |
| 402 | may be processed along with it in the same thread context and so result in |
| 403 | reduced latency. |
| 404 | |
| 405 | XPS is configured per transmit queue by setting a bitmap of |
| 406 | CPUs/receive-queues that may use that queue to transmit. The reverse |
| 407 | mapping, from CPUs to transmit queues or from receive-queues to transmit |
| 408 | queues, is computed and maintained for each network device. When |
| 409 | transmitting the first packet in a flow, the function get_xps_queue() is |
| 410 | called to select a queue. This function uses the ID of the receive queue |
| 411 | for the socket connection for a match in the receive queue-to-transmit queue |
| 412 | lookup table. Alternatively, this function can also use the ID of the |
| 413 | running CPU as a key into the CPU-to-queue lookup table. If the |
Willem de Bruijn | 56c0727 | 2011-08-09 04:20:48 +0000 | [diff] [blame] | 414 | ID matches a single queue, that is used for transmission. If multiple |
| 415 | queues match, one is selected by using the flow hash to compute an index |
Amritha Nambiar | a4fd1f4 | 2018-06-29 21:27:12 -0700 | [diff] [blame] | 416 | into the set. When selecting the transmit queue based on receive queue(s) |
| 417 | map, the transmit device is not validated against the receive device as it |
| 418 | requires expensive lookup operation in the datapath. |
Willem de Bruijn | 56c0727 | 2011-08-09 04:20:48 +0000 | [diff] [blame] | 419 | |
| 420 | The queue chosen for transmitting a particular flow is saved in the |
| 421 | corresponding socket structure for the flow (e.g. a TCP connection). |
| 422 | This transmit queue is used for subsequent packets sent on the flow to |
| 423 | prevent out of order (ooo) packets. The choice also amortizes the cost |
Willem de Bruijn | 320f24e | 2011-08-11 14:41:48 +0000 | [diff] [blame] | 424 | of calling get_xps_queues() over all packets in the flow. To avoid |
Willem de Bruijn | 56c0727 | 2011-08-09 04:20:48 +0000 | [diff] [blame] | 425 | ooo packets, the queue for a flow can subsequently only be changed if |
| 426 | skb->ooo_okay is set for a packet in the flow. This flag indicates that |
| 427 | there are no outstanding packets in the flow, so the transmit queue can |
| 428 | change without the risk of generating out of order packets. The |
| 429 | transport layer is responsible for setting ooo_okay appropriately. TCP, |
| 430 | for instance, sets the flag when all data for a connection has been |
| 431 | acknowledged. |
| 432 | |
| 433 | ==== XPS Configuration |
| 434 | |
| 435 | XPS is only available if the kconfig symbol CONFIG_XPS is enabled (on by |
| 436 | default for SMP). The functionality remains disabled until explicitly |
Amritha Nambiar | a4fd1f4 | 2018-06-29 21:27:12 -0700 | [diff] [blame] | 437 | configured. To enable XPS, the bitmap of CPUs/receive-queues that may |
| 438 | use a transmit queue is configured using the sysfs file entry: |
Willem de Bruijn | 56c0727 | 2011-08-09 04:20:48 +0000 | [diff] [blame] | 439 | |
Amritha Nambiar | a4fd1f4 | 2018-06-29 21:27:12 -0700 | [diff] [blame] | 440 | For selection based on CPUs map: |
Willem de Bruijn | 56c0727 | 2011-08-09 04:20:48 +0000 | [diff] [blame] | 441 | /sys/class/net/<dev>/queues/tx-<n>/xps_cpus |
| 442 | |
Amritha Nambiar | a4fd1f4 | 2018-06-29 21:27:12 -0700 | [diff] [blame] | 443 | For selection based on receive-queues map: |
| 444 | /sys/class/net/<dev>/queues/tx-<n>/xps_rxqs |
| 445 | |
Willem de Bruijn | 56c0727 | 2011-08-09 04:20:48 +0000 | [diff] [blame] | 446 | == Suggested Configuration |
| 447 | |
| 448 | For a network device with a single transmission queue, XPS configuration |
| 449 | has no effect, since there is no choice in this case. In a multi-queue |
| 450 | system, XPS is preferably configured so that each CPU maps onto one queue. |
| 451 | If there are as many queues as there are CPUs in the system, then each |
| 452 | queue can also map onto one CPU, resulting in exclusive pairings that |
| 453 | experience no contention. If there are fewer queues than CPUs, then the |
| 454 | best CPUs to share a given queue are probably those that share the cache |
| 455 | with the CPU that processes transmit completions for that queue |
| 456 | (transmit interrupts). |
| 457 | |
Amritha Nambiar | a4fd1f4 | 2018-06-29 21:27:12 -0700 | [diff] [blame] | 458 | For transmit queue selection based on receive queue(s), XPS has to be |
| 459 | explicitly configured mapping receive-queue(s) to transmit queue(s). If the |
| 460 | user configuration for receive-queue map does not apply, then the transmit |
| 461 | queue is selected based on the CPUs map. |
| 462 | |
John Fastabend | 822b3b2 | 2015-03-18 14:57:33 +0200 | [diff] [blame] | 463 | Per TX Queue rate limitation: |
| 464 | ============================= |
| 465 | |
| 466 | These are rate-limitation mechanisms implemented by HW, where currently |
| 467 | a max-rate attribute is supported, by setting a Mbps value to |
| 468 | |
| 469 | /sys/class/net/<dev>/queues/tx-<n>/tx_maxrate |
| 470 | |
| 471 | A value of zero means disabled, and this is the default. |
Willem de Bruijn | 56c0727 | 2011-08-09 04:20:48 +0000 | [diff] [blame] | 472 | |
| 473 | Further Information |
| 474 | =================== |
| 475 | RPS and RFS were introduced in kernel 2.6.35. XPS was incorporated into |
| 476 | 2.6.38. Original patches were submitted by Tom Herbert |
| 477 | (therbert@google.com) |
| 478 | |
| 479 | Accelerated RFS was introduced in 2.6.35. Original patches were |
Ben Hutchings | c06cbcb | 2014-04-22 17:29:42 +0100 | [diff] [blame] | 480 | submitted by Ben Hutchings (bwh@kernel.org) |
Willem de Bruijn | 56c0727 | 2011-08-09 04:20:48 +0000 | [diff] [blame] | 481 | |
| 482 | Authors: |
| 483 | Tom Herbert (therbert@google.com) |
| 484 | Willem de Bruijn (willemb@google.com) |