Otto Sabart | 1b23f5e | 2019-01-06 00:28:59 +0100 | [diff] [blame] | 1 | .. SPDX-License-Identifier: GPL-2.0 |
| 2 | |
Otto Sabart | b83eb68 | 2019-01-06 00:29:28 +0100 | [diff] [blame] | 3 | ================= |
| 4 | Checksum Offloads |
| 5 | ================= |
Edward Cree | e8ae7b0 | 2016-02-11 21:03:37 +0000 | [diff] [blame] | 6 | |
| 7 | |
| 8 | Introduction |
| 9 | ============ |
| 10 | |
Otto Sabart | 1b23f5e | 2019-01-06 00:28:59 +0100 | [diff] [blame] | 11 | This document describes a set of techniques in the Linux networking stack to |
| 12 | take advantage of checksum offload capabilities of various NICs. |
Edward Cree | e8ae7b0 | 2016-02-11 21:03:37 +0000 | [diff] [blame] | 13 | |
| 14 | The following technologies are described: |
Otto Sabart | 1b23f5e | 2019-01-06 00:28:59 +0100 | [diff] [blame] | 15 | |
| 16 | * TX Checksum Offload |
| 17 | * LCO: Local Checksum Offload |
| 18 | * RCO: Remote Checksum Offload |
Edward Cree | e8ae7b0 | 2016-02-11 21:03:37 +0000 | [diff] [blame] | 19 | |
| 20 | Things that should be documented here but aren't yet: |
Otto Sabart | 1b23f5e | 2019-01-06 00:28:59 +0100 | [diff] [blame] | 21 | |
| 22 | * RX Checksum Offload |
| 23 | * CHECKSUM_UNNECESSARY conversion |
Edward Cree | e8ae7b0 | 2016-02-11 21:03:37 +0000 | [diff] [blame] | 24 | |
| 25 | |
| 26 | TX Checksum Offload |
| 27 | =================== |
| 28 | |
Otto Sabart | 1b23f5e | 2019-01-06 00:28:59 +0100 | [diff] [blame] | 29 | The interface for offloading a transmit checksum to a device is explained in |
| 30 | detail in comments near the top of include/linux/skbuff.h. |
| 31 | |
Edward Cree | e8ae7b0 | 2016-02-11 21:03:37 +0000 | [diff] [blame] | 32 | In brief, it allows to request the device fill in a single ones-complement |
Otto Sabart | 1b23f5e | 2019-01-06 00:28:59 +0100 | [diff] [blame] | 33 | checksum defined by the sk_buff fields skb->csum_start and skb->csum_offset. |
| 34 | The device should compute the 16-bit ones-complement checksum (i.e. the |
| 35 | 'IP-style' checksum) from csum_start to the end of the packet, and fill in the |
| 36 | result at (csum_start + csum_offset). |
| 37 | |
| 38 | Because csum_offset cannot be negative, this ensures that the previous value of |
| 39 | the checksum field is included in the checksum computation, thus it can be used |
| 40 | to supply any needed corrections to the checksum (such as the sum of the |
| 41 | pseudo-header for UDP or TCP). |
| 42 | |
Edward Cree | e8ae7b0 | 2016-02-11 21:03:37 +0000 | [diff] [blame] | 43 | This interface only allows a single checksum to be offloaded. Where |
Otto Sabart | 1b23f5e | 2019-01-06 00:28:59 +0100 | [diff] [blame] | 44 | encapsulation is used, the packet may have multiple checksum fields in |
| 45 | different header layers, and the rest will have to be handled by another |
| 46 | mechanism such as LCO or RCO. |
| 47 | |
Davide Caratti | 43c26a1 | 2017-05-18 15:44:41 +0200 | [diff] [blame] | 48 | CRC32c can also be offloaded using this interface, by means of filling |
Otto Sabart | 1b23f5e | 2019-01-06 00:28:59 +0100 | [diff] [blame] | 49 | skb->csum_start and skb->csum_offset as described above, and setting |
| 50 | skb->csum_not_inet: see skbuff.h comment (section 'D') for more details. |
| 51 | |
Edward Cree | e8ae7b0 | 2016-02-11 21:03:37 +0000 | [diff] [blame] | 52 | No offloading of the IP header checksum is performed; it is always done in |
Otto Sabart | 1b23f5e | 2019-01-06 00:28:59 +0100 | [diff] [blame] | 53 | software. This is OK because when we build the IP header, we obviously have it |
| 54 | in cache, so summing it isn't expensive. It's also rather short. |
| 55 | |
Edward Cree | e8ae7b0 | 2016-02-11 21:03:37 +0000 | [diff] [blame] | 56 | The requirements for GSO are more complicated, because when segmenting an |
Otto Sabart | 1b23f5e | 2019-01-06 00:28:59 +0100 | [diff] [blame] | 57 | encapsulated packet both the inner and outer checksums may need to be edited or |
| 58 | recomputed for each resulting segment. See the skbuff.h comment (section 'E') |
| 59 | for more details. |
Edward Cree | e8ae7b0 | 2016-02-11 21:03:37 +0000 | [diff] [blame] | 60 | |
| 61 | A driver declares its offload capabilities in netdev->hw_features; see |
Mauro Carvalho Chehab | ea5baca | 2020-04-30 18:04:03 +0200 | [diff] [blame] | 62 | Documentation/networking/netdev-features.rst for more. Note that a device |
Otto Sabart | 1b23f5e | 2019-01-06 00:28:59 +0100 | [diff] [blame] | 63 | which only advertises NETIF_F_IP[V6]_CSUM must still obey the csum_start and |
| 64 | csum_offset given in the SKB; if it tries to deduce these itself in hardware |
| 65 | (as some NICs do) the driver should check that the values in the SKB match |
| 66 | those which the hardware will deduce, and if not, fall back to checksumming in |
| 67 | software instead (with skb_csum_hwoffload_help() or one of the |
| 68 | skb_checksum_help() / skb_crc32c_csum_help functions, as mentioned in |
| 69 | include/linux/skbuff.h). |
Edward Cree | e8ae7b0 | 2016-02-11 21:03:37 +0000 | [diff] [blame] | 70 | |
Otto Sabart | 1b23f5e | 2019-01-06 00:28:59 +0100 | [diff] [blame] | 71 | The stack should, for the most part, assume that checksum offload is supported |
| 72 | by the underlying device. The only place that should check is |
| 73 | validate_xmit_skb(), and the functions it calls directly or indirectly. That |
| 74 | function compares the offload features requested by the SKB (which may include |
| 75 | other offloads besides TX Checksum Offload) and, if they are not supported or |
| 76 | enabled on the device (determined by netdev->features), performs the |
| 77 | corresponding offload in software. In the case of TX Checksum Offload, that |
| 78 | means calling skb_csum_hwoffload_help(skb, features). |
Edward Cree | e8ae7b0 | 2016-02-11 21:03:37 +0000 | [diff] [blame] | 79 | |
| 80 | |
| 81 | LCO: Local Checksum Offload |
| 82 | =========================== |
| 83 | |
| 84 | LCO is a technique for efficiently computing the outer checksum of an |
Otto Sabart | 1b23f5e | 2019-01-06 00:28:59 +0100 | [diff] [blame] | 85 | encapsulated datagram when the inner checksum is due to be offloaded. |
| 86 | |
| 87 | The ones-complement sum of a correctly checksummed TCP or UDP packet is equal |
| 88 | to the complement of the sum of the pseudo header, because everything else gets |
| 89 | 'cancelled out' by the checksum field. This is because the sum was |
| 90 | complemented before being written to the checksum field. |
| 91 | |
Edward Cree | e8ae7b0 | 2016-02-11 21:03:37 +0000 | [diff] [blame] | 92 | More generally, this holds in any case where the 'IP-style' ones complement |
Otto Sabart | 1b23f5e | 2019-01-06 00:28:59 +0100 | [diff] [blame] | 93 | checksum is used, and thus any checksum that TX Checksum Offload supports. |
| 94 | |
Edward Cree | e8ae7b0 | 2016-02-11 21:03:37 +0000 | [diff] [blame] | 95 | That is, if we have set up TX Checksum Offload with a start/offset pair, we |
Otto Sabart | 1b23f5e | 2019-01-06 00:28:59 +0100 | [diff] [blame] | 96 | know that after the device has filled in that checksum, the ones complement sum |
| 97 | from csum_start to the end of the packet will be equal to the complement of |
| 98 | whatever value we put in the checksum field beforehand. This allows us to |
| 99 | compute the outer checksum without looking at the payload: we simply stop |
| 100 | summing when we get to csum_start, then add the complement of the 16-bit word |
| 101 | at (csum_start + csum_offset). |
| 102 | |
Edward Cree | e8ae7b0 | 2016-02-11 21:03:37 +0000 | [diff] [blame] | 103 | Then, when the true inner checksum is filled in (either by hardware or by |
Otto Sabart | 1b23f5e | 2019-01-06 00:28:59 +0100 | [diff] [blame] | 104 | skb_checksum_help()), the outer checksum will become correct by virtue of the |
| 105 | arithmetic. |
Edward Cree | e8ae7b0 | 2016-02-11 21:03:37 +0000 | [diff] [blame] | 106 | |
| 107 | LCO is performed by the stack when constructing an outer UDP header for an |
Otto Sabart | 1b23f5e | 2019-01-06 00:28:59 +0100 | [diff] [blame] | 108 | encapsulation such as VXLAN or GENEVE, in udp_set_csum(). Similarly for the |
| 109 | IPv6 equivalents, in udp6_set_csum(). |
| 110 | |
Edward Cree | e8ae7b0 | 2016-02-11 21:03:37 +0000 | [diff] [blame] | 111 | It is also performed when constructing an IPv4 GRE header, in |
Otto Sabart | 1b23f5e | 2019-01-06 00:28:59 +0100 | [diff] [blame] | 112 | net/ipv4/ip_gre.c:build_header(). It is *not* currently performed when |
| 113 | constructing an IPv6 GRE header; the GRE checksum is computed over the whole |
| 114 | packet in net/ipv6/ip6_gre.c:ip6gre_xmit2(), but it should be possible to use |
| 115 | LCO here as IPv6 GRE still uses an IP-style checksum. |
| 116 | |
Edward Cree | e8ae7b0 | 2016-02-11 21:03:37 +0000 | [diff] [blame] | 117 | All of the LCO implementations use a helper function lco_csum(), in |
Otto Sabart | 1b23f5e | 2019-01-06 00:28:59 +0100 | [diff] [blame] | 118 | include/linux/skbuff.h. |
Edward Cree | e8ae7b0 | 2016-02-11 21:03:37 +0000 | [diff] [blame] | 119 | |
| 120 | LCO can safely be used for nested encapsulations; in this case, the outer |
Otto Sabart | 1b23f5e | 2019-01-06 00:28:59 +0100 | [diff] [blame] | 121 | encapsulation layer will sum over both its own header and the 'middle' header. |
| 122 | This does mean that the 'middle' header will get summed multiple times, but |
| 123 | there doesn't seem to be a way to avoid that without incurring bigger costs |
| 124 | (e.g. in SKB bloat). |
Edward Cree | e8ae7b0 | 2016-02-11 21:03:37 +0000 | [diff] [blame] | 125 | |
| 126 | |
| 127 | RCO: Remote Checksum Offload |
| 128 | ============================ |
| 129 | |
Otto Sabart | 1b23f5e | 2019-01-06 00:28:59 +0100 | [diff] [blame] | 130 | RCO is a technique for eliding the inner checksum of an encapsulated datagram, |
| 131 | allowing the outer checksum to be offloaded. It does, however, involve a |
| 132 | change to the encapsulation protocols, which the receiver must also support. |
| 133 | For this reason, it is disabled by default. |
| 134 | |
Edward Cree | e8ae7b0 | 2016-02-11 21:03:37 +0000 | [diff] [blame] | 135 | RCO is detailed in the following Internet-Drafts: |
Otto Sabart | 1b23f5e | 2019-01-06 00:28:59 +0100 | [diff] [blame] | 136 | |
| 137 | * https://tools.ietf.org/html/draft-herbert-remotecsumoffload-00 |
| 138 | * https://tools.ietf.org/html/draft-herbert-vxlan-rco-00 |
| 139 | |
| 140 | In Linux, RCO is implemented individually in each encapsulation protocol, and |
| 141 | most tunnel types have flags controlling its use. For instance, VXLAN has the |
| 142 | flag VXLAN_F_REMCSUM_TX (per struct vxlan_rdst) to indicate that RCO should be |
| 143 | used when transmitting to a given remote destination. |