Stanislav Fomichev | 5eed789 | 2019-04-03 13:53:18 -0700 | [diff] [blame] | 1 | .. SPDX-License-Identifier: GPL-2.0 |
| 2 | |
Stanislav Fomichev | 8069594 | 2019-04-18 16:47:52 -0700 | [diff] [blame] | 3 | ============================ |
| 4 | BPF_PROG_TYPE_FLOW_DISSECTOR |
| 5 | ============================ |
Stanislav Fomichev | 5eed789 | 2019-04-03 13:53:18 -0700 | [diff] [blame] | 6 | |
| 7 | Overview |
| 8 | ======== |
| 9 | |
| 10 | Flow dissector is a routine that parses metadata out of the packets. It's |
| 11 | used in the various places in the networking subsystem (RFS, flow hash, etc). |
| 12 | |
| 13 | BPF flow dissector is an attempt to reimplement C-based flow dissector logic |
| 14 | in BPF to gain all the benefits of BPF verifier (namely, limits on the |
| 15 | number of instructions and tail calls). |
| 16 | |
| 17 | API |
| 18 | === |
| 19 | |
| 20 | BPF flow dissector programs operate on an ``__sk_buff``. However, only the |
| 21 | limited set of fields is allowed: ``data``, ``data_end`` and ``flow_keys``. |
| 22 | ``flow_keys`` is ``struct bpf_flow_keys`` and contains flow dissector input |
| 23 | and output arguments. |
| 24 | |
| 25 | The inputs are: |
| 26 | * ``nhoff`` - initial offset of the networking header |
| 27 | * ``thoff`` - initial offset of the transport header, initialized to nhoff |
| 28 | * ``n_proto`` - L3 protocol type, parsed out of L2 header |
Stanislav Fomichev | 1ac6b12 | 2019-07-25 15:52:26 -0700 | [diff] [blame] | 29 | * ``flags`` - optional flags |
Stanislav Fomichev | 5eed789 | 2019-04-03 13:53:18 -0700 | [diff] [blame] | 30 | |
| 31 | Flow dissector BPF program should fill out the rest of the ``struct |
| 32 | bpf_flow_keys`` fields. Input arguments ``nhoff/thoff/n_proto`` should be |
| 33 | also adjusted accordingly. |
| 34 | |
| 35 | The return code of the BPF program is either BPF_OK to indicate successful |
| 36 | dissection, or BPF_DROP to indicate parsing error. |
| 37 | |
| 38 | __sk_buff->data |
| 39 | =============== |
| 40 | |
| 41 | In the VLAN-less case, this is what the initial state of the BPF flow |
| 42 | dissector looks like:: |
| 43 | |
| 44 | +------+------+------------+-----------+ |
| 45 | | DMAC | SMAC | ETHER_TYPE | L3_HEADER | |
| 46 | +------+------+------------+-----------+ |
| 47 | ^ |
| 48 | | |
| 49 | +-- flow dissector starts here |
| 50 | |
| 51 | |
| 52 | .. code:: c |
| 53 | |
| 54 | skb->data + flow_keys->nhoff point to the first byte of L3_HEADER |
| 55 | flow_keys->thoff = nhoff |
| 56 | flow_keys->n_proto = ETHER_TYPE |
| 57 | |
| 58 | In case of VLAN, flow dissector can be called with the two different states. |
| 59 | |
| 60 | Pre-VLAN parsing:: |
| 61 | |
| 62 | +------+------+------+-----+-----------+-----------+ |
| 63 | | DMAC | SMAC | TPID | TCI |ETHER_TYPE | L3_HEADER | |
| 64 | +------+------+------+-----+-----------+-----------+ |
| 65 | ^ |
| 66 | | |
| 67 | +-- flow dissector starts here |
| 68 | |
| 69 | .. code:: c |
| 70 | |
| 71 | skb->data + flow_keys->nhoff point the to first byte of TCI |
| 72 | flow_keys->thoff = nhoff |
| 73 | flow_keys->n_proto = TPID |
| 74 | |
| 75 | Please note that TPID can be 802.1AD and, hence, BPF program would |
| 76 | have to parse VLAN information twice for double tagged packets. |
| 77 | |
| 78 | |
| 79 | Post-VLAN parsing:: |
| 80 | |
| 81 | +------+------+------+-----+-----------+-----------+ |
| 82 | | DMAC | SMAC | TPID | TCI |ETHER_TYPE | L3_HEADER | |
| 83 | +------+------+------+-----+-----------+-----------+ |
| 84 | ^ |
| 85 | | |
| 86 | +-- flow dissector starts here |
| 87 | |
| 88 | .. code:: c |
| 89 | |
| 90 | skb->data + flow_keys->nhoff point the to first byte of L3_HEADER |
| 91 | flow_keys->thoff = nhoff |
| 92 | flow_keys->n_proto = ETHER_TYPE |
| 93 | |
| 94 | In this case VLAN information has been processed before the flow dissector |
| 95 | and BPF flow dissector is not required to handle it. |
| 96 | |
| 97 | |
| 98 | The takeaway here is as follows: BPF flow dissector program can be called with |
| 99 | the optional VLAN header and should gracefully handle both cases: when single |
| 100 | or double VLAN is present and when it is not present. The same program |
| 101 | can be called for both cases and would have to be written carefully to |
| 102 | handle both cases. |
| 103 | |
| 104 | |
Stanislav Fomichev | 1ac6b12 | 2019-07-25 15:52:26 -0700 | [diff] [blame] | 105 | Flags |
| 106 | ===== |
| 107 | |
| 108 | ``flow_keys->flags`` might contain optional input flags that work as follows: |
| 109 | |
| 110 | * ``BPF_FLOW_DISSECTOR_F_PARSE_1ST_FRAG`` - tells BPF flow dissector to |
| 111 | continue parsing first fragment; the default expected behavior is that |
| 112 | flow dissector returns as soon as it finds out that the packet is fragmented; |
| 113 | used by ``eth_get_headlen`` to estimate length of all headers for GRO. |
| 114 | * ``BPF_FLOW_DISSECTOR_F_STOP_AT_FLOW_LABEL`` - tells BPF flow dissector to |
| 115 | stop parsing as soon as it reaches IPv6 flow label; used by |
| 116 | ``___skb_get_hash`` and ``__skb_get_hash_symmetric`` to get flow hash. |
| 117 | * ``BPF_FLOW_DISSECTOR_F_STOP_AT_ENCAP`` - tells BPF flow dissector to stop |
| 118 | parsing as soon as it reaches encapsulated headers; used by routing |
| 119 | infrastructure. |
| 120 | |
| 121 | |
Stanislav Fomichev | 5eed789 | 2019-04-03 13:53:18 -0700 | [diff] [blame] | 122 | Reference Implementation |
| 123 | ======================== |
| 124 | |
| 125 | See ``tools/testing/selftests/bpf/progs/bpf_flow.c`` for the reference |
| 126 | implementation and ``tools/testing/selftests/bpf/flow_dissector_load.[hc]`` |
| 127 | for the loader. bpftool can be used to load BPF flow dissector program as well. |
| 128 | |
| 129 | The reference implementation is organized as follows: |
| 130 | * ``jmp_table`` map that contains sub-programs for each supported L3 protocol |
| 131 | * ``_dissect`` routine - entry point; it does input ``n_proto`` parsing and |
| 132 | does ``bpf_tail_call`` to the appropriate L3 handler |
| 133 | |
| 134 | Since BPF at this point doesn't support looping (or any jumping back), |
| 135 | jmp_table is used instead to handle multiple levels of encapsulation (and |
| 136 | IPv6 options). |
| 137 | |
| 138 | |
| 139 | Current Limitations |
| 140 | =================== |
| 141 | BPF flow dissector doesn't support exporting all the metadata that in-kernel |
| 142 | C-based implementation can export. Notable example is single VLAN (802.1Q) |
| 143 | and double VLAN (802.1AD) tags. Please refer to the ``struct bpf_flow_keys`` |
| 144 | for a set of information that's currently can be exported from the BPF context. |
Stanislav Fomichev | a11c397 | 2019-10-07 09:21:02 -0700 | [diff] [blame] | 145 | |
| 146 | When BPF flow dissector is attached to the root network namespace (machine-wide |
| 147 | policy), users can't override it in their child network namespaces. |