Stanislav Fomichev | 0c51b36 | 2019-06-27 13:38:54 -0700 | [diff] [blame] | 1 | .. SPDX-License-Identifier: GPL-2.0 |
| 2 | |
| 3 | ============================ |
| 4 | BPF_PROG_TYPE_CGROUP_SOCKOPT |
| 5 | ============================ |
| 6 | |
| 7 | ``BPF_PROG_TYPE_CGROUP_SOCKOPT`` program type can be attached to two |
| 8 | cgroup hooks: |
| 9 | |
| 10 | * ``BPF_CGROUP_GETSOCKOPT`` - called every time process executes ``getsockopt`` |
| 11 | system call. |
| 12 | * ``BPF_CGROUP_SETSOCKOPT`` - called every time process executes ``setsockopt`` |
| 13 | system call. |
| 14 | |
| 15 | The context (``struct bpf_sockopt``) has associated socket (``sk``) and |
| 16 | all input arguments: ``level``, ``optname``, ``optval`` and ``optlen``. |
| 17 | |
| 18 | BPF_CGROUP_SETSOCKOPT |
| 19 | ===================== |
| 20 | |
| 21 | ``BPF_CGROUP_SETSOCKOPT`` is triggered *before* the kernel handling of |
| 22 | sockopt and it has writable context: it can modify the supplied arguments |
| 23 | before passing them down to the kernel. This hook has access to the cgroup |
| 24 | and socket local storage. |
| 25 | |
| 26 | If BPF program sets ``optlen`` to -1, the control will be returned |
| 27 | back to the userspace after all other BPF programs in the cgroup |
| 28 | chain finish (i.e. kernel ``setsockopt`` handling will *not* be executed). |
| 29 | |
| 30 | Note, that ``optlen`` can not be increased beyond the user-supplied |
| 31 | value. It can only be decreased or set to -1. Any other value will |
| 32 | trigger ``EFAULT``. |
| 33 | |
| 34 | Return Type |
| 35 | ----------- |
| 36 | |
| 37 | * ``0`` - reject the syscall, ``EPERM`` will be returned to the userspace. |
| 38 | * ``1`` - success, continue with next BPF program in the cgroup chain. |
| 39 | |
| 40 | BPF_CGROUP_GETSOCKOPT |
| 41 | ===================== |
| 42 | |
| 43 | ``BPF_CGROUP_GETSOCKOPT`` is triggered *after* the kernel handing of |
| 44 | sockopt. The BPF hook can observe ``optval``, ``optlen`` and ``retval`` |
| 45 | if it's interested in whatever kernel has returned. BPF hook can override |
| 46 | the values above, adjust ``optlen`` and reset ``retval`` to 0. If ``optlen`` |
| 47 | has been increased above initial ``getsockopt`` value (i.e. userspace |
| 48 | buffer is too small), ``EFAULT`` is returned. |
| 49 | |
| 50 | This hook has access to the cgroup and socket local storage. |
| 51 | |
| 52 | Note, that the only acceptable value to set to ``retval`` is 0 and the |
| 53 | original value that the kernel returned. Any other value will trigger |
| 54 | ``EFAULT``. |
| 55 | |
| 56 | Return Type |
| 57 | ----------- |
| 58 | |
| 59 | * ``0`` - reject the syscall, ``EPERM`` will be returned to the userspace. |
| 60 | * ``1`` - success: copy ``optval`` and ``optlen`` to userspace, return |
| 61 | ``retval`` from the syscall (note that this can be overwritten by |
| 62 | the BPF program from the parent cgroup). |
| 63 | |
| 64 | Cgroup Inheritance |
| 65 | ================== |
| 66 | |
| 67 | Suppose, there is the following cgroup hierarchy where each cgroup |
| 68 | has ``BPF_CGROUP_GETSOCKOPT`` attached at each level with |
| 69 | ``BPF_F_ALLOW_MULTI`` flag:: |
| 70 | |
| 71 | A (root, parent) |
| 72 | \ |
| 73 | B (child) |
| 74 | |
| 75 | When the application calls ``getsockopt`` syscall from the cgroup B, |
| 76 | the programs are executed from the bottom up: B, A. First program |
| 77 | (B) sees the result of kernel's ``getsockopt``. It can optionally |
| 78 | adjust ``optval``, ``optlen`` and reset ``retval`` to 0. After that |
| 79 | control will be passed to the second (A) program which will see the |
| 80 | same context as B including any potential modifications. |
| 81 | |
| 82 | Same for ``BPF_CGROUP_SETSOCKOPT``: if the program is attached to |
| 83 | A and B, the trigger order is B, then A. If B does any changes |
| 84 | to the input arguments (``level``, ``optname``, ``optval``, ``optlen``), |
| 85 | then the next program in the chain (A) will see those changes, |
| 86 | *not* the original input ``setsockopt`` arguments. The potentially |
| 87 | modified values will be then passed down to the kernel. |
| 88 | |
Stanislav Fomichev | 8030e250 | 2020-06-16 18:04:16 -0700 | [diff] [blame] | 89 | Large optval |
| 90 | ============ |
| 91 | When the ``optval`` is greater than the ``PAGE_SIZE``, the BPF program |
| 92 | can access only the first ``PAGE_SIZE`` of that data. So it has to options: |
| 93 | |
| 94 | * Set ``optlen`` to zero, which indicates that the kernel should |
| 95 | use the original buffer from the userspace. Any modifications |
| 96 | done by the BPF program to the ``optval`` are ignored. |
| 97 | * Set ``optlen`` to the value less than ``PAGE_SIZE``, which |
| 98 | indicates that the kernel should use BPF's trimmed ``optval``. |
| 99 | |
| 100 | When the BPF program returns with the ``optlen`` greater than |
| 101 | ``PAGE_SIZE``, the userspace will receive ``EFAULT`` errno. |
| 102 | |
Stanislav Fomichev | 0c51b36 | 2019-06-27 13:38:54 -0700 | [diff] [blame] | 103 | Example |
| 104 | ======= |
| 105 | |
| 106 | See ``tools/testing/selftests/bpf/progs/sockopt_sk.c`` for an example |
| 107 | of BPF program that handles socket options. |