David Drysdale | 4983953 | 2015-08-10 09:00:44 +0100 | [diff] [blame] | 1 | Adding a New System Call |
| 2 | ======================== |
| 3 | |
| 4 | This document describes what's involved in adding a new system call to the |
| 5 | Linux kernel, over and above the normal submission advice in |
Mauro Carvalho Chehab | 8c27ceff3 | 2016-10-18 10:12:27 -0200 | [diff] [blame] | 6 | :ref:`Documentation/process/submitting-patches.rst <submittingpatches>`. |
David Drysdale | 4983953 | 2015-08-10 09:00:44 +0100 | [diff] [blame] | 7 | |
| 8 | |
| 9 | System Call Alternatives |
| 10 | ------------------------ |
| 11 | |
| 12 | The first thing to consider when adding a new system call is whether one of |
| 13 | the alternatives might be suitable instead. Although system calls are the |
| 14 | most traditional and most obvious interaction points between userspace and the |
| 15 | kernel, there are other possibilities -- choose what fits best for your |
| 16 | interface. |
| 17 | |
| 18 | - If the operations involved can be made to look like a filesystem-like |
| 19 | object, it may make more sense to create a new filesystem or device. This |
| 20 | also makes it easier to encapsulate the new functionality in a kernel module |
| 21 | rather than requiring it to be built into the main kernel. |
Mauro Carvalho Chehab | 12983bc | 2016-09-21 13:14:35 -0300 | [diff] [blame] | 22 | |
David Drysdale | 4983953 | 2015-08-10 09:00:44 +0100 | [diff] [blame] | 23 | - If the new functionality involves operations where the kernel notifies |
| 24 | userspace that something has happened, then returning a new file |
| 25 | descriptor for the relevant object allows userspace to use |
Mauro Carvalho Chehab | 12983bc | 2016-09-21 13:14:35 -0300 | [diff] [blame] | 26 | ``poll``/``select``/``epoll`` to receive that notification. |
| 27 | - However, operations that don't map to |
| 28 | :manpage:`read(2)`/:manpage:`write(2)`-like operations |
| 29 | have to be implemented as :manpage:`ioctl(2)` requests, which can lead |
| 30 | to a somewhat opaque API. |
| 31 | |
David Drysdale | 4983953 | 2015-08-10 09:00:44 +0100 | [diff] [blame] | 32 | - If you're just exposing runtime system information, a new node in sysfs |
Mauro Carvalho Chehab | 12983bc | 2016-09-21 13:14:35 -0300 | [diff] [blame] | 33 | (see ``Documentation/filesystems/sysfs.txt``) or the ``/proc`` filesystem may |
| 34 | be more appropriate. However, access to these mechanisms requires that the |
David Drysdale | 4983953 | 2015-08-10 09:00:44 +0100 | [diff] [blame] | 35 | relevant filesystem is mounted, which might not always be the case (e.g. |
| 36 | in a namespaced/sandboxed/chrooted environment). Avoid adding any API to |
| 37 | debugfs, as this is not considered a 'production' interface to userspace. |
| 38 | - If the operation is specific to a particular file or file descriptor, then |
Mauro Carvalho Chehab | 12983bc | 2016-09-21 13:14:35 -0300 | [diff] [blame] | 39 | an additional :manpage:`fcntl(2)` command option may be more appropriate. However, |
| 40 | :manpage:`fcntl(2)` is a multiplexing system call that hides a lot of complexity, so |
David Drysdale | 4983953 | 2015-08-10 09:00:44 +0100 | [diff] [blame] | 41 | this option is best for when the new function is closely analogous to |
Mauro Carvalho Chehab | 12983bc | 2016-09-21 13:14:35 -0300 | [diff] [blame] | 42 | existing :manpage:`fcntl(2)` functionality, or the new functionality is very simple |
David Drysdale | 4983953 | 2015-08-10 09:00:44 +0100 | [diff] [blame] | 43 | (for example, getting/setting a simple flag related to a file descriptor). |
| 44 | - If the operation is specific to a particular task or process, then an |
Mauro Carvalho Chehab | 12983bc | 2016-09-21 13:14:35 -0300 | [diff] [blame] | 45 | additional :manpage:`prctl(2)` command option may be more appropriate. As |
| 46 | with :manpage:`fcntl(2)`, this system call is a complicated multiplexor so |
| 47 | is best reserved for near-analogs of existing ``prctl()`` commands or |
| 48 | getting/setting a simple flag related to a process. |
David Drysdale | 4983953 | 2015-08-10 09:00:44 +0100 | [diff] [blame] | 49 | |
| 50 | |
| 51 | Designing the API: Planning for Extension |
| 52 | ----------------------------------------- |
| 53 | |
| 54 | A new system call forms part of the API of the kernel, and has to be supported |
| 55 | indefinitely. As such, it's a very good idea to explicitly discuss the |
| 56 | interface on the kernel mailing list, and it's important to plan for future |
| 57 | extensions of the interface. |
| 58 | |
| 59 | (The syscall table is littered with historical examples where this wasn't done, |
Mauro Carvalho Chehab | 12983bc | 2016-09-21 13:14:35 -0300 | [diff] [blame] | 60 | together with the corresponding follow-up system calls -- |
| 61 | ``eventfd``/``eventfd2``, ``dup2``/``dup3``, ``inotify_init``/``inotify_init1``, |
| 62 | ``pipe``/``pipe2``, ``renameat``/``renameat2`` -- so |
David Drysdale | 4983953 | 2015-08-10 09:00:44 +0100 | [diff] [blame] | 63 | learn from the history of the kernel and plan for extensions from the start.) |
| 64 | |
| 65 | For simpler system calls that only take a couple of arguments, the preferred |
| 66 | way to allow for future extensibility is to include a flags argument to the |
| 67 | system call. To make sure that userspace programs can safely use flags |
| 68 | between kernel versions, check whether the flags value holds any unknown |
Mauro Carvalho Chehab | 12983bc | 2016-09-21 13:14:35 -0300 | [diff] [blame] | 69 | flags, and reject the system call (with ``EINVAL``) if it does:: |
David Drysdale | 4983953 | 2015-08-10 09:00:44 +0100 | [diff] [blame] | 70 | |
| 71 | if (flags & ~(THING_FLAG1 | THING_FLAG2 | THING_FLAG3)) |
| 72 | return -EINVAL; |
| 73 | |
| 74 | (If no flags values are used yet, check that the flags argument is zero.) |
| 75 | |
| 76 | For more sophisticated system calls that involve a larger number of arguments, |
| 77 | it's preferred to encapsulate the majority of the arguments into a structure |
| 78 | that is passed in by pointer. Such a structure can cope with future extension |
Mauro Carvalho Chehab | 12983bc | 2016-09-21 13:14:35 -0300 | [diff] [blame] | 79 | by including a size argument in the structure:: |
David Drysdale | 4983953 | 2015-08-10 09:00:44 +0100 | [diff] [blame] | 80 | |
| 81 | struct xyzzy_params { |
| 82 | u32 size; /* userspace sets p->size = sizeof(struct xyzzy_params) */ |
| 83 | u32 param_1; |
| 84 | u64 param_2; |
| 85 | u64 param_3; |
| 86 | }; |
| 87 | |
Mauro Carvalho Chehab | 12983bc | 2016-09-21 13:14:35 -0300 | [diff] [blame] | 88 | As long as any subsequently added field, say ``param_4``, is designed so that a |
David Drysdale | 4983953 | 2015-08-10 09:00:44 +0100 | [diff] [blame] | 89 | zero value gives the previous behaviour, then this allows both directions of |
| 90 | version mismatch: |
| 91 | |
| 92 | - To cope with a later userspace program calling an older kernel, the kernel |
| 93 | code should check that any memory beyond the size of the structure that it |
Mauro Carvalho Chehab | 12983bc | 2016-09-21 13:14:35 -0300 | [diff] [blame] | 94 | expects is zero (effectively checking that ``param_4 == 0``). |
David Drysdale | 4983953 | 2015-08-10 09:00:44 +0100 | [diff] [blame] | 95 | - To cope with an older userspace program calling a newer kernel, the kernel |
| 96 | code can zero-extend a smaller instance of the structure (effectively |
Mauro Carvalho Chehab | 12983bc | 2016-09-21 13:14:35 -0300 | [diff] [blame] | 97 | setting ``param_4 = 0``). |
David Drysdale | 4983953 | 2015-08-10 09:00:44 +0100 | [diff] [blame] | 98 | |
Mauro Carvalho Chehab | 12983bc | 2016-09-21 13:14:35 -0300 | [diff] [blame] | 99 | See :manpage:`perf_event_open(2)` and the ``perf_copy_attr()`` function (in |
| 100 | ``kernel/events/core.c``) for an example of this approach. |
David Drysdale | 4983953 | 2015-08-10 09:00:44 +0100 | [diff] [blame] | 101 | |
| 102 | |
| 103 | Designing the API: Other Considerations |
| 104 | --------------------------------------- |
| 105 | |
| 106 | If your new system call allows userspace to refer to a kernel object, it |
| 107 | should use a file descriptor as the handle for that object -- don't invent a |
| 108 | new type of userspace object handle when the kernel already has mechanisms and |
| 109 | well-defined semantics for using file descriptors. |
| 110 | |
Mauro Carvalho Chehab | 12983bc | 2016-09-21 13:14:35 -0300 | [diff] [blame] | 111 | If your new :manpage:`xyzzy(2)` system call does return a new file descriptor, |
| 112 | then the flags argument should include a value that is equivalent to setting |
| 113 | ``O_CLOEXEC`` on the new FD. This makes it possible for userspace to close |
| 114 | the timing window between ``xyzzy()`` and calling |
| 115 | ``fcntl(fd, F_SETFD, FD_CLOEXEC)``, where an unexpected ``fork()`` and |
| 116 | ``execve()`` in another thread could leak a descriptor to |
David Drysdale | 4983953 | 2015-08-10 09:00:44 +0100 | [diff] [blame] | 117 | the exec'ed program. (However, resist the temptation to re-use the actual value |
Mauro Carvalho Chehab | 12983bc | 2016-09-21 13:14:35 -0300 | [diff] [blame] | 118 | of the ``O_CLOEXEC`` constant, as it is architecture-specific and is part of a |
| 119 | numbering space of ``O_*`` flags that is fairly full.) |
David Drysdale | 4983953 | 2015-08-10 09:00:44 +0100 | [diff] [blame] | 120 | |
| 121 | If your system call returns a new file descriptor, you should also consider |
Mauro Carvalho Chehab | 12983bc | 2016-09-21 13:14:35 -0300 | [diff] [blame] | 122 | what it means to use the :manpage:`poll(2)` family of system calls on that file |
David Drysdale | 4983953 | 2015-08-10 09:00:44 +0100 | [diff] [blame] | 123 | descriptor. Making a file descriptor ready for reading or writing is the |
| 124 | normal way for the kernel to indicate to userspace that an event has |
| 125 | occurred on the corresponding kernel object. |
| 126 | |
Mauro Carvalho Chehab | 12983bc | 2016-09-21 13:14:35 -0300 | [diff] [blame] | 127 | If your new :manpage:`xyzzy(2)` system call involves a filename argument:: |
David Drysdale | 4983953 | 2015-08-10 09:00:44 +0100 | [diff] [blame] | 128 | |
| 129 | int sys_xyzzy(const char __user *path, ..., unsigned int flags); |
| 130 | |
Mauro Carvalho Chehab | 12983bc | 2016-09-21 13:14:35 -0300 | [diff] [blame] | 131 | you should also consider whether an :manpage:`xyzzyat(2)` version is more appropriate:: |
David Drysdale | 4983953 | 2015-08-10 09:00:44 +0100 | [diff] [blame] | 132 | |
| 133 | int sys_xyzzyat(int dfd, const char __user *path, ..., unsigned int flags); |
| 134 | |
| 135 | This allows more flexibility for how userspace specifies the file in question; |
| 136 | in particular it allows userspace to request the functionality for an |
Mauro Carvalho Chehab | 12983bc | 2016-09-21 13:14:35 -0300 | [diff] [blame] | 137 | already-opened file descriptor using the ``AT_EMPTY_PATH`` flag, effectively |
| 138 | giving an :manpage:`fxyzzy(3)` operation for free:: |
David Drysdale | 4983953 | 2015-08-10 09:00:44 +0100 | [diff] [blame] | 139 | |
| 140 | - xyzzyat(AT_FDCWD, path, ..., 0) is equivalent to xyzzy(path,...) |
| 141 | - xyzzyat(fd, "", ..., AT_EMPTY_PATH) is equivalent to fxyzzy(fd, ...) |
| 142 | |
Mauro Carvalho Chehab | 12983bc | 2016-09-21 13:14:35 -0300 | [diff] [blame] | 143 | (For more details on the rationale of the \*at() calls, see the |
| 144 | :manpage:`openat(2)` man page; for an example of AT_EMPTY_PATH, see the |
| 145 | :manpage:`fstatat(2)` man page.) |
David Drysdale | 4983953 | 2015-08-10 09:00:44 +0100 | [diff] [blame] | 146 | |
Mauro Carvalho Chehab | 12983bc | 2016-09-21 13:14:35 -0300 | [diff] [blame] | 147 | If your new :manpage:`xyzzy(2)` system call involves a parameter describing an |
| 148 | offset within a file, make its type ``loff_t`` so that 64-bit offsets can be |
| 149 | supported even on 32-bit architectures. |
David Drysdale | 4983953 | 2015-08-10 09:00:44 +0100 | [diff] [blame] | 150 | |
Mauro Carvalho Chehab | 12983bc | 2016-09-21 13:14:35 -0300 | [diff] [blame] | 151 | If your new :manpage:`xyzzy(2)` system call involves privileged functionality, |
| 152 | it needs to be governed by the appropriate Linux capability bit (checked with |
| 153 | a call to ``capable()``), as described in the :manpage:`capabilities(7)` man |
| 154 | page. Choose an existing capability bit that governs related functionality, |
| 155 | but try to avoid combining lots of only vaguely related functions together |
| 156 | under the same bit, as this goes against capabilities' purpose of splitting |
| 157 | the power of root. In particular, avoid adding new uses of the already |
| 158 | overly-general ``CAP_SYS_ADMIN`` capability. |
David Drysdale | 4983953 | 2015-08-10 09:00:44 +0100 | [diff] [blame] | 159 | |
Mauro Carvalho Chehab | 12983bc | 2016-09-21 13:14:35 -0300 | [diff] [blame] | 160 | If your new :manpage:`xyzzy(2)` system call manipulates a process other than |
| 161 | the calling process, it should be restricted (using a call to |
| 162 | ``ptrace_may_access()``) so that only a calling process with the same |
| 163 | permissions as the target process, or with the necessary capabilities, can |
| 164 | manipulate the target process. |
David Drysdale | 4983953 | 2015-08-10 09:00:44 +0100 | [diff] [blame] | 165 | |
| 166 | Finally, be aware that some non-x86 architectures have an easier time if |
| 167 | system call parameters that are explicitly 64-bit fall on odd-numbered |
| 168 | arguments (i.e. parameter 1, 3, 5), to allow use of contiguous pairs of 32-bit |
| 169 | registers. (This concern does not apply if the arguments are part of a |
| 170 | structure that's passed in by pointer.) |
| 171 | |
| 172 | |
| 173 | Proposing the API |
| 174 | ----------------- |
| 175 | |
| 176 | To make new system calls easy to review, it's best to divide up the patchset |
| 177 | into separate chunks. These should include at least the following items as |
| 178 | distinct commits (each of which is described further below): |
| 179 | |
| 180 | - The core implementation of the system call, together with prototypes, |
| 181 | generic numbering, Kconfig changes and fallback stub implementation. |
| 182 | - Wiring up of the new system call for one particular architecture, usually |
| 183 | x86 (including all of x86_64, x86_32 and x32). |
| 184 | - A demonstration of the use of the new system call in userspace via a |
Mauro Carvalho Chehab | 12983bc | 2016-09-21 13:14:35 -0300 | [diff] [blame] | 185 | selftest in ``tools/testing/selftests/``. |
David Drysdale | 4983953 | 2015-08-10 09:00:44 +0100 | [diff] [blame] | 186 | - A draft man-page for the new system call, either as plain text in the |
| 187 | cover letter, or as a patch to the (separate) man-pages repository. |
| 188 | |
| 189 | New system call proposals, like any change to the kernel's API, should always |
| 190 | be cc'ed to linux-api@vger.kernel.org. |
| 191 | |
| 192 | |
| 193 | Generic System Call Implementation |
| 194 | ---------------------------------- |
| 195 | |
Mauro Carvalho Chehab | 12983bc | 2016-09-21 13:14:35 -0300 | [diff] [blame] | 196 | The main entry point for your new :manpage:`xyzzy(2)` system call will be called |
| 197 | ``sys_xyzzy()``, but you add this entry point with the appropriate |
| 198 | ``SYSCALL_DEFINEn()`` macro rather than explicitly. The 'n' indicates the |
| 199 | number of arguments to the system call, and the macro takes the system call name |
David Drysdale | 4983953 | 2015-08-10 09:00:44 +0100 | [diff] [blame] | 200 | followed by the (type, name) pairs for the parameters as arguments. Using |
| 201 | this macro allows metadata about the new system call to be made available for |
| 202 | other tools. |
| 203 | |
| 204 | The new entry point also needs a corresponding function prototype, in |
Mauro Carvalho Chehab | 12983bc | 2016-09-21 13:14:35 -0300 | [diff] [blame] | 205 | ``include/linux/syscalls.h``, marked as asmlinkage to match the way that system |
| 206 | calls are invoked:: |
David Drysdale | 4983953 | 2015-08-10 09:00:44 +0100 | [diff] [blame] | 207 | |
| 208 | asmlinkage long sys_xyzzy(...); |
| 209 | |
| 210 | Some architectures (e.g. x86) have their own architecture-specific syscall |
| 211 | tables, but several other architectures share a generic syscall table. Add your |
| 212 | new system call to the generic list by adding an entry to the list in |
Mauro Carvalho Chehab | 12983bc | 2016-09-21 13:14:35 -0300 | [diff] [blame] | 213 | ``include/uapi/asm-generic/unistd.h``:: |
David Drysdale | 4983953 | 2015-08-10 09:00:44 +0100 | [diff] [blame] | 214 | |
| 215 | #define __NR_xyzzy 292 |
| 216 | __SYSCALL(__NR_xyzzy, sys_xyzzy) |
| 217 | |
| 218 | Also update the __NR_syscalls count to reflect the additional system call, and |
| 219 | note that if multiple new system calls are added in the same merge window, |
| 220 | your new syscall number may get adjusted to resolve conflicts. |
| 221 | |
Mauro Carvalho Chehab | 12983bc | 2016-09-21 13:14:35 -0300 | [diff] [blame] | 222 | The file ``kernel/sys_ni.c`` provides a fallback stub implementation of each |
| 223 | system call, returning ``-ENOSYS``. Add your new system call here too:: |
David Drysdale | 4983953 | 2015-08-10 09:00:44 +0100 | [diff] [blame] | 224 | |
Dominik Brodowski | 67a7acd | 2018-03-04 19:06:35 +0100 | [diff] [blame] | 225 | COND_SYSCALL(xyzzy); |
David Drysdale | 4983953 | 2015-08-10 09:00:44 +0100 | [diff] [blame] | 226 | |
| 227 | Your new kernel functionality, and the system call that controls it, should |
Mauro Carvalho Chehab | 12983bc | 2016-09-21 13:14:35 -0300 | [diff] [blame] | 228 | normally be optional, so add a ``CONFIG`` option (typically to |
| 229 | ``init/Kconfig``) for it. As usual for new ``CONFIG`` options: |
David Drysdale | 4983953 | 2015-08-10 09:00:44 +0100 | [diff] [blame] | 230 | |
| 231 | - Include a description of the new functionality and system call controlled |
| 232 | by the option. |
| 233 | - Make the option depend on EXPERT if it should be hidden from normal users. |
| 234 | - Make any new source files implementing the function dependent on the CONFIG |
Mauro Carvalho Chehab | 12983bc | 2016-09-21 13:14:35 -0300 | [diff] [blame] | 235 | option in the Makefile (e.g. ``obj-$(CONFIG_XYZZY_SYSCALL) += xyzzy.c``). |
David Drysdale | 4983953 | 2015-08-10 09:00:44 +0100 | [diff] [blame] | 236 | - Double check that the kernel still builds with the new CONFIG option turned |
| 237 | off. |
| 238 | |
| 239 | To summarize, you need a commit that includes: |
| 240 | |
Mauro Carvalho Chehab | 12983bc | 2016-09-21 13:14:35 -0300 | [diff] [blame] | 241 | - ``CONFIG`` option for the new function, normally in ``init/Kconfig`` |
| 242 | - ``SYSCALL_DEFINEn(xyzzy, ...)`` for the entry point |
| 243 | - corresponding prototype in ``include/linux/syscalls.h`` |
| 244 | - generic table entry in ``include/uapi/asm-generic/unistd.h`` |
| 245 | - fallback stub in ``kernel/sys_ni.c`` |
David Drysdale | 4983953 | 2015-08-10 09:00:44 +0100 | [diff] [blame] | 246 | |
| 247 | |
| 248 | x86 System Call Implementation |
| 249 | ------------------------------ |
| 250 | |
| 251 | To wire up your new system call for x86 platforms, you need to update the |
| 252 | master syscall tables. Assuming your new system call isn't special in some |
| 253 | way (see below), this involves a "common" entry (for x86_64 and x32) in |
Mauro Carvalho Chehab | 12983bc | 2016-09-21 13:14:35 -0300 | [diff] [blame] | 254 | arch/x86/entry/syscalls/syscall_64.tbl:: |
David Drysdale | 4983953 | 2015-08-10 09:00:44 +0100 | [diff] [blame] | 255 | |
| 256 | 333 common xyzzy sys_xyzzy |
| 257 | |
Mauro Carvalho Chehab | 12983bc | 2016-09-21 13:14:35 -0300 | [diff] [blame] | 258 | and an "i386" entry in ``arch/x86/entry/syscalls/syscall_32.tbl``:: |
David Drysdale | 4983953 | 2015-08-10 09:00:44 +0100 | [diff] [blame] | 259 | |
| 260 | 380 i386 xyzzy sys_xyzzy |
| 261 | |
| 262 | Again, these numbers are liable to be changed if there are conflicts in the |
| 263 | relevant merge window. |
| 264 | |
| 265 | |
| 266 | Compatibility System Calls (Generic) |
| 267 | ------------------------------------ |
| 268 | |
| 269 | For most system calls the same 64-bit implementation can be invoked even when |
| 270 | the userspace program is itself 32-bit; even if the system call's parameters |
| 271 | include an explicit pointer, this is handled transparently. |
| 272 | |
| 273 | However, there are a couple of situations where a compatibility layer is |
| 274 | needed to cope with size differences between 32-bit and 64-bit. |
| 275 | |
| 276 | The first is if the 64-bit kernel also supports 32-bit userspace programs, and |
Mauro Carvalho Chehab | 12983bc | 2016-09-21 13:14:35 -0300 | [diff] [blame] | 277 | so needs to parse areas of (``__user``) memory that could hold either 32-bit or |
David Drysdale | 4983953 | 2015-08-10 09:00:44 +0100 | [diff] [blame] | 278 | 64-bit values. In particular, this is needed whenever a system call argument |
| 279 | is: |
| 280 | |
| 281 | - a pointer to a pointer |
Mauro Carvalho Chehab | 12983bc | 2016-09-21 13:14:35 -0300 | [diff] [blame] | 282 | - a pointer to a struct containing a pointer (e.g. ``struct iovec __user *``) |
| 283 | - a pointer to a varying sized integral type (``time_t``, ``off_t``, |
| 284 | ``long``, ...) |
David Drysdale | 4983953 | 2015-08-10 09:00:44 +0100 | [diff] [blame] | 285 | - a pointer to a struct containing a varying sized integral type. |
| 286 | |
| 287 | The second situation that requires a compatibility layer is if one of the |
| 288 | system call's arguments has a type that is explicitly 64-bit even on a 32-bit |
Mauro Carvalho Chehab | 12983bc | 2016-09-21 13:14:35 -0300 | [diff] [blame] | 289 | architecture, for example ``loff_t`` or ``__u64``. In this case, a value that |
| 290 | arrives at a 64-bit kernel from a 32-bit application will be split into two |
| 291 | 32-bit values, which then need to be re-assembled in the compatibility layer. |
David Drysdale | 4983953 | 2015-08-10 09:00:44 +0100 | [diff] [blame] | 292 | |
| 293 | (Note that a system call argument that's a pointer to an explicit 64-bit type |
Mauro Carvalho Chehab | 12983bc | 2016-09-21 13:14:35 -0300 | [diff] [blame] | 294 | does **not** need a compatibility layer; for example, :manpage:`splice(2)`'s arguments of |
| 295 | type ``loff_t __user *`` do not trigger the need for a ``compat_`` system call.) |
David Drysdale | 4983953 | 2015-08-10 09:00:44 +0100 | [diff] [blame] | 296 | |
Mauro Carvalho Chehab | 12983bc | 2016-09-21 13:14:35 -0300 | [diff] [blame] | 297 | The compatibility version of the system call is called ``compat_sys_xyzzy()``, |
| 298 | and is added with the ``COMPAT_SYSCALL_DEFINEn()`` macro, analogously to |
David Drysdale | 4983953 | 2015-08-10 09:00:44 +0100 | [diff] [blame] | 299 | SYSCALL_DEFINEn. This version of the implementation runs as part of a 64-bit |
| 300 | kernel, but expects to receive 32-bit parameter values and does whatever is |
Mauro Carvalho Chehab | 12983bc | 2016-09-21 13:14:35 -0300 | [diff] [blame] | 301 | needed to deal with them. (Typically, the ``compat_sys_`` version converts the |
| 302 | values to 64-bit versions and either calls on to the ``sys_`` version, or both of |
David Drysdale | 4983953 | 2015-08-10 09:00:44 +0100 | [diff] [blame] | 303 | them call a common inner implementation function.) |
| 304 | |
| 305 | The compat entry point also needs a corresponding function prototype, in |
Mauro Carvalho Chehab | 12983bc | 2016-09-21 13:14:35 -0300 | [diff] [blame] | 306 | ``include/linux/compat.h``, marked as asmlinkage to match the way that system |
| 307 | calls are invoked:: |
David Drysdale | 4983953 | 2015-08-10 09:00:44 +0100 | [diff] [blame] | 308 | |
| 309 | asmlinkage long compat_sys_xyzzy(...); |
| 310 | |
| 311 | If the system call involves a structure that is laid out differently on 32-bit |
Mauro Carvalho Chehab | 12983bc | 2016-09-21 13:14:35 -0300 | [diff] [blame] | 312 | and 64-bit systems, say ``struct xyzzy_args``, then the include/linux/compat.h |
| 313 | header file should also include a compat version of the structure (``struct |
| 314 | compat_xyzzy_args``) where each variable-size field has the appropriate |
| 315 | ``compat_`` type that corresponds to the type in ``struct xyzzy_args``. The |
| 316 | ``compat_sys_xyzzy()`` routine can then use this ``compat_`` structure to |
| 317 | parse the arguments from a 32-bit invocation. |
David Drysdale | 4983953 | 2015-08-10 09:00:44 +0100 | [diff] [blame] | 318 | |
Mauro Carvalho Chehab | 12983bc | 2016-09-21 13:14:35 -0300 | [diff] [blame] | 319 | For example, if there are fields:: |
David Drysdale | 4983953 | 2015-08-10 09:00:44 +0100 | [diff] [blame] | 320 | |
| 321 | struct xyzzy_args { |
| 322 | const char __user *ptr; |
| 323 | __kernel_long_t varying_val; |
| 324 | u64 fixed_val; |
| 325 | /* ... */ |
| 326 | }; |
| 327 | |
Mauro Carvalho Chehab | 12983bc | 2016-09-21 13:14:35 -0300 | [diff] [blame] | 328 | in struct xyzzy_args, then struct compat_xyzzy_args would have:: |
David Drysdale | 4983953 | 2015-08-10 09:00:44 +0100 | [diff] [blame] | 329 | |
| 330 | struct compat_xyzzy_args { |
| 331 | compat_uptr_t ptr; |
| 332 | compat_long_t varying_val; |
| 333 | u64 fixed_val; |
| 334 | /* ... */ |
| 335 | }; |
| 336 | |
| 337 | The generic system call list also needs adjusting to allow for the compat |
Mauro Carvalho Chehab | 12983bc | 2016-09-21 13:14:35 -0300 | [diff] [blame] | 338 | version; the entry in ``include/uapi/asm-generic/unistd.h`` should use |
| 339 | ``__SC_COMP`` rather than ``__SYSCALL``:: |
David Drysdale | 4983953 | 2015-08-10 09:00:44 +0100 | [diff] [blame] | 340 | |
| 341 | #define __NR_xyzzy 292 |
| 342 | __SC_COMP(__NR_xyzzy, sys_xyzzy, compat_sys_xyzzy) |
| 343 | |
| 344 | To summarize, you need: |
| 345 | |
Mauro Carvalho Chehab | 12983bc | 2016-09-21 13:14:35 -0300 | [diff] [blame] | 346 | - a ``COMPAT_SYSCALL_DEFINEn(xyzzy, ...)`` for the compat entry point |
| 347 | - corresponding prototype in ``include/linux/compat.h`` |
| 348 | - (if needed) 32-bit mapping struct in ``include/linux/compat.h`` |
| 349 | - instance of ``__SC_COMP`` not ``__SYSCALL`` in |
| 350 | ``include/uapi/asm-generic/unistd.h`` |
David Drysdale | 4983953 | 2015-08-10 09:00:44 +0100 | [diff] [blame] | 351 | |
| 352 | |
| 353 | Compatibility System Calls (x86) |
| 354 | -------------------------------- |
| 355 | |
| 356 | To wire up the x86 architecture of a system call with a compatibility version, |
| 357 | the entries in the syscall tables need to be adjusted. |
| 358 | |
Mauro Carvalho Chehab | 12983bc | 2016-09-21 13:14:35 -0300 | [diff] [blame] | 359 | First, the entry in ``arch/x86/entry/syscalls/syscall_32.tbl`` gets an extra |
David Drysdale | 4983953 | 2015-08-10 09:00:44 +0100 | [diff] [blame] | 360 | column to indicate that a 32-bit userspace program running on a 64-bit kernel |
Mauro Carvalho Chehab | 12983bc | 2016-09-21 13:14:35 -0300 | [diff] [blame] | 361 | should hit the compat entry point:: |
David Drysdale | 4983953 | 2015-08-10 09:00:44 +0100 | [diff] [blame] | 362 | |
Dominik Brodowski | 5ac9efa | 2018-04-09 12:51:43 +0200 | [diff] [blame] | 363 | 380 i386 xyzzy sys_xyzzy __ia32_compat_sys_xyzzy |
David Drysdale | 4983953 | 2015-08-10 09:00:44 +0100 | [diff] [blame] | 364 | |
| 365 | Second, you need to figure out what should happen for the x32 ABI version of |
| 366 | the new system call. There's a choice here: the layout of the arguments |
| 367 | should either match the 64-bit version or the 32-bit version. |
| 368 | |
| 369 | If there's a pointer-to-a-pointer involved, the decision is easy: x32 is |
| 370 | ILP32, so the layout should match the 32-bit version, and the entry in |
Mauro Carvalho Chehab | 12983bc | 2016-09-21 13:14:35 -0300 | [diff] [blame] | 371 | ``arch/x86/entry/syscalls/syscall_64.tbl`` is split so that x32 programs hit |
| 372 | the compatibility wrapper:: |
David Drysdale | 4983953 | 2015-08-10 09:00:44 +0100 | [diff] [blame] | 373 | |
| 374 | 333 64 xyzzy sys_xyzzy |
| 375 | ... |
Dominik Brodowski | 5ac9efa | 2018-04-09 12:51:43 +0200 | [diff] [blame] | 376 | 555 x32 xyzzy __x32_compat_sys_xyzzy |
David Drysdale | 4983953 | 2015-08-10 09:00:44 +0100 | [diff] [blame] | 377 | |
| 378 | If no pointers are involved, then it is preferable to re-use the 64-bit system |
| 379 | call for the x32 ABI (and consequently the entry in |
| 380 | arch/x86/entry/syscalls/syscall_64.tbl is unchanged). |
| 381 | |
| 382 | In either case, you should check that the types involved in your argument |
| 383 | layout do indeed map exactly from x32 (-mx32) to either the 32-bit (-m32) or |
| 384 | 64-bit (-m64) equivalents. |
| 385 | |
| 386 | |
| 387 | System Calls Returning Elsewhere |
| 388 | -------------------------------- |
| 389 | |
| 390 | For most system calls, once the system call is complete the user program |
| 391 | continues exactly where it left off -- at the next instruction, with the |
| 392 | stack the same and most of the registers the same as before the system call, |
| 393 | and with the same virtual memory space. |
| 394 | |
| 395 | However, a few system calls do things differently. They might return to a |
Mauro Carvalho Chehab | 12983bc | 2016-09-21 13:14:35 -0300 | [diff] [blame] | 396 | different location (``rt_sigreturn``) or change the memory space |
| 397 | (``fork``/``vfork``/``clone``) or even architecture (``execve``/``execveat``) |
| 398 | of the program. |
David Drysdale | 4983953 | 2015-08-10 09:00:44 +0100 | [diff] [blame] | 399 | |
| 400 | To allow for this, the kernel implementation of the system call may need to |
| 401 | save and restore additional registers to the kernel stack, allowing complete |
| 402 | control of where and how execution continues after the system call. |
| 403 | |
| 404 | This is arch-specific, but typically involves defining assembly entry points |
| 405 | that save/restore additional registers and invoke the real system call entry |
| 406 | point. |
| 407 | |
Mauro Carvalho Chehab | 12983bc | 2016-09-21 13:14:35 -0300 | [diff] [blame] | 408 | For x86_64, this is implemented as a ``stub_xyzzy`` entry point in |
| 409 | ``arch/x86/entry/entry_64.S``, and the entry in the syscall table |
| 410 | (``arch/x86/entry/syscalls/syscall_64.tbl``) is adjusted to match:: |
David Drysdale | 4983953 | 2015-08-10 09:00:44 +0100 | [diff] [blame] | 411 | |
| 412 | 333 common xyzzy stub_xyzzy |
| 413 | |
| 414 | The equivalent for 32-bit programs running on a 64-bit kernel is normally |
Mauro Carvalho Chehab | 12983bc | 2016-09-21 13:14:35 -0300 | [diff] [blame] | 415 | called ``stub32_xyzzy`` and implemented in ``arch/x86/entry/entry_64_compat.S``, |
David Drysdale | 4983953 | 2015-08-10 09:00:44 +0100 | [diff] [blame] | 416 | with the corresponding syscall table adjustment in |
Mauro Carvalho Chehab | 12983bc | 2016-09-21 13:14:35 -0300 | [diff] [blame] | 417 | ``arch/x86/entry/syscalls/syscall_32.tbl``:: |
David Drysdale | 4983953 | 2015-08-10 09:00:44 +0100 | [diff] [blame] | 418 | |
| 419 | 380 i386 xyzzy sys_xyzzy stub32_xyzzy |
| 420 | |
| 421 | If the system call needs a compatibility layer (as in the previous section) |
Mauro Carvalho Chehab | 12983bc | 2016-09-21 13:14:35 -0300 | [diff] [blame] | 422 | then the ``stub32_`` version needs to call on to the ``compat_sys_`` version |
| 423 | of the system call rather than the native 64-bit version. Also, if the x32 ABI |
David Drysdale | 4983953 | 2015-08-10 09:00:44 +0100 | [diff] [blame] | 424 | implementation is not common with the x86_64 version, then its syscall |
Mauro Carvalho Chehab | 12983bc | 2016-09-21 13:14:35 -0300 | [diff] [blame] | 425 | table will also need to invoke a stub that calls on to the ``compat_sys_`` |
David Drysdale | 4983953 | 2015-08-10 09:00:44 +0100 | [diff] [blame] | 426 | version. |
| 427 | |
| 428 | For completeness, it's also nice to set up a mapping so that user-mode Linux |
| 429 | still works -- its syscall table will reference stub_xyzzy, but the UML build |
Mauro Carvalho Chehab | 12983bc | 2016-09-21 13:14:35 -0300 | [diff] [blame] | 430 | doesn't include ``arch/x86/entry/entry_64.S`` implementation (because UML |
David Drysdale | 4983953 | 2015-08-10 09:00:44 +0100 | [diff] [blame] | 431 | simulates registers etc). Fixing this is as simple as adding a #define to |
Mauro Carvalho Chehab | 12983bc | 2016-09-21 13:14:35 -0300 | [diff] [blame] | 432 | ``arch/x86/um/sys_call_table_64.c``:: |
David Drysdale | 4983953 | 2015-08-10 09:00:44 +0100 | [diff] [blame] | 433 | |
| 434 | #define stub_xyzzy sys_xyzzy |
| 435 | |
| 436 | |
| 437 | Other Details |
| 438 | ------------- |
| 439 | |
| 440 | Most of the kernel treats system calls in a generic way, but there is the |
| 441 | occasional exception that may need updating for your particular system call. |
| 442 | |
| 443 | The audit subsystem is one such special case; it includes (arch-specific) |
| 444 | functions that classify some special types of system call -- specifically |
Mauro Carvalho Chehab | 12983bc | 2016-09-21 13:14:35 -0300 | [diff] [blame] | 445 | file open (``open``/``openat``), program execution (``execve``/``exeveat``) or |
| 446 | socket multiplexor (``socketcall``) operations. If your new system call is |
| 447 | analogous to one of these, then the audit system should be updated. |
David Drysdale | 4983953 | 2015-08-10 09:00:44 +0100 | [diff] [blame] | 448 | |
| 449 | More generally, if there is an existing system call that is analogous to your |
| 450 | new system call, it's worth doing a kernel-wide grep for the existing system |
| 451 | call to check there are no other special cases. |
| 452 | |
| 453 | |
| 454 | Testing |
| 455 | ------- |
| 456 | |
| 457 | A new system call should obviously be tested; it is also useful to provide |
| 458 | reviewers with a demonstration of how user space programs will use the system |
| 459 | call. A good way to combine these aims is to include a simple self-test |
Mauro Carvalho Chehab | 12983bc | 2016-09-21 13:14:35 -0300 | [diff] [blame] | 460 | program in a new directory under ``tools/testing/selftests/``. |
David Drysdale | 4983953 | 2015-08-10 09:00:44 +0100 | [diff] [blame] | 461 | |
| 462 | For a new system call, there will obviously be no libc wrapper function and so |
Mauro Carvalho Chehab | 12983bc | 2016-09-21 13:14:35 -0300 | [diff] [blame] | 463 | the test will need to invoke it using ``syscall()``; also, if the system call |
David Drysdale | 4983953 | 2015-08-10 09:00:44 +0100 | [diff] [blame] | 464 | involves a new userspace-visible structure, the corresponding header will need |
| 465 | to be installed to compile the test. |
| 466 | |
| 467 | Make sure the selftest runs successfully on all supported architectures. For |
| 468 | example, check that it works when compiled as an x86_64 (-m64), x86_32 (-m32) |
| 469 | and x32 (-mx32) ABI program. |
| 470 | |
| 471 | For more extensive and thorough testing of new functionality, you should also |
| 472 | consider adding tests to the Linux Test Project, or to the xfstests project |
| 473 | for filesystem-related changes. |
Mauro Carvalho Chehab | 12983bc | 2016-09-21 13:14:35 -0300 | [diff] [blame] | 474 | |
David Drysdale | 4983953 | 2015-08-10 09:00:44 +0100 | [diff] [blame] | 475 | - https://linux-test-project.github.io/ |
| 476 | - git://git.kernel.org/pub/scm/fs/xfs/xfstests-dev.git |
| 477 | |
| 478 | |
| 479 | Man Page |
| 480 | -------- |
| 481 | |
| 482 | All new system calls should come with a complete man page, ideally using groff |
| 483 | markup, but plain text will do. If groff is used, it's helpful to include a |
| 484 | pre-rendered ASCII version of the man page in the cover email for the |
| 485 | patchset, for the convenience of reviewers. |
| 486 | |
| 487 | The man page should be cc'ed to linux-man@vger.kernel.org |
| 488 | For more details, see https://www.kernel.org/doc/man-pages/patches.html |
| 489 | |
Dominik Brodowski | 819671ff | 2018-03-11 11:34:25 +0100 | [diff] [blame] | 490 | |
| 491 | Do not call System Calls in the Kernel |
| 492 | -------------------------------------- |
| 493 | |
| 494 | System calls are, as stated above, interaction points between userspace and |
| 495 | the kernel. Therefore, system call functions such as ``sys_xyzzy()`` or |
| 496 | ``compat_sys_xyzzy()`` should only be called from userspace via the syscall |
| 497 | table, but not from elsewhere in the kernel. If the syscall functionality is |
| 498 | useful to be used within the kernel, needs to be shared between an old and a |
| 499 | new syscall, or needs to be shared between a syscall and its compatibility |
| 500 | variant, it should be implemented by means of a "helper" function (such as |
| 501 | ``kern_xyzzy()``). This kernel function may then be called within the |
| 502 | syscall stub (``sys_xyzzy()``), the compatibility syscall stub |
| 503 | (``compat_sys_xyzzy()``), and/or other kernel code. |
| 504 | |
| 505 | At least on 64-bit x86, it will be a hard requirement from v4.17 onwards to not |
| 506 | call system call functions in the kernel. It uses a different calling |
| 507 | convention for system calls where ``struct pt_regs`` is decoded on-the-fly in a |
| 508 | syscall wrapper which then hands processing over to the actual syscall function. |
| 509 | This means that only those parameters which are actually needed for a specific |
| 510 | syscall are passed on during syscall entry, instead of filling in six CPU |
| 511 | registers with random user space content all the time (which may cause serious |
| 512 | trouble down the call chain). |
| 513 | |
| 514 | Moreover, rules on how data may be accessed may differ between kernel data and |
| 515 | user data. This is another reason why calling ``sys_xyzzy()`` is generally a |
| 516 | bad idea. |
| 517 | |
| 518 | Exceptions to this rule are only allowed in architecture-specific overrides, |
| 519 | architecture-specific compatibility wrappers, or other code in arch/. |
| 520 | |
| 521 | |
David Drysdale | 4983953 | 2015-08-10 09:00:44 +0100 | [diff] [blame] | 522 | References and Sources |
| 523 | ---------------------- |
| 524 | |
| 525 | - LWN article from Michael Kerrisk on use of flags argument in system calls: |
| 526 | https://lwn.net/Articles/585415/ |
| 527 | - LWN article from Michael Kerrisk on how to handle unknown flags in a system |
| 528 | call: https://lwn.net/Articles/588444/ |
| 529 | - LWN article from Jake Edge describing constraints on 64-bit system call |
| 530 | arguments: https://lwn.net/Articles/311630/ |
| 531 | - Pair of LWN articles from David Drysdale that describe the system call |
| 532 | implementation paths in detail for v3.14: |
Mauro Carvalho Chehab | 12983bc | 2016-09-21 13:14:35 -0300 | [diff] [blame] | 533 | |
David Drysdale | 4983953 | 2015-08-10 09:00:44 +0100 | [diff] [blame] | 534 | - https://lwn.net/Articles/604287/ |
| 535 | - https://lwn.net/Articles/604515/ |
Mauro Carvalho Chehab | 12983bc | 2016-09-21 13:14:35 -0300 | [diff] [blame] | 536 | |
David Drysdale | 4983953 | 2015-08-10 09:00:44 +0100 | [diff] [blame] | 537 | - Architecture-specific requirements for system calls are discussed in the |
Mauro Carvalho Chehab | 12983bc | 2016-09-21 13:14:35 -0300 | [diff] [blame] | 538 | :manpage:`syscall(2)` man-page: |
David Drysdale | 4983953 | 2015-08-10 09:00:44 +0100 | [diff] [blame] | 539 | http://man7.org/linux/man-pages/man2/syscall.2.html#NOTES |
Mauro Carvalho Chehab | 12983bc | 2016-09-21 13:14:35 -0300 | [diff] [blame] | 540 | - Collated emails from Linus Torvalds discussing the problems with ``ioctl()``: |
David Drysdale | 4983953 | 2015-08-10 09:00:44 +0100 | [diff] [blame] | 541 | http://yarchive.net/comp/linux/ioctl.html |
| 542 | - "How to not invent kernel interfaces", Arnd Bergmann, |
| 543 | http://www.ukuug.org/events/linux2007/2007/papers/Bergmann.pdf |
| 544 | - LWN article from Michael Kerrisk on avoiding new uses of CAP_SYS_ADMIN: |
| 545 | https://lwn.net/Articles/486306/ |
| 546 | - Recommendation from Andrew Morton that all related information for a new |
| 547 | system call should come in the same email thread: |
| 548 | https://lkml.org/lkml/2014/7/24/641 |
| 549 | - Recommendation from Michael Kerrisk that a new system call should come with |
| 550 | a man page: https://lkml.org/lkml/2014/6/13/309 |
| 551 | - Suggestion from Thomas Gleixner that x86 wire-up should be in a separate |
| 552 | commit: https://lkml.org/lkml/2014/11/19/254 |
| 553 | - Suggestion from Greg Kroah-Hartman that it's good for new system calls to |
| 554 | come with a man-page & selftest: https://lkml.org/lkml/2014/3/19/710 |
Mauro Carvalho Chehab | 12983bc | 2016-09-21 13:14:35 -0300 | [diff] [blame] | 555 | - Discussion from Michael Kerrisk of new system call vs. :manpage:`prctl(2)` extension: |
David Drysdale | 4983953 | 2015-08-10 09:00:44 +0100 | [diff] [blame] | 556 | https://lkml.org/lkml/2014/6/3/411 |
| 557 | - Suggestion from Ingo Molnar that system calls that involve multiple |
| 558 | arguments should encapsulate those arguments in a struct, which includes a |
| 559 | size field for future extensibility: https://lkml.org/lkml/2015/7/30/117 |
| 560 | - Numbering oddities arising from (re-)use of O_* numbering space flags: |
Mauro Carvalho Chehab | 12983bc | 2016-09-21 13:14:35 -0300 | [diff] [blame] | 561 | |
David Drysdale | 4983953 | 2015-08-10 09:00:44 +0100 | [diff] [blame] | 562 | - commit 75069f2b5bfb ("vfs: renumber FMODE_NONOTIFY and add to uniqueness |
| 563 | check") |
| 564 | - commit 12ed2e36c98a ("fanotify: FMODE_NONOTIFY and __O_SYNC in sparc |
| 565 | conflict") |
| 566 | - commit bb458c644a59 ("Safer ABI for O_TMPFILE") |
Mauro Carvalho Chehab | 12983bc | 2016-09-21 13:14:35 -0300 | [diff] [blame] | 567 | |
David Drysdale | 4983953 | 2015-08-10 09:00:44 +0100 | [diff] [blame] | 568 | - Discussion from Matthew Wilcox about restrictions on 64-bit arguments: |
| 569 | https://lkml.org/lkml/2008/12/12/187 |
| 570 | - Recommendation from Greg Kroah-Hartman that unknown flags should be |
| 571 | policed: https://lkml.org/lkml/2014/7/17/577 |
| 572 | - Recommendation from Linus Torvalds that x32 system calls should prefer |
| 573 | compatibility with 64-bit versions rather than 32-bit versions: |
| 574 | https://lkml.org/lkml/2011/8/31/244 |