Gabriel Krisman Bertazi | a4452e6 | 2020-11-27 14:32:38 -0500 | [diff] [blame] | 1 | .. SPDX-License-Identifier: GPL-2.0 |
| 2 | |
| 3 | ===================== |
| 4 | Syscall User Dispatch |
| 5 | ===================== |
| 6 | |
| 7 | Background |
| 8 | ---------- |
| 9 | |
| 10 | Compatibility layers like Wine need a way to efficiently emulate system |
| 11 | calls of only a part of their process - the part that has the |
| 12 | incompatible code - while being able to execute native syscalls without |
| 13 | a high performance penalty on the native part of the process. Seccomp |
| 14 | falls short on this task, since it has limited support to efficiently |
| 15 | filter syscalls based on memory regions, and it doesn't support removing |
| 16 | filters. Therefore a new mechanism is necessary. |
| 17 | |
| 18 | Syscall User Dispatch brings the filtering of the syscall dispatcher |
| 19 | address back to userspace. The application is in control of a flip |
| 20 | switch, indicating the current personality of the process. A |
| 21 | multiple-personality application can then flip the switch without |
| 22 | invoking the kernel, when crossing the compatibility layer API |
| 23 | boundaries, to enable/disable the syscall redirection and execute |
| 24 | syscalls directly (disabled) or send them to be emulated in userspace |
| 25 | through a SIGSYS. |
| 26 | |
| 27 | The goal of this design is to provide very quick compatibility layer |
| 28 | boundary crosses, which is achieved by not executing a syscall to change |
| 29 | personality every time the compatibility layer executes. Instead, a |
| 30 | userspace memory region exposed to the kernel indicates the current |
| 31 | personality, and the application simply modifies that variable to |
| 32 | configure the mechanism. |
| 33 | |
| 34 | There is a relatively high cost associated with handling signals on most |
| 35 | architectures, like x86, but at least for Wine, syscalls issued by |
| 36 | native Windows code are currently not known to be a performance problem, |
| 37 | since they are quite rare, at least for modern gaming applications. |
| 38 | |
| 39 | Since this mechanism is designed to capture syscalls issued by |
| 40 | non-native applications, it must function on syscalls whose invocation |
| 41 | ABI is completely unexpected to Linux. Syscall User Dispatch, therefore |
| 42 | doesn't rely on any of the syscall ABI to make the filtering. It uses |
| 43 | only the syscall dispatcher address and the userspace key. |
| 44 | |
| 45 | As the ABI of these intercepted syscalls is unknown to Linux, these |
| 46 | syscalls are not instrumentable via ptrace or the syscall tracepoints. |
| 47 | |
| 48 | Interface |
| 49 | --------- |
| 50 | |
| 51 | A thread can setup this mechanism on supported kernels by executing the |
| 52 | following prctl: |
| 53 | |
| 54 | prctl(PR_SET_SYSCALL_USER_DISPATCH, <op>, <offset>, <length>, [selector]) |
| 55 | |
| 56 | <op> is either PR_SYS_DISPATCH_ON or PR_SYS_DISPATCH_OFF, to enable and |
| 57 | disable the mechanism globally for that thread. When |
| 58 | PR_SYS_DISPATCH_OFF is used, the other fields must be zero. |
| 59 | |
| 60 | [<offset>, <offset>+<length>) delimit a memory region interval |
| 61 | from which syscalls are always executed directly, regardless of the |
| 62 | userspace selector. This provides a fast path for the C library, which |
| 63 | includes the most common syscall dispatchers in the native code |
| 64 | applications, and also provides a way for the signal handler to return |
| 65 | without triggering a nested SIGSYS on (rt\_)sigreturn. Users of this |
| 66 | interface should make sure that at least the signal trampoline code is |
| 67 | included in this region. In addition, for syscalls that implement the |
| 68 | trampoline code on the vDSO, that trampoline is never intercepted. |
| 69 | |
| 70 | [selector] is a pointer to a char-sized region in the process memory |
| 71 | region, that provides a quick way to enable disable syscall redirection |
| 72 | thread-wide, without the need to invoke the kernel directly. selector |
Gabriel Krisman Bertazi | 36a6c84 | 2021-02-05 13:43:21 -0500 | [diff] [blame] | 73 | can be set to SYSCALL_DISPATCH_FILTER_ALLOW or SYSCALL_DISPATCH_FILTER_BLOCK. |
| 74 | Any other value should terminate the program with a SIGSYS. |
Gabriel Krisman Bertazi | a4452e6 | 2020-11-27 14:32:38 -0500 | [diff] [blame] | 75 | |
| 76 | Security Notes |
| 77 | -------------- |
| 78 | |
| 79 | Syscall User Dispatch provides functionality for compatibility layers to |
| 80 | quickly capture system calls issued by a non-native part of the |
| 81 | application, while not impacting the Linux native regions of the |
| 82 | process. It is not a mechanism for sandboxing system calls, and it |
| 83 | should not be seen as a security mechanism, since it is trivial for a |
| 84 | malicious application to subvert the mechanism by jumping to an allowed |
| 85 | dispatcher region prior to executing the syscall, or to discover the |
| 86 | address and modify the selector value. If the use case requires any |
| 87 | kind of security sandboxing, Seccomp should be used instead. |
| 88 | |
| 89 | Any fork or exec of the existing process resets the mechanism to |
| 90 | PR_SYS_DISPATCH_OFF. |