Mauro Carvalho Chehab | db9a097 | 2019-04-18 10:10:33 -0300 | [diff] [blame] | 1 | =================================== |
| 2 | Light-weight System Calls for IA-64 |
| 3 | =================================== |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 4 | |
| 5 | Started: 13-Jan-2003 |
Mauro Carvalho Chehab | db9a097 | 2019-04-18 10:10:33 -0300 | [diff] [blame] | 6 | |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 7 | Last update: 27-Sep-2003 |
| 8 | |
| 9 | David Mosberger-Tang |
| 10 | <davidm@hpl.hp.com> |
| 11 | |
| 12 | Using the "epc" instruction effectively introduces a new mode of |
| 13 | execution to the ia64 linux kernel. We call this mode the |
| 14 | "fsys-mode". To recap, the normal states of execution are: |
| 15 | |
| 16 | - kernel mode: |
| 17 | Both the register stack and the memory stack have been |
| 18 | switched over to kernel memory. The user-level state is saved |
| 19 | in a pt-regs structure at the top of the kernel memory stack. |
| 20 | |
| 21 | - user mode: |
| 22 | Both the register stack and the kernel stack are in |
| 23 | user memory. The user-level state is contained in the |
| 24 | CPU registers. |
| 25 | |
| 26 | - bank 0 interruption-handling mode: |
| 27 | This is the non-interruptible state which all |
| 28 | interruption-handlers start execution in. The user-level |
| 29 | state remains in the CPU registers and some kernel state may |
| 30 | be stored in bank 0 of registers r16-r31. |
| 31 | |
| 32 | In contrast, fsys-mode has the following special properties: |
| 33 | |
| 34 | - execution is at privilege level 0 (most-privileged) |
| 35 | |
| 36 | - CPU registers may contain a mixture of user-level and kernel-level |
| 37 | state (it is the responsibility of the kernel to ensure that no |
| 38 | security-sensitive kernel-level state is leaked back to |
| 39 | user-level) |
| 40 | |
| 41 | - execution is interruptible and preemptible (an fsys-mode handler |
| 42 | can disable interrupts and avoid all other interruption-sources |
| 43 | to avoid preemption) |
| 44 | |
| 45 | - neither the memory-stack nor the register-stack can be trusted while |
| 46 | in fsys-mode (they point to the user-level stacks, which may |
| 47 | be invalid, or completely bogus addresses) |
| 48 | |
| 49 | In summary, fsys-mode is much more similar to running in user-mode |
| 50 | than it is to running in kernel-mode. Of course, given that the |
| 51 | privilege level is at level 0, this means that fsys-mode requires some |
| 52 | care (see below). |
| 53 | |
| 54 | |
Mauro Carvalho Chehab | db9a097 | 2019-04-18 10:10:33 -0300 | [diff] [blame] | 55 | How to tell fsys-mode |
| 56 | ===================== |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 57 | |
| 58 | Linux operates in fsys-mode when (a) the privilege level is 0 (most |
| 59 | privileged) and (b) the stacks have NOT been switched to kernel memory |
| 60 | yet. For convenience, the header file <asm-ia64/ptrace.h> provides |
Mauro Carvalho Chehab | db9a097 | 2019-04-18 10:10:33 -0300 | [diff] [blame] | 61 | three macros:: |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 62 | |
| 63 | user_mode(regs) |
| 64 | user_stack(task,regs) |
| 65 | fsys_mode(task,regs) |
| 66 | |
| 67 | The "regs" argument is a pointer to a pt_regs structure. The "task" |
| 68 | argument is a pointer to the task structure to which the "regs" |
| 69 | pointer belongs to. user_mode() returns TRUE if the CPU state pointed |
| 70 | to by "regs" was executing in user mode (privilege level 3). |
| 71 | user_stack() returns TRUE if the state pointed to by "regs" was |
| 72 | executing on the user-level stack(s). Finally, fsys_mode() returns |
| 73 | TRUE if the CPU state pointed to by "regs" was executing in fsys-mode. |
Mauro Carvalho Chehab | db9a097 | 2019-04-18 10:10:33 -0300 | [diff] [blame] | 74 | The fsys_mode() macro is equivalent to the expression:: |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 75 | |
| 76 | !user_mode(regs) && user_stack(task,regs) |
| 77 | |
Mauro Carvalho Chehab | db9a097 | 2019-04-18 10:10:33 -0300 | [diff] [blame] | 78 | How to write an fsyscall handler |
| 79 | ================================ |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 80 | |
| 81 | The file arch/ia64/kernel/fsys.S contains a table of fsyscall-handlers |
| 82 | (fsyscall_table). This table contains one entry for each system call. |
| 83 | By default, a system call is handled by fsys_fallback_syscall(). This |
| 84 | routine takes care of entering (full) kernel mode and calling the |
| 85 | normal Linux system call handler. For performance-critical system |
| 86 | calls, it is possible to write a hand-tuned fsyscall_handler. For |
| 87 | example, fsys.S contains fsys_getpid(), which is a hand-tuned version |
| 88 | of the getpid() system call. |
| 89 | |
| 90 | The entry and exit-state of an fsyscall handler is as follows: |
| 91 | |
Mauro Carvalho Chehab | db9a097 | 2019-04-18 10:10:33 -0300 | [diff] [blame] | 92 | Machine state on entry to fsyscall handler |
| 93 | ------------------------------------------ |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 94 | |
Mauro Carvalho Chehab | db9a097 | 2019-04-18 10:10:33 -0300 | [diff] [blame] | 95 | ========= =============================================================== |
| 96 | r10 0 |
| 97 | r11 saved ar.pfs (a user-level value) |
| 98 | r15 system call number |
| 99 | r16 "current" task pointer (in normal kernel-mode, this is in r13) |
| 100 | r32-r39 system call arguments |
| 101 | b6 return address (a user-level value) |
| 102 | ar.pfs previous frame-state (a user-level value) |
| 103 | PSR.be cleared to zero (i.e., little-endian byte order is in effect) |
| 104 | - all other registers may contain values passed in from user-mode |
| 105 | ========= =============================================================== |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 106 | |
Mauro Carvalho Chehab | db9a097 | 2019-04-18 10:10:33 -0300 | [diff] [blame] | 107 | Required machine state on exit to fsyscall handler |
| 108 | -------------------------------------------------- |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 109 | |
Mauro Carvalho Chehab | db9a097 | 2019-04-18 10:10:33 -0300 | [diff] [blame] | 110 | ========= =========================================================== |
| 111 | r11 saved ar.pfs (as passed into the fsyscall handler) |
| 112 | r15 system call number (as passed into the fsyscall handler) |
| 113 | r32-r39 system call arguments (as passed into the fsyscall handler) |
| 114 | b6 return address (as passed into the fsyscall handler) |
| 115 | ar.pfs previous frame-state (as passed into the fsyscall handler) |
| 116 | ========= =========================================================== |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 117 | |
| 118 | Fsyscall handlers can execute with very little overhead, but with that |
| 119 | speed comes a set of restrictions: |
| 120 | |
Mauro Carvalho Chehab | db9a097 | 2019-04-18 10:10:33 -0300 | [diff] [blame] | 121 | * Fsyscall-handlers MUST check for any pending work in the flags |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 122 | member of the thread-info structure and if any of the |
| 123 | TIF_ALLWORK_MASK flags are set, the handler needs to fall back on |
| 124 | doing a full system call (by calling fsys_fallback_syscall). |
| 125 | |
Mauro Carvalho Chehab | db9a097 | 2019-04-18 10:10:33 -0300 | [diff] [blame] | 126 | * Fsyscall-handlers MUST preserve incoming arguments (r32-r39, r11, |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 127 | r15, b6, and ar.pfs) because they will be needed in case of a |
| 128 | system call restart. Of course, all "preserved" registers also |
| 129 | must be preserved, in accordance to the normal calling conventions. |
| 130 | |
Mauro Carvalho Chehab | db9a097 | 2019-04-18 10:10:33 -0300 | [diff] [blame] | 131 | * Fsyscall-handlers MUST check argument registers for containing a |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 132 | NaT value before using them in any way that could trigger a |
| 133 | NaT-consumption fault. If a system call argument is found to |
| 134 | contain a NaT value, an fsyscall-handler may return immediately |
| 135 | with r8=EINVAL, r10=-1. |
| 136 | |
Mauro Carvalho Chehab | db9a097 | 2019-04-18 10:10:33 -0300 | [diff] [blame] | 137 | * Fsyscall-handlers MUST NOT use the "alloc" instruction or perform |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 138 | any other operation that would trigger mandatory RSE |
| 139 | (register-stack engine) traffic. |
| 140 | |
Mauro Carvalho Chehab | db9a097 | 2019-04-18 10:10:33 -0300 | [diff] [blame] | 141 | * Fsyscall-handlers MUST NOT write to any stacked registers because |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 142 | it is not safe to assume that user-level called a handler with the |
| 143 | proper number of arguments. |
| 144 | |
Mauro Carvalho Chehab | db9a097 | 2019-04-18 10:10:33 -0300 | [diff] [blame] | 145 | * Fsyscall-handlers need to be careful when accessing per-CPU variables: |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 146 | unless proper safe-guards are taken (e.g., interruptions are avoided), |
| 147 | execution may be pre-empted and resumed on another CPU at any given |
| 148 | time. |
| 149 | |
Mauro Carvalho Chehab | db9a097 | 2019-04-18 10:10:33 -0300 | [diff] [blame] | 150 | * Fsyscall-handlers must be careful not to leak sensitive kernel' |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 151 | information back to user-level. In particular, before returning to |
| 152 | user-level, care needs to be taken to clear any scratch registers |
| 153 | that could contain sensitive information (note that the current |
| 154 | task pointer is not considered sensitive: it's already exposed |
| 155 | through ar.k6). |
| 156 | |
Mauro Carvalho Chehab | db9a097 | 2019-04-18 10:10:33 -0300 | [diff] [blame] | 157 | * Fsyscall-handlers MUST NOT access user-memory without first |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 158 | validating access-permission (this can be done typically via |
| 159 | probe.r.fault and/or probe.w.fault) and without guarding against |
| 160 | memory access exceptions (this can be done with the EX() macros |
| 161 | defined by asmmacro.h). |
| 162 | |
| 163 | The above restrictions may seem draconian, but remember that it's |
| 164 | possible to trade off some of the restrictions by paying a slightly |
| 165 | higher overhead. For example, if an fsyscall-handler could benefit |
| 166 | from the shadow register bank, it could temporarily disable PSR.i and |
| 167 | PSR.ic, switch to bank 0 (bsw.0) and then use the shadow registers as |
| 168 | needed. In other words, following the above rules yields extremely |
| 169 | fast system call execution (while fully preserving system call |
| 170 | semantics), but there is also a lot of flexibility in handling more |
| 171 | complicated cases. |
| 172 | |
Mauro Carvalho Chehab | db9a097 | 2019-04-18 10:10:33 -0300 | [diff] [blame] | 173 | Signal handling |
| 174 | =============== |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 175 | |
| 176 | The delivery of (asynchronous) signals must be delayed until fsys-mode |
Matt LaPlante | 3f6dee9 | 2006-10-03 22:45:33 +0200 | [diff] [blame] | 177 | is exited. This is accomplished with the help of the lower-privilege |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 178 | transfer trap: arch/ia64/kernel/process.c:do_notify_resume_user() |
| 179 | checks whether the interrupted task was in fsys-mode and, if so, sets |
| 180 | PSR.lp and returns immediately. When fsys-mode is exited via the |
| 181 | "br.ret" instruction that lowers the privilege level, a trap will |
| 182 | occur. The trap handler clears PSR.lp again and returns immediately. |
| 183 | The kernel exit path then checks for and delivers any pending signals. |
| 184 | |
Mauro Carvalho Chehab | db9a097 | 2019-04-18 10:10:33 -0300 | [diff] [blame] | 185 | PSR Handling |
| 186 | ============ |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 187 | |
| 188 | The "epc" instruction doesn't change the contents of PSR at all. This |
| 189 | is in contrast to a regular interruption, which clears almost all |
| 190 | bits. Because of that, some care needs to be taken to ensure things |
| 191 | work as expected. The following discussion describes how each PSR bit |
| 192 | is handled. |
| 193 | |
Mauro Carvalho Chehab | db9a097 | 2019-04-18 10:10:33 -0300 | [diff] [blame] | 194 | ======= ======================================================================= |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 195 | PSR.be Cleared when entering fsys-mode. A srlz.d instruction is used |
| 196 | to ensure the CPU is in little-endian mode before the first |
| 197 | load/store instruction is executed. PSR.be is normally NOT |
| 198 | restored upon return from an fsys-mode handler. In other |
| 199 | words, user-level code must not rely on PSR.be being preserved |
| 200 | across a system call. |
| 201 | PSR.up Unchanged. |
| 202 | PSR.ac Unchanged. |
| 203 | PSR.mfl Unchanged. Note: fsys-mode handlers must not write-registers! |
| 204 | PSR.mfh Unchanged. Note: fsys-mode handlers must not write-registers! |
| 205 | PSR.ic Unchanged. Note: fsys-mode handlers can clear the bit, if needed. |
| 206 | PSR.i Unchanged. Note: fsys-mode handlers can clear the bit, if needed. |
| 207 | PSR.pk Unchanged. |
| 208 | PSR.dt Unchanged. |
| 209 | PSR.dfl Unchanged. Note: fsys-mode handlers must not write-registers! |
| 210 | PSR.dfh Unchanged. Note: fsys-mode handlers must not write-registers! |
| 211 | PSR.sp Unchanged. |
| 212 | PSR.pp Unchanged. |
| 213 | PSR.di Unchanged. |
| 214 | PSR.si Unchanged. |
| 215 | PSR.db Unchanged. The kernel prevents user-level from setting a hardware |
Mauro Carvalho Chehab | db9a097 | 2019-04-18 10:10:33 -0300 | [diff] [blame] | 216 | breakpoint that triggers at any privilege level other than |
| 217 | 3 (user-mode). |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 218 | PSR.lp Unchanged. |
| 219 | PSR.tb Lazy redirect. If a taken-branch trap occurs while in |
| 220 | fsys-mode, the trap-handler modifies the saved machine state |
| 221 | such that execution resumes in the gate page at |
| 222 | syscall_via_break(), with privilege level 3. Note: the |
| 223 | taken branch would occur on the branch invoking the |
| 224 | fsyscall-handler, at which point, by definition, a syscall |
| 225 | restart is still safe. If the system call number is invalid, |
| 226 | the fsys-mode handler will return directly to user-level. This |
| 227 | return will trigger a taken-branch trap, but since the trap is |
| 228 | taken _after_ restoring the privilege level, the CPU has already |
| 229 | left fsys-mode, so no special treatment is needed. |
| 230 | PSR.rt Unchanged. |
| 231 | PSR.cpl Cleared to 0. |
| 232 | PSR.is Unchanged (guaranteed to be 0 on entry to the gate page). |
| 233 | PSR.mc Unchanged. |
| 234 | PSR.it Unchanged (guaranteed to be 1). |
| 235 | PSR.id Unchanged. Note: the ia64 linux kernel never sets this bit. |
| 236 | PSR.da Unchanged. Note: the ia64 linux kernel never sets this bit. |
| 237 | PSR.dd Unchanged. Note: the ia64 linux kernel never sets this bit. |
| 238 | PSR.ss Lazy redirect. If set, "epc" will cause a Single Step Trap to |
| 239 | be taken. The trap handler then modifies the saved machine |
| 240 | state such that execution resumes in the gate page at |
| 241 | syscall_via_break(), with privilege level 3. |
| 242 | PSR.ri Unchanged. |
| 243 | PSR.ed Unchanged. Note: This bit could only have an effect if an fsys-mode |
| 244 | handler performed a speculative load that gets NaTted. If so, this |
| 245 | would be the normal & expected behavior, so no special treatment is |
| 246 | needed. |
| 247 | PSR.bn Unchanged. Note: fsys-mode handlers may clear the bit, if needed. |
| 248 | Doing so requires clearing PSR.i and PSR.ic as well. |
| 249 | PSR.ia Unchanged. Note: the ia64 linux kernel never sets this bit. |
Mauro Carvalho Chehab | db9a097 | 2019-04-18 10:10:33 -0300 | [diff] [blame] | 250 | ======= ======================================================================= |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 251 | |
Mauro Carvalho Chehab | db9a097 | 2019-04-18 10:10:33 -0300 | [diff] [blame] | 252 | Using fast system calls |
| 253 | ======================= |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 254 | |
| 255 | To use fast system calls, userspace applications need simply call |
| 256 | __kernel_syscall_via_epc(). For example |
| 257 | |
| 258 | -- example fgettimeofday() call -- |
Mauro Carvalho Chehab | db9a097 | 2019-04-18 10:10:33 -0300 | [diff] [blame] | 259 | |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 260 | -- fgettimeofday.S -- |
| 261 | |
Mauro Carvalho Chehab | db9a097 | 2019-04-18 10:10:33 -0300 | [diff] [blame] | 262 | :: |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 263 | |
Mauro Carvalho Chehab | db9a097 | 2019-04-18 10:10:33 -0300 | [diff] [blame] | 264 | #include <asm/asmmacro.h> |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 265 | |
Mauro Carvalho Chehab | db9a097 | 2019-04-18 10:10:33 -0300 | [diff] [blame] | 266 | GLOBAL_ENTRY(fgettimeofday) |
| 267 | .prologue |
| 268 | .save ar.pfs, r11 |
| 269 | mov r11 = ar.pfs |
| 270 | .body |
| 271 | |
| 272 | mov r2 = 0xa000000000020660;; // gate address |
| 273 | // found by inspection of System.map for the |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 274 | // __kernel_syscall_via_epc() function. See |
| 275 | // below for how to do this for real. |
| 276 | |
Mauro Carvalho Chehab | db9a097 | 2019-04-18 10:10:33 -0300 | [diff] [blame] | 277 | mov b7 = r2 |
| 278 | mov r15 = 1087 // gettimeofday syscall |
| 279 | ;; |
| 280 | br.call.sptk.many b6 = b7 |
| 281 | ;; |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 282 | |
Mauro Carvalho Chehab | db9a097 | 2019-04-18 10:10:33 -0300 | [diff] [blame] | 283 | .restore sp |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 284 | |
Mauro Carvalho Chehab | db9a097 | 2019-04-18 10:10:33 -0300 | [diff] [blame] | 285 | mov ar.pfs = r11 |
| 286 | br.ret.sptk.many rp;; // return to caller |
| 287 | END(fgettimeofday) |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 288 | |
| 289 | -- end fgettimeofday.S -- |
| 290 | |
| 291 | In reality, getting the gate address is accomplished by two extra |
| 292 | values passed via the ELF auxiliary vector (include/asm-ia64/elf.h) |
| 293 | |
Mauro Carvalho Chehab | db9a097 | 2019-04-18 10:10:33 -0300 | [diff] [blame] | 294 | * AT_SYSINFO : is the address of __kernel_syscall_via_epc() |
| 295 | * AT_SYSINFO_EHDR : is the address of the kernel gate ELF DSO |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 296 | |
| 297 | The ELF DSO is a pre-linked library that is mapped in by the kernel at |
| 298 | the gate page. It is a proper ELF shared object so, with a dynamic |
| 299 | loader that recognises the library, you should be able to make calls to |
| 300 | the exported functions within it as with any other shared library. |
| 301 | AT_SYSINFO points into the kernel DSO at the |
| 302 | __kernel_syscall_via_epc() function for historical reasons (it was |
| 303 | used before the kernel DSO) and as a convenience. |