blob: a702d2cc94b6300ee44f5de5d915cb6300c862b3 [file] [log] [blame]
Mauro Carvalho Chehabdb9a0972019-04-18 10:10:33 -03001===================================
2Light-weight System Calls for IA-64
3===================================
Linus Torvalds1da177e2005-04-16 15:20:36 -07004
5 Started: 13-Jan-2003
Mauro Carvalho Chehabdb9a0972019-04-18 10:10:33 -03006
Linus Torvalds1da177e2005-04-16 15:20:36 -07007 Last update: 27-Sep-2003
8
9 David Mosberger-Tang
10 <davidm@hpl.hp.com>
11
12Using the "epc" instruction effectively introduces a new mode of
13execution to the ia64 linux kernel. We call this mode the
14"fsys-mode". To recap, the normal states of execution are:
15
16 - kernel mode:
17 Both the register stack and the memory stack have been
18 switched over to kernel memory. The user-level state is saved
19 in a pt-regs structure at the top of the kernel memory stack.
20
21 - user mode:
22 Both the register stack and the kernel stack are in
23 user memory. The user-level state is contained in the
24 CPU registers.
25
26 - bank 0 interruption-handling mode:
27 This is the non-interruptible state which all
28 interruption-handlers start execution in. The user-level
29 state remains in the CPU registers and some kernel state may
30 be stored in bank 0 of registers r16-r31.
31
32In contrast, fsys-mode has the following special properties:
33
34 - execution is at privilege level 0 (most-privileged)
35
36 - CPU registers may contain a mixture of user-level and kernel-level
37 state (it is the responsibility of the kernel to ensure that no
38 security-sensitive kernel-level state is leaked back to
39 user-level)
40
41 - execution is interruptible and preemptible (an fsys-mode handler
42 can disable interrupts and avoid all other interruption-sources
43 to avoid preemption)
44
45 - neither the memory-stack nor the register-stack can be trusted while
46 in fsys-mode (they point to the user-level stacks, which may
47 be invalid, or completely bogus addresses)
48
49In summary, fsys-mode is much more similar to running in user-mode
50than it is to running in kernel-mode. Of course, given that the
51privilege level is at level 0, this means that fsys-mode requires some
52care (see below).
53
54
Mauro Carvalho Chehabdb9a0972019-04-18 10:10:33 -030055How to tell fsys-mode
56=====================
Linus Torvalds1da177e2005-04-16 15:20:36 -070057
58Linux operates in fsys-mode when (a) the privilege level is 0 (most
59privileged) and (b) the stacks have NOT been switched to kernel memory
60yet. For convenience, the header file <asm-ia64/ptrace.h> provides
Mauro Carvalho Chehabdb9a0972019-04-18 10:10:33 -030061three macros::
Linus Torvalds1da177e2005-04-16 15:20:36 -070062
63 user_mode(regs)
64 user_stack(task,regs)
65 fsys_mode(task,regs)
66
67The "regs" argument is a pointer to a pt_regs structure. The "task"
68argument is a pointer to the task structure to which the "regs"
69pointer belongs to. user_mode() returns TRUE if the CPU state pointed
70to by "regs" was executing in user mode (privilege level 3).
71user_stack() returns TRUE if the state pointed to by "regs" was
72executing on the user-level stack(s). Finally, fsys_mode() returns
73TRUE if the CPU state pointed to by "regs" was executing in fsys-mode.
Mauro Carvalho Chehabdb9a0972019-04-18 10:10:33 -030074The fsys_mode() macro is equivalent to the expression::
Linus Torvalds1da177e2005-04-16 15:20:36 -070075
76 !user_mode(regs) && user_stack(task,regs)
77
Mauro Carvalho Chehabdb9a0972019-04-18 10:10:33 -030078How to write an fsyscall handler
79================================
Linus Torvalds1da177e2005-04-16 15:20:36 -070080
81The file arch/ia64/kernel/fsys.S contains a table of fsyscall-handlers
82(fsyscall_table). This table contains one entry for each system call.
83By default, a system call is handled by fsys_fallback_syscall(). This
84routine takes care of entering (full) kernel mode and calling the
85normal Linux system call handler. For performance-critical system
86calls, it is possible to write a hand-tuned fsyscall_handler. For
87example, fsys.S contains fsys_getpid(), which is a hand-tuned version
88of the getpid() system call.
89
90The entry and exit-state of an fsyscall handler is as follows:
91
Mauro Carvalho Chehabdb9a0972019-04-18 10:10:33 -030092Machine state on entry to fsyscall handler
93------------------------------------------
Linus Torvalds1da177e2005-04-16 15:20:36 -070094
Mauro Carvalho Chehabdb9a0972019-04-18 10:10:33 -030095 ========= ===============================================================
96 r10 0
97 r11 saved ar.pfs (a user-level value)
98 r15 system call number
99 r16 "current" task pointer (in normal kernel-mode, this is in r13)
100 r32-r39 system call arguments
101 b6 return address (a user-level value)
102 ar.pfs previous frame-state (a user-level value)
103 PSR.be cleared to zero (i.e., little-endian byte order is in effect)
104 - all other registers may contain values passed in from user-mode
105 ========= ===============================================================
Linus Torvalds1da177e2005-04-16 15:20:36 -0700106
Mauro Carvalho Chehabdb9a0972019-04-18 10:10:33 -0300107Required machine state on exit to fsyscall handler
108--------------------------------------------------
Linus Torvalds1da177e2005-04-16 15:20:36 -0700109
Mauro Carvalho Chehabdb9a0972019-04-18 10:10:33 -0300110 ========= ===========================================================
111 r11 saved ar.pfs (as passed into the fsyscall handler)
112 r15 system call number (as passed into the fsyscall handler)
113 r32-r39 system call arguments (as passed into the fsyscall handler)
114 b6 return address (as passed into the fsyscall handler)
115 ar.pfs previous frame-state (as passed into the fsyscall handler)
116 ========= ===========================================================
Linus Torvalds1da177e2005-04-16 15:20:36 -0700117
118Fsyscall handlers can execute with very little overhead, but with that
119speed comes a set of restrictions:
120
Mauro Carvalho Chehabdb9a0972019-04-18 10:10:33 -0300121 * Fsyscall-handlers MUST check for any pending work in the flags
Linus Torvalds1da177e2005-04-16 15:20:36 -0700122 member of the thread-info structure and if any of the
123 TIF_ALLWORK_MASK flags are set, the handler needs to fall back on
124 doing a full system call (by calling fsys_fallback_syscall).
125
Mauro Carvalho Chehabdb9a0972019-04-18 10:10:33 -0300126 * Fsyscall-handlers MUST preserve incoming arguments (r32-r39, r11,
Linus Torvalds1da177e2005-04-16 15:20:36 -0700127 r15, b6, and ar.pfs) because they will be needed in case of a
128 system call restart. Of course, all "preserved" registers also
129 must be preserved, in accordance to the normal calling conventions.
130
Mauro Carvalho Chehabdb9a0972019-04-18 10:10:33 -0300131 * Fsyscall-handlers MUST check argument registers for containing a
Linus Torvalds1da177e2005-04-16 15:20:36 -0700132 NaT value before using them in any way that could trigger a
133 NaT-consumption fault. If a system call argument is found to
134 contain a NaT value, an fsyscall-handler may return immediately
135 with r8=EINVAL, r10=-1.
136
Mauro Carvalho Chehabdb9a0972019-04-18 10:10:33 -0300137 * Fsyscall-handlers MUST NOT use the "alloc" instruction or perform
Linus Torvalds1da177e2005-04-16 15:20:36 -0700138 any other operation that would trigger mandatory RSE
139 (register-stack engine) traffic.
140
Mauro Carvalho Chehabdb9a0972019-04-18 10:10:33 -0300141 * Fsyscall-handlers MUST NOT write to any stacked registers because
Linus Torvalds1da177e2005-04-16 15:20:36 -0700142 it is not safe to assume that user-level called a handler with the
143 proper number of arguments.
144
Mauro Carvalho Chehabdb9a0972019-04-18 10:10:33 -0300145 * Fsyscall-handlers need to be careful when accessing per-CPU variables:
Linus Torvalds1da177e2005-04-16 15:20:36 -0700146 unless proper safe-guards are taken (e.g., interruptions are avoided),
147 execution may be pre-empted and resumed on another CPU at any given
148 time.
149
Mauro Carvalho Chehabdb9a0972019-04-18 10:10:33 -0300150 * Fsyscall-handlers must be careful not to leak sensitive kernel'
Linus Torvalds1da177e2005-04-16 15:20:36 -0700151 information back to user-level. In particular, before returning to
152 user-level, care needs to be taken to clear any scratch registers
153 that could contain sensitive information (note that the current
154 task pointer is not considered sensitive: it's already exposed
155 through ar.k6).
156
Mauro Carvalho Chehabdb9a0972019-04-18 10:10:33 -0300157 * Fsyscall-handlers MUST NOT access user-memory without first
Linus Torvalds1da177e2005-04-16 15:20:36 -0700158 validating access-permission (this can be done typically via
159 probe.r.fault and/or probe.w.fault) and without guarding against
160 memory access exceptions (this can be done with the EX() macros
161 defined by asmmacro.h).
162
163The above restrictions may seem draconian, but remember that it's
164possible to trade off some of the restrictions by paying a slightly
165higher overhead. For example, if an fsyscall-handler could benefit
166from the shadow register bank, it could temporarily disable PSR.i and
167PSR.ic, switch to bank 0 (bsw.0) and then use the shadow registers as
168needed. In other words, following the above rules yields extremely
169fast system call execution (while fully preserving system call
170semantics), but there is also a lot of flexibility in handling more
171complicated cases.
172
Mauro Carvalho Chehabdb9a0972019-04-18 10:10:33 -0300173Signal handling
174===============
Linus Torvalds1da177e2005-04-16 15:20:36 -0700175
176The delivery of (asynchronous) signals must be delayed until fsys-mode
Matt LaPlante3f6dee92006-10-03 22:45:33 +0200177is exited. This is accomplished with the help of the lower-privilege
Linus Torvalds1da177e2005-04-16 15:20:36 -0700178transfer trap: arch/ia64/kernel/process.c:do_notify_resume_user()
179checks whether the interrupted task was in fsys-mode and, if so, sets
180PSR.lp and returns immediately. When fsys-mode is exited via the
181"br.ret" instruction that lowers the privilege level, a trap will
182occur. The trap handler clears PSR.lp again and returns immediately.
183The kernel exit path then checks for and delivers any pending signals.
184
Mauro Carvalho Chehabdb9a0972019-04-18 10:10:33 -0300185PSR Handling
186============
Linus Torvalds1da177e2005-04-16 15:20:36 -0700187
188The "epc" instruction doesn't change the contents of PSR at all. This
189is in contrast to a regular interruption, which clears almost all
190bits. Because of that, some care needs to be taken to ensure things
191work as expected. The following discussion describes how each PSR bit
192is handled.
193
Mauro Carvalho Chehabdb9a0972019-04-18 10:10:33 -0300194======= =======================================================================
Linus Torvalds1da177e2005-04-16 15:20:36 -0700195PSR.be Cleared when entering fsys-mode. A srlz.d instruction is used
196 to ensure the CPU is in little-endian mode before the first
197 load/store instruction is executed. PSR.be is normally NOT
198 restored upon return from an fsys-mode handler. In other
199 words, user-level code must not rely on PSR.be being preserved
200 across a system call.
201PSR.up Unchanged.
202PSR.ac Unchanged.
203PSR.mfl Unchanged. Note: fsys-mode handlers must not write-registers!
204PSR.mfh Unchanged. Note: fsys-mode handlers must not write-registers!
205PSR.ic Unchanged. Note: fsys-mode handlers can clear the bit, if needed.
206PSR.i Unchanged. Note: fsys-mode handlers can clear the bit, if needed.
207PSR.pk Unchanged.
208PSR.dt Unchanged.
209PSR.dfl Unchanged. Note: fsys-mode handlers must not write-registers!
210PSR.dfh Unchanged. Note: fsys-mode handlers must not write-registers!
211PSR.sp Unchanged.
212PSR.pp Unchanged.
213PSR.di Unchanged.
214PSR.si Unchanged.
215PSR.db Unchanged. The kernel prevents user-level from setting a hardware
Mauro Carvalho Chehabdb9a0972019-04-18 10:10:33 -0300216 breakpoint that triggers at any privilege level other than
217 3 (user-mode).
Linus Torvalds1da177e2005-04-16 15:20:36 -0700218PSR.lp Unchanged.
219PSR.tb Lazy redirect. If a taken-branch trap occurs while in
220 fsys-mode, the trap-handler modifies the saved machine state
221 such that execution resumes in the gate page at
222 syscall_via_break(), with privilege level 3. Note: the
223 taken branch would occur on the branch invoking the
224 fsyscall-handler, at which point, by definition, a syscall
225 restart is still safe. If the system call number is invalid,
226 the fsys-mode handler will return directly to user-level. This
227 return will trigger a taken-branch trap, but since the trap is
228 taken _after_ restoring the privilege level, the CPU has already
229 left fsys-mode, so no special treatment is needed.
230PSR.rt Unchanged.
231PSR.cpl Cleared to 0.
232PSR.is Unchanged (guaranteed to be 0 on entry to the gate page).
233PSR.mc Unchanged.
234PSR.it Unchanged (guaranteed to be 1).
235PSR.id Unchanged. Note: the ia64 linux kernel never sets this bit.
236PSR.da Unchanged. Note: the ia64 linux kernel never sets this bit.
237PSR.dd Unchanged. Note: the ia64 linux kernel never sets this bit.
238PSR.ss Lazy redirect. If set, "epc" will cause a Single Step Trap to
239 be taken. The trap handler then modifies the saved machine
240 state such that execution resumes in the gate page at
241 syscall_via_break(), with privilege level 3.
242PSR.ri Unchanged.
243PSR.ed Unchanged. Note: This bit could only have an effect if an fsys-mode
244 handler performed a speculative load that gets NaTted. If so, this
245 would be the normal & expected behavior, so no special treatment is
246 needed.
247PSR.bn Unchanged. Note: fsys-mode handlers may clear the bit, if needed.
248 Doing so requires clearing PSR.i and PSR.ic as well.
249PSR.ia Unchanged. Note: the ia64 linux kernel never sets this bit.
Mauro Carvalho Chehabdb9a0972019-04-18 10:10:33 -0300250======= =======================================================================
Linus Torvalds1da177e2005-04-16 15:20:36 -0700251
Mauro Carvalho Chehabdb9a0972019-04-18 10:10:33 -0300252Using fast system calls
253=======================
Linus Torvalds1da177e2005-04-16 15:20:36 -0700254
255To use fast system calls, userspace applications need simply call
256__kernel_syscall_via_epc(). For example
257
258-- example fgettimeofday() call --
Mauro Carvalho Chehabdb9a0972019-04-18 10:10:33 -0300259
Linus Torvalds1da177e2005-04-16 15:20:36 -0700260-- fgettimeofday.S --
261
Mauro Carvalho Chehabdb9a0972019-04-18 10:10:33 -0300262::
Linus Torvalds1da177e2005-04-16 15:20:36 -0700263
Mauro Carvalho Chehabdb9a0972019-04-18 10:10:33 -0300264 #include <asm/asmmacro.h>
Linus Torvalds1da177e2005-04-16 15:20:36 -0700265
Mauro Carvalho Chehabdb9a0972019-04-18 10:10:33 -0300266 GLOBAL_ENTRY(fgettimeofday)
267 .prologue
268 .save ar.pfs, r11
269 mov r11 = ar.pfs
270 .body
271
272 mov r2 = 0xa000000000020660;; // gate address
273 // found by inspection of System.map for the
Linus Torvalds1da177e2005-04-16 15:20:36 -0700274 // __kernel_syscall_via_epc() function. See
275 // below for how to do this for real.
276
Mauro Carvalho Chehabdb9a0972019-04-18 10:10:33 -0300277 mov b7 = r2
278 mov r15 = 1087 // gettimeofday syscall
279 ;;
280 br.call.sptk.many b6 = b7
281 ;;
Linus Torvalds1da177e2005-04-16 15:20:36 -0700282
Mauro Carvalho Chehabdb9a0972019-04-18 10:10:33 -0300283 .restore sp
Linus Torvalds1da177e2005-04-16 15:20:36 -0700284
Mauro Carvalho Chehabdb9a0972019-04-18 10:10:33 -0300285 mov ar.pfs = r11
286 br.ret.sptk.many rp;; // return to caller
287 END(fgettimeofday)
Linus Torvalds1da177e2005-04-16 15:20:36 -0700288
289-- end fgettimeofday.S --
290
291In reality, getting the gate address is accomplished by two extra
292values passed via the ELF auxiliary vector (include/asm-ia64/elf.h)
293
Mauro Carvalho Chehabdb9a0972019-04-18 10:10:33 -0300294 * AT_SYSINFO : is the address of __kernel_syscall_via_epc()
295 * AT_SYSINFO_EHDR : is the address of the kernel gate ELF DSO
Linus Torvalds1da177e2005-04-16 15:20:36 -0700296
297The ELF DSO is a pre-linked library that is mapped in by the kernel at
298the gate page. It is a proper ELF shared object so, with a dynamic
299loader that recognises the library, you should be able to make calls to
300the exported functions within it as with any other shared library.
301AT_SYSINFO points into the kernel DSO at the
302__kernel_syscall_via_epc() function for historical reasons (it was
303used before the kernel DSO) and as a convenience.