blob: 877e90a352389653cb38d93d0bce9d14c0b28a09 [file] [log] [blame]
Jonathan Corbetf504d472017-04-02 15:18:32 -06001unshare system call
2===================
JANAK DESAI0d4c3e72006-02-07 12:58:56 -08003
Jonathan Corbetf504d472017-04-02 15:18:32 -06004This document describes the new system call, unshare(). The document
JANAK DESAI0d4c3e72006-02-07 12:58:56 -08005provides an overview of the feature, why it is needed, how it can
6be used, its interface specification, design, implementation and
7how it can be tested.
8
Jonathan Corbetf504d472017-04-02 15:18:32 -06009Change Log
10----------
JANAK DESAI0d4c3e72006-02-07 12:58:56 -080011version 0.1 Initial document, Janak Desai (janak@us.ibm.com), Jan 11, 2006
12
Jonathan Corbetf504d472017-04-02 15:18:32 -060013Contents
14--------
JANAK DESAI0d4c3e72006-02-07 12:58:56 -080015 1) Overview
16 2) Benefits
17 3) Cost
18 4) Requirements
19 5) Functional Specification
20 6) High Level Design
21 7) Low Level Design
22 8) Test Specification
23 9) Future Work
24
251) Overview
26-----------
Jonathan Corbetf504d472017-04-02 15:18:32 -060027
JANAK DESAI0d4c3e72006-02-07 12:58:56 -080028Most legacy operating system kernels support an abstraction of threads
29as multiple execution contexts within a process. These kernels provide
30special resources and mechanisms to maintain these "threads". The Linux
31kernel, in a clever and simple manner, does not make distinction
32between processes and "threads". The kernel allows processes to share
33resources and thus they can achieve legacy "threads" behavior without
34requiring additional data structures and mechanisms in the kernel. The
35power of implementing threads in this manner comes not only from
36its simplicity but also from allowing application programmers to work
37outside the confinement of all-or-nothing shared resources of legacy
38threads. On Linux, at the time of thread creation using the clone system
39call, applications can selectively choose which resources to share
40between threads.
41
Jonathan Corbetf504d472017-04-02 15:18:32 -060042unshare() system call adds a primitive to the Linux thread model that
JANAK DESAI0d4c3e72006-02-07 12:58:56 -080043allows threads to selectively 'unshare' any resources that were being
Jonathan Corbetf504d472017-04-02 15:18:32 -060044shared at the time of their creation. unshare() was conceptualized by
JANAK DESAI0d4c3e72006-02-07 12:58:56 -080045Al Viro in the August of 2000, on the Linux-Kernel mailing list, as part
Jonathan Corbetf504d472017-04-02 15:18:32 -060046of the discussion on POSIX threads on Linux. unshare() augments the
JANAK DESAI0d4c3e72006-02-07 12:58:56 -080047usefulness of Linux threads for applications that would like to control
Jonathan Corbetf504d472017-04-02 15:18:32 -060048shared resources without creating a new process. unshare() is a natural
JANAK DESAI0d4c3e72006-02-07 12:58:56 -080049addition to the set of available primitives on Linux that implement
50the concept of process/thread as a virtual machine.
51
522) Benefits
53-----------
Jonathan Corbetf504d472017-04-02 15:18:32 -060054
55unshare() would be useful to large application frameworks such as PAM
JANAK DESAI0d4c3e72006-02-07 12:58:56 -080056where creating a new process to control sharing/unsharing of process
57resources is not possible. Since namespaces are shared by default
Jonathan Corbetf504d472017-04-02 15:18:32 -060058when creating a new process using fork or clone, unshare() can benefit
JANAK DESAI0d4c3e72006-02-07 12:58:56 -080059even non-threaded applications if they have a need to disassociate
60from default shared namespace. The following lists two use-cases
Jonathan Corbetf504d472017-04-02 15:18:32 -060061where unshare() can be used.
JANAK DESAI0d4c3e72006-02-07 12:58:56 -080062
632.1 Per-security context namespaces
Jonathan Corbetf504d472017-04-02 15:18:32 -060064~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
65
66unshare() can be used to implement polyinstantiated directories using
JANAK DESAI0d4c3e72006-02-07 12:58:56 -080067the kernel's per-process namespace mechanism. Polyinstantiated directories,
68such as per-user and/or per-security context instance of /tmp, /var/tmp or
69per-security context instance of a user's home directory, isolate user
Jonathan Corbetf504d472017-04-02 15:18:32 -060070processes when working with these directories. Using unshare(), a PAM
JANAK DESAI0d4c3e72006-02-07 12:58:56 -080071module can easily setup a private namespace for a user at login.
72Polyinstantiated directories are required for Common Criteria certification
73with Labeled System Protection Profile, however, with the availability
74of shared-tree feature in the Linux kernel, even regular Linux systems
75can benefit from setting up private namespaces at login and
76polyinstantiating /tmp, /var/tmp and other directories deemed
77appropriate by system administrators.
78
792.2 unsharing of virtual memory and/or open files
Jonathan Corbetf504d472017-04-02 15:18:32 -060080~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
81
JANAK DESAI0d4c3e72006-02-07 12:58:56 -080082Consider a client/server application where the server is processing
83client requests by creating processes that share resources such as
Jonathan Corbetf504d472017-04-02 15:18:32 -060084virtual memory and open files. Without unshare(), the server has to
JANAK DESAI0d4c3e72006-02-07 12:58:56 -080085decide what needs to be shared at the time of creating the process
Jonathan Corbetf504d472017-04-02 15:18:32 -060086which services the request. unshare() allows the server an ability to
JANAK DESAI0d4c3e72006-02-07 12:58:56 -080087disassociate parts of the context during the servicing of the
88request. For large and complex middleware application frameworks, this
Jonathan Corbetf504d472017-04-02 15:18:32 -060089ability to unshare() after the process was created can be very
JANAK DESAI0d4c3e72006-02-07 12:58:56 -080090useful.
91
923) Cost
93-------
Jonathan Corbetf504d472017-04-02 15:18:32 -060094
95In order to not duplicate code and to handle the fact that unshare()
JANAK DESAI0d4c3e72006-02-07 12:58:56 -080096works on an active task (as opposed to clone/fork working on a newly
Jonathan Corbetf504d472017-04-02 15:18:32 -060097allocated inactive task) unshare() had to make minor reorganizational
JANAK DESAI0d4c3e72006-02-07 12:58:56 -080098changes to copy_* functions utilized by clone/fork system call.
99There is a cost associated with altering existing, well tested and
100stable code to implement a new feature that may not get exercised
101extensively in the beginning. However, with proper design and code
Jonathan Corbetf504d472017-04-02 15:18:32 -0600102review of the changes and creation of an unshare() test for the LTP
JANAK DESAI0d4c3e72006-02-07 12:58:56 -0800103the benefits of this new feature can exceed its cost.
104
1054) Requirements
106---------------
Jonathan Corbetf504d472017-04-02 15:18:32 -0600107
108unshare() reverses sharing that was done using clone(2) system call,
109so unshare() should have a similar interface as clone(2). That is,
Markus Heiser5e339942017-05-13 15:41:38 +0200110since flags in clone(int flags, void \*stack) specifies what should
JANAK DESAI0d4c3e72006-02-07 12:58:56 -0800111be shared, similar flags in unshare(int flags) should specify
112what should be unshared. Unfortunately, this may appear to invert
113the meaning of the flags from the way they are used in clone(2).
114However, there was no easy solution that was less confusing and that
115allowed incremental context unsharing in future without an ABI change.
116
Jonathan Corbetf504d472017-04-02 15:18:32 -0600117unshare() interface should accommodate possible future addition of
JANAK DESAI0d4c3e72006-02-07 12:58:56 -0800118new context flags without requiring a rebuild of old applications.
Jonathan Corbetf504d472017-04-02 15:18:32 -0600119If and when new context flags are added, unshare() design should allow
JANAK DESAI0d4c3e72006-02-07 12:58:56 -0800120incremental unsharing of those resources on an as needed basis.
121
1225) Functional Specification
123---------------------------
Jonathan Corbetf504d472017-04-02 15:18:32 -0600124
JANAK DESAI0d4c3e72006-02-07 12:58:56 -0800125NAME
126 unshare - disassociate parts of the process execution context
127
128SYNOPSIS
129 #include <sched.h>
130
131 int unshare(int flags);
132
133DESCRIPTION
Jonathan Corbetf504d472017-04-02 15:18:32 -0600134 unshare() allows a process to disassociate parts of its execution
JANAK DESAI0d4c3e72006-02-07 12:58:56 -0800135 context that are currently being shared with other processes. Part
136 of execution context, such as the namespace, is shared by default
137 when a new process is created using fork(2), while other parts,
138 such as the virtual memory, open file descriptors, etc, may be
139 shared by explicit request to share them when creating a process
140 using clone(2).
141
Jonathan Corbetf504d472017-04-02 15:18:32 -0600142 The main use of unshare() is to allow a process to control its
JANAK DESAI0d4c3e72006-02-07 12:58:56 -0800143 shared execution context without creating a new process.
144
145 The flags argument specifies one or bitwise-or'ed of several of
146 the following constants.
147
148 CLONE_FS
149 If CLONE_FS is set, file system information of the caller
150 is disassociated from the shared file system information.
151
152 CLONE_FILES
153 If CLONE_FILES is set, the file descriptor table of the
154 caller is disassociated from the shared file descriptor
155 table.
156
157 CLONE_NEWNS
158 If CLONE_NEWNS is set, the namespace of the caller is
159 disassociated from the shared namespace.
160
161 CLONE_VM
162 If CLONE_VM is set, the virtual memory of the caller is
163 disassociated from the shared virtual memory.
164
165RETURN VALUE
166 On success, zero returned. On failure, -1 is returned and errno is
167
168ERRORS
169 EPERM CLONE_NEWNS was specified by a non-root process (process
170 without CAP_SYS_ADMIN).
171
172 ENOMEM Cannot allocate sufficient memory to copy parts of caller's
173 context that need to be unshared.
174
175 EINVAL Invalid flag was specified as an argument.
176
177CONFORMING TO
178 The unshare() call is Linux-specific and should not be used
179 in programs intended to be portable.
180
181SEE ALSO
182 clone(2), fork(2)
183
1846) High Level Design
185--------------------
Jonathan Corbetf504d472017-04-02 15:18:32 -0600186
187Depending on the flags argument, the unshare() system call allocates
JANAK DESAI0d4c3e72006-02-07 12:58:56 -0800188appropriate process context structures, populates it with values from
189the current shared version, associates newly duplicated structures
190with the current task structure and releases corresponding shared
191versions. Helper functions of clone (copy_*) could not be used
Jonathan Corbetf504d472017-04-02 15:18:32 -0600192directly by unshare() because of the following two reasons.
193
JANAK DESAI0d4c3e72006-02-07 12:58:56 -0800194 1) clone operates on a newly allocated not-yet-active task
Jonathan Corbetf504d472017-04-02 15:18:32 -0600195 structure, where as unshare() operates on the current active
196 task. Therefore unshare() has to take appropriate task_lock()
JANAK DESAI0d4c3e72006-02-07 12:58:56 -0800197 before associating newly duplicated context structures
Jonathan Corbetf504d472017-04-02 15:18:32 -0600198
199 2) unshare() has to allocate and duplicate all context structures
JANAK DESAI0d4c3e72006-02-07 12:58:56 -0800200 that are being unshared, before associating them with the
201 current task and releasing older shared structures. Failure
202 do so will create race conditions and/or oops when trying
203 to backout due to an error. Consider the case of unsharing
204 both virtual memory and namespace. After successfully unsharing
205 vm, if the system call encounters an error while allocating
206 new namespace structure, the error return code will have to
207 reverse the unsharing of vm. As part of the reversal the
208 system call will have to go back to older, shared, vm
209 structure, which may not exist anymore.
210
211Therefore code from copy_* functions that allocated and duplicated
212current context structure was moved into new dup_* functions. Now,
213copy_* functions call dup_* functions to allocate and duplicate
214appropriate context structures and then associate them with the
Jonathan Corbetf504d472017-04-02 15:18:32 -0600215task structure that is being constructed. unshare() system call on
JANAK DESAI0d4c3e72006-02-07 12:58:56 -0800216the other hand performs the following:
Jonathan Corbetf504d472017-04-02 15:18:32 -0600217
JANAK DESAI0d4c3e72006-02-07 12:58:56 -0800218 1) Check flags to force missing, but implied, flags
Jonathan Corbetf504d472017-04-02 15:18:32 -0600219
220 2) For each context structure, call the corresponding unshare()
JANAK DESAI0d4c3e72006-02-07 12:58:56 -0800221 helper function to allocate and duplicate a new context
222 structure, if the appropriate bit is set in the flags argument.
Jonathan Corbetf504d472017-04-02 15:18:32 -0600223
JANAK DESAI0d4c3e72006-02-07 12:58:56 -0800224 3) If there is no error in allocation and duplication and there
225 are new context structures then lock the current task structure,
226 associate new context structures with the current task structure,
227 and release the lock on the current task structure.
Jonathan Corbetf504d472017-04-02 15:18:32 -0600228
JANAK DESAI0d4c3e72006-02-07 12:58:56 -0800229 4) Appropriately release older, shared, context structures.
230
2317) Low Level Design
232-------------------
Jonathan Corbetf504d472017-04-02 15:18:32 -0600233
234Implementation of unshare() can be grouped in the following 4 different
JANAK DESAI0d4c3e72006-02-07 12:58:56 -0800235items:
Jonathan Corbetf504d472017-04-02 15:18:32 -0600236
JANAK DESAI0d4c3e72006-02-07 12:58:56 -0800237 a) Reorganization of existing copy_* functions
Jonathan Corbetf504d472017-04-02 15:18:32 -0600238
239 b) unshare() system call service function
240
241 c) unshare() helper functions for each different process context
242
JANAK DESAI0d4c3e72006-02-07 12:58:56 -0800243 d) Registration of system call number for different architectures
244
Jonathan Corbetf504d472017-04-02 15:18:32 -06002457.1) Reorganization of copy_* functions
246~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
JANAK DESAI0d4c3e72006-02-07 12:58:56 -0800247
Jonathan Corbetf504d472017-04-02 15:18:32 -0600248Each copy function such as copy_mm, copy_namespace, copy_files,
249etc, had roughly two components. The first component allocated
250and duplicated the appropriate structure and the second component
251linked it to the task structure passed in as an argument to the copy
252function. The first component was split into its own function.
253These dup_* functions allocated and duplicated the appropriate
254context structure. The reorganized copy_* functions invoked
255their corresponding dup_* functions and then linked the newly
256duplicated structures to the task structure with which the
257copy function was called.
258
2597.2) unshare() system call service function
260~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
261
JANAK DESAI0d4c3e72006-02-07 12:58:56 -0800262 * Check flags
263 Force implied flags. If CLONE_THREAD is set force CLONE_VM.
264 If CLONE_VM is set, force CLONE_SIGHAND. If CLONE_SIGHAND is
265 set and signals are also being shared, force CLONE_THREAD. If
266 CLONE_NEWNS is set, force CLONE_FS.
Jonathan Corbetf504d472017-04-02 15:18:32 -0600267
JANAK DESAI0d4c3e72006-02-07 12:58:56 -0800268 * For each context flag, invoke the corresponding unshare_*
269 helper routine with flags passed into the system call and a
270 reference to pointer pointing the new unshared structure
Jonathan Corbetf504d472017-04-02 15:18:32 -0600271
JANAK DESAI0d4c3e72006-02-07 12:58:56 -0800272 * If any new structures are created by unshare_* helper
273 functions, take the task_lock() on the current task,
274 modify appropriate context pointers, and release the
275 task lock.
Jonathan Corbetf504d472017-04-02 15:18:32 -0600276
JANAK DESAI0d4c3e72006-02-07 12:58:56 -0800277 * For all newly unshared structures, release the corresponding
278 older, shared, structures.
279
Jonathan Corbetf504d472017-04-02 15:18:32 -06002807.3) unshare_* helper functions
281~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
JANAK DESAI0d4c3e72006-02-07 12:58:56 -0800282
Jonathan Corbetf504d472017-04-02 15:18:32 -0600283For unshare_* helpers corresponding to CLONE_SYSVSEM, CLONE_SIGHAND,
284and CLONE_THREAD, return -EINVAL since they are not implemented yet.
285For others, check the flag value to see if the unsharing is
286required for that structure. If it is, invoke the corresponding
287dup_* function to allocate and duplicate the structure and return
288a pointer to it.
289
2907.4) Finally
291~~~~~~~~~~~~
292
293Appropriately modify architecture specific code to register the
294new system call.
JANAK DESAI0d4c3e72006-02-07 12:58:56 -0800295
2968) Test Specification
297---------------------
Jonathan Corbetf504d472017-04-02 15:18:32 -0600298
299The test for unshare() should test the following:
300
JANAK DESAI0d4c3e72006-02-07 12:58:56 -0800301 1) Valid flags: Test to check that clone flags for signal and
Jonathan Corbetf504d472017-04-02 15:18:32 -0600302 signal handlers, for which unsharing is not implemented
303 yet, return -EINVAL.
304
JANAK DESAI0d4c3e72006-02-07 12:58:56 -0800305 2) Missing/implied flags: Test to make sure that if unsharing
Jonathan Corbetf504d472017-04-02 15:18:32 -0600306 namespace without specifying unsharing of filesystem, correctly
307 unshares both namespace and filesystem information.
308
JANAK DESAI0d4c3e72006-02-07 12:58:56 -0800309 3) For each of the four (namespace, filesystem, files and vm)
Jonathan Corbetf504d472017-04-02 15:18:32 -0600310 supported unsharing, verify that the system call correctly
311 unshares the appropriate structure. Verify that unsharing
312 them individually as well as in combination with each
313 other works as expected.
314
JANAK DESAI0d4c3e72006-02-07 12:58:56 -0800315 4) Concurrent execution: Use shared memory segments and futex on
Jonathan Corbetf504d472017-04-02 15:18:32 -0600316 an address in the shm segment to synchronize execution of
317 about 10 threads. Have a couple of threads execute execve,
318 a couple _exit and the rest unshare with different combination
319 of flags. Verify that unsharing is performed as expected and
320 that there are no oops or hangs.
JANAK DESAI0d4c3e72006-02-07 12:58:56 -0800321
3229) Future Work
323--------------
Jonathan Corbetf504d472017-04-02 15:18:32 -0600324
325The current implementation of unshare() does not allow unsharing of
JANAK DESAI0d4c3e72006-02-07 12:58:56 -0800326signals and signal handlers. Signals are complex to begin with and
327to unshare signals and/or signal handlers of a currently running
328process is even more complex. If in the future there is a specific
329need to allow unsharing of signals and/or signal handlers, it can
Jonathan Corbetf504d472017-04-02 15:18:32 -0600330be incrementally added to unshare() without affecting legacy
331applications using unshare().
JANAK DESAI0d4c3e72006-02-07 12:58:56 -0800332