blob: 9ce73544a8f02ba516eb2e3ba115b9a1aa8ae854 [file] [log] [blame]
Mauro Carvalho Chehaba02dcdf2020-04-27 23:17:08 +02001.. SPDX-License-Identifier: GPL-2.0
2
3=====================================================
4Mandatory File Locking For The Linux Operating System
5=====================================================
Linus Torvalds1da177e2005-04-16 15:20:36 -07006
7 Andy Walker <andy@lysaker.kvaerner.no>
8
9 15 April 1996
Mauro Carvalho Chehaba02dcdf2020-04-27 23:17:08 +020010
J. Bruce Fields9efa68e2007-09-25 11:57:19 -040011 (Updated September 2007)
Linus Torvalds1da177e2005-04-16 15:20:36 -070012
J. Bruce Fields9efa68e2007-09-25 11:57:19 -0400130. Why you should avoid mandatory locking
14-----------------------------------------
15
16The Linux implementation is prey to a number of difficult-to-fix race
17conditions which in practice make it not dependable:
18
19 - The write system call checks for a mandatory lock only once
20 at its start. It is therefore possible for a lock request to
21 be granted after this check but before the data is modified.
22 A process may then see file data change even while a mandatory
23 lock was held.
24 - Similarly, an exclusive lock may be granted on a file after
25 the kernel has decided to proceed with a read, but before the
26 read has actually completed, and the reading process may see
27 the file data in a state which should not have been visible
28 to it.
29 - Similar races make the claimed mutual exclusion between lock
30 and mmap similarly unreliable.
Linus Torvalds1da177e2005-04-16 15:20:36 -070031
321. What is mandatory locking?
33------------------------------
34
35Mandatory locking is kernel enforced file locking, as opposed to the more usual
36cooperative file locking used to guarantee sequential access to files among
37processes. File locks are applied using the flock() and fcntl() system calls
38(and the lockf() library routine which is a wrapper around fcntl().) It is
39normally a process' responsibility to check for locks on a file it wishes to
40update, before applying its own lock, updating the file and unlocking it again.
41The most commonly used example of this (and in the case of sendmail, the most
42troublesome) is access to a user's mailbox. The mail user agent and the mail
43transfer agent must guard against updating the mailbox at the same time, and
44prevent reading the mailbox while it is being updated.
45
46In a perfect world all processes would use and honour a cooperative, or
47"advisory" locking scheme. However, the world isn't perfect, and there's
48a lot of poorly written code out there.
49
50In trying to address this problem, the designers of System V UNIX came up
51with a "mandatory" locking scheme, whereby the operating system kernel would
52block attempts by a process to write to a file that another process holds a
53"read" -or- "shared" lock on, and block attempts to both read and write to a
54file that a process holds a "write " -or- "exclusive" lock on.
55
56The System V mandatory locking scheme was intended to have as little impact as
57possible on existing user code. The scheme is based on marking individual files
58as candidates for mandatory locking, and using the existing fcntl()/lockf()
59interface for applying locks just as if they were normal, advisory locks.
60
Mauro Carvalho Chehaba02dcdf2020-04-27 23:17:08 +020061.. Note::
Linus Torvalds1da177e2005-04-16 15:20:36 -070062
Mauro Carvalho Chehaba02dcdf2020-04-27 23:17:08 +020063 1. In saying "file" in the paragraphs above I am actually not telling
64 the whole truth. System V locking is based on fcntl(). The granularity of
65 fcntl() is such that it allows the locking of byte ranges in files, in
66 addition to entire files, so the mandatory locking rules also have byte
67 level granularity.
68
69 2. POSIX.1 does not specify any scheme for mandatory locking, despite
70 borrowing the fcntl() locking scheme from System V. The mandatory locking
71 scheme is defined by the System V Interface Definition (SVID) Version 3.
Linus Torvalds1da177e2005-04-16 15:20:36 -070072
732. Marking a file for mandatory locking
74---------------------------------------
75
76A file is marked as a candidate for mandatory locking by setting the group-id
77bit in its file mode but removing the group-execute bit. This is an otherwise
78meaningless combination, and was chosen by the System V implementors so as not
79to break existing user programs.
80
81Note that the group-id bit is usually automatically cleared by the kernel when
82a setgid file is written to. This is a security measure. The kernel has been
83modified to recognize the special case of a mandatory lock candidate and to
84refrain from clearing this bit. Similarly the kernel has been modified not
85to run mandatory lock candidates with setgid privileges.
86
873. Available implementations
88----------------------------
89
90I have considered the implementations of mandatory locking available with
91SunOS 4.1.x, Solaris 2.x and HP-UX 9.x.
92
93Generally I have tried to make the most sense out of the behaviour exhibited
94by these three reference systems. There are many anomalies.
95
96All the reference systems reject all calls to open() for a file on which
97another process has outstanding mandatory locks. This is in direct
98contravention of SVID 3, which states that only calls to open() with the
99O_TRUNC flag set should be rejected. The Linux implementation follows the SVID
100definition, which is the "Right Thing", since only calls with O_TRUNC can
101modify the contents of the file.
102
103HP-UX even disallows open() with O_TRUNC for a file with advisory locks, not
104just mandatory locks. That would appear to contravene POSIX.1.
105
106mmap() is another interesting case. All the operating systems mentioned
107prevent mandatory locks from being applied to an mmap()'ed file, but HP-UX
108also disallows advisory locks for such a file. SVID actually specifies the
109paranoid HP-UX behaviour.
110
111In my opinion only MAP_SHARED mappings should be immune from locking, and then
112only from mandatory locks - that is what is currently implemented.
113
114SunOS is so hopeless that it doesn't even honour the O_NONBLOCK flag for
115mandatory locks, so reads and writes to locked files always block when they
116should return EAGAIN.
117
118I'm afraid that this is such an esoteric area that the semantics described
119below are just as valid as any others, so long as the main points seem to
120agree.
121
1224. Semantics
123------------
124
1251. Mandatory locks can only be applied via the fcntl()/lockf() locking
126 interface - in other words the System V/POSIX interface. BSD style
127 locks using flock() never result in a mandatory lock.
128
1292. If a process has locked a region of a file with a mandatory read lock, then
130 other processes are permitted to read from that region. If any of these
131 processes attempts to write to the region it will block until the lock is
132 released, unless the process has opened the file with the O_NONBLOCK
133 flag in which case the system call will return immediately with the error
134 status EAGAIN.
135
1363. If a process has locked a region of a file with a mandatory write lock, all
137 attempts to read or write to that region block until the lock is released,
138 unless a process has opened the file with the O_NONBLOCK flag in which case
139 the system call will return immediately with the error status EAGAIN.
140
1414. Calls to open() with O_TRUNC, or to creat(), on a existing file that has
142 any mandatory locks owned by other processes will be rejected with the
143 error status EAGAIN.
144
1455. Attempts to apply a mandatory lock to a file that is memory mapped and
146 shared (via mmap() with MAP_SHARED) will be rejected with the error status
147 EAGAIN.
148
1496. Attempts to create a shared memory map of a file (via mmap() with MAP_SHARED)
150 that has any mandatory locks in effect will be rejected with the error status
151 EAGAIN.
152
1535. Which system calls are affected?
154-----------------------------------
155
156Those which modify a file's contents, not just the inode. That gives read(),
157write(), readv(), writev(), open(), creat(), mmap(), truncate() and
158ftruncate(). truncate() and ftruncate() are considered to be "write" actions
159for the purposes of mandatory locking.
160
161The affected region is usually defined as stretching from the current position
162for the total number of bytes read or written. For the truncate calls it is
163defined as the bytes of a file removed or added (we must also consider bytes
164added, as a lock can specify just "the whole file", rather than a specific
165range of bytes.)
166
167Note 3: I may have overlooked some system calls that need mandatory lock
168checking in my eagerness to get this code out the door. Please let me know, or
169better still fix the system calls yourself and submit a patch to me or Linus.
170
1716. Warning!
172-----------
173
174Not even root can override a mandatory lock, so runaway processes can wreak
175havoc if they lock crucial files. The way around it is to change the file
176permissions (remove the setgid bit) before trying to read or write to it.
177Of course, that might be a bit tricky if the system is hung :-(
178
Jeff Laytondf2474a2019-08-15 15:21:17 -04001797. The "mand" mount option
180--------------------------
181Mandatory locking is disabled on all filesystems by default, and must be
182administratively enabled by mounting with "-o mand". That mount option
183is only allowed if the mounting task has the CAP_SYS_ADMIN capability.
184
185Since kernel v4.5, it is possible to disable mandatory locking
186altogether by setting CONFIG_MANDATORY_FILE_LOCKING to "n". A kernel
187with this disabled will reject attempts to mount filesystems with the
188"mand" mount option with the error status EPERM.