blob: 3f9b1497ebb8dac3f905c545b4a0ce3dee3c54bb [file] [log] [blame]
Mauro Carvalho Chehabec23eb52019-07-26 09:51:27 -03001=======
2Locking
3=======
4
5The text below describes the locking rules for VFS-related methods.
Linus Torvalds1da177e2005-04-16 15:20:36 -07006It is (believed to be) up-to-date. *Please*, if you change anything in
7prototypes or locking protocols - update this file. And update the relevant
8instances in the tree, don't leave that to maintainers of filesystems/devices/
9etc. At the very least, put the list of dubious cases in the end of this file.
10Don't turn it into log - maintainers of out-of-the-tree code are supposed to
11be able to use diff(1).
Linus Torvalds1da177e2005-04-16 15:20:36 -070012
Mauro Carvalho Chehabec23eb52019-07-26 09:51:27 -030013Thing currently missing here: socket operations. Alexey?
14
15dentry_operations
16=================
17
18prototypes::
19
Al Viro0b728e12012-06-10 16:03:43 -040020 int (*d_revalidate)(struct dentry *, unsigned int);
Jeff Laytonecf3d1f2013-02-20 11:19:05 -050021 int (*d_weak_revalidate)(struct dentry *, unsigned int);
Linus Torvaldsda53be12013-05-21 15:22:44 -070022 int (*d_hash)(const struct dentry *, struct qstr *);
Al Viro6fa67e72016-07-31 16:37:25 -040023 int (*d_compare)(const struct dentry *,
Nick Piggin621e1552011-01-07 17:49:27 +110024 unsigned int, const char *, const struct qstr *);
Linus Torvalds1da177e2005-04-16 15:20:36 -070025 int (*d_delete)(struct dentry *);
Miklos Szeredi285b1022016-06-28 11:47:32 +020026 int (*d_init)(struct dentry *);
Linus Torvalds1da177e2005-04-16 15:20:36 -070027 void (*d_release)(struct dentry *);
28 void (*d_iput)(struct dentry *, struct inode *);
Eric Dumazetc23fbb62007-05-08 00:26:18 -070029 char *(*d_dname)((struct dentry *dentry, char *buffer, int buflen);
David Howells9875cf82011-01-14 18:45:21 +000030 struct vfsmount *(*d_automount)(struct path *path);
Ian Kentfb5f51c2016-11-24 08:03:41 +110031 int (*d_manage)(const struct path *, bool);
Miklos Szeredifb160432018-07-18 15:44:44 +020032 struct dentry *(*d_real)(struct dentry *, const struct inode *);
Linus Torvalds1da177e2005-04-16 15:20:36 -070033
34locking rules:
Linus Torvalds1da177e2005-04-16 15:20:36 -070035
Mauro Carvalho Chehabec23eb52019-07-26 09:51:27 -030036================== =========== ======== ============== ========
37ops rename_lock ->d_lock may block rcu-walk
38================== =========== ======== ============== ========
39d_revalidate: no no yes (ref-walk) maybe
40d_weak_revalidate: no no yes no
41d_hash no no no maybe
42d_compare: yes no no maybe
43d_delete: no yes no no
44d_init: no no yes no
45d_release: no no yes no
46d_prune: no yes no no
47d_iput: no no yes no
48d_dname: no no no no
49d_automount: no no yes no
50d_manage: no no yes (ref-walk) maybe
51d_real no no yes no
52================== =========== ======== ============== ========
53
54inode_operations
55================
56
57prototypes::
58
Al Viroebfc3b42012-06-10 18:05:36 -040059 int (*create) (struct inode *,struct dentry *,umode_t, bool);
Al Viro00cd8dd2012-06-10 17:13:09 -040060 struct dentry * (*lookup) (struct inode *,struct dentry *, unsigned int);
Linus Torvalds1da177e2005-04-16 15:20:36 -070061 int (*link) (struct dentry *,struct inode *,struct dentry *);
62 int (*unlink) (struct inode *,struct dentry *);
63 int (*symlink) (struct inode *,struct dentry *,const char *);
Al Viro18bb1db2011-07-26 01:41:39 -040064 int (*mkdir) (struct inode *,struct dentry *,umode_t);
Linus Torvalds1da177e2005-04-16 15:20:36 -070065 int (*rmdir) (struct inode *,struct dentry *);
Al Viro1a67aaf2011-07-26 01:52:52 -040066 int (*mknod) (struct inode *,struct dentry *,umode_t,dev_t);
Linus Torvalds1da177e2005-04-16 15:20:36 -070067 int (*rename) (struct inode *, struct dentry *,
Miklos Szeredi520c8b12014-04-01 17:08:42 +020068 struct inode *, struct dentry *, unsigned int);
Linus Torvalds1da177e2005-04-16 15:20:36 -070069 int (*readlink) (struct dentry *, char __user *,int);
Eric Biggers1a6a3162019-04-11 16:16:29 -070070 const char *(*get_link) (struct dentry *, struct inode *, struct delayed_call *);
Linus Torvalds1da177e2005-04-16 15:20:36 -070071 void (*truncate) (struct inode *);
Nick Pigginb74c79e2011-01-07 17:49:58 +110072 int (*permission) (struct inode *, int, unsigned int);
Miklos Szeredi0cad6242021-08-18 22:08:24 +020073 struct posix_acl * (*get_acl)(struct inode *, int, bool);
Linus Torvalds1da177e2005-04-16 15:20:36 -070074 int (*setattr) (struct dentry *, struct iattr *);
Eric Biggers75dd7e42017-03-31 18:31:25 +010075 int (*getattr) (const struct path *, struct kstat *, u32, unsigned int);
Linus Torvalds1da177e2005-04-16 15:20:36 -070076 ssize_t (*listxattr) (struct dentry *, char *, size_t);
Christoph Hellwigb83be6f2010-12-16 12:04:54 +010077 int (*fiemap)(struct inode *, struct fiemap_extent_info *, u64 start, u64 len);
Josef Bacikc3b2da32012-03-26 09:59:21 -040078 void (*update_time)(struct inode *, struct timespec *, int);
Al Virod9585272012-06-22 12:39:14 +040079 int (*atomic_open)(struct inode *, struct dentry *,
Al Viro30d90492012-06-22 12:40:19 +040080 struct file *, unsigned open_flag,
Al Viro6c9b1de2018-07-09 19:20:08 -040081 umode_t create_mode);
Al Viro48bde8d2013-07-03 16:19:23 +040082 int (*tmpfile) (struct inode *, struct dentry *, umode_t);
Miklos Szeredi4c5b4792021-04-07 14:36:42 +020083 int (*fileattr_set)(struct user_namespace *mnt_userns,
84 struct dentry *dentry, struct fileattr *fa);
85 int (*fileattr_get)(struct dentry *dentry, struct fileattr *fa);
Linus Torvalds1da177e2005-04-16 15:20:36 -070086
87locking rules:
Christoph Hellwigb83be6f2010-12-16 12:04:54 +010088 all may block
Mauro Carvalho Chehabec23eb52019-07-26 09:51:27 -030089
Miklos Szeredi4c5b4792021-04-07 14:36:42 +020090============= =============================================
Mauro Carvalho Chehabec23eb52019-07-26 09:51:27 -030091ops i_rwsem(inode)
Miklos Szeredi4c5b4792021-04-07 14:36:42 +020092============= =============================================
Sean Anderson965de0e2018-05-23 22:29:10 -040093lookup: shared
94create: exclusive
95link: exclusive (both)
96mknod: exclusive
97symlink: exclusive
98mkdir: exclusive
99unlink: exclusive (both)
100rmdir: exclusive (both)(see below)
101rename: exclusive (all) (see below)
Linus Torvalds1da177e2005-04-16 15:20:36 -0700102readlink: no
Al Viro6b255392015-11-17 10:20:54 -0500103get_link: no
Sean Anderson965de0e2018-05-23 22:29:10 -0400104setattr: exclusive
Nick Pigginb74c79e2011-01-07 17:49:58 +1100105permission: no (may not block if called in rcu-walk mode)
Christoph Hellwig4e34e712011-07-23 17:37:31 +0200106get_acl: no
Linus Torvalds1da177e2005-04-16 15:20:36 -0700107getattr: no
Linus Torvalds1da177e2005-04-16 15:20:36 -0700108listxattr: no
Christoph Hellwigb83be6f2010-12-16 12:04:54 +0100109fiemap: no
Josef Bacikc3b2da32012-03-26 09:59:21 -0400110update_time: no
Jeff Laytonff467342019-10-30 06:46:54 -0400111atomic_open: shared (exclusive if O_CREAT is set in open flags)
Al Viro48bde8d2013-07-03 16:19:23 +0400112tmpfile: no
Miklos Szeredi4c5b4792021-04-07 14:36:42 +0200113fileattr_get: no or exclusive
114fileattr_set: exclusive
115============= =============================================
Josef Bacikc3b2da32012-03-26 09:59:21 -0400116
Andreas Gruenbacher6c6ef9f2016-09-29 17:48:44 +0200117
Sean Anderson965de0e2018-05-23 22:29:10 -0400118 Additionally, ->rmdir(), ->unlink() and ->rename() have ->i_rwsem
119 exclusive on victim.
Miklos Szeredi2773bf02016-09-27 11:03:58 +0200120 cross-directory ->rename() has (per-superblock) ->s_vfs_rename_sem.
Linus Torvalds1da177e2005-04-16 15:20:36 -0700121
Mauro Carvalho Chehabec23eb52019-07-26 09:51:27 -0300122See Documentation/filesystems/directory-locking.rst for more detailed discussion
Linus Torvalds1da177e2005-04-16 15:20:36 -0700123of the locking scheme for directory operations.
124
Mauro Carvalho Chehabec23eb52019-07-26 09:51:27 -0300125xattr_handler operations
126========================
127
128prototypes::
129
Andreas Gruenbacher6c6ef9f2016-09-29 17:48:44 +0200130 bool (*list)(struct dentry *dentry);
131 int (*get)(const struct xattr_handler *handler, struct dentry *dentry,
132 struct inode *inode, const char *name, void *buffer,
133 size_t size);
Christian Braunere65ce2a2021-01-21 14:19:27 +0100134 int (*set)(const struct xattr_handler *handler,
135 struct user_namespace *mnt_userns,
136 struct dentry *dentry, struct inode *inode, const char *name,
137 const void *buffer, size_t size, int flags);
Andreas Gruenbacher6c6ef9f2016-09-29 17:48:44 +0200138
139locking rules:
140 all may block
Mauro Carvalho Chehabec23eb52019-07-26 09:51:27 -0300141
142===== ==============
143ops i_rwsem(inode)
144===== ==============
Andreas Gruenbacher6c6ef9f2016-09-29 17:48:44 +0200145list: no
146get: no
Sean Anderson965de0e2018-05-23 22:29:10 -0400147set: exclusive
Mauro Carvalho Chehabec23eb52019-07-26 09:51:27 -0300148===== ==============
Andreas Gruenbacher6c6ef9f2016-09-29 17:48:44 +0200149
Mauro Carvalho Chehabec23eb52019-07-26 09:51:27 -0300150super_operations
151================
152
153prototypes::
154
Linus Torvalds1da177e2005-04-16 15:20:36 -0700155 struct inode *(*alloc_inode)(struct super_block *sb);
Al Virofdb0da82019-04-10 14:43:44 -0400156 void (*free_inode)(struct inode *);
Linus Torvalds1da177e2005-04-16 15:20:36 -0700157 void (*destroy_inode)(struct inode *);
Christoph Hellwigaa385722011-05-27 06:53:02 -0400158 void (*dirty_inode) (struct inode *, int flags);
Christoph Hellwigb83be6f2010-12-16 12:04:54 +0100159 int (*write_inode) (struct inode *, struct writeback_control *wbc);
Al Viro336fb3b2010-06-08 00:37:12 -0400160 int (*drop_inode) (struct inode *);
161 void (*evict_inode) (struct inode *);
Linus Torvalds1da177e2005-04-16 15:20:36 -0700162 void (*put_super) (struct super_block *);
Linus Torvalds1da177e2005-04-16 15:20:36 -0700163 int (*sync_fs)(struct super_block *sb, int wait);
Takashi Satoc4be0c12009-01-09 16:40:58 -0800164 int (*freeze_fs) (struct super_block *);
165 int (*unfreeze_fs) (struct super_block *);
David Howells726c3342006-06-23 02:02:58 -0700166 int (*statfs) (struct dentry *, struct kstatfs *);
Linus Torvalds1da177e2005-04-16 15:20:36 -0700167 int (*remount_fs) (struct super_block *, int *, char *);
Linus Torvalds1da177e2005-04-16 15:20:36 -0700168 void (*umount_begin) (struct super_block *);
Al Viro34c80b12011-12-08 21:32:45 -0500169 int (*show_options)(struct seq_file *, struct dentry *);
Linus Torvalds1da177e2005-04-16 15:20:36 -0700170 ssize_t (*quota_read)(struct super_block *, int, char *, size_t, loff_t);
171 ssize_t (*quota_write)(struct super_block *, int, const char *, size_t, loff_t);
172
173locking rules:
Al Viro336fb3b2010-06-08 00:37:12 -0400174 All may block [not true, see below]
Mauro Carvalho Chehabec23eb52019-07-26 09:51:27 -0300175
176====================== ============ ========================
177ops s_umount note
178====================== ============ ========================
Christoph Hellwig7e325d32009-06-19 20:22:37 +0200179alloc_inode:
Al Virofdb0da82019-04-10 14:43:44 -0400180free_inode: called from RCU callback
Christoph Hellwig7e325d32009-06-19 20:22:37 +0200181destroy_inode:
Christoph Hellwigaa385722011-05-27 06:53:02 -0400182dirty_inode:
Christoph Hellwig7e325d32009-06-19 20:22:37 +0200183write_inode:
Dave Chinnerf283c862011-03-22 22:23:39 +1100184drop_inode: !!!inode->i_lock!!!
Al Viro336fb3b2010-06-08 00:37:12 -0400185evict_inode:
Christoph Hellwig7e325d32009-06-19 20:22:37 +0200186put_super: write
Christoph Hellwig7e325d32009-06-19 20:22:37 +0200187sync_fs: read
Valerie Aurora06fd5162012-06-12 16:20:48 +0200188freeze_fs: write
189unfreeze_fs: write
Al Viro336fb3b2010-06-08 00:37:12 -0400190statfs: maybe(read) (see below)
191remount_fs: write
Christoph Hellwig7e325d32009-06-19 20:22:37 +0200192umount_begin: no
193show_options: no (namespace_sem)
194quota_read: no (see below)
195quota_write: no (see below)
Mauro Carvalho Chehabec23eb52019-07-26 09:51:27 -0300196====================== ============ ========================
Linus Torvalds1da177e2005-04-16 15:20:36 -0700197
Al Viro336fb3b2010-06-08 00:37:12 -0400198->statfs() has s_umount (shared) when called by ustat(2) (native or
199compat), but that's an accident of bad API; s_umount is used to pin
200the superblock down when we only have dev_t given us by userland to
201identify the superblock. Everything else (statfs(), fstatfs(), etc.)
202doesn't hold it when calling ->statfs() - superblock is pinned down
203by resolving the pathname passed to syscall.
Mauro Carvalho Chehabec23eb52019-07-26 09:51:27 -0300204
Linus Torvalds1da177e2005-04-16 15:20:36 -0700205->quota_read() and ->quota_write() functions are both guaranteed to
206be the only ones operating on the quota file by the quota code (via
207dqio_sem) (unless an admin really wants to screw up something and
208writes to quota files with quotas on). For other details about locking
209see also dquot_operations section.
Mauro Carvalho Chehabec23eb52019-07-26 09:51:27 -0300210
Mauro Carvalho Chehabec23eb52019-07-26 09:51:27 -0300211file_system_type
212================
213
214prototypes::
215
Christoph Hellwigb83be6f2010-12-16 12:04:54 +0100216 struct dentry *(*mount) (struct file_system_type *, int,
217 const char *, void *);
Linus Torvalds1da177e2005-04-16 15:20:36 -0700218 void (*kill_sb) (struct super_block *);
Mauro Carvalho Chehabec23eb52019-07-26 09:51:27 -0300219
Linus Torvalds1da177e2005-04-16 15:20:36 -0700220locking rules:
Mauro Carvalho Chehabec23eb52019-07-26 09:51:27 -0300221
222======= =========
223ops may block
224======= =========
Christoph Hellwigb83be6f2010-12-16 12:04:54 +0100225mount yes
226kill_sb yes
Mauro Carvalho Chehabec23eb52019-07-26 09:51:27 -0300227======= =========
Linus Torvalds1da177e2005-04-16 15:20:36 -0700228
Al Viro1a102ff2011-03-16 09:07:58 -0400229->mount() returns ERR_PTR or the root dentry; its superblock should be locked
230on return.
Mauro Carvalho Chehabec23eb52019-07-26 09:51:27 -0300231
Linus Torvalds1da177e2005-04-16 15:20:36 -0700232->kill_sb() takes a write-locked superblock, does all shutdown work on it,
233unlocks and drops the reference.
234
Mauro Carvalho Chehabec23eb52019-07-26 09:51:27 -0300235address_space_operations
236========================
237prototypes::
238
Linus Torvalds1da177e2005-04-16 15:20:36 -0700239 int (*writepage)(struct page *page, struct writeback_control *wbc);
240 int (*readpage)(struct file *, struct page *);
Linus Torvalds1da177e2005-04-16 15:20:36 -0700241 int (*writepages)(struct address_space *, struct writeback_control *);
242 int (*set_page_dirty)(struct page *page);
Matthew Wilcox (Oracle)8151b4c2020-06-01 21:46:44 -0700243 void (*readahead)(struct readahead_control *);
Linus Torvalds1da177e2005-04-16 15:20:36 -0700244 int (*readpages)(struct file *filp, struct address_space *mapping,
245 struct list_head *pages, unsigned nr_pages);
Nick Piggin4e02ed42008-10-29 14:00:55 -0700246 int (*write_begin)(struct file *, struct address_space *mapping,
247 loff_t pos, unsigned len, unsigned flags,
248 struct page **pagep, void **fsdata);
249 int (*write_end)(struct file *, struct address_space *mapping,
250 loff_t pos, unsigned len, unsigned copied,
251 struct page *page, void *fsdata);
Linus Torvalds1da177e2005-04-16 15:20:36 -0700252 sector_t (*bmap)(struct address_space *, sector_t);
Lukas Czernerd47992f2013-05-21 23:17:23 -0400253 void (*invalidatepage) (struct page *, unsigned int, unsigned int);
Linus Torvalds1da177e2005-04-16 15:20:36 -0700254 int (*releasepage) (struct page *, int);
Linus Torvalds6072d132010-12-01 13:35:19 -0500255 void (*freepage)(struct page *);
Christoph Hellwigc8b8e322016-04-07 08:51:58 -0700256 int (*direct_IO)(struct kiocb *, struct iov_iter *iter);
Minchan Kimbda807d2016-07-26 15:23:05 -0700257 bool (*isolate_page) (struct page *, isolate_mode_t);
Christoph Hellwigb83be6f2010-12-16 12:04:54 +0100258 int (*migratepage)(struct address_space *, struct page *, struct page *);
Minchan Kimbda807d2016-07-26 15:23:05 -0700259 void (*putback_page) (struct page *);
Christoph Hellwigb83be6f2010-12-16 12:04:54 +0100260 int (*launder_page)(struct page *);
Al Viroc186afb42014-02-02 21:16:54 -0500261 int (*is_partially_uptodate)(struct page *, unsigned long, unsigned long);
Christoph Hellwigb83be6f2010-12-16 12:04:54 +0100262 int (*error_remove_page)(struct address_space *, struct page *);
Mel Gorman62c230b2012-07-31 16:44:55 -0700263 int (*swap_activate)(struct file *);
264 int (*swap_deactivate)(struct file *);
Linus Torvalds1da177e2005-04-16 15:20:36 -0700265
266locking rules:
Linus Torvalds6072d132010-12-01 13:35:19 -0500267 All except set_page_dirty and freepage may block
Linus Torvalds1da177e2005-04-16 15:20:36 -0700268
Jan Kara730633f2021-01-28 19:19:45 +0100269====================== ======================== ========= ===============
270ops PageLocked(page) i_rwsem invalidate_lock
271====================== ======================== ========= ===============
Christoph Hellwigb83be6f2010-12-16 12:04:54 +0100272writepage: yes, unlocks (see below)
Jan Kara730633f2021-01-28 19:19:45 +0100273readpage: yes, unlocks shared
Christoph Hellwigb83be6f2010-12-16 12:04:54 +0100274writepages:
275set_page_dirty no
Jan Kara730633f2021-01-28 19:19:45 +0100276readahead: yes, unlocks shared
277readpages: no shared
Mauro Carvalho Chehabec23eb52019-07-26 09:51:27 -0300278write_begin: locks the page exclusive
279write_end: yes, unlocks exclusive
Christoph Hellwigb83be6f2010-12-16 12:04:54 +0100280bmap:
Jan Kara730633f2021-01-28 19:19:45 +0100281invalidatepage: yes exclusive
Christoph Hellwigb83be6f2010-12-16 12:04:54 +0100282releasepage: yes
283freepage: yes
284direct_IO:
Minchan Kimbda807d2016-07-26 15:23:05 -0700285isolate_page: yes
Christoph Hellwigb83be6f2010-12-16 12:04:54 +0100286migratepage: yes (both)
Minchan Kimbda807d2016-07-26 15:23:05 -0700287putback_page: yes
Christoph Hellwigb83be6f2010-12-16 12:04:54 +0100288launder_page: yes
289is_partially_uptodate: yes
290error_remove_page: yes
Mel Gorman62c230b2012-07-31 16:44:55 -0700291swap_activate: no
292swap_deactivate: no
Randy Dunlap7882c552021-07-27 16:22:12 -0700293====================== ======================== ========= ===============
Linus Torvalds1da177e2005-04-16 15:20:36 -0700294
Mauro Carvalho Chehabec23eb52019-07-26 09:51:27 -0300295->write_begin(), ->write_end() and ->readpage() may be called from
Matthew Wilcoxf4e6d842016-03-06 23:27:26 -0500296the request handler (/dev/loop).
Linus Torvalds1da177e2005-04-16 15:20:36 -0700297
Mauro Carvalho Chehabec23eb52019-07-26 09:51:27 -0300298->readpage() unlocks the page, either synchronously or via I/O
Linus Torvalds1da177e2005-04-16 15:20:36 -0700299completion.
300
Matthew Wilcox (Oracle)8151b4c2020-06-01 21:46:44 -0700301->readahead() unlocks the pages that I/O is attempted on like ->readpage().
302
Mauro Carvalho Chehabec23eb52019-07-26 09:51:27 -0300303->readpages() populates the pagecache with the passed pages and starts
Linus Torvalds1da177e2005-04-16 15:20:36 -0700304I/O against them. They come unlocked upon I/O completion.
305
Mauro Carvalho Chehabec23eb52019-07-26 09:51:27 -0300306->writepage() is used for two purposes: for "memory cleansing" and for
Linus Torvalds1da177e2005-04-16 15:20:36 -0700307"sync". These are quite different operations and the behaviour may differ
308depending upon the mode.
309
310If writepage is called for sync (wbc->sync_mode != WBC_SYNC_NONE) then
311it *must* start I/O against the page, even if that would involve
312blocking on in-progress I/O.
313
314If writepage is called for memory cleansing (sync_mode ==
315WBC_SYNC_NONE) then its role is to get as much writeout underway as
316possible. So writepage should try to avoid blocking against
317currently-in-progress I/O.
318
319If the filesystem is not called for "sync" and it determines that it
320would need to block against in-progress I/O to be able to start new I/O
321against the page the filesystem should redirty the page with
322redirty_page_for_writepage(), then unlock the page and return zero.
323This may also be done to avoid internal deadlocks, but rarely.
324
Robert P. J. Day3a4fa0a2007-10-19 23:10:43 +0200325If the filesystem is called for sync then it must wait on any
Linus Torvalds1da177e2005-04-16 15:20:36 -0700326in-progress I/O and then start new I/O.
327
Nikita Danilov20546062005-05-01 08:58:37 -0700328The filesystem should unlock the page synchronously, before returning to the
329caller, unless ->writepage() returns special WRITEPAGE_ACTIVATE
330value. WRITEPAGE_ACTIVATE means that page cannot really be written out
331currently, and VM should stop calling ->writepage() on this page for some
332time. VM does this by moving page to the head of the active list, hence the
333name.
Linus Torvalds1da177e2005-04-16 15:20:36 -0700334
335Unless the filesystem is going to redirty_page_for_writepage(), unlock the page
336and return zero, writepage *must* run set_page_writeback() against the page,
337followed by unlocking it. Once set_page_writeback() has been run against the
338page, write I/O can be submitted and the write I/O completion handler must run
339end_page_writeback() once the I/O is complete. If no I/O is submitted, the
340filesystem must run end_page_writeback() against the page before returning from
341writepage.
342
343That is: after 2.5.12, pages which are under writeout are *not* locked. Note,
344if the filesystem needs the page to be locked during writeout, that is ok, too,
345the page is allowed to be unlocked at any point in time between the calls to
346set_page_writeback() and end_page_writeback().
347
348Note, failure to run either redirty_page_for_writepage() or the combination of
349set_page_writeback()/end_page_writeback() on a page submitted to writepage
350will leave the page itself marked clean but it will be tagged as dirty in the
351radix tree. This incoherency can lead to all sorts of hard-to-debug problems
352in the filesystem like having dirty inodes at umount and losing written data.
353
Mauro Carvalho Chehabec23eb52019-07-26 09:51:27 -0300354->writepages() is used for periodic writeback and for syscall-initiated
Linus Torvalds1da177e2005-04-16 15:20:36 -0700355sync operations. The address_space should start I/O against at least
Mauro Carvalho Chehabec23eb52019-07-26 09:51:27 -0300356``*nr_to_write`` pages. ``*nr_to_write`` must be decremented for each page
357which is written. The address_space implementation may write more (or less)
358pages than ``*nr_to_write`` asks for, but it should try to be reasonably close.
359If nr_to_write is NULL, all dirty pages must be written.
Linus Torvalds1da177e2005-04-16 15:20:36 -0700360
361writepages should _only_ write pages which are present on
362mapping->io_pages.
363
Mauro Carvalho Chehabec23eb52019-07-26 09:51:27 -0300364->set_page_dirty() is called from various places in the kernel
Linus Torvalds1da177e2005-04-16 15:20:36 -0700365when the target page is marked as needing writeback. It may be called
366under spinlock (it cannot block) and is sometimes called with the page
367not locked.
368
Mauro Carvalho Chehabec23eb52019-07-26 09:51:27 -0300369->bmap() is currently used by legacy ioctl() (FIBMAP) provided by some
Christoph Hellwigb83be6f2010-12-16 12:04:54 +0100370filesystems and by the swapper. The latter will eventually go away. Please,
371keep it that way and don't breed new callers.
Linus Torvalds1da177e2005-04-16 15:20:36 -0700372
Mauro Carvalho Chehabec23eb52019-07-26 09:51:27 -0300373->invalidatepage() is called when the filesystem must attempt to drop
Lukas Czernerd47992f2013-05-21 23:17:23 -0400374some or all of the buffers from the page when it is being truncated. It
375returns zero on success. If ->invalidatepage is zero, the kernel uses
Jan Kara730633f2021-01-28 19:19:45 +0100376block_invalidatepage() instead. The filesystem must exclusively acquire
377invalidate_lock before invalidating page cache in truncate / hole punch path
378(and thus calling into ->invalidatepage) to block races between page cache
379invalidation and page cache filling functions (fault, read, ...).
Linus Torvalds1da177e2005-04-16 15:20:36 -0700380
Mauro Carvalho Chehabec23eb52019-07-26 09:51:27 -0300381->releasepage() is called when the kernel is about to try to drop the
Linus Torvalds1da177e2005-04-16 15:20:36 -0700382buffers from the page in preparation for freeing it. It returns zero to
383indicate that the buffers are (or may be) freeable. If ->releasepage is zero,
384the kernel assumes that the fs has no private interest in the buffers.
385
Mauro Carvalho Chehabec23eb52019-07-26 09:51:27 -0300386->freepage() is called when the kernel is done dropping the page
Linus Torvalds6072d132010-12-01 13:35:19 -0500387from the page cache.
388
Mauro Carvalho Chehabec23eb52019-07-26 09:51:27 -0300389->launder_page() may be called prior to releasing a page if
Trond Myklebuste3db7692007-01-10 23:15:39 -0800390it is still found to be dirty. It returns zero if the page was successfully
391cleaned, or an error value if not. Note that in order to prevent the page
392getting mapped back in and redirtied, it needs to be kept locked
393across the entire operation.
394
Mauro Carvalho Chehabec23eb52019-07-26 09:51:27 -0300395->swap_activate will be called with a non-zero argument on
Mel Gorman62c230b2012-07-31 16:44:55 -0700396files backing (non block device backed) swapfiles. A return value
397of zero indicates success, in which case this file can be used for
398backing swapspace. The swapspace operations will be proxied to the
399address space operations.
400
Mauro Carvalho Chehabec23eb52019-07-26 09:51:27 -0300401->swap_deactivate() will be called in the sys_swapoff()
Mel Gorman62c230b2012-07-31 16:44:55 -0700402path after ->swap_activate() returned success.
403
Mauro Carvalho Chehabec23eb52019-07-26 09:51:27 -0300404file_lock_operations
405====================
406
407prototypes::
408
Linus Torvalds1da177e2005-04-16 15:20:36 -0700409 void (*fl_copy_lock)(struct file_lock *, struct file_lock *);
410 void (*fl_release_private)(struct file_lock *);
411
412
413locking rules:
Mauro Carvalho Chehabec23eb52019-07-26 09:51:27 -0300414
415=================== ============= =========
416ops inode->i_lock may block
417=================== ============= =========
Christoph Hellwigb83be6f2010-12-16 12:04:54 +0100418fl_copy_lock: yes no
Mauro Carvalho Chehabec23eb52019-07-26 09:51:27 -0300419fl_release_private: maybe maybe[1]_
420=================== ============= =========
Jeff Layton2ece1732014-08-12 10:38:07 -0400421
Mauro Carvalho Chehabec23eb52019-07-26 09:51:27 -0300422.. [1]:
423 ->fl_release_private for flock or POSIX locks is currently allowed
424 to block. Leases however can still be freed while the i_lock is held and
425 so fl_release_private called on a lease should not block.
Linus Torvalds1da177e2005-04-16 15:20:36 -0700426
Mauro Carvalho Chehabec23eb52019-07-26 09:51:27 -0300427lock_manager_operations
428=======================
429
430prototypes::
431
J. Bruce Fields8fb47a42011-07-20 20:21:59 -0400432 void (*lm_notify)(struct file_lock *); /* unblock callback */
433 int (*lm_grant)(struct file_lock *, struct file_lock *, int);
J. Bruce Fields8fb47a42011-07-20 20:21:59 -0400434 void (*lm_break)(struct file_lock *); /* break_lease callback */
435 int (*lm_change)(struct file_lock **, int);
J. Bruce Fields28df3d12017-07-28 16:35:15 -0400436 bool (*lm_breaker_owns_lease)(struct file_lock *);
Linus Torvalds1da177e2005-04-16 15:20:36 -0700437
438locking rules:
Jeff Layton1c8c6012013-06-21 08:58:15 -0400439
Randy Dunlap6cbef2a2020-06-14 20:22:19 -0700440====================== ============= ================= =========
Mauro Carvalho Chehabec23eb52019-07-26 09:51:27 -0300441ops inode->i_lock blocked_lock_lock may block
Randy Dunlap6cbef2a2020-06-14 20:22:19 -0700442====================== ============= ================= =========
Jeff Layton7b2296a2013-06-21 08:58:20 -0400443lm_notify: yes yes no
444lm_grant: no no no
445lm_break: yes no no
446lm_change yes no no
J. Bruce Fields28df3d12017-07-28 16:35:15 -0400447lm_breaker_owns_lease: no no no
Randy Dunlap6cbef2a2020-06-14 20:22:19 -0700448====================== ============= ================= =========
Jeff Layton1c8c6012013-06-21 08:58:15 -0400449
Mauro Carvalho Chehabec23eb52019-07-26 09:51:27 -0300450buffer_head
451===========
452
453prototypes::
454
Linus Torvalds1da177e2005-04-16 15:20:36 -0700455 void (*b_end_io)(struct buffer_head *bh, int uptodate);
456
457locking rules:
Mauro Carvalho Chehabec23eb52019-07-26 09:51:27 -0300458
459called from interrupts. In other words, extreme care is needed here.
Linus Torvalds1da177e2005-04-16 15:20:36 -0700460bh is locked, but that's all warranties we have here. Currently only RAID1,
461highmem, fs/buffer.c, and fs/ntfs/aops.c are providing these. Block devices
462call this method upon the IO completion.
463
Mauro Carvalho Chehabec23eb52019-07-26 09:51:27 -0300464block_device_operations
465=======================
466prototypes::
467
Christoph Hellwige1455d12010-10-06 10:46:53 +0200468 int (*open) (struct block_device *, fmode_t);
469 int (*release) (struct gendisk *, fmode_t);
470 int (*ioctl) (struct block_device *, fmode_t, unsigned, unsigned long);
471 int (*compat_ioctl) (struct block_device *, fmode_t, unsigned, unsigned long);
Dan Williams7a9eb202016-06-03 18:06:47 -0700472 int (*direct_access) (struct block_device *, sector_t, void **,
Ross Zwislere2e05392015-08-18 13:55:41 -0600473 unsigned long *);
Christoph Hellwige1455d12010-10-06 10:46:53 +0200474 void (*unlock_native_capacity) (struct gendisk *);
Christoph Hellwige1455d12010-10-06 10:46:53 +0200475 int (*getgeo)(struct block_device *, struct hd_geometry *);
476 void (*swap_slot_free_notify) (struct block_device *, unsigned long);
Linus Torvalds1da177e2005-04-16 15:20:36 -0700477
478locking rules:
Mauro Carvalho Chehabec23eb52019-07-26 09:51:27 -0300479
480======================= ===================
Christoph Hellwiga8698702021-05-25 08:12:56 +0200481ops open_mutex
Mauro Carvalho Chehabec23eb52019-07-26 09:51:27 -0300482======================= ===================
Christoph Hellwigb83be6f2010-12-16 12:04:54 +0100483open: yes
484release: yes
485ioctl: no
486compat_ioctl: no
487direct_access: no
Christoph Hellwigb83be6f2010-12-16 12:04:54 +0100488unlock_native_capacity: no
Christoph Hellwigb83be6f2010-12-16 12:04:54 +0100489getgeo: no
490swap_slot_free_notify: no (see below)
Mauro Carvalho Chehabec23eb52019-07-26 09:51:27 -0300491======================= ===================
Linus Torvalds1da177e2005-04-16 15:20:36 -0700492
Christoph Hellwige1455d12010-10-06 10:46:53 +0200493swap_slot_free_notify is called with swap_lock and sometimes the page lock
494held.
495
Linus Torvalds1da177e2005-04-16 15:20:36 -0700496
Mauro Carvalho Chehabec23eb52019-07-26 09:51:27 -0300497file_operations
498===============
499
500prototypes::
501
Linus Torvalds1da177e2005-04-16 15:20:36 -0700502 loff_t (*llseek) (struct file *, loff_t, int);
503 ssize_t (*read) (struct file *, char __user *, size_t, loff_t *);
Linus Torvalds1da177e2005-04-16 15:20:36 -0700504 ssize_t (*write) (struct file *, const char __user *, size_t, loff_t *);
Al Viro293bc982014-02-11 18:37:41 -0500505 ssize_t (*read_iter) (struct kiocb *, struct iov_iter *);
506 ssize_t (*write_iter) (struct kiocb *, struct iov_iter *);
Jan Karac625b4c2021-05-10 19:13:53 +0200507 int (*iopoll) (struct kiocb *kiocb, bool spin);
Al Viro2233f312013-05-22 21:44:23 -0400508 int (*iterate) (struct file *, struct dir_context *);
Sean Anderson965de0e2018-05-23 22:29:10 -0400509 int (*iterate_shared) (struct file *, struct dir_context *);
Christoph Hellwig6e8b7042018-01-02 22:50:45 +0100510 __poll_t (*poll) (struct file *, struct poll_table_struct *);
Linus Torvalds1da177e2005-04-16 15:20:36 -0700511 long (*unlocked_ioctl) (struct file *, unsigned int, unsigned long);
512 long (*compat_ioctl) (struct file *, unsigned int, unsigned long);
513 int (*mmap) (struct file *, struct vm_area_struct *);
514 int (*open) (struct inode *, struct file *);
515 int (*flush) (struct file *);
516 int (*release) (struct inode *, struct file *);
Josef Bacik02c24a82011-07-16 20:44:56 -0400517 int (*fsync) (struct file *, loff_t start, loff_t end, int datasync);
Linus Torvalds1da177e2005-04-16 15:20:36 -0700518 int (*fasync) (int, struct file *, int);
519 int (*lock) (struct file *, int, struct file_lock *);
Linus Torvalds1da177e2005-04-16 15:20:36 -0700520 ssize_t (*sendpage) (struct file *, struct page *, int, size_t,
521 loff_t *, int);
522 unsigned long (*get_unmapped_area)(struct file *, unsigned long,
523 unsigned long, unsigned long, unsigned long);
524 int (*check_flags)(int);
Christoph Hellwigb83be6f2010-12-16 12:04:54 +0100525 int (*flock) (struct file *, int, struct file_lock *);
526 ssize_t (*splice_write)(struct pipe_inode_info *, struct file *, loff_t *,
527 size_t, unsigned int);
528 ssize_t (*splice_read)(struct file *, loff_t *, struct pipe_inode_info *,
529 size_t, unsigned int);
Jeff Laytone6f5c782014-08-22 10:40:25 -0400530 int (*setlease)(struct file *, long, struct file_lock **, void **);
Christoph Hellwig2fe17c12011-01-14 13:07:43 +0100531 long (*fallocate)(struct file *, int, loff_t, loff_t);
Jan Karac625b4c2021-05-10 19:13:53 +0200532 void (*show_fdinfo)(struct seq_file *m, struct file *f);
533 unsigned (*mmap_capabilities)(struct file *);
534 ssize_t (*copy_file_range)(struct file *, loff_t, struct file *,
535 loff_t, size_t, unsigned int);
536 loff_t (*remap_file_range)(struct file *file_in, loff_t pos_in,
537 struct file *file_out, loff_t pos_out,
538 loff_t len, unsigned int remap_flags);
539 int (*fadvise)(struct file *, loff_t, loff_t, int);
Linus Torvalds1da177e2005-04-16 15:20:36 -0700540
541locking rules:
Linus Torvaldsa11e1d42018-06-28 09:43:44 -0700542 All may block.
Christoph Hellwigb83be6f2010-12-16 12:04:54 +0100543
Linus Torvalds1da177e2005-04-16 15:20:36 -0700544->llseek() locking has moved from llseek to the individual llseek
545implementations. If your fs is not using generic_file_llseek, you
546need to acquire and release the appropriate locks in your ->llseek().
547For many filesystems, it is probably safe to acquire the inode
Jan Blunck866707f2010-05-26 14:44:54 -0700548mutex or just to use i_size_read() instead.
549Note: this does not protect the file->f_pos against concurrent modifications
550since this is something the userspace has to take care about.
Linus Torvalds1da177e2005-04-16 15:20:36 -0700551
Sean Anderson965de0e2018-05-23 22:29:10 -0400552->iterate() is called with i_rwsem exclusive.
553
554->iterate_shared() is called with i_rwsem at least shared.
555
Christoph Hellwigb83be6f2010-12-16 12:04:54 +0100556->fasync() is responsible for maintaining the FASYNC bit in filp->f_flags.
557Most instances call fasync_helper(), which does that maintenance, so it's
558not normally something one needs to worry about. Return values > 0 will be
559mapped to zero in the VFS layer.
Linus Torvalds1da177e2005-04-16 15:20:36 -0700560
561->readdir() and ->ioctl() on directories must be changed. Ideally we would
562move ->readdir() to inode_operations and use a separate method for directory
563->ioctl() or kill the latter completely. One of the problems is that for
564anything that resembles union-mount we won't have a struct file for all
565components. And there are other reasons why the current interface is a mess...
566
Linus Torvalds1da177e2005-04-16 15:20:36 -0700567->read on directories probably must go away - we should just enforce -EISDIR
568in sys_read() and friends.
569
Jeff Laytonf82b4b62014-08-22 18:50:48 -0400570->setlease operations should call generic_setlease() before or after setting
571the lease within the individual filesystem to record the result of the
572operation
573
Jan Kara730633f2021-01-28 19:19:45 +0100574->fallocate implementation must be really careful to maintain page cache
575consistency when punching holes or performing other operations that invalidate
576page cache contents. Usually the filesystem needs to call
577truncate_inode_pages_range() to invalidate relevant range of the page cache.
578However the filesystem usually also needs to update its internal (and on disk)
579view of file offset -> disk block mapping. Until this update is finished, the
580filesystem needs to block page faults and reads from reloading now-stale page
581cache contents from the disk. Since VFS acquires mapping->invalidate_lock in
582shared mode when loading pages from disk (filemap_fault(), filemap_read(),
583readahead paths), the fallocate implementation must take the invalidate_lock to
584prevent reloading.
585
586->copy_file_range and ->remap_file_range implementations need to serialize
587against modifications of file data while the operation is running. For
588blocking changes through write(2) and similar operations inode->i_rwsem can be
589used. To block changes to file contents via a memory mapping during the
590operation, the filesystem must take mapping->invalidate_lock to coordinate
591with ->page_mkwrite.
592
Mauro Carvalho Chehabec23eb52019-07-26 09:51:27 -0300593dquot_operations
594================
595
596prototypes::
597
Linus Torvalds1da177e2005-04-16 15:20:36 -0700598 int (*write_dquot) (struct dquot *);
599 int (*acquire_dquot) (struct dquot *);
600 int (*release_dquot) (struct dquot *);
601 int (*mark_dirty) (struct dquot *);
602 int (*write_info) (struct super_block *, int);
603
604These operations are intended to be more or less wrapping functions that ensure
605a proper locking wrt the filesystem and call the generic quota operations.
606
607What filesystem should expect from the generic quota functions:
608
Mauro Carvalho Chehabec23eb52019-07-26 09:51:27 -0300609============== ============ =========================
610ops FS recursion Held locks when called
611============== ============ =========================
Linus Torvalds1da177e2005-04-16 15:20:36 -0700612write_dquot: yes dqonoff_sem or dqptr_sem
613acquire_dquot: yes dqonoff_sem or dqptr_sem
614release_dquot: yes dqonoff_sem or dqptr_sem
615mark_dirty: no -
616write_info: yes dqonoff_sem
Mauro Carvalho Chehabec23eb52019-07-26 09:51:27 -0300617============== ============ =========================
Linus Torvalds1da177e2005-04-16 15:20:36 -0700618
619FS recursion means calling ->quota_read() and ->quota_write() from superblock
620operations.
621
Linus Torvalds1da177e2005-04-16 15:20:36 -0700622More details about quota locking can be found in fs/dquot.c.
623
Mauro Carvalho Chehabec23eb52019-07-26 09:51:27 -0300624vm_operations_struct
625====================
626
627prototypes::
628
Linus Torvalds1da177e2005-04-16 15:20:36 -0700629 void (*open)(struct vm_area_struct*);
630 void (*close)(struct vm_area_struct*);
Souptick Joarderfe3136f2018-07-22 18:31:34 +0530631 vm_fault_t (*fault)(struct vm_area_struct*, struct vm_fault *);
632 vm_fault_t (*page_mkwrite)(struct vm_area_struct *, struct vm_fault *);
633 vm_fault_t (*pfn_mkwrite)(struct vm_area_struct *, struct vm_fault *);
Rik van Riel28b2ee22008-07-23 21:27:05 -0700634 int (*access)(struct vm_area_struct *, unsigned long, void*, int, int);
Linus Torvalds1da177e2005-04-16 15:20:36 -0700635
636locking rules:
Mauro Carvalho Chehabec23eb52019-07-26 09:51:27 -0300637
Randy Dunlap6cbef2a2020-06-14 20:22:19 -0700638============= ========= ===========================
Michel Lespinassec1e8d7c2020-06-08 21:33:54 -0700639ops mmap_lock PageLocked(page)
Randy Dunlap6cbef2a2020-06-14 20:22:19 -0700640============= ========= ===========================
Christoph Hellwigb83be6f2010-12-16 12:04:54 +0100641open: yes
642close: yes
643fault: yes can return with page locked
Kirill A. Shutemov8c6e50b2014-04-07 15:37:18 -0700644map_pages: yes
Christoph Hellwigb83be6f2010-12-16 12:04:54 +0100645page_mkwrite: yes can return with page locked
Boaz Harroshdd906182015-04-15 16:15:11 -0700646pfn_mkwrite: yes
Christoph Hellwigb83be6f2010-12-16 12:04:54 +0100647access: yes
Randy Dunlap6cbef2a2020-06-14 20:22:19 -0700648============= ========= ===========================
Mark Fashehed2f2f92007-07-19 01:47:01 -0700649
Jan Kara730633f2021-01-28 19:19:45 +0100650->fault() is called when a previously not present pte is about to be faulted
651in. The filesystem must find and return the page associated with the passed in
652"pgoff" in the vm_fault structure. If it is possible that the page may be
653truncated and/or invalidated, then the filesystem must lock invalidate_lock,
654then ensure the page is not already truncated (invalidate_lock will block
Nick Pigginb827e492009-04-30 15:08:16 -0700655subsequent truncate), and then return with VM_FAULT_LOCKED, and the page
656locked. The VM will unlock the page.
657
Mauro Carvalho Chehabec23eb52019-07-26 09:51:27 -0300658->map_pages() is called when VM asks to map easy accessible pages.
Kirill A. Shutemovbae473a2016-07-26 15:25:20 -0700659Filesystem should find and map pages associated with offsets from "start_pgoff"
660till "end_pgoff". ->map_pages() is called with page table locked and must
Kirill A. Shutemov8c6e50b2014-04-07 15:37:18 -0700661not block. If it's not possible to reach a page without blocking,
662filesystem should skip it. Filesystem should use do_set_pte() to setup
Kirill A. Shutemovbae473a2016-07-26 15:25:20 -0700663page table entry. Pointer to entry associated with the page is passed in
Jan Kara82b0f8c2016-12-14 15:06:58 -0800664"pte" field in vm_fault structure. Pointers to entries for other offsets
Kirill A. Shutemovbae473a2016-07-26 15:25:20 -0700665should be calculated relative to "pte".
Kirill A. Shutemov8c6e50b2014-04-07 15:37:18 -0700666
Jan Kara730633f2021-01-28 19:19:45 +0100667->page_mkwrite() is called when a previously read-only pte is about to become
668writeable. The filesystem again must ensure that there are no
669truncate/invalidate races or races with operations such as ->remap_file_range
670or ->copy_file_range, and then return with the page locked. Usually
671mapping->invalidate_lock is suitable for proper serialization. If the page has
672been truncated, the filesystem should not look up a new page like the ->fault()
673handler, but simply return with VM_FAULT_NOPAGE, which will cause the VM to
674retry the fault.
Linus Torvalds1da177e2005-04-16 15:20:36 -0700675
Mauro Carvalho Chehabec23eb52019-07-26 09:51:27 -0300676->pfn_mkwrite() is the same as page_mkwrite but when the pte is
Boaz Harroshdd906182015-04-15 16:15:11 -0700677VM_PFNMAP or VM_MIXEDMAP with a page-less entry. Expected return is
678VM_FAULT_NOPAGE. Or one of the VM_FAULT_ERROR types. The default behavior
679after this call is to make the pte read-write, unless pfn_mkwrite returns
680an error.
681
Mauro Carvalho Chehabec23eb52019-07-26 09:51:27 -0300682->access() is called when get_user_pages() fails in
Stefan Weil507da6a2013-12-05 20:34:05 +0100683access_process_vm(), typically used to debug a process through
Rik van Riel28b2ee22008-07-23 21:27:05 -0700684/proc/pid/mem or ptrace. This function is needed only for
685VM_IO | VM_PFNMAP VMAs.
686
Mauro Carvalho Chehabec23eb52019-07-26 09:51:27 -0300687--------------------------------------------------------------------------------
688
Linus Torvalds1da177e2005-04-16 15:20:36 -0700689 Dubious stuff
690
691(if you break something or notice that it is broken and do not fix it yourself
692- at least put it here)