blob: 256e10eedba4e43c058fa5136e3304e2e5bc0a87 [file] [log] [blame]
Mauro Carvalho Chehab90f9f112017-05-12 06:50:22 -03001=====================
2Linux Filesystems API
3=====================
4
5The Linux VFS
6=============
7
8The Filesystem types
9--------------------
10
11.. kernel-doc:: include/linux/fs.h
12 :internal:
13
14The Directory Cache
15-------------------
16
17.. kernel-doc:: fs/dcache.c
18 :export:
19
20.. kernel-doc:: include/linux/dcache.h
21 :internal:
22
23Inode Handling
24--------------
25
26.. kernel-doc:: fs/inode.c
27 :export:
28
29.. kernel-doc:: fs/bad_inode.c
30 :export:
31
32Registration and Superblocks
33----------------------------
34
35.. kernel-doc:: fs/super.c
36 :export:
37
38File Locks
39----------
40
41.. kernel-doc:: fs/locks.c
42 :export:
43
44.. kernel-doc:: fs/locks.c
45 :internal:
46
47Other Functions
48---------------
49
50.. kernel-doc:: fs/mpage.c
51 :export:
52
53.. kernel-doc:: fs/namei.c
54 :export:
55
56.. kernel-doc:: fs/buffer.c
57 :export:
58
59.. kernel-doc:: block/bio.c
60 :export:
61
62.. kernel-doc:: fs/seq_file.c
63 :export:
64
65.. kernel-doc:: fs/filesystems.c
66 :export:
67
68.. kernel-doc:: fs/fs-writeback.c
69 :export:
70
71.. kernel-doc:: fs/block_dev.c
72 :export:
73
74The proc filesystem
75===================
76
77sysctl interface
78----------------
79
80.. kernel-doc:: kernel/sysctl.c
81 :export:
82
83proc filesystem interface
84-------------------------
85
86.. kernel-doc:: fs/proc/base.c
87 :internal:
88
89Events based on file descriptors
90================================
91
92.. kernel-doc:: fs/eventfd.c
93 :export:
94
95The Filesystem for Exporting Kernel Objects
96===========================================
97
98.. kernel-doc:: fs/sysfs/file.c
99 :export:
100
101.. kernel-doc:: fs/sysfs/symlink.c
102 :export:
103
104The debugfs filesystem
105======================
106
107debugfs interface
108-----------------
109
110.. kernel-doc:: fs/debugfs/inode.c
111 :export:
112
113.. kernel-doc:: fs/debugfs/file.c
114 :export:
115
116The Linux Journalling API
117=========================
118
119Overview
120--------
121
122Details
123~~~~~~~
124
125The journalling layer is easy to use. You need to first of all create a
126journal_t data structure. There are two calls to do this dependent on
127how you decide to allocate the physical media on which the journal
Mauro Carvalho Chehab7a2208f2017-05-12 07:15:19 -0300128resides. The :c:func:`jbd2_journal_init_inode` call is for journals stored in
129filesystem inodes, or the :c:func:`jbd2_journal_init_dev` call can be used
Mauro Carvalho Chehab90f9f112017-05-12 06:50:22 -0300130for journal stored on a raw device (in a continuous range of blocks). A
131journal_t is a typedef for a struct pointer, so when you are finally
Mauro Carvalho Chehab7a2208f2017-05-12 07:15:19 -0300132finished make sure you call :c:func:`jbd2_journal_destroy` on it to free up
Mauro Carvalho Chehab90f9f112017-05-12 06:50:22 -0300133any used kernel memory.
134
135Once you have got your journal_t object you need to 'mount' or load the
136journal file. The journalling layer expects the space for the journal
137was already allocated and initialized properly by the userspace tools.
Mauro Carvalho Chehab7a2208f2017-05-12 07:15:19 -0300138When loading the journal you must call :c:func:`jbd2_journal_load` to process
Mauro Carvalho Chehab90f9f112017-05-12 06:50:22 -0300139journal contents. If the client file system detects the journal contents
140does not need to be processed (or even need not have valid contents), it
Mauro Carvalho Chehab7a2208f2017-05-12 07:15:19 -0300141may call :c:func:`jbd2_journal_wipe` to clear the journal contents before
142calling :c:func:`jbd2_journal_load`.
Mauro Carvalho Chehab90f9f112017-05-12 06:50:22 -0300143
144Note that jbd2_journal_wipe(..,0) calls
Mauro Carvalho Chehab7a2208f2017-05-12 07:15:19 -0300145:c:func:`jbd2_journal_skip_recovery` for you if it detects any outstanding
146transactions in the journal and similarly :c:func:`jbd2_journal_load` will
147call :c:func:`jbd2_journal_recover` if necessary. I would advise reading
148:c:func:`ext4_load_journal` in fs/ext4/super.c for examples on this stage.
Mauro Carvalho Chehab90f9f112017-05-12 06:50:22 -0300149
150Now you can go ahead and start modifying the underlying filesystem.
151Almost.
152
153You still need to actually journal your filesystem changes, this is done
154by wrapping them into transactions. Additionally you also need to wrap
155the modification of each of the buffers with calls to the journal layer,
156so it knows what the modifications you are actually making are. To do
Mauro Carvalho Chehab7a2208f2017-05-12 07:15:19 -0300157this use :c:func:`jbd2_journal_start` which returns a transaction handle.
Mauro Carvalho Chehab90f9f112017-05-12 06:50:22 -0300158
Mauro Carvalho Chehab7a2208f2017-05-12 07:15:19 -0300159:c:func:`jbd2_journal_start` and its counterpart :c:func:`jbd2_journal_stop`,
160which indicates the end of a transaction are nestable calls, so you can
Mauro Carvalho Chehab90f9f112017-05-12 06:50:22 -0300161reenter a transaction if necessary, but remember you must call
Mauro Carvalho Chehab7a2208f2017-05-12 07:15:19 -0300162:c:func:`jbd2_journal_stop` the same number of times as
163:c:func:`jbd2_journal_start` before the transaction is completed (or more
164accurately leaves the update phase). Ext4/VFS makes use of this feature to
165simplify handling of inode dirtying, quota support, etc.
Mauro Carvalho Chehab90f9f112017-05-12 06:50:22 -0300166
167Inside each transaction you need to wrap the modifications to the
168individual buffers (blocks). Before you start to modify a buffer you
Mauro Carvalho Chehab7a2208f2017-05-12 07:15:19 -0300169need to call :c:func:`jbd2_journal_get_create_access()` /
170:c:func:`jbd2_journal_get_write_access()` /
171:c:func:`jbd2_journal_get_undo_access()` as appropriate, this allows the
172journalling layer to copy the unmodified
Mauro Carvalho Chehab90f9f112017-05-12 06:50:22 -0300173data if it needs to. After all the buffer may be part of a previously
174uncommitted transaction. At this point you are at last ready to modify a
175buffer, and once you are have done so you need to call
Mauro Carvalho Chehab7a2208f2017-05-12 07:15:19 -0300176:c:func:`jbd2_journal_dirty_metadata`. Or if you've asked for access to a
Mauro Carvalho Chehab90f9f112017-05-12 06:50:22 -0300177buffer you now know is now longer required to be pushed back on the
Mauro Carvalho Chehab7a2208f2017-05-12 07:15:19 -0300178device you can call :c:func:`jbd2_journal_forget` in much the same way as you
179might have used :c:func:`bforget` in the past.
Mauro Carvalho Chehab90f9f112017-05-12 06:50:22 -0300180
Mauro Carvalho Chehab7a2208f2017-05-12 07:15:19 -0300181A :c:func:`jbd2_journal_flush` may be called at any time to commit and
Mauro Carvalho Chehab90f9f112017-05-12 06:50:22 -0300182checkpoint all your transactions.
183
Mauro Carvalho Chehab7a2208f2017-05-12 07:15:19 -0300184Then at umount time , in your :c:func:`put_super` you can then call
185:c:func:`jbd2_journal_destroy` to clean up your in-core journal object.
Mauro Carvalho Chehab90f9f112017-05-12 06:50:22 -0300186
187Unfortunately there a couple of ways the journal layer can cause a
188deadlock. The first thing to note is that each task can only have a
189single outstanding transaction at any one time, remember nothing commits
Mauro Carvalho Chehab7a2208f2017-05-12 07:15:19 -0300190until the outermost :c:func:`jbd2_journal_stop`. This means you must complete
Mauro Carvalho Chehab90f9f112017-05-12 06:50:22 -0300191the transaction at the end of each file/inode/address etc. operation you
192perform, so that the journalling system isn't re-entered on another
193journal. Since transactions can't be nested/batched across differing
194journals, and another filesystem other than yours (say ext4) may be
195modified in a later syscall.
196
Mauro Carvalho Chehab7a2208f2017-05-12 07:15:19 -0300197The second case to bear in mind is that :c:func:`jbd2_journal_start` can block
Mauro Carvalho Chehab90f9f112017-05-12 06:50:22 -0300198if there isn't enough space in the journal for your transaction (based
199on the passed nblocks param) - when it blocks it merely(!) needs to wait
200for transactions to complete and be committed from other tasks, so
Mauro Carvalho Chehab7a2208f2017-05-12 07:15:19 -0300201essentially we are waiting for :c:func:`jbd2_journal_stop`. So to avoid
202deadlocks you must treat :c:func:`jbd2_journal_start` /
203:c:func:`jbd2_journal_stop` as if they were semaphores and include them in
204your semaphore ordering rules to prevent
205deadlocks. Note that :c:func:`jbd2_journal_extend` has similar blocking
206behaviour to :c:func:`jbd2_journal_start` so you can deadlock here just as
207easily as on :c:func:`jbd2_journal_start`.
Mauro Carvalho Chehab90f9f112017-05-12 06:50:22 -0300208
209Try to reserve the right number of blocks the first time. ;-). This will
210be the maximum number of blocks you are going to touch in this
211transaction. I advise having a look at at least ext4_jbd.h to see the
212basis on which ext4 uses to make these decisions.
213
214Another wriggle to watch out for is your on-disk block allocation
215strategy. Why? Because, if you do a delete, you need to ensure you
216haven't reused any of the freed blocks until the transaction freeing
217these blocks commits. If you reused these blocks and crash happens,
218there is no way to restore the contents of the reallocated blocks at the
219end of the last fully committed transaction. One simple way of doing
220this is to mark blocks as free in internal in-memory block allocation
221structures only after the transaction freeing them commits. Ext4 uses
222journal commit callback for this purpose.
223
224With journal commit callbacks you can ask the journalling layer to call
225a callback function when the transaction is finally committed to disk,
226so that you can do some of your own management. You ask the journalling
227layer for calling the callback by simply setting
Mauro Carvalho Chehab7a2208f2017-05-12 07:15:19 -0300228``journal->j_commit_callback`` function pointer and that function is
Mauro Carvalho Chehab90f9f112017-05-12 06:50:22 -0300229called after each transaction commit. You can also use
Mauro Carvalho Chehab7a2208f2017-05-12 07:15:19 -0300230``transaction->t_private_list`` for attaching entries to a transaction
Mauro Carvalho Chehab90f9f112017-05-12 06:50:22 -0300231that need processing when the transaction commits.
232
233JBD2 also provides a way to block all transaction updates via
Mauro Carvalho Chehab7a2208f2017-05-12 07:15:19 -0300234:c:func:`jbd2_journal_lock_updates()` /
235:c:func:`jbd2_journal_unlock_updates()`. Ext4 uses this when it wants a
Mauro Carvalho Chehab90f9f112017-05-12 06:50:22 -0300236window with a clean and stable fs for a moment. E.g.
237
238::
239
240
241 jbd2_journal_lock_updates() //stop new stuff happening..
242 jbd2_journal_flush() // checkpoint everything.
243 ..do stuff on stable fs
244 jbd2_journal_unlock_updates() // carry on with filesystem use.
245
246The opportunities for abuse and DOS attacks with this should be obvious,
247if you allow unprivileged userspace to trigger codepaths containing
248these calls.
249
250Summary
251~~~~~~~
252
253Using the journal is a matter of wrapping the different context changes,
254being each mount, each modification (transaction) and each changed
255buffer to tell the journalling layer about them.
256
257Data Types
258----------
259
260The journalling layer uses typedefs to 'hide' the concrete definitions
261of the structures used. As a client of the JBD2 layer you can just rely
262on the using the pointer as a magic cookie of some sort. Obviously the
263hiding is not enforced as this is 'C'.
264
265Structures
266~~~~~~~~~~
267
268.. kernel-doc:: include/linux/jbd2.h
269 :internal:
270
271Functions
272---------
273
274The functions here are split into two groups those that affect a journal
275as a whole, and those which are used to manage transactions
276
277Journal Level
278~~~~~~~~~~~~~
279
280.. kernel-doc:: fs/jbd2/journal.c
281 :export:
282
283.. kernel-doc:: fs/jbd2/recovery.c
284 :internal:
285
286Transasction Level
287~~~~~~~~~~~~~~~~~~
288
289.. kernel-doc:: fs/jbd2/transaction.c
Mauro Carvalho Chehab90f9f112017-05-12 06:50:22 -0300290
291See also
292--------
293
294`Journaling the Linux ext2fs Filesystem, LinuxExpo 98, Stephen
295Tweedie <http://kernel.org/pub/linux/kernel/people/sct/ext3/journal-design.ps.gz>`__
296
297`Ext3 Journalling FileSystem, OLS 2000, Dr. Stephen
298Tweedie <http://olstrans.sourceforge.net/release/OLS2000-ext3/OLS2000-ext3.html>`__
299
300splice API
301==========
302
303splice is a method for moving blocks of data around inside the kernel,
304without continually transferring them between the kernel and user space.
305
306.. kernel-doc:: fs/splice.c
307
308pipes API
309=========
310
311Pipe interfaces are all for in-kernel (builtin image) use. They are not
312exported for use by modules.
313
314.. kernel-doc:: include/linux/pipe_fs_i.h
315 :internal:
316
317.. kernel-doc:: fs/pipe.c