blob: 655ce898f3f5c4a08c60581ee45764ae1d7abdb9 [file] [log] [blame]
Darrick J. Wong46180552018-07-29 15:44:00 -04001.. SPDX-License-Identifier: GPL-2.0
2
3Index Nodes
4-----------
5
6In a regular UNIX filesystem, the inode stores all the metadata
7pertaining to the file (time stamps, block maps, extended attributes,
8etc), not the directory entry. To find the information associated with a
9file, one must traverse the directory files to find the directory entry
10associated with a file, then load the inode to find the metadata for
11that file. ext4 appears to cheat (for performance reasons) a little bit
12by storing a copy of the file type (normally stored in the inode) in the
13directory entry. (Compare all this to FAT, which stores all the file
14information directly in the directory entry, but does not support hard
15links and is in general more seek-happy than ext4 due to its simpler
16block allocator and extensive use of linked lists.)
17
18The inode table is a linear array of ``struct ext4_inode``. The table is
19sized to have enough blocks to store at least
20``sb.s_inode_size * sb.s_inodes_per_group`` bytes. The number of the
21block group containing an inode can be calculated as
22``(inode_number - 1) / sb.s_inodes_per_group``, and the offset into the
23group's table is ``(inode_number - 1) % sb.s_inodes_per_group``. There
24is no inode 0.
25
26The inode checksum is calculated against the FS UUID, the inode number,
27and the inode structure itself.
28
29The inode table entry is laid out in ``struct ext4_inode``.
30
31.. list-table::
32 :widths: 1 1 1 77
33 :header-rows: 1
34
35 * - Offset
36 - Size
37 - Name
38 - Description
39 * - 0x0
40 - \_\_le16
41 - i\_mode
42 - File mode. See the table i_mode_ below.
43 * - 0x2
44 - \_\_le16
45 - i\_uid
46 - Lower 16-bits of Owner UID.
47 * - 0x4
48 - \_\_le32
49 - i\_size\_lo
50 - Lower 32-bits of size in bytes.
51 * - 0x8
52 - \_\_le32
53 - i\_atime
54 - Last access time, in seconds since the epoch. However, if the EA\_INODE
55 inode flag is set, this inode stores an extended attribute value and
56 this field contains the checksum of the value.
57 * - 0xC
58 - \_\_le32
59 - i\_ctime
60 - Last inode change time, in seconds since the epoch. However, if the
61 EA\_INODE inode flag is set, this inode stores an extended attribute
62 value and this field contains the lower 32 bits of the attribute value's
63 reference count.
64 * - 0x10
65 - \_\_le32
66 - i\_mtime
67 - Last data modification time, in seconds since the epoch. However, if the
68 EA\_INODE inode flag is set, this inode stores an extended attribute
69 value and this field contains the number of the inode that owns the
70 extended attribute.
71 * - 0x14
72 - \_\_le32
73 - i\_dtime
74 - Deletion Time, in seconds since the epoch.
75 * - 0x18
76 - \_\_le16
77 - i\_gid
78 - Lower 16-bits of GID.
79 * - 0x1A
80 - \_\_le16
81 - i\_links\_count
82 - Hard link count. Normally, ext4 does not permit an inode to have more
83 than 65,000 hard links. This applies to files as well as directories,
84 which means that there cannot be more than 64,998 subdirectories in a
85 directory (each subdirectory's '..' entry counts as a hard link, as does
86 the '.' entry in the directory itself). With the DIR\_NLINK feature
87 enabled, ext4 supports more than 64,998 subdirectories by setting this
88 field to 1 to indicate that the number of hard links is not known.
89 * - 0x1C
90 - \_\_le32
91 - i\_blocks\_lo
92 - Lower 32-bits of “block” count. If the huge\_file feature flag is not
93 set on the filesystem, the file consumes ``i_blocks_lo`` 512-byte blocks
94 on disk. If huge\_file is set and EXT4\_HUGE\_FILE\_FL is NOT set in
95 ``inode.i_flags``, then the file consumes ``i_blocks_lo + (i_blocks_hi
96 << 32)`` 512-byte blocks on disk. If huge\_file is set and
97 EXT4\_HUGE\_FILE\_FL IS set in ``inode.i_flags``, then this file
98 consumes (``i_blocks_lo + i_blocks_hi`` << 32) filesystem blocks on
99 disk.
100 * - 0x20
101 - \_\_le32
102 - i\_flags
103 - Inode flags. See the table i_flags_ below.
104 * - 0x24
105 - 4 bytes
106 - i\_osd1
107 - See the table i_osd1_ for more details.
108 * - 0x28
109 - 60 bytes
110 - i\_block[EXT4\_N\_BLOCKS=15]
111 - Block map or extent tree. See the section “The Contents of inode.i\_block”.
112 * - 0x64
113 - \_\_le32
114 - i\_generation
115 - File version (for NFS).
116 * - 0x68
117 - \_\_le32
118 - i\_file\_acl\_lo
119 - Lower 32-bits of extended attribute block. ACLs are of course one of
120 many possible extended attributes; I think the name of this field is a
121 result of the first use of extended attributes being for ACLs.
122 * - 0x6C
123 - \_\_le32
124 - i\_size\_high / i\_dir\_acl
125 - Upper 32-bits of file/directory size. In ext2/3 this field was named
126 i\_dir\_acl, though it was usually set to zero and never used.
127 * - 0x70
128 - \_\_le32
129 - i\_obso\_faddr
130 - (Obsolete) fragment address.
131 * - 0x74
132 - 12 bytes
133 - i\_osd2
134 - See the table i_osd2_ for more details.
135 * - 0x80
136 - \_\_le16
137 - i\_extra\_isize
138 - Size of this inode - 128. Alternately, the size of the extended inode
139 fields beyond the original ext2 inode, including this field.
140 * - 0x82
141 - \_\_le16
142 - i\_checksum\_hi
143 - Upper 16-bits of the inode checksum.
144 * - 0x84
145 - \_\_le32
146 - i\_ctime\_extra
147 - Extra change time bits. This provides sub-second precision. See Inode
148 Timestamps section.
149 * - 0x88
150 - \_\_le32
151 - i\_mtime\_extra
152 - Extra modification time bits. This provides sub-second precision.
153 * - 0x8C
154 - \_\_le32
155 - i\_atime\_extra
156 - Extra access time bits. This provides sub-second precision.
157 * - 0x90
158 - \_\_le32
159 - i\_crtime
160 - File creation time, in seconds since the epoch.
161 * - 0x94
162 - \_\_le32
163 - i\_crtime\_extra
164 - Extra file creation time bits. This provides sub-second precision.
165 * - 0x98
166 - \_\_le32
167 - i\_version\_hi
168 - Upper 32-bits for version number.
169 * - 0x9C
170 - \_\_le32
171 - i\_projid
172 - Project ID.
173
174.. _i_mode:
175
176The ``i_mode`` value is a combination of the following flags:
177
178.. list-table::
179 :widths: 1 79
180 :header-rows: 1
181
182 * - Value
183 - Description
184 * - 0x1
185 - S\_IXOTH (Others may execute)
186 * - 0x2
187 - S\_IWOTH (Others may write)
188 * - 0x4
189 - S\_IROTH (Others may read)
190 * - 0x8
191 - S\_IXGRP (Group members may execute)
192 * - 0x10
193 - S\_IWGRP (Group members may write)
194 * - 0x20
195 - S\_IRGRP (Group members may read)
196 * - 0x40
197 - S\_IXUSR (Owner may execute)
198 * - 0x80
199 - S\_IWUSR (Owner may write)
200 * - 0x100
201 - S\_IRUSR (Owner may read)
202 * - 0x200
203 - S\_ISVTX (Sticky bit)
204 * - 0x400
205 - S\_ISGID (Set GID)
206 * - 0x800
207 - S\_ISUID (Set UID)
208 * -
209 - These are mutually-exclusive file types:
210 * - 0x1000
211 - S\_IFIFO (FIFO)
212 * - 0x2000
213 - S\_IFCHR (Character device)
214 * - 0x4000
215 - S\_IFDIR (Directory)
216 * - 0x6000
217 - S\_IFBLK (Block device)
218 * - 0x8000
219 - S\_IFREG (Regular file)
220 * - 0xA000
221 - S\_IFLNK (Symbolic link)
222 * - 0xC000
223 - S\_IFSOCK (Socket)
224
225.. _i_flags:
226
227The ``i_flags`` field is a combination of these values:
228
229.. list-table::
230 :widths: 1 79
231 :header-rows: 1
232
233 * - Value
234 - Description
235 * - 0x1
236 - This file requires secure deletion (EXT4\_SECRM\_FL). (not implemented)
237 * - 0x2
238 - This file should be preserved, should undeletion be desired
239 (EXT4\_UNRM\_FL). (not implemented)
240 * - 0x4
241 - File is compressed (EXT4\_COMPR\_FL). (not really implemented)
242 * - 0x8
243 - All writes to the file must be synchronous (EXT4\_SYNC\_FL).
244 * - 0x10
245 - File is immutable (EXT4\_IMMUTABLE\_FL).
246 * - 0x20
247 - File can only be appended (EXT4\_APPEND\_FL).
248 * - 0x40
249 - The dump(1) utility should not dump this file (EXT4\_NODUMP\_FL).
250 * - 0x80
251 - Do not update access time (EXT4\_NOATIME\_FL).
252 * - 0x100
253 - Dirty compressed file (EXT4\_DIRTY\_FL). (not used)
254 * - 0x200
255 - File has one or more compressed clusters (EXT4\_COMPRBLK\_FL). (not used)
256 * - 0x400
257 - Do not compress file (EXT4\_NOCOMPR\_FL). (not used)
258 * - 0x800
259 - Encrypted inode (EXT4\_ENCRYPT\_FL). This bit value previously was
260 EXT4\_ECOMPR\_FL (compression error), which was never used.
261 * - 0x1000
262 - Directory has hashed indexes (EXT4\_INDEX\_FL).
263 * - 0x2000
264 - AFS magic directory (EXT4\_IMAGIC\_FL).
265 * - 0x4000
266 - File data must always be written through the journal
267 (EXT4\_JOURNAL\_DATA\_FL).
268 * - 0x8000
269 - File tail should not be merged (EXT4\_NOTAIL\_FL). (not used by ext4)
270 * - 0x10000
271 - All directory entry data should be written synchronously (see
272 ``dirsync``) (EXT4\_DIRSYNC\_FL).
273 * - 0x20000
274 - Top of directory hierarchy (EXT4\_TOPDIR\_FL).
275 * - 0x40000
276 - This is a huge file (EXT4\_HUGE\_FILE\_FL).
277 * - 0x80000
278 - Inode uses extents (EXT4\_EXTENTS\_FL).
279 * - 0x200000
280 - Inode stores a large extended attribute value in its data blocks
281 (EXT4\_EA\_INODE\_FL).
282 * - 0x400000
283 - This file has blocks allocated past EOF (EXT4\_EOFBLOCKS\_FL).
284 (deprecated)
285 * - 0x01000000
286 - Inode is a snapshot (``EXT4_SNAPFILE_FL``). (not in mainline)
287 * - 0x04000000
288 - Snapshot is being deleted (``EXT4_SNAPFILE_DELETED_FL``). (not in
289 mainline)
290 * - 0x08000000
291 - Snapshot shrink has completed (``EXT4_SNAPFILE_SHRUNK_FL``). (not in
292 mainline)
293 * - 0x10000000
294 - Inode has inline data (EXT4\_INLINE\_DATA\_FL).
295 * - 0x20000000
296 - Create children with the same project ID (EXT4\_PROJINHERIT\_FL).
297 * - 0x80000000
298 - Reserved for ext4 library (EXT4\_RESERVED\_FL).
299 * -
300 - Aggregate flags:
301 * - 0x4BDFFF
302 - User-visible flags.
303 * - 0x4B80FF
304 - User-modifiable flags. Note that while EXT4\_JOURNAL\_DATA\_FL and
305 EXT4\_EXTENTS\_FL can be set with setattr, they are not in the kernel's
306 EXT4\_FL\_USER\_MODIFIABLE mask, since it needs to handle the setting of
307 these flags in a special manner and they are masked out of the set of
308 flags that are saved directly to i\_flags.
309
310.. _i_osd1:
311
312The ``osd1`` field has multiple meanings depending on the creator:
313
314Linux:
315
316.. list-table::
317 :widths: 1 1 1 77
318 :header-rows: 1
319
320 * - Offset
321 - Size
322 - Name
323 - Description
324 * - 0x0
325 - \_\_le32
326 - l\_i\_version
327 - Inode version. However, if the EA\_INODE inode flag is set, this inode
328 stores an extended attribute value and this field contains the upper 32
329 bits of the attribute value's reference count.
330
331Hurd:
332
333.. list-table::
334 :widths: 1 1 1 77
335 :header-rows: 1
336
337 * - Offset
338 - Size
339 - Name
340 - Description
341 * - 0x0
342 - \_\_le32
343 - h\_i\_translator
344 - ??
345
346Masix:
347
348.. list-table::
349 :widths: 1 1 1 77
350 :header-rows: 1
351
352 * - Offset
353 - Size
354 - Name
355 - Description
356 * - 0x0
357 - \_\_le32
358 - m\_i\_reserved
359 - ??
360
361.. _i_osd2:
362
363The ``osd2`` field has multiple meanings depending on the filesystem creator:
364
365Linux:
366
367.. list-table::
368 :widths: 1 1 1 77
369 :header-rows: 1
370
371 * - Offset
372 - Size
373 - Name
374 - Description
375 * - 0x0
376 - \_\_le16
377 - l\_i\_blocks\_high
378 - Upper 16-bits of the block count. Please see the note attached to
379 i\_blocks\_lo.
380 * - 0x2
381 - \_\_le16
382 - l\_i\_file\_acl\_high
383 - Upper 16-bits of the extended attribute block (historically, the file
384 ACL location). See the Extended Attributes section below.
385 * - 0x4
386 - \_\_le16
387 - l\_i\_uid\_high
388 - Upper 16-bits of the Owner UID.
389 * - 0x6
390 - \_\_le16
391 - l\_i\_gid\_high
392 - Upper 16-bits of the GID.
393 * - 0x8
394 - \_\_le16
395 - l\_i\_checksum\_lo
396 - Lower 16-bits of the inode checksum.
397 * - 0xA
398 - \_\_le16
399 - l\_i\_reserved
400 - Unused.
401
402Hurd:
403
404.. list-table::
405 :widths: 1 1 1 77
406 :header-rows: 1
407
408 * - Offset
409 - Size
410 - Name
411 - Description
412 * - 0x0
413 - \_\_le16
414 - h\_i\_reserved1
415 - ??
416 * - 0x2
417 - \_\_u16
418 - h\_i\_mode\_high
419 - Upper 16-bits of the file mode.
420 * - 0x4
421 - \_\_le16
422 - h\_i\_uid\_high
423 - Upper 16-bits of the Owner UID.
424 * - 0x6
425 - \_\_le16
426 - h\_i\_gid\_high
427 - Upper 16-bits of the GID.
428 * - 0x8
429 - \_\_u32
430 - h\_i\_author
431 - Author code?
432
433Masix:
434
435.. list-table::
436 :widths: 1 1 1 77
437 :header-rows: 1
438
439 * - Offset
440 - Size
441 - Name
442 - Description
443 * - 0x0
444 - \_\_le16
445 - h\_i\_reserved1
446 - ??
447 * - 0x2
448 - \_\_u16
449 - m\_i\_file\_acl\_high
450 - Upper 16-bits of the extended attribute block (historically, the file
451 ACL location).
452 * - 0x4
453 - \_\_u32
454 - m\_i\_reserved2[2]
455 - ??
456
457Inode Size
458~~~~~~~~~~
459
460In ext2 and ext3, the inode structure size was fixed at 128 bytes
461(``EXT2_GOOD_OLD_INODE_SIZE``) and each inode had a disk record size of
462128 bytes. Starting with ext4, it is possible to allocate a larger
463on-disk inode at format time for all inodes in the filesystem to provide
464space beyond the end of the original ext2 inode. The on-disk inode
465record size is recorded in the superblock as ``s_inode_size``. The
466number of bytes actually used by struct ext4\_inode beyond the original
467128-byte ext2 inode is recorded in the ``i_extra_isize`` field for each
468inode, which allows struct ext4\_inode to grow for a new kernel without
469having to upgrade all of the on-disk inodes. Access to fields beyond
470EXT2\_GOOD\_OLD\_INODE\_SIZE should be verified to be within
471``i_extra_isize``. By default, ext4 inode records are 256 bytes, and (as
472of October 2013) the inode structure is 156 bytes
473(``i_extra_isize = 28``). The extra space between the end of the inode
474structure and the end of the inode record can be used to store extended
475attributes. Each inode record can be as large as the filesystem block
476size, though this is not terribly efficient.
477
478Finding an Inode
479~~~~~~~~~~~~~~~~
480
481Each block group contains ``sb->s_inodes_per_group`` inodes. Because
482inode 0 is defined not to exist, this formula can be used to find the
483block group that an inode lives in:
484``bg = (inode_num - 1) / sb->s_inodes_per_group``. The particular inode
485can be found within the block group's inode table at
486``index = (inode_num - 1) % sb->s_inodes_per_group``. To get the byte
487address within the inode table, use
488``offset = index * sb->s_inode_size``.
489
490Inode Timestamps
491~~~~~~~~~~~~~~~~
492
493Four timestamps are recorded in the lower 128 bytes of the inode
494structure -- inode change time (ctime), access time (atime), data
495modification time (mtime), and deletion time (dtime). The four fields
496are 32-bit signed integers that represent seconds since the Unix epoch
497(1970-01-01 00:00:00 GMT), which means that the fields will overflow in
498January 2038. For inodes that are not linked from any directory but are
499still open (orphan inodes), the dtime field is overloaded for use with
500the orphan list. The superblock field ``s_last_orphan`` points to the
501first inode in the orphan list; dtime is then the number of the next
502orphaned inode, or zero if there are no more orphans.
503
504If the inode structure size ``sb->s_inode_size`` is larger than 128
505bytes and the ``i_inode_extra`` field is large enough to encompass the
506respective ``i_[cma]time_extra`` field, the ctime, atime, and mtime
507inode fields are widened to 64 bits. Within this “extra” 32-bit field,
508the lower two bits are used to extend the 32-bit seconds field to be 34
509bit wide; the upper 30 bits are used to provide nanosecond timestamp
510accuracy. Therefore, timestamps should not overflow until May 2446.
511dtime was not widened. There is also a fifth timestamp to record inode
512creation time (crtime); this field is 64-bits wide and decoded in the
513same manner as 64-bit [cma]time. Neither crtime nor dtime are accessible
514through the regular stat() interface, though debugfs will report them.
515
516We use the 32-bit signed time value plus (2^32 \* (extra epoch bits)).
517In other words:
518
519.. list-table::
520 :widths: 20 20 20 20 20
521 :header-rows: 1
522
523 * - Extra epoch bits
524 - MSB of 32-bit time
525 - Adjustment for signed 32-bit to 64-bit tv\_sec
526 - Decoded 64-bit tv\_sec
527 - valid time range
528 * - 0 0
529 - 1
530 - 0
531 - ``-0x80000000 - -0x00000001``
532 - 1901-12-13 to 1969-12-31
533 * - 0 0
534 - 0
535 - 0
536 - ``0x000000000 - 0x07fffffff``
537 - 1970-01-01 to 2038-01-19
538 * - 0 1
539 - 1
540 - 0x100000000
541 - ``0x080000000 - 0x0ffffffff``
542 - 2038-01-19 to 2106-02-07
543 * - 0 1
544 - 0
545 - 0x100000000
546 - ``0x100000000 - 0x17fffffff``
547 - 2106-02-07 to 2174-02-25
548 * - 1 0
549 - 1
550 - 0x200000000
551 - ``0x180000000 - 0x1ffffffff``
552 - 2174-02-25 to 2242-03-16
553 * - 1 0
554 - 0
555 - 0x200000000
556 - ``0x200000000 - 0x27fffffff``
557 - 2242-03-16 to 2310-04-04
558 * - 1 1
559 - 1
560 - 0x300000000
561 - ``0x280000000 - 0x2ffffffff``
562 - 2310-04-04 to 2378-04-22
563 * - 1 1
564 - 0
565 - 0x300000000
566 - ``0x300000000 - 0x37fffffff``
567 - 2378-04-22 to 2446-05-10
568
569This is a somewhat odd encoding since there are effectively seven times
570as many positive values as negative values. There have also been
571long-standing bugs decoding and encoding dates beyond 2038, which don't
572seem to be fixed as of kernel 3.12 and e2fsprogs 1.42.8. 64-bit kernels
573incorrectly use the extra epoch bits 1,1 for dates between 1901 and
5741970. At some point the kernel will be fixed and e2fsck will fix this
575situation, assuming that it is run before 2310.