Darrick J. Wong | 60edae3 | 2018-07-29 15:46:00 -0400 | [diff] [blame] | 1 | .. SPDX-License-Identifier: GPL-2.0 |
| 2 | |
| 3 | Directory Entries |
| 4 | ----------------- |
| 5 | |
| 6 | In an ext4 filesystem, a directory is more or less a flat file that maps |
| 7 | an arbitrary byte string (usually ASCII) to an inode number on the |
| 8 | filesystem. There can be many directory entries across the filesystem |
| 9 | that reference the same inode number--these are known as hard links, and |
| 10 | that is why hard links cannot reference files on other filesystems. As |
| 11 | such, directory entries are found by reading the data block(s) |
| 12 | associated with a directory file for the particular directory entry that |
| 13 | is desired. |
| 14 | |
| 15 | Linear (Classic) Directories |
| 16 | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~ |
| 17 | |
| 18 | By default, each directory lists its entries in an “almost-linear” |
| 19 | array. I write “almost” because it's not a linear array in the memory |
| 20 | sense because directory entries are not split across filesystem blocks. |
| 21 | Therefore, it is more accurate to say that a directory is a series of |
| 22 | data blocks and that each block contains a linear array of directory |
| 23 | entries. The end of each per-block array is signified by reaching the |
| 24 | end of the block; the last entry in the block has a record length that |
| 25 | takes it all the way to the end of the block. The end of the entire |
| 26 | directory is of course signified by reaching the end of the file. Unused |
| 27 | directory entries are signified by inode = 0. By default the filesystem |
| 28 | uses ``struct ext4_dir_entry_2`` for directory entries unless the |
| 29 | “filetype” feature flag is not set, in which case it uses |
| 30 | ``struct ext4_dir_entry``. |
| 31 | |
| 32 | The original directory entry format is ``struct ext4_dir_entry``, which |
| 33 | is at most 263 bytes long, though on disk you'll need to reference |
| 34 | ``dirent.rec_len`` to know for sure. |
| 35 | |
| 36 | .. list-table:: |
Darrick J. Wong | de7abd7 | 2018-10-02 22:43:40 -0400 | [diff] [blame] | 37 | :widths: 8 8 24 40 |
Darrick J. Wong | 60edae3 | 2018-07-29 15:46:00 -0400 | [diff] [blame] | 38 | :header-rows: 1 |
| 39 | |
| 40 | * - Offset |
| 41 | - Size |
| 42 | - Name |
| 43 | - Description |
| 44 | * - 0x0 |
| 45 | - \_\_le32 |
| 46 | - inode |
| 47 | - Number of the inode that this directory entry points to. |
| 48 | * - 0x4 |
| 49 | - \_\_le16 |
| 50 | - rec\_len |
| 51 | - Length of this directory entry. Must be a multiple of 4. |
| 52 | * - 0x6 |
| 53 | - \_\_le16 |
| 54 | - name\_len |
| 55 | - Length of the file name. |
| 56 | * - 0x8 |
| 57 | - char |
| 58 | - name[EXT4\_NAME\_LEN] |
| 59 | - File name. |
| 60 | |
| 61 | Since file names cannot be longer than 255 bytes, the new directory |
Ayush Ranjan | 219db95 | 2019-08-22 23:18:33 -0400 | [diff] [blame] | 62 | entry format shortens the name\_len field and uses the space for a file |
Darrick J. Wong | 60edae3 | 2018-07-29 15:46:00 -0400 | [diff] [blame] | 63 | type flag, probably to avoid having to load every inode during directory |
| 64 | tree traversal. This format is ``ext4_dir_entry_2``, which is at most |
| 65 | 263 bytes long, though on disk you'll need to reference |
| 66 | ``dirent.rec_len`` to know for sure. |
| 67 | |
| 68 | .. list-table:: |
Darrick J. Wong | de7abd7 | 2018-10-02 22:43:40 -0400 | [diff] [blame] | 69 | :widths: 8 8 24 40 |
Darrick J. Wong | 60edae3 | 2018-07-29 15:46:00 -0400 | [diff] [blame] | 70 | :header-rows: 1 |
| 71 | |
| 72 | * - Offset |
| 73 | - Size |
| 74 | - Name |
| 75 | - Description |
| 76 | * - 0x0 |
| 77 | - \_\_le32 |
| 78 | - inode |
| 79 | - Number of the inode that this directory entry points to. |
| 80 | * - 0x4 |
| 81 | - \_\_le16 |
| 82 | - rec\_len |
| 83 | - Length of this directory entry. |
| 84 | * - 0x6 |
| 85 | - \_\_u8 |
| 86 | - name\_len |
| 87 | - Length of the file name. |
| 88 | * - 0x7 |
| 89 | - \_\_u8 |
| 90 | - file\_type |
| 91 | - File type code, see ftype_ table below. |
| 92 | * - 0x8 |
| 93 | - char |
| 94 | - name[EXT4\_NAME\_LEN] |
| 95 | - File name. |
| 96 | |
| 97 | .. _ftype: |
| 98 | |
| 99 | The directory file type is one of the following values: |
| 100 | |
| 101 | .. list-table:: |
Darrick J. Wong | de7abd7 | 2018-10-02 22:43:40 -0400 | [diff] [blame] | 102 | :widths: 16 64 |
Darrick J. Wong | 60edae3 | 2018-07-29 15:46:00 -0400 | [diff] [blame] | 103 | :header-rows: 1 |
| 104 | |
| 105 | * - Value |
| 106 | - Description |
| 107 | * - 0x0 |
| 108 | - Unknown. |
| 109 | * - 0x1 |
| 110 | - Regular file. |
| 111 | * - 0x2 |
| 112 | - Directory. |
| 113 | * - 0x3 |
| 114 | - Character device file. |
| 115 | * - 0x4 |
| 116 | - Block device file. |
| 117 | * - 0x5 |
| 118 | - FIFO. |
| 119 | * - 0x6 |
| 120 | - Socket. |
| 121 | * - 0x7 |
| 122 | - Symbolic link. |
| 123 | |
Daniel Rosenberg | 705a3e5 | 2019-10-07 18:12:23 -0700 | [diff] [blame] | 124 | To support directories that are both encrypted and casefolded directories, we |
| 125 | must also include hash information in the directory entry. We append |
| 126 | ``ext4_extended_dir_entry_2`` to ``ext4_dir_entry_2`` except for the entries |
| 127 | for dot and dotdot, which are kept the same. The structure follows immediately |
| 128 | after ``name`` and is included in the size listed by ``rec_len`` If a directory |
| 129 | entry uses this extension, it may be up to 271 bytes. |
| 130 | |
| 131 | .. list-table:: |
| 132 | :widths: 8 8 24 40 |
| 133 | :header-rows: 1 |
| 134 | |
| 135 | * - Offset |
| 136 | - Size |
| 137 | - Name |
| 138 | - Description |
| 139 | * - 0x0 |
| 140 | - \_\_le32 |
| 141 | - hash |
| 142 | - The hash of the directory name |
| 143 | * - 0x4 |
| 144 | - \_\_le32 |
| 145 | - minor\_hash |
| 146 | - The minor hash of the directory name |
| 147 | |
| 148 | |
Darrick J. Wong | 60edae3 | 2018-07-29 15:46:00 -0400 | [diff] [blame] | 149 | In order to add checksums to these classic directory blocks, a phony |
| 150 | ``struct ext4_dir_entry`` is placed at the end of each leaf block to |
| 151 | hold the checksum. The directory entry is 12 bytes long. The inode |
| 152 | number and name\_len fields are set to zero to fool old software into |
| 153 | ignoring an apparently empty directory entry, and the checksum is stored |
| 154 | in the place where the name normally goes. The structure is |
| 155 | ``struct ext4_dir_entry_tail``: |
| 156 | |
| 157 | .. list-table:: |
Darrick J. Wong | de7abd7 | 2018-10-02 22:43:40 -0400 | [diff] [blame] | 158 | :widths: 8 8 24 40 |
Darrick J. Wong | 60edae3 | 2018-07-29 15:46:00 -0400 | [diff] [blame] | 159 | :header-rows: 1 |
| 160 | |
| 161 | * - Offset |
| 162 | - Size |
| 163 | - Name |
| 164 | - Description |
| 165 | * - 0x0 |
| 166 | - \_\_le32 |
| 167 | - det\_reserved\_zero1 |
| 168 | - Inode number, which must be zero. |
| 169 | * - 0x4 |
| 170 | - \_\_le16 |
| 171 | - det\_rec\_len |
| 172 | - Length of this directory entry, which must be 12. |
| 173 | * - 0x6 |
| 174 | - \_\_u8 |
| 175 | - det\_reserved\_zero2 |
| 176 | - Length of the file name, which must be zero. |
| 177 | * - 0x7 |
| 178 | - \_\_u8 |
| 179 | - det\_reserved\_ft |
| 180 | - File type, which must be 0xDE. |
| 181 | * - 0x8 |
| 182 | - \_\_le32 |
| 183 | - det\_checksum |
| 184 | - Directory leaf block checksum. |
| 185 | |
| 186 | The leaf directory block checksum is calculated against the FS UUID, the |
| 187 | directory's inode number, the directory's inode generation number, and |
| 188 | the entire directory entry block up to (but not including) the fake |
| 189 | directory entry. |
| 190 | |
| 191 | Hash Tree Directories |
| 192 | ~~~~~~~~~~~~~~~~~~~~~ |
| 193 | |
| 194 | A linear array of directory entries isn't great for performance, so a |
| 195 | new feature was added to ext3 to provide a faster (but peculiar) |
| 196 | balanced tree keyed off a hash of the directory entry name. If the |
| 197 | EXT4\_INDEX\_FL (0x1000) flag is set in the inode, this directory uses a |
| 198 | hashed btree (htree) to organize and find directory entries. For |
| 199 | backwards read-only compatibility with ext2, this tree is actually |
| 200 | hidden inside the directory file, masquerading as “empty” directory data |
| 201 | blocks! It was stated previously that the end of the linear directory |
| 202 | entry table was signified with an entry pointing to inode 0; this is |
| 203 | (ab)used to fool the old linear-scan algorithm into thinking that the |
| 204 | rest of the directory block is empty so that it moves on. |
| 205 | |
| 206 | The root of the tree always lives in the first data block of the |
| 207 | directory. By ext2 custom, the '.' and '..' entries must appear at the |
| 208 | beginning of this first block, so they are put here as two |
| 209 | ``struct ext4_dir_entry_2``\ s and not stored in the tree. The rest of |
| 210 | the root node contains metadata about the tree and finally a hash->block |
| 211 | map to find nodes that are lower in the htree. If |
| 212 | ``dx_root.info.indirect_levels`` is non-zero then the htree has two |
| 213 | levels; the data block pointed to by the root node's map is an interior |
| 214 | node, which is indexed by a minor hash. Interior nodes in this tree |
| 215 | contains a zeroed out ``struct ext4_dir_entry_2`` followed by a |
| 216 | minor\_hash->block map to find leafe nodes. Leaf nodes contain a linear |
| 217 | array of all ``struct ext4_dir_entry_2``; all of these entries |
| 218 | (presumably) hash to the same value. If there is an overflow, the |
| 219 | entries simply overflow into the next leaf node, and the |
| 220 | least-significant bit of the hash (in the interior node map) that gets |
| 221 | us to this next leaf node is set. |
| 222 | |
| 223 | To traverse the directory as a htree, the code calculates the hash of |
| 224 | the desired file name and uses it to find the corresponding block |
| 225 | number. If the tree is flat, the block is a linear array of directory |
| 226 | entries that can be searched; otherwise, the minor hash of the file name |
| 227 | is computed and used against this second block to find the corresponding |
| 228 | third block number. That third block number will be a linear array of |
| 229 | directory entries. |
| 230 | |
| 231 | To traverse the directory as a linear array (such as the old code does), |
| 232 | the code simply reads every data block in the directory. The blocks used |
| 233 | for the htree will appear to have no entries (aside from '.' and '..') |
| 234 | and so only the leaf nodes will appear to have any interesting content. |
| 235 | |
| 236 | The root of the htree is in ``struct dx_root``, which is the full length |
| 237 | of a data block: |
| 238 | |
| 239 | .. list-table:: |
Darrick J. Wong | de7abd7 | 2018-10-02 22:43:40 -0400 | [diff] [blame] | 240 | :widths: 8 8 24 40 |
Darrick J. Wong | 60edae3 | 2018-07-29 15:46:00 -0400 | [diff] [blame] | 241 | :header-rows: 1 |
| 242 | |
| 243 | * - Offset |
| 244 | - Type |
| 245 | - Name |
| 246 | - Description |
| 247 | * - 0x0 |
| 248 | - \_\_le32 |
| 249 | - dot.inode |
| 250 | - inode number of this directory. |
| 251 | * - 0x4 |
| 252 | - \_\_le16 |
| 253 | - dot.rec\_len |
| 254 | - Length of this record, 12. |
| 255 | * - 0x6 |
| 256 | - u8 |
| 257 | - dot.name\_len |
| 258 | - Length of the name, 1. |
| 259 | * - 0x7 |
| 260 | - u8 |
| 261 | - dot.file\_type |
| 262 | - File type of this entry, 0x2 (directory) (if the feature flag is set). |
| 263 | * - 0x8 |
| 264 | - char |
| 265 | - dot.name[4] |
| 266 | - “.\\0\\0\\0” |
| 267 | * - 0xC |
| 268 | - \_\_le32 |
| 269 | - dotdot.inode |
| 270 | - inode number of parent directory. |
| 271 | * - 0x10 |
| 272 | - \_\_le16 |
| 273 | - dotdot.rec\_len |
| 274 | - block\_size - 12. The record length is long enough to cover all htree |
| 275 | data. |
| 276 | * - 0x12 |
| 277 | - u8 |
| 278 | - dotdot.name\_len |
| 279 | - Length of the name, 2. |
| 280 | * - 0x13 |
| 281 | - u8 |
| 282 | - dotdot.file\_type |
| 283 | - File type of this entry, 0x2 (directory) (if the feature flag is set). |
| 284 | * - 0x14 |
| 285 | - char |
| 286 | - dotdot\_name[4] |
| 287 | - “..\\0\\0” |
| 288 | * - 0x18 |
| 289 | - \_\_le32 |
| 290 | - struct dx\_root\_info.reserved\_zero |
| 291 | - Zero. |
| 292 | * - 0x1C |
| 293 | - u8 |
| 294 | - struct dx\_root\_info.hash\_version |
| 295 | - Hash type, see dirhash_ table below. |
| 296 | * - 0x1D |
| 297 | - u8 |
| 298 | - struct dx\_root\_info.info\_length |
| 299 | - Length of the tree information, 0x8. |
| 300 | * - 0x1E |
| 301 | - u8 |
| 302 | - struct dx\_root\_info.indirect\_levels |
| 303 | - Depth of the htree. Cannot be larger than 3 if the INCOMPAT\_LARGEDIR |
| 304 | feature is set; cannot be larger than 2 otherwise. |
| 305 | * - 0x1F |
| 306 | - u8 |
| 307 | - struct dx\_root\_info.unused\_flags |
| 308 | - |
| 309 | * - 0x20 |
| 310 | - \_\_le16 |
| 311 | - limit |
| 312 | - Maximum number of dx\_entries that can follow this header, plus 1 for |
| 313 | the header itself. |
| 314 | * - 0x22 |
| 315 | - \_\_le16 |
| 316 | - count |
| 317 | - Actual number of dx\_entries that follow this header, plus 1 for the |
| 318 | header itself. |
| 319 | * - 0x24 |
| 320 | - \_\_le32 |
| 321 | - block |
| 322 | - The block number (within the directory file) that goes with hash=0. |
| 323 | * - 0x28 |
| 324 | - struct dx\_entry |
| 325 | - entries[0] |
| 326 | - As many 8-byte ``struct dx_entry`` as fits in the rest of the data block. |
| 327 | |
| 328 | .. _dirhash: |
| 329 | |
| 330 | The directory hash is one of the following values: |
| 331 | |
| 332 | .. list-table:: |
Darrick J. Wong | de7abd7 | 2018-10-02 22:43:40 -0400 | [diff] [blame] | 333 | :widths: 16 64 |
Darrick J. Wong | 60edae3 | 2018-07-29 15:46:00 -0400 | [diff] [blame] | 334 | :header-rows: 1 |
| 335 | |
| 336 | * - Value |
| 337 | - Description |
| 338 | * - 0x0 |
| 339 | - Legacy. |
| 340 | * - 0x1 |
| 341 | - Half MD4. |
| 342 | * - 0x2 |
| 343 | - Tea. |
| 344 | * - 0x3 |
| 345 | - Legacy, unsigned. |
| 346 | * - 0x4 |
| 347 | - Half MD4, unsigned. |
| 348 | * - 0x5 |
| 349 | - Tea, unsigned. |
Daniel Rosenberg | 705a3e5 | 2019-10-07 18:12:23 -0700 | [diff] [blame] | 350 | * - 0x6 |
| 351 | - Siphash. |
Darrick J. Wong | 60edae3 | 2018-07-29 15:46:00 -0400 | [diff] [blame] | 352 | |
| 353 | Interior nodes of an htree are recorded as ``struct dx_node``, which is |
| 354 | also the full length of a data block: |
| 355 | |
| 356 | .. list-table:: |
Darrick J. Wong | de7abd7 | 2018-10-02 22:43:40 -0400 | [diff] [blame] | 357 | :widths: 8 8 24 40 |
Darrick J. Wong | 60edae3 | 2018-07-29 15:46:00 -0400 | [diff] [blame] | 358 | :header-rows: 1 |
| 359 | |
| 360 | * - Offset |
| 361 | - Type |
| 362 | - Name |
| 363 | - Description |
| 364 | * - 0x0 |
| 365 | - \_\_le32 |
| 366 | - fake.inode |
| 367 | - Zero, to make it look like this entry is not in use. |
| 368 | * - 0x4 |
| 369 | - \_\_le16 |
| 370 | - fake.rec\_len |
| 371 | - The size of the block, in order to hide all of the dx\_node data. |
| 372 | * - 0x6 |
| 373 | - u8 |
| 374 | - name\_len |
| 375 | - Zero. There is no name for this “unused” directory entry. |
| 376 | * - 0x7 |
| 377 | - u8 |
| 378 | - file\_type |
| 379 | - Zero. There is no file type for this “unused” directory entry. |
| 380 | * - 0x8 |
| 381 | - \_\_le16 |
| 382 | - limit |
| 383 | - Maximum number of dx\_entries that can follow this header, plus 1 for |
| 384 | the header itself. |
| 385 | * - 0xA |
| 386 | - \_\_le16 |
| 387 | - count |
| 388 | - Actual number of dx\_entries that follow this header, plus 1 for the |
| 389 | header itself. |
| 390 | * - 0xE |
| 391 | - \_\_le32 |
| 392 | - block |
| 393 | - The block number (within the directory file) that goes with the lowest |
| 394 | hash value of this block. This value is stored in the parent block. |
| 395 | * - 0x12 |
| 396 | - struct dx\_entry |
| 397 | - entries[0] |
| 398 | - As many 8-byte ``struct dx_entry`` as fits in the rest of the data block. |
| 399 | |
| 400 | The hash maps that exist in both ``struct dx_root`` and |
| 401 | ``struct dx_node`` are recorded as ``struct dx_entry``, which is 8 bytes |
| 402 | long: |
| 403 | |
| 404 | .. list-table:: |
Darrick J. Wong | de7abd7 | 2018-10-02 22:43:40 -0400 | [diff] [blame] | 405 | :widths: 8 8 24 40 |
Darrick J. Wong | 60edae3 | 2018-07-29 15:46:00 -0400 | [diff] [blame] | 406 | :header-rows: 1 |
| 407 | |
| 408 | * - Offset |
| 409 | - Type |
| 410 | - Name |
| 411 | - Description |
| 412 | * - 0x0 |
| 413 | - \_\_le32 |
| 414 | - hash |
| 415 | - Hash code. |
| 416 | * - 0x4 |
| 417 | - \_\_le32 |
| 418 | - block |
| 419 | - Block number (within the directory file, not filesystem blocks) of the |
| 420 | next node in the htree. |
| 421 | |
| 422 | (If you think this is all quite clever and peculiar, so does the |
| 423 | author.) |
| 424 | |
| 425 | If metadata checksums are enabled, the last 8 bytes of the directory |
| 426 | block (precisely the length of one dx\_entry) are used to store a |
| 427 | ``struct dx_tail``, which contains the checksum. The ``limit`` and |
| 428 | ``count`` entries in the dx\_root/dx\_node structures are adjusted as |
| 429 | necessary to fit the dx\_tail into the block. If there is no space for |
| 430 | the dx\_tail, the user is notified to run e2fsck -D to rebuild the |
| 431 | directory index (which will ensure that there's space for the checksum. |
| 432 | The dx\_tail structure is 8 bytes long and looks like this: |
| 433 | |
| 434 | .. list-table:: |
Darrick J. Wong | de7abd7 | 2018-10-02 22:43:40 -0400 | [diff] [blame] | 435 | :widths: 8 8 24 40 |
Darrick J. Wong | 60edae3 | 2018-07-29 15:46:00 -0400 | [diff] [blame] | 436 | :header-rows: 1 |
| 437 | |
| 438 | * - Offset |
| 439 | - Type |
| 440 | - Name |
| 441 | - Description |
| 442 | * - 0x0 |
| 443 | - u32 |
| 444 | - dt\_reserved |
| 445 | - Zero. |
| 446 | * - 0x4 |
| 447 | - \_\_le32 |
| 448 | - dt\_checksum |
| 449 | - Checksum of the htree directory block. |
| 450 | |
| 451 | The checksum is calculated against the FS UUID, the htree index header |
| 452 | (dx\_root or dx\_node), all of the htree indices (dx\_entry) that are in |
| 453 | use, and the tail block (dx\_tail). |