Darrick J. Wong | c09f3bac | 2018-07-29 15:38:00 -0400 | [diff] [blame] | 1 | .. SPDX-License-Identifier: GPL-2.0 |
| 2 | |
| 3 | Layout |
| 4 | ------ |
| 5 | |
| 6 | The layout of a standard block group is approximately as follows (each |
| 7 | of these fields is discussed in a separate section below): |
| 8 | |
| 9 | .. list-table:: |
| 10 | :widths: 1 1 1 1 1 1 1 1 |
| 11 | :header-rows: 1 |
| 12 | |
| 13 | * - Group 0 Padding |
| 14 | - ext4 Super Block |
| 15 | - Group Descriptors |
| 16 | - Reserved GDT Blocks |
| 17 | - Data Block Bitmap |
| 18 | - inode Bitmap |
| 19 | - inode Table |
| 20 | - Data Blocks |
| 21 | * - 1024 bytes |
| 22 | - 1 block |
| 23 | - many blocks |
| 24 | - many blocks |
| 25 | - 1 block |
| 26 | - 1 block |
| 27 | - many blocks |
| 28 | - many more blocks |
| 29 | |
| 30 | For the special case of block group 0, the first 1024 bytes are unused, |
| 31 | to allow for the installation of x86 boot sectors and other oddities. |
| 32 | The superblock will start at offset 1024 bytes, whichever block that |
| 33 | happens to be (usually 0). However, if for some reason the block size = |
| 34 | 1024, then block 0 is marked in use and the superblock goes in block 1. |
| 35 | For all other block groups, there is no padding. |
| 36 | |
| 37 | The ext4 driver primarily works with the superblock and the group |
| 38 | descriptors that are found in block group 0. Redundant copies of the |
| 39 | superblock and group descriptors are written to some of the block groups |
| 40 | across the disk in case the beginning of the disk gets trashed, though |
| 41 | not all block groups necessarily host a redundant copy (see following |
| 42 | paragraph for more details). If the group does not have a redundant |
| 43 | copy, the block group begins with the data block bitmap. Note also that |
| 44 | when the filesystem is freshly formatted, mkfs will allocate “reserve |
| 45 | GDT block” space after the block group descriptors and before the start |
| 46 | of the block bitmaps to allow for future expansion of the filesystem. By |
| 47 | default, a filesystem is allowed to increase in size by a factor of |
| 48 | 1024x over the original filesystem size. |
| 49 | |
| 50 | The location of the inode table is given by ``grp.bg_inode_table_*``. It |
| 51 | is continuous range of blocks large enough to contain |
| 52 | ``sb.s_inodes_per_group * sb.s_inode_size`` bytes. |
| 53 | |
| 54 | As for the ordering of items in a block group, it is generally |
| 55 | established that the super block and the group descriptor table, if |
| 56 | present, will be at the beginning of the block group. The bitmaps and |
| 57 | the inode table can be anywhere, and it is quite possible for the |
| 58 | bitmaps to come after the inode table, or for both to be in different |
| 59 | groups (flex\_bg). Leftover space is used for file data blocks, indirect |
| 60 | block maps, extent tree blocks, and extended attributes. |
| 61 | |
| 62 | Flexible Block Groups |
| 63 | --------------------- |
| 64 | |
| 65 | Starting in ext4, there is a new feature called flexible block groups |
| 66 | (flex\_bg). In a flex\_bg, several block groups are tied together as one |
| 67 | logical block group; the bitmap spaces and the inode table space in the |
| 68 | first block group of the flex\_bg are expanded to include the bitmaps |
| 69 | and inode tables of all other block groups in the flex\_bg. For example, |
| 70 | if the flex\_bg size is 4, then group 0 will contain (in order) the |
| 71 | superblock, group descriptors, data block bitmaps for groups 0-3, inode |
| 72 | bitmaps for groups 0-3, inode tables for groups 0-3, and the remaining |
| 73 | space in group 0 is for file data. The effect of this is to group the |
Ayush Ranjan | 219db95 | 2019-08-22 23:18:33 -0400 | [diff] [blame] | 74 | block group metadata close together for faster loading, and to enable |
| 75 | large files to be continuous on disk. Backup copies of the superblock |
| 76 | and group descriptors are always at the beginning of block groups, even |
| 77 | if flex\_bg is enabled. The number of block groups that make up a |
| 78 | flex\_bg is given by 2 ^ ``sb.s_log_groups_per_flex``. |
Darrick J. Wong | c09f3bac | 2018-07-29 15:38:00 -0400 | [diff] [blame] | 79 | |
| 80 | Meta Block Groups |
| 81 | ----------------- |
| 82 | |
| 83 | Without the option META\_BG, for safety concerns, all block group |
| 84 | descriptors copies are kept in the first block group. Given the default |
| 85 | 128MiB(2^27 bytes) block group size and 64-byte group descriptors, ext4 |
| 86 | can have at most 2^27/64 = 2^21 block groups. This limits the entire |
Mauro Carvalho Chehab | d9d2c82 | 2021-06-16 08:55:12 +0200 | [diff] [blame] | 87 | filesystem size to 2^21 * 2^27 = 2^48bytes or 256TiB. |
Darrick J. Wong | c09f3bac | 2018-07-29 15:38:00 -0400 | [diff] [blame] | 88 | |
| 89 | The solution to this problem is to use the metablock group feature |
| 90 | (META\_BG), which is already in ext3 for all 2.6 releases. With the |
| 91 | META\_BG feature, ext4 filesystems are partitioned into many metablock |
| 92 | groups. Each metablock group is a cluster of block groups whose group |
| 93 | descriptor structures can be stored in a single disk block. For ext4 |
| 94 | filesystems with 4 KB block size, a single metablock group partition |
| 95 | includes 64 block groups, or 8 GiB of disk space. The metablock group |
| 96 | feature moves the location of the group descriptors from the congested |
| 97 | first block group of the whole filesystem into the first group of each |
| 98 | metablock group itself. The backups are in the second and last group of |
| 99 | each metablock group. This increases the 2^21 maximum block groups limit |
| 100 | to the hard limit 2^32, allowing support for a 512PiB filesystem. |
| 101 | |
| 102 | The change in the filesystem format replaces the current scheme where |
| 103 | the superblock is followed by a variable-length set of block group |
| 104 | descriptors. Instead, the superblock and a single block group descriptor |
| 105 | block is placed at the beginning of the first, second, and last block |
| 106 | groups in a meta-block group. A meta-block group is a collection of |
| 107 | block groups which can be described by a single block group descriptor |
| 108 | block. Since the size of the block group descriptor structure is 32 |
| 109 | bytes, a meta-block group contains 32 block groups for filesystems with |
| 110 | a 1KB block size, and 128 block groups for filesystems with a 4KB |
| 111 | blocksize. Filesystems can either be created using this new block group |
| 112 | descriptor layout, or existing filesystems can be resized on-line, and |
| 113 | the field s\_first\_meta\_bg in the superblock will indicate the first |
| 114 | block group using this new layout. |
| 115 | |
| 116 | Please see an important note about ``BLOCK_UNINIT`` in the section about |
| 117 | block and inode bitmaps. |
| 118 | |
| 119 | Lazy Block Group Initialization |
| 120 | ------------------------------- |
| 121 | |
| 122 | A new feature for ext4 are three block group descriptor flags that |
| 123 | enable mkfs to skip initializing other parts of the block group |
| 124 | metadata. Specifically, the INODE\_UNINIT and BLOCK\_UNINIT flags mean |
| 125 | that the inode and block bitmaps for that group can be calculated and |
| 126 | therefore the on-disk bitmap blocks are not initialized. This is |
| 127 | generally the case for an empty block group or a block group containing |
| 128 | only fixed-location block group metadata. The INODE\_ZEROED flag means |
| 129 | that the inode table has been initialized; mkfs will unset this flag and |
| 130 | rely on the kernel to initialize the inode tables in the background. |
| 131 | |
| 132 | By not writing zeroes to the bitmaps and inode table, mkfs time is |
| 133 | reduced considerably. Note the feature flag is RO\_COMPAT\_GDT\_CSUM, |
| 134 | but the dumpe2fs output prints this as “uninit\_bg”. They are the same |
| 135 | thing. |