Mauro Carvalho Chehab | e66d863 | 2020-02-17 17:12:01 +0100 | [diff] [blame] | 1 | .. SPDX-License-Identifier: GPL-2.0 |
| 2 | |
| 3 | ====================================== |
| 4 | Enhanced Read-Only File System - EROFS |
| 5 | ====================================== |
| 6 | |
| 7 | Overview |
| 8 | ======== |
| 9 | |
| 10 | EROFS file-system stands for Enhanced Read-Only File System. Different |
| 11 | from other read-only file systems, it aims to be designed for flexibility, |
| 12 | scalability, but be kept simple and high performance. |
| 13 | |
| 14 | It is designed as a better filesystem solution for the following scenarios: |
| 15 | |
| 16 | - read-only storage media or |
| 17 | |
| 18 | - part of a fully trusted read-only solution, which means it needs to be |
| 19 | immutable and bit-for-bit identical to the official golden image for |
| 20 | their releases due to security and other considerations and |
| 21 | |
Gao Xiang | dfeab2e | 2021-10-14 16:10:10 +0800 | [diff] [blame] | 22 | - hope to minimize extra storage space with guaranteed end-to-end performance |
| 23 | by using compact layout, transparent file compression and direct access, |
| 24 | especially for those embedded devices with limited memory and high-density |
| 25 | hosts with numerous containers; |
Mauro Carvalho Chehab | e66d863 | 2020-02-17 17:12:01 +0100 | [diff] [blame] | 26 | |
| 27 | Here is the main features of EROFS: |
| 28 | |
| 29 | - Little endian on-disk design; |
| 30 | |
| 31 | - Currently 4KB block size (nobh) and therefore maximum 16TB address space; |
| 32 | |
| 33 | - Metadata & data could be mixed by design; |
| 34 | |
| 35 | - 2 inode versions for different requirements: |
| 36 | |
| 37 | ===================== ============ ===================================== |
| 38 | compact (v1) extended (v2) |
| 39 | ===================== ============ ===================================== |
| 40 | Inode metadata size 32 bytes 64 bytes |
| 41 | Max file size 4 GB 16 EB (also limited by max. vol size) |
| 42 | Max uids/gids 65536 4294967296 |
| 43 | File change time no yes (64 + 32-bit timestamp) |
| 44 | Max hardlinks 65536 4294967296 |
| 45 | Metadata reserved 4 bytes 14 bytes |
| 46 | ===================== ============ ===================================== |
| 47 | |
| 48 | - Support extended attributes (xattrs) as an option; |
| 49 | |
| 50 | - Support xattr inline and tail-end data inline for all files; |
| 51 | |
| 52 | - Support POSIX.1e ACLs by using xattrs; |
| 53 | |
Gao Xiang | 46f2e04 | 2021-05-11 16:44:14 +0800 | [diff] [blame] | 54 | - Support transparent data compression as an option: |
Gao Xiang | dfeab2e | 2021-10-14 16:10:10 +0800 | [diff] [blame] | 55 | LZ4 algorithm with the fixed-sized output compression for high performance; |
| 56 | |
| 57 | - Multiple device support for multi-layer container images. |
Mauro Carvalho Chehab | e66d863 | 2020-02-17 17:12:01 +0100 | [diff] [blame] | 58 | |
| 59 | The following git tree provides the file system user-space tools under |
| 60 | development (ex, formatting tool mkfs.erofs): |
| 61 | |
| 62 | - git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs-utils.git |
| 63 | |
| 64 | Bugs and patches are welcome, please kindly help us and send to the following |
| 65 | linux-erofs mailing list: |
| 66 | |
| 67 | - linux-erofs mailing list <linux-erofs@lists.ozlabs.org> |
| 68 | |
| 69 | Mount options |
| 70 | ============= |
| 71 | |
| 72 | =================== ========================================================= |
| 73 | (no)user_xattr Setup Extended User Attributes. Note: xattr is enabled |
| 74 | by default if CONFIG_EROFS_FS_XATTR is selected. |
| 75 | (no)acl Setup POSIX Access Control List. Note: acl is enabled |
| 76 | by default if CONFIG_EROFS_FS_POSIX_ACL is selected. |
| 77 | cache_strategy=%s Select a strategy for cached decompression from now on: |
| 78 | |
| 79 | ========== ============================================= |
| 80 | disabled In-place I/O decompression only; |
| 81 | readahead Cache the last incomplete compressed physical |
| 82 | cluster for further reading. It still does |
| 83 | in-place I/O decompression for the rest |
| 84 | compressed physical clusters; |
| 85 | readaround Cache the both ends of incomplete compressed |
| 86 | physical clusters for further reading. |
| 87 | It still does in-place I/O decompression |
| 88 | for the rest compressed physical clusters. |
| 89 | ========== ============================================= |
Gao Xiang | 06252e9 | 2021-08-05 08:36:00 +0800 | [diff] [blame] | 90 | dax={always,never} Use direct access (no page cache). See |
| 91 | Documentation/filesystems/dax.rst. |
| 92 | dax A legacy option which is an alias for ``dax=always``. |
Gao Xiang | dfeab2e | 2021-10-14 16:10:10 +0800 | [diff] [blame] | 93 | device=%s Specify a path to an extra device to be used together. |
Mauro Carvalho Chehab | e66d863 | 2020-02-17 17:12:01 +0100 | [diff] [blame] | 94 | =================== ========================================================= |
| 95 | |
Huang Jianan | 168e9a7 | 2021-12-01 22:54:36 +0800 | [diff] [blame] | 96 | Sysfs Entries |
| 97 | ============= |
| 98 | |
| 99 | Information about mounted erofs file systems can be found in /sys/fs/erofs. |
| 100 | Each mounted filesystem will have a directory in /sys/fs/erofs based on its |
| 101 | device name (i.e., /sys/fs/erofs/sda). |
| 102 | (see also Documentation/ABI/testing/sysfs-fs-erofs) |
| 103 | |
Mauro Carvalho Chehab | e66d863 | 2020-02-17 17:12:01 +0100 | [diff] [blame] | 104 | On-disk details |
| 105 | =============== |
| 106 | |
| 107 | Summary |
| 108 | ------- |
| 109 | Different from other read-only file systems, an EROFS volume is designed |
| 110 | to be as simple as possible:: |
| 111 | |
| 112 | |-> aligned with the block size |
| 113 | ____________________________________________________________ |
| 114 | | |SB| | ... | Metadata | ... | Data | Metadata | ... | Data | |
| 115 | |_|__|_|_____|__________|_____|______|__________|_____|______| |
| 116 | 0 +1K |
| 117 | |
| 118 | All data areas should be aligned with the block size, but metadata areas |
| 119 | may not. All metadatas can be now observed in two different spaces (views): |
| 120 | |
| 121 | 1. Inode metadata space |
| 122 | |
| 123 | Each valid inode should be aligned with an inode slot, which is a fixed |
| 124 | value (32 bytes) and designed to be kept in line with compact inode size. |
| 125 | |
| 126 | Each inode can be directly found with the following formula: |
| 127 | inode offset = meta_blkaddr * block_size + 32 * nid |
| 128 | |
| 129 | :: |
| 130 | |
Gao Xiang | 1b55767 | 2021-05-11 00:25:05 +0800 | [diff] [blame] | 131 | |-> aligned with 8B |
| 132 | |-> followed closely |
| 133 | + meta_blkaddr blocks |-> another slot |
| 134 | _____________________________________________________________________ |
| 135 | | ... | inode | xattrs | extents | data inline | ... | inode ... |
| 136 | |________|_______|(optional)|(optional)|__(optional)_|_____|__________ |
| 137 | |-> aligned with the inode slot size |
| 138 | . . |
| 139 | . . |
| 140 | . . |
| 141 | . . |
| 142 | . . |
| 143 | . . |
| 144 | .____________________________________________________|-> aligned with 4B |
| 145 | | xattr_ibody_header | shared xattrs | inline xattrs | |
| 146 | |____________________|_______________|_______________| |
| 147 | |-> 12 bytes <-|->x * 4 bytes<-| . |
| 148 | . . . |
| 149 | . . . |
| 150 | . . . |
| 151 | ._______________________________.______________________. |
| 152 | | id | id | id | id | ... | id | ent | ... | ent| ... | |
| 153 | |____|____|____|____|______|____|_____|_____|____|_____| |
| 154 | |-> aligned with 4B |
| 155 | |-> aligned with 4B |
Mauro Carvalho Chehab | e66d863 | 2020-02-17 17:12:01 +0100 | [diff] [blame] | 156 | |
| 157 | Inode could be 32 or 64 bytes, which can be distinguished from a common |
| 158 | field which all inode versions have -- i_format:: |
| 159 | |
| 160 | __________________ __________________ |
| 161 | | i_format | | i_format | |
| 162 | |__________________| |__________________| |
| 163 | | ... | | ... | |
| 164 | | | | | |
| 165 | |__________________| 32 bytes | | |
| 166 | | | |
| 167 | |__________________| 64 bytes |
| 168 | |
| 169 | Xattrs, extents, data inline are followed by the corresponding inode with |
| 170 | proper alignment, and they could be optional for different data mappings. |
Gao Xiang | 2a9dc7a | 2021-08-20 18:00:18 +0800 | [diff] [blame] | 171 | _currently_ total 5 data layouts are supported: |
Mauro Carvalho Chehab | e66d863 | 2020-02-17 17:12:01 +0100 | [diff] [blame] | 172 | |
| 173 | == ==================================================================== |
| 174 | 0 flat file data without data inline (no extent); |
| 175 | 1 fixed-sized output data compression (with non-compacted indexes); |
| 176 | 2 flat file data with tail packing data inline (no extent); |
Gao Xiang | 2a9dc7a | 2021-08-20 18:00:18 +0800 | [diff] [blame] | 177 | 3 fixed-sized output data compression (with compacted indexes, v5.3+); |
| 178 | 4 chunk-based file (v5.15+). |
Mauro Carvalho Chehab | e66d863 | 2020-02-17 17:12:01 +0100 | [diff] [blame] | 179 | == ==================================================================== |
| 180 | |
| 181 | The size of the optional xattrs is indicated by i_xattr_count in inode |
| 182 | header. Large xattrs or xattrs shared by many different files can be |
| 183 | stored in shared xattrs metadata rather than inlined right after inode. |
| 184 | |
| 185 | 2. Shared xattrs metadata space |
| 186 | |
| 187 | Shared xattrs space is similar to the above inode space, started with |
| 188 | a specific block indicated by xattr_blkaddr, organized one by one with |
| 189 | proper align. |
| 190 | |
| 191 | Each share xattr can also be directly found by the following formula: |
| 192 | xattr offset = xattr_blkaddr * block_size + 4 * xattr_id |
| 193 | |
Gao Xiang | 1b55767 | 2021-05-11 00:25:05 +0800 | [diff] [blame] | 194 | :: |
Mauro Carvalho Chehab | e66d863 | 2020-02-17 17:12:01 +0100 | [diff] [blame] | 195 | |
Gao Xiang | 1b55767 | 2021-05-11 00:25:05 +0800 | [diff] [blame] | 196 | |-> aligned by 4 bytes |
| 197 | + xattr_blkaddr blocks |-> aligned with 4 bytes |
| 198 | _________________________________________________________________________ |
| 199 | | ... | xattr_entry | xattr data | ... | xattr_entry | xattr data ... |
| 200 | |________|_____________|_____________|_____|______________|_______________ |
Mauro Carvalho Chehab | e66d863 | 2020-02-17 17:12:01 +0100 | [diff] [blame] | 201 | |
| 202 | Directories |
| 203 | ----------- |
| 204 | All directories are now organized in a compact on-disk format. Note that |
| 205 | each directory block is divided into index and name areas in order to support |
| 206 | random file lookup, and all directory entries are _strictly_ recorded in |
| 207 | alphabetical order in order to support improved prefix binary search |
| 208 | algorithm (could refer to the related source code). |
| 209 | |
| 210 | :: |
| 211 | |
Gao Xiang | 1b55767 | 2021-05-11 00:25:05 +0800 | [diff] [blame] | 212 | ___________________________ |
| 213 | / | |
| 214 | / ______________|________________ |
| 215 | / / | nameoff1 | nameoffN-1 |
| 216 | ____________.______________._______________v________________v__________ |
| 217 | | dirent | dirent | ... | dirent | filename | filename | ... | filename | |
| 218 | |___.0___|____1___|_____|___N-1__|____0_____|____1_____|_____|___N-1____| |
| 219 | \ ^ |
| 220 | \ | * could have |
| 221 | \ | trailing '\0' |
| 222 | \________________________| nameoff0 |
| 223 | Directory block |
Mauro Carvalho Chehab | e66d863 | 2020-02-17 17:12:01 +0100 | [diff] [blame] | 224 | |
| 225 | Note that apart from the offset of the first filename, nameoff0 also indicates |
| 226 | the total number of directory entries in this block since it is no need to |
| 227 | introduce another on-disk field at all. |
| 228 | |
Gao Xiang | 2a9dc7a | 2021-08-20 18:00:18 +0800 | [diff] [blame] | 229 | Chunk-based file |
| 230 | ---------------- |
| 231 | In order to support chunk-based data deduplication, a new inode data layout has |
| 232 | been supported since Linux v5.15: Files are split in equal-sized data chunks |
| 233 | with ``extents`` area of the inode metadata indicating how to get the chunk |
| 234 | data: these can be simply as a 4-byte block address array or in the 8-byte |
| 235 | chunk index form (see struct erofs_inode_chunk_index in erofs_fs.h for more |
| 236 | details.) |
| 237 | |
| 238 | By the way, chunk-based files are all uncompressed for now. |
| 239 | |
Gao Xiang | 46f2e04 | 2021-05-11 16:44:14 +0800 | [diff] [blame] | 240 | Data compression |
| 241 | ---------------- |
| 242 | EROFS implements LZ4 fixed-sized output compression which generates fixed-sized |
| 243 | compressed data blocks from variable-sized input in contrast to other existing |
| 244 | fixed-sized input solutions. Relatively higher compression ratios can be gotten |
| 245 | by using fixed-sized output compression since nowadays popular data compression |
| 246 | algorithms are mostly LZ77-based and such fixed-sized output approach can be |
| 247 | benefited from the historical dictionary (aka. sliding window). |
| 248 | |
| 249 | In details, original (uncompressed) data is turned into several variable-sized |
| 250 | extents and in the meanwhile, compressed into physical clusters (pclusters). |
| 251 | In order to record each variable-sized extent, logical clusters (lclusters) are |
| 252 | introduced as the basic unit of compress indexes to indicate whether a new |
| 253 | extent is generated within the range (HEAD) or not (NONHEAD). Lclusters are now |
| 254 | fixed in block size, as illustrated below:: |
Mauro Carvalho Chehab | e66d863 | 2020-02-17 17:12:01 +0100 | [diff] [blame] | 255 | |
Gao Xiang | 1b55767 | 2021-05-11 00:25:05 +0800 | [diff] [blame] | 256 | |<- variable-sized extent ->|<- VLE ->| |
| 257 | clusterofs clusterofs clusterofs |
| 258 | | | | |
| 259 | _________v_________________________________v_______________________v________ |
| 260 | ... | . | | . | | . ... |
| 261 | ____|____._________|______________|________.___ _|______________|__.________ |
| 262 | |-> lcluster <-|-> lcluster <-|-> lcluster <-|-> lcluster <-| |
Gao Xiang | 46f2e04 | 2021-05-11 16:44:14 +0800 | [diff] [blame] | 263 | (HEAD) (NONHEAD) (HEAD) (NONHEAD) . |
| 264 | . CBLKCNT . . |
| 265 | . . . |
| 266 | . . . |
| 267 | _______._____________________________.______________._________________ |
Gao Xiang | 1b55767 | 2021-05-11 00:25:05 +0800 | [diff] [blame] | 268 | ... | | | | ... |
| 269 | _______|______________|______________|______________|_________________ |
Gao Xiang | 46f2e04 | 2021-05-11 16:44:14 +0800 | [diff] [blame] | 270 | |-> big pcluster <-|-> pcluster <-| |
Mauro Carvalho Chehab | e66d863 | 2020-02-17 17:12:01 +0100 | [diff] [blame] | 271 | |
Gao Xiang | 46f2e04 | 2021-05-11 16:44:14 +0800 | [diff] [blame] | 272 | A physical cluster can be seen as a container of physical compressed blocks |
| 273 | which contains compressed data. Previously, only lcluster-sized (4KB) pclusters |
| 274 | were supported. After big pcluster feature is introduced (available since |
| 275 | Linux v5.13), pcluster can be a multiple of lcluster size. |
Mauro Carvalho Chehab | e66d863 | 2020-02-17 17:12:01 +0100 | [diff] [blame] | 276 | |
Gao Xiang | 46f2e04 | 2021-05-11 16:44:14 +0800 | [diff] [blame] | 277 | For each HEAD lcluster, clusterofs is recorded to indicate where a new extent |
| 278 | starts and blkaddr is used to seek the compressed data. For each NONHEAD |
| 279 | lcluster, delta0 and delta1 are available instead of blkaddr to indicate the |
| 280 | distance to its HEAD lcluster and the next HEAD lcluster. A PLAIN lcluster is |
| 281 | also a HEAD lcluster except that its data is uncompressed. See the comments |
| 282 | around "struct z_erofs_vle_decompressed_index" in erofs_fs.h for more details. |
| 283 | |
| 284 | If big pcluster is enabled, pcluster size in lclusters needs to be recorded as |
| 285 | well. Let the delta0 of the first NONHEAD lcluster store the compressed block |
| 286 | count with a special flag as a new called CBLKCNT NONHEAD lcluster. It's easy |
| 287 | to understand its delta0 is constantly 1, as illustrated below:: |
| 288 | |
| 289 | __________________________________________________________ |
| 290 | | HEAD | NONHEAD | NONHEAD | ... | NONHEAD | HEAD | HEAD | |
| 291 | |__:___|_(CBLKCNT)_|_________|_____|_________|__:___|____:_| |
| 292 | |<----- a big pcluster (with CBLKCNT) ------>|<-- -->| |
| 293 | a lcluster-sized pcluster (without CBLKCNT) ^ |
| 294 | |
| 295 | If another HEAD follows a HEAD lcluster, there is no room to record CBLKCNT, |
| 296 | but it's easy to know the size of such pcluster is 1 lcluster as well. |