Mauro Carvalho Chehab | e66d863 | 2020-02-17 17:12:01 +0100 | [diff] [blame] | 1 | .. SPDX-License-Identifier: GPL-2.0 |
| 2 | |
| 3 | ====================================== |
| 4 | Enhanced Read-Only File System - EROFS |
| 5 | ====================================== |
| 6 | |
| 7 | Overview |
| 8 | ======== |
| 9 | |
| 10 | EROFS file-system stands for Enhanced Read-Only File System. Different |
| 11 | from other read-only file systems, it aims to be designed for flexibility, |
| 12 | scalability, but be kept simple and high performance. |
| 13 | |
| 14 | It is designed as a better filesystem solution for the following scenarios: |
| 15 | |
| 16 | - read-only storage media or |
| 17 | |
| 18 | - part of a fully trusted read-only solution, which means it needs to be |
| 19 | immutable and bit-for-bit identical to the official golden image for |
| 20 | their releases due to security and other considerations and |
| 21 | |
| 22 | - hope to save some extra storage space with guaranteed end-to-end performance |
| 23 | by using reduced metadata and transparent file compression, especially |
| 24 | for those embedded devices with limited memory (ex, smartphone); |
| 25 | |
| 26 | Here is the main features of EROFS: |
| 27 | |
| 28 | - Little endian on-disk design; |
| 29 | |
| 30 | - Currently 4KB block size (nobh) and therefore maximum 16TB address space; |
| 31 | |
| 32 | - Metadata & data could be mixed by design; |
| 33 | |
| 34 | - 2 inode versions for different requirements: |
| 35 | |
| 36 | ===================== ============ ===================================== |
| 37 | compact (v1) extended (v2) |
| 38 | ===================== ============ ===================================== |
| 39 | Inode metadata size 32 bytes 64 bytes |
| 40 | Max file size 4 GB 16 EB (also limited by max. vol size) |
| 41 | Max uids/gids 65536 4294967296 |
| 42 | File change time no yes (64 + 32-bit timestamp) |
| 43 | Max hardlinks 65536 4294967296 |
| 44 | Metadata reserved 4 bytes 14 bytes |
| 45 | ===================== ============ ===================================== |
| 46 | |
| 47 | - Support extended attributes (xattrs) as an option; |
| 48 | |
| 49 | - Support xattr inline and tail-end data inline for all files; |
| 50 | |
| 51 | - Support POSIX.1e ACLs by using xattrs; |
| 52 | |
| 53 | - Support transparent file compression as an option: |
| 54 | LZ4 algorithm with 4 KB fixed-sized output compression for high performance. |
| 55 | |
| 56 | The following git tree provides the file system user-space tools under |
| 57 | development (ex, formatting tool mkfs.erofs): |
| 58 | |
| 59 | - git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs-utils.git |
| 60 | |
| 61 | Bugs and patches are welcome, please kindly help us and send to the following |
| 62 | linux-erofs mailing list: |
| 63 | |
| 64 | - linux-erofs mailing list <linux-erofs@lists.ozlabs.org> |
| 65 | |
| 66 | Mount options |
| 67 | ============= |
| 68 | |
| 69 | =================== ========================================================= |
| 70 | (no)user_xattr Setup Extended User Attributes. Note: xattr is enabled |
| 71 | by default if CONFIG_EROFS_FS_XATTR is selected. |
| 72 | (no)acl Setup POSIX Access Control List. Note: acl is enabled |
| 73 | by default if CONFIG_EROFS_FS_POSIX_ACL is selected. |
| 74 | cache_strategy=%s Select a strategy for cached decompression from now on: |
| 75 | |
| 76 | ========== ============================================= |
| 77 | disabled In-place I/O decompression only; |
| 78 | readahead Cache the last incomplete compressed physical |
| 79 | cluster for further reading. It still does |
| 80 | in-place I/O decompression for the rest |
| 81 | compressed physical clusters; |
| 82 | readaround Cache the both ends of incomplete compressed |
| 83 | physical clusters for further reading. |
| 84 | It still does in-place I/O decompression |
| 85 | for the rest compressed physical clusters. |
| 86 | ========== ============================================= |
| 87 | =================== ========================================================= |
| 88 | |
| 89 | On-disk details |
| 90 | =============== |
| 91 | |
| 92 | Summary |
| 93 | ------- |
| 94 | Different from other read-only file systems, an EROFS volume is designed |
| 95 | to be as simple as possible:: |
| 96 | |
| 97 | |-> aligned with the block size |
| 98 | ____________________________________________________________ |
| 99 | | |SB| | ... | Metadata | ... | Data | Metadata | ... | Data | |
| 100 | |_|__|_|_____|__________|_____|______|__________|_____|______| |
| 101 | 0 +1K |
| 102 | |
| 103 | All data areas should be aligned with the block size, but metadata areas |
| 104 | may not. All metadatas can be now observed in two different spaces (views): |
| 105 | |
| 106 | 1. Inode metadata space |
| 107 | |
| 108 | Each valid inode should be aligned with an inode slot, which is a fixed |
| 109 | value (32 bytes) and designed to be kept in line with compact inode size. |
| 110 | |
| 111 | Each inode can be directly found with the following formula: |
| 112 | inode offset = meta_blkaddr * block_size + 32 * nid |
| 113 | |
| 114 | :: |
| 115 | |
| 116 | |-> aligned with 8B |
| 117 | |-> followed closely |
| 118 | + meta_blkaddr blocks |-> another slot |
| 119 | _____________________________________________________________________ |
| 120 | | ... | inode | xattrs | extents | data inline | ... | inode ... |
| 121 | |________|_______|(optional)|(optional)|__(optional)_|_____|__________ |
| 122 | |-> aligned with the inode slot size |
| 123 | . . |
| 124 | . . |
| 125 | . . |
| 126 | . . |
| 127 | . . |
| 128 | . . |
| 129 | .____________________________________________________|-> aligned with 4B |
| 130 | | xattr_ibody_header | shared xattrs | inline xattrs | |
| 131 | |____________________|_______________|_______________| |
| 132 | |-> 12 bytes <-|->x * 4 bytes<-| . |
| 133 | . . . |
| 134 | . . . |
| 135 | . . . |
| 136 | ._______________________________.______________________. |
| 137 | | id | id | id | id | ... | id | ent | ... | ent| ... | |
| 138 | |____|____|____|____|______|____|_____|_____|____|_____| |
| 139 | |-> aligned with 4B |
| 140 | |-> aligned with 4B |
| 141 | |
| 142 | Inode could be 32 or 64 bytes, which can be distinguished from a common |
| 143 | field which all inode versions have -- i_format:: |
| 144 | |
| 145 | __________________ __________________ |
| 146 | | i_format | | i_format | |
| 147 | |__________________| |__________________| |
| 148 | | ... | | ... | |
| 149 | | | | | |
| 150 | |__________________| 32 bytes | | |
| 151 | | | |
| 152 | |__________________| 64 bytes |
| 153 | |
| 154 | Xattrs, extents, data inline are followed by the corresponding inode with |
| 155 | proper alignment, and they could be optional for different data mappings. |
| 156 | _currently_ total 4 valid data mappings are supported: |
| 157 | |
| 158 | == ==================================================================== |
| 159 | 0 flat file data without data inline (no extent); |
| 160 | 1 fixed-sized output data compression (with non-compacted indexes); |
| 161 | 2 flat file data with tail packing data inline (no extent); |
| 162 | 3 fixed-sized output data compression (with compacted indexes, v5.3+). |
| 163 | == ==================================================================== |
| 164 | |
| 165 | The size of the optional xattrs is indicated by i_xattr_count in inode |
| 166 | header. Large xattrs or xattrs shared by many different files can be |
| 167 | stored in shared xattrs metadata rather than inlined right after inode. |
| 168 | |
| 169 | 2. Shared xattrs metadata space |
| 170 | |
| 171 | Shared xattrs space is similar to the above inode space, started with |
| 172 | a specific block indicated by xattr_blkaddr, organized one by one with |
| 173 | proper align. |
| 174 | |
| 175 | Each share xattr can also be directly found by the following formula: |
| 176 | xattr offset = xattr_blkaddr * block_size + 4 * xattr_id |
| 177 | |
| 178 | :: |
| 179 | |
| 180 | |-> aligned by 4 bytes |
| 181 | + xattr_blkaddr blocks |-> aligned with 4 bytes |
| 182 | _________________________________________________________________________ |
| 183 | | ... | xattr_entry | xattr data | ... | xattr_entry | xattr data ... |
| 184 | |________|_____________|_____________|_____|______________|_______________ |
| 185 | |
| 186 | Directories |
| 187 | ----------- |
| 188 | All directories are now organized in a compact on-disk format. Note that |
| 189 | each directory block is divided into index and name areas in order to support |
| 190 | random file lookup, and all directory entries are _strictly_ recorded in |
| 191 | alphabetical order in order to support improved prefix binary search |
| 192 | algorithm (could refer to the related source code). |
| 193 | |
| 194 | :: |
| 195 | |
| 196 | ___________________________ |
| 197 | / | |
| 198 | / ______________|________________ |
| 199 | / / | nameoff1 | nameoffN-1 |
| 200 | ____________.______________._______________v________________v__________ |
| 201 | | dirent | dirent | ... | dirent | filename | filename | ... | filename | |
| 202 | |___.0___|____1___|_____|___N-1__|____0_____|____1_____|_____|___N-1____| |
| 203 | \ ^ |
| 204 | \ | * could have |
| 205 | \ | trailing '\0' |
| 206 | \________________________| nameoff0 |
| 207 | |
| 208 | Directory block |
| 209 | |
| 210 | Note that apart from the offset of the first filename, nameoff0 also indicates |
| 211 | the total number of directory entries in this block since it is no need to |
| 212 | introduce another on-disk field at all. |
| 213 | |
| 214 | Compression |
| 215 | ----------- |
| 216 | Currently, EROFS supports 4KB fixed-sized output transparent file compression, |
| 217 | as illustrated below:: |
| 218 | |
| 219 | |---- Variant-Length Extent ----|-------- VLE --------|----- VLE ----- |
| 220 | clusterofs clusterofs clusterofs |
| 221 | | | | logical data |
| 222 | _________v_______________________________v_____________________v_______________ |
| 223 | ... | . | | . | | . | ... |
| 224 | ____|____.________|_____________|________.____|_____________|__.__________|____ |
| 225 | |-> cluster <-|-> cluster <-|-> cluster <-|-> cluster <-|-> cluster <-| |
| 226 | size size size size size |
| 227 | . . . . |
| 228 | . . . . |
| 229 | . . . . |
| 230 | _______._____________._____________._____________._____________________ |
| 231 | ... | | | | ... physical data |
| 232 | _______|_____________|_____________|_____________|_____________________ |
| 233 | |-> cluster <-|-> cluster <-|-> cluster <-| |
| 234 | size size size |
| 235 | |
| 236 | Currently each on-disk physical cluster can contain 4KB (un)compressed data |
| 237 | at most. For each logical cluster, there is a corresponding on-disk index to |
| 238 | describe its cluster type, physical cluster address, etc. |
| 239 | |
| 240 | See "struct z_erofs_vle_decompressed_index" in erofs_fs.h for more details. |