Blame - Documentation/filesystems/erofs.rst - SHIFTPHONES/mainline/linux

2020-02-17 17:12:01 +0100

[diff] [blame]

1

.. SPDX-License-Identifier: GPL-2.0

2

3

======================================

4

Enhanced Read-Only File System - EROFS

5

======================================

Overview

========

EROFS file-system stands for Enhanced Read-Only File System. Different

11

from other read-only file systems, it aims to be designed for flexibility,

12

scalability, but be kept simple and high performance.

13

14

It is designed as a better filesystem solution for the following scenarios:

15

16

- read-only storage media or

17

18

- part of a fully trusted read-only solution, which means it needs to be

19

immutable and bit-for-bit identical to the official golden image for

20

their releases due to security and other considerations and

21

Gao Xiang

dfeab2e

2021-10-14 16:10:10 +0800

[diff] [blame]

22

- hope to minimize extra storage space with guaranteed end-to-end performance

23

by using compact layout, transparent file compression and direct access,

24

especially for those embedded devices with limited memory and high-density

25

hosts with numerous containers;

Mauro Carvalho Chehab

2020-02-17 17:12:01 +0100

[diff] [blame]

26

27

Here is the main features of EROFS:

28

29

- Little endian on-disk design;

30

31

- Currently 4KB block size (nobh) and therefore maximum 16TB address space;

32

33

- Metadata & data could be mixed by design;

34

35

- 2 inode versions for different requirements:

36

37

===================== ============ =====================================

38

compact (v1) extended (v2)

39

===================== ============ =====================================

40

Inode metadata size 32 bytes 64 bytes

41

Max file size 4 GB 16 EB (also limited by max. vol size)

42

Max uids/gids 65536 4294967296

43

File change time no yes (64 + 32-bit timestamp)

44

Max hardlinks 65536 4294967296

45

Metadata reserved 4 bytes 14 bytes

46

===================== ============ =====================================

47

48

- Support extended attributes (xattrs) as an option;

49

50

- Support xattr inline and tail-end data inline for all files;

51

52

- Support POSIX.1e ACLs by using xattrs;

53

Gao Xiang

2021-05-11 16:44:14 +0800

[diff] [blame]

54

- Support transparent data compression as an option:

Gao Xiang

dfeab2e

2021-10-14 16:10:10 +0800

[diff] [blame]

55

LZ4 algorithm with the fixed-sized output compression for high performance;

56

57

- Multiple device support for multi-layer container images.

Mauro Carvalho Chehab

2020-02-17 17:12:01 +0100

[diff] [blame]

58

59

The following git tree provides the file system user-space tools under

60

development (ex, formatting tool mkfs.erofs):

61

62

- git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs-utils.git

63

64

Bugs and patches are welcome, please kindly help us and send to the following

65

linux-erofs mailing list:

66

67

- linux-erofs mailing list <linux-erofs@lists.ozlabs.org>

Mount options

=============

=================== =========================================================

73

(no)user_xattr Setup Extended User Attributes. Note: xattr is enabled

74

by default if CONFIG_EROFS_FS_XATTR is selected.

75

(no)acl Setup POSIX Access Control List. Note: acl is enabled

76

by default if CONFIG_EROFS_FS_POSIX_ACL is selected.

77

cache_strategy=%s Select a strategy for cached decompression from now on:

78

79

========== =============================================

80

disabled In-place I/O decompression only;

81

readahead Cache the last incomplete compressed physical

82

cluster for further reading. It still does

83

in-place I/O decompression for the rest

84

compressed physical clusters;

85

readaround Cache the both ends of incomplete compressed

86

physical clusters for further reading.

87

It still does in-place I/O decompression

88

for the rest compressed physical clusters.

89

========== =============================================

Gao Xiang

06252e9

2021-08-05 08:36:00 +0800

[diff] [blame]

90

dax={always,never} Use direct access (no page cache). See

91

Documentation/filesystems/dax.rst.

92

dax A legacy option which is an alias for ``dax=always``.

Gao Xiang

dfeab2e

2021-10-14 16:10:10 +0800

[diff] [blame]

93

device=%s Specify a path to an extra device to be used together.

Mauro Carvalho Chehab

2020-02-17 17:12:01 +0100

[diff] [blame]

94

=================== =========================================================

95

Huang Jianan

168e9a7

2021-12-01 22:54:36 +0800

[diff] [blame]

Sysfs Entries

=============

Information about mounted erofs file systems can be found in /sys/fs/erofs.

100

Each mounted filesystem will have a directory in /sys/fs/erofs based on its

101

device name (i.e., /sys/fs/erofs/sda).

102

(see also Documentation/ABI/testing/sysfs-fs-erofs)

103

Mauro Carvalho Chehab

2020-02-17 17:12:01 +0100

[diff] [blame]

On-disk details

===============

Summary

-------

Different from other read-only file systems, an EROFS volume is designed

110

to be as simple as possible::

111

112

|-> aligned with the block size

113

____________________________________________________________

114

| |SB| | ... | Metadata | ... | Data | Metadata | ... | Data |

115

|_|__|_|_____|__________|_____|______|__________|_____|______|

116

0 +1K

117

118

All data areas should be aligned with the block size, but metadata areas

119

may not. All metadatas can be now observed in two different spaces (views):

120

121

1. Inode metadata space

122

123

Each valid inode should be aligned with an inode slot, which is a fixed

124

value (32 bytes) and designed to be kept in line with compact inode size.

125

126

Each inode can be directly found with the following formula:

127

inode offset = meta_blkaddr * block_size + 32 * nid

::

Gao Xiang

2021-05-11 00:25:05 +0800

[diff] [blame]

131

|-> aligned with 8B

132

|-> followed closely

133

+ meta_blkaddr blocks |-> another slot

134

_____________________________________________________________________

135

136

|________|_______|(optional)|(optional)|__(optional)_|_____|__________

137

|-> aligned with the inode slot size

. .

. .

. .

. .

. .

. .

.____________________________________________________|-> aligned with 4B

145

| xattr_ibody_header | shared xattrs | inline xattrs |

146

|____________________|_______________|_______________|

147

|-> 12 bytes <-|->x * 4 bytes<-| .

. . .

. . .

. . .

._______________________________.______________________.

152

| id | id | id | id | ... | id | ent | ... | ent| ... |

153

|____|____|____|____|______|____|_____|_____|____|_____|

154

|-> aligned with 4B

155

|-> aligned with 4B

Mauro Carvalho Chehab

2020-02-17 17:12:01 +0100

[diff] [blame]

156

157

Inode could be 32 or 64 bytes, which can be distinguished from a common

158

field which all inode versions have -- i_format::

159

160

__________________ __________________

161

| i_format | | i_format |

162

|__________________| |__________________|

163

| ... | | ... |

164

| | | |

165

|__________________| 32 bytes | |

166

| |

167

|__________________| 64 bytes

168

169

Xattrs, extents, data inline are followed by the corresponding inode with

170

proper alignment, and they could be optional for different data mappings.

Gao Xiang

2a9dc7a

2021-08-20 18:00:18 +0800

[diff] [blame]

171

_currently_ total 5 data layouts are supported:

Mauro Carvalho Chehab

2020-02-17 17:12:01 +0100

[diff] [blame]

172

173

== ====================================================================

174

0 flat file data without data inline (no extent);

175

1 fixed-sized output data compression (with non-compacted indexes);

176

2 flat file data with tail packing data inline (no extent);

Gao Xiang

2a9dc7a

2021-08-20 18:00:18 +0800

[diff] [blame]

177

3 fixed-sized output data compression (with compacted indexes, v5.3+);

178

4 chunk-based file (v5.15+).

Mauro Carvalho Chehab

2020-02-17 17:12:01 +0100

[diff] [blame]

179

== ====================================================================

180

181

The size of the optional xattrs is indicated by i_xattr_count in inode

182

header. Large xattrs or xattrs shared by many different files can be

183

stored in shared xattrs metadata rather than inlined right after inode.

184

185

2. Shared xattrs metadata space

186

187

Shared xattrs space is similar to the above inode space, started with

188

a specific block indicated by xattr_blkaddr, organized one by one with

189

proper align.

190

191

Each share xattr can also be directly found by the following formula:

192

xattr offset = xattr_blkaddr * block_size + 4 * xattr_id

193

Gao Xiang

2021-05-11 00:25:05 +0800

[diff] [blame]

194

::

Mauro Carvalho Chehab

2020-02-17 17:12:01 +0100

[diff] [blame]

195

Gao Xiang

2021-05-11 00:25:05 +0800

[diff] [blame]

196

|-> aligned by 4 bytes

197

+ xattr_blkaddr blocks |-> aligned with 4 bytes

198

_________________________________________________________________________

199

200

|________|_____________|_____________|_____|______________|_______________

Mauro Carvalho Chehab

2020-02-17 17:12:01 +0100

[diff] [blame]

Directories

-----------

All directories are now organized in a compact on-disk format. Note that

205

each directory block is divided into index and name areas in order to support

206

random file lookup, and all directory entries are _strictly_ recorded in

207

alphabetical order in order to support improved prefix binary search

208

algorithm (could refer to the related source code).

::

Gao Xiang

2021-05-11 00:25:05 +0800

[diff] [blame]

212

___________________________

213

/ |

214

/ ______________|________________

215

/ / | nameoff1 | nameoffN-1

216

____________.______________._______________v________________v__________

217

218

|___.0___|____1___|_____|___N-1__|____0_____|____1_____|_____|___N-1____|

\ ^

\ | * could have

\ | trailing '\0'

\________________________| nameoff0

223

Directory block

Mauro Carvalho Chehab

2020-02-17 17:12:01 +0100

[diff] [blame]

224

225

Note that apart from the offset of the first filename, nameoff0 also indicates

226

the total number of directory entries in this block since it is no need to

227

introduce another on-disk field at all.

228

Gao Xiang

2a9dc7a

2021-08-20 18:00:18 +0800

[diff] [blame]

229

Chunk-based file

230

----------------

231

In order to support chunk-based data deduplication, a new inode data layout has

232

been supported since Linux v5.15: Files are split in equal-sized data chunks

233

with ``extents`` area of the inode metadata indicating how to get the chunk

234

data: these can be simply as a 4-byte block address array or in the 8-byte

235

chunk index form (see struct erofs_inode_chunk_index in erofs_fs.h for more

236

details.)

237

238

By the way, chunk-based files are all uncompressed for now.

239

Gao Xiang

2021-05-11 16:44:14 +0800

[diff] [blame]

240

Data compression

241

----------------

242

EROFS implements LZ4 fixed-sized output compression which generates fixed-sized

243

compressed data blocks from variable-sized input in contrast to other existing

244

fixed-sized input solutions. Relatively higher compression ratios can be gotten

245

by using fixed-sized output compression since nowadays popular data compression

246

algorithms are mostly LZ77-based and such fixed-sized output approach can be

247

benefited from the historical dictionary (aka. sliding window).

248

249

In details, original (uncompressed) data is turned into several variable-sized

250

extents and in the meanwhile, compressed into physical clusters (pclusters).

251

In order to record each variable-sized extent, logical clusters (lclusters) are

252

introduced as the basic unit of compress indexes to indicate whether a new

253

extent is generated within the range (HEAD) or not (NONHEAD). Lclusters are now

254

fixed in block size, as illustrated below::

Mauro Carvalho Chehab

2020-02-17 17:12:01 +0100

[diff] [blame]

255

Gao Xiang

2021-05-11 00:25:05 +0800

[diff] [blame]

256

|<- variable-sized extent ->|<- VLE ->|

257

clusterofs clusterofs clusterofs

258

| | |

259

_________v_________________________________v_______________________v________

260

... | . | | . | | . ...

261

____|____._________|______________|________.___ _|______________|__.________

262

Gao Xiang

2021-05-11 16:44:14 +0800

[diff] [blame]

263

(HEAD) (NONHEAD) (HEAD) (NONHEAD) .

. CBLKCNT . .

. . .

. . .

_______._____________________________.______________._________________

Gao Xiang

2021-05-11 00:25:05 +0800

[diff] [blame]

268

... | | | | ...

269

_______|______________|______________|______________|_________________

Gao Xiang

2021-05-11 16:44:14 +0800

[diff] [blame]

270

|-> big pcluster <-|-> pcluster <-|

Mauro Carvalho Chehab

2020-02-17 17:12:01 +0100

[diff] [blame]

271

Gao Xiang

2021-05-11 16:44:14 +0800

[diff] [blame]

272

A physical cluster can be seen as a container of physical compressed blocks

273

which contains compressed data. Previously, only lcluster-sized (4KB) pclusters

274

were supported. After big pcluster feature is introduced (available since

275

Linux v5.13), pcluster can be a multiple of lcluster size.

Mauro Carvalho Chehab

2020-02-17 17:12:01 +0100

[diff] [blame]

276

Gao Xiang