Blame - Documentation/filesystems/f2fs.rst - SHIFTPHONES/mainline/linux

blob: b8ee761c9922a8a9eb9006ca675a74ee0616caac [file] [log] [blame]

Mauro Carvalho Chehab	89272ca	2020-02-17 17:12:04 +0100	[diff] [blame]	1	.. SPDX-License-Identifier: GPL-2.0
				2
				3	==========================================
Jaegeuk Kim	98e4da8	2012-11-02 17:05:42 +0900	[diff] [blame]	4	WHAT IS Flash-Friendly File System (F2FS)?
Mauro Carvalho Chehab	89272ca	2020-02-17 17:12:04 +0100	[diff] [blame]	5	==========================================
Jaegeuk Kim	98e4da8	2012-11-02 17:05:42 +0900	[diff] [blame]	6
				7	NAND flash memory-based storage devices, such as SSD, eMMC, and SD cards, have
				8	been equipped on a variety systems ranging from mobile to server systems. Since
				9	they are known to have different characteristics from the conventional rotating
				10	disks, a file system, an upper layer to the storage device, should adapt to the
				11	changes from the sketch in the design level.
				12
				13	F2FS is a file system exploiting NAND flash memory-based storage devices, which
				14	is based on Log-structured File System (LFS). The design has been focused on
				15	addressing the fundamental issues in LFS, which are snowball effect of wandering
				16	tree and high cleaning overhead.
				17
				18	Since a NAND flash memory-based storage device shows different characteristic
				19	according to its internal geometry or flash memory management scheme, namely FTL,
				20	F2FS and its tools support various parameters not only for configuring on-disk
				21	layout, but also for selecting allocation and cleaning algorithms.
				22
Changman Lee	d51a7fb	2013-07-04 17:12:47 +0900	[diff] [blame]	23	The following git tree provides the file system formatting tool (mkfs.f2fs),
				24	a consistency checking tool (fsck.f2fs), and a debugging tool (dump.f2fs).
Mauro Carvalho Chehab	89272ca	2020-02-17 17:12:04 +0100	[diff] [blame]	25
				26	- git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs-tools.git
Jaegeuk Kim	5bb446a	2012-11-27 14:36:14 +0900	[diff] [blame]	27
				28	For reporting bugs and sending patches, please use the following mailing list:
Jaegeuk Kim	98e4da8	2012-11-02 17:05:42 +0900	[diff] [blame]	29
Mauro Carvalho Chehab	89272ca	2020-02-17 17:12:04 +0100	[diff] [blame]	30	- linux-f2fs-devel@lists.sourceforge.net
				31
				32	Background and Design issues
				33	============================
Jaegeuk Kim	98e4da8	2012-11-02 17:05:42 +0900	[diff] [blame]	34
				35	Log-structured File System (LFS)
				36	--------------------------------
				37	"A log-structured file system writes all modifications to disk sequentially in
				38	a log-like structure, thereby speeding up both file writing and crash recovery.
				39	The log is the only structure on disk; it contains indexing information so that
				40	files can be read back from the log efficiently. In order to maintain large free
				41	areas on disk for fast writing, we divide the log into segments and use a
				42	segment cleaner to compress the live information from heavily fragmented
				43	segments." from Rosenblum, M. and Ousterhout, J. K., 1992, "The design and
				44	implementation of a log-structured file system", ACM Trans. Computer Systems
				45	10, 1, 26–52.
				46
				47	Wandering Tree Problem
				48	----------------------
				49	In LFS, when a file data is updated and written to the end of log, its direct
				50	pointer block is updated due to the changed location. Then the indirect pointer
				51	block is also updated due to the direct pointer block update. In this manner,
				52	the upper index structures such as inode, inode map, and checkpoint block are
				53	also updated recursively. This problem is called as wandering tree problem [1],
				54	and in order to enhance the performance, it should eliminate or relax the update
				55	propagation as much as possible.
				56
				57	[1] Bityutskiy, A. 2005. JFFS3 design issues. http://www.linux-mtd.infradead.org/
				58
				59	Cleaning Overhead
				60	-----------------
				61	Since LFS is based on out-of-place writes, it produces so many obsolete blocks
				62	scattered across the whole storage. In order to serve new empty log space, it
				63	needs to reclaim these obsolete blocks seamlessly to users. This job is called
				64	as a cleaning process.
				65
				66	The process consists of three operations as follows.
Mauro Carvalho Chehab	89272ca	2020-02-17 17:12:04 +0100	[diff] [blame]	67
Jaegeuk Kim	98e4da8	2012-11-02 17:05:42 +0900	[diff] [blame]	68	1. A victim segment is selected through referencing segment usage table.
				69	2. It loads parent index structures of all the data in the victim identified by
				70	segment summary blocks.
				71	3. It checks the cross-reference between the data and its parent index structure.
				72	4. It moves valid data selectively.
				73
				74	This cleaning job may cause unexpected long delays, so the most important goal
				75	is to hide the latencies to users. And also definitely, it should reduce the
				76	amount of valid data to be moved, and move them quickly as well.
				77
Mauro Carvalho Chehab	89272ca	2020-02-17 17:12:04 +0100	[diff] [blame]	78	Key Features
				79	============
Jaegeuk Kim	98e4da8	2012-11-02 17:05:42 +0900	[diff] [blame]	80
				81	Flash Awareness
				82	---------------
				83	- Enlarge the random write area for better performance, but provide the high
				84	spatial locality
				85	- Align FS data structures to the operational units in FTL as best efforts
				86
				87	Wandering Tree Problem
				88	----------------------
				89	- Use a term, “node”, that represents inodes as well as various pointer blocks
				90	- Introduce Node Address Table (NAT) containing the locations of all the “node”
				91	blocks; this will cut off the update propagation.
				92
				93	Cleaning Overhead
				94	-----------------
				95	- Support a background cleaning process
				96	- Support greedy and cost-benefit algorithms for victim selection policies
				97	- Support multi-head logs for static/dynamic hot and cold data separation
				98	- Introduce adaptive logging for efficient block allocation
				99
Mauro Carvalho Chehab	89272ca	2020-02-17 17:12:04 +0100	[diff] [blame]	100	Mount Options
				101	=============
Jaegeuk Kim	98e4da8	2012-11-02 17:05:42 +0900	[diff] [blame]	102
Mauro Carvalho Chehab	89272ca	2020-02-17 17:12:04 +0100	[diff] [blame]	103
Jonathan Corbet	9aa1ccb	2020-06-22 07:35:39 -0600	[diff] [blame]	104	======================== ============================================================
				105	background_gc=%s Turn on/off cleaning operations, namely garbage
				106	collection, triggered in background when I/O subsystem is
				107	idle. If background_gc=on, it will turn on the garbage
				108	collection and if background_gc=off, garbage collection
				109	will be turned off. If background_gc=sync, it will turn
				110	on synchronous garbage collection running in background.
				111	Default value for this option is on. So garbage
				112	collection is on by default.
				113	disable_roll_forward Disable the roll-forward recovery routine
				114	norecovery Disable the roll-forward recovery routine, mounted read-
				115	only (i.e., -o ro,disable_roll_forward)
				116	discard/nodiscard Enable/disable real-time discard in f2fs, if discard is
				117	enabled, f2fs will issue discard/TRIM commands when a
				118	segment is cleaned.
				119	no_heap Disable heap-style segment allocation which finds free
				120	segments for data from the beginning of main area, while
				121	for node from the end of main area.
				122	nouser_xattr Disable Extended User Attributes. Note: xattr is enabled
				123	by default if CONFIG_F2FS_FS_XATTR is selected.
				124	noacl Disable POSIX Access Control List. Note: acl is enabled
				125	by default if CONFIG_F2FS_FS_POSIX_ACL is selected.
				126	active_logs=%u Support configuring the number of active logs. In the
				127	current design, f2fs supports only 2, 4, and 6 logs.
				128	Default number is 6.
				129	disable_ext_identify Disable the extension list configured by mkfs, so f2fs
Randy Dunlap	ca313c8	2020-09-02 17:08:31 -0700	[diff] [blame^]	130	is not aware of cold files such as media files.
Jonathan Corbet	9aa1ccb	2020-06-22 07:35:39 -0600	[diff] [blame]	131	inline_xattr Enable the inline xattrs feature.
				132	noinline_xattr Disable the inline xattrs feature.
				133	inline_xattr_size=%u Support configuring inline xattr size, it depends on
				134	flexible inline xattr feature.
Randy Dunlap	ca313c8	2020-09-02 17:08:31 -0700	[diff] [blame^]	135	inline_data Enable the inline data feature: Newly created small (<~3.4k)
Jonathan Corbet	9aa1ccb	2020-06-22 07:35:39 -0600	[diff] [blame]	136	files can be written into inode block.
Randy Dunlap	ca313c8	2020-09-02 17:08:31 -0700	[diff] [blame^]	137	inline_dentry Enable the inline dir feature: data in newly created
Jonathan Corbet	9aa1ccb	2020-06-22 07:35:39 -0600	[diff] [blame]	138	directory entries can be written into inode block. The
				139	space of inode block which is used to store inline
				140	dentries is limited to ~3.4k.
				141	noinline_dentry Disable the inline dentry feature.
				142	flush_merge Merge concurrent cache_flush commands as much as possible
				143	to eliminate redundant command issues. If the underlying
				144	device handles the cache_flush command relatively slowly,
				145	recommend to enable this option.
				146	nobarrier This option can be used if underlying storage guarantees
				147	its cached data should be written to the novolatile area.
				148	If this option is set, no cache_flush commands are issued
				149	but f2fs still guarantees the write ordering of all the
				150	data writes.
				151	fastboot This option is used when a system wants to reduce mount
				152	time as much as possible, even though normal performance
				153	can be sacrificed.
				154	extent_cache Enable an extent cache based on rb-tree, it can cache
				155	as many as extent which map between contiguous logical
				156	address and physical address per inode, resulting in
				157	increasing the cache hit ratio. Set by default.
				158	noextent_cache Disable an extent cache based on rb-tree explicitly, see
				159	the above extent_cache mount option.
				160	noinline_data Disable the inline data feature, inline data feature is
				161	enabled by default.
				162	data_flush Enable data flushing before checkpoint in order to
				163	persist data of regular and symlink.
				164	reserve_root=%d Support configuring reserved space which is used for
				165	allocation from a privileged user with specified uid or
				166	gid, unit: 4KB, the default limit is 0.2% of user blocks.
				167	resuid=%d The user ID which may use the reserved blocks.
				168	resgid=%d The group ID which may use the reserved blocks.
				169	fault_injection=%d Enable fault injection in all supported types with
				170	specified injection rate.
				171	fault_type=%d Support configuring fault injection type, should be
				172	enabled with fault_injection option, fault type value
				173	is shown below, it supports single or combined type.
Mauro Carvalho Chehab	89272ca	2020-02-17 17:12:04 +0100	[diff] [blame]	174
Jonathan Corbet	9aa1ccb	2020-06-22 07:35:39 -0600	[diff] [blame]	175	=================== ===========
				176	Type_Name Type_Value
				177	=================== ===========
				178	FAULT_KMALLOC 0x000000001
				179	FAULT_KVMALLOC 0x000000002
				180	FAULT_PAGE_ALLOC 0x000000004
				181	FAULT_PAGE_GET 0x000000008
				182	FAULT_ALLOC_BIO 0x000000010
				183	FAULT_ALLOC_NID 0x000000020
				184	FAULT_ORPHAN 0x000000040
				185	FAULT_BLOCK 0x000000080
				186	FAULT_DIR_DEPTH 0x000000100
				187	FAULT_EVICT_INODE 0x000000200
				188	FAULT_TRUNCATE 0x000000400
				189	FAULT_READ_IO 0x000000800
				190	FAULT_CHECKPOINT 0x000001000
				191	FAULT_DISCARD 0x000002000
				192	FAULT_WRITE_IO 0x000004000
				193	=================== ===========
				194	mode=%s Control block allocation mode which supports "adaptive"
				195	and "lfs". In "lfs" mode, there should be no random
				196	writes towards main area.
				197	io_bits=%u Set the bit size of write IO requests. It should be set
				198	with "mode=lfs".
				199	usrquota Enable plain user disk quota accounting.
				200	grpquota Enable plain group disk quota accounting.
				201	prjquota Enable plain project quota accounting.
				202	usrjquota=<file> Appoint specified file and type during mount, so that quota
				203	grpjquota=<file> information can be properly updated during recovery flow,
				204	prjjquota=<file> <quota file>: must be in root directory;
				205	jqfmt=<quota type> <quota type>: [vfsold,vfsv0,vfsv1].
Randy Dunlap	ca313c8	2020-09-02 17:08:31 -0700	[diff] [blame^]	206	offusrjquota Turn off user journalled quota.
				207	offgrpjquota Turn off group journalled quota.
				208	offprjjquota Turn off project journalled quota.
Jonathan Corbet	9aa1ccb	2020-06-22 07:35:39 -0600	[diff] [blame]	209	quota Enable plain user disk quota accounting.
				210	noquota Disable all plain disk quota option.
				211	whint_mode=%s Control which write hints are passed down to block
				212	layer. This supports "off", "user-based", and
				213	"fs-based". In "off" mode (default), f2fs does not pass
				214	down hints. In "user-based" mode, f2fs tries to pass
				215	down hints given by users. And in "fs-based" mode, f2fs
				216	passes down hints with its policy.
				217	alloc_mode=%s Adjust block allocation policy, which supports "reuse"
				218	and "default".
				219	fsync_mode=%s Control the policy of fsync. Currently supports "posix",
				220	"strict", and "nobarrier". In "posix" mode, which is
				221	default, fsync will follow POSIX semantics and does a
				222	light operation to improve the filesystem performance.
				223	In "strict" mode, fsync will be heavy and behaves in line
				224	with xfs, ext4 and btrfs, where xfstest generic/342 will
				225	pass, but the performance will regress. "nobarrier" is
				226	based on "posix", but doesn't issue flush command for
				227	non-atomic files likewise "nobarrier" mount option.
Eric Biggers	ed318a6	2020-05-12 16:32:50 -0700	[diff] [blame]	228	test_dummy_encryption
				229	test_dummy_encryption=%s
Jonathan Corbet	9aa1ccb	2020-06-22 07:35:39 -0600	[diff] [blame]	230	Enable dummy encryption, which provides a fake fscrypt
				231	context. The fake fscrypt context is used by xfstests.
				232	The argument may be either "v1" or "v2", in order to
				233	select the corresponding fscrypt policy version.
				234	checkpoint=%s[:%u[%]] Set to "disable" to turn off checkpointing. Set to "enable"
				235	to reenable checkpointing. Is enabled by default. While
				236	disabled, any unmounting or unexpected shutdowns will cause
				237	the filesystem contents to appear as they did when the
				238	filesystem was mounted with that option.
				239	While mounting with checkpoint=disabled, the filesystem must
				240	run garbage collection to ensure that all available space can
				241	be used. If this takes too much time, the mount may return
				242	EAGAIN. You may optionally add a value to indicate how much
				243	of the disk you would be willing to temporarily give up to
				244	avoid additional garbage collection. This can be given as a
				245	number of blocks, or as a percent. For instance, mounting
				246	with checkpoint=disable:100% would always succeed, but it may
				247	hide up to all remaining free space. The actual space that
				248	would be unusable can be viewed at /sys/fs/f2fs/<disk>/unusable
				249	This space is reclaimed once checkpoint=enable.
				250	compress_algorithm=%s Control compress algorithm, currently f2fs supports "lzo",
				251	"lz4", "zstd" and "lzo-rle" algorithm.
				252	compress_log_size=%u Support configuring compress cluster size, the size will
				253	be 4KB * (1 << %u), 16KB is minimum size, also it's
				254	default size.
				255	compress_extension=%s Support adding specified extension, so that f2fs can enable
				256	compression on those corresponding files, e.g. if all files
				257	with '.ext' has high compression rate, we can set the '.ext'
				258	on compression extension list and enable compression on
				259	these file by default rather than to enable it via ioctl.
				260	For other files, we can still enable compression via ioctl.
Linus Torvalds	086ba2e	2020-08-10 18:33:22 -0700	[diff] [blame]	261	Note that, there is one reserved special extension '*', it
				262	can be set to enable compression for all files.
Linus Torvalds	2324d50	2020-08-04 22:47:54 -0700	[diff] [blame]	263	inlinecrypt When possible, encrypt/decrypt the contents of encrypted
				264	files using the blk-crypto framework rather than
				265	filesystem-layer encryption. This allows the use of
				266	inline encryption hardware. The on-disk format is
				267	unaffected. For more details, see
				268	Documentation/block/inline-encryption.rst.
Chao Yu	093749e	2020-08-04 21:14:49 +0800	[diff] [blame]	269	atgc Enable age-threshold garbage collection, it provides high
				270	effectiveness and efficiency on background GC.
Jonathan Corbet	9aa1ccb	2020-06-22 07:35:39 -0600	[diff] [blame]	271	======================== ============================================================
Jaegeuk Kim	98e4da8	2012-11-02 17:05:42 +0900	[diff] [blame]	272
Mauro Carvalho Chehab	89272ca	2020-02-17 17:12:04 +0100	[diff] [blame]	273	Debugfs Entries
				274	===============
Jaegeuk Kim	98e4da8	2012-11-02 17:05:42 +0900	[diff] [blame]	275
				276	/sys/kernel/debug/f2fs/ contains information about all the partitions mounted as
				277	f2fs. Each file shows the whole f2fs information.
				278
				279	/sys/kernel/debug/f2fs/status includes:
Mauro Carvalho Chehab	89272ca	2020-02-17 17:12:04 +0100	[diff] [blame]	280
Jaegeuk Kim	98e4da8	2012-11-02 17:05:42 +0900	[diff] [blame]	281	- major file system information managed by f2fs currently
				282	- average SIT information about whole segments
				283	- current memory footprint consumed by f2fs.
				284
Mauro Carvalho Chehab	89272ca	2020-02-17 17:12:04 +0100	[diff] [blame]	285	Sysfs Entries
				286	=============
Namjae Jeon	b59d0ba	2013-08-04 23:09:40 +0900	[diff] [blame]	287
Tiezhu Yang	6de3f12	2017-02-08 05:08:01 +0800	[diff] [blame]	288	Information about mounted f2fs file systems can be found in
Namjae Jeon	b59d0ba	2013-08-04 23:09:40 +0900	[diff] [blame]	289	/sys/fs/f2fs. Each mounted filesystem will have a directory in
				290	/sys/fs/f2fs based on its device name (i.e., /sys/fs/f2fs/sda).
				291	The files in each per-device directory are shown in table below.
				292
				293	Files in /sys/fs/f2fs/<devname>
				294	(see also Documentation/ABI/testing/sysfs-fs-f2fs)
Daniel Rosenberg	5aba543	2019-07-23 16:05:28 -0700	[diff] [blame]	295
Mauro Carvalho Chehab	89272ca	2020-02-17 17:12:04 +0100	[diff] [blame]	296	Usage
				297	=====
Jaegeuk Kim	98e4da8	2012-11-02 17:05:42 +0900	[diff] [blame]	298
				299	1. Download userland tools and compile them.
				300
				301	2. Skip, if f2fs was compiled statically inside kernel.
Mauro Carvalho Chehab	89272ca	2020-02-17 17:12:04 +0100	[diff] [blame]	302	Otherwise, insert the f2fs.ko module::
Jaegeuk Kim	98e4da8	2012-11-02 17:05:42 +0900	[diff] [blame]	303
Mauro Carvalho Chehab	89272ca	2020-02-17 17:12:04 +0100	[diff] [blame]	304	# insmod f2fs.ko
Jaegeuk Kim	98e4da8	2012-11-02 17:05:42 +0900	[diff] [blame]	305
Randy Dunlap	ca313c8	2020-09-02 17:08:31 -0700	[diff] [blame^]	306	3. Create a directory to use when mounting::
Mauro Carvalho Chehab	89272ca	2020-02-17 17:12:04 +0100	[diff] [blame]	307
				308	# mkdir /mnt/f2fs
				309
				310	4. Format the block device, and then mount as f2fs::
				311
				312	# mkfs.f2fs -l label /dev/block_device
				313	# mount -t f2fs /dev/block_device /mnt/f2fs
Jaegeuk Kim	98e4da8	2012-11-02 17:05:42 +0900	[diff] [blame]	314
Changman Lee	d51a7fb	2013-07-04 17:12:47 +0900	[diff] [blame]	315	mkfs.f2fs
				316	---------
				317	The mkfs.f2fs is for the use of formatting a partition as the f2fs filesystem,
				318	which builds a basic on-disk layout.
				319
Jaegeuk Kim	568d2a1	2020-08-31 10:22:17 -0700	[diff] [blame]	320	The quick options consist of:
Mauro Carvalho Chehab	89272ca	2020-02-17 17:12:04 +0100	[diff] [blame]	321
				322	=============== ===========================================================
				323	``-l [label]`` Give a volume label, up to 512 unicode name.
				324	``-a [0 or 1]`` Split start location of each area for heap-based allocation.
				325
				326	1 is set by default, which performs this.
				327	``-o [int]`` Set overprovision ratio in percent over volume size.
				328
				329	5 is set by default.
				330	``-s [int]`` Set the number of segments per section.
				331
				332	1 is set by default.
				333	``-z [int]`` Set the number of sections per zone.
				334
				335	1 is set by default.
				336	``-e [str]`` Set basic extension list. e.g. "mp3,gif,mov"
				337	``-t [0 or 1]`` Disable discard command or not.
				338
				339	1 is set by default, which conducts discard.
				340	=============== ===========================================================
Jaegeuk Kim	98e4da8	2012-11-02 17:05:42 +0900	[diff] [blame]	341
Randy Dunlap	ca313c8	2020-09-02 17:08:31 -0700	[diff] [blame^]	342	Note: please refer to the manpage of mkfs.f2fs(8) to get full option list.
Jaegeuk Kim	568d2a1	2020-08-31 10:22:17 -0700	[diff] [blame]	343
Changman Lee	d51a7fb	2013-07-04 17:12:47 +0900	[diff] [blame]	344	fsck.f2fs
				345	---------
				346	The fsck.f2fs is a tool to check the consistency of an f2fs-formatted
				347	partition, which examines whether the filesystem metadata and user-made data
				348	are cross-referenced correctly or not.
				349	Note that, initial version of the tool does not fix any inconsistency.
				350
Jaegeuk Kim	568d2a1	2020-08-31 10:22:17 -0700	[diff] [blame]	351	The quick options consist of::
Mauro Carvalho Chehab	89272ca	2020-02-17 17:12:04 +0100	[diff] [blame]	352
Changman Lee	d51a7fb	2013-07-04 17:12:47 +0900	[diff] [blame]	353	-d debug level [default:0]
				354
Randy Dunlap	ca313c8	2020-09-02 17:08:31 -0700	[diff] [blame^]	355	Note: please refer to the manpage of fsck.f2fs(8) to get full option list.
Jaegeuk Kim	568d2a1	2020-08-31 10:22:17 -0700	[diff] [blame]	356
Changman Lee	d51a7fb	2013-07-04 17:12:47 +0900	[diff] [blame]	357	dump.f2fs
				358	---------
				359	The dump.f2fs shows the information of specific inode and dumps SSA and SIT to
				360	file. Each file is dump_ssa and dump_sit.
				361
				362	The dump.f2fs is used to debug on-disk data structures of the f2fs filesystem.
Masanari Iida	4bb9998	2015-11-16 20:46:28 +0900	[diff] [blame]	363	It shows on-disk inode information recognized by a given inode number, and is
Changman Lee	d51a7fb	2013-07-04 17:12:47 +0900	[diff] [blame]	364	able to dump all the SSA and SIT entries into predefined files, ./dump_ssa and
				365	./dump_sit respectively.
				366
Mauro Carvalho Chehab	89272ca	2020-02-17 17:12:04 +0100	[diff] [blame]	367	The options consist of::
				368
Changman Lee	d51a7fb	2013-07-04 17:12:47 +0900	[diff] [blame]	369	-d debug level [default:0]
				370	-i inode no (hex)
				371	-s [SIT dump segno from #1~#2 (decimal), for all 0~-1]
				372	-a [SSA dump segno from #1~#2 (decimal), for all 0~-1]
				373
Mauro Carvalho Chehab	89272ca	2020-02-17 17:12:04 +0100	[diff] [blame]	374	Examples::
Changman Lee	d51a7fb	2013-07-04 17:12:47 +0900	[diff] [blame]	375
Mauro Carvalho Chehab	89272ca	2020-02-17 17:12:04 +0100	[diff] [blame]	376	# dump.f2fs -i [ino] /dev/sdx
				377	# dump.f2fs -s 0~-1 /dev/sdx (SIT dump)
				378	# dump.f2fs -a 0~-1 /dev/sdx (SSA dump)
				379
Randy Dunlap	ca313c8	2020-09-02 17:08:31 -0700	[diff] [blame^]	380	Note: please refer to the manpage of dump.f2fs(8) to get full option list.
Jaegeuk Kim	568d2a1	2020-08-31 10:22:17 -0700	[diff] [blame]	381
				382	sload.f2fs
				383	----------
				384	The sload.f2fs gives a way to insert files and directories in the exisiting disk
				385	image. This tool is useful when building f2fs images given compiled files.
				386
Randy Dunlap	ca313c8	2020-09-02 17:08:31 -0700	[diff] [blame^]	387	Note: please refer to the manpage of sload.f2fs(8) to get full option list.
Jaegeuk Kim	568d2a1	2020-08-31 10:22:17 -0700	[diff] [blame]	388
				389	resize.f2fs
				390	-----------
Randy Dunlap	ca313c8	2020-09-02 17:08:31 -0700	[diff] [blame^]	391	The resize.f2fs lets a user resize the f2fs-formatted disk image, while preserving
Jaegeuk Kim	568d2a1	2020-08-31 10:22:17 -0700	[diff] [blame]	392	all the files and directories stored in the image.
				393
Randy Dunlap	ca313c8	2020-09-02 17:08:31 -0700	[diff] [blame^]	394	Note: please refer to the manpage of resize.f2fs(8) to get full option list.
Jaegeuk Kim	568d2a1	2020-08-31 10:22:17 -0700	[diff] [blame]	395
				396	defrag.f2fs
				397	-----------
Randy Dunlap	ca313c8	2020-09-02 17:08:31 -0700	[diff] [blame^]	398	The defrag.f2fs can be used to defragment scattered written data as well as
Jaegeuk Kim	568d2a1	2020-08-31 10:22:17 -0700	[diff] [blame]	399	filesystem metadata across the disk. This can improve the write speed by giving
				400	more free consecutive space.
				401
Randy Dunlap	ca313c8	2020-09-02 17:08:31 -0700	[diff] [blame^]	402	Note: please refer to the manpage of defrag.f2fs(8) to get full option list.
Jaegeuk Kim	568d2a1	2020-08-31 10:22:17 -0700	[diff] [blame]	403
				404	f2fs_io
				405	-------
				406	The f2fs_io is a simple tool to issue various filesystem APIs as well as
				407	f2fs-specific ones, which is very useful for QA tests.
				408
Randy Dunlap	ca313c8	2020-09-02 17:08:31 -0700	[diff] [blame^]	409	Note: please refer to the manpage of f2fs_io(8) to get full option list.
Jaegeuk Kim	568d2a1	2020-08-31 10:22:17 -0700	[diff] [blame]	410
Mauro Carvalho Chehab	89272ca	2020-02-17 17:12:04 +0100	[diff] [blame]	411	Design
				412	======
Jaegeuk Kim	98e4da8	2012-11-02 17:05:42 +0900	[diff] [blame]	413
				414	On-disk Layout
				415	--------------
				416
				417	F2FS divides the whole volume into a number of segments, each of which is fixed
				418	to 2MB in size. A section is composed of consecutive segments, and a zone
				419	consists of a set of sections. By default, section and zone sizes are set to one
				420	segment size identically, but users can easily modify the sizes by mkfs.
				421
				422	F2FS splits the entire volume into six areas, and all the areas except superblock
Randy Dunlap	ca313c8	2020-09-02 17:08:31 -0700	[diff] [blame^]	423	consist of multiple segments as described below::
Jaegeuk Kim	98e4da8	2012-11-02 17:05:42 +0900	[diff] [blame]	424
				425	align with the zone size <-\|
				426	\|-> align with the segment size
				427	_________________________________________________________________________
Huajun Li	9268cc3	2012-12-31 13:59:04 +0800	[diff] [blame]	428	\| \| \| Segment \| Node \| Segment \| \|
				429	\| Superblock \| Checkpoint \| Info. \| Address \| Summary \| Main \|
				430	\| (SB) \| (CP) \| Table (SIT) \| Table (NAT) \| Area (SSA) \| \|
Jaegeuk Kim	98e4da8	2012-11-02 17:05:42 +0900	[diff] [blame]	431	\|____________\|_____2______\|______N______\|______N______\|______N_____\|__N___\|
				432	. .
				433	. .
				434	. .
				435	._________________________________________.
				436	\|_Segment_\|_..._\|_Segment_\|_..._\|_Segment_\|
				437	. .
				438	._________._________
				439	\|_section_\|__...__\|_
				440	. .
				441	.________.
				442	\|__zone__\|
				443
				444	- Superblock (SB)
Mauro Carvalho Chehab	89272ca	2020-02-17 17:12:04 +0100	[diff] [blame]	445	It is located at the beginning of the partition, and there exist two copies
Jaegeuk Kim	98e4da8	2012-11-02 17:05:42 +0900	[diff] [blame]	446	to avoid file system crash. It contains basic partition information and some
				447	default parameters of f2fs.
				448
				449	- Checkpoint (CP)
Mauro Carvalho Chehab	89272ca	2020-02-17 17:12:04 +0100	[diff] [blame]	450	It contains file system information, bitmaps for valid NAT/SIT sets, orphan
Jaegeuk Kim	98e4da8	2012-11-02 17:05:42 +0900	[diff] [blame]	451	inode lists, and summary entries of current active segments.
				452
Jaegeuk Kim	98e4da8	2012-11-02 17:05:42 +0900	[diff] [blame]	453	- Segment Information Table (SIT)
Mauro Carvalho Chehab	89272ca	2020-02-17 17:12:04 +0100	[diff] [blame]	454	It contains segment information such as valid block count and bitmap for the
Jaegeuk Kim	98e4da8	2012-11-02 17:05:42 +0900	[diff] [blame]	455	validity of all the blocks.
				456
Huajun Li	9268cc3	2012-12-31 13:59:04 +0800	[diff] [blame]	457	- Node Address Table (NAT)
Mauro Carvalho Chehab	89272ca	2020-02-17 17:12:04 +0100	[diff] [blame]	458	It is composed of a block address table for all the node blocks stored in
Huajun Li	9268cc3	2012-12-31 13:59:04 +0800	[diff] [blame]	459	Main area.
				460
Jaegeuk Kim	98e4da8	2012-11-02 17:05:42 +0900	[diff] [blame]	461	- Segment Summary Area (SSA)
Mauro Carvalho Chehab	89272ca	2020-02-17 17:12:04 +0100	[diff] [blame]	462	It contains summary entries which contains the owner information of all the
Jaegeuk Kim	98e4da8	2012-11-02 17:05:42 +0900	[diff] [blame]	463	data and node blocks stored in Main area.
				464
				465	- Main Area
Mauro Carvalho Chehab	89272ca	2020-02-17 17:12:04 +0100	[diff] [blame]	466	It contains file and directory data including their indices.
Jaegeuk Kim	98e4da8	2012-11-02 17:05:42 +0900	[diff] [blame]	467
				468	In order to avoid misalignment between file system and flash-based storage, F2FS
				469	aligns the start block address of CP with the segment size. Also, it aligns the
				470	start block address of Main area with the zone size by reserving some segments
				471	in SSA area.
				472
				473	Reference the following survey for additional technical details.
				474	https://wiki.linaro.org/WorkingGroups/Kernel/Projects/FlashCardSurvey
				475
				476	File System Metadata Structure
				477	------------------------------
				478
				479	F2FS adopts the checkpointing scheme to maintain file system consistency. At
				480	mount time, F2FS first tries to find the last valid checkpoint data by scanning
				481	CP area. In order to reduce the scanning time, F2FS uses only two copies of CP.
				482	One of them always indicates the last valid data, which is called as shadow copy
				483	mechanism. In addition to CP, NAT and SIT also adopt the shadow copy mechanism.
				484
				485	For file system consistency, each CP points to which NAT and SIT copies are
Mauro Carvalho Chehab	89272ca	2020-02-17 17:12:04 +0100	[diff] [blame]	486	valid, as shown as below::
Jaegeuk Kim	98e4da8	2012-11-02 17:05:42 +0900	[diff] [blame]	487
				488	+--------+----------+---------+
Huajun Li	9268cc3	2012-12-31 13:59:04 +0800	[diff] [blame]	489	\| CP \| SIT \| NAT \|
Jaegeuk Kim	98e4da8	2012-11-02 17:05:42 +0900	[diff] [blame]	490	+--------+----------+---------+
				491	. . . .
				492	. . . .
				493	. . . .
				494	+-------+-------+--------+--------+--------+--------+
Huajun Li	9268cc3	2012-12-31 13:59:04 +0800	[diff] [blame]	495	\| CP #0 \| CP #1 \| SIT #0 \| SIT #1 \| NAT #0 \| NAT #1 \|
Jaegeuk Kim	98e4da8	2012-11-02 17:05:42 +0900	[diff] [blame]	496	+-------+-------+--------+--------+--------+--------+
				497	\| ^ ^
				498	\| \| \|
				499	`----------------------------------------'
				500
				501	Index Structure
				502	---------------
				503
				504	The key data structure to manage the data locations is a "node". Similar to
				505	traditional file structures, F2FS has three types of node: inode, direct node,
Huajun Li	d08ab08	2012-12-05 16:45:32 +0800	[diff] [blame]	506	indirect node. F2FS assigns 4KB to an inode block which contains 923 data block
Jaegeuk Kim	98e4da8	2012-11-02 17:05:42 +0900	[diff] [blame]	507	indices, two direct node pointers, two indirect node pointers, and one double
				508	indirect node pointer as described below. One direct node block contains 1018
				509	data blocks, and one indirect node block contains also 1018 node blocks. Thus,
Mauro Carvalho Chehab	89272ca	2020-02-17 17:12:04 +0100	[diff] [blame]	510	one inode block (i.e., a file) covers::
Jaegeuk Kim	98e4da8	2012-11-02 17:05:42 +0900	[diff] [blame]	511
				512	4KB * (923 + 2 * 1018 + 2 * 1018 * 1018 + 1018 * 1018 * 1018) := 3.94TB.
				513
				514	Inode block (4KB)
				515	\|- data (923)
				516	\|- direct node (2)
				517	\| `- data (1018)
				518	\|- indirect node (2)
				519	\| `- direct node (1018)
				520	\| `- data (1018)
				521	`- double indirect node (1)
				522	`- indirect node (1018)
				523	`- direct node (1018)
				524	`- data (1018)
				525
Randy Dunlap	ca313c8	2020-09-02 17:08:31 -0700	[diff] [blame^]	526	Note that all the node blocks are mapped by NAT which means the location of
Jaegeuk Kim	98e4da8	2012-11-02 17:05:42 +0900	[diff] [blame]	527	each node is translated by the NAT table. In the consideration of the wandering
				528	tree problem, F2FS is able to cut off the propagation of node updates caused by
				529	leaf data writes.
				530
				531	Directory Structure
				532	-------------------
				533
				534	A directory entry occupies 11 bytes, which consists of the following attributes.
				535
				536	- hash hash value of the file name
				537	- ino inode number
				538	- len the length of file name
				539	- type file type such as directory, symlink, etc
				540
				541	A dentry block consists of 214 dentry slots and file names. Therein a bitmap is
				542	used to represent whether each dentry is valid or not. A dentry block occupies
				543	4KB with the following composition.
				544
Mauro Carvalho Chehab	89272ca	2020-02-17 17:12:04 +0100	[diff] [blame]	545	::
				546
Jaegeuk Kim	98e4da8	2012-11-02 17:05:42 +0900	[diff] [blame]	547	Dentry Block(4 K) = bitmap (27 bytes) + reserved (3 bytes) +
				548	dentries(11 * 214 bytes) + file name (8 * 214 bytes)
				549
				550	[Bucket]
				551	+--------------------------------+
				552	\|dentry block 1 \| dentry block 2 \|
				553	+--------------------------------+
				554	. .
				555	. .
				556	. [Dentry Block Structure: 4KB] .
				557	+--------+----------+----------+------------+
				558	\| bitmap \| reserved \| dentries \| file names \|
				559	+--------+----------+----------+------------+
				560	[Dentry Block: 4KB] . .
				561	. .
				562	. .
				563	+------+------+-----+------+
				564	\| hash \| ino \| len \| type \|
				565	+------+------+-----+------+
				566	[Dentry Structure: 11 bytes]
				567
				568	F2FS implements multi-level hash tables for directory structure. Each level has
				569	a hash table with dedicated number of hash buckets as shown below. Note that
				570	"A(2B)" means a bucket includes 2 data blocks.
				571
Mauro Carvalho Chehab	89272ca	2020-02-17 17:12:04 +0100	[diff] [blame]	572	::
Jaegeuk Kim	98e4da8	2012-11-02 17:05:42 +0900	[diff] [blame]	573
Mauro Carvalho Chehab	89272ca	2020-02-17 17:12:04 +0100	[diff] [blame]	574	----------------------
				575	A : bucket
				576	B : block
				577	N : MAX_DIR_HASH_DEPTH
				578	----------------------
Jaegeuk Kim	98e4da8	2012-11-02 17:05:42 +0900	[diff] [blame]	579
Mauro Carvalho Chehab	89272ca	2020-02-17 17:12:04 +0100	[diff] [blame]	580	level #0 \| A(2B)
				581	\|
				582	level #1 \| A(2B) - A(2B)
				583	\|
				584	level #2 \| A(2B) - A(2B) - A(2B) - A(2B)
				585	. \| . . . .
				586	level #N/2 \| A(2B) - A(2B) - A(2B) - A(2B) - A(2B) - ... - A(2B)
				587	. \| . . . .
				588	level #N \| A(4B) - A(4B) - A(4B) - A(4B) - A(4B) - ... - A(4B)
				589
				590	The number of blocks and buckets are determined by::
Jaegeuk Kim	98e4da8	2012-11-02 17:05:42 +0900	[diff] [blame]	591
				592	,- 2, if n < MAX_DIR_HASH_DEPTH / 2,
				593	# of blocks in level #n = \|
				594	`- 4, Otherwise
				595
Chao Yu	bfec07d	2014-05-28 08:56:09 +0800	[diff] [blame]	596	,- 2^(n + dir_level),
				597	\| if n + dir_level < MAX_DIR_HASH_DEPTH / 2,
Jaegeuk Kim	98e4da8	2012-11-02 17:05:42 +0900	[diff] [blame]	598	# of buckets in level #n = \|
Chao Yu	bfec07d	2014-05-28 08:56:09 +0800	[diff] [blame]	599	`- 2^((MAX_DIR_HASH_DEPTH / 2) - 1),
				600	Otherwise
Jaegeuk Kim	98e4da8	2012-11-02 17:05:42 +0900	[diff] [blame]	601
				602	When F2FS finds a file name in a directory, at first a hash value of the file
				603	name is calculated. Then, F2FS scans the hash table in level #0 to find the
				604	dentry consisting of the file name and its inode number. If not found, F2FS
				605	scans the next hash table in level #1. In this way, F2FS scans hash tables in
Randy Dunlap	ca313c8	2020-09-02 17:08:31 -0700	[diff] [blame^]	606	each levels incrementally from 1 to N. In each level F2FS needs to scan only
Jaegeuk Kim	98e4da8	2012-11-02 17:05:42 +0900	[diff] [blame]	607	one bucket determined by the following equation, which shows O(log(# of files))
Mauro Carvalho Chehab	89272ca	2020-02-17 17:12:04 +0100	[diff] [blame]	608	complexity::
Jaegeuk Kim	98e4da8	2012-11-02 17:05:42 +0900	[diff] [blame]	609
				610	bucket number to scan in level #n = (hash value) % (# of buckets in level #n)
				611
				612	In the case of file creation, F2FS finds empty consecutive slots that cover the
				613	file name. F2FS searches the empty slots in the hash tables of whole levels from
				614	1 to N in the same way as the lookup operation.
				615
Mauro Carvalho Chehab	89272ca	2020-02-17 17:12:04 +0100	[diff] [blame]	616	The following figure shows an example of two cases holding children::
				617
Jaegeuk Kim	98e4da8	2012-11-02 17:05:42 +0900	[diff] [blame]	618	--------------> Dir <--------------
				619	\| \|
				620	child child
				621
				622	child - child [hole] - child
				623
				624	child - child - child [hole] - [hole] - child
				625
				626	Case 1: Case 2:
				627	Number of children = 6, Number of children = 3,
				628	File size = 7 File size = 7
				629
				630	Default Block Allocation
				631	------------------------
				632
				633	At runtime, F2FS manages six active logs inside "Main" area: Hot/Warm/Cold node
				634	and Hot/Warm/Cold data.
				635
				636	- Hot node contains direct node blocks of directories.
				637	- Warm node contains direct node blocks except hot node blocks.
				638	- Cold node contains indirect node blocks
				639	- Hot data contains dentry blocks
				640	- Warm data contains data blocks except hot and cold data blocks
				641	- Cold data contains multimedia data or migrated data blocks
				642
				643	LFS has two schemes for free space management: threaded log and copy-and-compac-
				644	tion. The copy-and-compaction scheme which is known as cleaning, is well-suited
				645	for devices showing very good sequential write performance, since free segments
				646	are served all the time for writing new data. However, it suffers from cleaning
				647	overhead under high utilization. Contrarily, the threaded log scheme suffers
				648	from random writes, but no cleaning process is needed. F2FS adopts a hybrid
				649	scheme where the copy-and-compaction scheme is adopted by default, but the
				650	policy is dynamically changed to the threaded log scheme according to the file
				651	system status.
				652
				653	In order to align F2FS with underlying flash-based storage, F2FS allocates a
				654	segment in a unit of section. F2FS expects that the section size would be the
				655	same as the unit size of garbage collection in FTL. Furthermore, with respect
				656	to the mapping granularity in FTL, F2FS allocates each section of the active
				657	logs from different zones as much as possible, since FTL can write the data in
				658	the active logs into one allocation unit according to its mapping granularity.
				659
				660	Cleaning process
				661	----------------
				662
				663	F2FS does cleaning both on demand and in the background. On-demand cleaning is
				664	triggered when there are not enough free segments to serve VFS calls. Background
				665	cleaner is operated by a kernel thread, and triggers the cleaning job when the
				666	system is idle.
				667
				668	F2FS supports two victim selection policies: greedy and cost-benefit algorithms.
				669	In the greedy algorithm, F2FS selects a victim segment having the smallest number
				670	of valid blocks. In the cost-benefit algorithm, F2FS selects a victim segment
				671	according to the segment age and the number of valid blocks in order to address
				672	log block thrashing problem in the greedy algorithm. F2FS adopts the greedy
				673	algorithm for on-demand cleaner, while background cleaner adopts cost-benefit
				674	algorithm.
				675
				676	In order to identify whether the data in the victim segment are valid or not,
				677	F2FS manages a bitmap. Each bit represents the validity of a block, and the
				678	bitmap is composed of a bit stream covering whole blocks in main area.
Hyunchul Lee	8b3a0ca	2018-01-31 11:36:59 +0900	[diff] [blame]	679
				680	Write-hint Policy
				681	-----------------
				682
				683	1) whint_mode=off. F2FS only passes down WRITE_LIFE_NOT_SET.
				684
				685	2) whint_mode=user-based. F2FS tries to pass down hints given by
				686	users.
				687
Mauro Carvalho Chehab	89272ca	2020-02-17 17:12:04 +0100	[diff] [blame]	688	===================== ======================== ===================
Hyunchul Lee	8b3a0ca	2018-01-31 11:36:59 +0900	[diff] [blame]	689	User F2FS Block
Mauro Carvalho Chehab	89272ca	2020-02-17 17:12:04 +0100	[diff] [blame]	690	===================== ======================== ===================
Hyunchul Lee	8b3a0ca	2018-01-31 11:36:59 +0900	[diff] [blame]	691	META WRITE_LIFE_NOT_SET
				692	HOT_NODE "
				693	WARM_NODE "
				694	COLD_NODE "
Mauro Carvalho Chehab	89272ca	2020-02-17 17:12:04 +0100	[diff] [blame]	695	ioctl(COLD) COLD_DATA WRITE_LIFE_EXTREME
				696	extension list " "
Hyunchul Lee	8b3a0ca	2018-01-31 11:36:59 +0900	[diff] [blame]	697
				698	-- buffered io
				699	WRITE_LIFE_EXTREME COLD_DATA WRITE_LIFE_EXTREME
				700	WRITE_LIFE_SHORT HOT_DATA WRITE_LIFE_SHORT
				701	WRITE_LIFE_NOT_SET WARM_DATA WRITE_LIFE_NOT_SET
				702	WRITE_LIFE_NONE " "
				703	WRITE_LIFE_MEDIUM " "
				704	WRITE_LIFE_LONG " "
				705
				706	-- direct io
				707	WRITE_LIFE_EXTREME COLD_DATA WRITE_LIFE_EXTREME
				708	WRITE_LIFE_SHORT HOT_DATA WRITE_LIFE_SHORT
				709	WRITE_LIFE_NOT_SET WARM_DATA WRITE_LIFE_NOT_SET
				710	WRITE_LIFE_NONE " WRITE_LIFE_NONE
				711	WRITE_LIFE_MEDIUM " WRITE_LIFE_MEDIUM
				712	WRITE_LIFE_LONG " WRITE_LIFE_LONG
Mauro Carvalho Chehab	89272ca	2020-02-17 17:12:04 +0100	[diff] [blame]	713	===================== ======================== ===================
Hyunchul Lee	8b3a0ca	2018-01-31 11:36:59 +0900	[diff] [blame]	714
				715	3) whint_mode=fs-based. F2FS passes down hints with its policy.
				716
Mauro Carvalho Chehab	89272ca	2020-02-17 17:12:04 +0100	[diff] [blame]	717	===================== ======================== ===================
Hyunchul Lee	8b3a0ca	2018-01-31 11:36:59 +0900	[diff] [blame]	718	User F2FS Block
Mauro Carvalho Chehab	89272ca	2020-02-17 17:12:04 +0100	[diff] [blame]	719	===================== ======================== ===================
Hyunchul Lee	8b3a0ca	2018-01-31 11:36:59 +0900	[diff] [blame]	720	META WRITE_LIFE_MEDIUM;
				721	HOT_NODE WRITE_LIFE_NOT_SET
				722	WARM_NODE "
				723	COLD_NODE WRITE_LIFE_NONE
				724	ioctl(COLD) COLD_DATA WRITE_LIFE_EXTREME
				725	extension list " "
				726
				727	-- buffered io
				728	WRITE_LIFE_EXTREME COLD_DATA WRITE_LIFE_EXTREME
				729	WRITE_LIFE_SHORT HOT_DATA WRITE_LIFE_SHORT
				730	WRITE_LIFE_NOT_SET WARM_DATA WRITE_LIFE_LONG
				731	WRITE_LIFE_NONE " "
				732	WRITE_LIFE_MEDIUM " "
				733	WRITE_LIFE_LONG " "
				734
				735	-- direct io
				736	WRITE_LIFE_EXTREME COLD_DATA WRITE_LIFE_EXTREME
				737	WRITE_LIFE_SHORT HOT_DATA WRITE_LIFE_SHORT
				738	WRITE_LIFE_NOT_SET WARM_DATA WRITE_LIFE_NOT_SET
				739	WRITE_LIFE_NONE " WRITE_LIFE_NONE
				740	WRITE_LIFE_MEDIUM " WRITE_LIFE_MEDIUM
				741	WRITE_LIFE_LONG " WRITE_LIFE_LONG
Mauro Carvalho Chehab	89272ca	2020-02-17 17:12:04 +0100	[diff] [blame]	742	===================== ======================== ===================
Jaegeuk Kim	cad3836	2019-06-26 18:23:05 -0700	[diff] [blame]	743
				744	Fallocate(2) Policy
				745	-------------------
				746
Randy Dunlap	ca313c8	2020-09-02 17:08:31 -0700	[diff] [blame^]	747	The default policy follows the below POSIX rule.
Jaegeuk Kim	cad3836	2019-06-26 18:23:05 -0700	[diff] [blame]	748
				749	Allocating disk space
				750	The default operation (i.e., mode is zero) of fallocate() allocates
				751	the disk space within the range specified by offset and len. The
				752	file size (as reported by stat(2)) will be changed if offset+len is
				753	greater than the file size. Any subregion within the range specified
				754	by offset and len that did not contain data before the call will be
				755	initialized to zero. This default behavior closely resembles the
				756	behavior of the posix_fallocate(3) library function, and is intended
				757	as a method of optimally implementing that function.
				758
				759	However, once F2FS receives ioctl(fd, F2FS_IOC_SET_PIN_FILE) in prior to
Randy Dunlap	ca313c8	2020-09-02 17:08:31 -0700	[diff] [blame^]	760	fallocate(fd, DEFAULT_MODE), it allocates on-disk block addressess having
Jaegeuk Kim	cad3836	2019-06-26 18:23:05 -0700	[diff] [blame]	761	zero or random data, which is useful to the below scenario where:
Mauro Carvalho Chehab	89272ca	2020-02-17 17:12:04 +0100	[diff] [blame]	762
Jaegeuk Kim	cad3836	2019-06-26 18:23:05 -0700	[diff] [blame]	763	1. create(fd)
				764	2. ioctl(fd, F2FS_IOC_SET_PIN_FILE)
				765	3. fallocate(fd, 0, 0, size)
				766	4. address = fibmap(fd, offset)
				767	5. open(blkdev)
				768	6. write(blkdev, address)
Chao Yu	4c8ff70	2019-11-01 18:07:14 +0800	[diff] [blame]	769
				770	Compression implementation
				771	--------------------------
				772
				773	- New term named cluster is defined as basic unit of compression, file can
Mauro Carvalho Chehab	89272ca	2020-02-17 17:12:04 +0100	[diff] [blame]	774	be divided into multiple clusters logically. One cluster includes 4 << n
				775	(n >= 0) logical pages, compression size is also cluster size, each of
				776	cluster can be compressed or not.
Chao Yu	4c8ff70	2019-11-01 18:07:14 +0800	[diff] [blame]	777
				778	- In cluster metadata layout, one special block address is used to indicate
Randy Dunlap	ca313c8	2020-09-02 17:08:31 -0700	[diff] [blame^]	779	a cluster is a compressed one or normal one; for compressed cluster, following
Mauro Carvalho Chehab	89272ca	2020-02-17 17:12:04 +0100	[diff] [blame]	780	metadata maps cluster to [1, 4 << n - 1] physical blocks, in where f2fs
				781	stores data including compress header and compressed data.
Chao Yu	4c8ff70	2019-11-01 18:07:14 +0800	[diff] [blame]	782
				783	- In order to eliminate write amplification during overwrite, F2FS only
Mauro Carvalho Chehab	89272ca	2020-02-17 17:12:04 +0100	[diff] [blame]	784	support compression on write-once file, data can be compressed only when
Chao Yu	4fc781a	2020-07-03 16:39:09 +0800	[diff] [blame]	785	all logical blocks in cluster contain valid data and compress ratio of
				786	cluster data is lower than specified threshold.
Chao Yu	4c8ff70	2019-11-01 18:07:14 +0800	[diff] [blame]	787
				788	- To enable compression on regular inode, there are three ways:
Chao Yu	4c8ff70	2019-11-01 18:07:14 +0800	[diff] [blame]	789
Mauro Carvalho Chehab	89272ca	2020-02-17 17:12:04 +0100	[diff] [blame]	790	* chattr +c file
				791	* chattr +c dir; touch dir/file
				792	* mount w/ -o compress_extension=ext; touch file.ext
				793
				794	Compress metadata layout::
				795
				796	[Dnode Structure]
				797	+-----------------------------------------------+
				798	\| cluster 1 \| cluster 2 \| ......... \| cluster N \|
				799	+-----------------------------------------------+
				800	. . . .
				801	. . . .
				802	. Compressed Cluster . . Normal Cluster .
				803	+----------+---------+---------+---------+ +---------+---------+---------+---------+
				804	\|compr flag\| block 1 \| block 2 \| block 3 \| \| block 1 \| block 2 \| block 3 \| block 4 \|
				805	+----------+---------+---------+---------+ +---------+---------+---------+---------+
				806	. .
				807	. .
				808	. .
				809	+-------------+-------------+----------+----------------------------+
				810	\| data length \| data chksum \| reserved \| compressed data \|
				811	+-------------+-------------+----------+----------------------------+
Aravind Ramesh	de881df	2020-07-16 18:26:56 +0530	[diff] [blame]	812
				813	NVMe Zoned Namespace devices
				814	----------------------------
				815
				816	- ZNS defines a per-zone capacity which can be equal or less than the
				817	zone-size. Zone-capacity is the number of usable blocks in the zone.
Randy Dunlap	ca313c8	2020-09-02 17:08:31 -0700	[diff] [blame^]	818	F2FS checks if zone-capacity is less than zone-size, if it is, then any
Aravind Ramesh	de881df	2020-07-16 18:26:56 +0530	[diff] [blame]	819	segment which starts after the zone-capacity is marked as not-free in
				820	the free segment bitmap at initial mount time. These segments are marked
				821	as permanently used so they are not allocated for writes and
				822	consequently are not needed to be garbage collected. In case the
				823	zone-capacity is not aligned to default segment size(2MB), then a segment
				824	can start before the zone-capacity and span across zone-capacity boundary.
				825	Such spanning segments are also considered as usable segments. All blocks
				826	past the zone-capacity are considered unusable in these segments.