Blame - Documentation/admin-guide/device-mapper/dm-zoned.rst - SHIFTPHONES/mainline/linux

blob: 0fac051caeaca40c5b44b003c62251ab51256950 [file] [log] [blame]

Mauro Carvalho Chehab	f0ba437	2019-06-12 14:52:43 -0300	[diff] [blame]	1	========
Damien Le Moal	3b1a94c	2017-06-07 15:55:39 +0900	[diff] [blame]	2	dm-zoned
				3	========
				4
				5	The dm-zoned device mapper target exposes a zoned block device (ZBC and
				6	ZAC compliant devices) as a regular block device without any write
				7	pattern constraints. In effect, it implements a drive-managed zoned
				8	block device which hides from the user (a file system or an application
				9	doing raw block device accesses) the sequential write constraints of
				10	host-managed zoned block devices and can mitigate the potential
				11	device-side performance degradation due to excessive random writes on
				12	host-aware zoned block devices.
				13
				14	For a more detailed description of the zoned block device models and
				15	their constraints see (for SCSI devices):
				16
Alexander A. Klimov	6f3bc22	2020-06-27 12:31:38 +0200	[diff] [blame]	17	https://www.t10.org/drafts.htm#ZBC_Family
Damien Le Moal	3b1a94c	2017-06-07 15:55:39 +0900	[diff] [blame]	18
				19	and (for ATA devices):
				20
				21	http://www.t13.org/Documents/UploadedDocuments/docs2015/di537r05-Zoned_Device_ATA_Command_Set_ZAC.pdf
				22
				23	The dm-zoned implementation is simple and minimizes system overhead (CPU
				24	and memory usage as well as storage capacity loss). For a 10TB
				25	host-managed disk with 256 MB zones, dm-zoned memory usage per disk
				26	instance is at most 4.5 MB and as little as 5 zones will be used
Andrew Klychkov	751d5b2	2020-12-04 10:28:48 +0300	[diff] [blame]	27	internally for storing metadata and performing reclaim operations.
Damien Le Moal	3b1a94c	2017-06-07 15:55:39 +0900	[diff] [blame]	28
				29	dm-zoned target devices are formatted and checked using the dmzadm
				30	utility available at:
				31
				32	https://github.com/hgst/dm-zoned-tools
				33
				34	Algorithm
				35	=========
				36
				37	dm-zoned implements an on-disk buffering scheme to handle non-sequential
				38	write accesses to the sequential zones of a zoned block device.
				39	Conventional zones are used for caching as well as for storing internal
Hannes Reinecke	bd5c403	2020-05-11 10:24:30 +0200	[diff] [blame]	40	metadata. It can also use a regular block device together with the zoned
				41	block device; in that case the regular block device will be split logically
				42	in zones with the same size as the zoned block device. These zones will be
				43	placed in front of the zones from the zoned block device and will be handled
				44	just like conventional zones.
Damien Le Moal	3b1a94c	2017-06-07 15:55:39 +0900	[diff] [blame]	45
Hannes Reinecke	bd5c403	2020-05-11 10:24:30 +0200	[diff] [blame]	46	The zones of the device(s) are separated into 2 types:
Damien Le Moal	3b1a94c	2017-06-07 15:55:39 +0900	[diff] [blame]	47
				48	1) Metadata zones: these are conventional zones used to store metadata.
				49	Metadata zones are not reported as useable capacity to the user.
				50
				51	2) Data zones: all remaining zones, the vast majority of which will be
				52	sequential zones used exclusively to store user data. The conventional
				53	zones of the device may be used also for buffering user random writes.
				54	Data in these zones may be directly mapped to the conventional zone, but
				55	later moved to a sequential zone so that the conventional zone can be
				56	reused for buffering incoming random writes.
				57
				58	dm-zoned exposes a logical device with a sector size of 4096 bytes,
				59	irrespective of the physical sector size of the backend zoned block
				60	device being used. This allows reducing the amount of metadata needed to
				61	manage valid blocks (blocks written).
				62
				63	The on-disk metadata format is as follows:
				64
				65	1) The first block of the first conventional zone found contains the
				66	super block which describes the on disk amount and position of metadata
				67	blocks.
				68
				69	2) Following the super block, a set of blocks is used to describe the
				70	mapping of the logical device blocks. The mapping is done per chunk of
				71	blocks, with the chunk size equal to the zoned block device size. The
				72	mapping table is indexed by chunk number and each mapping entry
				73	indicates the zone number of the device storing the chunk of data. Each
				74	mapping entry may also indicate if the zone number of a conventional
				75	zone used to buffer random modification to the data zone.
				76
				77	3) A set of blocks used to store bitmaps indicating the validity of
				78	blocks in the data zones follows the mapping table. A valid block is
				79	defined as a block that was written and not discarded. For a buffered
				80	data chunk, a block is always valid only in the data zone mapping the
				81	chunk or in the buffer zone of the chunk.
				82
				83	For a logical chunk mapped to a conventional zone, all write operations
				84	are processed by directly writing to the zone. If the mapping zone is a
				85	sequential zone, the write operation is processed directly only if the
				86	write offset within the logical chunk is equal to the write pointer
				87	offset within of the sequential data zone (i.e. the write operation is
				88	aligned on the zone write pointer). Otherwise, write operations are
				89	processed indirectly using a buffer zone. In that case, an unused
				90	conventional zone is allocated and assigned to the chunk being
				91	accessed. Writing a block to the buffer zone of a chunk will
				92	automatically invalidate the same block in the sequential zone mapping
				93	the chunk. If all blocks of the sequential zone become invalid, the zone
				94	is freed and the chunk buffer zone becomes the primary zone mapping the
				95	chunk, resulting in native random write performance similar to a regular
				96	block device.
				97
				98	Read operations are processed according to the block validity
				99	information provided by the bitmaps. Valid blocks are read either from
				100	the sequential zone mapping a chunk, or if the chunk is buffered, from
				101	the buffer zone assigned. If the accessed chunk has no mapping, or the
				102	accessed blocks are invalid, the read buffer is zeroed and the read
				103	operation terminated.
				104
Andrew Klychkov	751d5b2	2020-12-04 10:28:48 +0300	[diff] [blame]	105	After some time, the limited number of conventional zones available may
Damien Le Moal	3b1a94c	2017-06-07 15:55:39 +0900	[diff] [blame]	106	be exhausted (all used to map chunks or buffer sequential zones) and
				107	unaligned writes to unbuffered chunks become impossible. To avoid this
				108	situation, a reclaim process regularly scans used conventional zones and
				109	tries to reclaim the least recently used zones by copying the valid
				110	blocks of the buffer zone to a free sequential zone. Once the copy
				111	completes, the chunk mapping is updated to point to the sequential zone
				112	and the buffer zone freed for reuse.
				113
				114	Metadata Protection
				115	===================
				116
				117	To protect metadata against corruption in case of sudden power loss or
				118	system crash, 2 sets of metadata zones are used. One set, the primary
				119	set, is used as the main metadata region, while the secondary set is
				120	used as a staging area. Modified metadata is first written to the
				121	secondary set and validated by updating the super block in the secondary
				122	set, a generation counter is used to indicate that this set contains the
				123	newest metadata. Once this operation completes, in place of metadata
				124	block updates can be done in the primary metadata set. This ensures that
				125	one of the set is always consistent (all modifications committed or none
				126	at all). Flush operations are used as a commit point. Upon reception of
				127	a flush request, metadata modification activity is temporarily blocked
				128	(for both incoming BIO processing and reclaim process) and all dirty
				129	metadata blocks are staged and updated. Normal operation is then
				130	resumed. Flushing metadata thus only temporarily delays write and
				131	discard requests. Read requests can be processed concurrently while
				132	metadata flush is being executed.
				133
Hannes Reinecke	bd5c403	2020-05-11 10:24:30 +0200	[diff] [blame]	134	If a regular device is used in conjunction with the zoned block device,
				135	a third set of metadata (without the zone bitmaps) is written to the
				136	start of the zoned block device. This metadata has a generation counter of
				137	'0' and will never be updated during normal operation; it just serves for
				138	identification purposes. The first and second copy of the metadata
				139	are located at the start of the regular block device.
				140
Damien Le Moal	3b1a94c	2017-06-07 15:55:39 +0900	[diff] [blame]	141	Usage
				142	=====
				143
				144	A zoned block device must first be formatted using the dmzadm tool. This
				145	will analyze the device zone configuration, determine where to place the
				146	metadata sets on the device and initialize the metadata sets.
				147
Mauro Carvalho Chehab	f0ba437	2019-06-12 14:52:43 -0300	[diff] [blame]	148	Ex::
Damien Le Moal	3b1a94c	2017-06-07 15:55:39 +0900	[diff] [blame]	149
Mauro Carvalho Chehab	f0ba437	2019-06-12 14:52:43 -0300	[diff] [blame]	150	dmzadm --format /dev/sdxx
Damien Le Moal	3b1a94c	2017-06-07 15:55:39 +0900	[diff] [blame]	151
Damien Le Moal	3b1a94c	2017-06-07 15:55:39 +0900	[diff] [blame]	152
Hannes Reinecke	bd5c403	2020-05-11 10:24:30 +0200	[diff] [blame]	153	If two drives are to be used, both devices must be specified, with the
				154	regular block device as the first device.
				155
				156	Ex::
				157
				158	dmzadm --format /dev/sdxx /dev/sdyy
				159
				160
Andrew Klychkov	751d5b2	2020-12-04 10:28:48 +0300	[diff] [blame]	161	Formatted device(s) can be started with the dmzadm utility, too.:
Hannes Reinecke	bd5c403	2020-05-11 10:24:30 +0200	[diff] [blame]	162
				163	Ex::
				164
				165	dmzadm --start /dev/sdxx /dev/sdyy
				166
Hannes Reinecke	bc3d571	2020-05-11 10:24:16 +0200	[diff] [blame]	167
				168	Information about the internal layout and current usage of the zones can
				169	be obtained with the 'status' callback from dmsetup:
				170
				171	Ex::
				172
				173	dmsetup status /dev/dm-X
				174
				175	will return a line
				176
				177	0 <size> zoned <nr_zones> zones <nr_unmap_rnd>/<nr_rnd> random <nr_unmap_seq>/<nr_seq> sequential
				178
				179	where <nr_zones> is the total number of zones, <nr_unmap_rnd> is the number
				180	of unmapped (ie free) random zones, <nr_rnd> the total number of zones,
				181	<nr_unmap_seq> the number of unmapped sequential zones, and <nr_seq> the
				182	total number of sequential zones.
Hannes Reinecke	90b39d5	2020-05-11 10:24:17 +0200	[diff] [blame]	183
				184	Normally the reclaim process will be started once there are less than 50
				185	percent free random zones. In order to start the reclaim process manually
				186	even before reaching this threshold the 'dmsetup message' function can be
				187	used:
				188
				189	Ex::
				190
				191	dmsetup message /dev/dm-X 0 reclaim
				192
				193	will start the reclaim process and random zones will be moved to sequential
				194	zones.