Blame - Documentation/device-mapper/dm-zoned.txt - SHIFTPHONES/mainline/linux

blob: 736fcc78d193e9f1c022b6ce89a98addb3bc17d9 [file] [log] [blame]

Damien Le Moal	3b1a94c	2017-06-07 15:55:39 +0900	[diff] [blame]	1	dm-zoned
				2	========
				3
				4	The dm-zoned device mapper target exposes a zoned block device (ZBC and
				5	ZAC compliant devices) as a regular block device without any write
				6	pattern constraints. In effect, it implements a drive-managed zoned
				7	block device which hides from the user (a file system or an application
				8	doing raw block device accesses) the sequential write constraints of
				9	host-managed zoned block devices and can mitigate the potential
				10	device-side performance degradation due to excessive random writes on
				11	host-aware zoned block devices.
				12
				13	For a more detailed description of the zoned block device models and
				14	their constraints see (for SCSI devices):
				15
				16	http://www.t10.org/drafts.htm#ZBC_Family
				17
				18	and (for ATA devices):
				19
				20	http://www.t13.org/Documents/UploadedDocuments/docs2015/di537r05-Zoned_Device_ATA_Command_Set_ZAC.pdf
				21
				22	The dm-zoned implementation is simple and minimizes system overhead (CPU
				23	and memory usage as well as storage capacity loss). For a 10TB
				24	host-managed disk with 256 MB zones, dm-zoned memory usage per disk
				25	instance is at most 4.5 MB and as little as 5 zones will be used
				26	internally for storing metadata and performaing reclaim operations.
				27
				28	dm-zoned target devices are formatted and checked using the dmzadm
				29	utility available at:
				30
				31	https://github.com/hgst/dm-zoned-tools
				32
				33	Algorithm
				34	=========
				35
				36	dm-zoned implements an on-disk buffering scheme to handle non-sequential
				37	write accesses to the sequential zones of a zoned block device.
				38	Conventional zones are used for caching as well as for storing internal
				39	metadata.
				40
				41	The zones of the device are separated into 2 types:
				42
				43	1) Metadata zones: these are conventional zones used to store metadata.
				44	Metadata zones are not reported as useable capacity to the user.
				45
				46	2) Data zones: all remaining zones, the vast majority of which will be
				47	sequential zones used exclusively to store user data. The conventional
				48	zones of the device may be used also for buffering user random writes.
				49	Data in these zones may be directly mapped to the conventional zone, but
				50	later moved to a sequential zone so that the conventional zone can be
				51	reused for buffering incoming random writes.
				52
				53	dm-zoned exposes a logical device with a sector size of 4096 bytes,
				54	irrespective of the physical sector size of the backend zoned block
				55	device being used. This allows reducing the amount of metadata needed to
				56	manage valid blocks (blocks written).
				57
				58	The on-disk metadata format is as follows:
				59
				60	1) The first block of the first conventional zone found contains the
				61	super block which describes the on disk amount and position of metadata
				62	blocks.
				63
				64	2) Following the super block, a set of blocks is used to describe the
				65	mapping of the logical device blocks. The mapping is done per chunk of
				66	blocks, with the chunk size equal to the zoned block device size. The
				67	mapping table is indexed by chunk number and each mapping entry
				68	indicates the zone number of the device storing the chunk of data. Each
				69	mapping entry may also indicate if the zone number of a conventional
				70	zone used to buffer random modification to the data zone.
				71
				72	3) A set of blocks used to store bitmaps indicating the validity of
				73	blocks in the data zones follows the mapping table. A valid block is
				74	defined as a block that was written and not discarded. For a buffered
				75	data chunk, a block is always valid only in the data zone mapping the
				76	chunk or in the buffer zone of the chunk.
				77
				78	For a logical chunk mapped to a conventional zone, all write operations
				79	are processed by directly writing to the zone. If the mapping zone is a
				80	sequential zone, the write operation is processed directly only if the
				81	write offset within the logical chunk is equal to the write pointer
				82	offset within of the sequential data zone (i.e. the write operation is
				83	aligned on the zone write pointer). Otherwise, write operations are
				84	processed indirectly using a buffer zone. In that case, an unused
				85	conventional zone is allocated and assigned to the chunk being
				86	accessed. Writing a block to the buffer zone of a chunk will
				87	automatically invalidate the same block in the sequential zone mapping
				88	the chunk. If all blocks of the sequential zone become invalid, the zone
				89	is freed and the chunk buffer zone becomes the primary zone mapping the
				90	chunk, resulting in native random write performance similar to a regular
				91	block device.
				92
				93	Read operations are processed according to the block validity
				94	information provided by the bitmaps. Valid blocks are read either from
				95	the sequential zone mapping a chunk, or if the chunk is buffered, from
				96	the buffer zone assigned. If the accessed chunk has no mapping, or the
				97	accessed blocks are invalid, the read buffer is zeroed and the read
				98	operation terminated.
				99
				100	After some time, the limited number of convnetional zones available may
				101	be exhausted (all used to map chunks or buffer sequential zones) and
				102	unaligned writes to unbuffered chunks become impossible. To avoid this
				103	situation, a reclaim process regularly scans used conventional zones and
				104	tries to reclaim the least recently used zones by copying the valid
				105	blocks of the buffer zone to a free sequential zone. Once the copy
				106	completes, the chunk mapping is updated to point to the sequential zone
				107	and the buffer zone freed for reuse.
				108
				109	Metadata Protection
				110	===================
				111
				112	To protect metadata against corruption in case of sudden power loss or
				113	system crash, 2 sets of metadata zones are used. One set, the primary
				114	set, is used as the main metadata region, while the secondary set is
				115	used as a staging area. Modified metadata is first written to the
				116	secondary set and validated by updating the super block in the secondary
				117	set, a generation counter is used to indicate that this set contains the
				118	newest metadata. Once this operation completes, in place of metadata
				119	block updates can be done in the primary metadata set. This ensures that
				120	one of the set is always consistent (all modifications committed or none
				121	at all). Flush operations are used as a commit point. Upon reception of
				122	a flush request, metadata modification activity is temporarily blocked
				123	(for both incoming BIO processing and reclaim process) and all dirty
				124	metadata blocks are staged and updated. Normal operation is then
				125	resumed. Flushing metadata thus only temporarily delays write and
				126	discard requests. Read requests can be processed concurrently while
				127	metadata flush is being executed.
				128
				129	Usage
				130	=====
				131
				132	A zoned block device must first be formatted using the dmzadm tool. This
				133	will analyze the device zone configuration, determine where to place the
				134	metadata sets on the device and initialize the metadata sets.
				135
				136	Ex:
				137
				138	dmzadm --format /dev/sdxx
				139
				140	For a formatted device, the target can be created normally with the
				141	dmsetup utility. The only parameter that dm-zoned requires is the
				142	underlying zoned block device name. Ex:
				143
				144	echo "0 `blockdev --getsize ${dev}` zoned ${dev}" \| dmsetup create dmz-`basename ${dev}`