Damien Le Moal | 3b1a94c | 2017-06-07 15:55:39 +0900 | [diff] [blame] | 1 | dm-zoned |
| 2 | ======== |
| 3 | |
| 4 | The dm-zoned device mapper target exposes a zoned block device (ZBC and |
| 5 | ZAC compliant devices) as a regular block device without any write |
| 6 | pattern constraints. In effect, it implements a drive-managed zoned |
| 7 | block device which hides from the user (a file system or an application |
| 8 | doing raw block device accesses) the sequential write constraints of |
| 9 | host-managed zoned block devices and can mitigate the potential |
| 10 | device-side performance degradation due to excessive random writes on |
| 11 | host-aware zoned block devices. |
| 12 | |
| 13 | For a more detailed description of the zoned block device models and |
| 14 | their constraints see (for SCSI devices): |
| 15 | |
| 16 | http://www.t10.org/drafts.htm#ZBC_Family |
| 17 | |
| 18 | and (for ATA devices): |
| 19 | |
| 20 | http://www.t13.org/Documents/UploadedDocuments/docs2015/di537r05-Zoned_Device_ATA_Command_Set_ZAC.pdf |
| 21 | |
| 22 | The dm-zoned implementation is simple and minimizes system overhead (CPU |
| 23 | and memory usage as well as storage capacity loss). For a 10TB |
| 24 | host-managed disk with 256 MB zones, dm-zoned memory usage per disk |
| 25 | instance is at most 4.5 MB and as little as 5 zones will be used |
| 26 | internally for storing metadata and performaing reclaim operations. |
| 27 | |
| 28 | dm-zoned target devices are formatted and checked using the dmzadm |
| 29 | utility available at: |
| 30 | |
| 31 | https://github.com/hgst/dm-zoned-tools |
| 32 | |
| 33 | Algorithm |
| 34 | ========= |
| 35 | |
| 36 | dm-zoned implements an on-disk buffering scheme to handle non-sequential |
| 37 | write accesses to the sequential zones of a zoned block device. |
| 38 | Conventional zones are used for caching as well as for storing internal |
| 39 | metadata. |
| 40 | |
| 41 | The zones of the device are separated into 2 types: |
| 42 | |
| 43 | 1) Metadata zones: these are conventional zones used to store metadata. |
| 44 | Metadata zones are not reported as useable capacity to the user. |
| 45 | |
| 46 | 2) Data zones: all remaining zones, the vast majority of which will be |
| 47 | sequential zones used exclusively to store user data. The conventional |
| 48 | zones of the device may be used also for buffering user random writes. |
| 49 | Data in these zones may be directly mapped to the conventional zone, but |
| 50 | later moved to a sequential zone so that the conventional zone can be |
| 51 | reused for buffering incoming random writes. |
| 52 | |
| 53 | dm-zoned exposes a logical device with a sector size of 4096 bytes, |
| 54 | irrespective of the physical sector size of the backend zoned block |
| 55 | device being used. This allows reducing the amount of metadata needed to |
| 56 | manage valid blocks (blocks written). |
| 57 | |
| 58 | The on-disk metadata format is as follows: |
| 59 | |
| 60 | 1) The first block of the first conventional zone found contains the |
| 61 | super block which describes the on disk amount and position of metadata |
| 62 | blocks. |
| 63 | |
| 64 | 2) Following the super block, a set of blocks is used to describe the |
| 65 | mapping of the logical device blocks. The mapping is done per chunk of |
| 66 | blocks, with the chunk size equal to the zoned block device size. The |
| 67 | mapping table is indexed by chunk number and each mapping entry |
| 68 | indicates the zone number of the device storing the chunk of data. Each |
| 69 | mapping entry may also indicate if the zone number of a conventional |
| 70 | zone used to buffer random modification to the data zone. |
| 71 | |
| 72 | 3) A set of blocks used to store bitmaps indicating the validity of |
| 73 | blocks in the data zones follows the mapping table. A valid block is |
| 74 | defined as a block that was written and not discarded. For a buffered |
| 75 | data chunk, a block is always valid only in the data zone mapping the |
| 76 | chunk or in the buffer zone of the chunk. |
| 77 | |
| 78 | For a logical chunk mapped to a conventional zone, all write operations |
| 79 | are processed by directly writing to the zone. If the mapping zone is a |
| 80 | sequential zone, the write operation is processed directly only if the |
| 81 | write offset within the logical chunk is equal to the write pointer |
| 82 | offset within of the sequential data zone (i.e. the write operation is |
| 83 | aligned on the zone write pointer). Otherwise, write operations are |
| 84 | processed indirectly using a buffer zone. In that case, an unused |
| 85 | conventional zone is allocated and assigned to the chunk being |
| 86 | accessed. Writing a block to the buffer zone of a chunk will |
| 87 | automatically invalidate the same block in the sequential zone mapping |
| 88 | the chunk. If all blocks of the sequential zone become invalid, the zone |
| 89 | is freed and the chunk buffer zone becomes the primary zone mapping the |
| 90 | chunk, resulting in native random write performance similar to a regular |
| 91 | block device. |
| 92 | |
| 93 | Read operations are processed according to the block validity |
| 94 | information provided by the bitmaps. Valid blocks are read either from |
| 95 | the sequential zone mapping a chunk, or if the chunk is buffered, from |
| 96 | the buffer zone assigned. If the accessed chunk has no mapping, or the |
| 97 | accessed blocks are invalid, the read buffer is zeroed and the read |
| 98 | operation terminated. |
| 99 | |
| 100 | After some time, the limited number of convnetional zones available may |
| 101 | be exhausted (all used to map chunks or buffer sequential zones) and |
| 102 | unaligned writes to unbuffered chunks become impossible. To avoid this |
| 103 | situation, a reclaim process regularly scans used conventional zones and |
| 104 | tries to reclaim the least recently used zones by copying the valid |
| 105 | blocks of the buffer zone to a free sequential zone. Once the copy |
| 106 | completes, the chunk mapping is updated to point to the sequential zone |
| 107 | and the buffer zone freed for reuse. |
| 108 | |
| 109 | Metadata Protection |
| 110 | =================== |
| 111 | |
| 112 | To protect metadata against corruption in case of sudden power loss or |
| 113 | system crash, 2 sets of metadata zones are used. One set, the primary |
| 114 | set, is used as the main metadata region, while the secondary set is |
| 115 | used as a staging area. Modified metadata is first written to the |
| 116 | secondary set and validated by updating the super block in the secondary |
| 117 | set, a generation counter is used to indicate that this set contains the |
| 118 | newest metadata. Once this operation completes, in place of metadata |
| 119 | block updates can be done in the primary metadata set. This ensures that |
| 120 | one of the set is always consistent (all modifications committed or none |
| 121 | at all). Flush operations are used as a commit point. Upon reception of |
| 122 | a flush request, metadata modification activity is temporarily blocked |
| 123 | (for both incoming BIO processing and reclaim process) and all dirty |
| 124 | metadata blocks are staged and updated. Normal operation is then |
| 125 | resumed. Flushing metadata thus only temporarily delays write and |
| 126 | discard requests. Read requests can be processed concurrently while |
| 127 | metadata flush is being executed. |
| 128 | |
| 129 | Usage |
| 130 | ===== |
| 131 | |
| 132 | A zoned block device must first be formatted using the dmzadm tool. This |
| 133 | will analyze the device zone configuration, determine where to place the |
| 134 | metadata sets on the device and initialize the metadata sets. |
| 135 | |
| 136 | Ex: |
| 137 | |
| 138 | dmzadm --format /dev/sdxx |
| 139 | |
| 140 | For a formatted device, the target can be created normally with the |
| 141 | dmsetup utility. The only parameter that dm-zoned requires is the |
| 142 | underlying zoned block device name. Ex: |
| 143 | |
| 144 | echo "0 `blockdev --getsize ${dev}` zoned ${dev}" | dmsetup create dmz-`basename ${dev}` |