Mauro Carvalho Chehab | f0ba437 | 2019-06-12 14:52:43 -0300 | [diff] [blame] | 1 | ======== |
Damien Le Moal | 3b1a94c | 2017-06-07 15:55:39 +0900 | [diff] [blame] | 2 | dm-zoned |
| 3 | ======== |
| 4 | |
| 5 | The dm-zoned device mapper target exposes a zoned block device (ZBC and |
| 6 | ZAC compliant devices) as a regular block device without any write |
| 7 | pattern constraints. In effect, it implements a drive-managed zoned |
| 8 | block device which hides from the user (a file system or an application |
| 9 | doing raw block device accesses) the sequential write constraints of |
| 10 | host-managed zoned block devices and can mitigate the potential |
| 11 | device-side performance degradation due to excessive random writes on |
| 12 | host-aware zoned block devices. |
| 13 | |
| 14 | For a more detailed description of the zoned block device models and |
| 15 | their constraints see (for SCSI devices): |
| 16 | |
Alexander A. Klimov | 6f3bc22 | 2020-06-27 12:31:38 +0200 | [diff] [blame] | 17 | https://www.t10.org/drafts.htm#ZBC_Family |
Damien Le Moal | 3b1a94c | 2017-06-07 15:55:39 +0900 | [diff] [blame] | 18 | |
| 19 | and (for ATA devices): |
| 20 | |
| 21 | http://www.t13.org/Documents/UploadedDocuments/docs2015/di537r05-Zoned_Device_ATA_Command_Set_ZAC.pdf |
| 22 | |
| 23 | The dm-zoned implementation is simple and minimizes system overhead (CPU |
| 24 | and memory usage as well as storage capacity loss). For a 10TB |
| 25 | host-managed disk with 256 MB zones, dm-zoned memory usage per disk |
| 26 | instance is at most 4.5 MB and as little as 5 zones will be used |
Andrew Klychkov | 751d5b2 | 2020-12-04 10:28:48 +0300 | [diff] [blame] | 27 | internally for storing metadata and performing reclaim operations. |
Damien Le Moal | 3b1a94c | 2017-06-07 15:55:39 +0900 | [diff] [blame] | 28 | |
| 29 | dm-zoned target devices are formatted and checked using the dmzadm |
| 30 | utility available at: |
| 31 | |
| 32 | https://github.com/hgst/dm-zoned-tools |
| 33 | |
| 34 | Algorithm |
| 35 | ========= |
| 36 | |
| 37 | dm-zoned implements an on-disk buffering scheme to handle non-sequential |
| 38 | write accesses to the sequential zones of a zoned block device. |
| 39 | Conventional zones are used for caching as well as for storing internal |
Hannes Reinecke | bd5c403 | 2020-05-11 10:24:30 +0200 | [diff] [blame] | 40 | metadata. It can also use a regular block device together with the zoned |
| 41 | block device; in that case the regular block device will be split logically |
| 42 | in zones with the same size as the zoned block device. These zones will be |
| 43 | placed in front of the zones from the zoned block device and will be handled |
| 44 | just like conventional zones. |
Damien Le Moal | 3b1a94c | 2017-06-07 15:55:39 +0900 | [diff] [blame] | 45 | |
Hannes Reinecke | bd5c403 | 2020-05-11 10:24:30 +0200 | [diff] [blame] | 46 | The zones of the device(s) are separated into 2 types: |
Damien Le Moal | 3b1a94c | 2017-06-07 15:55:39 +0900 | [diff] [blame] | 47 | |
| 48 | 1) Metadata zones: these are conventional zones used to store metadata. |
| 49 | Metadata zones are not reported as useable capacity to the user. |
| 50 | |
| 51 | 2) Data zones: all remaining zones, the vast majority of which will be |
| 52 | sequential zones used exclusively to store user data. The conventional |
| 53 | zones of the device may be used also for buffering user random writes. |
| 54 | Data in these zones may be directly mapped to the conventional zone, but |
| 55 | later moved to a sequential zone so that the conventional zone can be |
| 56 | reused for buffering incoming random writes. |
| 57 | |
| 58 | dm-zoned exposes a logical device with a sector size of 4096 bytes, |
| 59 | irrespective of the physical sector size of the backend zoned block |
| 60 | device being used. This allows reducing the amount of metadata needed to |
| 61 | manage valid blocks (blocks written). |
| 62 | |
| 63 | The on-disk metadata format is as follows: |
| 64 | |
| 65 | 1) The first block of the first conventional zone found contains the |
| 66 | super block which describes the on disk amount and position of metadata |
| 67 | blocks. |
| 68 | |
| 69 | 2) Following the super block, a set of blocks is used to describe the |
| 70 | mapping of the logical device blocks. The mapping is done per chunk of |
| 71 | blocks, with the chunk size equal to the zoned block device size. The |
| 72 | mapping table is indexed by chunk number and each mapping entry |
| 73 | indicates the zone number of the device storing the chunk of data. Each |
| 74 | mapping entry may also indicate if the zone number of a conventional |
| 75 | zone used to buffer random modification to the data zone. |
| 76 | |
| 77 | 3) A set of blocks used to store bitmaps indicating the validity of |
| 78 | blocks in the data zones follows the mapping table. A valid block is |
| 79 | defined as a block that was written and not discarded. For a buffered |
| 80 | data chunk, a block is always valid only in the data zone mapping the |
| 81 | chunk or in the buffer zone of the chunk. |
| 82 | |
| 83 | For a logical chunk mapped to a conventional zone, all write operations |
| 84 | are processed by directly writing to the zone. If the mapping zone is a |
| 85 | sequential zone, the write operation is processed directly only if the |
| 86 | write offset within the logical chunk is equal to the write pointer |
| 87 | offset within of the sequential data zone (i.e. the write operation is |
| 88 | aligned on the zone write pointer). Otherwise, write operations are |
| 89 | processed indirectly using a buffer zone. In that case, an unused |
| 90 | conventional zone is allocated and assigned to the chunk being |
| 91 | accessed. Writing a block to the buffer zone of a chunk will |
| 92 | automatically invalidate the same block in the sequential zone mapping |
| 93 | the chunk. If all blocks of the sequential zone become invalid, the zone |
| 94 | is freed and the chunk buffer zone becomes the primary zone mapping the |
| 95 | chunk, resulting in native random write performance similar to a regular |
| 96 | block device. |
| 97 | |
| 98 | Read operations are processed according to the block validity |
| 99 | information provided by the bitmaps. Valid blocks are read either from |
| 100 | the sequential zone mapping a chunk, or if the chunk is buffered, from |
| 101 | the buffer zone assigned. If the accessed chunk has no mapping, or the |
| 102 | accessed blocks are invalid, the read buffer is zeroed and the read |
| 103 | operation terminated. |
| 104 | |
Andrew Klychkov | 751d5b2 | 2020-12-04 10:28:48 +0300 | [diff] [blame] | 105 | After some time, the limited number of conventional zones available may |
Damien Le Moal | 3b1a94c | 2017-06-07 15:55:39 +0900 | [diff] [blame] | 106 | be exhausted (all used to map chunks or buffer sequential zones) and |
| 107 | unaligned writes to unbuffered chunks become impossible. To avoid this |
| 108 | situation, a reclaim process regularly scans used conventional zones and |
| 109 | tries to reclaim the least recently used zones by copying the valid |
| 110 | blocks of the buffer zone to a free sequential zone. Once the copy |
| 111 | completes, the chunk mapping is updated to point to the sequential zone |
| 112 | and the buffer zone freed for reuse. |
| 113 | |
| 114 | Metadata Protection |
| 115 | =================== |
| 116 | |
| 117 | To protect metadata against corruption in case of sudden power loss or |
| 118 | system crash, 2 sets of metadata zones are used. One set, the primary |
| 119 | set, is used as the main metadata region, while the secondary set is |
| 120 | used as a staging area. Modified metadata is first written to the |
| 121 | secondary set and validated by updating the super block in the secondary |
| 122 | set, a generation counter is used to indicate that this set contains the |
| 123 | newest metadata. Once this operation completes, in place of metadata |
| 124 | block updates can be done in the primary metadata set. This ensures that |
| 125 | one of the set is always consistent (all modifications committed or none |
| 126 | at all). Flush operations are used as a commit point. Upon reception of |
| 127 | a flush request, metadata modification activity is temporarily blocked |
| 128 | (for both incoming BIO processing and reclaim process) and all dirty |
| 129 | metadata blocks are staged and updated. Normal operation is then |
| 130 | resumed. Flushing metadata thus only temporarily delays write and |
| 131 | discard requests. Read requests can be processed concurrently while |
| 132 | metadata flush is being executed. |
| 133 | |
Hannes Reinecke | bd5c403 | 2020-05-11 10:24:30 +0200 | [diff] [blame] | 134 | If a regular device is used in conjunction with the zoned block device, |
| 135 | a third set of metadata (without the zone bitmaps) is written to the |
| 136 | start of the zoned block device. This metadata has a generation counter of |
| 137 | '0' and will never be updated during normal operation; it just serves for |
| 138 | identification purposes. The first and second copy of the metadata |
| 139 | are located at the start of the regular block device. |
| 140 | |
Damien Le Moal | 3b1a94c | 2017-06-07 15:55:39 +0900 | [diff] [blame] | 141 | Usage |
| 142 | ===== |
| 143 | |
| 144 | A zoned block device must first be formatted using the dmzadm tool. This |
| 145 | will analyze the device zone configuration, determine where to place the |
| 146 | metadata sets on the device and initialize the metadata sets. |
| 147 | |
Mauro Carvalho Chehab | f0ba437 | 2019-06-12 14:52:43 -0300 | [diff] [blame] | 148 | Ex:: |
Damien Le Moal | 3b1a94c | 2017-06-07 15:55:39 +0900 | [diff] [blame] | 149 | |
Mauro Carvalho Chehab | f0ba437 | 2019-06-12 14:52:43 -0300 | [diff] [blame] | 150 | dmzadm --format /dev/sdxx |
Damien Le Moal | 3b1a94c | 2017-06-07 15:55:39 +0900 | [diff] [blame] | 151 | |
Damien Le Moal | 3b1a94c | 2017-06-07 15:55:39 +0900 | [diff] [blame] | 152 | |
Hannes Reinecke | bd5c403 | 2020-05-11 10:24:30 +0200 | [diff] [blame] | 153 | If two drives are to be used, both devices must be specified, with the |
| 154 | regular block device as the first device. |
| 155 | |
| 156 | Ex:: |
| 157 | |
| 158 | dmzadm --format /dev/sdxx /dev/sdyy |
| 159 | |
| 160 | |
Andrew Klychkov | 751d5b2 | 2020-12-04 10:28:48 +0300 | [diff] [blame] | 161 | Formatted device(s) can be started with the dmzadm utility, too.: |
Hannes Reinecke | bd5c403 | 2020-05-11 10:24:30 +0200 | [diff] [blame] | 162 | |
| 163 | Ex:: |
| 164 | |
| 165 | dmzadm --start /dev/sdxx /dev/sdyy |
| 166 | |
Hannes Reinecke | bc3d571 | 2020-05-11 10:24:16 +0200 | [diff] [blame] | 167 | |
| 168 | Information about the internal layout and current usage of the zones can |
| 169 | be obtained with the 'status' callback from dmsetup: |
| 170 | |
| 171 | Ex:: |
| 172 | |
| 173 | dmsetup status /dev/dm-X |
| 174 | |
| 175 | will return a line |
| 176 | |
| 177 | 0 <size> zoned <nr_zones> zones <nr_unmap_rnd>/<nr_rnd> random <nr_unmap_seq>/<nr_seq> sequential |
| 178 | |
| 179 | where <nr_zones> is the total number of zones, <nr_unmap_rnd> is the number |
| 180 | of unmapped (ie free) random zones, <nr_rnd> the total number of zones, |
| 181 | <nr_unmap_seq> the number of unmapped sequential zones, and <nr_seq> the |
| 182 | total number of sequential zones. |
Hannes Reinecke | 90b39d5 | 2020-05-11 10:24:17 +0200 | [diff] [blame] | 183 | |
| 184 | Normally the reclaim process will be started once there are less than 50 |
| 185 | percent free random zones. In order to start the reclaim process manually |
| 186 | even before reaching this threshold the 'dmsetup message' function can be |
| 187 | used: |
| 188 | |
| 189 | Ex:: |
| 190 | |
| 191 | dmsetup message /dev/dm-X 0 reclaim |
| 192 | |
| 193 | will start the reclaim process and random zones will be moved to sequential |
| 194 | zones. |