Randy Dunlap | 9e255e2 | 2021-05-06 16:19:07 -0700 | [diff] [blame] | 1 | ============== |
Mauro Carvalho Chehab | 898bd37 | 2019-04-18 19:45:00 -0300 | [diff] [blame] | 2 | Data Integrity |
| 3 | ============== |
| 4 | |
| 5 | 1. Introduction |
| 6 | =============== |
Martin K. Petersen | c1c72b5 | 2008-06-17 18:59:57 +0200 | [diff] [blame] | 7 | |
| 8 | Modern filesystems feature checksumming of data and metadata to |
| 9 | protect against data corruption. However, the detection of the |
| 10 | corruption is done at read time which could potentially be months |
| 11 | after the data was written. At that point the original data that the |
| 12 | application tried to write is most likely lost. |
| 13 | |
| 14 | The solution is to ensure that the disk is actually storing what the |
| 15 | application meant it to. Recent additions to both the SCSI family |
| 16 | protocols (SBC Data Integrity Field, SCC protection proposal) as well |
| 17 | as SATA/T13 (External Path Protection) try to remedy this by adding |
| 18 | support for appending integrity metadata to an I/O. The integrity |
| 19 | metadata (or protection information in SCSI terminology) includes a |
| 20 | checksum for each sector as well as an incrementing counter that |
| 21 | ensures the individual sectors are written in the right order. And |
| 22 | for some protection schemes also that the I/O is written to the right |
| 23 | place on disk. |
| 24 | |
| 25 | Current storage controllers and devices implement various protective |
| 26 | measures, for instance checksumming and scrubbing. But these |
| 27 | technologies are working in their own isolated domains or at best |
| 28 | between adjacent nodes in the I/O path. The interesting thing about |
| 29 | DIF and the other integrity extensions is that the protection format |
| 30 | is well defined and every node in the I/O path can verify the |
| 31 | integrity of the I/O and reject it if corruption is detected. This |
| 32 | allows not only corruption prevention but also isolation of the point |
| 33 | of failure. |
| 34 | |
Mauro Carvalho Chehab | 898bd37 | 2019-04-18 19:45:00 -0300 | [diff] [blame] | 35 | 2. The Data Integrity Extensions |
| 36 | ================================ |
Martin K. Petersen | c1c72b5 | 2008-06-17 18:59:57 +0200 | [diff] [blame] | 37 | |
| 38 | As written, the protocol extensions only protect the path between |
| 39 | controller and storage device. However, many controllers actually |
| 40 | allow the operating system to interact with the integrity metadata |
| 41 | (IMD). We have been working with several FC/SAS HBA vendors to enable |
| 42 | the protection information to be transferred to and from their |
| 43 | controllers. |
| 44 | |
| 45 | The SCSI Data Integrity Field works by appending 8 bytes of protection |
| 46 | information to each sector. The data + integrity metadata is stored |
| 47 | in 520 byte sectors on disk. Data + IMD are interleaved when |
| 48 | transferred between the controller and target. The T13 proposal is |
| 49 | similar. |
| 50 | |
| 51 | Because it is highly inconvenient for operating systems to deal with |
| 52 | 520 (and 4104) byte sectors, we approached several HBA vendors and |
| 53 | encouraged them to allow separation of the data and integrity metadata |
| 54 | scatter-gather lists. |
| 55 | |
| 56 | The controller will interleave the buffers on write and split them on |
Andre Noll | 61fd216 | 2009-06-25 13:03:04 +0200 | [diff] [blame] | 57 | read. This means that Linux can DMA the data buffers to and from |
Martin K. Petersen | c1c72b5 | 2008-06-17 18:59:57 +0200 | [diff] [blame] | 58 | host memory without changes to the page cache. |
| 59 | |
| 60 | Also, the 16-bit CRC checksum mandated by both the SCSI and SATA specs |
| 61 | is somewhat heavy to compute in software. Benchmarks found that |
| 62 | calculating this checksum had a significant impact on system |
| 63 | performance for a number of workloads. Some controllers allow a |
| 64 | lighter-weight checksum to be used when interfacing with the operating |
| 65 | system. Emulex, for instance, supports the TCP/IP checksum instead. |
| 66 | The IP checksum received from the OS is converted to the 16-bit CRC |
| 67 | when writing and vice versa. This allows the integrity metadata to be |
| 68 | generated by Linux or the application at very low cost (comparable to |
| 69 | software RAID5). |
| 70 | |
| 71 | The IP checksum is weaker than the CRC in terms of detecting bit |
| 72 | errors. However, the strength is really in the separation of the data |
Andre Noll | 61fd216 | 2009-06-25 13:03:04 +0200 | [diff] [blame] | 73 | buffers and the integrity metadata. These two distinct buffers must |
Martin K. Petersen | c1c72b5 | 2008-06-17 18:59:57 +0200 | [diff] [blame] | 74 | match up for an I/O to complete. |
| 75 | |
| 76 | The separation of the data and integrity metadata buffers as well as |
| 77 | the choice in checksums is referred to as the Data Integrity |
| 78 | Extensions. As these extensions are outside the scope of the protocol |
| 79 | bodies (T10, T13), Oracle and its partners are trying to standardize |
| 80 | them within the Storage Networking Industry Association. |
| 81 | |
Mauro Carvalho Chehab | 898bd37 | 2019-04-18 19:45:00 -0300 | [diff] [blame] | 82 | 3. Kernel Changes |
| 83 | ================= |
Martin K. Petersen | c1c72b5 | 2008-06-17 18:59:57 +0200 | [diff] [blame] | 84 | |
| 85 | The data integrity framework in Linux enables protection information |
| 86 | to be pinned to I/Os and sent to/received from controllers that |
| 87 | support it. |
| 88 | |
| 89 | The advantage to the integrity extensions in SCSI and SATA is that |
| 90 | they enable us to protect the entire path from application to storage |
| 91 | device. However, at the same time this is also the biggest |
| 92 | disadvantage. It means that the protection information must be in a |
| 93 | format that can be understood by the disk. |
| 94 | |
| 95 | Generally Linux/POSIX applications are agnostic to the intricacies of |
| 96 | the storage devices they are accessing. The virtual filesystem switch |
| 97 | and the block layer make things like hardware sector size and |
| 98 | transport protocols completely transparent to the application. |
| 99 | |
| 100 | However, this level of detail is required when preparing the |
| 101 | protection information to send to a disk. Consequently, the very |
| 102 | concept of an end-to-end protection scheme is a layering violation. |
| 103 | It is completely unreasonable for an application to be aware whether |
| 104 | it is accessing a SCSI or SATA disk. |
| 105 | |
| 106 | The data integrity support implemented in Linux attempts to hide this |
| 107 | from the application. As far as the application (and to some extent |
| 108 | the kernel) is concerned, the integrity metadata is opaque information |
| 109 | that's attached to the I/O. |
| 110 | |
| 111 | The current implementation allows the block layer to automatically |
| 112 | generate the protection information for any I/O. Eventually the |
| 113 | intent is to move the integrity metadata calculation to userspace for |
| 114 | user data. Metadata and other I/O that originates within the kernel |
| 115 | will still use the automatic generation interface. |
| 116 | |
| 117 | Some storage devices allow each hardware sector to be tagged with a |
| 118 | 16-bit value. The owner of this tag space is the owner of the block |
| 119 | device. I.e. the filesystem in most cases. The filesystem can use |
| 120 | this extra space to tag sectors as they see fit. Because the tag |
| 121 | space is limited, the block interface allows tagging bigger chunks by |
| 122 | way of interleaving. This way, 8*16 bits of information can be |
| 123 | attached to a typical 4KB filesystem block. |
| 124 | |
| 125 | This also means that applications such as fsck and mkfs will need |
| 126 | access to manipulate the tags from user space. A passthrough |
| 127 | interface for this is being worked on. |
| 128 | |
| 129 | |
Mauro Carvalho Chehab | 898bd37 | 2019-04-18 19:45:00 -0300 | [diff] [blame] | 130 | 4. Block Layer Implementation Details |
| 131 | ===================================== |
Martin K. Petersen | c1c72b5 | 2008-06-17 18:59:57 +0200 | [diff] [blame] | 132 | |
Mauro Carvalho Chehab | 898bd37 | 2019-04-18 19:45:00 -0300 | [diff] [blame] | 133 | 4.1 Bio |
| 134 | ------- |
Martin K. Petersen | c1c72b5 | 2008-06-17 18:59:57 +0200 | [diff] [blame] | 135 | |
| 136 | The data integrity patches add a new field to struct bio when |
Martin K. Petersen | 180b2f9 | 2014-09-26 19:19:56 -0400 | [diff] [blame] | 137 | CONFIG_BLK_DEV_INTEGRITY is enabled. bio_integrity(bio) returns a |
| 138 | pointer to a struct bip which contains the bio integrity payload. |
| 139 | Essentially a bip is a trimmed down struct bio which holds a bio_vec |
| 140 | containing the integrity metadata and the required housekeeping |
| 141 | information (bvec pool, vector count, etc.) |
Martin K. Petersen | c1c72b5 | 2008-06-17 18:59:57 +0200 | [diff] [blame] | 142 | |
| 143 | A kernel subsystem can enable data integrity protection on a bio by |
| 144 | calling bio_integrity_alloc(bio). This will allocate and attach the |
| 145 | bip to the bio. |
| 146 | |
| 147 | Individual pages containing integrity metadata can subsequently be |
| 148 | attached using bio_integrity_add_page(). |
| 149 | |
| 150 | bio_free() will automatically free the bip. |
| 151 | |
| 152 | |
Mauro Carvalho Chehab | 898bd37 | 2019-04-18 19:45:00 -0300 | [diff] [blame] | 153 | 4.2 Block Device |
| 154 | ---------------- |
Martin K. Petersen | c1c72b5 | 2008-06-17 18:59:57 +0200 | [diff] [blame] | 155 | |
| 156 | Because the format of the protection data is tied to the physical |
| 157 | disk, each block device has been extended with a block integrity |
| 158 | profile (struct blk_integrity). This optional profile is registered |
| 159 | with the block layer using blk_integrity_register(). |
| 160 | |
| 161 | The profile contains callback functions for generating and verifying |
| 162 | the protection data, as well as getting and setting application tags. |
| 163 | The profile also contains a few constants to aid in completing, |
| 164 | merging and splitting the integrity metadata. |
| 165 | |
| 166 | Layered block devices will need to pick a profile that's appropriate |
| 167 | for all subdevices. blk_integrity_compare() can help with that. DM |
| 168 | and MD linear, RAID0 and RAID1 are currently supported. RAID4/5/6 |
| 169 | will require extra work due to the application tag. |
| 170 | |
| 171 | |
Mauro Carvalho Chehab | 898bd37 | 2019-04-18 19:45:00 -0300 | [diff] [blame] | 172 | 5.0 Block Layer Integrity API |
| 173 | ============================= |
Martin K. Petersen | c1c72b5 | 2008-06-17 18:59:57 +0200 | [diff] [blame] | 174 | |
Mauro Carvalho Chehab | 898bd37 | 2019-04-18 19:45:00 -0300 | [diff] [blame] | 175 | 5.1 Normal Filesystem |
| 176 | --------------------- |
Martin K. Petersen | c1c72b5 | 2008-06-17 18:59:57 +0200 | [diff] [blame] | 177 | |
| 178 | The normal filesystem is unaware that the underlying block device |
| 179 | is capable of sending/receiving integrity metadata. The IMD will |
| 180 | be automatically generated by the block layer at submit_bio() time |
| 181 | in case of a WRITE. A READ request will cause the I/O integrity |
| 182 | to be verified upon completion. |
| 183 | |
Mauro Carvalho Chehab | 898bd37 | 2019-04-18 19:45:00 -0300 | [diff] [blame] | 184 | IMD generation and verification can be toggled using the:: |
Martin K. Petersen | c1c72b5 | 2008-06-17 18:59:57 +0200 | [diff] [blame] | 185 | |
| 186 | /sys/block/<bdev>/integrity/write_generate |
| 187 | |
Mauro Carvalho Chehab | 898bd37 | 2019-04-18 19:45:00 -0300 | [diff] [blame] | 188 | and:: |
Martin K. Petersen | c1c72b5 | 2008-06-17 18:59:57 +0200 | [diff] [blame] | 189 | |
| 190 | /sys/block/<bdev>/integrity/read_verify |
| 191 | |
| 192 | flags. |
| 193 | |
| 194 | |
Mauro Carvalho Chehab | 898bd37 | 2019-04-18 19:45:00 -0300 | [diff] [blame] | 195 | 5.2 Integrity-Aware Filesystem |
| 196 | ------------------------------ |
Martin K. Petersen | c1c72b5 | 2008-06-17 18:59:57 +0200 | [diff] [blame] | 197 | |
| 198 | A filesystem that is integrity-aware can prepare I/Os with IMD |
| 199 | attached. It can also use the application tag space if this is |
| 200 | supported by the block device. |
| 201 | |
| 202 | |
Mauro Carvalho Chehab | 898bd37 | 2019-04-18 19:45:00 -0300 | [diff] [blame] | 203 | `bool bio_integrity_prep(bio);` |
Martin K. Petersen | c1c72b5 | 2008-06-17 18:59:57 +0200 | [diff] [blame] | 204 | |
| 205 | To generate IMD for WRITE and to set up buffers for READ, the |
| 206 | filesystem must call bio_integrity_prep(bio). |
| 207 | |
| 208 | Prior to calling this function, the bio data direction and start |
| 209 | sector must be set, and the bio should have all data pages |
| 210 | added. It is up to the caller to ensure that the bio does not |
| 211 | change while I/O is in progress. |
Dmitry Monakhov | e23947b | 2017-06-29 11:31:11 -0700 | [diff] [blame] | 212 | Complete bio with error if prepare failed for some reson. |
Martin K. Petersen | c1c72b5 | 2008-06-17 18:59:57 +0200 | [diff] [blame] | 213 | |
| 214 | |
Mauro Carvalho Chehab | 898bd37 | 2019-04-18 19:45:00 -0300 | [diff] [blame] | 215 | 5.3 Passing Existing Integrity Metadata |
| 216 | --------------------------------------- |
Martin K. Petersen | c1c72b5 | 2008-06-17 18:59:57 +0200 | [diff] [blame] | 217 | |
| 218 | Filesystems that either generate their own integrity metadata or |
| 219 | are capable of transferring IMD from user space can use the |
| 220 | following calls: |
| 221 | |
| 222 | |
Mauro Carvalho Chehab | 898bd37 | 2019-04-18 19:45:00 -0300 | [diff] [blame] | 223 | `struct bip * bio_integrity_alloc(bio, gfp_mask, nr_pages);` |
Martin K. Petersen | c1c72b5 | 2008-06-17 18:59:57 +0200 | [diff] [blame] | 224 | |
| 225 | Allocates the bio integrity payload and hangs it off of the bio. |
| 226 | nr_pages indicate how many pages of protection data need to be |
| 227 | stored in the integrity bio_vec list (similar to bio_alloc()). |
| 228 | |
| 229 | The integrity payload will be freed at bio_free() time. |
| 230 | |
| 231 | |
Mauro Carvalho Chehab | 898bd37 | 2019-04-18 19:45:00 -0300 | [diff] [blame] | 232 | `int bio_integrity_add_page(bio, page, len, offset);` |
Martin K. Petersen | c1c72b5 | 2008-06-17 18:59:57 +0200 | [diff] [blame] | 233 | |
| 234 | Attaches a page containing integrity metadata to an existing |
| 235 | bio. The bio must have an existing bip, |
| 236 | i.e. bio_integrity_alloc() must have been called. For a WRITE, |
| 237 | the integrity metadata in the pages must be in a format |
| 238 | understood by the target device with the notable exception that |
| 239 | the sector numbers will be remapped as the request traverses the |
| 240 | I/O stack. This implies that the pages added using this call |
| 241 | will be modified during I/O! The first reference tag in the |
| 242 | integrity metadata must have a value of bip->bip_sector. |
| 243 | |
| 244 | Pages can be added using bio_integrity_add_page() as long as |
| 245 | there is room in the bip bio_vec array (nr_pages). |
| 246 | |
| 247 | Upon completion of a READ operation, the attached pages will |
| 248 | contain the integrity metadata received from the storage device. |
| 249 | It is up to the receiver to process them and verify data |
| 250 | integrity upon completion. |
| 251 | |
| 252 | |
Mauro Carvalho Chehab | 898bd37 | 2019-04-18 19:45:00 -0300 | [diff] [blame] | 253 | 5.4 Registering A Block Device As Capable Of Exchanging Integrity Metadata |
| 254 | -------------------------------------------------------------------------- |
Martin K. Petersen | c1c72b5 | 2008-06-17 18:59:57 +0200 | [diff] [blame] | 255 | |
| 256 | To enable integrity exchange on a block device the gendisk must be |
| 257 | registered as capable: |
| 258 | |
Mauro Carvalho Chehab | 898bd37 | 2019-04-18 19:45:00 -0300 | [diff] [blame] | 259 | `int blk_integrity_register(gendisk, blk_integrity);` |
Martin K. Petersen | c1c72b5 | 2008-06-17 18:59:57 +0200 | [diff] [blame] | 260 | |
| 261 | The blk_integrity struct is a template and should contain the |
Mauro Carvalho Chehab | 898bd37 | 2019-04-18 19:45:00 -0300 | [diff] [blame] | 262 | following:: |
Martin K. Petersen | c1c72b5 | 2008-06-17 18:59:57 +0200 | [diff] [blame] | 263 | |
| 264 | static struct blk_integrity my_profile = { |
| 265 | .name = "STANDARDSBODY-TYPE-VARIANT-CSUM", |
| 266 | .generate_fn = my_generate_fn, |
Mauro Carvalho Chehab | 898bd37 | 2019-04-18 19:45:00 -0300 | [diff] [blame] | 267 | .verify_fn = my_verify_fn, |
Martin K. Petersen | c1c72b5 | 2008-06-17 18:59:57 +0200 | [diff] [blame] | 268 | .tuple_size = sizeof(struct my_tuple_size), |
| 269 | .tag_size = <tag bytes per hw sector>, |
| 270 | }; |
| 271 | |
| 272 | 'name' is a text string which will be visible in sysfs. This is |
| 273 | part of the userland API so chose it carefully and never change |
| 274 | it. The format is standards body-type-variant. |
| 275 | E.g. T10-DIF-TYPE1-IP or T13-EPP-0-CRC. |
| 276 | |
| 277 | 'generate_fn' generates appropriate integrity metadata (for WRITE). |
| 278 | |
| 279 | 'verify_fn' verifies that the data buffer matches the integrity |
| 280 | metadata. |
| 281 | |
| 282 | 'tuple_size' must be set to match the size of the integrity |
| 283 | metadata per sector. I.e. 8 for DIF and EPP. |
| 284 | |
| 285 | 'tag_size' must be set to identify how many bytes of tag space |
| 286 | are available per hardware sector. For DIF this is either 2 or |
| 287 | 0 depending on the value of the Control Mode Page ATO bit. |
| 288 | |
Martin K. Petersen | c1c72b5 | 2008-06-17 18:59:57 +0200 | [diff] [blame] | 289 | ---------------------------------------------------------------------- |
Mauro Carvalho Chehab | 898bd37 | 2019-04-18 19:45:00 -0300 | [diff] [blame] | 290 | |
Martin K. Petersen | c1c72b5 | 2008-06-17 18:59:57 +0200 | [diff] [blame] | 291 | 2007-12-24 Martin K. Petersen <martin.petersen@oracle.com> |