Mauro Carvalho Chehab | 898bd37 | 2019-04-18 19:45:00 -0300 | [diff] [blame] | 1 | ===================================================== |
| 2 | Notes on the Generic Block Layer Rewrite in Linux 2.5 |
| 3 | ===================================================== |
| 4 | |
| 5 | .. note:: |
| 6 | |
| 7 | It seems that there are lot of outdated stuff here. This seems |
| 8 | to be written somewhat as a task list. Yet, eventually, something |
| 9 | here might still be useful. |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 10 | |
| 11 | Notes Written on Jan 15, 2002: |
Mauro Carvalho Chehab | 8bb0776 | 2019-07-09 12:36:09 -0300 | [diff] [blame] | 12 | |
Mauro Carvalho Chehab | 898bd37 | 2019-04-18 19:45:00 -0300 | [diff] [blame] | 13 | - Jens Axboe <jens.axboe@oracle.com> |
| 14 | - Suparna Bhattacharya <suparna@in.ibm.com> |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 15 | |
| 16 | Last Updated May 2, 2002 |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 17 | |
Mauro Carvalho Chehab | 898bd37 | 2019-04-18 19:45:00 -0300 | [diff] [blame] | 18 | September 2003: Updated I/O Scheduler portions |
| 19 | - Nick Piggin <npiggin@kernel.dk> |
| 20 | |
| 21 | Introduction |
| 22 | ============ |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 23 | |
| 24 | These are some notes describing some aspects of the 2.5 block layer in the |
| 25 | context of the bio rewrite. The idea is to bring out some of the key |
| 26 | changes and a glimpse of the rationale behind those changes. |
| 27 | |
| 28 | Please mail corrections & suggestions to suparna@in.ibm.com. |
| 29 | |
Mauro Carvalho Chehab | 898bd37 | 2019-04-18 19:45:00 -0300 | [diff] [blame] | 30 | Credits |
| 31 | ======= |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 32 | |
| 33 | 2.5 bio rewrite: |
Mauro Carvalho Chehab | 898bd37 | 2019-04-18 19:45:00 -0300 | [diff] [blame] | 34 | - Jens Axboe <jens.axboe@oracle.com> |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 35 | |
| 36 | Many aspects of the generic block layer redesign were driven by and evolved |
| 37 | over discussions, prior patches and the collective experience of several |
| 38 | people. See sections 8 and 9 for a list of some related references. |
| 39 | |
| 40 | The following people helped with review comments and inputs for this |
| 41 | document: |
Mauro Carvalho Chehab | 898bd37 | 2019-04-18 19:45:00 -0300 | [diff] [blame] | 42 | |
| 43 | - Christoph Hellwig <hch@infradead.org> |
| 44 | - Arjan van de Ven <arjanv@redhat.com> |
| 45 | - Randy Dunlap <rdunlap@xenotime.net> |
| 46 | - Andre Hedrick <andre@linux-ide.org> |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 47 | |
| 48 | The following people helped with fixes/contributions to the bio patches |
| 49 | while it was still work-in-progress: |
Mauro Carvalho Chehab | 898bd37 | 2019-04-18 19:45:00 -0300 | [diff] [blame] | 50 | |
| 51 | - David S. Miller <davem@redhat.com> |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 52 | |
| 53 | |
Mauro Carvalho Chehab | 898bd37 | 2019-04-18 19:45:00 -0300 | [diff] [blame] | 54 | .. Description of Contents: |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 55 | |
Mauro Carvalho Chehab | 898bd37 | 2019-04-18 19:45:00 -0300 | [diff] [blame] | 56 | 1. Scope for tuning of logic to various needs |
| 57 | 1.1 Tuning based on device or low level driver capabilities |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 58 | - Per-queue parameters |
| 59 | - Highmem I/O support |
| 60 | - I/O scheduler modularization |
Mauro Carvalho Chehab | 898bd37 | 2019-04-18 19:45:00 -0300 | [diff] [blame] | 61 | 1.2 Tuning based on high level requirements/capabilities |
Leonid V. Fedorenchik | 8962786 | 2015-03-13 23:53:22 +0300 | [diff] [blame] | 62 | 1.2.1 Request Priority/Latency |
Mauro Carvalho Chehab | 898bd37 | 2019-04-18 19:45:00 -0300 | [diff] [blame] | 63 | 1.3 Direct access/bypass to lower layers for diagnostics and special |
| 64 | device operations |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 65 | 1.3.1 Pre-built commands |
Mauro Carvalho Chehab | 898bd37 | 2019-04-18 19:45:00 -0300 | [diff] [blame] | 66 | 2. New flexible and generic but minimalist i/o structure or descriptor |
| 67 | (instead of using buffer heads at the i/o layer) |
| 68 | 2.1 Requirements/Goals addressed |
| 69 | 2.2 The bio struct in detail (multi-page io unit) |
| 70 | 2.3 Changes in the request structure |
| 71 | 3. Using bios |
| 72 | 3.1 Setup/teardown (allocation, splitting) |
| 73 | 3.2 Generic bio helper routines |
| 74 | 3.2.1 Traversing segments and completion units in a request |
| 75 | 3.2.2 Setting up DMA scatterlists |
| 76 | 3.2.3 I/O completion |
| 77 | 3.2.4 Implications for drivers that do not interpret bios (don't handle |
| 78 | multiple segments) |
| 79 | 3.3 I/O submission |
| 80 | 4. The I/O scheduler |
| 81 | 5. Scalability related changes |
| 82 | 5.1 Granular locking: Removal of io_request_lock |
| 83 | 5.2 Prepare for transition to 64 bit sector_t |
| 84 | 6. Other Changes/Implications |
| 85 | 6.1 Partition re-mapping handled by the generic block layer |
| 86 | 7. A few tips on migration of older drivers |
| 87 | 8. A list of prior/related/impacted patches/ideas |
| 88 | 9. Other References/Discussion Threads |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 89 | |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 90 | |
| 91 | Bio Notes |
Mauro Carvalho Chehab | 898bd37 | 2019-04-18 19:45:00 -0300 | [diff] [blame] | 92 | ========= |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 93 | |
| 94 | Let us discuss the changes in the context of how some overall goals for the |
| 95 | block layer are addressed. |
| 96 | |
| 97 | 1. Scope for tuning the generic logic to satisfy various requirements |
Mauro Carvalho Chehab | 898bd37 | 2019-04-18 19:45:00 -0300 | [diff] [blame] | 98 | ===================================================================== |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 99 | |
| 100 | The block layer design supports adaptable abstractions to handle common |
| 101 | processing with the ability to tune the logic to an appropriate extent |
| 102 | depending on the nature of the device and the requirements of the caller. |
| 103 | One of the objectives of the rewrite was to increase the degree of tunability |
| 104 | and to enable higher level code to utilize underlying device/driver |
| 105 | capabilities to the maximum extent for better i/o performance. This is |
| 106 | important especially in the light of ever improving hardware capabilities |
| 107 | and application/middleware software designed to take advantage of these |
| 108 | capabilities. |
| 109 | |
| 110 | 1.1 Tuning based on low level device / driver capabilities |
Mauro Carvalho Chehab | 898bd37 | 2019-04-18 19:45:00 -0300 | [diff] [blame] | 111 | ---------------------------------------------------------- |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 112 | |
| 113 | Sophisticated devices with large built-in caches, intelligent i/o scheduling |
| 114 | optimizations, high memory DMA support, etc may find some of the |
| 115 | generic processing an overhead, while for less capable devices the |
| 116 | generic functionality is essential for performance or correctness reasons. |
| 117 | Knowledge of some of the capabilities or parameters of the device should be |
| 118 | used at the generic block layer to take the right decisions on |
| 119 | behalf of the driver. |
| 120 | |
| 121 | How is this achieved ? |
| 122 | |
| 123 | Tuning at a per-queue level: |
| 124 | |
| 125 | i. Per-queue limits/values exported to the generic layer by the driver |
| 126 | |
| 127 | Various parameters that the generic i/o scheduler logic uses are set at |
| 128 | a per-queue level (e.g maximum request size, maximum number of segments in |
Linus Walleij | a441b0d | 2016-09-14 14:32:52 +0200 | [diff] [blame] | 129 | a scatter-gather list, logical block size) |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 130 | |
| 131 | Some parameters that were earlier available as global arrays indexed by |
| 132 | major/minor are now directly associated with the queue. Some of these may |
| 133 | move into the block device structure in the future. Some characteristics |
| 134 | have been incorporated into a queue flags field rather than separate fields |
| 135 | in themselves. There are blk_queue_xxx functions to set the parameters, |
| 136 | rather than update the fields directly |
| 137 | |
| 138 | Some new queue property settings: |
| 139 | |
| 140 | blk_queue_bounce_limit(q, u64 dma_address) |
| 141 | Enable I/O to highmem pages, dma_address being the |
| 142 | limit. No highmem default. |
| 143 | |
| 144 | blk_queue_max_sectors(q, max_sectors) |
Mike Christie | 28832e8 | 2006-03-08 11:19:51 +0100 | [diff] [blame] | 145 | Sets two variables that limit the size of the request. |
| 146 | |
| 147 | - The request queue's max_sectors, which is a soft size in |
Mauro Carvalho Chehab | 898bd37 | 2019-04-18 19:45:00 -0300 | [diff] [blame] | 148 | units of 512 byte sectors, and could be dynamically varied |
| 149 | by the core kernel. |
Mike Christie | 28832e8 | 2006-03-08 11:19:51 +0100 | [diff] [blame] | 150 | |
| 151 | - The request queue's max_hw_sectors, which is a hard limit |
Mauro Carvalho Chehab | 898bd37 | 2019-04-18 19:45:00 -0300 | [diff] [blame] | 152 | and reflects the maximum size request a driver can handle |
| 153 | in units of 512 byte sectors. |
Mike Christie | 28832e8 | 2006-03-08 11:19:51 +0100 | [diff] [blame] | 154 | |
| 155 | The default for both max_sectors and max_hw_sectors is |
| 156 | 255. The upper limit of max_sectors is 1024. |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 157 | |
| 158 | blk_queue_max_phys_segments(q, max_segments) |
| 159 | Maximum physical segments you can handle in a request. 128 |
| 160 | default (driver limit). (See 3.2.2) |
| 161 | |
| 162 | blk_queue_max_hw_segments(q, max_segments) |
| 163 | Maximum dma segments the hardware can handle in a request. 128 |
| 164 | default (host adapter limit, after dma remapping). |
| 165 | (See 3.2.2) |
| 166 | |
| 167 | blk_queue_max_segment_size(q, max_seg_size) |
| 168 | Maximum size of a clustered segment, 64kB default. |
| 169 | |
Linus Walleij | a441b0d | 2016-09-14 14:32:52 +0200 | [diff] [blame] | 170 | blk_queue_logical_block_size(q, logical_block_size) |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 171 | Lowest possible sector size that the hardware can operate |
| 172 | on, 512 bytes default. |
| 173 | |
| 174 | New queue flags: |
| 175 | |
Mauro Carvalho Chehab | 8bb0776 | 2019-07-09 12:36:09 -0300 | [diff] [blame] | 176 | - QUEUE_FLAG_CLUSTER (see 3.2.2) |
| 177 | - QUEUE_FLAG_QUEUED (see 3.2.4) |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 178 | |
| 179 | |
| 180 | ii. High-mem i/o capabilities are now considered the default |
| 181 | |
| 182 | The generic bounce buffer logic, present in 2.4, where the block layer would |
| 183 | by default copyin/out i/o requests on high-memory buffers to low-memory buffers |
| 184 | assuming that the driver wouldn't be able to handle it directly, has been |
| 185 | changed in 2.5. The bounce logic is now applied only for memory ranges |
| 186 | for which the device cannot handle i/o. A driver can specify this by |
| 187 | setting the queue bounce limit for the request queue for the device |
| 188 | (blk_queue_bounce_limit()). This avoids the inefficiencies of the copyin/out |
| 189 | where a device is capable of handling high memory i/o. |
| 190 | |
| 191 | In order to enable high-memory i/o where the device is capable of supporting |
| 192 | it, the pci dma mapping routines and associated data structures have now been |
| 193 | modified to accomplish a direct page -> bus translation, without requiring |
| 194 | a virtual address mapping (unlike the earlier scheme of virtual address |
| 195 | -> bus translation). So this works uniformly for high-memory pages (which |
Matt LaPlante | 5d3f083 | 2006-11-30 05:21:10 +0100 | [diff] [blame] | 196 | do not have a corresponding kernel virtual address space mapping) and |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 197 | low-memory pages. |
| 198 | |
Mauro Carvalho Chehab | 4cd4bdf | 2021-06-16 08:27:21 +0200 | [diff] [blame] | 199 | Note: Please refer to Documentation/core-api/dma-api-howto.rst for a discussion |
Randy Dunlap | 5872fb9 | 2009-01-29 16:28:02 -0800 | [diff] [blame] | 200 | on PCI high mem DMA aspects and mapping of scatter gather lists, and support |
| 201 | for 64 bit PCI. |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 202 | |
| 203 | Special handling is required only for cases where i/o needs to happen on |
| 204 | pages at physical memory addresses beyond what the device can support. In these |
| 205 | cases, a bounce bio representing a buffer from the supported memory range |
| 206 | is used for performing the i/o with copyin/copyout as needed depending on |
| 207 | the type of the operation. For example, in case of a read operation, the |
| 208 | data read has to be copied to the original buffer on i/o completion, so a |
| 209 | callback routine is set up to do this, while for write, the data is copied |
| 210 | from the original buffer to the bounce buffer prior to issuing the |
| 211 | operation. Since an original buffer may be in a high memory area that's not |
| 212 | mapped in kernel virtual addr, a kmap operation may be required for |
| 213 | performing the copy, and special care may be needed in the completion path |
| 214 | as it may not be in irq context. Special care is also required (by way of |
| 215 | GFP flags) when allocating bounce buffers, to avoid certain highmem |
| 216 | deadlock possibilities. |
| 217 | |
| 218 | It is also possible that a bounce buffer may be allocated from high-memory |
| 219 | area that's not mapped in kernel virtual addr, but within the range that the |
| 220 | device can use directly; so the bounce page may need to be kmapped during |
| 221 | copy operations. [Note: This does not hold in the current implementation, |
| 222 | though] |
| 223 | |
| 224 | There are some situations when pages from high memory may need to |
| 225 | be kmapped, even if bounce buffers are not necessary. For example a device |
| 226 | may need to abort DMA operations and revert to PIO for the transfer, in |
| 227 | which case a virtual mapping of the page is required. For SCSI it is also |
| 228 | done in some scenarios where the low level driver cannot be trusted to |
| 229 | handle a single sg entry correctly. The driver is expected to perform the |
Christoph Hellwig | d004a5e7 | 2017-11-08 19:13:48 +0100 | [diff] [blame] | 230 | kmaps as needed on such occasions as appropriate. A driver could also use |
| 231 | the blk_queue_bounce() routine on its own to bounce highmem i/o to low |
| 232 | memory for specific requests if so desired. |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 233 | |
| 234 | iii. The i/o scheduler algorithm itself can be replaced/set as appropriate |
| 235 | |
| 236 | As in 2.4, it is possible to plugin a brand new i/o scheduler for a particular |
| 237 | queue or pick from (copy) existing generic schedulers and replace/override |
| 238 | certain portions of it. The 2.5 rewrite provides improved modularization |
| 239 | of the i/o scheduler. There are more pluggable callbacks, e.g for init, |
| 240 | add request, extract request, which makes it possible to abstract specific |
| 241 | i/o scheduling algorithm aspects and details outside of the generic loop. |
| 242 | It also makes it possible to completely hide the implementation details of |
| 243 | the i/o scheduler from block drivers. |
| 244 | |
| 245 | I/O scheduler wrappers are to be used instead of accessing the queue directly. |
| 246 | See section 4. The I/O scheduler for details. |
| 247 | |
| 248 | 1.2 Tuning Based on High level code capabilities |
Mauro Carvalho Chehab | 898bd37 | 2019-04-18 19:45:00 -0300 | [diff] [blame] | 249 | ------------------------------------------------ |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 250 | |
| 251 | i. Application capabilities for raw i/o |
| 252 | |
| 253 | This comes from some of the high-performance database/middleware |
| 254 | requirements where an application prefers to make its own i/o scheduling |
| 255 | decisions based on an understanding of the access patterns and i/o |
| 256 | characteristics |
| 257 | |
| 258 | ii. High performance filesystems or other higher level kernel code's |
| 259 | capabilities |
| 260 | |
| 261 | Kernel components like filesystems could also take their own i/o scheduling |
| 262 | decisions for optimizing performance. Journalling filesystems may need |
| 263 | some control over i/o ordering. |
| 264 | |
| 265 | What kind of support exists at the generic block layer for this ? |
| 266 | |
| 267 | The flags and rw fields in the bio structure can be used for some tuning |
Leonid V. Fedorenchik | 8962786 | 2015-03-13 23:53:22 +0300 | [diff] [blame] | 268 | from above e.g indicating that an i/o is just a readahead request, or priority |
| 269 | settings (currently unused). As far as user applications are concerned they |
| 270 | would need an additional mechanism either via open flags or ioctls, or some |
| 271 | other upper level mechanism to communicate such settings to block. |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 272 | |
Leonid V. Fedorenchik | 8962786 | 2015-03-13 23:53:22 +0300 | [diff] [blame] | 273 | 1.2.1 Request Priority/Latency |
Mauro Carvalho Chehab | 898bd37 | 2019-04-18 19:45:00 -0300 | [diff] [blame] | 274 | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 275 | |
Mauro Carvalho Chehab | 898bd37 | 2019-04-18 19:45:00 -0300 | [diff] [blame] | 276 | Todo/Under discussion:: |
| 277 | |
| 278 | Arjan's proposed request priority scheme allows higher levels some broad |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 279 | control (high/med/low) over the priority of an i/o request vs other pending |
| 280 | requests in the queue. For example it allows reads for bringing in an |
| 281 | executable page on demand to be given a higher priority over pending write |
| 282 | requests which haven't aged too much on the queue. Potentially this priority |
| 283 | could even be exposed to applications in some manner, providing higher level |
| 284 | tunability. Time based aging avoids starvation of lower priority |
Jens Axboe | 1eff9d3 | 2016-08-05 15:35:16 -0600 | [diff] [blame] | 285 | requests. Some bits in the bi_opf flags field in the bio structure are |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 286 | intended to be used for this priority information. |
| 287 | |
| 288 | |
| 289 | 1.3 Direct Access to Low level Device/Driver Capabilities (Bypass mode) |
Mauro Carvalho Chehab | 898bd37 | 2019-04-18 19:45:00 -0300 | [diff] [blame] | 290 | ----------------------------------------------------------------------- |
| 291 | |
| 292 | (e.g Diagnostics, Systems Management) |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 293 | |
| 294 | There are situations where high-level code needs to have direct access to |
| 295 | the low level device capabilities or requires the ability to issue commands |
| 296 | to the device bypassing some of the intermediate i/o layers. |
| 297 | These could, for example, be special control commands issued through ioctl |
| 298 | interfaces, or could be raw read/write commands that stress the drive's |
| 299 | capabilities for certain kinds of fitness tests. Having direct interfaces at |
| 300 | multiple levels without having to pass through upper layers makes |
| 301 | it possible to perform bottom up validation of the i/o path, layer by |
| 302 | layer, starting from the media. |
| 303 | |
| 304 | The normal i/o submission interfaces, e.g submit_bio, could be bypassed |
| 305 | for specially crafted requests which such ioctl or diagnostics |
| 306 | interfaces would typically use, and the elevator add_request routine |
| 307 | can instead be used to directly insert such requests in the queue or preferably |
| 308 | the blk_do_rq routine can be used to place the request on the queue and |
| 309 | wait for completion. Alternatively, sometimes the caller might just |
| 310 | invoke a lower level driver specific interface with the request as a |
| 311 | parameter. |
| 312 | |
| 313 | If the request is a means for passing on special information associated with |
| 314 | the command, then such information is associated with the request->special |
| 315 | field (rather than misuse the request->buffer field which is meant for the |
| 316 | request data buffer's virtual mapping). |
| 317 | |
| 318 | For passing request data, the caller must build up a bio descriptor |
| 319 | representing the concerned memory buffer if the underlying driver interprets |
| 320 | bio segments or uses the block layer end*request* functions for i/o |
| 321 | completion. Alternatively one could directly use the request->buffer field to |
| 322 | specify the virtual address of the buffer, if the driver expects buffer |
| 323 | addresses passed in this way and ignores bio entries for the request type |
| 324 | involved. In the latter case, the driver would modify and manage the |
| 325 | request->buffer, request->sector and request->nr_sectors or |
| 326 | request->current_nr_sectors fields itself rather than using the block layer |
| 327 | end_request or end_that_request_first completion interfaces. |
Mauro Carvalho Chehab | 898bd37 | 2019-04-18 19:45:00 -0300 | [diff] [blame] | 328 | (See 2.3 or Documentation/block/request.rst for a brief explanation of |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 329 | the request structure fields) |
| 330 | |
Mauro Carvalho Chehab | 898bd37 | 2019-04-18 19:45:00 -0300 | [diff] [blame] | 331 | :: |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 332 | |
Mauro Carvalho Chehab | 898bd37 | 2019-04-18 19:45:00 -0300 | [diff] [blame] | 333 | [TBD: end_that_request_last should be usable even in this case; |
| 334 | Perhaps an end_that_direct_request_first routine could be implemented to make |
| 335 | handling direct requests easier for such drivers; Also for drivers that |
| 336 | expect bios, a helper function could be provided for setting up a bio |
| 337 | corresponding to a data buffer] |
| 338 | |
| 339 | <JENS: I dont understand the above, why is end_that_request_first() not |
| 340 | usable? Or _last for that matter. I must be missing something> |
| 341 | |
| 342 | <SUP: What I meant here was that if the request doesn't have a bio, then |
| 343 | end_that_request_first doesn't modify nr_sectors or current_nr_sectors, |
| 344 | and hence can't be used for advancing request state settings on the |
| 345 | completion of partial transfers. The driver has to modify these fields |
| 346 | directly by hand. |
| 347 | This is because end_that_request_first only iterates over the bio list, |
| 348 | and always returns 0 if there are none associated with the request. |
| 349 | _last works OK in this case, and is not a problem, as I mentioned earlier |
| 350 | > |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 351 | |
| 352 | 1.3.1 Pre-built Commands |
Mauro Carvalho Chehab | 898bd37 | 2019-04-18 19:45:00 -0300 | [diff] [blame] | 353 | ^^^^^^^^^^^^^^^^^^^^^^^^ |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 354 | |
| 355 | A request can be created with a pre-built custom command to be sent directly |
| 356 | to the device. The cmd block in the request structure has room for filling |
| 357 | in the command bytes. (i.e rq->cmd is now 16 bytes in size, and meant for |
| 358 | command pre-building, and the type of the request is now indicated |
| 359 | through rq->flags instead of via rq->cmd) |
| 360 | |
| 361 | The request structure flags can be set up to indicate the type of request |
| 362 | in such cases (REQ_PC: direct packet command passed to driver, REQ_BLOCK_PC: |
| 363 | packet command issued via blk_do_rq, REQ_SPECIAL: special request). |
| 364 | |
| 365 | It can help to pre-build device commands for requests in advance. |
| 366 | Drivers can now specify a request prepare function (q->prep_rq_fn) that the |
| 367 | block layer would invoke to pre-build device commands for a given request, |
| 368 | or perform other preparatory processing for the request. This is routine is |
| 369 | called by elv_next_request(), i.e. typically just before servicing a request. |
Christoph Hellwig | e806402 | 2016-10-20 15:12:13 +0200 | [diff] [blame] | 370 | (The prepare function would not be called for requests that have RQF_DONTPREP |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 371 | enabled) |
| 372 | |
| 373 | Aside: |
| 374 | Pre-building could possibly even be done early, i.e before placing the |
| 375 | request on the queue, rather than construct the command on the fly in the |
| 376 | driver while servicing the request queue when it may affect latencies in |
| 377 | interrupt context or responsiveness in general. One way to add early |
| 378 | pre-building would be to do it whenever we fail to merge on a request. |
| 379 | Now REQ_NOMERGE is set in the request flags to skip this one in the future, |
| 380 | which means that it will not change before we feed it to the device. So |
| 381 | the pre-builder hook can be invoked there. |
| 382 | |
| 383 | |
Mauro Carvalho Chehab | 898bd37 | 2019-04-18 19:45:00 -0300 | [diff] [blame] | 384 | 2. Flexible and generic but minimalist i/o structure/descriptor |
| 385 | =============================================================== |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 386 | |
| 387 | 2.1 Reason for a new structure and requirements addressed |
Mauro Carvalho Chehab | 898bd37 | 2019-04-18 19:45:00 -0300 | [diff] [blame] | 388 | --------------------------------------------------------- |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 389 | |
| 390 | Prior to 2.5, buffer heads were used as the unit of i/o at the generic block |
| 391 | layer, and the low level request structure was associated with a chain of |
| 392 | buffer heads for a contiguous i/o request. This led to certain inefficiencies |
| 393 | when it came to large i/o requests and readv/writev style operations, as it |
| 394 | forced such requests to be broken up into small chunks before being passed |
| 395 | on to the generic block layer, only to be merged by the i/o scheduler |
| 396 | when the underlying device was capable of handling the i/o in one shot. |
| 397 | Also, using the buffer head as an i/o structure for i/os that didn't originate |
Matt LaPlante | 4ae0edc | 2006-11-30 04:58:40 +0100 | [diff] [blame] | 398 | from the buffer cache unnecessarily added to the weight of the descriptors |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 399 | which were generated for each such chunk. |
| 400 | |
| 401 | The following were some of the goals and expectations considered in the |
| 402 | redesign of the block i/o data structure in 2.5. |
| 403 | |
Mauro Carvalho Chehab | 898bd37 | 2019-04-18 19:45:00 -0300 | [diff] [blame] | 404 | 1. Should be appropriate as a descriptor for both raw and buffered i/o - |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 405 | avoid cache related fields which are irrelevant in the direct/page i/o path, |
| 406 | or filesystem block size alignment restrictions which may not be relevant |
| 407 | for raw i/o. |
Mauro Carvalho Chehab | 898bd37 | 2019-04-18 19:45:00 -0300 | [diff] [blame] | 408 | 2. Ability to represent high-memory buffers (which do not have a virtual |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 409 | address mapping in kernel address space). |
Mauro Carvalho Chehab | 898bd37 | 2019-04-18 19:45:00 -0300 | [diff] [blame] | 410 | 3. Ability to represent large i/os w/o unnecessarily breaking them up (i.e |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 411 | greater than PAGE_SIZE chunks in one shot) |
Mauro Carvalho Chehab | 898bd37 | 2019-04-18 19:45:00 -0300 | [diff] [blame] | 412 | 4. At the same time, ability to retain independent identity of i/os from |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 413 | different sources or i/o units requiring individual completion (e.g. for |
| 414 | latency reasons) |
Mauro Carvalho Chehab | 898bd37 | 2019-04-18 19:45:00 -0300 | [diff] [blame] | 415 | 5. Ability to represent an i/o involving multiple physical memory segments |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 416 | (including non-page aligned page fragments, as specified via readv/writev) |
Matt LaPlante | 4ae0edc | 2006-11-30 04:58:40 +0100 | [diff] [blame] | 417 | without unnecessarily breaking it up, if the underlying device is capable of |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 418 | handling it. |
Mauro Carvalho Chehab | 898bd37 | 2019-04-18 19:45:00 -0300 | [diff] [blame] | 419 | 6. Preferably should be based on a memory descriptor structure that can be |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 420 | passed around different types of subsystems or layers, maybe even |
| 421 | networking, without duplication or extra copies of data/descriptor fields |
| 422 | themselves in the process |
Mauro Carvalho Chehab | 898bd37 | 2019-04-18 19:45:00 -0300 | [diff] [blame] | 423 | 7. Ability to handle the possibility of splits/merges as the structure passes |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 424 | through layered drivers (lvm, md, evms), with minimal overhead. |
| 425 | |
| 426 | The solution was to define a new structure (bio) for the block layer, |
| 427 | instead of using the buffer head structure (bh) directly, the idea being |
| 428 | avoidance of some associated baggage and limitations. The bio structure |
| 429 | is uniformly used for all i/o at the block layer ; it forms a part of the |
| 430 | bh structure for buffered i/o, and in the case of raw/direct i/o kiobufs are |
| 431 | mapped to bio structures. |
| 432 | |
| 433 | 2.2 The bio struct |
Mauro Carvalho Chehab | 898bd37 | 2019-04-18 19:45:00 -0300 | [diff] [blame] | 434 | ------------------ |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 435 | |
| 436 | The bio structure uses a vector representation pointing to an array of tuples |
| 437 | of <page, offset, len> to describe the i/o buffer, and has various other |
| 438 | fields describing i/o parameters and state that needs to be maintained for |
| 439 | performing the i/o. |
| 440 | |
| 441 | Notice that this representation means that a bio has no virtual address |
| 442 | mapping at all (unlike buffer heads). |
| 443 | |
Mauro Carvalho Chehab | 898bd37 | 2019-04-18 19:45:00 -0300 | [diff] [blame] | 444 | :: |
| 445 | |
| 446 | struct bio_vec { |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 447 | struct page *bv_page; |
| 448 | unsigned short bv_len; |
| 449 | unsigned short bv_offset; |
Mauro Carvalho Chehab | 898bd37 | 2019-04-18 19:45:00 -0300 | [diff] [blame] | 450 | }; |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 451 | |
Mauro Carvalho Chehab | 898bd37 | 2019-04-18 19:45:00 -0300 | [diff] [blame] | 452 | /* |
| 453 | * main unit of I/O for the block layer and lower layers (ie drivers) |
| 454 | */ |
| 455 | struct bio { |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 456 | struct bio *bi_next; /* request queue link */ |
| 457 | struct block_device *bi_bdev; /* target device */ |
| 458 | unsigned long bi_flags; /* status, command, etc */ |
Jens Axboe | 1eff9d3 | 2016-08-05 15:35:16 -0600 | [diff] [blame] | 459 | unsigned long bi_opf; /* low bits: r/w, high: priority */ |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 460 | |
| 461 | unsigned int bi_vcnt; /* how may bio_vec's */ |
Kent Overstreet | 4f024f3 | 2013-10-11 15:44:27 -0700 | [diff] [blame] | 462 | struct bvec_iter bi_iter; /* current index into bio_vec array */ |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 463 | |
| 464 | unsigned int bi_size; /* total size in bytes */ |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 465 | unsigned short bi_hw_segments; /* segments after DMA remapping */ |
| 466 | unsigned int bi_max; /* max bio_vecs we can hold |
| 467 | used as index into pool */ |
| 468 | struct bio_vec *bi_io_vec; /* the actual vec list */ |
| 469 | bio_end_io_t *bi_end_io; /* bi_end_io (bio) */ |
| 470 | atomic_t bi_cnt; /* pin count: free when it hits zero */ |
| 471 | void *bi_private; |
Mauro Carvalho Chehab | 898bd37 | 2019-04-18 19:45:00 -0300 | [diff] [blame] | 472 | }; |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 473 | |
| 474 | With this multipage bio design: |
| 475 | |
| 476 | - Large i/os can be sent down in one go using a bio_vec list consisting |
| 477 | of an array of <page, offset, len> fragments (similar to the way fragments |
| 478 | are represented in the zero-copy network code) |
| 479 | - Splitting of an i/o request across multiple devices (as in the case of |
| 480 | lvm or raid) is achieved by cloning the bio (where the clone points to |
| 481 | the same bi_io_vec array, but with the index and size accordingly modified) |
Mauro Carvalho Chehab | 8bb0776 | 2019-07-09 12:36:09 -0300 | [diff] [blame] | 482 | - A linked list of bios is used as before for unrelated merges [#]_ - this |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 483 | avoids reallocs and makes independent completions easier to handle. |
NeilBrown | 5705f70 | 2007-09-25 12:35:59 +0200 | [diff] [blame] | 484 | - Code that traverses the req list can find all the segments of a bio |
| 485 | by using rq_for_each_segment. This handles the fact that a request |
| 486 | has multiple bios, each of which can have multiple segments. |
Kent Overstreet | 4f024f3 | 2013-10-11 15:44:27 -0700 | [diff] [blame] | 487 | - Drivers which can't process a large bio in one shot can use the bi_iter |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 488 | field to keep track of the next bio_vec entry to process. |
| 489 | (e.g a 1MB bio_vec needs to be handled in max 128kB chunks for IDE) |
| 490 | [TBD: Should preferably also have a bi_voffset and bi_vlen to avoid modifying |
Mauro Carvalho Chehab | 898bd37 | 2019-04-18 19:45:00 -0300 | [diff] [blame] | 491 | bi_offset an len fields] |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 492 | |
Mauro Carvalho Chehab | 8bb0776 | 2019-07-09 12:36:09 -0300 | [diff] [blame] | 493 | .. [#] |
Mauro Carvalho Chehab | 898bd37 | 2019-04-18 19:45:00 -0300 | [diff] [blame] | 494 | |
| 495 | unrelated merges -- a request ends up containing two or more bios that |
| 496 | didn't originate from the same place. |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 497 | |
| 498 | bi_end_io() i/o callback gets called on i/o completion of the entire bio. |
| 499 | |
| 500 | At a lower level, drivers build a scatter gather list from the merged bios. |
| 501 | The scatter gather list is in the form of an array of <page, offset, len> |
| 502 | entries with their corresponding dma address mappings filled in at the |
| 503 | appropriate time. As an optimization, contiguous physical pages can be |
| 504 | covered by a single entry where <page> refers to the first page and <len> |
Lucas De Marchi | 25985ed | 2011-03-30 22:57:33 -0300 | [diff] [blame] | 505 | covers the range of pages (up to 16 contiguous pages could be covered this |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 506 | way). There is a helper routine (blk_rq_map_sg) which drivers can use to build |
| 507 | the sg list. |
| 508 | |
| 509 | Note: Right now the only user of bios with more than one page is ll_rw_kio, |
| 510 | which in turn means that only raw I/O uses it (direct i/o may not work |
| 511 | right now). The intent however is to enable clustering of pages etc to |
| 512 | become possible. The pagebuf abstraction layer from SGI also uses multi-page |
| 513 | bios, but that is currently not included in the stock development kernels. |
Mauro Carvalho Chehab | 898bd37 | 2019-04-18 19:45:00 -0300 | [diff] [blame] | 514 | The same is true of Andrew Morton's work-in-progress multipage bio writeout |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 515 | and readahead patches. |
| 516 | |
| 517 | 2.3 Changes in the Request Structure |
Mauro Carvalho Chehab | 898bd37 | 2019-04-18 19:45:00 -0300 | [diff] [blame] | 518 | ------------------------------------ |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 519 | |
| 520 | The request structure is the structure that gets passed down to low level |
| 521 | drivers. The block layer make_request function builds up a request structure, |
| 522 | places it on the queue and invokes the drivers request_fn. The driver makes |
| 523 | use of block layer helper routine elv_next_request to pull the next request |
| 524 | off the queue. Control or diagnostic functions might bypass block and directly |
| 525 | invoke underlying driver entry points passing in a specially constructed |
| 526 | request structure. |
| 527 | |
| 528 | Only some relevant fields (mainly those which changed or may be referred |
| 529 | to in some of the discussion here) are listed below, not necessarily in |
| 530 | the order in which they occur in the structure (see include/linux/blkdev.h) |
Mauro Carvalho Chehab | 898bd37 | 2019-04-18 19:45:00 -0300 | [diff] [blame] | 531 | Refer to Documentation/block/request.rst for details about all the request |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 532 | structure fields and a quick reference about the layers which are |
Mauro Carvalho Chehab | 898bd37 | 2019-04-18 19:45:00 -0300 | [diff] [blame] | 533 | supposed to use or modify those fields:: |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 534 | |
Mauro Carvalho Chehab | 898bd37 | 2019-04-18 19:45:00 -0300 | [diff] [blame] | 535 | struct request { |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 536 | struct list_head queuelist; /* Not meant to be directly accessed by |
| 537 | the driver. |
| 538 | Used by q->elv_next_request_fn |
| 539 | rq->queue is gone |
| 540 | */ |
| 541 | . |
| 542 | . |
| 543 | unsigned char cmd[16]; /* prebuilt command data block */ |
| 544 | unsigned long flags; /* also includes earlier rq->cmd settings */ |
| 545 | . |
| 546 | . |
| 547 | sector_t sector; /* this field is now of type sector_t instead of int |
| 548 | preparation for 64 bit sectors */ |
| 549 | . |
| 550 | . |
| 551 | |
| 552 | /* Number of scatter-gather DMA addr+len pairs after |
| 553 | * physical address coalescing is performed. |
| 554 | */ |
| 555 | unsigned short nr_phys_segments; |
| 556 | |
| 557 | /* Number of scatter-gather addr+len pairs after |
| 558 | * physical and DMA remapping hardware coalescing is performed. |
| 559 | * This is the number of scatter-gather entries the driver |
| 560 | * will actually have to deal with after DMA mapping is done. |
| 561 | */ |
| 562 | unsigned short nr_hw_segments; |
| 563 | |
| 564 | /* Various sector counts */ |
| 565 | unsigned long nr_sectors; /* no. of sectors left: driver modifiable */ |
| 566 | unsigned long hard_nr_sectors; /* block internal copy of above */ |
| 567 | unsigned int current_nr_sectors; /* no. of sectors left in the |
| 568 | current segment:driver modifiable */ |
| 569 | unsigned long hard_cur_sectors; /* block internal copy of the above */ |
| 570 | . |
| 571 | . |
| 572 | int tag; /* command tag associated with request */ |
| 573 | void *special; /* same as before */ |
Lucas De Marchi | 25985ed | 2011-03-30 22:57:33 -0300 | [diff] [blame] | 574 | char *buffer; /* valid only for low memory buffers up to |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 575 | current_nr_sectors */ |
| 576 | . |
| 577 | . |
| 578 | struct bio *bio, *biotail; /* bio list instead of bh */ |
| 579 | struct request_list *rl; |
Mauro Carvalho Chehab | 898bd37 | 2019-04-18 19:45:00 -0300 | [diff] [blame] | 580 | } |
| 581 | |
Christoph Hellwig | ef295ec | 2016-10-28 08:48:16 -0600 | [diff] [blame] | 582 | See the req_ops and req_flag_bits definitions for an explanation of the various |
| 583 | flags available. Some bits are used by the block layer or i/o scheduler. |
Mauro Carvalho Chehab | 898bd37 | 2019-04-18 19:45:00 -0300 | [diff] [blame] | 584 | |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 585 | The behaviour of the various sector counts are almost the same as before, |
| 586 | except that since we have multi-segment bios, current_nr_sectors refers |
| 587 | to the numbers of sectors in the current segment being processed which could |
| 588 | be one of the many segments in the current bio (i.e i/o completion unit). |
| 589 | The nr_sectors value refers to the total number of sectors in the whole |
| 590 | request that remain to be transferred (no change). The purpose of the |
| 591 | hard_xxx values is for block to remember these counts every time it hands |
| 592 | over the request to the driver. These values are updated by block on |
| 593 | end_that_request_first, i.e. every time the driver completes a part of the |
| 594 | transfer and invokes block end*request helpers to mark this. The |
| 595 | driver should not modify these values. The block layer sets up the |
| 596 | nr_sectors and current_nr_sectors fields (based on the corresponding |
| 597 | hard_xxx values and the number of bytes transferred) and updates it on |
| 598 | every transfer that invokes end_that_request_first. It does the same for the |
Kent Overstreet | 4f024f3 | 2013-10-11 15:44:27 -0700 | [diff] [blame] | 599 | buffer, bio, bio->bi_iter fields too. |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 600 | |
| 601 | The buffer field is just a virtual address mapping of the current segment |
| 602 | of the i/o buffer in cases where the buffer resides in low-memory. For high |
| 603 | memory i/o, this field is not valid and must not be used by drivers. |
| 604 | |
| 605 | Code that sets up its own request structures and passes them down to |
| 606 | a driver needs to be careful about interoperation with the block layer helper |
| 607 | functions which the driver uses. (Section 1.3) |
| 608 | |
| 609 | 3. Using bios |
Mauro Carvalho Chehab | 898bd37 | 2019-04-18 19:45:00 -0300 | [diff] [blame] | 610 | ============= |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 611 | |
| 612 | 3.1 Setup/Teardown |
Mauro Carvalho Chehab | 898bd37 | 2019-04-18 19:45:00 -0300 | [diff] [blame] | 613 | ------------------ |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 614 | |
| 615 | There are routines for managing the allocation, and reference counting, and |
| 616 | freeing of bios (bio_alloc, bio_get, bio_put). |
| 617 | |
| 618 | This makes use of Ingo Molnar's mempool implementation, which enables |
| 619 | subsystems like bio to maintain their own reserve memory pools for guaranteed |
| 620 | deadlock-free allocations during extreme VM load. For example, the VM |
| 621 | subsystem makes use of the block layer to writeout dirty pages in order to be |
| 622 | able to free up memory space, a case which needs careful handling. The |
| 623 | allocation logic draws from the preallocated emergency reserve in situations |
| 624 | where it cannot allocate through normal means. If the pool is empty and it |
| 625 | can wait, then it would trigger action that would help free up memory or |
| 626 | replenish the pool (without deadlocking) and wait for availability in the pool. |
| 627 | If it is in IRQ context, and hence not in a position to do this, allocation |
| 628 | could fail if the pool is empty. In general mempool always first tries to |
| 629 | perform allocation without having to wait, even if it means digging into the |
| 630 | pool as long it is not less that 50% full. |
| 631 | |
| 632 | On a free, memory is released to the pool or directly freed depending on |
| 633 | the current availability in the pool. The mempool interface lets the |
| 634 | subsystem specify the routines to be used for normal alloc and free. In the |
| 635 | case of bio, these routines make use of the standard slab allocator. |
| 636 | |
| 637 | The caller of bio_alloc is expected to taken certain steps to avoid |
| 638 | deadlocks, e.g. avoid trying to allocate more memory from the pool while |
| 639 | already holding memory obtained from the pool. |
Mauro Carvalho Chehab | 898bd37 | 2019-04-18 19:45:00 -0300 | [diff] [blame] | 640 | |
| 641 | :: |
| 642 | |
| 643 | [TBD: This is a potential issue, though a rare possibility |
| 644 | in the bounce bio allocation that happens in the current code, since |
| 645 | it ends up allocating a second bio from the same pool while |
| 646 | holding the original bio ] |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 647 | |
| 648 | Memory allocated from the pool should be released back within a limited |
| 649 | amount of time (in the case of bio, that would be after the i/o is completed). |
| 650 | This ensures that if part of the pool has been used up, some work (in this |
| 651 | case i/o) must already be in progress and memory would be available when it |
| 652 | is over. If allocating from multiple pools in the same code path, the order |
| 653 | or hierarchy of allocation needs to be consistent, just the way one deals |
| 654 | with multiple locks. |
| 655 | |
| 656 | The bio_alloc routine also needs to allocate the bio_vec_list (bvec_alloc()) |
| 657 | for a non-clone bio. There are the 6 pools setup for different size biovecs, |
| 658 | so bio_alloc(gfp_mask, nr_iovecs) will allocate a vec_list of the |
| 659 | given size from these slabs. |
| 660 | |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 661 | The bio_get() routine may be used to hold an extra reference on a bio prior |
| 662 | to i/o submission, if the bio fields are likely to be accessed after the |
| 663 | i/o is issued (since the bio may otherwise get freed in case i/o completion |
| 664 | happens in the meantime). |
| 665 | |
NeilBrown | 9b10f6a | 2017-06-18 14:38:59 +1000 | [diff] [blame] | 666 | The bio_clone_fast() routine may be used to duplicate a bio, where the clone |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 667 | shares the bio_vec_list with the original bio (i.e. both point to the |
| 668 | same bio_vec_list). This would typically be used for splitting i/o requests |
| 669 | in lvm or md. |
| 670 | |
| 671 | 3.2 Generic bio helper Routines |
Mauro Carvalho Chehab | 898bd37 | 2019-04-18 19:45:00 -0300 | [diff] [blame] | 672 | ------------------------------- |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 673 | |
| 674 | 3.2.1 Traversing segments and completion units in a request |
Mauro Carvalho Chehab | 898bd37 | 2019-04-18 19:45:00 -0300 | [diff] [blame] | 675 | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 676 | |
NeilBrown | 5705f70 | 2007-09-25 12:35:59 +0200 | [diff] [blame] | 677 | The macro rq_for_each_segment() should be used for traversing the bios |
| 678 | in the request list (drivers should avoid directly trying to do it |
| 679 | themselves). Using these helpers should also make it easier to cope |
| 680 | with block changes in the future. |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 681 | |
Mauro Carvalho Chehab | 898bd37 | 2019-04-18 19:45:00 -0300 | [diff] [blame] | 682 | :: |
| 683 | |
NeilBrown | 5705f70 | 2007-09-25 12:35:59 +0200 | [diff] [blame] | 684 | struct req_iterator iter; |
| 685 | rq_for_each_segment(bio_vec, rq, iter) |
| 686 | /* bio_vec is now current segment */ |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 687 | |
| 688 | I/O completion callbacks are per-bio rather than per-segment, so drivers |
| 689 | that traverse bio chains on completion need to keep that in mind. Drivers |
| 690 | which don't make a distinction between segments and completion units would |
| 691 | need to be reorganized to support multi-segment bios. |
| 692 | |
| 693 | 3.2.2 Setting up DMA scatterlists |
Mauro Carvalho Chehab | 898bd37 | 2019-04-18 19:45:00 -0300 | [diff] [blame] | 694 | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 695 | |
| 696 | The blk_rq_map_sg() helper routine would be used for setting up scatter |
| 697 | gather lists from a request, so a driver need not do it on its own. |
| 698 | |
| 699 | nr_segments = blk_rq_map_sg(q, rq, scatterlist); |
| 700 | |
| 701 | The helper routine provides a level of abstraction which makes it easier |
| 702 | to modify the internals of request to scatterlist conversion down the line |
| 703 | without breaking drivers. The blk_rq_map_sg routine takes care of several |
| 704 | things like collapsing physically contiguous segments (if QUEUE_FLAG_CLUSTER |
| 705 | is set) and correct segment accounting to avoid exceeding the limits which |
| 706 | the i/o hardware can handle, based on various queue properties. |
| 707 | |
| 708 | - Prevents a clustered segment from crossing a 4GB mem boundary |
| 709 | - Avoids building segments that would exceed the number of physical |
| 710 | memory segments that the driver can handle (phys_segments) and the |
| 711 | number that the underlying hardware can handle at once, accounting for |
| 712 | DMA remapping (hw_segments) (i.e. IOMMU aware limits). |
| 713 | |
| 714 | Routines which the low level driver can use to set up the segment limits: |
| 715 | |
| 716 | blk_queue_max_hw_segments() : Sets an upper limit of the maximum number of |
| 717 | hw data segments in a request (i.e. the maximum number of address/length |
| 718 | pairs the host adapter can actually hand to the device at once) |
| 719 | |
| 720 | blk_queue_max_phys_segments() : Sets an upper limit on the maximum number |
| 721 | of physical data segments in a request (i.e. the largest sized scatter list |
| 722 | a driver could handle) |
| 723 | |
| 724 | 3.2.3 I/O completion |
Mauro Carvalho Chehab | 898bd37 | 2019-04-18 19:45:00 -0300 | [diff] [blame] | 725 | ^^^^^^^^^^^^^^^^^^^^ |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 726 | |
| 727 | The existing generic block layer helper routines end_request, |
| 728 | end_that_request_first and end_that_request_last can be used for i/o |
| 729 | completion (and setting things up so the rest of the i/o or the next |
| 730 | request can be kicked of) as before. With the introduction of multi-page |
| 731 | bio support, end_that_request_first requires an additional argument indicating |
| 732 | the number of sectors completed. |
| 733 | |
Mauro Carvalho Chehab | 898bd37 | 2019-04-18 19:45:00 -0300 | [diff] [blame] | 734 | 3.2.4 Implications for drivers that do not interpret bios |
| 735 | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ |
| 736 | |
| 737 | (don't handle multiple segments) |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 738 | |
| 739 | Drivers that do not interpret bios e.g those which do not handle multiple |
| 740 | segments and do not support i/o into high memory addresses (require bounce |
| 741 | buffers) and expect only virtually mapped buffers, can access the rq->buffer |
| 742 | field. As before the driver should use current_nr_sectors to determine the |
| 743 | size of remaining data in the current segment (that is the maximum it can |
| 744 | transfer in one go unless it interprets segments), and rely on the block layer |
| 745 | end_request, or end_that_request_first/last to take care of all accounting |
| 746 | and transparent mapping of the next bio segment when a segment boundary |
| 747 | is crossed on completion of a transfer. (The end*request* functions should |
| 748 | be used if only if the request has come down from block/bio path, not for |
| 749 | direct access requests which only specify rq->buffer without a valid rq->bio) |
| 750 | |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 751 | 3.3 I/O Submission |
Mauro Carvalho Chehab | 898bd37 | 2019-04-18 19:45:00 -0300 | [diff] [blame] | 752 | ------------------ |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 753 | |
| 754 | The routine submit_bio() is used to submit a single io. Higher level i/o |
| 755 | routines make use of this: |
| 756 | |
| 757 | (a) Buffered i/o: |
Mauro Carvalho Chehab | 898bd37 | 2019-04-18 19:45:00 -0300 | [diff] [blame] | 758 | |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 759 | The routine submit_bh() invokes submit_bio() on a bio corresponding to the |
| 760 | bh, allocating the bio if required. ll_rw_block() uses submit_bh() as before. |
| 761 | |
| 762 | (b) Kiobuf i/o (for raw/direct i/o): |
Mauro Carvalho Chehab | 898bd37 | 2019-04-18 19:45:00 -0300 | [diff] [blame] | 763 | |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 764 | The ll_rw_kio() routine breaks up the kiobuf into page sized chunks and |
| 765 | maps the array to one or more multi-page bios, issuing submit_bio() to |
| 766 | perform the i/o on each of these. |
| 767 | |
| 768 | The embedded bh array in the kiobuf structure has been removed and no |
| 769 | preallocation of bios is done for kiobufs. [The intent is to remove the |
| 770 | blocks array as well, but it's currently in there to kludge around direct i/o.] |
| 771 | Thus kiobuf allocation has switched back to using kmalloc rather than vmalloc. |
| 772 | |
| 773 | Todo/Observation: |
| 774 | |
| 775 | A single kiobuf structure is assumed to correspond to a contiguous range |
| 776 | of data, so brw_kiovec() invokes ll_rw_kio for each kiobuf in a kiovec. |
| 777 | So right now it wouldn't work for direct i/o on non-contiguous blocks. |
| 778 | This is to be resolved. The eventual direction is to replace kiobuf |
| 779 | by kvec's. |
| 780 | |
| 781 | Badari Pulavarty has a patch to implement direct i/o correctly using |
| 782 | bio and kvec. |
| 783 | |
| 784 | |
| 785 | (c) Page i/o: |
Mauro Carvalho Chehab | 898bd37 | 2019-04-18 19:45:00 -0300 | [diff] [blame] | 786 | |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 787 | Todo/Under discussion: |
| 788 | |
| 789 | Andrew Morton's multi-page bio patches attempt to issue multi-page |
| 790 | writeouts (and reads) from the page cache, by directly building up |
| 791 | large bios for submission completely bypassing the usage of buffer |
| 792 | heads. This work is still in progress. |
| 793 | |
| 794 | Christoph Hellwig had some code that uses bios for page-io (rather than |
| 795 | bh). This isn't included in bio as yet. Christoph was also working on a |
| 796 | design for representing virtual/real extents as an entity and modifying |
| 797 | some of the address space ops interfaces to utilize this abstraction rather |
| 798 | than buffer_heads. (This is somewhat along the lines of the SGI XFS pagebuf |
| 799 | abstraction, but intended to be as lightweight as possible). |
| 800 | |
| 801 | (d) Direct access i/o: |
Mauro Carvalho Chehab | 898bd37 | 2019-04-18 19:45:00 -0300 | [diff] [blame] | 802 | |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 803 | Direct access requests that do not contain bios would be submitted differently |
| 804 | as discussed earlier in section 1.3. |
| 805 | |
| 806 | Aside: |
| 807 | |
| 808 | Kvec i/o: |
| 809 | |
Matt LaPlante | 53cb472 | 2006-10-03 22:55:17 +0200 | [diff] [blame] | 810 | Ben LaHaise's aio code uses a slightly different structure instead |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 811 | of kiobufs, called a kvec_cb. This contains an array of <page, offset, len> |
| 812 | tuples (very much like the networking code), together with a callback function |
| 813 | and data pointer. This is embedded into a brw_cb structure when passed |
| 814 | to brw_kvec_async(). |
| 815 | |
| 816 | Now it should be possible to directly map these kvecs to a bio. Just as while |
| 817 | cloning, in this case rather than PRE_BUILT bio_vecs, we set the bi_io_vec |
| 818 | array pointer to point to the veclet array in kvecs. |
| 819 | |
| 820 | TBD: In order for this to work, some changes are needed in the way multi-page |
| 821 | bios are handled today. The values of the tuples in such a vector passed in |
| 822 | from higher level code should not be modified by the block layer in the course |
| 823 | of its request processing, since that would make it hard for the higher layer |
| 824 | to continue to use the vector descriptor (kvec) after i/o completes. Instead, |
| 825 | all such transient state should either be maintained in the request structure, |
| 826 | and passed on in some way to the endio completion routine. |
| 827 | |
| 828 | |
| 829 | 4. The I/O scheduler |
Mauro Carvalho Chehab | 898bd37 | 2019-04-18 19:45:00 -0300 | [diff] [blame] | 830 | ==================== |
| 831 | |
Tejun Heo | 4c9f783 | 2005-10-20 16:47:40 +0200 | [diff] [blame] | 832 | I/O scheduler, a.k.a. elevator, is implemented in two layers. Generic dispatch |
| 833 | queue and specific I/O schedulers. Unless stated otherwise, elevator is used |
| 834 | to refer to both parts and I/O scheduler to specific I/O schedulers. |
| 835 | |
Mauro Carvalho Chehab | 898bd37 | 2019-04-18 19:45:00 -0300 | [diff] [blame] | 836 | Block layer implements generic dispatch queue in `block/*.c`. |
Leonid V. Fedorenchik | 8962786 | 2015-03-13 23:53:22 +0300 | [diff] [blame] | 837 | The generic dispatch queue is responsible for requeueing, handling non-fs |
| 838 | requests and all other subtleties. |
Tejun Heo | 4c9f783 | 2005-10-20 16:47:40 +0200 | [diff] [blame] | 839 | |
| 840 | Specific I/O schedulers are responsible for ordering normal filesystem |
| 841 | requests. They can also choose to delay certain requests to improve |
| 842 | throughput or whatever purpose. As the plural form indicates, there are |
| 843 | multiple I/O schedulers. They can be built as modules but at least one should |
| 844 | be built inside the kernel. Each queue can choose different one and can also |
| 845 | change to another one dynamically. |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 846 | |
| 847 | A block layer call to the i/o scheduler follows the convention elv_xxx(). This |
Nikanth Karthikesan | 4236469 | 2008-11-24 10:46:29 +0100 | [diff] [blame] | 848 | calls elevator_xxx_fn in the elevator switch (block/elevator.c). Oh, xxx |
| 849 | and xxx might not match exactly, but use your imagination. If an elevator |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 850 | doesn't implement a function, the switch does nothing or some minimal house |
| 851 | keeping work. |
| 852 | |
| 853 | 4.1. I/O scheduler API |
Mauro Carvalho Chehab | 898bd37 | 2019-04-18 19:45:00 -0300 | [diff] [blame] | 854 | ---------------------- |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 855 | |
| 856 | The functions an elevator may implement are: (* are mandatory) |
Mauro Carvalho Chehab | 898bd37 | 2019-04-18 19:45:00 -0300 | [diff] [blame] | 857 | |
| 858 | =============================== ================================================ |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 859 | elevator_merge_fn called to query requests for merge with a bio |
| 860 | |
Tejun Heo | 4c9f783 | 2005-10-20 16:47:40 +0200 | [diff] [blame] | 861 | elevator_merge_req_fn called when two requests get merged. the one |
| 862 | which gets merged into the other one will be |
| 863 | never seen by I/O scheduler again. IOW, after |
| 864 | being merged, the request is gone. |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 865 | |
| 866 | elevator_merged_fn called when a request in the scheduler has been |
| 867 | involved in a merge. It is used in the deadline |
| 868 | scheduler for example, to reposition the request |
| 869 | if its sorting order has changed. |
| 870 | |
Jens Axboe | 126ec9a6 | 2006-12-20 11:06:15 +0100 | [diff] [blame] | 871 | elevator_allow_merge_fn called whenever the block layer determines |
| 872 | that a bio can be merged into an existing |
| 873 | request safely. The io scheduler may still |
| 874 | want to stop a merge at this point if it |
| 875 | results in some sort of conflict internally, |
Jan Kara | b8ab956 | 2014-11-04 12:52:41 +0100 | [diff] [blame] | 876 | this hook allows it to do that. Note however |
| 877 | that two *requests* can still be merged at later |
| 878 | time. Currently the io scheduler has no way to |
| 879 | prevent that. It can only learn about the fact |
| 880 | from elevator_merge_req_fn callback. |
Jens Axboe | 126ec9a6 | 2006-12-20 11:06:15 +0100 | [diff] [blame] | 881 | |
Nikanth Karthikesan | 7598909 | 2009-01-27 09:29:24 +0100 | [diff] [blame] | 882 | elevator_dispatch_fn* fills the dispatch queue with ready requests. |
Tejun Heo | 4c9f783 | 2005-10-20 16:47:40 +0200 | [diff] [blame] | 883 | I/O schedulers are free to postpone requests by |
| 884 | not filling the dispatch queue unless @force |
| 885 | is non-zero. Once dispatched, I/O schedulers |
| 886 | are not allowed to manipulate the requests - |
| 887 | they belong to generic dispatch queue. |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 888 | |
Nikanth Karthikesan | 7598909 | 2009-01-27 09:29:24 +0100 | [diff] [blame] | 889 | elevator_add_req_fn* called to add a new request into the scheduler |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 890 | |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 891 | elevator_former_req_fn |
| 892 | elevator_latter_req_fn These return the request before or after the |
| 893 | one specified in disk sort order. Used by the |
| 894 | block layer to find merge possibilities. |
| 895 | |
Tejun Heo | 4c9f783 | 2005-10-20 16:47:40 +0200 | [diff] [blame] | 896 | elevator_completed_req_fn called when a request is completed. |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 897 | |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 898 | elevator_set_req_fn |
| 899 | elevator_put_req_fn Must be used to allocate and free any elevator |
Tejun Heo | 4c9f783 | 2005-10-20 16:47:40 +0200 | [diff] [blame] | 900 | specific storage for a request. |
| 901 | |
| 902 | elevator_activate_req_fn Called when device driver first sees a request. |
| 903 | I/O schedulers can use this callback to |
| 904 | determine when actual execution of a request |
| 905 | starts. |
| 906 | elevator_deactivate_req_fn Called when device driver decides to delay |
| 907 | a request by requeueing it. |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 908 | |
Nikanth Karthikesan | 7598909 | 2009-01-27 09:29:24 +0100 | [diff] [blame] | 909 | elevator_init_fn* |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 910 | elevator_exit_fn Allocate and free any elevator specific storage |
| 911 | for a queue. |
Mauro Carvalho Chehab | 898bd37 | 2019-04-18 19:45:00 -0300 | [diff] [blame] | 912 | =============================== ================================================ |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 913 | |
Tejun Heo | 4c9f783 | 2005-10-20 16:47:40 +0200 | [diff] [blame] | 914 | 4.2 Request flows seen by I/O schedulers |
Mauro Carvalho Chehab | 898bd37 | 2019-04-18 19:45:00 -0300 | [diff] [blame] | 915 | ---------------------------------------- |
| 916 | |
Matt LaPlante | 53cb472 | 2006-10-03 22:55:17 +0200 | [diff] [blame] | 917 | All requests seen by I/O schedulers strictly follow one of the following three |
Tejun Heo | 4c9f783 | 2005-10-20 16:47:40 +0200 | [diff] [blame] | 918 | flows. |
| 919 | |
| 920 | set_req_fn -> |
| 921 | |
| 922 | i. add_req_fn -> (merged_fn ->)* -> dispatch_fn -> activate_req_fn -> |
| 923 | (deactivate_req_fn -> activate_req_fn ->)* -> completed_req_fn |
| 924 | ii. add_req_fn -> (merged_fn ->)* -> merge_req_fn |
| 925 | iii. [none] |
| 926 | |
| 927 | -> put_req_fn |
| 928 | |
| 929 | 4.3 I/O scheduler implementation |
Mauro Carvalho Chehab | 898bd37 | 2019-04-18 19:45:00 -0300 | [diff] [blame] | 930 | -------------------------------- |
| 931 | |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 932 | The generic i/o scheduler algorithm attempts to sort/merge/batch requests for |
| 933 | optimal disk scan and request servicing performance (based on generic |
| 934 | principles and device capabilities), optimized for: |
Mauro Carvalho Chehab | 898bd37 | 2019-04-18 19:45:00 -0300 | [diff] [blame] | 935 | |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 936 | i. improved throughput |
| 937 | ii. improved latency |
| 938 | iii. better utilization of h/w & CPU time |
| 939 | |
| 940 | Characteristics: |
| 941 | |
| 942 | i. Binary tree |
| 943 | AS and deadline i/o schedulers use red black binary trees for disk position |
| 944 | sorting and searching, and a fifo linked list for time-based searching. This |
Matt LaPlante | 5d3f083 | 2006-11-30 05:21:10 +0100 | [diff] [blame] | 945 | gives good scalability and good availability of information. Requests are |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 946 | almost always dispatched in disk sort order, so a cache is kept of the next |
| 947 | request in sort order to prevent binary tree lookups. |
| 948 | |
| 949 | This arrangement is not a generic block layer characteristic however, so |
| 950 | elevators may implement queues as they please. |
| 951 | |
Tejun Heo | 4c9f783 | 2005-10-20 16:47:40 +0200 | [diff] [blame] | 952 | ii. Merge hash |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 953 | AS and deadline use a hash table indexed by the last sector of a request. This |
| 954 | enables merging code to quickly look up "back merge" candidates, even when |
| 955 | multiple I/O streams are being performed at once on one disk. |
| 956 | |
| 957 | "Front merges", a new request being merged at the front of an existing request, |
| 958 | are far less common than "back merges" due to the nature of most I/O patterns. |
| 959 | Front merges are handled by the binary trees in AS and deadline schedulers. |
| 960 | |
Tejun Heo | 4c9f783 | 2005-10-20 16:47:40 +0200 | [diff] [blame] | 961 | iii. Plugging the queue to batch requests in anticipation of opportunities for |
| 962 | merge/sort optimizations |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 963 | |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 964 | Plugging is an approach that the current i/o scheduling algorithm resorts to so |
| 965 | that it collects up enough requests in the queue to be able to take |
| 966 | advantage of the sorting/merging logic in the elevator. If the |
| 967 | queue is empty when a request comes in, then it plugs the request queue |
Jens Axboe | 329007c | 2009-04-08 11:38:50 +0200 | [diff] [blame] | 968 | (sort of like plugging the bath tub of a vessel to get fluid to build up) |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 969 | till it fills up with a few more requests, before starting to service |
| 970 | the requests. This provides an opportunity to merge/sort the requests before |
| 971 | passing them down to the device. There are various conditions when the queue is |
| 972 | unplugged (to open up the flow again), either through a scheduled task or |
| 973 | could be on demand. For example wait_on_buffer sets the unplugging going |
Jens Axboe | 329007c | 2009-04-08 11:38:50 +0200 | [diff] [blame] | 974 | through sync_buffer() running blk_run_address_space(mapping). Or the caller |
| 975 | can do it explicity through blk_unplug(bdev). So in the read case, |
| 976 | the queue gets explicitly unplugged as part of waiting for completion on that |
Matthew Wilcox | f4e6d84 | 2016-03-06 23:27:26 -0500 | [diff] [blame] | 977 | buffer. |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 978 | |
| 979 | Aside: |
| 980 | This is kind of controversial territory, as it's not clear if plugging is |
| 981 | always the right thing to do. Devices typically have their own queues, |
| 982 | and allowing a big queue to build up in software, while letting the device be |
| 983 | idle for a while may not always make sense. The trick is to handle the fine |
| 984 | balance between when to plug and when to open up. Also now that we have |
| 985 | multi-page bios being queued in one shot, we may not need to wait to merge |
| 986 | a big request from the broken up pieces coming by. |
| 987 | |
Tejun Heo | 4c9f783 | 2005-10-20 16:47:40 +0200 | [diff] [blame] | 988 | 4.4 I/O contexts |
Mauro Carvalho Chehab | 898bd37 | 2019-04-18 19:45:00 -0300 | [diff] [blame] | 989 | ---------------- |
| 990 | |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 991 | I/O contexts provide a dynamically allocated per process data area. They may |
| 992 | be used in I/O schedulers, and in the block layer (could be used for IO statis, |
Mauro Carvalho Chehab | 898bd37 | 2019-04-18 19:45:00 -0300 | [diff] [blame] | 993 | priorities for example). See `*io_context` in block/ll_rw_blk.c, and as-iosched.c |
Ben Collins | 1d193f4 | 2005-11-15 00:09:21 -0800 | [diff] [blame] | 994 | for an example of usage in an i/o scheduler. |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 995 | |
| 996 | |
| 997 | 5. Scalability related changes |
Mauro Carvalho Chehab | 898bd37 | 2019-04-18 19:45:00 -0300 | [diff] [blame] | 998 | ============================== |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 999 | |
| 1000 | 5.1 Granular Locking: io_request_lock replaced by a per-queue lock |
Mauro Carvalho Chehab | 898bd37 | 2019-04-18 19:45:00 -0300 | [diff] [blame] | 1001 | ------------------------------------------------------------------ |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 1002 | |
| 1003 | The global io_request_lock has been removed as of 2.5, to avoid |
| 1004 | the scalability bottleneck it was causing, and has been replaced by more |
| 1005 | granular locking. The request queue structure has a pointer to the |
| 1006 | lock to be used for that queue. As a result, locking can now be |
| 1007 | per-queue, with a provision for sharing a lock across queues if |
| 1008 | necessary (e.g the scsi layer sets the queue lock pointers to the |
| 1009 | corresponding adapter lock, which results in a per host locking |
| 1010 | granularity). The locking semantics are the same, i.e. locking is |
| 1011 | still imposed by the block layer, grabbing the lock before |
| 1012 | request_fn execution which it means that lots of older drivers |
| 1013 | should still be SMP safe. Drivers are free to drop the queue |
| 1014 | lock themselves, if required. Drivers that explicitly used the |
| 1015 | io_request_lock for serialization need to be modified accordingly. |
Mauro Carvalho Chehab | 898bd37 | 2019-04-18 19:45:00 -0300 | [diff] [blame] | 1016 | Usually it's as easy as adding a global lock:: |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 1017 | |
Robert P. J. Day | c0d1f29 | 2008-04-21 22:44:50 +0000 | [diff] [blame] | 1018 | static DEFINE_SPINLOCK(my_driver_lock); |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 1019 | |
| 1020 | and passing the address to that lock to blk_init_queue(). |
| 1021 | |
| 1022 | 5.2 64 bit sector numbers (sector_t prepares for 64 bit support) |
Mauro Carvalho Chehab | 898bd37 | 2019-04-18 19:45:00 -0300 | [diff] [blame] | 1023 | ---------------------------------------------------------------- |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 1024 | |
| 1025 | The sector number used in the bio structure has been changed to sector_t, |
| 1026 | which could be defined as 64 bit in preparation for 64 bit sector support. |
| 1027 | |
| 1028 | 6. Other Changes/Implications |
Mauro Carvalho Chehab | 898bd37 | 2019-04-18 19:45:00 -0300 | [diff] [blame] | 1029 | ============================= |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 1030 | |
| 1031 | 6.1 Partition re-mapping handled by the generic block layer |
Mauro Carvalho Chehab | 898bd37 | 2019-04-18 19:45:00 -0300 | [diff] [blame] | 1032 | ----------------------------------------------------------- |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 1033 | |
| 1034 | In 2.5 some of the gendisk/partition related code has been reorganized. |
| 1035 | Now the generic block layer performs partition-remapping early and thus |
| 1036 | provides drivers with a sector number relative to whole device, rather than |
| 1037 | having to take partition number into account in order to arrive at the true |
| 1038 | sector number. The routine blk_partition_remap() is invoked by |
Christoph Hellwig | ed00aab | 2020-07-01 10:59:44 +0200 | [diff] [blame] | 1039 | submit_bio_noacct even before invoking the queue specific ->submit_bio, |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 1040 | so the i/o scheduler also gets to operate on whole disk sector numbers. This |
| 1041 | should typically not require changes to block drivers, it just never gets |
| 1042 | to invoke its own partition sector offset calculations since all bios |
| 1043 | sent are offset from the beginning of the device. |
| 1044 | |
| 1045 | |
| 1046 | 7. A Few Tips on Migration of older drivers |
Mauro Carvalho Chehab | 898bd37 | 2019-04-18 19:45:00 -0300 | [diff] [blame] | 1047 | =========================================== |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 1048 | |
| 1049 | Old-style drivers that just use CURRENT and ignores clustered requests, |
| 1050 | may not need much change. The generic layer will automatically handle |
| 1051 | clustered requests, multi-page bios, etc for the driver. |
| 1052 | |
| 1053 | For a low performance driver or hardware that is PIO driven or just doesn't |
| 1054 | support scatter-gather changes should be minimal too. |
| 1055 | |
| 1056 | The following are some points to keep in mind when converting old drivers |
| 1057 | to bio. |
| 1058 | |
| 1059 | Drivers should use elv_next_request to pick up requests and are no longer |
| 1060 | supposed to handle looping directly over the request list. |
| 1061 | (struct request->queue has been removed) |
| 1062 | |
| 1063 | Now end_that_request_first takes an additional number_of_sectors argument. |
| 1064 | It used to handle always just the first buffer_head in a request, now |
| 1065 | it will loop and handle as many sectors (on a bio-segment granularity) |
| 1066 | as specified. |
| 1067 | |
| 1068 | Now bh->b_end_io is replaced by bio->bi_end_io, but most of the time the |
Christoph Hellwig | 4246a0b | 2015-07-20 15:29:37 +0200 | [diff] [blame] | 1069 | right thing to use is bio_endio(bio) instead. |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 1070 | |
| 1071 | If the driver is dropping the io_request_lock from its request_fn strategy, |
| 1072 | then it just needs to replace that with q->queue_lock instead. |
| 1073 | |
| 1074 | As described in Sec 1.1, drivers can set max sector size, max segment size |
| 1075 | etc per queue now. Drivers that used to define their own merge functions i |
| 1076 | to handle things like this can now just use the blk_queue_* functions at |
| 1077 | blk_init_queue time. |
| 1078 | |
| 1079 | Drivers no longer have to map a {partition, sector offset} into the |
| 1080 | correct absolute location anymore, this is done by the block layer, so |
Mauro Carvalho Chehab | 898bd37 | 2019-04-18 19:45:00 -0300 | [diff] [blame] | 1081 | where a driver received a request ala this before:: |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 1082 | |
| 1083 | rq->rq_dev = mk_kdev(3, 5); /* /dev/hda5 */ |
| 1084 | rq->sector = 0; /* first sector on hda5 */ |
| 1085 | |
Mauro Carvalho Chehab | 898bd37 | 2019-04-18 19:45:00 -0300 | [diff] [blame] | 1086 | it will now see:: |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 1087 | |
| 1088 | rq->rq_dev = mk_kdev(3, 0); /* /dev/hda */ |
| 1089 | rq->sector = 123128; /* offset from start of disk */ |
| 1090 | |
| 1091 | As mentioned, there is no virtual mapping of a bio. For DMA, this is |
| 1092 | not a problem as the driver probably never will need a virtual mapping. |
FUJITA Tomonori | c2282ad | 2010-03-08 09:11:07 +0100 | [diff] [blame] | 1093 | Instead it needs a bus mapping (dma_map_page for a single segment or |
| 1094 | use dma_map_sg for scatter gather) to be able to ship it to the driver. For |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 1095 | PIO drivers (or drivers that need to revert to PIO transfer once in a |
| 1096 | while (IDE for example)), where the CPU is doing the actual data |
| 1097 | transfer a virtual mapping is needed. If the driver supports highmem I/O, |
Christoph Hellwig | d004a5e7 | 2017-11-08 19:13:48 +0100 | [diff] [blame] | 1098 | (Sec 1.1, (ii) ) it needs to use kmap_atomic or similar to temporarily map |
| 1099 | a bio into the virtual address space. |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 1100 | |
| 1101 | |
| 1102 | 8. Prior/Related/Impacted patches |
Mauro Carvalho Chehab | 898bd37 | 2019-04-18 19:45:00 -0300 | [diff] [blame] | 1103 | ================================= |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 1104 | |
| 1105 | 8.1. Earlier kiobuf patches (sct/axboe/chait/hch/mkp) |
Mauro Carvalho Chehab | 898bd37 | 2019-04-18 19:45:00 -0300 | [diff] [blame] | 1106 | ----------------------------------------------------- |
| 1107 | |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 1108 | - orig kiobuf & raw i/o patches (now in 2.4 tree) |
| 1109 | - direct kiobuf based i/o to devices (no intermediate bh's) |
| 1110 | - page i/o using kiobuf |
| 1111 | - kiobuf splitting for lvm (mkp) |
| 1112 | - elevator support for kiobuf request merging (axboe) |
Mauro Carvalho Chehab | 898bd37 | 2019-04-18 19:45:00 -0300 | [diff] [blame] | 1113 | |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 1114 | 8.2. Zero-copy networking (Dave Miller) |
Mauro Carvalho Chehab | 898bd37 | 2019-04-18 19:45:00 -0300 | [diff] [blame] | 1115 | --------------------------------------- |
| 1116 | |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 1117 | 8.3. SGI XFS - pagebuf patches - use of kiobufs |
Mauro Carvalho Chehab | 898bd37 | 2019-04-18 19:45:00 -0300 | [diff] [blame] | 1118 | ----------------------------------------------- |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 1119 | 8.4. Multi-page pioent patch for bio (Christoph Hellwig) |
Mauro Carvalho Chehab | 898bd37 | 2019-04-18 19:45:00 -0300 | [diff] [blame] | 1120 | -------------------------------------------------------- |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 1121 | 8.5. Direct i/o implementation (Andrea Arcangeli) since 2.4.10-pre11 |
Mauro Carvalho Chehab | 898bd37 | 2019-04-18 19:45:00 -0300 | [diff] [blame] | 1122 | -------------------------------------------------------------------- |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 1123 | 8.6. Async i/o implementation patch (Ben LaHaise) |
Mauro Carvalho Chehab | 898bd37 | 2019-04-18 19:45:00 -0300 | [diff] [blame] | 1124 | ------------------------------------------------- |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 1125 | 8.7. EVMS layering design (IBM EVMS team) |
Mauro Carvalho Chehab | 898bd37 | 2019-04-18 19:45:00 -0300 | [diff] [blame] | 1126 | ----------------------------------------- |
| 1127 | 8.8. Larger page cache size patch (Ben LaHaise) and Large page size (Daniel Phillips) |
| 1128 | ------------------------------------------------------------------------------------- |
| 1129 | |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 1130 | => larger contiguous physical memory buffers |
Mauro Carvalho Chehab | 898bd37 | 2019-04-18 19:45:00 -0300 | [diff] [blame] | 1131 | |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 1132 | 8.9. VM reservations patch (Ben LaHaise) |
Mauro Carvalho Chehab | 898bd37 | 2019-04-18 19:45:00 -0300 | [diff] [blame] | 1133 | ---------------------------------------- |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 1134 | 8.10. Write clustering patches ? (Marcelo/Quintela/Riel ?) |
Mauro Carvalho Chehab | 898bd37 | 2019-04-18 19:45:00 -0300 | [diff] [blame] | 1135 | ---------------------------------------------------------- |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 1136 | 8.11. Block device in page cache patch (Andrea Archangeli) - now in 2.4.10+ |
Mauro Carvalho Chehab | 898bd37 | 2019-04-18 19:45:00 -0300 | [diff] [blame] | 1137 | --------------------------------------------------------------------------- |
| 1138 | 8.12. Multiple block-size transfers for faster raw i/o (Shailabh Nagar, Badari) |
| 1139 | ------------------------------------------------------------------------------- |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 1140 | 8.13 Priority based i/o scheduler - prepatches (Arjan van de Ven) |
Mauro Carvalho Chehab | 898bd37 | 2019-04-18 19:45:00 -0300 | [diff] [blame] | 1141 | ------------------------------------------------------------------ |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 1142 | 8.14 IDE Taskfile i/o patch (Andre Hedrick) |
Mauro Carvalho Chehab | 898bd37 | 2019-04-18 19:45:00 -0300 | [diff] [blame] | 1143 | -------------------------------------------- |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 1144 | 8.15 Multi-page writeout and readahead patches (Andrew Morton) |
Mauro Carvalho Chehab | 898bd37 | 2019-04-18 19:45:00 -0300 | [diff] [blame] | 1145 | --------------------------------------------------------------- |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 1146 | 8.16 Direct i/o patches for 2.5 using kvec and bio (Badari Pulavarthy) |
Mauro Carvalho Chehab | 898bd37 | 2019-04-18 19:45:00 -0300 | [diff] [blame] | 1147 | ----------------------------------------------------------------------- |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 1148 | |
Mauro Carvalho Chehab | 898bd37 | 2019-04-18 19:45:00 -0300 | [diff] [blame] | 1149 | 9. Other References |
| 1150 | =================== |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 1151 | |
Mauro Carvalho Chehab | 898bd37 | 2019-04-18 19:45:00 -0300 | [diff] [blame] | 1152 | 9.1 The Splice I/O Model |
| 1153 | ------------------------ |
| 1154 | |
| 1155 | Larry McVoy (and subsequent discussions on lkml, and Linus' comments - Jan 2001 |
| 1156 | |
| 1157 | 9.2 Discussions about kiobuf and bh design |
| 1158 | ------------------------------------------ |
| 1159 | |
| 1160 | On lkml between sct, linus, alan et al - Feb-March 2001 (many of the |
| 1161 | initial thoughts that led to bio were brought up in this discussion thread) |
| 1162 | |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 1163 | 9.3 Discussions on mempool on lkml - Dec 2001. |
Mauro Carvalho Chehab | 898bd37 | 2019-04-18 19:45:00 -0300 | [diff] [blame] | 1164 | ---------------------------------------------- |