Dave Chinner | a9a745d | 2010-05-14 21:43:11 +1000 | [diff] [blame^] | 1 | XFS Delayed Logging Design |
| 2 | -------------------------- |
| 3 | |
| 4 | Introduction to Re-logging in XFS |
| 5 | --------------------------------- |
| 6 | |
| 7 | XFS logging is a combination of logical and physical logging. Some objects, |
| 8 | such as inodes and dquots, are logged in logical format where the details |
| 9 | logged are made up of the changes to in-core structures rather than on-disk |
| 10 | structures. Other objects - typically buffers - have their physical changes |
| 11 | logged. The reason for these differences is to reduce the amount of log space |
| 12 | required for objects that are frequently logged. Some parts of inodes are more |
| 13 | frequently logged than others, and inodes are typically more frequently logged |
| 14 | than any other object (except maybe the superblock buffer) so keeping the |
| 15 | amount of metadata logged low is of prime importance. |
| 16 | |
| 17 | The reason that this is such a concern is that XFS allows multiple separate |
| 18 | modifications to a single object to be carried in the log at any given time. |
| 19 | This allows the log to avoid needing to flush each change to disk before |
| 20 | recording a new change to the object. XFS does this via a method called |
| 21 | "re-logging". Conceptually, this is quite simple - all it requires is that any |
| 22 | new change to the object is recorded with a *new copy* of all the existing |
| 23 | changes in the new transaction that is written to the log. |
| 24 | |
| 25 | That is, if we have a sequence of changes A through to F, and the object was |
| 26 | written to disk after change D, we would see in the log the following series |
| 27 | of transactions, their contents and the log sequence number (LSN) of the |
| 28 | transaction: |
| 29 | |
| 30 | Transaction Contents LSN |
| 31 | A A X |
| 32 | B A+B X+n |
| 33 | C A+B+C X+n+m |
| 34 | D A+B+C+D X+n+m+o |
| 35 | <object written to disk> |
| 36 | E E Y (> X+n+m+o) |
| 37 | F E+F Yٍ+p |
| 38 | |
| 39 | In other words, each time an object is relogged, the new transaction contains |
| 40 | the aggregation of all the previous changes currently held only in the log. |
| 41 | |
| 42 | This relogging technique also allows objects to be moved forward in the log so |
| 43 | that an object being relogged does not prevent the tail of the log from ever |
| 44 | moving forward. This can be seen in the table above by the changing |
| 45 | (increasing) LSN of each subsquent transaction - the LSN is effectively a |
| 46 | direct encoding of the location in the log of the transaction. |
| 47 | |
| 48 | This relogging is also used to implement long-running, multiple-commit |
| 49 | transactions. These transaction are known as rolling transactions, and require |
| 50 | a special log reservation known as a permanent transaction reservation. A |
| 51 | typical example of a rolling transaction is the removal of extents from an |
| 52 | inode which can only be done at a rate of two extents per transaction because |
| 53 | of reservation size limitations. Hence a rolling extent removal transaction |
| 54 | keeps relogging the inode and btree buffers as they get modified in each |
| 55 | removal operation. This keeps them moving forward in the log as the operation |
| 56 | progresses, ensuring that current operation never gets blocked by itself if the |
| 57 | log wraps around. |
| 58 | |
| 59 | Hence it can be seen that the relogging operation is fundamental to the correct |
| 60 | working of the XFS journalling subsystem. From the above description, most |
| 61 | people should be able to see why the XFS metadata operations writes so much to |
| 62 | the log - repeated operations to the same objects write the same changes to |
| 63 | the log over and over again. Worse is the fact that objects tend to get |
| 64 | dirtier as they get relogged, so each subsequent transaction is writing more |
| 65 | metadata into the log. |
| 66 | |
| 67 | Another feature of the XFS transaction subsystem is that most transactions are |
| 68 | asynchronous. That is, they don't commit to disk until either a log buffer is |
| 69 | filled (a log buffer can hold multiple transactions) or a synchronous operation |
| 70 | forces the log buffers holding the transactions to disk. This means that XFS is |
| 71 | doing aggregation of transactions in memory - batching them, if you like - to |
| 72 | minimise the impact of the log IO on transaction throughput. |
| 73 | |
| 74 | The limitation on asynchronous transaction throughput is the number and size of |
| 75 | log buffers made available by the log manager. By default there are 8 log |
| 76 | buffers available and the size of each is 32kB - the size can be increased up |
| 77 | to 256kB by use of a mount option. |
| 78 | |
| 79 | Effectively, this gives us the maximum bound of outstanding metadata changes |
| 80 | that can be made to the filesystem at any point in time - if all the log |
| 81 | buffers are full and under IO, then no more transactions can be committed until |
| 82 | the current batch completes. It is now common for a single current CPU core to |
| 83 | be to able to issue enough transactions to keep the log buffers full and under |
| 84 | IO permanently. Hence the XFS journalling subsystem can be considered to be IO |
| 85 | bound. |
| 86 | |
| 87 | Delayed Logging: Concepts |
| 88 | ------------------------- |
| 89 | |
| 90 | The key thing to note about the asynchronous logging combined with the |
| 91 | relogging technique XFS uses is that we can be relogging changed objects |
| 92 | multiple times before they are committed to disk in the log buffers. If we |
| 93 | return to the previous relogging example, it is entirely possible that |
| 94 | transactions A through D are committed to disk in the same log buffer. |
| 95 | |
| 96 | That is, a single log buffer may contain multiple copies of the same object, |
| 97 | but only one of those copies needs to be there - the last one "D", as it |
| 98 | contains all the changes from the previous changes. In other words, we have one |
| 99 | necessary copy in the log buffer, and three stale copies that are simply |
| 100 | wasting space. When we are doing repeated operations on the same set of |
| 101 | objects, these "stale objects" can be over 90% of the space used in the log |
| 102 | buffers. It is clear that reducing the number of stale objects written to the |
| 103 | log would greatly reduce the amount of metadata we write to the log, and this |
| 104 | is the fundamental goal of delayed logging. |
| 105 | |
| 106 | From a conceptual point of view, XFS is already doing relogging in memory (where |
| 107 | memory == log buffer), only it is doing it extremely inefficiently. It is using |
| 108 | logical to physical formatting to do the relogging because there is no |
| 109 | infrastructure to keep track of logical changes in memory prior to physically |
| 110 | formatting the changes in a transaction to the log buffer. Hence we cannot avoid |
| 111 | accumulating stale objects in the log buffers. |
| 112 | |
| 113 | Delayed logging is the name we've given to keeping and tracking transactional |
| 114 | changes to objects in memory outside the log buffer infrastructure. Because of |
| 115 | the relogging concept fundamental to the XFS journalling subsystem, this is |
| 116 | actually relatively easy to do - all the changes to logged items are already |
| 117 | tracked in the current infrastructure. The big problem is how to accumulate |
| 118 | them and get them to the log in a consistent, recoverable manner. |
| 119 | Describing the problems and how they have been solved is the focus of this |
| 120 | document. |
| 121 | |
| 122 | One of the key changes that delayed logging makes to the operation of the |
| 123 | journalling subsystem is that it disassociates the amount of outstanding |
| 124 | metadata changes from the size and number of log buffers available. In other |
| 125 | words, instead of there only being a maximum of 2MB of transaction changes not |
| 126 | written to the log at any point in time, there may be a much greater amount |
| 127 | being accumulated in memory. Hence the potential for loss of metadata on a |
| 128 | crash is much greater than for the existing logging mechanism. |
| 129 | |
| 130 | It should be noted that this does not change the guarantee that log recovery |
| 131 | will result in a consistent filesystem. What it does mean is that as far as the |
| 132 | recovered filesystem is concerned, there may be many thousands of transactions |
| 133 | that simply did not occur as a result of the crash. This makes it even more |
| 134 | important that applications that care about their data use fsync() where they |
| 135 | need to ensure application level data integrity is maintained. |
| 136 | |
| 137 | It should be noted that delayed logging is not an innovative new concept that |
| 138 | warrants rigorous proofs to determine whether it is correct or not. The method |
| 139 | of accumulating changes in memory for some period before writing them to the |
| 140 | log is used effectively in many filesystems including ext3 and ext4. Hence |
| 141 | no time is spent in this document trying to convince the reader that the |
| 142 | concept is sound. Instead it is simply considered a "solved problem" and as |
| 143 | such implementing it in XFS is purely an exercise in software engineering. |
| 144 | |
| 145 | The fundamental requirements for delayed logging in XFS are simple: |
| 146 | |
| 147 | 1. Reduce the amount of metadata written to the log by at least |
| 148 | an order of magnitude. |
| 149 | 2. Supply sufficient statistics to validate Requirement #1. |
| 150 | 3. Supply sufficient new tracing infrastructure to be able to debug |
| 151 | problems with the new code. |
| 152 | 4. No on-disk format change (metadata or log format). |
| 153 | 5. Enable and disable with a mount option. |
| 154 | 6. No performance regressions for synchronous transaction workloads. |
| 155 | |
| 156 | Delayed Logging: Design |
| 157 | ----------------------- |
| 158 | |
| 159 | Storing Changes |
| 160 | |
| 161 | The problem with accumulating changes at a logical level (i.e. just using the |
| 162 | existing log item dirty region tracking) is that when it comes to writing the |
| 163 | changes to the log buffers, we need to ensure that the object we are formatting |
| 164 | is not changing while we do this. This requires locking the object to prevent |
| 165 | concurrent modification. Hence flushing the logical changes to the log would |
| 166 | require us to lock every object, format them, and then unlock them again. |
| 167 | |
| 168 | This introduces lots of scope for deadlocks with transactions that are already |
| 169 | running. For example, a transaction has object A locked and modified, but needs |
| 170 | the delayed logging tracking lock to commit the transaction. However, the |
| 171 | flushing thread has the delayed logging tracking lock already held, and is |
| 172 | trying to get the lock on object A to flush it to the log buffer. This appears |
| 173 | to be an unsolvable deadlock condition, and it was solving this problem that |
| 174 | was the barrier to implementing delayed logging for so long. |
| 175 | |
| 176 | The solution is relatively simple - it just took a long time to recognise it. |
| 177 | Put simply, the current logging code formats the changes to each item into an |
| 178 | vector array that points to the changed regions in the item. The log write code |
| 179 | simply copies the memory these vectors point to into the log buffer during |
| 180 | transaction commit while the item is locked in the transaction. Instead of |
| 181 | using the log buffer as the destination of the formatting code, we can use an |
| 182 | allocated memory buffer big enough to fit the formatted vector. |
| 183 | |
| 184 | If we then copy the vector into the memory buffer and rewrite the vector to |
| 185 | point to the memory buffer rather than the object itself, we now have a copy of |
| 186 | the changes in a format that is compatible with the log buffer writing code. |
| 187 | that does not require us to lock the item to access. This formatting and |
| 188 | rewriting can all be done while the object is locked during transaction commit, |
| 189 | resulting in a vector that is transactionally consistent and can be accessed |
| 190 | without needing to lock the owning item. |
| 191 | |
| 192 | Hence we avoid the need to lock items when we need to flush outstanding |
| 193 | asynchronous transactions to the log. The differences between the existing |
| 194 | formatting method and the delayed logging formatting can be seen in the |
| 195 | diagram below. |
| 196 | |
| 197 | Current format log vector: |
| 198 | |
| 199 | Object +---------------------------------------------+ |
| 200 | Vector 1 +----+ |
| 201 | Vector 2 +----+ |
| 202 | Vector 3 +----------+ |
| 203 | |
| 204 | After formatting: |
| 205 | |
| 206 | Log Buffer +-V1-+-V2-+----V3----+ |
| 207 | |
| 208 | Delayed logging vector: |
| 209 | |
| 210 | Object +---------------------------------------------+ |
| 211 | Vector 1 +----+ |
| 212 | Vector 2 +----+ |
| 213 | Vector 3 +----------+ |
| 214 | |
| 215 | After formatting: |
| 216 | |
| 217 | Memory Buffer +-V1-+-V2-+----V3----+ |
| 218 | Vector 1 +----+ |
| 219 | Vector 2 +----+ |
| 220 | Vector 3 +----------+ |
| 221 | |
| 222 | The memory buffer and associated vector need to be passed as a single object, |
| 223 | but still need to be associated with the parent object so if the object is |
| 224 | relogged we can replace the current memory buffer with a new memory buffer that |
| 225 | contains the latest changes. |
| 226 | |
| 227 | The reason for keeping the vector around after we've formatted the memory |
| 228 | buffer is to support splitting vectors across log buffer boundaries correctly. |
| 229 | If we don't keep the vector around, we do not know where the region boundaries |
| 230 | are in the item, so we'd need a new encapsulation method for regions in the log |
| 231 | buffer writing (i.e. double encapsulation). This would be an on-disk format |
| 232 | change and as such is not desirable. It also means we'd have to write the log |
| 233 | region headers in the formatting stage, which is problematic as there is per |
| 234 | region state that needs to be placed into the headers during the log write. |
| 235 | |
| 236 | Hence we need to keep the vector, but by attaching the memory buffer to it and |
| 237 | rewriting the vector addresses to point at the memory buffer we end up with a |
| 238 | self-describing object that can be passed to the log buffer write code to be |
| 239 | handled in exactly the same manner as the existing log vectors are handled. |
| 240 | Hence we avoid needing a new on-disk format to handle items that have been |
| 241 | relogged in memory. |
| 242 | |
| 243 | |
| 244 | Tracking Changes |
| 245 | |
| 246 | Now that we can record transactional changes in memory in a form that allows |
| 247 | them to be used without limitations, we need to be able to track and accumulate |
| 248 | them so that they can be written to the log at some later point in time. The |
| 249 | log item is the natural place to store this vector and buffer, and also makes sense |
| 250 | to be the object that is used to track committed objects as it will always |
| 251 | exist once the object has been included in a transaction. |
| 252 | |
| 253 | The log item is already used to track the log items that have been written to |
| 254 | the log but not yet written to disk. Such log items are considered "active" |
| 255 | and as such are stored in the Active Item List (AIL) which is a LSN-ordered |
| 256 | double linked list. Items are inserted into this list during log buffer IO |
| 257 | completion, after which they are unpinned and can be written to disk. An object |
| 258 | that is in the AIL can be relogged, which causes the object to be pinned again |
| 259 | and then moved forward in the AIL when the log buffer IO completes for that |
| 260 | transaction. |
| 261 | |
| 262 | Essentially, this shows that an item that is in the AIL can still be modified |
| 263 | and relogged, so any tracking must be separate to the AIL infrastructure. As |
| 264 | such, we cannot reuse the AIL list pointers for tracking committed items, nor |
| 265 | can we store state in any field that is protected by the AIL lock. Hence the |
| 266 | committed item tracking needs it's own locks, lists and state fields in the log |
| 267 | item. |
| 268 | |
| 269 | Similar to the AIL, tracking of committed items is done through a new list |
| 270 | called the Committed Item List (CIL). The list tracks log items that have been |
| 271 | committed and have formatted memory buffers attached to them. It tracks objects |
| 272 | in transaction commit order, so when an object is relogged it is removed from |
| 273 | it's place in the list and re-inserted at the tail. This is entirely arbitrary |
| 274 | and done to make it easy for debugging - the last items in the list are the |
| 275 | ones that are most recently modified. Ordering of the CIL is not necessary for |
| 276 | transactional integrity (as discussed in the next section) so the ordering is |
| 277 | done for convenience/sanity of the developers. |
| 278 | |
| 279 | |
| 280 | Delayed Logging: Checkpoints |
| 281 | |
| 282 | When we have a log synchronisation event, commonly known as a "log force", |
| 283 | all the items in the CIL must be written into the log via the log buffers. |
| 284 | We need to write these items in the order that they exist in the CIL, and they |
| 285 | need to be written as an atomic transaction. The need for all the objects to be |
| 286 | written as an atomic transaction comes from the requirements of relogging and |
| 287 | log replay - all the changes in all the objects in a given transaction must |
| 288 | either be completely replayed during log recovery, or not replayed at all. If |
| 289 | a transaction is not replayed because it is not complete in the log, then |
| 290 | no later transactions should be replayed, either. |
| 291 | |
| 292 | To fulfill this requirement, we need to write the entire CIL in a single log |
| 293 | transaction. Fortunately, the XFS log code has no fixed limit on the size of a |
| 294 | transaction, nor does the log replay code. The only fundamental limit is that |
| 295 | the transaction cannot be larger than just under half the size of the log. The |
| 296 | reason for this limit is that to find the head and tail of the log, there must |
| 297 | be at least one complete transaction in the log at any given time. If a |
| 298 | transaction is larger than half the log, then there is the possibility that a |
| 299 | crash during the write of a such a transaction could partially overwrite the |
| 300 | only complete previous transaction in the log. This will result in a recovery |
| 301 | failure and an inconsistent filesystem and hence we must enforce the maximum |
| 302 | size of a checkpoint to be slightly less than a half the log. |
| 303 | |
| 304 | Apart from this size requirement, a checkpoint transaction looks no different |
| 305 | to any other transaction - it contains a transaction header, a series of |
| 306 | formatted log items and a commit record at the tail. From a recovery |
| 307 | perspective, the checkpoint transaction is also no different - just a lot |
| 308 | bigger with a lot more items in it. The worst case effect of this is that we |
| 309 | might need to tune the recovery transaction object hash size. |
| 310 | |
| 311 | Because the checkpoint is just another transaction and all the changes to log |
| 312 | items are stored as log vectors, we can use the existing log buffer writing |
| 313 | code to write the changes into the log. To do this efficiently, we need to |
| 314 | minimise the time we hold the CIL locked while writing the checkpoint |
| 315 | transaction. The current log write code enables us to do this easily with the |
| 316 | way it separates the writing of the transaction contents (the log vectors) from |
| 317 | the transaction commit record, but tracking this requires us to have a |
| 318 | per-checkpoint context that travels through the log write process through to |
| 319 | checkpoint completion. |
| 320 | |
| 321 | Hence a checkpoint has a context that tracks the state of the current |
| 322 | checkpoint from initiation to checkpoint completion. A new context is initiated |
| 323 | at the same time a checkpoint transaction is started. That is, when we remove |
| 324 | all the current items from the CIL during a checkpoint operation, we move all |
| 325 | those changes into the current checkpoint context. We then initialise a new |
| 326 | context and attach that to the CIL for aggregation of new transactions. |
| 327 | |
| 328 | This allows us to unlock the CIL immediately after transfer of all the |
| 329 | committed items and effectively allow new transactions to be issued while we |
| 330 | are formatting the checkpoint into the log. It also allows concurrent |
| 331 | checkpoints to be written into the log buffers in the case of log force heavy |
| 332 | workloads, just like the existing transaction commit code does. This, however, |
| 333 | requires that we strictly order the commit records in the log so that |
| 334 | checkpoint sequence order is maintained during log replay. |
| 335 | |
| 336 | To ensure that we can be writing an item into a checkpoint transaction at |
| 337 | the same time another transaction modifies the item and inserts the log item |
| 338 | into the new CIL, then checkpoint transaction commit code cannot use log items |
| 339 | to store the list of log vectors that need to be written into the transaction. |
| 340 | Hence log vectors need to be able to be chained together to allow them to be |
| 341 | detatched from the log items. That is, when the CIL is flushed the memory |
| 342 | buffer and log vector attached to each log item needs to be attached to the |
| 343 | checkpoint context so that the log item can be released. In diagrammatic form, |
| 344 | the CIL would look like this before the flush: |
| 345 | |
| 346 | CIL Head |
| 347 | | |
| 348 | V |
| 349 | Log Item <-> log vector 1 -> memory buffer |
| 350 | | -> vector array |
| 351 | V |
| 352 | Log Item <-> log vector 2 -> memory buffer |
| 353 | | -> vector array |
| 354 | V |
| 355 | ...... |
| 356 | | |
| 357 | V |
| 358 | Log Item <-> log vector N-1 -> memory buffer |
| 359 | | -> vector array |
| 360 | V |
| 361 | Log Item <-> log vector N -> memory buffer |
| 362 | -> vector array |
| 363 | |
| 364 | And after the flush the CIL head is empty, and the checkpoint context log |
| 365 | vector list would look like: |
| 366 | |
| 367 | Checkpoint Context |
| 368 | | |
| 369 | V |
| 370 | log vector 1 -> memory buffer |
| 371 | | -> vector array |
| 372 | | -> Log Item |
| 373 | V |
| 374 | log vector 2 -> memory buffer |
| 375 | | -> vector array |
| 376 | | -> Log Item |
| 377 | V |
| 378 | ...... |
| 379 | | |
| 380 | V |
| 381 | log vector N-1 -> memory buffer |
| 382 | | -> vector array |
| 383 | | -> Log Item |
| 384 | V |
| 385 | log vector N -> memory buffer |
| 386 | -> vector array |
| 387 | -> Log Item |
| 388 | |
| 389 | Once this transfer is done, the CIL can be unlocked and new transactions can |
| 390 | start, while the checkpoint flush code works over the log vector chain to |
| 391 | commit the checkpoint. |
| 392 | |
| 393 | Once the checkpoint is written into the log buffers, the checkpoint context is |
| 394 | attached to the log buffer that the commit record was written to along with a |
| 395 | completion callback. Log IO completion will call that callback, which can then |
| 396 | run transaction committed processing for the log items (i.e. insert into AIL |
| 397 | and unpin) in the log vector chain and then free the log vector chain and |
| 398 | checkpoint context. |
| 399 | |
| 400 | Discussion Point: I am uncertain as to whether the log item is the most |
| 401 | efficient way to track vectors, even though it seems like the natural way to do |
| 402 | it. The fact that we walk the log items (in the CIL) just to chain the log |
| 403 | vectors and break the link between the log item and the log vector means that |
| 404 | we take a cache line hit for the log item list modification, then another for |
| 405 | the log vector chaining. If we track by the log vectors, then we only need to |
| 406 | break the link between the log item and the log vector, which means we should |
| 407 | dirty only the log item cachelines. Normally I wouldn't be concerned about one |
| 408 | vs two dirty cachelines except for the fact I've seen upwards of 80,000 log |
| 409 | vectors in one checkpoint transaction. I'd guess this is a "measure and |
| 410 | compare" situation that can be done after a working and reviewed implementation |
| 411 | is in the dev tree.... |
| 412 | |
| 413 | Delayed Logging: Checkpoint Sequencing |
| 414 | |
| 415 | One of the key aspects of the XFS transaction subsystem is that it tags |
| 416 | committed transactions with the log sequence number of the transaction commit. |
| 417 | This allows transactions to be issued asynchronously even though there may be |
| 418 | future operations that cannot be completed until that transaction is fully |
| 419 | committed to the log. In the rare case that a dependent operation occurs (e.g. |
| 420 | re-using a freed metadata extent for a data extent), a special, optimised log |
| 421 | force can be issued to force the dependent transaction to disk immediately. |
| 422 | |
| 423 | To do this, transactions need to record the LSN of the commit record of the |
| 424 | transaction. This LSN comes directly from the log buffer the transaction is |
| 425 | written into. While this works just fine for the existing transaction |
| 426 | mechanism, it does not work for delayed logging because transactions are not |
| 427 | written directly into the log buffers. Hence some other method of sequencing |
| 428 | transactions is required. |
| 429 | |
| 430 | As discussed in the checkpoint section, delayed logging uses per-checkpoint |
| 431 | contexts, and as such it is simple to assign a sequence number to each |
| 432 | checkpoint. Because the switching of checkpoint contexts must be done |
| 433 | atomically, it is simple to ensure that each new context has a monotonically |
| 434 | increasing sequence number assigned to it without the need for an external |
| 435 | atomic counter - we can just take the current context sequence number and add |
| 436 | one to it for the new context. |
| 437 | |
| 438 | Then, instead of assigning a log buffer LSN to the transaction commit LSN |
| 439 | during the commit, we can assign the current checkpoint sequence. This allows |
| 440 | operations that track transactions that have not yet completed know what |
| 441 | checkpoint sequence needs to be committed before they can continue. As a |
| 442 | result, the code that forces the log to a specific LSN now needs to ensure that |
| 443 | the log forces to a specific checkpoint. |
| 444 | |
| 445 | To ensure that we can do this, we need to track all the checkpoint contexts |
| 446 | that are currently committing to the log. When we flush a checkpoint, the |
| 447 | context gets added to a "committing" list which can be searched. When a |
| 448 | checkpoint commit completes, it is removed from the committing list. Because |
| 449 | the checkpoint context records the LSN of the commit record for the checkpoint, |
| 450 | we can also wait on the log buffer that contains the commit record, thereby |
| 451 | using the existing log force mechanisms to execute synchronous forces. |
| 452 | |
| 453 | It should be noted that the synchronous forces may need to be extended with |
| 454 | mitigation algorithms similar to the current log buffer code to allow |
| 455 | aggregation of multiple synchronous transactions if there are already |
| 456 | synchronous transactions being flushed. Investigation of the performance of the |
| 457 | current design is needed before making any decisions here. |
| 458 | |
| 459 | The main concern with log forces is to ensure that all the previous checkpoints |
| 460 | are also committed to disk before the one we need to wait for. Therefore we |
| 461 | need to check that all the prior contexts in the committing list are also |
| 462 | complete before waiting on the one we need to complete. We do this |
| 463 | synchronisation in the log force code so that we don't need to wait anywhere |
| 464 | else for such serialisation - it only matters when we do a log force. |
| 465 | |
| 466 | The only remaining complexity is that a log force now also has to handle the |
| 467 | case where the forcing sequence number is the same as the current context. That |
| 468 | is, we need to flush the CIL and potentially wait for it to complete. This is a |
| 469 | simple addition to the existing log forcing code to check the sequence numbers |
| 470 | and push if required. Indeed, placing the current sequence checkpoint flush in |
| 471 | the log force code enables the current mechanism for issuing synchronous |
| 472 | transactions to remain untouched (i.e. commit an asynchronous transaction, then |
| 473 | force the log at the LSN of that transaction) and so the higher level code |
| 474 | behaves the same regardless of whether delayed logging is being used or not. |
| 475 | |
| 476 | Delayed Logging: Checkpoint Log Space Accounting |
| 477 | |
| 478 | The big issue for a checkpoint transaction is the log space reservation for the |
| 479 | transaction. We don't know how big a checkpoint transaction is going to be |
| 480 | ahead of time, nor how many log buffers it will take to write out, nor the |
| 481 | number of split log vector regions are going to be used. We can track the |
| 482 | amount of log space required as we add items to the commit item list, but we |
| 483 | still need to reserve the space in the log for the checkpoint. |
| 484 | |
| 485 | A typical transaction reserves enough space in the log for the worst case space |
| 486 | usage of the transaction. The reservation accounts for log record headers, |
| 487 | transaction and region headers, headers for split regions, buffer tail padding, |
| 488 | etc. as well as the actual space for all the changed metadata in the |
| 489 | transaction. While some of this is fixed overhead, much of it is dependent on |
| 490 | the size of the transaction and the number of regions being logged (the number |
| 491 | of log vectors in the transaction). |
| 492 | |
| 493 | An example of the differences would be logging directory changes versus logging |
| 494 | inode changes. If you modify lots of inode cores (e.g. chmod -R g+w *), then |
| 495 | there are lots of transactions that only contain an inode core and an inode log |
| 496 | format structure. That is, two vectors totaling roughly 150 bytes. If we modify |
| 497 | 10,000 inodes, we have about 1.5MB of metadata to write in 20,000 vectors. Each |
| 498 | vector is 12 bytes, so the total to be logged is approximately 1.75MB. In |
| 499 | comparison, if we are logging full directory buffers, they are typically 4KB |
| 500 | each, so we in 1.5MB of directory buffers we'd have roughly 400 buffers and a |
| 501 | buffer format structure for each buffer - roughly 800 vectors or 1.51MB total |
| 502 | space. From this, it should be obvious that a static log space reservation is |
| 503 | not particularly flexible and is difficult to select the "optimal value" for |
| 504 | all workloads. |
| 505 | |
| 506 | Further, if we are going to use a static reservation, which bit of the entire |
| 507 | reservation does it cover? We account for space used by the transaction |
| 508 | reservation by tracking the space currently used by the object in the CIL and |
| 509 | then calculating the increase or decrease in space used as the object is |
| 510 | relogged. This allows for a checkpoint reservation to only have to account for |
| 511 | log buffer metadata used such as log header records. |
| 512 | |
| 513 | However, even using a static reservation for just the log metadata is |
| 514 | problematic. Typically log record headers use at least 16KB of log space per |
| 515 | 1MB of log space consumed (512 bytes per 32k) and the reservation needs to be |
| 516 | large enough to handle arbitrary sized checkpoint transactions. This |
| 517 | reservation needs to be made before the checkpoint is started, and we need to |
| 518 | be able to reserve the space without sleeping. For a 8MB checkpoint, we need a |
| 519 | reservation of around 150KB, which is a non-trivial amount of space. |
| 520 | |
| 521 | A static reservation needs to manipulate the log grant counters - we can take a |
| 522 | permanent reservation on the space, but we still need to make sure we refresh |
| 523 | the write reservation (the actual space available to the transaction) after |
| 524 | every checkpoint transaction completion. Unfortunately, if this space is not |
| 525 | available when required, then the regrant code will sleep waiting for it. |
| 526 | |
| 527 | The problem with this is that it can lead to deadlocks as we may need to commit |
| 528 | checkpoints to be able to free up log space (refer back to the description of |
| 529 | rolling transactions for an example of this). Hence we *must* always have |
| 530 | space available in the log if we are to use static reservations, and that is |
| 531 | very difficult and complex to arrange. It is possible to do, but there is a |
| 532 | simpler way. |
| 533 | |
| 534 | The simpler way of doing this is tracking the entire log space used by the |
| 535 | items in the CIL and using this to dynamically calculate the amount of log |
| 536 | space required by the log metadata. If this log metadata space changes as a |
| 537 | result of a transaction commit inserting a new memory buffer into the CIL, then |
| 538 | the difference in space required is removed from the transaction that causes |
| 539 | the change. Transactions at this level will *always* have enough space |
| 540 | available in their reservation for this as they have already reserved the |
| 541 | maximal amount of log metadata space they require, and such a delta reservation |
| 542 | will always be less than or equal to the maximal amount in the reservation. |
| 543 | |
| 544 | Hence we can grow the checkpoint transaction reservation dynamically as items |
| 545 | are added to the CIL and avoid the need for reserving and regranting log space |
| 546 | up front. This avoids deadlocks and removes a blocking point from the |
| 547 | checkpoint flush code. |
| 548 | |
| 549 | As mentioned early, transactions can't grow to more than half the size of the |
| 550 | log. Hence as part of the reservation growing, we need to also check the size |
| 551 | of the reservation against the maximum allowed transaction size. If we reach |
| 552 | the maximum threshold, we need to push the CIL to the log. This is effectively |
| 553 | a "background flush" and is done on demand. This is identical to |
| 554 | a CIL push triggered by a log force, only that there is no waiting for the |
| 555 | checkpoint commit to complete. This background push is checked and executed by |
| 556 | transaction commit code. |
| 557 | |
| 558 | If the transaction subsystem goes idle while we still have items in the CIL, |
| 559 | they will be flushed by the periodic log force issued by the xfssyncd. This log |
| 560 | force will push the CIL to disk, and if the transaction subsystem stays idle, |
| 561 | allow the idle log to be covered (effectively marked clean) in exactly the same |
| 562 | manner that is done for the existing logging method. A discussion point is |
| 563 | whether this log force needs to be done more frequently than the current rate |
| 564 | which is once every 30s. |
| 565 | |
| 566 | |
| 567 | Delayed Logging: Log Item Pinning |
| 568 | |
| 569 | Currently log items are pinned during transaction commit while the items are |
| 570 | still locked. This happens just after the items are formatted, though it could |
| 571 | be done any time before the items are unlocked. The result of this mechanism is |
| 572 | that items get pinned once for every transaction that is committed to the log |
| 573 | buffers. Hence items that are relogged in the log buffers will have a pin count |
| 574 | for every outstanding transaction they were dirtied in. When each of these |
| 575 | transactions is completed, they will unpin the item once. As a result, the item |
| 576 | only becomes unpinned when all the transactions complete and there are no |
| 577 | pending transactions. Thus the pinning and unpinning of a log item is symmetric |
| 578 | as there is a 1:1 relationship with transaction commit and log item completion. |
| 579 | |
| 580 | For delayed logging, however, we have an assymetric transaction commit to |
| 581 | completion relationship. Every time an object is relogged in the CIL it goes |
| 582 | through the commit process without a corresponding completion being registered. |
| 583 | That is, we now have a many-to-one relationship between transaction commit and |
| 584 | log item completion. The result of this is that pinning and unpinning of the |
| 585 | log items becomes unbalanced if we retain the "pin on transaction commit, unpin |
| 586 | on transaction completion" model. |
| 587 | |
| 588 | To keep pin/unpin symmetry, the algorithm needs to change to a "pin on |
| 589 | insertion into the CIL, unpin on checkpoint completion". In other words, the |
| 590 | pinning and unpinning becomes symmetric around a checkpoint context. We have to |
| 591 | pin the object the first time it is inserted into the CIL - if it is already in |
| 592 | the CIL during a transaction commit, then we do not pin it again. Because there |
| 593 | can be multiple outstanding checkpoint contexts, we can still see elevated pin |
| 594 | counts, but as each checkpoint completes the pin count will retain the correct |
| 595 | value according to it's context. |
| 596 | |
| 597 | Just to make matters more slightly more complex, this checkpoint level context |
| 598 | for the pin count means that the pinning of an item must take place under the |
| 599 | CIL commit/flush lock. If we pin the object outside this lock, we cannot |
| 600 | guarantee which context the pin count is associated with. This is because of |
| 601 | the fact pinning the item is dependent on whether the item is present in the |
| 602 | current CIL or not. If we don't pin the CIL first before we check and pin the |
| 603 | object, we have a race with CIL being flushed between the check and the pin |
| 604 | (or not pinning, as the case may be). Hence we must hold the CIL flush/commit |
| 605 | lock to guarantee that we pin the items correctly. |
| 606 | |
| 607 | Delayed Logging: Concurrent Scalability |
| 608 | |
| 609 | A fundamental requirement for the CIL is that accesses through transaction |
| 610 | commits must scale to many concurrent commits. The current transaction commit |
| 611 | code does not break down even when there are transactions coming from 2048 |
| 612 | processors at once. The current transaction code does not go any faster than if |
| 613 | there was only one CPU using it, but it does not slow down either. |
| 614 | |
| 615 | As a result, the delayed logging transaction commit code needs to be designed |
| 616 | for concurrency from the ground up. It is obvious that there are serialisation |
| 617 | points in the design - the three important ones are: |
| 618 | |
| 619 | 1. Locking out new transaction commits while flushing the CIL |
| 620 | 2. Adding items to the CIL and updating item space accounting |
| 621 | 3. Checkpoint commit ordering |
| 622 | |
| 623 | Looking at the transaction commit and CIL flushing interactions, it is clear |
| 624 | that we have a many-to-one interaction here. That is, the only restriction on |
| 625 | the number of concurrent transactions that can be trying to commit at once is |
| 626 | the amount of space available in the log for their reservations. The practical |
| 627 | limit here is in the order of several hundred concurrent transactions for a |
| 628 | 128MB log, which means that it is generally one per CPU in a machine. |
| 629 | |
| 630 | The amount of time a transaction commit needs to hold out a flush is a |
| 631 | relatively long period of time - the pinning of log items needs to be done |
| 632 | while we are holding out a CIL flush, so at the moment that means it is held |
| 633 | across the formatting of the objects into memory buffers (i.e. while memcpy()s |
| 634 | are in progress). Ultimately a two pass algorithm where the formatting is done |
| 635 | separately to the pinning of objects could be used to reduce the hold time of |
| 636 | the transaction commit side. |
| 637 | |
| 638 | Because of the number of potential transaction commit side holders, the lock |
| 639 | really needs to be a sleeping lock - if the CIL flush takes the lock, we do not |
| 640 | want every other CPU in the machine spinning on the CIL lock. Given that |
| 641 | flushing the CIL could involve walking a list of tens of thousands of log |
| 642 | items, it will get held for a significant time and so spin contention is a |
| 643 | significant concern. Preventing lots of CPUs spinning doing nothing is the |
| 644 | main reason for choosing a sleeping lock even though nothing in either the |
| 645 | transaction commit or CIL flush side sleeps with the lock held. |
| 646 | |
| 647 | It should also be noted that CIL flushing is also a relatively rare operation |
| 648 | compared to transaction commit for asynchronous transaction workloads - only |
| 649 | time will tell if using a read-write semaphore for exclusion will limit |
| 650 | transaction commit concurrency due to cache line bouncing of the lock on the |
| 651 | read side. |
| 652 | |
| 653 | The second serialisation point is on the transaction commit side where items |
| 654 | are inserted into the CIL. Because transactions can enter this code |
| 655 | concurrently, the CIL needs to be protected separately from the above |
| 656 | commit/flush exclusion. It also needs to be an exclusive lock but it is only |
| 657 | held for a very short time and so a spin lock is appropriate here. It is |
| 658 | possible that this lock will become a contention point, but given the short |
| 659 | hold time once per transaction I think that contention is unlikely. |
| 660 | |
| 661 | The final serialisation point is the checkpoint commit record ordering code |
| 662 | that is run as part of the checkpoint commit and log force sequencing. The code |
| 663 | path that triggers a CIL flush (i.e. whatever triggers the log force) will enter |
| 664 | an ordering loop after writing all the log vectors into the log buffers but |
| 665 | before writing the commit record. This loop walks the list of committing |
| 666 | checkpoints and needs to block waiting for checkpoints to complete their commit |
| 667 | record write. As a result it needs a lock and a wait variable. Log force |
| 668 | sequencing also requires the same lock, list walk, and blocking mechanism to |
| 669 | ensure completion of checkpoints. |
| 670 | |
| 671 | These two sequencing operations can use the mechanism even though the |
| 672 | events they are waiting for are different. The checkpoint commit record |
| 673 | sequencing needs to wait until checkpoint contexts contain a commit LSN |
| 674 | (obtained through completion of a commit record write) while log force |
| 675 | sequencing needs to wait until previous checkpoint contexts are removed from |
| 676 | the committing list (i.e. they've completed). A simple wait variable and |
| 677 | broadcast wakeups (thundering herds) has been used to implement these two |
| 678 | serialisation queues. They use the same lock as the CIL, too. If we see too |
| 679 | much contention on the CIL lock, or too many context switches as a result of |
| 680 | the broadcast wakeups these operations can be put under a new spinlock and |
| 681 | given separate wait lists to reduce lock contention and the number of processes |
| 682 | woken by the wrong event. |
| 683 | |
| 684 | |
| 685 | Lifecycle Changes |
| 686 | |
| 687 | The existing log item life cycle is as follows: |
| 688 | |
| 689 | 1. Transaction allocate |
| 690 | 2. Transaction reserve |
| 691 | 3. Lock item |
| 692 | 4. Join item to transaction |
| 693 | If not already attached, |
| 694 | Allocate log item |
| 695 | Attach log item to owner item |
| 696 | Attach log item to transaction |
| 697 | 5. Modify item |
| 698 | Record modifications in log item |
| 699 | 6. Transaction commit |
| 700 | Pin item in memory |
| 701 | Format item into log buffer |
| 702 | Write commit LSN into transaction |
| 703 | Unlock item |
| 704 | Attach transaction to log buffer |
| 705 | |
| 706 | <log buffer IO dispatched> |
| 707 | <log buffer IO completes> |
| 708 | |
| 709 | 7. Transaction completion |
| 710 | Mark log item committed |
| 711 | Insert log item into AIL |
| 712 | Write commit LSN into log item |
| 713 | Unpin log item |
| 714 | 8. AIL traversal |
| 715 | Lock item |
| 716 | Mark log item clean |
| 717 | Flush item to disk |
| 718 | |
| 719 | <item IO completion> |
| 720 | |
| 721 | 9. Log item removed from AIL |
| 722 | Moves log tail |
| 723 | Item unlocked |
| 724 | |
| 725 | Essentially, steps 1-6 operate independently from step 7, which is also |
| 726 | independent of steps 8-9. An item can be locked in steps 1-6 or steps 8-9 |
| 727 | at the same time step 7 is occurring, but only steps 1-6 or 8-9 can occur |
| 728 | at the same time. If the log item is in the AIL or between steps 6 and 7 |
| 729 | and steps 1-6 are re-entered, then the item is relogged. Only when steps 8-9 |
| 730 | are entered and completed is the object considered clean. |
| 731 | |
| 732 | With delayed logging, there are new steps inserted into the life cycle: |
| 733 | |
| 734 | 1. Transaction allocate |
| 735 | 2. Transaction reserve |
| 736 | 3. Lock item |
| 737 | 4. Join item to transaction |
| 738 | If not already attached, |
| 739 | Allocate log item |
| 740 | Attach log item to owner item |
| 741 | Attach log item to transaction |
| 742 | 5. Modify item |
| 743 | Record modifications in log item |
| 744 | 6. Transaction commit |
| 745 | Pin item in memory if not pinned in CIL |
| 746 | Format item into log vector + buffer |
| 747 | Attach log vector and buffer to log item |
| 748 | Insert log item into CIL |
| 749 | Write CIL context sequence into transaction |
| 750 | Unlock item |
| 751 | |
| 752 | <next log force> |
| 753 | |
| 754 | 7. CIL push |
| 755 | lock CIL flush |
| 756 | Chain log vectors and buffers together |
| 757 | Remove items from CIL |
| 758 | unlock CIL flush |
| 759 | write log vectors into log |
| 760 | sequence commit records |
| 761 | attach checkpoint context to log buffer |
| 762 | |
| 763 | <log buffer IO dispatched> |
| 764 | <log buffer IO completes> |
| 765 | |
| 766 | 8. Checkpoint completion |
| 767 | Mark log item committed |
| 768 | Insert item into AIL |
| 769 | Write commit LSN into log item |
| 770 | Unpin log item |
| 771 | 9. AIL traversal |
| 772 | Lock item |
| 773 | Mark log item clean |
| 774 | Flush item to disk |
| 775 | <item IO completion> |
| 776 | 10. Log item removed from AIL |
| 777 | Moves log tail |
| 778 | Item unlocked |
| 779 | |
| 780 | From this, it can be seen that the only life cycle differences between the two |
| 781 | logging methods are in the middle of the life cycle - they still have the same |
| 782 | beginning and end and execution constraints. The only differences are in the |
| 783 | commiting of the log items to the log itself and the completion processing. |
| 784 | Hence delayed logging should not introduce any constraints on log item |
| 785 | behaviour, allocation or freeing that don't already exist. |
| 786 | |
| 787 | As a result of this zero-impact "insertion" of delayed logging infrastructure |
| 788 | and the design of the internal structures to avoid on disk format changes, we |
| 789 | can basically switch between delayed logging and the existing mechanism with a |
| 790 | mount option. Fundamentally, there is no reason why the log manager would not |
| 791 | be able to swap methods automatically and transparently depending on load |
| 792 | characteristics, but this should not be necessary if delayed logging works as |
| 793 | designed. |
| 794 | |
| 795 | Roadmap: |
| 796 | |
| 797 | 2.6.35 Inclusion in mainline as an experimental mount option |
| 798 | => approximately 2-3 months to merge window |
| 799 | => needs to be in xfs-dev tree in 4-6 weeks |
| 800 | => code is nearing readiness for review |
| 801 | |
| 802 | 2.6.37 Remove experimental tag from mount option |
| 803 | => should be roughly 6 months after initial merge |
| 804 | => enough time to: |
| 805 | => gain confidence and fix problems reported by early |
| 806 | adopters (a.k.a. guinea pigs) |
| 807 | => address worst performance regressions and undesired |
| 808 | behaviours |
| 809 | => start tuning/optimising code for parallelism |
| 810 | => start tuning/optimising algorithms consuming |
| 811 | excessive CPU time |
| 812 | |
| 813 | 2.6.39 Switch default mount option to use delayed logging |
| 814 | => should be roughly 12 months after initial merge |
| 815 | => enough time to shake out remaining problems before next round of |
| 816 | enterprise distro kernel rebases |