Thomas Gleixner | c6ae4c0 | 2019-05-22 09:51:37 +0200 | [diff] [blame] | 1 | /* SPDX-License-Identifier: GPL-2.0-or-later */ |
Philipp Reisner | b411b36 | 2009-09-25 16:07:19 -0700 | [diff] [blame] | 2 | /* |
| 3 | lru_cache.c |
| 4 | |
| 5 | This file is part of DRBD by Philipp Reisner and Lars Ellenberg. |
| 6 | |
| 7 | Copyright (C) 2003-2008, LINBIT Information Technologies GmbH. |
| 8 | Copyright (C) 2003-2008, Philipp Reisner <philipp.reisner@linbit.com>. |
| 9 | Copyright (C) 2003-2008, Lars Ellenberg <lars.ellenberg@linbit.com>. |
| 10 | |
Philipp Reisner | b411b36 | 2009-09-25 16:07:19 -0700 | [diff] [blame] | 11 | |
| 12 | */ |
| 13 | |
| 14 | #ifndef LRU_CACHE_H |
| 15 | #define LRU_CACHE_H |
| 16 | |
| 17 | #include <linux/list.h> |
| 18 | #include <linux/slab.h> |
| 19 | #include <linux/bitops.h> |
| 20 | #include <linux/string.h> /* for memset */ |
| 21 | #include <linux/seq_file.h> |
| 22 | |
| 23 | /* |
| 24 | This header file (and its .c file; kernel-doc of functions see there) |
| 25 | define a helper framework to easily keep track of index:label associations, |
| 26 | and changes to an "active set" of objects, as well as pending transactions, |
| 27 | to persistently record those changes. |
| 28 | |
| 29 | We use an LRU policy if it is necessary to "cool down" a region currently in |
| 30 | the active set before we can "heat" a previously unused region. |
| 31 | |
| 32 | Because of this later property, it is called "lru_cache". |
| 33 | As it actually Tracks Objects in an Active SeT, we could also call it |
| 34 | toast (incidentally that is what may happen to the data on the |
| 35 | backend storage uppon next resync, if we don't get it right). |
| 36 | |
| 37 | What for? |
| 38 | |
| 39 | We replicate IO (more or less synchronously) to local and remote disk. |
| 40 | |
| 41 | For crash recovery after replication node failure, |
| 42 | we need to resync all regions that have been target of in-flight WRITE IO |
Adam Buchbinder | 48fc7f7 | 2012-09-19 21:48:00 -0400 | [diff] [blame] | 43 | (in use, or "hot", regions), as we don't know whether or not those WRITEs |
| 44 | have made it to stable storage. |
Philipp Reisner | b411b36 | 2009-09-25 16:07:19 -0700 | [diff] [blame] | 45 | |
| 46 | To avoid a "full resync", we need to persistently track these regions. |
| 47 | |
| 48 | This is known as "write intent log", and can be implemented as on-disk |
| 49 | (coarse or fine grained) bitmap, or other meta data. |
| 50 | |
| 51 | To avoid the overhead of frequent extra writes to this meta data area, |
| 52 | usually the condition is softened to regions that _may_ have been target of |
| 53 | in-flight WRITE IO, e.g. by only lazily clearing the on-disk write-intent |
| 54 | bitmap, trading frequency of meta data transactions against amount of |
Daniel Mack | 3ad2f3fb | 2010-02-03 08:01:28 +0800 | [diff] [blame] | 55 | (possibly unnecessary) resync traffic. |
Philipp Reisner | b411b36 | 2009-09-25 16:07:19 -0700 | [diff] [blame] | 56 | |
| 57 | If we set a hard limit on the area that may be "hot" at any given time, we |
| 58 | limit the amount of resync traffic needed for crash recovery. |
| 59 | |
| 60 | For recovery after replication link failure, |
| 61 | we need to resync all blocks that have been changed on the other replica |
| 62 | in the mean time, or, if both replica have been changed independently [*], |
| 63 | all blocks that have been changed on either replica in the mean time. |
| 64 | [*] usually as a result of a cluster split-brain and insufficient protection. |
| 65 | but there are valid use cases to do this on purpose. |
| 66 | |
| 67 | Tracking those blocks can be implemented as "dirty bitmap". |
| 68 | Having it fine-grained reduces the amount of resync traffic. |
| 69 | It should also be persistent, to allow for reboots (or crashes) |
| 70 | while the replication link is down. |
| 71 | |
| 72 | There are various possible implementations for persistently storing |
| 73 | write intent log information, three of which are mentioned here. |
| 74 | |
| 75 | "Chunk dirtying" |
| 76 | The on-disk "dirty bitmap" may be re-used as "write-intent" bitmap as well. |
| 77 | To reduce the frequency of bitmap updates for write-intent log purposes, |
| 78 | one could dirty "chunks" (of some size) at a time of the (fine grained) |
| 79 | on-disk bitmap, while keeping the in-memory "dirty" bitmap as clean as |
| 80 | possible, flushing it to disk again when a previously "hot" (and on-disk |
| 81 | dirtied as full chunk) area "cools down" again (no IO in flight anymore, |
| 82 | and none expected in the near future either). |
| 83 | |
| 84 | "Explicit (coarse) write intent bitmap" |
| 85 | An other implementation could chose a (probably coarse) explicit bitmap, |
| 86 | for write-intent log purposes, additionally to the fine grained dirty bitmap. |
| 87 | |
| 88 | "Activity log" |
| 89 | Yet an other implementation may keep track of the hot regions, by starting |
| 90 | with an empty set, and writing down a journal of region numbers that have |
| 91 | become "hot", or have "cooled down" again. |
| 92 | |
| 93 | To be able to use a ring buffer for this journal of changes to the active |
| 94 | set, we not only record the actual changes to that set, but also record the |
| 95 | not changing members of the set in a round robin fashion. To do so, we use a |
| 96 | fixed (but configurable) number of slots which we can identify by index, and |
| 97 | associate region numbers (labels) with these indices. |
| 98 | For each transaction recording a change to the active set, we record the |
| 99 | change itself (index: -old_label, +new_label), and which index is associated |
| 100 | with which label (index: current_label) within a certain sliding window that |
| 101 | is moved further over the available indices with each such transaction. |
| 102 | |
| 103 | Thus, for crash recovery, if the ringbuffer is sufficiently large, we can |
| 104 | accurately reconstruct the active set. |
| 105 | |
| 106 | Sufficiently large depends only on maximum number of active objects, and the |
| 107 | size of the sliding window recording "index: current_label" associations within |
| 108 | each transaction. |
| 109 | |
| 110 | This is what we call the "activity log". |
| 111 | |
| 112 | Currently we need one activity log transaction per single label change, which |
| 113 | does not give much benefit over the "dirty chunks of bitmap" approach, other |
| 114 | than potentially less seeks. |
| 115 | |
| 116 | We plan to change the transaction format to support multiple changes per |
| 117 | transaction, which then would reduce several (disjoint, "random") updates to |
| 118 | the bitmap into one transaction to the activity log ring buffer. |
| 119 | */ |
| 120 | |
| 121 | /* this defines an element in a tracked set |
| 122 | * .colision is for hash table lookup. |
| 123 | * When we process a new IO request, we know its sector, thus can deduce the |
| 124 | * region number (label) easily. To do the label -> object lookup without a |
| 125 | * full list walk, we use a simple hash table. |
| 126 | * |
| 127 | * .list is on one of three lists: |
| 128 | * in_use: currently in use (refcnt > 0, lc_number != LC_FREE) |
| 129 | * lru: unused but ready to be reused or recycled |
Lars Ellenberg | 600942e | 2011-01-27 15:24:58 +0100 | [diff] [blame] | 130 | * (lc_refcnt == 0, lc_number != LC_FREE), |
Philipp Reisner | b411b36 | 2009-09-25 16:07:19 -0700 | [diff] [blame] | 131 | * free: unused but ready to be recycled |
Lars Ellenberg | 600942e | 2011-01-27 15:24:58 +0100 | [diff] [blame] | 132 | * (lc_refcnt == 0, lc_number == LC_FREE), |
Philipp Reisner | b411b36 | 2009-09-25 16:07:19 -0700 | [diff] [blame] | 133 | * |
| 134 | * an element is said to be "in the active set", |
| 135 | * if either on "in_use" or "lru", i.e. lc_number != LC_FREE. |
| 136 | * |
| 137 | * DRBD currently (May 2009) only uses 61 elements on the resync lru_cache |
| 138 | * (total memory usage 2 pages), and up to 3833 elements on the act_log |
Lucas De Marchi | 25985ed | 2011-03-30 22:57:33 -0300 | [diff] [blame] | 139 | * lru_cache, totalling ~215 kB for 64bit architecture, ~53 pages. |
Philipp Reisner | b411b36 | 2009-09-25 16:07:19 -0700 | [diff] [blame] | 140 | * |
| 141 | * We usually do not actually free these objects again, but only "recycle" |
| 142 | * them, as the change "index: -old_label, +LC_FREE" would need a transaction |
| 143 | * as well. Which also means that using a kmem_cache to allocate the objects |
| 144 | * from wastes some resources. |
| 145 | * But it avoids high order page allocations in kmalloc. |
| 146 | */ |
| 147 | struct lc_element { |
| 148 | struct hlist_node colision; |
| 149 | struct list_head list; /* LRU list or free list */ |
| 150 | unsigned refcnt; |
Lars Ellenberg | 600942e | 2011-01-27 15:24:58 +0100 | [diff] [blame] | 151 | /* back "pointer" into lc_cache->element[index], |
| 152 | * for paranoia, and for "lc_element_to_index" */ |
Philipp Reisner | b411b36 | 2009-09-25 16:07:19 -0700 | [diff] [blame] | 153 | unsigned lc_index; |
| 154 | /* if we want to track a larger set of objects, |
| 155 | * it needs to become arch independend u64 */ |
| 156 | unsigned lc_number; |
Philipp Reisner | b411b36 | 2009-09-25 16:07:19 -0700 | [diff] [blame] | 157 | /* special label when on free list */ |
| 158 | #define LC_FREE (~0U) |
Lars Ellenberg | 46a15bc | 2011-02-21 13:21:01 +0100 | [diff] [blame] | 159 | |
| 160 | /* for pending changes */ |
| 161 | unsigned lc_new_number; |
Philipp Reisner | b411b36 | 2009-09-25 16:07:19 -0700 | [diff] [blame] | 162 | }; |
| 163 | |
| 164 | struct lru_cache { |
| 165 | /* the least recently used item is kept at lru->prev */ |
| 166 | struct list_head lru; |
| 167 | struct list_head free; |
| 168 | struct list_head in_use; |
Lars Ellenberg | 46a15bc | 2011-02-21 13:21:01 +0100 | [diff] [blame] | 169 | struct list_head to_be_changed; |
Philipp Reisner | b411b36 | 2009-09-25 16:07:19 -0700 | [diff] [blame] | 170 | |
| 171 | /* the pre-created kmem cache to allocate the objects from */ |
| 172 | struct kmem_cache *lc_cache; |
| 173 | |
| 174 | /* size of tracked objects, used to memset(,0,) them in lc_reset */ |
| 175 | size_t element_size; |
| 176 | /* offset of struct lc_element member in the tracked object */ |
| 177 | size_t element_off; |
| 178 | |
| 179 | /* number of elements (indices) */ |
Lars Ellenberg | 46a15bc | 2011-02-21 13:21:01 +0100 | [diff] [blame] | 180 | unsigned int nr_elements; |
Philipp Reisner | b411b36 | 2009-09-25 16:07:19 -0700 | [diff] [blame] | 181 | /* Arbitrary limit on maximum tracked objects. Practical limit is much |
| 182 | * lower due to allocation failures, probably. For typical use cases, |
| 183 | * nr_elements should be a few thousand at most. |
Lars Ellenberg | 600942e | 2011-01-27 15:24:58 +0100 | [diff] [blame] | 184 | * This also limits the maximum value of lc_element.lc_index, allowing the |
| 185 | * 8 high bits of .lc_index to be overloaded with flags in the future. */ |
Philipp Reisner | b411b36 | 2009-09-25 16:07:19 -0700 | [diff] [blame] | 186 | #define LC_MAX_ACTIVE (1<<24) |
| 187 | |
Lars Ellenberg | 46a15bc | 2011-02-21 13:21:01 +0100 | [diff] [blame] | 188 | /* allow to accumulate a few (index:label) changes, |
| 189 | * but no more than max_pending_changes */ |
| 190 | unsigned int max_pending_changes; |
| 191 | /* number of elements currently on to_be_changed list */ |
| 192 | unsigned int pending_changes; |
| 193 | |
Philipp Reisner | b411b36 | 2009-09-25 16:07:19 -0700 | [diff] [blame] | 194 | /* statistics */ |
Lars Ellenberg | 46a15bc | 2011-02-21 13:21:01 +0100 | [diff] [blame] | 195 | unsigned used; /* number of elements currently on in_use list */ |
| 196 | unsigned long hits, misses, starving, locked, changed; |
Philipp Reisner | b411b36 | 2009-09-25 16:07:19 -0700 | [diff] [blame] | 197 | |
| 198 | /* see below: flag-bits for lru_cache */ |
| 199 | unsigned long flags; |
| 200 | |
Philipp Reisner | b411b36 | 2009-09-25 16:07:19 -0700 | [diff] [blame] | 201 | |
| 202 | void *lc_private; |
| 203 | const char *name; |
| 204 | |
| 205 | /* nr_elements there */ |
| 206 | struct hlist_head *lc_slot; |
| 207 | struct lc_element **lc_element; |
| 208 | }; |
| 209 | |
| 210 | |
| 211 | /* flag-bits for lru_cache */ |
| 212 | enum { |
| 213 | /* debugging aid, to catch concurrent access early. |
| 214 | * user needs to guarantee exclusive access by proper locking! */ |
| 215 | __LC_PARANOIA, |
Lars Ellenberg | 46a15bc | 2011-02-21 13:21:01 +0100 | [diff] [blame] | 216 | |
| 217 | /* annotate that the set is "dirty", possibly accumulating further |
| 218 | * changes, until a transaction is finally triggered */ |
Philipp Reisner | b411b36 | 2009-09-25 16:07:19 -0700 | [diff] [blame] | 219 | __LC_DIRTY, |
Lars Ellenberg | 46a15bc | 2011-02-21 13:21:01 +0100 | [diff] [blame] | 220 | |
| 221 | /* Locked, no further changes allowed. |
| 222 | * Also used to serialize changing transactions. */ |
| 223 | __LC_LOCKED, |
| 224 | |
Philipp Reisner | b411b36 | 2009-09-25 16:07:19 -0700 | [diff] [blame] | 225 | /* if we need to change the set, but currently there is no free nor |
| 226 | * unused element available, we are "starving", and must not give out |
| 227 | * further references, to guarantee that eventually some refcnt will |
| 228 | * drop to zero and we will be able to make progress again, changing |
| 229 | * the set, writing the transaction. |
| 230 | * if the statistics say we are frequently starving, |
| 231 | * nr_elements is too small. */ |
| 232 | __LC_STARVING, |
| 233 | }; |
| 234 | #define LC_PARANOIA (1<<__LC_PARANOIA) |
| 235 | #define LC_DIRTY (1<<__LC_DIRTY) |
Lars Ellenberg | 46a15bc | 2011-02-21 13:21:01 +0100 | [diff] [blame] | 236 | #define LC_LOCKED (1<<__LC_LOCKED) |
Philipp Reisner | b411b36 | 2009-09-25 16:07:19 -0700 | [diff] [blame] | 237 | #define LC_STARVING (1<<__LC_STARVING) |
| 238 | |
| 239 | extern struct lru_cache *lc_create(const char *name, struct kmem_cache *cache, |
Lars Ellenberg | 46a15bc | 2011-02-21 13:21:01 +0100 | [diff] [blame] | 240 | unsigned max_pending_changes, |
Philipp Reisner | b411b36 | 2009-09-25 16:07:19 -0700 | [diff] [blame] | 241 | unsigned e_count, size_t e_size, size_t e_off); |
| 242 | extern void lc_reset(struct lru_cache *lc); |
| 243 | extern void lc_destroy(struct lru_cache *lc); |
| 244 | extern void lc_set(struct lru_cache *lc, unsigned int enr, int index); |
| 245 | extern void lc_del(struct lru_cache *lc, struct lc_element *element); |
| 246 | |
Lars Ellenberg | cbe5e61 | 2013-03-22 22:17:36 -0600 | [diff] [blame] | 247 | extern struct lc_element *lc_get_cumulative(struct lru_cache *lc, unsigned int enr); |
Philipp Reisner | b411b36 | 2009-09-25 16:07:19 -0700 | [diff] [blame] | 248 | extern struct lc_element *lc_try_get(struct lru_cache *lc, unsigned int enr); |
| 249 | extern struct lc_element *lc_find(struct lru_cache *lc, unsigned int enr); |
| 250 | extern struct lc_element *lc_get(struct lru_cache *lc, unsigned int enr); |
| 251 | extern unsigned int lc_put(struct lru_cache *lc, struct lc_element *e); |
Lars Ellenberg | 46a15bc | 2011-02-21 13:21:01 +0100 | [diff] [blame] | 252 | extern void lc_committed(struct lru_cache *lc); |
Philipp Reisner | b411b36 | 2009-09-25 16:07:19 -0700 | [diff] [blame] | 253 | |
| 254 | struct seq_file; |
Roland Kammerer | bb649b3 | 2015-04-16 10:17:51 +0200 | [diff] [blame] | 255 | extern void lc_seq_printf_stats(struct seq_file *seq, struct lru_cache *lc); |
Philipp Reisner | b411b36 | 2009-09-25 16:07:19 -0700 | [diff] [blame] | 256 | |
| 257 | extern void lc_seq_dump_details(struct seq_file *seq, struct lru_cache *lc, char *utext, |
| 258 | void (*detail) (struct seq_file *, struct lc_element *)); |
| 259 | |
| 260 | /** |
Lars Ellenberg | 46a15bc | 2011-02-21 13:21:01 +0100 | [diff] [blame] | 261 | * lc_try_lock_for_transaction - can be used to stop lc_get() from changing the tracked set |
| 262 | * @lc: the lru cache to operate on |
| 263 | * |
| 264 | * Allows (expects) the set to be "dirty". Note that the reference counts and |
| 265 | * order on the active and lru lists may still change. Used to serialize |
| 266 | * changing transactions. Returns true if we aquired the lock. |
| 267 | */ |
| 268 | static inline int lc_try_lock_for_transaction(struct lru_cache *lc) |
| 269 | { |
| 270 | return !test_and_set_bit(__LC_LOCKED, &lc->flags); |
| 271 | } |
| 272 | |
| 273 | /** |
| 274 | * lc_try_lock - variant to stop lc_get() from changing the tracked set |
Philipp Reisner | b411b36 | 2009-09-25 16:07:19 -0700 | [diff] [blame] | 275 | * @lc: the lru cache to operate on |
| 276 | * |
| 277 | * Note that the reference counts and order on the active and lru lists may |
Lars Ellenberg | 46a15bc | 2011-02-21 13:21:01 +0100 | [diff] [blame] | 278 | * still change. Only works on a "clean" set. Returns true if we aquired the |
| 279 | * lock, which means there are no pending changes, and any further attempt to |
| 280 | * change the set will not succeed until the next lc_unlock(). |
Philipp Reisner | b411b36 | 2009-09-25 16:07:19 -0700 | [diff] [blame] | 281 | */ |
Lars Ellenberg | 46a15bc | 2011-02-21 13:21:01 +0100 | [diff] [blame] | 282 | extern int lc_try_lock(struct lru_cache *lc); |
Philipp Reisner | b411b36 | 2009-09-25 16:07:19 -0700 | [diff] [blame] | 283 | |
| 284 | /** |
| 285 | * lc_unlock - unlock @lc, allow lc_get() to change the set again |
| 286 | * @lc: the lru cache to operate on |
| 287 | */ |
| 288 | static inline void lc_unlock(struct lru_cache *lc) |
| 289 | { |
| 290 | clear_bit(__LC_DIRTY, &lc->flags); |
Lars Ellenberg | 46a15bc | 2011-02-21 13:21:01 +0100 | [diff] [blame] | 291 | clear_bit_unlock(__LC_LOCKED, &lc->flags); |
Philipp Reisner | b411b36 | 2009-09-25 16:07:19 -0700 | [diff] [blame] | 292 | } |
| 293 | |
Lars Ellenberg | 46a15bc | 2011-02-21 13:21:01 +0100 | [diff] [blame] | 294 | extern bool lc_is_used(struct lru_cache *lc, unsigned int enr); |
Philipp Reisner | b411b36 | 2009-09-25 16:07:19 -0700 | [diff] [blame] | 295 | |
| 296 | #define lc_entry(ptr, type, member) \ |
| 297 | container_of(ptr, type, member) |
| 298 | |
| 299 | extern struct lc_element *lc_element_by_index(struct lru_cache *lc, unsigned i); |
| 300 | extern unsigned int lc_index_of(struct lru_cache *lc, struct lc_element *e); |
| 301 | |
| 302 | #endif |