Guoqing Jiang | f0e230a | 2017-10-24 15:11:53 +0800 | [diff] [blame] | 1 | The cluster MD is a shared-device RAID for a cluster, it supports |
| 2 | two levels: raid1 and raid10 (limited support). |
Goldwyn Rodrigues | b8d8344 | 2014-06-10 16:31:01 -0500 | [diff] [blame] | 3 | |
| 4 | |
| 5 | 1. On-disk format |
| 6 | |
Guoqing Jiang | d323ef0 | 2015-12-21 10:51:00 +1100 | [diff] [blame] | 7 | Separate write-intent-bitmaps are used for each cluster node. |
Goldwyn Rodrigues | b8d8344 | 2014-06-10 16:31:01 -0500 | [diff] [blame] | 8 | The bitmaps record all writes that may have been started on that node, |
| 9 | and may not yet have finished. The on-disk layout is: |
| 10 | |
| 11 | 0 4k 8k 12k |
| 12 | ------------------------------------------------------------------- |
| 13 | | idle | md super | bm super [0] + bits | |
| 14 | | bm bits[0, contd] | bm super[1] + bits | bm bits[1, contd] | |
| 15 | | bm super[2] + bits | bm bits [2, contd] | bm super[3] + bits | |
| 16 | | bm bits [3, contd] | | | |
| 17 | |
Guoqing Jiang | d323ef0 | 2015-12-21 10:51:00 +1100 | [diff] [blame] | 18 | During "normal" functioning we assume the filesystem ensures that only |
| 19 | one node writes to any given block at a time, so a write request will |
| 20 | |
Goldwyn Rodrigues | b8d8344 | 2014-06-10 16:31:01 -0500 | [diff] [blame] | 21 | - set the appropriate bit (if not already set) |
| 22 | - commit the write to all mirrors |
| 23 | - schedule the bit to be cleared after a timeout. |
| 24 | |
Guoqing Jiang | d323ef0 | 2015-12-21 10:51:00 +1100 | [diff] [blame] | 25 | Reads are just handled normally. It is up to the filesystem to ensure |
| 26 | one node doesn't read from a location where another node (or the same |
Goldwyn Rodrigues | b8d8344 | 2014-06-10 16:31:01 -0500 | [diff] [blame] | 27 | node) is writing. |
| 28 | |
| 29 | |
| 30 | 2. DLM Locks for management |
| 31 | |
Guoqing Jiang | d323ef0 | 2015-12-21 10:51:00 +1100 | [diff] [blame] | 32 | There are three groups of locks for managing the device: |
Goldwyn Rodrigues | b8d8344 | 2014-06-10 16:31:01 -0500 | [diff] [blame] | 33 | |
| 34 | 2.1 Bitmap lock resource (bm_lockres) |
| 35 | |
Guoqing Jiang | d323ef0 | 2015-12-21 10:51:00 +1100 | [diff] [blame] | 36 | The bm_lockres protects individual node bitmaps. They are named in |
| 37 | the form bitmap000 for node 1, bitmap001 for node 2 and so on. When a |
| 38 | node joins the cluster, it acquires the lock in PW mode and it stays |
| 39 | so during the lifetime the node is part of the cluster. The lock |
| 40 | resource number is based on the slot number returned by the DLM |
| 41 | subsystem. Since DLM starts node count from one and bitmap slots |
| 42 | start from zero, one is subtracted from the DLM slot number to arrive |
| 43 | at the bitmap slot number. |
| 44 | |
| 45 | The LVB of the bitmap lock for a particular node records the range |
| 46 | of sectors that are being re-synced by that node. No other |
| 47 | node may write to those sectors. This is used when a new nodes |
| 48 | joins the cluster. |
| 49 | |
| 50 | 2.2 Message passing locks |
| 51 | |
| 52 | Each node has to communicate with other nodes when starting or ending |
| 53 | resync, and for metadata superblock updates. This communication is |
| 54 | managed through three locks: "token", "message", and "ack", together |
| 55 | with the Lock Value Block (LVB) of one of the "message" lock. |
| 56 | |
| 57 | 2.3 new-device management |
| 58 | |
| 59 | A single lock: "no-new-dev" is used to co-ordinate the addition of |
| 60 | new devices - this must be synchronized across the array. |
| 61 | Normally all nodes hold a concurrent-read lock on this device. |
Goldwyn Rodrigues | b8d8344 | 2014-06-10 16:31:01 -0500 | [diff] [blame] | 62 | |
| 63 | 3. Communication |
| 64 | |
Guoqing Jiang | d323ef0 | 2015-12-21 10:51:00 +1100 | [diff] [blame] | 65 | Messages can be broadcast to all nodes, and the sender waits for all |
| 66 | other nodes to acknowledge the message before proceeding. Only one |
| 67 | message can be processed at a time. |
Goldwyn Rodrigues | b8d8344 | 2014-06-10 16:31:01 -0500 | [diff] [blame] | 68 | |
| 69 | 3.1 Message Types |
| 70 | |
Guoqing Jiang | d323ef0 | 2015-12-21 10:51:00 +1100 | [diff] [blame] | 71 | There are six types of messages which are passed: |
Goldwyn Rodrigues | b8d8344 | 2014-06-10 16:31:01 -0500 | [diff] [blame] | 72 | |
Guoqing Jiang | d323ef0 | 2015-12-21 10:51:00 +1100 | [diff] [blame] | 73 | 3.1.1 METADATA_UPDATED: informs other nodes that the metadata has |
| 74 | been updated, and the node must re-read the md superblock. This is |
| 75 | performed synchronously. It is primarily used to signal device |
| 76 | failure. |
Goldwyn Rodrigues | b8d8344 | 2014-06-10 16:31:01 -0500 | [diff] [blame] | 77 | |
Guoqing Jiang | d323ef0 | 2015-12-21 10:51:00 +1100 | [diff] [blame] | 78 | 3.1.2 RESYNCING: informs other nodes that a resync is initiated or |
| 79 | ended so that each node may suspend or resume the region. Each |
| 80 | RESYNCING message identifies a range of the devices that the |
Tamara Diaconita | d771495 | 2017-03-17 20:14:08 +0200 | [diff] [blame] | 81 | sending node is about to resync. This overrides any previous |
Guoqing Jiang | d323ef0 | 2015-12-21 10:51:00 +1100 | [diff] [blame] | 82 | notification from that node: only one ranged can be resynced at a |
| 83 | time per-node. |
| 84 | |
| 85 | 3.1.3 NEWDISK: informs other nodes that a device is being added to |
| 86 | the array. Message contains an identifier for that device. See |
| 87 | below for further details. |
| 88 | |
| 89 | 3.1.4 REMOVE: A failed or spare device is being removed from the |
| 90 | array. The slot-number of the device is included in the message. |
| 91 | |
| 92 | 3.1.5 RE_ADD: A failed device is being re-activated - the assumption |
| 93 | is that it has been determined to be working again. |
| 94 | |
| 95 | 3.1.6 BITMAP_NEEDS_SYNC: if a node is stopped locally but the bitmap |
| 96 | isn't clean, then another node is informed to take the ownership of |
| 97 | resync. |
Goldwyn Rodrigues | b8d8344 | 2014-06-10 16:31:01 -0500 | [diff] [blame] | 98 | |
| 99 | 3.2 Communication mechanism |
| 100 | |
| 101 | The DLM LVB is used to communicate within nodes of the cluster. There |
| 102 | are three resources used for the purpose: |
| 103 | |
Guoqing Jiang | d323ef0 | 2015-12-21 10:51:00 +1100 | [diff] [blame] | 104 | 3.2.1 token: The resource which protects the entire communication |
Goldwyn Rodrigues | b8d8344 | 2014-06-10 16:31:01 -0500 | [diff] [blame] | 105 | system. The node having the token resource is allowed to |
| 106 | communicate. |
| 107 | |
Guoqing Jiang | d323ef0 | 2015-12-21 10:51:00 +1100 | [diff] [blame] | 108 | 3.2.2 message: The lock resource which carries the data to |
Goldwyn Rodrigues | b8d8344 | 2014-06-10 16:31:01 -0500 | [diff] [blame] | 109 | communicate. |
| 110 | |
Guoqing Jiang | d323ef0 | 2015-12-21 10:51:00 +1100 | [diff] [blame] | 111 | 3.2.3 ack: The resource, acquiring which means the message has been |
Goldwyn Rodrigues | b8d8344 | 2014-06-10 16:31:01 -0500 | [diff] [blame] | 112 | acknowledged by all nodes in the cluster. The BAST of the resource |
Guoqing Jiang | d323ef0 | 2015-12-21 10:51:00 +1100 | [diff] [blame] | 113 | is used to inform the receiving node that a node wants to |
| 114 | communicate. |
Goldwyn Rodrigues | b8d8344 | 2014-06-10 16:31:01 -0500 | [diff] [blame] | 115 | |
| 116 | The algorithm is: |
| 117 | |
Guoqing Jiang | d323ef0 | 2015-12-21 10:51:00 +1100 | [diff] [blame] | 118 | 1. receive status - all nodes have concurrent-reader lock on "ack". |
Goldwyn Rodrigues | b8d8344 | 2014-06-10 16:31:01 -0500 | [diff] [blame] | 119 | |
Guoqing Jiang | d323ef0 | 2015-12-21 10:51:00 +1100 | [diff] [blame] | 120 | sender receiver receiver |
| 121 | "ack":CR "ack":CR "ack":CR |
Goldwyn Rodrigues | b8d8344 | 2014-06-10 16:31:01 -0500 | [diff] [blame] | 122 | |
Guoqing Jiang | d323ef0 | 2015-12-21 10:51:00 +1100 | [diff] [blame] | 123 | 2. sender get EX on "token" |
| 124 | sender get EX on "message" |
Goldwyn Rodrigues | b8d8344 | 2014-06-10 16:31:01 -0500 | [diff] [blame] | 125 | sender receiver receiver |
Guoqing Jiang | d323ef0 | 2015-12-21 10:51:00 +1100 | [diff] [blame] | 126 | "token":EX "ack":CR "ack":CR |
| 127 | "message":EX |
| 128 | "ack":CR |
Goldwyn Rodrigues | b8d8344 | 2014-06-10 16:31:01 -0500 | [diff] [blame] | 129 | |
Guoqing Jiang | d323ef0 | 2015-12-21 10:51:00 +1100 | [diff] [blame] | 130 | Sender checks that it still needs to send a message. Messages |
| 131 | received or other events that happened while waiting for the |
| 132 | "token" may have made this message inappropriate or redundant. |
Goldwyn Rodrigues | b8d8344 | 2014-06-10 16:31:01 -0500 | [diff] [blame] | 133 | |
Guoqing Jiang | d323ef0 | 2015-12-21 10:51:00 +1100 | [diff] [blame] | 134 | 3. sender writes LVB. |
| 135 | sender down-convert "message" from EX to CW |
| 136 | sender try to get EX of "ack" |
| 137 | [ wait until all receivers have *processed* the "message" ] |
Goldwyn Rodrigues | b8d8344 | 2014-06-10 16:31:01 -0500 | [diff] [blame] | 138 | |
Guoqing Jiang | d323ef0 | 2015-12-21 10:51:00 +1100 | [diff] [blame] | 139 | [ triggered by bast of "ack" ] |
| 140 | receiver get CR on "message" |
Goldwyn Rodrigues | b8d8344 | 2014-06-10 16:31:01 -0500 | [diff] [blame] | 141 | receiver read LVB |
| 142 | receiver processes the message |
| 143 | [ wait finish ] |
Guoqing Jiang | d323ef0 | 2015-12-21 10:51:00 +1100 | [diff] [blame] | 144 | receiver releases "ack" |
| 145 | receiver tries to get PR on "message" |
Goldwyn Rodrigues | b8d8344 | 2014-06-10 16:31:01 -0500 | [diff] [blame] | 146 | |
Guoqing Jiang | d323ef0 | 2015-12-21 10:51:00 +1100 | [diff] [blame] | 147 | sender receiver receiver |
| 148 | "token":EX "message":CR "message":CR |
| 149 | "message":CW |
| 150 | "ack":EX |
Goldwyn Rodrigues | b8d8344 | 2014-06-10 16:31:01 -0500 | [diff] [blame] | 151 | |
Guoqing Jiang | d323ef0 | 2015-12-21 10:51:00 +1100 | [diff] [blame] | 152 | 4. triggered by grant of EX on "ack" (indicating all receivers |
| 153 | have processed message) |
| 154 | sender down-converts "ack" from EX to CR |
| 155 | sender releases "message" |
| 156 | sender releases "token" |
| 157 | receiver upconvert to PR on "message" |
| 158 | receiver get CR of "ack" |
| 159 | receiver release "message" |
Goldwyn Rodrigues | b8d8344 | 2014-06-10 16:31:01 -0500 | [diff] [blame] | 160 | |
| 161 | sender receiver receiver |
Guoqing Jiang | d323ef0 | 2015-12-21 10:51:00 +1100 | [diff] [blame] | 162 | "ack":CR "ack":CR "ack":CR |
Goldwyn Rodrigues | b8d8344 | 2014-06-10 16:31:01 -0500 | [diff] [blame] | 163 | |
| 164 | |
| 165 | 4. Handling Failures |
| 166 | |
| 167 | 4.1 Node Failure |
Guoqing Jiang | d323ef0 | 2015-12-21 10:51:00 +1100 | [diff] [blame] | 168 | |
| 169 | When a node fails, the DLM informs the cluster with the slot |
| 170 | number. The node starts a cluster recovery thread. The cluster |
| 171 | recovery thread: |
| 172 | |
Goldwyn Rodrigues | b8d8344 | 2014-06-10 16:31:01 -0500 | [diff] [blame] | 173 | - acquires the bitmap<number> lock of the failed node |
| 174 | - opens the bitmap |
| 175 | - reads the bitmap of the failed node |
| 176 | - copies the set bitmap to local node |
| 177 | - cleans the bitmap of the failed node |
| 178 | - releases bitmap<number> lock of the failed node |
| 179 | - initiates resync of the bitmap on the current node |
Guoqing Jiang | d323ef0 | 2015-12-21 10:51:00 +1100 | [diff] [blame] | 180 | md_check_recovery is invoked within recover_bitmaps, |
| 181 | then md_check_recovery -> metadata_update_start/finish, |
| 182 | it will lock the communication by lock_comm. |
| 183 | Which means when one node is resyncing it blocks all |
| 184 | other nodes from writing anywhere on the array. |
Goldwyn Rodrigues | b8d8344 | 2014-06-10 16:31:01 -0500 | [diff] [blame] | 185 | |
Guoqing Jiang | d323ef0 | 2015-12-21 10:51:00 +1100 | [diff] [blame] | 186 | The resync process is the regular md resync. However, in a clustered |
Goldwyn Rodrigues | b8d8344 | 2014-06-10 16:31:01 -0500 | [diff] [blame] | 187 | environment when a resync is performed, it needs to tell other nodes |
| 188 | of the areas which are suspended. Before a resync starts, the node |
Guoqing Jiang | d323ef0 | 2015-12-21 10:51:00 +1100 | [diff] [blame] | 189 | send out RESYNCING with the (lo,hi) range of the area which needs to |
| 190 | be suspended. Each node maintains a suspend_list, which contains the |
| 191 | list of ranges which are currently suspended. On receiving RESYNCING, |
| 192 | the node adds the range to the suspend_list. Similarly, when the node |
| 193 | performing resync finishes, it sends RESYNCING with an empty range to |
| 194 | other nodes and other nodes remove the corresponding entry from the |
| 195 | suspend_list. |
Goldwyn Rodrigues | b8d8344 | 2014-06-10 16:31:01 -0500 | [diff] [blame] | 196 | |
Guoqing Jiang | d323ef0 | 2015-12-21 10:51:00 +1100 | [diff] [blame] | 197 | A helper function, ->area_resyncing() can be used to check if a |
| 198 | particular I/O range should be suspended or not. |
Goldwyn Rodrigues | b8d8344 | 2014-06-10 16:31:01 -0500 | [diff] [blame] | 199 | |
| 200 | 4.2 Device Failure |
Guoqing Jiang | d323ef0 | 2015-12-21 10:51:00 +1100 | [diff] [blame] | 201 | |
Goldwyn Rodrigues | b8d8344 | 2014-06-10 16:31:01 -0500 | [diff] [blame] | 202 | Device failures are handled and communicated with the metadata update |
Guoqing Jiang | d323ef0 | 2015-12-21 10:51:00 +1100 | [diff] [blame] | 203 | routine. When a node detects a device failure it does not allow |
| 204 | any further writes to that device until the failure has been |
| 205 | acknowledged by all other nodes. |
Goldwyn Rodrigues | b8d8344 | 2014-06-10 16:31:01 -0500 | [diff] [blame] | 206 | |
| 207 | 5. Adding a new Device |
Guoqing Jiang | d323ef0 | 2015-12-21 10:51:00 +1100 | [diff] [blame] | 208 | |
| 209 | For adding a new device, it is necessary that all nodes "see" the new |
| 210 | device to be added. For this, the following algorithm is used: |
Goldwyn Rodrigues | b8d8344 | 2014-06-10 16:31:01 -0500 | [diff] [blame] | 211 | |
| 212 | 1. Node 1 issues mdadm --manage /dev/mdX --add /dev/sdYY which issues |
Guoqing Jiang | d323ef0 | 2015-12-21 10:51:00 +1100 | [diff] [blame] | 213 | ioctl(ADD_NEW_DISK with disc.state set to MD_DISK_CLUSTER_ADD) |
| 214 | 2. Node 1 sends a NEWDISK message with uuid and slot number |
Goldwyn Rodrigues | b8d8344 | 2014-06-10 16:31:01 -0500 | [diff] [blame] | 215 | 3. Other nodes issue kobject_uevent_env with uuid and slot number |
| 216 | (Steps 4,5 could be a udev rule) |
| 217 | 4. In userspace, the node searches for the disk, perhaps |
| 218 | using blkid -t SUB_UUID="" |
Guoqing Jiang | d323ef0 | 2015-12-21 10:51:00 +1100 | [diff] [blame] | 219 | 5. Other nodes issue either of the following depending on whether |
| 220 | the disk was found: |
Goldwyn Rodrigues | b8d8344 | 2014-06-10 16:31:01 -0500 | [diff] [blame] | 221 | ioctl(ADD_NEW_DISK with disc.state set to MD_DISK_CANDIDATE and |
Guoqing Jiang | d323ef0 | 2015-12-21 10:51:00 +1100 | [diff] [blame] | 222 | disc.number set to slot number) |
Goldwyn Rodrigues | b8d8344 | 2014-06-10 16:31:01 -0500 | [diff] [blame] | 223 | ioctl(CLUSTERED_DISK_NACK) |
Guoqing Jiang | d323ef0 | 2015-12-21 10:51:00 +1100 | [diff] [blame] | 224 | 6. Other nodes drop lock on "no-new-devs" (CR) if device is found |
| 225 | 7. Node 1 attempts EX lock on "no-new-dev" |
| 226 | 8. If node 1 gets the lock, it sends METADATA_UPDATED after |
| 227 | unmarking the disk as SpareLocal |
| 228 | 9. If not (get "no-new-dev" lock), it fails the operation and sends |
| 229 | METADATA_UPDATED. |
| 230 | 10. Other nodes get the information whether a disk is added or not |
| 231 | by the following METADATA_UPDATED. |
| 232 | |
| 233 | 6. Module interface. |
| 234 | |
| 235 | There are 17 call-backs which the md core can make to the cluster |
| 236 | module. Understanding these can give a good overview of the whole |
| 237 | process. |
| 238 | |
| 239 | 6.1 join(nodes) and leave() |
| 240 | |
| 241 | These are called when an array is started with a clustered bitmap, |
| 242 | and when the array is stopped. join() ensures the cluster is |
| 243 | available and initializes the various resources. |
| 244 | Only the first 'nodes' nodes in the cluster can use the array. |
| 245 | |
| 246 | 6.2 slot_number() |
| 247 | |
| 248 | Reports the slot number advised by the cluster infrastructure. |
| 249 | Range is from 0 to nodes-1. |
| 250 | |
| 251 | 6.3 resync_info_update() |
| 252 | |
| 253 | This updates the resync range that is stored in the bitmap lock. |
| 254 | The starting point is updated as the resync progresses. The |
| 255 | end point is always the end of the array. |
| 256 | It does *not* send a RESYNCING message. |
| 257 | |
| 258 | 6.4 resync_start(), resync_finish() |
| 259 | |
| 260 | These are called when resync/recovery/reshape starts or stops. |
| 261 | They update the resyncing range in the bitmap lock and also |
| 262 | send a RESYNCING message. resync_start reports the whole |
| 263 | array as resyncing, resync_finish reports none of it. |
| 264 | |
| 265 | resync_finish() also sends a BITMAP_NEEDS_SYNC message which |
| 266 | allows some other node to take over. |
| 267 | |
| 268 | 6.5 metadata_update_start(), metadata_update_finish(), |
| 269 | metadata_update_cancel(). |
| 270 | |
| 271 | metadata_update_start is used to get exclusive access to |
| 272 | the metadata. If a change is still needed once that access is |
| 273 | gained, metadata_update_finish() will send a METADATA_UPDATE |
| 274 | message to all other nodes, otherwise metadata_update_cancel() |
| 275 | can be used to release the lock. |
| 276 | |
| 277 | 6.6 area_resyncing() |
| 278 | |
| 279 | This combines two elements of functionality. |
| 280 | |
| 281 | Firstly, it will check if any node is currently resyncing |
| 282 | anything in a given range of sectors. If any resync is found, |
| 283 | then the caller will avoid writing or read-balancing in that |
| 284 | range. |
| 285 | |
| 286 | Secondly, while node recovery is happening it reports that |
| 287 | all areas are resyncing for READ requests. This avoids races |
| 288 | between the cluster-filesystem and the cluster-RAID handling |
| 289 | a node failure. |
| 290 | |
| 291 | 6.7 add_new_disk_start(), add_new_disk_finish(), new_disk_ack() |
| 292 | |
| 293 | These are used to manage the new-disk protocol described above. |
| 294 | When a new device is added, add_new_disk_start() is called before |
| 295 | it is bound to the array and, if that succeeds, add_new_disk_finish() |
| 296 | is called the device is fully added. |
| 297 | |
| 298 | When a device is added in acknowledgement to a previous |
| 299 | request, or when the device is declared "unavailable", |
| 300 | new_disk_ack() is called. |
| 301 | |
| 302 | 6.8 remove_disk() |
| 303 | |
| 304 | This is called when a spare or failed device is removed from |
| 305 | the array. It causes a REMOVE message to be send to other nodes. |
| 306 | |
| 307 | 6.9 gather_bitmaps() |
| 308 | |
| 309 | This sends a RE_ADD message to all other nodes and then |
| 310 | gathers bitmap information from all bitmaps. This combined |
| 311 | bitmap is then used to recovery the re-added device. |
| 312 | |
| 313 | 6.10 lock_all_bitmaps() and unlock_all_bitmaps() |
| 314 | |
| 315 | These are called when change bitmap to none. If a node plans |
| 316 | to clear the cluster raid's bitmap, it need to make sure no other |
| 317 | nodes are using the raid which is achieved by lock all bitmap |
| 318 | locks within the cluster, and also those locks are unlocked |
| 319 | accordingly. |
Guoqing Jiang | ab5a98b | 2016-05-02 11:33:13 -0400 | [diff] [blame] | 320 | |
| 321 | 7. Unsupported features |
| 322 | |
| 323 | There are somethings which are not supported by cluster MD yet. |
| 324 | |
Guoqing Jiang | 818da59 | 2017-03-01 16:42:40 +0800 | [diff] [blame] | 325 | - change array_sectors. |