Rob Landley | c742b53 | 2007-02-10 01:46:20 -0800 | [diff] [blame] | 1 | Red-black Trees (rbtree) in Linux |
| 2 | January 18, 2007 |
| 3 | Rob Landley <rob@landley.net> |
| 4 | ============================= |
| 5 | |
| 6 | What are red-black trees, and what are they for? |
| 7 | ------------------------------------------------ |
| 8 | |
| 9 | Red-black trees are a type of self-balancing binary search tree, used for |
| 10 | storing sortable key/value data pairs. This differs from radix trees (which |
| 11 | are used to efficiently store sparse arrays and thus use long integer indexes |
| 12 | to insert/access/delete nodes) and hash tables (which are not kept sorted to |
| 13 | be easily traversed in order, and must be tuned for a specific size and |
| 14 | hash function where rbtrees scale gracefully storing arbitrary keys). |
| 15 | |
| 16 | Red-black trees are similar to AVL trees, but provide faster real-time bounded |
| 17 | worst case performance for insertion and deletion (at most two rotations and |
| 18 | three rotations, respectively, to balance the tree), with slightly slower |
| 19 | (but still O(log n)) lookup time. |
| 20 | |
| 21 | To quote Linux Weekly News: |
| 22 | |
| 23 | There are a number of red-black trees in use in the kernel. |
Randy Dunlap | 17a9e7b | 2010-11-11 12:09:59 +0100 | [diff] [blame] | 24 | The deadline and CFQ I/O schedulers employ rbtrees to |
| 25 | track requests; the packet CD/DVD driver does the same. |
Rob Landley | c742b53 | 2007-02-10 01:46:20 -0800 | [diff] [blame] | 26 | The high-resolution timer code uses an rbtree to organize outstanding |
| 27 | timer requests. The ext3 filesystem tracks directory entries in a |
| 28 | red-black tree. Virtual memory areas (VMAs) are tracked with red-black |
| 29 | trees, as are epoll file descriptors, cryptographic keys, and network |
| 30 | packets in the "hierarchical token bucket" scheduler. |
| 31 | |
| 32 | This document covers use of the Linux rbtree implementation. For more |
| 33 | information on the nature and implementation of Red Black Trees, see: |
| 34 | |
| 35 | Linux Weekly News article on red-black trees |
| 36 | http://lwn.net/Articles/184495/ |
| 37 | |
| 38 | Wikipedia entry on red-black trees |
| 39 | http://en.wikipedia.org/wiki/Red-black_tree |
| 40 | |
| 41 | Linux implementation of red-black trees |
| 42 | --------------------------------------- |
| 43 | |
| 44 | Linux's rbtree implementation lives in the file "lib/rbtree.c". To use it, |
| 45 | "#include <linux/rbtree.h>". |
| 46 | |
| 47 | The Linux rbtree implementation is optimized for speed, and thus has one |
| 48 | less layer of indirection (and better cache locality) than more traditional |
| 49 | tree implementations. Instead of using pointers to separate rb_node and data |
| 50 | structures, each instance of struct rb_node is embedded in the data structure |
| 51 | it organizes. And instead of using a comparison callback function pointer, |
| 52 | users are expected to write their own tree search and insert functions |
| 53 | which call the provided rbtree functions. Locking is also left up to the |
| 54 | user of the rbtree code. |
| 55 | |
| 56 | Creating a new rbtree |
| 57 | --------------------- |
| 58 | |
| 59 | Data nodes in an rbtree tree are structures containing a struct rb_node member: |
| 60 | |
| 61 | struct mytype { |
| 62 | struct rb_node node; |
| 63 | char *keystring; |
| 64 | }; |
| 65 | |
| 66 | When dealing with a pointer to the embedded struct rb_node, the containing data |
| 67 | structure may be accessed with the standard container_of() macro. In addition, |
| 68 | individual members may be accessed directly via rb_entry(node, type, member). |
| 69 | |
| 70 | At the root of each rbtree is an rb_root structure, which is initialized to be |
| 71 | empty via: |
| 72 | |
| 73 | struct rb_root mytree = RB_ROOT; |
| 74 | |
| 75 | Searching for a value in an rbtree |
| 76 | ---------------------------------- |
| 77 | |
| 78 | Writing a search function for your tree is fairly straightforward: start at the |
| 79 | root, compare each value, and follow the left or right branch as necessary. |
| 80 | |
| 81 | Example: |
| 82 | |
| 83 | struct mytype *my_search(struct rb_root *root, char *string) |
| 84 | { |
| 85 | struct rb_node *node = root->rb_node; |
| 86 | |
| 87 | while (node) { |
| 88 | struct mytype *data = container_of(node, struct mytype, node); |
| 89 | int result; |
| 90 | |
| 91 | result = strcmp(string, data->keystring); |
| 92 | |
| 93 | if (result < 0) |
| 94 | node = node->rb_left; |
| 95 | else if (result > 0) |
| 96 | node = node->rb_right; |
| 97 | else |
| 98 | return data; |
| 99 | } |
| 100 | return NULL; |
| 101 | } |
| 102 | |
| 103 | Inserting data into an rbtree |
| 104 | ----------------------------- |
| 105 | |
| 106 | Inserting data in the tree involves first searching for the place to insert the |
| 107 | new node, then inserting the node and rebalancing ("recoloring") the tree. |
| 108 | |
| 109 | The search for insertion differs from the previous search by finding the |
| 110 | location of the pointer on which to graft the new node. The new node also |
| 111 | needs a link to its parent node for rebalancing purposes. |
| 112 | |
| 113 | Example: |
| 114 | |
| 115 | int my_insert(struct rb_root *root, struct mytype *data) |
| 116 | { |
| 117 | struct rb_node **new = &(root->rb_node), *parent = NULL; |
| 118 | |
| 119 | /* Figure out where to put new node */ |
| 120 | while (*new) { |
| 121 | struct mytype *this = container_of(*new, struct mytype, node); |
| 122 | int result = strcmp(data->keystring, this->keystring); |
| 123 | |
| 124 | parent = *new; |
| 125 | if (result < 0) |
| 126 | new = &((*new)->rb_left); |
| 127 | else if (result > 0) |
| 128 | new = &((*new)->rb_right); |
| 129 | else |
| 130 | return FALSE; |
| 131 | } |
| 132 | |
| 133 | /* Add new node and rebalance tree. */ |
figo.zhang | 27af1da | 2009-04-17 10:58:48 +0800 | [diff] [blame] | 134 | rb_link_node(&data->node, parent, new); |
| 135 | rb_insert_color(&data->node, root); |
Rob Landley | c742b53 | 2007-02-10 01:46:20 -0800 | [diff] [blame] | 136 | |
| 137 | return TRUE; |
| 138 | } |
| 139 | |
| 140 | Removing or replacing existing data in an rbtree |
| 141 | ------------------------------------------------ |
| 142 | |
| 143 | To remove an existing node from a tree, call: |
| 144 | |
| 145 | void rb_erase(struct rb_node *victim, struct rb_root *tree); |
| 146 | |
| 147 | Example: |
| 148 | |
figo.zhang | 27af1da | 2009-04-17 10:58:48 +0800 | [diff] [blame] | 149 | struct mytype *data = mysearch(&mytree, "walrus"); |
Rob Landley | c742b53 | 2007-02-10 01:46:20 -0800 | [diff] [blame] | 150 | |
| 151 | if (data) { |
figo.zhang | 27af1da | 2009-04-17 10:58:48 +0800 | [diff] [blame] | 152 | rb_erase(&data->node, &mytree); |
Rob Landley | c742b53 | 2007-02-10 01:46:20 -0800 | [diff] [blame] | 153 | myfree(data); |
| 154 | } |
| 155 | |
| 156 | To replace an existing node in a tree with a new one with the same key, call: |
| 157 | |
| 158 | void rb_replace_node(struct rb_node *old, struct rb_node *new, |
| 159 | struct rb_root *tree); |
| 160 | |
| 161 | Replacing a node this way does not re-sort the tree: If the new node doesn't |
| 162 | have the same key as the old node, the rbtree will probably become corrupted. |
| 163 | |
| 164 | Iterating through the elements stored in an rbtree (in sort order) |
| 165 | ------------------------------------------------------------------ |
| 166 | |
| 167 | Four functions are provided for iterating through an rbtree's contents in |
| 168 | sorted order. These work on arbitrary trees, and should not need to be |
| 169 | modified or wrapped (except for locking purposes): |
| 170 | |
| 171 | struct rb_node *rb_first(struct rb_root *tree); |
| 172 | struct rb_node *rb_last(struct rb_root *tree); |
| 173 | struct rb_node *rb_next(struct rb_node *node); |
| 174 | struct rb_node *rb_prev(struct rb_node *node); |
| 175 | |
| 176 | To start iterating, call rb_first() or rb_last() with a pointer to the root |
| 177 | of the tree, which will return a pointer to the node structure contained in |
| 178 | the first or last element in the tree. To continue, fetch the next or previous |
| 179 | node by calling rb_next() or rb_prev() on the current node. This will return |
| 180 | NULL when there are no more nodes left. |
| 181 | |
| 182 | The iterator functions return a pointer to the embedded struct rb_node, from |
| 183 | which the containing data structure may be accessed with the container_of() |
| 184 | macro, and individual members may be accessed directly via |
| 185 | rb_entry(node, type, member). |
| 186 | |
| 187 | Example: |
| 188 | |
| 189 | struct rb_node *node; |
| 190 | for (node = rb_first(&mytree); node; node = rb_next(node)) |
Wang Tinggong | 1903423 | 2009-05-14 11:00:20 +0200 | [diff] [blame] | 191 | printk("key=%s\n", rb_entry(node, struct mytype, node)->keystring); |
Rob Landley | c742b53 | 2007-02-10 01:46:20 -0800 | [diff] [blame] | 192 | |
Pallipadi, Venkatesh | 17d9ddc | 2010-02-10 15:23:44 -0800 | [diff] [blame] | 193 | Support for Augmented rbtrees |
| 194 | ----------------------------- |
| 195 | |
Michel Lespinasse | 14b94af | 2012-10-08 16:31:17 -0700 | [diff] [blame^] | 196 | Augmented rbtree is an rbtree with "some" additional data stored in |
| 197 | each node, where the additional data for node N must be a function of |
| 198 | the contents of all nodes in the subtree rooted at N. This data can |
| 199 | be used to augment some new functionality to rbtree. Augmented rbtree |
| 200 | is an optional feature built on top of basic rbtree infrastructure. |
| 201 | An rbtree user who wants this feature will have to call the augmentation |
| 202 | functions with the user provided augmentation callback when inserting |
| 203 | and erasing nodes. |
Pallipadi, Venkatesh | 17d9ddc | 2010-02-10 15:23:44 -0800 | [diff] [blame] | 204 | |
Michel Lespinasse | 14b94af | 2012-10-08 16:31:17 -0700 | [diff] [blame^] | 205 | On insertion, the user must update the augmented information on the path |
| 206 | leading to the inserted node, then call rb_link_node() as usual and |
| 207 | rb_augment_inserted() instead of the usual rb_insert_color() call. |
| 208 | If rb_augment_inserted() rebalances the rbtree, it will callback into |
| 209 | a user provided function to update the augmented information on the |
| 210 | affected subtrees. |
Sasha Levin | 2f17507 | 2011-07-24 11:23:20 +0300 | [diff] [blame] | 211 | |
Michel Lespinasse | 14b94af | 2012-10-08 16:31:17 -0700 | [diff] [blame^] | 212 | When erasing a node, the user must call rb_erase_augmented() instead of |
| 213 | rb_erase(). rb_erase_augmented() calls back into user provided functions |
| 214 | to updated the augmented information on affected subtrees. |
Pallipadi, Venkatesh | 17d9ddc | 2010-02-10 15:23:44 -0800 | [diff] [blame] | 215 | |
Michel Lespinasse | 14b94af | 2012-10-08 16:31:17 -0700 | [diff] [blame^] | 216 | In both cases, the callbacks are provided through struct rb_augment_callbacks. |
| 217 | 3 callbacks must be defined: |
| 218 | |
| 219 | - A propagation callback, which updates the augmented value for a given |
| 220 | node and its ancestors, up to a given stop point (or NULL to update |
| 221 | all the way to the root). |
| 222 | |
| 223 | - A copy callback, which copies the augmented value for a given subtree |
| 224 | to a newly assigned subtree root. |
| 225 | |
| 226 | - A tree rotation callback, which copies the augmented value for a given |
| 227 | subtree to a newly assigned subtree root AND recomputes the augmented |
| 228 | information for the former subtree root. |
| 229 | |
| 230 | |
| 231 | Sample usage: |
Pallipadi, Venkatesh | 17d9ddc | 2010-02-10 15:23:44 -0800 | [diff] [blame] | 232 | |
| 233 | Interval tree is an example of augmented rb tree. Reference - |
| 234 | "Introduction to Algorithms" by Cormen, Leiserson, Rivest and Stein. |
| 235 | More details about interval trees: |
| 236 | |
| 237 | Classical rbtree has a single key and it cannot be directly used to store |
| 238 | interval ranges like [lo:hi] and do a quick lookup for any overlap with a new |
| 239 | lo:hi or to find whether there is an exact match for a new lo:hi. |
| 240 | |
| 241 | However, rbtree can be augmented to store such interval ranges in a structured |
| 242 | way making it possible to do efficient lookup and exact match. |
| 243 | |
| 244 | This "extra information" stored in each node is the maximum hi |
| 245 | (max_hi) value among all the nodes that are its descendents. This |
| 246 | information can be maintained at each node just be looking at the node |
| 247 | and its immediate children. And this will be used in O(log n) lookup |
| 248 | for lowest match (lowest start address among all possible matches) |
| 249 | with something like: |
| 250 | |
Michel Lespinasse | 14b94af | 2012-10-08 16:31:17 -0700 | [diff] [blame^] | 251 | struct interval_tree_node * |
| 252 | interval_tree_first_match(struct rb_root *root, |
| 253 | unsigned long start, unsigned long last) |
Pallipadi, Venkatesh | 17d9ddc | 2010-02-10 15:23:44 -0800 | [diff] [blame] | 254 | { |
Michel Lespinasse | 14b94af | 2012-10-08 16:31:17 -0700 | [diff] [blame^] | 255 | struct interval_tree_node *node; |
| 256 | |
| 257 | if (!root->rb_node) |
| 258 | return NULL; |
| 259 | node = rb_entry(root->rb_node, struct interval_tree_node, rb); |
| 260 | |
| 261 | while (true) { |
| 262 | if (node->rb.rb_left) { |
| 263 | struct interval_tree_node *left = |
| 264 | rb_entry(node->rb.rb_left, |
| 265 | struct interval_tree_node, rb); |
| 266 | if (left->__subtree_last >= start) { |
| 267 | /* |
| 268 | * Some nodes in left subtree satisfy Cond2. |
| 269 | * Iterate to find the leftmost such node N. |
| 270 | * If it also satisfies Cond1, that's the match |
| 271 | * we are looking for. Otherwise, there is no |
| 272 | * matching interval as nodes to the right of N |
| 273 | * can't satisfy Cond1 either. |
| 274 | */ |
| 275 | node = left; |
| 276 | continue; |
| 277 | } |
Pallipadi, Venkatesh | 17d9ddc | 2010-02-10 15:23:44 -0800 | [diff] [blame] | 278 | } |
Michel Lespinasse | 14b94af | 2012-10-08 16:31:17 -0700 | [diff] [blame^] | 279 | if (node->start <= last) { /* Cond1 */ |
| 280 | if (node->last >= start) /* Cond2 */ |
| 281 | return node; /* node is leftmost match */ |
| 282 | if (node->rb.rb_right) { |
| 283 | node = rb_entry(node->rb.rb_right, |
| 284 | struct interval_tree_node, rb); |
| 285 | if (node->__subtree_last >= start) |
| 286 | continue; |
| 287 | } |
| 288 | } |
| 289 | return NULL; /* No match */ |
Pallipadi, Venkatesh | 17d9ddc | 2010-02-10 15:23:44 -0800 | [diff] [blame] | 290 | } |
Pallipadi, Venkatesh | 17d9ddc | 2010-02-10 15:23:44 -0800 | [diff] [blame] | 291 | } |
| 292 | |
Michel Lespinasse | 14b94af | 2012-10-08 16:31:17 -0700 | [diff] [blame^] | 293 | Insertion/removal are defined using the following augmented callbacks: |
| 294 | |
| 295 | static inline unsigned long |
| 296 | compute_subtree_last(struct interval_tree_node *node) |
| 297 | { |
| 298 | unsigned long max = node->last, subtree_last; |
| 299 | if (node->rb.rb_left) { |
| 300 | subtree_last = rb_entry(node->rb.rb_left, |
| 301 | struct interval_tree_node, rb)->__subtree_last; |
| 302 | if (max < subtree_last) |
| 303 | max = subtree_last; |
| 304 | } |
| 305 | if (node->rb.rb_right) { |
| 306 | subtree_last = rb_entry(node->rb.rb_right, |
| 307 | struct interval_tree_node, rb)->__subtree_last; |
| 308 | if (max < subtree_last) |
| 309 | max = subtree_last; |
| 310 | } |
| 311 | return max; |
| 312 | } |
| 313 | |
| 314 | static void augment_propagate(struct rb_node *rb, struct rb_node *stop) |
| 315 | { |
| 316 | while (rb != stop) { |
| 317 | struct interval_tree_node *node = |
| 318 | rb_entry(rb, struct interval_tree_node, rb); |
| 319 | unsigned long subtree_last = compute_subtree_last(node); |
| 320 | if (node->__subtree_last == subtree_last) |
| 321 | break; |
| 322 | node->__subtree_last = subtree_last; |
| 323 | rb = rb_parent(&node->rb); |
| 324 | } |
| 325 | } |
| 326 | |
| 327 | static void augment_copy(struct rb_node *rb_old, struct rb_node *rb_new) |
| 328 | { |
| 329 | struct interval_tree_node *old = |
| 330 | rb_entry(rb_old, struct interval_tree_node, rb); |
| 331 | struct interval_tree_node *new = |
| 332 | rb_entry(rb_new, struct interval_tree_node, rb); |
| 333 | |
| 334 | new->__subtree_last = old->__subtree_last; |
| 335 | } |
| 336 | |
| 337 | static void augment_rotate(struct rb_node *rb_old, struct rb_node *rb_new) |
| 338 | { |
| 339 | struct interval_tree_node *old = |
| 340 | rb_entry(rb_old, struct interval_tree_node, rb); |
| 341 | struct interval_tree_node *new = |
| 342 | rb_entry(rb_new, struct interval_tree_node, rb); |
| 343 | |
| 344 | new->__subtree_last = old->__subtree_last; |
| 345 | old->__subtree_last = compute_subtree_last(old); |
| 346 | } |
| 347 | |
| 348 | static const struct rb_augment_callbacks augment_callbacks = { |
| 349 | augment_propagate, augment_copy, augment_rotate |
| 350 | }; |
| 351 | |
| 352 | void interval_tree_insert(struct interval_tree_node *node, |
| 353 | struct rb_root *root) |
| 354 | { |
| 355 | struct rb_node **link = &root->rb_node, *rb_parent = NULL; |
| 356 | unsigned long start = node->start, last = node->last; |
| 357 | struct interval_tree_node *parent; |
| 358 | |
| 359 | while (*link) { |
| 360 | rb_parent = *link; |
| 361 | parent = rb_entry(rb_parent, struct interval_tree_node, rb); |
| 362 | if (parent->__subtree_last < last) |
| 363 | parent->__subtree_last = last; |
| 364 | if (start < parent->start) |
| 365 | link = &parent->rb.rb_left; |
| 366 | else |
| 367 | link = &parent->rb.rb_right; |
| 368 | } |
| 369 | |
| 370 | node->__subtree_last = last; |
| 371 | rb_link_node(&node->rb, rb_parent, link); |
| 372 | rb_insert_augmented(&node->rb, root, &augment_callbacks); |
| 373 | } |
| 374 | |
| 375 | void interval_tree_remove(struct interval_tree_node *node, |
| 376 | struct rb_root *root) |
| 377 | { |
| 378 | rb_erase_augmented(&node->rb, root, &augment_callbacks); |
| 379 | } |