Mauro Carvalho Chehab | ec23eb5 | 2019-07-26 09:51:27 -0300 | [diff] [blame] | 1 | ================= |
| 2 | Directory Locking |
| 3 | ================= |
| 4 | |
| 5 | |
| 6 | Locking scheme used for directory operations is based on two |
Al Viro | d42b386 | 2016-05-26 00:04:18 -0400 | [diff] [blame] | 7 | kinds of locks - per-inode (->i_rwsem) and per-filesystem |
Josef 'Jeff' Sipek | c2b3898 | 2007-05-24 12:21:43 -0400 | [diff] [blame] | 8 | (->s_vfs_rename_mutex). |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 9 | |
Mauro Carvalho Chehab | ec23eb5 | 2019-07-26 09:51:27 -0300 | [diff] [blame] | 10 | When taking the i_rwsem on multiple non-directory objects, we |
J. Bruce Fields | 6cedba8 | 2012-03-05 11:40:41 -0500 | [diff] [blame] | 11 | always acquire the locks in order by increasing address. We'll call |
| 12 | that "inode pointer" order in the following. |
| 13 | |
Mauro Carvalho Chehab | ec23eb5 | 2019-07-26 09:51:27 -0300 | [diff] [blame] | 14 | For our purposes all operations fall in 5 classes: |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 15 | |
| 16 | 1) read access. Locking rules: caller locks directory we are accessing. |
Al Viro | d42b386 | 2016-05-26 00:04:18 -0400 | [diff] [blame] | 17 | The lock is taken shared. |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 18 | |
Al Viro | d42b386 | 2016-05-26 00:04:18 -0400 | [diff] [blame] | 19 | 2) object creation. Locking rules: same as above, but the lock is taken |
| 20 | exclusive. |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 21 | |
| 22 | 3) object removal. Locking rules: caller locks parent, finds victim, |
Al Viro | d42b386 | 2016-05-26 00:04:18 -0400 | [diff] [blame] | 23 | locks victim and calls the method. Locks are exclusive. |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 24 | |
| 25 | 4) rename() that is _not_ cross-directory. Locking rules: caller locks |
Al Viro | d42b386 | 2016-05-26 00:04:18 -0400 | [diff] [blame] | 26 | the parent and finds source and target. In case of exchange (with |
Miklos Szeredi | 18fc84d | 2016-09-27 11:03:58 +0200 | [diff] [blame] | 27 | RENAME_EXCHANGE in flags argument) lock both. In any case, |
Al Viro | d42b386 | 2016-05-26 00:04:18 -0400 | [diff] [blame] | 28 | if the target already exists, lock it. If the source is a non-directory, |
| 29 | lock it. If we need to lock both, lock them in inode pointer order. |
| 30 | Then call the method. All locks are exclusive. |
Randy Dunlap | 2f32295 | 2020-07-03 14:43:19 -0700 | [diff] [blame^] | 31 | NB: we might get away with locking the source (and target in exchange |
Al Viro | d42b386 | 2016-05-26 00:04:18 -0400 | [diff] [blame] | 32 | case) shared. |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 33 | |
| 34 | 5) link creation. Locking rules: |
Mauro Carvalho Chehab | ec23eb5 | 2019-07-26 09:51:27 -0300 | [diff] [blame] | 35 | |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 36 | * lock parent |
| 37 | * check that source is not a directory |
| 38 | * lock source |
| 39 | * call the method. |
Mauro Carvalho Chehab | ec23eb5 | 2019-07-26 09:51:27 -0300 | [diff] [blame] | 40 | |
Al Viro | d42b386 | 2016-05-26 00:04:18 -0400 | [diff] [blame] | 41 | All locks are exclusive. |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 42 | |
| 43 | 6) cross-directory rename. The trickiest in the whole bunch. Locking |
| 44 | rules: |
Mauro Carvalho Chehab | ec23eb5 | 2019-07-26 09:51:27 -0300 | [diff] [blame] | 45 | |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 46 | * lock the filesystem |
| 47 | * lock parents in "ancestors first" order. |
| 48 | * find source and target. |
| 49 | * if old parent is equal to or is a descendent of target |
Mauro Carvalho Chehab | ec23eb5 | 2019-07-26 09:51:27 -0300 | [diff] [blame] | 50 | fail with -ENOTEMPTY |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 51 | * if new parent is equal to or is a descendent of source |
Mauro Carvalho Chehab | ec23eb5 | 2019-07-26 09:51:27 -0300 | [diff] [blame] | 52 | fail with -ELOOP |
Al Viro | d42b386 | 2016-05-26 00:04:18 -0400 | [diff] [blame] | 53 | * If it's an exchange, lock both the source and the target. |
| 54 | * If the target exists, lock it. If the source is a non-directory, |
| 55 | lock it. If we need to lock both, do so in inode pointer order. |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 56 | * call the method. |
Mauro Carvalho Chehab | ec23eb5 | 2019-07-26 09:51:27 -0300 | [diff] [blame] | 57 | |
Al Viro | d42b386 | 2016-05-26 00:04:18 -0400 | [diff] [blame] | 58 | All ->i_rwsem are taken exclusive. Again, we might get away with locking |
Randy Dunlap | 2f32295 | 2020-07-03 14:43:19 -0700 | [diff] [blame^] | 59 | the source (and target in exchange case) shared. |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 60 | |
| 61 | The rules above obviously guarantee that all directories that are going to be |
| 62 | read, modified or removed by method will be locked by caller. |
| 63 | |
| 64 | |
| 65 | If no directory is its own ancestor, the scheme above is deadlock-free. |
Mauro Carvalho Chehab | ec23eb5 | 2019-07-26 09:51:27 -0300 | [diff] [blame] | 66 | |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 67 | Proof: |
| 68 | |
| 69 | First of all, at any moment we have a partial ordering of the |
Mauro Carvalho Chehab | ec23eb5 | 2019-07-26 09:51:27 -0300 | [diff] [blame] | 70 | objects - A < B iff A is an ancestor of B. |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 71 | |
| 72 | That ordering can change. However, the following is true: |
| 73 | |
| 74 | (1) if object removal or non-cross-directory rename holds lock on A and |
| 75 | attempts to acquire lock on B, A will remain the parent of B until we |
| 76 | acquire the lock on B. (Proof: only cross-directory rename can change |
| 77 | the parent of object and it would have to lock the parent). |
| 78 | |
| 79 | (2) if cross-directory rename holds the lock on filesystem, order will not |
| 80 | change until rename acquires all locks. (Proof: other cross-directory |
| 81 | renames will be blocked on filesystem lock and we don't start changing |
| 82 | the order until we had acquired all locks). |
| 83 | |
J. Bruce Fields | 6cedba8 | 2012-03-05 11:40:41 -0500 | [diff] [blame] | 84 | (3) locks on non-directory objects are acquired only after locks on |
| 85 | directory objects, and are acquired in inode pointer order. |
| 86 | (Proof: all operations but renames take lock on at most one |
| 87 | non-directory object, except renames, which take locks on source and |
| 88 | target in inode pointer order in the case they are not directories.) |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 89 | |
Mauro Carvalho Chehab | ec23eb5 | 2019-07-26 09:51:27 -0300 | [diff] [blame] | 90 | Now consider the minimal deadlock. Each process is blocked on |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 91 | attempt to acquire some lock and already holds at least one lock. Let's |
| 92 | consider the set of contended locks. First of all, filesystem lock is |
| 93 | not contended, since any process blocked on it is not holding any locks. |
Al Viro | d42b386 | 2016-05-26 00:04:18 -0400 | [diff] [blame] | 94 | Thus all processes are blocked on ->i_rwsem. |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 95 | |
Mauro Carvalho Chehab | ec23eb5 | 2019-07-26 09:51:27 -0300 | [diff] [blame] | 96 | By (3), any process holding a non-directory lock can only be |
J. Bruce Fields | 6cedba8 | 2012-03-05 11:40:41 -0500 | [diff] [blame] | 97 | waiting on another non-directory lock with a larger address. Therefore |
| 98 | the process holding the "largest" such lock can always make progress, and |
| 99 | non-directory objects are not included in the set of contended locks. |
| 100 | |
Mauro Carvalho Chehab | ec23eb5 | 2019-07-26 09:51:27 -0300 | [diff] [blame] | 101 | Thus link creation can't be a part of deadlock - it can't be |
J. Bruce Fields | 6cedba8 | 2012-03-05 11:40:41 -0500 | [diff] [blame] | 102 | blocked on source and it means that it doesn't hold any locks. |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 103 | |
Mauro Carvalho Chehab | ec23eb5 | 2019-07-26 09:51:27 -0300 | [diff] [blame] | 104 | Any contended object is either held by cross-directory rename or |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 105 | has a child that is also contended. Indeed, suppose that it is held by |
| 106 | operation other than cross-directory rename. Then the lock this operation |
| 107 | is blocked on belongs to child of that object due to (1). |
| 108 | |
Mauro Carvalho Chehab | ec23eb5 | 2019-07-26 09:51:27 -0300 | [diff] [blame] | 109 | It means that one of the operations is cross-directory rename. |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 110 | Otherwise the set of contended objects would be infinite - each of them |
| 111 | would have a contended child and we had assumed that no object is its |
| 112 | own descendent. Moreover, there is exactly one cross-directory rename |
| 113 | (see above). |
| 114 | |
Mauro Carvalho Chehab | ec23eb5 | 2019-07-26 09:51:27 -0300 | [diff] [blame] | 115 | Consider the object blocking the cross-directory rename. One |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 116 | of its descendents is locked by cross-directory rename (otherwise we |
Paolo Ornati | 670e9f3 | 2006-10-03 22:57:56 +0200 | [diff] [blame] | 117 | would again have an infinite set of contended objects). But that |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 118 | means that cross-directory rename is taking locks out of order. Due |
| 119 | to (2) the order hadn't changed since we had acquired filesystem lock. |
| 120 | But locking rules for cross-directory rename guarantee that we do not |
| 121 | try to acquire lock on descendent before the lock on ancestor. |
| 122 | Contradiction. I.e. deadlock is impossible. Q.E.D. |
| 123 | |
| 124 | |
Mauro Carvalho Chehab | ec23eb5 | 2019-07-26 09:51:27 -0300 | [diff] [blame] | 125 | These operations are guaranteed to avoid loop creation. Indeed, |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 126 | the only operation that could introduce loops is cross-directory rename. |
| 127 | Since the only new (parent, child) pair added by rename() is (new parent, |
| 128 | source), such loop would have to contain these objects and the rest of it |
| 129 | would have to exist before rename(). I.e. at the moment of loop creation |
| 130 | rename() responsible for that would be holding filesystem lock and new parent |
| 131 | would have to be equal to or a descendent of source. But that means that |
| 132 | new parent had been equal to or a descendent of source since the moment when |
| 133 | we had acquired filesystem lock and rename() would fail with -ELOOP in that |
| 134 | case. |
| 135 | |
Mauro Carvalho Chehab | ec23eb5 | 2019-07-26 09:51:27 -0300 | [diff] [blame] | 136 | While this locking scheme works for arbitrary DAGs, it relies on |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 137 | ability to check that directory is a descendent of another object. Current |
| 138 | implementation assumes that directory graph is a tree. This assumption is |
| 139 | also preserved by all operations (cross-directory rename on a tree that would |
| 140 | not introduce a cycle will leave it a tree and link() fails for directories). |
| 141 | |
Mauro Carvalho Chehab | ec23eb5 | 2019-07-26 09:51:27 -0300 | [diff] [blame] | 142 | Notice that "directory" in the above == "anything that might have |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 143 | children", so if we are going to introduce hybrid objects we will need |
| 144 | either to make sure that link(2) doesn't work for them or to make changes |
| 145 | in is_subdir() that would make it work even in presence of such beasts. |