Mauro Carvalho Chehab | 18ccb22 | 2020-02-17 17:12:17 +0100 | [diff] [blame] | 1 | .. SPDX-License-Identifier: GPL-2.0 |
| 2 | |
| 3 | ======== |
Mike Marshall | 74a552a | 2015-07-17 10:38:16 -0400 | [diff] [blame] | 4 | ORANGEFS |
| 5 | ======== |
| 6 | |
| 7 | OrangeFS is an LGPL userspace scale-out parallel storage system. It is ideal |
| 8 | for large storage problems faced by HPC, BigData, Streaming Video, |
| 9 | Genomics, Bioinformatics. |
| 10 | |
| 11 | Orangefs, originally called PVFS, was first developed in 1993 by |
| 12 | Walt Ligon and Eric Blumer as a parallel file system for Parallel |
| 13 | Virtual Machine (PVM) as part of a NASA grant to study the I/O patterns |
| 14 | of parallel programs. |
| 15 | |
| 16 | Orangefs features include: |
| 17 | |
| 18 | * Distributes file data among multiple file servers |
| 19 | * Supports simultaneous access by multiple clients |
| 20 | * Stores file data and metadata on servers using local file system |
| 21 | and access methods |
| 22 | * Userspace implementation is easy to install and maintain |
| 23 | * Direct MPI support |
| 24 | * Stateless |
| 25 | |
| 26 | |
Mauro Carvalho Chehab | 18ccb22 | 2020-02-17 17:12:17 +0100 | [diff] [blame] | 27 | Mailing List Archives |
Mike Marshall | 8e9ba5c | 2018-04-04 14:05:48 -0400 | [diff] [blame] | 28 | ===================== |
Mike Marshall | 74a552a | 2015-07-17 10:38:16 -0400 | [diff] [blame] | 29 | |
Mike Marshall | 8e9ba5c | 2018-04-04 14:05:48 -0400 | [diff] [blame] | 30 | http://lists.orangefs.org/pipermail/devel_lists.orangefs.org/ |
| 31 | |
| 32 | |
Mauro Carvalho Chehab | 18ccb22 | 2020-02-17 17:12:17 +0100 | [diff] [blame] | 33 | Mailing List Submissions |
Mike Marshall | 8e9ba5c | 2018-04-04 14:05:48 -0400 | [diff] [blame] | 34 | ======================== |
| 35 | |
| 36 | devel@lists.orangefs.org |
Mike Marshall | 74a552a | 2015-07-17 10:38:16 -0400 | [diff] [blame] | 37 | |
| 38 | |
Mauro Carvalho Chehab | 18ccb22 | 2020-02-17 17:12:17 +0100 | [diff] [blame] | 39 | Documentation |
Mike Marshall | 74a552a | 2015-07-17 10:38:16 -0400 | [diff] [blame] | 40 | ============= |
| 41 | |
| 42 | http://www.orangefs.org/documentation/ |
| 43 | |
Mauro Carvalho Chehab | 18ccb22 | 2020-02-17 17:12:17 +0100 | [diff] [blame] | 44 | Running ORANGEFS On a Single Server |
Martin Brandenburg | dd09802 | 2018-04-03 16:27:15 +0000 | [diff] [blame] | 45 | =================================== |
Mike Marshall | 74a552a | 2015-07-17 10:38:16 -0400 | [diff] [blame] | 46 | |
Martin Brandenburg | dd09802 | 2018-04-03 16:27:15 +0000 | [diff] [blame] | 47 | OrangeFS is usually run in large installations with multiple servers and |
| 48 | clients, but a complete filesystem can be run on a single machine for |
| 49 | development and testing. |
| 50 | |
Mauro Carvalho Chehab | 18ccb22 | 2020-02-17 17:12:17 +0100 | [diff] [blame] | 51 | On Fedora, install orangefs and orangefs-server:: |
Martin Brandenburg | dd09802 | 2018-04-03 16:27:15 +0000 | [diff] [blame] | 52 | |
Mauro Carvalho Chehab | 18ccb22 | 2020-02-17 17:12:17 +0100 | [diff] [blame] | 53 | dnf -y install orangefs orangefs-server |
Martin Brandenburg | dd09802 | 2018-04-03 16:27:15 +0000 | [diff] [blame] | 54 | |
| 55 | There is an example server configuration file in |
| 56 | /etc/orangefs/orangefs.conf. Change localhost to your hostname if |
| 57 | necessary. |
| 58 | |
| 59 | To generate a filesystem to run xfstests against, see below. |
| 60 | |
| 61 | There is an example client configuration file in /etc/pvfs2tab. It is a |
| 62 | single line. Uncomment it and change the hostname if necessary. This |
| 63 | controls clients which use libpvfs2. This does not control the |
| 64 | pvfs2-client-core. |
| 65 | |
Mauro Carvalho Chehab | 18ccb22 | 2020-02-17 17:12:17 +0100 | [diff] [blame] | 66 | Create the filesystem:: |
Martin Brandenburg | dd09802 | 2018-04-03 16:27:15 +0000 | [diff] [blame] | 67 | |
Mauro Carvalho Chehab | 18ccb22 | 2020-02-17 17:12:17 +0100 | [diff] [blame] | 68 | pvfs2-server -f /etc/orangefs/orangefs.conf |
Martin Brandenburg | dd09802 | 2018-04-03 16:27:15 +0000 | [diff] [blame] | 69 | |
Mauro Carvalho Chehab | 18ccb22 | 2020-02-17 17:12:17 +0100 | [diff] [blame] | 70 | Start the server:: |
Martin Brandenburg | dd09802 | 2018-04-03 16:27:15 +0000 | [diff] [blame] | 71 | |
Mauro Carvalho Chehab | 18ccb22 | 2020-02-17 17:12:17 +0100 | [diff] [blame] | 72 | systemctl start orangefs-server |
Martin Brandenburg | dd09802 | 2018-04-03 16:27:15 +0000 | [diff] [blame] | 73 | |
Mauro Carvalho Chehab | 18ccb22 | 2020-02-17 17:12:17 +0100 | [diff] [blame] | 74 | Test the server:: |
Martin Brandenburg | dd09802 | 2018-04-03 16:27:15 +0000 | [diff] [blame] | 75 | |
Mauro Carvalho Chehab | 18ccb22 | 2020-02-17 17:12:17 +0100 | [diff] [blame] | 76 | pvfs2-ping -m /pvfsmnt |
Martin Brandenburg | dd09802 | 2018-04-03 16:27:15 +0000 | [diff] [blame] | 77 | |
| 78 | Start the client. The module must be compiled in or loaded before this |
Mauro Carvalho Chehab | 18ccb22 | 2020-02-17 17:12:17 +0100 | [diff] [blame] | 79 | point:: |
Martin Brandenburg | dd09802 | 2018-04-03 16:27:15 +0000 | [diff] [blame] | 80 | |
Mauro Carvalho Chehab | 18ccb22 | 2020-02-17 17:12:17 +0100 | [diff] [blame] | 81 | systemctl start orangefs-client |
Martin Brandenburg | dd09802 | 2018-04-03 16:27:15 +0000 | [diff] [blame] | 82 | |
Mauro Carvalho Chehab | 18ccb22 | 2020-02-17 17:12:17 +0100 | [diff] [blame] | 83 | Mount the filesystem:: |
Martin Brandenburg | dd09802 | 2018-04-03 16:27:15 +0000 | [diff] [blame] | 84 | |
Mauro Carvalho Chehab | 18ccb22 | 2020-02-17 17:12:17 +0100 | [diff] [blame] | 85 | mount -t pvfs2 tcp://localhost:3334/orangefs /pvfsmnt |
Martin Brandenburg | dd09802 | 2018-04-03 16:27:15 +0000 | [diff] [blame] | 86 | |
Linus Torvalds | 4e4bdcf | 2020-04-10 17:50:01 -0700 | [diff] [blame] | 87 | Userspace Filesystem Source |
| 88 | =========================== |
| 89 | |
| 90 | http://www.orangefs.org/download |
| 91 | |
| 92 | Orangefs versions prior to 2.9.3 would not be compatible with the |
| 93 | upstream version of the kernel client. |
| 94 | |
Martin Brandenburg | dd09802 | 2018-04-03 16:27:15 +0000 | [diff] [blame] | 95 | |
Mauro Carvalho Chehab | 18ccb22 | 2020-02-17 17:12:17 +0100 | [diff] [blame] | 96 | Building ORANGEFS on a Single Server |
Martin Brandenburg | dd09802 | 2018-04-03 16:27:15 +0000 | [diff] [blame] | 97 | ==================================== |
| 98 | |
| 99 | Where OrangeFS cannot be installed from distribution packages, it may be |
| 100 | built from source. |
| 101 | |
| 102 | You can omit --prefix if you don't care that things are sprinkled around |
| 103 | in /usr/local. As of version 2.9.6, OrangeFS uses Berkeley DB by |
| 104 | default, we will probably be changing the default to LMDB soon. |
Mike Marshall | 74a552a | 2015-07-17 10:38:16 -0400 | [diff] [blame] | 105 | |
Mauro Carvalho Chehab | 18ccb22 | 2020-02-17 17:12:17 +0100 | [diff] [blame] | 106 | :: |
Mike Marshall | 74a552a | 2015-07-17 10:38:16 -0400 | [diff] [blame] | 107 | |
Linus Torvalds | 4e4bdcf | 2020-04-10 17:50:01 -0700 | [diff] [blame] | 108 | ./configure --prefix=/opt/ofs --with-db-backend=lmdb --disable-usrint |
Mike Marshall | 74a552a | 2015-07-17 10:38:16 -0400 | [diff] [blame] | 109 | |
Mauro Carvalho Chehab | 18ccb22 | 2020-02-17 17:12:17 +0100 | [diff] [blame] | 110 | make |
Mike Marshall | 74a552a | 2015-07-17 10:38:16 -0400 | [diff] [blame] | 111 | |
Mauro Carvalho Chehab | 18ccb22 | 2020-02-17 17:12:17 +0100 | [diff] [blame] | 112 | make install |
Martin Brandenburg | dd09802 | 2018-04-03 16:27:15 +0000 | [diff] [blame] | 113 | |
Linus Torvalds | 4e4bdcf | 2020-04-10 17:50:01 -0700 | [diff] [blame] | 114 | Create an orangefs config file by running pvfs2-genconfig and |
| 115 | specifying a target config file. Pvfs2-genconfig will prompt you |
| 116 | through. Generally it works fine to take the defaults, but you |
| 117 | should use your server's hostname, rather than "localhost" when |
| 118 | it comes to that question:: |
Mike Marshall | 74a552a | 2015-07-17 10:38:16 -0400 | [diff] [blame] | 119 | |
Mauro Carvalho Chehab | 18ccb22 | 2020-02-17 17:12:17 +0100 | [diff] [blame] | 120 | /opt/ofs/bin/pvfs2-genconfig /etc/pvfs2.conf |
Mike Marshall | 74a552a | 2015-07-17 10:38:16 -0400 | [diff] [blame] | 121 | |
Stephen Kitt | 920af1c | 2020-04-24 17:35:15 +0200 | [diff] [blame] | 122 | Create an /etc/pvfs2tab file (localhost is fine):: |
Linus Torvalds | 4e4bdcf | 2020-04-10 17:50:01 -0700 | [diff] [blame] | 123 | |
Mauro Carvalho Chehab | 18ccb22 | 2020-02-17 17:12:17 +0100 | [diff] [blame] | 124 | echo tcp://localhost:3334/orangefs /pvfsmnt pvfs2 defaults,noauto 0 0 > \ |
| 125 | /etc/pvfs2tab |
Mike Marshall | 74a552a | 2015-07-17 10:38:16 -0400 | [diff] [blame] | 126 | |
Mauro Carvalho Chehab | 18ccb22 | 2020-02-17 17:12:17 +0100 | [diff] [blame] | 127 | Create the mount point you specified in the tab file if needed:: |
Mike Marshall | 74a552a | 2015-07-17 10:38:16 -0400 | [diff] [blame] | 128 | |
Mauro Carvalho Chehab | 18ccb22 | 2020-02-17 17:12:17 +0100 | [diff] [blame] | 129 | mkdir /pvfsmnt |
Martin Brandenburg | dd09802 | 2018-04-03 16:27:15 +0000 | [diff] [blame] | 130 | |
Mauro Carvalho Chehab | 18ccb22 | 2020-02-17 17:12:17 +0100 | [diff] [blame] | 131 | Bootstrap the server:: |
Martin Brandenburg | dd09802 | 2018-04-03 16:27:15 +0000 | [diff] [blame] | 132 | |
Mauro Carvalho Chehab | 18ccb22 | 2020-02-17 17:12:17 +0100 | [diff] [blame] | 133 | /opt/ofs/sbin/pvfs2-server -f /etc/pvfs2.conf |
Martin Brandenburg | dd09802 | 2018-04-03 16:27:15 +0000 | [diff] [blame] | 134 | |
Mauro Carvalho Chehab | 18ccb22 | 2020-02-17 17:12:17 +0100 | [diff] [blame] | 135 | Start the server:: |
| 136 | |
Linus Torvalds | 4e4bdcf | 2020-04-10 17:50:01 -0700 | [diff] [blame] | 137 | /opt/ofs/sbin/pvfs2-server /etc/pvfs2.conf |
Mike Marshall | 74a552a | 2015-07-17 10:38:16 -0400 | [diff] [blame] | 138 | |
Mike Marshall | 8e9ba5c | 2018-04-04 14:05:48 -0400 | [diff] [blame] | 139 | Now the server should be running. Pvfs2-ls is a simple |
Mauro Carvalho Chehab | 18ccb22 | 2020-02-17 17:12:17 +0100 | [diff] [blame] | 140 | test to verify that the server is running:: |
Mike Marshall | 74a552a | 2015-07-17 10:38:16 -0400 | [diff] [blame] | 141 | |
Mauro Carvalho Chehab | 18ccb22 | 2020-02-17 17:12:17 +0100 | [diff] [blame] | 142 | /opt/ofs/bin/pvfs2-ls /pvfsmnt |
Mike Marshall | 74a552a | 2015-07-17 10:38:16 -0400 | [diff] [blame] | 143 | |
Mike Marshall | 8e9ba5c | 2018-04-04 14:05:48 -0400 | [diff] [blame] | 144 | If stuff seems to be working, load the kernel module and |
Mauro Carvalho Chehab | 18ccb22 | 2020-02-17 17:12:17 +0100 | [diff] [blame] | 145 | turn on the client core:: |
Mike Marshall | 74a552a | 2015-07-17 10:38:16 -0400 | [diff] [blame] | 146 | |
Linus Torvalds | 4e4bdcf | 2020-04-10 17:50:01 -0700 | [diff] [blame] | 147 | /opt/ofs/sbin/pvfs2-client -p /opt/ofs/sbin/pvfs2-client-core |
Martin Brandenburg | dd09802 | 2018-04-03 16:27:15 +0000 | [diff] [blame] | 148 | |
Mauro Carvalho Chehab | 18ccb22 | 2020-02-17 17:12:17 +0100 | [diff] [blame] | 149 | Mount your filesystem:: |
Martin Brandenburg | dd09802 | 2018-04-03 16:27:15 +0000 | [diff] [blame] | 150 | |
Linus Torvalds | 4e4bdcf | 2020-04-10 17:50:01 -0700 | [diff] [blame] | 151 | mount -t pvfs2 tcp://`hostname`:3334/orangefs /pvfsmnt |
Martin Brandenburg | dd09802 | 2018-04-03 16:27:15 +0000 | [diff] [blame] | 152 | |
| 153 | |
Mauro Carvalho Chehab | 18ccb22 | 2020-02-17 17:12:17 +0100 | [diff] [blame] | 154 | Running xfstests |
Martin Brandenburg | dd09802 | 2018-04-03 16:27:15 +0000 | [diff] [blame] | 155 | ================ |
| 156 | |
| 157 | It is useful to use a scratch filesystem with xfstests. This can be |
| 158 | done with only one server. |
| 159 | |
| 160 | Make a second copy of the FileSystem section in the server configuration |
| 161 | file, which is /etc/orangefs/orangefs.conf. Change the Name to scratch. |
| 162 | Change the ID to something other than the ID of the first FileSystem |
| 163 | section (2 is usually a good choice). |
| 164 | |
| 165 | Then there are two FileSystem sections: orangefs and scratch. |
| 166 | |
| 167 | This change should be made before creating the filesystem. |
| 168 | |
Mauro Carvalho Chehab | 18ccb22 | 2020-02-17 17:12:17 +0100 | [diff] [blame] | 169 | :: |
Martin Brandenburg | dd09802 | 2018-04-03 16:27:15 +0000 | [diff] [blame] | 170 | |
Mauro Carvalho Chehab | 18ccb22 | 2020-02-17 17:12:17 +0100 | [diff] [blame] | 171 | pvfs2-server -f /etc/orangefs/orangefs.conf |
Martin Brandenburg | dd09802 | 2018-04-03 16:27:15 +0000 | [diff] [blame] | 172 | |
Mauro Carvalho Chehab | 18ccb22 | 2020-02-17 17:12:17 +0100 | [diff] [blame] | 173 | To run xfstests, create /etc/xfsqa.config:: |
Martin Brandenburg | dd09802 | 2018-04-03 16:27:15 +0000 | [diff] [blame] | 174 | |
Mauro Carvalho Chehab | 18ccb22 | 2020-02-17 17:12:17 +0100 | [diff] [blame] | 175 | TEST_DIR=/orangefs |
| 176 | TEST_DEV=tcp://localhost:3334/orangefs |
| 177 | SCRATCH_MNT=/scratch |
| 178 | SCRATCH_DEV=tcp://localhost:3334/scratch |
Martin Brandenburg | dd09802 | 2018-04-03 16:27:15 +0000 | [diff] [blame] | 179 | |
Mauro Carvalho Chehab | 18ccb22 | 2020-02-17 17:12:17 +0100 | [diff] [blame] | 180 | Then xfstests can be run:: |
| 181 | |
| 182 | ./check -pvfs2 |
Mike Marshall | 74a552a | 2015-07-17 10:38:16 -0400 | [diff] [blame] | 183 | |
| 184 | |
Mauro Carvalho Chehab | 18ccb22 | 2020-02-17 17:12:17 +0100 | [diff] [blame] | 185 | Options |
Mike Marshall | 74a552a | 2015-07-17 10:38:16 -0400 | [diff] [blame] | 186 | ======= |
| 187 | |
| 188 | The following mount options are accepted: |
| 189 | |
| 190 | acl |
| 191 | Allow the use of Access Control Lists on files and directories. |
| 192 | |
| 193 | intr |
| 194 | Some operations between the kernel client and the user space |
| 195 | filesystem can be interruptible, such as changes in debug levels |
| 196 | and the setting of tunable parameters. |
| 197 | |
| 198 | local_lock |
| 199 | Enable posix locking from the perspective of "this" kernel. The |
| 200 | default file_operations lock action is to return ENOSYS. Posix |
| 201 | locking kicks in if the filesystem is mounted with -o local_lock. |
| 202 | Distributed locking is being worked on for the future. |
| 203 | |
| 204 | |
Mauro Carvalho Chehab | 18ccb22 | 2020-02-17 17:12:17 +0100 | [diff] [blame] | 205 | Debugging |
Mike Marshall | 74a552a | 2015-07-17 10:38:16 -0400 | [diff] [blame] | 206 | ========= |
| 207 | |
Mike Marshall | fcac9d5 | 2016-01-13 14:28:13 -0500 | [diff] [blame] | 208 | If you want the debug (GOSSIP) statements in a particular |
Mauro Carvalho Chehab | 18ccb22 | 2020-02-17 17:12:17 +0100 | [diff] [blame] | 209 | source file (inode.c for example) go to syslog:: |
Mike Marshall | 74a552a | 2015-07-17 10:38:16 -0400 | [diff] [blame] | 210 | |
| 211 | echo inode > /sys/kernel/debug/orangefs/kernel-debug |
| 212 | |
Mauro Carvalho Chehab | 18ccb22 | 2020-02-17 17:12:17 +0100 | [diff] [blame] | 213 | No debugging (the default):: |
Mike Marshall | 74a552a | 2015-07-17 10:38:16 -0400 | [diff] [blame] | 214 | |
| 215 | echo none > /sys/kernel/debug/orangefs/kernel-debug |
| 216 | |
Mauro Carvalho Chehab | 18ccb22 | 2020-02-17 17:12:17 +0100 | [diff] [blame] | 217 | Debugging from several source files:: |
Mike Marshall | 74a552a | 2015-07-17 10:38:16 -0400 | [diff] [blame] | 218 | |
| 219 | echo inode,dir > /sys/kernel/debug/orangefs/kernel-debug |
| 220 | |
Mauro Carvalho Chehab | 18ccb22 | 2020-02-17 17:12:17 +0100 | [diff] [blame] | 221 | All debugging:: |
Mike Marshall | 74a552a | 2015-07-17 10:38:16 -0400 | [diff] [blame] | 222 | |
| 223 | echo all > /sys/kernel/debug/orangefs/kernel-debug |
| 224 | |
Mauro Carvalho Chehab | 18ccb22 | 2020-02-17 17:12:17 +0100 | [diff] [blame] | 225 | Get a list of all debugging keywords:: |
Mike Marshall | 74a552a | 2015-07-17 10:38:16 -0400 | [diff] [blame] | 226 | |
| 227 | cat /sys/kernel/debug/orangefs/debug-help |
Mike Marshall | fcac9d5 | 2016-01-13 14:28:13 -0500 | [diff] [blame] | 228 | |
| 229 | |
Mauro Carvalho Chehab | 18ccb22 | 2020-02-17 17:12:17 +0100 | [diff] [blame] | 230 | Protocol between Kernel Module and Userspace |
Mike Marshall | fcac9d5 | 2016-01-13 14:28:13 -0500 | [diff] [blame] | 231 | ============================================ |
| 232 | |
| 233 | Orangefs is a user space filesystem and an associated kernel module. |
| 234 | We'll just refer to the user space part of Orangefs as "userspace" |
| 235 | from here on out. Orangefs descends from PVFS, and userspace code |
| 236 | still uses PVFS for function and variable names. Userspace typedefs |
| 237 | many of the important structures. Function and variable names in |
| 238 | the kernel module have been transitioned to "orangefs", and The Linux |
| 239 | Coding Style avoids typedefs, so kernel module structures that |
| 240 | correspond to userspace structures are not typedefed. |
| 241 | |
| 242 | The kernel module implements a pseudo device that userspace |
| 243 | can read from and write to. Userspace can also manipulate the |
| 244 | kernel module through the pseudo device with ioctl. |
| 245 | |
Mauro Carvalho Chehab | 18ccb22 | 2020-02-17 17:12:17 +0100 | [diff] [blame] | 246 | The Bufmap |
| 247 | ---------- |
Mike Marshall | fcac9d5 | 2016-01-13 14:28:13 -0500 | [diff] [blame] | 248 | |
| 249 | At startup userspace allocates two page-size-aligned (posix_memalign) |
| 250 | mlocked memory buffers, one is used for IO and one is used for readdir |
| 251 | operations. The IO buffer is 41943040 bytes and the readdir buffer is |
| 252 | 4194304 bytes. Each buffer contains logical chunks, or partitions, and |
| 253 | a pointer to each buffer is added to its own PVFS_dev_map_desc structure |
| 254 | which also describes its total size, as well as the size and number of |
| 255 | the partitions. |
| 256 | |
| 257 | A pointer to the IO buffer's PVFS_dev_map_desc structure is sent to a |
| 258 | mapping routine in the kernel module with an ioctl. The structure is |
| 259 | copied from user space to kernel space with copy_from_user and is used |
| 260 | to initialize the kernel module's "bufmap" (struct orangefs_bufmap), which |
| 261 | then contains: |
| 262 | |
Mauro Carvalho Chehab | 18ccb22 | 2020-02-17 17:12:17 +0100 | [diff] [blame] | 263 | * refcnt |
| 264 | - a reference counter |
Mike Marshall | fcac9d5 | 2016-01-13 14:28:13 -0500 | [diff] [blame] | 265 | * desc_size - PVFS2_BUFMAP_DEFAULT_DESC_SIZE (4194304) - the IO buffer's |
| 266 | partition size, which represents the filesystem's block size and |
| 267 | is used for s_blocksize in super blocks. |
| 268 | * desc_count - PVFS2_BUFMAP_DEFAULT_DESC_COUNT (10) - the number of |
| 269 | partitions in the IO buffer. |
| 270 | * desc_shift - log2(desc_size), used for s_blocksize_bits in super blocks. |
| 271 | * total_size - the total size of the IO buffer. |
| 272 | * page_count - the number of 4096 byte pages in the IO buffer. |
Mauro Carvalho Chehab | 18ccb22 | 2020-02-17 17:12:17 +0100 | [diff] [blame] | 273 | * page_array - a pointer to ``page_count * (sizeof(struct page*))`` bytes |
Mike Marshall | fcac9d5 | 2016-01-13 14:28:13 -0500 | [diff] [blame] | 274 | of kcalloced memory. This memory is used as an array of pointers |
| 275 | to each of the pages in the IO buffer through a call to get_user_pages. |
Mauro Carvalho Chehab | 18ccb22 | 2020-02-17 17:12:17 +0100 | [diff] [blame] | 276 | * desc_array - a pointer to ``desc_count * (sizeof(struct orangefs_bufmap_desc))`` |
Mike Marshall | fcac9d5 | 2016-01-13 14:28:13 -0500 | [diff] [blame] | 277 | bytes of kcalloced memory. This memory is further intialized: |
| 278 | |
| 279 | user_desc is the kernel's copy of the IO buffer's ORANGEFS_dev_map_desc |
| 280 | structure. user_desc->ptr points to the IO buffer. |
| 281 | |
Mauro Carvalho Chehab | 18ccb22 | 2020-02-17 17:12:17 +0100 | [diff] [blame] | 282 | :: |
| 283 | |
| 284 | pages_per_desc = bufmap->desc_size / PAGE_SIZE |
| 285 | offset = 0 |
Mike Marshall | fcac9d5 | 2016-01-13 14:28:13 -0500 | [diff] [blame] | 286 | |
| 287 | bufmap->desc_array[0].page_array = &bufmap->page_array[offset] |
| 288 | bufmap->desc_array[0].array_count = pages_per_desc = 1024 |
| 289 | bufmap->desc_array[0].uaddr = (user_desc->ptr) + (0 * 1024 * 4096) |
| 290 | offset += 1024 |
| 291 | . |
| 292 | . |
| 293 | . |
| 294 | bufmap->desc_array[9].page_array = &bufmap->page_array[offset] |
| 295 | bufmap->desc_array[9].array_count = pages_per_desc = 1024 |
| 296 | bufmap->desc_array[9].uaddr = (user_desc->ptr) + |
| 297 | (9 * 1024 * 4096) |
| 298 | offset += 1024 |
| 299 | |
| 300 | * buffer_index_array - a desc_count sized array of ints, used to |
| 301 | indicate which of the IO buffer's partitions are available to use. |
| 302 | * buffer_index_lock - a spinlock to protect buffer_index_array during update. |
| 303 | * readdir_index_array - a five (ORANGEFS_READDIR_DEFAULT_DESC_COUNT) element |
| 304 | int array used to indicate which of the readdir buffer's partitions are |
| 305 | available to use. |
| 306 | * readdir_index_lock - a spinlock to protect readdir_index_array during |
| 307 | update. |
| 308 | |
Mauro Carvalho Chehab | 18ccb22 | 2020-02-17 17:12:17 +0100 | [diff] [blame] | 309 | Operations |
| 310 | ---------- |
Mike Marshall | fcac9d5 | 2016-01-13 14:28:13 -0500 | [diff] [blame] | 311 | |
| 312 | The kernel module builds an "op" (struct orangefs_kernel_op_s) when it |
| 313 | needs to communicate with userspace. Part of the op contains the "upcall" |
| 314 | which expresses the request to userspace. Part of the op eventually |
| 315 | contains the "downcall" which expresses the results of the request. |
| 316 | |
| 317 | The slab allocator is used to keep a cache of op structures handy. |
| 318 | |
Mike Marshall | 9f08cfe | 2016-02-26 14:39:08 -0500 | [diff] [blame] | 319 | At init time the kernel module defines and initializes a request list |
| 320 | and an in_progress hash table to keep track of all the ops that are |
| 321 | in flight at any given time. |
Mike Marshall | fcac9d5 | 2016-01-13 14:28:13 -0500 | [diff] [blame] | 322 | |
Mike Marshall | 9f08cfe | 2016-02-26 14:39:08 -0500 | [diff] [blame] | 323 | Ops are stateful: |
Mike Marshall | fcac9d5 | 2016-01-13 14:28:13 -0500 | [diff] [blame] | 324 | |
Mauro Carvalho Chehab | 18ccb22 | 2020-02-17 17:12:17 +0100 | [diff] [blame] | 325 | * unknown |
| 326 | - op was just initialized |
| 327 | * waiting |
| 328 | - op is on request_list (upward bound) |
| 329 | * inprogr |
| 330 | - op is in progress (waiting for downcall) |
| 331 | * serviced |
| 332 | - op has matching downcall; ok |
| 333 | * purged |
| 334 | - op has to start a timer since client-core |
Mike Marshall | 9f08cfe | 2016-02-26 14:39:08 -0500 | [diff] [blame] | 335 | exited uncleanly before servicing op |
Mauro Carvalho Chehab | 18ccb22 | 2020-02-17 17:12:17 +0100 | [diff] [blame] | 336 | * given up |
| 337 | - submitter has given up waiting for it |
Mike Marshall | fcac9d5 | 2016-01-13 14:28:13 -0500 | [diff] [blame] | 338 | |
Mike Marshall | 9f08cfe | 2016-02-26 14:39:08 -0500 | [diff] [blame] | 339 | When some arbitrary userspace program needs to perform a |
| 340 | filesystem operation on Orangefs (readdir, I/O, create, whatever) |
| 341 | an op structure is initialized and tagged with a distinguishing ID |
| 342 | number. The upcall part of the op is filled out, and the op is |
| 343 | passed to the "service_operation" function. |
Mike Marshall | fcac9d5 | 2016-01-13 14:28:13 -0500 | [diff] [blame] | 344 | |
Mike Marshall | 9f08cfe | 2016-02-26 14:39:08 -0500 | [diff] [blame] | 345 | Service_operation changes the op's state to "waiting", puts |
| 346 | it on the request list, and signals the Orangefs file_operations.poll |
| 347 | function through a wait queue. Userspace is polling the pseudo-device |
| 348 | and thus becomes aware of the upcall request that needs to be read. |
Mike Marshall | fcac9d5 | 2016-01-13 14:28:13 -0500 | [diff] [blame] | 349 | |
Mike Marshall | 9f08cfe | 2016-02-26 14:39:08 -0500 | [diff] [blame] | 350 | When the Orangefs file_operations.read function is triggered, the |
| 351 | request list is searched for an op that seems ready-to-process. |
| 352 | The op is removed from the request list. The tag from the op and |
| 353 | the filled-out upcall struct are copy_to_user'ed back to userspace. |
| 354 | |
| 355 | If any of these (and some additional protocol) copy_to_users fail, |
| 356 | the op's state is set to "waiting" and the op is added back to |
| 357 | the request list. Otherwise, the op's state is changed to "in progress", |
| 358 | and the op is hashed on its tag and put onto the end of a list in the |
| 359 | in_progress hash table at the index the tag hashed to. |
| 360 | |
| 361 | When userspace has assembled the response to the upcall, it |
| 362 | writes the response, which includes the distinguishing tag, back to |
| 363 | the pseudo device in a series of io_vecs. This triggers the Orangefs |
| 364 | file_operations.write_iter function to find the op with the associated |
| 365 | tag and remove it from the in_progress hash table. As long as the op's |
| 366 | state is not "canceled" or "given up", its state is set to "serviced". |
| 367 | The file_operations.write_iter function returns to the waiting vfs, |
| 368 | and back to service_operation through wait_for_matching_downcall. |
| 369 | |
| 370 | Service operation returns to its caller with the op's downcall |
| 371 | part (the response to the upcall) filled out. |
| 372 | |
| 373 | The "client-core" is the bridge between the kernel module and |
| 374 | userspace. The client-core is a daemon. The client-core has an |
| 375 | associated watchdog daemon. If the client-core is ever signaled |
| 376 | to die, the watchdog daemon restarts the client-core. Even though |
| 377 | the client-core is restarted "right away", there is a period of |
| 378 | time during such an event that the client-core is dead. A dead client-core |
| 379 | can't be triggered by the Orangefs file_operations.poll function. |
| 380 | Ops that pass through service_operation during a "dead spell" can timeout |
| 381 | on the wait queue and one attempt is made to recycle them. Obviously, |
| 382 | if the client-core stays dead too long, the arbitrary userspace processes |
| 383 | trying to use Orangefs will be negatively affected. Waiting ops |
| 384 | that can't be serviced will be removed from the request list and |
Mike Marshall | 302f049 | 2016-08-01 14:01:40 -0400 | [diff] [blame] | 385 | have their states set to "given up". In-progress ops that can't |
Mike Marshall | 9f08cfe | 2016-02-26 14:39:08 -0500 | [diff] [blame] | 386 | be serviced will be removed from the in_progress hash table and |
| 387 | have their states set to "given up". |
| 388 | |
| 389 | Readdir and I/O ops are atypical with respect to their payloads. |
Mike Marshall | fcac9d5 | 2016-01-13 14:28:13 -0500 | [diff] [blame] | 390 | |
| 391 | - readdir ops use the smaller of the two pre-allocated pre-partitioned |
| 392 | memory buffers. The readdir buffer is only available to userspace. |
| 393 | The kernel module obtains an index to a free partition before launching |
| 394 | a readdir op. Userspace deposits the results into the indexed partition |
| 395 | and then writes them to back to the pvfs device. |
| 396 | |
| 397 | - io (read and write) ops use the larger of the two pre-allocated |
| 398 | pre-partitioned memory buffers. The IO buffer is accessible from |
| 399 | both userspace and the kernel module. The kernel module obtains an |
| 400 | index to a free partition before launching an io op. The kernel module |
| 401 | deposits write data into the indexed partition, to be consumed |
| 402 | directly by userspace. Userspace deposits the results of read |
| 403 | requests into the indexed partition, to be consumed directly |
| 404 | by the kernel module. |
| 405 | |
| 406 | Responses to kernel requests are all packaged in pvfs2_downcall_t |
| 407 | structs. Besides a few other members, pvfs2_downcall_t contains a |
| 408 | union of structs, each of which is associated with a particular |
| 409 | response type. |
| 410 | |
| 411 | The several members outside of the union are: |
Mauro Carvalho Chehab | 18ccb22 | 2020-02-17 17:12:17 +0100 | [diff] [blame] | 412 | |
| 413 | ``int32_t type`` |
| 414 | - type of operation. |
| 415 | ``int32_t status`` |
| 416 | - return code for the operation. |
| 417 | ``int64_t trailer_size`` |
| 418 | - 0 unless readdir operation. |
| 419 | ``char *trailer_buf`` |
| 420 | - initialized to NULL, used during readdir operations. |
Mike Marshall | fcac9d5 | 2016-01-13 14:28:13 -0500 | [diff] [blame] | 421 | |
| 422 | The appropriate member inside the union is filled out for any |
| 423 | particular response. |
| 424 | |
| 425 | PVFS2_VFS_OP_FILE_IO |
| 426 | fill a pvfs2_io_response_t |
| 427 | |
| 428 | PVFS2_VFS_OP_LOOKUP |
| 429 | fill a PVFS_object_kref |
| 430 | |
| 431 | PVFS2_VFS_OP_CREATE |
| 432 | fill a PVFS_object_kref |
| 433 | |
| 434 | PVFS2_VFS_OP_SYMLINK |
| 435 | fill a PVFS_object_kref |
| 436 | |
| 437 | PVFS2_VFS_OP_GETATTR |
| 438 | fill in a PVFS_sys_attr_s (tons of stuff the kernel doesn't need) |
| 439 | fill in a string with the link target when the object is a symlink. |
| 440 | |
| 441 | PVFS2_VFS_OP_MKDIR |
| 442 | fill a PVFS_object_kref |
| 443 | |
| 444 | PVFS2_VFS_OP_STATFS |
| 445 | fill a pvfs2_statfs_response_t with useless info <g>. It is hard for |
| 446 | us to know, in a timely fashion, these statistics about our |
Mike Marshall | 302f049 | 2016-08-01 14:01:40 -0400 | [diff] [blame] | 447 | distributed network filesystem. |
Mike Marshall | fcac9d5 | 2016-01-13 14:28:13 -0500 | [diff] [blame] | 448 | |
| 449 | PVFS2_VFS_OP_FS_MOUNT |
| 450 | fill a pvfs2_fs_mount_response_t which is just like a PVFS_object_kref |
| 451 | except its members are in a different order and "__pad1" is replaced |
| 452 | with "id". |
| 453 | |
| 454 | PVFS2_VFS_OP_GETXATTR |
| 455 | fill a pvfs2_getxattr_response_t |
| 456 | |
| 457 | PVFS2_VFS_OP_LISTXATTR |
| 458 | fill a pvfs2_listxattr_response_t |
| 459 | |
| 460 | PVFS2_VFS_OP_PARAM |
| 461 | fill a pvfs2_param_response_t |
| 462 | |
| 463 | PVFS2_VFS_OP_PERF_COUNT |
| 464 | fill a pvfs2_perf_count_response_t |
| 465 | |
| 466 | PVFS2_VFS_OP_FSKEY |
| 467 | file a pvfs2_fs_key_response_t |
| 468 | |
| 469 | PVFS2_VFS_OP_READDIR |
| 470 | jamb everything needed to represent a pvfs2_readdir_response_t into |
| 471 | the readdir buffer descriptor specified in the upcall. |
| 472 | |
Mike Marshall | 9f08cfe | 2016-02-26 14:39:08 -0500 | [diff] [blame] | 473 | Userspace uses writev() on /dev/pvfs2-req to pass responses to the requests |
Mike Marshall | fcac9d5 | 2016-01-13 14:28:13 -0500 | [diff] [blame] | 474 | made by the kernel side. |
| 475 | |
| 476 | A buffer_list containing: |
Mauro Carvalho Chehab | 18ccb22 | 2020-02-17 17:12:17 +0100 | [diff] [blame] | 477 | |
Mike Marshall | fcac9d5 | 2016-01-13 14:28:13 -0500 | [diff] [blame] | 478 | - a pointer to the prepared response to the request from the |
| 479 | kernel (struct pvfs2_downcall_t). |
| 480 | - and also, in the case of a readdir request, a pointer to a |
| 481 | buffer containing descriptors for the objects in the target |
| 482 | directory. |
Mauro Carvalho Chehab | 18ccb22 | 2020-02-17 17:12:17 +0100 | [diff] [blame] | 483 | |
Mike Marshall | fcac9d5 | 2016-01-13 14:28:13 -0500 | [diff] [blame] | 484 | ... is sent to the function (PINT_dev_write_list) which performs |
| 485 | the writev. |
| 486 | |
| 487 | PINT_dev_write_list has a local iovec array: struct iovec io_array[10]; |
| 488 | |
| 489 | The first four elements of io_array are initialized like this for all |
Mauro Carvalho Chehab | 18ccb22 | 2020-02-17 17:12:17 +0100 | [diff] [blame] | 490 | responses:: |
Mike Marshall | fcac9d5 | 2016-01-13 14:28:13 -0500 | [diff] [blame] | 491 | |
| 492 | io_array[0].iov_base = address of local variable "proto_ver" (int32_t) |
| 493 | io_array[0].iov_len = sizeof(int32_t) |
| 494 | |
| 495 | io_array[1].iov_base = address of global variable "pdev_magic" (int32_t) |
| 496 | io_array[1].iov_len = sizeof(int32_t) |
Mike Marshall | 302f049 | 2016-08-01 14:01:40 -0400 | [diff] [blame] | 497 | |
Mike Marshall | fcac9d5 | 2016-01-13 14:28:13 -0500 | [diff] [blame] | 498 | io_array[2].iov_base = address of parameter "tag" (PVFS_id_gen_t) |
| 499 | io_array[2].iov_len = sizeof(int64_t) |
| 500 | |
| 501 | io_array[3].iov_base = address of out_downcall member (pvfs2_downcall_t) |
| 502 | of global variable vfs_request (vfs_request_t) |
| 503 | io_array[3].iov_len = sizeof(pvfs2_downcall_t) |
| 504 | |
Mauro Carvalho Chehab | 18ccb22 | 2020-02-17 17:12:17 +0100 | [diff] [blame] | 505 | Readdir responses initialize the fifth element io_array like this:: |
Mike Marshall | fcac9d5 | 2016-01-13 14:28:13 -0500 | [diff] [blame] | 506 | |
| 507 | io_array[4].iov_base = contents of member trailer_buf (char *) |
| 508 | from out_downcall member of global variable |
| 509 | vfs_request |
| 510 | io_array[4].iov_len = contents of member trailer_size (PVFS_size) |
| 511 | from out_downcall member of global variable |
| 512 | vfs_request |
Mike Marshall | 302f049 | 2016-08-01 14:01:40 -0400 | [diff] [blame] | 513 | |
| 514 | Orangefs exploits the dcache in order to avoid sending redundant |
| 515 | requests to userspace. We keep object inode attributes up-to-date with |
| 516 | orangefs_inode_getattr. Orangefs_inode_getattr uses two arguments to |
| 517 | help it decide whether or not to update an inode: "new" and "bypass". |
| 518 | Orangefs keeps private data in an object's inode that includes a short |
| 519 | timeout value, getattr_time, which allows any iteration of |
| 520 | orangefs_inode_getattr to know how long it has been since the inode was |
| 521 | updated. When the object is not new (new == 0) and the bypass flag is not |
| 522 | set (bypass == 0) orangefs_inode_getattr returns without updating the inode |
| 523 | if getattr_time has not timed out. Getattr_time is updated each time the |
| 524 | inode is updated. |
| 525 | |
| 526 | Creation of a new object (file, dir, sym-link) includes the evaluation of |
| 527 | its pathname, resulting in a negative directory entry for the object. |
| 528 | A new inode is allocated and associated with the dentry, turning it from |
| 529 | a negative dentry into a "productive full member of society". Orangefs |
| 530 | obtains the new inode from Linux with new_inode() and associates |
| 531 | the inode with the dentry by sending the pair back to Linux with |
| 532 | d_instantiate(). |
| 533 | |
| 534 | The evaluation of a pathname for an object resolves to its corresponding |
| 535 | dentry. If there is no corresponding dentry, one is created for it in |
| 536 | the dcache. Whenever a dentry is modified or verified Orangefs stores a |
| 537 | short timeout value in the dentry's d_time, and the dentry will be trusted |
| 538 | for that amount of time. Orangefs is a network filesystem, and objects |
| 539 | can potentially change out-of-band with any particular Orangefs kernel module |
| 540 | instance, so trusting a dentry is risky. The alternative to trusting |
| 541 | dentries is to always obtain the needed information from userspace - at |
| 542 | least a trip to the client-core, maybe to the servers. Obtaining information |
| 543 | from a dentry is cheap, obtaining it from userspace is relatively expensive, |
| 544 | hence the motivation to use the dentry when possible. |
| 545 | |
| 546 | The timeout values d_time and getattr_time are jiffy based, and the |
Mauro Carvalho Chehab | 18ccb22 | 2020-02-17 17:12:17 +0100 | [diff] [blame] | 547 | code is designed to avoid the jiffy-wrap problem:: |
Mike Marshall | 302f049 | 2016-08-01 14:01:40 -0400 | [diff] [blame] | 548 | |
Mauro Carvalho Chehab | 18ccb22 | 2020-02-17 17:12:17 +0100 | [diff] [blame] | 549 | "In general, if the clock may have wrapped around more than once, there |
| 550 | is no way to tell how much time has elapsed. However, if the times t1 |
| 551 | and t2 are known to be fairly close, we can reliably compute the |
| 552 | difference in a way that takes into account the possibility that the |
| 553 | clock may have wrapped between times." |
Mike Marshall | 302f049 | 2016-08-01 14:01:40 -0400 | [diff] [blame] | 554 | |
Mauro Carvalho Chehab | 18ccb22 | 2020-02-17 17:12:17 +0100 | [diff] [blame] | 555 | from course notes by instructor Andy Wang |
Mike Marshall | fcac9d5 | 2016-01-13 14:28:13 -0500 | [diff] [blame] | 556 | |