blob: 4f373a8ec47bc140aff7d0f9fd25efd63f0fa7aa [file] [log] [blame]
David Howellsfb28afc2021-02-22 13:17:24 +00001.. SPDX-License-Identifier: GPL-2.0
2
3=================================
David Howellsddca5b02021-11-29 15:26:43 +00004Network Filesystem Helper Library
David Howellsfb28afc2021-02-22 13:17:24 +00005=================================
6
7.. Contents:
8
9 - Overview.
10 - Buffered read helpers.
11 - Read helper functions.
12 - Read helper structures.
13 - Read helper operations.
14 - Read helper procedure.
15 - Read helper cache API.
16
17
18Overview
19========
20
21The network filesystem helper library is a set of functions designed to aid a
22network filesystem in implementing VM/VFS operations. For the moment, that
23just includes turning various VM buffered read operations into requests to read
24from the server. The helper library, however, can also interpose other
25services, such as local caching or local data encryption.
26
27Note that the library module doesn't link against local caching directly, so
28access must be provided by the netfs.
29
30
31Buffered Read Helpers
32=====================
33
34The library provides a set of read helpers that handle the ->readpage(),
35->readahead() and much of the ->write_begin() VM operations and translate them
36into a common call framework.
37
38The following services are provided:
39
David Howellsddca5b02021-11-29 15:26:43 +000040 * Handle folios that span multiple pages.
David Howellsfb28afc2021-02-22 13:17:24 +000041
David Howellsddca5b02021-11-29 15:26:43 +000042 * Insulate the netfs from VM interface changes.
David Howellsfb28afc2021-02-22 13:17:24 +000043
David Howellsddca5b02021-11-29 15:26:43 +000044 * Allow the netfs to arbitrarily split reads up into pieces, even ones that
45 don't match folio sizes or folio alignments and that may cross folios.
David Howellsfb28afc2021-02-22 13:17:24 +000046
David Howellsddca5b02021-11-29 15:26:43 +000047 * Allow the netfs to expand a readahead request in both directions to meet its
48 needs.
David Howellsfb28afc2021-02-22 13:17:24 +000049
David Howellsddca5b02021-11-29 15:26:43 +000050 * Allow the netfs to partially fulfil a read, which will then be resubmitted.
David Howellsfb28afc2021-02-22 13:17:24 +000051
David Howellsddca5b02021-11-29 15:26:43 +000052 * Handle local caching, allowing cached data and server-read data to be
David Howellsfb28afc2021-02-22 13:17:24 +000053 interleaved for a single request.
54
David Howellsddca5b02021-11-29 15:26:43 +000055 * Handle clearing of bufferage that aren't on the server.
David Howellsfb28afc2021-02-22 13:17:24 +000056
57 * Handle retrying of reads that failed, switching reads from the cache to the
58 server as necessary.
59
60 * In the future, this is a place that other services can be performed, such as
61 local encryption of data to be stored remotely or in the cache.
62
63From the network filesystem, the helpers require a table of operations. This
64includes a mandatory method to issue a read operation along with a number of
65optional methods.
66
67
68Read Helper Functions
69---------------------
70
71Three read helpers are provided::
72
David Howellsddca5b02021-11-29 15:26:43 +000073 void netfs_readahead(struct readahead_control *ractl,
74 const struct netfs_read_request_ops *ops,
75 void *netfs_priv);
76 int netfs_readpage(struct file *file,
77 struct folio *folio,
78 const struct netfs_read_request_ops *ops,
79 void *netfs_priv);
80 int netfs_write_begin(struct file *file,
81 struct address_space *mapping,
82 loff_t pos,
83 unsigned int len,
84 unsigned int flags,
85 struct folio **_folio,
86 void **_fsdata,
87 const struct netfs_read_request_ops *ops,
88 void *netfs_priv);
David Howellsfb28afc2021-02-22 13:17:24 +000089
90Each corresponds to a VM operation, with the addition of a couple of parameters
91for the use of the read helpers:
92
93 * ``ops``
94
95 A table of operations through which the helpers can talk to the filesystem.
96
97 * ``netfs_priv``
98
99 Filesystem private data (can be NULL).
100
101Both of these values will be stored into the read request structure.
102
103For ->readahead() and ->readpage(), the network filesystem should just jump
104into the corresponding read helper; whereas for ->write_begin(), it may be a
105little more complicated as the network filesystem might want to flush
David Howellsddca5b02021-11-29 15:26:43 +0000106conflicting writes or track dirty data and needs to put the acquired folio if
107an error occurs after calling the helper.
David Howellsfb28afc2021-02-22 13:17:24 +0000108
109The helpers manage the read request, calling back into the network filesystem
110through the suppplied table of operations. Waits will be performed as
111necessary before returning for helpers that are meant to be synchronous.
112
113If an error occurs and netfs_priv is non-NULL, ops->cleanup() will be called to
114deal with it. If some parts of the request are in progress when an error
115occurs, the request will get partially completed if sufficient data is read.
116
117Additionally, there is::
118
119 * void netfs_subreq_terminated(struct netfs_read_subrequest *subreq,
120 ssize_t transferred_or_error,
121 bool was_async);
122
123which should be called to complete a read subrequest. This is given the number
124of bytes transferred or a negative error code, plus a flag indicating whether
125the operation was asynchronous (ie. whether the follow-on processing can be
126done in the current context, given this may involve sleeping).
127
128
129Read Helper Structures
130----------------------
131
132The read helpers make use of a couple of structures to maintain the state of
133the read. The first is a structure that manages a read request as a whole::
134
135 struct netfs_read_request {
136 struct inode *inode;
137 struct address_space *mapping;
138 struct netfs_cache_resources cache_resources;
139 void *netfs_priv;
140 loff_t start;
141 size_t len;
142 loff_t i_size;
143 const struct netfs_read_request_ops *netfs_ops;
144 unsigned int debug_id;
145 ...
146 };
147
148The above fields are the ones the netfs can use. They are:
149
150 * ``inode``
151 * ``mapping``
152
153 The inode and the address space of the file being read from. The mapping
154 may or may not point to inode->i_data.
155
156 * ``cache_resources``
157
158 Resources for the local cache to use, if present.
159
160 * ``netfs_priv``
161
162 The network filesystem's private data. The value for this can be passed in
163 to the helper functions or set during the request. The ->cleanup() op will
164 be called if this is non-NULL at the end.
165
166 * ``start``
167 * ``len``
168
169 The file position of the start of the read request and the length. These
170 may be altered by the ->expand_readahead() op.
171
172 * ``i_size``
173
174 The size of the file at the start of the request.
175
176 * ``netfs_ops``
177
178 A pointer to the operation table. The value for this is passed into the
179 helper functions.
180
181 * ``debug_id``
182
183 A number allocated to this operation that can be displayed in trace lines
184 for reference.
185
186
187The second structure is used to manage individual slices of the overall read
188request::
189
190 struct netfs_read_subrequest {
191 struct netfs_read_request *rreq;
192 loff_t start;
193 size_t len;
194 size_t transferred;
195 unsigned long flags;
196 unsigned short debug_index;
197 ...
198 };
199
200Each subrequest is expected to access a single source, though the helpers will
201handle falling back from one source type to another. The members are:
202
203 * ``rreq``
204
205 A pointer to the read request.
206
207 * ``start``
208 * ``len``
209
210 The file position of the start of this slice of the read request and the
211 length.
212
213 * ``transferred``
214
215 The amount of data transferred so far of the length of this slice. The
216 network filesystem or cache should start the operation this far into the
217 slice. If a short read occurs, the helpers will call again, having updated
218 this to reflect the amount read so far.
219
220 * ``flags``
221
222 Flags pertaining to the read. There are two of interest to the filesystem
223 or cache:
224
225 * ``NETFS_SREQ_CLEAR_TAIL``
226
227 This can be set to indicate that the remainder of the slice, from
228 transferred to len, should be cleared.
229
230 * ``NETFS_SREQ_SEEK_DATA_READ``
231
232 This is a hint to the cache that it might want to try skipping ahead to
233 the next data (ie. using SEEK_DATA).
234
235 * ``debug_index``
236
237 A number allocated to this slice that can be displayed in trace lines for
238 reference.
239
240
241Read Helper Operations
242----------------------
243
244The network filesystem must provide the read helpers with a table of operations
245through which it can issue requests and negotiate::
246
247 struct netfs_read_request_ops {
248 void (*init_rreq)(struct netfs_read_request *rreq, struct file *file);
249 bool (*is_cache_enabled)(struct inode *inode);
250 int (*begin_cache_operation)(struct netfs_read_request *rreq);
251 void (*expand_readahead)(struct netfs_read_request *rreq);
252 bool (*clamp_length)(struct netfs_read_subrequest *subreq);
253 void (*issue_op)(struct netfs_read_subrequest *subreq);
254 bool (*is_still_valid)(struct netfs_read_request *rreq);
255 int (*check_write_begin)(struct file *file, loff_t pos, unsigned len,
David Howellsddca5b02021-11-29 15:26:43 +0000256 struct folio *folio, void **_fsdata);
David Howellsfb28afc2021-02-22 13:17:24 +0000257 void (*done)(struct netfs_read_request *rreq);
258 void (*cleanup)(struct address_space *mapping, void *netfs_priv);
259 };
260
261The operations are as follows:
262
263 * ``init_rreq()``
264
265 [Optional] This is called to initialise the request structure. It is given
266 the file for reference and can modify the ->netfs_priv value.
267
268 * ``is_cache_enabled()``
269
270 [Required] This is called by netfs_write_begin() to ask if the file is being
271 cached. It should return true if it is being cached and false otherwise.
272
273 * ``begin_cache_operation()``
274
275 [Optional] This is called to ask the network filesystem to call into the
276 cache (if present) to initialise the caching state for this read. The netfs
277 library module cannot access the cache directly, so the cache should call
278 something like fscache_begin_read_operation() to do this.
279
280 The cache gets to store its state in ->cache_resources and must set a table
281 of operations of its own there (though of a different type).
282
283 This should return 0 on success and an error code otherwise. If an error is
284 reported, the operation may proceed anyway, just without local caching (only
285 out of memory and interruption errors cause failure here).
286
287 * ``expand_readahead()``
288
289 [Optional] This is called to allow the filesystem to expand the size of a
290 readahead read request. The filesystem gets to expand the request in both
291 directions, though it's not permitted to reduce it as the numbers may
292 represent an allocation already made. If local caching is enabled, it gets
293 to expand the request first.
294
295 Expansion is communicated by changing ->start and ->len in the request
296 structure. Note that if any change is made, ->len must be increased by at
297 least as much as ->start is reduced.
298
299 * ``clamp_length()``
300
301 [Optional] This is called to allow the filesystem to reduce the size of a
302 subrequest. The filesystem can use this, for example, to chop up a request
303 that has to be split across multiple servers or to put multiple reads in
304 flight.
305
306 This should return 0 on success and an error code on error.
307
308 * ``issue_op()``
309
310 [Required] The helpers use this to dispatch a subrequest to the server for
311 reading. In the subrequest, ->start, ->len and ->transferred indicate what
312 data should be read from the server.
313
314 There is no return value; the netfs_subreq_terminated() function should be
315 called to indicate whether or not the operation succeeded and how much data
David Howellsddca5b02021-11-29 15:26:43 +0000316 it transferred. The filesystem also should not deal with setting folios
David Howellsfb28afc2021-02-22 13:17:24 +0000317 uptodate, unlocking them or dropping their refs - the helpers need to deal
318 with this as they have to coordinate with copying to the local cache.
319
David Howellsddca5b02021-11-29 15:26:43 +0000320 Note that the helpers have the folios locked, but not pinned. It is
321 possible to use the ITER_XARRAY iov iterator to refer to the range of the
322 inode that is being operated upon without the need to allocate large bvec
323 tables.
David Howellsfb28afc2021-02-22 13:17:24 +0000324
325 * ``is_still_valid()``
326
327 [Optional] This is called to find out if the data just read from the local
328 cache is still valid. It should return true if it is still valid and false
329 if not. If it's not still valid, it will be reread from the server.
330
331 * ``check_write_begin()``
332
333 [Optional] This is called from the netfs_write_begin() helper once it has
David Howellsddca5b02021-11-29 15:26:43 +0000334 allocated/grabbed the folio to be modified to allow the filesystem to flush
David Howellsfb28afc2021-02-22 13:17:24 +0000335 conflicting state before allowing it to be modified.
336
David Howellsddca5b02021-11-29 15:26:43 +0000337 It should return 0 if everything is now fine, -EAGAIN if the folio should be
David Howellsfb28afc2021-02-22 13:17:24 +0000338 regrabbed and any other error code to abort the operation.
339
340 * ``done``
341
David Howellsddca5b02021-11-29 15:26:43 +0000342 [Optional] This is called after the folios in the request have all been
David Howellsfb28afc2021-02-22 13:17:24 +0000343 unlocked (and marked uptodate if applicable).
344
345 * ``cleanup``
346
347 [Optional] This is called as the request is being deallocated so that the
348 filesystem can clean up ->netfs_priv.
349
350
351
352Read Helper Procedure
353---------------------
354
355The read helpers work by the following general procedure:
356
357 * Set up the request.
358
359 * For readahead, allow the local cache and then the network filesystem to
360 propose expansions to the read request. This is then proposed to the VM.
361 If the VM cannot fully perform the expansion, a partially expanded read will
362 be performed, though this may not get written to the cache in its entirety.
363
364 * Loop around slicing chunks off of the request to form subrequests:
365
366 * If a local cache is present, it gets to do the slicing, otherwise the
367 helpers just try to generate maximal slices.
368
369 * The network filesystem gets to clamp the size of each slice if it is to be
370 the source. This allows rsize and chunking to be implemented.
371
372 * The helpers issue a read from the cache or a read from the server or just
373 clears the slice as appropriate.
374
375 * The next slice begins at the end of the last one.
376
377 * As slices finish being read, they terminate.
378
379 * When all the subrequests have terminated, the subrequests are assessed and
380 any that are short or have failed are reissued:
381
382 * Failed cache requests are issued against the server instead.
383
384 * Failed server requests just fail.
385
386 * Short reads against either source will be reissued against that source
387 provided they have transferred some more data:
388
389 * The cache may need to skip holes that it can't do DIO from.
390
391 * If NETFS_SREQ_CLEAR_TAIL was set, a short read will be cleared to the
392 end of the slice instead of reissuing.
393
David Howellsddca5b02021-11-29 15:26:43 +0000394 * Once the data is read, the folios that have been fully read/cleared:
David Howellsfb28afc2021-02-22 13:17:24 +0000395
396 * Will be marked uptodate.
397
398 * If a cache is present, will be marked with PG_fscache.
399
400 * Unlocked
401
David Howellsddca5b02021-11-29 15:26:43 +0000402 * Any folios that need writing to the cache will then have DIO writes issued.
David Howellsfb28afc2021-02-22 13:17:24 +0000403
404 * Synchronous operations will wait for reading to be complete.
405
David Howellsddca5b02021-11-29 15:26:43 +0000406 * Writes to the cache will proceed asynchronously and the folios will have the
David Howellsfb28afc2021-02-22 13:17:24 +0000407 PG_fscache mark removed when that completes.
408
409 * The request structures will be cleaned up when everything has completed.
410
411
412Read Helper Cache API
413---------------------
414
415When implementing a local cache to be used by the read helpers, two things are
416required: some way for the network filesystem to initialise the caching for a
417read request and a table of operations for the helpers to call.
418
419The network filesystem's ->begin_cache_operation() method is called to set up a
420cache and this must call into the cache to do the work. If using fscache, for
421example, the cache would call::
422
423 int fscache_begin_read_operation(struct netfs_read_request *rreq,
424 struct fscache_cookie *cookie);
425
426passing in the request pointer and the cookie corresponding to the file.
427
428The netfs_read_request object contains a place for the cache to hang its
429state::
430
431 struct netfs_cache_resources {
432 const struct netfs_cache_ops *ops;
433 void *cache_priv;
434 void *cache_priv2;
435 };
436
437This contains an operations table pointer and two private pointers. The
438operation table looks like the following::
439
440 struct netfs_cache_ops {
441 void (*end_operation)(struct netfs_cache_resources *cres);
442
443 void (*expand_readahead)(struct netfs_cache_resources *cres,
444 loff_t *_start, size_t *_len, loff_t i_size);
445
446 enum netfs_read_source (*prepare_read)(struct netfs_read_subrequest *subreq,
447 loff_t i_size);
448
449 int (*read)(struct netfs_cache_resources *cres,
450 loff_t start_pos,
451 struct iov_iter *iter,
452 bool seek_data,
453 netfs_io_terminated_t term_func,
454 void *term_func_priv);
455
David Howellsddca5b02021-11-29 15:26:43 +0000456 int (*prepare_write)(struct netfs_cache_resources *cres,
David Howellse0484342021-11-10 13:25:03 +0000457 loff_t *_start, size_t *_len, loff_t i_size,
458 bool no_space_allocated_yet);
David Howellsddca5b02021-11-29 15:26:43 +0000459
David Howellsfb28afc2021-02-22 13:17:24 +0000460 int (*write)(struct netfs_cache_resources *cres,
461 loff_t start_pos,
462 struct iov_iter *iter,
463 netfs_io_terminated_t term_func,
464 void *term_func_priv);
David Howellsbee9f652022-01-27 16:02:50 +0000465
466 int (*query_occupancy)(struct netfs_cache_resources *cres,
467 loff_t start, size_t len, size_t granularity,
468 loff_t *_data_start, size_t *_data_len);
David Howellsfb28afc2021-02-22 13:17:24 +0000469 };
470
471With a termination handler function pointer::
472
473 typedef void (*netfs_io_terminated_t)(void *priv,
474 ssize_t transferred_or_error,
475 bool was_async);
476
477The methods defined in the table are:
478
479 * ``end_operation()``
480
481 [Required] Called to clean up the resources at the end of the read request.
482
483 * ``expand_readahead()``
484
485 [Optional] Called at the beginning of a netfs_readahead() operation to allow
486 the cache to expand a request in either direction. This allows the cache to
487 size the request appropriately for the cache granularity.
488
489 The function is passed poiners to the start and length in its parameters,
490 plus the size of the file for reference, and adjusts the start and length
491 appropriately. It should return one of:
492
493 * ``NETFS_FILL_WITH_ZEROES``
494 * ``NETFS_DOWNLOAD_FROM_SERVER``
495 * ``NETFS_READ_FROM_CACHE``
496 * ``NETFS_INVALID_READ``
497
498 to indicate whether the slice should just be cleared or whether it should be
499 downloaded from the server or read from the cache - or whether slicing
500 should be given up at the current point.
501
502 * ``prepare_read()``
503
504 [Required] Called to configure the next slice of a request. ->start and
505 ->len in the subrequest indicate where and how big the next slice can be;
506 the cache gets to reduce the length to match its granularity requirements.
507
508 * ``read()``
509
510 [Required] Called to read from the cache. The start file offset is given
511 along with an iterator to read to, which gives the length also. It can be
512 given a hint requesting that it seek forward from that start position for
513 data.
514
515 Also provided is a pointer to a termination handler function and private
516 data to pass to that function. The termination function should be called
517 with the number of bytes transferred or an error code, plus a flag
518 indicating whether the termination is definitely happening in the caller's
519 context.
520
David Howellsddca5b02021-11-29 15:26:43 +0000521 * ``prepare_write()``
522
David Howellse0484342021-11-10 13:25:03 +0000523 [Required] Called to prepare a write to the cache to take place. This
524 involves checking to see whether the cache has sufficient space to honour
525 the write. ``*_start`` and ``*_len`` indicate the region to be written; the
526 region can be shrunk or it can be expanded to a page boundary either way as
527 necessary to align for direct I/O. i_size holds the size of the object and
528 is provided for reference. no_space_allocated_yet is set to true if the
529 caller is certain that no data has been written to that region - for example
530 if it tried to do a read from there already.
David Howellsddca5b02021-11-29 15:26:43 +0000531
David Howellsfb28afc2021-02-22 13:17:24 +0000532 * ``write()``
533
534 [Required] Called to write to the cache. The start file offset is given
535 along with an iterator to write from, which gives the length also.
536
537 Also provided is a pointer to a termination handler function and private
538 data to pass to that function. The termination function should be called
539 with the number of bytes transferred or an error code, plus a flag
540 indicating whether the termination is definitely happening in the caller's
541 context.
542
David Howellsbee9f652022-01-27 16:02:50 +0000543 * ``query_occupancy()``
544
545 [Required] Called to find out where the next piece of data is within a
546 particular region of the cache. The start and length of the region to be
547 queried are passed in, along with the granularity to which the answer needs
548 to be aligned. The function passes back the start and length of the data,
549 if any, available within that region. Note that there may be a hole at the
550 front.
551
552 It returns 0 if some data was found, -ENODATA if there was no usable data
553 within the region or -ENOBUFS if there is no caching on this file.
554
David Howellsfb28afc2021-02-22 13:17:24 +0000555Note that these methods are passed a pointer to the cache resource structure,
556not the read request structure as they could be used in other situations where
557there isn't a read request structure as well, such as writing dirty data to the
558cache.
Matthew Wilcox (Oracle)6abbaa52021-04-27 14:24:30 -0400559
David Howellsddca5b02021-11-29 15:26:43 +0000560
561API Function Reference
562======================
563
Matthew Wilcox (Oracle)6abbaa52021-04-27 14:24:30 -0400564.. kernel-doc:: include/linux/netfs.h
David Howellsddca5b02021-11-29 15:26:43 +0000565.. kernel-doc:: fs/netfs/read_helper.c