Network Filesystem Services Library¶

Overview¶

The network filesystem services library, netfslib, is a set of functions designed to aid a network filesystem in implementing VM/VFS API operations. It takes over the normal buffered read, readahead, write and writeback and also handles unbuffered and direct I/O.

The library provides support for (re-)negotiation of I/O sizes and retrying failed I/O as well as local caching and will, in the future, provide content encryption.

It insulates the filesystem from VM interface changes as much as possible and handles VM features such as large multipage folios. The filesystem basically just has to provide a way to perform read and write RPC calls.

The way I/O is organised inside netfslib consists of a number of objects:

A request. A request is used to track the progress of the I/O overall and to hold on to resources. The collection of results is done at the request level. The I/O within a request is divided into a number of parallel streams of subrequests.

A stream. A non-overlapping series of subrequests. The subrequests within a stream do not have to be contiguous.

A subrequest. This is the basic unit of I/O. It represents a single RPC call or a single cache I/O operation. The library passes these to the filesystem and the cache to perform.

Requests and Streams¶

When actually performing I/O (as opposed to just copying into the pagecache), netfslib will create one or more requests to track the progress of the I/O and to hold resources.

A read operation will have a single stream and the subrequests within that stream may be of mixed origins, for instance mixing RPC subrequests and cache subrequests.

On the other hand, a write operation may have multiple streams, where each stream targets a different destination. For instance, there may be one stream writing to the local cache and one to the server. Currently, only two streams are allowed, but this could be increased if parallel writes to multiple servers is desired.

The subrequests within a write stream do not need to match alignment or size with the subrequests in another write stream and netfslib performs the tiling of subrequests in each stream over the source buffer independently. Further, each stream may contain holes that don’t correspond to holes in the other stream.

In addition, the subrequests do not need to correspond to the boundaries of the folios or vectors in the source/destination buffer. The library handles the collection of results and the wrangling of folio flags and references.

Subrequests¶

Subrequests are at the heart of the interaction between netfslib and the filesystem using it. Each subrequest is expected to correspond to a single read or write RPC or cache operation. The library will stitch together the results from a set of subrequests to provide a higher level operation.

Netfslib has two interactions with the filesystem or the cache when setting up a subrequest. First, there’s an optional preparatory step that allows the filesystem to negotiate the limits on the subrequest, both in terms of maximum number of bytes and maximum number of vectors (e.g. for RDMA). This may involve negotiating with the server (e.g. cifs needing to acquire credits).

And, secondly, there’s the issuing step in which the subrequest is handed off to the filesystem to perform.

Note that these two steps are done slightly differently between read and write:

For reads, the VM/VFS tells us how much is being requested up front, so the library can preset maximum values that the cache and then the filesystem can then reduce. The cache also gets consulted first on whether it wants to do a read before the filesystem is consulted.

For writeback, it is unknown how much there will be to write until the pagecache is walked, so no limit is set by the library.

Once a subrequest is completed, the filesystem or cache informs the library of the completion and then collection is invoked. Depending on whether the request is synchronous or asynchronous, the collection of results will be done in either the application thread or in a work queue.

Result Collection and Retry¶

As subrequests complete, the results are collected and collated by the library and folio unlocking is performed progressively (if appropriate). Once the request is complete, async completion will be invoked (again, if appropriate). It is possible for the filesystem to provide interim progress reports to the library to cause folio unlocking to happen earlier if possible.

If any subrequests fail, netfslib can retry them. It will wait until all subrequests are completed, offer the filesystem the opportunity to fiddle with the resources/state held by the request and poke at the subrequests before re-preparing and re-issuing the subrequests.

This allows the tiling of contiguous sets of failed subrequest within a stream to be changed, adding more subrequests or ditching excess as necessary (for instance, if the network sizes change or the server decides it wants smaller chunks).

Further, if one or more contiguous cache-read subrequests fail, the library will pass them to the filesystem to perform instead, renegotiating and retiling them as necessary to fit with the filesystem’s parameters rather than those of the cache.

Local Caching¶

One of the services netfslib provides, via fscache, is the option to cache on local disk a copy of the data obtained from/written to a network filesystem. The library will manage the storing, retrieval and some invalidation of data automatically on behalf of the filesystem if a cookie is attached to the netfs_inode.

Note that local caching used to use the PG_private_2 (aliased as PG_fscache) to keep track of a page that was being written to the cache, but this is now deprecated as PG_private_2 will be removed.

Instead, folios that are read from the server for which there was no data in the cache will be marked as dirty and will have folio->private set to a special value (NETFS_FOLIO_COPY_TO_CACHE) and left to writeback to write. If the folio is modified before that happened, the special value will be cleared and the write will become normally dirty.

When writeback occurs, folios that are so marked will only be written to the cache and not to the server. Writeback handles mixed cache-only writes and server-and-cache writes by using two streams, sending one to the cache and one to the server. The server stream will have gaps in it corresponding to those folios.

Content Encryption (fscrypt)¶

Though it does not do so yet, at some point netfslib will acquire the ability to do client-side content encryption on behalf of the network filesystem (Ceph, for example). fscrypt can be used for this if appropriate (it may not be - cifs, for example).

The data will be stored encrypted in the local cache using the same manner of encryption as the data written to the server and the library will impose bounce buffering and RMW cycles as necessary.

Per-Inode Context¶

The network filesystem helper library needs a place to store a bit of state for its use on each netfs inode it is helping to manage. To this end, a context structure is defined:

struct netfs_inode {
        struct inode inode;
        const struct netfs_request_ops *ops;
        struct fscache_cookie * cache;
        loff_t remote_i_size;
        unsigned long flags;
        ...
};

A network filesystem that wants to use netfslib must place one of these in its inode wrapper struct instead of the VFS struct inode. This can be done in a way similar to the following:

struct my_inode {
        struct netfs_inode netfs; /* Netfslib context and vfs inode */
        ...
};

This allows netfslib to find its state by using container_of() from the inode pointer, thereby allowing the netfslib helper functions to be pointed to directly by the VFS/VM operation tables.

The structure contains the following fields that are of interest to the filesystem:

inode

The VFS inode structure.

ops

The set of operations provided by the network filesystem to netfslib.

cache

Local caching cookie, or NULL if no caching is enabled. This field does not exist if fscache is disabled.

remote_i_size

The size of the file on the server. This differs from inode->i_size if local modifications have been made but not yet written back.

flags

A set of flags, some of which the filesystem might be interested in:

NETFS_ICTX_MODIFIED_ATTR

Set if netfslib modifies mtime/ctime. The filesystem is free to ignore this or clear it.

NETFS_ICTX_UNBUFFERED

Do unbuffered I/O upon the file. Like direct I/O but without the alignment limitations. RMW will be performed if necessary. The pagecache will not be used unless mmap() is also used.

NETFS_ICTX_WRITETHROUGH

Do writethrough caching upon the file. I/O will be set up and dispatched as buffered writes are made to the page cache. mmap() does the normal writeback thing.

NETFS_ICTX_SINGLE_NO_UPLOAD

Set if the file has a monolithic content that must be read entirely in a single go and must not be written back to the server, though it can be cached (e.g. AFS directories).

Inode Context Helper Functions¶

To help deal with the per-inode context, a number helper functions are provided. Firstly, a function to perform basic initialisation on a context and set the operations table pointer:

void netfs_inode_init(struct netfs_inode *ctx,
                      const struct netfs_request_ops *ops);

then a function to cast from the VFS inode structure to the netfs context:

struct netfs_inode *netfs_inode(struct inode *inode);

and finally, a function to get the cache cookie pointer from the context attached to an inode (or NULL if fscache is disabled):

struct fscache_cookie *netfs_i_cookie(struct netfs_inode *ctx);

Inode Locking¶

A number of functions are provided to manage the locking of i_rwsem for I/O and to effectively extend it to provide more separate classes of exclusion:

int netfs_start_io_read(struct inode *inode);
void netfs_end_io_read(struct inode *inode);
int netfs_start_io_write(struct inode *inode);
void netfs_end_io_write(struct inode *inode);
int netfs_start_io_direct(struct inode *inode);
void netfs_end_io_direct(struct inode *inode);

The exclusion breaks down into four separate classes:

Buffered reads and writes.

Buffered reads can run concurrently each other and with buffered writes, but buffered writes cannot run concurrently with each other.

Direct reads and writes.

Direct (and unbuffered) reads and writes can run concurrently since they do not share local buffering (i.e. the pagecache) and, in a network filesystem, are expected to have exclusion managed on the server (though this may not be the case for, say, Ceph).

Other major inode modifying operations (e.g. truncate, fallocate).

These should just access i_rwsem directly.

mmap().

mmap’d accesses might operate concurrently with any of the other classes. They might form the buffer for an intra-file loopback DIO read/write. They might be permitted on unbuffered files.

Inode Writeback¶

Netfslib will pin resources on an inode for future writeback (such as pinning use of an fscache cookie) when an inode is dirtied. However, this pinning needs careful management. To manage the pinning, the following sequence occurs:

An inode state flag I_PINNING_NETFS_WB is set by netfslib when the pinning begins (when a folio is dirtied, for example) if the cache is active to stop the cache structures from being discarded and the cache space from being culled. This also prevents re-getting of cache resources if the flag is already set.

This flag then cleared inside the inode lock during inode writeback in the VM - and the fact that it was set is transferred to ->unpinned_netfs_wb in struct writeback_control.

If ->unpinned_netfs_wb is now set, the write_inode procedure is forced.

The filesystem’s ->write_inode() function is invoked to do the cleanup.

The filesystem invokes netfs to do its cleanup.

To do the cleanup, netfslib provides a function to do the resource unpinning:

int netfs_unpin_writeback(struct inode *inode, struct writeback_control *wbc);

If the filesystem doesn’t need to do anything else, this may be set as a its .write_inode method.

Further, if an inode is deleted, the filesystem’s write_inode method may not get called, so:

void netfs_clear_inode_writeback(struct inode *inode, const void *aux);

must be called from ->evict_inode() before clear_inode() is called.

High-Level VFS API¶

Netfslib provides a number of sets of API calls for the filesystem to delegate VFS operations to. Netfslib, in turn, will call out to the filesystem and the cache to negotiate I/O sizes, issue RPCs and provide places for it to intervene at various times.

Unlocked Read/Write Iter¶

The first API set is for the delegation of operations to netfslib when the filesystem is called through the standard VFS read/write_iter methods:

ssize_t netfs_file_read_iter(struct kiocb *iocb, struct iov_iter *iter);
ssize_t netfs_file_write_iter(struct kiocb *iocb, struct iov_iter *from);
ssize_t netfs_buffered_read_iter(struct kiocb *iocb, struct iov_iter *iter);
ssize_t netfs_unbuffered_read_iter(struct kiocb *iocb, struct iov_iter *iter);
ssize_t netfs_unbuffered_write_iter(struct kiocb *iocb, struct iov_iter *from);

They can be assigned directly to .read_iter and .write_iter. They perform the inode locking themselves and the first two will switch between buffered I/O and DIO as appropriate.

Pre-Locked Read/Write Iter¶

The second API set is for the delegation of operations to netfslib when the filesystem is called through the standard VFS methods, but needs to do some other stuff before or after calling netfslib whilst still inside locked section (e.g. Ceph negotiating caps). The unbuffered read function is:

ssize_t netfs_unbuffered_read_iter_locked(struct kiocb *iocb, struct iov_iter *iter);

This must not be assigned directly to .read_iter and the filesystem is responsible for performing the inode locking before calling it. In the case of buffered read, the filesystem should use filemap_read().

There are three functions for writes:

ssize_t netfs_buffered_write_iter_locked(struct kiocb *iocb, struct iov_iter *from,
                                         struct netfs_group *netfs_group);
ssize_t netfs_perform_write(struct kiocb *iocb, struct iov_iter *iter,
                            struct netfs_group *netfs_group);
ssize_t netfs_unbuffered_write_iter_locked(struct kiocb *iocb, struct iov_iter *iter,
                                           struct netfs_group *netfs_group);

These must not be assigned directly to .write_iter and the filesystem is responsible for performing the inode locking before calling them.

The first two functions are for buffered writes; the first just adds some standard write checks and jumps to the second, but if the filesystem wants to do the checks itself, it can use the second directly. The third function is for unbuffered or DIO writes.

On all three write functions, there is a writeback group pointer (which should be NULL if the filesystem doesn’t use this). Writeback groups are set on folios when they’re modified. If a folio to-be-modified is already marked with a different group, it is flushed first. The writeback API allows writing back of a specific group.

Memory-Mapped I/O API¶

An API for support of mmap()’d I/O is provided:

vm_fault_t netfs_page_mkwrite(struct vm_fault *vmf, struct netfs_group *netfs_group);

This allows the filesystem to delegate .page_mkwrite to netfslib. The filesystem should not take the inode lock before calling it, but, as with the locked write functions above, this does take a writeback group pointer. If the page to be made writable is in a different group, it will be flushed first.

Monolithic Files API¶

There is also a special API set for files for which the content must be read in a single RPC (and not written back) and is maintained as a monolithic blob (e.g. an AFS directory), though it can be stored and updated in the local cache:

ssize_t netfs_read_single(struct inode *inode, struct file *file, struct iov_iter *iter);
void netfs_single_mark_inode_dirty(struct inode *inode);
int netfs_writeback_single(struct address_space *mapping,
                           struct writeback_control *wbc,
                           struct iov_iter *iter);

The first function reads from a file into the given buffer, reading from the cache in preference if the data is cached there; the second function allows the inode to be marked dirty, causing a later writeback; and the third function can be called from the writeback code to write the data to the cache, if there is one.

The inode should be marked NETFS_ICTX_SINGLE_NO_UPLOAD if this API is to be used. The writeback function requires the buffer to be of ITER_FOLIOQ type.

High-Level VM API¶

Netfslib also provides a number of sets of API calls for the filesystem to delegate VM operations to. Again, netfslib, in turn, will call out to the filesystem and the cache to negotiate I/O sizes, issue RPCs and provide places for it to intervene at various times:

void netfs_readahead(struct readahead_control *);
int netfs_read_folio(struct file *, struct folio *);
int netfs_writepages(struct address_space *mapping,
                     struct writeback_control *wbc);
bool netfs_dirty_folio(struct address_space *mapping, struct folio *folio);
void netfs_invalidate_folio(struct folio *folio, size_t offset, size_t length);
bool netfs_release_folio(struct folio *folio, gfp_t gfp);

These are address_space_operations methods and can be set directly in the operations table.

Deprecated PG_private_2 API¶

There is also a deprecated function for filesystems that still use the ->write_begin method:

int netfs_write_begin(struct netfs_inode *inode, struct file *file,
                      struct address_space *mapping, loff_t pos, unsigned int len,
                      struct folio **_folio, void **_fsdata);

It uses the deprecated PG_private_2 flag and so should not be used.

I/O Request API¶

The I/O request API comprises a number of structures and a number of functions that the filesystem may need to use.

Request Structure¶

The request structure manages the request as a whole, holding some resources and state on behalf of the filesystem and tracking the collection of results:

struct netfs_io_request {
        enum netfs_io_origin    origin;
        struct inode            *inode;
        struct address_space    *mapping;
        struct netfs_group      *group;
        struct netfs_io_stream  io_streams[];
        void                    *netfs_priv;
        void                    *netfs_priv2;
        unsigned long long      start;
        unsigned long long      len;
        unsigned long long      i_size;
        unsigned int            debug_id;
        unsigned long           flags;
        ...
};

Many of the fields are for internal use, but the fields shown here are of interest to the filesystem:

origin

The origin of the request (readahead, read_folio, DIO read, writeback, ...).

inode

mapping

The inode and the address space of the file being read from. The mapping may or may not point to inode->i_data.

group

The writeback group this request is dealing with or NULL. This holds a ref on the group.

io_streams

The parallel streams of subrequests available to the request. Currently two are available, but this may be made extensible in future. NR_IO_STREAMS indicates the size of the array.

netfs_priv

netfs_priv2

The network filesystem’s private data. The value for this can be passed in to the helper functions or set during the request.

start

len

The file position of the start of the read request and the length. These may be altered by the ->expand_readahead() op.

i_size

The size of the file at the start of the request.

debug_id

A number allocated to this operation that can be displayed in trace lines for reference.

flags

Flags for managing and controlling the operation of the request. Some of these may be of interest to the filesystem:

NETFS_RREQ_RETRYING

Netfslib sets this when generating retries.

NETFS_RREQ_PAUSE

The filesystem can set this to request to pause the library’s subrequest issuing loop - but care needs to be taken as netfslib may also set it.

NETFS_RREQ_NONBLOCK

NETFS_RREQ_BLOCKED

Netfslib sets the first to indicate that non-blocking mode was set by the caller and the filesystem can set the second to indicate that it would have had to block.

NETFS_RREQ_USE_PGPRIV2

The filesystem can set this if it wants to use PG_private_2 to track whether a folio is being written to the cache. This is deprecated as PG_private_2 is going to go away.

If the filesystem wants more private data than is afforded by this structure, then it should wrap it and provide its own allocator.

Stream Structure¶

A request is comprised of one or more parallel streams and each stream may be aimed at a different target.

For read requests, only stream 0 is used. This can contain a mixture of subrequests aimed at different sources. For write requests, stream 0 is used for the server and stream 1 is used for the cache. For buffered writeback, stream 0 is not enabled unless a normal dirty folio is encountered, at which point ->begin_writeback() will be invoked and the filesystem can mark the stream available.

The stream struct looks like:

struct netfs_io_stream {
        unsigned char           stream_nr;
        bool                    avail;
        size_t                  sreq_max_len;
        unsigned int            sreq_max_segs;
        unsigned int            submit_extendable_to;
        ...
};

A number of members are available for access/use by the filesystem:

stream_nr

The number of the stream within the request.

avail

True if the stream is available for use. The filesystem should set this on stream zero if in ->begin_writeback().

sreq_max_len

sreq_max_segs

These are set by the filesystem or the cache in ->prepare_read() or ->prepare_write() for each subrequest to indicate the maximum number of bytes and, optionally, the maximum number of segments (if not 0) that that subrequest can support.

submit_extendable_to

The size that a subrequest can be rounded up to beyond the EOF, given the available buffer. This allows the cache to work out if it can do a DIO read or write that straddles the EOF marker.

Subrequest Structure¶

Individual units of I/O are managed by the subrequest structure. These represent slices of the overall request and run independently:

struct netfs_io_subrequest {
        struct netfs_io_request *rreq;
        struct iov_iter         io_iter;
        unsigned long long      start;
        size_t                  len;
        size_t                  transferred;
        unsigned long           flags;
        short                   error;
        unsigned short          debug_index;
        unsigned char           stream_nr;
        ...
};

Each subrequest is expected to access a single source, though the library will handle falling back from one source type to another. The members are:

rreq

A pointer to the read request.

io_iter

An I/O iterator representing a slice of the buffer to be read into or written from.

start

len

The file position of the start of this slice of the read request and the length.

transferred

The amount of data transferred so far for this subrequest. This should be added to with the length of the transfer made by this issuance of the subrequest. If this is less than len then the subrequest may be reissued to continue.

flags

Flags for managing the subrequest. There are a number of interest to the filesystem or cache:

NETFS_SREQ_MADE_PROGRESS

Set by the filesystem to indicates that at least one byte of data was read or written.

NETFS_SREQ_HIT_EOF

The filesystem should set this if a read hit the EOF on the file (in which case transferred should stop at the EOF). Netfslib may expand the subrequest out to the size of the folio containing the EOF on the off chance that a third party change happened or a DIO read may have asked for more than is available. The library will clear any excess pagecache.

NETFS_SREQ_CLEAR_TAIL

The filesystem can set this to indicate that the remainder of the slice, from transferred to len, should be cleared. Do not set if HIT_EOF is set.

NETFS_SREQ_NEED_RETRY

The filesystem can set this to tell netfslib to retry the subrequest.

NETFS_SREQ_BOUNDARY

This can be set by the filesystem on a subrequest to indicate that it ends at a boundary with the filesystem structure (e.g. at the end of a Ceph object). It tells netfslib not to retile subrequests across it.

error

This is for the filesystem to store result of the subrequest. It should be set to 0 if successful and a negative error code otherwise.

debug_index

stream_nr

A number allocated to this slice that can be displayed in trace lines for reference and the number of the request stream that it belongs to.

If necessary, the filesystem can get and put extra refs on the subrequest it is given:

void netfs_get_subrequest(struct netfs_io_subrequest *subreq,
                          enum netfs_sreq_ref_trace what);
void netfs_put_subrequest(struct netfs_io_subrequest *subreq,
                          enum netfs_sreq_ref_trace what);

using netfs trace codes to indicate the reason. Care must be taken, however, as once control of the subrequest is returned to netfslib, the same subrequest can be reissued/retried.

Filesystem Methods¶

The filesystem sets a table of operations in netfs_inode for netfslib to use:

struct netfs_request_ops {
        mempool_t *request_pool;
        mempool_t *subrequest_pool;
        int (*init_request)(struct netfs_io_request *rreq, struct file *file);
        void (*free_request)(struct netfs_io_request *rreq);
        void (*free_subrequest)(struct netfs_io_subrequest *rreq);
        void (*expand_readahead)(struct netfs_io_request *rreq);
        int (*prepare_read)(struct netfs_io_subrequest *subreq);
        void (*issue_read)(struct netfs_io_subrequest *subreq);
        void (*done)(struct netfs_io_request *rreq);
        void (*update_i_size)(struct inode *inode, loff_t i_size);
        void (*post_modify)(struct inode *inode);
        void (*begin_writeback)(struct netfs_io_request *wreq);
        void (*prepare_write)(struct netfs_io_subrequest *subreq);
        void (*issue_write)(struct netfs_io_subrequest *subreq);
        void (*retry_request)(struct netfs_io_request *wreq,
                              struct netfs_io_stream *stream);
        void (*invalidate_cache)(struct netfs_io_request *wreq);
};

The table starts with a pair of optional pointers to memory pools from which requests and subrequests can be allocated. If these are not given, netfslib has default pools that it will use instead. If the filesystem wraps the netfs structs in its own larger structs, then it will need to use its own pools. Netfslib will allocate directly from the pools.

The methods defined in the table are:

init_request()

free_request()

free_subrequest()

[Optional] A filesystem may implement these to initialise or clean up any resources that it attaches to the request or subrequest.

expand_readahead()

[Optional] This is called to allow the filesystem to expand the size of a readahead request. The filesystem gets to expand the request in both directions, though it must retain the initial region as that may represent an allocation already made. If local caching is enabled, it gets to expand the request first.

Expansion is communicated by changing ->start and ->len in the request structure. Note that if any change is made, ->len must be increased by at least as much as ->start is reduced.
prepare_read()

[Optional] This is called to allow the filesystem to limit the size of a subrequest. It may also limit the number of individual regions in iterator, such as required by RDMA. This information should be set on stream zero in:
rreq->io_streams[0].sreq_max_len
rreq->io_streams[0].sreq_max_segs
The filesystem can use this, for example, to chop up a request that has to be split across multiple servers or to put multiple reads in flight.

Zero should be returned on success and an error code otherwise.
issue_read()

[Required] Netfslib calls this to dispatch a subrequest to the server for reading. In the subrequest, ->start, ->len and ->transferred indicate what data should be read from the server and ->io_iter indicates the buffer to be used.

There is no return value; the netfs_read_subreq_terminated() function should be called to indicate that the subrequest completed either way. ->error, ->transferred and ->flags should be updated before completing. The termination can be done asynchronously.

Note: the filesystem must not deal with setting folios uptodate, unlocking them or dropping their refs - the library deals with this as it may have to stitch together the results of multiple subrequests that variously overlap the set of folios.

done()

[Optional] This is called after the folios in a read request have all been unlocked (and marked uptodate if applicable).

update_i_size()

[Optional] This is invoked by netfslib at various points during the write paths to ask the filesystem to update its idea of the file size. If not given, netfslib will set i_size and i_blocks and update the local cache cookie.

post_modify()

[Optional] This is called after netfslib writes to the pagecache or when it allows an mmap’d page to be marked as writable.

begin_writeback()

[Optional] Netfslib calls this when processing a writeback request if it finds a dirty page that isn’t simply marked NETFS_FOLIO_COPY_TO_CACHE, indicating it must be written to the server. This allows the filesystem to only set up writeback resources when it knows it’s going to have to perform a write.
prepare_write()

[Optional] This is called to allow the filesystem to limit the size of a subrequest. It may also limit the number of individual regions in iterator, such as required by RDMA. This information should be set on stream to which the subrequest belongs:
rreq->io_streams[subreq->stream_nr].sreq_max_len
rreq->io_streams[subreq->stream_nr].sreq_max_segs
The filesystem can use this, for example, to chop up a request that has to be split across multiple servers or to put multiple writes in flight.

This is not permitted to return an error. Instead, in the event of failure, netfs_prepare_write_failed() must be called.
issue_write()

[Required] This is used to dispatch a subrequest to the server for writing. In the subrequest, ->start, ->len and ->transferred indicate what data should be written to the server and ->io_iter indicates the buffer to be used.

There is no return value; the netfs_write_subreq_terminated() function should be called to indicate that the subrequest completed either way. ->error, ->transferred and ->flags should be updated before completing. The termination can be done asynchronously.

Note: the filesystem must not deal with removing the dirty or writeback marks on folios involved in the operation and should not take refs or pins on them, but should leave retention to netfslib.

retry_request()

[Optional] Netfslib calls this at the beginning of a retry cycle. This allows the filesystem to examine the state of the request, the subrequests in the indicated stream and of its own data and make adjustments or renegotiate resources.

invalidate_cache()

[Optional] This is called by netfslib to invalidate data stored in the local cache in the event that writing to the local cache fails, providing updated coherency data that netfs can’t provide.

Terminating a subrequest¶

When a subrequest completes, there are a number of functions that the cache or subrequest can call to inform netfslib of the status change. One function is provided to terminate a write subrequest at the preparation stage and acts synchronously:

void netfs_prepare_write_failed(struct netfs_io_subrequest *subreq);

Indicate that the ->prepare_write() call failed. The error field should have been updated.

Note that ->prepare_read() can return an error as a read can simply be aborted. Dealing with writeback failure is trickier.

The other functions are used for subrequests that got as far as being issued:

void netfs_read_subreq_terminated(struct netfs_io_subrequest *subreq);

Tell netfslib that a read subrequest has terminated. The error, flags and transferred fields should have been updated.

void netfs_write_subrequest_terminated(void *_op, ssize_t transferred_or_error);

Tell netfslib that a write subrequest has terminated. Either the amount of data processed or the negative error code can be passed in. This is can be used as a kiocb completion function.

void netfs_read_subreq_progress(struct netfs_io_subrequest *subreq);

This is provided to optionally update netfslib on the incremental progress of a read, allowing some folios to be unlocked early and does not actually terminate the subrequest. The transferred field should have been updated.

Local Cache API¶

Netfslib provides a separate API for a local cache to implement, though it provides some somewhat similar routines to the filesystem request API.

Firstly, the netfs_io_request object contains a place for the cache to hang its state:

struct netfs_cache_resources {
        const struct netfs_cache_ops    *ops;
        void                            *cache_priv;
        void                            *cache_priv2;
        unsigned int                    debug_id;
        unsigned int                    inval_counter;
};

This contains an operations table pointer and two private pointers plus the debug ID of the fscache cookie for tracing purposes and an invalidation counter that is cranked by calls to fscache_invalidate() allowing cache subrequests to be invalidated after completion.

The cache operation table looks like the following:

struct netfs_cache_ops {
        void (*end_operation)(struct netfs_cache_resources *cres);
        void (*expand_readahead)(struct netfs_cache_resources *cres,
                                 loff_t *_start, size_t *_len, loff_t i_size);
        enum netfs_io_source (*prepare_read)(struct netfs_io_subrequest *subreq,
                                             loff_t i_size);
        int (*read)(struct netfs_cache_resources *cres,
                    loff_t start_pos,
                    struct iov_iter *iter,
                    bool seek_data,
                    netfs_io_terminated_t term_func,
                    void *term_func_priv);
        void (*prepare_write_subreq)(struct netfs_io_subrequest *subreq);
        void (*issue_write)(struct netfs_io_subrequest *subreq);
};

With a termination handler function pointer:

typedef void (*netfs_io_terminated_t)(void *priv,
                                      ssize_t transferred_or_error,
                                      bool was_async);

The methods defined in the table are:

end_operation()

[Required] Called to clean up the resources at the end of the read request.

expand_readahead()

[Optional] Called at the beginning of a readahead operation to allow the cache to expand a request in either direction. This allows the cache to size the request appropriately for the cache granularity.

prepare_read()

[Required] Called to configure the next slice of a request. ->start and ->len in the subrequest indicate where and how big the next slice can be; the cache gets to reduce the length to match its granularity requirements.

The function is passed pointers to the start and length in its parameters, plus the size of the file for reference, and adjusts the start and length appropriately. It should return one of:

NETFS_FILL_WITH_ZEROES

NETFS_DOWNLOAD_FROM_SERVER

NETFS_READ_FROM_CACHE

NETFS_INVALID_READ

to indicate whether the slice should just be cleared or whether it should be downloaded from the server or read from the cache - or whether slicing should be given up at the current point.

read()

[Required] Called to read from the cache. The start file offset is given along with an iterator to read to, which gives the length also. It can be given a hint requesting that it seek forward from that start position for data.

Also provided is a pointer to a termination handler function and private data to pass to that function. The termination function should be called with the number of bytes transferred or an error code, plus a flag indicating whether the termination is definitely happening in the caller’s context.
prepare_write_subreq()

[Required] This is called to allow the cache to limit the size of a subrequest. It may also limit the number of individual regions in iterator, such as required by DIO/DMA. This information should be set on stream to which the subrequest belongs:
rreq->io_streams[subreq->stream_nr].sreq_max_len
rreq->io_streams[subreq->stream_nr].sreq_max_segs
The filesystem can use this, for example, to chop up a request that has to be split across multiple servers or to put multiple writes in flight.

This is not permitted to return an error. In the event of failure, netfs_prepare_write_failed() must be called.
issue_write()

[Required] This is used to dispatch a subrequest to the cache for writing. In the subrequest, ->start, ->len and ->transferred indicate what data should be written to the cache and ->io_iter indicates the buffer to be used.

There is no return value; the netfs_write_subreq_terminated() function should be called to indicate that the subrequest completed either way. ->error, ->transferred and ->flags should be updated before completing. The termination can be done asynchronously.

API Function Reference¶

void folio_start_private_2(struct folio *folio)¶: Start an fscache write on a folio. [DEPRECATED]

Parameters

struct folio *folio: The folio.

Description

Call this function before writing a folio to a local cache. Starting a second write before the first one finishes is not allowed.

Note that this should no longer be used.

struct netfs_inode *netfs_inode(struct inode *inode)¶: Get the netfs inode context from the inode

Parameters

struct inode *inode: The inode to query

Description

Get the netfs lib inode context from the network filesystem’s inode. The context struct is expected to directly follow on from the VFS inode struct.

void netfs_inode_init(struct netfs_inode *ctx, const struct netfs_request_ops *ops, bool use_zero_point)¶: Initialise a netfslib inode context

Parameters

struct netfs_inode *ctx: The netfs inode to initialise
const struct netfs_request_ops *ops: The netfs’s operations list
bool use_zero_point: True to use the zero_point read optimisation

Description

Initialise the netfs library context struct. This is expected to follow on directly from the VFS inode struct.

void netfs_resize_file(struct netfs_inode *ctx, loff_t new_i_size, bool changed_on_server)¶: Note that a file got resized

Parameters

struct netfs_inode *ctx: The netfs inode being resized
loff_t new_i_size: The new file size
bool changed_on_server: The change was applied to the server

Description

Inform the netfs lib that a file got resized so that it can adjust its state.

struct fscache_cookie *netfs_i_cookie(struct netfs_inode *ctx)¶: Get the cache cookie from the inode

Parameters

struct netfs_inode *ctx: The netfs inode to query

Description

Get the caching cookie (if enabled) from the network filesystem’s inode.

void netfs_wait_for_outstanding_io(struct inode *inode)¶: Wait for outstanding I/O to complete

Parameters

struct inode *inode: The netfs inode to wait on

Description

Wait for outstanding I/O requests of any type to complete. This is intended to be called from inode eviction routines. This makes sure that any resources held by those requests are cleaned up before we let the inode get cleaned up.

void netfs_readahead(struct readahead_control *ractl)¶: Helper to manage a read request

Parameters

struct readahead_control *ractl: The description of the readahead request

Description

Fulfil a readahead request by drawing data from the cache if possible, or the netfs if not. Space beyond the EOF is zero-filled. Multiple I/O requests from different sources will get munged together. If necessary, the readahead window can be expanded in either direction to a more convenient alighment for RPC efficiency or to make storage in the cache feasible.

The calling netfs must initialise a netfs context contiguous to the vfs inode before calling this.

This is usable whether or not caching is enabled.

int netfs_read_folio(struct file *file, struct folio *folio)¶: Helper to manage a read_folio request

Parameters

struct file *file: The file to read from
struct folio *folio: The folio to read

Description

Fulfil a read_folio request by drawing data from the cache if possible, or the netfs if not. Space beyond the EOF is zero-filled. Multiple I/O requests from different sources will get munged together.

The calling netfs must initialise a netfs context contiguous to the vfs inode before calling this.

This is usable whether or not caching is enabled.

int netfs_write_begin(struct netfs_inode *ctx, struct file *file, struct address_space *mapping, loff_t pos, unsigned int len, struct folio **_folio, void **_fsdata)¶: Helper to prepare for writing [DEPRECATED]

Parameters

struct netfs_inode *ctx: The netfs context
struct file *file: The file to read from
struct address_space *mapping: The mapping to read from
loff_t pos: File position at which the write will begin
unsigned int len: The length of the write (may extend beyond the end of the folio chosen)
struct folio **_folio: Where to put the resultant folio
void **_fsdata: Place for the netfs to store a cookie

Description

Pre-read data for a write-begin request by drawing data from the cache if possible, or the netfs if not. Space beyond the EOF is zero-filled. Multiple I/O requests from different sources will get munged together.

The calling netfs must provide a table of operations, only one of which, issue_read, is mandatory.

The check_write_begin() operation can be provided to check for and flush conflicting writes once the folio is grabbed and locked. It is passed a pointer to the fsdata cookie that gets returned to the VM to be passed to write_end. It is permitted to sleep. It should return 0 if the request should go ahead or it may return an error. It may also unlock and put the folio, provided it sets *foliop to NULL, in which case a return of 0 will cause the folio to be re-got and the process to be retried.

The calling netfs must initialise a netfs context contiguous to the vfs inode before calling this.

This is usable whether or not caching is enabled.

Note that this should be considered deprecated and netfs_perform_write() used instead.

ssize_t netfs_buffered_read_iter(struct kiocb *iocb, struct iov_iter *iter)¶: Filesystem buffered I/O read routine

Parameters

struct kiocb *iocb: kernel I/O control block
struct iov_iter *iter: destination for the data read

Description

This is the ->read_iter() routine for all filesystems that can use the page cache directly.

The IOCB_NOWAIT flag in iocb->ki_flags indicates that -EAGAIN shall be returned when no data can be read without waiting for I/O requests to complete; it doesn’t prevent readahead.

The IOCB_NOIO flag in iocb->ki_flags indicates that no new I/O requests shall be made for the read or for readahead. When no data can be read, -EAGAIN shall be returned. When readahead would be triggered, a partial, possibly empty read shall be returned.

Return

number of bytes copied, even for partial reads
negative error code (or 0 if IOCB_NOIO) if nothing was read

ssize_t netfs_file_read_iter(struct kiocb *iocb, struct iov_iter *iter)¶: Generic filesystem read routine

Parameters

struct kiocb *iocb: kernel I/O control block
struct iov_iter *iter: destination for the data read

Description

This is the ->read_iter() routine for all filesystems that can use the page cache directly.

The IOCB_NOWAIT flag in iocb->ki_flags indicates that -EAGAIN shall be returned when no data can be read without waiting for I/O requests to complete; it doesn’t prevent readahead.

The IOCB_NOIO flag in iocb->ki_flags indicates that no new I/O requests shall be made for the read or for readahead. When no data can be read, -EAGAIN shall be returned. When readahead would be triggered, a partial, possibly empty read shall be returned.

Return

number of bytes copied, even for partial reads
negative error code (or 0 if IOCB_NOIO) if nothing was read

The Linux Kernel

Contents

This Page

Network Filesystem Services Library¶

Overview¶

Requests and Streams¶

Subrequests¶

Result Collection and Retry¶

Local Caching¶

Content Encryption (fscrypt)¶

Per-Inode Context¶

Inode Context Helper Functions¶

Inode Locking¶

Inode Writeback¶

High-Level VFS API¶

Unlocked Read/Write Iter¶

Pre-Locked Read/Write Iter¶

Memory-Mapped I/O API¶

Monolithic Files API¶

High-Level VM API¶

Deprecated PG_private_2 API¶

I/O Request API¶

Request Structure¶

Stream Structure¶

Subrequest Structure¶

Filesystem Methods¶

Terminating a subrequest¶

Local Cache API¶

API Function Reference¶