Memory Management APIs

User Space Memory Access

access_ok(type, addr, size)

Checks if a user space pointer is valid

Parameters

type
Type of access: VERIFY_READ or VERIFY_WRITE. Note that VERIFY_WRITE is a superset of VERIFY_READ - if it is safe to write to a block, it is always safe to read from it.
addr
User space pointer to start of block to check
size
Size of block to check

Context

User context only. This function may sleep if pagefaults are enabled.

Description

Checks if a pointer to a block of memory in user space is valid.

Returns true (nonzero) if the memory block may be valid, false (zero) if it is definitely invalid.

Note that, depending on architecture, this function probably just checks that the pointer is in the user space range - after calling this function, memory access functions may still return -EFAULT.

get_user(x, ptr)

Get a simple variable from user space.

Parameters

x
Variable to store result.
ptr
Source address, in user space.

Context

User context only. This function may sleep if pagefaults are enabled.

Description

This macro copies a single simple variable from user space to kernel space. It supports simple types like char and int, but not larger data types like structures or arrays.

ptr must have pointer-to-simple-variable type, and the result of dereferencing ptr must be assignable to x without a cast.

Returns zero on success, or -EFAULT on error. On error, the variable x is set to zero.

put_user(x, ptr)

Write a simple value into user space.

Parameters

x
Value to copy to user space.
ptr
Destination address, in user space.

Context

User context only. This function may sleep if pagefaults are enabled.

Description

This macro copies a single simple value from kernel space to user space. It supports simple types like char and int, but not larger data types like structures or arrays.

ptr must have pointer-to-simple-variable type, and x must be assignable to the result of dereferencing ptr.

Returns zero on success, or -EFAULT on error.

__get_user(x, ptr)

Get a simple variable from user space, with less checking.

Parameters

x
Variable to store result.
ptr
Source address, in user space.

Context

User context only. This function may sleep if pagefaults are enabled.

Description

This macro copies a single simple variable from user space to kernel space. It supports simple types like char and int, but not larger data types like structures or arrays.

ptr must have pointer-to-simple-variable type, and the result of dereferencing ptr must be assignable to x without a cast.

Caller must check the pointer with access_ok() before calling this function.

Returns zero on success, or -EFAULT on error. On error, the variable x is set to zero.

__put_user(x, ptr)

Write a simple value into user space, with less checking.

Parameters

x
Value to copy to user space.
ptr
Destination address, in user space.

Context

User context only. This function may sleep if pagefaults are enabled.

Description

This macro copies a single simple value from kernel space to user space. It supports simple types like char and int, but not larger data types like structures or arrays.

ptr must have pointer-to-simple-variable type, and x must be assignable to the result of dereferencing ptr.

Caller must check the pointer with access_ok() before calling this function.

Returns zero on success, or -EFAULT on error.

unsigned long clear_user(void __user * to, unsigned long n)

Zero a block of memory in user space.

Parameters

void __user * to
Destination address, in user space.
unsigned long n
Number of bytes to zero.

Description

Zero a block of memory in user space.

Returns number of bytes that could not be cleared. On success, this will be zero.

unsigned long __clear_user(void __user * to, unsigned long n)

Zero a block of memory in user space, with less checking.

Parameters

void __user * to
Destination address, in user space.
unsigned long n
Number of bytes to zero.

Description

Zero a block of memory in user space. Caller must check the specified block with access_ok() before calling this function.

Returns number of bytes that could not be cleared. On success, this will be zero.

int get_user_pages_fast(unsigned long start, int nr_pages, int write, struct page ** pages)

pin user pages in memory

Parameters

unsigned long start
starting user address
int nr_pages
number of pages from start to pin
int write
whether pages will be written to
struct page ** pages
array that receives pointers to the pages pinned. Should be at least nr_pages long.

Description

Returns number of pages pinned. This may be fewer than the number requested. If nr_pages is 0 or negative, returns 0. If no pages were pinned, returns -errno.

get_user_pages_fast provides equivalent functionality to get_user_pages, operating on current and current->mm, with force=0 and vma=NULL. However unlike get_user_pages, it must be called without mmap_sem held.

get_user_pages_fast may take mmap_sem and page table locks, so no assumptions can be made about lack of locking. get_user_pages_fast is to be implemented in a way that is advantageous (vs get_user_pages()) when the user memory area is already faulted in and present in ptes. However if the pages have to be faulted in, it may turn out to be slightly slower so callers need to carefully consider what to use. On many architectures, get_user_pages_fast simply falls back to get_user_pages.

Memory Allocation Controls

Functions which need to allocate memory often use GFP flags to express how that memory should be allocated. The GFP acronym stands for “get free pages”, the underlying memory allocation function. Not every GFP flag is allowed to every function which may allocate memory. Most users will want to use a plain GFP_KERNEL.

Page mobility and placement hints

These flags provide hints about how mobile the page is. Pages with similar mobility are placed within the same pageblocks to minimise problems due to external fragmentation.

__GFP_MOVABLE (also a zone modifier) indicates that the page can be moved by page migration during memory compaction or can be reclaimed.

__GFP_RECLAIMABLE is used for slab allocations that specify SLAB_RECLAIM_ACCOUNT and whose pages can be freed via shrinkers.

__GFP_WRITE indicates the caller intends to dirty the page. Where possible, these pages will be spread between local zones to avoid all the dirty pages being in one zone (fair zone allocation policy).

__GFP_HARDWALL enforces the cpuset memory allocation policy.

__GFP_THISNODE forces the allocation to be satisified from the requested node with no fallbacks or placement policy enforcements.

__GFP_ACCOUNT causes the allocation to be accounted to kmemcg.

Watermark modifiers – controls access to emergency reserves

__GFP_HIGH indicates that the caller is high-priority and that granting the request is necessary before the system can make forward progress. For example, creating an IO context to clean pages.

__GFP_ATOMIC indicates that the caller cannot reclaim or sleep and is high priority. Users are typically interrupt handlers. This may be used in conjunction with __GFP_HIGH

__GFP_MEMALLOC allows access to all memory. This should only be used when the caller guarantees the allocation will allow more memory to be freed very shortly e.g. process exiting or swapping. Users either should be the MM or co-ordinating closely with the VM (e.g. swap over NFS).

__GFP_NOMEMALLOC is used to explicitly forbid access to emergency reserves. This takes precedence over the __GFP_MEMALLOC flag if both are set.

Reclaim modifiers

__GFP_IO can start physical IO.

__GFP_FS can call down to the low-level FS. Clearing the flag avoids the allocator recursing into the filesystem which might already be holding locks.

__GFP_DIRECT_RECLAIM indicates that the caller may enter direct reclaim. This flag can be cleared to avoid unnecessary delays when a fallback option is available.

__GFP_KSWAPD_RECLAIM indicates that the caller wants to wake kswapd when the low watermark is reached and have it reclaim pages until the high watermark is reached. A caller may wish to clear this flag when fallback options are available and the reclaim is likely to disrupt the system. The canonical example is THP allocation where a fallback is cheap but reclaim/compaction may cause indirect stalls.

__GFP_RECLAIM is shorthand to allow/forbid both direct and kswapd reclaim.

The default allocator behavior depends on the request size. We have a concept of so called costly allocations (with order > PAGE_ALLOC_COSTLY_ORDER). !costly allocations are too essential to fail so they are implicitly non-failing by default (with some exceptions like OOM victims might fail so the caller still has to check for failures) while costly requests try to be not disruptive and back off even without invoking the OOM killer. The following three modifiers might be used to override some of these implicit rules

__GFP_NORETRY: The VM implementation will try only very lightweight memory direct reclaim to get some memory under memory pressure (thus it can sleep). It will avoid disruptive actions like OOM killer. The caller must handle the failure which is quite likely to happen under heavy memory pressure. The flag is suitable when failure can easily be handled at small cost, such as reduced throughput

__GFP_RETRY_MAYFAIL: The VM implementation will retry memory reclaim procedures that have previously failed if there is some indication that progress has been made else where. It can wait for other tasks to attempt high level approaches to freeing memory such as compaction (which removes fragmentation) and page-out. There is still a definite limit to the number of retries, but it is a larger limit than with __GFP_NORETRY. Allocations with this flag may fail, but only when there is genuinely little unused memory. While these allocations do not directly trigger the OOM killer, their failure indicates that the system is likely to need to use the OOM killer soon. The caller must handle failure, but can reasonably do so by failing a higher-level request, or completing it only in a much less efficient manner. If the allocation does fail, and the caller is in a position to free some non-essential memory, doing so could benefit the system as a whole.

__GFP_NOFAIL: The VM implementation _must_ retry infinitely: the caller cannot handle allocation failures. The allocation could block indefinitely but will never return with failure. Testing for failure is pointless. New users should be evaluated carefully (and the flag should be used only when there is no reasonable failure policy) but it is definitely preferable to use the flag rather than opencode endless loop around allocator. Using this flag for costly allocations is _highly_ discouraged.

The Slab Cache

void * kmalloc(size_t size, gfp_t flags)

allocate memory

Parameters

size_t size
how many bytes of memory are required.
gfp_t flags
the type of memory to allocate.

Description

kmalloc is the normal method of allocating memory for objects smaller than page size in the kernel.

The flags argument may be one of:

GFP_USER - Allocate memory on behalf of user. May sleep.

GFP_KERNEL - Allocate normal kernel ram. May sleep.

GFP_ATOMIC - Allocation will not sleep. May use emergency pools.
For example, use this inside interrupt handlers.

GFP_HIGHUSER - Allocate pages from high memory.

GFP_NOIO - Do not do any I/O at all while trying to get memory.

GFP_NOFS - Do not make any fs calls while trying to get memory.

GFP_NOWAIT - Allocation will not sleep.

__GFP_THISNODE - Allocate node-local memory only.

GFP_DMA - Allocation suitable for DMA.
Should only be used for kmalloc() caches. Otherwise, use a slab created with SLAB_DMA.

Also it is possible to set different flags by OR’ing in one or more of the following additional flags:

__GFP_HIGH - This allocation has high priority and may use emergency pools.

__GFP_NOFAIL - Indicate that this allocation is in no way allowed to fail
(think twice before using).
__GFP_NORETRY - If memory is not immediately available,
then give up at once.

__GFP_NOWARN - If allocation fails, don’t issue any warnings.

__GFP_RETRY_MAYFAIL - Try really hard to succeed the allocation but fail
eventually.

There are other flags available as well, but these are not intended for general use, and so are not documented here. For a full list of potential flags, always refer to linux/gfp.h.

void * kmalloc_array(size_t n, size_t size, gfp_t flags)

allocate memory for an array.

Parameters

size_t n
number of elements.
size_t size
element size.
gfp_t flags
the type of memory to allocate (see kmalloc).
void * kcalloc(size_t n, size_t size, gfp_t flags)

allocate memory for an array. The memory is set to zero.

Parameters

size_t n
number of elements.
size_t size
element size.
gfp_t flags
the type of memory to allocate (see kmalloc).
void * kzalloc(size_t size, gfp_t flags)

allocate memory. The memory is set to zero.

Parameters

size_t size
how many bytes of memory are required.
gfp_t flags
the type of memory to allocate (see kmalloc).
void * kzalloc_node(size_t size, gfp_t flags, int node)

allocate zeroed memory from a particular memory node.

Parameters

size_t size
how many bytes of memory are required.
gfp_t flags
the type of memory to allocate (see kmalloc).
int node
memory node from which to allocate
void * kmem_cache_alloc(struct kmem_cache * cachep, gfp_t flags)

Allocate an object

Parameters

struct kmem_cache * cachep
The cache to allocate from.
gfp_t flags
See kmalloc().

Description

Allocate an object from this cache. The flags are only relevant if the cache has no available objects.

void * kmem_cache_alloc_node(struct kmem_cache * cachep, gfp_t flags, int nodeid)

Allocate an object on the specified node

Parameters

struct kmem_cache * cachep
The cache to allocate from.
gfp_t flags
See kmalloc().
int nodeid
node number of the target node.

Description

Identical to kmem_cache_alloc but it will allocate memory on the given node, which can improve the performance for cpu bound structures.

Fallback to other node is possible if __GFP_THISNODE is not set.

void kmem_cache_free(struct kmem_cache * cachep, void * objp)

Deallocate an object

Parameters

struct kmem_cache * cachep
The cache the allocation was from.
void * objp
The previously allocated object.

Description

Free an object which was previously allocated from this cache.

void kfree(const void * objp)

free previously allocated memory

Parameters

const void * objp
pointer returned by kmalloc.

Description

If objp is NULL, no operation is performed.

Don’t free memory not originally allocated by kmalloc() or you will run into trouble.

size_t ksize(const void * objp)

get the actual amount of memory allocated for a given object

Parameters

const void * objp
Pointer to the object

Description

kmalloc may internally round up allocations and return more memory than requested. ksize() can be used to determine the actual amount of memory allocated. The caller may use this additional memory, even though a smaller amount of memory was initially specified with the kmalloc call. The caller must guarantee that objp points to a valid object previously allocated with either kmalloc() or kmem_cache_alloc(). The object must not be freed during the duration of the call.

void kfree_const(const void * x)

conditionally free memory

Parameters

const void * x
pointer to the memory

Description

Function calls kfree only if x is not in .rodata section.

void * kvmalloc_node(size_t size, gfp_t flags, int node)

attempt to allocate physically contiguous memory, but upon failure, fall back to non-contiguous (vmalloc) allocation.

Parameters

size_t size
size of the request.
gfp_t flags
gfp mask for the allocation - must be compatible (superset) with GFP_KERNEL.
int node
numa node to allocate from

Description

Uses kmalloc to get the memory but if the allocation fails then falls back to the vmalloc allocator. Use kvfree for freeing the memory.

Reclaim modifiers - __GFP_NORETRY and __GFP_NOFAIL are not supported. __GFP_RETRY_MAYFAIL is supported, and it should be used only if kmalloc is preferable to the vmalloc fallback, due to visible performance drawbacks.

Please note that any use of gfp flags outside of GFP_KERNEL is careful to not fall back to vmalloc.

void kvfree(const void * addr)

Free memory.

Parameters

const void * addr
Pointer to allocated memory.

Description

kvfree frees memory allocated by any of vmalloc(), kmalloc() or kvmalloc(). It is slightly more efficient to use kfree() or vfree() if you are certain that you know which one to use.

Context

Either preemptible task context or not-NMI interrupt.

More Memory Management Functions

int read_cache_pages(struct address_space * mapping, struct list_head * pages, int (*filler) (void *, struct page *, void * data)

populate an address space with some pages & start reads against them

Parameters

struct address_space * mapping
the address_space
struct list_head * pages
The address of a list_head which contains the target pages. These pages have their ->index populated and are otherwise uninitialised.
int (*)(void *, struct page *) filler
callback routine for filling a single page.
void * data
private data for the callback routine.

Description

Hides the details of the LRU cache etc from the filesystems.

void page_cache_sync_readahead(struct address_space * mapping, struct file_ra_state * ra, struct file * filp, pgoff_t offset, unsigned long req_size)

generic file readahead

Parameters

struct address_space * mapping
address_space which holds the pagecache and I/O vectors
struct file_ra_state * ra
file_ra_state which holds the readahead state
struct file * filp
passed on to ->:c:func:readpage() and ->:c:func:readpages()
pgoff_t offset
start offset into mapping, in pagecache page-sized units
unsigned long req_size
hint: total size of the read which the caller is performing in pagecache pages

Description

page_cache_sync_readahead() should be called when a cache miss happened: it will submit the read. The readahead logic may decide to piggyback more pages onto the read request if access patterns suggest it will improve performance.

void page_cache_async_readahead(struct address_space * mapping, struct file_ra_state * ra, struct file * filp, struct page * page, pgoff_t offset, unsigned long req_size)

file readahead for marked pages

Parameters

struct address_space * mapping
address_space which holds the pagecache and I/O vectors
struct file_ra_state * ra
file_ra_state which holds the readahead state
struct file * filp
passed on to ->:c:func:readpage() and ->:c:func:readpages()
struct page * page
the page at offset which has the PG_readahead flag set
pgoff_t offset
start offset into mapping, in pagecache page-sized units
unsigned long req_size
hint: total size of the read which the caller is performing in pagecache pages

Description

page_cache_async_readahead() should be called when a page is used which has the PG_readahead flag; this is a marker to suggest that the application has used up enough of the readahead window that we should start pulling in more pages.

void delete_from_page_cache(struct page * page)

delete page from page cache

Parameters

struct page * page
the page which the kernel is trying to remove from page cache

Description

This must be called only on pages that have been verified to be in the page cache and locked. It will never put the page into the free list, the caller has a reference on the page.

int filemap_flush(struct address_space * mapping)

mostly a non-blocking flush

Parameters

struct address_space * mapping
target address_space

Description

This is a mostly non-blocking flush. Not suitable for data-integrity purposes - I/O may not be started against all dirty pages.

bool filemap_range_has_page(struct address_space * mapping, loff_t start_byte, loff_t end_byte)

check if a page exists in range.

Parameters

struct address_space * mapping
address space within which to check
loff_t start_byte
offset in bytes where the range starts
loff_t end_byte
offset in bytes where the range ends (inclusive)

Description

Find at least one page in the range supplied, usually used to check if direct writing in this range will trigger a writeback.

int filemap_fdatawait_range(struct address_space * mapping, loff_t start_byte, loff_t end_byte)

wait for writeback to complete

Parameters

struct address_space * mapping
address space structure to wait for
loff_t start_byte
offset in bytes where the range starts
loff_t end_byte
offset in bytes where the range ends (inclusive)

Description

Walk the list of under-writeback pages of the given address space in the given range and wait for all of them. Check error status of the address space and return it.

Since the error status of the address space is cleared by this function, callers are responsible for checking the return value and handling and/or reporting the error.

int file_fdatawait_range(struct file * file, loff_t start_byte, loff_t end_byte)

wait for writeback to complete

Parameters

struct file * file
file pointing to address space structure to wait for
loff_t start_byte
offset in bytes where the range starts
loff_t end_byte
offset in bytes where the range ends (inclusive)

Description

Walk the list of under-writeback pages of the address space that file refers to, in the given range and wait for all of them. Check error status of the address space vs. the file->f_wb_err cursor and return it.

Since the error status of the file is advanced by this function, callers are responsible for checking the return value and handling and/or reporting the error.

int filemap_fdatawait_keep_errors(struct address_space * mapping)

wait for writeback without clearing errors

Parameters

struct address_space * mapping
address space structure to wait for

Description

Walk the list of under-writeback pages of the given address space and wait for all of them. Unlike filemap_fdatawait(), this function does not clear error status of the address space.

Use this function if callers don’t handle errors themselves. Expected call sites are system-wide / filesystem-wide data flushers: e.g. sync(2), fsfreeze(8)

int filemap_write_and_wait_range(struct address_space * mapping, loff_t lstart, loff_t lend)

write out & wait on a file range

Parameters

struct address_space * mapping
the address_space for the pages
loff_t lstart
offset in bytes where the range starts
loff_t lend
offset in bytes where the range ends (inclusive)

Description

Write out and wait upon file offsets lstart->lend, inclusive.

Note that lend is inclusive (describes the last byte to be written) so that this function can be used to write to the very end-of-file (end = -1).

int file_check_and_advance_wb_err(struct file * file)

report wb error (if any) that was previously and advance wb_err to current one

Parameters

struct file * file
struct file on which the error is being reported

Description

When userland calls fsync (or something like nfsd does the equivalent), we want to report any writeback errors that occurred since the last fsync (or since the file was opened if there haven’t been any).

Grab the wb_err from the mapping. If it matches what we have in the file, then just quickly return 0. The file is all caught up.

If it doesn’t match, then take the mapping value, set the “seen” flag in it and try to swap it into place. If it works, or another task beat us to it with the new value, then update the f_wb_err and return the error portion. The error at this point must be reported via proper channels (a’la fsync, or NFS COMMIT operation, etc.).

While we handle mapping->wb_err with atomic operations, the f_wb_err value is protected by the f_lock since we must ensure that it reflects the latest value swapped in for this file descriptor.

int file_write_and_wait_range(struct file * file, loff_t lstart, loff_t lend)

write out & wait on a file range

Parameters

struct file * file
file pointing to address_space with pages
loff_t lstart
offset in bytes where the range starts
loff_t lend
offset in bytes where the range ends (inclusive)

Description

Write out and wait upon file offsets lstart->lend, inclusive.

Note that lend is inclusive (describes the last byte to be written) so that this function can be used to write to the very end-of-file (end = -1).

After writing out and waiting on the data, we check and advance the f_wb_err cursor to the latest value, and return any errors detected there.

int replace_page_cache_page(struct page * old, struct page * new, gfp_t gfp_mask)

replace a pagecache page with a new one

Parameters

struct page * old
page to be replaced
struct page * new
page to replace with
gfp_t gfp_mask
allocation mode

Description

This function replaces a page in the pagecache with a new one. On success it acquires the pagecache reference for the new page and drops it for the old page. Both the old and new pages must be locked. This function does not add the new page to the LRU, the caller must do that.

The remove + add is atomic. This function cannot fail.

int add_to_page_cache_locked(struct page * page, struct address_space * mapping, pgoff_t offset, gfp_t gfp_mask)

add a locked page to the pagecache

Parameters

struct page * page
page to add
struct address_space * mapping
the page’s address_space
pgoff_t offset
page index
gfp_t gfp_mask
page allocation mode

Description

This function is used to add a page to the pagecache. It must be locked. This function does not add the page to the LRU. The caller must do that.

void add_page_wait_queue(struct page * page, wait_queue_entry_t * waiter)

Add an arbitrary waiter to a page’s wait queue

Parameters

struct page * page
Page defining the wait queue of interest
wait_queue_entry_t * waiter
Waiter to add to the queue

Description

Add an arbitrary waiter to the wait queue for the nominated page.

void unlock_page(struct page * page)

unlock a locked page

Parameters

struct page * page
the page

Description

Unlocks the page and wakes up sleepers in ___wait_on_page_locked(). Also wakes sleepers in wait_on_page_writeback() because the wakeup mechanism between PageLocked pages and PageWriteback pages is shared. But that’s OK - sleepers in wait_on_page_writeback() just go back to sleep.

Note that this depends on PG_waiters being the sign bit in the byte that contains PG_locked - thus the BUILD_BUG_ON(). That allows us to clear the PG_locked bit and test PG_waiters at the same time fairly portably (architectures that do LL/SC can test any bit, while x86 can test the sign bit).

void end_page_writeback(struct page * page)

end writeback against a page

Parameters

struct page * page
the page
void __lock_page(struct page * __page)

get a lock on the page, assuming we need to sleep to get it

Parameters

struct page * __page
the page to lock
pgoff_t page_cache_next_miss(struct address_space * mapping, pgoff_t index, unsigned long max_scan)

Find the next gap in the page cache.

Parameters

struct address_space * mapping
Mapping.
pgoff_t index
Index.
unsigned long max_scan
Maximum range to search.

Description

Search the range [index, min(index + max_scan - 1, ULONG_MAX)] for the gap with the lowest index.

This function may be called under the rcu_read_lock. However, this will not atomically search a snapshot of the cache at a single point in time. For example, if a gap is created at index 5, then subsequently a gap is created at index 10, page_cache_next_miss covering both indices may return 10 if called under the rcu_read_lock.

Return

The index of the gap if found, otherwise an index outside the range specified (in which case ‘return - index >= max_scan’ will be true). In the rare case of index wrap-around, 0 will be returned.

pgoff_t page_cache_prev_miss(struct address_space * mapping, pgoff_t index, unsigned long max_scan)

Find the next gap in the page cache.

Parameters

struct address_space * mapping
Mapping.
pgoff_t index
Index.
unsigned long max_scan
Maximum range to search.

Description

Search the range [max(index - max_scan + 1, 0), index] for the gap with the highest index.

This function may be called under the rcu_read_lock. However, this will not atomically search a snapshot of the cache at a single point in time. For example, if a gap is created at index 10, then subsequently a gap is created at index 5, page_cache_prev_miss() covering both indices may return 5 if called under the rcu_read_lock.

Return

The index of the gap if found, otherwise an index outside the range specified (in which case ‘index - return >= max_scan’ will be true). In the rare case of wrap-around, ULONG_MAX will be returned.

struct page * find_get_entry(struct address_space * mapping, pgoff_t offset)

find and get a page cache entry

Parameters

struct address_space * mapping
the address_space to search
pgoff_t offset
the page cache index

Description

Looks up the page cache slot at mapping & offset. If there is a page cache page, it is returned with an increased refcount.

If the slot holds a shadow entry of a previously evicted page, or a swap entry from shmem/tmpfs, it is returned.

Otherwise, NULL is returned.

struct page * find_lock_entry(struct address_space * mapping, pgoff_t offset)

locate, pin and lock a page cache entry

Parameters

struct address_space * mapping
the address_space to search
pgoff_t offset
the page cache index

Description

Looks up the page cache slot at mapping & offset. If there is a page cache page, it is returned locked and with an increased refcount.

If the slot holds a shadow entry of a previously evicted page, or a swap entry from shmem/tmpfs, it is returned.

Otherwise, NULL is returned.

find_lock_entry() may sleep.

struct page * pagecache_get_page(struct address_space * mapping, pgoff_t offset, int fgp_flags, gfp_t gfp_mask)

find and get a page reference

Parameters

struct address_space * mapping
the address_space to search
pgoff_t offset
the page index
int fgp_flags
PCG flags
gfp_t gfp_mask
gfp mask to use for the page cache data page allocation

Description

Looks up the page cache slot at mapping & offset.

PCG flags modify how the page is returned.

fgp_flags can be:

  • FGP_ACCESSED: the page will be marked accessed
  • FGP_LOCK: Page is return locked
  • FGP_CREAT: If page is not present then a new page is allocated using gfp_mask and added to the page cache and the VM’s LRU list. The page is returned locked and with an increased refcount. Otherwise, NULL is returned.

If FGP_LOCK or FGP_CREAT are specified then the function may sleep even if the GFP flags specified for FGP_CREAT are atomic.

If there is a page cache page, it is returned with an increased refcount.

unsigned find_get_pages_contig(struct address_space * mapping, pgoff_t index, unsigned int nr_pages, struct page ** pages)

gang contiguous pagecache lookup

Parameters

struct address_space * mapping
The address_space to search
pgoff_t index
The starting page index
unsigned int nr_pages
The maximum number of pages
struct page ** pages
Where the resulting pages are placed

Description

find_get_pages_contig() works exactly like find_get_pages(), except that the returned number of pages are guaranteed to be contiguous.

find_get_pages_contig() returns the number of pages which were found.

unsigned find_get_pages_range_tag(struct address_space * mapping, pgoff_t * index, pgoff_t end, xa_mark_t tag, unsigned int nr_pages, struct page ** pages)

find and return pages in given range matching tag

Parameters

struct address_space * mapping
the address_space to search
pgoff_t * index
the starting page index
pgoff_t end
The final page index (inclusive)
xa_mark_t tag
the tag index
unsigned int nr_pages
the maximum number of pages
struct page ** pages
where the resulting pages are placed

Description

Like find_get_pages, except we only return pages which are tagged with tag. We update index to index the next page for the traversal.

unsigned find_get_entries_tag(struct address_space * mapping, pgoff_t start, xa_mark_t tag, unsigned int nr_entries, struct page ** entries, pgoff_t * indices)

find and return entries that match tag

Parameters

struct address_space * mapping
the address_space to search
pgoff_t start
the starting page cache index
xa_mark_t tag
the tag index
unsigned int nr_entries
the maximum number of entries
struct page ** entries
where the resulting entries are placed
pgoff_t * indices
the cache indices corresponding to the entries in entries

Description

Like find_get_entries, except we only return entries which are tagged with tag.

ssize_t generic_file_read_iter(struct kiocb * iocb, struct iov_iter * iter)

generic filesystem read routine

Parameters

struct kiocb * iocb
kernel I/O control block
struct iov_iter * iter
destination for the data read

Description

This is the “read_iter()” routine for all filesystems that can use the page cache directly.

vm_fault_t filemap_fault(struct vm_fault * vmf)

read in file data for page fault handling

Parameters

struct vm_fault * vmf
struct vm_fault containing details of the fault

Description

filemap_fault() is invoked via the vma operations vector for a mapped memory region to read in file data during a page fault.

The goto’s are kind of ugly, but this streamlines the normal case of having it in the page cache, and handles the special cases reasonably without having a lot of duplicated code.

vma->vm_mm->mmap_sem must be held on entry.

If our return value has VM_FAULT_RETRY set, it’s because lock_page_or_retry() returned 0. The mmap_sem has usually been released in this case. See __lock_page_or_retry() for the exception.

If our return value does not have VM_FAULT_RETRY set, the mmap_sem has not been released.

We never return with VM_FAULT_RETRY and a bit from VM_FAULT_ERROR set.

struct page * read_cache_page(struct address_space * mapping, pgoff_t index, int (*filler) (void *, struct page *, void * data)

read into page cache, fill it if needed

Parameters

struct address_space * mapping
the page’s address_space
pgoff_t index
the page index
int (*)(void *, struct page *) filler
function to perform the read
void * data
first arg to filler(data, page) function, often left as NULL

Description

Read into the page cache. If a page already exists, and PageUptodate() is not set, try to fill the page and wait for it to become unlocked.

If the page does not get brought uptodate, return -EIO.

struct page * read_cache_page_gfp(struct address_space * mapping, pgoff_t index, gfp_t gfp)

read into page cache, using specified page allocation flags.

Parameters

struct address_space * mapping
the page’s address_space
pgoff_t index
the page index
gfp_t gfp
the page allocator flags to use if allocating

Description

This is the same as “read_mapping_page(mapping, index, NULL)”, but with any new page allocations done using the specified allocation flags.

If the page does not get brought uptodate, return -EIO.

ssize_t __generic_file_write_iter(struct kiocb * iocb, struct iov_iter * from)

write data to a file

Parameters

struct kiocb * iocb
IO state structure (file, offset, etc.)
struct iov_iter * from
iov_iter with data to write

Description

This function does all the work needed for actually writing data to a file. It does all basic checks, removes SUID from the file, updates modification times and calls proper subroutines depending on whether we do direct IO or a standard buffered write.

It expects i_mutex to be grabbed unless we work on a block device or similar object which does not need locking at all.

This function does not take care of syncing data in case of O_SYNC write. A caller has to handle it. This is mainly due to the fact that we want to avoid syncing under i_mutex.

ssize_t generic_file_write_iter(struct kiocb * iocb, struct iov_iter * from)

write data to a file

Parameters

struct kiocb * iocb
IO state structure
struct iov_iter * from
iov_iter with data to write

Description

This is a wrapper around __generic_file_write_iter() to be used by most filesystems. It takes care of syncing the file in case of O_SYNC file and acquires i_mutex as needed.

int try_to_release_page(struct page * page, gfp_t gfp_mask)

release old fs-specific metadata on a page

Parameters

struct page * page
the page which the kernel is trying to free
gfp_t gfp_mask
memory allocation flags (and I/O mode)

Description

The address_space is to try to release any data against the page (presumably at page->private). If the release was successful, return ‘1’. Otherwise return zero.

This may also be called if PG_fscache is set on a page, indicating that the page is known to the local caching routines.

The gfp_mask argument specifies whether I/O may be performed to release this page (__GFP_IO), and whether the call may block (__GFP_RECLAIM & __GFP_FS).

void zap_vma_ptes(struct vm_area_struct * vma, unsigned long address, unsigned long size)

remove ptes mapping the vma

Parameters

struct vm_area_struct * vma
vm_area_struct holding ptes to be zapped
unsigned long address
starting address of pages to zap
unsigned long size
number of bytes to zap

Description

This function only unmaps ptes assigned to VM_PFNMAP vmas.

The entire address range must be fully contained within the vma.

int vm_insert_page(struct vm_area_struct * vma, unsigned long addr, struct page * page)

insert single page into user vma

Parameters

struct vm_area_struct * vma
user vma to map to
unsigned long addr
target user address of this page
struct page * page
source kernel page

Description

This allows drivers to insert individual pages they’ve allocated into a user vma.

The page has to be a nice clean _individual_ kernel allocation. If you allocate a compound page, you need to have marked it as such (__GFP_COMP), or manually just split the page up yourself (see split_page()).

NOTE! Traditionally this was done with “remap_pfn_range()” which took an arbitrary page protection parameter. This doesn’t allow that. Your vma protection will have to be set up correctly, which means that if you want a shared writable mapping, you’d better ask for a shared writable mapping!

The page does not need to be reserved.

Usually this function is called from f_op->:c:func:mmap() handler under mm->mmap_sem write-lock, so it can change vma->vm_flags. Caller must set VM_MIXEDMAP on vma if it wants to call this function from other places, for example from page-fault handler.

vm_fault_t vmf_insert_pfn_prot(struct vm_area_struct * vma, unsigned long addr, unsigned long pfn, pgprot_t pgprot)

insert single pfn into user vma with specified pgprot

Parameters

struct vm_area_struct * vma
user vma to map to
unsigned long addr
target user address of this page
unsigned long pfn
source kernel pfn
pgprot_t pgprot
pgprot flags for the inserted page

Description

This is exactly like vmf_insert_pfn(), except that it allows drivers to to override pgprot on a per-page basis.

This only makes sense for IO mappings, and it makes no sense for COW mappings. In general, using multiple vmas is preferable; vmf_insert_pfn_prot should only be used if using multiple VMAs is impractical.

Context

Process context. May allocate using GFP_KERNEL.

Return

vm_fault_t value.

vm_fault_t vmf_insert_pfn(struct vm_area_struct * vma, unsigned long addr, unsigned long pfn)

insert single pfn into user vma

Parameters

struct vm_area_struct * vma
user vma to map to
unsigned long addr
target user address of this page
unsigned long pfn
source kernel pfn

Description

Similar to vm_insert_page, this allows drivers to insert individual pages they’ve allocated into a user vma. Same comments apply.

This function should only be called from a vm_ops->fault handler, and in that case the handler should return the result of this function.

vma cannot be a COW mapping.

As this is called only for pages that do not currently exist, we do not need to flush old virtual caches or the TLB.

Context

Process context. May allocate using GFP_KERNEL.

Return

vm_fault_t value.

int remap_pfn_range(struct vm_area_struct * vma, unsigned long addr, unsigned long pfn, unsigned long size, pgprot_t prot)

remap kernel memory to userspace

Parameters

struct vm_area_struct * vma
user vma to map to
unsigned long addr
target user address to start at
unsigned long pfn
physical address of kernel memory
unsigned long size
size of map area
pgprot_t prot
page protection flags for this mapping

Note

this is only safe if the mm semaphore is held when called.

int vm_iomap_memory(struct vm_area_struct * vma, phys_addr_t start, unsigned long len)

remap memory to userspace

Parameters

struct vm_area_struct * vma
user vma to map to
phys_addr_t start
start of area
unsigned long len
size of area

Description

This is a simplified io_remap_pfn_range() for common driver use. The driver just needs to give us the physical memory range to be mapped, we’ll figure out the rest from the vma information.

NOTE! Some drivers might want to tweak vma->vm_page_prot first to get whatever write-combining details or similar.

void unmap_mapping_range(struct address_space * mapping, loff_t const holebegin, loff_t const holelen, int even_cows)

unmap the portion of all mmaps in the specified address_space corresponding to the specified byte range in the underlying file.

Parameters

struct address_space * mapping
the address space containing mmaps to be unmapped.
loff_t const holebegin
byte in first page to unmap, relative to the start of the underlying file. This will be rounded down to a PAGE_SIZE boundary. Note that this is different from truncate_pagecache(), which must keep the partial page. In contrast, we must get rid of partial pages.
loff_t const holelen
size of prospective hole in bytes. This will be rounded up to a PAGE_SIZE boundary. A holelen of zero truncates to the end of the file.
int even_cows
1 when truncating a file, unmap even private COWed pages; but 0 when invalidating pagecache, don’t throw away private data.
int follow_pfn(struct vm_area_struct * vma, unsigned long address, unsigned long * pfn)

look up PFN at a user virtual address

Parameters

struct vm_area_struct * vma
memory mapping
unsigned long address
user virtual address
unsigned long * pfn
location to store found PFN

Description

Only IO mappings and raw PFN mappings are allowed.

Returns zero and the pfn at pfn on success, -ve otherwise.

void vm_unmap_aliases(void)

unmap outstanding lazy aliases in the vmap layer

Parameters

void
no arguments

Description

The vmap/vmalloc layer lazily flushes kernel virtual mappings primarily to amortize TLB flushing overheads. What this means is that any page you have now, may, in a former life, have been mapped into kernel virtual address by the vmap layer and so there might be some CPUs with TLB entries still referencing that page (additional to the regular 1:1 kernel mapping).

vm_unmap_aliases flushes all such lazy mappings. After it returns, we can be sure that none of the pages we have control over will have any aliases from the vmap layer.

void vm_unmap_ram(const void * mem, unsigned int count)

unmap linear kernel address space set up by vm_map_ram

Parameters

const void * mem
the pointer returned by vm_map_ram
unsigned int count
the count passed to that vm_map_ram call (cannot unmap partial)
void * vm_map_ram(struct page ** pages, unsigned int count, int node, pgprot_t prot)

map pages linearly into kernel virtual address (vmalloc space)

Parameters

struct page ** pages
an array of pointers to the pages to be mapped
unsigned int count
number of pages
int node
prefer to allocate data structures on this node
pgprot_t prot
memory protection to use. PAGE_KERNEL for regular RAM

Description

If you use this function for less than VMAP_MAX_ALLOC pages, it could be faster than vmap so it’s good. But if you mix long-life and short-life objects with vm_map_ram(), it could consume lots of address space through fragmentation (especially on a 32bit machine). You could see failures in the end. Please use this function for short-lived objects.

Return

a pointer to the address that has been mapped, or NULL on failure

void unmap_kernel_range_noflush(unsigned long addr, unsigned long size)

unmap kernel VM area

Parameters

unsigned long addr
start of the VM area to unmap
unsigned long size
size of the VM area to unmap

Description

Unmap PFN_UP(size) pages at addr. The VM area addr and size specify should have been allocated using get_vm_area() and its friends.

NOTE

This function does NOT do any cache flushing. The caller is responsible for calling flush_cache_vunmap() on to-be-mapped areas before calling this function and flush_tlb_kernel_range() after.

void unmap_kernel_range(unsigned long addr, unsigned long size)

unmap kernel VM area and flush cache and TLB

Parameters

unsigned long addr
start of the VM area to unmap
unsigned long size
size of the VM area to unmap

Description

Similar to unmap_kernel_range_noflush() but flushes vcache before the unmapping and tlb after.

void vfree(const void * addr)

release memory allocated by vmalloc()

Parameters

const void * addr
memory base address

Description

Free the virtually continuous memory area starting at addr, as obtained from vmalloc(), vmalloc_32() or __vmalloc(). If addr is NULL, no operation is performed.

Must not be called in NMI context (strictly speaking, only if we don’t have CONFIG_ARCH_HAVE_NMI_SAFE_CMPXCHG, but making the calling conventions for vfree() arch-depenedent would be a really bad idea)

May sleep if called not from interrupt context.

NOTE

assumes that the object at addr has a size >= sizeof(llist_node)

void vunmap(const void * addr)

release virtual mapping obtained by vmap()

Parameters

const void * addr
memory base address

Description

Free the virtually contiguous memory area starting at addr, which was created from the page array passed to vmap().

Must not be called in interrupt context.

void * vmap(struct page ** pages, unsigned int count, unsigned long flags, pgprot_t prot)

map an array of pages into virtually contiguous space

Parameters

struct page ** pages
array of page pointers
unsigned int count
number of pages to map
unsigned long flags
vm_area->flags
pgprot_t prot
page protection for the mapping

Description

Maps count pages from pages into contiguous kernel virtual space.
void * vmalloc(unsigned long size)

allocate virtually contiguous memory

Parameters

unsigned long size
allocation size Allocate enough pages to cover size from the page level allocator and map them into contiguous kernel virtual space.

Description

For tight control over page level allocator and protection flags use __vmalloc() instead.
void * vzalloc(unsigned long size)

allocate virtually contiguous memory with zero fill

Parameters

unsigned long size
allocation size Allocate enough pages to cover size from the page level allocator and map them into contiguous kernel virtual space. The memory allocated is set to zero.

Description

For tight control over page level allocator and protection flags use __vmalloc() instead.
void * vmalloc_user(unsigned long size)

allocate zeroed virtually contiguous memory for userspace

Parameters

unsigned long size
allocation size

Description

The resulting memory area is zeroed so it can be mapped to userspace without leaking data.

void * vmalloc_node(unsigned long size, int node)

allocate memory on a specific node

Parameters

unsigned long size
allocation size
int node
numa node

Description

Allocate enough pages to cover size from the page level allocator and map them into contiguous kernel virtual space.

For tight control over page level allocator and protection flags use __vmalloc() instead.

void * vzalloc_node(unsigned long size, int node)

allocate memory on a specific node with zero fill

Parameters

unsigned long size
allocation size
int node
numa node

Description

Allocate enough pages to cover size from the page level allocator and map them into contiguous kernel virtual space. The memory allocated is set to zero.

For tight control over page level allocator and protection flags use __vmalloc_node() instead.

void * vmalloc_32(unsigned long size)

allocate virtually contiguous memory (32bit addressable)

Parameters

unsigned long size
allocation size

Description

Allocate enough 32bit PA addressable pages to cover size from the page level allocator and map them into contiguous kernel virtual space.
void * vmalloc_32_user(unsigned long size)

allocate zeroed virtually contiguous 32bit memory

Parameters

unsigned long size
allocation size

Description

The resulting memory area is 32bit addressable and zeroed so it can be mapped to userspace without leaking data.

int remap_vmalloc_range_partial(struct vm_area_struct * vma, unsigned long uaddr, void * kaddr, unsigned long size)

map vmalloc pages to userspace

Parameters

struct vm_area_struct * vma
vma to cover
unsigned long uaddr
target user address to start at
void * kaddr
virtual address of vmalloc kernel memory
unsigned long size
size of map area

Return

0 for success, -Exxx on failure

This function checks that kaddr is a valid vmalloc’ed area, and that it is big enough to cover the range starting at uaddr in vma. Will return failure if that criteria isn’t met.

Similar to remap_pfn_range() (see mm/memory.c)

int remap_vmalloc_range(struct vm_area_struct * vma, void * addr, unsigned long pgoff)

map vmalloc pages to userspace

Parameters

struct vm_area_struct * vma
vma to cover (map full range of vma)
void * addr
vmalloc memory
unsigned long pgoff
number of pages into addr before first page to map

Return

0 for success, -Exxx on failure

This function checks that addr is a valid vmalloc’ed area, and that it is big enough to cover the vma. Will return failure if that criteria isn’t met.

Similar to remap_pfn_range() (see mm/memory.c)

struct vm_struct * alloc_vm_area(size_t size, pte_t ** ptes)

allocate a range of kernel address space

Parameters

size_t size
size of the area
pte_t ** ptes
returns the PTEs for the address space

Return

NULL on failure, vm_struct on success

This function reserves a range of kernel address space, and allocates pagetables to map that range. No actual mappings are created.

If ptes is non-NULL, pointers to the PTEs (in init_mm) allocated for the VM area are returned.

unsigned long __get_pfnblock_flags_mask(struct page * page, unsigned long pfn, unsigned long end_bitidx, unsigned long mask)

Return the requested group of flags for the pageblock_nr_pages block of pages

Parameters

struct page * page
The page within the block of interest
unsigned long pfn
The target page frame number
unsigned long end_bitidx
The last bit of interest to retrieve
unsigned long mask
mask of bits that the caller is interested in

Return

pageblock_bits flags

void set_pfnblock_flags_mask(struct page * page, unsigned long flags, unsigned long pfn, unsigned long end_bitidx, unsigned long mask)

Set the requested group of flags for a pageblock_nr_pages block of pages

Parameters

struct page * page
The page within the block of interest
unsigned long flags
The flags to set
unsigned long pfn
The target page frame number
unsigned long end_bitidx
The last bit of interest
unsigned long mask
mask of bits that the caller is interested in
void * alloc_pages_exact_nid(int nid, size_t size, gfp_t gfp_mask)

allocate an exact number of physically-contiguous pages on a node.

Parameters

int nid
the preferred node ID where memory should be allocated
size_t size
the number of bytes to allocate
gfp_t gfp_mask
GFP flags for the allocation

Description

Like alloc_pages_exact(), but try to allocate on node nid first before falling back.

unsigned long nr_free_zone_pages(int offset)

count number of pages beyond high watermark

Parameters

int offset
The zone index of the highest zone

Description

nr_free_zone_pages() counts the number of counts pages which are beyond the high watermark within all zones at or below a given zone index. For each zone, the number of pages is calculated as:

nr_free_zone_pages = managed_pages - high_pages
unsigned long nr_free_pagecache_pages(void)

count number of pages beyond high watermark

Parameters

void
no arguments

Description

nr_free_pagecache_pages() counts the number of pages which are beyond the high watermark within all zones.

int find_next_best_node(int node, nodemask_t * used_node_mask)

find the next node that should appear in a given node’s fallback list

Parameters

int node
node whose fallback list we’re appending
nodemask_t * used_node_mask
nodemask_t of already used nodes

Description

We use a number of factors to determine which is the next node that should appear on a given node’s fallback list. The node should not have appeared already in node‘s fallback list, and it should be the next closest node according to the distance array (which contains arbitrary distance values from each node to each node in the system), and should also prefer nodes with no CPUs, since presumably they’ll have very little allocation pressure on them otherwise. It returns -1 if no node is found.

void free_bootmem_with_active_regions(int nid, unsigned long max_low_pfn)

Call memblock_free_early_nid for each active range

Parameters

int nid
The node to free memory on. If MAX_NUMNODES, all nodes are freed.
unsigned long max_low_pfn
The highest PFN that will be passed to memblock_free_early_nid

Description

If an architecture guarantees that all ranges registered contain no holes and may be freed, this this function may be used instead of calling memblock_free_early_nid() manually.

void sparse_memory_present_with_active_regions(int nid)

Call memory_present for each active range

Parameters

int nid
The node to call memory_present for. If MAX_NUMNODES, all nodes will be used.

Description

If an architecture guarantees that all ranges registered contain no holes and may be freed, this function may be used instead of calling memory_present() manually.

void get_pfn_range_for_nid(unsigned int nid, unsigned long * start_pfn, unsigned long * end_pfn)

Return the start and end page frames for a node

Parameters

unsigned int nid
The nid to return the range for. If MAX_NUMNODES, the min and max PFN are returned.
unsigned long * start_pfn
Passed by reference. On return, it will have the node start_pfn.
unsigned long * end_pfn
Passed by reference. On return, it will have the node end_pfn.

Description

It returns the start and end page frame of a node based on information provided by memblock_set_node(). If called for a node with no available memory, a warning is printed and the start and end PFNs will be 0.

unsigned long absent_pages_in_range(unsigned long start_pfn, unsigned long end_pfn)

Return number of page frames in holes within a range

Parameters

unsigned long start_pfn
The start PFN to start searching for holes
unsigned long end_pfn
The end PFN to stop searching for holes

Description

It returns the number of pages frames in memory holes within a range.

unsigned long node_map_pfn_alignment(void)

determine the maximum internode alignment

Parameters

void
no arguments

Description

This function should be called after node map is populated and sorted. It calculates the maximum power of two alignment which can distinguish all the nodes.

For example, if all nodes are 1GiB and aligned to 1GiB, the return value would indicate 1GiB alignment with (1 << (30 - PAGE_SHIFT)). If the nodes are shifted by 256MiB, 256MiB. Note that if only the last node is shifted, 1GiB is enough and this function will indicate so.

This is used to test whether pfn -> nid mapping of the chosen memory model has fine enough granularity to avoid incorrect mapping for the populated node map.

Returns the determined alignment in pfn’s. 0 if there is no alignment requirement (single node).

unsigned long find_min_pfn_with_active_regions(void)

Find the minimum PFN registered

Parameters

void
no arguments

Description

It returns the minimum PFN based on information provided via memblock_set_node().

void free_area_init_nodes(unsigned long * max_zone_pfn)

Initialise all pg_data_t and zone data

Parameters

unsigned long * max_zone_pfn
an array of max PFNs for each zone

Description

This will call free_area_init_node() for each active node in the system. Using the page ranges provided by memblock_set_node(), the size of each zone in each node and their holes is calculated. If the maximum PFN between two adjacent zones match, it is assumed that the zone is empty. For example, if arch_max_dma_pfn == arch_max_dma32_pfn, it is assumed that arch_max_dma32_pfn has no pages. It is also assumed that a zone starts where the previous one ended. For example, ZONE_DMA32 starts at arch_max_dma_pfn.

void set_dma_reserve(unsigned long new_dma_reserve)

set the specified number of pages reserved in the first zone

Parameters

unsigned long new_dma_reserve
The number of pages to mark reserved

Description

The per-cpu batchsize and zone watermarks are determined by managed_pages. In the DMA zone, a significant percentage may be consumed by kernel image and other unfreeable allocations which can skew the watermarks badly. This function may optionally be used to account for unfreeable pages in the first zone (e.g., ZONE_DMA). The effect will be lower watermarks and smaller per-cpu batchsize.

void setup_per_zone_wmarks(void)

called when min_free_kbytes changes or when memory is hot-{added|removed}

Parameters

void
no arguments

Description

Ensures that the watermark[min,low,high] values for each zone are set correctly with respect to min_free_kbytes.

int alloc_contig_range(unsigned long start, unsigned long end, unsigned migratetype, gfp_t gfp_mask)
  • tries to allocate given range of pages

Parameters

unsigned long start
start PFN to allocate
unsigned long end
one-past-the-last PFN to allocate
unsigned migratetype
migratetype of the underlaying pageblocks (either #MIGRATE_MOVABLE or #MIGRATE_CMA). All pageblocks in range must have the same migratetype and it must be either of the two.
gfp_t gfp_mask
GFP mask to use during compaction

Description

The PFN range does not have to be pageblock or MAX_ORDER_NR_PAGES aligned. The PFN range must belong to a single zone.

The first thing this routine does is attempt to MIGRATE_ISOLATE all pageblocks in the range. Once isolated, the pageblocks should not be modified by others.

Returns zero on success or negative error code. On success all pages which PFN is in [start, end) are allocated for the caller and need to be freed with free_contig_range().

void mempool_exit(mempool_t * pool)

exit a mempool initialized with mempool_init()

Parameters

mempool_t * pool
pointer to the memory pool which was initialized with mempool_init().

Description

Free all reserved elements in pool and pool itself. This function only sleeps if the free_fn() function sleeps.

May be called on a zeroed but uninitialized mempool (i.e. allocated with kzalloc()).

void mempool_destroy(mempool_t * pool)

deallocate a memory pool

Parameters

mempool_t * pool
pointer to the memory pool which was allocated via mempool_create().

Description

Free all reserved elements in pool and pool itself. This function only sleeps if the free_fn() function sleeps.

int mempool_init(mempool_t * pool, int min_nr, mempool_alloc_t * alloc_fn, mempool_free_t * free_fn, void * pool_data)

initialize a memory pool

Parameters

mempool_t * pool
pointer to the memory pool that should be initialized
int min_nr
the minimum number of elements guaranteed to be allocated for this pool.
mempool_alloc_t * alloc_fn
user-defined element-allocation function.
mempool_free_t * free_fn
user-defined element-freeing function.
void * pool_data
optional private data available to the user-defined functions.

Description

Like mempool_create(), but initializes the pool in (i.e. embedded in another structure).

mempool_t * mempool_create(int min_nr, mempool_alloc_t * alloc_fn, mempool_free_t * free_fn, void * pool_data)

create a memory pool

Parameters

int min_nr
the minimum number of elements guaranteed to be allocated for this pool.
mempool_alloc_t * alloc_fn
user-defined element-allocation function.
mempool_free_t * free_fn
user-defined element-freeing function.
void * pool_data
optional private data available to the user-defined functions.

Description

this function creates and allocates a guaranteed size, preallocated memory pool. The pool can be used from the mempool_alloc() and mempool_free() functions. This function might sleep. Both the alloc_fn() and the free_fn() functions might sleep - as long as the mempool_alloc() function is not called from IRQ contexts.

int mempool_resize(mempool_t * pool, int new_min_nr)

resize an existing memory pool

Parameters

mempool_t * pool
pointer to the memory pool which was allocated via mempool_create().
int new_min_nr
the new minimum number of elements guaranteed to be allocated for this pool.

Description

This function shrinks/grows the pool. In the case of growing, it cannot be guaranteed that the pool will be grown to the new size immediately, but new mempool_free() calls will refill it. This function may sleep.

Note, the caller must guarantee that no mempool_destroy is called while this function is running. mempool_alloc() & mempool_free() might be called (eg. from IRQ contexts) while this function executes.

void * mempool_alloc(mempool_t * pool, gfp_t gfp_mask)

allocate an element from a specific memory pool

Parameters

mempool_t * pool
pointer to the memory pool which was allocated via mempool_create().
gfp_t gfp_mask
the usual allocation bitmask.

Description

this function only sleeps if the alloc_fn() function sleeps or returns NULL. Note that due to preallocation, this function never fails when called from process contexts. (it might fail if called from an IRQ context.)

Note

using __GFP_ZERO is not supported.

void mempool_free(void * element, mempool_t * pool)

return an element to the pool.

Parameters

void * element
pool element pointer.
mempool_t * pool
pointer to the memory pool which was allocated via mempool_create().

Description

this function only sleeps if the free_fn() function sleeps.

struct dma_pool * dma_pool_create(const char * name, struct device * dev, size_t size, size_t align, size_t boundary)

Creates a pool of consistent memory blocks, for dma.

Parameters

const char * name
name of pool, for diagnostics
struct device * dev
device that will be doing the DMA
size_t size
size of the blocks in this pool.
size_t align
alignment requirement for blocks; must be a power of two
size_t boundary
returned blocks won’t cross this power of two boundary

Context

!:c:func:in_interrupt()

Description

Returns a dma allocation pool with the requested characteristics, or null if one can’t be created. Given one of these pools, dma_pool_alloc() may be used to allocate memory. Such memory will all have “consistent” DMA mappings, accessible by the device and its driver without using cache flushing primitives. The actual size of blocks allocated may be larger than requested because of alignment.

If boundary is nonzero, objects returned from dma_pool_alloc() won’t cross that size boundary. This is useful for devices which have addressing restrictions on individual DMA transfers, such as not crossing boundaries of 4KBytes.

void dma_pool_destroy(struct dma_pool * pool)

destroys a pool of dma memory blocks.

Parameters

struct dma_pool * pool
dma pool that will be destroyed

Context

!:c:func:in_interrupt()

Description

Caller guarantees that no more memory from the pool is in use, and that nothing will try to use the pool after this call.

void * dma_pool_alloc(struct dma_pool * pool, gfp_t mem_flags, dma_addr_t * handle)

get a block of consistent memory

Parameters

struct dma_pool * pool
dma pool that will produce the block
gfp_t mem_flags
GFP_* bitmask
dma_addr_t * handle
pointer to dma address of block

Description

This returns the kernel virtual address of a currently unused block, and reports its dma address through the handle. If such a memory block can’t be allocated, NULL is returned.

void dma_pool_free(struct dma_pool * pool, void * vaddr, dma_addr_t dma)

put block back into dma pool

Parameters

struct dma_pool * pool
the dma pool holding the block
void * vaddr
virtual address of block
dma_addr_t dma
dma address of block

Description

Caller promises neither device nor driver will again touch this block unless it is first re-allocated.

struct dma_pool * dmam_pool_create(const char * name, struct device * dev, size_t size, size_t align, size_t allocation)

Managed dma_pool_create()

Parameters

const char * name
name of pool, for diagnostics
struct device * dev
device that will be doing the DMA
size_t size
size of the blocks in this pool.
size_t align
alignment requirement for blocks; must be a power of two
size_t allocation
returned blocks won’t cross this boundary (or zero)

Description

Managed dma_pool_create(). DMA pool created with this function is automatically destroyed on driver detach.

void dmam_pool_destroy(struct dma_pool * pool)

Managed dma_pool_destroy()

Parameters

struct dma_pool * pool
dma pool that will be destroyed

Description

Managed dma_pool_destroy().

void balance_dirty_pages_ratelimited(struct address_space * mapping)

balance dirty memory state

Parameters

struct address_space * mapping
address_space which was dirtied

Description

Processes which are dirtying memory should call in here once for each page which was newly dirtied. The function will periodically check the system’s dirty state and will initiate writeback if needed.

On really big machines, get_writeback_state is expensive, so try to avoid calling it too often (ratelimiting). But once we’re over the dirty memory limit we decrease the ratelimiting by a lot, to prevent individual processes from overshooting the limit by (ratelimit_pages) each.

void tag_pages_for_writeback(struct address_space * mapping, pgoff_t start, pgoff_t end)

tag pages to be written by write_cache_pages

Parameters

struct address_space * mapping
address space structure to write
pgoff_t start
starting page index
pgoff_t end
ending page index (inclusive)

Description

This function scans the page range from start to end (inclusive) and tags all pages that have DIRTY tag set with a special TOWRITE tag. The idea is that write_cache_pages (or whoever calls this function) will then use TOWRITE tag to identify pages eligible for writeback. This mechanism is used to avoid livelocking of writeback by a process steadily creating new dirty pages in the file (thus it is important for this function to be quick so that it can tag pages faster than a dirtying process can create them).

int write_cache_pages(struct address_space * mapping, struct writeback_control * wbc, writepage_t writepage, void * data)

walk the list of dirty pages of the given address space and write all of them.

Parameters

struct address_space * mapping
address space structure to write
struct writeback_control * wbc
subtract the number of written pages from *wbc->nr_to_write
writepage_t writepage
function called for each page
void * data
data passed to writepage function

Description

If a page is already under I/O, write_cache_pages() skips it, even if it’s dirty. This is desirable behaviour for memory-cleaning writeback, but it is INCORRECT for data-integrity system calls such as fsync(). fsync() and msync() need to guarantee that all the data which was dirty at the time the call was made get new I/O started against them. If wbc->sync_mode is WB_SYNC_ALL then we were called for data integrity and we must wait for existing IO to complete.

To avoid livelocks (when other process dirties new pages), we first tag pages which should be written back with TOWRITE tag and only then start writing them. For data-integrity sync we have to be careful so that we do not miss some pages (e.g., because some other process has cleared TOWRITE tag we set). The rule we follow is that TOWRITE tag can be cleared only by the process clearing the DIRTY tag (and submitting the page for IO).

To avoid deadlocks between range_cyclic writeback and callers that hold pages in PageWriteback to aggregate IO until write_cache_pages() returns, we do not loop back to the start of the file. Doing so causes a page lock/page writeback access order inversion - we should only ever lock multiple pages in ascending page->index order, and looping back to the start of the file violates that rule and causes deadlocks.

int generic_writepages(struct address_space * mapping, struct writeback_control * wbc)

walk the list of dirty pages of the given address space and writepage() all of them.

Parameters

struct address_space * mapping
address space structure to write
struct writeback_control * wbc
subtract the number of written pages from *wbc->nr_to_write

Description

This is a library function, which implements the writepages() address_space_operation.

int write_one_page(struct page * page)

write out a single page and wait on I/O

Parameters

struct page * page
the page to write

Description

The page must be locked by the caller and will be unlocked upon return.

Note that the mapping’s AS_EIO/AS_ENOSPC flags will be cleared when this function returns.

void wait_for_stable_page(struct page * page)

wait for writeback to finish, if necessary.

Parameters

struct page * page
The page to wait on.

Description

This function determines if the given page is related to a backing device that requires page contents to be held stable during writeback. If so, then it will wait for any pending writeback to complete.

void truncate_inode_pages_range(struct address_space * mapping, loff_t lstart, loff_t lend)

truncate range of pages specified by start & end byte offsets

Parameters

struct address_space * mapping
mapping to truncate
loff_t lstart
offset from which to truncate
loff_t lend
offset to which to truncate (inclusive)

Description

Truncate the page cache, removing the pages that are between specified offsets (and zeroing out partial pages if lstart or lend + 1 is not page aligned).

Truncate takes two passes - the first pass is nonblocking. It will not block on page locks and it will not block on writeback. The second pass will wait. This is to prevent as much IO as possible in the affected region. The first pass will remove most pages, so the search cost of the second pass is low.

We pass down the cache-hot hint to the page freeing code. Even if the mapping is large, it is probably the case that the final pages are the most recently touched, and freeing happens in ascending file offset order.

Note that since ->:c:func:invalidatepage() accepts range to invalidate truncate_inode_pages_range is able to handle cases where lend + 1 is not page aligned properly.

void truncate_inode_pages(struct address_space * mapping, loff_t lstart)

truncate all the pages from an offset

Parameters

struct address_space * mapping
mapping to truncate
loff_t lstart
offset from which to truncate

Description

Called under (and serialised by) inode->i_mutex.

Note

When this function returns, there can be a page in the process of deletion (inside __delete_from_page_cache()) in the specified range. Thus mapping->nrpages can be non-zero when this function returns even after truncation of the whole mapping.

void truncate_inode_pages_final(struct address_space * mapping)

truncate all pages before inode dies

Parameters

struct address_space * mapping
mapping to truncate

Description

Called under (and serialized by) inode->i_mutex.

Filesystems have to use this in the .evict_inode path to inform the VM that this is the final truncate and the inode is going away.

unsigned long invalidate_mapping_pages(struct address_space * mapping, pgoff_t start, pgoff_t end)

Invalidate all the unlocked pages of one inode

Parameters

struct address_space * mapping
the address_space which holds the pages to invalidate
pgoff_t start
the offset ‘from’ which to invalidate
pgoff_t end
the offset ‘to’ which to invalidate (inclusive)

Description

This function only removes the unlocked pages, if you want to remove all the pages of one inode, you must call truncate_inode_pages.

invalidate_mapping_pages() will not block on IO activity. It will not invalidate pages which are dirty, locked, under writeback or mapped into pagetables.

int invalidate_inode_pages2_range(struct address_space * mapping, pgoff_t start, pgoff_t end)

remove range of pages from an address_space

Parameters

struct address_space * mapping
the address_space
pgoff_t start
the page offset ‘from’ which to invalidate
pgoff_t end
the page offset ‘to’ which to invalidate (inclusive)

Description

Any pages which are found to be mapped into pagetables are unmapped prior to invalidation.

Returns -EBUSY if any pages could not be invalidated.

int invalidate_inode_pages2(struct address_space * mapping)

remove all pages from an address_space

Parameters

struct address_space * mapping
the address_space

Description

Any pages which are found to be mapped into pagetables are unmapped prior to invalidation.

Returns -EBUSY if any pages could not be invalidated.

void truncate_pagecache(struct inode * inode, loff_t newsize)

unmap and remove pagecache that has been truncated

Parameters

struct inode * inode
inode
loff_t newsize
new file size

Description

inode’s new i_size must already be written before truncate_pagecache is called.

This function should typically be called before the filesystem releases resources associated with the freed range (eg. deallocates blocks). This way, pagecache will always stay logically coherent with on-disk format, and the filesystem would not have to deal with situations such as writepage being called for a page that has already had its underlying blocks deallocated.

void truncate_setsize(struct inode * inode, loff_t newsize)

update inode and pagecache for a new file size

Parameters

struct inode * inode
inode
loff_t newsize
new file size

Description

truncate_setsize updates i_size and performs pagecache truncation (if necessary) to newsize. It will be typically be called from the filesystem’s setattr function when ATTR_SIZE is passed in.

Must be called with a lock serializing truncates and writes (generally i_mutex but e.g. xfs uses a different lock) and before all filesystem specific block truncation has been performed.

void pagecache_isize_extended(struct inode * inode, loff_t from, loff_t to)

update pagecache after extension of i_size

Parameters

struct inode * inode
inode for which i_size was extended
loff_t from
original inode size
loff_t to
new inode size

Description

Handle extension of inode size either caused by extending truncate or by write starting after current i_size. We mark the page straddling current i_size RO so that page_mkwrite() is called on the nearest write access to the page. This way filesystem can be sure that page_mkwrite() is called on the page before user writes to the page via mmap after the i_size has been changed.

The function must be called after i_size is updated so that page fault coming after we unlock the page will already see the new i_size. The function must be called while we still hold i_mutex - this not only makes sure i_size is stable but also that userspace cannot observe new i_size value before we are prepared to store mmap writes at new inode size.

void truncate_pagecache_range(struct inode * inode, loff_t lstart, loff_t lend)

unmap and remove pagecache that is hole-punched

Parameters

struct inode * inode
inode
loff_t lstart
offset of beginning of hole
loff_t lend
offset of last byte of hole

Description

This function should typically be called before the filesystem releases resources associated with the freed range (eg. deallocates blocks). This way, pagecache will always stay logically coherent with on-disk format, and the filesystem would not have to deal with situations such as writepage being called for a page that has already had its underlying blocks deallocated.