English

Chinese (Simplified)

Memory Management APIs¶

User Space Memory Access¶

get_user¶

get_user (x, ptr)

Get a simple variable from user space.

Parameters

x: Variable to store result.
ptr: Source address, in user space.

Context

User context only. This function may sleep if pagefaults are enabled.

Description

This macro copies a single simple variable from user space to kernel space. It supports simple types like char and int, but not larger data types like structures or arrays.

ptr must have pointer-to-simple-variable type, and the result of dereferencing ptr must be assignable to x without a cast.

Return

zero on success, or -EFAULT on error. On error, the variable x is set to zero.

__get_user¶

__get_user (x, ptr)

Get a simple variable from user space, with less checking.

Parameters

x: Variable to store result.
ptr: Source address, in user space.

Context

User context only. This function may sleep if pagefaults are enabled.

Description

This macro copies a single simple variable from user space to kernel space. It supports simple types like char and int, but not larger data types like structures or arrays.

ptr must have pointer-to-simple-variable type, and the result of dereferencing ptr must be assignable to x without a cast.

Caller must check the pointer with access_ok() before calling this function.

Return

zero on success, or -EFAULT on error. On error, the variable x is set to zero.

put_user¶

put_user (x, ptr)

Write a simple value into user space.

Parameters

x: Value to copy to user space.
ptr: Destination address, in user space.

Context

User context only. This function may sleep if pagefaults are enabled.

Description

This macro copies a single simple value from kernel space to user space. It supports simple types like char and int, but not larger data types like structures or arrays.

ptr must have pointer-to-simple-variable type, and x must be assignable to the result of dereferencing ptr.

Return

zero on success, or -EFAULT on error.

__put_user¶

__put_user (x, ptr)

Write a simple value into user space, with less checking.

Parameters

x: Value to copy to user space.
ptr: Destination address, in user space.

Context

User context only. This function may sleep if pagefaults are enabled.

Description

This macro copies a single simple value from kernel space to user space. It supports simple types like char and int, but not larger data types like structures or arrays.

ptr must have pointer-to-simple-variable type, and x must be assignable to the result of dereferencing ptr.

Caller must check the pointer with access_ok() before calling this function.

Return

zero on success, or -EFAULT on error.

unsigned long clear_user(void __user *to, unsigned long n)¶: Zero a block of memory in user space.

Parameters

void __user *to: Destination address, in user space.
unsigned long n: Number of bytes to zero.

Description

Zero a block of memory in user space.

Return

number of bytes that could not be cleared. On success, this will be zero.

unsigned long __clear_user(void __user *to, unsigned long n)¶: Zero a block of memory in user space, with less checking.

Parameters

void __user *to: Destination address, in user space.
unsigned long n: Number of bytes to zero.

Description

Zero a block of memory in user space. Caller must check the specified block with access_ok() before calling this function.

Return

number of bytes that could not be cleared. On success, this will be zero.

int get_user_pages_fast(unsigned long start, int nr_pages, unsigned int gup_flags, struct page **pages)¶: pin user pages in memory

Parameters

unsigned long start: starting user address
int nr_pages: number of pages from start to pin
unsigned int gup_flags: flags modifying pin behaviour
struct page **pages: array that receives pointers to the pages pinned. Should be at least nr_pages long.

Description

Attempt to pin user pages in memory without taking mm->mmap_lock. If not successful, it will fall back to taking the lock and calling get_user_pages().

Returns number of pages pinned. This may be fewer than the number requested. If nr_pages is 0 or negative, returns 0. If no pages were pinned, returns -errno.

Memory Allocation Controls¶

Page mobility and placement hints¶

These flags provide hints about how mobile the page is. Pages with similar mobility are placed within the same pageblocks to minimise problems due to external fragmentation.

__GFP_MOVABLE (also a zone modifier) indicates that the page can be moved by page migration during memory compaction or can be reclaimed.

__GFP_RECLAIMABLE is used for slab allocations that specify SLAB_RECLAIM_ACCOUNT and whose pages can be freed via shrinkers.

__GFP_WRITE indicates the caller intends to dirty the page. Where possible, these pages will be spread between local zones to avoid all the dirty pages being in one zone (fair zone allocation policy).

__GFP_HARDWALL enforces the cpuset memory allocation policy.

__GFP_THISNODE forces the allocation to be satisfied from the requested node with no fallbacks or placement policy enforcements.

__GFP_ACCOUNT causes the allocation to be accounted to kmemcg.

__GFP_NO_OBJ_EXT causes slab allocation to have no object extension.

Watermark modifiers -- controls access to emergency reserves¶

__GFP_HIGH indicates that the caller is high-priority and that granting the request is necessary before the system can make forward progress. For example creating an IO context to clean pages and requests from atomic context.

__GFP_MEMALLOC allows access to all memory. This should only be used when the caller guarantees the allocation will allow more memory to be freed very shortly e.g. process exiting or swapping. Users either should be the MM or co-ordinating closely with the VM (e.g. swap over NFS). Users of this flag have to be extremely careful to not deplete the reserve completely and implement a throttling mechanism which controls the consumption of the reserve based on the amount of freed memory. Usage of a pre-allocated pool (e.g. mempool) should be always considered before using this flag.

__GFP_NOMEMALLOC is used to explicitly forbid access to emergency reserves. This takes precedence over the __GFP_MEMALLOC flag if both are set.

Reclaim modifiers¶

Please note that all the following flags are only applicable to sleepable allocations (e.g. GFP_NOWAIT and GFP_ATOMIC will ignore them).

__GFP_IO can start physical IO.

__GFP_FS can call down to the low-level FS. Clearing the flag avoids the allocator recursing into the filesystem which might already be holding locks.

__GFP_DIRECT_RECLAIM indicates that the caller may enter direct reclaim. This flag can be cleared to avoid unnecessary delays when a fallback option is available.

__GFP_KSWAPD_RECLAIM indicates that the caller wants to wake kswapd when the low watermark is reached and have it reclaim pages until the high watermark is reached. A caller may wish to clear this flag when fallback options are available and the reclaim is likely to disrupt the system. The canonical example is THP allocation where a fallback is cheap but reclaim/compaction may cause indirect stalls.

__GFP_RECLAIM is shorthand to allow/forbid both direct and kswapd reclaim.

The default allocator behavior depends on the request size. We have a concept of so-called costly allocations (with order > PAGE_ALLOC_COSTLY_ORDER). !costly allocations are too essential to fail so they are implicitly non-failing by default (with some exceptions like OOM victims might fail so the caller still has to check for failures) while costly requests try to be not disruptive and back off even without invoking the OOM killer. The following three modifiers might be used to override some of these implicit rules. Please note that all of them must be used along with __GFP_DIRECT_RECLAIM flag.

__GFP_NORETRY: The VM implementation will try only very lightweight memory direct reclaim to get some memory under memory pressure (thus it can sleep). It will avoid disruptive actions like OOM killer. The caller must handle the failure which is quite likely to happen under heavy memory pressure. The flag is suitable when failure can easily be handled at small cost, such as reduced throughput.

__GFP_RETRY_MAYFAIL: The VM implementation will retry memory reclaim procedures that have previously failed if there is some indication that progress has been made elsewhere. It can wait for other tasks to attempt high-level approaches to freeing memory such as compaction (which removes fragmentation) and page-out. There is still a definite limit to the number of retries, but it is a larger limit than with __GFP_NORETRY. Allocations with this flag may fail, but only when there is genuinely little unused memory. While these allocations do not directly trigger the OOM killer, their failure indicates that the system is likely to need to use the OOM killer soon. The caller must handle failure, but can reasonably do so by failing a higher-level request, or completing it only in a much less efficient manner. If the allocation does fail, and the caller is in a position to free some non-essential memory, doing so could benefit the system as a whole.

__GFP_NOFAIL: The VM implementation _must_ retry infinitely: the caller cannot handle allocation failures. The allocation could block indefinitely but will never return with failure. Testing for failure is pointless. It _must_ be blockable and used together with __GFP_DIRECT_RECLAIM. It should _never_ be used in non-sleepable contexts. New users should be evaluated carefully (and the flag should be used only when there is no reasonable failure policy) but it is definitely preferable to use the flag rather than opencode endless loop around allocator. Allocating pages from the buddy with __GFP_NOFAIL and order > 1 is not supported. Please consider using kvmalloc() instead.

Useful GFP flag combinations¶

Useful GFP flag combinations that are commonly used. It is recommended that subsystems start with one of these combinations and then set/clear __GFP_FOO flags as necessary.

GFP_ATOMIC users can not sleep and need the allocation to succeed. A lower watermark is applied to allow access to “atomic reserves”. The current implementation doesn’t support NMI and few other strict non-preemptive contexts (e.g. raw_spin_lock). The same applies to GFP_NOWAIT.

GFP_KERNEL is typical for kernel-internal allocations. The caller requires ZONE_NORMAL or a lower zone for direct access but can direct reclaim.

GFP_KERNEL_ACCOUNT is the same as GFP_KERNEL, except the allocation is accounted to kmemcg.

GFP_NOWAIT is for kernel allocations that should not stall for direct reclaim, start physical IO or use any filesystem callback. It is very likely to fail to allocate memory, even for very small allocations.

GFP_NOIO will use direct reclaim to discard clean pages or slab pages that do not require the starting of any physical IO. Please try to avoid using this flag directly and instead use memalloc_noio_{save,restore} to mark the whole scope which cannot perform any IO with a short explanation why. All allocation requests will inherit GFP_NOIO implicitly.

GFP_NOFS will use direct reclaim but will not use any filesystem interfaces. Please try to avoid using this flag directly and instead use memalloc_nofs_{save,restore} to mark the whole scope which cannot/shouldn’t recurse into the FS layer with a short explanation why. All allocation requests will inherit GFP_NOFS implicitly.

GFP_USER is for userspace allocations that also need to be directly accessibly by the kernel or hardware. It is typically used by hardware for buffers that are mapped to userspace (e.g. graphics) that hardware still must DMA to. cpuset limits are enforced for these allocations.

GFP_DMA exists for historical reasons and should be avoided where possible. The flags indicates that the caller requires that the lowest zone be used (ZONE_DMA or 16M on x86-64). Ideally, this would be removed but it would require careful auditing as some users really require it and others use the flag to avoid lowmem reserves in ZONE_DMA and treat the lowest zone as a type of emergency reserve.

GFP_DMA32 is similar to GFP_DMA except that the caller requires a 32-bit address. Note that kmalloc(..., GFP_DMA32) does not return DMA32 memory because the DMA32 kmalloc cache array is not implemented. (Reason: there is no such user in kernel).

GFP_HIGHUSER is for userspace allocations that may be mapped to userspace, do not need to be directly accessible by the kernel but that cannot move once in use. An example may be a hardware allocation that maps data directly into userspace but has no addressing limitations.

GFP_HIGHUSER_MOVABLE is for userspace allocations that the kernel does not need direct access to but can use kmap() when access is required. They are expected to be movable via page reclaim or page migration. Typically, pages on the LRU would also be allocated with GFP_HIGHUSER_MOVABLE.

GFP_TRANSHUGE and GFP_TRANSHUGE_LIGHT are used for THP allocations. They are compound allocations that will generally fail quickly if memory is not available and will not wake kswapd/kcompactd on failure. The _LIGHT version does not attempt reclaim/compaction at all and is by default used in page fault path, while the non-light is used by khugepaged.

The Slab Cache¶

SLAB_HWCACHE_ALIGN¶

SLAB_HWCACHE_ALIGN

Align objects on cache line boundaries.

Description

Sufficiently large objects are aligned on cache line boundary. For object size smaller than a half of cache line size, the alignment is on the half of cache line size. In general, if object size is smaller than 1/2^n of cache line size, the alignment is adjusted to 1/2^n.

If explicit alignment is also requested by the respective struct kmem_cache_args field, the greater of both is alignments is applied.

SLAB_TYPESAFE_BY_RCU¶

SLAB_TYPESAFE_BY_RCU

WARNING READ THIS!

Description

This delays freeing the SLAB page by a grace period, it does _NOT_ delay object freeing. This means that if you do kmem_cache_free() that memory location is free to be reused at any time. Thus it may be possible to see another object there in the same RCU grace period.

This feature only ensures the memory location backing the object stays valid, the trick to using this is relying on an independent object validation pass. Something like:
begin:
 rcu_read_lock();
 obj = lockless_lookup(key);
 if (obj) {
   if (!try_get_ref(obj)) // might fail for free objects
     rcu_read_unlock();
     goto begin;

   if (obj->key != key) { // not the object we expected
     put_ref(obj);
     rcu_read_unlock();
     goto begin;
   }
 }
rcu_read_unlock();
This is useful if we need to approach a kernel structure obliquely, from its address obtained without the usual locking. We can lock the structure to stabilize it and check it’s still at the given address, only if we can be sure that the memory has not been meanwhile reused for some other kind of object (which our subsystem’s lock might corrupt).

rcu_read_lock before reading the address, then rcu_read_unlock after taking the spinlock within the structure expected at that address.

Note that object identity check has to be done after acquiring a reference, therefore user has to ensure proper ordering for loads. Similarly, when initializing objects allocated with SLAB_TYPESAFE_BY_RCU, the newly allocated object has to be fully initialized before its refcount gets initialized and proper ordering for stores is required. refcount_{add|inc}_not_zero_acquire() and refcount_set_release() are designed with the proper fences required for reference counting objects allocated with SLAB_TYPESAFE_BY_RCU.

Note that it is not possible to acquire a lock within a structure allocated with SLAB_TYPESAFE_BY_RCU without first acquiring a reference as described above. The reason is that SLAB_TYPESAFE_BY_RCU pages are not zeroed before being given to the slab, which means that any locks must be initialized after each and every kmem_struct_alloc(). Alternatively, make the ctor passed to kmem_cache_create() initialize the locks at page-allocation time, as is done in __i915_request_ctor(), sighand_ctor(), and anon_vma_ctor(). Such a ctor permits readers to safely acquire those ctor-initialized locks under rcu_read_lock() protection.

Note that SLAB_TYPESAFE_BY_RCU was originally named SLAB_DESTROY_BY_RCU.

SLAB_ACCOUNT¶

SLAB_ACCOUNT

Account allocations to memcg.

Description

All object allocations from this cache will be memcg accounted, regardless of __GFP_ACCOUNT being or not being passed to individual allocations.

SLAB_RECLAIM_ACCOUNT¶

SLAB_RECLAIM_ACCOUNT

Objects are reclaimable.

Description

Use this flag for caches that have an associated shrinker. As a result, slab pages are allocated with __GFP_RECLAIMABLE, which affects grouping pages by mobility, and are accounted in SReclaimable counter in /proc/meminfo

struct kmem_cache_args¶: Less common arguments for kmem_cache_create()

Definition:

struct kmem_cache_args {
    unsigned int align;
    unsigned int useroffset;
    unsigned int usersize;
    unsigned int freeptr_offset;
    bool use_freeptr_offset;
    void (*ctor)(void *);
    unsigned int sheaf_capacity;
};

Members

align

The required alignment for the objects.

0 means no specific alignment is requested.

useroffset

Usercopy region offset.

0 is a valid offset, when usersize is non-0

usersize

Usercopy region size.

0 means no usercopy region is specified.

freeptr_offset

Custom offset for the free pointer in SLAB_TYPESAFE_BY_RCU caches

By default SLAB_TYPESAFE_BY_RCU caches place the free pointer outside of the object. This might cause the object to grow in size. Cache creators that have a reason to avoid this can specify a custom free pointer offset in their struct where the free pointer will be placed.

Note that placing the free pointer inside the object requires the caller to ensure that no fields are invalidated that are required to guard against object recycling (See SLAB_TYPESAFE_BY_RCU for details).

Using 0 as a value for freeptr_offset is valid. If freeptr_offset is specified, use_freeptr_offset must be set true.

Note that ctor currently isn’t supported with custom free pointers as a ctor requires an external free pointer.

use_freeptr_offset

Whether a freeptr_offset is used.

ctor

A constructor for the objects.

The constructor is invoked for each object in a newly allocated slab page. It is the cache user’s responsibility to free object in the same state as after calling the constructor, or deal appropriately with any differences between a freshly constructed and a reallocated object.

NULL means no constructor.

sheaf_capacity

Enable sheaves of given capacity for the cache.

With a non-zero value, allocations from the cache go through caching arrays called sheaves. Each cpu has a main sheaf that’s always present, and a spare sheaf that may be not present. When both become empty, there’s an attempt to replace an empty sheaf with a full sheaf from the per-node barn.

When no full sheaf is available, and gfp flags allow blocking, a sheaf is allocated and filled from slab(s) using bulk allocation. Otherwise the allocation falls back to the normal operation allocating a single object from a slab.

Analogically when freeing and both percpu sheaves are full, the barn may replace it with an empty sheaf, unless it’s over capacity. In that case a sheaf is bulk freed to slab pages.

The sheaves do not enforce NUMA placement of objects, so allocations via kmem_cache_alloc_node() with a node specified other than NUMA_NO_NODE will bypass them.

Bulk allocation and free operations also try to use the cpu sheaves and barn, but fallback to using slab pages directly.

When slub_debug is enabled for the cache, the sheaf_capacity argument is ignored.

0 means no sheaves will be created.

Description

Any uninitialized fields of the structure are interpreted as unused. The exception is freeptr_offset where 0 is a valid value, so use_freeptr_offset must be also set to true in order to interpret the field as used. For useroffset 0 is also valid, but only with non-0 usersize.

When NULL args is passed to kmem_cache_create(), it is equivalent to all fields unused.

struct kmem_cache *kmem_cache_create_usercopy(const char *name, unsigned int size, unsigned int align, slab_flags_t flags, unsigned int useroffset, unsigned int usersize, void (*ctor)(void*))¶: Create a kmem cache with a region suitable for copying to userspace.

Parameters

const char *name: A string which is used in /proc/slabinfo to identify this cache.
unsigned int size: The size of objects to be created in this cache.
unsigned int align: The required alignment for the objects.
slab_flags_t flags: SLAB flags
unsigned int useroffset: Usercopy region offset
unsigned int usersize: Usercopy region size
void (*ctor)(void *): A constructor for the objects, or NULL.

Description

This is a legacy wrapper, new code should use either KMEM_CACHE_USERCOPY() if whitelisting a single field is sufficient, or kmem_cache_create() with the necessary parameters passed via the args parameter (see struct kmem_cache_args)

Return

a pointer to the cache on success, NULL on failure.

kmem_cache_create¶

kmem_cache_create (__name, __object_size, __args, ...)

Create a kmem cache.

Parameters

__name: A string which is used in /proc/slabinfo to identify this cache.
__object_size: The size of objects to be created in this cache.
__args: Optional arguments, see struct kmem_cache_args. Passing NULL means defaults will be used for all the arguments.
...: variable arguments

Description

This is currently implemented as a macro using _Generic() to call either the new variant of the function, or a legacy one.

The new variant has 4 parameters: kmem_cache_create(name, object_size, args, flags)

See __kmem_cache_create_args() which implements this.

The legacy variant has 5 parameters: kmem_cache_create(name, object_size, align, flags, ctor)

The align and ctor parameters map to the respective fields of struct kmem_cache_args

Context

Cannot be called within a interrupt, but can be interrupted.

Return

a pointer to the cache on success, NULL on failure.

size_t ksize(const void *objp)¶: Report actual allocation size of associated object

Parameters

const void *objp: Pointer returned from a prior kmalloc()-family allocation.

Description

This should not be used for writing beyond the originally requested allocation size. Either use krealloc() or round up the allocation size with kmalloc_size_roundup() prior to allocation. If this is used to access beyond the originally requested allocation size, UBSAN_BOUNDS and/or FORTIFY_SOURCE may trip, since they only know about the originally allocated size via the __alloc_size attribute.

void *kmem_cache_alloc(struct kmem_cache *cachep, gfp_t flags)¶: Allocate an object

Parameters

struct kmem_cache *cachep: The cache to allocate from.
gfp_t flags: See kmalloc().

Description

Allocate an object from this cache. See kmem_cache_zalloc() for a shortcut of adding __GFP_ZERO to flags.

Return

pointer to the new object or NULL in case of error

bool kmem_cache_charge(void *objp, gfp_t gfpflags)¶: memcg charge an already allocated slab memory

Parameters

void *objp: address of the slab object to memcg charge
gfp_t gfpflags: describe the allocation context

Description

kmem_cache_charge allows charging a slab object to the current memcg, primarily in cases where charging at allocation time might not be possible because the target memcg is not known (i.e. softirq context)

The objp should be pointer returned by the slab allocator functions like kmalloc (with __GFP_ACCOUNT in flags) or kmem_cache_alloc. The memcg charge behavior can be controlled through gfpflags parameter, which affects how the necessary internal metadata can be allocated. Including __GFP_NOFAIL denotes that overcharging is requested instead of failure, but is not applied for the internal metadata allocation.

There are several cases where it will return true even if the charging was not done: More specifically:

For !CONFIG_MEMCG or cgroup_disable=memory systems.
Already charged slab objects.
For slab objects from KMALLOC_NORMAL caches - allocated by kmalloc() without __GFP_ACCOUNT
Allocating internal metadata has failed

Return

true if charge was successful otherwise false.

void *kmalloc(size_t size, gfp_t flags)¶: allocate kernel memory

Parameters

size_t size: how many bytes of memory are required.
gfp_t flags: describe the allocation context

Description

kmalloc is the normal method of allocating memory for objects smaller than page size in the kernel.

The allocated object address is aligned to at least ARCH_KMALLOC_MINALIGN bytes. For size of power of two bytes, the alignment is also guaranteed to be at least to the size. For other sizes, the alignment is guaranteed to be at least the largest power-of-two divisor of size.

The flags argument may be one of the GFP flags defined at include/linux/gfp_types.h and described at Documentation/core-api/mm-api.rst

The recommended usage of the flags is described at Documentation/core-api/memory-allocation.rst

Below is a brief outline of the most useful GFP flags

GFP_KERNEL: Allocate normal kernel ram. May sleep.
GFP_NOWAIT: Allocation will not sleep.
GFP_ATOMIC: Allocation will not sleep. May use emergency pools.

Also it is possible to set different flags by OR’ing in one or more of the following additional flags:

__GFP_ZERO: Zero the allocated memory before returning. Also see kzalloc().
__GFP_HIGH: This allocation has high priority and may use emergency pools.
__GFP_NOFAIL: Indicate that this allocation is in no way allowed to fail (think twice before using).
__GFP_NORETRY: If memory is not immediately available, then give up at once.
__GFP_NOWARN: If allocation fails, don’t issue any warnings.
__GFP_RETRY_MAYFAIL: Try really hard to succeed the allocation but fail eventually.

void *kmalloc_array(size_t n, size_t size, gfp_t flags)¶: allocate memory for an array.

Parameters

size_t n: number of elements.
size_t size: element size.
gfp_t flags: the type of memory to allocate (see kmalloc).

void *krealloc_array(void *p, size_t new_n, size_t new_size, gfp_t flags)¶: reallocate memory for an array.

Parameters

void *p: pointer to the memory chunk to reallocate
size_t new_n: new number of elements to alloc
size_t new_size: new size of a single member of the array
gfp_t flags: the type of memory to allocate (see kmalloc)

Description

If __GFP_ZERO logic is requested, callers must ensure that, starting with the initial memory allocation, every subsequent call to this API for the same memory allocation is flagged with __GFP_ZERO. Otherwise, it is possible that __GFP_ZERO is not fully honored by this API.

See krealloc_noprof() for further details.

In any case, the contents of the object pointed to are preserved up to the lesser of the new and old sizes.

kcalloc¶

kcalloc (n, size, flags)

allocate memory for an array. The memory is set to zero.

Parameters

n: number of elements.
size: element size.
flags: the type of memory to allocate (see kmalloc).

void *kzalloc(size_t size, gfp_t flags)¶: allocate memory. The memory is set to zero.

Parameters

size_t size: how many bytes of memory are required.
gfp_t flags: the type of memory to allocate (see kmalloc).

size_t kmalloc_size_roundup(size_t size)¶: Report allocation bucket size for the given size

Parameters

size_t size: Number of bytes to round up from.

Description

This returns the number of bytes that would be available in a kmalloc() allocation of size bytes. For example, a 126 byte request would be rounded up to the next sized kmalloc bucket, 128 bytes. (This is strictly for the general-purpose kmalloc()-based allocations, and is not for the pre-sized kmem_cache_alloc()-based allocations.)

Use this to kmalloc() the full bucket size ahead of time instead of using ksize() to query the size after an allocation.

void *kmem_cache_alloc_node(struct kmem_cache *s, gfp_t gfpflags, int node)¶: Allocate an object on the specified node

Parameters

struct kmem_cache *s: The cache to allocate from.
gfp_t gfpflags: See kmalloc().
int node: node number of the target node.

Description

Identical to kmem_cache_alloc but it will allocate memory on the given node, which can improve the performance for cpu bound structures.

Fallback to other node is possible if __GFP_THISNODE is not set.

Return

pointer to the new object or NULL in case of error

void *kmalloc_nolock(size_t size, gfp_t gfp_flags, int node)¶: Allocate an object of given size from any context.

Parameters

size_t size: size to allocate
gfp_t gfp_flags: GFP flags. Only __GFP_ACCOUNT, __GFP_ZERO, __GFP_NO_OBJ_EXT allowed.
int node: node number of the target node.

Return

pointer to the new object or NULL in case of error. NULL does not mean EBUSY or EAGAIN. It means ENOMEM. There is no reason to call it again and expect !NULL.

void kmem_cache_free(struct kmem_cache *s, void *x)¶: Deallocate an object

Parameters

struct kmem_cache *s: The cache the allocation was from.
void *x: The previously allocated object.

Description

Free an object which was previously allocated from this cache.

void kfree(const void *object)¶: free previously allocated memory

Parameters

const void *object: pointer returned by kmalloc() or kmem_cache_alloc()

Description

If object is NULL, no operation is performed.

void *krealloc_node_align(const void *p, size_t new_size, unsigned long align, gfp_t flags, int nid)¶: reallocate memory. The contents will remain unchanged.

Parameters

const void *p: object to reallocate memory for.
size_t new_size: how many bytes of memory are required.
unsigned long align: desired alignment.
gfp_t flags: the type of memory to allocate.
int nid: NUMA node or NUMA_NO_NODE

Description

If p is NULL, krealloc() behaves exactly like kmalloc(). If new_size is 0 and p is not a NULL pointer, the object pointed to is freed.

Only alignments up to those guaranteed by kmalloc() will be honored. Please see Memory Allocation Guide for more details.

If __GFP_ZERO logic is requested, callers must ensure that, starting with the initial memory allocation, every subsequent call to this API for the same memory allocation is flagged with __GFP_ZERO. Otherwise, it is possible that __GFP_ZERO is not fully honored by this API.

When slub_debug_orig_size() is off, krealloc() only knows about the bucket size of an allocation (but not the exact size it was allocated with) and hence implements the following semantics for shrinking and growing buffers with __GFP_ZERO:

        new             bucket
0       size             size
|--------|----------------|
|  keep  |      zero      |

Otherwise, the original allocation size ‘orig_size’ could be used to precisely clear the requested size, and the new size will also be stored as the new ‘orig_size’.

In any case, the contents of the object pointed to are preserved up to the lesser of the new and old sizes.

Return

pointer to the allocated memory or NULL in case of error

void *__kvmalloc_node(size, b, unsigned long align, gfp_t flags, int node)¶: attempt to allocate physically contiguous memory, but upon failure, fall back to non-contiguous (vmalloc) allocation.

Parameters

size: size of the request.
b: which set of kmalloc buckets to allocate from.
unsigned long align: desired alignment.
gfp_t flags: gfp mask for the allocation - must be compatible (superset) with GFP_KERNEL.
int node: numa node to allocate from

Description

Only alignments up to those guaranteed by kmalloc() will be honored. Please see Memory Allocation Guide for more details.

Uses kmalloc to get the memory but if the allocation fails then falls back to the vmalloc allocator. Use kvfree for freeing the memory.

GFP_NOWAIT and GFP_ATOMIC are supported, the __GFP_NORETRY modifier is not. __GFP_RETRY_MAYFAIL is supported, and it should be used only if kmalloc is preferable to the vmalloc fallback, due to visible performance drawbacks.

Return

pointer to the allocated memory of NULL in case of failure

void kvfree(const void *addr)¶: Free memory.

Parameters

const void *addr: Pointer to allocated memory.

Description

kvfree frees memory allocated by any of vmalloc(), kmalloc() or kvmalloc(). It is slightly more efficient to use kfree() or vfree() if you are certain that you know which one to use.

Context

Either preemptible task context or not-NMI interrupt.

void kvfree_sensitive(const void *addr, size_t len)¶: Free a data object containing sensitive information.

Parameters

const void *addr: address of the data object to be freed.
size_t len: length of the data object.

Description

Use the special memzero_explicit() function to clear the content of a kvmalloc’ed object containing sensitive data to make sure that the compiler won’t optimize out the data clearing.

void *kvrealloc_node_align(const void *p, size_t size, unsigned long align, gfp_t flags, int nid)¶: reallocate memory; contents remain unchanged

Parameters

const void *p: object to reallocate memory for
size_t size: the size to reallocate
unsigned long align: desired alignment
gfp_t flags: the flags for the page level allocator
int nid: NUMA node id

Description

If p is NULL, kvrealloc() behaves exactly like kvmalloc(). If size is 0 and p is not a NULL pointer, the object pointed to is freed.

Only alignments up to those guaranteed by kmalloc() will be honored. Please see Memory Allocation Guide for more details.

If __GFP_ZERO logic is requested, callers must ensure that, starting with the initial memory allocation, every subsequent call to this API for the same memory allocation is flagged with __GFP_ZERO. Otherwise, it is possible that __GFP_ZERO is not fully honored by this API.

In any case, the contents of the object pointed to are preserved up to the lesser of the new and old sizes.

This function must not be called concurrently with itself or kvfree() for the same memory allocation.

Return

pointer to the allocated memory or NULL in case of error

struct kmem_cache *__kmem_cache_create_args(const char *name, unsigned int object_size, struct kmem_cache_args *args, slab_flags_t flags)¶: Create a kmem cache.

Parameters

const char *name: A string which is used in /proc/slabinfo to identify this cache.
unsigned int object_size: The size of objects to be created in this cache.
struct kmem_cache_args *args: Additional arguments for the cache creation (see struct kmem_cache_args).
slab_flags_t flags: See the descriptions of individual flags. The common ones are listed in the description below.

Description

Not to be called directly, use the kmem_cache_create() wrapper with the same parameters.

Commonly used flags:

SLAB_ACCOUNT - Account allocations to memcg.

SLAB_HWCACHE_ALIGN - Align objects on cache line boundaries.

SLAB_RECLAIM_ACCOUNT - Objects are reclaimable.

SLAB_TYPESAFE_BY_RCU - Slab page (not individual objects) freeing delayed by a grace period - see the full description before using.

Context

Cannot be called within a interrupt, but can be interrupted.

Return

a pointer to the cache on success, NULL on failure.

kmem_buckets *kmem_buckets_create(const char *name, slab_flags_t flags, unsigned int useroffset, unsigned int usersize, void (*ctor)(void*))¶: Create a set of caches that handle dynamic sized allocations via kmem_buckets_alloc()

Parameters

const char *name: A prefix string which is used in /proc/slabinfo to identify this cache. The individual caches with have their sizes as the suffix.
slab_flags_t flags: SLAB flags (see kmem_cache_create() for details).
unsigned int useroffset: Starting offset within an allocation that may be copied to/from userspace.
unsigned int usersize: How many bytes, starting at useroffset, may be copied to/from userspace.
void (*ctor)(void *): A constructor for the objects, run when new allocations are made.

Description

Cannot be called within an interrupt, but can be interrupted.

Return

a pointer to the cache on success, NULL on failure. When CONFIG_SLAB_BUCKETS is not enabled, ZERO_SIZE_PTR is returned, and subsequent calls to kmem_buckets_alloc() will fall back to kmalloc(). (i.e. callers only need to check for NULL on failure.)

int kmem_cache_shrink(struct kmem_cache *cachep)¶: Shrink a cache.

Parameters

struct kmem_cache *cachep: The cache to shrink.

Description

Releases as many slabs as possible for a cache. To help debugging, a zero exit status indicates all slabs were released.

Return

0 if all slabs were released, non-zero otherwise

bool kmem_dump_obj(void *object)¶: Print available slab provenance information

Parameters

void *object: slab object for which to find provenance information.

Description

This function uses pr_cont(), so that the caller is expected to have printed out whatever preamble is appropriate. The provenance information depends on the type of object and on how much debugging is enabled. For a slab-cache object, the fact that it is a slab object is printed, and, if available, the slab name, return address, and stack trace from the allocation and last free path of that object.

Return

true if the pointer is to a not-yet-freed object from kmalloc() or kmem_cache_alloc(), either true or false if the pointer is to an already-freed object, and false otherwise.

void kfree_sensitive(const void *p)¶: Clear sensitive information in memory before freeing

Parameters

const void *p: object to free memory of

Description

The memory of the object p points to is zeroed before freed. If p is NULL, kfree_sensitive() does nothing.

Note

this function zeroes the whole allocated buffer which can be a good deal bigger than the requested buffer size passed to kmalloc(). So be careful when using this function in performance sensitive code.

void kvfree_rcu_barrier(void)¶: Wait until all in-flight kvfree_rcu() complete.

Parameters

void: no arguments

Description

Note that a single argument of kvfree_rcu() call has a slow path that triggers synchronize_rcu() following by freeing a pointer. It is done before the return from the function. Therefore for any single-argument call that will result in a kfree() to a cache that is to be destroyed during module exit, it is developer’s responsibility to ensure that all such calls have returned before the call to kmem_cache_destroy().

void kvfree_rcu_barrier_on_cache(struct kmem_cache *s)¶: Wait for in-flight kvfree_rcu() calls on a specific slab cache.

Parameters

struct kmem_cache *s: slab cache to wait for

Description

See the description of kvfree_rcu_barrier() for details.

void kfree_const(const void *x)¶: conditionally free memory

Parameters

const void *x: pointer to the memory

Description

Function calls kfree only if x is not in .rodata section.

Virtually Contiguous Mappings¶

void vm_unmap_aliases(void)¶: unmap outstanding lazy aliases in the vmap layer

Parameters

void: no arguments

Description

The vmap/vmalloc layer lazily flushes kernel virtual mappings primarily to amortize TLB flushing overheads. What this means is that any page you have now, may, in a former life, have been mapped into kernel virtual address by the vmap layer and so there might be some CPUs with TLB entries still referencing that page (additional to the regular 1:1 kernel mapping).

vm_unmap_aliases flushes all such lazy mappings. After it returns, we can be sure that none of the pages we have control over will have any aliases from the vmap layer.

void vm_unmap_ram(const void *mem, unsigned int count)¶: unmap linear kernel address space set up by vm_map_ram

Parameters

const void *mem: the pointer returned by vm_map_ram
unsigned int count: the count passed to that vm_map_ram call (cannot unmap partial)

void *vm_map_ram(struct page **pages, unsigned int count, int node)¶: map pages linearly into kernel virtual address (vmalloc space)

Parameters

struct page **pages: an array of pointers to the pages to be mapped
unsigned int count: number of pages
int node: prefer to allocate data structures on this node

Description

If you use this function for less than VMAP_MAX_ALLOC pages, it could be faster than vmap so it’s good. But if you mix long-life and short-life objects with vm_map_ram(), it could consume lots of address space through fragmentation (especially on a 32bit machine). You could see failures in the end. Please use this function for short-lived objects.

Return

a pointer to the address that has been mapped, or NULL on failure

void vfree(const void *addr)¶: Release memory allocated by vmalloc()

Parameters

const void *addr: Memory base address

Description

Free the virtually continuous memory area starting at addr, as obtained from one of the vmalloc() family of APIs. This will usually also free the physical memory underlying the virtual allocation, but that memory is reference counted, so it will not be freed until the last user goes away.

If addr is NULL, no operation is performed.

Context

May sleep if called not from interrupt context. Must not be called in NMI context (strictly speaking, it could be if we have CONFIG_ARCH_HAVE_NMI_SAFE_CMPXCHG, but making the calling conventions for vfree() arch-dependent would be a really bad idea).

void vunmap(const void *addr)¶: release virtual mapping obtained by vmap()

Parameters

const void *addr: memory base address

Description

Free the virtually contiguous memory area starting at addr, which was created from the page array passed to vmap().

Must not be called in interrupt context.

void *vmap(struct page **pages, unsigned int count, unsigned long flags, pgprot_t prot)¶: map an array of pages into virtually contiguous space

Parameters

struct page **pages: array of page pointers
unsigned int count: number of pages to map
unsigned long flags: vm_area->flags
pgprot_t prot: page protection for the mapping

Description

Maps count pages from pages into contiguous kernel virtual space. If flags contains VM_MAP_PUT_PAGES the ownership of the pages array itself (which must be kmalloc or vmalloc memory) and one reference per pages in it are transferred from the caller to vmap(), and will be freed / dropped when vfree() is called on the return value.

Return

the address of the area or NULL on failure

void *vmap_pfn(unsigned long *pfns, unsigned int count, pgprot_t prot)¶: map an array of PFNs into virtually contiguous space

Parameters

unsigned long *pfns: array of PFNs
unsigned int count: number of pages to map
pgprot_t prot: page protection for the mapping

Description

Maps count PFNs from pfns into contiguous kernel virtual space and returns the start address of the mapping.

void *__vmalloc_node(unsigned long size, unsigned long align, gfp_t gfp_mask, int node, const void *caller)¶: allocate virtually contiguous memory

Parameters

unsigned long size: allocation size
unsigned long align: desired alignment
gfp_t gfp_mask: flags for the page level allocator
int node: node to use for allocation or NUMA_NO_NODE
const void *caller: caller’s return address

Description

Allocate enough pages to cover size from the page level allocator with gfp_mask flags. Map them into contiguous kernel virtual space.

Semantics of gfp_mask (including reclaim/retry modifiers such as __GFP_NOFAIL) are the same as in __vmalloc_node_range_noprof().

Return

pointer to the allocated memory or NULL on error

void *vmalloc(unsigned long size)¶: allocate virtually contiguous memory

Parameters

unsigned long size: allocation size

Description

Allocate enough pages to cover size from the page level allocator and map them into contiguous kernel virtual space.

For tight control over page level allocator and protection flags use __vmalloc() instead.

Return

pointer to the allocated memory or NULL on error

void *vmalloc_huge_node(unsigned long size, gfp_t gfp_mask, int node)¶: allocate virtually contiguous memory, allow huge pages

Parameters

unsigned long size: allocation size
gfp_t gfp_mask: flags for the page level allocator
int node: node to use for allocation or NUMA_NO_NODE

Description

Allocate enough pages to cover size from the page level allocator and map them into contiguous kernel virtual space. If size is greater than or equal to PMD_SIZE, allow using huge pages for the memory

Return

pointer to the allocated memory or NULL on error

void *vzalloc(unsigned long size)¶: allocate virtually contiguous memory with zero fill

Parameters

unsigned long size: allocation size

Description

Allocate enough pages to cover size from the page level allocator and map them into contiguous kernel virtual space. The memory allocated is set to zero.

For tight control over page level allocator and protection flags use __vmalloc() instead.

Return

pointer to the allocated memory or NULL on error

void *vmalloc_user(unsigned long size)¶: allocate zeroed virtually contiguous memory for userspace

Parameters

unsigned long size: allocation size

Description

The resulting memory area is zeroed so it can be mapped to userspace without leaking data.

Return

pointer to the allocated memory or NULL on error

void *vmalloc_node(unsigned long size, int node)¶: allocate memory on a specific node

Parameters

unsigned long size: allocation size
int node: numa node

Description

Allocate enough pages to cover size from the page level allocator and map them into contiguous kernel virtual space.

For tight control over page level allocator and protection flags use __vmalloc() instead.

Return

pointer to the allocated memory or NULL on error

void *vzalloc_node(unsigned long size, int node)¶: allocate memory on a specific node with zero fill

Parameters

unsigned long size: allocation size
int node: numa node

Description

Allocate enough pages to cover size from the page level allocator and map them into contiguous kernel virtual space. The memory allocated is set to zero.

Return

pointer to the allocated memory or NULL on error

void *vmalloc_32(unsigned long size)¶: allocate virtually contiguous memory (32bit addressable)

Parameters

unsigned long size: allocation size

Description

Allocate enough 32bit PA addressable pages to cover size from the page level allocator and map them into contiguous kernel virtual space.

Return

pointer to the allocated memory or NULL on error

void *vmalloc_32_user(unsigned long size)¶: allocate zeroed virtually contiguous 32bit memory

Parameters

unsigned long size: allocation size

Description

The resulting memory area is 32bit addressable and zeroed so it can be mapped to userspace without leaking data.

Return

pointer to the allocated memory or NULL on error

int remap_vmalloc_range(struct vm_area_struct *vma, void *addr, unsigned long pgoff)¶: map vmalloc pages to userspace

Parameters

struct vm_area_struct *vma: vma to cover (map full range of vma)
void *addr: vmalloc memory
unsigned long pgoff: number of pages into addr before first page to map

Return

0 for success, -Exxx on failure

Description

This function checks that addr is a valid vmalloc’ed area, and that it is big enough to cover the vma. Will return failure if that criteria isn’t met.

Similar to remap_pfn_range() (see mm/memory.c)

File Mapping and Page Cache¶

Filemap¶

int filemap_fdatawrite_range(struct address_space *mapping, loff_t start, loff_t end)¶: start writeback on mapping dirty pages in range

Parameters

struct address_space *mapping: address space structure to write
loff_t start: offset in bytes where the range starts
loff_t end: offset in bytes where the range ends (inclusive)

Description

Start writeback against all of a mapping’s dirty pages that lie within the byte offsets <start, end> inclusive.

This is a data integrity operation that waits upon dirty or in writeback pages.

Return

0 on success, negative error code otherwise.

int filemap_flush_range(struct address_space *mapping, loff_t start, loff_t end)¶: start writeback on a range

Parameters

struct address_space *mapping: target address_space
loff_t start: index to start writeback on
loff_t end: last (inclusive) index for writeback

Description

This is a non-integrity writeback helper, to start writing back folios for the indicated range.

Return

0 on success, negative error code otherwise.

int filemap_flush(struct address_space *mapping)¶: mostly a non-blocking flush

Parameters

struct address_space *mapping: target address_space

Description

This is a mostly non-blocking flush. Not suitable for data-integrity purposes - I/O may not be started against all dirty pages.

Return

0 on success, negative error code otherwise.

bool filemap_range_has_page(struct address_space *mapping, loff_t start_byte, loff_t end_byte)¶: check if a page exists in range.

Parameters

struct address_space *mapping: address space within which to check
loff_t start_byte: offset in bytes where the range starts
loff_t end_byte: offset in bytes where the range ends (inclusive)

Description

Find at least one page in the range supplied, usually used to check if direct writing in this range will trigger a writeback.

Return

true if at least one page exists in the specified range, false otherwise.

int filemap_fdatawait_range(struct address_space *mapping, loff_t start_byte, loff_t end_byte)¶: wait for writeback to complete

Parameters

struct address_space *mapping: address space structure to wait for
loff_t start_byte: offset in bytes where the range starts
loff_t end_byte: offset in bytes where the range ends (inclusive)

Description

Walk the list of under-writeback pages of the given address space in the given range and wait for all of them. Check error status of the address space and return it.

Since the error status of the address space is cleared by this function, callers are responsible for checking the return value and handling and/or reporting the error.

Return

error status of the address space.

int filemap_fdatawait_range_keep_errors(struct address_space *mapping, loff_t start_byte, loff_t end_byte)¶: wait for writeback to complete

Parameters

struct address_space *mapping: address space structure to wait for
loff_t start_byte: offset in bytes where the range starts
loff_t end_byte: offset in bytes where the range ends (inclusive)

Description

Walk the list of under-writeback pages of the given address space in the given range and wait for all of them. Unlike filemap_fdatawait_range(), this function does not clear error status of the address space.

Use this function if callers don’t handle errors themselves. Expected call sites are system-wide / filesystem-wide data flushers: e.g. sync(2), fsfreeze(8)

int file_fdatawait_range(struct file *file, loff_t start_byte, loff_t end_byte)¶: wait for writeback to complete

Parameters

struct file *file: file pointing to address space structure to wait for
loff_t start_byte: offset in bytes where the range starts
loff_t end_byte: offset in bytes where the range ends (inclusive)

Description

Walk the list of under-writeback pages of the address space that file refers to, in the given range and wait for all of them. Check error status of the address space vs. the file->f_wb_err cursor and return it.

Since the error status of the file is advanced by this function, callers are responsible for checking the return value and handling and/or reporting the error.

Return

error status of the address space vs. the file->f_wb_err cursor.

int filemap_fdatawait_keep_errors(struct address_space *mapping)¶: wait for writeback without clearing errors

Parameters

struct address_space *mapping: address space structure to wait for

Description

Walk the list of under-writeback pages of the given address space and wait for all of them. Unlike filemap_fdatawait(), this function does not clear error status of the address space.

Use this function if callers don’t handle errors themselves. Expected call sites are system-wide / filesystem-wide data flushers: e.g. sync(2), fsfreeze(8)

Return

error status of the address space.

int filemap_write_and_wait_range(struct address_space *mapping, loff_t lstart, loff_t lend)¶: write out & wait on a file range

Parameters

struct address_space *mapping: the address_space for the pages
loff_t lstart: offset in bytes where the range starts
loff_t lend: offset in bytes where the range ends (inclusive)

Description

Write out and wait upon file offsets lstart->lend, inclusive.

Note that lend is inclusive (describes the last byte to be written) so that this function can be used to write to the very end-of-file (end = -1).

Return

error status of the address space.

int file_check_and_advance_wb_err(struct file *file)¶: report wb error (if any) that was previously and advance wb_err to current one

Parameters

struct file *file: struct file on which the error is being reported

Description

When userland calls fsync (or something like nfsd does the equivalent), we want to report any writeback errors that occurred since the last fsync (or since the file was opened if there haven’t been any).

Grab the wb_err from the mapping. If it matches what we have in the file, then just quickly return 0. The file is all caught up.

If it doesn’t match, then take the mapping value, set the “seen” flag in it and try to swap it into place. If it works, or another task beat us to it with the new value, then update the f_wb_err and return the error portion. The error at this point must be reported via proper channels (a’la fsync, or NFS COMMIT operation, etc.).

While we handle mapping->wb_err with atomic operations, the f_wb_err value is protected by the f_lock since we must ensure that it reflects the latest value swapped in for this file descriptor.

Return

0 on success, negative error code otherwise.

int file_write_and_wait_range(struct file *file, loff_t lstart, loff_t lend)¶: write out & wait on a file range

Parameters

struct file *file: file pointing to address_space with pages
loff_t lstart: offset in bytes where the range starts
loff_t lend: offset in bytes where the range ends (inclusive)

Description

Write out and wait upon file offsets lstart->lend, inclusive.

Note that lend is inclusive (describes the last byte to be written) so that this function can be used to write to the very end-of-file (end = -1).

After writing out and waiting on the data, we check and advance the f_wb_err cursor to the latest value, and return any errors detected there.

Return

0 on success, negative error code otherwise.

void replace_page_cache_folio(struct folio *old, struct folio *new)¶: replace a pagecache folio with a new one

Parameters

struct folio *old: folio to be replaced
struct folio *new: folio to replace with

Description

This function replaces a folio in the pagecache with a new one. On success it acquires the pagecache reference for the new folio and drops it for the old folio. Both the old and new folios must be locked. This function does not add the new folio to the LRU, the caller must do that.

The remove + add is atomic. This function cannot fail.

void folio_unlock(struct folio *folio)¶: Unlock a locked folio.

Parameters

struct folio *folio: The folio.

Description

Unlocks the folio and wakes up any thread sleeping on the page lock.

Context

May be called from interrupt or process context. May not be called from NMI context.

void folio_end_read(struct folio *folio, bool success)¶: End read on a folio.

Parameters

struct folio *folio: The folio.
bool success: True if all reads completed successfully.

Description

When all reads against a folio have completed, filesystems should call this function to let the pagecache know that no more reads are outstanding. This will unlock the folio and wake up any thread sleeping on the lock. The folio will also be marked uptodate if all reads succeeded.

Context

May be called from interrupt or process context. May not be called from NMI context.

void folio_end_private_2(struct folio *folio)¶: Clear PG_private_2 and wake any waiters.

Parameters

struct folio *folio: The folio.

Description

Clear the PG_private_2 bit on a folio and wake up any sleepers waiting for it. The folio reference held for PG_private_2 being set is released.

This is, for example, used when a netfs folio is being written to a local disk cache, thereby allowing writes to the cache for the same folio to be serialised.

void folio_wait_private_2(struct folio *folio)¶: Wait for PG_private_2 to be cleared on a folio.

Parameters

struct folio *folio: The folio to wait on.

Description

Wait for PG_private_2 to be cleared on a folio.

int folio_wait_private_2_killable(struct folio *folio)¶: Wait for PG_private_2 to be cleared on a folio.

Parameters

struct folio *folio: The folio to wait on.

Description

Wait for PG_private_2 to be cleared on a folio or until a fatal signal is received by the calling task.

Return

0 if successful.
-EINTR if a fatal signal was encountered.

void folio_end_writeback_no_dropbehind(struct folio *folio)¶: End writeback against a folio.

Parameters

struct folio *folio: The folio.

Description

The folio must actually be under writeback. This call is intended for filesystems that need to defer dropbehind.

Context

May be called from process or interrupt context.

void folio_end_writeback(struct folio *folio)¶: End writeback against a folio.

Parameters

struct folio *folio: The folio.

Description

The folio must actually be under writeback.

Context

May be called from process or interrupt context.

void __folio_lock(struct folio *folio)¶: Get a lock on the folio, assuming we need to sleep to get it.

Parameters

struct folio *folio: The folio to lock

pgoff_t page_cache_next_miss(struct address_space *mapping, pgoff_t index, unsigned long max_scan)¶: Find the next gap in the page cache.

Parameters

struct address_space *mapping: Mapping.
pgoff_t index: Index.
unsigned long max_scan: Maximum range to search.

Description

Search the range [index, min(index + max_scan - 1, ULONG_MAX)] for the gap with the lowest index.

This function may be called under the rcu_read_lock. However, this will not atomically search a snapshot of the cache at a single point in time. For example, if a gap is created at index 5, then subsequently a gap is created at index 10, page_cache_next_miss covering both indices may return 10 if called under the rcu_read_lock.

Return

The index of the gap if found, otherwise an index outside the range specified (in which case ‘return - index >= max_scan’ will be true). In the rare case of index wrap-around, 0 will be returned.

pgoff_t page_cache_prev_miss(struct address_space *mapping, pgoff_t index, unsigned long max_scan)¶: Find the previous gap in the page cache.

Parameters

struct address_space *mapping: Mapping.
pgoff_t index: Index.
unsigned long max_scan: Maximum range to search.

Description

Search the range [max(index - max_scan + 1, 0), index] for the gap with the highest index.

This function may be called under the rcu_read_lock. However, this will not atomically search a snapshot of the cache at a single point in time. For example, if a gap is created at index 10, then subsequently a gap is created at index 5, page_cache_prev_miss() covering both indices may return 5 if called under the rcu_read_lock.

Return

The index of the gap if found, otherwise an index outside the range specified (in which case ‘index - return >= max_scan’ will be true). In the rare case of wrap-around, ULONG_MAX will be returned.

struct folio *__filemap_get_folio_mpol(struct address_space *mapping, pgoff_t index, fgf_t fgp_flags, gfp_t gfp, struct mempolicy *policy)¶: Find and get a reference to a folio.

Parameters

struct address_space *mapping: The address_space to search.
pgoff_t index: The page index.
fgf_t fgp_flags: FGP flags modify how the folio is returned.
gfp_t gfp: Memory allocation flags to use if FGP_CREAT is specified.
struct mempolicy *policy: NUMA memory allocation policy to follow.

Description

Looks up the page cache entry at mapping & index.

If FGP_LOCK or FGP_CREAT are specified then the function may sleep even if the GFP flags specified for FGP_CREAT are atomic.

If this function returns a folio, it is returned with an increased refcount.

Return

The found folio or an ERR_PTR() otherwise.

unsigned filemap_get_folios(struct address_space *mapping, pgoff_t *start, pgoff_t end, struct folio_batch *fbatch)¶: Get a batch of folios

Parameters

struct address_space *mapping: The address_space to search
pgoff_t *start: The starting page index
pgoff_t end: The final page index (inclusive)
struct folio_batch *fbatch: The batch to fill.

Description

Search for and return a batch of folios in the mapping starting at index start and up to index end (inclusive). The folios are returned in fbatch with an elevated reference count.

Return

The number of folios which were found. We also update start to index the next folio for the traversal.

unsigned filemap_get_folios_contig(struct address_space *mapping, pgoff_t *start, pgoff_t end, struct folio_batch *fbatch)¶: Get a batch of contiguous folios

Parameters

struct address_space *mapping: The address_space to search
pgoff_t *start: The starting page index
pgoff_t end: The final page index (inclusive)
struct folio_batch *fbatch: The batch to fill

Description

filemap_get_folios_contig() works exactly like filemap_get_folios(), except the returned folios are guaranteed to be contiguous. This may not return all contiguous folios if the batch gets filled up.

Return

The number of folios found. Also update start to be positioned for traversal of the next folio.

unsigned filemap_get_folios_tag(struct address_space *mapping, pgoff_t *start, pgoff_t end, xa_mark_t tag, struct folio_batch *fbatch)¶: Get a batch of folios matching tag

Parameters

struct address_space *mapping: The address_space to search
pgoff_t *start: The starting page index
pgoff_t end: The final page index (inclusive)
xa_mark_t tag: The tag index
struct folio_batch *fbatch: The batch to fill

Description

The first folio may start before start; if it does, it will contain start. The final folio may extend beyond end; if it does, it will contain end. The folios have ascending indices. There may be gaps between the folios if there are indices which have no folio in the page cache. If folios are added to or removed from the page cache while this is running, they may or may not be found by this call. Only returns folios that are tagged with tag.

Return

The number of folios found. Also update start to index the next folio for traversal.

ssize_t filemap_read(struct kiocb *iocb, struct iov_iter *iter, ssize_t already_read)¶: Read data from the page cache.

Parameters

struct kiocb *iocb: The iocb to read.
struct iov_iter *iter: Destination for the data.
ssize_t already_read: Number of bytes already read by the caller.

Description

Copies data from the page cache. If the data is not currently present, uses the readahead and read_folio address_space operations to fetch it.

Return

Total number of bytes copied, including those already read by the caller. If an error happens before any bytes are copied, returns a negative error number.

ssize_t generic_file_read_iter(struct kiocb *iocb, struct iov_iter *iter)¶: generic filesystem read routine

Parameters

struct kiocb *iocb: kernel I/O control block
struct iov_iter *iter: destination for the data read

Description

This is the “read_iter()” routine for all filesystems that can use the page cache directly.

The IOCB_NOWAIT flag in iocb->ki_flags indicates that -EAGAIN shall be returned when no data can be read without waiting for I/O requests to complete; it doesn’t prevent readahead.

The IOCB_NOIO flag in iocb->ki_flags indicates that no new I/O requests shall be made for the read or for readahead. When no data can be read, -EAGAIN shall be returned. When readahead would be triggered, a partial, possibly empty read shall be returned.

Return

number of bytes copied, even for partial reads
negative error code (or 0 if IOCB_NOIO) if nothing was read

ssize_t filemap_splice_read(struct file *in, loff_t *ppos, struct pipe_inode_info *pipe, size_t len, unsigned int flags)¶: Splice data from a file’s pagecache into a pipe

Parameters

struct file *in: The file to read from
loff_t *ppos: Pointer to the file position to read from
struct pipe_inode_info *pipe: The pipe to splice into
size_t len: The amount to splice
unsigned int flags: The SPLICE_F_* flags

Description

This function gets folios from a file’s pagecache and splices them into the pipe. Readahead will be called as necessary to fill more folios. This may be used for blockdevs also.

Return

On success, the number of bytes read will be returned and *ppos will be updated if appropriate; 0 will be returned if there is no more data to be read; -EAGAIN will be returned if the pipe had no space, and some other negative error code will be returned on error. A short read may occur if the pipe has insufficient space, we reach the end of the data or we hit a hole.

vm_fault_t filemap_fault(struct vm_fault *vmf)¶: read in file data for page fault handling

Parameters

struct vm_fault *vmf: struct vm_fault containing details of the fault

Description

filemap_fault() is invoked via the vma operations vector for a mapped memory region to read in file data during a page fault.

The goto’s are kind of ugly, but this streamlines the normal case of having it in the page cache, and handles the special cases reasonably without having a lot of duplicated code.

vma->vm_mm->mmap_lock must be held on entry.

If our return value has VM_FAULT_RETRY set, it’s because the mmap_lock may be dropped before doing I/O or by lock_folio_maybe_drop_mmap().

If our return value does not have VM_FAULT_RETRY set, the mmap_lock has not been released.

We never return with VM_FAULT_RETRY and a bit from VM_FAULT_ERROR set.

Return

bitwise-OR of VM_FAULT_ codes.

struct folio *read_cache_folio(struct address_space *mapping, pgoff_t index, filler_t filler, struct file *file)¶: Read into page cache, fill it if needed.

Parameters

struct address_space *mapping: The address_space to read from.
pgoff_t index: The index to read.
filler_t filler: Function to perform the read, or NULL to use aops->read_folio().
struct file *file: Passed to filler function, may be NULL if not required.

Description

Read one page into the page cache. If it succeeds, the folio returned will contain index, but it may not be the first page of the folio.

If the filler function returns an error, it will be returned to the caller.

Context

May sleep. Expects mapping->invalidate_lock to be held.

Return

An uptodate folio on success, ERR_PTR() on failure.

struct folio *mapping_read_folio_gfp(struct address_space *mapping, pgoff_t index, gfp_t gfp)¶: Read into page cache, using specified allocation flags.

Parameters

struct address_space *mapping: The address_space for the folio.
pgoff_t index: The index that the allocated folio will contain.
gfp_t gfp: The page allocator flags to use if allocating.

Description

This is the same as “read_cache_folio(mapping, index, NULL, NULL)”, but with any new memory allocations done using the specified allocation flags.

The most likely error from this function is EIO, but ENOMEM is possible and so is EINTR. If ->read_folio returns another error, that will be returned to the caller.

The function expects mapping->invalidate_lock to be already held.

Return

Uptodate folio on success, ERR_PTR() on failure.

struct page *read_cache_page_gfp(struct address_space *mapping, pgoff_t index, gfp_t gfp)¶: read into page cache, using specified page allocation flags.

Parameters

struct address_space *mapping: the page’s address_space
pgoff_t index: the page index
gfp_t gfp: the page allocator flags to use if allocating

Description

This is the same as “read_mapping_page(mapping, index, NULL)”, but with any new page allocations done using the specified allocation flags.

If the page does not get brought uptodate, return -EIO.

The function expects mapping->invalidate_lock to be already held.

Return

up to date page on success, ERR_PTR() on failure.

ssize_t __generic_file_write_iter(struct kiocb *iocb, struct iov_iter *from)¶: write data to a file

Parameters

struct kiocb *iocb: IO state structure (file, offset, etc.)
struct iov_iter *from: iov_iter with data to write

Description

This function does all the work needed for actually writing data to a file. It does all basic checks, removes SUID from the file, updates modification times and calls proper subroutines depending on whether we do direct IO or a standard buffered write.

It expects i_rwsem to be grabbed unless we work on a block device or similar object which does not need locking at all.

This function does not take care of syncing data in case of O_SYNC write. A caller has to handle it. This is mainly due to the fact that we want to avoid syncing under i_rwsem.

Return

number of bytes written, even for truncated writes
negative error code if no data has been written at all

ssize_t generic_file_write_iter(struct kiocb *iocb, struct iov_iter *from)¶: write data to a file

Parameters

struct kiocb *iocb: IO state structure
struct iov_iter *from: iov_iter with data to write

Description

This is a wrapper around __generic_file_write_iter() to be used by most filesystems. It takes care of syncing the file in case of O_SYNC file and acquires i_rwsem as needed.

Return

negative error code if no data has been written at all of vfs_fsync_range() failed for a synchronous write
number of bytes written, even for truncated writes

bool filemap_release_folio(struct folio *folio, gfp_t gfp)¶: Release fs-specific metadata on a folio.

Parameters

struct folio *folio: The folio which the kernel is trying to free.
gfp_t gfp: Memory allocation flags (and I/O mode).

Description

The address_space is trying to release any data attached to a folio (presumably at folio->private).

This will also be called if the private_2 flag is set on a page, indicating that the folio has other metadata associated with it.

The gfp argument specifies whether I/O may be performed to release this page (__GFP_IO), and whether the call may block (__GFP_RECLAIM & __GFP_FS).

Return

true if the release was successful, otherwise false.

int filemap_invalidate_inode(struct inode *inode, bool flush, loff_t start, loff_t end)¶: Invalidate/forcibly write back a range of an inode’s pagecache

Parameters

struct inode *inode: The inode to flush
bool flush: Set to write back rather than simply invalidate.
loff_t start: First byte to in range.
loff_t end: Last byte in range (inclusive), or LLONG_MAX for everything from start onwards.

Description

Invalidate all the folios on an inode that contribute to the specified range, possibly writing them back first. Whilst the operation is undertaken, the invalidate lock is held to prevent new folios from being installed.

Readahead¶

Readahead is used to read content into the page cache before it is explicitly requested by the application. Readahead only ever attempts to read folios that are not yet in the page cache. If a folio is present but not up-to-date, readahead will not try to read it. In that case a simple ->read_folio() will be requested.

Readahead is triggered when an application read request (whether a system call or a page fault) finds that the requested folio is not in the page cache, or that it is in the page cache and has the readahead flag set. This flag indicates that the folio was read as part of a previous readahead request and now that it has been accessed, it is time for the next readahead.

Each readahead request is partly synchronous read, and partly async readahead. This is reflected in the struct file_ra_state which contains ->size being the total number of pages, and ->async_size which is the number of pages in the async section. The readahead flag will be set on the first folio in this async section to trigger a subsequent readahead. Once a series of sequential reads has been established, there should be no need for a synchronous component and all readahead request will be fully asynchronous.

When either of the triggers causes a readahead, three numbers need to be determined: the start of the region to read, the size of the region, and the size of the async tail.

The start of the region is simply the first page address at or after the accessed address, which is not currently populated in the page cache. This is found with a simple search in the page cache.

The size of the async tail is determined by subtracting the size that was explicitly requested from the determined request size, unless this would be less than zero - then zero is used. NOTE THIS CALCULATION IS WRONG WHEN THE START OF THE REGION IS NOT THE ACCESSED PAGE. ALSO THIS CALCULATION IS NOT USED CONSISTENTLY.

The size of the region is normally determined from the size of the previous readahead which loaded the preceding pages. This may be discovered from the struct file_ra_state for simple sequential reads, or from examining the state of the page cache when multiple sequential reads are interleaved. Specifically: where the readahead was triggered by the readahead flag, the size of the previous readahead is assumed to be the number of pages from the triggering page to the start of the new readahead. In these cases, the size of the previous readahead is scaled, often doubled, for the new readahead, though see get_next_ra_size() for details.

If the size of the previous read cannot be determined, the number of preceding pages in the page cache is used to estimate the size of a previous read. This estimate could easily be misled by random reads being coincidentally adjacent, so it is ignored unless it is larger than the current request, and it is not scaled up, unless it is at the start of file.

In general readahead is accelerated at the start of the file, as reads from there are often sequential. There are other minor adjustments to the readahead size in various special cases and these are best discovered by reading the code.

The above calculation, based on the previous readahead size, determines the size of the readahead, to which any requested read size may be added.

Readahead requests are sent to the filesystem using the ->readahead() address space operation, for which mpage_readahead() is a canonical implementation. ->readahead() should normally initiate reads on all folios, but may fail to read any or all folios without causing an I/O error. The page cache reading code will issue a ->read_folio() request for any folio which ->readahead() did not read, and only an error from this will be final.

->readahead() will generally call readahead_folio() repeatedly to get each folio from those prepared for readahead. It may fail to read a folio by:

not calling readahead_folio() sufficiently many times, effectively ignoring some folios, as might be appropriate if the path to storage is congested.
failing to actually submit a read request for a given folio, possibly due to insufficient resources, or
getting an error during subsequent processing of a request.

In the last two cases, the folio should be unlocked by the filesystem to indicate that the read attempt has failed. In the first case the folio will be unlocked by the VFS.

Those folios not in the final async_size of the request should be considered to be important and ->readahead() should not fail them due to congestion or temporary resource unavailability, but should wait for necessary resources (e.g. memory or indexing information) to become available. Folios in the final async_size may be considered less urgent and failure to read them is more acceptable. In this case it is best to use filemap_remove_folio() to remove the folios from the page cache as is automatically done for folios that were not fetched with readahead_folio(). This will allow a subsequent synchronous readahead request to try them again. If they are left in the page cache, then they will be read individually using ->read_folio() which may be less efficient.

void page_cache_ra_unbounded(struct readahead_control *ractl, unsigned long nr_to_read, unsigned long lookahead_size)¶: Start unchecked readahead.

Parameters

struct readahead_control *ractl: Readahead control.
unsigned long nr_to_read: The number of pages to read.
unsigned long lookahead_size: Where to start the next readahead.

Description

This function is for filesystems to call when they want to start readahead beyond a file’s stated i_size. This is almost certainly not the function you want to call. Use page_cache_async_readahead() or page_cache_sync_readahead() instead.

Context

File is referenced by caller. Mutexes may be held by caller. May sleep, but will not reenter filesystem to reclaim memory.

void readahead_expand(struct readahead_control *ractl, loff_t new_start, size_t new_len)¶: Expand a readahead request

Parameters

struct readahead_control *ractl: The request to be expanded
loff_t new_start: The revised start
size_t new_len: The revised size of the request

Description

Attempt to expand a readahead request outwards from the current size to the specified size by inserting locked pages before and after the current window to increase the size to the new window. This may involve the insertion of THPs, in which case the window may get expanded even beyond what was requested.

The algorithm will stop if it encounters a conflicting page already in the pagecache and leave a smaller expansion than requested.

The caller must check for this by examining the revised ractl object for a different expansion than was requested.

Writeback¶

int balance_dirty_pages_ratelimited_flags(struct address_space *mapping, unsigned int flags)¶: Balance dirty memory state.

Parameters

struct address_space *mapping: address_space which was dirtied.
unsigned int flags: BDP flags.

Description

Processes which are dirtying memory should call in here once for each page which was newly dirtied. The function will periodically check the system’s dirty state and will initiate writeback if needed.

See balance_dirty_pages_ratelimited() for details.

Return

If flags contains BDP_ASYNC, it may return -EAGAIN to indicate that memory is out of balance and the caller must wait for I/O to complete. Otherwise, it will return 0 to indicate that either memory was already in balance, or it was able to sleep until the amount of dirty memory returned to balance.

void balance_dirty_pages_ratelimited(struct address_space *mapping)¶: balance dirty memory state.

Parameters

struct address_space *mapping: address_space which was dirtied.

Description

Processes which are dirtying memory should call in here once for each page which was newly dirtied. The function will periodically check the system’s dirty state and will initiate writeback if needed.

Once we’re over the dirty memory limit we decrease the ratelimiting by a lot, to prevent individual processes from overshooting the limit by (ratelimit_pages) each.

void tag_pages_for_writeback(struct address_space *mapping, pgoff_t start, pgoff_t end)¶: tag pages to be written by writeback

Parameters

struct address_space *mapping: address space structure to write
pgoff_t start: starting page index
pgoff_t end: ending page index (inclusive)

Description

This function scans the page range from start to end (inclusive) and tags all pages that have DIRTY tag set with a special TOWRITE tag. The caller can then use the TOWRITE tag to identify pages eligible for writeback. This mechanism is used to avoid livelocking of writeback by a process steadily creating new dirty pages in the file (thus it is important for this function to be quick so that it can tag pages faster than a dirtying process can create them).

struct folio *writeback_iter(struct address_space *mapping, struct writeback_control *wbc, struct folio *folio, int *error)¶: iterate folio of a mapping for writeback

Parameters

struct address_space *mapping: address space structure to write
struct writeback_control *wbc: writeback context
struct folio *folio: previously iterated folio (NULL to start)
int *error: in-out pointer for writeback errors (see below)

Description

This function returns the next folio for the writeback operation described by wbc on mapping and should be called in a while loop in the ->writepages implementation.

To start the writeback operation, NULL is passed in the folio argument, and for every subsequent iteration the folio returned previously should be passed back in.

If there was an error in the per-folio writeback inside the writeback_iter() loop, error should be set to the error value.

Once the writeback described in wbc has finished, this function will return NULL and if there was an error in any iteration restore it to error.

Note

callers should not manually break out of the loop using break or goto but must keep calling writeback_iter() until it returns NULL.

Return

the folio to write or NULL if the loop is done.

bool filemap_dirty_folio(struct address_space *mapping, struct folio *folio)¶: Mark a folio dirty for filesystems which do not use buffer_heads.

Parameters

struct address_space *mapping: Address space this folio belongs to.
struct folio *folio: Folio to be marked as dirty.

Description

Filesystems which do not use buffer heads should call this function from their dirty_folio address space operation. It ignores the contents of folio_get_private(), so if the filesystem marks individual blocks as dirty, the filesystem should handle that itself.

This is also sometimes used by filesystems which use buffer_heads when a single buffer is being dirtied: we want to set the folio dirty in that case, but not all the buffers. This is a “bottom-up” dirtying, whereas block_dirty_folio() is a “top-down” dirtying.

The caller must ensure this doesn’t race with truncation. Most will simply hold the folio lock, but e.g. zap_pte_range() calls with the folio mapped and the pte lock held, which also locks out truncation.

bool folio_redirty_for_writepage(struct writeback_control *wbc, struct folio *folio)¶: Decline to write a dirty folio.

Parameters

struct writeback_control *wbc: The writeback control.
struct folio *folio: The folio.

Description

When a writepage implementation decides that it doesn’t want to write folio for some reason, it should call this function, unlock folio and return 0.

Return

True if we redirtied the folio. False if someone else dirtied it first.

bool folio_mark_dirty(struct folio *folio)¶: Mark a folio as being modified.

Parameters

struct folio *folio: The folio.

Description

The folio may not be truncated while this function is running. Holding the folio lock is sufficient to prevent truncation, but some callers cannot acquire a sleeping lock. These callers instead hold the page table lock for a page table which contains at least one page in this folio. Truncation will block on the page table lock as it unmaps pages before removing the folio from its mapping.

Return

True if the folio was newly dirtied, false if it was already dirty.

void folio_wait_writeback(struct folio *folio)¶: Wait for a folio to finish writeback.

Parameters

struct folio *folio: The folio to wait for.

Description

If the folio is currently being written back to storage, wait for the I/O to complete.

Context

Sleeps. Must be called in process context and with no spinlocks held. Caller should hold a reference on the folio. If the folio is not locked, writeback may start again after writeback has finished.

int folio_wait_writeback_killable(struct folio *folio)¶: Wait for a folio to finish writeback.

Parameters

struct folio *folio: The folio to wait for.

Description

If the folio is currently being written back to storage, wait for the I/O to complete or a fatal signal to arrive.

Context

Sleeps. Must be called in process context and with no spinlocks held. Caller should hold a reference on the folio. If the folio is not locked, writeback may start again after writeback has finished.

Return

0 on success, -EINTR if we get a fatal signal while waiting.

void folio_wait_stable(struct folio *folio)¶: wait for writeback to finish, if necessary.

Parameters

struct folio *folio: The folio to wait on.

Description

This function determines if the given folio is related to a backing device that requires folio contents to be held stable during writeback. If so, then it will wait for any pending writeback to complete.

Context

Sleeps. Must be called in process context and with no spinlocks held. Caller should hold a reference on the folio. If the folio is not locked, writeback may start again after writeback has finished.

Truncate¶

void folio_invalidate(struct folio *folio, size_t offset, size_t length)¶: Invalidate part or all of a folio.

Parameters

struct folio *folio: The folio which is affected.
size_t offset: start of the range to invalidate
size_t length: length of the range to invalidate

Description

folio_invalidate() is called when all or part of the folio has become invalidated by a truncate operation.

folio_invalidate() does not have to release all buffers, but it must ensure that no dirty buffer is left outside offset and that no I/O is underway against any of the blocks which are outside the truncation point. Because the caller is about to free (and possibly reuse) those blocks on-disk.

void truncate_inode_pages_range(struct address_space *mapping, loff_t lstart, uoff_t lend)¶: truncate range of pages specified by start & end byte offsets

Parameters

struct address_space *mapping: mapping to truncate
loff_t lstart: offset from which to truncate
uoff_t lend: offset to which to truncate (inclusive)

Description

Truncate the page cache, removing the pages that are between specified offsets (and zeroing out partial pages if lstart or lend + 1 is not page aligned).

Truncate takes two passes - the first pass is nonblocking. It will not block on page locks and it will not block on writeback. The second pass will wait. This is to prevent as much IO as possible in the affected region. The first pass will remove most pages, so the search cost of the second pass is low.

We pass down the cache-hot hint to the page freeing code. Even if the mapping is large, it is probably the case that the final pages are the most recently touched, and freeing happens in ascending file offset order.

Note that since ->invalidate_folio() accepts range to invalidate truncate_inode_pages_range is able to handle cases where lend + 1 is not page aligned properly.

void truncate_inode_pages(struct address_space *mapping, loff_t lstart)¶: truncate all the pages from an offset

Parameters

struct address_space *mapping: mapping to truncate
loff_t lstart: offset from which to truncate

Description

Called under (and serialised by) inode->i_rwsem and mapping->invalidate_lock.

Note

When this function returns, there can be a page in the process of deletion (inside __filemap_remove_folio()) in the specified range. Thus mapping->nrpages can be non-zero when this function returns even after truncation of the whole mapping.

void truncate_inode_pages_final(struct address_space *mapping)¶: truncate all pages before inode dies

Parameters

struct address_space *mapping: mapping to truncate

Description

Called under (and serialized by) inode->i_rwsem.

Filesystems have to use this in the .evict_inode path to inform the VM that this is the final truncate and the inode is going away.

unsigned long invalidate_mapping_pages(struct address_space *mapping, pgoff_t start, pgoff_t end)¶: Invalidate all clean, unlocked cache of one inode

Parameters

struct address_space *mapping: the address_space which holds the cache to invalidate
pgoff_t start: the offset ‘from’ which to invalidate
pgoff_t end: the offset ‘to’ which to invalidate (inclusive)

Description

This function removes pages that are clean, unmapped and unlocked, as well as shadow entries. It will not block on IO activity.

If you want to remove all the pages of one inode, regardless of their use and writeback state, use truncate_inode_pages().

Return

The number of indices that had their contents invalidated

int invalidate_inode_pages2_range(struct address_space *mapping, pgoff_t start, pgoff_t end)¶: remove range of pages from an address_space

Parameters

struct address_space *mapping: the address_space
pgoff_t start: the page offset ‘from’ which to invalidate
pgoff_t end: the page offset ‘to’ which to invalidate (inclusive)

Description

Any pages which are found to be mapped into pagetables are unmapped prior to invalidation.

Return

-EBUSY if any pages could not be invalidated.

int invalidate_inode_pages2(struct address_space *mapping)¶: remove all pages from an address_space

Parameters

struct address_space *mapping: the address_space

Description

Any pages which are found to be mapped into pagetables are unmapped prior to invalidation.

Return

-EBUSY if any pages could not be invalidated.

void truncate_pagecache(struct inode *inode, loff_t newsize)¶: unmap and remove pagecache that has been truncated

Parameters

struct inode *inode: inode
loff_t newsize: new file size

Description

inode’s new i_size must already be written before truncate_pagecache is called.

This function should typically be called before the filesystem releases resources associated with the freed range (eg. deallocates blocks). This way, pagecache will always stay logically coherent with on-disk format, and the filesystem would not have to deal with situations such as writepage being called for a page that has already had its underlying blocks deallocated.

void truncate_setsize(struct inode *inode, loff_t newsize)¶: update inode and pagecache for a new file size

Parameters

struct inode *inode: inode
loff_t newsize: new file size

Description

truncate_setsize updates i_size and performs pagecache truncation (if necessary) to newsize. It will be typically be called from the filesystem’s setattr function when ATTR_SIZE is passed in.

Must be called with a lock serializing truncates and writes (generally i_rwsem but e.g. xfs uses a different lock) and before all filesystem specific block truncation has been performed.

void pagecache_isize_extended(struct inode *inode, loff_t from, loff_t to)¶: update pagecache after extension of i_size

Parameters

struct inode *inode: inode for which i_size was extended
loff_t from: original inode size
loff_t to: new inode size

Description

Handle extension of inode size either caused by extending truncate or by write starting after current i_size. We mark the page straddling current i_size RO so that page_mkwrite() is called on the first write access to the page. The filesystem will update its per-block information before user writes to the page via mmap after the i_size has been changed.

The function must be called after i_size is updated so that page fault coming after we unlock the folio will already see the new i_size. The function must be called while we still hold i_rwsem - this not only makes sure i_size is stable but also that userspace cannot observe new i_size value before we are prepared to store mmap writes at new inode size.

void truncate_pagecache_range(struct inode *inode, loff_t lstart, loff_t lend)¶: unmap and remove pagecache that is hole-punched

Parameters

struct inode *inode: inode
loff_t lstart: offset of beginning of hole
loff_t lend: offset of last byte of hole

Description

This function should typically be called before the filesystem releases resources associated with the freed range (eg. deallocates blocks). This way, pagecache will always stay logically coherent with on-disk format, and the filesystem would not have to deal with situations such as writepage being called for a page that has already had its underlying blocks deallocated.

void filemap_set_wb_err(struct address_space *mapping, int err)¶: set a writeback error on an address_space

Parameters

struct address_space *mapping: mapping in which to set writeback error
int err: error to be set in mapping

Description

When writeback fails in some way, we must record that error so that userspace can be informed when fsync and the like are called. We endeavor to report errors on any file that was open at the time of the error. Some internal callers also need to know when writeback errors have occurred.

When a writeback error occurs, most filesystems will want to call filemap_set_wb_err to record the error in the mapping so that it will be automatically reported whenever fsync is called on the file.

int filemap_check_wb_err(struct address_space *mapping, errseq_t since)¶: has an error occurred since the mark was sampled?

Parameters

struct address_space *mapping: mapping to check for writeback errors
errseq_t since: previously-sampled errseq_t

Description

Grab the errseq_t value from the mapping, and see if it has changed “since” the given value was sampled.

If it has then report the latest error set, otherwise return 0.

errseq_t filemap_sample_wb_err(struct address_space *mapping)¶: sample the current errseq_t to test for later errors

Parameters

struct address_space *mapping: mapping to be sampled

Description

Writeback errors are always reported relative to a particular sample point in the past. This function provides those sample points.

errseq_t file_sample_sb_err(struct file *file)¶: sample the current errseq_t to test for later errors

Parameters

struct file *file: file pointer to be sampled

Description

Grab the most current superblock-level errseq_t value for the given struct file.

void mapping_set_error(struct address_space *mapping, int error)¶: record a writeback error in the address_space

Parameters

struct address_space *mapping: the mapping in which an error should be set
int error: the error to set in the mapping

Description

When writeback fails in some way, we must record that error so that userspace can be informed when fsync and the like are called. We endeavor to report errors on any file that was open at the time of the error. Some internal callers also need to know when writeback errors have occurred.

When a writeback error occurs, most filesystems will want to call mapping_set_error to record the error in the mapping so that it can be reported when the application calls fsync(2).

void mapping_set_large_folios(struct address_space *mapping)¶: Indicate the file supports large folios.

Parameters

struct address_space *mapping: The address space of the file.

Description

The filesystem should call this function in its inode constructor to indicate that the VFS can use large folios to cache the contents of the file.

Context

This should not be called while the inode is active as it is non-atomic.

pgoff_t mapping_align_index(const struct address_space *mapping, pgoff_t index)¶: Align index for this mapping.

Parameters

const struct address_space *mapping: The address_space.
pgoff_t index: The page index.

Description

The index of a folio must be naturally aligned. If you are adding a new folio to the page cache and need to know what index to give it, call this function.

struct address_space *folio_flush_mapping(struct folio *folio)¶: Find the file mapping this folio belongs to.

Parameters

struct folio *folio: The folio.

Description

For folios which are in the page cache, return the mapping that this page belongs to. Anonymous folios return NULL, even if they’re in the swap cache. Other kinds of folio also return NULL.

This is ONLY used by architecture cache flushing code. If you aren’t writing cache flushing code, you want either folio_mapping() or folio_file_mapping().

struct inode *folio_inode(struct folio *folio)¶: Get the host inode for this folio.

Parameters

struct folio *folio: The folio.

Description

For folios which are in the page cache, return the inode that this folio belongs to.

Do not call this for folios which aren’t in the page cache.

void folio_attach_private(struct folio *folio, void *data)¶: Attach private data to a folio.

Parameters

struct folio *folio: Folio to attach data to.
void *data: Data to attach to folio.

Description

Attaching private data to a folio increments the page’s reference count. The data must be detached before the folio will be freed.

void *folio_change_private(struct folio *folio, void *data)¶: Change private data on a folio.

Parameters

struct folio *folio: Folio to change the data on.
void *data: Data to set on the folio.

Description

Change the private data attached to a folio and return the old data. The page must previously have had data attached and the data must be detached before the folio will be freed.

Return

Data that was previously attached to the folio.

void *folio_detach_private(struct folio *folio)¶: Detach private data from a folio.

Parameters

struct folio *folio: Folio to detach data from.

Description

Removes the data that was previously attached to the folio and decrements the refcount on the page.

Return

Data that was attached to the folio.

type fgf_t¶: Flags for getting folios from the page cache.

Description

Most users of the page cache will not need to use these flags; there are convenience functions such as filemap_get_folio() and filemap_lock_folio(). For users which need more control over exactly what is done with the folios, these flags to __filemap_get_folio() are available.

FGP_ACCESSED - The folio will be marked accessed.
FGP_LOCK - The folio is returned locked.
FGP_CREAT - If no folio is present then a new folio is allocated, added to the page cache and the VM’s LRU list. The folio is returned locked.
FGP_FOR_MMAP - The caller wants to do its own locking dance if the folio is already in cache. If the folio was allocated, unlock it before returning so the caller can do the same dance.
FGP_WRITE - The folio will be written to by the caller.
FGP_NOFS - __GFP_FS will get cleared in gfp.
FGP_NOWAIT - Don’t block on the folio lock.
FGP_STABLE - Wait for the folio to be stable (finished writeback)
FGP_DONTCACHE - Uncached buffered IO
FGP_WRITEBEGIN - The flags to use in a filesystem write_begin() implementation.

fgf_t fgf_set_order(size_t size)¶: Encode a length in the fgf_t flags.

Parameters

size_t size: The suggested size of the folio to create.

Description

The caller of __filemap_get_folio() can use this to suggest a preferred size for the folio that is created. If there is already a folio at the index, it will be returned, no matter what its size. If a folio is freshly created, it may be of a different size than requested due to alignment constraints, memory pressure, or the presence of other folios at nearby indices.

struct folio *write_begin_get_folio(const struct kiocb *iocb, struct address_space *mapping, pgoff_t index, size_t len)¶: Get folio for write_begin with flags.

Parameters

const struct kiocb *iocb: The kiocb passed from write_begin (may be NULL).
struct address_space *mapping: The address space to search.
pgoff_t index: The page cache index.
size_t len: Length of data being written.

Description

This is a helper for filesystem write_begin() implementations. It wraps __filemap_get_folio(), setting appropriate flags in the write begin context.

Return

A folio or an ERR_PTR.

struct folio *filemap_get_folio(struct address_space *mapping, pgoff_t index)¶: Find and get a folio.

Parameters

struct address_space *mapping: The address_space to search.
pgoff_t index: The page index.

Description

Looks up the page cache entry at mapping & index. If a folio is present, it is returned with an increased refcount.

Return

A folio or ERR_PTR(-ENOENT) if there is no folio in the cache for this index. Will not return a shadow, swap or DAX entry.

struct folio *filemap_lock_folio(struct address_space *mapping, pgoff_t index)¶: Find and lock a folio.

Parameters

struct address_space *mapping: The address_space to search.
pgoff_t index: The page index.

Description

Looks up the page cache entry at mapping & index. If a folio is present, it is returned locked with an increased refcount.

Context

May sleep.

Return

A folio or ERR_PTR(-ENOENT) if there is no folio in the cache for this index. Will not return a shadow, swap or DAX entry.

struct folio *filemap_grab_folio(struct address_space *mapping, pgoff_t index)¶: grab a folio from the page cache

Parameters

struct address_space *mapping: The address space to search
pgoff_t index: The page index

Description

Looks up the page cache entry at mapping & index. If no folio is found, a new folio is created. The folio is locked, marked as accessed, and returned.

Return

A found or created folio. ERR_PTR(-ENOMEM) if no folio is found and failed to create a folio.

struct page *find_get_page(struct address_space *mapping, pgoff_t offset)¶: find and get a page reference

Parameters

struct address_space *mapping: the address_space to search
pgoff_t offset: the page index

Description

Looks up the page cache slot at mapping & offset. If there is a page cache page, it is returned with an increased refcount.

Otherwise, NULL is returned.

struct page *find_lock_page(struct address_space *mapping, pgoff_t index)¶: locate, pin and lock a pagecache page

Parameters

struct address_space *mapping: the address_space to search
pgoff_t index: the page index

Description

Looks up the page cache entry at mapping & index. If there is a page cache page, it is returned locked and with an increased refcount.

Context

May sleep.

Return

A struct page or NULL if there is no page in the cache for this index.

struct page *find_or_create_page(struct address_space *mapping, pgoff_t index, gfp_t gfp_mask)¶: locate or add a pagecache page

Parameters

struct address_space *mapping: the page’s address_space
pgoff_t index: the page’s index into the mapping
gfp_t gfp_mask: page allocation mode

Description

Looks up the page cache slot at mapping & offset. If there is a page cache page, it is returned locked and with an increased refcount.

If the page is not present, a new page is allocated using gfp_mask and added to the page cache and the VM’s LRU list. The page is returned locked and with an increased refcount.

On memory exhaustion, NULL is returned.

find_or_create_page() may sleep, even if gfp_flags specifies an atomic allocation!

struct page *grab_cache_page_nowait(struct address_space *mapping, pgoff_t index)¶: returns locked page at given index in given cache

Parameters

struct address_space *mapping: target address_space
pgoff_t index: the page index

Description

Returns locked page at given index in given cache, creating it if needed, but do not wait if the page is locked or to reclaim memory. This is intended for speculative data generators, where the data can be regenerated if the page couldn’t be grabbed. This routine should be safe to call while holding the lock for another page.

Clear __GFP_FS when allocating the page to avoid recursion into the fs and deadlock against the caller’s locked page.

pgoff_t folio_next_index(const struct folio *folio)¶: Get the index of the next folio.

Parameters

const struct folio *folio: The current folio.

Return

The index of the folio which follows this folio in the file.

loff_t folio_next_pos(const struct folio *folio)¶: Get the file position of the next folio.

Parameters

const struct folio *folio: The current folio.

Return

The position of the folio which follows this folio in the file.

struct page *folio_file_page(struct folio *folio, pgoff_t index)¶: The page for a particular index.

Parameters

struct folio *folio: The folio which contains this index.
pgoff_t index: The index we want to look up.

Description

Sometimes after looking up a folio in the page cache, we need to obtain the specific page for an index (eg a page fault).

Return

The page containing the file data for this index.

bool folio_contains(const struct folio *folio, pgoff_t index)¶: Does this folio contain this index?

Parameters

const struct folio *folio: The folio.
pgoff_t index: The page index within the file.

Context

The caller should have the folio locked and ensure e.g., shmem did not move this folio to the swap cache.

Return

true or false.

pgoff_t page_pgoff(const struct folio *folio, const struct page *page)¶: Calculate the logical page offset of this page.

Parameters

const struct folio *folio: The folio containing this page.
const struct page *page: The page which we need the offset of.

Description

For file pages, this is the offset from the beginning of the file in units of PAGE_SIZE. For anonymous pages, this is the offset from the beginning of the anon_vma in units of PAGE_SIZE. This will return nonsense for KSM pages.

Context

Caller must have a reference on the folio or otherwise prevent it from being split or freed.

Return

The offset in units of PAGE_SIZE.

loff_t folio_pos(const struct folio *folio)¶: Returns the byte position of this folio in its file.

Parameters

const struct folio *folio: The folio.

bool folio_trylock(struct folio *folio)¶: Attempt to lock a folio.

Parameters

struct folio *folio: The folio to attempt to lock.

Description

Sometimes it is undesirable to wait for a folio to be unlocked (eg when the locks are being taken in the wrong order, or if making progress through a batch of folios is more important than processing them in order). Usually folio_lock() is the correct function to call.

Context

Any context.

Return

Whether the lock was successfully acquired.

void folio_lock(struct folio *folio)¶: Lock this folio.

Parameters

struct folio *folio: The folio to lock.

Description

The folio lock protects against many things, probably more than it should. It is primarily held while a folio is being brought uptodate, either from its backing file or from swap. It is also held while a folio is being truncated from its address_space, so holding the lock is sufficient to keep folio->mapping stable.

The folio lock is also held while write() is modifying the page to provide POSIX atomicity guarantees (as long as the write does not cross a page boundary). Other modifications to the data in the folio do not hold the folio lock and can race with writes, eg DMA and stores to mapped pages.

Context

May sleep. If you need to acquire the locks of two or more folios, they must be in order of ascending index, if they are in the same address_space. If they are in different address_spaces, acquire the lock of the folio which belongs to the address_space which has the lowest address in memory first.

void lock_page(struct page *page)¶: Lock the folio containing this page.

Parameters

struct page *page: The page to lock.

Description

See folio_lock() for a description of what the lock protects. This is a legacy function and new code should probably use folio_lock() instead.

Context

May sleep. Pages in the same folio share a lock, so do not attempt to lock two pages which share a folio.

int folio_lock_killable(struct folio *folio)¶: Lock this folio, interruptible by a fatal signal.

Parameters

struct folio *folio: The folio to lock.

Description

Attempts to lock the folio, like folio_lock(), except that the sleep to acquire the lock is interruptible by a fatal signal.

Context

May sleep; see folio_lock().

Return

0 if the lock was acquired; -EINTR if a fatal signal was received.

bool filemap_range_needs_writeback(struct address_space *mapping, loff_t start_byte, loff_t end_byte)¶: check if range potentially needs writeback

Parameters

struct address_space *mapping: address space within which to check
loff_t start_byte: offset in bytes where the range starts
loff_t end_byte: offset in bytes where the range ends (inclusive)

Description

Find at least one page in the range supplied, usually used to check if direct writing in this range will trigger a writeback. Used by O_DIRECT read/write with IOCB_NOWAIT, to see if the caller needs to do filemap_write_and_wait_range() before proceeding.

Return

true if the caller should do filemap_write_and_wait_range() before doing O_DIRECT to a page in this range, false otherwise.

struct readahead_control¶: Describes a readahead request.

Definition:

struct readahead_control {
    struct file *file;
    struct address_space *mapping;
    struct file_ra_state *ra;
};

Members

file: The file, used primarily by network filesystems for authentication. May be NULL if invoked internally by the filesystem.
mapping: Readahead this filesystem object.
ra: File readahead state. May be NULL.

Description

A readahead request is for consecutive pages. Filesystems which implement the ->readahead method should call readahead_folio() or __readahead_batch() in a loop and attempt to start reads into each folio in the request.

Most of the fields in this struct are private and should be accessed by the functions below.

void page_cache_sync_readahead(struct address_space *mapping, struct file_ra_state *ra, struct file *file, pgoff_t index, unsigned long req_count)¶: generic file readahead

Parameters

struct address_space *mapping: address_space which holds the pagecache and I/O vectors
struct file_ra_state *ra: file_ra_state which holds the readahead state
struct file *file: Used by the filesystem for authentication.
pgoff_t index: Index of first page to be read.
unsigned long req_count: Total number of pages being read by the caller.

Description

page_cache_sync_readahead() should be called when a cache miss happened: it will submit the read. The readahead logic may decide to piggyback more pages onto the read request if access patterns suggest it will improve performance.

void page_cache_async_readahead(struct address_space *mapping, struct file_ra_state *ra, struct file *file, struct folio *folio, unsigned long req_count)¶: file readahead for marked pages

Parameters

struct address_space *mapping: address_space which holds the pagecache and I/O vectors
struct file_ra_state *ra: file_ra_state which holds the readahead state
struct file *file: Used by the filesystem for authentication.
struct folio *folio: The folio which triggered the readahead call.
unsigned long req_count: Total number of pages being read by the caller.

Description

page_cache_async_readahead() should be called when a page is used which is marked as PageReadahead; this is a marker to suggest that the application has used up enough of the readahead window that we should start pulling in more pages.

struct folio *readahead_folio(struct readahead_control *ractl)¶: Get the next folio to read.

Parameters

struct readahead_control *ractl: The current readahead request.

Context

The folio is locked. The caller should unlock the folio once all I/O to that folio has completed.

Return

A pointer to the next folio, or NULL if we are done.

loff_t readahead_pos(const struct readahead_control *rac)¶: The byte offset into the file of this readahead request.

Parameters

const struct readahead_control *rac: The readahead request.

size_t readahead_length(const struct readahead_control *rac)¶: The number of bytes in this readahead request.

Parameters

const struct readahead_control *rac: The readahead request.

pgoff_t readahead_index(const struct readahead_control *rac)¶: The index of the first page in this readahead request.

Parameters

const struct readahead_control *rac: The readahead request.

unsigned int readahead_count(const struct readahead_control *rac)¶: The number of pages in this readahead request.

Parameters

const struct readahead_control *rac: The readahead request.

size_t readahead_batch_length(const struct readahead_control *rac)¶: The number of bytes in the current batch.

Parameters

const struct readahead_control *rac: The readahead request.

ssize_t folio_mkwrite_check_truncate(const struct folio *folio, const struct inode *inode)¶: check if folio was truncated

Parameters

const struct folio *folio: the folio to check
const struct inode *inode: the inode to check the folio against

Return

the number of bytes in the folio up to EOF, or -EFAULT if the folio was truncated.

unsigned int i_blocks_per_folio(const struct inode *inode, const struct folio *folio)¶: How many blocks fit in this folio.

Parameters

const struct inode *inode: The inode which contains the blocks.
const struct folio *folio: The folio.

Description

If the block size is larger than the size of this folio, return zero.

Context

The caller should hold a refcount on the folio to prevent it from being split.

Return

The number of filesystem blocks covered by this folio.

Memory pools¶

void mempool_exit(struct mempool *pool)¶: exit a mempool initialized with mempool_init()

Parameters

struct mempool *pool: pointer to the memory pool which was initialized with mempool_init().

Description

Free all reserved elements in pool and pool itself. This function only sleeps if the free_fn() function sleeps.

May be called on a zeroed but uninitialized mempool (i.e. allocated with kzalloc()).

void mempool_destroy(struct mempool *pool)¶: deallocate a memory pool

Parameters

struct mempool *pool: pointer to the memory pool which was allocated via mempool_create().

Description

Free all reserved elements in pool and pool itself. This function only sleeps if the free_fn() function sleeps.

int mempool_init(struct mempool *pool, int min_nr, mempool_alloc_t *alloc_fn, mempool_free_t *free_fn, void *pool_data)¶: initialize a memory pool

Parameters

struct mempool *pool: pointer to the memory pool that should be initialized
int min_nr: the minimum number of elements guaranteed to be allocated for this pool.
mempool_alloc_t *alloc_fn: user-defined element-allocation function.
mempool_free_t *free_fn: user-defined element-freeing function.
void *pool_data: optional private data available to the user-defined functions.

Description

Like mempool_create(), but initializes the pool in (i.e. embedded in another structure).

Return

0 on success, negative error code otherwise.

struct mempool *mempool_create_node(int min_nr, mempool_alloc_t *alloc_fn, mempool_free_t *free_fn, void *pool_data, gfp_t gfp_mask, int node_id)¶: create a memory pool

Parameters

int min_nr: the minimum number of elements guaranteed to be allocated for this pool.
mempool_alloc_t *alloc_fn: user-defined element-allocation function.
mempool_free_t *free_fn: user-defined element-freeing function.
void *pool_data: optional private data available to the user-defined functions.
gfp_t gfp_mask: memory allocation flags
int node_id: numa node to allocate on

Description

this function creates and allocates a guaranteed size, preallocated memory pool. The pool can be used from the mempool_alloc() and mempool_free() functions. This function might sleep. Both the alloc_fn() and the free_fn() functions might sleep - as long as the mempool_alloc() function is not called from IRQ contexts.

Return

pointer to the created memory pool object or NULL on error.

int mempool_resize(struct mempool *pool, int new_min_nr)¶: resize an existing memory pool

Parameters

struct mempool *pool: pointer to the memory pool which was allocated via mempool_create().
int new_min_nr: the new minimum number of elements guaranteed to be allocated for this pool.

Description

This function shrinks/grows the pool. In the case of growing, it cannot be guaranteed that the pool will be grown to the new size immediately, but new mempool_free() calls will refill it. This function may sleep.

Note, the caller must guarantee that no mempool_destroy is called while this function is running. mempool_alloc() & mempool_free() might be called (eg. from IRQ contexts) while this function executes.

Return

0 on success, negative error code otherwise.

int mempool_alloc_bulk(struct mempool *pool, void **elems, unsigned int count, unsigned int allocated)¶: allocate multiple elements from a memory pool

Parameters

struct mempool *pool: pointer to the memory pool
void **elems: partially or fully populated elements array
unsigned int count: number of entries in elem that need to be allocated
unsigned int allocated: number of entries in elem already allocated

Description

Allocate elements for each slot in elem that is non-NULL. This is done by first calling into the alloc_fn supplied at pool initialization time, and dipping into the reserved pool when alloc_fn fails to allocate an element.

On return all count elements in elems will be populated.

Return

Always 0. If it wasn’t for %$#^$ alloc tags, it would return void.

void *mempool_alloc(struct mempool *pool, gfp_t gfp_mask)¶: allocate an element from a memory pool

Parameters

struct mempool *pool: pointer to the memory pool
gfp_t gfp_mask: GFP_* flags. __GFP_ZERO is not supported.

Description

Allocate an element from pool. This is done by first calling into the alloc_fn supplied at pool initialization time, and dipping into the reserved pool when alloc_fn fails to allocate an element.

This function only sleeps if the alloc_fn callback sleeps, or when waiting for elements to become available in the pool.

Return

pointer to the allocated element or NULL when failing to allocate an element. Allocation failure can only happen when gfp_mask does not include __GFP_DIRECT_RECLAIM.

void *mempool_alloc_preallocated(struct mempool *pool)¶: allocate an element from preallocated elements belonging to a memory pool

Parameters

struct mempool *pool: pointer to the memory pool

Description

This function is similar to mempool_alloc(), but it only attempts allocating an element from the preallocated elements. It only takes a single spinlock_t and immediately returns if no preallocated elements are available.

Return

pointer to the allocated element or NULL if no elements are available.

unsigned int mempool_free_bulk(struct mempool *pool, void **elems, unsigned int count)¶: return elements to a mempool

Parameters

struct mempool *pool: pointer to the memory pool
void **elems: elements to return
unsigned int count: number of elements to return

Description

Returns a number of elements from the start of elem to pool if pool needs replenishing and sets their slots in elem to NULL. Other elements are left in elem.

Return

number of elements transferred to pool. Elements are always transferred from the beginning of elem, so the return value can be used as an offset into elem for the freeing the remaining elements in the caller.

void mempool_free(void *element, struct mempool *pool)¶: return an element to the pool.

Parameters

void *element: element to return
struct mempool *pool: pointer to the memory pool

Description

Returns element to pool if it needs replenishing, else frees it using the free_fn callback in pool.

This function only sleeps if the free_fn callback sleeps.

More Memory Management Functions¶

void zap_vma_ptes(struct vm_area_struct *vma, unsigned long address, unsigned long size)¶: remove ptes mapping the vma

Parameters

struct vm_area_struct *vma: vm_area_struct holding ptes to be zapped
unsigned long address: starting address of pages to zap
unsigned long size: number of bytes to zap

Description

This function only unmaps ptes assigned to VM_PFNMAP vmas.

The entire address range must be fully contained within the vma.

int vm_insert_pages(struct vm_area_struct *vma, unsigned long addr, struct page **pages, unsigned long *num)¶: insert multiple pages into user vma, batching the pmd lock.

Parameters

struct vm_area_struct *vma: user vma to map to
unsigned long addr: target start user address of these pages
struct page **pages: source kernel pages
unsigned long *num: in: number of pages to map. out: number of pages that were not mapped. (0 means all pages were successfully mapped).

Description

Preferred over vm_insert_page() when inserting multiple pages.

In case of error, we may have mapped a subset of the provided pages. It is the caller’s responsibility to account for this case.

The same restrictions apply as in vm_insert_page().

int vm_insert_page(struct vm_area_struct *vma, unsigned long addr, struct page *page)¶: insert single page into user vma

Parameters

struct vm_area_struct *vma: user vma to map to
unsigned long addr: target user address of this page
struct page *page: source kernel page

Description

This allows drivers to insert individual pages they’ve allocated into a user vma. The zeropage is supported in some VMAs, see vm_mixed_zeropage_allowed().

The page has to be a nice clean _individual_ kernel allocation. If you allocate a compound page, you need to have marked it as such (__GFP_COMP), or manually just split the page up yourself (see split_page()).

NOTE! Traditionally this was done with “remap_pfn_range()” which took an arbitrary page protection parameter. This doesn’t allow that. Your vma protection will have to be set up correctly, which means that if you want a shared writable mapping, you’d better ask for a shared writable mapping!

The page does not need to be reserved.

Usually this function is called from f_op->mmap() handler under mm->mmap_lock write-lock, so it can change vma->vm_flags. Caller must set VM_MIXEDMAP on vma if it wants to call this function from other places, for example from page-fault handler.

Return

0 on success, negative error code otherwise.

int vm_map_pages(struct vm_area_struct *vma, struct page **pages, unsigned long num)¶: maps range of kernel pages starts with non zero offset

Parameters

struct vm_area_struct *vma: user vma to map to
struct page **pages: pointer to array of source kernel pages
unsigned long num: number of pages in page array

Description

Maps an object consisting of num pages, catering for the user’s requested vm_pgoff

If we fail to insert any page into the vma, the function will return immediately leaving any previously inserted pages present. Callers from the mmap handler may immediately return the error as their caller will destroy the vma, removing any successfully inserted pages. Other callers should make their own arrangements for calling unmap_region().

Context

Process context. Called by mmap handlers.

Return

0 on success and error code otherwise.

int vm_map_pages_zero(struct vm_area_struct *vma, struct page **pages, unsigned long num)¶: map range of kernel pages starts with zero offset

Parameters

struct vm_area_struct *vma: user vma to map to
struct page **pages: pointer to array of source kernel pages
unsigned long num: number of pages in page array

Description

Similar to vm_map_pages(), except that it explicitly sets the offset to 0. This function is intended for the drivers that did not consider vm_pgoff.

Context

Process context. Called by mmap handlers.

Return

0 on success and error code otherwise.

vm_fault_t vmf_insert_pfn_prot(struct vm_area_struct *vma, unsigned long addr, unsigned long pfn, pgprot_t pgprot)¶: insert single pfn into user vma with specified pgprot

Parameters

struct vm_area_struct *vma: user vma to map to
unsigned long addr: target user address of this page
unsigned long pfn: source kernel pfn
pgprot_t pgprot: pgprot flags for the inserted page

Description

This is exactly like vmf_insert_pfn(), except that it allows drivers to override pgprot on a per-page basis.

This only makes sense for IO mappings, and it makes no sense for COW mappings. In general, using multiple vmas is preferable; vmf_insert_pfn_prot should only be used if using multiple VMAs is impractical.

pgprot typically only differs from vma->vm_page_prot when drivers set caching- and encryption bits different than those of vma->vm_page_prot, because the caching- or encryption mode may not be known at mmap() time.

This is ok as long as vma->vm_page_prot is not used by the core vm to set caching and encryption bits for those vmas (except for COW pages). This is ensured by core vm only modifying these page table entries using functions that don’t touch caching- or encryption bits, using pte_modify() if needed. (See for example mprotect()).

Also when new page-table entries are created, this is only done using the fault() callback, and never using the value of vma->vm_page_prot, except for page-table entries that point to anonymous pages as the result of COW.

Context

Process context. May allocate using GFP_KERNEL.

Return

vm_fault_t value.

vm_fault_t vmf_insert_pfn(struct vm_area_struct *vma, unsigned long addr, unsigned long pfn)¶: insert single pfn into user vma

Parameters

struct vm_area_struct *vma: user vma to map to
unsigned long addr: target user address of this page
unsigned long pfn: source kernel pfn

Description

Similar to vm_insert_page, this allows drivers to insert individual pages they’ve allocated into a user vma. Same comments apply.

This function should only be called from a vm_ops->fault handler, and in that case the handler should return the result of this function.

vma cannot be a COW mapping.

As this is called only for pages that do not currently exist, we do not need to flush old virtual caches or the TLB.

Context

Process context. May allocate using GFP_KERNEL.

Return

vm_fault_t value.

int remap_pfn_range(struct vm_area_struct *vma, unsigned long addr, unsigned long pfn, unsigned long size, pgprot_t prot)¶: remap kernel memory to userspace

Parameters

struct vm_area_struct *vma: user vma to map to
unsigned long addr: target page aligned user address to start at
unsigned long pfn: page frame number of kernel physical memory address
unsigned long size: size of mapping area
pgprot_t prot: page protection flags for this mapping

Note

this is only safe if the mm semaphore is held when called.

Return

0 on success, negative error code otherwise.

int vm_iomap_memory(struct vm_area_struct *vma, phys_addr_t start, unsigned long len)¶: remap memory to userspace

Parameters

struct vm_area_struct *vma: user vma to map to
phys_addr_t start: start of the physical memory to be mapped
unsigned long len: size of area

Description

This is a simplified io_remap_pfn_range() for common driver use. The driver just needs to give us the physical memory range to be mapped, we’ll figure out the rest from the vma information.

NOTE! Some drivers might want to tweak vma->vm_page_prot first to get whatever write-combining details or similar.

Return

0 on success, negative error code otherwise.

void unmap_mapping_pages(struct address_space *mapping, pgoff_t start, pgoff_t nr, bool even_cows)¶: Unmap pages from processes.

Parameters

struct address_space *mapping: The address space containing pages to be unmapped.
pgoff_t start: Index of first page to be unmapped.
pgoff_t nr: Number of pages to be unmapped. 0 to unmap to end of file.
bool even_cows: Whether to unmap even private COWed pages.

Description

Unmap the pages in this address space from any userspace process which has them mmaped. Generally, you want to remove COWed pages as well when a file is being truncated, but not when invalidating pages from the page cache.

void unmap_mapping_range(struct address_space *mapping, loff_t const holebegin, loff_t const holelen, int even_cows)¶: unmap the portion of all mmaps in the specified address_space corresponding to the specified byte range in the underlying file.

Parameters

struct address_space *mapping: the address space containing mmaps to be unmapped.
loff_t const holebegin: byte in first page to unmap, relative to the start of the underlying file. This will be rounded down to a PAGE_SIZE boundary. Note that this is different from truncate_pagecache(), which must keep the partial page. In contrast, we must get rid of partial pages.
loff_t const holelen: size of prospective hole in bytes. This will be rounded up to a PAGE_SIZE boundary. A holelen of zero truncates to the end of the file.
int even_cows: 1 when truncating a file, unmap even private COWed pages; but 0 when invalidating pagecache, don’t throw away private data.

int follow_pfnmap_start(struct follow_pfnmap_args *args)¶: Look up a pfn mapping at a user virtual address

Parameters

struct follow_pfnmap_args *args: Pointer to struct follow_pfnmap_args

Description

The caller needs to setup args->vma and args->address to point to the virtual address as the target of such lookup. On a successful return, the results will be put into other output fields.

After the caller finished using the fields, the caller must invoke another follow_pfnmap_end() to proper releases the locks and resources of such look up request.

During the start() and end() calls, the results in args will be valid as proper locks will be held. After the end() is called, all the fields in follow_pfnmap_args will be invalid to be further accessed. Further use of such information after end() may require proper synchronizations by the caller with page table updates, otherwise it can create a security bug.

If the PTE maps a refcounted page, callers are responsible to protect against invalidation with MMU notifiers; otherwise access to the PFN at a later point in time can trigger use-after-free.

Only IO mappings and raw PFN mappings are allowed. The mmap semaphore should be taken for read, and the mmap semaphore cannot be released before the end() is invoked.

This function must not be used to modify PTE content.

Return

zero on success, negative otherwise.

void follow_pfnmap_end(struct follow_pfnmap_args *args)¶: End a follow_pfnmap_start() process

Parameters

struct follow_pfnmap_args *args: Pointer to struct follow_pfnmap_args

Description

Must be used in pair of follow_pfnmap_start(). See the start() function above for more information.

int generic_access_phys(struct vm_area_struct *vma, unsigned long addr, void *buf, int len, int write)¶: generic implementation for iomem mmap access

Parameters

struct vm_area_struct *vma: the vma to access
unsigned long addr: userspace address, not relative offset within vma
void *buf: buffer to read/write
int len: length of transfer
int write: set to FOLL_WRITE when writing, otherwise reading

Description

This is a generic implementation for vm_operations_struct.access for an iomem mapping. This callback is used by access_process_vm() when the vma is not page based.

int copy_remote_vm_str(struct task_struct *tsk, unsigned long addr, void *buf, int len, unsigned int gup_flags)¶: copy a string from another process’s address space.

Parameters

struct task_struct *tsk: the task of the target address space
unsigned long addr: start address to read from
void *buf: destination buffer
int len: number of bytes to copy
unsigned int gup_flags: flags modifying lookup behaviour

Description

The caller must hold a reference on mm.

Return

number of bytes copied from addr (source) to buf (destination); not including the trailing NUL. Always guaranteed to leave NUL-terminated buffer. On any error, return -EFAULT.

unsigned long __get_pfnblock_flags_mask(const struct page *page, unsigned long pfn, unsigned long mask)¶: Return the requested group of flags for a pageblock_nr_pages block of pages

Parameters

const struct page *page: The page within the block of interest
unsigned long pfn: The target page frame number
unsigned long mask: mask of bits that the caller is interested in

Return

pageblock_bits flags

bool get_pfnblock_bit(const struct page *page, unsigned long pfn, enum pageblock_bits pb_bit)¶: Check if a standalone bit of a pageblock is set

Parameters

const struct page *page: The page within the block of interest
unsigned long pfn: The target page frame number
enum pageblock_bits pb_bit: pageblock bit to check

Return

true if the bit is set, otherwise false

enum migratetype get_pfnblock_migratetype(const struct page *page, unsigned long pfn)¶: Return the migratetype of a pageblock

Parameters

const struct page *page: The page within the block of interest
unsigned long pfn: The target page frame number

Return

The migratetype of the pageblock

Description

Use get_pfnblock_migratetype() if caller already has both page and pfn to save a call to page_to_pfn().

void __set_pfnblock_flags_mask(struct page *page, unsigned long pfn, unsigned long flags, unsigned long mask)¶: Set the requested group of flags for a pageblock_nr_pages block of pages

Parameters

struct page *page: The page within the block of interest
unsigned long pfn: The target page frame number
unsigned long flags: The flags to set
unsigned long mask: mask of bits that the caller is interested in

void set_pfnblock_bit(const struct page *page, unsigned long pfn, enum pageblock_bits pb_bit)¶: Set a standalone bit of a pageblock

Parameters

const struct page *page: The page within the block of interest
unsigned long pfn: The target page frame number
enum pageblock_bits pb_bit: pageblock bit to set

void clear_pfnblock_bit(const struct page *page, unsigned long pfn, enum pageblock_bits pb_bit)¶: Clear a standalone bit of a pageblock

Parameters

const struct page *page: The page within the block of interest
unsigned long pfn: The target page frame number
enum pageblock_bits pb_bit: pageblock bit to clear

void set_pageblock_migratetype(struct page *page, enum migratetype migratetype)¶: Set the migratetype of a pageblock

Parameters

struct page *page: The page within the block of interest
enum migratetype migratetype: migratetype to set

bool __move_freepages_block_isolate(struct zone *zone, struct page *page, bool isolate)¶: move free pages in block for page isolation

Parameters

struct zone *zone: the zone
struct page *page: the pageblock page
bool isolate: to isolate the given pageblock or unisolate it

Description

This is similar to move_freepages_block(), but handles the special case encountered in page isolation, where the block of interest might be part of a larger buddy spanning multiple pageblocks.

Unlike the regular page allocator path, which moves pages while stealing buddies off the freelist, page isolation is interested in arbitrary pfn ranges that may have overlapping buddies on both ends.

This function handles that. Straddling buddies are split into individual pageblocks. Only the block of interest is moved.

Returns true if pages could be moved, false otherwise.

void __putback_isolated_page(struct page *page, unsigned int order, int mt)¶: Return a now-isolated page back where we got it

Parameters

struct page *page: Page that was isolated
unsigned int order: Order of the isolated page
int mt: The page’s pageblock’s migratetype

Description

This function is meant to return a page pulled from the free lists via __isolate_free_page back to the free lists they were pulled from.

void __free_pages(struct page *page, unsigned int order)¶: Free pages allocated with alloc_pages().

Parameters

struct page *page: The page pointer returned from alloc_pages().
unsigned int order: The order of the allocation.

Description

This function can free multi-page allocations that are not compound pages. It does not check that the order passed in matches that of the allocation, so it is easy to leak memory. Freeing more memory than was allocated will probably emit a warning.

If the last reference to this page is speculative, it will be released by put_page() which only frees the first page of a non-compound allocation. To prevent the remaining pages from being leaked, we free the subsequent pages here. If you want to use the page’s reference count to decide when to free the allocation, you should allocate a compound page, and use put_page() instead of __free_pages().

Context

May be called in interrupt context or while holding a normal spinlock, but not in NMI context or while holding a raw spinlock.

void free_pages(unsigned long addr, unsigned int order)¶: Free pages allocated with __get_free_pages().

Parameters

unsigned long addr: The virtual address tied to a page returned from __get_free_pages().
unsigned int order: The order of the allocation.

Description

This function behaves the same as __free_pages(). Use this function to free pages when you only have a valid virtual address. If you have the page, call __free_pages() instead.

void *alloc_pages_exact(size_t size, gfp_t gfp_mask)¶: allocate an exact number physically-contiguous pages.

Parameters

size_t size: the number of bytes to allocate
gfp_t gfp_mask: GFP flags for the allocation, must not contain __GFP_COMP

Description

This function is similar to alloc_pages(), except that it allocates the minimum number of pages to satisfy the request. alloc_pages() can only allocate memory in power-of-two pages.

This function is also limited by MAX_PAGE_ORDER.

Memory allocated by this function must be released by free_pages_exact().

Return

pointer to the allocated area or NULL in case of error.

void *alloc_pages_exact_nid(int nid, size_t size, gfp_t gfp_mask)¶: allocate an exact number of physically-contiguous pages on a node.

Parameters

int nid: the preferred node ID where memory should be allocated
size_t size: the number of bytes to allocate
gfp_t gfp_mask: GFP flags for the allocation, must not contain __GFP_COMP

Description

Like alloc_pages_exact(), but try to allocate on node nid first before falling back.

Return

pointer to the allocated area or NULL in case of error.

void free_pages_exact(void *virt, size_t size)¶: release memory allocated via alloc_pages_exact()

Parameters

void *virt: the value returned by alloc_pages_exact.
size_t size: size of allocation, same value as passed to alloc_pages_exact().

Description

Release the memory allocated by a previous call to alloc_pages_exact.

unsigned long nr_free_zone_pages(int offset)¶: count number of pages beyond high watermark

Parameters

int offset: The zone index of the highest zone

Description

nr_free_zone_pages() counts the number of pages which are beyond the high watermark within all zones at or below a given zone index. For each zone, the number of pages is calculated as:

nr_free_zone_pages = managed_pages - high_pages

Return

number of pages beyond high watermark.

unsigned long nr_free_buffer_pages(void)¶: count number of pages beyond high watermark

Parameters

void: no arguments

Description

nr_free_buffer_pages() counts the number of pages which are beyond the high watermark within ZONE_DMA and ZONE_NORMAL.

Return

number of pages beyond high watermark within ZONE_DMA and ZONE_NORMAL.

int find_next_best_node(int node, nodemask_t *used_node_mask)¶: find the next node that should appear in a given node’s fallback list

Parameters

int node: node whose fallback list we’re appending
nodemask_t *used_node_mask: nodemask_t of already used nodes

Description

We use a number of factors to determine which is the next node that should appear on a given node’s fallback list. The node should not have appeared already in node’s fallback list, and it should be the next closest node according to the distance array (which contains arbitrary distance values from each node to each node in the system), and should also prefer nodes with no CPUs, since presumably they’ll have very little allocation pressure on them otherwise.

Return

node id of the found node or NUMA_NO_NODE if no node is found.

void setup_per_zone_wmarks(void)¶: called when min_free_kbytes changes or when memory is hot-{added|removed}

Parameters

void: no arguments

Description

Ensures that the watermark[min,low,high] values for each zone are set correctly with respect to min_free_kbytes.

int alloc_contig_range(unsigned long start, unsigned long end, acr_flags_t alloc_flags, gfp_t gfp_mask)¶

tries to allocate given range of pages

Parameters

unsigned long start: start PFN to allocate
unsigned long end: one-past-the-last PFN to allocate
acr_flags_t alloc_flags: allocation information
gfp_t gfp_mask: GFP mask. Node/zone/placement hints are ignored; only some action and reclaim modifiers are supported. Reclaim modifiers control allocation behavior during compaction/migration/reclaim.

Description

The PFN range does not have to be pageblock aligned. The PFN range must belong to a single zone.

The first thing this routine does is attempt to MIGRATE_ISOLATE all pageblocks in the range. Once isolated, the pageblocks should not be modified by others.

Return

zero on success or negative error code. On success all pages which PFN is in [start, end) are allocated for the caller and need to be freed with free_contig_range().

struct page *alloc_contig_pages(unsigned long nr_pages, gfp_t gfp_mask, int nid, nodemask_t *nodemask)¶

tries to find and allocate contiguous range of pages

Parameters

unsigned long nr_pages: Number of contiguous pages to allocate
gfp_t gfp_mask: GFP mask. Node/zone/placement hints limit the search; only some action and reclaim modifiers are supported. Reclaim modifiers control allocation behavior during compaction/migration/reclaim.
int nid: Target node
nodemask_t *nodemask: Mask for other possible nodes

Description

This routine is a wrapper around alloc_contig_range(). It scans over zones on an applicable zonelist to find a contiguous pfn range which can then be tried for allocation with alloc_contig_range(). This routine is intended for allocation requests which can not be fulfilled with the buddy allocator.

The allocated memory is always aligned to a page boundary. If nr_pages is a power of two, then allocated range is also guaranteed to be aligned to same nr_pages (e.g. 1GB request would be aligned to 1GB).

Allocated pages can be freed with free_contig_range() or by manually calling __free_page() on each allocated page.

Return

pointer to contiguous pages on success, or NULL if not successful.

struct page *alloc_pages_nolock(gfp_t gfp_flags, int nid, unsigned int order)¶: opportunistic reentrant allocation from any context

Parameters

gfp_t gfp_flags: GFP flags. Only __GFP_ACCOUNT allowed.
int nid: node to allocate from
unsigned int order: allocation order size

Description

Allocates pages of a given order from the given node. This is safe to call from any context (from atomic, NMI, and also reentrant allocator -> tracepoint -> alloc_pages_nolock_noprof). Allocation is best effort and to be expected to fail easily so nobody should rely on the success. Failures are not reported via warn_alloc(). See always fail conditions below.

Return

allocated page or NULL on failure. NULL does not mean EBUSY or EAGAIN. It means ENOMEM. There is no reason to call it again and expect !NULL.

int numa_nearest_node(int node, unsigned int state)¶: Find nearest node by state

Parameters

int node: Node id to start the search
unsigned int state: State to filter the search

Description

Lookup the closest node by distance if nid is not in state.

Return

this node if it is in state, otherwise the closest node by distance

int nearest_node_nodemask(int node, nodemask_t *mask)¶: Find the node in mask at the nearest distance from node.

Parameters

int node: a valid node ID to start the search from.
nodemask_t *mask: a pointer to a nodemask representing the allowed nodes.

Description

This function iterates over all nodes in mask and calculates the distance from the starting node, then it returns the node ID that is the closest to node, or MAX_NUMNODES if no node is found.

Note that node must be a valid node ID usable with node_distance(), providing an invalid node ID (e.g., NUMA_NO_NODE) may result in crashes or unexpected behavior.

bool folio_can_map_prot_numa(struct folio *folio, struct vm_area_struct *vma, bool is_private_single_threaded)¶: check whether the folio can map prot numa

Parameters

struct folio *folio: The folio whose mapping considered for being made NUMA hintable
struct vm_area_struct *vma: The VMA that the folio belongs to.
bool is_private_single_threaded: Is this a single-threaded private VMA or not

Description

This function checks to see if the folio actually indicates that we need to make the mapping one which causes a NUMA hinting fault, as there are cases where it’s simply unnecessary, and the folio’s access time is adjusted for memory tiering if prot numa needed.

Return

True if the mapping of the folio needs to be changed, false otherwise.

struct page *alloc_pages_mpol(gfp_t gfp, unsigned int order, struct mempolicy *pol, pgoff_t ilx, int nid)¶: Allocate pages according to NUMA mempolicy.

Parameters

gfp_t gfp: GFP flags.
unsigned int order: Order of the page allocation.
struct mempolicy *pol: Pointer to the NUMA mempolicy.
pgoff_t ilx: Index for interleave mempolicy (also distinguishes alloc_pages()).
int nid: Preferred node (usually numa_node_id() but mpol may override it).

Return

The page on success or NULL if allocation fails.

struct folio *vma_alloc_folio(gfp_t gfp, int order, struct vm_area_struct *vma, unsigned long addr)¶: Allocate a folio for a VMA.

Parameters

gfp_t gfp: GFP flags.
int order: Order of the folio.
struct vm_area_struct *vma: Pointer to VMA.
unsigned long addr: Virtual address of the allocation. Must be inside vma.

Description

Allocate a folio for a specific address in vma, using the appropriate NUMA policy. The caller must hold the mmap_lock of the mm_struct of the VMA to prevent it from going away. Should be used for all allocations for folios that will be mapped into user space, excepting hugetlbfs, and excepting where direct use of folio_alloc_mpol() is more appropriate.

Return

The folio on success or NULL if allocation fails.

struct page *alloc_pages(gfp_t gfp, unsigned int order)¶: Allocate pages.

Parameters

gfp_t gfp: GFP flags.
unsigned int order: Power of two of number of pages to allocate.

Description

Allocate 1 << order contiguous pages. The physical address of the first page is naturally aligned (eg an order-3 allocation will be aligned to a multiple of 8 * PAGE_SIZE bytes). The NUMA policy of the current process is honoured when in process context.

Context

Can be called from any context, providing the appropriate GFP flags are used.

Return

The page on success or NULL if allocation fails.

int mpol_misplaced(struct folio *folio, struct vm_fault *vmf, unsigned long addr)¶: check whether current folio node is valid in policy

Parameters

struct folio *folio: folio to be checked
struct vm_fault *vmf: structure describing the fault
unsigned long addr: virtual address in vma for shared policy lookup and interleave policy

Description

Lookup current policy node id for vma,addr and “compare to” folio’s node id. Policy determination “mimics” alloc_page_vma(). Called from fault path where we know the vma and faulting address.

Return

NUMA_NO_NODE if the page is in a node that is valid for this policy, or a suitable node ID to allocate a replacement folio from.

void mpol_shared_policy_init(struct shared_policy *sp, struct mempolicy *mpol)¶: initialize shared policy for inode

Parameters

struct shared_policy *sp: pointer to inode shared policy
struct mempolicy *mpol: struct mempolicy to install

Description

Install non-NULL mpol in inode’s shared policy rb-tree. On entry, the current task has a reference on a non-NULL mpol. This must be released on exit. This is called at get_inode() calls and we can use GFP_KERNEL.

int mpol_parse_str(char *str, struct mempolicy **mpol)¶: parse string to mempolicy, for tmpfs mpol mount option.

Parameters

char *str: string containing mempolicy to parse
struct mempolicy **mpol: pointer to struct mempolicy pointer, returned on success.

Description

Format of input:: <mode>[=<flags>][:<nodelist>]

Return

0 on success, else 1

void mpol_to_str(char *buffer, int maxlen, struct mempolicy *pol)¶: format a mempolicy structure for printing

Parameters

char *buffer: to contain formatted mempolicy string
int maxlen: length of buffer
struct mempolicy *pol: pointer to mempolicy to be formatted

Description

Convert pol into a string. If buffer is too short, truncate the string. Recommend a maxlen of at least 51 for the longest mode, “weighted interleave”, plus the longest flag flags, “relative|balancing”, and to display at least a few node ids.

type softleaf_t¶: Describes a page table software leaf entry, abstracted from its architecture-specific encoding.

Description

Page table leaf entries are those which do not reference any descendent page tables but rather either reference a data page, are an empty (or ‘none’ entry), or contain a non-present entry.

If referencing another page table or a data page then the page table entry is pertinent to hardware - that is it tells the hardware how to decode the page table entry.

Otherwise it is a software-defined leaf page table entry, which this type describes. See leafops.h and specifically softleaf_type for a list of all possible kinds of software leaf entry.

A softleaf_t entry is abstracted from the hardware page table entry, so is not architecture-specific.

NOTE

While we transition from the confusing swp_entry_t type used for this: purpose, we simply alias this type. This will be removed once the transition is complete.

struct folio¶: Represents a contiguous set of bytes.

Definition:

struct folio {
    memdesc_flags_t flags;
    union {
        struct list_head lru;
        unsigned int mlock_count;
        struct dev_pagemap *pgmap;
    };
    struct address_space *mapping;
    union {
        pgoff_t index;
        unsigned long share;
    };
    union {
        void *private;
        swp_entry_t swap;
    };
    atomic_t _mapcount;
    atomic_t _refcount;
#ifdef CONFIG_MEMCG;
    unsigned long memcg_data;
#elif defined(CONFIG_SLAB_OBJ_EXT);
    unsigned long _unused_slab_obj_exts;
#endif;
#if defined(WANT_PAGE_VIRTUAL);
    void *virtual;
#endif;
#ifdef LAST_CPUPID_NOT_IN_PAGE_FLAGS;
    int _last_cpupid;
#endif;
    atomic_t _large_mapcount;
    atomic_t _nr_pages_mapped;
#ifdef CONFIG_64BIT;
    atomic_t _entire_mapcount;
    atomic_t _pincount;
#endif;
    mm_id_mapcount_t _mm_id_mapcount[2];
    union {
        mm_id_t _mm_id[2];
        unsigned long _mm_ids;
    };
#ifdef NR_PAGES_IN_LARGE_FOLIO;
    unsigned int _nr_pages;
#endif;
    struct list_head _deferred_list;
#ifndef CONFIG_64BIT;
    atomic_t _entire_mapcount;
    atomic_t _pincount;
#endif;
    void *_hugetlb_subpool;
    void *_hugetlb_cgroup;
    void *_hugetlb_cgroup_rsvd;
    void *_hugetlb_hwpoison;
};

Members

flags: Identical to the page flags.
{unnamed_union}: anonymous
lru: Least Recently Used list; tracks how recently this folio was used.
mlock_count: Number of times this folio has been pinned by mlock().
pgmap: Metadata for ZONE_DEVICE mappings
mapping: The file this page belongs to, or refers to the anon_vma for anonymous memory.
{unnamed_union}: anonymous
index: Offset within the file, in units of pages. For anonymous memory, this is the index from the beginning of the mmap.
share: number of DAX mappings that reference this folio. See dax_associate_entry.
{unnamed_union}: anonymous
private: Filesystem per-folio data (see folio_attach_private()).
swap: Used for swp_entry_t if folio_test_swapcache().
_mapcount: Do not access this member directly. Use folio_mapcount() to find out how many times this folio is mapped by userspace.
_refcount: Do not access this member directly. Use folio_ref_count() to find how many references there are to this folio.
memcg_data: Memory Control Group data.
_unused_slab_obj_exts: Placeholder to match obj_exts in struct slab.
virtual: Virtual address in the kernel direct map.
_last_cpupid: IDs of last CPU and last process that accessed the folio.
_large_mapcount: Do not use directly, call folio_mapcount().
_nr_pages_mapped: Do not use outside of rmap and debug code.
_entire_mapcount: Do not use directly, call folio_entire_mapcount().
_pincount: Do not use directly, call folio_maybe_dma_pinned().
_mm_id_mapcount: Do not use outside of rmap code.
{unnamed_union}: anonymous
_mm_id: Do not use outside of rmap code.
_mm_ids: Do not use outside of rmap code.
_nr_pages: Do not use directly, call folio_nr_pages().
_deferred_list: Folios to be split under memory pressure.
_entire_mapcount: Do not use directly, call folio_entire_mapcount().
_pincount: Do not use directly, call folio_maybe_dma_pinned().
_hugetlb_subpool: Do not use directly, use accessor in hugetlb.h.
_hugetlb_cgroup: Do not use directly, use accessor in hugetlb_cgroup.h.
_hugetlb_cgroup_rsvd: Do not use directly, use accessor in hugetlb_cgroup.h.
_hugetlb_hwpoison: Do not use directly, call raw_hwp_list_head().

Description

A folio is a physically, virtually and logically contiguous set of bytes. It is a power-of-two in size, and it is aligned to that same power-of-two. It is at least as large as PAGE_SIZE. If it is in the page cache, it is at a file offset which is a multiple of that power-of-two. It may be mapped into userspace at an address which is at an arbitrary page offset, but its kernel virtual address is aligned to its size.

struct ptdesc¶: Memory descriptor for page tables.

Definition:

struct ptdesc {
    memdesc_flags_t pt_flags;
    union {
        struct rcu_head pt_rcu_head;
        struct list_head pt_list;
        struct {
            unsigned long _pt_pad_1;
            pgtable_t pmd_huge_pte;
        };
    };
    unsigned long __page_mapping;
    union {
        pgoff_t pt_index;
        struct mm_struct *pt_mm;
        atomic_t pt_frag_refcount;
#ifdef CONFIG_HUGETLB_PMD_PAGE_TABLE_SHARING;
        atomic_t pt_share_count;
#endif;
    };
    union {
        unsigned long _pt_pad_2;
#if ALLOC_SPLIT_PTLOCKS;
        spinlock_t *ptl;
#else;
        spinlock_t ptl;
#endif;
    };
    unsigned int __page_type;
    atomic_t __page_refcount;
#ifdef CONFIG_MEMCG;
    unsigned long pt_memcg_data;
#endif;
};

Members

pt_flags: enum pt_flags plus zone/node/section.
{unnamed_union}: anonymous
pt_rcu_head: For freeing page table pages.
pt_list: List of used page tables. Used for s390 gmap shadow pages (which are not linked into the user page tables) and x86 pgds.
{unnamed_struct}: anonymous
_pt_pad_1: Padding that aliases with page’s compound head.
pmd_huge_pte: Protected by ptdesc->ptl, used for THPs.
__page_mapping: Aliases with page->mapping. Unused for page tables.
{unnamed_union}: anonymous
pt_index: Used for s390 gmap.
pt_mm: Used for x86 pgds.
pt_frag_refcount: For fragmented page table tracking. Powerpc only.
pt_share_count: Used for HugeTLB PMD page table share count.
{unnamed_union}: anonymous
_pt_pad_2: Padding to ensure proper alignment.
ptl: Lock for the page table.
ptl: Lock for the page table.
__page_type: Same as page->page_type. Unused for page tables.
__page_refcount: Same as page refcount.
pt_memcg_data: Memcg data. Tracked for page tables here.

Description

This struct overlays struct page for now. Do not modify without a good understanding of the issues.

type vm_fault_t¶: Return type for page fault handlers.

Description

Page fault handlers return a bitmask of VM_FAULT values.

enum vm_fault_reason¶: Page fault handlers return a bitmask of these values to tell the core VM what happened when handling the fault. Used to decide whether a process gets delivered SIGBUS or just gets major/minor fault counters bumped up.

Constants

VM_FAULT_OOM: Out Of Memory
VM_FAULT_SIGBUS: Bad access
VM_FAULT_MAJOR: Page read from storage
VM_FAULT_HWPOISON: Hit poisoned small page
VM_FAULT_HWPOISON_LARGE: Hit poisoned large page. Index encoded in upper bits
VM_FAULT_SIGSEGV: segmentation fault
VM_FAULT_NOPAGE: ->fault installed the pte, not return page
VM_FAULT_LOCKED: ->fault locked the returned page
VM_FAULT_RETRY: ->fault blocked, must retry
VM_FAULT_FALLBACK: huge page fault failed, fall back to small
VM_FAULT_DONE_COW: ->fault has fully handled COW
VM_FAULT_NEEDDSYNC: ->fault did not modify page tables and needs fsync() to complete (for synchronous page faults in DAX)
VM_FAULT_COMPLETED: ->fault completed, meanwhile mmap lock released
VM_FAULT_HINDEX_MASK: mask HINDEX value

enum fault_flag¶: Fault flag definitions.

Constants

FAULT_FLAG_WRITE: Fault was a write fault.
FAULT_FLAG_MKWRITE: Fault was mkwrite of existing PTE.
FAULT_FLAG_ALLOW_RETRY: Allow to retry the fault if blocked.
FAULT_FLAG_RETRY_NOWAIT: Don’t drop mmap_lock and wait when retrying.
FAULT_FLAG_KILLABLE: The fault task is in SIGKILL killable region.
FAULT_FLAG_TRIED: The fault has been tried once.
FAULT_FLAG_USER: The fault originated in userspace.
FAULT_FLAG_REMOTE: The fault is not for current task/mm.
FAULT_FLAG_INSTRUCTION: The fault was during an instruction fetch.
FAULT_FLAG_INTERRUPTIBLE: The fault can be interrupted by non-fatal signals.
FAULT_FLAG_UNSHARE: The fault is an unsharing request to break COW in a COW mapping, making sure that an exclusive anon page is mapped after the fault.
FAULT_FLAG_ORIG_PTE_VALID: whether the fault has vmf->orig_pte cached. We should only access orig_pte if this flag set.
FAULT_FLAG_VMA_LOCK: The fault is handled under VMA lock.

Description

About FAULT_FLAG_ALLOW_RETRY and FAULT_FLAG_TRIED: we can specify whether we would allow page faults to retry by specifying these two fault flags correctly. Currently there can be three legal combinations:

ALLOW_RETRY and !TRIED: this means the page fault allows retry, and
this is the first try
ALLOW_RETRY and TRIED: this means the page fault allows retry, and
we’ve already tried at least once
!ALLOW_RETRY and !TRIED: this means the page fault does not allow retry

The unlisted combination (!ALLOW_RETRY && TRIED) is illegal and should never be used. Note that page faults can be allowed to retry for multiple times, in which case we’ll have an initial fault with flags (a) then later on continuous faults with flags (b). We should always try to detect pending signals before a retry to make sure the continuous page faults can still be interrupted if necessary.

The combination FAULT_FLAG_WRITE|FAULT_FLAG_UNSHARE is illegal. FAULT_FLAG_UNSHARE is ignored and treated like an ordinary read fault when applied to mappings that are not COW mappings.

int folio_is_file_lru(const struct folio *folio)¶: Should the folio be on a file LRU or anon LRU?

Parameters

const struct folio *folio: The folio to test.

Description

We would like to get this info without a page flag, but the state needs to survive until the folio is last deleted from the LRU, which could be as far down as __page_cache_release.

Return

An integer (not a boolean!) used to sort a folio onto the right LRU list and to account folios correctly. 1 if folio is a regular filesystem backed page cache folio or a lazily freed anonymous folio (e.g. via MADV_FREE). 0 if folio is a normal anonymous folio, a tmpfs folio or otherwise ram or swap backed folio.

void __folio_clear_lru_flags(struct folio *folio)¶: Clear page lru flags before releasing a page.

Parameters

struct folio *folio: The folio that was on lru and now has a zero reference.

enum lru_list folio_lru_list(const struct folio *folio)¶: Which LRU list should a folio be on?

Parameters

const struct folio *folio: The folio to test.

Return

The LRU list a folio should be on, as an index into the array of LRU lists.

size_t num_pages_contiguous(struct page **pages, size_t nr_pages)¶: determine the number of contiguous pages that represent contiguous PFNs

Parameters

struct page **pages: an array of page pointers
size_t nr_pages: length of the array, at least 1

Description

Determine the number of contiguous pages that represent contiguous PFNs in pages, starting from the first page.

In some kernel configs contiguous PFNs will not have contiguous struct pages. In these configurations num_pages_contiguous() will return a num smaller than ideal number. The caller should continue to check for pfn contiguity after each call to num_pages_contiguous().

Returns the number of contiguous pages.

page_folio¶

page_folio (p)

Converts from page to folio.

Parameters

p: The page.

Description

Every page is part of a folio. This function cannot be called on a NULL pointer.

Context

No reference, nor lock is required on page. If the caller does not hold a reference, this call may race with a folio split, so it should re-check the folio still contains this page after gaining a reference on the folio.

Return

The folio which contains this page.

folio_page¶

folio_page (folio, n)

Return a page from a folio.

Parameters

folio: The folio.
n: The page number to return.

Description

n is relative to the start of the folio. This function does not check that the page number lies within folio; the caller is presumed to have a reference to the page.

bool folio_xor_flags_has_waiters(struct folio *folio, unsigned long mask)¶: Change some folio flags.

Parameters

struct folio *folio: The folio.
unsigned long mask: Bits set in this word will be changed.

Description

This must only be used for flags which are changed with the folio lock held. For example, it is unsafe to use for PG_dirty as that can be set without the folio lock held. It can also only be used on flags which are in the range 0-6 as some of the implementations only affect those bits.

Return

Whether there are tasks waiting on the folio.

bool folio_test_uptodate(const struct folio *folio)¶: Is this folio up to date?

Parameters

const struct folio *folio: The folio.

Description

The uptodate flag is set on a folio when every byte in the folio is at least as new as the corresponding bytes on storage. Anonymous and CoW folios are always uptodate. If the folio is not uptodate, some of the bytes in it may be; see the is_partially_uptodate() address_space operation.

bool folio_test_large(const struct folio *folio)¶: Does this folio contain more than one page?

Parameters

const struct folio *folio: The folio to test.

Return

True if the folio is larger than one page.

bool PageHuge(const struct page *page)¶: Determine if the page belongs to hugetlbfs

Parameters

const struct page *page: The page to test.

Context

Any context.

Return

True for hugetlbfs pages, false for anon pages or pages belonging to other filesystems.

bool page_has_movable_ops(const struct page *page)¶: test for a movable_ops page

Parameters

const struct page *page: The page to test.

Description

Test whether this is a movable_ops page. Such pages will stay that way until freed.

Returns true if this is a movable_ops page, otherwise false.

int folio_has_private(const struct folio *folio)¶: Determine if folio has private stuff

Parameters

const struct folio *folio: The folio to be checked

Description

Determine if a folio has private stuff, indicating that release routines should be invoked upon it.

unsigned long folio_page_idx(const struct folio *folio, const struct page *page)¶: Return the number of a page in a folio.

Parameters

const struct folio *folio: The folio.
const struct page *page: The folio page.

Description

This function expects that the page is actually part of the folio. The returned number is relative to the start of the folio.

type vma_flag_t¶: specifies an individual VMA flag by bit number.

Description

This value is made type safe by sparse to avoid passing invalid flag values around.

bool fault_flag_allow_retry_first(enum fault_flag flags)¶: check ALLOW_RETRY the first time

Parameters

enum fault_flag flags: Fault flags.

Description

This is mostly used for places where we want to try to avoid taking the mmap_lock for too long a time when waiting for another condition to change, in which case we can try to be polite to release the mmap_lock in the first round to avoid potential starvation of other processes that would also want the mmap_lock.

Return

true if the page fault allows retry and this is the first attempt of the fault handling; false otherwise.

unsigned int folio_order(const struct folio *folio)¶: The allocation order of a folio.

Parameters

const struct folio *folio: The folio.

Description

A folio is composed of 2^order pages. See get_order() for the definition of order.

Return

The order of the folio.

void folio_reset_order(struct folio *folio)¶: Reset the folio order and derived _nr_pages

Parameters

struct folio *folio: The folio.

Description

Reset the order and derived _nr_pages to 0. Must only be used in the process of splitting large folios.

int folio_mapcount(const struct folio *folio)¶: Number of mappings of this folio.

Parameters

const struct folio *folio: The folio.

Description

The folio mapcount corresponds to the number of present user page table entries that reference any part of a folio. Each such present user page table entry must be paired with exactly on folio reference.

For ordindary folios, each user page table entry (PTE/PMD/PUD/...) counts exactly once.

For hugetlb folios, each abstracted “hugetlb” user page table entry that references the entire folio counts exactly once, even when such special page table entries are comprised of multiple ordinary page table entries.

Will report 0 for pages which cannot be mapped into userspace, such as slab, page tables and similar.

Return

The number of times this folio is mapped.

bool folio_mapped(const struct folio *folio)¶: Is this folio mapped into userspace?

Parameters

const struct folio *folio: The folio.

Return

True if any page in this folio is referenced by user page tables.

unsigned int thp_order(struct page *page)¶: Order of a transparent huge page.

Parameters

struct page *page: Head page of a transparent huge page.

unsigned long thp_size(struct page *page)¶: Size of a transparent huge page.

Parameters

struct page *page: Head page of a transparent huge page.

Return

Number of bytes in this page.

void folio_get(struct folio *folio)¶: Increment the reference count on a folio.

Parameters

struct folio *folio: The folio.

Context

May be called in any context, as long as you know that you have a refcount on the folio. If you do not already have one, folio_try_get() may be the right interface for you to use.

void folio_put(struct folio *folio)¶: Decrement the reference count on a folio.

Parameters

struct folio *folio: The folio.

Description

If the folio’s reference count reaches zero, the memory will be released back to the page allocator and may be used by another allocation immediately. Do not access the memory or the struct folio after calling folio_put() unless you can be sure that it wasn’t the last reference.

Context

May be called in process or interrupt context, but not in NMI context. May be called while holding a spinlock.

void folio_put_refs(struct folio *folio, int refs)¶: Reduce the reference count on a folio.

Parameters

struct folio *folio: The folio.
int refs: The amount to subtract from the folio’s reference count.

Description

If the folio’s reference count reaches zero, the memory will be released back to the page allocator and may be used by another allocation immediately. Do not access the memory or the struct folio after calling folio_put_refs() unless you can be sure that these weren’t the last references.

Context

May be called in process or interrupt context, but not in NMI context. May be called while holding a spinlock.

void folios_put(struct folio_batch *folios)¶: Decrement the reference count on an array of folios.

Parameters

struct folio_batch *folios: The folios.

Description

Like folio_put(), but for a batch of folios. This is more efficient than writing the loop yourself as it will optimise the locks which need to be taken if the folios are freed. The folios batch is returned empty and ready to be reused for another batch; there is no need to reinitialise it.

Context

May be called in process or interrupt context, but not in NMI context. May be called while holding a spinlock.

unsigned long folio_pfn(const struct folio *folio)¶: Return the Page Frame Number of a folio.

Parameters

const struct folio *folio: The folio.

Description

A folio may contain multiple pages. The pages have consecutive Page Frame Numbers.

Return

The Page Frame Number of the first page in the folio.

pte_t folio_mk_pte(const struct folio *folio, pgprot_t pgprot)¶: Create a PTE for this folio

Parameters

const struct folio *folio: The folio to create a PTE for
pgprot_t pgprot: The page protection bits to use

Description

Create a page table entry for the first page of this folio. This is suitable for passing to set_ptes().

Return

A page table entry suitable for mapping this folio.

pmd_t folio_mk_pmd(const struct folio *folio, pgprot_t pgprot)¶: Create a PMD for this folio

Parameters

const struct folio *folio: The folio to create a PMD for
pgprot_t pgprot: The page protection bits to use

Description

Create a page table entry for the first page of this folio. This is suitable for passing to set_pmd_at().

Return

A page table entry suitable for mapping this folio.

pud_t folio_mk_pud(const struct folio *folio, pgprot_t pgprot)¶: Create a PUD for this folio

Parameters

const struct folio *folio: The folio to create a PUD for
pgprot_t pgprot: The page protection bits to use

Description

Create a page table entry for the first page of this folio. This is suitable for passing to set_pud_at().

Return

A page table entry suitable for mapping this folio.

bool folio_maybe_dma_pinned(struct folio *folio)¶: Report if a folio may be pinned for DMA.

Parameters

struct folio *folio: The folio.

Description

This function checks if a folio has been pinned via a call to a function in the pin_user_pages() family.

For small folios, the return value is partially fuzzy: false is not fuzzy, because it means “definitely not pinned for DMA”, but true means “probably pinned for DMA, but possibly a false positive due to having at least GUP_PIN_COUNTING_BIAS worth of normal folio references”.

False positives are OK, because: a) it’s unlikely for a folio to get that many refcounts, and b) all the callers of this routine are expected to be able to deal gracefully with a false positive.

For most large folios, the result will be exactly correct. That’s because we have more tracking data available: the _pincount field is used instead of the GUP_PIN_COUNTING_BIAS scheme.

For more information, please see pin_user_pages() and related calls.

Return

True, if it is likely that the folio has been “dma-pinned”. False, if the folio is definitely not dma-pinned.

bool is_zero_page(const struct page *page)¶: Query if a page is a zero page

Parameters

const struct page *page: The page to query

Description

This returns true if page is one of the permanent zero pages.

bool is_zero_folio(const struct folio *folio)¶: Query if a folio is a zero page

Parameters

const struct folio *folio: The folio to query

Description

This returns true if folio is one of the permanent zero pages.

unsigned long folio_nr_pages(const struct folio *folio)¶: The number of pages in the folio.

Parameters

const struct folio *folio: The folio.

Return

A positive power of two.

struct folio *folio_next(struct folio *folio)¶: Move to the next physical folio.

Parameters

struct folio *folio: The folio we’re currently operating on.

Description

If you have physically contiguous memory which may span more than one folio (eg a struct bio_vec), use this function to move from one folio to the next. Do not use it if the memory is only virtually contiguous as the folios are almost certainly not adjacent to each other. This is the folio equivalent to writing page++.

Context

We assume that the folios are refcounted and/or locked at a higher level and do not adjust the reference counts.

Return

The next struct folio.

unsigned int folio_shift(const struct folio *folio)¶: The size of the memory described by this folio.

Parameters

const struct folio *folio: The folio.

Description

A folio represents a number of bytes which is a power-of-two in size. This function tells you which power-of-two the folio is. See also folio_size() and folio_order().

Context

The caller should have a reference on the folio to prevent it from being split. It is not necessary for the folio to be locked.

Return

The base-2 logarithm of the size of this folio.

size_t folio_size(const struct folio *folio)¶: The number of bytes in a folio.

Parameters

const struct folio *folio: The folio.

Context

The caller should have a reference on the folio to prevent it from being split. It is not necessary for the folio to be locked.

Return

The number of bytes in this folio.

bool folio_maybe_mapped_shared(struct folio *folio)¶: Whether the folio is mapped into the page tables of more than one MM

Parameters

struct folio *folio: The folio.

Description

This function checks if the folio maybe currently mapped into more than one MM (“maybe mapped shared”), or if the folio is certainly mapped into a single MM (“mapped exclusively”).

For KSM folios, this function also returns “mapped shared” when a folio is mapped multiple times into the same MM, because the individual page mappings are independent.

For small anonymous folios and anonymous hugetlb folios, the return value will be exactly correct: non-KSM folios can only be mapped at most once into an MM, and they cannot be partially mapped. KSM folios are considered shared even if mapped multiple times into the same MM.

For other folios, the result can be fuzzy:

For partially-mappable large folios (THP), the return value can wrongly indicate “mapped shared” (false positive) if a folio was mapped by more than two MMs at one point in time.
For pagecache folios (including hugetlb), the return value can wrongly indicate “mapped shared” (false positive) when two VMAs in the same MM cover the same file range.

Further, this function only considers current page table mappings that are tracked using the folio mapcount(s).

This function does not consider:

If the folio might get mapped in the (near) future (e.g., swapcache, pagecache, temporary unmapping for migration).
If the folio is mapped differently (VM_PFNMAP).
If hugetlb page table sharing applies. Callers might want to check hugetlb_pmd_shared().

Return

Whether the folio is estimated to be mapped into more than one MM.

int folio_expected_ref_count(const struct folio *folio)¶: calculate the expected folio refcount

Parameters

const struct folio *folio: the folio

Description

Calculate the expected folio refcount, taking references from the pagecache, swapcache, PG_private and page table mappings into account. Useful in combination with folio_ref_count() to detect unexpected references (e.g., GUP or other temporary references).

Does currently not consider references from the LRU cache. If the folio was isolated from the LRU (which is the case during migration or split), the LRU cache does not apply.

Calling this function on an unmapped folio -- !folio_mapped() -- that is locked will return a stable result.

Calling this function on a mapped folio will not result in a stable result, because nothing stops additional page table mappings from coming (e.g., fork()) or going (e.g., munmap()).

Calling this function without the folio lock will also not result in a stable result: for example, the folio might get dropped from the swapcache concurrently.

However, even when called without the folio lock or on a mapped folio, this function can be used to detect unexpected references early (for example, if it makes sense to even lock the folio and unmap it).

The caller must add any reference (e.g., from folio_try_get()) it might be holding itself to the result.

Returns the expected folio refcount.

void *ptdesc_address(const struct ptdesc *pt)¶: Virtual address of page table.

Parameters

const struct ptdesc *pt: Page table descriptor.

Return

The first byte of the page table described by pt.

void ptdesc_set_kernel(struct ptdesc *ptdesc)¶: Mark a ptdesc used to map the kernel

Parameters

struct ptdesc *ptdesc: The ptdesc to be marked

Description

Kernel page tables often need special handling. Set a flag so that the handling code knows this ptdesc will not be used for userspace.

void ptdesc_clear_kernel(struct ptdesc *ptdesc)¶: Mark a ptdesc as no longer used to map the kernel

Parameters

struct ptdesc *ptdesc: The ptdesc to be unmarked

Description

Use when the ptdesc is no longer used to map the kernel and no longer needs special handling.

bool ptdesc_test_kernel(const struct ptdesc *ptdesc)¶: Check if a ptdesc is used to map the kernel

Parameters

const struct ptdesc *ptdesc: The ptdesc being tested

Description

Call to tell if the ptdesc used to map the kernel.

struct ptdesc *pagetable_alloc(gfp_t gfp, unsigned int order)¶: Allocate pagetables

Parameters

gfp_t gfp: GFP flags
unsigned int order: desired pagetable order

Description

pagetable_alloc allocates memory for page tables as well as a page table descriptor to describe that memory.

Return

The ptdesc describing the allocated page tables.

void pagetable_free(struct ptdesc *pt)¶: Free pagetables

Parameters

struct ptdesc *pt: The page table descriptor

Description

pagetable_free frees the memory of all page tables described by a page table descriptor and the memory for the descriptor itself.

struct vm_area_struct *vma_lookup(struct mm_struct *mm, unsigned long addr)¶: Find a VMA at a specific address

Parameters

struct mm_struct *mm: The process address space.
unsigned long addr: The user address.

Return

The vm_area_struct at the given address, NULL otherwise.

void mmap_action_remap(struct vm_area_desc *desc, unsigned long start, unsigned long start_pfn, unsigned long size)¶: helper for mmap_prepare hook to specify that a pure PFN remap is required.

Parameters

struct vm_area_desc *desc: The VMA descriptor for the VMA requiring remap.
unsigned long start: The virtual address to start the remap from, must be within the VMA.
unsigned long start_pfn: The first PFN in the range to remap.
unsigned long size: The size of the range to remap, in bytes, at most spanning to the end of the VMA.

void mmap_action_remap_full(struct vm_area_desc *desc, unsigned long start_pfn)¶: helper for mmap_prepare hook to specify that the entirety of a VMA should be PFN remapped.

Parameters

struct vm_area_desc *desc: The VMA descriptor for the VMA requiring remap.
unsigned long start_pfn: The first PFN in the range to remap.

void mmap_action_ioremap(struct vm_area_desc *desc, unsigned long start, unsigned long start_pfn, unsigned long size)¶: helper for mmap_prepare hook to specify that a pure PFN I/O remap is required.

Parameters

struct vm_area_desc *desc: The VMA descriptor for the VMA requiring remap.
unsigned long start: The virtual address to start the remap from, must be within the VMA.
unsigned long start_pfn: The first PFN in the range to remap.
unsigned long size: The size of the range to remap, in bytes, at most spanning to the end of the VMA.

void mmap_action_ioremap_full(struct vm_area_desc *desc, unsigned long start_pfn)¶: helper for mmap_prepare hook to specify that the entirety of a VMA should be PFN I/O remapped.

Parameters

struct vm_area_desc *desc: The VMA descriptor for the VMA requiring remap.
unsigned long start_pfn: The first PFN in the range to remap.

bool vma_is_special_huge(const struct vm_area_struct *vma)¶: Are transhuge page-table entries considered special?

Parameters

const struct vm_area_struct *vma: Pointer to the struct vm_area_struct to consider

Description

Whether transhuge page-table entries are considered “special” following the definition in vm_normal_page().

Return

true if transhuge page-table entries should be considered special, false otherwise.

int folio_ref_count(const struct folio *folio)¶: The reference count on this folio.

Parameters

const struct folio *folio: The folio.

Description

The refcount is usually incremented by calls to folio_get() and decremented by calls to folio_put(). Some typical users of the folio refcount:

Each reference from a page table
The page cache
Filesystem private data
The LRU list
Pipes
Direct IO which references this page in the process address space

Return

The number of references to this folio.

bool folio_try_get(struct folio *folio)¶: Attempt to increase the refcount on a folio.

Parameters

struct folio *folio: The folio.

Description

If you do not already have a reference to a folio, you can attempt to get one using this function. It may fail if, for example, the folio has been freed since you found a pointer to it, or it is frozen for the purposes of splitting or migration.

Return

True if the reference count was successfully incremented.

int is_highmem(const struct zone *zone)¶: helper function to quickly check if a struct zone is a highmem zone or not. This is an attempt to keep references to ZONE_{DMA/NORMAL/HIGHMEM/etc} in general code to a minimum.

Parameters

const struct zone *zone: pointer to struct zone variable

Return

1 for a highmem zone, 0 otherwise

for_each_online_pgdat¶

for_each_online_pgdat (pgdat)

helper macro to iterate over all online nodes

Parameters

pgdat: pointer to a pg_data_t variable

for_each_zone¶

for_each_zone (zone)

helper macro to iterate over all memory zones

Parameters

zone: pointer to struct zone variable

Description

The user only needs to declare the zone variable, for_each_zone fills it in.

struct zoneref *next_zones_zonelist(struct zoneref *z, enum zone_type highest_zoneidx, nodemask_t *nodes)¶: Returns the next zone at or below highest_zoneidx within the allowed nodemask using a cursor within a zonelist as a starting point

Parameters

struct zoneref *z: The cursor used as a starting point for the search
enum zone_type highest_zoneidx: The zone index of the highest zone to return
nodemask_t *nodes: An optional nodemask to filter the zonelist with

Description

This function returns the next zone at or below a given zone index that is within the allowed nodemask using a cursor as the starting point for the search. The zoneref returned is a cursor that represents the current zone being examined. It should be advanced by one before calling next_zones_zonelist again.

Return

the next zone at or below highest_zoneidx within the allowed nodemask using a cursor within a zonelist as a starting point

struct zoneref *first_zones_zonelist(struct zonelist *zonelist, enum zone_type highest_zoneidx, nodemask_t *nodes)¶: Returns the first zone at or below highest_zoneidx within the allowed nodemask in a zonelist

Parameters

struct zonelist *zonelist: The zonelist to search for a suitable zone
enum zone_type highest_zoneidx: The zone index of the highest zone to return
nodemask_t *nodes: An optional nodemask to filter the zonelist with

Description

This function returns the first zone at or below a given zone index that is within the allowed nodemask. The zoneref returned is a cursor that can be used to iterate the zonelist with next_zones_zonelist by advancing it by one before calling.

When no eligible zone is found, zoneref->zone is NULL (zoneref itself is never NULL). This may happen either genuinely, or due to concurrent nodemask update due to cpuset modification.

Return

Zoneref pointer for the first suitable zone found

for_each_zone_zonelist_nodemask¶

for_each_zone_zonelist_nodemask (zone, z, zlist, highidx, nodemask)

helper macro to iterate over valid zones in a zonelist at or below a given zone index and within a nodemask

Parameters

zone: The current zone in the iterator
z: The current pointer within zonelist->_zonerefs being iterated
zlist: The zonelist being iterated
highidx: The zone index of the highest zone to return
nodemask: Nodemask allowed by the allocator

Description

This iterator iterates though all zones at or below a given zone index and within a given nodemask

for_each_zone_zonelist¶

for_each_zone_zonelist (zone, z, zlist, highidx)

helper macro to iterate over valid zones in a zonelist at or below a given zone index

Parameters

zone: The current zone in the iterator
z: The current pointer within zonelist->zones being iterated
zlist: The zonelist being iterated
highidx: The zone index of the highest zone to return

Description

This iterator iterates though all zones at or below a given zone index.

int pfn_valid(unsigned long pfn)¶: check if there is a valid memory map entry for a PFN

Parameters

unsigned long pfn: the page frame number to check

Description

Check if there is a valid memory map entry aka struct page for the pfn. Note, that availability of the memory map entry does not imply that there is actual usable memory at that pfn. The struct page may represent a hole or an unusable page frame.

Return

1 for PFNs that have memory map entries and 0 otherwise

struct address_space *folio_mapping(const struct folio *folio)¶: Find the mapping where this folio is stored.

Parameters

const struct folio *folio: The folio.

Description

For folios which are in the page cache, return the mapping that this page belongs to. Folios in the swap cache return the swap mapping this page is stored in (which is different from the mapping for the swap file or swap device where the data is stored).

You can call this for folios which aren’t in the swap cache or page cache and it will return NULL.

int __anon_vma_prepare(struct vm_area_struct *vma)¶: attach an anon_vma to a memory region

Parameters

struct vm_area_struct *vma: the memory region in question

Description

This makes sure the memory mapping described by ‘vma’ has an ‘anon_vma’ attached to it, so that we can associate the anonymous pages mapped into it with that anon_vma.

The common case will be that we already have one, which is handled inline by anon_vma_prepare(). But if not we either need to find an adjacent mapping that we can re-use the anon_vma from (very common when the only reason for splitting a vma has been mprotect()), or we allocate a new one.

Anon-vma allocations are very subtle, because we may have optimistically looked up an anon_vma in folio_lock_anon_vma_read() and that may actually touch the rwsem even in the newly allocated vma (it depends on RCU to make sure that the anon_vma isn’t actually destroyed).

As a result, we need to do proper anon_vma locking even for the new allocation. At the same time, we do not want to do any locking for the common case of already having an anon_vma.

unsigned long page_address_in_vma(const struct folio *folio, const struct page *page, const struct vm_area_struct *vma)¶: The virtual address of a page in this VMA.

Parameters

const struct folio *folio: The folio containing the page.
const struct page *page: The page within the folio.
const struct vm_area_struct *vma: The VMA we need to know the address in.

Description

Calculates the user virtual address of this page in the specified VMA. It is the caller’s responsibility to check the page is actually within the VMA. There may not currently be a PTE pointing at this page, but if a page fault occurs at this address, this is the page which will be accessed.

Context

Caller should hold a reference to the folio. Caller should hold a lock (eg the i_mmap_lock or the mmap_lock) which keeps the VMA from being altered.

Return

The virtual address corresponding to this page in the VMA.

int folio_referenced(struct folio *folio, int is_locked, struct mem_cgroup *memcg, vm_flags_t *vm_flags)¶: Test if the folio was referenced.

Parameters

struct folio *folio: The folio to test.
int is_locked: Caller holds lock on the folio.
struct mem_cgroup *memcg: target memory cgroup
vm_flags_t *vm_flags: A combination of all the vma->vm_flags which referenced the folio.

Description

Quick test_and_clear_referenced for all mappings of a folio,

Return

The number of mappings which referenced the folio. Return -1 if the function bailed out due to rmap lock contention.

int mapping_wrprotect_range(struct address_space *mapping, pgoff_t pgoff, unsigned long pfn, unsigned long nr_pages)¶: Write-protect all mappings in a specified range.

Parameters

struct address_space *mapping: The mapping whose reverse mapping should be traversed.
pgoff_t pgoff: The page offset at which pfn is mapped within mapping.
unsigned long pfn: The PFN of the page mapped in mapping at pgoff.
unsigned long nr_pages: The number of physically contiguous base pages spanned.

Description

Traverses the reverse mapping, finding all VMAs which contain a shared mapping of the pages in the specified range in mapping, and write-protects them (that is, updates the page tables to mark the mappings read-only such that a write protection fault arises when the mappings are written to).

The pfn value need not refer to a folio, but rather can reference a kernel allocation which is mapped into userland. We therefore do not require that the page maps to a folio with a valid mapping or index field, rather the caller specifies these in mapping and pgoff.

Return

the number of write-protected PTEs, or an error.

int pfn_mkclean_range(unsigned long pfn, unsigned long nr_pages, pgoff_t pgoff, struct vm_area_struct *vma)¶: Cleans the PTEs (including PMDs) mapped with range of [pfn, pfn + nr_pages) at the specific offset (pgoff) within the vma of shared mappings. And since clean PTEs should also be readonly, write protects them too.

Parameters

unsigned long pfn: start pfn.
unsigned long nr_pages: number of physically contiguous pages srarting with pfn.
pgoff_t pgoff: page offset that the pfn mapped with.
struct vm_area_struct *vma: vma that pfn mapped within.

Description

Returns the number of cleaned PTEs (including PMDs).

void folio_move_anon_rmap(struct folio *folio, struct vm_area_struct *vma)¶: move a folio to our anon_vma

Parameters

struct folio *folio: The folio to move to our anon_vma
struct vm_area_struct *vma: The vma the folio belongs to

Description

When a folio belongs exclusively to one process after a COW event, that folio can be moved into the anon_vma that belongs to just that process, so the rmap code will not search the parent or sibling processes.

void __folio_set_anon(struct folio *folio, struct vm_area_struct *vma, unsigned long address, bool exclusive)¶: set up a new anonymous rmap for a folio

Parameters

struct folio *folio: The folio to set up the new anonymous rmap for.
struct vm_area_struct *vma: VM area to add the folio to.
unsigned long address: User virtual address of the mapping
bool exclusive: Whether the folio is exclusive to the process.

void __page_check_anon_rmap(const struct folio *folio, const struct page *page, struct vm_area_struct *vma, unsigned long address)¶: sanity check anonymous rmap addition

Parameters

const struct folio *folio: The folio containing page.
const struct page *page: the page to check the mapping of
struct vm_area_struct *vma: the vm area in which the mapping is added
unsigned long address: the user virtual address mapped

void folio_add_anon_rmap_ptes(struct folio *folio, struct page *page, int nr_pages, struct vm_area_struct *vma, unsigned long address, rmap_t flags)¶: add PTE mappings to a page range of an anon folio

Parameters

struct folio *folio: The folio to add the mappings to
struct page *page: The first page to add
int nr_pages: The number of pages which will be mapped
struct vm_area_struct *vma: The vm area in which the mappings are added
unsigned long address: The user virtual address of the first page to map
rmap_t flags: The rmap flags

Description

The page range of folio is defined by [first_page, first_page + nr_pages)

The caller needs to hold the page table lock, and the page must be locked in the anon_vma case: to serialize mapping,index checking after setting, and to ensure that an anon folio is not being upgraded racily to a KSM folio (but KSM folios are never downgraded).

void folio_add_anon_rmap_pmd(struct folio *folio, struct page *page, struct vm_area_struct *vma, unsigned long address, rmap_t flags)¶: add a PMD mapping to a page range of an anon folio

Parameters

struct folio *folio: The folio to add the mapping to
struct page *page: The first page to add
struct vm_area_struct *vma: The vm area in which the mapping is added
unsigned long address: The user virtual address of the first page to map
rmap_t flags: The rmap flags

Description

The page range of folio is defined by [first_page, first_page + HPAGE_PMD_NR)

The caller needs to hold the page table lock, and the page must be locked in the anon_vma case: to serialize mapping,index checking after setting.

void folio_add_new_anon_rmap(struct folio *folio, struct vm_area_struct *vma, unsigned long address, rmap_t flags)¶: Add mapping to a new anonymous folio.

Parameters

struct folio *folio: The folio to add the mapping to.
struct vm_area_struct *vma: the vm area in which the mapping is added
unsigned long address: the user virtual address mapped
rmap_t flags: The rmap flags

Description

Like folio_add_anon_rmap_*() but must only be called on new folios. This means the inc-and-test can be bypassed. The folio doesn’t necessarily need to be locked while it’s exclusive unless two threads map it concurrently. However, the folio must be locked if it’s shared.

If the folio is pmd-mappable, it is accounted as a THP.

void folio_add_file_rmap_ptes(struct folio *folio, struct page *page, int nr_pages, struct vm_area_struct *vma)¶: add PTE mappings to a page range of a folio

Parameters

struct folio *folio: The folio to add the mappings to
struct page *page: The first page to add
int nr_pages: The number of pages that will be mapped using PTEs
struct vm_area_struct *vma: The vm area in which the mappings are added

Description

The page range of the folio is defined by [page, page + nr_pages)

The caller needs to hold the page table lock.

void folio_add_file_rmap_pmd(struct folio *folio, struct page *page, struct vm_area_struct *vma)¶: add a PMD mapping to a page range of a folio

Parameters

struct folio *folio: The folio to add the mapping to
struct page *page: The first page to add
struct vm_area_struct *vma: The vm area in which the mapping is added

Description

The page range of the folio is defined by [page, page + HPAGE_PMD_NR)

The caller needs to hold the page table lock.

void folio_add_file_rmap_pud(struct folio *folio, struct page *page, struct vm_area_struct *vma)¶: add a PUD mapping to a page range of a folio

Parameters

struct folio *folio: The folio to add the mapping to
struct page *page: The first page to add
struct vm_area_struct *vma: The vm area in which the mapping is added

Description

The page range of the folio is defined by [page, page + HPAGE_PUD_NR)

The caller needs to hold the page table lock.

void folio_remove_rmap_ptes(struct folio *folio, struct page *page, int nr_pages, struct vm_area_struct *vma)¶: remove PTE mappings from a page range of a folio

Parameters

struct folio *folio: The folio to remove the mappings from
struct page *page: The first page to remove
int nr_pages: The number of pages that will be removed from the mapping
struct vm_area_struct *vma: The vm area from which the mappings are removed

Description

The page range of the folio is defined by [page, page + nr_pages)

The caller needs to hold the page table lock.

void folio_remove_rmap_pmd(struct folio *folio, struct page *page, struct vm_area_struct *vma)¶: remove a PMD mapping from a page range of a folio

Parameters

struct folio *folio: The folio to remove the mapping from
struct page *page: The first page to remove
struct vm_area_struct *vma: The vm area from which the mapping is removed

Description

The page range of the folio is defined by [page, page + HPAGE_PMD_NR)

The caller needs to hold the page table lock.

void folio_remove_rmap_pud(struct folio *folio, struct page *page, struct vm_area_struct *vma)¶: remove a PUD mapping from a page range of a folio

Parameters

struct folio *folio: The folio to remove the mapping from
struct page *page: The first page to remove
struct vm_area_struct *vma: The vm area from which the mapping is removed

Description

The page range of the folio is defined by [page, page + HPAGE_PUD_NR)

The caller needs to hold the page table lock.

void try_to_unmap(struct folio *folio, enum ttu_flags flags)¶: Try to remove all page table mappings to a folio.

Parameters

struct folio *folio: The folio to unmap.
enum ttu_flags flags: action and flags

Description

Tries to remove all the page table entries which are mapping this folio. It is the caller’s responsibility to check if the folio is still mapped if needed (use TTU_SYNC to prevent accounting races).

Context

Caller must hold the folio lock.

void try_to_migrate(struct folio *folio, enum ttu_flags flags)¶: try to replace all page table mappings with swap entries

Parameters

struct folio *folio: the folio to replace page table entries for
enum ttu_flags flags: action and flags

Description

Tries to remove all the page table entries which are mapping this folio and replace them with special swap entries. Caller must hold the folio lock.

struct page *make_device_exclusive(struct mm_struct *mm, unsigned long addr, void *owner, struct folio **foliop)¶: Mark a page for exclusive use by a device

Parameters

struct mm_struct *mm: mm_struct of associated target process
unsigned long addr: the virtual address to mark for exclusive device access
void *owner: passed to MMU_NOTIFY_EXCLUSIVE range notifier to allow filtering
struct folio **foliop: folio pointer will be stored here on success.

Description

This function looks up the page mapped at the given address, grabs a folio reference, locks the folio and replaces the PTE with special device-exclusive PFN swap entry, preventing access through the process page tables. The function will return with the folio locked and referenced.

On fault, the device-exclusive entries are replaced with the original PTE under folio lock, after calling MMU notifiers.

Only anonymous non-hugetlb folios are supported and the VMA must have write permissions such that we can fault in the anonymous page writable in order to mark it exclusive. The caller must hold the mmap_lock in read mode.

A driver using this to program access from a device must use a mmu notifier critical section to hold a device specific lock during programming. Once programming is complete it should drop the folio lock and reference after which point CPU access to the page will revoke the exclusive access.

Notes

This function always operates on individual PTEs mapping individual pages. PMD-sized THPs are first remapped to be mapped by PTEs before the conversion happens on a single PTE corresponding to addr.

While concurrent access through the process page tables is prevented, concurrent access through other page references (e.g., earlier GUP invocation) is not handled and not supported.

device-exclusive entries are considered “clean” and “old” by core-mm. Device drivers must update the folio state when informed by MMU notifiers.

Return

pointer to mapped page on success, otherwise a negative error.

void __rmap_walk_file(struct folio *folio, struct address_space *mapping, pgoff_t pgoff_start, unsigned long nr_pages, struct rmap_walk_control *rwc, bool locked)¶: Traverse the reverse mapping for a file-backed mapping of a page mapped within a specified page cache object at a specified offset.

Parameters

struct folio *folio: Either the folio whose mappings to traverse, or if NULL, the callbacks specified in rwc will be configured such as to be able to look up mappings correctly.
struct address_space *mapping: The page cache object whose mapping VMAs we intend to traverse. If folio is non-NULL, this should be equal to folio_mapping(folio).
pgoff_t pgoff_start: The offset within mapping of the page which we are looking up. If folio is non-NULL, this should be equal to folio_pgoff(folio).
unsigned long nr_pages: The number of pages mapped by the mapping. If folio is non-NULL, this should be equal to folio_nr_pages(folio).
struct rmap_walk_control *rwc: The reverse mapping walk control object describing how the traversal should proceed.
bool locked: Is the mapping already locked? If not, we acquire the lock.

bool isolate_movable_ops_page(struct page *page, isolate_mode_t mode)¶: isolate a movable_ops page for migration

Parameters

struct page *page: The page.
isolate_mode_t mode: The isolation mode.

Description

Try to isolate a movable_ops page for migration. Will fail if the page is not a movable_ops page, if the page is already isolated for migration or if the page was just was released by its owner.

Once isolated, the page cannot get freed until it is either putback or migrated.

Returns true if isolation succeeded, otherwise false.

void putback_movable_ops_page(struct page *page)¶: putback an isolated movable_ops page

Parameters

struct page *page: The isolated page.

Description

Putback an isolated movable_ops page.

After the page was putback, it might get freed instantly.

int migrate_movable_ops_page(struct page *dst, struct page *src, enum migrate_mode mode)¶: migrate an isolated movable_ops page

Parameters

struct page *dst: The destination page.
struct page *src: The source page.
enum migrate_mode mode: The migration mode.

Description

Migrate an isolated movable_ops page.

If the src page was already released by its owner, the src page is un-isolated (putback) and migration succeeds; the migration core will be the owner of both pages.

If the src page was not released by its owner and the migration was successful, the owner of the src page and the dst page are swapped and the src page is un-isolated.

If migration fails, the ownership stays unmodified and the src page remains isolated: migration may be retried later or the page can be putback.

TODO: migration core will treat both pages as folios and lock them before this call to unlock them after this call. Further, the folio refcounts on src and dst are also released by migration core. These pages will not be folios in the future, so that must be reworked.

Returns 0 on success, otherwise a negative error code.

int migrate_folio(struct address_space *mapping, struct folio *dst, struct folio *src, enum migrate_mode mode)¶: Simple folio migration.

Parameters

struct address_space *mapping: The address_space containing the folio.
struct folio *dst: The folio to migrate the data to.
struct folio *src: The folio containing the current data.
enum migrate_mode mode: How to migrate the page.

Description

Common logic to directly migrate a single LRU folio suitable for folios that do not have private data.

Folios are locked upon entry and exit.

int buffer_migrate_folio(struct address_space *mapping, struct folio *dst, struct folio *src, enum migrate_mode mode)¶: Migration function for folios with buffers.

Parameters

struct address_space *mapping: The address space containing src.
struct folio *dst: The folio to migrate to.
struct folio *src: The folio to migrate from.
enum migrate_mode mode: How to migrate the folio.

Description

This function can only be used if the underlying filesystem guarantees that no other references to src exist. For example attached buffer heads are accessed only under the folio lock. If your filesystem cannot provide this guarantee, buffer_migrate_folio_norefs() may be more appropriate.

Return

0 on success or a negative errno on failure.

int buffer_migrate_folio_norefs(struct address_space *mapping, struct folio *dst, struct folio *src, enum migrate_mode mode)¶: Migration function for folios with buffers.

Parameters

struct address_space *mapping: The address space containing src.
struct folio *dst: The folio to migrate to.
struct folio *src: The folio to migrate from.
enum migrate_mode mode: How to migrate the folio.

Description

Like buffer_migrate_folio() except that this variant is more careful and checks that there are also no buffer head references. This function is the right one for mappings where buffer heads are directly looked up and referenced (such as block device mappings).

Return

0 on success or a negative errno on failure.

unsigned long do_mmap(struct file *file, unsigned long addr, unsigned long len, unsigned long prot, unsigned long flags, vm_flags_t vm_flags, unsigned long pgoff, unsigned long *populate, struct list_head *uf)¶: Perform a userland memory mapping into the current process address space of length len with protection bits prot, mmap flags flags (from which VMA flags will be inferred), and any additional VMA flags to apply vm_flags. If this is a file-backed mapping then the file is specified in file and page offset into the file via pgoff.

Parameters

struct file *file: An optional struct file pointer describing the file which is to be mapped, if a file-backed mapping.
unsigned long addr: If non-zero, hints at (or if flags has MAP_FIXED set, specifies) the address at which to perform this mapping. See mmap (2) for details. Must be page-aligned.
unsigned long len: The length of the mapping. Will be page-aligned and must be at least 1 page in size.
unsigned long prot: Protection bits describing access required to the mapping. See mmap (2) for details.
unsigned long flags: Flags specifying how the mapping should be performed, see mmap (2) for details.
vm_flags_t vm_flags: VMA flags which should be set by default, or 0 otherwise.
unsigned long pgoff: Page offset into the file if file-backed, should be 0 otherwise.
unsigned long *populate: A pointer to a value which will be set to 0 if no population of the range is required, or the number of bytes to populate if it is. Must be non-NULL. See mmap (2) for details as to under what circumstances population of the range occurs.
struct list_head *uf: An optional pointer to a list head to track userfaultfd unmap events should unmapping events arise. If provided, it is up to the caller to manage this.

Description

This function does not perform security checks on the file and assumes, if uf is non-NULL, the caller has provided a list head to track unmap events for userfaultfd uf.

It also simply indicates whether memory population is required by setting populate, which must be non-NULL, expecting the caller to actually perform this task itself if appropriate.

This function will invoke architecture-specific (and if provided and relevant, file system-specific) logic to determine the most appropriate unmapped area in which to place the mapping if not MAP_FIXED.

Callers which require userland mmap() behaviour should invoke vm_mmap(), which is also exported for module use.

Those which require this behaviour less security checks, userfaultfd and populate behaviour, and who handle the mmap write lock themselves, should call this function.

Note that the returned address may reside within a merged VMA if an appropriate merge were to take place, so it doesn’t necessarily specify the start of a VMA, rather only the start of a valid mapped range of length len bytes, rounded down to the nearest page size.

The caller must write-lock current->mm->mmap_lock.

Return

Either an error, or the address at which the requested mapping has been performed.

struct vm_area_struct *find_vma_intersection(struct mm_struct *mm, unsigned long start_addr, unsigned long end_addr)¶: Look up the first VMA which intersects the interval

Parameters

struct mm_struct *mm: The process address space.
unsigned long start_addr: The inclusive start user address.
unsigned long end_addr: The exclusive end user address.

Return

The first VMA within the provided range, NULL otherwise. Assumes start_addr < end_addr.

struct vm_area_struct *find_vma(struct mm_struct *mm, unsigned long addr)¶: Find the VMA for a given address, or the next VMA.

Parameters

struct mm_struct *mm: The mm_struct to check
unsigned long addr: The address

Return

The VMA associated with addr, or the next VMA. May return NULL in the case of no VMA at addr or above.

struct vm_area_struct *find_vma_prev(struct mm_struct *mm, unsigned long addr, struct vm_area_struct **pprev)¶: Find the VMA for a given address, or the next vma and set pprev to the previous VMA, if any.

Parameters

struct mm_struct *mm: The mm_struct to check
unsigned long addr: The address
struct vm_area_struct **pprev: The pointer to set to the previous VMA

Description

Note that RCU lock is missing here since the external mmap_lock() is used instead.

Return

The VMA associated with addr, or the next vma. May return NULL in the case of no vma at addr or above.

void __ref kmemleak_alloc(const void *ptr, size_t size, int min_count, gfp_t gfp)¶: register a newly allocated object

Parameters

const void *ptr: pointer to beginning of the object
size_t size: size of the object
int min_count: minimum number of references to this object. If during memory scanning a number of references less than min_count is found, the object is reported as a memory leak. If min_count is 0, the object is never reported as a leak. If min_count is -1, the object is ignored (not scanned and not reported as a leak)
gfp_t gfp: kmalloc() flags used for kmemleak internal memory allocations

Description

This function is called from the kernel allocators when a new object (memory block) is allocated (kmem_cache_alloc, kmalloc etc.).

void __ref kmemleak_alloc_percpu(const void __percpu *ptr, size_t size, gfp_t gfp)¶: register a newly allocated __percpu object

Parameters

const void __percpu *ptr: __percpu pointer to beginning of the object
size_t size: size of the object
gfp_t gfp: flags used for kmemleak internal memory allocations

Description

This function is called from the kernel percpu allocator when a new object (memory block) is allocated (alloc_percpu).

void __ref kmemleak_vmalloc(const struct vm_struct *area, size_t size, gfp_t gfp)¶: register a newly vmalloc’ed object

Parameters

const struct vm_struct *area: pointer to vm_struct
size_t size: size of the object
gfp_t gfp: __vmalloc() flags used for kmemleak internal memory allocations

Description

This function is called from the vmalloc() kernel allocator when a new object (memory block) is allocated.

void __ref kmemleak_free(const void *ptr)¶: unregister a previously registered object

Parameters

const void *ptr: pointer to beginning of the object

Description

This function is called from the kernel allocators when an object (memory block) is freed (kmem_cache_free, kfree, vfree etc.).

void __ref kmemleak_free_part(const void *ptr, size_t size)¶: partially unregister a previously registered object

Parameters

const void *ptr: pointer to the beginning or inside the object. This also represents the start of the range to be freed
size_t size: size to be unregistered

Description

This function is called when only a part of a memory block is freed (usually from the bootmem allocator).

void __ref kmemleak_free_percpu(const void __percpu *ptr)¶: unregister a previously registered __percpu object

Parameters

const void __percpu *ptr: __percpu pointer to beginning of the object

Description

This function is called from the kernel percpu allocator when an object (memory block) is freed (free_percpu).

void __ref kmemleak_update_trace(const void *ptr)¶: update object allocation stack trace

Parameters

const void *ptr: pointer to beginning of the object

Description

Override the object allocation stack trace for cases where the actual allocation place is not always useful.

void __ref kmemleak_not_leak(const void *ptr)¶: mark an allocated object as false positive

Parameters

const void *ptr: pointer to beginning of the object

Description

Calling this function on an object will cause the memory block to no longer be reported as leak and always be scanned.

void __ref kmemleak_transient_leak(const void *ptr)¶: mark an allocated object as transient false positive

Parameters

const void *ptr: pointer to beginning of the object

Description

Calling this function on an object will cause the memory block to not be reported as a leak temporarily. This may happen, for example, if the object is part of a singly linked list and the ->next reference to it is changed.

void __ref kmemleak_ignore_percpu(const void __percpu *ptr)¶: similar to kmemleak_ignore but taking a percpu address argument

Parameters

const void __percpu *ptr: percpu address of the object

void __ref kmemleak_ignore(const void *ptr)¶: ignore an allocated object

Parameters

const void *ptr: pointer to beginning of the object

Description

Calling this function on an object will cause the memory block to be ignored (not scanned and not reported as a leak). This is usually done when it is known that the corresponding block is not a leak and does not contain any references to other allocated memory blocks.

void __ref kmemleak_scan_area(const void *ptr, size_t size, gfp_t gfp)¶: limit the range to be scanned in an allocated object

Parameters

const void *ptr: pointer to beginning or inside the object. This also represents the start of the scan area
size_t size: size of the scan area
gfp_t gfp: kmalloc() flags used for kmemleak internal memory allocations

Description

This function is used when it is known that only certain parts of an object contain references to other objects. Kmemleak will only scan these areas reducing the number false negatives.

void __ref kmemleak_no_scan(const void *ptr)¶: do not scan an allocated object

Parameters

const void *ptr: pointer to beginning of the object

Description

This function notifies kmemleak not to scan the given memory block. Useful in situations where it is known that the given object does not contain any references to other objects. Kmemleak will not scan such objects reducing the number of false negatives.

void __ref kmemleak_alloc_phys(phys_addr_t phys, size_t size, gfp_t gfp)¶: similar to kmemleak_alloc but taking a physical address argument

Parameters

phys_addr_t phys: physical address of the object
size_t size: size of the object
gfp_t gfp: kmalloc() flags used for kmemleak internal memory allocations

void __ref kmemleak_free_part_phys(phys_addr_t phys, size_t size)¶: similar to kmemleak_free_part but taking a physical address argument

Parameters

phys_addr_t phys: physical address if the beginning or inside an object. This also represents the start of the range to be freed
size_t size: size to be unregistered

void __ref kmemleak_ignore_phys(phys_addr_t phys)¶: similar to kmemleak_ignore but taking a physical address argument

Parameters

phys_addr_t phys: physical address of the object

void *devm_memremap_pages(struct device *dev, struct dev_pagemap *pgmap)¶: remap and provide memmap backing for the given resource

Parameters

struct device *dev: hosting device for res
struct dev_pagemap *pgmap: pointer to a struct dev_pagemap

Notes

1/ At a minimum the range and type members of pgmap must be initialized: by the caller before passing it to this function
2/ The altmap field may optionally be initialized, in which case: PGMAP_ALTMAP_VALID must be set in pgmap->flags.
3/ The ref field may optionally be provided, in which pgmap->ref must be: ‘live’ on entry and will be killed and reaped at devm_memremap_pages_release() time, or if this routine fails.
4/ range is expected to be a host memory range that could feasibly be: treated as a “System RAM” range, i.e. not a device mmio range, but this is not enforced.

struct dev_pagemap *get_dev_pagemap(unsigned long pfn)¶: take a new live reference on the dev_pagemap for pfn

Parameters

unsigned long pfn: page frame number to lookup page_map

unsigned long vma_kernel_pagesize(struct vm_area_struct *vma)¶: Page size granularity for this VMA.

Parameters

struct vm_area_struct *vma: The user mapping.

Description

Folios in this VMA will be aligned to, and at least the size of the number of bytes returned by this function.

Return

The default size of the folios allocated when backing a VMA.

int huge_pmd_unshare(struct mmu_gather *tlb, struct vm_area_struct *vma, unsigned long addr, pte_t *ptep)¶: Unmap a pmd table if it is shared by multiple users

Parameters

struct mmu_gather *tlb: the current mmu_gather.
struct vm_area_struct *vma: the vma covering the pmd table.
unsigned long addr: the address we are trying to unshare.
pte_t *ptep: pointer into the (pmd) page table.

Description

Called with the page table lock held, the i_mmap_rwsem held in write mode and the hugetlb vma lock held in write mode.

Note

The caller must call huge_pmd_unshare_flush() before dropping the i_mmap_rwsem.

Return

1 if it was a shared PMD table and it got unmapped, or 0 if it was not a shared PMD table.

bool folio_isolate_hugetlb(struct folio *folio, struct list_head *list)¶: try to isolate an allocated hugetlb folio

Parameters

struct folio *folio: the folio to isolate
struct list_head *list: the list to add the folio to on success

Description

Isolate an allocated (refcount > 0) hugetlb folio, marking it as isolated/non-migratable, and moving it from the active list to the given list.

Isolation will fail if folio is not an allocated hugetlb folio, or if it is already isolated/non-migratable.

On success, an additional folio reference is taken that must be dropped using folio_putback_hugetlb() to undo the isolation.

Return

True if isolation worked, otherwise False.

void folio_putback_hugetlb(struct folio *folio)¶: unisolate a hugetlb folio

Parameters

struct folio *folio: the isolated hugetlb folio

Description

Putback/un-isolate the hugetlb folio that was previous isolated using folio_isolate_hugetlb(): marking it non-isolated/migratable and putting it back onto the active list.

Will drop the additional folio reference obtained through folio_isolate_hugetlb().

void folio_mark_accessed(struct folio *folio)¶: Mark a folio as having seen activity.

Parameters

struct folio *folio: The folio to mark.

Description

This function will perform one of the following transitions:

inactive,unreferenced -> inactive,referenced
inactive,referenced -> active,unreferenced
active,unreferenced -> active,referenced

When a newly allocated folio is not yet visible, so safe for non-atomic ops, __folio_set_referenced() may be substituted for folio_mark_accessed().

void folio_add_lru(struct folio *folio)¶: Add a folio to an LRU list.

Parameters

struct folio *folio: The folio to be added to the LRU.

Description

Queue the folio for addition to the LRU. The decision on whether to add the page to the [in]active [file|anon] list is deferred until the folio_batch is drained. This gives a chance for the caller of folio_add_lru() have the folio added to the active list using folio_mark_accessed().

void folio_add_lru_vma(struct folio *folio, struct vm_area_struct *vma)¶: Add a folio to the appropate LRU list for this VMA.

Parameters

struct folio *folio: The folio to be added to the LRU.
struct vm_area_struct *vma: VMA in which the folio is mapped.

Description

If the VMA is mlocked, folio is added to the unevictable list. Otherwise, it is treated the same way as folio_add_lru().

void deactivate_file_folio(struct folio *folio)¶: Deactivate a file folio.

Parameters

struct folio *folio: Folio to deactivate.

Description

This function hints to the VM that folio is a good reclaim candidate, for example if its invalidation fails due to the folio being dirty or under writeback.

Context

Caller holds a reference on the folio.

void folio_mark_lazyfree(struct folio *folio)¶: make an anon folio lazyfree

Parameters

struct folio *folio: folio to deactivate

Description

folio_mark_lazyfree() moves folio to the inactive file list. This is done to accelerate the reclaim of folio.

void folios_put_refs(struct folio_batch *folios, unsigned int *refs)¶: Reduce the reference count on a batch of folios.

Parameters

struct folio_batch *folios: The folios.
unsigned int *refs: The number of refs to subtract from each folio.

Description

Like folio_put(), but for a batch of folios. This is more efficient than writing the loop yourself as it will optimise the locks which need to be taken if the folios are freed. The folios batch is returned empty and ready to be reused for another batch; there is no need to reinitialise it. If refs is NULL, we subtract one from each folio refcount.

Context

May be called in process or interrupt context, but not in NMI context. May be called while holding a spinlock.

void release_pages(release_pages_arg arg, int nr)¶: batched put_page()

Parameters

release_pages_arg arg: array of pages to release
int nr: number of pages

Description

Decrement the reference count on all the pages in arg. If it fell to zero, remove the page from the LRU and free it.

Note that the argument can be an array of pages, encoded pages, or folio pointers. We ignore any encoded bits, and turn any of them into just a folio that gets free’d.

void folio_batch_remove_exceptionals(struct folio_batch *fbatch)¶: Prune non-folios from a batch.

Parameters

struct folio_batch *fbatch: The batch to prune

Description

find_get_entries() fills a batch with both folios and shadow/swap/DAX entries. This function prunes all the non-folio entries from fbatch without leaving holes, so that it can be passed on to folio-only batch operations.

struct cgroup_subsys_state *mem_cgroup_css_from_folio(struct folio *folio)¶: css of the memcg associated with a folio

Parameters

struct folio *folio: folio of interest

Description

If memcg is bound to the default hierarchy, css of the memcg associated with folio is returned. The returned css remains associated with folio until it is released.

If memcg is bound to a traditional hierarchy, the css of root_mem_cgroup is returned.

ino_t page_cgroup_ino(struct page *page)¶: return inode number of the memcg a page is charged to

Parameters

struct page *page: the page

Description

Look up the closest online ancestor of the memory cgroup page is charged to and return its inode number or 0 if page is not charged to any cgroup. It is safe to call this function without holding a reference to page.

Note, this function is inherently racy, because there is nothing to prevent the cgroup inode from getting torn down and potentially reallocated a moment after page_cgroup_ino() returns, so it only should be used by callers that do not care (such as procfs interfaces).

void mod_memcg_state(struct mem_cgroup *memcg, enum memcg_stat_item idx, int val)¶: update cgroup memory statistics

Parameters

struct mem_cgroup *memcg: the memory cgroup
enum memcg_stat_item idx: the stat item - can be enum memcg_stat_item or enum node_stat_item
int val: delta to add to the counter, can be negative

void mod_lruvec_state(struct lruvec *lruvec, enum node_stat_item idx, int val)¶: update lruvec memory statistics

Parameters

struct lruvec *lruvec: the lruvec
enum node_stat_item idx: the stat item
int val: delta to add to the counter, can be negative

Description

The lruvec is the intersection of the NUMA node and a cgroup. This function updates the all three counters that are affected by a change of state at this level: per-node, per-cgroup, per-lruvec.

void count_memcg_events(struct mem_cgroup *memcg, enum vm_event_item idx, unsigned long count)¶: account VM events in a cgroup

Parameters

struct mem_cgroup *memcg: the memory cgroup
enum vm_event_item idx: the event item
unsigned long count: the number of events that occurred

struct mem_cgroup *get_mem_cgroup_from_mm(struct mm_struct *mm)¶: Obtain a reference on given mm_struct’s memcg.

Parameters

struct mm_struct *mm: mm from which memcg should be extracted. It can be NULL.

Description

Obtain a reference on mm->memcg and returns it if successful. If mm is NULL, then the memcg is chosen as follows: 1) The active memcg, if set. 2) current->mm->memcg, if available 3) root memcg If mem_cgroup is disabled, NULL is returned.

struct mem_cgroup *get_mem_cgroup_from_current(void)¶: Obtain a reference on current task’s memcg.

Parameters

void: no arguments

struct mem_cgroup *get_mem_cgroup_from_folio(struct folio *folio)¶: Obtain a reference on a given folio’s memcg.

Parameters

struct folio *folio: folio from which memcg should be extracted.

struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root, struct mem_cgroup *prev, struct mem_cgroup_reclaim_cookie *reclaim)¶: iterate over memory cgroup hierarchy

Parameters

struct mem_cgroup *root: hierarchy root
struct mem_cgroup *prev: previously returned memcg, NULL on first invocation
struct mem_cgroup_reclaim_cookie *reclaim: cookie for shared reclaim walks, NULL for full walks

Description

Returns references to children of the hierarchy below root, or root itself, or NULL after a full round-trip.

Caller must pass the return value in prev on subsequent invocations for reference counting, or use mem_cgroup_iter_break() to cancel a hierarchy walk before the round-trip is complete.

Reclaimers can specify a node in reclaim to divide up the memcgs in the hierarchy among all concurrent reclaimers operating on the same node.

void mem_cgroup_iter_break(struct mem_cgroup *root, struct mem_cgroup *prev)¶: abort a hierarchy walk prematurely

Parameters

struct mem_cgroup *root: hierarchy root
struct mem_cgroup *prev: last visited hierarchy member as returned by mem_cgroup_iter()

void mem_cgroup_scan_tasks(struct mem_cgroup *memcg, int (*fn)(struct task_struct*, void*), void *arg)¶: iterate over tasks of a memory cgroup hierarchy

Parameters

struct mem_cgroup *memcg: hierarchy root
int (*fn)(struct task_struct *, void *): function to call for each task
void *arg: argument passed to fn

Description

This function iterates over tasks attached to memcg or to any of its descendants and calls fn for each task. If fn returns a non-zero value, the function breaks the iteration loop. Otherwise, it will iterate over all tasks and return 0.

This function must not be called for the root memory cgroup.

struct lruvec *folio_lruvec_lock(struct folio *folio)¶: Lock the lruvec for a folio.

Parameters

struct folio *folio: Pointer to the folio.

Description

These functions are safe to use under any of the following conditions: - folio locked - folio_test_lru false - folio frozen (refcount of 0)

Return

The lruvec this folio is on with its lock held.

struct lruvec *folio_lruvec_lock_irq(struct folio *folio)¶: Lock the lruvec for a folio.

Parameters

struct folio *folio: Pointer to the folio.

Description

These functions are safe to use under any of the following conditions: - folio locked - folio_test_lru false - folio frozen (refcount of 0)

Return

The lruvec this folio is on with its lock held and interrupts disabled.

struct lruvec *folio_lruvec_lock_irqsave(struct folio *folio, unsigned long *flags)¶: Lock the lruvec for a folio.

Parameters

struct folio *folio: Pointer to the folio.
unsigned long *flags: Pointer to irqsave flags.

Description

These functions are safe to use under any of the following conditions: - folio locked - folio_test_lru false - folio frozen (refcount of 0)

Return

The lruvec this folio is on with its lock held and interrupts disabled.

void mem_cgroup_update_lru_size(struct lruvec *lruvec, enum lru_list lru, int zid, int nr_pages)¶: account for adding or removing an lru page

Parameters

struct lruvec *lruvec: mem_cgroup per zone lru vector
enum lru_list lru: index of lru list the page is sitting on
int zid: zone id of the accounted pages
int nr_pages: positive when adding or negative when removing

Description

This function must be called under lru_lock, just before a page is added to or just after a page is removed from an lru list.

unsigned long mem_cgroup_margin(struct mem_cgroup *memcg)¶: calculate chargeable space of a memory cgroup

Parameters

struct mem_cgroup *memcg: the memory cgroup

Description

Returns the maximum amount of memory mem can be charged with, in pages.

void mem_cgroup_print_oom_context(struct mem_cgroup *memcg, struct task_struct *p)¶: Print OOM information relevant to memory controller.

Parameters

struct mem_cgroup *memcg: The memory cgroup that went over limit
struct task_struct *p: Task that is going to be killed

NOTE

memcg and p’s mem_cgroup can be different when hierarchy is enabled

void mem_cgroup_print_oom_meminfo(struct mem_cgroup *memcg)¶: Print OOM memory information relevant to memory controller.

Parameters

struct mem_cgroup *memcg: The memory cgroup that went over limit

struct mem_cgroup *mem_cgroup_get_oom_group(struct task_struct *victim, struct mem_cgroup *oom_domain)¶: get a memory cgroup to clean up after OOM

Parameters

struct task_struct *victim: task to be killed by the OOM killer
struct mem_cgroup *oom_domain: memcg in case of memcg OOM, NULL in case of system-wide OOM

Description

Returns a pointer to a memory cgroup, which has to be cleaned up by killing all belonging OOM-killable tasks.

Caller has to call mem_cgroup_put() on the returned non-NULL memcg.

bool consume_stock(struct mem_cgroup *memcg, unsigned int nr_pages)¶: Try to consume stocked charge on this cpu.

Parameters

struct mem_cgroup *memcg: memcg to consume from.
unsigned int nr_pages: how many pages to charge.

Description

Consume the cached charge if enough nr_pages are present otherwise return failure. Also return failure for charge request larger than MEMCG_CHARGE_BATCH or if the local lock is already taken.

returns true if successful, false otherwise.

int __memcg_kmem_charge_page(struct page *page, gfp_t gfp, int order)¶: charge a kmem page to the current memory cgroup

Parameters

struct page *page: page to charge
gfp_t gfp: reclaim mode
int order: allocation order

Description

Returns 0 on success, an error code on failure.

void __memcg_kmem_uncharge_page(struct page *page, int order)¶: uncharge a kmem page

Parameters

struct page *page: page to uncharge
int order: allocation order

void mem_cgroup_wb_stats(struct bdi_writeback *wb, unsigned long *pfilepages, unsigned long *pheadroom, unsigned long *pdirty, unsigned long *pwriteback)¶: retrieve writeback related stats from its memcg

Parameters

struct bdi_writeback *wb: bdi_writeback in question
unsigned long *pfilepages: out parameter for number of file pages
unsigned long *pheadroom: out parameter for number of allocatable pages according to memcg
unsigned long *pdirty: out parameter for number of dirty pages
unsigned long *pwriteback: out parameter for number of pages under writeback

Description

Determine the numbers of file, headroom, dirty, and writeback pages in wb’s memcg. File, dirty and writeback are self-explanatory. Headroom is a bit more involved.

A memcg’s headroom is “min(max, high) - used”. In the hierarchy, the headroom is calculated as the lowest headroom of itself and the ancestors. Note that this doesn’t consider the actual amount of available memory in the system. The caller should further cap *pheadroom accordingly.

struct mem_cgroup *mem_cgroup_from_id(unsigned short id)¶: look up a memcg from a memcg id

Parameters

unsigned short id: the memcg id to look up

Description

Caller must hold rcu_read_lock().

void mem_cgroup_css_reset(struct cgroup_subsys_state *css)¶: reset the states of a mem_cgroup

Parameters

struct cgroup_subsys_state *css: the target css

Description

Reset the states of the mem_cgroup associated with css. This is invoked when the userland requests disabling on the default hierarchy but the memcg is pinned through dependency. The memcg should stop applying policies and should revert to the vanilla state as it may be made visible again.

The current implementation only resets the essential configurations. This needs to be expanded to cover all the visible parts.

void mem_cgroup_calculate_protection(struct mem_cgroup *root, struct mem_cgroup *memcg)¶: check if memory consumption is in the normal range

Parameters

struct mem_cgroup *root: the top ancestor of the sub-tree being checked
struct mem_cgroup *memcg: the memory cgroup to check

Description

WARNING: This function is not stateless! It can only be used as part: of a top-down tree iteration, not for isolated queries.

int mem_cgroup_charge_hugetlb(struct folio *folio, gfp_t gfp)¶: charge the memcg for a hugetlb folio

Parameters

struct folio *folio: folio being charged
gfp_t gfp: reclaim mode

Description

This function is called when allocating a huge page folio, after the page has already been obtained and charged to the appropriate hugetlb cgroup controller (if it is enabled).

Returns ENOMEM if the memcg is already full. Returns 0 if either the charge was successful, or if we skip the charging.

int mem_cgroup_swapin_charge_folio(struct folio *folio, struct mm_struct *mm, gfp_t gfp, swp_entry_t entry)¶: Charge a newly allocated folio for swapin.

Parameters

struct folio *folio: folio to charge.
struct mm_struct *mm: mm context of the victim
gfp_t gfp: reclaim mode
swp_entry_t entry: swap entry for which the folio is allocated

Description

This function charges a folio allocated for swapin. Please call this before adding the folio to the swapcache.

Returns 0 on success. Otherwise, an error code is returned.

void mem_cgroup_replace_folio(struct folio *old, struct folio *new)¶: Charge a folio’s replacement.

Parameters

struct folio *old: Currently circulating folio.
struct folio *new: Replacement folio.

Description

Charge new as a replacement folio for old. old will be uncharged upon free.

Both folios must be locked, new->mapping must be set up.

void mem_cgroup_migrate(struct folio *old, struct folio *new)¶: Transfer the memcg data from the old to the new folio.

Parameters

struct folio *old: Currently circulating folio.
struct folio *new: Replacement folio.

Description

Transfer the memcg data from the old folio to the new folio for migration. The old folio’s data info will be cleared. Note that the memory counters will remain unchanged throughout the process.

Both folios must be locked, new->mapping must be set up.

bool mem_cgroup_sk_charge(const struct sock *sk, unsigned int nr_pages, gfp_t gfp_mask)¶: charge socket memory

Parameters

const struct sock *sk: socket in memcg to charge
unsigned int nr_pages: number of pages to charge
gfp_t gfp_mask: reclaim mode

Description

Charges nr_pages to memcg. Returns true if the charge fit within memcg’s configured limit, false if it doesn’t.

void mem_cgroup_sk_uncharge(const struct sock *sk, unsigned int nr_pages)¶: uncharge socket memory

Parameters

const struct sock *sk: socket in memcg to uncharge
unsigned int nr_pages: number of pages to uncharge

int __mem_cgroup_try_charge_swap(struct folio *folio, swp_entry_t entry)¶: try charging swap space for a folio

Parameters

struct folio *folio: folio being added to swap
swp_entry_t entry: swap entry to charge

Description

Try to charge folio’s memcg for the swap space at entry.

Returns 0 on success, -ENOMEM on failure.

void __mem_cgroup_uncharge_swap(swp_entry_t entry, unsigned int nr_pages)¶: uncharge swap space

Parameters

swp_entry_t entry: swap entry to uncharge
unsigned int nr_pages: the amount of swap space to uncharge

bool obj_cgroup_may_zswap(struct obj_cgroup *objcg)¶: check if this cgroup can zswap

Parameters

struct obj_cgroup *objcg: the object cgroup

Description

Check if the hierarchical zswap limit has been reached.

This doesn’t check for specific headroom, and it is not atomic either. But with zswap, the size of the allocation is only known once compression has occurred, and this optimistic pre-check avoids spending cycles on compression when there is already no room left or zswap is disabled altogether somewhere in the hierarchy.

void obj_cgroup_charge_zswap(struct obj_cgroup *objcg, size_t size)¶: charge compression backend memory

Parameters

struct obj_cgroup *objcg: the object cgroup
size_t size: size of compressed object

Description

This forces the charge after obj_cgroup_may_zswap() allowed compression and storage in zswap for this cgroup to go ahead.

void obj_cgroup_uncharge_zswap(struct obj_cgroup *objcg, size_t size)¶: uncharge compression backend memory

Parameters

struct obj_cgroup *objcg: the object cgroup
size_t size: size of compressed object

Description

Uncharges zswap memory on page in.

bool shmem_recalc_inode(struct inode *inode, long alloced, long swapped)¶: recalculate the block usage of an inode

Parameters

struct inode *inode: inode to recalc
long alloced: the change in number of pages allocated to inode
long swapped: the change in number of pages swapped from inode

Description

We have to calculate the free blocks since the mm can drop undirtied hole pages behind our back.

But normally info->alloced == inode->i_mapping->nrpages + info->swapped So mm freed is info->alloced - (inode->i_mapping->nrpages + info->swapped)

Return

true if swapped was incremented from 0, for shmem_writeout().

int shmem_writeout(struct folio *folio, struct swap_iocb **plug, struct list_head *folio_list)¶: Write the folio to swap

Parameters

struct folio *folio: The folio to write
struct swap_iocb **plug: swap plug
struct list_head *folio_list: list to put back folios on split

Description

Move the folio from the page cache to the swap cache.

int shmem_get_folio(struct inode *inode, pgoff_t index, loff_t write_end, struct folio **foliop, enum sgp_type sgp)¶: find, and lock a shmem folio.

Parameters

struct inode *inode: inode to search
pgoff_t index: the page index.
loff_t write_end: end of a write, could extend inode size
struct folio **foliop: pointer to the folio if found
enum sgp_type sgp: SGP_* flags to control behavior

Description

Looks up the page cache entry at inode & index. If a folio is present, it is returned locked with an increased refcount.

If the caller modifies data in the folio, it must call folio_mark_dirty() before unlocking the folio to ensure that the folio is not reclaimed. There is no need to reserve space before calling folio_mark_dirty().

When no folio is found, the behavior depends on sgp:

for SGP_READ, *foliop is NULL and 0 is returned
for SGP_NOALLOC, *foliop is NULL and -ENOENT is returned
for all other flags a new folio is allocated, inserted into the page cache and returned locked in foliop.

Context

May sleep.

Return

0 if successful, else a negative error code.

struct file *shmem_kernel_file_setup(const char *name, loff_t size, unsigned long flags)¶: get an unlinked file living in tmpfs which must be kernel internal. There will be NO LSM permission checks against the underlying inode. So users of this interface must do LSM checks at a higher layer. The users are the big_key and shm implementations. LSM checks are provided at the key or shm level rather than the inode.

Parameters

const char *name: name for dentry (to be seen in /proc/<pid>/maps)
loff_t size: size to be set for the file
unsigned long flags: VM_NORESERVE suppresses pre-accounting of the entire object size

struct file *shmem_file_setup(const char *name, loff_t size, unsigned long flags)¶: get an unlinked file living in tmpfs

Parameters

const char *name: name for dentry (to be seen in /proc/<pid>/maps)
loff_t size: size to be set for the file
unsigned long flags: VM_NORESERVE suppresses pre-accounting of the entire object size

struct file *shmem_file_setup_with_mnt(struct vfsmount *mnt, const char *name, loff_t size, unsigned long flags)¶: get an unlinked file living in tmpfs

Parameters

struct vfsmount *mnt: the tmpfs mount where the file will be created
const char *name: name for dentry (to be seen in /proc/<pid>/maps)
loff_t size: size to be set for the file
unsigned long flags: VM_NORESERVE suppresses pre-accounting of the entire object size

int shmem_zero_setup(struct vm_area_struct *vma)¶: setup a shared anonymous mapping

Parameters

struct vm_area_struct *vma: the vma to be mmapped is prepared by do_mmap

Return

0 on success, or error

int shmem_zero_setup_desc(struct vm_area_desc *desc)¶: same as shmem_zero_setup, but determined by VMA descriptor for convenience.

Parameters

struct vm_area_desc *desc: Describes VMA

Return

0 on success, or error

struct folio *shmem_read_folio_gfp(struct address_space *mapping, pgoff_t index, gfp_t gfp)¶: read into page cache, using specified page allocation flags.

Parameters

struct address_space *mapping: the folio’s address_space
pgoff_t index: the folio index
gfp_t gfp: the page allocator flags to use if allocating

Description

This behaves as a tmpfs “read_cache_page_gfp(mapping, index, gfp)”, with any new page allocations done using the specified allocation flags. But read_cache_page_gfp() uses the ->read_folio() method: which does not suit tmpfs, since it may have pages in swapcache, and needs to find those for itself; although drivers/gpu/drm i915 and ttm rely upon this support.

i915_gem_object_get_pages_gtt() mixes __GFP_NORETRY | __GFP_NOWARN in with the mapping_gfp_mask(), to avoid OOMing the machine unnecessarily.

int migrate_vma_split_folio(struct folio *folio, struct page *fault_page)¶: Helper function to split a THP folio

Parameters

struct folio *folio: the folio to split
struct page *fault_page: struct page associated with the fault if any

Description

Returns 0 on success

int migrate_vma_setup(struct migrate_vma *args)¶: prepare to migrate a range of memory

Parameters

struct migrate_vma *args: contains the vma, start, and pfns arrays for the migration

Return

negative errno on failures, 0 when 0 or more pages were migrated without an error.

Description

Prepare to migrate a range of memory virtual address range by collecting all the pages backing each virtual address in the range, saving them inside the src array. Then lock those pages and unmap them. Once the pages are locked and unmapped, check whether each page is pinned or not. Pages that aren’t pinned have the MIGRATE_PFN_MIGRATE flag set (by this function) in the corresponding src array entry. Then restores any pages that are pinned, by remapping and unlocking those pages.

The caller should then allocate destination memory and copy source memory to it for all those entries (ie with MIGRATE_PFN_VALID and MIGRATE_PFN_MIGRATE flag set). Once these are allocated and copied, the caller must update each corresponding entry in the dst array with the pfn value of the destination page and with MIGRATE_PFN_VALID. Destination pages must be locked via lock_page().

Note that the caller does not have to migrate all the pages that are marked with MIGRATE_PFN_MIGRATE flag in src array unless this is a migration from device memory to system memory. If the caller cannot migrate a device page back to system memory, then it must return VM_FAULT_SIGBUS, which has severe consequences for the userspace process, so it must be avoided if at all possible.

For empty entries inside CPU page table (pte_none() or pmd_none() is true) we do set MIGRATE_PFN_MIGRATE flag inside the corresponding source array thus allowing the caller to allocate device memory for those unbacked virtual addresses. For this the caller simply has to allocate device memory and properly set the destination entry like for regular migration. Note that this can still fail, and thus inside the device driver you must check if the migration was successful for those entries after calling migrate_vma_pages(), just like for regular migration.

After that, the callers must call migrate_vma_pages() to go over each entry in the src array that has the MIGRATE_PFN_VALID and MIGRATE_PFN_MIGRATE flag set. If the corresponding entry in dst array has MIGRATE_PFN_VALID flag set, then migrate_vma_pages() to migrate struct page information from the source struct page to the destination struct page. If it fails to migrate the struct page information, then it clears the MIGRATE_PFN_MIGRATE flag in the src array.

At this point all successfully migrated pages have an entry in the src array with MIGRATE_PFN_VALID and MIGRATE_PFN_MIGRATE flag set and the dst array entry with MIGRATE_PFN_VALID flag set.

Once migrate_vma_pages() returns the caller may inspect which pages were successfully migrated, and which were not. Successfully migrated pages will have the MIGRATE_PFN_MIGRATE flag set for their src array entry.

It is safe to update device page table after migrate_vma_pages() because both destination and source page are still locked, and the mmap_lock is held in read mode (hence no one can unmap the range being migrated).

Once the caller is done cleaning up things and updating its page table (if it chose to do so, this is not an obligation) it finally calls migrate_vma_finalize() to update the CPU page table to point to new pages for successfully migrated pages or otherwise restore the CPU page table to point to the original source pages.

int migrate_vma_insert_huge_pmd_page(struct migrate_vma *migrate, unsigned long addr, struct page *page, unsigned long *src, pmd_t *pmdp)¶: Insert a huge folio into migrate->vma->vm_mm at addr. folio is already allocated as a part of the migration process with large page.

Parameters

struct migrate_vma *migrate: migrate_vma arguments
unsigned long addr: address where the folio will be inserted
struct page *page: page to be inserted at addr
unsigned long *src: src pfn which is being migrated
pmd_t *pmdp: pointer to the pmd

Description

page needs to be initialized and setup after it’s allocated. The code bits here follow closely the code in __do_huge_pmd_anonymous_page(). This API does not support THP zero pages.

void migrate_device_pages(unsigned long *src_pfns, unsigned long *dst_pfns, unsigned long npages)¶: migrate meta-data from src page to dst page

Parameters

unsigned long *src_pfns: src_pfns returned from migrate_device_range()
unsigned long *dst_pfns: array of pfns allocated by the driver to migrate memory to
unsigned long npages: number of pages in the range

Description

Equivalent to migrate_vma_pages(). This is called to migrate struct page meta-data from source struct page to destination.

void migrate_vma_pages(struct migrate_vma *migrate)¶: migrate meta-data from src page to dst page

Parameters

struct migrate_vma *migrate: migrate struct containing all migration information

Description

This migrates struct page meta-data from source struct page to destination struct page. This effectively finishes the migration from source page to the destination page.

void migrate_vma_finalize(struct migrate_vma *migrate)¶: restore CPU page table entry

Parameters

struct migrate_vma *migrate: migrate struct containing all migration information

Description

This replaces the special migration pte entry with either a mapping to the new page if migration was successful for that page, or to the original page otherwise.

This also unlocks the pages and puts them back on the lru, or drops the extra refcount, for device pages.

int migrate_device_range(unsigned long *src_pfns, unsigned long start, unsigned long npages)¶: migrate device private pfns to normal memory.

Parameters

unsigned long *src_pfns: array large enough to hold migrating source device private pfns.
unsigned long start: starting pfn in the range to migrate.
unsigned long npages: number of pages to migrate.

Description

migrate_vma_setup() is similar in concept to migrate_vma_setup() except that instead of looking up pages based on virtual address mappings a range of device pfns that should be migrated to system memory is used instead.

This is useful when a driver needs to free device memory but doesn’t know the virtual mappings of every page that may be in device memory. For example this is often the case when a driver is being unloaded or unbound from a device.

Like migrate_vma_setup() this function will take a reference and lock any migrating pages that aren’t free before unmapping them. Drivers may then allocate destination pages and start copying data from the device to CPU memory before calling migrate_device_pages().

int migrate_device_pfns(unsigned long *src_pfns, unsigned long npages)¶: migrate device private pfns to normal memory.

Parameters

unsigned long *src_pfns: pre-popluated array of source device private pfns to migrate.
unsigned long npages: number of pages to migrate.

Description

Similar to migrate_device_range() but supports non-contiguous pre-popluated array of device pages to migrate.

struct wp_walk¶: Private struct for pagetable walk callbacks

Definition:

struct wp_walk {
    struct mmu_notifier_range range;
    unsigned long tlbflush_start;
    unsigned long tlbflush_end;
    unsigned long total;
};

Members

range: Range for mmu notifiers
tlbflush_start: Address of first modified pte
tlbflush_end: Address of last modified pte + 1
total: Total number of modified ptes

int wp_pte(pte_t *pte, unsigned long addr, unsigned long end, struct mm_walk *walk)¶: Write-protect a pte

Parameters

pte_t *pte: Pointer to the pte
unsigned long addr: The start of protecting virtual address
unsigned long end: The end of protecting virtual address
struct mm_walk *walk: pagetable walk callback argument

Description

The function write-protects a pte and records the range in virtual address space of touched ptes for efficient range TLB flushes.

struct clean_walk¶: Private struct for the clean_record_pte function.

Definition:

struct clean_walk {
    struct wp_walk base;
    pgoff_t bitmap_pgoff;
    unsigned long *bitmap;
    pgoff_t start;
    pgoff_t end;
};

Members

base: struct wp_walk we derive from
bitmap_pgoff: Address_space Page offset of the first bit in bitmap
bitmap: Bitmap with one bit for each page offset in the address_space range covered.
start: Address_space page offset of first modified pte relative to bitmap_pgoff
end: Address_space page offset of last modified pte relative to bitmap_pgoff

int clean_record_pte(pte_t *pte, unsigned long addr, unsigned long end, struct mm_walk *walk)¶: Clean a pte and record its address space offset in a bitmap

Parameters

pte_t *pte: Pointer to the pte
unsigned long addr: The start of virtual address to be clean
unsigned long end: The end of virtual address to be clean
struct mm_walk *walk: pagetable walk callback argument

Description

The function cleans a pte and records the range in virtual address space of touched ptes for efficient TLB flushes. It also records dirty ptes in a bitmap representing page offsets in the address_space, as well as the first and last of the bits touched.

unsigned long wp_shared_mapping_range(struct address_space *mapping, pgoff_t first_index, pgoff_t nr)¶: Write-protect all ptes in an address space range

Parameters

struct address_space *mapping: The address_space we want to write protect
pgoff_t first_index: The first page offset in the range
pgoff_t nr: Number of incremental page offsets to cover

Note

This function currently skips transhuge page-table entries, since it’s intended for dirty-tracking on the PTE level. It will warn on encountering transhuge write-enabled entries, though, and can easily be extended to handle them as well.

Return

The number of ptes actually write-protected. Note that already write-protected ptes are not counted.

unsigned long clean_record_shared_mapping_range(struct address_space *mapping, pgoff_t first_index, pgoff_t nr, pgoff_t bitmap_pgoff, unsigned long *bitmap, pgoff_t *start, pgoff_t *end)¶: Clean and record all ptes in an address space range

Parameters

struct address_space *mapping: The address_space we want to clean
pgoff_t first_index: The first page offset in the range
pgoff_t nr: Number of incremental page offsets to cover
pgoff_t bitmap_pgoff: The page offset of the first bit in bitmap
unsigned long *bitmap: Pointer to a bitmap of at least nr bits. The bitmap needs to cover the whole range first_index..**first_index** + nr.
pgoff_t *start: Pointer to number of the first set bit in bitmap. is modified as new bits are set by the function.
pgoff_t *end: Pointer to the number of the last set bit in bitmap. none set. The value is modified as new bits are set by the function.

Description

When this function returns there is no guarantee that a CPU has not already dirtied new ptes. However it will not clean any ptes not reported in the bitmap. The guarantees are as follows:

All ptes dirty when the function starts executing will end up recorded in the bitmap.
All ptes dirtied after that will either remain dirty, be recorded in the bitmap or both.

If a caller needs to make sure all dirty ptes are picked up and none additional are added, it first needs to write-protect the address-space range and make sure new writers are blocked in page_mkwrite() or pfn_mkwrite(). And then after a TLB flush following the write-protection pick up all dirty bits.

This function currently skips transhuge page-table entries, since it’s intended for dirty-tracking on the PTE level. It will warn on encountering transhuge dirty entries, though, and can easily be extended to handle them as well.

Return

The number of dirty ptes actually cleaned.

bool pcpu_addr_in_chunk(struct pcpu_chunk *chunk, void *addr)¶: check if the address is served from this chunk

Parameters

struct pcpu_chunk *chunk: chunk of interest
void *addr: percpu address

Return

True if the address is served from this chunk.

bool pcpu_check_block_hint(struct pcpu_block_md *block, int bits, size_t align)¶: check against the contig hint

Parameters

struct pcpu_block_md *block: block of interest
int bits: size of allocation
size_t align: alignment of area (max PAGE_SIZE)

Description

Check to see if the allocation can fit in the block’s contig hint. Note, a chunk uses the same hints as a block so this can also check against the chunk’s contig hint.

void pcpu_next_md_free_region(struct pcpu_chunk *chunk, int *bit_off, int *bits)¶: finds the next hint free area

Parameters

struct pcpu_chunk *chunk: chunk of interest
int *bit_off: chunk offset
int *bits: size of free area

Description

Helper function for pcpu_for_each_md_free_region. It checks block->contig_hint and performs aggregation across blocks to find the next hint. It modifies bit_off and bits in-place to be consumed in the loop.

void pcpu_next_fit_region(struct pcpu_chunk *chunk, int alloc_bits, int align, int *bit_off, int *bits)¶: finds fit areas for a given allocation request

Parameters

struct pcpu_chunk *chunk: chunk of interest
int alloc_bits: size of allocation
int align: alignment of area (max PAGE_SIZE)
int *bit_off: chunk offset
int *bits: size of free area

Description

Finds the next free region that is viable for use with a given size and alignment. This only returns if there is a valid area to be used for this allocation. block->first_free is returned if the allocation request fits within the block to see if the request can be fulfilled prior to the contig hint.

void *pcpu_mem_zalloc(size_t size, gfp_t gfp)¶: allocate memory

Parameters

size_t size: bytes to allocate
gfp_t gfp: allocation flags

Description

Allocate size bytes. If size is smaller than PAGE_SIZE, kzalloc() is used; otherwise, the equivalent of vzalloc() is used. This is to facilitate passing through whitelisted flags. The returned memory is always zeroed.

Return

Pointer to the allocated area on success, NULL on failure.

void pcpu_mem_free(void *ptr)¶: free memory

Parameters

void *ptr: memory to free

Description

Free ptr. ptr should have been allocated using pcpu_mem_zalloc().

void pcpu_chunk_relocate(struct pcpu_chunk *chunk, int oslot)¶: put chunk in the appropriate chunk slot

Parameters

struct pcpu_chunk *chunk: chunk of interest
int oslot: the previous slot it was on

Description

This function is called after an allocation or free changed chunk. New slot according to the changed state is determined and chunk is moved to the slot. Note that the reserved chunk is never put on chunk slots.

Context

pcpu_lock.

void pcpu_block_update(struct pcpu_block_md *block, int start, int end)¶: updates a block given a free area

Parameters

struct pcpu_block_md *block: block of interest
int start: start offset in block
int end: end offset in block

Description

Updates a block given a known free area. The region [start, end) is expected to be the entirety of the free area within a block. Chooses the best starting offset if the contig hints are equal.

void pcpu_chunk_refresh_hint(struct pcpu_chunk *chunk, bool full_scan)¶: updates metadata about a chunk

Parameters

struct pcpu_chunk *chunk: chunk of interest
bool full_scan: if we should scan from the beginning

Description

Iterates over the metadata blocks to find the largest contig area. A full scan can be avoided on the allocation path as this is triggered if we broke the contig_hint. In doing so, the scan_hint will be before the contig_hint or after if the scan_hint == contig_hint. This cannot be prevented on freeing as we want to find the largest area possibly spanning blocks.

void pcpu_block_refresh_hint(struct pcpu_chunk *chunk, int index)¶

Parameters

struct pcpu_chunk *chunk: chunk of interest
int index: index of the metadata block

Description

Scans over the block beginning at first_free and updates the block metadata accordingly.

void pcpu_block_update_hint_alloc(struct pcpu_chunk *chunk, int bit_off, int bits)¶: update hint on allocation path

Parameters

struct pcpu_chunk *chunk: chunk of interest
int bit_off: chunk offset
int bits: size of request

Description

Updates metadata for the allocation path. The metadata only has to be refreshed by a full scan iff the chunk’s contig hint is broken. Block level scans are required if the block’s contig hint is broken.

void pcpu_block_update_hint_free(struct pcpu_chunk *chunk, int bit_off, int bits)¶: updates the block hints on the free path

Parameters

struct pcpu_chunk *chunk: chunk of interest
int bit_off: chunk offset
int bits: size of request

Description

Updates metadata for the allocation path. This avoids a blind block refresh by making use of the block contig hints. If this fails, it scans forward and backward to determine the extent of the free area. This is capped at the boundary of blocks.

A chunk update is triggered if a page becomes free, a block becomes free, or the free spans across blocks. This tradeoff is to minimize iterating over the block metadata to update chunk_md->contig_hint. chunk_md->contig_hint may be off by up to a page, but it will never be more than the available space. If the contig hint is contained in one block, it will be accurate.

bool pcpu_is_populated(struct pcpu_chunk *chunk, int bit_off, int bits, int *next_off)¶: determines if the region is populated

Parameters

struct pcpu_chunk *chunk: chunk of interest
int bit_off: chunk offset
int bits: size of area
int *next_off: return value for the next offset to start searching

Description

For atomic allocations, check if the backing pages are populated.

Return

Bool if the backing pages are populated. next_index is to skip over unpopulated blocks in pcpu_find_block_fit.

int pcpu_find_block_fit(struct pcpu_chunk *chunk, int alloc_bits, size_t align, bool pop_only)¶: finds the block index to start searching

Parameters

struct pcpu_chunk *chunk: chunk of interest
int alloc_bits: size of request in allocation units
size_t align: alignment of area (max PAGE_SIZE bytes)
bool pop_only: use populated regions only

Description

Given a chunk and an allocation spec, find the offset to begin searching for a free region. This iterates over the bitmap metadata blocks to find an offset that will be guaranteed to fit the requirements. It is not quite first fit as if the allocation does not fit in the contig hint of a block or chunk, it is skipped. This errs on the side of caution to prevent excess iteration. Poor alignment can cause the allocator to skip over blocks and chunks that have valid free areas.

Return

The offset in the bitmap to begin searching. -1 if no offset is found.

int pcpu_alloc_area(struct pcpu_chunk *chunk, int alloc_bits, size_t align, int start)¶: allocates an area from a pcpu_chunk

Parameters

struct pcpu_chunk *chunk: chunk of interest
int alloc_bits: size of request in allocation units
size_t align: alignment of area (max PAGE_SIZE)
int start: bit_off to start searching

Description

This function takes in a start offset to begin searching to fit an allocation of alloc_bits with alignment align. It needs to scan the allocation map because if it fits within the block’s contig hint, start will be block->first_free. This is an attempt to fill the allocation prior to breaking the contig hint. The allocation and boundary maps are updated accordingly if it confirms a valid free area.

Return

Allocated addr offset in chunk on success. -1 if no matching area is found.

int pcpu_free_area(struct pcpu_chunk *chunk, int off)¶: frees the corresponding offset

Parameters

struct pcpu_chunk *chunk: chunk of interest
int off: addr offset into chunk

Description

This function determines the size of an allocation to free using the boundary bitmap and clears the allocation map.

Return

Number of freed bytes.

struct pcpu_chunk *pcpu_alloc_first_chunk(unsigned long tmp_addr, int map_size)¶: creates chunks that serve the first chunk

Parameters

unsigned long tmp_addr: the start of the region served
int map_size: size of the region served

Description

This is responsible for creating the chunks that serve the first chunk. The base_addr is page aligned down of tmp_addr while the region end is page aligned up. Offsets are kept track of to determine the region served. All this is done to appease the bitmap allocator in avoiding partial blocks.

Return

Chunk serving the region at tmp_addr of map_size.

void pcpu_chunk_populated(struct pcpu_chunk *chunk, int page_start, int page_end)¶: post-population bookkeeping

Parameters

struct pcpu_chunk *chunk: pcpu_chunk which got populated
int page_start: the start page
int page_end: the end page

Description

Pages in [page_start,**page_end**) have been populated to chunk. Update the bookkeeping information accordingly. Must be called after each successful population.

void pcpu_chunk_depopulated(struct pcpu_chunk *chunk, int page_start, int page_end)¶: post-depopulation bookkeeping

Parameters

struct pcpu_chunk *chunk: pcpu_chunk which got depopulated
int page_start: the start page
int page_end: the end page

Description

Pages in [page_start,**page_end**) have been depopulated from chunk. Update the bookkeeping information accordingly. Must be called after each successful depopulation.

struct pcpu_chunk *pcpu_chunk_addr_search(void *addr)¶: determine chunk containing specified address

Parameters

void *addr: address for which the chunk needs to be determined.

Description

This is an internal function that handles all but static allocations. Static percpu address values should never be passed into the allocator.

Return

The address of the found chunk.

void __percpu *pcpu_alloc(size_t size, size_t align, bool reserved, gfp_t gfp)¶: the percpu allocator

Parameters

size_t size: size of area to allocate in bytes
size_t align: alignment of area (max PAGE_SIZE)
bool reserved: allocate from the reserved chunk if available
gfp_t gfp: allocation flags

Description

Allocate percpu area of size bytes aligned at align. If gfp doesn’t contain GFP_KERNEL, the allocation is atomic. If gfp has __GFP_NOWARN then no warning will be triggered on invalid or failed allocation requests.

Return

Percpu pointer to the allocated area on success, NULL on failure.

void pcpu_balance_free(bool empty_only)¶: manage the amount of free chunks

Parameters

bool empty_only: free chunks only if there are no populated pages

Description

If empty_only is false, reclaim all fully free chunks regardless of the number of populated pages. Otherwise, only reclaim chunks that have no populated pages.

Context

pcpu_lock (can be dropped temporarily)

void pcpu_balance_populated(void)¶: manage the amount of populated pages

Parameters

void: no arguments

Description

Maintain a certain amount of populated pages to satisfy atomic allocations. It is possible that this is called when physical memory is scarce causing OOM killer to be triggered. We should avoid doing so until an actual allocation causes the failure as it is possible that requests can be serviced from already backed regions.

Context

pcpu_lock (can be dropped temporarily)

void pcpu_reclaim_populated(void)¶: scan over to_depopulate chunks and free empty pages

Parameters

void: no arguments

Description

Scan over chunks in the depopulate list and try to release unused populated pages back to the system. Depopulated chunks are sidelined to prevent repopulating these pages unless required. Fully free chunks are reintegrated and freed accordingly (1 is kept around). If we drop below the empty populated pages threshold, reintegrate the chunk if it has empty free pages. Each chunk is scanned in the reverse order to keep populated pages close to the beginning of the chunk.

Context

pcpu_lock (can be dropped temporarily)

void pcpu_balance_workfn(struct work_struct *work)¶: manage the amount of free chunks and populated pages

Parameters

struct work_struct *work: unused

Description

For each chunk type, manage the number of fully free chunks and the number of populated pages. An important thing to consider is when pages are freed and how they contribute to the global counts.

void free_percpu(void __percpu *ptr)¶: free percpu area

Parameters

void __percpu *ptr: pointer to area to free

Description

Free percpu area ptr.

Context

Can be called from atomic context.

bool is_kernel_percpu_address(unsigned long addr)¶: test whether address is from static percpu area

Parameters

unsigned long addr: address to test

Description

Test whether addr belongs to in-kernel static percpu area. Module static percpu areas are not considered. For those, use is_module_percpu_address().

Return

true if addr is from in-kernel static percpu area, false otherwise.

phys_addr_t per_cpu_ptr_to_phys(void *addr)¶: convert translated percpu address to physical address

Parameters

void *addr: the address to be converted to physical address

Description

Given addr which is dereferenceable address obtained via one of percpu access macros, this function translates it into its physical address. The caller is responsible for ensuring addr stays valid until this function finishes.

percpu allocator has special setup for the first chunk, which currently supports either embedding in linear address space or vmalloc mapping, and, from the second one, the backing allocator (currently either vm or km) provides translation.

The addr can be translated simply without checking if it falls into the first chunk. But the current code reflects better how percpu allocator actually works, and the verification can discover both bugs in percpu allocator itself and per_cpu_ptr_to_phys() callers. So we keep current code.

Return

The physical address for addr.

struct pcpu_alloc_info *pcpu_alloc_alloc_info(int nr_groups, int nr_units)¶: allocate percpu allocation info

Parameters

int nr_groups: the number of groups
int nr_units: the number of units

Description

Allocate ai which is large enough for nr_groups groups containing nr_units units. The returned ai’s groups[0].cpu_map points to the cpu_map array which is long enough for nr_units and filled with NR_CPUS. It’s the caller’s responsibility to initialize cpu_map pointer of other groups.

Return

Pointer to the allocated pcpu_alloc_info on success, NULL on failure.

void pcpu_free_alloc_info(struct pcpu_alloc_info *ai)¶: free percpu allocation info

Parameters

struct pcpu_alloc_info *ai: pcpu_alloc_info to free

Description

Free ai which was allocated by pcpu_alloc_alloc_info().

void pcpu_dump_alloc_info(const char *lvl, const struct pcpu_alloc_info *ai)¶: print out information about pcpu_alloc_info

Parameters

const char *lvl: loglevel
const struct pcpu_alloc_info *ai: allocation info to dump

Description

Print out information about ai using loglevel lvl.

void pcpu_setup_first_chunk(const struct pcpu_alloc_info *ai, void *base_addr)¶: initialize the first percpu chunk

Parameters

const struct pcpu_alloc_info *ai: pcpu_alloc_info describing how to percpu area is shaped
void *base_addr: mapped address

Description

Initialize the first percpu chunk which contains the kernel static percpu area. This function is to be called from arch percpu area setup path.

ai contains all information necessary to initialize the first chunk and prime the dynamic percpu allocator.

ai->static_size is the size of static percpu area.

ai->reserved_size, if non-zero, specifies the amount of bytes to reserve after the static area in the first chunk. This reserves the first chunk such that it’s available only through reserved percpu allocation. This is primarily used to serve module percpu static areas on architectures where the addressing model has limited offset range for symbol relocations to guarantee module percpu symbols fall inside the relocatable range.

ai->dyn_size determines the number of bytes available for dynamic allocation in the first chunk. The area between ai->static_size + ai->reserved_size + ai->dyn_size and ai->unit_size is unused.

ai->unit_size specifies unit size and must be aligned to PAGE_SIZE and equal to or larger than ai->static_size + ai->reserved_size + ai->dyn_size.

ai->atom_size is the allocation atom size and used as alignment for vm areas.

ai->alloc_size is the allocation size and always multiple of ai->atom_size. This is larger than ai->atom_size if ai->unit_size is larger than ai->atom_size.

ai->nr_groups and ai->groups describe virtual memory layout of percpu areas. Units which should be colocated are put into the same group. Dynamic VM areas will be allocated according to these groupings. If ai->nr_groups is zero, a single group containing all units is assumed.

The caller should have mapped the first chunk at base_addr and copied static data to each unit.

The first chunk will always contain a static and a dynamic region. However, the static region is not managed by any chunk. If the first chunk also contains a reserved region, it is served by two chunks - one for the reserved region and one for the dynamic region. They share the same vm, but use offset regions in the area allocation map. The chunk serving the dynamic region is circulated in the chunk slots and available for dynamic allocation like any other chunk.

struct pcpu_alloc_info *pcpu_build_alloc_info(size_t reserved_size, size_t dyn_size, size_t atom_size, pcpu_fc_cpu_distance_fn_t cpu_distance_fn)¶: build alloc_info considering distances between CPUs

Parameters

size_t reserved_size: the size of reserved percpu area in bytes
size_t dyn_size: minimum free size for dynamic allocation in bytes
size_t atom_size: allocation atom size
pcpu_fc_cpu_distance_fn_t cpu_distance_fn: callback to determine distance between cpus, optional

Description

This function determines grouping of units, their mappings to cpus and other parameters considering needed percpu size, allocation atom size and distances between CPUs.

Groups are always multiples of atom size and CPUs which are of LOCAL_DISTANCE both ways are grouped together and share space for units in the same group. The returned configuration is guaranteed to have CPUs on different nodes on different groups and >=75% usage of allocated virtual address space.

Return

On success, pointer to the new allocation_info is returned. On failure, ERR_PTR value is returned.

int pcpu_embed_first_chunk(size_t reserved_size, size_t dyn_size, size_t atom_size, pcpu_fc_cpu_distance_fn_t cpu_distance_fn, pcpu_fc_cpu_to_node_fn_t cpu_to_nd_fn)¶: embed the first percpu chunk into bootmem

Parameters

size_t reserved_size: the size of reserved percpu area in bytes
size_t dyn_size: minimum free size for dynamic allocation in bytes
size_t atom_size: allocation atom size
pcpu_fc_cpu_distance_fn_t cpu_distance_fn: callback to determine distance between cpus, optional
pcpu_fc_cpu_to_node_fn_t cpu_to_nd_fn: callback to convert cpu to it’s node, optional

Description

This is a helper to ease setting up embedded first percpu chunk and can be called where pcpu_setup_first_chunk() is expected.

If this function is used to setup the first chunk, it is allocated by calling pcpu_fc_alloc and used as-is without being mapped into vmalloc area. Allocations are always whole multiples of atom_size aligned to atom_size.

This enables the first chunk to piggy back on the linear physical mapping which often uses larger page size. Please note that this can result in very sparse cpu->unit mapping on NUMA machines thus requiring large vmalloc address space. Don’t use this allocator if vmalloc space is not orders of magnitude larger than distances between node memory addresses (ie. 32bit NUMA machines).

dyn_size specifies the minimum dynamic area size.

If the needed size is smaller than the minimum or specified unit size, the leftover is returned using pcpu_fc_free.

Return

0 on success, -errno on failure.

int pcpu_page_first_chunk(size_t reserved_size, pcpu_fc_cpu_to_node_fn_t cpu_to_nd_fn)¶: map the first chunk using PAGE_SIZE pages

Parameters

size_t reserved_size: the size of reserved percpu area in bytes
pcpu_fc_cpu_to_node_fn_t cpu_to_nd_fn: callback to convert cpu to it’s node, optional

Description

This is a helper to ease setting up page-remapped first percpu chunk and can be called where pcpu_setup_first_chunk() is expected.

This is the basic allocator. Static percpu area is allocated page-by-page into vmalloc area.

Return

0 on success, -errno on failure.

long copy_from_user_nofault(void *dst, const void __user *src, size_t size)¶: safely attempt to read from a user-space location

Parameters

void *dst: pointer to the buffer that shall take the data
const void __user *src: address to read from. This must be a user address.
size_t size: size of the data chunk

Description

Safely read from user address src to the buffer at dst. If a kernel fault happens, handle that and return -EFAULT.

long copy_to_user_nofault(void __user *dst, const void *src, size_t size)¶: safely attempt to write to a user-space location

Parameters

void __user *dst: address to write to
const void *src: pointer to the data that shall be written
size_t size: size of the data chunk

Description

Safely write to address dst from the buffer at src. If a kernel fault happens, handle that and return -EFAULT.

long strncpy_from_user_nofault(char *dst, const void __user *unsafe_addr, long count)¶

Copy a NUL terminated string from unsafe user address.

Parameters

char *dst: Destination address, in kernel space. This buffer must be at least count bytes long.
const void __user *unsafe_addr: Unsafe user address.
long count: Maximum number of bytes to copy, including the trailing NUL.

Description

Copies a NUL-terminated string from unsafe user address to kernel buffer.

On success, returns the length of the string INCLUDING the trailing NUL.

If access fails, returns -EFAULT (some data may have been copied and the trailing NUL added).

If count is smaller than the length of the string, copies count-1 bytes, sets the last byte of dst buffer to NUL and returns count.

long strnlen_user_nofault(const void __user *unsafe_addr, long count)¶

Get the size of a user string INCLUDING final NUL.

Parameters

const void __user *unsafe_addr: The string to measure.
long count: Maximum count (including NUL)

Description

Get the size of a NUL-terminated string in user space without pagefault.

Returns the size of the string INCLUDING the terminating NUL.

If the string is too long, returns a number larger than count. User has to check the return value against “> count”. On exception (or invalid count), returns 0.

Unlike strnlen_user, this can be used from IRQ handler etc. because it disables pagefaults.

bool writeback_throttling_sane(struct scan_control *sc)¶: is the usual dirty throttling mechanism available?

Parameters

struct scan_control *sc: scan_control in question

Description

The normal page dirty throttling mechanism in balance_dirty_pages() is completely broken with the legacy memcg and direct stalling in shrink_folio_list() is used for throttling instead, which lacks all the niceties such as fairness, adaptive pausing, bandwidth proportional allocation and configurability.

This function tests whether the vmscan currently in progress can assume that the normal dirty throttling mechanism is operational.

unsigned long lruvec_lru_size(struct lruvec *lruvec, enum lru_list lru, int zone_idx)¶: Returns the number of pages on the given LRU list.

Parameters

struct lruvec *lruvec: lru vector
enum lru_list lru: lru to use
int zone_idx: zones to consider (use MAX_NR_ZONES - 1 for the whole LRU list)

long remove_mapping(struct address_space *mapping, struct folio *folio)¶: Attempt to remove a folio from its mapping.

Parameters

struct address_space *mapping: The address space.
struct folio *folio: The folio to remove.

Description

If the folio is dirty, under writeback or if someone else has a ref on it, removal will fail.

Return

The number of pages removed from the mapping. 0 if the folio could not be removed.

Context

The caller should have a single refcount on the folio and hold its lock.

void folio_putback_lru(struct folio *folio)¶: Put previously isolated folio onto appropriate LRU list.

Parameters

struct folio *folio: Folio to be returned to an LRU list.

Description

Add previously isolated folio to appropriate LRU list. The folio may still be unevictable for other reasons.

Context

lru_lock must not be held, interrupts must be enabled.

bool folio_isolate_lru(struct folio *folio)¶: Try to isolate a folio from its LRU list.

Parameters

struct folio *folio: Folio to isolate from its LRU list.

Description

Isolate a folio from an LRU list and adjust the vmstat statistic corresponding to whatever LRU list the folio was on.

The folio will have its LRU flag cleared. If it was found on the active list, it will have the Active flag set. If it was found on the unevictable list, it will have the Unevictable flag set. These flags may need to be cleared by the caller before letting the page go.

Must be called with an elevated refcount on the folio. This is a fundamental difference from isolate_lru_folios() (which is called without a stable reference).
The lru_lock must not be held.
Interrupts must be enabled.

Context

Return

true if the folio was removed from an LRU list. false if the folio was not on an LRU list.

void check_move_unevictable_folios(struct folio_batch *fbatch)¶: Move evictable folios to appropriate zone lru list

Parameters

struct folio_batch *fbatch: Batch of lru folios to check.

Description

Checks folios for evictability, if an evictable folio is in the unevictable lru list, moves it to the appropriate evictable lru list. This function should be only used for lru folios.

void __remove_pages(unsigned long pfn, unsigned long nr_pages, struct vmem_altmap *altmap)¶: remove sections of pages

Parameters

unsigned long pfn: starting pageframe (must be aligned to start of a section)
unsigned long nr_pages: number of pages to remove (must be multiple of section size)
struct vmem_altmap *altmap: alternative device page map or NULL if default memmap is used

Description

Generic helper function to remove section mappings and sysfs entries for the section of the memory we are removing. Caller needs to make sure that pages are marked reserved and zones are adjust properly by calling offline_pages().

void try_offline_node(int nid)¶

Parameters

int nid: the node ID

Description

Offline a node if all memory sections and cpus of the node are removed.

NOTE

The caller must call lock_device_hotplug() to serialize hotplug and online/offline operations before this call.

void __remove_memory(u64 start, u64 size)¶: Remove memory if every memory block is offline

Parameters

u64 start: physical address of the region to remove
u64 size: size of the region to remove

NOTE

The caller must call lock_device_hotplug() to serialize hotplug and online/offline operations before this call, as required by try_offline_node().

unsigned long mmu_interval_read_begin(struct mmu_interval_notifier *interval_sub)¶: Begin a read side critical section against a VA range

Parameters

struct mmu_interval_notifier *interval_sub: The interval subscription

Description

mmu_iterval_read_begin()/mmu_iterval_read_retry() implement a collision-retry scheme similar to seqcount for the VA range under subscription. If the mm invokes invalidation during the critical section then mmu_interval_read_retry() will return true.

This is useful to obtain shadow PTEs where teardown or setup of the SPTEs require a blocking context. The critical region formed by this can sleep, and the required ‘user_lock’ can also be a sleeping lock.

The caller is required to provide a ‘user_lock’ to serialize both teardown and setup.

The return value should be passed to mmu_interval_read_retry().

int mmu_notifier_register(struct mmu_notifier *subscription, struct mm_struct *mm)¶: Register a notifier on a mm

Parameters

struct mmu_notifier *subscription: The notifier to attach
struct mm_struct *mm: The mm to attach the notifier to

Description

Must not hold mmap_lock nor any other VM related lock when calling this registration function. Must also ensure mm_users can’t go down to zero while this runs to avoid races with mmu_notifier_release, so mm has to be current->mm or the mm should be pinned safely such as with get_task_mm(). If the mm is not current->mm, the mm_users pin should be released by calling mmput after mmu_notifier_register returns.

mmu_notifier_unregister() or mmu_notifier_put() must be always called to unregister the notifier.

While the caller has a mmu_notifier get the subscription->mm pointer will remain valid, and can be converted to an active mm pointer via mmget_not_zero().

struct mmu_notifier *mmu_notifier_get_locked(const struct mmu_notifier_ops *ops, struct mm_struct *mm)¶: Return the single struct mmu_notifier for the mm & ops

Parameters

const struct mmu_notifier_ops *ops: The operations struct being subscribe with
struct mm_struct *mm: The mm to attach notifiers too

Description

This function either allocates a new mmu_notifier via ops->alloc_notifier(), or returns an already existing notifier on the list. The value of the ops pointer is used to determine when two notifiers are the same.

Each call to mmu_notifier_get() must be paired with a call to mmu_notifier_put(). The caller must hold the write side of mm->mmap_lock.

While the caller has a mmu_notifier get the mm pointer will remain valid, and can be converted to an active mm pointer via mmget_not_zero().

void mmu_notifier_put(struct mmu_notifier *subscription)¶: Release the reference on the notifier

Parameters

struct mmu_notifier *subscription: The notifier to act on

Description

This function must be paired with each mmu_notifier_get(), it releases the reference obtained by the get. If this is the last reference then process to free the notifier will be run asynchronously.

Unlike mmu_notifier_unregister() the get/put flow only calls ops->release when the mm_struct is destroyed. Instead free_notifier is always called to release any resources held by the user.

As ops->release is not guaranteed to be called, the user must ensure that all sptes are dropped, and no new sptes can be established before mmu_notifier_put() is called.

This function can be called from the ops->release callback, however the caller must still ensure it is called pairwise with mmu_notifier_get().

Modules calling this function must call mmu_notifier_synchronize() in their __exit functions to ensure the async work is completed.

int mmu_interval_notifier_insert(struct mmu_interval_notifier *interval_sub, struct mm_struct *mm, unsigned long start, unsigned long length, const struct mmu_interval_notifier_ops *ops)¶: Insert an interval notifier

Parameters

struct mmu_interval_notifier *interval_sub: Interval subscription to register
struct mm_struct *mm: mm_struct to attach to
unsigned long start: Starting virtual address to monitor
unsigned long length: Length of the range to monitor
const struct mmu_interval_notifier_ops *ops: Interval notifier operations to be called on matching events

Description

This function subscribes the interval notifier for notifications from the mm. Upon return the ops related to mmu_interval_notifier will be called whenever an event that intersects with the given range occurs.

Upon return the range_notifier may not be present in the interval tree yet. The caller must use the normal interval notifier read flow via mmu_interval_read_begin() to establish SPTEs for this range.

void mmu_interval_notifier_remove(struct mmu_interval_notifier *interval_sub)¶: Remove a interval notifier

Parameters

struct mmu_interval_notifier *interval_sub: Interval subscription to unregister

Description

This function must be paired with mmu_interval_notifier_insert(). It cannot be called from any ops callback.

Once this returns ops callbacks are no longer running on other CPUs and will not be called in future.

void mmu_notifier_synchronize(void)¶: Ensure all mmu_notifiers are freed

Parameters

void: no arguments

Description

This function ensures that all outstanding async SRU work from mmu_notifier_put() is completed. After it returns any mmu_notifier_ops associated with an unused mmu_notifier will no longer be called.

Before using the caller must ensure that all of its mmu_notifiers have been fully released via mmu_notifier_put().

Modules using the mmu_notifier_put() API should call this in their __exit function to avoid module unloading races.

size_t balloon_page_list_enqueue(struct balloon_dev_info *b_dev_info, struct list_head *pages)¶: inserts a list of pages into the balloon page list.

Parameters

struct balloon_dev_info *b_dev_info: balloon device descriptor where we will insert a new page to
struct list_head *pages: pages to enqueue - allocated using balloon_page_alloc.

Description

Driver must call this function to properly enqueue balloon pages before definitively removing them from the guest system.

Return

number of pages that were enqueued.

size_t balloon_page_list_dequeue(struct balloon_dev_info *b_dev_info, struct list_head *pages, size_t n_req_pages)¶: removes pages from balloon’s page list and returns a list of the pages.

Parameters

struct balloon_dev_info *b_dev_info: balloon device descriptor where we will grab a page from.
struct list_head *pages: pointer to the list of pages that would be returned to the caller.
size_t n_req_pages: number of requested pages.

Description

Driver must call this function to properly de-allocate a previous enlisted balloon pages before definitively releasing it back to the guest system. This function tries to remove n_req_pages from the ballooned pages and return them to the caller in the pages list.

Note that this function may fail to dequeue some pages even if the balloon isn’t empty - since the page list can be temporarily empty due to compaction of isolated pages.

Return

number of pages that were added to the pages list.

vm_fault_t vmf_insert_pfn_pmd(struct vm_fault *vmf, unsigned long pfn, bool write)¶: insert a pmd size pfn

Parameters

struct vm_fault *vmf: Structure describing the fault
unsigned long pfn: pfn to insert
bool write: whether it’s a write fault

Description

Insert a pmd size pfn. See vmf_insert_pfn() for additional info.

Return

vm_fault_t value.

vm_fault_t vmf_insert_pfn_pud(struct vm_fault *vmf, unsigned long pfn, bool write)¶: insert a pud size pfn

Parameters

struct vm_fault *vmf: Structure describing the fault
unsigned long pfn: pfn to insert
bool write: whether it’s a write fault

Description

Insert a pud size pfn. See vmf_insert_pfn() for additional info.

Return

vm_fault_t value.

vm_fault_t vmf_insert_folio_pud(struct vm_fault *vmf, struct folio *folio, bool write)¶: insert a pud size folio mapped by a pud entry

Parameters

struct vm_fault *vmf: Structure describing the fault
struct folio *folio: folio to insert
bool write: whether it’s a write fault

Return

vm_fault_t value.

bool touch_pmd(struct vm_area_struct *vma, unsigned long addr, pmd_t *pmd, bool write)¶: Mark page table pmd entry as accessed and dirty (for write)

Parameters

struct vm_area_struct *vma: The VMA covering addr
unsigned long addr: The virtual address
pmd_t *pmd: pmd pointer into the page table mapping addr
bool write: Whether it’s a write access

Return

whether the pmd entry is changed

int __split_unmapped_folio(struct folio *folio, int new_order, struct page *split_at, struct xa_state *xas, struct address_space *mapping, enum split_type split_type)¶: splits an unmapped folio to lower order folios in two ways: uniform split or non-uniform split.

Parameters

struct folio *folio: the to-be-split folio
int new_order: the smallest order of the after split folios (since buddy allocator like split generates folios with orders from folio’s order - 1 to new_order).
struct page *split_at: in buddy allocator like split, the folio containing split_at will be split until its order becomes new_order.
struct xa_state *xas: xa_state pointing to folio->mapping->i_pages and locked by caller
struct address_space *mapping: folio->mapping
enum split_type split_type: if the split is uniform or not (buddy allocator like split)

Description

uniform split: the given folio into multiple new_order small folios, where all small folios have the same order. This is done when split_type is SPLIT_TYPE_UNIFORM.
buddy allocator like (non-uniform) split: the given folio is split into half and one of the half (containing the given page) is split into half until the given folio’s order becomes new_order. This is done when split_type is SPLIT_TYPE_NON_UNIFORM.

The high level flow for these two methods are:

uniform split: xas is split with no expectation of failure and a single __split_folio_to_order() is called to split the folio into new_order along with stats update.
non-uniform split: folio_order - new_order calls to __split_folio_to_order() are expected to be made in a for loop to split the folio to one lower order at a time. The folio containing split_at is split in each iteration. xas is split into half in each iteration and can fail. A failed xas split leaves split folios as is without merging them back.

After splitting, the caller’s folio reference will be transferred to the folio containing split_at. The caller needs to unlock and/or free after-split folios if necessary.

Return

0 - successful, <0 - failed (if -ENOMEM is returned, folio might be split but not to new_order, the caller needs to check)

int folio_check_splittable(struct folio *folio, unsigned int new_order, enum split_type split_type)¶: check if a folio can be split to a given order

Parameters

struct folio *folio: folio to be split
unsigned int new_order: the smallest order of the after split folios (since buddy allocator like split generates folios with orders from folio’s order - 1 to new_order).
enum split_type split_type: uniform or non-uniform split

Description

folio_check_splittable() checks if folio can be split to new_order using split_type method. The truncated folio check must come first.

Context

folio must be locked.

Return

0 - folio can be split to new_order, otherwise an error number is returned.

int __folio_split(struct folio *folio, unsigned int new_order, struct page *split_at, struct page *lock_at, struct list_head *list, enum split_type split_type)¶: split a folio at split_at to a new_order folio

Parameters

struct folio *folio: folio to split
unsigned int new_order: the order of the new folio
struct page *split_at: a page within the new folio
struct page *lock_at: a page within folio to be left locked to caller
struct list_head *list: after-split folios will be put on it if non NULL
enum split_type split_type: perform uniform split or not (non-uniform split)

Description

It calls __split_unmapped_folio() to perform uniform and non-uniform split. It is in charge of checking whether the split is supported or not and preparing folio for __split_unmapped_folio().

After splitting, the after-split folio containing lock_at remains locked and others are unlocked: 1. for uniform split, lock_at points to one of folio’s subpages; 2. for buddy allocator like (non-uniform) split, lock_at points to folio.

Return

0 - successful, <0 - failed (if -ENOMEM is returned, folio might be split but not to new_order, the caller needs to check)

int folio_split_unmapped(struct folio *folio, unsigned int new_order)¶: split a large anon folio that is already unmapped

Parameters

struct folio *folio: folio to split
unsigned int new_order: the order of folios after split

Description

This function is a helper for splitting folios that have already been unmapped. The use case is that the device or the CPU can refuse to migrate THP pages in the middle of migration, due to allocation issues on either side.

anon_vma_lock is not required to be held, mmap_read_lock() or mmap_write_lock() should be held. folio is expected to be locked by the caller. device-private and non device-private folios are supported along with folios that are in the swapcache. folio should also be unmapped and isolated from LRU (if applicable)

Upon return, the folio is not remapped, split folios are not added to LRU, free_folio_and_swap_cache() is not called, and new folios remain locked.

Return

0 on success, -EAGAIN if the folio cannot be split (e.g., due to insufficient reference count or extra pins).

int folio_split(struct folio *folio, unsigned int new_order, struct page *split_at, struct list_head *list)¶: split a folio at split_at to a new_order folio

Parameters

struct folio *folio: folio to split
unsigned int new_order: the order of the new folio
struct page *split_at: a page within the new folio
struct list_head *list: after-split folios are added to list if not null, otherwise to LRU list

Description

It has the same prerequisites and returns as split_huge_page_to_list_to_order().

Split a folio at split_at to a new_order folio, leave the remaining subpages of the original folio as large as possible. For example, in the case of splitting an order-9 folio at its third order-3 subpages to an order-3 folio, there are 2^(9-3)=64 order-3 subpages in the order-9 folio. After the split, there will be a group of folios with different orders and the new folio containing split_at is marked in bracket: [order-4, {order-3}, order-3, order-5, order-6, order-7, order-8].

After split, folio is left locked for caller.

Return

0 - successful, <0 - failed (if -ENOMEM is returned, folio might be split but not to new_order, the caller needs to check)

unsigned int min_order_for_split(struct folio *folio)¶: get the minimum order folio can be split to

Parameters

struct folio *folio: folio to split

Description

min_order_for_split() tells the minimum order folio can be split to. If a file-backed folio is truncated, 0 will be returned. Any subsequent split attempt should get -EBUSY from split checking code.

Return

folio’s minimum order for split

The Linux Kernel

Contents

This Page

Memory Management APIs¶

User Space Memory Access¶

Memory Allocation Controls¶

Page mobility and placement hints¶

Watermark modifiers -- controls access to emergency reserves¶

Reclaim modifiers¶

Useful GFP flag combinations¶

The Slab Cache¶

Virtually Contiguous Mappings¶

File Mapping and Page Cache¶

Filemap¶

Readahead¶

Writeback¶

Truncate¶

Memory pools¶

More Memory Management Functions¶