Memory Management APIs¶
User Space Memory Access¶
-
access_ok
(addr, size)¶ Checks if a user space pointer is valid
Parameters
addr
User space pointer to start of block to check
size
Size of block to check
Context
User context only. This function may sleep if pagefaults are enabled.
Description
Checks if a pointer to a block of memory in user space is valid.
Note that, depending on architecture, this function probably just checks that the pointer is in the user space range - after calling this function, memory access functions may still return -EFAULT.
Return
true (nonzero) if the memory block may be valid, false (zero) if it is definitely invalid.
-
get_user
(x, ptr)¶ Get a simple variable from user space.
Parameters
x
Variable to store result.
ptr
Source address, in user space.
Context
User context only. This function may sleep if pagefaults are enabled.
Description
This macro copies a single simple variable from user space to kernel space. It supports simple types like char and int, but not larger data types like structures or arrays.
ptr must have pointer-to-simple-variable type, and the result of dereferencing ptr must be assignable to x without a cast.
Return
zero on success, or -EFAULT on error. On error, the variable x is set to zero.
-
__get_user
(x, ptr)¶ Get a simple variable from user space, with less checking.
Parameters
x
Variable to store result.
ptr
Source address, in user space.
Context
User context only. This function may sleep if pagefaults are enabled.
Description
This macro copies a single simple variable from user space to kernel space. It supports simple types like char and int, but not larger data types like structures or arrays.
ptr must have pointer-to-simple-variable type, and the result of dereferencing ptr must be assignable to x without a cast.
Caller must check the pointer with access_ok()
before calling this
function.
Return
zero on success, or -EFAULT on error. On error, the variable x is set to zero.
-
put_user
(x, ptr)¶ Write a simple value into user space.
Parameters
x
Value to copy to user space.
ptr
Destination address, in user space.
Context
User context only. This function may sleep if pagefaults are enabled.
Description
This macro copies a single simple value from kernel space to user space. It supports simple types like char and int, but not larger data types like structures or arrays.
ptr must have pointer-to-simple-variable type, and x must be assignable to the result of dereferencing ptr.
Return
zero on success, or -EFAULT on error.
-
__put_user
(x, ptr)¶ Write a simple value into user space, with less checking.
Parameters
x
Value to copy to user space.
ptr
Destination address, in user space.
Context
User context only. This function may sleep if pagefaults are enabled.
Description
This macro copies a single simple value from kernel space to user space. It supports simple types like char and int, but not larger data types like structures or arrays.
ptr must have pointer-to-simple-variable type, and x must be assignable to the result of dereferencing ptr.
Caller must check the pointer with access_ok()
before calling this
function.
Return
zero on success, or -EFAULT on error.
-
unsigned long
clear_user
(void __user *to, unsigned long n)¶ Zero a block of memory in user space.
Parameters
void __user *to
Destination address, in user space.
unsigned long n
Number of bytes to zero.
Description
Zero a block of memory in user space.
Return
number of bytes that could not be cleared. On success, this will be zero.
-
unsigned long
__clear_user
(void __user *to, unsigned long n)¶ Zero a block of memory in user space, with less checking.
Parameters
void __user *to
Destination address, in user space.
unsigned long n
Number of bytes to zero.
Description
Zero a block of memory in user space. Caller must check
the specified block with access_ok()
before calling this function.
Return
number of bytes that could not be cleared. On success, this will be zero.
-
int
get_user_pages_fast
(unsigned long start, int nr_pages, unsigned int gup_flags, struct page **pages)¶ pin user pages in memory
Parameters
unsigned long start
starting user address
int nr_pages
number of pages from start to pin
unsigned int gup_flags
flags modifying pin behaviour
struct page **pages
array that receives pointers to the pages pinned. Should be at least nr_pages long.
Description
Attempt to pin user pages in memory without taking mm->mmap_lock. If not successful, it will fall back to taking the lock and calling get_user_pages().
Returns number of pages pinned. This may be fewer than the number requested. If nr_pages is 0 or negative, returns 0. If no pages were pinned, returns -errno.
Memory Allocation Controls¶
-
typedef
gfp_t
¶ Memory allocation flags.
Description
GFP flags are commonly used throughout Linux to indicate how memory
should be allocated. The GFP acronym stands for get_free_pages(),
the underlying memory allocation function. Not every GFP flag is
supported by every function which may allocate memory. Most users
will want to use a plain GFP_KERNEL
.
Parameters
const gfp_t gfp_flags
gfp_flags to test
Description
Test whether gfp_flags indicates that the allocation is from the
current
context and allowed to sleep.
An allocation being allowed to block doesn’t mean it owns the current
context. When direct reclaim path tries to allocate memory, the
allocation context is nested inside whatever current
was doing at the
time of the original allocation. The nested allocation may be allowed
to block but modifying anything current
owns can corrupt the outer
context’s expectations.
true
result from this function indicates that the allocation context
can sleep and use anything that’s associated with current
.
Page mobility and placement hints¶
These flags provide hints about how mobile the page is. Pages with similar mobility are placed within the same pageblocks to minimise problems due to external fragmentation.
__GFP_MOVABLE
(also a zone modifier) indicates that the page can be
moved by page migration during memory compaction or can be reclaimed.
__GFP_RECLAIMABLE
is used for slab allocations that specify
SLAB_RECLAIM_ACCOUNT and whose pages can be freed via shrinkers.
__GFP_WRITE
indicates the caller intends to dirty the page. Where possible,
these pages will be spread between local zones to avoid all the dirty
pages being in one zone (fair zone allocation policy).
__GFP_HARDWALL
enforces the cpuset memory allocation policy.
__GFP_THISNODE
forces the allocation to be satisfied from the requested
node with no fallbacks or placement policy enforcements.
__GFP_ACCOUNT
causes the allocation to be accounted to kmemcg.
Watermark modifiers – controls access to emergency reserves¶
__GFP_HIGH
indicates that the caller is high-priority and that granting
the request is necessary before the system can make forward progress.
For example, creating an IO context to clean pages.
__GFP_ATOMIC
indicates that the caller cannot reclaim or sleep and is
high priority. Users are typically interrupt handlers. This may be
used in conjunction with __GFP_HIGH
__GFP_MEMALLOC
allows access to all memory. This should only be used when
the caller guarantees the allocation will allow more memory to be freed
very shortly e.g. process exiting or swapping. Users either should
be the MM or co-ordinating closely with the VM (e.g. swap over NFS).
Users of this flag have to be extremely careful to not deplete the reserve
completely and implement a throttling mechanism which controls the
consumption of the reserve based on the amount of freed memory.
Usage of a pre-allocated pool (e.g. mempool) should be always considered
before using this flag.
__GFP_NOMEMALLOC
is used to explicitly forbid access to emergency reserves.
This takes precedence over the __GFP_MEMALLOC
flag if both are set.
Reclaim modifiers¶
Please note that all the following flags are only applicable to sleepable
allocations (e.g. GFP_NOWAIT
and GFP_ATOMIC
will ignore them).
__GFP_IO
can start physical IO.
__GFP_FS
can call down to the low-level FS. Clearing the flag avoids the
allocator recursing into the filesystem which might already be holding
locks.
__GFP_DIRECT_RECLAIM
indicates that the caller may enter direct reclaim.
This flag can be cleared to avoid unnecessary delays when a fallback
option is available.
__GFP_KSWAPD_RECLAIM
indicates that the caller wants to wake kswapd when
the low watermark is reached and have it reclaim pages until the high
watermark is reached. A caller may wish to clear this flag when fallback
options are available and the reclaim is likely to disrupt the system. The
canonical example is THP allocation where a fallback is cheap but
reclaim/compaction may cause indirect stalls.
__GFP_RECLAIM
is shorthand to allow/forbid both direct and kswapd reclaim.
The default allocator behavior depends on the request size. We have a concept
of so called costly allocations (with order > PAGE_ALLOC_COSTLY_ORDER
).
!costly allocations are too essential to fail so they are implicitly
non-failing by default (with some exceptions like OOM victims might fail so
the caller still has to check for failures) while costly requests try to be
not disruptive and back off even without invoking the OOM killer.
The following three modifiers might be used to override some of these
implicit rules
__GFP_NORETRY
: The VM implementation will try only very lightweight
memory direct reclaim to get some memory under memory pressure (thus
it can sleep). It will avoid disruptive actions like OOM killer. The
caller must handle the failure which is quite likely to happen under
heavy memory pressure. The flag is suitable when failure can easily be
handled at small cost, such as reduced throughput
__GFP_RETRY_MAYFAIL
: The VM implementation will retry memory reclaim
procedures that have previously failed if there is some indication
that progress has been made else where. It can wait for other
tasks to attempt high level approaches to freeing memory such as
compaction (which removes fragmentation) and page-out.
There is still a definite limit to the number of retries, but it is
a larger limit than with __GFP_NORETRY
.
Allocations with this flag may fail, but only when there is
genuinely little unused memory. While these allocations do not
directly trigger the OOM killer, their failure indicates that
the system is likely to need to use the OOM killer soon. The
caller must handle failure, but can reasonably do so by failing
a higher-level request, or completing it only in a much less
efficient manner.
If the allocation does fail, and the caller is in a position to
free some non-essential memory, doing so could benefit the system
as a whole.
__GFP_NOFAIL
: The VM implementation _must_ retry infinitely: the caller
cannot handle allocation failures. The allocation could block
indefinitely but will never return with failure. Testing for
failure is pointless.
New users should be evaluated carefully (and the flag should be
used only when there is no reasonable failure policy) but it is
definitely preferable to use the flag rather than opencode endless
loop around allocator.
Using this flag for costly allocations is _highly_ discouraged.
Useful GFP flag combinations¶
Useful GFP flag combinations that are commonly used. It is recommended
that subsystems start with one of these combinations and then set/clear
__GFP_FOO
flags as necessary.
GFP_ATOMIC
users can not sleep and need the allocation to succeed. A lower
watermark is applied to allow access to “atomic reserves”.
The current implementation doesn’t support NMI and few other strict
non-preemptive contexts (e.g. raw_spin_lock). The same applies to GFP_NOWAIT
.
GFP_KERNEL
is typical for kernel-internal allocations. The caller requires
ZONE_NORMAL
or a lower zone for direct access but can direct reclaim.
GFP_KERNEL_ACCOUNT
is the same as GFP_KERNEL, except the allocation is
accounted to kmemcg.
GFP_NOWAIT
is for kernel allocations that should not stall for direct
reclaim, start physical IO or use any filesystem callback.
GFP_NOIO
will use direct reclaim to discard clean pages or slab pages
that do not require the starting of any physical IO.
Please try to avoid using this flag directly and instead use
memalloc_noio_{save,restore} to mark the whole scope which cannot
perform any IO with a short explanation why. All allocation requests
will inherit GFP_NOIO implicitly.
GFP_NOFS
will use direct reclaim but will not use any filesystem interfaces.
Please try to avoid using this flag directly and instead use
memalloc_nofs_{save,restore} to mark the whole scope which cannot/shouldn’t
recurse into the FS layer with a short explanation why. All allocation
requests will inherit GFP_NOFS implicitly.
GFP_USER
is for userspace allocations that also need to be directly
accessibly by the kernel or hardware. It is typically used by hardware
for buffers that are mapped to userspace (e.g. graphics) that hardware
still must DMA to. cpuset limits are enforced for these allocations.
GFP_DMA
exists for historical reasons and should be avoided where possible.
The flags indicates that the caller requires that the lowest zone be
used (ZONE_DMA
or 16M on x86-64). Ideally, this would be removed but
it would require careful auditing as some users really require it and
others use the flag to avoid lowmem reserves in ZONE_DMA
and treat the
lowest zone as a type of emergency reserve.
GFP_DMA32
is similar to GFP_DMA
except that the caller requires a 32-bit
address.
GFP_HIGHUSER
is for userspace allocations that may be mapped to userspace,
do not need to be directly accessible by the kernel but that cannot
move once in use. An example may be a hardware allocation that maps
data directly into userspace but has no addressing limitations.
GFP_HIGHUSER_MOVABLE
is for userspace allocations that the kernel does not
need direct access to but can use kmap() when access is required. They
are expected to be movable via page reclaim or page migration. Typically,
pages on the LRU would also be allocated with GFP_HIGHUSER_MOVABLE
.
GFP_TRANSHUGE
and GFP_TRANSHUGE_LIGHT
are used for THP allocations. They
are compound allocations that will generally fail quickly if memory is not
available and will not wake kswapd/kcompactd on failure. The _LIGHT
version does not attempt reclaim/compaction at all and is by default used
in page fault path, while the non-light is used by khugepaged.
The Slab Cache¶
Parameters
size_t size
how many bytes of memory are required.
gfp_t flags
the type of memory to allocate.
Description
kmalloc is the normal method of allocating memory for objects smaller than page size in the kernel.
The allocated object address is aligned to at least ARCH_KMALLOC_MINALIGN bytes. For size of power of two bytes, the alignment is also guaranteed to be at least to the size.
The flags argument may be one of the GFP flags defined at include/linux/gfp.h and described at Memory Management APIs
The recommended usage of the flags is described at Memory Allocation Guide
Below is a brief outline of the most useful GFP flags
GFP_KERNEL
Allocate normal kernel ram. May sleep.
GFP_NOWAIT
Allocation will not sleep.
GFP_ATOMIC
Allocation will not sleep. May use emergency pools.
GFP_HIGHUSER
Allocate memory from high memory on behalf of user.
Also it is possible to set different flags by OR’ing in one or more of the following additional flags:
__GFP_HIGH
This allocation has high priority and may use emergency pools.
__GFP_NOFAIL
Indicate that this allocation is in no way allowed to fail (think twice before using).
__GFP_NORETRY
If memory is not immediately available, then give up at once.
__GFP_NOWARN
If allocation fails, don’t issue any warnings.
__GFP_RETRY_MAYFAIL
Try really hard to succeed the allocation but fail eventually.
Parameters
size_t n
number of elements.
size_t size
element size.
gfp_t flags
the type of memory to allocate (see kmalloc).
-
void *
krealloc_array
(void *p, size_t new_n, size_t new_size, gfp_t flags)¶ reallocate memory for an array.
Parameters
void *p
pointer to the memory chunk to reallocate
size_t new_n
new number of elements to alloc
size_t new_size
new size of a single member of the array
gfp_t flags
the type of memory to allocate (see kmalloc)
-
void *
kcalloc
(size_t n, size_t size, gfp_t flags)¶ allocate memory for an array. The memory is set to zero.
Parameters
size_t n
number of elements.
size_t size
element size.
gfp_t flags
the type of memory to allocate (see kmalloc).
Parameters
size_t size
how many bytes of memory are required.
gfp_t flags
the type of memory to allocate (see kmalloc).
-
void *
kzalloc_node
(size_t size, gfp_t flags, int node)¶ allocate zeroed memory from a particular memory node.
Parameters
size_t size
how many bytes of memory are required.
gfp_t flags
the type of memory to allocate (see kmalloc).
int node
memory node from which to allocate
Parameters
struct kmem_cache *cachep
The cache to allocate from.
gfp_t flags
See
kmalloc()
.
Description
Allocate an object from this cache. The flags are only relevant if the cache has no available objects.
Return
pointer to the new object or NULL
in case of error
-
void *
kmem_cache_alloc_node
(struct kmem_cache *cachep, gfp_t flags, int nodeid)¶ Allocate an object on the specified node
Parameters
struct kmem_cache *cachep
The cache to allocate from.
gfp_t flags
See
kmalloc()
.int nodeid
node number of the target node.
Description
Identical to kmem_cache_alloc but it will allocate memory on the given node, which can improve the performance for cpu bound structures.
Fallback to other node is possible if __GFP_THISNODE is not set.
Return
pointer to the new object or NULL
in case of error
-
void
kmem_cache_free
(struct kmem_cache *cachep, void *objp)¶ Deallocate an object
Parameters
struct kmem_cache *cachep
The cache the allocation was from.
void *objp
The previously allocated object.
Description
Free an object which was previously allocated from this cache.
-
void
kfree
(const void *objp)¶ free previously allocated memory
Parameters
const void *objp
pointer returned by kmalloc.
Description
If objp is NULL, no operation is performed.
Don’t free memory not originally allocated by kmalloc()
or you will run into trouble.
-
size_t
__ksize
(const void *objp)¶ Uninstrumented ksize.
Parameters
const void *objp
pointer to the object
Description
Unlike ksize()
, __ksize()
is uninstrumented, and does not provide the same
safety checks as ksize()
with KASAN instrumentation enabled.
Return
size of the actual memory used by objp in bytes
-
struct kmem_cache *
kmem_cache_create_usercopy
(const char *name, unsigned int size, unsigned int align, slab_flags_t flags, unsigned int useroffset, unsigned int usersize, void (*ctor)(void *))¶ Create a cache with a region suitable for copying to userspace
Parameters
const char *name
A string which is used in /proc/slabinfo to identify this cache.
unsigned int size
The size of objects to be created in this cache.
unsigned int align
The required alignment for the objects.
slab_flags_t flags
SLAB flags
unsigned int useroffset
Usercopy region offset
unsigned int usersize
Usercopy region size
void (*ctor)(void *)
A constructor for the objects.
Description
Cannot be called within a interrupt, but can be interrupted. The ctor is run when new pages are allocated by the cache.
The flags are
SLAB_POISON
- Poison the slab with a known test pattern (a5a5a5a5)
to catch references to uninitialised memory.
SLAB_RED_ZONE
- Insert Red zones around the allocated memory to check
for buffer overruns.
SLAB_HWCACHE_ALIGN
- Align the objects in this cache to a hardware
cacheline. This can be beneficial if you’re counting cycles as closely
as davem.
Return
a pointer to the cache on success, NULL on failure.
-
struct kmem_cache *
kmem_cache_create
(const char *name, unsigned int size, unsigned int align, slab_flags_t flags, void (*ctor)(void *))¶ Create a cache.
Parameters
const char *name
A string which is used in /proc/slabinfo to identify this cache.
unsigned int size
The size of objects to be created in this cache.
unsigned int align
The required alignment for the objects.
slab_flags_t flags
SLAB flags
void (*ctor)(void *)
A constructor for the objects.
Description
Cannot be called within a interrupt, but can be interrupted. The ctor is run when new pages are allocated by the cache.
The flags are
SLAB_POISON
- Poison the slab with a known test pattern (a5a5a5a5)
to catch references to uninitialised memory.
SLAB_RED_ZONE
- Insert Red zones around the allocated memory to check
for buffer overruns.
SLAB_HWCACHE_ALIGN
- Align the objects in this cache to a hardware
cacheline. This can be beneficial if you’re counting cycles as closely
as davem.
Return
a pointer to the cache on success, NULL on failure.
-
int
kmem_cache_shrink
(struct kmem_cache *cachep)¶ Shrink a cache.
Parameters
struct kmem_cache *cachep
The cache to shrink.
Description
Releases as many slabs as possible for a cache. To help debugging, a zero exit status indicates all slabs were released.
Return
0
if all slabs were released, non-zero otherwise
-
bool
kmem_valid_obj
(void *object)¶ does the pointer reference a valid slab object?
Parameters
void *object
pointer to query.
Return
true
if the pointer is to a not-yet-freed object from
kmalloc()
or kmem_cache_alloc()
, either true
or false
if the pointer
is to an already-freed object, and false
otherwise.
-
void
kmem_dump_obj
(void *object)¶ Print available slab provenance information
Parameters
void *object
slab object for which to find provenance information.
Description
This function uses pr_cont()
, so that the caller is expected to have
printed out whatever preamble is appropriate. The provenance information
depends on the type of object and on how much debugging is enabled.
For a slab-cache object, the fact that it is a slab object is printed,
and, if available, the slab name, return address, and stack trace from
the allocation and last free path of that object.
This function will splat if passed a pointer to a non-slab object. If you are not sure what type of object you have, you should instead use mem_dump_obj().
-
void *
krealloc
(const void *p, size_t new_size, gfp_t flags)¶ reallocate memory. The contents will remain unchanged.
Parameters
const void *p
object to reallocate memory for.
size_t new_size
how many bytes of memory are required.
gfp_t flags
the type of memory to allocate.
Description
The contents of the object pointed to are preserved up to the
lesser of the new and old sizes (__GFP_ZERO flag is effectively ignored).
If p is NULL
, krealloc()
behaves exactly like kmalloc()
. If new_size
is 0 and p is not a NULL
pointer, the object pointed to is freed.
Return
pointer to the allocated memory or NULL
in case of error
-
void
kfree_sensitive
(const void *p)¶ Clear sensitive information in memory before freeing
Parameters
const void *p
object to free memory of
Description
The memory of the object p points to is zeroed before freed.
If p is NULL
, kfree_sensitive()
does nothing.
Note
this function zeroes the whole allocated buffer which can be a good
deal bigger than the requested buffer size passed to kmalloc()
. So be
careful when using this function in performance sensitive code.
-
size_t
ksize
(const void *objp)¶ get the actual amount of memory allocated for a given object
Parameters
const void *objp
Pointer to the object
Description
kmalloc may internally round up allocations and return more memory
than requested. ksize()
can be used to determine the actual amount of
memory allocated. The caller may use this additional memory, even though
a smaller amount of memory was initially specified with the kmalloc call.
The caller must guarantee that objp points to a valid object previously
allocated with either kmalloc()
or kmem_cache_alloc()
. The object
must not be freed during the duration of the call.
Return
size of the actual memory used by objp in bytes
-
void
kfree_const
(const void *x)¶ conditionally free memory
Parameters
const void *x
pointer to the memory
Description
Function calls kfree only if x is not in .rodata section.
-
void *
kvmalloc_node
(size_t size, gfp_t flags, int node)¶ attempt to allocate physically contiguous memory, but upon failure, fall back to non-contiguous (vmalloc) allocation.
Parameters
size_t size
size of the request.
gfp_t flags
gfp mask for the allocation - must be compatible (superset) with GFP_KERNEL.
int node
numa node to allocate from
Description
Uses kmalloc to get the memory but if the allocation fails then falls back to the vmalloc allocator. Use kvfree for freeing the memory.
Reclaim modifiers - __GFP_NORETRY and __GFP_NOFAIL are not supported. __GFP_RETRY_MAYFAIL is supported, and it should be used only if kmalloc is preferable to the vmalloc fallback, due to visible performance drawbacks.
Please note that any use of gfp flags outside of GFP_KERNEL is careful to not fall back to vmalloc.
Return
pointer to the allocated memory of NULL
in case of failure
-
void
kvfree
(const void *addr)¶ Free memory.
Parameters
const void *addr
Pointer to allocated memory.
Description
kvfree frees memory allocated by any of vmalloc()
, kmalloc()
or kvmalloc().
It is slightly more efficient to use kfree()
or vfree()
if you are certain
that you know which one to use.
Context
Either preemptible task context or not-NMI interrupt.
Virtually Contiguous Mappings¶
-
void
vm_unmap_aliases
(void)¶ unmap outstanding lazy aliases in the vmap layer
Parameters
void
no arguments
Description
The vmap/vmalloc layer lazily flushes kernel virtual mappings primarily to amortize TLB flushing overheads. What this means is that any page you have now, may, in a former life, have been mapped into kernel virtual address by the vmap layer and so there might be some CPUs with TLB entries still referencing that page (additional to the regular 1:1 kernel mapping).
vm_unmap_aliases flushes all such lazy mappings. After it returns, we can be sure that none of the pages we have control over will have any aliases from the vmap layer.
-
void
vm_unmap_ram
(const void *mem, unsigned int count)¶ unmap linear kernel address space set up by vm_map_ram
Parameters
const void *mem
the pointer returned by vm_map_ram
unsigned int count
the count passed to that vm_map_ram call (cannot unmap partial)
-
void *
vm_map_ram
(struct page **pages, unsigned int count, int node)¶ map pages linearly into kernel virtual address (vmalloc space)
Parameters
struct page **pages
an array of pointers to the pages to be mapped
unsigned int count
number of pages
int node
prefer to allocate data structures on this node
Description
If you use this function for less than VMAP_MAX_ALLOC pages, it could be
faster than vmap so it’s good. But if you mix long-life and short-life
objects with vm_map_ram()
, it could consume lots of address space through
fragmentation (especially on a 32bit machine). You could see failures in
the end. Please use this function for short-lived objects.
Return
a pointer to the address that has been mapped, or NULL
on failure
Parameters
const void *addr
Memory base address
Description
Free the virtually continuous memory area starting at addr, as obtained
from one of the vmalloc()
family of APIs. This will usually also free the
physical memory underlying the virtual allocation, but that memory is
reference counted, so it will not be freed until the last user goes away.
If addr is NULL, no operation is performed.
Context
May sleep if called not from interrupt context.
Must not be called in NMI context (strictly speaking, it could be
if we have CONFIG_ARCH_HAVE_NMI_SAFE_CMPXCHG, but making the calling
conventions for vfree()
arch-dependent would be a really bad idea).
Parameters
const void *addr
memory base address
Description
Free the virtually contiguous memory area starting at addr,
which was created from the page array passed to vmap()
.
Must not be called in interrupt context.
-
void *
vmap
(struct page **pages, unsigned int count, unsigned long flags, pgprot_t prot)¶ map an array of pages into virtually contiguous space
Parameters
struct page **pages
array of page pointers
unsigned int count
number of pages to map
unsigned long flags
vm_area->flags
pgprot_t prot
page protection for the mapping
Description
Maps count pages from pages into contiguous kernel virtual space.
If flags contains VM_MAP_PUT_PAGES
the ownership of the pages array itself
(which must be kmalloc or vmalloc memory) and one reference per pages in it
are transferred from the caller to vmap()
, and will be freed / dropped when
vfree()
is called on the return value.
Return
the address of the area or NULL
on failure
-
void *
vmap_pfn
(unsigned long *pfns, unsigned int count, pgprot_t prot)¶ map an array of PFNs into virtually contiguous space
Parameters
unsigned long *pfns
array of PFNs
unsigned int count
number of pages to map
pgprot_t prot
page protection for the mapping
Description
Maps count PFNs from pfns into contiguous kernel virtual space and returns the start address of the mapping.
-
void *
__vmalloc_node
(unsigned long size, unsigned long align, gfp_t gfp_mask, int node, const void *caller)¶ allocate virtually contiguous memory
Parameters
unsigned long size
allocation size
unsigned long align
desired alignment
gfp_t gfp_mask
flags for the page level allocator
int node
node to use for allocation or NUMA_NO_NODE
const void *caller
caller’s return address
Description
Allocate enough pages to cover size from the page level allocator with gfp_mask flags. Map them into contiguous kernel virtual space.
Reclaim modifiers in gfp_mask - __GFP_NORETRY, __GFP_RETRY_MAYFAIL and __GFP_NOFAIL are not supported
Any use of gfp flags outside of GFP_KERNEL should be consulted with mm people.
Return
pointer to the allocated memory or NULL
on error
-
void *
vmalloc
(unsigned long size)¶ allocate virtually contiguous memory
Parameters
unsigned long size
allocation size
Description
Allocate enough pages to cover size from the page level allocator and map them into contiguous kernel virtual space.
For tight control over page level allocator and protection flags use __vmalloc() instead.
Return
pointer to the allocated memory or NULL
on error
-
void *
vmalloc_no_huge
(unsigned long size)¶ allocate virtually contiguous memory using small pages
Parameters
unsigned long size
allocation size
Description
Allocate enough non-huge pages to cover size from the page level allocator and map them into contiguous kernel virtual space.
Return
pointer to the allocated memory or NULL
on error
-
void *
vzalloc
(unsigned long size)¶ allocate virtually contiguous memory with zero fill
Parameters
unsigned long size
allocation size
Description
Allocate enough pages to cover size from the page level allocator and map them into contiguous kernel virtual space. The memory allocated is set to zero.
For tight control over page level allocator and protection flags use __vmalloc() instead.
Return
pointer to the allocated memory or NULL
on error
-
void *
vmalloc_user
(unsigned long size)¶ allocate zeroed virtually contiguous memory for userspace
Parameters
unsigned long size
allocation size
Description
The resulting memory area is zeroed so it can be mapped to userspace without leaking data.
Return
pointer to the allocated memory or NULL
on error
-
void *
vmalloc_node
(unsigned long size, int node)¶ allocate memory on a specific node
Parameters
unsigned long size
allocation size
int node
numa node
Description
Allocate enough pages to cover size from the page level allocator and map them into contiguous kernel virtual space.
For tight control over page level allocator and protection flags use __vmalloc() instead.
Return
pointer to the allocated memory or NULL
on error
-
void *
vzalloc_node
(unsigned long size, int node)¶ allocate memory on a specific node with zero fill
Parameters
unsigned long size
allocation size
int node
numa node
Description
Allocate enough pages to cover size from the page level allocator and map them into contiguous kernel virtual space. The memory allocated is set to zero.
Return
pointer to the allocated memory or NULL
on error
-
void *
vmalloc_32
(unsigned long size)¶ allocate virtually contiguous memory (32bit addressable)
Parameters
unsigned long size
allocation size
Description
Allocate enough 32bit PA addressable pages to cover size from the page level allocator and map them into contiguous kernel virtual space.
Return
pointer to the allocated memory or NULL
on error
-
void *
vmalloc_32_user
(unsigned long size)¶ allocate zeroed virtually contiguous 32bit memory
Parameters
unsigned long size
allocation size
Description
The resulting memory area is 32bit addressable and zeroed so it can be mapped to userspace without leaking data.
Return
pointer to the allocated memory or NULL
on error
-
int
remap_vmalloc_range
(struct vm_area_struct *vma, void *addr, unsigned long pgoff)¶ map vmalloc pages to userspace
Parameters
struct vm_area_struct *vma
vma to cover (map full range of vma)
void *addr
vmalloc memory
unsigned long pgoff
number of pages into addr before first page to map
Return
0 for success, -Exxx on failure
Description
This function checks that addr is a valid vmalloc’ed area, and that it is big enough to cover the vma. Will return failure if that criteria isn’t met.
Similar to remap_pfn_range()
(see mm/memory.c)
File Mapping and Page Cache¶
-
int
read_cache_pages
(struct address_space *mapping, struct list_head *pages, int (*filler)(void *, struct page *), void *data)¶ populate an address space with some pages & start reads against them
Parameters
struct address_space *mapping
the address_space
struct list_head *pages
The address of a list_head which contains the target pages. These pages have their ->index populated and are otherwise uninitialised.
int (*filler)(void *, struct page *)
callback routine for filling a single page.
void *data
private data for the callback routine.
Description
Hides the details of the LRU cache etc from the filesystems.
Return
0
on success, error return by filler otherwise
-
void
page_cache_ra_unbounded
(struct readahead_control *ractl, unsigned long nr_to_read, unsigned long lookahead_size)¶ Start unchecked readahead.
Parameters
struct readahead_control *ractl
Readahead control.
unsigned long nr_to_read
The number of pages to read.
unsigned long lookahead_size
Where to start the next readahead.
Description
This function is for filesystems to call when they want to start
readahead beyond a file’s stated i_size. This is almost certainly
not the function you want to call. Use page_cache_async_readahead()
or page_cache_sync_readahead()
instead.
Context
File is referenced by caller. Mutexes may be held by caller. May sleep, but will not reenter filesystem to reclaim memory.
-
void
readahead_expand
(struct readahead_control *ractl, loff_t new_start, size_t new_len)¶ Expand a readahead request
Parameters
struct readahead_control *ractl
The request to be expanded
loff_t new_start
The revised start
size_t new_len
The revised size of the request
Description
Attempt to expand a readahead request outwards from the current size to the specified size by inserting locked pages before and after the current window to increase the size to the new window. This may involve the insertion of THPs, in which case the window may get expanded even beyond what was requested.
The algorithm will stop if it encounters a conflicting page already in the pagecache and leave a smaller expansion than requested.
The caller must check for this by examining the revised ractl object for a different expansion than was requested.
-
void
delete_from_page_cache
(struct page *page)¶ delete page from page cache
Parameters
struct page *page
the page which the kernel is trying to remove from page cache
Description
This must be called only on pages that have been verified to be in the page cache and locked. It will never put the page into the free list, the caller has a reference on the page.
-
int
filemap_fdatawrite_wbc
(struct address_space *mapping, struct writeback_control *wbc)¶ start writeback on mapping dirty pages in range
Parameters
struct address_space *mapping
address space structure to write
struct writeback_control *wbc
the writeback_control controlling the writeout
Description
Call writepages on the mapping using the provided wbc to control the writeout.
Return
0
on success, negative error code otherwise.
-
int
filemap_flush
(struct address_space *mapping)¶ mostly a non-blocking flush
Parameters
struct address_space *mapping
target address_space
Description
This is a mostly non-blocking flush. Not suitable for data-integrity purposes - I/O may not be started against all dirty pages.
Return
0
on success, negative error code otherwise.
-
bool
filemap_range_has_page
(struct address_space *mapping, loff_t start_byte, loff_t end_byte)¶ check if a page exists in range.
Parameters
struct address_space *mapping
address space within which to check
loff_t start_byte
offset in bytes where the range starts
loff_t end_byte
offset in bytes where the range ends (inclusive)
Description
Find at least one page in the range supplied, usually used to check if direct writing in this range will trigger a writeback.
Return
true
if at least one page exists in the specified range,
false
otherwise.
-
int
filemap_fdatawait_range
(struct address_space *mapping, loff_t start_byte, loff_t end_byte)¶ wait for writeback to complete
Parameters
struct address_space *mapping
address space structure to wait for
loff_t start_byte
offset in bytes where the range starts
loff_t end_byte
offset in bytes where the range ends (inclusive)
Description
Walk the list of under-writeback pages of the given address space in the given range and wait for all of them. Check error status of the address space and return it.
Since the error status of the address space is cleared by this function, callers are responsible for checking the return value and handling and/or reporting the error.
Return
error status of the address space.
-
int
filemap_fdatawait_range_keep_errors
(struct address_space *mapping, loff_t start_byte, loff_t end_byte)¶ wait for writeback to complete
Parameters
struct address_space *mapping
address space structure to wait for
loff_t start_byte
offset in bytes where the range starts
loff_t end_byte
offset in bytes where the range ends (inclusive)
Description
Walk the list of under-writeback pages of the given address space in the
given range and wait for all of them. Unlike filemap_fdatawait_range()
,
this function does not clear error status of the address space.
Use this function if callers don’t handle errors themselves. Expected call sites are system-wide / filesystem-wide data flushers: e.g. sync(2), fsfreeze(8)
-
int
file_fdatawait_range
(struct file *file, loff_t start_byte, loff_t end_byte)¶ wait for writeback to complete
Parameters
struct file *file
file pointing to address space structure to wait for
loff_t start_byte
offset in bytes where the range starts
loff_t end_byte
offset in bytes where the range ends (inclusive)
Description
Walk the list of under-writeback pages of the address space that file refers to, in the given range and wait for all of them. Check error status of the address space vs. the file->f_wb_err cursor and return it.
Since the error status of the file is advanced by this function, callers are responsible for checking the return value and handling and/or reporting the error.
Return
error status of the address space vs. the file->f_wb_err cursor.
-
int
filemap_fdatawait_keep_errors
(struct address_space *mapping)¶ wait for writeback without clearing errors
Parameters
struct address_space *mapping
address space structure to wait for
Description
Walk the list of under-writeback pages of the given address space and wait for all of them. Unlike filemap_fdatawait(), this function does not clear error status of the address space.
Use this function if callers don’t handle errors themselves. Expected call sites are system-wide / filesystem-wide data flushers: e.g. sync(2), fsfreeze(8)
Return
error status of the address space.
-
bool
filemap_range_needs_writeback
(struct address_space *mapping, loff_t start_byte, loff_t end_byte)¶ check if range potentially needs writeback
Parameters
struct address_space *mapping
address space within which to check
loff_t start_byte
offset in bytes where the range starts
loff_t end_byte
offset in bytes where the range ends (inclusive)
Description
Find at least one page in the range supplied, usually used to check if
direct writing in this range will trigger a writeback. Used by O_DIRECT
read/write with IOCB_NOWAIT, to see if the caller needs to do
filemap_write_and_wait_range()
before proceeding.
Return
true
if the caller should do filemap_write_and_wait_range()
before
doing O_DIRECT to a page in this range, false
otherwise.
-
int
filemap_write_and_wait_range
(struct address_space *mapping, loff_t lstart, loff_t lend)¶ write out & wait on a file range
Parameters
struct address_space *mapping
the address_space for the pages
loff_t lstart
offset in bytes where the range starts
loff_t lend
offset in bytes where the range ends (inclusive)
Description
Write out and wait upon file offsets lstart->lend, inclusive.
Note that lend is inclusive (describes the last byte to be written) so that this function can be used to write to the very end-of-file (end = -1).
Return
error status of the address space.
-
int
file_check_and_advance_wb_err
(struct file *file)¶ report wb error (if any) that was previously and advance wb_err to current one
Parameters
struct file *file
struct file on which the error is being reported
Description
When userland calls fsync (or something like nfsd does the equivalent), we want to report any writeback errors that occurred since the last fsync (or since the file was opened if there haven’t been any).
Grab the wb_err from the mapping. If it matches what we have in the file, then just quickly return 0. The file is all caught up.
If it doesn’t match, then take the mapping value, set the “seen” flag in it and try to swap it into place. If it works, or another task beat us to it with the new value, then update the f_wb_err and return the error portion. The error at this point must be reported via proper channels (a’la fsync, or NFS COMMIT operation, etc.).
While we handle mapping->wb_err with atomic operations, the f_wb_err value is protected by the f_lock since we must ensure that it reflects the latest value swapped in for this file descriptor.
Return
0
on success, negative error code otherwise.
-
int
file_write_and_wait_range
(struct file *file, loff_t lstart, loff_t lend)¶ write out & wait on a file range
Parameters
struct file *file
file pointing to address_space with pages
loff_t lstart
offset in bytes where the range starts
loff_t lend
offset in bytes where the range ends (inclusive)
Description
Write out and wait upon file offsets lstart->lend, inclusive.
Note that lend is inclusive (describes the last byte to be written) so that this function can be used to write to the very end-of-file (end = -1).
After writing out and waiting on the data, we check and advance the f_wb_err cursor to the latest value, and return any errors detected there.
Return
0
on success, negative error code otherwise.
-
void
replace_page_cache_page
(struct page *old, struct page *new)¶ replace a pagecache page with a new one
Parameters
struct page *old
page to be replaced
struct page *new
page to replace with
Description
This function replaces a page in the pagecache with a new one. On success it acquires the pagecache reference for the new page and drops it for the old page. Both the old and new pages must be locked. This function does not add the new page to the LRU, the caller must do that.
The remove + add is atomic. This function cannot fail.
-
int
add_to_page_cache_locked
(struct page *page, struct address_space *mapping, pgoff_t offset, gfp_t gfp_mask)¶ add a locked page to the pagecache
Parameters
struct page *page
page to add
struct address_space *mapping
the page’s address_space
pgoff_t offset
page index
gfp_t gfp_mask
page allocation mode
Description
This function is used to add a page to the pagecache. It must be locked. This function does not add the page to the LRU. The caller must do that.
Return
0
on success, negative error code otherwise.
-
void
folio_add_wait_queue
(struct folio *folio, wait_queue_entry_t *waiter)¶ Add an arbitrary waiter to a folio’s wait queue
Parameters
struct folio *folio
Folio defining the wait queue of interest
wait_queue_entry_t *waiter
Waiter to add to the queue
Description
Add an arbitrary waiter to the wait queue for the nominated folio.
Parameters
struct folio *folio
The folio.
Description
Unlocks the folio and wakes up any thread sleeping on the page lock.
Context
May be called from interrupt or process context. May not be called from NMI context.
Parameters
struct folio *folio
The folio.
Description
Clear the PG_private_2 bit on a folio and wake up any sleepers waiting for it. The folio reference held for PG_private_2 being set is released.
This is, for example, used when a netfs folio is being written to a local disk cache, thereby allowing writes to the cache for the same folio to be serialised.
Parameters
struct folio *folio
The folio to wait on.
Description
Wait for PG_private_2 (aka PG_fscache) to be cleared on a folio.
-
int
folio_wait_private_2_killable
(struct folio *folio)¶ Wait for PG_private_2 to be cleared on a folio.
Parameters
struct folio *folio
The folio to wait on.
Description
Wait for PG_private_2 (aka PG_fscache) to be cleared on a folio or until a fatal signal is received by the calling task.
Return
0 if successful.
-EINTR if a fatal signal was encountered.
Parameters
struct folio *folio
The folio.
-
void
__folio_lock
(struct folio *folio)¶ Get a lock on the folio, assuming we need to sleep to get it.
Parameters
struct folio *folio
The folio to lock
-
pgoff_t
page_cache_next_miss
(struct address_space *mapping, pgoff_t index, unsigned long max_scan)¶ Find the next gap in the page cache.
Parameters
struct address_space *mapping
Mapping.
pgoff_t index
Index.
unsigned long max_scan
Maximum range to search.
Description
Search the range [index, min(index + max_scan - 1, ULONG_MAX)] for the gap with the lowest index.
This function may be called under the rcu_read_lock. However, this will not atomically search a snapshot of the cache at a single point in time. For example, if a gap is created at index 5, then subsequently a gap is created at index 10, page_cache_next_miss covering both indices may return 10 if called under the rcu_read_lock.
Return
The index of the gap if found, otherwise an index outside the range specified (in which case ‘return - index >= max_scan’ will be true). In the rare case of index wrap-around, 0 will be returned.
-
pgoff_t
page_cache_prev_miss
(struct address_space *mapping, pgoff_t index, unsigned long max_scan)¶ Find the previous gap in the page cache.
Parameters
struct address_space *mapping
Mapping.
pgoff_t index
Index.
unsigned long max_scan
Maximum range to search.
Description
Search the range [max(index - max_scan + 1, 0), index] for the gap with the highest index.
This function may be called under the rcu_read_lock. However, this will
not atomically search a snapshot of the cache at a single point in time.
For example, if a gap is created at index 10, then subsequently a gap is
created at index 5, page_cache_prev_miss()
covering both indices may
return 5 if called under the rcu_read_lock.
Return
The index of the gap if found, otherwise an index outside the range specified (in which case ‘index - return >= max_scan’ will be true). In the rare case of wrap-around, ULONG_MAX will be returned.
-
struct folio *
__filemap_get_folio
(struct address_space *mapping, pgoff_t index, int fgp_flags, gfp_t gfp)¶ Find and get a reference to a folio.
Parameters
struct address_space *mapping
The address_space to search.
pgoff_t index
The page index.
int fgp_flags
FGP
flags modify how the folio is returned.gfp_t gfp
Memory allocation flags to use if
FGP_CREAT
is specified.
Description
Looks up the page cache entry at mapping & index.
fgp_flags can be zero or more of these flags:
FGP_ACCESSED
- The folio will be marked accessed.FGP_LOCK
- The folio is returned locked.FGP_ENTRY
- If there is a shadow / swap / DAX entry, return it instead of allocating a new folio to replace it.FGP_CREAT
- If no page is present then a new page is allocated using gfp and added to the page cache and the VM’s LRU list. The page is returned locked and with an increased refcount.FGP_FOR_MMAP
- The caller wants to do its own locking dance if the page is already in cache. If the page was allocated, unlock it before returning so the caller can do the same dance.FGP_WRITE
- The page will be written to by the caller.FGP_NOFS
- __GFP_FS will get cleared in gfp.FGP_NOWAIT
- Don’t get blocked by page lock.FGP_STABLE
- Wait for the folio to be stable (finished writeback)
If FGP_LOCK
or FGP_CREAT
are specified then the function may sleep even
if the GFP
flags specified for FGP_CREAT
are atomic.
If there is a page cache page, it is returned with an increased refcount.
Return
The found folio or NULL
otherwise.
-
unsigned
find_get_pages_contig
(struct address_space *mapping, pgoff_t index, unsigned int nr_pages, struct page **pages)¶ gang contiguous pagecache lookup
Parameters
struct address_space *mapping
The address_space to search
pgoff_t index
The starting page index
unsigned int nr_pages
The maximum number of pages
struct page **pages
Where the resulting pages are placed
Description
find_get_pages_contig()
works exactly like find_get_pages(), except
that the returned number of pages are guaranteed to be contiguous.
Return
the number of pages which were found.
-
unsigned
find_get_pages_range_tag
(struct address_space *mapping, pgoff_t *index, pgoff_t end, xa_mark_t tag, unsigned int nr_pages, struct page **pages)¶ Find and return head pages matching tag.
Parameters
struct address_space *mapping
the address_space to search
pgoff_t *index
the starting page index
pgoff_t end
The final page index (inclusive)
xa_mark_t tag
the tag index
unsigned int nr_pages
the maximum number of pages
struct page **pages
where the resulting pages are placed
Description
Like find_get_pages(), except we only return head pages which are tagged with tag. index is updated to the index immediately after the last page we return, ready for the next iteration.
Return
the number of pages which were found.
-
ssize_t
filemap_read
(struct kiocb *iocb, struct iov_iter *iter, ssize_t already_read)¶ Read data from the page cache.
Parameters
struct kiocb *iocb
The iocb to read.
struct iov_iter *iter
Destination for the data.
ssize_t already_read
Number of bytes already read by the caller.
Description
Copies data from the page cache. If the data is not currently present, uses the readahead and readpage address_space operations to fetch it.
Return
Total number of bytes copied, including those already read by the caller. If an error happens before any bytes are copied, returns a negative error number.
-
ssize_t
generic_file_read_iter
(struct kiocb *iocb, struct iov_iter *iter)¶ generic filesystem read routine
Parameters
struct kiocb *iocb
kernel I/O control block
struct iov_iter *iter
destination for the data read
Description
This is the “read_iter()” routine for all filesystems that can use the page cache directly.
The IOCB_NOWAIT flag in iocb->ki_flags indicates that -EAGAIN shall be returned when no data can be read without waiting for I/O requests to complete; it doesn’t prevent readahead.
The IOCB_NOIO flag in iocb->ki_flags indicates that no new I/O requests shall be made for the read or for readahead. When no data can be read, -EAGAIN shall be returned. When readahead would be triggered, a partial, possibly empty read shall be returned.
Return
number of bytes copied, even for partial reads
negative error code (or 0 if IOCB_NOIO) if nothing was read
-
vm_fault_t
filemap_fault
(struct vm_fault *vmf)¶ read in file data for page fault handling
Parameters
struct vm_fault *vmf
struct vm_fault containing details of the fault
Description
filemap_fault()
is invoked via the vma operations vector for a
mapped memory region to read in file data during a page fault.
The goto’s are kind of ugly, but this streamlines the normal case of having it in the page cache, and handles the special cases reasonably without having a lot of duplicated code.
vma->vm_mm->mmap_lock must be held on entry.
If our return value has VM_FAULT_RETRY set, it’s because the mmap_lock may be dropped before doing I/O or by lock_page_maybe_drop_mmap().
If our return value does not have VM_FAULT_RETRY set, the mmap_lock has not been released.
We never return with VM_FAULT_RETRY and a bit from VM_FAULT_ERROR set.
Return
bitwise-OR of VM_FAULT_
codes.
-
struct page *
read_cache_page
(struct address_space *mapping, pgoff_t index, int (*filler)(void *, struct page *), void *data)¶ read into page cache, fill it if needed
Parameters
struct address_space *mapping
the page’s address_space
pgoff_t index
the page index
int (*filler)(void *, struct page *)
function to perform the read
void *data
first arg to filler(data, page) function, often left as NULL
Description
Read into the page cache. If a page already exists, and PageUptodate() is not set, try to fill the page and wait for it to become unlocked.
If the page does not get brought uptodate, return -EIO.
The function expects mapping->invalidate_lock to be already held.
Return
up to date page on success, ERR_PTR() on failure.
-
struct page *
read_cache_page_gfp
(struct address_space *mapping, pgoff_t index, gfp_t gfp)¶ read into page cache, using specified page allocation flags.
Parameters
struct address_space *mapping
the page’s address_space
pgoff_t index
the page index
gfp_t gfp
the page allocator flags to use if allocating
Description
This is the same as “read_mapping_page(mapping, index, NULL)”, but with any new page allocations done using the specified allocation flags.
If the page does not get brought uptodate, return -EIO.
The function expects mapping->invalidate_lock to be already held.
Return
up to date page on success, ERR_PTR() on failure.
-
ssize_t
__generic_file_write_iter
(struct kiocb *iocb, struct iov_iter *from)¶ write data to a file
Parameters
struct kiocb *iocb
IO state structure (file, offset, etc.)
struct iov_iter *from
iov_iter with data to write
Description
This function does all the work needed for actually writing data to a file. It does all basic checks, removes SUID from the file, updates modification times and calls proper subroutines depending on whether we do direct IO or a standard buffered write.
It expects i_rwsem to be grabbed unless we work on a block device or similar object which does not need locking at all.
This function does not take care of syncing data in case of O_SYNC write. A caller has to handle it. This is mainly due to the fact that we want to avoid syncing under i_rwsem.
Return
number of bytes written, even for truncated writes
negative error code if no data has been written at all
-
ssize_t
generic_file_write_iter
(struct kiocb *iocb, struct iov_iter *from)¶ write data to a file
Parameters
struct kiocb *iocb
IO state structure
struct iov_iter *from
iov_iter with data to write
Description
This is a wrapper around __generic_file_write_iter()
to be used by most
filesystems. It takes care of syncing the file in case of O_SYNC file
and acquires i_rwsem as needed.
Return
negative error code if no data has been written at all of
vfs_fsync_range()
failed for a synchronous writenumber of bytes written, even for truncated writes
-
int
try_to_release_page
(struct page *page, gfp_t gfp_mask)¶ release old fs-specific metadata on a page
Parameters
struct page *page
the page which the kernel is trying to free
gfp_t gfp_mask
memory allocation flags (and I/O mode)
Description
The address_space is to try to release any data against the page (presumably at page->private).
This may also be called if PG_fscache is set on a page, indicating that the page is known to the local caching routines.
The gfp_mask argument specifies whether I/O may be performed to release this page (__GFP_IO), and whether the call may block (__GFP_RECLAIM & __GFP_FS).
Return
1
if the release was successful, otherwise return zero.
-
void
balance_dirty_pages_ratelimited
(struct address_space *mapping)¶ balance dirty memory state
Parameters
struct address_space *mapping
address_space which was dirtied
Description
Processes which are dirtying memory should call in here once for each page which was newly dirtied. The function will periodically check the system’s dirty state and will initiate writeback if needed.
Once we’re over the dirty memory limit we decrease the ratelimiting by a lot, to prevent individual processes from overshooting the limit by (ratelimit_pages) each.
-
void
tag_pages_for_writeback
(struct address_space *mapping, pgoff_t start, pgoff_t end)¶ tag pages to be written by write_cache_pages
Parameters
struct address_space *mapping
address space structure to write
pgoff_t start
starting page index
pgoff_t end
ending page index (inclusive)
Description
This function scans the page range from start to end (inclusive) and tags all pages that have DIRTY tag set with a special TOWRITE tag. The idea is that write_cache_pages (or whoever calls this function) will then use TOWRITE tag to identify pages eligible for writeback. This mechanism is used to avoid livelocking of writeback by a process steadily creating new dirty pages in the file (thus it is important for this function to be quick so that it can tag pages faster than a dirtying process can create them).
-
int
write_cache_pages
(struct address_space *mapping, struct writeback_control *wbc, writepage_t writepage, void *data)¶ walk the list of dirty pages of the given address space and write all of them.
Parameters
struct address_space *mapping
address space structure to write
struct writeback_control *wbc
subtract the number of written pages from *wbc->nr_to_write
writepage_t writepage
function called for each page
void *data
data passed to writepage function
Description
If a page is already under I/O, write_cache_pages()
skips it, even
if it’s dirty. This is desirable behaviour for memory-cleaning writeback,
but it is INCORRECT for data-integrity system calls such as fsync(). fsync()
and msync() need to guarantee that all the data which was dirty at the time
the call was made get new I/O started against them. If wbc->sync_mode is
WB_SYNC_ALL then we were called for data integrity and we must wait for
existing IO to complete.
To avoid livelocks (when other process dirties new pages), we first tag pages which should be written back with TOWRITE tag and only then start writing them. For data-integrity sync we have to be careful so that we do not miss some pages (e.g., because some other process has cleared TOWRITE tag we set). The rule we follow is that TOWRITE tag can be cleared only by the process clearing the DIRTY tag (and submitting the page for IO).
To avoid deadlocks between range_cyclic writeback and callers that hold
pages in PageWriteback to aggregate IO until write_cache_pages()
returns,
we do not loop back to the start of the file. Doing so causes a page
lock/page writeback access order inversion - we should only ever lock
multiple pages in ascending page->index order, and looping back to the start
of the file violates that rule and causes deadlocks.
Return
0
on success, negative error code otherwise
-
int
generic_writepages
(struct address_space *mapping, struct writeback_control *wbc)¶ walk the list of dirty pages of the given address space and writepage() all of them.
Parameters
struct address_space *mapping
address space structure to write
struct writeback_control *wbc
subtract the number of written pages from *wbc->nr_to_write
Description
This is a library function, which implements the writepages() address_space_operation.
Return
0
on success, negative error code otherwise
Parameters
struct folio *folio
The folio to write.
Description
The folio must be locked by the caller and will be unlocked upon return.
Note that the mapping’s AS_EIO/AS_ENOSPC flags will be cleared when this function returns.
Return
0
on success, negative error code otherwise
-
bool
filemap_dirty_folio
(struct address_space *mapping, struct folio *folio)¶ Mark a folio dirty for filesystems which do not use buffer_heads.
Parameters
struct address_space *mapping
Address space this folio belongs to.
struct folio *folio
Folio to be marked as dirty.
Description
Filesystems which do not use buffer heads should call this function from their set_page_dirty address space operation. It ignores the contents of folio_get_private(), so if the filesystem marks individual blocks as dirty, the filesystem should handle that itself.
This is also sometimes used by filesystems which use buffer_heads when a single buffer is being dirtied: we want to set the folio dirty in that case, but not all the buffers. This is a “bottom-up” dirtying, whereas __set_page_dirty_buffers() is a “top-down” dirtying.
The caller must ensure this doesn’t race with truncation. Most will simply hold the folio lock, but e.g. zap_pte_range() calls with the folio mapped and the pte lock held, which also locks out truncation.
Parameters
struct folio *folio
The folio which is being redirtied.
Description
Most filesystems should call folio_redirty_for_writepage()
instead
of this fuction. If your filesystem is doing writeback outside the
context of a writeback_control(), it can call this when redirtying
a folio, to de-account the dirty counters (NR_DIRTIED, WB_DIRTIED,
tsk->nr_dirtied), so that they match the written counters (NR_WRITTEN,
WB_WRITTEN) in long term. The mismatches will lead to systematic errors
in balanced_dirty_ratelimit and the dirty pages position control.
-
bool
folio_redirty_for_writepage
(struct writeback_control *wbc, struct folio *folio)¶ Decline to write a dirty folio.
Parameters
struct writeback_control *wbc
The writeback control.
struct folio *folio
The folio.
Description
When a writepage implementation decides that it doesn’t want to write folio for some reason, it should call this function, unlock folio and return 0.
Return
True if we redirtied the folio. False if someone else dirtied it first.
Parameters
struct folio *folio
The folio.
Description
For folios with a mapping this should be done under the page lock for the benefit of asynchronous memory errors who prefer a consistent dirty state. This rule can be broken in some special cases, but should be better not to.
Return
True if the folio was newly dirtied, false if it was already dirty.
Parameters
struct folio *folio
The folio to wait for.
Description
If the folio is currently being written back to storage, wait for the I/O to complete.
Context
Sleeps. Must be called in process context and with no spinlocks held. Caller should hold a reference on the folio. If the folio is not locked, writeback may start again after writeback has finished.
Parameters
struct folio *folio
The folio to wait for.
Description
If the folio is currently being written back to storage, wait for the I/O to complete or a fatal signal to arrive.
Context
Sleeps. Must be called in process context and with no spinlocks held. Caller should hold a reference on the folio. If the folio is not locked, writeback may start again after writeback has finished.
Return
0 on success, -EINTR if we get a fatal signal while waiting.
Parameters
struct folio *folio
The folio to wait on.
Description
This function determines if the given folio is related to a backing device that requires folio contents to be held stable during writeback. If so, then it will wait for any pending writeback to complete.
Context
Sleeps. Must be called in process context and with no spinlocks held. Caller should hold a reference on the folio. If the folio is not locked, writeback may start again after writeback has finished.
-
void
truncate_inode_pages_range
(struct address_space *mapping, loff_t lstart, loff_t lend)¶ truncate range of pages specified by start & end byte offsets
Parameters
struct address_space *mapping
mapping to truncate
loff_t lstart
offset from which to truncate
loff_t lend
offset to which to truncate (inclusive)
Description
Truncate the page cache, removing the pages that are between specified offsets (and zeroing out partial pages if lstart or lend + 1 is not page aligned).
Truncate takes two passes - the first pass is nonblocking. It will not block on page locks and it will not block on writeback. The second pass will wait. This is to prevent as much IO as possible in the affected region. The first pass will remove most pages, so the search cost of the second pass is low.
We pass down the cache-hot hint to the page freeing code. Even if the mapping is large, it is probably the case that the final pages are the most recently touched, and freeing happens in ascending file offset order.
Note that since ->invalidatepage() accepts range to invalidate truncate_inode_pages_range is able to handle cases where lend + 1 is not page aligned properly.
-
void
truncate_inode_pages
(struct address_space *mapping, loff_t lstart)¶ truncate all the pages from an offset
Parameters
struct address_space *mapping
mapping to truncate
loff_t lstart
offset from which to truncate
Description
Called under (and serialised by) inode->i_rwsem and mapping->invalidate_lock.
Note
When this function returns, there can be a page in the process of deletion (inside __delete_from_page_cache()) in the specified range. Thus mapping->nrpages can be non-zero when this function returns even after truncation of the whole mapping.
-
void
truncate_inode_pages_final
(struct address_space *mapping)¶ truncate all pages before inode dies
Parameters
struct address_space *mapping
mapping to truncate
Description
Called under (and serialized by) inode->i_rwsem.
Filesystems have to use this in the .evict_inode path to inform the VM that this is the final truncate and the inode is going away.
-
unsigned long
invalidate_mapping_pages
(struct address_space *mapping, pgoff_t start, pgoff_t end)¶ Invalidate all clean, unlocked cache of one inode
Parameters
struct address_space *mapping
the address_space which holds the cache to invalidate
pgoff_t start
the offset ‘from’ which to invalidate
pgoff_t end
the offset ‘to’ which to invalidate (inclusive)
Description
This function removes pages that are clean, unmapped and unlocked, as well as shadow entries. It will not block on IO activity.
If you want to remove all the pages of one inode, regardless of
their use and writeback state, use truncate_inode_pages()
.
Return
the number of the cache entries that were invalidated
-
int
invalidate_inode_pages2_range
(struct address_space *mapping, pgoff_t start, pgoff_t end)¶ remove range of pages from an address_space
Parameters
struct address_space *mapping
the address_space
pgoff_t start
the page offset ‘from’ which to invalidate
pgoff_t end
the page offset ‘to’ which to invalidate (inclusive)
Description
Any pages which are found to be mapped into pagetables are unmapped prior to invalidation.
Return
-EBUSY if any pages could not be invalidated.
-
int
invalidate_inode_pages2
(struct address_space *mapping)¶ remove all pages from an address_space
Parameters
struct address_space *mapping
the address_space
Description
Any pages which are found to be mapped into pagetables are unmapped prior to invalidation.
Return
-EBUSY if any pages could not be invalidated.
-
void
truncate_pagecache
(struct inode *inode, loff_t newsize)¶ unmap and remove pagecache that has been truncated
Parameters
struct inode *inode
inode
loff_t newsize
new file size
Description
inode’s new i_size must already be written before truncate_pagecache is called.
This function should typically be called before the filesystem releases resources associated with the freed range (eg. deallocates blocks). This way, pagecache will always stay logically coherent with on-disk format, and the filesystem would not have to deal with situations such as writepage being called for a page that has already had its underlying blocks deallocated.
-
void
truncate_setsize
(struct inode *inode, loff_t newsize)¶ update inode and pagecache for a new file size
Parameters
struct inode *inode
inode
loff_t newsize
new file size
Description
truncate_setsize updates i_size and performs pagecache truncation (if necessary) to newsize. It will be typically be called from the filesystem’s setattr function when ATTR_SIZE is passed in.
Must be called with a lock serializing truncates and writes (generally i_rwsem but e.g. xfs uses a different lock) and before all filesystem specific block truncation has been performed.
-
void
pagecache_isize_extended
(struct inode *inode, loff_t from, loff_t to)¶ update pagecache after extension of i_size
Parameters
struct inode *inode
inode for which i_size was extended
loff_t from
original inode size
loff_t to
new inode size
Description
Handle extension of inode size either caused by extending truncate or by write starting after current i_size. We mark the page straddling current i_size RO so that page_mkwrite() is called on the nearest write access to the page. This way filesystem can be sure that page_mkwrite() is called on the page before user writes to the page via mmap after the i_size has been changed.
The function must be called after i_size is updated so that page fault coming after we unlock the page will already see the new i_size. The function must be called while we still hold i_rwsem - this not only makes sure i_size is stable but also that userspace cannot observe new i_size value before we are prepared to store mmap writes at new inode size.
-
void
truncate_pagecache_range
(struct inode *inode, loff_t lstart, loff_t lend)¶ unmap and remove pagecache that is hole-punched
Parameters
struct inode *inode
inode
loff_t lstart
offset of beginning of hole
loff_t lend
offset of last byte of hole
Description
This function should typically be called before the filesystem releases resources associated with the freed range (eg. deallocates blocks). This way, pagecache will always stay logically coherent with on-disk format, and the filesystem would not have to deal with situations such as writepage being called for a page that has already had its underlying blocks deallocated.
-
void
mapping_set_error
(struct address_space *mapping, int error)¶ record a writeback error in the address_space
Parameters
struct address_space *mapping
the mapping in which an error should be set
int error
the error to set in the mapping
Description
When writeback fails in some way, we must record that error so that userspace can be informed when fsync and the like are called. We endeavor to report errors on any file that was open at the time of the error. Some internal callers also need to know when writeback errors have occurred.
When a writeback error occurs, most filesystems will want to call mapping_set_error to record the error in the mapping so that it can be reported when the application calls fsync(2).
-
void
mapping_set_large_folios
(struct address_space *mapping)¶ Indicate the file supports large folios.
Parameters
struct address_space *mapping
The file.
Description
The filesystem should call this function in its inode constructor to indicate that the VFS can use large folios to cache the contents of the file.
Context
This should not be called while the inode is active as it is non-atomic.
-
struct address_space *
folio_file_mapping
(struct folio *folio)¶ Find the mapping this folio belongs to.
Parameters
struct folio *folio
The folio.
Description
For folios which are in the page cache, return the mapping that this
page belongs to. Folios in the swap cache return the mapping of the
swap file or swap device where the data is stored. This is different
from the mapping returned by folio_mapping()
. The only reason to
use it is if, like NFS, you return 0 from ->activate_swapfile.
Do not call this for folios which aren’t in the page cache or swap cache.
Parameters
struct folio *folio
The folio.
Description
For folios which are in the page cache, return the inode that this folio belongs to.
Do not call this for folios which aren’t in the page cache.
Parameters
struct folio *folio
Folio to attach data to.
void *data
Data to attach to folio.
Description
Attaching private data to a folio increments the page’s reference count. The data must be detached before the folio will be freed.
Parameters
struct folio *folio
Folio to change the data on.
void *data
Data to set on the folio.
Description
Change the private data attached to a folio and return the old data. The page must previously have had data attached and the data must be detached before the folio will be freed.
Return
Data that was previously attached to the folio.
Parameters
struct folio *folio
Folio to detach data from.
Description
Removes the data that was previously attached to the folio and decrements the refcount on the page.
Return
Data that was attached to the folio.
-
struct folio *
filemap_get_folio
(struct address_space *mapping, pgoff_t index)¶ Find and get a folio.
Parameters
struct address_space *mapping
The address_space to search.
pgoff_t index
The page index.
Description
Looks up the page cache entry at mapping & index. If a folio is present, it is returned with an increased refcount.
Otherwise, NULL
is returned.
-
struct page *
find_get_page
(struct address_space *mapping, pgoff_t offset)¶ find and get a page reference
Parameters
struct address_space *mapping
the address_space to search
pgoff_t offset
the page index
Description
Looks up the page cache slot at mapping & offset. If there is a page cache page, it is returned with an increased refcount.
Otherwise, NULL
is returned.
-
struct page *
find_lock_page
(struct address_space *mapping, pgoff_t index)¶ locate, pin and lock a pagecache page
Parameters
struct address_space *mapping
the address_space to search
pgoff_t index
the page index
Description
Looks up the page cache entry at mapping & index. If there is a page cache page, it is returned locked and with an increased refcount.
Context
May sleep.
Return
A struct page or NULL
if there is no page in the cache for this
index.
-
struct page *
find_or_create_page
(struct address_space *mapping, pgoff_t index, gfp_t gfp_mask)¶ locate or add a pagecache page
Parameters
struct address_space *mapping
the page’s address_space
pgoff_t index
the page’s index into the mapping
gfp_t gfp_mask
page allocation mode
Description
Looks up the page cache slot at mapping & offset. If there is a page cache page, it is returned locked and with an increased refcount.
If the page is not present, a new page is allocated using gfp_mask and added to the page cache and the VM’s LRU list. The page is returned locked and with an increased refcount.
On memory exhaustion, NULL
is returned.
find_or_create_page()
may sleep, even if gfp_flags specifies an
atomic allocation!
-
struct page *
grab_cache_page_nowait
(struct address_space *mapping, pgoff_t index)¶ returns locked page at given index in given cache
Parameters
struct address_space *mapping
target address_space
pgoff_t index
the page index
Description
Same as grab_cache_page(), but do not wait if the page is unavailable. This is intended for speculative data generators, where the data can be regenerated if the page couldn’t be grabbed. This routine should be safe to call while holding the lock for another page.
Clear __GFP_FS when allocating the page to avoid recursion into the fs and deadlock against the caller’s locked page.
Parameters
struct folio *folio
The folio.
Description
For a folio which is either in the page cache or the swap cache, return its index within the address_space it belongs to. If you know the page is definitely in the page cache, you can look at the folio’s index directly.
Return
The index (offset in units of pages) of a folio in its file.
Parameters
struct folio *folio
The current folio.
Return
The index of the folio which follows this folio in the file.
Parameters
struct folio *folio
The folio which contains this index.
pgoff_t index
The index we want to look up.
Description
Sometimes after looking up a folio in the page cache, we need to obtain the specific page for an index (eg a page fault).
Return
The page containing the file data for this index.
Parameters
struct folio *folio
The folio.
pgoff_t index
The page index within the file.
Context
The caller should have the page locked in order to prevent (eg) shmem from moving the page between the page cache and swap cache and changing its index in the middle of the operation.
Return
true or false.
Parameters
struct folio *folio
The folio.
Parameters
struct folio *folio
The folio.
Description
This differs from folio_pos()
for folios which belong to a swap file.
NFS is the only filesystem today which needs to use folio_file_pos()
.
-
struct
readahead_control
¶ Describes a readahead request.
Definition
struct readahead_control {
struct file *file;
struct address_space *mapping;
struct file_ra_state *ra;
};
Members
file
The file, used primarily by network filesystems for authentication. May be NULL if invoked internally by the filesystem.
mapping
Readahead this filesystem object.
ra
File readahead state. May be NULL.
Description
A readahead request is for consecutive pages. Filesystems which
implement the ->readahead method should call readahead_page()
or
readahead_page_batch()
in a loop and attempt to start I/O against
each page in the request.
Most of the fields in this struct are private and should be accessed by the functions below.
-
void
page_cache_sync_readahead
(struct address_space *mapping, struct file_ra_state *ra, struct file *file, pgoff_t index, unsigned long req_count)¶ generic file readahead
Parameters
struct address_space *mapping
address_space which holds the pagecache and I/O vectors
struct file_ra_state *ra
file_ra_state which holds the readahead state
struct file *file
Used by the filesystem for authentication.
pgoff_t index
Index of first page to be read.
unsigned long req_count
Total number of pages being read by the caller.
Description
page_cache_sync_readahead()
should be called when a cache miss happened:
it will submit the read. The readahead logic may decide to piggyback more
pages onto the read request if access patterns suggest it will improve
performance.
-
void
page_cache_async_readahead
(struct address_space *mapping, struct file_ra_state *ra, struct file *file, struct page *page, pgoff_t index, unsigned long req_count)¶ file readahead for marked pages
Parameters
struct address_space *mapping
address_space which holds the pagecache and I/O vectors
struct file_ra_state *ra
file_ra_state which holds the readahead state
struct file *file
Used by the filesystem for authentication.
struct page *page
The page at index which triggered the readahead call.
pgoff_t index
Index of first page to be read.
unsigned long req_count
Total number of pages being read by the caller.
Description
page_cache_async_readahead()
should be called when a page is used which
is marked as PageReadahead; this is a marker to suggest that the application
has used up enough of the readahead window that we should start pulling in
more pages.
-
struct page *
readahead_page
(struct readahead_control *ractl)¶ Get the next page to read.
Parameters
struct readahead_control *ractl
The current readahead request.
Context
The page is locked and has an elevated refcount. The caller should decreases the refcount once the page has been submitted for I/O and unlock the page once all I/O to that page has completed.
Return
A pointer to the next page, or NULL
if we are done.
-
struct folio *
readahead_folio
(struct readahead_control *ractl)¶ Get the next folio to read.
Parameters
struct readahead_control *ractl
The current readahead request.
Context
The folio is locked. The caller should unlock the folio once all I/O to that folio has completed.
Return
A pointer to the next folio, or NULL
if we are done.
-
readahead_page_batch
(rac, array)¶ Get a batch of pages to read.
Parameters
rac
The current readahead request.
array
An array of pointers to struct page.
Context
The pages are locked and have an elevated refcount. The caller should decreases the refcount once the page has been submitted for I/O and unlock the page once all I/O to that page has completed.
Return
The number of pages placed in the array. 0 indicates the request is complete.
-
loff_t
readahead_pos
(struct readahead_control *rac)¶ The byte offset into the file of this readahead request.
Parameters
struct readahead_control *rac
The readahead request.
-
size_t
readahead_length
(struct readahead_control *rac)¶ The number of bytes in this readahead request.
Parameters
struct readahead_control *rac
The readahead request.
-
pgoff_t
readahead_index
(struct readahead_control *rac)¶ The index of the first page in this readahead request.
Parameters
struct readahead_control *rac
The readahead request.
-
unsigned int
readahead_count
(struct readahead_control *rac)¶ The number of pages in this readahead request.
Parameters
struct readahead_control *rac
The readahead request.
-
size_t
readahead_batch_length
(struct readahead_control *rac)¶ The number of bytes in the current batch.
Parameters
struct readahead_control *rac
The readahead request.
-
ssize_t
folio_mkwrite_check_truncate
(struct folio *folio, struct inode *inode)¶ check if folio was truncated
Parameters
struct folio *folio
the folio to check
struct inode *inode
the inode to check the folio against
Return
the number of bytes in the folio up to EOF, or -EFAULT if the folio was truncated.
-
int
page_mkwrite_check_truncate
(struct page *page, struct inode *inode)¶ check if page was truncated
Parameters
struct page *page
the page to check
struct inode *inode
the inode to check the page against
Description
Returns the number of bytes in the page up to EOF, or -EFAULT if the page was truncated.
-
unsigned int
i_blocks_per_folio
(struct inode *inode, struct folio *folio)¶ How many blocks fit in this folio.
Parameters
struct inode *inode
The inode which contains the blocks.
struct folio *folio
The folio.
Description
If the block size is larger than the size of this folio, return zero.
Context
The caller should hold a refcount on the folio to prevent it from being split.
Return
The number of filesystem blocks covered by this folio.
Memory pools¶
-
void
mempool_exit
(mempool_t *pool)¶ exit a mempool initialized with
mempool_init()
Parameters
mempool_t *pool
pointer to the memory pool which was initialized with
mempool_init()
.
Description
Free all reserved elements in pool and pool itself. This function only sleeps if the free_fn() function sleeps.
May be called on a zeroed but uninitialized mempool (i.e. allocated with
kzalloc()
).
-
void
mempool_destroy
(mempool_t *pool)¶ deallocate a memory pool
Parameters
mempool_t *pool
pointer to the memory pool which was allocated via
mempool_create()
.
Description
Free all reserved elements in pool and pool itself. This function only sleeps if the free_fn() function sleeps.
-
int
mempool_init
(mempool_t *pool, int min_nr, mempool_alloc_t *alloc_fn, mempool_free_t *free_fn, void *pool_data)¶ initialize a memory pool
Parameters
mempool_t *pool
pointer to the memory pool that should be initialized
int min_nr
the minimum number of elements guaranteed to be allocated for this pool.
mempool_alloc_t *alloc_fn
user-defined element-allocation function.
mempool_free_t *free_fn
user-defined element-freeing function.
void *pool_data
optional private data available to the user-defined functions.
Description
Like mempool_create()
, but initializes the pool in (i.e. embedded in another
structure).
Return
0
on success, negative error code otherwise.
-
mempool_t *
mempool_create
(int min_nr, mempool_alloc_t *alloc_fn, mempool_free_t *free_fn, void *pool_data)¶ create a memory pool
Parameters
int min_nr
the minimum number of elements guaranteed to be allocated for this pool.
mempool_alloc_t *alloc_fn
user-defined element-allocation function.
mempool_free_t *free_fn
user-defined element-freeing function.
void *pool_data
optional private data available to the user-defined functions.
Description
this function creates and allocates a guaranteed size, preallocated
memory pool. The pool can be used from the mempool_alloc()
and mempool_free()
functions. This function might sleep. Both the alloc_fn() and the free_fn()
functions might sleep - as long as the mempool_alloc()
function is not called
from IRQ contexts.
Return
pointer to the created memory pool object or NULL
on error.
-
int
mempool_resize
(mempool_t *pool, int new_min_nr)¶ resize an existing memory pool
Parameters
mempool_t *pool
pointer to the memory pool which was allocated via
mempool_create()
.int new_min_nr
the new minimum number of elements guaranteed to be allocated for this pool.
Description
This function shrinks/grows the pool. In the case of growing,
it cannot be guaranteed that the pool will be grown to the new
size immediately, but new mempool_free()
calls will refill it.
This function may sleep.
Note, the caller must guarantee that no mempool_destroy is called
while this function is running. mempool_alloc()
& mempool_free()
might be called (eg. from IRQ contexts) while this function executes.
Return
0
on success, negative error code otherwise.
-
void *
mempool_alloc
(mempool_t *pool, gfp_t gfp_mask)¶ allocate an element from a specific memory pool
Parameters
mempool_t *pool
pointer to the memory pool which was allocated via
mempool_create()
.gfp_t gfp_mask
the usual allocation bitmask.
Description
this function only sleeps if the alloc_fn() function sleeps or returns NULL. Note that due to preallocation, this function never fails when called from process contexts. (it might fail if called from an IRQ context.)
Note
using __GFP_ZERO is not supported.
Return
pointer to the allocated element or NULL
on error.
-
void
mempool_free
(void *element, mempool_t *pool)¶ return an element to the pool.
Parameters
void *element
pool element pointer.
mempool_t *pool
pointer to the memory pool which was allocated via
mempool_create()
.
Description
this function only sleeps if the free_fn() function sleeps.
DMA pools¶
-
struct dma_pool *
dma_pool_create
(const char *name, struct device *dev, size_t size, size_t align, size_t boundary)¶ Creates a pool of consistent memory blocks, for dma.
Parameters
const char *name
name of pool, for diagnostics
struct device *dev
device that will be doing the DMA
size_t size
size of the blocks in this pool.
size_t align
alignment requirement for blocks; must be a power of two
size_t boundary
returned blocks won’t cross this power of two boundary
Context
not in_interrupt()
Description
Given one of these pools, dma_pool_alloc()
may be used to allocate memory. Such memory will all have “consistent”
DMA mappings, accessible by the device and its driver without using
cache flushing primitives. The actual size of blocks allocated may be
larger than requested because of alignment.
If boundary is nonzero, objects returned from dma_pool_alloc()
won’t
cross that size boundary. This is useful for devices which have
addressing restrictions on individual DMA transfers, such as not crossing
boundaries of 4KBytes.
Return
a dma allocation pool with the requested characteristics, or
NULL
if one can’t be created.
-
void
dma_pool_destroy
(struct dma_pool *pool)¶ destroys a pool of dma memory blocks.
Parameters
struct dma_pool *pool
dma pool that will be destroyed
Context
!in_interrupt()
Description
Caller guarantees that no more memory from the pool is in use, and that nothing will try to use the pool after this call.
-
void *
dma_pool_alloc
(struct dma_pool *pool, gfp_t mem_flags, dma_addr_t *handle)¶ get a block of consistent memory
Parameters
struct dma_pool *pool
dma pool that will produce the block
gfp_t mem_flags
GFP_* bitmask
dma_addr_t *handle
pointer to dma address of block
Return
the kernel virtual address of a currently unused block,
and reports its dma address through the handle.
If such a memory block can’t be allocated, NULL
is returned.
-
void
dma_pool_free
(struct dma_pool *pool, void *vaddr, dma_addr_t dma)¶ put block back into dma pool
Parameters
struct dma_pool *pool
the dma pool holding the block
void *vaddr
virtual address of block
dma_addr_t dma
dma address of block
Description
Caller promises neither device nor driver will again touch this block unless it is first re-allocated.
-
struct dma_pool *
dmam_pool_create
(const char *name, struct device *dev, size_t size, size_t align, size_t allocation)¶ Managed
dma_pool_create()
Parameters
const char *name
name of pool, for diagnostics
struct device *dev
device that will be doing the DMA
size_t size
size of the blocks in this pool.
size_t align
alignment requirement for blocks; must be a power of two
size_t allocation
returned blocks won’t cross this boundary (or zero)
Description
Managed dma_pool_create()
. DMA pool created with this function is
automatically destroyed on driver detach.
Return
a managed dma allocation pool with the requested
characteristics, or NULL
if one can’t be created.
-
void
dmam_pool_destroy
(struct dma_pool *pool)¶ Managed
dma_pool_destroy()
Parameters
struct dma_pool *pool
dma pool that will be destroyed
Description
Managed dma_pool_destroy()
.
More Memory Management Functions¶
-
void
zap_vma_ptes
(struct vm_area_struct *vma, unsigned long address, unsigned long size)¶ remove ptes mapping the vma
Parameters
struct vm_area_struct *vma
vm_area_struct holding ptes to be zapped
unsigned long address
starting address of pages to zap
unsigned long size
number of bytes to zap
Description
This function only unmaps ptes assigned to VM_PFNMAP vmas.
The entire address range must be fully contained within the vma.
-
int
vm_insert_pages
(struct vm_area_struct *vma, unsigned long addr, struct page **pages, unsigned long *num)¶ insert multiple pages into user vma, batching the pmd lock.
Parameters
struct vm_area_struct *vma
user vma to map to
unsigned long addr
target start user address of these pages
struct page **pages
source kernel pages
unsigned long *num
in: number of pages to map. out: number of pages that were not mapped. (0 means all pages were successfully mapped).
Description
Preferred over vm_insert_page()
when inserting multiple pages.
In case of error, we may have mapped a subset of the provided pages. It is the caller’s responsibility to account for this case.
The same restrictions apply as in vm_insert_page()
.
-
int
vm_insert_page
(struct vm_area_struct *vma, unsigned long addr, struct page *page)¶ insert single page into user vma
Parameters
struct vm_area_struct *vma
user vma to map to
unsigned long addr
target user address of this page
struct page *page
source kernel page
Description
This allows drivers to insert individual pages they’ve allocated into a user vma.
The page has to be a nice clean _individual_ kernel allocation. If you allocate a compound page, you need to have marked it as such (__GFP_COMP), or manually just split the page up yourself (see split_page()).
NOTE! Traditionally this was done with “remap_pfn_range()
” which
took an arbitrary page protection parameter. This doesn’t allow
that. Your vma protection will have to be set up correctly, which
means that if you want a shared writable mapping, you’d better
ask for a shared writable mapping!
The page does not need to be reserved.
Usually this function is called from f_op->mmap() handler under mm->mmap_lock write-lock, so it can change vma->vm_flags. Caller must set VM_MIXEDMAP on vma if it wants to call this function from other places, for example from page-fault handler.
Return
0
on success, negative error code otherwise.
-
int
vm_map_pages
(struct vm_area_struct *vma, struct page **pages, unsigned long num)¶ maps range of kernel pages starts with non zero offset
Parameters
struct vm_area_struct *vma
user vma to map to
struct page **pages
pointer to array of source kernel pages
unsigned long num
number of pages in page array
Description
Maps an object consisting of num pages, catering for the user’s requested vm_pgoff
If we fail to insert any page into the vma, the function will return immediately leaving any previously inserted pages present. Callers from the mmap handler may immediately return the error as their caller will destroy the vma, removing any successfully inserted pages. Other callers should make their own arrangements for calling unmap_region().
Context
Process context. Called by mmap handlers.
Return
0 on success and error code otherwise.
-
int
vm_map_pages_zero
(struct vm_area_struct *vma, struct page **pages, unsigned long num)¶ map range of kernel pages starts with zero offset
Parameters
struct vm_area_struct *vma
user vma to map to
struct page **pages
pointer to array of source kernel pages
unsigned long num
number of pages in page array
Description
Similar to vm_map_pages()
, except that it explicitly sets the offset
to 0. This function is intended for the drivers that did not consider
vm_pgoff.
Context
Process context. Called by mmap handlers.
Return
0 on success and error code otherwise.
-
vm_fault_t
vmf_insert_pfn_prot
(struct vm_area_struct *vma, unsigned long addr, unsigned long pfn, pgprot_t pgprot)¶ insert single pfn into user vma with specified pgprot
Parameters
struct vm_area_struct *vma
user vma to map to
unsigned long addr
target user address of this page
unsigned long pfn
source kernel pfn
pgprot_t pgprot
pgprot flags for the inserted page
Description
This is exactly like vmf_insert_pfn()
, except that it allows drivers
to override pgprot on a per-page basis.
This only makes sense for IO mappings, and it makes no sense for COW mappings. In general, using multiple vmas is preferable; vmf_insert_pfn_prot should only be used if using multiple VMAs is impractical.
See vmf_insert_mixed_prot()
for a discussion of the implication of using
a value of pgprot different from that of vma->vm_page_prot.
Context
Process context. May allocate using GFP_KERNEL
.
Return
vm_fault_t value.
-
vm_fault_t
vmf_insert_pfn
(struct vm_area_struct *vma, unsigned long addr, unsigned long pfn)¶ insert single pfn into user vma
Parameters
struct vm_area_struct *vma
user vma to map to
unsigned long addr
target user address of this page
unsigned long pfn
source kernel pfn
Description
Similar to vm_insert_page, this allows drivers to insert individual pages they’ve allocated into a user vma. Same comments apply.
This function should only be called from a vm_ops->fault handler, and in that case the handler should return the result of this function.
vma cannot be a COW mapping.
As this is called only for pages that do not currently exist, we do not need to flush old virtual caches or the TLB.
Context
Process context. May allocate using GFP_KERNEL
.
Return
vm_fault_t value.
-
vm_fault_t
vmf_insert_mixed_prot
(struct vm_area_struct *vma, unsigned long addr, pfn_t pfn, pgprot_t pgprot)¶ insert single pfn into user vma with specified pgprot
Parameters
struct vm_area_struct *vma
user vma to map to
unsigned long addr
target user address of this page
pfn_t pfn
source kernel pfn
pgprot_t pgprot
pgprot flags for the inserted page
Description
This is exactly like vmf_insert_mixed(), except that it allows drivers to override pgprot on a per-page basis.
Typically this function should be used by drivers to set caching- and encryption bits different than those of vma->vm_page_prot, because the caching- or encryption mode may not be known at mmap() time. This is ok as long as vma->vm_page_prot is not used by the core vm to set caching and encryption bits for those vmas (except for COW pages). This is ensured by core vm only modifying these page table entries using functions that don’t touch caching- or encryption bits, using pte_modify() if needed. (See for example mprotect()). Also when new page-table entries are created, this is only done using the fault() callback, and never using the value of vma->vm_page_prot, except for page-table entries that point to anonymous pages as the result of COW.
Context
Process context. May allocate using GFP_KERNEL
.
Return
vm_fault_t value.
-
int
remap_pfn_range
(struct vm_area_struct *vma, unsigned long addr, unsigned long pfn, unsigned long size, pgprot_t prot)¶ remap kernel memory to userspace
Parameters
struct vm_area_struct *vma
user vma to map to
unsigned long addr
target page aligned user address to start at
unsigned long pfn
page frame number of kernel physical memory address
unsigned long size
size of mapping area
pgprot_t prot
page protection flags for this mapping
Note
this is only safe if the mm semaphore is held when called.
Return
0
on success, negative error code otherwise.
-
int
vm_iomap_memory
(struct vm_area_struct *vma, phys_addr_t start, unsigned long len)¶ remap memory to userspace
Parameters
struct vm_area_struct *vma
user vma to map to
phys_addr_t start
start of the physical memory to be mapped
unsigned long len
size of area
Description
This is a simplified io_remap_pfn_range() for common driver use. The driver just needs to give us the physical memory range to be mapped, we’ll figure out the rest from the vma information.
NOTE! Some drivers might want to tweak vma->vm_page_prot first to get whatever write-combining details or similar.
Return
0
on success, negative error code otherwise.
-
void
unmap_mapping_pages
(struct address_space *mapping, pgoff_t start, pgoff_t nr, bool even_cows)¶ Unmap pages from processes.
Parameters
struct address_space *mapping
The address space containing pages to be unmapped.
pgoff_t start
Index of first page to be unmapped.
pgoff_t nr
Number of pages to be unmapped. 0 to unmap to end of file.
bool even_cows
Whether to unmap even private COWed pages.
Description
Unmap the pages in this address space from any userspace process which has them mmaped. Generally, you want to remove COWed pages as well when a file is being truncated, but not when invalidating pages from the page cache.
-
void
unmap_mapping_range
(struct address_space *mapping, loff_t const holebegin, loff_t const holelen, int even_cows)¶ unmap the portion of all mmaps in the specified address_space corresponding to the specified byte range in the underlying file.
Parameters
struct address_space *mapping
the address space containing mmaps to be unmapped.
loff_t const holebegin
byte in first page to unmap, relative to the start of the underlying file. This will be rounded down to a PAGE_SIZE boundary. Note that this is different from
truncate_pagecache()
, which must keep the partial page. In contrast, we must get rid of partial pages.loff_t const holelen
size of prospective hole in bytes. This will be rounded up to a PAGE_SIZE boundary. A holelen of zero truncates to the end of the file.
int even_cows
1 when truncating a file, unmap even private COWed pages; but 0 when invalidating pagecache, don’t throw away private data.
-
int
follow_pte
(struct mm_struct *mm, unsigned long address, pte_t **ptepp, spinlock_t **ptlp)¶ look up PTE at a user virtual address
Parameters
struct mm_struct *mm
the mm_struct of the target address space
unsigned long address
user virtual address
pte_t **ptepp
location to store found PTE
spinlock_t **ptlp
location to store the lock for the PTE
Description
On a successful return, the pointer to the PTE is stored in ptepp; the corresponding lock is taken and its location is stored in ptlp. The contents of the PTE are only stable until ptlp is released; any further use, if any, must be protected against invalidation with MMU notifiers.
Only IO mappings and raw PFN mappings are allowed. The mmap semaphore should be taken for read.
KVM uses this function. While it is arguably less bad than follow_pfn
,
it is not a good general-purpose API.
Return
zero on success, -ve otherwise.
-
int
follow_pfn
(struct vm_area_struct *vma, unsigned long address, unsigned long *pfn)¶ look up PFN at a user virtual address
Parameters
struct vm_area_struct *vma
memory mapping
unsigned long address
user virtual address
unsigned long *pfn
location to store found PFN
Description
Only IO mappings and raw PFN mappings are allowed.
This function does not allow the caller to read the permissions of the PTE. Do not use it.
Return
zero and the pfn at pfn on success, -ve otherwise.
-
int
generic_access_phys
(struct vm_area_struct *vma, unsigned long addr, void *buf, int len, int write)¶ generic implementation for iomem mmap access
Parameters
struct vm_area_struct *vma
the vma to access
unsigned long addr
userspace address, not relative offset within vma
void *buf
buffer to read/write
int len
length of transfer
int write
set to FOLL_WRITE when writing, otherwise reading
Description
This is a generic implementation for vm_operations_struct.access
for an
iomem mapping. This callback is used by access_process_vm() when the vma is
not page based.
-
unsigned long
get_pfnblock_flags_mask
(const struct page *page, unsigned long pfn, unsigned long mask)¶ Return the requested group of flags for the pageblock_nr_pages block of pages
Parameters
const struct page *page
The page within the block of interest
unsigned long pfn
The target page frame number
unsigned long mask
mask of bits that the caller is interested in
Return
pageblock_bits flags
-
void
set_pfnblock_flags_mask
(struct page *page, unsigned long flags, unsigned long pfn, unsigned long mask)¶ Set the requested group of flags for a pageblock_nr_pages block of pages
Parameters
struct page *page
The page within the block of interest
unsigned long flags
The flags to set
unsigned long pfn
The target page frame number
unsigned long mask
mask of bits that the caller is interested in
-
void
__putback_isolated_page
(struct page *page, unsigned int order, int mt)¶ Return a now-isolated page back where we got it
Parameters
struct page *page
Page that was isolated
unsigned int order
Order of the isolated page
int mt
The page’s pageblock’s migratetype
Description
This function is meant to return a page pulled from the free lists via __isolate_free_page back to the free lists they were pulled from.
-
void
__free_pages
(struct page *page, unsigned int order)¶ Free pages allocated with
alloc_pages()
.
Parameters
struct page *page
The page pointer returned from
alloc_pages()
.unsigned int order
The order of the allocation.
Description
This function can free multi-page allocations that are not compound pages. It does not check that the order passed in matches that of the allocation, so it is easy to leak memory. Freeing more memory than was allocated will probably emit a warning.
If the last reference to this page is speculative, it will be released
by put_page() which only frees the first page of a non-compound
allocation. To prevent the remaining pages from being leaked, we free
the subsequent pages here. If you want to use the page’s reference
count to decide when to free the allocation, you should allocate a
compound page, and use put_page() instead of __free_pages()
.
Context
May be called in interrupt context or while holding a normal spinlock, but not in NMI context or while holding a raw spinlock.
-
void *
alloc_pages_exact
(size_t size, gfp_t gfp_mask)¶ allocate an exact number physically-contiguous pages.
Parameters
size_t size
the number of bytes to allocate
gfp_t gfp_mask
GFP flags for the allocation, must not contain __GFP_COMP
Description
This function is similar to alloc_pages()
, except that it allocates the
minimum number of pages to satisfy the request. alloc_pages()
can only
allocate memory in power-of-two pages.
This function is also limited by MAX_ORDER.
Memory allocated by this function must be released by free_pages_exact()
.
Return
pointer to the allocated area or NULL
in case of error.
-
void *
alloc_pages_exact_nid
(int nid, size_t size, gfp_t gfp_mask)¶ allocate an exact number of physically-contiguous pages on a node.
Parameters
int nid
the preferred node ID where memory should be allocated
size_t size
the number of bytes to allocate
gfp_t gfp_mask
GFP flags for the allocation, must not contain __GFP_COMP
Description
Like alloc_pages_exact()
, but try to allocate on node nid first before falling
back.
Return
pointer to the allocated area or NULL
in case of error.
-
void
free_pages_exact
(void *virt, size_t size)¶ release memory allocated via
alloc_pages_exact()
Parameters
void *virt
the value returned by alloc_pages_exact.
size_t size
size of allocation, same value as passed to
alloc_pages_exact()
.
Description
Release the memory allocated by a previous call to alloc_pages_exact.
-
unsigned long
nr_free_zone_pages
(int offset)¶ count number of pages beyond high watermark
Parameters
int offset
The zone index of the highest zone
Description
nr_free_zone_pages()
counts the number of pages which are beyond the
high watermark within all zones at or below a given zone index. For each
zone, the number of pages is calculated as:
nr_free_zone_pages = managed_pages - high_pages
Return
number of pages beyond high watermark.
-
unsigned long
nr_free_buffer_pages
(void)¶ count number of pages beyond high watermark
Parameters
void
no arguments
Description
nr_free_buffer_pages()
counts the number of pages which are beyond the high
watermark within ZONE_DMA and ZONE_NORMAL.
Return
number of pages beyond high watermark within ZONE_DMA and ZONE_NORMAL.
-
int
find_next_best_node
(int node, nodemask_t *used_node_mask)¶ find the next node that should appear in a given node’s fallback list
Parameters
int node
node whose fallback list we’re appending
nodemask_t *used_node_mask
nodemask_t of already used nodes
Description
We use a number of factors to determine which is the next node that should appear on a given node’s fallback list. The node should not have appeared already in node’s fallback list, and it should be the next closest node according to the distance array (which contains arbitrary distance values from each node to each node in the system), and should also prefer nodes with no CPUs, since presumably they’ll have very little allocation pressure on them otherwise.
Return
node id of the found node or NUMA_NO_NODE
if no node is found.
-
void
get_pfn_range_for_nid
(unsigned int nid, unsigned long *start_pfn, unsigned long *end_pfn)¶ Return the start and end page frames for a node
Parameters
unsigned int nid
The nid to return the range for. If MAX_NUMNODES, the min and max PFN are returned.
unsigned long *start_pfn
Passed by reference. On return, it will have the node start_pfn.
unsigned long *end_pfn
Passed by reference. On return, it will have the node end_pfn.
Description
It returns the start and end page frame of a node based on information
provided by memblock_set_node()
. If called for a node
with no available memory, a warning is printed and the start and end
PFNs will be 0.
-
unsigned long
absent_pages_in_range
(unsigned long start_pfn, unsigned long end_pfn)¶ Return number of page frames in holes within a range
Parameters
unsigned long start_pfn
The start PFN to start searching for holes
unsigned long end_pfn
The end PFN to stop searching for holes
Return
the number of pages frames in memory holes within a range.
-
unsigned long
node_map_pfn_alignment
(void)¶ determine the maximum internode alignment
Parameters
void
no arguments
Description
This function should be called after node map is populated and sorted. It calculates the maximum power of two alignment which can distinguish all the nodes.
For example, if all nodes are 1GiB and aligned to 1GiB, the return value would indicate 1GiB alignment with (1 << (30 - PAGE_SHIFT)). If the nodes are shifted by 256MiB, 256MiB. Note that if only the last node is shifted, 1GiB is enough and this function will indicate so.
This is used to test whether pfn -> nid mapping of the chosen memory model has fine enough granularity to avoid incorrect mapping for the populated node map.
Return
the determined alignment in pfn’s. 0 if there is no alignment requirement (single node).
-
unsigned long
find_min_pfn_with_active_regions
(void)¶ Find the minimum PFN registered
Parameters
void
no arguments
Return
the minimum PFN based on information provided via
memblock_set_node()
.
-
void
free_area_init
(unsigned long *max_zone_pfn)¶ Initialise all pg_data_t and zone data
Parameters
unsigned long *max_zone_pfn
an array of max PFNs for each zone
Description
This will call free_area_init_node() for each active node in the system.
Using the page ranges provided by memblock_set_node()
, the size of each
zone in each node and their holes is calculated. If the maximum PFN
between two adjacent zones match, it is assumed that the zone is empty.
For example, if arch_max_dma_pfn == arch_max_dma32_pfn, it is assumed
that arch_max_dma32_pfn has no pages. It is also assumed that a zone
starts where the previous one ended. For example, ZONE_DMA32 starts
at arch_max_dma_pfn.
-
void
set_dma_reserve
(unsigned long new_dma_reserve)¶ set the specified number of pages reserved in the first zone
Parameters
unsigned long new_dma_reserve
The number of pages to mark reserved
Description
The per-cpu batchsize and zone watermarks are determined by managed_pages. In the DMA zone, a significant percentage may be consumed by kernel image and other unfreeable allocations which can skew the watermarks badly. This function may optionally be used to account for unfreeable pages in the first zone (e.g., ZONE_DMA). The effect will be lower watermarks and smaller per-cpu batchsize.
-
void
setup_per_zone_wmarks
(void)¶ called when min_free_kbytes changes or when memory is hot-{added|removed}
Parameters
void
no arguments
Description
Ensures that the watermark[min,low,high] values for each zone are set correctly with respect to min_free_kbytes.
-
int
alloc_contig_range
(unsigned long start, unsigned long end, unsigned migratetype, gfp_t gfp_mask)¶ tries to allocate given range of pages
Parameters
unsigned long start
start PFN to allocate
unsigned long end
one-past-the-last PFN to allocate
unsigned migratetype
migratetype of the underlying pageblocks (either #MIGRATE_MOVABLE or #MIGRATE_CMA). All pageblocks in range must have the same migratetype and it must be either of the two.
gfp_t gfp_mask
GFP mask to use during compaction
Description
The PFN range does not have to be pageblock or MAX_ORDER_NR_PAGES aligned. The PFN range must belong to a single zone.
The first thing this routine does is attempt to MIGRATE_ISOLATE all pageblocks in the range. Once isolated, the pageblocks should not be modified by others.
Return
zero on success or negative error code. On success all pages which PFN is in [start, end) are allocated for the caller and need to be freed with free_contig_range().
-
struct page *
alloc_contig_pages
(unsigned long nr_pages, gfp_t gfp_mask, int nid, nodemask_t *nodemask)¶ tries to find and allocate contiguous range of pages
Parameters
unsigned long nr_pages
Number of contiguous pages to allocate
gfp_t gfp_mask
GFP mask to limit search and used during compaction
int nid
Target node
nodemask_t *nodemask
Mask for other possible nodes
Description
This routine is a wrapper around alloc_contig_range()
. It scans over zones
on an applicable zonelist to find a contiguous pfn range which can then be
tried for allocation with alloc_contig_range()
. This routine is intended
for allocation requests which can not be fulfilled with the buddy allocator.
The allocated memory is always aligned to a page boundary. If nr_pages is a power of two then the alignment is guaranteed to be to the given nr_pages (e.g. 1GB request would be aligned to 1GB).
Allocated pages can be freed with free_contig_range() or by manually calling __free_page() on each allocated page.
Return
pointer to contiguous pages on success, or NULL if not successful.
-
int
numa_map_to_online_node
(int node)¶ Find closest online node
Parameters
int node
Node id to start the search
Description
Lookup the next closest node by distance if nid is not online.
-
struct page *
alloc_pages_vma
(gfp_t gfp, int order, struct vm_area_struct *vma, unsigned long addr, int node, bool hugepage)¶ Allocate a page for a VMA.
Parameters
gfp_t gfp
GFP flags.
int order
Order of the GFP allocation.
struct vm_area_struct *vma
Pointer to VMA or NULL if not available.
unsigned long addr
Virtual address of the allocation. Must be inside vma.
int node
Which node to prefer for allocation (modulo policy).
bool hugepage
For hugepages try only the preferred node if possible.
Description
Allocate a page for a specific address in vma, using the appropriate NUMA policy. When vma is not NULL the caller must hold the mmap_lock of the mm_struct of the VMA to prevent it from going away. Should be used for all allocations for pages that will be mapped into user space.
Return
The page on success or NULL if allocation fails.
Parameters
gfp_t gfp
GFP flags.
unsigned order
Power of two of number of pages to allocate.
Description
Allocate 1 << order contiguous pages. The physical address of the first page is naturally aligned (eg an order-3 allocation will be aligned to a multiple of 8 * PAGE_SIZE bytes). The NUMA policy of the current process is honoured when in process context.
Context
Can be called from any context, providing the appropriate GFP flags are used.
Return
The page on success or NULL if allocation fails.
-
int
mpol_misplaced
(struct page *page, struct vm_area_struct *vma, unsigned long addr)¶ check whether current page node is valid in policy
Parameters
struct page *page
page to be checked
struct vm_area_struct *vma
vm area where page mapped
unsigned long addr
virtual address where page mapped
Description
Lookup current policy node id for vma,addr and “compare to” page’s node id. Policy determination “mimics” alloc_page_vma(). Called from fault path where we know the vma and faulting address.
Return
NUMA_NO_NODE if the page is in a node that is valid for this policy, or a suitable node ID to allocate a replacement page from.
initialize shared policy for inode
Parameters
struct shared_policy *sp
pointer to inode shared policy
struct mempolicy *mpol
struct mempolicy to install
Description
Install non-NULL mpol in inode’s shared policy rb-tree. On entry, the current task has a reference on a non-NULL mpol. This must be released on exit. This is called at get_inode() calls and we can use GFP_KERNEL.
-
int
mpol_parse_str
(char *str, struct mempolicy **mpol)¶ parse string to mempolicy, for tmpfs mpol mount option.
Parameters
char *str
string containing mempolicy to parse
struct mempolicy **mpol
pointer to struct mempolicy pointer, returned on success.
Description
- Format of input:
<mode>[=<flags>][:<nodelist>]
On success, returns 0, else 1
-
void
mpol_to_str
(char *buffer, int maxlen, struct mempolicy *pol)¶ format a mempolicy structure for printing
Parameters
char *buffer
to contain formatted mempolicy string
int maxlen
length of buffer
struct mempolicy *pol
pointer to mempolicy to be formatted
Description
Convert pol into a string. If buffer is too short, truncate the string. Recommend a maxlen of at least 32 for the longest mode, “interleave”, the longest flag, “relative”, and to display at least a few node ids.
-
struct
folio
¶ Represents a contiguous set of bytes.
Definition
struct folio {
unsigned long flags;
struct list_head lru;
struct address_space *mapping;
pgoff_t index;
void *private;
atomic_t _mapcount;
atomic_t _refcount;
#ifdef CONFIG_MEMCG;
unsigned long memcg_data;
#endif;
};
Members
flags
Identical to the page flags.
lru
Least Recently Used list; tracks how recently this folio was used.
mapping
The file this page belongs to, or refers to the anon_vma for anonymous memory.
index
Offset within the file, in units of pages. For anonymous memory, this is the index from the beginning of the mmap.
private
Filesystem per-folio data (see
folio_attach_private()
). Used for swp_entry_t if folio_test_swapcache()._mapcount
Do not access this member directly. Use folio_mapcount() to find out how many times this folio is mapped by userspace.
_refcount
Do not access this member directly. Use
folio_ref_count()
to find how many references there are to this folio.memcg_data
Memory Control Group data.
Description
A folio is a physically, virtually and logically contiguous set
of bytes. It is a power-of-two in size, and it is aligned to that
same power-of-two. It is at least as large as PAGE_SIZE
. If it is
in the page cache, it is at a file offset which is a multiple of that
power-of-two. It may be mapped into userspace at an address which is
at an arbitrary page offset, but its kernel virtual address is aligned
to its size.
-
typedef
vm_fault_t
¶ Return type for page fault handlers.
Description
Page fault handlers return a bitmask of VM_FAULT
values.
-
enum
vm_fault_reason
¶ Page fault handlers return a bitmask of these values to tell the core VM what happened when handling the fault. Used to decide whether a process gets delivered SIGBUS or just gets major/minor fault counters bumped up.
Constants
VM_FAULT_OOM
Out Of Memory
VM_FAULT_SIGBUS
Bad access
VM_FAULT_MAJOR
Page read from storage
VM_FAULT_WRITE
Special case for get_user_pages
VM_FAULT_HWPOISON
Hit poisoned small page
VM_FAULT_HWPOISON_LARGE
Hit poisoned large page. Index encoded in upper bits
VM_FAULT_SIGSEGV
segmentation fault
VM_FAULT_NOPAGE
->fault installed the pte, not return page
VM_FAULT_LOCKED
->fault locked the returned page
VM_FAULT_RETRY
->fault blocked, must retry
VM_FAULT_FALLBACK
huge page fault failed, fall back to small
VM_FAULT_DONE_COW
->fault has fully handled COW
VM_FAULT_NEEDDSYNC
->fault did not modify page tables and needs fsync() to complete (for synchronous page faults in DAX)
VM_FAULT_HINDEX_MASK
mask HINDEX value
Parameters
struct folio *folio
The folio to test.
Description
We would like to get this info without a page flag, but the state needs to survive until the folio is last deleted from the LRU, which could be as far down as __page_cache_release.
Return
An integer (not a boolean!) used to sort a folio onto the right LRU list and to account folios correctly. 1 if folio is a regular filesystem backed page cache folio or a lazily freed anonymous folio (e.g. via MADV_FREE). 0 if folio is a normal anonymous folio, a tmpfs folio or otherwise ram or swap backed folio.
Parameters
struct folio *folio
The folio that was on lru and now has a zero reference.
Parameters
struct folio *folio
The folio to test.
Return
The LRU list a folio should be on, as an index into the array of LRU lists.
-
page_folio
(p)¶ Converts from page to folio.
Parameters
p
The page.
Description
Every page is part of a folio. This function cannot be called on a NULL pointer.
Context
No reference, nor lock is required on page. If the caller does not hold a reference, this call may race with a folio split, so it should re-check the folio still contains this page after gaining a reference on the folio.
Return
The folio which contains this page.
-
folio_page
(folio, n)¶ Return a page from a folio.
Parameters
folio
The folio.
n
The page number to return.
Description
n is relative to the start of the folio. This function does not check that the page number lies within folio; the caller is presumed to have a reference to the page.
Parameters
struct folio *folio
The folio to test.
Return
True if the folio is larger than one page.
-
int
page_has_private
(struct page *page)¶ Determine if page has private stuff
Parameters
struct page *page
The page to be checked
Description
Determine if a page has private stuff, indicating that release routines should be invoked upon it.
-
enum
fault_flag
¶ Fault flag definitions.
Constants
FAULT_FLAG_WRITE
Fault was a write fault.
FAULT_FLAG_MKWRITE
Fault was mkwrite of existing PTE.
FAULT_FLAG_ALLOW_RETRY
Allow to retry the fault if blocked.
FAULT_FLAG_RETRY_NOWAIT
Don’t drop mmap_lock and wait when retrying.
FAULT_FLAG_KILLABLE
The fault task is in SIGKILL killable region.
FAULT_FLAG_TRIED
The fault has been tried once.
FAULT_FLAG_USER
The fault originated in userspace.
FAULT_FLAG_REMOTE
The fault is not for current task/mm.
FAULT_FLAG_INSTRUCTION
The fault was during an instruction fetch.
FAULT_FLAG_INTERRUPTIBLE
The fault can be interrupted by non-fatal signals.
Description
About FAULT_FLAG_ALLOW_RETRY and FAULT_FLAG_TRIED: we can specify whether we would allow page faults to retry by specifying these two fault flags correctly. Currently there can be three legal combinations:
- ALLOW_RETRY and !TRIED: this means the page fault allows retry, and
this is the first try
- ALLOW_RETRY and TRIED: this means the page fault allows retry, and
we’ve already tried at least once
!ALLOW_RETRY and !TRIED: this means the page fault does not allow retry
The unlisted combination (!ALLOW_RETRY && TRIED) is illegal and should never be used. Note that page faults can be allowed to retry for multiple times, in which case we’ll have an initial fault with flags (a) then later on continuous faults with flags (b). We should always try to detect pending signals before a retry to make sure the continuous page faults can still be interrupted if necessary.
-
bool
fault_flag_allow_retry_first
(enum fault_flag flags)¶ check ALLOW_RETRY the first time
Parameters
enum fault_flag flags
Fault flags.
Description
This is mostly used for places where we want to try to avoid taking the mmap_lock for too long a time when waiting for another condition to change, in which case we can try to be polite to release the mmap_lock in the first round to avoid potential starvation of other processes that would also want the mmap_lock.
Return
true if the page fault allows retry and this is the first attempt of the fault handling; false otherwise.
Parameters
struct folio *folio
The folio.
Description
A folio is composed of 2^order pages. See get_order() for the definition of order.
Return
The order of the folio.
Parameters
struct folio *folio
The folio.
Context
May be called in any context, as long as you know that
you have a refcount on the folio. If you do not already have one,
folio_try_get()
may be the right interface for you to use.
Parameters
struct folio *folio
The folio.
Description
If the folio’s reference count reaches zero, the memory will be
released back to the page allocator and may be used by another
allocation immediately. Do not access the memory or the struct folio
after calling folio_put()
unless you can be sure that it wasn’t the
last reference.
Context
May be called in process or interrupt context, but not in NMI context. May be called while holding a spinlock.
-
bool
page_maybe_dma_pinned
(struct page *page)¶ Report if a page is pinned for DMA.
Parameters
struct page *page
The page.
Description
This function checks if a page has been pinned via a call to a function in the pin_user_pages() family.
For non-huge pages, the return value is partially fuzzy: false is not fuzzy, because it means “definitely not pinned for DMA”, but true means “probably pinned for DMA, but possibly a false positive due to having at least GUP_PIN_COUNTING_BIAS worth of normal page references”.
False positives are OK, because: a) it’s unlikely for a page to get that many refcounts, and b) all the callers of this routine are expected to be able to deal gracefully with a false positive.
For huge pages, the result will be exactly correct. That’s because we have more tracking data available: the 3rd struct page in the compound page is used to track the pincount (instead using of the GUP_PIN_COUNTING_BIAS scheme).
For more information, please see pin_user_pages() and related calls.
Return
True, if it is likely that the page has been “dma-pinned”. False, if the page is definitely not dma-pinned.
Parameters
struct folio *folio
The folio.
Description
A folio may contain multiple pages. The pages have consecutive Page Frame Numbers.
Return
The Page Frame Number of the first page in the folio.
Parameters
struct folio *folio
The folio.
Return
A positive power of two.
Parameters
struct folio *folio
The folio we’re currently operating on.
Description
If you have physically contiguous memory which may span more than
one folio (eg a struct bio_vec
), use this function to move from one
folio to the next. Do not use it if the memory is only virtually
contiguous as the folios are almost certainly not adjacent to each
other. This is the folio equivalent to writing page++
.
Context
We assume that the folios are refcounted and/or locked at a higher level and do not adjust the reference counts.
Return
The next struct folio
.
Parameters
struct folio *folio
The folio.
Description
A folio represents a number of bytes which is a power-of-two in size.
This function tells you which power-of-two the folio is. See also
folio_size()
and folio_order()
.
Context
The caller should have a reference on the folio to prevent it from being split. It is not necessary for the folio to be locked.
Return
The base-2 logarithm of the size of this folio.
Parameters
struct folio *folio
The folio.
Context
The caller should have a reference on the folio to prevent it from being split. It is not necessary for the folio to be locked.
Return
The number of bytes in this folio.
-
struct vm_area_struct *
find_vma_intersection
(struct mm_struct *mm, unsigned long start_addr, unsigned long end_addr)¶ Look up the first VMA which intersects the interval
Parameters
struct mm_struct *mm
The process address space.
unsigned long start_addr
The inclusive start user address.
unsigned long end_addr
The exclusive end user address.
Return
The first VMA within the provided range, NULL
otherwise. Assumes
start_addr < end_addr.
-
struct vm_area_struct *
vma_lookup
(struct mm_struct *mm, unsigned long addr)¶ Find a VMA at a specific address
Parameters
struct mm_struct *mm
The process address space.
unsigned long addr
The user address.
Return
The vm_area_struct at the given address, NULL
otherwise.
-
bool
vma_is_special_huge
(const struct vm_area_struct *vma)¶ Are transhuge page-table entries considered special?
Parameters
const struct vm_area_struct *vma
Pointer to the struct vm_area_struct to consider
Description
Whether transhuge page-table entries are considered “special” following the definition in vm_normal_page().
Return
true if transhuge page-table entries should be considered special, false otherwise.
-
int
seal_check_future_write
(int seals, struct vm_area_struct *vma)¶ Check for F_SEAL_FUTURE_WRITE flag and handle it
Parameters
int seals
the seals to check
struct vm_area_struct *vma
the vma to operate on
Description
Check whether F_SEAL_FUTURE_WRITE is set; if so, do proper check/handling on the vma flags. Return 0 if check pass, or <0 for errors.
Parameters
const struct folio *folio
The folio.
Description
The refcount is usually incremented by calls to folio_get()
and
decremented by calls to folio_put()
. Some typical users of the
folio refcount:
Each reference from a page table
The page cache
Filesystem private data
The LRU list
Pipes
Direct IO which references this page in the process address space
Return
The number of references to this folio.
Parameters
struct folio *folio
The folio.
Description
If you do not already have a reference to a folio, you can attempt to get one using this function. It may fail if, for example, the folio has been freed since you found a pointer to it, or it is frozen for the purposes of splitting or migration.
Return
True if the reference count was successfully incremented.
Parameters
struct folio *folio
The folio.
Description
This is a version of folio_try_get()
optimised for non-SMP kernels.
If you are still holding the rcu_read_lock()
after looking up the
page and know that the page cannot have its refcount decreased to
zero in interrupt context, you can use this instead of folio_try_get()
.
Example users include get_user_pages_fast()
(as pages are not unmapped
from interrupt context) and the page cache lookups (as pages are not
truncated from interrupt context). We also know that pages are not
frozen in interrupt context for the purposes of splitting or migration.
You can also use this function if you’re holding a lock that prevents
pages being frozen & removed; eg the i_pages lock for the page cache
or the mmap_sem or page table lock for page tables. In this case,
it will always succeed, and you could have used a plain folio_get()
,
but it’s sometimes more convenient to have a common function called
from both locked and RCU-protected contexts.
Return
True if the reference count was successfully incremented.
-
int
is_highmem
(struct zone *zone)¶ helper function to quickly check if a struct zone is a highmem zone or not. This is an attempt to keep references to ZONE_{DMA/NORMAL/HIGHMEM/etc} in general code to a minimum.
Parameters
struct zone *zone
pointer to struct zone variable
Return
1 for a highmem zone, 0 otherwise
-
for_each_online_pgdat
(pgdat)¶ helper macro to iterate over all online nodes
Parameters
pgdat
pointer to a pg_data_t variable
-
for_each_zone
(zone)¶ helper macro to iterate over all memory zones
Parameters
zone
pointer to struct zone variable
Description
The user only needs to declare the zone variable, for_each_zone fills it in.
-
struct zoneref *
next_zones_zonelist
(struct zoneref *z, enum zone_type highest_zoneidx, nodemask_t *nodes)¶ Returns the next zone at or below highest_zoneidx within the allowed nodemask using a cursor within a zonelist as a starting point
Parameters
struct zoneref *z
The cursor used as a starting point for the search
enum zone_type highest_zoneidx
The zone index of the highest zone to return
nodemask_t *nodes
An optional nodemask to filter the zonelist with
Description
This function returns the next zone at or below a given zone index that is within the allowed nodemask using a cursor as the starting point for the search. The zoneref returned is a cursor that represents the current zone being examined. It should be advanced by one before calling next_zones_zonelist again.
Return
the next zone at or below highest_zoneidx within the allowed nodemask using a cursor within a zonelist as a starting point
-
struct zoneref *
first_zones_zonelist
(struct zonelist *zonelist, enum zone_type highest_zoneidx, nodemask_t *nodes)¶ Returns the first zone at or below highest_zoneidx within the allowed nodemask in a zonelist
Parameters
struct zonelist *zonelist
The zonelist to search for a suitable zone
enum zone_type highest_zoneidx
The zone index of the highest zone to return
nodemask_t *nodes
An optional nodemask to filter the zonelist with
Description
This function returns the first zone at or below a given zone index that is within the allowed nodemask. The zoneref returned is a cursor that can be used to iterate the zonelist with next_zones_zonelist by advancing it by one before calling.
When no eligible zone is found, zoneref->zone is NULL (zoneref itself is never NULL). This may happen either genuinely, or due to concurrent nodemask update due to cpuset modification.
Return
Zoneref pointer for the first suitable zone found
-
for_each_zone_zonelist_nodemask
(zone, z, zlist, highidx, nodemask)¶ helper macro to iterate over valid zones in a zonelist at or below a given zone index and within a nodemask
Parameters
zone
The current zone in the iterator
z
The current pointer within zonelist->_zonerefs being iterated
zlist
The zonelist being iterated
highidx
The zone index of the highest zone to return
nodemask
Nodemask allowed by the allocator
Description
This iterator iterates though all zones at or below a given zone index and within a given nodemask
-
for_each_zone_zonelist
(zone, z, zlist, highidx)¶ helper macro to iterate over valid zones in a zonelist at or below a given zone index
Parameters
zone
The current zone in the iterator
z
The current pointer within zonelist->zones being iterated
zlist
The zonelist being iterated
highidx
The zone index of the highest zone to return
Description
This iterator iterates though all zones at or below a given zone index.
-
int
pfn_valid
(unsigned long pfn)¶ check if there is a valid memory map entry for a PFN
Parameters
unsigned long pfn
the page frame number to check
Description
Check if there is a valid memory map entry aka struct page for the pfn. Note, that availability of the memory map entry does not imply that there is actual usable memory at that pfn. The struct page may represent a hole or an unusable page frame.
Return
1 for PFNs that have memory map entries and 0 otherwise
-
struct address_space *
folio_mapping
(struct folio *folio)¶ Find the mapping where this folio is stored.
Parameters
struct folio *folio
The folio.
Description
For folios which are in the page cache, return the mapping that this page belongs to. Folios in the swap cache return the swap mapping this page is stored in (which is different from the mapping for the swap file or swap device where the data is stored).
You can call this for folios which aren’t in the swap cache or page cache and it will return NULL.