Kexec Handover Subsystem¶
Overview¶
Kexec HandOver (KHO) is a mechanism that allows Linux to preserve memory regions, which could contain serialized system states, across kexec.
KHO uses flattened device tree (FDT) to pass information about the preserved state from pre-exec kernel to post-kexec kernel and scratch memory regions to ensure integrity of the preserved memory.
KHO FDT¶
Every KHO kexec carries a KHO specific flattened device tree (FDT) blob that describes the preserved state. The FDT includes properties describing preserved memory regions and nodes that hold subsystem specific state.
The preserved memory regions contain either serialized subsystem states, or in-memory data that shall not be touched across kexec. After KHO, subsystems can retrieve and restore the preserved state from KHO FDT.
Subsystems participating in KHO can define their own format for state serialization and preservation.
KHO FDT and structures defined by the subsystems form an ABI between pre-kexec
and post-kexec kernels. This ABI is defined by header files in
include/linux/kho/abi directory.
Scratch Regions¶
To boot into kexec, we need to have a physically contiguous memory range that contains no handed over memory. Kexec then places the target kernel and initrd into that region. The new kernel exclusively uses this region for memory allocations before during boot up to the initialization of the page allocator.
We guarantee that we always have such regions through the scratch regions: On
first boot KHO allocates several physically contiguous memory regions. Since
after kexec these regions will be used by early memory allocations, there is a
scratch region per NUMA node plus a scratch region to satisfy allocations
requests that do not require particular NUMA node assignment.
By default, size of the scratch region is calculated based on amount of memory
allocated during boot. The kho_scratch kernel command line option may be
used to explicitly define size of the scratch regions.
The scratch regions are declared as CMA when page allocator is initialized so
that their memory can be used during system lifetime. CMA gives us the
guarantee that no handover pages land in that region, because handover pages
must be at a static physical memory location and CMA enforces that only
movable pages can be located inside.
After KHO kexec, we ignore the kho_scratch kernel command line option and
instead reuse the exact same region that was originally allocated. This allows
us to recursively execute any amount of KHO kexecs. Because we used this region
for boot memory allocations and as target memory for kexec blobs, some parts
of that memory region may be reserved. These reservations are irrelevant for
the next KHO, because kexec can overwrite even the original kernel.
Kexec Handover Radix Tree¶
This is a radix tree implementation for tracking physical memory pages across kexec transitions. It was developed for the KHO mechanism but is designed for broader use by any subsystem that needs to preserve pages.
The radix tree is a multi-level tree where leaf nodes are bitmaps representing individual pages. To allow pages of different sizes (orders) to be stored efficiently in a single tree, it uses a unique key encoding scheme. Each key is an unsigned long that combines a page’s physical address and its order.
Client code is responsible for allocating the root node of the tree, initializing the mutex lock, and managing its lifecycle. It must use the tree data structures defined in the KHO ABI, include/linux/kho/abi/kexec_handover.h.
Public API¶
-
int kho_radix_add_page(struct kho_radix_tree *tree, unsigned long pfn, unsigned int order)¶
Marks a page as preserved in the radix tree.
Parameters
struct kho_radix_tree *treeThe KHO radix tree.
unsigned long pfnThe page frame number of the page to preserve.
unsigned int orderThe order of the page.
Description
This function traverses the radix tree based on the key derived from pfn and order. It sets the corresponding bit in the leaf bitmap to mark the page for preservation. If intermediate nodes do not exist along the path, they are allocated and added to the tree.
Return
0 on success, or a negative error code on failure.
-
void kho_radix_del_page(struct kho_radix_tree *tree, unsigned long pfn, unsigned int order)¶
Removes a page’s preservation status from the radix tree.
Parameters
struct kho_radix_tree *treeThe KHO radix tree.
unsigned long pfnThe page frame number of the page to unpreserve.
unsigned int orderThe order of the page.
Description
This function traverses the radix tree and clears the bit corresponding to the page, effectively removing its “preserved” status. It does not free the tree’s intermediate nodes, even if they become empty.
-
int kho_radix_walk_tree(struct kho_radix_tree *tree, kho_radix_tree_walk_callback_t cb)¶
Traverses the radix tree and calls a callback for each preserved page.
Parameters
struct kho_radix_tree *treeA pointer to the KHO radix tree to walk.
kho_radix_tree_walk_callback_t cbA callback function of type kho_radix_tree_walk_callback_t that will be invoked for each preserved page found in the tree. The callback receives the physical address and order of the preserved page.
Description
This function walks the radix tree, searching from the specified top level down to the lowest level (level 0). For each preserved page found, it invokes the provided callback, passing the page’s physical address and order.
Return
0 if the walk completed the specified tree, or the non-zero return value from the callback that stopped the walk.
Parameters
phys_addr_t physphysical address of the folio.
Return
pointer to the struct folio on success, NULL on failure.
-
struct page *kho_restore_pages(phys_addr_t phys, unsigned long nr_pages)¶
restore list of contiguous order 0 pages.
Parameters
phys_addr_t physphysical address of the first page.
unsigned long nr_pagesnumber of pages.
Description
Restore a contiguous list of order 0 pages that was preserved with
kho_preserve_pages().
Return
the first page on success, NULL on failure.
-
int kho_add_subtree(const char *name, void *blob, size_t size)¶
record the physical address of a sub blob in KHO root tree.
Parameters
const char *namename of the sub tree.
void *blobthe sub tree blob.
size_t sizesize of the blob in bytes.
Description
Creates a new child node named name in KHO root FDT and records the physical address of blob. The pages of blob must also be preserved by KHO for the new kernel to retrieve it after kexec.
A debugfs blob entry is also created at
/sys/kernel/debug/kho/out/sub_fdts/**name** when kernel is configured with
CONFIG_KEXEC_HANDOVER_DEBUGFS
Return
0 on success, error code on failure
Parameters
struct folio *foliofolio to preserve.
Description
Instructs KHO to preserve the whole folio across kexec. The order will be preserved as well.
Return
0 on success, error code on failure
Parameters
struct folio *foliofolio to unpreserve.
Description
Instructs KHO to unpreserve a folio that was preserved by
kho_preserve_folio() before. The provided folio (pfn and order)
must exactly match a previously preserved folio.
-
int kho_preserve_pages(struct page *page, unsigned long nr_pages)¶
preserve contiguous pages across kexec
Parameters
struct page *pagefirst page in the list.
unsigned long nr_pagesnumber of pages.
Description
Preserve a contiguous list of order 0 pages. Must be restored using
kho_restore_pages() to ensure the pages are restored properly as order 0.
Return
0 on success, error code on failure
Parameters
struct page *pagefirst page in the list.
unsigned long nr_pagesnumber of pages.
Description
Instructs KHO to unpreserve nr_pages contiguous pages starting from page.
This must be called with the same page and nr_pages as the corresponding
kho_preserve_pages() call. Unpreserving arbitrary sub-ranges of larger
preserved blocks is not supported.
-
int kho_preserve_vmalloc(void *ptr, struct kho_vmalloc *preservation)¶
preserve memory allocated with
vmalloc()across kexec
Parameters
void *ptrpointer to the area in vmalloc address space
struct kho_vmalloc *preservationplaceholder for preservation metadata
Description
Instructs KHO to preserve the area in vmalloc address space at ptr. The physical pages mapped at ptr will be preserved and on successful return preservation will hold the physical address of a structure that describes the preservation.
NOTE
The memory allocated with vmalloc_node() variants cannot be reliably
restored on the same node
Return
0 on success, error code on failure
-
void kho_unpreserve_vmalloc(struct kho_vmalloc *preservation)¶
unpreserve memory allocated with
vmalloc()
Parameters
struct kho_vmalloc *preservationpreservation metadata returned by
kho_preserve_vmalloc()
Description
Instructs KHO to unpreserve the area in vmalloc address space that was
previously preserved with kho_preserve_vmalloc().
-
void *kho_restore_vmalloc(const struct kho_vmalloc *preservation)¶
recreates and populates an area in vmalloc address space from the preserved memory.
Parameters
const struct kho_vmalloc *preservationpreservation metadata.
Description
Recreates an area in vmalloc address space and populates it with memory that
was preserved using kho_preserve_vmalloc().
Return
pointer to the area in the vmalloc address space, NULL on failure.
-
void *kho_alloc_preserve(size_t size)¶
Allocate, zero, and preserve memory.
Parameters
size_t sizeThe number of bytes to allocate.
Description
Allocates a physically contiguous block of zeroed pages that is large enough to hold size bytes. The allocated memory is then registered with KHO for preservation across a kexec.
Note
The actual allocated size will be rounded up to the nearest power-of-two page boundary.
return A virtual pointer to the allocated and preserved memory on success,
or an ERR_PTR() encoded error on failure.
-
void kho_unpreserve_free(void *mem)¶
Unpreserve and free memory.
Parameters
void *memPointer to the memory allocated by
kho_alloc_preserve().
Description
Unregisters the memory from KHO preservation and frees the underlying
pages back to the system. This function should be called to clean up
memory allocated with kho_alloc_preserve().
-
void kho_restore_free(void *mem)¶
Restore and free memory after kexec.
Parameters
void *memPointer to the memory (in the new kernel’s address space) that was allocated by the old kernel.
Description
This function is intended to be called in the new kernel (post-kexec)
to take ownership of and free a memory region that was preserved by the
old kernel using kho_alloc_preserve().
It first restores the pages from KHO (using their physical address) and then frees the pages back to the new kernel’s page allocator.
-
bool is_kho_boot(void)¶
check if current kernel was booted via KHO-enabled kexec
Parameters
voidno arguments
Description
This function checks if the current kernel was loaded through a kexec operation with KHO enabled, by verifying that a valid KHO FDT was passed.
Note
This function returns reliable results only after
kho_populate() has been called during early boot. Before that,
it may return false even if KHO data is present.
Return
true if booted via KHO-enabled kexec, false otherwise
-
int kho_retrieve_subtree(const char *name, phys_addr_t *phys, size_t *size)¶
retrieve a preserved sub blob by its name.
Parameters
const char *namethe name of the sub blob passed to
kho_add_subtree().phys_addr_t *physif found, the physical address of the sub blob is stored in phys.
size_t *sizeif not NULL and found, the size of the sub blob is stored in size.
Description
Retrieve a preserved sub blob named name and store its physical address in phys and optionally its size in size.
Return
0 on success, error code on failure
KHO Serialization Blocks API¶
KHO provides a mechanism to preserve stateful data across a kexec handover by serializing it into memory blocks, and provides the common infrastructure for managing these blocks.
Each block consists of a header (struct kho_block_header_ser) followed by an
array of serialized entries. Multiple blocks are linked together via a
physical pointer in the header, forming a linked list that can be easily
traversed in both the current and the next kernel.
-
struct kho_block¶
Internal representation of a serialization block.
Definition:
struct kho_block {
struct list_head list;
struct kho_block_header_ser *ser;
};
Members
listList head for linking blocks in memory.
serPointer to the serialized header in preserved memory.
-
struct kho_block_set¶
A set of blocks containing serialized entries of the same type.
Definition:
struct kho_block_set {
struct list_head blocks;
long nblocks;
u64 head_pa;
size_t entry_size;
u64 count_per_block;
bool incoming;
};
Members
blocksThe list of serialization blocks (
struct kho_block).nblocksThe number of allocated serialization blocks.
head_paPhysical address of the first block header.
entry_sizeThe size of each entry in the blocks.
count_per_blockThe maximum number of entries each block can hold.
incomingTrue if this block set was restored from the previous kernel.
Note
Synchronization and locking are the responsibility of the caller. The block set structure itself is not internally synchronized.
-
struct kho_block_set_it¶
Iterator for serializing entries into blocks.
Definition:
struct kho_block_set_it {
struct kho_block_set *bs;
struct kho_block *block;
u64 i;
};
Members
bsThe block set being iterated.
blockThe current block.
iThe current entry index within block.
-
KHO_BLOCK_SET_INIT¶
KHO_BLOCK_SET_INIT (_name, _entry_size)
Initialize a static kho_block_set.
Parameters
_nameName of the kho_block_set variable.
_entry_sizeThe size of each entry in the block set.
-
u64 kho_block_set_head_pa(struct kho_block_set *bs)¶
Get the physical address of the first block header.
Parameters
struct kho_block_set *bsThe block set.
Return
The physical address of the first block header, or 0 if empty.
-
bool kho_block_set_is_empty(struct kho_block_set *bs)¶
Check if the block set has no allocated blocks.
Parameters
struct kho_block_set *bsThe block set.
Return
True if there are no blocks in the set, false otherwise.
-
void kho_block_set_init(struct kho_block_set *bs, size_t entry_size)¶
Initialize a block set.
Parameters
struct kho_block_set *bsThe block set to initialize.
size_t entry_sizeThe size of each entry in the blocks.
-
int kho_block_set_grow(struct kho_block_set *bs, u64 count)¶
Expand the block set to accommodate the target count.
Parameters
struct kho_block_set *bsThe block set.
u64 countThe target number of valid entries to accommodate.
Description
Dynamically preallocates and links preserved memory blocks if the target entry count exceeds the current total capacity of the set, ensuring they are available during serialization/deserialization.
Context
Caller must hold a lock protecting the block set.
Return
0 on success, or a negative errno on failure.
-
void kho_block_set_shrink(struct kho_block_set *bs, u64 count)¶
Shrink the block set to accommodate the target count.
Parameters
struct kho_block_set *bsThe block set.
u64 countThe target number of valid entries to accommodate.
Description
Releases and unallocates redundant preserved memory blocks. Checks if the last block in the set can be removed because the remaining entry count is fully accommodated by the preceding blocks.
Note
It is the caller’s responsibility to ensure that entries are removed in the reverse order of their insertion. Because shrinking destroys the last block in the set, removing entries in any other order would corrupt active data.
Context
Caller must hold a lock protecting the block set.
-
int kho_block_set_restore(struct kho_block_set *bs, u64 head_pa)¶
Restore a block set from a physical address.
Parameters
struct kho_block_set *bsThe block set to restore.
u64 head_paPhysical address of the first block header.
Description
Restores a serialized block set from a given physical address. The caller is responsible for ensuring that the block set bs has been allocated and initialized prior to calling this function.
Return
0 on success, or a negative errno on failure.
-
void kho_block_set_destroy(struct kho_block_set *bs)¶
Destroy all blocks in a block set.
Parameters
struct kho_block_set *bsThe block set.
-
void kho_block_set_clear(struct kho_block_set *bs)¶
Clear all serialized data in a block set.
Parameters
struct kho_block_set *bsThe block set to clear.
-
void kho_block_set_it_init(struct kho_block_set_it *it, struct kho_block_set *bs)¶
Initialize a block set iterator.
Parameters
struct kho_block_set_it *itThe iterator to initialize.
struct kho_block_set *bsThe block set to iterate over.
-
void *kho_block_set_it_reserve_entry(struct kho_block_set_it *it)¶
Reserve and return the next available slot for writing.
Parameters
struct kho_block_set_it *itThe block iterator.
Description
Reserves a slot in the current block during state serialization to add a new entry, advancing the internal index. If the current block is full, it automatically moves to the next block in the set.
Return
A pointer to the reserved entry slot, or NULL if the block set’s capacity is fully exhausted.
-
void *kho_block_set_it_read_entry(struct kho_block_set_it *it)¶
Read the next serialized entry from the block set.
Parameters
struct kho_block_set_it *itThe block iterator.
Description
Iterates through previously written entries during state deserialization, respecting the actual count stored in each block’s header.
Return
A pointer to the next serialized entry, or NULL if all serialized entries have been read.
-
void *kho_block_set_it_prev(struct kho_block_set_it *it)¶
Return the previous entry slot in the block set.
Parameters
struct kho_block_set_it *itThe block iterator.
Description
If the current index is at the start of a block, it automatically moves to the end of the previous block.
Return
A pointer to the previous entry slot, or NULL if at the very beginning of the block set.