Boot time memory management

Early system initialization cannot use “normal” memory management simply because it is not set up yet. But there is still need to allocate memory for various data structures, for instance for the physical page allocator. To address this, a specialized allocator called the Boot Memory Allocator, or bootmem, was introduced. Several years later PowerPC developers added a “Logical Memory Blocks” allocator, which was later adopted by other architectures and renamed to memblock. There is also a compatibility layer called nobootmem that translates bootmem allocation interfaces to memblock calls.

The selection of the early allocator is done using CONFIG_NO_BOOTMEM and CONFIG_HAVE_MEMBLOCK kernel configuration options. These options are enabled or disabled statically by the architectures’ Kconfig files.

  • Architectures that rely only on bootmem select CONFIG_NO_BOOTMEM=n && CONFIG_HAVE_MEMBLOCK=n.
  • The users of memblock with the nobootmem compatibility layer set CONFIG_NO_BOOTMEM=y && CONFIG_HAVE_MEMBLOCK=y.
  • And for those that use both memblock and bootmem the configuration includes CONFIG_NO_BOOTMEM=n && CONFIG_HAVE_MEMBLOCK=y.

Whichever allocator is used, it is the responsibility of the architecture specific initialization to set it up in setup_arch() and tear it down in mem_init() functions.

Once the early memory management is available it offers a variety of functions and macros for memory allocations. The allocation request may be directed to the first (and probably the only) node or to a particular node in a NUMA system. There are API variants that panic when an allocation fails and those that don’t. And more recent and advanced memblock even allows controlling its own behaviour.

Bootmem

(mostly stolen from Mel Gorman’s “Understanding the Linux Virtual Memory Manager” book)

Bootmem is a boot-time physical memory allocator and configurator.

It is used early in the boot process before the page allocator is set up.

Bootmem is based on the most basic of allocators, a First Fit allocator which uses a bitmap to represent memory. If a bit is 1, the page is allocated and 0 if unallocated. To satisfy allocations of sizes smaller than a page, the allocator records the Page Frame Number (PFN) of the last allocation and the offset the allocation ended at. Subsequent small allocations are merged together and stored on the same page.

The information used by the bootmem allocator is represented by struct bootmem_data. An array to hold up to MAX_NUMNODES such structures is statically allocated and then it is discarded when the system initialization completes. Each entry in this array corresponds to a node with memory. For UMA systems only entry 0 is used.

The bootmem allocator is initialized during early architecture specific setup. Each architecture is required to supply a setup_arch() function which, among other tasks, is responsible for acquiring the necessary parameters to initialise the boot memory allocator. These parameters define limits of usable physical memory:

  • min_low_pfn - the lowest PFN that is available in the system
  • max_low_pfn - the highest PFN that may be addressed by low memory (ZONE_NORMAL)
  • max_pfn - the last PFN available to the system.

After those limits are determined, the init_bootmem() or init_bootmem_node() function should be called to initialize the bootmem allocator. The UMA case should use the init_bootmem function. It will initialize contig_page_data structure that represents the only memory node in the system. In the NUMA case the init_bootmem_node function should be called to initialize the bootmem allocator for each node.

Once the allocator is set up, it is possible to use either single node or NUMA variant of the allocation APIs.

Memblock

Memblock is a method of managing memory regions during the early boot period when the usual kernel memory allocators are not up and running.

Memblock views the system memory as collections of contiguous regions. There are several types of these collections:

  • memory - describes the physical memory available to the kernel; this may differ from the actual physical memory installed in the system, for instance when the memory is restricted with mem= command line parameter
  • reserved - describes the regions that were allocated
  • physmap - describes the actual physical memory regardless of the possible restrictions; the physmap type is only available on some architectures.

Each region is represented by struct memblock_region that defines the region extents, its attributes and NUMA node id on NUMA systems. Every memory type is described by the struct memblock_type which contains an array of memory regions along with the allocator metadata. The memory types are nicely wrapped with struct memblock. This structure is statically initialzed at build time. The region arrays for the “memory” and “reserved” types are initially sized to INIT_MEMBLOCK_REGIONS and for the “physmap” type to INIT_PHYSMEM_REGIONS. The memblock_allow_resize() enables automatic resizing of the region arrays during addition of new regions. This feature should be used with care so that memory allocated for the region array will not overlap with areas that should be reserved, for example initrd.

The early architecture setup should tell memblock what the physical memory layout is by using memblock_add() or memblock_add_node() functions. The first function does not assign the region to a NUMA node and it is appropriate for UMA systems. Yet, it is possible to use it on NUMA systems as well and assign the region to a NUMA node later in the setup process using memblock_set_node(). The memblock_add_node() performs such an assignment directly.

Once memblock is setup the memory can be allocated using either memblock or bootmem APIs.

As the system boot progresses, the architecture specific mem_init() function frees all the memory to the buddy page allocator.

If an architecure enables CONFIG_ARCH_DISCARD_MEMBLOCK, the memblock data structures will be discarded after the system initialization compltes.

Functions and structures

Common API

The functions that are described in this section are available regardless of what early memory manager is enabled.

void free_bootmem_late(unsigned long addr, unsigned long size)

free bootmem pages directly to page allocator

Parameters

unsigned long addr
starting address of the range
unsigned long size
size of the range in bytes

Description

This is only useful when the bootmem allocator has already been torn down, but we are still initializing the system. Pages are given directly to the page allocator, no bootmem metadata is updated because it is gone.

unsigned long free_all_bootmem(void)

release free pages to the buddy allocator

Parameters

void
no arguments

Return

the number of pages actually released.

void free_bootmem_node(pg_data_t * pgdat, unsigned long physaddr, unsigned long size)

mark a page range as usable

Parameters

pg_data_t * pgdat
node the range resides on
unsigned long physaddr
starting physical address of the range
unsigned long size
size of the range in bytes

Description

Partial pages will be considered reserved and left as they are.

The range must reside completely on the specified node.

void free_bootmem(unsigned long addr, unsigned long size)

mark a page range as usable

Parameters

unsigned long addr
starting physical address of the range
unsigned long size
size of the range in bytes

Description

Partial pages will be considered reserved and left as they are.

The range must be contiguous but may span node boundaries.

void * __alloc_bootmem_nopanic(unsigned long size, unsigned long align, unsigned long goal)

allocate boot memory without panicking

Parameters

unsigned long size
size of the request in bytes
unsigned long align
alignment of the region
unsigned long goal
preferred starting address of the region

Description

The goal is dropped if it can not be satisfied and the allocation will fall back to memory below goal.

Allocation may happen on any node in the system.

Return

address of the allocated region or NULL on failure.

void * __alloc_bootmem(unsigned long size, unsigned long align, unsigned long goal)

allocate boot memory

Parameters

unsigned long size
size of the request in bytes
unsigned long align
alignment of the region
unsigned long goal
preferred starting address of the region

Description

The goal is dropped if it can not be satisfied and the allocation will fall back to memory below goal.

Allocation may happen on any node in the system.

The function panics if the request can not be satisfied.

Return

address of the allocated region.

void * __alloc_bootmem_node(pg_data_t * pgdat, unsigned long size, unsigned long align, unsigned long goal)

allocate boot memory from a specific node

Parameters

pg_data_t * pgdat
node to allocate from
unsigned long size
size of the request in bytes
unsigned long align
alignment of the region
unsigned long goal
preferred starting address of the region

Description

The goal is dropped if it can not be satisfied and the allocation will fall back to memory below goal.

Allocation may fall back to any node in the system if the specified node can not hold the requested memory.

The function panics if the request can not be satisfied.

Return

address of the allocated region.

void * __alloc_bootmem_low(unsigned long size, unsigned long align, unsigned long goal)

allocate low boot memory

Parameters

unsigned long size
size of the request in bytes
unsigned long align
alignment of the region
unsigned long goal
preferred starting address of the region

Description

The goal is dropped if it can not be satisfied and the allocation will fall back to memory below goal.

Allocation may happen on any node in the system.

The function panics if the request can not be satisfied.

Return

address of the allocated region.

void * __alloc_bootmem_low_node(pg_data_t * pgdat, unsigned long size, unsigned long align, unsigned long goal)

allocate low boot memory from a specific node

Parameters

pg_data_t * pgdat
node to allocate from
unsigned long size
size of the request in bytes
unsigned long align
alignment of the region
unsigned long goal
preferred starting address of the region

Description

The goal is dropped if it can not be satisfied and the allocation will fall back to memory below goal.

Allocation may fall back to any node in the system if the specified node can not hold the requested memory.

The function panics if the request can not be satisfied.

Return

address of the allocated region.

Bootmem specific API

These interfaces available only with bootmem, i.e when CONFIG_NO_BOOTMEM=n

struct bootmem_data

per-node information used by the bootmem allocator

Definition

struct bootmem_data {
  unsigned long node_min_pfn;
  unsigned long node_low_pfn;
  void *node_bootmem_map;
  unsigned long last_end_off;
  unsigned long hint_idx;
  struct list_head list;
};

Members

node_min_pfn
the starting physical address of the node’s memory
node_low_pfn
the end physical address of the directly addressable memory
node_bootmem_map
is a bitmap pointer - the bits represent all physical memory pages (including holes) on the node.
last_end_off
the offset within the page of the end of the last allocation; if 0, the page used is full
hint_idx
the PFN of the page used with the last allocation; together with using this with the last_end_offset field, a test can be made to see if allocations can be merged with the page used for the last allocation rather than using up a full new page.
list
list entry in the linked list ordered by the memory addresses

Memblock specific API

Here is the description of memblock data structures, functions and macros. Some of them are actually internal, but since they are documented it would be silly to omit them. Besides, reading the descriptions for the internal functions can help to understand what really happens under the hood.

enum memblock_flags

definition of memory region attributes

Constants

MEMBLOCK_NONE
no special request
MEMBLOCK_HOTPLUG
hotpluggable region
MEMBLOCK_MIRROR
mirrored region
MEMBLOCK_NOMAP
don’t add to kernel direct mapping
struct memblock_region

represents a memory region

Definition

struct memblock_region {
  phys_addr_t base;
  phys_addr_t size;
  enum memblock_flags flags;
#ifdef CONFIG_HAVE_MEMBLOCK_NODE_MAP;
  int nid;
#endif;
};

Members

base
physical address of the region
size
size of the region
flags
memory region attributes
nid
NUMA node id
struct memblock_type

collection of memory regions of certain type

Definition

struct memblock_type {
  unsigned long cnt;
  unsigned long max;
  phys_addr_t total_size;
  struct memblock_region *regions;
  char *name;
};

Members

cnt
number of regions
max
size of the allocated array
total_size
size of all regions
regions
array of regions
name
the memory type symbolic name
struct memblock

memblock allocator metadata

Definition

struct memblock {
  bool bottom_up;
  phys_addr_t current_limit;
  struct memblock_type memory;
  struct memblock_type reserved;
#ifdef CONFIG_HAVE_MEMBLOCK_PHYS_MAP;
  struct memblock_type physmem;
#endif;
};

Members

bottom_up
is bottom up direction?
current_limit
physical address of the current allocation limit
memory
usabe memory regions
reserved
reserved memory regions
physmem
all physical memory
for_each_mem_range(i, type_a, type_b, nid, flags, p_start, p_end, p_nid)

iterate through memblock areas from type_a and not included in type_b. Or just type_a if type_b is NULL.

Parameters

i
u64 used as loop variable
type_a
ptr to memblock_type to iterate
type_b
ptr to memblock_type which excludes from the iteration
nid
node selector, NUMA_NO_NODE for all nodes
flags
pick from blocks based on memory attributes
p_start
ptr to phys_addr_t for start address of the range, can be NULL
p_end
ptr to phys_addr_t for end address of the range, can be NULL
p_nid
ptr to int for nid of the range, can be NULL
for_each_mem_range_rev(i, type_a, type_b, nid, flags, p_start, p_end, p_nid)

reverse iterate through memblock areas from type_a and not included in type_b. Or just type_a if type_b is NULL.

Parameters

i
u64 used as loop variable
type_a
ptr to memblock_type to iterate
type_b
ptr to memblock_type which excludes from the iteration
nid
node selector, NUMA_NO_NODE for all nodes
flags
pick from blocks based on memory attributes
p_start
ptr to phys_addr_t for start address of the range, can be NULL
p_end
ptr to phys_addr_t for end address of the range, can be NULL
p_nid
ptr to int for nid of the range, can be NULL
for_each_reserved_mem_region(i, p_start, p_end)

iterate over all reserved memblock areas

Parameters

i
u64 used as loop variable
p_start
ptr to phys_addr_t for start address of the range, can be NULL
p_end
ptr to phys_addr_t for end address of the range, can be NULL

Description

Walks over reserved areas of memblock. Available as soon as memblock is initialized.

for_each_mem_pfn_range(i, nid, p_start, p_end, p_nid)

early memory pfn range iterator

Parameters

i
an integer used as loop variable
nid
node selector, MAX_NUMNODES for all nodes
p_start
ptr to ulong for start pfn of the range, can be NULL
p_end
ptr to ulong for end pfn of the range, can be NULL
p_nid
ptr to int for nid of the range, can be NULL

Description

Walks over configured memory ranges.

for_each_free_mem_range(i, nid, flags, p_start, p_end, p_nid)

iterate through free memblock areas

Parameters

i
u64 used as loop variable
nid
node selector, NUMA_NO_NODE for all nodes
flags
pick from blocks based on memory attributes
p_start
ptr to phys_addr_t for start address of the range, can be NULL
p_end
ptr to phys_addr_t for end address of the range, can be NULL
p_nid
ptr to int for nid of the range, can be NULL

Description

Walks over free (memory && !reserved) areas of memblock. Available as soon as memblock is initialized.

for_each_free_mem_range_reverse(i, nid, flags, p_start, p_end, p_nid)

rev-iterate through free memblock areas

Parameters

i
u64 used as loop variable
nid
node selector, NUMA_NO_NODE for all nodes
flags
pick from blocks based on memory attributes
p_start
ptr to phys_addr_t for start address of the range, can be NULL
p_end
ptr to phys_addr_t for end address of the range, can be NULL
p_nid
ptr to int for nid of the range, can be NULL

Description

Walks over free (memory && !reserved) areas of memblock in reverse order. Available as soon as memblock is initialized.

for_each_resv_unavail_range(i, p_start, p_end)

iterate through reserved and unavailable memory

Parameters

i
u64 used as loop variable
p_start
ptr to phys_addr_t for start address of the range, can be NULL
p_end
ptr to phys_addr_t for end address of the range, can be NULL

Description

Walks over unavailable but reserved (reserved && !memory) areas of memblock. Available as soon as memblock is initialized.

Note

because this memory does not belong to any physical node, flags and nid arguments do not make sense and thus not exported as arguments.

void memblock_set_current_limit(phys_addr_t limit)

Set the current allocation limit to allow limiting allocations to what is currently accessible during boot

Parameters

phys_addr_t limit
New limit value (physical address)
unsigned long memblock_region_memory_base_pfn(const struct memblock_region * reg)

get the lowest pfn of the memory region

Parameters

const struct memblock_region * reg
memblock_region structure

Return

the lowest pfn intersecting with the memory region

unsigned long memblock_region_memory_end_pfn(const struct memblock_region * reg)

get the end pfn of the memory region

Parameters

const struct memblock_region * reg
memblock_region structure

Return

the end_pfn of the reserved region

unsigned long memblock_region_reserved_base_pfn(const struct memblock_region * reg)

get the lowest pfn of the reserved region

Parameters

const struct memblock_region * reg
memblock_region structure

Return

the lowest pfn intersecting with the reserved region

unsigned long memblock_region_reserved_end_pfn(const struct memblock_region * reg)

get the end pfn of the reserved region

Parameters

const struct memblock_region * reg
memblock_region structure

Return

the end_pfn of the reserved region