Linux Init (Early Boot)¶
Linux configuration is split into two major steps: Early-Boot and everything else.
During early boot, Linux sets up immutable resources (such as numa nodes), while later operations include things like driver probe and memory hotplug. Linux may read EFI and ACPI information throughout this process to configure logical representations of the devices.
During Linux Early Boot stage (functions in the kernel that have the __init decorator), the system takes the resources created by EFI/BIOS (ACPI tables) and turns them into resources that the kernel can consume.
BIOS, Build and Boot Options¶
There are 4 pre-boot options that need to be considered during kernel build which dictate how memory will be managed by Linux during early boot.
EFI_MEMORY_SP
BIOS/EFI Option that dictates whether memory is SystemRAM or Specific Purpose. Specific Purpose memory will be deferred to drivers to manage - and not immediately exposed as system RAM.
CONFIG_EFI_SOFT_RESERVE
Linux Build config option that dictates whether the kernel supports Specific Purpose memory.
CONFIG_MHP_DEFAULT_ONLINE_TYPE
Linux Build config that dictates whether and how Specific Purpose memory converted to a dax device should be managed (left as DAX or onlined as SystemRAM in ZONE_NORMAL or ZONE_MOVABLE).
nosoftreserve
Linux kernel boot option that dictates whether Soft Reserve should be supported. Similar to CONFIG_EFI_SOFT_RESERVE.
Memory Map Creation¶
While the kernel parses the EFI memory map, if Specific Purpose
memory
is supported and detected, it will set this region aside as
SOFT_RESERVED
.
If EFI_MEMORY_SP=0
, CONFIG_EFI_SOFT_RESERVE=n
, or
nosoftreserve=y
- Linux will default a CXL device memory region to
SystemRAM. This will expose the memory to the kernel page allocator in
ZONE_NORMAL
, making it available for use for most allocations (including
struct page
and page tables).
If Specific Purpose is set and supported, CONFIG_MHP_DEFAULT_ONLINE_TYPE_*
dictates whether the memory is onlined by default (_OFFLINE
or
_ONLINE_*
), and if online which zone to online this memory to by default
(_NORMAL
or _MOVABLE
).
If placed in ZONE_MOVABLE
, the memory will not be available for most
kernel allocations (such as struct page
or page tables). This may
significant impact performance depending on the memory capacity of the system.
NUMA Node Reservation¶
Linux refers to the proximity domains (PXM
) defined in the SRAT to create NUMA nodes in acpi_numa_init
.
Typically, there is a 1:1 relation between PXM
and NUMA node IDs.
The SRAT is the only ACPI defined way of defining Proximity Domains. Linux chooses to, at most, map those 1:1 with NUMA nodes. CEDT adds a description of SPA ranges which Linux may map to one or more NUMA nodes.
If there are CXL ranges in the CFMWS but not in SRAT, then a fake PXM
is created (as of v6.15). In the future, Linux may reject CFMWS not described
by SRAT due to the ambiguity of proximity domain association.
It is important to note that NUMA node creation cannot be done at runtime. All
possible NUMA nodes are identified at __init
time, more specifically
during mm_init
. The CEDT and SRAT must contain sufficient PXM
data for Linux to identify NUMA nodes their associated memory regions.
The relevant code exists in: linux/drivers/acpi/numa/srat.c
.
See Example Platform Configurations for more info.
Memory Tiers Creation¶
Memory tiers are a collection of NUMA nodes grouped by performance characteristics.
During __init
, Linux initializes the system with a default memory tier that
contains all nodes marked N_MEMORY
.
memory_tier_init
is called at boot for all nodes with memory online by
default. memory_tier_late_init
is called during late-init for nodes setup
during driver configuration.
Nodes are only marked N_MEMORY
if they have online memory.
Tier membership can be inspected in
/sys/devices/virtual/memory_tiering/memory_tierN/nodelist
0-1
If nodes are grouped which have clear difference in performance, check the HMAT and CDAT information for the CXL nodes. All nodes default to the DRAM tier, unless HMAT/CDAT information is reported to the memory_tier component via access_coordinates.
For more, see CXL access coordinates documentation.
Contiguous Memory Allocation¶
The contiguous memory allocator (CMA) enables reservation of contiguous memory regions on NUMA nodes during early boot. However, CMA cannot reserve memory on NUMA nodes that are not online during early boot.
void __init hugetlb_cma_reserve(int order) {
if (!node_online(nid))
/* do not allow reservations */
}
This means if users intend to defer management of CXL memory to the driver, CMA
cannot be used to guarantee huge page allocations. If enabling CXL memory as
SystemRAM in ZONE_NORMAL during early boot, CMA reservations per-node can be
made with the cma_pernuma
or numa_cma
kernel command line
parameters.