Linux Init (Early Boot)

Linux configuration is split into two major steps: Early-Boot and everything else.

During early boot, Linux sets up immutable resources (such as numa nodes), while later operations include things like driver probe and memory hotplug. Linux may read EFI and ACPI information throughout this process to configure logical representations of the devices.

During Linux Early Boot stage (functions in the kernel that have the __init decorator), the system takes the resources created by EFI/BIOS (ACPI tables) and turns them into resources that the kernel can consume.

BIOS, Build and Boot Options

There are 4 pre-boot options that need to be considered during kernel build which dictate how memory will be managed by Linux during early boot.

  • EFI_MEMORY_SP

    • BIOS/EFI Option that dictates whether memory is SystemRAM or Specific Purpose. Specific Purpose memory will be deferred to drivers to manage - and not immediately exposed as system RAM.

  • CONFIG_EFI_SOFT_RESERVE

    • Linux Build config option that dictates whether the kernel supports Specific Purpose memory.

  • CONFIG_MHP_DEFAULT_ONLINE_TYPE

    • Linux Build config that dictates whether and how Specific Purpose memory converted to a dax device should be managed (left as DAX or onlined as SystemRAM in ZONE_NORMAL or ZONE_MOVABLE).

  • nosoftreserve

    • Linux kernel boot option that dictates whether Soft Reserve should be supported. Similar to CONFIG_EFI_SOFT_RESERVE.

Memory Map Creation

While the kernel parses the EFI memory map, if Specific Purpose memory is supported and detected, it will set this region aside as SOFT_RESERVED.

If EFI_MEMORY_SP=0, CONFIG_EFI_SOFT_RESERVE=n, or nosoftreserve=y - Linux will default a CXL device memory region to SystemRAM. This will expose the memory to the kernel page allocator in ZONE_NORMAL, making it available for use for most allocations (including struct page and page tables).

If Specific Purpose is set and supported, CONFIG_MHP_DEFAULT_ONLINE_TYPE_* dictates whether the memory is onlined by default (_OFFLINE or _ONLINE_*), and if online which zone to online this memory to by default (_NORMAL or _MOVABLE).

If placed in ZONE_MOVABLE, the memory will not be available for most kernel allocations (such as struct page or page tables). This may significant impact performance depending on the memory capacity of the system.

NUMA Node Reservation

Linux refers to the proximity domains (PXM) defined in the SRAT to create NUMA nodes in acpi_numa_init. Typically, there is a 1:1 relation between PXM and NUMA node IDs.

The SRAT is the only ACPI defined way of defining Proximity Domains. Linux chooses to, at most, map those 1:1 with NUMA nodes. CEDT adds a description of SPA ranges which Linux may map to one or more NUMA nodes.

If there are CXL ranges in the CFMWS but not in SRAT, then a fake PXM is created (as of v6.15). In the future, Linux may reject CFMWS not described by SRAT due to the ambiguity of proximity domain association.

It is important to note that NUMA node creation cannot be done at runtime. All possible NUMA nodes are identified at __init time, more specifically during mm_init. The CEDT and SRAT must contain sufficient PXM data for Linux to identify NUMA nodes their associated memory regions.

The relevant code exists in: linux/drivers/acpi/numa/srat.c.

See Example Platform Configurations for more info.

Memory Tiers Creation

Memory tiers are a collection of NUMA nodes grouped by performance characteristics. During __init, Linux initializes the system with a default memory tier that contains all nodes marked N_MEMORY.

memory_tier_init is called at boot for all nodes with memory online by default. memory_tier_late_init is called during late-init for nodes setup during driver configuration.

Nodes are only marked N_MEMORY if they have online memory.

Tier membership can be inspected in

/sys/devices/virtual/memory_tiering/memory_tierN/nodelist
0-1

If nodes are grouped which have clear difference in performance, check the HMAT and CDAT information for the CXL nodes. All nodes default to the DRAM tier, unless HMAT/CDAT information is reported to the memory_tier component via access_coordinates.

For more, see CXL access coordinates documentation.

Contiguous Memory Allocation

The contiguous memory allocator (CMA) enables reservation of contiguous memory regions on NUMA nodes during early boot. However, CMA cannot reserve memory on NUMA nodes that are not online during early boot.

void __init hugetlb_cma_reserve(int order) {
  if (!node_online(nid))
    /* do not allow reservations */
}

This means if users intend to defer management of CXL memory to the driver, CMA cannot be used to guarantee huge page allocations. If enabling CXL memory as SystemRAM in ZONE_NORMAL during early boot, CMA reservations per-node can be made with the cma_pernuma or numa_cma kernel command line parameters.