Previous

Next

Chapter�5��Boot Memory Allocator

It is impractical to statically initialise all the core kernel memory structures at compile time as there are simply far too many permutations of hardware configurations. Yet to set up even the basic structures requires memory as even the physical page allocator, discussed in the next chapter, needs to allocate memory to initialise itself. But how can the physical page allocator allocate memory to initialise itself?

To address this, a specialised allocator called the Boot Memory Allocator is used. It is based on the most basic of allocators, a First Fit allocator which uses a bitmap to represent memory�[Tan01] instead of linked lists of free blocks. If a bit is 1, the page is allocated and 0 if unallocated. To satisfy allocations of sizes smaller than a page, the allocator records the Page Frame Number (PFN) of the last allocation and the offset the allocation ended at. Subsequent small allocations are “merged” together and stored on the same page.

The reader may ask why this allocator is not used for the running system. One compelling reason is that although the first fit allocator does not suffer badly from fragmentation�[JW98], memory frequently has to linearly searched to satisfy an allocation. As this is examining bitmaps, it gets very expensive, especially as the first fit algorithm tends to leave many small free blocks at the beginning of physical memory which still get scanned for large allocations, thus making the process very wasteful�[WJNB95].

unsigned long init_bootmem(unsigned long start, unsigned long page)

�This initialises the memory between 0 and the PFN page. The beginning of usable memory is at the PFN start

void reserve_bootmem(unsigned long addr, unsigned long size)

�Mark the pages between the address addr and addr+size reserved. Requests to partially reserve a page will result in the full page being reserved

void free_bootmem(unsigned long addr, unsigned long size)

�Mark the pages between the address addr and addr+size free

void * alloc_bootmem(unsigned long size)

�Allocate size number of bytes from ZONE_NORMAL. The allocation will be aligned to the L1 hardware cache to get the maximum benefit from the hardware cache

void * alloc_bootmem_low(unsigned long size)

�Allocate size number of bytes from ZONE_DMA. The allocation will be aligned to the L1 hardware cache

void * alloc_bootmem_pages(unsigned long size)

�Allocate size number of bytes from ZONE_NORMAL aligned on a page size so that full pages will be returned to the caller

void * alloc_bootmem_low_pages(unsigned long size)

�Allocate size number of bytes from ZONE_NORMAL aligned on a page size so that full pages will be returned to the caller

unsigned long bootmem_bootmap_pages(unsigned long pages)

�Calculate the number of pages required to store a bitmap representing the allocation state of pages number of pages

unsigned long free_all_bootmem()

�Used at the boot allocator end of life. It cycles through all pages in the bitmap. For each one that is free, the flags are cleared and the page is freed to the physical page allocator (See next chapter) so the runtime allocator can set up its free lists

Table 5.1: Boot Memory Allocator API for UMA Architectures

There are two very similar but distinct APIs for the allocator. One is for UMA architectures, listed in Table ?? and the other is for NUMA, listed in Table ??. The principle difference is that the NUMA API must be supplied with the node affected by the operation but as the callers of these APIs exist in the architecture dependant layer, it is not a significant problem.

This chapter will begin with a description of the structure the allocator uses to describe the physical memory available for each node. We will then illustrate how the limits of physical memory and the sizes of each zone are discovered before talking about how the information is used to initialised the boot memory allocator structures. The allocation and free routines will then be discussed before finally talking about how the boot memory allocator is retired.

unsigned long init_bootmem_node(pg_data_t *pgdat, unsigned long freepfn, unsigned long startpfn, unsigned long endpfn)

�For use with NUMA architectures. It initialise the memory between PFNs startpfn and endpfn with the first usable PFN at freepfn. Once initialised, the pgdat node is inserted into the pgdat_list

void reserve_bootmem_node(pg_data_t *pgdat, unsigned long physaddr, unsigned long size)

�Mark the pages between the address addr and addr+size on the specified node pgdat reserved. Requests to partially reserve a page will result in the full page being reserved

void free_bootmem_node(pg_data_t *pgdat, unsigned long physaddr, unsigned long size)

�Mark the pages between the address addr and addr+size on the specified node pgdat free

void * alloc_bootmem_node(pg_data_t *pgdat, unsigned long size)

�Allocate size number of bytes from ZONE_NORMAL on the specified node pgdat. The allocation will be aligned to the L1 hardware cache to get the maximum benefit from the hardware cache

void * alloc_bootmem_pages_node(pg_data_t *pgdat, unsigned long size)

�Allocate size number of bytes from ZONE_NORMAL on the specified node pgdat aligned on a page size so that full pages will be returned to the caller

void * alloc_bootmem_low_pages_node(pg_data_t *pgdat, unsigned long size)

�Allocate size number of bytes from ZONE_NORMAL on the specified node pgdat aligned on a page size so that full pages will be returned to the caller

unsigned long free_all_bootmem_node(pg_data_t *pgdat)

�Used at the boot allocator end of life. It cycles through all pages in the bitmap for the specified node. For each one that is free, the page flags are cleared and the page is freed to the physical page allocator (See next chapter) so the runtime allocator can set up its free lists

Table 5.2: Boot Memory Allocator API for NUMA Architectures

5.1��Representing the Boot Map

A bootmem_data struct exists for each node of memory in the system. It contains the information needed for the boot memory allocator to allocate memory for a node such as the bitmap representing allocated pages and where the memory is located. It is declared as follows in <linux/bootmem.h>:

 25 typedef struct bootmem_data {
 26     unsigned long node_boot_start;
 27     unsigned long node_low_pfn;
 28     void *node_bootmem_map;
 29     unsigned long last_offset;
 30     unsigned long last_pos;
 31 } bootmem_data_t;

The fields of this struct are as follows:

: node_boot_start This is the starting physical address of the represented block;
: node_low_pfn This is the end physical address, in other words, the end of the ZONE_NORMAL this node represents;
: node_bootmem_map This is the location of the bitmap representing allocated or free pages with each bit;
: last_offset This is the offset within the the page of the end of the last allocation. If 0, the page used is full;
: last_pos This is the the PFN of the page used with the last allocation. Using this with the last_offset field, a test can be made to see if allocations can be merged with the page used for the last allocation rather than using up a full new page.

5.2��Initialising the Boot Memory Allocator

Each architecture is required to supply a setup_arch() function which, among other tasks, is responsible for acquiring the necessary parameters to initialise the boot memory allocator.

Each architecture has its own function to get the necessary parameters. On the x86, it is called setup_memory(), as discussed in Section 2.2.2, but on other architectures such as MIPS or Sparc, it is called bootmem_init() or the case of the PPC, do_init_bootmem(). Regardless of the architecture, the tasks are essentially the same. The parameters it calculates are:

: min_low_pfn This is the lowest PFN that is available in the system;
: max_low_pfn This is the highest PFN that may be addressed by low memory (ZONE_NORMAL);
: highstart_pfn This is the PFN of the beginning of high memory (ZONE_HIGHMEM);
: highend_pfn This is the last PFN in high memory;
: max_pfn Finally, this is the last PFN available to the system.

5.2.1��Initialising `bootmem_data`

Once the limits of usable physical memory are discovered by setup_memory(), one of two boot memory initialisation functions is selected and provided with the start and end PFN for the node to be initialised. init_bootmem(), which initialises contig_page_data, is used by UMA architectures, while init_bootmem_node() is for NUMA to initialise a specified node. Both function are trivial and rely on init_bootmem_core() to do the real work.

The first task of the core function is to insert this pgdat_data_t into the pgdat_list as at the end of this function, the node is ready for use. It then records the starting and end address for this node in its associated bootmem_data_t and allocates the bitmap representing page allocations. The size in bytes, hence the division by 8, of the bitmap required is calculated as:

mapsize = ((end_pfn - start_pfn) + 7) / 8

The bitmap in stored at the physical address pointed to by

bootmem_data_t→node_boot_start and the virtual address to the map is placed in bootmem_data_t→node_bootmem_map. As there is no architecture independent way to detect “holes” in memory, the entire bitmap is initialised to 1, effectively marking all pages allocated. It is up to the architecture dependent code to set the bits of usable pages to 0 although, in reality, the Sparc architecture is the only one which uses this bitmap. In the case of the x86, the function register_bootmem_low_pages() reads through the e820 map and calls free_bootmem() for each usable page to set the bit to 0 before using reserve_bootmem() to reserve the pages needed by the actual bitmap.

5.3��Allocating Memory

The reserve_bootmem() function may be used to reserve pages for use by the caller but is very cumbersome to use for general allocations. There are four functions provided for easy allocations on UMA architectures called alloc_bootmem(), alloc_bootmem_low(), alloc_bootmem_pages() and alloc_bootmem_low_pages() which are fully described in Table ??. All of these macros call __alloc_bootmem() with different parameters. The call graph for these functions is shown in in Figure 5.1.

Figure 5.1: Call Graph: alloc_bootmem()

Similar functions exist for NUMA which take the node as an additional parameter, as listed in Table ??. They are called alloc_bootmem_node(), alloc_bootmem_pages_node() and alloc_bootmem_low_pages_node(). All of these macros call __alloc_bootmem_node() with different parameters.

The parameters to either __alloc_bootmem() and __alloc_bootmem_node() are essentially the same. They are

: pgdat This is the node to allocate from. It is omitted in the UMA case as it is assumed to be contig_page_data;
: size This is the size in bytes of the requested allocation;
: align This is the number of bytes that the request should be aligned to. For small allocations, they are aligned to SMP_CACHE_BYTES, which on the x86 will align to the L1 hardware cache;
: goal This is the preferred starting address to begin allocating from. The “low” functions will start from physical address 0 where as the others will begin from MAX_DMA_ADDRESS which is the maximum address DMA transfers may be made from on this architecture.

The core function for all the allocation APIs is __alloc_bootmem_core(). It is a large function but with simple steps that can be broken down. The function linearly scans memory starting from the goal address for a block of memory large enough to satisfy the allocation. With the API, this address will either be 0 for DMA-friendly allocations or MAX_DMA_ADDRESS otherwise.

The clever part, and the main bulk of the function, deals with deciding if this new allocation can be merged with the previous one. It may be merged if the following conditions hold:

The page used for the previous allocation (bootmem_data→pos) is adjacent to the page found for this allocation;
The previous page has some free space in it (bootmem_data→offset != 0);
The alignment is less than PAGE_SIZE.

Regardless of whether the allocations may be merged or not, the pos and offset fields will be updated to show the last page used for allocating and how much of the last page was used. If the last page was fully used, the offset is 0.

5.4��Freeing Memory

In contrast to the allocation functions, only two free function are provided which are free_bootmem() for UMA and free_bootmem_node() for NUMA. They both call free_bootmem_core() with the only difference being that a pgdat is supplied with NUMA.

The core function is relatively simple in comparison to the rest of the allocator. For each full page affected by the free, the corresponding bit in the bitmap is set to 0. If it already was 0, BUG() is called to show a double-free occured. BUG() is used when an unrecoverable error due to a kernel bug occurs. It terminates the running process and causes a kernel oops which shows a stack trace and debugging information that a developer can use to fix the bug.

An important restriction with the free functions is that only full pages may be freed. It is never recorded when a page is partially allocated so if only partially freed, the full page remains reserved. This is not as major a problem as it appears as the allocations always persist for the lifetime of the system; However, it is still an important restriction for developers during boot time.

5.5��Retiring the Boot Memory Allocator

Late in the bootstrapping process, the function start_kernel() is called which knows it is safe to remove the boot allocator and all its associated data structures. Each architecture is required to provide a function mem_init() that is responsible for destroying the boot memory allocator and its associated structures.

Figure 5.2: Call Graph: mem_init()

The purpose of the function is quite simple. It is responsible for calculating the dimensions of low and high memory and printing out an informational message to the user as well as performing final initialisations of the hardware if necessary. On the x86, the principal function of concern for the VM is the free_pages_init().

This function first tells the boot memory allocator to retire itself by calling free_all_bootmem() for UMA architectures or free_all_bootmem_node() for NUMA. Both call the core function free_all_bootmem_core() with different parameters. The core function is simple in principle and performs the following tasks:

For all unallocated pages known to the allocator for this node;
- Clear the PG_reserved flag in its struct page;
- Set the count to 1;
- Call __free_pages() so that the buddy allocator (discussed next chapter) can build its free lists.
Free all pages used for the bitmap and give them to the buddy allocator.

At this stage, the buddy allocator now has control of all the pages in low memory which leaves only the high memory pages. After free_all_bootmem() returns, it first counts the number of reserved pages for accounting purposes. The remainder of the free_pages_init() function is responsible for the high memory pages. However, at this point, it should be clear how the global mem_map array is allocated, initialised and the pages given to the main allocator. The basic flow used to initialise pages in low memory in a single node system is shown in Figure 5.3.

Figure 5.3: Initialising mem_map and the Main Physical Page Allocator

Once free_all_bootmem() returns, all the pages in ZONE_NORMAL have been given to the buddy allocator. To initialise the high memory pages, free_pages_init() calls one_highpage_init() for every page between highstart_pfn and highend_pfn. one_highpage_init() simple clears the PG_reserved flag, sets the PG_highmem flag, sets the count to 1 and calls __free_pages() to release it to the buddy allocator in the same manner free_all_bootmem_core() did.

At this point, the boot memory allocator is no longer required and the buddy allocator is the main physical page allocator for the system. An interesting feature to note is that not only is the data for the boot allocator removed but also all code that was used to bootstrap the system. All initilisation function that are required only during system start-up are marked __init such as the following;

321 unsigned long __init free_all_bootmem (void)

All of these functions are placed together in the .init section by the linker. On the x86, the function free_initmem() walks through all pages from __init_begin to __init_end and frees up the pages to the buddy allocator. With this method, Linux can free up a considerable amount of memory that is used by bootstrapping code that is no longer required. For example, 27 pages were freed while booting the kernel running on the machine this document is composed on.

5.6��What's New in 2.6

The boot memory allocator has not changed significantly since 2.4 and is mainly concerned with optimisations and some minor NUMA related modifications. The first optimisation is the addition of a last_success field to the bootmem_data_t struct. As the name suggests, it keeps track of the location of the last successful allocation to reduce search times. If an address is freed before last_success, it will be changed to the freed location.

The second optimisation is also related to the linear search. When searching for a free page, 2.4 test every bit which is expensive. 2.6 instead tests if a block of BITS_PER_LONG is all ones. If it's not, it will test each of the bits individually in that block. To help the linear search, nodes are ordered in order of their physical addresses by init_bootmem().

The last change is related to NUMA and contiguous architectures. Contiguous architectures now define their own init_bootmem() function and any architecture can optionally define their own reserve_bootmem() function.

Previous

Next