Chapter 13 Out Of Memory Management

The last aspect of the VM we are going to discuss is the Out Of Memory (OOM) manager. This intentionally is a very short chapter as it has one simple task; check if there is enough available memory to satisfy, verify that the system is truely out of memory and if so, select a process to kill. This is a controversial part of the VM and it has been suggested that it be removed on many occasions. Regardless of whether it exists in the latest kernel, it still is a useful system to examine as it touches off a number of other subsystems.

13.1 Checking Available Memory

For certain operations, such as expaning the heap with brk() or remapping an address space with mremap(), the system will check if there is enough available memory to satisfy a request. Note that this is separate to the out_of_memory() path that is covered in the next section. This path is used to avoid the system being in a state of OOM if at all possible.

When checking available memory, the number of required pages is passed as a parameter to vm_enough_memory(). Unless the system administrator has specified that the system should overcommit memory, the mount of available memory will be checked. To determine how many pages are potentially available, Linux sums up the following bits of data:

: Total page cache as page cache is easily reclaimed
: Total free pages because they are already available
: Total free swap pages as userspace pages may be paged out
: Total pages managed by swapper_space although this double-counts the free swap pages. This is balanced by the fact that slots are sometimes reserved but not used
: Total pages used by the dentry cache as they are easily reclaimed
: Total pages used by the inode cache as they are easily reclaimed

If the total number of pages added here is sufficient for the request, vm_enough_memory() returns true to the caller. If false is returned, the caller knows that the memory is not available and usually decides to return -ENOMEM to userspace.

13.2 Determining OOM Status

When the machine is low on memory, old page frames will be reclaimed (see Chapter 10) but despite reclaiming pages is may find that it was unable to free enough pages to satisfy a request even when scanning at highest priority. If it does fail to free page frames, out_of_memory() is called to see if the system is out of memory and needs to kill a process.

Figure 13.1: Call Graph: out_of_memory()

Unfortunately, it is possible that the system is not out memory and simply needs to wait for IO to complete or for pages to be swapped to backing storage. This is unfortunate, not because the system has memory, but because the function is being called unnecessarily opening the possibly of processes being unnecessarily killed. Before deciding to kill a process, it goes through the following checklist.

Is there enough swap space left (nr_swap_pages > 0) ? If yes, not OOM
Has it been more than 5 seconds since the last failure? If yes, not OOM
Have we failed within the last second? If no, not OOM
If there hasn't been 10 failures at least in the last 5 seconds, we're not OOM
Has a process been killed within the last 5 seconds? If yes, not OOM

It is only if the above tests are passed that oom_kill() is called to select a process to kill.

13.3 Selecting a Process

The function select_bad_process() is responsible for choosing a process to kill. It decides by stepping through each running task and calculating how suitable it is for killing with the function badness(). The badness is calculated as follows, note that the square roots are integer approximations calculated with int_sqrt();

badness_for_task = total_vm_for_task / (sqrt(cpu_time_in_seconds) *
sqrt(sqrt(cpu_time_in_minutes)))

This has been chosen to select a process that is using a large amount of memory but is not that long lived. Processes which have been running a long time are unlikely to be the cause of memory shortage so this calculation is likely to select a process that uses a lot of memory but has not been running long. If the process is a root process or has CAP_SYS_ADMIN capabilities, the points are divided by four as it is assumed that root privilege processes are well behaved. Similarly, if it has CAP_SYS_RAWIO capabilities (access to raw devices) privileges, the points are further divided by 4 as it is undesirable to kill a process that has direct access to hardware.

13.4 Killing the Selected Process

Once a task is selected, the list is walked again and each process that shares the same mm_struct as the selected process (i.e. they are threads) is sent a signal. If the process has CAP_SYS_RAWIO capabilities, a SIGTERM is sent to give the process a chance of exiting cleanly, otherwise a SIGKILL is sent.

13.5 Is That It?

Yes, thats it, out of memory management touches a lot of subsystems otherwise, there is not much to it.

13.6 What's New in 2.6

The majority of OOM management remains essentially the same for 2.6 except for the introduction of VM accounted objects. These are VMAs that are flagged with the VM_ACCOUNT flag, first mentioned in Section 4.8. Additional checks will be made to ensure there is memory available when performing operations on VMAs with this flag set. The principal incentive for this complexity is to avoid the need of an OOM killer.

Some regions which always have the VM_ACCOUNT flag set are the process stack, the process heap, regions mmap()ed with MAP_SHARED, private regions that are writable and regions set up shmget(). In other words, most userspace mappings have the VM_ACCOUNT flag set.

Linux accounts for the amount of memory that is committed to these VMAs with vm_acct_memory() which increments a variable called committed_space. When the VMA is freed, the committed space is decremented with vm_unacct_memory(). This is a fairly simple mechanism, but it allows Linux to remember how much memory it has already committed to userspace when deciding if it should commit more.

The checks are performed by calling security_vm_enough_memory() which introduces us to another new feature. 2.6 has a feature available which allows security related kernel modules to override certain kernel functions. The full list of hooks available is stored in a struct security_operations called security_ops. There are a number of dummy, or default, functions that may be used which are all listed in security/dummy.c but the majority do nothing except return. If there are no security modules loaded, the security_operations struct used is called dummy_security_ops which uses all the default function.

By default, security_vm_enough_memory() calls dummy_vm_enough_memory() which is declared in security/dummy.c and is very similar to 2.4's vm_enough_memory() function. The new version adds the following pieces of information together to determine available memory:

: Total page cache as page cache is easily reclaimed
: Total free pages because they are already available
: Total free swap pages as userspace pages may be paged out
: Slab pages with SLAB_RECLAIM_ACCOUNT set as they are easily reclaimed

These pages, minus a 3% reserve for root processes, is the total amount of memory that is available for the request. If the memory is available, it makes a check to ensure the total amount of committed memory does not exceed the allowed threshold. The allowed threshold is TotalRam * (OverCommitRatio/100) + TotalSwapPage, where OverCommitRatio is set by the system administrator. If the total amount of committed space is not too high, 1 will be returned so that the allocation can proceed.