Appendix M Out of Memory Management

M.1 Determining Available Memory

M.1.1 Function: vm_enough_memory

Source: mm/mmap.c

 53 int vm_enough_memory(long pages)
 54 {
 65     unsigned long free;
 66     
 67     /* Sometimes we want to use more memory than we have. */
 68     if (sysctl_overcommit_memory)
 69         return 1;
 70 
 71     /* The page cache contains buffer pages these days.. */
 72     free = atomic_read(&page_cache_size);
 73     free += nr_free_pages();
 74     free += nr_swap_pages;
 75 
 76     /*
 77      * This double-counts: the nrpages are both in the page-cache
 78      * and in the swapper space. At the same time, this compensates
 79      * for the swap-space over-allocation (ie "nr_swap_pages" being
 80      * too small.
 81      */
 82     free += swapper_space.nrpages;
 83 
 84     /*
 85      * The code below doesn't account for free space in the inode
 86      * and dentry slab cache, slab cache fragmentation, inodes and
 87      * dentries which will become freeable under VM load, etc.
 88      * Lets just hope all these (complex) factors balance out...
 89      */
 90     free += (dentry_stat.nr_unused * sizeof(struct dentry)) >> PAGE_SHIFT;
 91     free += (inodes_stat.nr_unused * sizeof(struct inode)) >> PAGE_SHIFT;
 92 
 93     return free > pages;
 94 }

: 68-69If the system administrator has specified via the proc interface that overcommit is allowed, return immediately saying that the memory is available
: 72Start the free pages count with the size of the page cache as these pages may be easily reclaimed
: 73Add the total number of free pages in the system
: 74Add the total number of available swap slots
: 82Add the number of pages managed by swapper_space. This double counts free slots in swaps but is balanced by the fact that some slots are reserved for pages but are not being currently used
: 90Add the number of unused pages in the dentry cache
: 91Add the number of unused pages in the inode cache
: 93Return if there are more free pages available than the request

M.2 Detecting and Recovering from OOM

M.2.1 Function: out_of_memory

Source: mm/oom_kill.c

202 void out_of_memory(void)
203 {
204         static unsigned long first, last, count, lastkill;
205         unsigned long now, since;
206 
210         if (nr_swap_pages > 0)
211                 return;
212 
213         now = jiffies;
214         since = now - last;
215         last = now;
216 
221         last = now;
222         if (since > 5*HZ)
223                 goto reset;
224 
229         since = now - first;
230         if (since < HZ)
231                 return;
232 
237         if (++count < 10)
238                 return;
239 
245         since = now - lastkill;
246         if (since < HZ*5)
247                 return;
248 
252         lastkill = now;
253         oom_kill();
254 
255 reset:
256         first = now;
257         count = 0;
258 }

: 210-211If there are available swap slots, the system is no OOM
: 213-215Record what time it is now in jiffies and determine how long it has been since this function was last called
: 222-223If it has been more than 5 seconds since this function was last called, then reset the timer and exit the function
: 229-231If it has been longer than a second since this function was last called, then exit the function. It is possible that IO is in progress which will complete soon
: 237-238If the function has not been called 10 times within the last short interval, then the system is not yet OOM
: 245-247If a process has been killed within the last 5 seconds, then exit the function as the dying process is likely to free memory
: 253Ok, the system really is OOM so call oom_kill() (See Section M.2.2) to select a process to kill

M.2.2 Function: oom_kill

Source: mm/oom_kill.c

This function first calls select_bad_process() to find a suitable process to kill. Once found, the task list is traversed and the oom_kill_task() is called for the selected process and all it's threads.

172 static void oom_kill(void)
173 {
174         struct task_struct *p, *q;
175 
176         read_lock(&tasklist_lock);
177         p = select_bad_process();
178 
179         /* Found nothing?!?! Either we hang forever, or we panic. */
180         if (p == NULL)
181                 panic("Out of memory and no killable processes...\n");
182 
183         /* kill all processes that share the ->mm (i.e. all threads) */
184         for_each_task(q) {
185                 if (q->mm == p->mm)
186                         oom_kill_task(q);
187         }
188         read_unlock(&tasklist_lock);
189 
190         /*
191          * Make kswapd go out of the way, so "p" has a good chance of
192          * killing itself before someone else gets the chance to ask
193          * for more memory.
194          */
195         yield();
196         return;
197 }

: 176Acquire the read-only semaphore to the task list
: 177Call select_bad_process()(See Section M.2.3) to find a suitable process to kill
: 180-170If one could not be found, panic the system because otherwise the system will deadlock. In this case, it is better to deadlock and have a developer solve the bug than have a mysterious hang
: 184-187Cycle through the task list and call oom_kill_task() (See Section M.2.5) for the selected process and all it's threads. Remember that threads will all share the same mm_struct
: 188Release the semaphore
: 195Call yield() to allow the signals to be delivered and the processes to die. The comments indicate that kswapd will be the sleeper but it is possible that a process in the direct-reclaim path will be executing this function too

M.2.3 Function: select_bad_process

Source: mm/oom_kill.c

This function is responsible for cycling through the entire task list and returning the process that scored highest with the badness() function.

121 static struct task_struct * select_bad_process(void)
122 {
123     int maxpoints = 0;
124     struct task_struct *p = NULL;
125     struct task_struct *chosen = NULL;
126 
127     for_each_task(p) {
128         if (p->pid) {
129             int points = badness(p);
130             if (points > maxpoints) {
131                 chosen = p;
132                 maxpoints = points;
133             }
134         }
135     }
136     return chosen;
137 }

: 127Cycle through all tasks in the task list
: 128If the process is the system idle task, then skip over it
: 129Call badness()(See Section M.2.4) to score the process
: 130-133If this is the highest score so far, record it
: 136Return the task_struct which scored highest with badness()

M.2.4 Function: badness

Source: mm/oom_kill.c

This calculates a score that determines how suitable the process is for killing. The scoring mechanism is explained in detail in Chapter 13.

 58 static int badness(struct task_struct *p)
 59 {
 60         int points, cpu_time, run_time;
 61 
 62         if (!p->mm)
 63             return 0;
 64
 65         if (p->flags & PF_MEMDIE)
 66             return 0;
 67
 71         points = p->mm->total_vm;
 72 
 79         cpu_time = (p->times.tms_utime + p->times.tms_stime) 
                                                        >> (SHIFT_HZ + 3);
 80         run_time = (jiffies - p->start_time) >> (SHIFT_HZ + 10);
 81 
 82         points /= int_sqrt(cpu_time);
 83         points /= int_sqrt(int_sqrt(run_time));
 84 
 89         if (p->nice > 0)
 90                 points *= 2;
 91 
 96         if (cap_t(p->cap_effective) & CAP_TO_MASK(CAP_SYS_ADMIN) ||
 97                                 p->uid == 0 || p->euid == 0)
 98                 points /= 4;
 99 
106         if (cap_t(p->cap_effective) & CAP_TO_MASK(CAP_SYS_RAWIO))
107                 points /= 4;
108 #ifdef DEBUG
109         printk(KERN_DEBUG "OOMkill: task %d (%s) got %d points\n",
110         p->pid, p->comm, points);
111 #endif
112         return points;
113 }

: 62-63If there is no mm, return 0 as this is a kernel thread
: 65-66If the process has already been marked by the OOM killer as exiting, return 0 as there is no point trying to kill it multiple times
: 71The total VM used by the process is the base starting point
: 79-80cpu_time is calculated as the total runtime of the process in seconds. run_time is the total runtime of the process in minutes. Comments indicate that there is no basis for this other than it works well in practice
: 82Divide the points by the integer square root of cpu_time
: 83Divide the points by the cube root of run_time
: 89-90If the process has been niced to be of lower priority, double it's points as it is likely to be an unimportant process
: 96-98On the other hand, if the process has superuser privileges or has the CAP_SYS_ADMIN capability, it is likely to be a system process so divide the points by 4
: 106-107If the process has direct access to hardware then divide the process by 4. Killing these processes forceably could potentially leave hardware in an inconsistent state. For example, forcibly killing X is never a good idea
: 112Return the score

M.2.5 Function: oom_kill_task

Source: mm/oom_kill.c

This function is responsible for sending the appropriate kill signals to the selected task.

144 void oom_kill_task(struct task_struct *p)
145 {
146         printk(KERN_ERR "Out of Memory: Killed process %d (%s).\n", 
                                                         p->pid, p->comm);
147 
148         /*
149          * We give our sacrificial lamb high priority and access to
150          * all the memory it needs. That way it should be able to
151          * exit() and clear out its resources quickly...
152          */
153         p->counter = 5 * HZ;
154         p->flags |= PF_MEMALLOC | PF_MEMDIE;
155 
156         /* This process has hardware access, be more careful. */
157         if (cap_t(p->cap_effective) & CAP_TO_MASK(CAP_SYS_RAWIO)) {
158                 force_sig(SIGTERM, p);
159         } else {
160                 force_sig(SIGKILL, p);
161         }
162 }

: 146Print an informational message on the process being killed
: 153This gives the dying process lots of time on the CPU so it can kill itself off quickly
: 154These flags will tell the allocator to give favourably treatment to the process if it requires more pages before cleaning itself up
: 157-158If the process can directly access hardware, send it the SIGTERM signal to give it a chance to exit cleanly
: 160Otherwise send it the SIGKILL signal to force the process to be killed