Previous Up Next

Appendix M  Out of Memory Management

M.1  Determining Available Memory

M.1.1  Function: vm_enough_memory

Source: mm/mmap.c

 53 int vm_enough_memory(long pages)
 54 {
 65     unsigned long free;
 66     
 67     /* Sometimes we want to use more memory than we have. */
 68     if (sysctl_overcommit_memory)
 69         return 1;
 70 
 71     /* The page cache contains buffer pages these days.. */
 72     free = atomic_read(&page_cache_size);
 73     free += nr_free_pages();
 74     free += nr_swap_pages;
 75 
 76     /*
 77      * This double-counts: the nrpages are both in the page-cache
 78      * and in the swapper space. At the same time, this compensates
 79      * for the swap-space over-allocation (ie "nr_swap_pages" being
 80      * too small.
 81      */
 82     free += swapper_space.nrpages;
 83 
 84     /*
 85      * The code below doesn't account for free space in the inode
 86      * and dentry slab cache, slab cache fragmentation, inodes and
 87      * dentries which will become freeable under VM load, etc.
 88      * Lets just hope all these (complex) factors balance out...
 89      */
 90     free += (dentry_stat.nr_unused * sizeof(struct dentry)) >> PAGE_SHIFT;
 91     free += (inodes_stat.nr_unused * sizeof(struct inode)) >> PAGE_SHIFT;
 92 
 93     return free > pages;
 94 }
68-69If the system administrator has specified via the proc interface that overcommit is allowed, return immediately saying that the memory is available
72Start the free pages count with the size of the page cache as these pages may be easily reclaimed
73Add the total number of free pages in the system
74Add the total number of available swap slots
82Add the number of pages managed by swapper_space. This double counts free slots in swaps but is balanced by the fact that some slots are reserved for pages but are not being currently used
90Add the number of unused pages in the dentry cache
91Add the number of unused pages in the inode cache
93Return if there are more free pages available than the request

M.2  Detecting and Recovering from OOM

M.2.1  Function: out_of_memory

Source: mm/oom_kill.c

202 void out_of_memory(void)
203 {
204         static unsigned long first, last, count, lastkill;
205         unsigned long now, since;
206 
210         if (nr_swap_pages > 0)
211                 return;
212 
213         now = jiffies;
214         since = now - last;
215         last = now;
216 
221         last = now;
222         if (since > 5*HZ)
223                 goto reset;
224 
229         since = now - first;
230         if (since < HZ)
231                 return;
232 
237         if (++count < 10)
238                 return;
239 
245         since = now - lastkill;
246         if (since < HZ*5)
247                 return;
248 
252         lastkill = now;
253         oom_kill();
254 
255 reset:
256         first = now;
257         count = 0;
258 }
210-211If there are available swap slots, the system is no OOM
213-215Record what time it is now in jiffies and determine how long it has been since this function was last called
222-223If it has been more than 5 seconds since this function was last called, then reset the timer and exit the function
229-231If it has been longer than a second since this function was last called, then exit the function. It is possible that IO is in progress which will complete soon
237-238If the function has not been called 10 times within the last short interval, then the system is not yet OOM
245-247If a process has been killed within the last 5 seconds, then exit the function as the dying process is likely to free memory
253Ok, the system really is OOM so call oom_kill() (See Section M.2.2) to select a process to kill

M.2.2  Function: oom_kill

Source: mm/oom_kill.c

This function first calls select_bad_process() to find a suitable process to kill. Once found, the task list is traversed and the oom_kill_task() is called for the selected process and all it's threads.

172 static void oom_kill(void)
173 {
174         struct task_struct *p, *q;
175 
176         read_lock(&tasklist_lock);
177         p = select_bad_process();
178 
179         /* Found nothing?!?! Either we hang forever, or we panic. */
180         if (p == NULL)
181                 panic("Out of memory and no killable processes...\n");
182 
183         /* kill all processes that share the ->mm (i.e. all threads) */
184         for_each_task(q) {
185                 if (q->mm == p->mm)
186                         oom_kill_task(q);
187         }
188         read_unlock(&tasklist_lock);
189 
190         /*
191          * Make kswapd go out of the way, so "p" has a good chance of
192          * killing itself before someone else gets the chance to ask
193          * for more memory.
194          */
195         yield();
196         return;
197 }
176Acquire the read-only semaphore to the task list
177Call select_bad_process()(See Section M.2.3) to find a suitable process to kill
180-170If one could not be found, panic the system because otherwise the system will deadlock. In this case, it is better to deadlock and have a developer solve the bug than have a mysterious hang
184-187Cycle through the task list and call oom_kill_task() (See Section M.2.5) for the selected process and all it's threads. Remember that threads will all share the same mm_struct
188Release the semaphore
195Call yield() to allow the signals to be delivered and the processes to die. The comments indicate that kswapd will be the sleeper but it is possible that a process in the direct-reclaim path will be executing this function too

M.2.3  Function: select_bad_process

Source: mm/oom_kill.c

This function is responsible for cycling through the entire task list and returning the process that scored highest with the badness() function.

121 static struct task_struct * select_bad_process(void)
122 {
123     int maxpoints = 0;
124     struct task_struct *p = NULL;
125     struct task_struct *chosen = NULL;
126 
127     for_each_task(p) {
128         if (p->pid) {
129             int points = badness(p);
130             if (points > maxpoints) {
131                 chosen = p;
132                 maxpoints = points;
133             }
134         }
135     }
136     return chosen;
137 }
127Cycle through all tasks in the task list
128If the process is the system idle task, then skip over it
129Call badness()(See Section M.2.4) to score the process
130-133If this is the highest score so far, record it
136Return the task_struct which scored highest with badness()

M.2.4  Function: badness

Source: mm/oom_kill.c

This calculates a score that determines how suitable the process is for killing. The scoring mechanism is explained in detail in Chapter 13.

 58 static int badness(struct task_struct *p)
 59 {
 60         int points, cpu_time, run_time;
 61 
 62         if (!p->mm)
 63             return 0;
 64
 65         if (p->flags & PF_MEMDIE)
 66             return 0;
 67
 71         points = p->mm->total_vm;
 72 
 79         cpu_time = (p->times.tms_utime + p->times.tms_stime) 
                                                        >> (SHIFT_HZ + 3);
 80         run_time = (jiffies - p->start_time) >> (SHIFT_HZ + 10);
 81 
 82         points /= int_sqrt(cpu_time);
 83         points /= int_sqrt(int_sqrt(run_time));
 84 
 89         if (p->nice > 0)
 90                 points *= 2;
 91 
 96         if (cap_t(p->cap_effective) & CAP_TO_MASK(CAP_SYS_ADMIN) ||
 97                                 p->uid == 0 || p->euid == 0)
 98                 points /= 4;
 99 
106         if (cap_t(p->cap_effective) & CAP_TO_MASK(CAP_SYS_RAWIO))
107                 points /= 4;
108 #ifdef DEBUG
109         printk(KERN_DEBUG "OOMkill: task %d (%s) got %d points\n",
110         p->pid, p->comm, points);
111 #endif
112         return points;
113 }
62-63If there is no mm, return 0 as this is a kernel thread
65-66If the process has already been marked by the OOM killer as exiting, return 0 as there is no point trying to kill it multiple times
71The total VM used by the process is the base starting point
79-80cpu_time is calculated as the total runtime of the process in seconds. run_time is the total runtime of the process in minutes. Comments indicate that there is no basis for this other than it works well in practice
82Divide the points by the integer square root of cpu_time
83Divide the points by the cube root of run_time
89-90If the process has been niced to be of lower priority, double it's points as it is likely to be an unimportant process
96-98On the other hand, if the process has superuser privileges or has the CAP_SYS_ADMIN capability, it is likely to be a system process so divide the points by 4
106-107If the process has direct access to hardware then divide the process by 4. Killing these processes forceably could potentially leave hardware in an inconsistent state. For example, forcibly killing X is never a good idea
112Return the score

M.2.5  Function: oom_kill_task

Source: mm/oom_kill.c

This function is responsible for sending the appropriate kill signals to the selected task.

144 void oom_kill_task(struct task_struct *p)
145 {
146         printk(KERN_ERR "Out of Memory: Killed process %d (%s).\n", 
                                                         p->pid, p->comm);
147 
148         /*
149          * We give our sacrificial lamb high priority and access to
150          * all the memory it needs. That way it should be able to
151          * exit() and clear out its resources quickly...
152          */
153         p->counter = 5 * HZ;
154         p->flags |= PF_MEMALLOC | PF_MEMDIE;
155 
156         /* This process has hardware access, be more careful. */
157         if (cap_t(p->cap_effective) & CAP_TO_MASK(CAP_SYS_RAWIO)) {
158                 force_sig(SIGTERM, p);
159         } else {
160                 force_sig(SIGKILL, p);
161         }
162 }
146Print an informational message on the process being killed
153This gives the dying process lots of time on the CPU so it can kill itself off quickly
154These flags will tell the allocator to give favourably treatment to the process if it requires more pages before cleaning itself up
157-158If the process can directly access hardware, send it the SIGTERM signal to give it a chance to exit cleanly
160Otherwise send it the SIGKILL signal to force the process to be killed


Previous Up Next