Appendix J Page Frame Reclamation

J.1 Page Cache Operations

This section addresses how pages are added and removed from the page cache and LRU lists, both of which are heavily intertwined.

J.1.1 Adding Pages to the Page Cache

J.1.1.1 Function: add_to_page_cache

Source: mm/filemap.c

Acquire the lock protecting the page cache before calling __add_to_page_cache() which will add the page to the page hash table and inode queue which allows the pages belonging to files to be found quickly.

667 void add_to_page_cache(struct page * page, 
                     struct address_space * mapping,
                      unsigned long offset)
668 {
669       spin_lock(&pagecache_lock);
670       __add_to_page_cache(page, mapping, 
                        offset, page_hash(mapping, offset));
671       spin_unlock(&pagecache_lock);
672       lru_cache_add(page);
673 }

: 669Acquire the lock protecting the page hash and inode queues
: 670Call the function which performs the “real” work
: 671Release the lock protecting the hash and inode queue
: 672Add the page to the page cache. page_hash() hashes into the page hash table based on the mapping and the offset within the file. If a page is returned, there was a collision and the colliding pages are chained with the page→next_hash and page→pprev_hash fields

J.1.1.2 Function: add_to_page_cache_unique

Source: mm/filemap.c

In many respects, this function is very similar to add_to_page_cache(). The principal difference is that this function will check the page cache with the pagecache_lock spinlock held before adding the page to the cache. It is for callers may race with another process for inserting a page in the cache such as add_to_swap_cache()(See Section K.2.1.1).

675 int add_to_page_cache_unique(struct page * page,
676         struct address_space *mapping, unsigned long offset,
677         struct page **hash)
678 {
679     int err;
680     struct page *alias;
681 
682     spin_lock(&pagecache_lock);
683     alias = __find_page_nolock(mapping, offset, *hash);
684 
685     err = 1;
686     if (!alias) {
687         __add_to_page_cache(page,mapping,offset,hash);
688         err = 0;
689     }
690 
691     spin_unlock(&pagecache_lock);
692     if (!err)
693         lru_cache_add(page);
694     return err;
695 }

: 682Acquire the pagecache_lock for examining the cache
: 683Check if the page already exists in the cache with __find_page_nolock() (See Section J.1.4.3)
: 686-689If the page does not exist in the cache, add it with __add_to_page_cache() (See Section J.1.1.3)
: 691Release the pagecache_lock
: 692-693If the page did not already exist in the page cache, add it to the LRU lists with lru_cache_add()(See Section J.2.1.1)
: 694Return 0 if this call entered the page into the page cache and 1 if it already existed

J.1.1.3 Function: __add_to_page_cache

Source: mm/filemap.c

Clear all page flags, lock it, take a reference and add it to the inode and hash queues.

653 static inline void __add_to_page_cache(struct page * page,
654       struct address_space *mapping, unsigned long offset,
655       struct page **hash)
656 {
657       unsigned long flags;
658 
659       flags = page->flags & ~(1 << PG_uptodate | 
                            1 << PG_error | 1 << PG_dirty | 
                            1 << PG_referenced | 1 << PG_arch_1 | 
                            1 << PG_checked);
660       page->flags = flags | (1 << PG_locked);
661       page_cache_get(page);
662       page->index = offset;
663       add_page_to_inode_queue(mapping, page);
664       add_page_to_hash_queue(page, hash);
665 }

: 659Clear all page flags
: 660Lock the page
: 661Take a reference to the page in case it gets freed prematurely
: 662Update the index so it is known what file offset this page represents
: 663Add the page to the inode queue with add_page_to_inode_queue() (See Section J.1.1.4). This links the page via the page→list to the clean_pages list in the address_space and points the page→mapping to the same address_space
: 664Add it to the page hash with add_page_to_hash_queue() (See Section J.1.1.5). The hash page was returned by page_hash() in the parent function. The page hash allows page cache pages without having to lineraly search the inode queue

J.1.1.4 Function: add_page_to_inode_queue

Source: mm/filemap.c

 85 static inline void add_page_to_inode_queue(
                       struct address_space *mapping, struct page * page)
 86 {
 87     struct list_head *head = &mapping->clean_pages;
 88 
 89     mapping->nrpages++;
 90     list_add(&page->list, head);
 91     page->mapping = mapping;
 92 }

: 87When this function is called, the page is clean, so mapping→clean_pages is the list of interest
: 89Increment the number of pages that belong to this mapping
: 90Add the page to the clean list
: 91Set the page→mapping field

J.1.1.5 Function: add_page_to_hash_queue

Source: mm/filemap.c

This adds page to the top of hash bucket headed by p. Bear in mind that p is an element of the array page_hash_table.

 71 static void add_page_to_hash_queue(struct page * page, 
                                       struct page **p)
 72 {
 73     struct page *next = *p;
 74 
 75     *p = page;
 76     page->next_hash = next;
 77     page->pprev_hash = p;
 78     if (next)
 79         next->pprev_hash = &page->next_hash;
 80     if (page->buffers)
 81         PAGE_BUG(page);
 82     atomic_inc(&page_cache_size);
 83 }

: 73Record the current head of the hash bucket in next
: 75Update the head of the hash bucket to be page
: 76Point page→next_hash to the old head of the hash bucket
: 77Point page→pprev_hash to point to the array element in page_hash_table
: 78-79This will point the pprev_hash field to the head of the hash bucket completing the insertion of the page into the linked list
: 80-81Check that the page entered has no associated buffers
: 82Increment page_cache_size which is the size of the page cache

J.1.2 Deleting Pages from the Page Cache

J.1.2.1 Function: remove_inode_page

Source: mm/filemap.c

130 void remove_inode_page(struct page *page)
131 {
132     if (!PageLocked(page))
133         PAGE_BUG(page);
134 
135     spin_lock(&pagecache_lock);
136     __remove_inode_page(page);
137     spin_unlock(&pagecache_lock);
138 }

: 132-133If the page is not locked, it is a bug
: 135Acquire the lock protecting the page cache
: 136__remove_inode_page() (See Section J.1.2.2) is the top-level function for when the pagecache lock is held
: 137Release the pagecache lock

J.1.2.2 Function: __remove_inode_page

Source: mm/filemap.c

This is the top-level function for removing a page from the page cache for callers with the pagecache_lock spinlock held. Callers that do not have this lock acquired should call remove_inode_page().

124 void __remove_inode_page(struct page *page)
125 {
126         remove_page_from_inode_queue(page);
127         remove_page_from_hash_queue(page);
128

: 126remove_page_from_inode_queue() (See Section J.1.2.3) remove the page from it's address_space at page→mapping
: 127remove_page_from_hash_queue() removes the page from the hash table in page_hash_table

J.1.2.3 Function: remove_page_from_inode_queue

Source: mm/filemap.c

 94 static inline void remove_page_from_inode_queue(struct page * page)
 95 {
 96     struct address_space * mapping = page->mapping;
 97 
 98     if (mapping->a_ops->removepage)
 99         mapping->a_ops->removepage(page);
100     list_del(&page->list);
101     page->mapping = NULL;
102     wmb();
103     mapping->nr_pages--;
104 }

: 96Get the associated address_space for this page
: 98-99Call the filesystem specific removepage() function if one is available
: 100Delete the page from whatever list it belongs to in the mapping such as the clean_pages list in most cases or the dirty_pages in rarer cases
: 101Set the page→mapping to NULL as it is no longer backed by any address_space
: 103Decrement the number of pages in the mapping

J.1.2.4 Function: remove_page_from_hash_queue

Source: mm/filemap.c

107 static inline void remove_page_from_hash_queue(struct page * page)
108 {
109     struct page *next = page->next_hash;
110     struct page **pprev = page->pprev_hash;
111 
112     if (next)
113         next->pprev_hash = pprev;
114     *pprev = next;
115     page->pprev_hash = NULL;
116     atomic_dec(&page_cache_size);
117 }

: 109Get the next page after the page being removed
: 110Get the pprev page before the page being removed. When the function completes, pprev will be linked to next
: 112If this is not the end of the list, update next→pprev_hash to point to pprev
: 114Similarly, point pprev forward to next. page is now unlinked
: 116Decrement the size of the page cache

J.1.3 Acquiring/Releasing Page Cache Pages

J.1.3.1 Function: page_cache_get

Source: include/linux/pagemap.h

 31 #define page_cache_get(x)       get_page(x)

: 31Simple call get_page() which simply uses atomic_inc() to increment the page reference count

J.1.3.2 Function: page_cache_release

Source: include/linux/pagemap.h

 32 #define page_cache_release(x)   __free_page(x)

: 32Call __free_page() which decrements the page count. If the count reaches 0, the page will be freed

J.1.4 Searching the Page Cache

J.1.4.1 Function: find_get_page

Source: include/linux/pagemap.h

Top level macro for finding a page in the page cache. It simply looks up the page hash

 75 #define find_get_page(mapping, index) \
 76     __find_get_page(mapping, index, page_hash(mapping, index))

: 76page_hash() locates an entry in the page_hash_table based on the address_space and offset

J.1.4.2 Function: __find_get_page

Source: mm/filemap.c

This function is responsible for finding a struct page given an entry in page_hash_table as a starting point.

931 struct page * __find_get_page(struct address_space *mapping,
932                 unsigned long offset, struct page **hash)
933 {
934     struct page *page;
935 
936     /*
937      * We scan the hash list read-only. Addition to and removal from
938      * the hash-list needs a held write-lock.
939      */
940     spin_lock(&pagecache_lock);
941     page = __find_page_nolock(mapping, offset, *hash);
942     if (page)
943         page_cache_get(page);
944     spin_unlock(&pagecache_lock);
945     return page;
946 }

: 940Acquire the read-only page cache lock
: 941Call the page cache traversal function which presumes a lock is held
: 942-943If the page was found, obtain a reference to it with page_cache_get() (See Section J.1.3.1) so it is not freed prematurely
: 944Release the page cache lock
: 945Return the page or NULL if not found

J.1.4.3 Function: __find_page_nolock

Source: mm/filemap.c

This function traverses the hash collision list looking for the page specified by the address_space and offset.

443 static inline struct page * __find_page_nolock(
                    struct address_space *mapping, 
                    unsigned long offset, 
                    struct page *page)
444 {
445     goto inside;
446 
447     for (;;) {
448         page = page->next_hash;
449 inside:
450         if (!page)
451             goto not_found;
452         if (page->mapping != mapping)
453             continue;
454         if (page->index == offset)
455             break;
456     }
457 
458 not_found:
459     return page;
460 }

: 445Begin by examining the first page in the list
: 450-451If the page is NULL, the right one could not be found so return NULL
: 452If the address_space does not match, move to the next page on the collision list
: 454If the offset matchs, return it, else move on
: 448Move to the next page on the hash list
: 459Return the found page or NULL if not

J.1.4.4 Function: find_lock_page

Source: include/linux/pagemap.h

This is the top level function for searching the page cache for a page and having it returned in a locked state.

 84 #define find_lock_page(mapping, index) \
 85     __find_lock_page(mapping, index, page_hash(mapping, index))

: 85Call the core function __find_lock_page() after looking up what hash bucket this page is using with page_hash()

J.1.4.5 Function: __find_lock_page

Source: mm/filemap.c

This function acquires the pagecache_lock spinlock before calling the core function __find_lock_page_helper() to locate the page and lock it.

1005 struct page * __find_lock_page (struct address_space *mapping,
1006                     unsigned long offset, struct page **hash)
1007 {
1008    struct page *page;
1009 
1010    spin_lock(&pagecache_lock);
1011    page = __find_lock_page_helper(mapping, offset, *hash);
1012    spin_unlock(&pagecache_lock);
1013    return page;
1014 }

: 1010Acquire the pagecache_lock spinlock
: 1011Call __find_lock_page_helper() which will search the page cache and lock the page if it is found
: 1012Release the pagecache_lock spinlock
: 1013If the page was found, return it in a locked state, otherwise return NULL

J.1.4.6 Function: __find_lock_page_helper

Source: mm/filemap.c

This function uses __find_page_nolock() to locate a page within the page cache. If it is found, the page will be locked for returning to the caller.

972 static struct page * __find_lock_page_helper(
                               struct address_space *mapping,
973                            unsigned long offset, struct page *hash)
974 {
975     struct page *page;
976 
977     /*
978      * We scan the hash list read-only. Addition to and removal from
979      * the hash-list needs a held write-lock.
980      */
981 repeat:
982     page = __find_page_nolock(mapping, offset, hash);
983     if (page) {
984         page_cache_get(page);
985         if (TryLockPage(page)) {
986             spin_unlock(&pagecache_lock);
987             lock_page(page);
988             spin_lock(&pagecache_lock);
989 
990             /* Has the page been re-allocated while we slept?  */
991             if (page->mapping != mapping || page->index != offset) {
992                 UnlockPage(page);
993                 page_cache_release(page);
994                 goto repeat;
995             }
996         }
997     }
998     return page;
999 }

: 982Use __find_page_nolock()(See Section J.1.4.3) to locate the page in the page cache
: 983-984If the page was found, take a reference to it
: 985Try and lock the page with TryLockPage(). This macro is just a wrapper around test_and_set_bit() which attempts to set the PG_locked bit in the page→flags
: 986-988If the lock failed, release the pagecache_lock spinlock and call lock_page() (See Section B.2.1.1) to lock the page. It is likely this function will sleep until the page lock is acquired. When the page is locked, acquire the pagecache_lock spinlock again
: 991If the mapping and index no longer match, it means that this page was reclaimed while we were asleep. The page is unlocked and the reference dropped before searching the page cache again
: 998Return the page in a locked state, or NULL if it was not in the page cache

J.2 LRU List Operations

J.2.1 Adding Pages to the LRU Lists

J.2.1.1 Function: lru_cache_add

Source: mm/swap.c

Adds a page to the LRU inactive_list.

 58 void lru_cache_add(struct page * page)
 59 {
 60       if (!PageLRU(page)) {
 61             spin_lock(&pagemap_lru_lock);
 62             if (!TestSetPageLRU(page))
 63                   add_page_to_inactive_list(page);
 64             spin_unlock(&pagemap_lru_lock);
 65       }
 66 }

: 60If the page is not already part of the LRU lists, add it
: 61Acquire the LRU lock
: 62-63Test and set the LRU bit. If it was clear, call add_page_to_inactive_list()
: 64Release the LRU lock

J.2.1.2 Function: add_page_to_active_list

Source: include/linux/swap.h

Adds the page to the active_list

178 #define add_page_to_active_list(page)         \
179 do {                                          \
180       DEBUG_LRU_PAGE(page);                   \
181       SetPageActive(page);                    \
182       list_add(&(page)->lru, &active_list);   \
183       nr_active_pages++;                      \
184 } while (0)

: 180The DEBUG_LRU_PAGE() macro will call BUG() if the page is already on the LRU list or is marked been active
: 181Update the flags of the page to show it is active
: 182Add the page to the active_list
: 183Update the count of the number of pages in the active_list

J.2.1.3 Function: add_page_to_inactive_list

Source: include/linux/swap.h

Adds the page to the inactive_list

186 #define add_page_to_inactive_list(page)       \
187 do {                                          \
188       DEBUG_LRU_PAGE(page);                   \
189       list_add(&(page)->lru, &inactive_list); \
190       nr_inactive_pages++;                    \
191 } while (0)

: 188The DEBUG_LRU_PAGE() macro will call BUG() if the page is already on the LRU list or is marked been active
: 189Add the page to the inactive_list
: 190Update the count of the number of inactive pages on the list

J.2.2 Deleting Pages from the LRU Lists

J.2.2.1 Function: lru_cache_del

Source: mm/swap.c

Acquire the lock protecting the LRU lists before calling __lru_cache_del().

 90 void lru_cache_del(struct page * page)
 91 {
 92       spin_lock(&pagemap_lru_lock);
 93       __lru_cache_del(page);
 94       spin_unlock(&pagemap_lru_lock);
 95 }

: 92Acquire the LRU lock
: 93__lru_cache_del() does the “real” work of removing the page from the LRU lists
: 94Release the LRU lock

J.2.2.2 Function: __lru_cache_del

Source: mm/swap.c

Select which function is needed to remove the page from the LRU list.

 75 void __lru_cache_del(struct page * page)
 76 {
 77       if (TestClearPageLRU(page)) {
 78             if (PageActive(page)) {
 79                   del_page_from_active_list(page);
 80             } else {
 81                   del_page_from_inactive_list(page);
 82             }
 83       }
 84 }

: 77Test and clear the flag indicating the page is in the LRU
: 78-82If the page is on the LRU, select the appropriate removal function
: 78-79If the page is active, then call del_page_from_active_list() else delete from the inactive list with del_page_from_inactive_list()

J.2.2.3 Function: del_page_from_active_list

Source: include/linux/swap.h

Remove the page from the active_list

193 #define del_page_from_active_list(page)   \
194 do {                                      \
195       list_del(&(page)->lru);             \
196       ClearPageActive(page);              \
197       nr_active_pages--;                  \
198 } while (0)

: 195Delete the page from the list
: 196Clear the flag indicating it is part of active_list. The flag indicating it is part of the LRU list has already been cleared by __lru_cache_del()
: 197Update the count of the number of pages in the active_list

J.2.2.4 Function: del_page_from_inactive_list

Source: include/linux/swap.h

200 #define del_page_from_inactive_list(page) \
201 do {                                      \
202       list_del(&(page)->lru);             \
203       nr_inactive_pages--;                \
204 } while (0)

: 202Remove the page from the LRU list
: 203Update the count of the number of pages in the inactive_list

J.2.3 Activating Pages

J.2.3.1 Function: mark_page_accessed

Source: mm/filemap.c

This marks that a page has been referenced. If the page is already on the active_list or the referenced flag is clear, the referenced flag will be simply set. If it is in the inactive_list and the referenced flag has been set, activate_page() will be called to move the page to the top of the active_list.

1332 void mark_page_accessed(struct page *page)
1333 {
1334       if (!PageActive(page) && PageReferenced(page)) {
1335             activate_page(page);
1336             ClearPageReferenced(page);
1337       } else
1338             SetPageReferenced(page);
1339 }

: 1334-1337If the page is on the inactive_list (!PageActive()) and has been referenced recently (PageReferenced()), activate_page() is called to move it to the active_list
: 1338Otherwise, mark the page as been referenced

J.2.3.2 Function: activate_lock

Source: mm/swap.c

Acquire the LRU lock before calling activate_page_nolock() which moves the page from the inactive_list to the active_list.

 47 void activate_page(struct page * page)
 48 {
 49       spin_lock(&pagemap_lru_lock);
 50       activate_page_nolock(page);
 51       spin_unlock(&pagemap_lru_lock);
 52 }

: 49Acquire the LRU lock
: 50Call the main work function
: 51Release the LRU lock

J.2.3.3 Function: activate_page_nolock

Source: mm/swap.c

Move the page from the inactive_list to the active_list

 39 static inline void activate_page_nolock(struct page * page)
 40 {
 41       if (PageLRU(page) && !PageActive(page)) {
 42             del_page_from_inactive_list(page);
 43             add_page_to_active_list(page);
 44       }
 45 }

: 41Make sure the page is on the LRU and not already on the active_list
: 42-43Delete the page from the inactive_list and add to the active_list

J.3 Refilling `inactive_list`

This section covers how pages are moved from the active lists to the inactive lists.

J.3.1 Function: refill_inactive

Source: mm/vmscan.c

Move nr_pages from the active_list to the inactive_list. The parameter nr_pages is calculated by shrink_caches() and is a number which tries to keep the active list two thirds the size of the page cache.

533 static void refill_inactive(int nr_pages)
534 {
535       struct list_head * entry;
536 
537       spin_lock(&pagemap_lru_lock);
538       entry = active_list.prev;
539       while (nr_pages && entry != &active_list) {
540             struct page * page;
541 
542             page = list_entry(entry, struct page, lru);
543             entry = entry->prev;
544             if (PageTestandClearReferenced(page)) {
545                   list_del(&page->lru);
546                   list_add(&page->lru, &active_list);
547                   continue;
548             }
549 
550             nr_pages--;
551 
552             del_page_from_active_list(page);
553             add_page_to_inactive_list(page);
554             SetPageReferenced(page);
555       }
556       spin_unlock(&pagemap_lru_lock);
557 }

: 537Acquire the lock protecting the LRU list
: 538Take the last entry in the active_list
: 539-555Move nr_pages or until the active_list is empty
: 542Get the struct page for this entry
: 544-548Test and clear the referenced flag. If it has been referenced, then it is moved back to the top of the active_list
: 550-553Move one page from the active_list to the inactive_list
: 554Mark it referenced so that if it is referenced again soon, it will be promoted back to the active_list without requiring a second reference
: 556Release the lock protecting the LRU list

J.4 Reclaiming Pages from the LRU Lists

This section covers how a page is reclaimed once it has been selected for pageout.

J.4.1 Function: shrink_cache

Source: mm/vmscan.c

338 static int shrink_cache(int nr_pages, zone_t * classzone, 
                            unsigned int gfp_mask, int priority)
339 {
340     struct list_head * entry;
341     int max_scan = nr_inactive_pages / priority;
342     int max_mapped = min((nr_pages << (10 - priority)), 
                             max_scan / 10);
343 
344     spin_lock(&pagemap_lru_lock);
345     while (--max_scan >= 0 && 
               (entry = inactive_list.prev) != &inactive_list) {

338The parameters are as follows;

: nr_pagesThe number of pages to swap out
: classzoneThe zone we are interested in swapping pages out for. Pages not belonging to this zone are skipped
: gfp_maskThe gfp mask determining what actions may be taken such as if filesystem operations may be performed
: priorityThe priority of the function, starts at DEF_PRIORITY (6) and decreases to the highest priority of 1

341The maximum number of pages to scan is the number of pages in the active_list divided by the priority. At lowest priority, 1/6th of the list may scanned. At highest priority, the full list may be scanned

342The maximum amount of process mapped pages allowed is either one tenth of the max_scan value or nr_pages * 2^{10−priority}. If this number of pages are found, whole processes will be swapped out

344Lock the LRU list

345Keep scanning until max_scan pages have been scanned or the inactive_list is empty

346         struct page * page;
347 
348         if (unlikely(current->need_resched)) {
349             spin_unlock(&pagemap_lru_lock);
350             __set_current_state(TASK_RUNNING);
351             schedule();
352             spin_lock(&pagemap_lru_lock);
353             continue;
354         }
355

: 348-354Reschedule if the quanta has been used up
: 349Free the LRU lock as we are about to sleep
: 350Show we are still running
: 351Call schedule() so another process can be context switched in
: 352Re-acquire the LRU lock
: 353Reiterate through the loop and take an entry inactive_list again. As we slept, another process could have changed what entries are on the list which is why another entry has to be taken with the spinlock held

356         page = list_entry(entry, struct page, lru);
357 
358         BUG_ON(!PageLRU(page));
359         BUG_ON(PageActive(page));
360 
361         list_del(entry);
362         list_add(entry, &inactive_list);
363 
364         /*
365          * Zero page counts can happen because we unlink the pages
366          * _after_ decrementing the usage count..
367          */
368         if (unlikely(!page_count(page)))
369             continue;
370 
371         if (!memclass(page_zone(page), classzone))
372             continue;
373 
374         /* Racy check to avoid trylocking when not worthwhile */
375         if (!page->buffers && (page_count(page) != 1 || !page->mapping))
376             goto page_mapped;

: 356Get the struct page for this entry in the LRU
: 358-359It is a bug if the page either belongs to the active_list or is currently marked as active
: 361-362Move the page to the top of the inactive_list so that if the page is not freed, we can just continue knowing that it will be simply examined later
: 368-369If the page count has already reached 0, skip over it. In __free_pages(), the page count is dropped with put_page_testzero() before __free_pages_ok() is called to free it. This leaves a window where a page with a zero count is left on the LRU before it is freed. There is a special case to trap this at the beginning of __free_pages_ok()
: 371-372Skip over this page if it belongs to a zone we are not currently interested in
: 375-376If the page is mapped by a process, then goto page_mapped where the max_mapped is decremented and next page examined. If max_mapped reaches 0, process pages will be swapped out

382         if (unlikely(TryLockPage(page))) {
383             if (PageLaunder(page) && (gfp_mask & __GFP_FS)) {
384                 page_cache_get(page);
385                 spin_unlock(&pagemap_lru_lock);
386                 wait_on_page(page);
387                 page_cache_release(page);
388                 spin_lock(&pagemap_lru_lock);
389             }
390             continue;
391         }

Page is locked and the launder bit is set. In this case, it is the second time this page has been found dirty. The first time it was scheduled for IO and placed back on the list. This time we wait until the IO is complete and then try to free the page.

: 382-383If we could not lock the page, the PG_launder bit is set and the GFP flags allow the caller to perform FS operations, then...
: 384Take a reference to the page so it does not disappear while we sleep
: 385Free the LRU lock
: 386Wait until the IO is complete
: 387Release the reference to the page. If it reaches 0, the page will be freed
: 388Re-acquire the LRU lock
: 390Move to the next page

392 
393         if (PageDirty(page) && 
                is_page_cache_freeable(page) && 
                page->mapping) {
394             /*
395              * It is not critical here to write it only if
396              * the page is unmapped beause any direct writer
397              * like O_DIRECT would set the PG_dirty bitflag
398              * on the phisical page after having successfully
399              * pinned it and after the I/O to the page is finished,
400              * so the direct writes to the page cannot get lost.
401              */
402             int (*writepage)(struct page *);
403 
404             writepage = page->mapping->a_ops->writepage;
405             if ((gfp_mask & __GFP_FS) && writepage) {
406                 ClearPageDirty(page);
407                 SetPageLaunder(page);
408                 page_cache_get(page);
409                 spin_unlock(&pagemap_lru_lock);
410 
411                 writepage(page);
412                 page_cache_release(page);
413 
414                 spin_lock(&pagemap_lru_lock);
415                 continue;
416             }
417         }

This handles the case where a page is dirty, is not mapped by any process, has no buffers and is backed by a file or device mapping. The page is cleaned and will be reclaimed by the previous block of code when the IO is complete.

: 393PageDirty() checks the PG_dirty bit, is_page_cache_freeable() will return true if it is not mapped by any process and has no buffers
: 404Get a pointer to the necessary writepage() function for this mapping or device
: 405-416This block of code can only be executed if a writepage() function is available and the GFP flags allow file operations
: 406-407Clear the dirty bit and mark that the page is being laundered
: 408Take a reference to the page so it will not be freed unexpectedly
: 409Unlock the LRU list
: 411Call the filesystem-specific writepage() function which is taken from the address_space_operations belonging to page→mapping
: 412Release the reference to the page
: 414-415Re-acquire the LRU list lock and move to the next page

424         if (page->buffers) {
425             spin_unlock(&pagemap_lru_lock);
426 
427             /* avoid to free a locked page */
428             page_cache_get(page);
429 
430             if (try_to_release_page(page, gfp_mask)) {
431                 if (!page->mapping) {
438                     spin_lock(&pagemap_lru_lock);
439                     UnlockPage(page);
440                     __lru_cache_del(page);
441 
442                     /* effectively free the page here */
443                     page_cache_release(page);
444 
445                     if (--nr_pages)
446                         continue;
447                     break;
448                 } else {
454                     page_cache_release(page);
455 
456                     spin_lock(&pagemap_lru_lock);
457                 }
458             } else {
459                 /* failed to drop the buffers so stop here */
460                 UnlockPage(page);
461                 page_cache_release(page);
462 
463                 spin_lock(&pagemap_lru_lock);
464                 continue;
465             }
466         }

Page has buffers associated with it that must be freed.

: 425Release the LRU lock as we may sleep
: 428Take a reference to the page
: 430Call try_to_release_page() which will attempt to release the buffers associated with the page. Returns 1 if it succeeds
: 431-447This is a case where an anonymous page that was in the swap cache has now had it's buffers cleared and removed. As it was on the swap cache, it was placed on the LRU by add_to_swap_cache() so remove it now frmo the LRU and drop the reference to the page. In swap_writepage(), it calls remove_exclusive_swap_page() which will delete the page from the swap cache when there are no more processes mapping the page. This block will free the page after the buffers have been written out if it was backed by a swap file
: 438-440Take the LRU list lock, unlock the page, delete it from the page cache and free it
: 445-446Update nr_pages to show a page has been freed and move to the next page
: 447If nr_pages drops to 0, then exit the loop as the work is completed
: 449-456If the page does have an associated mapping then simply drop the reference to the page and re-acquire the LRU lock. More work will be performed later to remove the page from the page cache at line 499
: 459-464If the buffers could not be freed, then unlock the page, drop the reference to it, re-acquire the LRU lock and move to the next page

468         spin_lock(&pagecache_lock);
469 
470         /*
471          * this is the non-racy check for busy page.
472          */
473         if (!page->mapping || !is_page_cache_freeable(page)) {
474             spin_unlock(&pagecache_lock);
475             UnlockPage(page);
476 page_mapped:
477             if (--max_mapped >= 0)
478                 continue;
479 
484             spin_unlock(&pagemap_lru_lock);
485             swap_out(priority, gfp_mask, classzone);
486             return nr_pages;
487         }

: 468From this point on, pages in the swap cache are likely to be examined which is protected by the pagecache_lock which must be now held
: 473-487An anonymous page with no buffers is mapped by a process
: 474-475Release the page cache lock and the page
: 477-478Decrement max_mapped. If it has not reached 0, move to the next page
: 484-485Too many mapped pages have been found in the page cache. The LRU lock is released and swap_out() is called to begin swapping out whole processes

493         if (PageDirty(page)) {
494             spin_unlock(&pagecache_lock);
495             UnlockPage(page);
496             continue;
497         }

: 493-497The page has no references but could have been dirtied by the last process to free it if the dirty bit was set in the PTE. It is left in the page cache and will get laundered later. Once it has been cleaned, it can be safely deleted

498 
499         /* point of no return */
500         if (likely(!PageSwapCache(page))) {
501             __remove_inode_page(page);
502             spin_unlock(&pagecache_lock);
503         } else {
504             swp_entry_t swap;
505             swap.val = page->index;
506             __delete_from_swap_cache(page);
507             spin_unlock(&pagecache_lock);
508             swap_free(swap);
509         }
510 
511         __lru_cache_del(page);
512         UnlockPage(page);
513 
514         /* effectively free the page here */
515         page_cache_release(page);
516 
517         if (--nr_pages)
518             continue;
519         break;
520     }

: 500-503If the page does not belong to the swap cache, it is part of the inode queue so it is removed
: 504-508Remove it from the swap cache as there is no more references to it
: 511Delete it from the page cache
: 512Unlock the page
: 515Free the page
: 517-518Decrement nr_page and move to the next page if it is not 0
: 519If it reaches 0, the work of the function is complete

521     spin_unlock(&pagemap_lru_lock);
522 
523     return nr_pages;
524 }

: 521-524Function exit. Free the LRU lock and return the number of pages left to free

J.5 Shrinking all caches

J.5.1 Function: shrink_caches

Source: mm/vmscan.c

The call graph for this function is shown in Figure 10.4.

560 static int shrink_caches(zone_t * classzone, int priority, 
                 unsigned int gfp_mask, int nr_pages)
561 {
562     int chunk_size = nr_pages;
563     unsigned long ratio;
564 
565     nr_pages -= kmem_cache_reap(gfp_mask);
566     if (nr_pages <= 0)
567         return 0;
568 
569     nr_pages = chunk_size;
570     /* try to keep the active list 2/3 of the size of the cache */
571     ratio = (unsigned long) nr_pages * 
            nr_active_pages / ((nr_inactive_pages + 1) * 2);
572     refill_inactive(ratio);
573 
574     nr_pages = shrink_cache(nr_pages, classzone, gfp_mask, priority);
575     if (nr_pages <= 0)
576         return 0;
577 
578     shrink_dcache_memory(priority, gfp_mask);
579     shrink_icache_memory(priority, gfp_mask);
580 #ifdef CONFIG_QUOTA
581     shrink_dqcache_memory(DEF_PRIORITY, gfp_mask);
582 #endif
583 
584     return nr_pages;
585 }

560The parameters are as follows;

: classzone is the zone that pages should be freed from
: priority determines how much work will be done to free pages
: gfp_mask determines what sort of actions may be taken
: nr_pages is the number of pages remaining to be freed

565-567Ask the slab allocator to free up some pages with kmem_cache_reap() (See Section H.1.5.1). If enough are freed, the function returns otherwise nr_pages will be freed from other caches

571-572Move pages from the active_list to the inactive_list by calling refill_inactive() (See Section J.3.1). The number of pages moved depends on how many pages need to be freed and to have active_list about two thirds the size of the page cache

574-575Shrink the page cache, if enough pages are freed, return

578-582Shrink the dcache, icache and dqcache. These are small objects in themselves but the cascading effect frees up a lot of disk buffers

584Return the number of pages remaining to be freed

J.5.2 Function: try_to_free_pages

Source: mm/vmscan.c

This function cycles through all pgdats and tries to balance the preferred allocation zone (usually ZONE_NORMAL) for each of them. This function is only called from one place, buffer.c:free_more_memory() when the buffer manager fails to create new buffers or grow existing ones. It calls try_to_free_pages() with GFP_NOIO as the gfp_mask.

This results in the first zone in pg_data_t→node_zonelists having pages freed so that buffers can grow. This array is the preferred order of zones to allocate from and usually will begin with ZONE_NORMAL which is required by the buffer manager. On NUMA architectures, some nodes may have ZONE_DMA as the preferred zone if the memory bank is dedicated to IO devices and UML also uses only this zone. As the buffer manager is restricted in the zones is uses, there is no point balancing other zones.

607 int try_to_free_pages(unsigned int gfp_mask)
608 {
609     pg_data_t *pgdat;
610     zonelist_t *zonelist;
611     unsigned long pf_free_pages;
612     int error = 0;
613 
614     pf_free_pages = current->flags & PF_FREE_PAGES;
615     current->flags &= ~PF_FREE_PAGES;
616 
617     for_each_pgdat(pgdat) {
618         zonelist = pgdat->node_zonelists + 
                 (gfp_mask & GFP_ZONEMASK);
619         error |= try_to_free_pages_zone(
                    zonelist->zones[0], gfp_mask);
620     }
621 
622     current->flags |= pf_free_pages;
623     return error;
624 }

: 614-615This clears the PF_FREE_PAGES flag if it is set so that pages freed by the process will be returned to the global pool rather than reserved for the process itself
: 617-620Cycle through all nodes and call try_to_free_pages() for the preferred zone in each node
: 618This function is only called with GFP_NOIO as a parameter. When ANDed with GFP_ZONEMASK, it will always result in 0
: 622-623Restore the process flags and return the result

J.5.3 Function: try_to_free_pages_zone

Source: mm/vmscan.c

Try to free SWAP_CLUSTER_MAX pages from the requested zone. As will as being used by kswapd, this function is the entry for the buddy allocator's direct-reclaim path.

587 int try_to_free_pages_zone(zone_t *classzone, 
                               unsigned int gfp_mask)
588 {
589     int priority = DEF_PRIORITY;
590     int nr_pages = SWAP_CLUSTER_MAX;
591 
592     gfp_mask = pf_gfp_mask(gfp_mask);
593     do {
594         nr_pages = shrink_caches(classzone, priority, 
                         gfp_mask, nr_pages);
595         if (nr_pages <= 0)
596             return 1;
597     } while (--priority);
598 
599     /*
600      * Hmm.. Cache shrink failed - time to kill something?
601      * Mhwahahhaha! This is the part I really like. Giggle.
602      */
603     out_of_memory();
604     return 0;
605 }

: 589Start with the lowest priority. Statically defined to be 6
: 590Try and free SWAP_CLUSTER_MAX pages. Statically defined to be 32
: 592pf_gfp_mask() checks the PF_NOIO flag in the current process flags. If no IO can be performed, it ensures there is no incompatible flags in the GFP mask
: 593-597Starting with the lowest priority and increasing with each pass, call shrink_caches() until nr_pages has been freed
: 595-596If enough pages were freed, return indicating that the work is complete
: 603If enough pages could not be freed even at highest priority (where at worst the full inactive_list is scanned) then check to see if we are out of memory. If we are, then a process will be selected to be killed
: 604Return indicating that we failed to free enough pages

J.6 Swapping Out Process Pages

This section covers the path where too many process mapped pages have been found in the LRU lists. This path will start scanning whole processes and reclaiming the mapped pages.

J.6.1 Function: swap_out

Source: mm/vmscan.c

The call graph for this function is shown in Figure 10.5. This function linearaly searches through every processes page tables trying to swap out SWAP_CLUSTER_MAX number of pages. The process it starts with is the swap_mm and the starting address is mm→swap_address

296 static int swap_out(unsigned int priority, unsigned int gfp_mask, 
            zone_t * classzone)
297 {
298     int counter, nr_pages = SWAP_CLUSTER_MAX;
299     struct mm_struct *mm;
300 
301     counter = mmlist_nr;
302     do {
303         if (unlikely(current->need_resched)) {
304             __set_current_state(TASK_RUNNING);
305             schedule();
306         }
307 
308         spin_lock(&mmlist_lock);
309         mm = swap_mm;
310         while (mm->swap_address == TASK_SIZE || mm == &init_mm) {
311             mm->swap_address = 0;
312             mm = list_entry(mm->mmlist.next, 
                        struct mm_struct, mmlist);
313             if (mm == swap_mm)
314                 goto empty;
315             swap_mm = mm;
316         }
317 
318         /* Make sure the mm doesn't disappear 
             when we drop the lock.. */
319         atomic_inc(&mm->mm_users);
320         spin_unlock(&mmlist_lock);
321 
322         nr_pages = swap_out_mm(mm, nr_pages, &counter, classzone);
323 
324         mmput(mm);
325 
326         if (!nr_pages)
327             return 1;
328     } while (--counter >= 0);
329 
330     return 0;
331 
332 empty:
333     spin_unlock(&mmlist_lock);
334     return 0;
335 }

: 301Set the counter so the process list is only scanned once
: 303-306Reschedule if the quanta has been used up to prevent CPU hogging
: 308Acquire the lock protecting the mm list
: 309Start with the swap_mm. It is interesting this is never checked to make sure it is valid. It is possible, albeit unlikely that the process with the mm has exited since the last scan and the slab holding the mm_struct has been reclaimed during a cache shrink making the pointer totally invalid. The lack of bug reports might be because the slab rarely gets reclaimed and would be difficult to trigger in reality
: 310-316Move to the next process if the swap_address has reached the TASK_SIZE or if the mm is the init_mm
: 311Start at the beginning of the process space
: 312Get the mm for this process
: 313-314If it is the same, there is no running processes that can be examined
: 315Record the swap_mm for the next pass
: 319Increase the reference count so that the mm does not get freed while we are scanning
: 320Release the mm lock
: 322Begin scanning the mm with swap_out_mm()(See Section J.6.2)
: 324Drop the reference to the mm
: 326-327If the required number of pages has been freed, return success
: 328If we failed on this pass, increase the priority so more processes will be scanned
: 330Return failure

J.6.2 Function: swap_out_mm

Source: mm/vmscan.c

Walk through each VMA and call swap_out_mm() for each one.

256 static inline int swap_out_mm(struct mm_struct * mm, int count, 
                  int * mmcounter, zone_t * classzone)
257 {
258     unsigned long address;
259     struct vm_area_struct* vma;
260 
265     spin_lock(&mm->page_table_lock);
266     address = mm->swap_address;
267     if (address == TASK_SIZE || swap_mm != mm) {
268         /* We raced: don't count this mm but try again */
269         ++*mmcounter;
270         goto out_unlock;
271     }
272     vma = find_vma(mm, address);
273     if (vma) {
274         if (address < vma->vm_start)
275             address = vma->vm_start;
276 
277         for (;;) {
278             count = swap_out_vma(mm, vma, address, 
                         count, classzone);
279             vma = vma->vm_next;
280             if (!vma)
281                 break;
282             if (!count)
283                 goto out_unlock;
284             address = vma->vm_start;
285         }
286     }
287     /* Indicate that we reached the end of address space */
288     mm->swap_address = TASK_SIZE;
289 
290 out_unlock:
291     spin_unlock(&mm->page_table_lock);
292     return count;
293 }

: 265Acquire the page table lock for this mm
: 266Start with the address contained in swap_address
: 267-271If the address is TASK_SIZE, it means that a thread raced and scanned this process already. Increase mmcounter so that swap_out_mm() knows to go to another process
: 272Find the VMA for this address
: 273Presuming a VMA was found then ....
: 274-275Start at the beginning of the VMA
: 277-285Scan through this and each subsequent VMA calling swap_out_vma() (See Section J.6.3) for each one. If the requisite number of pages (count) is freed, then finish scanning and return
: 288Once the last VMA has been scanned, set swap_address to TASK_SIZE so that this process will be skipped over by swap_out_mm() next time

J.6.3 Function: swap_out_vma

Source: mm/vmscan.c

Walk through this VMA and for each PGD in it, call swap_out_pgd().

227 static inline int swap_out_vma(struct mm_struct * mm, 
                   struct vm_area_struct * vma, 
                   unsigned long address, int count, 
                   zone_t * classzone)
228 {
229     pgd_t *pgdir;
230     unsigned long end;
231 
232     /* Don't swap out areas which are reserved */
233     if (vma->vm_flags & VM_RESERVED)
234         return count;
235 
236     pgdir = pgd_offset(mm, address);
237 
238     end = vma->vm_end;
239     BUG_ON(address >= end);
240     do {
241         count = swap_out_pgd(mm, vma, pgdir, 
                     address, end, count, classzone);
242         if (!count)
243             break;
244         address = (address + PGDIR_SIZE) & PGDIR_MASK;
245         pgdir++;
246     } while (address && (address < end));
247     return count;
248 }

: 233-234Skip over this VMA if the VM_RESERVED flag is set. This is used by some device drivers such as the SCSI generic driver
: 236Get the starting PGD for the address
: 238Mark where the end is and BUG() it if the starting address is somehow past the end
: 240Cycle through PGDs until the end address is reached
: 241Call swap_out_pgd()(See Section J.6.4) keeping count of how many more pages need to be freed
: 242-243If enough pages have been freed, break and return
: 244-245Move to the next PGD and move the address to the next PGD aligned address
: 247Return the remaining number of pages to be freed

J.6.4 Function: swap_out_pgd

Source: mm/vmscan.c

Step through all PMD's in the supplied PGD and call swap_out_pmd()

197 static inline int swap_out_pgd(struct mm_struct * mm, 
                   struct vm_area_struct * vma, pgd_t *dir, 
                   unsigned long address, unsigned long end, 
                   int count, zone_t * classzone)
198 {
199     pmd_t * pmd;
200     unsigned long pgd_end;
201 
202     if (pgd_none(*dir))
203         return count;
204     if (pgd_bad(*dir)) {
205         pgd_ERROR(*dir);
206         pgd_clear(dir);
207         return count;
208     }
209 
210     pmd = pmd_offset(dir, address);
211 
212     pgd_end = (address + PGDIR_SIZE) & PGDIR_MASK;  
213     if (pgd_end && (end > pgd_end))
214         end = pgd_end;
215     
216     do {
217         count = swap_out_pmd(mm, vma, pmd, 
                                 address, end, count, classzone);
218         if (!count)
219             break;
220         address = (address + PMD_SIZE) & PMD_MASK;
221         pmd++;
222     } while (address && (address < end));
223     return count;
224 }

: 202-203If there is no PGD, return
: 204-208If the PGD is bad, flag it as such and return
: 210Get the starting PMD
: 212-214Calculate the end to be the end of this PGD or the end of the VMA been scanned, whichever is closer
: 216-222For each PMD in this PGD, call swap_out_pmd() (See Section J.6.5). If enough pages get freed, break and return
: 223Return the number of pages remaining to be freed

J.6.5 Function: swap_out_pmd

Source: mm/vmscan.c

For each PTE in this PMD, call try_to_swap_out(). On completion, mm→swap_address is updated to show where we finished to prevent the same page been examined soon after this scan.

158 static inline int swap_out_pmd(struct mm_struct * mm, 
                   struct vm_area_struct * vma, pmd_t *dir, 
                   unsigned long address, unsigned long end, 
                   int count, zone_t * classzone)
159 {
160     pte_t * pte;
161     unsigned long pmd_end;
162 
163     if (pmd_none(*dir))
164         return count;
165     if (pmd_bad(*dir)) {
166         pmd_ERROR(*dir);
167         pmd_clear(dir);
168         return count;
169     }
170     
171     pte = pte_offset(dir, address);
172     
173     pmd_end = (address + PMD_SIZE) & PMD_MASK;
174     if (end > pmd_end)
175         end = pmd_end;
176 
177     do {
178         if (pte_present(*pte)) {
179             struct page *page = pte_page(*pte);
180 
181             if (VALID_PAGE(page) && !PageReserved(page)) {
182                 count -= try_to_swap_out(mm, vma, 
                                 address, pte, 
                                 page, classzone);
183                 if (!count) {
184                     address += PAGE_SIZE;
185                     break;
186                 }
187             }
188         }
189         address += PAGE_SIZE;
190         pte++;
191     } while (address && (address < end));
192     mm->swap_address = address;
193     return count;
194 }

: 163-164Return if there is no PMD
: 165-169If the PMD is bad, flag it as such and return
: 171Get the starting PTE
: 173-175Calculate the end to be the end of the PMD or the end of the VMA, whichever is closer
: 177-191Cycle through each PTE
: 178Make sure the PTE is marked present
: 179Get the struct page for this PTE
: 181If it is a valid page and it is not reserved then ...
: 182Call try_to_swap_out()
: 183-186If enough pages have been swapped out, move the address to the next page and break to return
: 189-190Move to the next page and PTE
: 192Update the swap_address to show where we last finished off
: 193Return the number of pages remaining to be freed

J.6.6 Function: try_to_swap_out

Source: mm/vmscan.c

This function tries to swap out a page from a process. It is quite a large function so will be dealt with in parts. Broadly speaking they are

Function preamble, ensure this is a page that should be swapped out
Remove the page and PTE from the page tables
Handle the case where the page is already in the swap cache
Handle the case where the page is dirty or has associated buffers
Handle the case where the page is been added to the swap cache

 47 static inline int try_to_swap_out(struct mm_struct * mm, 
                    struct vm_area_struct* vma, 
                    unsigned long address, 
                    pte_t * page_table, 
                    struct page *page, 
                    zone_t * classzone)
 48 {
 49     pte_t pte;
 50     swp_entry_t entry;
 51 
 52     /* Don't look at this pte if it's been accessed recently. */
 53     if ((vma->vm_flags & VM_LOCKED) ||
        ptep_test_and_clear_young(page_table)) {
 54         mark_page_accessed(page);
 55         return 0;
 56     }
 57 
 58     /* Don't bother unmapping pages that are active */
 59     if (PageActive(page))
 60         return 0;
 61 
 62     /* Don't bother replenishing zones not under pressure.. */
 63     if (!memclass(page_zone(page), classzone))
 64         return 0;
 65 
 66     if (TryLockPage(page))
 67         return 0;

: 53-56If the page is locked (for tasks like IO) or the PTE shows the page has been accessed recently then clear the referenced bit and call mark_page_accessed() (See Section J.2.3.1) to make the struct page reflect the age. Return 0 to show it was not swapped out
: 59-60If the page is on the active_list, do not swap it out
: 63-64If the page belongs to a zone we are not interested in, do not swap it out
: 66-67If the page is already locked for IO, skip it

 74     flush_cache_page(vma, address);
 75     pte = ptep_get_and_clear(page_table);
 76     flush_tlb_page(vma, address);
 77 
 78     if (pte_dirty(pte))
 79         set_page_dirty(page);
 80

: 74Call the architecture hook to flush this page from all CPUs
: 75Get the PTE from the page tables and clear it
: 76Call the architecture hook to flush the TLB
: 78-79If the PTE was marked dirty, mark the struct page dirty so it will be laundered correctly

 86     if (PageSwapCache(page)) {
 87         entry.val = page->index;
 88         swap_duplicate(entry);
 89 set_swap_pte:
 90         set_pte(page_table, swp_entry_to_pte(entry));
 91 drop_pte:
 92         mm->rss--;
 93         UnlockPage(page);
 94         {
 95             int freeable = 
                 page_count(page) - !!page->buffers <= 2;
 96             page_cache_release(page);
 97             return freeable;
 98         }
 99     }

Handle the case where the page is already in the swap cache

: 86Enter this block only if the page is already in the swap cache. Note that it can also be entered by calling goto to the set_swap_pte and drop_pte labels
: 87-88Fill in the index value for the swap entry. swap_duplicate() verifies the swap identifier is valid and increases the counter in the swap_map if it is
: 90Fill the PTE with information needed to get the page from swap
: 92Update RSS to show there is one less page being mapped by the process
: 93Unlock the page
: 95The page is free-able if the count is currently 2 or less and has no buffers. If the count is higher, it is either being mapped by other processes or is a file-backed page and the “user” is the page cache
: 96Decrement the reference count and free the page if it reaches 0. Note that if this is a file-backed page, it will not reach 0 even if there are no processes mapping it. The page will be later reclaimed from the page cache by shrink_cache() (See Section J.4.1)
: 97Return if the page was freed or not

115     if (page->mapping)
116         goto drop_pte;
117     if (!PageDirty(page))
118         goto drop_pte;
124     if (page->buffers)
125         goto preserve;

: 115-116If the page has an associated mapping, simply drop it from the page tables. When no processes are mapping it, it will be reclaimed from the page cache by shrink_cache()
: 117-118If the page is clean, it is safe to simply drop it
: 124-125If it has associated buffers due to a truncate followed by a page fault, then re-attach the page and PTE to the page tables as it cannot be handled yet

126 
127     /*
128      * This is a dirty, swappable page.  First of all,
129      * get a suitable swap entry for it, and make sure
130      * we have the swap cache set up to associate the
131      * page with that swap entry.
132      */
133     for (;;) {
134         entry = get_swap_page();
135         if (!entry.val)
136             break;
137         /* Add it to the swap cache and mark it dirty
138          * (adding to the page cache will clear the dirty
139          * and uptodate bits, so we need to do it again)
140          */
141         if (add_to_swap_cache(page, entry) == 0) {
142             SetPageUptodate(page);
143             set_page_dirty(page);
144             goto set_swap_pte;
145         }
146         /* Raced with "speculative" read_swap_cache_async */
147         swap_free(entry);
148     }
149 
150     /* No swap space left */
151 preserve:
152     set_pte(page_table, pte);
153     UnlockPage(page);
154     return 0;
155 }

: 134Allocate a swap entry for this page
: 135-136If one could not be allocated, break out where the PTE and page will be re-attached to the process page tables
: 141Add the page to the swap cache
: 142Mark the page as up to date in memory
: 143Mark the page dirty so that it will be written out to swap soon
: 144Goto set_swap_pte which will update the PTE with information needed to get the page from swap later
: 147If the add to swap cache failed, it means that the page was placed in the swap cache already by a readahead so drop the work done here
: 152Reattach the PTE to the page tables
: 153Unlock the page
: 154Return that no page was freed

J.7 Page Swap Daemon

This section details the main loops used by the kswapd daemon which is woken-up when memory is low. The main functions covered are the ones that determine if kswapd can sleep and how it determines which nodes need balancing.

J.7.1 Initialising kswapd

J.7.1.1 Function: kswapd_init

Source: mm/vmscan.c

Start the kswapd kernel thread

767 static int __init kswapd_init(void)
768 {
769     printk("Starting kswapd\n");
770     swap_setup();
771     kernel_thread(kswapd, NULL, CLONE_FS 
                                  | CLONE_FILES 
                                  | CLONE_SIGNAL);
772     return 0;
773 }

: 770swap_setup()(See Section K.4.2) setups up how many pages will be prefetched when reading from backing storage based on the amount of physical memory
: 771Start the kswapd kernel thread

J.7.2 kswapd Daemon

J.7.2.1 Function: kswapd

Source: mm/vmscan.c

The main function of the kswapd kernel thread.

720 int kswapd(void *unused)
721 {
722     struct task_struct *tsk = current;
723     DECLARE_WAITQUEUE(wait, tsk);
724 
725     daemonize();
726     strcpy(tsk->comm, "kswapd");
727     sigfillset(&tsk->blocked);
728     
741     tsk->flags |= PF_MEMALLOC;
742 
746     for (;;) {
747         __set_current_state(TASK_INTERRUPTIBLE);
748         add_wait_queue(&kswapd_wait, &wait);
749 
750         mb();
751         if (kswapd_can_sleep())
752             schedule();
753 
754         __set_current_state(TASK_RUNNING);
755         remove_wait_queue(&kswapd_wait, &wait);
756 
762         kswapd_balance();
763         run_task_queue(&tq_disk);
764     }
765 }

: 725Call daemonize() which will make this a kernel thread, remove the mm context, close all files and re-parent the process
: 726Set the name of the process
: 727Ignore all signals
: 741By setting this flag, the physical page allocator will always try to satisfy requests for pages. As this process will always be trying to free pages, it is worth satisfying requests
: 746-764Endlessly loop
: 747-748This adds kswapd to the wait queue in preparation to sleep
: 750The Memory Block function (mb()) ensures that all reads and writes that occurred before this line will be visible to all CPU's
: 751kswapd_can_sleep()(See Section J.7.2.2) cycles through all nodes and zones checking the need_balance field. If any of them are set to 1, kswapd can not sleep
: 752By calling schedule(), kswapd will now sleep until woken again by the physical page allocator in __alloc_pages() (See Section F.1.3)
: 754-755Once woken up, kswapd is removed from the wait queue as it is now running
: 762kswapd_balance()(See Section J.7.2.4) cycles through all zones and calls try_to_free_pages_zone()(See Section J.5.3) for each zone that requires balance
: 763Run the IO task queue to start writing data out to disk

J.7.2.2 Function: kswapd_can_sleep

Source: mm/vmscan.c

Simple function to cycle through all pgdats to call kswapd_can_sleep_pgdat() on each.

695 static int kswapd_can_sleep(void)
696 {
697     pg_data_t * pgdat;
698 
699     for_each_pgdat(pgdat) {
700         if (!kswapd_can_sleep_pgdat(pgdat))
701             return 0;
702     }
703 
704     return 1;
705 }

: 699-702for_each_pgdat() does exactly as the name implies. It cycles through all available pgdat's and in this case calls kswapd_can_sleep_pgdat() (See Section J.7.2.3) for each. On the x86, there will only be one pgdat

J.7.2.3 Function: kswapd_can_sleep_pgdat

Source: mm/vmscan.c

Cycles through all zones to make sure none of them need balance. The zone→need_balanace flag is set by __alloc_pages() when the number of free pages in the zone reaches the pages_low watermark.

680 static int kswapd_can_sleep_pgdat(pg_data_t * pgdat)
681 {
682     zone_t * zone;
683     int i;
684 
685     for (i = pgdat->nr_zones-1; i >= 0; i--) {
686         zone = pgdat->node_zones + i;
687         if (!zone->need_balance)
688             continue;
689         return 0;
690     }
691 
692     return 1;
693 }

: 685-689Simple for loop to cycle through all zones
: 686The node_zones field is an array of all available zones so adding i gives the index
: 687-688If the zone does not need balance, continue
: 6890 is returned if any needs balance indicating kswapd can not sleep
: 692Return indicating kswapd can sleep if the for loop completes

J.7.2.4 Function: kswapd_balance

Source: mm/vmscan.c

Continuously cycle through each pgdat until none require balancing

667 static void kswapd_balance(void)
668 {
669     int need_more_balance;
670     pg_data_t * pgdat;
671 
672     do {
673         need_more_balance = 0;
674 
675         for_each_pgdat(pgdat)
676             need_more_balance |= kswapd_balance_pgdat(pgdat);
677     } while (need_more_balance);
678 }

: 672-677Cycle through all pgdats until none of them report that they need balancing
: 675For each pgdat, call kswapd_balance_pgdat() to check if the node requires balancing. If any node required balancing, need_more_balance will be set to 1

J.7.2.5 Function: kswapd_balance_pgdat

Source: mm/vmscan.c

This function will check if a node requires balance by examining each of the nodes in it. If any zone requires balancing, try_to_free_pages_zone() will be called.

641 static int kswapd_balance_pgdat(pg_data_t * pgdat)
642 {
643     int need_more_balance = 0, i;
644     zone_t * zone;
645 
646     for (i = pgdat->nr_zones-1; i >= 0; i--) {
647         zone = pgdat->node_zones + i;
648         if (unlikely(current->need_resched))
649             schedule();
650         if (!zone->need_balance)
651             continue;
652         if (!try_to_free_pages_zone(zone, GFP_KSWAPD)) {
653             zone->need_balance = 0;
654             __set_current_state(TASK_INTERRUPTIBLE);
655             schedule_timeout(HZ);
656             continue;
657         }
658         if (check_classzone_need_balance(zone))
659             need_more_balance = 1;
660         else
661             zone->need_balance = 0;
662     }
663 
664     return need_more_balance;
665 }

: 646-662Cycle through each zone and call try_to_free_pages_zone() (See Section J.5.3) if it needs re-balancing
: 647node_zones is an array and i is an index within it
: 648-649Call schedule() if the quanta is expired to prevent kswapd hogging the CPU
: 650-651If the zone does not require balance, move to the next one
: 652-657If the function returns 0, it means the out_of_memory() function was called because a sufficient number of pages could not be freed. kswapd sleeps for 1 second to give the system a chance to reclaim the killed processes pages and perform IO. The zone is marked as balanced so kswapd will ignore this zone until the the allocator function __alloc_pages() complains again
: 658-661If is was successful, check_classzone_need_balance() is called to see if the zone requires further balancing or not
: 664Return 1 if one zone requires further balancing

Appendix J Page Frame Reclamation

J.1 Page Cache Operations

J.1.1 Adding Pages to the Page Cache

J.1.1.1 Function: add_to_page_cache

J.1.1.2 Function: add_to_page_cache_unique

J.1.1.3 Function: __add_to_page_cache

J.1.1.4 Function: add_page_to_inode_queue

J.1.1.5 Function: add_page_to_hash_queue

J.1.2 Deleting Pages from the Page Cache

J.1.2.1 Function: remove_inode_page

J.1.2.2 Function: __remove_inode_page

J.1.2.3 Function: remove_page_from_inode_queue

J.1.2.4 Function: remove_page_from_hash_queue

J.1.3 Acquiring/Releasing Page Cache Pages

J.1.3.1 Function: page_cache_get

J.1.3.2 Function: page_cache_release

J.1.4 Searching the Page Cache

J.1.4.1 Function: find_get_page

J.1.4.2 Function: __find_get_page

J.1.4.3 Function: __find_page_nolock

J.1.4.4 Function: find_lock_page

J.1.4.5 Function: __find_lock_page

J.1.4.6 Function: __find_lock_page_helper

J.2 LRU List Operations

J.2.1 Adding Pages to the LRU Lists

J.2.1.1 Function: lru_cache_add

J.2.1.2 Function: add_page_to_active_list

J.2.1.3 Function: add_page_to_inactive_list

J.2.2 Deleting Pages from the LRU Lists

J.2.2.1 Function: lru_cache_del

J.2.2.2 Function: __lru_cache_del

J.2.2.3 Function: del_page_from_active_list

J.2.2.4 Function: del_page_from_inactive_list

J.2.3 Activating Pages

J.2.3.1 Function: mark_page_accessed

J.2.3.2 Function: activate_lock

J.2.3.3 Function: activate_page_nolock

J.3 Refilling inactive_list

J.3.1 Function: refill_inactive

J.4 Reclaiming Pages from the LRU Lists

J.4.1 Function: shrink_cache

J.5 Shrinking all caches

J.5.1 Function: shrink_caches

J.5.2 Function: try_to_free_pages

J.5.3 Function: try_to_free_pages_zone

J.6 Swapping Out Process Pages

J.6.1 Function: swap_out

J.6.2 Function: swap_out_mm

J.6.3 Function: swap_out_vma

J.6.4 Function: swap_out_pgd

J.6.5 Function: swap_out_pmd

J.6.6 Function: try_to_swap_out

J.7 Page Swap Daemon

J.7.1 Initialising kswapd

J.7.1.1 Function: kswapd_init

J.7.2 kswapd Daemon

J.7.2.1 Function: kswapd

J.7.2.2 Function: kswapd_can_sleep

J.7.2.3 Function: kswapd_can_sleep_pgdat

J.7.2.4 Function: kswapd_balance

J.7.2.5 Function: kswapd_balance_pgdat

J.3 Refilling `inactive_list`