Appendix K Swap Management

K.1 Scanning for Free Entries

K.1.1 Function: get_swap_page

Source: mm/swapfile.c

The call graph for this function is shown in Figure 11.2. This is the high level API function for searching the swap areas for a free swap lot and returning the resulting swp_entry_t.

 99 swp_entry_t get_swap_page(void)
100 {
101     struct swap_info_struct * p;
102     unsigned long offset;
103     swp_entry_t entry;
104     int type, wrapped = 0;
105 
106     entry.val = 0;  /* Out of memory */
107     swap_list_lock();
108     type = swap_list.next;
109     if (type < 0)
110         goto out;
111     if (nr_swap_pages <= 0)
112         goto out;
113 
114     while (1) {
115         p = &swap_info[type];
116         if ((p->flags & SWP_WRITEOK) == SWP_WRITEOK) {
117             swap_device_lock(p);
118             offset = scan_swap_map(p);
119             swap_device_unlock(p);
120             if (offset) {
121                 entry = SWP_ENTRY(type,offset);
122                 type = swap_info[type].next;
123                 if (type < 0 ||
124                     p->prio != swap_info[type].prio) {
125                       swap_list.next = swap_list.head;
126                 } else {
127                     swap_list.next = type;
128                 }
129                 goto out;
130             }
131         }
132         type = p->next;
133         if (!wrapped) {
134             if (type < 0 || p->prio != swap_info[type].prio) {
135                 type = swap_list.head;
136                 wrapped = 1;
137             }
138         } else
139             if (type < 0)
140                 goto out;     /* out of swap space */
141     }
142 out:
143     swap_list_unlock();
144     return entry;
145 }

: 107Lock the list of swap areas
: 108Get the next swap area that is to be used for allocating from. This list will be ordered depending on the priority of the swap areas
: 109-110If there are no swap areas, return NULL
: 111-112If the accounting says there are no available swap slots, return NULL
: 114-141Cycle through all swap areas
: 115Get the current swap info struct from the swap_info array
: 116If this swap area is available for writing to and is active...
: 117Lock the swap area
: 118Call scan_swap_map()(See Section K.1.2) which searches the requested swap map for a free slot
: 119Unlock the swap device
: 120-130If a slot was free...
: 121Encode an identifier for the entry with SWP_ENTRY()
: 122Record the next swap area to use
: 123-126If the next area is the end of the list or the priority of the next swap area does not match the current one, move back to the head
: 126-128Otherwise move to the next area
: 129Goto out
: 132Move to the next swap area
: 133-138Check for wrapaound. Set wrapped to 1 if we get to the end of the list of swap areas
: 139-140If there was no available swap areas, goto out
: 142The exit to this function
: 143Unlock the swap area list
: 144Return the entry if one was found and NULL otherwise

K.1.2 Function: scan_swap_map

Source: mm/swapfile.c

This function tries to allocate SWAPFILE_CLUSTER number of pages sequentially in swap. When it has allocated that many, it searches for another block of free slots of size SWAPFILE_CLUSTER. If it fails to find one, it resorts to allocating the first free slot. This clustering attempts to make sure that slots are allocated and freed in SWAPFILE_CLUSTER sized chunks.

 36 static inline int scan_swap_map(struct swap_info_struct *si)
 37 {
 38     unsigned long offset;
 47     if (si->cluster_nr) {
 48         while (si->cluster_next <= si->highest_bit) {
 49             offset = si->cluster_next++;
 50             if (si->swap_map[offset])
 51                 continue;
 52             si->cluster_nr--;
 53             goto got_page;
 54         }
 55     }

Allocate SWAPFILE_CLUSTER pages sequentially. cluster_nr is initialised to SWAPFILE_CLUTER and decrements with each allocation

: 47If cluster_nr is still postive, allocate the next available sequential slot
: 48While the current offset to use (cluster_next) is less then the highest known free slot (highest_bit) then ...
: 49Record the offset and update cluster_next to the next free slot
: 50-51If the slot is not actually free, move to the next one
: 52Slot has been found, decrement the cluster_nr field
: 53Goto the out path

 56     si->cluster_nr = SWAPFILE_CLUSTER;
 57 
 58     /* try to find an empty (even not aligned) cluster. */
 59     offset = si->lowest_bit;
 60  check_next_cluster:
 61     if (offset+SWAPFILE_CLUSTER-1 <= si->highest_bit)
 62     {
 63         int nr;
 64         for (nr = offset; nr < offset+SWAPFILE_CLUSTER; nr++)
 65             if (si->swap_map[nr])
 66             {
 67                 offset = nr+1;
 68                 goto check_next_cluster;
 69             }
 70         /* We found a completly empty cluster, so start
 71          * using it.
 72          */
 73         goto got_page;
 74     }

At this stage, SWAPFILE_CLUSTER pages have been allocated sequentially so find the next free block of SWAPFILE_CLUSTER pages.

: 56Re-initialise the count of sequential pages to allocate to SWAPFILE_CLUSTER
: 59Starting searching at the lowest known free slot
: 61If the offset plus the cluster size is less than the known last free slot, then examine all the pages to see if this is a large free block
: 64Scan from offset to offset + SWAPFILE_CLUSTER
: 65-69If this slot is used, then start searching again for a free slot beginning after this known alloated one
: 73A large cluster was found so use it

 75     /* No luck, so now go finegrined as usual. -Andrea */
 76     for (offset = si->lowest_bit; offset <= si->highest_bit ;
                                offset++) {
 77         if (si->swap_map[offset])
 78             continue;
 79         si->lowest_bit = offset+1;

This unusual for loop extract starts scanning for a free page starting from lowest_bit

: 77-78If the slot is in use, move to the next one
: 79Update the lowest_bit known probable free slot to the succeeding one

 80     got_page:
 81         if (offset == si->lowest_bit)
 82             si->lowest_bit++;
 83         if (offset == si->highest_bit)
 84             si->highest_bit--;
 85         if (si->lowest_bit > si->highest_bit) {
 86             si->lowest_bit = si->max;
 87             si->highest_bit = 0;
 88         }
 89         si->swap_map[offset] = 1;
 90         nr_swap_pages--;
 91         si->cluster_next = offset+1;
 92         return offset;
 93     }
 94     si->lowest_bit = si->max;
 95     si->highest_bit = 0;
 96     return 0;
 97 }

A slot has been found, do some housekeeping and return it

: 81-82If this offset is the known lowest free slot(lowest_bit), increment it
: 83-84If this offset is the highest known likely free slot, decrement it
: 85-88If the low and high mark meet, the swap area is not worth searching any more because these marks represent the lowest and highest known free slots. Set the low slot to be the highest possible slot and the high mark to 0 to cut down on search time later. This will be fixed up the next time a slot is freed
: 89Set the reference count for the slot
: 90Update the accounting for the number of available swap pages (nr_swap_pages)
: 91Set cluster_next to the adjacent slot so the next search will start here
: 92Return the free slot
: 94-96No free slot available, mark the area unsearchable and return 0

K.2 Swap Cache

K.2.1 Adding Pages to the Swap Cache

K.2.1.1 Function: add_to_swap_cache

Source: mm/swap_state.c

The call graph for this function is shown in Figure 11.3. This function wraps around the normal page cache handler. It first checks if the page is already in the swap cache with swap_duplicate() and if it does not, it calls add_to_page_cache_unique() instead.

 70 int add_to_swap_cache(struct page *page, swp_entry_t entry)
 71 {
 72     if (page->mapping)
 73         BUG();
 74     if (!swap_duplicate(entry)) {
 75         INC_CACHE_INFO(noent_race);
 76         return -ENOENT;
 77     }
 78     if (add_to_page_cache_unique(page, &swapper_space, entry.val,
 79             page_hash(&swapper_space, entry.val)) != 0) {
 80         swap_free(entry);
 81         INC_CACHE_INFO(exist_race);
 82         return -EEXIST;
 83     }
 84     if (!PageLocked(page))
 85         BUG();
 86     if (!PageSwapCache(page))
 87         BUG();
 88     INC_CACHE_INFO(add_total);
 89     return 0;
 90 }

: 72-73A check is made with PageSwapCache() before this function is called to make sure the page is not already in the swap cache. This check here ensures the page has no other existing mapping in case the caller was careless and did not make the check
: 74-77Use swap_duplicate() (See Section K.2.1.2) to try an increment the count for this entry. If a slot already exists in the swap_map, increment the statistic recording the number of races involving adding pages to the swap cache and return -ENOENT
: 78Try and add the page to the page cache with add_to_page_cache_unique() (See Section J.1.1.2). This function is similar to add_to_page_cache() (See Section J.1.1.1) except it searches the page cache for a duplicate entry with __find_page_nolock(). The managing address space is swapper_space. The “offset within the file” in this case is the offset within swap_map, hence entry.val and finally the page is hashed based on address_space and offset within swap_map
: 80-83If it already existed in the page cache, we raced so increment the statistic recording the number of races to insert an existing page into the swap cache and return EEXIST
: 84-85If the page is locked for IO, it is a bug
: 86-87If it is not now in the swap cache, something went seriously wrong
: 88Increment the statistic recording the total number of pages in the swap cache
: 89Return success

K.2.1.2 Function: swap_duplicate

Source: mm/swapfile.c

This function verifies a swap entry is valid and if so, increments its swap map count.

1161 int swap_duplicate(swp_entry_t entry)
1162 {
1163     struct swap_info_struct * p;
1164     unsigned long offset, type;
1165     int result = 0;
1166 
1167     type = SWP_TYPE(entry);
1168     if (type >= nr_swapfiles)
1169         goto bad_file;
1170     p = type + swap_info;
1171     offset = SWP_OFFSET(entry);
1172 
1173     swap_device_lock(p);
1174     if (offset < p->max && p->swap_map[offset]) {
1175         if (p->swap_map[offset] < SWAP_MAP_MAX - 1) {
1176             p->swap_map[offset]++;
1177             result = 1;
1178         } else if (p->swap_map[offset] <= SWAP_MAP_MAX) {
1179             if (swap_overflow++ < 5)
1180                 printk(KERN_WARNING "swap_dup: swap entry
                              overflow\n");
1181             p->swap_map[offset] = SWAP_MAP_MAX;
1182             result = 1;
1183         }
1184     }
1185     swap_device_unlock(p);
1186 out:
1187     return result;
1188 
1189 bad_file:
1190     printk(KERN_ERR "swap_dup: %s%08lx\n", Bad_file, entry.val);
1191     goto out;
1192 }

: 1161The parameter is the swap entry to increase the swap_map count for
: 1167-1169Get the offset within the swap_info for the swap_info_struct containing this entry. If it is greater than the number of swap areas, goto bad_file
: 1170-1171Get the relevant swap_info_struct and get the offset within its swap_map
: 1173Lock the swap device
: 1174Make a quick sanity check to ensure the offset is within the swap_map and that the slot indicated has a positive count. A 0 count would mean the slot is not free and this is a bogus swp_entry_t
: 1175-1177If the count is not SWAP_MAP_MAX, simply increment it and return 1 for success
: 1178-1183Else the count would overflow so set it to SWAP_MAP_MAX and reserve the slot permanently. In reality this condition is virtually impossible
: 1185-1187Unlock the swap device and return
: 1190-1191If a bad device was used, print out the error message and return failure

K.2.2 Deleting Pages from the Swap Cache

K.2.2.1 Function: swap_free

Source: mm/swapfile.c

Decrements the corresponding swap_map entry for the swp_entry_t

214 void swap_free(swp_entry_t entry)
215 {
216     struct swap_info_struct * p;
217 
218     p = swap_info_get(entry);
219     if (p) {
220         swap_entry_free(p, SWP_OFFSET(entry));
221         swap_info_put(p);
222     }
223 }

: 218swap_info_get() (See Section K.2.3.1) fetches the correct swap_info_struct and performs a number of debugging checks to ensure it is a valid area and a valid swap_map entry. If all is sane, it will lock the swap device
: 219-222If it is valid, the corresponding swap_map entry is decremented with swap_entry_free() (See Section K.2.2.2) and swap_info_put() (See Section K.2.3.2) called to free the device

K.2.2.2 Function: swap_entry_free

Source: mm/swapfile.c

192 static int swap_entry_free(struct swap_info_struct *p, 
                 unsigned long offset)
193 {
194     int count = p->swap_map[offset];
195 
196     if (count < SWAP_MAP_MAX) {
197         count--;
198         p->swap_map[offset] = count;
199         if (!count) {
200             if (offset < p->lowest_bit)
201                 p->lowest_bit = offset;
202             if (offset > p->highest_bit)
203                 p->highest_bit = offset;
204             nr_swap_pages++;
205         }
206     }
207     return count;
208 }

: 194Get the current count
: 196If the count indicates the slot is not permanently reserved then..
: 197-198Decrement the count and store it in the swap_map
: 199If the count reaches 0, the slot is free so update some information
: 200-201If this freed slot is below lowest_bit, update lowest_bit which indicates the lowest known free slot
: 202-203Similarly, update the highest_bit if this newly freed slot is above it
: 204Increment the count indicating the number of free swap slots
: 207Return the current count

K.2.3 Acquiring/Releasing Swap Cache Pages

K.2.3.1 Function: swap_info_get

Source: mm/swapfile.c

This function finds the swap_info_struct for the given entry, performs some basic checking and then locks the device.

147 static struct swap_info_struct * swap_info_get(swp_entry_t entry)
148 {
149     struct swap_info_struct * p;
150     unsigned long offset, type;
151 
152     if (!entry.val)
153         goto out;
154     type = SWP_TYPE(entry);
155     if (type >= nr_swapfiles)
156         goto bad_nofile;
157     p = & swap_info[type];
158     if (!(p->flags & SWP_USED))
159         goto bad_device;
160     offset = SWP_OFFSET(entry);
161     if (offset >= p->max)
162         goto bad_offset;
163     if (!p->swap_map[offset])
164         goto bad_free;
165     swap_list_lock();
166     if (p->prio > swap_info[swap_list.next].prio)
167         swap_list.next = type;
168     swap_device_lock(p);
169     return p;
170 
171 bad_free:
172     printk(KERN_ERR "swap_free: %s%08lx\n", Unused_offset, 
                                                entry.val);
173     goto out;
174 bad_offset:
175     printk(KERN_ERR "swap_free: %s%08lx\n", Bad_offset, 
                                                entry.val);
176     goto out;
177 bad_device:
178     printk(KERN_ERR "swap_free: %s%08lx\n", Unused_file, 
                                                entry.val);
179     goto out;
180 bad_nofile:
181     printk(KERN_ERR "swap_free: %s%08lx\n", Bad_file, 
                                                entry.val);
182 out:
183     return NULL;
184 }

: 152-153If the supplied entry is NULL, return
: 154Get the offset within the swap_info array
: 155-156Ensure it is a valid area
: 157Get the address of the area
: 158-159If the area is not active yet, print a bad device error and return
: 160Get the offset within the swap_map
: 161-162Make sure the offset is not after the end of the map
: 163-164Make sure the slot is currently in use
: 165Lock the swap area list
: 166-167If this area is of higher priority than the area that would be next, ensure the current area is used
: 168-169Lock the swap device and return the swap area descriptor

K.2.3.2 Function: swap_info_put

Source: mm/swapfile.c

This function simply unlocks the area and list

186 static void swap_info_put(struct swap_info_struct * p)
187 {
188     swap_device_unlock(p);
189     swap_list_unlock();
190 }

: 188Unlock the device
: 189Unlock the swap area list

K.2.4 Searching the Swap Cache

K.2.4.1 Function: lookup_swap_cache

Source: mm/swap_state.c

Top level function for finding a page in the swap cache

161 struct page * lookup_swap_cache(swp_entry_t entry)
162 {
163     struct page *found;
164 
165     found = find_get_page(&swapper_space, entry.val);
166     /*
167      * Unsafe to assert PageSwapCache and mapping on page found:
168      * if SMP nothing prevents swapoff from deleting this page from
169      * the swap cache at this moment.  find_lock_page would prevent
170      * that, but no need to change: we _have_ got the right page.
171      */
172     INC_CACHE_INFO(find_total);
173     if (found)
174         INC_CACHE_INFO(find_success);
175     return found;
176 }

: 165find_get_page()(See Section J.1.4.1) is the principle function for returning the struct page. It uses the normal page hashing and cache functions for quickly finding it
: 172Increase the statistic recording the number of times a page was searched for in the cache
: 173-174If one was found, increment the successful find count
: 175Return the struct page or NULL if it did not exist

K.3 Swap Area IO

K.3.1 Reading Backing Storage

K.3.1.1 Function: read_swap_cache_async

Source: mm/swap_state.c

This function will either return the requsted page from the swap cache. If it does not exist, a page will be allocated, placed in the swap cache and the data is scheduled to be read from disk with rw_swap_page().

184 struct page * read_swap_cache_async(swp_entry_t entry)
185 {
186     struct page *found_page, *new_page = NULL;
187     int err;
188 
189     do {
196         found_page = find_get_page(&swapper_space, entry.val);
197         if (found_page)
198             break;
199 
200         /*
201          * Get a new page to read into from swap.
202          */
203         if (!new_page) {
204             new_page = alloc_page(GFP_HIGHUSER);
205             if (!new_page)
206                 break;          /* Out of memory */
207         }
208 
209         /*
210          * Associate the page with swap entry in the swap cache.
211          * May fail (-ENOENT) if swap entry has been freed since
212          * our caller observed it.  May fail (-EEXIST) if there
213          * is already a page associated with this entry in the
214          * swap cache: added by a racing read_swap_cache_async,
215          * or by try_to_swap_out (or shmem_writepage) re-using
216          * the just freed swap entry for an existing page.
217          */
218         err = add_to_swap_cache(new_page, entry);
219         if (!err) {
220             /*
221              * Initiate read into locked page and return.
222              */
223             rw_swap_page(READ, new_page);
224             return new_page;
225         }
226     } while (err != -ENOENT);
227 
228     if (new_page)
229         page_cache_release(new_page);
230     return found_page;
231 }

: 189Loop in case add_to_swap_cache() fails to add a page to the swap cache
: 196First search the swap cache with find_get_page()(See Section J.1.4.1) to see if the page is already avaialble. Ordinarily, lookup_swap_cache() (See Section K.2.4.1) would be called but it updates statistics (such as the number of cache searches) so find_get_page() (See Section J.1.4.1) is called directly
: 203-207If the page is not in the swap cache and we have not allocated one yet, allocate one with alloc_page()
: 218Add the newly allocated page to the swap cache with add_to_swap_cache() (See Section K.2.1.1)
: 223Schedule the data to be read with rw_swap_page()(See Section K.3.3.1). The page will be returned locked and will be unlocked when IO completes
: 224Return the new page
: 226Loop until add_to_swap_cache() succeeds or another process successfully inserts the page into the swap cache
: 228-229This is either the error path or another process added the page to the swap cache for us. If a new page was allocated, free it with page_cache_release() (See Section J.1.3.2)
: 230Return either the page found in the swap cache or an error

K.3.2 Writing Backing Storage

K.3.2.1 Function: swap_writepage

Source: mm/swap_state.c

This is the function registered in swap_aops for writing out pages. It's function is pretty simple. First it calls remove_exclusive_swap_page() to try and free the page. If the page was freed, then the page will be unlocked here before returning as there is no IO pending on the page. Otherwise rw_swap_page() is called to sync the page with backing storage.

 24 static int swap_writepage(struct page *page)
 25 {
 26     if (remove_exclusive_swap_page(page)) {
 27         UnlockPage(page);
 28         return 0;
 29     }
 30     rw_swap_page(WRITE, page);
 31     return 0;
 32 }

: 26-29remove_exclusive_swap_page()(See Section K.3.2.2) will reclaim the page from the swap cache if possible. If the page is reclaimed, unlock it before returning
: 30Otherwise the page is still in the swap cache so synchronise it with backing storage by calling rw_swap_page() (See Section K.3.3.1)

K.3.2.2 Function: remove_exclusive_swap_page

Source: mm/swapfile.c

This function will tries to work out if there is other processes sharing this page or not. If possible the page will be removed from the swap cache and freed. Once removed from the swap cache, swap_free() is decremented to indicate that the swap cache is no longer using the slot. The count will instead reflect the number of PTEs that contain a swp_entry_t for this slot.

287 int remove_exclusive_swap_page(struct page *page)
288 {
289     int retval;
290     struct swap_info_struct * p;
291     swp_entry_t entry;
292 
293     if (!PageLocked(page))
294         BUG();
295     if (!PageSwapCache(page))
296         return 0;
297     if (page_count(page) - !!page->buffers != 2) /* 2: us + cache */
298         return 0;
299 
300     entry.val = page->index;
301     p = swap_info_get(entry);
302     if (!p)
303         return 0;
304 
305     /* Is the only swap cache user the cache itself? */
306     retval = 0;
307     if (p->swap_map[SWP_OFFSET(entry)] == 1) {
308         /* Recheck the page count with the pagecache lock held.. */
309         spin_lock(&pagecache_lock);
310         if (page_count(page) - !!page->buffers == 2) {
311             __delete_from_swap_cache(page);
312             SetPageDirty(page);
313             retval = 1;
314         }
315         spin_unlock(&pagecache_lock);
316     }
317     swap_info_put(p);
318 
319     if (retval) {
320         block_flushpage(page, 0);
321         swap_free(entry);
322         page_cache_release(page);
323     }
324 
325     return retval;
326 }

: 293-294This operation should only be made with the page locked
: 295-296If the page is not in the swap cache, then there is nothing to do
: 297-298If there are other users of the page, then it cannot be reclaimed so return
: 300The swp_entry_t for the page is stored in page→index as explained in Section 2.4
: 301Get the swap_info_struct with swap_info_get() (See Section K.2.3.1)
: 307If the only user of the swap slot is the swap cache itself (i.e. no process is mapping it), then delete this page from the swap cache to free the slot. Later the swap slot usage count will be decremented as the swap cache is no longer using it
: 310If the current user is the only user of this page, then it is safe to remove from the swap cache. If another process is sharing it, it must remain here
: 311Delete from the swap cache
: 313Set retval to 1 so that the caller knows the page was freed and so that swap_free() (See Section K.2.2.1) will be called to decrement the usage count in the swap_map
: 317Drop the reference to the swap slot that was taken with swap_info_get() (See Section K.2.3.1)
: 320The slot is being freed to call block_flushpage() so that all IO will complete and any buffers associated with the page will be freed
: 321Free the swap slot with swap_free()
: 322Drop the reference to the page

K.3.2.3 Function: free_swap_and_cache

Source: mm/swapfile.c

This function frees an entry from the swap cache and tries to reclaims the page. Note that this function only applies to the swap cache.

 
332 void free_swap_and_cache(swp_entry_t entry)
333 {
334     struct swap_info_struct * p;
335     struct page *page = NULL;
336 
337     p = swap_info_get(entry);
338     if (p) {
339         if (swap_entry_free(p, SWP_OFFSET(entry)) == 1)
340             page = find_trylock_page(&swapper_space, entry.val);
341         swap_info_put(p);
342     }
343     if (page) {
344         page_cache_get(page);
345         /* Only cache user (+us), or swap space full? Free it! */
346         if (page_count(page) - !!page->buffers == 2 || vm_swap_full()) {
347             delete_from_swap_cache(page);
348             SetPageDirty(page);
349         }
350         UnlockPage(page);
351         page_cache_release(page);
352     }
353 }

: 337Get the swap_info struct for the requsted entry
: 338-342Presuming the swap area information struct exists, call swap_entry_free() to free the swap entry. The page for the entry is then located in the swap cache using find_trylock_page(). Note that the page is returned locked
: 341Drop the reference taken to the swap info struct at line 337
: 343-352If the page was located then we try to reclaim it
: 344Take a reference to the page so it will not be freed prematurly
: 346-349The page is deleted from the swap cache if there are no processes mapping the page or if the swap area is more than 50% full (Checked by vm_swap_full())
: 350Unlock the page again
: 351Drop the local reference to the page taken at line 344

K.3.3 Block IO

K.3.3.1 Function: rw_swap_page

Source: mm/page_io.c

This is the main function used for reading data from backing storage into a page or writing data from a page to backing storage. Which operation is performs depends on the first parameter rw. It is basically a wrapper function around the core function rw_swap_page_base(). This simply enforces that the operations are only performed on pages in the swap cache.

 85 void rw_swap_page(int rw, struct page *page)
 86 {
 87     swp_entry_t entry;
 88 
 89     entry.val = page->index;
 90 
 91     if (!PageLocked(page))
 92         PAGE_BUG(page);
 93     if (!PageSwapCache(page))
 94         PAGE_BUG(page);
 95     if (!rw_swap_page_base(rw, entry, page))
 96         UnlockPage(page);
 97 }

: 85rw indicates whether a read or write is taking place
: 89Get the swp_entry_t from the index field
: 91-92If the page is not locked for IO, it is a bug
: 93-94If the page is not in the swap cache, it is a bug
: 95Call the core function rw_swap_page_base(). If it returns failure, the page is unlocked with UnlockPage() so it can be freed

K.3.3.2 Function: rw_swap_page_base

Source: mm/page_io.c

This is the core function for reading or writing data to the backing storage. Whether it is writing to a partition or a file, the block layer brw_page() function is used to perform the actual IO. This function sets up the necessary buffer information for the block layer to do it's job. The brw_page() performs asynchronous IO so it is likely it will return with the page locked which will be unlocked when the IO completes.

 36 static int rw_swap_page_base(int rw, swp_entry_t entry, 
                                 struct page *page)
 37 {
 38     unsigned long offset;
 39     int zones[PAGE_SIZE/512];
 40     int zones_used;
 41     kdev_t dev = 0;
 42     int block_size;
 43     struct inode *swapf = 0;
 44 
 45     if (rw == READ) {
 46         ClearPageUptodate(page);
 47         kstat.pswpin++;
 48     } else
 49         kstat.pswpout++;
 50

36The parameters are:

: rw indicates whether the operation is a read or a write
: entry is the swap entry for locating the data in backing storage
: page is the page that is been read or written to

39zones is a parameter required by the block layer for brw_page(). It is expected to contain an array of block numbers that are to be written to. This is primarily of important when the backing storage is a file rather than a partition

45-47If the page is to be read from disk, clear the Uptodate flag as the page is obviously not up to date if we are reading information from the disk. Increment the pages swapped in (pswpin) statistic

49Else just update the pages swapped out (pswpout) statistic

 51     get_swaphandle_info(entry, &offset, &dev, &swapf);
 52     if (dev) {
 53         zones[0] = offset;
 54         zones_used = 1;
 55         block_size = PAGE_SIZE;
 56     } else if (swapf) {
 57         int i, j;
 58         unsigned int block = 
 59              offset << (PAGE_SHIFT - swapf->i_sb->s_blocksize_bits);
 60 
 61         block_size = swapf->i_sb->s_blocksize;
 62         for (i=0, j=0; j< PAGE_SIZE ; i++, j += block_size)
 63             if (!(zones[i] = bmap(swapf,block++))) {
 64                 printk("rw_swap_page: bad swap file\n");
 65                 return 0;
 66             }
 67         zones_used = i;
 68         dev = swapf->i_dev;
 69     } else {
 70         return 0;
 71     }
 72 
 73     /* block_size == PAGE_SIZE/zones_used */
 74     brw_page(rw, page, dev, zones, block_size);
 75     return 1;
 76 }

: 51get_swaphandle_info()(See Section K.3.3.3) returns either the kdev_t or struct inode that represents the swap area, whichever is appropriate
: 52-55If the storage area is a partition, then there is only one block to be written which is the size of a page. Hence, zones only has one entry which is the offset within the partition to be written and the block_size is PAGE_SIZE
: 56Else it is a swap file so each of the blocks in the file that make up the page has to be mapped with bmap() before calling brw_page()
: 58-59Calculate what the starting block is
: 61The size of individual block is stored in the superblock information for the filesystem the file resides on
: 62-66Call bmap() for every block that makes up the full page. Each block is stored in the zones array for passing to brw_page(). If any block fails to be mapped, 0 is returned
: 67Record how many blocks make up the page in zones_used
: 68Record which device is being written to
: 74Call brw_page() from the block layer to schedule the IO to occur. This function returns immediately as the IO is asychronous. When the IO is completed, a callback function (end_buffer_io_async()) is called which unlocks the page. Any process waiting on the page will be woken up at that point
: 75Return success

K.3.3.3 Function: get_swaphandle_info

Source: mm/swapfile.c

This function is responsible for returning either the kdev_t or struct inode that is managing the swap area that entry belongs to.

1197 void get_swaphandle_info(swp_entry_t entry, unsigned long *offset, 
1198                         kdev_t *dev, struct inode **swapf)
1199 {
1200     unsigned long type;
1201     struct swap_info_struct *p;
1202 
1203     type = SWP_TYPE(entry);
1204     if (type >= nr_swapfiles) {
1205         printk(KERN_ERR "rw_swap_page: %s%08lx\n", Bad_file, 
                                                        entry.val);
1206         return;
1207     }
1208 
1209     p = &swap_info[type];
1210     *offset = SWP_OFFSET(entry);
1211     if (*offset >= p->max && *offset != 0) {
1212         printk(KERN_ERR "rw_swap_page: %s%08lx\n", Bad_offset, 
                                                        entry.val);
1213         return;
1214     }
1215     if (p->swap_map && !p->swap_map[*offset]) {
1216         printk(KERN_ERR "rw_swap_page: %s%08lx\n", Unused_offset, 
                                                        entry.val);
1217         return;
1218     }
1219     if (!(p->flags & SWP_USED)) {
1220         printk(KERN_ERR "rw_swap_page: %s%08lx\n", Unused_file, 
                                                        entry.val);
1221         return;
1222     }
1223 
1224     if (p->swap_device) {
1225         *dev = p->swap_device;
1226     } else if (p->swap_file) {
1227         *swapf = p->swap_file->d_inode;
1228     } else {
1229         printk(KERN_ERR "rw_swap_page: no swap file or device\n");
1230     }
1231     return;
1232 }

: 1203Extract which area within swap_info this entry belongs to
: 1204-1206If the index is for an area that does not exist, then print out an information message and return. Bad_file is a static array declared near the top of mm/swapfile.c that says “Bad swap file entry”
: 1209Get the swap_info_struct from swap_info
: 1210Extrac the offset within the swap area for this entry
: 1211-1214Make sure the offset is not after the end of the file. Print out the message in Bad_offset if it is
: 1215-1218If the offset is currently not being used, it means that entry is a stale entry so print out the error message in Unused_offset
: 1219-1222If the swap area is currently not active, print out the error message in Unused_file
: 1224If the swap area is a device, return the kdev_t in swap_info_struct→swap_device
: 1226-1227If it is a swap file, return the struct inode which is available via swap_info_struct→swap_file→d_inode
: 1229Else there is no swap file or device for this entry so print out the error message and return

K.4 Activating a Swap Area

K.4.1 Function: sys_swapon

Source: mm/swapfile.c

This, quite large, function is responsible for the activating of swap space. Broadly speaking the tasks is takes are as follows;

Find a free swap_info_struct in the swap_info array an initialise it with default values
Call user_path_walk() which traverses the directory tree for the supplied specialfile and populates a namidata structure with the available data on the file, such as the dentry and the filesystem information for where it is stored (vfsmount)
Populate swap_info_struct fields pertaining to the dimensions of the swap area and how to find it. If the swap area is a partition, the block size will be configured to the PAGE_SIZE before calculating the size. If it is a file, the information is obtained directly from the inode
Ensure the area is not already activated. If not, allocate a page from memory and read the first page sized slot from the swap area. This page contains information such as the number of good slots and how to populate the swap_info_struct→swap_map with the bad entries
Allocate memory with vmalloc() for swap_info_struct→swap_map and initialise each entry with 0 for good slots and SWAP_MAP_BAD otherwise. Ideally the header information will be a version 2 file format as version 1 was limited to swap areas of just under 128MiB for architectures with 4KiB page sizes like the x86
After ensuring the information indicated in the header matches the actual swap area, fill in the remaining information in the swap_info_struct such as the maximum number of pages and the available good pages. Update the global statistics for nr_swap_pages and total_swap_pages
The swap area is now fully active and initialised and so it is inserted into the swap list in the correct position based on priority of the newly activated area

855 asmlinkage long sys_swapon(const char * specialfile, 
                               int swap_flags)
856 {
857       struct swap_info_struct * p;
858       struct nameidata nd;
859       struct inode * swap_inode;
860       unsigned int type;
861       int i, j, prev;
862       int error;
863       static int least_priority = 0;
864       union swap_header *swap_header = 0;
865       int swap_header_version;
866       int nr_good_pages = 0;
867       unsigned long maxpages = 1;
868       int swapfilesize;
869       struct block_device *bdev = NULL;
870       unsigned short *swap_map;
871       
872       if (!capable(CAP_SYS_ADMIN))
873         return -EPERM;
874       lock_kernel();
875       swap_list_lock();
876       p = swap_info;

: 855The two parameters are the path to the swap area and the flags for activation
: 872-873The activating process must have the CAP_SYS_ADMIN capability or be the superuser to activate a swap area
: 874Acquire the Big Kernel Lock
: 875Lock the list of swap areas
: 876Get the first swap area in the swap_info array

877       for (type = 0 ; type < nr_swapfiles ; type++,p++)
878         if (!(p->flags & SWP_USED))
879           break;
880       error = -EPERM;
881       if (type >= MAX_SWAPFILES) {
882         swap_list_unlock();
883         goto out;
884       }
885       if (type >= nr_swapfiles)
886         nr_swapfiles = type+1;
887       p->flags = SWP_USED;
888       p->swap_file = NULL;
889       p->swap_vfsmnt = NULL;
890       p->swap_device = 0;
891       p->swap_map = NULL;
892       p->lowest_bit = 0;
893       p->highest_bit = 0;
894       p->cluster_nr = 0;
895       p->sdev_lock = SPIN_LOCK_UNLOCKED;
896       p->next = -1;
897       if (swap_flags & SWAP_FLAG_PREFER) {
898         p->prio =
899           (swap_flags & SWAP_FLAG_PRIO_MASK)>>SWAP_FLAG_PRIO_SHIFT;
900       } else {
901         p->prio = --least_priority;
902       }
903       swap_list_unlock();

Find a free swap_info_struct and initialise it with default values

: 877-879Cycle through the swap_info until a struct is found that is not in use
: 880By default the error returned is Permission Denied which indicates the caller did not have the proper permissions or too many swap areas are already in use
: 881If no struct was free, MAX_SWAPFILE areas have already been activated so unlock the swap list and return
: 885-886If the selected swap area is after the last known active area (nr_swapfiles), then update nr_swapfiles
: 887Set the flag indicating the area is in use
: 888-896Initialise fields to default values
: 897-902If the caller has specified a priority, use it else set it to least_priority and decrement it. This way, the swap areas will be prioritised in order of activation
: 903Release the swap list lock

904       error = user_path_walk(specialfile, &nd);
905       if (error)
906         goto bad_swap_2;
907 
908       p->swap_file = nd.dentry;
909       p->swap_vfsmnt = nd.mnt;
910       swap_inode = nd.dentry->d_inode;
911       error = -EINVAL;
912

Traverse the VFS and get some information about the special file

: 904user_path_walk() traverses the directory structure to obtain a nameidata structure describing the specialfile
: 905-906If it failed, return failure
: 908Fill in the swap_file field with the returned dentry
: 909Similarily, fill in the swap_vfsmnt
: 910Record the inode of the special file
: 911Now the default error is -EINVAL indicating that the special file was found but it was not a block device or a regular file

913       if (S_ISBLK(swap_inode->i_mode)) {
914         kdev_t dev = swap_inode->i_rdev;
915         struct block_device_operations *bdops;
916         devfs_handle_t de;
917 
918         p->swap_device = dev;
919         set_blocksize(dev, PAGE_SIZE);
920         
921         bd_acquire(swap_inode);
922         bdev = swap_inode->i_bdev;
923         de = devfs_get_handle_from_inode(swap_inode);
924         bdops = devfs_get_ops(de);
925         if (bdops) bdev->bd_op = bdops;
926 
927         error = blkdev_get(bdev, FMODE_READ|FMODE_WRITE, 0,
                   BDEV_SWAP);
928         devfs_put_ops(de);/* Decrement module use count 
                               * now we're safe*/
929         if (error)
930           goto bad_swap_2;
931         set_blocksize(dev, PAGE_SIZE);
932         error = -ENODEV;
933         if (!dev || (blk_size[MAJOR(dev)] &&
934          !blk_size[MAJOR(dev)][MINOR(dev)]))
935           goto bad_swap;
936         swapfilesize = 0;
937         if (blk_size[MAJOR(dev)])
938           swapfilesize = blk_size[MAJOR(dev)][MINOR(dev)]
939             >> (PAGE_SHIFT - 10);
940       } else if (S_ISREG(swap_inode->i_mode))
941         swapfilesize = swap_inode->i_size >> PAGE_SHIFT;
942       else
943         goto bad_swap;

If a partition, configure the block device before calculating the size of the area, else obtain it from the inode for the file.

: 913Check if the special file is a block device
: 914-939This code segment handles the case where the swap area is a partition
: 914Record a pointer to the device structure for the block device
: 918Store a pointer to the device structure describing the special file which will be needed for block IO operations
: 919Set the block size on the device to be PAGE_SIZE as it will be page sized chunks swap is interested in
: 921The bd_acquire() function increments the usage count for this block device
: 922Get a pointer to the block_device structure which is a descriptor for the device file which is needed to open it
: 923Get a devfs handle if it is enabled. devfs is beyond the scope of this book
: 924-925Increment the usage count of this device entry
: 927Open the block device in read/write mode and set the BDEV_SWAP flag which is an enumerated type but is ignored when do_open() is called
: 928Decrement the use count of the devfs entry
: 929-930If an error occured on open, return failure
: 931Set the block size again
: 932After this point, the default error is to indicate no device could be found
: 933-935Ensure the returned device is ok
: 937-939Calculate the size of the swap file as the number of page sized chunks that exist in the block device as indicated by blk_size. The size of the swap area is calculated to make sure the information in the swap area is sane
: 941If the swap area is a regular file, obtain the size directly from the inode and calculate how many page sized chunks exist
: 943If the file is not a block device or regular file, return error

945       error = -EBUSY;
946       for (i = 0 ; i < nr_swapfiles ; i++) {
947         struct swap_info_struct *q = &swap_info[i];
948         if (i == type || !q->swap_file)
949           continue;
950         if (swap_inode->i_mapping ==
             q->swap_file->d_inode->i_mapping)
951           goto bad_swap;
952       }
953 
954       swap_header = (void *) __get_free_page(GFP_USER);
955       if (!swap_header) {
956         printk("Unable to start swapping: out of memory :-)\n");
957         error = -ENOMEM;
958         goto bad_swap;
959       }
960 
961       lock_page(virt_to_page(swap_header));
962       rw_swap_page_nolock(READ, SWP_ENTRY(type,0), 
            (char *) swap_header);
963 
964       if (!memcmp("SWAP-SPACE",swap_header->magic.magic,10))
965         swap_header_version = 1;
966       else if (!memcmp("SWAPSPACE2",swap_header->magic.magic,10))
967         swap_header_version = 2;
968       else {
969         printk("Unable to find swap-space signature\n");
970         error = -EINVAL;
971         goto bad_swap;
972       }

: 945The next check makes sure the area is not already active. If it is, the error -EBUSY will be returned
: 946-962Read through the while swap_info struct and ensure the area to be activated is not already active
: 954-959Allocate a page for reading the swap area information from disk
: 961The function lock_page() locks a page and makes sure it is synced with disk if it is file backed. In this case, it'll just mark the page as locked which is required for the rw_swap_page_nolock() function
: 962Read the first page slot in the swap area into swap_header
: 964-672Check the version based on the swap area information is and set swap_header_version variable with it. If the swap area could not be identified, return -EINVAL

974       switch (swap_header_version) {
975       case 1:
976         memset(((char *) swap_header)+PAGE_SIZE-10,0,10);
977         j = 0;
978         p->lowest_bit = 0;
979         p->highest_bit = 0;
980         for (i = 1 ; i < 8*PAGE_SIZE ; i++) {
981           if (test_bit(i,(char *) swap_header)) {
982             if (!p->lowest_bit)
983                   p->lowest_bit = i;
984             p->highest_bit = i;
985             maxpages = i+1;
986             j++;
987           }
988         }
989         nr_good_pages = j;
990         p->swap_map = vmalloc(maxpages * sizeof(short));
991         if (!p->swap_map) {
992           error = -ENOMEM;        
993           goto bad_swap;
994         }
995         for (i = 1 ; i < maxpages ; i++) {
996           if (test_bit(i,(char *) swap_header))
997             p->swap_map[i] = 0;
998           else
999             p->swap_map[i] = SWAP_MAP_BAD;
1000         }
1001         break;
1002

Read in the information needed to populate the swap_map when the swap area is version 1.

: 976Zero out the magic string identifing the version of the swap area
: 978-979Initialise fields in swap_info_struct to 0
: 980-988A bitmap with 8*PAGE_SIZE entries is stored in the swap area. The full page, minus 10 bits for the magic string, is used to describe the swap map limiting swap areas to just under 128MiB in size. If the bit is set to 1, there is a slot on disk available. This pass will calculate how many slots are available so a swap_map may be allocated
: 981Test if the bit for this slot is set
: 982-983If the lowest_bit field is not yet set, set it to this slot. In most cases, lowest_bit will be initialised to 1
: 984As long as new slots are found, keep updating the highest_bit
: 985Count the number of pages
: 986j is the count of good pages in the area
: 990Allocate memory for the swap_map with vmalloc()
: 991-994If memory could not be allocated, return ENOMEM
: 995-1000For each slot, check if the slot is “good”. If yes, initialise the slot count to 0, else set it to SWAP_MAP_BAD so it will not be used
: 1001Exit the switch statement

1003       case 2:
1006         if (swap_header->info.version != 1) {
1007           printk(KERN_WARNING
1008            "Unable to handle swap header version %d\n",
1009            swap_header->info.version);
1010           error = -EINVAL;
1011           goto bad_swap;
1012         }
1013 
1014         p->lowest_bit  = 1;
1015         maxpages = SWP_OFFSET(SWP_ENTRY(0,~0UL)) - 1;
1016         if (maxpages > swap_header->info.last_page)
1017           maxpages = swap_header->info.last_page;
1018         p->highest_bit = maxpages - 1;
1019 
1020         error = -EINVAL;
1021         if (swap_header->info.nr_badpages > MAX_SWAP_BADPAGES)
1022           goto bad_swap;
1023         
1025         if (!(p->swap_map = vmalloc(maxpages * sizeof(short)))) {
1026           error = -ENOMEM;
1027           goto bad_swap;
1028         }
1029 
1030         error = 0;
1031         memset(p->swap_map, 0, maxpages * sizeof(short));
1032         for (i=0; i<swap_header->info.nr_badpages; i++) {
1033           int page = swap_header->info.badpages[i];
1034           if (page <= 0 || 
             page >= swap_header->info.last_page)
1035             error = -EINVAL;
1036           else
1037             p->swap_map[page] = SWAP_MAP_BAD;
1038         }
1039         nr_good_pages = swap_header->info.last_page -
1040             swap_header->info.nr_badpages -
1041             1 /* header page */;
1042         if (error) 
1043           goto bad_swap;
1044       }

Read the header information when the file format is version 2

: 1006-1012Make absolutly sure we can handle this swap file format and return -EINVAL if we cannot. Remember that with this version, the swap_header struct is placed nicely on disk
: 1014Initialise lowest_bit to the known lowest available slot
: 1015-1017Calculate the maxpages initially as the maximum possible size of a swap_map and then set it to the size indicated by the information on disk. This ensures the swap_map array is not accidently overloaded
: 1018Initialise highest_bit
: 1020-1022Make sure the number of bad pages that exist does not exceed MAX_SWAP_BADPAGES
: 1025-1028Allocate memory for the swap_map with vmalloc()
: 1031Initialise the full swap_map to 0 indicating all slots are available
: 1032-1038Using the information loaded from disk, set each slot that is unusuable to SWAP_MAP_BAD
: 1039-1041Calculate the number of available good pages
: 1042-1043Return if an error occured

1045       
1046       if (swapfilesize && maxpages > swapfilesize) {
1047         printk(KERN_WARNING
1048          "Swap area shorter than signature indicates\n");
1049         error = -EINVAL;
1050         goto bad_swap;
1051       }
1052       if (!nr_good_pages) {
1053         printk(KERN_WARNING "Empty swap-file\n");
1054         error = -EINVAL;
1055         goto bad_swap;
1056       }
1057       p->swap_map[0] = SWAP_MAP_BAD;
1058       swap_list_lock();
1059       swap_device_lock(p);
1060       p->max = maxpages;
1061       p->flags = SWP_WRITEOK;
1062       p->pages = nr_good_pages;
1063       nr_swap_pages += nr_good_pages;
1064       total_swap_pages += nr_good_pages;
1065       printk(KERN_INFO "Adding Swap: 
                             %dk swap-space (priority %d)\n",
1066        nr_good_pages<<(PAGE_SHIFT-10), p->prio);

: 1046-1051Ensure the information loaded from disk matches the actual dimensions of the swap area. If they do not match, print a warning and return an error
: 1052-1056If no good pages were available, return an error
: 1057Make sure the first page in the map containing the swap header information is not used. If it was, the header information would be overwritten the first time this area was used
: 1058-1059Lock the swap list and the swap device
: 1060-1062Fill in the remaining fields in the swap_info_struct
: 1063-1064Update global statistics for the number of available swap pages (nr_swap_pages) and the total number of swap pages (total_swap_pages)
: 1065-1066Print an informational message about the swap activation

1068       /* insert swap space into swap_list: */
1069       prev = -1;
1070       for (i = swap_list.head; i >= 0; i = swap_info[i].next) {
1071         if (p->prio >= swap_info[i].prio) {
1072           break;
1073         }
1074         prev = i;
1075       }
1076       p->next = i;
1077       if (prev < 0) {
1078         swap_list.head = swap_list.next = p - swap_info;
1079       } else {
1080         swap_info[prev].next = p - swap_info;
1081       }
1082       swap_device_unlock(p);
1083       swap_list_unlock();
1084       error = 0;
1085       goto out;

: 1070-1080Insert the new swap area into the correct slot in the swap list based on priority
: 1082Unlock the swap device
: 1083Unlock the swap list
: 1084-1085Return success

1086 bad_swap:
1087       if (bdev)
1088         blkdev_put(bdev, BDEV_SWAP);
1089 bad_swap_2:
1090       swap_list_lock();
1091       swap_map = p->swap_map;
1092       nd.mnt = p->swap_vfsmnt;
1093       nd.dentry = p->swap_file;
1094       p->swap_device = 0;
1095       p->swap_file = NULL;
1096       p->swap_vfsmnt = NULL;
1097       p->swap_map = NULL;
1098       p->flags = 0;
1099       if (!(swap_flags & SWAP_FLAG_PREFER))
1100         ++least_priority;
1101       swap_list_unlock();
1102       if (swap_map)
1103         vfree(swap_map);
1104       path_release(&nd);
1105 out:
1106       if (swap_header)
1107         free_page((long) swap_header);
1108       unlock_kernel();
1109       return error;
1110 }

: 1087-1088Drop the reference to the block device
: 1090-1104This is the error path where the swap list need to be unlocked, the slot in swap_info reset to being unused and the memory allocated for swap_map freed if it was assigned
: 1104Drop the reference to the special file
: 1106-1107Release the page containing the swap header information as it is no longer needed
: 1108Drop the Big Kernel Lock
: 1109Return the error or success value

K.4.2 Function: swap_setup

Source: mm/swap.c

This function is called during the initialisation of kswapd to set the size of page_cluster. This variable determines how many pages readahead from files and from backing storage when paging in data.

100 void __init swap_setup(void)
101 {
102     unsigned long megs = num_physpages >> (20 - PAGE_SHIFT);
103 
104     /* Use a smaller cluster for small-memory machines */
105     if (megs < 16)
106         page_cluster = 2;
107     else
108         page_cluster = 3;
109     /*
110      * Right now other parts of the system means that we
111      * _really_ don't want to cluster much more
112      */
113 }

: 102Calculate how much memory the system has in megabytes
: 105In low memory systems, set page_cluster to 2 which means that, at most, 4 pages will be paged in from disk during readahead
: 108Else readahead 8 pages

K.5 Deactivating a Swap Area

K.5.1 Function: sys_swapoff

Source: mm/swapfile.c

This function is principally concerned with updating the swap_info_struct and the swap lists. The main task of paging in all pages in the area is the responsibility of try_to_unuse(). The function tasks are broadly

Call user_path_walk() to acquire the information about the special file to be deactivated and then take the BKL
Remove the swap_info_struct from the swap list and update the global statistics on the number of swap pages available (nr_swap_pages) and the total number of swap entries (total_swap_pages. Once this is acquired, the BKL can be released again
Call try_to_unuse() which will page in all pages from the swap area to be deactivated.
If there was not enough available memory to page in all the entries, the swap area is reinserted back into the running system as it cannot be simply dropped. If it succeeded, the swap_info_struct is placed into an uninitialised state and the swap_map memory freed with vfree()

720 asmlinkage long sys_swapoff(const char * specialfile)
721 {
722     struct swap_info_struct * p = NULL;
723     unsigned short *swap_map;
724     struct nameidata nd;
725     int i, type, prev;
726     int err;
727     
728     if (!capable(CAP_SYS_ADMIN))
729         return -EPERM;
730 
731     err = user_path_walk(specialfile, &nd);
732     if (err)
733         goto out;
734

: 728-729Only the superuser or a process with CAP_SYS_ADMIN capabilities may deactivate an area
: 731-732Acquire information about the special file representing the swap area with user_path_walk(). Goto out if an error occured

735     lock_kernel();
736     prev = -1;
737     swap_list_lock();
738     for (type = swap_list.head; type >= 0; 
         type = swap_info[type].next) {
739         p = swap_info + type;
740         if ((p->flags & SWP_WRITEOK) == SWP_WRITEOK) {
741             if (p->swap_file == nd.dentry)
742               break;
743         }
744         prev = type;
745     }
746     err = -EINVAL;
747     if (type < 0) {
748         swap_list_unlock();
749         goto out_dput;
750     }
751 
752     if (prev < 0) {
753         swap_list.head = p->next;
754     } else {
755         swap_info[prev].next = p->next;
756     }
757     if (type == swap_list.next) {
758         /* just pick something that's safe... */
759         swap_list.next = swap_list.head;
760     }
761     nr_swap_pages -= p->pages;
762     total_swap_pages -= p->pages;
763     p->flags = SWP_USED;

Acquire the BKL, find the swap_info_struct for the area to be deactivated and remove it from the swap list.

: 735Acquire the BKL
: 737Lock the swap list
: 738-745Traverse the swap list and find the swap_info_struct for the requested area. Use the dentry to identify the area
: 747-750If the struct could not be found, return
: 752-760Remove from the swap list making sure that this is not the head
: 761Update the total number of free swap slots
: 762Update the total number of existing swap slots
: 763Mark the area as active but may not be written to

764     swap_list_unlock();
765     unlock_kernel();
766     err = try_to_unuse(type);

: 764Unlock the swap list
: 765Release the BKL
: 766Page in all pages from this swap area

767     lock_kernel();
768     if (err) {
769         /* re-insert swap space back into swap_list */
770         swap_list_lock();
771         for (prev = -1, i = swap_list.head; 
                 i >= 0; 
                 prev = i, i = swap_info[i].next)
772             if (p->prio >= swap_info[i].prio)
773                 break;
774         p->next = i;
775         if (prev < 0)
776             swap_list.head = swap_list.next = p - swap_info;
777         else
778             swap_info[prev].next = p - swap_info;
779         nr_swap_pages += p->pages;
780         total_swap_pages += p->pages;
781         p->flags = SWP_WRITEOK;
782         swap_list_unlock();
783         goto out_dput;
784     }

Acquire the BKL. If we failed to page in all pages, then reinsert the area into the swap list

: 767Acquire the BKL
: 770Lock the swap list
: 771-778Reinsert the area into the swap list. The position it is inserted at depends on the swap area priority
: 779-780Update the global statistics
: 781Mark the area as safe to write to again
: 782-783Unlock the swap list and return

785     if (p->swap_device)
786         blkdev_put(p->swap_file->d_inode->i_bdev, BDEV_SWAP);
787     path_release(&nd);
788 
789     swap_list_lock();
790     swap_device_lock(p);
791     nd.mnt = p->swap_vfsmnt;
792     nd.dentry = p->swap_file;
793     p->swap_vfsmnt = NULL;
794     p->swap_file = NULL;
795     p->swap_device = 0;
796     p->max = 0;
797     swap_map = p->swap_map;
798     p->swap_map = NULL;
799     p->flags = 0;
800     swap_device_unlock(p);
801     swap_list_unlock();
802     vfree(swap_map);
803     err = 0;
804 
805 out_dput:
806     unlock_kernel();
807     path_release(&nd);
808 out:
809     return err;
810 }

Else the swap area was successfully deactivated to close the block device and mark the swap_info_struct free

: 785-786Close the block device
: 787Release the path information
: 789-790Acquire the swap list and swap device lock
: 791-799Reset the fields in swap_info_struct to default values
: 800-801Release the swap list and swap device
: 801Free the memory used for the swap_map
: 806Release the BKL
: 807Release the path information in the event we reached here via the error path
: 809Return success or failure

K.5.2 Function: try_to_unuse

Source: mm/swapfile.c

This function is heavily commented in the source code albeit it consists of speculation or is slightly inaccurate at parts. The comments are omitted here for brevity.

513 static int try_to_unuse(unsigned int type)
514 {
515     struct swap_info_struct * si = &swap_info[type];
516     struct mm_struct *start_mm;
517     unsigned short *swap_map;
518     unsigned short swcount;
519     struct page *page;
520     swp_entry_t entry;
521     int i = 0;
522     int retval = 0;
523     int reset_overflow = 0;
525 
540     start_mm = &init_mm;
541     atomic_inc(&init_mm.mm_users);
542

: 540-541The starting mm_struct to page in pages for is init_mm. The count is incremented even though this particular struct will not disappear to prevent having to write special cases in the remainder of the function

556     while ((i = find_next_to_unuse(si, i))) {
557         /* 
558          * Get a page for the entry, using the existing swap
559          * cache page if there is one.  Otherwise, get a clean
560          * page and read the swap into it. 
561          */
562         swap_map = &si->swap_map[i];
563         entry = SWP_ENTRY(type, i);
564         page = read_swap_cache_async(entry);
565         if (!page) {
572             if (!*swap_map)
573                 continue;
574             retval = -ENOMEM;
575             break;
576         }
577 
578         /*
579          * Don't hold on to start_mm if it looks like exiting.
580          */
581         if (atomic_read(&start_mm->mm_users) == 1) {
582             mmput(start_mm);
583             start_mm = &init_mm;
584             atomic_inc(&init_mm.mm_users);
585         }

: 556This is the beginning of the major loop in this function. Starting from the beginning of the swap_map, it searches for the next entry to be freed with find_next_to_unuse() until all swap map entries have been paged in
: 562-564Get the swp_entry_t and call read_swap_cache_async() (See Section K.3.1.1) to find the page in the swap cache or have a new page allocated for reading in from the disk
: 565-576If we failed to get the page, it means the slot has already been freed independently by another process or thread (process could be exiting elsewhere) or we are out of memory. If independently freed, we continue to the next map, else we return -ENOMEM
: 581Check to make sure this mm is not exiting. If it is, decrement its count and go back to init_mm

587         /*
588          * Wait for and lock page.  When do_swap_page races with
589          * try_to_unuse, do_swap_page can handle the fault much
590          * faster than try_to_unuse can locate the entry.  This
591          * apparently redundant "wait_on_page" lets try_to_unuse
592          * defer to do_swap_page in such a case - in some tests,
593          * do_swap_page and try_to_unuse repeatedly compete.
594          */
595         wait_on_page(page);
596         lock_page(page);
597 
598         /*
599          * Remove all references to entry, without blocking.
600          * Whenever we reach init_mm, there's no address space
601          * to search, but use it as a reminder to search shmem.
602          */
603         shmem = 0;
604         swcount = *swap_map;
605         if (swcount > 1) {
606             flush_page_to_ram(page);
607             if (start_mm == &init_mm)
608                 shmem = shmem_unuse(entry, page);
609             else
610                 unuse_process(start_mm, entry, page);
611         }

: 595Wait on the page to complete IO. Once it returns, we know for a fact the page exists in memory with the same information as that on disk
: 596Lock the page
: 604Get the swap map reference count
: 605If the count is positive then...
: 606As the page is about to be inserted into proces page tables, it must be freed from the D-Cache or the process may not “see” changes made to the page by the kernel
: 607-608If we are using the init_mm, call shmem_unuse() (See Section L.6.2) which will free the page from any shared memory regions that are in use
: 610Else update the PTE in the current mm which references this page

612         if (*swap_map > 1) {
613             int set_start_mm = (*swap_map >= swcount);
614             struct list_head *p = &start_mm->mmlist;
615             struct mm_struct *new_start_mm = start_mm;
616             struct mm_struct *mm;
617 
618             spin_lock(&mmlist_lock);
619             while (*swap_map > 1 &&
620                 (p = p->next) != &start_mm->mmlist) {
621                 mm = list_entry(p, struct mm_struct,
                            mmlist);
622                 swcount = *swap_map;
623                 if (mm == &init_mm) {
624                     set_start_mm = 1;
625                     spin_unlock(&mmlist_lock);
626                     shmem = shmem_unuse(entry, page);
627                     spin_lock(&mmlist_lock);
628                 } else
629                     unuse_process(mm, entry, page);
630                 if (set_start_mm && *swap_map < swcount) {
631                     new_start_mm = mm;
632                     set_start_mm = 0;
633                 }
634             }
635             atomic_inc(&new_start_mm->mm_users);
636             spin_unlock(&mmlist_lock);
637             mmput(start_mm);
638             start_mm = new_start_mm;
639         }

: 612-637If an entry still exists, begin traversing through all mm_structs finding references to this page and update the respective PTE
: 618Lock the mm list
: 619-632Keep searching until all mm_structs have been found. Do not traverse the full list more than once
: 621Get the mm_struct for this list entry
: 623-627Call shmem_unuse()(See Section L.6.2) if the mm is init_mm as that indicates that is a page from the virtual filesystem. Else call unuse_process() (See Section K.5.3) to traverse the current process's page tables searching for the swap entry. If found, the entry will be freed and the page reinstantiated in the PTE
: 630-633Record if we need to start searching mm_structs starting from init_mm again

654         if (*swap_map == SWAP_MAP_MAX) {
655             swap_list_lock();
656             swap_device_lock(si);
657             nr_swap_pages++;
658             *swap_map = 1;
659             swap_device_unlock(si);
660             swap_list_unlock();
661             reset_overflow = 1;
662         }

: 654If the swap map entry is permanently mapped, we have to hope that all processes have their PTEs updated to point to the page and in reality the swap map entry is free. In reality, it is highly unlikely a slot would be permanetly reserved in the first place
: 645-661Lock the list and swap device, set the swap map entry to 1, unlock them again and record that a reset overflow occured

683         if ((*swap_map > 1) && PageDirty(page) &&
                PageSwapCache(page)) {
684             rw_swap_page(WRITE, page);
685             lock_page(page);
686         }
687         if (PageSwapCache(page)) {
688             if (shmem)
689                 swap_duplicate(entry);
690             else
691                 delete_from_swap_cache(page);
692         }

: 683-686In the very rare event a reference still exists to the page, write the page back to disk so at least if another process really has a reference to it, it'll copy the page back in from disk correctly
: 687-689If the page is in the swap cache and belongs to the shared memory filesystem, a new reference is taken to it wieh swap_duplicate() so we can try and remove it again later with shmem_unuse()
: 691Else, for normal pages, just delete them from the swap cache

699         SetPageDirty(page);
700         UnlockPage(page);
701         page_cache_release(page);

: 699Mark the page dirty so that the swap out code will preserve the page and if it needs to remove it again, it'll write it correctly to a new swap area
: 700Unlock the page
: 701Release our reference to it in the page cache

708         if (current->need_resched)
714             schedule();
715     }
716 
717     mmput(start_mm);
718     if (reset_overflow) {
714         printk(KERN_WARNING "swapoff: cleared swap entry
                     overflow\n");
715         swap_overflow = 0;
716     }
717     return retval;
718 }

: 708-709Call schedule() if necessary so the deactivation of swap does not hog the entire CPU
: 717Drop our reference to the mm
: 718-721If a permanently mapped page had to be removed, then print out a warning so that in the very unlikely event an error occurs later, there will be a hint to what might have happend
: 717Return success or failure

K.5.3 Function: unuse_process

Source: mm/swapfile.c

This function begins the page table walk required to remove the requested page and entry from the process page tables managed by mm. This is only required when a swap area is being deactivated so, while expensive, it is a very rare operation. This set of functions should be instantly recognisable as a standard page-table walk.

454 static void unuse_process(struct mm_struct * mm,
455                         swp_entry_t entry, struct page* page)
456 {
457     struct vm_area_struct* vma;
458 
459     /*
460      * Go through process' page directory.
461      */
462     spin_lock(&mm->page_table_lock);
463     for (vma = mm->mmap; vma; vma = vma->vm_next) {
464         pgd_t * pgd = pgd_offset(mm, vma->vm_start);
465         unuse_vma(vma, pgd, entry, page);
466     }
467     spin_unlock(&mm->page_table_lock);
468     return;
469 }

: 462Lock the process page tables
: 463Move through every VMA managed by this mm. Remember that one page frame could be mapped in multiple locations
: 462Get the PGD managing the beginning of this VMA
: 465Call unuse_vma()(See Section K.5.4) to search the VMA for the page
: 467-468The full mm has been searched so unlock the process page tables and return

K.5.4 Function: unuse_vma

Source: mm/swapfile.c

This function searches the requested VMA for page table entries mapping the page and using the given swap entry. It calls unuse_pgd() for every PGD this VMA maps.

440 static void unuse_vma(struct vm_area_struct * vma, pgd_t *pgdir,
441                         swp_entry_t entry, struct page* page)
442 {
443     unsigned long start = vma->vm_start, end = vma->vm_end;
444 
445     if (start >= end)
446         BUG();
447     do {
448         unuse_pgd(vma, pgdir, start, end - start, entry, page);
449         start = (start + PGDIR_SIZE) & PGDIR_MASK;
450         pgdir++;
451     } while (start && (start < end));
452 }

: 443Get the virtual addresses for ther start and end of the VMA
: 445-446Check that the start is not after the end. There would need to be serious braindamage in the kernel for this to occur
: 447-451Walk through the VMA in PGDIR_SIZE-sized strides until the end of the VMA is reached. This effectively walks through every PGD that maps portions of this VMA
: 448Call unuse_pgd()(See Section K.5.5) to walk through just this PGD to unmap page
: 449Move the virtual address start to the beginning of the next PGD
: 450Move pgdir to the next PGD in the VMA

K.5.5 Function: unuse_pgd

Source: mm/swapfile.c

This function searches the requested PGD for page table entries mapping the page and using the given swap entry. It calls unuse_pmd() for every PMD this PGD maps.

409 static inline void unuse_pgd(struct vm_area_struct * vma, pgd_t *dir,
410         unsigned long address, unsigned long size,
411         swp_entry_t entry, struct page* page)
412 {
413     pmd_t * pmd;
414     unsigned long offset, end;
415 
416     if (pgd_none(*dir))
417         return;
418     if (pgd_bad(*dir)) {
419         pgd_ERROR(*dir);
420         pgd_clear(dir);
421         return;
422     }
423     pmd = pmd_offset(dir, address);
424     offset = address & PGDIR_MASK;
425     address &= ~PGDIR_MASK;
426     end = address + size;
427     if (end > PGDIR_SIZE)
428         end = PGDIR_SIZE;
429     if (address >= end)
430         BUG();
431     do {
432         unuse_pmd(vma, pmd, address, end - address, offset, entry,
433                   page);
434         address = (address + PMD_SIZE) & PMD_MASK;
435         pmd++;
436     } while (address && (address < end));
437 }

: 416-417If there is no PGD here, return
: 418-422If the PGD is bad, then set the appropriate error, clear the PGD and return. There are very few architectures where this condition can occur
: 423Get the address of the first PMD in this PGD
: 424Calculate offset as the offset within the PGD the address is for. Remember that the first time this function is called, it might be searching a partial PGD
: 425Align the address to the PGD
: 426Calculate the end address of the search
: 427-428If the end is beyond this PGD, set the end just to the end of this PGD
: 429-430If the starting address is after the end address, something is very seriously wrong
: 431-436Step through the PGD in PMD_SIZE-sized strides and call unuse_pmd() (See Section K.5.6) for every PMD in this PGD

K.5.6 Function: unuse_pmd

Source: mm/swapfile.c

This function searches the requested PMD for page table entries mapping the page and using the given swap entry. It calls unuse_pte() for every PTE this PMD maps.

381 static inline void unuse_pmd(struct vm_area_struct * vma, pmd_t *dir,
382      unsigned long address, unsigned long size, unsigned long offset,
383      swp_entry_t entry, struct page* page)
384 {
385     pte_t * pte;
386     unsigned long end;
387 
388     if (pmd_none(*dir))
389         return;
390     if (pmd_bad(*dir)) {
391         pmd_ERROR(*dir);
392         pmd_clear(dir);
393         return;
394     }
395     pte = pte_offset(dir, address);
396     offset += address & PMD_MASK;
397     address &= ~PMD_MASK;
398     end = address + size;
399     if (end > PMD_SIZE)
400         end = PMD_SIZE;
401     do {
402         unuse_pte(vma, offset+address-vma->vm_start, pte, entry, page);
403         address += PAGE_SIZE;
404         pte++;
405     } while (address && (address < end));
406 }

: 388-389Return if no PMD exists
: 390-394Set the appropriate error and clear the PMD if it is bad. There are very few architectures where this condition can occur
: 395Calculate the starting PTE for this address
: 396Set offset to be the offset within the PMD we are starting at
: 397Align address to the PMD
: 398-400Calculate the end address. If it is beyond the end of this PMD, set it to the end of this PMD
: 401-405Step through this PMD in PAGE_SIZE-sized chunks and call unuse_pte() (See Section K.5.7) for each PTE

K.5.7 Function: unuse_pte

Source: mm/swapfile.c

This function checks if the PTE at dir matches the entry we are searching for. If it does, the swap entry is freed and a reference is taken to the page representing the PTE that will be updated to map it.

365 static inline void unuse_pte(struct vm_area_struct * vma, 
            unsigned long address,
366         pte_t *dir, swp_entry_t entry, struct page* page)
367 {
368     pte_t pte = *dir;
369 
370     if (likely(pte_to_swp_entry(pte).val != entry.val))
371         return;
372     if (unlikely(pte_none(pte) || pte_present(pte)))
373         return;
374     get_page(page);
375     set_pte(dir, pte_mkold(mk_pte(page, vma->vm_page_prot)));
376     swap_free(entry);
377     ++vma->vm_mm->rss;
378 }

: 370-371If the entry does not match the PTE, return
: 372-373If there is no PTE or it is already present (meaning there is no way this entry is mapped here), then return
: 374Otherwise we have found the entry we are looking for so take a reference to the page as a new PTE is about to map it
: 375Update the PTE to map page
: 376Free the swap entry
: 377Increment the RSS count for this process