aboutsummaryrefslogtreecommitdiffstats
diff options
context:
space:
mode:
-rw-r--r--Documentation/vm/page_migration129
-rw-r--r--include/linux/rmap.h4
-rw-r--r--include/linux/swap.h2
-rw-r--r--mm/rmap.c21
-rw-r--r--mm/vmscan.c226
5 files changed, 360 insertions, 22 deletions
diff --git a/Documentation/vm/page_migration b/Documentation/vm/page_migration
new file mode 100644
index 00000000000000..c52820fcf500a8
--- /dev/null
+++ b/Documentation/vm/page_migration
@@ -0,0 +1,129 @@
+Page migration
+--------------
+
+Page migration allows the moving of the physical location of pages between
+nodes in a numa system while the process is running. This means that the
+virtual addresses that the process sees do not change. However, the
+system rearranges the physical location of those pages.
+
+The main intend of page migration is to reduce the latency of memory access
+by moving pages near to the processor where the process accessing that memory
+is running.
+
+Page migration allows a process to manually relocate the node on which its
+pages are located through the MF_MOVE and MF_MOVE_ALL options while setting
+a new memory policy. The pages of process can also be relocated
+from another process using the sys_migrate_pages() function call. The
+migrate_pages function call takes two sets of nodes and moves pages of a
+process that are located on the from nodes to the destination nodes.
+
+Manual migration is very useful if for example the scheduler has relocated
+a process to a processor on a distant node. A batch scheduler or an
+administrator may detect the situation and move the pages of the process
+nearer to the new processor. At some point in the future we may have
+some mechanism in the scheduler that will automatically move the pages.
+
+Larger installations usually partition the system using cpusets into
+sections of nodes. Paul Jackson has equipped cpusets with the ability to
+move pages when a task is moved to another cpuset. This allows automatic
+control over locality of a process. If a task is moved to a new cpuset
+then also all its pages are moved with it so that the performance of the
+process does not sink dramatically (as is the case today).
+
+Page migration allows the preservation of the relative location of pages
+within a group of nodes for all migration techniques which will preserve a
+particular memory allocation pattern generated even after migrating a
+process. This is necessary in order to preserve the memory latencies.
+Processes will run with similar performance after migration.
+
+Page migration occurs in several steps. First a high level
+description for those trying to use migrate_pages() and then
+a low level description of how the low level details work.
+
+A. Use of migrate_pages()
+-------------------------
+
+1. Remove pages from the LRU.
+
+ Lists of pages to be migrated are generated by scanning over
+ pages and moving them into lists. This is done by
+ calling isolate_lru_page() or __isolate_lru_page().
+ Calling isolate_lru_page increases the references to the page
+ so that it cannot vanish under us.
+
+2. Generate a list of newly allocates page to move the contents
+ of the first list to.
+
+3. The migrate_pages() function is called which attempts
+ to do the migration. It returns the moved pages in the
+ list specified as the third parameter and the failed
+ migrations in the fourth parameter. The first parameter
+ will contain the pages that could still be retried.
+
+4. The leftover pages of various types are returned
+ to the LRU using putback_to_lru_pages() or otherwise
+ disposed of. The pages will still have the refcount as
+ increased by isolate_lru_pages()!
+
+B. Operation of migrate_pages()
+--------------------------------
+
+migrate_pages does several passes over its list of pages. A page is moved
+if all references to a page are removable at the time.
+
+Steps:
+
+1. Lock the page to be migrated
+
+2. Insure that writeback is complete.
+
+3. Make sure that the page has assigned swap cache entry if
+ it is an anonyous page. The swap cache reference is necessary
+ to preserve the information contain in the page table maps.
+
+4. Prep the new page that we want to move to. It is locked
+ and set to not being uptodate so that all accesses to the new
+ page immediately lock while we are moving references.
+
+5. All the page table references to the page are either dropped (file backed)
+ or converted to swap references (anonymous pages). This should decrease the
+ reference count.
+
+6. The radix tree lock is taken
+
+7. The refcount of the page is examined and we back out if references remain
+ otherwise we know that we are the only one referencing this page.
+
+8. The radix tree is checked and if it does not contain the pointer to this
+ page then we back out.
+
+9. The mapping is checked. If the mapping is gone then a truncate action may
+ be in progress and we back out.
+
+10. The new page is prepped with some settings from the old page so that accesses
+ to the new page will be discovered to have the correct settings.
+
+11. The radix tree is changed to point to the new page.
+
+12. The reference count of the old page is dropped because the reference has now
+ been removed.
+
+13. The radix tree lock is dropped.
+
+14. The page contents are copied to the new page.
+
+15. The remaining page flags are copied to the new page.
+
+16. The old page flags are cleared to indicate that the page does
+ not use any information anymore.
+
+17. Queued up writeback on the new page is triggered.
+
+18. If swap pte's were generated for the page then remove them again.
+
+19. The locks are dropped from the old and new page.
+
+20. The new page is moved to the LRU.
+
+Christoph Lameter, December 19, 2005.
+
diff --git a/include/linux/rmap.h b/include/linux/rmap.h
index 9d6fbeef210472..0f1ea2d6ed8635 100644
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -91,7 +91,7 @@ static inline void page_dup_rmap(struct page *page)
* Called from mm/vmscan.c to handle paging out
*/
int page_referenced(struct page *, int is_locked);
-int try_to_unmap(struct page *);
+int try_to_unmap(struct page *, int ignore_refs);
/*
* Called from mm/filemap_xip.c to unmap empty zero page
@@ -111,7 +111,7 @@ unsigned long page_address_in_vma(struct page *, struct vm_area_struct *);
#define anon_vma_link(vma) do {} while (0)
#define page_referenced(page,l) TestClearPageReferenced(page)
-#define try_to_unmap(page) SWAP_FAIL
+#define try_to_unmap(page, refs) SWAP_FAIL
#endif /* CONFIG_MMU */
diff --git a/include/linux/swap.h b/include/linux/swap.h
index e53fef7051e6f3..d359fc022433fe 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -191,6 +191,8 @@ static inline int zone_reclaim(struct zone *z, gfp_t mask, unsigned int order)
#ifdef CONFIG_MIGRATION
extern int isolate_lru_page(struct page *p);
extern int putback_lru_pages(struct list_head *l);
+extern int migrate_page(struct page *, struct page *);
+extern void migrate_page_copy(struct page *, struct page *);
extern int migrate_pages(struct list_head *l, struct list_head *t,
struct list_head *moved, struct list_head *failed);
#else
diff --git a/mm/rmap.c b/mm/rmap.c
index d85a99d28c0387..13fad5fcdf7996 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -52,6 +52,7 @@
#include <linux/init.h>
#include <linux/rmap.h>
#include <linux/rcupdate.h>
+#include <linux/module.h>
#include <asm/tlbflush.h>
@@ -541,7 +542,8 @@ void page_remove_rmap(struct page *page)
* Subfunctions of try_to_unmap: try_to_unmap_one called
* repeatedly from either try_to_unmap_anon or try_to_unmap_file.
*/
-static int try_to_unmap_one(struct page *page, struct vm_area_struct *vma)
+static int try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
+ int ignore_refs)
{
struct mm_struct *mm = vma->vm_mm;
unsigned long address;
@@ -564,7 +566,8 @@ static int try_to_unmap_one(struct page *page, struct vm_area_struct *vma)
* skipped over this mm) then we should reactivate it.
*/
if ((vma->vm_flags & VM_LOCKED) ||
- ptep_clear_flush_young(vma, address, pte)) {
+ (ptep_clear_flush_young(vma, address, pte)
+ && !ignore_refs)) {
ret = SWAP_FAIL;
goto out_unmap;
}
@@ -698,7 +701,7 @@ static void try_to_unmap_cluster(unsigned long cursor,
pte_unmap_unlock(pte - 1, ptl);
}
-static int try_to_unmap_anon(struct page *page)
+static int try_to_unmap_anon(struct page *page, int ignore_refs)
{
struct anon_vma *anon_vma;
struct vm_area_struct *vma;
@@ -709,7 +712,7 @@ static int try_to_unmap_anon(struct page *page)
return ret;
list_for_each_entry(vma, &anon_vma->head, anon_vma_node) {
- ret = try_to_unmap_one(page, vma);
+ ret = try_to_unmap_one(page, vma, ignore_refs);
if (ret == SWAP_FAIL || !page_mapped(page))
break;
}
@@ -726,7 +729,7 @@ static int try_to_unmap_anon(struct page *page)
*
* This function is only called from try_to_unmap for object-based pages.
*/
-static int try_to_unmap_file(struct page *page)
+static int try_to_unmap_file(struct page *page, int ignore_refs)
{
struct address_space *mapping = page->mapping;
pgoff_t pgoff = page->index << (PAGE_CACHE_SHIFT - PAGE_SHIFT);
@@ -740,7 +743,7 @@ static int try_to_unmap_file(struct page *page)
spin_lock(&mapping->i_mmap_lock);
vma_prio_tree_foreach(vma, &iter, &mapping->i_mmap, pgoff, pgoff) {
- ret = try_to_unmap_one(page, vma);
+ ret = try_to_unmap_one(page, vma, ignore_refs);
if (ret == SWAP_FAIL || !page_mapped(page))
goto out;
}
@@ -825,16 +828,16 @@ out:
* SWAP_AGAIN - we missed a mapping, try again later
* SWAP_FAIL - the page is unswappable
*/
-int try_to_unmap(struct page *page)
+int try_to_unmap(struct page *page, int ignore_refs)
{
int ret;
BUG_ON(!PageLocked(page));
if (PageAnon(page))
- ret = try_to_unmap_anon(page);
+ ret = try_to_unmap_anon(page, ignore_refs);
else
- ret = try_to_unmap_file(page);
+ ret = try_to_unmap_file(page, ignore_refs);
if (!page_mapped(page))
ret = SWAP_SUCCESS;
diff --git a/mm/vmscan.c b/mm/vmscan.c
index aa4b80dbe3ad60..8f326ce2b690bf 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -483,7 +483,7 @@ static int shrink_list(struct list_head *page_list, struct scan_control *sc)
if (!sc->may_swap)
goto keep_locked;
- switch (try_to_unmap(page)) {
+ switch (try_to_unmap(page, 0)) {
case SWAP_FAIL:
goto activate_locked;
case SWAP_AGAIN:
@@ -623,7 +623,7 @@ static int swap_page(struct page *page)
struct address_space *mapping = page_mapping(page);
if (page_mapped(page) && mapping)
- if (try_to_unmap(page) != SWAP_SUCCESS)
+ if (try_to_unmap(page, 0) != SWAP_SUCCESS)
goto unlock_retry;
if (PageDirty(page)) {
@@ -659,6 +659,154 @@ unlock_retry:
retry:
return -EAGAIN;
}
+
+/*
+ * Page migration was first developed in the context of the memory hotplug
+ * project. The main authors of the migration code are:
+ *
+ * IWAMOTO Toshihiro <iwamoto@valinux.co.jp>
+ * Hirokazu Takahashi <taka@valinux.co.jp>
+ * Dave Hansen <haveblue@us.ibm.com>
+ * Christoph Lameter <clameter@sgi.com>
+ */
+
+/*
+ * Remove references for a page and establish the new page with the correct
+ * basic settings to be able to stop accesses to the page.
+ */
+static int migrate_page_remove_references(struct page *newpage,
+ struct page *page, int nr_refs)
+{
+ struct address_space *mapping = page_mapping(page);
+ struct page **radix_pointer;
+
+ /*
+ * Avoid doing any of the following work if the page count
+ * indicates that the page is in use or truncate has removed
+ * the page.
+ */
+ if (!mapping || page_mapcount(page) + nr_refs != page_count(page))
+ return 1;
+
+ /*
+ * Establish swap ptes for anonymous pages or destroy pte
+ * maps for files.
+ *
+ * In order to reestablish file backed mappings the fault handlers
+ * will take the radix tree_lock which may then be used to stop
+ * processses from accessing this page until the new page is ready.
+ *
+ * A process accessing via a swap pte (an anonymous page) will take a
+ * page_lock on the old page which will block the process until the
+ * migration attempt is complete. At that time the PageSwapCache bit
+ * will be examined. If the page was migrated then the PageSwapCache
+ * bit will be clear and the operation to retrieve the page will be
+ * retried which will find the new page in the radix tree. Then a new
+ * direct mapping may be generated based on the radix tree contents.
+ *
+ * If the page was not migrated then the PageSwapCache bit
+ * is still set and the operation may continue.
+ */
+ try_to_unmap(page, 1);
+
+ /*
+ * Give up if we were unable to remove all mappings.
+ */
+ if (page_mapcount(page))
+ return 1;
+
+ write_lock_irq(&mapping->tree_lock);
+
+ radix_pointer = (struct page **)radix_tree_lookup_slot(
+ &mapping->page_tree,
+ page_index(page));
+
+ if (!page_mapping(page) || page_count(page) != nr_refs ||
+ *radix_pointer != page) {
+ write_unlock_irq(&mapping->tree_lock);
+ return 1;
+ }
+
+ /*
+ * Now we know that no one else is looking at the page.
+ *
+ * Certain minimal information about a page must be available
+ * in order for other subsystems to properly handle the page if they
+ * find it through the radix tree update before we are finished
+ * copying the page.
+ */
+ get_page(newpage);
+ newpage->index = page->index;
+ newpage->mapping = page->mapping;
+ if (PageSwapCache(page)) {
+ SetPageSwapCache(newpage);
+ set_page_private(newpage, page_private(page));
+ }
+
+ *radix_pointer = newpage;
+ __put_page(page);
+ write_unlock_irq(&mapping->tree_lock);
+
+ return 0;
+}
+
+/*
+ * Copy the page to its new location
+ */
+void migrate_page_copy(struct page *newpage, struct page *page)
+{
+ copy_highpage(newpage, page);
+
+ if (PageError(page))
+ SetPageError(newpage);
+ if (PageReferenced(page))
+ SetPageReferenced(newpage);
+ if (PageUptodate(page))
+ SetPageUptodate(newpage);
+ if (PageActive(page))
+ SetPageActive(newpage);
+ if (PageChecked(page))
+ SetPageChecked(newpage);
+ if (PageMappedToDisk(page))
+ SetPageMappedToDisk(newpage);
+
+ if (PageDirty(page)) {
+ clear_page_dirty_for_io(page);
+ set_page_dirty(newpage);
+ }
+
+ ClearPageSwapCache(page);
+ ClearPageActive(page);
+ ClearPagePrivate(page);
+ set_page_private(page, 0);
+ page->mapping = NULL;
+
+ /*
+ * If any waiters have accumulated on the new page then
+ * wake them up.
+ */
+ if (PageWriteback(newpage))
+ end_page_writeback(newpage);
+}
+
+/*
+ * Common logic to directly migrate a single page suitable for
+ * pages that do not use PagePrivate.
+ *
+ * Pages are locked upon entry and exit.
+ */
+int migrate_page(struct page *newpage, struct page *page)
+{
+ BUG_ON(PageWriteback(page)); /* Writeback must be complete */
+
+ if (migrate_page_remove_references(newpage, page, 2))
+ return -EAGAIN;
+
+ migrate_page_copy(newpage, page);
+
+ return 0;
+}
+
/*
* migrate_pages
*
@@ -672,11 +820,6 @@ retry:
* are movable anymore because t has become empty
* or no retryable pages exist anymore.
*
- * SIMPLIFIED VERSION: This implementation of migrate_pages
- * is only swapping out pages and never touches the second
- * list. The direct migration patchset
- * extends this function to avoid the use of swap.
- *
* Return: Number of pages not migrated when "to" ran empty.
*/
int migrate_pages(struct list_head *from, struct list_head *to,
@@ -697,6 +840,9 @@ redo:
retry = 0;
list_for_each_entry_safe(page, page2, from, lru) {
+ struct page *newpage = NULL;
+ struct address_space *mapping;
+
cond_resched();
rc = 0;
@@ -704,6 +850,9 @@ redo:
/* page was freed from under us. So we are done. */
goto next;
+ if (to && list_empty(to))
+ break;
+
/*
* Skip locked pages during the first two passes to give the
* functions holding the lock time to release the page. Later we
@@ -740,12 +889,64 @@ redo:
}
}
+ if (!to) {
+ rc = swap_page(page);
+ goto next;
+ }
+
+ newpage = lru_to_page(to);
+ lock_page(newpage);
+
/*
- * Page is properly locked and writeback is complete.
+ * Pages are properly locked and writeback is complete.
* Try to migrate the page.
*/
- rc = swap_page(page);
- goto next;
+ mapping = page_mapping(page);
+ if (!mapping)
+ goto unlock_both;
+
+ /*
+ * Trigger writeout if page is dirty
+ */
+ if (PageDirty(page)) {
+ switch (pageout(page, mapping)) {
+ case PAGE_KEEP:
+ case PAGE_ACTIVATE:
+ goto unlock_both;
+
+ case PAGE_SUCCESS:
+ unlock_page(newpage);
+ goto next;
+
+ case PAGE_CLEAN:
+ ; /* try to migrate the page below */
+ }
+ }
+ /*
+ * If we have no buffer or can release the buffer
+ * then do a simple migration.
+ */
+ if (!page_has_buffers(page) ||
+ try_to_release_page(page, GFP_KERNEL)) {
+ rc = migrate_page(newpage, page);
+ goto unlock_both;
+ }
+
+ /*
+ * On early passes with mapped pages simply
+ * retry. There may be a lock held for some
+ * buffers that may go away. Later
+ * swap them out.
+ */
+ if (pass > 4) {
+ unlock_page(newpage);
+ newpage = NULL;
+ rc = swap_page(page);
+ goto next;
+ }
+
+unlock_both:
+ unlock_page(newpage);
unlock_page:
unlock_page(page);
@@ -758,7 +959,10 @@ next:
list_move(&page->lru, failed);
nr_failed++;
} else {
- /* Success */
+ if (newpage) {
+ /* Successful migration. Return page to LRU */
+ move_to_lru(newpage);
+ }
list_move(&page->lru, moved);
}
}