The page fault handler for anonymous pages can generate significant overhead apart from its essential function which is to clear and setup a new page table entry for a never accessed memory location. This overhead increases significantly in an SMP environment. In the page table scalability patches, we addressed the issue by changing the locking scheme so that multiple fault handlers are able to be processed concurrently on multiple cpus. This patch attempts to aggregate multiple page faults into a single one. It does that by noting anonymous page faults generated in sequence by an application. If a fault occurred for page x and is then followed by page x+1 then it may be reasonable to expect another page fault at x+2 in the future. If page table entries for x+1 and x+2 would be prepared in the fault handling for page x+1 then the overhead of taking a fault for x+2 is avoided. However page x+2 may never be used and thus we may have increased the rss of an application unnecessarily. The swapper will take care of removing that page if memory should get tight. The following patch makes the anonymous fault handler anticipate future faults. For each fault a prediction is made where the fault would occur (assuming linear acccess by the application). If the prediction turns out to be right (next fault is where expected) then a number of pages is preallocated in order to avoid a series of future faults. The order of the preallocation increases by the power of two for each success in sequence. The first successful prediction leads to an additional page being allocated. Second successful prediction leads to 2 additional pages being allocated. Third to 4 pages and so on. The max order is 3 by default. In a large continous allocation the number of faults is reduced by a factor of 8. The patch may be combined with the page fault scalability patch (another edition of the patch is needed which will be forthcoming after the page fault scalability patch has been included). The combined patches will triple the possible page fault rate from ~1 mio faults sec to 3 mio faults sec. Standard Kernel on a 512 Cpu machine allocating 32GB with an increasing number of threads (and thus increasing parallellism of page faults): Gb Rep Threads User System Wall flt/cpu/s fault/wsec 32 3 1 1.416s 138.165s 139.050s 45073.831 45097.498 32 3 2 1.397s 148.523s 78.044s 41965.149 80201.646 32 3 4 1.390s 152.618s 44.044s 40851.258 141545.239 32 3 8 1.500s 374.008s 53.001s 16754.519 118671.950 32 3 16 1.415s 1051.759s 73.094s 5973.803 85087.358 32 3 32 1.867s 3400.417s 117.003s 1849.186 53754.928 32 3 64 5.361s 11633.040s 197.034s 540.577 31881.112 32 3 128 23.387s 39386.390s 332.055s 159.642 18918.599 32 3 256 15.409s 20031.450s 168.095s 313.837 37237.918 32 3 512 18.720s 10338.511s 86.047s 607.446 72752.686 Patched kernel: Gb Rep Threads User System Wall flt/cpu/s fault/wsec 32 3 1 1.098s 138.544s 139.063s 45053.657 45057.920 32 3 2 1.022s 127.770s 67.086s 48849.350 92707.085 32 3 4 0.995s 119.666s 37.045s 52141.503 167955.292 32 3 8 0.928s 87.400s 18.034s 71227.407 342934.242 32 3 16 1.067s 72.943s 11.035s 85007.293 553989.377 32 3 32 1.248s 133.753s 10.038s 46602.680 606062.151 32 3 64 5.557s 438.634s 13.093s 14163.802 451418.617 32 3 128 17.860s 1496.797s 19.048s 4153.714 322808.509 32 3 256 13.382s 766.063s 10.016s 8071.695 618816.838 32 3 512 17.067s 369.106s 5.041s 16291.764 1161285.521 These number are roughly equal to what can be accomplished with the page fault scalability patches. Kernel patches with both the page fault scalability patches and prefaulting: Gb Rep Threads User System Wall flt/cpu/s fault/wsec 32 10 1 4.103s 456.384s 460.046s 45541.992 45544.369 32 10 2 4.005s 415.119s 221.095s 50036.407 94484.174 32 10 4 3.855s 371.317s 111.076s 55898.259 187635.724 32 10 8 3.902s 308.673s 67.094s 67092.476 308634.397 32 10 16 4.011s 224.213s 37.016s 91889.781 564241.062 32 10 32 5.483s 209.391s 27.046s 97598.647 763495.417 32 10 64 19.166s 219.925s 26.030s 87713.212 797286.395 32 10 128 53.482s 342.342s 27.024s 52981.744 769687.791 32 10 256 67.334s 180.321s 15.036s 84679.911 1364614.334 32 10 512 66.516s 93.098s 9.015s131387.893 2291548.865 The fault rate doubles when both patches are applied. And on the high end (512 processors allocating 256G) (No numbers for regular kernels because they are extremely slow, also no number for a low number of threads. Also very slow) With prefaulting: Gb Rep Threads User System Wall flt/cpu/s fault/wsec 256 3 4 8.241s 1414.348s 449.016s 35380.301 112056.239 256 3 8 8.306s 1300.982s 247.025s 38441.977 203559.271 256 3 16 8.368s 1223.853s 154.089s 40846.272 324940.924 256 3 32 8.536s 1284.041s 110.097s 38938.970 453556.624 256 3 64 13.025s 3587.203s 110.010s 13980.123 457131.492 256 3 128 25.025s 11460.700s 145.071s 4382.104 345404.909 256 3 256 26.150s 6061.649s 75.086s 8267.625 663414.482 256 3 512 20.637s 3037.097s 38.062s 16460.435 1302993.019 Page fault scalability patch and prefaulting. Max prefault order increased to 5 (max preallocation of 32 pages): Gb Rep Threads User System Wall flt/cpu/s fault/wsec 256 10 8 33.571s 4516.293s 863.021s 36874.099 194356.930 256 10 16 33.103s 3737.688s 461.028s 44492.553 363704.484 256 10 32 35.094s 3436.561s 321.080s 48326.262 521352.840 256 10 64 46.675s 2899.997s 245.020s 56936.124 684214.256 256 10 128 85.493s 2890.198s 203.008s 56380.890 826122.524 256 10 256 74.299s 1374.973s 99.088s115762.963 1679630.272 256 10 512 62.760s 706.559s 53.027s218078.311 3149273.714 We are getting into an almost linear scalability in the high end with both patches and end up with a fault rate > 3 mio faults per second. The one thing that takes up a lot of time is still be the zeroing of pages in the page fault handler. There is a another set of patches that I am working on which will prezero pages and led to another an increase in performance by a factor of 2-4 (if prezeroed pages are available which may not always be the case). Maybe we can reach 10 mio fault /sec that way. Patch against 2.6.10-rc3-bk3: Index: linux-2.6.9/include/linux/sched.h =================================================================== --- linux-2.6.9.orig/include/linux/sched.h 2004-12-01 10:37:31.000000000 -0800 +++ linux-2.6.9/include/linux/sched.h 2004-12-01 10:38:15.000000000 -0800 @@ -537,6 +537,8 @@ #endif struct list_head tasks; + unsigned long anon_fault_next_addr; /* Predicted sequential fault address */ + int anon_fault_order; /* Last order of allocation on fault */ /* * ptrace_list/ptrace_children forms the list of my children * that were stolen by a ptracer. Index: linux-2.6.9/mm/memory.c =================================================================== --- linux-2.6.9.orig/mm/memory.c 2004-12-01 10:38:11.000000000 -0800 +++ linux-2.6.9/mm/memory.c 2004-12-01 10:45:01.000000000 -0800 @@ -55,6 +55,7 @@ #include #include +#include #ifndef CONFIG_DISCONTIGMEM /* use the per-pgdat data instead for discontigmem - mbligh */ @@ -1432,8 +1433,106 @@ unsigned long addr) { pte_t entry; - struct page * page = ZERO_PAGE(addr); + struct page * page; + + addr &= PAGE_MASK; + + if (current->anon_fault_next_addr == addr) { + unsigned long end_addr; + int order = current->anon_fault_order; + + /* Sequence of page faults detected. Perform preallocation of pages */ + /* The order of preallocations increases with each successful prediction */ + order++; + + if ((1 << order) < PAGEVEC_SIZE) + end_addr = addr + (1 << (order + PAGE_SHIFT)); + else + end_addr = addr + PAGEVEC_SIZE * PAGE_SIZE; + + if (end_addr > vma->vm_end) + end_addr = vma->vm_end; + if ((addr & PMD_MASK) != (end_addr & PMD_MASK)) + end_addr &= PMD_MASK; + + current->anon_fault_next_addr = end_addr; + current->anon_fault_order = order; + + if (write_access) { + + struct pagevec pv; + unsigned long a; + struct page **p; + + pte_unmap(page_table); + spin_unlock(&mm->page_table_lock); + + pagevec_init(&pv, 0); + + if (unlikely(anon_vma_prepare(vma))) + return VM_FAULT_OOM; + + /* Allocate the necessary pages */ + for(a = addr;a < end_addr ; a += PAGE_SIZE) { + struct page *p = alloc_page_vma(GFP_HIGHUSER, vma, a); + + if (p) { + clear_user_highpage(p, a); + pagevec_add(&pv,p); + } else + break; + } + end_addr = a; + + spin_lock(&mm->page_table_lock); + + for(p = pv.pages; addr < end_addr; addr += PAGE_SIZE, p++) { + + page_table = pte_offset_map(pmd, addr); + if (!pte_none(*page_table)) { + /* Someone else got there first */ + page_cache_release(*p); + pte_unmap(page_table); + continue; + } + + entry = maybe_mkwrite(pte_mkdirty(mk_pte(*p, + vma->vm_page_prot)), + vma); + + mm->rss++; + lru_cache_add_active(*p); + mark_page_accessed(*p); + page_add_anon_rmap(*p, vma, addr); + + set_pte(page_table, entry); + pte_unmap(page_table); + + /* No need to invalidate - it was non-present before */ + update_mmu_cache(vma, addr, entry); + } + } else { + /* Read */ + for(;addr < end_addr; addr += PAGE_SIZE) { + page_table = pte_offset_map(pmd, addr); + entry = pte_wrprotect(mk_pte(ZERO_PAGE(addr), vma->vm_page_prot)); + set_pte(page_table, entry); + pte_unmap(page_table); + + /* No need to invalidate - it was non-present before */ + update_mmu_cache(vma, addr, entry); + + }; + } + spin_unlock(&mm->page_table_lock); + return VM_FAULT_MINOR; + } + + current->anon_fault_next_addr = addr + PAGE_SIZE; + current->anon_fault_order = 0; + + page = ZERO_PAGE(addr); /* Read-only mapping of ZERO_PAGE. */ entry = pte_wrprotect(mk_pte(ZERO_PAGE(addr), vma->vm_page_prot));