aboutsummaryrefslogtreecommitdiffstats
path: root/mm
AgeCommit message (Collapse)AuthorFilesLines
2006-01-06[PATCH] mm: add a new function (needed for swap suspend)Rafael J. Wysocki1-0/+20
This adds the function get_swap_page_of_type() allowing us to specify an index in swap_info[] and select a swap_info_struct structure to be used for allocating a swap page. This function (or another one of similar functionality) will be necessary for implementing the image-writing part of swsusp in the user space.  It can also be used for simplifying the current in-kernel implementation of the image-writing part of swsusp. Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl> Acked-by: Pavel Machek <pavel@ucw.cz> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-06[PATCH] allow flatmem to be disabled when only sparsemem is implementedAnton Blanchard1-1/+1
On architectures that implement sparsemem but not discontigmem we want to be able to hide the flatmem option in some cases. On ppc64 for example, when we select NUMA we must not select flatmem. Signed-off-by: Anton Blanchard <anton@samba.org> Signed-off-by: Andy Whitcroft <apw@shadowen.org> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-06[PATCH] NOMMU: Make SYSV IPC SHM use ramfs facilities on NOMMUDavid Howells3-2/+36
The attached patch makes the SYSV IPC shared memory facilities use the new ramfs facilities on a no-MMU kernel. The following changes are made: (1) There are now shmem_mmap() and shmem_get_unmapped_area() functions to allow the IPC SHM facilities to commune with the tiny-shmem and shmem code. (2) ramfs files now need resizing using do_truncate() rather than by modifying the inode size directly (see shmem_file_setup()). This causes ramfs to attempt to bind a block of pages of sufficient size to the inode. (3) CONFIG_SYSVIPC is no longer contingent on CONFIG_MMU. Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-06[PATCH] mm: page_state optNick Piggin3-54/+72
Optimise page_state manipulations by introducing interrupt unsafe accessors to page_state fields. Callers must provide their own locking (either disable interrupts or not update from interrupt context). Switch over the hot callsites that can easily be moved under interrupts off sections. Signed-off-by: Nick Piggin <npiggin@suse.de> Cc: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-06[PATCH] build_zonelists_node(): rename argsChristoph Lameter1-9/+9
Give j and r meaningful names. Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-06[PATCH] Fix zone policy determinationChristoph Lameter1-3/+7
The use k in the inner loop means that the highest zone nr is always used if any zone of a node is populated. This means that the policy zone is not correctly determined on arches that do no use HIGHMEM like ia64. Change the loop to decrement k which also simplifies the BUG_ON. Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-06[PATCH] mm: move determination of policy_zone into page allocatorChristoph Lameter2-12/+5
Currently the function to build a zonelist for a BIND policy has the side effect to set the policy_zone. This seems to be a bit strange. policy zone seems to not be initialized elsewhere and therefore 0. Do we police ZONE_DMA if no bind policy has been used yet? This patch moves the determination of the zone to apply policies to into the page allocator. We determine the zone while building the zonelist for nodes. Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-06[PATCH] mm: simplify build_zonelists_node by removing the case statement.Christoph Lameter1-23/+11
Simplify build_zonelists_node by removing the case statement. Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-06[PATCH] mm: add populated_zone() helperCon Kolivas2-8/+8
There are numerous places we check whether a zone is populated or not. Provide a helper function to check for populated zones and convert all checks for zone->present_pages. Signed-off-by: Con Kolivas <kernel@kolivas.org> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-06[PATCH] consolidate lru_add_drain() and lru_drain_cache()Andrew Morton1-16/+11
Cc: Christoph Lameter <clameter@engr.sgi.com> Cc: Rajesh Shah <rajesh.shah@intel.com> Cc: Li Shaohua <shaohua.li@intel.com> Cc: Srivatsa Vaddagiri <vatsa@in.ibm.com> Cc: Ashok Raj <ashok.raj@intel.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-06[PATCH] vmscan: balancing fixAndrew Morton1-8/+0
Revert a patch which went into 2.6.8-rc1. The changelog for that patch was: The shrink_zone() logic can, under some circumstances, cause far too many pages to be reclaimed. Say, we're scanning at high priority and suddenly hit a large number of reclaimable pages on the LRU. Change things so we bale out when SWAP_CLUSTER_MAX pages have been reclaimed. Problem is, this change caused significant imbalance in inter-zone scan balancing by truncating scans of larger zones. Suppose, for example, ZONE_HIGHMEM is 10x the size of ZONE_NORMAL. The zone balancing algorithm would require that if we're scanning 100 pages of ZONE_HIGHMEM, we should scan 10 pages of ZONE_NORMAL. But this logic will cause the scanning of ZONE_HIGHMEM to bale out after only 32 pages are reclaimed. Thus effectively causing smaller zones to be scanned relatively harder than large ones. Now I need to remember what the workload was which caused me to write this patch originally, then fix it up in a different way... Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-06[PATCH] mm: pfault optimisationNick Piggin1-1/+0
This atomic operation is superfluous: the pte will be added with the referenced bit set, and the page will be referenced through this mapping after the page fault handler returns anyway. Signed-off-by: Nick Piggin <npiggin@suse.de> Cc: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-06[PATCH] mm: rmap optimisationNick Piggin2-14/+41
Optimise rmap functions by minimising atomic operations when we know there will be no concurrent modifications. Signed-off-by: Nick Piggin <npiggin@suse.de> Cc: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-06[PATCH] mm: bad_page optimisationNick Piggin1-24/+20
Cut down size slightly by not passing bad_page the function name (it should be able to be determined by dump_stack()). And cut down the number of printks in bad_page. Also, cut down some branching in the destroy_compound_page path. Signed-off-by: Nick Piggin <npiggin@suse.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-06[PATCH] mm: dma32 zone statisticsNick Piggin1-3/+11
Add dma32 to zone statistics. Also attempt to arrange struct page_state a bit better (visually). Signed-off-by: Nick Piggin <npiggin@suse.de> Cc: Andi Kleen <ak@muc.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-06[PATCH] kill last zone_reclaim() bitsAndrew Morton1-80/+0
Remove the last bits of Martin's ill-fated sys_set_zone_reclaim(). Cc: Martin Hicks <mort@wildopensource.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-06[PATCH] find_lock_page(): call __lock_page() directly.Nikita Danilov1-2/+3
As find_lock_page() already checks with TestSetPageLocked() that page is locked, there is no need to call lock_page() that will try-lock page again (chances of page being unlocked in between are small). Call __lock_page() directly, this saves one atomic operation. Also, mark truncate-while-slept path as unlikely while we are here. (akpm: ug. But this is actually a common path for normal old read()s against a page which is under readahead I/O so ho-hum.) Signed-off-by: Nikita Danilov <danilov@gmail.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-06[PATCH] FRV: Clean up bootmem allocator's page freeing algorithmDavid Howells3-17/+41
The attached patch cleans up the way the bootmem allocator frees pages. A new function, __free_pages_bootmem(), is provided in mm/page_alloc.c that is called from mm/bootmem.c to turn pages over to the main allocator. All the bits of code to initialise pages (clearing PG_reserved and setting the page count) are moved to here. The checks on page validity are removed, on the assumption that the struct page arrays will have been prepared correctly. Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-06[PATCH] Cleanup bootmem allocator and fix alloc_bootmem_lowRavikiran G Thirumalai1-7/+31
Patch cleans up the alloc_bootmem fix for swiotlb. Patch removes alloc_bootmem_*_limit api and fixes alloc_boot_*low api to do the right thing -- allocate from low32 memory. Signed-off-by: Ravikiran Thirumalai <kiran@scalex86.org> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-06[PATCH] mm: page_alloc cleanupsNick Piggin1-10/+6
Small cleanups that does not change generated code with the gcc's I've tested with. Signed-off-by: Nick Piggin <npiggin@suse.de> Cc: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-06[PATCH] mm: page_state fixesNick Piggin1-3/+2
read_page_state and __get_page_state only traverse online CPUs, which will cause results to fluctuate when CPUs are plugged in or out. Signed-off-by: Nick Piggin <npiggin@suse.de> Cc: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-06[PATCH] mm: remove pcp lowNick Piggin1-7/+2
struct per_cpu_pages.low is useless. Remove it. Signed-off-by: Nick Piggin <npiggin@suse.de> Cc: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-06[PATCH] mm: remove bad_rangeNick Piggin1-7/+19
bad_range is supposed to be a temporary check. It would be a pity to throw it out. Make it depend on CONFIG_DEBUG_VM instead. CONFIG_HOLES_IN_ZONE systems were relying on this to check pfn_valid in the page allocator. Add that to page_is_buddy instead. Signed-off-by: Nick Piggin <npiggin@suse.de> Cc: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-06[PATCH] mm: microopt conditionsNick Piggin1-8/+8
Micro optimise some conditionals where we don't need lazy evaluation. Signed-off-by: Nick Piggin <npiggin@suse.de> Cc: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-06[PATCH] mm: set_page_refs optNick Piggin2-19/+17
Inline set_page_refs. Signed-off-by: Nick Piggin <npiggin@suse.de> Cc: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-06[PATCH] mm: pagealloc optNick Piggin1-7/+11
Slightly optimise some page allocation and freeing functions by taking advantage of knowing whether or not interrupts are disabled. Signed-off-by: Nick Piggin <npiggin@suse.de> Cc: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-06[PATCH] mm: free_pages_and_swap_cache optHugh Dickins1-2/+2
Minor optimization (though it doesn't help in the PREEMPT case, severely constrained by small ZAP_BLOCK_SIZE). free_pages_and_swap_cache works in chunks of 16, calling release_pages which works in chunks of PAGEVEC_SIZE. But PAGEVEC_SIZE was dropped from 16 to 14 in 2.6.10, so we're now doing more spin_lock_irq'ing than necessary: use PAGEVEC_SIZE throughout. Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Nick Piggin <nickpiggin@yahoo.com.au> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-06[PATCH] mm: remove arch independent NODES_SPAN_OTHER_NODESMike Kravetz1-2/+0
The NODES_SPAN_OTHER_NODES config option was created so that DISCONTIGMEM could handle pSeries numa layouts. However, support for DISCONTIGMEM has been replaced by SPARSEMEM on powerpc. As a result, this config option and supporting code is no longer needed. I have already sent a patch to Paul that removes the option from powerpc specific code. This removes the arch independent piece. Doesn't really matter which is applied first. Signed-off-by: Mike Kravetz <kravetz@us.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-06[PATCH] hugepages: fold find_or_alloc_pages into huge_no_page()Christoph Lameter1-42/+24
The number of parameters for find_or_alloc_page increases significantly after policy support is added to huge pages. Simplify the code by folding find_or_alloc_huge_page() into hugetlb_no_page(). Adam Litke objected to this piece in an earlier patch but I think this is a good simplification. Diffstat shows that we can get rid of almost half of the lines of find_or_alloc_page(). If we can find no consensus then lets simply drop this patch. Signed-off-by: Christoph Lameter <clameter@sgi.com> Cc: Andi Kleen <ak@muc.de> Acked-by: William Lee Irwin III <wli@holomorphy.com> Cc: Adam Litke <agl@us.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-06[PATCH] Remove old node based policy interface from mempolicy.cChristoph Lameter1-48/+0
mempolicy.c contains provisional interface for huge page allocation based on node numbers. This is in use in SLES9 but was never used (AFAIK) in upstream versions of Linux. Huge page allocations now use zonelists to figure out where to allocate pages. The use of zonelists allows us to find the closest hugepage which was the consideration of the NUMA distance for huge page allocations. Remove the obsolete functions. Signed-off-by: Christoph Lameter <clameter@sgi.com> Cc: Andi Kleen <ak@muc.de> Acked-by: William Lee Irwin III <wli@holomorphy.com> Cc: Adam Litke <agl@us.ibm.com> Acked-by: Paul Jackson <pj@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-06[PATCH] Add NUMA policy support for huge pages.Christoph Lameter2-19/+44
The huge_zonelist() function in the memory policy layer provides an list of zones ordered by NUMA distance. The hugetlb layer will walk that list looking for a zone that has available huge pages but is also in the nodeset of the current cpuset. This patch does not contain the folding of find_or_alloc_huge_page() that was controversial in the earlier discussion. Signed-off-by: Christoph Lameter <clameter@sgi.com> Cc: Andi Kleen <ak@muc.de> Acked-by: William Lee Irwin III <wli@holomorphy.com> Cc: Adam Litke <agl@us.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-06[PATCH] mm: dequeue a huge page near to this nodeChristoph Lameter1-6/+8
This was discussed at http://marc.theaimsgroup.com/?l=linux-kernel&m=113166526217117&w=2 This patch changes the dequeueing to select a huge page near the node executing instead of always beginning to check for free nodes from node 0. This will result in a placement of the huge pages near the executing processor improving performance. The existing implementation can place the huge pages far away from the executing processor causing significant degradation of performance. The search starting from zero also means that the lower zones quickly run out of memory. Selecting a huge page near the process distributed the huge pages better. Signed-off-by: Christoph Lameter <clameter@sgi.com> Cc: William Lee Irwin III <wli@holomorphy.com> Cc: Adam Litke <agl@us.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-06[PATCH] Hugetlb: Copy on Write supportDavid Gibson1-19/+108
Implement copy-on-write support for hugetlb mappings so MAP_PRIVATE can be supported. This helps us to safely use hugetlb pages in many more applications. The patch makes the following changes. If needed, I also have it broken out according to the following paragraphs. 1. Add a pair of functions to set/clear write access on huge ptes. The writable check in make_huge_pte is moved out to the caller for use by COW later. 2. Hugetlb copy-on-write requires special case handling in the following situations: - copy_hugetlb_page_range() - Copied pages must be write protected so a COW fault will be triggered (if necessary) if those pages are written to. - find_or_alloc_huge_page() - Only MAP_SHARED pages are added to the page cache. MAP_PRIVATE pages still need to be locked however. 3. Provide hugetlb_cow() and calls from hugetlb_fault() and hugetlb_no_page() which handles the COW fault by making the actual copy. 4. Remove the check in hugetlbfs_file_map() so that MAP_PRIVATE mmaps will be allowed. Make MAP_HUGETLB exempt from the depricated VM_RESERVED mapping check. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Adam Litke <agl@us.ibm.com> Cc: William Lee Irwin III <wli@holomorphy.com> Cc: "Seth, Rohit" <rohit.seth@intel.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-06[PATCH] Hugetlb: Reorganize hugetlb_fault to prepare for COWAdam Litke1-9/+25
This patch splits the "no_page()" type activity into its own function, hugetlb_no_page(). hugetlb_fault() becomes the entry point for hugetlb faults and delegates to the appropriate handler depending on the type of fault. Right now we still have only hugetlb_no_page() but a later patch introduces a COW fault. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Adam Litke <agl@us.ibm.com> Cc: William Lee Irwin III <wli@holomorphy.com> Cc: "Seth, Rohit" <rohit.seth@intel.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-06[PATCH] Hugetlb: Rename find_lock_page to find_or_alloc_huge_pageAdam Litke1-3/+3
find_lock_huge_page() isn't a great name, since it does extra things not analagous to find_lock_page(). Rename it find_or_alloc_huge_page() which is closer to the mark. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Adam Litke <agl@us.ibm.com> Cc: William Lee Irwin III <wli@holomorphy.com> Cc: "Seth, Rohit" <rohit.seth@intel.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-06[PATCH] Hugetlb: Remove duplicate i_size checkAdam Litke1-7/+0
cleanup Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Adam Litke <agl@us.ibm.com> Cc: William Lee Irwin III <wli@holomorphy.com> Cc: "Seth, Rohit" <rohit.seth@intel.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-06[PATCH] madvise(MADV_REMOVE): remove pages from tmpfs shm backing storeBadari Pulavarty3-9/+83
Here is the patch to implement madvise(MADV_REMOVE) - which frees up a given range of pages & its associated backing store. Current implementation supports only shmfs/tmpfs and other filesystems return -ENOSYS. "Some app allocates large tmpfs files, then when some task quits and some client disconnect, some memory can be released. However the only way to release tmpfs-swap is to MADV_REMOVE". - Andrea Arcangeli Databases want to use this feature to drop a section of their bufferpool (shared memory segments) - without writing back to disk/swap space. This feature is also useful for supporting hot-plug memory on UML. Concerns raised by Andrew Morton: - "We have no plan for holepunching! If we _do_ have such a plan (or might in the future) then what would the API look like? I think sys_holepunch(fd, start, len), so we should start out with that." - Using madvise is very weird, because people will ask "why do I need to mmap my file before I can stick a hole in it?" - None of the other madvise operations call into the filesystem in this manner. A broad question is: is this capability an MM operation or a filesytem operation? truncate, for example, is a filesystem operation which sometimes has MM side-effects. madvise is an mm operation and with this patch, it gains FS side-effects, only they're really, really significant ones." Comments: - Andrea suggested the fs operation too but then it's more efficient to have it as a mm operation with fs side effects, because they don't immediatly know fd and physical offset of the range. It's possible to fixup in userland and to use the fs operation but it's more expensive, the vmas are already in the kernel and we can use them. Short term plan & Future Direction: - We seem to need this interface only for shmfs/tmpfs files in the short term. We have to add hooks into the filesystem for correctness and completeness. This is what this patch does. - In the future, plan is to support both fs and mmap apis also. This also involves (other) filesystem specific functions to be implemented. - Current patch doesn't support VM_NONLINEAR - which can be addressed in the future. Signed-off-by: Badari Pulavarty <pbadari@us.ibm.com> Cc: Hugh Dickins <hugh@veritas.com> Cc: Andrea Arcangeli <andrea@suse.de> Cc: Michael Kerrisk <mtk-manpages@gmx.net> Cc: Ulrich Drepper <drepper@redhat.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-06[PATCH] reiser4: vfs: add truncate_inode_pages_range()Hans Reiser1-7/+37
This patch makes truncate_inode_pages_range from truncate_inode_pages. truncate_inode_pages became a one-liner call to truncate_inode_pages_range. Reiser4 needs truncate_inode_pages_ranges because it tries to keep correspondence between existences of metadata pointing to data pages and pages to which those metadata point to. So, when metadata of certain part of file is removed from filesystem tree, only pages of corresponding range are to be truncated. (Needed by the madvise(MADV_REMOVE) patch) Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-06[PATCH] memhotplug: __add_section remove unused pgdat definitionAndy Whitcroft1-1/+0
__add_section defines an unused pointer to the zones pgdat. Remove this definition. This fixes a compile warning. Signed-off-by: Andy Whitcroft <apw@shadowen.org> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-06[PATCH] mm: fix __alloc_pages cpuset ALLOC_* flagsPaul Jackson1-3/+2
Two changes to the setting of the ALLOC_CPUSET flag in mm/page_alloc.c:__alloc_pages() - A bug fix - the "ignoring mins" case should not be honoring ALLOC_CPUSET. This case of all cases, since it is handling a request that will free up more memory than is asked for (exiting tasks, e.g.) should be allowed to escape cpuset constraints when memory is tight. - A logic change to make it simpler. Honor cpusets even on GFP_ATOMIC (!wait) requests. With this, cpuset confinement applies to all requests except ALLOC_NO_WATERMARKS, so that in a subsequent cleanup patch, I can remove the ALLOC_CPUSET flag entirely. Since I don't know any real reason this logic has to be either way, I am choosing the path of the simplest code. Signed-off-by: Paul Jackson <pj@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-03[PATCH] add AOP_TRUNCATED_PAGE, prepend AOP_ to WRITEPAGE_ACTIVATEZach Brown4-31/+61
readpage(), prepare_write(), and commit_write() callers are updated to understand the special return code AOP_TRUNCATED_PAGE in the style of writepage() and WRITEPAGE_ACTIVATE. AOP_TRUNCATED_PAGE tells the caller that the callee has unlocked the page and that the operation should be tried again with a new page. OCFS2 uses this to detect and work around a lock inversion in its aop methods. There should be no change in behaviour for methods that don't return AOP_TRUNCATED_PAGE. WRITEPAGE_ACTIVATE is also prepended with AOP_ for consistency and they are made enums so that kerneldoc can be used to document their semantics. Signed-off-by: Zach Brown <zach.brown@oracle.com>
2006-01-02[PATCH] Make sure interleave masks have at least one node setAndi Kleen1-0/+4
Otherwise a bad mem policy system call can confuse the interleaving code into referencing undefined nodes. Originally reported by Doug Chapman I was told it's CVE-2005-3358 (one has to love these security people - they make everything sound important) Signed-off-by: Andi Kleen <ak@suse.de> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-12-16Make sure we copy pages inserted with "vm_insert_page()" on forkLinus Torvalds3-3/+4
The logic that decides that a fork() might be able to avoid copying a VM area when it can be re-created by page faults didn't know about the new vm_insert_page() case. Also make some things a bit more anal wrt VM_PFNMAP. Pointed out by Hugh Dickins <hugh@veritas.com> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-12-15[PATCH] missing prototype (mm/page_alloc.c)Al Viro1-1/+1
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-12-13[PATCH] Fix calculation of grow_pgdat_span() in mm/memory_hotplug.cYasunori Goto1-1/+1
The calculation for node_spanned_pages at grow_pgdat_span() is clearly wrong. This is patch for it. (Please see grow_zone_span() to compare. It is correct.) Signed-off-by: Yasunori Goto <y-goto@jp.fujitsu.com> Acked-by: Dave Hansen <haveblue@us.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-12-12get_user_pages: don't try to follow PFNMAP pagesLinus Torvalds1-1/+1
Nick Piggin points out that a few drivers play games with VM_IO (why? who knows..) and thus a pfn-remapped area may not have that bit set even if remap_pfn_range() set it originally. So make it explicit in get_user_pages() that we don't follow VM_PFNMAP pages, since pretty much by definition they do not have a "struct page" associated with them. Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-12-12[PATCH] fix in __alloc_bootmem_core() when there is no free page in first ↵Haren Myneni1-0/+2
node's memory Hitting BUG_ON() in __alloc_bootmem_core() when there is no free page available in the first node's memory. For the case of kdump on PPC64 (Power 4 machine), the captured kernel is used two memory regions - memory for TCE tables (tce-base and tce-size at top of RAM and reserved) and captured kernel memory region (crashk_base and crashk_size). Since we reserve the memory for the first node, we should be returning from __alloc_bootmem_core() to search for the next node (pg_dat). Currently, find_next_zero_bit() is returning the n^th bit (eidx) when there is no free page. Then, test_bit() is failed since we set 0xff only for the actual size initially (init_bootmem_core()) even though rounded up to one page for bdata->node_bootmem_map. We are hitting the BUG_ON after failing to enter second "for" loop. Signed-off-by: Haren Myneni <haren@us.ibm.com> Cc: Andy Whitcroft <apw@shadowen.org> Cc: Dave Hansen <haveblue@us.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-12-11Allow arbitrary read-only shared pfn-remapping tooLinus Torvalds1-3/+8
The VM layer (for historical reasons) turns a read-only shared mmap into a private-like mapping with the VM_MAYWRITE bit clear. Thus checking just VM_SHARED isn't actually sufficient. So use a trivial helper function for the cases where we wanted to inquire if a mapping was COW-like or not. Moo! Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-12-11Remove (at least temporarily) the "incomplete PFN mapping" supportLinus Torvalds1-45/+1
With the previous commit, we can handle arbitrary shared re-mappings even without this complexity, and since the only known private mappings are for strange users of /dev/mem (which never create an incomplete one), there seems to be no reason to support it. Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-12-11Allow arbitrary shared PFNMAP'sLinus Torvalds1-4/+12
A shared mapping doesn't cause COW-pages, so we don't need to worry about the whole vm_pgoff logic to decide if a PFN-remapped page has gone through COW or not. This makes it possible to entirely avoid the special "partial remapping" logic for the common case. Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-12-03Make vm_insert_page() available to NVidia moduleLinus Torvalds1-1/+1
It used to use remap_pfn_range(), which wasn't GPL-only either, and the new interface is actually simpler and does more checking, so we shouldn't unnecessarily discourage people from switching over. Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-12-03[PATCH] Fix up per-cpu page batch sizesNick Piggin1-8/+8
The code to clamp batch sizes to 2^n - 1 went missing and an extra check got added, which must have been a hunk of the "higer order pcp batch refills" work sneaking in. Signed-off-by: Nick Piggin <npiggin@suse.de> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-11-30VM: add "vm_insert_page()" functionLinus Torvalds1-2/+34
This is what a lot of drivers will actually want to use to insert individual pages into a user VMA. It doesn't have the old PageReserved restrictions of remap_pfn_range(), and it doesn't complain about partial remappings. The page you insert needs to be a nice clean kernel allocation, so you can't insert arbitrary page mappings with this, but that's not what people want. Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-11-29[PATCH] VM: Fix typos in get_locked_pteTrond Myklebust1-2/+2
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-11-29[PATCH] pfnmap: do_no_page BUG_ON againHugh Dickins1-1/+3
Use copy_user_highpage directly instead of cow_user_page in do_no_page: in the immediately following page_cache_release, and elsewhere, it is assuming that new_page is normal. If any VM_PFNMAP driver can get to do_no_page, it's just a BUG (but not in the case of do_anonymous_page). Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-11-29[PATCH] pfnmap: remove src_page from do_wp_pageHugh Dickins1-4/+3
Clean away do_wp_page's "src_page": cow_user_page makes it unnecessary. Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-11-29cow_user_page: fix page alignmentLinus Torvalds1-2/+9
High Dickins points out that the user virtual address passed to the page fault handler isn't necessarily page-aligned. Also, add a comment on why the copy could fail for the user address case. Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-11-29VM: add common helper function to create the page tablesLinus Torvalds2-34/+16
This logic was duplicated four times, for no good reason. Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-11-29Support strange discontiguous PFN remappingsLinus Torvalds1-0/+92
These get created by some drivers that don't generally even want a pfn remapping at all, but would really mostly prefer to just map pages they've allocated individually instead. For now, create a helper function that turns such an incomplete PFN remapping call into a loop that does that explicit mapping. In the long run we almost certainly want to export a totally different interface for that, though. Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-11-29[PATCH] Fix missing pfn variables caused by vm changesBen Collins2-3/+3
I image this showed up because of "unused var..." when the changes occured, because flush_cache_page() is a noop in most places. This showed up for me on parisc however, where flush_cache_page() is a real function. Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-11-29[PATCH] Fix vma argument in get_usr_pages() for gate areasNick Piggin1-1/+1
The system call gate area handling called vm_normal_page() with the wrong vma (which was always NULL, and caused an oops). Signed-off-by: Nick Piggin <npiggin@suse.de> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-11-28[PATCH] shrinker->nr = LONG_MAX means deadlock for icacheAndrea Arcangeli1-3/+15
With Andrew Morton <akpm@osdl.org> The slab scanning code tries to balance the scanning rate of slabs versus the scanning rate of LRU pages. To do this, it retains state concerning how many slabs have been scanned - if a particular slab shrinker didn't scan enough objects, we remember that for next time, and scan more objects on the next pass. The problem with this is that with (say) a huge number of GFP_NOIO direct-reclaim attempts, the number of objects which are to be scanned when we finally get a GFP_KERNEL request can be huge. Because some shrinker handlers just bail out if !__GFP_FS. So the patch clamps the number of objects-to-be-scanned to 2* the total number of objects in the slab cache. Signed-off-by: Andrea Arcangeli <andrea@suse.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-11-28[PATCH] temporarily disable swap token on memory pressureRik van Riel3-21/+26
Some users (hi Zwane) have seen a problem when running a workload that eats nearly all of physical memory - th system does an OOM kill, even when there is still a lot of swap free. The problem appears to be a very big task that is holding the swap token, and the VM has a very hard time finding any other page in the system that is swappable. Instead of ignoring the swap token when sc->priority reaches 0, we could simply take the swap token away from the memory hog and make sure we don't give it back to the memory hog for a few seconds. This patch resolves the problem Zwane ran into. Signed-off-by: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-11-28[PATCH] mm: __alloc_pages cleanup fixNick Piggin1-7/+17
I believe this patch is required to fix breakage in the asynch reclaim watermark logic introduced by this patch: http://www.kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=7fb1d9fca5c6e3b06773b69165a73f3fb786b8ee Just some background of the watermark logic in case it isn't clear... Basically what we have is this: --- pages_high | | (a) | --- pages_low | | (b) | --- pages_min | | (c) | --- 0 Now when pages_low is reached, we want to kick asynch reclaim, which gives us an interval of "b" before we must start synch reclaim, and gives kswapd an interval of "a" before it need go back to sleep. When pages_min is reached, normal allocators must enter synch reclaim, but PF_MEMALLOC, ALLOC_HARDER, and ALLOC_HIGH (ie. atomic allocations, recursive allocations, etc.) get access to varying amounts of the reserve "c". Signed-off-by: Nick Piggin <npiggin@suse.de> Cc: "Seth, Rohit" <rohit.seth@intel.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-11-28[PATCH] Workaround for gcc 2.96 (undefined references)Alan Stern1-0/+12
LD .tmp_vmlinux1 mm/built-in.o(.text+0x100d6): In function `copy_page_range': : undefined reference to `__pud_alloc' mm/built-in.o(.text+0x1010b): In function `copy_page_range': : undefined reference to `__pmd_alloc' mm/built-in.o(.text+0x11ef4): In function `__handle_mm_fault': : undefined reference to `__pud_alloc' fs/built-in.o(.text+0xc930): In function `install_arg_page': : undefined reference to `__pud_alloc' make: *** [.tmp_vmlinux1] Error 1 Those missing references in mm/memory.c arise from this code in include/linux/mm.h, combined with the fact that __PGTABLE_PMD_FOLDED and __PGTABLE_PUD_FOLDED are both set and __ARCH_HAS_4LEVEL_HACK is not: /* * The following ifdef needed to get the 4level-fixup.h header to work. * Remove it when 4level-fixup.h has been removed. */ #if defined(CONFIG_MMU) && !defined(__ARCH_HAS_4LEVEL_HACK) static inline pud_t *pud_alloc(struct mm_struct *mm, pgd_t *pgd, unsigned long address) { return (unlikely(pgd_none(*pgd)) && __pud_alloc(mm, pgd, address))? NULL: pud_offset(pgd, address); } static inline pmd_t *pmd_alloc(struct mm_struct *mm, pud_t *pud, unsigned long address) { return (unlikely(pud_none(*pud)) && __pmd_alloc(mm, pud, address))? NULL: pmd_offset(pud, address); } #endif /* CONFIG_MMU && !__ARCH_HAS_4LEVEL_HACK */ With my configuration the pgd_none and pud_none routines are inlines returning a constant 0. Apparently the old compiler avoids generating calls to __pud_alloc and __pmd_alloc but still lists them as undefined references in the module's symbol table. I don't know which change caused this problem. I think it was added somewhere between 2.6.14 and 2.6.15-rc1, because I remember building several 2.6.14-rc kernels without difficulty. However I can't point to an individual culprit. Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-11-28mm: re-architect the VM_UNPAGED logicLinus Torvalds7-135/+118
This replaces the (in my opinion horrible) VM_UNMAPPED logic with very explicit support for a "remapped page range" aka VM_PFNMAP. It allows a VM area to contain an arbitrary range of page table entries that the VM never touches, and never considers to be normal pages. Any user of "remap_pfn_range()" automatically gets this new functionality, and doesn't even have to mark the pages reserved or indeed mark them any other way. It just works. As a side effect, doing mmap() on /dev/mem works for arbitrary ranges. Sparc update from David in the next commit. Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-11-23[PATCH] 32bit integer overflow in invalidate_inode_pages2()Oleg Drokin1-3/+3
Fix a 32 bit integer overflow in invalidate_inode_pages2_range. Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-11-23[PATCH] mm: update split ptlock KconfigHugh Dickins1-4/+2
Closer attention to the arithmetic shows that neither ppc64 nor sparc really uses one page for multiple page tables: how on earth could they, while pte_alloc_one returns just a struct page pointer, with no offset? Well, arm26 manages it by returning a pte_t pointer cast to a struct page pointer, harumph, then compensating in its pmd_populate. But arm26 is never SMP, so it's not a problem for split ptlock either. And the PA-RISC situation has been recently improved: CONFIG_PA20 works without the 16-byte alignment which inflated its spinlock_t. But the current union of spinlock_t with private does make the 7xxx struct page significantly larger, even without debug, so disable its split ptlock. Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-11-22[PATCH] hugetlb: fix race in set_max_huge_pages for multiple updaters of ↵Eric Paris1-0/+6
nr_huge_pages If there are multiple updaters to /proc/sys/vm/nr_hugepages simultaneously it is possible for the nr_huge_pages variable to become incorrect. There is no locking in the set_max_huge_pages function around alloc_fresh_huge_page which is able to update nr_huge_pages. Two callers to alloc_fresh_huge_page could race against each other as could a call to alloc_fresh_huge_page and a call to update_and_free_page. This patch just expands the area covered by the hugetlb_lock to cover the call into alloc_fresh_huge_page. I'm not sure how we could say that a sysctl section is performance critical where more specific locking would be needed. My reproducer was to run a couple copies of the following script simultaneously while [ true ]; do echo 1000 > /proc/sys/vm/nr_hugepages echo 500 > /proc/sys/vm/nr_hugepages echo 750 > /proc/sys/vm/nr_hugepages echo 100 > /proc/sys/vm/nr_hugepages echo 0 > /proc/sys/vm/nr_hugepages done and then watch /proc/meminfo and eventually you will see things like HugePages_Total: 100 HugePages_Free: 109 After applying the patch all seemed well. Signed-off-by: Eric Paris <eparis@redhat.com> Acked-by: William Irwin <wli@holomorphy.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-11-22[PATCH] unpaged: PG_reserved bad_pageHugh Dickins1-12/+34
It used to be the case that PG_reserved pages were silently never freed, but in 2.6.15-rc1 they may be freed with a "Bad page state" message. We should work through such cases as they appear, fixing the code; but for now it's safer to issue the message without freeing the page, leaving PG_reserved set. Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-11-22[PATCH] unpaged: ZERO_PAGE in VM_UNPAGEDHugh Dickins1-2/+12
It's strange enough to be looking out for anonymous pages in VM_UNPAGED areas, let's not insert the ZERO_PAGE there - though whether it would matter will depend on what we decide about ZERO_PAGE refcounting. But whereas do_anonymous_page may (exceptionally) be called on a VM_UNPAGED area, do_no_page should never be: just BUG_ON. Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-11-22[PATCH] unpaged: anon in VM_UNPAGEDHugh Dickins2-24/+46
copy_one_pte needs to copy the anonymous COWed pages in a VM_UNPAGED area, zap_pte_range needs to free them, do_wp_page needs to COW them: just like ordinary pages, not like the unpaged. But recognizing them is a little subtle: because PageReserved is no longer a condition for remap_pfn_range, we can now mmap all of /dev/mem (whether the distro permits, and whether it's advisable on this or that architecture, is another matter). So if we can see a PageAnon, it may not be ours to mess with (or may be ours from elsewhere in the address space). I suspect there's an entertaining insoluble self-referential problem here, but the page_is_anon function does a good practical job, and MAP_PRIVATE PROT_WRITE VM_UNPAGED will always be an odd choice. In updating the comment on page_address_in_vma, noticed a potential NULL dereference, in a path we don't actually take, but fixed it. Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-11-22[PATCH] unpaged: COW on VM_UNPAGEDHugh Dickins1-12/+25
Remove the BUG_ON(vma->vm_flags & VM_UNPAGED) from do_wp_page, and let it do Copy-On-Write without touching the VM_UNPAGED's page counts - but this is incomplete, because the anonymous page it inserts will itself need to be handled, here and in other functions - next patch. We still don't copy the page if the pfn is invalid, because the copy_user_highpage interface does not allow it. But that's not been a problem in the past: can be added in later if the need arises. Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-11-22[PATCH] unpaged: VM_NONLINEAR VM_RESERVEDHugh Dickins2-14/+7
There's one peculiar use of VM_RESERVED which the previous patch left behind: because VM_NONLINEAR's try_to_unmap_cluster uses vm_private_data as a swapout cursor, but should never meet VM_RESERVED vmas, it was a way of extending VM_NONLINEAR to VM_RESERVED vmas using vm_private_data for some other purpose. But that's an empty set - they don't have the populate function required. So just throw away those VM_RESERVED tests. But one more interesting in rmap.c has to go too: try_to_unmap_one will want to swap out an anonymous page from VM_RESERVED or VM_UNPAGED area. Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-11-22[PATCH] unpaged: VM_UNPAGEDHugh Dickins5-18/+24
Although we tend to associate VM_RESERVED with remap_pfn_range, quite a few drivers set VM_RESERVED on areas which are then populated by nopage. The PageReserved removal in 2.6.15-rc1 changed VM_RESERVED not to free pages in zap_pte_range, without changing those drivers not to set it: so their pages just leak away. Let's not change miscellaneous drivers now: introduce VM_UNPAGED at the core, to flag the special areas where the ptes may have no struct page, or if they have then it's not to be touched. Replace most instances of VM_RESERVED in core mm by VM_UNPAGED. Force it on in remap_pfn_range, and the sparc and sparc64 io_remap_pfn_range. Revert addition of VM_RESERVED to powerpc vdso, it's not needed there. Is it needed anywhere? It still governs the mm->reserved_vm statistic, and special vmas not to be merged, and areas not to be core dumped; but could probably be eliminated later (the drivers are probably specifying it because in 2.4 it kept swapout off the vma, but in 2.6 we work from the LRU, which these pages don't get on). Use the VM_SHM slot for VM_UNPAGED, and define VM_SHM to 0: it serves no purpose whatsoever, and should be removed from drivers when we clean up. Signed-off-by: Hugh Dickins <hugh@veritas.com> Acked-by: William Irwin <wli@holomorphy.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-11-22[PATCH] unpaged: unifdefed PageCompoundHugh Dickins2-8/+0
It looks like snd_xxx is not the only nopage to be using PageReserved as a way of holding a high-order page together: which no longer works, but is masked by our failure to free from VM_RESERVED areas. We cannot fix that bug without first substituting another way to hold the high-order page together, while farming out the 0-order pages from within it. That's just what PageCompound is designed for, but it's been kept under CONFIG_HUGETLB_PAGE. Remove the #ifdefs: which saves some space (out- of-line put_page), doesn't slow down what most needs to be fast (already using hugetlb), and unifies the way we handle high-order pages. Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-11-22[PATCH] unpaged: private write VM_RESERVEDHugh Dickins2-19/+0
The PageReserved removal in 2.6.15-rc1 issued a "deprecated" message when you tried to mmap or mprotect MAP_PRIVATE PROT_WRITE a VM_RESERVED, and failed with -EACCES: because do_wp_page lacks the refinement to COW pages in those areas, nor do we expect to find anonymous pages in them; and it seemed just bloat to add code for handling such a peculiar case. But immediately it caused vbetool and ddcprobe (using lrmi) to fail. So revert the "deprecated" messages, letting mmap and mprotect succeed. But leave do_wp_page's BUG_ON(vma->vm_flags & VM_RESERVED) in place until we've added the code to do it right: so this particular patch is only good if the app doesn't really need to write to that private area. Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-11-22[PATCH] unpaged: get_user_pages VM_RESERVEDHugh Dickins1-1/+1
The PageReserved removal in 2.6.15-rc1 prohibited get_user_pages on the areas flagged VM_RESERVED in place of PageReserved. That is correct in theory - we ought not to interfere with struct pages in such a reserved area; but in practice it broke BTTV for one. So revert to prohibiting only on VM_IO: if someone gets into trouble with get_user_pages on VM_RESERVED, it'll just be a "don't do that". You can argue that videobuf_mmap_mapper shouldn't set VM_RESERVED in the first place, but now's not the time for breaking drivers without notice. Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-11-18Merge branch 'master'Kyle McMartin1-0/+1
2005-11-18[PARISC] Fix compile warning caused by conflicting types of expand_upwards()Matthew Wilcox1-1/+1
Fix compile warning caused by conflicting types of expand_upwards. IA64 requires it to not be static inline, as it's used outside mm/mmap.c Signed-off-by: Matthew Wilcox <willy@parisc-linux.org> Signed-off-by: Kyle McMartin <kyle@parisc-linux.org>
2005-11-18[PATCH] re-export clear_page_dirty_for_io()Hans Reiser1-0/+1
2.6.14 has this exported, and reiser4 (at least) uses it. Put things back the way they were. Signed-off-by: Vladimir V. Saveliev <vs@namesys.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-11-17[PATCH] VM: fix zone list restart in page allocatateJens Axboe1-3/+4
We must reassign z before looping through the zones kicking kswapd, since it will be NULL if we hit an OOM condition and jump back to the beginning again. 'z' is initially assigned before the restart: label. So move the restart label up a little. Signed-off-by: Jens Axboe <axboe@suse.de>
2005-11-14Merge x86-64 update from AndiLinus Torvalds2-7/+15
2005-11-14[PATCH] x86_64: Remove obsolete ARCH_HAS_ATOMIC_UNSIGNED and page_flags_tAndi Kleen2-2/+2
Has been introduced for x86-64 at some point to save memory in struct page, but has been obsolete for some time. Just remove it. Signed-off-by: Andi Kleen <ak@suse.de> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-11-14[PATCH] x86_64: When cpu_up fails clean up page allocator properlyAndi Kleen1-2/+1
Signed-off-by: Andi Kleen <ak@suse.de> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-11-14[PATCH] x86_64: Add 4GB DMA32 zoneAndi Kleen1-3/+12
Add a new 4GB GFP_DMA32 zone between the GFP_DMA and GFP_NORMAL zones. As a bit of historical background: when the x86-64 port was originally designed we had some discussion if we should use a 16MB DMA zone like i386 or a 4GB DMA zone like IA64 or both. Both was ruled out at this point because it was in early 2.4 when VM is still quite shakey and had bad troubles even dealing with one DMA zone. We settled on the 16MB DMA zone mainly because we worried about older soundcards and the floppy. But this has always caused problems since then because device drivers had trouble getting enough DMA able memory. These days the VM works much better and the wide use of NUMA has proven it can deal with many zones successfully. So this patch adds both zones. This helps drivers who need a lot of memory below 4GB because their hardware is not accessing more (graphic drivers - proprietary and free ones, video frame buffer drivers, sound drivers etc.). Previously they could only use IOMMU+16MB GFP_DMA, which was not enough memory. Another common problem is that hardware who has full memory addressing for >4GB misses it for some control structures in memory (like transmit rings or other metadata). They tended to allocate memory in the 16MB GFP_DMA or the IOMMU/swiotlb then using pci_alloc_consistent, but that can tie up a lot of precious 16MB GFPDMA/IOMMU/swiotlb memory (even on AMD systems the IOMMU tends to be quite small) especially if you have many devices. With the new zone pci_alloc_consistent can just put this stuff into memory below 4GB which works better. One argument was still if the zone should be 4GB or 2GB. The main motivation for 2GB would be an unnamed not so unpopular hardware raid controller (mostly found in older machines from a particular four letter company) who has a strange 2GB restriction in firmware. But that one works ok with swiotlb/IOMMU anyways, so it doesn't really need GFP_DMA32. I chose 4GB to be compatible with IA64 and because it seems to be the most common restriction. The new zone is so far added only for x86-64. For other architectures who don't set up this new zone nothing changes. Architectures can set a compatibility define in Kconfig CONFIG_DMA_IS_DMA32 that will define GFP_DMA32 as GFP_DMA. Otherwise it's a nop because on 32bit architectures it's normally not needed because GFP_NORMAL (=0) is DMA able enough. One problem is still that GFP_DMA means different things on different architectures. e.g. some drivers used to have #ifdef ia64 use GFP_DMA (trusting it to be 4GB) #elif __x86_64__ (use other hacks like the swiotlb because 16MB is not enough) ... . This was quite ugly and is now obsolete. These should be now converted to use GFP_DMA32 unconditionally. I haven't done this yet. Or best only use pci_alloc_consistent/dma_alloc_coherent which will use GFP_DMA32 transparently. Signed-off-by: Andi Kleen <ak@suse.de> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-11-13[PATCH] slab: remove alloc_pages() callsChristoph Lameter1-5/+1
The slab allocator never uses alloc_pages since kmem_getpages() is always called with a valid nodeid. Remove the branch and the code from kmem_getpages() Signed-off-by: Christoph Lameter <clameter@sgi.com> Cc: Manfred Spraul <manfred@colorfullife.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-11-13[PATCH] slab: convert cache to page mapping macrosPekka Enberg1-17/+32
This patch converts object cache <-> page mapping macros to static inline functions to make the more explicit and readable. Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi> Cc: Manfred Spraul <manfred@colorfullife.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-11-13[PATCH] mm: highmem watermarksNick Piggin1-13/+14
The pages_high - pages_low and pages_low - pages_min deltas are the asynch reclaim watermarks. As such, the should be in the same ratios as any other zone for highmem zones. It is the pages_min - 0 delta which is the PF_MEMALLOC reserve, and this is the region that isn't very useful for highmem. This patch ensures highmem systems have similar characteristics as non highmem ones with the same amount of memory, and also that highmem zones get similar reclaim pressures to other zones. Signed-off-by: Nick Piggin <npiggin@suse.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-11-13[PATCH] mm: __alloc_pages cleanupRohit Seth2-113/+88
Clean up of __alloc_pages. Restoration of previous behaviour, plus further cleanups by introducing an 'alloc_flags', removing the last of should_reclaim_zone. Signed-off-by: Rohit Seth <rohit.seth@intel.com> Signed-off-by: Nick Piggin <npiggin@suse.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-11-13[PATCH] mm: ZAP_BLOCK causes redundant workRobin Holt1-34/+55
The address based work estimate for unmapping (for lockbreak) is and always was horribly inefficient for sparse mappings. The problem is most simply explained with an example: If we find a pgd is clear, we still have to call into unmap_page_range PGDIR_SIZE / ZAP_BLOCK_SIZE times, each time checking the clear pgd, in order to progress the working address to the next pgd. The fundamental way to solve the problem is to keep track of the end address we've processed and pass it back to the higher layers. From: Nick Piggin <npiggin@suse.de> Modification to completely get away from address based work estimate and instead use an abstract count, with a very small cost for empty entries as opposed to present pages. On 2.6.14-git2, ppc64, and CONFIG_PREEMPT=y, mapping and unmapping 1TB of virtual address space takes 1.69s; with the following patch applied, this operation can be done 1000 times in less than 0.01s From: Andrew Morton <akpm@osdl.org> With CONFIG_HUTETLB_PAGE=n: mm/memory.c: In function `unmap_vmas': mm/memory.c:779: warning: division by zero Due to zap_work -= (end - start) / (HPAGE_SIZE / PAGE_SIZE); So make the dummy HPAGE_SIZE non-zero Signed-off-by: Robin Holt <holt@sgi.com> Signed-off-by: Nick Piggin <npiggin@suse.de> Cc: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-11-13[PATCH] mm: __GFP_NOFAIL fixKirill Korotaev1-0/+5
In __alloc_pages(): if ((p->flags & (PF_MEMALLOC | PF_MEMDIE)) && !in_interrupt()) { /* go through the zonelist yet again, ignoring mins */ for (i = 0; zones[i] != NULL; i++) { struct zone *z = zones[i]; page = buffered_rmqueue(z, order, gfp_mask); if (page) { zone_statistics(zonelist, z); goto got_pg; } } goto nopage; <<<< HERE!!! FAIL... } kswapd (which has PF_MEMALLOC flag) can fail to allocate memory even when it allocates it with __GFP_NOFAIL flag. Signed-Off-By: Pavel Emelianov <xemul@sw.ru> Signed-Off-By: Denis Lunev <den@sw.ru> Signed-Off-By: Kirill Korotaev <dev@sw.ru> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-11-11Merge git://git.kernel.org/pub/scm/linux/kernel/git/bunk/trivialLinus Torvalds1-1/+1
2005-11-10[PATCH] Don't print per-cpu vm stats for offline cpus.Dave Jones1-1/+1
I just hit a page allocation error on a kernel configured to support 64 CPUs. It spewed 60 completely useless unnecessary lines of info. Signed-off-by: Dave Jones <davej@redhat.com> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-11-08mm/slab.c: fix a comment typoAdrian Bunk1-1/+1
2005-11-07[PATCH] mm/swap_state.c: unexport swapper_spaceAdrian Bunk1-1/+0
I didn't find any possible modular usage in the kernel. Signed-off-by: Adrian Bunk <bunk@stusta.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-11-07[PATCH] mm/swapfile.c: unexport total_swap_pagesAdrian Bunk1-2/+0
I didn't find any possible modular usage in the kernel. Signed-off-by: Adrian Bunk <bunk@stusta.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-11-07[PATCH] mm/swap.c: unexport vm_acct_memoryAdrian Bunk1-1/+0
I didn't find any possible modular usage in the kernel. Signed-off-by: Adrian Bunk <bunk@stusta.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-11-07[PATCH] unexport nr_swap_pagesAdrian Bunk1-1/+0
I didn't find any possible modular usage in the kernel. Signed-off-by: Adrian Bunk <bunk@stusta.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-11-07[PATCH] unexport clear_page_dirty_for_ioAdrian Bunk1-1/+0
I didn't find any possible modular usage in the kernel. Signed-off-by: Adrian Bunk <bunk@stusta.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-11-07[PATCH] unexport hugetlb_total_pagesAdrian Bunk1-1/+0
I didn't find any possible modular usage in the kernel. Signed-off-by: Adrian Bunk <bunk@stusta.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-11-07[PATCH] mm/{mmap,nommu}.c: several unexportsAdrian Bunk2-8/+0
I didn't find any possible modular usage in the kernel. This patch was already ACK'ed by Christoph Hellwig. Signed-off-by: Adrian Bunk <bunk@stusta.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-11-07[PATCH] kernel-doc: fix warnings in vmalloc.cRandy Dunlap1-2/+2
Fix new kernel-doc errors in vmalloc.c. Signed-off-by: Randy Dunlap <rdunlap@xenotime.net> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-11-07[PATCH] more kernel-doc cleanups, additionsRandy Dunlap1-0/+1
Various core kernel-doc cleanups: - add missing function parameters in ipc, irq/manage, kernel/sys, kernel/sysctl, and mm/slab; - move description to just above function for kernel_restart() Signed-off-by: Randy Dunlap <rdunlap@xenotime.net> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-11-07[PATCH] readahead commentaryAndrew Morton1-9/+22
Add a few comments surrounding the generic readahead API. Also convert some ulongs into pgoff_t: the identifier for PAGE_CACHE_SIZE offsets into pagecache. Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-11-07[PATCH] slab: Use same schedule timeout for all cpus in cache_reapManfred Spraul1-2/+2
Chen noticed that cache_reap uses REAPTIMEOUT_CPUC+smp_processor_id() as the timeout for rescheduling. The "+smp_processor_id()" part is wrong, the timeout should be identical for all cpus: start_cpu_timer already adds a cpu dependant offset to avoid any clustering. The attached patch removes smp_processor_id(). Signed-Off-By: Manfred Spraul <manfred@colorfullife.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-11-07[PATCH] mm: rename kmem_cache_s to kmem_cachePekka J Enberg1-1/+1
This patch renames struct kmem_cache_s to kmem_cache so we can start using it instead of kmem_cache_t typedef. Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-11-07[PATCH] slab: don't BUG on duplicated cacheAndrew Morton1-33/+34
slab presently goes BUG if someone tries to register an already-registered cache. But this can happen if the user accidentally loads a module which is already statically linked into the kernel. Nuking the kernel is rather a harsh reaction. Change it into a warning, and just fail the kmem_cache_alloc() attempt. If the module is well-behaved, the modprobe will fail and all is well. Notes: - Swaps the ranking of cache_chain_sem and lock_cpu_hotplug(). Doesn't seem important. Acked-by: Manfred Spraul <manfred@colorfullife.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-11-07[PATCH] Suppress split ptlock on arches which may use one page for multiple ↵Hugh Dickins1-0/+2
page tables Suppress split ptlock on arches which may use one page for multiple page tables. Reconsider what better to do (particularly on ppc64) later on. Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-11-06[PATCH] ppc64: support 64k pagesBenjamin Herrenschmidt1-0/+3
Adds a new CONFIG_PPC_64K_PAGES which, when enabled, changes the kernel base page size to 64K. The resulting kernel still boots on any hardware. On current machines with 4K pages support only, the kernel will maintain 16 "subpages" for each 64K page transparently. Note that while real 64K capable HW has been tested, the current patch will not enable it yet as such hardware is not released yet, and I'm still verifying with the firmware architects the proper to get the information from the newer hypervisors. Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-11-01Export __pagevec_release and pagevec_lookup_tagSteve French1-0/+3
These are needed to implement cifs_writepages Signed-off-by: Dave Kleikamp <shaggy@austin.ibm.com> Signed-off-by: Steve French <sfrench@us.ibm.com>
2005-10-30[PATCH] Error checks omitted in init_tmpfs() in mm/tiny-shmem.cMatt Mackall1-1/+4
From: Hareesh Nagarajan <hnagar2@gmail.com> Signed-off-by: Hareesh Nagarajan <hnagar2@gmail.com> Acked-by: Matt Mackall <mpm@selenic.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-30[PATCH] RCU torture-testing kernel modulePaul E. McKenney1-1/+1
This patch is a rewrite of the one submitted on October 1st, using modules (http://marc.theaimsgroup.com/?l=linux-kernel&m=112819093522998&w=2). This rewrite adds a tristate CONFIG_RCU_TORTURE_TEST, which enables an intense torture test of the RCU infratructure. This is needed due to the continued changes to the RCU infrastructure to accommodate dynamic ticks, CPU hotplug, realtime, and so on. Most of the code is in a separate file that is compiled only if the CONFIG variable is set. Documentation on how to run the test and interpret the output is also included. This code has been tested on i386 and ppc64, and an earlier version of the code has received extensive testing on a number of architectures as part of the PREEMPT_RT patchset. Signed-off-by: "Paul E. McKenney" <paulmck@us.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-30[PATCH] vm: remove redundant assignment from __pagevec_release_nonlru()Tejun Heo1-1/+0
This patch removes redundant assignment from __pagevec_release_nonlru(). pages_to_free.cold is set to pvec->cold by pagevec_init() call right above the assignment. Signed-off-by: Tejun Heo <htejun@gmail.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-30[PATCH] fs: error case fix in __generic_file_aio_readTejun Heo1-2/+2
When __generic_file_aio_read() hits an error during reading, it reports the error iff nothing has successfully been read yet. This is condition - when an error occurs, if nothing has been read/written, report the error code; otherwise, report the amount of bytes successfully transferred upto that point. This corner case can be exposed by performing readv(2) with the following iov. iov[0] = len0 @ ptr0 iov[1] = len1 @ NULL (or any other invalid pointer) iov[2] = len2 @ ptr2 When file size is enough, performing above readv(2) results in len0 bytes from file_pos @ ptr0 len2 bytes from file_pos + len0 @ ptr2 And the return value is len0 + len2. Test program is attached to this mail. This patch makes __generic_file_aio_read()'s error handling identical to other functions. #include <stdio.h> #include <stdlib.h> #include <sys/types.h> #include <sys/stat.h> #include <fcntl.h> #include <unistd.h> #include <sys/uio.h> #include <errno.h> #include <string.h> int main(int argc, char **argv) { const char *path; struct stat stbuf; size_t len0, len1; void *buf0, *buf1; struct iovec iov[3]; int fd, i; ssize_t ret; if (argc < 2) { fprintf(stderr, "Usage: testreadv path (better be a " "small text file)\n"); return 1; } path = argv[1]; if (stat(path, &stbuf) < 0) { perror("stat"); return 1; } len0 = stbuf.st_size / 2; len1 = stbuf.st_size - len0; if (!len0 || !len1) { fprintf(stderr, "Dude, file is too small\n"); return 1; } if ((fd = open(path, O_RDONLY)) < 0) { perror("open"); return 1; } if (!(buf0 = malloc(len0)) || !(buf1 = malloc(len1))) { perror("malloc"); return 1; } memset(buf0, 0, len0); memset(buf1, 0, len1); iov[0].iov_base = buf0; iov[0].iov_len = len0; iov[1].iov_base = NULL; iov[1].iov_len = len1; iov[2].iov_base = buf1; iov[2].iov_len = len1; printf("vector "); for (i = 0; i < 3; i++) printf("%p:%zu ", iov[i].iov_base, iov[i].iov_len); printf("\n"); ret = readv(fd, iov, 3); if (ret < 0) perror("readv"); printf("readv returned %zd\nbuf0 = [%s]\nbuf1 = [%s]\n", ret, (char *)buf0, (char *)buf1); return 0; } Signed-off-by: Tejun Heo <htejun@gmail.com> Cc: Benjamin LaHaise <bcrl@kvack.org> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-30[PATCH] cpusets: automatic numa mempolicy rebindingPaul Jackson1-0/+64
This patch automatically updates a tasks NUMA mempolicy when its cpuset memory placement changes. It does so within the context of the task, without any need to support low level external mempolicy manipulation. If a system is not using cpusets, or if running on a system with just the root (all-encompassing) cpuset, then this remap is a no-op. Only when a task is moved between cpusets, or a cpusets memory placement is changed does the following apply. Otherwise, the main routine below, rebind_policy() is not even called. When mixing cpusets, scheduler affinity, and NUMA mempolicies, the essential role of cpusets is to place jobs (several related tasks) on a set of CPUs and Memory Nodes, the essential role of sched_setaffinity is to manage a jobs processor placement within its allowed cpuset, and the essential role of NUMA mempolicy (mbind, set_mempolicy) is to manage a jobs memory placement within its allowed cpuset. However, CPU affinity and NUMA memory placement are managed within the kernel using absolute system wide numbering, not cpuset relative numbering. This is ok until a job is migrated to a different cpuset, or what's the same, a jobs cpuset is moved to different CPUs and Memory Nodes. Then the CPU affinity and NUMA memory placement of the tasks in the job need to be updated, to preserve their cpuset-relative position. This can be done for CPU affinity using sched_setaffinity() from user code, as one task can modify anothers CPU affinity. This cannot be done from an external task for NUMA memory placement, as that can only be modified in the context of the task using it. However, it easy enough to remap a tasks NUMA mempolicy automatically when a task is migrated, using the existing cpuset mechanism to trigger a refresh of a tasks memory placement after its cpuset has changed. All that is needed is the old and new nodemask, and notice to the task that it needs to rebind its mempolicy. The tasks mems_allowed has the old mask, the tasks cpuset has the new mask, and the existing cpuset_update_current_mems_allowed() mechanism provides the notice. The bitmap/cpumask/nodemask remap operators provide the cpuset relative calculations. This patch leaves open a couple of issues: 1) Updating vma and shmfs/tmpfs/hugetlbfs memory policies: These mempolicies may reference nodes outside of those allowed to the current task by its cpuset. Tasks are migrated as part of jobs, which reside on what might be several cpusets in a subtree. When such a job is migrated, all NUMA memory policy references to nodes within that cpuset subtree should be translated, and references to any nodes outside that subtree should be left untouched. A future patch will provide the cpuset mechanism needed to mark such subtrees. With that patch, we will be able to correctly migrate these other memory policies across a job migration. 2) Updating cpuset, affinity and memory policies in user space: This is harder. Any placement state stored in user space using system-wide numbering will be invalidated across a migration. More work will be required to provide user code with a migration-safe means to manage its cpuset relative placement, while preserving the current API's that pass system wide numbers, not cpuset relative numbers across the kernel-user boundary. Signed-off-by: Paul Jackson <pj@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-30[PATCH] cpusets: confine pdflush to its cpusetPaul Jackson1-0/+13
This patch keeps pdflush daemons on the same cpuset as their parent, the kthread daemon. Some large NUMA configurations put as much as they can of kernel threads and other classic Unix load in what's called a bootcpuset, keeping the rest of the system free for dedicated jobs. This effort is thwarted by pdflush, which dynamically destroys and recreates pdflush daemons depending on load. It's easy enough to force the originally created pdflush deamons into the bootcpuset, at system boottime. But the pdflush threads created later were allowed to run freely across the system, due to the necessary line in their startup kthread(): set_cpus_allowed(current, CPU_MASK_ALL); By simply coding pdflush to start its threads with the cpus_allowed restrictions of its cpuset (inherited from kthread, its parent) we can ensure that dynamically created pdflush threads are also kept in the bootcpuset. On systems w/o cpusets, or w/o a bootcpuset implementation, the following will have no affect, leaving pdflush to run on any CPU, as before. Signed-off-by: Paul Jackson <pj@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-30[PATCH] ext3: Fix unmapped buffers in transaction's listsJan Kara1-10/+1
Fix the problem (BUG 4964) with unmapped buffers in transaction's t_sync_data list. The problem is we need to call filesystem's own invalidatepage() from block_write_full_page(). block_write_full_page() must call filesystem's invalidatepage(). Otherwise following nasty race can happen: proc 1 proc 2 ------ ------ - write some new data to 'offset' => bh gets to the transactions data list - starts truncate => i_size set to new size - mpage_writepages() - ext3_ordered_writepage() to 'offset' - block_write_full_page() - page->index > end_index+1 - block_invalidatepage() - discard_buffer() - clear_buffer_mapped() - commit triggers and finds unmapped buffer - BOOM! Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-29[PATCH] mm/filemap.c:filemap_populate(): move export.Nikita Danilov1-1/+1
move EXPORT_SYMBOL(filemap_populate) to the proper place: just after function itself: it's easy to miss that function is exported otherwise. Signed-off-by: Nikita Danilov <nikita@clusterfs.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-29[PATCH] mm: wider use of for_each_*cpu()John Hawkes1-4/+1
In 'mm' change the explicit use of a for-loop using NR_CPUS into the general for_each_cpu() constructs. This widens the scope of potential future optimizations of the general constructs, as well as takes advantage of the existing optimizations of first_cpu() and next_cpu(), which is advantageous when the true CPU count is much smaller than NR_CPUS. Signed-off-by: John Hawkes <hawkes@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-29[PATCH] Remove policy contextualization from mbindChristoph Lameter1-1/+1
Policy contextualization is only useful for task based policies and not for vma based policies. It may be useful to define allowed nodes that are not accessible from this thread because other threads may have access to these nodes. Without this patch strange memory policy situations may cause an application to fail with out of memory. Example: Let's say we have two threads A and B that share the same address space and a huge array computational array X. Thread A is restricted by its cpuset to nodes 0 and 1 and thread B is restricted by its cpuset to nodes 2 and 3. Thread A now wants to restrict allocations to the first node and thus applies a BIND policy on X to node 0 and 2. The cpuset limits this to node 0. Thus pages for X must be allocated on node 0 now. Thread B now touches a page that has never been used in X and faults in a page. According to the BIND policy of the vma for X the page must be allocated on page 0. However, the cpuset of B does not allow allocation on 0 and 1. Now the application fails in alloc_pages with out of memory. Signed-off-by: Christoph Lameter <clameter@sgi.com> Cc: Andi Kleen <ak@muc.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-29[PATCH] Implement sys_* do_* layering in the memory policy layer.Christoph Lameter1-114/+162
- Do a separation between do_xxx and sys_xxx functions. sys_xxx functions take variable sized bitmaps from user space as arguments. do_xxx functions take fixed sized nodemask_t as arguments and may be used from inside the kernel. Doing so simplifies the initialization code. There is no fs = kernel_ds assumption anymore. - Split up get_nodes into get_nodes (which gets the node list) and contextualize_policy which restricts the nodes to those accessible to the task and updates cpusets. - Add comments explaining limitations of bind policy Signed-off-by: Christoph Lameter <clameter@sgi.com> Cc: Andi Kleen <ak@muc.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-29[PATCH] memory hotplug: call setup_per_zone_pages_min after hotplugDave Hansen1-0/+2
From: IWAMOTO Toshihiro <iwamoto@valinux.co.jp> > I found the tests does not work well with Dave's patchset. > I've found the followings: > > - setup_per_zone_pages_min() calls should be added in > capture_page_range() and online_pages() > - lru_add_drain() should be called before try_to_migrate_pages() The following patch deals with the first item. Signed-off-by: IWAMOTO Toshihiro <iwamoto@valinux.co.jp> Signed-off-by: Dave Hansen <haveblue@us.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-29[PATCH] memory hotplug: move section_mem_map alloc to sparse.cDave Hansen2-50/+72
This basically keeps up from having to extern __kmalloc_section_memmap(). The vaddr_in_vmalloc_area() helper could go in a vmalloc header, but that header gets hard to work with, because it needs some arch-specific macros. Just stick it in here for now, instead of creating another header. Signed-off-by: Dave Hansen <haveblue@us.ibm.com> Signed-off-by: Lion Vollnhals <webmaster@schiggl.de> Signed-off-by: Jiri Slaby <xslaby@fi.muni.cz> Signed-off-by: Yasunori Goto <y-goto@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-29[PATCH] memory hotplug: sysfs and add/remove functionsDave Hansen4-3/+189
This adds generic memory add/remove and supporting functions for memory hotplug into a new file as well as a memory hotplug kernel config option. Individual architecture patches will follow. For now, disable memory hotplug when swsusp is enabled. There's a lot of churn there right now. We'll fix it up properly once it calms down. Signed-off-by: Matt Tolentino <matthew.e.tolentino@intel.com> Signed-off-by: Dave Hansen <haveblue@us.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-29[PATCH] memory hotplug locking: zone span seqlockDave Hansen1-5/+14
See the "fixup bad_range()" patch for more information, but this actually creates a the lock to protect things making assumptions about a zone's size staying constant at runtime. Signed-off-by: Dave Hansen <haveblue@us.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-29[PATCH] memory hotplug locking: node_size_lockDave Hansen1-0/+1
pgdat->node_size_lock is basically only neeeded in one place in the normal code: show_mem(), which is the arch-specific sysrq-m printing function. Strictly speaking, the architectures not doing memory hotplug do no need this locking in show_mem(). However, they are all included for completeness. This should also make any future consolidation of all of the implementations a little more straightforward. This lock is also held in the sparsemem code during a memory removal, as sections are invalidated. This is the place there pfn_valid() is made false for a memory area that's being removed. The lock is only required when doing pfn_valid() operations on memory which the user does not already have a reference on the page, such as in show_mem(). Signed-off-by: Dave Hansen <haveblue@us.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-29[PATCH] memory hotplug prep: fixup bad_range()Dave Hansen1-5/+21
When doing memory hotplug operations, the size of existing zones can obviously change. This means that zone->zone_{start_pfn,spanned_pages} can change. There are currently no locks that protect these structure members. However, they are rarely accessed at runtime. Outside of swsusp, the only place that I can find is bad_range(). So, split bad_range() up into two pieces: one that needs to be locked and anther that doesn't. Signed-off-by: Dave Hansen <haveblue@us.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-29[PATCH] memory hotplug prep: __section_nr helperDave Hansen1-0/+25
A little helper that we use in the hotplug code. Signed-off-by: Dave Hansen <haveblue@us.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-29[PATCH] memory hotplug prep: break out zone initializationDave Hansen1-40/+58
If a zone is empty at boot-time and then hot-added to later, it needs to run the same init code that would have been run on it at boot. This patch breaks out zone table and per-cpu-pages functions for use by the hotplug code. You can almost see all of the free_area_init_core() function on one page now. :) Signed-off-by: Dave Hansen <haveblue@us.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-29[PATCH] .text page fault SMP scalability optimizationAndrea Arcangeli1-4/+16
We had a problem on ppc64 where with more than 4 threads a large system wouldn't scale well while faulting in the .text (most of the time was spent in the kernel despite it was an userland compute intensive app). The reason is the useless overwrite of the same pte from all cpu. I fixed it this way (verified on an older kernel but the forward port is almost identical). This will benefit all archs not just ppc64. Signed-off-by: Andrea Arcangeli <andrea@suse.de> Cc: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-29[PATCH] hugetlb: demand fault handlerAdam Litke1-85/+95
Below is a patch to implement demand faulting for huge pages. The main motivation for changing from prefaulting to demand faulting is so that huge page memory areas can be allocated according to NUMA policy. Thanks to consolidated hugetlb code, switching the behavior requires changing only one fault handler. The bulk of the patch just moves the logic from hugelb_prefault() to hugetlb_pte_fault() and find_get_huge_page(). Signed-off-by: Adam Litke <agl@us.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-29[PATCH] mm: update comments to pte lockHugh Dickins3-10/+9
Updated several references to page_table_lock in common code comments. Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-29[PATCH] mm: fix rss and mmlist lockingHugh Dickins2-2/+5
A couple of oddities were guarded by page_table_lock, no longer properly guarded when that is split. The mm_counters of file_rss and anon_rss: make those an atomic_t, or an atomic64_t if the architecture supports it, in such a case. Definitions by courtesy of Christoph Lameter: who spent considerable effort on more scalable ways of counting, but found insufficient benefit in practice. And adding an mm with swap to the mmlist for swapoff: the list is well- guarded by its own lock, but the list_empty check now has to be repeated inside it. Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-29[PATCH] mm: split page table lockHugh Dickins12-48/+74
Christoph Lameter demonstrated very poor scalability on the SGI 512-way, with a many-threaded application which concurrently initializes different parts of a large anonymous area. This patch corrects that, by using a separate spinlock per page table page, to guard the page table entries in that page, instead of using the mm's single page_table_lock. (But even then, page_table_lock is still used to guard page table allocation, and anon_vma allocation.) In this implementation, the spinlock is tucked inside the struct page of the page table page: with a BUILD_BUG_ON in case it overflows - which it would in the case of 32-bit PA-RISC with spinlock debugging enabled. Splitting the lock is not quite for free: another cacheline access. Ideally, I suppose we would use split ptlock only for multi-threaded processes on multi-cpu machines; but deciding that dynamically would have its own costs. So for now enable it by config, at some number of cpus - since the Kconfig language doesn't support inequalities, let preprocessor compare that with NR_CPUS. But I don't think it's worth being user-configurable: for good testing of both split and unsplit configs, split now at 4 cpus, and perhaps change that to 8 later. There is a benefit even for singly threaded processes: kswapd can be attacking one part of the mm while another part is busy faulting. Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-29[PATCH] mm: follow_page with inner ptlockHugh Dickins2-82/+73
Final step in pushing down common core's page_table_lock. follow_page no longer wants caller to hold page_table_lock, uses pte_offset_map_lock itself; and so no page_table_lock is taken in get_user_pages itself. But get_user_pages (and get_futex_key) do then need follow_page to pin the page for them: take Daniel's suggestion of bitflags to follow_page. Need one for WRITE, another for TOUCH (it was the accessed flag before: vanished along with check_user_page_readable, but surely get_numa_maps is wrong to mark every page it finds as accessed), another for GET. And another, ANON to dispose of untouched_anonymous_page: it seems silly for that to descend a second time, let follow_page observe if there was no page table and return ZERO_PAGE if so. Fix minor bug in that: check VM_LOCKED - make_pages_present ought to make readonly anonymous present. Give get_numa_maps a cond_resched while we're there. Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-29[PATCH] mm: kill check_user_page_readableHugh Dickins1-25/+4
check_user_page_readable is a problematic variant of follow_page. It's used only by oprofile's i386 and arm backtrace code, at interrupt time, to establish whether a userspace stackframe is currently readable. This is problematic, because we want to push the page_table_lock down inside follow_page, and later split it; whereas oprofile is doing a spin_trylock on it (in the i386 case, forgotten in the arm case), and needs that to pin perhaps two pages spanned by the stackframe (which might be covered by different locks when we split). I think oprofile is going about this in the wrong way: it doesn't need to know the area is readable (neither i386 nor arm uses read protection of user pages), it doesn't need to pin the memory, it should simply __copy_from_user_inatomic, and see if that succeeds or not. Sorry, but I've not got around to devising the sparse __user annotations for this. Then we can eliminate check_user_page_readable, and return to a single follow_page without the __follow_page variants. Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-29[PATCH] mm: rmap with inner ptlockHugh Dickins2-63/+58
rmap's page_check_address descend without page_table_lock. First just pte_offset_map in case there's no pte present worth locking for, then take page_table_lock for the full check, and pass ptl back to caller in the same style as pte_offset_map_lock. __xip_unmap, page_referenced_one and try_to_unmap_one use pte_unmap_unlock. try_to_unmap_cluster also. page_check_address reformatted to avoid progressive indentation. No use is made of its one error code, return NULL when it fails. Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-29[PATCH] mm: xip_unmap ZERO_PAGE fixHugh Dickins1-1/+2
Small fix to the PageReserved patch: the mips ZERO_PAGE(address) depends on address, so __xip_unmap is wrong to initialize page with that before address is initialized; and in fact must re-evaluate it each iteration. Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-29[PATCH] mm: unmap_vmas with inner ptlockHugh Dickins3-44/+17
Remove the page_table_lock from around the calls to unmap_vmas, and replace the pte_offset_map in zap_pte_range by pte_offset_map_lock: all callers are now safe to descend without page_table_lock. Don't attempt fancy locking for hugepages, just take page_table_lock in unmap_hugepage_range. Which makes zap_hugepage_range, and the hugetlb test in zap_page_range, redundant: unmap_vmas calls unmap_hugepage_range anyway. Nor does unmap_vmas have much use for its mm arg now. The tlb_start_vma and tlb_end_vma in unmap_page_range are now called without page_table_lock: if they're implemented at all, they typically come down to flush_cache_range (usually done outside page_table_lock) and flush_tlb_range (which we already audited for the mprotect case). Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-29[PATCH] mm: unlink vma before pagetablesHugh Dickins2-19/+16
In most places the descent from pgd to pud to pmd to pte holds mmap_sem (exclusively or not), which ensures that free_pgtables cannot be freeing page tables from any level at the same time. But truncation and reverse mapping descend without mmap_sem. No problem: just make sure that a vma is unlinked from its prio_tree (or nonlinear list) and from its anon_vma list, after zapping the vma, but before freeing its page tables. Then neither vmtruncate nor rmap can reach that vma whose page tables are now volatile (nor do they need to reach it, since all its page entries have been zapped by this stage). The i_mmap_lock and anon_vma->lock already serialize this correctly; but the locking hierarchy is such that we cannot take them while holding page_table_lock. Well, we're trying to push that down anyway. So in this patch, move anon_vma_unlink and unlink_file_vma into free_pgtables, at the same time as moving page_table_lock around calls to unmap_vmas. tlb_gather_mmu and tlb_finish_mmu then fall outside the page_table_lock, but we made them preempt_disable and preempt_enable earlier; and a long source audit of all the architectures has shown no problem with removing page_table_lock from them. free_pgtables doesn't need page_table_lock for itself, nor for what it calls; tlb->mm->nr_ptes is usually protected by page_table_lock, but partly by non-exclusive mmap_sem - here it's decremented with exclusive mmap_sem, or mm_users 0. update_hiwater_rss and vm_unacct_memory don't need page_table_lock either. Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-29[PATCH] mm: pte_offset_map_lock loopsHugh Dickins4-34/+21
Convert those common loops using page_table_lock on the outside and pte_offset_map within to use just pte_offset_map_lock within instead. These all hold mmap_sem (some exclusively, some not), so at no level can a page table be whipped away from beneath them. But whereas pte_alloc loops tested with the "atomic" pmd_present, these loops are testing with pmd_none, which on i386 PAE tests both lower and upper halves. That's now unsafe, so add a cast into pmd_none to test only the vital lower half: we lose a little sensitivity to a corrupt middle directory, but not enough to worry about. It appears that i386 and UML were the only architectures vulnerable in this way, and pgd and pud no problem. Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-29[PATCH] mm: page fault handler lockingHugh Dickins1-60/+90
On the page fault path, the patch before last pushed acquiring the page_table_lock down to the head of handle_pte_fault (though it's also taken and dropped earlier when a new page table has to be allocated). Now delete that line, read "entry = *pte" without it, and go off to this or that page fault handler on the basis of this unlocked peek. Usually the handler can proceed without the lock, relying on the subsequent locked pte_same or pte_none test to back out when necessary; though do_wp_page needs the lock immediately, and do_file_page doesn't check (if there's a race, install_page just zaps the entry and reinstalls it). But on those architectures (notably i386 with PAE) whose pte is too big to be read atomically, if SMP or preemption is enabled, do_swap_page and do_file_page might cause irretrievable damage if passed a Frankenstein entry stitched together from unrelated parts. In those configs, "pte_unmap_same" has to take page_table_lock, validate orig_pte still the same, and drop page_table_lock before unmapping, before proceeding. Use pte_offset_map_lock and pte_unmap_unlock throughout the handlers; but lock avoidance leaves more lone maps and unmaps than elsewhere. Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-29[PATCH] mm: ptd_alloc take ptlockHugh Dickins4-124/+67
Second step in pushing down the page_table_lock. Remove the temporary bridging hack from __pud_alloc, __pmd_alloc, __pte_alloc: expect callers not to hold page_table_lock, whether it's on init_mm or a user mm; take page_table_lock internally to check if a racing task already allocated. Convert their callers from common code. But avoid coming back to change them again later: instead of moving the spin_lock(&mm->page_table_lock) down, switch over to new macros pte_alloc_map_lock and pte_unmap_unlock, which encapsulate the mapping+locking and unlocking+unmapping together, and in the end may use alternatives to the mm page_table_lock itself. These callers all hold mmap_sem (some exclusively, some not), so at no level can a page table be whipped away from beneath them; and pte_alloc uses the "atomic" pmd_present to test whether it needs to allocate. It appears that on all arches we can safely descend without page_table_lock. Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-29[PATCH] mm: ptd_alloc inline and outHugh Dickins2-62/+40
It seems odd to me that, whereas pud_alloc and pmd_alloc test inline, only calling out-of-line __pud_alloc __pmd_alloc if allocation needed, pte_alloc_map and pte_alloc_kernel are entirely out-of-line. Though it does add a little to kernel size, change them to macros testing inline, calling __pte_alloc or __pte_alloc_kernel to allocate out-of-line. Mark none of them as fastcalls, leave that to CONFIG_REGPARM or not. It also seems more natural for the out-of-line functions to leave the offset calculation and map to the inline, which has to do it anyway for the common case. At least mremap move wants __pte_alloc without _map. Macros rather than inline functions, certainly to avoid the header file issues which arise from CONFIG_HIGHPTE needing kmap_types.h, but also in case any architectures I haven't built would have other such problems. Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-29[PATCH] mm: init_mm without ptlockHugh Dickins2-36/+28
First step in pushing down the page_table_lock. init_mm.page_table_lock has been used throughout the architectures (usually for ioremap): not to serialize kernel address space allocation (that's usually vmlist_lock), but because pud_alloc,pmd_alloc,pte_alloc_kernel expect caller holds it. Reverse that: don't lock or unlock init_mm.page_table_lock in any of the architectures; instead rely on pud_alloc,pmd_alloc,pte_alloc_kernel to take and drop it when allocating a new one, to check lest a racing task already did. Similarly no page_table_lock in vmalloc's map_vm_area. Some temporary ugliness in __pud_alloc and __pmd_alloc: since they also handle user mms, which are converted only by a later patch, for now they have to lock differently according to whether or not it's init_mm. If sources get muddled, there's a danger that an arch source taking init_mm.page_table_lock will be mixed with common source also taking it (or neither take it). So break the rules and make another change, which should break the build for such a mismatch: remove the redundant mm arg from pte_alloc_kernel (ppc64 scrapped its distinct ioremap_mm in 2.6.13). Exceptions: arm26 used pte_alloc_kernel on user mm, now pte_alloc_map; ia64 used pte_alloc_map on init_mm, now pte_alloc_kernel; parisc had bad args to pmd_alloc and pte_alloc_kernel in unused USE_HPPA_IOREMAP code; ppc64 map_io_page forgot to unlock on failure; ppc mmu_mapin_ram and ppc64 im_free took page_table_lock for no good reason. Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-29[PATCH] mm: ia64 use expand_upwardsHugh Dickins1-3/+14
ia64 has expand_backing_store function for growing its Register Backing Store vma upwards. But more complete code for this purpose is found in the CONFIG_STACK_GROWSUP part of mm/mmap.c. Uglify its #ifdefs further to provide expand_upwards for ia64 as well as expand_stack for parisc. The Register Backing Store vma should be marked VM_ACCOUNT. Implement the intention of growing it only a page at a time, instead of passing an address outside of the vma to handle_mm_fault, with unknown consequences. Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-29[PATCH] mm: update_hiwaters just in timeHugh Dickins7-32/+29
update_mem_hiwater has attracted various criticisms, in particular from those concerned with mm scalability. Originally it was called whenever rss or total_vm got raised. Then many of those callsites were replaced by a timer tick call from account_system_time. Now Frank van Maarseveen reports that to be found inadequate. How about this? Works for Frank. Replace update_mem_hiwater, a poor combination of two unrelated ops, by macros update_hiwater_rss and update_hiwater_vm. Don't attempt to keep mm->hiwater_rss up to date at timer tick, nor every time we raise rss (usually by 1): those are hot paths. Do the opposite, update only when about to lower rss (usually by many), or just before final accounting in do_exit. Handle mm->hiwater_vm in the same way, though it's much less of an issue. Demand that whoever collects these hiwater statistics do the work of taking the maximum with rss or total_vm. And there has been no collector of these hiwater statistics in the tree. The new convention needs an example, so match Frank's usage by adding a VmPeak line above VmSize to /proc/<pid>/status, and also a VmHWM line above VmRSS (High-Water-Mark or High-Water-Memory). There was a particular anomaly during mremap move, that hiwater_vm might be captured too high. A fleeting such anomaly remains, but it's quickly corrected now, whereas before it would stick. What locking? None: if the app is racy then these statistics will be racy, it's not worth any overhead to make them exact. But whenever it suits, hiwater_vm is updated under exclusive mmap_sem, and hiwater_rss under page_table_lock (for now) or with preemption disabled (later on): without going to any trouble, minimize the time between reading current values and updating, to minimize those occasions when a racing thread bumps a count up and back down in between. Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-29[PATCH] mm: zap_pte out of lineHugh Dickins1-10/+9
There used to be just one call to zap_pte, but it shouldn't be inline now there are two. Check for the common case pte_none before calling, and move its rss accounting up into install_page or install_file_pte - which helps the next patch. Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-29[PATCH] mm: do_mremap current mmHugh Dickins1-9/+9
Cleanup: relieve do_mremap from its surfeit of current->mms. Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-29[PATCH] mm: do_swap_page race majorHugh Dickins1-3/+1
Small adjustment: do_swap_page should report its !pte_same race as a major fault if it had to read into swap cache, because whatever raced with it will have found page already in cache and reported minor fault. Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-29[PATCH] mm: zap_pte_range dec rssHugh Dickins1-3/+3
Small adjustment: zap_pte_range decrement its rss counts from 0 then finally add, avoiding negations - we don't have or need a sub_mm_rss. Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-29[PATCH] mm: copy_one_pte inc rssHugh Dickins1-10/+5
Small adjustment, following Nick's suggestion: it's more straightforward for copy_pte_range to let copy_one_pte do the rss incrementation, than use an index it passed back. Saves a #define, and 16 bytes of .text. Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-29[PATCH] core remove PageReservedNick Piggin13-104/+165
Remove PageReserved() calls from core code by tightening VM_RESERVED handling in mm/ to cover PageReserved functionality. PageReserved special casing is removed from get_page and put_page. All setting and clearing of PageReserved is retained, and it is now flagged in the page_alloc checks to help ensure we don't introduce any refcount based freeing of Reserved pages. MAP_PRIVATE, PROT_WRITE of VM_RESERVED regions is tentatively being deprecated. We never completely handled it correctly anyway, and is be reintroduced in future if required (Hugh has a proof of concept). Once PageReserved() calls are removed from kernel/power/swsusp.c, and all arch/ and driver code, the Set and Clear calls, and the PG_reserved bit can be trivially removed. Last real user of PageReserved is swsusp, which uses PageReserved to determine whether a struct page points to valid memory or not. This still needs to be addressed (a generic page_is_ram() should work). A last caveat: the ZERO_PAGE is now refcounted and managed with rmap (and thus mapcounted and count towards shared rss). These writes to the struct page could cause excessive cacheline bouncing on big systems. There are a number of ways this could be addressed if it is an issue. Signed-off-by: Nick Piggin <npiggin@suse.de> Refcount bug fix for filemap_xip.c Signed-off-by: Carsten Otte <cotte@de.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-29[PATCH] mm: batch updating mm_countersHugh Dickins1-15/+32
tlb_finish_mmu used to batch zap_pte_range's update of mm rss, which may be worthwhile if the mm is contended, and would reduce atomic operations if the counts were atomic. Let zap_pte_range now batch its updates to file_rss and anon_rss, per page-table in case we drop the lock outside; and copy_pte_range batch them too. Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-29[PATCH] mm: rss = file_rss + anon_rssHugh Dickins6-26/+27
I was lazy when we added anon_rss, and chose to change as few places as possible. So currently each anonymous page has to be counted twice, in rss and in anon_rss. Which won't be so good if those are atomic counts in some configurations. Change that around: keep file_rss and anon_rss separately, and add them together (with get_mm_rss macro) when the total is needed - reading two atomics is much cheaper than updating two atomics. And update anon_rss upfront, typically in memory.c, not tucked away in page_add_anon_rmap. Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-29[PATCH] mm: tlb_finish_mmu forget rssHugh Dickins1-1/+1
zap_pte_range has been counting the pages it frees in tlb->freed, then tlb_finish_mmu has used that to update the mm's rss. That got stranger when I added anon_rss, yet updated it by a different route; and stranger when rss and anon_rss became mm_counters with special access macros. And it would no longer be viable if we're relying on page_table_lock to stabilize the mm_counter, but calling tlb_finish_mmu outside that lock. Remove the mmu_gather's freed field, let tlb_finish_mmu stick to its own business, just decrement the rss mm_counter in zap_pte_range (yes, there was some point to batching the update, and a subsequent patch restores that). And forget the anal paranoia of first reading the counter to avoid going negative - if rss does go negative, just fix that bug. Remove the mmu_gather's flushes and avoided_flushes from arm and arm26: no use was being made of them. But arm26 alone was actually using the freed, in the way some others use need_flush: give it a need_flush. arm26 seems to prefer spaces to tabs here: respect that. Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-29[PATCH] mm: tlb_is_full_mm was obscureHugh Dickins1-2/+2
tlb_is_full_mm? What does that mean? The TLB is full? No, it means that the mm's last user has gone and the whole mm is being torn down. And it's an inline function because sparc64 uses a different (slightly better) "tlb_frozen" name for the flag others call "fullmm". And now the ptep_get_and_clear_full macro used in zap_pte_range refers directly to tlb->fullmm, which would be wrong for sparc64. Rather than correct that, I'd prefer to scrap tlb_is_full_mm altogether, and change sparc64 to just use the same poor name as everyone else - is that okay? Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-29[PATCH] mm: move_page_tables by extentsHugh Dickins1-96/+72
Speeding up mremap's moving of ptes has never been a priority, but the locking will get more complicated shortly, and is already too baroque. Scrap the current one-by-one moving, do an extent at a time: curtailed by end of src and dst pmds (have to use PMD_SIZE: the way pmd_addr_end gets elided doesn't match this usage), and by latency considerations. One nice property of the old method is lost: it never allocated a page table unless absolutely necessary, so you could free empty page tables by mremapping to and fro. Whereas this way, it allocates a dst table wherever there was a src table. I keep diving in to reinstate the old behaviour, then come out preferring not to clutter how it now is. Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-29[PATCH] mm: page fault handlers tidyupHugh Dickins3-125/+99
Impose a little more consistency on the page fault handlers do_wp_page, do_swap_page, do_anonymous_page, do_no_page, do_file_page: why not pass their arguments in the same order, called the same names? break_cow is all very well, but what it did was inlined elsewhere: easier to compare if it's brought back into do_wp_page. do_file_page's fallback to do_no_page dates from a time when we were testing pte_file by using it wherever possible: currently it's peculiar to nonlinear vmas, so just check that. BUG_ON if not? Better not, it's probably page table corruption, so just show the pte: hmm, there's a pte_ERROR macro, let's use that for do_wp_page's invalid pfn too. Hah! Someone in the ppc64 world noticed pte_ERROR was unused so removed it: restored (and say "pud" not "pmd" in its pud_ERROR). Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-29[PATCH] mm: exit_mmap need not resetHugh Dickins1-6/+0
exit_mmap resets various mm_struct fields, but the mm is well on its way out, and none of those fields matter by this point. Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-29[PATCH] mm: unlink_file_vma, remove_vmaHugh Dickins1-14/+27
Divide remove_vm_struct into two parts: first anon_vma_unlink plus unlink_file_vma, to unlink the vma from the list and tree by which rmap or vmtruncate might find it; then remove_vma to close, fput and free. The intention here is to do the anon_vma_unlink and unlink_file_vma earlier, in free_pgtables before freeing any page tables: so we can be sure that any page tables traversed by rmap and vmtruncate are stable (and other, ordinary cases are stabilized by holding mmap_sem). This will be crucial to traversing pgd,pud,pmd without page_table_lock. But testing the split-out patch showed that lifting the page_table_lock is symbiotically necessary to make this change - the lock ordering is wrong to move those unlinks into free_pgtables while it's under ptlock. Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-29[PATCH] mm: remove_vma_list consolidationHugh Dickins1-24/+12
unmap_vma doesn't amount to much, let's put it inside unmap_vma_list. Except it doesn't unmap anything, unmap_region just did the unmapping: rename it to remove_vma_list. Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-29[PATCH] mm: vm_stat_account unshackledHugh Dickins3-14/+14
The original vm_stat_account has fallen into disuse, with only one user, and only one user of vm_stat_unaccount. It's easier to keep track if we convert them all to __vm_stat_account, then free it from its __shackles. Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-29[PATCH] mm: anon is already wrprotectedHugh Dickins2-7/+7
do_anonymous_page's pte_wrprotect causes some confusion: in such a case, vm_page_prot must already be forcing COW, so must omit write permission, and so the pte_wrprotect is redundant. Replace it by a comment to that effect, and reword the comment on unuse_pte which also caused confusion. Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-29[PATCH] mm: zap_pte_range dont dirty anonHugh Dickins1-4/+6
zap_pte_range already avoids wasting time to mark_page_accessed on anon pages: it can also skip anon set_page_dirty - the page only needs to be marked dirty if shared with another mm, but that will say pte_dirty too. Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-29[PATCH] mm: msync_pte_range progressHugh Dickins1-24/+14
Use latency breaking in msync_pte_range like that in copy_pte_range, instead of the ugly CONFIG_PREEMPT filemap_msync alternatives. Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-29[PATCH] mm: copy_pte_range progress fixHugh Dickins1-6/+8
My latency breaking in copy_pte_range didn't work as intended: instead of checking at regularish intervals, after the first interval it checked every time around the loop, too impatient to be preempted. Fix that. Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-29[PATCH] slab: add additional debugging to detect slabs from the wrong nodeChristoph Lameter1-1/+4
This patch adds some stack dumps if the slab logic is processing slab blocks from the wrong node. This is necessary in order to detect situations as encountered by Petr. Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-29[PATCH] shrink_list(): skip anon pages if not may_swapLee Schermerhorn1-1/+3
Martin Hicks' page cache reclaim patch added the 'may_swap' flag to the scan_control struct; and modified shrink_list() not to add anon pages to the swap cache if may_swap is not asserted. Ref: http://marc.theaimsgroup.com/?l=linux-mm&m=111461480725322&w=4 However, further down, if the page is mapped, shrink_list() calls try_to_unmap() which will call try_to_unmap_one() via try_to_unmap_anon (). try_to_unmap_one() will BUG_ON() an anon page that is NOT in the swap cache. Martin says he never encountered this path in his testing, but agrees that it might happen. This patch modifies shrink_list() to skip anon pages that are not already in the swap cache when !may_swap, rather than just not adding them to the cache. Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com> Cc: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-29[PATCH] mm/msync.c cleanupOGAWA Hirofumi1-14/+14
This is not problem actually, but sync_page_range() is using for exported function to filesystems. The msync_xxx is more readable at least to me. Signed-off-by: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp> Acked-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-29[PATCH] Remove near all BUGs in mm/mempolicy.cAndi Kleen1-7/+2
Most of them can never be triggered and were only for development. Signed-off-by: "Andi Kleen" <ak@suse.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-29[PATCH] Convert mempolicies to nodemask_tAndi Kleen1-67/+53
The NUMA policy code predated nodemask_t so it used open coded bitmaps. Convert everything to nodemask_t. Big patch, but shouldn't have any actual behaviour changes (except I removed one unnecessary check against node_online_map and one unnecessary BUG_ON) Signed-off-by: "Andi Kleen" <ak@suse.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-29[PATCH] mm: set per-cpu-pages lower threshold to zeroSeth, Rohit1-2/+2
Set the low water mark for hot pages in pcp to zero. (akpm: for the life of me I cannot remember why we created pcp->low. Neither can Martin and the changelog is silent. Maybe it was just a brainfart, but I have this feeling that there was a reason. If not, we should remove the fields completely. We'll see.) Signed-off-by: Rohit Seth <rohit.seth@intel.com> Cc: <linux-mm@kvack.org> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-29[PATCH] mm: page_alloc: increase size of per-cpu-pagesSeth, Rohit1-12/+12
Increase the page allocator's per-cpu magazines from 1/4MB to 1/2MB. Over 100+ runs for a workload, the difference in mean is about 2%. The best results for both are almost same. Though the max variation in results with 1/2MB is only 2.2%, whereas with 1/4MB it is 12%. Signed-off-by: Rohit Seth <rohit.seth@intel.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-29[PATCH] swaptoken tuningRik Van Riel2-2/+6
It turns out that the original swap token implementation, by Song Jiang, only enforced the swap token while the task holding the token is handling a page fault. This patch approximates that, without adding an additional flag to the mm_struct, by checking whether the mm->mmap_sem is held for reading, like the page fault code does. This patch has the effect of automatically, and gradually, disabling the enforcement of the swap token when there is little or no paging going on, and "turning up" the intensity of the swap token code the more the task holding the token is thrashing. Thanks to Song Jiang for pointing out this aspect of the token based thrashing control concept. The new code shows a slight degradation over the old swap token code, but still a big win over running without the swap token. 2.6.12+ swap token disabled $ for i in `seq 10` ; do /usr/bin/time ./qsbench -n 30000000 -p 3 ; done 101.74user 23.13system 8:26.91elapsed 24%CPU (0avgtext+0avgdata 0maxresident)k 0inputs+0outputs (38597major+430315minor)pagefaults 0swaps 101.98user 24.91system 8:03.06elapsed 26%CPU (0avgtext+0avgdata 0maxresident)k 0inputs+0outputs (33939major+430457minor)pagefaults 0swaps 101.93user 22.12system 7:34.90elapsed 27%CPU (0avgtext+0avgdata 0maxresident)k 0inputs+0outputs (33166major+421267minor)pagefaults 0swaps 101.82user 22.38system 8:31.40elapsed 24%CPU (0avgtext+0avgdata 0maxresident)k 0inputs+0outputs (39338major+433262minor)pagefaults 0swaps 2.6.12+ swap token enabled, timeout 300 seconds $ for i in `seq 4` ; do /usr/bin/time ./qsbench -n 30000000 -p 3 ; done 102.58user 16.08system 3:41.44elapsed 53%CPU (0avgtext+0avgdata 0maxresident)k 0inputs+0outputs (19707major+285786minor)pagefaults 0swaps 102.07user 19.56system 4:00.64elapsed 50%CPU (0avgtext+0avgdata 0maxresident)k 0inputs+0outputs (19012major+299259minor)pagefaults 0swaps 102.64user 18.25system 4:07.31elapsed 48%CPU (0avgtext+0avgdata 0maxresident)k 0inputs+0outputs (21990major+304831minor)pagefaults 0swaps 101.39user 19.41system 5:15.81elapsed 38%CPU (0avgtext+0avgdata 0maxresident)k 0inputs+0outputs (24850major+323321minor)pagefaults 0swaps 2.6.12+ with new swap token code, timeout 300 seconds $ for i in `seq 4` ; do /usr/bin/time ./qsbench -n 30000000 -p 3 ; done 101.87user 24.66system 5:53.20elapsed 35%CPU (0avgtext+0avgdata 0maxresident)k 0inputs+0outputs (26848major+363497minor)pagefaults 0swaps 102.83user 19.95system 4:17.25elapsed 47%CPU (0avgtext+0avgdata 0maxresident)k 0inputs+0outputs (19946major+305722minor)pagefaults 0swaps 102.09user 19.46system 5:12.57elapsed 38%CPU (0avgtext+0avgdata 0maxresident)k 0inputs+0outputs (25461major+334994minor)pagefaults 0swaps 101.67user 20.61system 4:52.97elapsed 41%CPU (0avgtext+0avgdata 0maxresident)k 0inputs+0outputs (22190major+329508minor)pagefaults 0swaps Signed-off-by: Rik Van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-29[PATCH] vmalloc_nodeChristoph Lameter1-16/+57
This patch adds vmalloc_node(size, node) -> Allocate necessary memory on the specified node and get_vm_area_node(size, flags, node) and the other functions that it depends on. Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-28[PATCH] gfp_t: the restAl Viro2-19/+24
zone handling, mapping->flags handling Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-28[PATCH] gfp_t: mm/* (easy parts)Al Viro5-15/+15
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-28[PATCH] gfp_t: infrastructureAl Viro2-5/+5
Beginning of gfp_t annotations: - -Wbitwise added to CHECKFLAGS - old __bitwise renamed to __bitwise__ - __bitwise defined to either __bitwise__ or nothing, depending on __CHECK_ENDIAN__ being defined - gfp_t switched from __nocast to __bitwise__ - force cast to gfp_t added to __GFP_... constants - new helper - gfp_zone(); extracts zone bits out of gfp_t value and casts the result to int Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-26[PATCH] NUMA: broken per cpu pageset countersMagnus Damm1-0/+2
The NUMA counters in struct per_cpu_pageset (linux/mmzone.h) are never cleared today. This works ok for CPU 0 on NUMA machines because boot_pageset[] is already zero, but for other CPU:s this results in uninitialized counters. Signed-off-by: Magnus Damm <magnus@valinux.co.jp> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-20[PATCH] Fix handling spurious page fault for hugetlb regionHugh Dickins2-12/+24
This reverts commit 3359b54c8c07338f3a863d1109b42eebccdcf379 and replaces it with a cleaner version that is purely based on page table operations, so that the synchronization between inode size and hugetlb mappings becomes moot. Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-19[PATCH] swiotlb: make sure initial DMA allocations really are in DMA memoryYasunori Goto1-10/+21
This introduces a limit parameter to the core bootmem allocator; The new parameter indicates that physical memory allocated by the bootmem allocator should be within the requested limit. We also introduce alloc_bootmem_low_pages_limit, alloc_bootmem_node_limit, alloc_bootmem_low_pages_node_limit apis, but alloc_bootmem_low_pages_limit is the only api used for swiotlb. The existing alloc_bootmem_low_pages() api could instead have been changed and made to pass right limit to the core allocator. But that would make the patch more intrusive for 2.6.14, as other arches use alloc_bootmem_low_pages(). We may be done that post 2.6.14 as a cleanup. With this, swiotlb gets memory within 4G for both x86_64 and ia64 arches. Signed-off-by: Yasunori Goto <y-goto@jp.fujitsu.com> Cc: Ravikiran G Thirumalai <kiran@scalex86.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-19[PATCH] mm: hugetlb truncation fixesHugh Dickins1-14/+21
hugetlbfs allows truncation of its files (should it?), but hugetlb.c often forgets that: crashes and misaccounting ensue. copy_hugetlb_page_range better grab the src page_table_lock since we don't want to guess what happens if concurrently truncated. unmap_hugepage_range rss accounting must not assume the full range was mapped. follow_hugetlb_page must guard with page_table_lock and be prepared to exit early. Restyle copy_hugetlb_page_range with a for loop like the others there. Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-19[PATCH] Handle spurious page fault for hugetlb regionSeth, Rohit1-2/+12
The hugetlb pages are currently pre-faulted. At the time of mmap of hugepages, we populate the new PTEs. It is possible that HW has already cached some of the unused PTEs internally. These stale entries never get a chance to be purged in existing control flow. This patch extends the check in page fault code for hugepages. Check if a faulted address falls with in size for the hugetlb file backing it. We return VM_FAULT_MINOR for these cases (assuming that the arch specific page-faulting code purges the stale entry for the archs that need it). Signed-off-by: Rohit Seth <rohit.seth@intel.com> [ This is apparently arguably an ia64 port bug. But the code won't hurt, and for now it fixes a real problem on some ia64 machines ] Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-16Fix memory ordering bug in page reclaimLinus Torvalds1-4/+9
As noticed by Nick Piggin, we need to make sure that we check the page count before we check for PageDirty, since the dirty check is only valid if the count implies that we're the only possible ones holding the page. We always did do this, but the code needs a read-memory-barrier to make sure that the orderign is also honored by the CPU. (The writer side is ordered due to the atomic decrement and test on the page count, see the discussion on linux-kernel) Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-11[PATCH] Don't map the same page too muchHugh Dickins1-0/+3
Refuse to install a page into a mapping if the mapping count is already ridiculously large. You probably cannot trigger this on 32-bit architectures, but on a 64-bit setup we should protect against it. Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-11[PATCH] madvise: Avoid returning error code -EBADF for anonymous mappingsSuzuki1-7/+4
Revert this recent correctness change: Douglas Crosher <dcrosher@scieneer.com> reported that it broke an existing application, and that madvise() works without error on anonymous mappings on Solaris. This means that madvise() will remain non-standards-compliant: we should return -EBADF for all requests against non-file-backed vma's, but Linux only does this for MADV_WILLNEED requests. Signed-off-by: Suzuki K P <suzuki@in.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-08[PATCH] gfp flags annotations - part 1Al Viro11-41/+37
- added typedef unsigned int __nocast gfp_t; - replaced __nocast uses for gfp flags with gfp_t - it gives exactly the same warnings as far as sparse is concerned, doesn't change generated code (from gcc point of view we replaced unsigned int with typedef) and documents what's going on far better. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-30Revert "x86-64: Reverse order of bootmem lists"Linus Torvalds1-11/+3
As requested by Thomas Gleixner <tglx@linutronix.de>: "5d3d0f7704ed0bc7eaca0501eeae3e5da1ea6c87 breaks a couple of ARM boards, which depend on the historical bootmem allocation order. There is a cleaner solution around to remove the pgdat list completely, but this is a topic for post 2.6.14 Andi signalled ACK already." Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-28[PATCH] kmalloc_node IRQ safety fixAlok N Kataria1-7/+18
In kmalloc_node we are checking if the allocation is for the same node when interrupts are "on". This may lead to an allocation on another node than intended. This patch just shifts the check for the current node in __cache_alloc_node when interrupts are disabled. Signed-off-by: Alok N Kataria <alokk@calsoftinc.com> Acked-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-28[PATCH] mm: move_pte to remap ZERO_PAGENick Piggin1-3/+3
Move the ZERO_PAGE remapping complexity to the move_pte macro in asm-generic, have it conditionally depend on __HAVE_ARCH_MULTIPLE_ZERO_PAGE, which gets defined for MIPS. For architectures without __HAVE_ARCH_MULTIPLE_ZERO_PAGE, move_pte becomes a noop. From: Hugh Dickins <hugh@veritas.com> Fix nasty little bug we've missed in Nick's mremap move ZERO_PAGE patch. The "pte" at that point may be a swap entry or a pte_file entry: we must check pte_present before perhaps corrupting such an entry. Patch below against 2.6.14-rc2-mm1, but the same bug is in 2.6.14-rc2's mm/mremap.c, and more dangerous there since it's affecting all arches: I think the safest course is to send Nick's patch and Yoichi's build fix and this fix (build tested) on to Linus - so only MIPS can be affected. Signed-off-by: Nick Piggin <npiggin@suse.de> Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-23[PATCH] revert oversized kmalloc checkAndrew Morton1-1/+2
As davem points out, this wasn't such a great idea. There may be some code which does: size = 1024*1024; while (kmalloc(size, ...) == 0) size /= 2; which will now explode. Cc: "David S. Miller" <davem@davemloft.net> Cc: Christoph Lameter <christoph@lameter.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-22[PATCH] Fix bd_claim() error code.Rob Landley1-0/+1
Problem: In some circumstances, bd_claim() is returning the wrong error code. If we try to swapon an unused block device that isn't swap formatted, we get -EINVAL. But if that same block device is already mounted, we instead get -EBUSY, even though it still isn't a valid swap device. This issue came up on the busybox list trying to get the error message from "swapon -a" right. If a swap device is already enabled, we get -EBUSY, and we shouldn't report this as an error. But we can't distinguish the two -EBUSY conditions, which are very different errors. In the code, bd_claim() returns either 0 or -EBUSY, but in this case busy means "somebody other than sys_swapon has already claimed this", and _that_ means this block device can't be a valid swap device. So return -EINVAL there. Signed-off-by: Rob Landley <rob@landley.net> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-22[PATCH] __kmalloc: Generate BUG if size requested is too large.Christoph Lameter1-2/+1
I had an issue on ia64 where I got a bug in kernel/workqueue because kzalloc returned a NULL pointer due to the task structure getting too big for the slab allocator. Usually these cases are caught by the kmalloc macro in include/linux/slab.h. Compilation will fail if a too big value is passed to kmalloc. However, kzalloc uses __kmalloc which has no check for that. This patch makes __kmalloc bug if a too large entity is requested. Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-22[PATCH] slab: fix handling of pages from foreign NUMA nodesChristoph Lameter1-19/+19
The numa slab allocator may allocate pages from foreign nodes onto the lists for a particular node if a node runs out of memory. Inspecting the slab->nodeid field will not reflect that the page is now in use for the slabs of another node. This patch fixes that issue by adding a node field to free_block so that the caller can indicate which node currently uses a slab. Also removes the check for the current node from kmalloc_cache_node since the process may shift later to another node which may lead to an allocation on another node than intended. Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-22[PATCH] slab: alpha inlining fixIvan Kokshaysky1-3/+4
It is essential that index_of() be inlined. But alpha undoes the gcc inlining hackery and index_of() ends up out-of-line. So fiddle with things to make that function inline again. Cc: Richard Henderson <rth@twiddle.net> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-21[PATCH] mm: add a note about partially hardcoded VM_* flagsPaolo 'Blaisorblade' Giarrusso1-1/+2
Hugh made me note this line for permission checking in mprotect(): if ((newflags & ~(newflags >> 4)) & 0xf) { after figuring out what's that about, I decided it's nasty enough. Btw Hugh itself didn't like the 0xf. We can safely change it to VM_READ|VM_WRITE|VM_EXEC because we never change VM_SHARED, so no need to check that. Signed-off-by: Paolo 'Blaisorblade' Giarrusso <blaisorblade@yahoo.it> Acked-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-21[PATCH] fix locking comment in unmap_region()Paolo 'Blaisorblade' Giarrusso1-1/+1
That comment is plain wrong (we even take the pagetable lock inside unmap_region()). Signed-off-by: Paolo 'Blaisorblade' Giarrusso <blaisorblade@yahoo.it> Acked-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-17[PATCH] fix mm/Kconfig spellingDave Hansen1-2/+2
Signed-off-by: Dave Hansen <haveblue@us.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>