aboutsummaryrefslogtreecommitdiffstats
path: root/mm
AgeCommit message (Collapse)AuthorFilesLines
2004-08-22[PATCH] Concurrent O_SYNC write supportAndrew Morton1-20/+74
In databases it is common to have multiple threads or processes performing O_SYNC writes against different parts of the same file. Our performance at this is poor, because each writer blocks access to the file by waiting on I/O completion while holding i_sem: everything is serialised. The patch improves things by moving the writing and waiting outside i_sem. So other threads can get in and submit their I/O and permit the disk scheduler to optimise the IO patterns better. Also, the O_SYNC writer only writes and waits on the pages which he wrote, rather than writing and waiting on all dirty pages in the file. The reason we haven't been able to do this before is that the required walk of the address_space page lists is easily livelockable without the i_sem serialisation. But in this patch we perform the waiting via a radix-tree walk of the affected pages. This cannot be livelocked. The sync of the inode's metadata is still performed inside i_sem. This is because it is list-based and is hence still livelockable. However it is usually the case that databases are overwriting existing file blocks and there will be no dirty buffers attached to the address_space anyway. The code is careful to ensure that the IO for the pages and the IO for the metadata are nonblockingly scheduled at the same time. This is am improvemtn over the current code, which will issue two separate write-and-wait cycles: one for metadata, one for pages. Note from Suparna: Reworked to use the tagged radix-tree based writeback infrastructure. Signed-off-by: Suparna Bhattacharya <suparna@in.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2004-08-22[PATCH] filemap_fdatawrite range interfaceSuparna Bhattacharya1-2/+21
Range based equivalent of filemap_fdatawrite for O_SYNC writers (to go with writepages range support added to mpage_writepages). If both <start> and <end> are zero, then it defaults to writing back all of the mapping's dirty pages. Signed-off-by: Suparna Bhattacharya <suparna@in.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2004-08-22[PATCH] Fix writeback page range to use exact limitsSuparna Bhattacharya1-1/+6
wait_on_page_writeback_range shouldn't wait for pages beyond the specified range. Ideally, the radix-tree-lookup could accept an end_index parameter so that it doesn't return the extra pages in the first place, but for now we just add a few extra checks to skip such pages. Signed-off-by: Suparna Bhattacharya <suparna@in.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2004-08-22[PATCH] token based thrashing controlRik van Riel5-1/+104
The following experimental patch implements token based thrashing protection, using the algorithm described in: http://www.cs.wm.edu/~sjiang/token.htm When there are pageins going on, a task can grab a token, that protects the task from pageout (except by itself) until it is no longer doing heavy pageins, or until the maximum hold time of the token is over. If the maximum hold time is exceeded, the task isn't eligable to hold the token for a while more, since it wasn't doing it much good anyway. I have run a very unscientific benchmark on my system to test the effectiveness of the patch, timing how a 230MB two-process qsbench run takes, with and without the token thrashing protection present. normal 2.6.8-rc6: 6m45s 2.6.8-rc6 + token: 4m24s This is a quick hack, implemented without having talked to the inventor of the algorithm. He's copied on the mail and I suspect we'll be able to do better than my quick implementation ... Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2004-08-22[PATCH] prio_tree: iterator + vma_prio_tree_next cleanupRajesh Venkatasubramanian3-45/+31
Currently we have: while ((vma = vma_prio_tree_next(vma, root, &iter, begin, end)) != NULL) do_something_with(vma); Then iter,root,begin,end are all transfered unchanged to various functions. This patch hides them in struct iter instead. It slightly lessens source, code size, and stack usage. Patch compiles and tested lightly. Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Signed-off-by: Rajesh Venkatasubramanian <vrajesh@umich.edu> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2004-08-22[PATCH] prio_tree: kill vma_prio_tree_init()Rajesh Venkatasubramanian3-11/+8
vma_prio_tree_insert() relies on the fact, that vma was vma_prio_tree_init()'ed. Content of vma->shared should be considered undefined, until this vma is inserted into i_mmap/i_mmap_nonlinear. It's better to do proper initialization in vma_prio_tree_add/insert. Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Signed-off-by: Rajesh Venkatasubramanian <vrajesh@umich.edu> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2004-08-22[PATCH] break out zone free list initializationDave Hansen1-37/+47
The following patch removes the individual free area initialization from free_area_init_core(), and puts it in a new function zone_init_free_lists(). It also creates pages_to_bitmap_size(), which is then used in zone_init_free_lists() as well as several times in my free area bitmap resizing patch. First of all, I think it looks nicer this way, but it's also necessary to have this if you want to initialize a zone after system boot, like if a NUMA node was hot-added. In any case, it should be functionally equivalent to the old code. Compiles and boots on x86. I've been running with this for a few weeks, and haven't seen any problems with it yet. Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2004-08-22[PATCH] make shrinker_sem an rwsemNick Piggin1-16/+23
Use an rwsem to protect the shrinker list instead of a regular semaphore. Modifications to the list are now done under the write lock, shrink_slab takes the read lock, and access to shrinker->nr becomes racy (which is no different to how the page lru scanner is implemented). The shrinker functions become concurrent. Previously, having the slab scanner get preempted or scheduling while holding the semaphore would cause other tasks to skip putting pressure on the slab. Also, make shrink_icache_memory return -1 if it can't do anything in order to hold pressure on this cache and prevent useless looping in shrink_slab. Signed-off-by: Nick Piggin <nickpiggin@yahoo.com.au> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2004-08-22[PATCH] slab: locking optimization for cache_reapDimitri Sivanich1-31/+6
Here is another cache_reap optimization that reduces latency when applied after the 'Move cache_reap out of timer context' patch I submitted on 7/14 (for inclusion in -mm next week). Signed-off-by: Dimitri Sivanich <sivanich@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2004-08-22[PATCH] Move cache_reap out of timer contextDimitri Sivanich1-50/+25
I'm submitting two patches associated with moving cache_reap functionality out of timer context. Note that these patches do not make any further optimizations to cache_reap at this time. The first patch adds a function similiar to schedule_delayed_work to allow work to be scheduled on another cpu. The second patch makes use of schedule_delayed_work_on to schedule cache_reap to run from keventd. Signed-off-by: Dimitri Sivanich <sivanich@sgi.com> Signed-off-by: Manfred Spraul <manfred@colorfullife.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2004-08-22[PATCH] Make i/dhash_entries cmdline work as it use to.Jose R. Santos1-25/+24
I was looking at the recent for >MAX_ORDER hash tables but it seems that the patch limits the number of entries to what it thinks are good values and the i/dhash_entries cmdline options can not exceed this. This seems to limit the usability of the patch on systems were larger allocations that the ones the kernel calculates are desired. - Make ihash_entries and dhash_entries cmdline option behave like it use to. - Remove MAX_SYS_HASH_TABLE_ORDER. Limit the max size to 1/16 the total number of pages. Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2004-08-22[PATCH] split generic_file_aio_write into buffered and direct I/O partsChristoph Hellwig1-97/+118
If the generic code falls back to buffered I/O on a hole XFS needs to relock, so we need to have separate functions to call unless we want to duplicate everything. The XFS patch still needs some cleaning up, but I'll try to get it in before 2.6.8. Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2004-08-07Make sysctl pass the pos pointer around properly.Linus Torvalds3-8/+8
Nobody ever fixed the big FIXME in sysctl - but we really need to pass around the proper "loff_t *" to all the sysctl functions if we want them to be well-behaved wrt the file pointer position. This is all preparation for making direct f_pos accesses go away.
2004-08-01[PATCH] Canonically reference files in Documentation/ code comments partAdrian Bunk1-1/+1
Below is a patch by Hans Ulrich Niedermann <linux-kernel@n-dimensional.de> to change all references in comments to files in Documentation/ to start with Documentation/ Signed-off-by: Adrian Bunk <bunk@fs.tum.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2004-08-01[PATCH] quieten down per-zone memory statsJesse Barnes1-2/+2
On a system with a lot of nodes, 4 lines of output per node is a lot to have to sit through as the system comes up, especially if you're on the other end of a slow serial link. The information is valuable though, so keep it around for the system logger. This patch makes the printks for the memory stats use KERN_DEBUG instead of the default loglevel. Signed-off-by: Jesse Barnes <jbarnes@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2004-08-01[PATCH] oom-killer: call show_free_areasAndrew Morton2-2/+5
Change the oom-killer so that it spits a sysrq-m output into the logs, and shows the gfp_mask of the failing allocation attempt. Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2004-07-31[PATCH] slab memory shrinking balancing fixAndrew Morton2-20/+23
The logic in shrink_slab tries to balance the proportion of slab which it scans against the proportion of pagecache which the caller scanned. Problem is that with a large number of highmem LRU pages and a small number of lowmem LRU pages, the amount of pagecache scanning appears to be very small, so we don't push slab hard enough. The patch changes things so that for, say, a GFP_KERNEL allocation attempt we only consider ZONE_NORMAL and ZONE_DMA when calculating "what proportion of the LRU did the caller just scan". This will have the effect of shrinking slab harder in response to GFP_KERNEL allocations than for GFP_HIGHMEM allocations. Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2004-07-28[PATCH] page_cache_readahead unused variableShane Shrybman1-2/+0
Removal of unused variable in mm/readahead.c. Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2004-07-28[PATCH] fix readahead breakage for sequential after random readsMiklos Szeredi1-1/+6
Current readahead logic is broken when a random read pattern is followed by a long sequential read. The cause is that on a window miss ra->next_size is set to ra->average, but ra->average is only updated at the end of a sequence, so window size will remain 1 until the end of the sequential read. This patch fixes this by taking the current sequence length into account (code taken from towards end of page_cache_readahead()), and also setting ra->average to a decent value in handle_ra_miss() when sequential access is detected. Signed-off-by: Miklos Szeredi <miklos@szeredi.hu> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2004-07-28[PATCH] Remove dead comment in mm/filemap.cHimanshu Raj1-4/+0
Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2004-07-28[PATCH] sign fix in swapfile.cMika Kukkonen1-1/+1
CC mm/swapfile.o mm/swapfile.c: In function `scan_swap_map': mm/swapfile.c:114: warning: comparison between signed and unsigned Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2004-07-28[PATCH] Make get_user_pages() work again for ia64 gate areaDavid Mosberger1-1/+1
Changeset roland@redhat.com[torvalds]|ChangeSet|20040624165002|30880 inadvertently broke ia64 because the patch assumed that pgd_offset_k() is just an optimization of pgd_offset(), which it is not. This patch fixes the problem by introducing pgd_offset_gate(). On architectures on which the gate area lives in the user's address-space, this should be aliased to pgd_offset() and on architectures on which the gate area lives in the kernel-mapped segment, this should be aliased to pgd_offset_k(). This bug was found and tracked down by Peter Chubb. Signed-off-by: <davidm@hpl.hp.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2004-07-28[PATCH] swapoff mmap_sem deadlockHugh Dickins2-1/+13
Updating the mm lock ordering documentation drew attention to the fact that we were wrong to blithely add down_read(&mm->mmap_sem) to swapoff's unuse_process, while it holds swapcache page lock: not very likely, but it could deadlock against, say, mlock faulting a page back in from swap. But it looks like these days it's safe to drop and reacquire page lock if down_read_trylock fails: the page lock is held to stop try_to_unmap unmapping the page's ptes as fast as try_to_unuse is mapping them back in; but the recent fix for get_user_pages works to prevent that too. Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2004-07-28[PATCH] install_page vs. vmtruncateHugh Dickins1-2/+6
BK is still missing one piece for Oleg's install_page/vmtruncate races. Oleg didn't explicitly ACK this, but I think he did implicitly: Oleg? The previous patch to install_page, returning an error if !page_mapping once page_table_lock is held, is not enough to guard against vmtruncate. When unmap_mapping_range already did this vma, but truncate_inode_pages has not yet done this page, page->mapping will still be set, but we must now refrain from inserting the page into the page table. Could check truncate_count, but that would need caller to read and pass it down. Instead, recheck page->index against i_size, which is updated before unmap_mapping_range. Better check page->mapping too: not really necessary, but it's accidental that index is left when mapping is reset. Also, callers are expecting -EINVAL for beyond end of file, not -EAGAIN. Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2004-07-28[PATCH] BIO page refcounting fixJens Axboe1-6/+3
Hopefully fixes the free-of-a-freed-page BUG caused during CDRW writing. This also fixes a problem in the bouncing for io errors (it needs to free the pages and clear the BIO_UPTODATE flag, not set it. it's already set. passing -EIO to bio_endio() takes care of that). Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2004-07-25[PATCH] populate nonlinear mappings unconditionallyOleg Nesterov2-18/+6
filemap_populate and shmem_populate must install even a linear file_pte, in case there was a nonlinear page or file_pte already installed there: could only happen if already VM_NONLINEAR, but no need to check that. Acked by Ingo and Hugh. Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2004-07-22Make "install_page()" able to handle truncated pages.Linus Torvalds1-6/+8
This makes it much easier on the callers, no need to worry about races with vmtruncate() and friends, since "install_page()" will just cleanly handle that case and tell the caller about it.
2004-07-22[PATCH] is_highmem() and WANT_PAGE_VIRTUALAndy Whitcroft1-1/+1
Add is_highmem_idx() and is_normal_idx() to determine whether a zone index is a highmem or normal zone. Use this for memmap_init_zone(). Signed-off-by: Andy Whitcroft <apw@shadowen.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2004-07-17[PATCH] NX: clean up legacy binary supportIngo Molnar2-6/+14
This cleans up legacy x86 binary support by introducing a new personality bit: READ_IMPLIES_EXEC, and implements Linus' suggestion to add the PROT_EXEC bit on the two affected syscall entry places, sys_mprotect() and sys_mmap(). If this bit is set then PROT_READ will also add the PROT_EXEC bit - as expected by legacy x86 binaries. The ELF loader will automatically set this bit when it encounters a legacy binary. This approach avoids the problems the previous ->def_flags solution caused. In particular this patch fixes the PROT_NONE problem in a cleaner way (http://lkml.org/lkml/2004/7/12/227), and it should fix the ia64 PROT_EXEC problem reported by David Mosberger. Also, mprotect(PROT_READ) done by legacy binaries will do the right thing as well. the details: - the personality bit is added to the personality mask upon exec(), within the ELF loader, but is not cleared (see the exceptions below). This means that if an environment that already has the bit exec()s a new-style binary it will still get the old behavior. - one exception are setuid/setgid binaries: these will reset the bit - thus local attackers cannot manually set the bit and circumvent NX protection. Legacy setuid binaries will still get the bit through the ELF loader. This gives us maximum flexibility in shaping compatibility environments. - selinux also clears the bit when switching SIDs via exec(). - x86 is the only arch making use of READ_IMPLIES_EXEC currently. Other arches will have the pre-NX-patch protection setup they always had. I have booted an old distro [RH 7.2] and two new PT_GNU_STACK distros [SuSE 9.2 and FC2] on an NX-capable CPU - they work just fine and all the mapping details are right. I've checked the PROT_NONE test-utility as well and it works as expected. I have checked various setuid scenarios as well involving legacy and new-style binaries. an improved setarch utility can be used to set the personality bit manually: http://redhat.com/~mingo/nx-patches/setarch-1.4-3.tar.gz the new '-X' flag does it, e.g.: ./setarch -X linux /bin/cat /proc/self/maps will trigger the old protection layout even on a new distro. Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2004-07-15[PATCH] mmap PROT_NONE fix for NX patchDaniel McNeil1-0/+6
This works around the current PROT_NONE problem from elf binaries that do not have the PT_GNU_STACK so that the do not have execute permission. The problem was that setting "def_flags" to include the VM_EXEC bit for compatibility reasons would also make PROT_NONE pages executable, which is obviously not correct. Signed-off-by: Daniel McNeil <daniel@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2004-07-14[PATCH] tmpfs preempt count panicHugh Dickins1-2/+1
Just unearthed another of my warcrimes: reading a 17-page sparse file, I mean holey file, hits the in_interrupt panic in do_exit on a current highmem kernel (but 2.6.7 is okay). Fix mismatched preempt count from shmem_swp_alloc's swapindex hole case by mapping an empty_zero_page. Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2004-07-13[PATCH] sparse: read_descriptor_t annotationAlexander Viro2-11/+11
We have a fun situation with read_descriptor_t - all its instances end up passed to some actor; these actors use desc->buf as their private data; there are 5 of them and they expect resp: struct lo_read_data * struct svc_rqst * struct file * struct rpc_xprt * char __user * IOW, there is no type safety whatsoever; the field is essentially untyped, we rely on the fact that actor is chosen by the same code that sets ->buf and expect it to put something of the right type there. Right now desc->buf is declared as char __user *. Moreover, the last argument of ->sendfile() (what should be stored in ->buf) is void __user *, even though it's actually _never_ a userland pointer. If nothing else, ->sendfile() should take void * instead; that alone removes a bunch of bogus warnings. I went further and replaced desc->buf with a union of void * and char __user *.
2004-07-10[PATCH] tmpfs: scheduling-while-atomic fixHugh Dickins1-23/+19
Nick has tracked scheduling-while-atomic errors to shmem's fragile kmap avoidance: the root error appears to lie deeper, but rework that fragility. Plus I've been indicted for war crimes at the end of shmem_swp_entry: my apologia scorned, so now hide the evidence. Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2004-07-10[PATCH] slab: fix get_user inside spinlockAndrew Morton1-13/+0
This little debugging __get_user is in fact happening inside a spinlock. It was never very useful, and has caused problems for some architectures in the past. Let's just remove it. Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2004-07-10[PATCH] pagefault readaround fixAndrew Morton1-5/+4
Mika Kukkonen <mika@osdl.org> says: CC mm/filemap.o mm/filemap.c: In function `filemap_nopage': mm/filemap.c:1161: warning: comparison of unsigned expression < 0 is always false The pagefault readaround code is currently doing purely readahead. Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2004-07-10[PATCH] Remove always false check in mm/slab.cMika Kukkonen1-2/+1
CC mm/slab.o mm/slab.c: In function `kmem_cache_create': mm/slab.c:1129: warning: comparison of unsigned expression < 0 is always false This comes from the fact that 'align' is size_t and so unsigned. Just to be sure, I did $ grep __kernel_size_t include/*/posix_types.h and yes, every arch defines that to be unsigned. Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2004-07-10[PATCH] convert uses of ZONE_HIGHMEM to is_highmemAndy Whitcroft1-1/+1
As the comments in mmzone.h indicate is_highmem() is designed to reduce the proliferation of the constant ZONE_HIGHMEM. This patch updates references to ZONE_HIGHMEM to use is_highmem(). None appear to be on critical paths. Signed-off-by: Andy Whitcroft <apw@shadowen.org> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2004-07-04[PATCH] spurious remap_file_pages() -EINVALWilliam Lee Irwin III1-1/+2
As ->vm_private_data is used as a cursor for swapout of VM_NONLINEAR vmas, the check for NULL ->vm_private_data or VM_RESERVED is too strict, and should allow VM_NONLINEAR vmas with non-NULL ->vm_private_data. This fixes an issue on 2.6.7-mm5 where system calls to remap_file_pages() spuriously failed while under memory pressure. Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2004-07-04[PATCH] force O_LARGEFILE in sys_swapon() and sys_swapoff()William Lee Irwin III1-2/+2
For 32-bit, one quickly discovers that swapon() is not given an fd already opened with O_LARGEFILE to act upon and the forcing of O_LARGEFILE for 64-bit is irrelevant, as the system call's argument is a path. So this patch manually forces it for swapon() and swapoff(). Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2004-07-03[PATCH] swap_unplug_io_fn() nommu updatePaul Mundt1-1/+1
include/linux/swap.h changed the definition for swap_unplug_io_fn() awhile back, but mm/nommu.c was never updated to reflect the new definition. As such, mm/nommu.c presently fails to compile. This fixes it. Signed-off-by: Paul Mundt <lethal@linux-sh.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2004-07-02[PATCH] sparse: remaining integer zero / NULL fixes in allmodconfig & vmlinuxMika Kukkonen1-1/+1
This fixes the the remaining 0 to NULL things that were found with 'make allmodconfig' and 'make C=1 vmlinux'.
2004-06-30[PATCH] sparse: NULL vs 0 - the rest of itMika Kukkonen2-2/+2
2004-06-29[PATCH] kill mm_struct.used_hugetlbOleg Nesterov1-1/+0
mm_struct.used_hugetlb used to eliminate costly find_vma() from follow_page(). Now it is used only in ia64 version of follow_huge_addr(). I know nothing about ia64, but this REGION_NUMBER() looks simple enough to kill used_hugetlb. There is debug version (commented out) of follow_huge_addr() in i386 which looks at used_hugetlb, but it can work without this check. Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2004-06-29[PATCH] fix page->count discrepancy for zero pageDave Hansen1-1/+2
While writing some analysis tools for memory hot-remove, we came across a single page which had a ->count that always increased, without bound. It ended up always being the zero page, and it was caused by a leaked reference in some do_wp_page() code that ends up avoiding PG_reserved pages. Basically what happens is that page_cache_release()/put_page() ignore PG_reserved pages, while page_cache_get()/get_page() go ahead and take the reference. So, each time there's a COW fault on the zero-page, you get a leaked page->count increment. It's pretty rare to have a COW fault on anything that's PG_reserved, in fact, I can't think of anything else that this applies to other than the zero page. In any case, it the bug doesn't cause any real problems, but it is a bit of an annoyance and is obviously incorrect. We've been running with this patch for about 3 months now, and haven't run into any problems with it. Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2004-06-29[PATCH] dma_get_required_mask()James Bottomley1-0/+5
This patch implements dma_get_required_mask() which may be used by drivers to probe the optimal DMA descriptor type they should be implementing on the platform. I've also tested it this time with the sym_2 driver...making it chose the correct descriptors for the platform. (although I don't have a 64 bit platform with >4GB memory, so I only confirmed it selects the 32 bit descriptors all the time...) Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2004-06-29[PATCH] Combined patch for remaining trivial sparse warnings in allnoconfig ↵Mika Kukkonen2-3/+3
build Well, one of these (fs/block_dev.c) is little non-trivial, but i felt throwing that away would be a shame (and I did add comments ;-). Also almost all of these have been submitted earlier through other channels, but have not been picked up (the only controversial is again the fs/block_dev.c patch, where Linus felt a better job would be done with __ffs(), but I could not convince myself that is does the same thing as original code). Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2004-06-29sparse: fix pointer/integer confusionLinus Torvalds3-4/+4
I don't think we're in K&R any more, Toto. If you want a NULL pointer, use NULL. Don't use an integer. Most of the users really didn't seem to know the proper type.
2004-06-27[PATCH] fix GFP zone modifier interatorsAndy Whitcroft1-4/+4
For each node there are a defined list of MAX_NR_ZONES zones. These are selected as a result of the __GFP_DMA and __GFP_HIGHMEM zone modifier flags being passed to the memory allocator as part of the GFP mask. Each node has a set of zone lists, node_zonelists, which defines the list and order of zones to scan for each flag combination. When initialising these lists we iterate over modifier combinations 0 .. MAX_NR_ZONES. However, this is only correct when there are at most ZONES_SHIFT flags. If another flag is introduced zonelists for it would not be initialised. This patch introduces GFP_ZONETYPES (based on GFP_ZONEMASK) as a bound for the number of modifier combinations. Signed-off-by: Andy Whitcroft <apw@shadowen.org> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2004-06-26[PATCH] Don't hold i_sem on swapfilesHugh Dickins2-6/+25
We permanently hold the i_sem of swapfiles so that nobody can addidentally ftruncate them, causing subsequent filesystem destruction. Problem is, it's fairly easy for things like backup applications to get stuck onthe swapfile, sleeping until someone does a swapoff. So take all that out again and add a new S_SWAPFILE inode flag. Test that in the truncate path and refuse to truncate an in-use swapfile. Synchronisation between swapon and truncate is via i_sem. Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2004-06-26[PATCH] lock ordering comment updateAndrew Morton1-0/+8
Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2004-06-26[PATCH] anon_vma list locking bugHugh Dickins1-7/+8
Vladimir Saveliev reported anon_vma_unlink list_del BUG (LKML 24 June). His testing is still in progress, but we believe it comes from a nasty locking deficiency I introduced in 2.6.7's anon_vma_prepare. Andrea's original anon_vma_prepare was fine, it needed no anon_vma lock because it was always linking a freshly allocated structure; but my find_mergeable enhancement let it adopt a neighbouring anon_vma, which of course needs locking against a racing linkage from another mm - which the earlier adjust_vma fix seems to have made more likely. Does anon_vma->lock nest inside or outside page_table_lock? Inside, but that's not obvious without a lock ordering list: instead of listing the order here, update the list in filemap.c; but a separate patch because that's less urgent and more likely to get wrong or provoke controversy. (Could do it with anon_vma lock after dropping page_table_lock, but a long comment explaining why some code is safe suggests it's not.) Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2004-06-26[PATCH] fix NUMA boundaray between ZONE_NORMAL and HIGHMEMMartin J. Bligh1-0/+9
From: Andy Whitcroft <apw@shadowen.org> This patch eliminates the false hole which can form between ZONE_NORMAL and ZONE_HIGHMEM. This is most easily seen when 4g/4g split is enabled, but it's always broken, and we just happen not to hit it most of the time. Basically, the patch changes the allocation of the numa remaps regions (the source of the holes) such that they officially fall within VMALLOC space, where they belong. Tested in -mjb for a couple of months, and again against 2.6.7-mm1. Signed-off-by: Andy Whitcroft <apw@shadowen.org> Signed-off-by: Martin J. Bligh <mbligh@aracnet.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2004-06-26[PATCH] per node huge page stats in sysfsKenneth W. Chen1-9/+27
It adds per node huge page stats in sysfs. Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2004-06-26[PATCH] make __free_pages_bulk more comprehensibleMartin J. Bligh1-14/+11
I find __free_pages_bulk very hard to understand ... (I was trying to mod it for the non MAX_ORDER aligned zones, and cleaned it up first). This should make it much more comprehensible to mortal man ... I benchmarked the changes on the big 16x and it's no slower (actually it's about 0.5% faster, but that's within experimental error). I moved the creation of mask into __free_pages_bulk from the caller - it seems to really belong inside there. Then instead of doing wierd limbo dances with mask, I made it use order instead where it's more intuitive. Personally I find this makes the whole thing a damned sight easier to understand. Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2004-06-26[PATCH] NX (No eXecute) support for x86Ingo Molnar1-0/+22
we'd like to announce the availability of the following kernel patch: http://redhat.com/~mingo/nx-patches/nx-2.6.7-rc2-bk2-AE which makes use of the 'NX' x86 feature pioneered in AMD64 CPUs and for which support has also been announced by Intel. (other x86 CPU vendors, Transmeta and VIA announced support as well. Windows support for NX has also been announced by Microsoft, for their next service pack.) The NX feature is also being marketed as 'Enhanced Virus Protection'. This patch makes sure Linux has full support for this hardware feature on x86 too. What does this patch do? The pagetable format of current x86 CPUs does not have an 'execute' bit. This means that even if an application maps a memory area without PROT_EXEC, the CPU will still allow code to be executed in this memory. This property is often abused by exploits when they manage to inject hostile code into this memory, for example via a buffer overflow. The NX feature changes this and adds a 'dont execute' bit to the PAE pagetable format. But since the flag defaults to zero (for compatibility reasons), all pages are executable by default and the kernel has to be taught to make use of this bit. If the NX feature is supported by the CPU then the patched kernel turns on NX and it will enforce userspace executability constraints such as a no-exec stack and no-exec mmap and data areas. This means less chance for stack overflows and buffer-overflows to cause exploits. furthermore, the patch also implements 'NX protection' for kernelspace code: only the kernel code and modules are executable - so even kernel-space overflows are harder (in some cases, impossible) to exploit. Here is how kernel code that tries to execute off the stack is stopped: kernel tried to access NX-protected page - exploit attempt? (uid: 500) Unable to handle kernel paging request at virtual address f78d0f40 printing eip: ... The patch is based on a prototype NX patch written for 2.4 by Intel - special thanks go to Suresh Siddha and Jun Nakajima @ Intel. The existing NX support in the 64-bit x86_64 kernels has been written by Andi Kleen and this patch is modeled after his code. Arjan van de Ven has also provided lots of feedback and he has integrated the patch into the Fedora Core 2 kernel. Test rpms are available for download at: http://redhat.com/~arjanv/2.6/RPMS.kernel/ the kernel-2.6.6-1.411 rpms have the NX patch applied. here's a quickstart to recompile the vanilla kernel from source with the NX patch: http://redhat.com/~mingo/nx-patches/QuickStart-NX.txt update: - make the heap non-executable on PT_GNU_STACK binaries. - make all data mmap()s (and the heap) executable on !PT_GNU_STACK (legacy) binaries. This has no effect on non-NX CPUs, but should be much more compatible on NX CPUs. The only effect it has it has on non-NX CPUs is the extra 'x' bit displayed in /proc/PID/maps. Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2004-06-26[PATCH] __alloc_bootmem_node should not panic when it failsAnton Blanchard1-6/+1
__alloc_bootmem_node currently panics if it cant satisfy an allocation for a particular node. Thats rather antisocial, we should at the very least return NULL and allow the caller to proceed (eg try another node). A quick look at alloc_bootmem_node usage suggests we should fall back to allocating from other nodes if it fails (as arch/alpha/kernel/pci_iommu.c and arch/x86_64/kernel/setup64.c do). The following patch does that. We fall back to the regular __alloc_bootmem when __alloc_bootmem_node fails, which means all other nodes are checked for available memory. Signed-off-by: Anton Blanchard <anton@samba.org> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2004-06-23[PATCH] symlink 7/9: shmfsAlexander Viro1-26/+21
shm switched (it almost belongs to SL3, but it does some extra stuff after the link traversal).
2004-06-23[PATCH] fix x86-64 ptrace access to 32-bit vsyscall pageRoland McGrath1-3/+8
When I made get_user_pages support looking up a pte for the "gate" area, I assumed it would be part of the kernel's fixed mappings. On x86-64 running a 32-bit task, the 32-bit vsyscall DSO page still has no vma but has its pte allocated in the user mm in the normal fashion. This patch makes it use the generic page-table lookup calls rather than the shortcuts. With this, ptrace on x86-64 can access a 32-bit process's vsyscall page. The behavior on x86 is unchanged. Signed-off-by: Roland McGrath <roland@redhat.com> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2004-06-23[PATCH] kswapd warning fixAndrew Morton1-1/+2
mm/vmscan.c: In function `kswapd': mm/vmscan.c:1139: warning: no return statement in function returning non-void Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2004-06-23[PATCH] oom killer: ignore free swapspaceAndrew Morton1-6/+0
From: William Lee Irwin III <wli@holomorphy.com> During stress testing at Oracle to determine the maximum number of clients 2.6 can service, it was discovered that the failure mode of excessive numbers of clients was kernel deadlock. The following patch removes the check if (nr_swap_pages > 0) from out_of_memory() as this heuristic fails to detect memory exhaustion due to pinned allocations, directly causing the aforementioned deadlock. Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2004-06-23[PATCH] hugetlb.c - fix try_to_free_low()Andrew Morton1-7/+6
From: "Chen, Kenneth W" <kenneth.w.chen@intel.com> Turn on CONFIG_HIGHMEM and CONFIG_HUGETLBFS. Try to config the hugetlb pool: [root@quokka]# echo 100 > /proc/sys/vm/nr_hugepages [root@quokka]# grep HugePage /proc/meminfo HugePages_Total: 100 HugePages_Free: 100 [root@quokka]# echo 20 > /proc/sys/vm/nr_hugepages [root@quokka]# grep HugePage /proc/meminfo HugePages_Total: 0 HugePages_Free: 0 [root@quokka]# echo 100 > /proc/sys/vm/nr_hugepages [root@quokka]# grep HugePage /proc/meminfo HugePages_Total: 100 HugePages_Free: 100 [root@quokka]# echo 0 > /proc/sys/vm/nr_hugepages [root@quokka]# grep HugePage /proc/meminfo HugePages_Total: 31 HugePages_Free: 31 The argument "count" passed to try_to_free_low() is the config parameter for desired hugetlb page pool size. But the implementation took that input argument as number of pages to free. It also decrement the config parameter as well. All give random behavior depend on how many hugetlb pages are in normal/highmem zone. A two line fix in try_to_free_low() would be: - if (!--count) - return 0; + if (count >= nr_huge_pages) + return count; But more appropriately, that function shouldn't return anything. Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2004-06-23[PATCH] hugetlb.c: use safe iteratorAndrew Morton1-2/+2
From: "Chen, Kenneth W" <kenneth.w.chen@intel.com> With list poisoning on by default from linux-2.6.7, it's easier than ever to trigger the bug in try_to_free_low(). It ought to use the safe version of list iterater. Signed-off-by: Ken Chen <kenneth.w.chen@intel.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2004-06-23[PATCH] zap_pte_range speedupAndrew Morton1-1/+1
From: Hugh Dickins <hugh@veritas.com> zap_pte_range is wasting time marking anon pages accessed: its original !PageSwapCache test should have been reinstated when page_mapping was changed to return swapper_space; or more simply, just check !PageAnon. Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2004-06-23[PATCH] make total_swap_pages a longAndrew Morton1-1/+1
Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2004-06-23[PATCH] Make nr_swap_pages a longAndrew Morton1-1/+1
From: Anton Blanchard <anton@samba.org> ../include/linux/swap.h:extern int nr_swap_pages; /* XXX: shouldn't this be ulong? --hch */ Sounds like it should be too me. Some of the code checks for nr_swap_pages < 0 so I made it a long instead. I had to fix up the ppc64 show_mem() (Im guessing there will be other trivial changes required in other 64bit archs, I can find and fix those if you want). I also noticed that the ppc64 show_mem() used ints to store page counts. We can overflow that, so make them unsigned long. Signed-off-by: Anton Blanchard <anton@samba.org> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2004-06-23[PATCH] reduce function inlining in slab.cAndrew Morton1-47/+37
From: Manfred Spraul <manfred@colorfullife.com> slab.c contains too many inline functions: - some functions that are not performance critical were inlined. Waste of text size. - The debug code relies on __builtin_return_address(0) to keep track of the callers. According to rmk, gcc didn't inline some functions as expected and that resulted in useless debug output. This was probably caused by the large debug-only inline functions. The attached patche removes most inline functions: - the empty on release/huge on debug inline functions were replaced with empty macros on release/normal functions on debug. - spurious inline statements were removed. The code is down to 6 inline functions: three one-liners for struct abstractions, one for a might_sleep_if test and two for the performance critical __cache_alloc / __cache_free functions. Note: If an embedded arch wants to save a few bytes by uninlining __cache_{free,alloc}: The right way to do that is to fold the functions into kmem_cache_xy and then replace kmalloc with kmem_cache_alloc(kmem_find_general_cachep(),). Signed-Off: Manfred Spraul <manfred@colorfullife.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2004-06-23[PATCH] hwcache align kmalloc cachesAndrew Morton1-4/+10
From: Manfred Spraul <manfred@colorfullife.com> Reversing the patches that made all caches hw cacheline aligned had an unintended side effect on the kmalloc caches: Before they had the SLAB_HWCACHE_ALIGN flag set, now it's clear. This breaks one sgi driver - it expects aligned caches. Additionally I think it's the right thing to do: It costs virtually nothing (the caches are power-of-two sized) and could reduce false sharing. Additionally, the patch adds back the documentation for the SLAB_HWCACHE_ALIGN flag. Signed-Off: Manfred Spraul <manfred@colorfullife.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2004-06-23[PATCH] tweak the buddy allocator for better I/O mergingAndrew Morton1-5/+17
From: William Lee Irwin III <wli@holomorphy.com> Based on Arjan van de Ven's idea, with guidance and testing from James Bottomley. The physical ordering of pages delivered to the IO subsystem is strongly related to the order in which fragments are subdivided from larger blocks of memory tracked by the page allocator. Consider a single MAX_ORDER block of memory in isolation acted on by a sequence of order 0 allocations in an otherwise empty buddy system. Subdividing the block beginning at the highest addresses will yield all the pages of the block in reverse, and subdividing the block begining at the lowest addresses will yield all the pages of the block in physical address order. Empirical tests demonstrate this ordering is preserved, and that changing the order of subdivision so that the lowest page is split off first resolves the sglist merging difficulties encountered by driver authors at Adaptec and others in James Bottomley's testing. James found that before this patch, there were 40 merges out of about 32K segments. Afterward, there were 24007 merges out of 19513 segments, for a merge rate of about 55%. Merges of 128 segments, the maximum allowed, were observed afterward, where beforehand they never occurred. It also improves dbench on my workstation and works fine there. Signed-off-by: William Lee Irwin III <wli@holomorphy.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2004-06-23[PATCH] vmscan.c: dont reclaim too many pagesAndrew Morton1-0/+8
The shrink_zone() logic can, under some circumstances, cause far too many pages to be reclaimed. Say, we're scanning at high priority and suddenly hit a large number of reclaimable pages on the LRU. Change things so we bale out when SWAP_CLUSTER_MAX pages have been reclaimed. Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2004-06-23[PATCH] vmscan.c scan rate fixesAndrew Morton2-41/+33
We've been futzing with the scan rates of the inactive and active lists far too much, and it's still not right (Anton reports interrupt-off times of over a second). - We have this logic in there from 2.4.early (at least) which tries to keep the inactive list 1/3rd the size of the active list. Or something. I really cannot see any logic behind this, so toss it out and change the arithmetic in there so that all pages on both lists have equal scan rates. - Chunk the work up so we never hold interrupts off for more that 32 pages worth of scanning. - Make the per-zone scan-count accumulators unsigned long rather than atomic_t. Mainly because atomic_t's could conceivably overflow, but also because access to these counters is racy-by-design anyway. Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2004-06-23[PATCH] vmscan.c: shuffle things aroundAndrew Morton1-47/+45
Move all the data structure declarations, macros and variable definitions to less surprising places. Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2004-06-23[PATCH] cpumask: bitmap cleanup preparation for cpumask overhaulAndrew Morton1-8/+6
From: Paul Jackson <pj@sgi.com> Document the bitmap bit model and handling of unused bits. Tighten up bitmap so it does not generate nonzero bits in the unused tail if it is not given any on input. Add intersects, subset, xor and andnot operators. Change bitmap_complement to take two operands. Add a couple of missing 'const' qualifiers on bitops test_bit and bitmap_equal args. Signed-off-by: Paul Jackson <pj@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2004-06-21sparse: clean up warning in swapfile.cLinus Torvalds1-1/+1
2004-06-20[PATCH] NUMA API updatesAndi Kleen1-10/+13
This patch three issues in NUMA API - When 1 was passed to set_mempolicy or mbind as maxnodes argument get_nodes could corrupt the stack and cause a crash. Fix that. - Remove the restriction to do interleaving only for order 0. Together with the patch that went in previously to use interleaving policy at boot time this should give back the original behaviour of distributing the big hash tables. - Fix some bad white space in comments Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2004-06-20[PATCH] mprotect propagate anon_vmaHugh Dickins1-1/+14
When mprotect shifts the boundary between vmas (merging the reprotected area into the vma before or the vma after), make sure that the expanding vma has anon_vma if the shrinking vma had, to cover anon pages imported. Thanks to Andrea for alerting us to this oversight. Cc: <andrea@suse.de> Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2004-06-20[PATCH] swapoff: activate pagesAndrew Morton1-0/+7
People like to use swapoff/swapon as a way of restoring their VM to a predictable "preconditional" state. Problem is, swapoff leaves mapped anon/pagecache pages on the inactive list, so they immediately get swapped out again when swapspace becomes available. Let's move these pages onto the active list to the VM has to again decide whether to swap them out. Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2004-06-20[PATCH] Permit inode & dentry hash tables to be allocated > MAX_ORDER sizeDavid Howells1-0/+73
Here's a patch to allocate memory for big system hash tables with the bootmem allocator rather than with main page allocator. It is needed for three reasons: (1) So that the size can be bigger than MAX_ORDER. IBM have done some testing on their big PPC64 systems (64GB of RAM) with linux-2.4 and found that they get better performance if the sizes of the inode cache hash, dentry cache hash, buffer head hash and page cache hash are increased beyond MAX_ORDER (order 11). Now the main allocator can't allocate anything larger than MAX_ORDER, but the bootmem allocator can. In 2.6 it appears that only the inode and dentry hashes remain of those four, but there are other hash tables that could use this service. (2) Changing MAX_ORDER appears to have a number of effects beyond just limiting the maximum size that can be allocated in one go. (3) Should someone want a hash table in which each bucket isn't a power of two in size, memory will be wasted as the chunk of memory allocated will be a power of two in size (to hold a power of two number of buckets). On the other hand, using the bootmem allocator means the allocation will only take up sufficient pages to hold it, rather than the next power of two up. Admittedly, this point doesn't apply to the dentry and inode hashes, but it might to another hash table that might want to use this service. I've coelesced the meat of the inode and dentry allocation routines into one such routine in mm/page_alloc.c that the the respective initialisation functions now call before mem_init() is called. This routine gets it's approximation of memory size by counting up the ZONE_NORMAL and ZONE_DMA pages (and ZONE_HIGHMEM if requested) in all the nodes passed to the main allocator by paging_init() (or wherever the arch does it). It does not use max_low_pfn as that doesn't seem to be available on all archs, and it doesn't use num_physpages since that includes highmem pages not available to the kernel for allocating data structures upon - which may not be appropriate when calculating hash table size. On the off chance that the size of each hash bucket may not be exactly a power of two, the routine will only allocate as many pages as is necessary to ensure that the number of buckets is exactly a power of two, rather than allocating the smallest power-of-two sized chunk of memory that will hold the same array of buckets. The maximum size of any single hash table is given by MAX_SYS_HASH_TABLE_ORDER, as is now defined in linux/mmzone.h. Signed-off-by: Paul Mackerras <paulus@samba.org> Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2004-06-19[PATCH] fix amd64 boot breakageArjan van de Ven1-1/+1
This fixes a bug that prevent my amd64 box from booting; numa_default_policy was __init however it's called like this in init/main.c: free_initmem(); unlock_kernel(); system_state = SYSTEM_RUNNING; numa_default_policy(); eg after free_initmem(). This resulted in it being reused/freed and that gives a nasty oops.
2004-06-17[PATCH] handle partial DIO writeDaniel McNeil1-1/+1
The fsx-linux hole fill failure problem was caused by generic_file_aio_write_nolock() not handling the partial DIO write correctly. Here's a patch lets DIO do the partial write, and the fallback to buffered is done (correctly) for what is left. This fixes the hole filling without retrying the entire i/o. This patch also applies to 2.6.7-rc3 with some offset. I tested this (on ext3) with fsx-linux -l 500000 -r 4096 -t 4096 -w 4096 -Z -N 10000 junk -R -W Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2004-06-17[PATCH] remap_file_pages() speedupAndrea Arcangeli1-5/+12
Avoid taking down_write(mmap_sem) unless we really need it. Seems that the only reason we're taking it for writing is to protect vma->vm_flags. Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2004-06-17[PATCH] Use numa policy API for boot time policyAndi Kleen2-44/+15
Suggested by Manfred Spraul. __get_free_pages had a hack to do node interleaving allocation at boot time. This patch sets an interleave process policy using the NUMA API for init and the idle threads instead. Before entering the user space init the policy is reset to default again. Result is the same. Advantage is less code and removing of a check from a fast path. Removes more code than it adds. I verified that the memory distribution after boot is roughly the same. Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2004-06-17[PATCH] dm-io: device-mapper i/o library for kcopydAlasdair G. Kergon1-6/+0
Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2004-06-17[PATCH] Fix read() vs truncate raceNick Piggin1-26/+50
do_generic_mapping_read() { isize1 = i_size_read(); ... readpage copy_to_user up to isize1; } readpage() { isize2 = i_size_read(); ... read blocks ... zero-fill all blocks past isize2 } If a second thread runs truncate and shrinks i_size, so isize1 and isize2 are different, the read can return up to a page of zero-fill that shouldn't really exist. The trick is to read isize1 after doing the readpage. I realised this is the right way to do it without having to change the readpage API. The patch should not cost any cycles when reading from pagecache. Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2004-06-17[PATCH] invalidate_inodes2(): mark pages not uptodateAndrew Morton1-2/+8
Andrea Arcangeli <andrea@suse.de> points out that invalidate_inode_pages2() is supposed to mark mapped-into-pagetable pages as not uptodate so that next time someone faults the page in we will go get a new version from backing store. The callers are the direct-io code and the NFS "something changed on the server" code. In both these cases we do need to go and re-read the page. Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2004-06-17[PATCH] Clean up asm/pgalloc.h includeRussell King9-9/+0
This patch cleans up needless includes of asm/pgalloc.h from the fs/ kernel/ and mm/ subtrees. Compile tested on multiple ARM platforms, and x86, this patch appears safe. This patch is part of a larger patch aiming towards getting the include of asm/pgtable.h out of linux/mm.h, so that asm/pgtable.h can sanely get at things like mm_struct and friends. I suggest testing in -mm for a while to ensure there aren't any hidden arch issues. Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2004-06-17[PATCH] mm: flush TLB when clearing youngHugh Dickins1-3/+3
Traditionally we've not flushed TLB after clearing the young/referenced bit, it has seemed just a waste of time. Russell King points out that on some architectures, with the move from 2.4 mm sweeping to 2.6 rmap, this may be a serious omission: very frequently referenced pages never re-marked young, and the worst choices made for unmapping. So, replace ptep_test_and_clear_young by ptep_clear_flush_young throughout rmap.c. Originally I'd imagined making some kind of TLB gather optimization, but don't see what now: whether worth it rather depends on how common cross-cpu flushes are, and whether global or not. ppc and ppc64 have already found this issue, and worked around it by arranging TLB flush from their ptep_test_and_clear_young: with the aid of pgtable rmap pointers. I'm hoping ptep_clear_flush_young will allow ppc and ppc64 to remove that special code, but won't change them myself. It's worth noting that it is Andrea's anon_vma rmap which makes the vma available for ptep_clear_flush_young in page_referenced_one: anonmm and pte_chains would both need an additional find_vma for that. Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2004-06-13[PATCH] Sparse fix to mm/vmscan.cRandy Dunlap1-1/+1
Nick changed shrink_cache() to void, but one call was missed. From: Mika Kukkonen <mika@osdl.org> Signed-off-by: Randy Dunlap <rddunlap@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2004-06-12[PATCH] page-writeback.c: use read_page_state()Andrew Morton1-26/+42
Use the new read_page_state() in page-writeback.c to avoid large on-stack structures. Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2004-06-12[PATCH] vmscan.c: use read_page_state()Andrew Morton1-4/+5
Use the new read_page_state() in vmscan.c to avoid large on-stack structures. Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2004-06-12[PATCH] Implement read_page_stateAndrew Morton1-0/+17
struct page_state is large (148 bytes) and we put them on the stack in awkward code paths (page reclaim...) So implement a simple read_page_state() which can be used to pluck out a single member from the all-cpus page_state accumulators. Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2004-06-12[PATCH] vmscan.c: struct scan_controlAndrew Morton1-101/+102
From: Nick Piggin <nickpiggin@yahoo.com.au> Replace lots of parameters to functions in mm/vmscan.c with a structure struct scan_control. Signed-off-by: Nick Piggin <nickpiggin@yahoo.com.au> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2004-06-12[PATCH] vmscan.c: move ->writepage invocation into its own functionAndrew Morton1-46/+81
From: Nick Piggin <nickpiggin@yahoo.com.au> Move the invocation of ->writepage for to-be-reclaimed pages into its own function "pageout". From: Nikita Danilov <nikita@namesys.com> with small changes from Nick Piggin Signed-off-by: Nick Piggin <nickpiggin@cyberone.com.au> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2004-06-12[PATCH] vmscan: try harder for GFP_NOFS allocatorsAndrew Morton1-10/+4
Page reclaim bales out very early if reclaim isn't working out for !__GFP_FS allocation attempts. It was a fairly arbitrary thing in the first place and chances are the caller will simply retry the allocation or will do something which is disruptive to userspace. So remove that code and do much more scanning. Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2004-06-12[PATCH] vmscan: handle synchronous writepage()Andrew Morton1-2/+12
Teach page reclaim to understand synchronous ->writepage implementations. If ->writepage completed I/O prior to returning we can proceed to reclaim the page without giving it another trip around the LRU. This is beneficial for ramdisk-backed S_ISREG files: we can reclaim the file's pages as fast as the ramdisk driver needs to allocate them and this prevents I/O errors due to OOM in rd_blkdev_pagecache_IO(). Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2004-06-12[PATCH] numaq mempolicy.c build fixAndrew Morton1-0/+1
From: William Lee Irwin III <wli@holomorphy.com> mm/mempolicy.c: In function `verify_pages': mm/mempolicy.c:246: warning: implicit declaration of function `kmap_atomic' mm/mempolicy.c:249: warning: implicit declaration of function `kunmap_atomic' pte_offset_map() invokes kmap_atomic() via macro, without including the required header. Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2004-06-04[PATCH] mm: kill missed pte warningHugh Dickins1-22/+7
I've seen no warnings, nor heard any reports of warnings, that anon_vma ever misses ptes (nor anonmm before it). That WARN_ON (with its useless stack dump) was okay to goad developers into making reports, but would mainly be an irritation if it ever appears on user systems: kill it now. Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2004-06-04[PATCH] mm: get_user_pages vs. try_to_unmapHugh Dickins1-0/+17
Andrea Arcangeli's fix to an ironic weakness with get_user_pages. try_to_unmap_one must check page_count against page->mapcount before unmapping a swapcache page: because the raised pagecount by which get_user_pages ensures the page cannot be freed, will cause any write fault to see that page as not exclusively owned, and therefore a copy page will be substituted for it - the reverse of what's intended. rmap.c was entirely free of such page_count heuristics before, I tried hard to avoid putting this in. But Andrea's fix rarely gives a false positive; and although it might be nicer to change exclusive_swap_page etc. to rely on page->mapcount instead, it seems likely that we'll want to get rid of page->mapcount later, so better not to entrench its use. Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2004-06-04[PATCH] mm: vma_adjust insert file earlierHugh Dickins1-7/+17
For those arches (arm and parisc) which use the i_mmap tree to implement flush_dcache_page, during split_vma there's a small window in vma_adjust when flush_dcache_mmap_lock is dropped, and pages in the split-off part of the vma might for an instant be invisible to __flush_dcache_page. Though we're more solid there than ever before, I guess it's a bad idea to leave that window: so (with regret, it was structurally nicer before) take __vma_link_file (and vma_prio_tree_init) out of __vma_link. vma_prio_tree_init (which NULLs a few fields) is actually only needed when copying a vma, not when a new one has just been memset to 0. __insert_vm_struct is used by nothing but vma_adjust's split_vma case: comment it accordingly, remove its mark_mm_hugetlb (it can never create a new kind of vma) and its validate_mm (another follows immediately). Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2004-06-04[PATCH] mm: vma_adjust adjust_next wrapHugh Dickins1-10/+17
Fix vma_adjust adjust_next wrapping: Rajesh V. pointed out that if end were 2GB or more beyond next->vm_start (on 32-bit), then next->vm_pgoff would have been negatively adjusted. Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Rajesh Venkatasubramanian <vrajesh@umich.edu> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2004-06-04[PATCH] mm: follow_page invalid pte_pageHugh Dickins1-7/+3
The follow_page write-access case is relying on pte_page before checking pfn_valid: rearrange that - and we don't need three struct page *pages. (I notice mempolicy.c's verify_pages is also relying on pte_page, but I'll leave that to Andi: maybe it ought to be failing on, or skipping over, VM_IO or VM_RESERVED vmas?) Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2004-06-04[PATCH] mm: swapper_space.i_mmap_nonlinearHugh Dickins1-1/+3
Initialize swapper_space.i_mmap_nonlinear, so mapping_mapped reports false on it (as it used to do). Update comment on swapper_space, now more fields are used than those initialized explicitly. Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2004-06-03[PATCH] sparse: hugetlb sysctl annotationAlexander Viro1-1/+2
2004-06-02[PATCH] Export swapper_spaceRussell King1-1/+2
swapper_space is needed by at least loop/st/sg these days.
2004-06-02[PATCH] hugetlbpage: reinitialise compound page destructorAndrew Morton1-0/+1
From: David Gibson <david@gibson.dropbear.id.au> Currently the hugepage code stores the hugepage destructor in the mapping field of the second of the compound pages. However, this field is never cleared again, which causes tracebacks from free_pages_check() if the hugepage is later destroyed by reducing the number in /proc/sys/vm/nr_hugepages. This patch fixes the bug by clearing the mapping field when the hugepage is freed. Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2004-06-02[PATCH] hugetlbpage msync() fixAndrew Morton1-1/+9
From: David Gibson <david@gibson.dropbear.id.au> Currently, calling msync() on a hugepage area will cause the kernel to blow up with a bad_page() (at least on ppc64, but I think the problem will exist on other archs too). The msync path attempts to walk pagetables which may not be there, or may have an unusual layout for hugepages. Lucikly we shouldn't need to do anything for an msync on hugetlbfs beyond flushing the cache, so this patch should be sufficient to fix the problem. Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2004-06-02[PATCH] mm/oom_kill.c trivial cleanupAndrew Morton1-1/+0
From: "Luiz Fernando N. Capitulino" <lcapitulino@prefeitura.sp.gov.br> Remove duplicated assignment. Signed-off by: Luiz Capitulino <lcapitulino@prefeitura.sp.gov.br> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2004-06-02[PATCH] shrink_all_memory() fixesAndrew Morton1-3/+6
- Off-by-one in balance_pgdat means that we're not scanning the zones all the way down to priority=0. - Always set zone->temp_priority in shrink_caches(). I'm not sure why I had the `if (zone->free_pages < zone->pages_high)' test in there, but it's preventing us from setting ->prev_priority correctly on the try_to_free_pages() path. - Set zone->prev_priority to the current priority if it's currently a "lower" priority. This allows us to build up the pressure on mapped pages on the first scanning pass rather than only on successive passes. Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2004-05-31[PATCH] Mark cache_names __initdataAndrew Morton1-2/+4
From: Brian Gerst <bgerst@didntduck.org> We don't need to keep the pointer array around after the caches are initialized. This doesn't affect the actual strings. Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2004-05-28[PATCH] sparse: partial mm/* __user annotationAlexander Viro2-13/+14
2004-05-28[PATCH] use SLAB_PANIC for general cachesAndrew Morton1-10/+6
From: Brian Gerst <bgerst@didntduck.org> Initialize the general caches using SLAB_PANIC instead of BUG(). Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2004-05-26[PATCH] Print backtrace for bad vfree()Andrew Morton1-0/+2
From: Andi Kleen <ak@suse.de> Only the printk alone is not too useful, print the backtrace too. Signed-off-by: Andrew Morton <akpm@osdl.org>
2004-05-25Split ptep_establish into "establish" and "update_access_flags"Linus Torvalds1-3/+3
ptep_establish() is used to establish a new mapping at COW time, and it always replaces a non-writable page mapping with a totally new page mapping that is dirty (and likely writable, although ptrace may cause a non-writable new mapping). Because it was nonwritable, we don't have to worry about losing concurrent dirty page bit updates. ptep_update_access_flags() leaves the same page mapping, but updates the accessed/dirty/writable bits (it only ever sets them, and never removes any permissions). Often easier, but it may race with a dirty bit update on another CPU. Booted on x86 and ppc64. Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2004-05-24Pass in a "dirty" argument to ptep_establish in Linus Torvalds1-3/+3
preparation for pte update race fix. This does not actually use the information yet, but the next few patches will start to put it to some good use.
2004-05-24[PATCH] Fix nodemask clearing bug in NUMA APIAndi Kleen1-6/+5
Fix over long nodemask clearing in get_mem_policy() by using the right size for the node mask.
2004-05-24[PATCH] rmap build fixAndrew Morton1-5/+1
From: William Lee Irwin III <wli@holomorphy.com> PMD_SIZE is not a compile-time constant on sparc. Use min() in there so that the cluster size will be evaluated at runtime if the architecture insists on doing that.
2004-05-24[PATCH] remap_file_pages: implement MAP_POPULATE for all protectionsAndrew Morton1-1/+1
Signed-off-by: Hugh Dickins <hugh@veritas.com> It seems eccentric to implement MAP_POPULATE only on PROT_NONE mappings: do_mmap_pgoff is passing down prot, then sys_remap_file_pages verifies it's not set. I guess that's an oversight from when we realized that the prot arg to sys_remap_file_pages was misdesigned. There's another oddity whose heritage is harder for me to understand, so please let me leave it to you: sys_remap_file_pages is declared as asmlinkage in mm/fremap.c, but is the one syscall declared without asmlinkage in include/linux/syscalls.h.
2004-05-24[PATCH] don't export vma_prio_tree_nextAndrew Morton1-1/+0
From: Christoph Hellwig <hch@lst.de> there's no user is modules, the function isn't in mainline and I don't see why modules should use it.
2004-05-22[PATCH] partial prefetch for vma_prio_tree_nextAndrew Morton1-9/+18
From: Rajesh Venkatasubramanian <vrajesh@umich.edu> This patch adds prefetches for walking a vm_set.list. Adding prefetches for prio tree traversals is tricky and may lead to cache trashing. So this patch just adds prefetches only when walking a vm_set.list. I haven't done any benchmarks to show that this patch improves performance. However, this patch should help to improve performance when vm_set.lists are long, e.g., libc. Since we only prefetch vmas that are guaranteed to be used in the near future, this patch should not result in cache trashing, theoretically. I didn't add any NULL checks before prefetching because prefetch.h clearly says prefetch(0) is okay.
2004-05-22[PATCH] rmap 40 better anon_vma sharingAndrew Morton2-7/+77
From: Hugh Dickins <hugh@veritas.com> anon_vma rmap will always necessarily be more restrictive about vma merging than before: according to the history of the vmas in an mm, they are liable to be allocated different anon_vma heads, and from that point on be unmergeable. Most of the time this doesn't matter at all; but in two cases it may matter. One case is that mremap refuses (-EFAULT) to span more than a single vma: so it is conceivable that some app has relied on vma merging prior to mremap in the past, and will now fail with anon_vma. Conceivable but unlikely, let's cross that bridge if we come to it: and the right answer would be to extend mremap, which should not be exporting the kernel's implementation detail of vma to user interface. The other case that matters is when a reasonable repetitive sequence of syscalls and faults ends up with a large number of separate unmergeable vmas, instead of the single merged vma it could have. Andrea's mprotect-vma-merging patch fixed some such instances, but left other plausible cases unmerged. There is no perfect solution, and the harder you try to allow vmas to be merged, the less efficient anon_vma becomes, in the extreme there being one to span the whole address space, from which hangs every private vma; but anonmm rmap is clearly superior to that extreme. Andrea's principle was that neighbouring vmas which could be mprotected into mergeable vmas should be allowed to share anon_vma: good insight. His implementation was to arrange this sharing when trying vma merge, but that seems to be too early. This patch sticks to the principle, but implements it in anon_vma_prepare, when handling the first write fault on a private vma: with better results. The drawback is that this first write fault needs an extra find_vma_prev (whereas prev was already to hand when implementing anon_vma sharing at try-to-merge time).
2004-05-22[PATCH] rmap 39 add anon_vma rmapAndrew Morton4-85/+351
From: Hugh Dickins <hugh@veritas.com> Andrea Arcangeli's anon_vma object-based reverse mapping scheme for anonymous pages. Instead of tracking anonymous pages by pte_chains or by mm, this tracks them by vma. But because vmas are frequently split and merged (particularly by mprotect), a page cannot point directly to its vma(s), but instead to an anon_vma list of those vmas likely to contain the page - a list on which vmas can easily be linked and unlinked as they come and go. The vmas on one list are all related, either by forking or by splitting. This has three particular advantages over anonmm: that it can cope effortlessly with mremap moves; and no longer needs page_table_lock to protect an mm's vma tree, since try_to_unmap finds vmas via page -> anon_vma -> vma instead of using find_vma; and should use less cpu for swapout since it can locate its anonymous vmas more quickly. It does have disadvantages too: a lot more change in mmap.c to deal with anon_vmas, though small straightforward additions now that the vma merging has been refactored there; more lowmem needed for each anon_vma and vma structure; an additional restriction on the merging of vmas (cannot be merged if already assigned different anon_vmas, since then their pages will be pointing to different heads). (There would be no need to enlarge the vma structure if anonymous pages belonged only to anonymous vmas; but private file mappings accumulate anonymous pages by copy-on-write, so need to be listed in both anon_vma and prio_tree at the same time. A different implementation could avoid that by using anon_vmas only for purely anonymous vmas, and use the existing prio_tree to locate cow pages - but that would involve a long search for each single private copy, probably not a good idea.) Where before the vm_pgoff of a purely anonymous (not file-backed) vma was meaningless, now it represents the virtual start address at which that vma is mapped - which the standard file pgoff manipulations treat linearly as vmas are split and merged. But if mremap moves the vma, then it generally carries its original vm_pgoff to the new location, so pages shared with the old location can still be found. Magic. Hugh has massaged it somewhat: building on the earlier rmap patches, this patch is a fifth of the size of Andrea's original anon_vma patch. Please note that this posting will be his first sight of this patch, which he may or may not approve.
2004-05-22[PATCH] rmap 38 remove anonmm rmapAndrew Morton4-295/+17
From: Hugh Dickins <hugh@veritas.com> Before moving on to anon_vma rmap, remove now what's peculiar to anonmm rmap: the anonmm handling and the mremap move cows. Temporarily reduce page_referenced_anon and try_to_unmap_anon to stubs, so a kernel built with this patch will not swap anonymous at all.
2004-05-22[PATCH] rmap 37 page_add_anon_rmap vmaAndrew Morton3-8/+8
From: Hugh Dickins <hugh@veritas.com> Silly final patch for anonmm rmap: change page_add_anon_rmap's mm arg to vma arg like anon_vma rmap, to smooth the transition between them.
2004-05-22[PATCH] rmap 36 mprotect use vma_mergeAndrew Morton2-129/+153
From: Hugh Dickins <hugh@veritas.com> Earlier on, in 2.6.6, we took the vma merging code out of mremap.c and let it rely on vma_merge instead (via copy_vma). Now take the vma merging code out of mprotect.c and let it rely on vma_merge too: so vma_merge becomes the sole vma merging engine. The fruit of this consolidation is that mprotect now merges file-backed vmas naturally. Make this change now because anon_vma will complicate the vma merging rules, let's keep them all in one place. vma_merge remains where the decisions are made, whether to merge with prev and/or next; but now [addr,end) may be the latter part of prev, or first part or whole of next, whereas before it was always a new area. vma_adjust carries out vma_merge's decision, but when sliding the boundary between vma and next, must temporarily remove next from the prio_tree too. And it turned out (by oops) to have a surer idea of whether next needs to be removed than vma_merge, so the fput and freeing moves into vma_adjust. Too much decipherment of what's going on at the start of vma_adjust? Yes, and there's a delicate assumption that you may use vma_adjust in sliding a boundary, or splitting in two, or growing a vma (mremap uses it in that way), but not for simply shrinking a vma. Which is so, and must be so (how could pages mapped in the part to go, be zapped without first splitting?), but would feel better with some protection. __vma_unlink can then be moved from mm.h to mmap.c, and mm.h's more misleading than helpful can_vma_merge is deleted.
2004-05-22[PATCH] rmap 35 mmap.c cleanupsAndrew Morton1-44/+46
From: Hugh Dickins <hugh@veritas.com> Before some real vma_merge work in mmap.c in the next patch, a patch of miscellaneous cleanups to cut down the noise: - remove rb_parent arg from vma_merge: mm->mmap can do that case - scatter pgoff_t around to ingratiate myself with the boss - reorder is_mergeable_vma tests, vm_ops->close is least likely - can_vma_merge_before take combined pgoff+pglen arg (from Andrea) - rearrange do_mmap_pgoff's ever-confusing anonymous flags switch - comment do_mmap_pgoff's mysterious (vm_flags & VM_SHARED) test - fix ISO C90 warning on browse_rb if building with DEBUG_MM_RB - stop that long MNT_NOEXEC line wrapping Yes, buried in amidst these is indeed one pgoff replaced by "next->vm_pgoff - pglen" (reverting a mod of mine which took pgoff supplied by user too seriously in the anon case), and another pgoff replaced by 0 (reverting anon_vma mod which crept in with NUMA API): neither of them really matters, except perhaps in /proc/pid/maps.
2004-05-22[PATCH] rmap 34 vm_flags page_table_lockAndrew Morton3-7/+13
From: Hugh Dickins <hugh@veritas.com> First of a batch of seven rmap patches, based on 2.6.6-mm3. Probably the final batch: remaining issues outstanding can have isolated patches. The first half of the batch is good for anonmm or anon_vma, the second half of the batch replaces my anonmm rmap by Andrea's anon_vma rmap. Judge for yourselves which you prefer. I do think I was wrong to call anon_vma more complex than anonmm (its lists are easier to understand than my refcounting), and I'm happy with its vma merging after the last patch. It just comes down to whether we can spare the extra 24 bytes (maximum, on 32-bit) per vma for its advantages in swapout and mremap. rmap 34 vm_flags page_table_lock Why do we guard vm_flags mods with page_table_lock when it's already down_write guarded by mmap_sem? There's probably a historical reason, but no sign of any need for it now. Andrea added a comment and removed the instance from mprotect.c, Hugh plagiarized his comment and removed the instances from madvise.c and mlock.c. Huge leap in scalability... not expected; but this should stop people asking why those spinlocks.
2004-05-22[PATCH] rmap 32 zap_pmd_range wrapAndrew Morton1-1/+1
From: Hugh Dickins <hugh@veritas.com> From: Andrea Arcangeli <andrea@suse.de> zap_pmd_range, alone of all those page_range loops, lacks the check for whether address wrapped. Hugh is in doubt as to whether this makes any difference to any config on any arch, but eager to fix the odd one out.
2004-05-22[PATCH] rmap 31 unlikely bad memoryAndrew Morton1-16/+16
From: Hugh Dickins <hugh@veritas.com> From: Andrea Arcangeli <andrea@suse.de> Sprinkle unlikelys throughout mm/memory.c, wherever we see a pgd_bad or a pmd_bad; likely or unlikely on pte_same or !pte_same. Put the jump in the error return from do_no_page, not in the fast path.
2004-05-22[PATCH] rmap 30 fix bad mapcountAndrew Morton1-2/+3
From: Hugh Dickins <hugh@veritas.com> From: Andrea Arcangeli <andrea@suse.de> page_alloc.c's bad_page routine should reset a bad mapcount; and it's more revealing to show the bad mapcount than just the boolean mapped.
2004-05-22[PATCH] rmap 28 remove_vm_structAndrew Morton1-20/+11
From: Hugh Dickins <hugh@veritas.com> The callers of remove_shared_vm_struct then proceed to do several more identical things: gather them together in remove_vm_struct.
2004-05-22[PATCH] rmap 27 memset 0 vmaAndrew Morton1-12/+5
From: Hugh Dickins <hugh@veritas.com> We're NULLifying more and more fields when initializing a vma (mpol_set_vma_default does that too, if configured to do anything). Now use memset to avoid specifying fields, and save a little code too. (Yes, I realize anon_vma will want to set vm_pgoff non-0, but I think that will be better handled at the core, since anon vm_pgoff is negotiable up until an anon_vma is actually assigned.)
2004-05-22[PATCH] rmap 24 no rmap fastcallsAndrew Morton1-8/+8
From: Hugh Dickins <hugh@veritas.com> I like CONFIG_REGPARM, even when it's forced on: because it's easy to force off for debugging - easier than editing out scattered fastcalls. Plus I've never understood why we make function foo a fastcall, but function bar not. Remove fastcall directives from rmap. And fix comment about mremap_moved race: it only applies to anon pages.
2004-05-22[PATCH] rmap 22 flush_dcache_mmap_lockAndrew Morton3-2/+14
From: Hugh Dickins <hugh@veritas.com> arm and parisc __flush_dcache_page have been scanning the i_mmap(_shared) list without locking or disabling preemption. That may be even more unsafe now it's a prio tree instead of a list. It looks like we cannot use i_shared_lock for this protection: most uses of flush_dcache_page are okay, and only one would need lock ordering fixed (get_user_pages holds page_table_lock across flush_dcache_page); but there's a few (e.g. in net and ntfs) which look as if they're using it in I/O completion - and it would be restrictive to disallow it there. So, on arm and parisc only, define flush_dcache_mmap_lock(mapping) as spin_lock_irq(&(mapping)->tree_lock); on i386 (and other arches left to the next patch) define it away to nothing; and use where needed. While updating locking hierarchy in filemap.c, remove two layers of the fossil record from add_to_page_cache comment: no longer used for swap. I believe all the #includes will work out, but have only built i386. I can see several things about this patch which might cause revulsion: the name flush_dcache_mmap_lock? the reuse of the page radix_tree's tree_lock for this different purpose? spin_lock_irqsave instead? can't we somehow get i_shared_lock to handle the problem?
2004-05-22[PATCH] rmap 21 try_to_unmap_one mapcountAndrew Morton1-16/+10
From: Hugh Dickins <hugh@veritas.com> Why should try_to_unmap_anon and try_to_unmap_file take a copy of page->mapcount and pass it down for try_to_unmap_one to decrement? why not just check page->mapcount itself? asks akpm. Perhaps there used to be a good reason, but not any more: remove the mapcount arg.
2004-05-22[PATCH] rmap 20 i_mmap_shared into i_mmapAndrew Morton5-50/+16
From: Hugh Dickins <hugh@veritas.com> Why should struct address_space have separate i_mmap and i_mmap_shared prio_trees (separating !VM_SHARED and VM_SHARED vmas)? No good reason, the same processing is usually needed on both. Merge i_mmap_shared into i_mmap, but keep i_mmap_writable count of VM_SHARED vmas (those capable of dirtying the underlying file) for the mapping_writably_mapped test. The VM_MAYSHARE test in the arm and parisc loops is not necessarily what they will want to use in the end: it's provided as a harmless example of what might be appropriate, but maintainers are likely to revise it later (that parisc loop is currently being changed in the parisc tree anyway). On the way, remove the now out-of-date comments on vm_area_struct size.
2004-05-22[PATCH] rmap.c comment/style fixupsAndrew Morton1-21/+15
From: Christoph Hellwig <hch@lst.de>
2004-05-22[PATCH] unmap_mapping_range: add commentAndrew Morton1-0/+6
2004-05-22[PATCH] rmap 18: i_mmap_nonlinearAndrew Morton5-48/+46
From: Hugh Dickins <hugh@veritas.com> The prio_tree is of no use to nonlinear vmas: currently we're having to search the tree in the most inefficient way to find all its nonlinears. At the very least we need an indication of the unlikely case when there are some nonlinears; but really, we'd do best to take them out of the prio_tree altogether, into a list of their own - i_mmap_nonlinear.
2004-05-22[PATCH] rmap 17: real prio_treeAndrew Morton3-27/+657
From: Hugh Dickins <hugh@veritas.com> Rajesh Venkatasubramanian's implementation of a radix priority search tree of vmas, to handle object-based reverse mapping corner cases well. Amongst the objections to object-based rmap were test cases by akpm and by mingo, in which large numbers of vmas mapping disjoint or overlapping parts of a file showed strikingly poor performance of the i_mmap lists. Perhaps those tests are irrelevant in the real world? We cannot be too sure: the prio_tree is well-suited to solving precisely that problem, so unless it turns out to bring too much overhead, let's include it. Why is this prio_tree.c placed in mm rather than lib? See GET_INDEX: this implementation is geared throughout to use with vmas, though the first half of the file appears more general than the second half. Each node of the prio_tree is itself (contained within) a vma: might save memory by allocating distinct nodes from which to hang vmas, but wouldn't save much, and would complicate the usage with preallocations. Off each node of the prio_tree itself hangs a list of like vmas, if any. The connection from node to list is a little awkward, but probably the best compromise: it would be more straightforward to list likes directly from the tree node, but that would use more memory per vma, for the list_head and to identify that head. Instead, node's shared.vm_set.head points to next vma (whose shared.vm_set.head points back to node vma), and that next contains the list_head from which the rest hang - reusing fields already used in the prio_tree node itself. Currently lacks prefetch: Rajesh hopes to add some soon.
2004-05-22[PATCH] rmap 16: pretend prio_treeAndrew Morton3-64/+114
From: Hugh Dickins <hugh@veritas.com> Pave the way for prio_tree by switching over to its interfaces, but actually still implement them with the same old lists as before. Most of the vma_prio_tree interfaces are straightforward. The interesting one is vma_prio_tree_next, used to search the tree for all vmas which overlap the given range: unlike the list_for_each_entry it replaces, it does not find every vma, just those that match. But this does leave handling of nonlinear vmas in a very unsatisfactory state: for now we have to search again over the maximum range to find all the nonlinear vmas which might contain a page, which of course takes away the point of the tree. Fixed in later patch of this batch. There is no need to initialize vma linkage all over, just do it before inserting the vma in list or tree. /proc/pid/statm had an odd test for its shared count: simplified to an equivalent test on vm_file.
2004-05-22[PATCH] rmap 15: vma_adjustAndrew Morton2-83/+85
From: Hugh Dickins <hugh@veritas.com> If file-based vmas are to be kept in a tree, according to the file offsets they map, then adjusting the vma's start pgoff or its end involves repositioning in the tree, while holding i_shared_lock (and page_table_lock). We used to avoid that if possible, e.g. when just moving end; but if we're heading that way, let's now tidy up vma_merge and split_vma, and do all the locking and adjustment in a new helper vma_adjust. And please, let's call the next vma in vma_merge "next" rather than "prev". Since these patches are diffed over 2.6.6-rc2-mm2, they include the NUMA mpolicy mods which you'll have to remove to go earlier in the series, sorry for that nuisance. I have intentionally changed the one vma_mpol_equal to mpol_equal, to make the merge cases more alike.
2004-05-22[PATCH] numa api: fix end of memory handling in mbindAndrew Morton1-2/+2
From: Andi Kleen <ak@suse.de> This fixes a user triggerable crash in mbind() in NUMA API. It would oops when running into the end of memory. Actually not really oops, because a oops with the mm sem hold for writing always deadlocks.
2004-05-22[PATCH] numa api: Add policy support to anonymous memoryAndrew Morton3-10/+40
From: Andi Kleen <ak@suse.de> Change to core VM to use alloc_page_vma() instead of alloc_page(). Change the swap readahead to follow the policy of the VMA.
2004-05-22[PATCH] numa api: Add statisticsAndrew Morton1-4/+38
From: Andi Kleen <ak@suse.de> Add NUMA hit/miss statistics to page allocation and display them in sysfs. This is not 100% required for NUMA API, but without this it is very The overhead is quite low because all counters are per CPU and only happens when CONFIG_NUMA is defined.
2004-05-22[PATCH] small numa api fixupsAndrew Morton4-0/+4
From: Christoph Hellwig <hch@lst.de> - don't include mempolicy.h in sched.h and mm.h when a forward delcaration is enough. Andi argued against that in the past, but I'd really hate to add another header to two of the includes used in basically every driver when we can include it in the six files actually needing it instead (that number is for my ppc32 system, maybe other arches need more include in their directories) - make numa api fields in tast_struct conditional on CONFIG_NUMA, this gives us a few ugly ifdefs but avoids wasting memory on non-NUMA systems.
2004-05-22[PATCH] numa api: Add shared memory supportAndrew Morton1-4/+101
From: Andi Kleen <ak@suse.de> Add support to tmpfs and hugetlbfs to support NUMA API. Shared memory is a bit of a special case for NUMA policy. Normally policy is associated to VMAs or to processes, but for a shared memory segment you really want to share the policy. The core NUMA API has code for that, this patch adds the necessary changes to tmpfs and hugetlbfs. First it changes the custom swapping code in tmpfs to follow the policy set via VMAs. It is also useful to have a "backing store" of policy that saves the policy even when nobody has the shared memory segment mapped. This allows command line tools to pre configure policy, which is then later used by programs. Note that hugetlbfs needs more changes - it is also required to switch it to lazy allocation, otherwise the prefault prevents mbind() from working.
2004-05-22[PATCH] numa api: Add VMA hooks for policyAndrew Morton2-5/+31
From: Andi Kleen <ak@suse.de> NUMA API adds a policy to each VMA. During VMA creattion, merging and splitting these policies must be handled properly. This patch adds the calls to this. It is a nop when CONFIG_NUMA is not defined.
2004-05-22[PATCH] numa api core: use SLAB_PANICAndrew Morton1-5/+2
2004-05-22[PATCH] mpol in copy_vmaAndrew Morton1-0/+7
From: Hugh Dickins <hugh@veritas.com> I think Andi missed the copy_vma I recently added for mremap, and it'll need something like below.... (Doesn't look like it'll optimize away when it's not needed - rather bloaty.)
2004-05-22[PATCH] numa api: Core NUMA API codeAndrew Morton2-0/+1018
From: Andi Kleen <ak@suse.de> The following patches add support for configurable NUMA memory policy for user processes. It is based on the proposal from last kernel summit with feedback from various people. This NUMA API doesn't not attempt to implement page migration or anything else complicated: all it does is to police the allocation when a page is first allocation or when a page is reallocated after swapping. Currently only support for shared memory and anonymous memory is there; policy for file based mappings is not implemented yet (although they get implicitely policied by the default process policy) It adds three new system calls: mbind to change the policy of a VMA, set_mempolicy to change the policy of a process, get_mempolicy to retrieve memory policy. User tools (numactl, libnuma, test programs, manpages) can be found in ftp://ftp.suse.com/pub/people/ak/numa/numactl-0.6.tar.gz For details on the system calls see the manpages http://www.firstfloor.org/~andi/mbind.html http://www.firstfloor.org/~andi/set_mempolicy.html http://www.firstfloor.org/~andi/get_mempolicy.html Most user programs should actually not use the system calls directly, but use the higher level functions in libnuma (http://www.firstfloor.org/~andi/numa.html) or the command line tools (http://www.firstfloor.org/~andi/numactl.html The system calls allow user programs and administors to set various NUMA memory policies for putting memory on specific nodes. Here is a short description of the policies copied from the kernel patch: * NUMA policy allows the user to give hints in which node(s) memory should * be allocated. * * Support four policies per VMA and per process: * * The VMA policy has priority over the process policy for a page fault. * * interleave Allocate memory interleaved over a set of nodes, * with normal fallback if it fails. * For VMA based allocations this interleaves based on the * offset into the backing object or offset into the mapping * for anonymous memory. For process policy an process counter * is used. * bind Only allocate memory on a specific set of nodes, * no fallback. * preferred Try a specific node first before normal fallback. * As a special case node -1 here means do the allocation * on the local CPU. This is normally identical to default, * but useful to set in a VMA when you have a non default * process policy. * default Allocate on the local node first, or when on a VMA * use the process policy. This is what Linux always did * in a NUMA aware kernel and still does by, ahem, default. * * The process policy is applied for most non interrupt memory allocations * in that process' context. Interrupts ignore the policies and always * try to allocate on the local CPU. The VMA policy is only applied for memory * allocations for a VMA in the VM. * * Currently there are a few corner cases in swapping where the policy * is not applied, but the majority should be handled. When process policy * is used it is not remembered over swap outs/swap ins. * * Only the highest zone in the zone hierarchy gets policied. Allocations * requesting a lower zone just use default policy. This implies that * on systems with highmem kernel lowmem allocation don't get policied. * Same with GFP_DMA allocations. * * For shmfs/tmpfs/hugetlbfs shared memory the policy is shared between * all users and remembered even when nobody has memory mapped. This patch: This is the core NUMA API code. This includes NUMA policy aware wrappers for get_free_pages and alloc_page_vma(). On non NUMA kernels these are defined away. The system calls mbind (see http://www.firstfloor.org/~andi/mbind.html), get_mempolicy (http://www.firstfloor.org/~andi/get_mempolicy.html) and set_mempolicy (http://www.firstfloor.org/~andi/set_mempolicy.html) are implemented here. Adds a vm_policy field to the VMA and to the process. The process also has field for interleaving. VMA interleaving uses the offset into the VMA, but that's not possible for process allocations. From: Andi Kleen <ak@muc.de> > Andi, how come policy_vma() calls ->set_policy under i_shared_sem? I think this can be actually dropped now. In an earlier version I did walk the vma shared list to change the policies of other mappings to the same shared memory region. This turned out too complicated with all the corner cases, so I eventually gave in and added ->get_policy to the fast path. Also there is still the mmap_sem which prevents races in the same MM. Patch to remove it attached. Also adds documentation and removes the bogus __alloc_page_vma() prototype noticed by hch. From: Andi Kleen <ak@suse.de> A few incremental fixes for NUMA API. - Fix a few comments - Add a compat_ function for get_mem_policy I considered changing the ABI to avoid this, but that would have made the API too ugly. I put it directly into the file because a mm/compat.c didn't seem worth it just for this. - Fix the algorithm for VMA interleave. From: Matthew Dobson <colpatch@us.ibm.com> 1) Move the extern of alloc_pages_current() into #ifdef CONFIG_NUMA. The only references to the function are in NUMA code in mempolicy.c 2) Remove the definitions of __alloc_page_vma(). They aren't used. 3) Move forward declaration of struct vm_area_struct to top of file.
2004-05-22[PATCH] rmap 14: i_shared_lock fixesAndrew Morton2-9/+32
From: Hugh Dickins <hugh@veritas.com> First of batch of six patches which introduce Rajesh Venkatasubramanian's implementation of a radix priority search tree of vmas, to handle object-based reverse mapping corner cases well. rmap 14 i_shared_lock fixes Start the sequence with a couple of outstanding i_shared_lock fixes. Since i_shared_sem became i_shared_lock, we've had to shift and then temporarily remove mremap move's protection of concurrent truncation - if mremap moves ptes while unmap_mapping_range_list is making its way through the vmas, there's a danger we'd move a pte from an area yet to be cleaned back into an area already cleared. Now site the i_shared_lock with the page_table_lock in move_one_page. Replace page_table_present by get_one_pte_map, so we know when it's necessary to allocate a new page table: in which case have to drop i_shared_lock, trylock and perhaps reorder locks on the way back. Yet another fix: must check for NULL dst before pte_unmap(dst). And over in rmap.c, try_to_unmap_file's cond_resched amidst its lengthy nonlinear swapping was now causing might_sleep warnings: moved to a rather unsatisfactory and less frequent cond_resched_lock on i_shared_lock when we reach the end of the list; and one before starting on the nonlinears too: the "cursor" may become out-of-date if we do schedule, but I doubt it's worth bothering about.
2004-05-22[PATCH] Convert i_shared_sem back to a spinlockAndrew Morton6-52/+38
Having a semaphore in there causes modest performance regressions on heavily mmap-intensive workloads on some hardware. Specifically, up to 30% in SDET on NUMAQ and big PPC64. So switch it back to being a spinlock. This does mean that unmap_vmas() needs to be told whether or not it is allowed to schedule away; that's simple to do via the zap_details structure. This change means that there will be high scheuling latencies when someone truncates a large file which is currently mmapped, but nobody does that anyway. The scheduling points in unmap_vmas() are mainly for munmap() and exit(), and they still will work OK for that. From: Hugh Dickins <hugh@veritas.com> Sorry, my premature optimizations (trying to pass down NULL zap_details except when needed) have caught you out doubly: unmap_mapping_range_list was NULLing the details even though atomic was set; and if it hadn't, then zap_pte_range would have missed free_swap_and_cache and pte_clear when pte not present. Moved the optimization into zap_pte_range itself. Plus massive documentation update. From: Hugh Dickins <hugh@veritas.com> Here's a second patch to add to the first: mremap's cows can't come home without releasing the i_mmap_lock, better move the whole "Subtle point" locking from move_vma into move_page_tables. And it's possible for the file that was behind an anonymous page to be truncated while we drop that lock, don't want to abort mremap because of VM_FAULT_SIGBUS. (Eek, should we be checking do_swap_page of a vm_file area against the truncate_count sequence? Technically yes, but I doubt we need bother.) - We cannot hold i_mmap_lock across move_one_page() because move_one_page() needs to perform __GFP_WAIT allocations of pagetable pages. - Move the cond_resched() out so we test it once per page rather than only when move_one_page() returns -EAGAIN.
2004-05-22[PATCH] rmap 12 pgtable remove rmapAndrew Morton1-4/+2
From: Hugh Dickins <hugh@veritas.com> Remove the support for pte_chain rmap from page table initialization, just continue to maintain nr_page_table_pages (but only for user page tables - it also counted vmalloc page tables before, little need, and I'm unsure if per-cpu stats are safe early enough on all arches). mm/memory.c is the only core file affected. But ppc and ppc64 have found the old rmap page table initialization useful to support their ptep_test_and_clear_young: so transfer rmap's initialization to them (even on kernel page tables? well, okay).
2004-05-22[PATCH] rmap 11 mremap movesAndrew Morton4-33/+111
From: Hugh Dickins <hugh@veritas.com> A weakness of the anonmm scheme is its difficulty in tracking pages shared between two or more mms (one being an ancestor of the other), when mremap has been used to move a range of pages in one of those mms. mremap move is not very common anyway, and it's more often used on a page range exclusive to the mm; but uncommon though it may be, we must not allow unlocked pages to become unswappable. This patch follows Linus' suggestion, simply to take a private copy of the page in such a case: early C-O-W. My previous implementation was daft with respect to pages currently on swap: it insisted on swapping them in to copy them. No need for that: just take the copy when a page is brought in from swap, and its intended address is found to clash with what rmap has already noted. If do_swap_page has to make this copy in the mremap moved case (simply a call to do_wp_page), might as well do so also in the case when it's a write access but the page not exclusive, it's always seemed a little odd that swapin needed a second fault for that. A bug even: get_user_pages force imagines that a single call to handle_mm_fault must break C-O-W. Another bugfix: swapoff's unuse_process didn't check is_vm_hugetlb_page. Andrea's anon_vma has no such problem with mremap moved pages, handling them with elegant use of vm_pgoff - though at some cost to vma merging. How important is it to handle them efficiently? For now there's a msg printk(KERN_WARNING "%s: mremap moved %d cows\n", current->comm, cows);
2004-05-22[PATCH] rmap 10 add anonmm rmapAndrew Morton1-3/+236
From: Hugh Dickins <hugh@veritas.com> Hugh's anonmm object-based reverse mapping scheme for anonymous pages. We have not yet decided whether to adopt this scheme, or Andrea's more advanced anon_vma scheme. anonmm is easier for me to merge quickly, to replace the pte_chain rmap taken out in the previous patch; a patch to install Andrea's anon_vma will follow in due course. Why build up and tear down chains of pte pointers for anonymous pages, when a page can only appear at one particular address, in a restricted group of mms that might share it? (Except: see next patch on mremap.) Introduce struct anonmm per mm to track anonymous pages, all forks from one exec sharing the same bundle of linked anonmms. Anonymous pages originate in one mm, but may be forked into another mm of the bundle later on. Callouts from fork.c to allocate, dup and exit the anonmm structure private to rmap.c. From: Hugh Dickins <hugh@veritas.com> Two concurrent exits (of the last two mms sharing the anonhd). First exit_rmap brings anonhd->count down to 2, gets preempted (at the spin_unlock) by second, which brings anonhd->count down to 1, sees it's 1 and frees the anonhd (without making any change to anonhd->count itself), cpu goes on to do something new which reallocates the old anonhd as a new struct anonmm (probably not a head, in which case count will start at 1), first resumes after the spin_unlock and sees anonhd->count 1, frees "anonhd" again, it's used for something else, a later exit_rmap list_del finds list corrupt.
2004-05-22[PATCH] rmap 9 remove pte_chainsAndrew Morton6-554/+86
From: Hugh Dickins <hugh@veritas.com> Lots of deletions: the next patch will put in the new anon rmap, which should look clearer if first we remove all of the old pte-pointer-based rmap from the core in this patch - which therefore leaves anonymous rmap totally disabled, anon pages locked in memory until process frees them. Leave arch files (and page table rmap) untouched for now, clean them up in a later batch. A few constructive changes amidst all the deletions: Choose names (e.g. page_add_anon_rmap) and args (e.g. no more pteps) now so we need not revisit so many files in the next patch. Inline function page_dup_rmap for fork's copy_page_range, simply bumps mapcount under lock. cond_resched_lock in copy_page_range. Struct page rearranged: no pte union, just mapcount moved next to atomic count, so two ints can occupy one long on 64-bit; i386 struct page now 32 bytes even with PAE. Never pass PageReserved to page_remove_rmap, only do_wp_page did so. From: Hugh Dickins <hugh@veritas.com> Move page_add_anon_rmap's BUG_ON(page_mapping(page)) inside the rmap_lock (well, might as well just check mapping if !mapcount then): if this page is being mapped or unmapped on another cpu at the same time, page_mapping's PageAnon(page) and page->mapping are volatile. But page_mapping(page) is used more widely: I've a nasty feeling that clear_page_anon, page_add_anon_rmap and/or page_mapping need barriers added (also in 2.6.6 itself),
2004-05-22[PATCH] slab: consolidate panic codeAndrew Morton3-10/+9
Many places do: if (kmem_cache_create(...) == NULL) panic(...); We can consolidate all that by passing another flag to kmem_cache_create() which says "panic if it doesn't work".
2004-05-22[PATCH] rmap 8 unmap nonlinearAndrew Morton3-5/+182
From: Hugh Dickins <hugh@veritas.com> The previous patch let the ptes of file pages be located via page ->mapping->i_mmap and i_mmap_shared lists of vmas; which works well unless the vma is VM_NONLINEAR - one in which sys_remap_file_pages has been used to place pages in unexpected places, to avoid an explosion of distinct unmergable vmas. Such pages were effectively locked in memory. page_referenced_file is already skipping nonlinear vmas, they'd just waste its time, and age unfairly any pages in their proper positions. Now extend try_to_unmap_file, to persuade it to swap from nonlinears. Ignoring the page requested, try to unmap cluster of 32 neighbouring ptes (in worst case all empty slots) in a nonlinear vma, then move on to the next vma; stopping when we've unmapped at least as many maps as the requested page had (vague guide of how hard to try), or have reached the end. With large sparse nonlinear vmas, this could take a long time: inserted a cond_resched while no locks are held, unusual at this level but I think okay, shrink_list does so. Use vm_private_data a little like the old mm->swap_address, as a cursor recording how far we got, so we don't attack the same ptes next time around (earlier tried inserting an empty marker vma in the list, but that got messy). How well this will work on real- life nonlinear vmas remains to be seen, but should work better than locking them all in memory, or swapping everything out all the time. Existing users of vm_private_data have either VM_RESERVED or VM_DONTEXPAND set, both of which are in the VM_SPECIAL category where we never try to merge vmas: so removed the vm_private_data test from is_mergeable_vma, so we can still merge VM_NONLINEARs. Of course, we could instead add another field to vm_area_struct.
2004-05-22[PATCH] rmap 7 object-based rmapAndrew Morton4-44/+319
From: Hugh Dickins <hugh@veritas.com> Dave McCracken's object-based reverse mapping scheme for file pages: why build up and tear down chains of pte pointers for file pages, when page->mapping has i_mmap and i_mmap_shared lists of all the vmas which might contain that page, and it appears at one deterministic position within the vma (unless vma is nonlinear - see next patch)? Has some drawbacks: more work to locate the ptes from page_referenced and try_to_unmap, especially if the i_mmap lists contain a lot of vmas covering different ranges; has to down_trylock the i_shared_sem, and hope that doesn't fail too often. But attractive in that it uses less lowmem, and shifts the rmap burden away from the hot paths, to swapout. Hybrid scheme for the moment: carry on with pte_chains for anonymous pages, that's unchanged; but file pages keep mapcount in the pte union of struct page, where anonymous pages keep chain pointer or direct pte address: so page_mapped(page) works on both. Hugh massaged it a little: distinct page_add_file_rmap entry point; list searches check rss so as not to waste time on mms fully swapped out; check mapcount to terminate once all ptes have been found; and a WARN_ON if page_referenced should have but couldn't find all the ptes.
2004-05-22[PATCH] __set_page_dirty_nobuffers race fixAndrew Morton1-6/+11
Running __mark_inode_dirty() against a swapcache page is illegal and will oops. I see a race in set_page_dirty() wherein it can be called with a PageSwapCache page, but if the page is removed from swapcache after __set_page_dirty_nobuffers() drops tree_lock(), we have the situation where PageSwapCache() is false, but local variable `mapping' points at swapcache. Handle that by checking for non-null mapping->host. We don't care about the page state at this point - we're only interested in the inode. There is a converse case: what if a page is added to swapcache as we are running set_page_dirty() against it? In this case the page gets its PG_dirty flag set but it is not tagged as dirty in the swapper_space radix tree. The swap writeout code will handle this OK and test_clear_page_dirty()'s call to radix_tree_tag_clear(PAGECACHE_TAG_DIRTY) will silently have no effect. The only downside is that future radix-tree-based writearound won't notice that such pages are dirty and swap IO scheduling will be a teensy bit worse. The patch also fixes the (silly) testing of local variable `mapping' to see if the page was truncated. We should test page_mapping() for that.
2004-05-22[PATCH] Make sync_page use swapper_space againAndrew Morton2-6/+6
Revert recent changes to sync_page(). Now that page_mapping() returns &swapper_space for swapcache pages we don't need to test for PageSwapCache in sync_page().
2004-05-22[PATCH] vmscan: revert may_enter_fs changesAndrew Morton1-8/+5
Fix up the "may we call writepage" logic for the swapcache changes.
2004-05-22[PATCH] revert recent swapcache handling changesAndrew Morton2-15/+25
Go back to the 2.6.5 concepts, with rmap additions. In particular: - Implement Andrea's flavour of page_mapping(). This function opaquely does the right thing for pagecache pages, anon pages and for swapcache pages. The critical thing here is that page_mapping() returns &swapper_space for swapcache pages without actually requiring the storage at page->mapping. This frees page->mapping for the anonmm/anonvma metadata. - Andrea and Hugh placed the pagecache index of swapcache pages into page->private rather than page->index. So add new page_index() function which hides this. - Make swapper_space.set_page_dirty() again point at __set_page_dirty_buffers(). If we don't do that, a bare set_page_dirty() will fall through to __set_page_dirty_buffers(), which is silly. This way, __set_page_dirty_buffers() can continue to use page->mapping. It should never go near anon or swapcache pages. - Give swapper_space a ->set_page_dirty address_space_operation method, so that set_page_dirty() will not fall through to __set_page_dirty_buffers() for swapcache pages. That function is not set up to handle them. The main effect of these changes is that swapcache pages are treated more similarly to pagecache pages. And we are again tagging swapcache pages as dirty in their radix tree, which is a requirement if we later wish to implement swapcache writearound based on tagged radix-tree walks.
2004-05-22[PATCH] __add_to_swap_cache and add_to_pagecache() simplificationAndrew Morton2-6/+3
Simplify the logic in there a bit.
2004-05-22[PATCH] Make swapper_space tree_lock irq-safeAndrew Morton2-14/+14
->tree_lock is supposed to be IRQ-safe. Hugh worked out that with his changes, we never actually take it from interrupt context, so spin_lock() is sufficient. Apart from kinda freaking me out, the analysis which led to this decision becomes untrue with later patches. So make it irq-safe.
2004-05-21[PATCH] Sanitise handling of unneeded syscall stubsAndrew Morton1-0/+5
From: David Mosberger <davidm@napali.hpl.hp.com> Below is a patch that tries to sanitize the dropping of unneeded system-call stubs in generic code. In some instances, it would be possible to move the optional system-call stubs into a library routine which would avoid the need for #ifdefs, but in many cases, doing so would require making several functions global (and possibly exporting additional data-structures in header-files). Furthermore, it would inhibit (automatic) inlining in the cases in the cases where the stubs are needed. For these reasons, the patch keeps the #ifdef-approach. This has been tested on ia64 and there were no objections from the arch-maintainers (and one positive response). The patch should be safe but arch-maintainers may want to take a second look to see if some __ARCH_WANT_foo macros should be removed for their architecture (I'm quite sure that's the case, but I wanted to play it safe and only preserved the status-quo in that regard).
2004-05-20[PATCH] trivial: fix counter in build_zonelists()Andrew Morton1-1/+1
From: Rusty Russell <rusty@rustcorp.com.au> From: Stephen Leonard <stephen@phynp6.phy-astr.gsu.edu> This fixes a counter that is unnecessarily incremented in build_zonelists().
2004-05-19[PATCH] ucLinux: return 0 on success from do_munmap() for nommu versionGreg Ungerer1-1/+4
Added a nommu version of sysctl_max_map_count. Fix return value from do_munmap(), it should return 0 on success not EINVAL.
2004-05-19[PATCH] do_generic_mapping_read() cleanupAndrew Morton1-3/+0
We just tested the page's uptodateness, no point in doing it again.
2004-05-19[PATCH] slab: add kmem_cache_alloc_nodeAndrew Morton1-50/+149
From: Manfred Spraul <manfred@colorfullife.com> The attached patch adds a simple kmem_cache_alloc_node function: allocate memory on a given node. The function is intended for cpu bound structures. It's used for alloc_percpu and for the slab-internal per-cpu structures. Jack Steiner reported a ~3% performance increase for AIM7 on a 64-way Itanium 2. Port maintainers: The patch could cause problems if CPU_UP_PREPARE is called for a cpu on a node before the corresponding memory is attached and/or if alloc_pages_node doesn't fall back to memory from another node if there is no memory in the requested node. I think noone does that, but I'm not sure.
2004-05-19[PATCH] slab: allow arch override for kmem_bufctl_tAndrew Morton1-4/+3
From: Manfred Spraul <manfred@colorfullife.com> The slab allocator keeps track of the free objects in a slab with a linked list of integers (typedef'ed to kmem_bufctl_t). Right now unsigned int is used for kmem_bufctl_t, i.e. 4 bytes per-object overhead. The attached patch implements a per-arch definition of for this type: Theoretically, unsigned short is sufficient for kmem_bufctl_t and this would reduce the per-object overhead to 2 bytes. But some archs cannot operate on 16-bit values efficiently, thus it's not possible to switch everyone to ushort. The chosen types are a result of dicussions with the various arch maintainers.
2004-05-19[PATCH] Fix arithmetic in shrink_zone()Andrew Morton1-11/+22
From: Nick Piggin <nickpiggin@yahoo.com.au> If the zone has a very small number of inactive pages, local variable `ratio' can be huge and we do way too much scanning. So much so that Ingo hit an NMI watchdog expiry, although that was because the zone would have a had a single refcount-zero page in it, and that logic recently got fixed up via get_page_testone(). Nick's patch simply puts a sane-looking upper bound on the number of pages which we'll scan in this round. It fixes another failure case: if the inactive list becomes very small compared to the size of the active list, active list scanning (and therefore inactive list refilling) also becomes small. This patch causes inactive list scanning to be keyed off the size of the active+inactive lists. It has the plus of hiding active and inactive balancing implementation from the higher level scanning code. It will slightly change other aspects of scanning behaviour, but probably not significantly.
2004-05-19[PATCH] Fix madvise length checkingAndrew Morton1-2/+8
Fix http://bugme.osdl.org/show_bug.cgi?id=2710. When the user passed madvise a length of -1 through -4095, madvise blindly rounds this up to 0 then "succeeds".
2004-05-19[PATCH] speed up readahead for seeky loadsAndrew Morton1-23/+23
From: Ram Pai <linuxram@us.ibm.com> Currently the readahead code tends to read one more page than it should with seeky database-style loads. This was to prevent bogus readahead triggering when we step into the last page of the current window. The patch removes that workaround and fixes up the suboptimal logic instead. wrt the "rounding errors" mentioned in this patch, Ram provided the following description: Say the i/o size is 20 pages. Our algorithm starts by a initial average i/o size of 'ra_pages/2' which is mostly say 16. Now every time we take a average, the 'average' progresses as follows (16+20)/2=18 (18+20)/2=19 (19+20)/2=19 (19+20)/2=19..... and the rounding error makes it never touch 20 Benchmarking sitrep: IOZONE run on a nfs mounted filesystem: client machine 2proc, 733MHz, 2GB memory server machine 8proc, 700Mhz, 8GB memory ./iozone -c -t1 -s 4096m -r 128k
2004-05-14[PATCH] bootmem.c cleanupAndrew Morton1-14/+9
From: Michael Buesch <mbuesch@freenet.de> - BUG_ON() conversion - Remove redundant dump_stack() (BUG already does that)
2004-05-14[PATCH] Export `laptop_mode' for XFSAndrew Morton1-0/+2
From: <bart@samwel.tk> XFS needs `laptop_mode'.
2004-05-14[PATCH] swap speedups and fixAndrew Morton1-49/+25
From: Andrea Arcangeli <andrea@suse.de> I don't think we need an install_swap_bdev/remove_swap_bdev anymore, we should use the swap_info->bdev, not the swap_bdevs. the swap_info already has a ->bdev field, the only point of remove_swap_bdev/install_swap_bdev was to unplug all devices as efficiently as possible, we don't need that anymore with the page parameter. Plus the semaphore should be a rwsem to allow parallel unplug from multiple pages. After that I don't need to take the semaphore anymore during swapon, no swapcache with swp_type() pointing to such bdev, will be allowed until swapon is complete (SWP_ACTIVE is set a lot later after setting p->bdev). In swapoff I only need a dummy serialization with the readers, after try_to_unuse is complete: err = try_to_unuse(type); current->flags &= ~PF_SWAPOFF; /* wait for any unplug function to finish */ down_write(&swap_unplug_sem); up_write(&swap_unplug_sem); that's all, no other locking and no install_swap_bdev/remove_swap_bdev. (and the swap_bdevs[] compression code was busted)
2004-05-14[PATCH] blk_run_page(): fixup for swap_unplug_io_fn()Andrew Morton2-2/+2
2004-05-14[PATCH] Add blk_run_page()Andrew Morton2-2/+2
From: Andrea Arcangeli <andrea@suse.de> From: Jens Axboe Add blk_run_page() API. This is so that we can pass the target page all the way down to (for example) the swap unplug function. So swap can work out which blockdevs back this particular page.
2004-05-14[PATCH] rmap-5-swap_unplug-page-revertAndrew Morton4-18/+16
Revert the pre-2.6.6 per-address-space unplugging changes. This removes a swapper_space exceptionality, syncs things with Andrea and provides for simplification of the swap unplug function.
2004-05-14[PATCH] rename rmap_lock to page_map_lockAndrew Morton2-14/+14
Sync this up with Andrea's patches.
2004-05-14[PATCH] filtered wakeups: apply to pagecache functionsAndrew Morton1-13/+51
From: William Lee Irwin III <wli@holomorphy.com> This patch implements wake-one semantics for page wakeups in a single step. Discrimination between distinct pages is achieved by passing the page to the wakeup function, which compares it to a pointer in its own on-stack structure containing the waitqueue element and the page. Bit discrimination is achieved by storing the bit number in that same structure and testing the bit in the wakeup function. Wake-one semantics are achieved by using WQ_FLAG_EXCLUSIVE in the codepaths waiting to acquire the bit for mutual exclusion.
2004-05-14[PATCH] VM accounting fixAndrew Morton1-2/+1
From: Hugh Dickins <hugh@veritas.com> Stas Sergeev <stsp@aknet.ru> wrote: mprotect() fails to merge VMAs because one VMA can end up with VM_ACCOUNT flag set, and another without that flag. That makes several apps of mine to malfuncate. Great find! Someone has got their test the wrong way round. Since that VM_MAYACCT macro is being used in one place only, and just hiding what it's actually about, fold it into its callsite.
2004-05-14[PATCH] Fix page double-freeing raceAndrew Morton1-7/+14
This has been there for nearly two years. See bugzilla #1403 vmscan.c does, in two places: spin_lock(zone->lru_lock) page = lru_to_page(&zone->inactive_list); if (page_count(page) == 0) { /* erk, it's being freed by __page_cache_release() or * release_pages() */ put_it_back_on_the_lru(); } else { --> window 1 <-- page_cache_get(page); put_in_on_private_list(); } spin_unlock(zone->lru_lock) use_the_private_list(); page_cache_release(page); whereas __page_cache_release() and release_pages() do: if (put_page_testzero(page)) { --> window 2 <-- spin_lock(lru->lock); if (page_count(page) == 0) { remove_it_from_the_lru(); really_free_the_page() } spin_unlock(zone->lru_lock) } The race occurs if the vmscan.c path sees page_count()==1 and then the page_cache_release() path happens in that few-instruction "window 1" before vmscan's page_cache_get(). The page_cache_release() path does put_page_testzero(), which returns true. Then this CPU takes an interrupt... The vmscan.c path then does page_cache_get(), taking the refcount to one. Then it uses the page and does page_cache_release(), taking the refcount to zero and the page is really freed. Now, the CPU running page_cache_release() returns from the interrupt, takes the LRU lock, sees the page still has a refcount of zero and frees it again. Boom. The patch fixes this by closing "window 1". We provide a "get_page_testone()" which grabs a ref on the page and returns true if the refcount was previously zero. If that happens the vmscan.c code simply drops the page's refcount again and leaves the page on the LRU. All this happens under the zone->lru_lock, which is also taken by __page_cache_release() and release_pages(), so the vmscan code knows that the page has not been returned to the page allocator yet. In terms of implementation, the page counts are now offset by one: a free page has page->_count of -1. This is so that we can use atomic_add_negative() and atomic_inc_and_test() to provide put_page_testzero() and get_page_testone(). The macros hide all of this so the public interpretation of page_count() and set_page_count() remains unaltered. The compiler can usually constant-fold the offsetting of page->count. This patch increases an x86 SMP kernel's text by 32 bytes. The patch renames page->count to page->_count to break callers who aren't using the macros. This patch requires that the architecture implement atomic_add_negative(). It is currently present on arm arm26 i386 ia64 mips ppc s390 v850 x86_64 ppc implements this as #define atomic_add_negative(a, v) (atomic_add_return((a), (v)) < 0) and atomic_add_return() is implemented on alpha cris h8300 ia64 m68knommu mips parisc ppc ppc ppc64 s390 sh sparc v850 so we're looking pretty good.
2004-05-10[PATCH] readahead: keep file->f_ra saneAndrew Morton1-3/+6
When two threads are simultaneously pread()ing from the same fd (which is a legitimate thing to do), the readahead code thinks that a huge amount of seeking is happening and shrinks the window, damaging performance a lot. I don't see a sane way to avoid this within the readahead code, so take a private copy of the readahead state and restore it prior to returning from the read.
2004-05-09[PATCH] shrink_slab: improved handling of GFP_NOFS allocationsAndrew Morton1-14/+17
Currently, shrink_slab() will decide that it needs to scan a certain number of dentries, will call shrink_dcache_memory() requesting that this be done, and shrink_dcache_memory() will simply bale out without doing anything because the caller did not have __GFP_FS. This has the potential to disrupt our lovely pagecache-vs-slab balancing act. So change things so that shrinker callouts can return -1, indicating that they baled out. This way, shrink_slab can remember that this slab was owed a certain number of scannings and these will be correctly performed next time a __GFP_FS caller comes by.
2004-05-04[PATCH] page_mapping race fixAndrew Morton1-1/+1
From: Hugh Dickins <hugh@veritas.com> Remove this development-only debug code - Hugh thinks that its BUG_ON() can trigger by accident.
2004-05-03[PATCH] add_to_page_cache commentsHugh Dickins1-6/+0
Remove two layers of the fossil record from comments on add_to_page_cache: 2.6.6 moves swapcache handling away, and we long ago stopped masking flags.
2004-05-03[PATCH] mremap pte_unmap NULLHugh Dickins1-1/+2
Old bug noone seems to have hit, but mremap's pte_unmap dst might be NULL: would get preempt count wrong even when not DEBUG_HIGHMEM.
2004-04-30Fix fixed fadvice length handlingLinus Torvalds1-7/+25
- Correctly handle wraparound on offset+len - fix FADV_WILLNEED handling of non-page-aligned (offset+len) Let's hope we don't need to fix the fixed fix.
2004-04-29[PATCH] fadvise length handling fixAndrew Morton1-0/+3
POSIX sez: "If len is zero, all data following offset is specified."
2004-04-29[PATCH] mremap offset typeHugh Dickins1-1/+1
Just found I never changed type of move_page_tables when I changed it to return offset: einormous mremap moves would fail on 64-bit.
2004-04-28[PATCH] Fix might_sleep in /proc/swaps codeAlexander Viro1-4/+10
This fixes a locking problem noted by Tim Hockin: * /proc/swaps uses seq_file code, calling seq_path() with swaplock held * seq_path() calls d_path() * d_path() calls mntput() which might_sleep() We add a new semaphore protecting insertions/removals in the set of swap components + switch of ->start()/->stop() to the same semaphore [fixes deadlocks] + trivial cleanup of ->next().
2004-04-26[PATCH] hugepage fixesAndrew Morton1-9/+1
From: William Lee Irwin III <wli@holomorphy.com> mm/hugetlb.c is putting the destructor in head->lru.prev not head[1].mapping; fix below along with nuking huge_page_release(), which simply duplicates put_page().
2004-04-26[PATCH] simplify put_page()Andrew Morton1-5/+4
By requiring that compound pages implement destructors we can drop some code from put_page().
2004-04-26[PATCH] slab: use order 0 for vfs cachesAndrew Morton1-30/+44
We have interesting deadlocks when slab decides to use order-1 allocations for ext3_inode_cache. This is because ext3_alloc_inode() needs to perform a GFP_NOFS 1-order allocation. Sometimes the 1-order allocation needs to free a huge number of pages (tens of megabytes) before a 1-order grouping becomes available. But the GFP_NOFS allocator cannot free dcache (and hence icache) due to the deadlock problems identified in shrink_dcache_memory(). So change slab so that it will force 0-order allocations for shrinkable VFS objects. We can handle those OK.
2004-04-26[PATCH] slab alignment fixesAndrew Morton1-13/+12
From: Manfred Spraul <manfred@colorfullife.com> Below is a patch that redefines the kmem_cache_alloc `align' argument: - align not zero: use the specified alignment. I think values smaller than sizeof(void*) will work, even on archs with strict alignment requirement (or at least: slab shouldn't crash. Obviously the user must handle the alignment properly). - align zero: * debug on: align to sizeof(void*) * debug off, SLAB_HWCACHE_ALIGN clear: align to sizeof(void*) * debug off, SLAB_HWCACHE_ALIGN set: align to the smaller of - cache_line_size() - the object size, rounded up to the next power of two. Slab never honored cache align for tiny objects: otherwise the 32-byte kmalloc objects would use 128 byte objects. There is one additional point: right now slab uses ints for the bufctls. Using short would save two bytes for each object. Initially I had used short, but davem objected. IIRC because some archs do not handle short efficiently. Should I allow arch overrides for the bufctls? On i386, saving two bytes might allow a few additional anon_vma objects in each page.
2004-04-22[PATCH] writeback livelock fixAndrew Morton1-0/+12
If a filesystem's ->writepage implementation repeatedly refuses to write the page (it keeps on redirtying it instead) (reiserfs seems to do this) then the writeback logic can get stuck repeately trying to write the same page. Fix that up by correctly setting wbc->pages_skipped, to tell the writeback logic that things aren't working out.
2004-04-19[PATCH] fix madvise(MADV_DONTNEED) for nonlinear vmasAndrew Morton2-11/+10
From: Hugh Dickins <hugh@veritas.com> Jamie points out that madvise(MADV_DONTNEED) should unmap pages from a nonlinear area in such a way that the nonlinear offsets are preserved if the pages do turn out to be needed later after all, instead of reverting them to linearity: needs to pass down a zap_details block. (But this still leaves mincore unaware of nonlinear vmas: bigger job.)
2004-04-18[PATCH] From: David Gibson <david@gibson.dropbear.id.au>Andrew Morton1-4/+4
hugepage_vma() is both misleadingly named and unnecessary. On most archs it always returns NULL, and on IA64 the vma it returns is never used. The function's real purpose is to determine whether the address it is passed is a special hugepage address which must be looked up in hugepage pagetables, rather than being looked up in the normal pagetables (which might have specially marked hugepage PMDs or PTEs). This patch kills off hugepage_vma() and folds the logic it really needs into follow_huge_addr(). That now returns a (page *) if called on a special hugepage address, and an error encoded with ERR_PTR otherwise. This also requires tweaking the IA64 code to check that the hugepage PTE is present in follow_huge_addr() - previously this was guaranteed, since it was only called if the address was in an existing hugepage VMA, and hugepages are always prefaulted.
2004-04-18[PATCH] Rename PF_IOTHREAD to PF_NOFREEZEAndrew Morton2-2/+2
From: Nigel Cunningham <ncunningham@users.sourceforge.net> A few weeks ago, Pavel and I agreed that PF_IOTHREAD should be renamed to PF_NOFREEZE. This reflects the fact that some threads so marked aren't actually used for IO while suspending, but simply shouldn't be frozen. This patch, against 2.6.5 vanilla, applies that change. In the refrigerator calls, the actual value doesn't matter (so long as it's non-zero) and it makes more sense to use PF_FREEZE so I've used that.