aboutsummaryrefslogtreecommitdiffstats
path: root/kernel
AgeCommit message (Collapse)AuthorFilesLines
2006-01-08[PATCH] Make vm86 support optionalMatt Mackall1-0/+2
This adds an option to remove vm86 support under CONFIG_EMBEDDED. Saves about 5k. This version eliminates most of the #ifdefs of the previous version and instead uses function stubs in vm86.h. Also, release_vm86_irqs is moved from asm-i386/irq.h to a more appropriate home in vm86.h so that the stubs can live together. $ size vmlinux-baseline vmlinux-novm86 text data bss dec hex filename 2920821 523232 190652 3634705 377611 vmlinux-baseline 2916268 523100 190492 3629860 376324 vmlinux-novm86 Signed-off-by: Matt Mackall <mpm@selenic.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-08[PATCH] tiny: Make *[ug]id16 support optionalMatt Mackall1-0/+19
Configurable 16-bit UID and friends support This allows turning off the legacy 16 bit UID interfaces on embedded platforms. text data bss dec hex filename 3330172 529036 190556 4049764 3dcb64 vmlinux-baseline 3328268 529040 190556 4047864 3dc3f8 vmlinux From: Adrian Bunk <bunk@stusta.de> UID16 was accidentially disabled for !EMBEDDED. Signed-off-by: Matt Mackall <mpm@selenic.com> Signed-off-by: Adrian Bunk <bunk@stusta.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-08[PATCH] simplify k_getrusage()Oleg Nesterov1-18/+11
Factor out common code for different RUSAGE_xxx cases. Don't take ->sighand->siglock in RUSAGE_SELF case, suggested by Ravikiran G Thirumalai <kiran@scalex86.org>. Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-08[PATCH] fix workqueue oops during cpu offlineNathan Lynch1-6/+10
Use first_cpu(cpu_possible_map) for the single-thread workqueue case. We used to hardcode 0, but that broke on systems where !cpu_possible(0) when workqueue_struct->cpu_workqueue_struct was changed from a static array to alloc_percpu. Commit id bce61dd49d6ba7799be2de17c772e4c701558f14 ("Fix hardcoded cpu=0 in workqueue for per_cpu_ptr() calls") fixed that for Ben's funky sparc64 system, but it regressed my Power5. Offlining cpu 0 oopses upon the next call to queue_work for a single-thread workqueue, because now we try to manipulate per_cpu_ptr(wq->cpu_wq, 1), which is uninitialized. So we need to establish an unchanging "slot" for single-thread workqueues which will have a valid percpu allocation. Since alloc_percpu keys off of cpu_possible_map, which must not change after initialization, make this slot == first_cpu(cpu_possible_map). Signed-off-by: Nathan Lynch <ntl@pobox.com> Cc: <stable@kernel.org> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-08[PATCH] kernel/module.c: remove redundant spinlock in resolve_symbol()Ashutosh Naik1-2/+0
Remove the redundant spinlock in the function resolve_symbol() as we are not altering the module list, and we already hold the semaphore. Signed-off-by: Ashutosh Naik <ashutosh.naik@gmail.com> Cc: Rusty Russell <rusty@rustcorp.com.au> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-08[PATCH] modules: mark TAINT_FORCED_RMMOD correctlyAkinobu Mita1-5/+5
Currently TAINT_FORCED_RMMOD is totally unused. Because it is marked as TAINT_FORCED_MODULE instead when user forced a module unload. This patch marks it correctly Signed-off-by: Akinobu Mita <mita@miraclelinux.com> Cc: Rusty Russell <rusty@rustcorp.com.au> Cc: Dave Jones <davej@codemonkey.org.uk> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-08[PATCH] modules: prevent overriding of symbolsAshutosh Naik1-0/+39
Ensure that an exported symbol does not already exist in the kernel or in some other module's exported symbol table. This is done by checking the symbol tables for the exported symbol at the time of loading the module. Currently this is done after the relocation of the symbol. Signed-off-by: Ashutosh Naik <ashutosh.naik@gmail.com> Signed-off-by: Anand Krishnan <anandhkrishnan@yahoo.co.in> Cc: Rusty Russell <rusty@rustcorp.com.au> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-08[PATCH] copy_process: error path cleanupOleg Nesterov1-6/+2
This patch moves 'fork_out:' under 'bad_fork_free:', and removes now unneeded 'if (retval)' check. Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Cc: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-08[PATCH] setpgid: should not accept ptraced childsOleg Nesterov1-1/+1
sys_setpgid() allows to change ->pgrp of ptraced childs. 'man setpgid' does not tell anything about that, so I consider this behaviour is a bug. Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Cc: Oren Laadan <orenl@cs.columbia.edu> Cc: Roland McGrath <roland@redhat.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-08[PATCH] setpgid: should work for sub-threadsOren Laadan2-10/+8
setsid() does not work unless the calling process is a thread_group_leader(). 'man setpgid' does not tell anything about that, so I consider this behaviour is a bug. Signed-off-by: Oren Laadan <orenl@cs.columbia.edu> Cc: Oleg Nesterov <oleg@tv-sign.ru> Cc: Roland McGrath <roland@redhat.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-08[PATCH] setpgid: should work for sub-threadsOleg Nesterov1-5/+6
setpgid(0, pgid) or setpgid(forked_child_pid, pgid) does not work unless the calling process is a thread_group_leader(). 'man setpgid' does not tell anything about that, so I consider this behaviour is a bug. Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Cc: Oren Laadan <orenl@cs.columbia.edu> Cc: Roland McGrath <roland@redhat.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-08[PATCH] fork: fix race in setting child's pgrp and ttyOren Laadan1-6/+3
In fork, child should recopy parent's pgrp/tty after it has tasklist_lock. Otherwise following a setpgid() on the parent, *after* copy_signal(), the child will own a stale pgrp (which may be reused); (eg. if copy_mm() sleeps a long while due to memory pressure). Similar issue for the tty. Signed-off-by: Oren Laadan <orenl@cs.columbia.edu> Cc: Oleg Nesterov <oleg@tv-sign.ru> Cc: Roland McGrath <roland@redhat.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-08[PATCH] Don't attempt to power off if power off is not implementedEric W. Biederman1-0/+6
The problem. It is expected that /sbin/halt -p works exactly like /sbin/halt, when the kernel does not implement power off functionality. The kernel can do a lot of work in the reboot notifiers and in device_shutdown before we even get to machine_power_off. Some of that shutdown is not safe if you are leaving the power on, and it definitely gets in the way of using sysrq or pressing ctrl-alt-del. Since the shutdown happens in generic code there is no way to fix this in architecture specific code :( Some machines are kernel oopsing today because of this. The simple solution is to turn LINUX_REBOOT_CMD_POWER_OFF into LINUX_REBOOT_CMD_HALT if power_off functionality is not implemented. This has the unfortunate side effect of disabling the power off functionality on architectures that leave pm_power_off to null and still implement something in machine_power_off. And it will break the build on some architectures that don't have a pm_power_off variable at all. On both counts I say tough. For architectures like alpha that don't implement the pm_power_off variable pm_power_off is declared in linux/pm.h and it is a generic part of our power management code, and all architectures should implement it. For architectures like parisc that have a default power off method in machine_power_off if pm_power_off is not implemented or fails. It is easy enough to set the pm_power_off variable. And nothing bad happens there, the machines just stop powering off. The current semantics are impossible without a flag at the top level so we can avoid the problem code if a power off is not implemented. pm_power_off is as good a flag as any with the bonus that it works without modification on at least x86, x86_64, powerpc, and ppc today. Andrew can you pick this up and put this in the mm tree. Kernels that don't compile or don't power off seem saner than kernels that oops or panic. Until we get the arch specific patches for the problem architectures this probably isn't smart to push into the stable kernel. Unfortunately I don't have the time at the moment to walk through every architecture and make them work. And even if I did I couldn't test it :( From: Hirokazu Takata <takata@linux-m32r.org> Add pm_power_off() for build fix of arch/m32r/kernel/process.c. From: Miklos Szeredi <miklos@szeredi.hu> UML build fix Signed-off-by: Eric W. Biederman <ebiederm@xmission.com> Signed-off-by: Hayato Fujiwara <fujiwara@linux-m32r.org> Signed-off-by: Hirokazu Takata <takata@linux-m32r.org> Signed-off-by: Miklos Szeredi <miklos@szeredi.hu> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-08[PATCH] Extend RCU torture module to test tickless idle CPUSrivatsa Vaddagiri1-5/+91
This patch forces RCU torture threads off various CPUs in the system allowing them to become idle and go tickless. Meant to test support for such tickless idle CPU in RCU. Signed-off-by: Srivatsa Vaddagiri <vatsa@in.ibm.com> Cc: Dipankar Sarma <dipankar@in.ibm.com> Cc: "Paul E. McKenney" <paulmck@us.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-08[PATCH] Add tainting for proprietary helper modulesDave Jones1-0/+5
Kernels that have had Windows drivers loaded into them are undebuggable. I've wasted a number of hours chasing bugs filed in Fedora bugzilla only to find out much later that the user had used such 'helpers', and their problems were unreproducable without them loaded. Acked-by: Arjan van de Ven <arjan@infradead.org> Signed-off-by: Dave Jones <davej@redhat.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-08[PATCH] shrink dentry structEric Dumazet1-2/+2
Some long time ago, dentry struct was carefully tuned so that on 32 bits UP, sizeof(struct dentry) was exactly 128, ie a power of 2, and a multiple of memory cache lines. Then RCU was added and dentry struct enlarged by two pointers, with nice results for SMP, but not so good on UP, because breaking the above tuning (128 + 8 = 136 bytes) This patch reverts this unwanted side effect, by using an union (d_u), where d_rcu and d_child are placed so that these two fields can share their memory needs. At the time d_free() is called (and d_rcu is really used), d_child is known to be empty and not touched by the dentry freeing. Lockless lookups only access d_name, d_parent, d_lock, d_op, d_flags (so the previous content of d_child is not needed if said dentry was unhashed but still accessed by a CPU because of RCU constraints) As dentry cache easily contains millions of entries, a size reduction is worth the extra complexity of the ugly C union. Signed-off-by: Eric Dumazet <dada1@cosmosbay.com> Cc: Dipankar Sarma <dipankar@in.ibm.com> Cc: Maneesh Soni <maneesh@in.ibm.com> Cc: Miklos Szeredi <miklos@szeredi.hu> Cc: "Paul E. McKenney" <paulmck@us.ibm.com> Cc: Ian Kent <raven@themaw.net> Cc: Paul Jackson <pj@sgi.com> Cc: Al Viro <viro@ftp.linux.org.uk> Cc: Christoph Hellwig <hch@lst.de> Cc: Trond Myklebust <trond.myklebust@fys.uio.no> Cc: Neil Brown <neilb@cse.unsw.edu.au> Cc: James Morris <jmorris@namei.org> Cc: Stephen Smalley <sds@epoch.ncsc.mil> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-08[PATCH] remove unneeded sig->curr_target recalculationOleg Nesterov1-2/+0
This patch removes unneeded sig->curr_target recalculation under 'if (atomic_dec_and_test(&sig->count))' in __exit_signal(). When sig->count == 0 the signal can't be sent to this task and next_thread(tsk) == tsk anyway. Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-08[PATCH] little do_group_exit() cleanupOleg Nesterov1-1/+0
zap_other_threads() sets SIGNAL_GROUP_EXIT at the very start, do_group_exit() doesn't need to do it. Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-08[PATCH] kill_proc_info_as_uid: don't use hardcoded constantsOleg Nesterov1-2/+1
Use symbolic names instead of hardcoded constants. Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Acked-by: Harald Welte <laforge@gnumonks.org> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-08[PATCH] Unchecked alloc_percpu() return in __create_workqueue()Ben Collins1-0/+5
__create_workqueue() not checking return of alloc_percpu() NULL dereference was possible. Signed-off-by: Ben Collins <bcollins@ubuntu.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-08[PATCH] sigaction should clear all signals on SIG_IGN, not just < 32George Anzinger1-2/+32
While rooting aroung in the signal code trying to understand how to fix the SIG_IGN ploy (set sig handler to SIG_IGN and flood system with high speed repeating timers) I came across what, I think, is a problem in sigaction() in that when processing a SIG_IGN request it flushes signals from 1 to SIGRTMIN and leaves the rest. Attempt to fix this. Signed-off-by: George Anzinger <george@mvista.com> Cc: Roland McGrath <roland@redhat.com> Cc: Linus Torvalds <torvalds@osdl.org> Signed-off-by: Adrian Bunk <bunk@stusta.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-08[PATCH] printk return value: fix itGuillaume Chazarain1-3/+3
What's the true meaning of the printk return value? Should it include the priority prefix length of 3? and what about the timing information? In both cases it was broken: strace -e write echo 1 > /dev/kmsg => write(1, "1\n", 2) = 5 strace -e write echo "<1>1" > /dev/kmsg => write(1, "<1>1\n", 5) = 8 The returned length was "length of input string + 3", I made it "length of string output to the log buffer". Note that I couldn't find any printk caller in the kernel interested by its return value besides kmsg_write. Signed-off-by: Guillaume Chazarain <guichaz@yahoo.fr> Acked-By: Tim Bird <tim.bird@am.sony.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-08[PATCH] use ptrace_get_task_struct in various placesChristoph Hellwig1-31/+46
The ptrace_get_task_struct() helper that I added as part of the ptrace consolidation is useful in variety of places that currently opencode it. Switch them to the common helpers. Add a ptrace_traceme() helper that needs to be explicitly called, and simplify the ptrace_get_task_struct() interface. We don't need the request argument now, and we return the task_struct directly, using ERR_PTR() for error returns. It's a bit more code in the callers, but we have two sane routines that do one thing well now. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-08[PATCH] rcu file: use atomic primitivesNick Piggin2-15/+0
Use atomic_inc_not_zero for rcu files instead of special case rcuref. Signed-off-by: Nick Piggin <npiggin@suse.de> Cc: "Paul E. McKenney" <paulmck@us.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-08[PATCH] kernel/: small cleanupsAdrian Bunk4-2/+5
This patch contains the following cleanups: - make needlessly global functions static - every file should include the headers containing the prototypes for it's global functions Signed-off-by: Adrian Bunk <bunk@stusta.de> Acked-by: "Paul E. McKenney" <paulmck@us.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-08[PATCH] cpuset: skip rcu check if task is in root cpusetPaul Jackson1-4/+9
For systems that aren't using cpusets, but have them CONFIG_CPUSET enabled in their kernel (eventually this may be most distribution kernels), this patch removes even the minimal rcu_read_lock() from the memory page allocation path. Actually, it removes that rcu call for any task that is in the root cpuset (top_cpuset), which on systems not actively using cpusets, is all tasks. We don't need the rcu check for tasks in the top_cpuset, because the top_cpuset is statically allocated, so at no risk of being freed out from underneath us. Signed-off-by: Paul Jackson <pj@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-08[PATCH] cpuset: mark number_of_cpusets read_mostlyPaul Jackson1-1/+1
Mark cpuset global 'number_of_cpusets' as __read_mostly. This global is accessed everytime a zone is considered in the zonelist loops beneath __alloc_pages, looking for a free memory page. If number_of_cpusets is just one, then we can short circuit the mems_allowed check. Since this global is read alot on a hot path, and written rarely, it is an excellent candidate for __read_mostly. Thanks to Christoph Lameter for the suggestion. Signed-off-by: Paul Jackson <pj@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-08[PATCH] cpuset: use rcu directly optimizationPaul Jackson1-10/+30
Optimize the cpuset impact on page allocation, the most performance critical cpuset hook in the kernel. On each page allocation, the cpuset hook needs to check for a possible change in the current tasks cpuset. It can now handle the common case, of no change, without taking any spinlock or semaphore, thanks to RCU. Convert a spinlock on the current task to an rcu_read_lock(), saving approximately a memory barrier and an atomic op, depending on architecture. This is done by adding rcu_assign_pointer() and synchronize_rcu() calls to the write side of the task->cpuset pointer, in cpuset.c:attach_task(), to delay freeing up a detached cpuset until after any critical sections referencing that pointer. Thanks to Andi Kleen, Nick Piggin and Eric Dumazet for ideas. Signed-off-by: Paul Jackson <pj@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-08[PATCH] cpuset: remove test for null cpuset from alloc code pathPaul Jackson1-6/+16
Remove a couple of more lines of code from the cpuset hooks in the page allocation code path. There was a check for a NULL cpuset pointer in the routine cpuset_update_task_memory_state() that was only needed during system boot, after the memory subsystem was initialized, before the cpuset subsystem was initialized, to catch a NULL task->cpuset pointer. Add a cpuset_init_early() routine, just before the mem_init() call in init/main.c, that sets up just enough of the init tasks cpuset structure to render cpuset_update_task_memory_state() calls harmless. Signed-off-by: Paul Jackson <pj@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-08[PATCH] cpuset: migrate all tasks in cpuset at oncePaul Jackson1-13/+16
Given the mechanism in the previous patch to handle rebinding the per-vma mempolicies of all tasks in a cpuset that changes its memory placement, it is now easier to handle the page migration requirements of such tasks at the same time. The previous code didn't actually attempt to migrate the pages of the tasks in a cpuset whose memory placement changed until the next time each such task tried to allocate memory. This was undesirable, as users invoking memory page migration exected to happen when the placement changed, not some unspecified time later when the task needed more memory. It is now trivial to handle the page migration at the same time as the per-vma rebinding is done. The routine cpuset.c:update_nodemask(), which handles changing a cpusets memory placement ('mems') now checks for the special case of being asked to write a placement that is the same as before. It was harmless enough before to just recompute everything again, even though nothing had changed. But page migration is a heavy weight operation - moving pages about. So now it is worth avoiding that if asked to move a cpuset to its current location. Signed-off-by: Paul Jackson <pj@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-08[PATCH] cpuset: rebind vma mempolicies fixPaul Jackson1-0/+90
Fix more of longstanding bug in cpuset/mempolicy interaction. NUMA mempolicies (mm/mempolicy.c) are constrained by the current tasks cpuset to just the Memory Nodes allowed by that cpuset. The kernel maintains internal state for each mempolicy, tracking what nodes are used for the MPOL_INTERLEAVE, MPOL_BIND or MPOL_PREFERRED policies. When a tasks cpuset memory placement changes, whether because the cpuset changed, or because the task was attached to a different cpuset, then the tasks mempolicies have to be rebound to the new cpuset placement, so as to preserve the cpuset-relative numbering of the nodes in that policy. An earlier fix handled such mempolicy rebinding for mempolicies attached to a task. This fix rebinds mempolicies attached to vma's (address ranges in a tasks address space.) Due to the need to hold the task->mm->mmap_sem semaphore while updating vma's, the rebinding of vma mempolicies has to be done when the cpuset memory placement is changed, at which time mmap_sem can be safely acquired. The tasks mempolicy is rebound later, when the task next attempts to allocate memory and notices that its task->cpuset_mems_generation is out-of-date with its cpusets mems_generation. Because walking the tasklist to find all tasks attached to a changing cpuset requires holding tasklist_lock, a spinlock, one cannot update the vma's of the affected tasks while doing the tasklist scan. In general, one cannot acquire a semaphore (which can sleep) while already holding a spinlock (such as tasklist_lock). So a list of mm references has to be built up during the tasklist scan, then the tasklist lock dropped, then for each mm, its mmap_sem acquired, and the vma's in that mm rebound. Once the tasklist lock is dropped, affected tasks may fork new tasks, before their mm's are rebound. A kernel global 'cpuset_being_rebound' is set to point to the cpuset being rebound (there can only be one; cpuset modifications are done under a global 'manage_sem' semaphore), and the mpol_copy code that is used to copy a tasks mempolicies during fork catches such forking tasks, and ensures their children are also rebound. When a task is moved to a different cpuset, it is easier, as there is only one task involved. It's mm->vma's are scanned, using the same mpol_rebind_policy() as used above. It may happen that both the mpol_copy hook and the update done via the tasklist scan update the same mm twice. This is ok, as the mempolicies of each vma in an mm keep track of what mems_allowed they are relative to, and safely no-op a second request to rebind to the same nodes. Signed-off-by: Paul Jackson <pj@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-08[PATCH] cpuset: number_of_cpusets optimizationPaul Jackson1-1/+11
Easy little optimization hack to avoid actually having to call cpuset_zone_allowed() and check mems_allowed, in the main page allocation routine, __alloc_pages(). This saves several CPU cycles per page allocation on systems not using cpusets. A counter is updated each time a cpuset is created or removed, and whenever there is only one cpuset in the system, it must be the root cpuset, which contains all CPUs and all Memory Nodes. In that case, when the counter is one, all allocations are allowed. Signed-off-by: Paul Jackson <pj@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-08[PATCH] cpuset: numa_policy_rebind cleanupPaul Jackson1-1/+1
Cleanup, reorganize and make more robust the mempolicy.c code to rebind mempolicies relative to the containing cpuset after a tasks memory placement changes. The real motivator for this cleanup patch is to lay more groundwork for the upcoming patch to correctly rebind NUMA mempolicies that are attached to vma's after the containing cpuset memory placement changes. NUMA mempolicies are constrained by the cpuset their task is a member of. When either (1) a task is moved to a different cpuset, or (2) the 'mems' mems_allowed of a cpuset is changed, then the NUMA mempolicies have embedded node numbers (for MPOL_BIND, MPOL_INTERLEAVE and MPOL_PREFERRED) that need to be recalculated, relative to their new cpuset placement. The old code used an unreliable method of determining what was the old mems_allowed constraining the mempolicy. It just looked at the tasks mems_allowed value. This sort of worked with the present code, that just rebinds the -task- mempolicy, and leaves any -vma- mempolicies broken, referring to the old nodes. But in an upcoming patch, the vma mempolicies will be rebound as well. Then the order in which the various task and vma mempolicies are updated will no longer be deterministic, and one can no longer count on the task->mems_allowed holding the old value for as long as needed. It's not even clear if the current code was guaranteed to work reliably for task mempolicies. So I added a mems_allowed field to each mempolicy, stating exactly what mems_allowed the policy is relative to, and updated synchronously and reliably anytime that the mempolicy is rebound. Also removed a useless wrapper routine, numa_policy_rebind(), and had its caller, cpuset_update_task_memory_state(), call directly to the rewritten policy_rebind() routine, and made that rebind routine extern instead of static, and added a "mpol_" prefix to its name, making it mpol_rebind_policy(). Signed-off-by: Paul Jackson <pj@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-08[PATCH] cpuset: implement cpuset_mems_allowedPaul Jackson1-3/+26
Provide a cpuset_mems_allowed() method, which the sys_migrate_pages() code needed, to obtain the mems_allowed vector of a cpuset, and replaced the workaround in sys_migrate_pages() to call this new method. Signed-off-by: Paul Jackson <pj@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-08[PATCH] cpuset: combine refresh_mems and update_memsPaul Jackson1-54/+41
The important code paths through alloc_pages_current() and alloc_page_vma(), by which most kernel page allocations go, both called cpuset_update_current_mems_allowed(), which in turn called refresh_mems(). -Both- of these latter two routines did a tasklock, got the tasks cpuset pointer, and checked for out of date cpuset->mems_generation. That was a silly duplication of code and waste of CPU cycles on an important code path. Consolidated those two routines into a single routine, called cpuset_update_task_memory_state(), since it updates more than just mems_allowed. Changed all callers of either routine to call the new consolidated routine. Signed-off-by: Paul Jackson <pj@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-08[PATCH] cpuset: fork hook fixPaul Jackson2-5/+5
Fix obscure, never seen in real life, cpuset fork race. The cpuset_fork() call in fork.c was setting up the correct task->cpuset pointer after the tasklist_lock was dropped, which briefly exposed the newly forked process with an unsafe (copied from parent without locks or usage counter increment) cpuset pointer. In theory, that exposed cpuset pointer could have been pointing at a cpuset that was already freed and removed, and in theory another task that had been sitting on the tasklist_lock waiting to scan the task list could have raced down the entire tasklist, found our new child at the far end, and dereferenced that bogus cpuset pointer. To fix, setup up the correct cpuset pointer in the new child by calling cpuset_fork() before the new task is linked into the tasklist, and with that, add a fork failure case, to dereference that cpuset, if the fork fails along the way, after cpuset_fork() was called. Had to remove a BUG_ON() from cpuset_exit(), because it was no longer valid - the call to cpuset_exit() from a failed fork would not have PF_EXITING set. Signed-off-by: Paul Jackson <pj@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-08[PATCH] cpuset: update_nodemask code reformatPaul Jackson1-10/+15
Restructure code layout of the kernel/cpuset.c update_nodemask() routine, removing embedded returns and nested if's in favor of goto completion labels. This is being done in anticipation of adding more logic to this routine, which will favor the goto style structure. Signed-off-by: Paul Jackson <pj@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-08[PATCH] cpuset: minor spacing initializer fixesPaul Jackson1-6/+3
Four trivial cpuset fixes: remove extra spaces, remove useless initializers, mark one __read_mostly. Signed-off-by: Paul Jackson <pj@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-08[PATCH] cpuset: memory pressure meterPaul Jackson1-2/+191
Provide a simple per-cpuset metric of memory pressure, tracking the -rate- that the tasks in a cpuset call try_to_free_pages(), the synchronous (direct) memory reclaim code. This enables batch managers monitoring jobs running in dedicated cpusets to efficiently detect what level of memory pressure that job is causing. This is useful both on tightly managed systems running a wide mix of submitted jobs, which may choose to terminate or reprioritize jobs that are trying to use more memory than allowed on the nodes assigned them, and with tightly coupled, long running, massively parallel scientific computing jobs that will dramatically fail to meet required performance goals if they start to use more memory than allowed to them. This patch just provides a very economical way for the batch manager to monitor a cpuset for signs of memory pressure. It's up to the batch manager or other user code to decide what to do about it and take action. ==> Unless this feature is enabled by writing "1" to the special file /dev/cpuset/memory_pressure_enabled, the hook in the rebalance code of __alloc_pages() for this metric reduces to simply noticing that the cpuset_memory_pressure_enabled flag is zero. So only systems that enable this feature will compute the metric. Why a per-cpuset, running average: Because this meter is per-cpuset, rather than per-task or mm, the system load imposed by a batch scheduler monitoring this metric is sharply reduced on large systems, because a scan of the tasklist can be avoided on each set of queries. Because this meter is a running average, instead of an accumulating counter, a batch scheduler can detect memory pressure with a single read, instead of having to read and accumulate results for a period of time. Because this meter is per-cpuset rather than per-task or mm, the batch scheduler can obtain the key information, memory pressure in a cpuset, with a single read, rather than having to query and accumulate results over all the (dynamically changing) set of tasks in the cpuset. A per-cpuset simple digital filter (requires a spinlock and 3 words of data per-cpuset) is kept, and updated by any task attached to that cpuset, if it enters the synchronous (direct) page reclaim code. A per-cpuset file provides an integer number representing the recent (half-life of 10 seconds) rate of direct page reclaims caused by the tasks in the cpuset, in units of reclaims attempted per second, times 1000. Signed-off-by: Paul Jackson <pj@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-08[PATCH] cpuset: mempolicy one more nodemask conversionPaul Jackson1-10/+0
Finish converting mm/mempolicy.c from bitmaps to nodemasks. The previous conversion had left one routine using bitmaps, since it involved a corresponding change to kernel/cpuset.c Fix that interface by replacing with a simple macro that calls nodes_subset(), or if !CONFIG_CPUSET, returns (1). Signed-off-by: Paul Jackson <pj@sgi.com> Cc: Christoph Lameter <christoph@lameter.com> Cc: Andi Kleen <ak@muc.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-08[PATCH] Simpler signal-exit concurrency handlingPaul E. McKenney1-5/+6
Some simplification in checking signal delivery against concurrent exit. Instead of using get_task_struct_rcu(), which increments the task_struct reference count, check the reference count after acquiring sighand lock. Signed-off-by: "Paul E. McKenney" <paulmck@us.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-08[PATCH] RCU signal handlingIngo Molnar6-27/+111
RCU tasklist_lock and RCU signal handling: send signals RCU-read-locked instead of tasklist_lock read-locked. This is a scalability improvement on SMP and a preemption-latency improvement under PREEMPT_RCU. Signed-off-by: Paul E. McKenney <paulmck@us.ibm.com> Signed-off-by: Ingo Molnar <mingo@elte.hu> Acked-by: William Irwin <wli@holomorphy.com> Cc: Roland McGrath <roland@redhat.com> Cc: Oleg Nesterov <oleg@tv-sign.ru> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-08[PATCH] Change maxaligned_in_smp alignemnt macros to internodealigned_in_smp ↵Ravikiran G Thirumalai1-2/+2
macros ____cacheline_maxaligned_in_smp is currently used to align critical structures and avoid false sharing. It uses per-arch L1_CACHE_SHIFT_MAX and people find L1_CACHE_SHIFT_MAX useless. However, we have been using ____cacheline_maxaligned_in_smp to align structures on the internode cacheline size. As per Andi's suggestion, following patch kills ____cacheline_maxaligned_in_smp and introduces INTERNODE_CACHE_SHIFT, which defaults to L1_CACHE_SHIFT for all arches. Arches needing L3/Internode cacheline alignment can define INTERNODE_CACHE_SHIFT in the arch asm/cache.h. Patch replaces ____cacheline_maxaligned_in_smp with ____cacheline_internodealigned_in_smp With this patch, L1_CACHE_SHIFT_MAX can be killed Signed-off-by: Ravikiran Thirumalai <kiran@scalex86.org> Signed-off-by: Shai Fultheim <shai@scalex86.org> Signed-off-by: Andi Kleen <ak@suse.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-08[PATCH] cpusets: swap migration interfacePaul Jackson1-2/+36
Add a boolean "memory_migrate" to each cpuset, represented by a file containing "0" or "1" in each directory below /dev/cpuset. It defaults to false (file contains "0"). It can be set true by writing "1" to the file. If true, then anytime that a task is attached to the cpuset so marked, the pages of that task will be moved to that cpuset, preserving, to the extent practical, the cpuset-relative placement of the pages. Also anytime that a cpuset so marked has its memory placement changed (by writing to its "mems" file), the tasks in that cpuset will have their pages moved to the cpusets new nodes, preserving, to the extent practical, the cpuset-relative placement of the moved pages. Signed-off-by: Paul Jackson <pj@sgi.com> Cc: Christoph Lameter <christoph@lameter.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-08[PATCH] Swap Migration V5: sys_migrate_pages interfaceChristoph Lameter1-0/+1
sys_migrate_pages implementation using swap based page migration This is the original API proposed by Ray Bryant in his posts during the first half of 2005 on linux-mm@kvack.org and linux-kernel@vger.kernel.org. The intent of sys_migrate is to migrate memory of a process. A process may have migrated to another node. Memory was allocated optimally for the prior context. sys_migrate_pages allows to shift the memory to the new node. sys_migrate_pages is also useful if the processes available memory nodes have changed through cpuset operations to manually move the processes memory. Paul Jackson is working on an automated mechanism that will allow an automatic migration if the cpuset of a process is changed. However, a user may decide to manually control the migration. This implementation is put into the policy layer since it uses concepts and functions that are also needed for mbind and friends. The patch also provides a do_migrate_pages function that may be useful for cpusets to automatically move memory. sys_migrate_pages does not modify policies in contrast to Ray's implementation. The current code here is based on the swap based page migration capability and thus is not able to preserve the physical layout relative to it containing nodeset (which may be a cpuset). When direct page migration becomes available then the implementation needs to be changed to do a isomorphic move of pages between different nodesets. The current implementation simply evicts all pages in source nodeset that are not in the target nodeset. Patch supports ia64, i386 and x86_64. Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-08[PATCH] add schedule_on_each_cpu()Christoph Lameter1-0/+19
swap migration's isolate_lru_page() currently uses an IPI to notify other processors that the lru caches need to be drained if the page cannot be found on the LRU. The IPI interrupt may interrupt a processor that is just processing lru requests and cause a race condition. This patch introduces a new function run_on_each_cpu() that uses the keventd() to run the LRU draining on each processor. Processors disable preemption when dealing the LRU caches (these are per processor) and thus executing LRU draining from another process is safe. Thanks to Lee Schermerhorn <lee.schermerhorn@hp.com> for finding this race condition. Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-08[PATCH] Make high and batch sizes of per_cpu_pagelists configurableRohit Seth1-0/+12
As recently there has been lot of traffic on the right values for batch and high water marks for per_cpu_pagelists. This patch makes these two variables configurable through /proc interface. A new tunable /proc/sys/vm/percpu_pagelist_fraction is added. This entry controls the fraction of pages at most in each zone that are allocated for each per cpu page list. The min value for this is 8. It means that we don't allow more than 1/8th of pages in each zone to be allocated in any single per_cpu_pagelist. The batch value of each per cpu pagelist is also updated as a result. It is set to pcp->high/4. The upper limit of batch is (PAGE_SHIFT * 8) Signed-off-by: Rohit Seth <rohit.seth@intel.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-08[PATCH] drop-pagecacheAndrew Morton1-0/+10
Add /proc/sys/vm/drop_caches. When written to, this will cause the kernel to discard as much pagecache and/or reclaimable slab objects as it can. THis operation requires root permissions. It won't drop dirty data, so the user should run `sync' first. Caveats: a) Holds inode_lock for exorbitant amounts of time. b) Needs to be taught about NUMA nodes: propagate these all the way through so the discarding can be controlled on a per-node basis. This is a debugging feature: useful for getting consistent results between filesystem benchmarks. We could possibly put it under a config option, but it's less than 300 bytes. Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-06[PATCH] Fix posix-cpu-timers sched_time accumulationDavid S. Miller1-12/+1
I've spent the past 3 days digging into a glibc testsuite failure in current CVS, specifically libc/rt/tst-cputimer1.c The thr1 and thr2 timers fire too early in the second pass of this test. The second pass is noteworthy because it makes use of intervals, whereas the first pass does not. All throughout the posix-cpu-timers.c code, the calculation of the process sched_time sum is implemented roughly as: unsigned long long sum; sum = tsk->signal->sched_time; t = tsk; do { sum += t->sched_time; t = next_thread(t); } while (t != tsk); In fact this is the exact scheme used by check_process_timers(). In the case of check_process_timers(), current->sched_time has just been updated (via scheduler_tick(), which is invoked by update_process_times(), which subsequently invokes run_posix_cpu_timers()) So there is no special processing necessary wrt. that. In other contexts, we have to allot for the fact that tsk->sched_time might be a bit out of date if we are current. And the posix-cpu-timers.c code uses current_sched_time() to deal with that. Unfortunately it does so in an erroneous and inconsistent manner in one spot which is what results in the early timer firing. In cpu_clock_sample_group_locked(), it does this: cpu->sched = p->signal->sched_time; /* Add in each other live thread. */ while ((t = next_thread(t)) != p) { cpu->sched += t->sched_time; } if (p->tgid == current->tgid) { /* * We're sampling ourselves, so include the * cycles not yet banked. We still omit * other threads running on other CPUs, * so the total can always be behind as * much as max(nthreads-1,ncpus) * (NSEC_PER_SEC/HZ). */ cpu->sched += current_sched_time(current); } else { cpu->sched += p->sched_time; } The problem is the "p->tgid == current->tgid" test. If "p" is not current, and the tgids are the same, we will add the process t->sched_time twice into cpu->sched and omit "p"'s sched_time which is very very very wrong. posix-cpu-timers.c has a helper function, sched_ns(p) which takes care of this, so my fix is to use that here instead of this special tgid test. The fact that current can be one of the sub-threads of "p" points out that we could make things a little bit more accurate, perhaps by using sched_ns() on every thread we process in these loops. It also points out that we don't use the most accurate value for threads in the group actively running other cpus (and this is mentioned in the comment). But that is a future enhancement, and this fix here definitely makes sense. Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-06[PATCH] kernel/module.c: removed dead codeJayachandran C1-2/+1
This patch fixes an issue reported by Coverity in kernel/module.c Error reported: Cannot reach this line of code "else return ptr;" Patch description: This is the error path, so 'err' will be negative, the else case is not required, this patch removes it. Signed-off-by: Jayachandran C. <c.jayachandran@gmail.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-06[PATCH] s390: cleanup KconfigMartin Schwidefsky2-5/+5
Sanitize some s390 Kconfig options. We have ARCH_S390, ARCH_S390X, ARCH_S390_31, 64BIT, S390_SUPPORT and COMPAT. Replace these 6 options by S390, 64BIT and COMPAT. Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-06[PATCH] s390: cputime_t fixesMartin Schwidefsky1-7/+9
There are some more places where the use of cputime_t instead of an integer type and the associated macros is necessary for the virtual cputime accounting on s390. Affected are the s390 specific appldata code and BSD process accounting. Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-06[PATCH] swsusp: save image header firstRafael J. Wysocki2-126/+65
This makes the swsusp_info structure become the header of the image in the literal sense (ie. it is saved to the swap and read before any other image data with the help of the swsusp's swap map structure, so generally it is treated in the same way as the rest of the image). The main thing it does is to make swsusp_header contain the offset of the swap map used to track the image data pages rather than the offset of swsusp_info.  Simultaneously, swsusp_info becomes the first image page written to the swap. The other changes are generally consequences of the above with a few exceptions (there's some consolidation in the image reading part as a few functions turn into trivial wrappers around something else). Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl> Acked-by: Pavel Machek <pavel@suse.cz> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-06[PATCH] swsusp: improve handling of swap partitionsRafael J. Wysocki1-92/+36
This changes the handling of swap partitions by swsusp to avoid locking of the swap devices that are not used for suspend and, consequently, simplifies the code. Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl> Acked-by: Pavel Machek <pavel@ucw.cz> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-06[PATCH] swsusp: make image size limit tunableRafael J. Wysocki3-6/+31
Make the suspend image size limit tunable via /sys/power/image_size. It is necessary for systems on which there is a limited amount of swap available for suspend. It can also be useful for optimizing performance of swsusp on systems with 1 GB of RAM or more. Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl> Acked-by: Pavel Machek <pavel@ucw.cz> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-06[PATCH] swsusp: limit image sizeRafael J. Wysocki2-14/+11
Limit the size of the suspend image to approx. 500 MB, which should improve the overall performance of swsusp on systems with more than 1 GB of RAM. It introduces the constant IMAGE_SIZE that can be set to the preferred size of the image (in MB) and modifies the memory-shrinking part of swsusp to take this constant into account (500 is the default value of IMAGE_SIZE). Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl> Acked-by: Pavel Machek <pavel@ucw.cz> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-06[PATCH] swsusp: Drop duplicate prototypesPavel Machek1-3/+0
These two prototypes are already present in sched.h, remove duplicate version. Signed-off-by: Pavel Machek <pavel@suse.cz> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-06[PATCH] swsusp: fix enough_free_memRafael J. Wysocki1-2/+8
This patch fixes a problem with the function enough_free_mem() used by swsusp to verify if there is a sufficient number of memory pages available to it to create and save the suspend image. Namely, enough_free_mem() uses nr_free_pages() to obtain the number of free memory pages, which is incorrect, because this function returns the total number of free pages, including free highmem pages, and the highmem pages cannot be used by swsusp for storing the image data. The patch makes enough_free_mem() avoid counting the free highmem pages as available to swsusp. Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-06[PATCH] swsusp: improve freeing of memoryRafael J. Wysocki4-36/+125
This patch makes swsusp free only as much memory as needed to complete the suspend and not as much as possible.  In the most of cases this should speed up the suspend and make the system much more responsive after resume, especially if a GUI (eg. X Windows) is used. If needed, the old behavior (ie to free as much memory as possible during suspend) can be restored by unsetting FAST_FREE in power.h Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl> Acked-by: Pavel Machek <pavel@suse.cz> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-06[PATCH] swsusp: introduce the swap map structureRafael J. Wysocki4-176/+417
This patch introduces the swap map structure that can be used by swsusp for keeping tracks of data pages written to the swap.  The structure itself is described in a comment within the patch. The overall idea is to reduce the amount of metadata written to the swap and to write and read the image pages sequentially, in a file-alike way. This makes the swap-handling part of swsusp fairly independent of its snapshot-handling part and will hopefully allow us to completely separate these two parts in the future. This patch is needed to remove the suspend image size limit imposed by the limited size of the swsusp_info structure, which is essential for x86-64 systems with more than 512 MB of RAM. Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl> Acked-by: Pavel Machek <pavel@suse.cz> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-06[PATCH] swsusp: remove encryptionRafael J. Wysocki1-159/+4
This patch removes the image encryption that is only used by swsusp instead of zeroing the image after resume in order to prevent someone from reading some confidential data from it in the future and it does not protect the image from being read by an unauthorized person before resume. The functionality it provides should really belong to the user space and will possibly be reimplemented after the swap-handling functionality of swsusp is moved to the user space. Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl> Acked-by: Pavel Machek <pavel@suse.cz> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-06[PATCH] Alpha: convert to generic irq framework (generic part)Ivan Kokshaysky2-1/+5
Thanks to Christoph for doing most of the work. This allows automatic SMP IRQ affinity assignment other than default "all interrupts on all CPUs" which is rather expensive. This might be useful if the hardware can be programmed to distribute interrupts among different CPUs, like Alpha does. Signed-off-by: Ivan Kokshaysky <ink@jurassic.park.msu.ru> Cc: Christoph Hellwig <hch@lst.de> Cc: Richard Henderson <rth@twiddle.net> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-06[PATCH] FRV: Make futex code compilable on nommu [try #2]David Howells1-0/+7
Make the futex code compilable and usable on NOMMU by making the attempt to handle page faults conditional on CONFIG_MMU. If this is not enabled, then we can assume that EFAULT returned from futex_atomic_op_inuser() is not recoverable, and that the address lies outside of valid memory. handle_mm_fault() is made to BUG if called on NOMMU without attempting to invoke the actual handler (__handle_mm_fault). Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-06[PATCH] swsusp: resume_store() retval fixAndrew Morton1-18/+16
- This function returns -EINVAL all the time. Fix. - Decruftify it a bit too. - Writing to it doesn't seem to do what it's suppoed to do. Cc: Pavel Machek <pavel@ucw.cz> Cc: "Rafael J. Wysocki" <rjw@sisk.pl> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-04Merge git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-2.6Linus Torvalds2-9/+29
Trivial manual merge fixup for usb_find_interface clashes.
2006-01-04Merge git://git.kernel.org/pub/scm/linux/kernel/git/sam/kbuildLinus Torvalds1-0/+5
2006-01-04[PATCH] kobject_uevent CONFIG_NET=n fixakpm@osdl.org1-0/+3
lib/lib.a(kobject_uevent.o)(.text+0x25f): In function `kobject_uevent': : undefined reference to `__alloc_skb' lib/lib.a(kobject_uevent.o)(.text+0x2a1): In function `kobject_uevent': : undefined reference to `skb_over_panic' lib/lib.a(kobject_uevent.o)(.text+0x31d): In function `kobject_uevent': : undefined reference to `skb_over_panic' lib/lib.a(kobject_uevent.o)(.text+0x356): In function `kobject_uevent': : undefined reference to `netlink_broadcast' lib/lib.a(kobject_uevent.o)(.init.text+0x9): In function `kobject_uevent_init': : undefined reference to `netlink_kernel_create' make: *** [.tmp_vmlinux1] Error 1 Netlink is unconditionally enabled if CONFIG_NET, so that's OK. kobject_uevent.o is compiled even if !CONFIG_HOTPLUG, which is lazy. Let's compound the sin. Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2006-01-04[PATCH] driver core: replace "hotplug" by "uevent"Kay Sievers2-9/+9
Leave the overloaded "hotplug" word to susbsystems which are handling real devices. The driver core does not "plug" anything, it just exports the state to userspace and generates events. Signed-off-by: Kay Sievers <kay.sievers@suse.de> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2006-01-04[PATCH] add uevent_helper control in /sys/kernel/Kay Sievers1-3/+22
This deprecates the /proc/sys/kernel/hotplug file, as all this stuff should be in /sys some day, right? :) In /sys/kernel/ we have now uevent_seqnum and uevent_helper. The seqnum is no longer used by udev, as the version for this kernel depends on netlink which events will never get out-of-order. Recent udev versions disable the /sbin/hotplug helper with an init script, cause it leads to OOM on big boxes by running hundreds of shells in parallel. It should be done now by: echo "" > /sys/kernel/uevent_helper (Note that "-n" does not work, cause neighter proc nor sysfs support truncate().) Signed-off-by: Kay Sievers <kay.sievers@suse.de> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2006-01-04[PATCH] remove CONFIG_KOBJECT_UEVENT optionKay Sievers1-3/+1
It makes zero sense to have hotplug, but not the netlink events enabled today. Remove this option and merge the kobject_uevent.h header into the kobject.h header file. Signed-off-by: Kay Sievers <kay.sievers@suse.de> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2006-01-03update the email address of Randy DunlapAdrian Bunk1-1/+1
This patch removes all references to the bouncing address rddunlap@osdl.org and one dead web page from the kernel. Signed-off-by: Adrian Bunk <bunk@stusta.de> Acked-by: Randy Dunlap <rdunlap@xenotime.net>
2006-01-03gitignore: ignore more generated files1-0/+5
Signed-off-by: Sam Ravnborg <sam@ravnborg.org>
2005-12-31sysctl: make sure to terminate strings with a NULLinus Torvalds1-10/+15
This is a slightly more complete fix for the previous minimal sysctl string fix. It always terminates the returned string with a NUL, even if the full result wouldn't fit in the user-supplied buffer. The returned length is the full untruncated length, so that you can tell when truncation has occurred. Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-12-30[PATCH] Fix false old value return of sysctlYi Yang1-1/+1
For the sysctl syscall, if the user wants to get the old value of a sysctl entry and set a new value for it in the same syscall, the old value is always overwritten by the new value if the sysctl entry is of string type and if the user sets its strategy to sysctl_string. This issue lies in the strategy being run twice if the strategy is set to sysctl_string, the general strategy sysctl_string always returns 0 if success. Such strategy routines as sysctl_jiffies and sysctl_jiffies_ms return 1 because they do read and write for the sysctl entry. The strategy routine sysctl_string return 0 although it actually read and write the sysctl entry. According to my analysis, if a strategy routine do read and write, it should return 1, if it just does some necessary check but not read and write, it should return 0, for example sysctl_intvec. Signed-off-by: Yi Yang <yang.y.yi@gmail.com> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-12-30sysctl: don't overflow the user-supplied buffer with '\0'Linus Torvalds1-3/+1
If the string was too long to fit in the user-supplied buffer, the sysctl layer would zero-terminate it by writing past the end of the buffer. Don't do that. Noticed by Yi Yang <yang.y.yi@gmail.com> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-12-24[PATCH] Fix memory ordering problem in wake_futex()Andrew Morton1-0/+6
Fix a memory ordering problem that occurs on IA64. The "store" to q->lock_ptr in wake_futex() can become visible before wake_up_all() clears the lock in the futex_q. Signed-off-by: Jack Steiner <steiner@sgi.com> Acked-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-12-20[PATCH] kernel/params.c: fix sysfs access with CONFIG_MODULES=nJason Wessel1-1/+1
All the work was done to setup the file and maintain the file handles but the access functions were zeroed out due to the #ifdef. Removing the #ifdef allows full access to all the parameters when CONFIG_MODULES=n. akpm: put it back again, but use CONFIG_SYSFS instead. Signed-off-by: Jason Wessel <jason.wessel@windriver.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Adrian Bunk <bunk@stusta.de> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-12-12[PATCH] kprobes: increment kprobe missed count for multiprobesKeshavamurthy Anil S1-0/+13
When multiple probes are registered at the same address and if due to some recursion (probe getting triggered within a probe handler), we skip calling pre_handlers and just increment nmissed field. The below patch make sure it walks the list for multiple probes case. Without the below patch we get incorrect results of nmissed count for multiple probe case. Signed-off-by: Anil S Keshavamurthy <anil.s.keshavamurthy@intel.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-12-12[PATCH] kprobes: no probes on critical pathKeshavamurthy Anil S1-1/+2
For Kprobes critical path is the path from debug break exception handler till the control reaches kprobes exception code. No probes can be supported in this path as we will end up in recursion. This patch prevents this by moving the below function to safe __kprobes section onto which no probes can be inserted. Signed-off-by: Anil S Keshavamurthy <anil.s.keshavamurthy@intel.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-12-12[PATCH] Add try_to_freeze to kauditdPierre Ossman1-1/+3
kauditd was causing suspends to fail because it refused to freeze. Adding a try_to_freeze() to its sleep loop solves the issue. Signed-off-by: Pierre Ossman <drzeus@drzeus.cx> Acked-by: Pavel Machek <pavel@suse.cz> Cc: David Woodhouse <dwmw2@infradead.org> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-12-12[PATCH] kprobes: fix race in aggregate kprobe registrationKeshavamurthy Anil S1-4/+1
When registering multiple kprobes at the same address, we leave a small window where the kprobe hlist will not contain a reference to the registered kprobe, leading to potentially, a system crash if the breakpoint is hit on another processor. Patch below now automically relpace the old kprobe with the new kprobe from the hash list. Signed-off-by: Anil S Keshavamurthy <anil.s.keshavamurthy@intel.com> Acked-by: Ananth N Mavinakayanahalli <ananth@in.ibm.com> Cc: "Paul E. McKenney" <paulmck@us.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-12-12[PATCH] Add getnstimestamp functionMatt Helsley1-0/+22
There are several functions that might seem appropriate for a timestamp: get_cycles() current_kernel_time() do_gettimeofday() <read jiffies/jiffies_64> Each has problems with combinations of SMP-safety, low resolution, and monotonicity. This patch adds a new function that returns a monotonic SMP-safe timestamp with nanosecond resolution where available. Changes: Split timestamp into separate patch Moved to kernel/time.c Renamed to getnstimestamp Fixed unintended-pointer-arithmetic bug Signed-off-by: Matt Helsley <matthltc@us.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-12-12[PATCH] Fix RCU race in access of nohz_cpu_maskSrivatsa Vaddagiri1-5/+13
Accessing nohz_cpu_mask before incrementing rcp->cur is racy. It can cause tickless idle CPUs to be included in rsp->cpumask, which will extend graceperiods unnecessarily. Fix this race. It has been tested using extensions to RCU torture module that forces various CPUs to become idle. Signed-off-by: Srivatsa Vaddagiri <vatsa@in.ibm.com> Cc: Dipankar Sarma <dipankar@in.ibm.com> Cc: "Paul E. McKenney" <paulmck@us.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-12-12[PATCH] Fix bug in RCU torture testSrivatsa Vaddagiri1-2/+1
While doing some test of RCU torture module, I hit a OOPS in rcu_do_batch, which was trying to processes callback of a module that was just removed. This is because we weren't waiting long enough for all callbacks to fire. Signed-off-by: Srivatsa Vaddagiri <vatsa@in.ibm.com> Cc: Dipankar Sarma <dipankar@in.ibm.com> Acked-by: "Paul E. McKenney" <paulmck@us.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-12-12[PATCH] add rcu_barrier() synchronization pointDipankar Sarma1-0/+41
This introduces a new interface - rcu_barrier() which waits until all the RCUs queued until this call have been completed. Reiser4 needs this, because we do more than just freeing memory object in our RCU callback: we also remove it from the list hanging off super-block. This means, that before freeing reiser4-specific portion of super-block (during umount) we have to wait until all pending RCU callbacks are executed. The only change of reiser4 made to the original patch, is exporting of rcu_barrier(). Cc: Hans Reiser <reiser@namesys.com> Cc: Vladimir V. Saveliev <vs@namesys.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-12-12[PATCH] Kprobes: Reference count the modules when probed on itMao, Bibo1-2/+16
When a Kprobes are inserted/removed on a modules, the modules must be ref counted so as not to allow to unload while probes are registered on that module. Without this patch, the probed module is free to unload, and when the probing module unregister the probe, the kpobes code while trying to replace the original instruction might crash. Signed-off-by: Anil S Keshavamurthy <anil.s.keshavamurthy@intel.com> Signed-off-by: Mao Bibo <bibo.mao@intel.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-11-29[PATCH] fix swsusp on machines not supporting S4Pavel Machek1-5/+16
Fix swsusp on machines not supporting S4. With recent changes, it is not possible to trigger it using /sys filesystem. Swsusp does not really need any support from low-level code, it is possible to reboot or halt at the end of suspend. Signed-off-by: Pavel Machek <pavel@suse.cz> Cc: "Brown, Len" <len.brown@intel.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-11-29[PATCH] Fix crash when ptrace poking hugepage areasDavid Gibson1-1/+2
set_page_dirty() will not cope with being handed a page * which is part of a compound page, but not the master page in that compound page. This case can occur via access_process_vm() if you attemp to write to another process's hugepage memory area using ptrace() (causing an oops or hang). This patch fixes the bug by only calling set_page_dirty() from access_process_vm() if the page is not a compound page. We already use a similar fix in bio_set_pages_dirty() for the case of direct io to hugepages. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Acked-by: William Irwin <wli@holomorphy.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-11-28[PATCH] cpuset fork locking fixPaul Jackson1-2/+1
Move the cpuset_fork() call below the write_unlock_irq call in kernel/fork.c copy_process(). Since the cpuset-dual-semaphore-locking-overhaul.patch, the cpuset_fork() routine acquires task_lock(), so cannot be called while holding the tasklist_lock for write. Signed-off-by: Paul Jackson <pj@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-11-28[PATCH] Fix hardcoded cpu=0 in workqueue for per_cpu_ptr() callsBen Collins1-6/+6
Tracked this down on an Ultra Enterprise 3000. It's a 6-way machine. Odd thing about this machine (and it's good for finding bugs like this) is that the CPU id's are not 0 based. For instance, on my machine the CPU's are 6/7/10/11/14/15. This caused some NULL pointer dereference in kernel/workqueue.c because for single_threaded workqueue's, it hardcoded the cpu to 0. I changed the 0's to any_online_cpu(cpu_online_mask), which cpumask.h claims is "First cpu in mask". So this fits the same usage. Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-11-28[PATCH] fix 32bit overflow in timespec_to_sample()Oleg Nesterov1-1/+1
fix 32bit overflow in timespec_to_sample() Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-11-28[PATCH] fork.c: proc_fork_connector() called under write_lock()Andrew Morton1-1/+1
Don't do that - it does GFP_KERNEL allocations, for a start. (Reported by Guillaume Thouvenin <guillaume.thouvenin@bull.net>) Acked-by: Matt Helsley <matthltc@us.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-11-28[PATCH] clean up lock_cpu_hotplug() in cpufreqAshok Raj1-34/+49
There are some callers in cpufreq hotplug notify path that the lowest function calls lock_cpu_hotplug(). The lock is already held during cpu_up() and cpu_down() calls when the notify calls are broadcast to registered clients. Ideally if possible, we could disable_preempt() at the highest caller and make sure we dont sleep in the path down in cpufreq->driver_target() calls but the calls are so intertwined and cumbersome to cleanup. Hence we consistently use lock_cpu_hotplug() and unlock_cpu_hotplug() in all places. - Removed export of cpucontrol semaphore and made it static. - removed explicit uses of up/down with lock_cpu_hotplug() so we can keep track of the the callers in same thread context and just keep refcounts without calling a down() that causes a deadlock. - Removed current_in_hotplug() uses - Removed PF_HOTPLUG_CPU in sched.h introduced for the current_in_hotplug() temporary workaround. Tested with insmod of cpufreq_stat.ko, and logical online/offline to make sure we dont have any hang situations. Signed-off-by: Ashok Raj <ashok.raj@intel.com> Cc: Zwane Mwaikambo <zwane@linuxpower.ca> Cc: Shaohua Li <shaohua.li@intel.com> Cc: "Siddha, Suresh B" <suresh.b.siddha@intel.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-11-23[PATCH] Fix crash in unregister_console()Benjamin Herrenschmidt1-1/+1
If unregister_console() is inadvertently called while no consoles are registered, it will crash trying to dereference NULL pointer. It is necessary to fix that because register_console() provides no indication that it actually registered the console passed in. In fact, it may well decide not to register it based on various things... (akpm: It'd be better to make register_console() return something and fix the callers. All 106 of them...) Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-11-23[PATCH] mm: unbloat get_futex_keyHugh Dickins1-15/+0
The follow_page changes in get_futex_key have left it with two almost identical blocks, when handling the rare case of a futex in a nonlinear vma. get_user_pages will itself do that follow_page, and its additional find_extend_vma is hardly any overhead since the vma is already cached. Let's just delete the follow_page block and let get_user_pages do it. Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-11-23[PATCH] Check the irq number is within boundsMatthew Wilcox1-0/+15
Most of the functions already check. Do the ones that didn't. Signed-off-by: Matthew Wilcox <matthew@wil.cx> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-11-22[PATCH] unpaged: copy_page_range vmaHugh Dickins1-1/+1
For copy_one_pte's print_bad_pte to show the task correctly (instead of "???"), dup_mmap must pass down parent vma rather than child vma. Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-11-18[PATCH] add success/failure indication to RCU torture testPaul E. McKenney1-6/+24
One issue with the RCU torture test is that the current error flagging can be lost in dmesg. This patch adds a "SUCCESS"/"FAILURE" string to the line that flags the end of the test, where it can easily be seen with "dmesg | tail" at the end of the test. Also adds tests of architecture-specific memory barriers -- or, more likely, of the RCU torture test itself. Cc: <vatsa@in.ibm.com> Signed-off-by: "Paul E. McKenney" <paulmck@us.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-11-13[PATCH] DocBook: include printk documentationMartin Waitz1-2/+14
Add printk documentation to kernel-api. Signed-off-by: Martin Waitz <tali@admingilde.org> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-11-13[PATCH] timespec: normalize off by one errorsGeorge Anzinger1-7/+3
It would appear that the timespec normalize code has an off by one error. Found in three places. Thanks to Ben for spotting. Signed-off-by: George Anzinger<george@mvista.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-11-13[PATCH] aio: remove kioctx from mm_structZach Brown1-1/+0
Sync iocbs have a life cycle that don't need a kioctx. Their retrying, if any, is done in the context of their owner who has allocated them on the stack. The sole user of a sync iocb's ctx reference was aio_complete() checking for an elevated iocb ref count that could never happen. No path which grabs an iocb ref has access to sync iocbs. If we were to implement sync iocb cancelation it would be done by the owner of the iocb using its on-stack reference. Removing this chunk from aio_complete allows us to remove the entire kioctx instance from mm_struct, reducing its size by a third. On a i386 testing box the slab size went from 768 to 504 bytes and from 5 to 8 per page. Signed-off-by: Zach Brown <zach.brown@oracle.com> Acked-by: Benjamin LaHaise <bcrl@kvack.org> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-11-13[PATCH] stop_machine() vs. synchronous IPI send deadlockKirill Korotaev1-3/+3
This fixes deadlock of stop_machine() vs. synchronous IPI send. The problem is that stop_machine() disables interrupts before disabling preemption on other CPUs. So if another CPU is preempted and then calls something like flush_tlb_all() it will deadlock with CPU doing stop_machine() and which can't process IPI due to disabled IRQs. I changed stop_machine() to do the same things exactly as it does on other CPUs, i.e. it should disable preemption first on _all_ CPUs including itself and only after that disable IRQs. Signed-off-by: Kirill Korotaev <dev@sw.ru> Cc: Rusty Russell <rusty@rustcorp.com.au> Cc: "Andrey Savochkin" <saw@sawoct.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-11-13[PATCH] rcutorture: renice to low priorityIngo Molnar1-0/+4
Make the box usable for interactive work when running the RCU torture test, by renicing the RCU torture-test threads to +19 by default. Kthreads run at nice -5 by default. Signed-off-by: Ingo Molnar <mingo@elte.hu> Acked-by: Paul E. McKenney" <paulmck@us.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-11-13[PATCH] signal handling: revert sigkill priority fixHeiko Carstens1-10/+1
This patch reverts commit c33880aaddbbab1ccf36f4457ed1090621f2e39a since it's not needed anymore. As pointed out by Roland McGrath the real fix is to deliver all signals before returning to user space. See http://www.ussg.iu.edu/hypermail/linux/kernel/0509.2/0683.html A fix for s390 has been merged. Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Roland McGrath <roland@redhat.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Linus Torvalds <torvalds@osdl.org> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-11-13[PATCH] m68k: introduce setup_thread_stack() and end_of_stack()Al Viro2-4/+3
encapsulates the rest of arch-dependent operations with thread_info access. Two new helpers - setup_thread_stack() and end_of_stack(). For normal case the former consists of copying thread_info of parent to new thread_info and the latter returns pointer immediately past the end of thread_info. Signed-off-by: Al Viro <viro@parcelfarce.linux.theplanet.co.uk> Signed-off-by: Roman Zippel <zippel@linux-m68k.org> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-11-13[PATCH] m68k: introduce task_thread_infoAl Viro3-6/+6
new helper - task_thread_info(task). On platforms that have thread_info allocated separately (i.e. in default case) it simply returns task->thread_info. m68k wants (and for good reasons) to embed its thread_info into task_struct. So it will (in later patch) have task_thread_info() of its own. For now we just add a macro for generic case and convert existing instances of its body in core kernel to uses of new macro. Obviously safe - all normal architectures get the same preprocessor output they used to get. Signed-off-by: Al Viro <viro@parcelfarce.linux.theplanet.co.uk> Signed-off-by: Roman Zippel <zippel@linux-m68k.org> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-11-13[PATCH] cpuset: fix return without releasing semaphoreBob Picco1-2/+3
It is wrong to acquire the semaphore and then return from cpuset_zone_allowed without releasing it. Signed-off-by: Bob Picco <bob.picco@hp.com> Acked-by: Paul Jackson <pj@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-11-13[PATCH] fix task_struct leak in ptraceChristoph Hellwig1-1/+1
When ptrace_attach fails we need to drop the task_struct reference. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-11-13[PATCH] move pm_register/etc. to CONFIG_PM_LEGACY, pm_legacy.hJeff Garzik3-1/+12
Since few people need the support anymore, this moves the legacy pm_xxx functions to CONFIG_PM_LEGACY, and include/linux/pm_legacy.h. Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-11-10[SPARC64]: Re-export uts_sem for solaris compat module.David S. Miller1-0/+2
Revert: b26b9bc58263acda274f82a9dde8b6d96559878a Signed-off-by: David S. Miller <davem@davemloft.net>
2005-11-10[PATCH] Don't auto-reap traced childrenOleg Nesterov1-1/+1
If a task is being traced we never auto-reap it even if it might look like its parent doesn't care. The tracer obviously _does_ care. Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-11-09[PATCH] optimize activate_task()Chen, Kenneth W1-1/+2
recalc_task_prio() is called from activate_task() to calculate dynamic priority and interactive credit for the activating task. For real-time scheduling process, all that dynamic calculation is thrown away at the end because rt priority is fixed. Patch to optimize recalc_task_prio() away for rt processes. Signed-off-by: Ken Chen <kenneth.w.chen@intel.com> Acked-by: Ingo Molnar <mingo@elte.hu> Cc: Nick Piggin <piggin@cyberone.com.au> Cc: Con Kolivas <kernel@kolivas.org> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-11-09Fix ptrace self-attach ruleLinus Torvalds1-1/+1
Before we did CLONE_THREAD, the way to check whether we were attaching to ourselves was to just check "current == task", but with CLONE_THREAD we should check that the thread group ID matches instead. Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-11-09[PATCH] sched: resched and cpu_idle reworkNick Piggin1-7/+14
Make some changes to the NEED_RESCHED and POLLING_NRFLAG to reduce confusion, and make their semantics rigid. Improves efficiency of resched_task and some cpu_idle routines. * In resched_task: - TIF_NEED_RESCHED is only cleared with the task's runqueue lock held, and as we hold it during resched_task, then there is no need for an atomic test and set there. The only other time this should be set is when the task's quantum expires, in the timer interrupt - this is protected against because the rq lock is irq-safe. - If TIF_NEED_RESCHED is set, then we don't need to do anything. It won't get unset until the task get's schedule()d off. - If we are running on the same CPU as the task we resched, then set TIF_NEED_RESCHED and no further action is required. - If we are running on another CPU, and TIF_POLLING_NRFLAG is *not* set after TIF_NEED_RESCHED has been set, then we need to send an IPI. Using these rules, we are able to remove the test and set operation in resched_task, and make clear the previously vague semantics of POLLING_NRFLAG. * In idle routines: - Enter cpu_idle with preempt disabled. When the need_resched() condition becomes true, explicitly call schedule(). This makes things a bit clearer (IMO), but haven't updated all architectures yet. - Many do a test and clear of TIF_NEED_RESCHED for some reason. According to the resched_task rules, this isn't needed (and actually breaks the assumption that TIF_NEED_RESCHED is only cleared with the runqueue lock held). So remove that. Generally one less locked memory op when switching to the idle thread. - Many idle routines clear TIF_POLLING_NRFLAG, and only set it in the inner most polling idle loops. The above resched_task semantics allow it to be set until before the last time need_resched() is checked before going into a halt requiring interrupt wakeup. Many idle routines simply never enter such a halt, and so POLLING_NRFLAG can be always left set, completely eliminating resched IPIs when rescheduling the idle task. POLLING_NRFLAG width can be increased, to reduce the chance of resched IPIs. Signed-off-by: Nick Piggin <npiggin@suse.de> Cc: Ingo Molnar <mingo@elte.hu> Cc: Con Kolivas <kernel@kolivas.org> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-11-09[PATCH] sched: consider migration thread with smp niceCon Kolivas1-9/+26
The intermittent scheduling of the migration thread at ultra high priority makes the smp nice handling see that runqueue as being heavily loaded. The migration thread itself actually handles the balancing so its influence on priority balancing should be ignored. Signed-off-by: Con Kolivas <kernel@kolivas.org> Cc: Ingo Molnar <mingo@elte.hu> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-11-09[PATCH] sched: correct smp_nice_biasCon Kolivas1-6/+8
The priority biasing was off by mutliplying the total load by the total priority bias and this ruins the ratio of loads between runqueues. This patch should correct the ratios of loads between runqueues to be proportional to overall load. -2nd attempt. From: Dave Kleikamp <shaggy@austin.ibm.com> This patch fixes a divide-by-zero error that I hit on a two-way i386 machine. rq->nr_running is tested to be non-zero, but may change by the time it is used in the division. Saving the value to a local variable ensures that the same value that is checked is used in the division. Signed-off-by: Con Kolivas <kernel@kolivas.org> Signed-off-by: Dave Kleikamp <shaggy@austin.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-11-09[PATCH] sched: smp nice bias busy queues on idle rebalanceCon Kolivas1-18/+23
To intensify the 'nice' support across physical cpus on SMP we can bias the loads on idle rebalancing. To prevent idle rebalance from trying to pull tasks from queues that appear heavily loaded we only bias the load if there is more than one task running. Add some minor micro-optimisations and have only one return from __source_load and __target_load functions. Fix the fact that target_load was not biased by priority when type == 0. Signed-off-by: Con Kolivas <kernel@kolivas.org> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-11-09[PATCH] sched: account rt tasks in prio_bias()Con Kolivas1-8/+14
Real time tasks' effect on prio_bias should be based on their real time priority level instead of their static_prio which is based on nice. Signed-off-by: Con Kolivas <kernel@kolivas.org> Cc: Ingo Molnar <mingo@elte.hu> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-11-09[PATCH] sched: change prio bias only if queuedCon Kolivas1-5/+4
prio_bias should only be adjusted in set_user_nice if p is actually currently queued. Signed-off-by: Con Kolivas <kernel@kolivas.org> Cc: Ingo Molnar <mingo@elte.hu> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-11-09[PATCH] sched: implement nice support across physical cpus on SMPCon Kolivas1-18/+83
This patch implements 'nice' support across physical cpus on SMP. It introduces an extra runqueue variable prio_bias which is the sum of the (inverted) static priorities of all the tasks on the runqueue. This is then used to bias busy rebalancing between runqueues to obtain good distribution of tasks of different nice values. By biasing the balancing only during busy rebalancing we can avoid having any significant loss of throughput by not affecting the carefully tuned idle balancing already in place. If all tasks are running at the same nice level this code should also have minimal effect. The code is optimised out in the !CONFIG_SMP case. Signed-off-by: Con Kolivas <kernel@kolivas.org> Cc: Ingo Molnar <mingo@elte.hu> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-11-09[PATCH] swsusp: rework swsusp_suspendRafael J. Wysocki3-49/+47
This patch makes only the functions in swsusp.c call functions in snapshot.c and not both ways.  It also moves the check for available swap out of swsusp_suspend() which is necessary for separating the swap-handling functions in swsusp from the core code. Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl> Cc: Pavel Machek <pavel@ucw.cz> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-11-09[PATCH] swsusp: simplify pagedir relocationRafael J. Wysocki3-53/+28
This patch simplifies the relocation of the page backup list (aka pagedir) during resume. Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl> Cc: Pavel Machek <pavel@ucw.cz> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-11-09[PATCH] swsusp: reduce code duplicationRafael J. Wysocki3-70/+52
The changes made by this patch are necessary for the pagedir relocation simplification in the next patch.  Additionally, these changes allow us to drop check_pagedir() and make get_safe_page() be a one-line wrapper around alloc_image_page() (get_safe_page() goes to snapshot.c, because alloc_image_page() is static and it does not make sense to export it). Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl> Cc: Pavel Machek <pavel@ucw.cz> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-11-09[PATCH] sleep: Fix oops in enter_statePavel Machek1-1/+1
If ACPI sleep is not configured, but someone still wants to run swsusp, he'd get oops in enter_state. This is regression since 2.6.14 and this fixes it. Signed-off-by: Pavel Machek <pavel@suse.cz> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-11-09[PATCH] quieten softlockup at bootAnton Blanchard1-3/+0
On a large SMP box we get a lot of softlockup thread XX started lines. Signed-off-by: Anton Blanchard <anton@samba.org> Acked-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-11-09[PATCH] cpu hotplug: fix locking in cpufreq driversAshok Raj1-0/+33
When calling target drivers to set frequency, we take cpucontrol lock. When we modified the code to accomodate CPU hotplug, there was an attempt to take a double lock of cpucontrol leading to a deadlock. Since the current thread context is already holding the cpucontrol lock, we dont need to make another attempt to acquire it. Now we leave a trace in current->flags indicating current thread already is under cpucontrol lock held, so we dont attempt to do this another time. Thanks to Andrew Morton for the beating:-) From: Brice Goglin <Brice.Goglin@ens-lyon.org> Build fix (akpm: this patch is still unpleasant. Ashok continues to look for a cleaner solution, doesn't he? ;)) Signed-off-by: Ashok Raj <ashok.raj@intel.com> Signed-off-by: Brice Goglin <Brice.Goglin@ens-lyon.org> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-11-08[PATCH] Fix sysctl unregistration oops (CVE-2005-2709)Al Viro1-29/+107
You could open the /proc/sys/net/ipv4/conf/<if>/<whatever> file, then wait for interface to go away, try to grab as much memory as possible in hope to hit the (kfreed) ctl_table. Then fill it with pointers to your function. Then do read from file you've opened and if you are lucky, you'll get it called as ->proc_handler() in kernel mode. So this is at least an Oops and possibly more. It does depend on an interface going away though, so less of a security risk than it would otherwise be. Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-11-07Merge master.kernel.org:/pub/scm/linux/kernel/git/davem/sparc-2.6Linus Torvalds1-7/+0
2005-11-07[PATCH] saner handling of auto_acct_off() and DQUOT_OFF() in umountAl Viro1-30/+62
The way we currently deal with quota and process accounting that might keep vfsmount busy at umount time is inherently broken; we try to turn them off just in case (not quite correctly, at that) and a) pray umount doesn't fail (otherwise they'll stay turned off) b) pray nobody doesn anything funny just as we turn quota off Moreover, LSM provides hooks for doing the same sort of broken logics. The proper way to deal with that is to introduce the second kind of reference to vfsmount. Semantics: - when the last normal reference is dropped, all special ones are converted to normal ones and if there had been any, cleanup is done. - normal reference can be cloned into a special one - special reference can be converted to normal one; that's a no-op if we'd already passed the point of no return (i.e. mntput() had converted special references to normal and started cleanup). The way it works: e.g. starting process accounting converts the vfsmount reference pinned by the opened file into special one and turns it back to normal when it gets shut down; acct_auto_close() is done when no normal references are left. That way it does *not* obstruct umount(2) and it silently gets turned off when the last normal reference to vfsmount is gone. Which is exactly what we want... The same should be done by LSM module that holds some internal references to vfsmount and wants to shut them down on umount - it should make them special and security_sb_umount_close() will be called exactly when the last normal reference to vfsmount is gone. quota handling is even simpler - we don't use normal file IO anymore, so there's no need to hold vfsmounts at all. DQUOT_OFF() is done from deactivate_super(), where it really belongs. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-11-07[SPARC64] mm: context switch ptlockHugh Dickins1-7/+0
sparc64 is unique among architectures in taking the page_table_lock in its context switch (well, cris does too, but erroneously, and it's not yet SMP anyway). This seems to be a private affair between switch_mm and activate_mm, using page_table_lock as a per-mm lock, without any relation to its uses elsewhere. That's fine, but comment it as such; and unlock sooner in switch_mm, more like in activate_mm (preemption is disabled here). There is a block of "if (0)"ed code in smp_flush_tlb_pending which would have liked to rely on the page_table_lock, in switch_mm and elsewhere; but its comment explains how dup_mmap's flush_tlb_mm defeated it. And though that could have been changed at any time over the past few years, now the chance vanishes as we push the page_table_lock downwards, and perhaps split it per page table page. Just delete that block of code. Which leaves the mysterious spin_unlock_wait(&oldmm->page_table_lock) in kernel/fork.c copy_mm. Textual analysis (supported by Nick Piggin) suggests that the comment was written by DaveM, and that it relates to the defeated approach in the sparc64 smp_flush_tlb_pending. Just delete this block too. Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2005-11-07[PATCH] unexport uts_semAdrian Bunk1-2/+0
I didn't find any possible modular usage in the kernel. Signed-off-by: Adrian Bunk <bunk@stusta.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-11-07[PATCH] unexport idle_cpuAdrian Bunk1-2/+0
I didn't find any possible modular usage in the kernel. Signed-off-by: Adrian Bunk <bunk@stusta.de> Acked-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-11-07[PATCH] unexport console_unblankAdrian Bunk1-1/+0
I didn't find any possible modular usage of console_unblank in the kernel. This patch was already ACK'ed by Alan Cox. Signed-off-by: Adrian Bunk <bunk@stusta.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-11-07[PATCH] kernel-docs: fix kernel-doc format problemsRandy Dunlap1-1/+1
Convert to proper kernel-doc format. Some have extra blank lines (not allowed immed. after the function name) or need blank lines (after all parameters). Function summary must be only one line. Colon (":") in a function description does weird things (causes kernel-doc to think that it's a new section head sadly). Signed-off-by: Randy Dunlap <rdunlap@xenotime.net> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-11-07[PATCH] more kernel-doc cleanups, additionsRandy Dunlap3-6/+11
Various core kernel-doc cleanups: - add missing function parameters in ipc, irq/manage, kernel/sys, kernel/sysctl, and mm/slab; - move description to just above function for kernel_restart() Signed-off-by: Randy Dunlap <rdunlap@xenotime.net> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-11-07[PATCH] Kprobes: preempt_disable/enable() simplificationAnanth N Mavinakayanahalli1-1/+1
Reorganize the preempt_disable/enable calls to eliminate the extra preempt depth. Changes based on Paul McKenney's review suggestions for the kprobes RCU changeset. Signed-off-by: Ananth N Mavinakayanahalli <ananth@in.ibm.com> Signed-off-by: Anil S Keshavamurthy <anil.s.keshavamurthy@intel.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-11-07[PATCH] Kprobes: Use RCU for (un)register synchronization - base changesAnanth N Mavinakayanahalli1-61/+42
Changes to the base kprobes infrastructure to use RCU for synchronization during kprobe registration and unregistration. These changes coupled with the arch kprobe changes (next in series): a. serialize registration and unregistration of kprobes. b. enable lockless execution of handlers. Handlers can now run in parallel. Signed-off-by: Ananth N Mavinakayanahalli <ananth@in.ibm.com> Signed-off-by: Anil S Keshavamurthy <anil.s.keshavamurthy@intel.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-11-07[PATCH] Kprobes: Track kprobe on a per_cpu basis - base changesAnanth N Mavinakayanahalli1-15/+28
Changes to the base kprobe infrastructure to track kprobe execution on a per-cpu basis. Signed-off-by: Ananth N Mavinakayanahalli <ananth@in.ibm.com> Signed-off-by: Anil S Keshavamurthy <anil.s.keshavamurthy@intel.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-11-07[PATCH] consolidate sys_ptrace()Christoph Hellwig1-0/+82
The sys_ptrace boilerplate code (everything outside the big switch statement for the arch-specific requests) is shared by most architectures. This patch moves it to kernel/ptrace.c and leaves the arch-specific code as arch_ptrace. Some architectures have a too different ptrace so we have to exclude them. They continue to keep their implementations. For sh64 I had to add a sh64_ptrace wrapper because it does some initialization on the first call. For um I removed an ifdefed SUBARCH_PTRACE_SPECIAL block, but SUBARCH_PTRACE_SPECIAL isn't defined anywhere in the tree. Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Paul Mackerras <paulus@samba.org> Acked-by: Ralf Baechle <ralf@linux-mips.org> Acked-By: David Howells <dhowells@redhat.com> Acked-by: Russell King <rmk+kernel@arm.linux.org.uk> Acked-by: Paul Mundt <lethal@linux-sh.org> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-11-07[PATCH] fix remaining missing includesTim Schmielau1-0/+1
Fix more include file problems that surfaced since I submitted the previous fix-missing-includes.patch. This should now allow not to include sched.h from module.h, which is done by a followup patch. Signed-off-by: Tim Schmielau <tim@physik3.uni-rostock.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-11-07[PATCH] FUTEX_WAKE_OP: enhanced error handlingDavid Gibson1-0/+5
The code for FUTEX_WAKE_OP calls an arch callback, futex_atomic_op_inuser(). That callback can return an error code, but currently the caller assumes any error is EFAULT, and will try various things to resolve the fault before eventually giving up with EFAULT (regardless of the original error code). This is not a theoretical case - arch callbacks currently return -ENOSYS if the opcode they are given is bogus. This patch alters the code to detect non-EFAULT errors and return them directly to the user. Of course, whether -ENOSYS is the correct return value for the bogus opcode case, or whether EINVAL would be more appropriate is another question. Signed-off-by: David Gibson <dwg@au1.ibm.com> Cc: Rusty Russell <rusty@rustcorp.com.au> Cc: Ingo Molnar <mingo@elte.hu> Cc: Jamie Lokier <jamie@shareable.org> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-11-07[PATCH] aio: remove aio_max_nr accounting raceZach Brown1-2/+2
AIO was adding a new context's max requests to the global total before testing if that resulting total was over the global limit. This let innocent tasks get their new limit tested along with a racing guilty task that was crossing the limit. This serializes the _nr accounting with a spinlock It also switches to using unsigned long for the global totals. Individual contexts are still limited to an unsigned int's worth of requests by the syscall interface. The problem and fix were verified with a simple program that spun creating and destroying a context while holding on to another long lived context. Before the patch a task creating a tiny context could get a spurious EAGAIN if it raced with a task creating a very large context that overran the limit. Signed-off-by: Zach Brown <zach.brown@oracle.com> Cc: Benjamin LaHaise <bcrl@kvack.org> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-11-07[PATCH] Process Events ConnectorMatt Helsley3-0/+13
This patch adds a connector that reports fork, exec, id change, and exit events for all processes to userspace. It replaces the fork_advisor patch that ELSA is currently using. Applications that may find these events useful include accounting/auditing (e.g. ELSA), system activity monitoring (e.g. top), security, and resource management (e.g. CKRM). Signed-off-by: Matt Helsley <matthltc@us.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-11-07[PATCH] swsusp: remove unused variablePavel Machek1-8/+1
Remove unused variable, and make code less evil that way. Fix whitespace around for-loop-like macro. Signed-off-by: Pavel Machek <pavel@suse.cz> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-11-07[PATCH] swsusp cleanupsPavel Machek2-28/+27
This cleans spaces between * and pointer up, and adds "int" in "unsigned int". Signed-off-by: Pavel Machek <pavel@suse.cz> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-11-07[PATCH] cpu hoptlug: avoid usage of smp_processor_id() in preemptible codeHeiko Carstens4-4/+7
Replace smp_processor_id() with any_online_cpu(cpu_online_map) in order to avoid lots of "BUG: using smp_processor_id() in preemptible [00000001] code:..." messages in case taking a cpu online fails. All the traces start at the last notifier_call_chain(...) in kernel/cpu.c. Since we hold the cpu_control semaphore it shouldn't be any problem to access cpu_online_map. The reason why cpu_up failed is simply that the cpu that was supposed to be taken online wasn't even there. That is because on s390 we never know when a new cpu comes and therefore cpu_possible_map consists of only ones and doesn't reflect reality. Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-11-07[PATCH] posix-timers `unlikely' rejigAndrew Morton1-3/+3
!unlikely(expr) hurts my brain. likely(!expr) is more straightforward. Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-11-04[PATCH] improve scheduler fairness a bitOleg Nesterov1-1/+1
Do not transfer remaining time slice to another cpu on process exit. Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Acked-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-31Merge ../linux-2.6 by handPaul Mackerras32-1143/+1661
2005-10-30[PATCH] fix missing includesTim Schmielau3-0/+3
I recently picked up my older work to remove unnecessary #includes of sched.h, starting from a patch by Dave Jones to not include sched.h from module.h. This reduces the number of indirect includes of sched.h by ~300. Another ~400 pointless direct includes can be removed after this disentangling (patch to follow later). However, quite a few indirect includes need to be fixed up for this. In order to feed the patches through -mm with as little disturbance as possible, I've split out the fixes I accumulated up to now (complete for i386 and x86_64, more archs to follow later) and post them before the real patch. This way this large part of the patch is kept simple with only adding #includes, and all hunks are independent of each other. So if any hunk rejects or gets in the way of other patches, just drop it. My scripts will pick it up again in the next round. Signed-off-by: Tim Schmielau <tim@physik3.uni-rostock.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-30[PATCH] Remove duplicate code in signal.cPaul E. McKenney1-11/+5
Combine a bit of redundant code between force_sig_info() and force_sig_specific(). Signed-off-by: paulmck@us.ibm.com Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-30[PATCH] remove unneeded SI_TIMER checksOleg Nesterov1-18/+0
This patch removes checks for ->si_code == SI_TIMER from send_signal, specific_send_sig_info, __group_send_sig_info. I think posix-timers.c used these functions some time ago, now it sends signals via send_{,group_}sigqueue, so these hooks are unneeded. Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-30[PATCH] cleanup the usage of SEND_SIG_xxx constantsOleg Nesterov1-11/+7
This patch simplifies some checks for magic siginfo values. It should not change the behaviour in any way. Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-30[PATCH] remove hardcoded SEND_SIG_xxx constantsOleg Nesterov2-20/+25
This patch replaces hardcoded SEND_SIG_xxx constants with their symbolic names. No changes in affected .o files. Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-30[PATCH] sched: hardcode non-smp set_cpus_allowedPaul Jackson1-1/+0
Simplify the UP (1 CPU) implementatin of set_cpus_allowed. The one CPU is hardcoded to be cpu 0 - so just test for that bit, and avoid having to pick up the cpu_online_map. Also, unexport cpu_online_map: it was only needed for set_cpus_allowed(). Signed-off-by: Paul Jackson <pj@sgi.com> Acked-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-30[PATCH] posix-cpu-timers: fix overrun reportingRoland McGrath1-6/+10
This change corrects an omission in posix_cpu_timer_schedule, so that it correctly propagates the overrun calculation to where it will get reported to the user. Signed-off-by: Roland McGrath <roland@redhat.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-30[PATCH] RCU torture-testing kernel modulePaul E. McKenney3-0/+503
This patch is a rewrite of the one submitted on October 1st, using modules (http://marc.theaimsgroup.com/?l=linux-kernel&m=112819093522998&w=2). This rewrite adds a tristate CONFIG_RCU_TORTURE_TEST, which enables an intense torture test of the RCU infratructure. This is needed due to the continued changes to the RCU infrastructure to accommodate dynamic ticks, CPU hotplug, realtime, and so on. Most of the code is in a separate file that is compiled only if the CONFIG variable is set. Documentation on how to run the test and interpret the output is also included. This code has been tested on i386 and ppc64, and an earlier version of the code has received extensive testing on a number of architectures as part of the PREEMPT_RT patchset. Signed-off-by: "Paul E. McKenney" <paulmck@us.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-30[PATCH] jiffies_64 cleanupThomas Gleixner1-0/+4
Define jiffies_64 in kernel/timer.c rather than having 24 duplicated defines in each architecture. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-30[PATCH] wait4 PTRACE_ATTACH race fixRoland McGrath1-0/+9
Back about a year ago when I last fiddled heavily with the do_wait code, I was thinking too hard about the wrong thing and I now think I introduced a bug whose inverse thought I was fixing. Apparently noone was looking too hard over much shoulder, so as to cite my bogus reasoning at the time. In the race condition when PTRACE_ATTACH is about to steal a child and then the child hits a tracing event (what my_ptrace_child checks for), the real parent does need to set its flag noting it has some eligible live children. Otherwise a spurious ECHILD error is possible, since the child in question is not yet on the ptrace_children list. Signed-off-by: Roland McGrath <roland@redhat.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-30[PATCH] PF_DEAD cleanupCoywolf Qi Hunt1-5/+5
The PF_DEAD setting doesn't belong to exit_notify(), move it to a proper place. Signed-off-by: Coywolf Qi Hunt <qiyong@fc-cn.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-30[PATCH] cleanup for kernel/printk.cJesper Juhl1-36/+42
- Removes some trailing whitespace - Breaks long lines and make other small changes to conform to CodingStyle - Add explicit printk loglevels in two places. Signed-off-by: Jesper Juhl <jesper.juhl@gmail.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-30[PATCH] Keys: Get rid of warning in kmod.c if keys disabledDavid Howells1-3/+3
The attached patch gets rid of a "statement without effect" warning when CONFIG_KEYS is disabled by making use of the return value of key_get(). The compiler will optimise all of this away when keys are disabled. Signed-Off-By: David Howells <dhowells@redhat.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-30[PATCH] ptrace/coredump/exit_group deadlockAndrea Arcangeli2-5/+8
I could seldom reproduce a deadlock with a task not killable in T state (TASK_STOPPED, not TASK_TRACED) by attaching a NPTL threaded program to gdb, by segfaulting the task and triggering a core dump while some other task is executing exit_group and while one task is in ptrace_attached TASK_STOPPED state (not TASK_TRACED yet). This originated from a gdb bugreport (the fact gdb was segfaulting the task wasn't a kernel bug), but I just incidentally noticed the gdb bug triggered a real kernel bug as well. Most threads hangs in exit_mm because the core_dumping is still going, the core dumping hangs because the stopped task doesn't exit, the stopped task can't wakeup because it has SIGNAL_GROUP_EXIT set, hence the deadlock. To me it seems that the problem is that the force_sig_specific(SIGKILL) in zap_threads is a noop if the task has PF_PTRACED set (like in this case because gdb is attached). The __ptrace_unlink does nothing because the signal->flags is set to SIGNAL_GROUP_EXIT|SIGNAL_STOP_DEQUEUED (verified). The above info also shows that the stopped task hit a race and got the stop signal (presumably by the ptrace_attach, only the attach, state is still TASK_STOPPED and gdb hangs waiting the core before it can set it to TASK_TRACED) after one of the thread invoked the core dump (it's the core dump that sets signal->flags to SIGNAL_GROUP_EXIT). So beside the fact nobody would wakeup the task in __ptrace_unlink (the state is _not_ TASK_TRACED), there's a secondary problem in the signal handling code, where a task should ignore the ptrace-sigstops as long as SIGNAL_GROUP_EXIT is set (or the wakeup in __ptrace_unlink path wouldn't be enough). So I attempted to make this patch that seems to fix the problem. There were various ways to fix it, perhaps you prefer a different one, I just opted to the one that looked safer to me. I also removed the clearing of the stopped bits from the zap_other_threads (zap_other_threads was safe unlike zap_threads). I don't like useless code, this whole NPTL signal/ptrace thing is already unreadable enough and full of corner cases without confusing useless code into it to make it even less readable. And if this code is really needed, then you may want to explain why it's not being done in the other paths that sets SIGNAL_GROUP_EXIT at least. Even after this patch I still wonder who serializes the read of p->ptrace in zap_threads. Patch is called ptrace-core_dump-exit_group-deadlock-1. This was the trace I've got: test T ffff81003e8118c0 0 14305 1 14311 14309 (NOTLB) ffff810058ccdde8 0000000000000082 000001f4000037e1 ffff810000000013 00000000000000f8 ffff81003e811b00 ffff81003e8118c0 ffff810011362100 0000000000000012 ffff810017ca4180 Call Trace:<ffffffff801317ed>{try_to_wake_up+893} <ffffffff80141677>{finish_stop+87} <ffffffff8014367f>{get_signal_to_deliver+1359} <ffffffff8010d3ad>{do_signal+157} <ffffffff8013deee>{ptrace_check_attach+222} <ffffffff80111575>{sys_ptrace+2293} <ffffffff80131810>{default_wake_function+0} <ffffffff80196399>{sys_ioctl+73} <ffffffff8010dd27>{sysret_signal+28} <ffffffff8010e00f>{ptregscall_common+103} test D ffff810011362100 0 14309 1 14305 14312 (NOTLB) ffff810053c81cf8 0000000000000082 0000000000000286 0000000000000001 0000000000000195 ffff810011362340 ffff810011362100 ffff81002e338040 ffff810001e0ca80 0000000000000001 Call Trace:<ffffffff801317ed>{try_to_wake_up+893} <ffffffff8044677d>{wait_for_completion+173} <ffffffff80131810>{default_wake_function+0} <ffffffff80137435>{exit_mm+149} <ffffffff801381af>{do_exit+479} <ffffffff80138d0c>{do_group_exit+252} <ffffffff801436db>{get_signal_to_deliver+1451} <ffffffff8010d3ad>{do_signal+157} <ffffffff8013deee>{ptrace_check_attach+222} <ffffffff80140850>{specific_send_sig_info+2 <ffffffff8014208a>{force_sig_info+186} <ffffffff804479a0>{do_int3+112} <ffffffff8010e308>{retint_signal+61} test D ffff81002e338040 0 14311 1 14716 14305 (NOTLB) ffff81005ca8dcf8 0000000000000082 0000000000000286 0000000000000001 0000000000000120 ffff81002e338280 ffff81002e338040 ffff8100481cb740 ffff810001e0ca80 0000000000000001 Call Trace:<ffffffff801317ed>{try_to_wake_up+893} <ffffffff8044677d>{wait_for_completion+173} <ffffffff80131810>{default_wake_function+0} <ffffffff80137435>{exit_mm+149} <ffffffff801381af>{do_exit+479} <ffffffff80142d0e>{__dequeue_signal+558} <ffffffff80138d0c>{do_group_exit+252} <ffffffff801436db>{get_signal_to_deliver+1451} <ffffffff8010d3ad>{do_signal+157} <ffffffff8013deee>{ptrace_check_attach+222} <ffffffff80140850>{specific_send_sig_info+208} <ffffffff8014208a>{force_sig_info+186} <ffffffff804479a0>{do_int3+112} <ffffffff8010e308>{retint_signal+61} test D ffff810017ca4180 0 14312 1 14309 13882 (NOTLB) ffff81005d15fcb8 0000000000000082 ffff81005d15fc58 ffffffff80130816 0000000000000897 ffff810017ca43c0 ffff810017ca4180 ffff81003e8118c0 0000000000000082 ffffffff801317ed Call Trace:<ffffffff80130816>{activate_task+150} <ffffffff801317ed>{try_to_wake_up+893} <ffffffff8044677d>{wait_for_completion+173} <ffffffff80131810>{default_wake_function+0} <ffffffff8018cdc3>{do_coredump+819} <ffffffff80445f52>{thread_return+82} <ffffffff801436d4>{get_signal_to_deliver+1444} <ffffffff8010d3ad>{do_signal+157} <ffffffff8013deee>{ptrace_check_attach+222} <ffffffff80140850>{specific_send_sig_info+2 <ffffffff804472e5>{_spin_unlock_irqrestore+5} <ffffffff8014208a>{force_sig_info+186} <ffffffff804476ff>{do_general_protection+159} <ffffffff8010e308>{retint_signal+61} Signed-off-by: Andrea Arcangeli <andrea@suse.de> Cc: Roland McGrath <roland@redhat.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Linus Torvalds <torvalds@osdl.org> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-30[PATCH] cpusets: automatic numa mempolicy rebindingPaul Jackson1-0/+4
This patch automatically updates a tasks NUMA mempolicy when its cpuset memory placement changes. It does so within the context of the task, without any need to support low level external mempolicy manipulation. If a system is not using cpusets, or if running on a system with just the root (all-encompassing) cpuset, then this remap is a no-op. Only when a task is moved between cpusets, or a cpusets memory placement is changed does the following apply. Otherwise, the main routine below, rebind_policy() is not even called. When mixing cpusets, scheduler affinity, and NUMA mempolicies, the essential role of cpusets is to place jobs (several related tasks) on a set of CPUs and Memory Nodes, the essential role of sched_setaffinity is to manage a jobs processor placement within its allowed cpuset, and the essential role of NUMA mempolicy (mbind, set_mempolicy) is to manage a jobs memory placement within its allowed cpuset. However, CPU affinity and NUMA memory placement are managed within the kernel using absolute system wide numbering, not cpuset relative numbering. This is ok until a job is migrated to a different cpuset, or what's the same, a jobs cpuset is moved to different CPUs and Memory Nodes. Then the CPU affinity and NUMA memory placement of the tasks in the job need to be updated, to preserve their cpuset-relative position. This can be done for CPU affinity using sched_setaffinity() from user code, as one task can modify anothers CPU affinity. This cannot be done from an external task for NUMA memory placement, as that can only be modified in the context of the task using it. However, it easy enough to remap a tasks NUMA mempolicy automatically when a task is migrated, using the existing cpuset mechanism to trigger a refresh of a tasks memory placement after its cpuset has changed. All that is needed is the old and new nodemask, and notice to the task that it needs to rebind its mempolicy. The tasks mems_allowed has the old mask, the tasks cpuset has the new mask, and the existing cpuset_update_current_mems_allowed() mechanism provides the notice. The bitmap/cpumask/nodemask remap operators provide the cpuset relative calculations. This patch leaves open a couple of issues: 1) Updating vma and shmfs/tmpfs/hugetlbfs memory policies: These mempolicies may reference nodes outside of those allowed to the current task by its cpuset. Tasks are migrated as part of jobs, which reside on what might be several cpusets in a subtree. When such a job is migrated, all NUMA memory policy references to nodes within that cpuset subtree should be translated, and references to any nodes outside that subtree should be left untouched. A future patch will provide the cpuset mechanism needed to mark such subtrees. With that patch, we will be able to correctly migrate these other memory policies across a job migration. 2) Updating cpuset, affinity and memory policies in user space: This is harder. Any placement state stored in user space using system-wide numbering will be invalidated across a migration. More work will be required to provide user code with a migration-safe means to manage its cpuset relative placement, while preserving the current API's that pass system wide numbers, not cpuset relative numbers across the kernel-user boundary. Signed-off-by: Paul Jackson <pj@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-30[PATCH] cpusets: simple renamePaul Jackson1-0/+16
Add support for renaming cpusets. Only allow simple rename of cpuset directories in place. Don't allow moving cpusets elsewhere in hierarchy or renaming the special cpuset files in each cpuset directory. The usefulness of this simple rename became apparent when developing task migration facilities. It allows building a second cpuset hierarchy using new names and containing new CPUs and Memory Nodes, moving tasks from the old to the new cpusets, removing the old cpusets, and then renaming the new cpusets to be just like the old names, so that any knowledge that the tasks had of their cpuset names will still be valid. Leaf node cpusets can be migrated to other CPUs or Memory Nodes by just updating their 'cpus' and 'mems' files, but because no cpuset can contain CPUs or Nodes not in its parent cpuset, one cannot do this in a cpuset hierarchy without first expanding all the non-leaf cpusets to contain the union of both the old and new CPUs and Nodes, which would obfuscate the one-to-one migration of a task from one cpuset to another required to correctly migrate the physical page frames currently allocated to that task. Signed-off-by: Paul Jackson <pj@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-30[PATCH] cpusets: dual semaphore locking overhaulPaul Jackson1-137/+281
Overhaul cpuset locking. Replace single semaphore with two semaphores. The suggestion to use two locks was made by Roman Zippel. Both locks are global. Code that wants to modify cpusets must first acquire the exclusive manage_sem, which allows them read-only access to cpusets, and holds off other would-be modifiers. Before making actual changes, the second semaphore, callback_sem must be acquired as well. Code that needs only to query cpusets must acquire callback_sem, which is also a global exclusive lock. The earlier problems with double tripping are avoided, because it is allowed for holders of manage_sem to nest the second callback_sem lock, and only callback_sem is needed by code called from within __alloc_pages(), where the double tripping had been possible. This is not quite the same as a normal read/write semaphore, because obtaining read-only access with intent to change must hold off other such attempts, while allowing read-only access w/o such intention. Changing cpusets involves several related checks and changes, which must be done while allowing read-only queries (to avoid the double trip), but while ensuring nothing changes (holding off other would be modifiers.) This overhaul of cpuset locking also makes careful use of task_lock() to guard access to the task->cpuset pointer, closing a couple of race conditions noticed while reading this code (thanks, Roman). I've never seen these races fail in any use or test. See further the comments in the code. Signed-off-by: Paul Jackson <pj@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-30[PATCH] cpusets: remove depth counted locking hackPaul Jackson1-65/+40
Remove a rather hackish depth counter on cpuset locking. The depth counter was avoiding a possible double trip on the global cpuset_sem semaphore. It worked, but now an improved version of cpuset locking is available, to come in the next patch, using two global semaphores. This patch reverses "cpuset semaphore depth check deadlock fix" The kernel still works, even after this patch, except for some rare and difficult to reproduce race conditions when agressively creating and destroying cpusets marked with the notify_on_release option, on very large systems. Signed-off-by: Paul Jackson <pj@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-30[PATCH] cpuset cleanupPaul Jackson1-1/+0
Remove one more useless line from cpuset_common_file_read(). Signed-off-by: Paul Jackson <pj@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-30[PATCH] posix-timers: use schedule_timeout() in common_nsleep()Oleg Nesterov1-18/+1
common_nsleep() reimplements schedule_timeout_interruptible() for unknown reason. Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-30[PATCH] Unify sys_tkill() and sys_tgkill()Vadim Lobanov1-46/+25
The majority of the sys_tkill() and sys_tgkill() function code is duplicated between the two of them. This patch pulls the duplication out into a separate function -- do_tkill() -- and lets sys_tkill() and sys_tgkill() be simple wrappers around it. This should make it easier to maintain in light of future changes. Signed-off-by: Vadim Lobanov <vlobanov@speakeasy.net> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-30[PATCH] kill sigqueue->lockOleg Nesterov1-6/+4
This lock is used in sigqueue_free(), but it is always equal to current->sighand->siglock, so we don't need to keep it in the struct sigqueue. Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-30[PATCH] remove timer debug fieldAndrew Morton1-35/+0
Remove timer_list.magic and associated debugging code. I originally added this when a spinlock was added to timer_list - this meant that an all-zeroes timer became illegal and init_timer() was required. That spinlock isn't even there any more, although timer.base must now be initialised. I'll keep this debugging code in -mm. Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-30[PATCH] Use alloc_percpu to allocate workqueues locallyChristoph Lameter1-13/+20
This patch makes the workqueus use alloc_percpu instead of an array. The workqueues are placed on nodes local to each processor. The workqueue structure can grow to a significant size on a system with lots of processors if this patch is not applied. 64 bit architectures with all debugging features enabled and configured for 512 processors will not be able to boot without this patch. Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-30[PATCH] ntp whitespace cleanupAndrew Morton1-126/+126
Fix bizarre 4-space coding style in the NTP code. Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-30[PATCH] NTP shift_right cleanupjohn stultz2-61/+17
Create a macro shift_right() that avoids the numerous ugly conditionals in the NTP code that look like: if(a < 0) b = -(-a >> shift); else b = a >> shift; Replacing it with: b = shift_right(a, shift); This should have zero effect on the logic, however it should probably have a bit of testing just to be sure. Also replace open-coded min/max with the macros. Signed-off-by : John Stultz <johnstul@us.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-30[PATCH] Add kthread_stop_sem()Alan Stern1-2/+11
Enhance the kthread API by adding kthread_stop_sem, for use in stopping threads that spend their idle time waiting on a semaphore. Signed-off-by: Alan Stern <stern@rowland.harvard.edu> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-30[PATCH] introduce setup_timer() helperOleg Nesterov1-6/+2
Every user of init_timer() also needs to initialize ->function and ->data fields. This patch adds a simple setup_timer() helper for that. The schedule_timeout() is patched as an example of usage. Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-30[PATCH] introduce .valid callback for pm_opsShaohua Li1-1/+4
Add pm_ops.valid callback, so only the available pm states show in /sys/power/state. And this also makes an earlier states error report at enter_state before we do actual suspend/resume. Signed-off-by: Shaohua Li<shaohua.li@intel.com> Acked-by: Pavel Machek<pavel@suse.cz> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-30[PATCH] swsusp: two simplificationsRafael J. Wysocki2-7/+3
The following patch simplifies the progress meter in disk.c:free_some_memory() and makes disk.c:pm_suspend_disk() call device_resume() explicitly in the suspend path. Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl> Acked-by: Pavel Machek <pavel@suse.cz> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-30[PATCH] swsusp: get rid of unnecessary wrapper functionRafael J. Wysocki1-7/+1
The following patch merges two functions in a trivial way. Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl> Acked-by: Pavel Machek <pavel@suse.cz> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-30[PATCH] swsusp: cleanupsPavel Machek1-12/+10
Reduce number of ifdefs somehow, and fix whitespace a bit. No real code changes. Signed-off-by: Pavel Machek <pavel@suse.cz> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-30[PATCH] swsusp: remove unneccessary includesPavel Machek2-31/+3
Cleanup comments and remove unneccessary includes. Signed-off-by: Pavel Machek <pavel@suse.cz> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-30[PATCH] swsusp: rework memory freeing on resumeRafael J. Wysocki4-83/+45
The following patch makes swsusp use the PG_nosave and PG_nosave_free flags to mark pages that should be freed in case of an error during resume. This allows us to simplify the code and to use swsusp_free() in all of the swsusp's resume error paths, which makes them actually work. Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-30[PATCH] swsusp: reduce the use of global variablesRafael J. Wysocki3-43/+43
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-30[PATCH] swsusp: move snapshot functionality to separate fileRafael J. Wysocki4-437/+486
The following patch moves the functionality of swsusp related to creating and handling the snapshot of memory to a separate file, snapshot.c This should enable us to untangle the code in the future and eventually to implement some parts of swsusp.c in the user space. The patch does not change the code. Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl> Signed-off-by: Pavel Machek <pavel@suse.cz> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-30[PATCH] swsusp: rework image freeingRafael J. Wysocki1-84/+75
The following patch makes swsusp use PG_nosave and PG_nosave_free flags to mark pages that should be freed after the state of the system has been restored from the image (or in case of an error during suspend). This allows us to avoid storing metadata in swap twice and to reduce the amount of memory needed by swsusp.  Additionally, it allows us to simplify the code by removing a couple of functions that are no longer necessary. Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl> Signed-off-by: Pavel Machek <pavel@suse.cz> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-30[PATCH] create and destroy cpufreq sysfs entries based on cpu notifiersAshok Raj1-0/+1
cpufreq entries in sysfs should only be populated when CPU is online state. When we either boot with maxcpus=x and then boot the other cpus by echoing to sysfs online file, these entries should be created and destroyed when CPU_DEAD is notified. Same treatement as cache entries under sysfs. We place the processor in the lowest frequency, so hw managed P-State transitions can still work on the other threads to save power. Primary goal was to just make these directories appear/disapper dynamically. There is one in this patch i had to do, which i really dont like myself but probably best if someone handling the cpufreq infrastructure could give this code right treatment if this is not acceptable. I guess its probably good for the first cut. - Converting lock_cpu_hotplug()/unlock_cpu_hotplug() to disable/enable preempt. The locking was smack in the middle of the notification path, when the hotplug is already holding the lock. I tried another solution to avoid this so avoid taking locks if we know we are from notification path. The solution was getting very ugly and i decided this was probably good for this iteration until someone who understands cpufreq could do a better job than me. (akpm: export cpucontrol to GPL modules: drivers/cpufreq/cpufreq_stats.c now does lock_cpu_hotplug()) Signed-off-by: Ashok Raj <ashok.raj@intel.com> Signed-off-by: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com> Cc: Dave Jones <davej@codemonkey.org.uk> Cc: Zwane Mwaikambo <zwane@holomorphy.com> Cc: Greg KH <greg@kroah.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-30[PATCH] Remove redundant configs.oBrian Gerst1-1/+0
Since CONFIG_IKCONFIG_PROC already depends on CONFIG_IKCONFIG, adding configs.o again is redundant. Signed-off-by: Brian Gerst <bgerst@didntduck.org> Cc: Sam Ravnborg <sam@ravnborg.org> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-29[PATCH] mm: split page table lockHugh Dickins1-2/+2
Christoph Lameter demonstrated very poor scalability on the SGI 512-way, with a many-threaded application which concurrently initializes different parts of a large anonymous area. This patch corrects that, by using a separate spinlock per page table page, to guard the page table entries in that page, instead of using the mm's single page_table_lock. (But even then, page_table_lock is still used to guard page table allocation, and anon_vma allocation.) In this implementation, the spinlock is tucked inside the struct page of the page table page: with a BUILD_BUG_ON in case it overflows - which it would in the case of 32-bit PA-RISC with spinlock debugging enabled. Splitting the lock is not quite for free: another cacheline access. Ideally, I suppose we would use split ptlock only for multi-threaded processes on multi-cpu machines; but deciding that dynamically would have its own costs. So for now enable it by config, at some number of cpus - since the Kconfig language doesn't support inequalities, let preprocessor compare that with NR_CPUS. But I don't think it's worth being user-configurable: for good testing of both split and unsplit configs, split now at 4 cpus, and perhaps change that to 8 later. There is a benefit even for singly threaded processes: kswapd can be attacking one part of the mm while another part is busy faulting. Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-29[PATCH] mm: follow_page with inner ptlockHugh Dickins1-4/+2
Final step in pushing down common core's page_table_lock. follow_page no longer wants caller to hold page_table_lock, uses pte_offset_map_lock itself; and so no page_table_lock is taken in get_user_pages itself. But get_user_pages (and get_futex_key) do then need follow_page to pin the page for them: take Daniel's suggestion of bitflags to follow_page. Need one for WRITE, another for TOUCH (it was the accessed flag before: vanished along with check_user_page_readable, but surely get_numa_maps is wrong to mark every page it finds as accessed), another for GET. And another, ANON to dispose of untouched_anonymous_page: it seems silly for that to descend a second time, let follow_page observe if there was no page table and return ZERO_PAGE if so. Fix minor bug in that: check VM_LOCKED - make_pages_present ought to make readonly anonymous present. Give get_numa_maps a cond_resched while we're there. Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-29[PATCH] mm: ptd_alloc take ptlockHugh Dickins1-2/+0
Second step in pushing down the page_table_lock. Remove the temporary bridging hack from __pud_alloc, __pmd_alloc, __pte_alloc: expect callers not to hold page_table_lock, whether it's on init_mm or a user mm; take page_table_lock internally to check if a racing task already allocated. Convert their callers from common code. But avoid coming back to change them again later: instead of moving the spin_lock(&mm->page_table_lock) down, switch over to new macros pte_alloc_map_lock and pte_unmap_unlock, which encapsulate the mapping+locking and unlocking+unmapping together, and in the end may use alternatives to the mm page_table_lock itself. These callers all hold mmap_sem (some exclusively, some not), so at no level can a page table be whipped away from beneath them; and pte_alloc uses the "atomic" pmd_present to test whether it needs to allocate. It appears that on all arches we can safely descend without page_table_lock. Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-29[PATCH] mm: update_hiwaters just in timeHugh Dickins2-3/+4
update_mem_hiwater has attracted various criticisms, in particular from those concerned with mm scalability. Originally it was called whenever rss or total_vm got raised. Then many of those callsites were replaced by a timer tick call from account_system_time. Now Frank van Maarseveen reports that to be found inadequate. How about this? Works for Frank. Replace update_mem_hiwater, a poor combination of two unrelated ops, by macros update_hiwater_rss and update_hiwater_vm. Don't attempt to keep mm->hiwater_rss up to date at timer tick, nor every time we raise rss (usually by 1): those are hot paths. Do the opposite, update only when about to lower rss (usually by many), or just before final accounting in do_exit. Handle mm->hiwater_vm in the same way, though it's much less of an issue. Demand that whoever collects these hiwater statistics do the work of taking the maximum with rss or total_vm. And there has been no collector of these hiwater statistics in the tree. The new convention needs an example, so match Frank's usage by adding a VmPeak line above VmSize to /proc/<pid>/status, and also a VmHWM line above VmRSS (High-Water-Mark or High-Water-Memory). There was a particular anomaly during mremap move, that hiwater_vm might be captured too high. A fleeting such anomaly remains, but it's quickly corrected now, whereas before it would stick. What locking? None: if the app is racy then these statistics will be racy, it's not worth any overhead to make them exact. But whenever it suits, hiwater_vm is updated under exclusive mmap_sem, and hiwater_rss under page_table_lock (for now) or with preemption disabled (later on): without going to any trouble, minimize the time between reading current values and updating, to minimize those occasions when a racing thread bumps a count up and back down in between. Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-29[PATCH] core remove PageReservedNick Piggin1-9/+16
Remove PageReserved() calls from core code by tightening VM_RESERVED handling in mm/ to cover PageReserved functionality. PageReserved special casing is removed from get_page and put_page. All setting and clearing of PageReserved is retained, and it is now flagged in the page_alloc checks to help ensure we don't introduce any refcount based freeing of Reserved pages. MAP_PRIVATE, PROT_WRITE of VM_RESERVED regions is tentatively being deprecated. We never completely handled it correctly anyway, and is be reintroduced in future if required (Hugh has a proof of concept). Once PageReserved() calls are removed from kernel/power/swsusp.c, and all arch/ and driver code, the Set and Clear calls, and the PG_reserved bit can be trivially removed. Last real user of PageReserved is swsusp, which uses PageReserved to determine whether a struct page points to valid memory or not. This still needs to be addressed (a generic page_is_ram() should work). A last caveat: the ZERO_PAGE is now refcounted and managed with rmap (and thus mapcounted and count towards shared rss). These writes to the struct page could cause excessive cacheline bouncing on big systems. There are a number of ways this could be addressed if it is an issue. Signed-off-by: Nick Piggin <npiggin@suse.de> Refcount bug fix for filemap_xip.c Signed-off-by: Carsten Otte <cotte@de.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-29[PATCH] mm: dup_mmap down new mmap_semHugh Dickins1-5/+4
One anomaly remains from when Andrea rationalized the responsibilities of mmap_sem and page_table_lock: in dup_mmap we add vmas to the child holding its page_table_lock, but not the mmap_sem which normally guards the vma list and rbtree. Which could be an issue for unuse_mm: though since it just walks down the list (today with page_table_lock, tomorrow not), it's probably okay. Will need a memory barrier? Oh, keep it simple, Nick and I agreed, no harm in taking child's mmap_sem here. Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-29[PATCH] mm: dup_mmap use oldmm moreHugh Dickins1-6/+6
Use the parent's oldmm throughout dup_mmap, instead of perversely going back to current->mm. (Can you hear the sigh of relief from those mpnts? Usually I squash them, but not today.) Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-29[PATCH] mm: rss = file_rss + anon_rssHugh Dickins2-3/+3
I was lazy when we added anon_rss, and chose to change as few places as possible. So currently each anonymous page has to be counted twice, in rss and in anon_rss. Which won't be so good if those are atomic counts in some configurations. Change that around: keep file_rss and anon_rss separately, and add them together (with get_mm_rss macro) when the total is needed - reading two atomics is much cheaper than updating two atomics. And update anon_rss upfront, typically in memory.c, not tucked away in page_add_anon_rmap. Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-29[PATCH] mm: mm_init set_mm_countersHugh Dickins1-2/+2
How is anon_rss initialized? In dup_mmap, and by mm_alloc's memset; but that's not so good if an mm_counter_t is a special type. And how is rss initialized? By set_mm_counter, all over the place. Come on, we just need to initialize them both at once by set_mm_counter in mm_init (which follows the memcpy when forking). Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-29[PATCH] mm: vm_stat_account unshackledHugh Dickins1-1/+1
The original vm_stat_account has fallen into disuse, with only one user, and only one user of vm_stat_unaccount. It's easier to keep track if we convert them all to __vm_stat_account, then free it from its __shackles. Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-29[PATCH] TIMERS: add missing compensation for HZ == 250YOSHIFUJI Hideaki1-0/+9
Add missing compensation for (HZ == 250) != (1 << SHIFT_HZ) in second_overflow(). Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org> Cc: "David S. Miller" <davem@davemloft.net> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-29[PATCH] missing exports of do_settimeofday() variantsAl Viro1-0/+1
frv, sh64, ia64 and sparc64 do not have do_settimeofday() exported (the last two are using variant in kernel/time.c). Exports added to match the rest of architectures. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Linus Torvalds <torvalds@osdl.org>