aboutsummaryrefslogtreecommitdiffstats
path: root/kernel
AgeCommit message (Collapse)AuthorFilesLines
2006-03-17[PATCH] posix-timers: fix requeue accounting when signal is ignoredRoman Zippel1-0/+1
When the posix-timer signal is ignored then the timer is rearmed by the callback function. The requeue pending accounting has to be fixed up else the state might be wrong. Signed-off-by: Roman Zippel <zippel@linux-m68k.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-17[PATCH] time_interpolator: add __read_mostlyChristoph Lameter1-2/+2
The pointer to the current time interpolator and the current list of time interpolators are typically only changed during bootup. Adding __read_mostly takes them away from possibly hot cachelines. Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-17[PATCH] unshare: Use rcu_assign_pointer when setting sighandEric W. Biederman1-1/+1
The sighand pointer only needs the rcu_read_lock on the read side. So only depending on task_lock protection when setting this pointer is not enough. We also need a memory barrier to ensure the initialization is seen first. Use rcu_assign_pointer as it does this for us, and clearly documents that we are setting an rcu readable pointer. Signed-off-by: Eric W. Biederman <ebiederm@xmission.com> Acked-by: Paul E. McKenney <paulmck@us.ibm.com> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-14[PATCH] Fix sigaltstack corruption among cloned threadsGOTO Masanori1-0/+6
This patch fixes alternate signal stack corruption among cloned threads with CLONE_SIGHAND (and CLONE_VM) for linux-2.6.16-rc6. The value of alternate signal stack is currently inherited after a call of clone(... CLONE_SIGHAND | CLONE_VM). But if sigaltstack is set by a parent thread, and then if multiple cloned child threads (+ parent threads) call signal handler at the same time, some threads may be conflicted - because they share to use the same alternative signal stack region. Finally they get sigsegv. It's an undesirable race condition. Note that child threads created from NPTL pthread_create() also hit this conflict when the parent thread uses sigaltstack, without my patch. To fix this problem, this patch clears the child threads' sigaltstack information like exec(). This behavior follows the SUSv3 specification. In SUSv3, pthread_create() says "The alternate stack shall not be inherited (when new threads are initialized)". It means that sigaltstack should be cleared when sigaltstack memory space is shared by cloned threads with CLONE_SIGHAND. Note that I chose "if (clone_flags & CLONE_SIGHAND)" line because: - If clone_flags line is not existed, fork() does not inherit sigaltstack. - CLONE_VM is another choice, but vfork() does not inherit sigaltstack. - CLONE_SIGHAND implies CLONE_VM, and it looks suitable. - CLONE_THREAD is another candidate, and includes CLONE_SIGHAND + CLONE_VM, but this flag has a bit different semantics. I decided to use CLONE_SIGHAND. [ Changed to test for CLONE_VM && !CLONE_VFORK after discussion --Linus ] Signed-off-by: GOTO Masanori <gotom@sanori.org> Cc: Roland McGrath <roland@redhat.com> Cc: Ingo Molnar <mingo@elte.hu> Acked-by: Linus Torvalds <torvalds@osdl.org> Cc: Ulrich Drepper <drepper@redhat.com> Cc: Jakub Jelinek <jakub@redhat.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-11[PATCH] remove __put_task_struct_cb export againChristoph Hellwig2-8/+3
The patch '[PATCH] RCU signal handling' [1] added an export for __put_task_struct_cb, a put_task_struct helper newly introduced in that patch. But the put_task_struct couldn't be used modular previously as __put_task_struct wasn't exported. There are not callers of it in modular code, and it shouldn't be exported because we don't want drivers to hold references to task_structs. This patch removes the export and folds __put_task_struct into __put_task_struct_cb as there's no other caller. [1] http://www2.kernel.org/git/gitweb.cgi?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=e56d090310d7625ecb43a1eeebd479f04affb48b Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Paul E. McKenney <paulmck@us.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-08[PATCH] fix file countingDipankar Sarma1-1/+4
I have benchmarked this on an x86_64 NUMA system and see no significant performance difference on kernbench. Tested on both x86_64 and powerpc. The way we do file struct accounting is not very suitable for batched freeing. For scalability reasons, file accounting was constructor/destructor based. This meant that nr_files was decremented only when the object was removed from the slab cache. This is susceptible to slab fragmentation. With RCU based file structure, consequent batched freeing and a test program like Serge's, we just speed this up and end up with a very fragmented slab - llm22:~ # cat /proc/sys/fs/file-nr 587730 0 758844 At the same time, I see only a 2000+ objects in filp cache. The following patch I fixes this problem. This patch changes the file counting by removing the filp_count_lock. Instead we use a separate percpu counter, nr_files, for now and all accesses to it are through get_nr_files() api. In the sysctl handler for nr_files, we populate files_stat.nr_files before returning to user. Counting files as an when they are created and destroyed (as opposed to inside slab) allows us to correctly count open files with RCU. Signed-off-by: Dipankar Sarma <dipankar@in.ibm.com> Cc: "Paul E. McKenney" <paulmck@us.ibm.com> Cc: "David S. Miller" <davem@davemloft.net> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-08[PATCH] rcu batch tuningDipankar Sarma1-18/+58
This patch adds new tunables for RCU queue and finished batches. There are two types of controls - number of completed RCU updates invoked in a batch (blimit) and monitoring for high rate of incoming RCUs on a cpu (qhimark, qlowmark). By default, the per-cpu batch limit is set to a small value. If the input RCU rate exceeds the high watermark, we do two things - force quiescent state on all cpus and set the batch limit of the CPU to INTMAX. Setting batch limit to INTMAX forces all finished RCUs to be processed in one shot. If we have more than INTMAX RCUs queued up, then we have bigger problems anyway. Once the incoming queued RCUs fall below the low watermark, the batch limit is set to the default. Signed-off-by: Dipankar Sarma <dipankar@in.ibm.com> Cc: "Paul E. McKenney" <paulmck@us.ibm.com> Cc: "David S. Miller" <davem@davemloft.net> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-08[PATCH] idle threads should have a sane ->timestamp valueIngo Molnar1-0/+1
Idle threads should have a sane ->timestamp value, to avoid init kernel thread(s) from inheriting it and causing miscalculations in try_to_wake_up(). Reported-by: Mike Galbraith <efault@gmx.de>. Signed-off-by: Ingo Molnar <mingo@elte.hu> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-06[PATCH] time: add barrier after updating jiffies_64Atsushi Nemoto1-0/+2
Add a compiler barrier so that we don't read jiffies before updating jiffies_64. Signed-off-by: Atsushi Nemoto <anemo@mba.ocn.ne.jp> Cc: Ralf Baechle <ralf@linux-mips.org> Cc: Paul Mackerras <paulus@samba.org> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-06[PATCH] fix next_timer_interrupt() for hrtimerTony Lindgren2-0/+51
Also from Thomas Gleixner <tglx@linutronix.de> Function next_timer_interrupt() got broken with a recent patch 6ba1b91213e81aa92b5cf7539f7d2a94ff54947c as sys_nanosleep() was moved to hrtimer. This broke things as next_timer_interrupt() did not check hrtimer tree for next event. Function next_timer_interrupt() is needed with dyntick (CONFIG_NO_IDLE_HZ, VST) implementations, as the system can be in idle when next hrtimer event was supposed to happen. At least ARM and S390 currently use next_timer_interrupt(). Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Russell King <rmk@arm.linux.org.uk> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-06Add early-boot-safety check to cond_resched()Linus Torvalds1-0/+2
Just to be safe, we should not trigger a conditional reschedule during the early boot sequence. We've historically done some questionable early on, and the safety warnings in __might_sleep() are generally turned off during that period, so there might be problems lurking. This affects CONFIG_PREEMPT_VOLUNTARY, which takes over might_sleep() to cause a voluntary conditional reschedule. Acked-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-02[PATCH] time_interpolator: Use readq_relaxed() instead of readq().Christoph Lameter1-2/+2
On some platforms readq performs additional work to make sure I/O is done in a coherent way. This is not needed for time retrieval as done by the time interpolator. So we can use readq_relaxed instead which will improve performance. It affects sparc64 and ia64 only. Apparently it makes a significant difference on ia64. Signed-off-by: Christoph Lameter <clameter@sgi.com> Cc: john stultz <johnstul@us.ibm.com> Cc: "David S. Miller" <davem@davemloft.net> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-02[PATCH] fix acpi_video_flags on x86-64Stefan Seyfried1-1/+1
acpi_video_flags variable is unsigned long, so it should be set as such. This actually matters on x86-64. Signed-off-by: Stefan Seyfried <seife@suse.de> Signed-off-by: Pavel Machek <pavel@suse.cz> Cc: "Brown, Len" <len.brown@intel.com> Cc: Andi Kleen <ak@muc.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-02-28[IA64] sysctl option to silence unaligned trap warningsJes Sorensen1-0/+14
Allow sysadmin to disable all warnings about userland apps making unaligned accesses by using: # echo 1 > /proc/sys/kernel/ignore-unaligned-usertrap Rather than having to use prctl on a process by process basis. Default behaivour leaves the warnings enabled. Signed-off-by: Jes Sorensen <jes@sgi.com> Signed-off-by: Tony Luck <tony.luck@intel.com>
2006-02-20[PATCH] kjournald keeps reference to namespaceBjörn Steinbrink1-0/+3
In daemonize() a new thread gets cleaned up and 'merged' with init_task. The current fs_struct is handled there, but not the current namespace. This adds the namespace part. [ Eric Biederman pointed out the namespace wrappers, and also notes that we can't ever count on using our parents namespace because we already have called exit_fs(), which is the only way to the namespace from a process. ] Signed-off-by: Björn Steinbrink <B.Steinbrink@gmx.de> Acked-by: Eric Biederman <ebiederm@xmission.com> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-02-20Merge branch 'fixes.b8' of ↵Linus Torvalds1-3/+3
git://git.kernel.org/pub/scm/linux/kernel/git/viro/bird
2006-02-20[PATCH] Fix compile for CONFIG_SYSVIPC=n or CONFIG_SYSCTL=nStephen Rothwell1-0/+2
The compat syscalls are added to sys_ni.c since they are not defined if the above CONFIG options are off. Also, nfs would not build with CONFIG_SYSCTL off. Noticed by Arthur Othieno. Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au> Cc: "David S. Miller" <davem@davemloft.net> Cc: Trond Myklebust <trond.myklebust@fys.uio.no> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-02-20[PATCH] Fix undefined symbols for nommu architectureLuke Yang1-0/+2
Signed-off-by: Luke Yang <luke.adi@gmail.com> Acked-by: David Howells <dhowells@redhat.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-02-20[PATCH] suspend-to-ram: allow video options to be set at runtimePavel Machek1-4/+12
Currently, acpi video options can only be set on kernel command line. That's little inflexible; I'd like userland s2ram application that just works, and modifying kernel command line according to whitelist is not fun. It is better to just allow s2ram application to set video options just before suspend (according to the whitelist). This implements sysctl to allow setting suspend video options without reboot. (akpm: Documentation updates for this new sysctl are pending..) Signed-off-by: Pavel Machek <pavel@suse.cz> Cc: "Brown, Len" <len.brown@intel.com> Cc: "Antonino A. Daplas" <adaplas@pol.net> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-02-18[PATCH] GFP_KERNEL allocations in atomic (auditsc)Al Viro1-3/+3
audit_log_exit() is called from atomic contexts and gets explicit gfp_mask argument; it should use it for all allocations rather than doing some with gfp_mask and some with GFP_KERNEL. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2006-02-17[PATCH] swsusp: fix breakage with swap on LVMRafael J. Wysocki1-3/+1
Restore the compatibility with the older code and make it possible to suspend if the kernel command line doesn't contain the "resume=" argument Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl> Cc: Pavel Machek <pavel@ucw.cz> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-02-17[PATCH] Introduce CONFIG_DEFAULT_MIGRATION_COSTIngo Molnar1-1/+12
Heiko Carstens <heiko.carstens@de.ibm.com> wrote: The boot sequence on s390 sometimes takes ages and we spend a very long time (up to one or two minutes) in calibrate_migration_costs. The time spent there differs from boot to boot. Also the calculated costs differ a lot. I've seen differences by up to a factor of 15 (yes, factor not percent). Also I doubt that making these measurements make much sense on a completely virtualized architecture where you cannot tell how much cpu time you will get anyway. So introduce the CONFIG_DEFAULT_MIGRATION_COST method for an architecture to set the scheduler migration costs. This turns off automatic detection of migration costs. Makes sense on virtual platforms, where migration costs are hard to measure accurately. Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-02-17[PATCH] Provide an interface for getting the current tick lengthPaul Mackerras1-5/+34
This provides an interface for arch code to find out how many nanoseconds are going to be added on to xtime by the next call to do_timer. The value returned is a fixed-point number in 52.12 format in nanoseconds. The reason for this format is that it gives the full precision that the timekeeping code is using internally. The motivation for this is to fix a problem that has arisen on 32-bit powerpc in that the value returned by do_gettimeofday drifts apart from xtime if NTP is being used. PowerPC is now using a lockless do_gettimeofday based on reading the timebase register and performing some simple arithmetic. (This method of getting the time is also exported to userspace via the VDSO.) However, the factor and offset it uses were calculated based on the nominal tick length and weren't being adjusted when NTP varied the tick length. Note that 64-bit powerpc has had the lockless do_gettimeofday for a long time now. It also had an extremely hairy routine that got called from the 32-bit compat routine for adjtimex, which adjusted the factor and offset according to what it thought the timekeeping code was going to do. Not only was this only called if a 32-bit task did adjtimex (i.e. not if a 64-bit task did adjtimex), it was also duplicating computations from kernel/timer.c and it wasn't clear that it was (still) correct. The simple solution is to ask the timekeeping code how long the current jiffy will be on each timer interrupt, after calling do_timer. If this jiffy will be a different length from the last one, we then need to compute new values for the factor and offset used in the lockless do_gettimeofday. In this way we can keep xtime and do_gettimeofday in sync, even when NTP is varying the tick length. Note that when adjtimex varies the tick length, it almost always introduces the variation from the next tick on. The only case I could see where adjtimex would vary the length of the current tick is when an old-style adjtime adjustment is being cancelled. (It's not clear to me why the adjustment has to be cancelled immediately rather than from the next tick on.) Thus I don't see any real need for a hook in adjtimex; the rare case of an old-style adjustment being cancelled can be fixed up at the next tick. Signed-off-by: Paul Mackerras <paulus@samba.org> Acked-by: john stultz <johnstul@us.ibm.com> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-02-17[PATCH] x86_64: Add boot option to disable randomized mappings and cleanupAndi Kleen1-2/+0
AMD SimNow!'s JIT doesn't like them at all in the guest. For distribution installation it's easiest if it's a boot time option. Also I moved the variable to a more appropiate place and make it independent from sysctl And marked __read_mostly which it is. Signed-off-by: Andi Kleen <ak@suse.de> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-02-15[PATCH] swsusp: nuke noisy messageAndrew Morton1-3/+1
I get about 88 squillion of these when suspending an old ad450nx server. Cc: Pavel Roskin <proski@gnu.org> Cc: "Rafael J. Wysocki" <rjw@sisk.pl> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-02-15[PATCH] cpuset: oops in exit on null cpuset fixPaul Jackson1-1/+34
Fix a latent bug in cpuset_exit() handling. If a task tried to allocate memory after calling cpuset_exit(), it oops'd in cpuset_update_task_memory_state() on a NULL cpuset pointer. So set the exiting tasks cpuset to the root cpuset instead of to NULL. A distro kernel hit this with an added kernel package that had just such a hook (allocating memory) in the exit code path. Signed-off-by: Paul Jackson <pj@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-02-15[PATCH] fix zap_thread's ptrace related problemsOleg Nesterov1-10/+15
1. The tracee can go from ptrace_stop() to do_signal_stop() after __ptrace_unlink(p). 2. It is unsafe to __ptrace_unlink(p) while p->parent may wait for tasklist_lock in ptrace_detach(). Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Cc: Roland McGrath <roland@redhat.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Christoph Hellwig <hch@lst.de> Cc: Eric W. Biederman <ebiederm@xmission.com> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-02-15[PATCH] fix kill_proc_info() vs fork() theoretical raceOleg Nesterov1-2/+2
copy_process: attach_pid(p, PIDTYPE_PID, p->pid); attach_pid(p, PIDTYPE_TGID, p->tgid); What if kill_proc_info(p->pid) happens in between? copy_process() holds current->sighand.siglock, so we are safe in CLONE_THREAD case, because current->sighand == p->sighand. Otherwise, p->sighand is unlocked, the new process is already visible to the find_task_by_pid(), but have a copy of parent's 'struct pid' in ->pids[PIDTYPE_TGID]. This means that __group_complete_signal() may hang while doing do ... while (next_thread() != p) We can solve this problem if we reverse these 2 attach_pid()s: attach_pid() does wmb() group_send_sig_info() calls spin_lock(), which provides a read barrier. // Yes ? I don't think we can hit this race in practice, but still. Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Cc: Roland McGrath <roland@redhat.com> Cc: Ingo Molnar <mingo@elte.hu> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-02-15[PATCH] fix kill_proc_info() vs CLONE_THREAD raceOleg Nesterov1-3/+2
There is a window after copy_process() unlocks ->sighand.siglock and before it adds the new thread to the thread list. In that window __group_complete_signal(SIGKILL) will not see the new thread yet, so this thread will start running while the whole thread group was supposed to exit. I beleive we have another good reason to place attach_pid(PID/TGID) under ->sighand.siglock. We can do the same for release_task()->__unhash_process() de_thread()->switch_exec_pids() After that we don't need tasklist_lock to iterate over the thread list, and we can simplify things, see for example do_sigaction() or sys_times(). Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Cc: Roland McGrath <roland@redhat.com> Cc: Ingo Molnar <mingo@elte.hu> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-02-14[PATCH] hrtimer: round up relative start time on low-res archesIngo Molnar1-1/+12
CONFIG_TIME_LOW_RES is a temporary way for architectures to signal that they simply return xtime in do_gettimeoffset(). In this corner-case we want to round up by resolution when starting a relative timer, to avoid short timeouts. This will go away with the GTOD framework. Signed-off-by: Ingo Molnar <mingo@elte.hu> Cc: Roman Zippel <zippel@linux-m68k.org> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-02-14[PATCH] sched: revert "filter affine wakeups"Chen, Kenneth W1-9/+1
Revert commit d7102e95b7b9c00277562c29aad421d2d521c5f6: [PATCH] sched: filter affine wakeups Apparently caused more than 10% performance regression for aim7 benchmark. The setup in use is 16-cpu HP rx8620, 64Gb of memory and 12 MSA1000s with 144 disks. Each disk is 72Gb with a single ext3 filesystem (courtesy of HP, who supplied benchmark results). The problem is, for aim7, the wake-up pattern is random, but it still needs load balancing action in the wake-up path to achieve best performance. With the above commit, lack of load balancing hurts that workload. However, for workloads like database transaction processing, the requirement is exactly opposite. In the wake up path, best performance is achieved with absolutely zero load balancing. We simply wake up the process on the CPU that it was previously run. Worst performance is obtained when we do load balancing at wake up. There isn't an easy way to auto detect the workload characteristics. Ingo's earlier patch that detects idle CPU and decide whether to load balance or not doesn't perform with aim7 either since all CPUs are busy (it causes even bigger perf. regression). Revert commit d7102e95b7b9c00277562c29aad421d2d521c5f6, which causes more than 10% performance regression with aim7. Signed-off-by: Ken Chen <kenneth.w.chen@intel.com> Acked-by: Ingo Molnar <mingo@elte.hu> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-02-14[PATCH] compound page: no access_process_vm checkHugh Dickins1-2/+1
The PageCompound check before access_process_vm's set_page_dirty_lock is no longer necessary, so remove it. But leave the PageCompound checks in bio_set_pages_dirty, dio_bio_complete and nfs_free_user_pages: at least some of those were introduced as a little optimization on hugetlb pages. Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-02-10[PATCH] prevent recursive panic from softlockup watchdogJan Beulich1-0/+1
When panic_timeout is zero, suppress triggering a nested panic due to soft lockup detection. Signed-off-by: Jan Beulich <jbeulich@novell.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-02-10[PATCH] sched: remove smpniceNick Piggin1-111/+18
I don't think the code is quite ready, which is why I asked for Peter's additions to also be merged before I acked it (although it turned out that it still isn't quite ready with his additions either). Basically I have had similar observations to Suresh in that it does not play nicely with the rest of the balancing infrastructure (and raised similar concerns in my review). The samples (group of 4) I got for "maximum recorded imbalance" on a 2x2 SMP+HT Xeon are as follows: | Following boot | hackbench 20 | hackbench 40 -----------+----------------+---------------------+--------------------- 2.6.16-rc2 | 30,37,100,112 | 5600,5530,6020,6090 | 6390,7090,8760,8470 +nosmpnice | 3, 2, 4, 2 | 28, 150, 294, 132 | 348, 348, 294, 347 Hackbench raw performance is down around 15% with smpnice (but that in itself isn't a huge deal because it is just a benchmark). However, the samples show that the imbalance passed into move_tasks is increased by about a factor of 10-30. I think this would also go some way to explaining latency blips turning up in the balancing code (though I haven't actually measured that). We'll probably have to revert this in the SUSE kernel. Cc: "Siddha, Suresh B" <suresh.b.siddha@intel.com> Acked-by: Ingo Molnar <mingo@elte.hu> Cc: Peter Williams <pwil3058@bigpond.net.au> Cc: "Martin J. Bligh" <mbligh@aracnet.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-02-09[PATCH] do_sigaction: cleanup ->sa_mask manipulationOleg Nesterov1-5/+3
Clear unblockable signals beforehand. Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-02-09[PATCH] sys_signal: initialize ->sa_maskOleg Nesterov1-0/+1
Pointed out by Linus Torvalds. sys_signal() forgets to initialize ->sa_mask. ( I suspect arch/ia64/ia32/ia32_signal.c:sys32_signal() also needs this fix ) Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-02-07[PATCH] kernel/sys.c NULL noise removalAl Viro1-1/+1
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2006-02-07[PATCH] timer.c NULL noise removalAl Viro1-1/+1
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2006-02-07[PATCH] remove bogus asm/bug.h includes.Al Viro1-1/+0
A bunch of asm/bug.h includes are both not needed (since it will get pulled anyway) and bogus (since they are done too early). Removed. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2006-02-07[PATCH] unshare system call -v5: unshare filesJANAK DESAI1-30/+51
If the file descriptor structure is being shared, allocate a new one and copy information from the current, shared, structure. Signed-off-by: Janak Desai <janak@us.ibm.com> Cc: Al Viro <viro@ftp.linux.org.uk> Cc: Christoph Hellwig <hch@lst.de> Cc: Michael Kerrisk <mtk-manpages@gmx.net> Cc: Andi Kleen <ak@muc.de> Cc: Paul Mackerras <paulus@samba.org> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-02-07[PATCH] unshare system call -v5: unshare vmJANAK DESAI1-31/+56
If vm structure is being shared, allocate a new one and copy information from the current, shared, structure. Signed-off-by: Janak Desai <janak@us.ibm.com> Cc: Al Viro <viro@ftp.linux.org.uk> Cc: Christoph Hellwig <hch@lst.de> Cc: Michael Kerrisk <mtk-manpages@gmx.net> Cc: Andi Kleen <ak@muc.de> Cc: Paul Mackerras <paulus@samba.org> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-02-07[PATCH] unshare system call -v5: unshare namespaceJANAK DESAI1-6/+11
If the namespace structure is being shared, allocate a new one and copy information from the current, shared, structure. Signed-off-by: Janak Desai <janak@us.ibm.com> Cc: Al Viro <viro@ftp.linux.org.uk> Cc: Christoph Hellwig <hch@lst.de> Cc: Michael Kerrisk <mtk-manpages@gmx.net> Cc: Andi Kleen <ak@muc.de> Cc: Paul Mackerras <paulus@samba.org> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-02-07[PATCH] unshare system call -v5: unshare filesystem infoJANAK DESAI1-3/+6
If filesystem structure is being shared, allocate a new one and copy information from the current, shared, structure. Signed-off-by: Janak Desai <janak@us.ibm.com> Cc: Al Viro <viro@ftp.linux.org.uk> Cc: Christoph Hellwig <hch@lst.de> Cc: Michael Kerrisk <mtk-manpages@gmx.net> Cc: Andi Kleen <ak@muc.de> Cc: Paul Mackerras <paulus@samba.org> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-02-07[PATCH] unshare system call -v5: system call handler functionJANAK DESAI1-0/+232
sys_unshare system call handler function accepts the same flags as clone system call, checks constraints on each of the flags and invokes corresponding unshare functions to disassociate respective process context if it was being shared with another task. Here is the link to a program for testing unshare system call. http://prdownloads.sourceforge.net/audit/unshare_test.c?download Please note that because of a problem in rmdir associated with bind mounts and clone with CLONE_NEWNS, the test fails while trying to remove temporary test directory. You can remove that temporary directory by doing rmdir, twice, from the command line. The first will fail with EBUSY, but the second will succeed. I have reported the problem to Ram Pai and Al Viro with a small program which reproduces the problem. Al told us yesterday that he will be looking at the problem soon. I have tried multiple rmdirs from the unshare_test program itself, but for some reason that is not working. Doing two rmdirs from command line does seem to remove the directory. Signed-off-by: Janak Desai <janak@us.ibm.com> Cc: Al Viro <viro@ftp.linux.org.uk> Cc: Christoph Hellwig <hch@lst.de> Cc: Michael Kerrisk <mtk-manpages@gmx.net> Cc: Andi Kleen <ak@muc.de> Cc: Paul Mackerras <paulus@samba.org> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-02-07[PATCH] Fix build failure in recent pm_prepare_* changes.Rafael J. Wysocki2-17/+3
Fix compilation problem in PM headers. Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-02-07[PATCH] module: strlen_user() race fixAndrew Morton1-0/+3
Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-02-07[PATCH] swsusp: kill unneeded/unbalanced bio_getPavel Machek1-4/+2
- Remove unneeded bio_get() which would cause a bio leak - Writing doesn't dirty pages. Reading dirties pages. - We should dirty the pages after the IO completion, not before (Busy-waiting for disk I/O completion isn't very polite.) Signed-off-by: Pavel Machek <pavel@suse.cz> Cc: "Rafael J. Wysocki" <rjw@sisk.pl> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-02-05[PATCH] missing license tag in intermoduleDave Jones1-0/+3
It may suck something awful, but it shouldn't taint the kernel. Signed-off-by: Dave Jones <davej@redhat.com> Signed-off-by: Adrian Bunk <bunk@stusta.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-02-05[PATCH] sched: only print migration_cost once per bootChuck Ebbert1-6/+8
migration_cost prints after every CPU hotplug event. Make it print only once at boot. Signed-off-by: Chuck Ebbert <76306.1226@compuserve.com> Acked-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-02-05[PATCH] percpu data: only iterate over possible CPUsEric Dumazet1-1/+1
percpu_data blindly allocates bootmem memory to store NR_CPUS instances of cpudata, instead of allocating memory only for possible cpus. As a preparation for changing that, we need to convert various 0 -> NR_CPUS loops to use for_each_cpu(). (The above only applies to users of asm-generic/percpu.h. powerpc has gone it alone and is presently only allocating memory for present CPUs, so it's currently corrupting memory). Signed-off-by: Eric Dumazet <dada1@cosmosbay.com> Cc: "David S. Miller" <davem@davemloft.net> Cc: James Bottomley <James.Bottomley@steeleye.com> Acked-by: Ingo Molnar <mingo@elte.hu> Cc: Jens Axboe <axboe@suse.de> Cc: Anton Blanchard <anton@samba.org> Acked-by: William Irwin <wli@holomorphy.com> Cc: Andi Kleen <ak@muc.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-02-03[PATCH] uninline __sigqueue_free()Andrew Morton1-1/+1
Five callsites. I dunno how all this crap got back in there :( Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-02-03[PATCH] cpuset: fix sparse warningRandy Dunlap1-1/+1
kernel/cpuset.c:644:38: warning: non-ANSI function declaration of function 'cpuset_update_task_memory_state' Signed-off-by: Randy Dunlap <rdunlap@xenotime.net> Acked-by: Paul Jackson <pj@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-02-03[PATCH] Normalize timespec for negative values in ns_to_timespecGeorge Anzinger1-6/+7
- In case of a negative nsec value the result of the division must be normalized. - Remove inline from an exported function. Signed-off-by: George Anzinger <george@wildturkeyranch.net> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-02-03[PATCH] Tell kallsyms_lookup_name() to ignore type U entriesKeith Owens1-1/+2
When one module exports a function symbol and another module uses that symbol then kallsyms shows the symbol twice. Once from the consumer with a type of 'U' and once from the provider with a type of 't' or 'T'. On most architectures, both entries have the same address so it does not matter which one is returned by kallsyms_lookup_name(). But on architectures with function descriptors, the 'U' entry points to the descriptor, not to the code body, which is not what we want. IA64 # grep -w qla2x00_remove_one /proc/kallsyms a000000208c25ef8 U qla2x00_remove_one [qla2300] <= descriptor a000000208bf44c0 t qla2x00_remove_one [qla2xxx] <= function body Tell kallsyms_lookup_name() to ignore type U entries in modules. Signed-off-by: Keith Owens <kaos@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-02-03[PATCH] Kprobes: Fix deadlock in function-return probesAnanth N Mavinakayanahalli1-1/+1
When two function-return probes are inserted on kfree()[1] and the second on say, sys_link()[2], and later [2] is unregistered, we have a deadlock as kfree is called with the kretprobe_lock held and the function-return probe on kfree will also try to grab the same lock. However, we can move the kfree() during unregistration to outside the spinlock as we are sure that no instances from the free list will be used after synchronized_sched() returns during the unregistration process. Thanks to Masami Hiramatsu for spotting this. Signed-off-by: Ananth N Mavinakayanahalli <ananth@in.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-02-03[PATCH] kernel/kprobes.c: fix a warning #ifndef ARCH_SUPPORTS_KRETPROBESAdrian Bunk1-17/+17
kernel/kprobes.c:353: warning: 'pre_handler_kretprobe' defined but not used Signed-off-by: Adrian Bunk <bunk@stusta.de> Acked-by: Ananth N Mavinakayanahalli <ananth@in.ibm.com> Acked-by: "Keshavamurthy, Anil S" <anil.s.keshavamurthy@intel.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-02-01Merge branch 'release' of ↵Linus Torvalds3-25/+13
git://git.kernel.org/pub/scm/linux/kernel/git/lenb/linux-acpi-2.6
2006-02-01[PATCH] zone_reclaim: configurable off node allocation period.Christoph Lameter1-0/+9
Currently the zone_reclaim code has a fixed window of 30 seconds of off node allocations should a local zone have no unused pagecache pages left. Reclaim will be attempted again after this timeout period to avoid repeated useless scans for memory. This is also useful to established sufficiently large off node allocation chunks to relieve the local node. It may be beneficial to adjust that time period for some special situations. For example if memory use was exceeding node capacity one may want to give up for longer periods of time. If memory spikes intermittendly then one may want to shorten the time period to reduce the number of off node allocations. This patch allows just that.... Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-02-01[PATCH] zone_reclaim: minor fixesChristoph Lameter1-1/+2
- If we only reclaim nr_pages then its okay to stay on node. Switch from > to >= for the comparison. - vm_table[] entry for zone_reclaim_mode is a bit screwed up. - Add empty lines around shrink_zone to show that this is the central function to be called. Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-02-01[PATCH] swsusp: do not change log level during suspend/resumeRafael J. Wysocki2-11/+6
Prevent the kernel from setting the log level to 10 unconditionally during suspend/resume which was needed in the past for debugging, but generally is undesirable. Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl> Acked-by: Pavel Machek <pavel@suse.cz> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-02-01[PATCH] sys_sched_getaffinity() & hotplugJack Steiner1-1/+1
Change sched_getaffinity() so that it returns a bitmap that indicates the legally schedulable cpus that a task is allowed to run on. Without this patch, if CONFIG_HOTPLUG_CPU is enabled, sched_getaffinity() unconditionally returns (at least on IA64) a mask with NR_CPUS bits set. This conveys no useful infornmation except for a kernel compile option. This fixes a breakage we obseved running recent kernels. We have MPI jobs that use sched_getaffinity() to determine where to place their threads. Placing them on non-existant cpus is problematic :-) Signed-off-by: Jack Steiner <steiner@sgi.com> Acked-by: Ingo Molnar <mingo@elte.hu> Cc: Nathan Lynch <ntl@pobox.com> Cc: Paul Jackson <pj@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-02-01[PATCH] kernel/posix-timers.c: remove do_posix_clock_notimer_create()Adrian Bunk1-6/+0
This function is neither used nor has any real contents. Signed-off-by: Adrian Bunk <bunk@stusta.de> Acked-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-02-01[PATCH] hrtimers: set correct initial expiry time for relative SIGEV_NONE timersThomas Gleixner1-1/+6
The expiry time for relative timers with SIGEV_NONE set was never updated to the correct value. Pointed out by George Anzinger. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-02-01[PATCH] hrtimers: add back lost credit linesThomas Gleixner1-0/+6
At some point we added credits to people who actively helped to bring k/hr-timers along. This was lost in the big code revamp. Add it back. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-02-01[PATCH] hrtimers: cleanups and simplificationsGeorge Anzinger3-64/+34
Clean up the interface to hrtimers by changing the init code to pass the mode as well as the clock. This allow the init code to select the correct base and eliminates extra timer re-init code in posix-timers. We also simplify the restart interface nanosleep use. Signed-off-by: George Anzinger <george@mvista.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-02-01[PATCH] hrtimers: fix posix-timer requeue raceakpm@osdl.org1-0/+5
From: Steven Rostedtrostedt@goodmis.org <rostedt@goodmis.org> CPU0 expires a posix-timer and runs the callback function. The signal is queued. After releasing the posix-timer lock and before returning to hrtimer_run_queue CPU0 gets interrupted. CPU1 delivers the queued signal and rearms the timer. CPU0 comes back to hrtimer_run_queue and sets the timer state to expired. The next modification of the timer can result in an oops, because the state information is wrong. Keep track of state = RUNNING and check if the state has been in the return path of hrtimer_run_queue. In case the state has been changed, ignore a restart request and do not touch the state variable. Signed-off-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-02-01[PATCH] hrtimers: fix oldvalue return in setitimerThomas Gleixner1-5/+5
This resolves bugzilla bug#5617. The oldvalue of the timer was read after the timer was cancelled, so the remaining time was always zero. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-02-01[PATCH] hrtimers: fix possible use of NULL pointer in posix-timersThomas Gleixner1-1/+2
Fixup the conversion of posix-timers to hrtimers. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-02-01[PATCH] hrtimers: fixup itimer conversionThomas Gleixner1-1/+10
The itimer conversion removed the locking which protects the timer and variables in the shared signal structure. Steven Rostedt found the problem in the latest -rt patches. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-02-01[PATCH] swsusp: use bytes as image size unitsRafael J. Wysocki3-9/+9
Make swsusp use bytes as the image size units, which is needed for future compatibility. Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl> Acked-by: Pavel Machek <pavel@ucw.cz> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-31[PATCH] "Fix uidhash_lock <-> RXU deadlock" fixAndrew Morton1-10/+17
I get storms of warnings from local_bh_enable(). Better-tested patches, please. Cc: Ingo Molnar <mingo@elte.hu> Cc: "Paul E. McKenney" <paulmck@us.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-31[PATCH] rcu_torture_lock deadlock fixIngo Molnar1-5/+5
rcu_torture_lock is used in a softirq-unsafe manner, but it is also taken by rcu_torture_cb(), which may execute in softirq-context, resulting in potential deadlocks. The fix is to acquire rcu_torture_lock in a softirq-safe manner. With this fix applied, the rcu-torture code passes validation. Signed-off-by: Ingo Molnar <mingo@elte.hu> Acked-by: Paul E. McKenney <paulmck@us.ibm.com> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-31[PATCH] fix uidhash_lock <-> RCU deadlockIngo Molnar1-8/+17
RCU task-struct freeing can call free_uid(), which is taking uidhash_lock - while other users of uidhash_lock are softirq-unsafe. The fix is to always take the uidhash_spinlock in a softirq-safe manner. Signed-off-by: Ingo Molnar <mingo@elte.hu> Acked-by: Paul E. McKenney <paulmck@us.ibm.com> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-31[PATCH] Fix boot-time slowdown for measure_migration_costIngo Molnar1-3/+3
This reduces the amount of time the migration cost calculations cost during bootup. Based on numbers by Tony Luck <tony.luck@intel.com>. Signed-off-by: Ingo Molnar <mingo@elte.hu>
2006-01-31Don't try to "validate" a non-existing timeval.Linus Torvalds1-1/+1
settime() with a NULL timeval is silly but legal. Noticed by Dave Jones <davej@redhat.com> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-24[ACPI] merge 3549 4320 4485 4588 4980 5483 5651 acpica asus fops pnpacpi ↵Len Brown3-25/+13
branches into release Signed-off-by: Len Brown <len.brown@intel.com>
2006-01-18[PATCH] EDAC: atomic scrub operationsAlan Cox2-2/+2
EDAC requires a way to scrub memory if an ECC error is found and the chipset does not do the work automatically. That means rewriting memory locations atomically with respect to all CPUs _and_ bus masters. That means we can't use atomic_add(foo, 0) as it gets optimised for non-SMP This adds a function to include/asm-foo/atomic.h for the platforms currently supported which implements a scrub of a mapped block. It also adjusts a few other files include order where atomic.h is included before types.h as this now causes an error as atomic_scrub uses u32. Signed-off-by: Alan Cox <alan@redhat.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-18[PATCH] Generic sys_rt_sigsuspend()David Woodhouse2-0/+54
The TIF_RESTORE_SIGMASK flag allows us to have a generic implementation of sys_rt_sigsuspend() instead of duplicating it for each architecture. This provides such an implementation and makes arch/powerpc use it. It also tidies up the ppc32 sys_sigsuspend() to use TIF_RESTORE_SIGMASK. Signed-off-by: David Woodhouse <dwmw2@infradead.org> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-18[PATCH] fix sched_setscheduler semanticsJason Baron1-0/+4
Currently, a negative policy argument passed into the 'sys_sched_setscheduler()' system call, will return with success. However, the manpage for 'sys_sched_setscheduler' says: EINVAL The scheduling policy is not one of the recognized policies, or the parameter p does not make sense for the policy. Signed-off-by: Jason Baron <jbaron@redhat.com> Acked-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-18[PATCH] Zone reclaim: proc overrideChristoph Lameter1-0/+11
proc support for zone reclaim This patch creates a proc entry /proc/sys/vm/zone_reclaim_mode that may be used to override the automatic determination of the zone reclaim made on bootup. Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-16[PATCH] kernel/hrtimer.c sparse warning fixIngo Molnar1-1/+2
fix the following sparse warning: kernel/hrtimer.c:665:34: warning: incorrect type in argument 2 (different address spaces) kernel/hrtimer.c:665:34: expected void const *from kernel/hrtimer.c:665:34: got struct timespec [noderef] *<noident><asn:1> kernel/hrtimer.c:664:2: warning: dereference of noderef expression Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-16[PATCH] build kernel/intermodule.c only when requiredAdrian Bunk1-1/+2
Build kernel/intermodule.c only when required. Signed-off-by: Adrian Bunk <bunk@stusta.de> Cc: Sam Ravnborg <sam@ravnborg.org> Cc: David Woodhouse <dwmw2@infradead.org> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-16[PATCH] hrtimer comment tweakJonathan Corbet1-1/+1
Fix a comment which missed an update cycle somewhere. Signed-off-by: Jonathan Corbet <corbet@lwn.net> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-15Merge git://git.kernel.org/pub/scm/linux/kernel/git/bunk/trivialLinus Torvalds2-2/+2
2006-01-14[PATCH] cpuset oom lock fixPaul Jackson1-5/+28
The problem, reported in: http://bugzilla.kernel.org/show_bug.cgi?id=5859 and by various other email messages and lkml posts is that the cpuset hook in the oom (out of memory) code can try to take a cpuset semaphore while holding the tasklist_lock (a spinlock). One must not sleep while holding a spinlock. The fix seems easy enough - move the cpuset semaphore region outside the tasklist_lock region. This required a few lines of mechanism to implement. The oom code where the locking needs to be changed does not have access to the cpuset locks, which are internal to kernel/cpuset.c only. So I provided a couple more cpuset interface routines, available to the rest of the kernel, which simple take and drop the lock needed here (cpusets callback_sem). Signed-off-by: Paul Jackson <pj@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-14[PATCH] s390: spinlock fixesMartin Schwidefsky1-1/+1
Remove useless spin_retry_counter and fix compilation for UP kernels. Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-14[PATCH] Unlinline a bunch of other functionsArjan van de Ven6-21/+21
Remove the "inline" keyword from a bunch of big functions in the kernel with the goal of shrinking it by 30kb to 40kb Signed-off-by: Arjan van de Ven <arjan@infradead.org> Signed-off-by: Ingo Molnar <mingo@elte.hu> Acked-by: Jeff Garzik <jgarzik@pobox.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-14[PATCH] sched: add new SCHED_BATCH policyIngo Molnar2-16/+36
Add a new SCHED_BATCH (3) scheduling policy: such tasks are presumed CPU-intensive, and will acquire a constant +5 priority level penalty. Such policy is nice for workloads that are non-interactive, but which do not want to give up their nice levels. The policy is also useful for workloads that want a deterministic scheduling policy without interactivity causing extra preemptions (between that workload's tasks). Signed-off-by: Ingo Molnar <mingo@elte.hu> Cc: Michael Kerrisk <mtk-manpages@gmx.net> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-15correct email address of Manfred SpraulChristian Kujau1-1/+1
I tried to send the forcedeth maintainer an email, but it came back with: "The mail address manfreds@colorfullife.com is not read anymore. Please resent your mail to manfred@ instead of manfreds@." This patch fixes this. Signed-off-by: Adrian Bunk <bunk@stusta.de>
2006-01-15SOFTWARE_SUSPEND: fix a typo in the dependenciesAdrian Bunk1-1/+1
This patch fixes a typo in the dependencies of SOFTWARE_SUSPEND. This patch is based on a report by Jean-Luc Leger <reiga@dspnet.fr.eu.org>. Signed-off-by: Adrian Bunk <bunk@stusta.de> Acked-by: Pavel Machek <pavel@ucw.cz>
2006-01-12Merge master.kernel.org:/pub/scm/linux/kernel/git/tglx/hrtimer-2.6Linus Torvalds1-15/+19
2006-01-12[PATCH] sched: filter affine wakeupsakpm@osdl.org1-1/+9
) From: Nick Piggin <nickpiggin@yahoo.com.au> Track the last waker CPU, and only consider wakeup-balancing if there's a match between current waker CPU and the previous waker CPU. This ensures that there is some correlation between two subsequent wakeup events before we move the task. Should help random-wakeup workloads on large SMP systems, by reducing the migration attempts by a factor of nr_cpus. Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Nick Piggin <npiggin@suse.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-12[PATCH] scheduler cache-hot-autodetectakpm@osdl.org1-0/+468
) From: Ingo Molnar <mingo@elte.hu> This is the latest version of the scheduler cache-hot-auto-tune patch. The first problem was that detection time scaled with O(N^2), which is unacceptable on larger SMP and NUMA systems. To solve this: - I've added a 'domain distance' function, which is used to cache measurement results. Each distance is only measured once. This means that e.g. on NUMA distances of 0, 1 and 2 might be measured, on HT distances 0 and 1, and on SMP distance 0 is measured. The code walks the domain tree to determine the distance, so it automatically follows whatever hierarchy an architecture sets up. This cuts down on the boot time significantly and removes the O(N^2) limit. The only assumption is that migration costs can be expressed as a function of domain distance - this covers the overwhelming majority of existing systems, and is a good guess even for more assymetric systems. [ People hacking systems that have assymetries that break this assumption (e.g. different CPU speeds) should experiment a bit with the cpu_distance() function. Adding a ->migration_distance factor to the domain structure would be one possible solution - but lets first see the problem systems, if they exist at all. Lets not overdesign. ] Another problem was that only a single cache-size was used for measuring the cost of migration, and most architectures didnt set that variable up. Furthermore, a single cache-size does not fit NUMA hierarchies with L3 caches and does not fit HT setups, where different CPUs will often have different 'effective cache sizes'. To solve this problem: - Instead of relying on a single cache-size provided by the platform and sticking to it, the code now auto-detects the 'effective migration cost' between two measured CPUs, via iterating through a wide range of cachesizes. The code searches for the maximum migration cost, which occurs when the working set of the test-workload falls just below the 'effective cache size'. I.e. real-life optimized search is done for the maximum migration cost, between two real CPUs. This, amongst other things, has the positive effect hat if e.g. two CPUs share a L2/L3 cache, a different (and accurate) migration cost will be found than between two CPUs on the same system that dont share any caches. (The reliable measurement of migration costs is tricky - see the source for details.) Furthermore i've added various boot-time options to override/tune migration behavior. Firstly, there's a blanket override for autodetection: migration_cost=1000,2000,3000 will override the depth 0/1/2 values with 1msec/2msec/3msec values. Secondly, there's a global factor that can be used to increase (or decrease) the autodetected values: migration_factor=120 will increase the autodetected values by 20%. This option is useful to tune things in a workload-dependent way - e.g. if a workload is cache-insensitive then CPU utilization can be maximized by specifying migration_factor=0. I've tested the autodetection code quite extensively on x86, on 3 P3/Xeon/2MB, and the autodetected values look pretty good: Dual Celeron (128K L2 cache): --------------------- migration cost matrix (max_cache_size: 131072, cpu: 467 MHz): --------------------- [00] [01] [00]: - 1.7(1) [01]: 1.7(1) - --------------------- cacheflush times [2]: 0.0 (0) 1.7 (1784008) --------------------- Here the slow memory subsystem dominates system performance, and even though caches are small, the migration cost is 1.7 msecs. Dual HT P4 (512K L2 cache): --------------------- migration cost matrix (max_cache_size: 524288, cpu: 2379 MHz): --------------------- [00] [01] [02] [03] [00]: - 0.4(1) 0.0(0) 0.4(1) [01]: 0.4(1) - 0.4(1) 0.0(0) [02]: 0.0(0) 0.4(1) - 0.4(1) [03]: 0.4(1) 0.0(0) 0.4(1) - --------------------- cacheflush times [2]: 0.0 (33900) 0.4 (448514) --------------------- Here it can be seen that there is no migration cost between two HT siblings (CPU#0/2 and CPU#1/3 are separate physical CPUs). A fast memory system makes inter-physical-CPU migration pretty cheap: 0.4 msecs. 8-way P3/Xeon [2MB L2 cache]: --------------------- migration cost matrix (max_cache_size: 2097152, cpu: 700 MHz): --------------------- [00] [01] [02] [03] [04] [05] [06] [07] [00]: - 19.2(1) 19.2(1) 19.2(1) 19.2(1) 19.2(1) 19.2(1) 19.2(1) [01]: 19.2(1) - 19.2(1) 19.2(1) 19.2(1) 19.2(1) 19.2(1) 19.2(1) [02]: 19.2(1) 19.2(1) - 19.2(1) 19.2(1) 19.2(1) 19.2(1) 19.2(1) [03]: 19.2(1) 19.2(1) 19.2(1) - 19.2(1) 19.2(1) 19.2(1) 19.2(1) [04]: 19.2(1) 19.2(1) 19.2(1) 19.2(1) - 19.2(1) 19.2(1) 19.2(1) [05]: 19.2(1) 19.2(1) 19.2(1) 19.2(1) 19.2(1) - 19.2(1) 19.2(1) [06]: 19.2(1) 19.2(1) 19.2(1) 19.2(1) 19.2(1) 19.2(1) - 19.2(1) [07]: 19.2(1) 19.2(1) 19.2(1) 19.2(1) 19.2(1) 19.2(1) 19.2(1) - --------------------- cacheflush times [2]: 0.0 (0) 19.2 (19281756) --------------------- This one has huge caches and a relatively slow memory subsystem - so the migration cost is 19 msecs. Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Ashok Raj <ashok.raj@intel.com> Signed-off-by: Ken Chen <kenneth.w.chen@intel.com> Cc: <wilder@us.ibm.com> Signed-off-by: John Hawkes <hawkes@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-12[hrtimer] Enforce resolution as lower limit of intervalsThomas Gleixner1-1/+4
Roman Zippel pointed out that the missing lower limit of intervals leads to an accounting error in the overrun count. Enforce the lower limit of intervals to resolution in the timer forwarding code. Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2006-01-12[hrtimer] Change resolution storage to ktime_t formatThomas Gleixner1-2/+1
Change the storage format of the per base resolution to ktime_t to make it easier accessible in the hrtimers code. Change the resolution from (NSEC_PER_SEC/HZ) to TICK_NSEC as Roman pointed out. TICK_NSEC is closer to the real resolution. Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2006-01-12[hrtimer] Remove listhead from hrtimer structThomas Gleixner1-12/+14
The list_head in the hrtimer structure was introduced for easy access to the first timer with the further extensions of real high resolution timers in mind, but it turned out in the course of development that it is not necessary for the standard use case. Remove the list head and access the first expiry timer by a datafield in the timer base. Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2006-01-11[PATCH] x86_64: Inclusion of ScaleMP vSMP architecture patches - vsmp_alignRavikiran G Thirumalai1-1/+5
vSMP specific alignment patch to 1. Define INTERNODE_CACHE_SHIFT for vSMP 2. Use this for alignment of critical structures 3. Use INTERNODE_CACHE_SHIFT for ARCH_MIN_TASKALIGN, and let the slab align task_struct allocations to the internode cacheline size 4. Introduce and use ARCH_MIN_MMSTRUCT_ALIGN for mm_struct slab allocations. Signed-off-by: Ravikiran Thirumalai <kiran@scalemp.com> Signed-off-by: Shai Fultheim <shai@scalemp.com> Signed-off-by: Andi Kleen <ak@suse.de> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-11[PATCH] x86_64: Make the cpu_*_maps in kernel/sched.c read mostlyAndi Kleen1-3/+3
They are referred to often so avoid potential false sharing for them. Signed-off-by: Andi Kleen <ak@suse.de> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-11[PATCH] move capable() to capability.hRandy.Dunlap13-0/+13
- Move capable() from sched.h to capability.h; - Use <linux/capability.h> where capable() is used (in include/, block/, ipc/, kernel/, a few drivers/, mm/, security/, & sound/; many more drivers/ to go) Signed-off-by: Randy Dunlap <rdunlap@xenotime.net> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-11[PATCH] uninline capable()Ingo Molnar1-0/+12
Uninline capable(). Saves 2K of kernel text on a generic .config, and 1K on a tiny config. In addition it makes the use of capable more consistent between CONFIG_SECURITY and !CONFIG_SECURITY Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-11[PATCH] kprobes: fix unloading of self probed moduleKeshavamurthy Anil S1-10/+32
When a kprobes modules is written in such a way that probes are inserted on itself, then unload of that moudle was not possible due to reference couning on the same module. The below patch makes a check and incrementes the module refcount only if it is not a self probed module. We need to allow modules to probe themself for kprobes performance measurements This patch has been tested on several x86_64, ppc64 and IA64 architectures. Signed-off-by: Anil S Keshavamurthy <anil.s.keshavamurthy> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-11[PATCH] fix/simplify mutex debugging codeDavid Woodhouse1-2/+3
Let's switch mutex_debug_check_no_locks_freed() to take (addr, len) as arguments instead, since all its callers were just calculating the 'to' address for themselves anyway... (and sometimes doing so badly). Signed-off-by: David Woodhouse <dwmw2@infradead.org> Acked-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-10[PATCH] mutex: trivial whitespace cleanupsIngo Molnar2-5/+1
Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-10[PATCH] mark mutex_lock*() as might_sleep()Ingo Molnar1-0/+2
Mark mutex_lock() and mutex_lock_interruptible() as might_sleep() functions. Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-10[PATCH] fix i386 mutex fastpath on FRAME_POINTER && !DEBUG_MUTEXESIngo Molnar1-9/+0
Call the mutex slowpath more conservatively - e.g. FRAME_POINTERS can change the calling convention, in which case a direct branch to the slowpath becomes illegal. Bug found by Hugh Dickins. Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-10[PATCH] remove unnecessary asm/mutex.h from kernel/mutex-debug.cIngo Molnar1-2/+0
Remove unnecessary (and incorrect) inclusion of asm/mutex.h, pointed out by David Howells. Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-10[PATCH] rcu: fix hotplug-cpu ->donelist leakOleg Nesterov1-1/+2
Pointed out by Srivatsa Vaddagiri <vatsa@in.ibm.com>. rcu_do_batch() stops after processing maxbatch callbacks on ->donelist leaving rcu_tasklet in TASKLET_STATE_SCHED state. If CPU_DEAD event happens remaining ->donelist entries are lost, rcu_offline_cpu() kills this tasklet. With this patch ->donelist migrates along with ->curlist and ->nxtlist to the current cpu. Compile tested. Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Acked-by: Paul E. McKenney <paulmck@us.ibm.com> Cc: Srivatsa Vaddagiri <vatsa@in.ibm.com> Cc: Dipankar Sarma <dipankar@in.ibm.com> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-10[PATCH] rcu: join rcu_ctrlblk and rcu_stateOleg Nesterov1-44/+38
This patch moves rcu_state into the rcu_ctrlblk. I think there are no reasons why we should have 2 different variables to control rcu state. Every user of rcu_state has also "rcu_ctrlblk *rcp" in the parameter list. Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Acked-by: Paul E. McKenney <paulmck@us.ibm.com> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-10[PATCH] kernel/resource.c: __check_region(): remove pointless __deprecatedAdrian Bunk1-1/+1
If a __deprecated is desired it should go to the prototype in the header (where it currently isn't). But at this place it's pointless. Signed-off-by: Adrian Bunk <bunk@stusta.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-10[PATCH] Decrease number of pointer derefs in exit.cJesper Juhl1-16/+21
Decrease the number of pointer derefs in kernel/exit.c Benefits of the patch: - Fewer pointer dereferences should make the code slightly faster. - Size of generated code is smaller - improved readability Signed-off-by: Jesper Juhl <jesper.juhl@gmail.com> Acked-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-10[PATCH] Kprobes: conversion from kcalloc to kzallocKeshavamurthy Anil S1-1/+1
Signed-of-by: Anil S Keshavamurthy <anil.s.keshavamurthy@intel.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-10[PATCH] kprobes: fix build breakageAnanth N Mavinakayanahalli1-2/+2
The following patch (against 2.6.15-rc5-mm3) fixes a kprobes build break due to changes introduced in the kprobe locking in 2.6.15-rc5-mm3. In addition, the patch reverts back the open-coding of kprobe_mutex. Signed-off-by: Ananth N Mavinakayanahalli <ananth@in.ibm.com> Acked-by: Anil S Keshavamurthy <anil.s.keshavamurthy@intel.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-10[PATCH] kprobes: arch_remove_kprobeAnil S Keshavamurthy1-3/+1
Currently arch_remove_kprobes() is only implemented/required for x86_64 and powerpc. All other architecture like IA64, i386 and sparc64 implementes a dummy function which is being called from arch independent kprobes.c file. This patch removes the dummy functions and replaces it with #define arch_remove_kprobe(p, s) do { } while(0) Signed-off-by: Anil S Keshavamurthy <anil.s.keshavamurthy@intel.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-10[PATCH] kprobes-changed-from-using-spinlock-to-mutex fixKeshavamurthy Anil S1-14/+18
Based on some feedback from Oleg Nesterov, I have made few changes to previously posted patch. Signed-off-by: Anil S Keshavamurthy <anil.s.keshavamurthy@intel.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-10[PATCH] kprobes: changed from using spinlock to mutexAnil S Keshavamurthy1-48/+43
Since Kprobes runtime exception handlers is now lock free as this code path is now using RCU to walk through the list, there is no need for the register/unregister{_kprobe} to use spin_{lock/unlock}_isr{save/restore}. The serialization during registration/unregistration is now possible using just a mutex. In the above process, this patch also fixes a minor memory leak for x86_64 and powerpc. Signed-off-by: Anil S Keshavamurthy <anil.s.keshavamurthy@intel.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-10[PATCH] kprobes: enable funcions only for required archAnil S Keshavamurthy1-0/+2
Kernel/kprobes.c defines get_insn_slot() and free_insn_slot() which are currently required _only_ for x86_64 and powerpc (which has no-exec support). FYI, get{free}_insn_slot() functions manages the memory page which is mapped as executable, required for instruction emulation. This patch moves those two functions under __ARCH_WANT_KPROBES_INSN_SLOT and defines __ARCH_WANT_KPROBES_INSN_SLOT in arch specific kprobes.h file. Signed-off-by: Anil S Keshavamurthy <anil.s.keshavamurthy@intel.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-10[PATCH] Remove getnstimestamp()Matt Helsley1-22/+0
Remove getnstimestamp() in favor of ktime.h's ktime_get_ts() Signed-off-by: Matt Helsley <matthltc@us.ibm.com> Cc: john stultz <johnstul@us.ibm.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-10[PATCH] Export ktime_get_ts()Matt Helsley1-0/+1
This series removes the getnstimestamp() function from kernel/time.c in favor of kernel/hrtimer.c's ktime_get_ts() function which currently does exactly the same thing: retrieves a high-resolution (ns) timespec structure and performs the wall_to_monotonic adjustment. This patch: Export ktime_get_ts() to be used as a timestamp function since it uses getnstimefoday() and does the wall_to_monotonic adjustment. Signed-off-by: Matt Helsley <matthltc@us.ibm.com> Cc: john stultz <johnstul@us.ibm.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-10[PATCH] hrtimer: convert posix timers completelyThomas Gleixner1-582/+135
- convert posix-timers.c to use hrtimers - remove the now obsolete abslist code Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Miklos Szeredi <miklos@szeredi.hu> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-10[PATCH] hrtimer: switch clock_nanosleep to hrtimer nanosleep APIThomas Gleixner2-133/+41
Switch clock_nanosleep to use the new nanosleep functions in hrtimer.c Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-10[PATCH] hrtimer: switch sys_nanosleep to hrtimerThomas Gleixner2-56/+14
convert sys_nanosleep() to use hrtimer_nanosleep() Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-10[PATCH] hrtimer: create hrtimer nanosleep APIThomas Gleixner1-0/+127
introduce the hrtimer_nanosleep() and hrtimer_nanosleep_real() APIs. Not yet used by any code. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-10[PATCH] hrtimer: switch itimers to hrtimerThomas Gleixner3-59/+55
switch itimers to a hrtimers-based implementation Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-10[PATCH] hrtimer: hrtimer core codeThomas Gleixner3-1/+682
hrtimer subsystem core. It is initialized at bootup and expired by the timer interrupt, but is otherwise not utilized by any other subsystem yet. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-10[PATCH] hrtimer: introduce nsec_t type and conversion functionsThomas Gleixner1-0/+36
- introduce the nsec_t type - basic nsec conversion routines: timespec_to_ns(), timeval_to_ns(), ns_to_timespec(), ns_to_timeval(). Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-10[PATCH] hrtimer: validate timespec of do_sys_settimeofdayThomas Gleixner1-0/+3
Check if the timespec which is provided from user space is normalized. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-10[PATCH] hrtimer: create and use timespec_valid macroThomas Gleixner1-3/+2
add timespec_valid(ts) [returns false if the timespec is denorm] Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-10[PATCH] hrtimer: make clockid_t arguments constThomas Gleixner2-35/+43
add const arguments to the posix-timers.h API functions Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-10[PATCH] hrtimer: export deinlined mktimeAndrew Morton1-0/+2
This is now uninlined, but some modules use it. Make it a non-GPL export, since the inlined mktime() was also available that way. Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-10[PATCH] hrtimer: clean up mktime and make arguments constIngo Molnar1-6/+9
add 'const' to mktime arguments, and clean it up a bit Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-10[PATCH] hrtimer: deinline mktime and set_normalized_timespecThomas Gleixner1-0/+61
mktime() and set_normalized_timespec() are large inline functions used in many places: deinline them. From: George Anzinger, off-by-1 bugfix Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-10[PATCH] hrtimer: remove duplicate div_long_long_rem implementationThomas Gleixner1-9/+1
make posix-timers.c use the generic calc64.h facility Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-10[PATCH] common compat_sys_timer_createChristoph Hellwig1-2/+18
The comment in compat.c is wrong, every architecture provides a get_compat_sigevent() for the IPC compat code already. This basically moves the x86_64 version to common code and removes all the others. Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Paul Mackerras <paulus@samba.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: "David S. Miller" <davem@davemloft.net> Acked-by: Andi Kleen <ak@muc.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-10[PATCH] kdump: read previous kernel's memoryVivek Goyal2-65/+0
- Moving the crash_dump.c file to arch dependent part as kmap_atomic_pfn is specific to i386 and highmem may not exist in other archs. - Use ioremap for x86_64 to map the previous kernel memory. - In copy_oldmem_page(), we now directly copy to the user/kernel buffer and avoid the unneccesary copy to a kmalloc'd page. Signed-off-by: Rachita Kothiyal <rachita@in.ibm.com> Signed-off-by: Vivek Goyal <vgoyal@in.ibm.com> Cc: Andi Kleen <ak@muc.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-10[PATCH] kdump: save registers early (inline functions)Vivek Goyal1-1/+3
- If system panics then cpu register states are captured through funciton crash_get_current_regs(). This is not a inline function hence a stack frame is pushed on to the stack and then cpu register state is captured. Later this frame is popped and new frames are pushed (machine_kexec). - In theory this is not very right as we are capturing register states for a frame and that frame is no more valid. This seems to have created back trace problems for ppc64. - This patch fixes it up. The very first thing it does after entering crash_kexec() is to capture the register states. Anyway we don't want the back trace beyond crash_kexec(). crash_get_current_regs() has been made inline - crash_setup_regs() is the top architecture dependent function which should be responsible for capturing the register states as well as to do some architecture dependent tricks. For ex. fixing up ss and esp for i386. crash_setup_regs() has also been made inline to ensure no new call frame is pushed onto stack. Signed-off-by: Vivek Goyal <vgoyal@in.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-10[PATCH] kdump: export per cpu crash notes pointer through sysfsVivek Goyal1-13/+0
- Kexec on panic functionality allocates memory for saving cpu registers in case of system crash event. Address of this allocated memory needs to be exported to user space, which is used by kexec-tools. - Previously, a single /sys/kernel/crash_notes entry was being exported as memory allocated was a single continuous array. Now memory allocation being dyanmic and per cpu based, address of per cpu buffer is exported through "/sys/devices/system/cpu/cpuX/crash_notes" Signed-off-by: Vivek Goyal <vgoyal@in.ibm.com> Cc: Greg KH <greg@kroah.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-10[PATCH] kdump: dynamic per cpu allocation of memory for saving cpu registersVivek Goyal1-0/+16
- In case of system crash, current state of cpu registers is saved in memory in elf note format. So far memory for storing elf notes was being allocated statically for NR_CPUS. - This patch introduces dynamic allocation of memory for storing elf notes. It uses alloc_percpu() interface. This should lead to better memory usage. - Introduced based on Andi Kleen's and Eric W. Biederman's suggestions. - This patch also moves memory allocation for elf notes from architecture dependent portion to architecture independent portion. Now crash_notes is architecture independent. The whole idea is that size of memory to be allocated per cpu (MAX_NOTE_BYTES) can be architecture dependent and allocation of this memory can be architecture independent. Signed-off-by: Vivek Goyal <vgoyal@in.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-10[PATCH] Remove set_fs() in stop_machine()akpm@osdl.org1-5/+1
) From: Brian Gerst <bgerst@didntduck.org> Call sched_setscheduler() directly instead. Signed-off-by: Brian Gerst <bgerst@didntduck.org> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-09Merge master.kernel.org:/pub/scm/linux/kernel/git/mingo/mutex-2.6Linus Torvalds9-6/+975
2006-01-09[PATCH] rcu: don't set ->next_pending in rcu_start_batch()Oleg Nesterov1-7/+4
I think it is better to set ->next_pending in the caller, when it is needed. This saves one parameter, and this coincides with cpu_quiet() beahaviour, which sets ->completed = ->cur itself. Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Acked-by: Paul E. McKenney <paulmck@us.ibm.com> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-09[PATCH] mutex subsystem, semaphore to mutex: VFS, ->i_semJes Sorensen1-5/+5
This patch converts the inode semaphore to a mutex. I have tested it on XFS and compiled as much as one can consider on an ia64. Anyway your luck with it might be different. Modified-by: Ingo Molnar <mingo@elte.hu> (finished the conversion) Signed-off-by: Jes Sorensen <jes@sgi.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2006-01-09[PATCH] mutex subsystem, more debugging codeIngo Molnar2-0/+6
more mutex debugging: check for held locks during memory freeing, task exit, enable sysrq printouts, etc. Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Arjan van de Ven <arjan@infradead.org>
2006-01-09[PATCH] mutex subsystem, debugging codeIngo Molnar4-0/+603
mutex implementation - add debugging code. Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Arjan van de Ven <arjan@infradead.org>
2006-01-09[PATCH] mutex subsystem, coreIngo Molnar3-1/+361
mutex implementation, core files: just the basic subsystem, no users of it. Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Arjan van de Ven <arjan@infradead.org>
2006-01-09Merge git://git.kernel.org/pub/scm/linux/kernel/git/paulus/powerpc-mergeLinus Torvalds2-0/+5
2006-01-09[PATCH] rcu: uninline __rcu_pending()Oleg Nesterov1-0/+30
__rcu_pending() is rather fat and called twice from rcu_pending(). rcu_pending() has multiple callers, and not that small too. This patch uninlines both of them. Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Acked-by: Paul E. McKenney <paulmck@us.ibm.com> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-08[PATCH] Make vm86 support optionalMatt Mackall1-0/+2
This adds an option to remove vm86 support under CONFIG_EMBEDDED. Saves about 5k. This version eliminates most of the #ifdefs of the previous version and instead uses function stubs in vm86.h. Also, release_vm86_irqs is moved from asm-i386/irq.h to a more appropriate home in vm86.h so that the stubs can live together. $ size vmlinux-baseline vmlinux-novm86 text data bss dec hex filename 2920821 523232 190652 3634705 377611 vmlinux-baseline 2916268 523100 190492 3629860 376324 vmlinux-novm86 Signed-off-by: Matt Mackall <mpm@selenic.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-08[PATCH] tiny: Make *[ug]id16 support optionalMatt Mackall1-0/+19
Configurable 16-bit UID and friends support This allows turning off the legacy 16 bit UID interfaces on embedded platforms. text data bss dec hex filename 3330172 529036 190556 4049764 3dcb64 vmlinux-baseline 3328268 529040 190556 4047864 3dc3f8 vmlinux From: Adrian Bunk <bunk@stusta.de> UID16 was accidentially disabled for !EMBEDDED. Signed-off-by: Matt Mackall <mpm@selenic.com> Signed-off-by: Adrian Bunk <bunk@stusta.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-08[PATCH] simplify k_getrusage()Oleg Nesterov1-18/+11
Factor out common code for different RUSAGE_xxx cases. Don't take ->sighand->siglock in RUSAGE_SELF case, suggested by Ravikiran G Thirumalai <kiran@scalex86.org>. Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-08[PATCH] fix workqueue oops during cpu offlineNathan Lynch1-6/+10
Use first_cpu(cpu_possible_map) for the single-thread workqueue case. We used to hardcode 0, but that broke on systems where !cpu_possible(0) when workqueue_struct->cpu_workqueue_struct was changed from a static array to alloc_percpu. Commit id bce61dd49d6ba7799be2de17c772e4c701558f14 ("Fix hardcoded cpu=0 in workqueue for per_cpu_ptr() calls") fixed that for Ben's funky sparc64 system, but it regressed my Power5. Offlining cpu 0 oopses upon the next call to queue_work for a single-thread workqueue, because now we try to manipulate per_cpu_ptr(wq->cpu_wq, 1), which is uninitialized. So we need to establish an unchanging "slot" for single-thread workqueues which will have a valid percpu allocation. Since alloc_percpu keys off of cpu_possible_map, which must not change after initialization, make this slot == first_cpu(cpu_possible_map). Signed-off-by: Nathan Lynch <ntl@pobox.com> Cc: <stable@kernel.org> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-08[PATCH] kernel/module.c: remove redundant spinlock in resolve_symbol()Ashutosh Naik1-2/+0
Remove the redundant spinlock in the function resolve_symbol() as we are not altering the module list, and we already hold the semaphore. Signed-off-by: Ashutosh Naik <ashutosh.naik@gmail.com> Cc: Rusty Russell <rusty@rustcorp.com.au> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-08[PATCH] modules: mark TAINT_FORCED_RMMOD correctlyAkinobu Mita1-5/+5
Currently TAINT_FORCED_RMMOD is totally unused. Because it is marked as TAINT_FORCED_MODULE instead when user forced a module unload. This patch marks it correctly Signed-off-by: Akinobu Mita <mita@miraclelinux.com> Cc: Rusty Russell <rusty@rustcorp.com.au> Cc: Dave Jones <davej@codemonkey.org.uk> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-08[PATCH] modules: prevent overriding of symbolsAshutosh Naik1-0/+39
Ensure that an exported symbol does not already exist in the kernel or in some other module's exported symbol table. This is done by checking the symbol tables for the exported symbol at the time of loading the module. Currently this is done after the relocation of the symbol. Signed-off-by: Ashutosh Naik <ashutosh.naik@gmail.com> Signed-off-by: Anand Krishnan <anandhkrishnan@yahoo.co.in> Cc: Rusty Russell <rusty@rustcorp.com.au> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-08[PATCH] copy_process: error path cleanupOleg Nesterov1-6/+2
This patch moves 'fork_out:' under 'bad_fork_free:', and removes now unneeded 'if (retval)' check. Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Cc: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-08[PATCH] setpgid: should not accept ptraced childsOleg Nesterov1-1/+1
sys_setpgid() allows to change ->pgrp of ptraced childs. 'man setpgid' does not tell anything about that, so I consider this behaviour is a bug. Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Cc: Oren Laadan <orenl@cs.columbia.edu> Cc: Roland McGrath <roland@redhat.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-08[PATCH] setpgid: should work for sub-threadsOren Laadan2-10/+8
setsid() does not work unless the calling process is a thread_group_leader(). 'man setpgid' does not tell anything about that, so I consider this behaviour is a bug. Signed-off-by: Oren Laadan <orenl@cs.columbia.edu> Cc: Oleg Nesterov <oleg@tv-sign.ru> Cc: Roland McGrath <roland@redhat.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-08[PATCH] setpgid: should work for sub-threadsOleg Nesterov1-5/+6
setpgid(0, pgid) or setpgid(forked_child_pid, pgid) does not work unless the calling process is a thread_group_leader(). 'man setpgid' does not tell anything about that, so I consider this behaviour is a bug. Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Cc: Oren Laadan <orenl@cs.columbia.edu> Cc: Roland McGrath <roland@redhat.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-08[PATCH] fork: fix race in setting child's pgrp and ttyOren Laadan1-6/+3
In fork, child should recopy parent's pgrp/tty after it has tasklist_lock. Otherwise following a setpgid() on the parent, *after* copy_signal(), the child will own a stale pgrp (which may be reused); (eg. if copy_mm() sleeps a long while due to memory pressure). Similar issue for the tty. Signed-off-by: Oren Laadan <orenl@cs.columbia.edu> Cc: Oleg Nesterov <oleg@tv-sign.ru> Cc: Roland McGrath <roland@redhat.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-08[PATCH] Don't attempt to power off if power off is not implementedEric W. Biederman1-0/+6
The problem. It is expected that /sbin/halt -p works exactly like /sbin/halt, when the kernel does not implement power off functionality. The kernel can do a lot of work in the reboot notifiers and in device_shutdown before we even get to machine_power_off. Some of that shutdown is not safe if you are leaving the power on, and it definitely gets in the way of using sysrq or pressing ctrl-alt-del. Since the shutdown happens in generic code there is no way to fix this in architecture specific code :( Some machines are kernel oopsing today because of this. The simple solution is to turn LINUX_REBOOT_CMD_POWER_OFF into LINUX_REBOOT_CMD_HALT if power_off functionality is not implemented. This has the unfortunate side effect of disabling the power off functionality on architectures that leave pm_power_off to null and still implement something in machine_power_off. And it will break the build on some architectures that don't have a pm_power_off variable at all. On both counts I say tough. For architectures like alpha that don't implement the pm_power_off variable pm_power_off is declared in linux/pm.h and it is a generic part of our power management code, and all architectures should implement it. For architectures like parisc that have a default power off method in machine_power_off if pm_power_off is not implemented or fails. It is easy enough to set the pm_power_off variable. And nothing bad happens there, the machines just stop powering off. The current semantics are impossible without a flag at the top level so we can avoid the problem code if a power off is not implemented. pm_power_off is as good a flag as any with the bonus that it works without modification on at least x86, x86_64, powerpc, and ppc today. Andrew can you pick this up and put this in the mm tree. Kernels that don't compile or don't power off seem saner than kernels that oops or panic. Until we get the arch specific patches for the problem architectures this probably isn't smart to push into the stable kernel. Unfortunately I don't have the time at the moment to walk through every architecture and make them work. And even if I did I couldn't test it :( From: Hirokazu Takata <takata@linux-m32r.org> Add pm_power_off() for build fix of arch/m32r/kernel/process.c. From: Miklos Szeredi <miklos@szeredi.hu> UML build fix Signed-off-by: Eric W. Biederman <ebiederm@xmission.com> Signed-off-by: Hayato Fujiwara <fujiwara@linux-m32r.org> Signed-off-by: Hirokazu Takata <takata@linux-m32r.org> Signed-off-by: Miklos Szeredi <miklos@szeredi.hu> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-08[PATCH] Extend RCU torture module to test tickless idle CPUSrivatsa Vaddagiri1-5/+91
This patch forces RCU torture threads off various CPUs in the system allowing them to become idle and go tickless. Meant to test support for such tickless idle CPU in RCU. Signed-off-by: Srivatsa Vaddagiri <vatsa@in.ibm.com> Cc: Dipankar Sarma <dipankar@in.ibm.com> Cc: "Paul E. McKenney" <paulmck@us.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-08[PATCH] Add tainting for proprietary helper modulesDave Jones1-0/+5
Kernels that have had Windows drivers loaded into them are undebuggable. I've wasted a number of hours chasing bugs filed in Fedora bugzilla only to find out much later that the user had used such 'helpers', and their problems were unreproducable without them loaded. Acked-by: Arjan van de Ven <arjan@infradead.org> Signed-off-by: Dave Jones <davej@redhat.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-08[PATCH] shrink dentry structEric Dumazet1-2/+2
Some long time ago, dentry struct was carefully tuned so that on 32 bits UP, sizeof(struct dentry) was exactly 128, ie a power of 2, and a multiple of memory cache lines. Then RCU was added and dentry struct enlarged by two pointers, with nice results for SMP, but not so good on UP, because breaking the above tuning (128 + 8 = 136 bytes) This patch reverts this unwanted side effect, by using an union (d_u), where d_rcu and d_child are placed so that these two fields can share their memory needs. At the time d_free() is called (and d_rcu is really used), d_child is known to be empty and not touched by the dentry freeing. Lockless lookups only access d_name, d_parent, d_lock, d_op, d_flags (so the previous content of d_child is not needed if said dentry was unhashed but still accessed by a CPU because of RCU constraints) As dentry cache easily contains millions of entries, a size reduction is worth the extra complexity of the ugly C union. Signed-off-by: Eric Dumazet <dada1@cosmosbay.com> Cc: Dipankar Sarma <dipankar@in.ibm.com> Cc: Maneesh Soni <maneesh@in.ibm.com> Cc: Miklos Szeredi <miklos@szeredi.hu> Cc: "Paul E. McKenney" <paulmck@us.ibm.com> Cc: Ian Kent <raven@themaw.net> Cc: Paul Jackson <pj@sgi.com> Cc: Al Viro <viro@ftp.linux.org.uk> Cc: Christoph Hellwig <hch@lst.de> Cc: Trond Myklebust <trond.myklebust@fys.uio.no> Cc: Neil Brown <neilb@cse.unsw.edu.au> Cc: James Morris <jmorris@namei.org> Cc: Stephen Smalley <sds@epoch.ncsc.mil> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-08[PATCH] remove unneeded sig->curr_target recalculationOleg Nesterov1-2/+0
This patch removes unneeded sig->curr_target recalculation under 'if (atomic_dec_and_test(&sig->count))' in __exit_signal(). When sig->count == 0 the signal can't be sent to this task and next_thread(tsk) == tsk anyway. Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-08[PATCH] little do_group_exit() cleanupOleg Nesterov1-1/+0
zap_other_threads() sets SIGNAL_GROUP_EXIT at the very start, do_group_exit() doesn't need to do it. Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-08[PATCH] kill_proc_info_as_uid: don't use hardcoded constantsOleg Nesterov1-2/+1
Use symbolic names instead of hardcoded constants. Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Acked-by: Harald Welte <laforge@gnumonks.org> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-08[PATCH] Unchecked alloc_percpu() return in __create_workqueue()Ben Collins1-0/+5
__create_workqueue() not checking return of alloc_percpu() NULL dereference was possible. Signed-off-by: Ben Collins <bcollins@ubuntu.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-08[PATCH] sigaction should clear all signals on SIG_IGN, not just < 32George Anzinger1-2/+32
While rooting aroung in the signal code trying to understand how to fix the SIG_IGN ploy (set sig handler to SIG_IGN and flood system with high speed repeating timers) I came across what, I think, is a problem in sigaction() in that when processing a SIG_IGN request it flushes signals from 1 to SIGRTMIN and leaves the rest. Attempt to fix this. Signed-off-by: George Anzinger <george@mvista.com> Cc: Roland McGrath <roland@redhat.com> Cc: Linus Torvalds <torvalds@osdl.org> Signed-off-by: Adrian Bunk <bunk@stusta.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-08[PATCH] printk return value: fix itGuillaume Chazarain1-3/+3
What's the true meaning of the printk return value? Should it include the priority prefix length of 3? and what about the timing information? In both cases it was broken: strace -e write echo 1 > /dev/kmsg => write(1, "1\n", 2) = 5 strace -e write echo "<1>1" > /dev/kmsg => write(1, "<1>1\n", 5) = 8 The returned length was "length of input string + 3", I made it "length of string output to the log buffer". Note that I couldn't find any printk caller in the kernel interested by its return value besides kmsg_write. Signed-off-by: Guillaume Chazarain <guichaz@yahoo.fr> Acked-By: Tim Bird <tim.bird@am.sony.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-08[PATCH] use ptrace_get_task_struct in various placesChristoph Hellwig1-31/+46
The ptrace_get_task_struct() helper that I added as part of the ptrace consolidation is useful in variety of places that currently opencode it. Switch them to the common helpers. Add a ptrace_traceme() helper that needs to be explicitly called, and simplify the ptrace_get_task_struct() interface. We don't need the request argument now, and we return the task_struct directly, using ERR_PTR() for error returns. It's a bit more code in the callers, but we have two sane routines that do one thing well now. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-08[PATCH] rcu file: use atomic primitivesNick Piggin2-15/+0
Use atomic_inc_not_zero for rcu files instead of special case rcuref. Signed-off-by: Nick Piggin <npiggin@suse.de> Cc: "Paul E. McKenney" <paulmck@us.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-08[PATCH] kernel/: small cleanupsAdrian Bunk4-2/+5
This patch contains the following cleanups: - make needlessly global functions static - every file should include the headers containing the prototypes for it's global functions Signed-off-by: Adrian Bunk <bunk@stusta.de> Acked-by: "Paul E. McKenney" <paulmck@us.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-08[PATCH] cpuset: skip rcu check if task is in root cpusetPaul Jackson1-4/+9
For systems that aren't using cpusets, but have them CONFIG_CPUSET enabled in their kernel (eventually this may be most distribution kernels), this patch removes even the minimal rcu_read_lock() from the memory page allocation path. Actually, it removes that rcu call for any task that is in the root cpuset (top_cpuset), which on systems not actively using cpusets, is all tasks. We don't need the rcu check for tasks in the top_cpuset, because the top_cpuset is statically allocated, so at no risk of being freed out from underneath us. Signed-off-by: Paul Jackson <pj@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-08[PATCH] cpuset: mark number_of_cpusets read_mostlyPaul Jackson1-1/+1
Mark cpuset global 'number_of_cpusets' as __read_mostly. This global is accessed everytime a zone is considered in the zonelist loops beneath __alloc_pages, looking for a free memory page. If number_of_cpusets is just one, then we can short circuit the mems_allowed check. Since this global is read alot on a hot path, and written rarely, it is an excellent candidate for __read_mostly. Thanks to Christoph Lameter for the suggestion. Signed-off-by: Paul Jackson <pj@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-08[PATCH] cpuset: use rcu directly optimizationPaul Jackson1-10/+30
Optimize the cpuset impact on page allocation, the most performance critical cpuset hook in the kernel. On each page allocation, the cpuset hook needs to check for a possible change in the current tasks cpuset. It can now handle the common case, of no change, without taking any spinlock or semaphore, thanks to RCU. Convert a spinlock on the current task to an rcu_read_lock(), saving approximately a memory barrier and an atomic op, depending on architecture. This is done by adding rcu_assign_pointer() and synchronize_rcu() calls to the write side of the task->cpuset pointer, in cpuset.c:attach_task(), to delay freeing up a detached cpuset until after any critical sections referencing that pointer. Thanks to Andi Kleen, Nick Piggin and Eric Dumazet for ideas. Signed-off-by: Paul Jackson <pj@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-08[PATCH] cpuset: remove test for null cpuset from alloc code pathPaul Jackson1-6/+16
Remove a couple of more lines of code from the cpuset hooks in the page allocation code path. There was a check for a NULL cpuset pointer in the routine cpuset_update_task_memory_state() that was only needed during system boot, after the memory subsystem was initialized, before the cpuset subsystem was initialized, to catch a NULL task->cpuset pointer. Add a cpuset_init_early() routine, just before the mem_init() call in init/main.c, that sets up just enough of the init tasks cpuset structure to render cpuset_update_task_memory_state() calls harmless. Signed-off-by: Paul Jackson <pj@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-08[PATCH] cpuset: migrate all tasks in cpuset at oncePaul Jackson1-13/+16
Given the mechanism in the previous patch to handle rebinding the per-vma mempolicies of all tasks in a cpuset that changes its memory placement, it is now easier to handle the page migration requirements of such tasks at the same time. The previous code didn't actually attempt to migrate the pages of the tasks in a cpuset whose memory placement changed until the next time each such task tried to allocate memory. This was undesirable, as users invoking memory page migration exected to happen when the placement changed, not some unspecified time later when the task needed more memory. It is now trivial to handle the page migration at the same time as the per-vma rebinding is done. The routine cpuset.c:update_nodemask(), which handles changing a cpusets memory placement ('mems') now checks for the special case of being asked to write a placement that is the same as before. It was harmless enough before to just recompute everything again, even though nothing had changed. But page migration is a heavy weight operation - moving pages about. So now it is worth avoiding that if asked to move a cpuset to its current location. Signed-off-by: Paul Jackson <pj@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-08[PATCH] cpuset: rebind vma mempolicies fixPaul Jackson1-0/+90
Fix more of longstanding bug in cpuset/mempolicy interaction. NUMA mempolicies (mm/mempolicy.c) are constrained by the current tasks cpuset to just the Memory Nodes allowed by that cpuset. The kernel maintains internal state for each mempolicy, tracking what nodes are used for the MPOL_INTERLEAVE, MPOL_BIND or MPOL_PREFERRED policies. When a tasks cpuset memory placement changes, whether because the cpuset changed, or because the task was attached to a different cpuset, then the tasks mempolicies have to be rebound to the new cpuset placement, so as to preserve the cpuset-relative numbering of the nodes in that policy. An earlier fix handled such mempolicy rebinding for mempolicies attached to a task. This fix rebinds mempolicies attached to vma's (address ranges in a tasks address space.) Due to the need to hold the task->mm->mmap_sem semaphore while updating vma's, the rebinding of vma mempolicies has to be done when the cpuset memory placement is changed, at which time mmap_sem can be safely acquired. The tasks mempolicy is rebound later, when the task next attempts to allocate memory and notices that its task->cpuset_mems_generation is out-of-date with its cpusets mems_generation. Because walking the tasklist to find all tasks attached to a changing cpuset requires holding tasklist_lock, a spinlock, one cannot update the vma's of the affected tasks while doing the tasklist scan. In general, one cannot acquire a semaphore (which can sleep) while already holding a spinlock (such as tasklist_lock). So a list of mm references has to be built up during the tasklist scan, then the tasklist lock dropped, then for each mm, its mmap_sem acquired, and the vma's in that mm rebound. Once the tasklist lock is dropped, affected tasks may fork new tasks, before their mm's are rebound. A kernel global 'cpuset_being_rebound' is set to point to the cpuset being rebound (there can only be one; cpuset modifications are done under a global 'manage_sem' semaphore), and the mpol_copy code that is used to copy a tasks mempolicies during fork catches such forking tasks, and ensures their children are also rebound. When a task is moved to a different cpuset, it is easier, as there is only one task involved. It's mm->vma's are scanned, using the same mpol_rebind_policy() as used above. It may happen that both the mpol_copy hook and the update done via the tasklist scan update the same mm twice. This is ok, as the mempolicies of each vma in an mm keep track of what mems_allowed they are relative to, and safely no-op a second request to rebind to the same nodes. Signed-off-by: Paul Jackson <pj@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-08[PATCH] cpuset: number_of_cpusets optimizationPaul Jackson1-1/+11
Easy little optimization hack to avoid actually having to call cpuset_zone_allowed() and check mems_allowed, in the main page allocation routine, __alloc_pages(). This saves several CPU cycles per page allocation on systems not using cpusets. A counter is updated each time a cpuset is created or removed, and whenever there is only one cpuset in the system, it must be the root cpuset, which contains all CPUs and all Memory Nodes. In that case, when the counter is one, all allocations are allowed. Signed-off-by: Paul Jackson <pj@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-08[PATCH] cpuset: numa_policy_rebind cleanupPaul Jackson1-1/+1
Cleanup, reorganize and make more robust the mempolicy.c code to rebind mempolicies relative to the containing cpuset after a tasks memory placement changes. The real motivator for this cleanup patch is to lay more groundwork for the upcoming patch to correctly rebind NUMA mempolicies that are attached to vma's after the containing cpuset memory placement changes. NUMA mempolicies are constrained by the cpuset their task is a member of. When either (1) a task is moved to a different cpuset, or (2) the 'mems' mems_allowed of a cpuset is changed, then the NUMA mempolicies have embedded node numbers (for MPOL_BIND, MPOL_INTERLEAVE and MPOL_PREFERRED) that need to be recalculated, relative to their new cpuset placement. The old code used an unreliable method of determining what was the old mems_allowed constraining the mempolicy. It just looked at the tasks mems_allowed value. This sort of worked with the present code, that just rebinds the -task- mempolicy, and leaves any -vma- mempolicies broken, referring to the old nodes. But in an upcoming patch, the vma mempolicies will be rebound as well. Then the order in which the various task and vma mempolicies are updated will no longer be deterministic, and one can no longer count on the task->mems_allowed holding the old value for as long as needed. It's not even clear if the current code was guaranteed to work reliably for task mempolicies. So I added a mems_allowed field to each mempolicy, stating exactly what mems_allowed the policy is relative to, and updated synchronously and reliably anytime that the mempolicy is rebound. Also removed a useless wrapper routine, numa_policy_rebind(), and had its caller, cpuset_update_task_memory_state(), call directly to the rewritten policy_rebind() routine, and made that rebind routine extern instead of static, and added a "mpol_" prefix to its name, making it mpol_rebind_policy(). Signed-off-by: Paul Jackson <pj@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-08[PATCH] cpuset: implement cpuset_mems_allowedPaul Jackson1-3/+26
Provide a cpuset_mems_allowed() method, which the sys_migrate_pages() code needed, to obtain the mems_allowed vector of a cpuset, and replaced the workaround in sys_migrate_pages() to call this new method. Signed-off-by: Paul Jackson <pj@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-08[PATCH] cpuset: combine refresh_mems and update_memsPaul Jackson1-54/+41
The important code paths through alloc_pages_current() and alloc_page_vma(), by which most kernel page allocations go, both called cpuset_update_current_mems_allowed(), which in turn called refresh_mems(). -Both- of these latter two routines did a tasklock, got the tasks cpuset pointer, and checked for out of date cpuset->mems_generation. That was a silly duplication of code and waste of CPU cycles on an important code path. Consolidated those two routines into a single routine, called cpuset_update_task_memory_state(), since it updates more than just mems_allowed. Changed all callers of either routine to call the new consolidated routine. Signed-off-by: Paul Jackson <pj@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-08[PATCH] cpuset: fork hook fixPaul Jackson2-5/+5
Fix obscure, never seen in real life, cpuset fork race. The cpuset_fork() call in fork.c was setting up the correct task->cpuset pointer after the tasklist_lock was dropped, which briefly exposed the newly forked process with an unsafe (copied from parent without locks or usage counter increment) cpuset pointer. In theory, that exposed cpuset pointer could have been pointing at a cpuset that was already freed and removed, and in theory another task that had been sitting on the tasklist_lock waiting to scan the task list could have raced down the entire tasklist, found our new child at the far end, and dereferenced that bogus cpuset pointer. To fix, setup up the correct cpuset pointer in the new child by calling cpuset_fork() before the new task is linked into the tasklist, and with that, add a fork failure case, to dereference that cpuset, if the fork fails along the way, after cpuset_fork() was called. Had to remove a BUG_ON() from cpuset_exit(), because it was no longer valid - the call to cpuset_exit() from a failed fork would not have PF_EXITING set. Signed-off-by: Paul Jackson <pj@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-08[PATCH] cpuset: update_nodemask code reformatPaul Jackson1-10/+15
Restructure code layout of the kernel/cpuset.c update_nodemask() routine, removing embedded returns and nested if's in favor of goto completion labels. This is being done in anticipation of adding more logic to this routine, which will favor the goto style structure. Signed-off-by: Paul Jackson <pj@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-08[PATCH] cpuset: minor spacing initializer fixesPaul Jackson1-6/+3
Four trivial cpuset fixes: remove extra spaces, remove useless initializers, mark one __read_mostly. Signed-off-by: Paul Jackson <pj@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-08[PATCH] cpuset: memory pressure meterPaul Jackson1-2/+191
Provide a simple per-cpuset metric of memory pressure, tracking the -rate- that the tasks in a cpuset call try_to_free_pages(), the synchronous (direct) memory reclaim code. This enables batch managers monitoring jobs running in dedicated cpusets to efficiently detect what level of memory pressure that job is causing. This is useful both on tightly managed systems running a wide mix of submitted jobs, which may choose to terminate or reprioritize jobs that are trying to use more memory than allowed on the nodes assigned them, and with tightly coupled, long running, massively parallel scientific computing jobs that will dramatically fail to meet required performance goals if they start to use more memory than allowed to them. This patch just provides a very economical way for the batch manager to monitor a cpuset for signs of memory pressure. It's up to the batch manager or other user code to decide what to do about it and take action. ==> Unless this feature is enabled by writing "1" to the special file /dev/cpuset/memory_pressure_enabled, the hook in the rebalance code of __alloc_pages() for this metric reduces to simply noticing that the cpuset_memory_pressure_enabled flag is zero. So only systems that enable this feature will compute the metric. Why a per-cpuset, running average: Because this meter is per-cpuset, rather than per-task or mm, the system load imposed by a batch scheduler monitoring this metric is sharply reduced on large systems, because a scan of the tasklist can be avoided on each set of queries. Because this meter is a running average, instead of an accumulating counter, a batch scheduler can detect memory pressure with a single read, instead of having to read and accumulate results for a period of time. Because this meter is per-cpuset rather than per-task or mm, the batch scheduler can obtain the key information, memory pressure in a cpuset, with a single read, rather than having to query and accumulate results over all the (dynamically changing) set of tasks in the cpuset. A per-cpuset simple digital filter (requires a spinlock and 3 words of data per-cpuset) is kept, and updated by any task attached to that cpuset, if it enters the synchronous (direct) page reclaim code. A per-cpuset file provides an integer number representing the recent (half-life of 10 seconds) rate of direct page reclaims caused by the tasks in the cpuset, in units of reclaims attempted per second, times 1000. Signed-off-by: Paul Jackson <pj@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-08[PATCH] cpuset: mempolicy one more nodemask conversionPaul Jackson1-10/+0
Finish converting mm/mempolicy.c from bitmaps to nodemasks. The previous conversion had left one routine using bitmaps, since it involved a corresponding change to kernel/cpuset.c Fix that interface by replacing with a simple macro that calls nodes_subset(), or if !CONFIG_CPUSET, returns (1). Signed-off-by: Paul Jackson <pj@sgi.com> Cc: Christoph Lameter <christoph@lameter.com> Cc: Andi Kleen <ak@muc.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-08[PATCH] Simpler signal-exit concurrency handlingPaul E. McKenney1-5/+6
Some simplification in checking signal delivery against concurrent exit. Instead of using get_task_struct_rcu(), which increments the task_struct reference count, check the reference count after acquiring sighand lock. Signed-off-by: "Paul E. McKenney" <paulmck@us.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-08[PATCH] RCU signal handlingIngo Molnar6-27/+111
RCU tasklist_lock and RCU signal handling: send signals RCU-read-locked instead of tasklist_lock read-locked. This is a scalability improvement on SMP and a preemption-latency improvement under PREEMPT_RCU. Signed-off-by: Paul E. McKenney <paulmck@us.ibm.com> Signed-off-by: Ingo Molnar <mingo@elte.hu> Acked-by: William Irwin <wli@holomorphy.com> Cc: Roland McGrath <roland@redhat.com> Cc: Oleg Nesterov <oleg@tv-sign.ru> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-08[PATCH] Change maxaligned_in_smp alignemnt macros to internodealigned_in_smp ↵Ravikiran G Thirumalai1-2/+2
macros ____cacheline_maxaligned_in_smp is currently used to align critical structures and avoid false sharing. It uses per-arch L1_CACHE_SHIFT_MAX and people find L1_CACHE_SHIFT_MAX useless. However, we have been using ____cacheline_maxaligned_in_smp to align structures on the internode cacheline size. As per Andi's suggestion, following patch kills ____cacheline_maxaligned_in_smp and introduces INTERNODE_CACHE_SHIFT, which defaults to L1_CACHE_SHIFT for all arches. Arches needing L3/Internode cacheline alignment can define INTERNODE_CACHE_SHIFT in the arch asm/cache.h. Patch replaces ____cacheline_maxaligned_in_smp with ____cacheline_internodealigned_in_smp With this patch, L1_CACHE_SHIFT_MAX can be killed Signed-off-by: Ravikiran Thirumalai <kiran@scalex86.org> Signed-off-by: Shai Fultheim <shai@scalex86.org> Signed-off-by: Andi Kleen <ak@suse.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-08[PATCH] cpusets: swap migration interfacePaul Jackson1-2/+36
Add a boolean "memory_migrate" to each cpuset, represented by a file containing "0" or "1" in each directory below /dev/cpuset. It defaults to false (file contains "0"). It can be set true by writing "1" to the file. If true, then anytime that a task is attached to the cpuset so marked, the pages of that task will be moved to that cpuset, preserving, to the extent practical, the cpuset-relative placement of the pages. Also anytime that a cpuset so marked has its memory placement changed (by writing to its "mems" file), the tasks in that cpuset will have their pages moved to the cpusets new nodes, preserving, to the extent practical, the cpuset-relative placement of the moved pages. Signed-off-by: Paul Jackson <pj@sgi.com> Cc: Christoph Lameter <christoph@lameter.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-08[PATCH] Swap Migration V5: sys_migrate_pages interfaceChristoph Lameter1-0/+1
sys_migrate_pages implementation using swap based page migration This is the original API proposed by Ray Bryant in his posts during the first half of 2005 on linux-mm@kvack.org and linux-kernel@vger.kernel.org. The intent of sys_migrate is to migrate memory of a process. A process may have migrated to another node. Memory was allocated optimally for the prior context. sys_migrate_pages allows to shift the memory to the new node. sys_migrate_pages is also useful if the processes available memory nodes have changed through cpuset operations to manually move the processes memory. Paul Jackson is working on an automated mechanism that will allow an automatic migration if the cpuset of a process is changed. However, a user may decide to manually control the migration. This implementation is put into the policy layer since it uses concepts and functions that are also needed for mbind and friends. The patch also provides a do_migrate_pages function that may be useful for cpusets to automatically move memory. sys_migrate_pages does not modify policies in contrast to Ray's implementation. The current code here is based on the swap based page migration capability and thus is not able to preserve the physical layout relative to it containing nodeset (which may be a cpuset). When direct page migration becomes available then the implementation needs to be changed to do a isomorphic move of pages between different nodesets. The current implementation simply evicts all pages in source nodeset that are not in the target nodeset. Patch supports ia64, i386 and x86_64. Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-08[PATCH] add schedule_on_each_cpu()Christoph Lameter1-0/+19
swap migration's isolate_lru_page() currently uses an IPI to notify other processors that the lru caches need to be drained if the page cannot be found on the LRU. The IPI interrupt may interrupt a processor that is just processing lru requests and cause a race condition. This patch introduces a new function run_on_each_cpu() that uses the keventd() to run the LRU draining on each processor. Processors disable preemption when dealing the LRU caches (these are per processor) and thus executing LRU draining from another process is safe. Thanks to Lee Schermerhorn <lee.schermerhorn@hp.com> for finding this race condition. Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-08[PATCH] Make high and batch sizes of per_cpu_pagelists configurableRohit Seth1-0/+12
As recently there has been lot of traffic on the right values for batch and high water marks for per_cpu_pagelists. This patch makes these two variables configurable through /proc interface. A new tunable /proc/sys/vm/percpu_pagelist_fraction is added. This entry controls the fraction of pages at most in each zone that are allocated for each per cpu page list. The min value for this is 8. It means that we don't allow more than 1/8th of pages in each zone to be allocated in any single per_cpu_pagelist. The batch value of each per cpu pagelist is also updated as a result. It is set to pcp->high/4. The upper limit of batch is (PAGE_SHIFT * 8) Signed-off-by: Rohit Seth <rohit.seth@intel.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-08[PATCH] drop-pagecacheAndrew Morton1-0/+10
Add /proc/sys/vm/drop_caches. When written to, this will cause the kernel to discard as much pagecache and/or reclaimable slab objects as it can. THis operation requires root permissions. It won't drop dirty data, so the user should run `sync' first. Caveats: a) Holds inode_lock for exorbitant amounts of time. b) Needs to be taught about NUMA nodes: propagate these all the way through so the discarding can be controlled on a per-node basis. This is a debugging feature: useful for getting consistent results between filesystem benchmarks. We could possibly put it under a config option, but it's less than 300 bytes. Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-09[PATCH] powerpc: Add arch-dependent copy_oldmem_pageMichael Ellerman1-0/+3
Signed-off-by: Haren Myneni <haren@us.ibm.com> Signed-off-by: Michael Ellerman <michael@ellerman.id.au> Signed-off-by: Paul Mackerras <paulus@samba.org>
2006-01-09[PATCH] spufs: The SPU file system, baseArnd Bergmann1-0/+2
This is the current version of the spu file system, used for driving SPEs on the Cell Broadband Engine. This release is almost identical to the version for the 2.6.14 kernel posted earlier, which is available as part of the Cell BE Linux distribution from http://www.bsc.es/projects/deepcomputing/linuxoncell/. The first patch provides all the interfaces for running spu application, but does not have any support for debugging SPU tasks or for scheduling. Both these functionalities are added in the subsequent patches. See Documentation/filesystems/spufs.txt on how to use spufs. Signed-off-by: Arnd Bergmann <arndb@de.ibm.com> Signed-off-by: Paul Mackerras <paulus@samba.org>
2006-01-06[PATCH] Fix posix-cpu-timers sched_time accumulationDavid S. Miller1-12/+1
I've spent the past 3 days digging into a glibc testsuite failure in current CVS, specifically libc/rt/tst-cputimer1.c The thr1 and thr2 timers fire too early in the second pass of this test. The second pass is noteworthy because it makes use of intervals, whereas the first pass does not. All throughout the posix-cpu-timers.c code, the calculation of the process sched_time sum is implemented roughly as: unsigned long long sum; sum = tsk->signal->sched_time; t = tsk; do { sum += t->sched_time; t = next_thread(t); } while (t != tsk); In fact this is the exact scheme used by check_process_timers(). In the case of check_process_timers(), current->sched_time has just been updated (via scheduler_tick(), which is invoked by update_process_times(), which subsequently invokes run_posix_cpu_timers()) So there is no special processing necessary wrt. that. In other contexts, we have to allot for the fact that tsk->sched_time might be a bit out of date if we are current. And the posix-cpu-timers.c code uses current_sched_time() to deal with that. Unfortunately it does so in an erroneous and inconsistent manner in one spot which is what results in the early timer firing. In cpu_clock_sample_group_locked(), it does this: cpu->sched = p->signal->sched_time; /* Add in each other live thread. */ while ((t = next_thread(t)) != p) { cpu->sched += t->sched_time; } if (p->tgid == current->tgid) { /* * We're sampling ourselves, so include the * cycles not yet banked. We still omit * other threads running on other CPUs, * so the total can always be behind as * much as max(nthreads-1,ncpus) * (NSEC_PER_SEC/HZ). */ cpu->sched += current_sched_time(current); } else { cpu->sched += p->sched_time; } The problem is the "p->tgid == current->tgid" test. If "p" is not current, and the tgids are the same, we will add the process t->sched_time twice into cpu->sched and omit "p"'s sched_time which is very very very wrong. posix-cpu-timers.c has a helper function, sched_ns(p) which takes care of this, so my fix is to use that here instead of this special tgid test. The fact that current can be one of the sub-threads of "p" points out that we could make things a little bit more accurate, perhaps by using sched_ns() on every thread we process in these loops. It also points out that we don't use the most accurate value for threads in the group actively running other cpus (and this is mentioned in the comment). But that is a future enhancement, and this fix here definitely makes sense. Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-06[PATCH] kernel/module.c: removed dead codeJayachandran C1-2/+1
This patch fixes an issue reported by Coverity in kernel/module.c Error reported: Cannot reach this line of code "else return ptr;" Patch description: This is the error path, so 'err' will be negative, the else case is not required, this patch removes it. Signed-off-by: Jayachandran C. <c.jayachandran@gmail.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-06[PATCH] s390: cleanup KconfigMartin Schwidefsky2-5/+5
Sanitize some s390 Kconfig options. We have ARCH_S390, ARCH_S390X, ARCH_S390_31, 64BIT, S390_SUPPORT and COMPAT. Replace these 6 options by S390, 64BIT and COMPAT. Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-06[PATCH] s390: cputime_t fixesMartin Schwidefsky1-7/+9
There are some more places where the use of cputime_t instead of an integer type and the associated macros is necessary for the virtual cputime accounting on s390. Affected are the s390 specific appldata code and BSD process accounting. Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>