kernel/git/tglx/history.git - Linux kernel history

Age	Commit message (Collapse)	Author	Files	Lines
2005-01-14	[PATCH] reintroduce task_nice export for binfmt_elf32	Christian Bornträger	1	-0/+9
	S/390 needs this for its binfmt_elf32 module. Signed-off-by: Christian Borntraeger <cborntra@de.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-01-14	[PATCH] swsusp/dm: Use right levels for device_suspend()	Pavel Machek	3	-7/+10
	This almost changes no code (constant is still "3"), but at least it uses right constants for device_suspend() and fixes types at few points. Also puts explanation of constants to the Documentation. Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-01-14	[PATCH] swsusp: more small fixes	Pavel Machek	3	-12/+12
	This adds few missing statics to swsusp.c, prints errors even when non-debugging and fixes last "pmdisk: " message. Fixed few comments. Signed-off-by: Pavel Machek <pavel@suse.cz> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-01-14	[PATCH] generic irq code missing export of probe_irq_mask()	James Bottomley	1	-0/+1
	Matthew Wilcox just converted parisc over to doing the generic irq code and we ran across the symbol probe_irq_mask being undefined (and thus preventing yenta_socket from loading). It looks like the EXPORT_SYMBOL() was accidentally missed from kernel/irq/autoprobe.c and no-one noticed on x86 because it's still in i386_ksyms.c This patch corrects the problem so that the generic irq code now works completely on parisc. Signed-off-by: James Bottomley <James.Bottomley@SteelEye.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-01-14	[PATCH] cputime: s/390: fix account_steal_time.	Ulrich Weigand	1	-5/+9
	account_steal_time called for idle doesn't work correctly: 1) steal time while idle needs to be added to the system time of idle to get correct uptime numbers 3) if there is an i/o request outstanding the steal time should be added to iowait, even if the hypervisor scheduled another virtual cpu since we are still waiting for i/o. 2) steal time while idle without an i/o request outstanding has to be added to cpustat->idle and not to cpustat->system. Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com> Acked-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-01-14	[PATCH] Make compat_rt_sigtimedwait conform	Matthew Wilcox	1	-1/+1
	Compat syscalls need to start compat_sys_ otherwise PA-RISC's compat syscall wrappers don't work. Not that the individual involved bothered to patch PA-RISC ... Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-01-14	[PATCH] Don't busy-lock-loop in preemptable spinlocks	Ingo Molnar	1	-6/+8
	Paul Mackerras points out that doing the _raw_spin_trylock each time through the loop will generate tons of unnecessary bus traffic. Instead, after we fail to get the lock we should poll it with simple loads until we see that it is clear and then retry the atomic op. Assuming a reasonable cache design, the loads won't generate any bus traffic until another cpu writes to the cacheline containing the lock. Agreed. Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-01-11	[PATCH] Catch module parameter parsing failures	Rusty Russell	1	-0/+3
	Radheka Godse <radheka.godse@intel.com> pointed out that parameter parsing failures allow a module still to be loaded. Trivial fix. Signed-off-by: Rusty Russell <rusty@rustcorp.com.au> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-01-11	[PATCH] audit return code and log format fix	Peter Martuccelli	2	-2/+2
	A couple of one liners to resolve two issues that have come up regarding audit. Roger reported a problem with audit.c:audit_receive_skb which improperly negates the errno argument when netlink_ack is called. The second issue was reported by Steve on the linux-audit list, auditsc.s:audit_log_exit using %u instead of %d in the audit_log_format call. Please note, there is a mailing list available for audit discussion at https://www.redhat.com/archives/linux-audit/ Signed-off-by: Peter Martuccelli <peterm@redhat.com> Signed-off-by: Steve Grubb <sgrubb@redhat.com> Signed-off-by: Roger Luethi <rl@hellgate.ch> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-01-11	[PATCH] cputime: microsecond based cputime for s390	Martin Schwidefsky	1	-0/+1
	This patch adds the architecture magic to replace the jiffies based cputime with microsecond based cputime and it adds code to calculate involuntary wait time. With this patch the numbers reported by top and ps when running on LPAR or z/VM are finally not junk anymore. Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-01-11	[PATCH] cputime: introduce cputime	Martin Schwidefsky	9	-150/+238
	This patch introduces the concept of (virtual) cputime. Each architecture can define its method to measure cputime. The main idea is to define a cputime_t type and a set of operations on it (see asm-generic/cputime.h). Then use the type for utime, stime, cutime, cstime, it_virt_value, it_virt_incr, it_prof_value and it_prof_incr and use the cputime operations for each access to these variables. The default implementation is jiffies based and the effect of this patch for architectures which use the default implementation should be neglectible. There is a second type cputime64_t which is necessary for the kernel_stat cpu statistics. The default cputime_t is 32 bit and based on HZ, this will overflow after 49.7 days. This is not enough for kernel_stat (ihmo not enough for a processes too), so it is necessary to have a 64 bit type. The third thing that gets introduced by this patch is an additional field for the /proc/stat interface: cpu steal time. An architecture can account cpu steal time by calls to the account_stealtime function. The cpu which backs a virtual processor doesn't spent all of its time for the virtual cpu. To get meaningful cpu usage numbers this involuntary wait time needs to be accounted and exported to user space. From: Hugh Dickins <hugh@veritas.com> The p->signal check in account_system_time is insufficient. If the timer interrupt hits near the end of exit_notify, after EXIT_ZOMBIE has been set, another cpu may release_task (NULLifying p->signal) in between account_system_time's check and check_rlimit's dereference. Nor should account_it_prof risk send_sig. But surely account_user_time is safe? Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-01-11	[PATCH] acct_update_integrals speedup	Jay Lan	1	-0/+2
	This patch is to provide extra check in acct_update_integrals() function. The routine would return if 'delta' is 0 to take quick exit if nothing to be done. Signed-off-by: Jay Lan <jlan@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-01-10	[PATCH] signal.c: convert assertion to BUG_ON()	Pavel Machek	1	-2/+1
	Signed-off-by: Pavel Machek <pavel@suse.cz> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-01-10	[PATCH] swsusp: properly suspend and resume all devices	Barry K. Nathan	1	-0/+2
	During resume, my previous patch switches over to the saved swsusp image without suspending all devices first. This patch fixes that oversight, so that the state of the hardware upon resume more closely matches the state it had at suspend time. While my previous patch alone seemed to work fine in my testing, it is not fully correct without this as well. Signed-off-by: Barry K. Nathan <barryn@pobox.com> Acked-by: Pavel Machek <pavel@ucw.cz> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-01-10	[PATCH] swsusp: device power management fix	Barry K. Nathan	1	-0/+11
	Since at least kernel 2.6.9, if not earlier, swsusp fails to properly suspend and resume all devices. The most notable effect is that resuming fails to properly reconfigure interrupt routers. In 2.6.9 this was obscured by other kernel code, but in 2.6.10 this often causes post-resume APIC errors and near-total failure of some PCI devices (e.g. network, sound and USB controllers). Even in cases where interrupt routing is unaffected, this bug causes other problems. For instance, on one of my systems I have to run "ifdown eth0;ifup eth0" after resume in order to have functional networking, if I do not apply this patch. By itself, this patch is not theoretically complete; my next patch fixes that. However, this patch is the critical one for fixing swsusp's behavior in the real world. Signed-off-by: Barry K. Nathan <barryn@pobox.com> Acked-by: Pavel Machek <pavel@ucw.cz> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-01-10	[PATCH] don't let PTRACE_EVENT_EXIT stop hold up SIGKILL	Roland McGrath	1	-5/+6
	When a thread stops for ptrace exit tracing, it cannot be resumed by SIGKILL. Once PF_EXITING is set, SIGKILL will not cause a wakeup from stop (see wants_signal in kernel/signal.c). This patch moves the ptrace stop for exit tracing before the setting of PF_EXITING. Signed-off-by: Roland McGrath <roland@redhat.com> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-01-10	[PATCH] let SIGKILL wake TASK_TRACED	Roland McGrath	2	-19/+16
	Upon reevaluation we think it is indeed safe to permit the race between a ptrace call and the traced thread waking up, as long as it will never get back to user mode. This patch makes SIGKILL wake up threads in TASK_TRACED. That alone resolves most of the deadlock issues that became possible with the introduction of TASK_TRACED, getting us back to the killing behavior of 2.6.8 and before. This patch also further cleans up ptrace detaching, so that threads are left in TASK_STOPPED only if a job control stop is actually in effect, and otherwise resume. This removes the past nuisances requiring a SIGCONT to resume a thread even when it had a pending SIGKILL. Signed-off-by: Roland McGrath <roland@redhat.com> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-01-09	[PATCH] fix __ptrace_unlink TASK_TRACED recovery for real parent	Roland McGrath	2	-23/+26
	The __ptrace_unlink code that checks for TASK_TRACED fixed the problem of a thread being left in TASK_TRACED when no longer being ptraced. However, an oversight in the original fix made it fail to handle the case where the child is ptraced by its real parent. Fixed thus. Signed-off-by: Roland McGrath <roland@redhat.com> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-01-09	Merge kroah.com:/home/greg/linux/BK/bleed-2.6	Greg Kroah-Hartman	3	-198/+107
	into kroah.com:/home/greg/linux/BK/usb-2.6
2005-01-07	[PATCH] sched.c: remove an unused function	Adrian Bunk	1	-6/+0
	The patch below removes an unused function from kernel/sched.c Signed-off-by: Adrian Bunk <bunk@stusta.de> Acked-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-01-07	[PATCH] sched.c: remove an unused macro	Adrian Bunk	1	-6/+0
	Signed-off-by: Adrian Bunk <bunk@stusta.de> Acked-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-01-07	[PATCH] Fix kernel/timer.c comment typo	Vasia Pupkin	1	-1/+1
	Signed-off-by: Vasia Pupkin <ptushnik@gmail.com> Signed-off-by: Adrian Bunk <bunk@stusta.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-01-07	[PATCH] Lock initializer cleanup (Core)	Thomas Gleixner	17	-22/+22
	Kernel core files converted to use the new lock initializers. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-01-07	[PATCH] cpu_down() warning fix	Nathan Lynch	1	-1/+2
	Fix (harmless?) smp_processor_id() usage in preemptible section of cpu_down. Signed-off-by: Nathan Lynch <nathanl@austin.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-01-07	[PATCH] remove the BKL by turning it into a semaphore	Ingo Molnar	6	-15/+70
	This is the current remove-BKL patch. I test-booted it on x86 and x64, trying every conceivable combination of SMP, PREEMPT and PREEMPT_BKL. All other architectures should compile as well. (most of the testing was done with the zaphod patch undone but it applies cleanly on vanilla -mm3 as well and should work fine.) this is the debugging-enabled variant of the patch which has two main debugging features: - debug potentially illegal smp_processor_id() use. Has caught a number of real bugs - e.g. look at the printk.c fix in the patch. - make it possible to enable/disable the BKL via a .config. If this goes upstream we dont want this of course, but for now it gives people a chance to find out whether any particular problem was caused by this patch. This patch has one important fix over the previous BKL patch: on PREEMPT kernels if we preempted BKL-using code then the code still auto-dropped the BKL by mistake. This caused a number of breakages for testers, which breakages went away once this bug was fixed. Also the debugging mechanism has been improved alot relative to the previous BKL patch. Would be nice to test-drive this in -mm. There will likely be some more smp_processor_id() false positives but they are 1) harmless 2) easy to fix up. We could as well find more real smp_processor_id() related breakages as well. The most noteworthy fact is that no BKL-using code was found yet that relied on smp_processor_id(), which is promising from a compatibility POV. Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-01-07	[PATCH] vmtrunc: vm_truncate_count race caution	Hugh Dickins	1	-0/+1
	Fix some unlikely races in respect of vm_truncate_count. Firstly, it's supposed to be guarded by i_mmap_lock, but some places copy a vma structure by new_vma = old_vma: if the compiler implements that with a bytewise copy, new_vma->vm_truncate_count could be munged, and new_vma later appear up-to-date when it's not; so set it properly once under lock. vma_link set vm_truncate_count to mapping->truncate_count when adding an empty vma: if new vmas are being added profusely while vmtruncate is in progess, this lets them be skipped without scanning. vma_adjust has vm_truncate_count problem much like it had with anon_vma under mprotect merge: when merging be careful not to leave vma marked as up-to-date when it might not be, lest unmap_mapping_range in progress - set vm_truncate_count 0 when in doubt. Similarly when mremap moving ptes from one vma to another. Cut a little code from __anon_vma_merge: now vma_adjust sets "importer" in the remove_next case (to get its vm_truncate_count right), its anon_vma is already linked by the time __anon_vma_merge is called. Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-01-07	[PATCH] debug sched domains before attach	Nick Piggin	1	-95/+91
	Change the sched-domain debug routine to be called on a per-CPU basis, and executed before the domain is actually attached to the CPU. Previously, all CPUs would have their new domains attached, and then the debug routine would loop over all of them. This has two advantages: First, there is no longer any theoretical races: we are running the debug routine on a domain that isn't yet active, and should have no racing access from another CPU. Second, if there is a problem with a domain, the validator will have a better chance to catch the error and print a diagnostic _before_ the domain is attached, which may take down the system. Also, change reporting of detected error conditions to KERN_ERR instead of KERN_DEBUG, so they have a better chance of being seen in a hang on boot situation. The patch also does an unrelated (and harmless) cleanup in migration_thread(). Signed-off-by: Nick Piggin <nickpiggin@yahoo.com.au> Acked-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-01-07	[PATCH] sched: fix scheduling latencies for !PREEMPT kernels	Ingo Molnar	1	-0/+3
	This patch adds a handful of cond_resched() points to a number of key, scheduling-latency related non-inlined functions. This reduces preemption latency for !PREEMPT kernels. These are scheduling points complementary to PREEMPT_VOLUNTARY scheduling points (might_sleep() places) - i.e. these are all points where an explicit cond_resched() had to be added. Has been tested as part of the -VP patchset. Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-01-07	[PATCH] fix keventd execution dependency	Ingo Molnar	1	-4/+21
	We dont want to execute off keventd since it might hold a semaphore our callers hold too. This can happen when kthread_create() is called from within keventd. This happened due to the IRQ threading patches but it could happen with other code too. Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-01-07	[PATCH] sched: add cond_resched_softirq()	Ingo Molnar	1	-0/+16
	It adds cond_resched_softirq() which can be used by _process context_ softirqs-disabled codepaths to preempt if necessary. The function will enable softirqs before scheduling. (Later patches will use this primitive.) Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-01-07	[PATCH] preempt cleanup	Ingo Molnar	1	-6/+17
	This is another generic fallout from the voluntary-preempt patchset: a cleanup of the cond_resched() infrastructure, in preparation of the latency reduction patches. The changes: - uninline cond_resched() - this makes the footprint smaller, especially once the number of cond_resched() points increase. - add a 'was rescheduled' return value to cond_resched. This makes it symmetric to cond_resched_lock() and later latency reduction patches rely on the ability to tell whether there was any preemption. - make cond_resched() more robust by using the same mechanism as preempt_kernel(): by using PREEMPT_ACTIVE. This preserves the task's state - e.g. if the task is in TASK_ZOMBIE but gets preempted via cond_resched() just prior scheduling off then this approach preserves TASK_ZOMBIE. - the patch also adds need_lockbreak() which critical sections can use to detect lock-break requests. Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-01-07	[PATCH] improve preemption on SMP	Ingo Molnar	2	-88/+178
	SMP locking latencies are one of the last architectural problems that cause millisec-category scheduling delays. CONFIG_PREEMPT tries to solve some of the SMP issues but there are still lots of problems remaining: spinlocks nested at multiple levels, spinning with irqs turned off, and non-nested spinning with preemption turned off permanently. The nesting problem goes like this: if a piece of kernel code (e.g. the MM or ext3's journalling code) does the following: spin_lock(&spinlock_1); ... spin_lock(&spinlock_2); ... then even with CONFIG_PREEMPT enabled, current kernels may spin on spinlock_2 indefinitely. A number of critical sections break their long paths by using cond_resched_lock(), but this does not break the path on SMP, because need_resched() of the other CPU is not set so cond_resched_lock() doesnt notice that a reschedule is due. to solve this problem i've introduced a new spinlock field, lock->break_lock, which signals towards the holding CPU that a spinlock-break is requested by another CPU. This field is only set if a CPU is spinning in a spinlock function [at any locking depth], so the default overhead is zero. I've extended cond_resched_lock() to check for this flag - in this case we can also save a reschedule. I've added the lock_need_resched(lock) and need_lockbreak(lock) methods to check for the need to break out of a critical section. Another latency problem was that the stock kernel, even with CONFIG_PREEMPT enabled, didnt have any spin-nicely preemption logic for the following, commonly used SMP locking primitives: read_lock(), spin_lock_irqsave(), spin_lock_irq(), spin_lock_bh(), read_lock_irqsave(), read_lock_irq(), read_lock_bh(), write_lock_irqsave(), write_lock_irq(), write_lock_bh(). Only spin_lock() and write_lock() [the two simplest cases] where covered. In addition to the preemption latency problems, the _irq() variants in the above list didnt do any IRQ-enabling while spinning - possibly resulting in excessive irqs-off sections of code! preempt-smp.patch fixes all these latency problems by spinning irq-nicely (if possible) and by requesting lock-breaks if needed. Two architecture-level changes were necessary for this: the addition of the break_lock field to spinlock_t and rwlock_t, and the addition of the _raw_read_trylock() function. Testing done by Mark H Johnson and myself indicate SMP latencies comparable to the UP kernel - while they were basically indefinitely high without this patch. i successfully test-compiled and test-booted this patch ontop of BK-curr using the following .config combinations: SMP && PREEMPT, !SMP && PREEMPT, SMP && !PREEMPT and !SMP && !PREEMPT on x86, !SMP && !PREEMPT and SMP && PREEMPT on x64. I also test-booted x86 with the generic_read_trylock function to check that it works fine. Essentially the same patch has been in testing as part of the voluntary-preempt patches for some time already. NOTE to architecture maintainers: generic_raw_read_trylock() is a crude version that should be replaced with the proper arch-optimized version ASAP. From: Hugh Dickins <hugh@veritas.com> The i386 and x86_64 _raw_read_trylocks in preempt-smp.patch are too successful: atomic_read() returns a signed integer. Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-01-07	[PATCH] introduce idle_task_exit	Nathan Lynch	1	-0/+14
	Heiko Carstens figured out that offlining a cpu can leak mm_structs because the dying cpu's idle task fails to switch to init_mm and mmdrop its active_mm before the cpu is down. This patch introduces idle_task_exit, which allows the idle task to do this as Ingo suggested. I will follow this up with a patch for ppc64 which calls idle_task_exit from cpu_die. Signed-off-by: Nathan Lynch <nathanl@austin.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-01-07	[PATCH] sched: remove outdated/misleading comments	Josh Aas	1	-9/+1
	This patch removes two outdated/misleading comments from the CPU scheduler. 1) The first comment removed is simply incorrect. The function it comments on is not used for what the comments says it is anymore. 2) The second comment is a leftover from when the "if" block it comments on contained a goto. It does not any more, and the comment doesn't make sense. There isn't really a reason to add different comments, though someone might feel differently in the case of the second one. I'll leave adding a comment to anybody who wants to - more important to just get rid of them now. Signed-off-by: Josh Aas <josha@sgi.com> Acked-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-01-07	[PATCH] export sched_setscheduler() for kernel module use	Dean Nelson	1	-46/+45
	This patch exports sched_setscheduler() so that it can be used by a kernel module to set a kthread's scheduling policy and associated parameters. Signed-off-by: Dean Nelson <dcn@sgi.com> Acked-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-01-07	[PATCH] sched: no need to recalculate rq	Robert Love	1	-2/+2
	no need to call task_rq in setscheduler; just use rq Signed-Off-By: Robert Love <rml@novell.com> Acked-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-01-07	[PATCH] sched: use cached current value	Oleg Nesterov	1	-3/+3
	schedule() can use prev instead of get_current(). Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Acked-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-01-07	[PATCH] sched: remove_interactive_credit	Con Kolivas	1	-37/+9
	Special casing tasks by interactive credit was helpful for preventing fully cpu bound tasks from easily rising to interactive status. However it did not select out tasks that had periods of being fully cpu bound and then sleeping while waiting on pipes, signals etc. This led to a more disproportionate share of cpu time. Backing this out will no longer special case only fully cpu bound tasks, and prevents the variable behaviour that occurs at startup before tasks declare themseleves interactive or not, and speeds up application startup slightly under certain circumstances. It does cost in interactivity slightly as load rises but it is worth it for the fairness gains. Signed-off-by: Con Kolivas <kernel@kolivas.org> Acked-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-01-07	[PATCH] sched: requeue_granularity	Con Kolivas	1	-3/+1
	Change the granularity code to requeue tasks at their best priority instead of changing priority while they're running. This keeps tasks at their top interactive level during their whole timeslice. Signed-off-by: Con Kolivas <kernel@kolivas.org> Acked-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-01-07	[PATCH] sched: add_requeue_task	Con Kolivas	1	-4/+18
	We can requeue tasks for cheaper then doing a complete dequeue followed by an enqueue. Add the requeue_task function and perform it where possible. This will be hit frequently by upcoming changes to the requeueing in timeslice granularity. Signed-off-by: Con Kolivas <kernel@kolivas.org> Acked-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-01-07	[PATCH] sched: adjust_timeslice_granularity	Con Kolivas	1	-2/+4
	The minimum timeslice was decreased from 10ms to 5ms. In the process, the timeslice granularity was leading to much more rapid round robinning of interactive tasks at cache trashing levels. Restore minimum granularity to 10ms. Signed-off-by: Con Kolivas <kernel@kolivas.org> Acked-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-01-07	[PATCH] sched: alter_kthread_prio	Con Kolivas	1	-1/+1
	Timeslice proportion has been increased substantially for -niced tasks. As a result of this kernel threads have much larger timeslices than they previously had. Change kernel threads' nice value to -5 to bring their timeslice back in line with previous behaviour. This means kernel threads will be less likely to cause large latencies under periods of system stress for normal nice 0 tasks. Signed-off-by: Con Kolivas <kernel@kolivas.org> Acked-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-01-07	[PATCH] sched.c whitespace mangler	Con Kolivas	1	-2/+2
	Convert whitespace in sched.c to tabs Signed-off-by: Con Kolivas <kernel@kolivas.org> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-01-07	[PATCH] sched: active_load_balance() fixlet	Matthew Dobson	1	-33/+32
	There is a small problem with the active_load_balance() patch that Darren sent out last week. As soon as we discover a potential 'target_cpu' from 'cpu_group' to try to push tasks to, we cease considering other CPUs in that group as potential 'target_cpu's. We break out of the for_each_cpu_mask() loop and try to push tasks to that CPU. The problem is that there may well be other idle cpus in that group that we should also try to push tasks to. Here is a patch to fix that small problem. The solution is to simply move the code that tries to push the tasks into the for_each_cpu_mask() loop and do away with the whole 'target_cpu' thing entirely. Compiled & booted on a 16-way x440. Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-01-07	[PATCH] sched: can_migrate exception for idle cpus	Andrew Theurer	1	-23/+27
	Fix can_migrate to allow aggressive steal for idle cpus. This -was- in mainline, but I believe sched_domains kind of blasted it outta there. IMO, it's a no brainer for an idle cpu (with all that cache going to waste) to be granted to steal a task. The one enhancement I have made was to make sure the whole cpu was idle. Signed-off-by: <habanero@us.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-01-07	[PATCH] sched: more agressive wake_idle()	Andrew Theurer	1	-15/+15
	This patch addresses some problems with wake_idle(). Currently wake_idle() will wake a task on an alternate cpu if: 1) task->cpu is not idle 2) an idle cpu can be found However the span of cpus to look for is very limited (only the task->cpu's sibling). The scheduler should find the closest idle cpu, starting with the lowest level domain, then going to higher level domains if allowed (doamin has flag SD_WAKE_IDLE). This patch does this. This and the other two patches (also to be submitted) combined have provided as much at 5% improvement on that "online transaction DB workload" and 2% on the industry standard J@EE workload. I asked Martin Bligh to test these for regression, and he did not find any. I would like to submit for inclusion to -mm and barring any problems eventually to mainline. Signed-off-by: <habanero@us.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-01-06	merge	Greg Kroah-Hartman	3	-198/+107

2005-01-06	[PATCH] x86-64: kernel/sys.c build fix	Jeff Garzik	1	-0/+1
	On x86-64, the attached patch is required to fix > kernel/sys.c: In function `sys_setsid': > kernel/sys.c:1078: error: `tty_sem' undeclared (first use in this function) > kernel/sys.c:1078: error: (Each undeclared identifier is reported only once > kernel/sys.c:1078: error: for each function it appears in.) kernel/sys.c needs the tty_sem declaration from linux/tty.h.
2005-01-06	[PATCH] First cut at setsid/tty locking	Alan Cox	2	-0/+4
	Use the existing "tty_sem" to protect against the process tty changes too.
2005-01-04	[PATCH] Make page allocator aware of requests for zeroed memory	Christoph Lameter	1	-8/+4
	Thisintroduces __GFP_ZERO as an additional gfp_mask element to allow to request zeroed pages from the page allocator: - Modifies the page allocator so that it zeroes memory if __GFP_ZERO is set - Replace all page zeroing after allocating pages by prior allocations with allocations using __GFP_ZERO Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-01-05	Merge shinybook.infradead.org:/home/dwmw2/bk/linus-2.6	David Woodhouse	1	-1/+1
	into shinybook.infradead.org:/home/dwmw2/bk/mtd-2.6
2005-01-04	Merge shinybook.infradead.org:/home/dwmw2/bk/linus-2.6	David Woodhouse	1	-1/+1
	into shinybook.infradead.org:/home/dwmw2/bk/mtd-2.6
2005-01-04	[PATCH] uninline/kill __exit_mm()	Oleg Nesterov	1	-7/+2
	__exit_mm() is an inlined version of exit_mm(). This patch unifies them. Saves 356 byte in exit.o. Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-01-04	[PATCH] task_struct.exit_state usage	Roland McGrath	4	-9/+9
	I just did a quick audit of the use of exit_state and the EXIT_* bit macros. I guess I didn't really review these changes very closely when you did them originally. :-( I found several places that seem like lossy cases of query-replace without enough thought about the code. Linus has previously said the >= tests ought to be & tests instead. But for exit_state, it can only ever be 0, EXIT_DEAD, or EXIT_ZOMBIE--so a nonzero test is actually the same as testing & (EXIT_DEAD\|EXIT_ZOMBIE), and maybe its code is a tiny bit better. The case like in choose_new_parent is just confusing, to have the always-false test for EXIT_* bits in ->state there too. The two cases in wants_signal and do_process_times are actual regressions that will give us back old bugs in race conditions. These places had s/TASK/EXIT/ but not s/state/exit_state/, and now there tests for exiting tasks are now wrong and never catching them. I take it back: there is no regression in wants_signal in practice I think, because of the PF_EXITING test that makes the EXIT_* state checks superfluous anyway. So that is just another cosmetic case of confusing code. But in do_process_times, there is that SIGXCPU-while-exiting race condition back again. Signed-off-by: Roland McGrath <roland@redhat.com> Acked-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-01-04	[PATCH] move waitchld_exit from task_struct to signal_struct	Roland McGrath	3	-22/+6
	There is really no point in each task_struct having its own waitchld_exit. In the only use of it, the waitchld_exit of each thread in a group gets woken up at the same time. So, there might as well just be one wait queue for the whole thread group. This patch does that by moving the field from task_struct to signal_struct. It should have no effect on the behavior, but saves a little work and a little storage in the multithreaded case. Signed-off-by: Roland McGrath <roland@redhat.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-01-04	[PATCH] fix ptracer death race yielding bogus BUG_ON	Roland McGrath	1	-8/+22
	There is a BUG_ON in ptrace_stop that hits if the thread is not ptraced. However, there is no synchronization between a thread deciding to do a ptrace stop and so going here, and its ptracer dying and so detaching from it and clearing its ->ptrace field. The RHEL3 2.4-based kernel has a backport of a slightly older version of the 2.6 signals code, which has a different but equivalent BUG_ON. This actually bit users in practice (when the debugger dies), but was exceedingly difficult to reproduce in contrived circumstances. We moved forward in RHEL3 just by removing the BUG_ON, and that fixed the real user problems even though I was never able to reproduce the scenario myself. So, to my knowledge this scenario has never actually been seen in practice under 2.6. But it's plain to see from the code that it is indeed possible. This patch removes that BUG_ON, but also goes further and tries to handle this case more gracefully than simply avoiding the crash. By removing the BUG_ON alone, it becomes possible for the real parent of a process to see spurious SIGCHLD notifications intended for the debugger that has just died, and have its child wind up stopped unexpectedly. This patch avoids that possibility by detecting the case when we are about to do the ptrace stop but our ptracer has gone away, and simply eliding that ptrace stop altogether as if we hadn't been ptraced when we hit the interesting event (signal or ptrace_notify call for syscall tracing or something like that). Signed-off-by: Roland McGrath <roland@redhat.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-01-04	[PATCH] move group_exit flag into signal_struct.flags word	Roland McGrath	3	-12/+16
	After my last change, there are plenty of unused bits available in the new flags word in signal_struct. This patch moves the `group_exit' flag into one of those bits, saving a word in signal_struct. Signed-off-by: Roland McGrath <roland@redhat.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-01-04	[PATCH] fix stop signal race	Roland McGrath	3	-52/+72
	The `sig_avoid_stop_race' checks fail to catch a related race scenario that can happen. I don't think this has been seen in nature, but it could happen in the same sorts of situations where the observed problems come up that those checks work around. This patch takes a different approach to catching this race condition. The new approach plugs the hole, and I think is also cleaner. The issue is a race between one CPU processing a stop signal while another CPU processes a SIGCONT or SIGKILL. There is a window in stop-signal processing where the siglock must be released. If a SIGCONT or SIGKILL comes along here on another CPU, then the stop signal in the midst of being processed needs to be discarded rather than having the stop take place after the SIGCONT or SIGKILL has been generated. The existing workaround checks for this case explicitly by looking for a pending SIGCONT or SIGKILL after reacquiring the lock. However, there is another problem related to the same race issue. In the window where the processing of the stop signal has released the siglock, the stop signal is not represented in the pending set any more, but it is still "pending" and not "delivered" in POSIX terms. The SIGCONT coming in this window is required to clear all pending stop signals. But, if a stop signal has been dequeued but not yet processed, the SIGCONT generation will fail to clear it (in handle_stop_signal). Likewise, a SIGKILL coming here should prevent the stop processing and make the thread die immediately instead. The `sig_avoid_stop_race' code checks for this by examining the pending set to see if SIGCONT or SIGKILL is in it. But this fails to handle the case where another CPU running another thread in the same process has already dequeued the signal (so it no longer can be found in the pending set). We must catch this as well, so that the same problems do not arise when another thread on another CPU acted real fast. I've fixed this dumping the `sig_avoid_stop_race' kludge in favor of a little explicit bookkeeping. Now, dequeuing any stop signal sets a flag saying that a pending stop signal has been taken on by some CPU since the last time all pending stop signals were cleared due to SIGCONT/SIGKILL. The processing of stop signals checks the flag after the window where it released the lock, and abandons the signal the flag has been cleared. The code that clears pending stop signals on SIGCONT generation also clears this flag. The various places that are trying to ensure the process dies quickly (SIGKILL or other unhandled signals) also clear the flag. I've made this a general flags word in signal_struct, and replaced the stop_state field with flag bits in this word. Signed-off-by: Roland McGrath <roland@redhat.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-01-04	[PATCH] remove redundant sys_delete_module()	Coywolf Qi Hunt	1	-7/+0
	Peter Chubb recently split out a standalone sys_ni.c file for the not implemented syscalls. This patch removes the redundant sys_delete_module() in module.c. Signed-off-by: Coywolf Qi Hunt <coywolf@gmail.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-01-04	[PATCH] compat: sigtimedwait	Zou Nanhai	1	-0/+86
	- Merge sys32_rt_sigtimedwait function in X86_64, IA64, PPC64, MIPS, SPARC64, S390 32 bit layer into 1 compat_rt_sigtimedwait function. It will also fix a bug of copy wrong information to 32 bit userspace siginfo structure on X86_64, IA64 and SPARC64 when calling sigtimedwait on 32 bit layer. - Change all name the of siginfo_t32 structure in X86_64, IA64, MIPS, SPARC64 and S390 to the name compat_siginfo_t as used in PPC64. - Patch introduced a macro __COMPAT_ENDIAN_SWAP__ in include/asm-mips/compat.h when MIPS kernel is compiled in little-endian mode. This macro is used to do byte swapping in function sigset_from_compat. - This patch is only tested on X86_64 and IA_64. Signed-off-by: Zou Nan hai <Nanhai.zou@intel.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-01-04	[PATCH] cpumask: range check before using value	Randy Dunlap	1	-1/+3
	When setting the 'cpu_isolated_map' mask, check that the user input value is valid (in range 0 .. NR_CPUS - 1). Also fix up kernel-parameters.txt for this parameter. Signed-off-by: Randy Dunlap <rddunlap@osdl.org> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-01-04	[PATCH] Add PR_GET_NAME	Prasanna Meda	1	-0/+9
	A while back we added the PR_SET_NAME prctl, but no PR_GET_NAME. I guess we should add this, if only to enable testing of PR_SET_NAME. Signed-off-by: Prasanna Meda <pmeda@akamai.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-01-04	[PATCH] panic_timeout: move to kernel.h	Randy Dunlap	1	-1/+0
	Move 'panic_timeout' to linux/kernel.h. ipmi_watchdog.c wanted to know why panic_timeout isn't in some header file. However, ipmi_watchdog.c doesn't even use it, so that reference was deleted. Other references now use kernel.h instead of straight extern int. Signed-off-by: Randy Dunlap <rddunlap@osdl.org> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-01-04	[PATCH] rcu: simplify quiescent state detection	Manfred Spraul	1	-6/+5
	Based on an initial patch from Oleg Nesterov <oleg@tv-sign.ru> rcu_data.last_qsctr is not needed. Actually, not even a counter is needed, just a flag that indicates that there was a quiescent state. Signed-Off-By: Manfred Spraul <manfred@colorfullife.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-01-04	[PATCH] rcu: make two internal structs static	Manfred Spraul	1	-2/+2
	The patch below makes two needlessly global structs static. Signed-off-by: Adrian Bunk <bunk@stusta.de> Signed-off-by: Manfred Spraul <manfred@colorfullife.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-01-04	[PATCH] rcu: eliminate rcu_ctrlblk.lock	Oleg Nesterov	1	-14/+16
	rcu_ctrlblk.lock is used to read the ->cur and ->next_pending atomically in __rcu_process_callbacks(). It can be replaced by a couple of memory barriers. rcu_start_batch: rcp->next_pending = 0; smp_wmb(); rcp->cur++; __rcu_process_callbacks: rdp->batch = rcp->cur + 1; smp_rmb(); if (!rcp->next_pending) rcu_start_batch(rcp, rsp, 1); This way, if __rcu_process_callbacks() sees incremented ->cur value, it must also see that ->next_pending == 0 (or rcu_start_batch() is already in progress on another cpu). Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-01-04	[PATCH] Sync in core time granuality with filesystems	Andi Kleen	1	-1/+46
	This patch corrects a problem that was originally added with the nanosecond timestamps in stat patch. The problem is that some file systems don't have enough space in their on disk inode to save nanosecond timestamps, so they truncate the c/a/mtime to seconds when flushing an dirty node. In core the inode would have full jiffies granuality. This can be observed by programs as a timestamp that jumps backwards under specific loads when an inode is flushed and then reloaded from disk. The problem was already known when the original patch went in, but it wasn't deemed important enough at that time. So far there has been only one report of it causing problems. Now Tridge is worried that it will break running Excel over samba4 because Excel seems to do very anal timestamp checking and samba4 will supply 100ns timestamps over the network. This patch solves it by putting the time resolution into the superblock of a fs and always rounding the in core timestamps to that granuality. This also supercedes some previous ext2/3 hacks to flush the inode less often when only the subsecond timestamp changes. I tried to keep the overhead low, in particular it tries to keep divisions out of fast paths as far as possible. The patch is quite big but 99% of it is just relatively straight forward search'n'replace in a lot of fs. Unconverted filesystems will default to a 1ns granuality, but may still show the problem if they continue to use CURRENT_TIME. I converted all in tree fs. One possible future extension of this would be to have two time granualities per superblock - one that specifies the visible resolution, and the other to specify how often timestamps should be flushed to disk, which could be tuned with a mount option per fs (e.g. often m/atimes don't need to be flushed every second). Would be easy to do as an addon if someone is interested. Signed-off-by: Andi Kleen <ak@suse.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-01-04	[PATCH] sys_stime needs a compat function	Martin Schwidefsky	2	-4/+42
	I realized that the best way to get the sys_time/sys_stime problem fixed is to make sys_time 64 bit safe by using "time_t " instead of "int " and to introduce two proper compat functions compat_sys_time and compat_sys_stime. The prototype change of sys_time is transparent for 32 bit architectures because both "int" and "time_t" are 32 bit. For 64 bit the type change would be wrong but luckily no 64 bit architecture uses sys_time/sys_stime in 64 bit mode. The patch makes the following change: ia64 : Remove sys32_time, use compat_sys_time and add (!!) compat_sys_stime to compat syscall table. mips : Use compat_sys_time/compat_sys_stime in 32 bit syscall table. Add #ifdef magic to compile sys_time/sys_stime and compat_sys_time/compat_sys_stime only if needed. parisc : Remove sys32_time, use compat_sys_time and compat_sys_stime. ppc64 : remove sys32_time, ppc64_sys32_stime and ppc64_sys_stime. Use common compat_sys_time, compat_sys_stime and sys_stime. s390 : Use compat_sys_stime. Add #ifdef magic to compile sys_time/sys_stime and compat_sys_time/compat_sys_stime only if needed. sparc64 : Use compat_sys_time/compat_Sys_stime in 32 bit syscall table. um : Remove um_time and um_stime. Use common functions sys_time and sys_stime. This adds a CAP_SYS_TIME check to UMs stime call. x86_64 : Remove sys32_time. Use compat_sys_time and compat_sys_stime in 32 bit syscall table. The original stime bug is fixed for mips, parisc, s390, sparc64 and x86_64. Can the arch-maintainers please take a look at this? From: Martin Schwidefsky <schwidefsky@de.ibm.com> Convert compat_time_t to time_t in 32 bit emulation for sys_stime and consolidate all the different implementation of sys_time, sys_stime and their 32-bit emulation parts. Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-01-04	[PATCH] suppress might_sleep() if oopsing	Andrew Morton	1	-1/+1
	We can call might_sleep() functions on the oops handling path (under do_exit). There seem little point in emitting spurious might_sleep() warnings into the logs after the kernel has oopsed. Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-01-04	[PATCH] fork: total_forks not counted under tasklist_lock	Prasanna Meda	1	-6/+6
	Bring the total_forks under tasklist_lock. When most of the fork code icluding nr_threads is moved to copy_process() from do_fork() code in 2.6, this is left out. Althought accuracy of total_forks is not important, it would be nice to add this. It does not involve additional cost, and the code will be cleaner if it is grouped with nr_threads. The difference is, total_forks will increase on fork, but nr_threads will increase on fork and decrease on the exit. I also moved extern decleration to sched.h from proc_misc.c. Signed-off-by: Prasanna Meda <pmeda@akamai.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-01-04	[PATCH] move irq_enter and irq_exit to common code	Christoph Hellwig	2	-11/+17
	This code is the same for all architectures with the following invariants: - arm gurantees irqs are disabled when calling irq_exit so it can call __do_softirq directly instead of do_softirq - arm26 is totally broken for about half a year, I didn't care for it - some architectures use softirq_pending(smp_processor_id()) instead of local_softirq_pending, but they always evaluate to the same This patch moves the out of line irq_exit implementation from kernel/irq/handle.c which depends on CONFIG_GENERIC_HARDIRQS to kernel/softirq.c which is always compiled, tweaks it for the arm special case and moves the irq_enter/irq_exit/nmi_enter/nmi_exit bits from asm-*/hardirq.h to linux/hardirq.h Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-01-04	[PATCH] handle quoted module parameters	Randy Dunlap	1	-3/+12
	Fix module parameter quote handling. Module parameter strings (with spaces) are quoted like so: "modprm=this test" and not like this: modprm="this test" Signed-off-by: Randy Dunlap <rddunlap@osdl.org> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-01-04	[PATCH] enhanced Memory accounting data collection	Jay Lan	3	-0/+39
	This patch is to offer common accounting data collection method at memory usage for various accounting packages including BSD accounting, ELSA, CSA and any other acct packages that use a common layer of data collection. New struct fields are added to mm_struct to save high watermarks of rss usage as well as virtual memory usage. New struct fields are added to task_struct to collect accumulated rss usage and vm usages. These data are collected on per process basis. Signed-off-by: Jay Lan <jlan@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-01-04	[PATCH] enhanced I/O accounting data patch	Jay Lan	1	-3/+12
	This patch is to offer common accounting data collection method at I/O for various accounting packages including BSD accounting, ELSA, CSA and any other acct packages that use a common layer of data collection. Patch is made to fs/read_write.c to collect per process data on character read/written in bytes and number of read/write syscalls made. New struct fields are added to task_struct to store the data. These data are collected on per process basis. Signed-off-by: Jay Lan <jlan@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-01-04	[PATCH] VM routine fixes	David Howells	2	-0/+5
	The attached patch fixes a number of problems in the VM routines: (1) Some inline funcs don't compile if CONFIG_MMU is not set. (2) swapper_pml4 needn't exist if CONFIG_MMU is not set. (3) __free_pages_ok() doesn't counter set_page_refs() different behaviour if CONFIG_MMU is not set. (4) swsusp.c invokes TLB flushing functions without including the header file that declares them. CONFIG_SHMEM semantics: - If MMU: Always enabled if !EMBEDDED - If MMU && EMBEDDED: configurable - If !MMU: disabled Signed-Off-By: David Howells <dhowells@redhat.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-01-04	[PATCH] GP-REL data support	David Howells	2	-5/+4
	The attached patch makes it possible to support gp-rel addressing for small variables. Since the FR-V cpu's have fixed-length instructions and plenty of general-purpose registers, one register is nominated as a base for the small data area. This makes it possible to use single-insn accesses to access global and static variables instead of having to use multiple instructions. This, however, causes problems with small variables used to pinpoint the beginning and end of sections. The compiler assumes it can use gp-rel addressing for these, but the linker then complains because the displacement is out of range. By declaring certain variables as arrays or by forcing them into named sections, the compiler is persuaded to access them as if they can be outside the displacement range. Declaring the variables as "const void" type also works. Signed-Off-By: David Howells <dhowells@redhat.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-01-04	[PATCH] capset returns -EPERM when pid==current->pid	Serge Hallyn	1	-1/+1
	In the current kernel/capability.c:sys_capset() code, permission is denied if CAP_SETPCAP is not held and pid is positive. pid=0 means use the current process, and this is allowed. But using the current process' pid is not allowed. The man page for capsetp simply says that CAP_SETPCAP is required to use this function, and does not mention the exception for pid=0. The current behavior seems inconsistent. The attached patch also allows a process to call capset() on itself. Signed-off-by: Chris Wright <chrisw@osdl.org> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-01-04	[PATCH] properly split capset_check+capset_set	Serge Hallyn	1	-25/+38
	The attached patch removes checks from kernel/capability.c which are redundant with cap_capset_check() code, and moves the capset_check() calls to immediately before the capset_set() calls. This allows capset_check() to accurately check the setter's permission to set caps on the target. Please apply. Signed-off-by: Serge Hallyn <serue@us.ibm.com> Signed-off-by: Chris Wright <chrisw@osdl.org> Signed-off-by: Stephen Smalley <sds@epoch.ncsc.mil> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-01-03	Merge bk://linux-sam.bkbits.net/kbuild	Linus Torvalds	1	-4/+11
	into ppc970.osdl.org:/home/torvalds/v2.6/linux
2005-01-03	[PATCH] swsusp: Kill O(n^2) algorithm in swsusp	Pavel Machek	1	-75/+56
	Some machines are spending minutes of CPU time during suspend in stupid O(n^2) algorithm. This patch replaces it with O(n) algorithm, making swsusp usable to some people. Signed-off-by: Pavel Machek <pavel@suse.cz> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-01-03	[PATCH] swsusp: Small cleanups	Pavel Machek	1	-3/+3
	This adds statics at few places and fixes stale references to pmdisk. Signed-off-by: Pavel Machek <pavel@suse.cz> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-01-03	[PATCH] swsusp: kill one-line helpers, handle read errors	Pavel Machek	1	-15/+8
	swsusp contains few one-line helpers that only make reading/understanding code more difficult. Also warn the user when something goes wrong, instead of waking machine with corrupt data. Signed-off-by: Pavel Machek <pavel@suse.cz> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-01-03	[PATCH] swsusp: kill unused variable	Pavel Machek	1	-2/+0
	Variable used only for writing is bad idea. Signed-off-by: Pavel Machek <pavel@suse.cz> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-01-03	[PATCH] fix naming in swsusp	Pavel Machek	1	-2/+2
	At few points we still reference to swsusp as "pmdisk"... it might confuse someone not knowing full history. Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-01-03	[PATCH] /proc/sys/kernel/bootloader_type	H. Peter Anvin	1	-0/+10
	This patch exports to userspace the boot loader ID which has been exported by (b)zImage boot loaders since boot protocol version 2. It is needed so that update tools that update kernels from vendors know which bootloader file they need to update; eg right now those tools do all kinds of hairy heuristics to find out if it's grub or lilo or .. that installed the kernel. Those heuristics are fragile in the presence of more than one bootloader (which isn't that uncommon in OS upgrade situations). Tested on i386 and x86-64; as far as I know those are the only architectures which use zImage/bzImage format. Signed-Off-By: H. Peter Anvin <hpa@zytor.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2004-12-29	kallsyms: gate page is part of the kernel, honour CONFIG_KALLSYMS_ALL	Keith Owens	1	-4/+11
	* Treat the gate page as part of the kernel, to improve kernel backtraces. * Honour CONFIG_KALLSYMS_ALL, all symbols are valid, not just text. Signed-off-by: Keith Owens <kaos@ocs.com.au> Signed-off-by: Sam Ravnborg <sam@ravnborg.org>
2004-12-23	[PATCH] AB-BA deadlock between uidhash_lock and tasklist_lock.	Andrew Morton	1	-2/+1
	switch_uid() doesn't care about tasklist_lock, so do it outside the lock and avoid a subtle (and very very unlikely to trigger) AB-BA deadlock. Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2004-12-21	Merge kroah.com:/home/greg/linux/BK/bleed-2.6	Greg Kroah-Hartman	3	-198/+107
	into kroah.com:/home/greg/linux/BK/usb-2.6
2004-12-21	sysfs: export the /sys/kernel subsystem for people to use.	Greg Kroah-Hartman	1	-1/+2
	Signed-off-by: Greg Kroah-Hartman <greg@kroah.com>
2004-12-20	[PATCH] back out CPU clock additions to posix-timers	Roland McGrath	1	-111/+8
	This patch reverts the additions of an ABI supporting thread and process CPU clocks in the posix-timers code. This returns us to 2.6.9's condition, there is no support for any new clockid_t values for process CPU clocks. This also fixes the return value for clock_nanosleep when unsupported (I think this is used only by sgi-timer at the moment). The POSIX-specified code for valid clocks that don't support the sleep operation is ENOTSUP. On most architectures the kernel doesn't define ENOTSUP and this name is defined in userland the same as the kernel's EOPNOTSUPP. Signed-off-by: Roland McGrath <roland@redhat.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2004-12-20	[PATCH] module sysfs: module parameters reimplemented using attr group	Tejun Heo	1	-125/+54
	Reimplement parameter attributes using attribute group. This makes more sense, for, while they reside in a separate subdirectory, they belong to the ownig module and their lifetime exactly equals the lifetime of the owning module, and it's simpler. Signed-off-by: Tejun Heo <tj@home-tj.org> Signed-off-by: Greg Kroah-Hartman <greg@kroah.com>
2004-12-20	[PATCH] module sysfs: sections attr reimplemented using attr group	Tejun Heo	1	-40/+35
	Reimplement section attributes using attribute group. This makes more sense, for, while they reside in a separate subdirectory, they belong to the ownig module and their lifetime exactly equals the lifetime of the owning module, and it's simpler. Signed-off-by: Tejun Heo <tj@home-tj.org> Signed-off-by: Greg Kroah-Hartman <greg@kroah.com>
2004-12-20	[PATCH] module sysfs: expand module_attribute methods	Tejun Heo	2	-2/+3
	Modify module_attribute show/store methods to accept self argument to enable further extensions. Signed-off-by: Tejun Heo <tj@home-tj.org> Signed-off-by: Greg Kroah-Hartman <greg@kroah.com>
2004-12-20	[PATCH] module sysfs: make module.mkobj inline	Tejun Heo	2	-32/+15
	Make module.mkobj inline. As this is simpler and what's usually done with kobjs when it's representing an entity. Signed-off-by: Tejun Heo <tj@home-tj.org> Signed-off-by: Greg Kroah-Hartman <greg@kroah.com>
2004-12-16	[PATCH] fix bogus ECHILD return from wait* with zombie group leader	Roland McGrath	1	-2/+13
	Klaus Dittrich observed this bug and posted a test case for it. This patch fixes both that failure mode and some others possible. What Klaus saw was a false negative (i.e. ECHILD when there was a child) when the group leader was a zombie but delayed because other children live; in the test program this happens in a race between the two threads dying on a signal. The change to the TASK_TRACED case avoids a potential false positive (blocking, or WNOHANG returning 0, when there are really no children left), in the race condition where my_ptrace_child returns zero. Signed-off-by: Roland McGrath <roland@redhat.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2004-12-13	Merge shinybook.infradead.org:/home/dwmw2/bk/linus-2.6	David Woodhouse	1	-1/+1
	into shinybook.infradead.org:/home/dwmw2/bk/mtd-2.6
2004-12-12	[PATCH] swsusp: fix types	Pavel Machek	1	-1/+1
	This fixes types so that sparse has less stuff to complain about. Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2004-12-12	[PATCH] swsusp: Fix header typo	Pavel Machek	1	-1/+1
	Fixes typo in header, please apply, Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2004-12-12	[PATCH] swsusp fixes: fix confusing printk	Pavel Machek	1	-1/+1
	This fixes confusing printk. Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2004-12-12	[PATCH] swsusp bugfixes: fix memory leak	Pavel Machek	1	-1/+2
	This fixes memory leak when we are low on memory during suspend. Ouch and nr_needed_pages is only used twice, and only written :-(. I guess that can wait for 2.6.10. Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2004-12-12	[PATCH] swsusp bugfixes: do not oops when not enough memory during resume	Pavel Machek	1	-0/+2
	This prevents oops when not enough memory is available during resume. Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2004-12-07	[PATCH] Fix broken domain debugging (aka "isolcpus option broken")	Nick Piggin	1	-2/+4
	Fix an oops in sched_domain_debug when using the isolcpus= option. Also move a debug check for validating groups into the "for-each-group" loop, where it should be. Signed-off-by: Nick Piggin <nickpiggin@yahoo.com.au> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2004-12-07	Revert isolcpus option fix, pending better fix from Nick.	Linus Torvalds	1	-17/+0
	The real bug was in the debugging code, not the actual domain data structure setup. Cset exclude: sivanich@sgi.com[torvalds]\|ChangeSet\|20041207160443\|30564
2004-12-06	[PATCH] isolcpus option fix	Dimitri Sivanich	1	-0/+17
	The isolcpus option is broken in 2.6.10-rc2-bk2. The domains are no longer being properly initialized (which results in a panic at bootup). Signed-off-by: Dimitri Sivanich <sivanich@sgi.com> Acked-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2004-12-02	[PATCH] fix uninitialized variable in waitid(2)	Joe Korty	1	-0/+1
	Specify an initial value signal_struct's field stop_state whenever a signal_struct variable is created. Bug was discovered through the occasional failure of telnet(1) to connect. Signed-off-by: Joe Korty <joe.korty@ccur.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2004-12-02	[PATCH] sys_set/getpriority PRIO_USER semantics fix and optimisation	Prasanna Meda	1	-12/+10
	This change brings the semantics equivalent to 2.4 and also to what the man page says; Also optimises by avoiding unneeded lookup in uid cache, when who is same as the current->uid. sys_set/getpriority is rewritten in 2.5/2.6, perhaps while transitioning to the pid maps. It has now semantical bug, when uid is zero. Note that akpm also fixed refcount leak and locking in the new functions in changeset http://linus.bkbits.net:8080/linux-2.5/cset@1.1608.10.84 Signed-off-by: <pmeda@akamai.com> Acked-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2004-12-02	[PATCH] Allow multiple cpus in irq affinity call	Anton Blanchard	1	-2/+1
	The generic irq affinity code limits us to a single cpu target regardless of what the architecture supports. If required this should be done in the architecture specific ->set_affinity call. With this patch ppc64 is able to select all cpus affinity again. Signed-off-by: Anton Blanchard <anton@samba.org> Acked-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2004-12-01	[PATCH] Fix occasional stop_machine() lockup with > 2 CPUs	Rusty Russell	1	-1/+6
	Stephen Rothwell noted a case where one CPU was sitting in userspace, one in stop_machine() waiting for everyone to enter stopmachine(). This can happen if migration occurs at exactly the wrong time with more than 2 CPUS. Say we have 4 CPUS: 1) stop_machine() on CPU 0creates stopmachine() threads for CPUS 1, 2 and 3, and yields waiting for them to migrate to their CPUs and ack. 2) stopmachine(2) gets rebalanced (probably on exec) to CPU 1. 3) stopmachine(2) calls set_cpus_allowed on CPU 1, sleeps awaiting migration thread. 4) stopmachine(1) calls set_cpus_allowed on CPU 0, moves onto CPU1 and starts spinning. Now the migration thread never runs, and we deadlock. The simplest solution is for stopmachine() to yield until they are all in place. Signed-off-by: Rusty Russell <rusty@rustcorp.com.au> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2004-12-01	[PATCH] swsusp kconfig: Change in wording	Pavel Machek	1	-8/+7
	Vadim says: I was reading through the kernel/power/Kconfig file, and noticed that the wording was slightly unclear. I poked at it a bit, hopefully making the description a tad more straightforward, but you be the judge. :) Diffed against 2.6.10-rc2. From: Vadim Lobanov <vlobanov@speakeasy.net> Signed-off-by: Vadim Lobanov <vlobanov@speakeasy.net> Signed-off-by: Pavel Machek <pavel@suse.cz> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2004-11-29	[PATCH] Remove Futex Warning	Rusty Russell	1	-2/+2
	If we're waiting on a futex and we are woken up, it's either because someone did FUTEX_WAKE, we timed out, or have been signalled. However, the WARN_ON(!signal_pending(current)) test is overzealous: with threads (a common use of futexes), we share the signal handler and the other thread might get to the signal before us. In addition, exit_notify() can do a recalc_sigpending_tsk() on us, which will then clear our TIF_SIGPENDING bit, making signal_pending(current) return false. Returning EINTR is a little strange in this case, since this thread hasn't handled a signal. However, with threads it's the best we can do: there's always a race where another thread could have been the actual one to handle the signal. Signed-off-by: Rusty Russell <rusty@rustcorp.com.au> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2004-11-28	Merge dwmw2.baythorne.internal:/inst/bk/linus-2.6	David Woodhouse	1	-1/+1
	into dwmw2.baythorne.internal:/inst/bk/mtd-2.6
2004-11-21	[PATCH] del_timer() vs. mod_timer() SMP race	Benjamin Herrenschmidt	1	-0/+2
	We just spent some days fighting a rare race in one of the distro's who backported some of timer.c from 2.6 to 2.4 (though they missed a bit). The actual race we found didn't happen in 2.6 _but_ code inspection showed that a similar race is still present in 2.6, explanation below: Code removing a timer from a list (run_timers or del_timer) takes that CPU list lock, does list_del, then timer->base = NULL. It is mandatory that this timer->base = NULL is visible to other CPUs only after the list_del() is complete. If not, then mod timer could see it NULL, thus take it's own CPU list lock and not the one for the CPU the timer was beeing removed from the list, and thus the list_add in mod_timer() could race with the list_del() from run_timers() or del_timer(). Our race happened with run_timers(), which _DOES_ contain a proper smp_wmb() in the right spot in 2.6, but didn't in the "backport" we were fighting with. However, del_timer() doesn't have such a barrier, and thus is subject to this race in 2.6 as well. This patch fixes it. Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2004-11-18	[PATCH] early uart console support	Bjorn Helgaas	1	-1/+1
	This adds an early polled-mode "uart" console driver, based on Andi Kleen's early_printk work. The difference is that this locates the UART device directly by its MMIO or I/O port address, so we don't have to make assumptions about how ttyS devices will be named. After the normal serial driver starts, we try to locate the matching ttyS device and start a console there. Sample usage: console=uart,io,0x3f8 console=uart,mmio,0xff5e0000,115200n8 If the baud rate isn't specified, we peek at the UART to figure it out. Signed-off-by: Bjorn Helgaas <bjorn.helgaas@hp.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2004-11-18	[PATCH] sched: fix ->nr_uninterruptible handling bugs	Ingo Molnar	1	-7/+45
	PREEMPT_RT on SMP systems triggered weird (very high) load average values rather easily, which turned out to be a mainline kernel ->nr_uninterruptible handling bug in try_to_wake_up(). the following code: if (old_state == TASK_UNINTERRUPTIBLE) { old_rq->nr_uninterruptible--; potentially executes with old_rq potentially being != rq, and hence updating ->nr_uninterruptible without the lock held. Given a sufficiently concurrent preemption workload the count can get out of whack and updates might get lost, permanently skewing the global count. Nothing except the load-average uses nr_uninterruptible() so this condition can go unnoticed quite easily. the fix is to update ->nr_uninterruptible always on the runqueue where the task currently is. (this is also a tiny performance plus for try_to_wake_up() as a stackslot gets freed up.) while fixing this bug i found three other ->nr_uninterruptible related bugs: - the update should be moved from deactivate_task() into schedule(), beacause e.g. setscheduler() does deactivate_task()+activate_task(), which in turn may result in a -1 counter-skew if setscheduler() is done on a task asynchronously, which task is still on the runqueue but has already set ->state to TASK_UNINTERRUPTIBLE. sys_sched_setscheduler() is used rarely, but the bug is real. (The fix is also a small performance enhancement.) The rules for ->nr_uninterruptible updating are the following: it gets increased by schedule() only, when a task is moved off the runqueue and it has a state of TASK_UNINTERRUPTIBLE. It is decreased by try_to_wake_up(), by the first wakeup that materially changes the state from TASK_UNINTERRUPTIBLE back to TASK_RUNNING, and moves the task to the runqueue. - on CPU-hotplug down we might zap a CPU that has a nonzero counter. Due to the fuzzy nature of the global counter a CPU might hold a nonzero ->nr_uninterruptible count even if it has no tasks anymore. The solution is to 'migrate' the counter to another runqueue. - we should not return negative counter values from the nr_uninterruptible() function, since it accesses them without taking the runqueue locks, so the total sum might be slightly above or slightly below the real count. I tested the attached patch on x86 SMP and it solves the load-average problem. (I have tested CPU_HOTPLUG compilation but not functionality.) I think this is a must-have for 2.6.10, because there are apps that go berzerk if load-average is too high (e.g. sendmail). Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2004-11-17	Fix reading /proc/<pid>/mem when parent dies.	Linus Torvalds	1	-1/+0
	We should not touch "self_exec_id" here. The parent changed, not we.
2004-11-16	Email address update.	David Woodhouse	1	-1/+1
	The work address is increasingly unreliable and incompetently run. Time to remove all visible instances of it and rely only on one which isn't run by crack-monkeys. Signed-off-by: David Woodhouse <dwmw2@infradead.org>
2004-11-16	[PATCH] Fork fix fix	David Howells	1	-4/+1
	The attached patch fixes the fork fix to avoid the divide-by-zero error I'd previously fixed, but without using any sort of conditional. Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2004-11-14	[PATCH] revert recent futex_wait fix	Jamie Lokier	1	-8/+24
	The patch was wrong. Back it out, and add some commentary explaining why we need to run queue_me() prior to the get_user(). Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2004-11-10	[PATCH] unexport task_nice	Arjan van de Ven	1	-3/+0
	task_nice() was exported for binfmt_elf, however that's no longer modular. normalize_rt_tasks() is used by the sysreq code only, which isn't modular. Signed-off-by: Arjan van de Ven <arjan@infradead.org> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2004-11-10	[PATCH] futex_wait hang fix	Hidetoshi Seto	1	-6/+5
	NPTL has 3 control counters (total/wake/woken). so NPTL can know: "how many threads enter to wait"(total), "how many threads receive wake signal"(wake), and "how many threads exit waiting"(woken). Abstraction of pthread_cond_wait and pthread_cond_signal are: A01 pthread_cond_wait { A02 timeout = 0; A03 lock(counters); A04 total++; A05 val = get_from(futex); A06 unlock(counters); A07 A08 sys_futex(futex, FUTEX_WAIT, val, timeout); A09 A10 lock(counters); A11 woken++; A12 unlock(counters); A13 } B01 pthread_cond_signal { B02 lock(counters); B03 if(total>wake) { /* if there is waiter / B04 wake++; B05 update_val(futex); B06 sys_futex(futex, FUTEX_WAKE, 1); B07 } B08 unlock(counters); B09 } What we have to notice is: FUTEX_WAKE could be called before FUTEX_WAIT have called (at A07). In such case, FUTEX_WAKE will fail if there is no thread in waitqueue. However, since pthread_cond_signal do not only wake++ but also update_val(futex), next FUTEX_WAIT will fail with -EWOULDBLOCK because the val passed to WAIT is now not equal to updated val. Therefore, as the result, it seems that the WAKE wakes the WAIT. === The bug will appear if 2 pair of wait & wake called at (nearly)once: Assume 4 threads, wait_A, wait_B, wake_X, and wake_Y * counters start from [total/wake/woken]=[0/0/0] * the val of futex starts from (0), update means inclement of the val. * there is no thread in waitqueue on the futex. [simulation] wait_A: calls pthread_cond_wait: total++, prepare to call FUTEX_WAIT with val=0. # status: [1/0/0] (0) queue={}(empty) # wake_X: calls pthread_cond_signal: no one in waitqueue, just wake++ and update futex val. # status: [1/1/0] (1) queue={}(empty) # wait_B: calls pthread_cond_wait: total++, prepare to call FUTEX_WAIT with val=1. # status: [2/1/0] (1) queue={}(empty) # wait_A: calls FUTEX_WAIT with val=0: after queueing, compare val. 0!=1 ... this should be blocked... # status: [2/1/0] (1) queue={A} # wait_B: calls FUTEX_WAIT with val=1: after queueing, compare val. 1==1 ... OK, let's schedule()... # status: [2/1/0] (1) queue={A,B} (B=sleeping) # wake_Y: calls pthread_cond_signal: A is in waitqueue ... dequeue A, wake++ and update futex val. # status: [2/2/0] (2) queue={B} (B=sleeping) # wait_A: end of FUTEX_WAIT with val=0: try to dequeue but already dequeued, return anyway. # status: [2/2/0] (2) queue={B} (B=sleeping) # wait_A: end of pthread_cond_wait: woken++. # status: [2/2/1] (2) queue={B} (B=sleeping) # This is bug: wait_A: wakeup wait_B: sleeping wake_X: wake A wake_Y: wake A again if subsequent wake_Z try to wake B: wake_Z: calls pthread_cond_signal: since total==wake, do nothing. # status: [2/2/1] (2) queue={B} (B=sleeping) # If wait_C comes, B become to can be woken, but C... This bug makes the waitqueue to trap some threads in it all time. ==== > - According to man of futex: > "If the futex was not equal to the expected value, the operation > returns -EWOULDBLOCK." > but now, here is no description about the rare case: > "returns 0 if the futex was not equal to the expected value, but > the process was woken by a FUTEX_WAKE call." > this behavior on rare case causes the hang which I found. So to avoid this problem, my patch shut up the window that you said: > The patch certainly looks sensible - I can see that without the patch, > there is a window in which this process is pointlessly queued up on the > futex and that in this window a wakeup attempt might do a bad thing. ===== In short: There is an un-documented behavior of futex_wait. This behavior misleads NPTL to wake a thread doubly, as the result, causes an application hang. Signed-off-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2004-11-10	[PATCH] fix page size assumption in fork()	David Howells	1	-1/+5
	The attached patch fixes fork to get rid of the assumption that THREAD_SIZE >= PAGE_SIZE (on the FR-V the smallest available page size is 16KB). Signed-Off-By: David Howells <dhowells@redhat.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2004-11-10	[PATCH] remove contention on profile_lock	Jesse Barnes	1	-28/+17
	profile_hook unconditionally takes a read lock on profile_lock if kernel profiling is enabled. The lock protects the profile_hook notifier chain from being written while it's being called. The routine profile_hook is called in a very hot path though: every timer tick on every CPU. As you can imagine, on a large system, this makes the cacheline containing profile_lock pretty hot. Since oprofile was the only user of the profile_hook, I removed the notifier chain altogether in favor of a simple function pointer with the help of John Levon. This removes all of the contention in the hot path since the variable is very seldom written and simplifies things a little to boot. Acked-by: John Levon <levon@movementarian.org> Signed-off-by: Jesse Barnes <jbarnes@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2004-11-10	[PATCH] compat syscalls naming standardisation	Matthew Wilcox	2	-9/+9
	On PA-RISC, we have a unified syscall table for 32 and 64 bit that uses macros to generate the appropriate syscall names (native vs compat). For this to work, we need consistent compat syscall names. Unfortunately, some recent additions drop the 'sys_'. Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2004-11-09	[PATCH] Fix do_wait race	Dinakar Guniguntala	1	-1/+3
	Only set the flag in the cases when the exit state is not either TASK_DEAD or TASK_ZOMBIE. (TASK_DEAD or TASK_ZOMBIE will either race or we'll return the information, so no need to note them). I confirmed that this fixes the problem and I also ran some LTP tests Signed-off-by: Dinakar Guniguntala <dino@in.ibm.com> Signed-off-by: Sripathi Kodi <sripathik@in.ibm.com> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2004-11-08	wait_task_stopped() must not just return 0 when it has	Linus Torvalds	1	-2/+11
	released the tasklist_lock. Since it released the lock, the process lists may not be valid any more, and we must repeat the loop rather than continue with the next parent. Use -EAGAIN to show this condition (separate from the normal -EFAULT that may happen if rusage information could not be copied to user space).
2004-11-07	[PATCH] panic_blink and i8042 unloading	Dmitry Torokhov	1	-2/+5
	At unload i8042 sets panic_blink to 0. This will cause problems if kernel panics later as it will just use it assuming that the pointer is correct. Signed-off-by: Dmitry Torokhov <dtor@mail.ru> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2004-11-07	[PATCH] unexport do_settimeofday	Christoph Hellwig	1	-2/+0
	Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2004-11-07	[PATCH] kprobes: Minor i386 changes required for porting kprobes to x86_64	Prasanna S. Panchamukhi	1	-1/+5
	- Kprobes structure has been modified to support copying of original instruction as required by the architecture. On x86_64 normal pages we get from kmalloc or vmalloc are not executable. Single-stepping an instruction on such a page yields an oops. So instead of storing the instruction copies in their respective kprobe objects, we allocate a page, map it executable, and store all the instruction copies there and store the pointer of the copied instruction in the specific kprobes object. - jprobe_return_end is moved into inline assembly to avoid compiler optimization. - arch_prepare_kprobe() now returns an integer,since arch_prepare_kprobe() might fail on other architectures. - added arch_remove_kprobe() routine, since other architectures requires it. Signed-off-by: Prasanna S Panchamukhi <prasanna@in.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2004-11-07	[PATCH] SysRq-n changes RT tasks to normal	Måns Rullgård	1	-0/+32
	Teach sysrq-N to switch all rt-policy tasks to SCHED_OTHER. For recovering from (and diagnosing) userspace bugs. Acked-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2004-11-07	[PATCH] Don't ignore try_stop_module return	Rusty Russell	1	-0/+2
	Since 2.6.4 we've been ignoring the failure of try_stop_module: it will normally fail if the module reference count is non-zero. This would have been mainly unnoticed, since "modprobe -r" checks the usage count before calling sys_delete_module(), however there is a race which would cause a hang in this case. Signed-off-by: Rusty Russell <rusty@rustcorp.com.au> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2004-11-03	[PATCH] fix wrong kfifo_init buffer size argument	Martin Waitz	1	-4/+2
	kfifo_alloc tries to round up the buffer size to the next power of two. But it accidently uses the original size when calling kfifo_init, which will BUG. Acked-by: Stelian Pop <stelian@popies.net> Signed-off-by: Martin Waitz <tali@admingilde.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2004-11-03	x86: regparm calling convention for exceptions and interrupts.	Linus Torvalds	2	-3/+3
	This clarifies more of the x86 caller/callee stack ownership issues by making the exception and interrupt handler assembler interfaces use register calling conventions. System calls still use the stack. Tested with "crashme" on UP/SMP.
2004-11-01	[PATCH] Add panic blinking to 2.6	Andi Kleen	1	-5/+19
	This patch readds the panic blinking that was in 2.4 to 2.6. This is useful to see when you're in X that the machine has paniced It addresses previously criticism. It should work now when the keyboard interrupt is off. It doesn't fully emulate the handler, but has a timeout for this case. Signed-off-by: Andi Kleen <ak@suse.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2004-11-01	[PATCH] x86_64: add nmi button support	Andi Kleen	1	-2/+2
	Ported from i386 Support a sysctl to raise an oops with an NMI Signed-off-by: Andi Kleen <ak@suse.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2004-11-01	[PATCH] standalone sys_ni.c for not-implemented syscalls	Peter Chubb	3	-81/+86
	Sticking the not-implemented syscall stuff in sys.c is a pain because the cond_syscall()s explode when certain prototypes are in scope. And we need those prototypes' header files for the C code in sys.c. Fix all that up by moving all the sys_ni_syscall code into its own .c file. Signed-off-by: Peter Chubb <peterc@gelato.unsw.edu.au> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2004-10-31	[PATCH] fix IBM cyclone clock and some cleanup	Christoph Lameter	1	-17/+29
	- fix broken IBM cyclone time interpolator support - add support for cyclic timers through an addition of a mask in the timer interpolator structure - Allow time_interpolator_update() and time_interpolator_get_offset() to be invoked without an active time interpolator (necessary since the cyclone clock is initialized late in ACPI processing) - remove obsolete function time_interpolator_resolution() - add a mask to all struct time_interpolator setups in the kernel - Make time interpolators work on 32bit platforms Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2004-10-31	[PATCH] take me home, hotplug_path[]	Kay Sievers	3	-25/+1
	Move hotplug_path[] out of kmod.[ch] to kobject_uevent.[ch] where it belongs now. At some time in the future we should fix the remaining bad hotplug calls (no SEQNUM, no netlink uevent): ./drivers/input/input.c (no DEVPATH on some hotplug events!) ./drivers/pnp/pnpbios/core.c ./drivers/s390/crypto/z90main.c Signed-off-by: Kay Sievers <kay.sievers@vrfy.org> Signed-off-by: Greg Kroah-Hartman <greg@kroah.com>
2004-10-30	Lock-annotate some kernel functions as an example of how it works.	Linus Torvalds	1	-0/+2
	In particular, a function that is called with a lock held, and releases it only to re-acquire it needs to be annotated as such, since otherwise sparse will complain about an unexpected unlock, even though "globally" the lock is constant over the call.
2004-10-30	Annotate scheduler locking behaviour.	Linus Torvalds	1	-2/+19
	This annotates the scheduler routines for locking, telling what locks a function releases or acquires, allowing sparse to check the lock usage (and documenting it at the same time).
2004-10-29	[PATCH] uninline __sigqueue_alloc	Chris Wright	1	-1/+1
	Christoph suggests letting the compiler choose. No real compelling reason to inline anyhow. I had some vmlinux size numbers suggesting inline was better, but re-running them on newer kernel is giving different results, favoring uninline. Best let compiler choose. Un-inline __sigqueue_alloc. Signed-off-by: Chris Wright <chrisw@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2004-10-28	[PATCH] make dnotify a configure-time option	Robert Love	1	-0/+2
	make dnotify configurable, via CONFIG_DNOTIFY. CONFIG_EMBEDDED is required for disabling dnotify. Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2004-10-28	[PATCH] Add typechecking to suspend types and powerdown types	Pavel Machek	2	-8/+9
	This adds typechecking to suspend types and powerdown types. This should solve at least part of suspend type confusion. There should be no code changes generated by this one. Acked-by: Patrick Mochel <mochel@digitalimplant.org> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2004-10-27	[PATCH] remove double newline from sysrq action_msg	Olaf Hering	1	-1/+1
	__handle_sysrq already prints a newline, so the action_msg string doesnt need yet another newline. Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2004-10-27	[PATCH] remove invoke_softirq	Christoph Hellwig	1	-1/+1
	This was used by the early irqstacks implementation on s390 and has been replaced by __ARCH_HAS_DO_SOFTIRQ now. Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2004-10-27	[PATCH] fix show_refcnt return value type	Christoph Hellwig	1	-1/+1
	module_attribute.show is defined to return ssize_t Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2004-10-27	[PATCH] Lock initializer unifying (Core)	Thomas Gleixner	2	-5/+5
	To make spinlock/rwlock initialization consistent all over the kernel, this patch converts explicit lock-initializers into spin_lock_init() and rwlock_init() calls. Currently, spinlocks and rwlocks are initialized in two different ways: lock = SPIN_LOCK_UNLOCKED spin_lock_init(&lock) rwlock = RW_LOCK_UNLOCKED rwlock_init(&rwlock) this patch converts all explicit lock initializations to spin_lock_init() or rwlock_init(). (Besides consistency this also helps automatic lock validators and debugging code.) The conversion was done with a script, it was verified manually and it was reviewed, compiled and tested as far as possible on x86, ARM, PPC. There is no runtime overhead or actual code change resulting out of this patch, because spin_lock_init() and rwlock_init() are macros and are thus equivalent to the explicit initialization method. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2004-10-27	[PATCH] signal.c: gcc-3.4 fix	Pawel Sikora	1	-1/+1
	Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2004-10-27	[PATCH] unexport add_timer_on()	Arjan van de Ven	1	-1/+0
	add_timer_on() isn't used by modules (in fact it's only used ONCE, in workqueue.c) and it's not even a good api for drivers, in fact, the comment for it says * This is not very scalable on SMP. Double adds are not possible. Signed-off-by: Arjan van de Ven <arjan@infradead.org> Acked-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2004-10-27	[PATCH] unexport kick_process	Christoph Hellwig	1	-2/+0
	This isn't exactly the kind of interface modules should use. Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2004-10-27	[PATCH] unexport getnstimeofday	Christoph Hellwig	1	-2/+0
	This recently added function is only used by the posix timers code, no need to be exported. Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2004-10-27	[PATCH] unexport raise_softirq	Arjan van de Ven	1	-2/+0
	The patch below unexports raise_softirq(). raise_softirq() is not the right api for drivers to use, instead raise_softirq_irqoff() is, and thankfully all in-kernel code is using that variant already. To avoid future "accidents", unexport. Acked-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2004-10-27	[PATCH] scheduler: remove redundant #ifdef	Paul E. McKenney	1	-2/+0
	Removes a redundant #ifdef CONFIG_SMP that is nested within an enclosing #ifdef CONFIG_SMP. Signed-off-by: <paulmck@us.ibm.com> Acked-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2004-10-27	[PATCH] swsusp: print error message when swapping is disabled	Yi Zhu	1	-1/+4
	This patch gives some clues to the user when swapping is not enabled during swsusp. Please apply. Signed-off-by: Zhu Yi <yi.zhu@intel.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2004-10-27	[PATCH] statm: shared = rss - anon_rss	Hugh Dickins	1	-0/+1
	The third "shared" field of /proc/$pid/statm in 2.4 was a count of pages in the mm whose page_count is more than 1 (oddly, including pages shared just with swapcache). That's too costly to calculate each time, so 2.6 changed it to the total file-backed extent. But Andrea knows apps and users surprised when (rss - shared) goes negative: we need to provide an rss-like statistic, close to the 2.4 interpretation. Something that's quick and easy to maintain accurately is mm->anon_rss, the count of anonymous pages in the mm. Then shared = rss - anon_rss gives a pretty good and meaningful approximation to 2.4's intention: wli confirms that this will be useful to Oracle too. Where to show it? I think it's best to treat this as a bugfix and show it in the third field of /proc/$pid/statm, after resident, as before - there's no evidence that the total file-backed extent was found useful. Albert would like other fields to revert to page counts, but that's a lot harder: if mprotect can change the category of a page, then it can't be accounted as simply as this. Only go that route if real need shown. Signed-off-by: Hugh Dickins <hugh@veritas.com> Acked-by: William Irwin <wli@holomorphy.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2004-10-27	Fix up "compat_sys_keyctl()" system call.	Linus Torvalds	1	-0/+1
	Fix name, and make sure that it's listed as a conditional system call so that we stub it out to ENOSYS if the kernel isn't compiled with key management support.
2004-10-27	arm: Fix ARM kernel build with permitted binutils versions	Russell King	1	-2/+14
	All ARM binutils versions post 2.11.90 contains an extra "feature" which interferes with the kernel in various ways - extra "mapping symbols" in the ELF symbol table '$a', '$t' and '$d'. This causes two problems: 1. Since '$a' symbols have the same value as function names, this causes anything which uses the kallsyms infrastructure to report wrong values. 2. programs which parse System.map do not expect symbols to start with '$'. Signed-off-by: Russell King <rmk@arm.linux.org.uk> Signed-off-by: Sam Ravnborg <sam@ravnborg.org> ===== kernel/module.c 1.120 vs edited =====
2004-10-25	[PATCH] Wake up signalled tasks when exiting ptrace	Roland McGrath	1	-0/+14
	In general it is not safe to do any non-ptrace wakeup of a thread in TASK_TRACED, because the waking thread could race with a ptrace call that could be doing things like mucking directly with its kernel stack. AFAIK noone has established that whatever clobberation ptrace can do to a running thread is safe even if it will never return to user mode, so we can't allow this even for SIGKILL. What we _can_ safely do is make a thread switching out of TASK_TRACED resume rather than sitting in TASK_STOPPED if it has a pending SIGKILL or SIGCONT. The following patch does this. This should be sufficient for the shutdown case. When killing all processes, if the tracer gets killed first, the tracee goes into TASK_STOPPED and will be woken and killed by the SIGKILL (same as before). If the tracee gets killed first, it gets a pending SIGKILL and doesn't wake up immediately--but, now, when the tracer gets killed, the tracee will then wake up to die. This will also fix the (same) situations that can arise now where you have used gdb (or whatever ptrace caller), killed -9 the gdb and the process being debugged, but still have to kill -CONT the process before it goes away (now it should just go away either the first time or when you kill gdb). Signed-off-by: Roland McGrath <roland@redhat.com> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2004-10-25	[PATCH] Posix layer <-> clock driver API fix	Christoph Lameter	1	-55/+34
	This is needed for an mmtimer driver update that we are currently working on. The mmtimer driver provides CLOCK_SGI_CYCLE via clock_gettime and clock_settime. With this api fix one will be able to use timer_create, timer_settime and friends from userspace to schedule and receive signals via timer interrupts of mmtimer. Changelog * Clean up timer api for drivers that use register_posix_clock. Drivers will then be able to use posix timers to schedule interrupts. * Change API for posix_clocks[].timer_create to only pass one pointer to a k_itimer structure that is now allocated and managed by the posix layer in the same way as for the other posix timer functions. * Isolate a posix_timer_event(timr) function in posix-timers.c that may be called by the interrupt routine of a timer to signal that the scheduled event has taken place. Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2004-10-25	[PATCH] Builtin Module Parameters in sysfs too	Rusty Russell	2	-105/+420
	Currently, only module parameters in loaded modules are exported in /sys/modules/, while those of "modules" built into the kernel can be set by the kernel command line, but not read or set via sysfs. - move module parameters from /sys/modules/$(module_name)/$(parameter_name) to /sys/modules/$(module_name)/parameters/$(parameter_name) - remove dummy kernel_param for exporting refcnt, add "struct module *"-based attribute instead - also export module paramters for "modules" which are built into the kernel, so parameters are always accessible at /sys/modules/$(KBUILD_MODNAME)/$(parameter_name) Signed-off-by: Rusty Russell <rusty@rustcorp.com.au> (modified) Signed-off-by: Dominik Brodowski <linux@brodo.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2004-10-25	[PATCH] session leader tty disassociation fix	Roland McGrath	1	-2/+4
	The session leader should disassociate from its controlling terminal and send SIGHUP signals only when the whole session leader process dies. Currently, this gets done when any thread in that process dies, which is wrong. This patch fixes it. Signed-off-by: Roland McGrath <roland@redhat.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2004-10-25	[PATCH] acct: report single record for multithreaded process	Roland McGrath	3	-7/+16
	This patch changes process accounting to write just one record for a process with many NPTL threads, rather than one record for each thread. No record is written until the last thread exits. The process's record shows the cumulative time of all the threads that ever lived in that process (thread group). This seems like the clearly right thing and I assume it is what anyone using process accounting really would like to see. There is a race condition between multiple threads exiting at the same time to decide which one should write the accounting record. I couldn't think of anything clever using existing bookkeeping that would get this right, so I added another counter for this. (There may be some potential to clean up existing places that figure out how many non-zombie threads are in the group, now that this count is available.) Signed-off-by: Roland McGrath <roland@redhat.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2004-10-25	[PATCH] sched: active_load_balance fixes	Darren Hart	1	-55/+70
	The following patch against the latest mm fixes several problems with active_load_balance(). Rather than starting with the highest allowable domain (SD_LOAD_BALANCE is still set) and depending on the order of the cpu groups, we start at the lowest domain and work up until we find a suitable CPU or run out of options (SD_LOAD_BALANCE is no longer set). This is a more robust approach as it is more explicit and not subject to the construction order of the cpu groups. We move the test for busiest_rq->nr_running <=1 into the domain loop so we don't continue to try and move tasks when there are none left to move. This new logic (testing for nr_running in the domain loop) should make the busiest_rq==target_rq condition really impossible, so we have replaced the graceful continue on fail with a BUG_ON. (Bjorn Helgaas, please confirm) We eliminate the exclusion of the busiest_cpu's group from the pool of available groups to push to as it is the ideal group to push to, even if not very likely to be available. Note that by removing the test for group==busy_group and allowing it to also be tested for suitability, the running time is nearly the same. We no longer force the destination CPU to be in a group of completely idle CPUs, nor to be the last in that group. Signed-off-by: Darren Hart <dvhltc@us.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2004-10-25	[PATCH] schedstat: fix schedule() statistics	Akinobu Mita	1	-1/+2
	The number of times schedule() left the processor idle in the /proc/schedstat (runqueue.sched_goidle) seems to be wrong. The schedule() statistics should satisfy the equation: sched_cnt == sched_noswitch + sched_switch + sched_goidle (http://eaglet.rain.com/rick/linux/schedstat/v10/format-10.html) The below patch fix this, and I have confirmed to be fixed with: # grep ^cpu /proc/schedstat \| awk '{print $6+$7+$9, $8}' Acked-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2004-10-25	[PATCH] sched: improved load_balance() tolerance for pinned tasks	John Hawkes	1	-4/+12
	A large number of processes that are pinned to a single CPU results in every other CPU's load_balance() seeing this overloaded CPU as "busiest", yet move_tasks() never finds a task to pull-migrate. This condition occurs during module unload, but can also occur as a denial-of-service using sys_sched_setaffinity(). Several hundred CPUs performing this fruitless load_balance() will livelock on the busiest CPU's runqueue lock. A smaller number of CPUs will livelock if the pinned task count gets high. This simple patch remedies the more common first problem: after a move_tasks() failure to migrate anything, the balance_interval increments. Using a simple increment, vs. the more dramatic doubling of the balance_interval, is conservative and yet also effective. Signed-off-by: John Hawkes <hawkes@sgi.com> Acked-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2004-10-25	[PATCH] sched: small load balance fix	Jesse Barnes	1	-2/+1
	Small bug fix for domains that don't load balance (like those that only balance on exec for example). Signed-off-by: John Hawkes <hawkes@sgi.com> Signed-off-by: Jesse Barnes <jbarnes@sgi.com> Acked-by: Nick Piggin <nickpiggin@yahoo.com.au> Acked-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2004-10-25	[PATCH] small SOFTWARE_SUSPEND help text fixes	Adrian Bunk	1	-1/+2
	Some small fixes for the SOFTWARE_SUSPEND help text. Signed-off-by: Adrian Bunk <bunk@stusta.de> Signed-off-by: Pavel Machek <pavel@suse.cz> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2004-10-25	[PATCH] power/disk.c: small fixups	Pavel Machek	1	-14/+5
	power_down may never ever fail, so it does not really need to return anything. Kill obsolete code and fixup old comments. Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2004-10-25	Allow BKL re-acquire to fail, causing us to re-schedule.	Linus Torvalds	1	-2/+5
	This allows for low-latency BKL contention even with preemption. Previously, since preemption is disabled over context switches, re-acquiring the kernel lock when resuming a process would be non-preemtible.
2004-10-24	[PATCH] Fix msleep to sleep _at_least_ the requested amount	Benjamin Herrenschmidt	1	-2/+2
	Makes sure msleep() sleeps at least the amount provided, since schedule_timeout() doesn't guarantee a full jiffy. Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2004-10-24	Un-inline the big kernel lock.	Linus Torvalds	1	-15/+0
	Now that spinlocks are uninlined, it is silly to keep the BKL inlined. And this should make it a lot easier for people to play around with variations on the locking (ie Ingo's semaphores etc).
2004-10-22	[PATCH] Fix ptrace problem	Roland McGrath	1	-1/+2
	This is indeed a new bug, and it is not architecture-specific. In my recent changes to close some race conditions, I overlooked the case of a process using PTRACE_ATTACH on its own children. The new PT_ATTACHED flag does not really mean "PTRACE_ATTACH was used", it means "PTRACE_ATTACH is changing the ->parent link". This fixes the problem that Stephane Eranian program demonstrates. Signed-off-by: Roland McGrath <roland@redhat.com> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2004-10-22	[PATCH] Invalid BUG_ONs in signal.c	Roland McGrath	1	-14/+8
	Oh, duh. The race is obvious. Sorry for the confusion there. The BUG_ON's were useful for debugging, since they trigger on a lot of errors, but they _also_ trigger on some unlikely (but valid) races. So just remove them - just fall through to the regular exit code after core-dumping (which does everything right). Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2004-10-22	[PATCH] delay rq_lock acquisition in setscheduler	Chris Wright	1	-16/+17
	Doing access control checks with rq_lock held can cause deadlock when audit messages are created (via printk or audit infrastructure) which trigger a wakeup and deadlock, as noted by both SELinux and SubDomain folks. This patch will let the security checks happen w/out lock held, then re-sample the p->policy in case it was raced. Originally from John Johansen <johansen@immunix.com>, reworked by me. AFAIK, this version drew no objections from Ingo or Andrea. From: John Johansen <johansen@immunix.com> Acked-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Chris Wright <chrisw@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2004-10-22	[PATCH] remove cpu_run_sbin_hotplug()	Andrew Morton	1	-35/+0
	From: Keshavamurthy Anil S <anil.s.keshavamurthy@intel.com> Remove cpu_run_sbin_hotplug() - use kobject_hotplug() instead. Signed-off-by: Anil S Keshavamurthy <anil.s.keshavamurthy@intel.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Greg Kroah-Hartman <greg@kroah.com>
2004-10-22	[PATCH] avoid problems with kobject_set_name and name with %	Stephen Hemminger	1	-1/+1
	kobject_set_name takes a printf style argument list. There are many callers that pass only one string, if this string contained a '%' character than bad things would happen. The fix is simple. Signed-off-by: Stephen Hemminger <shemminger@osdl.org> Signed-off-by: Greg Kroah-Hartman <greg@kroah.com>
2004-10-21	[PATCH] make __sigqueue_alloc() a general helper	Chris Wright	1	-13/+7
	Posix timers preallocate siqueue structures during timer creation and keep them for reuse. This allocation happens in user context with no locks held, however it's designated as an atomic allocation. Loosen this restriction, and while we're at it let's do a bit of code consolidation so signal sending uses same __sigqueue_alloc() helper. Signed-off-by: Chris Wright <chrisw@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2004-10-21	[PATCH] ppc: Disable IRQ probe on ppc	Benjamin Herrenschmidt	1	-1/+2
	The current "generic" implementation of IRQ probing isn't well suited for ppc in it's current form, and causes issues with yenta_socket (and possibly others) on pmac laptops. We didn't have a probe implementation in the past, we probably don't need one anyway, so for now, the fix is to make this optional and enable it on x86 and x86_64 but not ppc and ppc64 (the 4 archs to use the generic IRQ code). Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2004-10-20	[PATCH] vm thrashing control tuning CONFIG_SWAP=n build fix	Hideo Aoki	1	-2/+2
	Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2004-10-20	Update tty layer to not mix kernel and user pointers.	Linus Torvalds	1	-1/+1
	Instead, tty_io.c will always copy user space data to kernel space, leaving the drivers to worry only about normal kernel buffers. No more "from_user" flag, and having the user copy in each driver. This cleans up the code and also fixes a number of locking bugs.
2004-10-20	Fix posix timer direct user space access	Linus Torvalds	1	-7/+11
	This makes us do the proper copy_to_user() for the new posix timers code. Acked by Christoph Lameter <clameter@sgi.com>.
2004-10-19	Merge bk://kernel.bkbits.net/davem/sparc-2.6	Linus Torvalds	1	-0/+1
	into ppc970.osdl.org:/home/torvalds/v2.6/linux
2004-10-19	Merge bk://kernel.bkbits.net/davem/net-2.6	Linus Torvalds	1	-0/+1
	into ppc970.osdl.org:/home/torvalds/v2.6/linux
2004-10-19	[PATCH] make CONFIG_PM_DEBUG depend on CONFIG_PM	Adrian Bunk	1	-0/+1
	Signed-off-by: Chris Wright <chrisw@osdl.org> Signed-off-by: Adrian Bunk <bunk@stusta.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2004-10-19	[PATCH] "console=" parameter ignored	Maciej W. Rozycki	1	-3/+7
	I've noticed that under specific circumstances the "console=" kernel parameter is ignored. This happens when EARLY_PRINTK is enabled and the serial console is the only available. In this case unregister_console() when called for the early console sets preferred_console back to -1 replacing the value that was recorded by console_setup() -- the order of calls is as follows: 1. register_console() -- for the early console, 2. console_setup() -- recording the console index for the real console, 3. unregister_console() -- for the early console, erasing the console index recorded above, 4. register_console() -- for the real console, picking up the first device available, instead of the selected one. I've observed this problem with a DECstation system using ttyS3 -- its default console device from the firmware's point of view. The solution is to restore the setting of "console=" upon unregister_console(). This made a snapshot of 2.4.26 work for me. I wasn't able to test the changes with 2.6 because DECstation drivers don't support it yet, but the code responsible for console selection appears functionally the same. So I've concluded it needs the same change. Here's a patch. Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2004-10-19	[PATCH] #include <asm/bitops.h> -> #include <linux/bitops.h>	Adrian Bunk	1	-1/+1
	There's no reason to directly #include <asm/bitops.h> since it's available on all architectures and also included by #include <linux/bitops.h>. This patch changes #include <asm/bitops.h> to #include <linux/bitops.h>. Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2004-10-19	[PATCH] vm thrashing control tuning	Hideo Aoki	1	-0/+11
	This patch adds "swap_token_timeout" parameter in /proc/sys/vm. The parameter means expired time of token. Unit of the value is HZ, and the default value is the same as current SWAP_TOKEN_TIMEOUT (i.e. HZ * 300). Signed-off-by: Hideo Aoki <aoki@sdl.hitachi.co.jp> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2004-10-19	[PATCH] detach_pid(): eliminate one find_pid() call	Oleg Nesterov	1	-3/+4
	Now there is no point in calling costly find_pid(type) if __detach_pid(type) returned non zero value. Acked-By: Kirill Korotaev <dev@sw.ru> Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2004-10-19	[PATCH] detach_pid(): restore optimization	Oleg Nesterov	1	-4/+7
	Kirill's kernel/pid.c rework broke optimization logic in detach_pid(). Non zero return from __detach_pid() was used to indicate, that this pid can probably be freed. Current version always (modulo idle threads) return non zero value, thus resulting in unneccesary pid_hash scanning. Also, uninlining __detach_pid() reduces pid.o text size from 2492 to 1600 bytes. Acked-By: Kirill Korotaev <dev@sw.ru> Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2004-10-19	[PATCH] Posix compliant cpu clocks	Christoph Lameter	1	-15/+131
	POSIX clocks are to be implemented in the following way according to V3 of the Single Unix Specification: 1. CLOCK_PROCESS_CPUTIME_ID Implementations shall also support the special clockid_t value CLOCK_PROCESS_CPUTIME_ID, which represents the CPU-time clock of the calling process when invoking one of the clock_() or timer_() functions. For these clock IDs, the values returned by clock_gettime() and specified by clock_settime() represent the amount of execution time of the process associated with the clock. 2. CLOCK_THREAD_CPUTIME_ID Implementations shall also support the special clockid_t value CLOCK_THREAD_CPUTIME_ID, which represents the CPU-time clock of the calling thread when invoking one of the clock_() or timer_() functions. For these clock IDs, the values returned by clock_gettime() and specified by clock_settime() shall represent the amount of execution time of the thread associated with the clock. These times mentioned are CPU processing times and not the time that has passed since the startup of a process. Glibc currently provides its own implementation of these two clocks which is designed to return the time that passed since the startup of a process or a thread. Moreover Glibc's clocks are bound to CPU timers which is problematic when the frequency of the clock changes or the process is moved to a different processor whose cpu timer may not be fully synchronized to the cpu timer of the current CPU. This patchset results in a both clocks working reliably. The patch also implements the access to other the thread and process clocks of linux processes by using negative clockid's: 1. For CLOCK_PROCESS_CPUTIME_ID: -pid 2. For CLOCK_THREAD_CPUTIME_ID: -(pid + PID_MAX_LIMIT) This allows clock_getcpuclockid(pid) to return -pid and pthread_getcpuiclock(pid) to return -(pid + PID_MAX_LIMIT) to allow access to the corresponding clocks. Todo: - The timer API to generate events by a non tick based timer is not usable in its current state. The posix timer API seems to be only useful at this point to define clock_get/set. Need to revise this. - Implement timed interrupts in mmtimer after API is revised. The mmtimer patch is unchanged from V6 and stays as is in 2.6.9-rc3-mm2. But I expect to update the driver as soon as the interface to setup hardware timer interrupts is usable. Single Thread Testing CLOCK_THREAD_CPUTIME_ID= 0.494140878 resolution= 0.000976563 CLOCK_PROCESS_CPUTIME_ID= 0.494140878 resolution= 0.000976563 Multi Thread Testing Starting Thread: 0 1 2 3 4 5 6 7 8 9 Joining Thread: 0 1 2 3 4 5 6 7 8 9 0 Cycles= 0 Thread= 0.000000000ns Process= 0.495117441ns 1 Cycles=1000000 Thread= 0.140625072ns Process= 2.523438792ns 2 Cycles=2000000 Thread= 0.966797370ns Process= 8.512699671ns 3 Cycles=3000000 Thread= 0.806641038ns Process= 7.561527309ns 4 Cycles=4000000 Thread= 1.865235330ns Process= 12.891608163ns 5 Cycles=5000000 Thread= 1.604493009ns Process= 11.528326215ns 6 Cycles=6000000 Thread= 2.086915131ns Process= 13.500983475ns 7 Cycles=7000000 Thread= 2.245118337ns Process= 13.947272766ns 8 Cycles=8000000 Thread= 1.604493009ns Process= 12.252935961ns 9 Cycles=9000000 Thread= 2.160157356ns Process= 13.977546219ns Clock status at the end of the timer tests: Gettimeofday() = 1097084999.489938000 CLOCK_REALTIME= 1097084999.490116229 resolution= 0.000000040 CLOCK_MONOTONIC= 177.071675109 resolution= 0.000000040 CLOCK_PROCESS_CPUTIME_ID= 13.978522782 resolution= 0.000976563 CLOCK_THREAD_CPUTIME_ID= 0.497070567 resolution= 0.000976563 CLOCK_SGI_CYCLE= 229.967982280 resolution= 0.000000040 PROCESS clock of 1 (init)= 4.833986850 resolution= 0.000976563 THREAD clock of 1 (init)= 0.009765630 resolution= 0.000976563 Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2004-10-19	[PATCH] BSD Secure Levels LSM: add time hooks	Michael A. Halcrow	1	-5/+13
	I have received positive feedback from various individuals who have applied my BSD Secure Levels LSM patch, and so at this point I am submitting it to you with a request to merge it in. Nothing has changed in this patch since when I last posted it to the LKML, so I am not re-sending it there. This first patch adds hooks to catch attempts to set the system clock back. Signed-off-by: Michael A. Halcrow <mahalcro@us.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2004-10-19	[PATCH] lighten mmlist_lock	Hugh Dickins	1	-26/+11
	Let's lighten the global spinlock mmlist_lock. What's it for? 1. Its original role is to guard mmlist. 2. It later got a second role, to prevent get_task_mm from raising mm_users from the dead, just after it went down to 0. Firstly consider the second: __exit_mm sets tsk->mm NULL while holding task_lock before calling mmput; so mmlist_lock only guards against the exceptional case, of get_task_mm on a kernel workthread which did AIO's use_mm (which transiently sets its tsk->mm without raising mm_users) on an mm now exiting. Well, I don't think get_task_mm should succeed at all on use_mm tasks. It's mainly used by /proc/pid and ptrace, seems at best confusing for those to present the kernel thread as having a user mm, which it won't have a moment later. Define PF_BORROWED_MM, set in use_mm, clear in unuse_mm (though we could just leave it), get_task_mm give NULL if set. Secondly consider the first: and what's mmlist for? 1. Its original role was for swap_out to scan: rmap ended that in 2.5.27. 2. In 2.4.10 it got a second role, for try_to_unuse to scan for swapoff. So, make mmlist a list of mms which maybe have pages on swap: add mm to mmlist when first swap entry is assigned in try_to_unmap_one (pageout), or in copy_page_range (fork); and mmput remove it from mmlist as before, except usually list_empty and there's no need to lock. drain_mmlist added to swapoff, to empty out the mmlist if no swap is then in use. mmput leave mm on mmlist until after its exit_mmap, so try_to_unmap_one can still add mm to mmlist without worrying about the mm_users 0 case; but try_to_unuse must avoid the mm_users 0 case (when an mm might be removed from mmlist, and freed, while it's down in unuse_process): use atomic_inc_return now all architectures support that. Some of the detailed comments in try_to_unuse have grown out of date: updated and trimmed some, but leave SWAP_MAP_MAX for another occasion. Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2004-10-19	[PATCH] module_param_array() should take a pointer	Rusty Russell	1	-2/+2
	module_param_array() takes a variable to put the number of elements in. Looking through the uses, many people don't care, so they declare a dummy or share one variable between several parameters. The latter is problematic because sysfs uses that number to decide how many to display. The solution is to change the variable arg to a pointer, and if the pointer is NULL, use the "max" value. This change is fairly small, but fixing up the callers is a lot of (trivial) churn. Signed-off-by: Rusty Russell <rusty@rustcorp.com.au> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2004-10-19	Merge	David S. Miller	1	-0/+1

2004-10-19	[SPARC64]: Re-export force_sig to modules.	David S. Miller	1	-0/+1
	Used by sparc envctl drivers, specifically envctl.c and bbc_envctrl.c under drivers/sbus/char/ Signed-off-by: David S. Miller <davem@davemloft.net>
2004-10-18	Trivial Makefile merge	Linus Torvalds	16	-351/+928

2004-10-18	[PATCH] profile: 512x Altix timer interrupt livelock fix	William Lee Irwin III	1	-1/+257
	I've been informed that /proc/profile livelocks some systems in the timer interrupt, usually at boot. The following patch attempts to amortize the atomic operations done on the profile buffer to address this stability concern. This patch has nothing to do with performance; kernels using periodic timer interrupts are under realtime constraints to complete whatever work they perform within timer interrupts before the next timer interrupt arrives lest they livelock, performing no work whatsoever apart from servicing timer interrupts. The latency of the cacheline bounce for prof_buffer contributes to the time spent in the timer interrupt, hence it must be amortized when remote access latencies or deviations from fair exclusive cacheline acquisition may cause cacheline bounces to take longer than the interval between timer ticks. What this patch does is to create a pair of per-cpu open-addressed hashtables indexed by profile buffer slot holding values representing the number of pending profile buffer hits for the profile buffer slot. When this hashtable overflows, one iterates over the hashtable accounting each of the pairs of profile buffer slots and hit counts to the global profile buffer. Zero is a legitimate profile buffer slot, so zero hit counts represent unused hashtable entries. The hashtable is furthermore protected from flush IPI's by interrupt disablement. In order to flush the pending profile hits for read_profile(), this patch flips betweeen the pairs of per-cpu profile buffer by signalling all cpus to flip via IPI at the time of read_profile(), followed by doing all the work to flush the profile hits from the older per-cpu buffers in the context of the caller of read_profile(), with exclusion provided by a semaphore ensuring that only one caller of profile_flip_buffers() may execute at a time, and using interrupt disablement to prevent buffer flip IPI's from altering the hashtables or flip state while an update is in progress. The flip state is per-cpu so that remote cpus need only disable interrupts locally for synchronization, which is both simple and busywait-free for remote cpus. The flip states all change in tandem when some cpu requests the hashtables be flipped, and the requester waits for the completion of smp_call_function() for notification that all cpus have finished flipping between their hashtables. The IPI handler merely toggles the flip state (which is an array index) between 0 and 1. This is expected to be a much stronger amortization than merely reducing the frequency of profile buffer access by a factor of the size of the hashtable because numerous hits may be held for each of its entries. This reduces what was before the patch a number of atomic increments equal to what after the patch becomes the sum of the hits held for each entry in the hashtable, to a number of atomic_add()'s equal to the number of entries in the per_cpu hashtable. This is nondeterministic, but as the profile hits tend to be concentrated in a very small number of profile buffer slots during any given timing interval, is likely to represent a very large number of atomic increments. This amortization of atomic increments does not depend on the hash function, only the sharp peakedness of the distribution of profile buffer hits. This algorithm has two advantages over full-size per-cpu profile buffers. The first is that the space footprint is much smaller. Per-cpu profile buffers would increase the space requirements by a factor of num_online_cpus(), where this algorithm only requires one page per cpu. The second is that reading the profile state is much faster, because the state that must be traversed is exactly the above space consumers, and the relative reduction in size concomitantly reduces the time required for a read operation. I also took the liberty of adding some commentary to the comments at the beginning of the file reflecting the major work done on profile.c in recent months and describing what the file implements. The reporters of this issue have verified that this resolves their timer interrupt livelock on 512x Altixen. In my own testing on 4x logical x86-64, this patch saw a rate of about 18 flushes per minute under load, or about one flush every 3 seconds, for about 38.4 atomic accesses to the profile buffer per second per cpu in one of the algorithm's worst cases, about 3.84% of the number of atomic profile buffer accesses per second per cpu as a normal kernel would commit. This represents a twenty-six-fold increase in the scalability on SMP systems with 4KB PAGE_SIZE, i.e. with a 4KB PAGE_SIZE, the number of atomic profile buffer accesses per second per cpu is reduced by a factor of 26, thereby increasing the number of cpus a system must have before it would experience a timer interrupt livelock by a factor of 26, with the proviso that cacheline bounces must take the same amount of time to service. This increase in the scalability of the kernel is expected to be much larger for ia64, which has a large PAGE_SIZE, because the distribution of profile buffer hits is so sharply peaked that doubling the hashtable size will much more than double the amortization factor. In fact, only 19 flushes were observed on a 64x Altix over an approximately 10 minute AIM7 run, and 1 flush on a 512x Altix over the course of an entire AIM7 run, for truly vast effective amortization factors. A prior version of this patch, which did not include the node-local hashtable allocation and bounded collision chains has been successfully tested on 64x and 512x ia64 vs 2.6.9-rc2, 8x ia64 vs. 2.6.9-rc2-mm1, 4x x86-64 vs. 2.6.9-rc2-mm1, and 6x sparc64 vs. 2.6.9-rc2-mm1. This patch minus the hashtable initialization fix has been successfully tested on 2x ppc64, 2x alpha, 8x ia64, 6x sparc64, and 4x x86-64, all vs. 2.6.9-rc2-mm1. This precise version of the patch has been successfully tested on 8x ia32 against 2.6.9-rc2-mm1 and 6x sparc64 vs. both 2.6.9-rc2-mm1 and 2.6.9-rc2-mm2. Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2004-10-18	[PATCH] taint on bad_page	Nick Piggin	1	-2/+4
	Hugh and I both thought this would be generally useful. Signed-off-by: Nick Piggin <nickpiggin@yahoo.com.au> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2004-10-18	[PATCH] taint: fix forced rmmod	Nick Piggin	1	-1/+3
	This taint didn't appear to be reported. Signed-off-by: Nick Piggin <nickpiggin@yahoo.com.au> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2004-10-18	[PATCH] x86-64/i386: add mce tainting	Andi Kleen	1	-2/+10
	This patch adds machine check tainting. When a handled machine check occurs the oops gets a new 'M' flag. This is useful to ignore machines with hardware problems in oops reports. On i386 a thermal failure also sets this flag. Done for x86-64 and i386 so far. Signed-off-by: Andi Kleen <ak@suse.de> Signed-off-by: Nick Piggin <nickpiggin@yahoo.com.au> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2004-10-18	[PATCH] cleanup: move call to update_process_times.	Martin Schwidefsky	1	-5/+0
	For non-smp kernels the call to update_process_times is done in the do_timer function. It is more consistent with smp kernels to move this call to the architecture file which calls do_timer. Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>