aboutsummaryrefslogtreecommitdiffstats
path: root/kernel
AgeCommit message (Collapse)AuthorFilesLines
2004-08-22[PATCH] token based thrashing controlRik van Riel1-0/+2
The following experimental patch implements token based thrashing protection, using the algorithm described in: http://www.cs.wm.edu/~sjiang/token.htm When there are pageins going on, a task can grab a token, that protects the task from pageout (except by itself) until it is no longer doing heavy pageins, or until the maximum hold time of the token is over. If the maximum hold time is exceeded, the task isn't eligable to hold the token for a while more, since it wasn't doing it much good anyway. I have run a very unscientific benchmark on my system to test the effectiveness of the patch, timing how a 230MB two-process qsbench run takes, with and without the token thrashing protection present. normal 2.6.8-rc6: 6m45s 2.6.8-rc6 + token: 4m24s This is a quick hack, implemented without having talked to the inventor of the algorithm. He's copied on the mail and I suspect we'll be able to do better than my quick implementation ... Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2004-08-22[PATCH] rcu: document RCU apiDipankar Sarma1-15/+24
Patch from Paul for additional documentation of api. Updated based on feedback, and to apply to 2.6.8-rc3. I will be adding more detailed documentation to the Documentation directory in a separate patch. Signed-off-by: Paul McKenney <paulmck@us.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2004-08-22[PATCH] rcu: introduce call_rcu_bh()Dipankar Sarma2-6/+54
Introduces call_rcu_bh() to be used when critical sections are mostly in softirq context. This patch introduces a new api - call_rcu_bh(). This is to be used for RCU callbacks for whom the critical sections are mostly in softirq context. These callbacks consider completion of a softirq handler to be a quiescent state. So, in order to make reader critical sections safe in process context, rcu_read_lock_bh() and rcu_read_unlock_bh() must be used. Use of softirq handler completion as a quiescent state speeds up RCU grace periods and prevents too many callbacks getting queued up in softirq-heavy workloads like network stack. Signed-off-by: Dipankar Sarma <dipankar@in.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2004-08-22[PATCH] rcu: clean up codeDipankar Sarma2-109/+122
Avoids per_cpu calculations and also prepares for call_rcu_bh(). At OLS, Rusty had suggested getting rid of many per_cpu() calculations in RCU code and making the code simpler. I had already done that for the rcu-softirq patch earlier, so I am splitting that into two patch. This first patch cleans up the macros and uses pointers to the rcu per-cpu data directly to manipulate the callback queues. This is useful for the call-rcu-bh patch (to follow) which introduces a new RCU mechanism - call_rcu_bh(). Both generic and softirq rcu can then use the same code, they work different global and percpu data. Signed-off-by: Dipankar Sarma <dipankar@in.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2004-08-22[PATCH] RCU: low latency rcuDipankar Sarma1-8/+19
This patch makes RCU callbacks friendly to scheduler. It helps low latency by limiting the number of callbacks invoked per tasklet handler. Since we cannot schedule during a single softirq handler, this reduces size of non-preemptible section significantly, specially under heavy RCU updates. The limiting is done through a kernel parameter rcupdate.maxbatch which is the maximum number of RCU callbacks to invoke during a single tasklet handler. Signed-off-by: Dipankar Sarma <dipankar@in.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2004-08-22[PATCH] RCU - cpu offline fixDipankar Sarma1-10/+9
This fixes the RCU cpu offline code which was broken by singly-linked RCU changes. Nathan pointed out the problems and submitted a patch for this. This is an optimal fix - no need to iterate through the list of callbacks, just use the tail pointers and attach the list from the dead cpu. Signed-off-by: Nathan Lynch <nathanl@austin.ibm.com> Signed-off-by: Dipankar Sarma <dipankar@in.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2004-08-22[PATCH] RCU - cpu-offline-cleanupDipankar Sarma1-2/+6
There is a series of patches in my tree and these 3 are the first ones that should probably be merged down the road. Descriptions are on top of the patches. Please include them in -mm. A lot of RCU code will be cleaned up later in order to support call_rcu_bh(), the separate RCU interface that considers softirq handler completion a quiescent state. This patch: Minor cleanup of the hotplug code to remove #ifdef in cpu event notifier handler. If CONFIG_HOTPLUG_CPU is not defined, CPU_DEAD case will be optimized off. Signed-off-by: Dipankar Sarma <dipankar@in.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2004-08-22[PATCH] prio_tree: kill vma_prio_tree_init()Rajesh Venkatasubramanian1-1/+0
vma_prio_tree_insert() relies on the fact, that vma was vma_prio_tree_init()'ed. Content of vma->shared should be considered undefined, until this vma is inserted into i_mmap/i_mmap_nonlinear. It's better to do proper initialization in vma_prio_tree_add/insert. Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Signed-off-by: Rajesh Venkatasubramanian <vrajesh@umich.edu> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2004-08-22[PATCH] vprintk supportMatt Mackall1-2/+12
Add vprintk call. This lets us directly pass varargs stuff to the console without using vsnprintf to an intermediate buffer. Signed-off-by: Matt Mackall <mpm@selenic.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2004-08-22[PATCH] gettimeofday nanoseconds patchChristoph Lameter3-17/+41
This issue was discussed on lkml and linux-ia64. The patch introduces "getnstimeofday" and removes all the code scaling gettimeofday to nanoseoncs. It makes it possible for the posix-timer functions to return higher accuracy. Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2004-08-22[PATCH] Move cache_reap out of timer contextDimitri Sivanich1-0/+20
I'm submitting two patches associated with moving cache_reap functionality out of timer context. Note that these patches do not make any further optimizations to cache_reap at this time. The first patch adds a function similiar to schedule_delayed_work to allow work to be scheduled on another cpu. The second patch makes use of schedule_delayed_work_on to schedule cache_reap to run from keventd. Signed-off-by: Dimitri Sivanich <sivanich@sgi.com> Signed-off-by: Manfred Spraul <manfred@colorfullife.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2004-08-22[PATCH] remove sync() from panicChristian Bornträger1-7/+1
Various people have reported deadlocks and it has aways seemed a bit risky to try to sync the filesystems at this stage anyway. "I have seen panic failing two times lately on an SMP system. The box panic'ed but was running happily on the other cpus. The culprit of this failure is the fact, that these panics have been caused by a block device or a filesystem (e.g. using errors=panic). In these cases the likelihood of a failure/hang of sys_sync() is high. This is exactly what happened in both cases I have seen. Meanwhile the other cpus are happily continuing destroying data as the kernel has a severe problem but its not aware of that as smp_send_stop happens after sys_sync." Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2004-08-22[PATCH] Enable all events for initramfsHannes Reinecke1-3/+1
Currently most driver events are not sent out when using initramfs as driver_init() (which triggers the events) is called before init_workqueues. This patch rearranges the init calls so that the hotplug event queue is enabled prior to calling driver_init(), hence we're getting all hotplug events again. Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2004-08-22[PATCH] NMI trigger switch support for debugging(updated)Akiyama Nobuyuki1-0/+16
I made a patch for debugging with the help of NMI trigger switch. When kernel hangs severely, keyboard operation(e.g.Ctrl-Alt-Del) doesn't work properly. This patch enables debugging information to be displayed on console in this case. I think this feature is necessary as standard functionality. Please feel free to use this patch and let me know if you have any comments. Background: When a trouble occurs in kernel, we usually begin to investigate with following information: - panic >> panic message. - oops >> CPU registers and stack trace. - hang >> **NONE** no standard method established. How it works: Most IA32 servers have a NMI switch that fires NMI interrupt up. The NMI interrupt can interrupt even if kernel is serious state, for example deadlock under the interrupt disabled. When the NMI switch is pressed after this feature is activated, CPU registers and stack trace are displayed on console and then panic occurs. This feature is activated or deactivated with sysctl. On IA32 architecture, only the following are defined as reason of NMI interrupt: - memory parity error - I/O check error The reason code of NMI switch is not defined, so this patch assumes that all undefined NMI interrupts are fired by MNI switch. However, oprofile and NMI watchdog also use undefined NMI interrupt. Therefore this feature cannot be used at the same time with oprofile and NMI watchdog. This feature hands NMI interrupt over to oprofile and NMI watchdog. So, when they have been activated, this feature doesn't work even if it is activated. Supported architecture: IA32 Setup: Set up the system control parameter as follows: # sysctl -w kernel.unknown_nmi_panic=1 kernel.unknown_nmi_panic = 1 If the NMI switch is pressed, CPU registers and stack trace will be displayed on console and then panic occurs. Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2004-08-22[PATCH] fix reading string module parameters in sysfsArnd Bergmann1-0/+7
Reading the contents of a module_param_string through sysfs currently oopses because the param_get_charp() function cannot operate on a kparam_string struct. This introduces the required param_get_string. Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2004-08-12[PATCH] ppc32: Fix warning on CONFIG_PPC32 && CONFIG_6xxTom Rini1-1/+1
In the *ppos cleanups, proc_dol2crvec was updated, but the prototype found at the top of kernel/sysctl.h was not, generating warning. This corrects the prototype to match the code. (I'm gonna take a stab at moving these into arch/ppc shortly) Signed-off-by: Tom Rini <trini@kernel.crashing.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2004-08-07Make sysctl pass the pos pointer around properly.Linus Torvalds1-50/+50
Nobody ever fixed the big FIXME in sysctl - but we really need to pass around the proper "loff_t *" to all the sysctl functions if we want them to be well-behaved wrt the file pointer position. This is all preparation for making direct f_pos accesses go away.
2004-08-01[PATCH] Off-by-one error for SIGXCPU / RLIMIT_CPU checkingMichael Kerrisk1-2/+2
There is a lonstanding off-by-one error that results from an incorrect comparison when checking whether a process has consumed CPU time in excess of its RLIMIT_CPU limits. This means, for example, that if we use setrlimit() to set the soft CPU limit (rlim_cur) to 5 seconds and the hard limit (rlim_max) to 10 seconds, then the process only receives a SIGXCPU signal after consuming 6 seconds of CPU time, and, if it continues consuming CPU after handling that signal, only receives SIGKILL after consuming 11 seconds of CPU time. The fix is trivial. Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2004-08-01[PATCH] Remove symbol_is()Brian Gerst1-3/+0
Remove the unused symbol_is() macro. Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2004-08-01[PATCH] Fix BSD accounting cross-platform compatibilityTim Schmielau1-3/+2
BSD accounting cross-platform compatibility is a new feature of 2.6.8 and thus not crucial, but it'd be nice not to have kernels writing wrong file formats out in the wild. The endianness detection logic I wanted to suppose for userspace turned out to be bogus. So just do it the simple way and store endianness info together with the version number. Signed-off-by: Tim Schmielau <tim@physik3.uni-rostock.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2004-08-01[PATCH] sched: use for_each_cpuAnton Blanchard1-3/+3
The per cpu schedule counters need to be summed up over all possible cpus. When testing hotplug cpu remove I saw the sum of the online cpu count for nr_uninterruptible go negative which made the load average go nuts. Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2004-07-30[PATCH] sparse: misc cleanupsAlexander Viro1-1/+1
all sorts of minor stuff - basically, all chunks are independent here, but IMO that one is not worth splitting. Contains: * pmac_cpufreq.c: declaration in the middle of a block. * sys_ia32.c: couple of trivial annotations. * ipmi_si_intf.c: should be using asm/irq.h instead of linux/irq.h * synclink_cs.c: assignment-in-conditional with nobody ever looking at the variable we are assigning to afterwards; variable removed. * sbni.c: s/__volatile/__volatile__ * matroxfb_base.h: got rid of ((u32 *)p)++ * asm-ppc/checksum.h and asm-sparc64/floppy.h: NULL noise removal * amd64 compat.h: missing L in long constant. * mtd-abi.h: annotated ioctl structure * sysctl.c: corrected annotations in extern Signed-off-by: Al Viro <viro@parcelfarce.linux.org.uk> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2004-07-28[PATCH] remove dead code from copy_process()Luiz Capitulino1-1/+0
Don't assign to `retval' twice in a row. Signed-off-by: Luiz Capitulino <lcapitulino@prefeitura.sp.gov.br> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2004-07-28[PATCH] fix for buffer limit for long in sysctl.cStéphane Eranian1-2/+2
Fix a bug in do_proc_doulongvec_minmax() where the the string buffer was too short to parse a 64-bit number expressed in decimal. That was causing problems with entries in /proc/sys using long and allowing large number (such as -1) Signed-off-by: Stephane Eranian <eranian@hpl.hp.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2004-07-28[PATCH] compat_clock_getres shouldn't return -EFAULT if res == NULLArun Sharma1-1/+1
For clock_getres(clockid_t clock_id, struct timespec *res), the specification says "If res is NULL, the clock resolution is not returned." So this kind of call should succeed. The current implementation returns -EFAULT. The patch fixes the bug in compat_clock_getres(). Signed-off-by: Gordon Jin <gordon.jin@intel.com> Signed-off-by: Arun Sharma <arun.sharma@intel.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2004-07-28[PATCH] sched: initialize sched domain tableJack Steiner1-0/+1
Here is a trivial patch that is required to boot the latest 2.6.7 tree on the SGI 512p system. Initialize the busy_factor in the sched_domain_init table. Otherwise, booting hangs doing excessive load balance operations. Signed-off-by: Jack Steiner <steiner@sgi.com> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2004-07-28[PATCH] fixes for rcu_offline_cpu, rcu_move_batchNathan Lynch1-10/+11
rcu_offline_cpu and rcu_move_batch have been broken since the list_head's in struct rcu_head and struct rcu_data were replaced with singly-linked lists: CC kernel/rcupdate.o kernel/rcupdate.c: In function `rcu_move_batch': kernel/rcupdate.c:222: warning: passing arg 2 of `list_add_tail' from incompatible pointer type kernel/rcupdate.c: In function `rcu_offline_cpu': kernel/rcupdate.c:239: warning: passing arg 1 of `rcu_move_batch' from incompatible pointer type kernel/rcupdate.c:240: warning: passing arg 1 of `rcu_move_batch' from incompatible pointer type kernel/rcupdate.c:236: warning: label `unlock' defined but not used Kernel crashes when you try to offline a cpu, not surprisingly. It also looks like rcu_move_batch isn't preempt-safe so I touched that up, and got rid of an unused label in rcu_offline_cpu. Signed-off-by: Nathan Lynch <nathanl@austin.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2004-07-17Clean up ptrace child exit case.Linus Torvalds1-11/+8
This also fixes it for when the real parent is ignoring SIGCHLD - noted by David Mosberger.
2004-07-15[PATCH] misc sparse cleanupsAlexander Viro1-1/+1
- missing ; between default: and } in sun4setup.c - cast of pointer to unsigned long long instead of unsigned long in x86_64 signal.c - missed annotations for ioctl structure in sparc64 openpromio.h (should've been in the same patch as the rest of drivers/sbus/* annotations) - 0->NULL in list.h and pmdisk.c
2004-07-13[PATCH] pointer-to-int done the canonical wayAlexander Viro1-1/+1
Extraction of int from pointer is slightly broken in several places.
2004-07-12[PATCH] sparse: signal annotationAlexander Viro1-1/+1
ss_sp in struct sigaltstack made __user ->si_addr and ->sival_ptr made __user your ->sa_restorer and ->sa_handler changes propagated users of these guys annotated on i386/amd64/alpha/sparc/sparc64
2004-07-12[PATCH] ia64: Reduce TLB flushing during process migrationJack Steiner1-0/+2
This patch adds an architecture-specific callout after explicit processor migrations. The callout allows architectures (or platforms) to update TLB specific information (ex., cpu_vm_mask). Signed-off-by: Jack Steiner <steiner@sgi.com> Signed-off-by: David Mosberger <davidm@hpl.hp.com>
2004-07-10[PATCH] kill IKCONFIG_VERSIONAdrian Bunk1-2/+0
The patch below (already ACK'ed by Randy Dunlap) kills the unused IKCONFIG_VERSION from kernel/configs.c . This patch is based on a previous patch by Anton Blanchard and an idea of Bartlomiej Zolnierkiewicz. (I hope I haven't forgotten anyone who contributed to this patch. ;-) ) Signed-off-by: Adrian Bunk <bunk@fs.tum.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2004-07-06sparse: annotate signal handler and ss_sp as user pointersLinus Torvalds1-3/+3
2004-07-06[PATCH] NUMA API: fix use-after-free bugAndi Kleen1-3/+4
Move the memory policy freeing to later in exit to make sure the last memory allocations don't use an uninitialized policy. Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2004-07-04[PATCH] gcc 3.5 fixes #2Anton Blanchard1-1/+1
gcc 3.5 is warning about unused static variables, add __attribute_unused__ to the 2 places to silence it. Signed-off-by: Anton Blanchard <anton@samba.org> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2004-07-03[PATCH] ppc32: compilation failure on ppc32Christoph Hellwig1-2/+4
This fixes compilation on ppc32. The power/smp.o file should be linked only if both SMP and SWSUSPEND are configured in. It used to do it even without SWSUSPEND.
2004-07-02[PATCH] sparse: remaining integer zero / NULL fixes in allmodconfig & vmlinuxMika Kukkonen1-1/+1
This fixes the the remaining 0 to NULL things that were found with 'make allmodconfig' and 'make C=1 vmlinux'.
2004-07-01[PATCH] Remaining sparse warnings in allnoconfigMika Kukkonen1-1/+1
Attached is a smallish patch for couple trivial sparse warnings in allnoconfig build and more importantly an "excuses" text file explaining why the rest have not been fixed. Basically all of them (with the exception of the one in Andrews tree) need some serious re-engineering. Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2004-07-01[PATCH] sparse: fix sparse warnings in kernel/power/*Mika Kukkonen2-5/+8
CHECK kernel/power/swsusp.c kernel/power/swsusp.c:320:15: warning: expected lvalue for member dereference kernel/power/swsusp.c:337:15: warning: expected lvalue for member dereference kernel/power/swsusp.c:359:14: warning: expected lvalue for member dereference kernel/power/swsusp.c:925:12: warning: assignment expression in conditional [...] CHECK kernel/power/pmdisk.c kernel/power/pmdisk.c:795:12: warning: assignment expression in conditional Trivial sparse fixes for two files under kernel/power. Patch attached. Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2004-07-01[PATCH] sparse: define max kernel symbol length and clean up errors in ↵Mika Kukkonen1-30/+19
kernel/kallsyms.c CHECK kernel/kallsyms.c kernel/kallsyms.c:136:7: warning: bad constant expression kernel/kallsyms.c:136:7: warning: bad constant expression kernel/kallsyms.c:136:7: warning: bad constant expression kernel/kallsyms.c:143:22: warning: bad constant expression kernel/kallsyms.c:143:22: warning: bad constant expression kernel/kallsyms.c:143:22: warning: bad constant expression Now the cause of sparse warnings is that it does not handle runtime array dimensioning (which I take it is a sparse problem), but in this particular case it _might_ make sense to change the runtime allocation to compile time, as the upper size of the array is known, because the code in kernel/kallsyms.c clearly uses 127 (or 128) as "magic constant" for kernel symbol (array) length, and in the other hand in include/linux/module.h there is: #define MODULE_NAME_LEN (64 - sizeof(unsigned long)) The only concern is that the array become quite big (the original comment of it being "pretty small" no longer applies ...). One way to help that would be to use buffer[] also in place of namebuf[], but that would be little tricky as the format string should be before the symbol name ... Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2004-07-01[PATCH] Bugfix for CLOCK_REALTIME absolute timerGeorge Anzinger1-30/+245
As required by the standard, this patch adds to POSIX ABSOLUTE timers the functionality of adjusting the timer when the clock is set so that it still expires at the specified time (provided that time has not passed, in which case the timer expires immeadiatly). The standard is, IMNSOHO, a bit vague on just how repeating timers are to be handled so I made some choices: 1) If an absolute timer is to expire every N intervals, we assume that the expiries should happen at those specified times after clock setting. I.e. we adjust the repeat timer as well as the initial timer. (The other option would be to treat the repeating timers as relative and not to adjust them.) 2) If a clock set moves the the clock prior to the initial expiry time AND that time has already passed and been signaled, the current repeat timer is adjusted, i.e. we DO NOT go back to the initial time and repeat that. (The other option is to treat this case as a new request with the initial timer parameters (which by this time we have lost).) 3) If time is advanced such that it appears that several expiries have been missed, the overrun count will reflect the misses. (The other option is to not reflect this in the overrun.) At the same time, nothing is done to acknowledge, to the user, that we are repeating expiries when the clock is retarded. Signed-off-by: George Anzinger <george@mvista.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2004-07-01[PATCH] swsusp: preparation for smp support & fix device suspendingPavel Machek3-7/+98
It fixes levels for calling driver model, puts devices into sleep before powering down (so that emergency parking does not happen), and actually introduces SMP support, but its disabled for now. Plus noone should try to freeze_processes() when thats not implemented, we now BUG()s -- we do not want Heisenbugs. Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2004-06-30[PATCH] zombie with CLONE_THREAD and straceAndrea Arcangeli1-3/+31
'strace' shows a problem with a missing release_task for self-reaping clones that have been traced. We need to defer releasing them until the tracer is done with them, but if the tracer dies, we need to handle that case gracefully too. We do that by having 'forget_original_parent()' generate a list of tasks to release when this case happens. Patch based on discussions on linux-kernel, and suggestions from Roland McGrath <roland@redhat.com>.
2004-06-30[PATCH] sparse: NULL vs 0 - the rest of itMika Kukkonen2-2/+2
2004-06-29[PATCH] Provide console_suspend() and console_resume()Russell King1-0/+21
Add console_stop() and console_start() methods so the serial drivers can disable console output before suspending a port, and re-enable output afterwards. We also add locking to ensure that we synchronise with any in-progress printk. Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2004-06-29[PATCH] Provide console_device()Russell King1-0/+20
[This patch series has also been separately sent to the architecture maintainers] Add console_device() to return the console tty driver structure and the index. Acquire the console lock while scanning the list of console drivers to protect us against console driver list manipulations. Signed-off-by: Russell King <rmk@arm.linux.org.uk> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2004-06-29sparse: fix pointer/integer confusionLinus Torvalds4-16/+9
I don't think we're in K&R any more, Toto. If you want a NULL pointer, use NULL. Don't use an integer. Most of the users really didn't seem to know the proper type.
2004-06-26[PATCH] Fix race between CONFIG_DEBUG_SLABALLOC and modulesRusty Russell2-2/+27
store_stackinfo() does an unlocked module list walk during normal runtime which opens up a race with the module load/unload code. This can be triggered by simply unloading and loading a module in a loop with CONFIG_DEBUG_PAGEALLOC resulting in store_stackinfo() tripping over bad list pointers. kernel_text_address doesn't take any locks, because during an OOPS we don't want to deadlock. Rename that to __kernel_text_address, and make kernel_text_address take the lock. Signed-off-by: Zwane Mwaikambo <zwane@fsmlabs.com> Signed-off-by: Rusty Russell <rusty@rustcorp.com.au> (modified) Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2004-06-23[PATCH] consolidate in-kernel configurationAndrew Morton2-23/+20
From: Andy Whitcroft <apw@shadowen.org> Being able to recover the configuration from a kernel is very useful and it would be nice to default this option to Yes. Currently, to have the config available both from the image (using extract-ikconfig) and via /proc we keep two copies of the original .config in the kernel. One in plain text and one gzip compressed. This is not optimal. This patch removes the plain text version of the configuration and updates the extraction tools to locate and use the gzip'd version of the file. This has the added bonus of providing us with the exact same results in both cases, the original .config; including the comments. Signed-off-by: Andy Whitcroft <apw@shadowen.org> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2004-06-23[PATCH] Prepare for SMP suspendAndrew Morton2-1/+1
From: Pavel Machek <pavel@ucw.cz> Its very bad idea to freeze migration threads, as it crashes machine upon next call to "schedule()". In refrigerator, I had one "wake_up_process()" too many. This fixes it. Signed-off-by: Pavel Machek <pavel@suse.cz> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2004-06-23[PATCH] vm: vfs shrinkage tuningAndrew Morton1-0/+12
Some people want the dentry and inode caches shrink harder, others want them shrunk more reluctantly. The patch adds /proc/sys/vm/vfs_cache_pressure, which tunes the vfs cache versus pagecache scanning pressure. - at vfs_cache_pressure=0 we don't shrink dcache and icache at all. - at vfs_cache_pressure=100 there is no change in behaviour. - at vfs_cache_pressure > 100 we reclaim dentries and inodes harder. The number of megabytes of slab left after a slocate.cron on my 256MB test box: vfs_cache_pressure=100000 33480 vfs_cache_pressure=10000 61996 vfs_cache_pressure=1000 104056 vfs_cache_pressure=200 166340 vfs_cache_pressure=100 190200 vfs_cache_pressure=50 206168 Of course, this just left more directory and inode pagecache behind instead of vfs cache. Interestingly, on this machine the entire slocate run fits into pagecache, but not into VFS caches. Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2004-06-23[PATCH] clean up cpumask_t temporariesAndrew Morton3-7/+4
From: Rusty Russell <rusty@rustcorp.com.au> Paul Jackson's cpumask tour-de-force allows us to get rid of those stupid temporaries which we used to hold CPU_MASK_ALL to hand them to functions. This used to break NR_CPUS > BITS_PER_LONG. Signed-off-by: Rusty Russell <rusty@rustcorp.com.au> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2004-06-23[PATCH] cpumask: optimize various uses of new cpumasksAndrew Morton1-11/+7
From: Paul Jackson <pj@sgi.com> Make use of for_each_cpu_mask() macro to simplify and optimize a couple of sparc64 per-CPU loops. Optimize a bit of cpumask code for asm-i386/mach-es7000 Convert physids_complement() to use both args in the files include/asm-i386/mpspec.h, include/asm-x86_64/mpspec.h. Remove cpumask hack from asm-x86_64/topology.h routine pcibus_to_cpumask(). Clarify and slightly optimize several cpumask manipulations in kernel/sched.c Signed-off-by: Paul Jackson <pj@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2004-06-23[PATCH] cpumask: rewrite cpumask.h - single bitmap based implementationAndrew Morton2-5/+7
From: Paul Jackson <pj@sgi.com> Major rewrite of cpumask to use a single implementation, as a struct-wrapped bitmap. This patch leaves some 26 include/asm-*/cpumask*.h header files orphaned - to be removed next patch. Some nine cpumask macros for const variants and to coerce and promote between an unsigned long and a cpumask are obsolete. Simple emulation wrappers are provided in this patch for these obsolete macros, which can be removed once each of the 3 archs (i386, ppc64, x86_64) using them are recoded in follow-on patches to not need them. The CPU_MASK_ALL macro now avoids leaving possible garbage one bits in any unused portion of the high word. An inproved comment lists all available operators, for convenient browsing. From: Mikael Pettersson <mikpe@csd.uu.se> 2.6.7-rc3-mm1 changed CPU_MASK_NONE into something that isn't a valid rvalue (it only works inside struct initializers). This caused compile-time errors in perfctr in UP x86 builds. From: Arnd Bergmann <arnd@arndb.de> cpumask-5-10-rewrite-cpumaskh-single-bitmap-based from 2.6.7-rc3-mm1 causes include2/asm/smp.h:54:1: warning: "cpu_online" redefined Signed-off-by: Paul Jackson <pj@sgi.com> Signed-off-by: Mikael Pettersson <mikpe@csd.uu.se> Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2004-06-23[PATCH] cpumask: make cpu_present_map real even on non-smpAndrew Morton2-8/+10
From: Paul Jackson <pj@sgi.com> This patch makes cpu_present_map a real map for all configurations, instead of a constant for non-SMP. It also moves the definition of cpu_present_map out of kernel/cpu.c into kernel/sched.c, because cpu.c isn't compiled into non-SMP kernels. The pattern is that each of the possible, present and online cpu maps are actual kernel global cpumask_t variables, for all configurations. They are documented in include/linux/cpumask.h. Some of the UP (NR_CPUS=1) code cheats, and hardcodes the assumption that the single bit position of these maps is always set, as an optimization. Signed-off-by: Paul Jackson <pj@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2004-06-23[PATCH] rcu: avoid passing an argument to the callback functionAndrew Morton2-13/+20
From: Dipankar Sarma <dipankar@in.ibm.com> This patch changes the call_rcu() API and avoids passing an argument to the callback function as suggested by Rusty. Instead, it is assumed that the user has embedded the rcu head into a structure that is useful in the callback and the rcu_head pointer is passed to the callback. The callback can use container_of() to get the pointer to its structure and work with it. Together with the rcu-singly-link patch, it reduces the rcu_head size by 50%. Considering that we use these in things like struct dentry and struct dst_entry, this is good savings in space. An example : struct my_struct { struct rcu_head rcu; int x; int y; }; void my_rcu_callback(struct rcu_head *head) { struct my_struct *p = container_of(head, struct my_struct, rcu); free(p); } void my_delete(struct my_struct *p) { ... call_rcu(&p->rcu, my_rcu_callback); ... } Signed-Off-By: Dipankar Sarma <dipankar@in.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2004-06-23[PATCH] reduce rcu_head size - coreAndrew Morton1-21/+21
From: Dipankar Sarma <dipankar@in.ibm.com> This reduces the RCU head size by using a singly linked to maintain them. The ordering of the callbacks is still maintained as before by using a tail pointer for the next list. Signed-Off-By : Dipankar Sarma <dipankar@in.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2004-06-23[PATCH] rcu lock update: Code move & cleanupAndrew Morton1-36/+45
From: Manfred Spraul <manfred@colorfullife.com> Step three for reducing cacheline trashing within rcupdate.c: Cleanup and code move from <linux/rcupdate.h> to kernel/rcupdate.c: Remove internal details from the header file. Signed-off-by: Manfred Spraul <manfred@colorfullife.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2004-06-23[PATCH] rcu lock update: Use a sequence lock for starting batchesAndrew Morton1-8/+21
From: Manfred Spraul <manfred@colorfullife.com> Step two for reducing cacheline trashing within rcupdate.c: rcu_process_callbacks always acquires rcu_ctrlblk.state.mutex and calls rcu_start_batch, even if the batch is already running or already scheduled to run. This can be avoided with a sequence lock: A sequence lock allows to read the current batch number and next_pending atomically. If next_pending is already set, then there is no need to acquire the global mutex. This means that for each grace period, there will be - one write access to the rcu_ctrlblk.batch cacheline - lots of read accesses to rcu_ctrlblk.batch (3-10*cpus_online()). Behavior similar to the jiffies cacheline, shouldn't be a problem. - cpus_online()+1 write accesses to rcu_ctrlblk.state, all of them starting with spin_lock(&rcu_ctrlblk.state.mutex). For large enough cpus_online() this will be a problem, but all except two of the spin_lock calls only protect the rcu_cpu_mask bitmap, thus a hierarchical bitmap would allow to split the write accesses to multiple cachelines. Tested on an 8-way with reaim. Unfortunately it probably won't help with Jack Steiner's 'ls' test since in this test only one cpu generates rcu entries. Signed-off-by: Manfred Spraul <manfred@colorfullife.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2004-06-23[PATCH] rcu lock update: Add per-cpu batch counterAndrew Morton1-54/+88
From: Manfred Spraul <manfred@colorfullife.com> Below is the one of my patches from my rcu lock update. Jack Steiner tested the first one on a 512p and it resolved the rcu cache line trashing. All were tested on osdl with STP. Step one for reducing cacheline trashing within rcupdate.c: The current code uses the rcu_cpu_mask bitmap both for keeping track of the cpus that haven't gone through a quiescent state and for checking if a cpu should look for quiescent states. The bitmap is frequently changed and the check is done by polling - together this causes cache line trashing. If it's cheaper to access a (mostly) read-only cacheline than a cacheline that is frequently dirtied, then it's possible to reduce the trashing by splitting the rcu_cpu_mask bitmap into two cachelines: The patch adds a generation counter and moves it into a separate cacheline. This allows to removes all accesses to rcu_cpumask (in the read-write cacheline) from rcu_pending and at least 50% of the accesses from rcu_check_quiescent_state. rcu_pending and all but one call per cpu to rcu_check_quiescent_state access the read-only cacheline. Probably not enough for 512p, but it's a start, just for 128 byte more memory use, without slowing down rcu grace periods. Obviously the read-only cacheline is not really read-only: it's written once per grace period to indicate that a new grace period is running. Tests on an 8-way Pentium III with reaim showed some improvement: oprofile hits: Reference: http://khack.osdl.org/stp/293075/ Hits % 23741 0.0994 rcu_pending 19057 0.0798 rcu_check_quiescent_state 6530 0.0273 rcu_check_callbacks Patched: http://khack.osdl.org/stp/293076/ 8291 0.0579 rcu_pending 5475 0.0382 rcu_check_quiescent_state 3604 0.0252 rcu_check_callbacks The total runtime differs between both runs, thus the % number must be compared: Around 50% faster. I've uninlined rcu_pending for the test. Tested with reaim and kernbench. Description: - per-cpu quiescbatch and qs_pending fields introduced: quiescbatch contains the number of the last quiescent period that the cpu has seen and qs_pending is set if the cpu has not yet reported the quiescent state for the current period. With these two fields a cpu can test if it should report a quiescent state without having to look at the frequently written rcu_cpu_mask bitmap. - curbatch split into two fields: rcu_ctrlblk.batch.completed and rcu_ctrlblk.batch.cur. This makes it possible to figure out if a grace period is running (completed != cur) without accessing the rcu_cpu_mask bitmap. - rcu_ctrlblk.maxbatch removed and replaced with a true/false next_pending flag: next_pending=1 means that another grace period should be started immediately after the end of the current period. Previously, this was achieved by maxbatch: curbatch==maxbatch means don't start, curbatch!= maxbatch means start. A flag improves the readability: The only possible values for maxbatch were curbatch and curbatch+1. - rcu_ctrlblk split into two cachelines for better performance. - common code from rcu_offline_cpu and rcu_check_quiescent_state merged into cpu_quiet. - rcu_offline_cpu: replace spin_lock_irq with spin_lock_bh, there are no accesses from irq context (and there are accesses to the spinlock with enabled interrupts from tasklet context). - rcu_restart_cpu introduced, s390 should call it after changing nohz: Theoretically the global batch counter could wrap around and end up at RCU_quiescbatch(cpu). Then the cpu would not look for a quiescent state and rcu would lock up. Signed-off-by: Manfred Spraul <manfred@colorfullife.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2004-06-21merge Greg Kroah-Hartman1-0/+100
2004-06-20[PATCH] Avoid rebuild of IKCFG when using O=Sam Ravnborg1-1/+1
When using a separate output directory the in-kernel config wiere rebuild each time the kernel was compiled. Fix this by specifying correct path to Makefile in the prerequisite to the ikconfig.h file. Signed-off-by: Sam Ravnborg <sam@ravnborg.org> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2004-06-19Follow 2.4.x semantics for in-kernel signal sending.Linus Torvalds1-0/+7
2004-06-18[PATCH] sparse: kernel/module.c sparse fixRandy Dunlap1-1/+1
Add __user annotation for !CONFIG_MODULE_UNLOAD case. From: Mika Kukkonen <mika@osdl.org> Signed-off-by: Randy Dunlap <rddunlap@osdl.org>
2004-06-17[PATCH] RLIM: remove unused queued_signals global accountingChris Wright2-21/+0
Remove unused queued_signals global accounting. Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2004-06-17[PATCH] RLIM: enforce rlimits on queued signalsChris Wright1-6/+11
Add a user_struct pointer to the sigqueue structure. Charge sigqueue allocation and destruction to the user_struct rather than a global pool. This per user rlimit accounting obsoletes the global queued_signals accouting. The patch as charges the sigqueue struct allocation to the queue that it's pending on (the receiver of the signal). So the owner of the queue is charged for whoever writes to it (much like quota for a 777 file). The patch started out charging the task which allocated the sigqueue struct. In most cases, these are always the same user (permission for sending a signal), so those cases are moot. In the cases where it isn't the same user, it's a privileged user sending a signal to another user. It seems wrong to charge the allocation to the privleged user, when the other user could block receipt as long as it feels. The flipside is, someone else can fill your queue (expectation is that someone else is privileged). I think it's right the way it is. The change to revert is very small. Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2004-06-17[PATCH] RLIM: pass task_struct in send_signal()Chris Wright1-3/+4
Update send_signal() api to allow passing the task receiving the signal. This is necessary to ensure signals generated out of process context can be charged to the correct user. Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2004-06-17Fix kill_pg_info(): return success if _any_ signal succeeded.Linus Torvalds1-11/+7
2004-06-17[PATCH] contify some scheduler functionsKeith Owens2-6/+6
Several scheduler macros only read from the task struct, mark them const. It may help the compiler generate better code. Signed-off-by: Keith Owens <kaos@ocs.com.au> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2004-06-17[PATCH] remove EXPORT_SYMBOL(kallsyms_lookup)Greg Kroah-Hartman1-1/+0
Distros have started to ship kernels with this patch, as it seems that some unnamed binary module authors are already abusing this function (as well as some open source modules, like the openib code.) I could not find any valid reason why this symbol should be exported, so here's a patch against 2.6.7 that removes it. Signed-off-by: Greg Kroah-Hartman <greg@kroah.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2004-06-17[PATCH] Make update_one_process() staticAndrew Morton1-1/+1
Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2004-06-17[PATCH] swsusp: remove copy_pagedirHerbert Xu2-32/+7
It can be replaced by a simple memcpy. Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2004-06-17[PATCH] remove unnecessary memsets from swsusp and pmdiskHerbert Xu2-2/+0
Here's the patch that removes the memset calls from both pmdisk and swsusp. Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2004-06-17[PATCH] omdisk memory leak fixHerbert Xu1-11/+11
Fix a couple of memory leaks in the pmdisk driver. Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2004-06-17[PATCH] Fix memory leak in swsuspPavel Machek1-11/+16
This fixes 2 memory leaks in swsusp: during relocating pagedir, eaten pages were not properly freed in error path and even regular freeing path was freeing one page too little. Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2004-06-17[PATCH] BSD accounting format reworkTim Schmielau1-8/+97
BSD accounting format rework: Use all explicit and implicit padding in struct acct to - correctly report 32 bit uid/gid, - correctly report jobs (e.g., daemons) running longer than 497 days, - increase the precision of ac_etime from 2^-13 to 2^-20 (i.e., from ~6 hours to ~1 min. after a year) - store the current AHZ value. - allow cross-platform processing of the accounting file (limited for m68k which has a different size struct acct). - introduce versioning for smooth transition to incompatible formats in the future. Currently the following version numbers are defined: 0: old format (until 2.6.7) with 16 bit uid/gid 1: extended variant (binary compatible to v0 on M68K) 2: extended variant (binary compatible to v0 on everything except M68K) 3: a new binary incompatible format (64 bytes) 4: new binary incompatible format (128 bytes). layout of its first 64 bytes is the same as for v3. 5: marks second half of new binary incompatible format (128 bytes) (layout is not yet defined) All this is accomplished without breaking binary compatibility. 32 bit uid/gid support is compatible with the patch previously floating around and used e.g. by Red Hat. This patch also introduces a config option for a new, binary incompatible "version 3" format that - is uniform across and properly aligned on all platforms - stores pid and ppid - uses AHZ==100 on all platforms (allows to report longer times) Much of the compatibility glue goes away when v1/v2 support is removed from the kernel. Such a patch is at http://www.physik3.uni-rostock.de/tim/kernel/2.7/acct-cleanup-04.patch and might be applied in the 2.7 timeframe. The new v3 format is source compatible with current GNU acct tools (6.3.5). However, current GNU acct tools can be compiled for only one format. As there is no way to pass the kernel configuration to userspace, with my patch it will still only support the old v2 format. Only if v1/v2 support is removed from the kernel, recompiling GNU acct tools will yield v3 support. A preliminary take at the corresponding work on cross-platform userspace tools (GNU acct package) is at http://www.physik3.uni-rostock.de/tim/kernel/utils/acct/ This version of the package is able to read any of the v0/v2/v3 formats, regardless of byte-order (untested), even within the same file. Cross-platform compatibility with m68k (v1 format) is not yet implemented, but native use on m68k should work (untested). pid and ppid are currently only shown by the dump-acct utility. Thanks to Arthur Corliss, Albert Cahalan and Ragnar Kjørstad for their comments, and to Albert Cahalan for the u64->IEEE float conversion code. Signed-off-by: Tim Schmielau <tim@physik3.uni-rostock.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2004-06-17[PATCH] getgroups16() fixTomas Olsson1-6/+6
sys_getgroups16 (or rather groups16_to_user()) returns large gids truncated. Needs to be fixed, one way or another. Don't know why the other similar casts are still there. Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2004-06-17[PATCH] RLIM: add mq_bytes to user_structChris Wright1-0/+3
Add mq_bytes field to user_struct, and make sure it's properly initialized. Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2004-06-17[PATCH] RLIM: add sigpending field to user_structChris Wright1-1/+3
Add sigpending field to user_struct, and make sure it's properly initialized. Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2004-06-17[PATCH] Fixes for idr codeCorey Minyard1-34/+52
* On a 32-bit architecture, the idr code will cease to work if you add more than 2^20 entries. You will not be able to find many of the entries. The problem is that the IDR code uses 5-bit chunks of the number and the lower portion used by IDR is 24 bits, so you have one bit that leaks over into the comparisons that should not be there. The solution is to mask off that bit before doing IDR processing. This actually causes the POSIX timer code to crash if you create that many timers. I have included an idr_test.tar.gz file that demonstrates this with and without the fix, in case you need more evidence :). * When the IDR fills up, it returns -1. However, there was no way to check for this condition. This patch adds the ability to check for the idr being full and fixes all the users. It also fixes a problem in fs/super.c where the idr code wasn't checking for -1. * There was a race condition creating POSIX timers. The timer was added to a task struct for another process then the data for the timer was filled out. The other task could use/destroy time timer as soon as it is in the task's queue and the lock is released. This moves settup up the timer data to before the timer is enqueued or (for some data) into the lock. * Change things so that the caller doesn't need to run idr_full() to find out the reason for an idr_get_new() failure. Just return -ENOSPC if the tree was full, or -EAGAIN if the caller needs to re-run idr_pre_get() and try again. Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2004-06-17[PATCH] Clean up asm/pgalloc.h includeRussell King1-1/+0
This patch cleans up needless includes of asm/pgalloc.h from the fs/ kernel/ and mm/ subtrees. Compile tested on multiple ARM platforms, and x86, this patch appears safe. This patch is part of a larger patch aiming towards getting the include of asm/pgtable.h out of linux/mm.h, so that asm/pgtable.h can sanely get at things like mm_struct and friends. I suggest testing in -mm for a while to ensure there aren't any hidden arch issues. Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2004-06-15[PATCH] insert_resource fixJohn Rose1-2/+2
I noticed that insert_resource() incorrectly handles the case of an existing parent resource with the same ending address as a newly added child. This results in incorrect nesting, like the following: # cat /proc/ioports <snip> 002f0000-002fffff : PCI Bus #48 00200000-002fffff : /pci@800000020000003 </snip> Signed-off-by: John Rose <johnrose@austin.ibm.com> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2004-06-14Revert wakeup-affinity fixesLinus Torvalds1-3/+3
This patch results in too much idle time under certain loads, and while that is being looked into we're better off just reverting the change. Cset exclude: nickpiggin@yahoo.com.au[torvalds]|ChangeSet|20040605175839|02419
2004-06-12[PATCH] dup_mmap() memory accounting fixAndrew Morton1-0/+5
From: Hugh Dickins <hugh@veritas.com> Oleg's patch was good in that exit_mmap usually does the un-accounting; but dup_mmap still needs its own un-accounting for the case when it has charged for a vma, but error before it's inserted into child mm's list. Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2004-06-12[PATCH] fix the exit-vs-timer race fixAndrew Morton1-1/+1
As Roland McGrath <roland@redhat.com> points out, we need to zero task->it_virt_value to prevent timer-based signal delivery, not ->it_virt_incr. Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2004-06-12[PATCH] fix modprobe_path and hotplug_path sizes and sysctlAndrew Morton2-4/+4
From: Andy Whitcroft <apw@shadowen.org> Both modprobe_path and hotplug_path are arbitrarily sized at 256 bytes and that size is also expressed directly in the sysctl code. It seems reasonable to define a standard length and use that for consitancy. This patch introduces the constant KMOD_PATH_LEN and uses that. Signed-off-by: Andy Whitcroft <apw@shadowen.org> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2004-06-08merge i2c-2.6 into driver-2.6 trees due to problems people reported.Greg Kroah-Hartman1-0/+100
2004-06-08[PATCH] dup_mmap() double memory accountingOleg Nesterov1-5/+1
dup_mmap() unnecessarily tries to account for memory of the vma's it has created if it fails in the middle. However, that's pointless (and wrong), since the exit_mmap() path called through mmput() will do so anyway in the failure path. Just remove the bogus un-accounting code.
2004-06-08[PATCH] kernel/sysctl annotations for sparseRandy Dunlap1-11/+11
Add __user annotations to kernel/sysctl.c to satisfy sparse for !CONFIG_SYSCTL, !CONFIG_PROC_FS. Signed-off-by: Randy Dunlap <rddunlap@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2004-06-08[PATCH] fix uts sysctl write sizeAndrew Morton1-5/+5
From: Andy Whitcroft <apw@shadowen.org> The sysctl interfaces for updating the uts entries such as hostname and domainname are using the wrong length for these buffers; they are hard coded to 64. Although safe, this artifically limits the size of these fields to one less than the true maximum. This generates an inconsistency between the various methods of update for these fields. # hostname 12345678901234567890123456789012345678901234567890123456789012345 hostname: name too long # hostname 1234567890123456789012345678901234567890123456789012345678901234 # hostname 1234567890123456789012345678901234567890123456789012345678901234 # sysctl -w kernel.hostname=1234567890123456789012345678901234567890123456789012345678901234567890 kernel.hostname = 1234567890123456789012345678901234567890123456789012345678901234567890 # hostname 123456789012345678901234567890123456789012345678901234567890123 # The error originates from the fact the handler for strings (proc_dostring) already allows for the string terminator. This patch corrects the limit, taking the oppotunity to convert to use of sizeof(). Signed-off-by: Andy Whitcroft <apw@shadowen.org> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2004-06-08[PATCH] flush_workqueue locking simplificationAndrew Morton1-3/+2
From: "Anil" <anil.s.keshavamurthy@intel.com> We don't need lock_cpu_hotplug()/unlock_cpu_hotplug for singlethreaded workqueues. Signed-off-by: Anil Keshavamurthy <anil.s.keshavamurthy@intel.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2004-06-08[PATCH] __ARCH_WANT_SYS_RT_SIGACTION fixAndrew Morton1-2/+2
From: Ivan Kokshaysky <ink@jurassic.park.msu.ru> Recent syscall stubs cleanup broke alpha, as it has its own version of sys_rt_sigaction(). This defines __ARCH_WANT_SYS_RT_SIGACTION for all architectures except alpha, sparc and sparc64. Signed-off-by: Ivan Kokshaysky <ink@jurassic.park.msu.ru> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2004-06-08[PATCH] speedup flush_workqueue for singlethread_workqueueAndrew Morton1-31/+35
From: "Anil" <anil.s.keshavamurthy@intel.com> In flush_workqueue() for a single_threaded_worqueue case the code flushes the same cpu_workqueue_struct for each online_cpu. Change things so that we only perform the flush once in this case. Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2004-06-05[PATCH] sched: honor the "sync" wakeup bitIngo Molnar1-0/+7
The scheduler changes had another thing missing: the appreciation of sync wakeups. (I had this in one of the earlier sched-domains cleanup patches before but it got lost in the shuffle.) When a sync waker is waking, we should subtract its load from the current load - it will schedule away for sure in the near future. That's what the "sync" bit means. This change is necessary because with the sched-domains balancer we have a much more sensitive cpu-load estimator, and in this particular context of try_to_wake_up() the sync waker's effect will always be part of the load. Patch against your patch attached. In my testing there's an additional increase in bw_pipe numbers on a dual P2 box, it went from 110-120 MB/sec to 120-130 MB/sec. Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2004-06-04Merge bk://kernel.bkbits.net/davem/sparc-2.6Linus Torvalds1-22/+41
into ppc970.osdl.org:/home/torvalds/v2.6/linux
2004-06-04[PATCH] sched: improve wakeup-affinityNick Piggin1-3/+3
David Mosberger noticed bw_pipe was way down on sched-domains kernels on SMP systems. That is due to two things: first, the previous wake-affine logic would *always* move a pipe wakee onto the waker's CPU. With the scheduler rework, this was toned down a lot (but extended to all types of wakeups). One of the ways this was damped was with the logic: don't move the wakee if its CPU is relatively idle compared to the waker's CPU. Without this, some workloads would pile everything up onto a few CPUs and get lots of idle time. However, the fix was a bit of a blunt hack: if the wakee runqueue was below 50% busy, and the waker's was above 50% busy, we wouldn't do the move. I think a better way to capture it is what this patch does: if the wakee runqueue is below 100% busy, and the sum of the two runqueue's loads is above 100% busy, and the wakee runqueue is less busy than the waker runqueue (ie. CPU utilisation would drop if we do the move), then we don't do the move. After I fixed this, I found things were still getting bounced around quite a bit. The reason is that we were attempting very aggressive idle balancing in order to cut down idle time in a dbt2-pgsql workload, which is particularly sensitive to idle. After having Mark Wong (markw@osdl.org) retest this load with this patch, it looks like we don't need to be so aggressive. I'm glad to be rid of this because it never sat too well with me. We should see slightly lower cost of schedule and slightly improved cache impact with this change too. Mark said: --- This looks pretty good: metric kernel 2334 2.6.7-rc2 2298 2.6.7-rc2-mm2 2329 2.6.7-rc2-mm2-sched-more-wakeaffine --- ie. within the noise. David said: --- Oooh, me likeee! Host OS Pipe AF UNIX --------- ------------- ---- ---- caldera.h Linux 2.6.6 3424 2057 (plain 2.6.6) caldera.h Linux 2.6.7-r 333. 1402 (original 2.6.7-rc1) caldera.h Linux 2.6.7-r 3086 4301 (2.6.7-rc1 with your patch) Pipe-bandwidth is still down about 10% but that may be due to unrelated changes (or perhaps warmup effects?). The AF UNIX bandwidth is just mindboggling. Moreover, with your patch 2.6.7-rc1 shows better context-switch times and lower communication latencies (more like the numbers you're getting on UP). So it seems like the overall balance of keeping things on the same CPU vs. distributing them across CPUs is improved. --- I also ran some tests on the NUMAQ. kernbench, dbench, hackbench, reaim were much the same. tbench was improved, very much so when clients < NR_CPU. Signed-off-by: Nick Piggin <nickpiggin@yahoo.com.au> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2004-06-04[COMPAT]: Add __user attributes for pointers passed while KERNEL_DS.David S. Miller1-22/+41
2004-06-04[PATCH] Module section offsets in /sys/moduleJonathan Corbet1-0/+100
So here I am trying to write about how one can apply gdb to a running kernel, and I'd like to tell people how to debug loadable modules. Only with the 2.6 module loader, there's no way to find out where the various sections in the module image ended up, so you can't do much. This patch attempts to fix that by adding a "sections" subdirectory to every module's entry in /sys/module; each attribute in that directory associates a beginning address with the section name. Those attributes can be used by a a simple script to generate an add-symbol-file command for gdb, something like: #!/bin/bash # # gdbline module image # # Outputs an add-symbol-file line suitable for pasting into gdb to examine # a loaded module. # cd /sys/module/$1/sections echo -n add-symbol-file $2 `/bin/cat .text` for section in .[a-z]* *; do if [ $section != ".text" ]; then echo " \\" echo -n " -s" $section `/bin/cat $section` fi done echo Currently, this feature is absent if CONFIG_KALLSYMS is not set. I do wonder if CONFIG_DEBUG_INFO might not be a better choice, now that I think about it. Section names are unmunged, so "ls -a" is needed to see most of them. Signed-off-by: Greg Kroah-Hartman <greg@kroah.com>
2004-06-03sparse: annotate (and comment on) kmod.c user pointer usageLinus Torvalds1-3/+13
Big comment, because it wasn't clear why this cast was valid. Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2004-06-03[PATCH] Add the sixth arg to the sys_futex() prototype.Andrew Morton2-1/+2
Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2004-06-02sparse: fix up futex address space warningLinus Torvalds1-1/+1
2004-06-02[PATCH] move #endif to correct placeAndrew Morton1-1/+1
From: David Mosberger <davidm@napali.hpl.hp.com> Darrene Williams <dsw@gelato.unsw.edu.au> noticed that the #endif for __ARCH_WANT_SYS_SIGPROCMASK was off by one routine. Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2004-06-02Merge bk://kernel.bkbits.net/davem/sparc-2.6Linus Torvalds1-2/+2
into ppc970.osdl.org:/home/torvalds/v2.6/linux
2004-06-01[PATCH] Fix signal race during process exitJeremy Kerr1-0/+8
Fix a race identified by Jeremy Kerr <jeremy@redfishsoftware.com.au>: if update_process_times() decides to deliver a signal due to process timer expiry, it can race with __exit_sighand()'s freeing of task->sighand. Fix that by clearing the per-process timer state in exit_notify(), while under local_irq_disable() and under tasklist_lock. tasklist_lock provides exclusion wrt release_task()'s freeing of task->sighand and local_irq_disable() provides exclusion wrt update_process_times()'s inspection of the per-process timer state. We also need to deal with the send_sig() calls in do_process_times() by setting rlim_cur to RLIM_INFINITY. Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Jeremy Kerr <jk@ozlabs.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2004-06-01[SPARC64]: Compat syscall overhaul.David S. Miller1-2/+2
1) Make syscall entry zero-extend all arguments. 2) Sign extend those needed in sys32.S 3) Kill the A() AA() macros, replace with compat_ptr() et al.
2004-05-31Add comments on load balancing special cases.Linus Torvalds1-1/+13
Ingo explains: The condition is 'impossible', but the whole balancing code is (intentionally) a bit racy: cpus_and(tmp, group->cpumask, cpu_online_map); if (!cpus_weight(tmp)) goto next_group; for_each_cpu_mask(i, tmp) { if (!idle_cpu(i)) goto next_group; push_cpu = i; } rq = cpu_rq(push_cpu); double_lock_balance(busiest, rq); move_tasks(rq, push_cpu, busiest, 1, sd, IDLE); in the for_each_cpu_mask() loop we specifically check for each CPU in the target group to be idle - so push_cpu's runqueue == busiest [== current runqueue] cannot be true because the current CPU is not idle, we are running in the migration thread ... But this is not a real problem, load-balancing we do in a racy way to reduce overhead [and it's all statistics anyway so absolute accuracy is impossible], and active balancing itself is somewhat racy due to the migration-thread wakeup (and the active_balance flag) going outside the runqueue locks [for similar reasons]. so it all looks quite plausible - the normal SMP boxes dont trigger it, but Bjorn's 128-CPU setup with a non-trivial domain hiearachy triggers it. Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2004-05-31[PATCH] active_load_balance() deadlockBjorn Helgaas1-0/+2
active_load_balance() looks susceptible to deadlock when busiest==rq. Without the following patch, my 128-way box deadlocks consistently during boot-time driver init.
2004-05-31[PATCH] sched: remove noinline workaroundAndrew Morton1-4/+1
From: Ingo Molnar <mingo@elte.hu> Now the x86_64 bitop memory clobber problem has been fixed we can remove this. Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2004-05-31[PATCH] s/tkill/tgkill/ in /** documentation */Andrew Morton1-1/+1
From: bert hubert <ahu@ds9a.nl> Documentation is in fact for tgkill and not for tkill Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2004-05-31[PATCH] Add FUTEX_CMP_REQUEUE futex opAndrew Morton2-9/+47
From: Jakub Jelinek <jakub@redhat.com> FUTEX_REQUEUE operation has been added to the kernel mainly to improve pthread_cond_broadcast which previously used FUTEX_WAKE INT_MAX op. pthread_cond_broadcast releases internal condvar mutex before FUTEX_REQUEUE operation, as otherwise the woken up thread most likely immediately sleeps again on the internal condvar mutex until the broadcasting thread releases it. Unfortunately this is racy and causes e.g. http://sources.redhat.com/cgi-bin/cvsweb.cgi/libc/nptl/tst-cond16.c?rev=1.1&content-type=text/x-cvsweb-markup&cvsroot=glibc to hang on SMP. http://listman.redhat.com/archives/phil-list/2004-May/msg00023.html contains analysis how the hang happens, the problem is if any thread does pthread_cond_*wait in between releasing of the internal condvar mutex and FUTEX_REQUEUE operation, a wrong thread might be awaken (and immediately go to sleep again because it doesn't satisfy conditions for returning from pthread_cond_*wait) while the right thread requeued on the associated mutex and there would be nobody to wake that thread up. The patch below extends FUTEX_REQUEUE operation with something FUTEX_WAIT already uses: FUTEX_CMP_REQUEUE is passed an additional argument which is the expected value of *futex. Kernel then while holding the futex locks checks if *futex != expected and returns -EAGAIN in that case, while if it is equal, continues with a normal FUTEX_REQUEUE operation. If the syscall returns -EAGAIN, NPTL can fall back to FUTEX_WAKE INT_MAX operation which doesn't have this problem, but is less efficient, while in the likely case that nobody hit the (small) window the efficient FUTEX_REQUEUE operation is used. Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2004-05-31[PATCH] Export kthread primitivesRusty Russell1-1/+5
kthreads are not just for breakfast anymore. Signed-off-by: Rusty Russell <rusty@rustcorp.com.au> (creator) Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2004-05-31[PATCH] Fix x86-64 compilation without CONFIG_NUMAAndi Kleen1-0/+1
This fixes compilation of x86-64 without CONFIG_NUMA again (got broken by the previous patchkit)
2004-05-29[PATCH] sparse: kernel/sysctl.c annotation and cleanupAlexander Viro1-30/+30
2004-05-28[PATCH] sparse: trivial part of kernel/* __user annotationAlexander Viro6-45/+55
2004-05-26[PATCH] CPU Hotplug: restore Idle task's priority during CPU_DEAD notificationAndrew Morton1-1/+2
From: Srivatsa Vaddagiri <vatsa@in.ibm.com> Fix a CPU Hotplug problem wherein idle task's "->prio" value is not restored to MAX_PRIO during CPU_DEAD handling. Without this patch, once a CPU is offlined and then later onlined, it becomes "more or less" useless (does not run any task other than its idle task!) Ingo said: The __setscheduler() call is (technically) incorrect because in the SCHED_NORMAL case the prio should be zero. So it's a bit cleaner to set up the static priority to MAX_PRIO and then revert the policy to SCHED_NORMAL via __setscheduler(). Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@osdl.org>
2004-05-24[PATCH] sched_yield() microoptimisationAndrew Morton1-2/+1
Signed-off-by: Ingo Molnar <mingo@elte.hu> We can avoid the local_irq_enable() in sched_yield() because schedule() unconditionally enables interrupts anyway.
2004-05-24[PATCH] minor sched.c cleanupAndrew Morton1-2/+1
Signed-off-by: Christian Meder <chris@onestepahead.de> Signed-off-by: Ingo Molnar <mingo@elte.hu> The following obviously correct patch from Christian Meder simplifies the DELTA() define.
2004-05-24[PATCH] Fix race condition with current->group_infoAndrew Morton1-0/+5
From: Olaf Kirch <okir@suse.de> I have been chasing a corruption of current->group_info on PPC during NFS stress tests. The problem seems to be that nfsd is messing with its group_info quite a bit, while some monitoring processes look at /proc/<pid>/status and do a get_group_info/put_group_info without any locking. This problem can be reproduced on ppc platforms within a few seconds if you generate some NFS load and do a "cat /proc/XXX/status" of an nfsd thread in a tight loop. I therefore think changes to current->group_info, and querying it from a different process, needs to be protected using the task_lock. (akpm: task->group_info here is safe against exit() because the task holds a ref on group_info which is released in __put_task_struct, and the /proc file has a ref on the task_struct).
2004-05-24[PATCH] Fix the mangled-oops-output-on-SMP problemAndrew Morton1-6/+23
From: Ingo Molnar <mingo@elte.hu> printk currently does if (oops_in_progres) bust_printk_locks(); which means that once we oops, the printk locking is 100% ineffective and multiple CPUs make an unreadable mess on a serial console. It's a significant development hassle. Fix that up by only popping locks once per ten seconds. akpm@osdl.org did: - Bump the timeout to 30 seconds - 9600 baud is slow. - Handle jiffy wraps: change the logic so that we only skip the lockbust if the current time is within 30 seconds of the previous lockbusting attempt.
2004-05-23[PATCH] pa-risc: kernel/fork.c broken by the new rmapJames Bottomley1-2/+2
Any architecture (like pa-risc) that makes use of the helper function flush_dcache_mmap_lock() won't compile with the new rmap due to use of the wrong "mapping". Trivial fix.
2004-05-22[PATCH] rmap 39 add anon_vma rmapAndrew Morton1-1/+2
From: Hugh Dickins <hugh@veritas.com> Andrea Arcangeli's anon_vma object-based reverse mapping scheme for anonymous pages. Instead of tracking anonymous pages by pte_chains or by mm, this tracks them by vma. But because vmas are frequently split and merged (particularly by mprotect), a page cannot point directly to its vma(s), but instead to an anon_vma list of those vmas likely to contain the page - a list on which vmas can easily be linked and unlinked as they come and go. The vmas on one list are all related, either by forking or by splitting. This has three particular advantages over anonmm: that it can cope effortlessly with mremap moves; and no longer needs page_table_lock to protect an mm's vma tree, since try_to_unmap finds vmas via page -> anon_vma -> vma instead of using find_vma; and should use less cpu for swapout since it can locate its anonymous vmas more quickly. It does have disadvantages too: a lot more change in mmap.c to deal with anon_vmas, though small straightforward additions now that the vma merging has been refactored there; more lowmem needed for each anon_vma and vma structure; an additional restriction on the merging of vmas (cannot be merged if already assigned different anon_vmas, since then their pages will be pointing to different heads). (There would be no need to enlarge the vma structure if anonymous pages belonged only to anonymous vmas; but private file mappings accumulate anonymous pages by copy-on-write, so need to be listed in both anon_vma and prio_tree at the same time. A different implementation could avoid that by using anon_vmas only for purely anonymous vmas, and use the existing prio_tree to locate cow pages - but that would involve a long search for each single private copy, probably not a good idea.) Where before the vm_pgoff of a purely anonymous (not file-backed) vma was meaningless, now it represents the virtual start address at which that vma is mapped - which the standard file pgoff manipulations treat linearly as vmas are split and merged. But if mremap moves the vma, then it generally carries its original vm_pgoff to the new location, so pages shared with the old location can still be found. Magic. Hugh has massaged it somewhat: building on the earlier rmap patches, this patch is a fifth of the size of Andrea's original anon_vma patch. Please note that this posting will be his first sight of this patch, which he may or may not approve.
2004-05-22[PATCH] rmap 38 remove anonmm rmapAndrew Morton1-13/+0
From: Hugh Dickins <hugh@veritas.com> Before moving on to anon_vma rmap, remove now what's peculiar to anonmm rmap: the anonmm handling and the mremap move cows. Temporarily reduce page_referenced_anon and try_to_unmap_anon to stubs, so a kernel built with this patch will not swap anonymous at all.
2004-05-22[PATCH] rmap 22 flush_dcache_mmap_lockAndrew Morton1-0/+2
From: Hugh Dickins <hugh@veritas.com> arm and parisc __flush_dcache_page have been scanning the i_mmap(_shared) list without locking or disabling preemption. That may be even more unsafe now it's a prio tree instead of a list. It looks like we cannot use i_shared_lock for this protection: most uses of flush_dcache_page are okay, and only one would need lock ordering fixed (get_user_pages holds page_table_lock across flush_dcache_page); but there's a few (e.g. in net and ntfs) which look as if they're using it in I/O completion - and it would be restrictive to disallow it there. So, on arm and parisc only, define flush_dcache_mmap_lock(mapping) as spin_lock_irq(&(mapping)->tree_lock); on i386 (and other arches left to the next patch) define it away to nothing; and use where needed. While updating locking hierarchy in filemap.c, remove two layers of the fossil record from add_to_page_cache comment: no longer used for swap. I believe all the #includes will work out, but have only built i386. I can see several things about this patch which might cause revulsion: the name flush_dcache_mmap_lock? the reuse of the page radix_tree's tree_lock for this different purpose? spin_lock_irqsave instead? can't we somehow get i_shared_lock to handle the problem?
2004-05-22[PATCH] rmap 16: pretend prio_treeAndrew Morton1-2/+2
From: Hugh Dickins <hugh@veritas.com> Pave the way for prio_tree by switching over to its interfaces, but actually still implement them with the same old lists as before. Most of the vma_prio_tree interfaces are straightforward. The interesting one is vma_prio_tree_next, used to search the tree for all vmas which overlap the given range: unlike the list_for_each_entry it replaces, it does not find every vma, just those that match. But this does leave handling of nonlinear vmas in a very unsatisfactory state: for now we have to search again over the maximum range to find all the nonlinear vmas which might contain a page, which of course takes away the point of the tree. Fixed in later patch of this batch. There is no need to initialize vma linkage all over, just do it before inserting the vma in list or tree. /proc/pid/statm had an odd test for its shared count: simplified to an equivalent test on vm_file.
2004-05-22[PATCH] small numa api fixupsAndrew Morton2-0/+8
From: Christoph Hellwig <hch@lst.de> - don't include mempolicy.h in sched.h and mm.h when a forward delcaration is enough. Andi argued against that in the past, but I'd really hate to add another header to two of the includes used in basically every driver when we can include it in the six files actually needing it instead (that number is for my ppc32 system, maybe other arches need more include in their directories) - make numa api fields in tast_struct conditional on CONFIG_NUMA, this gives us a few ugly ifdefs but avoids wasting memory on non-NUMA systems.
2004-05-22[PATCH] numa api: Add VMA hooks for policyAndrew Morton2-1/+18
From: Andi Kleen <ak@suse.de> NUMA API adds a policy to each VMA. During VMA creattion, merging and splitting these policies must be handled properly. This patch adds the calls to this. It is a nop when CONFIG_NUMA is not defined.
2004-05-22[PATCH] numa api: Core NUMA API codeAndrew Morton1-0/+3
From: Andi Kleen <ak@suse.de> The following patches add support for configurable NUMA memory policy for user processes. It is based on the proposal from last kernel summit with feedback from various people. This NUMA API doesn't not attempt to implement page migration or anything else complicated: all it does is to police the allocation when a page is first allocation or when a page is reallocated after swapping. Currently only support for shared memory and anonymous memory is there; policy for file based mappings is not implemented yet (although they get implicitely policied by the default process policy) It adds three new system calls: mbind to change the policy of a VMA, set_mempolicy to change the policy of a process, get_mempolicy to retrieve memory policy. User tools (numactl, libnuma, test programs, manpages) can be found in ftp://ftp.suse.com/pub/people/ak/numa/numactl-0.6.tar.gz For details on the system calls see the manpages http://www.firstfloor.org/~andi/mbind.html http://www.firstfloor.org/~andi/set_mempolicy.html http://www.firstfloor.org/~andi/get_mempolicy.html Most user programs should actually not use the system calls directly, but use the higher level functions in libnuma (http://www.firstfloor.org/~andi/numa.html) or the command line tools (http://www.firstfloor.org/~andi/numactl.html The system calls allow user programs and administors to set various NUMA memory policies for putting memory on specific nodes. Here is a short description of the policies copied from the kernel patch: * NUMA policy allows the user to give hints in which node(s) memory should * be allocated. * * Support four policies per VMA and per process: * * The VMA policy has priority over the process policy for a page fault. * * interleave Allocate memory interleaved over a set of nodes, * with normal fallback if it fails. * For VMA based allocations this interleaves based on the * offset into the backing object or offset into the mapping * for anonymous memory. For process policy an process counter * is used. * bind Only allocate memory on a specific set of nodes, * no fallback. * preferred Try a specific node first before normal fallback. * As a special case node -1 here means do the allocation * on the local CPU. This is normally identical to default, * but useful to set in a VMA when you have a non default * process policy. * default Allocate on the local node first, or when on a VMA * use the process policy. This is what Linux always did * in a NUMA aware kernel and still does by, ahem, default. * * The process policy is applied for most non interrupt memory allocations * in that process' context. Interrupts ignore the policies and always * try to allocate on the local CPU. The VMA policy is only applied for memory * allocations for a VMA in the VM. * * Currently there are a few corner cases in swapping where the policy * is not applied, but the majority should be handled. When process policy * is used it is not remembered over swap outs/swap ins. * * Only the highest zone in the zone hierarchy gets policied. Allocations * requesting a lower zone just use default policy. This implies that * on systems with highmem kernel lowmem allocation don't get policied. * Same with GFP_DMA allocations. * * For shmfs/tmpfs/hugetlbfs shared memory the policy is shared between * all users and remembered even when nobody has memory mapped. This patch: This is the core NUMA API code. This includes NUMA policy aware wrappers for get_free_pages and alloc_page_vma(). On non NUMA kernels these are defined away. The system calls mbind (see http://www.firstfloor.org/~andi/mbind.html), get_mempolicy (http://www.firstfloor.org/~andi/get_mempolicy.html) and set_mempolicy (http://www.firstfloor.org/~andi/set_mempolicy.html) are implemented here. Adds a vm_policy field to the VMA and to the process. The process also has field for interleaving. VMA interleaving uses the offset into the VMA, but that's not possible for process allocations. From: Andi Kleen <ak@muc.de> > Andi, how come policy_vma() calls ->set_policy under i_shared_sem? I think this can be actually dropped now. In an earlier version I did walk the vma shared list to change the policies of other mappings to the same shared memory region. This turned out too complicated with all the corner cases, so I eventually gave in and added ->get_policy to the fast path. Also there is still the mmap_sem which prevents races in the same MM. Patch to remove it attached. Also adds documentation and removes the bogus __alloc_page_vma() prototype noticed by hch. From: Andi Kleen <ak@suse.de> A few incremental fixes for NUMA API. - Fix a few comments - Add a compat_ function for get_mem_policy I considered changing the ABI to avoid this, but that would have made the API too ugly. I put it directly into the file because a mm/compat.c didn't seem worth it just for this. - Fix the algorithm for VMA interleave. From: Matthew Dobson <colpatch@us.ibm.com> 1) Move the extern of alloc_pages_current() into #ifdef CONFIG_NUMA. The only references to the function are in NUMA code in mempolicy.c 2) Remove the definitions of __alloc_page_vma(). They aren't used. 3) Move forward declaration of struct vm_area_struct to top of file.
2004-05-22[PATCH] Convert i_shared_sem back to a spinlockAndrew Morton1-2/+2
Having a semaphore in there causes modest performance regressions on heavily mmap-intensive workloads on some hardware. Specifically, up to 30% in SDET on NUMAQ and big PPC64. So switch it back to being a spinlock. This does mean that unmap_vmas() needs to be told whether or not it is allowed to schedule away; that's simple to do via the zap_details structure. This change means that there will be high scheuling latencies when someone truncates a large file which is currently mmapped, but nobody does that anyway. The scheduling points in unmap_vmas() are mainly for munmap() and exit(), and they still will work OK for that. From: Hugh Dickins <hugh@veritas.com> Sorry, my premature optimizations (trying to pass down NULL zap_details except when needed) have caught you out doubly: unmap_mapping_range_list was NULLing the details even though atomic was set; and if it hadn't, then zap_pte_range would have missed free_swap_and_cache and pte_clear when pte not present. Moved the optimization into zap_pte_range itself. Plus massive documentation update. From: Hugh Dickins <hugh@veritas.com> Here's a second patch to add to the first: mremap's cows can't come home without releasing the i_mmap_lock, better move the whole "Subtle point" locking from move_vma into move_page_tables. And it's possible for the file that was behind an anonymous page to be truncated while we drop that lock, don't want to abort mremap because of VM_FAULT_SIGBUS. (Eek, should we be checking do_swap_page of a vm_file area against the truncate_count sequence? Technically yes, but I doubt we need bother.) - We cannot hold i_mmap_lock across move_one_page() because move_one_page() needs to perform __GFP_WAIT allocations of pagetable pages. - Move the cond_resched() out so we test it once per page rather than only when move_one_page() returns -EAGAIN.
2004-05-22[PATCH] rmap 10 add anonmm rmapAndrew Morton1-2/+16
From: Hugh Dickins <hugh@veritas.com> Hugh's anonmm object-based reverse mapping scheme for anonymous pages. We have not yet decided whether to adopt this scheme, or Andrea's more advanced anon_vma scheme. anonmm is easier for me to merge quickly, to replace the pte_chain rmap taken out in the previous patch; a patch to install Andrea's anon_vma will follow in due course. Why build up and tear down chains of pte pointers for anonymous pages, when a page can only appear at one particular address, in a restricted group of mms that might share it? (Except: see next patch on mremap.) Introduce struct anonmm per mm to track anonymous pages, all forks from one exec sharing the same bundle of linked anonmms. Anonymous pages originate in one mm, but may be forked into another mm of the bundle later on. Callouts from fork.c to allocate, dup and exit the anonmm structure private to rmap.c. From: Hugh Dickins <hugh@veritas.com> Two concurrent exits (of the last two mms sharing the anonhd). First exit_rmap brings anonhd->count down to 2, gets preempted (at the spin_unlock) by second, which brings anonhd->count down to 1, sees it's 1 and frees the anonhd (without making any change to anonhd->count itself), cpu goes on to do something new which reallocates the old anonhd as a new struct anonmm (probably not a head, in which case count will start at 1), first resumes after the spin_unlock and sees anonhd->count 1, frees "anonhd" again, it's used for something else, a later exit_rmap list_del finds list corrupt.
2004-05-22[PATCH] slab: consolidate panic codeAndrew Morton3-37/+12
Many places do: if (kmem_cache_create(...) == NULL) panic(...); We can consolidate all that by passing another flag to kmem_cache_create() which says "panic if it doesn't work".
2004-05-21[PATCH] swsusp: fix devfs breakage introduced in 2.6.6Andrew Morton1-9/+22
From: Pavel Machek <pavel@ucw.cz> This fixes bad interaction between devfs and swsusp. Check whether the swap device is the specified resume device, irrespective of whether they are specified by identical names. (Thus, device inode aliasing is allowed. You can say /dev/hda4 instead of /dev/ide/host0/bus0/target0/lun0/part4 [if using devfs] and they'll be considered the same device. This is *necessary* for devfs, since the resume code can only recognize the form /dev/hda4, but the suspend code would like the long name [as shown in 'cat /proc/mounts'].) [Thanks to devfs hero whose name I forgot.]
2004-05-21[PATCH] swsusp: kill unneccessary debuggingAndrew Morton1-7/+0
From: Pavel Machek <pavel@ucw.cz> This is no longer neccessary. We have enough pauses elsewhere, and it works well enough that this is not needed.
2004-05-21[PATCH] Sanitise handling of unneeded syscall stubsAndrew Morton6-17/+34
From: David Mosberger <davidm@napali.hpl.hp.com> Below is a patch that tries to sanitize the dropping of unneeded system-call stubs in generic code. In some instances, it would be possible to move the optional system-call stubs into a library routine which would avoid the need for #ifdefs, but in many cases, doing so would require making several functions global (and possibly exporting additional data-structures in header-files). Furthermore, it would inhibit (automatic) inlining in the cases in the cases where the stubs are needed. For these reasons, the patch keeps the #ifdef-approach. This has been tested on ia64 and there were no objections from the arch-maintainers (and one positive response). The patch should be safe but arch-maintainers may want to take a second look to see if some __ARCH_WANT_foo macros should be removed for their architecture (I'm quite sure that's the case, but I wanted to play it safe and only preserved the status-quo in that regard).
2004-05-20[PATCH] trivial: swsusp section usageAndrew Morton1-1/+1
From: Rusty Russell <rusty@rustcorp.com.au> From: Pavel Machek <pavel@ucw.cz> This patch fixes init section usage in swsusp.c: "read_suspend_image()" can be __init.
2004-05-20[PATCH] Debugging option to put data symbols in kallsymsAndrew Morton1-1/+4
From: Rusty Russell <rusty@rustcorp.com.au> kallsyms contains only function names, but some debuggers (eg. xmon on PPC/PPC64) use it to lookup symbols: it'd be much nicer if it included data symbols too.
2004-05-20[PATCH] fix for stuck cpus at boot]Andrew Morton1-1/+1
From: Anton Blanchard <anton@samba.org> From: Rusty Russell <rusty@rustcorp.com.au> When hotplug cpu isn't enabled, cpu_is_offline is always false. I had a stuck cpu at boot that resulted in a lockup because we tried to start a migration thread on it. Instead of cpu_is_offline we can use !cpu_online which should cover both the hotplug cpu enabled and disabled cases.
2004-05-19[PATCH] Work around gcc 3.3.3-hammer sched miscompilation on x86-64Andrew Morton1-1/+4
From: Andi Kleen <ak@muc.de> The new domain scheduler got miscompiled on x86-64 with gcc 3.3.3-hammer, which is shipping with some distributions. The kernel deadlocks eventually under light stress on SMP systems with the right options. After some experiments it seems this simple change avoids the miscompilation. It also doesn't pessimize the code unduly for other architectures.
2004-05-19[PATCH] system_state splitupAndrew Morton1-4/+4
Split the system_state state `SYSTEM_SHUTDOWN' into SYSTEM_HALT, SYSTEM_POWER_OFF and SYSTEM_RESTART and export system_state to modules. This allows driver shutdown routines to know why they are being shutdown. The IDE subsystem wants this so that it knows to not spin the disks down across a reboot.
2004-05-18Add msleep function to the kernel core to prevent duplication.Greg Kroah-Hartman1-0/+17
2004-05-14Fix gidsetsize == 0 for real this time.Linus Torvalds1-2/+3
We need to always allocate at least one indirect block pointer, since we always fill out blocks[0] even if we don't have any groups.
2004-05-14[PATCH] groups_alloc(0) clobbers memory past end of blockAndrew Morton1-2/+1
From: Olaf Kirch <okir@suse.de> Authentication code in net/sunrpc makes frequent use of groups_alloc(0), which seems to clobber memory past the end of what it allocated. If called with gidsetsize == 0, groups_alloc will set nblocks = 0, but still does a group_info->blocks[0] = group_info->small_block;
2004-05-14[PATCH] implement print_modules()Andrew Morton1-0/+11
From: Arjan van de Ven <arjanv@redhat.com>, Rusty Russell <rusty@rustcorp.com.au> The patch below resolves the "Not Yet Implemented" print_modules() thing. This is a really useful feature for distros; it allows us to do statistical analysis on which modules are present how often in oopses compared to how often they are used normally. In addition it helps to spot candidates for certain bugs without having to go back to the customer asking for this information.
2004-05-14[PATCH] create_workqueue locking fixAndrew Morton1-2/+2
Fix some silliness in there.
2004-05-14[PATCH] Include Aliases in kallsymsAndrew Morton1-4/+10
From: Rusty Russell <rusty@rustcorp.com.au> Kallsyms discards symbols with the same address, but these are sometimes useful. Skip this minor optimization and make kallsyms_lookup deal with aliases
2004-05-14[PATCH] show last kernel-image symbol in /proc/kallsymsAndrew Morton1-6/+9
From: Rusty Russell <rusty@rustcorp.com.au> The current code doesn't show the last symbol (usually _einittext) in /proc/kallsyms. The reason for this is subtle: s_start() returns an empty string for position 0 (ignored by s_show()), and s_next() returns the first symbol for position 1. What should happen is that update_iter() for position 0 should fill in the first symbol. Unfortunately, the get_ksymbol_core() fills in the symbol information, *and* updates the iterator: we have to split these functions, which we do by making it return the length of the name offset. Then we can call get_ksymbol_core() without moving the iterator, meaning that we can call it at position 0 (ie. s_start()).
2004-05-14[PATCH] sched: less locking in balancingAndrew Morton1-5/+13
From: Nick Piggin <nickpiggin@yahoo.com.au> Analysis and basic idea from Suresh Siddha <suresh.b.siddha@intel.com> "This small change in load_balance() brings the performance back upto base scheduler(infact I see a ~1.5% performance improvement now). Basically this fix removes the unnecessary double_lock.." Workload is SpecJBB on 16-way Altix.
2004-05-14[PATCH] sched: fix scheduler for unsynched processor sched_clockAndrew Morton1-10/+28
From: Nick Piggin <nickpiggin@yahoo.com.au> Fine-tune the unsynched sched_clock handling. Basically, you need to be careful about ensuring timestamps get correctly adjusted when moving CPUs, and you *can't* look at your unadjusted sched_clock() and a remote task's ->timestamp and try to come up with anything meaningful. I think this second problem will really hit hard in the activate_task path on systems with unsynched sched_clock when you're waking up a remote task, which happens very often. Andi, I thought some Opterons have unsynched tscs? Maybe this is causing your unexplained bad interactivity? Another problem is a fixup in pull_task. When adjusting ->timestamp from one processor to another, you must use timestamp_last_tick for the local processor too. Using sched_clock() will cause ->timestamp to creep forward. A final small fix is for sync wakeups. They were using __activate_task for some reason, thus they don't get credited for sleeping at all AFAIKS. And another thing, do we want to #ifdef timestamp_last_tick so it doesn't show on UP?
2004-05-14[PATCH] sched: improved cpu_load roundingAndrew Morton1-0/+7
From: Nick Piggin <nickpiggin@yahoo.com.au> "Siddha, Suresh B" <suresh.b.siddha@intel.com> noticed a problem in the cpu_load averaging where the integer truncation could sometimes cause cpu_load to never quite reach its target. I'm not sure that you could demonstrate a real world problem, but I quite like this fix.
2004-05-14[PATCH] s390: coreAndrew Morton2-4/+4
From: Martin Schwidefsky <schwidefsky@de.ibm.com> s390 core changes: - Rename idle_cpu_mask to nohz_cpu_mask as agreed with Dipankar. - Refine compiler version check for "Q" constraints in uaccess.h. - Store per process ptrace information to the correct place. - Fix per cpu data access for 64-bit modules. - Add topology_init function for cpu hotplug. - Define TASK_SIZE dependent on TIF_31BIT and define MM_VM_SIZE to 4TB to get rid of elf_map32 and arch_get_unmapped_area.
2004-05-14[PATCH] Add del_single_shot_timer()Andrew Morton1-4/+38
From: Geoff Gustafson <geoff@linux.jf.intel.com>, "Chen, Kenneth W" <kenneth.w.chen@intel.com>, Ingo Molnar <mingo@elte.hu>, me. The big-SMP guys are seeing high CPU load due to del_timer_sync()'s inefficiencies. The callers are fs/aio.c and schedule_timeout(). We note that neither of these callers' timer handlers actually re-add the timer - they are single-shot. So we don't need all that complexity in del_timer_sync() - we can just run del_timer() and if that worked we know the timer is dead. Add del_single_shot_timer(), export it to modules and use it in AIO and schedule_timeout(). (these numbers are for an earlier patch, but they'll be close) Before: 32p 4p Warm cache 29,000 505 Cold cache 37,800 1220 After: 32p 4p Warm cache 95 88 Cold cache 1,800 140 [Measurements are CPU cycles spent in a call to del_timer_sync, the average of 1000 calls. 32p is 16-node NUMA, 4p is SMP.] (I cleaned up a few things and added some commentary)
2004-05-14[PATCH] Revisited: ia64-cpu-hotplug-cpu_present.patchAndrew Morton4-6/+14
From: Paul Jackson <pj@sgi.com> With a hotplug capable kernel, there is a requirement to distinguish a possible CPU from one actually present. The set of possible CPU numbers doesn't change during a single system boot, but the set of present CPUs changes as CPUs are physically inserted into or removed from a system. The cpu_possible_map does not change once initialized at boot, but the cpu_present_map changes dynamically as CPUs are inserted or removed. Paul Jackson <pj@sgi.com> provided an expanded explanation: Ashok's cpu hot plug patch adds a cpu_present_map, resulting in the following cpu maps being available. All the following maps are fixed size bitmaps of size NR_CPUS. #ifdef CONFIG_HOTPLUG_CPU cpu_possible_map - map with all NR_CPUS bits set cpu_present_map - map with bit 'cpu' set iff cpu is populated cpu_online_map - map with bit 'cpu' set iff cpu available to scheduler #else cpu_possible_map - map with bit 'cpu' set iff cpu is populated cpu_present_map - copy of cpu_possible_map cpu_online_map - map with bit 'cpu' set iff cpu available to scheduler #endif In either case, NR_CPUS is fixed at compile time, as the static size of these bitmaps. The cpu_possible_map is fixed at boot time, as the set of CPU id's that it is possible might ever be plugged in at anytime during the life of that system boot. The cpu_present_map is dynamic(*), representing which CPUs are currently plugged in. And cpu_online_map is the dynamic subset of cpu_present_map, indicating those CPUs available for scheduling. If HOTPLUG is enabled, then cpu_possible_map is forced to have all NR_CPUS bits set, otherwise it is just the set of CPUs that ACPI reports present at boot. If HOTPLUG is enabled, then cpu_present_map varies dynamically, depending on what ACPI reports as currently plugged in, otherwise cpu_present_map is just a copy of cpu_possible_map. (*) Well, cpu_present_map is dynamic in the hotplug case. If not hotplug, it's the same as cpu_possible_map, hence fixed at boot.
2004-05-14[PATCH] ia64 cpu hotplug: core kernel initialisationAndrew Morton1-1/+1
From: Ashok Raj <ashok.raj@intel.com> This patch changes __init to __devinit to init_idle so that when a new cpu arrives, it can call these functions at a later time.
2004-05-14[PATCH] filtered wakeups: wakeup enhancementsAndrew Morton1-8/+9
From: William Lee Irwin III <wli@holomorphy.com> This patch provides an additional argument to __wake_up_common() so that the information wakefunc.patch made waiters ready to receive may be passed to them by wakers. This is provided as a separate patch so that the overhead of the additional argument to __wake_up_common() can be measured in isolation. No change in performance was observable here.
2004-05-14[PATCH] filtered wakeupsAndrew Morton2-4/+4
From: William Lee Irwin III <wli@holomorphy.com> This patch series is solving the "thundering herd" problem that occurs in the mainline implementation of hashed waitqueues. There are two sources of spurious wakeups in such arrangements: (a) Hash collisions that place waiters on different objects on the same waitqueue, which wakes threads falsely when any of the objects hashed to the same queue receives a wakeup. i.e. loss of information about which object a wakeup event is related to. (b) Loss of information about which object a given waiter is waiting on. This precludes wake-one semantics for mutual exclusion scenarios. For instance, a lock bit may be slept on. If there are any waiters on the object, a lock bit release event must wake at least one of them so as to prevent deadlock. But without information as to which waiter is waiting on which object, we must resort to waking all waiters who could possibly be waiting on it. Now, as the lock bit provides mutual exclusion, only one of the waiters woken can proceed, and the remainder will go back to sleep and wait for another event, creating unnecessary system load. Once wake-one semantics are established, only one of the waiters waiting to acquire a lock bit need to be woken, which measurably reduces system load and improves efficiency (i.e. it's the subject of the benchmarking I've been sending to you). Even beyond the measurable efficiency gains, there are reasons of robustness and responsiveness to motivate addressing the issue of thundering herds. In a real-life scenario I've been personally involved in resolving, the thundering herd issue caused powerful modern SMP machines with fast IO systems to be unresponsive to user input for a minute at a time or more. Analogues of these patches for the distro kernels involved fully resolved the issue to the customer's satisfaction and obviated workarounds to limit the pagecache's size. The latest spin of these patches basically shoves more pieces of the logic into the wakeup functions, with some efficiency gains from sharing the hot codepath with the rest of the kernel, and a slightly larger diff than the patches with the newly-introduced entrypoint. Writing these was motivated by the push to insulate sched.c from more of the details of wakeup semantics by putting more of the logic into the wakeup functions. In order to accomplish this while still solving (b), the wakeup functions grew a new argument for communication about what object a wakeup event is related to to be passed by the waker. ========= This patch provides an additional argument to wakeup functions so that information may be passed from the waker to the waiter. This is provided as a separate patch so that the overhead of the additional argument can be measured in isolation. No change in performance was observable here.
2004-05-14[PATCH] revert the process-migration-speedup patchAndrew Morton1-10/+0
David Mosberger asked that this be backed out: "I do not believe that flushing the TLB before migration is be the right thing to do on ia64 machines which support global TLB purges (i.e., all but SGI's machines)." It was of huge benefit for the SGI machines, so work is ongoing.
2004-05-14[PATCH] MSEC_TO_JIFFIES to msec_to_jiffiesAndrew Morton1-1/+1
Switch all users of MSEC[S]_TO_JIFFIES and JIFFIES_TO_MSEC[S] over to use jiffies_to_msecs() and msecs_to_jiffies(). Withdraw MSECS_TO_JIFFIES() and JIFFIES_TO_MSECS() from the kernel API.
2004-05-14[PATCH] MSEC_TO_JIFFIES consolidationAndrew Morton1-8/+1
From: Ingo Molnar <mingo@elte.hu> We have various different implementations of MSEC[S]_TO_JIFFIES and JIFFIES_TO_MSEC[S]. We recently had a compile-time clash in USB. Fix all that up. - The SCTP version was very inefficient. Hopefully this version is accurate enough. - Optimise for the HZ=100 and HZ=1000 cases - This version does round-up, so sleep(9 milliseconds) works OK on 100HZ. - We still have lots of jiffies_to_msec and msec_to_jiffies implementations. From: William Lee Irwin III <wli@holomorphy.com> Optimize the cases where HZ is a divisor of 1000 or vice-versa in JIFFIES_TO_MSECS() and MSECS_TO_JIFFIES() by allowing the nonvanishing(!) integral ratios to appear as a parenthesized expressions eligible for constant folding optimizations. From: me Use typesafe inlines for the jiffies-to-millisecond conversion functions. This means that milliseconds officially takes the type `unsigned int'. All current callers seem to be OK with that. Drivers need to be fixed up to use this instead of their private versions.
2004-05-14[PATCH] sched: add missing local_irq_enable()Andrew Morton1-0/+1
From: Nick Piggin <nickpiggin@yahoo.com.au> this_rq_lock does a local_irq_disable, and sched_yield() needs to undo that.
2004-05-14Merge kroah.com:/home/greg/linux/BK/bleed-2.6Greg Kroah-Hartman2-1/+161
into kroah.com:/home/greg/linux/BK/driver-2.6
2004-05-10Module attributes: fix build error if CONFIG_MODULE_UNLOAD=nGreg Kroah-Hartman1-16/+16
Thanks to Andrew Morton for pointing this out to me.
2004-05-10[PATCH] Make usermodehelper_init() use core_initcall()Andrew Morton1-1/+1
We may as well make usermodehelper_init() core_initcall as well, to make sure its services are avaialble to all the other initcall levels.
2004-05-10[PATCH] minor RCU optimizationAndrew Morton1-2/+2
From: Stephen Hemminger <shemminger@osdl.org> Minor tweak to rcu, use __list_splice instead of list_splice because the list has already been checked for empty.
2004-05-10[PATCH] Add sysctl to define a hugetlb-capable groupAndrew Morton1-0/+8
From: "Chen, Kenneth W" <kenneth.w.chen@intel.com>, "Seth, Rohit" <rohit.seth@intel.com> This patch addresses the longstanding problem wherein Oracle needs CAP_IPC_LOCK to allocate SHM_HUGETLB shm memory, but people don't want to run Oracle as root, and capabilties are busted. Various ideas with rlimits didn't work out, mainly because these objects live beyond the lifetime of the user processes which establish them. What we do is to create root-writeable /proc/sys/vm/hugetlb_shm_group which specifies a single group ID. Users who belong to that group may allocate hugepages for SHM_HUGETLB shm segments. So the sysadmin will greate a new group, say `hugepageusers', will add the oracle user to that group and will write that group's ID into /proc/sys/vm/hugetlb_shm_group.
2004-05-10[PATCH] worker_thread race fixAndrew Morton1-3/+4
Fix a waitqueue-handling race in worker_thread().
2004-05-10[PATCH] fix deadlock in create_workqueue()Andrew Morton1-1/+1
Fix bug identified by Srivatsa Vaddagiri <vatsa@in.ibm.com>: There's a deadlock in __create_workqueue when CONFIG_HOTPLUG_CPU is set. This can happen when create_workqueue_thread fails to create a worker thread. In that case, we call destroy_workqueue with cpu hotplug lock held. destroy_workqueue however also attempts to take the same lock.
2004-05-10[PATCH] Only Print Taint Message OnceAndrew Morton1-1/+1
From: Rusty Russell <rusty@rustcorp.com.au> Only print the tainted message the first time. Its purpose is to warn users that we can't support them, not to fill their logs.
2004-05-10[PATCH] find_user locking and leak fixAndrew Morton2-1/+16
find_user() is being called from set/get_priority(), but it doesn't take the needed lock, and those callers were forgetting to drop the refcount which find_user() took.
2004-05-09[PATCH] sched: in_sched_functions() cleanupAndrew Morton1-7/+8
From: Rusty Russell <rusty@rustcorp.com.au> 1) Create an in_sched_functions() function in sched.c and make the archs use it. (Two archs have wchan #if 0'd out: left them alone). 2) Move __sched from linux/init.h to linux/sched.h and add comment. 3) Rename __scheduling_functions_start_here/end_here to __sched_text_start/end. Thanks to wli and Sam Ravnborg for clue donation.
2004-05-09[PATCH] migration_thread() race fixAndrew Morton1-1/+3
From: Srivatsa Vaddagiri <vatsa@in.ibm.com> Noticed that migration_thread can examine "kthread_should_stop()?" without setting its state to TASK_INTERRUPTIBLE first. This can cause kthread_stop on that thread to block forever ... P.S - I assumed that having the task state set to TASK_INTERRUTIBLE while it is doing active_load_balance is fine. It seemed to be the case earlier also.
2004-05-09[PATCH] sched_getaffinity vs cpu hotplug race fixAndrew Morton1-0/+2
From: Srivatsa Vaddagiri <vatsa@in.ibm.com> Fix the race in sys_sched_getaffinity. Patch below takes cpu_hotplug lock before reading cpus_allowed mask of a task.
2004-05-09[PATCH] Move migrate_all_tasks to CPU_DEAD handlingAndrew Morton3-39/+110
From: Srivatsa Vaddagiri <vatsa@in.ibm.com> migrate_all_tasks is currently run with rest of the machine stopped. It iterates thr' the complete task table, turning off cpu affinity of any task that it finds affine to the dying cpu. Depending on the task table size this can take considerable time. All this time machine is stopped, doing nothing. Stopping the machine for such extended periods can be avoided if we do task migration in CPU_DEAD notification and that's precisely what this patch does. The patch puts idle task to the _front_ of the dying CPU's runqueue at the highest priority possible. This cause idle thread to run _immediately_ after kstopmachine thread yields. Idle thread notices that its cpu is offline and dies quickly. Task migration can then be done at leisure in CPU_DEAD notification, when rest of the CPUs are running. Some advantages with this approach are: - More scalable. Predicatable amout of time that machine is stopped. - No changes to hot path/core code. We are just exploiting scheduler rules which runs the next high-priority task on the runqueue. Also since I put idle task to the _front_ of the runqueue, there are no races when a equally high priority task is woken up and added to the runqueue. It gets in at the back of the runqueue, _after_ idle task! - cpu_is_offline check that is presenty required in try_to_wake_up, idle_balance and rebalance_tick can be removed, thus speeding them up a bit From: Srivatsa Vaddagiri <vatsa@in.ibm.com> Rusty mentioned that the unlikely hints against cpu_is_offline is redundant since the macro already has that hint. Patch below removes those redundant hints I added.
2004-05-09[PATCH] sched: Look at another CPU's domainAndrew Morton1-4/+4
From: Nick Piggin <nickpiggin@yahoo.com.au> The SMT wake_idle code really wants to look at a non-local CPU's domain in order to check for idle siblings. So change the domain attachment code a little bit so we continue to hold a runqueue's lock while attaching a new domain. This means the locking rules have changed to: you may access your own domain without any lock, you must hold a remote runqueue's lock in order to view its domain.
2004-05-09[PATCH] sched: micro-optimisation for wake_upAndrew Morton1-4/+5
From: Nick Piggin <nickpiggin@yahoo.com.au> This actually does produce better code, especially under the locked section. Turns a conditional + unconditional jump under the lock in the unlikely case into a cmov outside the lock.
2004-05-09[PATCH] sched: reduce idle timeAndrew Morton1-1/+2
From: Nick Piggin <nickpiggin@yahoo.com.au> It makes NEWLY_IDLE balances cause find_busiest_group return the busiest available group even if there isn't an imbalance. Basically - try a bit harder to prevent schedule emptying the runqueue. It is quite aggressive, but that isn't so bad because we don't (by default) do NEWLY_IDLE balancing across NUMA nodes, and NEWLY_IDLE balancing is always restricted to cache_hot tasks. It picked up a little bit of idle time that dbt2-pgsql was seeing...
2004-05-09[PATCH] sched: balance-on-cloneAndrew Morton2-39/+150
From: Ingo Molnar <mingo@elte.hu> Implement balancing during clone(). It does the following things: - introduces SD_BALANCE_CLONE that can serve as a tool for an architecture to limit the search-idlest-CPU scope on clone(). E.g. the 512-CPU systems should rather not enable this. - uses the highest sd for the imbalance_pct, not this_rq (which didnt make sense). - unifies balance-on-exec and balance-on-clone via the find_idlest_cpu() function. Gets rid of sched_best_cpu() which was still a bit inconsistent IMO, it used 'min_load < load' as a condition for balancing - while a more correct approach would be to use half of the imbalance_pct, like passive balancing does. - the patch also reintroduces the possibility to do SD_BALANCE_EXEC on SMP systems, and activates it - to get testing. - NOTE: there's one thing in this patch that is slightly unclean: i introduced wake_up_forked_thread. I did this to make it easier to get rid of this patch later (wake_up_forked_process() has lots of dependencies in various architectures). If this capability remains in the kernel then i'll clean it up and introduce one function for wake_up_forked_process/thread. - NOTE2: i added the SD_BALANCE_CLONE flag to the NUMA CPU template too. Some NUMA architectures probably want to disable this.
2004-05-09[PATCH] sched: cpu load management cleanupAndrew Morton1-10/+16
From: Ingo Molnar <mingo@elte.hu> This does the source/target cleanup. This is a no-functionality patch which also adds more comments to explain these functions.
2004-05-09[PATCH] sched: passive balancing dampingAndrew Morton1-16/+19
From: Nick Piggin <nickpiggin@yahoo.com.au> This patch starts to balance woken processes when half the relevant domain's imbalance_pct is reached. Previously balancing would start after a small, constant difference in waker/wakee runqueue loads was reached, which would cause too much process movement when there are lots of processes running. It also turns wake balancing into a domain flag while previously it was always on. Now sched domains can "soft partition" an SMP system without using processor affinities.
2004-05-09[PATCH] sched: cleanupsAndrew Morton1-17/+14
From: Ingo Molnar <mingo@elte.hu> This re-adds cleanups which were lost in splitups of an earlier patch.
2004-05-09[PATCH] sched: lock cpu_attach_domain for hotplugAndrew Morton1-0/+4
From: Nick Piggin <nickpiggin@yahoo.com.au> The attached patch is required to work correctly with the CPU hotplug framework. John Hawkes reports successful booting with this.
2004-05-09[PATCH] sched: extend sync wakeupsAndrew Morton1-2/+2
From: Ingo Molnar <mingo@elte.hu> The attached patch extends sync wakeups to the process sys_exit() path too: the chldwait wakeup can be done sync, since we know that the process is going to exit (and thus deschedule). The most visible effect of this change is strace's behavior on SMP systems: it now stays on a single CPU, together with the traced child. (previously it would run in parallel to the child, bouncing around madly.)
2004-05-09[PATCH] sched: add enqueeu_task_head()Andrew Morton1-0/+15
From: Ingo Molnar <mingo@elte.hu> Helper function for later patches
2004-05-09[PATCH] sched: uninliningsAndrew Morton1-12/+12
From: Ingo Molnar <mingo@elte.hu> Uninline things
2004-05-09[PATCH] sched: minor cleanupsAndrew Morton1-28/+19
From: Nick Piggin <nickpiggin@yahoo.com.au> Minor cleanups from Ingo's patch including task_hot (do it right in try_to_wake_up too).
2004-05-09[PATCH] sched: fix setup racesAndrew Morton1-45/+118
From: Nick Piggin <nickpiggin@yahoo.com.au> De-racify the sched domain setup code. This involves creating a dummy "init" domain during sched_init (which is called early). When topology information becomes available, the sched domains are then built and attached. The attach mechanism is asynchronous and uses the migration threads, which perform the switch with interrupts off. This is a quiescent state, so domains can still be lockless on the read side. It also allows us to change the domains at runtime without much more work. This is something SGI is interested in to elegantly do soft partitioning of their systems without having to use hard cpu affinities (which cause balancing problems of their own). The current setup code also has a race somewhere because it is unable to boot on a 384 CPU system. From: Anton Blanchard <anton@samba.org> This is basically a mindless ppc64 merge of the x86 changes to sched domain init code. Actually if I produce a sibling_map[] then the x86 code and the ppc64 will be identical. Maybe we can merge it.
2004-05-09[PATCH] sched: oops fixAndrew Morton1-6/+5
From: Nick Piggin <nickpiggin@yahoo.com.au> After the for_each_domain change, the warn here won't trigger, instead it will oops in the if statement. Also, make sure we don't pass an empty cpumask to for_each_cpu.
2004-05-09[PATCH] sched: fix imbalance calculationsAndrew Morton1-16/+22
From: Nick Piggin <nickpiggin@yahoo.com.au> Imbalance calculations were not right. This would cause unneeded migration.
2004-05-09[PATCH] sched: wakeup balancing fixesAndrew Morton1-7/+17
From: Nick Piggin <nickpiggin@yahoo.com.au> Make affine wakes and "passive load balancing" more conservative. Aggressive affine wakeups were causing huge regressions in dbt3-pgsql on 8-way non NUMA systems at OSDL's STP.
2004-05-09[PATCH] Hotplug CPU sched_balance_exec FixAndrew Morton1-9/+28
From: Rusty Russell <rusty@rustcorp.com.au> From: Srivatsa Vaddagiri <vatsa@in.ibm.com> From: Andrew Morton <akpm@osdl.org> From: Rusty Russell <rusty@rustcorp.com.au> We want to get rid of lock_cpu_hotplug() in sched_migrate_task. Found that lockless migration of execing task is _extremely_ racy. The races I hit are described below, alongwith probable solutions. Task migration done elsewhere should be safe (?) since they either hold the lock (sys_sched_setaffinity) or are done entirely with preemption disabled (load_balance). sched_balance_exec does: a. disables preemption b. finds new_cpu for current c. enables preemption d. calls sched_migrate_task to migrate current to new_cpu and sched_migrate_task does: e. task_rq_lock(p) f. migrate_task(p, dest_cpu ..) (if we have to wait for migration thread) g. task_rq_unlock() h. wake_up_process(rq->migration_thread) i. wait_for_completion() Several things can happen here: 1. new_cpu can go down after h and before migration thread has got around to handle the request ==> we need to add a cpu_is_offline check in __migrate_task 2. new_cpu can go down between c and d or before f. ===> Even though this case is automatically handled by the above change (migrate_task being called on a running task, current, will delegate migration to migration thread), would it be good practice to avoid calling migrate_task in the first place itself when dest_cpu is offline. This means adding another cpu_is_offline check after e in sched_migrate_task 3. The 'current' task can get preempted _immediately_ after g and when it comes back, task_cpu(p) can be dead. In which case, it is invalid to do wake_up on a non-existent migration thread. (rq->migration_thread can be NULL). ===> We should disable preemption thr' g and h 4. Before migration thread gets around to handle the request, its cpu goes dead. This will leave unhandled migration requests in the dead cpu. ===> We need to wakeup sleeping requestors (if any) in CPU_DEAD notification. I really wonder if we can get rid of these issues by avoiding balancing at exec time and instead have it balanced during load_balance ..Alternately if this is valuable and we want to retain it, I think we still need to consider a read/write sem, with sched_migrate_task doing down_read_trylock. This may eliminate the deadlock I hit between cpu_up and CPU_UP_PREPARE notification, which had forced me away from r/w sem. Anyway patch below addresses the above races. Its against 2.6.6-rc2-mm1 and has been tested on a 4way Intel Pentium SMP m/c. Rusty sez: Two other changes: 1) I grabbed a reference to the thread, rather than using preempt_disable(). It's the more obvious way I think. 2) Why the wait_to_die code? It might be needed if we move tasks after stop_machine, but for nowI don't see the problem with the migration thread running on the wrong CPU for a bit: nothing is on this runqueue so active_load_balance is safe, and __migrate task will be a noop (due to cpu_is_offline() check). If there is a problem, your fix is racy, because we could be preempted immediately afterwards. So I just stop the kthread then wakeup any remaining...
2004-05-09[PATCH] sched: trivial fixes, cleanupsAndrew Morton1-243/+242
From: Ingo Molnar <mingo@elte.hu> The trivial fixes. - added recent trivial bits from Nick's and my patches. - hotplug CPU fix - early init cleanup
2004-05-09[PATCH] Reduce TLB flushing during process migrationAndrew Morton1-0/+10
From: Martin Hicks <mort@wildopensource.com> Another optimization patch from Jack Steiner, intended to reduce TLB flushes during process migration. Most architextures should define tlb_migrate_prepare() to be flush_tlb_mm(), but on i386, it would be a wasted flush, because i386 disconnects previous cpus from the tlb flush automatically.
2004-05-09[PATCH] sched: add local load metricsAndrew Morton1-43/+30
From: Nick Piggin <piggin@cyberone.com.au> This patch removes the per runqueue array of NR_CPU arrays. Each time we want to check a remote CPU's load we check nr_running as well anyway, so introduce a cpu_load which is the load of the local runqueue and is kept updated in the timer tick. Put them in the same cacheline. This has additional benefits of having the cpu_load consistent across all CPUs and more up to date. It is sampled better too, being updated once per timer tick. This shouldn't make much difference in scheduling behaviour, but all benchmarks are either as good or better on the 16-way NUMAQ: hackbench, reaim, volanomark are about the same, tbench and dbench are maybe a bit better. kernbench is about one percent better. John reckons it isn't a big deal, but it does save 4K per CPU or 2MB total on his big systems, so I figure it must be a bit kinder on the caches. I think it is just nicer in general anyway.
2004-05-09[PATCH] sched: SMT niceness handlingAndrew Morton1-2/+115
From: Con Kolivas <kernel@kolivas.org> This patch provides full per-package priority support for SMT processors (aka pentium4 hyperthreading) when combined with CONFIG_SCHED_SMT. It maintains cpu percentage distribution within each physical cpu package by limiting the time a lower priority task can run on a sibling cpu concurrently with a higher priority task. It introduces a new flag into the scheduler domain unsigned int per_cpu_gain; /* CPU % gained by adding domain cpus */ This is empirically set to 15% for pentium4 at the moment and can be modified to support different values dynamically as newer processors come out with improved SMT performance. It should not matter how many siblings there are. How it works is it compares tasks running on sibling cpus and when a lower static priority task is running it will delay it till high_priority_timeslice * (100 - per_cpu_gain) / 100 <= low_prio_timeslice eg. a nice 19 task timeslice is 10ms and nice 0 timeslice is 102ms On vanilla the nice 0 task runs on one logical cpu while the nice 19 task runs unabated on the other logical cpu. With smtnice the nice 0 runs on one logical cpu for 102ms and the nice 19 sleeps till the nice 0 task has 12ms remaining and then will schedule. Real time tasks and kernel threads are not altered by this code, and kernel threads do not delay lower priority user tasks. with lots of thanks to Zwane Mwaikambo and Nick Piggin for help with the coding of this version. If this is merged, it is probably best to delay pushing this upstream in mainline till sched_domains gets tested for at least one major release.
2004-05-09[PATCH] sched_domains: use cpu_possible_mapAndrew Morton1-29/+24
From: Nick Piggin <piggin@cyberone.com.au> This changes sched domains to contain all possible CPUs, and check for online as needed. It's in order to play nicely with CPU hotplug.
2004-05-09[PATCH] sched-group-powerAndrew Morton1-62/+66
From: Nick Piggin <piggin@cyberone.com.au> The following patch implements a cpu_power member to struct sched_group. This allows special casing to be removed for SMT groups in the balancing code. It does not take CPU hotplug into account yet, but that shouldn't be too hard. I have tested it on the NUMAQ by pretending it has SMT. Works as expected. Active balances across nodes.
2004-05-09[PATCH] sched_balance_exec(): don't fiddle with the cpus_allowed maskAndrew Morton1-34/+33
From: Rusty Russell <rusty@rustcorp.com.au>, Nick Piggin <piggin@cyberone.com.au> The current sched_balance_exec() sets the task's cpus_allowed mask temporarily to move it to a different CPU. This has several issues, including the fact that a task will see its affinity at a bogus value. So we change the migration_req_t to explicitly specify a destination CPU, rather than the migration thread deriving it from cpus_allowed. If the requested CPU is no longer valid (racing with another set_cpus_allowed, say), it can be ignored: if the task is not allowed on this CPU, there will be another migration request pending. This change allows sched_balance_exec() to tell the migration thread what to do without changing the cpus_allowed mask. So we rename __set_cpus_allowed() to move_task(), as the cpus_allowed mask is now set by the caller. And move_task_away(), which the migration thread uses to actually perform the move, is renamed __move_task(). I also ignore offline CPUs in sched_best_cpu(), so sched_migrate_task() doesn't need to check for offline CPUs. Ulterior motive: this approach also plays well with CPU Hotplug. Previously that patch might have seen a task with cpus_allowed only containing the dying CPU (temporarily due to sched_balance_exec) and forcibly reset it to all cpus, which might be wrong. The other approach is to hold the cpucontrol sem around sched_balance_exec(), which is too much of a bottleneck.
2004-05-09[PATCH] sched: handle inter-CPU jiffies skewAndrew Morton1-8/+8
From: Nick Piggin <piggin@cyberone.com.au> John Hawkes discribed this problem to me: There *is* a small problem in this area, though, that SuSE avoids. "jiffies" gets updated by cpu0. The other CPUs may, over time, get out of sync (and they're initialized on ia64 to start out being out of sync), so it's no guarantee that every CPU will wake up from its timer interrupt and see a "jiffies" value that is guaranteed to be last_jiffies+1. Sometimes the jiffies value may be unchanged since the last wakeup. Sometimes the jiffies value may have incremented by 2 (or more, especially if cpu0's interrupts are disabled for long stretches of time). So an algoithm that says, "I'll call load_balance() only when jiffies is *exactly* N" is going to fail on occasion, either by calling load_balance() too often or not often enough. *** I fixed this by adding a last_balance field to struct sched_domain, and working off that.
2004-05-09[PATCH] sched: implement domains for i386 HTAndrew Morton1-25/+10
From: Nick Piggin <piggin@cyberone.com.au> The following patch builds a scheduling description for the i386 architecture using cpu_sibling_map to set up SMT if CONFIG_SCHED_SMT is set. It could be made more fancy and collapse degenerate domains at runtime (ie. 1 sibling per CPU, or 1 NUMA node in the computer). From: Zwane Mwaikambo <zwane@arm.linux.org.uk> This fixes an oops due to cpu_sibling_map being uninitialised when a system with no MP table (most UP boxen) boots a CONFIG_SMT kernel. What also happens is that the cpu_group lists end up not being terminated properly, but this oops kills it first. Patch tested on UP w/o MP table, 2x P2 and UP Xeon w/ no siblings. From: "Martin J. Bligh" <mbligh@aracnet.com>, Nick Piggin <piggin@cyberone.com.au> Change arch_init_sched_domains to use cpu_online_map From: Anton Blanchard <anton@samba.org> Fix build with NR_CPUS > BITS_PER_LONG
2004-05-09[PATCH] scheduler domain balancing improvementsAndrew Morton1-19/+36
From: Nick Piggin <piggin@cyberone.com.au> This patch gets the sched_domain scheduler working better WRT balancing. Its been tested on the NUMAQ. Among other things it changes to the way SMT load calculation works so as not to active load blances when it shouldn't. It still has a problem with SMT and NUMA: it will put a task on each sibling in a node before moving tasks to another node. It should probably start moving tasks after each *physical* CPU is filled. To fix, you need "how much CPU power in this domain?" At the moment we approximate # runqueues == CPU power, and hack around it at the CPU physical domain by counting all sibling runqueues as 1. It isn't hard to correctly work the CPU power out, but once CPU hotplug is in the equation it becomes much more hotplug events. If anyone is actually interested in getting this fixed, that is.
2004-05-09[PATCH] sched_domain debuggingAndrew Morton1-0/+78
From: Nick Piggin <piggin@cyberone.com.au> Anton was attempting to make a sched domain topology for his POWER5 and was having some trouble. This patch only includes code which is ifdefed out, but hopefully it will be of some use to implementors.