Age | Commit message (Collapse) | Author | Files | Lines |
|
The following experimental patch implements token based thrashing
protection, using the algorithm described in:
http://www.cs.wm.edu/~sjiang/token.htm
When there are pageins going on, a task can grab a token, that protects the
task from pageout (except by itself) until it is no longer doing heavy
pageins, or until the maximum hold time of the token is over.
If the maximum hold time is exceeded, the task isn't eligable to hold the
token for a while more, since it wasn't doing it much good anyway.
I have run a very unscientific benchmark on my system to test the
effectiveness of the patch, timing how a 230MB two-process qsbench run
takes, with and without the token thrashing protection present.
normal 2.6.8-rc6: 6m45s
2.6.8-rc6 + token: 4m24s
This is a quick hack, implemented without having talked to the inventor of
the algorithm. He's copied on the mail and I suspect we'll be able to do
better than my quick implementation ...
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
Patch from Paul for additional documentation of api.
Updated based on feedback, and to apply to 2.6.8-rc3. I will be adding more
detailed documentation to the Documentation directory in a separate patch.
Signed-off-by: Paul McKenney <paulmck@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
Introduces call_rcu_bh() to be used when critical sections are mostly in
softirq context.
This patch introduces a new api - call_rcu_bh(). This is to be used for RCU
callbacks for whom the critical sections are mostly in softirq context. These
callbacks consider completion of a softirq handler to be a quiescent state.
So, in order to make reader critical sections safe in process context,
rcu_read_lock_bh() and rcu_read_unlock_bh() must be used. Use of softirq
handler completion as a quiescent state speeds up RCU grace periods and
prevents too many callbacks getting queued up in softirq-heavy workloads like
network stack.
Signed-off-by: Dipankar Sarma <dipankar@in.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
Avoids per_cpu calculations and also prepares for call_rcu_bh().
At OLS, Rusty had suggested getting rid of many per_cpu() calculations in RCU
code and making the code simpler. I had already done that for the rcu-softirq
patch earlier, so I am splitting that into two patch. This first patch cleans
up the macros and uses pointers to the rcu per-cpu data directly to manipulate
the callback queues. This is useful for the call-rcu-bh patch (to follow)
which introduces a new RCU mechanism - call_rcu_bh(). Both generic and
softirq rcu can then use the same code, they work different global and percpu
data.
Signed-off-by: Dipankar Sarma <dipankar@in.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
This patch makes RCU callbacks friendly to scheduler. It helps low latency
by limiting the number of callbacks invoked per tasklet handler. Since we
cannot schedule during a single softirq handler, this reduces size of
non-preemptible section significantly, specially under heavy RCU updates.
The limiting is done through a kernel parameter rcupdate.maxbatch which is
the maximum number of RCU callbacks to invoke during a single tasklet
handler.
Signed-off-by: Dipankar Sarma <dipankar@in.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
This fixes the RCU cpu offline code which was broken by singly-linked RCU
changes. Nathan pointed out the problems and submitted a patch for this.
This is an optimal fix - no need to iterate through the list of callbacks,
just use the tail pointers and attach the list from the dead cpu.
Signed-off-by: Nathan Lynch <nathanl@austin.ibm.com>
Signed-off-by: Dipankar Sarma <dipankar@in.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
There is a series of patches in my tree and these 3 are the first ones that
should probably be merged down the road. Descriptions are on top of the
patches. Please include them in -mm.
A lot of RCU code will be cleaned up later in order to support
call_rcu_bh(), the separate RCU interface that considers softirq handler
completion a quiescent state.
This patch:
Minor cleanup of the hotplug code to remove #ifdef in cpu event notifier
handler. If CONFIG_HOTPLUG_CPU is not defined, CPU_DEAD case will be
optimized off.
Signed-off-by: Dipankar Sarma <dipankar@in.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
vma_prio_tree_insert() relies on the fact, that vma was
vma_prio_tree_init()'ed.
Content of vma->shared should be considered undefined, until this vma is
inserted into i_mmap/i_mmap_nonlinear. It's better to do proper
initialization in vma_prio_tree_add/insert.
Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Signed-off-by: Rajesh Venkatasubramanian <vrajesh@umich.edu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
Add vprintk call. This lets us directly pass varargs stuff to the console
without using vsnprintf to an intermediate buffer.
Signed-off-by: Matt Mackall <mpm@selenic.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
This issue was discussed on lkml and linux-ia64. The patch introduces
"getnstimeofday" and removes all the code scaling gettimeofday to
nanoseoncs. It makes it possible for the posix-timer functions to return
higher accuracy.
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
I'm submitting two patches associated with moving cache_reap functionality
out of timer context. Note that these patches do not make any further
optimizations to cache_reap at this time.
The first patch adds a function similiar to schedule_delayed_work to allow
work to be scheduled on another cpu.
The second patch makes use of schedule_delayed_work_on to schedule
cache_reap to run from keventd.
Signed-off-by: Dimitri Sivanich <sivanich@sgi.com>
Signed-off-by: Manfred Spraul <manfred@colorfullife.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
Various people have reported deadlocks and it has aways seemed a bit risky
to try to sync the filesystems at this stage anyway.
"I have seen panic failing two times lately on an SMP system. The box
panic'ed but was running happily on the other cpus. The culprit of this
failure is the fact, that these panics have been caused by a block device
or a filesystem (e.g. using errors=panic). In these cases the likelihood
of a failure/hang of sys_sync() is high. This is exactly what happened in
both cases I have seen. Meanwhile the other cpus are happily continuing
destroying data as the kernel has a severe problem but its not aware of
that as smp_send_stop happens after sys_sync."
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
Currently most driver events are not sent out when using initramfs as
driver_init() (which triggers the events) is called before init_workqueues.
This patch rearranges the init calls so that the hotplug event queue is
enabled prior to calling driver_init(), hence we're getting all hotplug
events again.
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
I made a patch for debugging with the help of NMI trigger switch.
When kernel hangs severely, keyboard operation(e.g.Ctrl-Alt-Del)
doesn't work properly. This patch enables debugging information
to be displayed on console in this case.
I think this feature is necessary as standard functionality.
Please feel free to use this patch and let me know if you have
any comments.
Background:
When a trouble occurs in kernel, we usually begin to investigate
with following information:
- panic >> panic message.
- oops >> CPU registers and stack trace.
- hang >> **NONE** no standard method established.
How it works:
Most IA32 servers have a NMI switch that fires NMI interrupt up.
The NMI interrupt can interrupt even if kernel is serious state,
for example deadlock under the interrupt disabled.
When the NMI switch is pressed after this feature is activated,
CPU registers and stack trace are displayed on console and then
panic occurs.
This feature is activated or deactivated with sysctl.
On IA32 architecture, only the following are defined as reason
of NMI interrupt:
- memory parity error
- I/O check error
The reason code of NMI switch is not defined, so this patch assumes
that all undefined NMI interrupts are fired by MNI switch.
However, oprofile and NMI watchdog also use undefined NMI interrupt.
Therefore this feature cannot be used at the same time with oprofile
and NMI watchdog. This feature hands NMI interrupt over to oprofile
and NMI watchdog. So, when they have been activated, this feature
doesn't work even if it is activated.
Supported architecture:
IA32
Setup:
Set up the system control parameter as follows:
# sysctl -w kernel.unknown_nmi_panic=1
kernel.unknown_nmi_panic = 1
If the NMI switch is pressed, CPU registers and stack trace will
be displayed on console and then panic occurs.
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
Reading the contents of a module_param_string through sysfs currently
oopses because the param_get_charp() function cannot operate on a
kparam_string struct. This introduces the required param_get_string.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
In the *ppos cleanups, proc_dol2crvec was updated, but the prototype
found at the top of kernel/sysctl.h was not, generating warning. This
corrects the prototype to match the code.
(I'm gonna take a stab at moving these into arch/ppc shortly)
Signed-off-by: Tom Rini <trini@kernel.crashing.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
Nobody ever fixed the big FIXME in sysctl - but we really need
to pass around the proper "loff_t *" to all the sysctl functions
if we want them to be well-behaved wrt the file pointer position.
This is all preparation for making direct f_pos accesses go
away.
|
|
There is a lonstanding off-by-one error that results from an incorrect
comparison when checking whether a process has consumed CPU time in
excess of its RLIMIT_CPU limits.
This means, for example, that if we use setrlimit() to set the soft CPU
limit (rlim_cur) to 5 seconds and the hard limit (rlim_max) to 10 seconds,
then the process only receives a SIGXCPU signal after consuming 6 seconds
of CPU time, and, if it continues consuming CPU after handling that
signal, only receives SIGKILL after consuming 11 seconds of CPU time.
The fix is trivial.
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
Remove the unused symbol_is() macro.
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
BSD accounting cross-platform compatibility is a new feature of 2.6.8 and
thus not crucial, but it'd be nice not to have kernels writing wrong file
formats out in the wild.
The endianness detection logic I wanted to suppose for userspace turned out
to be bogus. So just do it the simple way and store endianness info
together with the version number.
Signed-off-by: Tim Schmielau <tim@physik3.uni-rostock.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
The per cpu schedule counters need to be summed up over all possible cpus.
When testing hotplug cpu remove I saw the sum of the online cpu count for
nr_uninterruptible go negative which made the load average go nuts.
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
all sorts of minor stuff - basically, all chunks are independent here,
but IMO that one is not worth splitting. Contains:
* pmac_cpufreq.c: declaration in the middle of a block.
* sys_ia32.c: couple of trivial annotations.
* ipmi_si_intf.c: should be using asm/irq.h instead of linux/irq.h
* synclink_cs.c: assignment-in-conditional with nobody ever looking
at the variable we are assigning to afterwards; variable removed.
* sbni.c: s/__volatile/__volatile__
* matroxfb_base.h: got rid of ((u32 *)p)++
* asm-ppc/checksum.h and asm-sparc64/floppy.h: NULL noise removal
* amd64 compat.h: missing L in long constant.
* mtd-abi.h: annotated ioctl structure
* sysctl.c: corrected annotations in extern
Signed-off-by: Al Viro <viro@parcelfarce.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
Don't assign to `retval' twice in a row.
Signed-off-by: Luiz Capitulino <lcapitulino@prefeitura.sp.gov.br>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
Fix a bug in do_proc_doulongvec_minmax() where the the string buffer was
too short to parse a 64-bit number expressed in decimal. That was causing
problems with entries in /proc/sys using long and allowing large number
(such as -1)
Signed-off-by: Stephane Eranian <eranian@hpl.hp.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
For clock_getres(clockid_t clock_id, struct timespec *res), the specification
says "If res is NULL, the clock resolution is not returned." So this kind of
call should succeed. The current implementation returns -EFAULT.
The patch fixes the bug in compat_clock_getres().
Signed-off-by: Gordon Jin <gordon.jin@intel.com>
Signed-off-by: Arun Sharma <arun.sharma@intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
Here is a trivial patch that is required to boot the latest 2.6.7 tree
on the SGI 512p system.
Initialize the busy_factor in the sched_domain_init table. Otherwise,
booting hangs doing excessive load balance operations.
Signed-off-by: Jack Steiner <steiner@sgi.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
rcu_offline_cpu and rcu_move_batch have been broken since the list_head's
in struct rcu_head and struct rcu_data were replaced with singly-linked
lists:
CC kernel/rcupdate.o
kernel/rcupdate.c: In function `rcu_move_batch':
kernel/rcupdate.c:222: warning: passing arg 2 of `list_add_tail' from
incompatible pointer type
kernel/rcupdate.c: In function `rcu_offline_cpu':
kernel/rcupdate.c:239: warning: passing arg 1 of `rcu_move_batch' from
incompatible pointer type
kernel/rcupdate.c:240: warning: passing arg 1 of `rcu_move_batch' from
incompatible pointer type
kernel/rcupdate.c:236: warning: label `unlock' defined but not used
Kernel crashes when you try to offline a cpu, not surprisingly.
It also looks like rcu_move_batch isn't preempt-safe so I touched that up,
and got rid of an unused label in rcu_offline_cpu.
Signed-off-by: Nathan Lynch <nathanl@austin.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
This also fixes it for when the real parent is ignoring
SIGCHLD - noted by David Mosberger.
|
|
- missing ; between default: and } in sun4setup.c
- cast of pointer to unsigned long long instead of unsigned long in
x86_64 signal.c
- missed annotations for ioctl structure in sparc64 openpromio.h
(should've been in the same patch as the rest of drivers/sbus/*
annotations)
- 0->NULL in list.h and pmdisk.c
|
|
Extraction of int from pointer is slightly broken in several places.
|
|
ss_sp in struct sigaltstack made __user
->si_addr and ->sival_ptr made __user
your ->sa_restorer and ->sa_handler changes propagated
users of these guys annotated on i386/amd64/alpha/sparc/sparc64
|
|
This patch adds an architecture-specific callout after explicit
processor migrations. The callout allows architectures (or platforms)
to update TLB specific information (ex., cpu_vm_mask).
Signed-off-by: Jack Steiner <steiner@sgi.com>
Signed-off-by: David Mosberger <davidm@hpl.hp.com>
|
|
The patch below (already ACK'ed by Randy Dunlap) kills the unused
IKCONFIG_VERSION from kernel/configs.c .
This patch is based on a previous patch by Anton Blanchard and an idea of
Bartlomiej Zolnierkiewicz. (I hope I haven't forgotten anyone who contributed
to this patch. ;-) )
Signed-off-by: Adrian Bunk <bunk@fs.tum.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
|
|
Move the memory policy freeing to later in exit to make sure the last
memory allocations don't use an uninitialized policy.
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
gcc 3.5 is warning about unused static variables, add __attribute_unused__
to the 2 places to silence it.
Signed-off-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
This fixes compilation on ppc32.
The power/smp.o file should be linked only if both SMP and SWSUSPEND
are configured in. It used to do it even without SWSUSPEND.
|
|
This fixes the the remaining 0 to NULL things that were found with 'make
allmodconfig' and 'make C=1 vmlinux'.
|
|
Attached is a smallish patch for couple trivial sparse warnings in
allnoconfig build and more importantly an "excuses" text file explaining
why the rest have not been fixed.
Basically all of them (with the exception of the one in Andrews tree) need
some serious re-engineering.
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
CHECK kernel/power/swsusp.c
kernel/power/swsusp.c:320:15: warning: expected lvalue for member dereference
kernel/power/swsusp.c:337:15: warning: expected lvalue for member dereference
kernel/power/swsusp.c:359:14: warning: expected lvalue for member dereference
kernel/power/swsusp.c:925:12: warning: assignment expression in conditional
[...]
CHECK kernel/power/pmdisk.c
kernel/power/pmdisk.c:795:12: warning: assignment expression in conditional
Trivial sparse fixes for two files under kernel/power. Patch attached.
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
kernel/kallsyms.c
CHECK kernel/kallsyms.c
kernel/kallsyms.c:136:7: warning: bad constant expression
kernel/kallsyms.c:136:7: warning: bad constant expression
kernel/kallsyms.c:136:7: warning: bad constant expression
kernel/kallsyms.c:143:22: warning: bad constant expression
kernel/kallsyms.c:143:22: warning: bad constant expression
kernel/kallsyms.c:143:22: warning: bad constant expression
Now the cause of sparse warnings is that it does not handle runtime array
dimensioning (which I take it is a sparse problem), but in this particular
case it _might_ make sense to change the runtime allocation to compile
time, as the upper size of the array is known, because the code in
kernel/kallsyms.c clearly uses 127 (or 128) as "magic constant" for kernel
symbol (array) length, and in the other hand in include/linux/module.h
there is: #define MODULE_NAME_LEN (64 - sizeof(unsigned long))
The only concern is that the array become quite big (the original comment
of it being "pretty small" no longer applies ...). One way to help that
would be to use buffer[] also in place of namebuf[], but that would be
little tricky as the format string should be before the symbol name ...
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
As required by the standard, this patch adds to POSIX ABSOLUTE timers the
functionality of adjusting the timer when the clock is set so that it still
expires at the specified time (provided that time has not passed, in which
case the timer expires immeadiatly).
The standard is, IMNSOHO, a bit vague on just how repeating timers are to
be handled so I made some choices:
1) If an absolute timer is to expire every N intervals, we assume that
the expiries should happen at those specified times after clock setting.
I.e. we adjust the repeat timer as well as the initial timer. (The
other option would be to treat the repeating timers as relative and not
to adjust them.)
2) If a clock set moves the the clock prior to the initial expiry time
AND that time has already passed and been signaled, the current repeat
timer is adjusted, i.e. we DO NOT go back to the initial time and
repeat that. (The other option is to treat this case as a new request
with the initial timer parameters (which by this time we have lost).)
3) If time is advanced such that it appears that several expiries have
been missed, the overrun count will reflect the misses. (The other
option is to not reflect this in the overrun.) At the same time, nothing
is done to acknowledge, to the user, that we are repeating expiries when
the clock is retarded.
Signed-off-by: George Anzinger <george@mvista.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
It fixes levels for calling driver model, puts devices into sleep before
powering down (so that emergency parking does not happen), and actually
introduces SMP support, but its disabled for now. Plus noone should try to
freeze_processes() when thats not implemented, we now BUG()s -- we do not
want Heisenbugs.
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
'strace' shows a problem with a missing release_task for self-reaping
clones that have been traced. We need to defer releasing them until the
tracer is done with them, but if the tracer dies, we need to handle that
case gracefully too.
We do that by having 'forget_original_parent()' generate a list of tasks
to release when this case happens.
Patch based on discussions on linux-kernel, and suggestions from Roland
McGrath <roland@redhat.com>.
|
|
|
|
Add console_stop() and console_start() methods so the serial drivers
can disable console output before suspending a port, and re-enable output
afterwards.
We also add locking to ensure that we synchronise with any in-progress
printk.
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
[This patch series has also been separately sent to the architecture
maintainers]
Add console_device() to return the console tty driver structure and the
index. Acquire the console lock while scanning the list of console drivers
to protect us against console driver list manipulations.
Signed-off-by: Russell King <rmk@arm.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
I don't think we're in K&R any more, Toto.
If you want a NULL pointer, use NULL. Don't use an integer.
Most of the users really didn't seem to know the proper type.
|
|
store_stackinfo() does an unlocked module list walk during normal runtime
which opens up a race with the module load/unload code. This can be
triggered by simply unloading and loading a module in a loop with
CONFIG_DEBUG_PAGEALLOC resulting in store_stackinfo() tripping over bad
list pointers.
kernel_text_address doesn't take any locks, because during an OOPS we don't
want to deadlock. Rename that to __kernel_text_address, and make
kernel_text_address take the lock.
Signed-off-by: Zwane Mwaikambo <zwane@fsmlabs.com>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au> (modified)
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
From: Andy Whitcroft <apw@shadowen.org>
Being able to recover the configuration from a kernel is very useful and it
would be nice to default this option to Yes. Currently, to have the config
available both from the image (using extract-ikconfig) and via /proc we
keep two copies of the original .config in the kernel. One in plain text
and one gzip compressed. This is not optimal.
This patch removes the plain text version of the configuration and updates
the extraction tools to locate and use the gzip'd version of the file.
This has the added bonus of providing us with the exact same results in
both cases, the original .config; including the comments.
Signed-off-by: Andy Whitcroft <apw@shadowen.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
From: Pavel Machek <pavel@ucw.cz>
Its very bad idea to freeze migration threads, as it crashes machine upon
next call to "schedule()". In refrigerator, I had one "wake_up_process()"
too many. This fixes it.
Signed-off-by: Pavel Machek <pavel@suse.cz>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
Some people want the dentry and inode caches shrink harder, others want them
shrunk more reluctantly.
The patch adds /proc/sys/vm/vfs_cache_pressure, which tunes the vfs cache
versus pagecache scanning pressure.
- at vfs_cache_pressure=0 we don't shrink dcache and icache at all.
- at vfs_cache_pressure=100 there is no change in behaviour.
- at vfs_cache_pressure > 100 we reclaim dentries and inodes harder.
The number of megabytes of slab left after a slocate.cron on my 256MB test
box:
vfs_cache_pressure=100000 33480
vfs_cache_pressure=10000 61996
vfs_cache_pressure=1000 104056
vfs_cache_pressure=200 166340
vfs_cache_pressure=100 190200
vfs_cache_pressure=50 206168
Of course, this just left more directory and inode pagecache behind instead of
vfs cache. Interestingly, on this machine the entire slocate run fits into
pagecache, but not into VFS caches.
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
From: Rusty Russell <rusty@rustcorp.com.au>
Paul Jackson's cpumask tour-de-force allows us to get rid of those stupid
temporaries which we used to hold CPU_MASK_ALL to hand them to functions.
This used to break NR_CPUS > BITS_PER_LONG.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
From: Paul Jackson <pj@sgi.com>
Make use of for_each_cpu_mask() macro to simplify and optimize a couple of
sparc64 per-CPU loops.
Optimize a bit of cpumask code for asm-i386/mach-es7000
Convert physids_complement() to use both args in the files
include/asm-i386/mpspec.h, include/asm-x86_64/mpspec.h.
Remove cpumask hack from asm-x86_64/topology.h routine pcibus_to_cpumask().
Clarify and slightly optimize several cpumask manipulations in kernel/sched.c
Signed-off-by: Paul Jackson <pj@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
From: Paul Jackson <pj@sgi.com>
Major rewrite of cpumask to use a single implementation, as a struct-wrapped
bitmap.
This patch leaves some 26 include/asm-*/cpumask*.h header files orphaned - to
be removed next patch.
Some nine cpumask macros for const variants and to coerce and promote between
an unsigned long and a cpumask are obsolete. Simple emulation wrappers are
provided in this patch for these obsolete macros, which can be removed once
each of the 3 archs (i386, ppc64, x86_64) using them are recoded in follow-on
patches to not need them.
The CPU_MASK_ALL macro now avoids leaving possible garbage one bits in any
unused portion of the high word.
An inproved comment lists all available operators, for convenient browsing.
From: Mikael Pettersson <mikpe@csd.uu.se>
2.6.7-rc3-mm1 changed CPU_MASK_NONE into something that isn't a valid
rvalue (it only works inside struct initializers). This caused compile-time
errors in perfctr in UP x86 builds.
From: Arnd Bergmann <arnd@arndb.de>
cpumask-5-10-rewrite-cpumaskh-single-bitmap-based from 2.6.7-rc3-mm1
causes include2/asm/smp.h:54:1: warning: "cpu_online" redefined
Signed-off-by: Paul Jackson <pj@sgi.com>
Signed-off-by: Mikael Pettersson <mikpe@csd.uu.se>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
From: Paul Jackson <pj@sgi.com>
This patch makes cpu_present_map a real map for all configurations, instead of
a constant for non-SMP. It also moves the definition of cpu_present_map out
of kernel/cpu.c into kernel/sched.c, because cpu.c isn't compiled into non-SMP
kernels.
The pattern is that each of the possible, present and online cpu maps are
actual kernel global cpumask_t variables, for all configurations. They are
documented in include/linux/cpumask.h. Some of the UP (NR_CPUS=1) code
cheats, and hardcodes the assumption that the single bit position of these
maps is always set, as an optimization.
Signed-off-by: Paul Jackson <pj@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
From: Dipankar Sarma <dipankar@in.ibm.com>
This patch changes the call_rcu() API and avoids passing an argument to the
callback function as suggested by Rusty. Instead, it is assumed that the
user has embedded the rcu head into a structure that is useful in the
callback and the rcu_head pointer is passed to the callback. The callback
can use container_of() to get the pointer to its structure and work with
it. Together with the rcu-singly-link patch, it reduces the rcu_head size
by 50%. Considering that we use these in things like struct dentry and
struct dst_entry, this is good savings in space.
An example :
struct my_struct {
struct rcu_head rcu;
int x;
int y;
};
void my_rcu_callback(struct rcu_head *head)
{
struct my_struct *p = container_of(head, struct my_struct, rcu);
free(p);
}
void my_delete(struct my_struct *p)
{
...
call_rcu(&p->rcu, my_rcu_callback);
...
}
Signed-Off-By: Dipankar Sarma <dipankar@in.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
From: Dipankar Sarma <dipankar@in.ibm.com>
This reduces the RCU head size by using a singly linked to maintain them.
The ordering of the callbacks is still maintained as before by using a tail
pointer for the next list.
Signed-Off-By : Dipankar Sarma <dipankar@in.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
From: Manfred Spraul <manfred@colorfullife.com>
Step three for reducing cacheline trashing within rcupdate.c:
Cleanup and code move from <linux/rcupdate.h> to kernel/rcupdate.c: Remove
internal details from the header file.
Signed-off-by: Manfred Spraul <manfred@colorfullife.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
From: Manfred Spraul <manfred@colorfullife.com>
Step two for reducing cacheline trashing within rcupdate.c:
rcu_process_callbacks always acquires rcu_ctrlblk.state.mutex and calls
rcu_start_batch, even if the batch is already running or already scheduled to
run.
This can be avoided with a sequence lock: A sequence lock allows to read the
current batch number and next_pending atomically. If next_pending is already
set, then there is no need to acquire the global mutex.
This means that for each grace period, there will be
- one write access to the rcu_ctrlblk.batch cacheline
- lots of read accesses to rcu_ctrlblk.batch (3-10*cpus_online()). Behavior
similar to the jiffies cacheline, shouldn't be a problem.
- cpus_online()+1 write accesses to rcu_ctrlblk.state, all of them starting
with spin_lock(&rcu_ctrlblk.state.mutex).
For large enough cpus_online() this will be a problem, but all except two
of the spin_lock calls only protect the rcu_cpu_mask bitmap, thus a
hierarchical bitmap would allow to split the write accesses to multiple
cachelines.
Tested on an 8-way with reaim. Unfortunately it probably won't help with Jack
Steiner's 'ls' test since in this test only one cpu generates rcu entries.
Signed-off-by: Manfred Spraul <manfred@colorfullife.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
From: Manfred Spraul <manfred@colorfullife.com>
Below is the one of my patches from my rcu lock update. Jack Steiner tested
the first one on a 512p and it resolved the rcu cache line trashing. All were
tested on osdl with STP.
Step one for reducing cacheline trashing within rcupdate.c:
The current code uses the rcu_cpu_mask bitmap both for keeping track of the
cpus that haven't gone through a quiescent state and for checking if a cpu
should look for quiescent states. The bitmap is frequently changed and the
check is done by polling - together this causes cache line trashing.
If it's cheaper to access a (mostly) read-only cacheline than a cacheline that
is frequently dirtied, then it's possible to reduce the trashing by splitting
the rcu_cpu_mask bitmap into two cachelines:
The patch adds a generation counter and moves it into a separate cacheline.
This allows to removes all accesses to rcu_cpumask (in the read-write
cacheline) from rcu_pending and at least 50% of the accesses from
rcu_check_quiescent_state. rcu_pending and all but one call per cpu to
rcu_check_quiescent_state access the read-only cacheline. Probably not enough
for 512p, but it's a start, just for 128 byte more memory use, without slowing
down rcu grace periods. Obviously the read-only cacheline is not really
read-only: it's written once per grace period to indicate that a new grace
period is running.
Tests on an 8-way Pentium III with reaim showed some improvement:
oprofile hits:
Reference: http://khack.osdl.org/stp/293075/
Hits %
23741 0.0994 rcu_pending
19057 0.0798 rcu_check_quiescent_state
6530 0.0273 rcu_check_callbacks
Patched: http://khack.osdl.org/stp/293076/
8291 0.0579 rcu_pending
5475 0.0382 rcu_check_quiescent_state
3604 0.0252 rcu_check_callbacks
The total runtime differs between both runs, thus the % number must
be compared: Around 50% faster. I've uninlined rcu_pending for the
test.
Tested with reaim and kernbench.
Description:
- per-cpu quiescbatch and qs_pending fields introduced: quiescbatch contains
the number of the last quiescent period that the cpu has seen and qs_pending
is set if the cpu has not yet reported the quiescent state for the current
period. With these two fields a cpu can test if it should report a
quiescent state without having to look at the frequently written
rcu_cpu_mask bitmap.
- curbatch split into two fields: rcu_ctrlblk.batch.completed and
rcu_ctrlblk.batch.cur. This makes it possible to figure out if a grace
period is running (completed != cur) without accessing the rcu_cpu_mask
bitmap.
- rcu_ctrlblk.maxbatch removed and replaced with a true/false next_pending
flag: next_pending=1 means that another grace period should be started
immediately after the end of the current period. Previously, this was
achieved by maxbatch: curbatch==maxbatch means don't start, curbatch!=
maxbatch means start. A flag improves the readability: The only possible
values for maxbatch were curbatch and curbatch+1.
- rcu_ctrlblk split into two cachelines for better performance.
- common code from rcu_offline_cpu and rcu_check_quiescent_state merged into
cpu_quiet.
- rcu_offline_cpu: replace spin_lock_irq with spin_lock_bh, there are no
accesses from irq context (and there are accesses to the spinlock with
enabled interrupts from tasklet context).
- rcu_restart_cpu introduced, s390 should call it after changing nohz:
Theoretically the global batch counter could wrap around and end up at
RCU_quiescbatch(cpu). Then the cpu would not look for a quiescent state and
rcu would lock up.
Signed-off-by: Manfred Spraul <manfred@colorfullife.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
|
|
When using a separate output directory the in-kernel config wiere rebuild
each time the kernel was compiled. Fix this by specifying correct path to
Makefile in the prerequisite to the ikconfig.h file.
Signed-off-by: Sam Ravnborg <sam@ravnborg.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
|
|
Add __user annotation for !CONFIG_MODULE_UNLOAD case.
From: Mika Kukkonen <mika@osdl.org>
Signed-off-by: Randy Dunlap <rddunlap@osdl.org>
|
|
Remove unused queued_signals global accounting.
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
Add a user_struct pointer to the sigqueue structure. Charge sigqueue
allocation and destruction to the user_struct rather than a global pool. This
per user rlimit accounting obsoletes the global queued_signals accouting.
The patch as charges the sigqueue struct allocation to the queue that it's
pending on (the receiver of the signal). So the owner of the queue is charged
for whoever writes to it (much like quota for a 777 file).
The patch started out charging the task which allocated the sigqueue struct.
In most cases, these are always the same user (permission for sending a
signal), so those cases are moot. In the cases where it isn't the same user,
it's a privileged user sending a signal to another user.
It seems wrong to charge the allocation to the privleged user, when the other
user could block receipt as long as it feels. The flipside is, someone else
can fill your queue (expectation is that someone else is privileged). I think
it's right the way it is. The change to revert is very small.
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
Update send_signal() api to allow passing the task receiving the signal. This
is necessary to ensure signals generated out of process context can be charged
to the correct user.
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
|
|
Several scheduler macros only read from the task struct, mark them const.
It may help the compiler generate better code.
Signed-off-by: Keith Owens <kaos@ocs.com.au>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
Distros have started to ship kernels with this patch, as it seems that some
unnamed binary module authors are already abusing this function (as well as
some open source modules, like the openib code.) I could not find any valid
reason why this symbol should be exported, so here's a patch against 2.6.7
that removes it.
Signed-off-by: Greg Kroah-Hartman <greg@kroah.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
It can be replaced by a simple memcpy.
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
Here's the patch that removes the memset calls from both pmdisk and swsusp.
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
Fix a couple of memory leaks in the pmdisk driver.
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
This fixes 2 memory leaks in swsusp: during relocating pagedir, eaten pages
were not properly freed in error path and even regular freeing path was
freeing one page too little.
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
BSD accounting format rework:
Use all explicit and implicit padding in struct acct to
- correctly report 32 bit uid/gid,
- correctly report jobs (e.g., daemons) running longer than 497 days,
- increase the precision of ac_etime from 2^-13 to 2^-20
(i.e., from ~6 hours to ~1 min. after a year)
- store the current AHZ value.
- allow cross-platform processing of the accounting file
(limited for m68k which has a different size struct acct).
- introduce versioning for smooth transition to incompatible formats in
the future. Currently the following version numbers are defined:
0: old format (until 2.6.7) with 16 bit uid/gid
1: extended variant (binary compatible to v0 on M68K)
2: extended variant (binary compatible to v0 on everything except M68K)
3: a new binary incompatible format (64 bytes)
4: new binary incompatible format (128 bytes).
layout of its first 64 bytes is the same as for v3.
5: marks second half of new binary incompatible format (128 bytes)
(layout is not yet defined)
All this is accomplished without breaking binary compatibility. 32 bit
uid/gid support is compatible with the patch previously floating around and
used e.g. by Red Hat.
This patch also introduces a config option for a new, binary incompatible
"version 3" format that
- is uniform across and properly aligned on all platforms
- stores pid and ppid
- uses AHZ==100 on all platforms (allows to report longer times)
Much of the compatibility glue goes away when v1/v2 support is removed from
the kernel. Such a patch is at
http://www.physik3.uni-rostock.de/tim/kernel/2.7/acct-cleanup-04.patch
and might be applied in the 2.7 timeframe.
The new v3 format is source compatible with current GNU acct tools (6.3.5).
However, current GNU acct tools can be compiled for only one format. As there
is no way to pass the kernel configuration to userspace, with my patch it will
still only support the old v2 format. Only if v1/v2 support is removed from
the kernel, recompiling GNU acct tools will yield v3 support.
A preliminary take at the corresponding work on cross-platform userspace tools
(GNU acct package) is at
http://www.physik3.uni-rostock.de/tim/kernel/utils/acct/
This version of the package is able to read any of the v0/v2/v3 formats,
regardless of byte-order (untested), even within the same file.
Cross-platform compatibility with m68k (v1 format) is not yet implemented, but
native use on m68k should work (untested). pid and ppid are currently only
shown by the dump-acct utility.
Thanks to Arthur Corliss, Albert Cahalan and Ragnar Kjørstad for their
comments, and to Albert Cahalan for the u64->IEEE float conversion code.
Signed-off-by: Tim Schmielau <tim@physik3.uni-rostock.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
sys_getgroups16 (or rather groups16_to_user()) returns large gids
truncated. Needs to be fixed, one way or another. Don't know why the
other similar casts are still there.
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
Add mq_bytes field to user_struct, and make sure it's properly initialized.
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
Add sigpending field to user_struct, and make sure it's properly initialized.
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
* On a 32-bit architecture, the idr code will cease to work if you add
more than 2^20 entries. You will not be able to find many of the
entries. The problem is that the IDR code uses 5-bit chunks of the
number and the lower portion used by IDR is 24 bits, so you have one bit
that leaks over into the comparisons that should not be there. The
solution is to mask off that bit before doing IDR processing. This
actually causes the POSIX timer code to crash if you create that many
timers. I have included an idr_test.tar.gz file that demonstrates this
with and without the fix, in case you need more evidence :).
* When the IDR fills up, it returns -1. However, there was no way to
check for this condition. This patch adds the ability to check for the
idr being full and fixes all the users. It also fixes a problem in
fs/super.c where the idr code wasn't checking for -1.
* There was a race condition creating POSIX timers. The timer was added
to a task struct for another process then the data for the timer was
filled out. The other task could use/destroy time timer as soon as it is
in the task's queue and the lock is released. This moves settup up the
timer data to before the timer is enqueued or (for some data) into the
lock.
* Change things so that the caller doesn't need to run idr_full() to find
out the reason for an idr_get_new() failure.
Just return -ENOSPC if the tree was full, or -EAGAIN if the caller needs
to re-run idr_pre_get() and try again.
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
This patch cleans up needless includes of asm/pgalloc.h from the fs/
kernel/ and mm/ subtrees. Compile tested on multiple ARM platforms, and
x86, this patch appears safe.
This patch is part of a larger patch aiming towards getting the include of
asm/pgtable.h out of linux/mm.h, so that asm/pgtable.h can sanely get at
things like mm_struct and friends.
I suggest testing in -mm for a while to ensure there aren't any hidden arch
issues.
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
I noticed that insert_resource() incorrectly handles the case of an
existing parent resource with the same ending address as a newly added
child. This results in incorrect nesting, like the following:
# cat /proc/ioports
<snip>
002f0000-002fffff : PCI Bus #48
00200000-002fffff : /pci@800000020000003
</snip>
Signed-off-by: John Rose <johnrose@austin.ibm.com>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
This patch results in too much idle time under certain
loads, and while that is being looked into we're better
off just reverting the change.
Cset exclude: nickpiggin@yahoo.com.au[torvalds]|ChangeSet|20040605175839|02419
|
|
From: Hugh Dickins <hugh@veritas.com>
Oleg's patch was good in that exit_mmap usually does the un-accounting; but
dup_mmap still needs its own un-accounting for the case when it has charged
for a vma, but error before it's inserted into child mm's list.
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
As Roland McGrath <roland@redhat.com> points out, we need to zero
task->it_virt_value to prevent timer-based signal delivery, not
->it_virt_incr.
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
From: Andy Whitcroft <apw@shadowen.org>
Both modprobe_path and hotplug_path are arbitrarily sized at 256 bytes and
that size is also expressed directly in the sysctl code. It seems
reasonable to define a standard length and use that for consitancy. This
patch introduces the constant KMOD_PATH_LEN and uses that.
Signed-off-by: Andy Whitcroft <apw@shadowen.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
|
|
dup_mmap() unnecessarily tries to account for memory of the vma's it has
created if it fails in the middle.
However, that's pointless (and wrong), since the exit_mmap() path called
through mmput() will do so anyway in the failure path.
Just remove the bogus un-accounting code.
|
|
Add __user annotations to kernel/sysctl.c to satisfy sparse
for !CONFIG_SYSCTL, !CONFIG_PROC_FS.
Signed-off-by: Randy Dunlap <rddunlap@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
From: Andy Whitcroft <apw@shadowen.org>
The sysctl interfaces for updating the uts entries such as hostname and
domainname are using the wrong length for these buffers; they are hard
coded to 64. Although safe, this artifically limits the size of these
fields to one less than the true maximum. This generates an inconsistency
between the various methods of update for these fields.
# hostname 12345678901234567890123456789012345678901234567890123456789012345
hostname: name too long
# hostname 1234567890123456789012345678901234567890123456789012345678901234
# hostname
1234567890123456789012345678901234567890123456789012345678901234
# sysctl -w kernel.hostname=1234567890123456789012345678901234567890123456789012345678901234567890
kernel.hostname = 1234567890123456789012345678901234567890123456789012345678901234567890
# hostname
123456789012345678901234567890123456789012345678901234567890123
#
The error originates from the fact the handler for strings (proc_dostring)
already allows for the string terminator. This patch corrects the limit,
taking the oppotunity to convert to use of sizeof().
Signed-off-by: Andy Whitcroft <apw@shadowen.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
From: "Anil" <anil.s.keshavamurthy@intel.com>
We don't need lock_cpu_hotplug()/unlock_cpu_hotplug for singlethreaded
workqueues.
Signed-off-by: Anil Keshavamurthy <anil.s.keshavamurthy@intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
From: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
Recent syscall stubs cleanup broke alpha, as it has its own version of
sys_rt_sigaction(). This defines __ARCH_WANT_SYS_RT_SIGACTION for all
architectures except alpha, sparc and sparc64.
Signed-off-by: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
From: "Anil" <anil.s.keshavamurthy@intel.com>
In flush_workqueue() for a single_threaded_worqueue case the code flushes the
same cpu_workqueue_struct for each online_cpu.
Change things so that we only perform the flush once in this case.
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
The scheduler changes had another thing missing: the appreciation of
sync wakeups. (I had this in one of the earlier sched-domains cleanup
patches before but it got lost in the shuffle.)
When a sync waker is waking, we should subtract its load from the
current load - it will schedule away for sure in the near future.
That's what the "sync" bit means.
This change is necessary because with the sched-domains balancer we have
a much more sensitive cpu-load estimator, and in this particular context
of try_to_wake_up() the sync waker's effect will always be part of the
load. Patch against your patch attached.
In my testing there's an additional increase in bw_pipe numbers on a
dual P2 box, it went from 110-120 MB/sec to 120-130 MB/sec.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
into ppc970.osdl.org:/home/torvalds/v2.6/linux
|
|
David Mosberger noticed bw_pipe was way down on sched-domains kernels on
SMP systems.
That is due to two things: first, the previous wake-affine logic would
*always* move a pipe wakee onto the waker's CPU. With the scheduler
rework, this was toned down a lot (but extended to all types of wakeups).
One of the ways this was damped was with the logic: don't move the wakee if
its CPU is relatively idle compared to the waker's CPU. Without this, some
workloads would pile everything up onto a few CPUs and get lots of idle
time.
However, the fix was a bit of a blunt hack: if the wakee runqueue was below
50% busy, and the waker's was above 50% busy, we wouldn't do the move. I
think a better way to capture it is what this patch does: if the wakee
runqueue is below 100% busy, and the sum of the two runqueue's loads is
above 100% busy, and the wakee runqueue is less busy than the waker
runqueue (ie. CPU utilisation would drop if we do the move), then we don't
do the move.
After I fixed this, I found things were still getting bounced around quite
a bit. The reason is that we were attempting very aggressive idle
balancing in order to cut down idle time in a dbt2-pgsql workload, which is
particularly sensitive to idle.
After having Mark Wong (markw@osdl.org) retest this load with this patch,
it looks like we don't need to be so aggressive. I'm glad to be rid of
this because it never sat too well with me. We should see slightly lower
cost of schedule and slightly improved cache impact with this change too.
Mark said:
---
This looks pretty good:
metric kernel
2334 2.6.7-rc2
2298 2.6.7-rc2-mm2
2329 2.6.7-rc2-mm2-sched-more-wakeaffine
---
ie. within the noise.
David said:
---
Oooh, me likeee!
Host OS Pipe AF
UNIX
--------- ------------- ---- ----
caldera.h Linux 2.6.6 3424 2057 (plain 2.6.6)
caldera.h Linux 2.6.7-r 333. 1402 (original 2.6.7-rc1)
caldera.h Linux 2.6.7-r 3086 4301 (2.6.7-rc1 with your patch)
Pipe-bandwidth is still down about 10% but that may be due to
unrelated changes (or perhaps warmup effects?). The AF UNIX bandwidth
is just mindboggling. Moreover, with your patch 2.6.7-rc1 shows
better context-switch times and lower communication latencies (more
like the numbers you're getting on UP).
So it seems like the overall balance of keeping things on the same CPU
vs. distributing them across CPUs is improved.
---
I also ran some tests on the NUMAQ. kernbench, dbench, hackbench, reaim
were much the same. tbench was improved, very much so when clients < NR_CPU.
Signed-off-by: Nick Piggin <nickpiggin@yahoo.com.au>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
|
|
So here I am trying to write about how one can apply gdb to a running
kernel, and I'd like to tell people how to debug loadable modules. Only
with the 2.6 module loader, there's no way to find out where the various
sections in the module image ended up, so you can't do much. This patch
attempts to fix that by adding a "sections" subdirectory to every module's
entry in /sys/module; each attribute in that directory associates a
beginning address with the section name. Those attributes can be used by a
a simple script to generate an add-symbol-file command for gdb, something
like:
#!/bin/bash
#
# gdbline module image
#
# Outputs an add-symbol-file line suitable for pasting into gdb to examine
# a loaded module.
#
cd /sys/module/$1/sections
echo -n add-symbol-file $2 `/bin/cat .text`
for section in .[a-z]* *; do
if [ $section != ".text" ]; then
echo " \\"
echo -n " -s" $section `/bin/cat $section`
fi
done
echo
Currently, this feature is absent if CONFIG_KALLSYMS is not set. I do
wonder if CONFIG_DEBUG_INFO might not be a better choice, now that I think
about it. Section names are unmunged, so "ls -a" is needed to see most of
them.
Signed-off-by: Greg Kroah-Hartman <greg@kroah.com>
|
|
Big comment, because it wasn't clear why this cast was valid.
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
|
|
From: David Mosberger <davidm@napali.hpl.hp.com>
Darrene Williams <dsw@gelato.unsw.edu.au> noticed that the #endif for
__ARCH_WANT_SYS_SIGPROCMASK was off by one routine.
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
into ppc970.osdl.org:/home/torvalds/v2.6/linux
|
|
Fix a race identified by Jeremy Kerr <jeremy@redfishsoftware.com.au>: if
update_process_times() decides to deliver a signal due to process timer
expiry, it can race with __exit_sighand()'s freeing of task->sighand.
Fix that by clearing the per-process timer state in exit_notify(), while under
local_irq_disable() and under tasklist_lock. tasklist_lock provides exclusion
wrt release_task()'s freeing of task->sighand and local_irq_disable() provides
exclusion wrt update_process_times()'s inspection of the per-process timer
state.
We also need to deal with the send_sig() calls in do_process_times() by
setting rlim_cur to RLIM_INFINITY.
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Jeremy Kerr <jk@ozlabs.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
1) Make syscall entry zero-extend all arguments.
2) Sign extend those needed in sys32.S
3) Kill the A() AA() macros, replace with compat_ptr() et al.
|
|
Ingo explains:
The condition is 'impossible', but the whole balancing code is
(intentionally) a bit racy:
cpus_and(tmp, group->cpumask, cpu_online_map);
if (!cpus_weight(tmp))
goto next_group;
for_each_cpu_mask(i, tmp) {
if (!idle_cpu(i))
goto next_group;
push_cpu = i;
}
rq = cpu_rq(push_cpu);
double_lock_balance(busiest, rq);
move_tasks(rq, push_cpu, busiest, 1, sd, IDLE);
in the for_each_cpu_mask() loop we specifically check for each CPU in
the target group to be idle - so push_cpu's runqueue == busiest [==
current runqueue] cannot be true because the current CPU is not idle, we
are running in the migration thread ... But this is not a real problem,
load-balancing we do in a racy way to reduce overhead [and it's all
statistics anyway so absolute accuracy is impossible], and active
balancing itself is somewhat racy due to the migration-thread wakeup
(and the active_balance flag) going outside the runqueue locks [for
similar reasons].
so it all looks quite plausible - the normal SMP boxes dont trigger it,
but Bjorn's 128-CPU setup with a non-trivial domain hiearachy triggers
it.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
active_load_balance() looks susceptible to deadlock when busiest==rq.
Without the following patch, my 128-way box deadlocks consistently
during boot-time driver init.
|
|
From: Ingo Molnar <mingo@elte.hu>
Now the x86_64 bitop memory clobber problem has been fixed we can remove
this.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
From: bert hubert <ahu@ds9a.nl>
Documentation is in fact for tgkill and not for tkill
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
From: Jakub Jelinek <jakub@redhat.com>
FUTEX_REQUEUE operation has been added to the kernel mainly to improve
pthread_cond_broadcast which previously used FUTEX_WAKE INT_MAX op.
pthread_cond_broadcast releases internal condvar mutex before FUTEX_REQUEUE
operation, as otherwise the woken up thread most likely immediately sleeps
again on the internal condvar mutex until the broadcasting thread releases it.
Unfortunately this is racy and causes e.g.
http://sources.redhat.com/cgi-bin/cvsweb.cgi/libc/nptl/tst-cond16.c?rev=1.1&content-type=text/x-cvsweb-markup&cvsroot=glibc
to hang on SMP.
http://listman.redhat.com/archives/phil-list/2004-May/msg00023.html contains
analysis how the hang happens, the problem is if any thread does
pthread_cond_*wait in between releasing of the internal condvar mutex and
FUTEX_REQUEUE operation, a wrong thread might be awaken (and immediately go to
sleep again because it doesn't satisfy conditions for returning from
pthread_cond_*wait) while the right thread requeued on the associated mutex
and there would be nobody to wake that thread up.
The patch below extends FUTEX_REQUEUE operation with something FUTEX_WAIT
already uses:
FUTEX_CMP_REQUEUE is passed an additional argument which is the expected value
of *futex. Kernel then while holding the futex locks checks if *futex !=
expected and returns -EAGAIN in that case, while if it is equal, continues
with a normal FUTEX_REQUEUE operation. If the syscall returns -EAGAIN, NPTL
can fall back to FUTEX_WAKE INT_MAX operation which doesn't have this problem,
but is less efficient, while in the likely case that nobody hit the (small)
window the efficient FUTEX_REQUEUE operation is used.
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
kthreads are not just for breakfast anymore.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au> (creator)
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
This fixes compilation of x86-64 without CONFIG_NUMA again (got broken
by the previous patchkit)
|
|
|
|
|
|
From: Srivatsa Vaddagiri <vatsa@in.ibm.com>
Fix a CPU Hotplug problem wherein idle task's "->prio" value is not
restored to MAX_PRIO during CPU_DEAD handling. Without this patch, once a
CPU is offlined and then later onlined, it becomes "more or less" useless
(does not run any task other than its idle task!)
Ingo said:
The __setscheduler() call is (technically) incorrect because in the
SCHED_NORMAL case the prio should be zero. So it's a bit cleaner to set up
the static priority to MAX_PRIO and then revert the policy to SCHED_NORMAL
via __setscheduler().
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
|
|
Signed-off-by: Ingo Molnar <mingo@elte.hu>
We can avoid the local_irq_enable() in sched_yield() because schedule()
unconditionally enables interrupts anyway.
|
|
Signed-off-by: Christian Meder <chris@onestepahead.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
The following obviously correct patch from Christian Meder simplifies the
DELTA() define.
|
|
From: Olaf Kirch <okir@suse.de>
I have been chasing a corruption of current->group_info on PPC during NFS
stress tests. The problem seems to be that nfsd is messing with its
group_info quite a bit, while some monitoring processes look at
/proc/<pid>/status and do a get_group_info/put_group_info without any locking.
This problem can be reproduced on ppc platforms within a few seconds if you
generate some NFS load and do a "cat /proc/XXX/status" of an nfsd thread in a
tight loop.
I therefore think changes to current->group_info, and querying it from a
different process, needs to be protected using the task_lock.
(akpm: task->group_info here is safe against exit() because the task holds a
ref on group_info which is released in __put_task_struct, and the /proc file
has a ref on the task_struct).
|
|
From: Ingo Molnar <mingo@elte.hu>
printk currently does
if (oops_in_progres)
bust_printk_locks();
which means that once we oops, the printk locking is 100% ineffective and
multiple CPUs make an unreadable mess on a serial console. It's a significant
development hassle.
Fix that up by only popping locks once per ten seconds.
akpm@osdl.org did:
- Bump the timeout to 30 seconds - 9600 baud is slow.
- Handle jiffy wraps: change the logic so that we only skip the lockbust
if the current time is within 30 seconds of the previous lockbusting
attempt.
|
|
Any architecture (like pa-risc) that makes use of the helper function
flush_dcache_mmap_lock() won't compile with the new rmap due to use of
the wrong "mapping".
Trivial fix.
|
|
From: Hugh Dickins <hugh@veritas.com>
Andrea Arcangeli's anon_vma object-based reverse mapping scheme for anonymous
pages. Instead of tracking anonymous pages by pte_chains or by mm, this
tracks them by vma. But because vmas are frequently split and merged
(particularly by mprotect), a page cannot point directly to its vma(s), but
instead to an anon_vma list of those vmas likely to contain the page - a list
on which vmas can easily be linked and unlinked as they come and go. The vmas
on one list are all related, either by forking or by splitting.
This has three particular advantages over anonmm: that it can cope
effortlessly with mremap moves; and no longer needs page_table_lock to protect
an mm's vma tree, since try_to_unmap finds vmas via page -> anon_vma -> vma
instead of using find_vma; and should use less cpu for swapout since it can
locate its anonymous vmas more quickly.
It does have disadvantages too: a lot more change in mmap.c to deal with
anon_vmas, though small straightforward additions now that the vma merging has
been refactored there; more lowmem needed for each anon_vma and vma structure;
an additional restriction on the merging of vmas (cannot be merged if already
assigned different anon_vmas, since then their pages will be pointing to
different heads).
(There would be no need to enlarge the vma structure if anonymous pages
belonged only to anonymous vmas; but private file mappings accumulate
anonymous pages by copy-on-write, so need to be listed in both anon_vma and
prio_tree at the same time. A different implementation could avoid that by
using anon_vmas only for purely anonymous vmas, and use the existing prio_tree
to locate cow pages - but that would involve a long search for each single
private copy, probably not a good idea.)
Where before the vm_pgoff of a purely anonymous (not file-backed) vma was
meaningless, now it represents the virtual start address at which that vma is
mapped - which the standard file pgoff manipulations treat linearly as vmas
are split and merged. But if mremap moves the vma, then it generally carries
its original vm_pgoff to the new location, so pages shared with the old
location can still be found. Magic.
Hugh has massaged it somewhat: building on the earlier rmap patches, this
patch is a fifth of the size of Andrea's original anon_vma patch. Please note
that this posting will be his first sight of this patch, which he may or may
not approve.
|
|
From: Hugh Dickins <hugh@veritas.com>
Before moving on to anon_vma rmap, remove now what's peculiar to anonmm rmap:
the anonmm handling and the mremap move cows. Temporarily reduce
page_referenced_anon and try_to_unmap_anon to stubs, so a kernel built with
this patch will not swap anonymous at all.
|
|
From: Hugh Dickins <hugh@veritas.com>
arm and parisc __flush_dcache_page have been scanning the i_mmap(_shared) list
without locking or disabling preemption. That may be even more unsafe now
it's a prio tree instead of a list.
It looks like we cannot use i_shared_lock for this protection: most uses of
flush_dcache_page are okay, and only one would need lock ordering fixed
(get_user_pages holds page_table_lock across flush_dcache_page); but there's a
few (e.g. in net and ntfs) which look as if they're using it in I/O
completion - and it would be restrictive to disallow it there.
So, on arm and parisc only, define flush_dcache_mmap_lock(mapping) as
spin_lock_irq(&(mapping)->tree_lock); on i386 (and other arches left to the
next patch) define it away to nothing; and use where needed.
While updating locking hierarchy in filemap.c, remove two layers of the fossil
record from add_to_page_cache comment: no longer used for swap.
I believe all the #includes will work out, but have only built i386. I can
see several things about this patch which might cause revulsion: the name
flush_dcache_mmap_lock? the reuse of the page radix_tree's tree_lock for this
different purpose? spin_lock_irqsave instead? can't we somehow get
i_shared_lock to handle the problem?
|
|
From: Hugh Dickins <hugh@veritas.com>
Pave the way for prio_tree by switching over to its interfaces, but actually
still implement them with the same old lists as before.
Most of the vma_prio_tree interfaces are straightforward. The interesting one
is vma_prio_tree_next, used to search the tree for all vmas which overlap the
given range: unlike the list_for_each_entry it replaces, it does not find
every vma, just those that match.
But this does leave handling of nonlinear vmas in a very unsatisfactory state:
for now we have to search again over the maximum range to find all the
nonlinear vmas which might contain a page, which of course takes away the
point of the tree. Fixed in later patch of this batch.
There is no need to initialize vma linkage all over, just do it before
inserting the vma in list or tree. /proc/pid/statm had an odd test for its
shared count: simplified to an equivalent test on vm_file.
|
|
From: Christoph Hellwig <hch@lst.de>
- don't include mempolicy.h in sched.h and mm.h when a forward delcaration
is enough. Andi argued against that in the past, but I'd really hate to add
another header to two of the includes used in basically every driver when we
can include it in the six files actually needing it instead (that number is
for my ppc32 system, maybe other arches need more include in their
directories)
- make numa api fields in tast_struct conditional on CONFIG_NUMA, this gives
us a few ugly ifdefs but avoids wasting memory on non-NUMA systems.
|
|
From: Andi Kleen <ak@suse.de>
NUMA API adds a policy to each VMA. During VMA creattion, merging and
splitting these policies must be handled properly. This patch adds the calls
to this.
It is a nop when CONFIG_NUMA is not defined.
|
|
From: Andi Kleen <ak@suse.de>
The following patches add support for configurable NUMA memory policy
for user processes. It is based on the proposal from last kernel summit
with feedback from various people.
This NUMA API doesn't not attempt to implement page migration or anything
else complicated: all it does is to police the allocation when a page
is first allocation or when a page is reallocated after swapping. Currently
only support for shared memory and anonymous memory is there; policy for
file based mappings is not implemented yet (although they get implicitely
policied by the default process policy)
It adds three new system calls: mbind to change the policy of a VMA,
set_mempolicy to change the policy of a process, get_mempolicy to retrieve
memory policy. User tools (numactl, libnuma, test programs, manpages) can be
found in ftp://ftp.suse.com/pub/people/ak/numa/numactl-0.6.tar.gz
For details on the system calls see the manpages
http://www.firstfloor.org/~andi/mbind.html
http://www.firstfloor.org/~andi/set_mempolicy.html
http://www.firstfloor.org/~andi/get_mempolicy.html
Most user programs should actually not use the system calls directly,
but use the higher level functions in libnuma
(http://www.firstfloor.org/~andi/numa.html) or the command line tools
(http://www.firstfloor.org/~andi/numactl.html
The system calls allow user programs and administors to set various NUMA memory
policies for putting memory on specific nodes. Here is a short description
of the policies copied from the kernel patch:
* NUMA policy allows the user to give hints in which node(s) memory should
* be allocated.
*
* Support four policies per VMA and per process:
*
* The VMA policy has priority over the process policy for a page fault.
*
* interleave Allocate memory interleaved over a set of nodes,
* with normal fallback if it fails.
* For VMA based allocations this interleaves based on the
* offset into the backing object or offset into the mapping
* for anonymous memory. For process policy an process counter
* is used.
* bind Only allocate memory on a specific set of nodes,
* no fallback.
* preferred Try a specific node first before normal fallback.
* As a special case node -1 here means do the allocation
* on the local CPU. This is normally identical to default,
* but useful to set in a VMA when you have a non default
* process policy.
* default Allocate on the local node first, or when on a VMA
* use the process policy. This is what Linux always did
* in a NUMA aware kernel and still does by, ahem, default.
*
* The process policy is applied for most non interrupt memory allocations
* in that process' context. Interrupts ignore the policies and always
* try to allocate on the local CPU. The VMA policy is only applied for memory
* allocations for a VMA in the VM.
*
* Currently there are a few corner cases in swapping where the policy
* is not applied, but the majority should be handled. When process policy
* is used it is not remembered over swap outs/swap ins.
*
* Only the highest zone in the zone hierarchy gets policied. Allocations
* requesting a lower zone just use default policy. This implies that
* on systems with highmem kernel lowmem allocation don't get policied.
* Same with GFP_DMA allocations.
*
* For shmfs/tmpfs/hugetlbfs shared memory the policy is shared between
* all users and remembered even when nobody has memory mapped.
This patch:
This is the core NUMA API code. This includes NUMA policy aware
wrappers for get_free_pages and alloc_page_vma(). On non NUMA kernels
these are defined away.
The system calls mbind (see http://www.firstfloor.org/~andi/mbind.html),
get_mempolicy (http://www.firstfloor.org/~andi/get_mempolicy.html) and
set_mempolicy (http://www.firstfloor.org/~andi/set_mempolicy.html) are
implemented here.
Adds a vm_policy field to the VMA and to the process. The process
also has field for interleaving. VMA interleaving uses the offset
into the VMA, but that's not possible for process allocations.
From: Andi Kleen <ak@muc.de>
> Andi, how come policy_vma() calls ->set_policy under i_shared_sem?
I think this can be actually dropped now. In an earlier version I did
walk the vma shared list to change the policies of other mappings to the
same shared memory region. This turned out too complicated with all the
corner cases, so I eventually gave in and added ->get_policy to the fast
path. Also there is still the mmap_sem which prevents races in the same MM.
Patch to remove it attached. Also adds documentation and removes the
bogus __alloc_page_vma() prototype noticed by hch.
From: Andi Kleen <ak@suse.de>
A few incremental fixes for NUMA API.
- Fix a few comments
- Add a compat_ function for get_mem_policy I considered changing the
ABI to avoid this, but that would have made the API too ugly. I put it
directly into the file because a mm/compat.c didn't seem worth it just for
this.
- Fix the algorithm for VMA interleave.
From: Matthew Dobson <colpatch@us.ibm.com>
1) Move the extern of alloc_pages_current() into #ifdef CONFIG_NUMA.
The only references to the function are in NUMA code in mempolicy.c
2) Remove the definitions of __alloc_page_vma(). They aren't used.
3) Move forward declaration of struct vm_area_struct to top of file.
|
|
Having a semaphore in there causes modest performance regressions on heavily
mmap-intensive workloads on some hardware. Specifically, up to 30% in SDET on
NUMAQ and big PPC64.
So switch it back to being a spinlock. This does mean that unmap_vmas() needs
to be told whether or not it is allowed to schedule away; that's simple to do
via the zap_details structure.
This change means that there will be high scheuling latencies when someone
truncates a large file which is currently mmapped, but nobody does that
anyway. The scheduling points in unmap_vmas() are mainly for munmap() and
exit(), and they still will work OK for that.
From: Hugh Dickins <hugh@veritas.com>
Sorry, my premature optimizations (trying to pass down NULL zap_details
except when needed) have caught you out doubly: unmap_mapping_range_list was
NULLing the details even though atomic was set; and if it hadn't, then
zap_pte_range would have missed free_swap_and_cache and pte_clear when pte
not present. Moved the optimization into zap_pte_range itself. Plus
massive documentation update.
From: Hugh Dickins <hugh@veritas.com>
Here's a second patch to add to the first: mremap's cows can't come home
without releasing the i_mmap_lock, better move the whole "Subtle point"
locking from move_vma into move_page_tables. And it's possible for the file
that was behind an anonymous page to be truncated while we drop that lock,
don't want to abort mremap because of VM_FAULT_SIGBUS.
(Eek, should we be checking do_swap_page of a vm_file area against the
truncate_count sequence? Technically yes, but I doubt we need bother.)
- We cannot hold i_mmap_lock across move_one_page() because
move_one_page() needs to perform __GFP_WAIT allocations of pagetable pages.
- Move the cond_resched() out so we test it once per page rather than only
when move_one_page() returns -EAGAIN.
|
|
From: Hugh Dickins <hugh@veritas.com>
Hugh's anonmm object-based reverse mapping scheme for anonymous pages. We
have not yet decided whether to adopt this scheme, or Andrea's more advanced
anon_vma scheme. anonmm is easier for me to merge quickly, to replace the
pte_chain rmap taken out in the previous patch; a patch to install Andrea's
anon_vma will follow in due course.
Why build up and tear down chains of pte pointers for anonymous pages, when a
page can only appear at one particular address, in a restricted group of mms
that might share it? (Except: see next patch on mremap.)
Introduce struct anonmm per mm to track anonymous pages, all forks from one
exec sharing the same bundle of linked anonmms. Anonymous pages originate in
one mm, but may be forked into another mm of the bundle later on. Callouts
from fork.c to allocate, dup and exit the anonmm structure private to rmap.c.
From: Hugh Dickins <hugh@veritas.com>
Two concurrent exits (of the last two mms sharing the anonhd). First
exit_rmap brings anonhd->count down to 2, gets preempted (at the
spin_unlock) by second, which brings anonhd->count down to 1, sees it's 1
and frees the anonhd (without making any change to anonhd->count itself),
cpu goes on to do something new which reallocates the old anonhd as a new
struct anonmm (probably not a head, in which case count will start at 1),
first resumes after the spin_unlock and sees anonhd->count 1, frees "anonhd"
again, it's used for something else, a later exit_rmap list_del finds list
corrupt.
|
|
Many places do:
if (kmem_cache_create(...) == NULL)
panic(...);
We can consolidate all that by passing another flag to kmem_cache_create()
which says "panic if it doesn't work".
|
|
From: Pavel Machek <pavel@ucw.cz>
This fixes bad interaction between devfs and swsusp.
Check whether the swap device is the specified resume device, irrespective of
whether they are specified by identical names.
(Thus, device inode aliasing is allowed. You can say /dev/hda4 instead of
/dev/ide/host0/bus0/target0/lun0/part4 [if using devfs] and they'll be
considered the same device. This is *necessary* for devfs, since the resume
code can only recognize the form /dev/hda4, but the suspend code would like
the long name [as shown in 'cat /proc/mounts'].)
[Thanks to devfs hero whose name I forgot.]
|
|
From: Pavel Machek <pavel@ucw.cz>
This is no longer neccessary. We have enough pauses elsewhere, and it works
well enough that this is not needed.
|
|
From: David Mosberger <davidm@napali.hpl.hp.com>
Below is a patch that tries to sanitize the dropping of unneeded system-call
stubs in generic code. In some instances, it would be possible to move the
optional system-call stubs into a library routine which would avoid the need
for #ifdefs, but in many cases, doing so would require making several
functions global (and possibly exporting additional data-structures in
header-files). Furthermore, it would inhibit (automatic) inlining in the
cases in the cases where the stubs are needed. For these reasons, the patch
keeps the #ifdef-approach.
This has been tested on ia64 and there were no objections from the
arch-maintainers (and one positive response). The patch should be safe but
arch-maintainers may want to take a second look to see if some __ARCH_WANT_foo
macros should be removed for their architecture (I'm quite sure that's the
case, but I wanted to play it safe and only preserved the status-quo in that
regard).
|
|
From: Rusty Russell <rusty@rustcorp.com.au>
From: Pavel Machek <pavel@ucw.cz>
This patch fixes init section usage in swsusp.c: "read_suspend_image()" can
be __init.
|
|
From: Rusty Russell <rusty@rustcorp.com.au>
kallsyms contains only function names, but some debuggers (eg. xmon on
PPC/PPC64) use it to lookup symbols: it'd be much nicer if it included data
symbols too.
|
|
From: Anton Blanchard <anton@samba.org>
From: Rusty Russell <rusty@rustcorp.com.au>
When hotplug cpu isn't enabled, cpu_is_offline is always false. I had a stuck
cpu at boot that resulted in a lockup because we tried to start a migration
thread on it. Instead of cpu_is_offline we can use !cpu_online which should
cover both the hotplug cpu enabled and disabled cases.
|
|
From: Andi Kleen <ak@muc.de>
The new domain scheduler got miscompiled on x86-64 with gcc 3.3.3-hammer,
which is shipping with some distributions. The kernel deadlocks eventually
under light stress on SMP systems with the right options.
After some experiments it seems this simple change avoids the
miscompilation. It also doesn't pessimize the code unduly for other
architectures.
|
|
Split the system_state state `SYSTEM_SHUTDOWN' into SYSTEM_HALT,
SYSTEM_POWER_OFF and SYSTEM_RESTART and export system_state to modules.
This allows driver shutdown routines to know why they are being shutdown. The
IDE subsystem wants this so that it knows to not spin the disks down across a
reboot.
|
|
|
|
We need to always allocate at least one indirect block
pointer, since we always fill out blocks[0] even if
we don't have any groups.
|
|
From: Olaf Kirch <okir@suse.de>
Authentication code in net/sunrpc makes frequent use of groups_alloc(0),
which seems to clobber memory past the end of what it allocated.
If called with gidsetsize == 0, groups_alloc will set nblocks = 0,
but still does a
group_info->blocks[0] = group_info->small_block;
|
|
From: Arjan van de Ven <arjanv@redhat.com>,
Rusty Russell <rusty@rustcorp.com.au>
The patch below resolves the "Not Yet Implemented" print_modules() thing.
This is a really useful feature for distros; it allows us to do statistical
analysis on which modules are present how often in oopses compared to how
often they are used normally. In addition it helps to spot candidates for
certain bugs without having to go back to the customer asking for this
information.
|
|
Fix some silliness in there.
|
|
From: Rusty Russell <rusty@rustcorp.com.au>
Kallsyms discards symbols with the same address, but these are sometimes
useful. Skip this minor optimization and make kallsyms_lookup deal with
aliases
|
|
From: Rusty Russell <rusty@rustcorp.com.au>
The current code doesn't show the last symbol (usually _einittext) in
/proc/kallsyms. The reason for this is subtle: s_start() returns an empty
string for position 0 (ignored by s_show()), and s_next() returns the first
symbol for position 1.
What should happen is that update_iter() for position 0 should fill in the
first symbol. Unfortunately, the get_ksymbol_core() fills in the symbol
information, *and* updates the iterator: we have to split these functions,
which we do by making it return the length of the name offset.
Then we can call get_ksymbol_core() without moving the iterator, meaning
that we can call it at position 0 (ie. s_start()).
|
|
From: Nick Piggin <nickpiggin@yahoo.com.au>
Analysis and basic idea from Suresh Siddha <suresh.b.siddha@intel.com>
"This small change in load_balance() brings the performance back upto base
scheduler(infact I see a ~1.5% performance improvement now). Basically
this fix removes the unnecessary double_lock.."
Workload is SpecJBB on 16-way Altix.
|
|
From: Nick Piggin <nickpiggin@yahoo.com.au>
Fine-tune the unsynched sched_clock handling.
Basically, you need to be careful about ensuring timestamps get correctly
adjusted when moving CPUs, and you *can't* look at your unadjusted
sched_clock() and a remote task's ->timestamp and try to come up with
anything meaningful.
I think this second problem will really hit hard in the activate_task path
on systems with unsynched sched_clock when you're waking up a remote task,
which happens very often. Andi, I thought some Opterons have unsynched
tscs? Maybe this is causing your unexplained bad interactivity?
Another problem is a fixup in pull_task. When adjusting ->timestamp from
one processor to another, you must use timestamp_last_tick for the local
processor too. Using sched_clock() will cause ->timestamp to creep
forward.
A final small fix is for sync wakeups. They were using __activate_task for
some reason, thus they don't get credited for sleeping at all AFAIKS.
And another thing, do we want to #ifdef timestamp_last_tick so it doesn't
show on UP?
|
|
From: Nick Piggin <nickpiggin@yahoo.com.au>
"Siddha, Suresh B" <suresh.b.siddha@intel.com> noticed a problem in the
cpu_load averaging where the integer truncation could sometimes cause cpu_load
to never quite reach its target.
I'm not sure that you could demonstrate a real world problem, but I quite
like this fix.
|
|
From: Martin Schwidefsky <schwidefsky@de.ibm.com>
s390 core changes:
- Rename idle_cpu_mask to nohz_cpu_mask as agreed with Dipankar.
- Refine compiler version check for "Q" constraints in uaccess.h.
- Store per process ptrace information to the correct place.
- Fix per cpu data access for 64-bit modules.
- Add topology_init function for cpu hotplug.
- Define TASK_SIZE dependent on TIF_31BIT and define MM_VM_SIZE
to 4TB to get rid of elf_map32 and arch_get_unmapped_area.
|
|
From: Geoff Gustafson <geoff@linux.jf.intel.com>,
"Chen, Kenneth W" <kenneth.w.chen@intel.com>,
Ingo Molnar <mingo@elte.hu>,
me.
The big-SMP guys are seeing high CPU load due to del_timer_sync()'s
inefficiencies. The callers are fs/aio.c and schedule_timeout().
We note that neither of these callers' timer handlers actually re-add the
timer - they are single-shot.
So we don't need all that complexity in del_timer_sync() - we can just run
del_timer() and if that worked we know the timer is dead.
Add del_single_shot_timer(), export it to modules and use it in AIO and
schedule_timeout().
(these numbers are for an earlier patch, but they'll be close)
Before: 32p 4p
Warm cache 29,000 505
Cold cache 37,800 1220
After: 32p 4p
Warm cache 95 88
Cold cache 1,800 140
[Measurements are CPU cycles spent in a call to del_timer_sync, the average
of 1000 calls. 32p is 16-node NUMA, 4p is SMP.]
(I cleaned up a few things and added some commentary)
|
|
From: Paul Jackson <pj@sgi.com>
With a hotplug capable kernel, there is a requirement to distinguish a
possible CPU from one actually present. The set of possible CPU numbers
doesn't change during a single system boot, but the set of present CPUs
changes as CPUs are physically inserted into or removed from a system. The
cpu_possible_map does not change once initialized at boot, but the
cpu_present_map changes dynamically as CPUs are inserted or removed.
Paul Jackson <pj@sgi.com> provided an expanded explanation:
Ashok's cpu hot plug patch adds a cpu_present_map, resulting in the following
cpu maps being available. All the following maps are fixed size bitmaps of
size NR_CPUS.
#ifdef CONFIG_HOTPLUG_CPU
cpu_possible_map - map with all NR_CPUS bits set
cpu_present_map - map with bit 'cpu' set iff cpu is populated
cpu_online_map - map with bit 'cpu' set iff cpu available to scheduler
#else
cpu_possible_map - map with bit 'cpu' set iff cpu is populated
cpu_present_map - copy of cpu_possible_map
cpu_online_map - map with bit 'cpu' set iff cpu available to scheduler
#endif
In either case, NR_CPUS is fixed at compile time, as the static size of these
bitmaps. The cpu_possible_map is fixed at boot time, as the set of CPU id's
that it is possible might ever be plugged in at anytime during the life of
that system boot. The cpu_present_map is dynamic(*), representing which CPUs
are currently plugged in. And cpu_online_map is the dynamic subset of
cpu_present_map, indicating those CPUs available for scheduling.
If HOTPLUG is enabled, then cpu_possible_map is forced to have all NR_CPUS
bits set, otherwise it is just the set of CPUs that ACPI reports present at
boot.
If HOTPLUG is enabled, then cpu_present_map varies dynamically, depending on
what ACPI reports as currently plugged in, otherwise cpu_present_map is just a
copy of cpu_possible_map.
(*) Well, cpu_present_map is dynamic in the hotplug case. If not hotplug,
it's the same as cpu_possible_map, hence fixed at boot.
|
|
From: Ashok Raj <ashok.raj@intel.com>
This patch changes __init to __devinit to init_idle so that when a new cpu
arrives, it can call these functions at a later time.
|
|
From: William Lee Irwin III <wli@holomorphy.com>
This patch provides an additional argument to __wake_up_common() so that the
information wakefunc.patch made waiters ready to receive may be passed to them
by wakers. This is provided as a separate patch so that the overhead of the
additional argument to __wake_up_common() can be measured in isolation. No
change in performance was observable here.
|
|
From: William Lee Irwin III <wli@holomorphy.com>
This patch series is solving the "thundering herd" problem that occurs in the
mainline implementation of hashed waitqueues. There are two sources of
spurious wakeups in such arrangements:
(a) Hash collisions that place waiters on different objects on the same
waitqueue, which wakes threads falsely when any of the objects hashed to
the same queue receives a wakeup. i.e. loss of information about which
object a wakeup event is related to.
(b) Loss of information about which object a given waiter is waiting on.
This precludes wake-one semantics for mutual exclusion scenarios. For
instance, a lock bit may be slept on. If there are any waiters on the
object, a lock bit release event must wake at least one of them so as to
prevent deadlock. But without information as to which waiter is waiting
on which object, we must resort to waking all waiters who could possibly
be waiting on it. Now, as the lock bit provides mutual exclusion, only
one of the waiters woken can proceed, and the remainder will go back to
sleep and wait for another event, creating unnecessary system load. Once
wake-one semantics are established, only one of the waiters waiting to
acquire a lock bit need to be woken, which measurably reduces system load
and improves efficiency (i.e. it's the subject of the benchmarking I've
been sending to you).
Even beyond the measurable efficiency gains, there are reasons of robustness
and responsiveness to motivate addressing the issue of thundering herds. In a
real-life scenario I've been personally involved in resolving, the thundering
herd issue caused powerful modern SMP machines with fast IO systems to be
unresponsive to user input for a minute at a time or more. Analogues of these
patches for the distro kernels involved fully resolved the issue to the
customer's satisfaction and obviated workarounds to limit the pagecache's
size.
The latest spin of these patches basically shoves more pieces of the logic
into the wakeup functions, with some efficiency gains from sharing the hot
codepath with the rest of the kernel, and a slightly larger diff than the
patches with the newly-introduced entrypoint. Writing these was motivated by
the push to insulate sched.c from more of the details of wakeup semantics by
putting more of the logic into the wakeup functions. In order to accomplish
this while still solving (b), the wakeup functions grew a new argument for
communication about what object a wakeup event is related to to be passed by
the waker.
=========
This patch provides an additional argument to wakeup functions so that
information may be passed from the waker to the waiter. This is provided as a
separate patch so that the overhead of the additional argument can be measured
in isolation. No change in performance was observable here.
|
|
David Mosberger asked that this be backed out:
"I do not believe that flushing the TLB before migration is be the right thing
to do on ia64 machines which support global TLB purges (i.e., all but SGI's
machines)."
It was of huge benefit for the SGI machines, so work is ongoing.
|
|
Switch all users of MSEC[S]_TO_JIFFIES and JIFFIES_TO_MSEC[S] over to use
jiffies_to_msecs() and msecs_to_jiffies(). Withdraw MSECS_TO_JIFFIES() and
JIFFIES_TO_MSECS() from the kernel API.
|
|
From: Ingo Molnar <mingo@elte.hu>
We have various different implementations of MSEC[S]_TO_JIFFIES and
JIFFIES_TO_MSEC[S]. We recently had a compile-time clash in USB.
Fix all that up.
- The SCTP version was very inefficient. Hopefully this version is accurate
enough.
- Optimise for the HZ=100 and HZ=1000 cases
- This version does round-up, so sleep(9 milliseconds) works OK on 100HZ.
- We still have lots of jiffies_to_msec and msec_to_jiffies implementations.
From: William Lee Irwin III <wli@holomorphy.com>
Optimize the cases where HZ is a divisor of 1000 or vice-versa in
JIFFIES_TO_MSECS() and MSECS_TO_JIFFIES() by allowing the nonvanishing(!)
integral ratios to appear as a parenthesized expressions eligible for
constant folding optimizations.
From: me
Use typesafe inlines for the jiffies-to-millisecond conversion functions.
This means that milliseconds officially takes the type `unsigned int'.
All current callers seem to be OK with that.
Drivers need to be fixed up to use this instead of their private versions.
|
|
From: Nick Piggin <nickpiggin@yahoo.com.au>
this_rq_lock does a local_irq_disable, and sched_yield() needs to undo that.
|
|
into kroah.com:/home/greg/linux/BK/driver-2.6
|
|
Thanks to Andrew Morton for pointing this out to me.
|
|
We may as well make usermodehelper_init() core_initcall as well, to make
sure its services are avaialble to all the other initcall levels.
|
|
From: Stephen Hemminger <shemminger@osdl.org>
Minor tweak to rcu, use __list_splice instead of list_splice because the
list has already been checked for empty.
|
|
From: "Chen, Kenneth W" <kenneth.w.chen@intel.com>,
"Seth, Rohit" <rohit.seth@intel.com>
This patch addresses the longstanding problem wherein Oracle needs
CAP_IPC_LOCK to allocate SHM_HUGETLB shm memory, but people don't want to run
Oracle as root, and capabilties are busted.
Various ideas with rlimits didn't work out, mainly because these objects live
beyond the lifetime of the user processes which establish them.
What we do is to create root-writeable /proc/sys/vm/hugetlb_shm_group which
specifies a single group ID. Users who belong to that group may allocate
hugepages for SHM_HUGETLB shm segments.
So the sysadmin will greate a new group, say `hugepageusers', will add the
oracle user to that group and will write that group's ID into
/proc/sys/vm/hugetlb_shm_group.
|
|
Fix a waitqueue-handling race in worker_thread().
|
|
Fix bug identified by Srivatsa Vaddagiri <vatsa@in.ibm.com>:
There's a deadlock in __create_workqueue when CONFIG_HOTPLUG_CPU is set. This
can happen when create_workqueue_thread fails to create a worker thread. In
that case, we call destroy_workqueue with cpu hotplug lock held.
destroy_workqueue however also attempts to take the same lock.
|
|
From: Rusty Russell <rusty@rustcorp.com.au>
Only print the tainted message the first time. Its purpose is to warn
users that we can't support them, not to fill their logs.
|
|
find_user() is being called from set/get_priority(), but it doesn't take the
needed lock, and those callers were forgetting to drop the refcount which
find_user() took.
|
|
From: Rusty Russell <rusty@rustcorp.com.au>
1) Create an in_sched_functions() function in sched.c and make the
archs use it. (Two archs have wchan #if 0'd out: left them alone).
2) Move __sched from linux/init.h to linux/sched.h and add comment.
3) Rename __scheduling_functions_start_here/end_here to __sched_text_start/end.
Thanks to wli and Sam Ravnborg for clue donation.
|
|
From: Srivatsa Vaddagiri <vatsa@in.ibm.com>
Noticed that migration_thread can examine "kthread_should_stop()?" without
setting its state to TASK_INTERRUPTIBLE first. This can cause kthread_stop
on that thread to block forever ...
P.S - I assumed that having the task state set to TASK_INTERRUTIBLE
while it is doing active_load_balance is fine. It seemed to be
the case earlier also.
|
|
From: Srivatsa Vaddagiri <vatsa@in.ibm.com>
Fix the race in sys_sched_getaffinity. Patch below takes cpu_hotplug lock
before reading cpus_allowed mask of a task.
|
|
From: Srivatsa Vaddagiri <vatsa@in.ibm.com>
migrate_all_tasks is currently run with rest of the machine stopped.
It iterates thr' the complete task table, turning off cpu affinity of any task
that it finds affine to the dying cpu. Depending on the task table
size this can take considerable time. All this time machine is stopped, doing
nothing.
Stopping the machine for such extended periods can be avoided if we do
task migration in CPU_DEAD notification and that's precisely what this patch
does.
The patch puts idle task to the _front_ of the dying CPU's runqueue at the
highest priority possible. This cause idle thread to run _immediately_ after
kstopmachine thread yields. Idle thread notices that its cpu is offline and
dies quickly. Task migration can then be done at leisure in CPU_DEAD
notification, when rest of the CPUs are running.
Some advantages with this approach are:
- More scalable. Predicatable amout of time that machine is stopped.
- No changes to hot path/core code. We are just exploiting scheduler
rules which runs the next high-priority task on the runqueue. Also
since I put idle task to the _front_ of the runqueue, there
are no races when a equally high priority task is woken up
and added to the runqueue. It gets in at the back of the runqueue,
_after_ idle task!
- cpu_is_offline check that is presenty required in try_to_wake_up,
idle_balance and rebalance_tick can be removed, thus speeding them
up a bit
From: Srivatsa Vaddagiri <vatsa@in.ibm.com>
Rusty mentioned that the unlikely hints against cpu_is_offline is
redundant since the macro already has that hint. Patch below removes those
redundant hints I added.
|
|
From: Nick Piggin <nickpiggin@yahoo.com.au>
The SMT wake_idle code really wants to look at a non-local CPU's domain in
order to check for idle siblings.
So change the domain attachment code a little bit so we continue to hold a
runqueue's lock while attaching a new domain. This means the locking rules
have changed to: you may access your own domain without any lock, you must
hold a remote runqueue's lock in order to view its domain.
|
|
From: Nick Piggin <nickpiggin@yahoo.com.au>
This actually does produce better code, especially under the locked
section.
Turns a conditional + unconditional jump under the lock in the unlikely
case into a cmov outside the lock.
|
|
From: Nick Piggin <nickpiggin@yahoo.com.au>
It makes NEWLY_IDLE balances cause find_busiest_group return the busiest
available group even if there isn't an imbalance. Basically - try a bit
harder to prevent schedule emptying the runqueue.
It is quite aggressive, but that isn't so bad because we don't (by default)
do NEWLY_IDLE balancing across NUMA nodes, and NEWLY_IDLE balancing is always
restricted to cache_hot tasks.
It picked up a little bit of idle time that dbt2-pgsql was seeing...
|
|
From: Ingo Molnar <mingo@elte.hu>
Implement balancing during clone(). It does the following things:
- introduces SD_BALANCE_CLONE that can serve as a tool for an
architecture to limit the search-idlest-CPU scope on clone().
E.g. the 512-CPU systems should rather not enable this.
- uses the highest sd for the imbalance_pct, not this_rq (which didnt
make sense).
- unifies balance-on-exec and balance-on-clone via the find_idlest_cpu()
function. Gets rid of sched_best_cpu() which was still a bit
inconsistent IMO, it used 'min_load < load' as a condition for
balancing - while a more correct approach would be to use half of the
imbalance_pct, like passive balancing does.
- the patch also reintroduces the possibility to do SD_BALANCE_EXEC on
SMP systems, and activates it - to get testing.
- NOTE: there's one thing in this patch that is slightly unclean: i
introduced wake_up_forked_thread. I did this to make it easier to get
rid of this patch later (wake_up_forked_process() has lots of
dependencies in various architectures). If this capability remains in
the kernel then i'll clean it up and introduce one function for
wake_up_forked_process/thread.
- NOTE2: i added the SD_BALANCE_CLONE flag to the NUMA CPU template too.
Some NUMA architectures probably want to disable this.
|
|
From: Ingo Molnar <mingo@elte.hu>
This does the source/target cleanup. This is a no-functionality patch which
also adds more comments to explain these functions.
|
|
From: Nick Piggin <nickpiggin@yahoo.com.au>
This patch starts to balance woken processes when half the relevant domain's
imbalance_pct is reached. Previously balancing would start after a small,
constant difference in waker/wakee runqueue loads was reached, which would
cause too much process movement when there are lots of processes running.
It also turns wake balancing into a domain flag while previously it was always
on. Now sched domains can "soft partition" an SMP system without using
processor affinities.
|
|
From: Ingo Molnar <mingo@elte.hu>
This re-adds cleanups which were lost in splitups of an earlier patch.
|
|
From: Nick Piggin <nickpiggin@yahoo.com.au>
The attached patch is required to work correctly with the CPU hotplug
framework. John Hawkes reports successful booting with this.
|
|
From: Ingo Molnar <mingo@elte.hu>
The attached patch extends sync wakeups to the process sys_exit() path too:
the chldwait wakeup can be done sync, since we know that the process is
going to exit (and thus deschedule).
The most visible effect of this change is strace's behavior on SMP systems:
it now stays on a single CPU, together with the traced child. (previously
it would run in parallel to the child, bouncing around madly.)
|
|
From: Ingo Molnar <mingo@elte.hu>
Helper function for later patches
|
|
From: Ingo Molnar <mingo@elte.hu>
Uninline things
|
|
From: Nick Piggin <nickpiggin@yahoo.com.au>
Minor cleanups from Ingo's patch including task_hot (do it right in
try_to_wake_up too).
|
|
From: Nick Piggin <nickpiggin@yahoo.com.au>
De-racify the sched domain setup code. This involves creating a dummy
"init" domain during sched_init (which is called early).
When topology information becomes available, the sched domains are then
built and attached. The attach mechanism is asynchronous and uses the
migration threads, which perform the switch with interrupts off. This is a
quiescent state, so domains can still be lockless on the read side. It
also allows us to change the domains at runtime without much more work.
This is something SGI is interested in to elegantly do soft partitioning of
their systems without having to use hard cpu affinities (which cause
balancing problems of their own).
The current setup code also has a race somewhere because it is unable to
boot on a 384 CPU system.
From: Anton Blanchard <anton@samba.org>
This is basically a mindless ppc64 merge of the x86 changes to sched
domain init code.
Actually if I produce a sibling_map[] then the x86 code and the ppc64
will be identical. Maybe we can merge it.
|
|
From: Nick Piggin <nickpiggin@yahoo.com.au>
After the for_each_domain change, the warn here won't trigger, instead it
will oops in the if statement. Also, make sure we don't pass an empty
cpumask to for_each_cpu.
|
|
From: Nick Piggin <nickpiggin@yahoo.com.au>
Imbalance calculations were not right. This would cause unneeded migration.
|
|
From: Nick Piggin <nickpiggin@yahoo.com.au>
Make affine wakes and "passive load balancing" more conservative. Aggressive
affine wakeups were causing huge regressions in dbt3-pgsql on 8-way non NUMA
systems at OSDL's STP.
|
|
From: Rusty Russell <rusty@rustcorp.com.au>
From: Srivatsa Vaddagiri <vatsa@in.ibm.com>
From: Andrew Morton <akpm@osdl.org>
From: Rusty Russell <rusty@rustcorp.com.au>
We want to get rid of lock_cpu_hotplug() in sched_migrate_task. Found
that lockless migration of execing task is _extremely_ racy. The
races I hit are described below, alongwith probable solutions.
Task migration done elsewhere should be safe (?) since they either
hold the lock (sys_sched_setaffinity) or are done entirely with preemption
disabled (load_balance).
sched_balance_exec does:
a. disables preemption
b. finds new_cpu for current
c. enables preemption
d. calls sched_migrate_task to migrate current to new_cpu
and sched_migrate_task does:
e. task_rq_lock(p)
f. migrate_task(p, dest_cpu ..)
(if we have to wait for migration thread)
g. task_rq_unlock()
h. wake_up_process(rq->migration_thread)
i. wait_for_completion()
Several things can happen here:
1. new_cpu can go down after h and before migration thread has
got around to handle the request
==> we need to add a cpu_is_offline check in __migrate_task
2. new_cpu can go down between c and d or before f.
===> Even though this case is automatically handled by the above
change (migrate_task being called on a running task, current,
will delegate migration to migration thread), would it be
good practice to avoid calling migrate_task in the first place
itself when dest_cpu is offline. This means adding another
cpu_is_offline check after e in sched_migrate_task
3. The 'current' task can get preempted _immediately_ after
g and when it comes back, task_cpu(p) can be dead. In
which case, it is invalid to do wake_up on a non-existent migration
thread. (rq->migration_thread can be NULL).
===> We should disable preemption thr' g and h
4. Before migration thread gets around to handle the request, its cpu
goes dead. This will leave unhandled migration requests in the dead
cpu.
===> We need to wakeup sleeping requestors (if any) in CPU_DEAD
notification.
I really wonder if we can get rid of these issues by avoiding balancing at
exec time and instead have it balanced during load_balance ..Alternately
if this is valuable and we want to retain it, I think we still need to
consider a read/write sem, with sched_migrate_task doing down_read_trylock.
This may eliminate the deadlock I hit between cpu_up and CPU_UP_PREPARE
notification, which had forced me away from r/w sem.
Anyway patch below addresses the above races. Its against 2.6.6-rc2-mm1
and has been tested on a 4way Intel Pentium SMP m/c.
Rusty sez:
Two other changes:
1) I grabbed a reference to the thread, rather than using
preempt_disable(). It's the more obvious way I think.
2) Why the wait_to_die code? It might be needed if we move tasks after
stop_machine, but for nowI don't see the problem with the migration
thread running on the wrong CPU for a bit: nothing is on this runqueue
so active_load_balance is safe, and __migrate task will be a noop (due
to cpu_is_offline() check). If there is a problem, your fix is racy,
because we could be preempted immediately afterwards.
So I just stop the kthread then wakeup any remaining...
|
|
From: Ingo Molnar <mingo@elte.hu>
The trivial fixes.
- added recent trivial bits from Nick's and my patches.
- hotplug CPU fix
- early init cleanup
|
|
From: Martin Hicks <mort@wildopensource.com>
Another optimization patch from Jack Steiner, intended to reduce TLB
flushes during process migration.
Most architextures should define tlb_migrate_prepare() to be flush_tlb_mm(),
but on i386, it would be a wasted flush, because i386 disconnects previous
cpus from the tlb flush automatically.
|
|
From: Nick Piggin <piggin@cyberone.com.au>
This patch removes the per runqueue array of NR_CPU arrays. Each time we
want to check a remote CPU's load we check nr_running as well anyway, so
introduce a cpu_load which is the load of the local runqueue and is kept
updated in the timer tick. Put them in the same cacheline.
This has additional benefits of having the cpu_load consistent across all
CPUs and more up to date. It is sampled better too, being updated once per
timer tick.
This shouldn't make much difference in scheduling behaviour, but all
benchmarks are either as good or better on the 16-way NUMAQ: hackbench,
reaim, volanomark are about the same, tbench and dbench are maybe a bit
better. kernbench is about one percent better.
John reckons it isn't a big deal, but it does save 4K per CPU or 2MB total
on his big systems, so I figure it must be a bit kinder on the caches. I
think it is just nicer in general anyway.
|
|
From: Con Kolivas <kernel@kolivas.org>
This patch provides full per-package priority support for SMT processors
(aka pentium4 hyperthreading) when combined with CONFIG_SCHED_SMT.
It maintains cpu percentage distribution within each physical cpu package
by limiting the time a lower priority task can run on a sibling cpu
concurrently with a higher priority task.
It introduces a new flag into the scheduler domain
unsigned int per_cpu_gain; /* CPU % gained by adding domain cpus */
This is empirically set to 15% for pentium4 at the moment and can be
modified to support different values dynamically as newer processors come
out with improved SMT performance. It should not matter how many siblings
there are.
How it works is it compares tasks running on sibling cpus and when a lower
static priority task is running it will delay it till
high_priority_timeslice * (100 - per_cpu_gain) / 100 <= low_prio_timeslice
eg. a nice 19 task timeslice is 10ms and nice 0 timeslice is 102ms On
vanilla the nice 0 task runs on one logical cpu while the nice 19 task runs
unabated on the other logical cpu. With smtnice the nice 0 runs on one
logical cpu for 102ms and the nice 19 sleeps till the nice 0 task has 12ms
remaining and then will schedule.
Real time tasks and kernel threads are not altered by this code, and kernel
threads do not delay lower priority user tasks.
with lots of thanks to Zwane Mwaikambo and Nick Piggin for help with the
coding of this version.
If this is merged, it is probably best to delay pushing this upstream in
mainline till sched_domains gets tested for at least one major release.
|
|
From: Nick Piggin <piggin@cyberone.com.au>
This changes sched domains to contain all possible CPUs, and check for
online as needed. It's in order to play nicely with CPU hotplug.
|
|
From: Nick Piggin <piggin@cyberone.com.au>
The following patch implements a cpu_power member to struct sched_group.
This allows special casing to be removed for SMT groups in the balancing
code. It does not take CPU hotplug into account yet, but that shouldn't be
too hard.
I have tested it on the NUMAQ by pretending it has SMT. Works as expected.
Active balances across nodes.
|
|
From: Rusty Russell <rusty@rustcorp.com.au>,
Nick Piggin <piggin@cyberone.com.au>
The current sched_balance_exec() sets the task's cpus_allowed mask
temporarily to move it to a different CPU. This has several issues,
including the fact that a task will see its affinity at a bogus value.
So we change the migration_req_t to explicitly specify a destination CPU,
rather than the migration thread deriving it from cpus_allowed. If the
requested CPU is no longer valid (racing with another set_cpus_allowed,
say), it can be ignored: if the task is not allowed on this CPU, there will
be another migration request pending.
This change allows sched_balance_exec() to tell the migration thread what
to do without changing the cpus_allowed mask.
So we rename __set_cpus_allowed() to move_task(), as the cpus_allowed mask
is now set by the caller. And move_task_away(), which the migration thread
uses to actually perform the move, is renamed __move_task().
I also ignore offline CPUs in sched_best_cpu(), so sched_migrate_task()
doesn't need to check for offline CPUs.
Ulterior motive: this approach also plays well with CPU Hotplug.
Previously that patch might have seen a task with cpus_allowed only
containing the dying CPU (temporarily due to sched_balance_exec) and
forcibly reset it to all cpus, which might be wrong. The other approach is
to hold the cpucontrol sem around sched_balance_exec(), which is too much
of a bottleneck.
|
|
From: Nick Piggin <piggin@cyberone.com.au>
John Hawkes discribed this problem to me:
There *is* a small problem in this area, though, that SuSE avoids.
"jiffies" gets updated by cpu0. The other CPUs may, over time, get out of
sync (and they're initialized on ia64 to start out being out of sync), so
it's no guarantee that every CPU will wake up from its timer interrupt and
see a "jiffies" value that is guaranteed to be last_jiffies+1. Sometimes
the jiffies value may be unchanged since the last wakeup. Sometimes the
jiffies value may have incremented by 2 (or more, especially if cpu0's
interrupts are disabled for long stretches of time). So an algoithm that
says, "I'll call load_balance() only when jiffies is *exactly* N" is going
to fail on occasion, either by calling load_balance() too often or not
often enough. ***
I fixed this by adding a last_balance field to struct sched_domain, and
working off that.
|
|
From: Nick Piggin <piggin@cyberone.com.au>
The following patch builds a scheduling description for the i386
architecture using cpu_sibling_map to set up SMT if CONFIG_SCHED_SMT is
set.
It could be made more fancy and collapse degenerate domains at runtime (ie.
1 sibling per CPU, or 1 NUMA node in the computer).
From: Zwane Mwaikambo <zwane@arm.linux.org.uk>
This fixes an oops due to cpu_sibling_map being uninitialised when a
system with no MP table (most UP boxen) boots a CONFIG_SMT kernel. What
also happens is that the cpu_group lists end up not being terminated
properly, but this oops kills it first. Patch tested on UP w/o MP table,
2x P2 and UP Xeon w/ no siblings.
From: "Martin J. Bligh" <mbligh@aracnet.com>,
Nick Piggin <piggin@cyberone.com.au>
Change arch_init_sched_domains to use cpu_online_map
From: Anton Blanchard <anton@samba.org>
Fix build with NR_CPUS > BITS_PER_LONG
|
|
From: Nick Piggin <piggin@cyberone.com.au>
This patch gets the sched_domain scheduler working better WRT balancing.
Its been tested on the NUMAQ. Among other things it changes to the way SMT
load calculation works so as not to active load blances when it shouldn't.
It still has a problem with SMT and NUMA: it will put a task on each
sibling in a node before moving tasks to another node. It should probably
start moving tasks after each *physical* CPU is filled.
To fix, you need "how much CPU power in this domain?" At the moment we
approximate # runqueues == CPU power, and hack around it at the CPU
physical domain by counting all sibling runqueues as 1.
It isn't hard to correctly work the CPU power out, but once CPU hotplug is
in the equation it becomes much more hotplug events. If anyone is actually
interested in getting this fixed, that is.
|
|
From: Nick Piggin <piggin@cyberone.com.au>
Anton was attempting to make a sched domain topology for his POWER5 and was
having some trouble.
This patch only includes code which is ifdefed out, but hopefully it will
be of some use to implementors.
|