Age | Commit message (Collapse) | Author | Files | Lines |
|
Remove now-unneeded open-coded unlikelies around IS_ERR().
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
Use abstracted RCU API to dereference RCU protected data. Hides barrier
details. Patch from Paul McKenney.
This patch introduced an rcu_dereference() macro that replaces most uses of
smp_read_barrier_depends(). The new macro has the advantage of explicitly
documenting which pointers are protected by RCU -- in contrast, it is
sometimes difficult to figure out which pointer is being protected by a given
smp_read_barrier_depends() call.
Signed-off-by: Paul McKenney <paulmck@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
Attached is a cleanup of the main loops in sys_msgrcv and sys_msgsnd, based on
ipc_lock_by_ptr(). Most backward gotos are gone, instead normal "for(;;)"
loops until a suitable message is found.
Description:
- General cleanup of sys_msgrcv and sys_msgsnd: the function were too
convoluted.
- Enable lockless receive, update comments.
- Use ipc_getref for sys_msgsnd(), it's better than rechecking that the
msqid is still valid.
Signed-Off-By: Manfred Spraul <manfred@colorfullife.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
Independent from the other patches:
undo operations should not result in out of range semaphore values. The test
for newval > SEMVMX is missing. The attached patch adds the test and a
comment.
Signed-Off-By: Manfred Spraul <manfred@colorfullife.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
The attached patch removes sem_revalidate and replaces it with
ipc_rcu_getref() calls followed by ipc_lock_by_ptr().
Signed-Off-By: Manfred Spraul <manfred@colorfullife.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
The lifetime of the ipc objects (sem array, msg queue, shm mapping) is
controlled by kern_ipc_perms->lock - a spinlock. There is no simple way to
reacquire this spinlock after it was dropped to
schedule()/kmalloc/copy_{to,from}_user/whatever.
The attached patch adds a reference count as a preparation to get rid of
sem_revalidate().
Signed-Off-By: Manfred Spraul <manfred@colorfullife.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
ipc compat code switched to compat_alloc_user_space() and annotated.
|
|
|
|
From: Dipankar Sarma <dipankar@in.ibm.com>
This patch changes the call_rcu() API and avoids passing an argument to the
callback function as suggested by Rusty. Instead, it is assumed that the
user has embedded the rcu head into a structure that is useful in the
callback and the rcu_head pointer is passed to the callback. The callback
can use container_of() to get the pointer to its structure and work with
it. Together with the rcu-singly-link patch, it reduces the rcu_head size
by 50%. Considering that we use these in things like struct dentry and
struct dst_entry, this is good savings in space.
An example :
struct my_struct {
struct rcu_head rcu;
int x;
int y;
};
void my_rcu_callback(struct rcu_head *head)
{
struct my_struct *p = container_of(head, struct my_struct, rcu);
free(p);
}
void my_delete(struct my_struct *p)
{
...
call_rcu(&p->rcu, my_rcu_callback);
...
}
Signed-Off-By: Dipankar Sarma <dipankar@in.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
Lower default sizes for POSIX mqueue allocation now that rlimits are in place.
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
Add a user_struct to the mq_inode_info structure. Charge the maximum number
of bytes that could be allocated to a mqueue to the user who creates the
mqueue. This is checked against the per user rlimit.
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
Add helper function mq_attr_ok() to do mq_attr sanity checking, and do some
extra overlow checking.
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
|
|
From: Andi Kleen <ak@suse.de>
Add support to tmpfs and hugetlbfs to support NUMA API. Shared memory is a
bit of a special case for NUMA policy. Normally policy is associated to VMAs
or to processes, but for a shared memory segment you really want to share the
policy. The core NUMA API has code for that, this patch adds the necessary
changes to tmpfs and hugetlbfs.
First it changes the custom swapping code in tmpfs to follow the policy set
via VMAs.
It is also useful to have a "backing store" of policy that saves the policy
even when nobody has the shared memory segment mapped. This allows command
line tools to pre configure policy, which is then later used by programs.
Note that hugetlbfs needs more changes - it is also required to switch it to
lazy allocation, otherwise the prefault prevents mbind() from working.
|
|
From: David Mosberger <davidm@napali.hpl.hp.com>
Below is a patch that tries to sanitize the dropping of unneeded system-call
stubs in generic code. In some instances, it would be possible to move the
optional system-call stubs into a library routine which would avoid the need
for #ifdefs, but in many cases, doing so would require making several
functions global (and possibly exporting additional data-structures in
header-files). Furthermore, it would inhibit (automatic) inlining in the
cases in the cases where the stubs are needed. For these reasons, the patch
keeps the #ifdef-approach.
This has been tested on ia64 and there were no objections from the
arch-maintainers (and one positive response). The patch should be safe but
arch-maintainers may want to take a second look to see if some __ARCH_WANT_foo
macros should be removed for their architecture (I'm quite sure that's the
case, but I wanted to play it safe and only preserved the status-quo in that
regard).
|
|
From: Chris Wright <chrisw@osdl.org>
Currently, if a user creates an mqueue and passes an mq_attr, the
info->messages will be created twice (and the extra one is properly freed).
This patch simply delays the allocation so that it only ever happens once.
The relevant mq_attr data is passed to lower levels via the dentry->d_fsdata
fs private data. This also helps isolate the areas we'd need to touch to do
rlimits on mqueues.
|
|
During mqueue_get_inode(), it's possible that kmalloc() of the
info->messages array will fail. This failure mode will cause the
queues_count to be (incorrectly) decremented twice. This patch uses
info->messages on mqueue_delete_inode() to determine whether the
mqueue was every truly created, and hence proper accounting is needed
on destruction.
|
|
Move error handling to capture all three possible error conditions on
sending to a full queue. Without this fix any unprivileged user can
leak arbitrary amounts of kernel memory.
|
|
From: Manfred Spraul <manfred@colorfullife.com>
Any user can delete any entries in a mqueue mounted filesystem. The attached
patch prevents that.
- remove the writable test from mq_unlink.
- set the sticky bit in the root inode. This affects both mq_unlink and
sys_unlink: only the owner (and root) should be allowed to remove queues.
|
|
From: Chris Wright <chrisw@osdl.org>
SUSv3 doesn't seem to specify one way or the other. I don't have the POSIX
specs, and the old docs I have suggest that mq_open() creates an object
which is to be closed upon exec.
Jakub said:
I think it is valid and required:
http://www.opengroup.org/onlinepubs/007904975/functions/exec.html
All open message queue descriptors in the calling process shall be
closed, as described in mq_close()
I'll add a new test for this into glibc testsuite.
|
|
From: Jakub Jelinek <jakub@redhat.com>
mq_notify (q, NULL)
and
struct sigevent ev = { .sigev_notify = SIGEV_NONE };
mq_notify (q, &ev)
are not the same thing in POSIX, yet the kernel treats them the same. Only
the former makes the notification available to other processes immediately,
see
http://www.opengroup.org/onlinepubs/007904975/functions/mq_notify.html
Without the patch below,
http://sources.redhat.com/ml/libc-hacker/2004-04/msg00028.html
glibc test fails.
I looked at mq in Solaris and they behave the same in this regard as Linux
with this patch. Kernel with this patch passes both Intel POSIX testsuite
(with testsuite fixes from Ulrich) and glibc mq testsuite.
|
|
Intro to these patches:
- Major surgery against the pagecache, radix-tree and writeback code. This
work is to address the O_DIRECT-vs-buffered data exposure horrors which
we've been struggling with for months.
As a side-effect, 32 bytes are saved from struct inode and eight bytes
are removed from struct page. At a cost of approximately 2.5 bits per page
in the radix tree nodes on 4k pagesize, assuming the pagecache is densely
populated. Not all pages are pagecache; other pages gain the full 8 byte
saving.
This change will break any arch code which is using page->list and will
also break any arch code which is using page->lru of memory which was
obtained from slab.
The basic problem which we (mainly Daniel McNeil) have been struggling
with is in getting a really reliable fsync() across the page lists while
other processes are performing writeback against the same file. It's like
juggling four bars of wet soap with your eyes shut while someone is
whacking you with a baseball bat. Daniel pretty much has the problem
plugged but I suspect that's just because we don't have testcases to
trigger the remaining problems. The complexity and additional locking
which those patches add is worrisome.
So the approach taken here is to remove the page lists altogether and
replace the list-based writeback and wait operations with in-order
radix-tree walks.
The radix-tree code has been enhanced to support "tagging" of pages, for
later searches for pages which have a particular tag set. This means that
we can ask the radix tree code "find me the next 16 dirty pages starting at
pagecache index N" and it will do that in O(log64(N)) time.
This affects I/O scheduling potentially quite significantly. It is no
longer the case that the kernel will submit pages for I/O in the order in
which the application dirtied them. We instead submit them in file-offset
order all the time.
This is likely to be advantageous when applications are seeking all over
a large file randomly writing small amounts of data. I haven't performed
much benchmarking, but tiobench random write throughput seems to be
increased by 30%. Other tests appear to be unaltered. dbench may have got
10-20% quicker, but it's variable.
There is one large file which everyone seeks all over randomly writing
small amounts of data: the blockdev mapping which caches filesystem
metadata. The kernel's IO submission patterns for this are now ideal.
Because writeback and wait-for-writeback use a tree walk instead of a
list walk they are no longer livelockable. This probably means that we no
longer need to hold i_sem across O_SYNC writes and perhaps fsync() and
fdatasync(). This may be beneficial for databases: multiple processes
writing and syncing different parts of the same file at the same time can
now all submit and wait upon writes to just their own little bit of the
file, so we can get a lot more data into the queues.
It is trivial to implement a part-file-fdatasync() as well, so
applications can say "sync the file from byte N to byte M", and multiple
applications can do this concurrently. This is easy for ext2 filesystems,
but probably needs lots of work for data-journalled filesystems and XFS and
it probably doesn't offer much benefit over an i_semless O_SYNC write.
These patches can end up making ext3 (even) slower:
for i in 1 2 3 4
do
dd if=/dev/zero of=$i bs=1M count=2000 &
done
runs awfully slow on SMP. This is, yet again, because all the file
blocks are jumbled up and the per-file linear writeout causes tons of
seeking. The above test runs sweetly on UP because the on UP we don't
allocate blocks to different files in parallel.
Mingming and Badari are working on getting block reservation working for
ext3 (preallocation on steroids). That should fix ext3 up.
This patch:
- Later, we'll need to access the radix trees from inside disk I/O
completion handlers. So make mapping->page_lock irq-safe. And rename it
to tree_lock to reliably break any missed conversions.
|
|
From: Arnd Bergmann <arnd@arndb.de>
I have tested the code with the open posix test suite and found the same
four failures for both 64-bit and compat mode, most tests pass. The patch
is against -mc1, but I guess it also applies to the other trees around.
What worries me more than mq_attr compatibility is the conversion of struct
sigevent, which might turn out really hard when more fields in there are
used. AFAICS, the only other part in the kernel ABI is sys_timer_create(),
so maybe it's not too late to deprecate the current structure and create a
structure that can be used properly for compat syscalls.
|
|
From: Manfred Spraul <manfred@colorfullife.com>
SIGEV_THREAD means that a given callback should be called in the context on a
new thread. This must be done by the C library. The kernel must deliver a
notice of the event to the C library when the callback should be called.
This patch switches to a new, simpler interface: User space creates a socket
with socket(PF_NETLINK, SOCK_RAW,0) and passes the fd to the mq_notify call
together with a cookie. When the mq_notify() condition is satisfied, the
kernel "writes" the cookie to the socket. User space then reads the cookie
and calls the appropriate callback.
|
|
From: Manfred Spraul <manfred@colorfullife.com>
I found a security bug in the new mqueue code: a process that has only
write permissions to a message queue could call mq_notify(SIGEV_THREAD) and
use the returned notification file descriptor to read from the message
queue.
|
|
From: Manfred Spraul <manfred@colorfullife.com>
My discussion with Ulrich had one result:
- mq_setattr can accept implementation defined flags. Right now we have
none, but we might add some later (e.g. switch to CLOCK_MONOTONIC for
mq_timed{send,receive} or something similar). When we add flags, we
might need the fields for additional information. And they don't hurt.
Therefore add four __reserved fields to mq_attr.
- fail mq_setattr if we get unknown flags - otherwise glibc can't detect
if it's running on a future kernel that supports new features.
- use memset to initialize the mq_attr structure - theoretically we could
leak kernel memory.
- Only set O_NONBLOCK in mq_attr, explicitely clear O_RDWR & friends.
openposix uses getattr, attr |=O_NONBLOCK, setattr - a sane approach.
Without clearing O_RDWR, this fails.
I've retested all openposix conformance tests with the new patch - the two
new FAILED tests check undefined behavior. Note that I won't have net
access until Sunday - if the message queue patch breaks something important
either ask Krzysztof or drop it.
Ulrich had another good idea for SIGEV_THREAD, but I must think about it.
It would mean less complexitiy in glibc, but more code in the kernel. I'm
not yet convinced that it's overall better.
|
|
From: Manfred Spraul <manfred@colorfullife.com>
Make the posix message queue mountable by the user. This replaces ipcs and
ipcrm for posix message queue: The admin can check which queues exist with ls
and remove stale queues with rm.
I'd like a final confirmation from Ulrich that our SIGEV_THREAD approach is
the right thing(tm): He's aware of the design and didn't object, but I think
he hasn't seen the final API yet.
|
|
From: Manfred Spraul <manfred@colorfullife.com>
Linux specific extension: make the message queue identifiers pollable. It's
simple and could be useful.
|
|
From: Manfred Spraul <manfred@colorfullife.com>
Actual implementation of the posix message queues, written by Krzysztof
Benedyczak and Michal Wronski. The complete implementation is dependant on
CONFIG_POSIX_MQUEUE.
It passed the openposix test suite with two exceptions: one mq_unlink test
was bad and tested undefined behavior. And Linux succeeds
mq_close(open(,,,)). The spec mandates EBADF, but we have decided to ignore
that: we would have to add a new syscall just for the right error code.
The patch intentionally doesn't use all helpers from fs/libfs for kernel-only
filesystems: step 5 allows user space mounts of the file system.
Signal changes:
The patch redefines SI_MESGQ using __SI_CODE: The generic Linux ABI uses
a negative value (i.e. from user) for SI_MESGQ, but the kernel internal
value must be posive to pass check_kill_value. Additionally, the patch
adds support into copy_siginfo_to_user to copy the "new" signal type to
user space.
Changes in signal code caused by POSIX message queues patch:
General & rationale:
mqueues generated signals (only upon notification) must have si_code
== SI_MESGQ. In fact such a signal is send from one process which
caused notification (== sent message to empty message queue) to
another which requested it. Both processes can be of course unrelated
in terms of uids/euids. So SI_MESGQ signals must be classified as
SI_FROMKERNEL to pass check_kill_permissions (not need to say that
this signals ARE from kernel).
Signals generated by message queues notification need the same
fields in siginfo struct's union _sifields as POSIX.1b signals and we
can reuse its union entry.
SI_MESGQ was previously defined to -3 in kernel and also in glibc.
So in userspace SI_MESGQ must be still visible as -3.
Solution:
SI_MESGQ is defined in the same style as SI_TIMER using __SI_CODE macro.
Details:
Fortunately copy_siginfo_to_user copies si_code as short. So we
can use remaining part of int value freely. __SI_CODE does the
work. SI_MESGQ is in kernel:
6<<16 | (-3 & 0xffff) what is > 0
but to userspace is copied
(short) SI_MESGQ == -3
Actual changes:
Changes in include/asm-generic/siginfo.h
__SI_MESGQ added in signal.h to represent inside-kernel prefix of
SI_MESGQ. SI_MESGQ is redefined from -3 to __SI_CODE(__SI_MESGQ, -3)
Except mips architecture those changes should be arch independent
(asm-generic/siginfo.h is included in arch versions). On mips
SI_MESGQ is redefined to -4 in order to be compatible with IRIX. But
the same schema can be used.
Change in copy_siginfo_to_user: We only add one line to order the
same copy semantics as for _SI_RT.
This change isn't very portable - some arch have its own
copy_siginfo_to_user. All those should have similar change (but
possibly not one-line as _SI_RT case was sometimes ignored because i
wasn't used yet, e.g. see ia64 signal.c).
Update:
mq: only fail with invalid timespec if mq_timed{send,receive} needs to block
From: Jakub Jelinek <jakub@redhat.com>
POSIX requires EINVAL to be set if:
"The process or thread would have blocked, and the abs_timeout parameter
specified a nanoseconds field value less than zero or greater than or equal
to 1000 million."
but 2.6.5-mm3 returns -EINVAL even if the process or thread would not block
(if the queue is not empty for timedreceive or not full for timedsend).
|
|
From: Manfred Spraul <manfred@colorfullife.com>
cleanup of sysv ipc as a preparation for posix message queues:
- replace !CONFIG_SYSVIPC wrappers for copy_semundo and exit_sem with
static inline wrappers. Now the whole ipc/util.c file is only used if
CONFIG_SYSVIPC is set, use makefile magic instead of #ifdef.
- remove the prototypes for copy_semundo and exit_sem from kernel/fork.c
- they belong into a header file.
- create a new msgutil.c with the helper functions for message queues.
- cleanup the helper functions: run Lindent, add __user tags.
|
|
From: badari <pbadari@us.ibm.com>
I ran into an ipc hang while trying to shutdown a database. The problem is
due to missing sem_unlock() in find_undo().
|
|
New inlined helper - file_accessed(file) (wrapper for update_atime())
|
|
From: Arun Sharma <arun.sharma@intel.com>
The current Linux implementation of shmat() insists on SHMLBA alignment even
when shmflg & SHM_RND == 0. This is not consistent with the man pages and
the single UNIX spec, which require only a page-aligned address.
However, some architectures require a SHMLBA alignment for correctness in all
cases. Such architectures use __ARCH_FORCE_SHMLBA.
|
|
From: Manfred Spraul <manfred@colorfullife.com>
There are a few unchecked do_munmap()s in the shm code. Manfred's comment
explains why they are OK.
|
|
From: Arnd Bergmann <arnd@arndb.de>
Adds a generic implementation of 32 bit emulation for IPC system calls. The
code is based on the existing implementations for sparc64, ia64, mips, s390,
ppc and x86_64, which can subsequently be converted to use this.
|
|
From: Manfred Spraul <manfred@colorfullife.com>
sem_revalidate checks that a semaphore array didn't disappear while the
code was running without the semaphore array spinlock. If the array
disappeared, then it will return without holding a lock. find_undo calls
sem_revalidate and then sem_unlock, even if sem_revalidate failed. The
sem_unlock call must be removed.
Mingming Cao reported a spinlock deadlock with sysv semaphores. A
superflous unlock doesn't explain the deadlock, but it's obviously a bug.
|
|
From: "Randy.Dunlap" <rddunlap@osdl.org>
Add syscalls.h, which contains prototypes for the kernel's system calls.
Replace open-coded declarations all over the place. This patch found a
couple of prior bugs. It appears to be more important with -mregparm=3 as we
discover more asmlinkage mismatches.
Some syscalls have arch-dependent arguments, so their prototypes are in the
arch-specific unistd.h. Maybe it should have been asm/syscalls.h, but there
were already arch-specific syscall prototypes in asm/unistd.h...
Tested on x86, ia64, x86_64, ppc64, s390 and sparc64. May cause
trivial-to-fix build breakage on other architectures.
|
|
This renames sys_shmat to do_shmat. Additionally, I've replaced the
cond_syscall with a conditional inline function.
It touches all archs - only i386 is tested.
|
|
From: Manfred Spraul <manfred@colorfullife.com>
Attached is a patch that replaces the #ifndef CONFIG_SYSV syscall stubs
with cond_syscall stubs.
|
|
From: Nick Piggin <piggin@cyberone.com.au>
sys_shmat() need to be declared asmlinkage. This causes breakage when we
actually get the proper prototypes into caller's scope.
|
|
PA-RISC also uses the 64-bit version of the IPC structs.
|
|
The LSM changes broke the error checking for queue lengths in IPC_SET. The LSM check would
set set err to 0, but the next check expected it to still be -EPERM. Result was that
no error was reported, but the new parameters weren't correctly set.
|
|
From: Manfred Spraul <manfred@colorfullife.com>
attached is the lockless semop patch. I did another test run with
idle=poll on an pentium III, and it remained unchanged: 99.9% direct
fast path, 0.1% race with wakeup against writing the final result code:
http://khack.osdl.org/stp/282936/environment/proc/slabinfo
That means there is no immediate need to add the two-stage
implementation to finish_wait.
It reduces the spinlock operations on the semaphore array spinlock by 1/3.
|
|
Backport this fix from 2.4
|
|
This fixes CONFIG_UID16 problems on x86-64 as discussed earlier.
CONFIG_UID16 now only selects the inclusion of kernel/uid16.c, all
conversions are triggered dynamically based on type sizes. This allows
x86-64 to both include uid16.c for emulation purposes, but not truncate
uids to 16bit in sys_newstat.
- Replace the old macros from linux/highuid.h with new SET_UID/SET_GID
macros that do type checking. Based on Linus' proposal.
- Fix everybody to use them.
- Clean up some cruft in the x86-64 32bit emulation allowed by this
(other 32bit emulations could be cleaned too, but I'm too lazy for
that right now)
- Add one missing EOVERFLOW check in x86-64 32bit sys_newstat while
I was at it.
|
|
From: Anton Blanchard <anton@samba.org>
I saw a lockup where 2 cpus were stuck in sem_lock(). It seems like we can
loop back to retry_undos with the lock held. That path takes the lock so
we will deadlock.
|
|
From: Andrea Arcangeli <andrea@suse.de>
aka: "vmalloc allocations in ipc needs smp initialized (and vm must be
allowed to schedule in 2.6)"
In short if you change SEMMNI to 8192 the kernel will crash at boot, beause
it tries to call vmalloc before the smp is initialized. The reason is that
vmalloc calls into the pte alloc code, and the fast pte alloc is tried
first, but that reads into the pte_quicklist, that requires the cpu_data to
be initialized (and that happens in smp_init()).
the patch is obviously safe, since no piece of kernel (especially the code
in the check_bugs and smp_init paths ;) calls into the ipc subsystem.
The reason this started to trigger wasn't really that we increased SEMMNI,
but what happend is that some IPC data structure grown, and for some reason
the corruption due the uninitalized pte_quicklist triggers only for smp
boxes with less than 1G (not very common anymore ;). So it wasn't
immediatly reproducible on all setups.
2.6 doesn't suffer from the same problem, simply because 2.6 isn't using
the quicklist anymore, but I think it would be much more correct to make
the same change in 2.6 too, since whatever cond_resched() in the vm paths
(and they're definitely allowed to call it), will lead to a crash since the
init task isn't initialized and the scheduler can't be invoked yet. (and
2.6 already has the bigger data structures that should trigger the vmalloc
all the time on all setups)
|
|
One more overlooked area where the proper process ID has to be used:
SysV IPC "pid" values should use the thread group ID, not the per-thread
one.
|
|
|
|
From: Daniel McNeil <daniel@osdl.org>
This adds i_seqcount to the inode structure and then uses i_size_read() and
i_size_write() to provide atomic access to i_size. This is a port of
Andrea Arcangeli's i_size atomic access patch from 2.4. This only uses the
generic reader/writer consistent mechanism.
Before:
mnm:/usr/src/25> size vmlinux
text data bss dec hex filename
2229582 1027683 162436 3419701 342e35 vmlinux
After:
mnm:/usr/src/25> size vmlinux
text data bss dec hex filename
2225642 1027655 162436 3415733 341eb5 vmlinux
3.9k more text, a lot of it fastpath :(
It's a very minor bug, and the fix has a fairly non-minor cost. The most
compelling reason for fixing this is that writepage() checks i_size. If it
sees a transient value it may decide that page is outside i_size and will
refuse to write it. Lost user data.
|
|
From: "Chen, Kenneth W" <kenneth.w.chen@intel.com>
This patch proposes a performance fix for the current IPC semaphore
implementation.
There are two shortcoming in the current implementation:
try_atomic_semop() was called two times to wake up a blocked process,
once from the update_queue() (executed from the process that wakes up
the sleeping process) and once in the retry part of the blocked process
(executed from the block process that gets woken up).
A second issue is that when several sleeping processes that are eligible
for wake up, they woke up in daisy chain formation and each one in turn
to wake up next process in line. However, every time when a process
wakes up, it start scans the wait queue from the beginning, not from
where it was last scanned. This causes large number of unnecessary
scanning of the wait queue under a situation of deep wait queue.
Blocked processes come and go, but chances are there are still quite a
few blocked processes sit at the beginning of that queue.
What we are proposing here is to merge the portion of the code in the
bottom part of sys_semtimedop() (code that gets executed when a sleeping
process gets woken up) into update_queue() function. The benefit is two
folds: (1) is to reduce redundant calls to try_atomic_semop() and (2) to
increase efficiency of finding eligible processes to wake up and higher
concurrency for multiple wake-ups.
We have measured that this patch improves throughput for a large
application significantly on a industry standard benchmark.
This patch is relative to 2.5.72. Any feedback is very much
appreciated.
Some kernel profile data attached:
Kernel profile before optimization:
-----------------------------------------------
0.05 0.14 40805/529060 sys_semop [133]
0.55 1.73 488255/529060 ia64_ret_from_syscall
[2]
[52] 2.5 0.59 1.88 529060 sys_semtimedop [52]
0.05 0.83 477766/817966 schedule_timeout [62]
0.34 0.46 529064/989340 update_queue [61]
0.14 0.00 1006740/6473086 try_atomic_semop [75]
0.06 0.00 529060/989336 ipcperms [149]
-----------------------------------------------
0.30 0.40 460276/989340 semctl_main [68]
0.34 0.46 529064/989340 sys_semtimedop [52]
[61] 1.5 0.64 0.87 989340 update_queue [61]
0.75 0.00 5466346/6473086 try_atomic_semop [75]
0.01 0.11 477676/576698 wake_up_process [146]
-----------------------------------------------
0.14 0.00 1006740/6473086 sys_semtimedop [52]
0.75 0.00 5466346/6473086 update_queue [61]
[75] 0.9 0.89 0.00 6473086 try_atomic_semop [75]
-----------------------------------------------
Kernel profile with optimization:
-----------------------------------------------
0.03 0.05 26139/503178 sys_semop [155]
0.46 0.92 477039/503178 ia64_ret_from_syscall
[2]
[61] 1.2 0.48 0.97 503178 sys_semtimedop [61]
0.04 0.79 470724/784394 schedule_timeout [62]
0.05 0.00 503178/3301773 try_atomic_semop [109]
0.05 0.00 503178/930934 ipcperms [149]
0.00 0.03 32454/460210 update_queue [99]
-----------------------------------------------
0.00 0.03 32454/460210 sys_semtimedop [61]
0.06 0.36 427756/460210 semctl_main [75]
[99] 0.4 0.06 0.39 460210 update_queue [99]
0.30 0.00 2798595/3301773 try_atomic_semop [109]
0.00 0.09 470630/614097 wake_up_process [146]
-----------------------------------------------
0.05 0.00 503178/3301773 sys_semtimedop [61]
0.30 0.00 2798595/3301773 update_queue [99]
[109] 0.3 0.35 0.00 3301773 try_atomic_semop [109]
-----------------------------------------------=20
Both number of function calls to try_atomic_semop() and update_queue()
are reduced by 50% as a result of the merge. Execution time of
sys_semtimedop is reduced because of the reduction in the low level
functions.
|
|
AMD64 like IA64 needs to force IPC_64 in the IPC functions. This makes
2.5 compatible with 2.4 again.
|
|
From: Manfred Spraul <manfred@colorfullife.com>
The CLONE_SYSVSEM implementation is racy: it does an (atomic_read(->refcnt)
==1) instead of atomic_dec_and_test calls in the exit handling. The patch
fixes that.
Additionally, the patch contains the following changes:
- lock_undo() locks the list of undo structures. The lock is held
throughout the semop() syscall, but that's unnecessary - we can drop it
immediately after the lookup.
- undo structures are only allocated when necessary. The need for undo
structures is only noticed in the middle of the semop operation, while
holding the semaphore array spinlock. The result is a convoluted
unlock&revalidate implementation. I've reordered the code, and now the
undo allocation can happen before acquiring the semaphore array spinlock.
As a bonus, less code runs under the semaphore array spinlock.
- sysvsem.sleep_list looks like code to handle oopses: if an oops kills a
thread that sleeps in sys_timedsemop(), then sem_exit tries to recover.
I've removed that - too fragile.
|
|
From: Manfred Spraul <manfred@colorfullife.com>
SysV sem operations that involve multiple semaphores can fail in the
middle, and then sempid (pid of the last successful operation) must be
restored. This happens with "sempid >>= 16" - broken due to the 32-bit pid
values. The attached patch fixes that by reordering the updates of the
semaphore fields.
Additionally, the patch fixes the corruption of the sempid value that occurs
if a wait-for-zero operation fails.
The patch is more than two years old, and was in -dj and -ak kernels.
|
|
From: Mingming Cao <cmm@us.ibm.com>
Basically, freeary() is called with the spinlock for that semaphore set
hold. But after the semaphore set is removed from the ID array by
calling sem_rmid(), there is no lock to protect the waiting queue for
that semaphore set. So, if a waiter is woken up by a signal (not by the
wakeup from freeary()), it will check the q->status and q->prev fields.
At that moment, freeary() may not have a chance to update those fields
yet.
static void freeary (int id)
{
.......
sma = sem_rmid(id);
......
/* Wake up all pending processes and let them fail with EIDRM.*/
for (q = sma->sem_pending; q; q = q->next) {
q->status = -EIDRM;
q->prev = NULL;
wake_up_process(q->sleeper); /* doesn't sleep */
}
sem_unlock(sma);
......
}
So I propose move sem_rmid() after the loop of waking up every waiters.
That could gurantee that when the waiters are woke up, the updates for
q->status and q->prev have already done. Similar thing in message queue
case. The patch is attached below. Comments are very welcomed.
I have tested this patch on 2.5.68 kernel with LTP tests, seems fine to
me. Paul, could you test this on DOTS test again? Thanks!
|
|
|
|
From: Stewart Smith <stewartsmith@mac.com>
Remove the UPDATE_ATIME() macro, use update_atime() directly.
|
|
From: William Lee Irwin III <wli@holomorphy.com>
shm_get_stat() didn't know about hugetlbpage-backed shm.
|
|
From: William Lee Irwin III <wli@holomorphy.com>
Micro-optimize sys_shmdt(). There are methods of exploiting knowledge
of the vma's being searched to restrict the search space. These are:
(1) shm mappings always start their lives at file offset 0, so only
vma's above shmaddr need be considered. find_vma() can be used
to seek to the proper position in mm->mmap in O(lg(n)) time.
(2) The search is for a vma which could be a fragment of a broken-up
shm mapping, which would have been created starting at shmaddr
with vm_pgoff 0 and then continued no further into userspace
than shmaddr + size. So after having found an initial vma, find
the size of the shm segment it maps to calculate an upper bound
to the virtualspace that needs to be searched.
(3) mremap() would have caused the original checks to miss vma's mapping
the shm segment if shmaddr were the original address at which
the shm segments were attached. This does no better and no worse
than the original code in that situation.
(4) If the chain of references in vma->vm_file->f_dentry->d_inode->i_size
is not guaranteed by refcounting and/or the shm code then this is
oopsable; AFAICT an inode is always allocated.
|
|
This patch adds the remaining System V IPC hooks, including the inline
documentation for them in security.h. This includes a restored
sem_semop hook, as it does seem to be necessary to support fine-grained
access.
All of these System V IPC hooks are used by SELinux. The SELinux System
V IPC access controls were originally described in the technical report
available from http://www.nsa.gov/selinux/slinux-abs.html, and the
LSM-based implementation is described in the technical report available
from http://www.nsa.gov/selinux/module-abs.html.
|
|
|
|
From Rohit Seth
Attached is a patch that passes the correct information back to user
land for number of attachments to shared memory segment. I could have
done few more changes in a way nattach is getting set for regular cases
now, but just want to limit it at this point.
|
|
Makefiles no longer need to include Rules.make, which is currently an
empty file. This patch removes it from the remaining Makefiles, and
removes the empty Rules.make file.
|
|
Patch from Mark Fasheh <mark.fasheh@oracle.com> (plus a few cleanups
and a speedup from yours truly)
Adds the semtimedop() function - semop with a timeout. Solaris has
this. It's apparently worth a couple of percent to Oracle throughput
and given the simplicity, that is sufficient benefit for inclusion IMO.
This patch hooks up semtimedop() only for ia64 and ia32.
|
|
into conectiva.com.br:/home/BK/includes-2.5
|
|
Patch from Mingming Cao <cmm@us.ibm.com>
- ipc_lock() need a read_barrier_depends() to prevent indexing
uninitialized new array on the read side. This is corresponding to
the write memory barrier added in grow_ary() from Dipankar's patch to
prevent indexing uninitialized array.
- Replaced "wmb()" in IPC code with "smp_wmb()"."wmb()" produces a
full write memory barrier in both UP and SMP kernels, while
"smp_wmb()" provides a full write memory barrier in an SMP kernel,
but only a compiler directive in a UP kernel. The same change are
made for "rmb()".
- Removed rmb() in ipc_get(). We do not need a read memory barrier
there since ipc_get() is protected by ipc_ids.sem semaphore.
- Added more comments about why write barriers and read barriers are
needed (or not needed) here or there.
|
|
and net/* files.
|
|
|
|
|
|
From Dipanker Sarma.
Before setting the ids->entries to the new array, there must be a wmb()
to make sure that the memcpyed contents of the new array are visible
before the new array becomes visible.
|
|
Patch from Hugh Dickins <hugh@veritas.com>
Fixes the Oracle startup problem reported by Alessandro Suardi.
Reverts a "simplification" to shmdt() which was wrong if subsequent
mprotects broke up the original VMA, or if parts of it were munmapped.
|
|
stat64 has been changed to return jiffies granuality as nsec in previously
unused fields. This allows make to make better decisions on when
to recompile a file. Follows losely the Solaris API.
CURRENT_TIME has been redefined to return struct timespec. The users
who don't use it in a inode/attr context have been changed to use a new
get_seconds() function. CURRENT_TIME is implemented by an out-of-line
function.
There is a small performance penalty in this patch. The previous
filemap code had an optimization to flush atime only once a second.
This is currently gone, which will increase flushes a bit. I believe
the correct solution if it should be a problem is to have per super
block fields that give an arbitary atime flush granuality - so that you
can set it to be only flushed once a hour if you prefer that. I will
work on that later in separate patches if the need should arise.
struct inode and the attr struct has been changed to store struct
timespec instead of time_t for [cma]time. Not all file systems support
this granuality, but some like XFS,NFSv3,CIFS,JFS do. The others will
currently truncate the nsec part on flushing to disk. There was some
discussion on this rounding on l-k previously. I went for simple
truncation because there is not much evidence IMHO that the more
complicated roundings have any advantages. In practice application will
be rather unlikely to notice the rounding anyways - they can only see a
difference when an inode is flush from memory and reloaded in less than
a second, which is rather unlikely.
|
|
Uninlines some large functions in the ipc code.
Before:
text data bss dec hex filename
30226 224 192 30642 77b2 ipc/built-in.o
After:
text data bss dec hex filename
20274 224 192 20690 50d2 ipc/built-in.o
|
|
Patch from Mingming, Rusty, Hugh, Dipankar, me:
- It greatly reduces the lock contention by having one lock per id.
The global spinlock is removed and a spinlock is added in
kern_ipc_perm structure.
- Uses ReadCopyUpdate in grow_ary() for locking-free resizing.
- In the places where ipc_rmid() is called, delay calling ipc_free()
to RCU callbacks. This is to prevent ipc_lock() returning an invalid
pointer after ipc_rmid(). In addition, use the workqueue to enable
RCU freeing vmalloced entries.
Also some other changes:
- Remove redundant ipc_lockall/ipc_unlockall
- Now ipc_unlock() directly takes IPC ID pointer as argument, avoid
extra looking up the array.
The changes are made based on the input from Huge Dickens, Manfred
Spraul and Dipankar Sarma. In addition, Cliff White has run OSDL's
dbt1 test on a 2 way against the earlier version of this patch.
Results shows about 2-6% improvement on the average number of
transactions per second. Here is the summary of his tests:
2.5.42-mm2 2.5.42-mm2-ipclock
-----------------------------
Average over 5 runs 85.0 BT 89.8 BT
Std Deviation 5 runs 7.4 BT 1.0 BT
Average over 4 best 88.15 BT 90.2 BT
Std Deviation 4 best 2.8 BT 0.5 BT
Also, another test today from Bill Hartner:
I tested Mingming's RCU ipc lock patch using a *new* microbenchmark - semopbench.
semopbench was written to test the performance of Mingming's patch.
I also ran a 3 hour stress and it completed successfully.
Explanation of the microbenchmark is below the results.
Here is a link to the microbenchmark source.
http://www-124.ibm.com/developerworks/opensource/linuxperf/semopbench/semopbench.c
SUT : 8-way 700 Mhz PIII
I tested 2.5.44-mm2 and 2.5.44-mm2 + RCU ipc patch
>semopbench -g 64 -s 16 -n 16384 -r > sem.results.out
>readprofile -m /boot/System.map | sort -n +0 -r > sem.profile.out
The metric is seconds / per repetition. Lower is better.
kernel run 1 run 2
seconds seconds
================== ======= =======
2.5.44-mm2 515.1 515.4
2.5.44-mm2+rcu-ipc 46.7 46.7
With Mingming's patch, the test completes 10X faster.
|
|
From Bill Irwin
Optionally back priviled processes' shm with hugetlbfs.
One of the more common requests for and/or users of hugetlb interfaces
in general are databases using shm. This patch exports functionality
mostly equivalent to tmpfs, adds the calling sequence to ipc/shm.c, and
hashes out a small support function in fs/hugetlbfs/inode.c so that shm
segments may be hugetlbpage-backed if userspace passes a flag to
shmget().
Access to this resource requires CAP_IPC_LOCK.
|
|
|
|
The patch below adds the base set of LSM hooks for System V IPC to the
2.5.41 kernel. These hooks permit a security module to label
semaphore sets, message queues, and shared memory segments and to
perform security checks on these objects that parallel the existing
IPC access checks. Additional LSM hooks for labeling and controlling
individual messages sent on a single message queue and for providing
fine-grained distinctions among IPC operations will be submitted
separately after this base set of LSM IPC hooks has been accepted.
|
|
It's gone almost everywhere else already, and will eventually make for
a nicer top-level Makefile.
|
|
The old form of designated initializers are obsolete: we need to
replace them with the ISO C forms before 2.6. Gcc has always supported
both forms anyway.
|
|
An acct flag was added to do_munmap, true everywhere but in mremap's
move_vma: instead of updating the arch and driver sources, revert that
that change and temporarily mask VM_ACCOUNT around that one do_munmap.
Also, noticed that do_mremap fails needlessly if both shrinking _and_
moving a mapping: update old_len to pass vm area boundaries test.
|
|
If we support mmap MAP_NORESERVE, we should support it on shared
anonymous objects: too bad that needs a few changes. do_mmap_pgoff pass
VM_ACCOUNT (or not) down to shmem_file_setup, flag stored into shmem
info, for use by shmem_delete_inode later. Also removed a harmless but
pointless call to shmem_truncate.
|
|
Alan's overcommit patch, brought to 2.5 by Robert Love.
Can't say I've tested its functionality at all, but it doesn't crash,
it has been in -ac and RH kernels for some time and I haven't observed
any of its functions on profiles.
"So what is strict VM overcommit? We introduce new overcommit
policies that attempt to never succeed an allocation that can not be
fulfilled by the backing store and consequently never OOM. This is
achieved through strict accounting of the committed address space and
a policy to allow/refuse allocations based on that accounting.
In the strictest of modes, it should be impossible to allocate more
memory than available and impossible to OOM. All memory failures
should be pushed down to the allocation routines -- malloc, mmap, etc.
The new modes are available via sysctl (same as before). See
Documentation/vm/overcommit-accounting for more information."
|
|
Martin Schwidefsky <schwidefsky@de.ibm.com> reported "Bug with shared
memory" to LKML 14 May: hang due to schedule in truncate_list_pages
called from .... shm_destroy holding shm_lock spinlock. shm_destroy
needs that lock for shm_rmid, but it can be safely unlocked once link
from id to shp has been removed.
|
|
into kroah.com:/home/greg/linux/BK/lsm-2.5
|
|
This patch just makes some stuff in ipc/ static.
|
|
Also move where we set sma->sem_perm.mode and .key to before ipc_addid() gets called.
|
|
msg.c file to the msg.h file
Also move where the msg->q_perm.mode and .key values get set to before
ipc_addid() gets called to make placing a hook there easier.
|
|
Christopher Yeoh <cyeoh@samba.org>: (Made -p1 compliant by rusty) SUSv2 semctl compliance:
The semctl call with SETVAL currently does not set sempid (at the
moment sempid is only set during a successful semop call). An
explanation from Geoff Clare of the Open Group regarding why sempid
should be set during the semctl call:
"The spec isn't very clear, but there is a statement on the semget()
page which I think justifies the assumption made by the test. It says
that upon creation, the data structure associated with each semaphore
in the set is not initialised, and that the semctl() function with
SETVAL or SETALL can be used to initialise each semaphore.
Therefore semctl() with SETVAL has to set sempid to *something*, and
since sempid contains the "process ID of the last operation", setting
it to anything other than the pid of the calling process would mean
that sempid contained misleading information. It could be argued that
setting it to zero would not be misleading, but zero cannot be the
process ID of a process, and so is not a valid value for sempid anyway."
The following patch changes semctl so when called with SETVAL
sempid is set to the pid of the calling process:
|
|
The patch below fixes sem_exit() so that the BKL is always released.
|
|
As we discussed some time ago, here is a patch for the SEM_UNDO change
that can be applied to linux-2.5.9.
|
|
|
|
We always returned success even when we had no ->vm_ops
|
|
Push BKL down to the (few) routines that actually need it,
remove it from the do_exit() path.
|
|
Seperates shmem_sb_info from struct super_block.
|
|
- Jens Axboe: more bio updates, fix some request list bogosity under load
- Al Viro: export seq_xxx functions
- Manfred Spraul: include file cleanups, pc110pad compile fix
- David Woodhouse: fix JFFS2 write error handling
- Dave Jones: start merging up with 2.4.x patches
- Manfred Spraul: coredump fixes, FS event counter cleanups
- me: fix SCSI CD-ROM sectorsize BIO breakage
|
|
- Greg KH: USB updates
- Jens Axboe: more bio updates
- Christoph Rohland: fix up proper shmat semantics
|
|
- Al Viro: mnt_list init
- Jeff Garzik: network driver update (license tags, tulip driver)
- David Miller: sparc, net updates
- Ben Collins: firewire update
- Gerd Knorr: btaudio/bttv update
- Tim Hockin: MD cleanups
- Greg KH, Petko Manolov: USB updates
- Leonard Zubkoff: DAC960 driver update
|
|
- me/Al Viro: fix bdget() oops with block device modules that don't
clean up after they exit
- Alan Cox: continued merging (drivers, license tags)
- David Miller: sparc update, network fixes
- Christoph Hellwig: work around broken drivers that add a gendisk more
than once
- Jakub Jelinek: handle more ELF loading special cases
- Trond Myklebust: NFS client and lockd reclaimer cleanups/fixes
- Greg KH: USB updates
- Mikael Pettersson: sparate out local APIC / IO-APIC config options
|
|
- Alan Cox: continued merging
- Mingming Cao: make msgrcv/shmat check the queue/segment ID's properly
- Greg KH: USB serial init failure fix, Xircom serial converter driver
- Neil Brown: nsfd/raid/md/lockd cleanups
- Ingo Molnar: multipath RAID personality, raid xor update
- Hugh Dickins/Marcelo Tosatti: swapin read-ahead race fix
- Vojtech Pavlik: fix up some of the infrastructure for x86-64
- Robert Love: AMD 761 AGP GART support
- Jens Axboe: fix SCSI-generic queue handling race
- me: be sane about page reference bits
|
|
- me: fix forgotten nfsd usage of filldir off_t -> loff_t change
- Alan Cox: more driver merges
|
|
- Russell King: ARM updates
- Al Viro: more init cleanups
- Cort Dougan: more PPC updates
- David Miller: cleanups, pci mmap updates
- Neil Brown: raid resync by sector
- Alan Cox: more merging with -ac
- Johannes Erdfelt: USB updates
- Kai Germaschewski: ISDN updates
- Tobias Ringstrom: dmfe.c network driver update
- Trond Myklebust: NFS client updates and cleanups
|
|
- Rik van Riel and others: mm rw-semaphore (ps/top ok when swapping)
- IDE: 256 sectors at a time is legal, but apparently confuses some
drives. Max out at 255 sectors instead.
- Petko Manolov: USB pegasus driver update
- make the boottime memory map printout at least almost readable.
- USB driver updates
- pte_alloc()/pmd_alloc() need page_table_lock.
|
|
- sync up more with Alan
- Urban Widmark: smbfs and HIGHMEM fix
- Chris Mason: reiserfs tail unpacking fix ("null bytes in reiserfs files")
- Adan Richter: new cpia usb ID
- Hugh Dickins: misc small sysv ipc fixes
- Andries Brouwer: remove overly restrictive sector size check for
SCSI cd-roms
|
|
- Jens: better ordering of requests when unable to merge
- Neil Brown: make md work as a module again (we cannot autodetect
in modules, not enough background information)
- Neil Brown: raid5 SMP locking cleanups
- Neil Brown: nfsd: handle Irix NFS clients named pipe behavior and
dentry leak fix
- maestro3 shutdown fix
- fix dcache hash calculation that could cause bad hashes under certain
circumstances (Dean Gaudet)
- David Miller: networking and sparc updates
- Jeff Garzik: include file cleanups
- Andy Grover: ACPI update
- Coda-fs error return fixes
- rth: alpha Jensen update
|
|
- ReiserFS merge
- fix DRM R128/AGP dependency
|
|
|