aboutsummaryrefslogtreecommitdiffstats
path: root/kernel
AgeCommit message (Collapse)AuthorFilesLines
2005-10-29[PATCH] mm: split page table lockHugh Dickins1-2/+2
Christoph Lameter demonstrated very poor scalability on the SGI 512-way, with a many-threaded application which concurrently initializes different parts of a large anonymous area. This patch corrects that, by using a separate spinlock per page table page, to guard the page table entries in that page, instead of using the mm's single page_table_lock. (But even then, page_table_lock is still used to guard page table allocation, and anon_vma allocation.) In this implementation, the spinlock is tucked inside the struct page of the page table page: with a BUILD_BUG_ON in case it overflows - which it would in the case of 32-bit PA-RISC with spinlock debugging enabled. Splitting the lock is not quite for free: another cacheline access. Ideally, I suppose we would use split ptlock only for multi-threaded processes on multi-cpu machines; but deciding that dynamically would have its own costs. So for now enable it by config, at some number of cpus - since the Kconfig language doesn't support inequalities, let preprocessor compare that with NR_CPUS. But I don't think it's worth being user-configurable: for good testing of both split and unsplit configs, split now at 4 cpus, and perhaps change that to 8 later. There is a benefit even for singly threaded processes: kswapd can be attacking one part of the mm while another part is busy faulting. Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-29[PATCH] mm: follow_page with inner ptlockHugh Dickins1-4/+2
Final step in pushing down common core's page_table_lock. follow_page no longer wants caller to hold page_table_lock, uses pte_offset_map_lock itself; and so no page_table_lock is taken in get_user_pages itself. But get_user_pages (and get_futex_key) do then need follow_page to pin the page for them: take Daniel's suggestion of bitflags to follow_page. Need one for WRITE, another for TOUCH (it was the accessed flag before: vanished along with check_user_page_readable, but surely get_numa_maps is wrong to mark every page it finds as accessed), another for GET. And another, ANON to dispose of untouched_anonymous_page: it seems silly for that to descend a second time, let follow_page observe if there was no page table and return ZERO_PAGE if so. Fix minor bug in that: check VM_LOCKED - make_pages_present ought to make readonly anonymous present. Give get_numa_maps a cond_resched while we're there. Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-29[PATCH] mm: ptd_alloc take ptlockHugh Dickins1-2/+0
Second step in pushing down the page_table_lock. Remove the temporary bridging hack from __pud_alloc, __pmd_alloc, __pte_alloc: expect callers not to hold page_table_lock, whether it's on init_mm or a user mm; take page_table_lock internally to check if a racing task already allocated. Convert their callers from common code. But avoid coming back to change them again later: instead of moving the spin_lock(&mm->page_table_lock) down, switch over to new macros pte_alloc_map_lock and pte_unmap_unlock, which encapsulate the mapping+locking and unlocking+unmapping together, and in the end may use alternatives to the mm page_table_lock itself. These callers all hold mmap_sem (some exclusively, some not), so at no level can a page table be whipped away from beneath them; and pte_alloc uses the "atomic" pmd_present to test whether it needs to allocate. It appears that on all arches we can safely descend without page_table_lock. Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-29[PATCH] mm: update_hiwaters just in timeHugh Dickins2-3/+4
update_mem_hiwater has attracted various criticisms, in particular from those concerned with mm scalability. Originally it was called whenever rss or total_vm got raised. Then many of those callsites were replaced by a timer tick call from account_system_time. Now Frank van Maarseveen reports that to be found inadequate. How about this? Works for Frank. Replace update_mem_hiwater, a poor combination of two unrelated ops, by macros update_hiwater_rss and update_hiwater_vm. Don't attempt to keep mm->hiwater_rss up to date at timer tick, nor every time we raise rss (usually by 1): those are hot paths. Do the opposite, update only when about to lower rss (usually by many), or just before final accounting in do_exit. Handle mm->hiwater_vm in the same way, though it's much less of an issue. Demand that whoever collects these hiwater statistics do the work of taking the maximum with rss or total_vm. And there has been no collector of these hiwater statistics in the tree. The new convention needs an example, so match Frank's usage by adding a VmPeak line above VmSize to /proc/<pid>/status, and also a VmHWM line above VmRSS (High-Water-Mark or High-Water-Memory). There was a particular anomaly during mremap move, that hiwater_vm might be captured too high. A fleeting such anomaly remains, but it's quickly corrected now, whereas before it would stick. What locking? None: if the app is racy then these statistics will be racy, it's not worth any overhead to make them exact. But whenever it suits, hiwater_vm is updated under exclusive mmap_sem, and hiwater_rss under page_table_lock (for now) or with preemption disabled (later on): without going to any trouble, minimize the time between reading current values and updating, to minimize those occasions when a racing thread bumps a count up and back down in between. Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-29[PATCH] core remove PageReservedNick Piggin1-9/+16
Remove PageReserved() calls from core code by tightening VM_RESERVED handling in mm/ to cover PageReserved functionality. PageReserved special casing is removed from get_page and put_page. All setting and clearing of PageReserved is retained, and it is now flagged in the page_alloc checks to help ensure we don't introduce any refcount based freeing of Reserved pages. MAP_PRIVATE, PROT_WRITE of VM_RESERVED regions is tentatively being deprecated. We never completely handled it correctly anyway, and is be reintroduced in future if required (Hugh has a proof of concept). Once PageReserved() calls are removed from kernel/power/swsusp.c, and all arch/ and driver code, the Set and Clear calls, and the PG_reserved bit can be trivially removed. Last real user of PageReserved is swsusp, which uses PageReserved to determine whether a struct page points to valid memory or not. This still needs to be addressed (a generic page_is_ram() should work). A last caveat: the ZERO_PAGE is now refcounted and managed with rmap (and thus mapcounted and count towards shared rss). These writes to the struct page could cause excessive cacheline bouncing on big systems. There are a number of ways this could be addressed if it is an issue. Signed-off-by: Nick Piggin <npiggin@suse.de> Refcount bug fix for filemap_xip.c Signed-off-by: Carsten Otte <cotte@de.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-29[PATCH] mm: dup_mmap down new mmap_semHugh Dickins1-5/+4
One anomaly remains from when Andrea rationalized the responsibilities of mmap_sem and page_table_lock: in dup_mmap we add vmas to the child holding its page_table_lock, but not the mmap_sem which normally guards the vma list and rbtree. Which could be an issue for unuse_mm: though since it just walks down the list (today with page_table_lock, tomorrow not), it's probably okay. Will need a memory barrier? Oh, keep it simple, Nick and I agreed, no harm in taking child's mmap_sem here. Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-29[PATCH] mm: dup_mmap use oldmm moreHugh Dickins1-6/+6
Use the parent's oldmm throughout dup_mmap, instead of perversely going back to current->mm. (Can you hear the sigh of relief from those mpnts? Usually I squash them, but not today.) Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-29[PATCH] mm: rss = file_rss + anon_rssHugh Dickins2-3/+3
I was lazy when we added anon_rss, and chose to change as few places as possible. So currently each anonymous page has to be counted twice, in rss and in anon_rss. Which won't be so good if those are atomic counts in some configurations. Change that around: keep file_rss and anon_rss separately, and add them together (with get_mm_rss macro) when the total is needed - reading two atomics is much cheaper than updating two atomics. And update anon_rss upfront, typically in memory.c, not tucked away in page_add_anon_rmap. Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-29[PATCH] mm: mm_init set_mm_countersHugh Dickins1-2/+2
How is anon_rss initialized? In dup_mmap, and by mm_alloc's memset; but that's not so good if an mm_counter_t is a special type. And how is rss initialized? By set_mm_counter, all over the place. Come on, we just need to initialize them both at once by set_mm_counter in mm_init (which follows the memcpy when forking). Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-29[PATCH] mm: vm_stat_account unshackledHugh Dickins1-1/+1
The original vm_stat_account has fallen into disuse, with only one user, and only one user of vm_stat_unaccount. It's easier to keep track if we convert them all to __vm_stat_account, then free it from its __shackles. Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-29[PATCH] TIMERS: add missing compensation for HZ == 250YOSHIFUJI Hideaki1-0/+9
Add missing compensation for (HZ == 250) != (1 << SHIFT_HZ) in second_overflow(). Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org> Cc: "David S. Miller" <davem@davemloft.net> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-29[PATCH] missing exports of do_settimeofday() variantsAl Viro1-0/+1
frv, sh64, ia64 and sparc64 do not have do_settimeofday() exported (the last two are using variant in kernel/time.c). Exports added to match the rest of architectures. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-29[PATCH] fix ->signal->live leak in copy_process()Oleg Nesterov1-0/+2
exit_signal() (called from copy_process's error path) should decrement ->signal->live, otherwise forking process will miss 'group_dead' in do_exit(). Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-28[PATCH] gfp_t: kernel/*Al Viro4-9/+8
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-27[PATCH] Yet more posix-cpu-timer fixesRoland McGrath1-4/+7
This just makes sure that a thread's expiry times can't get reset after it clears them in do_exit. This is what allowed us to re-introduce the stricter BUG_ON() check in a362f463a6d316d14daed0f817e151835ce97ff7. Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-27Revert "remove false BUG_ON() from run_posix_cpu_timers()"Linus Torvalds2-18/+26
This reverts commit 3de463c7d9d58f8cf3395268230cb20a4c15bffa. Roland has another patch that allows us to leave the BUG_ON() in place by just making sure that the condition it tests for really is always true. That goes in next. Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-26[PATCH] Fix cpu timers expiration timeOleg Nesterov1-3/+3
There's a silly off-by-one error in the code that updates the expiration of posix CPU timers, causing them to not be properly updated when they hit exactly on their expiration time (which should be the normal case). This causes them to then fire immediately again, and only _then_ get properly updated. Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-26posix cpu timers: fix timer orderingLinus Torvalds1-6/+4
Pointed out by Oleg Nesterov, who has been walking over the code forwards and backwards. Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-26[PATCH] export cpu_online_mapAndrew Morton1-0/+1
With CONFIG_SMP=n: *** Warning: "cpu_online_map" [drivers/firmware/dcdbas.ko] undefined! due to set_cpus_allowed(). Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-24[PATCH] posix-timers: fix posix_cpu_timer_set() vs run_posix_cpu_timers() raceOleg Nesterov1-4/+8
This might be harmless, but looks like a race from code inspection (I was unable to trigger it). I must admit, I don't understand why we can't return TIMER_RETRY after 'spin_unlock(&p->sighand->siglock)' without doing bump_cpu_timer(), but this is what original code does. posix_cpu_timer_set: read_lock(&tasklist_lock); spin_lock(&p->sighand->siglock); list_del_init(&timer->it.cpu.entry); spin_unlock(&p->sighand->siglock); We are probaly deleting the timer from run_posix_cpu_timers's 'firing' local list_head while run_posix_cpu_timers() does list_for_each_safe. Various bad things can happen, for example we can just delete this timer so that list_for_each() will not notice it and run_posix_cpu_timers() will not reset '->firing' flag. In that case, .... if (timer->it.cpu.firing) { read_unlock(&tasklist_lock); timer->it.cpu.firing = -1; return TIMER_RETRY; } sys_timer_settime() goes to 'retry:', calls posix_cpu_timer_set() again, it returns TIMER_RETRY ... Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-24[PATCH] posix-timers: exit path cleanupOleg Nesterov1-0/+6
No need to rebalance when task exited Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-24[PATCH] posix-timers: remove false BUG_ON() from run_posix_cpu_timers()Oleg Nesterov2-26/+18
do_exit() clears ->it_##clock##_expires, but nothing prevents another cpu to attach the timer to exiting process after that. After exit_notify() does 'write_unlock_irq(&tasklist_lock)' and before do_exit() calls 'schedule() local timer interrupt can find tsk->exit_state != 0. If that state was EXIT_DEAD (or another cpu does sys_wait4) interrupted task has ->signal == NULL. At this moment exiting task has no pending cpu timers, they were cleaned up in __exit_signal()->posix_cpu_timers_exit{,_group}(), so we can just return from irq. Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-24[PATCH] posix-timers: fix cleanup_timers() and run_posix_cpu_timers() racesOleg Nesterov1-19/+10
1. cleanup_timers() sets timer->task = NULL under tasklist + ->sighand locks. That means that this code in posix_cpu_timer_del() and posix_cpu_timer_set() lock_timer(timer); if (timer->task == NULL) return; read_lock(tasklist); put_task_struct(timer->task) is racy. With this patch timer->task modified and accounted only under timer->it_lock. Sadly, this means that dead task_struct won't be freed until timer deleted or armed. 2. run_posix_cpu_timers() collects expired timers into local list under tasklist + ->sighand again. That means that posix_cpu_timer_del() should check timer->it.cpu.firing under these locks too. Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-23Posix timers: limit number of timers firing at onceLinus Torvalds1-6/+14
Bursty timers aren't good for anybody, very much including latency for other programs when we trigger lots of timers in interrupt context. So set a random limit, after which we'll handle the rest on the next timer tick. Noted by Oleg Nesterov <oleg@tv-sign.ru> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-21[PATCH] Call exit_itimers from do_exit, not __exit_signalRoland McGrath3-14/+3
When I originally moved exit_itimers into __exit_signal, that was the only place where we could reliably know it was the last thread in the group dying, without races. Since then we've gotten the signal_struct.live counter, and do_exit can reliably do group-wide cleanup work. This patch moves the call to do_exit, where it's made without locks. This avoids the deadlock issues that the old __exit_signal code's comment talks about, and the one that Oleg found recently with process CPU timers. [ This replaces e03d13e985d48ac4885382c9e3b1510c78bd047f, which is why it was just reverted. ] Signed-off-by: Roland McGrath <roland@redhat.com> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-21Revert "Fix cpu timers exit deadlock and races"Linus Torvalds1-11/+17
Revert commit e03d13e985d48ac4885382c9e3b1510c78bd047f, to be replaced by a much nicer fix from Roland.
2005-10-19[PATCH] Threads shouldn't inherit PF_NOFREEZEAlan Stern1-1/+1
The PF_NOFREEZE process flag should not be inherited when a thread is forked. This patch (as585) removes the flag from the child. This problem is starting to show up more and more as drivers turn to the kthread API instead of using kernel_thread(). As a result, their kernel threads are now children of the kthread worker instead of modprobe, and they inherit the PF_NOFREEZE flag. This can cause problems during system suspend; the kernel threads are not getting frozen as they ought to be. Signed-off-by: Alan Stern <stern@rowland.harvard.edu> Acked-by: Pavel Machek <pavel@ucw.cz> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-19[PATCH] Fix cpu timers exit deadlock and racesRoland McGrath1-17/+11
Oleg Nesterov reported an SMP deadlock. If there is a running timer tracking a different process's CPU time clock when the process owning the timer exits, we deadlock on tasklist_lock in posix_cpu_timer_del via exit_itimers. That code was using tasklist_lock to check for a race with __exit_signal being called on the timer-target task and clearing its ->signal. However, there is actually no such race. __exit_signal will have called posix_cpu_timers_exit and posix_cpu_timers_exit_group before it does that. Those will clear those k_itimer's association with the dying task, so posix_cpu_timer_del will return early and never reach the code in question. In addition, posix_cpu_timer_del called from exit_itimers during execve or directly from timer_delete in the process owning the timer can race with an exiting timer-target task to cause a double put on timer-target task struct. Make sure we always access cpu_timers lists with sighand lock held. Signed-off-by: Roland McGrath <roland@redhat.com> Signed-off-by: Chris Wright <chrisw@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-17[PATCH] rcu: keep rcu callback event counterEric Dumazet1-0/+11
This makes call_rcu() keep track of how many events there are on the RCU list, and cause a reschedule event when the list gets too long. This helps keep RCU event lists down. Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-17[PATCH] posix-timers: fix task accountingOleg Nesterov1-0/+3
Make sure we release the task struct properly when releasing pending timers. release_task() does write_lock_irq(&tasklist_lock), so it can't race with run_posix_cpu_timers() on any cpu. Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-17Increase default RCU batching sharplyLinus Torvalds1-1/+1
Dipankar made RCU limit the batch size to improve latency, but that approach is unworkable: it can cause the RCU queues to grow without bounds, since the batch limiter ended up limiting the callbacks. So make the limit much higher, and start planning on instead limiting the batch size by doing RCU callbacks more often if the queue looks like it might be growing too long. Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-14[PATCH] Add missing export of getnstimeofday()Takashi Iwai1-0/+1
Adds the missing EXPORT_SYMBOL_GPL for getnstimeofday() when CONFIG_TIME_INTERPOLATION isn't set. Needed by drivers/char/mmtimer.c Signed-off-by: Takashi Iwai <tiwai@suse.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-10[PATCH] Fix signal sending in usbdevio on async URB completionHarald Welte1-0/+34
If a process issues an URB from userspace and (starts to) terminate before the URB comes back, we run into the issue described above. This is because the urb saves a pointer to "current" when it is posted to the device, but there's no guarantee that this pointer is still valid afterwards. In fact, there are three separate issues: 1) the pointer to "current" can become invalid, since the task could be completely gone when the URB completion comes back from the device. 2) Even if the saved task pointer is still pointing to a valid task_struct, task_struct->sighand could have gone meanwhile. 3) Even if the process is perfectly fine, permissions may have changed, and we can no longer send it a signal. So what we do instead, is to save the PID and uid's of the process, and introduce a new kill_proc_info_as_uid() function. Signed-off-by: Harald Welte <laforge@gnumonks.org> [ Fixed up types and added symbol exports ] Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-10[PATCH] x86_64: Set up safe page tables during resumeRafael J. Wysocki1-3/+4
The following patch makes swsusp avoid the possible temporary corruption of page translation tables during resume on x86-64. This is achieved by creating a copy of the relevant page tables that will not be modified by swsusp and can be safely used by it on resume. The problem is that during resume on x86-64 swsusp may temporarily corrupt the page tables used for the direct mapping of RAM. If that happens, a page fault occurs and cannot be handled properly, which leads to the solid hang of the affected system. This leads to the loss of the system's state from before suspend and may result in the loss of data or the corruption of filesystems, so it is a serious issue. Also, it appears to happen quite often (for me, as often as 50% of the time). The problem is related to the fact that (at least) one of the PMD entries used in the direct memory mapping (starting at PAGE_OFFSET) points to a page table the physical address of which is much greater than the physical address of the PMD entry itself. Moreover, unfortunately, the physical address of the page table before suspend (i.e. the one stored in the suspend image) happens to be different to the physical address of the corresponding page table used during resume (i.e. the one that is valid right before swsusp_arch_resume() in arch/x86_64/kernel/suspend_asm.S is executed). Thus while the image is restored, the "offending" PMD entry gets overwritten, so it does not point to the right physical address any more (i.e. there's no page table at the address pointed to by it, because it points to the address the page table has been at during suspend). Consequently, if the PMD entry is used later on, and it _is_ used in the process of copying the image pages, a page fault occurs, but it cannot be handled in the normal way and the system hangs. In principle we can call create_resume_mapping() from swsusp_arch_resume() (ie. from suspend_asm.S), but then the memory allocations in create_resume_mapping(), resume_pud_mapping(), and resume_pmd_mapping() must be made carefully so that we use _only_ NosaveFree pages in them (the other pages are overwritten by the loop in swsusp_arch_resume()). Additionally, we are in atomic context at that time, so we cannot use GFP_KERNEL. Moreover, if one of the allocations fails, we should free all of the allocated pages, so we need to trace them somehow. All of this is done in the appended patch, except that the functions populating the page tables are located in arch/x86_64/kernel/suspend.c rather than in init.c. It may be done in a more elegan way in the future, with the help of some swsusp patches that are in the works now. [AK: move some externs into headers, renamed a function] Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl> Signed-off-by: Andi Kleen <ak@suse.de> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-08[PATCH] gfp flags annotations - part 1Al Viro4-5/+5
- added typedef unsigned int __nocast gfp_t; - replaced __nocast uses for gfp flags with gfp_t - it gives exactly the same warnings as far as sparse is concerned, doesn't change generated code (from gcc point of view we replaced unsigned int with typedef) and documents what's going on far better. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-08[PATCH] fix do_coredump() vs SIGSTOP raceOleg Nesterov1-1/+2
Let's suppose we have 2 threads in thread group: A - does coredump B - has pending SIGSTOP thread A thread B do_coredump: get_signal_to_deliver: lock(->sighand) ->signal->flags = SIGNAL_GROUP_EXIT unlock(->sighand) lock(->sighand) signr = dequeue_signal() ->signal->flags |= SIGNAL_STOP_DEQUEUED return SIGSTOP; do_signal_stop: unlock(->sighand) coredump_wait: zap_threads: lock(tasklist_lock) send SIGKILL to B // signal_wake_up() does nothing unlock(tasklist_lock) lock(tasklist_lock) lock(->sighand) re-check sig->flags & SIGNAL_STOP_DEQUEUED, yes set_current_state(TASK_STOPPED); finish_stop: schedule(); // ->state == TASK_STOPPED wait_for_completion(&startup_done) // waits for complete() from B, // ->state == TASK_UNINTERRUPTIBLE We can't wake up 'B' in any way: SIGCONT will be ignored because handle_stop_signal() sees ->signal->flags & SIGNAL_GROUP_EXIT. sys_kill(SIGKILL)->__group_complete_signal() will choose uninterruptible 'A', so it can't help. sys_tkill(B, SIGKILL) will be ignored by specific_send_sig_info() because B already has pending SIGKILL. This scenario is not possbile if 'A' does do_group_exit(), because it sets sig->flags = SIGNAL_GROUP_EXIT and delivers SIGKILL to subthreads atomically, holding both tasklist_lock and sighand->lock. That means that do_signal_stop() will notice !SIGNAL_STOP_DEQUEUED after re-locking ->sighand. And it is not possible to any other thread to re-add SIGNAL_STOP_DEQUEUED later, because dequeue_signal() can only return SIGKILL. I think it is better to change do_coredump() to do sigaddset(SIGKILL) and signal_wake_up() under sighand->lock, but this patch is much simpler. Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-01Fix inequality comparison against "task->state"Linus Torvalds1-1/+1
We should always use bitmask ops, rather than depend on some ordering of the different states. With the TASK_NONINTERACTIVE flag, the inequality doesn't really work. Oleg Nesterov argues (likely correctly) that this test is unnecessary in the first place. However, the minimal fix for now is to at least make it work in the presense of TASK_NONINTERACTIVE. Waiting for consensus from Roland & co on potential bigger cleanups. Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-30[PATCH] cpuset crapectomyAl Viro1-11/+1
Switched cpuset_common_file_read() to simple_read_from_buffer(), killed a bunch of useless (and not quite correct - e.g. min(size_t,ssize_t)) code. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-29[PATCH] Fix task state testing properly in do_signal_stop()Roland McGrath1-1/+2
Any tests using < TASK_STOPPED or the like are left over from the time when the TASK_ZOMBIE and TASK_DEAD bits were in the same word, and it served to check for "stopped or dead". I think this one in do_signal_stop is the only such case. It has been buggy ever since exit_state was separated, and isn't testing the exit_state value. Signed-off-by: Roland McGrath <roland@redhat.com> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-28[PATCH] cpuset read past eof memory leak fixPaul Jackson1-5/+6
Don't leak a page of memory if user reads a cpuset file past eof. Signed-off-by: KUROSAWA Takahiro <kurosawa@valinux.co.jp> Signed-off-by: Paul Jackson <pj@sgi.com> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-28[PATCH] swsusp: avoid problems if there are too many pages to saveRafael J. Wysocki2-1/+8
The following patch makes swsusp avoid problems during resume if there are too many pages to save on suspend. It adds a constant that allows us to verify if we are going to save too many pages and implements the check (this is done as early as we can tell that the check will trigger, which is in swsusp_alloc()). Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl> Acked-by: Pavel Machek <pavel@suse.cz> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-28[PATCH] Ignore trailing whitespace on kernel parameters correctlyRusty Russell1-2/+8
Dave Jones says: ... if the modprobe.conf has trailing whitespace, modules fail to load with the following helpful message.. snd_intel8x0: Unknown parameter `' Previous version truncated last argument. Signed-off-by: Rusty Russell <rusty@rustcorp.com.au> Cc: Dave Jones <davej@redhat.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-28[PATCH] swsusp: prevent possible memory leakRafael J. Wysocki1-3/+3
Prevent swsusp from leaking some memory in case of an error in read_pagedir(). It also prevents the BUG_ON() from triggering if there's an error while reading swap. Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl> Acked-by: Pavel Machek <pavel@ucw.cz> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-28[PATCH] swsusp: remove wrong code from data_freeRafael J. Wysocki1-4/+3
The following patch removes some wrong code from the data_free() function in swsusp. This function could only be called if there's an error while writing the suspend image to swap, so it is not triggered easily. However, if triggered, it would probably corrupt some memory. Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl> Acked-by: Pavel Machek <pavel@ucw.cz> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-23Make sure SIGKILL gets proper respectLinus Torvalds1-17/+14
Bhavesh P. Davda <bhavesh@avaya.com> noticed that SIGKILL wouldn't properly kill a process under just the right cicumstances: a stopped task that already had another signal queued would get the SIGKILL queued onto the shared queue, and there it would remain until SIGCONT. This simplifies the signal acceptance logic, and fixes the bug in the process. Losely based on an earlier patch by Bhavesh. Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-22[PATCH] swsusp: fix commentsPavel Machek2-4/+8
Fix comments in swsusp. Signed-off-by: Pavel Machek <pavel@suse.cz> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-22[PATCH] swsusp: do not trigger BUG_ON() if there is not enough memoryRafael J. Wysocki1-1/+1
The following patch makes swsusp avoid triggering the BUG_ON() in swsusp_suspend() if there is not enough memory for suspend. Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl> Cc: Pavel Machek <pavel@ucw.cz> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-22[PATCH] SOFTWARE_SUSPEND needs HOTPLUG_CPU on SMPRandy Dunlap1-1/+1
Signed-off-by: Randy Dunlap <rdunlap@xenotime.net> Acked-by: Pavel Machek <pavel@ucw.cz> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-22[PATCH] suspend: cleanup calling of power off methods.Eric W. Biederman1-4/+2
In the lead up to 2.6.13 I fixed a large number of reboot problems by making the calling conventions consistent. Despite checking and double checking my work it appears I missed an obvious one. The S4 suspend code for PM_DISK_PLATFORM was also calling device_shutdown without setting system_state, and was not calling the appropriate reboot_notifier. This patch fixes the bug by replacing the call of device_suspend with kernel_poweroff_prepare. Various forms of this failure have been fixed and tracked for a while. Thanks for tracking this down go to: Alexey Starikovskiy, Meelis Roos <mroos@linux.ee>, Nigel Cunningham <ncunningham@cyclades.com>, Pierre Ossman <drzeus-list@drzeus.cx> History of this bug is at: http://bugme.osdl.org/show_bug.cgi?id=4320 Signed-off-by: Eric W. Biederman <ebiederm@xmission.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-22[PATCH] reboot: comment and factor the main reboot functionsEric W. Biederman1-6/+46
In the lead up to 2.6.13 I fixed a large number of reboot problems by making the calling conventions consistent. Despite checking and double checking my work it appears I missed an obvious one. This first patch simply refactors the reboot routines so all of the preparation for various kinds of reboots are in their own functions. Making it very hard to get the various kinds of reboot out of sync. Signed-off-by: Eric W. Biederman <ebiederm@xmission.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-21[PATCH] Add printk_clock()Andrew Morton1-1/+6
ia64's sched_clock() accesses per-cpu data which isn't set up at boot time. Hence ia64 cannot use printk timestamping, because printk() will crash in sched_clock(). So make printk() use printk_clock(), defaulting to sched_clock(), overrideable by the architecture via attribute(weak). Cc: "Luck, Tony" <tony.luck@intel.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-17[PATCH] files: fix preemption issuesDipankar Sarma1-0/+6
With the new fdtable locking rules, you have to protect fdtable with either ->file_lock or rcu_read_lock/unlock(). There are some places where we aren't doing either. This patch fixes those places. Signed-off-by: Dipankar Sarma <dipankar@in.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-17[PATCH] PR_GET_DUMPABLE returns incorrect infoMichael Kerrisk1-2/+1
2.6.13 incorporated Alan Cox's patch for /proc/sys/fs/suid_dumpable (one version of this patch can be found here http://marc.theaimsgroup.com/?l=linux-kernel&m=109647550421014&w=2 ). This patch also made corresponding changes in kernel/sys.c to change the prctl() PR_SET_DUMPABLE operation so that the permitted range of 'arg2' was modified from 0..1 to 0..2. However, a corresponding change was not made for PR_GET_DUMPABLE: if the dumpable flag is non-zero, then PR_GET_DUMPABLE always returns 1, so that the caller can't determine the true setting of this flag. Acked-by: Alan Cox <alan@redhat.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-17[PATCH] CPU hotplug breaks wake_up_new_taskSrivatsa Vaddagiri1-1/+2
Fix a problem wherein a new-born task is added to a dead CPU. Signed-off-by: Srivatsa Vaddagiri <vatsa@in.ibm.com> Acked-by: Nick Piggin <nickpiggin@yahoo.com.au> Acked-by: Shaohua Li <shaohua.li@intel.com> Acked-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-13[PATCH] Fix spinlock owner debuggingIngo Molnar1-4/+4
fix up the runqueue lock owner only if we truly did a context-switch with the runqueue lock held. Impacts ia64, mips, sparc64 and arm. Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-13Merge master.kernel.org:/pub/scm/linux/kernel/git/dwmw2/audit-2.6 Linus Torvalds2-147/+308
2005-09-13[PATCH] use add_taint() for setting tainted bit flagsRandy Dunlap1-5/+6
Use the add_taint() interface for setting tainted bit flags instead of doing it manually. Signed-off-by: Randy Dunlap <rdunlap@xenotime.net> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-13[PATCH] schedule_timeout_[un]interruptible() speedupAndrew Morton1-3/+6
These functions don't need schedule_timeout()'s barrier. Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-12[PATCH] x86-64: Some cleanup and optimization to the processor data area.Andi Kleen1-1/+1
- Remove unused irqrsp field - Remove pda->me - Optimize set_softirq_pending slightly Signed-off-by: Andi Kleen <ak@suse.de> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-12[PATCH] cpuset semaphore depth check optimizePaul Jackson1-4/+9
Optimize the deadlock avoidance check on the global cpuset semaphore cpuset_sem. Instead of adding a depth counter to the task struct of each task, rather just two words are enough, one to store the depth and the other the current cpuset_sem holder. Thanks to Nikita Danilov for the idea. Signed-off-by: Paul Jackson <pj@sgi.com> [ We may want to change this further, but at least it's now a totally internal decision to the cpusets code ] Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-12Mark ia64-specific MCA/INIT scheduler hooks as dangerousLinus Torvalds1-26/+44
..and only enable them for ia64. The functions are only valid when the whole system has been totally stopped and no scheduler activity is ongoing on any CPU, and interrupts are globally disabled. In other words, they aren't useful for anything else. So make sure that nobody can use them by mistake. Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-11[PATCH] MCA/INIT: scheduler hooksKeith Owens1-0/+26
Scheduler hooks to see/change which process is deemed to be on a cpu. Signed-off-by: Keith Owens <kaos@sgi.com> Signed-off-by: Tony Luck <tony.luck@intel.com>
2005-09-10[PATCH] kernel: fix-up schedule_timeout() usageNishanth Aravamudan3-20/+10
Use schedule_timeout_{,un}interruptible() instead of set_current_state()/schedule_timeout() to reduce kernel size. Signed-off-by: Nishanth Aravamudan <nacc@us.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-10[PATCH] add schedule_timeout_{,un}interruptible() interfacesNishanth Aravamudan1-0/+14
Add schedule_timeout_{,un}interruptible() interfaces so that schedule_timeout() callers don't have to worry about forgetting to add the set_current_state() call beforehand. Signed-off-by: Nishanth Aravamudan <nacc@us.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-10[PATCH] kernel/acct: add kerneldocRandy Dunlap1-16/+27
for kernel/acct.c: - fix typos - add kerneldoc for non-static functions Signed-off-by: Randy Dunlap <rdunlap@xenotime.net> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-10[PATCH] sched: allow the load to grow upto its cpu_powerSiddha, Suresh B1-2/+7
Don't pull tasks from a group if that would cause the group's total load to drop below its total cpu_power (ie. cause the group to start going idle). Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com> Signed-off-by: Nick Piggin <npiggin@suse.de> Acked-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-10[PATCH] sched: don't kick ALB in the presence of pinned taskSiddha, Suresh B1-1/+14
Jack Steiner brought this issue at my OLS talk. Take a scenario where two tasks are pinned to two HT threads in a physical package. Idle packages in the system will keep kicking migration_thread on the busy package with out any success. We will run into similar scenarios in the presence of CMP/NUMA. Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com> Acked-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-10[PATCH] sched: use cached variable in sys_sched_yield()Renaud Lienhart1-1/+1
In sys_sched_yield(), we cache current->array in the "array" variable, thus there's no need to dereference "current" again later. Signed-Off-By: Renaud Lienhart <renaud.lienhart@free.fr> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-10[PATCH] sched: HT optimisationNick Piggin1-6/+28
If an idle sibling of an HT queue encounters a busy sibling, then make higher level load balancing of the non-idle variety. Performance of multiprocessor HT systems with low numbers of tasks (generally < number of virtual CPUs) can be significantly worse than the exact same workloads when running in non-HT mode. The reason is largely due to poor scheduling behaviour. This patch improves the situation, making the performance gap far less significant on one problematic test case (tbench). Signed-off-by: Nick Piggin <npiggin@suse.de> Acked-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-10[PATCH] sched: less lockingNick Piggin1-7/+2
During periodic load balancing, don't hold this runqueue's lock while scanning remote runqueues, which can take a non trivial amount of time especially on very large systems. Holding the runqueue lock will only help to stabilise ->nr_running, however this doesn't do much to help because tasks being woken will simply get held up on the runqueue lock, so ->nr_running would not provide a really accurate picture of runqueue load in that case anyway. What's more, ->nr_running (and possibly the cpu_load averages) of remote runqueues won't be stable anyway, so load balancing is always an inexact operation. Signed-off-by: Nick Piggin <npiggin@suse.de> Acked-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-10[PATCH] sched: less newidle lockingNick Piggin1-7/+10
Similarly to the earlier change in load_balance, only lock the runqueue in load_balance_newidle if the busiest queue found has a nr_running > 1. This will reduce frequency of expensive remote runqueue lock aquisitions in the schedule() path on some workloads. Signed-off-by: Nick Piggin <npiggin@suse.de> Acked-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-10[PATCH] sched: fix SMT scheduler latency bugIngo Molnar1-4/+15
William Weston reported unusually high scheduling latencies on his x86 HT box, on the -RT kernel. I managed to reproduce it on my HT box and the latency tracer shows the incident in action: _------=> CPU# / _-----=> irqs-off | / _----=> need-resched || / _---=> hardirq/softirq ||| / _--=> preempt-depth |||| / ||||| delay cmd pid ||||| time | caller \ / ||||| \ | / du-2803 3Dnh2 0us : __trace_start_sched_wakeup (try_to_wake_up) .............................................................. ... we are running on CPU#3, PID 2778 gets woken to CPU#1: ... .............................................................. du-2803 3Dnh2 0us : __trace_start_sched_wakeup <<...>-2778> (73 1) du-2803 3Dnh2 0us : _raw_spin_unlock (try_to_wake_up) ................................................ ... still on CPU#3, we send an IPI to CPU#1: ... ................................................ du-2803 3Dnh1 0us : resched_task (try_to_wake_up) du-2803 3Dnh1 1us : smp_send_reschedule (try_to_wake_up) du-2803 3Dnh1 1us : send_IPI_mask_bitmask (smp_send_reschedule) du-2803 3Dnh1 2us : _raw_spin_unlock_irqrestore (try_to_wake_up) ............................................... ... 1 usec later, the IPI arrives on CPU#1: ... ............................................... <idle>-0 1Dnh. 2us : smp_reschedule_interrupt (c0100c5a 0 0) So far so good, this is the normal wakeup/preemption mechanism. But here comes the scheduler anomaly on CPU#1: <idle>-0 1Dnh. 2us : preempt_schedule_irq (need_resched) <idle>-0 1Dnh. 2us : preempt_schedule_irq (need_resched) <idle>-0 1Dnh. 3us : __schedule (preempt_schedule_irq) <idle>-0 1Dnh. 3us : profile_hit (__schedule) <idle>-0 1Dnh1 3us : sched_clock (__schedule) <idle>-0 1Dnh1 4us : _raw_spin_lock_irq (__schedule) <idle>-0 1Dnh1 4us : _raw_spin_lock_irqsave (__schedule) <idle>-0 1Dnh2 5us : _raw_spin_unlock (__schedule) <idle>-0 1Dnh1 5us : preempt_schedule (__schedule) <idle>-0 1Dnh1 6us : _raw_spin_lock (__schedule) <idle>-0 1Dnh2 6us : find_next_bit (__schedule) <idle>-0 1Dnh2 6us : _raw_spin_lock (__schedule) <idle>-0 1Dnh3 7us : find_next_bit (__schedule) <idle>-0 1Dnh3 7us : find_next_bit (__schedule) <idle>-0 1Dnh3 8us : _raw_spin_unlock (__schedule) <idle>-0 1Dnh2 8us : preempt_schedule (__schedule) <idle>-0 1Dnh2 8us : find_next_bit (__schedule) <idle>-0 1Dnh2 9us : trace_stop_sched_switched (__schedule) <idle>-0 1Dnh2 9us : _raw_spin_lock (trace_stop_sched_switched) <idle>-0 1Dnh3 10us : trace_stop_sched_switched <<...>-2778> (73 8c) <idle>-0 1Dnh3 10us : _raw_spin_unlock (trace_stop_sched_switched) <idle>-0 1Dnh1 10us : _raw_spin_unlock (__schedule) <idle>-0 1Dnh. 11us : local_irq_enable_noresched (preempt_schedule_irq) <idle>-0 1Dnh. 11us < (0) we didnt pick up pid 2778! It only gets scheduled much later: <...>-2778 1Dnh2 412us : __switch_to (__schedule) <...>-2778 1Dnh2 413us : __schedule <<idle>-0> (8c 73) <...>-2778 1Dnh2 413us : _raw_spin_unlock (__schedule) <...>-2778 1Dnh1 413us : trace_stop_sched_switched (__schedule) <...>-2778 1Dnh1 414us : _raw_spin_lock (trace_stop_sched_switched) <...>-2778 1Dnh2 414us : trace_stop_sched_switched <<...>-2778> (73 1) <...>-2778 1Dnh2 414us : _raw_spin_unlock (trace_stop_sched_switched) <...>-2778 1Dnh1 415us : trace_stop_sched_switched (__schedule) the reason for this anomaly is the following code in dependent_sleeper(): /* * If a user task with lower static priority than the * running task on the SMT sibling is trying to schedule, * delay it till there is proportionately less timeslice * left of the sibling task to prevent a lower priority * task from using an unfair proportion of the * physical cpu's resources. -ck */ [...] if (((smt_curr->time_slice * (100 - sd->per_cpu_gain) / 100) > task_timeslice(p))) ret = 1; Note that in contrast to the comment above, we dont actually do the check based on static priority, we do the check based on timeslices. But timeslices go up and down, and even highprio tasks can randomly have very low timeslices (just before their next refill) and can thus be judged as 'lowprio' by the above piece of code. This condition is clearly buggy. The correct test is to check for static_prio _and_ to check for the preemption priority. Even on different static priority levels, a higher-prio interactive task should not be delayed due to a higher-static-prio CPU hog. There is a symmetric bug in the 'kick SMT sibling' code of this function as well, which can be solved in a similar way. The patch below (against the current scheduler queue in -mm) fixes both bugs. I have build and boot-tested this on x86 SMT, and nice +20 tasks still get properly throttled - so the dependent-sleeper logic is still in action. btw., these bugs pessimised the SMT scheduler because the 'delay wakeup' property was applied too liberally, so this fix is likely a throughput improvement as well. I separated out a smt_slice() function to make the code easier to read. Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-10[PATCH] sched: TASK_NONINTERACTIVEIngo Molnar1-1/+10
This patch implements a task state bit (TASK_NONINTERACTIVE), which can be used by blocking points to mark the task's wait as "non-interactive". This does not mean the task will be considered a CPU-hog - the wait will simply not have an effect on the waiting task's priority - positive or negative alike. Right now only pipe_wait() will make use of it, because it's a common source of not-so-interactive waits (kernel compilation jobs, etc.). Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-10[PATCH] sched cleanupsIngo Molnar1-19/+25
whitespace cleanups. Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-10[PATCH] sched: make idlest_group/cpu cpus_allowed-awareM.Baris Demiray1-4/+13
Add relevant checks into find_idlest_group() and find_idlest_cpu() to make them return only the groups that have allowed CPUs and allowed CPUs respectively. Signed-off-by: M.Baris Demiray <baris@labristeknoloji.com> Signed-off-by: Nick Piggin <nickpiggin@yahoo.com.au> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-10[PATCH] sched: run SCHED_NORMAL tasks with real time tasks on SMT siblingsCon Kolivas1-18/+47
The hyperthread aware nice handling currently puts to sleep any non real time task when a real time task is running on its sibling cpu. This can lead to prolonged starvation by having the non real time task pegged to the cpu with load balancing not pulling that task away. Currently we force lower priority hyperthread tasks to run a percentage of time difference based on timeslice differences which is meaningless when comparing real time tasks to SCHED_NORMAL tasks. We can allow non real time tasks to run with real time tasks on the sibling up to per_cpu_gain% if we use jiffies as a counter. Cleanups and micro-optimisations to the relevant code section should make it more understandable as well. Signed-off-by: Con Kolivas <kernel@kolivas.org> Acked-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-10[PATCH] cpuset semaphore depth check deadlock fixPaul Jackson1-40/+60
The cpusets-formalize-intermediate-gfp_kernel-containment patch has a deadlock problem. This patch was part of a set of four patches to make more extensive use of the cpuset 'mem_exclusive' attribute to manage kernel GFP_KERNEL memory allocations and to constrain the out-of-memory (oom) killer. A task that is changing cpusets in particular ways on a system when it is very short of free memory could double trip over the global cpuset_sem semaphore (get the lock and then deadlock trying to get it again). The second attempt to get cpuset_sem would be in the routine cpuset_zone_allowed(). This was discovered by code inspection. I can not reproduce the problem except with an artifically hacked kernel and a specialized stress test. In real life you cannot hit this unless you are manipulating cpusets, and are very unlikely to hit it unless you are rapidly modifying cpusets on a memory tight system. Even then it would be a rare occurence. If you did hit it, the task double tripping over cpuset_sem would deadlock in the kernel, and any other task also trying to manipulate cpusets would deadlock there too, on cpuset_sem. Your batch manager would be wedged solid (if it was cpuset savvy), but classic Unix shells and utilities would work well enough to reboot the system. The unusual condition that led to this bug is that unlike most semaphores, cpuset_sem _can_ be acquired while in the page allocation code, when __alloc_pages() calls cpuset_zone_allowed. So it easy to mistakenly perform the following sequence: 1) task makes system call to alter a cpuset 2) take cpuset_sem 3) try to allocate memory 4) memory allocator, via cpuset_zone_allowed, trys to take cpuset_sem 5) deadlock The reason that this is not a serious bug for most users is that almost all calls to allocate memory don't require taking cpuset_sem. Only some code paths off the beaten track require taking cpuset_sem -- which is good. Taking a global semaphore on the main code path for allocating memory would not scale well. This patch fixes this deadlock by wrapping the up() and down() calls on cpuset_sem in kernel/cpuset.c with code that tracks the nesting depth of the current task on that semaphore, and only does the real down() if the task doesn't hold the lock already, and only does the real up() if the nesting depth (number of unmatched downs) is exactly one. The previous required use of refresh_mems(), anytime that the cpuset_sem semaphore was acquired and the code executed while holding that semaphore might try to allocate memory, is no longer required. Two refresh_mems() calls were removed thanks to this. This is a good change, as failing to get all the necessary refresh_mems() calls placed was a primary source of bugs in this cpuset code. The only remaining call to refresh_mems() is made while doing a memory allocation, if certain task memory placement data needs to be updated from its cpuset, due to the cpuset having been changed behind the tasks back. Signed-off-by: Paul Jackson <pj@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-10[PATCH] spinlock consolidationIngo Molnar3-6/+14
This patch (written by me and also containing many suggestions of Arjan van de Ven) does a major cleanup of the spinlock code. It does the following things: - consolidates and enhances the spinlock/rwlock debugging code - simplifies the asm/spinlock.h files - encapsulates the raw spinlock type and moves generic spinlock features (such as ->break_lock) into the generic code. - cleans up the spinlock code hierarchy to get rid of the spaghetti. Most notably there's now only a single variant of the debugging code, located in lib/spinlock_debug.c. (previously we had one SMP debugging variant per architecture, plus a separate generic one for UP builds) Also, i've enhanced the rwlock debugging facility, it will now track write-owners. There is new spinlock-owner/CPU-tracking on SMP builds too. All locks have lockup detection now, which will work for both soft and hard spin/rwlock lockups. The arch-level include files now only contain the minimally necessary subset of the spinlock code - all the rest that can be generalized now lives in the generic headers: include/asm-i386/spinlock_types.h | 16 include/asm-x86_64/spinlock_types.h | 16 I have also split up the various spinlock variants into separate files, making it easier to see which does what. The new layout is: SMP | UP ----------------------------|----------------------------------- asm/spinlock_types_smp.h | linux/spinlock_types_up.h linux/spinlock_types.h | linux/spinlock_types.h asm/spinlock_smp.h | linux/spinlock_up.h linux/spinlock_api_smp.h | linux/spinlock_api_up.h linux/spinlock.h | linux/spinlock.h /* * here's the role of the various spinlock/rwlock related include files: * * on SMP builds: * * asm/spinlock_types.h: contains the raw_spinlock_t/raw_rwlock_t and the * initializers * * linux/spinlock_types.h: * defines the generic type and initializers * * asm/spinlock.h: contains the __raw_spin_*()/etc. lowlevel * implementations, mostly inline assembly code * * (also included on UP-debug builds:) * * linux/spinlock_api_smp.h: * contains the prototypes for the _spin_*() APIs. * * linux/spinlock.h: builds the final spin_*() APIs. * * on UP builds: * * linux/spinlock_type_up.h: * contains the generic, simplified UP spinlock type. * (which is an empty structure on non-debug builds) * * linux/spinlock_types.h: * defines the generic type and initializers * * linux/spinlock_up.h: * contains the __raw_spin_*()/etc. version of UP * builds. (which are NOPs on non-debug, non-preempt * builds) * * (included on UP-non-debug builds:) * * linux/spinlock_api_up.h: * builds the _spin_*() APIs. * * linux/spinlock.h: builds the final spin_*() APIs. */ All SMP and UP architectures are converted by this patch. arm, i386, ia64, ppc, ppc64, s390/s390x, x64 was build-tested via crosscompilers. m32r, mips, sh, sparc, have not been tested yet, but should be mostly fine. From: Grant Grundler <grundler@parisc-linux.org> Booted and lightly tested on a500-44 (64-bit, SMP kernel, dual CPU). Builds 32-bit SMP kernel (not booted or tested). I did not try to build non-SMP kernels. That should be trivial to fix up later if necessary. I converted bit ops atomic_hash lock to raw_spinlock_t. Doing so avoids some ugly nesting of linux/*.h and asm/*.h files. Those particular locks are well tested and contained entirely inside arch specific code. I do NOT expect any new issues to arise with them. If someone does ever need to use debug/metrics with them, then they will need to unravel this hairball between spinlocks, atomic ops, and bit ops that exist only because parisc has exactly one atomic instruction: LDCW (load and clear word). From: "Luck, Tony" <tony.luck@intel.com> ia64 fix Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Arjan van de Ven <arjanv@infradead.org> Signed-off-by: Grant Grundler <grundler@parisc-linux.org> Cc: Matthew Wilcox <willy@debian.org> Signed-off-by: Hirokazu Takata <takata@linux-m32r.org> Signed-off-by: Mikael Pettersson <mikpe@csd.uu.se> Signed-off-by: Benoit Boissinot <benoit.boissinot@ens-lyon.org> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-09[PATCH] files: files struct with RCUDipankar Sarma2-13/+25
Patch to eliminate struct files_struct.file_lock spinlock on the reader side and use rcu refcounting rcuref_xxx api for the f_count refcounter. The updates to the fdtable are done by allocating a new fdtable structure and setting files->fdt to point to the new structure. The fdtable structure is protected by RCU thereby allowing lock-free lookup. For fd arrays/sets that are vmalloced, we use keventd to free them since RCU callbacks can't sleep. A global list of fdtable to be freed is not scalable, so we use a per-cpu list. If keventd is already handling the current cpu's work, we use a timer to defer queueing of that work. Since the last publication, this patch has been re-written to avoid using explicit memory barriers and use rcu_assign_pointer(), rcu_dereference() premitives instead. This required that the fd information is kept in a separate structure (fdtable) and updated atomically. Signed-off-by: Dipankar Sarma <dipankar@in.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-09[PATCH] files: break up files structDipankar Sarma2-40/+63
In order for the RCU to work, the file table array, sets and their sizes must be updated atomically. Instead of ensuring this through too many memory barriers, we put the arrays and their sizes in a separate structure. This patch takes the first step of putting the file table elements in a separate structure fdtable that is embedded withing files_struct. It also changes all the users to refer to the file table using files_fdtable() macro. Subsequent applciation of RCU becomes easier after this. Signed-off-by: Dipankar Sarma <dipankar@in.ibm.com> Signed-Off-By: David Howells <dhowells@redhat.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-09[PATCH] files: rcuref APIsDipankar Sarma1-0/+14
Adds a set of primitives to do reference counting for objects that are looked up without locks using RCU. Signed-off-by: Ravikiran Thirumalai <kiran_th@gmail.com> Signed-off-by: Dipankar Sarma <dipankar@in.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-09[PATCH] fix for cpusets minor problemKUROSAWA Takahiro1-0/+4
This patch fixes minor problem that the CPUSETS have when files in the cpuset filesystem are read after lseek()-ed beyond the EOF. Signed-off-by: KUROSAWA Takahiro <kurosawa@valinux.co.jp> Acked-by: Paul Jackson <pj@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-09[PATCH] Prefetch kernel stacks to speed up context switchChen, Kenneth W1-0/+1
For architecture like ia64, the switch stack structure is fairly large (currently 528 bytes). For context switch intensive application, we found that significant amount of cache misses occurs in switch_to() function. The following patch adds a hook in the schedule() function to prefetch switch stack structure as soon as 'next' task is determined. This allows maximum overlap in prefetch cache lines for that structure. Signed-off-by: Ken Chen <kenneth.w.chen@intel.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: "Luck, Tony" <tony.luck@intel.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-09[PATCH] fix disassociate_ctty vs. fork raceJason Baron1-0/+3
Race is as follows. Process A forks process B, both being part of the same session. Then, A calls disassociate_ctty while B forks C: A B ==== ==== fork() copy_signal() dissasociate_ctty() .... attach_pid(p, PIDTYPE_SID, p->signal->session); Now, C can have current->signal->tty pointing to a freed tty structure, as it hasn't yet been added to the session group (to have its controlling tty cleared on the diassociate_ctty() call). This has shown up as an oops but could be even more serious. I haven't tried to create a test case, but a customer has verified that the patch below resolves the issue, which was occuring quite frequently. I'll try and post the test case if i can. The patch simply checks for a NULL tty *after* it has been attached to the proper session group and clears it as necessary. Alternatively, we could simply do the tty assignment after the the process is added to the proper session group. Signed-off-by: Jason Baron <jbaron@redhat.com> Cc: Roland McGrath <roland@redhat.com> Cc: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-09[PATCH] Clear task_struct->fs_excl on fork()Giancarlo Formicuccia1-0/+1
An oversight. We don't want to carry the IO scheduler's "we hold exclusive fs resources" hint over to the child across fork(). Acked-by: Jens Axboe <axboe@suse.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-08Merge linux-2.6 with linux-acpi-2.6Len Brown28-290/+1128
2005-09-07Merge branch 'upstream' of ↵Linus Torvalds1-0/+1
master.kernel.org:/pub/scm/linux/kernel/git/jgarzik/misc-2.6
2005-09-07[PATCH] kprobes: fix bug when probed on task and isr functionsKeshavamurthy Anil S1-0/+22
This patch fixes a race condition where in system used to hang or sometime crash within minutes when kprobes are inserted on ISR routine and a task routine. The fix has been stress tested on i386, ia64, pp64 and on x86_64. To reproduce the problem insert kprobes on schedule() and do_IRQ() functions and you should see hang or system crash. Signed-off-by: Anil S Keshavamurthy <anil.s.keshavamurthy@intel.com> Signed-off-by: Ananth N Mavinakayanahalli <ananth@in.ibm.com> Acked-by: Prasanna S Panchamukhi <prasanna@in.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-07[PATCH] Kprobes: prevent possible race conditions genericPrasanna S Panchamukhi1-29/+43
There are possible race conditions if probes are placed on routines within the kprobes files and routines used by the kprobes. For example if you put probe on get_kprobe() routines, the system can hang while inserting probes on any routine such as do_fork(). Because while inserting probes on do_fork(), register_kprobes() routine grabs the kprobes spin lock and executes get_kprobe() routine and to handle probe of get_kprobe(), kprobes_handler() gets executed and tries to grab kprobes spin lock, and spins forever. This patch avoids such possible race conditions by preventing probes on routines within the kprobes file and routines used by kprobes. I have modified the patches as per Andi Kleen's suggestion to move kprobes routines and other routines used by kprobes to a seperate section .kprobes.text. Also moved page fault and exception handlers, general protection fault to .kprobes.text section. These patches have been tested on i386, x86_64 and ppc64 architectures, also compiled on ia64 and sparc64 architectures. Signed-off-by: Prasanna S Panchamukhi <prasanna@in.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-07[PATCH] introduce and use kzallocPekka J Enberg5-10/+6
This patch introduces a kzalloc wrapper and converts kernel/ to use it. It saves a little program text. Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi> Signed-off-by: Adrian Bunk <bunk@stusta.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-07[PATCH] remove duplicated code from proc and ptraceMiklos Szeredi1-13/+28
Extract common code used by ptrace_attach() and may_ptrace_attach() into a separate function. Signed-off-by: Miklos Szeredi <miklos@szeredi.hu> Cc: <viro@parcelfarce.linux.theplanet.co.uk> Cc: Christoph Hellwig <hch@lst.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-07[PATCH] cpusets: re-enable "dynamic sched domains"John Hawkes1-12/+0
Revert the hack introduced last week. Signed-off-by: John Hawkes <hawkes@sgi.com> Cc: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-07[PATCH] cpusets: fix the "dynamic sched domains" bugJohn Hawkes1-20/+69
For a NUMA system with multiple CPUs per node, declaring a cpu-exclusive cpuset that includes only some, but not all, of the CPUs in a node will mangle the sched domain structures. Signed-off-by: John Hawkes <hawkes@sgi.com> Cc; Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-07[PATCH] cpusets: Move the ia64 domain setup code to the generic codeJohn Hawkes1-54/+236
Signed-off-by: John Hawkes <hawkes@sgi.com> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-07[PATCH] cpusets: confine oom_killer to mem_exclusive cpusetPaul Jackson1-0/+33
Now the real motivation for this cpuset mem_exclusive patch series seems trivial. This patch keeps a task in or under one mem_exclusive cpuset from provoking an oom kill of a task under a non-overlapping mem_exclusive cpuset. Since only interrupt and GFP_ATOMIC allocations are allowed to escape mem_exclusive containment, there is little to gain from oom killing a task under a non-overlapping mem_exclusive cpuset, as almost all kernel and user memory allocation must come from disjoint memory nodes. This patch enables configuring a system so that a runaway job under one mem_exclusive cpuset cannot cause the killing of a job in another such cpuset that might be using very high compute and memory resources for a prolonged time. Signed-off-by: Paul Jackson <pj@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-07[PATCH] cpusets: formalize intermediate GFP_KERNEL containmentPaul Jackson1-8/+72
This patch makes use of the previously underutilized cpuset flag 'mem_exclusive' to provide what amounts to another layer of memory placement resolution. With this patch, there are now the following four layers of memory placement available: 1) The whole system (interrupt and GFP_ATOMIC allocations can use this), 2) The nearest enclosing mem_exclusive cpuset (GFP_KERNEL allocations can use), 3) The current tasks cpuset (GFP_USER allocations constrained to here), and 4) Specific node placement, using mbind and set_mempolicy. These nest - each layer is a subset (same or within) of the previous. Layer (2) above is new, with this patch. The call used to check whether a zone (its node, actually) is in a cpuset (in its mems_allowed, actually) is extended to take a gfp_mask argument, and its logic is extended, in the case that __GFP_HARDWALL is not set in the flag bits, to look up the cpuset hierarchy for the nearest enclosing mem_exclusive cpuset, to determine if placement is allowed. The definition of GFP_USER, which used to be identical to GFP_KERNEL, is changed to also set the __GFP_HARDWALL bit, in the previous cpuset_gfp_hardwall_flag patch. GFP_ATOMIC and GFP_KERNEL allocations will stay within the current tasks cpuset, so long as any node therein is not too tight on memory, but will escape to the larger layer, if need be. The intended use is to allow something like a batch manager to handle several jobs, each job in its own cpuset, but using common kernel memory for caches and such. Swapper and oom_kill activity is also constrained to Layer (2). A task in or below one mem_exclusive cpuset should not cause swapping on nodes in another non-overlapping mem_exclusive cpuset, nor provoke oom_killing of a task in another such cpuset. Heavy use of kernel memory for i/o caching and such by one job should not impact the memory available to jobs in other non-overlapping mem_exclusive cpusets. This patch enables providing hardwall, inescapable cpusets for memory allocations of each job, while sharing kernel memory allocations between several jobs, in an enclosing mem_exclusive cpuset. Like Dinakar's patch earlier to enable administering sched domains using the cpu_exclusive flag, this patch also provides a useful meaning to a cpuset flag that had previously done nothing much useful other than restrict what cpuset configurations were allowed. Signed-off-by: Paul Jackson <pj@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-07[PATCH] futex: remove duplicate codePekka Enberg1-12/+9
This patch cleans up the error path of futex_fd() by removing duplicate code. Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-07[PATCH] fix send_sigqueue() vs thread exit raceOleg Nesterov2-23/+27
posix_timer_event() first checks that the thread (SIGEV_THREAD_ID case) does not have PF_EXITING flag, then it calls send_sigqueue() which locks task list. But if the thread exits in between the kernel will oops (->sighand == NULL after __exit_sighand). This patch moves the PF_EXITING check into the send_sigqueue(), it must be done atomically under tasklist_lock. When send_sigqueue() detects exiting thread it returns -1. In that case posix_timer_event will send the signal to thread group. Also, this patch fixes task_struct use-after-free in posix_timer_event. Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-07[PATCH] remove a redundant variable in sys_prctl()Jesper Juhl1-4/+2
The patch removes a redundant variable `sig' from sys_prctl(). For some reason, when sys_prctl is called with option == PR_SET_PDEATHSIG then the value of arg2 is assigned to an int variable named sig. Then sig is tested with valid_signal() and later used to set the value of current->pdeath_signal . There is no reason to use this intermediate variable since valid_signal() takes a unsigned long argument, so it can handle being passed arg2 directly, and if the call to valid_signal is OK, then we know the value of arg2 is in the range zero to _NSIG and thus it'll easily fit in a plain int and thus there's no problem assigning it later to current->pdeath_signal (which is an int). The patch gets rid of the pointless variable `sig'. This reduces the size of kernel/sys.o in 2.6.13-rc6-mm1 by 32 bytes on my system. Patch has been compile tested, boot tested, and just to make damn sure I didn't break anything I wrote a quick test app that calls prctl(PR_SET_PDEATHSIG ...) with the entire range of values for a unsigned long, and it behaves as expected with and without the patch. Signed-off-by: Jesper Juhl <jesper.juhl@gmail.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-07[PATCH] largefile support for accountingPeter Staubach1-1/+1
There is a problem in the accounting subsystem in the kernel can not correctly handle files larger than 2GB. The output file containing the process accounting data can grow very large if the system is large enough and active enough. If the 2GB limit is reached, then the system simply stops storing process accounting data. Another annoying problem is that once the system reaches this 2GB limit, then every process which exits will receive a signal, SIGXFSZ. This signal is generated because an attempt was made to write beyond the limit for the file descriptor. This signal makes it look like every process has exited due to a signal, when in fact, they have not. The solution is to add the O_LARGEFILE flag to the list of flags used to open the accounting file. The rest of the accounting support is already largefile safe. The changes were tested by constructing a large file (just short of 2GB), enabling accounting, and then running enough commands to cause the accounting data generated to increase the size of the file to 2GB. Without the changes, the file grows to 2GB and the last command run in the test script appears to exit due a signal when it has not. With the changes, things work as expected and quietly. There are some user level changes required so that it can deal with largefiles, but those are being handled separately. Signed-off-by: Peter Staubach <staubach@redhat.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-07[PATCH] do_notify_parent_cldstop() cleanupOleg Nesterov1-35/+26
This patch simplifies the usage of do_notify_parent_cldstop(), it lessens the source and .text size slightly, and makes the code (in my opinion) a bit more readable. I am sending this patch now because I'm afraid Paul will touch do_notify_parent_cldstop() really soon, It's better to cleanup first. Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-07[PATCH] CHECK_IRQ_PER_CPU() to avoid dead code in __do_IRQ()Karsten Wiese1-1/+1
IRQ_PER_CPU is not used by all architectures. This patch introduces the macros ARCH_HAS_IRQ_PER_CPU and CHECK_IRQ_PER_CPU() to avoid the generation of dead code in __do_IRQ(). ARCH_HAS_IRQ_PER_CPU is defined by architectures using IRQ_PER_CPU in their include/asm_ARCH/irq.h file. Through grepping the tree I found the following architectures currently use IRQ_PER_CPU: cris, ia64, ppc, ppc64 and parisc. Signed-off-by: Karsten Wiese <annabellesgarden@yahoo.de> Acked-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-07[PATCH] create_workqueue_thread() signedness fixMika Kukkonen1-1/+1
With "-W -Wno-unused -Wno-sign-compare" I get the following compile warning: CC kernel/workqueue.o kernel/workqueue.c: In function `workqueue_cpu_callback': kernel/workqueue.c:504: warning: ordered comparison of pointer with integer zero On error create_workqueue_thread() returns NULL, not negative pointer, so following trivial patch suggests itself. Signed-off-by: Mika Kukkonen <mikukkon@gmail.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-07[PATCH] flush icache early when loading moduleThomas Koeller1-14/+19
Change the sequence of operations performed during module loading to flush the instruction cache before module parameters are processed. If a module has parameters of an unusual type that cannot be handled using the standard accessor functions param_set_xxx and param_get_xxx, it has to to provide a set of accessor functions for this type. This requires module code to be executed during parameter processing, which is of course only possible after the icache has been flushed. Signed-off-by: Thomas Koeller <thomas@koeller.dyndns.org> Cc: Rusty Russell <rusty@rustcorp.com.au> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-07[PATCH] optimize writer path in time_interpolator_get_counter()Alex Williamson1-4/+13
Christoph Lameter <clameter@engr.sgi.com> When using a time interpolator that is susceptible to jitter there's potentially contention over a cmpxchg used to prevent time from going backwards. This is unnecessary when the caller holds the xtime write seqlock as all readers will be blocked from returning until the write is complete. We can therefore allow writers to insert a new value and exit rather than fight with CPUs who only hold a reader lock. Signed-off-by: Alex Williamson <alex.williamson@hp.com> Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-07[PATCH] Provide better printk() support for SMP machinesDavid Howells1-1/+12
The attached patch prevents oopses interleaving with characters from other printks on other CPUs by only breaking the lock if the oops is happening on the machine holding the lock. It might be better if the oops generator got the lock and then called an inner vprintk routine that assumed the caller holds the lock, thus making oops reports "atomic". Signed-Off-By: David Howells <dhowells@redhat.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-07[PATCH] detect soft lockupsIngo Molnar4-0/+154
This patch adds a new kernel debug feature: CONFIG_DETECT_SOFTLOCKUP. When enabled then per-CPU watchdog threads are started, which try to run once per second. If they get delayed for more than 10 seconds then a callback from the timer interrupt detects this condition and prints out a warning message and a stack dump (once per lockup incident). The feature is otherwise non-intrusive, it doesnt try to unlock the box in any way, it only gets the debug info out, automatically, and on all CPUs affected by the lockup. Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Nishanth Aravamudan <nacc@us.ibm.com> Signed-Off-By: Matthias Urlichs <smurf@smurf.noris.de> Signed-off-by: Richard Purdie <rpurdie@rpsys.net> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-07[PATCH] FUTEX_WAKE_OP: pthread_cond_signal() speedupJakub Jelinek1-0/+116
ATM pthread_cond_signal is unnecessarily slow, because it wakes one waiter (which at least on UP usually means an immediate context switch to one of the waiter threads). This waiter wakes up and after a few instructions it attempts to acquire the cv internal lock, but that lock is still held by the thread calling pthread_cond_signal. So it goes to sleep and eventually the signalling thread is scheduled in, unlocks the internal lock and wakes the waiter again. Now, before 2003-09-21 NPTL was using FUTEX_REQUEUE in pthread_cond_signal to avoid this performance issue, but it was removed when locks were redesigned to the 3 state scheme (unlocked, locked uncontended, locked contended). Following scenario shows why simply using FUTEX_REQUEUE in pthread_cond_signal together with using lll_mutex_unlock_force in place of lll_mutex_unlock is not enough and probably why it has been disabled at that time: The number is value in cv->__data.__lock. thr1 thr2 thr3 0 pthread_cond_wait 1 lll_mutex_lock (cv->__data.__lock) 0 lll_mutex_unlock (cv->__data.__lock) 0 lll_futex_wait (&cv->__data.__futex, futexval) 0 pthread_cond_signal 1 lll_mutex_lock (cv->__data.__lock) 1 pthread_cond_signal 2 lll_mutex_lock (cv->__data.__lock) 2 lll_futex_wait (&cv->__data.__lock, 2) 2 lll_futex_requeue (&cv->__data.__futex, 0, 1, &cv->__data.__lock) # FUTEX_REQUEUE, not FUTEX_CMP_REQUEUE 2 lll_mutex_unlock_force (cv->__data.__lock) 0 cv->__data.__lock = 0 0 lll_futex_wake (&cv->__data.__lock, 1) 1 lll_mutex_lock (cv->__data.__lock) 0 lll_mutex_unlock (cv->__data.__lock) # Here, lll_mutex_unlock doesn't know there are threads waiting # on the internal cv's lock Now, I believe it is possible to use FUTEX_REQUEUE in pthread_cond_signal, but it will cost us not one, but 2 extra syscalls and, what's worse, one of these extra syscalls will be done for every single waiting loop in pthread_cond_*wait. We would need to use lll_mutex_unlock_force in pthread_cond_signal after requeue and lll_mutex_cond_lock in pthread_cond_*wait after lll_futex_wait. Another alternative is to do the unlocking pthread_cond_signal needs to do (the lock can't be unlocked before lll_futex_wake, as that is racy) in the kernel. I have implemented both variants, futex-requeue-glibc.patch is the first one and futex-wake_op{,-glibc}.patch is the unlocking inside of the kernel. The kernel interface allows userland to specify how exactly an unlocking operation should look like (some atomic arithmetic operation with optional constant argument and comparison of the previous futex value with another constant). It has been implemented just for ppc*, x86_64 and i?86, for other architectures I'm including just a stub header which can be used as a starting point by maintainers to write support for their arches and ATM will just return -ENOSYS for FUTEX_WAKE_OP. The requeue patch has been (lightly) tested just on x86_64, the wake_op patch on ppc64 kernel running 32-bit and 64-bit NPTL and x86_64 kernel running 32-bit and 64-bit NPTL. With the following benchmark on UP x86-64 I get: for i in nptl-orig nptl-requeue nptl-wake_op; do echo time elf/ld.so --library-path .:$i /tmp/bench; \ for j in 1 2; do echo ( time elf/ld.so --library-path .:$i /tmp/bench ) 2>&1; done; done time elf/ld.so --library-path .:nptl-orig /tmp/bench real 0m0.655s user 0m0.253s sys 0m0.403s real 0m0.657s user 0m0.269s sys 0m0.388s time elf/ld.so --library-path .:nptl-requeue /tmp/bench real 0m0.496s user 0m0.225s sys 0m0.271s real 0m0.531s user 0m0.242s sys 0m0.288s time elf/ld.so --library-path .:nptl-wake_op /tmp/bench real 0m0.380s user 0m0.176s sys 0m0.204s real 0m0.382s user 0m0.175s sys 0m0.207s The benchmark is at: http://sourceware.org/ml/libc-alpha/2005-03/txt00001.txt Older futex-requeue-glibc.patch version is at: http://sourceware.org/ml/libc-alpha/2005-03/txt00002.txt Older futex-wake_op-glibc.patch version is at: http://sourceware.org/ml/libc-alpha/2005-03/txt00003.txt Will post a new version (just x86-64 fixes so that the patch applies against pthread_cond_signal.S) to libc-hacker ml soon. Attached is the kernel FUTEX_WAKE_OP patch as well as a simple-minded testcase that will not test the atomicity of the operation, but at least check if the threads that should have been woken up are woken up and whether the arithmetic operation in the kernel gave the expected results. Acked-by: Ingo Molnar <mingo@redhat.com> Cc: Ulrich Drepper <drepper@redhat.com> Cc: Jamie Lokier <jamie@shareable.org> Cc: Rusty Russell <rusty@rustcorp.com.au> Signed-off-by: Yoichi Yuasa <yuasa@hh.iij4u.or.jp> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-07[PATCH] swsusp: update documentationPavel Machek1-1/+1
This updates documentation a bit (mostly removing obsolete stuff), and marks swsusp as no longer experimental in config. Signed-off-by: Pavel Machek <pavel@suse.cz> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-07[PATCH] x86/x86_64: deferred handling of writes to /proc/irqxx/smp_affinityAshok Raj2-2/+16
When handling writes to /proc/irq, current code is re-programming rte entries directly. This is not recommended and could potentially cause chipset's to lockup, or cause missing interrupts. CONFIG_IRQ_BALANCE does this correctly, where it re-programs only when the interrupt is pending. The same needs to be done for /proc/irq handling as well. Otherwise user space irq balancers are really not doing the right thing. - Changed pending_irq_balance_cpumask to pending_irq_migrate_cpumask for lack of a generic name. - added move_irq out of IRQ_BALANCE, and added this same to X86_64 - Added new proc handler for write, so we can do deferred write at irq handling time. - Display of /proc/irq/XX/smp_affinity used to display CPU_MASKALL, instead it now shows only active cpu masks, or exactly what was set. - Provided a common move_irq implementation, instead of duplicating when using generic irq framework. Tested on i386/x86_64 and ia64 with CONFIG_PCI_MSI turned on and off. Tested UP builds as well. MSI testing: tbd: I have cards, need to look for a x-over cable, although I did test an earlier version of this patch. Will test in a couple days. Signed-off-by: Ashok Raj <ashok.raj@intel.com> Acked-by: Zwane Mwaikambo <zwane@holomorphy.com> Grudgingly-acked-by: Andi Kleen <ak@muc.de> Signed-off-by: Coywolf Qi Hunt <coywolf@lovecn.org> Signed-off-by: Ashok Raj <ashok.raj@intel.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-07[kernel-doc] fix various DocBook build problems/warningsJeff Garzik1-0/+1
Most serious is fixing include/sound/pcm.h, which breaks the DocBook build. The other stuff is just filling in things that cause warnings.
2005-09-05[PATCH] UML Support - Ptrace: adds the host SYSEMU support, for UML and ↵Laurent Vivier1-0/+3
general usage Jeff Dike <jdike@addtoit.com>, Paolo 'Blaisorblade' Giarrusso <blaisorblade_spam@yahoo.it>, Bodo Stroesser <bstroesser@fujitsu-siemens.com> Adds a new ptrace(2) mode, called PTRACE_SYSEMU, resembling PTRACE_SYSCALL except that the kernel does not execute the requested syscall; this is useful to improve performance for virtual environments, like UML, which want to run the syscall on their own. In fact, using PTRACE_SYSCALL means stopping child execution twice, on entry and on exit, and each time you also have two context switches; with SYSEMU you avoid the 2nd stop and so save two context switches per syscall. Also, some architectures don't have support in the host for changing the syscall number via ptrace(), which is currently needed to skip syscall execution (UML turns any syscall into getpid() to avoid it being executed on the host). Fixing that is hard, while SYSEMU is easier to implement. * This version of the patch includes some suggestions of Jeff Dike to avoid adding any instructions to the syscall fast path, plus some other little changes, by myself, to make it work even when the syscall is executed with SYSENTER (but I'm unsure about them). It has been widely tested for quite a lot of time. * Various fixed were included to handle the various switches between various states, i.e. when for instance a syscall entry is traced with one of PT_SYSCALL / _SYSEMU / _SINGLESTEP and another one is used on exit. Basically, this is done by remembering which one of them was used even after the call to ptrace_notify(). * We're combining TIF_SYSCALL_EMU with TIF_SYSCALL_TRACE or TIF_SINGLESTEP to make do_syscall_trace() notice that the current syscall was started with SYSEMU on entry, so that no notification ought to be done in the exit path; this is a bit of a hack, so this problem is solved in another way in next patches. * Also, the effects of the patch: "Ptrace - i386: fix Syscall Audit interaction with singlestep" are cancelled; they are restored back in the last patch of this series. Detailed descriptions of the patches doing this kind of processing follow (but I've already summed everything up). * Fix behaviour when changing interception kind #1. In do_syscall_trace(), we check the status of the TIF_SYSCALL_EMU flag only after doing the debugger notification; but the debugger might have changed the status of this flag because he continued execution with PTRACE_SYSCALL, so this is wrong. This patch fixes it by saving the flag status before calling ptrace_notify(). * Fix behaviour when changing interception kind #2: avoid intercepting syscall on return when using SYSCALL again. A guest process switching from using PTRACE_SYSEMU to PTRACE_SYSCALL crashes. The problem is in arch/i386/kernel/entry.S. The current SYSEMU patch inhibits the syscall-handler to be called, but does not prevent do_syscall_trace() to be called after this for syscall completion interception. The appended patch fixes this. It reuses the flag TIF_SYSCALL_EMU to remember "we come from PTRACE_SYSEMU and now are in PTRACE_SYSCALL", since the flag is unused in the depicted situation. * Fix behaviour when changing interception kind #3: avoid intercepting syscall on return when using SINGLESTEP. When testing 2.6.9 and the skas3.v6 patch, with my latest patch and had problems with singlestepping on UML in SKAS with SYSEMU. It looped receiving SIGTRAPs without moving forward. EIP of the traced process was the same for all SIGTRAPs. What's missing is to handle switching from PTRACE_SYSCALL_EMU to PTRACE_SINGLESTEP in a way very similar to what is done for the change from PTRACE_SYSCALL_EMU to PTRACE_SYSCALL_TRACE. I.e., after calling ptrace(PTRACE_SYSEMU), on the return path, the debugger is notified and then wake ups the process; the syscall is executed (or skipped, when do_syscall_trace() returns 0, i.e. when using PTRACE_SYSEMU), and do_syscall_trace() is called again. Since we are on the return path of a SYSEMU'd syscall, if the wake up is performed through ptrace(PTRACE_SYSCALL), we must still avoid notifying the parent of the syscall exit. Now, this behaviour is extended even to resuming with PTRACE_SINGLESTEP. Signed-off-by: Paolo 'Blaisorblade' Giarrusso <blaisorblade@yahoo.it> Cc: Jeff Dike <jdike@addtoit.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-05[PATCH] pm: clean up /sys/power/diskPavel Machek1-2/+3
Clean code up a bit, and only show suspend to disk as available when it is configured in. Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-05[PATCH] pm: fix process freezingPavel Machek1-2/+22
If process freezing fails, some processes are frozen, and rest are left in "were asked to be frozen" state. Thats wrong, we should leave it in some consistent state. Signed-off-by: Pavel Machek <pavel@suse.cz> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-05[PATCH] swsusp: fix error handling and cleanupsPavel Machek2-35/+22
Drop printing during normal boot (when no image exists in swap), print message when drivers fail, fix error paths and consolidate near-identical functions in disk.c (and functions with just one statement). Signed-off-by: Pavel Machek <pavel@suse.cz> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-05[PATCH] swsusp: add locking to software_resumeShaohua Li1-1/+9
It is trying to protect swsusp_resume_device and software_resume() from two users banging it from userspace at the same time. Signed-off-by: Shaohua Li <shaohua.li@intel.com> Signed-off-by: Pavel Machek <pavel@suse.cz> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-05[PATCH] swsusp: simpler calculation of number of pages in PBE listMichal Schmidt1-12/+1
The function calc_nr uses an iterative algorithm to calculate the number of pages needed for the image and the pagedir. Exactly the same result can be obtained with a one-line expression. Note that this was even proved correct ;-). Signed-off-by: Michal Schmidt <xschmi00@stud.feec.vutbr.cz> Signed-off-by: Pavel Machek <pavel@suse.cz> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-05[PATCH] encrypt suspend data for easy wipingAndreas Steinmetz2-5/+171
The patch protects from leaking sensitive data after resume from suspend. During suspend a temporary key is created and this key is used to encrypt the data written to disk. When, during resume, the data was read back into memory the temporary key is destroyed which simply means that all data written to disk during suspend are then inaccessible so they can't be stolen lateron. Think of the following: you suspend while an application is running that keeps sensitive data in memory. The application itself prevents the data from being swapped out. Suspend, however, must write these data to swap to be able to resume lateron. Without suspend encryption your sensitive data are then stored in plaintext on disk. This means that after resume your sensitive data are accessible to all applications having direct access to the swap device which was used for suspend. If you don't need swap after resume these data can remain on disk virtually forever. Thus it can happen that your system gets broken in weeks later and sensitive data which you thought were encrypted and protected are retrieved and stolen from the swap device. Signed-off-by: Andreas Steinmetz <ast@domdv.de> Acked-by: Pavel Machek <pavel@suse.cz> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-05[PATCH] remove busywait in refrigeratorPavel Machek1-2/+3
This should make refrigerator sleep properly, not busywait after the first schedule() returns. Signed-off-by: Pavel Machek <pavel@suse.cz> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-05[PATCH] swap: update swsusp use of swap_infoHugh Dickins1-6/+6
Aha, swsusp dips into swap_info[], better update it to swap_lock. It's bitflipping flags with 0xFF, so get_swap_page will allocate from only the one chosen device: let's change that to flip SWP_WRITEOK. Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-03Merge linux-2.6 into linux-acpi-2.6 testLen Brown2-4/+3
2005-08-29[NET]: Fix sparse warningsArnaldo Carvalho de Melo1-3/+1
Of this type, mostly: CHECK net/ipv6/netfilter.c net/ipv6/netfilter.c:96:12: warning: symbol 'ipv6_netfilter_init' was not declared. Should it be static? net/ipv6/netfilter.c:101:6: warning: symbol 'ipv6_netfilter_fini' was not declared. Should it be static? Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2005-08-29[NETLINK]: Add "groups" argument to netlink_kernel_createPatrick McHardy1-1/+1
Signed-off-by: Patrick McHardy <kaber@trash.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2005-08-29[NETLINK]: Add properly module refcounting for kernel netlink sockets.Harald Welte1-1/+2
- Remove bogus code for compiling netlink as module - Add module refcounting support for modules implementing a netlink protocol - Add support for autoloading modules that implement a netlink protocol as soon as someone opens a socket for that protocol Signed-off-by: Harald Welte <laforge@netfilter.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2005-08-27Merge with master.kernel.org:/pub/scm/linux/kernel/git/torvalds/linux-2.6.gitDavid Woodhouse4-4/+17
2005-08-27[AUDIT] Allow filtering on system call success _or_ failureDavid Woodhouse1-2/+6
Signed-off-by: David Woodhouse <dwmw2@infradead.org>
2005-08-26Auto-update from upstreamLen Brown1-19/+13
2005-08-26[PATCH] completely disable cpu_exclusive sched domainPaul Jackson1-0/+13
At the suggestion of Nick Piggin and Dinakar, totally disable the facility to allow cpu_exclusive cpusets to define dynamic sched domains in Linux 2.6.13, in order to avoid problems first reported by John Hawkes (corrupt sched data structures and kernel oops). This has been built for ppc64, i386, ia64, x86_64, sparc, alpha. It has been built, booted and tested for cpuset functionality on an SN2 (ia64). Dinakar or Nick - could you verify that it for sure does avoid the problems Hawkes reported. Hawkes is out of town, and I don't have the recipe to reproduce what he found. Signed-off-by: Paul Jackson <pj@sgi.com> Acked-by: Nick Piggin <npiggin@suse.de> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-08-26[PATCH] undo partial cpu_exclusive sched domain disablingPaul Jackson1-19/+0
The partial disabling of Dinakar's new facility to allow cpu_exclusive cpusets to define dynamic sched domains doesn't go far enough. At the suggestion of Nick Piggin and Dinakar, let us instead totally disable this facility for 2.6.13, in order to avoid problems first reported by John Hawkes (corrupt sched data structures and kernel oops). This patch removes the partial disabling code in 2.6.13-rc7, in anticipation of the next patch, which will totally disable it instead. Signed-off-by: Paul Jackson <pj@sgi.com> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-08-25Merge HEAD from ../from-linus Len Brown1-0/+19
2005-08-25[ACPI] IA64-related ACPI Kconfig fixesLen Brown1-0/+1
Build issues were mostly in the ACPI=n case -- don't do that. Select ACPI from IA64_GENERIC. Add some missing dependencies on ACPI. Mark BLACKLIST_YEAR and some laptop-only ACPI drivers as X86-only. Let me know when you get an IA64 Laptop. Signed-off-by: Len Brown <len.brown@intel.com>
2005-08-24[PATCH] cpu_exclusive sched domains build fixPaul Jackson1-1/+3
As reported by Paul Mackerras <paulus@samba.org>, the previous patch "cpu_exclusive sched domains fix" broke the ppc64 build with CONFIC_CPUSET, yielding error messages: kernel/cpuset.c: In function 'update_cpu_domains': kernel/cpuset.c:648: error: invalid lvalue in unary '&' kernel/cpuset.c:648: error: invalid lvalue in unary '&' On some arch's, the node_to_cpumask() is a function, returning a cpumask_t. But the for_each_cpu_mask() requires an lvalue mask. The following patch fixes this build failure by making a copy of the cpumask_t on the stack. Signed-off-by: Paul Jackson <pj@sgi.com> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-08-23[PATCH] cpu_exclusive sched domains on partial nodes temp fixPaul Jackson1-0/+17
This keeps the kernel/cpuset.c routine update_cpu_domains() from invoking the sched.c routine partition_sched_domains() if the cpuset in question doesn't fall on node boundaries. I have boot tested this on an SN2, and with the help of a couple of ad hoc printk's, determined that it does indeed avoid calling the partition_sched_domains() routine on partial nodes. I did not directly verify that this avoids setting up bogus sched domains or avoids the oops that Hawkes saw. This patch imposes a silent artificial constraint on which cpusets can be used to define dynamic sched domains. This patch should allow proceeding with this new feature in 2.6.13 for the configurations in which it is useful (node alligned sched domains) while avoiding trying to setup sched domains in the less useful cases that can cause the kernel corruption and oops. Signed-off-by: Paul Jackson <pj@sgi.com> Acked-by: Ingo Molnar <mingo@elte.hu> Acked-by: Dinakar Guniguntala <dino@in.ibm.com> Acked-by: John Hawkes <hawkes@sgi.com> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-08-23[PATCH] preempt race in getppidDavid Meybohm1-1/+1
With CONFIG_PREEMPT && !CONFIG_SMP, it's possible for sys_getppid to return a bogus value if the parent's task_struct gets reallocated after current->group_leader->real_parent is read: asmlinkage long sys_getppid(void) { int pid; struct task_struct *me = current; struct task_struct *parent; parent = me->group_leader->real_parent; RACE HERE => for (;;) { pid = parent->tgid; #ifdef CONFIG_SMP { struct task_struct *old = parent; /* * Make sure we read the pid before re-reading the * parent pointer: */ smp_rmb(); parent = me->group_leader->real_parent; if (old != parent) continue; } #endif break; } return pid; } If the process gets preempted at the indicated point, the parent process can go ahead and call exit() and then get wait()'d on to reap its task_struct. When the preempted process gets resumed, it will not do any further checks of the parent pointer on !CONFIG_SMP: it will read the bad pid and return. So, the same algorithm used when SMP is enabled should be used when preempt is enabled, which will recheck ->real_parent in this case. Signed-off-by: David Meybohm <dmeybohmlkml@bellsouth.net> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-08-18[PATCH] Make RLIMIT_NICE ranges consistent with getpriority(2)Matt Mackall1-2/+2
As suggested by Michael Kerrisk <mtk-manpages@gmx.net>, make RLIMIT_NICE consistent with getpriority before it becomes available in released glibc. Signed-off-by: Matt Mackall <mpm@selenic.com> Acked-by: Ingo Molnar <mingo@elte.hu> Acked-by: Chris Wright <chrisw@osdl.org> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-08-17[PATCH] NPTL signal delivery deadlock fixBhavesh P. Davda1-1/+1
This bug is quite subtle and only happens in a very interesting situation where a real-time threaded process is in the middle of a coredump when someone whacks it with a SIGKILL. However, this deadlock leaves the system pretty hosed and you have to reboot to recover. Not good for real-time priority-preemption applications like our telephony application, with 90+ real-time (SCHED_FIFO and SCHED_RR) processes, many of them multi-threaded, interacting with each other for high volume call processing. Acked-by: Roland McGrath <roland@redhat.com> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-08-17AUDIT: Prevent duplicate syscall rulesAmy Griffis2-44/+56
The following patch against audit.81 prevents duplicate syscall rules in a given filter list by walking the list on each rule add. I also removed the unused struct audit_entry in audit.c and made the static inlines in auditsc.c consistent. Signed-off-by: Amy Griffis <amy.griffis@hp.com> Signed-off-by: David Woodhouse <dwmw2@infradead.org>
2005-08-17AUDIT: Speed up audit_filter_syscall() for the non-auditable case.David Woodhouse1-10/+12
It was showing up fairly high on profiles even when no rules were set. Make sure the common path stays as fast as possible. Signed-off-by: David Woodhouse <dwmw2@infradead.org>
2005-08-17AUDIT: Fix task refcount leak in audit_filter_syscall()David Woodhouse1-1/+2
Signed-off-by: David Woodhouse <dwmw2@infradead.org>
2005-08-17Merge with master.kernel.org:/pub/scm/linux/kernel/git/torvalds/linux-2.6.gitDavid Woodhouse2-22/+48
2005-08-10[PATCH] remove name length check in a workqueueJames Bottomley1-2/+0
We have a chek in there to make sure that the name won't overflow task_struct.comm[], but it's triggering for scsi with lots of HBAs, only scsi is using single-threaded workqueues which don't append the "/%d" anyway. All too hard. Just kill the BUG_ON. Cc: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@osdl.org> [ kthread_create() uses vsnprintf() and limits the thing, so no actual overflow can actually happen regardless ] Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-08-09[PATCH] cpuset release ABBA deadlock fixPaul Jackson1-20/+48
Fix possible cpuset_sem ABBA deadlock if 'notify_on_release' set. For a particular usage pattern, creating and destroying cpusets fairly frequently using notify_on_release, on a very large system, this deadlock can be seen every few days. If you are not using the cpuset notify_on_release feature, you will never see this deadlock. The existing code, on task exit (or cpuset deletion) did: get cpuset_sem if cpuset marked notify_on_release and is ready to release: compute cpuset path relative to /dev/cpuset mount point call_usermodehelper() forks /sbin/cpuset_release_agent with path drop cpuset_sem Unfortunately, the fork in call_usermodehelper can allocate memory, and allocating memory can require cpuset_sem, if the mems_generation values changed in the interim. This results in an ABBA deadlock, trying to obtain cpuset_sem when it is already held by the current task. To fix this, I put the cpuset path (which must be computed while holding cpuset_sem) in a temporary buffer, to be used in the call_usermodehelper call of /sbin/cpuset_release_agent only _after_ dropping cpuset_sem. So the new logic is: get cpuset_sem if cpuset marked notify_on_release and is ready to release: compute cpuset path relative to /dev/cpuset mount point stash path in kmalloc'd buffer drop cpuset_sem call_usermodehelper() forks /sbin/cpuset_release_agent with path free path The sharp eyed reader might notice that this patch does not contain any calls to kmalloc. The existing code in the check_for_release() routine was already kmalloc'ing a buffer to hold the cpuset path. In the old code, it just held the buffer for a few lines, over the cpuset_release_agent() call that in turn invoked call_usermodehelper(). In the new code, with the application of this patch, it returns that buffer via the new char **ppathbuf parameter, for later use and freeing in cpuset_release_agent(), which is called after cpuset_sem is dropped. Whereas the old code has just one call to cpuset_release_agent(), right in the check_for_release() routine, the new code has three calls to cpuset_release_agent(), from the various places that a cpuset can be released. This patch has been build and booted on SN2, and passed a stress test that previously hit the deadlock within a few seconds. Signed-off-by: Paul Jackson <pj@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-08-09Merge with /shiny/git/linux-2.6/.gitDavid Woodhouse12-60/+92
2005-08-04[PATCH] revert "timer exit cleanup"Andrew Morton2-2/+3
Revert this June 17 patch: it broke persistence of timers across execve(). Cc: Roland McGrath <roland@redhat.com> Cc: george anzinger <george@mvista.com> Cc: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-08-04[PATCH] Remove suspend() calls from shutdown pathBenjamin Herrenschmidt1-2/+0
This removes the calls to device_suspend() from the shutdown path that were added sometime during 2.6.13-rc*. They aren't working properly on a number of configs (I got reports from both ppc powerbook users and x86 users) causing the system to not shutdown anymore. I think it isn't the right approach at the moment anyway. We have already a shutdown() callback for the drivers that actually care about shutdown and the suspend() code isn't yet in a good enough shape to be so much generalized. Also, the semantics of suspend and shutdown are slightly different on a number of setups and the way this was patched in provides little way for drivers to cleanly differenciate. It should have been at least a different message. For 2.6.13, I think we should revert to 2.6.12 behaviour and have a working suspend back. Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-08-01[PATCH] Module per-cpu alignment cannot always be metRusty Russell1-4/+11
The module code assumes noone will ever ask for a per-cpu area more than SMP_CACHE_BYTES aligned. However, as these cases show, gcc asks sometimes asks for 32-byte alignment for the per-cpu section on a module, and if CONFIG_X86_L1_CACHE_SHIFT is 4, we hit that BUG_ON(). This is obviously an unusual combination, as there have been few reports, but better to warn than die. See: http://www.ussg.iu.edu/hypermail/linux/kernel/0409.0/0768.html And more recently: http://bugs.gentoo.org/show_bug.cgi?id=97006 Signed-off-by: Rusty Russell <rusty@rustcorp.com.au> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-08-01[PATCH] remove sys_set_zone_reclaim()Ingo Molnar1-1/+0
This removes sys_set_zone_reclaim() for now. While i'm sure Martin is trying to solve a real problem, we must not hard-code an incomplete and insufficient approach into a syscall, because syscalls are pretty much for eternity. I am quite strongly convinced that this syscall must not hit v2.6.13 in its current form. Firstly, the syscall lacks basic syscall design: e.g. it allows the global setting of VM policy for unprivileged users. (!) [ Imagine an Oracle installation and a SAP installation on the same NUMA box fighting over the 'optimal' setting for this flag. What will they do? Will they try to set the flag to their own preferred value every second or so? ] Secondly, it was added based on a single datapoint from Martin: http://marc.theaimsgroup.com/?l=linux-mm&m=111763597218177&w=2 where Martin characterizes the numbers the following way: ' Run-to-run variability for "make -j" is huge, so these numbers aren't terribly useful except to see that with reclaim the benchmark still finishes in a reasonable amount of time. ' in other words: the fundamental problem has likely not been solved, only a tendential move into the right direction has been observed, and a handful of numbers were picked out of a set of hugely variable results, without showing the variability data. How much variance is there run-to-run? I'd really suggest to first walk the walk and see what's needed to get stable & predictable kernel compilation numbers on that NUMA box, before adding random syscalls to tune a particular aspect of the VM ... which approach might not even matter once the whole picture has been analyzed and understood! The third, most important point is that the syscall exposes VM tuning internals in a completely unstructured way. What sense does it make to have a _GLOBAL_ per-node setting for 'should we go to another node for reclaim'? If then it might make sense to do this per-app, via numalib or so. The change is minimalistic in that it doesnt remove the syscall and the underlying infrastructure changes, only the user-visible changes. We could perhaps add a CAP_SYS_ADMIN-only sysctl for this hack, a'ka /proc/sys/vm/swappiness, but even that looks quite counterproductive when the generic approach is that we are trying to reduce the number of external factors in the VM balance picture. Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-07-30[PATCH] revert bogus softirq changesAndrew Morton1-2/+2
This snuck in with an x86_64 change. Thanks to Richard Purdie <rpurdie@rpsys.net> for spotting it. Cc: Andi Kleen <ak@muc.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-07-29[PATCH] reboot: remove device_suspend(PMSG_FREEZE) from kernel_kexecEric W. Biederman1-1/+0
If device_suspend(PMSG_FREEZE) is not ready to be called in kernel_restart it is definitely not ready to be called in the even more fickle kernel_kexec. Signed-off-by: Eric W. Biederman <ebiederm@xmission.com> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-07-28[PATCH] posix timers: fix normalization problemGeorge Anzinger1-14/+3
(We found this (after a customer complained) and it is in the kernel.org kernel. Seems that for CLOCK_MONOTONIC absolute timers and clock_nanosleep calls both the request time and wall_to_monotonic are subtracted prior to the normalize resulting in an overflow in the existing normalize test. This causes the result to be shifted ~4 seconds ahead instead of ~2 seconds back in time.) The normalize code in posix-timers.c fails when the tv_nsec member is ~1.2 seconds negative. This can happen on absolute timers (and clock_nanosleeps) requested on CLOCK_MONOTONIC (both the request time and wall_to_monotonic are subtracted resulting in the possibility of a number close to -2 seconds.) This fix uses the set_normalized_timespec() (which does not have an overflow problem) to fix the problem and as a side effect makes the code cleaner. Signed-off-by: George Anzinger <george@mvista.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-07-28[PATCH] x86_64: Switch to the interrupt stack when running a softirq in ↵Andi Kleen1-2/+2
local_bh_enable() This avoids some potential stack overflows with very deep softirq callchains. i386 does this too. TOADD CFI annotation Signed-off-by: Andi Kleen <ak@suse.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-07-27[PATCH] Avoid device suspend on rebootAndrew Morton1-1/+0
My fairly ordinary x86 test box gets stuck during reboot on the wait_for_completion() in ide_do_drive_cmd(): Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-07-27[PATCH] clean up inline static vs static inlineJesper Juhl1-1/+1
`gcc -W' likes to complain if the static keyword is not at the beginning of the declaration. This patch fixes all remaining occurrences of "inline static" up with "static inline" in the entire kernel tree (140 occurrences in 47 files). While making this change I came across a few lines with trailing whitespace that I also fixed up, I have also added or removed a blank line or two here and there, but there are no functional changes in the patch. Signed-off-by: Jesper Juhl <juhl-lkml@dif.dk> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-07-27[PATCH] kernel/crash_dump.c: add kerneldocRandy Dunlap1-1/+10
Add kerneldoc to kernel/crash_dump.c Signed-off-by: Randy Dunlap <rdunlap@xenotime.net> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-07-27[PATCH] kernel/cpuset.c: add kerneldoc, fix typosRandy Dunlap1-7/+19
Add kerneldoc to kernel/cpuset.c Fix cpuset typos in init/Kconfig Signed-off-by: Randy Dunlap <rdunlap@xenotime.net> Acked-by: Paul Jackson <pj@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-07-27[PATCH] kernel/capability.c: add kerneldocRandy Dunlap1-3/+17
Add kerneldoc to kernel/capability.c Signed-off-by: Randy Dunlap <rdunlap@xenotime.net> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-07-27[PATCH] s390: spin lock retryMartin Schwidefsky1-1/+11
Split spin lock and r/w lock implementation into a single try which is done inline and an out of line function that repeatedly tries to get the lock before doing the cpu_relax(). Add a system control to set the number of retries before a cpu is yielded. The reason for the spin lock retry is that the diagnose 0x44 that is used to give up the virtual cpu is quite expensive. For spin locks that are held only for a short period of time the costs of the diagnoses outweights the savings for spin locks that are held for a longer timer. The default retry count is 1000. Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-07-27[PATCH] itimer fixesGeorge Anzinger1-21/+16
Fix the recent off-by-one fix in the itimer code: 1. The repeating timer is figured using the requested time (not +1 as we know where we are in the jiffie). 2. The tests for interval too large are left to the time_val to jiffie code. Signed-off-by: George Anzinger <george@mvista.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-07-27[PATCH] Address BUG: using smp_processor_id() in preemptible [00000001] codeNigel Cunningham1-1/+1
This patch fixes a warning in the disable_nonboot_cpus call in kernel/power/smp.c. Signed-off by: Nigel Cunningham <nigel@suspend2.net> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-07-27Merge with master.kernel.org:/pub/scm/linux/kernel/git/torvalds/linux-2.6.gitDavid Woodhouse5-58/+85
2005-07-26[PATCH] fix MAX_USER_RT_PRIO and MAX_RT_PRIOSteven Rostedt1-2/+3
Here's the patch again to fix the code to handle if the values between MAX_USER_RT_PRIO and MAX_RT_PRIO are different. Without this patch, an SMP system will crash if the values are different. Signed-off-by: Steven Rostedt <rostedt@goodmis.org> Cc: Ingo Molnar <mingo@elte.hu> Signed-off-by: Dean Nelson <dcn@sgi.com> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-07-26[PATCH] Fix RLIMIT_RTPRIO breakageAndreas Steinmetz1-1/+2
RLIMIT_RTPRIO is supposed to grant non privileged users the right to use SCHED_FIFO/SCHED_RR scheduling policies with priorites bounded by the RLIMIT_RTPRIO value via sched_setscheduler(). This is usually used by audio users. Unfortunately this is broken in 2.6.13rc3 as you can see in the excerpt from sched_setscheduler below: /* * Allow unprivileged RT tasks to decrease priority: */ if (!capable(CAP_SYS_NICE)) { /* can't change policy */ if (policy != p->policy) return -EPERM; After the above unconditional test which causes sched_setscheduler to fail with no regard to the RLIMIT_RTPRIO value the following check is made: /* can't increase priority */ if (policy != SCHED_NORMAL && param->sched_priority > p->rt_priority && param->sched_priority > p->signal->rlim[RLIMIT_RTPRIO].rlim_cur) return -EPERM; Thus I do believe that the RLIMIT_RTPRIO value must be taken into account for the policy check, especially as the RLIMIT_RTPRIO limit is of no use without this change. The attached patch fixes this problem. Signed-off-by: Andreas Steinmetz <ast@domdv.de> Acked-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-07-26[PATCH] swpsuspend: Have suspend to disk use factors of sys_rebootEric W. Biederman1-6/+3
The suspend to disk code was a poor copy of the code in sys_reboot now that we have kernel_power_off, kernel_restart and kernel_halt use them instead of poorly duplicating them inline. Signed-off-by: Eric W. Biederman <ebiederm@xmission.com> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-07-26[PATCH] Call emergency_reboot from panicEric W. Biederman1-5/+4
We know the system is in trouble so there is no question if this is an emergecy :) Signed-off-by: Eric W. Biederman <ebiederm@xmission.com> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-07-26[PATCH] Use kernel_power_off in sysrq-oEric W. Biederman1-2/+2
We already do all of the gymnastics to run from process context to call the power off code so call into the power off code cleanly. This especially helps acpi as part of it's shutdown logic should run acpi_shutdown called from device_shutdown which was not being called from here. Signed-off-by: Eric W. Biederman <ebiederm@xmission.com> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-07-26[PATCH] Add emergency_restart()Eric W. Biederman1-0/+6
When the kernel is working well and we want to restart cleanly kernel_restart is the function to use. But in many instances the kernel wants to reboot when thing are expected to be working very badly such as from panic or a software watchdog handler. This patch adds the function emergency_restart() so that callers can be clear what semantics they expect when calling restart. emergency_restart() is expected to be callable from interrupt context and possibly reliable in even more trying circumstances. This is an initial generic implementation for all architectures. Signed-off-by: Eric W. Biederman <ebiederm@xmission.com> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-07-26[PATCH] Make ctrl_alt_del call kernel_restart to get a proper reboot.Eric W. Biederman1-2/+1
It is obvious we wanted to call kernel_restart here but since we don't have it the code was expanded inline and hasn't been correct since sometime in 2.4. Signed-off-by: Eric W. Biederman <ebiederm@xmission.com> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-07-26[PATCH] Refactor sys_reboot into reusable partsEric W. Biederman1-42/+64
Because the factors of sys_reboot don't exist people calling into the reboot path duplicate the code badly, leading to inconsistent expectations of code in the reboot path. This patch should is just code motion. Signed-off-by: Eric W. Biederman <ebiederm@xmission.com> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-07-26[PATCH] Add missing device_suspsend(PMSG_FREEZE) calls.Eric W. Biederman1-0/+2
In the recent addition of device_suspend calls into sys_reboot two code paths were missed. Signed-off-by: Eric W. Biederman <ebiederm@xmission.com> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-07-19Merge with /shiny/git/linux-2.6/.gitDavid Woodhouse1-40/+11
2005-07-18AUDIT: Reduce contention in audit_serial()David Woodhouse2-2/+6
... by generating serial numbers only if an audit context is actually _used_, rather than doing so at syscall entry even when the context isn't necessarily marked auditable. Signed-off-by: David Woodhouse <dwmw2@infradead.org>
2005-07-15AUDIT: Fix livelock in audit_serial().David Woodhouse1-11/+10
The tricks with atomic_t were bizarre. Just do it sensibly instead. Signed-off-by: David Woodhouse <dwmw2@infradead.org>
2005-07-14AUDIT: Fix compile error in audit_filter_syscallDavid Woodhouse1-1/+1
We didn't rename it to audit_tgid after all. Except once... Doh. Signed-off-by: David Woodhouse <dwmw2@infradead.org>
2005-07-13AUDIT: Avoid scheduling in idle threadDavid Woodhouse1-5/+8
When we flush a pending syscall audit record due to audit_free(), we might be doing that in the context of the idle thread. So use GFP_ATOMIC Signed-off-by: David Woodhouse <dwmw2@infradead.org>
2005-07-13AUDIT: Exempt the whole auditd thread-group from auditingDavid Woodhouse1-2/+2
and not just the one thread. Signed-off-by: David Woodhouse <dwmw2@infradead.org>
2005-07-13[AUDIT] Fix sparse warning about gfp_mask typeVictor Fusco1-1/+1
Fix the sparse warning "implicit cast to nocast type" Signed-off-by: Victor Fusco <victor@cetuc.puc-rio.br> Signed-off-by: Domen Puncer <domen@coderock.org> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: David Woodhouse <dwmw2@infradead.org>
2005-07-13[PATCH] inotify: move sysctlRobert Love1-40/+11
This moves the inotify sysctl knobs to "/proc/sys/fs/inotify" from "/proc/sys/fs". Also some related cleanup. Signed-off-by: Robert Love <rml@novell.com> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-07-13Merge with /shiny/git/linux-2.6/.gitDavid Woodhouse11-34/+103
2005-07-12[PATCH] inotifyRobert Love3-1/+49
inotify is intended to correct the deficiencies of dnotify, particularly its inability to scale and its terrible user interface: * dnotify requires the opening of one fd per each directory that you intend to watch. This quickly results in too many open files and pins removable media, preventing unmount. * dnotify is directory-based. You only learn about changes to directories. Sure, a change to a file in a directory affects the directory, but you are then forced to keep a cache of stat structures. * dnotify's interface to user-space is awful. Signals? inotify provides a more usable, simple, powerful solution to file change notification: * inotify's interface is a system call that returns a fd, not SIGIO. You get a single fd, which is select()-able. * inotify has an event that says "the filesystem that the item you were watching is on was unmounted." * inotify can watch directories or files. Inotify is currently used by Beagle (a desktop search infrastructure), Gamin (a FAM replacement), and other projects. See Documentation/filesystems/inotify.txt. Signed-off-by: Robert Love <rml@novell.com> Cc: John McCutchan <ttb@tentacle.dhs.org> Cc: Christoph Hellwig <hch@lst.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-07-12Merge master.kernel.org:/pub/scm/linux/kernel/git/lenb/linux-2.6Linus Torvalds1-1/+15
2005-07-12[PATCH] lower VM_DONTCOPY total_vmHugh Dickins1-1/+3
dup_mmap of a VM_DONTCOPY vma forgot to lower the child's total_vm. (But no way does this account for the recent report of total_vm seen too low.) Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-07-12[PATCH] name_to_dev_t warning fixAndrew Morton2-2/+3
kernel/power/disk.c needs a declaration of name_to_dev_t() in scope. mount.h seems like an appropriate choice. Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-07-12[ACPI] merge acpi-2.6.12 branch into latest Linux 2.6.13-rc...Len Brown1-1/+15
Signed-off-by: Len Brown <len.brown@intel.com>
2005-07-11[ACPI] Suspend to RAM fixDavid Shaohua Li1-0/+14
Free some RAM before entering S3 so that upon resume we can be sure early allocations will succeed. http://bugzilla.kernel.org/show_bug.cgi?id=3469 Signed-off-by: David Shaohua Li <shaohua.li@intel.com> Signed-off-by: Len Brown <len.brown@intel.com>
2005-07-11[ACPI] ACPI poweroff fixAlexey Starikovskiy1-1/+1
Register an "acpi" system device to be notified of shutdown preparation. This depends on CONFIG_PM http://bugzilla.kernel.org/show_bug.cgi?id=4041 Signed-off-by: Alexey Starikovskiy <alexey.y.starikovskiy@intel.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Len Brown <len.brown@intel.com>
2005-07-07[PATCH] cond_resched(): fix bogus might_sleep() warningIngo Molnar1-0/+7
The BKS might be reacquired before we have dropped PREEMPT_ACTIVE, which could trigger a second could trigger a second cond_resched() call. Bug found by Hirofumi Ogawa. Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-07-07[PATCH] mostly_read data sectionChristoph Lameter1-2/+2
Add a new section called ".data.read_mostly" for data items that are read frequently and rarely written to like cpumaps etc. If these maps are placed in the .data section then these frequenly read items may end up in cachelines with data is is frequently updated. In that case all processors in an SMP system must needlessly reload the cachelines again and again containing elements of those frequently used variables. The ability to share these cachelines will allow each cpu in an SMP system to keep local copies of those shared cachelines thereby optimizing performance. Signed-off-by: Alok N Kataria <alokk@calsoftinc.com> Signed-off-by: Shobhit Dayal <shobhit@calsoftinc.com> Signed-off-by: Christoph Lameter <christoph@scalex86.org> Signed-off-by: Shai Fultheim <shai@scalex86.org> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-07-07[PATCH] pm: clean up process.cPavel Machek1-4/+2
freezeable() already tests for TRACED/STOPPED processes, no need to do it twice. Signed-off-by: Pavel Machek <pavel@suse.cz> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-07-07[PATCH] swsusp: fix error handlingPavel Machek1-12/+11
Fix error handling and whitespace in swsusp.c. swsusp_free() was called when there was nothing allocating, leading to oops. Signed-off-by: Pavel Machek <pavel@suse.cz> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-07-07[PATCH] pm: Fix resume from initrdPavel Machek2-10/+10
Move device name resolution code around so that it is not called from resume-from-initrd. name_to_dev_t may be unavailable at that point. Signed-off-by: Pavel Machek <pavel@suse.cz> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-07-05[PATCH] kprobes: fix namespace problem and sparc64 buildRusty Lynch1-1/+1
The following renames arch_init, a kprobes function for performing any architecture specific initialization, to arch_init_kprobes in order to cleanup the namespace. Also, this patch adds arch_init_kprobes to sparc64 to fix the sparc64 kprobes build from the last return probe patch. Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-07-02AUDIT: Really don't audit auditd.David Woodhouse1-2/+2
The pid in the audit context isn't always set up. Use tsk->pid when checking whether it's auditd in audit_filter_syscall(), instead of ctx->pid. Remove a band-aid which did the same elsewhere. Signed-off-by: David Woodhouse <dwmw2@infradead.org>
2005-07-02AUDIT: Stop waiting for backlog after audit_panic() happensDavid Woodhouse1-5/+10
We force a rate-limit on auditable events by making them wait for space on the backlog queue. However, if auditd really is AWOL then this could potentially bring the entire system to a halt, depending on the audit rules in effect. Firstly, make sure the wait time is honoured correctly -- it's the maximum time the process should wait, rather than the time to wait _each_ time round the loop. We were getting re-woken _each_ time a packet was dequeued, and the timeout was being restarted each time. Secondly, reset the wait time after audit_panic() is called. In general this will be reset to zero, to allow progress to be made. If the system is configured to _actually_ panic on audit_panic() then that will already have happened; otherwise we know that audit records are being lost anyway. These two tunables can't be exposed via AUDIT_GET and AUDIT_SET because those aren't particularly well-designed. It probably should have been done by sysctls or sysfs anyway -- one for a later patch. Signed-off-by: David Woodhouse <dwmw2@infradead.org>
2005-07-02Merge with master.kernel.org:/pub/scm/linux/kernel/git/torvalds/linux-2.6.gitDavid Woodhouse37-907/+3049
2005-06-28[PATCH] irqpollAlan Cox2-3/+112
Anyone reporting a stuck IRQ should try these options. Its effectiveness varies we've found in the Fedora case. Quite a few systems with misdescribed IRQ routing just work when you use irqpoll. It also fixes up the VIA systems although thats now fixed with the VIA quirk (which we could just make default as its what Redmond OS does but Linus didn't like it historically). A small number of systems have jammed IRQ sources or misdescribes that cause an IRQ that we have no handler registered anywhere for. In those cases it doesn't help. Signed-off-by: Alan Cox <number6@the-village.bc.nu> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-06-28[PATCH] ITIMER_REAL: fix possible deadlock and raceOleg Nesterov1-2/+6
As Steven Rostedt pointed out, there are 2 problems with ITIMER_REAL timers. 1. do_setitimer() does not call del_timer_sync() in case when the timer is not pending (it_real_value() returns 0). This is wrong, the timer may still be running, and it can rearm itself. 2. It calls del_timer_sync() with tsk->sighand->siglock held. This is deadlockable, because timer's handler needs this lock too. Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Acked-by: Steven Rostedt <rostedt@goodmis.org> Cc: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-06-28[PATCH] Using msleep() instead of HZLuca Falavigna1-5/+4
Use msleep() in a few places. Signed-off-by: Luca Falavigna <dktrkranz@gmail.com> Acked-by: Ingo Molnar <mingo@elte.hu> Acked-by: Jeff Garzik <jgarzik@pobox.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-06-28[PATCH] Tweak idle thread setup semanticsIngo Molnar1-1/+8
This patch tweaks idle thread setup semantics a bit: instead of setting NEED_RESCHED in init_idle(), we do an explicit schedule() before calling into cpu_idle(). This patch, while having no negative side-effects, enables wider use of cond_resched()s. (which might happen in the stock kernel too, but it's particulary important for voluntary-preempt) Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-06-28[PATCH] kexec: fix sparse warningsAlexey Dobriyan1-5/+5
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Cc: Eric Biederman <ebiederm@xmission.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-06-27[PATCH] Return probe redesign: architecture independent changesRusty Lynch1-50/+19
The following is the second version of the function return probe patches I sent out earlier this week. Changes since my last submission include: * Fix in ppc64 code removing an unneeded call to re-enable preemption * Fix a build problem in ia64 when kprobes was turned off * Added another BUG_ON check to each of the architecture trampoline handlers My initial patch description ==> From my experiences with adding return probes to x86_64 and ia64, and the feedback on LKML to those patches, I think we can simplify the design for return probes. The following patch tweaks the original design such that: * Instead of storing the stack address in the return probe instance, the task pointer is stored. This gives us all we need in order to: - find the correct return probe instance when we enter the trampoline (even if we are recursing) - find all left-over return probe instances when the task is going away This has the side effect of simplifying the implementation since more work can be done in kernel/kprobes.c since architecture specific knowledge of the stack layout is no longer required. Specifically, we no longer have: - arch_get_kprobe_task() - arch_kprobe_flush_task() - get_rp_inst_tsk() - get_rp_inst() - trampoline_post_handler() <see next bullet> * Instead of splitting the return probe handling and cleanup logic across the pre and post trampoline handlers, all the work is pushed into the pre function (trampoline_probe_handler), and then we skip single stepping the original function. In this case the original instruction to be single stepped was just a NOP, and we can do without the extra interruption. The new flow of events to having a return probe handler execute when a target function exits is: * At system initialization time, a kprobe is inserted at the beginning of kretprobe_trampoline. kernel/kprobes.c use to handle this on it's own, but ia64 needed to do this a little differently (i.e. a function pointer is really a pointer to a structure containing the instruction pointer and a global pointer), so I added the notion of arch_init(), so that kernel/kprobes.c:init_kprobes() now allows architecture specific initialization by calling arch_init() before exiting. Each architecture now registers a kprobe on it's own trampoline function. * register_kretprobe() will insert a kprobe at the beginning of the targeted function with the kprobe pre_handler set to arch_prepare_kretprobe (still no change) * When the target function is entered, the kprobe is fired, calling arch_prepare_kretprobe (still no change) * In arch_prepare_kretprobe() we try to get a free instance and if one is available then we fill out the instance with a pointer to the return probe, the original return address, and a pointer to the task structure (instead of the stack address.) Just like before we change the return address to the trampoline function and mark the instance as used. If multiple return probes are registered for a given target function, then arch_prepare_kretprobe() will get called multiple times for the same task (since our kprobe implementation is able to handle multiple kprobes at the same address.) Past the first call to arch_prepare_kretprobe, we end up with the original address stored in the return probe instance pointing to our trampoline function. (This is a significant difference from the original arch_prepare_kretprobe design.) * Target function executes like normal and then returns to kretprobe_trampoline. * kprobe inserted on the first instruction of kretprobe_trampoline is fired and calls trampoline_probe_handler() (no change here) * trampoline_probe_handler() consumes each of the instances associated with the current task by calling the registered handler function and marking the instance as unused until an instance is found that has a return address different then the trampoline function. (change similar to my previous ia64 RFC) * If the task is killed with some left-over return probe instances (meaning that a target function was entered, but never returned), then we just free any instances associated with the task. (Not much different other then we can handle this without calling architecture specific functions.) There is a known problem that this patch does not yet solve where registering a return probe flush_old_exec or flush_thread will put us in a bad state. Most likely the best way to handle this is to not allow registering return probes on these two functions. (Significant change) This patch series applies to the 2.6.12-rc6-mm1 kernel, and provides: * kernel/kprobes.c changes * i386 patch of existing return probes implementation * x86_64 patch of existing return probe implementation * ia64 implementation * ppc64 implementation (provided by Ananth) This patch implements the architecture independant changes for a reworking of the kprobes based function return probes design. Changes include: * Removing functions for querying a return probe instance off a stack address * Removing the stack_addr field from the kretprobe_instance definition, and adding a task pointer * Adding architecture specific initialization via arch_init() * Removing extern definitions for the architecture trampoline functions (this isn't needed anymore since the architecture handles the initialization of the kprobe in the return probe trampoline function.) Signed-off-by: Rusty Lynch <rusty.lynch@intel.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>