aboutsummaryrefslogtreecommitdiffstats
path: root/kernel
AgeCommit message (Collapse)AuthorFilesLines
2006-03-31[PATCH] wrong error path in dup_fd() leading to oopses in RCUKirill Korotaev1-1/+1
Wrong error path in dup_fd() - it should return NULL on error, not an address of already freed memory :/ Triggered by OpenVZ stress test suite. What is interesting is that it was causing different oopses in RCU like below: Call Trace: [<c013492c>] rcu_do_batch+0x2c/0x80 [<c0134bdd>] rcu_process_callbacks+0x3d/0x70 [<c0126cf3>] tasklet_action+0x73/0xe0 [<c01269aa>] __do_softirq+0x10a/0x130 [<c01058ff>] do_softirq+0x4f/0x60 ======================= [<c0113817>] smp_apic_timer_interrupt+0x77/0x110 [<c0103b54>] apic_timer_interrupt+0x1c/0x24 Code: Bad EIP value. <0>Kernel panic - not syncing: Fatal exception in interrupt Signed-Off-By: Pavel Emelianov <xemul@sw.ru> Signed-Off-By: Dmitry Mishin <dim@openvz.org> Signed-Off-By: Kirill Korotaev <dev@openvz.org> Signed-Off-By: Linus Torvalds <torvalds@osdl.org>
2006-03-31[PATCH] pidhash: Refactor the pid hash tableEric W. Biederman2-73/+155
Simplifies the code, reduces the need for 4 pid hash tables, and makes the code more capable. In the discussions I had with Oleg it was felt that to a large extent the cleanup itself justified the work. With struct pid being dynamically allocated meant we could create the hash table entry when the pid was allocated and free the hash table entry when the pid was freed. Instead of playing with the hash lists when ever a process would attach or detach to a process. For myself the fact that it gave what my previous task_ref patch gave for free with simpler code was a big win. The problem is that if you hold a reference to struct task_struct you lock in 10K of low memory. If you do that in a user controllable way like /proc does, with an unprivileged but hostile user space application with typical resource limits of 1000 fds and 100 processes I can trigger the OOM killer by consuming all of low memory with task structs, on a machine wight 1GB of low memory. If I instead hold a reference to struct pid which holds a pointer to my task_struct, I don't suffer from that problem because struct pid is 2 orders of magnitude smaller. In fact struct pid is small enough that most other kernel data structures dwarf it, so simply limiting the number of referring data structures is enough to prevent exhaustion of low memory. This splits the current struct pid into two structures, struct pid and struct pid_link, and reduces our number of hash tables from PIDTYPE_MAX to just one. struct pid_link is the per process linkage into the hash tables and lives in struct task_struct. struct pid is given an indepedent lifetime, and holds pointers to each of the pid types. The independent life of struct pid simplifies attach_pid, and detach_pid, because we are always manipulating the list of pids and not the hash table. In addition in giving struct pid an indpendent life it makes the concept much more powerful. Kernel data structures can now embed a struct pid * instead of a pid_t and not suffer from pid wrap around problems or from keeping unnecessarily large amounts of memory allocated. Signed-off-by: Eric W. Biederman <ebiederm@xmission.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-31[PATCH] task: RCU protect task->usageEric W. Biederman1-1/+6
A big problem with rcu protected data structures that are also reference counted is that you must jump through several hoops to increase the reference count. I think someone finally implemented atomic_inc_not_zero(&count) to automate the common case. Unfortunately this means you must special case the rcu access case. When data structures are only visible via rcu in a manner that is not determined by the reference count on the object (i.e. tasks are visible until their zombies are reaped) there is a much simpler technique we can employ. Simply delaying the decrement of the reference count until the rcu interval is over. What that means is that the proc code that looks up a task and later wants to sleep can now do: rcu_read_lock(); task = find_task_by_pid(some_pid); if (task) { get_task_struct(task); } rcu_read_unlock(); The effect on the rest of the kernel is that put_task_struct becomes cheaper and immediate, and in the case where the task has been reaped it frees the task immediate instead of unnecessarily waiting an until the rcu interval is over. Cleanup of task_struct does not happen when its reference count drops to zero, instead cleanup happens when release_task is called. Tasks can only be looked up via rcu before release_task is called. All rcu protected members of task_struct are freed by release_task. Therefore we can move call_rcu from put_task_struct into release_task. And we can modify release_task to not immediately release the reference count but instead have it call put_task_struct from the function it gives to call_rcu. The end result: - get_task_struct is safe in an rcu context where we have just looked up the task. - put_task_struct() simplifies into its old pre rcu self. This reorganization also makes put_task_struct uncallable from modules as it is not exported but it does not appear to be called from any modules so this should not be an issue, and is trivially fixed. Signed-off-by: Eric W. Biederman <ebiederm@xmission.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-31[PATCH] resurrect __put_task_structAndrew Morton1-3/+7
This just got nuked in mainline. Bring it back because Eric's patches use it. Cc: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-31[PATCH] Make setsid() more robustEric W. Biederman1-4/+15
The core problem: setsid fails if it is called by init. The effect in 2.6.16 and the earlier kernels that have this problem is that if you do a "ps -j 1 or ps -ej 1" you will see that init and several of it's children have process group and session == 0. Instead of process group == session == 1. Despite init calling setsid. The reason it fails is that daemonize calls set_special_pids(1,1) on kernel threads that are launched before /sbin/init is called. The only remaining effect in that current->signal->leader == 0 for init instead of 1. And the setsid call fails. No one has noticed because /sbin/init does not check the return value of setsid. In 2.4 where we don't have the pidhash table, and daemonize doesn't exist setsid actually works for init. I care a lot about pid == 1 not being a special case that we leave broken, because of the container/jail work that I am doing. - Carefully allow init (pid == 1) to call setsid despite the kernel using its session. - Use find_task_by_pid instead of find_pid because find_pid taking a pidtype is going away. Signed-off-by: Eric W. Biederman <ebiederm@xmission.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-31[PATCH] futex: check and validate timevalsThomas Gleixner2-2/+6
The futex timeval is not checked for correctness. The change does not break existing applications as the timeval is supplied by glibc (and glibc always passes a correct value), but the glibc-internal tests for this functionality fail. Signed-off-by: Thomas Gleixner <tglx@tglx.de> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-31[PATCH] sched: activate SCHED BATCH expiredCon Kolivas1-3/+7
To increase the strength of SCHED_BATCH as a scheduling hint we can activate batch tasks on the expired array since by definition they are latency insensitive tasks. Signed-off-by: Con Kolivas <kernel@kolivas.org> Acked-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-31[PATCH] sched: remove on runqueue requeueingCon Kolivas1-2/+1
On runqueue time is used to elevate priority in schedule(). In the code it currently requeues tasks even if their priority is not elevated, which would end up placing them at the end of their runqueue array effectively delaying them instead of improving their priority. Bug spotted by Mike Galbraith <efault@gmx.de> This patch removes this requeueing. Signed-off-by: Con Kolivas <kernel@kolivas.org> Acked-by: Ingo Molnar <mingo@elte.hu> Cc: Mike Galbraith <efault@gmx.de> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-31[PATCH] sched: include noninteractive sleep in idle detectCon Kolivas1-2/+1
Tasks waiting in SLEEP_NONINTERACTIVE state can now get to best priority so they need to be included in the idle detection code. Signed-off-by: Con Kolivas <kernel@kolivas.org> Acked-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-31[PATCH] sched: dont decrease idle sleep avgCon Kolivas1-5/+10
We watch for tasks that sleep extended periods and don't allow one single prolonged sleep period from elevating priority to maximum bonus to prevent cpu bound tasks from getting high priority with single long sleeps. There is a bug in the current code that also penalises tasks that already have high priority. Correct that bug. Signed-off-by: Con Kolivas <kernel@kolivas.org> Acked-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-31[PATCH] sched: make task_noninteractive use sleep_typeCon Kolivas1-8/+8
Alterations to the pipe code in the kernel made it possible for relative starvation to occur with tasks that slept waiting on a pipe getting unfair priority bonuses even if they were otherwise fully cpu bound so the TASK_NONINTERACTIVE flag was introduced which prevented any change to sleep_avg while sleeping waiting on a pipe. This change also leads to the converse though, preventing any priority boost from occurring in truly interactive tasks that wait on pipes. Convert the TASK_NONINTERACTIVE flag to set sleep_type to SLEEP_NONINTERACTIVE which will allow a linear bonus to priority based on sleep time thus allowing interactive tasks to get high priority if they sleep enough. Signed-off-by: Con Kolivas <kernel@kolivas.org> Acked-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-31[PATCH] sched: cleanup task_activated()Con Kolivas1-9/+15
The activated flag in task_struct is used to track different sleep types and its usage is somewhat obfuscated. Convert the variable to an enum with more descriptive names without altering the function. Signed-off-by: Con Kolivas <kernel@kolivas.org> Acked-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-31[PATCH] sched: reduce overhead of calc_loadJack Steiner2-1/+16
Currently, count_active_tasks() calls both nr_running() & nr_interruptible(). Each of these functions does a "for_each_cpu" & reads values from the runqueue of each cpu. Although this is not a lot of instructions, each runqueue may be located on different node. Depending on the architecture, a unique TLB entry may be required to access each runqueue. Since there may be more runqueues than cpu TLB entries, a scan of all runqueues can trash the TLB. Each memory reference incurs a TLB miss & refill. In addition, the runqueue cacheline that contains nr_running & nr_uninterruptible may be evicted from the cache between the two passes. This causes unnecessary cache misses. Combining nr_running() & nr_interruptible() into a single function substantially reduces the TLB & cache misses on large systems. This should have no measureable effect on smaller systems. On a 128p IA64 system running a memory stress workload, the new function reduced the overhead of calc_load() from 605 usec/call to 324 usec/call. Signed-off-by: Jack Steiner <steiner@sgi.com> Acked-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-31[PATCH] hrtimer: call get_softirq_time() only when necessary in ↵Dimitri Sivanich1-0/+3
run_hrtimer_queue() It seems that run_hrtimer_queue() is calling get_softirq_time() more often than it needs to. With this patch, it only calls get_softirq_time() if there's a pending timer. Signed-off-by: Dimitri Sivanich <sivanich@sgi.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-31[PATCH] hrtimer: use generic sleeper for nanosleepThomas Gleixner1-30/+9
Replace the nanosleep private sleeper functionality by the generic hrtimer sleeper. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-31[PATCH] hrtimer: create generic sleeperThomas Gleixner1-0/+19
The removal of the data field in the hrtimer structure enforces the embedding of the timer into another data structure. nanosleep now uses a private implementation of the most common used timer callback function (simple task wakeup). In order to avoid the reimplentation of such functionality all over the place a generic hrtimer_sleeper functionality is created. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-31[PATCH] modules: permit Dual-MIT/GPL licensesAndrew Morton1-0/+1
One of the LEDs driver files wants to use this. Probably drivers/mtd/maps/ipaq-flash.c wants to convert as well - right now it'll be tainting the kernel. Cc: David Woodhouse <dwmw2@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: John Bowler <jbowler@acm.org> Cc: "'Richard Purdie'" <rpurdie@rpsys.net> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-31[PATCH] cpuset: memory migration interaction fixPaul Jackson1-5/+52
Fix memory migration so that it works regardless of what cpuset the invoking task is in. If a task invoked a memory migration, by doing one of: 1) writing a different nodemask to a cpuset 'mems' file, or 2) writing a tasks pid to a different cpuset's 'tasks' file, where the cpuset had its 'memory_migrate' option turned on, then the allocation of the new pages for the migrated task(s) was constrained by the invoking tasks cpuset. If this task wasn't in a cpuset that allowed the requested memory nodes, the memory migration would happen to some other nodes that were in that invoking tasks cpuset. This was usually surprising and puzzling behaviour: Why didn't the pages move? Why did the pages move -there-? To fix this, temporarilly change the invoking tasks 'mems_allowed' task_struct field to the nodes the migrating tasks is moving to, so that new pages can be allocated there. Signed-off-by: Paul Jackson <pj@sgi.com> Acked-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-31[PATCH] cpuset: unsafe mm reference fixPaul Jackson1-2/+2
Fix unsafe reference to a tasks mm struct, by moving the reference inside of a convenient nearby properly guarded code block. Signed-off-by: Paul Jackson <pj@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-31[PATCH] cpuset: task_lock comment fixPaul Jackson1-6/+4
Fix cpuset comment involving case of a tasks cpuset pointer being NULL. Thanks to "the_top_cpuset_hack", this code no longer sees NULL task->cpuset pointers. Signed-off-by: Paul Jackson <pj@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-31[PATCH] Fix pacct bug in multithreading case.KaiGai Kohei1-6/+6
I noticed a bug on the process accounting facility. In multi-threading process, some data would be recorded incorrectly when the group_leader dies earlier than one or more threads. The attached patch fixes this problem. See below. 'bugacct' is a test program that create a worker thread after 4 seconds sleeping, then the group_leader dies soon. The worker thread consume CPU/Memory for 6 seconds, then exit. We can estimate 10 seconds as etime and 6 seconds as stime + utime. This is a sample program which the group_leader dies earlier than other threads. The results of same binary execution on different kernel are below. -- accounted records -------------------- | btime | utime | stime | etime | minflt | majflt | comm | original | 13:16:40 | 0.00 | 0.00 | 6.10 | 171 | 0 | bugacct | patched | 13:20:21 | 5.83 | 0.18 | 10.03 | 32776 | 0 | bugacct | (*) bugacct allocates 128MB memory, thus 128MB / 4KB = 32768 of minflt is appropriate. -- Test results in original kernel ------ $ date; time -p ./bugacct Tue Mar 28 13:16:36 JST 2006 <- But pacct said btime is 13:16:40 real 10.11 <- But pacct said etime is 6.10 user 5.96 <- But pacct said utime is 0.00 sys 0.14 <- But pacct said stime is 0.00 $ -- Test results in patched kernel ------- $ date; time -p ./bugacct Tue Mar 28 13:20:21 JST 2006 real 10.04 user 5.83 sys 0.19 $ In the original 2.6.16 kernel, pacct records btime, utime, stime, etime and minflt incorrectly. In my opinion, this problem is caused by an assumption that group_leader dies last. The following section calculates process running time for etime and btime. But it means running time of the thread that dies last, not process. The start_time of the first thread in the process (group_leader) should be reduced from uptime to calculate etime and btime correctly. ---- do_acct_process() in kernel/acct.c: /* calculate run_time in nsec*/ do_posix_clock_monotonic_gettime(&uptime); run_time = (u64)uptime.tv_sec*NSEC_PER_SEC + uptime.tv_nsec; run_time -= (u64)current->start_time.tv_sec*NSEC_PER_SEC + current->start_time.tv_nsec; ---- The following section calculates stime and utime of the process. But it might count the utime and stime of the group_leader duplicatly and ignore the utime and stime of the thread dies last, when one or more threads remain after group_leader dead. The ac_utime should be calculated as the sum of the signal->utime and utime of the thread dies last. The ac_stime should be done also. ---- do_acct_process() in kernel/acct.c: jiffies = cputime_to_jiffies(cputime_add(current->group_leader->utime, current->signal->utime)); ac.ac_utime = encode_comp_t(jiffies_to_AHZ(jiffies)); jiffies = cputime_to_jiffies(cputime_add(current->group_leader->stime, current->signal->stime)); ac.ac_stime = encode_comp_t(jiffies_to_AHZ(jiffies)); ---- The part of the minflt/majflt calculation has same problem. This patch solves those problems, I think. Cc: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-31[PATCH] Don't pass boot parameters to argv_init[]OGAWA Hirofumi1-1/+1
The boot cmdline is parsed in parse_early_param() and parse_args(,unknown_bootoption). And __setup() is used in obsolete_checksetup(). start_kernel() -> parse_args() -> unknown_bootoption() -> obsolete_checksetup() If __setup()'s callback (->setup_func()) returns 1 in obsolete_checksetup(), obsolete_checksetup() thinks a parameter was handled. If ->setup_func() returns 0, obsolete_checksetup() tries other ->setup_func(). If all ->setup_func() that matched a parameter returns 0, a parameter is seted to argv_init[]. Then, when runing /sbin/init or init=app, argv_init[] is passed to the app. If the app doesn't ignore those arguments, it will warning and exit. This patch fixes a wrong usage of it, however fixes obvious one only. Signed-off-by: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-31[PATCH] __mod_timer: simplify ->base changingOleg Nesterov1-8/+6
Since base and new_base are of the same type now, we can save one 'if' branch and simplify the code a bit. Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Acked-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-31[PATCH] kill __init_timer_base in favor of boot_tvec_basesOleg Nesterov1-49/+35
Commit a4a6198b80cf82eb8160603c98da218d1bd5e104: [PATCH] tvec_bases too large for per-cpu data introduced "struct tvec_t_base_s boot_tvec_bases" which is visible at compile time. This means we can kill __init_timer_base and move timer_base_s's content into tvec_t_base_s. Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-31[PATCH] Fix suspend with traced tasksPavel Machek2-2/+2
strace /bin/bash misbehaves after resume; this fixes it. (akpm: it's scary calling refrigerator() in state TASK_TRACED, but it seems to do the right thing). Signed-off-by: Pavel Machek <pavel@suse.cz> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-28[PATCH] send_sigqueue: simplify and fix the raceOleg Nesterov1-37/+4
send_sigqueue() checks PF_EXITING, then locks p->sighand->siglock. This is unsafe: 'p' can exit in between and set ->sighand = NULL. The race is theoretical, the window is tiny and irqs are disabled by the caller, so I don't think we need the fix for -stable tree. Convert send_sigqueue() to use lock_task_sighand() helper. Also, delete 'p->flags & PF_EXITING' re-check, it is unneeded and the comment is wrong. Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-28[PATCH] do_notify_parent_cldstop: remove 'to_self' paramOleg Nesterov1-21/+11
The previous patch has changed callsites of do_notify_parent_cldstop() so that to_self == (->ptrace & PT_PTRACED) always (as it should be). We can remove this parameter now. Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Cc: john stultz <johnstul@us.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-28[PATCH] finish_stop: don't check stop_count < 0Oleg Nesterov1-1/+1
Remove an obscure 'stop_count < 0' check in finish_stop(). The previous patch made this case impossible. Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Cc: john stultz <johnstul@us.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-28[PATCH] simplify do_signal_stop()Oleg Nesterov1-24/+8
do_signal_stop() considers 'thread_group_empty()' as a special case. This was needed to avoid taking tasklist_lock. Since this lock is unneeded any longer, we can remove this special case and simplify the code even more. Also, before this patch, finish_stop() was called with stop_count == -1 for 'thread_group_empty()' case. This is not strictly wrong, but confusing and unneeded. Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Cc: john stultz <johnstul@us.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-28[PATCH] cleanup __exit_signal->cleanup_sighand pathOleg Nesterov2-7/+4
Move 'tsk->sighand = NULL' from cleanup_sighand() to __exit_signal(). This makes the exit path more understandable and allows us to do cleanup_sighand() outside of ->siglock protected section. Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-28[PATCH] make fork() atomic wrt pgrp/session signalsOleg Nesterov1-20/+17
Eric W. Biederman wrote: > > Ok. SUSV3/Posix is clear, fork is atomic with respect > to signals. Either a signal comes before or after a > fork but not during. (See the rationale section). > http://www.opengroup.org/onlinepubs/000095399/functions/fork.html > > The tasklist_lock does not stop forks from adding to a process > group. The forks stall while the tasklist_lock is held, but a fork > that began before we grabbed the tasklist_lock simply completes > afterwards, and the child does not receive the signal. This also means that SIGSTOP or sig_kernel_coredump() signal can't be delivered to pgrp/session reliably. With this patch copy_process() returns -ERESTARTNOINTR when it detects a pending signal, fork() will be restarted transparently after handling the signals. This patch also deletes now unneeded "group_stop_count > 0" check, copy_process() can no longer succeed while group stop in progress. Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Acked-By: Eric Biederman <ebiederm@xmission.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-28[PATCH] pids: kill PIDTYPE_TGIDOleg Nesterov2-10/+4
This patch kills PIDTYPE_TGID pid_type thus saving one hash table in kernel/pid.c and speeding up subthreads create/destroy a bit. It is also a preparation for the further tref/pids rework. This patch adds 'struct list_head thread_group' to 'struct task_struct' instead. We don't detach group leader from PIDTYPE_PID namespace until another thread inherits it's ->pid == ->tgid, so we are safe wrt premature free_pidmap(->tgid) call. Currently there are no users of find_task_by_pid_type(PIDTYPE_TGID). Should the need arise, we can use find_task_by_pid()->group_leader. Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Acked-By: Eric Biederman <ebiederm@xmission.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-28[PATCH] do_sigaction: don't take tasklist_lockOleg Nesterov1-19/+3
do_sigaction() does not need tasklist_lock anymore, we can simplify the code. Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-28[PATCH] do_group_exit: don't take tasklist_lockOleg Nesterov1-2/+0
do_group_exit() takes tasklist_lock for zap_other_threads(), this is unneeded now. Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-28[PATCH] do_signal_stop: don't take tasklist_lockOleg Nesterov1-52/+17
do_signal_stop() does not need tasklist_lock anymore. So it does not need to do misc re-checks, and we can simplify the code. Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-28[PATCH] relax sig_needs_tasklist()Oleg Nesterov1-2/+1
handle_stop_signal() does not need tasklist_lock for SIG_KERNEL_STOP_MASK signals anymore. Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-28[PATCH] sys_times: don't take tasklist_lockOleg Nesterov1-12/+1
sys_times: don't take tasklist_lock Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-28[PATCH] do __unhash_process() under ->siglockOleg Nesterov1-2/+2
This patch moves __unhash_process() call from realease_task() to __exit_signal(), so __detach_pid() is called with ->siglock held. This means we don't need tasklist_lock to iterate over thread group anymore: copy_process() was already changed to do attach_pid() under ->siglock. Eric's "pidhash-kill-switch_exec_pids.patch" from -mm changed de_thread() so it doesn't touch PIDTYPE_TGID. NOTE: de_thread() still needs some attention. It still changes task->pid lockless. Taking ->sighand.siglock here allows to do more tasklist_lock removals. Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-28[PATCH] revert "Optimize sys_times for a single thread process"Oleg Nesterov2-65/+27
This patch reverts 'CONFIG_SMP && thread_group_empty()' optimization in sys_times(). The reason is that the next patch breaks memory ordering which is needed for that optimization. tasklist_lock in sys_times() will be eliminated completely by further patch. Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-28[PATCH] move __exit_signal() to kernel/exit.cOleg Nesterov2-64/+64
__exit_signal() is private to release_task() now. I think it is better to make it static in kernel/exit.c and export flush_sigqueue() instead - this function is much more simple and straightforward. Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-28[PATCH] rename __exit_sighand to cleanup_sighandOleg Nesterov2-18/+13
Cosmetic, rename __exit_sighand to cleanup_sighand and move it close to copy_sighand(). This matches copy_signal/cleanup_signal naming, and I think it is easier to follow. Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Acked-by: "Paul E. McKenney" <paulmck@us.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-28[PATCH] cleanup __exit_signal()Oleg Nesterov1-16/+15
This patch factors out duplicated code under 'if' branches. Also, BUG_ON() conversions and whitespace cleanups. Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Acked-by: "Paul E. McKenney" <paulmck@us.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-28[PATCH] copy_process: cleanup bad_fork_cleanup_signalOleg Nesterov2-18/+20
__exit_signal() does important cleanups atomically under ->siglock. It is also called from copy_process's error path. This is not good, for example we can't move __unhash_process() under ->siglock for that reason. We should not mix these 2 paths, just look at ugly 'if (p->sighand)' under 'bad_fork_cleanup_sighand:' label. For copy_process() case it is sufficient to just backout copy_signal(), nothing more. Again, nobody can see this task yet. For CLONE_THREAD case we just decrement signal->count, otherwise nobody can see this ->signal and we can free it lockless. This patch assumes it is safe to do exit_thread_group_keys() without tasklist_lock. Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Acked-by: David Howells <dhowells@redhat.com> Signed-off-by: Adrian Bunk <bunk@stusta.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-28[PATCH] copy_process: cleanup bad_fork_cleanup_sighandOleg Nesterov2-15/+2
The only caller of exit_sighand(tsk) is copy_process's error path. We can call __exit_sighand() directly and kill exit_sighand(). This 'tsk' was not yet registered in pid_hash[] or init_task.tasks, it has no external references, nobody can see it, and IF (clone_flags & CLONE_SIGHAND) At least 'current' has a reference to ->sighand, this means atomic_dec_and_test(sighand->count) can't be true. ELSE Nobody can see this ->sighand, this means we can free it without any locking. Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Acked-by: "Paul E. McKenney" <paulmck@us.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-28[PATCH] introduce sig_needs_tasklist() helperOleg Nesterov1-1/+4
In my opinion this patch cleans up the code. Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-28[PATCH] introduce lock_task_sighand() helperOleg Nesterov1-14/+24
Add lock_task_sighand() helper and converts group_send_sig_info() to use it. Hopefully we will have more users soon. This patch also removes '!sighand->count' and '!p->usage' checks, I think they both are bogus, racy and unneeded (but probably it makes sense to restore them as BUG_ON()s). ->sighand is cleared and it's ->count is decremented in release_task() with sighand->siglock held, so it is a bug to have '!p->usage || !->count' after we already locked and verified it is the same. On the other hand, an already dead task without ->sighand can have a non-zero ->usage due to ptrace, for example. If we read the stale value of ->sighand we must see the change after spin_lock(), because that change was done while holding that same old ->sighand.siglock. Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-28[PATCH] convert sighand_cache to use SLAB_DESTROY_BY_RCUOleg Nesterov2-11/+12
This patch borrows a clever Hugh's 'struct anon_vma' trick. Without tasklist_lock held we can't trust task->sighand until we locked it and re-checked that it is still the same. But this means we don't need to defer 'kmem_cache_free(sighand)'. We can return the memory to slab immediately, all we need is to be sure that sighand->siglock can't dissapear inside rcu protected section. To do so we need to initialize ->siglock inside ctor function, SLAB_DESTROY_BY_RCU does the rest. Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-28[PATCH] release_task: replace open-coded ptrace_unlink()Oleg Nesterov1-3/+2
Use ptrace_unlink() instead of open-coding. No changes in kernel/exit.o Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-28[PATCH] wait_for_helper: trivial style cleanupOleg Nesterov1-1/+1
Use NULL instead of (... *)0 Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-28[PATCH] reparent_thread: use remove_parent/add_parentOleg Nesterov1-2/+2
Use remove_parent/add_parent instead of open coding. No changes in kernel/exit.o Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-28[PATCH] pidhash: don't count idle threadsOleg Nesterov3-43/+20
fork_idle() does unhash_process() just after copy_process(). Contrary, boot_cpu's idle thread explicitely registers itself for each pid_type with nr = 0. copy_process() already checks p->pid != 0 before process_counts++, I think we can just skip attach_pid() calls and job control inits for idle threads and kill unhash_process(). We don't need to cleanup ->proc_dentry in fork_idle() because with this patch idle threads are never hashed in kernel/pid.c:pid_hash[]. We don't need to hash pid == 0 in pidmap_init(). free_pidmap() is never called with pid == 0 arg, so it will never be reused. So it is still possible to use pid == 0 in any PIDTYPE_xxx namespace from kernel/pid.c's POV. However with this patch we don't hash pid == 0 for PIDTYPE_PID case. We still have have PIDTYPE_PGID/PIDTYPE_SID entries with pid == 0: /sbin/init and kernel threads which don't call daemonize(). Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-28[PATCH] kill SET_LINKS/REMOVE_LINKSOleg Nesterov2-2/+6
Both SET_LINKS() and SET_LINKS/REMOVE_LINKS() have exactly one caller, and these callers already check thread_group_leader(). This patch kills theese macros, they mix two different things: setting process's parent and registering it in init_task.tasks list. Callers are updated to do these actions by hand. Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-28[PATCH] don't use REMOVE_LINKS/SET_LINKS for reparentingOleg Nesterov2-6/+6
There are places where kernel uses REMOVE_LINKS/SET_LINKS while changing process's ->parent. Use add_parent/remove_parent instead, they don't abuse of global process list. Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-28[PATCH] remove add_parent()'s parent argumentOleg Nesterov1-1/+1
add_parent(p, parent) is always called with parent == p->parent, and it makes no sense to do it differently. This patch removes this argument. No changes in affected .o files. Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-28[PATCH] choose_new_parent: remove unused arg, sanitize exit_state checkOleg Nesterov1-4/+4
'child_reaper' arg is not used in choose_new_parent(). "->exit_state >= EXIT_ZOMBIE" check is a leftover, was valid when EXIT_ZOMBIE lived in ->state var. Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Acked-by: Eric Biederman <ebiederm@xmission.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-28[PATCH] pidhash: kill switch_exec_pidsEric W. Biederman1-30/+0
switch_exec_pids is only called from de_thread by way of exec, and it is only called when we are exec'ing from a non thread group leader. Currently switch_exec_pids gives the leader the pid of the thread and unhashes and rehashes all of the process groups. The leader is already in the EXIT_DEAD state so no one cares about it's pids. The only concern for the leader is that __unhash_process called from release_task will function correctly. If we don't touch the leader at all we know that __unhash_process will work fine so there is no need to touch the leader. For the task becomming the thread group leader, we just need to give it the pid of the old thread group leader, add it to the task list, and attach it to the session and the process group of the thread group. Currently de_thread is also adding the task to the task list which is just silly. Currently the only leader of __detach_pid besides detach_pid is switch_exec_pids because of the ugly extra work that was being performed. So this patch removes switch_exec_pids because it is doing too much, it is creating an unnecessary special case in pid.c, duing work duplicated in de_thread, and generally obscuring what it is going on. The necessary work is added to de_thread, and it seems to be a little clearer there what is going on. Signed-off-by: Eric W. Biederman <ebiederm@xmission.com> Cc: Oleg Nesterov <oleg@tv-sign.ru> Cc: Kirill Korotaev <dev@sw.ru> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-28[PATCH] exec: allow init to exec from any thread.Eric W. Biederman2-2/+2
After looking at the problem of init calling exec some more I figured out an easy way to make the code work. The actual symptom without out this patch is that all threads will die except pid == 1, and the thread calling exec. The thread calling exec will wait forever for pid == 1 to die. Since pid == 1 does not install a handler for SIGKILL it will never die. This modifies the tests for init from current->pid == 1 to the equivalent current == child_reaper. And then it causes exec in the ugly case to modify child_reaper. The only weird symptom is that you wind up with an init process that doesn't have the oldest start time on the box. Signed-off-by: Eric W. Biederman <ebiederm@xmission.com> Cc: Oleg Nesterov <oleg@tv-sign.ru> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-28[PATCH] compat_sys_futex() warning fixAndrew Morton1-1/+2
kernel/futex_compat.c: In function `compat_sys_futex': kernel/futex_compat.c:140: warning: passing arg 1 of `do_futex' makes integer from pointer without a cast kernel/futex_compat.c:140: warning: passing arg 5 of `do_futex' makes integer from pointer without a cast Not sure what Ingo was thinking of here. Put the casts back in. Cc: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-28[PATCH] for_each_possible_cpu: fixes for generic partKAMEZAWA Hiroyuki2-6/+6
replaces for_each_cpu with for_each_possible_cpu(). Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-28[PATCH] Change dash2underscore() return value to charEric Sesterhenn1-1/+1
Since dash2underscore() just operates and returns chars, I guess its safe to change the return value to a char. With my .config, this reduces its size by 5 bytes. text data bss dec hex filename 4155 152 0 4307 10d3 params.o.orig 4150 152 0 4302 10ce params.o Signed-off-by: Eric Sesterhenn <snakebyte@gmx.de> Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Adrian Bunk <bunk@stusta.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-28[PATCH] symversion warning fixAndrew Morton1-1/+1
gcc-4.2: kernel/module.c: In function '__find_symbol': kernel/module.c:158: warning: the address of '__start___kcrctab', will always evaluate as 'true' kernel/module.c:165: warning: the address of '__start___kcrctab_gpl', will always evaluate as 'true' kernel/module.c:182: warning: the address of '__start___kcrctab_gpl_future', will always evaluate as 'true' Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-27[PATCH] Notifier chain update: API changesAlan Stern6-134/+301
The kernel's implementation of notifier chains is unsafe. There is no protection against entries being added to or removed from a chain while the chain is in use. The issues were discussed in this thread: http://marc.theaimsgroup.com/?l=linux-kernel&m=113018709002036&w=2 We noticed that notifier chains in the kernel fall into two basic usage classes: "Blocking" chains are always called from a process context and the callout routines are allowed to sleep; "Atomic" chains can be called from an atomic context and the callout routines are not allowed to sleep. We decided to codify this distinction and make it part of the API. Therefore this set of patches introduces three new, parallel APIs: one for blocking notifiers, one for atomic notifiers, and one for "raw" notifiers (which is really just the old API under a new name). New kinds of data structures are used for the heads of the chains, and new routines are defined for registration, unregistration, and calling a chain. The three APIs are explained in include/linux/notifier.h and their implementation is in kernel/sys.c. With atomic and blocking chains, the implementation guarantees that the chain links will not be corrupted and that chain callers will not get messed up by entries being added or removed. For raw chains the implementation provides no guarantees at all; users of this API must provide their own protections. (The idea was that situations may come up where the assumptions of the atomic and blocking APIs are not appropriate, so it should be possible for users to handle these things in their own way.) There are some limitations, which should not be too hard to live with. For atomic/blocking chains, registration and unregistration must always be done in a process context since the chain is protected by a mutex/rwsem. Also, a callout routine for a non-raw chain must not try to register or unregister entries on its own chain. (This did happen in a couple of places and the code had to be changed to avoid it.) Since atomic chains may be called from within an NMI handler, they cannot use spinlocks for synchronization. Instead we use RCU. The overhead falls almost entirely in the unregister routine, which is okay since unregistration is much less frequent that calling a chain. Here is the list of chains that we adjusted and their classifications. None of them use the raw API, so for the moment it is only a placeholder. ATOMIC CHAINS ------------- arch/i386/kernel/traps.c: i386die_chain arch/ia64/kernel/traps.c: ia64die_chain arch/powerpc/kernel/traps.c: powerpc_die_chain arch/sparc64/kernel/traps.c: sparc64die_chain arch/x86_64/kernel/traps.c: die_chain drivers/char/ipmi/ipmi_si_intf.c: xaction_notifier_list kernel/panic.c: panic_notifier_list kernel/profile.c: task_free_notifier net/bluetooth/hci_core.c: hci_notifier net/ipv4/netfilter/ip_conntrack_core.c: ip_conntrack_chain net/ipv4/netfilter/ip_conntrack_core.c: ip_conntrack_expect_chain net/ipv6/addrconf.c: inet6addr_chain net/netfilter/nf_conntrack_core.c: nf_conntrack_chain net/netfilter/nf_conntrack_core.c: nf_conntrack_expect_chain net/netlink/af_netlink.c: netlink_chain BLOCKING CHAINS --------------- arch/powerpc/platforms/pseries/reconfig.c: pSeries_reconfig_chain arch/s390/kernel/process.c: idle_chain arch/x86_64/kernel/process.c idle_notifier drivers/base/memory.c: memory_chain drivers/cpufreq/cpufreq.c cpufreq_policy_notifier_list drivers/cpufreq/cpufreq.c cpufreq_transition_notifier_list drivers/macintosh/adb.c: adb_client_list drivers/macintosh/via-pmu.c sleep_notifier_list drivers/macintosh/via-pmu68k.c sleep_notifier_list drivers/macintosh/windfarm_core.c wf_client_list drivers/usb/core/notify.c usb_notifier_list drivers/video/fbmem.c fb_notifier_list kernel/cpu.c cpu_chain kernel/module.c module_notify_list kernel/profile.c munmap_notifier kernel/profile.c task_exit_notifier kernel/sys.c reboot_notifier_list net/core/dev.c netdev_chain net/decnet/dn_dev.c: dnaddr_chain net/ipv4/devinet.c: inetaddr_chain It's possible that some of these classifications are wrong. If they are, please let us know or submit a patch to fix them. Note that any chain that gets called very frequently should be atomic, because the rwsem read-locking used for blocking chains is very likely to incur cache misses on SMP systems. (However, if the chain's callout routines may sleep then the chain cannot be atomic.) The patch set was written by Alan Stern and Chandra Seetharaman, incorporating material written by Keith Owens and suggestions from Paul McKenney and Andrew Morton. [jes@sgi.com: restructure the notifier chain initialization macros] Signed-off-by: Alan Stern <stern@rowland.harvard.edu> Signed-off-by: Chandra Seetharaman <sekharan@us.ibm.com> Signed-off-by: Jes Sorensen <jes@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-27[PATCH] lightweight robust futexes updatesIngo Molnar3-16/+16
- fix: initialize the robust list(s) to NULL in copy_process. - doc update - cleanup: rename _inuser to _inatomic - __user cleanups and other small cleanups Signed-off-by: Ingo Molnar <mingo@elte.hu> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Arjan van de Ven <arjan@infradead.org> Cc: Ulrich Drepper <drepper@redhat.com> Cc: Andi Kleen <ak@muc.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-27[PATCH] lightweight robust futexes: compatIngo Molnar4-23/+150
32-bit syscall compatibility support. (This patch also moves all futex related compat functionality into kernel/futex_compat.c.) Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Arjan van de Ven <arjan@infradead.org> Acked-by: Ulrich Drepper <drepper@redhat.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-27[PATCH] lightweight robust futexes: coreIngo Molnar3-0/+179
Add the core infrastructure for robust futexes: structure definitions, the new syscalls and the do_exit() based cleanup mechanism. Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Arjan van de Ven <arjan@infradead.org> Acked-by: Ulrich Drepper <drepper@redhat.com> Cc: Michael Kerrisk <mtk-manpages@gmx.net> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-27[PATCH] sched: fix group power for allnodes_domainsSiddha, Suresh B1-33/+29
Current sched groups power calculation for allnodes_domains is wrong. We should really be using cumulative power of the physical packages in that group (similar to the calculation in node_domains) Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-27[PATCH] sched: new sched domain for representing multi-coreSiddha, Suresh B1-5/+68
Add a new sched domain for representing multi-core with shared caches between cores. Consider a dual package system, each package containing two cores and with last level cache shared between cores with in a package. If there are two runnable processes, with this appended patch those two processes will be scheduled on different packages. On such systems, with this patch we have observed 8% perf improvement with specJBB(2 warehouse) benchmark and 35% improvement with CFP2000 rate(with 2 users). This new domain will come into play only on multi-core systems with shared caches. On other systems, this sched domain will be removed by domain degeneration code. This new domain can be also used for implementing power savings policy (see OLS 2005 CMP kernel scheduler paper for more details.. I will post another patch for power savings policy soon) Most of the arch/* file changes are for cpu_coregroup_map() implementation. Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com> Cc: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-27[PATCH] Small schedule() optimizationAndreas Mohr1-7/+5
small schedule() microoptimization. Signed-off-by: Andreas Mohr <andi@lisas.de> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-27[PATCH] sched: fix task interactivity calculationMartin Andersson1-1/+2
Is a truncation error in kernel/sched.c triggered when the nice value is negative. The affected code is used in the TASK_INTERACTIVE macro. The code is: #define SCALE(v1,v1_max,v2_max) \ (v1) * (v2_max) / (v1_max) which is used in this way: SCALE(TASK_NICE(p), 40, MAX_BONUS) Comments in the code says: * This part scales the interactivity limit depending on niceness. * * We scale it linearly, offset by the INTERACTIVE_DELTA delta. * Here are a few examples of different nice levels: * * TASK_INTERACTIVE(-20): [1,1,1,1,1,1,1,1,1,0,0] * TASK_INTERACTIVE(-10): [1,1,1,1,1,1,1,0,0,0,0] * TASK_INTERACTIVE( 0): [1,1,1,1,0,0,0,0,0,0,0] * TASK_INTERACTIVE( 10): [1,1,0,0,0,0,0,0,0,0,0] * TASK_INTERACTIVE( 19): [0,0,0,0,0,0,0,0,0,0,0] * * (the X axis represents the possible -5 ... 0 ... +5 dynamic * priority range a task can explore, a value of '1' means the * task is rated interactive.) However, the current code does not scale it linearly and the result differs from the given examples. If the mathematical function "floor" is used when the nice value is negative instead of the truncation one gets when using integer division, the result conforms to the documentation. Output of TASK_INTERACTIVE when using the kernel code: nice dynamic priorities -20 1 1 1 1 1 1 1 1 1 0 0 -19 1 1 1 1 1 1 1 1 0 0 0 -18 1 1 1 1 1 1 1 1 0 0 0 -17 1 1 1 1 1 1 1 1 0 0 0 -16 1 1 1 1 1 1 1 1 0 0 0 -15 1 1 1 1 1 1 1 0 0 0 0 -14 1 1 1 1 1 1 1 0 0 0 0 -13 1 1 1 1 1 1 1 0 0 0 0 -12 1 1 1 1 1 1 1 0 0 0 0 -11 1 1 1 1 1 1 0 0 0 0 0 -10 1 1 1 1 1 1 0 0 0 0 0 -9 1 1 1 1 1 1 0 0 0 0 0 -8 1 1 1 1 1 1 0 0 0 0 0 -7 1 1 1 1 1 0 0 0 0 0 0 -6 1 1 1 1 1 0 0 0 0 0 0 -5 1 1 1 1 1 0 0 0 0 0 0 -4 1 1 1 1 1 0 0 0 0 0 0 -3 1 1 1 1 0 0 0 0 0 0 0 -2 1 1 1 1 0 0 0 0 0 0 0 -1 1 1 1 1 0 0 0 0 0 0 0 0 1 1 1 1 0 0 0 0 0 0 0 1 1 1 1 1 0 0 0 0 0 0 0 2 1 1 1 1 0 0 0 0 0 0 0 3 1 1 1 1 0 0 0 0 0 0 0 4 1 1 1 0 0 0 0 0 0 0 0 5 1 1 1 0 0 0 0 0 0 0 0 6 1 1 1 0 0 0 0 0 0 0 0 7 1 1 1 0 0 0 0 0 0 0 0 8 1 1 0 0 0 0 0 0 0 0 0 9 1 1 0 0 0 0 0 0 0 0 0 10 1 1 0 0 0 0 0 0 0 0 0 11 1 1 0 0 0 0 0 0 0 0 0 12 1 0 0 0 0 0 0 0 0 0 0 13 1 0 0 0 0 0 0 0 0 0 0 14 1 0 0 0 0 0 0 0 0 0 0 15 1 0 0 0 0 0 0 0 0 0 0 16 0 0 0 0 0 0 0 0 0 0 0 17 0 0 0 0 0 0 0 0 0 0 0 18 0 0 0 0 0 0 0 0 0 0 0 19 0 0 0 0 0 0 0 0 0 0 0 Output of TASK_INTERACTIVE when using "floor" nice dynamic priorities -20 1 1 1 1 1 1 1 1 1 0 0 -19 1 1 1 1 1 1 1 1 1 0 0 -18 1 1 1 1 1 1 1 1 1 0 0 -17 1 1 1 1 1 1 1 1 1 0 0 -16 1 1 1 1 1 1 1 1 0 0 0 -15 1 1 1 1 1 1 1 1 0 0 0 -14 1 1 1 1 1 1 1 1 0 0 0 -13 1 1 1 1 1 1 1 1 0 0 0 -12 1 1 1 1 1 1 1 0 0 0 0 -11 1 1 1 1 1 1 1 0 0 0 0 -10 1 1 1 1 1 1 1 0 0 0 0 -9 1 1 1 1 1 1 1 0 0 0 0 -8 1 1 1 1 1 1 0 0 0 0 0 -7 1 1 1 1 1 1 0 0 0 0 0 -6 1 1 1 1 1 1 0 0 0 0 0 -5 1 1 1 1 1 1 0 0 0 0 0 -4 1 1 1 1 1 0 0 0 0 0 0 -3 1 1 1 1 1 0 0 0 0 0 0 -2 1 1 1 1 1 0 0 0 0 0 0 -1 1 1 1 1 1 0 0 0 0 0 0 0 1 1 1 1 0 0 0 0 0 0 0 1 1 1 1 1 0 0 0 0 0 0 0 2 1 1 1 1 0 0 0 0 0 0 0 3 1 1 1 1 0 0 0 0 0 0 0 4 1 1 1 0 0 0 0 0 0 0 0 5 1 1 1 0 0 0 0 0 0 0 0 6 1 1 1 0 0 0 0 0 0 0 0 7 1 1 1 0 0 0 0 0 0 0 0 8 1 1 0 0 0 0 0 0 0 0 0 9 1 1 0 0 0 0 0 0 0 0 0 10 1 1 0 0 0 0 0 0 0 0 0 11 1 1 0 0 0 0 0 0 0 0 0 12 1 0 0 0 0 0 0 0 0 0 0 13 1 0 0 0 0 0 0 0 0 0 0 14 1 0 0 0 0 0 0 0 0 0 0 15 1 0 0 0 0 0 0 0 0 0 0 16 0 0 0 0 0 0 0 0 0 0 0 17 0 0 0 0 0 0 0 0 0 0 0 18 0 0 0 0 0 0 0 0 0 0 0 19 0 0 0 0 0 0 0 0 0 0 0 Signed-off-by: Martin Andersson <martin.andersson@control.lth.se> Acked-by: Ingo Molnar <mingo@elte.hu> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Williams <pwil3058@bigpond.net.au> Cc: Con Kolivas <kernel@kolivas.org> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-26Merge git://git.kernel.org/pub/scm/linux/kernel/git/bunk/trivialLinus Torvalds1-2/+1
* git://git.kernel.org/pub/scm/linux/kernel/git/bunk/trivial: drivers/char/ftape/lowlevel/fdc-io.c: Correct a comment Kconfig help: MTD_JEDECPROBE already supports Intel Remove ugly debugging stuff do_mounts.c: Minor ROOT_DEV comment cleanup BUG_ON() Conversion in drivers/s390/block/dasd_devmap.c BUG_ON() Conversion in mm/mempool.c BUG_ON() Conversion in mm/memory.c BUG_ON() Conversion in kernel/fork.c BUG_ON() Conversion in ipc/sem.c BUG_ON() Conversion in fs/ext2/ BUG_ON() Conversion in fs/hfs/ BUG_ON() Conversion in fs/dcache.c BUG_ON() Conversion in fs/buffer.c BUG_ON() Conversion in input/serio/hp_sdc_mlc.c BUG_ON() Conversion in md/dm-table.c BUG_ON() Conversion in md/dm-path-selector.c BUG_ON() Conversion in drivers/isdn BUG_ON() Conversion in drivers/char BUG_ON() Conversion in drivers/mtd/
2006-03-26[PATCH] kretprobe instance recycled by parent processbibo mao2-6/+13
When kretprobe probes the schedule() function, if the probed process exits then schedule() will never return, so some kretprobe instances will never be recycled. In this patch the parent process will recycle retprobe instances of the probed function and there will be no memory leak of kretprobe instances. Signed-off-by: bibo mao <bibo.mao@intel.com> Cc: Masami Hiramatsu <hiramatu@sdl.hitachi.co.jp> Cc: Prasanna S Panchamukhi <prasanna@in.ibm.com> Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com> Cc: Anil S Keshavamurthy <anil.s.keshavamurthy@intel.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-26[PATCH] hrtimers: remove data fieldRoman Zippel4-21/+17
The nanosleep cleanup allows to remove the data field of hrtimer. The callback function can use container_of() to get it's own data. Since the hrtimer structure is anyway embedded in other structures, this adds no overhead. Signed-off-by: Roman Zippel <zippel@linux-m68k.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-26[PATCH] hrtimers: remove nsec_t typedefRoman Zippel2-4/+4
nsec_t predates ktime_t and has mostly been superseded by it. In the few places that are left it's better to make it explicit that we're dealing with 64 bit values here. Signed-off-by: Roman Zippel <zippel@linux-m68k.org> Acked-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: John Stultz <johnstul@us.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-26[PATCH] hrtimers: remove state fieldRoman Zippel1-9/+5
Remove the state field and encode this information in the rb_node similiar to normal timer. Signed-off-by: Roman Zippel <zippel@linux-m68k.org> Acked-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-26[PATCH] hrtimers: simplify nanosleepRoman Zippel1-80/+62
nanosleep is the only user of the expired state, so let it manage this itself, which makes the hrtimer code a bit simpler. The remaining time is also only calculated if requested. Signed-off-by: Roman Zippel <zippel@linux-m68k.org> Acked-by: Ingo Molnar <mingo@elte.hu> Acked-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-26[PATCH] hrtimers: posix-timer: cleanup common_timer_get()Roman Zippel1-24/+26
Cleanup common_timer_get() a little. Signed-off-by: Roman Zippel <zippel@linux-m68k.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-26[PATCH] hrtimers: pass current time to hrtimer_forward()Roman Zippel3-9/+15
Pass current time to hrtimer_forward(). This allows to use the softirq time in the timer base when the forward function is called from the timer callback. Other places pass current time with a call to timer->base->get_time(). Signed-off-by: Roman Zippel <zippel@linux-m68k.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-26[PATCH] hrtimers: optimize softirq runqueuesThomas Gleixner1-2/+26
The hrtimer softirq is called from the timer softirq every tick. Retrieve the current time from xtime and wall_to_monotonic instead of calling base->get_time() for each timer base. Store the time in the base structure and provide a hook once clock source abstractions are in place and to keep the code open for new base clocks. Based on a patch from: Roman Zippel <zippel@linux-m68k.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-26[PATCH] consolidate sys32/compat_adjtimexStephen Rothwell1-0/+59
Create compat_sys_adjtimex and use it an all appropriate places. Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au> Cc: Arnd Bergmann <arnd@arndb.de> Acked-by: Paul Mackerras <paulus@samba.org> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-26[PATCH] swswsup: return correct load_image errorCon Kolivas1-3/+4
If there's an error in load_image() we should return that without checking snapshot_image_loaded. Signed-off-by: Con Kolivas <kernel@kolivas.org> Acked-by: "Rafael J. Wysocki" <rjw@sisk.pl> Cc: Pavel Machek <pavel@ucw.cz> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-26[PATCH] warn if free_irq() is called from IRQ contextIngo Molnar1-0/+1
Warn if free_irq() is called in IRQ context - free_irq() can execute /proc VFS work, which must not be done in IRQ context. Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-26BUG_ON() Conversion in kernel/fork.cEric Sesterhenn1-2/+1
this changes if() BUG(); constructs to BUG_ON() which is cleaner, contains unlikely() and can better optimized away. Signed-off-by: Eric Sesterhenn <snakebyte@gmx.de> Signed-off-by: Adrian Bunk <bunk@stusta.de>
2006-03-25Merge branch 'audit.b3' of ↵Linus Torvalds5-436/+1293
git://git.kernel.org/pub/scm/linux/kernel/git/viro/audit-current * 'audit.b3' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/audit-current: (22 commits) [PATCH] fix audit_init failure path [PATCH] EXPORT_SYMBOL patch for audit_log, audit_log_start, audit_log_end and audit_format [PATCH] sem2mutex: audit_netlink_sem [PATCH] simplify audit_free() locking [PATCH] Fix audit operators [PATCH] promiscuous mode [PATCH] Add tty to syscall audit records [PATCH] add/remove rule update [PATCH] audit string fields interface + consumer [PATCH] SE Linux audit events [PATCH] Minor cosmetic cleanups to the code moved into auditfilter.c [PATCH] Fix audit record filtering with !CONFIG_AUDITSYSCALL [PATCH] Fix IA64 success/failure indication in syscall auditing. [PATCH] Miscellaneous bug and warning fixes [PATCH] Capture selinux subject/object context information. [PATCH] Exclude messages by message type [PATCH] Collect more inode information during syscall processing. [PATCH] Pass dentry, not just name, in fsnotify creation hooks. [PATCH] Define new range of userspace messages. [PATCH] Filter rule comparators ... Fixed trivial conflict in security/selinux/hooks.c
2006-03-25Merge git://git.kernel.org/pub/scm/linux/kernel/git/bunk/trivialLinus Torvalds1-2/+1
* git://git.kernel.org/pub/scm/linux/kernel/git/bunk/trivial: (21 commits) BUG_ON() Conversion in drivers/video/ BUG_ON() Conversion in drivers/parisc/ BUG_ON() Conversion in drivers/block/ BUG_ON() Conversion in sound/sparc/cs4231.c BUG_ON() Conversion in drivers/s390/block/dasd.c BUG_ON() Conversion in lib/swiotlb.c BUG_ON() Conversion in kernel/cpu.c BUG_ON() Conversion in ipc/msg.c BUG_ON() Conversion in block/elevator.c BUG_ON() Conversion in fs/coda/ BUG_ON() Conversion in fs/binfmt_elf_fdpic.c BUG_ON() Conversion in input/serio/hil_mlc.c BUG_ON() Conversion in md/dm-hw-handler.c BUG_ON() Conversion in md/bitmap.c The comment describing how MS_ASYNC works in msync.c is confusing rcu: undeclared variable used in documentation fix typos "wich" -> "which" typo patch for fs/ufs/super.c Fix simple typos tabify drivers/char/Makefile ...
2006-03-25[PATCH] remove pps supportRoman Zippel2-54/+18
This removes the support for pps. It's completely unused within the kernel and is basically in the way for further cleanups. It should be easier to readd proper support for it after the rest has been converted to NTP4 (where the pps mechanisms are quite different from NTP3 anyway). Signed-off-by: Roman Zippel <zippel@linux-m68k.org> Cc: Adrian Bunk <bunk@stusta.de> Cc: john stultz <johnstul@us.ibm.com> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-25[PATCH] Add SA_PERCPU_IRQ flag supportDimitri Sivanich1-5/+18
Add support for SA_PERCPU_IRQ (only mmtimer.c uses this at this stage). Signed-off-by: Dimitri Sivanich <sivanich@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-25[PATCH] Use unsigned int types for a faster bsearchEric Dumazet1-2/+2
This patch avoids arithmetic on 'signed' types that are slower than 'unsigned'. This saves space and cpu cycles. size of kernel/sys.o before the patch (gcc-3.4.5) text data bss dec hex filename 10924 252 4 11180 2bac kernel/sys.o size of kernel/sys.o after the patch text data bss dec hex filename 10903 252 4 11159 2b97 kernel/sys.o I noticed that gcc-4.1.0 (from Fedora Core 5) even uses idiv instruction for (a+b)/2 if a and b are signed. Signed-off-by: Eric Dumazet <dada1@cosmosbay.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-25[PATCH] Check if cpu can be onlined before calling smp_prepare_cpu()Ashok Raj1-3/+1
- Moved check for online cpu out of smp_prepare_cpu() - Moved default declaration of smp_prepare_cpu() to kernel/cpu.c - Removed lock_cpu_hotplug() from smp_prepare_cpu() to around it, since its called from cpu_up() as well now. - Removed clearing from cpu_present_map during cpu_offline as it breaks using cpu_up() directly during a subsequent online operation. Signed-off-by: Ashok Raj <ashok.raj@intel.com> Cc: Srivatsa Vaddagiri <vatsa@in.ibm.com> Cc: "Li, Shaohua" <shaohua.li@intel.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-25[PATCH] No need to protect current->group_info in sys_getgroups(), ↵Eric Dumazet1-6/+0
in_group_p() and in_egroup_p() While doing some benchmarks of an Apache/PHP SMP server, I noticed high oprofile numbers in in_group_p() and _atomic_dec_and_lock(). rank percent 1 4.8911 % __link_path_walk 2 4.8503 % __d_lookup *3 4.2911 % _atomic_dec_and_lock 4 3.9307 % __copy_to_user_ll 5 4.9004 % sysenter_past_esp *6 3.3248 % in_group_p It appears that in_group_p() does an uncessary get_group_info(current->group_info); /* atomic_inc() */ ... /* access current->group_info */ put_group_info(current->group_info); /* _atomic_dec_and_lock */ It is not necessary to do this, because the current task holds a reference on its own group_info, and this reference cannot change during the lookup. This patch deletes the get_group_info()/put_group_info() pair from sys_getgroups(), in_group_p() and in_egroup_p() functions. Signed-off-by: Eric Dumazet <dada1@cosmosbay.com> Cc: Tim Hockin <thockin@hockin.org> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-25[PATCH] find_task_by_pid() needs tasklist_lockAndrew Morton1-0/+2
A couple of places are forgetting to take it. The kswapd case is probably unimportant. keventd_create_kthread() was racy. The whole thing is a bit flakey: you start a kernel thread, get its pid from kernel_thread() then look up its task_struct. a) It assumes that pid recycling takes a "long" time. b) We get a task_struct but no reference was taken on it. The owner of the kswapd and kthread task_struct*'s must assume that the new thread won't exit unexpectedly. Because if it does, they're left holding dead memory and any attempt to control or stop that task will crash. Cc: Christoph Hellwig <hch@lst.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-25[PATCH] refactor capable() to one implementation, add __capable() helperChris Wright2-12/+16
Move capable() to kernel/capability.c and eliminate duplicate implementations. Add __capable() function which can be used to check for capabiilty of any process. Signed-off-by: Chris Wright <chrisw@sous-sol.org> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-25[PATCH] IRQ: prevent enabling of previously disabled interruptBryan Holty1-4/+17
This fix prevents re-disabling and enabling of a previously disabled interrupt. On an SMP system with irq balancing enabled; If an interrupt is disabled from within its own interrupt context with disable_irq_nosync and is also earmarked for processor migration, the interrupt is blindly moved to the other processor and enabled without regard for its current "enabled" state. If there is an interrupt pending, it will unexpectedly invoke the irq handler on the new irq owning processor (even though the irq was previously disabled) The more intuitive fix would be to invoke disable_irq_nosync and enable_irq, but since we already have the desc->lock from __do_IRQ, we cannot call them directly. Instead we can use the same logic to disable and enable found in disable_irq_nosync and enable_irq, with regards to the desc->depth. This now prevents a disabled interrupt from being re-disabled, and more importantly prevents a disabled interrupt from being incorrectly enabled on a different processor. Signed-off-by: Bryan Holty <lgeek@frontiernet.net> Cc: Andi Kleen <ak@muc.de> Cc: "Luck, Tony" <tony.luck@intel.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-25[PATCH] irq: uninline migration functionsAndrew Morton2-2/+53
Uninline some massive IRQ migration functions. Put them in the new kernel/irq/migration.c. Cc: Andi Kleen <ak@muc.de> Cc: "Luck, Tony" <tony.luck@intel.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-25[PATCH] kernel/params.c: make param_array() staticAdrian Bunk1-6/+6
param_array() in kernel/params.c can now become static. Signed-off-by: Adrian Bunk <bunk@stusta.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-25[PATCH] Remove MODULE_PARMRusty Russell2-177/+16
MODULE_PARM was actually breaking: recent gcc version optimize them out as unused. It's time to replace the last users, which are generally in the most unloved drivers anyway. Signed-off-by: Rusty Russell <rusty@rustcorp.com.au> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-25[PATCH] Validate and sanitze itimer timeval from userspaceThomas Gleixner1-0/+66
According to the specification the timevals must be validated and an errorcode -EINVAL returned in case the timevals are not in canonical form. This check was never done in Linux. The pre 2.6.16 code converted invalid timevals silently. Negative timeouts were converted by the timeval_to_jiffies conversion to the maximum timeout. hrtimers and the ktime_t operations expect timevals in canonical form. Otherwise random results might happen on 32 bits machines due to the optimized ktime_add/sub operations. Negative timeouts are treated as already expired. This might break applications which work on pre 2.6.16. To prevent random behaviour and API breakage the timevals are checked and invalid timevals sanitized in a simliar way as the pre 2.6.16 code did. Invalid timevals are reported with a per boot limited number of kernel messages so applications which use this misfeature can be corrected. After a grace period of one year the sanitizing should be replaced by a correct validation check. This is also documented in Documentation/feature-removal-schedule.txt The validation and sanitizing is done inside do_setitimer so all callers (sys_setitimer, compat_sys_setitimer, osf_setitimer) are catched. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-25[PATCH] sys_alarm() unsigned signed conversion fixupThomas Gleixner2-13/+38
alarm() calls the kernel with an unsigend int timeout in seconds. The value is stored in the tv_sec field of a struct timeval to setup the itimer. The tv_sec field of struct timeval is of type long, which causes the tv_sec value to be negative on 32 bit machines if seconds > INT_MAX. Before the hrtimer merge (pre 2.6.16) such a negative value was converted to the maximum jiffies timeout by the timeval_to_jiffies conversion. It's not clear whether this was intended or just happened to be done by the timeval_to_jiffies code. hrtimers expect a timeval in canonical form and treat a negative timeout as already expired. This breaks the legitimate usage of alarm() with a timeout value > INT_MAX seconds. For 32 bit machines it is therefor necessary to limit the internal seconds value to avoid API breakage. Instead of doing this in all implementations of sys_alarm the duplicated sys_alarm code is moved into a common function in itimer.c Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-25[PATCH] timer irq driven soft watchdog fixAndrew Morton1-0/+1
I seem to have lost this hunk in yesterday's patch. It brings the coming-online CPU's softlockup timer up to date so we don't get false-positive tripups during CPU hot-add. Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-24BUG_ON() Conversion in kernel/cpu.cEric Sesterhenn1-2/+1
this changes if() BUG(); constructs to BUG_ON() which is cleaner, contains unlikely() and can better optimized away. Signed-off-by: Eric Sesterhenn <snakebyte@gmx.de> Signed-off-by: Adrian Bunk <bunk@stusta.de>
2006-03-24[PATCH] fix build error if CONFIG_SYSFS=nAndrew Morton1-2/+2
uevent_seqnum and uevent_helper are only defined if CONFIG_HOTPLUG=y, CONFIG_NET=n. (I stole this back from Greg's tree - it makes allnoconfig work). Cc: Greg KH <greg@kroah.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-24[PATCH] strndup_user: convert moduleDavi Arnaut1-16/+3
Change hand-coded userspace string copying to strndup_user. Signed-off-by: Davi Arnaut <davi.arnaut@gmail.com> Cc: David Howells <dhowells@redhat.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-24[PATCH] timer-irq-driven soft-watchdog, cleanupsIngo Molnar2-25/+31
Make the softlockup detector purely timer-interrupt driven, removing softirq-context (timer) dependencies. This means that if the softlockup watchdog triggers, it has truly observed a longer than 10 seconds scheduling delay of a SCHED_FIFO prio 99 task. (the patch also turns off the softlockup detector during the initial bootup phase and does small style fixes) Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-24[PATCH] Fix module refcount leak in __set_personality()Sergey Vlasov1-0/+1
If the change of personality does not lead to change of exec domain, __set_personality() returned without releasing the module reference acquired by lookup_exec_domain(). Signed-off-by: Sergey Vlasov <vsu@altlinux.ru> Cc: Christoph Hellwig <hch@lst.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-24[PATCH] RLIMIT_CPU: document wrong return valueAndrew Morton1-0/+7
Document the fact that setrlimit(RLIMIT_CPU) doesn't return error codes when it should. I don't think we can fix this without a 2.7.x.. Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Ulrich Weigand <uweigand@de.ibm.com> Cc: Cliff Wickman <cpw@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-24[PATCH] RLIMIT_CPU: fix handling of a zero limitAndrew Morton1-1/+12
At present the kernel doesn't honour an attempt to set RLIMIT_CPU to zero seconds. But the spec says it should, and that's what 2.4.x does. Fixing this for real would involve some complexity (such as adding a new it-has-been-set flag to the task_struct, and testing that everwhere, instead of overloading the value of it_prof_expires). Given that a 2.4 kernel won't actually send the signal until one second has expired anyway, let's just handle this case by treating the caller's zero-seconds as one second. Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Ulrich Weigand <uweigand@de.ibm.com> Cc: Cliff Wickman <cpw@sgi.com> Acked-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-24[PATCH] sys_setrlimit() cleanupAndrew Morton1-11/+15
- Whitespace cleanups - Make that expression comprehensible. There's a potential logic change here: we do the "is it_prof_expires equal to zero" test after converting it to seconds, rather than doing the comparison between raw cputime_t's. But given that it's in units of seconds anyway, that shouldn't change anything. Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Ulrich Weigand <uweigand@de.ibm.com> Cc: Cliff Wickman <cpw@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-24[PATCH] console_setup() depends (wrongly?) on CONFIG_PRINTKJohn Z. Bohach1-38/+38
It appears that console_setup() code only gets compiled into the kernel if CONFIG_PRINTK is enabled. One detrimental side-effect of this is that serial8250_console_setup() never gets invoked when CONFIG_PRINTK is not set, resulting in baud rate not being read/parsed from command line (i.e. console=ttyS0,115200n8 is ignored, at least the baud rate part...) Attached patch moves console_setup() code from inside #ifdef CONFIG_PRINTK to outside (in printk.c), removing dependence on said config. option. Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-24[PATCH] cpuset: remove useless local variable initializationPaul Jackson1-1/+1
Remove a useless variable initialization in cpuset __cpuset_zone_allowed(). The local variable 'allowed' is unconditionally set before use, later on in the code, so does not need to be initialized. Not that it seems to matter to the code generated any, as the compiler optimizes out the superfluous assignment anyway. Signed-off-by: Paul Jackson <pj@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-24[PATCH] cpuset: don't need to mark cpuset_mems_generation atomicPaul Jackson1-8/+11
Drop the atomic_t marking on the cpuset static global cpuset_mems_generation. Since all access to it is guarded by the global manage_mutex, there is no need for further serialization of this value. Signed-off-by: Paul Jackson <pj@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-24[PATCH] cpuset: remove unnecessary NULL checkPaul Jackson1-12/+6
Remove a no longer needed test for NULL cpuset pointer, with a little comment explaining why the test isn't needed. Signed-off-by: Paul Jackson <pj@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-24[PATCH] cpuset memory spread slab cache optimizationsPaul Jackson1-0/+1
The hooks in the slab cache allocator code path for support of NUMA mempolicies and cpuset memory spreading are in an important code path. Many systems will use neither feature. This patch optimizes those hooks down to a single check of some bits in the current tasks task_struct flags. For non NUMA systems, this hook and related code is already ifdef'd out. The optimization is done by using another task flag, set if the task is using a non-default NUMA mempolicy. Taking this flag bit along with the PF_SPREAD_PAGE and PF_SPREAD_SLAB flag bits added earlier in this 'cpuset memory spreading' patch set, one can check for the combination of any of these special case memory placement mechanisms with a single test of the current tasks task_struct flags. This patch also tightens up the code, to save a few bytes of kernel text space, and moves some of it out of line. Due to the nested inlines called from multiple places, we were ending up with three copies of this code, which once we get off the main code path (for local node allocation) seems a bit wasteful of instruction memory. Signed-off-by: Paul Jackson <pj@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-24[PATCH] cpuset memory spread basic implementationPaul Jackson1-6/+98
This patch provides the implementation and cpuset interface for an alternative memory allocation policy that can be applied to certain kinds of memory allocations, such as the page cache (file system buffers) and some slab caches (such as inode caches). The policy is called "memory spreading." If enabled, it spreads out these kinds of memory allocations over all the nodes allowed to a task, instead of preferring to place them on the node where the task is executing. All other kinds of allocations, including anonymous pages for a tasks stack and data regions, are not affected by this policy choice, and continue to be allocated preferring the node local to execution, as modified by the NUMA mempolicy. There are two boolean flag files per cpuset that control where the kernel allocates pages for the file system buffers and related in kernel data structures. They are called 'memory_spread_page' and 'memory_spread_slab'. If the per-cpuset boolean flag file 'memory_spread_page' is set, then the kernel will spread the file system buffers (page cache) evenly over all the nodes that the faulting task is allowed to use, instead of preferring to put those pages on the node where the task is running. If the per-cpuset boolean flag file 'memory_spread_slab' is set, then the kernel will spread some file system related slab caches, such as for inodes and dentries evenly over all the nodes that the faulting task is allowed to use, instead of preferring to put those pages on the node where the task is running. The implementation is simple. Setting the cpuset flags 'memory_spread_page' or 'memory_spread_cache' turns on the per-process flags PF_SPREAD_PAGE or PF_SPREAD_SLAB, respectively, for each task that is in the cpuset or subsequently joins that cpuset. In subsequent patches, the page allocation calls for the affected page cache and slab caches are modified to perform an inline check for these flags, and if set, a call to a new routine cpuset_mem_spread_node() returns the node to prefer for the allocation. The cpuset_mem_spread_node() routine is also simple. It uses the value of a per-task rotor cpuset_mem_spread_rotor to select the next node in the current tasks mems_allowed to prefer for the allocation. This policy can provide substantial improvements for jobs that need to place thread local data on the corresponding node, but that need to access large file system data sets that need to be spread across the several nodes in the jobs cpuset in order to fit. Without this patch, especially for jobs that might have one thread reading in the data set, the memory allocation across the nodes in the jobs cpuset can become very uneven. A couple of Copyright year ranges are updated as well. And a couple of email addresses that can be found in the MAINTAINERS file are removed. Signed-off-by: Paul Jackson <pj@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-24[PATCH] cpuset use combined atomic_inc_return callsPaul Jackson1-7/+4
Replace pairs of calls to <atomic_inc, atomic_read>, with a single call atomic_inc_return, saving a few bytes of source and kernel text. Signed-off-by: Paul Jackson <pj@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-24[PATCH] cpuset cleanup not not operatorsPaul Jackson1-5/+5
Since the test_bit() bit operator is boolean (return 0 or 1), the double not "!!" operations needed to convert a scalar (zero or not zero) to a boolean are not needed. Signed-off-by: Paul Jackson <pj@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-24[PATCH] rcutorture: tag success/failure line with module parametersPaul E. McKenney1-8/+15
A long-running rcutorture test can overflow dmesg, so that the line containing the module parameters is lost. Although it is usually possible to retrieve this information from the log files, it is much better to just tag it onto the final success/failure line so that it may be easily found. This patch does just that. Signed-off-by: "Paul E. McKenney" <paulmck@us.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-24[PATCH] tvec_bases too large for per-cpu dataJan Beulich1-11/+34
With internal Xen-enabled kernels we see the kernel's static per-cpu data area exceed the limit of 32k on x86-64, and even native x86-64 kernels get fairly close to that limit. I generally question whether it is reasonable to have data structures several kb in size allocated as per-cpu data when the space there is rather limited. The biggest arch-independent consumer is tvec_bases (over 4k on 32-bit archs, over 8k on 64-bit ones), which now gets converted to use dynamically allocated memory instead. Signed-off-by: Jan Beulich <jbeulich@novell.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-24[PATCH] rcu_process_callbacks: don't cli() while testing ->nxtlistOleg Nesterov1-3/+2
__rcu_process_callbacks() disables interrupts to protect itself from call_rcu() which adds new entries to ->nxtlist. However we can check "->nxtlist != NULL" with interrupts enabled, we can't get "false positives" because call_rcu() can only change this condition from 0 to 1. Tested with rcutorture.ko. Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Acked-by: Dipankar Sarma <dipankar@in.ibm.com> Cc: "Paul E. McKenney" <paulmck@us.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-24[PATCH] Range checking in do_proc_dointvec_(userhz_)jiffies_convBart Samwel1-0/+4
When (integer) sysctl values are in either seconds or centiseconds, but represented internally as jiffies, the allowable value range is decreased. This patch adds range checks to the conversion routines. For values in seconds: maximum LONG_MAX / HZ. For values in centiseconds: maximum (LONG_MAX / HZ) * USER_HZ. (BTW, does anyone else feel that an interface in seconds should not be accepting negative values?) Signed-off-by: Bart Samwel <bart@samwel.tk> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-24[PATCH] Represent laptop_mode as jiffies internallyBart Samwel1-3/+2
Make that the internal value for /proc/sys/vm/laptop_mode is stored as jiffies instead of seconds. Let the sysctl interface do the conversions, instead of doing on-the-fly conversions every time the value is used. Add a description of the fact that laptop_mode doubles as a flag and a timeout to the comment above the laptop_mode variable. Signed-off-by: Bart Samwel <bart@samwel.tk> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-24[PATCH] Represent dirty_*_centisecs as jiffies internallyBart Samwel1-5/+5
Make that the internal values for: /proc/sys/vm/dirty_writeback_centisecs /proc/sys/vm/dirty_expire_centisecs are stored as jiffies instead of centiseconds. Let the sysctl interface do the conversions with full precision using clock_t_to_jiffies, instead of doing overflow-sensitive on-the-fly conversions every time the values are used. Cons: apparent precision loss if HZ is not a multiple of 100, because of conversion back and forth. This is a common problem for all sysctl values that use proc_dointvec_userhz_jiffies. (There is only one other in-tree use, in net/core/neighbour.c.) Signed-off-by: Bart Samwel <bart@samwel.tk> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-24[PATCH] free_uid() locking improvementAndrew Morton1-3/+7
Reduce lock hold times in free_uid(). Cc: Ingo Molnar <mingo@elte.hu> Cc: "Paul E. McKenney" <paulmck@us.ibm.com> Cc: David Howells <dhowells@redhat.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-23[PATCH] Block queue IO tracing support (blktrace) as of 2006-03-23Jens Axboe1-0/+1
Signed-off-by: Jens Axboe <axboe@suse.de>
2006-03-23[PATCH] relay: consolidate sendfile() and read() codeTom Zanussi1-84/+91
Signed-off-by: Jens Axboe <axboe@suse.de>
2006-03-23[PATCH] relay: add sendfile() supportJens Axboe1-29/+115
Signed-off-by: Jens Axboe <axboe@suse.de>
2006-03-23[PATCH] relay: migrate from relayfs to a generic relay APIJens Axboe2-0/+920
Original patch from Paul Mundt, sysfs parts removed by me since they were broken. Signed-off-by: Jens Axboe <axboe@suse.de>
2006-03-23[PATCH] kernel/rcupdate.c: make two structs staticAdrian Bunk1-2/+2
This patch makes two needlessly global structs static. Signed-off-by: Adrian Bunk <bunk@stusta.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-23[PATCH] BUILD_LOCK_OPS: cleanup preempt_disable() usageOleg Nesterov1-5/+4
This patch changes the code from: preempt_disable(); for (;;) { ... preempt_disable(); } to: for (;;) { preempt_disable(); ... } which seems more clean to me and saves a couple of bytes for each function. Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Acked-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-23[PATCH] pause_on_oops command line optionAndrew Morton1-1/+96
Attempt to fix the problem wherein people's oops reports scroll off the screen due to repeated oopsing or to oopses on other CPUs. If this happens the user can reboot with the `pause_on_oops=<seconds>' option. It will allow the first oopsing CPU to print an oops record just a single time. Second oopsing attempts, or oopses on other CPUs will cause those CPUs to enter a tight loop until the specified number of seconds have elapsed. The patch implements the infrastructure generically in the expectation that architectures other than x86 will find it useful. Cc: Dave Jones <davej@codemonkey.org.uk> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-23[PATCH] make bug messages more consistentIngo Molnar1-2/+2
Consolidate all kernel bug printouts to begin with the "BUG: " string. Makes it easier to find them in large bootup logs. Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-23[PATCH] sigprocmask: kill unneeded temp varOleg Nesterov1-4/+4
Cleanup, remove unneeded double copying of current->blocked. Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-23[PATCH] kernel/module.c Semaphore to Mutex Conversion for module_mutexAshutosh Naik1-19/+19
This patch converts the module_mutex semaphore to a mutex. Signed-off-by: Ashutosh Naik <ashutosh.naik@gmail.com> Cc: Arjan van de Ven <arjan@infradead.org> Cc: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-23[PATCH] sem2mutex: kprobesIngo Molnar1-7/+7
Semaphore to mutex conversion. The conversion was generated via scripts, and the result was validated automatically via a script as well. Signed-off-by: Ingo Molnar <mingo@elte.hu> Acked-by: Anil S Keshavamurthy <anil.s.keshavamurthy@intel.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-23[PATCH] sem2mutex: ttyIngo Molnar2-4/+4
Semaphore to mutex conversion. The conversion was generated via scripts, and the result was validated automatically via a script as well. Signed-off-by: Ingo Molnar <mingo@elte.hu> Cc: Alan Cox <alan@lxorguk.ukuu.org.uk> Cc: Russell King <rmk@arm.linux.org.uk> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-23[PATCH] sem2mutex: kernel/Arjan van de Ven5-25/+30
Semaphore to mutex conversion. The conversion was generated via scripts, and the result was validated automatically via a script as well. Signed-off-by: Arjan van de Ven <arjan@infradead.org> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-23[PATCH] convert kernel/rcupdate.c:rcu_barrier_sema to mutexIngo Molnar1-5/+5
Convert kernel/rcupdate's rcu_barrier_sema to mutex. Signed-off-by: Ingo Molnar <mingo@elte.hu> Acked-by: "Paul E. McKenney" <paulmck@us.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-23[PATCH] kernel/cpuset.c, mutex conversionIngo Molnar1-109/+103
convert cpuset.c's callback_sem and manage_sem to mutexes. Build and boot tested by Ingo. Build, boot, unit and stress tested by pj. Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Paul Jackson <pj@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-23[PATCH] Avoid taking global tasklist_lock for single threadedprocess at ↵Ravikiran G Thirumalai1-8/+34
getrusage() Avoid taking the global tasklist_lock when possible, if a process is single threaded during getrusage(). Any avoidance of tasklist_lock is good for NUMA boxes (and possibly for large SMPs). Thanks to Oleg Nesterov for review and suggestions. Signed-off-by: Nippun Goel <nippung@calsoftinc.com> Signed-off-by: Ravikiran Thirumalai <kiran@scalex86.org> Signed-off-by: Shai Fultheim <shai@scalex86.org> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-23[PATCH] Shrinks sizeof(files_struct) and better layoutEric Dumazet1-4/+4
1) Reduce the size of (struct fdtable) to exactly 64 bytes on 32bits platforms, lowering kmalloc() allocated space by 50%. 2) Reduce the size of (files_struct), using a special 32 bits (or 64bits) embedded_fd_set, instead of a 1024 bits fd_set for the close_on_exec_init and open_fds_init fields. This save some ram (248 bytes per task) as most tasks dont open more than 32 files. D-Cache footprint for such tasks is also reduced to the minimum. 3) Reduce size of allocated fdset. Currently two full pages are allocated, that is 32768 bits on x86 for example, and way too much. The minimum is now L1_CACHE_BYTES. UP and SMP should benefit from this patch, because most tasks will touch only one cache line when open()/close() stdin/stdout/stderr (0/1/2), (next_fd, close_on_exec_init, open_fds_init, fd_array[0 .. 2] being in the same cache line) Signed-off-by: Eric Dumazet <dada1@cosmosbay.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-23[PATCH] swsusp: add s2ram ioctl to userland interfaceLuca Tettamanti3-2/+40
Add the SNAPSHOT_S2RAM ioctl to the snapshot device. This ioctl allows a userland application to make the system (previously frozen with the SNAPSHOT_FREE ioctl) enter the S3 state without freezing processes and disabling nonboot CPUs for the second time. This will allow us to implement the suspend-to-disk-and-RAM (STDR) functionality in the userland suspend tools. Signed-off-by: Luca Tettamanti <kronos.it@gmail.com> Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl> Cc: Pavel Machek <pavel@ucw.cz> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-23[PATCH] swsusp: let userland tools switch console on suspendRafael J. Wysocki1-3/+0
Remove the console-switching code from the suspend part of the swsusp userland interface and let the userland tools switch the console. Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl> Acked-by: Pavel Machek <pavel@suse.cz> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-23[PATCH] swsusp: drain high mem pagesShaohua Li1-0/+1
Highmem could be in pcp list as well. Signed-off-by: Shaohua Li<shaohua.li@intel.com> Acked-by: Pavel Machek <pavel@ucw.cz> Cc: "Rafael J. Wysocki" <rjw@sisk.pl> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-23[PATCH] swsusp: finally solve mysqld problemRafael J. Wysocki1-1/+2
This patch from Pavel moves userland freeze signals handling into more logical place. It now hits even with mysqld running. Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl> Signed-off-by: Pavel Machek <pavel@suse.cz> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-23[PATCH] suspend: make progress printing prettierPavel Machek1-2/+3
Combination of printk/pr_debug led to <7> in the middle of the line, and we printed way too many dots. Signed-off-by: Pavel Machek <pavel@suse.cz> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-23[PATCH] swsusp: freeze user space processes firstRafael J. Wysocki3-17/+46
Allow swsusp to freeze processes successfully under heavy load by freezing userspace processes before kernel threads. [Thanks to Nigel Cunningham <nigel@suspend2.net> for suggesting the way to go.] Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl> Acked-by: Pavel Machek <pavel@suse.cz> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-23[PATCH] swsusp: userland interfaceRafael J. Wysocki4-5/+321
This patch introduces a user space interface for swsusp. The interface is based on a special character device, called the snapshot device, that allows user space processes to perform suspend and resume-related operations with the help of some ioctls and the read()/write() functions.  Additionally it allows these processes to allocate free swap pages from a selected swap partition, called the resume partition, so that they know which sectors of the resume partition are available to them. The interface uses the same low-level system memory snapshot-handling functions that are used by the built-it swap-writing/reading code of swsusp. The interface documentation is included in the patch. The patch assumes that the major and minor numbers of the snapshot device will be 10 (ie. misc device) and 231, the registration of which has already been requested. Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl> Acked-by: Pavel Machek <pavel@ucw.cz> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-23[PATCH] swsusp: documentation updatesPavel Machek1-1/+1
Update suspend-to-RAM documentation with new machines, and makes message when processes can't be stopped little clearer. (In one case, waiting longer actually did help). From: "Rafael J. Wysocki" <rjw@sisk.pl> Warn in the documentation that data may be lost if there are some filesystems mounted from USB devices before suspend. [Thanks to Alan Stern for providing the answer to the question in the Q:-A: part.] Signed-off-by: Pavel Machek <pavel@suse.cz> Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-23[PATCH] kernel/power: move externs to header filesRandy Dunlap2-12/+5
Move externs from C source files to header files. Signed-off-by: Randy Dunlap <rdunlap@xenotime.net> Cc: "Rafael J. Wysocki" <rjw@sisk.pl> Cc: Pavel Machek <pavel@ucw.cz> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-23[PATCH] swsusp: separate swap-writing/reading codeRafael J. Wysocki4-564/+581
Move the swap-writing/reading code of swsusp to a separate file. Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl> Acked-by: Pavel Machek <pavel@ucw.cz> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-23[PATCH] swsusp: low level interfaceRafael J. Wysocki4-488/+599
Introduce the low level interface that can be used for handling the snapshot of the system memory by the in-kernel swap-writing/reading code of swsusp and the userland interface code (to be introduced shortly). Also change the way in which swsusp records the allocated swap pages and, consequently, simplifies the in-kernel swap-writing/reading code (this is necessary for the userland interface too). To this end, it introduces two helper functions in mm/swapfile.c, so that the swsusp code does not refer directly to the swap internals. Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl> Acked-by: Pavel Machek <pavel@ucw.cz> Signed-off-by: Adrian Bunk <bunk@stusta.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-23[PATCH] revert "swsusp: fix breakage with swap on lvm"Andrew Morton1-1/+3
This was a temporary thing for 2.6.16. Cc: "Rafael J. Wysocki" <rjw@sisk.pl> Cc: Pavel Machek <pavel@ucw.cz> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-23[PATCH] fix scheduler deadlockAnton Blanchard1-2/+7
We have noticed lockups during boot when stress testing kexec on ppc64. Two cpus would deadlock in scheduler code trying to grab already taken spinlocks. The double_rq_lock code uses the address of the runqueue to order the taking of multiple locks. This address is a per cpu variable: if (rq1 < rq2) { spin_lock(&rq1->lock); spin_lock(&rq2->lock); } else { spin_lock(&rq2->lock); spin_lock(&rq1->lock); } On the other hand, the code in wake_sleeping_dependent uses the cpu id order to grab locks: for_each_cpu_mask(i, sibling_map) spin_lock(&cpu_rq(i)->lock); This means we rely on the address of per cpu data increasing as cpu ids increase. While this will be true for the generic percpu implementation it may not be true for arch specific implementations. One way to solve this is to always take runqueues in cpu id order. To do this we add a cpu variable to the runqueue and check it in the double runqueue locking functions. Signed-off-by: Anton Blanchard <anton@samba.org> Acked-by: Ingo Molnar <mingo@elte.hu> Cc: <stable@kernel.org> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-22Merge git://git.kernel.org/pub/scm/linux/kernel/git/paulus/powerpcLinus Torvalds1-1/+1
* git://git.kernel.org/pub/scm/linux/kernel/git/paulus/powerpc: (78 commits) [PATCH] powerpc: Add FSL SEC node to documentation [PATCH] macintosh: tidy-up driver_register() return values [PATCH] powerpc: tidy-up of_register_driver()/driver_register() return values [PATCH] powerpc: via-pmu warning fix [PATCH] macintosh: cleanup the use of i2c headers [PATCH] powerpc: dont allow old RTC to be selected [PATCH] powerpc: make powerbook_sleep_grackle static [PATCH] powerpc: Fix warning in add_memory [PATCH] powerpc: update mailing list addresses [PATCH] powerpc: Remove calculation of io hole [PATCH] powerpc: iseries: Add bootargs to /chosen [PATCH] powerpc: iseries: Add /system-id, /model and /compatible [PATCH] powerpc: Add strne2a() to convert a string from EBCDIC to ASCII [PATCH] powerpc: iseries: Make more stuff static in platforms/iseries/mf.c [PATCH] powerpc: iseries: Remove pointless iSeries_(restart|power_off|halt) [PATCH] powerpc: iseries: mf related cleanups [PATCH] powerpc: Replace platform_is_lpar() with a firmware feature [PATCH] powerpc: trivial: Cleanup whitespace in cputable.h [PATCH] powerpc: Remove unused iommu_off logic from pSeries_init_early() [PATCH] powerpc: Unconfuse htab_bolt_mapping() callers ...
2006-03-22Merge master.kernel.org:/pub/scm/linux/kernel/git/jejb/scsi-misc-2.6Linus Torvalds1-0/+29
* master.kernel.org:/pub/scm/linux/kernel/git/jejb/scsi-misc-2.6: (138 commits) [SCSI] libata: implement minimal transport template for ->eh_timed_out [SCSI] eliminate rphy allocation in favour of expander/end device allocation [SCSI] convert mptsas over to end_device/expander allocations [SCSI] allow displaying and setting of cache type via sysfs [SCSI] add scsi_mode_select to scsi_lib.c [SCSI] 3ware 9000 add big endian support [SCSI] qla2xxx: update MAINTAINERS [SCSI] scsi: move target_destroy call [SCSI] fusion - bump version [SCSI] fusion - expander hotplug suport in mptsas module [SCSI] fusion - exposing raid components in mptsas [SCSI] fusion - memory leak, and initializing fields [SCSI] fusion - exclosure misspelled [SCSI] fusion - cleanup mptsas event handling functions [SCSI] fusion - removing target_id/bus_id from the VirtDevice structure [SCSI] fusion - static fix's [SCSI] fusion - move some debug firmware event debug msgs to verbose level [SCSI] fusion - loginfo header update [SCSI] add scsi_reprobe_device [SCSI] megaraid_sas: fix extended timeout handling ...
2006-03-22[PATCH] on_each_cpu(): disable local interruptsAndrew Morton1-0/+20
When on_each_cpu() runs the callback on other CPUs, it runs with local interrupts disabled. So we should run the function with local interrupts disabled on this CPU, too. And do the same for UP, so the callback is run in the same environment on both UP and SMP. (strictly it should do preempt_disable() too, but I think local_irq_disable is sufficiently equivalent). Also uninlines on_each_cpu(). softirq.c was the most appropriate file I could find, but it doesn't seem to justify creating a new file. Oh, and fix up that comment over (under?) x86's smp_call_function(). It drives me nuts. Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-22[PATCH] unshare: Error if passed unsupported flagsEric W. Biederman1-0/+6
A bare bones trivial patch to ensure we always get -EINVAL on the unsupported cases for sys_unshare. If this goes in before 2.6.16 it allows us to forward compatible with future applications using sys_unshare. Signed-off-by: Eric W. Biederman <ebiederm@xmission.com> Cc: JANAK DESAI <janak@us.ibm.com> Cc: <stable@kerenl.org> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-22[PATCH] sched: remove sleep_avg multiplierMike Galbraith1-6/+0
Remove the sleep_avg multiplier. This multiplier was necessary back when we had 10 seconds of dynamic range in sleep_avg, but now that we only have one second, it causes that one second to be compressed down to 100ms in some cases. This is particularly noticeable when compiling a kernel in a slow NFS mount, and I believe it to be a very likely candidate for other recently reported network related interactivity problems. In testing, I can detect no negative impact of this removal. Signed-off-by: Mike Galbraith <efault@gmx.de> Acked-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-21Merge ../linux-2.6James Bottomley8-104/+132
2006-03-20[PATCH] fix module sysfs files reference countingGreg Kroah-Hartman2-56/+31
The module files, refcnt, version, and srcversion did not properly increment the owner's module reference count, allowing the modules to be removed while the files were open, causing oopses. This patch fixes this, and also fixes the problem that the version and srcversion files were not showing up, unless CONFIG_MODULE_UNLOAD was enabled, which is not correct. Cc: Nathan Lynch <ntl@pobox.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2006-03-20[PATCH] add EXPORT_SYMBOL_GPL_FUTURE() to RCU subsystemGreg Kroah-Hartman1-3/+3
As the RCU symbols are going to be changed to GPL in the near future, lets warn users that this is going to happen. Cc: Paul McKenney <paulmck@us.ibm.com> Acked-by: Dipankar Sarma <dipankar@in.ibm.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2006-03-20[PATCH] add EXPORT_SYMBOL_GPL_FUTURE()Greg Kroah-Hartman1-2/+47
This patch adds the ability to mark symbols that will be changed in the future, so that kernel modules that don't include MODULE_LICENSE("GPL") and use the symbols, will be flagged and printed out to the system log. Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2006-03-20[PATCH] Clean up module.c symbol searching logicSam Ravnborg1-32/+41
Signed-off-by: Sam Ravnborg <sam@ravnborg.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2006-03-20[PATCH] kobject: fix build error if CONFIG_SYSFS=nJun'ichi Nomura1-3/+0
Moving uevent_seqnum and uevent_helper to kobject_uevent.c because they are used even if CONFIG_SYSFS=n while kernel/ksysfs.c is built only if CONFIG_SYSFS=y, Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2006-03-20[PATCH] fix audit_init failure pathAmy Griffis1-1/+2
Make audit_init() failure path handle situations where the audit_panic() action is not AUDIT_FAIL_PANIC (default is AUDIT_FAIL_PRINTK). Other uses of audit_sock are not reached unless audit's netlink message handler is properly registered. Bug noticed by Peter Staubach. Signed-off-by: Amy Griffis <amy.griffis@hp.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2006-03-20[PATCH] EXPORT_SYMBOL patch for audit_log, audit_log_start, audit_log_end ↵lorenzo@gnu.org1-0/+5
and audit_format Hi, This is a trivial patch that enables the possibility of using some auditing functions within loadable kernel modules (ie. inside a Linux Security Module). _ Make the audit_log_start, audit_log_end, audit_format and audit_log interfaces available to Loadable Kernel Modules, thus making possible the usage of the audit framework inside LSMs, etc. Signed-off-by: <Lorenzo Hernández García-Hierro <lorenzo@gnu.org>> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2006-03-20[PATCH] sem2mutex: audit_netlink_semIngo Molnar3-12/+13
Semaphore to mutex conversion. The conversion was generated via scripts, and the result was validated automatically via a script as well. Signed-off-by: Ingo Molnar <mingo@elte.hu> Cc: David Woodhouse <dwmw2@infradead.org> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2006-03-20[PATCH] simplify audit_free() lockingIngo Molnar1-3/+7
Simplify audit_free()'s locking: no need to lock a task that we are tearing down. [the extra locking also caused false positives in the lock validator] Signed-off-by: Ingo Molnar <mingo@elte.hu> Cc: David Woodhouse <dwmw2@infradead.org> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2006-03-20[PATCH] Fix audit operatorsDustin Kirkland1-6/+12
Darrel Goeddel initiated a discussion on IRC regarding the possibility of audit_comparator() returning -EINVAL signaling an invalid operator. It is possible when creating the rule to assure that the operator is one of the 6 sane values. Here's a snip from include/linux/audit.h Note that 0 (nonsense) and 7 (all operators) are not valid values for an operator. ... /* These are the supported operators. * 4 2 1 * = > < * ------- * 0 0 0 0 nonsense * 0 0 1 1 < * 0 1 0 2 > * 0 1 1 3 != * 1 0 0 4 = * 1 0 1 5 <= * 1 1 0 6 >= * 1 1 1 7 all operators */ ... Furthermore, prior to adding these extended operators, flagging the AUDIT_NEGATE bit implied !=, and otherwise == was assumed. The following code forces the operator to be != if the AUDIT_NEGATE bit was flipped on. And if no operator was specified, == is assumed. The only invalid condition is if the AUDIT_NEGATE bit is off and all of the AUDIT_EQUAL, AUDIT_LESS_THAN, and AUDIT_GREATER_THAN bits are on--clearly a nonsensical operator. Now that this is handled at rule insertion time, the default -EINVAL return of audit_comparator() is eliminated such that the function can only return 1 or 0. If this is acceptable, let's get this applied to the current tree. :-Dustin -- Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> (cherry picked from 9bf0a8e137040f87d1b563336d4194e38fb2ba1a commit)
2006-03-20[PATCH] Add tty to syscall audit recordsSteve Grubb1-2/+8
Hi, >From the RBAC specs: FAU_SAR.1.1 The TSF shall provide the set of authorized RBAC administrators with the capability to read the following audit information from the audit records: <snip> (e) The User Session Identifier or Terminal Type A patch adding the tty for all syscalls is included in this email. Please apply. Signed-off-by: Steve Grubb <sgrubb@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2006-03-20[PATCH] add/remove rule updateSteve Grubb1-7/+9
Hi, The following patch adds a little more information to the add/remove rule message emitted by the kernel. Signed-off-by: Steve Grubb <sgrubb@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2006-03-20[PATCH] audit string fields interface + consumerAmy Griffis4-141/+418
Updated patch to dynamically allocate audit rule fields in kernel's internal representation. Added unlikely() calls for testing memory allocation result. Amy Griffis wrote: [Wed Jan 11 2006, 02:02:31PM EST] > Modify audit's kernel-userspace interface to allow the specification > of string fields in audit rules. > > Signed-off-by: Amy Griffis <amy.griffis@hp.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> (cherry picked from 5ffc4a863f92351b720fe3e9c5cd647accff9e03 commit)
2006-03-20[PATCH] Minor cosmetic cleanups to the code moved into auditfilter.cDavid Woodhouse1-7/+4
Signed-off-by: David Woodhouse <dwmw2@infradead.org>
2006-03-20[PATCH] Fix audit record filtering with !CONFIG_AUDITSYSCALLDavid Woodhouse5-377/+454
This fixes the per-user and per-message-type filtering when syscall auditing isn't enabled. [AV: folded followup fix from the same author] Signed-off-by: David Woodhouse <dwmw2@infradead.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2006-03-20[PATCH] Miscellaneous bug and warning fixesDustin Kirkland1-9/+12
This patch fixes a couple of bugs revealed in new features recently added to -mm1: * fixes warnings due to inconsistent use of const struct inode *inode * fixes bug that prevent a kernel from booting with audit on, and SELinux off due to a missing function in security/dummy.c * fixes a bug that throws spurious audit_panic() messages due to a missing return just before an error_path label * some reasonable house cleaning in audit_ipc_context(), audit_inode_context(), and audit_log_task_context() Signed-off-by: Dustin Kirkland <dustin.kirkland@us.ibm.com> Signed-off-by: David Woodhouse <dwmw2@infradead.org>
2006-03-20[PATCH] Capture selinux subject/object context information.Dustin Kirkland2-9/+135
This patch extends existing audit records with subject/object context information. Audit records associated with filesystem inodes, ipc, and tasks now contain SELinux label information in the field "subj" if the item is performing the action, or in "obj" if the item is the receiver of an action. These labels are collected via hooks in SELinux and appended to the appropriate record in the audit code. This additional information is required for Common Criteria Labeled Security Protection Profile (LSPP). [AV: fixed kmalloc flags use] [folded leak fixes] [folded cleanup from akpm (kfree(NULL)] [folded audit_inode_context() leak fix] [folded akpm's fix for audit_ipc_perm() definition in case of !CONFIG_AUDIT] Signed-off-by: Dustin Kirkland <dustin.kirkland@us.ibm.com> Signed-off-by: David Woodhouse <dwmw2@infradead.org> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2006-03-20[PATCH] Exclude messages by message typeDustin Kirkland2-1/+37
- Add a new, 5th filter called "exclude". - And add a new field AUDIT_MSGTYPE. - Define a new function audit_filter_exclude() that takes a message type as input and examines all rules in the filter. It returns '1' if the message is to be excluded, and '0' otherwise. - Call the audit_filter_exclude() function near the top of audit_log_start() just after asserting audit_initialized. If the message type is not to be audited, return NULL very early, before doing a lot of work. [combined with followup fix for bug in original patch, Nov 4, same author] [combined with later renaming AUDIT_FILTER_EXCLUDE->AUDIT_FILTER_TYPE and audit_filter_exclude() -> audit_filter_type()] Signed-off-by: Dustin Kirkland <dustin.kirkland@us.ibm.com> Signed-off-by: David Woodhouse <dwmw2@infradead.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2006-03-20[PATCH] Collect more inode information during syscall processing.Amy Griffis1-24/+118
This patch augments the collection of inode info during syscall processing. It represents part of the functionality that was provided by the auditfs patch included in RHEL4. Specifically, it: - Collects information for target inodes created or removed during syscalls. Previous code only collects information for the target inode's parent. - Adds the audit_inode() hook to syscalls that operate on a file descriptor (e.g. fchown), enabling audit to do inode filtering for these calls. - Modifies filtering code to check audit context for either an inode # or a parent inode # matching a given rule. - Modifies logging to provide inode # for both parent and child. - Protect debug info from NULL audit_names.name. [AV: folded a later typo fix from the same author] Signed-off-by: Amy Griffis <amy.griffis@hp.com> Signed-off-by: David Woodhouse <dwmw2@infradead.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2006-03-20[PATCH] Pass dentry, not just name, in fsnotify creation hooks.Amy Griffis1-1/+1
The audit hooks (to be added shortly) will want to see dentry->d_inode too, not just the name. Signed-off-by: Amy Griffis <amy.griffis@hp.com> Signed-off-by: David Woodhouse <dwmw2@infradead.org>
2006-03-20[PATCH] Define new range of userspace messages.Steve Grubb1-0/+2
The attached patch updates various items for the new user space messages. Please apply. Signed-off-by: Steve Grubb <sgrubb@redhat.com> Signed-off-by: David Woodhouse <dwmw2@infradead.org>
2006-03-20[PATCH] Filter rule comparatorsDustin Kirkland1-42/+75
Currently, audit only supports the "=" and "!=" operators in the -F filter rules. This patch reworks the support for "=" and "!=", and adds support for ">", ">=", "<", and "<=". This turned out to be a pretty clean, and simply process. I ended up using the high order bits of the "field", as suggested by Steve and Amy. This allowed for no changes whatsoever to the netlink communications. See the documentation within the patch in the include/linux/audit.h area, where there is a table that explains the reasoning of the bitmask assignments clearly. The patch adds a new function, audit_comparator(left, op, right). This function will perform the specified comparison (op, which defaults to "==" for backward compatibility) between two values (left and right). If the negate bit is on, it will negate whatever that result was. This value is returned. Signed-off-by: Dustin Kirkland <dustin.kirkland@us.ibm.com> Signed-off-by: David Woodhouse <dwmw2@infradead.org>
2006-03-20[PATCH] AUDIT: kerneldoc for kernel/audit*.cRandy Dunlap2-46/+238
- add kerneldoc for non-static functions; - don't init static data to 0; - limit lines to < 80 columns; - fix long-format style; - delete whitespace at end of some lines; (chrisw: resend and update to current audit-2.6 tree) Signed-off-by: Randy Dunlap <rdunlap@xenotime.net> Signed-off-by: Chris Wright <chrisw@osdl.org> Signed-off-by: David Woodhouse <dwmw2@infradead.org>
2006-03-20[PATCH] make vm86 call audit_syscall_exitJason Baron1-5/+0
hi, The motivation behind the patch below was to address messages in /var/log/messages such as: Jan 31 10:54:15 mets kernel: audit(:0): major=252 name_count=0: freeing multiple contexts (1) Jan 31 10:54:15 mets kernel: audit(:0): major=113 name_count=0: freeing multiple contexts (2) I can reproduce by running 'get-edid' from: http://john.fremlin.de/programs/linux/read-edid/. These messages come about in the log b/c the vm86 calls do not exit via the normal system call exit paths and thus do not call 'audit_syscall_exit'. The next system call will then free the context for itself and for the vm86 context, thus generating the above messages. This patch addresses the issue by simply adding a call to 'audit_syscall_exit' from the vm86 code. Besides fixing the above error messages the patch also now allows vm86 system calls to become auditable. This is useful since strace does not appear to properly record the return values from sys_vm86. I think this patch is also a step in the right direction in terms of cleaning up some core auditing code. If we can correct any other paths that do not properly call the audit exit and entries points, then we can also eliminate the notion of context chaining. I've tested this patch by verifying that the log messages no longer appear, and that the audit records for sys_vm86 appear to be correct. Also, 'read_edid' produces itentical output. thanks, -Jason Signed-off-by: Jason Baron <jbaron@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2006-03-18[PATCH] don't do exit_io_context() until we know we won't be doing any IOAl Viro1-2/+5
testcase: mount /dev/sdb10 /mnt touch /mnt/tmp/b umount /mnt mount /dev/sdb10 /mnt rm /mnt/tmp/b </mnt/tmp/b umount /mnt and watch blkdev_ioc line in /proc/slabinfo. Vanilla kernel leaks. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2006-03-18[PATCH] disable unshare(CLONE_VM) for nowOleg Nesterov1-3/+1
sys_unshare() does mmput(new_mm). This is not enough if we have mm->core_waiters. This patch is a temporary fix for soon to be released 2.6.16. Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> [ Checked with Uli: "I'm not planning to use unshare(CLONE_VM). It's not needed for any functionality planned so far. What we (as in Red Hat) need unshare() for now is the filesystem side." ] Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-17[PATCH] posix-timers: fix requeue accounting when signal is ignoredRoman Zippel1-0/+1
When the posix-timer signal is ignored then the timer is rearmed by the callback function. The requeue pending accounting has to be fixed up else the state might be wrong. Signed-off-by: Roman Zippel <zippel@linux-m68k.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-17[PATCH] time_interpolator: add __read_mostlyChristoph Lameter1-2/+2
The pointer to the current time interpolator and the current list of time interpolators are typically only changed during bootup. Adding __read_mostly takes them away from possibly hot cachelines. Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-17[PATCH] unshare: Use rcu_assign_pointer when setting sighandEric W. Biederman1-1/+1
The sighand pointer only needs the rcu_read_lock on the read side. So only depending on task_lock protection when setting this pointer is not enough. We also need a memory barrier to ensure the initialization is seen first. Use rcu_assign_pointer as it does this for us, and clearly documents that we are setting an rcu readable pointer. Signed-off-by: Eric W. Biederman <ebiederm@xmission.com> Acked-by: Paul E. McKenney <paulmck@us.ibm.com> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-17Merge ../linux-2.6Paul Mackerras2-8/+9
2006-03-14Merge ../linux-2.6James Bottomley6-29/+143
2006-03-14[PATCH] Fix sigaltstack corruption among cloned threadsGOTO Masanori1-0/+6
This patch fixes alternate signal stack corruption among cloned threads with CLONE_SIGHAND (and CLONE_VM) for linux-2.6.16-rc6. The value of alternate signal stack is currently inherited after a call of clone(... CLONE_SIGHAND | CLONE_VM). But if sigaltstack is set by a parent thread, and then if multiple cloned child threads (+ parent threads) call signal handler at the same time, some threads may be conflicted - because they share to use the same alternative signal stack region. Finally they get sigsegv. It's an undesirable race condition. Note that child threads created from NPTL pthread_create() also hit this conflict when the parent thread uses sigaltstack, without my patch. To fix this problem, this patch clears the child threads' sigaltstack information like exec(). This behavior follows the SUSv3 specification. In SUSv3, pthread_create() says "The alternate stack shall not be inherited (when new threads are initialized)". It means that sigaltstack should be cleared when sigaltstack memory space is shared by cloned threads with CLONE_SIGHAND. Note that I chose "if (clone_flags & CLONE_SIGHAND)" line because: - If clone_flags line is not existed, fork() does not inherit sigaltstack. - CLONE_VM is another choice, but vfork() does not inherit sigaltstack. - CLONE_SIGHAND implies CLONE_VM, and it looks suitable. - CLONE_THREAD is another candidate, and includes CLONE_SIGHAND + CLONE_VM, but this flag has a bit different semantics. I decided to use CLONE_SIGHAND. [ Changed to test for CLONE_VM && !CLONE_VFORK after discussion --Linus ] Signed-off-by: GOTO Masanori <gotom@sanori.org> Cc: Roland McGrath <roland@redhat.com> Cc: Ingo Molnar <mingo@elte.hu> Acked-by: Linus Torvalds <torvalds@osdl.org> Cc: Ulrich Drepper <drepper@redhat.com> Cc: Jakub Jelinek <jakub@redhat.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-11[PATCH] remove __put_task_struct_cb export againChristoph Hellwig2-8/+3
The patch '[PATCH] RCU signal handling' [1] added an export for __put_task_struct_cb, a put_task_struct helper newly introduced in that patch. But the put_task_struct couldn't be used modular previously as __put_task_struct wasn't exported. There are not callers of it in modular code, and it shouldn't be exported because we don't want drivers to hold references to task_structs. This patch removes the export and folds __put_task_struct into __put_task_struct_cb as there's no other caller. [1] http://www2.kernel.org/git/gitweb.cgi?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=e56d090310d7625ecb43a1eeebd479f04affb48b Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Paul E. McKenney <paulmck@us.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-09Merge ../linux-2.6Paul Mackerras5-21/+134
2006-03-08[PATCH] fix file countingDipankar Sarma1-1/+4
I have benchmarked this on an x86_64 NUMA system and see no significant performance difference on kernbench. Tested on both x86_64 and powerpc. The way we do file struct accounting is not very suitable for batched freeing. For scalability reasons, file accounting was constructor/destructor based. This meant that nr_files was decremented only when the object was removed from the slab cache. This is susceptible to slab fragmentation. With RCU based file structure, consequent batched freeing and a test program like Serge's, we just speed this up and end up with a very fragmented slab - llm22:~ # cat /proc/sys/fs/file-nr 587730 0 758844 At the same time, I see only a 2000+ objects in filp cache. The following patch I fixes this problem. This patch changes the file counting by removing the filp_count_lock. Instead we use a separate percpu counter, nr_files, for now and all accesses to it are through get_nr_files() api. In the sysctl handler for nr_files, we populate files_stat.nr_files before returning to user. Counting files as an when they are created and destroyed (as opposed to inside slab) allows us to correctly count open files with RCU. Signed-off-by: Dipankar Sarma <dipankar@in.ibm.com> Cc: "Paul E. McKenney" <paulmck@us.ibm.com> Cc: "David S. Miller" <davem@davemloft.net> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-08[PATCH] rcu batch tuningDipankar Sarma1-18/+58
This patch adds new tunables for RCU queue and finished batches. There are two types of controls - number of completed RCU updates invoked in a batch (blimit) and monitoring for high rate of incoming RCUs on a cpu (qhimark, qlowmark). By default, the per-cpu batch limit is set to a small value. If the input RCU rate exceeds the high watermark, we do two things - force quiescent state on all cpus and set the batch limit of the CPU to INTMAX. Setting batch limit to INTMAX forces all finished RCUs to be processed in one shot. If we have more than INTMAX RCUs queued up, then we have bigger problems anyway. Once the incoming queued RCUs fall below the low watermark, the batch limit is set to the default. Signed-off-by: Dipankar Sarma <dipankar@in.ibm.com> Cc: "Paul E. McKenney" <paulmck@us.ibm.com> Cc: "David S. Miller" <davem@davemloft.net> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-08[PATCH] idle threads should have a sane ->timestamp valueIngo Molnar1-0/+1
Idle threads should have a sane ->timestamp value, to avoid init kernel thread(s) from inheriting it and causing miscalculations in try_to_wake_up(). Reported-by: Mike Galbraith <efault@gmx.de>. Signed-off-by: Ingo Molnar <mingo@elte.hu> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-06[PATCH] time: add barrier after updating jiffies_64Atsushi Nemoto1-0/+2
Add a compiler barrier so that we don't read jiffies before updating jiffies_64. Signed-off-by: Atsushi Nemoto <anemo@mba.ocn.ne.jp> Cc: Ralf Baechle <ralf@linux-mips.org> Cc: Paul Mackerras <paulus@samba.org> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-06[PATCH] fix next_timer_interrupt() for hrtimerTony Lindgren2-0/+51
Also from Thomas Gleixner <tglx@linutronix.de> Function next_timer_interrupt() got broken with a recent patch 6ba1b91213e81aa92b5cf7539f7d2a94ff54947c as sys_nanosleep() was moved to hrtimer. This broke things as next_timer_interrupt() did not check hrtimer tree for next event. Function next_timer_interrupt() is needed with dyntick (CONFIG_NO_IDLE_HZ, VST) implementations, as the system can be in idle when next hrtimer event was supposed to happen. At least ARM and S390 currently use next_timer_interrupt(). Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Russell King <rmk@arm.linux.org.uk> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-06Add early-boot-safety check to cond_resched()Linus Torvalds1-0/+2
Just to be safe, we should not trigger a conditional reschedule during the early boot sequence. We've historically done some questionable early on, and the safety warnings in __might_sleep() are generally turned off during that period, so there might be problems lurking. This affects CONFIG_PREEMPT_VOLUNTARY, which takes over might_sleep() to cause a voluntary conditional reschedule. Acked-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-02[PATCH] time_interpolator: Use readq_relaxed() instead of readq().Christoph Lameter1-2/+2
On some platforms readq performs additional work to make sure I/O is done in a coherent way. This is not needed for time retrieval as done by the time interpolator. So we can use readq_relaxed instead which will improve performance. It affects sparc64 and ia64 only. Apparently it makes a significant difference on ia64. Signed-off-by: Christoph Lameter <clameter@sgi.com> Cc: john stultz <johnstul@us.ibm.com> Cc: "David S. Miller" <davem@davemloft.net> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-02[PATCH] fix acpi_video_flags on x86-64Stefan Seyfried1-1/+1
acpi_video_flags variable is unsigned long, so it should be set as such. This actually matters on x86-64. Signed-off-by: Stefan Seyfried <seife@suse.de> Signed-off-by: Pavel Machek <pavel@suse.cz> Cc: "Brown, Len" <len.brown@intel.com> Cc: Andi Kleen <ak@muc.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-02-28[IA64] sysctl option to silence unaligned trap warningsJes Sorensen1-0/+14
Allow sysadmin to disable all warnings about userland apps making unaligned accesses by using: # echo 1 > /proc/sys/kernel/ignore-unaligned-usertrap Rather than having to use prctl on a process by process basis. Default behaivour leaves the warnings enabled. Signed-off-by: Jes Sorensen <jes@sgi.com> Signed-off-by: Tony Luck <tony.luck@intel.com>