I believe I have identified a failure mode that Linus saw a couple weeks
back when tracking down some other fork/exit sorts of races.  We saw this
come up on rare occasions with the RHEL3 kernel's backport of the new code
(while trying to track down other race failure modes we have yet to fix, sigh).

I am talking about the following scenario:

> Btw, even with the fix, doing a "while : ; ./crash t 10 ; done" will
> eventually result in a stuck process:
> 
> 	 1415 tty1     D      0:00 ./crash
> 
> This is some kind of deadlock: most of the fifty threads are in "D"
> state, with a trace something like
> 
> 	 [<c011fbe3>] schedule+0x360/0x7f8
> 	 [<c0120539>] wait_for_completion+0xd4/0x1c3
> 	 [<c0128c9e>] do_exit+0x627/0x6a4
> 	 [<c0128ddd>] do_group_exit+0x3d/0x177
> 	 [<c0130c13>] dequeue_signal+0x2d/0x84
> 	 [<c0133911>] get_signal_to_deliver+0x390/0x575
> 	 [<c010a541>] do_signal+0x6c/0xf1
> 	 [<c01200be>] default_wake_function+0x0/0x12
> 	 [<c01200be>] default_wake_function+0x0/0x12
> 	 [<c013d50f>] do_futex+0x6d/0x7d
> 	 [<c013d635>] sys_futex+0x116/0x12f
> 	 [<c010a601>] do_notify_resume+0x3b/0x3d
> 	 [<c010a82e>] work_notifysig+0x13/0x15
> 
> except for one that is trying to core-dump:
> 
> 	 [<c0120539>] wait_for_completion+0xd4/0x1c3
> 	 [<c01200be>] default_wake_function+0x0/0x12
> 	 [<c01200be>] default_wake_function+0x0/0x12
> 	 [<c02101aa>] rwsem_wake+0x86/0x12d
> 	 [<c01738af>] coredump_wait+0xa8/0xaa
> 	 [<c0173a26>] do_coredump+0x175/0x26c
> 
> and three that are just doing a regular "exit()" system call:
> 
> 	 [<c011fbe3>] schedule+0x360/0x7f8
> 	 [<c011e19a>] recalc_task_prio+0x90/0x1aa
> 	 [<c0120539>] wait_for_completion+0xd4/0x1c3
> 	 [<c01200be>] default_wake_function+0x0/0x12
> 	 [<c01200be>] default_wake_function+0x0/0x12
> 	 [<c0210207>] rwsem_wake+0xe3/0x12d
> 	 [<c0128c9e>] do_exit+0x627/0x6a4
> 	 [<c0128d4d>] next_thread+0x0/0x53
> 	 [<c010a7e3>] syscall_call+0x7/0xb
> 
> However, the rest of the system is totally unaffected by this deadlock:
> it's only deadlocked withing the thread group itself, nobody else cares.

What happens here is a race between an exiting thread checking
mm->core_waiters in __exit_mm, and the thread taking the core-dump signal
(in coredump_wait) examining the first thread's ->mm pointer and
incrementing mm->core_waiters to account for it.  There is no
synchronization at all in __exit_mm's use of mm->core_waiters.  If the
coredump_wait thread reads tsk->mm when tsk is in __exit_mm between
checking mm->core_waiters and clearing tsk->mm, then it will increment
mm->core_waiters and the total count will later exceed the number of
threads that will ever decrement it and synchronize.  Hence it blocks forever.

The following patch fixes the problem by using mm->mmap_sem in __exit_mm.
The read lock must be held around checking mm->core_waiters and clearing
tsk->mm so that coredump_wait (which gets the write lock) cannot come in
between and do bogus bookkeeping.



 25-akpm/kernel/exit.c |   10 +++++++++-
 1 files changed, 9 insertions(+), 1 deletion(-)

diff -puN kernel/exit.c~mm_core_waiters-synchronisation kernel/exit.c
--- 25/kernel/exit.c~mm_core_waiters-synchronisation	Wed Dec 17 15:20:56 2003
+++ 25-akpm/kernel/exit.c	Wed Dec 17 15:20:56 2003
@@ -472,21 +472,29 @@ static inline void __exit_mm(struct task
 	if (!mm)
 		return;
 	/*
-	 * Serialize with any possible pending coredump:
+	 * Serialize with any possible pending coredump.
+	 * We must hold mmap_sem around checking core_waiters
+	 * and clearing tsk->mm.  The core-inducing thread
+	 * will increment core_waiters for each thread in the
+	 * group with ->mm != NULL.
 	 */
+	down_read(&mm->mmap_sem);
 	if (mm->core_waiters) {
+		up_read(&mm->mmap_sem);
 		down_write(&mm->mmap_sem);
 		if (!--mm->core_waiters)
 			complete(mm->core_startup_done);
 		up_write(&mm->mmap_sem);
 
 		wait_for_completion(&mm->core_done);
+		down_read(&mm->mmap_sem);
 	}
 	atomic_inc(&mm->mm_count);
 	if (mm != tsk->active_mm) BUG();
 	/* more a memory barrier than a real lock */
 	task_lock(tsk);
 	tsk->mm = NULL;
+	up_read(&mm->mmap_sem);
 	enter_lazy_tlb(mm, current);
 	task_unlock(tsk);
 	mmput(mm);

_