From: Manfred Spraul (We think this might be the mystery bug which has been hanging about for months) We found a [the?] task struct refcount error: A task that dies sets tsk->state to TASK_ZOMBIE. The next scheduled task checks prev->state, and if it's ZOMBIE, then it decrements the reference count of prev. The prev->state & _ZOMBIE test is not atomic with schedule, thus if prev is scheduled again and dies between dropping the runqueue lock and checking prev->state, then the reference it dropped twice. This is possible with either preemption [schedule_tail is called by ret_from_fork with preemption count 1, finish_arch_switch drops it to 0] or profiling [profile_exit_mmap can sleep on profile_rwsem, called by mmdrop()] enabled. kernel/sched.c | 19 ++++++++++++++++++- 1 files changed, 18 insertions(+), 1 deletion(-) diff -puN kernel/sched.c~task-refcounting-fix kernel/sched.c --- 25/kernel/sched.c~task-refcounting-fix 2003-08-09 13:05:37.000000000 -0700 +++ 25-akpm/kernel/sched.c 2003-08-09 13:05:37.000000000 -0700 @@ -742,12 +742,29 @@ static inline void finish_task_switch(ta { runqueue_t *rq = this_rq(); struct mm_struct *mm = rq->prev_mm; + int drop_task_ref; rq->prev_mm = NULL; + + /* + * A task struct has one reference for the use as "current". + * If a task dies, then it sets TASK_ZOMBIE in tsk->state and calls + * schedule one last time. The schedule call will never return, + * and the scheduled task must drop that reference. + * The test for TASK_ZOMBIE must occur while the runqueue locks are + * still held, otherwise prev could be scheduled on another cpu, die + * there before we look at prev->state, and then the reference would + * be dropped twice. + * Manfred Spraul + */ + drop_task_ref = 0; + if (unlikely(prev->state & (TASK_DEAD | TASK_ZOMBIE))) + drop_task_ref = 1; + finish_arch_switch(rq, prev); if (mm) mmdrop(mm); - if (prev->state & (TASK_DEAD | TASK_ZOMBIE)) + if (drop_task_ref) put_task_struct(prev); } _