From: Hidetoshi Seto NPTL has 3 control counters (total/wake/woken). so NPTL can know: "how many threads enter to wait"(total), "how many threads receive wake signal"(wake), and "how many threads exit waiting"(woken). Abstraction of pthread_cond_wait and pthread_cond_signal are: A01 pthread_cond_wait { A02 timeout = 0; A03 lock(counters); A04 total++; A05 val = get_from(futex); A06 unlock(counters); A07 A08 sys_futex(futex, FUTEX_WAIT, val, timeout); A09 A10 lock(counters); A11 woken++; A12 unlock(counters); A13 } B01 pthread_cond_signal { B02 lock(counters); B03 if(total>wake) { /* if there is waiter */ B04 wake++; B05 update_val(futex); B06 sys_futex(futex, FUTEX_WAKE, 1); B07 } B08 unlock(counters); B09 } What we have to notice is: FUTEX_WAKE could be called before FUTEX_WAIT have called (at A07). In such case, FUTEX_WAKE will fail if there is no thread in waitqueue. However, since pthread_cond_signal do not only wake++ but also update_val(futex), next FUTEX_WAIT will fail with -EWOULDBLOCK because the val passed to WAIT is now not equal to updated val. Therefore, as the result, it seems that the WAKE wakes the WAIT. ----- The bug will appear if 2 pair of wait & wake called at (nearly)once: * Assume 4 threads, wait_A, wait_B, wake_X, and wake_Y * counters start from [total/wake/woken]=[0/0/0] * the val of futex starts from (0), update means inclement of the val. * there is no thread in waitqueue on the futex. [simulation] wait_A: calls pthread_cond_wait: total++, prepare to call FUTEX_WAIT with val=0. # status: [1/0/0] (0) queue={}(empty) # wake_X: calls pthread_cond_signal: no one in waitqueue, just wake++ and update futex val. # status: [1/1/0] (1) queue={}(empty) # wait_B: calls pthread_cond_wait: total++, prepare to call FUTEX_WAIT with val=1. # status: [2/1/0] (1) queue={}(empty) # wait_A: calls FUTEX_WAIT with val=0: after queueing, compare val. 0!=1 ... this should be blocked... # status: [2/1/0] (1) queue={A} # wait_B: calls FUTEX_WAIT with val=1: after queueing, compare val. 1==1 ... OK, let's schedule()... # status: [2/1/0] (1) queue={A,B} (B=sleeping) # wake_Y: calls pthread_cond_signal: A is in waitqueue ... dequeue A, wake++ and update futex val. # status: [2/2/0] (2) queue={B} (B=sleeping) # wait_A: end of FUTEX_WAIT with val=0: try to dequeue but already dequeued, return anyway. # status: [2/2/0] (2) queue={B} (B=sleeping) # wait_A: end of pthread_cond_wait: woken++. # status: [2/2/1] (2) queue={B} (B=sleeping) # This is bug: wait_A: wakeup wait_B: sleeping wake_X: wake A wake_Y: wake A again if subsequent wake_Z try to wake B: wake_Z: calls pthread_cond_signal: since total==wake, do nothing. # status: [2/2/1] (2) queue={B} (B=sleeping) # If wait_C comes, B become to can be woken, but C... This bug makes the waitqueue to trap some threads in it all time. ----- > - According to man of futex: > "If the futex was not equal to the expected value, the operation > returns -EWOULDBLOCK." > but now, here is no description about the rare case: > "returns 0 if the futex was not equal to the expected value, but > the process was woken by a FUTEX_WAKE call." > this behavior on rare case causes the hang which I found. So to avoid this problem, my patch shut up the window that you said: > The patch certainly looks sensible - I can see that without the patch, > there is a window in which this process is pointlessly queued up on the > futex and that in this window a wakeup attempt might do a bad thing. ----- In short: There is an un-documented behavior of futex_wait. This behavior misleads NPTL to wake a thread doubly, as the result, causes an application hang. Signed-off-by: Hidetoshi Seto Signed-off-by: Andrew Morton --- 25-akpm/kernel/futex.c | 11 +++++------ 1 files changed, 5 insertions(+), 6 deletions(-) diff -puN kernel/futex.c~futex_wait-fix kernel/futex.c --- 25/kernel/futex.c~futex_wait-fix Tue Nov 9 13:43:05 2004 +++ 25-akpm/kernel/futex.c Tue Nov 9 13:43:05 2004 @@ -486,8 +486,6 @@ static int futex_wait(unsigned long uadd if (unlikely(ret != 0)) goto out_release_sem; - queue_me(&q, -1, NULL); - /* * Access the page after the futex is queued. * We hold the mmap semaphore, so the mapping cannot have changed @@ -495,13 +493,15 @@ static int futex_wait(unsigned long uadd */ if (get_user(curval, (int __user *)uaddr) != 0) { ret = -EFAULT; - goto out_unqueue; + goto out_release_sem; } if (curval != val) { ret = -EWOULDBLOCK; - goto out_unqueue; + goto out_release_sem; } + queue_me(&q, -1, NULL); + /* * Now the futex is queued and we have checked the data, we * don't want to hold mmap_sem while we sleep. @@ -542,11 +542,10 @@ static int futex_wait(unsigned long uadd WARN_ON(!signal_pending(current)); return -EINTR; - out_unqueue: /* If we were woken (and unqueued), we succeeded, whatever. */ if (!unqueue_me(&q)) ret = 0; - out_release_sem: +out_release_sem: up_read(¤t->mm->mmap_sem); return ret; } _