From: Nick Piggin From: Rick Lindsley In find_busiest_group(), after we exit the do/while, we select our imbalance. But max_load, avg_load, and this_load are all unsigned, so min(x,y) will make a bad choice if max_load < avg_load < this_load (that is, a choice between two negative [very large] numbers). Unfortunately, there is a bug when max_load never gets changed from zero (look in the loop and think what happens if the only load on the machine is being created by cpu groups of which we are a member). And you have a recipe for some really bogus values for imbalance. Even if you fix the max_load == 0 bug, there will still be times when avg_load - this_load will be negative (thus very large) and you'll make the decision to move stuff when you shouldn't have. This patch allows for this_load to set max_load, which if I understand the logic properly is correct. With this patch applied, the algorithm is *much* more conservative ... maybe *too* conservative but that's for another round of testing ... --- 25-akpm/kernel/sched.c | 24 +++++++++++++++--------- 1 files changed, 15 insertions(+), 9 deletions(-) diff -puN kernel/sched.c~sched-find_busiest_group-arith-fix kernel/sched.c --- 25/kernel/sched.c~sched-find_busiest_group-arith-fix Fri Feb 6 15:17:19 2004 +++ 25-akpm/kernel/sched.c Fri Feb 6 15:17:19 2004 @@ -1418,14 +1418,13 @@ find_busiest_group(struct sched_domain * total_nr_cpus += nr_cpus; avg_load /= nr_cpus; + if (avg_load > max_load) + max_load = avg_load; + if (local_group) { this_load = avg_load; - goto nextgroup; - } - - if (avg_load >= max_load) { + } else if (avg_load >= max_load) { busiest = group; - max_load = avg_load; busiest_nr_cpus = nr_cpus; } nextgroup: @@ -1448,11 +1447,18 @@ nextgroup: * reduce the max loaded cpu below the average load, as either of these * actions would just result in more rebalancing later, and ping-pong * tasks around. Thus we look for the minimum possible imbalance. + * Negative imbalances (*we* are more loaded than anyone else) will + * be counted as no imbalance for these purposes -- we can't fix that + * by pulling tasks to us. Be careful of negative numbers as they'll + * appear as very large values with unsigned longs. */ - *imbalance = min(max_load - avg_load, avg_load - this_load); - - /* Get rid of the scaling factor now, rounding *up* as we divide */ - *imbalance = (*imbalance + SCHED_LOAD_SCALE - 1) >> SCHED_LOAD_SHIFT; + if (avg_load >= this_load) { + *imbalance = min(max_load - avg_load, avg_load - this_load); + /* Get rid of the scaling factor, rounding *up* as we divide */ + *imbalance = (*imbalance + SCHED_LOAD_SCALE - 1) + >> SCHED_LOAD_SHIFT; + } else + *imbalance = 0; if (*imbalance == 0) { if (package_idle != NOT_IDLE && domain->flags & SD_FLAG_IDLE _