Because of the batching logic in shrink_zone() it turns out that the high-level `priority' variable is not a reliable indicator of the amount of scanning stress. This is because with small inactive lists, `priority' can pass through quite a number of decrements before we do _any_ scanning at all: instead, we've simply been increasing the value of zone->nr_inactive_scan, waiting for it to exceed SWAP_CLUSTER_MAX, at which point we do some scanning for real. So the patch stops using `priority' as a measure of scanning distress and uses (total pages scanned) / (total pages reclaimed) instead. The code at present fairly arbitrarily decides that if this exceeds 1.5, reclaim is struggling and we need to throttle. --- 25-akpm/mm/vmscan.c | 17 +++++++++-------- 1 files changed, 9 insertions(+), 8 deletions(-) diff -puN mm/vmscan.c~vmscan-less-sleepiness mm/vmscan.c --- 25/mm/vmscan.c~vmscan-less-sleepiness 2004-03-25 22:25:48.367046584 -0800 +++ 25-akpm/mm/vmscan.c 2004-03-25 22:27:34.007986720 -0800 @@ -876,11 +876,9 @@ int try_to_free_pages(struct zone **zone if (total_scanned > SWAP_CLUSTER_MAX + SWAP_CLUSTER_MAX/2) { wakeup_bdflush(laptop_mode ? 0 : total_scanned); do_writepage = 1; + blk_congestion_wait(WRITE, HZ/10); } - /* Take a nap, wait for some writeback to complete */ - if (scanned && priority < DEF_PRIORITY - 2) - blk_congestion_wait(WRITE, HZ/10); } if ((gfp_mask & __GFP_FS) && !(gfp_mask & __GFP_NORETRY)) out_of_memory(); @@ -1009,12 +1007,15 @@ scan: continue; /* swsusp: need to do more work */ if (all_zones_ok) break; /* kswapd: all done */ - /* - * OK, kswapd is getting into trouble. Take a nap, then take - * another pass across the zones. - */ - if (total_scanned && priority < DEF_PRIORITY - 2) + if (do_writepage) { + /* + * OK, kswapd is getting into trouble. Take a nap, + * then take another pass across the zones. We reuse + * do_writepage here, just because it happens to be + * exactly what we want... + */ blk_congestion_wait(WRITE, HZ/10); + } } out: for (i = 0; i < pgdat->nr_zones; i++) { _