Intro to these patches: - Major surgery against the pagecache, radix-tree and writeback code. This work is to address the O_DIRECT-vs-buffered data exposure horrors which we've been struggling with for months. As a side-effect, 32 bytes are saved from struct inode and eight bytes are removed from struct page. At a cost of approximately 2.5 bits per page in the radix tree nodes on 4k pagesize, assuming the pagecache is densely populated. Not all pages are pagecache; other pages gain the full 8 byte saving. This change will break any arch code which is using page->list and will also break any arch code which is using page->lru of memory which was obtained from slab. The basic problem which we (mainly Daniel McNeil) have been struggling with is in getting a really reliable fsync() across the page lists while other processes are performing writeback against the same file. It's like juggling four bars of wet soap with your eyes shut while someone is whacking you with a baseball bat. Daniel pretty much has the problem plugged but I suspect that's just because we don't have testcases to trigger the remaining problems. The complexity and additional locking which those patches add is worrisome. So the approach taken here is to remove the page lists altogether and replace the list-based writeback and wait operations with in-order radix-tree walks. The radix-tree code has been enhanced to support "tagging" of pages, for later searches for pages which have a particular tag set. This means that we can ask the radix tree code "find me the next 16 dirty pages starting at pagecache index N" and it will do that in O(log64(N)) time. This affects I/O scheduling potentially quite significantly. It is no longer the case that the kernel will submit pages for I/O in the order in which the application dirtied them. We instead submit them in file-offset order all the time. This is likely to be advantageous when applications are seeking all over a large file randomly writing small amounts of data. I haven't performed much benchmarking, but tiobench random write throughput seems to be increased by 30%. Other tests appear to be unaltered. dbench may have got 10-20% quicker, but it's variable. There is one large file which everyone seeks all over randomly writing small amounts of data: the blockdev mapping which caches filesystem metadata. The kernel's IO submission patterns for this are now ideal. Because writeback and wait-for-writeback use a tree walk instead of a list walk they are no longer livelockable. This probably means that we no longer need to hold i_sem across O_SYNC writes and perhaps fsync() and fdatasync(). This may be beneficial for databases: multiple processes writing and syncing different parts of the same file at the same time can now all submit and wait upon writes to just their own little bit of the file, so we can get a lot more data into the queues. It is trivial to implement a part-file-fdatasync() as well, so applications can say "sync the file from byte N to byte M", and multiple applications can do this concurrently. This is easy for ext2 filesystems, but probably needs lots of work for data-journalled filesystems and XFS and it probably doesn't offer much benefit over an i_semless O_SYNC write. This patch: - Later, we'll need to access the radix trees from inside disk I/O completion handlers. So make mapping->page_lock irq-safe. And rename it to tree_lock to reliably break any missed conversions. --- 25-akpm/fs/buffer.c | 8 +++---- 25-akpm/fs/cifs/file.c | 10 -------- 25-akpm/fs/fs-writeback.c | 4 +-- 25-akpm/fs/inode.c | 2 - 25-akpm/fs/mpage.c | 10 ++++---- 25-akpm/include/linux/fs.h | 2 - 25-akpm/ipc/shm.c | 2 - 25-akpm/mm/filemap.c | 50 ++++++++++++++++++++++---------------------- 25-akpm/mm/page-writeback.c | 10 ++++---- 25-akpm/mm/readahead.c | 8 +++---- 25-akpm/mm/swap_state.c | 22 +++++++++---------- 25-akpm/mm/swapfile.c | 8 +++---- 25-akpm/mm/truncate.c | 8 +++---- 25-akpm/mm/vmscan.c | 13 +++-------- 14 files changed, 71 insertions(+), 86 deletions(-) diff -puN fs/buffer.c~irq-safe-pagecache-lock fs/buffer.c --- 25/fs/buffer.c~irq-safe-pagecache-lock 2004-04-03 03:00:11.029554808 -0800 +++ 25-akpm/fs/buffer.c 2004-04-03 03:00:11.053551160 -0800 @@ -396,7 +396,7 @@ out: * Hack idea: for the blockdev mapping, i_bufferlist_lock contention * may be quite high. This code could TryLock the page, and if that * succeeds, there is no need to take private_lock. (But if - * private_lock is contended then so is mapping->page_lock). + * private_lock is contended then so is mapping->tree_lock). */ static struct buffer_head * __find_get_block_slow(struct block_device *bdev, sector_t block, int unused) @@ -867,14 +867,14 @@ int __set_page_dirty_buffers(struct page spin_unlock(&mapping->private_lock); if (!TestSetPageDirty(page)) { - spin_lock(&mapping->page_lock); + spin_lock_irq(&mapping->tree_lock); if (page->mapping) { /* Race with truncate? */ if (!mapping->backing_dev_info->memory_backed) inc_page_state(nr_dirty); list_del(&page->list); list_add(&page->list, &mapping->dirty_pages); } - spin_unlock(&mapping->page_lock); + spin_unlock_irq(&mapping->tree_lock); __mark_inode_dirty(mapping->host, I_DIRTY_PAGES); } @@ -1254,7 +1254,7 @@ __getblk_slow(struct block_device *bdev, * inode to its superblock's dirty inode list. * * mark_buffer_dirty() is atomic. It takes bh->b_page->mapping->private_lock, - * mapping->page_lock and the global inode_lock. + * mapping->tree_lock and the global inode_lock. */ void fastcall mark_buffer_dirty(struct buffer_head *bh) { diff -puN fs/fs-writeback.c~irq-safe-pagecache-lock fs/fs-writeback.c --- 25/fs/fs-writeback.c~irq-safe-pagecache-lock 2004-04-03 03:00:11.031554504 -0800 +++ 25-akpm/fs/fs-writeback.c 2004-04-03 03:00:11.054551008 -0800 @@ -159,10 +159,10 @@ __sync_single_inode(struct inode *inode, * read speculatively by this cpu before &= ~I_DIRTY -- mikulas */ - spin_lock(&mapping->page_lock); + spin_lock_irq(&mapping->tree_lock); if (wait || !wbc->for_kupdate || list_empty(&mapping->io_pages)) list_splice_init(&mapping->dirty_pages, &mapping->io_pages); - spin_unlock(&mapping->page_lock); + spin_unlock_irq(&mapping->tree_lock); spin_unlock(&inode_lock); ret = do_writepages(mapping, wbc); diff -puN fs/inode.c~irq-safe-pagecache-lock fs/inode.c --- 25/fs/inode.c~irq-safe-pagecache-lock 2004-04-03 03:00:11.032554352 -0800 +++ 25-akpm/fs/inode.c 2004-04-03 03:00:11.055550856 -0800 @@ -187,7 +187,7 @@ void inode_init_once(struct inode *inode sema_init(&inode->i_sem, 1); init_rwsem(&inode->i_alloc_sem); INIT_RADIX_TREE(&inode->i_data.page_tree, GFP_ATOMIC); - spin_lock_init(&inode->i_data.page_lock); + spin_lock_init(&inode->i_data.tree_lock); init_MUTEX(&inode->i_data.i_shared_sem); atomic_set(&inode->i_data.truncate_count, 0); INIT_LIST_HEAD(&inode->i_data.private_list); diff -puN fs/mpage.c~irq-safe-pagecache-lock fs/mpage.c --- 25/fs/mpage.c~irq-safe-pagecache-lock 2004-04-03 03:00:11.033554200 -0800 +++ 25-akpm/fs/mpage.c 2004-04-03 03:00:11.056550704 -0800 @@ -635,7 +635,7 @@ mpage_writepages(struct address_space *m if (get_block == NULL) writepage = mapping->a_ops->writepage; - spin_lock(&mapping->page_lock); + spin_lock_irq(&mapping->tree_lock); while (!list_empty(&mapping->io_pages) && !done) { struct page *page = list_entry(mapping->io_pages.prev, struct page, list); @@ -655,10 +655,10 @@ mpage_writepages(struct address_space *m list_add(&page->list, &mapping->locked_pages); page_cache_get(page); - spin_unlock(&mapping->page_lock); + spin_unlock_irq(&mapping->tree_lock); /* - * At this point we hold neither mapping->page_lock nor + * At this point we hold neither mapping->tree_lock nor * lock on the page itself: the page may be truncated or * invalidated (changing page->mapping to NULL), or even * swizzled back from swapper_space to tmpfs file mapping. @@ -695,12 +695,12 @@ mpage_writepages(struct address_space *m unlock_page(page); } page_cache_release(page); - spin_lock(&mapping->page_lock); + spin_lock_irq(&mapping->tree_lock); } /* * Leave any remaining dirty pages on ->io_pages */ - spin_unlock(&mapping->page_lock); + spin_unlock_irq(&mapping->tree_lock); if (bio) mpage_bio_submit(WRITE, bio); return ret; diff -puN include/linux/fs.h~irq-safe-pagecache-lock include/linux/fs.h --- 25/include/linux/fs.h~irq-safe-pagecache-lock 2004-04-03 03:00:11.035553896 -0800 +++ 25-akpm/include/linux/fs.h 2004-04-03 03:00:11.058550400 -0800 @@ -322,7 +322,7 @@ struct backing_dev_info; struct address_space { struct inode *host; /* owner: inode, block_device */ struct radix_tree_root page_tree; /* radix tree of all pages */ - spinlock_t page_lock; /* and spinlock protecting it */ + spinlock_t tree_lock; /* and spinlock protecting it */ struct list_head clean_pages; /* list of clean pages */ struct list_head dirty_pages; /* list of dirty pages */ struct list_head locked_pages; /* list of locked pages */ diff -puN ipc/shm.c~irq-safe-pagecache-lock ipc/shm.c --- 25/ipc/shm.c~irq-safe-pagecache-lock 2004-04-03 03:00:11.036553744 -0800 +++ 25-akpm/ipc/shm.c 2004-04-03 03:00:11.058550400 -0800 @@ -380,9 +380,7 @@ static void shm_get_stat(unsigned long * if (is_file_hugepages(shp->shm_file)) { struct address_space *mapping = inode->i_mapping; - spin_lock(&mapping->page_lock); *rss += (HPAGE_SIZE/PAGE_SIZE)*mapping->nrpages; - spin_unlock(&mapping->page_lock); } else { struct shmem_inode_info *info = SHMEM_I(inode); spin_lock(&info->lock); diff -puN mm/filemap.c~irq-safe-pagecache-lock mm/filemap.c --- 25/mm/filemap.c~irq-safe-pagecache-lock 2004-04-03 03:00:11.037553592 -0800 +++ 25-akpm/mm/filemap.c 2004-04-03 03:00:11.061549944 -0800 @@ -59,7 +59,7 @@ * ->private_lock (__free_pte->__set_page_dirty_buffers) * ->swap_list_lock * ->swap_device_lock (exclusive_swap_page, others) - * ->mapping->page_lock + * ->mapping->tree_lock * * ->i_sem * ->i_shared_sem (truncate->invalidate_mmap_range) @@ -78,12 +78,12 @@ * * ->inode_lock * ->sb_lock (fs/fs-writeback.c) - * ->mapping->page_lock (__sync_single_inode) + * ->mapping->tree_lock (__sync_single_inode) * * ->page_table_lock * ->swap_device_lock (try_to_unmap_one) * ->private_lock (try_to_unmap_one) - * ->page_lock (try_to_unmap_one) + * ->tree_lock (try_to_unmap_one) * ->zone.lru_lock (follow_page->mark_page_accessed) * * ->task->proc_lock @@ -93,7 +93,7 @@ /* * Remove a page from the page cache and free it. Caller has to make * sure the page is locked and that nobody else uses it - or that usage - * is safe. The caller must hold a write_lock on the mapping's page_lock. + * is safe. The caller must hold a write_lock on the mapping's tree_lock. */ void __remove_from_page_cache(struct page *page) { @@ -114,9 +114,9 @@ void remove_from_page_cache(struct page if (unlikely(!PageLocked(page))) PAGE_BUG(page); - spin_lock(&mapping->page_lock); + spin_lock_irq(&mapping->tree_lock); __remove_from_page_cache(page); - spin_unlock(&mapping->page_lock); + spin_unlock_irq(&mapping->tree_lock); } static inline int sync_page(struct page *page) @@ -148,9 +148,9 @@ static int __filemap_fdatawrite(struct a if (mapping->backing_dev_info->memory_backed) return 0; - spin_lock(&mapping->page_lock); + spin_lock_irq(&mapping->tree_lock); list_splice_init(&mapping->dirty_pages, &mapping->io_pages); - spin_unlock(&mapping->page_lock); + spin_unlock_irq(&mapping->tree_lock); ret = do_writepages(mapping, &wbc); return ret; } @@ -185,7 +185,7 @@ int filemap_fdatawait(struct address_spa restart: progress = 0; - spin_lock(&mapping->page_lock); + spin_lock_irq(&mapping->tree_lock); while (!list_empty(&mapping->locked_pages)) { struct page *page; @@ -199,7 +199,7 @@ restart: if (!PageWriteback(page)) { if (++progress > 32) { if (need_resched()) { - spin_unlock(&mapping->page_lock); + spin_unlock_irq(&mapping->tree_lock); __cond_resched(); goto restart; } @@ -209,16 +209,16 @@ restart: progress = 0; page_cache_get(page); - spin_unlock(&mapping->page_lock); + spin_unlock_irq(&mapping->tree_lock); wait_on_page_writeback(page); if (PageError(page)) ret = -EIO; page_cache_release(page); - spin_lock(&mapping->page_lock); + spin_lock_irq(&mapping->tree_lock); } - spin_unlock(&mapping->page_lock); + spin_unlock_irq(&mapping->tree_lock); /* Check for outstanding write errors */ if (test_and_clear_bit(AS_ENOSPC, &mapping->flags)) @@ -267,7 +267,7 @@ int add_to_page_cache(struct page *page, if (error == 0) { page_cache_get(page); - spin_lock(&mapping->page_lock); + spin_lock_irq(&mapping->tree_lock); error = radix_tree_insert(&mapping->page_tree, offset, page); if (!error) { SetPageLocked(page); @@ -275,7 +275,7 @@ int add_to_page_cache(struct page *page, } else { page_cache_release(page); } - spin_unlock(&mapping->page_lock); + spin_unlock_irq(&mapping->tree_lock); radix_tree_preload_end(); } return error; @@ -411,11 +411,11 @@ struct page * find_get_page(struct addre * We scan the hash list read-only. Addition to and removal from * the hash-list needs a held write-lock. */ - spin_lock(&mapping->page_lock); + spin_lock_irq(&mapping->tree_lock); page = radix_tree_lookup(&mapping->page_tree, offset); if (page) page_cache_get(page); - spin_unlock(&mapping->page_lock); + spin_unlock_irq(&mapping->tree_lock); return page; } @@ -428,11 +428,11 @@ struct page *find_trylock_page(struct ad { struct page *page; - spin_lock(&mapping->page_lock); + spin_lock_irq(&mapping->tree_lock); page = radix_tree_lookup(&mapping->page_tree, offset); if (page && TestSetPageLocked(page)) page = NULL; - spin_unlock(&mapping->page_lock); + spin_unlock_irq(&mapping->tree_lock); return page; } @@ -454,15 +454,15 @@ struct page *find_lock_page(struct addre { struct page *page; - spin_lock(&mapping->page_lock); + spin_lock_irq(&mapping->tree_lock); repeat: page = radix_tree_lookup(&mapping->page_tree, offset); if (page) { page_cache_get(page); if (TestSetPageLocked(page)) { - spin_unlock(&mapping->page_lock); + spin_unlock_irq(&mapping->tree_lock); lock_page(page); - spin_lock(&mapping->page_lock); + spin_lock_irq(&mapping->tree_lock); /* Has the page been truncated while we slept? */ if (page->mapping != mapping || page->index != offset) { @@ -472,7 +472,7 @@ repeat: } } } - spin_unlock(&mapping->page_lock); + spin_unlock_irq(&mapping->tree_lock); return page; } @@ -546,12 +546,12 @@ unsigned int find_get_pages(struct addre unsigned int i; unsigned int ret; - spin_lock(&mapping->page_lock); + spin_lock_irq(&mapping->tree_lock); ret = radix_tree_gang_lookup(&mapping->page_tree, (void **)pages, start, nr_pages); for (i = 0; i < ret; i++) page_cache_get(pages[i]); - spin_unlock(&mapping->page_lock); + spin_unlock_irq(&mapping->tree_lock); return ret; } diff -puN mm/page-writeback.c~irq-safe-pagecache-lock mm/page-writeback.c --- 25/mm/page-writeback.c~irq-safe-pagecache-lock 2004-04-03 03:00:11.039553288 -0800 +++ 25-akpm/mm/page-writeback.c 2004-04-03 03:00:11.062549792 -0800 @@ -472,12 +472,12 @@ int write_one_page(struct page *page, in if (wait) wait_on_page_writeback(page); - spin_lock(&mapping->page_lock); + spin_lock_irq(&mapping->tree_lock); list_del(&page->list); if (test_clear_page_dirty(page)) { list_add(&page->list, &mapping->locked_pages); page_cache_get(page); - spin_unlock(&mapping->page_lock); + spin_unlock_irq(&mapping->tree_lock); ret = mapping->a_ops->writepage(page, &wbc); if (ret == 0 && wait) { wait_on_page_writeback(page); @@ -487,7 +487,7 @@ int write_one_page(struct page *page, in page_cache_release(page); } else { list_add(&page->list, &mapping->clean_pages); - spin_unlock(&mapping->page_lock); + spin_unlock_irq(&mapping->tree_lock); unlock_page(page); } return ret; @@ -515,7 +515,7 @@ int __set_page_dirty_nobuffers(struct pa struct address_space *mapping = page->mapping; if (mapping) { - spin_lock(&mapping->page_lock); + spin_lock_irq(&mapping->tree_lock); if (page->mapping) { /* Race with truncate? */ BUG_ON(page->mapping != mapping); if (!mapping->backing_dev_info->memory_backed) @@ -523,7 +523,7 @@ int __set_page_dirty_nobuffers(struct pa list_del(&page->list); list_add(&page->list, &mapping->dirty_pages); } - spin_unlock(&mapping->page_lock); + spin_unlock_irq(&mapping->tree_lock); if (!PageSwapCache(page)) __mark_inode_dirty(mapping->host, I_DIRTY_PAGES); diff -puN mm/readahead.c~irq-safe-pagecache-lock mm/readahead.c --- 25/mm/readahead.c~irq-safe-pagecache-lock 2004-04-03 03:00:11.040553136 -0800 +++ 25-akpm/mm/readahead.c 2004-04-03 03:00:11.063549640 -0800 @@ -230,7 +230,7 @@ __do_page_cache_readahead(struct address /* * Preallocate as many pages as we will need. */ - spin_lock(&mapping->page_lock); + spin_lock_irq(&mapping->tree_lock); for (page_idx = 0; page_idx < nr_to_read; page_idx++) { unsigned long page_offset = offset + page_idx; @@ -241,16 +241,16 @@ __do_page_cache_readahead(struct address if (page) continue; - spin_unlock(&mapping->page_lock); + spin_unlock_irq(&mapping->tree_lock); page = page_cache_alloc_cold(mapping); - spin_lock(&mapping->page_lock); + spin_lock_irq(&mapping->tree_lock); if (!page) break; page->index = page_offset; list_add(&page->list, &page_pool); ret++; } - spin_unlock(&mapping->page_lock); + spin_unlock_irq(&mapping->tree_lock); /* * Now start the IO. We ignore I/O errors - if the page is not diff -puN mm/swapfile.c~irq-safe-pagecache-lock mm/swapfile.c --- 25/mm/swapfile.c~irq-safe-pagecache-lock 2004-04-03 03:00:11.041552984 -0800 +++ 25-akpm/mm/swapfile.c 2004-04-03 03:00:11.064549488 -0800 @@ -253,10 +253,10 @@ static int exclusive_swap_page(struct pa /* Is the only swap cache user the cache itself? */ if (p->swap_map[swp_offset(entry)] == 1) { /* Recheck the page count with the pagecache lock held.. */ - spin_lock(&swapper_space.page_lock); + spin_lock_irq(&swapper_space.tree_lock); if (page_count(page) - !!PagePrivate(page) == 2) retval = 1; - spin_unlock(&swapper_space.page_lock); + spin_unlock_irq(&swapper_space.tree_lock); } swap_info_put(p); } @@ -324,13 +324,13 @@ int remove_exclusive_swap_page(struct pa retval = 0; if (p->swap_map[swp_offset(entry)] == 1) { /* Recheck the page count with the pagecache lock held.. */ - spin_lock(&swapper_space.page_lock); + spin_lock_irq(&swapper_space.tree_lock); if ((page_count(page) == 2) && !PageWriteback(page)) { __delete_from_swap_cache(page); SetPageDirty(page); retval = 1; } - spin_unlock(&swapper_space.page_lock); + spin_unlock_irq(&swapper_space.tree_lock); } swap_info_put(p); diff -puN mm/swap_state.c~irq-safe-pagecache-lock mm/swap_state.c --- 25/mm/swap_state.c~irq-safe-pagecache-lock 2004-04-03 03:00:11.043552680 -0800 +++ 25-akpm/mm/swap_state.c 2004-04-03 03:00:11.065549336 -0800 @@ -25,7 +25,7 @@ extern struct address_space_operations s struct address_space swapper_space = { .page_tree = RADIX_TREE_INIT(GFP_ATOMIC), - .page_lock = SPIN_LOCK_UNLOCKED, + .tree_lock = SPIN_LOCK_UNLOCKED, .clean_pages = LIST_HEAD_INIT(swapper_space.clean_pages), .dirty_pages = LIST_HEAD_INIT(swapper_space.dirty_pages), .io_pages = LIST_HEAD_INIT(swapper_space.io_pages), @@ -182,9 +182,9 @@ void delete_from_swap_cache(struct page entry.val = page->index; - spin_lock(&swapper_space.page_lock); + spin_lock_irq(&swapper_space.tree_lock); __delete_from_swap_cache(page); - spin_unlock(&swapper_space.page_lock); + spin_unlock_irq(&swapper_space.tree_lock); swap_free(entry); page_cache_release(page); @@ -195,8 +195,8 @@ int move_to_swap_cache(struct page *page struct address_space *mapping = page->mapping; int err; - spin_lock(&swapper_space.page_lock); - spin_lock(&mapping->page_lock); + spin_lock_irq(&swapper_space.tree_lock); + spin_lock(&mapping->tree_lock); err = radix_tree_insert(&swapper_space.page_tree, entry.val, page); if (!err) { @@ -204,8 +204,8 @@ int move_to_swap_cache(struct page *page ___add_to_page_cache(page, &swapper_space, entry.val); } - spin_unlock(&mapping->page_lock); - spin_unlock(&swapper_space.page_lock); + spin_unlock(&mapping->tree_lock); + spin_unlock_irq(&swapper_space.tree_lock); if (!err) { if (!swap_duplicate(entry)) @@ -231,8 +231,8 @@ int move_from_swap_cache(struct page *pa entry.val = page->index; - spin_lock(&swapper_space.page_lock); - spin_lock(&mapping->page_lock); + spin_lock_irq(&swapper_space.tree_lock); + spin_lock(&mapping->tree_lock); err = radix_tree_insert(&mapping->page_tree, index, page); if (!err) { @@ -240,8 +240,8 @@ int move_from_swap_cache(struct page *pa ___add_to_page_cache(page, mapping, index); } - spin_unlock(&mapping->page_lock); - spin_unlock(&swapper_space.page_lock); + spin_unlock(&mapping->tree_lock); + spin_unlock_irq(&swapper_space.tree_lock); if (!err) { swap_free(entry); diff -puN mm/truncate.c~irq-safe-pagecache-lock mm/truncate.c --- 25/mm/truncate.c~irq-safe-pagecache-lock 2004-04-03 03:00:11.044552528 -0800 +++ 25-akpm/mm/truncate.c 2004-04-03 03:00:11.066549184 -0800 @@ -62,7 +62,7 @@ truncate_complete_page(struct address_sp * This is for invalidate_inode_pages(). That function can be called at * any time, and is not supposed to throw away dirty pages. But pages can * be marked dirty at any time too. So we re-check the dirtiness inside - * ->page_lock. That provides exclusion against the __set_page_dirty + * ->tree_lock. That provides exclusion against the __set_page_dirty * functions. */ static int @@ -74,13 +74,13 @@ invalidate_complete_page(struct address_ if (PagePrivate(page) && !try_to_release_page(page, 0)) return 0; - spin_lock(&mapping->page_lock); + spin_lock_irq(&mapping->tree_lock); if (PageDirty(page)) { - spin_unlock(&mapping->page_lock); + spin_unlock_irq(&mapping->tree_lock); return 0; } __remove_from_page_cache(page); - spin_unlock(&mapping->page_lock); + spin_unlock_irq(&mapping->tree_lock); ClearPageUptodate(page); page_cache_release(page); /* pagecache ref */ return 1; diff -puN mm/vmscan.c~irq-safe-pagecache-lock mm/vmscan.c --- 25/mm/vmscan.c~irq-safe-pagecache-lock 2004-04-03 03:00:11.045552376 -0800 +++ 25-akpm/mm/vmscan.c 2004-04-03 03:00:11.067549032 -0800 @@ -354,7 +354,6 @@ shrink_list(struct list_head *page_list, goto keep_locked; if (!may_write_to_queue(mapping->backing_dev_info)) goto keep_locked; - spin_lock(&mapping->page_lock); if (test_clear_page_dirty(page)) { int res; struct writeback_control wbc = { @@ -364,9 +363,6 @@ shrink_list(struct list_head *page_list, .for_reclaim = 1, }; - list_move(&page->list, &mapping->locked_pages); - spin_unlock(&mapping->page_lock); - SetPageReclaim(page); res = mapping->a_ops->writepage(page, &wbc); if (res < 0) @@ -381,7 +377,6 @@ shrink_list(struct list_head *page_list, } goto keep; } - spin_unlock(&mapping->page_lock); } /* @@ -415,7 +410,7 @@ shrink_list(struct list_head *page_list, if (!mapping) goto keep_locked; /* truncate got there first */ - spin_lock(&mapping->page_lock); + spin_lock_irq(&mapping->tree_lock); /* * The non-racy check for busy page. It is critical to check @@ -423,7 +418,7 @@ shrink_list(struct list_head *page_list, * not in use by anybody. (pagecache + us == 2) */ if (page_count(page) != 2 || PageDirty(page)) { - spin_unlock(&mapping->page_lock); + spin_unlock_irq(&mapping->tree_lock); goto keep_locked; } @@ -431,7 +426,7 @@ shrink_list(struct list_head *page_list, if (PageSwapCache(page)) { swp_entry_t swap = { .val = page->index }; __delete_from_swap_cache(page); - spin_unlock(&mapping->page_lock); + spin_unlock_irq(&mapping->tree_lock); swap_free(swap); __put_page(page); /* The pagecache ref */ goto free_it; @@ -439,7 +434,7 @@ shrink_list(struct list_head *page_list, #endif /* CONFIG_SWAP */ __remove_from_page_cache(page); - spin_unlock(&mapping->page_lock); + spin_unlock_irq(&mapping->tree_lock); __put_page(page); free_it: diff -puN fs/cifs/file.c~irq-safe-pagecache-lock fs/cifs/file.c --- 25/fs/cifs/file.c~irq-safe-pagecache-lock 2004-04-03 03:00:11.046552224 -0800 +++ 25-akpm/fs/cifs/file.c 2004-04-03 03:00:11.068548880 -0800 @@ -898,11 +898,9 @@ static void cifs_copy_cache_pages(struct if(list_empty(pages)) break; - spin_lock(&mapping->page_lock); page = list_entry(pages->prev, struct page, list); list_del(&page->list); - spin_unlock(&mapping->page_lock); if (add_to_page_cache(page, mapping, page->index, GFP_KERNEL)) { page_cache_release(page); @@ -962,14 +960,10 @@ cifs_readpages(struct file *file, struct pagevec_init(&lru_pvec, 0); for(i = 0;ipage_lock); - if(list_empty(page_list)) { - spin_unlock(&mapping->page_lock); + if(list_empty(page_list)) break; - } page = list_entry(page_list->prev, struct page, list); offset = (loff_t)page->index << PAGE_CACHE_SHIFT; - spin_unlock(&mapping->page_lock); /* for reads over a certain size could initiate async read ahead */ @@ -989,12 +983,10 @@ cifs_readpages(struct file *file, struct cFYI(1,("Read error in readpages: %d",rc)); /* clean up remaing pages off list */ - spin_lock(&mapping->page_lock); while (!list_empty(page_list) && (i < num_pages)) { page = list_entry(page_list->prev, struct page, list); list_del(&page->list); } - spin_unlock(&mapping->page_lock); break; } else if (bytes_read > 0) { pSMBr = (struct smb_com_read_rsp *)smb_read_data; _